Proactive Fault Tolerance in Large Systems

PPL Paper Number: 04-14

Authors:
Sayantan Chakravorty, Celso L. Mendes and Laxmikant V. Kale
Parallel Programming Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign

Accepted at HPCRI workshop 05


Abstract

High-performance systems with thousands of processors have been introduced in the recent past, and systems with hundreds of thousands of processors should become available in the near future. Since failures are likely to be frequent in such systems, schemes for dealing with faults are important. In this paper, we introduce a new fault tolerance solution for parallel applications that proactively migrates execution from a processor where a failure is imminent. Our approach assumes that some failures are predictable, and leverages the fact that current hardware devices contain various features supporting early indication of faults. By using the concepts of processor virtualization in Charm++ and Adaptive MPI (AMPI), we describe a mechanism that migrates objects when a failure is expected to arise in a given processor, without requiring spare processors. After migrating objects, and applying a load balancing scheme, the execution of an MPI application can proceed and achieve optimized efficiency. We modify the implementation of collective operations, such as reductions, so that they continue to operate efficiently even after a processor is evacuated and crashes. To demonstrate the feasibility of our approach, we present preliminary performance data.


[PDF] [bibtex] [text reference]