Live Webcast 15th Annual Charm++ Workshop

Proactive Fault Tolerance in Large Systems
Workshop on High Performance Computing Reliability Issues at HPCA (HPCRI) 2005
Publication Type: Paper
Repository URL:
High-performance systems with thousands of processors have been introduced in the recent past, and systems with hundreds of thousands of processors should become available in the near future. Since failures are likely to be frequent in such systems, schemes for dealing with faults are important. In this paper, we introduce a new fault tolerance solution for parallel applications that proactively migrates execution from a processor where a failure is imminent. Our approach assumes that some failures are predictable, and leverages the fact that current hardware devices contain various features supporting early indication of faults. By using the concepts of processor virtualization in Charm++ and Adaptive MPI (AMPI), we describe a mechanism that migrates objects when a failure is expected to arise in a given processor, without requiring spare processors. After migrating objects, and applying a load balancing scheme, the execution of an MPI application can proceed and achieve optimized efficiency. We modify the implementation of collective operations, such as reductions, so that they continue to operate efficiently even after a processor is evacuated and crashes. To demonstrate the feasibility of our approach, we present preliminary performance data.
Sayantan Chakravorty, Celso Mendes and L. V. Kale, "Proactive Fault Tolerance in Large Systems", HPCRI Workshop in conjunction with HPCA 2005, 2005.
Research Areas