Proactive Fault Tolerance in MPI Applications via Task Migration
IEEE International Conference on High Performance Computing (HiPC) 2006
Publication Type: Paper
Repository URL: fault-avoidance-sc
Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications.
Sayantan Chakravorty and Celso L. Mendes and Laxmikant V. Kale, "Proactive Fault Tolerance in MPI Applications Via Task Migration.", HiPC, Publ: Springer, Lecture Notes in Computer Science, vol. 4297, pp. 485-496, 2006.