Proactive Fault Tolerance in MPI Applications via Task Migration
Authors:
Sayantan Chakravorty, Celso L. Mendes, Laxmikant V. Kale
Parallel Programming Laboratory, Department of Computer Science, University
of Illinois at Urbana-Champaign
In Proceedings of HIPC 2006, LNCS volume 4297, page 485
Failures are likely to be more frequent in systems with thousands of processors. Therefore, schemes for dealing with faults become increasingly important. In this paper, we present a fault tolerance solution for parallel applications that proactively migrates execution from processors where failure is imminent. Our approach assumes that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults. We use the concepts of processor virtualization and dynamic task migration, provided by Charm++ and Adaptive MPI (AMPI), to implement a mechanism that migrates tasks away from processors which are expected to fail. To demonstrate the feasibility of our approach, we present performance data from experiments with existing MPI applications. Our results show that proactive task migration is an effective technique to tolerate faults in MPI applications.