Gengbin Zheng, Chao Huang and Laxmikant V. Kalé {gzheng,chuang10,kale}@cs.uiuc.edu
Department of Computer Science
University of Illinois at Urbana-Champaign
As the size of high performance clusters multiplies, the probability
of system failure grows substantially, posing an
increasingly significant challenge for scalability. Checkpoint-based
fault tolerance methods are effective approaches at dealing with faults. With
these methods, the state of the entire parallel application is checkpointed to
reliable storage. When a fault occurs, the application is restarted from
a recent checkpoint. However, the application developer is required to write
significant additional code for checkpointing and restarting. This paper
describes disk-based and memory-based checkpointing fault tolerance schemes
that automate the task of checkpointing and restarting. The schemes
also allow the program to be restarted on a different number of processors.
These schemes are based on self-checkpointable, migratable objects supported by
the Charm++ and Adaptive MPI (AMPI) run-time and can be applied to a wide
class of applications written using MPI or message-driven languages.
We demonstrate the effectiveness of the strategies and evaluate their
performance.