Performance Evaluation of Automatic Checkpoint-based Fault Tolerance for AMPI and Charm++
    
    Operating and Runtime Systems for High-end Computing Systems 2006
    Publication Type: Paper
    Repository URL: ftCompare
    Abstract
    As the size of high performance clusters multiplies, the
probability of system failure grows substantially, posing an
increasingly significant challenge for scalability.
Checkpoint-based fault tolerance methods are effective approaches
at dealing with faults. With these methods, the state of the entire
parallel application is checkpointed to reliable storage. When a
fault occurs, the application is restarted from a recent
checkpoint. However, the application developer is required to write
significant additional code for checkpointing and restarting. This
paper describes disk-based and memory-based checkpointing fault
tolerance schemes that automate the task of checkpointing and
restarting. The schemes also allow the program to be restarted on a
different number of processors. These schemes are based on
self-checkpointable, migratable objects supported by the Charm++
and Adaptive MPI (AMPI) run-time and can be applied to a wide class
of applications written using MPI or message-driven languages. We
demonstrate the effectiveness of the strategies and evaluate their
performance.
    TextRef
      
        Gengbin Zheng and Chao Huang and Laxmikant V. Kale, "Performance Evaluation of 
Automatic Checkpoint-based Fault Tolerance for AMPI and Charm++", ACM SIGOPS 
Operating Systems Review: Operating and Runtime Systems for High-end Computing 
Systems, vol. 40, April 2006.
      
    People
      
    Research Areas
      
  









