Thesis: System Support for Checkpoint/Restart of Charm++ and AMPI Applications
Authors:
Chao Huang
Parallel Programming Laboratory, Department of Computer Science, University
of Illinois at Urbana-Champaign
Master's Thesis, Dept. of Computer Science, University of Illinois 2004
As both modern supercomputers and new generation scientific computing applications grow in size and complexity, the probability of system failure rises commensurately. Making parallel computing fault tolerant has become an increasingly important issue. Checkpoint/restart mechanism provides for fault tolerance capability as well as other benefits for parallel programmers. This thesis describes the On-Disk Checkpoint/Restart Mechanism for Charm++ and Adaptive MPI programming framework, its motivation, design, and implementation. This mechanism has proven to be useful in practice and can also be extended to implement other fault tolerant techniques.