Thesis: System Support for Checkpoint/Restart of Charm++ and AMPI Applications

PPL Paper Number: 04-07

Authors:
Chao Huang
Parallel Programming Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign

Master's Thesis, Dept. of Computer Science, University of Illinois 2004


Abstract

As both modern supercomputers and new generation scientific computing applications grow in size and complexity, the probability of system failure rises commensurately. Making parallel computing fault tolerant has become an increasingly important issue. Checkpoint/restart mechanism provides for fault tolerance capability as well as other benefits for parallel programmers. This thesis describes the On-Disk Checkpoint/Restart Mechanism for Charm++ and Adaptive MPI programming framework, its motivation, design, and implementation. This mechanism has proven to be useful in practice and can also be extended to implement other fault tolerant techniques.


[postscript] [PDF] [bibtex] [text reference]