CHARM++ offers a range of fault tolerance capabilities through its checkpoint/restart mechanism. Usual Chare array-based CHARM++ application including AMPI application can be checkpointed to disk files and later on restarting from the files.
The basic idea behind this is straightforward: Checkpointing an application is like migrating its parallel objects from the processors onto disks, and restarting is the reverse. Thanks to the migration utilities like PUP'ing(Section 3.18), users can decide what data to save in checkpoints and how to save them.
Two schemes of fault tolerance protocols are implemented.
May 26, 2012
Charm Homepage