The previous disk-based fault-tolerance scheme is a very basic scheme in that when a failure occurs, the whole program gets killed and the user has to manually restart the application from the checkpoint files. The double checkpoint/restart protocol described in this subsection provides an automatic fault tolerance solution. When a failure occurs, the program can automatically detect the failure and restart from the checkpoint. Further, this fault-tolerance protocol does not rely on any reliable storage (as needed in the previous method). Instead, it stores two copies of checkpoint data to two different locations (can be memory or disk). This double checkpointing ensures the availability of one checkpoint in case the other is lost. The double in-memory checkpoint/restart scheme is useful and efficient for applications with small memory footprint at the checkpoint state. The double in-disk variation stores checkpoints into local disk, thus can be useful for applications with large memory footprint.
The function that user can call to initiate a checkpointing in a Chare array-based application is:
where cb has the same meaning as in the Section 6.1.1 . Just like the above disk checkpoint described, it is up to programmer what to save. The programmer is responsible for choosing when to activate checkpointing so that the size of a global checkpoint state can be minimal.
In AMPI applications, user just needs to call the following function to start checkpointing:
When a processor crashes, the restart protocol will be automatically invoked to recover all objects using the last checkpoints. And then the program will continue to run on the survived processors. This is based on the assumption that there are no extra processors to replace the crashed ones.
However, if there are a pool of extra processors to replace the crashed ones, the fault-toerlance protocol can also take advantage of this to grab one free processor and let the program run on the same number of processors as before crash. In order to achieve this, CHARM++ needs to be compiled with the macro option CK_NO_PROC_POOL turned on.
A variation of double memory checkpoint/restart, double in-disk checkpoint/restart, can be applied to applcaitions with large memory footprint. In this scheme, instead of storing checkpoints in the memory, it stores them in the local disk. The checkpoint files are named "ckpt[CkMyPe]-[idx]-XXXXXX" and are stored under /tmp.
A programmer can use runtime option +ftc_disk to switch to this mode. For example:
May 26, 2012
Charm Homepage