The pack-unpack subroutines written for migrations make sure that the current state of the program is correctly packed (serialized) so that it can be restarted on a different processor. Using the same subroutines, it is also possible to save the state of the program to disk, so that if the program were to crash abruptly, or if the allocated time for the program expires before completing execution, the program can be restarted from the previously checkpointed state. Thus, the pack-unpack subroutines act as the key facility for checkpointing in addition to their usual role for migration.
A subroutine for checkpoint purpose has been added to AMPI: void MPI_Checkpoint(char *dirname); This subroutine takes a directory name as its argument. It is a collective function, meaning every virtual processor in the program needs to call this subroutine and specify the same directory name. (Typically, in an iterative AMPI program, the iteration number, converted to a character string, can serve as a checkpoint directory name.) This directory is created, and the entire state of the program is checkpointed to this directory. One can restart the program from the checkpointed state by specifying "+restart dirname" on the command-line. This capability is powered by the CHARM++ runtime system. For more information about CHARM++ checkpoint/restart mechanism please refer to CHARM++ manual.
January 17, 2008
AMPI Homepage
Charm Homepage