As both modern supercomputers and new generation scientific computing applications grow
in size and complexity, the probability of system failure rises commensurately.
Making parallel computing fault tolerant has become an increasingly important
issue.
At PPL, we have research (and resultant software) on multiple schemes
that support fault tolerance:(1) automated traditional checkpoint with
restart on a different number of processors (2) FTC Charm++: an
in-remote-memory (or in-local-disk) double checkpointing
(3) an ongoing research project that avoids sending all processors back to their checkpoints, and speeds up restart (4) a proactive fault avoidance scheme that evacuates a processor while letting the computation continue.
On-disk checkpoint/restart is the simplest approach. It
requires synchronization in the system. In AMPI, this is invoked by a
collective call. All the data is saved as disk files on a file
server. When a processor crashes, the application is killed and then
restarted from its checkpoint. This approach allows the scheduler more flexibility,
since a job can be restarted on a different number of processors.
FTC-Charm++: double checkpoint-based fault tolerance for Charm++ and AMPI.
The above traditional disk-based method of dealing with faults is to
checkpoint the state of the entire application periodically to reliable
storage and restart from the recent checkpoint.
The recovery of the application from faults involves (often manually)
restarting applications on all processors and having it read the data from
disks on all processors. The restart can therefore take minutes after it has
been initiated. Traditional strategies often require that the failed processor
can be replaced so that the number of processors at checkpoint-time and
recovery-time are the same.
FTC-Charm++ is a fault-tolerant runtime based on a scheme for fast and
scalable in-memory checkpoint and restart. The scheme
- does not require any individual component to be fault-free
because checkpoint data is stored in two copies on two different locations.
(although not infalliable)
- provides fast and efficient checkpoint and restart using memory for
checkpointing. It can take advantage of high performance network to speedup
checkpointing process.
- at restart, it provides fault tolerance support for both cases with
or without backup processors.
If there are no backup processors, the program can continue to
run on the remaining processors without performance penalty due to
load imbalance.
- the memory-based method is useful for applications whose memory footprint
is small at the checkpoint state, while a variation of this scheme ---
in-disk checkpoint/restart can be applied to applications with large memory
footprint. It uses local scratch disk for storing checkpoints, which is
much more efficient than using centralized reliable file server.
FTL-Charm++:is a sender based pessimistic fault
tolerant protocol for Charm++ and AMPI. The sender saves a copy of the
message in its memory while sending a message to an object on another
processor. It sends a ticket request to the receiver, which replies
with a ticket. A ticket is the sequence in which the receiver will
process this message. The sender saves the ticket in its memory and
sends the message to the receiver. For messages sent to an object on
the same processor, the ticket number is fetched through a function
call and along with the sender and receiver's id is saved on a buddy processor. This allows
message logging to work along with virtualization. While
restarting after a crash, this protocol can spread the objects on
the restarted processor among other processors. This provides for much
faster restarts compared to other message logging protocols where all
the work of the restarted processor is done on one processor.
Checkpoint based protocols, which rollback all processors to their
previous checkpoint when one processor crashes, of course cannot
redistribute work from the restarted processor as all processors are
busy. So our protocol is unique in allowing restarts much faster than
the time between the crash and the previous checkpoint. Work is going
on to combine the message logging protocol with migration of objects to
allow us to do dynamic runtime load balancing.
Proactive Fault Tolerance: is a scheme for reacting
to fault warnings for Charm++ and AMPI. Modern hardware and fault
prediction schemes can predict many faults with a good degree of
accuracy. We developed a scheme for reacting to such faults. When a
node is warned that it might crash, the charm++ objects on it are
migrated away. The runtime system is also changed so that message
delivery can continue seamlessly even if the warned node crashes.
Reduction trees are also modified to remove the warned node from the
tree so that its crash does not cause the tree to become disconnected.
A node in the tree is replaced by one of its children. If the tree
becomes too unbalanced, the whole reduction tree can be recreated from
scratch. The proactive fault tolerance protocol does not require any
extra nodes, it continues working on the remaining ones. It can deal
with multiple simultaneous warnings, which is useful for real life
cases when all the nodes in a rack are about to go down.
Comparisons of these schemes:
Scheme
|
Diskless
|
Require Backup Processors
|
Transparent Checkpointing
|
Synchronized Checkpointing
|
Automatic Restart
|
On-disk Checkpoint/Restart
|
central file server
|
not necessarily
|
No
|
Yes
|
No
|
FTC-Charm++ (double memory)
|
Yes
|
not necessarily |
No
|
Yes
|
Yes
|
| FTC-Charm++ (double disk) |
local disk
|
not necessarily |
No
|
Yes
|
Yes
|
FTL-Charm++ (message logging)
|
Yes
|
currently yes, but not necessarily(*)
|
Yes
|
No
|
Yes
|
(*) The basic approach of FTL does not require a backup processor; However,
our current implementation assumes availability of backup processors.
This will change in later versions...
|
- 06-12
Sayantan Chakravorty, Laxmikant V. Kale, A Fault Tolerance Protocol with Fast Fault Recovery, Proceedings of the 21st International Parallel and Distributed Processing Symposium, 2007, Long Beach California
- 06-11
Sayantan Chakravorty, Celso L. Mendes, Laxmikant V. Kale, Proactive Fault Tolerance in MPI Applications via Task Migration, In Proceedings of HIPC 2006, LNCS volume 4297, page 485
- 06-04
Sayantan Chakravorty, Celso L. Mendes, Laxmikant V. Kale, Terry Jones, Andrew Tauferner, Todd Inglett, Jose Moreira, HPC-Colony: Services and Interfaces for Very Large Systems, accepted by OSR Special Issue on HEC OS/Runtimes
- 06-03
Gengbin Zheng, Chao Huang, Laxmikant V. Kale, Performance Evaluation of Automatic Checkpoint-based Fault Tolerance for AMPI and Charm++, ACM SIGOPS Operating Systems Review: Operating and Runtime Systems for High-end Computing Systems, 2006
- 04-14
Sayantan Chakravorty, Celso L. Mendes and Laxmikant V. Kale, Proactive Fault Tolerance in Large Systems, Accepted at HPCRI workshop 05
- 04-07
Chao Huang, Thesis: System Support for Checkpoint/Restart of Charm++ and AMPI Applications, Master's Thesis, Dept. of Computer Science, University of Illinois 2004
- 04-06
Gengbin Zheng, Lixia Shi, Laxmikant V. Kale, FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI, Cluster 2004
- 04-03
Sayantan Chakravorty and Laxmikant V. Kale, A Fault Tolerant Protocol for Massively Parallel Systems, published in the proceedings of the FTPDS workshop at IPDPS'04
|