Fault Tolerance Support

As both modern supercomputers and new generation scientific computing applications grow in size and complexity, the probability of system failure rises commensurately. Making parallel computing fault tolerant has become an increasingly important issue.

At PPL, we have research (and resultant software) on multiple schemes that support fault tolerance:(1) automated traditional checkpoint with restart on a different number of processors (2) FTC Charm++: an in-remote-memory (or in-local-disk) double checkpointing (3) an ongoing research project that avoids sending all processors back to their checkpoints, and speeds up restart (4) a proactive fault avoidance scheme that evacuates a processor while letting the computation continue.

  • On-disk checkpoint/restart is the simplest approach. It requires synchronization in the system. In AMPI, this is invoked by a collective call. All the data is saved as disk files on a file server. When a processor crashes, the application is killed and then restarted from its checkpoint. This approach allows the scheduler more flexibility, since a job can be restarted on a different number of processors.
  • FTC-Charm++: double checkpoint-based fault tolerance for Charm++ and AMPI.
    The above traditional disk-based method of dealing with faults is to checkpoint the state of the entire application periodically to reliable storage and restart from the recent checkpoint. The recovery of the application from faults involves (often manually) restarting applications on all processors and having it read the data from disks on all processors. The restart can therefore take minutes after it has been initiated. Traditional strategies often require that the failed processor can be replaced so that the number of processors at checkpoint-time and recovery-time are the same. FTC-Charm++ is a fault-tolerant runtime based on a scheme for fast and scalable in-memory checkpoint and restart. The scheme
    • does not require any individual component to be fault-free because checkpoint data is stored in two copies on two different locations. (although not infalliable)
    • provides fast and efficient checkpoint and restart using memory for checkpointing. It can take advantage of high performance network to speedup checkpointing process.
    • at restart, it provides fault tolerance support for both cases with or without backup processors. If there are no backup processors, the program can continue to run on the remaining processors without performance penalty due to load imbalance.
    • the memory-based method is useful for applications where the memory footprint is small at the checkpoint state, while a variation of this scheme --- in-disk checkpoint/restart can be applied to applications with large memory footprint. It uses local scratch disk for storing checkpoints, which is much more efficient than using centralized reliable file server.
  • FTL-Charm++:is a sender based pessimistic fault tolerant protocol for Charm++ and AMPI. The sender saves a copy of the message in its memory while sending a message to an object on another processor. It sends a ticket request to the receiver, which replies with a ticket. A ticket is the sequence in which the receiver will process this message. The sender saves the ticket in its memory and sends the message to the receiver. For messages sent to an object on the same processor, the ticket number is fetched through a function call and along with the sender and receiver's id is saved on a buddy processor. This allows message logging to work along with virtualization. While restarting  after a crash, this protocol can spread the objects on the restarted processor among other processors. This provides for much faster restarts compared to other message logging protocols where all the work of the restarted processor is done on one processor. Checkpoint based protocols, which rollback all processors to their previous checkpoint when one processor crashes, obviously cannot redistribute work from the restarted processor as all processors are busy. So our protocol is unique in allowing restarts much faster than the time between the crash and the previous checkpoint. Work is going on to combine the message logging protocol with migration of objects to allow us to do dynamic runtime load balancing.

  • Proactive Fault Tolerance: is a scheme for reacting to fault warnings for Charm++ and AMPI. Modern hardware and fault prediction schemes can predict many faults with a good degree of accuracy. We developed a scheme for reacting to such faults. When a node is warned that it might crash, the charm++ objects on it are migrated away. The runtime system is also changed so that message delivery can continue seamlessly even if the warned node crashes. Reduction trees are also modified to remove the warned node from the tree so that its crash does not cause the tree to become disconnected. A node in the tree is replaced by one of its children. If the tree becomes too unbalanced, the whole reduction tree can be recreated from scratch. The proactive fault tolerance protocol does not require any extra nodes, it continues working on the remaining ones. It can deal with multiple simultaneous warnings, which is useful for real life cases where all the nodes in a rack are about to go down.
  • Comparisons of these schemes:

    Scheme
    Diskless
    Require Backup Processors
    Transparent Checkpointing
    Synchronized Checkpointing
    Automatic Restart
    On-disk Checkpoint/Restart
    central file server
    not necessarily
    No
    Yes
    No
    FTC-Charm++ (double memory)
    Yes
    not necessarily No
    Yes
    Yes
    FTC-Charm++ (double disk) local  disk
    not necessarily No
    Yes
    Yes
    FTL-Charm++ (message logging)
    Yes
    currently yes, but not necessarily(*)
    Yes
    No
    Yes
    (*) The basic approach of FTL does not require a backup processor; However, our current implementation assumes availability of backup processors. This will change in later versions...
     

    People
    Papers
    • 06-12    Sayantan Chakravorty, Laxmikant V. Kale,  A Fault Tolerance Protocol with Fast Fault Recovery,  Proceedings of the 21st International Parallel and Distributed Processing Symposium, 2007, Long Beach California
    • 06-11    Sayantan Chakravorty, Celso L. Mendes, Laxmikant V. Kale,  Proactive Fault Tolerance in MPI Applications via Task Migration,  In Proceedings of HIPC 2006, LNCS volume 4297, page 485
    • 06-04    Sayantan Chakravorty, Celso L. Mendes, Laxmikant V. Kale, Terry Jones, Andrew Tauferner, Todd Inglett, Jose Moreira,  HPC-Colony: Services and Interfaces for Very Large Systems,  accepted by OSR Special Issue on HEC OS/Runtimes
    • 06-03    Gengbin Zheng, Chao Huang, Laxmikant V. Kale,  Performance Evaluation of Automatic Checkpoint-based Fault Tolerance for AMPI and Charm++,  ACM SIGOPS Operating Systems Review: Operating and Runtime Systems for High-end Computing Systems, 2006
    • 04-14    Sayantan Chakravorty, Celso L. Mendes and Laxmikant V. Kale,  Proactive Fault Tolerance in Large Systems,  Accepted at HPCRI workshop 05
    • 04-07    Chao Huang,  Thesis: System Support for Checkpoint/Restart of Charm++ and AMPI Applications,  Master's Thesis, Dept. of Computer Science, University of Illinois 2004
    • 04-06    Gengbin Zheng, Lixia Shi, Laxmikant V. Kale,  FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI,  Cluster 2004
    • 04-03    Sayantan Chakravorty and Laxmikant V. Kale,  A Fault Tolerant Protocol for Massively Parallel Systems,  published in the proceedings of the FTPDS workshop at IPDPS'04

    This page maintained by Esteban Meneses. Back to the PPL Research Page