A Fault Tolerance Protocol with Fast Fault Recovery

PPL Paper Number: 06-12

Authors:
Sayantan Chakravorty, Laxmikant V. Kale
Parallel Programming Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign

Proceedings of the 21st International Parallel and Distributed Processing Symposium, 2007, Long Beach California


Abstract

Large machines with tens or even hundreds of thousands of processors are currently in use. Fault tolerance is an important issue for these and the even larger machines of the future. Checkpoint based methods, currently used on most machines, rollback all processors to previous checkpoints after a crash. This wastes a significant amount of computation as all processors have to redo all the computation from that checkpoint onwards. In addition, recovery-time in checkpoint based fault tolerance protocols is bound by the time between the last checkpoint and the crash. Protocols based on message logging avoid the problem of rolling back all processors to their earlier state. However, the recovery time of existing message logging protocols is no smaller than the time between the last checkpoint and crash. We present a fault tolerance protocol, in this paper, that provides fast restarts by using the ideas of message logging and processor virtualization. We evaluate our implementation of the protocol in the Charm++/Adaptive MPI runtime system. We show that our protocol not only provides fast restarts but also has low fault-free overhead for many applications.


[PDF] [bibtex] [text reference]