Design and Analysis of a Message Logging Protocol for Fault Tolerant Multicore Systems
PPL Technical Report 2011
Publication Type: Paper
A look at Exascale reveals a future with multicore supercomputers that will inexorably experience frequent failures. Providing scalable and efficient fault tolerance support is one of the major concerns to pave the road for the next generation of machines. Checkpoint/restart remains as the standard de facto approach to provide fault tolerance in supercomputers. However, its high recovery cost has brought the attention of the community to an alternative mechanism, message logging. In this paper we present the design of a message logging protocol that targets multicore machines based on two fundamental assumptions. First, a multicore node is the minimum unit of failure and very frequently only one node goes down per failure. Second, the shared memory is a key resource to bring down the overhead of message logging. This paper also presents an analysis of failure data from recent supercomputers that show that most of the time a failure involves one single computational node. We offer two different distributions to model the data. Using those distributions we build a model for the survivability of the message logging protocol to multiple concurrent failures. We demonstrate our technique has a low overhead. The results of an experiment with a stencil program show the execution time penalty is below 5% when the program scales up to 1024 cores. Moreover, even when the protocol was designed to tolerate one single failure at a time, it provides a high probability of survival to a failure involving any number of nodes. Using real-world data from recent supercomputers we demonstrate the chances of survive any failure are higher than 99%.