Scalable Message-Logging Techniques for Effective Fault Tolerance in HPC Applications
Publication Type: PhD Thesis
An important set of challenges emerge as the High Performance Computing (HPC) community aims to reach extreme scale. Resilience and energy consumption are two of those challenges. Extreme-scale machines are expected to have a high failure frequency. This is an inevitable consequence of the mismatch between two trends. The number of components assembled in supercomputers grows exponentially. However, the improvement on the reliability of each individual component is much slower. At the same time, the vast number of components in a single machine will consume a non-trivial amount of energy. To keep a supercomputer within operational margins, HPC systems have to be both reliable and energy-aware. For an application to be able to run and make progress in spite of constant interruptions, it has to incorporate some fashion of fault tolerance. Rollback-recovery techniques provide a framework to overcome crashes in the system by periodically saving the state of the application and rolling back to checkpoints in case of failures. Two well-known rollback-recovery techniques are checkpoint/restart and message-logging. The former is easier to implement and has become the de facto standard to make applications fault tolerant. It has, however, a high performance and energy cost during recovery. Message-logging, on the other hand, makes it possible to recover faster from a failure and to consume less energy. The downside of message-logging is the overhead it exhibits in the failure-free scenario. Memory and performance overheads may offset its advantages. This thesis focuses on techniques to alleviate the downsides of message-logging. It presents a mechanism based on high-level programming language constructs to decrease the performance overhead of message-logging. It also introduces two strategies to reduce the memory overhead created by the message log. Additionally, it addresses important architectural constraints of modern supercomputers. Based on large-scale experimental results and projections from an analytical model, we conclude message-logging is a promising strategy to provide fault tolerance at a low energy cost for extreme-scale machines.