Parallel Programming Laboratory

Adoption Protocols for Fanout-Optimal Fault-Tolerant Termination Detection

| Jonathan Lifflander | Phil Miller | Laxmikant Kale

ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP) 2013

Publication Type: Paper

Repository URL: papers/201204_TermDetection

Download: [BIB] [PDF]

Abstract

Termination detection is relevant for signaling completion (all processors are idle and no messages are in flight) of many distributed operations, including work stealing algorithms, dynamic data exchange, and dynamically structured computations. In the face of growing supercomputers with increasing likelihood that each job may encounter faults, it is important for systems that rely on termination detection that such an algorithm be able to tolerate the inevitable faults. We provide a trio of new practical fault tolerance schemes for a standard approach to termination detection that are easy to implement, present low overhead in both theory and practice, and have scalable costs when recovering from faults. These schemes tolerate all single-process faults, and are probabilistically tolerant of faults affecting multiple processes. We combine the theoretical failure probabilities we can calculate for each algorithm with historical fault records from real machines to show that these algorithms have excellent overall survivability.

TextRef

Jonathan Lifflander, Phil Miller, Laxmikant V. Kale. "Adoption Protocols for Fanout-Optimal Fault-Tolerant Termination Detection". In proceedings of ACM Conference on Principles and Practice of Parallel Programming 2013 (PPoPP '13).

People

Research Areas

Fault Tolerance Support

Live Webcast 15th Annual Charm++ Workshop