Live Webcast 15th Annual Charm++ Workshop

Adoption Protocols for Fanout-Optimal Fault-Tolerant Termination Detection
ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP) 2013
Publication Type: Paper
Repository URL: papers/201204_TermDetection
Termination detection is relevant for signaling completion (all processors are idle and no messages are in flight) of many distributed operations, including work stealing algorithms, dynamic data exchange, and dynamically structured computations. In the face of growing supercomputers with increasing likelihood that each job may encounter faults, it is important for systems that rely on termination detection that such an algorithm be able to tolerate the inevitable faults. We provide a trio of new practical fault tolerance schemes for a standard approach to termination detection that are easy to implement, present low overhead in both theory and practice, and have scalable costs when recovering from faults. These schemes tolerate all single-process faults, and are probabilistically tolerant of faults affecting multiple processes. We combine the theoretical failure probabilities we can calculate for each algorithm with historical fault records from real machines to show that these algorithms have excellent overall survivability.
Jonathan Lifflander, Phil Miller, Laxmikant V. Kale. "Adoption Protocols for Fanout-Optimal Fault-Tolerant Termination Detection". In proceedings of ACM Conference on Principles and Practice of Parallel Programming 2013 (PPoPP '13).
Research Areas