Adoption Protocols for Fanout-Optimal Fault-Tolerant Termination Detection
ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP) 2013
Publication Type: Paper
Repository URL: papers/201204_TermDetection
Abstract
Termination detection is relevant for signaling completion (all
processors are idle and no messages are in flight) of many
distributed operations, including work stealing algorithms, dynamic
data exchange, and dynamically structured computations. In the face
of growing supercomputers with increasing likelihood that each job
may encounter faults, it is important for systems that rely on
termination detection that such an algorithm be able to tolerate the
inevitable faults. We provide a trio of new practical fault
tolerance schemes for a standard approach to termination detection
that are easy to implement, present low overhead in both theory and
practice, and have scalable costs when recovering from faults. These
schemes tolerate all single-process faults, and are
probabilistically tolerant of faults affecting multiple
processes. We combine the theoretical failure probabilities we can
calculate for each algorithm with historical fault records from real
machines to show that these algorithms have excellent overall
survivability.
TextRef
Jonathan Lifflander, Phil Miller, Laxmikant V. Kale. "Adoption Protocols for Fanout-Optimal Fault-Tolerant Termination Detection". In proceedings of ACM Conference on Principles and Practice of Parallel Programming 2013 (PPoPP '13).
People
Research Areas