A Fault Tolerance Protocol for Fast Recovery
Publication Type: PhD Thesis
Repository URL: sayantan-thesis
Large machines with tens or even hundreds of thousands of processors are currently in use. As the number of components increases, the mean time between failure will decrease further. Fault tolerance is an important issue for these and the even larger machines of the future. This is borne out by the significant amount of work in the field of fault tolerance for parallel computing. However, recovery-time after a crash in all current fault tolerance protocols is no smaller than the time between the last checkpoint and the crash. This wastes valuable computation time as all the remaining processors wait for the crashed processors to recover. This thesis presents research aimed at developing a fault tolerant protocol that is relevant in the context of parallel computing and provides fast restarts. We propose to combine the ideas of message logging and object based virtualization. We leverage the facts that message logging based protocols do not require all processors to rollback when one processor crashes and that object based virtualization allows work to be moved from one processor to another. We develop a message logging protocol that operates in conjunction with object based virtualization. We evaluate and study the implementation of our protocol in the Charm++/AMPI run-time. We use benchmarks and real world applications to investigate and improve the performance of different aspects of our protocol. We also modify the load balancing framework of the Charm++ run-time to work with the message logging protocol. We show that in the presence of faults, an application using our fault tolerance protocol takes less time to complete than a traditional checkpoint based protocol.
Sayantan Chakravorty, A Fault Tolerance Protocol for Fast Recovery, Dept. of Computer Science, University of Illinois. 2008.