A Fault Tolerance Protocol for Fast Recovery
Thesis 2008
Publication Type: PhD Thesis
Repository URL: sayantan-thesis
Abstract
Large machines with tens or even hundreds of thousands of
processors are currently in use. As the number of components
increases, the mean time between failure will decrease further.
Fault tolerance is an important issue for these and the even larger
machines of the future. This is borne out by the significant amount
of work in the field of fault tolerance for parallel computing.
However, recovery-time after a crash in all current fault tolerance
protocols is no smaller than the time between the last checkpoint
and the crash. This wastes valuable computation time as all the
remaining processors wait for the crashed processors to recover.
This thesis presents research aimed at developing a fault tolerant
protocol that is relevant in the context of parallel computing and
provides fast restarts. We propose to combine the ideas of message
logging and object based virtualization. We leverage the facts that
message logging based protocols do not require all processors to
rollback when one processor crashes and that object based
virtualization allows work to be moved from one processor to
another. We develop a message logging protocol that operates in
conjunction with object based virtualization. We evaluate and study
the implementation of our protocol in the Charm++/AMPI run-time. We
use benchmarks and real world applications to investigate and
improve the performance of different aspects of our protocol. We
also modify the load balancing framework of the Charm++ run-time to
work with the message logging protocol. We show that in the
presence of faults, an application using our fault tolerance
protocol takes less time to complete than a traditional checkpoint
based protocol.
TextRef
Sayantan Chakravorty, A Fault Tolerance Protocol for Fast Recovery, Dept. of Computer Science, University of Illinois. 2008.
People
Research Areas