As modern supercomputers used for scientific computing applications continue to grow in size and
complexity, the probability of system failure also rises. Making parallel computing fault tolerant
has become an increasingly important issue. Charm++ provides multiple schemes that support
fault tolerance with minimal effort from the application developer. Charm++ offers new and faster recovery that does not require centralized file storage or manual restart, and can be dramatically faster than the traditional approach. One scheme relies on message logging in addition to checpoints. So, when a processor fails, only the crashed processor is rolled back to its previous checkpoint. This protocol can spread the objects on the failed processor among the remaining functioning processors, leading to much faster restarts. By providing sophisticated fault
tolerance schemes to applications without the need for the programmer to change the structure of
their application, we can ease the task of application development significantly.
Investigator: