Center for Petascale Computing  
A collaboration led by Laxmikant Kalé (Computer Science) and Duane Johnson (Materials Science and Engineering) on a research theme within IACAT

Fault Tolerance

As modern supercomputers used for scientific computing applications continue to grow in size and complexity, the probability of system failure also rises. Making parallel computing fault tolerant has become an increasingly important issue. Charm++ provides multiple schemes that support fault tolerance with minimal effort from the application developer. Charm++ offers new and faster recovery that does not require centralized file storage or manual restart, and can be dramatically faster than the traditional approach. One scheme relies on message logging in addition to checpoints. So, when a processor fails, only the crashed processor is rolled back to its previous checkpoint. This protocol can spread the objects on the failed processor among the remaining functioning processors, leading to much faster restarts. By providing sophisticated fault tolerance schemes to applications without the need for the programmer to change the structure of their application, we can ease the task of application development significantly.
 

Investigator:

Fault Tolerance Information