Parallel Programming Laboratory

Toward Exascale Resilience: 2014 update

Supercomputing Frontiers And Innovations 2014

Publication Type: Paper

Repository URL:

Download: [BIB] [PDF]

Abstract

Resilience is a major roadblock for HPC executions on future exascale systems. These systems will typically gather millions of CPU cores running up to a billion threads. Projections from current large systems and technology evolution predict errors will happen in exascale systems many times per day. These errors will propagate and generate various kinds of malfunctions, from simple process crashes to result corruptions.

The past five years have seen extraordinary technical progress in many domains related to exascale resilience. Several technical options, initially considered inapplicable or unrealistic in the HPC context, have demonstrated surprising successes. Despite this progress, the exascale resilience problem is not solved, and the community is still facing the difficult challenge of ensuring that ex- ascale applications complete and generate correct results while running on unstable systems. Since 2009, many workshops, studies, and reports have improved the definition of the resilience problem and provided refined recommendations. Some projections made during the previous decades and some priorities established from these projections need to be revised. This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.

People

Franck Capello
Al Geist
William Gropp
Laxmikant Kale
Bill Kramer
Marc Snir

Research Areas

Live Webcast 15th Annual Charm++ Workshop