Why Charm++ for Exascale
Challenges Facing Parallel Application Development
Exascale computing is on the horizon. Yet, it is widely expected that programming exascale machines will present formidable challenges. Many of the adaptive strategies developed at PPL can facilitate this transition. To see how, let us first examine the challenges facing exascale software.
- Power is probably the biggest constraint on an exascale system. How can we rein in the power consumed by the individual components as well as the power required for cooling the whole system?
- Component failures will be frequent, so that the mean time between failures will be in hours (or minutes!). Can software help applications tolerate these faults?
- Data movement will become the most expensive operation in coming years in terms of execution time and power costs. How will exascale software hide or mitigate the impact of communication?
- Heterogeneity and dynamic changes to the execution environment will become the norm on future machines. How can software stacks maximize compute resource utilization in the face of such diversity in component capabilities?
- Strong scaling will become the norm at exascale. Problems of interest in many domains don’t get larger, but need to be solved faster. In addition, the total memory in a system is expected to grow much slower than compute power. How can we ease the way for applications to transition into this regime?
- Generating parallelism for strong scaling will be a challenge. Exascale compute resources will offer billion-way parallelism. Plain data-decomposition techniques will fall short of providing sufficient parallelism. What fundamental programming paradigm shift(s) are needed to express adequate parallelism?
- Load imbalances in applications running on extreme machines will have numerous origins. There will be persistent as well as transient load imbalances. As an example, adaptive refinements to increase simulation fidelity in regions of interest are already a necessity for many applications. How will we maintain high compute throughput in the face of such imbalances?
- Many applications will incorporate multiple modules due to the multi-physics nature of simulations and the need to use existing libraries. Thus there will be a need to cast expensive parallel software into modular, parallel libraries without the seams between the modules translating into performance penalties.
Charm++ and its Runtime System
A collection of research activities and techniques developed in PPL at Illinois directly address many of these challenges. These techniques are centered on the core idea of intelligent and adaptive runtime systems and are embodied in production quality software systems—Charm++, Adaptive MPI, and their associated tools and libraries.
The essential innovation behind this approach is the idea that computation—its data-units as well as work-units—should be decomposed by the programmer (or compiler) into a large number of logical units.
These units are then assigned to and scheduled on physical processor cores by an adaptive runtime system. This simple idea empowers the runtime system to help solve many of the challenges listed above, while taking away programming complexity and tedium from the users. How?
Expresses more parallelism
Using units of work and data that are native to the problem allows the programmer to express much more parallelism than simply the number of processors that are available for execution. This approach also intuitively allows data and task parallelism to be expressed in a unified manner. Expressing all available parallelism in the algorithm is central to strong scaling. We have demonstrated this in the context of several production applications and have a Gordon Bell award as a consequence.
Hides Latencies
By using a large number of logical work units, one gets, for free, adaptive overlap of communication and computation. While one work unit is waiting for data, another may be ready to proceed, and the runtime scheduler will automatically pick the ready unit. Furthermore, the communication is more evenly spread within individual iterations, and thus does not require an over-engineered, expensive, and power-hungry interconnection network.
Automatically Balances Load
With this approach, automatically dealing with load imbalances becomes feasible. As the application evolves, either because of adaptive refinements or due to slow changes (e.g., atoms moving across cell boundaries in an MD simulation), some processors get substantially more work than the average. This slows the whole computation down. The same thing happens when one of the processors slows down—say because its cache memory is transiently giving a larger number of (correctable) errors. Such imbalances can have a huge (and potentially unbounded) impact on overall performance. Exascale exacerbates this problem in many ways. Strong scaling implies less work for each processor, which means a small increase in load on an individual processor causes a proportionally larger load imbalance. Further, with a large number of cores, the chance that one of them is substantially overloaded is also higher. Well? The adaptive runtime system comes to the rescue. As the system is involved in scheduling work-units, it can measure the computation cost of each unit, and since it is involved in delivering messages among work-units, it knows which objects are communicating with which other objects. As a result, it can monitor load imbalances, and when necessary, migrate objects (work- or data-units) across processors to restore balance.
Automatically handles heterogeneous systems
We have demonstrated some really cool techniques to remove the pain from programming heterogeneous systems. The application simply has to use additional keywords to annotate work-units that are appropriate for acceleration. The Charm++ framework can generate multiple versions of the kernel representing the work-unit and can make the process of targeting multiple architectures a seamless experience for the application programmer. Our runtime system also takes care of launching the kernels and overlapping data transfers. It can agglomerate work-units where needed, tighten synchronization to achieve higher throughput, etc. The runtime can also employ its observation-based load balancing to automatically achieve high utilization across multiple heterogeneous cores within an execution. Thus, many of the big challenges facing today’s accelerator programmers are deflated by our runtime system.
Is Green
Energy is consumed in significant amounts when a floating-point operation is computed, or when data moves between units. As the heat generated is dissipated, the core temperature rises, and the core must be kept cool by a power-hungry cooling system. An adaptive runtime system can help in this scenario. By monitoring core temperatures individually, it can reduce frequency when a core starts to heat up. This does, of course, create load imbalances, but, as we saw earlier, the runtime system can already handle those imbalances by migrating work away from the slower cores. With this adaptive control of temperature, hot spots are eliminated, and a machine room can be run at a much warmer temperature, thus saving cooling energy. Further, by noticing that different pieces of code have different intensities of floating-point operations, memory accesses, and cache misses, it can change clock frequencies selectively before executing code-blocks. Since power is related non-linearly to frequencies, this selective control of frequencies can reduce power consumption substantially.
Is Resilient
If the runtime system notices (via sensors, cache monitors, etc.) that one of the nodes is about to fail, it can migrate work/data units away from that node, and remove it pro-actively from the computation. To deal with a situation when a node fails without a warning, the runtime system can also deploy a protocol that we have developed that stores copies of messages at the senders, and saves checkpoints of objects at appropriate times. Then, when one node fails without warning, the objects on that node are restored on other nodes based on their older checkpoints. The restored objects can flash forward to the current time by having their messages “played back” to them. This way, the restart time is short, because the work of the failed node is re-executed on multiple nodes in parallel. This means the computation can make progress even when the mean time between component failures (MTBF) goes lower than the checkpointing period! Further, we save power because the remaining processors don’t have to restore their state back from an older checkpoint.
Allows modularity with high performance
Since parallel programming is generally more difficult than sequential programming, it costs more to produce parallel software. This makes it all the more important to enable reuse of parallel software modules. Yet, it is more challenging to make truly reusable, composable, parallel modules. Composability is not just a programming-level consideration. In a performance-sensitive context, libraries must compose well at runtime too. Most current methods require one to either sequence the execution of the modules one after the other, or assign disjointed sets of processors to each, or a combination of the two. This misses the opportunities for true resource pooling. In particular, idle time in one module cannot be used for computations in the other, at least not without some complex, modularity-breaking hacks. Message-driven scheduling also solves this problem: work-units from different modules can easily (and automatically) interleave their execution based on availability of data.
Enhances productivity
The same premium on parallel software also increases the utility of composable libraries for commonly needed tasks, and frameworks for commonly needed data structures. Because of the modularity and compositional properties mentioned above, Charm++ is well suited to be a substrate for such components. Indeed, it already supports several such components.
To address the issue of simplifying parallel programming, we are also engaged in developing new parallel programming abstractions (or “languages”). These languages are useful not just for mini-exascale, but for parallel desktops and “small” clusters which are becoming ubiquitous. Each of the languages are designed to elegantly express a specific pattern of interaction that captures a significant fraction of parallel algorithms. Although each such language is not “complete” by itself, it simplifies programming for a significant subset of needs. Coupled with other such languages and backed by complete paradigms such as Charm++ and AMPI, these languages can be used to program any general purpose application effectively. We have developed two such specialized languages:
- Charisma, which elegantly expresses parallel computations where data flow (communication and sequencing) among parallel entities is unchanging (static). Here, the content and size of “messages” may change, but the message-exchange pattern is unchanged.
- MSA (Multiphase Shared Arrays) expresses a disciplined form of shared-address space that can be efficiently implemented on distributed memory machines.
The adaptive runtime system in Charm++ also helps with scalable performance analysis, interactive debugging, and on-line visualization. These capabilities are specifically enabled by message-driven execution. Similarly, one can leverage the overdecomposition to predict performance on future exascale machines using runs on existing machines, via the BigSim framework. But these topics deserve a separate article of their own.
How can I use it?
The modules have to be written using one of the several supported parallel libraries/notations. You can use Adaptive MPI (AMPI), which is just plain MPI (with C, C++ or Fortran), with a few stylistic restrictions on how global variables are used. We provide several methods (fully or partially automated) to make existing MPI modules conform to AMPI requirements. You can also develop new modules in Charm++ or one of its affiliated higher-level notations and frameworks.
Do I have to learn a new programming language?
Charm++ programs are just plain C++ programs that call Charm++ libraries and inherit from system provided classes.
Is this a mature platform?
You bet! Charm++ has existed in its current form for about a decade. However, the principles have evolved over 20 years of research. A diverse collection of production applications run atop Charm++ every day on supercomputers in five continents! The simulations cover disciplines ranging from cosmology, computational molecular biology, and material design. We’ve had major successes with applications which are willing to adapt in the quest for more productivity and performance.
Do I have to drop all my existing code?
No, you can mix traditional MPI programs with AMPI/Charm++/Charisma/MSA modules. Of course, the above benefits will be available to the new / converted modules. But you can incrementally convert/develop a parallel application by mixing modules in this fashion. You can also use all your sequential code as it is.
Our group will continue to develop newer techniques and strategies that are motivated by application needs and new machines. Once tested, these techniques will also find their way into the Charm++ runtime system. However, its application interface will remain unchanged! So, start developing with Charm++ now.
Are these benefits a free lunch?
No. By asking the application experts (you) to specify the computation in its natural work and data units, we are actually setting the stage for our runtime system to deliver these benefits.
And, of course, it is not a free lunch for us, the developers of Charm++. We need to develop new runtime strategies that match the complexities of the exascale systems. These include load balancers of various hues, scalable collective communication libraries, interfaces for the newer interconnects, accelerator support for GPGPU, MIC and others, new language abstractions, and techniques for data-exchange across distributed modules. Even for us (and those of you who want to contribute or collaborate), though, the existing Charm infrastructure is useful, because various capabilities can be provided as standardized “plug-ins”.