Subsections

1 Introduction

This manual describes Adaptive MPI (AMPI), which is an implementation of a significant subset1 of MPI-1.1 Standard over CHARM++. CHARM++ is a C++-based parallel programming library being developed by Prof. L. V. Kalé and his students back from 1992 until now at University of Illinois.

We first describe our philosophy behind this work (why we do what we do). Later we give a brief introduction to CHARM++ and rationale for AMPI (tools of the trade). We then describe AMPI in detail. Finally we summarize the changes required for original MPI codes to get them working with AMPI (current state of our work). Appendices contain the gory details of installing AMPI, building and running AMPI programs.


1.1 Overview

Developing parallel Computational Science and Engineering (CSE) applications is a complex task. One has to implement the right physics, develop or choose and code appropriate numerical methods, decide and implement the proper input and output data formats, perform visualizations, and be concerned with correctness and efficiency of the programs. It becomes even more complex for multi-physics coupled simulations such as the solid propellant rocket simulation application. In addition, many applications are dynamic and adaptively refined so load imbalance is a major challenge. Our philosophy is to lessen the burden of the application developers by providing advanced programming paradigms and versatile runtime systems that can handle many common programming and performance concerns automatically and let the application programmers focus on the actual application content.

Many of these concerns can be addressed using processor virtualization and over-decomposition philosophy of CHARM++. Thus, the developer only sees virtual processors and lets the runtime system deal with underlying physical processors. This is implemented in AMPI by mapping MPI ranks to CHARM++ user-level threads as illustrated in Figure1. As an immediate and simple benefit, the programmer can use as many virtual processors ("MPI ranks") as the problem can be easily decomposed to them. For example, suppose the problem domain has $n*2^n$ parts that can be easily distributed but programming for general number of MPI processes is burdensome, then the developer can have $n*2^n$ virtual processors on any number of physical ones using AMPI.

Figure 1: MPI processes are implemented as user-level threads in AMPI
Image virtualization

AMPI's execution model consists of multiple user-level threads per process and, typically, there is one process per physical processor. CHARM++ scheduler coordinates execution of these threads (also called Virtual Processors or VPs) and controls execution as shown in Figure 2. These VPs can also migrate between processors because of load balancing or other reasons. The number of VPs per processor specifies the virtualization ratio (degree of over-decomposition). For example, in Figure 2 virtualization ratio is four (there are four VPs per each processor). Figure 3 show how the problem domain is over-decomposed in AMPI's VPs as opposed to other MPI implementations.

Figure 2: VPs are managed by CHARM++ scheduler
Image ratio

Figure 3: Problem domain is over-decomposed to more VPs
Image prac

Another benefit of virtualization is communication and computation overlap which is automatically achieved without programming effort. Techniques such as software pipelining require significant programming effort to achieve this goal and improve performance. However, one can use AMPI to have more virtual processors than physical processors to overlap communication and computation. Each time a VP is blocked for communication, CHARM++ scheduler picks the next VP among those that are ready to execute. In this manner, while some of the VPs of a physical processor are waiting for a message to arrive, others can continue their execution. Thus, performance will be improved without any change to the source code.

A potential benefit is that of better cache utilization. With over-decomposition, a smaller subdomain is accessed by a VP repeatedly in different function calls before getting blocked by communication and switching to another VP. That smaller subdomain may fit into cache if over-decomposition is enough. This concept is illustrated in Figure 4 where each AMPI subdomain (such as 12) is smaller than corresponding MPI subdomain (such as 3) and may fit into cache memory. Thus, there is a potential performance improvement without changing the source code.

Figure 4: Smaller subdomains may fit into cache and result in better performance
Image cache

One important concern is that of load imbalance. New generation parallel applications are dynamically varying, meaning that processors' load is shifting during execution. In a dynamic simulation application such as rocket simulation, burning solid fuel, sub-scaling for a certain part of the mesh, crack propagation, particle flows all contribute to load imbalance. Centralized load balancing strategy built into an application is impractical since each individual modules are developed almost independently by various developers. In addition, embedding a load balancing strategy in the code complicates it and programming effort increases significantly. Thus, the runtime system support for load balancing becomes even more critical. Figure 5 shows migration of a VP because of load imbalance. For instance, this domain may correspond to a weather forecast model where there is a tornado in top-left side, which requires more computation to simulate. AMPI will then migrate VP 13 to balance the division of work across processors and improve performance. Note that incorporating this sort of load balancing inside the application code may take a lot of effort and complicates the code.

Figure 5: AMPI migrates VPs across processors for load balancing
Image migrate

There are different load balancing strategies built into CHARM++ that can be selected. Among those, some may fit better for an application depending on its characteristics. Moreover, one can write a new load balancer, best suited for an application, by the simple API provided inside CHARM++ infrastructure. Our approach is based on actual measurement of load information at runtime, and on migrating computations from heavily loaded to lightly loaded processors.

For this approach to be effective, we need the computation to be split into pieces many more in number than available processors. This allows us to flexibly map and re-map these computational pieces to available processors. This approach is usually called ``multi-domain decomposition''.

CHARM++, which we use as a runtime system layer for the work described here, simplifies our approach. It embeds an elaborate performance tracing mechanism, a suite of plug-in load balancing strategies, infrastructure for defining and migrating computational load, and is interoperable with other programming paradigms.

1.2 Terminology

Module
A module refers to either a complete program or a library with an orchestrator subroutine2 . An orchestrator subroutine specifies the main control flow of the module by calling various subroutines from the associated library and does not usually have much state associated with it.

Thread
A thread is a lightweight process that owns a stack and machine registers including program counter, but shares code and data with other threads within the same address space. If the underlying operating system recognizes a thread, it is known as kernel thread, otherwise it is known as user-thread. A context-switch between threads refers to suspending one thread's execution and transferring control to another thread. Kernel threads typically have higher context switching costs than user-threads because of operating system overheads. The policy implemented by the underlying system for transferring control between threads is known as thread scheduling policy. Scheduling policy for kernel threads is determined by the operating system, and is often more inflexible than user-threads. Scheduling policy is said to be non-preemptive if a context-switch occurs only when the currently running thread willingly asks to be suspended, otherwise it is said to be preemptive. AMPI threads are non-preemptive user-level threads.

Chunk
A chunk is a combination of a user-level thread and the data it manipulates. When a program is converted from MPI to AMPI, we convert an MPI process into a chunk. This conversion is referred to as chunkification.

Object
An object is just a blob of memory on which certain computations can be performed. The memory is referred to as an object's state, and the set of computations that can be performed on the object is called the interface of the object.

February 12, 2012
AMPI Homepage
Charm Homepage