Subsections

1 Introduction

This manual describes Adaptive MPI (AMPI), which is an implementation of a significant subset1 of MPI-1.1 Standard over CHARM++. CHARM++ is a C++-based parallel programming library developed by Prof. L. V. Kalé and his students over the last 10 years at University of Illinois.

We first describe our philosophy behind this work (why we do what we do). Later we give a brief introduction to CHARM++ and rationale for AMPI (tools of the trade). We describe AMPI in detail. Finally we summarize the changes required for original MPI codes to get them working with AMPI (current state of our work). Appendices contain the gory details of installing AMPI, building and running AMPI programs.

1.1 Our Philosophy

Developing parallel Computational Science and Engineering (CSE) applications is a complex task. One has to implement the right physics, develop or choose and code appropriate numerical methods, decide and implement the proper input and output data formats, perform visualizations, and be concerned with correctness and efficiency of the programs. It becomes even more complex for multi-physics coupled simulations such as the solid propellant rocket simulation application. Our philosophy is to lessen the burden of the application developers by providing advanced programming paradigms and versatile runtime systems that can handle many common performance concerns automatically and let the application programmers focus on the actual application content.

One such concern is that of load imbalance. In a dynamic simulation application such as rocket simulation, burning solid fuel, sub-scaling for a certain part of the mesh, crack propagation, particle flows all contribute to load imbalance. Centralized load balancing strategy built into an application is impractical since each individual modules are developed almost independently by various developers. Thus, the runtime system support for load balancing becomes even more critical.

Automatic load balancing is infeasible for a program about which nothing is known. Other approaches to automatic load balancing therefore require the applications to provide hints about the load to the runtime system, or restrict load balance to a certain kind of algorithms such as Adaptive Mesh Refinement or to certain architectures such as shared memory machines. Our approach is based on actual measurement of load information at runtime, and on migrating computations from heavily loaded to lightly loaded processors.

For this approach to be effective, we need the computation to be split into pieces many more in number than available processors. This allows us to flexibly map and re-map these computational pieces to available processors. This approach is usually called ``multi-domain decomposition''.

CHARM++, which we use as a runtime system layer for the work described here exemplifies our approach. It embeds an elaborate performance tracing mechanism, a suite of plug-in load balancing strategies, infrastructure for defining and migrating computational load, and is interoperable with other programming paradigms.

1.2 Terminology

Module
A module refers to either a complete program or a library with an orchestrator subroutine2 . An orchestrator subroutine specifies the main control flow of the module by calling various subroutines from the associated library and does not usually have much state associated with it.

Thread
A thread is a lightweight process that owns a stack and machine registers including program counter, but shares code and data with other threads within the same address space. If the underlying operating system recognizes a thread, it is known as kernel thread, otherwise it is known as user-thread. A context-switch between threads refers to suspending one thread's execution and transferring control to another thread. Kernel threads typically have higher context switching costs than user-threads because of operating system overheads. The policy implemented by the underlying system for transferring control between threads is known as thread scheduling policy. Scheduling policy for kernel threads is determined by the operating system, and is often more inflexible than user-threads. Scheduling policy is said to be non-preemptive if a context-switch occurs only when the currently running thread willingly asks to be suspended, otherwise it is said to be preemptive. AMPI threads are non-preemptive user-level threads.

Chunk
A chunk is a combination of a user-level thread and the data it manipulates. When a program is converted from MPI to AMPI, we convert an MPI process into a chunk. This conversion is referred to as chunkification.

Object
An object is just a blob of memory on which certain computations can be performed. The memory is referred to as an object's state, and the set of computations that can be performed on the object is called the interface of the object.

November 07, 2009
AMPI Homepage
Charm Homepage