Orion Lawlor Milind Bhandarkar L. V. Kalé
Parallel Programming Laboratory
University of Illinois at Urbana-Champaign
olawlor@acm.org,{bhandark,l-kale1}@uiuc.edu
``Adaptive MPI'', or AMPI, implements virtual MPI processors, several of which may reside on a single physical processor. This virtualization allows MPI applications to use an automatic migration-based load balancer, automatically overlap computation and communication, and provides several other benefits. In this paper, we present the design of and recent work on AMPI, its low-level and application performance, and some of the advanced capabilities enabled by virtualization.
The new generation of parallel applications are complex, involve simulation of dynamically varying systems, use adaptive techniques such as multiple timestepping and adaptive refinements, and often involve multiple parallel modules. Typical implementations of the MPI do not support the dynamic nature of these applications well. As a result, programming productivity and parallel efficiency suffer. We present AMPI, an adaptive implementation of MPI, that is better suited for adaptive applications, while still retaining the familiar programming model of MPI.
The basic idea behind AMPI is to separate the issue of mapping work to
processors from that of identifying work to be done in parallel.
Standard MPI programs divide the computation into
processes, one for
each of the
processors. Algorithmic considerations
often restrict the number of processors to a power of 2,
or a cube (or both).
In contrast, an AMPI programmer divides the computation into a large
number
of virtual processors, independent of the number of
physical processors. The virtual processors are programmed in MPI as before.
Physical processors are no longer visible to the programmer, as
the responsibility for assigning virtual processors to physical
processors is taken over by the runtime system. This provides an
effective division of labor between the system and the programmer: the
programmer decides what to do in parallel, and the runtime system
decides where and when to do it.
This allows the programmer to use the most natural decomposition
for his problem, rather than being restricted by the physical machine.
So, for example,
can still be a cube even though
is prime.
Note that the number of virtual processors
is typically much larger
than
. Using multiple virtual processors per physical processor brings
several additional benefits.
We first describe how our virtual processors are implemented and migrated. Section 3 describes the design and implementation strategies for specific features, such as checkpointing. We then present performance data showing that these adaptive features are affordable in real programs. Finally, we will demonstrate some adaptive capabilities quantitatively in the context of a conjugate gradient solver for sparse linear systems.
The virtualization concept embodied by AMPI is very old, and Fox et al. [1] make a convincing case for virtualizing parallel programs. Unlike Fox's work, AMPI virtualizes at the runtime layer rather than manually at the user level, and AMPI can use adaptive load balancers.
There are several excellent, complete, publicly available non-virtualized implementations of MPI, such as MPICH [2] and LAM [3]. Many researchers have described partially virtualized MPI implementations for checkpointing, often built on top of one of the free implementations of MPI. Several workers have described fully virtualized MPI implementations for fault-tolerance, such as FT-MPI [4], MPI/FT [5], and StarFish [6]. AMPI differs from these efforts in that we virtualize to improve performance and allow load balancing rather than solely for checkpointing or for fault tolerance. Some of these works also impose unacceptable runtime overheads or require extensive changes to the user code, problems AMPI largely manages to avoid.
An older version of AMPI was described by Bhandarkar et al. [7]. Our automatic load balancing framework was described in detail by Kalé et al. [8].
AMPI is built on CHARM++, and uses its communication facilities, load balancing strategies and threading model.
CHARM++ uses an object based model: programs consist of a collection of communicating objects, which are mapped to processors by the CHARM++ runtime. CHARM++ supports migration of objects by providing efficient forwarding of messages, when necessary. Migration can be used by the built-in measurement-based load balancing [8], adapting to changing load on workstation clusters [9], and even shrinking/expanding jobs for timeshared machines [10]. Migration presents interesting problems for basic and collective communication which are nicely solved by CHARM++ [11].
AMPI's virtual processors are implemented as CHARM++ ``user-level'' threads--threads which are created and scheduled by ordinary code, not by the operating system kernel. The advantages of user-level threads are fast1context switching, control over scheduling, and control over stack allocation. CHARM++'s user-level threads are scheduled non-preemptively.
CHARM++ natively supports object migration; but thread migration required several interesting additions to the runtime system, as described in the following sections.
A user-level thread, when suspended, consists of a stack and a set of
preserved machine registers. During migration, the machine registers
are simply copied to the new processor. The stack, unfortunately, is
very difficult to move--consider the variable
below:
During the call to bar, the stack-allocated variable
cannot be moved, since its address is stored by bar.2In a distributed memory parallel machine, if the stack is moved to
a new machine, it will almost undoubtably be allocated at a different
location, so bar's pointer to
will become dangling when the
stack moves. We cannot
reliably update all the pointers to stack-allocated variables, because
these pointers are stored in machine registers and stack frames,
whose layout is highly machine- and compiler-dependent.
Our solution is to ensure that even after a migration, a thread's stack will stay at the same address in memory that it had on the old processor. This means all the pointers embedded in the stack will still work properly. Luckily, any operating system with virtual memory support has the ability to map arbitrary pages in and out of memory. So in practice we merely need to mmap the appropriate address range into memory on the new machine and use it for our stack. Because this uses the hardware's built-in virtual memory support, when migration is not occurring this approach does not affect performance.
Of course, we must ensure that each thread allocates its stack at
a globally unique range of addresses. This is
accomplished by simply dividing the total virtual address space
into
regions; threads created on processor
then get their
stacks allocated from region
. This system thus has globally-unique
memory addresses like a software shared memory system (DSM), but
here the data movement is proactive--when a thread migrates, it takes
all its data with it.
This ``isomalloc'' approach to thread migration comes from
PM
[12].
Another obvious problem with migrating an arbitrary program is
dynamically allocated storage--for example, the array
in:
Clearly, if this thread is migrated to a new processor,
must
come along as well. But unlike the thread stack, which the system
allocated,
's location is known only to the user program.
The previous version of AMPI required the user to code a
``pack/unpack'' routine to capture all allocated heap data.
This routine was fairly easy to write, but rather difficult to
maintain.
Worse, this tiny amount of extra code prevented a straightforward switch
from ordinary MPI to AMPI.
The ``isomalloc'' strategy available in the latest version of AMPI uses the same virtual address allocation method used for stacks to allocate all heap data. This means the user's heap data is given globally unique virtual addresses, so it can be moved to any running processor without changing its address. Thus migration is transparent to the user code, even for arbitrarily interlinked, dynamically allocated data structures.
To do this, AMPI must intercept and handle all memory allocations done by the user code. On many UNIX systems, this can be done by providing our own implementation of ``malloc''. Because nearly all languages can be linked together with C code, even the C++ new and FORTRAN90 ALLOCATE runtime calls eventually result in a call to malloc. However, some implementations of these language runtimes perform caching of free memory blocks, which must be disabled.
Unfortunately, the isomalloc heap approach is only of limited use on 32-bit systems. Since the virtual address range on these machines is limited to 4GB, and since this space is divided among all processors when using the isomalloc approach, we run out of allocatable space very quickly. For example, dividing the 4GB address space3among 100 processors means each processor can only allocate 40MB of memory; a significant limit. Thus on 32-bit machines, the pack/unpack method is generally required. Machines with 64-bit pointers, which are becoming increasingly common, have many terabytes of free virtual address space and hence can fully benefit from isomalloc heaps.
Although not specified by the MPI standard, many actual MPI programs assume that global variables can be used independently on each processor. However, in AMPI, all the threads on a processor share a single set of global variables; and when a thread migrates, it leaves its global variables behind. This means many MPI programs cannot run unmodified under AMPI.
A simple solution is to manually remove all the global variables from the code. For example, all the formal globals can be collected into a single struct named ``Global'', which is then passed into each function. This process, though mechanical, is cumbersome and can indeed be automated.
AMPIzer [13] is our source-to-source translator that removes global variables from arbitrary FORTRAN77 or FORTRAN90 code. AMPIzer is based on the Polaris compiler front end [14]. For simple4 uses of the heap, Ampizer can also generate a pack/unpack routine if isomalloc heaps are not desired.
As Stellner describes in his paper on his checkpointing framework [15], process migration can easily be layered on top of any checkpointing system by simply rearranging the checkpoint files before restart. AMPI implements checkpointing in exactly the opposite way. In AMPI, rather than migration being a special kind of checkpoint/restart, checkpoint/restart is seen as a special kind of migration-migration to and from the disk.
A running AMPI thread checkpoints itself by calling MPI_Checkpoint with a directory name. Each thread drains its network queue, migrates a copy of itself into a separate file in that directory, and then continues normally. The checkpoint time is dominated by the cost of the I/O, since very little communication is required.
Because AMPI checkpoints threads rather than physical processors, an AMPI program may be restored on a larger or smaller number of physical processors than was it started on. Thus a checkpoint on 1000 processors can easily be restarted on 999 processors if, for example, a processor fails during the run.
Large scientific programs are often written in a modular fashion by combining multiple MPI modules into a single program. These MPI modules are often derived from independent MPI programs.
Current MPI programs transfer control from one module to another strictly via subroutine calls. Even if two modules are independent, idle time in one cannot be overlapped with computations in the other without breaking the abstraction boundaries between the two modules. In contrast, AMPI allows multiple separately developed modules to interleave execution based on the availability of messages. Each module may have its own ``main'', and its own flow of control. AMPI provides cross-communicators to communicate between such modules.
As described above, AMPI modules can be used in-process with other AMPI modules. AMPI modules can also coexist with other CHARM++ modules. For example, a program using another threaded CHARM++ framework, such as the CHARM++ Finite Element Method (FEM) Framework [16], or the CHARM++ collision detection system [17], can still use AMPI.
In this ``bound'' mode, a single thread of user code can make calls to the FEM and AMPI frameworks; when the thread migrates, the support data required by both frameworks automatically migrates as well. This keeps the users' code simple, since they do not have to synchronize two separate threads of control. Of course, it is also possible to run an FEM framework module and an AMPI framework module in their own separate sets of threads.
We have described the results from experiments involving real scientific applications running on AMPI in another work [7].
The times for various low-level MPI operations on various machines and MPI implementations are shown in Table 3.3. The serial machine is an AMD Athlon 1800XP running two AMPI virtual processors. The 100baseT cluster is a set of 4-way 500Mhz Pentium III SMP nodes running Linux, connected using switched fast Ethernet; AMPI is on UDP and MPICH is on p4. The Myrinet cluster is a set of 2-way 1GHz Pentium III SMP nodes running Linux, connected using a Myricom interconnect; AMPI and MPICH both ran on GM directly. The SGI Origin2000 is a single 50-processor 195MHz R10000 node; AMPI ran on the native SGI MPI implementation. IBM SP3 is a set of 8-way 375MHZ Power3 nodes; AMPI again ran on the native MPI implementation.
``Send/Recv'' performs a simple ping-pong operation--one processor sends while the other receives, then receives while the other sends. ``Repeated Send'' is nearly the same, except one processor always sends while the other always receives, and reports the time as measured by the sending processor. This send overhead is important during broadcast-style communication exchanges. MPICH, both on Ethernet and GM, had extremely poor performance for this test. ``Barrier'', ``Bcast'', and ``Allreduce'' are simply the equivalent MPI operation, in this case on just two processors, and with the time measured from the root. ``Bandwidth'' is the end-to-end large-message transmission rate, as measured by the time to exchange a one-megabyte buffer.
AMPI was occasionally several times slower than the non-adaptive MPI implementations. Part of this is simply the fundamental cost of AMPI's virtualization; but part is simply our implementation and we should be able to soon show substantial improvements. In particular, the non-blocking operations AMPI requires, such as MPI_Isend, have very poor performance on many MPI implementations; we have begun experimenting with direct implementations for many parallel machines. Despite these low-level results, application performance under AMPI is often quite good.
This application is a partial differential equation solver which uses a sparse, matrix-free form of the conjugate gradient method to solve the Poisson problem on a regular 2D grid. It is an iterative method, and typically performs several thousand steps during one solution.
Each (virtual) processor is responsible for computing the solution on a rectangular region of the mesh. Since the solution residual for a grid point depends on the solutions for its nearest four neighbors, each processor maintains a one-element-thick ghost region. In each step, messages are exchanged to fill these ghost regions, and there are two short global reductions. Like many scientific codes, this application is normally memory bandwidth bound.
Figure 1 shows the time per step of the solver on a single physical processor, while varying the number of virtual processors between 1 and 4096. Because AMPI's virtual processors are implemented as user-level threads, there is very little overhead in managing a large number of threads. On our Pentium IV system, with a relatively small cache but very fast RDRAM memory, simulating 100 virtual processors led to only a slight (10%) slowdown. However, on the Athlon and Pentium III Xenon, improved cache usage while simulating 100 virtual processors actually resulted in slightly better performance than using the single physical processor normally.
![]() |
![]() |
Figure 2 shows the time per step of the solver on an actual parallel machine, ASCI Red. This is with a larger mesh-6000x6000 elements, for a 36 million row matrix. As shown, the runtime cost for AMPI's virtualization is small.
AMPI normally migrates virtual processors for load balance, but this capability can also be used to respond to the changing properties of the parallel machine. For example, Figure 3 shows the conjugate gradient solver described above responding to the availability of several new processors. The time per step drops dramatically as virtual processors are migrated onto the new physical processors.
![]() |
We have presented AMPI, an adaptive implementation of MPI on top of CHARM++. AMPI implements migratable virtual MPI processors; and in particular allows the use of several virtual processors per physical processor. This efficient virtualization provides a number of benefits, such as the ability to automatically load balance arbitrary computations, automatically overlap computation and communication, emulate large machines on small ones, and respond to a changing physical machine.
We have much future work planned for AMPI. We hope to achieve full MPI1.1 standards conformance soon, and MPI2.0 shortly thereafter. We are rapidly improving the performance of AMPI, and should soon be quite near that of non-migratable MPI. The CHARM++ performance analysis tools need to be updated to provide more direct support for AMPI programs. We hope to extend our suite of automatic load balancing strategies to provide machine-topology specific strategies. Finally, we hope to apply our communication optimization libraries to programs running under AMPI.
This document was generated using the LaTeX2HTML translator Version 99.2beta6 (1.42)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 -white -antialias -local_icons paper.tex
The translation was initiated by Orion Lawlor on 2002-05-09