For MPI chunks to migrate, we have added a few calls to AMPI. These include ability to register thread-specific data with the run-time system, to pack all the thread's data, and to express willingness to migrate.
When the AMPI runtime system decides that load imbalance exists within the application, it will invoke one of its internal load balancing strategies, which determines the new mapping of AMPI chunks so as to balance the load. Then AMPI runtime has to pack up the chunk's state and move it to its new home processor. AMPI packs up any internal data in use by the chunk, including the thread's stack in use. This means that the local variables declared in subroutines in a chunk, which are created on stack, are automatically packed up by the AMPI runtime system. However, it has no way of knowing what other data are in use by the chunk. Thus upon starting execution, a chunk needs to notify the system about the data that it is going to use (apart from local variables.) Even with the data registration, AMPI cannot determine what size the data is, or whether the registered data contains pointers to other places in memory. For this purpose, a packing subroutine also needs to be provided to the AMPI runtime system along with registered data. (See next section for writing packing subroutines.) The call provided by AMPI for doing this is MPI_Register. This function takes two arguments: A data item to be transported alongwith the chunk, and the pack subroutine, and returns an integer denoting the registration identifier. In C/C++ programs, it may be necessary to use this return value after migration completes and control returns to the chunk, using function MPI_Get_userdata. Therefore, the return value should be stored in a local variable.
The AMPI runtime system could detect load imbalance by itself and invoke the load balancing strategy. However, since the application code is going to pack/unpack the chunk's data, writing the pack subroutine will be complicated if migrations occur at a stage unknown to the application. For example, if the system decides to migrate a chunk while it is in initialization stage (say, reading input files), application code will have to keep track of how much data it has read, what files are open etc. Typically, since initialization occurs only once in the beginning, load imbalance at that stage would not matter much. Therefore, we want the demand to perform load balance check to be initiated by the application.
AMPI provides a subroutine MPI_Migrate for this purpose. Each chunk periodically calls MPI_Migrate. Typical CSE applications are iterative and perform multiple time-steps. One should call MPI_Migrate in each chunk at the end of some fixed number of timesteps. The frequency of MPI_Migrate should be determined by a tradeoff between conflicting factors such as the load balancing overhead, and performance degradation caused by load imbalance. In some other applications, where application suspects that load imbalance may have occurred, as in the case of adaptive mesh refinement; it would be more effective if it performs a couple of timesteps before telling the system to re-map chunks. This will give the AMPI runtime system some time to collect the new load and communication statistics upon which it bases its migration decisions. Note that MPI_Migrate does NOT tell the system to migrate the chunk, but merely tells the system to check the load balance after all the chunks call MPI_Migrate. To migrate the chunk or not is decided only by the system's load balancing strategy.
Once the AMPI runtime system decides which chunks to send to which processors, it calls the specified pack subroutine for that chunk, with the chunk-specific data that was registered with the system using MPI_Register. This section explains how a subroutine should be written for performing pack/unpack.
There are three steps to transporting the chunk's data to other processor. First, the system calls a subroutine to get the size of the buffer required to pack the chunk's data. This is called the ``sizing'' step. In the next step, which is called immediately afterward on the source processor, the system allocates the required buffer and calls the subroutine to pack the chunk's data into that buffer. This is called the ``packing'' step. This packed data is then sent as a message to the destination processor, where first a chunk is created (alongwith the thread) and a subroutine is called to unpack the chunk's data from the buffer. This is called the ``unpacking'' step.
Though the above description mentions three subroutines called by the AMPI runtime system, it is possible to actually write a single subroutine that will perform all the three tasks. This is achieved using something we call a ``pupper''. A pupper is an external subroutine that is passed to the chunk's pack-unpack-sizing subroutine, and this subroutine, when called in different phases performs different tasks. An example will make this clear:
Suppose the chunk data is defined as a user-defined type in Fortran 90:
Then the pack-unpack subroutine chunkpup for this chunk module is written as:
There are several things to note in this example. First, the same subroutine pup (declared in module pupmod) is called to size/pack/unpack any type of data. This is possible because of procedure overloading possible in Fortran 90. Second is the integer argument p. It is this argument that specifies whether this invocation of subroutine chunkpup is sizing, packing or unpacking. Third, the integer parameters declared in the type chunk need not be packed or unpacked since they are guaranteed to be constants and thus available on any processor.
A few other functions are provided in module pupmod. These functions
provide more control over the packing/unpacking process. Suppose one modifies
the chunk type to include allocatable data or pointers that are
allocated dynamically at runtime. In this case, when the chunk is packed, these
allocated data structures should be deallocated after copying them to buffers,
and when the chunk is unpacked, these data structures should be allocated
before copying them from the buffers. For this purpose, one needs to know
whether the invocation of chunkpup is a packing one or unpacking one.
For this purpose, the pupmod module provides functions
fpup_isdeleting(fpup_isunpacking). These functions return logical value
.TRUE. if the invocation is for packing (unpacking), and .FALSE.
otherwise. Following example demonstrates this:
Suppose the type dchunk is declared as:
Then the pack-unpack subroutine is written as:
One more function fpup_issizing is also available in module pupmod
that returns .TRUE. when the invocation is a sizing one. In practice one
almost never needs to use it.
June 29, 2008
AMPI Homepage
Charm Homepage