Subsections

3.4 Extensions for Migrations

For MPI chunks to migrate, we have added a few calls to AMPI. These include ability to register thread-specific data with the run-time system, to pack all the thread's data, and to express willingness to migrate.

3.4.1 Registering Chunk data

When the AMPI runtime system decides that load imbalance exists within the application, it will invoke one of its internal load balancing strategies, which determines the new mapping of AMPI chunks so as to balance the load. Then AMPI runtime has to pack up the chunk's state and move it to its new home processor. AMPI packs up any internal data in use by the chunk, including the thread's stack in use. This means that the local variables declared in subroutines in a chunk, which are created on stack, are automatically packed up by the AMPI runtime system. However, it has no way of knowing what other data are in use by the chunk. Thus upon starting execution, a chunk needs to notify the system about the data that it is going to use (apart from local variables.) Even with the data registration, AMPI cannot determine what size the data is, or whether the registered data contains pointers to other places in memory. For this purpose, a packing subroutine also needs to be provided to the AMPI runtime system along with registered data. (See next section for writing packing subroutines.) The call provided by AMPI for doing this is MPI_Register. This function takes two arguments: A data item to be transported alongwith the chunk, and the pack subroutine, and returns an integer denoting the registration identifier. In C/C++ programs, it may be necessary to use this return value after migration completes and control returns to the chunk, using function MPI_Get_userdata. Therefore, the return value should be stored in a local variable.

3.4.2 Migration

The AMPI runtime system could detect load imbalance by itself and invoke the load balancing strategy. However, since the application code is going to pack/unpack the chunk's data, writing the pack subroutine will be complicated if migrations occur at a stage unknown to the application. For example, if the system decides to migrate a chunk while it is in initialization stage (say, reading input files), application code will have to keep track of how much data it has read, what files are open etc. Typically, since initialization occurs only once in the beginning, load imbalance at that stage would not matter much. Therefore, we want the demand to perform load balance check to be initiated by the application.

AMPI provides a subroutine MPI_Migrate for this purpose. Each chunk periodically calls MPI_Migrate. Typical CSE applications are iterative and perform multiple time-steps. One should call MPI_Migrate in each chunk at the end of some fixed number of timesteps. The frequency of MPI_Migrate should be determined by a tradeoff between conflicting factors such as the load balancing overhead, and performance degradation caused by load imbalance. In some other applications, where application suspects that load imbalance may have occurred, as in the case of adaptive mesh refinement; it would be more effective if it performs a couple of timesteps before telling the system to re-map chunks. This will give the AMPI runtime system some time to collect the new load and communication statistics upon which it bases its migration decisions. Note that MPI_Migrate does NOT tell the system to migrate the chunk, but merely tells the system to check the load balance after all the chunks call MPI_Migrate. To migrate the chunk or not is decided only by the system's load balancing strategy.

3.4.3 Packing/Unpacking Thread Data

Once the AMPI runtime system decides which chunks to send to which processors, it calls the specified pack subroutine for that chunk, with the chunk-specific data that was registered with the system using MPI_Register. This section explains how a subroutine should be written for performing pack/unpack.

There are three steps to transporting the chunk's data to other processor. First, the system calls a subroutine to get the size of the buffer required to pack the chunk's data. This is called the ``sizing'' step. In the next step, which is called immediately afterward on the source processor, the system allocates the required buffer and calls the subroutine to pack the chunk's data into that buffer. This is called the ``packing'' step. This packed data is then sent as a message to the destination processor, where first a chunk is created (alongwith the thread) and a subroutine is called to unpack the chunk's data from the buffer. This is called the ``unpacking'' step.

Though the above description mentions three subroutines called by the AMPI runtime system, it is possible to actually write a single subroutine that will perform all the three tasks. This is achieved using something we call a ``pupper''. A pupper is an external subroutine that is passed to the chunk's pack-unpack-sizing subroutine, and this subroutine, when called in different phases performs different tasks. An example will make this clear:

Suppose the chunk data is defined as a user-defined type in Fortran 90:

!FORTRAN EXAMPLE
MODULE chunkmod
  TYPE, PUBLIC :: chunk
      INTEGER , parameter :: nx=4, ny=4, tchunks=16
      REAL(KIND=8) t(22,22)
      INTEGER xidx, yidx
      REAL(KIND=8), dimension(400):: bxm, bxp, bym, byp
  END TYPE chunk
END MODULE

//C Example
struct chunk{
  double t;
  int xidx, yidx;
  double bxm,bxp,bym,byp;
};

Then the pack-unpack subroutine chunkpup for this chunk module is written as:

!FORTRAN EXAMPLE
SUBROUTINE chunkpup(p, c)
  USE pupmod
  USE chunkmod
  IMPLICIT NONE
  INTEGER :: p
  TYPE(chunk) :: c

  call pup(p, c%t)
  call pup(p, c%xidx)
  call pup(p, c%yidx)
  call pup(p, c%bxm)
  call pup(p, c%bxp)
  call pup(p, c%bym)
  call pup(p, c%byp)
end subroutine

//C Example
void chunkpup(pup_er p, struct chunk c){
  pup_double(p,c.t);
  pup_int(p,c.xidx);
  pup_int(p,c.yidx);
  pup_double(p,c.bxm);
  pup_double(p,c.bxp);
  pup_double(p,c.bym);
  pup_double(p,c.byp);
}

There are several things to note in this example. First, the same subroutine pup (declared in module pupmod) is called to size/pack/unpack any type of data. This is possible because of procedure overloading possible in Fortran 90. Second is the integer argument p. It is this argument that specifies whether this invocation of subroutine chunkpup is sizing, packing or unpacking. Third, the integer parameters declared in the type chunk need not be packed or unpacked since they are guaranteed to be constants and thus available on any processor.

A few other functions are provided in module pupmod. These functions provide more control over the packing/unpacking process. Suppose one modifies the chunk type to include allocatable data or pointers that are allocated dynamically at runtime. In this case, when the chunk is packed, these allocated data structures should be deallocated after copying them to buffers, and when the chunk is unpacked, these data structures should be allocated before copying them from the buffers. For this purpose, one needs to know whether the invocation of chunkpup is a packing one or unpacking one. For this purpose, the pupmod module provides functions fpup_isdeleting(fpup_isunpacking). These functions return logical value .TRUE. if the invocation is for packing (unpacking), and .FALSE. otherwise. Following example demonstrates this:

Suppose the type dchunk is declared as:

!FORTRAN EXAMPLE
MODULE dchunkmod
  TYPE, PUBLIC :: dchunk
      INTEGER :: asize
      REAL(KIND=8), pointer :: xarr(:), yarr(:)
  END TYPE dchunk
END MODULE

//C Example
struct dchunk{
  int asize;
  double* xarr, *yarr;
};

Then the pack-unpack subroutine is written as:

!FORTRAN EXAMPLE
SUBROUTINE dchunkpup(p, c)
  USE pupmod
  USE dchunkmod
  IMPLICIT NONE
  INTEGER :: p
  TYPE(dchunk) :: c

  pup(p, c%asize)
  
  IF (fpup_isunpacking(p)) THEN       !! if invocation is for unpacking
    allocate(c%xarr(asize))
    ALLOCATE(c%yarr(asize))
  ENDIF
  

  pup(p, c%xarr)
  pup(p, c%yarr)
  
  IF (fpup_isdeleting(p)) THEN        !! if invocation is for packing
    DEALLOCATE(c%xarr(asize))
    DEALLOCATE(c%yarr(asize))
  ENDIF
  


END SUBROUTINE

//C Example
void dchunkpup(pup_er p, struct dchunk c){
  pup_int(p,c.asize);
  if(pup_isUnpacking(p)){
    c.xarr = (double *)malloc(sizeof(double)*c.asize);
    c.yarr = (double *)malloc(sizeof(double)*c.asize);
  }
  pup_doubles(p,c.xarr,c.asize);
  pup_doubles(p,c.yarr,c.asize);
  if(pup_isPacking(p)){
    free(c.xarr);
    free(c.yarr);
  }
}

One more function fpup_issizing is also available in module pupmod that returns .TRUE. when the invocation is a sizing one. In practice one almost never needs to use it.

June 29, 2008
AMPI Homepage
Charm Homepage