Subsections

3 AMPI

AMPI utilizes the dynamic load balancing capabilities of CHARM++ by associating a ``user-level'' thread with each CHARM++ migratable object. User's code runs inside this thread, so that it can issue blocking receive calls similar to MPI, and still present the underlying scheduler an opportunity to schedule other computations on the same processor. The runtime system keeps track of computation loads of each thread as well as communication graph between AMPI threads, and can migrate these threads in order to balance the overall load while simultaneously minimizing communication overhead.

3.1 AMPI Status

Currently all the MPI-1.1 Standard functions are supported in AMPI, with a collection of our extentions explained in detail in this manual. One-sided communication calls in MPI-2 are implemented, but they are not taking advantage of RMA features yet. Also ROMIO3 has been integrated to support parallel I/O features. Link with -lampiromio to take advantage of this library.

Following MPI-1.1 basic datatypes are supported in AMPI. (Some are not available in Fortran binding. Refer to MPI-1.1 Standard for details.)

MPI_DATATYPE_NULL  MPI_BYTE            MPI_UNSIGNED_LONG MPI_LONG_DOUBLE_INT
MPI_DOUBLE         MPI_PACKED          MPI_LONG_DOUBLE   MPI_2FLOAT
MPI_INT            MPI_SHORT           MPI_FLOAT_INT     MPI_2DOUBLE
MPI_FLOAT          MPI_LONG            MPI_DOUBLE_INT    MPI_LB
MPI_COMPLEX        MPI_UNSIGNED_CHAR   MPI_LONG_INT      MPI_UB
MPI_LOGICAL        MPI_UNSIGNED_SHORT  MPI_2INT
MPI_CHAR           MPI_UNSIGNED        MPI_SHORT_INT

Following MPI-1.1 reduction operations are supported in AMPI.

MPI_MAX   MPI_MIN   MPI_SUM   MPI_PROD  MPI_MAXLOC  MPI_MINLOC
MPI_LAND  MPI_LOR   MPI_LXOR  MPI_BAND  MPI_BOR     MPI_BXOR

Following are AMPI extension calls, which will be explained in detail in this manual.

MPI_Migrate     MPI_Checkpoint  MPI_Restart     MPI_Register    MPI_Get_userdata
MPI_Ialltoall   MPI_Iallgather  MPI_Iallreduce  MPI_Ireduce     MPI_IGet

3.2 Name for Main Program

To convert an existing program to use AMPI, the main function or program may need to be renamed. The changes should be made as follows:

3.2.1 Fortran

You must declare the main program as a subroutine called ``MPI_MAIN''. Do not declare the main subroutine as a program because it will never be called by the AMPI runtime.

3.2.2 C or C++

The main function can be left as is, if mpi.h is included before the main function. This header file has a preprocessor macro that renames main, and the renamed version is called by the AMPI runtime by each thread.

3.3 Global Variable Privatization

For dynamic load balancing to be effective, one needs to map multiple user-level threads onto a processor. Traditional MPI programs assume that the entire processor is allocated to themselves, and that only one thread of control exists within the process's address space. Thats where the need arises to make some transformations to the original MPI program in order to run correctly with AMPI.

The basic transformation needed to port the MPI program to AMPI is privatization of global variables.4With the MPI process model, each MPI node can keep a copy of its own ``permanent variables'' - variables that are accessible from more than one subroutines without passing them as arguments. Module variables, ``saved'' subroutine local variables, and common blocks in Fortran 90 belong to this category. If such a program is executed without privatization on AMPI, all the AMPI threads that reside on one processor will access the same copy of such variables, which is clearly not the desired semantics. To ensure correct execution of the original source program, it is necessary to make such variables ``private'' to individual threads. We are two choices: automatic global swapping and manual code modification.

3.3.1 Automatic Globals Swapping

Thanks to the ELF Object Format, we have successfully automated the procedure of switching the set of user global variables when switching thread contexts. The only thing that the user needs to do is to set flag -swapglobals at compile and link time. Currently this feature only works on x86 and x86_64 (i.e. amd64) platforms that fully support ELF. Thus it will not work on PPC or Itanium, or on some microkernels such as Catamount.When this feature does not work for you, you are advised to make the modification manually, which is detailed in the following section.

3.3.2 Manual Change

We have employed a strategy of argument passing to do this privatization transformation. That is, the global variables are bunched together in a single user-defined type, which is allocated by each thread dynamically. Then a pointer to this type is passed from subroutine to subroutine as an argument. Since the subroutine arguments are passed on a stack, which is not shared across all threads, each subroutine, when executing within a thread operates on a private copy of the global variables.

This scheme is demonstrated in the following examples. The original Fortran 90 code contains a module shareddata. This module is used in the main program and a subroutine subA.

!FORTRAN EXAMPLE
MODULE shareddata
  INTEGER :: myrank
  DOUBLE PRECISION :: xyz(100)
END MODULE

SUBROUTINE MPI_MAIN
  USE shareddata
  include 'mpif.h'
  INTEGER :: i, ierr
  CALL MPI_Init(ierr)
  CALL MPI_Comm_rank(MPI_COMM_WORLD, myrank, ierr)
  DO i = 1, 100
    xyz(i) =  i + myrank
  END DO
  CALL subA
  CALL MPI_Finalize(ierr)
END PROGRAM

SUBROUTINE subA
  USE shareddata
  INTEGER :: i
  DO i = 1, 100
    xyz(i) = xyz(i) + 1.0
  END DO
END SUBROUTINE

//C Example
#include <mpi.h>

int myrank;
double xyz[100];

void subA();
int main(int argc, char** argv){
  int i;
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, myrank);
  for(i=0;i<100;i++)
    xyz[i] = i + myrank;
  subA();
  MPI_Finalize();
}

void subA(){
  int i;
  for(i=0;i<100;i++)
    xyz[i] = xyz[i] + 1.0;
}

AMPI executes the main subroutine inside a user-level thread as a subroutine.

Now we transform this program using the argument passing strategy. We first group the shared data into a user-defined type.

!FORTRAN EXAMPLE
MODULE shareddata
  TYPE chunk
    INTEGER :: myrank
    DOUBLE PRECISION :: xyz(100)
  END TYPE
END MODULE

//C Example
struct shareddata{
  int myrank;
  double xyz[100];
};

Now we modify the main subroutine to dynamically allocate this data and change the references to them. Subroutine subA is then modified to take this data as argument.

!FORTRAN EXAMPLE
SUBROUTINE MPI_Main
  USE shareddata
  USE AMPI
  INTEGER :: i, ierr
  TYPE(chunk), pointer :: c
  CALL MPI_Init(ierr)
  ALLOCATE(c)
  CALL MPI_Comm_rank(MPI_COMM_WORLD, c%myrank, ierr)
  DO i = 1, 100
    c%xyz(i) =  i + c%myrank
  END DO
  CALL subA(c)
  CALL MPI_Finalize(ierr)
END SUBROUTINE

SUBROUTINE subA(c)
  USE shareddata
  TYPE(chunk) :: c
  INTEGER :: i
  DO i = 1, 100
    c%xyz(i) = c%xyz(i) + 1.0
  END DO
END SUBROUTINE

//C Example
void MPI_Main{
  int i,ierr;
  struct shareddata *c;
  ierr = MPI_Init();
  c = (struct shareddata*)malloc(sizeof(struct shareddata));
  ierr = MPI_Comm_rank(MPI_COMM_WORLD, c.myrank);
  for(i=0;i<100;i++)
    c.xyz[i] = i + c.myrank;
  subA(c);
  ierr = MPI_Finalize();
}

void subA(struct shareddata *c){
  int i;
  for(i=0;i<100;i++)
    c.xyz[i] = c.xyz[i] + 1.0;
}

With these changes, the above program can be made thread-safe. Note that it is not really necessary to dynamically allocate chunk. One could have declared it as a local variable in subroutine MPI_Main. (Or for a small example such as this, one could have just removed the shareddata module, and instead declared both variables xyz and myrank as local variables). This is indeed a good idea if shared data are small in size. For large shared data, it would be better to do heap allocation because in AMPI, the stack sizes are fixed at the beginning (can be specified from the command line) and stacks do not grow dynamically.

3.4 Extensions for Migrations

For MPI chunks to migrate, we have added a few calls to AMPI. These include ability to register thread-specific data with the run-time system, to pack all the thread's data, and to express willingness to migrate.

3.4.1 Registering Chunk data

When the AMPI runtime system decides that load imbalance exists within the application, it will invoke one of its internal load balancing strategies, which determines the new mapping of AMPI chunks so as to balance the load. Then AMPI runtime has to pack up the chunk's state and move it to its new home processor. AMPI packs up any internal data in use by the chunk, including the thread's stack in use. This means that the local variables declared in subroutines in a chunk, which are created on stack, are automatically packed up by the AMPI runtime system. However, it has no way of knowing what other data are in use by the chunk. Thus upon starting execution, a chunk needs to notify the system about the data that it is going to use (apart from local variables.) Even with the data registration, AMPI cannot determine what size the data is, or whether the registered data contains pointers to other places in memory. For this purpose, a packing subroutine also needs to be provided to the AMPI runtime system along with registered data. (See next section for writing packing subroutines.) The call provided by AMPI for doing this is MPI_Register. This function takes two arguments: A data item to be transported alongwith the chunk, and the pack subroutine, and returns an integer denoting the registration identifier. In C/C++ programs, it may be necessary to use this return value after migration completes and control returns to the chunk, using function MPI_Get_userdata. Therefore, the return value should be stored in a local variable.

3.4.2 Migration

The AMPI runtime system could detect load imbalance by itself and invoke the load balancing strategy. However, since the application code is going to pack/unpack the chunk's data, writing the pack subroutine will be complicated if migrations occur at a stage unknown to the application. For example, if the system decides to migrate a chunk while it is in initialization stage (say, reading input files), application code will have to keep track of how much data it has read, what files are open etc. Typically, since initialization occurs only once in the beginning, load imbalance at that stage would not matter much. Therefore, we want the demand to perform load balance check to be initiated by the application.

AMPI provides a subroutine MPI_Migrate for this purpose. Each chunk periodically calls MPI_Migrate. Typical CSE applications are iterative and perform multiple time-steps. One should call MPI_Migrate in each chunk at the end of some fixed number of timesteps. The frequency of MPI_Migrate should be determined by a tradeoff between conflicting factors such as the load balancing overhead, and performance degradation caused by load imbalance. In some other applications, where application suspects that load imbalance may have occurred, as in the case of adaptive mesh refinement; it would be more effective if it performs a couple of timesteps before telling the system to re-map chunks. This will give the AMPI runtime system some time to collect the new load and communication statistics upon which it bases its migration decisions. Note that MPI_Migrate does NOT tell the system to migrate the chunk, but merely tells the system to check the load balance after all the chunks call MPI_Migrate. To migrate the chunk or not is decided only by the system's load balancing strategy.

3.4.3 Packing/Unpacking Thread Data

Once the AMPI runtime system decides which chunks to send to which processors, it calls the specified pack subroutine for that chunk, with the chunk-specific data that was registered with the system using MPI_Register. This section explains how a subroutine should be written for performing pack/unpack.

There are three steps to transporting the chunk's data to other processor. First, the system calls a subroutine to get the size of the buffer required to pack the chunk's data. This is called the ``sizing'' step. In the next step, which is called immediately afterward on the source processor, the system allocates the required buffer and calls the subroutine to pack the chunk's data into that buffer. This is called the ``packing'' step. This packed data is then sent as a message to the destination processor, where first a chunk is created (alongwith the thread) and a subroutine is called to unpack the chunk's data from the buffer. This is called the ``unpacking'' step.

Though the above description mentions three subroutines called by the AMPI runtime system, it is possible to actually write a single subroutine that will perform all the three tasks. This is achieved using something we call a ``pupper''. A pupper is an external subroutine that is passed to the chunk's pack-unpack-sizing subroutine, and this subroutine, when called in different phases performs different tasks. An example will make this clear:

Suppose the chunk data is defined as a user-defined type in Fortran 90:

!FORTRAN EXAMPLE
MODULE chunkmod
  TYPE, PUBLIC :: chunk
      INTEGER , parameter :: nx=4, ny=4, tchunks=16
      REAL(KIND=8) t(22,22)
      INTEGER xidx, yidx
      REAL(KIND=8), dimension(400):: bxm, bxp, bym, byp
  END TYPE chunk
END MODULE

//C Example
struct chunk{
  double t;
  int xidx, yidx;
  double bxm,bxp,bym,byp;
};

Then the pack-unpack subroutine chunkpup for this chunk module is written as:

!FORTRAN EXAMPLE
SUBROUTINE chunkpup(p, c)
  USE pupmod
  USE chunkmod
  IMPLICIT NONE
  INTEGER :: p
  TYPE(chunk) :: c

  call pup(p, c%t)
  call pup(p, c%xidx)
  call pup(p, c%yidx)
  call pup(p, c%bxm)
  call pup(p, c%bxp)
  call pup(p, c%bym)
  call pup(p, c%byp)
end subroutine

//C Example
void chunkpup(pup_er p, struct chunk c){
  pup_double(p,c.t);
  pup_int(p,c.xidx);
  pup_int(p,c.yidx);
  pup_double(p,c.bxm);
  pup_double(p,c.bxp);
  pup_double(p,c.bym);
  pup_double(p,c.byp);
}

There are several things to note in this example. First, the same subroutine pup (declared in module pupmod) is called to size/pack/unpack any type of data. This is possible because of procedure overloading possible in Fortran 90. Second is the integer argument p. It is this argument that specifies whether this invocation of subroutine chunkpup is sizing, packing or unpacking. Third, the integer parameters declared in the type chunk need not be packed or unpacked since they are guaranteed to be constants and thus available on any processor.

A few other functions are provided in module pupmod. These functions provide more control over the packing/unpacking process. Suppose one modifies the chunk type to include allocatable data or pointers that are allocated dynamically at runtime. In this case, when the chunk is packed, these allocated data structures should be deallocated after copying them to buffers, and when the chunk is unpacked, these data structures should be allocated before copying them from the buffers. For this purpose, one needs to know whether the invocation of chunkpup is a packing one or unpacking one. For this purpose, the pupmod module provides functions fpup_isdeleting(fpup_isunpacking). These functions return logical value .TRUE. if the invocation is for packing (unpacking), and .FALSE. otherwise. Following example demonstrates this:

Suppose the type dchunk is declared as:

!FORTRAN EXAMPLE
MODULE dchunkmod
  TYPE, PUBLIC :: dchunk
      INTEGER :: asize
      REAL(KIND=8), pointer :: xarr(:), yarr(:)
  END TYPE dchunk
END MODULE

//C Example
struct dchunk{
  int asize;
  double* xarr, *yarr;
};

Then the pack-unpack subroutine is written as:

!FORTRAN EXAMPLE
SUBROUTINE dchunkpup(p, c)
  USE pupmod
  USE dchunkmod
  IMPLICIT NONE
  INTEGER :: p
  TYPE(dchunk) :: c

  pup(p, c%asize)
  
  IF (fpup_isunpacking(p)) THEN       !! if invocation is for unpacking
    allocate(c%xarr(asize))
    ALLOCATE(c%yarr(asize))
  ENDIF
  

  pup(p, c%xarr)
  pup(p, c%yarr)
  
  IF (fpup_isdeleting(p)) THEN        !! if invocation is for packing
    DEALLOCATE(c%xarr(asize))
    DEALLOCATE(c%yarr(asize))
  ENDIF
  


END SUBROUTINE

//C Example
void dchunkpup(pup_er p, struct dchunk c){
  pup_int(p,c.asize);
  if(pup_isUnpacking(p)){
    c.xarr = (double *)malloc(sizeof(double)*c.asize);
    c.yarr = (double *)malloc(sizeof(double)*c.asize);
  }
  pup_doubles(p,c.xarr,c.asize);
  pup_doubles(p,c.yarr,c.asize);
  if(pup_isPacking(p)){
    free(c.xarr);
    free(c.yarr);
  }
}

One more function fpup_issizing is also available in module pupmod that returns .TRUE. when the invocation is a sizing one. In practice one almost never needs to use it.

3.5 Extensions for Checkpointing

The pack-unpack subroutines written for migrations make sure that the current state of the program is correctly packed (serialized) so that it can be restarted on a different processor. Using the same subroutines, it is also possible to save the state of the program to disk, so that if the program were to crash abruptly, or if the allocated time for the program expires before completing execution, the program can be restarted from the previously checkpointed state. Thus, the pack-unpack subroutines act as the key facility for checkpointing in addition to their usual role for migration.

A subroutine for checkpoint purpose has been added to AMPI: void MPI_Checkpoint(char *dirname); This subroutine takes a directory name as its argument. It is a collective function, meaning every virtual processor in the program needs to call this subroutine and specify the same directory name. (Typically, in an iterative AMPI program, the iteration number, converted to a character string, can serve as a checkpoint directory name.) This directory is created, and the entire state of the program is checkpointed to this directory. One can restart the program from the checkpointed state by specifying "+restart dirname" on the command-line. This capability is powered by the CHARM++ runtime system. For more information about CHARM++ checkpoint/restart mechanism please refer to CHARM++ manual.

3.6 Extensions for Memory Efficiency

MPI functions usually require the user to preallocate the data buffers needed before the functions being called. For unblocking communication primitives, sometimes the user would like to do lazy memory allocation until the data actually arrives, which gives the oppotunities to write more memory efficient programs. We provide a set of AMPI functions as an extension to the standard MPI-2 one-sided calls, where we provide a split phase MPI_Get called MPI_IGet. MPI_IGet preserves the similar semantics as MPI_Get except that no user buffer is provided to hold incoming data. MPI_IGet_Wait will block until the requested data arrives and runtime system takes care to allocate space, do appropriate unpacking based on data type, and return. MPI_IGet_Free lets the runtime system free the resources being used for this get request including the data buffer. And MPI_IGet_Data is the utility program that returns the actual data.


int MPI_IGet(MPI_Aint orgdisp, int orgcnt, MPI_Datatype orgtype, int rank,
             MPI_Aint targdisp, int targcnt, MPI_Datatype targtype, MPI_Win win,
             MPI_Request *request);

int MPI_IGet_Wait(MPI_Request *request, MPI_Status *status, MPI_Win win);

int MPI_IGet_Free(MPI_Request *request, MPI_Status *status, MPI_Win win);

char* MPI_IGet_Data(MPI_Status status);

3.7 Extensions for Interoperability

Interoperability between different modules is essential for coding coupled simulations. In this extension to AMPI, each MPI application module runs within its own group of user-level threads distributed over the physical parallel machine. In order to let AMPI know which chunks are to be created, and in what order, a top level registration routine needs to be written. A real-world example will make this clear. We have an MPI code for fluids and another MPI code for solids, both with their main programs, then we first transform each individual code to run correctly under AMPI as standalone codes. This involves the usual ``chunkification'' transformation so that multiple chunks from the application can run on the same processor without overwriting each other's data. This also involves making the main program into a subroutine and naming it MPI_Main.

Thus now, we have two MPI_Mains, one for the fluids code and one for the solids code. We now make these codes co-exist within the same executable, by first renaming these MPI_Mains as Fluids_Main and Solids_Main5 writing a subroutine called MPI_Setup.

!FORTRAN EXAMPLE
SUBROUTINE MPI_Setup
  USE ampi
  CALL MPI_Register_main(Solids_Main)
  CALL MPI_Register_main(Fluids_Main)
END SUBROUTINE

//C Example
void MPI_Setup(){
  MPI_Register_main(Solids_Main);
  MPI_Register_main(Fluids_Main);
}

This subroutine is called from the internal initialization routines of AMPI and tells AMPI how many number of distinct chunk types (modules) exist, and which orchestrator subroutines they execute.

The number of chunks to create for each chunk type is specified on the command line when an AMPI program is run. Appendix B explains how AMPI programs are run, and how to specify the number of chunks (+vp option). In the above case, suppose one wants to create 128 chunks of Solids and 64 chunks of Fluids on 32 physical processors, one would specify those with multiple +vp options on the command line as:

> charmrun gen1.x +p 32 +vp 128 +vp 64

This will ensure that multiple chunk types representing different complete applications can co-exist within the same executable. They can also continue to communicate among their own chunk-types using the same AMPI function calls to send and receive with communicator argument as MPI_COMM_WORLD. But this would be completely useless if these individual applications cannot communicate with each other, which is essential for building efficient coupled codes. For this purpose, we have extended the AMPI functionality to allow multiple ``COMM_WORLDs''; one for each application. These world communicators form a ``communicator universe'': an array of communicators aptly called MPI_COMM_UNIVERSE. This array of communicators is indexed [1 . . . MPI_MAX_COMM]. In the current implementation, MPI_MAX_COMM is 8, that is, maximum of 8 applications can co-exist within the same executable.

The order of these COMM_WORLDs within MPI_COMM_UNIVERSE is determined by the order in which individual applications are registered in MPI_Setup.

Thus, in the above example, the communicator for the Solids module would be MPI_COMM_UNIVERSE(1) and communicator for Fluids module would be MPI_COMM_UNIVERSE(2).

Now any chunk within one application can communicate with any chunk in the other application using the familiar send or receive AMPI calls by specifying the appropriate communicator and the chunk number within that communicator in the call. For example if a Solids chunk number 36 wants to send data to chunk number 47 within the Fluids module, it calls:

!FORTRAN EXAMPLE
INTEGER , PARAMETER :: Fluids_Comm = 2
CALL MPI_Send(InitialTime, 1, MPI_Double_Precision, tag,
              47, MPI_Comm_Universe(Fluids_Comm), ierr)

//C Example
int Fluids_Comm = 2;
ierr = MPI_Send(InitialTime, 1, MPI_DOUBLE, tag,
                47, MPI_Comm_Universe(Fluids_Comm));

The Fluids chunk has to issue a corresponding receive call to receive this data:

!FORTRAN EXAMPLE
INTEGER , PARAMETER :: Solids_Comm = 1
CALL MPI_Recv(InitialTime, 1, MPI_Double_Precision, tag,
              36, MPI_Comm_Universe(Solids_Comm), stat, ierr)

//C Example
int Solids_Comm = 1;
ierr = MPI_Recv(InitialTime, 1, MPI_DOUBLE, tag,
                36, MPI_Comm_Universe(Solids_Comm), &stat);

3.8 Extensions for Sequential Re-run of a Parallel Node

In some scenarios, a sequential re-run of a parallel node is desired. One example is instruction-level accurate architecture simulations, in which case the user may wish to repeat the execution of a node in a parallel run in the sequential simulator. AMPI provides support for such needs by logging the change in the MPI environment on a certain processors. To activate the feature, build AMPI module with variable ``AMPIMSGLOG'' defined, like the following command in charm directory. (Linking with zlib ``-lz'' might be required with this, for generating compressed log file.)

> ./build AMPI net-linux -DAMPIMSGLOG

The feature is used in two phases: writing (logging) the environment and repeating the run. The first logging phase is invoked by a parallel run of the AMPI program with some additional command line options.

> ./charmrun ./pgm +p4 +vp4 +msgLogWrite +msgLogRank 2 +msgLogFilename "msg2.log"

In the above example, a parallel run with 4 processors and 4 VPs will be executed, and the changes in the MPI environment of processor 2 (also VP 2, starting from 0) will get logged into diskfile "msg2.log".

Unlike the first run, the re-run is a sequential program, so it is not invoked by charmrun (and omitting charmrun options like +p4 and +vp4), and additional comamnd line options are required as well.

> ./pgm +msgLogRead +msgLogRank 2 +msgLogFilename "msg2.log"

3.9 Communication Optimizations for AMPI

AMPI is powered by the CHARM++ communication optimization support now! Currently the user needs to specify the communication pattern by command line option. In the future this can be done automatically by the system.

Currently there are four strategies available: USE_DIRECT, USE_MESH, USE_HYPERCUBE and USE_GRID. USE_DIRECT sends the message directly. USE_MESH imposes a 2d Mesh virtual topology on the processors so each processor sends messages to its neighbors in its row and column of the mesh which forward the messages to their correct destinations. USE_HYPERCUBE and USE_GRID impose a hypercube and a 3d Grid topologies on the processors. USE_HYPERCUBE will do best for very small messages and small number of processors, 3d has better performance for slightly higher message sizes and then Mesh starts performing best. The programmer is encouraged to try out all the strategies. (Stolen from the CommLib manual by Sameer :)

For more details please refer to the CommLib paper 6.

Specifying the strategy is as simple as a command line option +strategy. For example:

> ./charmrun +p64 alltoall +vp64 1000 100 +strategy USE_MESH
tells the system to use MESH strategy for CommLib. By default USE_DIRECT is used.

3.10 User Defined Initial Mapping

You can define the initial mapping of virtual processors (vp) to physical processors (p) as a runtime option. You can choose from predefined initial mappings or define your own mappings. Following predefined mappings are available:

Round Robin

This mapping scheme, maps virtual processor to physical processor in round-robin fashion, i.e. if there are 8 virtual processors and 2 physical processors then virtual processors indexed 0,2,4,6 will be mapped to physical processor 0 and virtual processors indexed 1,3,5,7 will be mapped to physical processor 1.

> ./charmrun ./hello +p2 +vp8 +mapping RR_MAP

Block Mapping

This mapping scheme, maps virtual processors to physical processor in chunks, i.e. if there are 8 virtual processors and 2 physical processors then virtual processors indexed 0,1,2,3 will be mapped to physical processor 0 and virtual processors indexed 4,5,6,7 will be mapped to physical processor 1.

> ./charmrun ./hello +p2 +vp8 +mapping BLOCK_MAP

Proportional Mapping

This scheme takes the processing capability of physical processors into account for mapping virtual processors to physical processors, i.e. if there are 2 processors with different processing power, then number of virtual processors mapped to processors will be in proportion to their processing power.

> ./charmrun ./hello +p2 +vp8 +mapping PROP_MAP
> ./charmrun ./hello +p2 +vp8

If you want to define your own mapping scheme, please contact us for help.

3.11 Compiling AMPI Programs

CHARM++ provides a cross-platform compile-and-link script called charmc to compile C, C++, Fortran, CHARM++ and AMPI programs. This script resides in the bin subdirectory in the CHARM++ installation directory. The main purpose of this script is to deal with the differences of various compiler names and command-line options across various machines on which CHARM++ runs. While, charmc handles C and C++ compiler differences most of the time, the support for Fortran 90 is new, and may have bugs. But CHARM++ developers are aware of this problem and are working to fix them. Even in its alpha stage of Fortran 90 support, charmc still handles many of the compiler differences across many machines, and it is recommended that charmc be used to compile and linking AMPI programs. One major advantage of using charmc is that one does not have to specify which libraries are to be linked for ensuring that C++ and Fortran 90 codes are linked correctly together. Appropriate libraries required for linking such modules together are known to charmc for various machines.

In spite of the platform-neutral syntax of charmc, one may have to specify some platform-specific options for compiling and building AMPI codes. Fortunately, if charmc does not recognize any particular options on its command line, it promptly passes it to all the individual compilers and linkers it invokes to compile the program.

November 23, 2009
AMPI Homepage
Charm Homepage