AMPI utilizes the dynamic load balancing capabilities of Charm++ by associating a user-level thread with each Charm++ migratable object. User's code runs inside this thread, so that it can issue blocking receive calls similar to MPI, and still present the underlying scheduler an opportunity to schedule other computations on the same processor. The runtime system keeps track of computation loads of each thread as well as communication graph between AMPI threads, and can migrate these threads in order to balance the overall load while simultaneously minimizing communication overhead.
For dynamic load balancing to be effective, one needs to map multiple user-level threads onto a processor. Traditional MPI programs assume that the entire processor is allocated to themselves, and that only one thread of control exists within the process's address space. That's where the need arises to make some transformations to the original MPI program in order to run correctly with AMPI.
The basic transformation needed to port the MPI program to AMPI is privatization of global variables. Typical Fortran MPI programs contain three types of global variables.
With the MPI process model, each MPI node can keep a copy of its own permanent variables -- variables that are accessible from more than one subroutines without passing them as arguments. Module variables, saved subroutine local variables, and common blocks in Fortran 90 belong to this category. If such a program is executed without privatization on AMPI, all the AMPI threads that reside on one processor will access the same copy of such variables, which is clearly not the desired semantics. To ensure correct execution of the original source program, it is necessary to make such variables private to individual threads.
While experimenting with Gen1, we have employed two strategies to do this transformation. One is by argument passing. That is, the global variables are bunched together in a single user-defined type, which is allocated by each thread dynamically. Then a pointer to this type is passed from subroutine to subroutine as an argument. Since the subroutine arguments are passed on a stack, which is not shared across all threads, each subroutine, when executing within a thread operates on a private copy of the global variables.
The second method we have employed is called dimension increment. Here, the dimension of global data items is increased by one. This added dimension is used to access thread private data by indexing it with the thread number (which can be obtained with MPI_Comm_rank subroutine.) This scheme has a distinct disadvantage of wastage of space. Therefore it should be used only in case of small global data. There is another disadvantage, that of false sharing, if the global data items are not aligned properly. Because of these disadvantages, we do not recommend using this scheme at all.
The scheme we have used in transforming Gen1 to AMPI is demonstrated in the following examples. The original Fortran 90 code contains a module shareddata. This module is used in the main program and a subroutine subA.
MODULE shareddata INTEGER :: myrank DOUBLE PRECISION :: xyz(100) END MODULE PROGRAM MAIN USE shareddata include 'mpif.h' INTEGER :: i, ierr CALL MPI_Init(ierr) CALL MPI_Comm_rank(MPI_COMM_WORLD, myrank, ierr) DO i = 1, 100 xyz(i) = i + myrank END DO CALL subA CALL MPI_Finalize(ierr) END PROGRAM SUBROUTINE subA USE shareddata INTEGER :: i DO i = 1, 100 xyz(i) = xyz(i) + 1.0 END DO END SUBROUTINE
AMPI executes the main program inside a user-level thread as a subroutine. For this purpose, the main program needs to be changed to be a subroutine with the name MPI_Main.
Now we transform this program using the first strategy. We first group the shared data into a user-defined type.
MODULE shareddata TYPE chunk INTEGER :: myrank DOUBLE PRECISION :: xyz(100) END TYPE END MODULE
Now we modify the main program to dynamically allocate this data and change the references to them. Subroutine subA is then modified to take this data as argument.
SUBROUTINE MPI_Main USE shareddata USE AMPI INTEGER :: i, ierr TYPE(chunk), pointer :: c CALL MPI_Init(ierr) ALLOCATE(c) CALL MPI_Comm_rank(MPI_COMM_WORLD, c%myrank, ierr) DO i = 1, 100 c%xyz(i) = i + c%myrank END DO CALL subA(c) CALL MPI_Finalize(ierr) END SUBROUTINE SUBROUTINE subA(c) USE shareddata TYPE(chunk) :: c INTEGER :: i DO i = 1, 100 c%xyz(i) = c%xyz(i) + 1.0 END DO END SUBROUTINE
With these changes, the above program can be made thread-safe. Note that it is not really necessary to dynamically allocate chunk. One could have declared it as a local variable in subroutine MPI_Main. (Or for a small example such as this, one could have just removed the shareddata module, and instead declared both variables xyz and myrank as local variables). This is indeed a good idea if shared data are small in size. For large shared data, it would be better to do heap allocation because in AMPI, the stack sizes are fixed at the beginning (can be specified from the command line) and stacks do not grow dynamically.
Now, this program is also thread-safe and ready to run with AMPI.