Fortran90 to AMPI Conversion

AMPI utilizes the dynamic load balancing capabilities of Charm++ by associating a user-level thread with each Charm++ migratable object. User's code runs inside this thread, so that it can issue blocking receive calls similar to MPI, and still present the underlying scheduler an opportunity to schedule other computations on the same processor. The runtime system keeps track of computation loads of each thread as well as communication graph between AMPI threads, and can migrate these threads in order to balance the overall load while simultaneously minimizing communication overhead.

For dynamic load balancing to be effective, one needs to map multiple user-level threads onto a processor. Traditional MPI programs assume that the entire processor is allocated to themselves, and that only one thread of control exists within the process's address space. That's where the need arises to make some transformations to the original MPI program in order to run correctly with AMPI.

The basic transformation needed to port the MPI program to AMPI is privatization of global variables. Typical Fortran MPI programs contain three types of global variables.

Global variables that are read-only. These are either parameters that are set at compile-time. Or other variables that are read as input or set at the beginning of the program and do not change during execution. It is not necessary to privatize such variables.
Global variables that are used as temporary buffers. These are variables that are used temporarily to store values to be accessible across subroutines. These variables have a characteristic that there is no blocking call such as MPI_recv between the time the variable is set and the time it is ever used. It is not necessary to privatize such variables either.
True global variables. These are used across subroutines that contain blocking receives and therefore possibility of a context switche between the definition and use of the variable. These variables need to be privatized.

With the MPI process model, each MPI node can keep a copy of its own permanent variables -- variables that are accessible from more than one subroutines without passing them as arguments. Module variables, saved subroutine local variables, and common blocks in Fortran 90 belong to this category. If such a program is executed without privatization on AMPI, all the AMPI threads that reside on one processor will access the same copy of such variables, which is clearly not the desired semantics. To ensure correct execution of the original source program, it is necessary to make such variables private to individual threads.

While experimenting with Gen1, we have employed two strategies to do this transformation. One is by argument passing. That is, the global variables are bunched together in a single user-defined type, which is allocated by each thread dynamically. Then a pointer to this type is passed from subroutine to subroutine as an argument. Since the subroutine arguments are passed on a stack, which is not shared across all threads, each subroutine, when executing within a thread operates on a private copy of the global variables.

The second method we have employed is called dimension increment. Here, the dimension of global data items is increased by one. This added dimension is used to access thread private data by indexing it with the thread number (which can be obtained with MPI_Comm_rank subroutine.) This scheme has a distinct disadvantage of wastage of space. Therefore it should be used only in case of small global data. There is another disadvantage, that of false sharing, if the global data items are not aligned properly. Because of these disadvantages, we do not recommend using this scheme at all.

The scheme we have used in transforming Gen1 to AMPI is demonstrated in the following examples. The original Fortran 90 code contains a module shareddata. This module is used in the main program and a subroutine subA.


MODULE shareddata
  INTEGER :: myrank
  DOUBLE PRECISION :: xyz(100)
END MODULE

PROGRAM MAIN
  USE shareddata
  include 'mpif.h'
  INTEGER :: i, ierr
  CALL MPI_Init(ierr)
  CALL MPI_Comm_rank(MPI_COMM_WORLD, myrank, ierr)
  DO i = 1, 100
    xyz(i) =  i + myrank
  END DO
  CALL subA
  CALL MPI_Finalize(ierr)
END PROGRAM

SUBROUTINE subA
  USE shareddata
  INTEGER :: i
  DO i = 1, 100
    xyz(i) = xyz(i) + 1.0
  END DO
END SUBROUTINE

AMPI executes the main program inside a user-level thread as a subroutine. For this purpose, the main program needs to be changed to be a subroutine with the name MPI_Main.

Now we transform this program using the first strategy. We first group the shared data into a user-defined type.


MODULE shareddata
  TYPE chunk
    INTEGER :: myrank
    DOUBLE PRECISION :: xyz(100)
  END TYPE
END MODULE

Now we modify the main program to dynamically allocate this data and change the references to them. Subroutine subA is then modified to take this data as argument.


SUBROUTINE MPI_Main
  USE shareddata
  USE AMPI
  INTEGER :: i, ierr
  TYPE(chunk), pointer :: c
  CALL MPI_Init(ierr)
  ALLOCATE(c)
  CALL MPI_Comm_rank(MPI_COMM_WORLD, c%myrank, ierr)
  DO i = 1, 100
    c%xyz(i) =  i + c%myrank
  END DO
  CALL subA(c)
  CALL MPI_Finalize(ierr)
END SUBROUTINE

SUBROUTINE subA(c)
  USE shareddata
  TYPE(chunk) :: c
  INTEGER :: i
  DO i = 1, 100
    c%xyz(i) = c%xyz(i) + 1.0
  END DO
END SUBROUTINE

With these changes, the above program can be made thread-safe. Note that it is not really necessary to dynamically allocate chunk. One could have declared it as a local variable in subroutine MPI_Main. (Or for a small example such as this, one could have just removed the shareddata module, and instead declared both variables xyz and myrank as local variables). This is indeed a good idea if shared data are small in size. For large shared data, it would be better to do heap allocation because in AMPI, the stack sizes are fixed at the beginning (can be specified from the command line) and stacks do not grow dynamically.

Now, this program is also thread-safe and ready to run with AMPI.