Deciding the characteristics of an ideal programming environment for a massively parallel machine like Blue Gene is a challenging task. This is because dealing with tens of thousands or even millions of processors requires qualitative change in both the programming environment and the runtime system.
In this context, we have developed a multi-tier programming model in which the object layer forms a middle layer which is supported from below by a low level explicit model, and which supports higher level components including domain specific languages and libraries. This section briefly describes the middle and lower layers. The higher level is presented in section 4.1.
The lowest level model strives to provide access to a machine's capabilities. In the programmer's view, each node consists of a number of hardware-supported threads with common shared memory. A runtime library call allows a thread to send a short message to a destination node. The header of each message encodes a handle function to be invoked at the destination. A designated number of threads continuously monitor the incoming buffer for arriving messages, extract them and invoke the designated handler function. We believe this low level abstraction of the petaflops architectures is general enough to encompass a wide variety of parallel machines with different numbers of processors and co-processors on each node.
We have developed a software emulator based on this low level model. The details of the emulator and its API were presented in [3].
In this base level model, the programmer must decide which computations to run on which node. The programming environment at a higher level relieves the application programmer of the burden of deciding where the subcomputations run.
In this context, we have evaluated the CHARM++ as a parallel programming language for petaflops machines and also as an alternative to the popular MPI methodology. CHARM++ is an object-based portable parallel programming language that embodies message-driven execution. A CHARM++ program consists of parallel objects and object arrays[4], which communicate via asynchronous method invocations. CHARM++ includes a powerful runtime system that supports automatic load balancing based on migratable objects. CHARM++ has been ported to the emulator in [5].
Adaptive MPI, or AMPI, is an MPI implementation and extension based on CHARM++ message driven system, that supports processor virtualization[1]. AMPI implements virtual MPI processes (VPs), several of which may be mapped to a single physical processor. Taking advantage of CHARM++'s migratable objects, AMPI also supports adaptive load balancing by migrating MPI threads.
In this environment, MPI is a special case for AMPI when exactly one VP is mapped to a physical processor.