Blue Gene Emulator
                                                                       last updated: Jan 22, 2002


What's New:

1-20-2002 : There are a few Changes to the Blue Gene emulator API. The new API is improved to allow further porting of other programming models including Charm++. See the Section Changes for details.

3-25-2001 : The Blue Gene emulator is now completely rewritten on top of Converse instead of Charm++, while the API supported by the original emulator is kept without major changes. The new emulator is implemented on a lower layer communication library - Converse in order to achieve better performance by avoiding the cross layer overhead. Switching to Converse Blue Gene emulator allows further porting of Charm++ parallel language on the emulator.

   New features are also added in the Converse Blue Gene emulator including supporting thread-committed messages that can be send to a specific thread in a Blue Gene node; supporting both Blue Gene node level and thread level broadcast.

Changes:

1-20-2002: In the new API, while most of the function calls remain the same, there are a few changes:

(i)   The message format changed. User defined message now must include a CmiBlueGeneMsgHeaderSizeBytes byte long pre-allocated space as message header which is used by the runtime communication library.

(ii)   In the new API, when you pass the message to a packet send call, you cannot use or free the message afterwards in the user code. The system will hold the message pointer until it is done. This passing of ownership avoided buffer copying in the emulator and improved the performance.

(iii)  The handler registration calls must be made in BgNodeStart. This is because in new API, the handler table is created for each Blue Gene node, instead of one for each emulator machine.

(iv)   Implemented node private macros(Bnvs), one can declare node private variables individually instead of declared them together in a big data structure and use functions to access them.

Objectives:

The Blue Gene emulator environment is designed with the following objectives:

(i)   To support a realistic Blue Gene API on existing parallel machines.

(ii)   To obtain first-order performance estimates of algorithms.

(ii)  To facilitate implementations of alternate programming models for Blue Gene; Charm++ can be one of the parallel langaugae on top of emulator.

The "Blue Gene" machine supported by the emulator consists of three-dimensional grid of 1-chip nodes. The user may specify the size of the machine along each dimension (e.g. 34x34x36). The chip supports k threads (e.g. 200), each with its own integer unit. The proximity of the integer unit with individual memory modules within a chip is not currently modeled.

The API supported by the emulator can be broken down into several components:

Level 0: Low-level API for chip-to-chip communication

Level 1a: Mid-level API that supports local micro-tasking with a chip level scheduler

Level 1b:  Features such as: read-only variables, reductions, broadcasts,distributed tables, get/put operations

Level 2:  Migratable objects with automatic load balancing support

Of these, the first two have been implemented. The simple time stamping algorithm, without error correction, has been implemented.  More sophisticated timing algorithms, specifically aimed at error correction, and more sophisticated features (1b, 2 and others), as well as libraries of commonly needed parallel operations are part of the proposed work for future.

The following sections define the appropriate parts of the API, with example programs and instructions for executing them.
 
 

Blue Gene Programming Environment

The basic philosophy of the Blue Gene Emulator is to hide intricate details of Blue Gene machine from the application developer. Thus, the application developer needs to provide intialization details (involving setting up Blue Gene dimensions and number of communication/worker threads) and handler functions only and gets the result as though running on a real machine.  Communication, Thread creation, Time Stamping, etc are done by the emulator.

Blue Gene API: Level 0

void addBgNodeInbuffer(bgMsg *msgPtr, int nodeID)
(low-level primitive invoked by Blue Gene emulator to put the message to the inbuffer queue of a node.)
                msgPtr - pointer to the message to be sent to target node
                nodeID - node ID of the target node, it is the serial number of a bluegene node in the emulator's physical node.

void addBgThreadMessage(bgMsg *msgPtr, int threadID)
(add a message to a thread's affinity queue, these messages can be only executed by a specific thread indicated by threadID.)

void addBgNodeMessage(bgMsg *msgPtr)
(add a message to a node's non-affinity queue, these messages can be executed by any thread in the node.)

CmiHandler msgHandlerFunc(char *msg)
(handler to process the msg)

void BgSendNonLocalPacket(int x, int y, int z, int threadID, int handlerID, WorkType type, int numbytes, char * data)
( chip-to-chip communication function. It send a message to Node[x][y][z])
            threadID   - affinity message for thread identified by threadID, -1 as any thread.
            handlerID - Id of the handler which executes on this message
            type            - defines whether the handler is to be executed by communication thread or worker thrad
            numbytes  - size of the message
            data           - pointer to the message to be sent

void BgSendLocalPacket(int threadID, int handlerID, WorkType type, int numbytes, char * data)
(create a micro-task i.e. work for some thread in the same node as the invoking thread
Arguments have same meaning as that of BgSendNonLocalPacket described above.)

boolean checkReady()
(invoked by communication thread to see if there is any unattended message in  the inBuffer.)

bgMsg * getFullBuffer()
(invoked by communication thread to retrieve the unattended message in inBuffer.)

typedef void (*BgHandler)(void*)
(It represents a handler function that returns nothing and takes a (void *))

Initialization API: Level 1a

        All the functions defined in API Level 0 are used internally for the implementation of bluegene node communication and worker threads. From this level, the functions defined are exposed to users to write bluegene program on emulator.

        Considering that the emulator machine will emulator several Bluegene nodes on each physical node, the emulator program define this function BgEmulatorInit(int argc, char **argv) to initialize each emulator node. In this function, user program can define the Bluegene machine size, number of communication/worker threads, and check the command line arguments.

        The size of the Blue Gene machine being emulated and the number of thread per node is determined either by the command line arguments or calling following functions.
void BgSetSize(int sx, int sy, int sz);
( set Blue Gene Machine size.)
void BgSetNumWorkThread(int num);
( set number of worker threads per node.)
void BgSetNumCommThread(int num);
( set number of communication threads per node.)

  User message handler functions are registered to Bluegene emulator via:
int BgRegisterHandler(BgHandler h);
( register a handler h, and returns the global identifier for that handler)
 

For each Blue Gene node, the execution starts at BgNodeStart(int argc, char **argv)called by emulator for each bluegene node, where application handlers can be registered and computation is triggered by creating a task at required nodes.

Similar to pthread's thread specifc data, each bluegene node can has its own node specific data associated with it. To do this, user need to define its own the Node Specific Variables encapsulated in a struct definition and register the pointer to the data to the emulator by following function:

void BgSetNodeData(char *data);

To retrieve the node specific data, call:
char *BgGetNodeData();

A set of Bnv macros are implemented to add flexibilty to the declaring and use of node private data:
BnvDeclare(int, data);
BnvStaticDeclare(int, data);
BnvInitialize(int, data);
BnvExtern(int, data);
BnvAccess(data);

After completion of execution, user program invokes a function
void BgShutdown()

Handler Function API: Level 1a

The following functions can be called in user's application program to retrieve the BleneGene machine information, get thread execution time, and perform the communication.
void BgGetSize(int *sx, int *sy, int *sz);
int BgGetNumWorkThread();
int BgGetNumCommThread();
int BgGetThreadID();
int BgGetGlobalThreadID();
double BgGetTime();
void BgSendPacket(int x, int y, int z, int threadID, int handlerID, WorkType type, int numbytes, char* data);
 
 

Writing a Blue Gene Application

Application Skeleton

Handler declarations
Struct definitions encapsulating Node (specific) variables

void  BgEmulatorInit(int argc, char **argv)
            (set bluegene machine configuration parameters including size, node thread configuration
            You also neet to register handlers in this function.)

void *BgNodeStart(int argc, char **argv)
             (The usual practice in this function is to send an intial message to trigger the execution.
            You can also register node specific data in this function.)

Handler Function 1, void handlerName(char *info)
Hanlder Function 2, void handlerName(char *info)
..
Handler Function N, void handlerName(char *info)
----------------------------------------------------------
     sample application 1
     /* Application: Each node starting at [0,0,0] sends a packet to next node in the ring order.
      *                          After node [0,0,0] gets message from last node in the ring, first iteration ends.
      *                          After doing 20000 iterations the execution ends.
      */

    sample application 2
    /* Application: Find the maximum element.
    *                          Each node computes maximum of it's elements and the max values it received from other nodes
    *                          and sends the result to next node in the reduction sequence.
    *
    *                          Reduction Sequence: Reduce max data to X-Y Plane
    *                                                                  Reduce max data to Y Axis
    *                                                                  Reduce max data to origin.
    */

     sample application 3
    /*  Application: Find the number of primes in a given range
     */

Compiling and Running

Dowload the source code, and see README and Makefile for compiling and running