Charm++ on the Cell Processor

Charm++
We believe that the Charm++ programming model is a good fit for the Cell processor for several reasons, including: data encapsulation, vitualization, and the ability to peek-ahead in the message queue.
  • Data Encapsulation: In the Charm++ programming model, the application is broken down into objects called chares. These chares communicate with one another by sending messages. See Charm++. This data encapsulation (within the chare and within the arriving message) creates spatial locality in the data of Charm++ applications which we can take advantage of when porting to the Cell processor. In particular, it makes it easy for the Charm++ runtime system to identify and DMA the data to and from the SPEs' local stores. Furthermore, the chares tend to be small in size (both data and code) which will allow multiple chares to fit within the SPEs' local store at a time (as the SPEs' local stores are limited in size; currently 256KB each).
  • Message Queue Peek-Ahead: As messages arrive to a processor, the Charm++ runtime system queues them until it is ready to processes them. When a message arrives, the Charm++ runtime system knows which chare object the message is for and which entry method is to be executed. In the case of the Cell processor, the Charm++ runtime system can peek-ahead in the message queue. If it finds a message (along with associated chare object and entry method) that can be offloaded onto one of the SPEs, the Charm++ runtime system can schedule the execution of the entry method on one of the SPEs rather than the PPE itself. The work done by the entry method is known to be useful computation.
  • Virtualization: The idea of virtualization is an important part of the Charm++ programming model. Here, virtualuzation refers to the idea of having many chare objects per physical processor. The idea is that, at any given moment, at least one chare object should have a message waiting in the message queue (and thus, is ready to execute). As this chare object executes, messages that other chare objects are waiting on will arrive and be queued by the Charm++ runtime system. This is the mechanism by which the Charm++ runtime system can overlap communication and computation and thus effectively hide the cost of sending messages over the interconnect. This idea can be extended for the DMA transaction in the Cell architecture. For any given SPE, one chare can be executing on the SPE while another chare's input data (and code) is being DMA'ed into the SPE's local store. Once the SPE is done executing the first chare, it can immediately start executing the second chare since the data has already arrived. Meanwhile, while the SPE is executing the second chare, the DMA controller for the SPE can be moving the results of the first chare back to main memory. This will effectively hide the latency of the DMA transactions needed to move data between the SPEs and system memory.

  • Notes for building Charm++ on the Cell architecture can be found here.
    Offload API
    We have developed an interfaced called the Offload API which will be used by the Charm++ runtime system to offload entry method execution onto the SPEs. The Offload API is independent of Charm++. That is, one can write an application using the Offload API directly without using Charm++. However, the design of the Offload API has been specifically geared towards the needs of the Charm++ runtime system.

    In the Offload API model, the computation heavy portions of the computation are broken down into chunks of computation called work requests. Each work request can have multiple input and output buffers. On each SPE, there is a small SPE Runtime that continuously executes. When the application code creates a work request via the Offload API on the PPE, the Offload API decides which SPE the work request should be executed on and then passes the work request to the SPE. The SPE Runtime then takes care of moving the data, allocating memory in the local store, executing the work request, and eventually moving the results of the work request back into system memory. The life of a work request is depicted in Figure 1.
    Figure 1: Work Request Flow
    (Note: Figure not drawn to scale.)
    [1] : The application code on the PPE issues a work request to the Offload. The Offload decides which SPE should execute the work request and sends the work request to that SPE.
    [2] : The SPE Runtime notices that it has a new work request and issues a DMA-Get to bring the input data from system memory into the SPE's local store.
    [3] : The DMA controller for the SPE moves the data. During this time, the SPE is free to do other work including executing another work request.
    [4] : Once the input data for the work request has arrived, the SPE is free to execute the work request. Once the work request has been executed, the SPE Runtime issues a DMA-Put to place the results into system memory.
    [5] : The DMA controller for the SPE moves the data. During this time, the SPE is free to do other work including executing another work request.
    [6] : Once the DMA-Put has finished moving the data, the SPE Runtime notifies the PPE that the work request has been completed.

    While the Offload API is independent of Charm++, it is distributed as part of the Charm++ distribution. Currently, only the nightly build of Charm++ includes the Offload API.


    For more information on Charm++ on Cell and the Offload API, please refer to the papers and posters listed below.
     
    People
    Papers
    • 06-16    David Kunzman,  Charm++ on the Cell Processor,  Master's Thesis, Department of Computer Science, University of Illinois 2006
    • 06-14    David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale,  Charm++, Offload API, and the Cell Processor,  In PMUP Workshop at PACT'06, September 2006.
    Posters
    • 06-03    Charm++ Simplifies Programming for the Cell Processor,  David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale
    • 06-01    Charm++ on Cell,  David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant Kale

    This page maintained by David Kunzman. Back to the PPL Research Page