Charm++ Tutorial
Parallel Programming Laboratory, UIUC

Data Driven Execution in Charm++
Object-based Parallelization
Charm++ features with simple examples
Chares and Chare Arrays
Parameter Marshalling
Structured Dagger Construct
Load Balancing

Technical Approach

Object - based Parallelization

Virtualization: Object-based Decomposition
Divide the computation into a large number of pieces
Independent of number of processors
Typically larger than number of processors
Let the system map objects  to processors

Chares – Concurrent Objects
Can be dynamically created on any available processor
Can be accessed from remote processors
Send messages to each other asynchronously
Contain “entry methods”

“Hello World!”

Compile and run the program

Data Driven Execution in Charm++

Charm++ solution: proxy classes
Proxy class generated for each chare class
For instance, CProxy_Y is the proxy class generated for chare class Y.
Proxy objects know where the real object is
Methods invoked on this object simply put the data in an “envelope” and send it out to the destination
Given a proxy p, you can invoke methods

Ring program

Chare Arrays

Slide 13

Slide 14

Sorting numbers
Sort n integers in increasing order.
Create n chares, each keeping one number.
In every odd iteration chares numbered 2i swaps with chare 2i+1 if required.
In every even iteration chares 2i swaps with chare 2i-1 if required.
After each iteration all chares report to the mainchare. After everybody reports mainchares signals next iteration. Sorting completes in n iterations.

Slide 16

Slide 17

Remember :
Message passing is asynchronous.
Messages can be delivered out of order.

Slide 19

Basic Entities in Charm++ Programs
Sequential Objects
 - ordinary sequential C++ code and objects
- concurrent objects
Chare Arrays
  - an indexed collection of chares

Illustrative example: Jacobi 1D

Illustrative example: Jacobi 1D
Input: 2D array of values with boundary  conditions
In each iteration, each array element is computed as the average of itself and its neighbors
Iterations are repeated till some threshold error value is reached

Jacobi 1D: Parallel Solution!

Jacobi 1D: Parallel Solution!
Slice up the 2D array into sets of columns
Chare = computations in one set
At the end of each iteration
Chares exchange boundaries
Determine maximum change in computation
Output result when threshold is reached

Arrays as Parameters
Array cannot be passed as pointer
specify the length of the array in the interface file
   entry void bar(int n,double arr[n])

Jacobi Code

Apply a single operation (add, max, min, ...) to data items scattered across many processors
Collect the result in one place
Reduce x across all elements
contribute(sizeof(x), &x,                         CkReduction::sum_int,processResult );
 Function “processResult()”
All contribute calls from one array must name the same function

A generic way to transfer control back to a client after a library has finished.
After finishing a reduction, the results have to be passed to some chare's entry method.
To do this, create an object of type CkCallback with chare's ID & entry method index
Different types of callbacks
One commonly used type:
  CkCallback cb(<chare’s entry method>,<chare’s proxy>);

void Ar1::doWork(int sendersID, int n, double arr[n])
//Code on previous slide
   if (((rightmsg == 1) && (leftmsg == 1)) || ((thisIndex == 0) &&   
   (rightmsg == 1)) || ((thisIndex ==K-1) && (leftmsg == 1)))
     // Both messages have been received and we can now
         compute the new values of the matrix
     // Use a reduction to find determine if all of the maximum
        errors on each processor had a maximum change that
        is below our threshold value.
  CkCallback cb(CkIndex_Ar1::doCheck(NULL),a1);
  contribute(sizeof(double), &maxChange, CkReduction::max_double, cb);

Types of Reductions
Predefined Reductions – A number of reductions are predefined, including ones that
Sum values or arrays
Calculate the product of values or arrays
Calculate the maximum contributed value
Calculate the minimum contributed value
Calculate the logical and of integer values
Calculate the logical or of contributed integer values
Form a set of all contributed values
Concatenate bytes of all contributed values
Plus, you can create your own

Structured Dagger
What is it?
A coordination language built on top of Charm++
To reduce the complexity of program development without adding any overhead

Structured Dagger Constructs
atomic {code} – Specifies that no structured dagger constructs appear inside of the code so it executes atomically.
overlap {code} – Enables all of its component constructs concurrently and can execute these constructs in any order.
when <entrylist> {code} – Specifies dependencies between computation and message arrival.

Structure Dagger Constructs Continued
if / else/ while / for – These are the same as their C++ conterparts, except that they can contain when blocks in their respective code segments.  Hence execution can be suspended while they wait for messages.
forall –  Functions like a for statement, but enables its component constructs for its entire iteration space at once. As a result it doesn’t need to execute its iteration space in strict sequence.

Jacobi Example Using Structured Dagger

Another Example of Structured Dagger : LeanMD
LeanMD is a molecular dynamics simulation application written in Charm++ and Structured Dagger.
Here, at every timestep, each cell sends its atom positions to its neighbors.Then it receives forces and integrates the information to calculate new positions.

Another Example of Structured Dagger : LeanMD

Load Balancing
Object Migration
Load Balancing Strategies
Using Load Balancing

Projections: Quick Introduction
Projections is a tool used to analyze the performance of your application
The tracemode option is used when you build your application to enable tracing
You get one log file per processor, plus a separate file with global information
These files are read by Projections so you can use the Projections views to analyze performance

Screen shots – Load imbalance

Timelines – load imbalance

Array objects can migrate from one PE to another
To migrate, must implement pack/unpack or pup method
Need this because migration creates a new object on the destination processor while destroying the original
pup combines 3 functions into one
Data structure traversal : compute message size, in bytes
Pack : write object into message
Unpack : read object out of message
Basic Contract : here are my fields (types, sizes and a pointer)

Pup – How to write it?

Load Balancing
All you need is a working pup
link a LB module
-module <strategy>
RefineLB, NeighborLB, GreedyCommLB, others…
EveryLB will include all load balancing strategies
compile time option (specify default balancer)
-balancer RefineLB
runtime option
+balancer RefineLB

Centralized Load Balancing
Uses information about activity on all processors to make load balancing decisions
Advantage: since it has the entire object communication graph, it can make the best global decision
Disadvantage: Higher communication costs/latency, since this requires information from all running chares

Neighborhood Load Balancing
Load balances among a small set of processors (the neighborhood) to decrease communication costs
Advantage: Lower communication costs, since communication is between a smaller subset of processors
Disadvantage: Could leave a system which is globally poorly balanced

Main Centralized Load Balancing Strategies
GreedyCommLB – a “greedy” load balancing strategy which uses the process load and communications graph to map the processes with the highest load onto the processors with the lowest load, while trying to keep communicating processes on the same processor
RefineLB – move objects off overloaded processors to under-utilized processors to reach average load
Others – the manual discusses several other load balancers which are not used as often, but may be useful in some cases; also, more are being developed

Neighborhood Load Balancing Strategies
NeighborLB – neighborhood load balancer, currently uses a neighborhood of 4 processors

When to Re-balance Load?
Programmer Control: AtSync load balancing

AtSync method: enable load balancing at specific point
Object ready to migrate
Re-balance if needed
AtSync() called when your chare is ready to be load balanced – load balancing may not start right away
ResumeFromSync() called when load balancing for this chare has finished

Processor Utilization: After Load Balance

Timelines: Before and After Load Balancing

Advanced Features
Node Groups
Entry Method Attributes
Communications Optimization

Advanced Features: Groups
With arrays, the standard method of communications between elements is message passing
With a large number of chares, this can lead to large numbers of messages in the system
For global operations like reductions, this could make the receiving chare a bottleneck
Can this be fixed?

Advanced Features: Groups
Solution: Groups
Groups are more of a “system level” programming feature of Charm++, versus “user level” arrays
Groups are similar to arrays, except only one element is on each processor – the index to access the group is the processor ID
Groups can be used to batch messages from chares running on a single processor, which cuts down on the message traffic
Disadvantage: Does not allow for effective load balancing, since groups are stationary (they are not virtualized)

Advanced Features: Node Groups
Similar to groups, but only one per node, instead of one per processor – the index is the node number
Can be used to solve similar problems as well – with one node group per SMP node, the node group could act as a collection point for messages on the node, lowering message traffic on interconnects between nodes

Advanced Features: Priorities
In general, messages in Charm++ are unordered: but what if an order is needed?
Solution: Priorities
Messages can be assigned different priorities
The simplest priorities just specify that the message should either go on the end of the queue (standard behavior) or the beginning of the queue
Specific priorities can also be assigned to messages, using either numbers or bit vectors
Note that messages are non-preemptive: a lower priority message will continue processing, even if a higher priority message shows up

Advanced Features: Entry Method Attributes
entry [attribute1, ..., attributeN] void EntryMethod(parameters);
entry methods which are run in their own non- preemptible threads
methods return message as a result

Advanced Features: Communications Optimization
Used to optimize communication patterns in your application
Can use either bracketed strategies or streaming strategies
Bracketed strategies are those where a specific start and end point for the communication are flagged
Streaming strategies use a preset time interval for bracketing messages

Advanced Features: Communications Optimization
For strategies, you need to specify a communications topology, which specifies the message pattern you will be using
When compiling with the communications optimization library, you must include –module commlib

Advanced Features: Checkpoint/Restart
If you have a long running application, it would be nice to be able to save its state, just in case something happens
Checkpointing gives you this ability
When you checkpoint an application, it uses the migrate code already present for load balancing to store the state of array objects
State information is saved in a directory of your choosing

Advanced Features: Checkpoint/Restart
The application can be restarted, with a different number of processors, using the checkpoint information
The charmrun option ++restart <dir> is used to restart
You can also restart groups by marking them migratable and writing a PUP routine – they still will not load balance, though

Other Advanced Features
Custom array indexes
Array creation/mapping options
Additional load balancers
Local versus proxied calls

Benefits of Virtualization
Better Software Engineering
Logical Units decoupled from “Number of processors”
Message Driven Execution
Adaptive overlap between computation and communication
Predictability of execution
Flexible and dynamic mapping to processors
Flexible mapping on clusters
Change the set of processors for a given job
Automatic Checkpointing
Principle of Persistence

More Information
Download files