Subsections

3 CHARM++ Programming

3.0.1 What's the basic programming model for Charm++?

Parallel objects using "Asynchronous Remote Method Invocation":

Asynchronous
in that you do not block until the method returns-the caller continues immediately.

Remote
in that the two objects may be separated by a network.

Method Invocation
in that it's just C++ classes calling each other's methods.

3.0.2 What is an ``entry method''?

Entry methods are all the methods of a chare where messages can be sent by other chares. They are declared in the .ci files, and they must be defined as public methods of the C++ object representing the chare.

3.0.3 When I invoke a remote method, do I block until that method returns?

No! This is one of the biggest differences between Charm++ and most other ``remote procedure call'' systems like CORBA, Java RMI, or RPC. ``Invoke an asynchronous method'' and ``send a message'' have exactly the same semantics and implementation. Since the invoking method does now wait for the remote method to terminate, it normally cannot receive any return value. (see later for a way to return values)

3.0.4 Why does Charm++ use asynchronous methods?

Asynchronous method invocation is more efficient because it can be implemented as a single message send. Unlike with synchronous methods, thread blocking and unblocking and a return message are not needed.

Another big advantage of asynchronous methods is that it's easy to make things run in parallel. If I execute:

a->foo();
b->bar();
Now foo and bar can run at the same time; there's no reason bar has to wait for foo.

3.0.5 Can I make a method synchronous? Can I then return a value?

Yes. If you want synchronous methods, so the caller will block, use the [sync] keyword before the method in the .ci file. This requires the sender to be a threaded entry method, as it will be suspended until the callee finishes. Sync entry methods are allowed to return values to the caller.

3.0.6 What is a threaded entry method? How does one make an entry method threaded?

A threaded entry method is an entry method for a chare that executes in a separate user-level thread. It is useful when the entry method wants to suspend itself (for example, to wait for more data). Note that threaded entry methods have nothing to do with kernel-level threads or pthreads; they run in user-level threads that are scheduled by Charm++ itself.

In order to make an entry method threaded, one should add the keyword threaded withing square brackets after the entry keyword in the interface file:

module M {
  chare X {
    entry [threaded] E1(void);
  };
};

3.0.7 If I don't want to use threads, how can an asynchronous method return a value?

The usual way to get data back to your caller is via another invocation in the opposite direction:

void A::start(void) {
  b->giveMeSomeData();
}
void B::giveMeSomeData(void) {
  a->hereIsTheData(data);
}
void A::hereIsTheData(myclass_t data) {
  ...use data somehow...
}
This is contorted, but it exactly matches what the machine has to do. The difficulty of accessing remote data encourages programmers to use local data, bundle outgoing requests, and develop higher-level abstractions, which leads to good performance and good code.

3.0.8 Isn't there a better way to send data back to whoever called me?

The above example is very non-modular, because b has to know that a called it, and what method to call a back on. For this kind of request/response code, you can abstract away the ``where to return the data'' with a CkCallback object:

void A::start(void) {
  b->giveMeSomeData(CkCallback(CkIndex_A::hereIsTheData,thisProxy));
}
void B::giveMeSomeData(CkCallback returnDataHere) {
  returnDataHere.send(data);
}
void A::hereIsTheData(myclass_t data) {
  ...use data somehow...
}
Now b can be called from several different places in a, or from several different modules.

3.0.9 Why should I prefer the callback way to return data rather than using [sync] entry methods?

There are a few reasons for that:

3.0.10 How does the initializazion in Charm work?

Each processor executes the following operations strictly in order:

  1. All methods registered as initnode;
  2. All methods registered as initproc;
  3. On processor zero, all mainchares constructor method is invoked (the ones taking a CkArgMsg*);
  4. The read-onlies are propagated from processor zero to all other processors;
  5. The nodegroups are created;
  6. The groups are created. During this phase, for all the chare arrays have been created with a block allocation, the corresponding array elements are instantiated;
  7. Initialization terminated and all messages are available for processing, including the messages responsible for the instantiation of array elements manually inserted.

This implies that you can assume that the previous steps has completely finished before the next one starts, and any side effect from all the previous steps are committed (and can therefore be used).

Inside a single step there is no order guarantee. This implies that, for example, two groups allocated from mainchare can be instantiated in any order. The only exception to this is processor zero, where chare objects are instantiated immediately when allocated in the mainchare, i.e if two groups are allocated, their order is fixed by the allocation order in the mainchare constructing them. Again, this is only valid for processor zero, and in no other processor this assumption should be made.

To notice that if array elements are allocated in block (by specifying the number of elements at the end of the ckNew function), they are all instantiated before normal execution is resumed; if manual insertion is used, each element can be constructed at any time on its home processor, and not necessarily before other regular communication messages have been delivered to other chares (including other array elements part of the same array).

3.0.11 Does Charm++ support C and Fortran?

3.0.12 What is a proxy?

A proxy is a local C++ class that represents a remote C++ class. When you invoke a method on a proxy, it sends the request across the network to the real object it represents. In Charm++, all communication is done using proxies.

A proxy class for each of your classes is generated based on the methods you list in the .ci file.

3.0.13 What are the different ways one can can create proxies?

Proxies can be:

3.0.14 What is wrong if I do A *ap = new CProxy_A(handle)?

This will not compile, because a CProxy_A is not an A. What you want is CProxy_A *ap = new CProxy_A(handle).

3.0.15 Why is the def.h usually included at the end? Is it necessary or can I just include it at the beginning?

You can include the def.h file once you've actually declared everything it will reference- all your chares and readonly variables. If your chares and readonlies are in your own header files, it is legal to include the def.h right away.

However, if the class declaration for a chare isn't visible when you include the def.h file, you'll get a confusing compiler error. This is why we recommend including the def.h file at the end.

3.0.16 How can I use a global variable across different processors?

Make the global variable ``readonly'' by declaring it in the .ci file. Remember also that read-onlies can be safely set only in che mainchare constructor. Any change after the mainchare constructor has finished will be local to the processor that made the change. To change a global variable later in the program, every processor must modify it accordingly (e.g by using a chare group. Note that chare arrays are not guaranteed to cover all processors)

3.0.17 Can I have a class static read-only variable?

One can have class-static variables as read-onlies. Inside a chare, group or array declaration in the .ci file, one can have a readonly variable declaration. Thus:

chare someChare {
  ...
  readonly CkGroupID someGroup;
  ...
};
is fine. In the .h declaration for class someChare, you will have have to put someGroup as a public static variable, and you are done.

You then refer to the variable in your program as someChare::someGroup.

3.0.18 How do I measure the time taken by a program or operation?

You can use CkWallTimer() to determine the time on some particular processor. To time some parallel computation, you need to call CkWallTimer on some processor, do the parallel computation, then call CkWallTimer again on the same processor and subtract.

3.0.19 What do CmiAssert and CkAssert do?

These are just like the standard C++ assert calls in <assert.h>- they call abort if the condition passed to them is false.

We use our own version rather than the standard version because we have to call CkAbort, and because we can turn our asserts off when CMK_OPTIMIZE is defined.

3.0.20 Can I know how many messages are being sent to a chare?

No.

There is no nice library to solve this problem, as some messages might be queued on the receiving processor, some on the sender, and some on the network. You can still:

3.0.21 What is "quiescence"? How does it work?

Quiescence is When nothing is happening anywhere on the parallel machine.

A low-level background task counts sent and received messages. When, across the machine, all the messages that have been sent have been received, and nothing is being processed, quiescence is triggered.

3.0.22 Should I use quiescence detection?

Probably not.

In some ways, quiescence is a very strong property (it guarentees nothing is happening anywhere) so if some other library is doing something, you won't reach quiescence. In other ways, quiescence is a very weak property, since it doesn't guarentee anything about the state of your application like a reduction does, only that nothing is happening. Because quiescence detection is on the one hand so strong it breaks modularity, and on the other hand is too weak to guarentee anything useful, it's often better to use something else.

Often global properties can be replaced by much easier-to-compute local properties. For example, my object could wait until all its neighbors have sent it messages (a local property my object can easily detect by counting message arrivals), rather than waiting until all neighbor messages across the whole machine have been sent (a global property that's difficult to determine). Sometimes a simple reduction is needed instead of quiescence, which has the benefits of being activated explicitly (each element of a chare array or chare group has to call contribute) and allows some data to be collected at the same time. A reduction is also a few times faster than quiescence detection. Finally, there are a few situations, such as some tree-search problems, where quiescence detection is actually the most sensible, efficient solution.

3.1 CHARM++ Arrays

3.1.1 How do I know which processor a chare array element is running on?

At any given instant, you can call CkMyPe() to find out where you are. There is no reliable way to tell where another array element is; even if you could find out at some instant, the element might immediately migrate somewhere else!

3.1.2 Should I use Charm++ Arrays in my program?

Yes! Most of your computation should happen inside array elements. Arrays are the main way to automatically balance the load using one of the load balancers available.

3.1.3 How many array elements should I have per processor?

To do load balancing, you need more than one array element per processor. To keep the time and space overheads reasonable, you probably don't want more than a few thousand array elements per processor. The optimal value depends on the program, but is usually between 10 and 100. If you come from an MPI background, this may seem like a lot.

3.1.4 What does the term reduction refer to?

You can reduce a set of data to a single value. For example, finding the sum of values, where each array element contributes a value to the final sum. Reductions are supported directly by Charm++ arrays, and some operations most commonly used are predefined. Other more complicated reductions can implement if needed.

3.1.5 Can I do multiple reductions on an array?

You can have several reductions happen one after another; but you cannot mix up the execution of two reductions over the same array. That is, if you want to reduce A, then B, every array element has to contribute to A, then contribute to B; you cannot have some elements contribute to B, then contribute to A.

3.1.6 Does Charm++ do automatic load balancing without the user asking for it?

No. You only get load balancing if you explicitly ask for it either at link-time with the +balancer option, or at runtime with the -balancer option.

3.1.7 What is the migration constructor and why do I need it?

The migration constructor (a constructor that takes CkMigrateMessage * as parameter) is invoked when an array element migrates to a new processor. If there is anything you want to do when you migrate, you could put it here. However, even if you don't want to do anything, you must create it, as it is called from the code automatically generated by the Charm++ translator, and constructors aren't inherited, so we can't just put a migration constructor in the base class.

The migration constructor should not be declared in the .ci file. Of course the array element will require also at least one regular constructor so that it can be created, and these must be declared in the .ci file.

3.1.8 What happens to the old copy of an array element after it migrates?

After sizing and packing a migrating array element, the array manager deletes the old copy. As long as all the array element destructors in the non-leaf nodes of your inheritance hierarchy are virtual destructors, with declaration syntax:

class foo : ... {
  ...
  virtual ~foo(); // <- virtual destructor
};
then everything will get deleted properly.
Note that deleting things in a packing pup happens to work for the current array manager, but WILL NOT work for checkpointing, debugging, or any of the (many) other uses for packing puppers we might dream up - so DON'T DO IT!

3.1.9 Is it possible to turn migratability on and off for an individual array element?

3.1.10 Is it possible to insist that a particular array element gets migrated at the next AtSync()?

3.1.11 When not using AtSync for LB, when does the LB start up? Where is the code that periodically checks if load balancing can be done?

If not using usesAtSync, the load balancer can start up at anytime. There is a dummy AtSync for each array element which by default tells the load balancer that it is always ready. The LDBD manager has a syncer (LBDB::batsyncer) which periodically calls AtSync roughly every 1ms to trigger the load balancing (this timeout can be changed with the +LBPeriod option). In this load balancing mode, users have to make sure all migratable objects are always ready to migrate (e.g. not depending on a global variable which cannot be migrated).

3.1.12 Should I use AtSync explicitely, or leave it to the system?

You almost certaintly want to use AtSync directly. In most cases there are points in the execution where the memory in use by a chare is bigger due to transitory data, which does not need to be transferred if the migration happens at predefined points.

3.2 CHARM++ Groups and Nodegroups

3.2.1 What are groups and nodegroups used for?

They are used for optimizations at the processor and node level respectively.

3.2.2 Should I use groups?

Probably not. People with an MPI background often overuse groups, which results in MPI-like Charm++ programs. Arrays should generally be used instead, because arrays can be migrated to acheive load balance.

Groups tend to be most useful in constructing communication optimization libraries. For example, all the array elements on a processor can contribute something to their local group, which can then send a combined message to another processor. This can be much more efficient than having each array element send a separate message.

3.2.3 Is it safe to use a local pointer to a group, such as from ckLocalBranch?

Yes. Groups never migrate, so a local pointer is safe. The only caveat is to make sure you don't migrate without updating the pointer.

A local pointer can be used for very efficient access to data held by a group.

3.2.4 What are migratable groups?

Migratable groups are declared so by adding the ``[migratable]'' attribute in the .ci file. They cannot migrate from one processor to another during normal execution, but only to disk for checkpointing purposes.

Migratable groups must declare a migration constructor (taking CkMigrateMessage * as a parameter) and a pup routine. The migration construtor must call the superclass migration constructor as in this example:

class MyGroup : public CBase_MyGroup {
  ...
  MyGroup (CkMigrateMessage *msg) : CBase_MyGroup(msg) { }
  ...
}

3.2.5 Should I use nodegroups?

Almost certainly not. You should use arrays for most computation, and even quite low-level communication optimizations are often best handled by groups. Nodegroups are very difficult to get right.

3.2.6 What's the difference between groups and nodegroups?

There's one group element per processor (CkNumPes() elements); and one nodegroup element per node (CkNumNodes() elements). Because they execute on a node, nodegroups have very different semantics from the rest of Charm++.

Note that on a non-SMP machine, groups and nodegroups are identical.

3.2.7 Do nodegroup entry methods execute on one fixed processor of the node, or on the next available processor?

Entries in node groups execute on the next available processor. Thus, if two messages were sent to a branch of a nodegroup, two processors could execute one each simultaneously.

3.2.8 Are nodegroups single-threaded?

No. They can be accessed by multiple threads at once.

3.2.9 Do we have to worry about two entry methods in an object executing simultaneously?

Yes, which makes nodegroups different from everything else in Charm++.

If a nodegroup method accesses a data structure in a non-threadsafe way (such as writing to it), you need to lock it, for example using a CmiNodeLock.

3.3 CHARM++ Messages

3.3.1 What are messages?

A bundle of data sent, via a proxy, to another chare. A message is a special kind of heap-allocated C++ object.

3.3.2 Should I use messages?

It depends on the application. We've found parameter marshalling to be less confusing and error-prone than messages for small parameters. Nevertheless, messages can be more efficient, especially if you need to buffer incoming data, or send complicated data structures (like a portion of a tree).

3.3.3 What is the best way to pass pointers in a message?

You can't pass pointers across processors. This is a basic fact of life on distributed-memory machines.

You can, of course, pass a copy of an object referenced via a pointer across processors-either dereference the pointer before sending, or use a varsize message.

3.3.4 Can I allocate a message on the stack?

No. You must allocate messages with new.

3.3.5 Do I need to delete messages that are sent to me?

Yes, or you will leak memory! If you receive a message, you are responsible for deleting it. This is exactly opposite of parameter marshalling, and much common practice. The only exception are entry methods declared as [nokeep]; for these the system will free the message automatically at the end of the method.

3.3.6 Do I need to delete messages that I allocate and send?

No, this will certainly corrupt both the message and the heap! Once you've sent a message, it's not yours any more. This is again exactly the opposite of parameter marshalling.

3.3.7 What can a variable-length message contain?

Variable-length messages can contain arrays of any type, both primitive type or any user-defined type. The only restriction is that they have to be 1D arrays.

3.3.8 Do I need to delete the arrays in variable-length messages?

No, this will certainly corrupt the heap! These arrays are allocated in a single contiguous buffer together with the message itself, and is deleted when the message is deleted.

3.3.9 What are priorities?

Priorities are special values that can be associated with messages, so that the Charm++ scheduler will prefer higher priority messages when choosing a message to deliver. Priorities are respected by Charm++ as much as possible: until there are higher priority messages in the queue, lower priority message will never be delivered. Nevertheless, this is not a guarantee, and a lower priority message can be delivered before a higher priority one.

Messages with priorities are typically used to perform optimizations on the order of the computation.

For integer priorities, the smaller the priority value, the higher the priority of the message. Negative value are therefore higher priority than positive ones. To enable and set a message's priority there is a special new syntax and CkPriorityPtr function; see the manual for details. If no priority is set, messages have a default priority of zero.

3.3.10 Can messages have multiple inheritance in Charm++?

3.3.11 What is the difference between new and alloc?

3.4 PUP Framework

3.4.1 How does one write a pup for a dynamically allocated 2-dimensional array?

The usual way: pup the size(s), allocate the array if unpacking, and then pup all the elements.

For example, if you have a 2D grid like this:

class foo {
 private:
  int wid,ht;
  double **grid;
  ...other data members

  //Utility allocation/deallocation routines
  void allocateGrid(void) {
    grid=new double*[ht];
    for (int y=0;y<ht;y++)
      grid[y]=new double[wid];
  }
  void freeGrid(void) {
    for (int y=0;y<ht;y++)
      delete[] grid[y];
    delete[] grid;
    grid=NULL;
  }

 public:
  //Regular constructor
  foo() {
    ...set wid, ht...
    allocateGrid();
  }
  //Migration constructor
  foo(CkMigrateMessage *) {}
  //Destructor
  ~foo() {
    freeGrid();
  }

  //pup method
  virtual void pup(PUP::er &p) {
    p(wid); p(ht);
    if (p.isUnpacking()) {
      //Now that we know wid and ht, allocate grid
      allocateGrid(wid,ht);
    }
    //Pup grid values element-by-element
    for (int y=0;y<ht;y++)
      for (int x=0; x<wid; x++)
        p|grid[y][x];
    ...pup other data members...
  }
};

3.4.2 When using automatic allocation via PUP::able, what do these calls mean? PUPable_def(parent); PUPable_def(child);

For the automatic allocation described in Automatic allocation via PUP::able of the manual, each class needs four things:

See charm/tests/charm++/megatest/marshall.[hC] for an executable example.

3.4.3 What is the difference between p|data; and p(data);? Which one should I use?

For most system- and user-defined structure someHandle, you want p|someHandle; instead of p(someHandle);

The reason for the two incompatible syntax varieties is that the bar operator can be overloaded outside pup.h (just like the std::ostream's operator<<); while the parenthesis operator can take multiple arguments (which is needed for efficiently PUPing arrays).

The bar syntax will be able to copy any structure, whether it has a pup method or not. If there is no pup method, the C++ operator overloading rules decay the bar operator into packing the bytes of the structure, which will work fine for simple types on homogenous machines. For dynamically allocated structures or heterogeneous migration, you'll need to define a pup method for all packed classes/structures. As an added benefit, the same pup methods will get called during parameter marshalling.

November 23, 2009
Charm Homepage