This manual describes CHARM++, an object oriented portable parallel programming language based on C++. Its program structure, execution model, interface language constructs and runtime system calls are described here1.
CHARM++ has continuously evolved since the OOPSLA 1993 paper. The earlier versions modified the C++ syntax to support CHARM++ primitives, and contained a full-fledged CHARM++ translator that parsed the CHARM++ syntactic extensions as well as the C++ syntax to produce a C++ program, which was later compiled using a C++ compiler. The current version does not augment the C++ syntax, and does not use a CHARM++ translator as in previous versions. Instead, the older constructs are converted to calls into the runtime library, several new constructs are added, and minimal language constructs are used to describe the interfaces.
CHARM++ is an explicitly parallel language based on C++ with a runtime library for supporting parallel computation called the Charm kernel. It provides a clear separation between sequential and parallel objects. The execution model of CHARM++ is message driven, thus helping one write programs that are latency-tolerant. CHARM++ supports dynamic load balancing while creating new work as well as periodically, based on object migration. Several dynamic load balancing strategies are provided. CHARM++ supports both irregular as well as regular, data-parallel applications. It is based on the CONVERSE interoperable runtime system for parallel programming.
Currently the parallel platforms supported by CHARM++ are the BlueGene/L,BlueGene/P, PSC Lemieux, IBM SP, SGI Origin2000, Cray XT3/4, Cray X1, Cray T3E, a single workstation or a network of workstations from Sun Microsystems (Solaris), IBM RS-6000 (AIX) SGI (IRIX 5.3 or 6.4), HP (HP-UX), Intel x86 (Linux, Windows 98/2000/XP), Intel IA64, Intel x86_64, multicore x86 and x86_64, and Apple Mac. The communication protocols and infrastructures supported by CHARM++ are UDP, TCP, Myrinet, Infiniband, Quadrics Elan, Shmem, MPI and NCSA VMI. CHARM++ programs can run without changing the source on all these platforms. Please see the CHARM++/CONVERSE Installation and Usage Manual for details about installing, compiling and running CHARM++ programs.
CHARM++ is an object oriented parallel language. What sets CHARM++ apart from traditional programming models such as message passing and shared variable programming is that the execution model of CHARM++ is message-driven. Therefore, computations in CHARM++ are triggered based on arrival of associated messages. These computations in turn can fire off more messages to other (possibly remote) processors that trigger more computations on those processors.
At the heart of any CHARM++ program is a scheduler that repetitively chooses a message from the available pool of messages, and executes the computations associated with that message.
The programmer-visible entities in a CHARM++ program are:
CHARM++ starts a program by creating a single instance of each mainchare on processor 0, and invokes constructor methods of these chares. Typically, these chares then creates a number of other chares, possibly on other processors, which can simultaneously work to solve the problem at hand.
Each chare contains a number of entry methods, which are methods that can be invoked from remote processors. The CHARM++ runtime system needs to be explicitly told about these methods, via an interface in a separate file. The syntax of this interface specification file is described in the later sections.
CHARM++ provides system calls to asynchronously create remote chares and to asynchronously invoke entry methods on remote chares by sending messages to those chares. This asynchronous message passing is the basic interprocess communication mechanism in CHARM++. However, CHARM++ also permits wide variations on this mechanism to make it easy for the programmer to write programs that adapt to the dynamic runtime environment. These possible variations include prioritization (associating priorities with method invocations), conditional message packing and unpacking (for reducing messaging overhead), quiescence detection (for detecting completion of some phase of the program), and dynamic load balancing (during remote object creation). In addition, several libraries are built on top of CHARM++ that can simplify otherwise arduous parallel programming tasks.
The following sections provide detailed information about various features of CHARM++ programming system.
The CHARM software was developed as a group effort of the Parallel Programming Laboratory at the University of Illinois at Urbana-Champaign. Researchers at the Parallel Programming Laboratory keep CHARM++ updated for the new machines, new programming paradigms, and for supporting and simplifying development of emerging applications for parallel processing. The earliest prototype, Chare Kernel(1.0), was developed in the late eighties. It consisted only of basic remote method invocation constructs available as a library. The second prototype, Chare Kernel(2.0), a complete re-write with major design changes. This included C language extensions to denote Chares, messages and asynchronous remote method invocation. CHARM(3.0) improved on this syntax, and contained important features such as information sharing abstractions, and chare groups (called Branch Office Chares). CHARM(4.0) included CHARM++ and was released in fall 1993. CHARM++ in its initial version consisted of syntactic changes to C++ and employed a special translator that parsed the entire C++ code while translating the syntactic extensions. CHARM(4.5) had a major change that resulted from a significant shift in the research agenda of the Parallel Programming Laboratory. The message-driven runtime system code of the CHARM++ was separated from the actual language implementation, resulting in an interoperable parallel runtime system called CONVERSE. The CHARM++ runtime system was retargetted on top of CONVERSE, and popular programming paradigms such as MPI and PVM were also implemented on CONVERSE. This allowed interoperability between these paradigms and CHARM++. This release also eliminated the full-fledged CHARM++ translator by replacing syntactic extensions to C++ with C++ macros, and instead contained a small language and a translator for describing the interfaces of CHARM++ entities to the runtime system. This version of CHARM++, which, in earlier releases was known as Interface Translator CHARM++, is the default version of CHARM++ now, and hence referred simply as CHARM++. In early 1999, the runtime system of CHARM++ was formally named the Charm Kernel, and was rewritten in C++. Several new features were added. The interface language underwent significant changes, and the macros that replaced the syntactic extensions in original CHARM++, were replaced by natural C++ constructs. Late 1999, and early 2000 reflected several additions to CHARM++, when a load balancing framework and migratable objects were added to CHARM++.
We think that CHARM++ is easy to use if you are familiar with object-based programming. (But of course that is our opinion, if your opinion differs, you are encouraged to let us know the reasons, and features that you would like to see in CHARM++.) Object-based programming is built around the concept of ``encapsulation'' of data. As implemented in C++, data encapsulation is achieved by grouping together data and methods (also known as functions, subroutines, or procedures) inside of an object.
A class is a blueprint for an object. The encapsulated data is said to be ``private'' to the object, and only the methods of that class can manipulate that data. A method that has the same name as the class is a ``blessed'' method, called a ``Constructor'' for that class. A constructor method is typically responsible for initializing the encapsulated data of an object. Each method, including the constructor can optionally be supplied data in the form of parameters (or arguments). In C++, one can create objects with the new operator that returns a pointer to the object. This pointer can be used to refer to the object, and call methods on that object.
CHARM++ is built on top of C++, and also based on ``encapsulation''. Similar to C++, CHARM++ entities can contain private data, and public methods. The major difference is that these methods can be invoked from remote processors asynchronously. Asynchronous method invocation means that the caller does not wait for the method to be actually executed and does not wait for the method's return value. Therefore, CHARM++ methods (called entry methods) do not have a return value3. Since the actual CHARM++ object on which the method is being invoked may be on a remote processor4, the C++ way of referring to an object, via a pointer, is not valid in CHARM++. Instead, we refer to a remote chare via a ``proxy'', as explained below.
Those familiar with various component models (such as CORBA) in the distributed computing world will recognize ``proxy'' to be a dummy, standin entity that refers to an actual entity. For each chare type, a ``proxy'' class exists.5 The methods of this ``proxy'' class correspond to the remote methods of the actual class, and act as ``forwarders''. That is, when one invokes a method on a proxy to a remote object, the proxy forwards this method invocation to the actual remote object. All entities that are created and manipulated remotely in CHARM++ have such proxies. Proxies for each type of entity in CHARM++ have some differences among the features they support, but the basic syntax and semantics remain the same - that of invoking methods on the remote object by invoking methods on proxies.
You can have several proxies that all refer to the same object.
Historically, handles (which are basically globally unique identifiers) were used to uniquely identify CHARM++ objects. Unlike pointers, they are valid on all processors and so could be sent as parameters in messages. They are still available, but now proxies also have the same feature.
Handles (like CkChareID, CkArrayID, etc.) and proxies (like CProxy_foo) are just bytes and can be sent in messages, pup'd, and parameter marshalled. This is now true of almost all objects in Charm++: the only exceptions being entire Chares (Array Elements, etc.) and, paradoxically, messages themselves.
A CHARM++ program consists of a number of CHARM++ objects distributed across the available number of processors. Thus, the basic unit of parallel computation in CHARM++ programs is the chare, a CHARM++ object that can be created on any available processor and can be accessed from remote processors. A chare is similar to a process, an actor, an ADA task, etc. Chares are created dynamically, and many chares may be active simultaneously. Chares send messages to one another to invoke methods asynchronously. Conceptually, the system maintains a ``work-pool'' consisting of seeds for new chares, and messages for existing chares. The runtime system (called Charm Kernel) may pick multiple items, non-deterministically, from this pool and execute them.
Methods of a chare that can be remotely invoked are called entry methods. Entry methods may take marshalled parameters, or a pointer to a message object. Since chares can be created on remote processors, obviously some constructor of a chare needs to be an entry method. Ordinary entry methods6 are completely non-preemptive- CHARM++ will never interrupt an executing method to start any other work, and all calls made are asynchronous.
CHARM++ provides dynamic seed-based load balancing. Thus location (processor number) need not be specified while creating a remote chare. The Charm Kernel will then place the remote chare on a least loaded processor. Thus one can imagine chare creation as generating only a seed for the new chare, which may take root on the most fertile processor. Charm Kernel identifies a chare by a ChareID. Since user code does not need to name a chares' processor, chares can potentially migrate from one processor to another. (This behavior is used by the dynamic load-balancing framework for chare containers, such as arrays.)
Other CHARM++ objects are collections of chares. They are: chare-arrays, chare-groups, and chare-nodegroups, referred to as arrays, groups, and nodegroups throughout this manual. An array is a collection of arbitrary number of migratable chares, indexed by some index type, and mapped to processors according to a user-defined map group. A group (nodegroup) is a collection of chares, one per processor (SMP node), that is addressed using a unique system-wide name.
Every CHARM++ program must have at least one mainchare. Each mainchare is created by the system on processor 0 when the CHARM++ program starts up. Execution of a CHARM++ program begins with the Charm Kernel constructing all the designated mainchares. For a mainchare named X, execution starts at constructor X() or X(CkArgMsg *) which are equivalent. Typically, the mainchare constructor starts the computation by creating arrays, other chares, and groups. It can also be used to initialize shared readonly objects.
The only method of communication between processors in CHARM++ is asynchronous entry method invocation on remote chares. For this purpose, Charm Kernel needs to know the types of chares in the user program, the methods that can be invoked on these chares from remote processors, the arguments these methods take as input etc. Therefore, when the program starts up, these user-defined entities need to be registered with Charm Kernel, which assigns a unique identifier to each of them. While invoking a method on a remote object, these identifiers need to be specified to Charm Kernel. Registration of user-defined entities, and maintaining these identifiers can be cumbersome. Fortunately, it is done automatically by the CHARM++ interface translator. The CHARM++ interface translator generates definitions for proxy objects. A proxy object acts as a handle to a remote chare. One invokes methods on a proxy object, which in turn carries out remote method invocation on the chare.
In addition, the CHARM++ interface translator provides ways to enhance the basic functionality of Charm Kernel using user-level threads and futures. These allow entry methods to be executed in separate user-level threads. These threaded entry methods may block waiting for data by making synchronous calls to remote object methods that return results in messages.
CHARM++ program execution is terminated by the CkExit call. Like the exit system call, CkExit never returns. The Charm Kernel ensures that no more messages are processed and no entry methods are called after a CkExit. CkExit need not be called on all processors; it is enough to call it from just one processor at the end of the computation.
This section describes various entities in a typical CHARM++ program.
A CHARM++ program typically consists mostly of ordinary sequential C++ code and objects. Such entities are only accessible locally, are not known to the CHARM++ runtime system, and thus need not be mentioned in the module interface files.
CHARM++ does not affect the syntax or semantics of such C++ entities, except that changes to global variables (or static data members of a class) on one node will not be visible on other nodes. Global data changes must be explicitly sent between processors. For processor- and thread-private storage, refer to the ``Global Variables'' section of the Converse manual.
Messages supply data arguments to the asynchronous remote method invocation. These objects are treated differently from other objects in CHARM++ by the runtime system, and therefore they must be specified in the interface file of the module. With parameter marshalling, the system creates and handles the message completely internally. Other messages are instances of C++ classes that are subclassed from a special class that is generated by the CHARM++ interface translator. Another variation of communication objects is conditionally packed and unpacked. This variation should be used when one wants to send messages that contain pointers to the data rather than the actual data to other processors. This type of communication objects contains two static methods: pack, and unpack. The third variation of communication objects is called varsize messages. Varsize messages is an effective optimization on conditionally packed messages, and can be declared with special syntax in the interface file.
Chares are the most important entities in a CHARM++ program. These concurrent objects are different from sequential C++ objects in many ways. Syntactically, Chares are instances of C++ classes that are derived from a system-provided class called Chare. Also, in addition to the usual C++ private and public data and method members, they contain some public methods called entry methods. These entry methods do not return anything (they are void methods), and take at most one argument, which is a pointer to a message. Chares are accessed using a proxy (an object of a specialized class generated by the CHARM++ interface translator) or using a handle (a CkChareID structure defined in CHARM++), rather than a pointer as in C++. Semantically, they are different from C++ objects because they can be created asynchronously from remote processors, and their entry methods also could be invoked asynchronously from the remote processors. Since the constructor method is invoked from remote processor (while creating a chare), every chare should have its constructors as entry methods (with at most one message pointer parameter). These chares and their entry methods have to be specified in the interface file.
Chare arrays are collections of chares. However, unlike chare groups or nodegroups, arrays are not constrained by characteristics of the underlying parallel machine such as number of processors or nodes. Thus, chare arrays can have any number of elements. The array elements themselves are chares, and methods can be invoked on individual array elements as usual. Each element of an array has a globally unique index, and messages are addressed to that index.
Unlike other entities in CHARM++ the dynamic load balancing framework (LB Framework) treats array elements as objects that can be migrated across processors. Thus, the runtime system keeps track of computational load across the system, and also the time spent in execution of entry methods on array elements, and then employs one of several strategies to redistribute array elements across the available processors.
Chare Groups7 are a special type of concurrent objects. Each chare group is a collection of chares, with one representative (group member) on each processor. All the members of a chare group share a globally unique name (handle, defined by Charm kernel to be of type CkGroupID). An entire chare group could be addressed using this global handle, and an individual member of a chare group can be addressed using the global handle, and a processor number. Chare groups are instances of C++ classes subclassed from a system-provided class called Group. The Charm kernel has to be notified that these chares are semantically different, and therefore chare groups have a different declaration in the interface specification file.
Chare nodegroups are very similar to chare groups except that instead of having one group member on each processor, the nodegroup has one member on each shared memory multiprocessor node. Note that CHARM++ (and its underlying runtime system Converse) distinguish between processors and nodes. A node consists of one or more processors that share an address space. The last few years have seen emergence of fast SMP systems of small (2-4 processors) to large (32-64 processors) number of processors per node. A network of such SMP nodes is the most general model of parallel computers, making pure distributed and pure shared memory systems mere special cases. CHARM++ is built on top of this machine abstraction, and Chare nodegroups embody this abstraction in a higher level language construct. Semantically, methods invoked on a nodegroup member could be executed on any processor within that node. This fact can be utilized for supporting load balance across processors within a node. However, this also means that different processors within a node could be executing methods of the same nodegroup member simultaneously, thus leading to common problems associated with shared address space programming. However, CHARM++ eases such problems by allowing the programmer to specify an entry method of a nodegroup to be exclusive, thus guaranteeing that no other exclusive method of that nodegroup member can execute simultaneously within the node.
A CHARM++ program is structurally similar to a C++ program. Most of a CHARM++ program is C++ code.8 The main syntactic units in a CHARM++ program are class definitions. A CHARM++ program can be distributed across several source code files.
There are five disjoint categories of objects (classes) in CHARM++:
The user's code is written in C++ and interfaces with the CHARM++ system as if it were a library containing base classes, functions, etc. A translator is used to generate the special code needed to handle CHARM++ constructs. This translator generates C++ code that needs to be compiled with the user's code.
Interfaces to the CHARM++ objects (such as messages, chares, readonly variables etc.) have to be declared in CHARM++ interface files. Typically, such entities are grouped into modules. A CHARM++ program may consist of multiple modules. One of these modules is declared to be a mainmodule. All the modules that are ``reachable'' from the mainmodule via the extern construct are included in a CHARM++ program.
The CHARM++ interface file has the suffix ``.ci''. The CHARM++ interface translator parses this file and produces two files (with suffixes ``.decl.h'' and ``.def.h'', for each module declared in the ``.ci'' file), that contain declarations (interface) and definitions (implementation)of various translator-generated entities. If the name of a module is MOD, then the files produced by the CHARM++ interface translator are named MOD.decl.h and MOD.def.h.9 We recommend that the declarations header file be included at the top of the header file (MOD.h) for module MOD, and the definitions file be included at the bottom of the code for module (MOD.C).10
A simple CHARM++ program is given below:
HelloMain is designated a mainchare. Thus the Charm Kernel starts execution of this program by creating an instance of HelloMain on processor 0. The HelloMain constructor creates a chare group HelloGroup, and stores a handle to itself and returns. The call to create the group returns immediately after directing Charm Kernel to perform the actual creation and invocation. Shortly after, the Charm Kernel will create an object of type HelloGroup on each processor, and call its constructor. The constructor will then print ``Hello World...'' and then call the PrintDone method of HelloMain. The PrintDone method calls CkExit after all group members have called it (i.e., they have finished printing ``Hello World...''), and the CHARM++program exits.
The decl.h file provides declarations for the proxy classes of the concurrent objects declared in the ``.ci'' file (from which the decl.h file is generated). So the Hello.decl.h file will have the declaration of the class CProxy_HelloMain. Similarly it will also have the declaration for the HelloGroup class.
This class will have functions to create new instances of the chares and groups, like the function ckNew. For HelloGroup this function creates an instance of the class HelloGroup on all the processors.
The proxy class also has functions corresponding to the entry methods defined in the ``.ci'' file. In the above program the method wait is declared in CProxy_HelloMain (proxy class for HelloMain).
The proxy class also provides static registration functions used by the CHARM++ runtime. The def.h file has a registration function (__registerHello in the above program) which calls all the registration functions corresponding to the readonly variables and entry methods declared in the module.
In CHARM++, chares, groups and nodegroups communicate using remote method invocation. These ``remote entry'' methods may either take marshalled parameters, described in the next section; or special objects called messages. Messages are lower level, more efficient, more flexible, and more difficult to use than parameter marshalling.
An entry method is always a part of a chare- there are no global entry methods in CHARM++. Entry methods are declared in the the interface file as:
Parameters is either a list of marshalled parameters, (e.g., ``int i, double x''), or a message description (e.g., ``MyMessage *msg''). See section 3.3 and section 3.4 for details on these types of parameters.
Entry methods typically do not return data- in C++, they have return type ``void''. An entry method with the same name as its enclosing class is a constructor. Constructors in C++ have no return type. Finally, sync methods, described below, may return a message.
CHARM++ provides a handful of special attributes that entry methods may have. In order to give a particular entry method an attribute, you must specify the keyword for the desired attribute in the attribute list of that entry method's .ci file declaration. The syntax for this is as follows:
CHARM++ currently offers the following attributes that one may assign to an entry method: threaded, sync, exclusive, nokeep, notrace, immediate, expedited, inline, local, python.
In CHARM++, chares, groups and nodegroups communicate by invoking each others methods. The methods may either take several parameters, described here; or take a special message object as described in the next section. Since parameters get marshalled into a message before being sent across the network, in this manual we use ``message'' to mean either a literal message object or a set of marshalled parameters.
For example, a chare could have this entry method declaration in the interface (.ci) file:
Since CHARM++ runs on distributed memory machines, we cannot pass an array via a pointer in the usual C++ way. Instead, we must specify the length of the array in the interface file, as:
This also means the data must be copied on the sending side, and to be kept must be copied again at the receive side. Especially for large arrays, this is less efficient than messages, as described in the next section.
Array parameters and other parameters can be combined in arbitrary ways, as:
The marshalling system uses the pup framework to copy data, meaning every user class that is marshalled needs either a pup routine, a ``PUPbytes'' declaration, or a working operator|. See the PUP description in Section 3.16 for more details on these routines.
Any user-defined types in the argument list must be declared before including the ``.decl.h'' file. As usual in C++, it is often dramatically more efficient to pass a large structure by reference (as shown) than by value.
For efficiency, arrays (like refChars above) are always copied as blocks of bytes and passed via pointers. This means classes that need their pup routines to be called, such as those with dynamically allocated data or virtual methods cannot be passed as arrays-use CkVec or STL vectors to pass lists of complicated user-defined classes. For historical reasons, pointer-accessible structures cannot appear alone in the parameter list (because they are confused with messages).
The order of marshalling operations on the send side is:
|a'' on each marshalled parameter with a sizing PUP::er.
|a'' on each marshalled parameter with a packing PUP::er.
The order of marshalling operations on the receive side is:
|a'' on each marshalled parameter using an unpacking PUP::er.
Finally, very large structures are most efficiently passed via messages, because messages are an efficient, low-level construct that minimizes copying and overhead; but very complicated structures are easiest to pass via marshalling, because marshalling uses the high-level pup framework.
A message encapsulates all the parameters sent to an entry method. Since the parameters are already encapsulated, sending messages is often more efficient than parameter marshalling. In addition, messages are easier to queue and store on the receive side.
The largest difference between parameter marshalling and messages is that entry methods keep the messages passed to them. Thus each entry method must be passed a new message. On the receiving side, the entry method must either store the passed message or explicitly delete it, or else the message will never be destroyed, wasting memory.
Several kinds of message are available. Regular CHARM++ messages are objects of fixed size. One can have messages that contain pointers or variable length arrays (arrays with sizes specified at runtime) and still have these pointers to be valid when messages are sent across processors, with some additional coding. Also available is a mechanism for assigning priorities to messages that applies all kinds of messages. A detailed discussion of priorities appears later in this section.
Like all other entities involved in asynchronous method invocation, messages need to be declared in the .ci file. In the .ci file (the interface file), a message is declared as:
A message that contains variable length arrays is declared as:
If the name of the message class is MessageType, the class must inherit publicly from a class whose name is CMessage_MessageType. This class is generated by the charm translator. Then message definition has the form:
Messages are allocated using the C++ new operator:
The optional arguments to the new operator are used when allocating messages with variable length arrays or prioritized messages. sz1, sz2, ... denote the size (in appropriate units) of the memory blocks that need to be allocated and assigned to the pointers that the message contains. The priobits argument denotes the size of a bitfield (number of bits) that will be used to store the message priority.
For example, to allocate a message whose class declaration is:
do the following:
To allocate a message whose class declaration is:
do the following:
This allocates a VarsizeMessage, in which firstArray points to an array of 10 ints and secondArray points to an array of 20 doubles. This is explained in detail in later sections.
To add a priority bitfield to this message,
Note, you must provide number of bits which is used to store the priority as the priobits parameter. The section on prioritized execution describes how this bitfield is used.
In Section 3.4.3 we explain how messages can contain arbitrary pointers, and how the validity of such pointers can be maintained across processors in a distributed memory machine.
When a message is sent to a chare, the programmer relinquishes control of it; the space allocated to the message is freed by the system. When a message is received at an entry point it is not freed by the runtime system. It may be reused or deleted by the programmer. Messages can be deleted using the standard C++ delete operator.
There are no limitations of the methods of message classes except that the message class may not redefine operators new or delete.
An ordinary message in CHARM++ is a fixed size message that is allocated internally with an envelope which encodes the size of the message. Very often, the size of the data contained in a message is not known until runtime. One can use packed messages to alleviate this problem. However, it requires multiple memory allocations (one for the message, and another for the buffer.) This can be avoided by making use of a varsize message. In varsize messages, the space required for these variable length arrays is allocated with the message such that it is contiguous to the message.
Such a message is declared as
in CHARM++ interface file. The class mtype has to inherit from CMessage_mtype. In addition, it has to contain variables of corresponding names pointing to appropriate types. If any of these variables (data members) are private or protected, it should declare class CMessage_mtype to be a ``friend'' class. Thus the mtype class declaration should be similar to:
An Example
Suppose a CHARM++ message contains two variable length arrays of types int and double:
Then in the .ci file, this has to be declared as:
We specify the types and actual names of the fields that contain variable length arrays. The dimensions of these arrays are NOT specified in the interface file, since they will be specified in the constructor of the message when message is created. In the .h or .C file, this class is declared as:
The interface translator generates the CMessage_VarsizeMessage class, which contains code to properly allocate, pack and unpack the VarsizeMessage.
One can allocate messages of the VarsizeMessage class as follows:
Another way of allocating a varsize message is to pass a sizes in an array instead of the parameter list. For example,
No special handling is needed for deleting varsize messages.
The CHARM++ interface translator generates implementation for three static methods for the message class CMessage_mtype. These methods have the prototypes:
One may choose not to use the translator-generated methods and may override these implementations with their own alloc, pack and unpack static methods of the mtype class. The alloc method will be called when the message is allocated using the C++ new operator. The programmer never needs to explicitly call it. Note that all elements of the message are allocated when the message is created with new. There is no need to call new to allocate any of the fields of the message. This differs from a packed message where each field requires individual allocation. The alloc method should actually allocate the message using CkAllocMsg, whose signature is given below:
For varsize messages, these static methods alloc, pack, and unpack are generated by the interface translator. For example, these methods for the VarsizeMessage class above would be similar to:
The pointers in a varsize message can exist in two states. At creation, they are valid C++ pointers to the start of the arrays. After packing, they become offsets from the address of the pointer variable to the start of the pointed-to data. Unpacking restores them to pointers.
In many cases, a message must store a non-linear data structure using pointers. Examples of these are binary trees, hash tables etc. Thus, the message itself contains only a pointer to the actual data. When the message is sent to the same processor, these pointers point to the original locations, which are within the address space of the same processor. However, when such a message is sent to other processors, these pointers will point to invalid locations.
Thus, the programmer needs a way to ``serialize'' these messages only if the message crosses the address-space boundary. CHARM++ provides a way to do this serialization by allowing the developer to override the default serialization methods generated by the CHARM++ interface translator. Note that this low-level serialization has nothing to do with parameter marshalling or the PUP framework described later.
Packed messages are declared in the .ci file the same way as ordinary messages:
Like all messages, the class PMessage needs to inherit from CMessage_PMessage and should provide two static methods: pack and unpack. These methods are called by the CHARM++ runtime system, when the message is determined to be crossing address-space boundary. The prototypes for these methods are as follows:
Typically, the following tasks are done in pack method:
On the receiving processor, the unpack method is called. Typically, the following tasks are done in the unpack method:
Here is an example of a packed-message implementation:
While serializing an arbitrary data structure into a flat buffer, one must be very wary of any possible alignment problems. Thus, if possible, the buffer itself should be declared to be a flat struct. This will allow the C++ compiler to ensure proper alignment of all its member fields.
By default, CHARM++ will process the messages you send in roughly FIFO order. For most programs, this behavior is fine. However, some programs need more explicit control over the order in which messages are processed. CHARM++ allows you to control queueing behavior on a per-message basis.
The simplest call available to change the order in which messages are processed is CkSetQueueing.
void CkSetQueueing(MsgType message, int queueingtype)
where queueingtype is one of the following constants:
The first two options, CK_QUEUEING_FIFO and CK_QUEUEING_LIFO, are used as follows:
When message msg1 arrives at its destination, it will be pushed onto the end of the message queue as usual. However, when msg2 arrives, it will be pushed onto the front of the message queue.
The other four options involve the use of priorities. To attach a priority field to a message, one needs to set aside space in the message's buffer while allocating the message. To achieve this, the size of the priority field in bits should be specified as a placement argument to the new operator, as described in Section 3.4.1. Although the size of the priority field is specified in bits, it is always padded to an integral number of ints. A pointer to the priority part of the message buffer can be obtained with this call:
unsigned int *CkPriorityPtr(MsgType msg)
There are two kinds of priorities which can be attached to a message: integer priorities and bitvector priorities. Integer priorities are quite straightforward. One allocates a message, setting aside enough space (in bits) in the message to hold the priority, which is an integer. One then stores the priority in the message. Finally, one informs the system that the message contains an integer priority using CkSetQueueing:
The predefined constant CK_QUEUEING_IFIFO indicates that the message contains an integer priority, and that if there are other messages of the same priority, they should be sequenced in FIFO order (relative to each other). Similarly, a CK_QUEUEING_ILIFO is available. Note that MAXINT is the lowest priority, and NEGATIVE_MAXINT is the highest priority.
Bitvector priorities are somewhat more complicated. Bitvector priorities are arbitrary-length bit-strings representing fixed-point numbers in the range 0 to 1. For example, the bit-string ``001001'' represents the number .001001binary. As with the simpler kind of priority, higher numbers represent lower priorities. Unlike the simpler kind of priority, bitvectors can be of arbitrary length, therefore, the priority numbers they represent can be of arbitrary precision.
Arbitrary-precision priorities are often useful in AI search-tree applications. Suppose we have a heuristic suggesting that tree node should be searched before tree node . We therefore designate that node and its descendants will use high priorities, and that node and its descendants will use lower priorities. We have effectively split the range of possible priorities in two. If several such heuristics fire in sequence, we can easily split the priority range in two enough times that no significant bits remain, and the search begins to fail for lack of meaningful priorities to assign. The solution is to use arbitrary-precision priorities, i.e. bitvector priorities.
To assign a bitvector priority, two methods are available. The first is to obtain a pointer to the priority field using CkPriorityPtr, and to then manually set the bits using the bit-setting operations inherent to C. To achieve this, one must know the format of the bitvector, which is as follows: the bitvector is represented as an array of unsigned integers. The most significant bit of the first integer contains the first bit of the bitvector. The remaining bits of the first integer contain the next 31 bits of the bitvector. Subsequent integers contain 32 bits each. If the size of the bitvector is not a multiple of 32, then the last integer contains 0 bits for padding in the least-significant bits of the integer.
The second way to assign priorities is only useful for those who are using the priority range-splitting described above. The root of the tree is assigned the null priority-string. Each child is assigned its parent's priority with some number of bits concatenated. The net effect is that the entire priority of a branch is within a small epsilon of the priority of its root.
It is possible to utilize unprioritized messages, integer priorities, and bitvector priorities in the same program. The messages will be processed in roughly the following order:
A final warning about prioritized execution: CHARM++ always processes messages in roughly the order you specify; it never guarantees to deliver the messages in precisely the order you specify. However, it makes a serious attempt to be ``close'', so priorities can strongly affect the efficiency of your program.
Immediate messages are specical messages that skip the Charm scheduler, they can be executed in an ``immediate'' fashion even in the middle of a normal running entry method. They are supported only in nodegroup. Also see Section 3.2.1 and example in charm/pgms/charm++/megatest/immediatering.C.
Chares are concurrent objects with methods that can be invoked remotely. These methods are known as entry methods, and must be specified in the interface (.ci) file:
A corresponding chare definition in the .h file would have the form:
Chares are concurrent objects encapsulating medium-grained units of work. Chares can be dynamically created on any processor; there may be thousands of chares on a processor. The location of a chare is usually determined by the dynamic load balancing strategy; however, once a chare commences execution on a processor, it does not migrate to other processors11. Chares do not have a default ``thread of control'': the entry methods in a chare execute in a message driven fashion upon the arrival of a message12.
The entry method definition specifies a function that is executed without interruption when a message is received and scheduled for processing. Only one message per chare is processed at a time. Entry methods are defined exactly as normal C++ function members, except that they must have the return value void (except for the constructor entry method which may not have a return value, and for a synchronous entry method, which is invoked by a threaded method in a remote chare) and they must have exactly one argument which is a pointer to a message.
Each chare instance is identified by a handle which is essentially a global pointer, and is unique across all processors. The handle of a chare has type CkChareID. The variable thishandle holds the handle of the chare whose entry function or public function is currently executing. thishandle is a public instance variable of the chare object which is inherited from the system-defined superclass CBase_ClassType. Following the older syntax, chares are also allowed to inherit directly for the superclass Chare instead of CBase_ClassType, although this form is not suggested. thishandle can be used to set fields in a message. This mechanism allows chares to send their handles to other chares.
First, a chare needs to be declared, both in .ci file and in .h file, as stated earlier. The following is an example of declaration for a chare of user-defined type C, where M1 and M2 are user-defined message types, and someEntry is an entry method.
In the mod.ci file we have:
and in the mod.h file:
Now one can use the class CProxy_chareType to create a new instance of a chare. Here chareType gets replaced with whatever chare type we want. For the above example, proxies would be of type CProxy_C. A number of chare creation calls exist as static or instance methods of class CProxy_chareType:
Each item above is optional, and:
The chare creation method deposits the seed for a chare in a pool of seeds and returns immediately. The chare will be created later on some processor, as determined by the dynamic load balancing strategy. When a chare is created, it is initialized by calling its constructor entry method with the message parameter specified to the chare creation method. The method operator does not return any value but fills in the virtual handle to the newly created chare if specified.
The following are some examples on how to use the chare creation method to create chares.
A message may be sent to a chare using the notation:
chareProxy.EntryMethod(parameters)
This invokes the entry method EntryMethod on the chare referred to by the proxy chareProxy. This call is asynchronous and non-blocking; it returns immediately after sending the message.
You can get direct access to a local chare using the proxy's ckLocal method, which returns an ordinary C++ pointer to the chare if it exists on the local processor; and NULL if the chare does not exist or is on another processor.
Since CHARM++ does not allow global variables for keeping programs portable across a wide range of machines, it provides a special mechanism for sharing data amongst all objects. Read-only variables, messages and arrays are used to share information that is obtained only after the program begins execution and does not change after they are initialized in the dynamic scope of main::main() function. They can be accessed from any chare on any processor as ``global'' variables.When a variable is declared as read only,it is PUPped so that it can be accessed from any chare on any processor. Large data structures containing pointers can be made available as read-only variables using read-only messages or read-only arrays. Read-only variables, messages and arrays can be used just like local variables for each processor, but the user has to allocate space for read-only messages using new to create the message in the main function of the mainchare.
Read-only variables, messages, and arrays are declared by using the type modifier readonly, which is similar to const in C++. Read-only data is specified in the .ci file (the interface file) as:
The variable ReadonlyVarName is declared to be a read-only variable of type Type. Type must be a single token and not a type expression.
The variable ReadonlyMsgName is declared to be a read-only message of type MessageType. Pointers are not allowed to be readonly variables unless they are pointers to message types. In this case, the message will be initialized on every processor.
The variable ReadonlyArrayName is declared to be a read-only array of type Type. Type must be a single token and not a type expression.
Read-only variables, messages and arrays must be declared either as global or as public class static data, and these declarations have the usual form:
Similar declarations preceded by extern would appear in the .h file.
Note: The current CHARM++ translator cannot prevent assignments to read-only variables. The user must make sure that no assignments occur in the program.
Arrays are arbitrarily-sized collections of chares. The entire array has a globally unique identifier of type CkArrayID, and each element has a unique index of type CkArrayIndex. A CkArrayIndex can be a single integer (i.e. 1D array), several integers (i.e. a multidimensional array), or an arbitrary string of bytes (e.g. a binary tree index).
Array elements can be dynamically created and destroyed on any processor, and messages for the elements will still arrive properly. Array elements can be migrated at any time, allowing arrays to be efficiently load balanced. Array elements can also receive array broadcasts and contribute to array reductions.
You can declare a one-dimensional chare array as:
Just as every Chare inherits from the system class CBase_ClassName, every array element inherits from the system class CBase_ClassName. Just as a Chare inherits ``thishandle'', each array element inherits ``thisArrayID'', the CkArrayID of its array, and ``thisIndex'', the element's array index. As well as chares are allowed to inherit directly from class Chare, array elements are allowed to inherit from ArrayElement1D if 1D array, ArrayElement2D if 2D array, and so on up to 6D.
Note A's odd migration constructor, which is normally empty:
Read the section ``Migratable Array Elements'' for more information on the CkMigrateMessage constructor.
You always create an array using the CProxy_Array::ckNew routine. This returns a proxy object, which can be kept, copied, or sent in messages. To create a 1D array containing elements indexed (0, 1, ..., num_elements-1), use:
The constructor is invoked on each array element. For creating higher-dimensional arrays, or for more options when creating the array, see section 3.8.2.
An array proxy responds to the appropriate index call- for 1D arrays, use [i] or (i); for 2D use (x,y); for 3D use (x,y,z); and for user-defined types use [f] or (f).
To send a message to an array element, index the proxy and call the method name:
You may invoke methods on array elements that have not yet been created- by default, the system will buffer the message until the element is created13.
Messages are not guarenteed to be delivered in order. For example, if I invoke a method A, then method B; it is possible for B to be executed before A.
Messages sent to migrating elements will be delivered after the migrating element arrives. It is an error to send a message to a deleted array element.
To broadcast a message to all the current elements of an array, simply omit the index, as:
The broadcast message will be delivered to every existing array element exactly once. Broadcasts work properly even with ongoing migrations, insertions, and deletions.
A reduction applies a single operation (e.g. add, max, min, ...) to data items scattered across many processors and collects the result in one place. CHARM++ supports reductions on the elements of a Chare array.
The data to be reduced comes from each array element, which must call the contribute method:
Reductions are described in more detail in Section 3.14.
To destroy an array element- detach it from the array, call its destructor, and release its memory-invoke its Array destroy method, as:
You must ensure that no messages are sent to a deleted element. After destroying an element, you may insert a new element at its index.
The basic array features described above (creation, messaging, broadcasts, and reductions) are needed in almost every CHARM++ program. The more advanced techniques that follow are not universally needed; but are still often useful.
CHARM++ contains direct support for multidimensional and even user-defined index arrays. These arrays can be declared as:
The last declaration expects an array index of type CkArrayIndexFoo, which must be defined before including the .decl.h file (see ``User-defined array index type'' below).
A 1D array element can access its index via its inherited ``thisIndex'' field; a 2D via ``thisIndex.x'' and ``thisIndex.y'', and a 3D via ``thisIndex.x'', ``thisIndex.y'', and ``thisIndex.z''. The subfields of 4D, 5D, and 6D are respectively {w,x,y,z}, {v,w,x,y,z}, and {x1,y1,z1,x2,y2,z2}. A user-defined index array can access its index as ``thisIndex''.
Likewise, you can create a dense multidimensional array by passing the extents at creation time to ckNew.
For 4D, 5D, 6D and user-defined arrays, this functionality cannot be used. You need to insert the array elements individually (Section 3.8.7).
There are several ways to control the array creation process. You can adjust the map and bindings before creation, change the way the initial array elements are created, create elements explicitly during the computation, and create elements implicitly, ``on demand''.
You can create all your elements using any one of these methods, or create different elements using different methods. An array element has the same syntax and semantics no matter how it was created.
The array creation method ckNew actually takes a parameter of type CkArrayOptions. This object describes several optional attributes of the new array.
The most common form of CkArrayOptions is to set the number of initial array elements. A CkArrayOptions object will be constructed automatically in this special common case. Thus the following code segments all do exactly the same thing:
Note that the ``numElements'' in an array element is simply the numElements passed in when the array was created. The true number of array elements may grow or shrink during the course of the computation, so numElements can become out of date. This ``bulk'' constructor approach should be preferred where possible, especially for large arrays. Bulk construction is handled via a broadcast which will be significantly more efficient in the number of messages required than inserting each element individually which will require one message send per element.
You can use CkArrayOptions to specify a ``map object'' for an array. The map object is used by the array manager to determine the ``home'' processor of each element. The home processor is the processor responsible for maintaining the location of the element.
There is a default map object, which maps 1D array indices in a round-robin fashion to processors, and maps other array indices based on a hash function.
A custom map object is implemented as a group which inherits from CkArrayMap and defines these virtual methods:
For example, a simple 1D blockmapping scheme. Actual mapping is handled in the procNum function.
The map object described above can also be used to create the initial set of array elements in a distributed fashion. An array's initial elements are created by its map object, by making a call to populateInitial on each processor.
You can create your own set of elements by creating your own map object and overriding this virtual function of CkArrayMap:
In this call, arrayHdl is the value returned by registerArray, numInitial is the number of elements passed to CkArrayOptions, msg is the constructor message to pass, and mgr is the array to create.
populateInitial creates new array elements using the method void CkArrMgr::insertInitial(CkArrayIndex idx,void *ctorMsg). For example, to create one row of 2D array elements on each processor, you would write:
Thus calling ckNew(10) on a 3-processor machine would result in 30 elements being created.
You can ``bind'' a new array to an existing array using the bindTo method of CkArrayOptions. Bound arrays act like separate arrays in all ways except for migration- corresponding elements of bound arrays always migrate together. For example, this code creates two arrays A and B which are bound together- A[i] and B[i] will always be on the same processor.
An arbitrary number of arrays can be bound together- in the example above, we could create yet another array C and bind it to A or B. The result would be the same in either case- A[i], B[i], and C[i] will always be on the same processor.
There is no relationship between the types of bound arrays- it is permissible to bind arrays of different types or of the same type. It is also permissible to have different numbers of elements in the arrays, although elements of A which have no corresponding element in B obey no special semantics. Any method may be used to create the elements of any bound array.
Bound arrays are often useful if A[i] and B[i] perform different aspects of the same computation, and thus will run most efficiently if they lie on the same processor. Bound array elements are guaranteed to always be able to interact using ckLocal (see section 3.8.12), although the local pointer must be refreshed after any migration. This should be done during the pup routine. When migrated, all elements that are bound together will be created at the new processor before pup is called on any of them, ensuring that a valid local pointer to any of the bound objects can be obtained during the pup routine of any of the others.
For example, an array Alibrary is implemented as a library module. It implements a certain functionality by operating on a data array dest which is just a pointer to some user provided data. A user defined array UserArray is created and bound to the array Alibrary to take advanatage of the functionality provided by the library. When bound array element migrated, the data pointer in UserArray is re-allocated in pup(), thus UserArray is responsible to refresh the pointer dest stored in Alibrary.
In addition to creating initial array elements using ckNew, you can also create array elements during the computation.
You insert elements into the array by indexing the proxy and calling insert. The insert call optionally takes parameters, which are passed to the constructor; and a processor number, where the element will be created. Array elements can be inserted in any order from any processor at any time. Array elements need not be contiguous.
If using insert to create all the elements of the array, you must call CProxy_Array::doneInserting before using the array.
The doneInserting call starts the reduction manager (see ``Array Reductions'') and load balancer (see 3.11.1)- since these objects need to know about all the array's elements, they must be started after the initial elements are inserted. You may call doneInserting multiple times, but only the first call actually does anything. You may even insert or destroy elements after a call to doneInserting, with different semantics- see the reduction manager and load balancer sections for details.
If you do not specify one, the system will choose a procesor to create an array element on based on the current map object.
Normally, invoking an entry method on a nonexistant array element is an error. But if you add the attribute [createhere] or [createhome] to an entry method, the array manager will ``demand create'' a new element to handle the message.
With [createhome], the new element will be created on the home processor, which is most efficient when messages for the element may arrive from anywhere in the machine. With [createhere], the new element is created on the sending processor, which is most efficient if when messages will often be sent from that same processor.
The new element is created by calling its default (taking no paramters) constructor, which must exist and be listed in the .ci file. A single array can have a mix of demand-creation and classic entry methods; and demand-created and normally created elements.
CHARM++ array indices are arbitrary collections of integers. To define a new array index, you create an ordinary C++ class which inherits from CkArrayIndex and sets the ``nInts'' member to the length, in integers, of the array index.
For example, if you have a structure or class named ``Foo'', you can use a Foo object as an array index by defining the class:
Note that Foo's size must be an integral number of integers- you must pad it with zero bytes if this is not the case. Also, Foo must be a simple class- it cannot contain pointers, have virtual functions, or require a destructor. Finally, there is a CHARM++ configuration-time option called CK_ARRAYINDEX_MAXLEN which is the largest allowable number of integers in an array index. The default is 3; but you may override this to any value by passing ``-DCK_ARRAYINDEX_MAXLEN=n'' to the CHARM++ build script as well as all user code. Larger values will increase the size of each message.
You can then declare an array indexed by Foo objects with
Note that since our CkArrayIndexFoo constructor is not declared with the explicit keyword, we can equivalently write the last line as:
When you implement your array element class, as shown above you can inherit from CBase_ClassName, a class templated by the index type Foo. In the old syntax, you could also inherit directly from ArrayElementT. The array index (an object of type Foo) is then accessible as ``thisIndex''. For example:
Array objects can migrate from one PE to another. For example, the load balancer (see section 3.11.1) might migrate array elements to better balance the load between processors. For an array element to migrate, it must implement a pack/unpack or ``pup'' method:
Please note that if your object contains Structured Dagger code (see section ``Structured Dagger'') you must use the following syntax to correctly pup the object:
See the section ``PUP'' for more details on pup routines and the PUP::er type.
The system uses one pup routine to do both packing and unpacking by passing different types of PUP::ers to it. You can determine what type of PUP::er has been passed to you with the isPacking(), isUnpacking(), and isSizing() calls.
An array element can migrate by calling the migrateMe(destination processor) member function- this call must be the last action in an element entry point. The system can also migrate array elements for load balancing (see the section 3.11.3).
To migrate your array element to another processor, the CHARM++ runtime will:
Migration constructors, then, are normally empty- all the unpacking and allocation of the data items is done in the element's pup routine. Deallocation is done in the element destructor as usual.
see section 3.11.1
You can get direct access to a local array element using the proxy's ckLocal method, which returns an ordinary C++ pointer to the element if it exists on the local processor; and NULL if the element does not exist or is on another processor.
Note that if the element migrates or is deleted, any pointers obtained with ckLocal are no longer valid. It is best, then, to either avoid ckLocal or else call ckLocal each time the element may have migrated; e.g., at the start of each entry method.
CHARM++ supports array section which is a subset of array elements in a chare array. A special proxy for an array section can be created given a list of array indexes of elements. Multicast operations are directly supported in array section proxy with an unoptimized direct-sending implementation. Section reduction is not directly supported by the section proxy. However, an optimized section multicast/reduction library called ''CkMulticast'' is provided as a separate library module, which can be plugged in as a delegation of a section proxy for performing section-based multicasts and reductions.
For each chare array "A" declared in a ci file, a section proxy of type "CProxySection_A" is automatically generated in the decl and def header files. In order to create an array section, a user needs to provide array indexes of all the array section members. You can create an array section proxy in your application by invoking ckNew() function of the CProxySection. For example, for a 3D array:
Alternatively, one can do the same thing by providing [lbound:ubound:stride] for each dimension:
The above codes create a section proxy that contains array elements of [0:9, 0:19:2, 0:29:2].
For user-defined array index other than CkArrayIndex1D to CkArrayIndex6D, one needs to use the generic array index type: CkArrayIndexMax.
Once you have the array section proxy, you can do multicast to all the section members, or send messages to one member using its index that is local to the section, like these:
You can move the section proxy in a message to another processor, and still safely invoke the entry functions to the section proxy.
In the multicast example above, for a section with k members, total number of k messages will be sent to all the memebers, which is considered inefficient when several members are on a same processor, in which case only one message needs to be sent to that processor and delivered to all section members on that processor locally. To support this optimization, a separate library called CkMulticast is provided. This library also supports section based reduction.
Note: Use of the bulk array constructor (dimensions given in the CkNew or CkArrayOptions rather than individual insertion) will allow construction to race ahead of several other startup procedures, this creates some limitation on the construction delegation and use of array section proxies. For safety, array sections should be created in a post constructor entry method.
To use the library, you need to compile and install CkMulticast library and link your applications against the library using -module:
CkMulticast library is implemented using delegation(Sec. 3.20). A special ''CkMulticastMgr'' Chare Group is created as a deletegation for section multicast/reduction - all the messages sent by the section proxy will be passed to the local delegation branch.
To use the CkMulticast delegation, one needs to create the CkMulticastMgr Group first, and then setup the delegation relationship between the section proxy and CkMulticastMgr Group. One only needs to create one CkMulticastMgr Group globally. CkMulticastMgr group can serve all multicast/reduction delegations for different array sections in an application:
By default, CkMulticastMgr group builds a spanning tree for multicast/reduction with a factor of 2 (binary tree). One can specify a different factor when creating a CkMulticastMgr group. For example,
Note, to use CkMulticast library, all multicast messages must inherit from CkMcastBaseMsg, as the following. Note that CkMcastBaseMsg must come first, this is IMPORTANT for CkMulticast library to retrieve section information out of the message.
Due to this restriction, you need to define message explicitly for multicast entry functions and no parameter marshalling can be used for multicast with CkMulticast library.
Since an array element can be members for multiple array sections, there has to be a way for each array element to tell for which array section it wants to contribute. For this purpose, a data structure called ''CkSectionInfo'' is created by CkMulticastMgr for each array section that the array element belongs to. When doing section reduction, the array element needs to pass the CkSectionInfo as a parameter in the contribute(). The CkSectionInfo can be retrieved from a message in a multicast entry function using function call CkGetSectionInfo:
Note that the cookie cannot be used as a one-time local variable in the function, the same cookie is needed for the next contribute. This is because cookie includes some context sensive information for example the reduction counter. Function CkGetSectionInfo() only update some part of the data in cookie, not creating a brand new one.
Similar to array reduction, to use section based reduction, a reduction client CkCallback object need to be created. You may pass the client callback as an additional parameter to contribute. If different contribute calls pass different callbacks, some (unspecified, unreliable) callback will be chosen for use. See the followin example:
If no member passes a callback to contribute, the reduction will use the default callback. You set the default callback for an array section using the setReductionClient call by the section root member. A CkReductionMsg message will be passed to this callback, which must delete the message when done.
Same as in array reduction, users can use built-in reduction types(Section 3.14.2) or define his/her own reducer functions (Section 3.14.3).
Using multicast/reduction, you don't need to worry about array migrations. When migration happens, array element in the array section can still use the CkSectionInfo it stored previously for doing reduction. Reduction messages will be correctly delivered but may not be as efficient until a new multicast spanning tree is rebuilt internally in CkMulticastMgr library. When a new spanning tree is rebuilt, a updated CkSectionInfo is passed along with a multicast message, so it is recommended that CkGetSectionInfo() function is always called when a multicast message arrives (as shown in the above SayHi example).
In case when a multicast root migrates, one needs to reconstruct the spanning tree to get optimal performance. One will get the following warning message if not doing so: "Warning: Multicast not optimized after multicast root migrated." In current implementation, user needs to initiate the rebuilding process like:
A group14 is a collection of chares where there exists one chare (or branch) on each processor. Each branch has its own data members. Groups have a definition syntax similar to normal chares, and they have to inherit from the system defined class CBase_ClassName. 15.
In the interface file, we declare
In the .h file, we define GroupType as follows:
A group is identified by a globally unique group identifier, whose type is CkGroupID. This identifier is common to all of the group's branches and can be obtained from the variable thisgroup, which is a public local variable of the Group superclass. For groups, thishandle is the handle of the particular branch in which the function is executing: it is a normal chare handle.
Groups can be used to implement data-parallel operations easily. In addition to sending messages to a particular branch of a group, one can broadcast messages to all branches of a group. There can be many instances corresponding to a group type. Each instance has a different group handle, and its own set of branches.
Given a .ci file as follows:
and the following .h file:
we can create a group in a manner similar to a regular chare.
Before sending a message to a group via an entry method, we need to get a proxy of that group.
A message may be sent to a particular branch of group using the notation:
This sends the given parameters to the branch of the group referred to by groupProxy which is on processor number Processor at the entry method EntryMethod, which must be a valid entry method of that group type. This call is asynchronous and non-blocking; it returns immediately after sending the message.
A message may be broadcast to all branches of a group (i.e., to all processors) using the notation :
This sends the given parameters to all branches of the group at the entry method EntryMethod, which must be a valid entry method of that group type. This call is asynchronous and non-blocking; it returns immediately after sending the message.
Sequential objects, chares and other groups can gain access to the local (i.e., on their processor) group object using:
This call returns a regular C++ pointer to the actual object (not a proxy) referred to by the proxy groupProxy. Once a proxy to the local branch of a group is obtained, that branch can be accessed as a regular C++ object. Its public methods can return values, and its public data is readily accessible.
Thus a dynamically created chare can call a public method of a group without needing to know which processor it actually resides: the method executes in the local branch of the group.
One very nice use of Groups is to reduce the number of messages sent between processors by collecting the data from all the chares on a single processor before sending that data to the mainchare. To do this, create basic chares to break up the work of a problem. Also, create a group. When a particular chare finishes its work, it reports its findings to the local branch of the group. When all the chares on one processor are complete, the local branch of the group can then report to the main chare. This reduces the number of messages sent to main from the number of chares created to the number of processors.
Node groups are very similar to the group objects already discussed in that node groups are collections of chares as well. Node groups, however, have one chare per node rather than one chare per processor. So, each node contains a branch of the node group, each containing one set of data members. When an entry method of a node group is executed, it runs on only one processor within each node.
Node groups have a definition syntax that is very similar to groups. 16For example, in the interface file, we declare:
In the .h file, we define NodeGroupType as follows:
Like groups, nodegroups are identified by a globally unique identifier of type CkGroupID. Just like with groups, this identifier is common to all branches of the nodegroup and can be obtained from the variable thisgroup, and once again, thishandle is the handle of the particular branch in which the function is executing.
Node groups may possess exclusive entry methods. These are entry methods that will not run while other exclusive entry methods of that node group are running on the same node. For instructions for making an entry method exclusive, refer to section 3.2.1.
For certain applications, node groups can be used in the place of regular groups to cut down on messaging overhead when shared memory access is possible. For example, consider a parallel program that does one calculation that can be decomposed into several mutually exclusive subcalculations. The program distributes the work amongst all of the processors, the subresults are all stored in the local branch of a group, and when the local branch has recieved all of its results, it relays everything to one particular processor where the subresults are put together into the final result. When normal groups are used, the number of messages sent is (# of processors). However, if node groups are used, a number of message sends will be replaced by local memory accesses if there is more than one processor per node. Instead, the number of messages sent is (# of nodes).
Just like groups, there can be many instances corresponding to a single node group type, and each instance has a different group handle, and its own set of branches.
Methods can be invoked either on a particular branch of a nodegroup by specifying a node number as a method parameter. In the absence of such a parameter, the call is treated as broadcast on a nodegroup, i.e. executed by all nodes. When a method is invoked on a particular branch of a nodegroup, it may be executed by ANY processor in that node. Thus two invocations of a specific method on a particular branch of a nodegroup may be carried out simultaneously by two different processors of the node. If that method contains code that should be executed by only one processor at a time, the method should be flagged exclusive in the interface file. If a method M of a nodegroup NG is marked exclusive, it means that while that method is being executed by any processor within a node, no other processor within the same node may execute any other exclusive method of that nodegroup branch. Other processors are free to execute other non-exclusive methods of that nodegroup branch, however.
The local branch of a nodegroup can be accessed using CkLocalNodeBranch() function. Thus data members could be accessed/modified or methods could be invoked on a branch of a nodegroup using this function. Note that such accesses are not thread-safe by default. Concurrent invocation of a method on a nodegroup by different processors within a node may result in unpredictable runtime behavior. One way to avoid this is to use node-level locks (described in Converse manual.)
CkLocalNodeBranch returns a generic (void *) pointer, similar to CkLocalBranch. Also, the static method ckLocalNodeBranch of the proxy class of appropriate nodegroup can be called to get the correct type of pointer.
Charm++ supports Load Balancing, enabled by the fact there are a large number of chares or chare-array-elements typically available to map to existing processors, and that they can be migrated at runtime.
Many parallel applications, especially physical simulations, are iterative in nature. They may contain a series of time-steps, and/or iterative solvers that run to convergence. For such computations, typically, the heuristic principle that we call "principle of persistence" holds: the computational loads and communication patterns between objects (chares) tend to persist over time, even in dynamic applications. In such cases, recent past is a good predictor of near future. Measurement-based chare migration strategies are useful in this context. Currently these apply to chare-array elements, but they may be extended to chares in the future.
For applications without such iterative structure, or with iterative structure but without the predictability (i.e. where the principle of persistence does not apply), Charm++ supports "seed balancers" that move seeds for new chares among processors (possibly repeatedly) to achieve load balance. These strategies are currently available for both chares and chare-arrays. Seed balancers were the original load balancers provided in Charm since the late '80s. They are extremely useful for state-space search applications, and are also useful in other computations, as well as in conjunction with migration strategies.
For iterative computations when there is a correlation between iterations/steps but either it is not strong or the machine environment is not predictable (noise due to OS interrupts on small time steps, or time-shared desk-top machines), one can use a combination of the two kinds of strategies. The base-line load balancing is provided by migration strategies; But in each iteration one also spawns off work in the form of chares that can run on any processor. The seed balancer will handle such work as it arises.
In CHARM++, objects(except groups, nodegroups) can migrate from processor to processor at run-time. Object migration can potentially improve the performance of the parallel program by migrating objects from overloaded processors to underloaded ones.
CHARM++ implements a generic, measurement-based load balancing framework which automatically instruments all CHARM++ objects, collects computation load and communication structure during execution and stores them into a load balancing database. CHARM++ then provides a collection of load balancing strategies whose job is to decide on a new mapping of objects to processors based on the information from the database. Such measurement based strategies are efficient when we can reasonably assume that objects in CHARM++ application tend to exhibit temporal correlation in their computation and communication patterns, i.e. future can be to some extent predicted using the historical measurement data, allowing effective measurement-based load balancing without application-specific knowledge.
Here are the two terms often used in CHARM++ load balancing framework:
Load balancing can be performed in either a centralized, fully distributed or hierarchical (or hybrid) fashion.
In centralized approaches, the entire machine's load and communication structure are accumulated to a single point, typically processor 0, followed by a decision making process to determine the new distribution of CHARM++ objects. Centralized load balancing requires synchronization which may incur an overhead and delay. However, due to the fact that the decision process has a high degree of the knowledge about the entire machine, it tends to be more accurate.
In distributed approaches, machine states are only exchanged among neighboring processors. There is no global synchronization. However, they will not, in general, provide an immediate restoration for load balance - the process is iterated until the load balance can be achieved.
In hierarchical approaches, processors are divided into independent autonomous sets of processor groups and these groups are organized in hierarchies, therefore decentralizing the load balancing task. Hybrid strategies are used to load balancing load on processors inside each processor group, and processors across groups in a hierarchical fashion.
Listed below are some of the available nontrivial centralized load balancers and their brief descriptions:
Listed below are the distributed load balancers:
User can choose any load balancing strategy he or she thinks is good for the application. The compiler and run-time options are described in section 3.11.6.
Load balancing framework is well integrated with Chare array implementation - when a chare array is created, it automatically registers its elements with the load balancing framework. The instrumentation of compute time (wall/cpu time) and communication pattern are done automatically and APIs are provided for users to trigger the load balancing.
To use the load balancer, you must make your array elements migratable (see migration section above) and choose a load balancing strategy (see the section 3.11.2 for a description of available load balancing strategies).
We implemented three methods to use load balancing for chare arrays to meet different needs of the applications. These methods are different in how and when a load balancing phase starts. The three methods are: periodic load balancing mode, and manual mode.
In periodic load balancing mode, a user just needs to specify how often he wants the load balancing to occur, using +LBPeriod runtime option to specify a time interval.
In sync mode, users can tell load balancer explicitly when is a good time to trigger load balancing by inserting a function call in the user code.
In the above two load balancing modes, users don't need to worry about how to start load balancing. However, in one scenario, the above automatic load balancers will fail to work - array elements are created by dynamic insertion. This is because the above two load balancing modes require an application to have fixed number of objects at the time of load balancing. The array manager needs to maintain a head count of local array elements for the local barrier. In this case, users have to use the manual mode to trigger load balancer themselves. The API is described below.
The detailed APIs of these three methods are described as follows:
Note that AtSync() is not a blocking call, it just gives a hint to load balancing that it is time for load balancing. During the time between AtSync and ResumeFromSync, the object may be migrated. One can choose to let objects continue working with incoming messages, however keep in mind the object may suddenly show up in another processor and make sure no operations that could possibly prevent migration be performed. The most commonly used approach is to force the object to be idle until load balancing finishes, the object can start working again when ResumeFromSync() is called.
The function call StartLB() starts load balancing immediately. This call should be made at only one place on only one processor. This function is also not blocking, the object will continue to process messages and the load balancing when triggered happens at the background.
TurnManualLBOff() turns off manual load balancing and switches back to the automatic Load balancing mode.
Load balancers migrate objects automatically. For an array element to migrate, user can refer to section 3.8.10 for how to write a ``pup'' for an array element.
In general one needs to pack the whole snapshot of the member data in an array element in the pup subroutine. This is because the migration of the object may happen at any time. In certain load balancing scheme where user explicitly control when the load balancing happens, user may choose to pack only a part of the data and may skip those temporary data.
There are several utility functions that can be called in applications to configure the load balancer, etc. These functions are:
Load balancing strategies are implemented as libraries in CHARM++. This allows programmers to easily experiment with different existing strategies by simply linking a pool of strategy modules and choosing one to use at run-time via a command line option.
Please note that linking a load balancing module is different from activating it:
Below are the descriptions about the compiler and run-time options:
The list of existing load balancers are in section 3.11.2. Note: you can have multiple -module *LB options. LB modules are linked into a program, but they are not activated automatically at runtime. Using -balancer at compile time in order to activate load balancers automatically at run time. Having -balancer A implies -module A, so you don't have to write -module A again, although it does not hurt. Using EveryLB is a convenient way to link against all existing load balancers. One of the load balancers called MetisLB requires the METIS library which is located at: charm/src/libs/ck-libs/parmetis/METISLib/. You need to compile METIS library by "make METIS" under charm/tmp after you compile Charm++.
Run-time options are doing the same thing as compile time options, but they can override the compile time options.
Note: +balancer option works only if you have already linked the load balancers module at compile time. Giving +balancer with a wrong LB name will result in a runtime error. When you have used -balancer A as compile time option, you don't need to use +balancer A again to activate it at runtime. However, you can use +balancer B to override the compile time option and choose to activate B instead of A.
When you don't activate any of the load balancers at compile time or run time, and your program counts on a load balancer because you use AtSync() and expect ResumeFromSync() to be called to continue, be assured that your program can still run. A special load balancer called NullLB is automatically created in this case which just calls ResumeFromSync() after AtSync(). This default load balancer keeps a program from hanging after calling AtSync(). The NullLB is smart enough to keep silent if another load balancer is created.
There are a few other run-time options for load balancing that may be useful:
The simulation feature of load balancing framework allows the users to collect information about the compute wall/cpu time and communication of the chares during a particular run of the program and use this information to later test different load balancing strategies to see which one is suitable for the programs behaviour. Currently, this feature is supported only for the centralized load balancing strategies. For this, the load balancing framework accepts the following command line options:
./charmrun pgm +balancer <Strategy to test> +LBSim 2 +LBSimSteps 4
+LBDumpFile dump.dat [+LBSimProcs 900]
When objects do not follow the assumption that the future workload will be the same as the past, the load balancer might not have the correct information to do a correct rebalancing job. To prevent this the user can provide a transition function to the load balancer to predict what will be the future workload, given the past, instrumented one. As said, the user might provide a specific class which inherits from LBPredictorFunction and implement the appropriate functions. Here is the abstract class:
class LBPredictorFunction {
public:
int num_params;
virtual void initialize_params(double *x);
virtual double predict(double x, double *params) =0;
virtual void print(double *params) {PredictorPrintf("LB: unknown model\n");};
virtual void function(double x, double *param, double &y, double *dyda) =0;
};
double predict(double x, double *param) {return (param[0]*x + param[1]);}
void function(double x, double *param, double &y, double *dyda) {
y = predict(x, param);
dyda[0] = x;
dyda[1] = 1;
}
If the model behind the computation is not known, the user can leave the system to use a predefined default function.
As seen, the function can have several parameters which will be learned during the execution of the program. For this, two parameters can be setup at command line to specify the learning behaviour:
Seed load balancing involves the movement of object creation messages, or "seeds", to create a balance of work across a set of processors. This load balancing scheme is used for load balancing chares only at creation time. When the chare is created on a processor, there is no movement of the chare due to the seed load balancer. The measurement based load balancer described in previous subsection perform the task of moving chares during work to achieve load balance.
Several variations of strategies have been designed and analyzed.
Other strategies can also be explored follow the simple API of the
seed load balancer.
Seed load balancers for Chares:
Seed load balancers can be directly used for load balancing Chares. The default seed load balancer which is always linked is the random seed load balancer. Users can choose another strategy listed above and link as a plugin module into binary as described below.
Seed load balancers for Array Elements:
Seed load balancers can also be used for array elements in the same way as they are used for individual chares. Chare array is a collection of individual Chares in Charm++. Since Chare Array has its internal strategy of static mapping of individual array elements to processors using CkArrayMap 3.8.4 19, a special CkArrayMap called CldMap must be created and passed into array creation calls to interface with seed load balancer.
For creating an empty array and then inserting chares into it, the API is as follows:
For initially populating the array with chares at time of creation the API is as follows:
The details about array creation are explained in section 3.8 of the manual.
Compile and run time options for seed load balancers
To choose a seed load balancer other than the default rand strategy, use link time command line option -balance foo.
When using neighbor seed load balancer, one can also specify the virtual topology at runtime. Use +LBTopo topo, where topo can be one of: (a) ring, (b) mesh2d, (c) mesh3d and (d) graph.
To write a seed load balancer, name your file as cldb.foo.c, where foo is the strategy name. Compile it in the form of library under charm/lib, named as libcldb-foo.a, where foo is the strategy name used above. Now one can use -balance foo as compile time option to charmc to link with the foo seed load balancer.
A simple example of how to use a load balancer in sync mode in one's application is presented below.
/*** lbexample.ci ***/
mainmodule lbexample {
readonly CProxy_Main mainProxy;
readonly int nElements;
mainchare Main {
entry Main(CkArgMsg *m);
entry void done(void);
};
array [1D] LBExample {
entry LBExample(void);
entry void doWork();
};
};
-----------------------------------------------------
/*** lbexample.C ***/
#include <stdio.h>
#include "lbexample.decl.h"
/*readonly*/ CProxy_Main mainProxy;
/*readonly*/ int nElements;
#define MAX_WORK_CNT 50
#define LB_INTERVAL 5
/*mainchare*/
class Main : public Chare
{
private:
int count;
public:
Main(CkArgMsg* m)
{
/*....Initialization....*/
mainProxy = thishandle;
CProxy_LBExample arr = CProxy_LBExample::ckNew(nElements);
arr.doWork();
};
void done(void)
{
count++;
if(count==nElements){
CkPrintf("All done\n");
CkExit();
}
};
};
/*array [1D]*/
class LBExample : public CBase_LBExample
{
private:
int workcnt;
public:
LBExample()
{
workcnt=0;
/* May initialize some variables to be used in doWork */
//Must be set to CmiTrue to make AtSync work
usesAtSync=CmiTrue;
}
LBExample(CkMigrateMessage *m) { /* Migration constructor -- invoked when chare migrates */ }
/* Must be written for migration to succeed */
void pup(PUP::er &p){
CBase_LBExample::pup(p);
p|workcnt;
/* There may be some more variables used in doWork */
}
void doWork()
{
/* Do work proportional to the chare index to see the effects of LB */
workcnt++;
if(workcnt==MAX_WORK_CNT)
mainProxy.done();
if(workcnt%LB_INTERVAL==0)
AtSync();
else
doWork();
}
void ResumeFromSync(){
doWork();
}
};
#include "lbexample.def.h"
Charm++ programmers can control CPU load data in the load balancing database before a load balancing phase is started (which is the time when load balancing database is collected and used by load balancing strategies).
In an array element, the following function can be invoked to overwrite the CPU load that is measured by load balancing framework.
setObjTime() is defined as a method of class CkMigratable, which is the superclass of all array elements.
The users can also retrieve the current timing that the load balancing runtime has measured for the current array element.
This is useful when the users want to derive a new CPU load based on the existing one.
Charm++ programmers can also choose to feed load balancer with their own CPU timing of each Chare based on certain computational model of the applications.
To do so, first turn off automatic CPU load measurement completely by setting:
in array element's constructor.
Then the users need to implement the following function to the chare array classes:
This function served as a callback that is called on each chare object when AtSync() is called and ready to do load balancing. The implementation of UserSetLBLoad() is simply to set the current chare object's CPU load to load balancer framework. setObjTime() described above can be used for this.
Charm++ programmers can pick load balancing strategy from Charm++'s built-in strategies(see 3.11.2) for the best performance based on the characteristics of their applications, they can also choose to write their own load balancing strategies.
Charm++ load balancing framework provides a simple scheme to incorporate new load balancing strategies. To write a new load balancing strategy involves the following steps (We use an example of writing a centralized load balancer fooLB to illustrate the steps).
To write a load balancing strategy, one may want to know what information is measured during the runtime and how it is represented in the load balancing database data structure?
There are mainly 3 categories of information: a) processor information including processor speed, background load; b) object information including per object cpu/wallclock compute time and c) communication information .
The database data structure named LDStats is defined in CentralLB.h:
struct ProcStats { // per processor
double total_walltime;
double total_cputime;
double idletime;
double bg_walltime;
double bg_cputime;
int pe_speed;
double utilization;
CmiBool available;
int n_objs;
}
struct LDStats { // load balancing database
ProcStats *procs;
int count;
int n_objs;
int n_migrateobjs;
LDObjData* objData;
int n_comm;
LDCommData* commData;
int *from_proc, *to_proc;
}
In CHARM++, quiescence is defined as the state in which no processor is executing an entry point, and no messages are awaiting processing.
CHARM++ provides two facilities for detecting quiescence: CkStartQD and CkWaitQD.
CkStartQD registers with the system a callback that should be made the next time quiescence is detected. CkStartQD has two variants which expect the following arguments:
To retrieve the corresponding index of a particular entry method, you must use a static method contained within the CkIndex object corresponding to the chare containing that entry method. The syntax of this call is as follows:
where ChareName is the name of the chare containing the desired entry method, EntryMethod is the name of that entry method, and parameters are the parameters taken by the method. These parameters are only used to resolve the proper EntryMethod; they are otherwise ignored.
Upon quiescence detection, specified callback is called with no parameters.
CkWaitQD, by contrast, does not register a callback. Rather, CkWaitQD blocks and does not return until quiescence is detected. It takes no parameters and returns no value. A call to CkWaitQD simply looks like this:
Keep in mind that CkWaitQD should only be called from threaded entry methods because a call to CkWaitQD suspends the current thread of execution, and if it were called outside of a threaded entry method it would suspend the main thread of execution of the processor from which CkWaitQD was called and the entire program would come to a grinding halt on that processor.
A reduction applies a single operation (e.g. add, max, min, ...) to data items scattered across many processors and collects the result in one place. CHARM++ supports reductions over the members of an array or group.
The data to be reduced comes from a call to the member contribute method:
This call contributes nBytes bytes starting at data to the reduction type (see reduction types, below). Unlike sending a message, you may use data after the call to contribute. All members must call contribute, and all must use the same reduction type.
When you create a new member, it is expected to contribute to the next reduction not already in progress on that processor. The reduction will complete properly even if members are migrated or deleted during the reduction.
For example, if we want to sum each member's single integer myInt, we would use:
The built-in reduction types (see below) can also handle arrays of numbers. For example, if each element of an array has a pair of doubles forces[2] which need to be summed up (separately) across every element, from each element call:
Note that since C++ arrays (like forces[2]) are already pointers, we don't use &forces.
Sometimes it is not important the data to be reduced, but only the fact that all elements have reached a synchronization point. In this case a simpler version of contribute can be used:
In all cases, the result of the reduction operation is passed to the reduction client. Many different kinds of reduction clients can be used, as explained below (Section 3.14.1).
After the data is reduced, it is passed to a you via a callback object, as described in section 3.15. The message passed to the callback is of type CkReductionMsg. The important members of CkReductionMsg are getSize(), which returns the number of bytes of reduction data; and getData(), which returns a ``void *'' to the actual reduced data.
You may pass the client callback as an additional parameter to contribute. If different contribute calls pass different callbacks, some (unspecified, unreliable) callback will be chosen for use.
In the case of the reduced version used for synchronization purposes, the callback parameter will be the only input parameter:
If no member passes a callback to contribute, the reduction will use the default callback. You set the default callback for an array or group using the ckSetReductionClient proxy call on processor zero. Again, a CkReductionMsg message will be passed to this callback, which must delete the message when done.
So, for the previous reduction on chare array arr:
and the actual entry point:
(See pgms/charm++/RedExample for a complete example).
For backward compatability, rather than a general callback you can specify a peculiar kind of C function using ckSetReductionClient or setReductionClient. This C function takes a user-defined parameter (passed to setReductionClient) and the actual reduction data, which it must not deallocate.
CHARM++ includes several built-in reduction types, used to combine the separate contributions. Any of them may be passed as an CkReduction::reducerType type to contribute.
The first four reductions (sum, product, max, and min) work on int, float, or double data as indicated by the suffix. The logical reductions (and, or) only work on integer data. All the built-in reductions work on either single numbers (pass a pointer) or arrays- just pass the correct number of bytes to contribute.
CkReduction::set returns a collection of CkReduction::setElement objects, one per contribution. This class has definition:
To extract the contribution of each array element from a reduction set, use the next routine repeatedly:
The reduction set order is undefined. Add a source field to your contribution if you need to know which array element gave a particular contribution. This will require you to do your own serialize/unserialize operation on your element structure if your reduction element data is complex. Consider using the PUP interface see 3.16 to simplify your object serialization needs.
If your data is order dependant, or if your data is just too heterogenous to be handled elegantly by the predefined types and you don't want to undertake multiple reductions, it may be best to define your own reduction type. See the next section (Section 3.14.3) for details.
It is possible to define a new type of reduction, performing a user-defined operation on user-defined data. A reduction function combines separate contributions (from this or other processors) into a single combined value.
The input to a reduction function is a list of CkReductionMsgs. A CkReductionMsg is a thin wrapper around a buffer of untyped data to be reduced. The output of a reduction function is a single CkReductionMsg containing the reduced data, which you should create using the CkReductionMsg::buildNew(int nBytes,const void *data) method.
Thus every reduction function has the prototype:
For example, a reduction function to add up contributions consisting of two machine short integers would be:
You must register your reduction function with CHARM++ using CkReduction::addReducer from an initcall routine (see section 3.18 for details on the initcall mechanism). CkReduction::addReducer returns a CkReduction::reducerType which you can later pass to contribute. Since initcall routines are executed once on every node, you can safely store the CkReduction::reducerType in a global or class-static variable. For the example above:
Note that you cannot call CkReduction::addReducer from anywhere but in an initcall routine.
A callback is a generic way to transfer control back to a client after a library has finished. For example, after finishing a reduction, you might want the results passed to some chare's entry method. To do this, you create an object of type CkCallback with the chare's CkChareID and entry method index, then pass the callback object to the reduction library.
You can create a CkCallback object in a number of ways, depending on what you want to have happen when the callback is finally invoked. The callback will be invoked with a CHARM++ message; but the message type will depend on the library that actually invokes the callback. Check the library documentation to see what kind of message the library will send to your callback. In any case, you are required to free the message passed to you via the callback.
The callbacks that go to chares require an ``entry method index'', an integer that identifies which entry method will be called. You can get an entry method index using the syntax:
Here, ChareName is the name of the chare (group, or array) containing the desired entry method, EntryMethod is the name of that entry method, and parameters are the parameters taken by the method. These parameters are only used to resolve the proper EntryMethod; they are otherwise ignored. An entry method index is the CHARM++ version of a function pointer.
There are a number of ways to build callbacks, depending on what you want to have happen when the callback is invoked:
This function will be called on the processor where the callback was created, so param is allowed to point to heap-allocated data. Of course, you are required to free any storage referenced by param.
One final type of callback, a CkCallback(CkCallback::resumeThread), can only be used from within threaded entry methods. This type of callback is typically hidden within a thread-capable library, so is discussed further in the library section.
Here, a ``library'' is simply any code which can be called from several different places. From the point of view of a library, a CkCallback is a destination for the library's result. CkCallback objects can be freely copied, marshalled, or even sent in messages.
Postponing threads for a moment, the only thing you can do with a CkCallback is to move it around or send a message to it:
A CkCallback will accept any message type, or even NULL. The message is immediately sent to the user's client function or entry point, so you do need to document the type of message you will send to the callback so the user knows what to expect.
In alternative to ``send'', the callback can be used in a contribute collective operation. This will internally invoke the ``send'' method on the callback when the contribute operation has finished.
Thread clients are a bit more complicated as they need to suspend while waiting for the operation invoked finishes. They will resume when the ``send'' method is invoked on the callback. In these situations, the class CkCallbackResumeThread is more useful. This class is a subclass of CkCallback with specific functionality for threads. This class automatically suspends the thread when its destructor is called. It can be used in situations when the return value is not needed, and only the synchronization is important. For example:
Alternatively, if doWork returns a value of interest, this can be retrieved by passing a pointer to CkCallbackResumeThread. This pointer will be modified by CkCallbackResume thread to point to the incoming message. Notice that the input pointer has to be cast to (void*&).
Notice that the instance of CkCallbackResumeThread is constructed on-the-fly as a parameter to the ``doWork'' call. This insures that the callback is destroyed as soon as the function returns, therefore suspending the thread.
It is also possible to allocate a CkCallbackResumeThread on the heap or on the stack. We suggest to avoid such usage, and favor the on-the-fly construction shown above. For completeness, we still report code for heap and stack allocation of CkCallbackResumeThread callbacks.
For heap allocation, the user will have to explicitely call ``delete'' to suspend the thread.
For callbacks allocated on the stack, its destructor will be called only at the end of the function, when the current stack goes out of scope. In this situation, the function ``thread_delay'' can be called on the callback to force the thread to suspend. This works also for heap allocated callbacks.
Notice: a CkCallbackResumeThread can be used to suspend a thread only once.
Deprecated usage: in the past, ``thread_delay'' was used to retrieve the incoming message from the callback. While that is still allowed for backward compatibility, its usage is deprecated. The old usage is subject to memory leaking and dangling pointers.
The PUP framework is a generic way to describe the data in an object and to use that description for any task requiring serialization. The CHARM++ system can use this description to pack the object into a message, and unpack the message into a new object on another processor. The name thus is a contraction of the words Pack and UnPack (PUP).
Like many C++ concepts, the PUP framework is easier to use than describe:
This class's pup routine describes the fields of a foo to CHARM++. This allows CHARM++ to: marshall parameters of type foo across processors, translate foos across processor architectures, read and write foos to disk files, inspect and modify foo objects in the debugger, and checkpoint and restart calculations involving foos.
Your object's pup routine must save and restore all your object's data. As shown, you save and restore a class's contents by writing a routine c alled ``pup'' which passes all the parts of the class to an object of type PUP::er, which does the saving or restoring. We often use ``pup'' as a verb, meaning ``to save/restore the value of'' or equivalently, ``to call the pup routine of''.
Pup routines for complicated objects normally call the pup routines for their simpler parts. Since all objects depend on their immediate superclass, the first line of every pup routine is a call to the superclass's pup routine--the only time you shouldn't call your superclass's pup routine is when you don't have a superclass. If your superclass has no pup routine, you must pup the values in the superclass yourself.
The recommended way to pup any object a is to use p|a;.
This syntax is an operator | applied to the PUP::er p
and the user variable a.
The p|a; syntax works wherever a is:
p|a; copies the data in-place.
This is equivalent to passing the type directly to the PUP::er
using p(a).
p|a; calls the object's pup routine.
This is equivalent to the statement a.pup(p);.
p|a; allocates and copies the appropriate subclass.
p|a; copies the object as plain bytes, like memcpy.
operator | defined.
In this case, p|a; calls the custom operator |.
For container types, you must simply pup each element of the container. For arrays, you can use the utility routine PUParray, which takes the PUP::er, the array base pointer, and the array length. This utility routine is defined for user-defined types T as:
If the variable is from the C++ Standard Template Library, you can include
operator|'s for STL vector, map, list, pair, and string, templated
on anything, by including the header ``pup_stl.h''.
As usual in C++, pointers and allocatable objects usually require special handling. Typically this only requires a p.isUnpacking() conditional block, where you perform the appropriate allocation. See Section 3.16.3 for more information and examples.
If the object does not have a pup routine, and you cannot add one or use
PUPbytes, you can define an operator| to pup the object.
For example, if myClass contains two fields a and b, the
operator| might look like:
For classes and structs with many fields, it can be tedious and error-prone to list all the fields in the pup routine. You can avoid this listing in two ways, as long as the object can be safely copied as raw bytes--this is normally the case for simple structs and classes without pointers.
PUPbytes(myClass) macro in your header file.
This lets you use the p|*myPtr; syntax
to pup the entire class as sizeof(myClass) raw bytes.
p((void *)myPtr,sizeof(myClass)); in the pup
routine. This is a direct call to pup a set of bytes.
p((char *)myCharArray,arraySize); in the pup
routine. This is a direct call to pup a set of bytes.
Other primitive types may also be used.
Note that pupping as bytes is just like using `memcpy': it does nothing to the data but copy it whole. For example, if the class contains any pointers, you must make sure to do any allocation needed, and pup the referenced data yourself.
Pupping as bytes will prevent your pup routine from ever being able to work across different machine architectures. We don't do this very often yet, but eventually may, so pupping as bytes is currently discouraged.
The PUP::er overhead is very small--one virtual function call for each item or array to be packed/unpacked. The actual packing/unpacking is normally a simple memory-to-memory binary copy.
For arrays of builtin types like ``int" and ``double", or arrays of a type with the ``PUPbytes'' declaration, PUParray uses an even faster block transfer, with one virtual function call per array.
Please note that if your object contains Structured Dagger code (see section ``Structured Dagger'') you must call the generated routine __sdag_pup to correctly pup the Structured Dagger state:
CHARM++ uses your pup routine to both pack and unpack, by passing different types of PUP::ers to it. The routine p.isUnpacking() returns true if your object is being unpacked--that is, your object's values are being restored. Your pup routine must work properly in sizing, packing, and unpacking modes; and to save and restore properly, the same fields must be passed to the PUP::er, in the exact same order, in all modes. This means most pup routines can ignore the pup mode.
Three modes are used, with three separate types of PUP::er: sizing, which only computes the size of your data without modifying it; packing, which reads/saves values out of your data; and unpacking, which writes/restores values into your data. You can determine exactly which type of PUP::er was passed to you using the p.isSizing(), p.isPacking(), and p.isUnpacking() routines. However, sizing and packing should almost always be handled identically, so most programs should use p.isUnpacking() and !p.isUnpacking(). Any program that calls p.isPacking() and does not also call p.isSizing() is probably buggy, because sizing and packing must see exactly the same data.
The p.isDeleting() flag indicates the object will be deleted after calling the pup routine. This is normally only needed for pup routines called via the C or f90 interface, as provided by AMPI or the FEM framework. Other CHARM++ array elements, marshalled parameters, and other C++ interface objects have their destructor called when they are deleted, so the p.isDeleting() call is not normally required--instead, memory should be deallocated in the destructor as usual.
The life cycle of an object with a pup routine is shown in Figure 1. As usual in C++, objects are constructed, do some processing, and are then destroyed.
Objects can be created in one of two ways: they can
be created using a normal constructor as usual; or they
can be created using their pup constructor. The pup constructor
for CHARM++ array elements and PUP::able objects
is a ``migration constructor'' that takes a single ``CkMigrateMessage *";
for other objects, such as parameter marshalled objects,
the pup constructor has no parameters. The pup constructor
is always followed by a call to the object's pup routine in
isUnpacking mode.
Once objects are created, they respond to regular user methods
and remote entry methods as usual. At any time, the object
pup routine can be called in isSizing or isPacking
mode. User methods and sizing or packing pup routines can be called
repeatedly over the object lifetime.
Finally, objects are destroyed by calling their destructor as usual.
If your class has fields that are dynamically allocated, when unpacking these need to be allocated (in the usual way) before you pup them. Deallocation should be left to the class destructor as usual.
The simplest case is when there is no dynamic allocation.
The next simplest case is when we contain a class that is always allocated during our constructor, and deallocated during our destructor. Then no allocation is needed within the pup routine.
If we need values obtained during the pup routine before we can allocate the class, we must allocate the class inside the pup routine. Be sure to protect the allocation with ``if (p.isUnpacking())''.
For example, if we keep an array of doubles, we need to know how many doubles there are before we can allocate the array. Hence we must first pup the array length, do our allocation, and then pup the array data. We could allocate memory using malloc/free or other allocators in exactly the same way.
If our allocated object may be NULL, our allocation becomes much more complicated. We must first check and pup a flag to indicate whether the object exists, then depending on the flag, pup the object.
This sort of code is normally much longer and more error-prone if split into the various packing/unpacking cases.
An array of actual classes can be treated exactly the same way
as an array of basic types. PUParray will pup each
element of the array properly, calling the appropriate operator|.
An array of pointers to classes must handle each element separately, since the PUParray routine does not work with pointers. An ``allocate'' routine to set up the array could simplify this code. More ambitious is to construct a ``smart pointer'' class that includes a pup routine.
Note that this will not properly handle the case where some elements of the array are actually subclasses of foo, with virtual methods. The PUP::able framework described in the next section can be helpful in this case.
If the class foo above might have been a subclass, instead of simply using new foo above we would have had to allocate an object of the appropriate subclass. Since determining the proper subclass and calling the appropriate constructor yourself can be difficult, the PUP framework provides a scheme for automatically determining and dynamically allocating subobjects of the appropriate type.
Your superclass must inherit from PUP::able, which provides the basic machinery used to move the class. A concrete superclass and all its concrete subclasses require these four features:
An abstract superclass--a superclass that will never actually be packed--only needs to inherit from PUP::able and include a PUPable_abstract(className) macro in their body. For these abstract classes, the .ci file, PUPable_decl macro, and constructor are not needed.
For example, if parent is a concrete superclass and child its subclass,
With these declarations, then, we can automatically allocate and pup a pointer to a parent or child using the vertical bar PUP::er syntax, which on the receive side will create a new object of the appropriate type:
This will properly pack, allocate, and unpack obj whether it is actually a parent or child object. The child class can use all the usual C++ features, such as virtual functions and extra private data.
If obj is NULL when packed, it will be restored to NULL when unpacked. For example, if the nodes of a binary tree are PUP::able, one may write a recursive pup routine for the tree quite easily:
This same implementation will also work properly even if the tree's internal nodes are actually subclasses of treeNode.
You may prefer to use the macros PUPable_def(className) and PUPable_reg(className) rather than using PUPable in the .ci file. PUPable_def provides routine definitions used by the PUP::able machinery, and should be included in exactly one source file at file scope. PUPable_reg registers this class with the runtime system, and should be executed exactly once per node during program startup.
Finally, a PUP::able superclass like parent above must normally be passed around via a pointer or reference, because the object might actually be some subclass like child. Because pointers and references cannot be passed across processors, for parameter marshalling you must use the special templated smart pointer classes CkPointer and CkReference, which only need to be listed in the .ci file.
A CkReference is a read-only reference to a PUP::able object--it is only valid for the duration of the method call. A CkPointer transfers ownership of the unmarshalled PUP::able to the method, so the pointer can be kept and the object used indefinitely.
For example, if the entry method bar needs a PUP::able parent object for in-call processing, you would use a CkReference like this:
If the entry method needs to keep its parameter, use a CkPointer like this:
Both CkReference and CkPointer are read-only from the send side--unlike messages, which are consumed when sent, the same object can be passed to several parameter marshalled entry methods. In the example above, we could do:
C and Fortran programmers can use a limited subset of the PUP::er capability. The routines all take a handle named pup_er. The routines have the prototype:
The first call is for use with a single element; the second call is for use with an array. The supported types are char, short, int, long, uchar, ushort, uint, ulong, float, and double, which all have the usual C meanings.A byte-packing routine
pup_isSizing, pup_isPacking, pup_isUnpacking, and pup_isDeleting calls are also available. Since C and Fortran have no destructors, you should actually deallocate all data when passed a deleting pup_er.
C and Fortran users cannot use PUP::able objects, seeking, or write custom PUP::ers. Using the C++ interface is recommended.
The most common PUP::ers used are PUP::sizer, PUP::toMem, and PUP::fromMem. These are sizing, packing, and unpacking PUP::ers, respectively.
PUP::sizer simply sums up the sizes of the native binary representation of the objects it is passed. PUP::toMem copies the binary representation of the objects passed into a preallocated contiguous memory buffer. PUP::fromMem copies binary data from a contiguous memory buffer into the objects passed. All three support the size method, which returns the number of bytes used by the objects seen so far.
Other common PUP::ers are PUP::toDisk, PUP::fromDisk, and PUP::xlater. The first two are simple filesystem variants of the PUP::toMem and PUP::fromMem classes; PUP::xlater translates binary data from an unpacking PUP::er into the machine's native binary format, based on a machineInfo structure that describes the format used by the source machine.
It may rarely occur that you require items to be unpacked in a different order than they are packed. That is, you want a seek capability. PUP::ers support a limited form of seeking.
To begin a seek block, create a PUP::seekBlock object with your current PUP::er and the number of ``sections'' to create. Seek to a (0-based) section number with the seek method, and end the seeking with the endBlock method. For example, if we have two objects A and B, where A's pup depends on and affects some object B, we can pup the two with:
Note that without the seek block, A's fields would be unpacked over B's memory, with disasterous consequences. The packing or sizing path must traverse the seek sections in numerical order; the unpack path may traverse them in any order. There is currently a small fixed limit of 3 on the maximum number of seek sections.
System-level programmers may occasionally find it useful to define their own PUP::er objects. The system PUP::er class is an abstract base class that funnels all incoming pup requests to a single subroutine:
The parameters are, in order, the field address, the number of items, the size of each item, and the type of the items. The PUP::er is allowed to use these fields in any way. However, an isSizing or isPacking PUP::er may not modify the referenced user data; while an isUnpacking PUP::er may not read the original values of the user data. If your PUP::er is not clearly packing (saving values to some format) or unpacking (restoring values), declare it as sizing PUP::er.
CHARM++ provides both C and C++ style methods of doing terminal I/O.
In place of C-style printf and scanf, CHARM++ provides CkPrintf and CkScanf. These functions have interfaces that are identical to their C counterparts, but there are some differences in their behavior that should be mentioned.
A recent change to CHARM++ is to also support all forms of printf, cout, etc. in addition to the special forms shown below. The special forms below are still useful, however, since they obey well-defined (but still lax) ordering requirements.
int CkPrintf(format [, arg]*)
This call is used for atomic terminal output. Its usage is similar to
printf in C. However, CkPrintf has some special properties
that make it more suited for parallel programming on networks of
workstations. CkPrintf routes all terminal output to the charmrun,
which is running on the host computer. So, if a
chare on processor 3 makes a call to CkPrintf, that call
puts the output in a TCP message and sends it to host
computer where it will be displayed. This message passing is an asynchronous
send, meaning that the call to CkPrintf returns immediately after the
message has been sent, and most likely before the message has actually
been received, processed, and displayed. 20
void CkError(format [, arg]*))
Like CkPrintf, but used to print error messages on stderr.
int CkScanf(format [, arg]*)
This call is used for atomic terminal input. Its usage is similar to
scanf in C. A call to CkScanf, unlike CkPrintf,
blocks all execution on the processor it is called from, and returns
only after all input has been retrieved.
For C++ style stream-based I/O, CHARM++ offers ckout and ckerr in the place of cout, and cerr. The C++ streams and their CHARM++ equivalents are related in the same manner as printf and scanf are to CkPrintf and CkScanf. The CHARM++ streams are all used through the same interface as the C++ streams, and all behave in a slightly different way, just like C-style I/O.
Some registration routines need be executed exactly once before the computation begins. You may choose to declare a regular C++ subroutine initnode in the .ci file to ask CHARM++to execute the routine exactly once on every node before the computation begins, or to declare a regular C++ subroutine initproc to be executed exactly once on every processor.
This code will execute the routines fooNodeInit and static bar::barNodeInit once on every node and fooProcInit and bar::barProcInit on every processor before the main computation starts. Initnode calls are always executed before initproc calls. Both init calls (declared as static member function) can be used in chare, group or chare arrays.
Note that these routines should only do registration, not computation since Charm run-time initialization does not start yet -- use a mainchare instead, which gets executed on only processor 0, to begin the computation. Initcall routines are typically used to do special registrations and global variable setup before the computation actually begins.
The following calls provide information about the machines upon which the parallel program is executing. Processing Element refers to a single CPU. Node refers to a single machine- a set of processing elements which share memory (i.e. an address space). Processing Elements and Nodes are numbered, starting from zero.
Thus if a parallel program is executing on one 4-processor workstation and one 2-processor workstation, there would be 6 processing elements (0, 1 ,2, 3, 4, and 5) but only 2 nodes (0 and 1). A given node's processing elements are numbered sequentially.
int CkNumPes()
returns the total number of processors, across all nodes.
int CkMyPe()
returns the processor number on which the call was made.
int CkMyRank()
returns the rank number of the processor on which the call was made.
Processing elements within a node are ranked starting from zero.
int CkMyNode()
returns the address space number (node number) on which the call was made.
int CkNumNodes()
returns the total number of address spaces.
int CkNodeFirst(int node)
returns the processor number of the first processor in this address space.
int CkNodeSize(int node)
returns the number of processors in the address space on which the call was made.
int CkNodeOf(int pe)
returns the node number on which the call was made.
int CkRankOf(int pe)
returns the rank of the given processor within its node.
The following calls provide commonly needed functions.
void CkAbort(const char *message)
Cause the program to abort, printing the given error message.
This routine never returns.
void CkExit()
This call informs the Charm kernel that computation on all processors
should terminate. After the currently executing entry method completes, no
more messages or entry methods will be called on any other processor.
This routine never returns.
void CkExitAfterQuiescence()
This call informs the Charm kernel that computation on all processors
should terminate as soon as the machine becomes completely idle-that is,
after all messages and entry methods are finished. This is the state of
quiescence, as described further in Section 3.13.
This routine returns immediately.
double CkCpuTimer()
Returns the current value of the system timer in seconds. The system
timer is started when the program begins execution. This timer measures process
time (user and system).
double CkWallTimer()
Returns the elapsed time in seconds since the program has started from the wall
clock timer.
double CkTimer()
This is an alias for either CkWallTimer on dedicated machines (such as ASCI Red) or
CkCpuTimer for machines with multiple user processes per CPU (such as workstation cluster.)
Delegation is a means by which a library writer can intercept messages sent via a proxy. This is typically used to construct communication libraries. A library creates a special kind of Group called a DelegationManager, which receives the messages sent via a delegated proxy.
There are two parts to the delegation interface- a very small client-side interface to enable delegation, and a more complex manager-side interface to handle the resulting redirected messages.
All proxies (Chare, Group, Array, ...) in CHARM++ support the following delegation routines.
void CProxy::ckDelegate(CkGroupID delMgr);
Begin delegating messages sent via this proxy to the
given delegation manager. This only affects
the proxy it is called on- other proxies for the
same object are not changed. If the proxy is
already delegated, this call changes the delegation manager.
CkGroupID CProxy::ckDelegatedIdx(void) const;
Get this proxy's current delegation manager.
void CProxy::ckUndelegate(void);
Stop delegating messages sent via this proxy.
This restores the proxy to normal operation.
One use of these routines might be:
The client interface is very simple; but it is often not called by users directly. Often the delegate manager library needs some other initialization, so a more typical use would be:
Sync entry methods, group and nodegroup multicast messages, and messages for virtual chares that have not yet been created are never delegated. Instead, these kinds of entry methods execute as usual, even if the proxy is delegated.
A delegation manager is a group which inherits from CkDelegateMgr and overrides certain virtual methods. Since CkDelegateMgr does not do any communication itself, it need not be mentioned in the .ci file; you can simply declare a group as usual and inherit the C++ implementation from CkDelegateMgr.
Your delegation manager will be called by CHARM++ any time a proxy delegated to it is used. Since any kind of proxy can be delegated, there are separate virtual methods for delegated Chares, Groups, NodeGroups, and Arrays.
These routines are called on the send side only. They are called after parameter marshalling; but before the messages are packed. The parameters passed in have the following descriptions.
The CkDelegateMgr superclass implements all these methods; so you only need to implement those you wish to optimize. You can also call the superclass to do the final delivery after you've sent your messages.
The communication framework in Charm++/Converse is aimed at optimizing certain communication patterns. Currently the programmer has to specify the communication pattern it intends to optimize, together with the strategy to be used. The communications library uses the delegation framework (3.20) in order to enable easy and transparent access to the framework by the programmer.
For AMPI programs, the communication optimization is done by the AMPI layer, so that the user does not need to worry about that. In Charm++, however, the user must create the strategies in the program explicitly. Charm++ programs are normally based on communicating arrays of chares, that compute and then invoke entry methods on local or remote chares by sending them messages. These array elements send messages to each other through proxies. The messages are passed to the Charm++ runtime which calls lower level network APIs to communicate. To optimize communication in Charm++, the user can redirect a communication call to go through an instance of a strategy.
To access the communication framework, the user first creates and initializes a communication library strategy. He then needs to make a copy of the array proxy and associate it with that strategy. In order to use the framework, the receiving entry methods need to receive messages (see 3.4), and not marshalled parameters. The user can create several instances of the same or different strategies, to optimize different communication calls in the application. In order to access the class signatures, the file ``comlib.h'' should be included.
Each communication operation is associated with a proxy, through which the message is sent. These proxies can be associated in the mainchare constructor (useful for all-to-all strategies), or later in the single chare array elements (useful for section multicasts). In both cases, some information has to be kept, either the CProxy or the ComlibInstanceHandle, and this can be done in readonly variables, or as internal variables of the objects.
An example on how to use commlib can be found in the charm distribution, under ``examples/charm++/commlib/multicast/ '', where the proxies are associated in the chare arrays.
One thing typically useful is having the the proxy associated with the strategy, or an instance of the strategy (to be used for future associations) to be declared as readonly variable, although this in not necessary. This is done by declaring them readonly (see 3.6 for more information).
The creation of all the strategies needed, and their registration must be done in the constructor of the mainchare (for more on array creation see 3.8.2):
In this example, after aproxy has been associated with comlibproxy it can only be used with commlib, and cannot send anymore regular messages. For this, if regular messages without commlib are desired, a copy of the original proxy should be made (like here).
In the chare array element, if cinst has been defined, other proxies can be created and associated, like here a CProxySection_MyArray, which allows to send multicasts (see 3.8.13 for more on section proxies).
After a proxy has been associated in some way to commlib, it can be used to send messages with commlib:
In case a bracketed strategy is used, two additional function calls have to be added before starting to send the messages and after finishing. These are discussed later in 3.21.4.
The signatures of the functions used here are the following:
The Communication optimization framework supports both loadbalancing and array migration. It enables migration through message forwarding. Messages sent by a migrated array are forwarded to the processor where it is mapped to, and from here they get accounted. Messages sent to migrated arrays are forwarded from the processor where they are mapped to their current destination.
This mapping of array elements to processors can be updated by the user by calling ComlibResetProxy for array proxies, and ComlibResetSectionProxy for section proxies. This should be done especially during load-balancing, where most of the migrations happen. As shown in the following example, these calls should be made inside the resumeFromSync method.
A migrating array element containing associated proxies or instances should pup them all at the source and destination.
All user programs that use the communication library should use the linker option -module comlib. For example,
The communication framework now supports four different communication operations:
There are two types of strategies in the communication framework:
The usage of the strategy becomes:
The class EachToManyMulticastStrategy optimizes both all-to-all personalized and all-to-all multicast communication using several virtual topologies like 2-D Mesh, 3-D Mesh and Hypercube. Personalized communication happens when a chare sends different messages to the other chares, multicast communication happens when a chare sends the same message to all other chares. EachToManyMulticastStrategy also optimizes the special cases of many-to-many multicast where not all the chares in an array are involved in the collective operation.
The charm level strategy collects all the messages from the chares and delivers them to the destination, while the low level (processor-to-processor) communication is performed through converse level routers and implements the various virtual topologies.
EachToManyMulticastStrategy requires that all local messages be deposited before they can be packed into single messages. Hence, it needs to be a bracketed strategy. This strategy can also be used to optimize all-to-all collectives between charm groups.
As for the constructors to be used in the main chare, the two prototypes follow. The first one is for groups, the second for arrays. The optional parameters allow to specify the many-to-many behaviour, passing the lists of source and destination elements participating in the operation. If they are left to the default value, the collective is an all-to-all.
Both have as first parameter the virtual topology that the strategy will use for the low level optimization. The possible values are:
USE_HYPERCUBE will do best for very small messages and small number of processors, 3d has better performance for slightly higher message sizes and then Mesh starts performing best. The programmer is encouraged to try out all the topologies.
There are two strategies of this type: BroadcastStrategy and PipeBroadcastStrategy. The first works only for group broadcast, while the second works for both groups and arrays.
BroadcastStrategy performs a broadcast through a hypercube (default) or a tree, and the constructor is:
PipeBroadcastStrategy performs a broadcast through a ring or a hypercube (default). The characteristic of this strategy is that it fragments the message into small chunks that fit a predetermined size (passed as argument to the constructor), and it reassembles them before delivery. The constructor prototypes for groups and arrays respectively are:
The subclasses of MulticastStrategy can multicast a message to the entire array or a section of array elements (MulticastStrategy itself is abstract). The multicast strategies are non-bracketed, and the message is processed when the application deposits it. These strategies do not combine messages, but they may sequence the destinations of the multicast to minimize contention on a network.
In order to use these strategies, the message sent must inherit from class CkMcastBaseMsg. (For an example see ``examples/charm++/commlib/multicast/'').
These are the subclass strategies that are available:
For these, the constructors are of the form:
For section multicast, the user must create a section proxy and delegate it to the communication library. Invocations on section proxies are passed on to the section multicast strategy.
This strategy optimizes the scenario where chares send several small messages to other chares. The StreamingStrategy collects messages destined to the same physical processor and, after a timeout or when a certain number of messages have been collected, it sends them as a single message. This results in sending fewer messages of larger size. The timeout is a floating-point parameter to the StreamingStrategy. It needs to be specified in milliseconds, with a default value of 1ms. Micro-second timeouts can also be specified by passing values less than 1. For example, represents .
The Streaming Strategy is a non-bracketed strategy. Since messages can be delayed due to the timeout present, it is possible to call ComlibEnd() to flush all the messages to be sent immediately.
The prototype of the constructor is:
There are two variants of this strategy:
Optimization algorithms are implemented as Strategies in the communication library. Strategies can be implemented at the Object (CHARM++) level or the processor (CONVERSE) level. Code reuse is possible by having a few object managers perform object level optimizations and then call several other processor level optimization schemes. For example, to optimize all-to-all communication the processor level strategies could use the different virtual topologies.
All processor (CONVERSE) level strategies inherit from the class Strategy defined below and override its virtual methods.
The class method insertMessage is called to deposit messages with the strategy. MessageHolder is a wrapper for converse messages. When a processor has sent all its messages doneInserting is invoked on the strategy.
At the CHARM++ level, all strategies inherit from the class CharmStrategy reported here.
CHARM++ level strategies also have to implement the insertMessage and doneInserting methods. Here insertMessage takes a CharmMessageHolder which is a CHARM++ message wrapper. The call to beginProcessing initializes the strategies on each processor. This additional call is needed because the constructor of the strategy is called by user code in main::main on processor 0, while the strategy needs to be constructed everywhere. Along with initializing its data, beginProcessing can also register message handlers, as the communication library strategies use Converse handlers to communicate between processors. The flags isArray and isGroup store the type of objects that call the strategy and the flag isStrategyBracketed specifies if the CharmStrategy is bracketed or not. Bracketed strategies require that the application deposits messages in brackets demarcated by the calls ComlibBegin and ComlibEnd.
The Python scripting language in CHARM++ allows the user to dynamically execute pieces of code inside a running application, without the need to recompile. This is performed through the CCS (Converse Client Server) framework (see ``Converse Manual'' for more information about this). The user specifies which elements of the system will be accessible through the interface, as we will see later, and then run a client which connects to the server.
In order to exploit this functionality, Python interpreter needs to be installed
into the system, and CHARM++ LIBS need to be built with:
./build LIBS arch options
The interface provides three different types of requests:
There are three modes to run code on the server, ordered here by increase of functionality, and decrease of dynamic flexibility:
The description will follow the client implementation first, and continuing then on the server implementation.
In order to facilitate the interface between the client and the server, some classes are available to the user to include into the client. Currently C++ and java interfaces are provided.
C++ programs need to include PythonCCS-client.h into their code. This file is among the CHARM++ include files. For java, the package charm.ccs needs to be imported. This is located under the java directory on the CHARM++ distribution, and it provides both the Python and CCS interface classes.
There are three main classes provided: PythonExecute, PythonPrint, and PythonFinished which are used for the three different types of request.
All of them have two common methods to enable communication across different platforms:
A tipical invocation to send a request from the client to the server has the following format:
To execute a Python script on a running server, the client has to create an instance of PythonExecute, the two constructors have the following signature (java has a correspondent functionality):
The second one is used for iterative requests (see 3.22.4). The only required argument is the code, a null terminated string, which will not be modified by the system. All the other parameters are optional. They refer to the possible variants that an execution request can be. In particular, this is a list of all the options present:
These flags can be set and checked with the following routines (CmiUInt4 represent a 4 byte unsigned integer):
From a PythonExecute request, the server will answer with a 4 byte integer value, which is a handle for the interpreter that is running. It can be used to request for prints, check if the script has finished, and for reusing the same interpreter (if it was persistent).
A value of 0 means that there was an error and the script didn't run. This is typically due to a request to reuse of an existing interpreter which is not available, either because it was not persistent or because another script is still running on that interpreter.
When a Python script is run inside a CHARM++ application, two Python modules are made available by the system. One is ck, the other is charm. The first one is always present and it represent basic functions, the second is related to high level scripting and it is present only when this is enabled (see 3.22.2 for how to enable it, and 3.22.11 for a description on how to implement charm functions).
The methods present in the ck module are the following:
Sometimes some operations need to be iterated over all the elements in the system. This ``iterative'' functionality provides a shortcut for the client user to do this. As an example, suppose we have a system which contains particles, with their position, velocity and mass. If we implement read and write routines which allow us to access single particle attributes, we may upload a script which doubles the mass of the particles with velocity greater than 1:
Instead of all these read and writes, it will be better to be able to write:
This is what the ``iterative'' functionality provides. In order for this to work, the server has to implement two additional functions (see 3.22.9), and the client has to pass some more information together with the code. This information is the name of the function that has to be called (which can be defined in the ``code'' or have already been uploaded to a persistent interpreter), and a user defined structure which specifies over what data the function should be invoked. These values can be specified either while constructing the PythonExecute variable (see the second constructor in section 3.22.2), or with the following methods:
As for the PythonIterator object, it has to be a class defined by the user, and the user has to insure that the same definition is present inside both the client and the server. The CHARM++ system will simply pass this structure as a void pointer. This structure needs to inherit from PythonIterator. It is recommended that no pointers are used inside this class, and no dynamic memory allocation. If this is the case, nothing else needs to be done.
If instead pointers and dynamic memory allocation is used, the following methods have to be reimplemented:
The first returns the size of the class/structure after being packed. The second returns a pointer to a newly allocated memory containing all the packed data, the returned memory must be compatible with the class itself, since later on this same memory a call to unpack will be performed. Finally, the third will do the work opposite to pack and fix all the pointers. This method will not return anything and is supposed to fix the pointers ``inline''.
In order to receive the output printed by the Python script, the client needs to send a PythonPrint request to the server. The constructor is:
PythonPrint(CmiUInt4 interpreter, bool Wait=true, bool Kill=false);
The interpreter for which the request is made is mandatory. The other parameters are optional. The wait parameter represents whether a reply will be sent back immediately to the client even if there is no output (false), or if the answer will be delayed until there is an output (true). The kill option set to true means that this is not a normal request, but a signal to unblock the latest print request which was blocking.
The returned data will be a non null-terminated string if some data is present (or if the request is blocking), or a 4 byte zero data if nothing is present. This zero reply can happen in different situations:
As for a print kill request, no data is expected to come back, so it is safe to call CcsNoResponse(server).
The two options can also be dynamically set with the following methods:
In order to know when a Python code has finished executing, especially when using persistent interpreters, and a serialization of the scripts is needed, a PythonFinished request is available. The constructor is the following:
PythonFinished(CmiUInt4 interpreter, bool Wait=true);
The interpreter corresponds to the handle for which the request was sent, while the wait option refers to a blocking call (true), or immediate return (false).
The wait option can be dynamically modified with the two methods:
This request will return a 4 byte integer containing the same interpreter value if the Python script has already finished, or zero if the script is still running.
In order for a CHARM++ object (chare, array, node, or nodegroup) to receive python requests, it is necessary to define it as python-compliant. This is done through the keyword python placed in square brackets before the object name in the .ci file. Some examples follow:
In order to register a newly created object to receive Python scripts, the method registerPython of the proxy should be called. As an example, the following code creates a 10 element array myArray, and then registers it to receive scripts directed to ``pycode''. The argument of registerPython is the string that CCS will use to address the Python scripting capability of the object.
As explained previously in subsection 3.22.3, some functions are automatically made available to the scripting code through the ck module. Two of these, read and write are only available if redefined by the object. The signatures of the two methods to redefine are:
The read function receives as a parameter an object specifying from where the data will be read, and returns an object with the information required. The write function will receive two parameters: where the data will be written and what data, and will perform the update. All these PyObjects are generic, and need to be coherent with the protocol specified by the application. In order to parse the parameters, and create the value of the read, please refer to the manual ``Extending and Embedding the Python Interpreter'', and in particular to the functions PyArg_ParseTuple and Py_BuildValue.
In order to use the iterative mode as explained in subsection 3.22.4, it is necessary to implement two functions which will be called by the system. These two functions have the following signatures:
The first one is called once before the first execution of the Python code, and receives two parameters. The first is a pointer to an empty PyObject to be filled with the data needed by the Python code. In order to manage this object, some utility functions are provided. They are explained in subsection 3.22.10.
The second is a void pointer containing information of what the iteration should run over. This parameter may contain any data structure, and an agreement between the client and the user object is necessary. The system treats it as a void pointer since it has no information of what user defined data it contains.
The second function (nextIteratorUpdate) has three parameters. The first contains the object to be filled like in buildIterator, but this time the object contains the PyObject which was provided for the last iteration, potentially modified by the Python function. Its content can be read with the provided routines, used to retrieve the next logical element in the iterator (with which to update the parameter itself), and possibly update the content of the data inside the CHARM++ object. The second parameter is the object returned by the last call to the Python function, and the third parameter is the same data structure passed to buildIterator.
Both functions return an integer which will be interpreted by the system as follows:
They are inherited when declaring an object as Python-compliant, and therefore they are available inside the object code. All of them accept a PyObject pointer where to read/write the data, a string with the name of a field, and one or two values containing the data to be read/written (note that to read the data from the PyObject, a pointer needs to be passed). The strings used to identify the fields will be the same strings that the Python script will use to access the data inside the object.
The name of the function identifies the type of Python object stored inside the PyObject container (i.e String, Int, Long, Float, Complex), while the parameter of the functions identifies the C++object type.
To handle more complicated structures like Dictionaries, Lists or Tuples, please refer to ``Python/C API Reference Manual''.
When in addition to the definition of the CHARM++ object as python, an
entry method is also defined as python, this entry method can be accessed
directly by a Python script through the charm module. For example, the
following definition will be accessible with the python call:
result = charm.highMethod(var1, var2, var3)
It can accept any number of parameters (even complex like tuples or
dictionaries), and it can return an object as complex as needed.
The method must have the following signature:
The parameter is a handle that is passed by the system, and can be used in subsequent calls to return values to the Python code.
The arguments passed by the Python caller can be retrieved using the function:
PyObject *pythonGetArg(int handle);
which returns a PyObject. This object is a Tuple containing a vector of all parameters. It can be parsed using PyArg_ParseTuple to extract the single parameters.
When the CHARM++'s entry method terminates (by means of return or termination of the function), control is returned to the waiting Python script. Since the python entry methods execute within an user-level thread, it is possible to suspend the entry method while some computation is carried on in CHARM++. To start parallel computation, the entry method can send regular messages, as every other threaded entry method (see 3.15.2 for more information on how this can be done using CkCallbackResumeThread callbacks). The only difference with other threaded entry methods is that here the callback CkCallbackPython must be used instead of CkCallbackResumeThread. The more specialized CkCallbackPython callback works exactly like the other one, except that it handles correctly Python internal locks.
At the end of the computation, if a value needs to be returned to the Python script, the following special returning function has to be used:
void pythonReturn(int handle, PyObject* result);
where the second parameter is the Python object representing the returned value. The function Py_BuildValue can be used to create this value. This function in itself does not terminate the entry method, but only sets the returning value for Python to read when the entry method terminates.
A characteristic of Python is that in a multithreaded environment (like the one provided in CHARM++), the running thread needs to keep a lock to prevent other threads to access any variable. When using high level scripting, and the Python script is suspended for long periods of time while waiting for the CHARM++ application to perform the required task, the Python internal locks are automatically released and re-acquired by the CkCallbackPython class when it suspends.
CHARM++ supports inheritance among CHARM++ objects such as chares, groups, and messages. This, along with facilities for generic programming using C++ style templates for CHARM++ objects, is a major enhancement over the previous versions of CHARM++.
Chare inheritance makes it possible to remotely invoke methods of a base chare from a proxy of a derived chare. Suppose a base chare is of type BaseChare, then the derived chare of type DerivedChare needs to be declared in the CHARM++ interface file to be explicitly derived from BaseChare. Thus, the constructs in the .ci file should look like:
Note that the access specifier public is omitted, because CHARM++ interface translator only needs to know about the public inheritance, and thus public is implicit. A Chare can inherit privately from other classes too, but the CHARM++ interface translator does not need to know about it, because it generates support classes (proxies) to remotely invoke only public methods.
The class definitions of both these chares should look like:
Now, it is possible to create a derived chare, and invoke methods of base chare from it, or to assign a derived chare proxy to a base chare proxy as shown below:
Note that C++ calls the default constructor of the base class from any constructor for the derived class where base class constructor is not called explicitly. Therefore, one should always provide a default constructor for the base class, or explicitly call another base class constructor.
Multiple inheritance is also allowed for Chares and Groups. Often, one should make each of the base classes inherit ``virtually'' from Chare or Group, so that a single copy of Chare or Group exists for each multiply derived class.
Entry methods are inherited in the same manner as methods of sequential C++ objects. To make an entry method virtual, just add the keyword virtual to the corresponding chare method- no change is needed in the interface file. Pure virtual entry methods also require no special description in the interface file.
Messages cannot inherit from other messages. A message can, however, inherit from a regular C++ class. For example:
Messages cannot contain virtual methods or virtual base classes unless you use a packed message. Parameter marshalling has complete support for inheritance, virtual methods, and virtual base classes via the PUP::able framework.
One can write ``templated'' code for Chares, Groups, Messages and other CHARM++ entities using familiar C++ template syntax (almost). The CHARM++ interface translator now recognizes most of the C++ templates syntax, including a variety of formal parameters, default parameters, etc. However, not all C++ compilers currently recognize templates in ANSI drafts, therefore the code generated by CHARM++ for templates may not be acceptable to some current C++ compilers
Since many modern C++ compilers22 require that the template definitions (in addition to the template declarations) be available in all sources which use them, you will need to include the templated Charm definitions in your header file. That is, given a module stlib, in addition to having a line #include "stlib.decl.h" in your header file (e.g. stlib.h), you also need the following lines towards the end of the file:
This has the effect of including into the header file only those declarations which relate to templates. You will still need to include the file stlib.def.h again in your implementation sources (i.e., stlib.C) in order to pick up the rest of the (non-template-related) definitions. Note that for completely template-based libraries, this means that you might need to create an implementation file stlib.C when you otherwise wouldn't solely for the purpose of making sure that the non-template definitions in stlib.def.h are included and compiled.
The CHARM++ interface file should contain the template definitions as well as the instantiation. For example, if a message class TMessage is templated with a formal type parameter DType, then every instantiation of TMessage should be specified in the CHARM++ interface file. An example will illustrate this better:
Note the use of default template parameters. It is not necessary for template definitions and template instantiations to be part of the same module. Thus, templates could be defined in one module, and could be instantiated in another module , as long as the module defining a template is imported into the other module using the extern module construct. Thus it is possible to build a standard CHARM++ template library. Here we give a flavor of possibilities:
The Singleton message is a template for storing one element of any dtype. The Reducer is a group template for a spanning-tree reduction, which is started by submitting data to the local branch. It also contains a public method to register the ReductionClient (or any of its derived types), which acts as a callback to receive results of a reduction.
CHARM++ offers a range of fault tolerance capabilities through its checkpoint/restart mechanism. Usual Chare array-based CHARM++ application including AMPI application can be checkpointed to disk files and later on restarting from the files.
The basic idea behind this is straightforward: Checkpointing an application is like migrating its parallel objects from the processors onto disks, and restarting is the reverse. Thanks to the migration utilities like PUP'ing(Section 3.16), users can decide what data to save in checkpoints and how to save them.
Two schemes of fault tolerance protocols are implemented.
The API to checkpoint the application is:
The string dirname is the destination directory where the checkpoint files will be stored, and cb is the callback function which will be invoked after the checkpoint is done, as well as when the restart is complete. Here is an example of a typical use:
A chare array usually has a PUP routine for the sake of migration. The PUP routine is also used in the checkpointing and restarting process. Therefore, it is up to the programmer what to save and restore for the application. One illustration of this flexbility is a complicated scientific computation application with 9 matrices, 8 of which holding the intermediate results and 1 holding the final results of each timestep. To save resource, the PUP routine can well omit the 8 intermediate matrices and checkpoint the matrix with final results of each timestep.
Group and nodegroup objects(Section 3.9) are normally not meant to be migrated. In order to checkpoint them, however, the user wants to write PUP routines for the groups and declare them as [migratable] in the .ci file. Some programs use mainchares to hold key control data like global object counts, and thus needs mainchares be checkpointed too. To do this, the programmer should write a PUP routine for the mainchare and declare them as [migratable] in the .ci file, just as in the case of Group and NodeGroup. In addition, the programmer also needs to put the proxy to the mainchare (usually noted as mainproxy) as a read-only data in the code, and make sure processor 0, which holds the mainchare, initiates the checkpoint.
After CkStartCheckpoint is executed, a directory of the designated name is created and a collection of checkpoint files are written into it.
The user can choose to run the CHARM++ application in restart mode, i.e., restarting execution from last checkpoint. The command line option -restart DIRNAME is required to invoke this mode. For example:
Restarting is the reverse process of checkpointing. CHARM++ allows restarting the old checkpoint on different number of physical processor. This provides the flexibility to expand or shrink your application when the availability of computing resource changes.
Note that on restart, if the old reduction client was set to a static function, the function pointer might be lost and the user needs to register it again. A better alternative is to always use entry method of a chare object. Since all the entry methods are registered inside CHARM++ system, in restart phase, the reduction client will be automatically restored.
After a failure, the system may consist less number of processors. After a problem fixed, some processors may become available again. Therefore, the user may need to flexibility to restart on different number of processors than in the checkpointing phase. This is allowable by giving different +pN option at runtime. One thing to note is that the new load distribution might differ from the previous one at checkpoint time, so running a load balancing (See Section 3.11) is suggested.
If restart is not done on the same number of processors, the processor-specific data in a group/nodegroup branch cannot (and usually should not) be restored individually. A copy from processor 0 will be propagate to all the processors.
In your programs, you may use chare groups for different types of purposes. For example, groups holding read-only data can avoid excessive data copying, while groups maintaining processor-specific information is used as a local manager of the processor. In the latter situation, the data is sometimes too complicated to save and restore but easy to re-compute. For the read-only data, you want to save and restore it in the PUP'er routing and leave empty the migration constructor, via which the new object is created during restart. For the easy-to-recompute type of data, we just omit the PUP'er routine and do the data reconstruction in the group's migration constructor.
A similar example is the program mentioned above, where there aree two types of chare arrays, one maintaining intermediate results while the other type holding the final result for each timestep. The programmer can take advantage of the flexibility by omitting PUP'er routine empty for intermediate objects, and do save/restore only for the important objects.
The previous disk-based fault-tolerance scheme is a very basic scheme in that when a failure occurs, the whole program gets killed and the user has to manually restart the application from the checkpoint files. The double checkpoint/restart protocol described in this subsection provides an automatic fault tolerance solution. When a failure occurs, the program can automatically detect the failure and restart from the checkpoint. Further, this fault-tolerance protocol does not rely on any reliable storage (as needed in the previous method). Instead, it stores two copies of checkpoint data to two different locations (can be memory or disk). This double checkpointing ensures the availability of one checkpoint in case the other is lost. The double in-memory checkpoint/restart scheme is useful and efficient for applications with small memory footprint at the checkpoint state. The double in-disk variation stores checkpoints into local disk, thus can be useful for applications with large memory footprint.
The function that user can call to initiate a checkpointing in a Chare array-based application is:
where cb has the same meaning as in the Section 5.1.1 . Just like the above disk checkpoint described, it is up to programmer what to save. The programmer is responsible for choosing when to activate checkpointing so that the size of a global checkpoint state can be minimal.
In AMPI applications, user just needs to call the following function to start checkpointing:
When a processor crashes, the restart protocol will be automatically invoked to recover all objects using the last checkpoints. And then the program will continue to run on the survived processors. This is based on the assumption that there are no extra processors to replace the crashed ones.
However, if there are a pool of extra processors to replace the crashed ones, the fault-toerlance protocol can also take advantage of this to grab one free processor and let the program run on the same number of processors as before crash. In order to achieve this, CHARM++ needs to be compiled with the macro option CK_NO_PROC_POOL turned on.
A variation of double memory checkpoint/restart, double in-disk checkpoint/restart, can be applied to applcaitions with large memory footprint. In this scheme, instead of storing checkpoints in the memory, it stores them in the local disk. The checkpoint files are named "ckpt[CkMyPe]-[idx]-XXXXXX" and are stored under /tmp.
A programmer can use runtime option +ftc_disk to switch to this mode. For example:
CHARM++ is based on the Message-Driven parallel programming paradigm. The message-driven programming style avoids the use of blocking receives and allows overlap of computation and communication by scheduling computations depending on availability of data. This programing style enables CHARM++ programs to tolerate communication latencies adaptively. Threads suffer from loss of performance due to context-switching overheads and limited scalability due to large and unpredictable stack memory requirements, when used in a data-driven manner to coordinate a sequence of remotely triggered actions.
The need to sequence remotely triggered actions arises in many situations. Let us consider an example:
Consider an algorithm for computing cutoff-based pairwise interactions
between atoms in a molecular dynamics application, where interaction
between atoms is considered only when they are within some cutoff
distance of each other. This algorithm is based on a combination of
task and spatial decompositions of the molecular system. The bounding
box for the molecule is divided into a number of cubes (Patches)
each containing some number of atoms. Since each patch contains a
different number of atoms and these atoms migrate between patches as
simulation progresses, a dynamic load balancing scheme is used. In
this scheme, the task of computing the pairwise interactions between
atoms of all pairs of patches is divided among a number of Compute Objects. These compute objects are assigned at runtime to
different processors. The initialization message for each compute
object contains the indices of the patches. The patches themselves are
distributed across processors. Mapping information of patches to
processors is maintained by a replicated object called PatchManager. Figure
illustrates the CHARM++ implementation of the compute object. Each compute object requests
information about both patches assigned to it from the
PatchManager. PatchManager then contacts the appropriate processors
and delivers the patch information to the requesting compute
object. The compute object, after receiving information about each
patch, determines which atoms in a patch do not interact with atoms in
another patch since they are separated by more than the cut-off
distance. This is done in method filter. Filtering could be
done after both patches arrive. However, in order to increase
processor utilization, we do it immediately after any patch
arrives. Since the patches can arrive at the requesting compute object
in any order, the compute object has to buffer the received patches,
and maintain state information using counters or flags. This example
has been chosen for simplicity in order to demonstrate the necessity
of counters and buffers. In general, a parallel algorithm may have
more interactions leading to the use of many counters, flags, and
message buffers, which complicates program development significantly.
Threads are typically used to perform the abovementioned sequencing. Lets us code our previous example using threads.
Contrast the compute chare-object example in figure
with
a thread-based implementation of the same scheme in
figure
. Functions getFirst, and getSecond send
messages asynchronously to the PatchManager, requesting that the specified
patches be sent to them, and return immediately. Since these messages with
patches could arrive in any order, two threads, recvFirst and
recvSecond, are created. These threads block, waiting for messages to
arrive. After each message arrives, each thread performs the filtering
operation. The main thread waits for these two threads to complete, and then
computes the pairwise interactions. Though the programming complexity of
buffering the messages and maintaining the counters has been eliminated in this
implementation, considerable overhead in the form of thread creation, and
synchronization in the form of join has been added. Let us now code the
same example in Structured Dagger. It reduces the parallel programming complexity without
adding any significant overhead.
Structured Dagger is a coordination language built on top of CHARM++ that supports the sequencing mentioned above, while overcoming limitations of thread-based languages, and facilitating a clear expression of flow of control within the object without losing the performance benefits of adaptive message-driven execution. In other words, Structured Dagger is a structured notation for specifying intra-process control dependences in message-driven programs. It combines the efficiency of message-driven execution with the explicitness of control specification. Structured Dagger allows easy expression of dependences among messages and computations and also among computations within the same object using when-blocks and various structured constructs. Structured Dagger is adequate for expressing control-dependencies that form a series-parallel control-flow graph. Structured Dagger has been developed on top of CHARM++.Structured Dagger allows CHARM++ entry methods (in chares, groups or arrays) to specify code (a when-block body) to be executed upon occurrence of certain events. These events (or guards of a when-block) are entry methods of the object that can be invoked remotely. While writing a Structured Dagger program, one has to declare these entries in CHARM++ interface file. The implementation of the entry methods that contain the when-block is written using the Structured Dagger language. Grammar of Structured Dagger is given in the EBNF form below.
Structured Dagger code can be inserted into the .ci file for any array, group, or chare's entry methods.
If you've added Structured Dagger code to your class, you must link in the code by:
For example, an array named ``Foo'' that uses sdag code might contain:
For more details regarding Structured Dagger, look at the example located in the examples/charm++/hello/sdag directory in the CHARM++ distribution.
For starters, see the publications, reports, and manuals on the Parallel Programming Laboratory website: http://charm.cs.uiuc.edu/.
Several tools and libraries are provided for CHARM++. PROJECTIONS is an automatic performance analysis tool which provides the user with information about the parallel behavior of CHARM++ programs. The purpose of implementing CHARM++ standard libraries is to reduce the time needed to develop parallel applications with the help of a set of efficient and re-usable modules. Most of the libraries have been described in a separate manual.
PROJECTIONS is a performance visualization and feedback tool. The system has a much more refined understanding of user computation than is possible in traditional tools.
PROJECTIONS displays information about the request for creation and the actual creation of tasks in CHARM++ programs. Projections also provides the function of post-mortem clock synchronization. Additionally, it can also automatically partition the execution of the running program into logically separate units, and automatically analyzes each individual partition.
Future versions will be able to provide recommendations/suggestions for improving performance as well.
While we can promise neither bug-free software nor immediate solutions to all problems, CHARM++ is a stable system and it is our intention to keep it as up-to-date and usable as our resources will allow by responding quickly to questions and bug reports. To that end, there are mechanisms in place for contacting Charm users and developers.
Our software is made available for research use and evaluation. For the latest software distribution, further information about CONVERSE/CHARM++ and information on how to contact the Parallel Programming laboratory, see our website at http://charm.cs.uiuc.edu/.
If retrieval of a publication via these channels is not possible, please send electronic mail to kale@cs.uiuc.edu or postal mail to:
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -white -antialias -local_icons -long_titles 1 -show_section_numbers -top_navigation -address '
November 23, 2009
Charm Homepage' -split 0 manual.tex
The translation was initiated by root on 2009-11-23
November 23, 2009
Charm Homepage