L. V. Kalé
B. Ramkumar
A. B. Sinha
A. Gürsoy

The system thus provides ease of programming on MIMD platforms without sacrificing performance.
Many parallel machines are commercially available. These include shared memory machines such as the Sequent Symmetry, Encore Multimax, and the KSR-1, and distributed memory machines such as the IBM SP-1, NCUBE-II, Intel iPSC/860 and Paragon, Convex Exemplar, and TMC CM-5. Many of these machines include hundreds of processors, each of which is a powerful microprocessor with performance of the order of tens of MFLOPs.
At the same time, there are many computation intensive applications that can benefit from parallel processing. These include applications in computational biology, computational fluid dynamics, design automation, weather prediction, various calculations/simulations in physics, discrete optimization, and many AI computations such as heuristic search, problem-solving, and planning.
However, programming these parallel machines remains a challenging and difficult task, except for a few highly regular kernels. The complexity of parallel programming arises from the inherent asynchrony of computational agents, with the concomitant correctness issues, and the need to consider performance issues beyond those in sequential programming, such as load balancing, network contention, and communication latencies.
In addition, different machines have different primitives, modes of
operations, and cost tradeoffs.
Even among MIMD
architectures, the programming
environment supported on a machine is significantly affected by the
characteristics exhibited by the architecture.
For example, the memory organization of a machine often
determines the support for inter-process communication on the machine.
Shared memory machines, like the Sequent Symmetry,
tend to support communication through shared variables whose access is
controlled using mutual exclusion primitives.
On distributed memory
machines supporting a global address space like the KSR-1, communication is
also supported through
shared variables. However, a significant difference exists between the
cost of accessing local memory and non-local memory. As a result,
they are also called NUMA (Non-Uniform Memory Access) machines.
Nonshared memory machines, like the Intel Paragon,
support a significantly different style of
inter-process communication --- message passing using sends and
receives.
This form of communication isn't directly compatible with the shared variable
approach.
As a result of the differences in memory organization,
programmers need to develop different versions of the same
parallel program for different target architectures.
This lack of
portability is exacerbated by the rapid evolution of parallel
machines, due to which even machines from the same vendor may not
be directly compatible with their earlier machines.
So, porting a parallel application written for one machine to another
is expensive, thus adding to the development and maintenance costs of
parallel software.
The complexity of parallel programming, exacerbated by the issue of portability, constitute hurdles that must be overcome before one can bring the substantial power of parallel processing to bear upon the broad class of significant applications. This paper describes Charm, a parallel programming language and its associated runtime system that is aimed at controlling the complexity of parallel programming. Charm was developed at the University of Illinois over the last several years, and is aimed at providing productive and effective parallel programming support for MIMD parallel machines. The targeted machines include both shared memory and distributed memory machines.
A general technique for dealing with complexity involves employing a hierarchical structure. In order to identify the appropriate hierarchy to deal with the conceptual complexity of parallel programming, we note that the task of writing a parallel program for a specific machine involves:
In addition, approaches can also be distinguished by their generality or specialization. An approach may become specialized either by limiting itself to specific application domains, or to narrow parallel programming paradigms. We will examine extant classes of approaches with this hierarchy in mind, to motivate our approach.
Figure 1: The spectrum of approaches to parallel programming.
The seven classes and their relative positions along the spectrum of low-level to high-level and (orthogonally) from general-purpose to specialized, is shown in Figure 1. Looking at this spectrum of approaches, it seemed apparent to us early in our research that there was a need for a general purpose language at a higher level in the hierarchy [31]. We need a language and system that satisfies the following requirements:
Figure 2: Intended uses of the Charm parallel
programming system.
A high-level, yet general purpose, approach has a further advantage. In the absence of domain specific knowledge, it provides an appropriate division of labor between the programmer and the system: the programmer specifies the decomposition of the computation into parallel actions, while the system can implement resource management and scheduling effectively.
Charm was designed to fit this niche. It provides more substantial support for parallel programming than the simple ``portable mechanism'' oriented systems. In addition, as shown in Figure 2, it can be used as a common base language for implementing other specialized high level languages, systems and packages, simplifying the task of developing them. Note that in order to support classes of approaches numbered 2 and 3, such a system must allow the programmer to over-ride its automatic mapping and/or scheduling functions. Such an over-ride is also necessary to give the programmer the upper hand in cases when they can do a better job of mapping/scheduling based on their knowledge of the application.
A general-purpose, high-level parallel language should also satisfy the following requirements:
The latency of communication --- the fact that remote data will take longer to access than local data --- is, and is likely to remain, a constant part of parallel processing hardware. Moreover, responses from a remote processor may encounter unpredictable delays due to ongoing computation on that processor. The language must provide features so that users can write programs which tolerate or hide such communication latencies.
For efficient portability, the primitives in the language
should have a well defined cost model.
Only with such a cost model can a programmer hope to write efficient and
portable programs. So, the cost of the system primitives should not
vary significantly from machine to machine. Nor should the compiler
restructure the user code for optimization, making the cost of primitives
unpredictable and requiring the user to second-guess the
compiler
.
Since the system is meant to abstract over shared memory and distributed memory machine capabilities, it should provide mechanisms that ensure efficient implementation of algorithms on both kinds of machines. An algorithm expressed in this language must run without change on both kinds of machines, yet be competitive with any implementation of the algorithm that directly uses the machine's capabilities.
The language we define must allow specification of such computational tasks, and allow for their dynamic creation. The mapping of these actions to processors and their scheduling should be automatically handled by the system, unless the user specifies otherwise.
Many explicitly parallel languages are based on a universal information sharing mechanism: a single mechanism that is deemed appropriate for various modes of information sharing. It is much more intuitive for the programmer to specify a particular information sharing mode, than to fit it into the single abstraction provided by such a ``universal'' mechanism based language. The language must provide specific abstractions to support various modes of information sharing.
The Charm language described below is designed to satisfy these requirements. The basic features of the language, including chares (tasks or objects) and messages, are described in Section 2. Chares are dynamically created and allow for expressing computations where dynamic task creation is necessary. They can be automatically mapped and scheduled, thereby providing the user with high-level capabilities. The message driven execution model of Charm and the rationale is described in Section 3. Section 5 discusses how specific information sharing mechanisms in Charm provide more expressive and efficient ways of sharing information across a multitude of parallel programming architectures. An additional kind of Charm process, called branch office chare, aimed at satisfying generality and modularity requirements, is described in Section 6. Section 7 provides a discussion on how Charm supports modularity. Additional features of Charm, such as conditional packing, quiescence detection, and prioritization, are discussed in Sections 8, 9, and 10. Section 11 describes the cost model for the different system primitives. The paper concludes with Section 12, which includes a comparison with some other systems, including those that occupy the same niche as Charm.
A parallel program includes many sub-computations that
are sequential in nature. There is no reason to invent
a new language for expressing these sub-computations.
We chose to employ C as the base language for Charm
.
As a consequence,
a Charm program may include function and type definitions as in C.
In addition, any sequential threads of control in the
parallel constructs must be expressible in C.
With this decision, it becomes possible to retain large portions of sequential
application code while parallelizing them.
C is also a pragmatic choice as it is available on all parallel machines,
and has an acceptable performance.
Next, we must define the parallel constructs in the language. Parallelism entails the existence of multiple focii of control. So we need a construct that captures the notion of a focus of control. One possibility is to associate each processor with one focus of control. This leads to the ``one process per processor'' view which is supported by most vendor supplied software. However this conflicts with requirement R4, which stipulates that dynamic creation of work be allowed. In addition, as stipulated by requirement R2, the user need not have to specify mapping of work to processors. Therefore we choose to separate the notion of a processor from the construct that encapsulates a focus of control. We call this construct, which specifies data and computation that will be mapped as a unit to a single processor, a chare --- for chore or a small task. There may exist zero, one or more chares per processor at one time.
The chares will need to exchange information with other chares. A common mode of information exchange occurs when a chare produces data that is needed by another chare. This mode is supported by the notion of a message, which is a directed communication from one chare to another. Syntactically, a message is defined to be a collection of data, and in Charm it has the same syntax as that of a C structure declaration (see Figure 7).
A chare is allowed to handle multiple messages addressed to it: a separate section of code within a chare handles each incoming message. One possible way of specifying such sections is to require the programmer to provide a single function for handling each type of incoming message. However in different contexts and in different phases of its lifetime, a chare may have to deal with messages of the same type in very different manners. So we provide the notion of a named entry-point function. An entry-point has a single message type and an arbitrary C-code block associated with it. With this it is possible to have a single message type associated with many distinct entry-points. Note that this is directly analogous to the notion of methods in object-oriented programming.
When a chare sends a message to another chare it directs the message to a
specific entry-point. The code in the entry-point function, which is
triggered by the message, may access
the fields of the message and the local variables of the function.
Each chare instance may also have local variables that can
be accessed from all of its entry points.
The code at different entry-points may need to execute similar or
identical computations. Such computations can be expressed as
private functions. Private functions
can be called only from within a chare,
and they, like entry-points, can access the local variables of
the chare.
A chare is very similar to an object. It provides data encapsulation.
However, chares in Charm
,
do not provide other object attributes, such as
inheritance and polymorphism.
An example of the syntax of a chare definition appears in Figure 7. A chare definition includes the declaration of its data area (local variables), followed by declaration of a sequence of entry-point definitions. Each entry-point definition consists of an entry-point name, followed by a declaration of a message associated with it, and a block of C code. This block may contain arbitrary C code, including function calls. In addition, it may contain calls to Charm primitives, such as those described in Section 2.1.
A Charm program, as defined so far, is a collection of chare definitions. At runtime, a single instance of a special chare, called main, exists. Execution of a Charm program begins with the creation of an instance of the main chare and the execution of a special entry function in the main chare called CharmInit. Other new chare instances can be created (from inside the main chare or other subsequently created chares) using the CreateChare call. It takes as parameters the name of the chare that is to be created, the entry-point to which the message is addressed, and a pointer to a message of the type associated with the entry-point. An entry-point is a variable of type EntryPointType. It can be specified as chare_name@entry, which refers to the entry-point entry in the chare chare_name. If the CreateChare call is made inside chare_name, then the entry-point could just be specified as entry. Note that an entry-point is a first class object. One can set a variable, say entry_name, of system defined type EntryPointType as: entry_name = chare_name@entry.
The CreateChare primitive, like all other Charm primitives, is non-blocking. From the programmer's viewpoint, the call deposits a new-chare message in a pool of such messages, and immediately returns. Eventually, a chare instance is created on some processor under the control of the runtime system of Charm (called the Chare Kernel). As soon as the chare is created, it executes the message using the code at the entry-point named in the creation call.
When a chare instance is created, it is identified by a unique chare-id.
This address
of a chare instance may be obtained by the system call
MyChareID(&chare_id), where chare_id is a variable of type
ChareIDType.
The addresses of chare instances can be then sent as fields in messages to
other chares, which can pass them along to other chares, and so on.
A message can be sent from one chare instance to another using
the SendMsg call.
It takes as parameters a pointer to a message, the address of the
destination chare,
and an entry point where the message needs to go.
In traditional message passing, a processor, after issuing a request for a receive, must idle until the specified message arrives. This waiting may not always be dictated by the algorithm, i.e., the algorithm may have more relaxed synchronization requirements. This is particularly true for global operations, such as reductions. Yet the use of blocking primitives forces unnecessary synchronization and may cause idle time. This idle time can be decreased by moving the sends earlier and postponing the receives as much as possible in the code. In many cases, such local rearrangement of communication can increase the utilization of processors. However, this strategy cannot handle cases with more complex dependences and unpredictable latencies, nor can it handle global operations [34].
In accordance with our latency tolerance requirement (R3), we want to avoid the idling of processors under this condition. This is accomplished in Charm by allowing multiple chares to exist on each processor
and by employing a message driven
execution model. In this model:
Conceptually, one can think of the system operating with a pool of
messages
, a collection of passive chares, and a collection of
active processors. Each processor may pick a message from the
message pool, identify the chare that it is meant for,
and schedule the execution of the code at the entry-point
designated by the message.
Note that there is no receive call, synchronous or asynchronous, in Charm.
In fact, there are no calls that allow chares to
interact synchronously with chares on remote processors. For efficiency
and convenience, Charm does provide local synchronous calls between
chares through public function calls.
Figure 3: This figure illustrates the split-phase programming style.
The message driven programming model and the absence of any non-local synchronous primitives, such as receive, in Charm leads to a unique style of programming called ``split-phase'' or ``continuation-passing''. In this style, a synchronous request must be split over into two code blocks: the first block issues the request and provides the servicer of the request with the address of the second block of code to which the reply must be sent. The second block of code is activated whenever the reply arrives and must therefore be capable of processing the reply. A simple example in Figure 3 illustrates the split-phase programming style. Let A and B be instances of the client and the server chares, respectively. Chare A needs some data from Chare B, so it sends Chare B a request message for the data, along with the address to which the data must be returned: in this case it is another entry point ( ProcessReply) in Chare A. Chare B, on receiving the request, sends back the data to Chare A at the specified entry point, which is invoked and handles the data when the message arrives. Notice that Chare A suspends after the entry point MakeRequest is executed, and is reactivated with the arrival of the reply from Chare B. The request is therefore split across two phases: make request and receive data.
A message-driven execution model allows one to adaptively schedule computations within and across modules as illustrated below.
A more detailed exposition of the advantages of message driven execution can be found in [34].
Consider the following example abstracted and modified from a real application
--- a core routine in parallelized version of a molecular mechanics code,
CHARMM. Each processor has an array A of size n. The computation
requires
each processor to compute the values of the elements of the array and
to compute the global sum of the array across all processors. Thus, the
element of A on every processor after the operation is the
sum of the
elements computed by all the processors.
In the SPMD model, this computation can be expressed with a single call to the system reduction library (e.g., gssum, on an Intel machine) preceded by the computation of the array on every processor. Alternatively, one can divide the array A into k parts, and in a loop, compute each partition and call the reduction library for each segment separately. Each call to the reduction library is a blocking one, i.e. the code cannot initiate the local computation belonging to the block before receiving the result of the current reduction.
However, the computations for each partition are completely independent. In particular, computation of the next k items (i.e., the next partitions) are not dependent on the result of the reduction, and so could be started even before the reduction results from the previous partitions are available. With this (message driven) strategy, a process that has just finished computing a partition is willing to either process the result of the reduction of any previous partition or compute the next partition.
This computation was programmed in C (using the Intel supplied native reduction library) for the blocking SPMD version and in Charm for the message driven version, and run on an Intel/Paragon machine. Figure 4-(a) shows the performance results of the case k=160, n = 40960, and up to 64 processors. The effect of pipelining of reductions in Charm is apparent from the flattening of the curve beyond eight processors. The increase up to eight processors with Charm can be attributed to the increase in the branching factor of the spanning tree used by the reduction library.
Figure 4: Concurrent Reductions, k=160,n=40960
Each processor in the above experiment did a fixed amount of computation per element of the array before calling the reduction. In real applications, this computation is likely to vary from processor to processor. Figure 4-(b) shows results of the computation for the same parameters but with a random amount of computation added to each partition. The performance benefits of message driven execution become more significant when there exist irregularities in the computation. The blocking version makes every processor wait at a barrier for the last processor to arrive at the barrier, thus making the completion time the sum of maxima (for all partitions) as opposed to the maximum of the sum for the message driven version.
The message-driven paradigm allows different modules that might have some concurrent computations to share processor time. Consider the computation (taken from Figure [34]) shown in Figure 5: module A invokes two other modules B and C. In the SPMD model, module A cannot activate B and C concurrently even if the computations in B and C are independent of each other. As a result, the processor time is not fully utilized, as illustrated in the same figure. In a message-driven paradigm, the idle times on a processor can be utilized by another module if it has some work to do. Such a scenario is illustrated in Figure 6 (taken from [34]). Module C gets processor time (by virtue of having its message selected by the scheduler) while B waits for some data, and vice versa, thus achieving a better overlap than the SPMD program.
Figure 5: SPMD modules cannot share the processor time.
Figure 6: Message-driven modules share the processor time.
Libraries constitute an important part of the software development process. They provide reusable, portable code, and they hide details from application programmers. There are many SPMD parallel libraries for commonly used kernel operations, such as numerical solvers, FFT, etc. The SPMD style does not encourage use of multiple concurrent libraries. When faced with performance loss in a situation, such as in Figure 5, an SPMD programmer typically breaks the library abstraction, combines modules B and C with A, and then tries to achieve better overlap by static movement of code augmented by ``wild-card'' receives. On the other hand, message-driven style encourages creation of smaller and more reusable modules. Therefore, we expect libraries to be a major strength of message-driven systems in the future.
A simple Charm program is shown in Figure 7. In this program, a number of new chares are created in CharmInit. The user input is read using the CkScanf call, which is similar to scanf of C. The message that is sent to each chare is allocated using the CkAllocMsg system call, and contains the value of the seed, the address of the main chare, and some user-defined common data.
1 1
Figure 7: A Simple Charm Program.
1 1
Each new chare is eventually mapped and scheduled by the runtime system on some processor selected by it. Thereafter, upon creation, each chare instance calculates, in the Start entry-point, a value which depends on the seed and the common data. This value is returned in a message to the Return entry-point of the main chare, where the values calculated by all chare instances are added. Messages are deallocated using the CkFreeMsg system call. The ChareExit call is used to signal to the system that a chare has completed execution. This call results in the de-allocation of memory occupied by the chare.
Once all chares have returned their computed values to the main chare, the aggregate value is printed out using the CkPrintf system call. This call is similar to the printf call, except it is guaranteed to be atomic, i.e., multiple CkPrintf calls from different chares will not be garbled together. The main chare signals that the program has completed execution by calling the CkExit system call. This call results in the termination of all Charm processes, and the collection of performance and debugging data that the user might have requested.
The primitives described in Section 2 permit two forms of information sharing:
Charm supports five specific modes in which information can be shared by chares. Each mode is provided as an abstract data type (ADT). Variables of each such type can be created statically (by initializing them inside the CharmInit entry-point of the main chare) or dynamically (any time during the execution of the program), and can be accessed and mutated only via the defined functions of the corresponding ADTs. Each of the ADTs may then be implemented by the runtime system differently on different machine architectures, thereby ensuring that the abstractions provide efficient portability.
ReadOnly variables: In some computations, many chares need read access to values that are created at the beginning of the computation (but are not known at compile time), and are not altered thereafter. Such information sharing can be specified by declaring a variable to be readonly in the declaration section. of the program.
These variables can be assigned values only in the CharmInit entry-point of the main chare. They can be accessed from any chare using the ReadValue(variable_id) system call, which simply returns the value of the variable.
In the sample program in Figure 7, the data needed to calculate the value (in compute) was initialized in the CharmInit entry point. This data was passed to all the chare instances in the message; a more efficient implementation would be possible with the readonly abstraction. The common data could be declared as a readonly variable, initialized using the ReadInit(data) call, and accessed using the ReadValue(data) call. Messages would no longer carry the data for every chare that is created. Write-once variables are the dynamic counterpart of readonly variable; they can be initialized from anywhere in the program.
Distributed table: A distributed table is a set of entries, where each entry is a ``record'' with an integer key, and an arbitrary (untyped) data field. A special data-type called table is defined by Charm. One may declare many different tables of this type in a program. Distributed tables are accessed and modified only via the three calls: Insert, Delete, and Find. Unlike read-only and write-once variables, for which the access is synchronous, and immediate, accesses to the entries in the table are all asynchronous. Thus a call to find the data corresponding to a given key does not return with the data. Instead, it deposits a request to send the data to a given chare at a given entry-point.
Accumulators: Consider a computation in which many dynamically mapped chares are sending messages to each other, and we would like to count the total number of messages generated during the entire computation (or during a particular phase of the computation). Of course, each chare could store its own count, and send this count to a counter chare before it terminates. However, this method is not desirable because it is not scalable --- with a large number of chares running on many processors, the ``counter'' chare will become a bottleneck. As this type of information sharing requirement is quite common, Charm provides the accumulator abstraction. More generally, an accumulator object has a commutative-associative operator as the sole mutator, and an identity element with respect to this operator.
We define the data associated with the accumulator as a message (this also facilitates conditional packing of the data; see Section 8). In addition to the data, the definition of an accumulator requires three user defined functions: one for initializing the accumulator data ( initfn, one for ``adding'' to the accumulator ( addfn, and one for combining two copies of the accumulator ( combinefn, if needed.
The accumulator abstraction is defined by three calls: CreateAcc, Accumulate, and CollectValue. The accumulator may be created in the CharmInit entry-point of the main chare by using the call: CreateAcc(ACC_TYPE, msg). This call creates an instance of the accumulator, and initializes it by calling the initfn with the given message as a parameter. The call returns a unique address (of type AccIDType) for the accumulator. This address can be used to access and modify the accumulator in the rest of the program. The address can be sent in messages to other chares, or it can be assigned to a readonly variable, so all chares can use it. The accumulator may also be created dynamically from any chare using the variant of the call CreateAcc(ACC_TYPE, message, entry, chare_id), which signals the runtime system to send the address of the new accumulator instance to the named entry-point ( entry) of the designated chare (with the chare_id), after it has been created.
Any chare which knows the address of an accumulator instance (say accid) may ``add'' to it by using the call: Accumulate(accid, addfn(..)). The final value of an accumulator can be read by calling CollectValue(accid, entry, chare_id), which returns immediately without any value, but a message is eventually sent to the entry-point entry of the chare instance designated by chare_id, containing the final value of the accumulator. The CollectValue call results in destruction of the accumulator variable. Hence the call should be used only once, and only when one is sure there are no more Accumulate operations possible on that variable.
The user program never calls the combineFn() function explicitly. It is used by the runtime system in case it has made multiple copies of the accumulator for efficiency on the target architecture, e.g. in a non-shared memory machine implementation. (See part II of this paper.)
Monotonics: Sometimes many chares need to read and update a shared variable, but the update operation is idempotent (i.e. repeated application of the same update operation are equivalent to one update operation) as well as commutative-associative, and the variable successively takes on monotonically ``decreasing'' values in some metric. In branch-and-bound computations, such a variable is needed to store the cost of best solution known so far. Every chare needs to know what the current best bound is, and when someone finds a new solution, the best bound may have to change to a smaller value, if the new value is smaller. Such information sharing is specified by declaring the variable as a monotonic variable.
The monotonic data abstraction is defined by three calls: CreateMono, NewValue, and MonoValue. The CreateMono call has the same set of parameters as the CreateAcc call, and works in an identical manner. A new value can be deposited into a monotonic variable by using the call: NewValue(monoID, updateFn(args..)). An upper bound of the current value of a monotonic variable can be obtained by the call: MonoValue(monoID). The value returned by the MonoValue call will be either the value assigned during initialization, or provided thereafter by some NewValue call, and be better than or equal to the best value provided by a NewValue call by the same process. In addition, the system will make efforts to provide the best value of the monotonic variable supplied by any NewValue call until that point in time.
The programmer's model of a Charm computation, as described so far, includes chares that may dynamically create other chares, send messages to each other, and share information via other specifically shared ``global'' variables. Note that the ``processor'' is not a part of the ontology so far. We now introduce a construct that brings in the notion of processors.
In many parallel applications, similar work is done by each processor. Such an application could be programmed using chares, where one chare is created on each processor. However substantial initial bookkeeping is required to make sure that each chare knows the address of a chare on a particular processor. As this is a common case, we prefer to have more convenient way of expressing it using a new construct: branch office chare (BOC), which is a replicated process. A branch chare of the BOC exists on every processor. However all the branches of an instance of a branch-office chare can be referenced by one name.
Figure 8: This figure shows two instances of
a branch office chare, and how different branches can interact with others using
public function calls if they are on the same processor or messages
if they are on distinct processors.
A chare and a branch of a BOC on the same processor, or branches of two BOCs on the same processor could interact with each other using messages. However messages have overheads of creation and scheduling. Such overheads can be eliminated with a synchronous, sequential function call interface between the local ``branches'' of the application and other chares. Therefore, in addition to receiving messages at entry-points like chares, BOCs also provide public functions. A chare (which happens to be running on some processor under the control of the dynamic load balancing strategy) may interact with the local branch of a BOC via a sequential public function call. This combination of features makes the branch office chare a versatile and useful abstraction. Figure 8 shows two branch office chares, and how different branches can interact with others using public function calls if they are on the same processor or messages if they are on distinct processors.
The syntax of a branch-office chare is similar to that of a chare, and an example is shown in Figure 9. A BOC declaration consists of its data area (local variables), entry point definitions, and definition of private and public function calls (identical to those of a chare).
The CreateBoc system call is used to create an instance of a branch office chare. It takes as parameters the name of the BOC to be created, a creation message, and the entry point to which the message is addressed. BOCs can be created statically from the CharmInit entry-point in the main chare. In this case, the call returns the address of the new branch-office chare. BOCs can also be created dynamically in the middle of a computation from any chare. In this case, the last two parameters are required, and the address of the BOC is sent to the chare identified by chare_id, at the specified entry-point entry2.
In the following discussion, the address of a new branch-office chare instance is denoted by the variable boc. Different branches of the same or different BOCs can communicate with each other using the SendMsgBranch call. A branch can send a message to all other branches using the BroadcastMsgBranch call. Chares and other BOC branches on a processor may call a public function fn of a BOC identified by boc, which is an instance of a BOC named boc_name, using the BranchCall system function.
Branch-office chares can be used in a variety of contexts:
1 1
1 1
Figure 9 shows a simple program written using a BOC: a message is sent around in a ring on the available processors starting at processor 0. The BOC is created using the CreateBoc call in the main chare.
Supporting modularity and reuse is more difficult in a parallel context than in a sequential context. First, there are a seemingly mundane set of issues that need to be addressed, such as the fact that pointers are not valid across address space boundaries. So, for example, one cannot pass a function-pointer to another module directly. Second, a reusable module must be able to work with a large variety of ways in which the input data is distributed among processors. Third, for scalability, modules should be able to exchange or pass data in a fully distributed fashion. When a module with entities spread over hundreds of processors, wishes to pass data to another module spread over hundreds of processors, one must ensure that the data exchange is not centralized. In this section, we describe the features in Charm that support modularity and reuse, and briefly discuss how these features achieve the objectives.
A Charm program is written as a set of modules. Charm supports separate compilation of modules. A module can contain names of chares, BOCs, C functions, messages, C type definitions, and specifically shared variables. These names are internal to the module. Modules can interact with each other by referencing external names (defined in other modules). An external name is referred to by specifying the module and the name being referenced. E.g., accesses to a function F, or a chare C, or an entry point E inside chare C which are defined in a module M would be made as M::F, M::C, M::C@E, respectively.
Each module has an interface statement that includes prototypes of all the names in the module that may be referenced by other modules. A module includes the interface statement of each module with which it interacts, as well as its own interface statement.
Names, external and internal, can be passed in messages or via functions to other modules. This mechanism can be used by a client to request a certain service from a server process defined in another module, such that the result of the server's computation is sent to a specified chare at a specified entry-point. This is very useful, because the server might have been written separately without any knowledge of the static entities or the names of possible clients. Similarly, consider a situation when a module M1 invokes M2, and wishes to have M2 call a function F defined in M1 possibly at some other processor than the one where the invocation occurred. Assume again that M2 was written independently of M1 (and so cannot contain an interface statement for M1). M1 cannot pass a pointer to the function F, because it will not be valid on the processor where F is to be invoked. Charm provides a mechanism for converting locally valid function pointers to globally valid function references for this purpose. Such references can be propagated in messages, and dereferenced via another system call when the function is to be invoked.
Charm supports distributed exchange of data among modules via BOCs (Section 6) and distributed tables (Section 5). A module may send data to another via a BranchCall carried out on all the processors. Alternatively, two modules may exchange data via a distributed table, thus obviating the need for either module to know where the particular data items are located. Thus, BOCs and tables act as a ``glue'' to interconnect modules through distributed interfaces. Finally, as illustrated in Section 3.2, message-driven execution also allows Charm modules to be composed efficiently.
The data structure to be passed in messages may sometimes be large and complex. Consider an array being passed in a CreateChare message. On a shared memory system, the message need only store a pointer to the base of the array and its size. However, on nonshared memory systems, a pointer is not valid across processors. So the whole array must be copied in each message. A chare may also want to send a dynamically created data-structure, such as a graph or a tree, which uses pointers. Again, the data structure must be copied (or ``packed'') into a contiguous structure without pointers before it can be sent in a message.
Charm encourages a programming style that counters the unpredictability of available work by creating many small chares in the hope of being able to distribute them as needed. So in a message passing system, each processor, typically, creates many chares that are not actually sent out to any other processor, but executed locally. When the system is in saturation (all the processors have sufficient work), this happens to most chares. This state of affairs is desirable because it offers the flexibility of responding to load fluctuations as they arise. Now, however, packing each message in a format suitable for across-processor transmission seems quite wasteful. Also, such packing is unnecessary and wasteful on shared memory machines. It would be better to pack only those messages that actually leave address-space boundaries, leaving other messages free to contain pointers. However, the programmer doesn't know which one of the created chares (or messages) will end up going to another processor, as it is the decision of the dynamic load balancing strategy. Charm, on the other hand, cannot know how to pack messages since their structure is known only to the corresponding application code.
Charm provides an interesting solution to this problem. It allows messages to contain pointers. However, if a message needs to move across address space boundaries, the kernel calls the appropriate code in the user program for packing the message into a contiguous space and eliminating explicit pointers. Conversely, before a message received from outside the address space is scheduled for execution, the system calls another entry-point to unpack the message. The functions for packing and unpacking are provided by the programmer along with the definition of the message-type.
Thus, only those messages that are actually sent out are packed, and the rest of the computation can proceed as it would on a uni-processor or a shared memory machine, using pointers to represent data structures efficiently. This feature is instrumental in satisfying the requirement R3 about ensuring competitive efficiency on shared memory machines. As experimental evidence, conditional packing led to three-fold improvements in speed on iPSC/2 for a parallel prolog interpreter [36] implemented using Charm.
1 1
Figure 10: The declaration and allocation of a variable size
message.
1 1
The special case of variable size arrays occurs very frequently in many applications. Charm provides a varSize array field in message definitions for this special case. When a message with varSize fields is allocated, the user must specify the sizes of all the varSize array fields in the message. The system then handles the packing and unpacking of this messages automatically.
Figure 10 shows the declaration of variable sized message-type MSG. The CkAllocMsg call takes an additional parameter ( sizes), which is an array whose elements are the sizes of the variable sized arrays in the message.
In some computations, it is useful to know when there are no more messages in the system.
Charm allows a user to specify an entry function of a chare with address to which a message is sent when the system becomes quiescent. The user-defined code at that entry-point then decides the course of action. In simple cases, the action may be just to terminate the execution by calling CkExit(). However, other interesting uses are also possible. For example, a read-eval loop for languages such Prolog can be written by having the code at the quiescence entry-point read the next query and start its execution. In general, quiescence detection can be used to detect the end of a phase of computation which involves arbitrary and unpredictable amount of communication per processor.
Quiescence detection has been efficiently implemented [37], so that (a) the condition is detected very quickly after it occurs, and (b) the overhead of the algorithm is very small.
In many computations, the order in which available tasks are selected for execution [38] can affect various performance metrics. In Charm, the programmer can assign priorities to messages; the message-queuing strategy chosen by the user will then schedule the highest priority member of the queue. The priority can be an integer, or an arbitrarily long bit-vector, depending on the queuing strategy option chosen. Bit-vector priorities are especially useful for obtaining good and consistent speedups in state-space search and related problems [39,40] which involve speculative work. Integer priorities are useful in many seemingly regular computations which may have critical paths [41] that must be prioritized, particularly in the presence of message-driven execution.
Charm has many different load balancing strategies for message passing machines. Different strategies may be effective in different application-specific and sometime machine-specific contexts. Therefore, Charm allows the user to link any one of the available strategies from its library, and to specify parameters for the strategy to tune it further.
Typically, for reasons of scalability and to avoid bottlenecks, it is desirable that load balancing strategies be distributed in nature. However, experimental results [42] have shown that existing fully distributed load balancing strategies do not balance priorities well resulting in the concentrations of low and high priority work neighborhoods. Therefore, in addition to distributed strategies, Charm also provides fully and partly centralized load balancing strategies. These load balancing strategies fare much better in balancing priorities than fully distributed strategies.
Charm provides a relatively simple cost model to the programmer. The cost of various operations can be understood in terms of the cost of a message and the cost of a function call. The cost of a message has two components, one fixed and one a linear function of the message size, the proportionality constants vary somewhat from machine to machine.
The cost of BranchCall and PrivateCall is that of a sequential function call, and so is the cost of the following calls for accessing and updating specifically shared variables: ReadInit, ReadValue, DerefWriteOnce, Accumulate, MonoValue, and NewValue. The cost of the following calls is that of a single message: CreateChare, SendMsg, and SendMsgBranch. The cost of the distributed table operations Insert, Find, and Delete is that of two messages. There are a set of calls whose cost to the overall system is that of a single message per processor; in terms of critical path (i.e., the time between the call and the completion of the action) the cost can be considered to be log p messages, where p is a number of processors. The calls in this category are: BroadcastMsgBranch, CreateBOC, WriteOnce, and CollectValue.
It might be argued that the cost model is inadequate because it does not account for interconnection topologies, and the variation in communication latencies and processor speeds across machines. Clearly, for portable design of parallel programs, it is desirable to be able to ignore these machine-dependent features. But more important, with the current generation of parallel computers, it is becoming clear that the interconnection topology is not a significant determinant of performance. The end-to-end message delays on machines with advanced routing networks (such as wormhole routing) vary very little with the inter-processor distance. The network bandwidth is affected by the average communication distance, and for the rare set of problems where the communication bandwidth is a significant issue (over-riding processor utilization, say), the user may indeed have to refine the cost model to include the interconnection topology. The relative communication latencies are rendered unimportant due to the message-driven execution in Charm --- as long as there is sufficient work on each processor, the actual network latency of messages does not affect the performance of an application. The only overhead the programmer needs to be concerned with is the per-message overhead incurred by Charm and the underlying operating system overheads while sending and receiving a message. This is a software overhead, and it scales at the same rate as the application code with variation in CPU speed. So, the user's cost model doesn't have to change due to variation in latencies and processor speeds, for most applications.
One of the initial motivations we gave for developing Charm was to fill the niche identified in the spectrum of approaches to parallel programming shown in Figure 1. Charm certainly accomplishes this objective, as chares are dynamically load balanced, their execution is scheduled automatically with the arrival of messages, and Charm programs run portably and efficiently across a range of MIMD machines. How this efficient implementation is accomplished is described in part II of this paper. In addition to filling this niche, Charm provides a new parallel programming paradigm. This paradigm:
Charm was intended as as a general-purpose high-level language, which could be used to support other language design efforts, as shown in Figure 2. This objective has also been attained as is substantiated by the fact that Charm has been used as a back-end for a data parallel language called DP [43], a parallel Prolog compiler [44], a high-level synchronization language called Dagger [45], a domain specific language called Divide-and-conquer [46], and an Actor language called Hal [35].
Charm is one of the first languages to employ message-driven execution in stock multi-computers [31]. The idea of message-driven execution is clearly implicit in earlier work on Data Flow machines, which depended on special-purpose hardware to support it. Special purpose hardware for message-driven execution was also the focus of projects such as the J-Machine [47] and Mosaic [48]. The work on macro data flow [49] has focussed on bringing these concepts on general purpose hardware (stock multicomputers).
The work on Active Messages [50] is more recent than Charm. In this model, a message interrupts the recipient process, and invokes the handler rountine specified in the message. Active messages only provides a low-level mechanism for writing message driven programs. For example, if a second message arrives while the first one is being processed, the user's handler code must handle it explicitly, possibly by buffering it. In contrast, in Charm, the runtime system buffers and schedules the messages automatically. Active Messages is a single-process based model, unlike Charm which supports multiple objects per processor. Active messages implementations carried out at the operating system level can deliver messages even faster than the vendor's send-receive primitives on some machines. In fact, the communication layer of Charm has recently been implemented on the CM-5 using Active Messages.
Split-C [51] is a programming language that provides a global data space, where accesses to global data are through unique split-phase operators, which separates the request for data from its use (this aspect is similar to ``futures''). The primary differences with Charm are that it is also a single-process model and that it provides a global address space, and its primitives do not permit an adaptive overlap of computation and communication.
The observation that information is shared in many specific modes was also made independently in [52]. However, they used it for annotations and optimizations in a parallelizing compiler. The dataflow community also developed similar notions --- often called ``sideways'' communication primitives --- in the context of functional programs. In particular, their notion of accumulator is very similar to that in Charm with an important difference. In a functional program, it is trivial to ensure the safety of the read operation, whereas in a Charm program, the user must make sure that all potential operations that can add to the accumulator have terminated before accessing the final value of the accumulator. Distributed tables have similarities with the I-structures in dataflow to some extent, as also with the tuple spaces of Linda. In Linda, the accesses to tuples are blocking whereas in Charm an entry from a distributed table is accessed in a non-blocking split-phase manner. Moreover, tuples are the only information sharing mechanism in Linda, whereas tables are one of many in Charm.
Actors [53] , a construct proposed by Hewitt and developed by Gul Agha, embodied one of the early proposals for message driven execution. Each actor has a behavior associated with it. Actors do not issue ``receive'' statements, but execute only when triggered by a message. Thus the basic notion of chares has much similarity with Actors. Actors, however, permit further concurrency within a single actor, while Charm uses chares to define a boundary between parallel and sequential --- only one method within a chare may execute at a time. The branch-office chare construct, and the use of information sharing abstractions other than messages further distinguish Charm from Actors. The Actor model provides a theoretical background that is applicable to a system such as Charm. One of the first implementations of the Actor model on stock multicomputers was carried out using Charm by Houck and Agha [35].
The notion of concurrent aggregates was developed at MIT by Chien and Dally [28,29] at the same time that the branch-office chares were implemented in Charm [54]. Concurrent aggregates were designed for fine-grained machines, and were implemented in a simulator. The members of a concurrent aggregate are analogous to a branch of a branch-office chare, except that they do not necessarily have a member on every processor. So, calls to a concurrent aggregate may go to a remote processor. The BOCs, on the other hand, are explicitly designed to provide a local, sequential access to branches.
One of the important attributes of Charm is the richness and specificity of the constructs it provides. As a result, the Charm run-time system has a clear understanding of the events in the application program, at a level much closer to the application than the machine. For example, whereas a traditional SPMD system will be able to note that a message went from processor X to processor Y (along with its message type), the Charm run-time system can discern between messages for creation of new chares, messages to existing chares, messages for requesting and fetching data from distributed tables, along with information about the entry-points and chare instances at which a message is directed. This specificity can be exploited in a variety of ways for supporting parallel programming with Charm. In particular, performance feedback and debugging tools can be built that provide the user with application-level feedback allowing them to home onto the trouble-spots in their source program easily. A preliminary step in this direction is represented by Projections [41], a graphical performance display tool, which exploits the specificity only minimally by distinguishing between different kinds of messages.
The extensive support for modularity in Charm, the ability to compose modules without losing the efficiency of message-driven execution, and the mechanisms for distributed data-exchange across modules (provided by branch-office chares and distributed tables) makes Charm an excellent framework for developing flexible and reusable parallel libraries. We expect that this capability of Charm will be leveraged by us and others for developing libraries for various application areas in computational science and engineering.
How does Charm measure up to the requirements for a parallel
``substrate'' language we identified in Section
?
We now discuss the requirements and how Charm satisfies them one at a time.
Charm allows dynamic creation of work with the CreateChare and CreateBoc primitives.
Charm provides a rich set of specific information sharing [] primitives.
The message-driven execution model of Charm allows users to write programs that tolerate latency: the waiting time for remote requests can be used to perform other tasks.
Charm provides efficient portability with mechanisms such as conditional packing and abstractions such as specifically shared variables. Conditional packing allows Charm to exploit the single, shared space on shared memory machines, and implementing abstractions efficiently (and differently) for different machines allows Charm to provide a portable efficient interface for information sharing.
Charm allows user programs to be written in separate modules. The message-driven execution model allows modules to be written independently, since no module needs to know the nature or the order of communication of other modules. This allows for greater re-usability of modules. Charm also provides mechanisms for distributed interface between modules, which promote their efficient usage.
The branch office chare mechanism along with the corresponding send, receive, and broadcast primitives is sufficient to specify any computation that can be carried out on a MIMD machine. Thus, Charm satisfies the completeness argument.
We started with the objective of supporting machine independent parallel programming on all MIMD machines, for a broad variety of applications. We proposed a language to act as a common supporting substrate for many approaches to parallel programming. The language can be used to directly code applications, in addition to its use for supporting implementations of implicitly parallel high level language, other explicitly parallel languages, and parallelizing compilers. The language and the support system it encapsulates are meant to abstract over the abilities of all MIMD machines, shared memory as well as distributed memory. The core language, and some of the advanced features have been implemented on many parallel machines. Preliminary performance studies show that the abstractions are supported efficiently by Charm, delivering performances close to sequential C programs on one processor, and very good speedups with multiple processors. With its dynamic load balancing, the system is particularly suitable for Artificial Intelligence and other irregularly structured computations.
The Charm Parallel Programming Language and System: Part I --- Description of Language Features
This document was generated using the LaTeX2HTML translator Version 95.1 (Fri Jan 20 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 paper.tex.
The translation was initiated by Joshua M. Yelon on Sat Nov 9 12:58:29 CST 1996