Software re-use via libraries is a strategy that allows the cost of software to be amortized. A parallel programming system must support the ability to develop modules that can be ``fitted together'' in a variety of contexts. For example, one should be able to use a (previously written) parallel linear system solver module and a parallel FFT module to construct a computational fluid dynamics application program.
Parallel programs and modules are more difficult to produce than sequential ones; as a result their is a higher premium on being able to reuse parallel libraries. Yet, the nature of parallel software makes it more difficult to reuse modules developed independently: different modules may employ different data distributions, control and data transfer may occur asynchronously or may have to be explicitly synchronized, and one may have pay significant costs in efficiency for modularizing a program to use libraries.
In sequential programs, the module interfaces are simple; a module is invoked by calling a subroutine with appropriate parameters. The control flow and data flow are co-incident. In case of parallel modules, one must find suitable ways in which modules can exchange data. Some of the requirements for promising reuse in parallel programs are:
In traditional approaches, library computations are invoked by regular function calls. The library call blocks the caller on all processors. After completion, the library module returns the result and control to the calling modules. Some disadvantages of libraries in this style are:
A message-driven system, such as Charm [1], supports multiple objects per processor, and uses a pool of messages on every processors. An object is scheduled for execution when there is a message for it. In such a system, one can invoke multiple library modules concurrently, allowing them to naturally overlap their idle times with useful computations. This is a substantial boost for encouraging use of libraries.
As an example, taken from [4], consider an application A that invokes two independent library modules B and C as shown in Figure 1. The B and C themselves are parallel computations which may have their own idle times on all processors due to message latency and dependencies among sub-computations. In the traditional approach, once the control is given to the library B, the idle times of B can not be used by other modules. However, in message-driven execution model, the control can be passed to the module C while B is idle provided that there is a message for the C (or vice versa). Therefore, the idle times across library modules can be used effectively in message-driven execution model.
Charm is a portable object-based message-driven parallel programming system. A Charm program/computation consist of potentially small-grained processes or objects, called chares, and a special type of replicated objects, called branch-office chares. Charm supports dynamic creation of chares, by providing dynamic (as well as static) load balancing strategies.
A chare consists of local data, entry-point functions, private and public functions. Private functions are not visible to other chares, and can be called only inside the owner chare. Public functions can be called by any object on the same processor. Entry functions are invoked asynchronously by an object on any processor. Invoking an entry function in a remote object can also be thought of as sending a message to it. A full description of the Charm language and its C++ based version, Charm++, can be found in [2,3].
The Charm runtime system is message driven. It repeatedly selects one of the available messages from a pool of messages in accordance with a user selected queueing strategy, restores the context of the chare to which it is directed, and initiates the execution of the code at the entry point.
The utility of message driven execution for efficient modularization has been already been described above. In addition, features supported in Charm that support modularity and reuse include branched objects, information sharing abstractions, and the module construct.
A branched object (BOC) is an object with a branch on every processor; all of the branches answer to the same name. One can call public functions of the local branch of a BOC, send a message to a particular branch of the BOC, or broadcast the message to all of its branches. BOCs provide a versatile abstraction that can be used to implement static load balancing, local services (e.g. memory management), distributed data structures, and inter-module interfaces. Two library modules, each spread over all of the processors of the system, can exchange control and data easily through public calls to each others local branches. This provides a simple mechanism for distributed flow of data across modules. Of course, function calls used in the traditional programming model also allow such exchange but they do not support encapsulation of the state of the library --- so, for example, names of global variables (which have to be used to express state) in the two libraries can conflict in such situations.
Information is shared in a few but multiple, distinct ways in a parallel program. A ``message'' is not an adequate mechanism for expressing information sharing, nor is any single generic mechanism such as Linda's tuples [6]. On the other hand, a shared variable is too amorphous a mechanism --- its generality is not needed in most situations and costs too much to implement. Recognizing this, Charm provides six specific information sharing abstractions in addition to messages. Further information about these abstractions can be found elsewhere [5]. Many of these mechanisms provide flexible ways of data exchange across modules. For example, a distributed table holds a collection of data items, each indexed by a distinct key. On distributed memory machines, this collection may be distributed across the processors. One can insert, delete, and find data from such tables asynchronously. Thus, a module may deposit data in a distributed table from which another module may extract it, thus obviating the need for ``hardwiring'' data distribution requirements in module interfaces.
Charm supports a well developed module system. As its first simple benefit,
names in one module do not conflict with those in others. Names that are
exported to other modules are referred to with a module prefix.
For example, an
entry function f of a chare c in module m is invoked
as: m::c@f(msg). Thus, a library
developer can freely use names --- even including the names that they wish
to export --- without worrying about possible conflicts.
The recent concept of "contexts" in MPI [7] eliminates conflicts on tags
across modules, but other name conflicts remain.
Complex libraries
often require the use of callbacks: a library module invoked by its client
might, during the course of its parallel computation, require further
information from the client. As the library is written independent of the
client, it doesn't know the names of entities in the client. The client must
pass references to such entities dynamically at invocation. Charm supports
such dynamic interfaces by treating function names, chare names, entry
function names, and chare id's as first class objects --- i.e. one can have
variables in which to hold such entities and one can pass them around.
All of these features are supported with separate compilation, which is
essential for promotion of practical reuse in a commercial environment. Thus,
multiple proprietary libraries can exist in a larger application program
without requiring access to their source code.
The interface between message-driven programs and message-driven libraries are different from the interface for SPMD style. Since computations are split-phase in message-driven style, library calls must provide a return address for the result or completion. The three separate steps of a simple library interaction in message-driven style are: (a) creation of the library object, (b) invocation of a library computation, and (c) reception of the result at a later point. The caller might execute other code before the result is returned, including possibly invoking other library computations.
There are various ways of interaction between parallel library computations. However, most interactions exhibit patterns that can be classified into a few groups (Figure 2). One of such interactions, which we will call as distributed interface, is: when every object/node makes a request to the same library and receives the result of the library computation. An example of this case is a reduction/broadcast operation which is very common in many scientific applications. A second pattern is the client/server interaction. An object makes a request to the library. Here the library is a server. Then the library computation continues concurrently with the rest of the application. The library may involve parallel tasks on multiple processes. Eventually, the result is returned to the requester by the library. The third pattern involves a single request from a client which triggers a distributed computation and returns the result to multiple agents, (e.g., all branches of a BOC) as specified by the client.
Another important interaction between library and computations is the delayed data access by the library. In this interaction, the caller invokes the library, and then continues with some other computation. Later, the library may request some data from the caller.
Figure 2: Pattern of Library Interactions
The first step in a typical Charm program is to create
instances of library modules.
Once the instance is created, it can be invoked by all the
participating computations using its unique instance identifier.
This requires the participating computations (or chares) to have the
library's instance identifier. The creation phase, thus, consists
of creating the instances of the library and distributing
the instance identifier to users of the library (clients).
The creation of a library module can be encapsulated into
a function. The library exports this function, create, that
handles all the steps of a creating the library (the user does not have to
know the details of the library). There are two ways to invoke the
create function:
lib_instance = LIB::create() or
LIB::create(chareid,entrypoint)
In the latter case, the instance id is returned asynchronously
to the given chare id via the named entry function.
A library can be invoked by sending a message
to it. The library module exports definitions of types that clients
need, such as messages and names of chares and entry points.
This function needs
input data from the caller (optional) and a return address.
The result can be returned in a message. In this case, the
library invocation call is as follows:
LIB::request(lib_instance,data,my_chare_instance,entry_point)
The library instance lib_instance is invoked, and the result
will be returned in a message to entry_point of the caller
chare. A second option is to receive the result by function call
LIB::request(lib_instance,data,result_buffer,function_ptr)
This call provides the library with input data, a pointer to the
result area, result_buffer,
where the library directly puts the result there, and function_ptr
which is a public function that the library module calls when
the result is ready.
Concurrent library invocations: If the same
library module needs to be called multiple times and concurrently,
what should the interface be? There are two options; the first one
is to create multiple instances of the library module as
shown below:
id1 = LIB::create();
id2 = LIB::create();
and different instances can be invoked concurrently as follows:
LIB::request(id1,data1,mychareid,e1);
LIB::request(id2,data2,mychareid,e2);
The second option is to design the library module such that it
handles concurrent calls.
The caller provides a reference number. The library maintains
a separate environment for each reference number to service the
requests concurrently.
A typical usage of this scheme is to create an instance of the library
and invoke the same instance with different reference numbers for
each distinct request.
id1 = LIB::create();
LIB::request(id1,data1,REQUEST_1,mychareid,e1);
LIB::request(id1,data2,REQUEST_2,mychareid,e1);
The reference number of the result message is set to
the REQUEST_1 or REQUEST_2. If the return is by a function call,
the reference number can be passed as an additional parameter
to the return function.
This example [4] is abstracted and modified from a real application
--- a core routine in parallelized version of a molecular mechanics code,
CHARMM. Each processor has an array A of size n. The computation requires
each processor to compute the values of the elements of the array and
to compute the global sum of the array across all processors. Thus, the
element of A on every processor after the operation is the
sum of the
elements computed by all the processors.
One can divide the array A into k parts, and in a loop, compute
each partition and call the reduction library for each segment separately
and concurrently.
Table 1 shows the completion time of the traditional
and message-driven implementation of this example.
The advantage arises from being able to invoke multiple reductions which
execute concurrently.
Table 1: Completion time (sec) of concurrent reductions, n=40960,k=160
Traditional library interface techniques do not allow one overlap idle times in one library with useful computations in the client or another library. This encourages programmers to merge multiple library modules together and tune them for efficiency. Message-driven execution, in contrast, supports modularity without sacrificing efficiency. A programming system that combines message-driven execution with facilities for flexible exchange of data and control across parallel modules, and supports module with static and dynamic linkages and separate compilation, thus provides an excellent substrate for building libraries.
L.V. Kale, The Chare Kernel Parallel Programming Language and System, Proc. ICCP90, Vol II, Aug 1990, pp. 17--25.
This document was generated using the LaTeX2HTML translator Version 95.1 (Fri Jan 20 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 paper.tex.
The translation was initiated by Joshua M. Yelon on Sat Nov 9 13:18:47 CST 1996