At its core, the Charm++ interface model is a sequential extension of the C++ interface model that allows invoking methods on remote objects. A chare is essentially a sequential component that may reside on a remote processor. Note that a chare may be implemented using more than one sequential C++ objects, but this is hidden from the user of the chare. The chare interface defines the access point for the chare and not its constituent sequential objects. A parallel component would typically consist of a number of such sequential components. A chare array presents an appropriate abstraction to implement a parallel component, since it provides a level of encapsulation over a collection of sequential components. However, the interface to chare arrays is presented in terms of interfaces to its constituent chares. Therefore, communication between such components would take place by components invoking methods on individual chares in other components, thus making them dependent on each others' internal parallel structures. One could hide the internal structure of components by providing wrapper objects for interfaces, as described next.
Solution 1 [Sequential Wrapper] : A straightforward extension of object-based interface models, such as the Charm++ interface description, for parallel components is to provide a sequential component wrapper for the parallel component, where functionality of a parallel component is presented as a sequential method invocation (Figure 3.8). This imposes serialization bottleneck on the component. For example, a parallel CSE application that interfaces with a linear system solver will have to serialize its data structures before invoking the solver. This is clearly not feasible for most large linear systems representations that need gigabytes of memory.
Solution 2 [Isomorphic Wrapper] : Another extension of the object-based interface models is to treat each parallel component as a collection of sequential components. In this model, the interaction between two parallel components takes place by having the corresponding sequential components invoke methods on each other (Figure 3.9). Thus, the interaction between components is defined in terms of interactions between sub-components of each component. While this model removes the serialization bottleneck, it imposes rigid restrictions on the structure of parallel components. For example, a parallel finite element solver will have to partition its mesh boundary into the same number of pieces as the neighboring block-structured CFD solver, while making sure that the corresponding pieces contain adjacent nodes.
Solution 3 [Processor-based Parallel Wrapper] : In order to avoid the serialization bottleneck in data and control transfer among components such as in figure 3.8 while not imposing rigidity in component interaction such as in figure 3.9, one can provide parallel wrappers around components as shown in figure 3.10. The parallel wrapper consists of a group of objects (e.g. Pa0 and Pa1 in figure 3.10), which are mapped to processors so that there is exactly one object per wrapper per processor. Objects belonging to a component always communicate with their local representative object of the parallel wrapper. They wait for data to be delivered to them by the local wrapper object, and upon computation, deliver the results also to the local wrapper object. Communication between components are thus mediated by the parallel wrapper objects of those components. The local parallel wrapper objects of two interacting components are bound together. If each component contains such parallel wrapper objects, the components themselves do not need to know about the connection topology of the peer components.
We have implemented this scheme of component interaction in NAMD (section 2.6) for mediating interaction between short-range electrostatics module based on Charm++ and long-range electrostatics module DPMTA. The short-range electrostatics module is implemented as a dynamically load balanced chare array in Charm++, where each array element represents a cubical portion of space (called a Patch) containing atoms. DPMTA (written in PVM) uses tree-structure and partitions its computations into pieces mapped one-to-one on processors. Since both these components use different partitioning schemes, atoms need to be re-partitioned according to their positions in space each time control transfers between these electrostatics modules. A parallel wrapper is implemented for the short-range electrostatics module using chare groups in Charm++. The chare array elements (patches) of the short-range electrostatics module deposit their atoms' positions with the local representative of the wrapper chare group (called PatchManager), which combines data from local patches and delivers it to the re-partitioning code in DPMTA via a library function call. When the parallel wrapper receives results of long-range electrostatics computations from DPMTA, it then re-partitions the received atoms and delivers them back to the local patches. While this method eliminates the serialization bottleneck, it results in a lot of ``glue'' code in the form of parallel wrappers. Also, since this method of component interaction is still based on method-calls (between wrapper objects), it makes the data exchange and control transfer between components hard-wired within the components, and in doing so, this method does not provide control points for flexible application composition and for effective resource management by the runtime system.
|
Lack of a control point at data exchange leads to reduced reusability of components. For example, suppose a physical system simulation component interacts with a sparse linear system solver component, and the data exchange between them is modeled as sending messages or as parameters to the method call. In that case, the simulation component needs to transform its matrices to the storage format accepted by the solver, prior to calling the solver methods. This transformation code is part of the simulation component. Suppose, a better solver becomes available, but it uses a different storage format for sparse matrices. The simulation component code needs to be changed to transform its matrices to the new format required by the solver. If the interface model provided a control point at data exchange, one can use the simulation component without change, while inserting a transformer component in between the simulation and the new solver.
Lack of a control point for the runtime system at control transfer prevents the runtime system from effective resource utilization. For example, with blocking method invocation semantics of control transfer, the runtime system cannot schedule other useful computations belonging to a parallel component while it is waiting for results from remote method invocations. Asynchronous remote method invocation provides a control-point for the runtime system at control-transfer. It allows the runtime system to be flexible in scheduling other computations for maximizing resource utilization. However, when we extend functional interface representations to use asynchronous remote method invocations, the resultant components have to supply continuations explicitly to their connected components. These are referred to as ``compositional callbacks''.
Solution 4 [Compositional Callbacks] : When a component (caller) invokes services from another component (callee) using asynchronous remote method invocation, it has to supply the callee with its own unique ID, and the callee has to know which method of the caller to call to deposit the results. This is illustrated with a simple client-server transaction in figure 3.11. Note that the client has to know the server's interface (in particular the name of the method service). Also, the server has to know the client's interface. In addition, both have to decide upon and hardcode the types of data they deposit or accept.
{CodeOne}
Client::invokeService() {
ServiceMessage *m = new ServiceMessage();
// ...
m->myID = thishandle;
ProxyServer ps(serverID);
ps.service(m);
}
Server::service(ServiceMessage *m) {
// ... perform service
ResultMessage *rm = new ResultMessage();
// ... construct proxy to the client
ProxyClient pc(m->myID);
pc.deposit(rm);
}
Client::deposit(ResultMessage *m) {
// ...
}
The mechanism of component interaction used in figure 3.11 is referred to as the ``compositional callback'' mechanism. It is useful in developing an application with pre-written server libraries. The server does not need to know the client's interface. The client must be a subclass of a generic client of the server. Compositional callback mechanism is equivalent to building an object communication graph (object network) at run-time. Such dynamic object network misses out on certain optimizations that can be performed on a static object network [77]. For example, if the runtime system were involved in establishing connections between communicating objects, it would place these objects closer together (typically on the same processor).
Another problem associated with the callback mechanism is that it leads to proliferation of interfaces, increasing programming complexity. For example, suppose a class called Compute needs to perform asynchronous reductions using a system component called ReductionManager and also participates in a gather-scatter collective operation using a system component called GatherScatter. It will act as a client of these system services. For ReductionManager and GatherScatter to recognize Compute as their client, the Compute class will have to implement separate interfaces that are recognized by ReductionManager and GatherScatter respectively. This is shown in figure 3.12. Thus, for each service that a component provides, this would result in two interfaces: one for the service, and another for the client of that service. If a component avails of multiple services, it will have to implement all the client interfaces for those services. In addition to the proliferation of interfaces, this model makes it difficult to have different concurrent instances of service invocations for the same service. For example, in Figure 3.12, if the class Compute needs to use the Reduction service at different places within the code, it needs to explicitly encode the continuation in its state before it invokes the reduction service. When the reduction results arrive via the reductionResults method, it has to explicitly deliver the results to the stored continuation.
{CodeOne}
class ReductionClient {
virtual void reductionResults(ReductionData *msg) = 0;
}
class GatherScatterClient {
virtual void gsResults(GSData *msg) = 0;
}
class Compute : public ReductionClient, public GatherScatterClient
{
// ....
void reductionResults(ReductionData *msg) { ... }
void gsResults(GSData *msg) { ... }
}