Supporting Dynamic Parallel Object Arrays


Submitted to ISCOPE 2001, Stanford, CA.

Orion S. Lawlor

Univ. of Illinois at Urbana-Champaign
1304 W. Springfield, 3315 DCL
Urbana, IL 61801-2987
1 (217) 333-5827

olawlor@acm.org

Laxmikant V. Kalé

Univ. of Illinois at Urbana-Champaign
1304 W. Springfield, 3304 DCL
Urbana, IL 61801-2987
1 (217) 244-0094

kale@cs.uiuc.edu



ABSTRACT

We present efficient support schemes for generalized arrays of parallel data driven objects. The "array elements" are scattered across a parallel machine. Each array element is an object that can be thought of as a virtual processor. The individual elements are addressed by their "index", which can be an arbitrary object rather than a simple integer. For example, it can be a series of numbers, supporting multidimensional sparse arrays; a bit vector, supporting collections of quadtree nodes; or a string. Messages can be sent to any individual array element from any processor, and the elements can participate in reductions and broadcasts. Individual elements can be created or deleted dynamically at any time. More importantly, the elements can migrate from processor to processor at any time. The paper discusses support for message delivery and collective operations in face of such dynamic behavior. The migration capabilities of array elements have proved to be extremely useful, for example, in implementing flexible load balancing strategies and for exploiting workstation clusters adaptively.

INTRODUCTION

Many of today's emerging high-end parallel applications are characterized by irregular and dynamic computational structure. Techniques such a latency hiding and dynamic load balancing are needed to efficiently parallelize these applications. However, incorporating these techniques into a parallel program written using the prevalent processor-centric programming model, exemplified by MPI, requires significant programming effort.

An alternative approach abandons the processor-centric model for an object-centric model. The computational work is divided into a large number of parallel objects. Parallel objects resemble processors in that they are self-contained, and can send and receive messages. Unlike processors, however, they can be created or deleted, scheduled dynamically, and migrated at run-time to improve load balance.

This approach, which we call multi-partition decomposition, separates the task of specifying parallelism from the issues of load balancing, and efficient execution in general. The programmer then specifies which actions are to be computed in parallel, and the system decides when and where these actions execute.

In this paper, we present a parallel construct called a dynamic parallel object array that supports this approach. Individual objects are called array elements, and can send and receive messages, participate in broadcasts and reductions, and migrate as needed. Each element of the array is identified by a unique array index, which may be variable-length. Because elements can be individually scheduled and migrated, an object array is quite distinct from the array objects found in HPF, POOMA[13], P++[14], Global Arrays[12] and elsewhere. In our construct, each element of the array is a relatively coarse-grained1 C++ object, with full support for remote method invocation. We add migration and reductions to pC++[8], to which our work is otherwise quite similar. Unlike Concurrent Aggregates[9], Linda[7], or Orca[10], there is no duplication or replication-- message sends address exactly one array element across the entire machine. This work is complemented by fast collective communication libraries such as [11]; but not dependent on them.

For example, a large dynamic structural simulation modeled using the finite element method may include 10 million elements in an unstructured mesh. Using our method, the application programmer may decide to partition this mesh into 5,000 chunks using a mesh partitioner such as METIS or Chako. Each chunk is then implemented as a data-driven[1] array element, making a 5,000 element object array. The programmer specifies communication between these elements without worrying about which processor they reside on.

This approach, which we have been exploring for the past several years, has several advantages:

Even more important are the advantages that accrue by separating the message target (the array element) from the target's location (the processor). With our approach, the run-time system is free to migrate objects across the parallel machine as it pleases, without affecting the semantics of the user program.

The run-time system can use this freedom to effect measurement-based load balancing, for example. During the computation, it can measure the load presented by each element, along with the element communication patterns. It can then remap the objects so as to minimize load imbalance and communication overheads. Even for dynamic applications, such measurement-based load balancing works effectively when load patterns change either slowly or infrequently.

If the parallel program is using idle desktop workstations, the run-time system can "vacate" the processors when their owners start using them, as described in [2]. Time-shared clusters can also be supported efficiently, by shrinking or expanding jobs to the available set of processors[5] using object migration.

Our research group has been engaged in developing this approach. The measurement based load balancing framework has been described in [2]. This paper presents the object array construct, its efficient implementation with support for routing and forwarding of messages, and its support for collective operations such as broadcasts and reductions.

Our approach is implemented in Charm++[1], a parallel library for C++. However, due to the popularity of MPI, and to allow existing MPI codes to use the load balancing and other facilities of Charm++, we have implemented AMPI [6], an adaptive implementation of MPI. In AMPI, an MPI "processor" is virtualized as a thread running in an array element. The system then simulates multiple MPI processors on each real processor, allowing latency hiding and load balancing. Thus this research is applicable to the wide of class of parallel applications written using MPI as well.

In the next section, we examine another application example to motivate the need for some of the features supported by object arrays. We then present the array construct itself, and its API. This is followed by a description of the algorithms used for message delivery and collective operations, in face of ongoing migrations, insertions, and deletions of array elements. We conclude with performance results.

2.MOTIVATION

2.1 Example

Consider a simple heat flow simulation application which discretizes its domain with an adaptive 2D mesh. The mesh is implemented as a quadtree, as in Figure 1. Using a separate array element for each leaf of the quadtree would likely result in a tiny grain size and poor performance; so each array element is the root of a small subtree of the mesh.

A quadtree admits a natural indexing scheme-- the directions taken at each level of the tree from the root. Concatenating a binary representation of these directions results in a variable-length bit string which uniquely identifies a leaf of the tree. For example, the element labeled "10;11" in Figure 1 can be reached from the outer box by moving to the lower-left (10) subbox, then to that box's lower-right (11) subbox. For addressing messages to these array elements, the variable-length index supported by this work is ideal.

Figure 1. Adaptive 2D quadtree mesh with seven elements, showing element array index and subtrees (in gray).

While running the program, the runtime load balancer[2] will collect the object communication graph shown in Figure 2.

Figure 2. Communication graph for example array.

As the computation proceeds, elements will send each other messages to exchange temperatures with their neighbors. They will perform local calculations to propagate heat around their part of the mesh. In a steady-state problem, elements will occasionally contribute their local error values to a reduction to determine whether the convergence criteria have been satisfied. Once the convergence criteria have been met, the program will broadcast a "report results" message to all elements.

The program may decide to create new array elements to refine an existing region. The program may delete array elements when coarsening a region. During the computation, the Charm++ run-time load balancer will migrate elements to improve the load balance. Clearly, the ongoing messaging, broadcasts, and reductions must continue to work even in the face of these migrations, creations, and deletions.

3.API

Array elements are implemented as ordinary C++ classes defined by the user. An array whose elements are of type A is referenced from other processors via a small "proxy" object of type CProxy_A. The Charm++ system defines the proxy object automatically based on an IDL-like interface file which describes the object's remotely accessible methods-- see details in [4].

A static proxy method is used to create a new array:

CProxy_A ap=CProxy_A::ckNew();


The proxy insert call is then used to add array elements:

ap[7].insert(message);

This creates an array element on some processor; the version insert(message,processor) specifies the initial processor.

The proxy destroy method deletes elements:

ap[7].destroy();

Elements may be created or destroyed at any time.

User-defined2 element methods may be invoked as:

ap[7].foo(message);

Like an ordinary C++ method invocation, this call passes the given data to the given element method. Unlike C++, the target element need not reside on the same processor, or even in the same address space. Of course, the method may be inherited or dynamically dispatched in the usual C++ fashion.

Like Smalltalk, we refer to C++ method invocation as sending an object a message. The interface file determines whether the remote call is synchronous, for ordinary blocking function call semantics; or asynchronous, for message semantics.

As usual, remote message delivery is out-of-order. Charm++ is also nonpreemptive-- messages that arrive on a processor during the execution of another message are queued.

The array broadcast syntax resembles the method syntax, but omits the index:

ap.foo(message);

This call passes the given data to the given method of every array element. An element may also call contribute to pass a value to a reduction; or migrate to move to another processor.

3.1 Indexing

For convenience, the system predefines 1D, 2D, and 3D index types. 2D and 3D types are indexed as:

int x, y, z;

ap2(x,y).foo(message);

ap3(x,y,z).foo(message);

The more appealing square-bracket `[x,y]' syntax cannot be used, because Charm++ inherits C++'s unfortunate comma operator.

By inheriting from a system index type, a program may define custom array index types.

class myIndex : public CkArrayIndex

{

...index data...

public:

myIndex(...) {nInts=2;...}

};

The interpretation of the index data is left to the application, which allows the system to support contiguous 1D, sparse 5D, or tree-structured computations uniformly. Once defined, a user-defined array index type may be used as:

apT[myIndex(...)].foo(message);

4.MESSAGE DELIVERY

A scalable implementation of this API is rather subtle. In particular, the user may create an element 42 on some processor C, then send a message to it from processor A. A must be able to deliver the message despite the fact that A may never have communicated with C. Worse, 42 may migrate to some new processor D while the message is in transit.

Non-scalable location determination schema are easy to imagine3. Processors could be required to broadcast the location of all new or migrated elements. This solution, however, would waste bandwidth and require every processor to keep track of every array element, which may require prohibitive amounts of storage. Alternately, a central registry could maintain the locations of all array elements. This conserves bandwidth, but still may have enormous non-distributed storage requirements and also presents a serial bottleneck. Our solution conserves bandwidth, has modest storage requirements, and is well distributed.

4.1 Scalable Location Determination

To solve the location problem scalably, the system can map any array index to a home, a processor that always knows where the corresponding element can be reached. The default index to home function simply returns the hashed array index modulo the number of processors; but user-defined functions are also supported. An element need not reside at its home processor, but must keep its home informed of its current location. In the example above, A will map the index 42 to its home processor B, which will know that 42 is currently living on processor C.

Thus, A sends its message to 42's home B, who then forwards the message to C. Since this forwarding is inefficient, C sends a (short circuit) routing update back to A, advising it to send future messages for 42 directly to C.

Figure 3. Message forwarding among processors:
A, the source; B, the home; and C, the destination

The forward-free alternative-- A asks B where to send, B replies, A sends directly to C-- may use less total bandwidth for large messages, but requires an additional hop in the critical path. Forwarding also generalizes more smoothly to the migration case. With either approach, the common case of repeated communication quickly settles to 1 hop-- that is, zero added communication overhead.

Since elements and homes are scattered across the machine, most forwarded messages must cross the machine twice, wasting cross-section bandwidth.

A simple generalization of this scheme is to use k separate mappings to assign k homes to each element. Several homes allow messages to be forwarded via the nearest home, saving bandwidth, but also requires elements to inform k processors when they are created or moved. With k=p, every processor knows the location of every element, eliminating forwarding; but creations, migrations, and deletions all require a broadcast. The best value of k depends on the relative frequency of message forwarding and creations, migrations, and deletions.

4.2 Creation

To create an element, the system need only inform the element's home and call the element constructor. If no processor is explicitly specified, the element is created at its home processor, which eliminates later message forwarding. It is an error to attempt to create two elements at the same index.

Infrequently, a message may arrive for an element before the element has been created. The system buffers these early messages until the element is finally created. Sending messages to elements which are never created is an error.

4.3 Deletion

To delete an element, the system invokes the object's destructor and informs the element's home processor. No other processors are informed. Any routing cache entries on other processors will remain unused until they eventually expire and are deleted.

Alternative, more complex methods to reclaim deleted element routing cache entries could be used. When deleting an element, a processor could broadcast a funeral notice. Elements could keep track of which processors may have cached their location and send a funeral notice to each. A systolic body wagon could propagate through the system at a low priority. These alternatives all use network bandwidth; while simple expiration can be completely local and quite efficient.

It is an error to send messages to deleted objects. However, a new element is allowed to reuse an array index vacated by a deleted element.

4.4 Migration

Migration is always under user control-- either explicitly, via a "migrate" call; or implicitly, by enabling run-time load balancing. To migrate an element, the system stops the object, packs it into a message, and sends it to its new location. Once the element arrives it is unpacked and the element's home processor is informed of the element's new location.

A message that arrives for a departed element is forwarded to its last known location, with the usual short circuit routing update once it arrives. If an element migrates rapidly and repeatedly, messages may be forwarded an arbitrary number of times (see Figure 4). Of course, migration is normally infrequent, so this pathological case is rare.

Processors which may have cached a migrator's old location are not informed of the migration. Any stale routing cache entries will be updated upon the next message sent. This lazy update prevents unnecessary traffic and keeps migrations fast. The alternative, to actively inform all others of your current location, saves time on the first message at the cost of significantly more expensive migration.

Of course, for the common case of repeated communication with stationary elements, the system quickly settles to 1 hop.

Figure 4. Delivery may require several hops
during an element migration.

4.5 Protocol Diagram

Each processor must keep track of each array's local elements, the locations of its home elements, and maintain a routing cache of "last-seen" locations. All this information can efficiently be kept in a per-processor, per-array hashtable, keyed by the array index.

To deliver a message addressed to an array index, the system looks the index up in its hashtable. The represented element will be in one of these states:

The element state can change according to the transitions of the finite state machine of Figure 5.

Figure 5. Finite state machine for element information

5.COLLECTIVE OPERATIONS

In addition to communicating point-to-point, array elements often need to participate in global operations such as broadcasts and reductions.

5.1 Broadcasts

The semantics of a broadcast are that every existing array element will receive each broadcast message exactly once. Since processors have no shared clock, "existing" means created but not destroyed at the instant the broadcast is received on that processor. Array broadcasts are thus first sent to each processor, then delivered to each processor's current local elements. However, this is not enough if there are ongoing migrations.

For example, consider the case where a migrating element leaves processor A before the broadcast is delivered, and arrives on processor B where the broadcast has already been delivered. The migrator may miss the broadcast. Or, reversing the situation, an element may receive a broadcast on processor C, then migrate to D where the broadcast has not yet arrived. When the broadcast reaches D, the migrator may erroneously receive the broadcast again (Figure 6).

Figure 6. Broadcast delivery problems. Processors A and D have not received the broadcast; processors B and C have.

To solve these problems, the broadcasts are serialized, and processors and elements each maintain a broadcast count. When an element is created, it takes the local processor's broadcast count.

To prevent duplicate delivery, when a broadcast arrives the system compares its count with each element's broadcast count. The system delivers the broadcast only if the count indicates the element has not yet received that broadcast.

To prevent missed broadcasts, the system maintains a buffer of past broadcasts. When an element arrives from migration, the system again compares its broadcast count with the element's. If the element missed any broadcasts while migrating, the element count will be too low, and the element is brought up to date from the broadcast buffer. The broadcast buffer is periodically garbage collected on each processor, removing broadcasts older than any plausible migration time.

5.2 Reductions

A reduction combines many values scattered across a parallel machine into a single value. A reduction function defines what "value" means and performs the combination. The semantics of a reduction are that each existing element will contribute exactly one value, and the reduction function will be applied to these values in an unspecified order. As before, "existing" means created but not deleted at the time the local reduction completes. Of course, other work may proceed during the reduction.

Reductions can be implemented efficiently by first reducing the values within each processor (the local reduction), then reducing these values across processors. As with broadcasts, in the presence of migration this simple algorithm is not enough.

Figure 7. Timeline: reduction skips a migrator. We must ensure the migrator's contribution is included.

The problem is that during the time a migrating element is in transit, it belongs to no processor6. That is, the source processor cannot wait for the migrator's contribution because it already left; while the destination processor cannot know it is on the way (Figure 7). Thus the source and destination processors might both complete their local reductions, missing the migrator. However, the reduction must wait until all elements, even migrators, have contributed.

One sensible solution is to count the number of contributed values as the reduction data is collected, and not allow the reduction to complete until the number of values matches the number of elements. Unfortunately, the total number of elements is not available on any processor; and a simple sum of the local element counts will still miss migrating elements.

The approach we use is to sum the net births-- the total number of elements created on a processor minus the total destroyed on that processor. Because of migration, this number may be negative if elements migrate in and are destroyed (e.g., on "graveyard" processors).

Since for each processor i,

Thus

Summed across all processors, then, the net births gives the total number of elements created but not yet deleted, which is the global element count.

Thus the reduction algorithm actually used is:

6.PERFORMANCE

We have extensively analyzed the performance of the array support, as summarized below.

6.1 Theoretical

Notation:

p the number of processors on the parallel machine

n the total number of array elements

li the number of local array elements on processor i

ri the number of remote elements recently referenced by processor i

hi the number of elements with processor i as their home

Element creation and deletion, since they only involve the current processor and the element's home, require O(1) time and 1 message. Migration requires O(1) time and 2 messages7. Message delivery may require an unbounded number of messages, but only if the element migrates as fast as the message travels. Repeated messages to stationary elements take O(1) time and 1 message.

The local, element-wise operations during reductions and broadcasts require time in O(li) on processor i. The cross-processor phase of a broadcast or reduction tree requires p-1 messages and completes in élogb pù hops, with b the tree branching factor (typically 2 to 16).

The storage consumed by the element hashtable on processor i is O(li+ri+hi). If each element communicates with a bounded number of other elements, ri is in O(li). If elements and home processors are distributed relatively uniformly, li and hi will both be near n/p. Subject to these assumptions, each processor's hashtable requires storage in O(n/p). In the worst case, li, ri and hi are all bounded by n, so the storage is still in O(n).

6.2 Single-Processor

The system was implemented on Charm++ [1], which also includes non-indexable, non-migratable parallel objects called chares. Table 1 compares the single-processor software overhead for preparing, scheduling, and receiving a message using these non-migratable objects and the array elements described in this paper.

Table 1. Comparison of software overhead with non-arrays

Type

Linux PC8

IBM SP39

Cray T3E10

Chares

0.92 us

1.62 us

2.03 us

Array Elements

1.85 us

4.33 us

9.64 us11

The migration layer adds a few microseconds of overhead to each message. For grain sizes over a few hundred microseconds, array elements add negligible overhead.

6.3 Multiple-Processor

Below, we plot the total time taken for various array operations for varying numbers of processors. In these plots, "Bcast/Redn" means a broadcast operation reaching every array element followed by a reduction across all array elements. "Migration" means the time for an array element to be packed, shipped across the network, unpacked, and the home processor informed. "Message" means the time to send a short message from one array element to another across processors.

The array elements are distributed in 1D with 16 elements per processor, scaling up with processors. The operations run on every element across the machine simultaneously, and are repeated 1,000 times to factor out startup overhead and include any induced non-critical-path load. For migration and messaging, the time reported is the total time divided by 16,000 (giving time per element-message); for broadcast/reduction the time reported is the total time divided by 1000 (giving total time per broadcast/reduction cycle). The first data point is with two processors, so migration is meaningful.

Figure 8. Time per array operation for IBM SP3

Figure 9. Time per array operation for Cray T3E

The system is indeed highly scalable. Theoretically, we expect the broadcast/reduction time curve to be logarithmic in the number of processors; and the messaging/migration curves to be flat. The system meets both theoretical expectations. The slow message time increase with processors for the T3E is due to the limitations of its machine architecture.


7.APPLICATION

In an application, an array element may:

7.1 Programs

The array support has been used by a number of highly scalable Charm++ libraries and programs.

CONCLUSIONS

We have presented efficient support for general arrays of communicating objects. The array index is a user-defined structure, supporting multidimensional, sparse arrays as well as structures such as trees. Objects may be efficiently created, deleted, or migrated at any time; and even in the face of these operations, the system supports array-wide broadcasts and reductions efficiently.

This system has proved a robust and useful foundation for several significant applications. Future work on this system will include: further optimization of the implementation; implementing the k-homes approach described in section 3.1; implementing a more aggressive broadcast deletion algorithm; and implementing the parallel object garbage collector of [15].

REFERENCES

  1. L.V. Kalé and S. Krishnan. "Charm++: Parallel Programming with Message-Driven Objects", Gregory V. Wilson and Paul Lu, editors, Parallel Programming using C++, pages 175-213. MIT Press, 1996

  2. R. Brunner and L.V. Kalé. "Adapting to Load on Workstation Clusters", Proceedings of the Seventh Symposium on the Frontiers of Massively Parallel Computation, pages 106-112. IEEE Computer Society Press, 1999

  3. S. Krishnan and L.V. Kalé. "A parallel array abstraction for data-driven objects", Proc. Parallel Object-Oriented Methods and Applications Conference, February 1996

  4. L.V. Kalé and others. Charm++ Programmer's Manual, http://charm.cs.uiuc.edu/, 2000

  5. L. V. Kalé, S. Kumar, J. DeSouza. An Adaptive Job Scheduler for Timeshared Parallel Machines, PPL Technical Report 00-02, University of Illinois at Urbana-Champaign, Sept. 2000

  6. M. Bhandarkar, L.V. Kalé, E. Sturler, J. Hoeflinger. Object-Based Adaptive Load Balancing for MPI Programs, PPL Technical report 00-03, University of Illinois at Urbana-Champaign, Sept. 2000

  7. S. Ahuja, N. Carriero, D. Gerlenter. "Linda and Friends", IEEE Computer, pages 26-34, August 1986

  8. F. Bodin, P. Beckman, D. Gannon, and others. "Distributed pC++: Basic Ideas for an Object Parallel Language", Scientific Programming, Volume 2/Number 3 Fall 1993

  9. A. Chien and W. Dally. "Concurrent Aggregates" , Proceedings of the Second ACM SIGPLAN, March 1990, Seattle, WA

  10. H. Bal, R. Bhoedjang, R. Hofman, and others. Orca: a Portable User-Level Shared Object System, Technical Report IR-408, Vrije Universiteit, Amsterdam, June 1996

  11. M. Barnett, S. Gupta, D. Payne, L. Shuler, R. van de Geijn and J. Watts. "Interprocessor Collective Communication Library," Supercomputing 1994, Nov. 1994

  12. J. Nieplocha, RJ Harrison, and RJ Littlefield. "Global Arrays: A nonuniform memory access programming model for high-performance computers", The Journal of Supercomputing, 10:197-220, 1996

  13. S. Atlas, S. Banerjee, J. C. Cummings, and others (presented by J. Reynders). "POOMA: A high performance distributed simulation environment for scientific applications," Supercomputing 1995, Nov. 1995

  14. M. Lemke, D. Quinlan. "P++, a Parallel C++ Array Class Library for Architecture-Independent Development of Structured Grid Applications", ACM SIGPLAN Workshop 1992. pp 21-23

  15. J. Piquer. "Indirect distributed garbage collection: handling object migration", ACM Trans. Program. Lang. Syst. 18, 5, pp 615 - 647, Sep. 1996

1The target grain size is from hundreds of microseconds to a few dozen milliseconds of work.

2Unlike system names, user-defined names are displayed here in italic type.

3And frequently implemented in real code!

4As calculated by mapping the array index in the usual way.

5This is a rare and short-lived state, but needed because messages may arrive out of order.

6A non-blocking control handoff without an in-between period is impossible-- it is an n-way handshake problem.

7One message transports the element, one updates the home processor's routing table.

8400 MHz AMD K6-3, Linux 2.4.0t10, egcs-2.91.66 -O3

9375 MHz IBM Power3, AIX 4.3.3, VisualAge C++ 5 -O

10450 MHz DEC Alpha, UNICOS 2.0.5.44, Cray C++ 3.3.0 -O

11The Cray C++ compiler does not support templated member functions, so this version is implemented using function pointers, which cannot be inlined and are significantly slower.