Automating Runtime Optimizations for Load Balancing in Irregular Problems

In order to reduce the effort required for attaining good performance for parallel programs, it is necessary to use automated performance optimizing techniques. In this paper we describe run-time optimizations for load balancing, and techniques to automate them without programmer intervention using post-mortem analysis of parallel program execution. These techniques are very useful for applications having irregular, non-uniform or dynamic load patterns. We classify the characteristics of parallel programs with respect to object placement (which determines load balance), then describe techniques to discover these characteristics by post-mortem analysis, and present heuristics to choose appropriate load balancing schemes based on these characteristics. Our ideas have been developed in the framework of the Paradise post-mortem analysis tool for the parallel object-oriented language Charm++. We also present results for optimizing simple parallel programs running on the Thinking Machines CM-5.
Parallel program development cost has been recognized as one of the bottlenecks preventing more widespread use of parallel computers. The increased cost of parallel software is due to several new issues which have to be tackled before parallel programs can attain the peak performance that parallel computers provide. The goals of high performance and better programmability are difficult to achieve simultaneously. The challenge is to attain good performance at low programming cost.
Parallel programming skills are also not widespread among application programmers, and moreover, the programmer may not know enough about the characteristics of the application and implementation without investing considerable effort. E.g. the choice of a load balancing scheme for an application requires knowledge of the application's load characteristics. Thus it is difficult to anticipate and respond to all potential performance problems during the design of the first prototype of a parallel program. The first prototype is hence likely to have several serious performance flaws such as load imbalance, leading to active research for identifying performance problems and solving them using optimization techniques. These steps lead to a significant portion of the programming cost in the parallel program development cycle. Since most parallel programmers are not skilled in identifying and solving performance problems, expert parallel programming knowledge must be embodied in tools which are available to the parallel programmer.
In particular, using expert knowledge either in the compiler or the run-time system to automating program optimizations can significantly help to reduce the parallel software development effort. When most performance problems are solved in this manner, there will be fewer iterations of the development cycle. Even in cases where completely automatic techniques are not possible, it is beneficial to automate the optimization steps to the extent possible. This paper is concerned with precise post-mortem analysis which can be used for automating run-time optimizations for load balancing, especially for irregular, non-uniform or dynamic problems.
This work is part of a larger project to develop a framework for automatic runtime optimizations [1,2].
The need for runtime optimizations arises because compiler transformations alone are not sufficient to optimize all parallel programs. In particular, compiler optimizations cannot take into account unpredictable run-time execution environments and unknown application characteristics. Static compiler analysis is also restricted by the difficulties of precise dependence and type analysis in the presence of pointers, and by separate compilation of program modules. Finally, for library based systems such as PVM and MPI, runtime optimization is the only means to get better performance.
Run-time optimizations such as for load balancing need information about characteristics of the parallel program in order to enable optimization mechanisms and guide strategies for selecting them. Since compilers cannot statically deduce the required information in many cases, programmers often manually analyze the execution of a program in order to discover program characteristics and optimizations (Figure 1a). Automatic post-mortem analysis may thus reduce the effort required for performance optimization.
Figure 1: Parallel program development cycles (reproduced
from [2]).
Most parallel programs needing optimizations such as for load balancing fall into one of two categories: the first consists of those for which good optimization strategies can be derived from well known heuristics; for such programs an automated expert system for guiding optimization can potentially achieve good results. The second category consists of programs which need new algorithms and techniques for optimizing them; for such programs the intervention of a human programmer is obviously needed. However, experience with parallel applications has shown that many of them have well-recognized performance problems for which optimizations can be found using a few good heuristics.
This paper presents current results from an ongoing project to develop a framework for automating run-time optimizations (Figure 1b). The framework involves the expert post-mortem analyzer PARADISE (PARallel programming ADvISEr) which analyses traces of program execution, finds program characteristics and suggests optimization hints. Paradise works in close cooperation with a run-time system which uses the hints to parameterize optimizations and select between alternate optimization strategies. Paradise builds a representation of the program's execution from traces, determines characteristics of the program, uses heuristics to find optimizations that will solve the performance problems, and generates concise hints which are communicated to runtime libraries by a ``hints file''.
Automatic post-mortem analysis has been used in performance analysis tools for parallel programs, to give the user insights into performance problems such as load imbalance using expert analysis [3,4,5,6]. Our framework aims to go a step further in the direction of automation: Paradise not only finds performance problems, but also solutions in terms of optimizations for the problem areas, and in co-operation with the run-time libraries, incorporates the optimizations in the program without programmer intervention. Some compilers for data-parallel languages such as HPF [7] use profile information to accurately find the cost of various computation and communication operations. In [8], profile information is used to manually calculate weights to be used in weighted graph decomposition, for irregular data-parallel applications. To the best of our knowledge, our framework is one of the first efforts towards using post-mortem analysis for automating selection of load balancing strategies, and incorporating them into the program without programmer intervention. Also, our optimizations apply to dynamic, non data-parallel and irregular applications as well as regular ones.
Another unique aspect of our work is that it is in the context of a parallel object-oriented language which allows the run-time system the flexibility to choose strategies for placement (mapping) of computations. Thus there are significantly greater opportunities and challenges for automatic load balancing strategies. In contrast, message-passing layers such as PVM or MPI require the programmer to explicitly specify the placement of computations, and also do not provide facilities for dynamic creation of tasks, thus restricting the extent of automatic load balancing. Again, data-parallel languages such as HPF present a much simpler regular computational model for which optimizations are easier to perform, as compared to a parallel object-oriented model which involves dynamic creation of tasks and asynchronous communication.
For completeness, we include a brief description of the programming model and its post-mortem representation, taken from [2].
This work is based on the parallel object-oriented language Charm++, [9,10] which is an extension of C++. The basic unit of work in Charm++ is a chare, which is a medium-grained concurrent C++ object. Chares are dynamically created; there may be thousands of chares per processor. A chare type is a C++ class containing data and functions which may be triggered by the arrival of messages. Functions inside chares are atomic (they may not be pre-empted).
Chares communicate by sending messages to functions (invoking methods) in other chares asynchronously. An essential feature of the Charm++ parallel programming model is asynchronous message driven execution, which helps latency tolerance by overlapping communication and computation. All calls to the run-time are non-blocking, and there are no ``receive'' calls. Remote accesses are performed in a split-phase manner.
Each processor has multiple chares, and a pool of messages targeted to methods in the chares. The scheduler picks messages from the pool one by one, and ``processes'' them by invoking their target methods in the proper chare objects.
Charm++ also provides multidimensional parallel arrays of objects which are distributed over processors using a user-specified mapping function. Array elements may communicate with each other in a point-to-point manner or using multicast communication primitives.
Post-mortem representation:
The execution of a Charm++ program is represented as an event graph,
which is essentially a dynamic task graph
constructed using traces collected at run-time.
Issues in collecting trace data, reducing perturbation, and
constructing the basic event graph are discussed
in [6] and are beyond the scope of this paper.
A simple version of the event graph was originally used for the
Projections [11]
performance visualization and analysis tool.
The event graph constructed by Paradise consists of vertices representing
entry-function executions, edges
representing messages between entry functions and
edges for dependences between methods (these dependences must be specified
in the language or generated by the compiler). Also, all vertices
belonging to the same object instance are grouped together.
A load balancing strategy determines the placement of objects onto processors. In order to select a suitable object placement strategy which balances load, Paradise attempts to systematically discover the characteristics of the parallel program from the event graph. We first describe four important characteristics which affect object placement, and discuss how they are inferred from the event graph.
The grainsize of a method is the amount of work done (load) in the method execution, and is computed as the difference in time from the start to the completion of the method (since all methods in Charm++ are atomic and all operations are non-blocking). The grainsize of an object instance is the sum of the grainsizes of all method executions for that object. The grainsize of an object type is the average over all objects of that type. The average grainsize of a program is computed by taking into account all parallel objects on all processors. If the average grainsize of a program is too small (e.g. comparable to the message latency on the machine it was run on), it indicates that the program might suffer from large overheads, for which corrective optimizations may be needed. If the average grainsize is too large, it indicates less parallelism leading to significant idle times on processors.
A useful characteristic of a program is the amount of variation in object loads. A large variation in loads requires more complex dynamic load balancing strategies. If all grainsizes are nearly the same, simpler strategies may suffice, or more accurate optimizations may be possible.
Object creation patterns tell us which processors create objects, and at what times in the program execution they are created. The patterns of object creation determine the times and locations where object mapping decisions have to be made, and thus determine the strategy for collection and distribution of load information which is needed for making the object mapping decisions at run-time. E.g. For a data-parallel program where arrays of objects are created at the beginning, no load information needs to be collected; on the other hand, for state-space search where the search tree is expanded in the course of the program, load information must be continuously updated so that new objects can be sent to underloaded processors.
The locations of object creation may be centralized (one processor creates all the objects) or distributed (many processors create objects). This characteristic can be easily inferred by counting the number of objects created on each processor. In general, distributed object creation requires more complex load balancing strategies. The times of object creation may be continuous (objects are created continuously) or bursty (objects are created in bursts, with intervening periods when no objects are created; a special case of this is when objects are created just once, at the beginning of the computation).
The behavior of a program with respect to data locality tells us the extent of access to non-local data. It is desirable to increase data locality in a program so that the amount of data accessed from remote processors decreases. Data locality can be increased by taking into account interactions between objects while making object mapping decisions: closely interacting objects should be mapped to the same processor. In order to deduce patterns of inter-object interactions, we construct an object-interaction graph in which the nodes are objects and edges represent communication between objects. The weight of an edge represents the amount of communication (e.g. number of messages) between the pair of objects it connects.
If there is very little inter-object communication or if there is ``all-to-all'' communication, it is not worthwhile trying to increase locality. However, if the communication is tree-structured (each object communicates only with its creator or child objects) or graph-structured (e.g. neighbor communication) it is worthwhile trying to increase locality by mapping closely interacting objects to the same processor.
There are two main aims of an object placement strategy: to balance load across processors; and to maintain locality by moving objects only when necessary and keeping heavily interacting objects on the same processor.
The control points for dynamic object placement are from the time of
seed
creation
through the time the seed is dispatched by the run-time for creating
the new object.
There are three components of a load balancing scheme. The load collection component determines how load information from different processors is collected. The initial mapping component determines the processor to which a newly created seed is sent. Finally the rebalancing component is responsible for redistributing seeds and objects after they have been initially assigned to a processor.
Several schemes may be used for dynamic object placement, with different levels of sophistication, overheads and for different types of load balancing problems. Some of the schemes commonly used are:
In addition to the application-specific characteristics described in Section 4, the following input-specific information is useful for dynamic object placement:
For programs which create objects dynamically throughout the execution of a program, Paradise chooses a load balancing scheme depending on the program's characteristics. The hint generated contains the chosen scheme and any necessary parameters. At execution time, this hint is read in by the load balancing module in the Charm runtime system and is used to activate the chosen scheme. Currently Paradise chooses one of three schemes: round-robin, neighbor averaging, and distributed manager. The heuristics used by Paradise to automatically choose a scheme are:
if ( object creation is centralized )
if ( all objects have the same grainsize )
Choose the round-robin scheme.
else
Choose the distributed-manager scheme (the processor
creating all objects is the manager).
endif
else
if ( there is significant inter-object communication )
Choose the neighbor-averaging scheme (it maintains locality
by only moving objects when necessary to balance load).
else if ( the average grainsize is sufficiently large )
Choose the distributed-manager scheme (grainsize is large
enough, so there are not too many objects; overhead of sending
seeds to the manager will not be significant).
else if ( all objects have the same grainsize )
Choose the round-robin scheme.
else
Choose the neighbor-averaging scheme (large number of objects
with varying grainsize: none of the other two will work).
endif
endif
These rules embody some of the expertise we have accumulated while optimizing several applications requiring dynamic load balancing schemes [12,13]. Note that this expertise can be brought to bear on this problem only because the post-mortem analysis based on the enhanced event graph is able to identify the relevant characteristics of the parallel computation.
In order to test Paradise's ability to correctly choose a dynamic object placement strategy for programs which dynamically create work, we used the following test programs.
Variable grainsize objects:
This is an artificial program which creates a number of objects of
varying grain-sizes, in the form of an irregular, large-branchfactor tree.
Paradise was able to deduce the following characteristics
with respect to placement:
Table 1: Time (in milliseconds) with the default and automatically
chosen load balancing strategies for the variable-grainsize program.
Heavily communicating objects:
This is an artificial program which creates a number of objects of
the same grain-size, but which communicate heavily with each other.
The structure of the program is an irregular tree, with large messages
being sent from children to parents in the tree.
Paradise was able to deduce the following characteristics
with respect to placement:
Table 2: Time (in milliseconds) with the default and automatically
chosen load balancing strategies for the heavily-communicating objects program.
Many applications, especially array-based applications in science and engineering, do not create objects dynamically: all objects are created at the beginning of the program. In such a case it is possible to place objects at the beginning of execution using a static placement strategy. As for dynamic placement, the two considerations for a static placement strategy are to balance load and maintain communication locality. The control point for allowing the run-time to determine static object placement can be provided by a function call to a partitioning or placement library, at the beginning of the program.
Work on compiler techniques for automatic data partitioning in array-based Fortran and HPF programs has achieved considerable success; block and cyclic mappings of regular arrays can be generated by compilers. However, there are many other types of irregular/dynamic applications for which static placement cannot be done by only compile-time analysis. Even for array-based scientific programs, block/cyclic mappings are not sufficient for many applications. Finally, since the number of processors and the size of the array and other parameters can vary from run to run of a parallel program, we need run-time decisions about the processor on which a particular object should be placed.
When the amount of work done in the element objects of a parallel object array varies significantly, these load variations must be taken into account while assigning objects to processors, in order to balance load across processors. Often variations in load arise due to input-dependent load patterns. E.g. In a particle simulation, the simulation space is divided into regions and each region is assigned to an object. The load of an object is proportional to the number of particles in its region. When the input particle distribution is non-uniform, there may be a large variation in the number of particles (hence load) per object.
Paradise detects a variation in load by comparing the grainsizes of the array element objects and checking to see if they vary significantly. If there is significant variation in load, the program is classified as an irregular program. For such programs, an input-dependent runtime object placement scheme must be used. Currently the scheme used for partitioning a parallel object array is Orthogonal Recursive Bisection (ORB).
ORB is a well known low-overhead scheme which recursively bisects a multidimensional space into two partitions by planes orthogonal to the coordinate axes, such that the load in each partition is approximately equal. The bisection process stops when the number of partitions is equal to the number of processors. Since the partition assigned to each processor is a rectangular (convex) subspace of the original multidimensional space, the communication (which is proportional to the boundary of the region) volume is also reduced. Thus at the end of ORB each processor gets a set of objects corresponding to a rectangular subarray of the original parallel object array, such that all processors have equal loads.
Thus for irregular programs Paradise generates a hint to the runtime libraries to use ORB for partitioning the parallel object array. Additionally, the runtime system needs information about when the partitioning must be applied. Since ORB partitioning requires information about the load of each object, it can only be applied after the object array has been initialized. An appropriate point at which ORB can be applied is the first synchronization point in the program. Accordingly, Paradise finds the phase number corresponding to the first synchronization point, and includes this phase number in the optimization hint.
Another important issue is the problem of conveying load information for
each object to the ORB library. This is solved using inheritance.
All parallel array objects are required to inherit from the `` loadarray''
class which contains a ``load'' variable. Each object is required to set
this variable in its constructor (e.g. the load may simply be the number
of particles in the object's region)
.
When the ORB algorithm is initiated,
each processor can find the load of all objects it contains by reading
the ``load'' variable in each object. All load values are collected on
processor 0, which then applies the ORB algorithm (thus the actual ORB
algorithm itself is just a sequential algorithm), and broadcasts the
resulting mapping to all processors.
Thus Paradise enables object placement for input-dependent irregular programs without programmer intervention. Section 6.1 gives results for an example irregular program.
We demonstrate automatic static object placement for irregular programs using an idealized particle simulation application. This type of program occurs in several scientific applications, including molecular dynamics and gas flow simulations. The program uses a two dimensional parallel object array to represent a computational space in which particles are distributed in a non-uniform manner. Thus each object (which represents a region of the space) contains a variable input-dependent number of particles. The computation consists of each object exchanging particles with neighboring objects, and thereafter performing some computation. This continues for several iterations.
Table 3: Time in seconds for particle simulation program for the
default (cyclic-cyclic) placement and
automatically optimized (using runtime ORB) versions
for a uniform and a non-uniform distribution of particles.
From the event graph for this program, Paradise was able to find that the grainsizes of objects varied significantly, and hence the program was classified as an irregular one. The phase number corresponding to the first synchronization point in the program was found, and a hint to perform Orthogonal Recursive Bisection (ORB) at that phase was generated by Paradise. The program is required to be modified by the programmer to set the load variable of the system base class ``loadarray'' in the constructor of each object. During the next run of the program, the runtime system automatically initiated the ORB library at the phase specified by Paradise. The ORB library accessed the load for each object and generated a partition of the parallel object array. This partition was encoded as a new mapping function. The ORB was followed by a remap operation using the new mapping function, after which the rest of the program was allowed to continue. Table 3 presents results for running the program on 16 processors of the CM-5. The simulation involved 1000 particles, distributed over a two dimensional parallel object array of size 16x16. The default mapping of the parallel object array was cyclic-cyclic. Results are presented for a uniform and a non-uniform input distribution. From the results it is clear that the ORB optimization which was automatically enabled by Paradise significantly improves performance for the nonuniform distribution, without introducing overhead for the uniform case.
In this paper we have described techniques for automating load balancing especially for irregular or dynamic applications. The specific contributions of this work are:
Our results have shown that load balancing schemes can be automatically selected and incorporated into the application program, thus decreasing the effort required from application programmers for developing parallel programs with good performance. In future work, we intend to evaluate the optimizations using real applications, as well as broaden the set of optimizations and techniques in the automatic optimization framework.
This document was generated using the LaTeX2HTML translator Version 95.1 (Fri Jan 20 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 paper.tex.
The translation was initiated by Joshua M. Yelon on Sat Nov 9 13:42:34 CST 1996