To appear in the 10th ACM International Conference
on Supercomputing, May 1996, Philadelphia.
Sanjeev Krishnan and Laxmikant V. Kale
Department of Computer Science,
University of Illinois, Urbana-Champaign.
Email : {sanjeev,kale}@cs.uiuc.edu
Phone : (217) 244-0094
Developing efficient software for parallel computers has been recognized as one of the bottlenecks preventing more widespread use of parallel computers. This is because several new issues have to be tackled before parallel programs can attain the peak performance that parallel computers provide. Whereas simple parallel implementations often suffer from bad speedups and other symptoms of low performance, good performance can usually be attained only at a high programming cost. The challenge is thus to attain good performance at low programming cost.
Ideally, an expert parallel programmer would be able to anticipate and respond to all potential performance problems during the design of the first prototype. However, parallel programming skills are not widespread, and moreover, the programmer may not know enough about the characteristics of the application and implementation without investing considerable effort. The first prototype of a parallel program is hence likely to have several serious performance flaws, leading to active research for identifying performance problems and optimization techniques. These steps lead to a significant portion of the programming cost in the parallel program development cycle. Since most parallel programmers are not skilled in identifying and solving performance problems, expert parallel programming knowledge must be embodied in tools which are available to the parallel programmer.
Automating program optimizations using expert knowledge either in the compiler or the run-time system can significantly help to reduce the parallel software development effort. When most performance problems are solved in this manner, there will be fewer iterations of the development cycle. Even in cases where completely automatic techniques are not possible, it is beneficial to automate the optimization steps to the extent possible. This paper is concerned with precise post-mortem analysis which can be used for automating run-time optimizations without programmer intervention.
This section motivates run-time optimizations, the use of post-mortem analysis for automating them, describes our framework, and relates it to previous work.
Traditionally, research in optimization techniques has concentrated on compiler transformations of parallel programs for exposing parallelism, automatic grainsize control, determining data distribution, improving locality, reducing communication overhead, etc. The disadvantages of optimizations performed only at compile time are :

Thus the compiler may have insufficient information to decide whether an optimization is necessary, how to do it, and when (at what point in the program) to do it. Optimizations performed at run-time can take care of many cases where compiler transformations are inadequate.
Run-time optimizations need information about characteristics of the parallel program in order to enable optimization mechanisms and guide strategies for selecting them. Since compilers cannot statically deduce the required information in many cases, programmers manually analyse the execution of a program in order to discover program characteristics and optimizations (Figure 1a). Automatic post-mortem analysis may thus reduce the effort required for performance optimization.
Figure 1: Parallel program development cycles.
Different executions of a parallel program are often very similar, if not identical, in their structure, communication patterns and performance characteristics. This leads us to observe that analyzing one or more executions of a program can give us information about performance problems and possible optimizations for the program. In a traditional program development cycle, all the tasks of analyzing performance data, identifying performance problems, designing optimizations, and incorporating them in the parallel program need to be done by the programmer. By incorporating expert knowledge in a post-mortem analysis tool, it is possible to do some of these tasks automatically and generate information for automatic compile-time as well as run-time optimization.
Most parallel programs needing optimization fall into one of two categories: the first consists of those for which good optimization strategies can be derived from well known heuristics; for such programs an automated expert system for guiding optimization can potentially achieve good results. The second category consists of programs which need new algorithms and techniques for optimizing them; for such programs the intervention of a human programmer is obviously needed. However, experience with parallel applications has shown that many of them have well-recognized performance problems for which optimizations can be found using a few good heuristics.
This paper presents current results from an ongoing project to develop a framework for automating run-time optimizations (Figure 1b). The framework involves the expert post-mortem analyzer PARADISE (PARallel programming ADvISEr) which analyzes traces of program execution, finds program characteristics and suggests optimization hints. Paradise works in close cooperation with a run-time system which uses the hints to parameterize optimizations and select between alternate optimization strategies. Paradise builds a representation of the program's execution from traces, determines characteristics of the program, uses heuristics to find optimizations that will solve the performance problems, and generates concise hints which are communicated to runtime libraries by a ``hints file''.
Performance analysis research has concentrated on visually displaying performance data [1,2], relating performance data to high level language constructs [3,4], or giving the user insights into performance problems using expert analysis [5,6]. Our framework aims to go a step further in the direction of automation : Paradise not only finds performance problems, but also solutions in terms of optimizations for the problem areas, and in co-operation with the run-time libraries, incorporates the optimizations in the program without programmer intervention.
Automatic compiler optimizations have achieved success for automatic data partitioning and communication schedule generation in array-based Fortran programs [7,8]. The use of profile information for compiler optimizations is well-known. Several sequential compilers use profile information to predict branch probabilities (e.g. [9]). Some compilers for data-parallel languages such as HPF [10]) use profile information to accurately find the cost of various computation and communication operations. To the best of our knowledge, our framework is one of the first efforts towards using post-mortem analysis for automating run-time optimizations. Also, the scope of our optimizations is much broader than in traditional compiler optimizations, and moreover applies to dynamic and irregular applications as well as regular ones.
Another unique aspect of our work is that it is in the context of a parallel object-oriented language which allows the run-time system the flexibility to choose strategies for placement (mapping) and scheduling of computations and managing communication between objects. Thus there are significantly greater opportunities and challenges for automatic optimizations. In contrast, message-passing layers such as PVM or MPI require the programmer to explicitly specify the placement and scheduling of computations and the communication between them, and also do not provide facilities for dynamic creation of tasks, thus restricting the extent of automatic optimizations. Again, data-parallel languages such as HPF present a much simpler regular computational model for which optimizations are easier to perform, as compared to a parallel object-oriented model which involves dynamic creation of tasks and asynchronous communication.
This work is based on the parallel object-oriented language Charm++, [11,12] which is an extension of C++. The basic unit of work in Charm++ is a chare, which is a medium-grained concurrent C++ object. Chares are dynamically created; there may be thousands of chares per processor. A chare type is a C++ class containing data and functions which may be triggered by the arrival of messages. Functions inside chares are atomic (they may not be pre-empted).
Chares communicate by sending messages to functions (invoking methods) in other chares asynchronously. An essential feature of the Charm++ parallel programming model is asynchronous message driven execution, which helps latency tolerance by overlapping communication and computation. All calls to the run-time are non-blocking, and there are no ``receive'' calls. Remote accesses are performed in a split-phase manner.
Each processor has multiple chares, and a pool of messages targeted to methods in the chares. The scheduler picks messages from the pool one by one, and ``processes'' them by invoking their target methods in the proper chare objects.
Charm++ also provides multidimensional parallel arrays of objects which are distributed over processors using a user-specified mapping function. Array elements may communicate with each other in a point-to-point manner or using multicast communication primitives.
Post-mortem representation:
The execution of a Charm++ program is represented as an event graph,
which is essentially a task graph
constructed using traces collected at run-time.
Issues in collecting trace data, reducing perturbation, and
constructing the basic event graph are discussed
in [6] and are beyond the scope of this paper.
A simple version of the event graph was originally used for the
Projections [2]
performance visualization and analysis tool.
The event graph constructed by Paradise consists of vertices representing
entry-function executions, edges
representing messages between entry functions and
edges for dependences between methods (these dependences must be specified
in the language or generated by the compiler). Also, all vertices
belonging to the same object instance are grouped together.
The validity of inferences drawn from post-mortem analysis of an execution depends on how similar subsequent executions are to that execution. For programs with completely deterministic behavior (such as some SPMD programs), or for executions with the same inputs, the inferences will be very accurate. Suggestions for optimization may be made in terms of individual message and object instances. For irregular or non-deterministic programs, variations in the input parameters can affect program execution, so that fewer inferences will be valid. In such cases we need techniques to specify properties of a set of instances (e.g. all messages of a particular type in the source program), because we cannot make suggestions at the level of individual instances. Essentially, we need to infer program-level characteristics (e.g. whether the program has graph-structured or tree-structured communication) by analyzing an execution.
Completely deterministic programs always produce the same set of tasks and messages. However, most parallel programs contain non-determinism because of the following factors:
To deal with non-determinism due to adaptive scheduling and dynamic object placement, we need to ensure that the program representation remains invariant even if there is such non-determinism. As described in the previous subsection, the event graph contains edges between methods for dependences : this ensures that the graph for operations such as a ``join'' --- where a method executes only if several preceding methods have executed --- does not change even if the order of execution of the preceding methods changes. Also, by grouping vertices belonging to the same object instance, we get an object-interaction graph which is independent of the exact assignment of objects to processors, thus removing the effects of non-deterministic placement. Moreover, this allows us to analyze the interactions between object instances, instead of restricting ourselves to a less precise processor-level. This is very useful when we try to analyze patterns of communication between objects.
There are two main aims of an object placement strategy: to balance load across processors; and to maintain locality by moving objects only when necessary and keeping heavily interacting objects on the same processor.
In order to select a suitable object placement strategy, Paradise attempts to systematically discover the characteristics of the parallel program from the event graph. We first describe some important characteristics which affect object placement, and discuss how they are inferred from the event graph.
This characteristic tells us if there are repeatedly occurring phases in the program. Separating the program into independent phases helps to focus analysis [6], because each phase can be analyzed independently. The phase structure is also important if all objects are not active in all phases: ignoring the phase structure may result in load imbalance within a phase in which only part of the objects are active. The following types of phases are commonly found in parallel programs:
Figure 2: Different types of phases in parallel programs.
The behavior of a program with respect to data locality tells us the extent of access to non-local data. It is desirable to increase data locality in a program so that the amount of data accessed from remote processors decreases. Data locality can be increased by taking into account interactions between objects while making object mapping decisions: closely interacting objects should be mapped to the same processor. In order to deduce patterns of inter-object interactions, we construct an object-interaction graph in which the nodes are objects and edges represent communication between objects. The weight of an edge represents the amount of communication (e.g. number of messages) between the pair of objects it connects. Whether data locality can be exploited for a particular program or not depends on the inherent communication patterns for the program. Most programs fall into one of the categories described below.
Locality is unimportant: There are two cases when locality need not be taken into consideration while mapping objects to processors. The first is when all objects execute independently, without communicating with each other. This is detected in the object-interaction graph when the total number of edges is close to the total number of nodes (since each object must have at least one edge connecting it with its creator object). The second is when each object communicates with almost every other object (e.g. when all objects do a broadcast), so that there is no locality in the communication pattern. This is detected when the average degree of a node in the object-interaction graph is close to the total number of nodes. If no locality exists in the program, object mapping needs to only take care of load balance, as determined by the other characteristics.
Tree-structured communication : Here each object communicates only with its parent (creator) object or its child objects. This property is detected in the object-interaction graph by a simple graph traversal.
Graph-structured communication : arises when objects communicate with neighboring objects. Usually this pattern of communication is found in spatially decomposed applications, where objects represent regions of space and communicate with adjacent regions.
Regular graph structures arise when the communication pattern between objects is regular (e.g. neighbor communication among objects in a parallel object array). Input dependent graph-structured communication occurs when the exact nature of the object-interaction graph depends on the input data (e.g. when each object constructs a list of interacting objects using input data).
Object creation patterns tell us which processors create objects, and at what times in the program execution they are created. The patterns of object creation determine the times and locations where object mapping decisions have to be made, and thus determine the strategy for collection and distribution of load information which is needed for making the object mapping decisions at run-time. E.g. For a data-parallel program where arrays of objects are created at the beginning, no load information needs to be collected; on the other hand, for state-space search where the search tree is expanded in the course of the program, load information must be continuously updated so that new objects can be sent to underloaded processors.
The locations of object creation may be centralized (one processor creates all the objects) or distributed (many processors create objects). This characteristic can be easily inferred by counting the number of objects created on each processor. In general distributed object creation requires more complex load balancing strategies. The times of object creation may be continuous (objects are created continuously) or bursty (objects are created in bursts, with intervening periods when no objects are created; a special case of this is when objects are created just once, at the beginning of the computation).
In static object placement, objects are created and placed only at the beginning of the program. Paradise currently chooses static object placement strategies for programs written using Charm++'s parallel object array construct. Essentially, Paradise chooses a strategy which partitions the array among processors so as to balance load as well as reduce inter-processor communication. When the communication pattern matches with commonly observed regular communication patterns (e.g. a grid-based decomposition where each object communicates with the two adjacent objects in each dimension), the partitioning strategy need not be dynamic (i.e. it need not execute at run-time). In such cases it is possible for Paradise to automatically infer a precise pattern for mapping objects to processors.
The first task to be done for array partitioning is to align interacting object arrays of the same dimensions. The strategy here is the one used by Li & Chen [13] and Gupta & Banerjee [7] for compilers, involving construction and partitioning of a graph whose vertices represent array dimensions. Each set of aligned arrays represents a template grid (as in HPF), and each object in each array is assigned to a grid point.
Next, Paradise partitions the template grids across processors. For regular array-based programs without phases, Paradise uses a (block, block, ...) pattern, where processors are arranged in a grid and a contiguous block of array elements is placed on each processor. For each template grid, the amount of communication along each dimension is calculated by creating a plane perpendicular to the dimension which bisects the grid, and finding the number of messages crossing the plane. The amount of communication is used to determine the block size in that dimension : the more the communication, the larger the block size. The final hint generated specifies the aspect ratio of the processor grid. At run-time, the object mapping library uses this aspect ratio and the array size to determine the block size and assign array elements to processors. E.g. for a Jacobi relaxation program where each object communicates with its two adjacent objects in each dimension, Paradise suggests square regions for each processor. E.g. for an array where communication occurs only along one dimension, processors will get long slabs oriented along the direction of communication so that there is no inter-processor communication.
The phase structure of a program is important only if object activity patterns vary from phase to phase. Paradise first constructs a list of phases in which a significant number of objects are not active. It then generates a set of load balance constraints on object mapping, and evaluates the suitability of well-known mapping patterns with respect to the constraints.
A constraint is specified between a pair of objects to indicate that both objects should preferably not be assigned to the same processor. Within each phase, constraints are attached to every pair of objects. Thus satisfying as many constraints as possible would ensure that the objects are evenly distributed across processors. In future we intend to automate object migration to allow remapping of objects between phases, if there are conflicting constraints on object placement across phases.
Finding patterns of object placement:
It is desirable to express the mapping of objects to processors
concisely so that it can be
communicated to the runtime libraries as a hint. Ideally, the mapping
should be an expression which maps the object's id to a processor number.
For parallel object arrays in Charm++, the expression
uses the coordinates of an object in the 1, 2, or 3-dimensional object array
to map it to a processor.
Currently, the mapping patterns supported include :
E.g. for a 3-D array this has the form

We demonstrate automatic object placement for a simple program which performs Gaussian Elimination (only the triangularization step, without pivoting). The Charm++ program for this has an object for each row in the matrix, and a parallel array of row objects which constitutes the matrix. When a row becomes the pivot row, it multicasts its elements to all rows below it in the matrix.
From the event graph, Paradise was able to deduce that there is no locality in the communication pattern (since each row object needs to communicate with all other rows). Further, the program has phases separated by initiation points corresponding to the multicasts made by the pivot rows. The number of objects active in each phase varies, thus it is necessary to balance load in each phase separately. Hence Paradise maps objects to processors phase by phase. The mapping pattern which matches the load-balance constraints best corresponds to a cyclic mapping. Paradise now generates a hint (consisting of the values of the constants in the mapping expression) to the object-mapping module in the run-time through the optimization hints file. Table 1 presents times in seconds for running the Gauss-Elimination program on 32 processors of the CM-5 for a 1000x1000 matrix, with a random object mapping strategy and the automatically optimized strategy.
Table 1: Object mapping for Gauss-Elimination.
Dynamic load balancing is an extensively researched area, and there are several general-purpose as well as application-specific strategies. If we know the problem structure, we can sometimes use better-tuned strategies. Recognizing this, Paradise uses a special load balancing strategy for tree-structured computations where there is some amount of uniformity in the tree. In tree-structured programs, each object communicates only with its parent (creator) object or its child objects.
The load balancing strategy for tree-structured computations essentially tries to partition the tree equally across all processors by assigning a subtree to each processor. This maintains locality, results in better load balance, distributes the tree across processors quickly, as well as avoids other overheads of general-purpose load balancing strategies. The types of trees which can be partitioned by this strategy are :
Paradise first does a traversal over the object-interaction graph to determine if the program is tree-structured. It then systematically traverses the tree in a depth-first manner, calculating the loads of all leaf and internal nodes. Then it analyses the branching factors and subtree loads to check if they are uniform. If so, it generates an appropriate hint.
If the tree is too irregular, Paradise reverts to one of the general purpose load balancing strategies provided with the Charm++ run-time system: currently it chooses between a round-robin strategy (objects are placed on processors in a round-robin manner), neighbor averaging (load is balanced among processors in a neighborhood) and hierarchical manager (a cluster of processors is controlled by a manager). The choice is made depending on the grain-sizes of objects (e.g. round-robin will not work well if the grain-sizes are widely varying), the amount of locality needed (e.g. neigbor-averaging works better when locality is needed), and the number of objects to be balanced.
We demonstrate automatic dynamic object mapping for tree-structured programs using a simple parallel program for computing the factorial of a large number. The program has essentially a divide-and-conquer structure. Each node in the divide-and-conquer tree is a Charm++ object, which creates two child nodes for computing partial products. The child nodes reply back to the parent when the result is found. This program was compiled and run on the TMC CM-5, the traces of program execution were collected and fed into Paradise. Paradise was able to deduce that the program has a uniform uniform tree structure (all leaves are at equal depth), and a branch factor of 2 at every internal node. Moreover, the object creation is continous and there is no phase-structure. Hence Paradise chose the object mapping strategy for trees, and also provided the additional information that the tree was regular and binary.
The hints generated by Paradise were fed back to the object-mapping module using an optimization hint file. The factorial program was again run on the CM-5 to get the optimized results. Table 2 gives times (in milliseconds) for computing the factorial of 4096 on 16 processors (which results in 8191 objects in the tree). Results are presented for a random object mapping strategy and the automatically optimized tree strategy.
Table 2: Dynamic object mapping for factorial program.
The scheduler in the run-time system determines the order of execution of method invocations in objects on a processor. This order becomes important when there are multiple concurrent objects on a processor, any of which can be allowed to execute. The main aim
of the scheduling strategy is to
ensure that all computation paths
in the parallel program finish at about the same time.
In programs where some computational paths are longer than others,
there are likely to be one or more critical paths.
We can optimize the scheduling by executing tasks on critical paths
with minimal delay. This is a difficult problem because no global information
is normally available: the scheduler does not know which of
the methods to be invoked lies on the critical path.
Scheduling may be affected through prioritization: objects or messages can be assigned a priority value, which is used to select a message for processing when there are many messages in the scheduler queue. Charm++ already allows the programmer to assign a priority to a message before sending it. To implement prioritization automatically, the scheduler needs to know what priorities need to be assigned to messages, and at what times in the program.
Paradise first determines if the program has one or more significant critical paths. The algorithm used to determine critical paths is based on a longest-path heuristic: a depth-first traversal of the event graph reveals the longest path from each node to the end of the program. Paradise then determines the method types or message types lying on the critical paths and the priority values to be assigned to them. The priority of a type is higher if it occurs more often on critical paths, and lower if it occurs more often on non-critical paths. Further, Paradise determines which objects are on critical paths and assigns priorities to objects. The priority of an object is higher if it occurs more often on the critical path, and also higher if it occurs earlier on the critical path. Then Paradise tries to find a pattern for relating the object-id to the priority, using a linear priority expression.
The Gauss-Elimination program described in section 4.3.1 has a critical path consisting of the initiation points when a row broadcasts itself to all rows below it in the matrix. Each object occurs exactly once on the critical path, and object id (row number) i occurs before object i+1 on the critical path. Thus Paradise deduces that objects should be assigned a priority based on their object ids. The priority expression matching routine now tries to relate the object-id to its priority. Since Charm++ assigns higher priority to lower numbers, the priority of an object can be set to its row-id. This hint is generated by Paradise and read in by the run-time system during the next run of the program. The performance of the program with and without prioritization is given in table 3.
Table 3: Prioritization for Gauss-Elimination
Communication optimizations are necessary to reduce communication volume and overheads, as well as tolerate latency. Here we discuss two of them that are commonly used in parallel programs.
Message pipelining
Instead of sending a large message
at the end of a large computation to an idle processor,
it is possible
that by sending pieces of the message earlier, the idle processor could
start processing the pieces earlier, thus reducing idle time by overlapping
communication and computation.
However, in general, the choice of the number of pieces the
computation should be divided into
(the degree of pipelining) must be made at run-time,
depending on the communication
latencies and bandwidth of the target machine, as well as the
sizes of the messages and computations in different phases of the program.
In order to enable this pipelining optimization,
the methods to be pipelined must be parameterized so as
to give the run-time system a control point where the degree of pipelining parameter can be set. Paradise determines the degree of pipelining for different message types using heuristics which reduce idle time and overheads [6].
Message aggregation or combining
When a processor sends many small messages to the same remote processor
and the messages are not immediately processed, the messages can be
combined into one large message before being sent out, and then broken up after
being received. This reduces message transmission overhead.
When such a message is sent by a processor,
the run-time at the sender needs to know whether to send the
message to the receiving processor right away or
delay the message so that it can be combined with another message to
the same receiver. In this situation Paradise
provides valuable
hints regarding which messages to combine, and at what times in the program.
We give an example of a simple parallel data-collection operation which benefits from message combining. The operation involves a parallel object array with a large number of elements distributed over all processors, each of which sends messages to one object on processor 0. Paradise detects this operation, and generates a hint to the runtime for enabling message combining for the particular message type. The number of messages to combine is set to the number of objects of the array on a processor. Note that the runtime uses this number only as a hint: a time-out value is used to limit the maximum amount of time a message may be delayed in order to wait for another message to the same processor. Table 4 gives results for the total time taken for broadcast to and data-collection from 512 objects on 32 processors of the CM-5, with each object sending a 4 byte message to processor 0.
Table 4: Time (in milliseconds) with and without message combining
for broadcast and data-collection over 512 objects on 32 processors.
In this paper we have described techniques for automating parallel run-time optimizations using post-mortem analysis. The specific contributions of this work are:
Our results have shown that run-time optimizations can be automated, thus decreasing the effort required from application programmers for developing parallel programs with good performance. In future work, we intend to evaluate the optimizations using real applications, as well as broaden the set of optimizations and techniques as the need arises.
Automating Parallel Runtime Optimizations Using Post-Mortem Analysis
This document was generated using the LaTeX2HTML translator Version 95.1 (Fri Jan 20 1995) Copyright © 1993, 1994, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 paper.tex.
The translation was initiated by Joshua M. Yelon on Sat Nov 9 13:37:50 CST 1996