L. V. Kalé, Milind Bhandarkar and Robert Brunner
Dept. of Computer Science, and
Theoretical Biophysics Group, Beckman Institute,
University of Illinois,
Urbana Illinois 61801
{kale,milind,brunner}@ks.uiuc.edu
Although typical simulations may run for weeks, they consist of a large number of relatively small-grained steps. Each simulations step typically simulates the behavior of the molecular system for a few femtoseconds, so millions of such steps are required to generate several nanoseconds of simulation data required for understanding the underlying phenomena. As the computation involved in each timestep is relatively small, effective parallelization is correspondingly more difficult.
Although the bonds between atoms, and the forces due to them, have the greatest influence on the evolving structure of the molecular system, these forces do not constitute the largest computational component. The non-bonded forces, the van der Waals and electrostatic (Coulomb) forces between charged atoms consume a significant fraction of the computation time. As the non-bonded forces decrease as the square of interaction distance, a common approach taken in biomolecular simulations is to restrict the calculation of electrostatic forces within a certain radius around each atom (cutoff radius). Cutoff simulation efficiency is important even when full-range electrostatic forces are used, since it is common to perform several integration steps using only cutoff forces between each full-range integration step. This paper therefore focuses on cutoff simulations performance.
In this paper, we describe the load balancing problems that arise in parallelization of such applications in context of a production quantity molecular dynamics program, NAMD 2 [5], which is being developed in a collaboratory research effort. The performance of the load balancing strategies employed by this program are evaluated for a real biomolecular system. The load balancing strategy employed is based on the use of migratable objects, which are supported in Charm++ [7], a C++ based parallel programming system. It relies on actual measurement of time spent by each object, instead of predictions of their computational load, to achieve an efficient load distribution.
NAMD simulates the motions of large molecules by computing the forces acting on each atom of the molecule by other atoms, and integrating the equations of motions repeatedly over time. The forces acting on atoms include bond forces, which are spring-like forces approximating the chemical bonds between specific atoms, and non-bonded forces, simulating the electrostatic and van der Waals forces between all pairs of atoms. NAMD, like most other molecular dynamics program, uses a cutoff radius for computing non-bonded forces (although it also allows the user to perform full-range electrostatics simulations computation using DPMTA library[9]).
Two decomposition schemes are usually employed to parallelize molecular dynamics simulations: force decomposition and spatial decompositon. In the more commonly used scheme, force decomposition, a list of atom pairs is distributed evenly among the processors, and each processor computes forces for its assigned pairs. The advantage of this scheme is that perfect load balance is obtained trivially by giving each processor the same number of forces to evaluate. In practice, this method is not very scalable because of the large amount of communication that results from the distribution of atom-pairs without regard for locality. NAMD uses a spatial decomposition method. The simulation space is divided into cubical regions called patches. The dimensions of the patches are chosen to be slightly larger than the cutoff radius, so that each patch only needs the coordinates from the 26 neighboring patches to compute the non-bonded forces. The disadvantage of spatial decomposition is that the load represented by a patch varies considerably because of the variable density of atoms in space, so a smarter load balancing strategy is necessary.
The original version of NAMD balanced the load by distributing patches
among processors and giving each patch responsibility for obtaining
forces exerted upon its constituent atoms. (In practice, Newton's
third law insures that the force exerted by atom i due to atom jis the same as that exerted on j by i. Such pair interactions are
computed by one of the owning patches, and the resulting force sent to
the other patch, halving the amount of computation at the expense of
extra communication.) We discovered that this method did not scale
efficiently to larger numbers of processors, since out of a few
hundred patches, only a few dozen containing the densest part of a
biomolecular system accounted for the majority of the computation.
The latest version of NAMD adds another abstraction, compute
objects (see figure 1). Each compute object is
responsible for computing the forces between one pair of neighboring
patches. These compute objects may be assigned to any processor,
regardless of where the associated patches are assigned. Since there
is a much larger number of compute objects than patches, the program
achieves much finer control over load balancing than previously
possible. In fact, we found that the program could reach a good load
balance just by moving compute objects, so we also refer to them as
migratable objects. Migratable objects represent work that is
not bound to a specific processor.
![]() |
Most communication in NAMD is handled by a third type of object, called proxy patches. A proxy patch is an object responsible for caching patch data on remote processors for compute objects. Whenever data from a patch X is needed on a remote node, a proxy PX for patch X is created on that node. All compute objects on that remote node then access the data of patch X through proxy patch PX, so that the patch data is sent to a particular processor only once per simulation step. The main implication of this for load balancing is that once a proxy patch is required on a particular processor, placing additional compute objects on that processor does not result in increased communication cost.
NAMD uses a predictive computational load model for initial load balancing. The computational load has two components. One component, proportional to the number of atoms, accounts for communication costs, integration, and bond force computation. The second component resulting from the non-bonded force computation, is based on the number of atom pairs in neighboring patches. Profiling indicates that this component can consume as much as eighty percent of typical simulations, so good overall load balance depends on good distribution of this work.
The first stage of load balancing uses a recursive-bisection algorithm to distribute the patches among the processors so that the number of atoms on each processor is approximately equal. The recursive bisection algorithm ensures that adjacent patches are usually assigned to the same processor.
The second stage is the distribution of compute objects that carry out
the non-bonded force computations. First, the compute objects
responsible for self-interactions (interactions between a pair of
atoms where both atoms are owned by the same patch) are assigned to
the processor where the patch has been assigned, and the load for the
processor incremented by the
.
Then the
compute objects for each pair of neighboring patches are considered.
If the patches reside on the same processor, the compute object is
assigned to that processor; otherwise the compute object is assigned
to the least-loaded of the two processors. Then the load-balancer
increments the load for that processor by
.
The
weights take into account the geometric relationship between the two
patches. There are three weights corresponding to whether the patches
touch at a corner, edge or face, since patches which have an entire
face in common contribute more pairs which actually must be computed
than patches which share only a corner.
This method yields better load balance than earlier implementations, which only balanced the number of atoms. However, it does not take advantage of the freedom to place compute objects on processors not associated with either patch. It is possible to build a more sophisticated static load balancing algorithm that uses that degree of freedom, and even tries to account for some communication related processor-load. However, the geometric distributions of atoms within a patch impacts the load considerably, making it harder to predict by static methods. So, we decided to supplement this scheme with a dynamic load balancer.
If the initial patch assignment makes reasonable allowances for the load represented by each patch, compute object migration alone provides a good final load balance. During the simulation, each migratable object informs the load balance coordinator when it begins and ends execution, and the coordinator accumulates the total time consumed by each migratable object. Furthermore, the Converse runtime system [6] provides callbacks from its central message-driven scheduler, which allows the coordinator to compute idle time for each processor during the same period. All other computation, including the bond-force computations and integration, is considered background load. The time consumed by the background load is computed by subtracting the idle time and the migratable object times from the total time.
After simulating several time steps, the load balance coordinator takes this data and passes it to the selected load balancing strategy object (see section 4.2). The strategy object returns new processor assignments for each migratable object. The coordinator analyzes this list and determines where new proxy patches are required. The coordinator creates these new proxy patches, moves the selected migratable objects, and then resumes the simulation.
The first rebalancing results in many migratable object reassignments. The large number of reassignments usually results in changes to the background load, due to (difficult to model) changes in communication patterns. Therefore, after a few more steps of timing, a second load balancing step is performed, using an algorithm designed to minimize changes in assignments (and therefore in communication load). The second balancing pass produces a small number of additional changes, which do not change the background load significantly, but result in an improved final load distribution.
The load balancing strategy object receives the following pieces of information from the load balance coordinator (all times/loads are measured for several recent timesteps):
Based on this information, the strategy must create a new mapping of migratable objects to processors, so as to minimize execution time while not increasing the communication overhead significantly. We implemented several load balancing strategies to experiment with this scenario. Two of the most successful ones are briefly described below.
It is worth noting that a simple greedy strategy is adequate for this problem if balancing computation were the sole criterion. In a standard greedy strategy, all the migratable objects are sorted in order of decreasing load. The processors are organized in a heap (i.e. a prioritized queue), so that the least loaded processor is at the top of the heap. Then, in each pass, the heaviest unassigned migratable object is assigned to the least loaded processor and the heap is reordered to account for the affected processor's load.
However, such a greedy strategy totally ignores communication costs. Each patch is involved in the electrostatic force computations with 26 neighboring patches in space. In the worst-case, the greedy strategy may end up requiring each patch to send 26 messages. In contrast, even static assignments, using reasonable heuristics, can lead to at most six or seven messages per patch. In general, since more than one patch resides on each processor, message-combining and multicast mechanisms can further reduce the number of messages per patch if locality is considered. Since the communication costs (including not just the cost of sending and receiving messages, but also the cost of managing various data structures related to proxies) constitute a significant fraction of the overall execution time in each timestep, it is essential that the load balancing algorithm also considers these costs.
One of the strategies we implemented modifies the greedy algorithm to take communication into account. Processors are still organized as a heap, and the migratable objects sorted in order of decreasing load. At each step, the ``heaviest'' unassigned migratable object is considered. The algorithm iterates through all the processors, and selects three candidate processors: (1) the least loaded processor where both patches (or proxies) the migratable object requires already exist (so no new communication is added), (2) the least loaded processor on which one of the two patches exists (so one additional proxy is incurred), and (3) the least loaded processor overall. The algorithm assigns the migratable object the best of these three candidates, considering both the increase in communication and the load imbalance. This step is biased somewhat towards minimizing communication, so the load balance obtained may not be best possible.
After all the migratable objects are tentatively assigned, the
algorithm uses a refinement procedure to reduce the remaining load
imbalance. During this step, all the overloaded processors (whose
computed load exceeds the average by a certain amount) are arranged in
a heap, as are all the under-loaded processors. The algorithm
repeatedly picks a migratable object from the highest loaded
processor, and assigns it to a suitable under-loaded
processor.
The result of implementing these new object assignments, in addition to anticipated load changes, creates new communication load shifts. Also, a part of the background work depends on the number (and size) of proxy patches on each processor, and increases or decrease as a result of the new assignment. As a result, the new load balance is not as good as the load balance strategy object expected. Rather than devising more complex heuristics to account for this load shift, we chose a simpler option. As described in section 4.1, a second load balancing pass is performed to correct for communication-induced load imbalance. This pass uses the refinement procedure only. Since the load is already fairly well balanced, only a few migratable objects are moved, improving load balance further without significant change in communication load. We find two passes sufficient to produce good load balance.
Our performance results are from an actual molecular system (ER-GRE, an estrogen receptor system) being studied by the Theoretical Biophysics Group. The system is composed of protein and DNA fragments in a sphere of water. Approximately 37,000 atoms are simulated using an 8.5 Å cutoff. The simulations are run on 16 processors of CRAY T3E.
Figures 2 and 3 show a typical improvement obtained with our load balancer. Figure 2 shows measured execution times for the first eight steps of the simulation (where load balancing is based on a predictive static load balancing strategy described in section 3), and figure 3 shows times for the eight steps after load balancing. Since total execution time is determined by the time required by the slowest processor, reducing the maximum processor time by ten percent decreases execution time by the same amount. We have also observed that for larger numbers of processors, the static load balancer produces even more load variation. The measurement-based load balance does not produce such variation, producing greater performance improvement. Comparisons of the average times (the rightmost bar on each plot) shows that the new load distribution does not increase computational overhead to obtain better balance.
Figure 4 shows the speedup compared to one processor for the same simulation, using measurement-based load balancing. Since the number of computational objects is fixed, and total communication increases with increasing number of processors, load balancing grows more complex. Our load balancer exhibits good speedup even for larger numbers of processors.
Future work will focus on improvements on the load balancing strategy. Smarter strategies may be able to reduce communication costs in addition to balancing the load, producing speed improvements beyond that obtained through equalizing load without great attention to communication. The current load balancer also does not address memory limitations; for large simulations it may be impossible to place objects on certain processors if memory on those processors is already consumed by simulation data. Although the measurement-based scheme is more capable of adjusting to changes in processor and communication speeds than other schemes, we will also study the algorithms behavior for other architectures, including workstation networks and shared memory machines. Also, the current implementation of load-balancing strategy is centralized on processor 0. We plan to provide a framework for parallel implementation for future load-balancing strategies.
This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)
Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 paper.tex.
The translation was initiated by Robert Brunner on 1999-03-18