| Introduction
|
|
The largest and fastest supercomputers in the top500 list deploy a scalable,
three-dimensional torus or mesh interconnect. For these machines, the number of
links (hops) traversed by a message has a direct effect on the time required to
reach the destination. This is especially true in presence of bandwidth
congestion, when multiple messages share links from source to destination. For
large parallel machines with a significant diameter, this can become a serious
performance bottleneck. Traditionally, application developers have neglected
this fact because of the advantages of virtual cut-through and wormhole routing
for most message sizes on small machines. This might not be true any longer due
to the large diameters of machines.
This research will demonstrate the effect of network contention on message
latencies and propose and evaluate techniques to minimize communication traffic
and hence, bandwidth congestion on the network. This will be achieved by
topology-aware mapping of tasks in an application. By placing communication
tasks on processors which are in physical proximity on the network,
communication can be restricted to near neighbors. This reduces link-sharing
among messages and leads to a better utilization of the available bandwidth.
Our aim is to minimize hop-bytes, which is a weighted sum of the number
of hops between the source and destination for all messages, the weights being
the message sizes. This can minimize the communication time and hence, lead to
significant speed-ups for parallel applications and in certain cases, also
remove scaling bottlenecks. The research will involve developing a general
automatic topology-aware mapping framework which takes the task graph and
processor graph as input, and outputs near-optimal mapping solutions.
|
| Study of Interconnects
|
|
Contention Studies:
We wrote a MPI benchmark called WICON (With Contention) to quantify message
latencies in presence of contention, which is a regime not handled by the basic
model of wormhole routing discussed earlier. In this benchmark, all MPI tasks are
grouped into pairs and the smaller rank in the pair sends messages of size B
bytes to its partner and awaits a reply. All pairs do this communication
simultaneously. The average time for the message sends is recorded for
different message sizes. To quantify the effect of hops on message latencies
this benchmark is run in two modes:
- Near Neighbor Mode (NN): The ranks which form a pair only differ by one. This
ensures that everyone is sending messages only 1 hop away (in a torus).
- Random Processor Mode (RND): The pairs are chosen randomly and thus they
are separated by a random number of links.
 |
 |
|
Figure 1. Plots showing the results of the WICON benchmark on
Blue Gene/P and XT3 |
Figure 1 shows the results of running WICON in the NN and RND modes on Blue
Gene/P and XT3. The first plot shows the results of WICON on 4,096 cores of
BG/P. It is clear that the random-processor (RND) latencies are more than the
near-neighbor (NN) latencies (by a factor of 1.75 for large messages.) This is
expected based on the assertion that hops have a significant impact on the
message latencies in the presence of contention, which increases with larger
messages because of a proportional increase in packets on the network. Similar
experiments were repeated on XT3 to understand the effects of contention on
Cray XT machines. The second plot in Figure 1 presents the results for WICON
benchmark on 2,048 cores of XT3. We see a significant difference between the NN
and RND lines (a factor of 2.25 at 1 MB messages which is greater than that on
BG/P.) This is not unexpected and a quantum chemistry code has shown huge
benefits (up to 40%) from topology-aware mapping on XT3 [Bohm07].
|
The benchmark in the previous section injects random contention on the network.
To quantify the effects of contention under controlled conditions, WICON was
modified to conduct a controlled experiment. Again, all ranks are divided into
pairs but now the pairs are chosen such that they are a fixed number of hops,
say n, away from each other. All pairs send messages
simultaneously and the average time for message sends of different sizes for
varying hops is recorded. Pairs are chosen only along one dimension of the
torus, in this case, the Z dimension.
Figure 2 shows the results of running the WICON2 benchmark on Blue Gene/P. On
each plot there are several lines, one each for a specific pairing which is
n hops away. The tests were done on a torus of dimensions 8 X 8 X 16.
Since messages are sent along Z, maximum number of hops possible is 8 and hence
there are 8 lines on the plot. The Blue Gene/P plot on the right shows that the
message latencies for large messages for the 1 hop and 8 hops case can differ
by a factor of 8!
As all messages travel more hops, links are shared by a greater number of
messages, increasing the contention on the network and decreasing the available
effective bandwidth. This is what applications have to deal with during
communication in practice. This huge difference between message latencies
indicates that it is very important to keep communicating tasks close by and
minimize contention on the network [Bhatele09a]. This is especially true for
communication bound applications. Next, we look at mapping successes for two
applications which have different communication characterstics.
|
|

Figure 2. Plots showing the results of WICON2 on BG/P
|
|
| Application-specific Topology-aware Mapping
|
|
Dynamic Irregular Communication: NAMD is a production Molecular Dynamics
(MD) application used for simulation of bio-molecules [Bhatele08b]. NAMD is
parallelized by use of Charm++ objects called patches and computes. The
simulation box is spatially divided into smaller cells called patches. The
force calculation for every pair of patches is assigned to a different compute.
Thus, communication in NAMD consists of section multicasts from patches to
computes and back. Every patch multicasts its atom data to multiple computes,
whereas each compute receives data from only two patches. Patches are
statically assigned to a few processors during start-up and computes are
distributed evenly by a load balancer. Let us now see the deployment of
topology-aware techniques in the static placement of patches and the load
balancers.
Topology placement of patches: Since patches form a geometric
decomposition of the simulation space, they constitute a 3D group of objects
which can be mapped nicely onto the 3D torus of machines. An ORB (Orthogonal
Recursive Bisection) of the torus is used to obtain partitions equal in number
to the patches and then, a one-to-one mapping of the patches to the processor
partitions is done.
 |
 |
|
Figure 3. (a) Placement of a compute withinn the inner brick, (b) Improvement in
hop-bytes using topology-aware mapping on Blue Gene/P |
Topology-aware Load Balancers: Once patches have been statically
assigned onto the processor torus, computes which interact with these patches
should be placed around them. Consider Figure 3(a), which shows the entire 3D
torus on which the job is running. When placing a compute, it should be placed
topologically close to the two processors that house the patches it interacts
with. The two patches define a smaller brick within the 3D torus (shown in dark
grey in the figure). The sum of distances from any processor within this brick
to the two patches is minimum.
Figure 3(b) shows the hop-bytes for all messages per iteration
when running NAMD on Blue Gene/P on different sized partitions. A standard
benchmark used in the MD community was used for the runs: 92,227-atom
ApoLipoprotein-A1 (ApoA1). As we would expect, hop-bytes consistently increase
as we go from a smaller partition to a larger one. The three strategies
compared are: topology oblivious mapping of patches and computes (Topology
Oblivious), topology-aware static placement of patches (TopoPlace Patches) and
topology-aware placement for both patches and load balancing for computes
(TopoAware LDBs).
Topology aware schemes for the placement of patches and the load balancer help
in reducing the hop-bytes for all processor counts. Also, the decrease in
hop-bytes becomes more significant as we go to larger-sized partitions. This is
due to the fact that the average distance traveled by each message increases as
we increase the partition size in the case of default mapping; however, it
becomes controlled when we do a topology-aware mapping. Since the actual
performance of the load balancers depends on several metrics, the question
remains as to whether the reduction in hop-bytes leads to an actual
improvement in performance. As it turns out, we also see a reduction in the
number of proxies and in the max-to-average ratio for topology-aware
load balancers, which is reflected in the overall performance of NAMD on Blue
Gene/P (see table above). The topology oblivious scheme stops scaling around
4,096 cores and hence, we did not obtain numbers for it beyond that. We see an
improvement of 10% at 16,384 cores with the use of topology-aware load
balancers. For further details, please refer to [Bhatele09b].
Static Regular Communication: OpenAtom is a fine-grained parallelization
of the CPAIMD method to understand dynamics of atoms at a quantum
scale [Bohm07]. Computation in OpenAtom is divided into a large
number of objects, enabling scaling to tens of thousands of processors.
Calculating the electrostatic energy involves computing several terms. Hence,
CPAIMD computations involve a large number of phases with high inter-processor
communication. These phases are discretized into a large number of objects,
which generate a lot of communication, but ensures efficient interleaving of
work. The entire computation is divided into ten phases, which are parallelized
by decomposing the physical system into fifteen chare arrays.
Since multiple chare arrays interact among one another, the communication
dependencies are complex and mapping is a challenging task. OpenAtom provides
us with a scenario where the load on each object is static (under the CPAIMD
method) and the communication is regular and clearly understood. Hence, it
should be possible to intelligently map the arrays in this application to
minimize inter-processor communication and maintain load balance. Because of
space limitations and a fairly involved mapping acheme, we will not go into the
details of how the mapping is done but present performance results.
 |
 |
We studied the strong scaling (fixed problem size) performance of OpenAtom with
and without topology aware mapping. Two benchmarks commonly used in the CPMD
community: the minimization of WATER_32M_70Ry and WATER_256M_70Ry were used.
As shown in the table on the left, performance improvements from topology-aware
mapping for Blue Gene/P (BG/P) can be quite significant. As the number of
cores and likewise, the diameter of the torus grows, the performance impact
increases until there is 40% improvement for WATER_32M_70Ry at 4096 and 50%
for WATER_256M_70Ry at 8192 cores. The improvements from topological awareness
on Cray XT3, presented in the table on the right, are comparable to those on
BG/P. There is an improvement of 20% on XT3 for WATER_256_70Ry at 1024 cores,
compared to the improvement of 38% on BG/P at 1024 cores.
|
| Automatic Mapping Framework
|
|
The final goal of this research is to utilize the experience from
application-specific mapping, in developing an automatic framework which can
develop topology-aware mapping solutions. The work on the automatic mapping
framework will relieve the application writers of doing the mapping themselves.
We also foresee acceptance of the idea that applications running on Cray XT
machines will benefit as much as those on Blue Gene machines. This might even
led to modification of the batch schedulers on these machines to allocate
contiguous 3D mesh partitions for jobs.
|
|
- 09-08
Abhinav Bhatele and Laxmikant V. Kale, Quantifying Network Contention on Large Parallel Machines, submitted to Parallel Processing Letters (Special Issue on Large-Scale Parallel Processing), 2009.
- 09-05
Abhinav Bhatele, Laxmikant V. Kale, Nicholas Chen and Ralph E. Johnson, A Pattern Language for Topology Aware Mapping, Workshop on Parallel Programming Patterns (ParaPLOP 2009)
- 09-02
Abhinav Bhatele, Laxmikant V. Kale and Sameer Kumar, Dynamic Topology Aware Load Balancing Algorithms for Molecular Dynamics Applications, To appear in 23rd ACM International Conference on Supercomputing, 2009.
- 09-01
Abhinav Bhatele and Laxmikant V. Kale, An Evaluative Study on the Effect of Contention on Message Latencies in Large Supercomputers, Workshop on Large-Scale Parallel Processing (IPDPS), 2009.
- 08-10
Abhinav Bhatele, Eric Bohm and Laxmikant V. Kale, A Case Study of Communication Optimizations on 3D Mesh Interconnects, To appear in Proceedings of Euro-Par (Topic 13 - High Performance Networks), 2009
- 08-07
Abhinav Bhatele, Laxmikant V. Kale, Benefits of Topology Aware Mapping for Mesh Interconnects, Parallel Processing Letters (Special issue on Large-Scale Parallel Processing), Vol: 18 Issue: 4 Pages: 549-566, 2008
- 08-02
Abhinav Bhatele, Laxmikant V. Kale, Application-specific Topology-aware Mapping for Three Dimensional Topologies, Workshop on Large-Scale Parallel Processing (IPDPS), 2008
- 07-12
Abhinav Bhatele, Application-specific Topology-aware Mapping and Load Balancing for three-dimensional Torus Topologies, Master's Thesis, Department of Computer Science, University of Illinois, 2007
|