Topology Aware Mapping

Introduction

The largest and fastest supercomputers in the top500 list deploy a scalable, three-dimensional torus or mesh interconnect. For these machines, the number of links (hops) traversed by a message has a direct effect on the time required to reach the destination. This is especially true in presence of bandwidth congestion, when multiple messages share links from source to destination. For large parallel machines with a significant diameter, this can become a serious performance bottleneck. Traditionally, application developers have neglected this fact because of the advantages of virtual cut-through and wormhole routing for most message sizes on small machines. This might not be true any longer due to the large diameters of machines.

This research will demonstrate the effect of network contention on message latencies and propose and evaluate techniques to minimize communication traffic and hence, bandwidth congestion on the network. This will be achieved by topology-aware mapping of tasks in an application. By placing communication tasks on processors which are in physical proximity on the network, communication can be restricted to near neighbors. This reduces link-sharing among messages and leads to a better utilization of the available bandwidth. Our aim is to minimize hop-bytes, which is a weighted sum of the number of hops between the source and destination for all messages, the weights being the message sizes. This can minimize the communication time and hence, lead to significant speed-ups for parallel applications and in certain cases, also remove scaling bottlenecks. The research will involve developing a general automatic topology-aware mapping framework which takes the task graph and processor graph as input, and outputs near-optimal mapping solutions.


Study of Interconnects

Contention Studies: We wrote a MPI benchmark called WICON (With Contention) to quantify message latencies in presence of contention, which is a regime not handled by the basic model of wormhole routing discussed earlier. In this benchmark, all MPI tasks are grouped into pairs and the smaller rank in the pair sends messages of size B bytes to its partner and awaits a reply. All pairs do this communication simultaneously. The average time for the message sends is recorded for different message sizes. To quantify the effect of hops on message latencies this benchmark is run in two modes:

  • Near Neighbor Mode (NN): The ranks which form a pair only differ by one. This ensures that everyone is sending messages only 1 hop away (in a torus).
  • Random Processor Mode (RND): The pairs are chosen randomly and thus they are separated by a random number of links.
Figure 1. Plots showing the results of the WICON benchmark on Blue Gene/P and XT3

Figure 1 shows the results of running WICON in the NN and RND modes on Blue Gene/P and XT3. The first plot shows the results of WICON on 4,096 cores of BG/P. It is clear that the random-processor (RND) latencies are more than the near-neighbor (NN) latencies (by a factor of 1.75 for large messages.) This is expected based on the assertion that hops have a significant impact on the message latencies in the presence of contention, which increases with larger messages because of a proportional increase in packets on the network. Similar experiments were repeated on XT3 to understand the effects of contention on Cray XT machines. The second plot in Figure 1 presents the results for WICON benchmark on 2,048 cores of XT3. We see a significant difference between the NN and RND lines (a factor of 2.25 at 1 MB messages which is greater than that on BG/P.) This is not unexpected and a quantum chemistry code has shown huge benefits (up to 40%) from topology-aware mapping on XT3 [Bohm07].

The benchmark in the previous section injects random contention on the network. To quantify the effects of contention under controlled conditions, WICON was modified to conduct a controlled experiment. Again, all ranks are divided into pairs but now the pairs are chosen such that they are a fixed number of hops, say n, away from each other. All pairs send messages simultaneously and the average time for message sends of different sizes for varying hops is recorded. Pairs are chosen only along one dimension of the torus, in this case, the Z dimension.

Figure 2 shows the results of running the WICON2 benchmark on Blue Gene/P. On each plot there are several lines, one each for a specific pairing which is n hops away. The tests were done on a torus of dimensions 8 X 8 X 16. Since messages are sent along Z, maximum number of hops possible is 8 and hence there are 8 lines on the plot. The Blue Gene/P plot on the right shows that the message latencies for large messages for the 1 hop and 8 hops case can differ by a factor of 8!

As all messages travel more hops, links are shared by a greater number of messages, increasing the contention on the network and decreasing the available effective bandwidth. This is what applications have to deal with during communication in practice. This huge difference between message latencies indicates that it is very important to keep communicating tasks close by and minimize contention on the network [Bhatele09a]. This is especially true for communication bound applications. Next, we look at mapping successes for two applications which have different communication characterstics.

  

Figure 2. Plots showing the results of WICON2 on BG/P

Application-specific Topology-aware Mapping

Dynamic Irregular Communication: NAMD is a production Molecular Dynamics (MD) application used for simulation of bio-molecules [Bhatele08b]. NAMD is parallelized by use of Charm++ objects called patches and computes. The simulation box is spatially divided into smaller cells called patches. The force calculation for every pair of patches is assigned to a different compute. Thus, communication in NAMD consists of section multicasts from patches to computes and back. Every patch multicasts its atom data to multiple computes, whereas each compute receives data from only two patches. Patches are statically assigned to a few processors during start-up and computes are distributed evenly by a load balancer. Let us now see the deployment of topology-aware techniques in the static placement of patches and the load balancers.

Topology placement of patches: Since patches form a geometric decomposition of the simulation space, they constitute a 3D group of objects which can be mapped nicely onto the 3D torus of machines. An ORB (Orthogonal Recursive Bisection) of the torus is used to obtain partitions equal in number to the patches and then, a one-to-one mapping of the patches to the processor partitions is done.

Figure 3. (a) Placement of a compute withinn the inner brick, (b) Improvement in hop-bytes using topology-aware mapping on Blue Gene/P

Topology-aware Load Balancers: Once patches have been statically assigned onto the processor torus, computes which interact with these patches should be placed around them. Consider Figure 3(a), which shows the entire 3D torus on which the job is running. When placing a compute, it should be placed topologically close to the two processors that house the patches it interacts with. The two patches define a smaller brick within the 3D torus (shown in dark grey in the figure). The sum of distances from any processor within this brick to the two patches is minimum.

Figure 3(b) shows the hop-bytes for all messages per iteration when running NAMD on Blue Gene/P on different sized partitions. A standard benchmark used in the MD community was used for the runs: 92,227-atom ApoLipoprotein-A1 (ApoA1). As we would expect, hop-bytes consistently increase as we go from a smaller partition to a larger one. The three strategies compared are: topology oblivious mapping of patches and computes (Topology Oblivious), topology-aware static placement of patches (TopoPlace Patches) and topology-aware placement for both patches and load balancing for computes (TopoAware LDBs).

Topology aware schemes for the placement of patches and the load balancer help in reducing the hop-bytes for all processor counts. Also, the decrease in hop-bytes becomes more significant as we go to larger-sized partitions. This is due to the fact that the average distance traveled by each message increases as we increase the partition size in the case of default mapping; however, it becomes controlled when we do a topology-aware mapping. Since the actual performance of the load balancers depends on several metrics, the question remains as to whether the reduction in hop-bytes leads to an actual improvement in performance. As it turns out, we also see a reduction in the number of proxies and in the max-to-average ratio for topology-aware load balancers, which is reflected in the overall performance of NAMD on Blue Gene/P (see table above). The topology oblivious scheme stops scaling around 4,096 cores and hence, we did not obtain numbers for it beyond that. We see an improvement of 10% at 16,384 cores with the use of topology-aware load balancers. For further details, please refer to [Bhatele09b].

Static Regular Communication: OpenAtom is a fine-grained parallelization of the CPAIMD method to understand dynamics of atoms at a quantum scale [Bohm07]. Computation in OpenAtom is divided into a large number of objects, enabling scaling to tens of thousands of processors. Calculating the electrostatic energy involves computing several terms. Hence, CPAIMD computations involve a large number of phases with high inter-processor communication. These phases are discretized into a large number of objects, which generate a lot of communication, but ensures efficient interleaving of work. The entire computation is divided into ten phases, which are parallelized by decomposing the physical system into fifteen chare arrays.

Since multiple chare arrays interact among one another, the communication dependencies are complex and mapping is a challenging task. OpenAtom provides us with a scenario where the load on each object is static (under the CPAIMD method) and the communication is regular and clearly understood. Hence, it should be possible to intelligently map the arrays in this application to minimize inter-processor communication and maintain load balance. Because of space limitations and a fairly involved mapping acheme, we will not go into the details of how the mapping is done but present performance results.

We studied the strong scaling (fixed problem size) performance of OpenAtom with and without topology aware mapping. Two benchmarks commonly used in the CPMD community: the minimization of WATER_32M_70Ry and WATER_256M_70Ry were used. As shown in the table on the left, performance improvements from topology-aware mapping for Blue Gene/P (BG/P) can be quite significant. As the number of cores and likewise, the diameter of the torus grows, the performance impact increases until there is 40% improvement for WATER_32M_70Ry at 4096 and 50% for WATER_256M_70Ry at 8192 cores. The improvements from topological awareness on Cray XT3, presented in the table on the right, are comparable to those on BG/P. There is an improvement of 20% on XT3 for WATER_256_70Ry at 1024 cores, compared to the improvement of 38% on BG/P at 1024 cores.

Automatic Mapping Framework

The final goal of this research is to utilize the experience from application-specific mapping, in developing an automatic framework which can develop topology-aware mapping solutions. The work on the automatic mapping framework will relieve the application writers of doing the mapping themselves. We also foresee acceptance of the idea that applications running on Cray XT machines will benefit as much as those on Blue Gene machines. This might even led to modification of the batch schedulers on these machines to allocate contiguous 3D mesh partitions for jobs.


 
People
Papers
  • 09-08    Abhinav Bhatele and Laxmikant V. Kale,  Quantifying Network Contention on Large Parallel Machines,  submitted to Parallel Processing Letters (Special Issue on Large-Scale Parallel Processing), 2009.
  • 09-05    Abhinav Bhatele, Laxmikant V. Kale, Nicholas Chen and Ralph E. Johnson,  A Pattern Language for Topology Aware Mapping,  Workshop on Parallel Programming Patterns (ParaPLOP 2009)
  • 09-02    Abhinav Bhatele, Laxmikant V. Kale and Sameer Kumar,  Dynamic Topology Aware Load Balancing Algorithms for Molecular Dynamics Applications,  To appear in 23rd ACM International Conference on Supercomputing, 2009.
  • 09-01    Abhinav Bhatele and Laxmikant V. Kale,  An Evaluative Study on the Effect of Contention on Message Latencies in Large Supercomputers,  Workshop on Large-Scale Parallel Processing (IPDPS), 2009.
  • 08-10    Abhinav Bhatele, Eric Bohm and Laxmikant V. Kale,  A Case Study of Communication Optimizations on 3D Mesh Interconnects,  To appear in Proceedings of Euro-Par (Topic 13 - High Performance Networks), 2009
  • 08-07    Abhinav Bhatele, Laxmikant V. Kale,  Benefits of Topology Aware Mapping for Mesh Interconnects,  Parallel Processing Letters (Special issue on Large-Scale Parallel Processing), Vol: 18 Issue: 4 Pages: 549-566, 2008
  • 08-02    Abhinav Bhatele, Laxmikant V. Kale,  Application-specific Topology-aware Mapping for Three Dimensional Topologies,  Workshop on Large-Scale Parallel Processing (IPDPS), 2008
  • 07-12    Abhinav Bhatele,  Application-specific Topology-aware Mapping and Load Balancing for three-dimensional Torus Topologies,  Master's Thesis, Department of Computer Science, University of Illinois, 2007
Posters
  • 09-01    Topology Aware Task Mapping Techniques: An API and Case Study,  Abhinav Bhatele, Eric Bohm, Laxmikant V. Kale
  • 08-02    Effects of Contention on Message Latencies in Large Supercomputers,  Abhinav Bhatele, Laxmikant V. Kale
  • 08-01    Automatic Topology-Aware Task Mapping for Parallel Applications Running on Large Parallel Machines,  Abhinav Bhatele, Laxmikant V. Kale
Related Links

This page maintained by Abhinav Bhatele. Back to the PPL Research Page