1 On the Mapping Problem, Shahid H Bokhari, IEEE Transactions on Computers, Vol. C-30, No. 3, March 1981 [pdf]
Graph Isomorphism: One-to-one and onto mapping between vertices of two
graphs G and H such that two vertices are adjacent in G if and only if the
corresponding vertices are adjacent in H. --- [link]
Bandwidth Reduction problem --- [link]
Quadratic assignment problem (mapping of facilities to locations) --- [link]
Assumption in this problem: No. of processes is less than or equal to the
number of processors (specifically array processors)
Quality of mapping is determined by the number of problem edges that fall on
array edges. This number is called the cardinality.
Brute force method: tries pairwise interchange of a node with every other
node to improve the cardinality
Time complexity: O(N^3) where N is the number of nodes in the FEM (Finite
Element Machine)
Summary: This paper uses the method of pairwise exchanges between
the problem nodes and random jumps (to avoid dead ends) to come up with a
mapping of the problem to system graph. This solution works only for
topological variations and the metric used is cardinality (number of problem
edges that fall on the array edges). Results on a specific array processor
(FEM) - each node is connected to eight other nodes.
2 On Mapping Parallel Algorithms into Parallel Architectures, Francis Berman and Lawrence Snyder, Journal of Parallel and Distributed Computing, Vol. 4, Pages 439--458, May 1987 [pdf]
Shuffle exchange networks --- [link]
Type 2 grammars are context-free grammars (CFGs) accepted by Push Down
Automata (PDAs)
Summary: Mostly theoretical work where the problem to be mapped
should be representable as a contractable edge grammar. Try to solve both
cardinality and topological variation by first contracting a graph, laying
out this one and then multiplexing the original one it. Results are upper
bounds for the mappings produced (on the contraction weighting functions and
layout expansion costs.
3 A Mapping Strategy for Parallel Processing, Soo-Young Lee and J. K. Aggarwal, IEEE Transactions on Computers, Vol. C-36, No. 4, April 1987 [pdf]
This paper comes up with some objective functions by dividing all parallel
processing tasks into different kinds: 1. only one problem edge is needed
in each communication phase, 2. there is only one phase, 3. type 1 and 2
combined.
Mapping for image processing (model, objective function and algorithm)
Summary: The algorithm is similar to Bokhari's: 1. works only for 1
process per processor, 2. does an initial assignment (greedy strategy) and
then pairwise interchange (to see if OF can be lowered). Results on
mappings of some example graphs onto a 8-node and 16-node hypercube.
4 Interprocessor Traffic Scheduling Algorithm for Multiple-Processor Networks, Ronald P. Bianchini Jr. and John Paul Shen, IEEE Transactions on Computers, Vol. C-36, No. 4, April 1987 [pdf]
Summary: This paper is not on mapping but concerned with scheduling
traffic on links given a mapping of processes to processors.
5 Processor and Link Assignment in Multicomputers using Simulated Annealing, S. Wayne Bollinger and Scott F. Midkiff, International Conference on Parellel Processing, Pages 1-7, 1988 [pdf]
On Simulated Annealing --- [link]
Simulated annealing is used for process mapping and traffic scheduling
(governed by the traffic rules of the host architecture)
Summary: An initial mapping and then optimizations using annealing is
the idea. Pairwise exchanges are done here too both for processor and link
assignment. Objective functions take the communication contention at a link
into account. First paper which looks at traffic scheduling. Results are for
mapping hypercube and tree image graphs onto a hypercube (whose size is not
mentioned, probably always larger than the image/graph size)
6 On the Communication Complexity of Generalized 2-D Convolution on Array Processors, Zhixi Fang, Xiaobo Li and Lionel M. Ni, IEEE Transactions on Computers, Vol. 38, No. 2, February 1989 [pdf]
Generalized 2-D Convolution: Used in image-processing, object labeling,
morphological operations (dialtion and erosion) --- [link]
Summary: This paper is not on mapping exactly but on transformation
of general convolution algorithms so that communication gets optimized
depending on the network we are running on (mesh, hypercube or
shuffle-exchange)
7 Task Allocation onto a Hypercube by Recursive Mincut Bipartitioning, F. Ercal, J. Ramanujam and P. Sadayappan, Journal of Parallel and Distributed Computing, Vol. 10, No. 1, Pages 35-44, 1990 [pdf]
Uses the Kernighan-Lin mincut bisection heuristic
Handles both communication and load balancing
Two kinds of cost models: minimax cost model and summed total
cost model. In the minimax model, the maximum cost of communication +
computation is minimized. In the summed total cost model, the objective
function is the sum of the penalty for communication and the deviation from
perfect load balance.
Summary: Two approaches to mincut heuristic: two-phase approach, 2PM
(clustering tasks into 'no-of-processors' clusters and processor assignment)
and direct approach called ARM (partial processor assignments as
partitioning is being done). ARM is done in two phases: Cut-minimization
phase and Fine-tuning phase. 2PM and ARM are compared to each other and to
simulated annealing.
8 Heuristic Technique for Processor and Link Assignment in Multicomputers, S. Wayne Bollinger and Scott F. Midkiff, IEEE Transactions on Computers, Vol. 40, No. 3, March 1991 [pdf]
9 Graph Contraction for Physical Optimization Methods: A Quality-Cost Tradeoff for Mapping Data on Parallel Computers, N. Mansour, R. Ponnusamy, A. Choudhary and G. C. Fox, Proceedings of the 7th international conference on Supercomputing, 1993 [pdf]
10 Genetic Algorithm Based Heuristics for the Mapping Problem, T. Chockalingam and S. Arunkumar, Computers and Operations Research, Vol. 22, Issue 1, Pages 55-64, January 1995 [pdf]
Summary: This paper has a nice literature survey of the field and discusses the
ideas of genetic algorithms applied to the NP-hard mapping problem.
11 PE mapping and the congestion problem in the T3E, M. Muller and Michael Reschand, Proceedings of the Fourth European Cray-SGI MPP Workshop, 1998 [pdf]
Summary: Discusses the effect of pe mappings on the bandwidth on Cray T3E.
They discuss a simple algorithm to come up with good reordering of pe ranks by
sorting the ranks on the basis of their logical co-ordinates (x, y, z).
12 Impact of PE Mapping on Cray T3E Message-Passing Performance, Eduardo Huedo, Manuel Prieto, Ignacio M. Llorente, and Francisco Tirado, Proceedings from the 6th International Euro-Par Conference on Parallel Processing, 2000 [pdf]
Summary: Work similar to the paper above except that they use a different
greedy strategy to arrive at pe reorderings.
13 A New Heuristic for the Process-Processor Mapping Problem, Zoltan Juhasz and Stephen J. Turner, The Kluwer International Series In Engineering And Computer Science, Pages 91-94, 2000 [pdf]
14 Topology-Aware Algorithms for Large-Scale Communication, Luis Rodrigues and Paulo Verissimo, Lecture Notes in Computer Science 1752, Springer, Pages 127-156, 2000 [pdf]
15 A New Task Mapping Technique for Communication-Aware Scheduling Strategies, J. M. Orduna, F. Silla and J. Duato, Parallel Processing Workshops, 2001 [pdf]
16 On the development of a communication-aware task mapping technique, Juan Manuel Orduna, Federico Silla and Jose Duato, Journal of Systems Architecture, Vol. 50, Pages 207-220, 2004 [pdf]
17 Optimizing task layout on the Blue Gene/L supercomputer, G. Bhanot, A. Gara, P. Heidelberger, E. Lawless, J. C. Sexton and R. Walkup, IBM Journal of Research and Development, Vol. 49, No. 2/3, Pages 489--500, May 2005 [pdf]
Difference from older work: Larger than target architectures considered earlier,
3D torus adds to the complexity, less attention has been paid to situations
where distance between processors is important (most work on load balance
and minimizing inter-module communication)
Objective fuction: communicated bytes * distance (hop-bytes). Distance
can be hop-count or 3/D*hop-count where D is the number of dimensions used
for message sending or the Euclidean distance metric
Approach: Map a domain and all its neighbors next to it. Map another one and
its neighbors around it and so on. We can also map one domain, its neighbors
and their neighbors and so on (this is done taking load balancing into
account). Then do simulated annealing
Summary: The approach is to create a heuristic map (described above)
and then use simulated annealing over it, if needed. Since SA takes a long
time, so METIS is used to create smaller sub-graphs over which SA is
performed. They show an improvement of nearly 1.7 times in the communication
time for different applications.
NOTES: For SAGE, they do not consider the full code with AMR. Also they show
improvements in communication time not overall running time. Comparisons with
MPI rank order and random mapping.
18 Performance Effects of Node Mappings on the IBM Blue Gene/L Machine, Brian E. Smith and Brett Bode, Lecture Notes in Computer Science 3648, Springer, Pages 1005-1013, 2005 [pdf]
Summary: A very simple paper on how changing the order in which
ranks are mapped to actual physical nodes on BG/L affects performance.
This is because of the communication characterstics of specific NAS
benchmarks. It is seen that for toruses with a large X dimension (like 4K
nodes which is a 32 X 8 X 16 torus), it is best if the mapping is YZX so
that the latency in X is hidden. Apart from permutations of X, Y and Z,
also discussed are gray code mappings and random ones.
19 Topology-aware task mapping for reducing communication contention on large parallel machines, Tarun Agarwal, Amit Sharma and Laxmikant V. Kale, Parallel and Distributed Processing Symposium, April 2006 [pdf]
20 Topology Mapping for Blue Gene/L Supercomputer, Hao Yu, I-Hsin Chung and Jose Moreira, Proceedings of the ACM/IEEE Supercomputing Conference, 2006 [pdf]
Summary: This work has no information about the communication
patterns of the application. The focus of the paper is embedding of virtual guest
graphs (rings, lines, grids, tori) to guest graphs (2D and 32D tori). I think a
lot of the work has been done before and this paper just outlines the
implementation on Blue Gene/L. Results are shown on NAS benchmarks (I guess it is
easy to come up with virtual topologies for them) up to 4096 processors.
21 Topology-Aware Parallel Molecular Dynamics Simulation Algorithm, Hideaki Kikuchi, Bijaya B. Karki and Subhash Saini, The 2006 International Conference on Parallel and Distributed Processing Techniques and Applications, 2006 [pdf]
22 Topology-aware Tile Mapping for Clusters of SMPs, Daniel Chavarria-Miranda, Jarek Nieplocha and Vinod Tipparaju, Proceedings of the 3rd conference on Computing frontiers, 2006 [pdf]
This work is not exactly on topology-aware mapping but smp-aware mapping.
The study was done by creating a mapping interface in Global Arrays and
it used the advantages of SMP shared memory through ARMCI.
Target Benchmarks: with near neighbor communication Jacobi-like
Summary: The idea is to create tiles of the array instead of
coloumns which reduces the surface-to-volume ratio. These tiles can then
be mapped to processor groups in the SMP cluster. Simulations show the
reduction in inter-node communication and real results on a 16-way SMP
show up to 30% improvement in performance.
23 , and, [pdf]
24 , and, [pdf]
|