Topology-aware mapping requires information about two things -- the processor
graph and the object graph (or the communication graph). The idea is to obtain
this information automatically at runtime. This helps towards the goal of
automatic mapping by the runtime which is hidden to the application writer. In
addition, this information can also be used by the application for
communication optimizations.
Information about the topology of the machine is needed to map objects or VPs
to processors (such as the dimensions of the 3D mesh/torus). The application
should be able to query the runtime to get information like the dimensions of
the allocated processor partition, mapping of ranks to physical nodes etc.
However, the mapping interface should be simple and should hide
machine-specific details from the application. We have implemented a "Topology
Manager" interface which gives useful information for torus interconnects like
Blue Gene/L, XT3 and Blue Gene/P and for n-way SMP machines like NCSA's Abe and
TACC's Ranger.
We now describe an API which we call the Topology Manager which can be
used by any application for mapping of objects to processors. In this paper,
the API is used in the context of Charm++ applications; however, it is generic
and can be used in any parallel program. Mapping of object graphs onto
the processor graph requires information about the machine topology at runtime.
The application should be able to query the runtime to get information like the
dimensions of the allocated processor partition, mapping of ranks to physical
nodes etc. However, the mapping interface should be simple and should hide
machine-specific details from the application.
The API provides different functions which can be grouped into the following
categories:
- Size and properties of the allocated partition: At runtime, the
application needs to know the dimensions of the allocated partition (getDimNX,
getDimNY, getDimNZ), number of cores per node (getDimNT) and whether we have a
torus or mesh in each dimension (isTorusX, isTorusY, isTorusZ).
- Properties of an individual node: The interface also provides simple
calls to convert from ranks to physical co-ordinates and vice-versa
(rankToCoordinates, coordinatesToRank).
- Additional Functionality: Mapping algorithms often need to calculate
number of hops between two ranks or pick the closest rank to a given rank from
a list. Hence, the API provides functions like getHopsBetweenRanks,
pickClosestRank and sortRanksByHops to facilitate mapping algorithms.
We now discuss the process of extracting this information from the system at
runtime and why is it useful to use the Topology Manager API on different
machines:
IBM Blue Gene machines: On Blue Gene/L and Blue Gene/P, topology
information is available through system calls to the ``BGLPersonality" and
``BGPPersonality" data structures, respectively. It is useful to use the
Topology Manager API instead of the system calls for two reasons. First, these
system calls can be expensive (especially on Blue Gene/L) and so it is
advisable to avoid doing too many of them. The API does a few system calls to
obtain enough information so that it can construct the topology information
itself. Once, an instance of the TopoManager class has been created, it
does not make any further system calls. The second reason is that on Blue
Gene/L and Blue Gene/P, there is a smallest size of a partition which can be
allocated (32 nodes on the Watson BG/L and 64 nodes on the ANL BG/P). If fewer
nodes than this smallest unit are requested, the smallest partition will be
allocated though only a subset of the nodes in it are used by the application.
In these cases, the lower level system calls give information about the entire
booted partition and not the actual nodes being used. Our API calculates which
portion of the allocated partition is being used when you use fewer nodes than
the allocated partition and gives the correct information.
Cray XT machines: Cray machines have been designed with a significant
overall bandwidth, and possibly for this reason, documentation for topology
information was not readily available at the installations we used. (We thank
Shawn Brown from PSC, Larry Kaplan from Cray Inc. and William Renaud at ORNL
for helping us obtain topology information through personal communication).
There is no documentation or publication which provide a user with system calls
to obtain topology information on these machines. We hope that the information
provided here will be useful to other application programmers.
Obtaining topology information on XT machines is a two step process: 1. Getting
the node ID (nid) corresponding to a given MPI rank (pid) which tells us which
physical node a given MPI rank is on. This can be done through different system
calls on XT3 and XT4: cnos_get_nidpid_map available through
"catamount/cnos_mpi_os.h" and PMI_Portals_get_nidpid_map available from
"pmi.h". These calls provide a map for all ranks in the current job and their
corresponding node IDs. 2. The second step is obtaining the physical
coordinates for a given node ID. This can be done by using the system call
rca_get_meshcoord from "rca_lib.h". Once we have information about the physical
coordinates for all ranks in the job, the API derives information such as the
extent of the allocated partition by itself (this assumes that the machine has
been reserved and we have a contiguous partition). Using the size of the
partition and the size of the total machine (11 X 12 X 16 for BigBen and 21 X
16 X 24 for Jaguar), the API can tell if we have a mesh or torus in each
direction. Again, once the TopoManager object is instantiated, it stores all
this information and does not make system calls again.
The Topology Manager API provides a uniform interface to the application
developer and hence the application just knows that it is a 3D torus or mesh
topology. Application specific task mapping decisions require no architecture
or machine specific knowledge (BG/L or XT3 for example) as there is no need for
it to use the lower level system calls for topology information. Our API
provides a very easy-to-use wrapper for this functionality.
Obtaining the object graph is not as simple. We have used two methods for this
until now -- (1) using the information available from the application writers
directly and, (2) using the load balancing framework in Charm++ to instrument
and record this information. Obtaining this information automatically is
a part of the proposed work.
|