The fastest supercomputers today such as Blue Gene/L and XT3 are connected
by a 3-dimensional torus/mesh interconnect. Applications running on these machines
can benefit from topology-awareness while mapping tasks to
processors at runtime. By co-locating communicating tasks on nearby
processors, the distance traveled by messages and hence the communication
traffic can be minimized, thereby reducing communication latency and contention
on the network. This paper describes preliminary work utilizing this technique and
performance improvements resulting from it in the context of a n-dimensional k-point
stencil program. It shows that for a fine-grained application with a high
communication to computation ratio, topology-aware mapping has a significant
impact on performance. Automated topology-aware mapping by the runtime using similar
ideas can relieve the application writer from this burden and result in better
performance. Preliminary work
towards achieving this for a molecular dynamics application,
NAMD, is also presented. Results on up to 32,768 processors of IBM's Blue
Gene/L and 2,048 processors of Cray's XT3 support the ideas discussed in the
paper.