Optimal Bucket Algorithms for Large MPI Collectives on Torus Interconnects
International Conference on Supercomputing (ICS) 2010
Publication Type: Paper
Repository URL:
Collectives are an important and frequently used component of MPI. Bucket algorithms, also known as "large vector" algorithms, were introduced in the early 90's and have since evolved as a well known paradigm for large MPI collectives. Many modern day supercomputers such as the IBM Blue Gene and Cray XT are based on torus interconnects that offer a highly scalable interconnection architecture for distributed memory systems. While near optimal algorithms have been developed for torus interconnects in other paradigms, such as spanning trees, bucket algorithms have not been optimally extended to these networks. In this paper, we study the basic "divide, distribute and gather" MPI collectives for bucket algorithms -- Allgather, Reduce-scatter and Allreduce -- for large messages on torus interconnects. We present bucket-based algorithms for these collectives on bidirectional links. We show that these algorithms are optimal in terms of bandwidth and computation for symmetric torus networks (i.e. when all the dimensions are equal), matching the theoretical lower bounds For an asymmetric torus, our algorithms are asymptotically optimal and converge to the lower bound for large dimension sizes. We also argue that our bucket algorithms are more scalable on multi-cores in comparison to spanning tree algorithms. Previous studies of bucket algorithms on torus interconnects have focused on unidirectional links and have been unable to obtain tight lower bounds and optimal algorithms. We close this gap by providing stronger lower bounds and showing that our bidirectional algorithms can easily be adapted to the unidirectional case, matching our lower bounds in terms of bandwidth and computational complexity. We implement our algorithms on the IBM Blue Gene/P Supercomputer, which has quad-core nodes connected in a 3-dimensional torus, using the low level communication interface. We demonstrate that our algorithms perform within 7--30% of the lower bounds for different MPI collectives. We demonstrate good scaling using multicores. We also demonstrate a factor of 3 to 17 speedup for various collectives in comparison to the latest optimized MPI implementation.
Research Areas