Evaluating HPC Networks via Simulation of Parallel Workloads
International Conference for High Performance Computing, Networking, Storage and Analysis (SC) 2016
Publication Type: Paper
Repository URL: papers/201409-cost-perf
This paper presents an evaluation and comparison of three topologies that are popular for building interconnection networks in large-scale supercomputers: torus, fat-tree, and dragonfly. To perform this evaluation for production multi-job workloads, a comprehensive methodology and a scalable packet- level network simulator, TraceR, are presented. Our methodology includes design of prototype systems that are being evaluated, use of proxy applications to determine computation and communication load, simulating individual proxy applications and multi-job workloads, and combining aggregate performance metrics with network procurement costs. Using the proposed methodology, torus, fat-tree, and dragonfly based prototype systems with up to 730K endpoints (MPI ranks) executed on 46K nodes are compared for multi-job workloads from capability and capacity systems. The fat-tree topology is shown to provide the highest performance per dollar for a 180 Petaflop/s system due to its consistently good performance and lower cost.
Research Areas