Parallel Programming Laboratory

Evaluating HPC Networks via Simulation of Parallel Workloads

International Conference for High Performance Computing, Networking, Storage and Analysis (SC) 2016

Publication Type: Paper

Repository URL: papers/201409-cost-perf

Download: [BIB] [PDF]

Abstract

This paper presents an evaluation and comparison of three topologies that are popular for building interconnection networks in large-scale supercomputers: torus, fat-tree, and dragonfly. To perform this evaluation for production multi-job workloads, a comprehensive methodology and a scalable packet- level network simulator, TraceR, are presented. Our methodology includes design of prototype systems that are being evaluated, use of proxy applications to determine computation and communication load, simulating individual proxy applications and multi-job workloads, and combining aggregate performance metrics with network procurement costs. Using the proposed methodology, torus, fat-tree, and dragonfly based prototype systems with up to 730K endpoints (MPI ranks) executed on 46K nodes are compared for multi-job workloads from capability and capacity systems. The fat-tree topology is shown to provide the highest performance per dollar for a 180 Petaflop/s system due to its consistently good performance and lower cost.

People

Research Areas

Live Webcast 15th Annual Charm++ Workshop