End-to-end Performance Modeling of Distributed GPU Applications
International Conference on Supercomputing (ICS) 2020
Publication Type: Paper
Repository URL:
Abstract
With the growing number of GPU-based supercomputing platforms and GPU-enabled applications, the ability to accurately model the performance of such applications is becoming increasingly important. Most current performance models for GPU-enabled applications are limited to single node performance. In this work, we propose a methodology for end-to-end performance modeling of distributed GPU applications. Our work strives to create performance models that are both accurate and easily applicable to any distributed GPU application. We combine trace-driven simulation of MPI communication based on the TraceR-CODES framework with a profiling-based roofline model for GPU kernels. We make substantial modifications to these models to capture the complex effects of both on-node and off-node networks in today's multi-GPU supercomputers. We validate our model against empirical data from GPU platforms and also vary tunable parameters of our model to observe how they affect application performance.
People
- Jaemin Choi
- David Richards
- Laxmikant Kale
- Abhinav Bhatele
Research Areas