Improving Scalability with GPU-Aware Asynchronous Tasks
International Workshop on High-Level Parallel Programming Models and Supportive Environments at IPDPS (HIPS) 2022
Publication Type: Paper
Repository URL: https://arxiv.org/abs/2202.11819
Download:
Abstract
Asynchronous tasks, when created with overdecomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also to modern GPU-accelerated platforms. While the ability to hide communication behind computation can be highly effective in weak scaling scenarios, performance begins to suffer with smaller problem sizes or in strong scaling due to finegrained overheads and reduced room for overlap. In this work, we integrate GPU-aware communication into asynchronous tasks in addition to computation-communication overlap, with the goal of reducing time spent in communication and further increasing
GPU utilization. We demonstrate the performance impact of our
approach using a proxy application that performs the Jacobi
iterative method on GPUs, Jacobi3D. In addition to optimizations
for minimizing host-device synchronization and increasing the
concurrency of GPU operations, we explore techniques such
as kernel fusion and CUDA Graphs to combat fine-grained
overheads at scale.
TextRef
https://arxiv.org/abs/2202.11819
People
- Jaemin Choi
- David Richards
- Laxmikant Kale
Research Areas