Parallel Programming Laboratory

Improving Scalability with GPU-Aware Asynchronous Tasks

| Jaemin Choi | David Richards | Laxmikant Kale

International Workshop on High-Level Parallel Programming Models and Supportive Environments at IPDPS (HIPS) 2022

Publication Type: Paper

Repository URL: https://arxiv.org/abs/2202.11819

Download:

Abstract

Asynchronous tasks, when created with overdecomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also to modern GPU-accelerated platforms. While the ability to hide communication behind computation can be highly effective in weak scaling scenarios, performance begins to suffer with smaller problem sizes or in strong scaling due to finegrained overheads and reduced room for overlap. In this work, we integrate GPU-aware communication into asynchronous tasks in addition to computation-communication overlap, with the goal of reducing time spent in communication and further increasing GPU utilization. We demonstrate the performance impact of our approach using a proxy application that performs the Jacobi iterative method on GPUs, Jacobi3D. In addition to optimizations for minimizing host-device synchronization and increasing the concurrency of GPU operations, we explore techniques such as kernel fusion and CUDA Graphs to combat fine-grained overheads at scale.

TextRef

https://arxiv.org/abs/2202.11819

People

Jaemin Choi
David Richards
Laxmikant Kale

Research Areas

Live Webcast 15th Annual Charm++ Workshop