Parallel Programming Laboratory

Techniques for communication optimization of parallel programs in an adaptive runtime system

| Michael Robson

Thesis 2020

Publication Type: PhD Thesis

Repository URL: http://hdl.handle.net/2142/108622

Download: [BIB] [PDF]

Abstract

With the current continuation of Moore’s law and the presumed end of improved single core performance, high performance computing (HPC) has turned to increased on-node parallelism in order to address ever growing challenges and numbers of transistors. While this has resulted in a continued increase in overall computing performance, supercomputer networks have lagged far behind in their development and are now oftentimes the singular bottleneck in achieving performance and scalability in modern HPC applications. New machines are consistently built with ‘deeper’ nodes that improve the single node compute performance, as measured by the achievable floating point operations per second (FLOPs), relative to earlier generations with a corresponding increase in network bandwidth or sufficient decrease in latency. This unequal increase has previously partially been addressed by partitioning duties between runtimes at the shared memory node level, e.g. OpenMP, and distributed memory communication level, e.g. MPI, to create a model known as MPI+X. In this work, we present an alternative approach to improving the performance of modern HPC applications utilizing current generation supercomputer networks. We focus on the combination of several of the benefits of the Charm++ programming model, namely overdecompsition, with OpenMP and the ability to ‘spread’ work across several cores. This allows applications to smoothly inject messages onto the network, constantly overlapping their communication requirements with their compute phases, our overall focus for this work. We further describe a complementary suite of techniques to fully utilize modern supercomputers and balance FLOPs and communication. We extend these techniques through micro-benchmark studies and integration into the production scale Charm++ runtime. We also turn our attention from internode communication optimization to apply these same techniques to intranode communication between various hardware devices, i.e. CPUs and graphics processing units, as well. We also discuss many of the tradeoffs of these approaches and attempt to quantify their general effect. While embodied in the Charm++ runtime system, these ideas are applicable to a wide swath of communication bound applications, a class of programs that we expect to only grow over time with the continuing trend of increased differential between node and network performance.

People

Michael Robson

Research Areas

Live Webcast 15th Annual Charm++ Workshop