Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing

PPL Paper Number: 05-06
PPL CVS: gengbin-thesis

Authors:
Gengbin Zheng
Parallel Programming Laboratory, Department of Computer Science, University of Illinois at Urbana-Champaign

Ph.D. Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005


Abstract

Parallel machines with an extremely large number of processors (at least tens of thousands processors) are now in operation. For example, the IBM BlueGene/L machine with 128K processors is currently being deployed. It is going to be a significant challenge for application developers to write parallel programs in order to exploit the enormous compute power available and manually scale their applications on such machines. Solving these problems involves finding suitable parallel programming models for such machines and addressing issues like load imbalance. In this thesis, we explore Charm++ programming model and its migratable objects for programming such machines and dynamic load balancing techniques to help parallel applications to easily scale on a large number of processors. We also present a parallel simulator that is capable of predicting parallel performance to help analysis and tuning of the parallel performance and facilitate the development of new load balancing techniques, even before such machines are built.

We evaluate the idea of virtualization and its usefulness in helping a programmer to write applications with high degree of parallelism. We demonstrate it by developing several mini-applications with million-way parallelism. We show that Charm++ and AMPI (an extension to MPI) with migratable objects and support for load balancing are suitable programming model for programming very large machines.

This thesis explores Parallel Discrete Event Simulation (PDES) techniques with an optimistic synchronization protocol to simulate parallel applications running on a very large number of processors. We optimize the synchronization protocol by exploiting the inherent determinacy that is normally found in parallel applications to reduce the synchronization overhead significantly.

We study load balancing techniques and develop a spectrum of load balancing strategies based on studies of the characteristics of applications. These load balancing strategies are motivated by applications such as LeanMD, NAMD (both are classical molecular dynamics applications) and Fractography3D (a dynamic 3D crack propagation simulation program). We have successfully scaled NAMD to 1-teraFLOPS of peak performance on 3000 processors of PSC LeMieux, using the load balancing techniques presented in this thesis.

We further study the performance of existing load balancing strategies in the context of very large parallel machines using the parallel simulator we developed. We demonstrate the weaknesses of the centralized and fully distributed load balancing schemes, and explore a new scalable load balancing scheme suitable for machines with a very large number of processors.


[postscript] [PDF] [bibtex] [text reference]