Parallel machines with enormous compute power and scale are now being built consisting of tens of thousands of processors and capable of achieving hundreds of teraflops of peak speed. For example, the Blue Gene (BG/L) machine being developed by IBM and slated for early 2005 delivery, will have 128,000 processors and 360 teraflops peak performance. Ambitious projects in computational modeling for science and engineering are gearing up to exploit this power to achieve breakthroughs in areas such as rational drug design, genomics, proteomics, engineering design and computational astronomy.
Development of a programming environment for such machines is a significant challenge. Further, it is also important to understand performance issues in specific algorithms thoroughly, so next-generation applications can be built to scale to such large machines. We have been engaged in a project 1to address these challenges for over two years. In this paper we summarize our progress and findings so far, with focus on recent unpublished results.
We explored CHARM++ as an appropriate programming model for large machines because of its ability to virtualize processors [1], allowing programmers to not worry about specific actions running on specific processors. This property seems essential for dealing with large machines, because it would be impractical to think about what is running where on 100k processors. Further, CHARM++ provides a solution to the issues that arise due to fine-grained computations resulting from using large machines. We first describe issues, explored using an emulator, in scaling CHARM++ and Adaptive MPI [2] (built using CHARM++) to run on large machines. Next we present our performance prediction system, based on parallel discrete event simulation, and novel ideas to avoid re-execution during optimistic simulation. Recent performance results using the simulator for structural dynamics computations involving the Finite Element Method (and unstructured grids) are discussed next. Progress on detailed architecture simulation of multi-processor nodes, which is needed to accurately predict individual processor performance in the context of a large simulation is then summarized, followed by an overview of future and ongoing research issues in the final section.