PPL: Parallel Performance Analysis, Visualization and Optimization

Parallel Performance Analysis, Visualization and Optimization

The Projections Performance Analysis Framework

An Introduction to Projections

The significant gap between peak and realized performance of parallel machines motivates the need for effective performance analysis and tuning of applications running on those machines. To this end, we have developed a framework for performance analysis and visualization called Projections for Charm++.

Performance Instrumentation

The Charm++ Runtime system provides, to Projections' instrumentation component, the ability to record detailed performance information about events as an application is executed. Examples of these events are the start and end of Charm++ entry methods and message sends. This data is recorded on per-processor log buffers and written as log files at the end of the application. These log files are then used for post-mortem performance analysis through the visualization component of Projections. This instrumentation is provided automatically whenever the application is linked with Projections' tracing modules by the application developer. We also provide various runtime options and APIs to allow the user to flexibly control the intrusiveness, size of data collection as well as the resolution of performance data collected, from full event traces to a summary profile of entry method utilization.

Performance Visualization and Analysis

In its current form, the visualization component of Projections relies on manual analysis by the user. It is implemented in Java and provides support of the analysis through useful application views and abstractions like utilization graphs, histograms and event timelines.

Performance analysis is human-centric. This illustrated below from the figures 1a to 1c: From visual distillations of overall application performance characteristics, the analyst employs a mixture of application domain knowledge and experience with visual cues expressed through Projections in order to identify general areas (e.g. over a set of processors and time intervals) of potential performance problems. The analyst then zooms in for more detail and/or seeks additional perspectives through the aggregation of information across data dimensions (e.g. processors). The same process is repeated, usually with higher levels of detail, as the analyst hones in on a problem or zooms into another area to correlate problems. The richness of information coupled with the tool's ability to provide relevant visual cues contribute greatly to the efficacy of this analysis process.

Figure 1a: Overview of 512-processor run of NAMD over several seconds of execution. Figure 1b: A Time Profile of 512 processors over a 70ms range of interest of the same NAMD simulation. Figure 1c: Detailed Timeline of events on a user-selected subset processors of the same 70ms range.

As shown above, the Overview (Figure 1a) gives the user a general picture of application behavior in terms of utilization across processors and over time. The Time Profile (Figure 1b) provides a breakdown of entry method activity over time, summed across all processors, effectively providing another, more detailed, perspective of the data provided by Overview. The Timeline (Figure 1c) offers the most detailed look into exactly what performance events occurred on each selected processor, allowing the examination of causal effects and other runtime information.

Other examples of the views offered include: the Usage Profile (Figure 2); which reveals information about the overall workloads across processors over a specified time range and is particularly useful in identifying Charm++ events that contribute to computational load imbalance in the program.

Figure 2: Usage Profile of various Charm++ Events across processors

Keeping Performance Analysis Effective

Issues and Motivation

In general, the analysis and subsequent tuning of an application is a non-trivial task for the analyst/developer. It is time-consuming and as an application scales to larger numbers of processors, running larger simulations, the problem of locating performance bottlenecks and problems can potentially be intractable. This is due to the growth in the volume of performance data the above-mentioned scaling inevitably produces. The consequences are twofold: the performance tool must read and process much more data, hence taking even more time and reducing responsiveness; and the performance information presented to the analyst visually can quickly become overwhelming. Our current research efforts in performance tools are directed to face these challenges in order to maintain Projections as an effective and useful tool.

Automating Performance Problem Discovery

We have been developing ways to help automate the discovery of performance bottleneck for analysts and quickly presenting this information visually via the Projections visualization tool.

One of these ways is through our NoiseMiner tool where we automatically locate precise sections of the performance space where unusually long (in regards to the rest of application activity) time durations are spent(Figure 3). Such long events may be symptoms of operating system interference, software interference, or computational noise. The analyst may then browse these sections of performance space in mini-timelines(Figure 4).

Figure 3: NoiseMiner lists groups of events with similar duration stretches. Each group is likely affected by the same type of operating system interference, software interference, or computational noise.

Figure 4: 36 mini-timelines can be viewed for each noise component detected by NoiseMiner. In the middle of each timeline is an event whose duration is longer than expected.

Performance Tool Scalability

As applications scale to handling larger datasets to be run on larger processor counts, the volume of performance data grows significantly. As such, for performance tools to remain effective and relevant, the scalability of the performance analysis process has to be addressed. Currently, we are pursuing research and development in the following directions of scalability:

Performance Tool Scalability - the tool itself must be capable of handling a large volume of data gracefully and responsively.
Data Scalability - we are researching ways by which performance data may be reduced without losing too much performance bottleneck information so that tools may continue to work effectively. One such method involves using Clustering techniques and heuristics to select only a subset of processor logs to retain at trace generation time.
Visualization Scalability - This is related to the above-mentioned research on automation. We are developing various ways and means by which pertinent performance information gets quickly presented to the analyst. In the face of greater volume of performance data, and more importantly, a larger performance-space as a result of scaling, naively extending the processing and display capabilities (i.e. handling more data, displaying timelines for more processors) of performance tools without better visual aids to the analyst simply overwhelms the analyst with too much information.
Turn-around Time - to gather performance traces in order to study the effects of extreme scaling traditionally requires submitting large job requests that can take an extremely long time waiting in the job submission queue of most supercomputing centers. This is in spite of the fact that, for performance analysis purposes, the instrumented application only needs to execute for a few seconds to an hour. Since the performance tuning cycle generally requires several rounds of hypothesis testing and code correction, the turn-around time of having these large-scale jobs wait in the queue can be highly significant. We are currently developing a way of using our BigSim Charm++ application simulation package to generate different performance traces under different conditions like Charm++ object-to-processor placement, load balancing schemes, network topology and other machine characteristics. This will potentially allow us a way to test performance hypothesis by re-simulation on a smaller number of processors instead of having to modify application code or parameters and then submitting them for a large-scale run.

People

Papers

10-03 Abhinav Bhatele, Lukasz Wesolowski, Eric Bohm, Edgar Solomonik and Laxmikant V. Kale, Understanding application performance via micro-benchmarks on three large supercomputers: Intrepid, Ranger and Jaguar, International Journal for High Performance Computing Applications (IJHPCA), Vol: 24, Issue: 4, Pages: 411-427, 2010.
Full text available at SAGE website.
09-13 Chee Wai Lee , Techniques in Scalable and Effective Parallel Performance Analysis, PhD Thesis Department of Computer Science University of Illinois at Urbana-Champaign December 2009
09-08 Isaac Dooley, Chee Wai Lee, and Laxmikant Kale, Continuous Performance Monitoring for Large-Scale Parallel Applications, 16th annual IEEE International Conference on High Performance Computing (HiPC 2009)
09-05 Scott Biersdorff, Chee Wai Lee, Allen D. Malony, Laxmikant V. Kale, Integrated Performance Views in Charm ++: Projections Meets TAU, ICPP'09 Workshop
08-05 Chee Wai Lee and Celso Mendes and Laxmikant V. Kale , Towards Scalable Performance Analysis and Visualization through Data Reduction, 13th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2008) held in conjunction with IPDPS 2008.
08-04 Isaac Dooley, Chao Mei, Laxmikant V. Kale, NoiseMiner: An Algorithm for Scalable Automatic Computational Noise and Software Interference Detection, In Proceedings of HIPS Workshop at IEEE International Parallel and Distributed Processing Symposium 2008
07-06 Chee Wai Lee and Laxmikant V. Kale , Scalable Techniques for Performance Analysis, Technical Report, Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, IL, May 2007.
05-19 Chee Wai Lee, Terry L. Wilmarth and Laxmikant V. Kale, Performance Visualization and Analysis of Parallel Discrete Event Simulations with Projections, PPL Technical Report 05-19, UIUC.
04-05 Laxmikant V. Kale, Gengbin Zheng, Chee Wai Lee, Sameer Kumar, Scaling Applications to Massively Parallel Machines Using Projections Performance Analysis Tool, Future Generation Computer Systems Journal
03-03 Laxmikant V. Kale, Sameer Kumar, Gengbin Zheng, Chee Wai Lee, Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study, Terascale Performance Analysis Workshop, International Conference on Computational Science(ICCS), 2003
99-05 Parthasarathy Ramachandran and Laxmikant V. Kale , Web-based Interaction and Monitoring for Parallel Programs (Via Conspector), Internal report.
99-01 Laxmikant Kale, Robert Brunner, James Phillips and Krishnan Varadarajan, Application Performance of a Linux Cluster using Converse, 3rd Workshop on Runtime Systems for Parallel Programming
96-13 Sanjeev Krishnan, Automating Runtime Optimizations for Parallel Object-Oriented Programming, Ph.D. Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, June 1996.
96-12 Amitabh Sinha and L. V. Kale, Towards Automatic Peformance Analysis, Proceedings of International Conference on Parallel Processing, August 1996, Volume III, pp. 53-60.
Available from IEEE Xplore.
96-08 Sanjeev Krishnan and L. V. Kale, Automating Parallel Runtime Optimizations Using Post-Mortem Analysis, Proceedings of the 10th ACM International Conference on Supercomputing, Philadelphia, May 1996.
96-07 Sanjeev Krishnan and L. V. Kale, Automating Runtime Optimizations for Load Balancing in Irregular Problems, Proceedings of the Conference on Parallel and Distributed Processing Technology and Applications, San Jose, August 1996.
92-03 Laxmikant V. Kale and Amitabh B. Sinha, Projections: a Preliminary Performance Tool for Charm, Parallel Systems Fair, International Symposium on Parallel Processing, Newport Beach, CA, April 1993.