The Projections Performance Analysis Framework
An Introduction to Projections
The significant gap between peak and realized performance of
parallel machines motivates the need for effective performance
analysis and tuning of applications running on those machines. To
this end, we have developed a framework for performance analysis and
visualization called Projections for Charm++.
Performance Instrumentation
The Charm++ Runtime system provides, to Projections'
instrumentation component, the ability to record detailed performance
information about events as an application is executed. Examples of
these events are the start and end of Charm++ entry methods and
message sends. This data is recorded on per-processor log buffers and
written as log files at the end of the application. These log files
are then used for post-mortem performance analysis through the
visualization component of Projections. This instrumentation is
provided automatically whenever the application is linked with
Projections' tracing modules by the application developer. We also
provide various runtime options and APIs to allow the user to flexibly
control the intrusiveness, size of data collection as well as the
resolution of performance data collected, from full event traces to a
summary profile of entry method utilization.
Performance Visualization and Analysis
In its current form, the visualization component of Projections
relies on manual analysis by the user. It is implemented in Java and
provides support of the analysis through useful application views and
abstractions like utilization graphs, histograms and event
timelines.
Performance analysis is human-centric. This illustrated below from
the figures 1a to 1c: From visual distillations
of overall application performance characteristics, the analyst
employs a mixture of application domain knowledge and experience with
visual cues expressed through Projections in order to identify general
areas (e.g. over a set of processors and time intervals) of potential
performance problems. The analyst then zooms in for more detail and/or
seeks additional perspectives through the aggregation of information
across data dimensions (e.g. processors). The same process is
repeated, usually with higher levels of detail, as the analyst homes
in on a problem or zooms into another area to correlate problems. The
richness of information coupled with the tool's ability to provide
relevant visual cues contribute greatly to the effectiveness of this
analysis process.
|
|
|
|
Figure 1a: Overview of 512-processor run of NAMD over several seconds of execution.
|
Figure 1b: A Time Profile of 512 processors over a 70ms range of interest of the same NAMD simulation.
|
Figure 1c: Detailed Timeline of events on a user-selected subset processors of the same 70ms range.
|
As shown above, the Overview (Figure 1a) gives the user a general
picture of application behavior in terms of utilization across
processors and over time. The Time Profile (Figure 1b) provides a
breakdown of entry method activity over time, summed across all
processors, effectively providing another, more detailed, perspective
of the data provided by Overview. The Timeline (Figure 1c) offers the
most detailed look into exactly what performance events occurred on
each selected processor, allowing the examination of causal effects
and other runtime information.
Other examples of the views offered include: the Usage Profile (Figure
2); which reveals information about the overall workloads across
processors over a specified time range and is particularly useful in
identifying Charm++ events that contribute to computational load
imbalance in the program.
|
|
Figure 2: Usage Profile of various Charm++ Events across processors
|
Keeping Performance Analysis Effective
Issues and Motivation
In general, the analysis and subsequent tuning of an application is a
non-trivial task for the analyst/developer. It is time-consuming and
as an application scales to larger numbers of processors, running
larger simulations, the problem of locating performance bottlenecks
and problems can potentially be intractable. This is due to the growth
in the volume of performance data the above-mentioned scaling
inevitably produces. The consequences are twofold: the performance
tool must read and process much more data, hence taking even more time
and reducing responsiveness; and the performance information presented
to the analyst visually can quickly become overwhelming. Our current
research effort in performance tools are directed to face these
challenges in order to keep Projections an effective and useful tool.
Automating Performance Problem Discovery
We have been developing ways to help automate the discovery of
performance bottleneck for analysts and quickly presenting this
information visually via the Projections visualization tool.
One of these ways is through our NoiseMiner tool where we
automatically locate precise sections of the performance space where
unusually long, relative to the rest of application activity, time
durations are spent(Figure 3). Such long events may be symptoms of operating system interference, software interference, or computational noise. The analyst may then browse these sections of
performance space in mini-timelines(Figure 4).
|
|
Figure 3: NoiseMiner lists groups of events with similar duration stretches. Each group is likely affected by the same type of operating system interference, software interference, or computational noise.
|
|
|
Figure 4: 36 mini-timelines can be viewed for each noise component detected by NoiseMiner. In the middle of each timeline is an event whose duration is longer than expected.
|
Performance Tool Scalability
As applications scale to handling larger datasets to be run on larger
processor counts, the volume of performance data grows
significantly. As such, for performance tools to remain effective and
relevant, the scalability of the performance analysis process has to
be addressed. Currently, we are pursuing research and development in
the following directions of scalability:
- Performance Tool Scalability - the tool itself must be
capable of handling a large volume of data gracefully and responsively.
- Data Scalability - we are researching ways by which performance
data may be reduced without losing too much performance bottleneck
information so that tools may continue to work effectively. One such
method involves using Clustering techniques and heuristics to select
only a subset of processor logs to retain at trace generation time.
- Visualization Scalability - This is related to the above-mentioned
research on automation. We are developing various ways and means by
which pertinent performance information gets quickly presented to the
analyst. In the face of greater volume of performance data and more
importantly, a larger performance-space as a result of scaling,
naively extending the processing and display capabilities
(i.e. handling more data, displaying timelines for more processors) of
performance tools without better visual aids to the analyst simply
overwhelms the analyst with too much information.
- Turn-around Time - to gather performance traces in order to study
the effect of extreme scaling traditionally requires submitting large
job requests that can take an extremely long time waiting in the job
submission queue of most supercomputing centers. This is in spite of
the fact that, for performance analysis purposes, the instrumented
application only need execute for a few seconds to an hour. Since the
performance tuning cycle generally requires several rounds of
hypothesis testing and code correction, the turn-around time of having
these large-scale jobs wait in the queue can be highly significant.
We are currently developing a way of using our BigSim Charm++
application simulation package to generate different performance
traces under different conditions like Charm++ object-to-processor
placement, load balancing schemes, network topology and other machine
characteristics. This will potentially allow us a way to test
performance hypothesis by re-simulation on a smaller number of
processors instead of having to modify application code or parameters
and then submitting them for a large-scale run.
|
- 08-05
Chee Wai Lee and Celso Mendes and Laxmikant V. Kale , Towards Scalable Performance Analysis and Visualization through Data Reduction, 13th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2008) held in conjunction with IPDPS 2008.
- 08-04
Isaac Dooley, Chao Mei, Laxmikant V. Kale, NoiseMiner: An Algorithm for Scalable Automatic Computational Noise and Software Interference Detection, To appear in Proceedings of HIPS Workshop at IEEE International Parallel and Distributed Processing Symposium 2008
- 07-06
Chee Wai Lee and Laxmikant V. Kale , Scalable Techniques for Performance Analysis, Technical Report, Computer Science Department, University of Illinois at Urbana-Champaign, Urbana, IL, May 2007.
- 05-19
Chee Wai Lee, Terry L. Wilmarth and Laxmikant V. Kale, Performance Visualization and Analysis of Parallel Discrete Event Simulations with Projections, PPL Technical Report 05-19, UIUC.
- 04-05
Laxmikant V. Kale, Gengbin Zheng, Chee Wai Lee, Sameer Kumar, Scaling Applications to Massively Parallel Machines Using Projections Performance Analysis Tool, Future Generation Computer Systems Journal
- 03-03
Laxmikant V. Kale, Sameer Kumar, Gengbin Zheng, Chee Wai Lee, Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study, Terascale Performance Analysis Workshop, International Conference on Computational Science(ICCS), 2003
- 99-05
Parthasarathy Ramachandran and Laxmikant V. Kale , Web-based Interaction and Monitoring for Parallel Programs (Via Conspector), Internal report.
- 99-01
Laxmikant Kale, Robert Brunner, James Phillips and Krishnan Varadarajan, Application Performance of a Linux Cluster using Converse, 3rd Workshop on Runtime Systems for Parallel Programming
- 96-13
Sanjeev Krishnan, Automating Runtime Optimizations for Parallel Object-Oriented Programming, Ph.D. Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, June 1996.
- 96-12
Amitabh Sinha and L. V. Kale, Towards Automatic Peformance Analysis, Proceedings of International Conference on Parallel Processing, August 1996, Volume III, pp. 53-60.
- 96-08
Sanjeev Krishnan and L. V. Kale, Automating Parallel Runtime Optimizations Using Post-Mortem Analysis, Proceedings of the 10th ACM International Conference on Supercomputing, Philadelphia, May 1996.
- 96-07
Sanjeev Krishnan and L. V. Kale, Automating Runtime Optimizations for Load Balancing in Irregular Problems, Proceedings of the Conference on Parallel and Distributed Processing Technology and Applications, San Jose, August 1996.
- 92-03
Laxmikant V. Kale and Amitabh B. Sinha, Projections: a Preliminary Performance Tool for Charm, Parallel Systems Fair, International Symposium on Parallel Processing, Newport Beach, CA, April 1993.
|