Time |
Type |
Description |
Slides |
Webcast |
|
Morning |
Tutorials (Session Chair - Viraj Paropkari) |
10:30 am - 12:00 pm |
Tutorial |
Debugging Parallel Programs with CharmDebug
Filippo Gioachin
Click here to expand description
In this tutorial, I will present the basics of CharmDebug, the parallel debugger
tailored on Charm++. I will start with an overview of the system, and how an
application can be prepared and started using CharmDebug. The tutorial will then
continue with a case study. We will use a Jacobi2D stencil computation program
(similar to that in examples/charm++/jacobi2d-iter of the Charm++ distribution).
This program has been modified to introduce a bug, and we shall detect it using
CharmDebug's introspection capability. Finally, the same program will be used as
a testbed for detecting various memory leaks.
|
|
|
12:00 Noon - 1:00 pm |
|
1:00 pm - 2:30 pm |
Tutorial |
Advanced Charm++ and Virtualization
Eric Bohm
Click here to expand description
This tutorial introduces advanced charm features such as: custom load balancers, custom initial placement, groups, nodegroups, threads, delegation, multicasts, and the structured dagger (SDAG) framework for expressing control flow.
|
|
|
Afternoon |
Opening Session (Session Chair - Dr. Gengbin Zheng) |
2:30 pm - 3:15 pm |
Talk |
Charm++ and Affiliated Research: Recent Developments
Prof. Laxmikant V. Kale
Click here to expand description
Charm++ and the surprisingly rich research agenda engendered by its idea of object-based over-decomposition made significant progress during the past year. I will review the basic concepts that have been the foundation of our approach to parallel programming, and highlight specific achievements of the past year. These include progress on our production-quality collaboratively-developed science and engineering applications, including NAMD (biophysics), OpenAtom (Quantum Chemistry), ChaNGa (Astronomy). I will also highlight some of the progress and challenges in our agenda of developing higher level parallel languages.
|
|
|
3:15 pm - 3:30 pm |
|
3:30 pm - 4:00 pm |
Talk |
BlueWaters and PPL's Role
Celso Mendes and Eric Bohm
Click here to expand description
The Blue Waters project will provide a computational system capable
of sustained petaflop performance on a range of science and engineering
applications. The system will be deployed at the University of Illinois'
Urbana-Champaign campus in 2011, under NCSA's administration. In this
talk, we will provide an overview of main aspects of the Blue Waters
system, and describe the work related to Blue Waters being conducted in
the Parallel Programming Laboratory. That work comprises preparation
of the Charm++/AMPI infrastructure for deployment on Blue Waters,
support for efficient porting of the NAMD code to Blue Waters, and
deployment of a version of the BigSim simulation system capable of
simulating machines at the scale of Blue Waters, targeting early
application development.
|
|
|
4:00 pm - 4:30 pm |
Submitted Paper |
Integrated Performance Views in Charm++: Projections meets TAU
Prof. Allen D. Malony
Department of Computer and Information Science, University of Oregon
Click here to expand description
The Charm++ parallel programming system provides a modular performance
interface that can be used to extend its performance measurement and
analysis capabilities. The interface exposes execution events of interest
representing Charm++ scheduling operations, application methods/routines,
and communication events for observation by alternative performance modules
configured to implement different measurement features. The paper
describes the Charm++'s performance interface and how the Charm++
Projections tool and the TAU Performance System can provide integrated
trace-based and profile-based performance views. These two tools are often
complementary, providing the user with different performance perspectives
on Charm++ applications based on performance data detail and temporal and
spatial analysis. How the tools work in practice is demonstrated in a
parallel performance analysis of NAMD, a scalable molecular dynamics code
that applies many of Charm++'s unique features.
|
|
|
4:30 pm - 6:00 pm |
Panel Discussion |
A Single Programming model for clusters and multiprocessor nodes: Dream, Nightmare, Reality, or Vision
Panelists: William D. Gropp, Laxmikant V. Kale, David A. Padua, Arch Robison, Marc Snir
Click here to expand description
Multiprocessor nodes are now commonplace in clusters as well as client
desktops. Is it feasible to think of a single programming model that will work
whether one is programming a single individual multiprocessor node, a cluster
of uniprocessor nodes [ a disappearing dinosaur] or a cluster of SMP nodes?
Within HPC/CSE community, are there organizing principles that allow us to
think of a unified model? Or are the concerns so different that they deserve
different models? Does the class of application being pursued make a
difference? It could be that it's a impractical to pursue this idea ("dream"),
or that a unified model will be possible but it will be a monstrosity to learn
and program with ("nighmare"), or is it already a realilty (MPI programmers
typically have used MPI-everywhere model, with a sporadic use of MPI-OpenMP
hybrid; Charm++ supports a low-level unified model ), or perhaps it is a vision
that should motivate us to develop new programming models with this explicit
aim. The panelists, with expertise on the wide spectrum of programming models,
will weigh in on this question. We hope to leave adequate time for discussion
with audience participation.
|
|
|
6:00 pm onwards |
Workshop Banquet (for registered participants only) |
|
8:30 am onwards |
|
Morning |
Keynote, Technical Session (Session Chair - Prof. Kale) |
9:00 am - 9:45 am |
Keynote |
PBGL: A High-Performance Parallel Distributed-Memory Graph Library
Prof. Andrew Lumsdaine
Computer Science Department, Indiana University
Click here to expand description
The increasing complexity of parallel architectures, coupled with the growing
importance of new classes of parallel applications, calls for new tools and new
development paradigms. Of particular importance is the need to develop
software in an architecture-independent fashion while still being able to take
advantage of architecture-specific features. In this talk, we present the
design and implementation of the Parallel Boost Graph Library, a library of
high- performance reusable software components for distributed graph
computation. Like the sequential Boost Graph Library (BGL) upon which it is
based, the Parallel BGL applies the paradigm of generic programming to the
domain of graph computations. To illustrate how the Parallel BGL was built
from the sequential BGL, we revisit the abstractions comprising the BGL in the
context of distributed-memory parallelism, and lift away the implicit
requirements of sequential execution and a single shared address space. With
this process, we are able to create generic algorithms having a sequential
expression and requiring only the introduction of external (distributed) data
structures for parallel execution. More importantly, the generic
implementation retains its sequential interface and semantics, such that other
distributed algorithms can be built upon it, just as algorithms are layered in
the sequential case. By characterizing these extensions as well as the
extension process, we develop general principles and patterns for using (and
reusing) generic parallel software libraries. We demonstrate that the
resulting algorithm implementations are both efficient and scalable with
performance results for several algorithms implemented in the open-source
Parallel Boost Graph Library.
|
|
|
9:45 am - 10:15 am |
Invited Talk |
Parallel Rendering in the GPU Era
Prof. Orion S. Lawlor
Department of Computer Science, University of Alaska at Fairbanks
Click here to Expand description
The classic model for parallel rendering, including Charm++ liveViz, is that a
parallel array of CPUs computes, composites, and sends pixels to a
low-sophistication client. The rise of programmable graphics processing unit
(GPU) hardware provides an opportunity to significantly increase delivered
application rendering performance, but taking full advantage of the GPU's power
requires a modified application model. In this talk, we will describe our
modification of liveViz to transmit "volume impostors" which are composited on
the client's graphics hardware; our parallelizing powerwall rendering library
called MPIglut; and discuss lessons learned, pitfalls to avoid, and directions
for future parallel GPU rendering research.
|
|
|
10:15 am - 10:45 am |
Invited Talk |
ChaNGa: Charm N-body GrAvity solver
Prof. Thomas R. Quinn
Department of Astronomy, University of Washington
Click here to expand description
Simulations of galaxies forming in their cosmological context poses a
number of challenges to performance on large parallel machines. The
first is the very non-local nature of gravitational forces. Galaxies
are influenced by the gravitational forces originating tens of
megaparsecs away, requiring significant communication in the force
solver. Second is the enormous spatial dynamic ranges involved, from
megaparsecs to sub-parsec scales, requiring dynamic hierarchical data
structures. Third is the vast time scales involve, from less than 1
million years to the age of the Universe, posing significant
challenges for load balancing. This talk will present how these
challenges have been addressed in the design of ChaNGa, the Charm
N-body GrAvity solver.
|
|
|
10:45 am - 11:00 am |
|
Morning |
Technical Session (Session Chair - Dr. Celso Mendes) |
11:00 am - 11:30 am |
Invited Talk |
ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers
Prof. Christopher D. Carothers
Computer Science Department, Rensselaer Polytechnic Institute (RPI)
Click here to expand description
We present the design for a robust Time Warp simulator, using
Rensselaer's Optimistic Simulation System (ROSS), that generates
scalable parallel performance over a variety of communication loads,
and for an extremely large number of processors, up to 131,072. At
65,536 processors, ROSS produces a peak event-rate of 12.26 billion
events per second at 10% remote events and 4 billion events per second
at 100% remote events, the largest ever reported. Additionally, for
the Transmission Line Matrix (TLM) model which approximates Maxwells
equations for electromagnetic wave propagation, we report an event
rate in excess of 100 million on 5,000 processors with 200 million
grid-LPs. The TLM model is used to model highly accurate radio wave
propagation for tens of thousands of radio devices much faster than
real-time.
|
|
|
11:30 am - 12:00 pm |
Talk |
Application Experience with the GPU
Isaac Dooley
Click here to expand description
We describe our experiences porting an explicit finite element application to CUDA. We describe our approach for structuring the application, and the resulting near perfect scaling up to 128 nodes of the Lincoln cluster. Each node of Lincoln contains 2 Intel quad-core processors, and 2 GPUs (half of one NVIDIA Tesla).
|
|
|
12:00 pm - 12:30 pm |
Talk |
Developing an Abstraction for Accelerator Programming
David Kunzman
Click here to expand description
In recent history, the use of various accelerators such as GPUs, FPGAs, and
SPEs on the Cell is becoming commonplace. The desire to use these specialized
devices stems from the desire for performance. However, the devices are also
considerably harder to program that typical commodity cores. In this talk,
which will focus on the Cell processor, we describe extensions to the Charm++
programming model that allows the application to take advantage of
accelerators.
|
|
|
12:30 pm - 1:30 pm |
|
Afternoon |
Technical Session (Session Chair - Eric Bohm) |
1:30 pm - 2:00 pm |
Talk |
Using Charm++ to Effectively Exploit Multicore SMPs
Prof. Laxmikant V. Kale
Click here to expand description
When Charm++ (actually, its precursor, the chare kernel) was originally developed, one of the motivations was to provide a portable programming model between the prevailing shared-memory and distributed memory machines. It has had several features aimed at this objective. This talk will present these and recently extended mechanisms that show how and why charm++ is a powerful programming methodology for programming a single multicore desktop client, and not just for clusters.
We will examine the importance of locality-centric programming models for performance, and how to utilize shared memory within this context in Charm++.
|
|
|
2:00 pm - 2:30 pm |
Talk |
New Paradigms in Parallel Programming
Aaron Becker, Pritish Jetley and Phil Miller
Click here to expand description
Parallel programming is getting more common, but it isn't getting any easier. Although many agree that we need new models of parallel programming to facilitate the next generations of high performance applications, the exact form this new model will take is unclear. We present an approach based on not one new model to solve the problems of exploding parallelism, but many incomplete models. By intentionally addressing only a subset of the problems raised by parallel applications, these models can provide increased safety and simpler semantics without sacrificing performance.
|
|
|
2:30 pm - 3:00 pm |
Invited Talk |
Charon: parallel linear algebra made easy
Gregory M. Crosswhite Graduate Student, Department of Physics, University of Washington
Click here to expand description
In this talk we present current work on Charon, a framework in Charm++ for parallel linear algebra. The motivation behind this framework the fact that current linear algebra libraries, while powerful, can be awkward to use in practice. One particularly striking example of this is the fact that each library tend to specialize in a particular class of operations, such as eigenproblem solving, so that to have a full spectrum of linear algebra operations one may need to juggle several libraries with different conventions.
The goal of the Charon framework is to leverage Charm++’s asynchronous communication and MPI virtualization to make it easier for users to write parallel linear algebra codes in terms of higher-level operations, taking care of the low-level glue “under the hood.” This framework has two basic components to it: a distributed array object which supports basic linear algebra operations using a master/slave model, and a layer of "AMPI slaves" which wrap more sophisticated operations implemented in other MPI libraries.
In this talk, we shall discuss our current progress so far on the design and implementation of Charon, and in particular we shall describe some challenges that we have met so far, such as the need to implement automatic global variable privatization to assist in using MPI libraries.
|
|
|
3:00 pm - 3:15 pm |
|
Afternoon |
Technical Session (Session Chair - Ramprasad Venkataraman) |
3:15 pm - 3:45 pm |
Talk |
Works in Progress: 1. A Generic Adaptive Runtime Autotuning Framework, 2. Effects of Network Contention on Messaging
Isaac Dooley and Abhinav Bhatele
Click here to expand description
This talk will present ongoing work in two areas:
1. Isaac will present his ongoing research in creating an intelligent
performance tuning framework for Charm++. The tuning framework tunes a set of
parameters exposed by the program, by libraries, or by the runtime system. The
parameters are tuned intelligently by using knowledge about the parameters,
along with an analysis of observed performance characteristics. The observed
performance characteristics include critical path profiles, memory usage
statistics, and computational load statistics. He will discuss preliminary
results from sample programs exhibiting various performance problems.
2. Abhinav will discuss his study showing that with the emergence of very large
supercomputers, typically connected as a 3D torus or mesh, topology effects
have become important again. He will present an evaluative study on the effect of
contention on message latencies on torus and mesh networks.
|
|
|
3:45 pm - 4:15 pm |
Talk |
OpenAtom
Dr. Glenn J. Martyna
Physical Sciences Division, IBM T. J. Watson Research Center
Click here to expand description
OpenAtom is the production release of LeanCP, a quantum chemistry application which implements the CPAIMD method. This talk will cover new science discoveries being made using OpenAtom and recent performance of this code.
|
|
|
4:15 pm - 4:45 pm |
Submitted paper |
ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations
Rodrigo de Souza Lima Espinha
Graduate Student, Computer Science Department, Pontifical Catholic University of Rio de Janeiro
Click here to expand description
Cohesive models are used for simulation of fracture, branching and
fragmentation phenomena at various scales. Those models require high
levels of mesh refinement at the crack tip region so that nonlinear
behavior can be captured and physical results obtained. This imposes the
use of large meshes that usually result in computational and memory costs
prohibitively expensive for a single traditional workstation. If an
extrinsic cohesive model is to be used, support for dynamic insertion of
cohesive elements is also required. This paper proposes a topological
framework (ParTopS) for supporting parallel adaptive fragmentation
simulations that provides operations for dynamic insertion of cohesive
elements, in a uniform way, for both two- and three-dimensional
unstructured meshes. Cohesive elements are truly represented and are
treated like any other regular element. The framework is built as an
extension of a compact adjacency-based serial topological data structure
(TopS), which can natively handle the representation of cohesive elements.
Symmetrical modifications of duplicated entities are used to reduce the
communication of topological changes among mesh partitions and also to
avoid the use of locks. The correctness and efficiency of the proposed
framework are demonstrated by a series of arbitrary insertions of cohesive
elements into some sample meshes.
|
|
|
4:45 pm - 5:15 pm |
Event |
Annual PPL Group Photograph
|
|
|
|
8:30 am onwards |
|
Morning |
Tutorials (Session Chair - Ryan Mokos) |
9:00 am - 10:15 am |
Tutorial |
Analyzing Program Performance with Projections
Chee Wai Lee
Click here to expand description
Projections is a performance tool used to analyze Charm++ applications. The tutorial will guide the attendee through the basics of instrumentation, trace-generation and various visualization features. We will use a simple Charm++ toy program as a case-study and will also explore more advanced features to aid the identification of bottlenecks on larger datasets. We will also explore how TAU profiles may be generated from Charm++ programs.
|
|
|
10:15 pm - 10:30 pm |
|
10:30 pm - 12:00 pm |
Tutorial |
BigSim
Dr Gengbin Zheng and Ryan Mokos
Click here to expand description
This tutorial will describe how to build and use the BigSim large machine simulator. Topics covered will include: building applications on BigSim, intrumenting applications with the BigSim LogAPI, use of the interpolation tool to integrate results from other simulators, use of BigNetSim, and the implementation of new interconnects in BigNetSim.
|
|
|
12:00 pm - 1:00 pm |
Tour |
NCSA Supercomputers' Tour
|
|
|