Charm++ Workshop 2009

Program, Slides and Webcast

Location: All sessions are held in room 2405 in Siebel Center.

Time	Type	Description	Slides	Webcast
Day 1 (Wednesday, April 15th)
Morning	Tutorials (Session Chair - Viraj Paropkari)
10:30 am - 12:00 pm	Tutorial	Debugging Parallel Programs with CharmDebug Filippo Gioachin Click here to expand description In this tutorial, I will present the basics of CharmDebug, the parallel debugger tailored on Charm++. I will start with an overview of the system, and how an application can be prepared and started using CharmDebug. The tutorial will then continue with a case study. We will use a Jacobi2D stencil computation program (similar to that in examples/charm++/jacobi2d-iter of the Charm++ distribution). This program has been modified to introduce a bug, and we shall detect it using CharmDebug's introspection capability. Finally, the same program will be used as a testbed for detecting various memory leaks.	[odp] [pdf]	[wmv]
12:00 Noon - 1:00 pm	Lunch
1:00 pm - 2:30 pm	Tutorial	Advanced Charm++ and Virtualization Eric Bohm Click here to expand description This tutorial introduces advanced charm features such as: custom load balancers, custom initial placement, groups, nodegroups, threads, delegation, multicasts, and the structured dagger (SDAG) framework for expressing control flow.	[ppt]	[wmv]
Afternoon	Opening Session (Session Chair - Dr. Gengbin Zheng)
2:30 pm - 3:15 pm	Talk	Charm++ and Affiliated Research: Recent Developments Prof. Laxmikant V. Kale Click here to expand description Charm++ and the surprisingly rich research agenda engendered by its idea of object-based over-decomposition made significant progress during the past year. I will review the basic concepts that have been the foundation of our approach to parallel programming, and highlight specific achievements of the past year. These include progress on our production-quality collaboratively-developed science and engineering applications, including NAMD (biophysics), OpenAtom (Quantum Chemistry), ChaNGa (Astronomy). I will also highlight some of the progress and challenges in our agenda of developing higher level parallel languages.		[wmv]
3:15 pm - 3:30 pm	Break
3:30 pm - 4:00 pm	Talk	BlueWaters and PPL's Role Celso Mendes and Eric Bohm Click here to expand description The Blue Waters project will provide a computational system capable of sustained petaflop performance on a range of science and engineering applications. The system will be deployed at the University of Illinois' Urbana-Champaign campus in 2011, under NCSA's administration. In this talk, we will provide an overview of main aspects of the Blue Waters system, and describe the work related to Blue Waters being conducted in the Parallel Programming Laboratory. That work comprises preparation of the Charm++/AMPI infrastructure for deployment on Blue Waters, support for efficient porting of the NAMD code to Blue Waters, and deployment of a version of the BigSim simulation system capable of simulating machines at the scale of Blue Waters, targeting early application development.	[ppt]	[wmv]
4:00 pm - 4:30 pm	Submitted Paper	Integrated Performance Views in Charm++: Projections meets TAU Prof. Allen D. Malony Department of Computer and Information Science, University of Oregon Click here to expand description The Charm++ parallel programming system provides a modular performance interface that can be used to extend its performance measurement and analysis capabilities. The interface exposes execution events of interest representing Charm++ scheduling operations, application methods/routines, and communication events for observation by alternative performance modules configured to implement different measurement features. The paper describes the Charm++'s performance interface and how the Charm++ Projections tool and the TAU Performance System can provide integrated trace-based and profile-based performance views. These two tools are often complementary, providing the user with different performance perspectives on Charm++ applications based on performance data detail and temporal and spatial analysis. How the tools work in practice is demonstrated in a parallel performance analysis of NAMD, a scalable molecular dynamics code that applies many of Charm++'s unique features.	[ppt]	[wmv]
4:30 pm - 6:00 pm	Panel Discussion	A Single Programming model for clusters and multiprocessor nodes: Dream, Nightmare, Reality, or Vision Panelists: William D. Gropp, Laxmikant V. Kale, David A. Padua, Arch Robison, Marc Snir Click here to expand description Multiprocessor nodes are now commonplace in clusters as well as client desktops. Is it feasible to think of a single programming model that will work whether one is programming a single individual multiprocessor node, a cluster of uniprocessor nodes [ a disappearing dinosaur] or a cluster of SMP nodes? Within HPC/CSE community, are there organizing principles that allow us to think of a unified model? Or are the concerns so different that they deserve different models? Does the class of application being pursued make a difference? It could be that it's a impractical to pursue this idea ("dream"), or that a unified model will be possible but it will be a monstrosity to learn and program with ("nighmare"), or is it already a realilty (MPI programmers typically have used MPI-everywhere model, with a sporadic use of MPI-OpenMP hybrid; Charm++ supports a low-level unified model ), or perhaps it is a vision that should motivate us to develop new programming models with this explicit aim. The panelists, with expertise on the wide spectrum of programming models, will weigh in on this question. We hope to leave adequate time for discussion with audience participation.	[ppt]	[wmv]
6:00 pm onwards	Workshop Banquet (for registered participants only)
Day 2 (Thursday, April 16th)
8:30 am onwards	Continental Breakfast
Morning	Keynote, Technical Session (Session Chair - Prof. Kale)
9:00 am - 9:45 am	Keynote	PBGL: A High-Performance Parallel Distributed-Memory Graph Library Prof. Andrew Lumsdaine Computer Science Department, Indiana University Click here to expand description The increasing complexity of parallel architectures, coupled with the growing importance of new classes of parallel applications, calls for new tools and new development paradigms. Of particular importance is the need to develop software in an architecture-independent fashion while still being able to take advantage of architecture-specific features. In this talk, we present the design and implementation of the Parallel Boost Graph Library, a library of high- performance reusable software components for distributed graph computation. Like the sequential Boost Graph Library (BGL) upon which it is based, the Parallel BGL applies the paradigm of generic programming to the domain of graph computations. To illustrate how the Parallel BGL was built from the sequential BGL, we revisit the abstractions comprising the BGL in the context of distributed-memory parallelism, and lift away the implicit requirements of sequential execution and a single shared address space. With this process, we are able to create generic algorithms having a sequential expression and requiring only the introduction of external (distributed) data structures for parallel execution. More importantly, the generic implementation retains its sequential interface and semantics, such that other distributed algorithms can be built upon it, just as algorithms are layered in the sequential case. By characterizing these extensions as well as the extension process, we develop general principles and patterns for using (and reusing) generic parallel software libraries. We demonstrate that the resulting algorithm implementations are both efficient and scalable with performance results for several algorithms implemented in the open-source Parallel Boost Graph Library.	[pptx] [pdf]	[wmv]
9:45 am - 10:15 am	Invited Talk	Parallel Rendering in the GPU Era Prof. Orion S. Lawlor Department of Computer Science, University of Alaska at Fairbanks Click here to Expand description The classic model for parallel rendering, including Charm++ liveViz, is that a parallel array of CPUs computes, composites, and sends pixels to a low-sophistication client. The rise of programmable graphics processing unit (GPU) hardware provides an opportunity to significantly increase delivered application rendering performance, but taking full advantage of the GPU's power requires a modified application model. In this talk, we will describe our modification of liveViz to transmit "volume impostors" which are composited on the client's graphics hardware; our parallelizing powerwall rendering library called MPIglut; and discuss lessons learned, pitfalls to avoid, and directions for future parallel GPU rendering research.	[ppt] [pdf]	[wmv]
10:15 am - 10:45 am	Invited Talk	ChaNGa: Charm N-body GrAvity solver Prof. Thomas R. Quinn Department of Astronomy, University of Washington Click here to expand description Simulations of galaxies forming in their cosmological context poses a number of challenges to performance on large parallel machines. The first is the very non-local nature of gravitational forces. Galaxies are influenced by the gravitational forces originating tens of megaparsecs away, requiring significant communication in the force solver. Second is the enormous spatial dynamic ranges involved, from megaparsecs to sub-parsec scales, requiring dynamic hierarchical data structures. Third is the vast time scales involve, from less than 1 million years to the age of the Universe, posing significant challenges for load balancing. This talk will present how these challenges have been addressed in the design of ChaNGa, the Charm N-body GrAvity solver.	[odp] [pdf] [movie]	[wmv]
10:45 am - 11:00 am	Break
Morning	Technical Session (Session Chair - Dr. Celso Mendes)
11:00 am - 11:30 am	Invited Talk	ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers Prof. Christopher D. Carothers Computer Science Department, Rensselaer Polytechnic Institute (RPI) Click here to expand description We present the design for a robust Time Warp simulator, using Rensselaer's Optimistic Simulation System (ROSS), that generates scalable parallel performance over a variety of communication loads, and for an extremely large number of processors, up to 131,072. At 65,536 processors, ROSS produces a peak event-rate of 12.26 billion events per second at 10% remote events and 4 billion events per second at 100% remote events, the largest ever reported. Additionally, for the Transmission Line Matrix (TLM) model which approximates Maxwells equations for electromagnetic wave propagation, we report an event rate in excess of 100 million on 5,000 processors with 200 million grid-LPs. The TLM model is used to model highly accurate radio wave propagation for tens of thousands of radio devices much faster than real-time.	[ppt]	[wmv]
11:30 am - 12:00 pm	Talk	Application Experience with the GPU Isaac Dooley Click here to expand description We describe our experiences porting an explicit finite element application to CUDA. We describe our approach for structuring the application, and the resulting near perfect scaling up to 128 nodes of the Lincoln cluster. Each node of Lincoln contains 2 Intel quad-core processors, and 2 GPUs (half of one NVIDIA Tesla).	[pdf]	[wmv]
12:00 pm - 12:30 pm	Talk	Developing an Abstraction for Accelerator Programming David Kunzman Click here to expand description In recent history, the use of various accelerators such as GPUs, FPGAs, and SPEs on the Cell is becoming commonplace. The desire to use these specialized devices stems from the desire for performance. However, the devices are also considerably harder to program that typical commodity cores. In this talk, which will focus on the Cell processor, we describe extensions to the Charm++ programming model that allows the application to take advantage of accelerators.	[ppt] [pdf]	[wmv]
12:30 pm - 1:30 pm	Lunch
Afternoon	Technical Session (Session Chair - Eric Bohm)
1:30 pm - 2:00 pm	Talk	Using Charm++ to Effectively Exploit Multicore SMPs Prof. Laxmikant V. Kale Click here to expand description When Charm++ (actually, its precursor, the chare kernel) was originally developed, one of the motivations was to provide a portable programming model between the prevailing shared-memory and distributed memory machines. It has had several features aimed at this objective. This talk will present these and recently extended mechanisms that show how and why charm++ is a powerful programming methodology for programming a single multicore desktop client, and not just for clusters. We will examine the importance of locality-centric programming models for performance, and how to utilize shared memory within this context in Charm++.	[pptx] [ppt]	[wmv]
2:00 pm - 2:30 pm	Talk	New Paradigms in Parallel Programming Aaron Becker, Pritish Jetley and Phil Miller Click here to expand description Parallel programming is getting more common, but it isn't getting any easier. Although many agree that we need new models of parallel programming to facilitate the next generations of high performance applications, the exact form this new model will take is unclear. We present an approach based on not one new model to solve the problems of exploding parallelism, but many incomplete models. By intentionally addressing only a subset of the problems raised by parallel applications, these models can provide increased safety and simpler semantics without sacrificing performance.	[pdf]	[wmv]
2:30 pm - 3:00 pm	Invited Talk	Charon: parallel linear algebra made easy Gregory M. Crosswhite Graduate Student, Department of Physics, University of Washington Click here to expand description In this talk we present current work on Charon, a framework in Charm++ for parallel linear algebra. The motivation behind this framework the fact that current linear algebra libraries, while powerful, can be awkward to use in practice. One particularly striking example of this is the fact that each library tend to specialize in a particular class of operations, such as eigenproblem solving, so that to have a full spectrum of linear algebra operations one may need to juggle several libraries with different conventions. The goal of the Charon framework is to leverage Charm++’s asynchronous communication and MPI virtualization to make it easier for users to write parallel linear algebra codes in terms of higher-level operations, taking care of the low-level glue “under the hood.” This framework has two basic components to it: a distributed array object which supports basic linear algebra operations using a master/slave model, and a layer of "AMPI slaves" which wrap more sophisticated operations implemented in other MPI libraries. In this talk, we shall discuss our current progress so far on the design and implementation of Charon, and in particular we shall describe some challenges that we have met so far, such as the need to implement automatic global variable privatization to assist in using MPI libraries.	[pdf]	[wmv]
3:00 pm - 3:15 pm	Break
Afternoon	Technical Session (Session Chair - Ramprasad Venkataraman)
3:15 pm - 3:45 pm	Talk	Works in Progress: 1. A Generic Adaptive Runtime Autotuning Framework, 2. Effects of Network Contention on Messaging Isaac Dooley and Abhinav Bhatele Click here to expand description This talk will present ongoing work in two areas: 1. Isaac will present his ongoing research in creating an intelligent performance tuning framework for Charm++. The tuning framework tunes a set of parameters exposed by the program, by libraries, or by the runtime system. The parameters are tuned intelligently by using knowledge about the parameters, along with an analysis of observed performance characteristics. The observed performance characteristics include critical path profiles, memory usage statistics, and computational load statistics. He will discuss preliminary results from sample programs exhibiting various performance problems. 2. Abhinav will discuss his study showing that with the emergence of very large supercomputers, typically connected as a 3D torus or mesh, topology effects have become important again. He will present an evaluative study on the effect of contention on message latencies on torus and mesh networks.	1. [pdf] 2. [pdf]	[wmv]
3:45 pm - 4:15 pm	Talk	OpenAtom Dr. Glenn J. Martyna Physical Sciences Division, IBM T. J. Watson Research Center Click here to expand description OpenAtom is the production release of LeanCP, a quantum chemistry application which implements the CPAIMD method. This talk will cover new science discoveries being made using OpenAtom and recent performance of this code.	[ppt]	[wmv]
4:15 pm - 4:45 pm	Submitted paper	ParTopS: Compact Topological Framework for Parallel Fragmentation Simulations Rodrigo de Souza Lima Espinha Graduate Student, Computer Science Department, Pontifical Catholic University of Rio de Janeiro Click here to expand description Cohesive models are used for simulation of fracture, branching and fragmentation phenomena at various scales. Those models require high levels of mesh refinement at the crack tip region so that nonlinear behavior can be captured and physical results obtained. This imposes the use of large meshes that usually result in computational and memory costs prohibitively expensive for a single traditional workstation. If an extrinsic cohesive model is to be used, support for dynamic insertion of cohesive elements is also required. This paper proposes a topological framework (ParTopS) for supporting parallel adaptive fragmentation simulations that provides operations for dynamic insertion of cohesive elements, in a uniform way, for both two- and three-dimensional unstructured meshes. Cohesive elements are truly represented and are treated like any other regular element. The framework is built as an extension of a compact adjacency-based serial topological data structure (TopS), which can natively handle the representation of cohesive elements. Symmetrical modifications of duplicated entities are used to reduce the communication of topological changes among mesh partitions and also to avoid the use of locks. The correctness and efficiency of the proposed framework are demonstrated by a series of arbitrary insertions of cohesive elements into some sample meshes.	[ppt]	[wmv]
4:45 pm - 5:15 pm	Event	Annual PPL Group Photograph Click here to expand description	[PPL]	[PPL + Attendees]
Day 3 (Friday, April 17th)
8:30 am onwards	Continental Breakfast
Morning	Tutorials (Session Chair - Ryan Mokos)
9:00 am - 10:15 am	Tutorial	Analyzing Program Performance with Projections Chee Wai Lee Click here to expand description Projections is a performance tool used to analyze Charm++ applications. The tutorial will guide the attendee through the basics of instrumentation, trace-generation and various visualization features. We will use a simple Charm++ toy program as a case-study and will also explore more advanced features to aid the identification of bottlenecks on larger datasets. We will also explore how TAU profiles may be generated from Charm++ programs.	[pptx] [pdf]	[wmv]
10:15 pm - 10:30 pm	Break
10:30 pm - 12:00 pm	Tutorial	BigSim Dr Gengbin Zheng and Ryan Mokos Click here to expand description This tutorial will describe how to build and use the BigSim large machine simulator. Topics covered will include: building applications on BigSim, intrumenting applications with the BigSim LogAPI, use of the interpolation tool to integrate results from other simulators, use of BigNetSim, and the implementation of new interconnects in BigNetSim.	[ppt]	[wmv]
12:00 pm - 1:00 pm	Tour	NCSA Supercomputers' Tour Click here to expand description