Final Program

The recorded talks have been posted to our YouTube channel with links on this page. Slides for all talks are also available. View program from last year's workshop here.

Time	Type	Description	Slides	Video
Day 1 (Monday, April 17th)
8:30 am - 9:00 am	Continental Breakfast / Registration - NCSA 1st Floor Lobby
Morning	Opening Session - NCSA Auditorium 1122
9:00 am - 9:20 am	Welcome	Opening Remarks Prof. Laxmikant V. Kale, University of Illinois at Urbana-Champaign	[pptx] [pdf]	[YouTube]
9:20 am - 10:20 am	Keynote	Building a Better Astrophysics AMR Code with Charm++: Enzo-P/Cello Prof. Michael Norman, University of California San Diego Click here to expand description In order to effectively utilize modern HPC architectures and maximize programmer productivity, new abstractions are required that effect a "separation of concerns" between hardware-specific issues, parallelism, data structures, and application-specific solvers. Object-oriented programming can achieve this abstraction without sacrificing performance. A new scalable adaptive mesh refinement (AMR) method for astrophysics code applications at extreme scale has emerged from this approach -- the Enzo-P/Cello project under development at SDSC [1]. Cello is an extremely scalable AMR software infrastructure for post-petascale architectures, and Enzo-P is a multiphysics application for astrophysics and cosmology simulations built on Cello. Cello implements a forest-of- octrees spatial mesh on top of the Charm++ parallel objects framework. Leaf nodes of the octrees are blocks of fixed size, and are represented as chares in a hierarchical chare array. Cello supports particle and field classes and operators for the construction of the explicit finite-difference/volume methods, N-body methods, and sparse linear system solvers called from Enzo-P. Cello is application agnostic, and can be adapted to any hybrid simulation involving particles and fields. We report on our progress and experiences developing with Charm++. We present illustrative performance and scaling results on several test problems involving hydrodynamic fields, particles, and self-gravity.	[pptx] [pdf]	[YouTube]
10:20 am - 10:45 am	Break
Morning	Technical Session: Applications I (Chair: Ronak Buch) - NCSA Auditorium 1122
10:45 am - 11:15 am	Talk	Adaptive MPI: Performance and Application Studies Sam White, University of Illinois at Urbana-Champaign Click here to expand description Adaptive MPI (AMPI) is an implementation of the MPI standard written on top of Charm++. AMPI provides high-level, application-independent features such as over-decomposition, dynamic load balancing, and automatic fault tolerance to MPI codes. In this talk we give an overview of AMPI's features and compare its performance to other MPI implementations on a variety of benchmarks and showcase recent results from applications.	[pptx] [pdf]	[YouTube]
11:15 am - 11:45 am	Talk	Experiences with Charm++ and NAMD on Knights Landing Supercomputers Dr. Jim Phillips, University of Illinois at Urbana-Champaign Click here to expand description The biomolecular simulation program NAMD has been ported to two major machines based on the Intel Xeon Phi Knights Landing (KNL) processor: Argonne Theta, with the Cray Aries interconnect, and TACC Stampede KNL, with the new Intel Omni-Path interconnect. With 64 or 68 low-power cores per single-socket host, up to 13 processes per host are required to relieve the Charm++ communication thread bottleneck. This bottleneck is particularly severe on Omni-Path, which "on-loads" much of the communication workload to the CPU, and for which the specialized verbs and OFI Charm++ machine layers have failed to out-perform the generic MPI machine layer. Optimization opportunities include aggregation of NAMD communication within both processes and hosts, and the introduction of multiple communication threads and a PSM2 machine layer in Charm++. The observed performance issues will be even more severe on the much larger Argonne Aurora Knights Hill machine arriving in late 2018.	[pdf]	[YouTube]
11:45 am - 12:15 pm	Talk	DARMA: A C++ Abstraction Layer for Large-Scale Asynchronous Tasking Jonathan Lifflander, Sandia National Laboratories Click here to expand description DARMA is a C++ abstraction layer for asynchronous many task runtimes that seeks to 1) facilitate the expression of coarse-grained tasking via intuitive programming model semantics, and 2) provide a single interface that enables application scientists to explore the effectiveness of different backend runtime system implementations. This talk will provide a high-level summary of DARMA and its role in Sandia's research portfolio. An overview of a DARMA-Charm++ backend runtime system will be presented, along with initial performance results for several benchmarks relevant to Sandia application teams.	[pptx] [pdf]	[YouTube]
12:15 pm - 01:15 pm	Lunch - Provided - NCSA 1st Floor Lobby - Michaels
Afternoon	Technical Session: Dynamic Load Balancing (Chair: Michael Robson) - NCSA Auditorium 1122
1:15 pm - 1:45 pm	Talk	Using an Adaptive Mesh Refinement Proxy Code to assess Dynamic Load Balancing Capabilities for Exascale Dr. Rob Van Der Wijngaart, Intel Click here to expand description Among the most dreaded obstacles facing scientific high performance computing workloads at exascale are sudden, discrete, localized disruptions ("system noise"), triggering load imbalances due to required synchronizations. These disturbances may be caused by local power capping to stay within a total power envelope, rescheduling of work onto resources to recover from component failure, sudden network congestion, etc. We developed a workload in which sudden local load variations are proxied in a controlled fashion via abrupt injection and removal of chunks of work. The workload is derived from Adaptive Mesh Refinement codes, and has been incorporated into the Parallel Research Kernels suite. We describe the design and several reference implementations of the kernel, including in Adaptive MPI. In addition we present experiments carried out to determine under what circumstances automatic dynamic load balancing enabled by the Charm++ runtime supporting Adaptive MPI is beneficial.	[pptx] [pdf]	[YouTube]
1:45 pm - 2:15 pm	Talk	Meta-Balancer: Automated Selection of Load Balancing Strategies Kavitha Chandrasekar, University of Illinois at Urbana-Champaign Click here to expand description Several HPC applications require dynamic load balancing to achieve high performance. With different HPC applications having different characteristics, selection of the best load balancing strategy out of many load balancing strategies is a complex process. Rule-of-thumb solutions might not always work and could lead to suboptimal performance. In this work, we present Meta-balancer, a framework to automatically capture application features at runtime and select the optimal load balancing strategy for the application for the given input dataset. We use random forest machine learning algorithm to select the load balancer based on application features. We discuss the selection of optimal load-balancing strategies for several mini-applications.	[pptx] [pdf]	[YouTube]
2:15 pm - 2:45 pm	Talk	Balancing Speculative Loads in Parallel Discrete Event Simulation Eric Mikida, University of Illinois at Urbana-Champaign Click here to expand description In Parallel Discrete Event Simulation (PDES), care must be taken to ensure that event order is maintained, while still exposing enough parallelism to run efficiently at large scales. One way to do so is to execute events speculatively, and allow for rollbacks when causality violations are detected. The speculative execution profile of a given simulation both affects, and is affected by load balance. In this talk we will show how load balancing can have a positive impact on the speculative execution profile of a simulation, in addition to showing some techniques that take advantage of simulation characteristics to provide even more benefit.	[pptx] [pdf]	[YouTube]
2:45 pm - 3:00 pm	Break
Afternoon	Technical Session: Heterogeneous Computing (Chair: Eric Bohm) - NCSA Auditorium 1122
3:00 pm - 3:30 pm	Talk	Scaling Clustered N-body/SPH Simulations Prof. Tom Quinn, University of Washington Click here to expand description Simulations of galaxy clusters provide a number of challenges to scaling on massively parallel machines. First, the amount of computation per data element can vary by an order of magnitude for a single force calculation. Second there are a large range of timescales, so that much of the time only a small subset of the domain needs updated forces. Third, there is a variety of physical processes to calculate, such as gravity, star formation, and supermassive black hole evolution, each of which has a different distribution of computational effort. I will present some the successes and continued challenges in using the Charm++ runtime system to address these difficult load balancing problems on Blue Waters.	[odp] [pdf]	[YouTube]
3:30 pm - 4:00 pm	Talk	Heterogeneous Computing in Charm++ Michael Robson, University of Illinois at Urbana-Champaign Click here to expand description With the increased adoption of and reliance on accelerators, particularly GPUs, to achieve more performance in current and next generation supercomputers, effectively utilizing these devices has become very important. However, there has not been a commensurate increase in the ability to program and interact with these devices. We seek to bridge the GPU usability and programmability gap in Charm++ through a variety of GPU frameworks that programmers can utilize. Our ultimate goal is to enable our users to easily and automatically leverage the compute power of these devices without having to rewrite significant portions of their code. In this talk we will present the various frameworks available in Charm++ for programmers interacting with accelerators, their current features and trade-offs, and a brief overview of some major Charm applications that currently utilize various pieces of the Charm++ accelerator stack. We will also present some preliminary performance results and review the programmability enhancements these frameworks offer. Finally, we will examine Charm's future directions as nodes grow in size, new accelerators are introduced, and heterogeneous load balancing at various levels and across different node types becomes increasingly important.	[pptx] [pdf]	[YouTube]
4:00 pm - 4:30 pm	Talk	Selected Topics in Dynamic Load Balancing Ronak Buch, University of Illinois at Urbana-Champaign Click here to expand description Dynamic load balancing has long been a hallmark of Charm++, but new hardware and usage patterns continually present new challenges. In this talk, we discuss several improvements that have been made to the load balancing infrastructure and strategies of Charm++. Additionally, we discuss the addition of heterogeneous and vector load balancers to Charm++, which allow loads to be balanced across accelerator devices and using alternative metrics to processor load.	[pdf]	[YouTube]
4:30 pm - 4:45 pm	Break
Afternoon	Technical Session: PPL Talks (Chair: Nitin Bhat) - NCSA Auditorium 1122
4:45 pm - 5:05 pm	Talk	Neural Network-Based Power Optimizations in Runtime Bilge Acun, University of Illinois at Urbana-Champaign Click here to expand description Increasing scale, and density of datacenter, and supercomputers combined with temperature variations among processors pose significant challenges for making power, and energy efficient systems, and cooling infrastructures. An accurate temperature prediction model is necessary to support proactive cooling decisions that can reduce power and save energy. We propose a neural network-based modeling approach for predicting core temperatures for different workloads under different core frequencies, fan speed levels, and ambient temperature. The model provides guidance for cooling control (i.e. fan speed control) as well as cooling-aware algorithms such as frequency control algorithms and thermal-aware load balancing. We propose a preemptive fan control mechanism that can reduce the maximum cooling power 53% on average by foreseeing the temperature of the cores. Moreover, through our decoupled fan control method and thermal-aware load balancing algorithm, we show that temperature variations can be reduced from 25 C to 2 C, making cooling system more efficient with minimal performance overhead.	[pptx] [pdf]	[YouTube]
5:05 pm - 5:25 pm	Talk	Sorting Algorithm/Library Vipul Harsh, University of Illinois at Urbana-Champaign Click here to expand description We characterize Histogram with sampling, an adaptation of the popular sorting algorithm: Histogram sort. We show that Histogram sort with sampling is more efficient than Sample sort algorithms that achieve the same level of load balance, both in theory and practice, especially for massively parallel applications, scaling to tens of thousands of processors. We also demonstrate practicality of Histogram sort with sampling on large modern clusters by exploiting shared memory within nodes to improve the performance of the algorithm.	[pptx] [pdf]	[YouTube]
5:25 pm - 5:45 pm	Talk	A Parallel Union-Find Library in Charm++ Karthik Senthil, University of Illinois at Urbana Champaign Click here to expand description Large-scale graph applications play a pivotal role in parallel and high-performance computing. Various scientific problems are translated into graphs to obtain better and faster solutions, while many graph processing frameworks are pushing the limits of modern computing hardware. A well explored and commonly used graph algorithm is the Connected Components detection, and extensive research in the field of parallel computing has resulted in several well-optimized implementations. The union-find operation lies at the core of this algorithm and many other significant graph operations such as Kruskal's minimum spanning tree algorithm, network connectivity problems, etc. We present a library that provides the functionality of performing union-find operations on large-scale graphs in a completely distributed and asynchronous fashion. It has been implemented in Charm++ and can be used in any generic Charm++ application. In this talk we present the current status of the library and analyze its performance with the particular use-case of probabilistic meshes. We also discuss various existing strategies and planned optimizations targeting real-world applications. This material is based in part upon work supported by the NSF, SI2-SSI: Collaborative Research: ParaTreet: Parallel Software for Spatial Trees in Simulation and Analysis (NSF #1550554).	[pdf]	[YouTube]
5:45 pm - 6:00 pm	Discussion	Upcoming Improvements and Features in Charm++ Dr. Phil Miller, CharmWorks Inc.		[YouTube]
7-9 pm	Workshop Banquet (for registered participants only) Located at the 2nd floor atrium of Siebel Center
Day 2 (Tuesday, April 18th)
8:30 am - 9:00 am	Continental Breakfast - NCSA 1st Floor Lobby
Morning	Opening Session (Chair: Sanjay Kale) - NCSA Auditorium 1122
9:00 am - 10:00 am	Keynote	Exascale Computing Project: Software Technology Perspective Dr. Rajeev Thakur, Argonne National Laboratory Click here to expand description The DOE Exascale Computing Project (ECP) was established to accelerate delivery of a capable exascale computing system that integrates hardware and software capability to deliver approximately 50 times more performance on mission-critical applications than the nation's most powerful supercomputers in use today. Its scope includes application development, software technology, hardware technology, advanced system engineering, and early testbed platforms. This presentation provides an overview of the Exascale Computing Project, its current status, and future plans, with a particular emphasis on activities in software technology.	[pdf]	[YouTube]
10:00 am - 10:30 am	Talk	Using Charm++ to Support Multiscale Multiphysics on the Trinity Supercomputer Robert Pavel (presenting), Christoph Junghans, Susan M. Mniszewski, Timothy C. Germann, Los Alamos National Laboratory Click here to expand description As part of the Trinity Open Science project, we are using the TaBaSCo proxy application to demonstrate the feasibility of at-scale heterogeneous computations. TaBaSCo is a proxy application for multiscale physics, specifically viscoplasticity, developed as part of the ExMatEx exascale computing co-design center. Because of its reliance on database assisted interpolation, this proxy application can lead to very unbalanced work- loads with drastically varying task lengths. To minimize the impact of this imbalance we have used the Charm++ runtime. We have run this program on the Trinity supercomputer as a way to investigate the problems and scalability of this approach on large scale machines. In this talk, we will explain our approach and present our results from phase one of the Trinity Open Science project in which we ran on the Intel Haswell portion of Trinity. We will describe the advantages and disadvantages of the Charm++ version of the code. Furthermore, as part of phase two of the Trinity Open Science project we intend to utilize the Intel Knights Landing partition of Trinity in conjunction with the Intel Haswell portion. Specifically, we will schedule tasks to the type of node that they will best run on. As one solution for this, we use Charm++ to map the chares to the appropriate physical node. As part of this talk we will present our early results and benefits from distributing work in a manner that best takes advantage of the different node configurations and discuss the benefits and obstacles provided by the Charm++ runtime	[pdf]	[YouTube]
10:30 am - 11:00 am	Break
Morning	Technical Session: Applications II (Chair: Sam White) - NCSA Auditorium 1122
11:00 am - 11:30 am	Talk	Quinoa: Adaptive Computational Fluid Dynamics Jozsef Bakosi, Los Alamos National Laboratory Click here to expand description We are developing a set of tools on top of Charm++ that enables research and numerical analysis in fluid dynamics. We currently focus on two solvers: (1) a numerical integrator for stochastic differential equations, used for the design of statistical moment approximations required for, e.g., modeling mixing materials in turbulence, and (2) a finite element solver for the Navier-Stokes (NS) equations on 3D unstructured grids with automatic solution-adaptive mesh refinement (AMR). Using the NS-AMR problem we explore what it takes to scale such high-load-imbalance simulations, representative of large production multiphysics codes, to very large problems using an asynchronous runtime system. On top of Charm++ we combine fully asynchronous data and task parallelism, allowing arbitrary overlap of communication, computation, and I/O. We aim to demonstrate that such an approach scales to large distributed-memory many-core architectures and that this can be done in a portable, extensible, and maintainable fashion. We develop the code in production style, and host it publicly at GitHub (https://github.com/quinoacomputing/quinoa), released under a permissive BSD license, to facilitate collaboration, enable transparency, and solicit feedback. The talk will discuss the high-level software architecture with some algorithm details, e.g., asynchronous linear system assembly, the current state, and our near-future plan.	[pdf]	[YouTube]
11:30 am - 12:00 pm	Talk	SpECTRE: A Next-Generation Relativistic Astrophysics Code Nils Deppe, Cornell University Click here to expand description Advanced LIGO has begun the exciting new era of gravitational wave astronomy with its groundbreaking discoveries of gravitational waves (GWs) from the merger of two black holes (BHBH). In addition to BHBH mergers, the most frequent sources of GWs expected for a LIGO and its partner observatories are the mergers of compact binaries with either two neutron stars (NSNS), or one stellar-mass black hole and a neutron star (BHNS). Unfortunately even using simplified models of the properties of neutron stars, the computational errors are too large (1-10%) and often not even quantifiable with current algorithmic and hardware limitations. Also, the simulations take too long: several months on present supercomputers even at the current low accuracy. Moreover, the methods do not scale well to upcoming exascale machines. Our code SpECTRE is designed from the ground up to solve problems our current codes struggle to, with focus on petascale and exascale platforms. It features two key new methods that will make breakthrough simulations of neutron mergers and core-collapse supernovae possible: Discontinuous Galerkin (DG) discretization and task-based parallelism. Task-based parallelism in SpECTRE is implemented using the Charm++ library. Currently, SpECTRE can numerically evolve strongly hyperbolic time-dependent partial differential equations in one to three spatial dimensions. The relativistic Euler and MHD systems have been implemented for simple equations of state, and SpECTRE is capable of running large-scale simulations efficiently using all cores of NSF's Blue Waters (~360,000 floating-point cores). This recent work has culminated in our first code paper describing challenging benchmark tests, scaling results, and task-based scheduling. We demonstrated that the DG algorithm together with task-based parallelism scales to massive core counts. To our knowledge this is the first combination of DG methods with task-based parallelism, and the first DG evolution of the relativistic MHD system. Because of Charm++ this scaling test succeeded with code that was moved from a laptop to the supercomputer with no code modifications at all.	[pdf]	[YouTube]
12:00 pm - 01:15 pm	Lunch - Provided - NCSA 1st Floor Lobby - Sitara
1:15 pm - 2:15 pm	Panel	Will exascale computing help raise the "missing middle"? Can it? Will it? How? Panelists: Dr. Rajeev Thakur, Mr. Keven Hofstetter, Dr. Rasmus Tamstorf, Dr. Laxmikant Kale Moderator: Dr. Phil Miller Click here to expand description High performance technical computing is done on single desktops (engineering workstations), or large supercomputers, but its penetration in the mid-size systems (say, 10-1000 nodes) has been limited. In other words, engineering / manufacturing industry has not been using cluster computing with as much intensity as they could. This has been called the "missing middle". The question for this panel is: will the push for extreme scale (exascale as one milepost along that push) computing help raise the missing middle, i.e. increase and broaden usage of cluster-scale computing in industry?		[YouTube]
2:15 pm - 2:45 pm	Break
Afternoon	Technical Session: Interfaces (Chair: Eric Mikida) - NCSA Auditorium 1122
2:45 pm - 3:15 pm	Talk	Early Experience with Integrating Charm++ Support to Green-Marl DSL Alex Frolov, NICEVT Click here to expand description The paper presents the implementation of the code generation mechanism in the domain-specific language (DSL) Green-Marl compiler targeted at the Charm++ framework. Green-Marl is used for parallel static graph analysis and adopts imperative shared memory programming model, while Charm++ implements a message-driven execution model. The description of the graph representation in the generated Charm++ code, as well of the translation of the common Green-Marl constructs to Charm++ is presented. The evaluation of the typical graph algorithms: a Single-Source Shortest Path (SSSP), Connected Components (CC), and PageRank showed that the performance of the Green-Marl programs translated to the Charm++ have the same performance as the native Charm++ implementations.	[pdf]	[YouTube]
3:15 pm - 3:45 pm	Talk	Applying Logistic Regression Model on HPX Parallel Loops Zahra Khatami, Louisiana State University Click here to expand description The performance of many parallel applications depends on the loop-level parallelism. However, manually parallelizing all loops may result in degrading parallelization performance, as some of the loops cannot scale desirably on more number of threads. In addition, the overheads of manually setting chunk sizes might avoid an application to reach its maximum parallel performance. We illustrate how machine learning techniques can be applied to address these challenges. In this research, we develop a framework that is able to automatically capture the static and dynamic information of a loop. Moreover, we advocate a novel method for determining execution policy and chunk size of a loop within an application by considering those captured information implemented within our learning model. Our evaluated execution results show that the proposed technique can speed up the execution process up to 35%.	[pdf]	[YouTube]
3:45 pm - 4:15 pm	Break, switch rooms
Afternoon	Technical Session: Applications III (Chair: Bilge Acun) - Siebel Center 1304
4:15 pm - 5:15 pm	Talk	OpenAtom: Ground and Excited States of Electrons from First Principles Prof. Sohrab Ismail-Beigi, Subhasish Mandal, Minjung Kim, Yale University; Dr. Glenn Martyna, Dr. Qi Li, IBM; Eric Bohm, Eric Mikida, Kavitha Chandrasekar University of Illinois Click here to expand description The goal of the OpenAtom project is to statistically sample complex environments in order to understand important and useful materials systems. We describe our progress towards fast ab initio computations to provide the ground energy surface that describe the systems of interest within Density Functional Theory and excited state properties and spectroscopy within the GW / Bethe-Saltpeter approach.	[pptx] [pdf]	[YouTube]
5:15 pm - 5:45 pm	Talk	Multilevel Summation Method for Calculating Electrostatic Interactions in NAMD David Hardy, University of Illinois at Urbana-Champaign Click here to expand description Molecular dynamics (MD) simulation has for decades been a valuable computational approach for investigating biomolecules. The emergence of exascale computing provides the opportunity for a commensurate increase in simulated system sizes to enable the study of macromolecular assemblies comprised of billions of atoms. The rate-limiting part of MD is the calculation of the nonbonded electrostatic forces, which must be computed billions of times when simulating microsecond timescales. For MD simulations performed with the program NAMD, it is most common to employ the particle-mesh Ewald (PME) method to calculate electrostatics. However, PME has two significant shortcomings: (1) its use necessitates the adoption of periodic boundary conditions within a simulation, and (2) each PME evaluation requires calculation of two 3D FFTs which poses a bottleneck to parallel scalability. Both shortcomings are addressed through an alternative approach, the multilevel summation method (MSM), that is currently being developed in NAMD.	[pdf]	[YouTube]
5:45 pm - 6:00 pm		Closing Remarks Prof. Laxmikant V. Kale, University of Illinois at Urbana-Champaign		[YouTube]
12:00 pm - 01:15 pm	Dinner - Provided - Big Grove Tavern
Day 3 (Wednesday, April 19th)
Morning	Tutorial Session - Siebel Center 4405
9:00 am - 1:00 pm	Tutorial	Basic Charm++ PPL	[pptx]
1:00 pm - 01:30 pm	Lunch - Provided - Potbelly
1:30 pm - 3:00 pm	Tutorial	Projections PPL	[pdf]

Abstracts Due:	Fri Mar 3rd
Author Notification:	Fri Mar 10th
Hotel Reservation:	Fri Mar 24
Workshop:	April 17-18, 2017
Tutorial:	Wed Apr 19th