Time |
Type |
Description |
Slides |
Video |
|
8:30 am - 9:00 am |
Continental Breakfast / Registration - NCSA 1st Floor Lobby |
Morning |
Opening Session - NCSA Auditorium 1122 |
9:00 am - 9:20 am |
Welcome
|
Opening Remarks
Prof. Laxmikant V. Kale, University of Illinois at Urbana-Champaign
|
|
|
9:20 am - 10:20 am |
Keynote
|
Building a Better Astrophysics AMR Code with Charm++: Enzo-P/Cello
Prof. Michael Norman, University of California San Diego
Click here to expand description
In order to effectively utilize modern HPC architectures and maximize programmer productivity, new abstractions are required that effect a "separation of concerns" between hardware-specific issues, parallelism, data structures, and application-specific solvers. Object-oriented programming can achieve this abstraction without sacrificing performance. A new scalable adaptive mesh refinement (AMR) method for astrophysics code applications at extreme scale has emerged from this approach -- the Enzo-P/Cello project under development at SDSC [1]. Cello is an extremely scalable AMR software infrastructure for post-petascale architectures, and Enzo-P is a multiphysics application for astrophysics and cosmology simulations built on Cello. Cello implements a forest-of- octrees spatial mesh on top of the Charm++ parallel objects framework. Leaf nodes of the octrees are blocks of fixed size, and are represented as chares in a hierarchical chare array. Cello supports particle and field classes and operators for the construction of the explicit finite-difference/volume methods, N-body methods, and sparse linear system solvers called from Enzo-P. Cello is application agnostic, and can be adapted to any hybrid simulation involving particles and fields. We report on our progress and experiences developing with Charm++. We present illustrative performance and scaling results on several test problems involving hydrodynamic fields, particles, and self-gravity.
|
|
|
10:20 am - 10:45 am |
|
Morning |
Technical Session: Applications I (Chair: Ronak Buch) - NCSA Auditorium 1122 |
10:45 am - 11:15 am |
Talk
|
Adaptive MPI: Performance and Application Studies
Sam White, University of Illinois at Urbana-Champaign
Click here to expand description
Adaptive MPI (AMPI) is an implementation of the MPI standard written on top of Charm++. AMPI provides high-level, application-independent features such as over-decomposition, dynamic load balancing, and automatic fault tolerance to MPI codes. In this talk we give an overview of AMPI's features and compare its performance to other MPI implementations on a variety of benchmarks and showcase recent results from applications.
|
|
|
11:15 am - 11:45 am |
Talk
|
Experiences with Charm++ and NAMD on Knights Landing Supercomputers
Dr. Jim Phillips, University of Illinois at Urbana-Champaign
Click here to expand description
The biomolecular simulation program NAMD has been ported to two major machines based on the Intel Xeon Phi Knights Landing (KNL) processor: Argonne Theta, with the Cray Aries interconnect, and TACC Stampede KNL, with the new Intel Omni-Path interconnect. With 64 or 68 low-power cores per single-socket host, up to 13 processes per host are required to relieve the Charm++ communication thread bottleneck. This bottleneck is particularly severe on Omni-Path, which "on-loads" much of the communication workload to the CPU, and for which the specialized verbs and OFI Charm++ machine layers have failed to out-perform the generic MPI machine layer. Optimization opportunities include aggregation of NAMD communication within both processes and hosts, and the introduction of multiple communication threads and a PSM2 machine layer in Charm++. The observed performance issues will be even more severe on the much larger Argonne Aurora Knights Hill machine arriving in late 2018.
|
|
|
11:45 am - 12:15 pm |
Talk
|
DARMA: A C++ Abstraction Layer for Large-Scale Asynchronous Tasking
Jonathan Lifflander, Sandia National Laboratories
Click here to expand description
DARMA is a C++ abstraction layer for asynchronous many task runtimes that seeks to 1) facilitate the expression of coarse-grained tasking via intuitive programming model semantics, and 2) provide a single interface that enables application scientists to explore the effectiveness of different backend runtime system implementations. This talk will provide a high-level summary of DARMA and its role in Sandia's research portfolio. An overview of a DARMA-Charm++ backend runtime system will be presented, along with initial performance results for several benchmarks relevant to Sandia application teams.
|
|
|
12:15 pm - 01:15 pm |
Lunch - Provided - NCSA 1st Floor Lobby - Michaels |
Afternoon |
Technical Session: Dynamic Load Balancing (Chair: Michael Robson) - NCSA Auditorium 1122 |
1:15 pm - 1:45 pm |
Talk
|
Using an Adaptive Mesh Refinement Proxy Code to assess Dynamic Load Balancing Capabilities for Exascale
Dr. Rob Van Der Wijngaart, Intel
Click here to expand description
Among the most dreaded obstacles facing scientific high performance computing workloads at exascale are sudden, discrete, localized disruptions ("system noise"), triggering load imbalances due to required synchronizations. These disturbances may be caused by local power capping to stay within a total power envelope, rescheduling of work onto resources to recover from component failure, sudden network congestion, etc. We developed a workload in which sudden local load variations are proxied in a controlled fashion via abrupt injection and removal of chunks of work. The workload is derived from Adaptive Mesh Refinement codes, and has been incorporated into the Parallel Research Kernels suite. We describe the design and several reference implementations of the kernel, including in Adaptive MPI. In addition we present experiments carried out to determine under what circumstances automatic dynamic load balancing enabled by the Charm++ runtime supporting Adaptive MPI is beneficial.
|
|
|
1:45 pm - 2:15 pm |
Talk
|
Meta-Balancer: Automated Selection of Load Balancing Strategies
Kavitha Chandrasekar, University of Illinois at Urbana-Champaign
Click here to expand description
Several HPC applications require dynamic load balancing to achieve high performance. With different HPC applications having different characteristics, selection of the best load balancing strategy out of many load balancing strategies is a complex process. Rule-of-thumb solutions might not always work and could lead to suboptimal performance. In this work, we present Meta-balancer, a framework to automatically capture application features at runtime and select the optimal load balancing strategy for the application for the given input dataset. We use random forest machine learning algorithm to select the load balancer based on application features. We discuss the selection of optimal load-balancing strategies for several mini-applications.
|
|
|
2:15 pm - 2:45 pm |
Talk
|
Balancing Speculative Loads in Parallel Discrete Event Simulation
Eric Mikida, University of Illinois at Urbana-Champaign
Click here to expand description
In Parallel Discrete Event Simulation (PDES), care must be taken to ensure that event order is maintained, while still exposing enough parallelism to run efficiently at large scales. One way to do so is to execute events speculatively, and allow for rollbacks when causality violations are detected. The speculative execution profile of a given simulation both affects, and is affected by load balance. In this talk we will show how load balancing can have a positive impact on the speculative execution profile of a simulation, in addition to showing some techniques that take advantage of simulation characteristics to provide even more benefit.
|
|
|
2:45 pm - 3:00 pm |
|
Afternoon |
Technical Session: Heterogeneous Computing (Chair: Eric Bohm) - NCSA Auditorium 1122 |
3:00 pm - 3:30 pm |
Talk
|
Scaling Clustered N-body/SPH Simulations
Prof. Tom Quinn, University of Washington
Click here to expand description
Simulations of galaxy clusters provide a number of challenges to scaling
on massively parallel machines. First, the amount of computation per
data element can vary by an order of magnitude for a single
force calculation. Second there are a large range of timescales, so
that much of the time only a small subset of the domain needs updated
forces. Third, there is a variety of physical processes to calculate,
such as gravity, star formation, and supermassive black hole evolution,
each of which has a different distribution of computational effort. I
will present some the successes and continued challenges in using the
Charm++ runtime system to address these difficult load balancing
problems on Blue Waters.
|
|
|
3:30 pm - 4:00 pm |
Talk
|
Heterogeneous Computing in Charm++
Michael Robson, University of Illinois at Urbana-Champaign
Click here to expand description
With the increased adoption of and reliance on accelerators, particularly GPUs, to achieve more performance in current and next generation supercomputers, effectively utilizing these devices has become very important. However, there has not been a commensurate increase in the ability to program and interact with these devices. We seek to bridge the GPU usability and programmability gap in Charm++ through a variety of GPU frameworks that programmers can utilize. Our ultimate goal is to enable our users to easily and automatically leverage the compute power of these devices without having to rewrite significant portions of their code. In this talk we will present the various frameworks available in Charm++ for programmers interacting with accelerators, their current features and trade-offs, and a brief overview of some major Charm applications that currently utilize various pieces of the Charm++ accelerator stack. We will also present some preliminary performance results and review the programmability enhancements these frameworks offer. Finally, we will examine Charm's future directions as nodes grow in size, new accelerators are introduced, and heterogeneous load balancing at various levels and across different node types becomes increasingly important.
|
|
|
4:00 pm - 4:30 pm |
Talk
|
Selected Topics in Dynamic Load Balancing
Ronak Buch, University of Illinois at Urbana-Champaign
Click here to expand description
Dynamic load balancing has long been a hallmark of Charm++, but new hardware and usage patterns continually present new challenges. In this talk, we discuss several improvements that have been made to the load balancing infrastructure and strategies of Charm++. Additionally, we discuss the addition of heterogeneous and vector load balancers to Charm++, which allow loads to be balanced across accelerator devices and using alternative metrics to processor load.
|
|
|
4:30 pm - 4:45 pm |
|
Afternoon |
Technical Session: PPL Talks (Chair: Nitin Bhat) - NCSA Auditorium 1122 |
4:45 pm - 5:05 pm |
Talk
|
Neural Network-Based Power Optimizations in Runtime
Bilge Acun, University of Illinois at Urbana-Champaign
Click here to expand description
Increasing scale, and density of datacenter, and supercomputers combined with temperature variations among processors pose significant challenges for making power, and energy efficient systems, and cooling infrastructures. An accurate temperature prediction model is necessary to support proactive cooling decisions that can reduce power and save energy. We propose a neural network-based modeling approach for predicting core temperatures for different workloads under different core frequencies, fan speed levels, and ambient temperature. The model provides guidance for cooling control (i.e. fan speed control) as well as cooling-aware algorithms such as frequency control algorithms and thermal-aware load balancing. We propose a preemptive fan control mechanism that can reduce the maximum cooling power 53% on average by foreseeing the temperature of the cores. Moreover, through our decoupled fan control method and thermal-aware load balancing algorithm, we show that temperature variations can be reduced from 25 C to 2 C, making cooling system more efficient with minimal performance overhead.
|
|
|
5:05 pm - 5:25 pm |
Talk
|
Sorting Algorithm/Library
Vipul Harsh, University of Illinois at Urbana-Champaign
Click here to expand description
We characterize Histogram with sampling, an adaptation of the popular sorting algorithm: Histogram sort. We show that Histogram sort with sampling is more efficient than Sample sort algorithms that achieve the same level of load balance, both in theory and practice, especially for massively parallel applications, scaling to tens of thousands of processors. We also demonstrate practicality of Histogram sort with sampling on large modern clusters by exploiting shared memory within nodes to improve the performance of the algorithm.
|
|
|
5:25 pm - 5:45 pm |
Talk
|
A Parallel Union-Find Library in Charm++
Karthik Senthil, University of Illinois at Urbana Champaign
Click here to expand description
Large-scale graph applications play a pivotal role in parallel and high-performance computing. Various scientific problems are translated into graphs to obtain better and faster solutions, while many graph processing frameworks are pushing the limits of modern computing hardware. A well explored and commonly used graph algorithm is the Connected Components detection, and extensive research in the field of parallel computing has resulted in several well-optimized implementations. The union-find operation lies at the core of this algorithm and many other significant graph operations such as Kruskal's minimum spanning tree algorithm, network connectivity problems, etc.
We present a library that provides the functionality of performing union-find operations on large-scale graphs in a completely distributed and asynchronous fashion. It has been implemented in Charm++ and can be used in any generic Charm++ application. In this talk we present the current status of the library and analyze its performance with the particular use-case of probabilistic meshes. We also discuss various existing strategies and planned optimizations targeting real-world applications.
This material is based in part upon work supported by the NSF, SI2-SSI: Collaborative Research: ParaTreet: Parallel Software for Spatial Trees in Simulation and Analysis (NSF #1550554).
|
|
|
5:45 pm - 6:00 pm |
Discussion
|
Dr. Phil Miller, CharmWorks Inc.
|
|
|
7-9 pm |
Workshop Banquet (for registered participants only) Located at the 2nd floor atrium of Siebel Center |
|
8:30 am - 9:00 am |
Continental Breakfast - NCSA 1st Floor Lobby |
Morning |
Opening Session (Chair: Sanjay Kale) - NCSA Auditorium 1122 |
9:00 am - 10:00 am |
Keynote
|
Exascale Computing Project: Software Technology Perspective
Dr. Rajeev Thakur, Argonne National Laboratory
Click here to expand description
The DOE Exascale Computing Project (ECP) was established to accelerate delivery of a capable exascale computing system that integrates hardware and software capability to deliver approximately 50 times more performance on mission-critical applications than the nation's most powerful supercomputers in use today. Its scope includes application development, software technology, hardware technology, advanced system engineering, and early testbed platforms. This presentation provides an overview of the Exascale Computing Project, its current status, and future plans, with a particular emphasis on activities in software technology.
|
|
|
10:00 am - 10:30 am |
Talk
|
Using Charm++ to Support Multiscale Multiphysics on the Trinity Supercomputer
Robert Pavel (presenting), Christoph Junghans, Susan M. Mniszewski, Timothy C. Germann, Los Alamos National Laboratory
Click here to expand description
As part of the Trinity Open Science project, we are using the TaBaSCo proxy application to demonstrate the feasibility of at-scale heterogeneous computations. TaBaSCo is a proxy application for multiscale physics, specifically viscoplasticity, developed as part of the ExMatEx exascale computing co-design center. Because of its reliance on database assisted interpolation, this proxy application can lead to very unbalanced work- loads with drastically varying task lengths. To minimize the impact of this imbalance we have used the Charm++ runtime.
We have run this program on the Trinity supercomputer as a way to investigate the problems and scalability of this approach on large scale machines. In this talk, we will explain our approach and present our results from phase one of the Trinity Open Science project in which we ran on the Intel Haswell portion of Trinity. We will describe the advantages and disadvantages of the Charm++ version of the code.
Furthermore, as part of phase two of the Trinity Open Science project we intend to utilize the Intel Knights Landing partition of Trinity in conjunction with the Intel Haswell portion. Specifically, we will schedule tasks to the type of node that they will best run on. As one solution for this, we use Charm++ to map the chares to the appropriate physical node. As part of this talk we will present our early results and benefits from distributing work in a manner that best takes advantage of the different node configurations and discuss the benefits and obstacles provided by the Charm++ runtime
|
|
|
10:30 am - 11:00 am |
|
Morning |
Technical Session: Applications II (Chair: Sam White) - NCSA Auditorium 1122 |
11:00 am - 11:30 am |
Talk
|
Quinoa: Adaptive Computational Fluid Dynamics
Jozsef Bakosi, Los Alamos National Laboratory
Click here to expand description
We are developing a set of tools on top of Charm++ that enables research and
numerical analysis in fluid dynamics. We currently focus on two solvers: (1) a
numerical integrator for stochastic differential equations, used for the design
of statistical moment approximations required for, e.g., modeling mixing
materials in turbulence, and (2) a finite element solver for the Navier-Stokes
(NS) equations on 3D unstructured grids with automatic solution-adaptive mesh
refinement (AMR). Using the NS-AMR problem we explore what it takes to scale
such high-load-imbalance simulations, representative of large production
multiphysics codes, to very large problems using an asynchronous runtime system.
On top of Charm++ we combine fully asynchronous data and task parallelism,
allowing arbitrary overlap of communication, computation, and I/O. We aim to
demonstrate that such an approach scales to large distributed-memory many-core
architectures and that this can be done in a portable, extensible, and
maintainable fashion. We develop the code in production style, and host it
publicly at GitHub ( https://github.com/quinoacomputing/quinoa), released under a
permissive BSD license, to facilitate collaboration, enable transparency, and
solicit feedback. The talk will discuss the high-level software architecture
with some algorithm details, e.g., asynchronous linear system assembly, the
current state, and our near-future plan.
|
|
|
11:30 am - 12:00 pm |
Talk
|
SpECTRE: A Next-Generation Relativistic Astrophysics Code
Nils Deppe, Cornell University
Click here to expand description
Advanced LIGO has begun the exciting new era of gravitational wave astronomy with its groundbreaking discoveries of gravitational waves (GWs) from the merger of two black holes (BHBH). In addition to BHBH mergers, the most frequent sources of GWs expected for a LIGO and its partner observatories are the mergers of compact binaries with either two neutron stars (NSNS), or one stellar-mass black hole and a neutron star (BHNS). Unfortunately even using simplified models of the properties of neutron stars, the computational errors are too large (1-10%) and often not even quantifiable with current algorithmic and hardware limitations. Also, the simulations take too long: several months on present supercomputers even at the current low accuracy. Moreover, the methods do not scale well to upcoming exascale machines. Our code SpECTRE is designed from the ground up to solve problems our current codes struggle to, with focus on petascale and exascale platforms. It features two key new methods that will make breakthrough simulations of neutron mergers and core-collapse supernovae possible: Discontinuous Galerkin (DG) discretization and task-based parallelism. Task-based parallelism in SpECTRE is implemented using the Charm++ library.
Currently, SpECTRE can numerically evolve strongly hyperbolic time-dependent partial differential equations in one to three spatial dimensions. The relativistic Euler and MHD systems have been implemented for simple equations of state, and SpECTRE is capable of running large-scale simulations efficiently using all cores of NSF's Blue Waters (~360,000 floating-point cores). This recent work has culminated in our first code paper describing challenging benchmark tests, scaling results, and task-based scheduling. We demonstrated that the DG algorithm together with task-based parallelism scales to massive core counts. To our knowledge this is the first combination of DG methods with task-based parallelism, and the first DG evolution of the relativistic MHD system. Because of Charm++ this scaling test succeeded with code that was moved from a laptop to the supercomputer with no code modifications at all.
|
|
|
12:00 pm - 01:15 pm |
Lunch - Provided - NCSA 1st Floor Lobby - Sitara |
1:15 pm - 2:15 pm |
Panel
|
Will exascale computing help raise the "missing middle"? Can it? Will it? How?
Panelists: Dr. Rajeev Thakur, Mr. Keven Hofstetter, Dr. Rasmus Tamstorf, Dr. Laxmikant Kale Moderator: Dr. Phil Miller
Click here to expand description
High performance technical computing is done on single desktops (engineering workstations), or large supercomputers, but its penetration in the mid-size systems (say, 10-1000 nodes) has been limited. In other words, engineering / manufacturing industry has not been using cluster computing with as much intensity as they could. This has been called the "missing middle". The question for this panel is: will the push for extreme scale (exascale as one milepost along that push) computing help raise the missing middle, i.e. increase and broaden usage of cluster-scale computing in industry?
|
|
|
2:15 pm - 2:45 pm |
|
Afternoon |
Technical Session: Interfaces (Chair: Eric Mikida) - NCSA Auditorium 1122 |
2:45 pm - 3:15 pm |
Talk
|
Early Experience with Integrating Charm++ Support to Green-Marl DSL
Alex Frolov, NICEVT
Click here to expand description
The paper presents the implementation of the code generation mechanism in the domain-specific language (DSL) Green-Marl compiler targeted at the Charm++ framework. Green-Marl is used for parallel static graph analysis and adopts imperative shared memory programming model, while Charm++ implements a message-driven execution model. The description of the graph representation in the generated Charm++ code, as well of the translation of the common Green-Marl constructs to Charm++ is presented. The evaluation of the typical graph algorithms: a Single-Source Shortest Path (SSSP), Connected Components (CC), and PageRank showed that the performance of the Green-Marl programs translated to the Charm++ have the same performance as the native Charm++ implementations.
|
|
|
3:15 pm - 3:45 pm |
Talk
|
Applying Logistic Regression Model on HPX Parallel Loops
Zahra Khatami, Louisiana State University
Click here to expand description
The performance of many parallel applications depends on the loop-level parallelism. However, manually parallelizing all loops may result in degrading parallelization performance, as some of the loops cannot scale desirably on more number of threads. In addition, the overheads of manually setting chunk sizes might avoid an application to reach its maximum parallel performance. We illustrate how machine learning techniques can be applied to address these challenges. In this research, we develop a framework that is able to automatically capture the static and dynamic information of a loop. Moreover, we advocate a novel method for determining execution policy and chunk size of a loop within an application by considering those captured information implemented within our learning model. Our evaluated execution results show that the proposed technique can speed up the execution process up to 35%.
|
|
|
3:45 pm - 4:15 pm |
|
Afternoon |
Technical Session: Applications III (Chair: Bilge Acun) - Siebel Center 1304 |
4:15 pm - 5:15 pm |
Talk
|
OpenAtom: Ground and Excited States of Electrons from First Principles
Prof. Sohrab Ismail-Beigi, Subhasish Mandal, Minjung Kim, Yale University; Dr. Glenn Martyna, Dr. Qi Li, IBM; Eric Bohm, Eric Mikida, Kavitha Chandrasekar University of Illinois
Click here to expand description
The goal of the OpenAtom project is to statistically sample complex environments in order to understand important and useful materials systems. We describe our progress towards fast ab initio computations to provide the ground energy surface that describe the systems of interest within Density Functional Theory and excited state properties and spectroscopy within the GW / Bethe-Saltpeter approach.
|
|
|
5:15 pm - 5:45 pm |
Talk
|
Multilevel Summation Method for Calculating Electrostatic Interactions in NAMD
David Hardy, University of Illinois at Urbana-Champaign
Click here to expand description
Molecular dynamics (MD) simulation has for decades been a valuable computational approach for investigating biomolecules. The emergence of exascale computing provides the opportunity for a commensurate increase in simulated system sizes to enable the study of macromolecular assemblies comprised of billions of atoms. The rate-limiting part of MD is the calculation of the nonbonded electrostatic forces, which must be computed billions of times when simulating microsecond timescales. For MD simulations performed with the program NAMD, it is most common to employ the particle-mesh Ewald (PME) method to calculate electrostatics. However, PME has two significant shortcomings: (1) its use necessitates the adoption of periodic boundary conditions within a simulation, and (2) each PME evaluation requires calculation of two 3D FFTs which poses a bottleneck to parallel scalability. Both shortcomings are addressed through an alternative approach, the multilevel summation method (MSM), that is currently being developed in NAMD.
|
|
|
5:45 pm - 6:00 pm |
|
Closing Remarks
Prof. Laxmikant V. Kale, University of Illinois at Urbana-Champaign
|
|
|
12:00 pm - 01:15 pm |
Dinner - Provided - Big Grove Tavern |
|
Morning |
Tutorial Session - Siebel Center 4405 |
9:00 am - 1:00 pm |
Tutorial
|
PPL
|
|
|
1:00 pm - 01:30 pm |
Lunch - Provided - Potbelly |
1:30 pm - 3:00 pm |
Tutorial
|
Projections
PPL
|
|
|