Live Webcast 15th Annual Charm++ Workshop

Mitigating Variability in HPC Systems and Applications For Performance and Power Efficiency
Thesis 2017
Publication Type: PhD Thesis
Repository URL:
Power consumption and process variability are two important, interconnected, challenges of future generation large-scale High Performance Computing (HPC) data centers. For example, current production petaflop supercomputers consume more than 10 megawatts of machine and cooling power that costs millions of dollars every year [1]. As HPC moves towards exascale computing, these costs will increase and power consumption is expected to become a major concern. Not solely dynamic behavior of HPC applications but also dynamic behavior of HPC systems makes it challenging to optimize the performance and power efficiency of large scale applications. Dynamic behavior of applications include irregular or imbalanced applications. Dynamic behavior of HPC systems include thermal, power, and frequency variations among processors. Smart and adaptive runtime systems have great potential to handle these challenges transparently from the application. In this dissertation, I first analyze frequency, temperature, and power variations in large-scale HPC systems using thousands of cores and different applications. After I identify the cause of each of these variations, I propose solutions to mitigate these variations to improve performance and power efficiency. When analyzing frequency variation, I attribute manufacturing related intrinsic differences in the chips’ power efficiency as the culprit behind frequency variation under dynamic overclocking. I propose speed-aware dynamic load balancing strategies to mitigate the performance overhead due to frequency variation. When analyzing temperature variation, I focus on inefficiencies in fan-based air cooling systems. I propose proactive and decoupled fan control mechanisms that reduce temperature variations and reduce cooling power consumption by predicting core temperatures using a learning based model. When analyzing power variations, I identify manufacturing related sources of power variation that are static and dynamic. I propose different variation aware node assembly methods to mitigate the power variation. Finally, I propose a fine-grained runtime based technique to mitigate application level variations that are caused by the characteristics of the application itself (for example, applications with different kernel types or phases) in order to reduce the energy consumption.
Research Areas