Performance Optimization Under Thermal and Power Constraints For High Performance Computing Data Centers
Publication Type: Talk
Energy, power and resilience are the major challenges that the HPC community faces in moving to larger supercomputers. Data centers worldwide consumed energy equivalent to 235 billion kWh in 2010 which is fast becoming a problem. A significant portion of that power and energy consumption is devoted to cooling. This thesis proposes a scheme based on a combination of limiting processor temperatures using Dynamic Voltage and Frequency Scaling (DVFS) and frequency-aware load balancing that reduces cooling energy consumption and prevents hot spot formation. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability has also been making progress independently. A second component of this thesis tries to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. The 10MW consumption of present day HPC systems is certainly becoming a bottleneck. Although energy bills will significantly increase with machine size, power consumption is a hard constraint that must be addressed. Intel's Running Average Power Limit (RAPL) toolkit is a recent feature that enables power capping of CPU and memory subsystems on modern hardware. The ability to constrain the maximum power consumption of the subsystems below the vendor-assigned Thermal Design Point (TDP) value allows us to add more nodes in an overprovisioned system while ensuring that the total power consumption of the data center does not exceed its power budget. The final component of this thesis proposes an interpolation scheme that uses application profile to optimize the number of nodes and distribution of power between CPU and memory subsystems that minimizes execution time under a strict power budget. We also present a resource management scheme including a scheduler that uses CPU power capping, hardware overprovisioning, and job malleability to improve the throughput of a data center under a strict power budget.