Services and Interfaces to Support Systems with Very Large
Numbers of Processors
Principal Investigator(s):
Terry Jones (Lawrence Livermore National Lab.)
Laxmikant V. Kale (Univ. Illinois)
Jose Moreira (IBM)
Project Period:
Starting Date: April 1st, 2005
Ending Date: March 31st, 2008
Documentation:
Subcontract between LLNL and UIUC: B551028
Project Summary:
We will research and develop system software that enables
general purpose operating and runtime systems for tens of thousands
of processors. To make an operating system with the desired performance
and functionality scale to such levels, new technology is required for
each of the following four areas: memory management, fault management,
parallel resource management and global system management. Our focus
will include a consolidated approach for all of these interrelated
issues in a large parallel context. Woven together into a more capable
and efficient operating and runtime system, these technologies will be
demonstrated on multiple platforms including IBM's BlueGene class machines.
Publications:
Sayantan Chakravorty and Laxmikant V. Kale.
A Fault Tolerance Protocol with Fast Fault Recovery,
Proceedings of the IEEE International Parallel and Distributed
Processing Symposium (IPDPS), California, March 2007.
Sayantan Chakravorty, Celso L. Mendes and Laxmikant V. Kale.
Proactive Fault Tolerance in MPI Applications via Task Migration,
Accepted for International Conference on High Performance Computing (HiPC),
Bangalore/India, December 2006.
Gengbin Zheng, Orion Sky Lawlor and Laxmikant V. Kale.
Multiple Flows of Control in Migratable Parallel Programs,
The 8th Workshop on High Performance Scientific and Engineering Computing,
Columbus/OH, August 2006.
Sayantan Chakravorty, Celso L. Mendes, Laxmikant V. Kale, Terry Jones,
Andrew Tauferner, Todd Inglett and Jose Moreira.
HPC-Colony: Services and Interfaces for Very Large Systems,
ACM SIGOPS Operating Systems Review, Special Issue on
Operating and Runtime Systems for High-End Computing Systems, April 2006.
Gengbin Zheng, Chao Huang, Laxmikant V. Kale.
Performance Evaluation of Automatic Checkpoint-based Fault Tolerance for
AMPI and Charm++,
ACM SIGOPS Operating Systems Review, Special Issue on
Operating and Runtime Systems for High-End Computing Systems, April 2006.
Tarun Agarwal, Amit Sharma and Laxmikant V. Kale.
Topology-aware Task
Mapping for Reducing Communication Contention on Large Parallel Machines,
Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS),
Rhodes Island/Greece, April 2006.
Sameer Kumar, Gheorghe Almasi, Chao Huang and Laxmikant V. Kale.
Achieving Strong Scaling with NAMD on Blue Gene/L,
Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS),
Rhodes Island/Greece, April 2006.
Chao Huang, Gengbin Zheng, Sameer Kumar and Laxmikant V. Kale.
Performance Evaluation of Adaptive MPI,
Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming (PPoPP), New York/NY, March 2006.
Tarun Agarwal.
Strategies for Topology-Aware Task Mapping and for Rebalancing
with Bounded Migrations,
MS Thesis, Dep. Computer Science, University of Illinois, June 2005.
Gengbin Zheng.
Achieving High Performance on Extremely Large Parallel Machines,
PhD Thesis, Dep. Computer Science, University of Illinois, May 2005.