Some of the most challenging applications to parallelize scalably are
the ones that present a relatively small amount of computation per
iteration.
Multiple interacting performance challenges must be identified and solved
to attain high parallel efficiency in such cases.
We present a case study involving NAMD, a
parallel molecular dynamics application, and efforts to scale it to
run on 3000 processors with TeraFLOPS level performance. NAMD is
implemented in Charm++, and the performance analysis was carried out
using ``projections'', the performance visualization/analysis tool
associated with Charm++.
We will showcase a series of optimizations facilitated by
projections. The resultant performance of NAMD led to a Gordon Bell
award at SC2002.