--
EricBohm - 24 Jun 2008
Thomas Dunning:
- Center is critical to Illinois strategy to become the academic center for extreme scaling computing.
- Allocations on Abe will be available
- 50% of Abe is for IACAT, apply for time through Kale and D Johnson
General:
David Padua: Transforming Programs Automatically
- Compilers
- Vikram A will be contributing a multicore oriented compiler.
- ETA ~ 1 year
- Generators
- schemes for specifying algorithms and then having the code expressed for target machines or tasks
- Libraries
- Short term work can begin on hand/self tuned libraries oriented towards IACAT-CPC needs.
- Auto tuning BLAS routines. (A Duchateau)
- strategy searches unrolling space and loop parameters around innermost loop.
- works by annotation, good for use on critical path computation
Ralph Johnson: Refactoring
- Original design no longer sufficient for current needs.
- Flossing vs Root Canal. Incremental refactoring vs major refactoring
- Have automated test suite
- make small change (15 minutes to 1-2 hours)
- test change to verify
- break large changes into smaller changes
- when adding a feature, do the refactoring first, won't need new tests.
- Optimizations
- qualitatively different in that they add no feature and usually make the code harder to understand
- these manual changes are actually a kind of program transformation
- a refactoring aware version control system can propagate changes forward down branches
- Ask Ralph to help with refactoring issues, improving documentation, and adapting to new technologies
- Automated/interactive refactoring tools (Photran) not yet for general consumption, ~ 1 year
Sanjay Kale: Charm++ and its use in scalable CSE applications
- overdecomposition
- express as much parallelism as there is available, more than the number of cores
- adaptive runtime system
- dynamic load balancing
- performance tools, scalable with automated data mining to highlight performance bottlenecks
- new "languages"
- early performance development via BigSim
- PMPI layer available for using projections tool with MPI codes (without AMPI) idooley@uiuc.edu
David Ceperly: QMCPack
- QMC stochastic sampling to solve many body Schrodinger
- Monte Carlo methods to use random walk approaches
- Branches occur which split walkers which creates some imbalance
- Walkers are weakly coupled, not communication bound
- Computing from the determinant e.g. 100x100 done ~ 10^12 times
- Lapack sherman morrison single column update for matrix inverse to get the determinant
- can be done using 3D spline interpolation table requiring a few hundred operations
- has memory constraint due to table size
- computational bias is 1/number of walkers.
- you need enough to approach normal noise
- too many consumes more resource with little benefit
- will need another level of decomposition, parallelization within walkers
- cpu time per walker per DMC strp grows N^3
- will need fault tolerance
- need SMP awareness
- memory issues with size of one-body orbitals
- finer level parallelism, perhaps 1 walker per node
- Padua suggests a code walkthrough
- D Johnson recommends profiling first to identify code segments for a walkthrough
Duane Johnson: KKR (hybrid real and k-space) order-N DFT code
- overall characteristics
- green's function, but without eigenvalue spectra etc
- screened structure constants (sparsity)
- sparse matrix technique ( memory and size)
- Iterative inversion technique (size)
- hybrid k-space/r-space (accuracy)
- r-space only (speed)
- Coherent potential approx (disorder)
- complex non-hermitian matrices
- G= G_0 (1- t G_0)^-1 (repeated until convergence) is the core of the computational time
- evaluated at each energy
- G_0 structure
- t is
- create reference state system using square well
- solve for true system relative to reference system G=G_r[I - (t -T_r)G_r]-1
- G_r is sparse <= 1% nonzero for N=2000. this is the whole point of doing it this way.
- sadly, off diagonals same magnitude as on diagonal, which makes banded approaches unsuitable
- parallelize by atom vectors, or groups of atom vectors
- pre-conditioners have so far not yielded good results
- need iterative schemes for non-hermitian case with good scaling properties
Paul Ricker: FLASH
- AMR
- Type 1a supernovae, novae, x-ray bursts
- grid determined by power spectrum initial fluctuation
- physics being modeled
- hydrodynamics, euler equations, magnetohydrodynamics, multifuild advection
- Reactions: coupled ODE
- Gravity: Poisson equation (heavy computational component)
- Particles: Vlasov-Boltzmann equation
- Modular framework, managed by python script to mix and match for specified target
- multigrid poisson solving has complex storage issues with ghost/guard boundary exchange
- use space filling curve for load balancing in Paramesh
- worth discussing a non-SFC solution (oct-tree?)
- multigrid poisson solver has discontinuity at fine grid boundaries
- oct-tree intpolate on all boundary points, residual on composite sources single layer potential
- weak scaling to 128k BG/L, flat for Eulerian, less optimal for lagrangian
- good strong scaling on itanium,
- strong scaling on Jaguar good to about 8x minimum number to fit
- Load balancing
- multigrid poisson solver
- movement of particles among processors
- addition of time refinement
- coping with hybrid architectures
- results out
- efficient I/O for raw checkpoint data (one run can produce terabytes of data)
- fault tolerance (error recovery)
- on-the fly analysis to reduce I/O requirements
- migrate to AMPI
- Tune multigrid
- hybrid parallelization for multi-core
- re-vectorization (sort of like it was way back in the T1 days)
Jim Phillips: Towards Petascale Biomolecular Simulations with NAMD
- All atom MD long time scales (10 ns minumum, more is better), usually biological systems in liquid
- 20,000 users
- 10-20 MB data for 100,000 atoms
- Full electrostatics, multiple timestepping for long range electrostatics
- Hybrid spatial decomposition, compute objects for pairwise patch force computation
- Uses dynamic load balancing based on instrumentation of computation time per object (Charm++)
- Jim notes that automatic refactoring/transformation tools could be useful for MPI codes
- e.g. a get based implementation could be swapped for a put based implementation if the underlying fabric has faster puts than gets
- Challenges derive from determining the origin of performance problems and locating the appropriate workaround
- this should be done automatically for the end user
Tools for
OpenMP? optimization should be studied. (Acumem, plus some direct help from Intel are worth pursuing).
Commonalites:
- What differential is being solved?
- What kernel contains the computation?
- Which of these characterizes performance?
- Flow Equation
- Poisson Equation,
- FFTW
- Linear Algebra routines
- How do we identify problems in the code
- how do we identify the real cause of the problem?
- A tool to figure out which cores to avoid on a target machine with daemon interference would be helpful
Key needs:
- D Johnson will consult with the application teams to identify their specific key needs wrt libraries, compilers, etc
- Kale will consult with the CS teams to identify specific techniques for IACAT-CPC use