In summary, there are a number of practical bottlenecks to execution on very large machines. First, large meshes must be generated; this is difficult with today's tools. Second, the meshes must be partitioned for parallel execution. Finally, the resulting computation may still have small grainsize, so messaging performance is important.
Our runs with BigSim also exposed a number of unexpected bottlenecks and limitations to scalability. For example, the serial partitioning library we use consumes memory proportional to the number of output pieces, not the total size of the mesh; so even our 4GB machine ran out of memory when partitioning a relatively small 5M element mesh into more than 16K pieces. Hopefully ParMetis will solve this problem.
Similarly, even though our MPI implementation,
AMPI, was designed to be scalable, while trying to simulate very
large machines we discovered our implementation used
total
memory for
processors. The culprit was a simple linear message
ordering table kept by each processor, because the table's length
was proportional to the number of processors. For today's machines,
where
=1000, the total amount of memory used was 16MB; but for
=100,000, the tables would use 160GB! The solution was to
break the tables into (software) pages and only allocate pages when
referenced; this dramatically reduces the storage requirements for
large machines because most processors only communicate directly with
a small subset of other processors.