next up previous
Next: Performance of Post-mortem Simulator Up: Performance Previous: Validation

FEM

We studied the performance of a CHARM++ FEM Framework program, which performs a simple 2D structural simulation on an unstructured triangle mesh. We chose a relatively small problem with a 5 million element mesh, so as to stress efficiency issues. Because our 2D elements take a little under a microsecond of CPU time per timestep, this is less than 5 seconds of serial work per timestep.

Figure 5 shows the predicted execution time per step, simulating 125 to 16,000 processors using only 32 Lemieux processors. The time per step is 23.5 milliseconds for 125 processors and drops to 640 microseconds on 16,000 processors. Figure 6 is the corresponding speedup, normalized based on the 125 processor time. It shows that the program can scale well to at least several thousands of processors.

Beyond several thousand processors, when the simulated time per step drops below a few milliseconds, the parallel efficiency begins to drop. Sub-millisecond cycle times are indeed extremely challenging even on today's small machines, and we continue to seek methods to improve this performance on even larger machines.

Figure 5: Predicted execution time
\includegraphics[width=3in]{figures/simple2d}

Figure 6: Predicted speedup
\includegraphics[width=3in]{figures/simple2d-su}

We also demonstrate the benefits of processor virtualization in CHARM++ for the same FEM program. We use different numbers of MPI virtual processors, each with a separate chunk of the problem mesh, on each simulated processor. Virtualization allows dynamic overlap of computation and communication, and can improve cache utilization because each virtual processor's data is small.

The predicted performance for various degrees of virtualization is illustrated in Figure 7. The problem size in this test is still the same--a 5 million element mesh, and the simulated machine size is fixed at 2000 2. Even a low degree of virtualization dramatically improves performance by allowing computation and communication to be overlapped; higher degrees of virtualization provide little benefit, and eventually the overhead of additional virtual processors only slows the program down.

Figure 7: Predicted execution time vs. degree of virtualization
\includegraphics[width=3in]{figures/simple2d-vt}


next up previous
Next: Performance of Post-mortem Simulator Up: Performance Previous: Validation
Gengbin Zheng 2004-01-21