FEM computations have a characteristic parallel communication pattern--each processor first exchanges data with neighboring processors, then performs local computation and repeats the process. As can be seen in Table 1, the time spent in local computation in each step can be quite small, especially for larger machines and smaller meshes. This small ``grainsize'' means communication happens more often, which can lead to poor performance.
For example, with a 1M element mesh running on 100,000 processors, each processor might only have 10us of computation between messaging phases. Since each message takes on the order of 10us, processors will spend all their time communicating and efficiency will be very low.
Communication latency can be hidden to a large extent with the technique of ``processor virtualization'', in which the problem is decomposed into more pieces than processors, and the pieces scheduled dynamically based on which messages are available. CHARM++ and the FEM framework fully support virtualization, and in fact require no extra user code for a virtualized run.
Another complementary approach to handle communication latency is the ghost cell expansion method [12], where redundant computations around each processor's border are used to decrease the frequency of message exchange. This multiple-ghost approach has only been implemented for structured grids, however, and the extension to unstructured grids, while conceptually straightforward, would be complicated to implement.