We have compared AMPI with the original message-driven multi-partitioning approach to evaluate overheads associated with each of them using a Computational Fluid Dynamics (CFD) kernel that performs Jacobi relaxation on large grids (where each partition contains 1000 grid points.) We ran this application on a single 250 MHz MIPS R10000 processor of the Origin2000 machine at National Center for Supercomputing Applications (NCSA), with different number of chunks, scaling the mesh to keep the chunk-size constant. Two different decompositions, 1-D and 3-D, were used. These decompositions vary in number of context-switches (blocking receives) per chunk. While the 1-D chunks have 2 blocking receive calls per chunk per iteration, the 3-D chunks have 6 blocking receive calls per chunk per iteration. However, in both cases, on the average only half of these calls actually block waiting for data, resulting in 1 and 3 context switches per chunk per iteration respectively. As can be seen from figure 5.12, the optimization due to availability of local variables across blocking calls, as well as larger subroutines in the AMPI version neutralizes thread context-switching overheads for a reasonable number of chunks per processor. Thus, thread-based Adaptive MPI can be effectively used for component coordination without incurring any significant overheads.
|
Encouraged by these results, we converted some large MPI applications using AMPI as part of the Center for Simulation of Advanced Rockets (CSAR). At CSAR, we are developing a detailed multi-physics rocket simulation and virtual prototyping tool [31]. GEN1, a first generation integrated rocket simulation code is composed of three coupled modules: Rocflo (an explicit fluid dynamics code), Rocsolid (an implicit structural simulation code), and Rocface (a parallel interface between Rocflo and Rocsolid) [63]. Rocflo and Rocsolid were written using FORTRAN 90 (about 10000 and 12000 lines respectively), and use MPI as parallel programming environment. Rocflo and Rocsolid were the the first application codes to be converted to AMPI5.3. This conversion, using the techniques described in the last section, resulted in very few changes to original code (in fact, the changed codes can be linked with MPI, without any changes), and did not take much time for us, even as we were unfamiliar with the codes (about a week for one person for each of these codes). Conversion of Rocface was even quicker5.4. This quick manual conversion was possible because Rocface was a modularly written code written in C++ with very few global variables.
The overhead of using AMPI instead of MPI is shown (tables 5.1 and 5.2) to be minimal, even with the original decomposition of one partition per processor. We expect the performance of AMPI to be better when multiple partitions are mapped per processor, as depicted in Figure 5.1. Also, the ability of AMPI to respond to dynamic load variations outweighs these overheads.
|
|
|
[Setup]
[Fluids Update]
[Solids Update]
[Predictor-Corrector]
[Total]
|
Figure 5.13 shows the performance of the integrated rocket simulation code implemented with AMPI-based components, and its comparison with the original MPI-based code. Setup stage is carried out once at startup (and exhibits a serial bottleneck at the file-system.) Each timestep of the integrated code consists of fluids update, solids update, and a predictor-corrector step. Timings of the AMPI component-based code are comparable with the original MPI-based codes, with the maximum 3% overhead of AMPI. These experiments were carried out on the NCSA Origin2000 (250MHz MIPS R10000 processors). The AMPI implementation was deliberately chosen to run on the version of Converse that used MPI as its underlying communication library. Also, for this comparison, only one virtual processor of the AMPI version of the rocket simulation code was mapped to each physical processor. Note that this is scaled problem, where the problem size grows with the number of processors.
Workstation clusters are often built incrementally, with individual machines that typically reflect technology improvements over the time it took to build a cluster. An example of such a cluster is the ``Turing'' cluster at the Department of Computer Science at University of Illinois. At the time this experiment was performed, it consisted of 208 dual-processor machines, ranging in architecture from 400 MHz Pentium II to 1 GHz Pentium III, connected with the Myrinet network. Synchronizations and near-neighbor communications make the faster processors wait for the slower processors to finish work. Therefore, for all except the most trivial parallel applications, the slowest processor dictates the time taken on such heterogeneous clusters. Because of the virtualization strategy employed by AMPI, the runtime system has the freedom to balance the load, taking into account the differences in processor speeds. This is demonstrated in figure 5.14. We ran a Jacobi relaxation application written using AMPI on 62 processors of the Turing cluster. The set of processors available to us had the following composition: 7 Pentium II (400 MHz), 17 Pentium II (450 MHz), 36 Pentium III (550 MHz) and 2 Pentium III (1 GHz). We varied the number of virtual processors from 64 to 512, scaling the mesh so that the individual partition-size was kept unchanged, and compared the timings for each timestep with and without using the processor speeds information. Indeed, we see a significant improvement in performance when processor speeds are taken into account. As the number of virtual processors increase, AMPI has more flexibility in mapping them to available physical processors to balance the load, resulting in greater performance improvements.
Figure 5.15 shows the performance improvements for the various components of the GEN1 codes when running on the heterogeneous Turing cluster.
The dynamic load balancing capabilities of AMPI result in improved utilization of dynamic computing platforms such as workstation clusters, where availability of computational resources may change with time. Figure 5.16 shows the conjugate gradient solver described above initially running on 16 processors in a workstation cluster. As 16 new processors become available after timestep 600, AMPI redistributes load to all available 32 processors, and the time required for each timestep of the conjugate gradient solver reduces to about half (as expected) of the earlier time. Such flexible allocation of processor resources is possible using an adaptive job scheduler developed by Sameer Kumar and Jay Desouza [48]. This experiment was carried out on an IA-32 based ``Platinum'' cluster (Pentium III 1 GHz, Myrinet) at NCSA5.5.
Dynamic and/or irregular applications cause load imbalance at runtime even on
homogeneous computing platforms with constant availability. Examples of such
problems include scientific applications that use techniques such as adaptive
mesh refinement. The measurement-based load balancing framework in Charisma
deals with such dynamic applications by redistributing workload among available
processors in order to balance the load as shown in
figure 5.17. This figure shows the performance of a
neighborhood averaging component in AMPI based on Jacobi-relaxation on
eight processors of SGI Origin 2000 at NCSA. The grid is partitioned into 64
partitions, that are mapped to eight processors by the Charisma runtime. In
the
iteration, one of the partitions is refined eight times, thus
increasing the computational load of that partition in proportion. Even though
the computation of the entire application increases only by 11%, the
throughput of the application is reduced by more than 30% due to this load
imbalance, since the neighboring partitions are idle waiting for the
overloaded partition. The load balancing framework of Charisma is activated
every 20 iterations, which detects this load imbalance and moves some
partitions from the heavily loaded processors to the lightly loaded processors,
bringing the throughput to the desired levels.