Response to the Paper by Nikos Chrisochoides and Kevin Barker at SC'03

Summary:

The paper at SC03 compares their PREMA framework with Charm++ (and PARMETIS) for automatic loadbalancing, and concludes that PREMA does much better than Charm++ in particular, for highly dynamic computations such as crack propagation. They did the comparison using a simple benchmark that creates an imbalanced load of simple "work units".

However, the claim is not justified because (a) the charm++ periodic load balancers they used are suitable for iterative computations, when there is a significant correlation between times taken by most objects across consecutive timesteps, but the benchmark they used isn't iterative (leave alone with such a correlation); (b) when such correlations don't exist, one is supposed to use "seed balancers" in Charm++ (that have been in Charm since the early 90's). With such seed load balancers, their benchmark parallelizes effectively with Charm++.

Overall, the PREMA approach is good, and their adoption of processor virtualization goes on to validate the utility of that approach. However, the comparison with Charm++ performance in the paper, which appears to show that Charm++ load balancers have no effect on performance, is not quite accurate.


A more detailed explanation will follow.

1. Load Balancing in Charm++

(From the updated Charm++ online manual:)

Since the basic motive of Charm++ is to free programmers from having to do resource management, Charm++ provides a variety of mechanisms and strategies to automate the task of load balancing, in a variety of application contexts. Here we will introduce these methods with a series of examples.

To understand the different kinds of load balancing needs, we first note that there are two different kinds of objects being load balanced: chares, and chare-array elements. In case of AMPI, each AMPI ``process'' is also a chare-array element.

The load balancing contexts, from the point of view of Charm++ strategies, can be summarized as follows:
In all these contexts, programmers have the option of leaving everything to the system: by employing the load balancing mechanisms described below, the run-time system is at least able to use default strategies in each context.  As the research on adaptive run-time systems progresses, one hopes that this is all that one would have to do.  However, at least for now, it also important for programmers to select strategies that are most appropriate for their application.

Charm's balancers (esp. the iterative ones) come in two varieties: fully centralized ones, and distributed ones. Centralized strategies should be used when the load changes relatively slowly (so, loadbalancing after 100s of timesteps/iterations is ok), AND the number of processors is not too large. Otherwise, one should use distributed balancers in both measurement based or monitoring based balancers.

2. Barker/Chrisochoide benchmark:

This benchmark creates N objects, with P percent of them being "heavy" and others "light". The heavy objects are twice as much work as the light ones. N, P are parameters to the benchmark. To stress the load balancers, they initially assign heavy objects to the lower P percent processors. There is only one "iteration".

For Charm++ this clearly indicates using seed balancers. However, (since they were probably not aware of seed balancers), they used the periodic measurement-based balancers. The measurement based balancers run for a few iterations, measure the load presented by objects, and use that information to make periodic changes to load. Since there are no natural iterations in their benchmark, they created artificial iterations. The N objects were divided into N/I groups, and each charm++ object was assigned I of the original objects! Now they can think of Charm++ benchmark as an iterative computation, and so they used 4 points during the iterations to call Charm++'s centralized load balancer. There is no real correlation between the I objects assigned to each chare--and even if there were, the periodic balancers are quite ineffective, as one would expect. In fact they also have another graph that shows without the artificial iterations charm++ does nothing to improve the load. But this is because the periodic load balancer does not run! The only reason the periodic load balancer even slightly improves performance for this non-iterative application is because of the essentially random migration of objects.

The right way to do this benchmark is using Chares and Seed balancers. In this benchmark, we used the distributed seed load balancer with 3D torus virtual topology.

The benchmark program in Charm++ can be found here

Below we show the results obtained on 64 processors.

   Before LB 
Figure 1.  Percentage of processor utilization of the benchmark without seed load balancing.
10% of the 80,000 objects are twice as heavy as the rest. The heavier objects are initially assigned to first 10% processors.
Completion time: 99.85s

after LB
Figure 2.  Percentage of processor utilization of the same benchmark with seed load balancing turned on.
Completion time is reduced to 55.52s

3. Crack propagation itself

Since the SC03 paper doesn't actually have results comparing Charm++ and PREMA on crack propoagation application with multiple timesteps (as would be needed in the full application) we will speculate a bit about that:
  1. 1. In crack propagation (which is also one of our applications) a subset of elements get refined every timestep. Still, depending on the numerical techniques used there may be significant correlation between consecutive timesteps for most of the parallel objects (since most of them won't get refined/coarsened). So, we expect the periodic balancers of Charm++ to do ok here. But not perfectly well.
  2. 2. If the dynamic variation is significant, substep balancing, as done in our namd application, is an option. One could use either a distributed balancer, in which each processor monitors load on neighboring processors and migrates work as needed, or fire each generation of work as free-floating chares and use seed balancers as we did for the benchmark above.