Subsections

3.11 Load Balancing

Charm++ supports Load Balancing, enabled by the fact there are a large number of chares or chare-array-elements typically available to map to existing processors, and that they can be migrated at runtime.

Many parallel applications, especially physical simulations, are iterative in nature. They may contain a series of time-steps, and/or iterative solvers that run to convergence. For such computations, typically, the heuristic principle that we call "principle of persistence" holds: the computational loads and communication patterns between objects (chares) tend to persist over time, even in dynamic applications. In such cases, recent past is a good predictor of near future. Measurement-based chare migration strategies are useful in this context. Currently these apply to chare-array elements, but they may be extended to chares in the future.

For applications without such iterative structure, or with iterative structure but without the predictability (i.e. where the principle of persistence does not apply), Charm++ supports "seed balancers" that move seeds for new chares among processors (possibly repeatedly) to achieve load balance. These strategies are currently available for both chares and chare-arrays. Seed balancers were the original load balancers provided in Charm since the late '80s. They are extremely useful for state-space search applications, and are also useful in other computations, as well as in conjunction with migration strategies.

For iterative computations when there is a correlation between iterations/steps but either it is not strong or the machine environment is not predictable (noise due to OS interrupts on small time steps, or time-shared desk-top machines), one can use a combination of the two kinds of strategies. The base-line load balancing is provided by migration strategies; But in each iteration one also spawns off work in the form of chares that can run on any processor. The seed balancer will handle such work as it arises.

3.11.1 Measurement-based Object Migration Strategies

In CHARM++, objects(except groups, nodegroups) can migrate from processor to processor at run-time. Object migration can potentially improve the performance of the parallel program by migrating objects from overloaded processors to underloaded ones.

CHARM++ implements a generic, measurement-based load balancing framework which automatically instruments all CHARM++ objects, collects computation load and communication structure during execution and stores them into a load balancing database. CHARM++ then provides a collection of load balancing strategies whose job is to decide on a new mapping of objects to processors based on the information from the database. Such measurement based strategies are efficient when we can reasonably assume that objects in CHARM++ application tend to exhibit temporal correlation in their computation and communication patterns, i.e. future can be to some extent predicted using the historical measurement data, allowing effective measurement-based load balancing without application-specific knowledge.

Here are the two terms often used in CHARM++ load balancing framework:

3.11.2 Available Load Balancing Strategies

Load balancing can be performed in either a centralized, fully distributed or hierarchical (or hybrid) fashion.

In centralized approaches, the entire machine's load and communication structure are accumulated to a single point, typically processor 0, followed by a decision making process to determine the new distribution of CHARM++ objects. Centralized load balancing requires synchronization which may incur an overhead and delay. However, due to the fact that the decision process has a high degree of the knowledge about the entire machine, it tends to be more accurate.

In distributed approaches, machine states are only exchanged among neighboring processors. There is no global synchronization. However, they will not, in general, provide an immediate restoration for load balance - the process is iterated until the load balance can be achieved.

In hierarchical approaches, processors are divided into independent autonomous sets of processor groups and these groups are organized in hierarchies, therefore decentralizing the load balancing task. Hybrid strategies are used to load balancing load on processors inside each processor group, and processors across groups in a hierarchical fashion.

Listed below are some of the available nontrivial centralized load balancers and their brief descriptions:

Listed below are the distributed load balancers:

User can choose any load balancing strategy he or she thinks is good for the application. The compiler and run-time options are described in section 3.11.6.

3.11.3 Load Balancing Chare Arrays

Load balancing framework is well integrated with Chare array implementation - when a chare array is created, it automatically registers its elements with the load balancing framework. The instrumentation of compute time (wall/cpu time) and communication pattern are done automatically and APIs are provided for users to trigger the load balancing.

To use the load balancer, you must make your array elements migratable (see migration section above) and choose a load balancing strategy (see the section 3.11.2 for a description of available load balancing strategies).

We implemented three methods to use load balancing for chare arrays to meet different needs of the applications. These methods are different in how and when a load balancing phase starts. The three methods are: periodic load balancing mode, and manual mode.

In periodic load balancing mode, a user just needs to specify how often he wants the load balancing to occur, using +LBPeriod runtime option to specify a time interval.

In sync mode, users can tell load balancer explicitly when is a good time to trigger load balancing by inserting a function call in the user code.

In the above two load balancing modes, users don't need to worry about how to start load balancing. However, in one scenario, the above automatic load balancers will fail to work - array elements are created by dynamic insertion. This is because the above two load balancing modes require an application to have fixed number of objects at the time of load balancing. The array manager needs to maintain a head count of local array elements for the local barrier. In this case, users have to use the manual mode to trigger load balancer themselves. The API is described below.

The detailed APIs of these three methods are described as follows:

  1. periodical load balancing mode: By default, the array elements may be asked to migrate at any time provided that they are not in the middle of executing an entry method. The array element's variable usesAtSync being CmiFalse attributes to this default behavior. In the default setting, load balancing happens whenever the array elements are ready with interval of 1 second. It is desirable for the application to set a larger interval using +LBPeriod runtime option. For example "+LBPeriod 5" to start load balancing roughly every 5 seconds.

  2. automatic with Sync: Using the AtSync method, elements can only be migrated at certain points in the execution when user calls AtSync(). For the AtSync method, set usesAtSync to CmiTrue in your array element constructor. When an element is ready to migrate, call AtSync() 17. When all local elements call AtSync, the load balancer is triggered. Once all migrations are completed, the load balancer calls the virtual function ArrayElement::ResumeFromSync() on each of the array elements.

    Note that AtSync() is not a blocking call, it just gives a hint to load balancing that it is time for load balancing. During the time between AtSync and ResumeFromSync, the object may be migrated. One can choose to let objects continue working with incoming messages, however keep in mind the object may suddenly show up in another processor and make sure no operations that could possibly prevent migration be performed. The most commonly used approach is to force the object to be idle until load balancing finishes, the object can start working again when ResumeFromSync() is called.

  3. manual mode: The load balancer can be programmed to be started manually. To switch to the manual mode, you should call TurnManualLBOn() on every processor to prevent load balancer from starting automatically. TurnManualLBOn() should be called as early as possible in the program. It could be called at the initialization part of the program, for example from a global variable constructor, or in an initcall 3.18. It can also be called in the constructor of a static array and definitely before the doneInserting call for a dynamic array. It can be called multiple times on one processor, but only the last one takes effect.

    The function call StartLB() starts load balancing immediately. This call should be made at only one place on only one processor. This function is also not blocking, the object will continue to process messages and the load balancing when triggered happens at the background.

    TurnManualLBOff() turns off manual load balancing and switches back to the automatic Load balancing mode.

3.11.4 Migrating objects

Load balancers migrate objects automatically. For an array element to migrate, user can refer to section 3.8.10 for how to write a ``pup'' for an array element.

In general one needs to pack the whole snapshot of the member data in an array element in the pup subroutine. This is because the migration of the object may happen at any time. In certain load balancing scheme where user explicitly control when the load balancing happens, user may choose to pack only a part of the data and may skip those temporary data.

3.11.5 Other utility functions

There are several utility functions that can be called in applications to configure the load balancer, etc. These functions are:

3.11.6 Compiler and run-time options to use load balancing module

Load balancing strategies are implemented as libraries in CHARM++. This allows programmers to easily experiment with different existing strategies by simply linking a pool of strategy modules and choosing one to use at run-time via a command line option.

Please note that linking a load balancing module is different from activating it:

Below are the descriptions about the compiler and run-time options:

  1. compile time options:

    The list of existing load balancers are in section 3.11.2. Note: you can have multiple -module *LB options. LB modules are linked into a program, but they are not activated automatically at runtime. Using -balancer at compile time in order to activate load balancers automatically at run time. Having -balancer A implies -module A, so you don't have to write -module A again, although it does not hurt. Using EveryLB is a convenient way to link against all existing load balancers. One of the load balancers called MetisLB requires the METIS library which is located at: charm/src/libs/ck-libs/parmetis/METISLib/. You need to compile METIS library by "make METIS" under charm/tmp after you compile Charm++.

  2. run-time options:

    Run-time options are doing the same thing as compile time options, but they can override the compile time options.

    Note: +balancer option works only if you have already linked the load balancers module at compile time. Giving +balancer with a wrong LB name will result in a runtime error. When you have used -balancer A as compile time option, you don't need to use +balancer A again to activate it at runtime. However, you can use +balancer B to override the compile time option and choose to activate B instead of A.

  3. When there is no load balancer activated

    When you don't activate any of the load balancers at compile time or run time, and your program counts on a load balancer because you use AtSync() and expect ResumeFromSync() to be called to continue, be assured that your program can still run. A special load balancer called NullLB is automatically created in this case which just calls ResumeFromSync() after AtSync(). This default load balancer keeps a program from hanging after calling AtSync(). The NullLB is smart enough to keep silent if another load balancer is created.

  4. Other useful run-time options

    There are a few other run-time options for load balancing that may be useful:

3.11.7 Load Balancing Simulation

The simulation feature of load balancing framework allows the users to collect information about the compute wall/cpu time and communication of the chares during a particular run of the program and use this information to later test different load balancing strategies to see which one is suitable for the programs behaviour. Currently, this feature is supported only for the centralized load balancing strategies. For this, the load balancing framework accepts the following command line options:

  1. +LBDump StepStart
    This will dump the instrument/communication data collected by the load balancing framework starting from the load balancing step StepStart into a file on the disk. The name of the file is given by the +LBDumpFile option. The first step in the program is number 0. Negative numbers will be converted to 0.
  2. +LBDumpSteps StepsNo
    This option specifies the number of steps for which data will be dumped to disk. If omitted, default value is 1. The program will exit after StepsNo files are dumped.
  3. +LBDumpFile FileName
    This option specifies the base name of the file into which the load balancing data is dumped. If this option is not specified, the framework uses the default file lbdata.dat. Since multiple steps are allowed, a number is appended to the filename in the form Filename.#; this applies to both dump and simulation.
  4. +LBSim StepStart
    This option instructs the framework to do the simulation during the first load balancing step. When this option is specified, the load balancing data from the file specified in the +LBDumpFile option, with the addition of the step number, will be read and this data will be used for the load balancing. The program will print the results of the balancing for a number of steps given by the +LBSimSteps option, and then will exit.
  5. +LBSimSteps StepsNo
    This option has the same meaning of +LBDumpSteps, except that apply for the simulation mode. Default value is 1.
  6. +LBSimProcs
    This option may change the number of processors target of the load balancer strategy. It may be used to test the load balancer in conditions where some processor crashes or someone becomes available. If this number is not changed since the original run, starting from the second step file the program will print other additional information about how the simulated load differs from the real load during the run (considering all strategies that were applied while running). This may be used to test the validity of a load balancer prediction over the reality. If the strategies used during run and simulation differ, the additional data printed may not be useful.
As an example, we can collect the data for a 1000 processor run of a program using:
./charmrun pgm +p 1000 +balancer RandCentLB +LBDump 2 +LBDumpSteps 4 +LBDumpFile dump.dat
This will collect data on files data.dat.2,3,4,5. Then, we can use this data to observe various centralized strategies using:
./charmrun pgm +balancer <Strategy to test> +LBSim 2 +LBSimSteps 4 
               +LBDumpFile dump.dat [+LBSimProcs 900]

3.11.8 Future load predictor

When objects do not follow the assumption that the future workload will be the same as the past, the load balancer might not have the correct information to do a correct rebalancing job. To prevent this the user can provide a transition function to the load balancer to predict what will be the future workload, given the past, instrumented one. As said, the user might provide a specific class which inherits from LBPredictorFunction and implement the appropriate functions. Here is the abstract class:

class LBPredictorFunction {
public:
  int num_params;
 
  virtual void initialize_params(double *x);

  virtual double predict(double x, double *params) =0;
  virtual void print(double *params) {PredictorPrintf("LB: unknown model\n");};
  virtual void function(double x, double *param, double &y, double *dyda) =0;
};
Other than these function, the user should provide a constructor which must initialize num_params to the number of parameters the model has to learn. This number is the dimension of param and dyda in the previous functions. For the given example, the constructor is {num_params = 2;}.

If the model behind the computation is not known, the user can leave the system to use a predefined default function.

As seen, the function can have several parameters which will be learned during the execution of the program. For this, two parameters can be setup at command line to specify the learning behaviour:

  1. +LBPredictorWindow size
    This parameter will specify how many statistics the load balancer will keep. The greater this number is, the better the approximation of the workload will be, but more memory is required to store the intermediate information. The default is 20.
  2. +LBPredictorDelay steps
    This will tell how many load balancer steps to wait before considering the function parameters learnt and start using the mode. The load balancer will collect statistics for a +LBPredictorWindow steps, but it will start using the model as soon as +LBPredictorDelay information are collected. The default is 10.
Moreover another flag can be set to enable the predictor from command line: +LBPredictor.
Other than the command line options, there are some methods callable from user program to modify the predictor. These methods are:

3.11.9 Seed load balancers - load balancing Chares at creation time

Seed load balancing involves the movement of object creation messages, or "seeds", to create a balance of work across a set of processors. This load balancing scheme is used for load balancing chares only at creation time. When the chare is created on a processor, there is no movement of the chare due to the seed load balancer. The measurement based load balancer described in previous subsection perform the task of moving chares during work to achieve load balance.

Several variations of strategies have been designed and analyzed.

  1. random
    A strategy that places seeds randomly when they are created and does no movement of seeds thereafter. This is used as the default seed load balancer.
  2. neighbor
    a strategy which imposes a virtual topology on the processors, load exchange happens to neighbors only. The overloaded processors initiate the load balancing, where a processor sends work to its neighbors when it becomes overloaded. The default topology is mesh2D, one can use command line option to choose other topology such as ring, mesh3D and dense graph.
  3. spray
    a strategy which imposes a spanning tree organization on the processors, results in communication via global reduction among all processors to compute global average load via periodic reduction. It uses averaging of loads to determine how seeds should be distributed.

Other strategies can also be explored follow the simple API of the seed load balancer.

Seed load balancers for Chares:

Seed load balancers can be directly used for load balancing Chares. The default seed load balancer which is always linked is the random seed load balancer. Users can choose another strategy listed above and link as a plugin module into binary as described below.

Seed load balancers for Array Elements:

Seed load balancers can also be used for array elements in the same way as they are used for individual chares. Chare array is a collection of individual Chares in Charm++. Since Chare Array has its internal strategy of static mapping of individual array elements to processors using CkArrayMap 3.8.4 19, a special CkArrayMap called CldMap must be created and passed into array creation calls to interface with seed load balancer.

For creating an empty array and then inserting chares into it, the API is as follows:

  CkArrayOptions opt;
  CkGroupID cldmapID = CProxy_CldMap::ckNew();
  opt.setMap(cldmapID);
  CProxy_WorkUnit arr = CProxy_WorkUnit::ckNew(param, opt);
  for (int i=0; i<numChares; i++)
    arr[i].insert(param);

For initially populating the array with chares at time of creation the API is as follows:

  CkArrayOptions opt(numChares);
  CkGroupID cldmapID = CProxy_CldMap::ckNew();
  opt.setMap(cldmapID);
  CProxy_WorkUnit arr = CProxy_WorkUnit::ckNew(param, opt);

The details about array creation are explained in section 3.8 of the manual.

Compile and run time options for seed load balancers

To choose a seed load balancer other than the default rand strategy, use link time command line option -balance foo.

When using neighbor seed load balancer, one can also specify the virtual topology at runtime. Use +LBTopo topo, where topo can be one of: (a) ring, (b) mesh2d, (c) mesh3d and (d) graph.

To write a seed load balancer, name your file as cldb.foo.c, where foo is the strategy name. Compile it in the form of library under charm/lib, named as libcldb-foo.a, where foo is the strategy name used above. Now one can use -balance foo as compile time option to charmc to link with the foo seed load balancer.

3.11.10 Simple Load Balancer Usage Example - Automatic with Sync LB

A simple example of how to use a load balancer in sync mode in one's application is presented below.

/*** lbexample.ci ***/
mainmodule lbexample {
  readonly CProxy_Main mainProxy;
  readonly int nElements;

  mainchare Main {
    entry Main(CkArgMsg *m);
    entry void done(void);
  };

  array [1D] LBExample {
    entry LBExample(void);
    entry void doWork();
  };
};
-----------------------------------------------------
/*** lbexample.C ***/
#include <stdio.h>
#include "lbexample.decl.h"

/*readonly*/ CProxy_Main mainProxy;
/*readonly*/ int nElements;

#define MAX_WORK_CNT 50
#define LB_INTERVAL 5

/*mainchare*/
class Main : public Chare
{
private:
  int count;
public:
  Main(CkArgMsg* m)
  {
    /*....Initialization....*/
    mainProxy = thishandle;
    CProxy_LBExample arr = CProxy_LBExample::ckNew(nElements);
    arr.doWork();
  };

  void done(void)
  {
    count++;
    if(count==nElements){
      CkPrintf("All done\n");
      CkExit();
    }
  };
};

/*array [1D]*/
class LBExample : public CBase_LBExample
{
private:
  int workcnt;
public:
  LBExample()
  {
    workcnt=0;
    /* May initialize some variables to be used in doWork */
    //Must be set to CmiTrue to make AtSync work
    usesAtSync=CmiTrue;
  }

  LBExample(CkMigrateMessage *m) { /* Migration constructor -- invoked when chare migrates */ }
  
  /* Must be written for migration to succeed */
  void pup(PUP::er &p){
    CBase_LBExample::pup(p);
    p|workcnt;
    /* There may be some more variables used in doWork */
  }
	
  void doWork()
  {
    /* Do work proportional to the chare index to see the effects of LB */
    
    workcnt++;
    if(workcnt==MAX_WORK_CNT)
      mainProxy.done();
    
    if(workcnt%LB_INTERVAL==0)
      AtSync();
    else
      doWork();
  }
  
  void ResumeFromSync(){
    doWork();
  }
};

#include "lbexample.def.h"

November 23, 2009
Charm Homepage