Charm++ supports Load Balancing, enabled by the fact there are a large number of chares or chare-array-elements typically available to map to existing processors, and that they can be migrated at runtime.
Many parallel applications, especially physical simulations, are iterative in nature. They may contain a series of time-steps, and/or iterative solvers that run to convergence. For such computations, typically, the heuristic principle that we call "principle of persistence" holds: the computational loads and communication patterns between objects (chares) tend to persist over time, even in dynamic applications. In such cases, recent past is a good predictor of near future. Measurement-based chare migration strategies are useful in this context. Currently these apply to chare-array elements, but they may be extended to chares in the future.
For applications without such iterative structure, or with iterative structure but without the predictability (i.e. where the principle of persistence does not apply), Charm++ supports "seed balancers" that move seeds for new chares among processors (possibly repeatedly) to achieve load balance. These strategies are currently available for both chares and chare-arrays. Seed balancers were the original load balancers provided in Charm since the late '80s. They are extremely useful for state-space search applications, and are also useful in other computations, as well as in conjunction with migration strategies.
For iterative computations when there is a correlation between iterations/steps but either it is not strong or the machine environment is not predictable (noise due to OS interrupts on small time steps, or time-shared desk-top machines), one can use a combination of the two kinds of strategies. The base-line load balancing is provided by migration strategies; But in each iteration one also spawns off work in the form of chares that can run on any processor. The seed balancer will handle such work as it arises.
In CHARM++, objects(except groups, nodegroups) can migrate from processor to processor at run-time. Object migration can potentially improve the performance of the parallel program by migrating objects from overloaded processors to underloaded ones.
CHARM++ implements a generic, measurement-based load balancing framework which automatically instruments all CHARM++ objects, collects computation load and communication structure during execution and stores them into a load balancing database. CHARM++ then provides a collection of load balancing strategies whose job is to decide on a new mapping of objects to processors based on the information from the database. Such measurement based strategies are efficient when we can reasonably assume that objects in CHARM++ application tend to exhibit temporal correlation in their computation and communication patterns, i.e. future can be to some extent predicted using the historical measurement data, allowing effective measurement-based load balancing without application-specific knowledge.
Here are the two terms often used in CHARM++ load balancing framework:
Load balancing can be performed in either a centralized, fully distributed or hierarchical (or hybrid) fashion.
In centralized approaches, the entire machine's load and communication structure are accumulated to a single point, typically processor 0, followed by a decision making process to determine the new distribution of CHARM++ objects. Centralized load balancing requires synchronization which may incur an overhead and delay. However, due to the fact that the decision process has a high degree of the knowledge about the entire machine, it tends to be more accurate.
In distributed approaches, machine states are only exchanged among neighboring processors. There is no global synchronization. However, they will not, in general, provide an immediate restoration for load balance - the process is iterated until the load balance can be achieved.
In hierarchical approaches, processors are divided into independent autonomous sets of processor groups and these groups are organized in hierarchies, therefore decentralizing the load balancing task. Hybrid strategies are used to load balancing load on processors inside each processor group, and processors across groups in a hierarchical fashion.
Listed below are some of the available nontrivial centralized load balancers and their brief descriptions:
Listed below are the distributed load balancers:
User can choose any load balancing strategy he or she thinks is good for the application. The compiler and run-time options are described in section 3.11.6.
Load balancing framework is well integrated with Chare array implementation - when a chare array is created, it automatically registers its elements with the load balancing framework. The instrumentation of compute time (wall/cpu time) and communication pattern are done automatically and APIs are provided for users to trigger the load balancing.
To use the load balancer, you must make your array elements migratable (see migration section above) and choose a load balancing strategy (see the section 3.11.2 for a description of available load balancing strategies).
We implemented three methods to use load balancing for chare arrays to meet different needs of the applications. These methods are different in how and when a load balancing phase starts. The three methods are: periodic load balancing mode, and manual mode.
In periodic load balancing mode, a user just needs to specify how often he wants the load balancing to occur, using +LBPeriod runtime option to specify a time interval.
In sync mode, users can tell load balancer explicitly when is a good time to trigger load balancing by inserting a function call in the user code.
In the above two load balancing modes, users don't need to worry about how to start load balancing. However, in one scenario, the above automatic load balancers will fail to work - array elements are created by dynamic insertion. This is because the above two load balancing modes require an application to have fixed number of objects at the time of load balancing. The array manager needs to maintain a head count of local array elements for the local barrier. In this case, users have to use the manual mode to trigger load balancer themselves. The API is described below.
The detailed APIs of these three methods are described as follows:
Note that AtSync() is not a blocking call, it just gives a hint to load balancing that it is time for load balancing. During the time between AtSync and ResumeFromSync, the object may be migrated. One can choose to let objects continue working with incoming messages, however keep in mind the object may suddenly show up in another processor and make sure no operations that could possibly prevent migration be performed. The most commonly used approach is to force the object to be idle until load balancing finishes, the object can start working again when ResumeFromSync() is called.
The function call StartLB() starts load balancing immediately. This call should be made at only one place on only one processor. This function is also not blocking, the object will continue to process messages and the load balancing when triggered happens at the background.
TurnManualLBOff() turns off manual load balancing and switches back to the automatic Load balancing mode.
Load balancers migrate objects automatically. For an array element to migrate, user can refer to section 3.8.10 for how to write a ``pup'' for an array element.
In general one needs to pack the whole snapshot of the member data in an array element in the pup subroutine. This is because the migration of the object may happen at any time. In certain load balancing scheme where user explicitly control when the load balancing happens, user may choose to pack only a part of the data and may skip those temporary data.
There are several utility functions that can be called in applications to configure the load balancer, etc. These functions are:
Load balancing strategies are implemented as libraries in CHARM++. This allows programmers to easily experiment with different existing strategies by simply linking a pool of strategy modules and choosing one to use at run-time via a command line option.
Please note that linking a load balancing module is different from activating it:
Below are the descriptions about the compiler and run-time options:
The list of existing load balancers are in section 3.11.2. Note: you can have multiple -module *LB options. LB modules are linked into a program, but they are not activated automatically at runtime. Using -balancer at compile time in order to activate load balancers automatically at run time. Having -balancer A implies -module A, so you don't have to write -module A again, although it does not hurt. Using EveryLB is a convenient way to link against all existing load balancers. One of the load balancers called MetisLB requires the METIS library which is located at: charm/src/libs/ck-libs/parmetis/METISLib/. You need to compile METIS library by "make METIS" under charm/tmp after you compile Charm++.
Run-time options are doing the same thing as compile time options, but they can override the compile time options.
Note: +balancer option works only if you have already linked the load balancers module at compile time. Giving +balancer with a wrong LB name will result in a runtime error. When you have used -balancer A as compile time option, you don't need to use +balancer A again to activate it at runtime. However, you can use +balancer B to override the compile time option and choose to activate B instead of A.
When you don't activate any of the load balancers at compile time or run time, and your program counts on a load balancer because you use AtSync() and expect ResumeFromSync() to be called to continue, be assured that your program can still run. A special load balancer called NullLB is automatically created in this case which just calls ResumeFromSync() after AtSync(). This default load balancer keeps a program from hanging after calling AtSync(). The NullLB is smart enough to keep silent if another load balancer is created.
There are a few other run-time options for load balancing that may be useful:
The simulation feature of load balancing framework allows the users to collect information about the compute wall/cpu time and communication of the chares during a particular run of the program and use this information to later test different load balancing strategies to see which one is suitable for the programs behaviour. Currently, this feature is supported only for the centralized load balancing strategies. For this, the load balancing framework accepts the following command line options:
./charmrun pgm +balancer <Strategy to test> +LBSim 2 +LBSimSteps 4
+LBDumpFile dump.dat [+LBSimProcs 900]
When objects do not follow the assumption that the future workload will be the same as the past, the load balancer might not have the correct information to do a correct rebalancing job. To prevent this the user can provide a transition function to the load balancer to predict what will be the future workload, given the past, instrumented one. As said, the user might provide a specific class which inherits from LBPredictorFunction and implement the appropriate functions. Here is the abstract class:
class LBPredictorFunction {
public:
int num_params;
virtual void initialize_params(double *x);
virtual double predict(double x, double *params) =0;
virtual void print(double *params) {PredictorPrintf("LB: unknown model\n");};
virtual void function(double x, double *param, double &y, double *dyda) =0;
};
double predict(double x, double *param) {return (param[0]*x + param[1]);}
void function(double x, double *param, double &y, double *dyda) {
y = predict(x, param);
dyda[0] = x;
dyda[1] = 1;
}
If the model behind the computation is not known, the user can leave the system to use a predefined default function.
As seen, the function can have several parameters which will be learned during the execution of the program. For this, two parameters can be setup at command line to specify the learning behaviour:
Seed load balancing involves the movement of object creation messages, or "seeds", to create a balance of work across a set of processors. This load balancing scheme is used for load balancing chares only at creation time. When the chare is created on a processor, there is no movement of the chare due to the seed load balancer. The measurement based load balancer described in previous subsection perform the task of moving chares during work to achieve load balance.
Several variations of strategies have been designed and analyzed.
Other strategies can also be explored follow the simple API of the
seed load balancer.
Seed load balancers for Chares:
Seed load balancers can be directly used for load balancing Chares. The default seed load balancer which is always linked is the random seed load balancer. Users can choose another strategy listed above and link as a plugin module into binary as described below.
Seed load balancers for Array Elements:
Seed load balancers can also be used for array elements in the same way as they are used for individual chares. Chare array is a collection of individual Chares in Charm++. Since Chare Array has its internal strategy of static mapping of individual array elements to processors using CkArrayMap 3.8.4 19, a special CkArrayMap called CldMap must be created and passed into array creation calls to interface with seed load balancer.
For creating an empty array and then inserting chares into it, the API is as follows:
For initially populating the array with chares at time of creation the API is as follows:
The details about array creation are explained in section 3.8 of the manual.
Compile and run time options for seed load balancers
To choose a seed load balancer other than the default rand strategy, use link time command line option -balance foo.
When using neighbor seed load balancer, one can also specify the virtual topology at runtime. Use +LBTopo topo, where topo can be one of: (a) ring, (b) mesh2d, (c) mesh3d and (d) graph.
To write a seed load balancer, name your file as cldb.foo.c, where foo is the strategy name. Compile it in the form of library under charm/lib, named as libcldb-foo.a, where foo is the strategy name used above. Now one can use -balance foo as compile time option to charmc to link with the foo seed load balancer.
A simple example of how to use a load balancer in sync mode in one's application is presented below.
/*** lbexample.ci ***/
mainmodule lbexample {
readonly CProxy_Main mainProxy;
readonly int nElements;
mainchare Main {
entry Main(CkArgMsg *m);
entry void done(void);
};
array [1D] LBExample {
entry LBExample(void);
entry void doWork();
};
};
-----------------------------------------------------
/*** lbexample.C ***/
#include <stdio.h>
#include "lbexample.decl.h"
/*readonly*/ CProxy_Main mainProxy;
/*readonly*/ int nElements;
#define MAX_WORK_CNT 50
#define LB_INTERVAL 5
/*mainchare*/
class Main : public Chare
{
private:
int count;
public:
Main(CkArgMsg* m)
{
/*....Initialization....*/
mainProxy = thishandle;
CProxy_LBExample arr = CProxy_LBExample::ckNew(nElements);
arr.doWork();
};
void done(void)
{
count++;
if(count==nElements){
CkPrintf("All done\n");
CkExit();
}
};
};
/*array [1D]*/
class LBExample : public CBase_LBExample
{
private:
int workcnt;
public:
LBExample()
{
workcnt=0;
/* May initialize some variables to be used in doWork */
//Must be set to CmiTrue to make AtSync work
usesAtSync=CmiTrue;
}
LBExample(CkMigrateMessage *m) { /* Migration constructor -- invoked when chare migrates */ }
/* Must be written for migration to succeed */
void pup(PUP::er &p){
CBase_LBExample::pup(p);
p|workcnt;
/* There may be some more variables used in doWork */
}
void doWork()
{
/* Do work proportional to the chare index to see the effects of LB */
workcnt++;
if(workcnt==MAX_WORK_CNT)
mainProxy.done();
if(workcnt%LB_INTERVAL==0)
AtSync();
else
doWork();
}
void ResumeFromSync(){
doWork();
}
};
#include "lbexample.def.h"
November 23, 2009
Charm Homepage