The interpolation tool is part of the regular CHARM++ distribution and can be found under the directory charm/examples/bigsim/tools/rewritelog with a README file describing its use in more detail than this manual.
The user must insert startTraceBigSim() and endTraceBigSim() calls around the main computational regions in the parallel application. These two calls bracket the region of interest and print out a record for that computational region. The functions should be called at most once during any SEB. The output produced by endTraceBigSim() is a line similar to
``TRACEBIGSIM: event:{ PairCalculator::bwMultiplyHelper } time:{ 0.002586 } params:{ 16384.00 1.00 220.00 128.00 128.00 0.00 0.00 0.00 }.''
The event name and the values (in double-precision floating-point) for up to 20 parameters are specified in the call to endTraceBigSim(); the time field records the duration of the bracketed region of sequential code.
To run in a cycle-accurate simulator such as IBM's MAMBO, the startTraceBigSim() and endTraceBigSim() functions would be modified to switch between the ``fast forward'' mode used during the rest of the program and the cycle-accurate mode during the bracketed region of code. The functions are provided in C++ source files under the directory charm/examples/bigsim/tools/rewritelog/traceBigSim and their calls must be added to an application's source file manually.
To understand how the interpolation tool works, it is instructive to consider the format
of logs produced by the BigSim Emulator.
A BigSim log file (i.e. bgTrace log) contains data from emulation of the full parallel application.
There is an entry for each SEB, with the following fields: ID, Name,
,
, Back, Forward, Message ID, Source Node,
Message ID, Sent Messages. The final field is actually a list of records for each
message sent by the execution block; each record contains the following fields:
Message ID,
,
, Destination PE, Size, Group.
The interpolation tool will rewrite the durations of the SEBs by correcting the
field for the
SEB and the
fields for each message sent. The new durations of all SEBs will be based upon
some model
.
Each SEB can be decomposed into three temporal regions as shown in Figure 2.
The entire SEB is associated with execution of a Charm++ entry method, while the middle region
is the computational kernel of interest, bracketed by the user's startTraceBigSim() and endTraceBigSim() calls. The model is used only to approximate the new duration of the middle temporal region;
the durations of the beginning and ending regions are simply scaled by a constant factor.
Internally, the interpolation tool takes the ID for each SEB and looks up its associated parameters.
When those parameters are found, they are used as input for evaluation of the new duration
for the SEB. The end time is then modified to be
.
![]() |
![]() |
The messages in the message list for each SEB must also have their
times rewritten. This is accomplished by linearly mapping the old
value from to the new range for the enclosing SEB region, as shown in Figure 3. Any message sent during the first portion will be mapped linearly onto the new first portion of the SEB. The new message
times are ignored by BigSimulator, so they do not need to be modified.
In simple cases, a sufficient approximation of the performance of a parallel application can be obtained
by simply scaling the SEB durations by a constant factor.
As an example, a user may know that a desired target machine has processors that will execute
each SEB twice as fast as on an existing machine.
The application is emulated on the existing machine and the observed SEB durations are scaled
by a factor of
. Although simple, this method may be sufficient in many cases. It becomes
unnecessary to use the startTraceBigSim() and endTraceBigSim() calls.
The scaling factor is hard coded in the interpolation tool as time_dilation_factor.
It is used to scale all blocks unless a suitable advanced model has a better method for approximating the
block's duration. It will always be used to scale any portions of blocks that are not bracketed with
the calls startTraceBigSim() and endTraceBigSim().
The user can simply insert the bracketing calls startTraceBigSim() and endTraceBigSim() around the computational kernels to log the times taken for each kernel. In practice, the duration of the SEB will likely depend upon the data distribution and access patterns for the parallel application. Thus, the user must specify parameters likely to influence the SEB duration. The parameters can include variables indicating number of loop iterations, number of calls to computational kernels, or sizes of accessed portions of data arrays. A model is built to approximate the duration of any SEB based upon its specified parameters.
As an example, NAMD uses a number of different types of objects. The compute objects will spend varying amounts of time depending upon the lengths of their associated atom lists. If an atom list is large, more interactions are computed and thus more computation is performed.
Meanwhile, assume that a Charm++ entry method called doWork(atomList) is where the majority
of the work from an application occurs. The function computes forces on atoms of various types.
Different calls to the function will contain different numbers and types of atoms.
The source code for doWork(atomList) will be modified by the user to contain calls to
startTraceBigSim() at the entry and endTraceBigSim() at the exit of the function.
The program will be run, and the resulting timed samples will be used to build a model. Assume the expected runtime of doWork(atomList) is quadratic in the atomList length and linear in the number of carbon atoms in the atomList. The endTraceBigSim() call would be provided with a descriptive name and a set of parameters, such as endTraceBigSim(``doWork()'',
,
), where parameter
is the length of atomList and parameter
is the number of carbon atoms in atomList.
The goal of using a model is to be able to predict the execution time of any arbitrary call to
doWork(), given its parameters. The application can be run on an existing processor
or parallel cluster for only a few timesteps with the modified doWork() method.
This run will produce a list of {
} records.
A least squares method is applied to fit a curve
approximating the durations of the records. The least square method minimizes the sum of the
squares of the difference between the function
evaluated at each parameter set and the actual
timing observed at those parameters. The least square method is provided
for each sample point and produces the coefficients
in
.
An arbitrary set of parameters (in the current implementation, up to twenty) can be input
to
to produce an approximation of the runtime of doWork() even though the particular instance was never timed before.
In this case, a cycle accurate simulator can be used to simulate a small fraction of all SEBs for a run of the application. The partial execution is used to build a model which applies to the whole execution. Parameterizations can be used as previously described, so that only some fraction of the SEBs will be run in the expensive cycle-accurate simulator. In NAMD, for example, a sufficient model can be built from a random sample of 2% of the cycle-accurate SEB durations from four timeloop iterations.
February 12, 2012
Charm Homepage