Synchronization Strategies for POSE
===================================

cons (Basic Conservative): relies on GVT for advancement; no lookahead, no
bells, no whistles [NOT MAINTAINED]

opt (Basic Optimistic): Performs events in best-first order on the processor,
one at a time

opt2 (Optimized Optimistic): Same as opt, but when an object gets control, it
will execute all events that have the same lowest timestamp

opt3 (Constrained Opt2): Same as opt2, but will not execute events beyond a
fixed speculative window

spec (Speculative Window): No attempt made at best-first ordering; when an
object gets control, it will execute all events within the fixed speculative
window 

adapt (Basic Adaptive): same as spec, only the size of the speculative window
changes based on rollback behavior: it starts at a min size, and expands by
one for every successful set of forward executions (up to a max size); when a
rollback occurs, speculative window size is reset to the minimum

adapt2 (Experimental Adaptive): changes to regular adapt reflecting what I'm
currently experimenting with...

adapt5 (added by Ryan Mokos in 2009-2010; this documentation added in
July, 2011)

These are excerpts from a rough draft of the 2010 NSF annual report that discuss the algorithmic choices available in adapt5.

In BigNetSim, calculations are performed by each parallel object, the objects communicate with each other and synchronize their work, and then the virtual time of the simulation is moved forward.  This iterative process continues until the simulation is complete.  Because BigNetSim is an *optimistic* parallel simulator, when performing their calculations, the parallel objects usually perform speculative execution (executing events that will occur at a future virtual time) to increase grain size and thus execution efficiency (more work done for each synchronization operation).  If this speculative work is correct, the object is done for that iteration.  If not, the object has to undo its calculations and perform them again, resulting in longer execution times.  Therefore, a balance must be struck: we want to perform the maximum amount of speculative execution while keeping the amount of work that must be undone to a minimum.  To accomplish this, a virtual time leash (or speculative window) is used to control how much speculative work is allowed to occur.  Analysis of the algorithm in use revealed that the time leash was always being set to 1, leading to little speculative work and many synchronization iterations.  We thus decided to develop a new algorithm based on old algorithmic ideas and further analysis of the problem.

BigNetSim simulates various components of a network, including processing cores (procs), nodes, switches, and channels, all of which are represented by unique parallel objects.  Each proc executes a sequence of events (each of which represents a block of computation time), some of which send messages to other procs.  These messages are first packetized and then sent through the network of switches and channels.  This modeling scheme leads to greatly varying behavior of the different types of objects.  For example, consider Table 1, which shows the differences in virtual time between consecutive events on two different objects at the start of a MILC simulation: a proc and a switch.  The difference between the proc's first and second event is 0 ns, the difference between the second and third event is 187,991,750 ns, and so on.

Proc        Switch
0           0
187991750   187991376
12012626    1
4693        11
32935783    12012619
51839515    1
14169597    6
2825424     4686
1573        1
15261816    6
5991642     32935771
11975431    1
3771        11
45977764    51839503
654698      1
420133      11
17902       14169590
958661      1
6966380     6
6392        2825412
1208322     1
21457       11

Table 1: Difference in virtual time in ns between consecutive events on a proc and a switch at the beginning of a MILC simulation

As one might expect, the proc's events occur at irregular intervals, determined by the timing of the computational blocks, which are contained in its trace file.  The switch's events are much more regular/periodic, having three events very close in virtual time followed by a large gap (in this part of the simulation).  This "burst" behavior is indicative of packets being sent through the switch, each of which has a few events associated with it: receive, process/route, and send.  While these patterns can become more complicated later in the simulation--especially during periods of network contention and if the periodicity changes--the same basic structure seems to remain.

Additionally, we attempted to determine how the number of synchronization iterations and the number of rollbacks (events that are undone) affect the simulation execution time by varying a hard-coded time leash value (some results are shown in Table 2).  In the end, only a loose correlation was found.  However, the study did suggest that some rollbacks--on the order of 2-3 per iteration on average--were not only acceptable but desirable.  Severely restricting the time leash to greatly discourage rollbacks led to so many extra iterations that the extra time spent in those iterations overshadowed the time saved by having fewer rollbacks.  Thus, we decided to restrict the number of rollbacks allowed, but not attempt to eliminate them.

Timeleash    Iterations      Rollbacks    Rollbacks/Iter    Exec. Time (sec)
   500       1,086,151        220,436          0.2             113.6
 1,000         626,029        135,299          0.22             93.3
 2,500         283,553         90,938          0.32             76.9
 5,000         150,975        102,857          0.68             71.8
 7,500         103,195        108,164          1.05             69
10,000          79,021        112,325          1.42             68.4
12,500          64,379        123,748          1.92             70.2
15,000          54,529        129,032          2.37             68.8
17,500          47,577        141,140          2.97             72.3
20,000          42,427        160,732          3.79             71.7
22,500          38,533        190,073          4.93             73.8
25,000          35,439        220,217          6.21             72.8
27,500          32,975        265,164          8.04             73.1

Table 2: Results of running a 32-VP MILC simulation with different hard-coded time leashes

Examination of these patterns and previous synchronization algorithms led to several attempts at improving performance.  All attempts use at least one of the following to set the time leash value:

A. Recent_Average_Rollback_Leash - The average value to which the time leash was set the last few times a rollback occurred

B. Recent_Average_Event_Sparsity - The number of time units per event for the last several events

C. Average_Rollbacks_Per_Iteration - The total number of rollbacks / the total number of iterations so far in the simulation

D. Average_Events_Per_Rollback - The total number of events / the total number of rollbacks so far in the simulation

E. Average_of_3rd_and_4th_Differences - A set of the most recent 16 virtual time differences (like those shown in Table 1) for the object is recorded.  The highest two values are ignored, and the next two highest values are averaged to set the time leash.

These are the four attempts made so far:

Attempt 1: This algorithm simply uses the formula (Recent_Average_Event_Sparsity * Average_Events_Per_Rollback) in an attempt to set the time leash to the average number of time units per rollback (i.e., set the leash so it just hits the next rollback, on average).

Attempt 2: This works as follows:
    Every time an event is executed:
        Add Recent_Average_Event_Sparsity to the time leash
        If Average_Rollbacks_Per_Iteration > a set constant around 2
            Constrain the leash to (a_multiplier_constant * Recent_Average_Event_Sparsity * Average_Events_Per_Rollback) / Average_Rollbacks_Per_Iteration
        Else
            Constrain the leash to a reasonably high number like 1000

    The idea behind this algorithm is to increase the time leash by the average time between events (extending the leash to include the next event) and then constrain it to a value that should avoid rollbacks (i.e., a fraction of the the average time between rollbacks) if too many have occurred so far (i.e., more than 2 per iteration), or let it have more space (constrain to 1000) if the number of rollbacks per iteration is low.

Attempt 3: This works as follows:
    If a rollback occurs
        If leash > Recent_Average_Rollback_Leash
            Set leash = Recent_Average_Rollback_Leash
        Else
            Set leash = Recent_Average_Rollback_Leash / 2
    Else (no rollback)
        Add Recent_Average_Event_Sparsity to the time leash
    If Average_Rollbacks_Per_Iteration > a set constant around 2
        Constrain the leash to Recent_Average_Rollback_Leash / 2
    Else
        Constrain the leash to a_multiplier_constant * Recent_Average_Event_Sparsity * Average_Events_Per_Rollback

    This algorithm is first meant to set the time leash to the average leash value when rollbacks occur (i.e., try to set it so it includes no more than one rollback) if there is a rollback, or else increase the time leash by the average time between events (extending the leash to include the next event) if there is no rollback.  Then the algorithm constrains the leash to half the average leash value when rollbacks occur (to try to avoid another rollback) if too many have occurred so far (i.e., more than 2 per iteration), or let it have more space (constrain it to several times the average number of time units per rollback) if the number of rollbacks per iteration is low.

Attempt 4: This simply uses Average_of_3rd_and_4th_Differences to set the leash value.  A number of variations on this method are also being examined, such as using a different number of virtual time differences, averaging different differences, and not averaging at all.  The basic idea behind this algorithm is that we'd like to execute an entire event "burst" (e.g., the handling of a packet) each iteration.  This algorithm attempts to find a time leash value that will include all events between the peak values shown in the Switch column of Table 1.

All of these new algorithms are currently undergoing testing and tuning, such as by adding or changing multiplicative constants in some of the formulas, to determine which works the best across multiple applications.