Detecting if Active or Lazy Load Balancing Is Used

In the following experiment, we detect when load balancing is initiated. We detect three kinds of policies:

Lazy load balancing: Load balancing is only done if a processor is idle
Active load balancing: Load balancing is done upon certain events, such as task sleep or task finish, but not immediately
Immediate load balancing: Load balancing is done upon certain events and immediately

Experiment Design

Assumptions:
- With N cores, x*N parallel tasks are distributed evenly to the cores.
  - We can detect if tasks are distributed evenly with a simple experiment (all tasks should have same duration times)
- The scheduler runs a separate run queue for each core.
  - If we don't know this, then the experiment result "active load balancing strategy" is ambiguous. The results could also stem from a single run queue. We then need a subsequent experiment to detect if the scheduler uses a single run queue or multiple run queues.
  - If the experiment result is "lazy load balancing", then we can assume that multiple run queues are used. This can be drawn from the fact that the results indicate imbalanced cores.

Evaluating the Experiment

The following figure could serve as help for explaining how to deal with the results:

ActiveOrLazyLoadBalancing ResultsExample.png

Experiment Results

Windows 7

Time Series Results for Experiment on a 4 Core Machine Running Windows 7

On a 4 core machine running Windows 7, the experiment shows clearly that a lazy load balancing policy is employed. The experiment starts spawning 12 processes at the same time. We assume that all 4 cores are equally utilized with 3 tasks per core. After 300 iterations, the first task completes. This leads to increased duration times of 2 other tasks (task 6 and task 11), while all the duration time of all other tasks remains the same. This shows that task 6 and 11 have more CPU time available, because they do not have to share the CPU with a third task.

After a while, task 6 completes as well. Now task 11 has a complete CPU available, although the other cores seem to be utilized with 3 tasks. Windows does not initiate load balancing at this point of time, but waits until the first core is idle. This happens when task 11 completes. Now there is one idle core and 9 tasks remain running. Windows now puts one task to the idle core (task 9), leaving tasks 5 and 8 on one core.

Linux 2.6.22

Time Series Results for Experiment on a 4 Core Machine Running Linux 2.6.22

The results for a machine running the Linux 2.6.22 scheduler look similar, but show a great difference. Once two tasks have finished (task 1 and task 3), the scheduler immediately detects the imbalance (one core running 1 task (task 5) vs. 3 cores running 3 tasks) and transfers task 7, 10 or 11 to the core which runs just task 5. In the graph, the times of task 7 and task 10 are hidden by the times of task 11. One of these three tasks is being transferred, which leads to decreased duration times of all three tasks (all of them now run on a core together with one other task). Compared to lazy load balancing, this transfer takes place although no core is idle. Instead, the scheduler detects a shift because of an event (task finishes, i.e. sleeps). Thus, it takes much less time until all running tasks benefit from the decreased overall system load. Here, an active load balancing policy is detected.

Linux 2.6.31

Time Series Results for Experiment on a 4 Core Machine Running Linux 2.6.31

On a 4 core machine running Linux 2.6.31, load balancing already kicks in once the first task completes. Now all other tasks are distributed evenly across the remaining cores. Here an immediate load balancing strategy is detected.

However, the experiment results would be similar if a scheduler employs a lazy load balancing strategy, but uses only a single run queue for all cores. Hence, we need a subsequent experiment to detect the number of run queues. Or we have to argue that we don't consider a single run queue, since load balancing is not possible with a single run queue anyway.

Deriving a Scheduling Model

sdqinternal:Category:SDQ-Internal

Based on the experiment results, we can calculate the performance of an amount of work w in a process p as follows:

Let t_start be the time at which processing of w start. Let t_1, t_2, t_x be points in time at which a different process has finished or been put to sleep while w is being processed (i.e. w is not completed yet).

To calculate the total duration time dt_w of w, we calculate how much work can be accomplished while additional processes run on the machine.

As above, if the total number n of parallel running CPU-bound processes is less or equal than the number of cores c, dt_w is equal to w. Events like other processes finishing or sleeping have no impact on the duration time.

Otherwise, we calculate the amount of work that can be completed for process p until the first event occurs, i.e. until t_1, as described above.

If the experiment results lead to the assumption that a single run queue is used or load balancing is triggered immediately, we use the processor sharing approximation as described above. Thus, we can calculate the amount of work as done for the time interval [t_start, t_1] with a reduced number of parallel running processes.

If the experiments results indicate a lazy or active load balancing strategy, we calculate the duration time dt_w as follows.

We assume that the processes p_i are assigned to core c_j where j is in [ 1 , n mod c+1 ]. Let cp_j be a list containing all processes p_i that are currently running on core c_j.

The amount of work w_a,b that can be completed between two events (i.e. in the time interval [ t_a, t_b ]) can be calculated as follows:

Step 1: Update the lists cp_j Step 2: Calculate the amount of work for the time interval [ t_a , t_b ]

Step 1 consist of two steps:

Step 1a: remove the process p_i that has finished at t_a from the list cp_j it was located in.

Step 1b is responsible for maintaining the lists if processes have to be moved to different cores. We assume that, if done, load balancing occurs immediately at this time, not later. This step depends on the detected load balancing strategy:

If lazy load balancing is detected, processes have only be assigned to a different list if one list is empty. If we don't have further experiments or knowledge which process to choose, we choose

one process randomly from a random list.

If active load balancing is detected, we move one process from list cp_j1 to cp_j2 if size( cp_j1 ) > size( cp_j2 ) + 1, i.e. if at least two more processes run on core c_j1 than on core c_j2. If multiple lists meet the criterion for cp_j1, we choose one of the lists randomly (if we don't have further experiments or knowledge). If we don't have further experiments or knowledge which process to choose, we choose one process randomly from a this list.

Step 2: Calculate the amount of work for the time interval [ t_a , t_b ] is then done as follows:

w_a,b = ( t_b - t_a ) / size( cp_j )

t_b - t_a is the time span between the two events t_a and t_b. The amount of work that can be completed depends on the number of processes that are concurrently assigned to the same core during this period, i.e. the size of the list cp_j.

If the sum of all completed amount of work reaches w at time t_end, dt_w is the time span between t_start and t_end.

Open issues

Is the formalism readable?
Is the formalism / derived model appropriate?
Is the formalism correct?

Determining How Long Load Balancing Takes

- No such time can be measured for Linux 2.6.31 - How to detect times for Win7 / Linux 2.6.22?

Detecting Further Immediate Load Balancing Features

If we detected immediate load balancing in the prior experiment, we can run a similar experiment to obtain further information about the scheduler.

Experiment Design

Evaluating the Experiment

3 times as many tasks are run in parallel as cores are available. After a certain time, the tasks are being finished one after another. Thus, during the experiment, we get results for any number between 1 and 3*#Cores of tasks running in parallel.

Since we have detected immediate load balancing, we assume that the duration times of all concurrently running tasks do not differ too much, since we expect that the scheduler tries to achieve fairness among all running tasks, i.e. all running tasks should get a similar amount of execution time.

As the experiment stops the running tasks one after another, we assume are grouping the time series results based on the event times when tasks have been finished.

Again, let N be the number of tasks running in parallel at the beginning of the experiment (this means that #Cores = N / 3). In each task t_n, the same amount of work w is measured in series.

Let end_time_i be the time at which t_i has been finished. We can then create a list of ordered end times. Because every task run with a different number of iterations, we avoid two tasks finishing at the same time. For each interval [end_time_i, end_time_i+1], we can assume that j = N - i tasks are still running in parallel. We now calculate the mean e_j and the standard deviation sigma_j of all measured duration times in this interval.

These numbers give further information about the expected duration time of a task depending on the number of parallel running tasks.

Experiment Results

Time Series Results for Experiment on a 4 Core Machine Running Linux 2.6.31

The experiment results indicate that although load balancing is done immediately, the variation varies depending on the number of concurrently running tasks. The vertical dotted lines in the diagram indicate the event times at which a process has been ended, i.e. the task end time intervals mentioned above.

For Linux 2.6.31 on a 4 core machine, the average mean and standard deviation of a 1000ms task depending on the number of concurrently running task is as follows:

Number of concurrent tasks	1	2	3	4	5	6	7	8	9	10	11	12
Avg. mean	1004.10	1003.91	1003.94	1006.18	1257.03	1511.71	1758.99	2011.40	2262.19	2512.40	2764.91	3019.13
Standard deviation	1.129	1.102	0.741	19.242	166.102	154.492	290.666	31.898	277.587	117.962	135.611	50.920

As the standard deviation is not increasing depending on the number of concurrently running tasks, we assume that the deviation depends on number of concurrently running tasks compared to the number of available cores: If the number of concurrently running tasks is a multiple of the number of cores, the deviation is lower.