Process Scheduler and Balancer in Linux Kernel
-
Upload
haifeng-li -
Category
Technology
-
view
437 -
download
7
description
Transcript of Process Scheduler and Balancer in Linux Kernel
![Page 1: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/1.jpg)
Process Scheduler and Balancer in Kernel
Haifeng Li
2014-3-3
![Page 2: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/2.jpg)
Process Scheduler
![Page 3: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/3.jpg)
Outline
• Introduction of scheduler
• Scheduler History
– Round-Robin Scheduler
– O(N)
– O(1)
• Completely Fair Scheduler
• Real Time Scheduler
3
![Page 4: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/4.jpg)
Introduction of Scheduler
• Scheduler
– Determining which process run when there are multiple runnable processes.
• Linux Scheduler history
4
Linux Version Scheduler
Previous 2.4 Round Robin Scheduler
Version 2.4 0(N)
V2.5.17~2.6.23 0(1)
V2.6.23~Now Completely Fair Scheduler
![Page 5: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/5.jpg)
Round Robin Scheduler
• Algorithm – Init: p->counter = current->counter >> 1
– At each tick: current->counter --
– When current->counter ==0, system picks the highest counter thread to run.
– When all threads’ counter is 0, reset the counter: p->counter = (p->counter >> 1) + p->priority
5
struct task_struct { long counter; long priority; … }
![Page 6: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/6.jpg)
O(N) Scheduler
• Algorithm
– All runnable task lists in a global list.
– Time slice is related to priority & CONFIG_HZ.
– When pick next task, choose the most weight task&&(p->counter!=0):weight = p->counter + (20-nice).
– After all task used up time slice, recalculate the counter.
6
#if HZ < 200 #define TICK_SCALE(x) ((x) >> 2) #elif HZ < 400 #define TICK_SCALE(x) ((x) >> 1) … #endif #define NICE_TO_TICKS(nice) (TICK_SCALE(20-(nice))+1)
struct task_struct{ … long counter; long nice; … }
![Page 7: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/7.jpg)
O(1) Scheduler (1)
• This scheduler use tow priority arrays per processor to keep track of ready tasks of the processor
7
![Page 8: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/8.jpg)
O(1) Scheduler (2)
• Time Slice
• Dynamitic Priority • max(100,min(static_priority-bonus+5,139))
8
#define SCALE_PRIO(x, prio) \ max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO/2), MIN_TIMESLICE) static unsigned int task_timeslice(task_t *p) { if (p->static_prio < NICE_TO_PRIO(0)) return SCALE_PRIO(DEF_TIMESLICE*4, p->static_prio); else return SCALE_PRIO(DEF_TIMESLICE, p->static_prio); }
![Page 9: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/9.jpg)
O(1) Scheduler (3)
• Bonus is from sleep time.
• MAX_SLEEP_AVG is 1000ms; MAX_BONUS is 10
9
![Page 10: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/10.jpg)
CFS Scheduler: Concept(1)
• "Ideal multi-tasking CPU" is a (non-existent :-)) CPU which can run each task at precise equal speed and equal share.[1]
10 [1].Documentation/scheduler/sched-design-CFS.txt
![Page 11: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/11.jpg)
CFS Scheduler: Concept(2)
11
• The actual things like this, obviously not fair:
• So, the concept of “virtual runtime” is introduced.
Picture is from: Completely Fair Scheduler, Linux journal, Issue #184, August 2009
![Page 12: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/12.jpg)
CFS Scheduler: Virtual Runtime (1)
• The virtual runtime of a task specifies when its next time slice would start execution on the ideal multi-tasking CPU.[1]
• CFS tries to maintain an equal virtual runtime for each task in a CPU’s run_queue at all time. – Reason: tasks would execute simultaneously and no
task would ever get "out of balance" from the "ideal" share of CPU time.[1]
• CFS always tries to run the task with the smallest virtual runtime value.
12 [1].Documentation/scheduler/sched-design-CFS.txt
![Page 13: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/13.jpg)
CFS scheduler: Virtual Runtime (2)
• One period time for all tasks
• Time slice for a task on real Processor
• Virtual Runtime
According to (2) and (3), get:
13
(1)
(2)
(3)
(4)
![Page 14: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/14.jpg)
A demo: understanding virtual runtime
14
• Thread 1: weight 2 /Thread 2: weight 5
• Period Clock: P=10ms(HZ:100) Clock Sequence Virtual Runtime 1 Virtual Runtime 2
0 0 0
1 ½ * P 0
2 ½ *P 1/5 * P
3 ½ *P 2/5 * P
4 ½ *P 3/5 * P
5 1*P 3/5 * P
6 1*P 4/5 * P
7 1*P 1 * P
… … …
![Page 15: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/15.jpg)
CFS Scheduler: Priority & Weight
15
static const int prio_to_weight[40] = {
/* -20 */ 88761, 71755, 56483, 46273, 36291,
/* -15 */ 29154, 23254, 18705, 14949, 11916,
/* -10 */ 9548, 7620, 6100, 4904, 3906,
/* -5 */ 3121, 2501, 1991, 1586, 1277,
/* 0 */ 1024, 820, 655, 526, 423,
/* 5 */ 335, 272, 215, 172, 137,
/* 10 */ 110, 87, 70, 56, 45,
/* 15 */ 36, 29, 23, 18, 15,
};
Nice Value
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
-20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Wei
ght
![Page 16: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/16.jpg)
CFS scheduler: implementation (1)
• CFS uses a virtual runtime-ordered red-black tree to build a "timeline" of future task execution.
16
![Page 17: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/17.jpg)
CFS scheduler: implementation (2)
17 More: http://www.ibm.com/developerworks/library/l-completely-fair-scheduler/
![Page 18: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/18.jpg)
Real Time Scheduler
• The real-time scheduler has to ensure system-wide strict real-time priority scheduling (SWSRPS)
• Only the N highest-priority tasks be running at any given point in time, where N is the number of CPUs.
• Frequently task balancing can introduce cache thrashing and contention for global data (such as runqueue locks) and can degrade throughput.
• Tow policies – SCHED_RR – SCHED_FIFO
18
![Page 19: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/19.jpg)
Key Structures
19
struct cpupri_vec { atomic_t count; cpumask_var_t mask; }; struct cpupri { struct cpupri_vec pri_to_cpu[CPUPRI_NR_PRIORITIES]; int cpu_to_pri[NR_CPUS]; };
![Page 20: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/20.jpg)
Overview of RT scheduler Algorithm
• The scheduler has to address several scenarios: – Where to place a task optimally on wakeup (that is,
pre-balance). – What to do with a lower-priority task when it wakes
up but is on a runqueue running a task of higher priority.
– What to do with a low-priority task when a higher-priority task on the same runqueue wakes up and preempts it.
– What to do when a task lowers its priority and thereby causes a previously lower-priority task to have the higher priority.
20 More: http://www.linuxjournal.com/magazine/real-time-linux-kernel-scheduler
![Page 21: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/21.jpg)
21
A Demo of RT scheduler Algorithm
![Page 22: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/22.jpg)
Scheduler decision
22
Start with top scheduler class
Runnable task available?
Pick Next scheduler class
Pick Next Task of Scheduler class
N
Y
![Page 23: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/23.jpg)
CFS Load Balancer
![Page 24: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/24.jpg)
Outline
• Objective
• How to balance among cores
–Hierarchy & Key Data Structures
• Scenarios of balance
24
![Page 25: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/25.jpg)
Objective
1. Prevent processors from being idle while others processors still have tasks waiting to execute[1]
2. Keep the difference in numbers of ready tasks on all processors as small as possible[1]
Addition: Try to save power while the load is light.[2]
[1] Chun-Yu Lai, Performance Evaluation of Linux Kernel Load Balancing Mechanisms , 2006
[2] Suresh Siddha, Chip Multi Processing aware Linux Kernel Scheduler , 2006 Linux Symposium
25
![Page 26: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/26.jpg)
Hierarchy
• Scheduling Domain: Each scheduling domain spans a number of CPUs.
• Scheduling Group: Each scheduling domain must have one or more CPU groups which are organized as a circular one way linked list.
• Balancing within a scheduling domain occurs between groups.
26 More information: http://lwn.net/Articles/80911/
![Page 27: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/27.jpg)
A Demo of Hierarchy
27
![Page 28: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/28.jpg)
Key members of sched_domain
28
struct sched_domain {
/* These fields must be setup */
struct sched_domain *parent; /* top domain must be null terminated */
struct sched_domain *child; /* bottom domain must be null terminated */
struct sched_group *groups; /* the balancing groups of the domain */
… unsigned int busy_factor; /* less balancing by factor if busy */
unsigned int imbalance_pct; /* No balance until over watermark */
… int flags; /* See SD_* */
… unsigned long last_balance; /* init to jiffies. units in jiffies */
unsigned int balance_interval; /* initialise to 1. units in ms. */
unsigned int span_weight;
unsigned long span[0];
};
/* * sched-domains (multiprocessor balancing) declarations: */
#ifdef CONFIG_SMP
#define SD_LOAD_BALANCE 0x0001 /* Do load balancing on this domain. */
#define SD_BALANCE_NEWIDLE 0x0002 /* Balance when about to become idle */
#define SD_BALANCE_EXEC 0x0004 /* Balance on exec */
#define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */
#define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */
#define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */
#define SD_SHARE_CPUPOWER 0x0080 /* Domain members share cpu power */
![Page 29: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/29.jpg)
Key members of sched_group
29
struct sched_group {
struct sched_group *next; /* Must be a circular list */
… unsigned int group_weight;
struct sched_group_power *sgp;
… unsigned long cpumask[0];
};
struct sched_group_power {
… unsigned int power;
… };
![Page 30: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/30.jpg)
An example of sched_domain
30
#define SD_CPU_INIT (struct sched_domain) { \
.busy_factor = 64, \
.imbalance_pct = 125, \
.flags = 1*SD_LOAD_BALANCE \
| 1*SD_BALANCE_NEWIDLE \
| 1*SD_BALANCE_EXEC \
| 1*SD_BALANCE_FORK \
| 0*SD_BALANCE_WAKE \
| 1*SD_WAKE_AFFINE \
, \
.last_balance = jiffies, \
.balance_interval = 1, \
}
![Page 31: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/31.jpg)
CFS Load Balancing: How to
• load_balance is used to offload tasks in the busiest runqueue of the busiest group (most runnable tasks): – inactive(likely to be cache cold)
– high priority
• load_balance skips tasks that are: – Currently running on a CPU
– Not allowed to run on the current CPU(as indicated by the cpus_allowed bitmask in the task_struct)
– Still be cache warm on its currently CPU
31
![Page 32: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/32.jpg)
How busiest is the busiest group?
• In current level domain, the biggest group average load is the busiest group.
– If current processor is idle, the busiest group
should meet that number of running threads is bigger than the core numbers of that group.
– Else
• If the busiest group is found, this domain is unbalanced.
32
![Page 33: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/33.jpg)
Restore balance
• How much load to actually move to equalize the imbalance:
• Offload min(imbalance_x) from the busiest runqueue in the busiest group to restore balance
• Busiest runqueue is the maximum load weight in the busiest group
33
(1)
(2)
(3)
![Page 34: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/34.jpg)
Load Balancing: idle balancing
• Idle balancing
– In schedule(), if this CPU is about to become idle. Attempts to pull one task from busiest CPUs.
34
for_each_domain(this_cpu, sd) {
if (!(sd->flags & SD_LOAD_BALANCE))
continue;
pulled_task = load_balance(this_cpu, this_rq,
sd, CPU_NEWLY_IDLE, &balance);
if (pulled_task)
break;
}
![Page 35: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/35.jpg)
Load Balancing: Periodic balancing
• In timer tick, if current time is after rq->next_balance, trigger SCHED_SOFTIRQ.
• Current processor starts from the lowest-level scheduling domain and searches the domain hierarchy to decide whether the rebalancing is need. – Current time > sd->last_balance+interval
– Current domain is unbalanced
– If needed rebalancing, pull tasks from busiest runqueue to current runqueue.
• After one round of periodic balancing, rq->next_balance is updated to current time + highest-level interval.
35
interval = sd->balance_interval; if (idle != CPU_IDLE)
interval *= sd->busy_factor;
![Page 36: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/36.jpg)
Other Methods to keep Balance
• Exec balancing
– Where to put a new task
• Fork balancing
– Where to put a new spawned thread
• Wake balancing
– Where to put the wakee thread
• ILB balancing
36
SD_BALANCE_EXEC
SD_BALANCE_FORK
SD_BALANCE_WAKE
![Page 37: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/37.jpg)
Exec balancing
• Search the idlest group from the highest level scheduling domain to lowest level domain. – Idlest group is the minimum avg_load – Meet
• Search the idlest cpu from idlest group.
– Idlest cpu is the minimum avg_load in idlest group
• Pack this task to a work and add this work to &per_cpu(cpu_stopper, cpu) list.
• Wake up the stoper->thread which running on idlest CPU
37
![Page 38: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/38.jpg)
Fork Balancing
• In do_fork, select the idlest cpu and insert this thread to the runqueue of the idlest cpu.
38
![Page 39: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/39.jpg)
Wake Balancing
• If this_cpu_load+wakee_weight <= prev_cpu_load, the target cpu is close to X;else close to Y.
• From the last level cache domain, choose the idle cpu. If no idle cpu, choose X or Y.
39
Waker is currently running on CPU X Wakee was last time running on CPU Y
![Page 40: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/40.jpg)
Idle Load Balance(1)
• When one of the busy CPUs notice that there may be an idle rebalancing needed, they will kick the idle load balancer, which then does idle load balancing for all the idle CPUs.
– Now >= nohz.next_balance
– Number of running tasks >2
– NOHZ.nr_cpus is not empty.
40
![Page 41: Process Scheduler and Balancer in Linux Kernel](https://reader034.fdocuments.in/reader034/viewer/2022052321/554fa337b4c905ad218b4b92/html5/thumbnails/41.jpg)
Idle Load Balance(2)
• Routine
– Find an ilber and send IPI_RESCHEDULE ipi to it
– After ilber wake up from ipi
• Do idle balance for itself
• Help other idle processors to do load balance.
• If pull tasks for other processor, send IPI_RESCHEDULE to it.
– Update nohz.next_next_balance to ilber’s next_balance
41