Multi-core scheduling optimizationsfor soft real-time applications
a cooperation aware approach
co-authors:
Liria Matsumoto Sato
Patrick Bellasi
William Fornaciari
Wolfgang Betz
Lucas De Marchi
sponsors:
Agenda
IntroductionMotivationObjectives
AnalysisOptimization descriptionExperimental resultsConclusions & future works
Introduction – motivation
Introduction – motivation
SMP + RTMultiple processing units inside a processorDeterminism
Parallel Programming ParadigmsData Level Parallelism (DLP)
Competitive tasks
Task Level Parallelism (TLP)Cooperative tasks
Introduction – motivationDLP TLP
T1
T0
T2
T3
T0
T1
T2
Characterization:SynchronizationCommunication
Introduction – motivation
Linux RT scheduler (mainline)Run as soon as possible (based on prio)⇔ Use as many CPUs as possibleOk for DLP!
But, what about TLP?Anyway, why do we care about TLP?
Objectives
Study the behavior of RT Linux scheduler for cooperative tasks
Optimize the RT schedulerSmooth integration into mainline kernel
Don't throw away everything
Analysis – benchmark
Simulation of a scenario where SW replaces HW
Multimedia-likeMixed workload:
DLP + TLPChallenge: map N
tasks to M cores optimally
Analysis – metrics
ThroughputSample mean
time
DeterminismSample variance
Analysis – metrics
Influencing factorsPreemption-disabledIRQ-disabledLocality of wake-up
Intel i7 920:4 cores2.8 GHz
Analysis – locality of wake-upMigrations
wave0: 112010101010101010101030111301010101010131010321010033 wave1: 033333333333333333333313202133333333333303333202332121 wave2: 022022222222222222222222233302222222222222222213322330 wave3: 111101010101010101010101011110101010101010101010101002 mixer0: 023033333333333333333333333202333333333333333333322233 mixer1: 103302201010101010101010101010330101010101010101010101 mixer2: 230201010101010101010101012020101010101010101010101021monitor: 111122222222222222222222231112222222222222222233322303
a) migration patterns (wave0, mixer1 and mixer2)
Analysis – locality of wake-upMigrations
wave0: 302121331102303323303311032320011021111111120101010101 wave1: 223303013333212010121032121012333210320230333333333333 wave2: 033332232302003111201211330021322233323330220222222222 wave3: 110211120211101030031121003030120101101111111010101010 mixer0: 322210301110320231310303021213120013220320230333333333 mixer1: 001003321202220131103203212130320331201031033022010101 mixer2: 230101020201202020121020021010130101201202302010101010monitor: 113022133333131312032133303222223233123111111222222222
b) occasional migrations
Analysis – locality of wake-up
Cache-miss rate measurements
1 CPU 2 CPUs 4 CPUs Increase (1 - 4)
Load 7.58% 8.99% 9.44% +24.5%
Store 9.29% 9.78% 11.62% +25.1%
Analysis – conclusion
Why do we care about TLP?Common parallelization technique
What about TLP?Current state of Linux scheduler is not as
good as we want
Solution – benchmarkAbstraction:
One application level
sends data to another
Serialization (task cooperation)
Parallelization (task com
petition)
w0
w1
w2
w3
m0
m1
m2
Solution – benchmarkAbstraction:
One application level
sends data to another
Reality:
shared buffers +
synchronization
Serialization (task cooperation)
Parallelization (task com
petition)
w0
w1
w2
w3
m0
m1
m2
buffer
Solution – benchmarkAbstraction:
One application level
sends data to another
Reality:
shared buffers +
synchronization
Dependencies:
Define dependencies
among tasks in the
opposite way of
data flow
Serialization (task cooperation)
Parallelization (task com
petition)
w0
w1
w2
w3
m0
m1
m2
If data do not go to tasks, then the tasks go to where data were produced
Make tasks run on same CPU of their
dependencies
Solution – ideaDependency followers
Measurement tools
Ftracesched_switch tool
(Carsten Emde)*gtkwaveperfadhoc scripts
* http://www.osadl.org/Single-View.111+M5d51b7830c8.0.html
Solution – task-affinityDependency followers
Task-affinity:
selection of the CPU in which a task (e.g. m0) executes takes into consideration the CPUs in which its dependencies (e.g. w0 and w1) ran last time.
w0
w1
w2
w3
m0
m1
m2
Solution – task-affinityimplementation
2 lists inside each task_struct:taskaffinity_listfollowme_list
2 system calls to add/delete affinities:sched_add_taskaffinitysched_del_taskaffinity
1 core 2 cores 4 cores
5.00%
6.00%
7.00%
8.00%
9.00%
10.00%
11.00%
12.00%
9.29%
9.78%
11.62%
10.17%
8.68%
10.35%
mainline task-aff inity
% o
f sto
re c
ach
e-m
iss
1 core 2 cores 4 cores
5.00%
5.50%
6.00%
6.50%
7.00%
7.50%
8.00%
8.50%
9.00%
9.50%
10.00%
7.58%
8.99%
9.44%
7.92% 7.88%
8.58%
mainline task-af f inity
% o
f lo
ad
ca
che
-mis
s
Experimental resultsCache-miss rates
More data/instruction hit
the L1 cache
less cache lines are invalidated
Measurements without and with task-affinity
Experimental resultsWhat exactly to evaluate?
w0
w1
w2
w3
m0
m1
m2
Cache-miss rate is not exactly what we want to optimize
Optimization objectives:Lower the time to
produce a single sample
Increase determinism on production of several samples
w ave0 w ave1 w ave2 w ave3 mixer0 mixer1 mixer2
8
10
12
14
16
18
20
8.81
9.479.21 9.19
16.34
15.41
17.55
9.15
9.649.47
9.84
17.39
16.69
17.82
task-aff inity mainline
ave
rag
e e
xecu
tion
tim
e (
µs
)Experimental resultsAverage execution time of each task
w0
w1
w2
w3
m0
m1
m2
w ave0 w ave1 w ave2 w ave3 mixer0 mixer1 mixer2
0
0.5
1
1.5
2
2.5
0.22
0.31 0.290.36
0.61
0.380.35
0.570.51 0.48
0.83
1.45
2.19
0.47
task-aff inity mainline
vari
an
ce o
f th
e e
xecu
tion
tim
e (
µs
)²Experimental resultsVariance of execution time of each task
w0
w1
w2
w3
m0
m1
m2
Experimental resultsProduction time of each single sample
mainline task-affinity
Results obtained for 150,000 samples
Experimental resultsProduction time of each single sample
Empiric repartition functionReal-time metric (normal distribution):
average + 2 * standard deviation
02
46
810
1214
1618
2022
2426
2830
3234
3638
4042
4446
4850
5254
5658
6062
6466
6870
7274
7678
8082
8486
8890
9294
9698
100102
104106
108110
112114
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
time (us)
% o
f sa
mp
les
41.2 µs
01
23
45
67
8910
1112
1314
1516
1718
1920
2122
2324
2526
2728
2930
3132
3334
3536
3738
3940
4142
4344
4546
4748
4950
51
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
time (us)
% o
f sa
mp
les
38.96 µs
Max: 114Min: 29Avg: 37.8Variance: 3.22
Max: 51Min: 35Avg: 38.0Variance: 0.21
mainline
task-affinity
Experimental resultssummary
~15x
Conclusion & future works
Average execution time is almost the sameDeterminism for real-time applications is
improved
Future works:Better focus on temporal localityImprove task-affinity configurationTest on other architecturesClean up the repository
Conclusion & future works
Still a Work In ProgressGit repository:
git://git.politreco.com/linux-lcs.git
Contact:[email protected]@gmail.com
Q & A
Solution – Linux schedulerDependency followers
Linux scheduler: Change the decision process of the CPU in which a task executes when it is woken up
Mainscheduler
Periodicscheduler
CoreScheduler
CPUi
Contextswitch
RT IdleFair
RR FIFO Normal
Batch
Idle
Selecttask
Schedulerclasses
Schedulerpolicies
Top Related