Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences [email protected]
description
Transcript of Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences [email protected]
Shouqing Hao
Institute of Computing Technology, Chinese Academy of Sciences
Processes Scheduling on Heterogeneous Multi-core Architecture with Hardware
Support
Contents
Introduction
Hardware support for LLC-miss latency
LA-ACMP scheduling algorithm
Evaluation and analysis
Introduction
Heter-CMP: Heterogeneous Chip Multi-Processor
−Composed with some big cores and some small coresBig cores: large area, high power, high performance
• Adapted to CPU-bound programs, serial programs, ……
Small cores: Small area, low power, low performance• Adapted to memory-bound programs, parallel programs,
……
−AdvantageMake good use of chip resourcesReduce power and performance waste
−Challenge Identify applications’ behaviors when executingSchedule proper programs to proper cores
Hardware Support (1)
Identify programs’ behaviors−Last level cache (LLC) miss latency
LLC miss Memory access • Memory accesses induce high latency• Affect programs’ efficiency when executed• Can not make full use of cores’ performance
Schedule rules• Programs with high LLC miss latency should be scheduled to
small cores• Programs with low LLC miss latency should be scheduled to
big cores
Hardware Support (2)
Identify programs’ behaviors−Last level Cache (LLC) miss latency
Mechanism• LLC miss delay is the period between miss request and
miss response– UN-Overlapped, Overlapped
•Record LLC miss latency for each core, with hardware support
Mem-access request
Mem-access response
Mem-access requeset
Mem-access response
t1 t2
delay = t1 + t2 + ...
Mem-access request
Mem-access response
request2 response2
t1
t2
delay
Hardware Support (3)
−Implemented based on Godson-3A Record LLC miss request and response for each core, with
hardware support
L1 miss requestDDR
controllerL2 CACHEL1 CACHE
L1 miss response
mem access request
mem access response
LLC_miss_request_times_core0LLC_miss_request_times_core1LLC_miss_request_times_core2LLC_miss_request_times_core3
LLC_miss_response_times_core0LLC_miss_response_times_core1LLC_miss_response_times_core2LLC_miss_response_times_core3
Hardware Support (4)
id=0?
request_id
Y
L2_miss_req_0
+
id=1? Y
L2_miss_req_1
+
mem_req_valid
id=2? Y
L2_miss_req_2
+
id=3? Y
L2_miss_req_3
+
id=0?
response_id
Y
L2_miss_res_0
+
id=1?Y
L2_miss_res_1
+
mem_res_valid
id=2?Y
L2_miss_res_2
+
id=3?Y
L2_miss_res_3
+
Equal?
Y L2_miss_ok_0<= 1
Equal?
YL2_miss_ok_1<= 1
Equal?
YL2_miss_ok_2<= 1
Equal?
YL2_miss_ok_3<= 1
LA-ACMP Schedule Algorithm(1)
LA-ACMP : Latency-Aware Asymmetry CMP−Identify heterogeneity of cores
Based on Linux kernel 2.6.18Calculate BogoMIPS value of each core, evaluate each core’s
performance
−Workload assignment balanceUsing Scaled Load method
• L=N/P: each core’s scaled load– N: number of workloads being in queue– P: processor’s performance
• If Lmax – Lmin <= 1, workload assignment balance
LA-ACMP Schedule Algorithm(2)
−LLC-delay buffer Append each run-queue with a LLC-delay buffersave each task’s LLC miss latency
thread0准备好 Run-queue
0
x
LLC-delaybuffer
thread0
idle
Run-queue
x
x
LLC-delaybuffer
idle
idle
(a)
LA-ACMP Schedule Algorithm(3)
−Update LLC-delay bufferWhen running, clear thread’s
LLC-delay valueWhen exhausting time slice,
save thread’s LLC-delay value
When migrate thread from queue-A to queue-B, also migrate LLC-delay value
Run-queue
0->delay0
0
LLC-delaybuffer
thread0
idle
processorAfter executing,
save thread’s LLC-delay value
When executing clear thread’s
LLC-delay value
Run-queueA
delay0->0
delay1
LLC-delaybuffer
thread0->idle
thread1
Run-queueB
0->delay0
0
LLC-delaybuffer
idle->thread0
idle
LA-ACMP Schedule Algorithm(4)
−LA-ACMP algorithmExecuted when judging balanceDon’t destroy balance
Y
YN
processor-bound thread on
slower- core
Y
Exchange pairs ofthreads
N
NY
YN N
Time slice over
Load imbalanceUpdatebuffer
busiest-idlest-heter
proper thread exists
Thread migration
do balance
do balance
memory-boundthread onfast-core
do nothing
do nothing
Evaluate and analysis(1)
Platform−Godson-3A-heter
Four cores: one works with 1GHz, three work with 500MHz
Using asynchronous FIFO for synchronization
Benchmark−SPEC CPU2000
m0
m3m2m1 m4
m5
s0
s3s2s1 s4
s5X1 Switch
p0 P2 P3P1
s0 s3s2s1
DM
A C
ontroller
HT
DM
A C
ontroller
HT
X2 Switch
MC0 MC1
Asynchronous FIFO
Evaluate and analysis(2)
Applications’ executing speedup−Compared to original OS−LLC miss rate: with 15.4% performance improvement−LLC miss delay: with 19.8% performance improvement−Application groups with higher heterogeneity get higher
performance improvementThe third group, with highest improvementThe second group, with lowest improvement
Thanks !