Consolidation of Real-Time Systems into a Multi-Core...
Transcript of Consolidation of Real-Time Systems into a Multi-Core...
CMAS CPSWeek’15
Consolidation of Real-Time Systems
into a Multi-Core Platform
[email protected] [email protected]
Raj Rajkumar Hyoseung Kim
CMAS CPSWeek’15
Consolidation onto Multicores
• Benefits
– Reduces the number of CPUs and wiring harness among them
– Leads to a significant reduction in cost and space requirements
Vendor #1 Vendor #2
Vendor #3 Vendor #4
Real-time System
Consolidation
Multi-core SBC
Single-core SBCs
2
CMAS CPSWeek’15
Key Challenges
Timing predictability without losing efficiency
Temporal isolation among consolidated workloads
(Quantification & minimization of temporal interference)
Use of standardized COTS multi-core platforms
3
CMAS CPSWeek’15
Multi-Core Memory Hierarchy
Core 1 Core 2 Core 3 Core P
L1/L2
Cache
L1/L2
Cache
L1/L2
Cache
L1/L2
Cache
…
Last-level Shared Cache
Memory Controller
DRAM
Bank 1
DRAM
Bank 2
DRAM
Bank 3
DRAM
Bank B
…
Cache
interference
Memory bus
interference
DRAM bank
interference
4
CMAS CPSWeek’15
Last-Level Shared Cache
• A large cache shared among processing cores
– Reduces memory access time (7 – 10x faster than DRAM access)
– Mitigates DRAM bandwidth consumption
– Allows high cache utilization
Intel Core i7
8-15 MB L3 Cache
Freescale P4080
2MB L3 Cache
5
CMAS CPSWeek’15
Uncontrolled Use of Shared Cache
1. Inter-core Interference
P1 P2 P3
L1
L2
L1
L2
L1
L2
L3
P4
L1
L2
Core
Private
Cache
Shared
Cache
2. Intra-core Interference
P1 P2 P3
L1
L2
L1
L2
L1
L2
L3
P4
L1
L2
Core
Private
Cache
Shared
Cache
43% Slowdown 27% Slowdown (C:2.5ms, T:10ms)
* PARSEC Benchmark on Intel i7 6
CMAS CPSWeek’15
Coordinated Cache Management
• Challenges
– Uncontrolled shared cache: Cache interference penalties
– Cache partitioning:
• Limited number of cache partitions
• Cache under-utilization
• Key idea: Controlled sharing of partitioned caches
while ensuring timing predictability
1. Provides predictability on multi-core real-time systems
2. Addresses the limitations of cache partitioning
[ECRTS14] Hyoseung Kim, Arvind Kandhalu, and Raj Rajkumar. A Coordinated Approach for Practical OS-Level Cache Management in Multi-Core Real-Time Systems. In ECRTS, 2013.
7
CMAS CPSWeek’15
Coordinated Cache Management
Tasks
+
+
+
Bounded Penalties
Memory partitions
Cache partitions
1
2
3
NP -1
NP
1. Per-core Cache
Reservation
2. Reserved Cache Sharing
3. Cache-Aware
Task Allocation
Task Parameters
Coordinated Cache
Management
τi :(Ci
p, Ti, Di, Mi )
Core
1
Core
2
Core
NC
…
Pa
rtit
ion
ed
Fix
ed
-pri
ori
ty S
ch
ed
uli
ng
…
…
Pa
ge
Co
lori
ng
(C
ac
he
Pa
rtit
ion
ing
)
Per-core cache reservation
Prevent Inter-core cache interference Reserved cache sharing: Mitigate the problems with page coloring
Considerations 1. Analyzing intra-core interference
2. Guaranteeing memory requirements
8
CMAS CPSWeek’15
Coordinated Cache Management
Tasks
+
+
+
Bounded Penalties
Memory partitions
Cache partitions
1
2
3
NP -1
NP
1. Per-core Cache
Reservation
2. Reserved Cache Sharing
3. Cache-Aware
Task Allocation
Coordinated Cache
Management Core
1
Core
2
Core
NC
…
Pa
rtit
ion
ed
Fix
ed
-pri
ori
ty S
ch
ed
uli
ng
…
…
Pa
ge
Co
lori
ng
(C
ac
he
Pa
rtit
ion
ing
)
Task Parameters
τi :(Ci
p, Ti, Di, Mi )
Cache-Aware Task Allocation
• Algorithm to allocate tasks and cache partitions to cores
• Exploits the benefits of cache sharing by load concentration
9
CMAS CPSWeek’15
Multi-Core Memory Hierarchy
Core 1 Core 2 Core 3 Core P
L1/L2
Cache
L1/L2
Cache
L1/L2
Cache
L1/L2
Cache
…
Last-level Shared Cache
Memory Controller
DRAM
Bank 1
DRAM
Bank 2
DRAM
Bank 3
DRAM
Bank B
…
Cache
interference
Memory bus
interference
DRAM bank
interference
10
CMAS CPSWeek’15
DRAM Organization
DRAM Rank CHIP
1
CHIP
2
CHIP
3
CHIP
4
CHIP
5
CHIP
6
CHIP
7
CHIP
8
Data bus 8-bit
Address bus
Command bus
64-bit
DRAM Chip
Bank 1
Com
ma
nd d
eco
de
r
Data bus
Address bus
Command bus
8-bit
Bank ... Bank 8
Columns
Row
s
Row buffer
Row
de
co
der
Column decoder
Ro
w a
dd
ress
Column address
DRAM access latency depends on which row is currently in the row buffer
Row hit
Row conflict
11
CMAS CPSWeek’15
Memory Controller
Request buffer Bank 1 Priority Queue
Bank 2 Priority Queue
Bank B Priority Queue
...
Bank 1 Scheduler
Bank 2 Scheduler
Bank B Scheduler
...
Channel Scheduler
Memory scheduler
Read/
Write
Buffers
DRAM address/command buses
Processor data bus
DRAM data bus
Memory requests from CPU cores
Two-level hierarchical
scheduling structure
(FR-FCFS)
Inter-bank interference at channel scheduler
Intra-bank interference at bank scheduler Memory interference
12
CMAS CPSWeek’15
DRAM Bank Partitioning & Sharing
• DRAM bank partitioning: software method to assign dedicated DRAM
banks to each core (or task)
• Problems
– Limited number of DRAM bank partitions
– Reduced usable memory size
Bank 1 Scheduler
Bank 2 Scheduler
Channel Scheduler
Core 1 Core 2
(1) w/o bank partitioning
Bank 1 Scheduler
Bank 2 Scheduler
Channel Scheduler
Core 1 Core 2
(2) w/ bank partitioning
Intra-bank and inter-bank
interference Inter-bank interference only
DRAM bank sharing
13
CMAS CPSWeek’15
Bounding Memory Interference Delay
• Provides an upper bound on memory interference delay
– Intra-bank and inter-bank interference delays
– Private and shared DRAM banks
Response-Time Based
Schedulability Analysis
[RTAS14] Hyoseung Kim et al. Bounding Memory Interference Delay in COTS-based Multi-Core Systems. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS), 2014. Best Paper Award
Request-driven bounding
Task’s own memory requests
Job-driven bounding
Interfering memory requests
during the task execution
14
CMAS CPSWeek’15
Multi-Core Memory Hierarchy
Core 1 Core 2 Core 3 Core P
L1/L2
Cache
L1/L2
Cache
L1/L2
Cache
L1/L2
Cache
…
Last-level Shared Cache
Memory Controller
DRAM
Bank 1
DRAM
Bank 2
DRAM
Bank 3
DRAM
Bank B …
Cache
interference
Memory bus
interference
DRAM bank
interference
Operating System
Task τ1
Task τ2
Task τ3
Task τ4
Task τ5
Task τ6
15
CMAS CPSWeek’15
Linux/RK
• CMU’s open source real-time extensions to the Linux kernel
• Resource kernel (RK) approach
• Resource reservation: tasks can reserve a portion of system resources
– Ex) CPU reserve: (budget, replenishment period, CPU affinity, policy)
• Guarantees resource allocations at admission time
• Enforces resource usage
Resource
Set
CPU
Reserve Task
1
Network
Reserve
Memory
Reserve
Task
2
Task
3
16
CMAS CPSWeek’15
Multi-Core Memory Management
• Linux/RK memory manager
– Supports software cache and DRAM partitioning
– Allows tasks to specify their memory size, cache, DRAM bank demands
Task i: memory reservation - Memory size = m pages - Cache partitions = h - DRAM bank partitions = b
Real-time task
c Task i: allocated pages
Linux/RK Memory Manager
Cache 1 Cache 2 Cache 3 Cache H ...
Bank 1
Bank 2
Bank B
...
...
...
Bank 1
Bank 2
Bank B
...
...
...
17
Unallocated physical pages
CMAS CPSWeek’15
Cache Interference
• Task execution time and L3 misses
– Intel i7 2600 3.4GHz quad-core, 8MB L3 cache, DDR3-1333
Task Instance
10
11
12
13
14
0 10 20 30 40 50 60 70 80 90 100
Exe
cuti
on
Tim
e
(mse
c)
w/o Cache Reservation
w/ Cache Reservation (p=12)
0
3
6
9
12
0 10 20 30 40 50 60 70 80 90 100
L3 m
isse
s p
er
mse
c (x
10
00
)
w/o Cache Reservation
w/ Cache Reservation (p=12)
Solo WCET
Solo WC L3 misses
18
CMAS CPSWeek’15
Memory Interference
• Shared DRAM Bank
0
200
400
600
800
1000
1200
1400
No
rm. R
esp
on
se T
ime
(%
)
black- scholes
body- track
canneal ferret fluid- animate
freq- mine
ray- trace
stream- cluster
swap- tions
vips x264
12x increase due to
memory interference
Observed
Predicted by analysis
19
CMAS CPSWeek’15
Memory Interference
• Private DRAM Bank
black- scholes
body- track
canneal ferret fluid- animate
freq- mine
ray- trace
stream- cluster
swap- tions
vips x264 0
100
200
300
400
500
No
rm. R
esp
on
se T
ime
(%
)
Observed
Predicted by analysis
4.1x increase DRAM bank partitioning
helps reducing the memory interference
Our analysis enables the quantification of the benefit of DRAM bank partitioning
20
CMAS CPSWeek’15
Memory Interference
• Private DRAM Bank under moderate memory contention
0
20
40
60
80
100
120
140
160
180
No
rm. R
esp
on
se T
ime
(%
) Average over-estimates are 8%
(13% for a shared bank) Observed
Predicted by analysis
black- scholes
body- track
canneal ferret fluid- animate
freq- mine
ray- trace
stream- cluster
swap- tions
vips x264
Our analysis bounds memory interference delay with low pessimism
under both severe and moderate memory contentions
21
CMAS CPSWeek’15
Multi-Core Virtualization
22
CMAS CPSWeek’15
Real-Time System Virtualization
• Barrier to consolidation
– Each app. could have been developed
independently by different vendors
• Bare-metal / Proprietary OS
• Linux
– Legacy software
– Different license issues
• Consolidation via virtualization
– Each application can maintain
its own implementation
– Minimizes re-certification process
– IP protection, license segregation
– Fault isolation
Virtualization
Multi-core CPU
Real-Time Hypervisor
23
CMAS CPSWeek’15
VM Scheduling Structure
• Two-level hierarchical scheduling structure
– Task scheduling and VCPU scheduling
VM1
VCPU1
Task τ1
Task Scheduler
Task τ2
VCPU2
Task τ3
Task Scheduler
Task τ4
Hypervisor
Physical Core 1 (PCPU1)
VCPU Scheduler
VM2
VCPU3
Task τ5
Task Scheduler
Task τ6
VCPU4
Task τ7
Task Scheduler
Task τ8
Physical Core 2 (PCPU2)
VCPU Scheduler
24
CMAS CPSWeek’15
Virt/RK
• Real-time virtualization with resource kernel approach
– CPU reservation for VCPUs
– Memory reservation for VMs
– Current implementation: Virt/RK::KVM-x86
– Under development: Virt/RK::KVM-ARM, Virt/RK::L4
VM1
VCPU1
Task τ1
Task Scheduler
Task τ2
VCPU2
Task τ3
Task Scheduler
Task τ4
VM Resource Reservation
• VCPU1: 30% of physical CPU
• VCPU2: 30% of physical CPU
• VM1: 20% of host memory w/
cache & DRAM bank
partitioning
25
CMAS CPSWeek’15
Resource Sharing in Virtualization
• Consolidation inevitably causes the sharing of resources
– Sensors
– Network interfaces
– I/O devices
– Shared data
• Long blocking time in a virtualized environment
– VCPU level preemptions
– VCPU budget depletion
Requires mutually-exclusive locks
to avoid race conditions
Need a synchronization mechanism with bounded blocking times
for multi-core real-time virtualization
26
CMAS CPSWeek’15
vMPCP
• Virtualization-aware Multiprocessor Priority Ceiling Protocol
– Provides bounded blocking time on accessing shared resources
in multi-core virtualization
• Hierarchical priority ceilings
• Two-level priority queue for a mutex waiting list
– Optional VCPU budget overrun to reduce blocking times
– VCPU budget replenishment policies
• Periodic server
• Deferrable server
– Implemented on Virt/RK::KVM-x86
[RTSS14] Hyoseung Kim, Shige Wang, and Raj Rajkumar. vMPCP: A Synchronization Framework for Multi-Core Virtual Machines. In IEEE Real-Time Systems Symposium (RTSS), 2014.
27
CMAS CPSWeek’15
Case Study: Effect of vMPCP
Baseline
Virtualization-unaware
synchronization protocol
(MPCP)
vMPCP
Virtualization-aware
synchronization protocol
τ1
τ2
τ3
τ4
τ5
τ6
τ7
τ8 (μsec)
τ1
τ2
τ3
τ4
τ5
τ6
τ7
τ8 (μsec)
vMPCP yields 29% shorter task response time with shared resources
28
CMAS CPSWeek’15
Summary
• Workload Consolidation into a Multi-core Platform
– Predictability without losing efficiency
– Quantification and minimization of temporal interference
• Multi-core memory hierarchy
– Coordinated cache management
• Cache partitioning, reserve, and sharing
– Bounding memory interference
• DRAM partitioning and sharing Quantifies the benefit of partitioning
• Memory-bus and DRAM bank interference analysis
• Real-time virtualization
– Virt/RK: resource reservation approach
– vMPCP: resource sharing for real-time virtualization
29
CMAS CPSWeek’15
What next?
• Modeling state-of-the-art computer architecture techniques
– Hardware prefetchers
– Future memory controllers
– Future DRAM designs
• Extending existing real-time systems work to virtualization
– Diverse locking protocols (e.g., RW-lock)
– Fault tolerance mechanisms
– Hardware accelerations (e.g., GPGPU)
30