WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using...
Transcript of WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using...
WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache
Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund
Hugues Cassé, Haluk Ozaktas and Christine Rochange Université Paul Sabatier, Toulouse
Outline
• Cache Coherence in Todays Architectures
• Cache Coherence and Static WCET-Analysis
• In Short: The On-Demand Coherent Cache
• Static WCET-Analysis with ODC²
• Evaluation
• Conclusion
2
Cache Coherence in Todays Architectures
• Caches are part of almost every multi/many-core architecture.
• Accesses on shared data induce inconsistence of data in caches.
• Coherent accesses are preserved by cache coherence protocols.
• Coherence protocols affect the timing of data accesses.
• Coherence protocols affect the state and content of the caches.
• Coherence protocols affect a static worst-case execution time analysis.
3
Inconsistent data
Core
Cache X = 0
Core
Cache X = 1
Example: Quad-Core LEON4-FT GR712RC
• Bus interconnect
• Private L1-Caches
• Shared L2-Cache
• Write-Through Policy
• Cache Coherence is ensured using a Snooping-based Write-Invalidate protocol.
4
Snooping: Any core listen to data accesses on the bus. Write-Invalidate: Invalidation of caches data by write accesses of other cores.
[1] Aeroflex Gaisler LEON4 Product Sheet
Quad Core LEON4 Block Diagram [1]
Snooping: Any core listen to data accesses on the bus. Write-Invalidate: Invalidation of caches data by write accesses of other cores.
Example: Quad-Core LEON4-FT GR712RC
• Bus interconnect
• Private L1-Caches
• Shared L2-Cache
• Write-Through Strategy
• Cache Coherence is ensured using a Snooping-based Write-Invalidate protocol.
5
Core 0
Cache
Core 1
Cache
Shared Memory Data X = 0
Data X = 0 Data X = 0
Snooping: Any core listen to data accesses on the bus. Write-Invalidate: Invalidation of caches data by write accesses of other cores.
Example: Quad-Core LEON4-FT GR712RC
• Bus interconnect
• Private L1-Caches
• Shared L2-Cache
• Write-Through Strategy
• Cache Coherence is ensured using a Snooping-based Write-Invalidate protocol.
6
Core 0
Cache
Core 1
Cache
Shared Memory Data X = 0
Data X = 1 Data X = 0
Write X = 1
Data X = 1 Data X = 1
Invalid X
Snooping: Any core listen to data accesses on the bus. Write-Invalidate: Invalidation of caches data by write accesses of other cores.
Example: Quad-Core LEON4-FT GR712RC
• Bus interconnect
• Private L1-Caches
• Shared L2-Cache
• Write-Through Strategy
• Cache Coherence is ensured using a Snooping-based Write-Invalidate protocol.
7
Core 0
Cache
Core 1
Cache
Shared Memory Data X = 1
Data X = 1
Read X = 1
Data X = 1
Cache Coherence and Static WCET-Analysis
8
• Estimation of the WCET using a static analysis tool
• Abstraction of the hardware architecture incl. caches
• Cache Analysis to calculate data access latencies
Overview of the OTAWA framework [2]
• Prediction of cache-hits and cache-misses • Abstraction of possible cache states (Must/May Analysis) • Modification of cache state by data accesses in program • Analysis of private cache in isolation from other cores
Cache Analysis
[2] Ballabriga, C. et al.: OTAWA: an Open Toolbox for Adaptive WCET Analysis. In: The 8th IFIP Workshop on Software Technologies for Future Embedded and Ubiquitous Systems
Cache Analysis
Cache Coherence and Static WCET-Analysis
9
CORE
Cache
• External Invalidations turn the cache into a shared resource
• Target of invalidation is unknown -> Invalidation of any data must be assumed
• Time of invalidation is unknown -> Invalidation must be assumed at any possible time
• Prediction of cache-hits is nearly impossible
External invalidations prohibit a feasible WCET estimation.
Data X = 1
External Invalidation
Data X = 0
Example: Quad-Core LEON4-FT GR712RC
• Bus interconnect
• Private L1-Caches
• Shared L2-Cache
• Write-Through Policy
• Alternative: Pure software-based cache coherence using the Cache-Flush instruction.
10
Cache-Flush: Invalidation of the complete cache triggered by a special instruction.
Quad Core LEON4 Block Diagram [1]
[1] Aeroflex Gaisler LEON4 Product Sheet
On-Demand Coherent Cache (ODC²)
• Software/Hardware co-approach
• Handling of coherence on demand
• Different handling of accesses to private and shared data
• Access to shared data only in protected code regions
• Invalidation of shared data after use
• No coherence transactions
11
Core ODC Cache
Replacement information
Control flags (Valid, Modified)
Indication of possible carrier of shared data
Shared Bit Crtl Repl. info Data Tag
ODC² Block-Frame
On-Demand Coherent Cache (ODC²)
12
…
- Accesses on private data -
…
Mutex Lock / Barrier;
enter_shared_mode();
…
- Accesses on private/shared data -
…
exit_shared_mode();
Mutex Unlock / Barrier;
…
- Accesses on private data -
…
ODC² in Private-Mode
• No coherence-operations
• Functionality and timing equal to incoherent cache
ODC² in Shared-Mode
• Newly loaded cache-lines are marked as shared
• Checking of target address
• Write-Through Policy
ODC² in Restore-Procedure
• Invalidation of marked cache-lines
ODC² in Private-Mode
• No coherence-operations
• Functionality and timing equal to incoherent cache
Abstract Program ODC² Functionality
Cache Analysis
Static WCET-Analysis with ODC²
13
CORE
ODC Cache
• Cache remains private resource • Time and target of invalidation is known • Invalidation increases cache-miss rate • Invalidation improve efficiency • Influence on timing is bounded and
predictable • Additional Shared Analysis to identify
accesses to shared data • New classification of accesses in Must/May
Analysis: Always Shared, Never Shared, Sometimes Shared
Data X = 0
Evaluation
14
OTAWA evaluation framework
parMERASA real-time capable many-core architecture incl. ODC²
Parallel benchmarks
Program flow and synchro-nisation
Cache analysis enhanced with ODC² model
WCET analysis of ODC² and common alternatives using the OTAWA toolbox
System architecture:
• Tree-type interconnect
• PPC-based cores with private L1-Caches
• Shared memory
Parallel Benchmarks:
• Matrix Multiplication
• Fast Fourier Transformation
• Dijkstra Shortest Path
Evaluated Platforms:
• ODC²
• Uncached shared data
• Software Cache-Flush
• Magic Coherence
Evaluation
15
• WCET Analysis regarding parallel execution with 2-8 cores.
• Significantly reduced WCET compared to Uncached and Cache-Flush
• Reduced accesses to shared memory with ODC²
• Coherence overhead of ODC² depends on access pattern to shared memory
Matrix FFT Dijkstra
0
50000
100000
150000
200000
250000
300000
350000
Num
be
r o
f A
cce
sse
s
Matrix FFT Djikstra
0,00
0,50
1,00
1,50
2,00
2,50
3,00
3,50
Uncached
Flush
ODC²
Magic
No
rmalis
ed
WC
ET
Matrix FFT Dijkstra
0
50000
100000
150000
200000
250000
300000
Num
be
r o
f W
T A
cce
sse
s
Conclusion
16
• Existing hardware cache coherence techniques are not suitable for a static WCET analysis, software coherence techniques lack in performance.
• ODC² permits time-predictable hardware/software co-approach.
• Static timing analysis using ODC² allows significant reduction of WCET compared to common alternatives.
• Promising approach for hard real-time multicore systems.
Thank you for your attention
Any questions?