Problem SPM Management Framework Caches in WCET …esaule/NSF-PI-CSR-2017-website/posters… · SPM...
Transcript of Problem SPM Management Framework Caches in WCET …esaule/NSF-PI-CSR-2017-website/posters… · SPM...
Scaling the Real-time Capabilities of Powertrain Controllers in Automotive Systems
Background - The complexity and the size of
automotive control systems are increasing.
- Such systems have hard real-time constraints, thus high-performance-yet-predictable control systems are needed.
PI: Aviral Shrivastava, Compiler Microarchitecture Lab, Arizona State University
Source code Task period
# of cores SPM size per core
Core speed
TaskInfo ArchitectureInfo
Mapping/ scheduling
Inter-taskAnalyzer Intra-taskAnalyzer
CodeSynthesizer
WCET estimation
Intra-task SPM partitioning
Code generation
Inter-task SPM partitioning
Inter-task runtime management
WCRT estimation
Application Size/ Complexity
4-bit microprocessors
80’s 2000’s 2020’s
Vehicles communicating with other vehicles and
road infrastructure
Multiple processors over in-vehicle
network
• Increasingregula:onsonfuelefficiency&emissions
• MoreandmoreADAS(AdvancedDriverAssistanceSystems)
…
Problem - Current control system design in
practice relies on testing, even though the correctness can only be guaranteed by static analysis at design time, not by testing.
- This is because of the unacceptable amount of pessimism in the analysis results.
Execution time
# of
mea
sure
men
ts
A safe upper bound found by analysis
All execution scenarios
Observed by testing
Too much overestimation!
Caches: The main culprit - HW uncertainties are becoming
more significant due to ever-increasing application sizes and system complexity.
- Caches are the main source that increases the uncertainties.
SWuncertain:es(Loopbounds,pointervalues,
infeasiblepaths,…)
Microarchitecturalstateuncertain:es(Caches,pipeline,branchpredictors)
Significance
Time
• MISRA C coding standards
• user annotations
...
• Application size grows beyond on-chip memory sizes
• Resource sharing in multi-tasking on multi-core architectures
Solution: Using scratchpad memories (SPMs)
Caches make: - WCET analysis difficult and pessimistic - WCRT analysis practically impossible
- Replace caches with scratchpad memories (SPMs) - SPMs have the following benefits:
- Tight WCET estimation thanks to explicit management - Better performance than caches
- [DAC 2013] SSDM: Smart Stack Data Management for Software Managed Multicores (SMMs)
- [CODES+ISSS 2013] CMSM: An Efficient and Effective Code Management for Software Managed Multicores
SPM Management Framework
0 1
A
A B
C B
SPM
...DMAloadAto0…DMAloadBto1…DMAloadCto0…
Predictable memory behavior
- Using SPMs enables various optimizations in compiler and scheduler to improve real-time capabilities, tailored to each application (no hardwired logic).
Results with Code Management
0"0.2"0.4"0.6"0.8"1"
Cache"
SPM"
Cache"
SPM"
Cache"
SPM"
Cache"
SPM"
Cache"
SPM"
Cache"
SPM"
Cache"
SPM"
Cache"
SPM"
Cache"
SPM"
Cache"
SPM"
Cache"
SPM"
Cache"
SPM"
matmult"1.1KB,"0.3KB"
lms"3.9KB,""1KB"
compress"3.1KB,"0.9KB"
sha"3.2KB,"1.2KB"
edn"4.6KB,"1.9KB"
adpcm"8.3KB,"2.3KB"
rijndael"9.3KB,"3.1KB"
statemate"11KB,"3.5KB"
1REG"28KB,"7.6KB"
DAP1"36KB,"28KB"
susan"51KB,"11KB"
DAP3"56KB,"42KB"
Normalized+WCET+Es1mates+ Cache"C" Cache"L" SPM"C" SPM"L" SPM"M"
-0.1% -0.2% -24% -43% -8% -9% -81% -70% -76% -90% -89% -79%
Code size Max function size
Figure 5: WCET is reduced up to 90% compared to 4-way set associative cache of 4KB with LRU replacement policy.
the cache miss overhead is relatively significant due to lessiteration counts (< 50), so our approach shows more signifi-cant WCET reduction thanks to the burst mode DMA trans-fers. The WCET reduction is significant for larger bench-marks. This is mainly caused by large nested loops with it-then-else or switch-case conditions, which frequently appearin these benchmarks. Without condition statements, oncean instruction is loaded into the cache line by a cold miss,the fetch to the instruction can become cache hits for the restof the iterations. However, when there are many conditionstatements in a loop, the number of instructions compete fora cache line in an abstract cache state increases and finallybecome larger than the associativity of the cache, which re-sults in always-miss (assuming cache misses for all itera-tions) [4]. This e↵ect becomes significant in nested loopssince the iteration counts can be high. With the SPM, onthe other hand, DMA operations in loops can be avoidedin most cases thanks to the code mapper which can findan optimal partition-to-region mapping for WCET in thegiven SPM size. Splitting functions into partitions greatlyhelps in this because it can shrink functions’ footprints bykeeping only the important parts, such as loop bodies, inthe SPM. The largest functions of ‘1REG’, ‘DAP1’, ‘susan’,and ‘DAP3’ are much larger than 4KB. These benchmarkscannot run on 4KB SPM without function splitting.
5. CONCLUSIONSThis paper presents a new code management technique for
SMM architectures. The proposed technique splits functionsinto partitions whereas all previous techniques manage codeat the granularity of functions. Splitting functions can im-prove SPM space utilization and solve the limitation on thesizes of applications caused by by function-granularity man-agement. The functions to split and the number of parti-tions are selected heuristically using a feedback loop from anoptimal code mapping technique that allocates SPM spacefor each partition and estimates the WCET. The resultson various benchmarks including automotive control appli-cations show that the proposed technique can significantlyreduce the WCET estimates compared to the optimal solu-tions found by function-granularity technique and 4-way setassociative cache of the same size. Our future work includesmaking the function splitting algorithm optimal and findingthe minimal SPM size for a task to meet its deadline.
6. REFERENCES[1] K. Bai, J. Lu, A. Shrivastava, and B. Holton. CMSM:
An E�cient and E↵ective Code Management for
Software Managed Multicores. In Proc.CODES+ISSS, 2013.
[2] M. A. Baker, A. Panda, N. Ghadge, A. Kadne, andK. S. Chatha. A Performance Model and CodeOverlay Generator for Scratchpad EnhancedEmbedded Processors. In Proc. CODES+ISSS, 2010.
[3] D. Broman, P. Derler, and J. C. Eidson. Temporalissues in cyber-physical systems. JIISc, 93(3), 2013.
[4] C. Cullmann. Cache persistence analysis: Theory andpractice. ACM TECS, 12(1s), 2013.
[5] H. Falk and J. C. Kleinsorge. Optimal StaticWCET-Aware Scratchpad Allocation of ProgramCode. In Proc. DAC, 2009.
[6] H. Falk and H. Kotthaus. Wcet-driven cache-awarecode positioning. In Proc. CASES, 2011.
[7] I. Gurobi Optimization. Gurobi Optimizer ReferenceManual, 2013.
[8] J. Gustafsson, A. Betts, A. Ermedahl, and B. Lisper.The Malardalen WCET Benchmarks - Past, Presentand Future. In Proc. WCET, 2010.
[9] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M.Austin, T. Mudge, and R. B. Brown. MiBench: AFree, Commercially Representative EmbeddedBenchmark Suite. In Proc. WWC, 2001.
[10] S. Hepp and F. Brandner. Splitting functions intosingle-entry regions. In Proc. CASES, 2014.
[11] S. Jung, A. Shrivastava, and K. Bai. Dynamic CodeMapping for Limited Local Memory Systems. In Proc.ASAP, 2010.
[12] Y. Kim, D. Broman, J. Cai, and A. Shrivastava.WCET-Aware Dynamic Code Management onScratchpads for Software-Managed Multicores. InProc. RTAS, 2014.
[13] M. Schoeberl. A time predictable instruction cache fora java processor. LNCS, 3292, 2004.
[14] A. Schranzhofer, J.-J. Chen, and L. Thiele. Timinganalysis for tdma arbitration in resource sharingsystems. In Proc. RTAS, 2010.
[15] L. Thiele and R. Wilhelm. Design for timingpredictability. Real-Time Systems, 28, 2004.
[16] J. Whitham and N. Audsley. ImplementingTime-Predictable Load and Store Operations. In Proc.EMSOFT, 2009.
[17] H. Wu, J. Xue, and S. Parameswaran. OptimalWCET-Aware Code Selection for Scratchpad Memory.In Proc. EMSOFT, 2010.
Toyota proprietary powertrain controller models
Management overhead
DMA operations overhead
Computation time Cache miss overhead
4KB SPM vs. 4KB 4-way set associative cache (LRU) This work was supported in part by the Center for Hybrid and Embedded Software Systems (CHESS) at UC Berkeley, Toyota Motors and the National Science Foundation grant CNS 1525855 Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of any of the sponsors.
Acknowledgement
- SPM space is partitioned for each task, and each type of data to privatize memory accesses.
- SPM space is allocated to minimize memory interference in the worst-case execution path.
Core
Core Core
Core
τ1 τ2 τ3
codestack global
f1 f2 f3 f4
Inter-task SPM partitioning
Task mapping
Intra-task SPM partitioning
SPM space allocation
Cores
Local SPM
SPM partition for a task
SPM partition for a type of data
Caches in WCET Analysis
Jan Reineke Saarland University
Contact Info:eMail: [email protected]: http://rw4.cs.uni-sb.de/~reinekePhone: +49 681 302 55 73Fax: +49 681 302 30 65
Uncertainty in Cache Analysis
readz
ready
readx
writez
1. Initial cache contents unknown.
2. Need to combine information.
3. Cannot resolve address of z.
=� Amount of uncertainty determinedby ability to recover information
Jan Reineke Caches in WCET Analysis November 7th , 2008 10 / 33
Uncertainty in Cache AnalysisPredictability Metrics
EvictFill
[dex ][fde]
[gfd ]
[hgf ][fec]
[gfe]
[fed ]
Sequence: �a, . . . , e, f, g, h⇥
Jan Reineke Caches in WCET Analysis November 7th , 2008 11 / 33
Predictability Metrics Evaluation of Policies
Policy Evict(k) Fill(k) Evict(8) Fill(8)LRU k k 8 8FIFO 2k � 1 3k � 1 15 23MRU 2k � 2 ⇤/3k � 4 14 ⇤/20PLRU k
2 log2 k + 1 k2 log2 k + k � 1 13 19
LRU is optimal w.r.t. metrics.Other policies are much less predictable.
�⇥ Use LRU.
How to obtain may- and must-information within the given limits forother policies?
Jan Reineke Caches in WCET Analysis November 7th , 2008 13 / 33
Evaluation of Policies
LRU is optimal wrt. metricsFIFO, MRU, PLRU less predictable
use LRU
Relative Competitiveness of Cache Replacement PoliciesJan Reineke Daniel Grund
Saarland University
IntroductionIn order to fulfill stringent performance requirements, caches are now alsoused in hard real-time systems. In such systems, upper and lower bounds onthe execution times of a task have to be computed. To obtain tight bounds,timing analyses must take into account the cache architecture. However,developing cache analyses – analyses that determine whether a memoryaccess is a hit or a miss – is a difficult problem for some cache architectures.
GoalDetermine safe bounds on the number of cache hits and misses by a task Tunder FIFO(k), PLRU(l), or any another replacement policy.
Approach
1. Determine competitiveness of the desired policy P relative to policy Q.
mP ⇥ k · mQ + c mQ(T) = mP(T)
2. Compute performance prediction of task T for policy Q by cache analysis.
mP ⇥ k · mQ + c mQ(T) = mP(T)
3. Calculate upper bounds on the number of misses for P using the cacheanalysis results for Q and the competitiveness results of P relative to Q.
mP ⇥ k · mQ + c mQ(T) = mP(T)
Relative CompetitivenessCompetitiveness (Sleator and Tarjan): worst-case performance of an onlinepolicy relative to the optimal offline policy.Relative competitiveness: worst-case performance of an online policy relativeto another online policy.
Let mP (q, s) be the number of misses incurred by policy P , starting in cache-setstate q, processing memory access sequence s.Definition 1 (Relative Miss-Competitiveness).Policy P is k-miss-competitive relative to policy Q with additive constant c, if
mP (p, s) ⇥ k · mQ(q, s) + c
for all access sequences s � S and comp. cache-set states p � CP, q � CQ.
“Policy P will incur at most k times the number of misses of policy Q plusconstant c on any access sequence.”
Hit-competitiveness is defined analogously.The competitive miss ratio cm
P,Q is the smallest k such that P is k-miss-competitive relative to Q with some additive constant.
Computing Competitive RatiosP and Q induce a transition system that captures how the two policies actrelative to each other, processing the same memory accesses:
[eabc]FIFO, [eabc]LRU
e(h, h)
[abcd]FIFO, [abcd]LRUe
(m, m) a
(h, h)
[eabc]FIFO, [ceab]LRU
c (h, h)
[abcd]FIFO, [dabc]LRU
d (h, h)
[eabc]FIFO, [ceda]LRU [eabc]FIFO, [edab]LRU
e (m, m)
c
(h, m)[deab]FIFO, [deab]LRU
d
(m, h)
Legend[abcd]FIFO Cache-set state
· ·d Memory access
(h, m), . . . Misses in pairs ofcache-set states
LRUMRU
last-infirst-in
Figure 1: Small part of the transition system in the computation of competitive-ness results for FIFO(4) vs. LRU(4).
Competitive ratio = maximum ratio of misses in policy P relative to the numberof misses in policy Q in transition system.
Quotient Transition SystemProblem: The induced transition system is ⌥ large.Goal : Construct finite transition system with same properties.Observation: Only the relative positions of elements matter:
[abc]LRU, [b e]FIFO [fgl]LRU, [g m]FIFO⌅
[cab]LRU, [cb ]FIFO
(h, m)c
[lfg]LRU, [lg ]FIFO
(h, m)l
⌅
Two pairs of cache-set states are ⌅-equivalent, if they can be transformed intoeach other by a renaming of their contents.Merging equivalent pairs of cache-set states yields a finite quotient transitionsystem that retains competitivenss properties:
[abcd]FIFO, [abcd]LRU
(h, h)
(m, m)
[abcd]FIFO, [dabc]LRU
(h, h)
[eabc]FIFO, [edab]LRU
(m, m)
(m, h)
[eabc]FIFO, [ceda]LRU(h, m)
Figure 2: ⌅-equivalent states are similarly colored in Figure 1. Merging equiv-alent states yields the quotient structure depicted here.
ResultsMiss-competitiveness (ratios, constants) (k, c) relating FIFO, PLRU, and LRU:
Associativity: 2 3 4 5 6 7 8LRU vs FIFO 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6 8, 7
FIFO vs LRU 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6 8, 7LRU vs PLRU 1, 0 � 2, 1 � � � 5, 4
PLRU vs LRU 1, 0 � ⌥ � � � ⌥FIFO vs PLRU 2, 1 � 4, 4 � � � 8, 8
PLRU vs FIFO 2, 1 � ⌥ � � � ⌥
Example: LRU(4) is 2-miss-competitive relative to PLRU(4) with constant 1.PLRU(4) is not miss-competitive relative to LRU(4) at all.
Hit-competitiveness (ratios, constants) (k, c) relating FIFO, PLRU, and LRU:Associativity: 2 3 4 5 6 7 8
LRU vs FIFO 0, 0 0, 0 0, 0 0, 0 0, 0 0, 0 0, 0FIFO vs LRU 1
2,12
12, 1
12,
32
12, 2
12,
52
12, 3
12,
72
LRU vs PLRU 1, 0 � 12, 1 � � � 1
8,158
PLRU vs LRU 1, 0 � 12, 1 � � � 1
4,32
FIFO vs PLRU 12,
12 � 1
4,54 � � � 1
11,1911
PLRU vs FIFO 0, 0 � 0, 0 � � � 0, 0
Generalizations to arbitrary kPreviously unknown relations:• PLRU(k) is 1-competitive relative to LRU(1 + log2k).• FIFO(k) is 1
2-hit-competitive relative to LRU(k), whereas• LRU(k) is not l-hit-competitive relative to FIFO(k) for any l, but• LRU(2k � 1) is 1-competitive relative to FIFO(k) with constant 0 for all k�⇧ yields first may-analysis for FIFO,
• PLRU(k) is not l-miss-competitive relative to LRU(k) for k ⇤ 4 and any l.=⌃ PLRU(k) is not l-miss-competitive (in the classical sense) for k ⇤ 4.
Reference: Jan Reineke and Daniel Grund: Relative Competitive Analysis ofCache Replacement Policies. In ACM SIGPLAN/SIGBED 2008 Conference onLanguages, Compilers, and Tools for Embedded Systems – LCTES 2008
Contact information:http://rw4.cs.uni-sb.de/{reineke|grund}@cs.uni-sb.de
Supported by the
Relative Competitiveness of Cache Replacement PoliciesJan Reineke Daniel Grund
Saarland University
IntroductionIn order to fulfill stringent performance requirements, caches are now alsoused in hard real-time systems. In such systems, upper and lower bounds onthe execution times of a task have to be computed. To obtain tight bounds,timing analyses must take into account the cache architecture. However,developing cache analyses – analyses that determine whether a memoryaccess is a hit or a miss – is a difficult problem for some cache architectures.
GoalDetermine safe bounds on the number of cache hits and misses by a task Tunder FIFO(k), PLRU(l), or any another replacement policy.
Approach
1. Determine competitiveness of the desired policy P relative to policy Q.
mP ⇥ k · mQ + c mQ(T) = mP(T)
2. Compute performance prediction of task T for policy Q by cache analysis.
mP ⇥ k · mQ + c mQ(T) = mP(T)
3. Calculate upper bounds on the number of misses for P using the cacheanalysis results for Q and the competitiveness results of P relative to Q.
mP ⇥ k · mQ + c mQ(T) = mP(T)
Relative CompetitivenessCompetitiveness (Sleator and Tarjan): worst-case performance of an onlinepolicy relative to the optimal offline policy.Relative competitiveness: worst-case performance of an online policy relativeto another online policy.
Let mP (q, s) be the number of misses incurred by policy P , starting in cache-setstate q, processing memory access sequence s.Definition 1 (Relative Miss-Competitiveness).Policy P is k-miss-competitive relative to policy Q with additive constant c, if
mP (p, s) ⇥ k · mQ(q, s) + c
for all access sequences s � S and comp. cache-set states p � CP, q � CQ.
“Policy P will incur at most k times the number of misses of policy Q plusconstant c on any access sequence.”
Hit-competitiveness is defined analogously.The competitive miss ratio cm
P,Q is the smallest k such that P is k-miss-competitive relative to Q with some additive constant.
Computing Competitive RatiosP and Q induce a transition system that captures how the two policies actrelative to each other, processing the same memory accesses:
[eabc]FIFO, [eabc]LRU
e(h, h)
[abcd]FIFO, [abcd]LRUe
(m, m) a
(h, h)
[eabc]FIFO, [ceab]LRU
c (h, h)
[abcd]FIFO, [dabc]LRU
d (h, h)
[eabc]FIFO, [ceda]LRU [eabc]FIFO, [edab]LRU
e (m, m)
c
(h, m)[deab]FIFO, [deab]LRU
d
(m, h)
Legend[abcd]FIFO Cache-set state
· ·d Memory access
(h, m), . . . Misses in pairs ofcache-set states
LRUMRU
last-infirst-in
Figure 1: Small part of the transition system in the computation of competitive-ness results for FIFO(4) vs. LRU(4).
Competitive ratio = maximum ratio of misses in policy P relative to the numberof misses in policy Q in transition system.
Quotient Transition SystemProblem: The induced transition system is ⌥ large.Goal : Construct finite transition system with same properties.Observation: Only the relative positions of elements matter:
[abc]LRU, [b e]FIFO [fgl]LRU, [g m]FIFO⌅
[cab]LRU, [cb ]FIFO
(h, m)c
[lfg]LRU, [lg ]FIFO
(h, m)l
⌅
Two pairs of cache-set states are ⌅-equivalent, if they can be transformed intoeach other by a renaming of their contents.Merging equivalent pairs of cache-set states yields a finite quotient transitionsystem that retains competitivenss properties:
[abcd]FIFO, [abcd]LRU
(h, h)
(m, m)
[abcd]FIFO, [dabc]LRU
(h, h)
[eabc]FIFO, [edab]LRU
(m, m)
(m, h)
[eabc]FIFO, [ceda]LRU(h, m)
Figure 2: ⌅-equivalent states are similarly colored in Figure 1. Merging equiv-alent states yields the quotient structure depicted here.
ResultsMiss-competitiveness (ratios, constants) (k, c) relating FIFO, PLRU, and LRU:
Associativity: 2 3 4 5 6 7 8LRU vs FIFO 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6 8, 7
FIFO vs LRU 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6 8, 7LRU vs PLRU 1, 0 � 2, 1 � � � 5, 4
PLRU vs LRU 1, 0 � ⌥ � � � ⌥FIFO vs PLRU 2, 1 � 4, 4 � � � 8, 8
PLRU vs FIFO 2, 1 � ⌥ � � � ⌥
Example: LRU(4) is 2-miss-competitive relative to PLRU(4) with constant 1.PLRU(4) is not miss-competitive relative to LRU(4) at all.
Hit-competitiveness (ratios, constants) (k, c) relating FIFO, PLRU, and LRU:Associativity: 2 3 4 5 6 7 8
LRU vs FIFO 0, 0 0, 0 0, 0 0, 0 0, 0 0, 0 0, 0FIFO vs LRU 1
2,12
12, 1
12,
32
12, 2
12,
52
12, 3
12,
72
LRU vs PLRU 1, 0 � 12, 1 � � � 1
8,158
PLRU vs LRU 1, 0 � 12, 1 � � � 1
4,32
FIFO vs PLRU 12,
12 � 1
4,54 � � � 1
11,1911
PLRU vs FIFO 0, 0 � 0, 0 � � � 0, 0
Generalizations to arbitrary kPreviously unknown relations:• PLRU(k) is 1-competitive relative to LRU(1 + log2k).• FIFO(k) is 1
2-hit-competitive relative to LRU(k), whereas• LRU(k) is not l-hit-competitive relative to FIFO(k) for any l, but• LRU(2k � 1) is 1-competitive relative to FIFO(k) with constant 0 for all k�⇧ yields first may-analysis for FIFO,
• PLRU(k) is not l-miss-competitive relative to LRU(k) for k ⇤ 4 and any l.=⌃ PLRU(k) is not l-miss-competitive (in the classical sense) for k ⇤ 4.
Reference: Jan Reineke and Daniel Grund: Relative Competitive Analysis ofCache Replacement Policies. In ACM SIGPLAN/SIGBED 2008 Conference onLanguages, Compilers, and Tools for Embedded Systems – LCTES 2008
Contact information:http://rw4.cs.uni-sb.de/{reineke|grund}@cs.uni-sb.de
Supported by the
Automatic Computation
FiniteQuotientSystem
Results and GeneralizationsRelative Competitiveness
Problem:Cache Analyses for LRU only
Idea:Transfer guarantees for LRU to FIFO, PLRU, etc.
Solution:Relative Competitiveness
Definition – Relative Miss-Competitiveness
NotationmP(p, s) = number of misses that policy P incurs on
access sequence s ⇤ M� starting in state p ⇤ CP
Definition (Relative miss competitiveness)Policy P is (k , c)-miss-competitive relative to policy Q if
mP(p, s) � k · mQ(q, s) + c
for all access sequences s ⇤ M� and cache-set states p ⇤ CP, q ⇤ CQ
that are compatible p ⇥ q.
Definition (Competitive miss ratio of P relative to Q)The smallest k , s.t. P is (k , c)-miss-competitive rel. to Q for some c.
Jan Reineke Caches in WCET Analysis November 7th , 2008 16 / 33
GeneralizationsIdentified patterns and proved generalizations by hand.Aided by example sequences generated by tool.
Previously unknown facts:PLRU(k) is (1, 0) comp. rel. to LRU(1 + log2k),
�⇥ LRU-must-analysis can be used for PLRU
FIFO(k) is (12 , k�1
2 ) hit-comp. rel. to LRU(k), whereasLRU(k) is (0, 0) hit-comp. rel. to FIFO(k), but
LRU(2k � 1) is (1, 0) comp. rel. to FIFO(k), andLRU(2k � 2) is (1, 0) comp. rel. to MRU(k).
�⇥ LRU-may-analysis can be used for FIFO and MRU�⇥ optimal with respect to predictability metric Evict
FIFO-may-analysis used in the analysis of the branch target buffer ofthe MOTOROLA POWERPC 56X.
Jan Reineke Caches in WCET Analysis November 7th , 2008 27 / 33
6.4. RESULTS
Associativity: 2 3 4 5 6 7 8LRU vs FIFO 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6 8, 7LRU vs PLRU 1, 0 � 2, 1 � � � 5, 4LRU vs MRU 1, 0 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6
FIFO vs LRU 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6 8, 7FIFO vs PLRU 2, 1 � 4, 4 � � � 8, 8FIFO vs MRU 2, 1 3, 3 4, 4 5, 5 6, 6 MEM MEM
PLRU vs LRU 1, 0 � ⇥ � � � ⇥PLRU vs FIFO 2, 1 � ⇥ � � � ⇥PLRU vs MRU 1, 0 � ⇥ � � � MEMMRU vs LRU 1, 0 2, 1 3, 2 4, 3 5, 4 6, 5 7, 6MRU vs FIFO 2, 1 4, 3 6, 5 8, 7 10, 9 MEM MEMMRU vs PLRU 1, 0 � 4, 3 � � � MEM
Figure 6.6: Miss-Competitiveness ratios k and additive constants c relating FIFO,PLRU, LRU, and MRU at the same associativity. PLRU is only definedfor powers of two. As an example of how this should be read, LRU(4) is2-miss-competitive relative to PLRU(4) with additive constant 1, whereasPLRU(4) is not miss-competitive relative to LRU(4) at all. ⇥ indicates thatthere is no k such that the policy on the left is k-miss-competitive relativeto the policy on the right. MEM indicates that the computation requiredmore than 1.5GB of heap space.
Associativity: 2 3 4 5 6 7 8LRU vs FIFO 0 0 0 0 0 0 0LRU vs PLRU 1, 0 � 1
2 , 1 � � � 18 ,
158
LRU vs MRU 1, 0 0 0 0 0 0 0FIFO vs LRU 1
2 ,12
12 , 1
12 ,
32
12 , 2
12 ,
52
12 , 3
12 ,
72
FIFO vs PLRU 12 ,
12 � 1
4 ,54 � � � 1
11 ,1911
FIFO vs MRU 12 ,�
12 0 0 0 0 MEM MEM
PLRU vs LRU 1, 0 � 12 , 1 � � � 1
4 ,32
PLRU vs FIFO 0 � 0 � � � 0PLRU vs MRU 1, 0 � 0 � � � MEMMRU vs LRU 1, 0 0 0 0 0 0 0MRU vs FIFO 0 0 0 0 0 MEM MEMMRU vs PLRU 1, 0 � 0 � � � MEM
Figure 6.7: Hit-Competitiveness ratios k and subtractive constants c relating FIFO,PLRU, LRU, and MRU at the same levels of associativity. Again, MEMindicates that the computation required more than 1.5GB.
103
Measurement-based WCET AnalysisMeasurement-Based Timing Analysis
Run program on a number of inputs andinitial states.Combine measurements for basic blocksto obtain WCET estimation.Sensitivity Analysis demonstrates thisapproach may be dramatically wrong.
Jan Reineke Caches in WCET Analysis November 7th , 2008 29 / 33
Influence of Initial Cache State
executiontime
BCET WCET upperbound
variation due toinitial cache state
Definition (Miss sensitivity)Policy P is (k , c)-miss-sensitive if
mP(p, s) � k · mP(p⇥, s) + c
for all access sequences s ⇥ M� and cache-set states p, p⇥ ⇥ CP.
Jan Reineke Caches in WCET Analysis November 7th , 2008 30 / 33
1. Measure execution times of basic blocks
2. Combine measurements for basic blocks to obtain estimate of WCET
Influence of Initial State
How wrong can this get?
Sensitivity Results
Policy 2 3 4 5 6 7 8LRU 1, 2 1, 3 1, 4 1, 5 1, 6 1, 7 1, 8
FIFO 2, 2 3, 3 4, 4 5, 5 6, 6 7, 7 8, 8PLRU 1, 2 � ⇥ � � � ⇥MRU 1, 2 3, 4 5, 6 7, 8 MEM MEM MEM
LRU is optimal. Performance varies in the least possible way.For FIFO, PLRU, and MRU the number of misses may varystrongly.Case study based on simple model of execution time byHennessy and Patterson (2003):WCET may be 3 times higher than a measured execution timefor 4-way FIFO.
Jan Reineke Caches in WCET Analysis November 7th , 2008 31 / 33
Sensitivity Results
LRU is optimal:minimal influence of initial state
FIFO, MRU, PLRU: great influence of initial state
Predictability Limits on the Precision of Cache Analysis
Competitiveness Cache Analyses for FIFO, PLRU, etc.
Sensitivity Caches in Measurement-based WCET Analysis
References:Jan Reineke. Caches in WCET Analysis. PhD thesis, Universität des Saarlandes, Saarbrücken, November 2008.Jan Reineke and Daniel Grund: Relative competitive analysis of cache replacement policies. In LCTES 2008. Jan Reineke, Daniel Grund, Christoph Berg, and Reinhard Wilhelm:
Timing predictability of cache replacement policies. Real-Time Systems, 37(2):99-122, November 2007.
[Courtesy of J. Reineke]
τ1
τ2
τ3
τ1
τ2
time
Indefinite delays due to corrupted cache states