Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators:...
-
Upload
silvia-rose -
Category
Documents
-
view
220 -
download
0
Transcript of Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators:...
![Page 1: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/1.jpg)
Using Criticality to Attack Performance Bottlenecks
Brian FieldsUC-Berkeley
(Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)
![Page 2: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/2.jpg)
Bottleneck Analysis
Bottleneck Analysis:Determining the performance effect of an
event on execution time
An event could be:• an instruction’s execution• an instruction-window-full stall• a branch mispredict• a network request• inter-processor communication• etc.
![Page 3: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/3.jpg)
Why is Bottleneck Analysis Important?
![Page 4: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/4.jpg)
Bottleneck Analysis Applications
Run-time Optimization• Resource arbitration
• e.g., how to scheduling memory accesses? • Effective speculation
• e.g., which branches to predicate?•Dynamic reconfiguration
• e.g, when to enable hyperthreading? • Energy efficiency
• e.g., when to throttle frequency?
Design Decisions• Overcoming technology constraints
• e.g., how to mitigate effect of long wire latencies?
Programmer Performance Tuning• Where have the cycles gone?
• e.g., which cache misses should be prefetched?
![Page 5: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/5.jpg)
Why is Bottleneck Analysis Hard?
![Page 6: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/6.jpg)
Current state-of-art
Event counts:Exe. time = (CPU cycles + Mem. cycles) * Clock cycle
timewhere:Mem. cycles = Number of cache misses * Miss penalty
miss11 (100 cycles) (100 cycles)
miss22 (100 cycles) (100 cycles)
2 misses but only 1 miss penalty
![Page 7: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/7.jpg)
Parallelism in systems complicates performance understanding
Parallelism
• A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing
• Two parallel cache misses
• Two parallel threads
![Page 8: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/8.jpg)
Criticality Challenges
• Cost• How much speedup possible from optimizing an
event?
• Slack• How much can an event be “slowed down” before
increasing execution time?
• Interactions• When do multiple events need to be optimized
simultaneously?
• When do we have a choice?
• Exploit in Hardware
![Page 9: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/9.jpg)
Our Approach
![Page 10: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/10.jpg)
Our Approach: Criticality
Critical events affect execution time, non-critical do not
Bottleneck Analysis:Determining the performance effect of an
event on execution time
![Page 11: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/11.jpg)
Defining criticality
Need Performance Sensitivity
• slowing down a “critical” event should slow down the entire program
• speeding up a “noncritical” event should leave execution time unchanged
![Page 12: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/12.jpg)
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
Standard Waterfall Diagram
![Page 13: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/13.jpg)
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
Annotated with Dependence Edges
(MISP)
![Page 14: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/14.jpg)
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
Fetch BW
ROB
Data Dep
Branch Misp.
Annotated with Dependence Edges
![Page 15: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/15.jpg)
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
1
1
1
1
11
3
1 1
2
1
0
1
Edge Weights Added
![Page 16: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/16.jpg)
R5 = 0
R3 = 0
R1 = #array + R3
R6 = ld[R1]
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
1
1
1
1
1
1
2
1
11
1
3
0
1
0
0
0
0
Convert to Graph
1
1
1
11
1
1
2
1
1
1
1
1
2
1 1
11
1
2
1
1
![Page 17: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/17.jpg)
R5 = 0
R3 = 0
R1 = #array + R3
R6 = ld[R1]
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
1
1
1
1
1
1
2
1
11
1
3
0
1
0
0
0
0
Convert to Graph
1
1
1
11
1
1
2
1
1
1
1
1
2
1 1
11
1
2
1
1
![Page 18: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/18.jpg)
Smaller graph instance
E1
E EE E
3
F F FF F
C C CC C
1
11 1
1
1
1 1
100 0 1
1
Non-critical,But how
much slack?
1
Critical Icache miss,
But how costly?
![Page 19: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/19.jpg)
Add “hidden” constraints
E1
E EE E1 11
1 2
3
F F FF F
C C CC C
1
1 11 1
1
1
11 1
100 0 1
100 1Non-critical,
But how much slack?
Critical Icache miss,
But how costly?
![Page 20: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/20.jpg)
Add “hidden” constraints
E1
E EE E1 11
1 2
3
F F FF F
C C CC C
1
1 11 1
1
1
11 1
100 0 1
100 1Slack = 13 – 7 = 6 cycles
Cost = 13 – 7 = 6 cycles
![Page 21: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/21.jpg)
Slack “sharing”
E1
E EE E1 11
1 2
3
F F FF F
C C CC C
1
1 11 1
1
1
11 1
100 0 1
100 1Slack = 6
cycles
Slack = 6 cycles
Can delay one edge by 6 cycles, but not both!
![Page 22: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/22.jpg)
Machine Imbalance
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Number of Cycles of Slack (perl)
Perc
ent o
f Dyn
amic
Inst
ruct
ions
apportioned
global
~80% insts have at least 5 cycles of apportioned
slack
![Page 23: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/23.jpg)
Criticality Challenges
• Cost• How much speedup possible from optimizing an
event?
• Slack• How much can an event be “slowed down” before
increasing execution time?
• Interactions• When do multiple events need to be optimized
simultaneously?
• When do we have a choice?
• Exploit in Hardware
![Page 24: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/24.jpg)
Simple criticality not always enough
Sometimes events have nearly equal criticality
miss #1 (99)
miss #2 (100)
Want to know • how critical is each event?
• how far from critical is each event?
Actually, even that is not enough
![Page 25: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/25.jpg)
Our solution: measure interactions
Two parallel cache misses
miss #1 (99)
miss #2 (100)Cost(miss #1) = 0
Cost(miss #2) = 1
Cost({miss #1, miss #2}) = 100
Aggregate cost > Sum of individual costs Parallel interaction100 0 +
1icost = aggregate cost – sum of individual costs
= 100 – 0 – 1 = 99
![Page 26: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/26.jpg)
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
2. Zero icost ?
1. Positive icost parallel
interaction
miss #1
miss #2
![Page 27: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/27.jpg)
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2
. . .
3. Negative icost ?
![Page 28: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/28.jpg)
Negative icost
Two serial cache misses (data dependent)
miss #1 (100)
miss #2 (100)
Cost(miss #1) = ?
ALU latency (110 cycles)
![Page 29: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/29.jpg)
Negative icost
Two serial cache misses (data dependent)
Cost(miss #1) = 90
Cost(miss #2) = 90
Cost({miss #1, miss #2}) = 90
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
icost = aggregate cost – sum of individual costs
= 90 – 90 – 90 = -90Negative icost serial interaction
![Page 30: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/30.jpg)
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2. . .
3. Negative icost serial
interaction
ALU latency
miss #1 miss #2
Branch mispredict
Fetch BW
Load-Replay Trap
LSQ stall
![Page 31: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/31.jpg)
Why care about serial interactions?
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
Reason #1 We are over-optimizing!Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)
Reason #2 We have a choice of what to optimizePrefetching miss #2 has the same effect as miss #1
![Page 32: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/32.jpg)
Icost Case Study: Deep pipelines
Looking for serial interactions!
Dcache (DL1)
1 4
![Page 33: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/33.jpg)
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
![Page 34: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/34.jpg)
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1 30.5 %
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
![Page 35: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/35.jpg)
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1 30.5 %
DL1+window -15.3
DL1+bw 6.0
DL1+bmisp -3.4
DL1+dmiss -0.4
DL1+alu -8.2
DL1+imiss 0.0
... ...
Total 100.0
![Page 36: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/36.jpg)
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1 18.3 % 30.5 % 25.8 %
DL1+window -4.2 -15.3 -24.5
DL1+bw 10.0 6.0 15.5
DL1+bmisp -7.0 -3.4 -0.3
DL1+dmiss -1.4 -0.4 -1.4
DL1+alu -1.6 -8.2 -4.7
DL1+imiss 0.1 0.0 0.4
... ... ... ...
Total 100.0 100.0 100.0
![Page 37: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/37.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 38: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/38.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 39: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/39.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 40: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/40.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 41: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/41.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 42: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/42.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 43: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/43.jpg)
Criticality Challenges
• Cost• How much speedup possible from optimizing an
event?
• Slack• How much can an event be “slowed down” before
increasing execution time?
• Interactions• When do multiple events need to be optimized
simultaneously?
• When do we have a choice?
• Exploit in Hardware
![Page 44: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/44.jpg)
Exploit in Hardware
• Criticality Analyzer• Online, fast-feedback• Limited to critical/not critical
• Replacement for Performance Counters
• Requires offline analysis • Constructs entire graph
![Page 45: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/45.jpg)
Only last-arriving edges can be critical
Observation: R1 R2 + R3
If dependence into R2 is on critical path, then value of R2 arrived last.
critical arrives last
arrives last critical
E
R2
R3
Dependence resolved early
![Page 46: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/46.jpg)
Determining last-arrive edges
Observe events within the machine
last_arrive[F] =
last_arrive[E] =
E
F
CC
E
F
CC
FE if data ready on fetch
E
F
CC
E
F
CC
E
F
CC
EE observe arrival order of operands
E
F
CC
E
F
CC
last_arrive[C] =
EC if commit pointer is delayed
CC otherwise
E
F
CC
E
F
CC
E
F
CC
E
F
CC
E
F
CC
E
F
CC
EF if branch misp.
E
F
CC
E
F
CC
E
F
CC
E
F
CC
CF if ROB stall
FF otherwise
![Page 47: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/47.jpg)
Last-arrive edges
The last-arrive rule
CP consists only of “last-arrive” edges
F
E
C
![Page 48: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/48.jpg)
Prune the graph
Only need to put last-arrive edges in graphNo other edges could be on CP
F
E
C
newest
![Page 49: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/49.jpg)
…and we’ve found the critical path!
Backward propagate along last-arrive edges
newest
F
E
C
newest Found CP by only observing last-arrive
edges but still requires constructing entire
graph
![Page 50: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/50.jpg)
Step 2. Reducing storage reqs
CP is a ”long” chain of last-arrive edges. the longer a given chain of last-arrive
edges, the more likely it is part of the CP
Algorithm: find sufficiently long last-arrive chains
1. Plant token into a node n
2. Propagate forward, only along last-arrive edges
3. Check for token after several hundred cycles
4. If token alive, n is assumed critical
![Page 51: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/51.jpg)
Online Criticality Detection
Forward propagate token
newest
F
E
C
newest
PlantToken
![Page 52: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/52.jpg)
Online Criticality Detection
Forward propagate token
newest
F
E
C
newest
PlantToken
Tokens
“Die”
![Page 53: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/53.jpg)
Online Criticality Detection
Forward propagate token
F
E
C
PlantToken
Token survives
!
![Page 54: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/54.jpg)
Putting it all together
CP prediction
table
Last-arrive edges
(producer retired instr)
OOO CoreE-critical?
Training Path
PC
Prediction Path
Token-PassingAnalyzer
![Page 55: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/55.jpg)
Results• Performance (Speed)
• Scheduling in clustered machines• 10% speedup
• Selective value prediction• Deferred scheduling (Crowe, et al)
• 11% speedup
• Heterogeneous cache (Rakvic, et al.)• 17% speedup
• Energy• Non-uniform machine: fast and slow pipelines
• ~25% less energy
• Instruction queue resizing (Sasanka, et al.)• Multiple frequency scaling (Semeraro, et al.)
• 19% less energy with 3% less performance
• Selective pre-execution (Petric, et al.)
![Page 56: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/56.jpg)
Exploit in Hardware
• Criticality Analyzer• Online, fast-feedback• Limited to critical/not critical
• Replacement for Performance Counters
• Requires offline analysis • Constructs entire graph
![Page 57: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/57.jpg)
Profiling goal
Goal: • Construct graph
many dynamic instructions
Constraint:• Can only sample sparsely
![Page 58: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/58.jpg)
Profiling goal
Goal: • Construct graph
Constraint:• Can only sample sparsely
DNA
DNA strand
Genome sequencing
![Page 59: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/59.jpg)
“Shotgun” genome sequencing
DNA
![Page 60: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/60.jpg)
“Shotgun” genome sequencing
DNA
![Page 61: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/61.jpg)
“Shotgun” genome sequencing
. . .. . .
DNA
![Page 62: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/62.jpg)
“Shotgun” genome sequencing
. . .. . .
. . . . . .
Find overlaps among samples
DNA
![Page 63: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/63.jpg)
Mapping “shotgun” to our situation
many dynamic instructions
Icache miss
Dcache missBranch misp.No event
![Page 64: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/64.jpg)
. . .. . .
Profiler hardware requirements
![Page 65: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/65.jpg)
. . .. . .
Profiler hardware requirements
Match!
![Page 66: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/66.jpg)
Sources of error
Error Source Gcc Parser Twolf
Modeling execution as a graph
2.1 % 6.0% 0.1 %
Errors in graph construction
5.3 % 1.5 % 1.6 %
Sampling only a few graph fragments
4.8 % 6.5 % 7.2 %
Total 12.2 % 14.0 % 8.9 %
![Page 67: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/67.jpg)
Conclusion: Grand Challenges
• Cost• How much speedup possible from optimizing
an event?
• Slack• How much can an event be “slowed down”
before increasing execution time?
• Interactions• When do multiple events need to be
optimized simultaneously?
• When do we have a choice?
modeling
token-passing analyzer
parallel interactions
serial interactions
shotgun profiling
![Page 68: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/68.jpg)
Conclusion: Bottleneck Analysis Applications
Run-time Optimization• Effective speculation
• Resource arbitration
• Dynamic reconfiguration
• Energy efficiency
Design Decisions• Overcoming technology constraints
Programmer Performance Tuning• Where have the cycles gone?
Selective value prediction
Scheduling and steering in clustered processors
Resize instruction window
Non-uniform machines
Helped cope with high-latency dcache
Measured cost of cache misses/branch
mispredicts
![Page 69: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/69.jpg)
Outline
Simple Criticality• Definition (ISCA ’01)
• Detection (ISCA ’01)
• Application (ISCA ’01-’02)
Advanced Criticality• Interpretation (MICRO ’03)
• What types of interactions are possible?
• Hardware Support (MICRO ’03, TACO ’04)
• Enhancement to performance counters
![Page 70: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/70.jpg)
Simple criticality not always enough
Sometimes events have nearly equal criticality
miss #1 (99)
miss #2 (100)
Want to know • how critical is each event?
• how far from critical is each event?
Actually, even that is not enough
![Page 71: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/71.jpg)
Our solution: measure interactions
Two parallel cache misses
miss #1 (99)
miss #2 (100)
Cost(miss #1) = 0
Cost(miss #2) = 1
Cost({miss #1, miss #2}) = 100
Aggregate cost > Sum of individual costs Parallel interaction100 0 +
1icost = aggregate cost – sum of individual costs
= 100 – 0 – 1 = 99
![Page 72: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/72.jpg)
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
2. Zero icost ?
1. Positive icost parallel
interaction
miss #1
miss #2
![Page 73: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/73.jpg)
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2
. . .
3. Negative icost ?
![Page 74: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/74.jpg)
Negative icost
Two serial cache misses (data dependent)
miss #1 (100)
miss #2 (100)
Cost(miss #1) = ?
ALU latency (110 cycles)
![Page 75: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/75.jpg)
Negative icost
Two serial cache misses (data dependent)
Cost(miss #1) = 90
Cost(miss #2) = 90
Cost({miss #1, miss #2}) = 90
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
icost = aggregate cost – sum of individual costs
= 90 – 90 – 90 = -90Negative icost serial interaction
![Page 76: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/76.jpg)
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2. . .
3. Negative icost serial
interaction
ALU latency
miss #1 miss #2
Branch mispredict
Fetch BW
Load-Replay Trap
LSQ stall
![Page 77: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/77.jpg)
Why care about serial interactions?
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
Reason #1 We are over-optimizing!Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)
Reason #2 We have a choice of what to optimizePrefetching miss #2 has the same effect as miss #1
![Page 78: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/78.jpg)
Outline
Simple Criticality• Definition (ISCA ’01)
• Detection (ISCA ’01)
• Application (ISCA ’01-’02)
Advanced Criticality• Interpretation (MICRO ’03)
• What types of interactions are possible?
• Hardware Support (MICRO ’03, TACO ’04)
• Enhancement to performance counters
![Page 79: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/79.jpg)
Profiling goal
Goal: • Construct graph
many dynamic instructions
Constraint:• Can only sample sparsely
![Page 80: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/80.jpg)
Profiling goal
Goal: • Construct graph
Constraint:• Can only sample sparsely
DNA
DNA strand
Genome sequencing
![Page 81: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/81.jpg)
“Shotgun” genome sequencing
DNA
![Page 82: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/82.jpg)
“Shotgun” genome sequencing
DNA
![Page 83: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/83.jpg)
“Shotgun” genome sequencing
. . .. . .
DNA
![Page 84: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/84.jpg)
“Shotgun” genome sequencing
. . .. . .
. . . . . .
Find overlaps among samples
DNA
![Page 85: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/85.jpg)
Mapping “shotgun” to our situation
many dynamic instructions
Icache miss
Dcache missBranch misp.No event
![Page 86: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/86.jpg)
. . .. . .
Profiler hardware requirements
![Page 87: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/87.jpg)
. . .. . .
Profiler hardware requirements
Match!
![Page 88: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/88.jpg)
Sources of error
Error Source Gcc Parser Twolf
![Page 89: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/89.jpg)
Sources of error
Error Source Gcc Parser Twolf
Modeling execution as a graph
2.1 % 6.0% 0.1 %
![Page 90: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/90.jpg)
Sources of error
Error Source Gcc Parser Twolf
Modeling execution as a graph
2.1 % 6.0% 0.1 %
Errors in graph construction
5.3 % 1.5 % 1.6 %
![Page 91: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/91.jpg)
Sources of error
Error Source Gcc Parser Twolf
Modeling execution as a graph
2.1 % 6.0% 0.1 %
Errors in graph construction
5.3 % 1.5 % 1.6 %
Sampling only a few graph fragments
4.8 % 6.5 % 7.2 %
Total 12.2 % 14.0 % 8.9 %
![Page 92: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/92.jpg)
Conclusion: Bottleneck Analysis Applications
Run-time Optimization• Effective speculation
• Resource arbitration
• Dynamic reconfiguration
• Energy efficiency
Design Decisions• Overcoming technology constraints
Programmer Performance Tuning• Where have the cycles gone?
Selective value prediction
Scheduling and steering in clustered processors
Resize instruction window
Non-uniform machines
Helped cope with high-latency dcache
Measured cost of cache misses/branch
mispredicts
![Page 93: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/93.jpg)
Conclusion: Grand Challenges
• Cost• How much speedup possible from optimizing
an event?
• Slack• How much can an event be “slowed down”
before increasing execution time?
• Interactions• When do multiple events need to be
optimized simultaneously?
• When do we have a choice?
modeling
token-passing analyzer
parallel interactions
serial interactions
shotgun profiling
![Page 94: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/94.jpg)
Backup Slides
![Page 95: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/95.jpg)
Related Work
![Page 96: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/96.jpg)
Criticality Prior Work
Critical-Path Method, PERT charts• Developed for Navy’s “Polaris” project-1957
• Used as a project management tool
• Simple critical-path, slack concepts
“Attribution” Heuristics• Rosenblum et al.: SOSP-1995, and many others
• Marks instruction at head of ROB as critical, etc.
• Empirically, has limited accuracy
• Does not account for interactions between events
![Page 97: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/97.jpg)
Related Work: Microprocessor Criticality
Latency tolerance analysis• Srinivasan and Lebeck: MICRO-1998
Heuristics-driven criticality predictors• Tune et al.: HPCA-2001• Srinivasan et al.: ISCA-2001
“Local” slack detector• Casmira and Grunwald: Kool Chips Workshop-
2000
ProfileMe with pair-wise sampling• Dean, et al.: MICRO-1997
![Page 98: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/98.jpg)
Unresolved Issues
![Page 99: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/99.jpg)
Alternative I: Addressing Unresolved Issues
Modeling and Measurement• What resources can we model effectively?
• difficulty with mutual-exclusion-type resouces (ALUs)
• Efficient algorithms
• Release tool for measuring cost/slack
Hardware • Detailed design for criticality analyzer
• Shotgun profiler simplifications• gradual path from counters
Optimization • explore heuristics for exploiting interactions
![Page 100: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/100.jpg)
Alternative II: Chip-Multiprocessors
Design Decisions• Should each core support out-of-order execution?• Should SMT be supported?• How many processors are useful?• What is the effect of inter-processor latency?
Programmer Performance TuningParallelizing applications
• What makes a good division into threads?• How can we find them automatically, or at least help programmers to find them?
![Page 101: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/101.jpg)
Unresolved issuesModeling and Measurement
• What resources can we model effectively?• difficulty with mutual-exclusion-type resouces (ALUs)
• In other words, unanticipated side effects
1
1
1. ld r2, [Mem]2. add r3 r2 + 13. ld r4, [Mem]4. add r6 r4 + 1
(cache miss)
(cache miss)
F
E
C
F
E
C
F
E
C
F
E
C
10 10
1
0
1 10 10
111
0 0
000
Original Execution
(cache miss)
(cache hit)Nocontention
1. ld r2, [Mem]2. add r3 r2 + 13. ld r4, [Mem]4. add r6 r4 + 1
F
E
C
F
E
C
F
E
C
F
E
C
10 2
1
0
10 1 12
1111
0 0
000
Altered Execution(to compute cost of inst #3
cache miss)
Adder contention
Contention edge
Incorrect critical path due to contention edge
Should not be here
![Page 102: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/102.jpg)
Unresolved issues
Modeling and Measurement (cont.)
• How should processor policies be modeled?• relationship to icost definition
• Efficient algorithms for measuring icosts• pairs of events, etc.
• Release tool for measuring cost/slack
![Page 103: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/103.jpg)
Unresolved issues
Hardware • Detailed design for criticality analyzer
• help to convince industry-types to build it
• Shotgun profiler simplifications• gradual path from counters
Optimization • Explore icost optimization heuristics
• icosts are difficult to interpret
![Page 104: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/104.jpg)
Validation
![Page 105: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/105.jpg)
Validation: can we trust our model?
Run two simulations :
• Reduce CP latencies
• Reduce non-CP latencies
Expect “big” speedup
Expect no speedup
![Page 106: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/106.jpg)
Validation: can we trust our model?
0
0.2
0.4
0.6
0.8
1
crafty eon gcc gzip perl vortex galgel mesaSp
eed
up
per
Cyc
le R
edu
ced
Reducing CP Latencies
Reducing non-CP Latencies
![Page 107: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/107.jpg)
Validation
Two steps:
1. Increase latencies of insts. by their apportioned slack
• for three apportioning strategies:1) latency+1,2) 5-cycles to as many instructions
as possible, 3) 12-cycles to as many loads as
possible
2. Compare to baseline (no delays inserted)
![Page 108: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/108.jpg)
Validation
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
120%
ammp art gcc gzip mesa parser perl vortex average
Per
cent
of E
xecu
tion
Tim
e
baseline
latency + 1
12 cycles to loads
five cycles
Worst case: Inaccuracy of 0.6%
![Page 109: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/109.jpg)
Slack Measurements
![Page 110: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/110.jpg)
Three slack variants
Local slack:# cycles latency can be increased
without delaying any subsequent instructions
Global slack:# cycles latency can be increased
without delaying the last instruction in the program
Apportioned slack:Distribute global slack among instructions
using an apportioning strategy
![Page 111: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/111.jpg)
Slack measurements
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Number of Cycles of Slack (perl)
Per
cent
of D
ynam
ic In
stru
ctio
ns
~21% insts have at least 5 cycles of local slack
local
![Page 112: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/112.jpg)
Slack measurements
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Number of Cycles of Slack (perl)
Per
cent
of D
ynam
ic In
stru
ctio
ns
~90% insts have at least 5 cycles of global slack
local
global
![Page 113: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/113.jpg)
Slack measurements
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Number of Cycles of Slack (perl)
Per
cent
of D
ynam
ic In
stru
ctio
ns
~80% insts have at least 5 cycles of apportioned
slack
local
apportioned
global
A large amount of exploitable slack exists
![Page 114: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/114.jpg)
Application-centered Slack Measurements
![Page 115: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/115.jpg)
Load slack
Can we tolerate a long-latency L1 hit?
design: wire-constrained machine, e.g. Grid
non-uniformity: multi-latency L1
apportioning strategy:apportion ALL slack to load
instructions
![Page 116: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/116.jpg)
Apportion all slack to loads
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Number of Cycles of Slack on Load Instructions
Per
cen
t of D
ynam
ic L
oad
s
gccperl
gzip
Most loads can tolerate an L2 cache hit
![Page 117: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/117.jpg)
Multi-speed ALUs
Can we tolerate ALUs running at half frequency?
design: fast/slow ALUs
non-uniformity: multi-latency execution latency,
bypassapportioning strategy:
give slack equal to original latency + 1
![Page 118: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/118.jpg)
Latency+1 apportioning
0%10%20%30%40%50%60%70%80%90%
100%
ammp art gcc gzip mesa parser perl vortex averagePerc
ent o
f Dyn
amic
Inst
ruct
ions
Most instructions can tolerate doubling their latency
![Page 119: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/119.jpg)
Slack Locality and Prediction
![Page 120: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/120.jpg)
Predicting slack
Two steps to PC-indexed, history-based prediction:
1. Measure slack of a dynamic instruction2. Store in array indexed by PC of static instruction
Two requirements:
1. Locality of slack2. Ability to measure slack of a dynamic instruction
![Page 121: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/121.jpg)
Locality of slack
0
10
20
30
40
50
60
70
80
90
100
ammp art gcc gzip mesa parser perl vortex average
Per
cen
t o
f (w
eig
hte
d)
stat
ic in
stru
ctio
ns
ideal
![Page 122: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/122.jpg)
Locality of slack
0
10
20
30
40
50
60
70
80
90
100
ammp art gcc gzip mesa parser perl vortex average
Per
cen
t o
f (w
eig
hte
d)
stat
ic in
stru
ctio
ns
ideal
100%
![Page 123: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/123.jpg)
Locality of slack
0
10
20
30
40
50
60
70
80
90
100
ammp art gcc gzip mesa parser perl vortex average
Per
cent
of (
wei
ghte
d) s
tatic
inst
ruct
ions
ideal
95%
100%
90%
PC-indexed, history-based predictor
can capture most of the available slack
![Page 124: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/124.jpg)
Slack Detector
Problem #2Determining if overall execution time increased
SolutionCheck if delay made instruction critical
delay and observe effective for hardware predictor
Problem #1Iterating repeatedly over same dynamic instruction
SolutionOnly sample dynamic instruction once
![Page 125: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/125.jpg)
Slack Detector
Goal: Determine whether instruction has n cycles of slack
1. Delay the instruction by n cycles2. Check if critical (via critical-path analyzer)
3. No, instruction has n cycles of slack 4. Yes, instruction does not have n cycles of slack
delay and observe
![Page 126: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/126.jpg)
Slack Application
![Page 127: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/127.jpg)
Fast/slow cluster microarchitecture
Data Cache
WIN Reg
WIN Reg
Fast, 3-wide cluster
Slow, 3-wide cluster
ALUs
ALUs
Fetch + Rename
Aggressive non-uniform design:
• Higher execution latencies
• Increased (cross-domain) bypass latency
• Decreased effective issue bandwidth
Steer
Bypass Bus
P F2
save ~37% core power
![Page 128: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/128.jpg)
Picking bins for the slack predictor
Use implicit slack predictor with four bins:
1. Steer to fast cluster + schedule with high priority2. Steer to fast cluster + schedule with low priority 3. Steer to slow cluster + schedule with high
priority4. Steer to slow cluster + schedule with low priority
Two decisions
1. Steer to fast/slow cluster
2. Schedule with high/low priority within a cluster
![Page 129: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/129.jpg)
Slack-based policies
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
ammp art gcc gzip mesa parser perl vortex average
No
rmal
ized
IPC
2 fast, high-power clustersslack-based
policyreg-dep steering
10% better performance from hiding non-uniformities
![Page 130: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/130.jpg)
CMP case study
![Page 131: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/131.jpg)
Multithreaded Execution Case Study
Two questions:
• How should a program be divided into threads?• what makes a good cutpoint?
• how can we find them automatically, or at least help programmers find them?
• What should a multiple-core design look like?• should each core support out-of-order execution?
• should SMT be supported?
• how many processors are useful?
• what is the effect of inter-processor latency?
![Page 132: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/132.jpg)
Parallelizing an application
Why parallelize a single-thread application?
• Legacy code, large code bases
• Difficult to parallelize apps• Interpreted code, kernels of operating systems
• Like to use better programming languages• Scheme, Java instead of C/C++
![Page 133: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/133.jpg)
Parallelizing an application
Simplifying assumption• Program binary unchanged
Simplified problem statement• Given a program of length L, find a cutpoint that
divides the program into two threads that provides maximum speedup
Must consider:
• data dependences, execution latencies, control dependences, proper load balancing
![Page 134: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/134.jpg)
Parallelizing an application
Naive solution:• try every possible cutpoint
Our solution:• efficiently determine the effect of every
possible cutpoint
• model execution before and after every cut
![Page 135: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/135.jpg)
Solution
last instruction
F
E
C
first instruction
0 1 0 1 0 1 0
1
3
2 1
0 1
21
1
4
0
0
2
1 11
2
0 1 0
21
141 1
21
1
2
3
1
000 0
start
![Page 136: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/136.jpg)
Parallelizing an application
Considerations:• Synchronization overhead
• add latency to EE edges
• Synchronization may involve turning EE to EF • Scheduling of threads
• additional CF edges
Challenges:• State behavior (one thread to multiple
processors)• caches, branch predictor
• Control behavior• limits where cutpoints can be made
![Page 137: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/137.jpg)
Parallelizing an application
More general problem:• Divide a program into N threads
• NP-complete
Icost can help:• icost(p1,p2) << 0 implies p1 and p2 redundant
• action: move p1 and p2 further apart
![Page 138: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/138.jpg)
Preliminary Results
Experimental Setup• Simulator, based loosely on SimpleScalar
• Alpha SpecInt binaries
Procedure1. Assume execution trace is known
2. Look at each 1k run of instructions
3. Test every possible cutpoint using 1k graphs
![Page 139: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/139.jpg)
Dynamic Cutpoints
Cost Distribution of Dynamic Cutpoints
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
Execution time reduction (cycles)
Cu
mu
lati
ve P
ct. o
f C
utp
oin
ts bzip
crafty
eon
gap
gcc
parser
perl
tw ol
vpr
Only 20% of cuts yield benefits of > 20 cycles
![Page 140: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/140.jpg)
Usefulness of cost-based policy
Speedups from parallelizing programs for a two-processor system
0
5
10
15
20
25
30
bzip crafty eon gap gcc gzip mcf parser perl twolf vpr
Sp
ee
du
p %
fixed-interval
simple cost-based
![Page 141: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/141.jpg)
Static Cutpoints
Cost Distribution of Static Cutpoints
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100 120 140 160 180
Avg. per-dynamic-instance Cost of Static Instructions
Cu
mu
lati
ve P
ct. o
f In
stru
ctio
ns bzip
crafty
eon
gap
gcc
gzip
mcf
parser
perl
tw olf
vpr
Up to 60% of cuts yield benefits of > 20 cycles
![Page 142: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/142.jpg)
Future Avenues of Research
• Map cutpoints back to actual code• Compare automatically generated cutpoints to
human-generated ones• See what performance gains are in a simulator, as
opposed to just on the graph
• Look at the effect of synchronization operations• What additional overhead do they introduce?
• Deal with state, control problems• Might need some technique outside of the graph
![Page 143: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/143.jpg)
Multithreaded Execution Case Study
Two possible questions:
• How should a program be divided into threads?• what makes a good cutpoint?
• how can we find them automatically, or at least help programmers find them?
• What should a multiple-core design look like?• should each core support out-of-order execution?
• should SMT be supported?
• how many processors are useful?
• what is the effect of inter-processor latency?
![Page 144: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/144.jpg)
CMP design study
What we can do:
• Try out many configurations quickly• dramatic changes in architecture often only small
changes in graph
• Identifying bottlenecks• especially interactions
![Page 145: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/145.jpg)
CMP design study: Out-of-orderness
Is out-of-order execution necessary in a CMP?
Procedure• model execution with different configurations
• adjust CD edges
• compute breakdowns• notice resource/events interacting with CD edges
![Page 146: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/146.jpg)
CMP design study: Out-of-orderness
last instruction
F
E
C
first instruction
0 1 0 1 0 1 0
1
3
2 1
0 1
21
1
4
0
0
2
1 11
2
0 1 0
21
141 1
21
1
2
3
1
000 0
![Page 147: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/147.jpg)
CMP design study: Out-of-orderness
Results summary• Single-core: Performance taps out at 256 entries• CMP: Performance gains up through 1024 entries
• some benchmarks see gains up to 16k entries
Why more beneficial?• Use breakdowns to find out.....
![Page 148: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/148.jpg)
CMP design study: Out-of-orderness
Components of window cost• cache misses holding up retirement?• long strands of data dependencies?• predictable control flow?
Icost breakdowns give quantitative and qualitative answers
![Page 149: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/149.jpg)
CMP design study: Out-of-orderness
cost(window) + icost(window, A) + icost(window, B) + icost(window, AB) = 0
window cost
100%
0%
ALU
cachemisses
Independent
ALU
cachemisses
interaction
Parallel Interaction
ALU
cachemisses
interaction
Serial Interaction
equal
![Page 150: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/150.jpg)
Summary of Preliminary Results
icost(window, ALU operations) << 0• primarily communication between processors
• window often stalled waiting for data
Implications• larger window may be overkill
• need a cheap non-blocking solution• e.g., continual-flow pipelines
![Page 151: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/151.jpg)
CMP design study: SMT?
Benefits• reduced thread start-up latency
• reduced communication costs
How we could help• distribution of thread lengths
• breakdowns to understand effect of communication
![Page 152: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/152.jpg)
#1
#2
#1
Start #1
#2
CMP design study: How many processors?
![Page 153: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/153.jpg)
CMP design study: Other Questions
What is the effect of inter-processor communication latency?• understand hidden vs. exposed communication
Allocating processors to programs• methodology for O/S to better assign programs
to processors
![Page 154: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/154.jpg)
Waterfall To Graph Story
![Page 155: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/155.jpg)
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
Standard Waterfall Diagram
![Page 156: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/156.jpg)
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
Annotated with Dependence Edges
![Page 157: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/157.jpg)
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
Fetch BW
Data Dep
ROB
Branch Misp.
Annotated with Dependence Edges
![Page 158: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/158.jpg)
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
1
1
1
1
11
3
1 1
2
1
0
1
Edge Weights Added
![Page 159: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/159.jpg)
R5 = 0
R3 = 0
R1 = #array + R3
R6 = ld[R1]
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
1
1
1
1
1
1
2
1
11
1
3
0
1
1
2
1 1
11
1
1
1
1
11
1
1
2
2
0
0
0
0
Convert to Graph
![Page 160: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/160.jpg)
R5 = 0
R3 = 0
R1 = #array + R3
R6 = ld[R1]
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
1
1
1
1
1
1
2
1
11
1
3
0
1
1
2
1 1
11
1
1
1
1
11
1
1
2
2
0
0
0
0
Find Critical Path
![Page 161: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/161.jpg)
R5 = 0
R3 = 0
R1 = #array + R3
R6 = ld[R1]
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
1
1
1
1
1
1
2 1
11
1
3
0
1
1 1
11
1
1
1
1
1
11
1
1
2
2 2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Add Non-last-arriving Edges
![Page 162: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/162.jpg)
R5 = 0
R3 = 0
R1 = #array + R3
R6 = ld[R1]
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
1
1
1
1
1
1
2 1
11
1
0
1
1 1
11
1
1
1
1
1
11
1
1
2
2 2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Branch misprediction made correct
Graph Alterations
![Page 163: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/163.jpg)
Token-passing analyzer
![Page 164: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/164.jpg)
Step 1. Observing
Observation: R1 R2 + R3
If dependence into R2 is on critical path, then value of R2 arrived last.
critical arrives last
arrives last critical
E
R2
R3
Dependence resolved early
![Page 165: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/165.jpg)
Determining last-arrive edges
Observe events within the machine
last_arrive[F] =
last_arrive[E] =
E
F
CC
E
F
CC
FE if data ready on fetch
E
F
CC
E
F
CC
E
F
CC
EE observe arrival order of operands
E
F
CC
E
F
CC
last_arrive[C] =
EC if commit pointer is delayed
CC otherwise
E
F
CC
E
F
CC
E
F
CC
E
F
CC
E
F
CC
E
F
CC
EF if branch misp.
E
F
CC
E
F
CC
E
F
CC
E
F
CC
CF if ROB stall
FF otherwise
![Page 166: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/166.jpg)
Last-arrive edges: a CPU stethoscope
CPU
E C
E E F E C F
F F
E F
C C
![Page 167: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/167.jpg)
Last-arrive edges
F
E
C
0 1 0 1 0 1 0
1
3
21
0 1
21
1
4
0
0
2
1 11
2
0 1 0
21
141
1
21
1
2
3
1
00 0 0
![Page 168: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/168.jpg)
Remove latencies
F
E
C
Do not need explicit weights
![Page 169: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/169.jpg)
Last-arrive edges
The last-arrive rule
CP consists only of “last-arrive” edges
F
E
C
![Page 170: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/170.jpg)
Prune the graph
Only need to put last-arrive edges in graphNo other edges could be on CP
F
E
C
newest
![Page 171: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/171.jpg)
…and we’ve found the critical path!
Backward propagate along last-arrive edges
newest
F
E
C
newest Found CP by only observing last-arrive
edges but still requires constructing entire
graph
![Page 172: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/172.jpg)
Step 2. Efficient analysis
CP is a ”long” chain of last-arrive edges. the longer a given chain of last-arrive
edges, the more likely it is part of the CP
Algorithm: find sufficiently long last-arrive chains
1. Plant token into a node n
2. Propagate forward, only along last-arrive edges
3. Check for token after several hundred cycles
4. If token alive, n is assumed critical
![Page 173: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/173.jpg)
1. plant token
Token-passing example
2. propagate token
3. is token alive?
4. yes, train critical
Critical
Found CP without constructing entire graph
ROB Size
![Page 174: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/174.jpg)
Implementation: a small SRAM array
Last-arrive producer node (inst id, type)
Token Queue
Read
Wri
te
Commited (inst id, type)
Size of SRAM: 3 bits ROB size < 200 Bytes
…
Simply replicate for additional tokens
![Page 175: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/175.jpg)
Putting it all together
CP prediction
table
Last-arrive edges
(producer retired instr)
OOO CoreE-critical?
Training Path
PC
Prediction Path
Token-PassingAnalyzer
![Page 176: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/176.jpg)
Scheduling and Steering
![Page 177: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/177.jpg)
Case Study #1: Clustered architectures
steering
issue window
scheduling1. Current state of art
(Base)2. Base + CP
Scheduling3. Base + CP Scheduling + CP Steering
![Page 178: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/178.jpg)
0.60
0.70
0.80
0.90
1.00
1.10
No
rma
lize
d I
PC
eoncrafty gcc gzip perl vortex galgel mesa
unclustered
2 cluster
4 cluster
Current State of the Art
Avg. clustering penalty for 4 clusters: 19%
Constant issue width, clock frequency
![Page 179: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/179.jpg)
0.60
0.70
0.80
0.90
1.00
1.10
No
rma
lize
d I
PC
eoncrafty gcc gzip perl vortex galgel mesa
unclustered
2 cluster
4 cluster
CP Optimizations
Base + CP Scheduling
![Page 180: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/180.jpg)
0.60
0.70
0.80
0.90
1.00
1.10
No
rma
lize
d I
PC
eoncrafty gcc gzip perl vortex galgel mesa
unclustered
2 cluster
4 cluster
CP Optimizations
Avg. clustering penalty reduced from 19% to 6%
Base + CP Scheduling + CP Steering
![Page 181: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/181.jpg)
Token-passing Vs. Heuristics
![Page 182: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/182.jpg)
Local Vs. Global Analysis
-5.0%
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
crafty eon gcc gzip perl vortex galgel mesa
Sp
eed
up
oldest-uncommited
oldest-unissued
token-passing
Previous CP predictors:local resource-sensitive predictions (HPCA 01, ISCA
01)
CP exploitation seems to require global analysis
![Page 183: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/183.jpg)
Icost case study
![Page 184: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/184.jpg)
Icost Case Study: Deep pipelines
Deep pipelines cause long latency loops:• level-one (DL1) cache access,
issue-wakeup, branch misprediction, …
But can often mitigate them indirectlyAssume 4-cycle DL1 access; how to mitigate?
Increase cache ports? Increase window size?
Increase fetch BW? Reduce cache misses?
Really, looking for serial interactions!
![Page 185: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/185.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 186: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/186.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 187: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/187.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 188: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/188.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 189: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/189.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 190: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/190.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 191: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/191.jpg)
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
![Page 192: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/192.jpg)
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
![Page 193: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/193.jpg)
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1 30.5 %
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
![Page 194: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/194.jpg)
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1 30.5 %
DL1+window -15.3
DL1+bw 6.0
DL1+bmisp -3.4
DL1+dmiss -0.4
DL1+alu -8.2
DL1+imiss 0.0
... ...
Total 100.0
![Page 195: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/195.jpg)
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1 18.3 % 30.5 % 25.8 %
DL1+window -4.2 -15.3 -24.5
DL1+bw 10.0 6.0 15.5
DL1+bmisp -7.0 -3.4 -0.3
DL1+dmiss -1.4 -0.4 -1.4
DL1+alu -1.6 -8.2 -4.7
DL1+imiss 0.1 0.0 0.4
... ... ... ...
Total 100.0 100.0 100.0
![Page 196: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/196.jpg)
Vortex Breakdowns, enlarging the window
64 128 256
DL1
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
![Page 197: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/197.jpg)
Vortex Breakdowns, enlarging the window
64 128 256
DL1 25.8 8.9 3.9
DL1+window
-24.5 -7.7 -2.6
DL1+bw 15.5 16.7 13.2
DL1+bmisp -0.3 -0.6 -0.8
DL1+dmiss -1.4 -2.1 -2.8
DL1+alu -4.7 -2.5 -0.4
DL1+imiss 0.4 0.5 0.3
... ... ... ...
Total 100.0 80.8 75.0
![Page 198: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/198.jpg)
Shotgun Profiling
![Page 199: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/199.jpg)
Profiling goal
Goal: • Construct graph
many dynamic instructions
Constraint:• Can only sample sparsely
![Page 200: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/200.jpg)
Profiling goal
Goal: • Construct graph
Constraint:• Can only sample sparsely
DNA
DNA strand
Genome sequencing
![Page 201: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/201.jpg)
“Shotgun” genome sequencing
DNA
![Page 202: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/202.jpg)
“Shotgun” genome sequencing
DNA
![Page 203: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/203.jpg)
“Shotgun” genome sequencing
. . .. . .
DNA
![Page 204: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/204.jpg)
“Shotgun” genome sequencing
. . .. . .
. . . . . .
Find overlaps among samples
DNA
![Page 205: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/205.jpg)
Mapping “shotgun” to our situation
many dynamic instructions
Icache miss
Dcache missBranch misp.No event
![Page 206: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/206.jpg)
. . .. . .
Profiler hardware requirements
![Page 207: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/207.jpg)
. . .. . .
Profiler hardware requirements
Match!
![Page 208: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/208.jpg)
Offline Profiler Algorithm
long sample
detailed samples
![Page 209: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/209.jpg)
=then
=if
Design issues
Identify microexecution context
• Choosing signature bits
• Determining PCs (for better detailed sample matching) long
sampleStart PC121620245660 . . .
branchencode taken/not-taken bit in signature
![Page 210: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/210.jpg)
Sources of error
Error Source Gcc Parser Twolf
![Page 211: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/211.jpg)
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
![Page 212: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/212.jpg)
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
Sampling only a few graph fragments
![Page 213: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/213.jpg)
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
Sampling only a few graph fragments
Modeling execution as a graph
![Page 214: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/214.jpg)
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
5.3 % 1.5 % 1.6 %
Sampling only a few graph fragments
Modeling execution as a graph
![Page 215: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/215.jpg)
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
5.3 % 1.5 % 1.6 %
Sampling only a few graph fragments
4.8 % 6.5 % 7.2 %
Modeling execution as a graph
![Page 216: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/216.jpg)
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
5.3 % 1.5 % 1.6 %
Sampling only a few graph fragments
4.8 % 6.5 % 7.2 %
Modeling execution as a graph
2.1 % 6.0% 0.1 %
![Page 217: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/217.jpg)
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
5.3 % 1.5 % 1.6 %
Sampling only a few graph fragments
4.8 % 6.5 % 7.2 %
Modeling execution as a graph
2.1 % 6.0% 0.1 %
Total 12.2 % 14.0 % 8.9 %
![Page 218: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/218.jpg)
Icost vs. Sensitivity Study
![Page 219: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/219.jpg)
Compare Icost and Sensitivity Study
Corollary to DL1 and ROB serial interaction:As load latency increases, the benefit from enlarging the ROB increases.
E E EE E
F F FF F
C C CC C
E
F
C
1
2
1
1 2 3 2 3
1111
0 1 0 1 1
01010
2
2
1
i1 i2 i3 i4 i5 i6
4
3
DL1 access
![Page 220: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/220.jpg)
Compare Icost and Sensitivity Study
0
5
10
15
20
25
64 128 192 256
ROB size
Sp
eed
up 10
54321
DL1 Latency
![Page 221: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/221.jpg)
Compare Icost and Sensitivity Study
Sensitivity Study Advantages• More information
• e.g., concave or convex curves
Interaction Cost Advantages• Easy (automatic) interpretation
• Sign and magnitude have well defined meanings
• Concise communication• DL1 and ROB interact serially
![Page 222: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/222.jpg)
Outline
• Definition (ISCA ’01)
• what does it mean for an event to be critical?
• Detection (ISCA ’01)
• how can we determine what events are critical?
• Interpretation (MICRO ’04, TACO ’04)
• what does it mean for two events to interact?
• Application (ISCA ’01-’02, TACO ’04)
• how can we exploit criticality in hardware?
![Page 223: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/223.jpg)
Our solution: measure interactions
Two parallel cache misses (Each 100 cycles)
miss #1 (100)miss #2 (100)
Cost(miss #1) = 0
Cost(miss #2) = 0
Cost({miss #1, miss #2}) = 100
Aggregate cost > Sum of individual costs Parallel interaction100 0 +
0icost = aggregate cost – sum of individual costs
= 100 – 0 – 0 = 100
![Page 224: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/224.jpg)
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
2. Zero icost ?
1. Positive icost parallel
interaction
miss #1
miss #2
![Page 225: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/225.jpg)
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2
. . .
3. Negative icost ?
![Page 226: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/226.jpg)
Negative icost
Two serial cache misses (data dependent)
miss #1 (100)
miss #2 (100)
Cost(miss #1) = ?
ALU latency (110 cycles)
![Page 227: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/227.jpg)
Negative icost
Two serial cache misses (data dependent)
Cost(miss #1) = 90
Cost(miss #2) = 90
Cost({miss #1, miss #2}) = 90
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
icost = aggregate cost – sum of individual costs
= 90 – 90 – 90 = -90Negative icost serial interaction
![Page 228: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/228.jpg)
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2. . .
3. Negative icost serial
interaction
ALU latency
miss #1 miss #2
Branch mispredict
Fetch BW
Load-Replay Trap
LSQ stall
![Page 229: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/229.jpg)
Why care about serial interactions?
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
Reason #1 We are over-optimizing!Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)
Reason #2 We have a choice of what to optimizePrefetching miss #2 has the same effect as miss #1
![Page 230: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/230.jpg)
Outline
• Definition (ISCA ’01)
• what does it mean for an event to be critical?
• Detection (ISCA ’01)
• how can we determine what events are critical?
• Interpretation (MICRO ’04, TACO ’04)
• what does it mean for two events to interact?
• Application (ISCA ’01-’02, TACO ’04)
• how can we exploit criticality in hardware?
![Page 231: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/231.jpg)
Criticality Analyzer (ISCA ‘01)
Procedure
1. Observe last-arriving edges
• uses simple rules
2. Propagate a token forward along last-arriving edges
• at worst, a read-modify-write sequence to a small array
3. If token dies, non-critical; otherwise, critical
Goal
• Detect criticality of dynamic instructions
![Page 232: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/232.jpg)
Slack Analyzer (ISCA ‘02)
Goal
• Detect likely slack of static instructions
Procedure
1. Delay the instruction by n cycles2. Check if critical (via critical-path analyzer)
• No, instruction has n cycles of slack • Yes, instruction does not have n cycles of
slack
![Page 233: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)](https://reader031.fdocuments.in/reader031/viewer/2022012918/56649e415503460f94b339ed/html5/thumbnails/233.jpg)
Shotgun Profiling (TACO ‘04)
Goal
• Create representative graph fragments
Procedure
• Enhance ProfileMe counters with context
• Use context to piece together counter samples