Taming Hardware Event Samples for FDO Compilation
description
Transcript of Taming Hardware Event Samples for FDO Compilation
![Page 1: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/1.jpg)
1
Taming Hardware Event Samples for FDO Compilation
Dehao Chen (Tsinghua University)Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google)
Vinodha RamasamyPaul Yuan (Peking University)
Wenguang Chen, Weimin Zheng (Tsinghua University)
![Page 2: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/2.jpg)
2
Why FDO?• Feedback Directed Optimization• Performance Improvements
– 5% speedup on SPEC2000 INT– Small? Huge for millions of computers
• Not widely adopted
![Page 3: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/3.jpg)
3
Instrumentation based FDOgcc –fprofile-generate …
Instrumented Binary
Representative Workload
Run the instrumented binary .gcda files
gcc –fprofile-use …FDO
optimized binary
1
2
3
1.Have to build twice2. Instrumentation run is slow3.Need representative input 4.Perturbs execution
![Page 4: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/4.jpg)
4
Sample based FDO
Running Environment
gcc –O2 -g …
Normal Binary
Real-World Workload
Profile Data
gcc –fsample-profile …FDO
optimized binary
1
1.Previous deployment/test binary to collect profile
2.Profiling input: real traffic3.Profiling does not perturb code
Profiling Tools(Oprofile, Pfmon)
![Page 5: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/5.jpg)
5
PMU Sampling• Performance monitoring unit (PMU)
o Captures events generated by CPU cache miss instruction retired clock tick
o Configurable counters increment on selected eventso Optional interrupt on counter overflow
• Samplingo On interrupt capture instruction pointer (IP)o Can also sample other state
registers other PMU counters
![Page 6: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/6.jpg)
6
Sampling Instructions RetiredInstruction Samples
1499 0x76a1f41517 0x76a48a1498 0x76aea11527 0x76c09d
1 0x77e3cf733 0x77ee7e
1242 0x78109d
Symbolized Samples0x76a1f4 : 1499
foo.c:11830x76a48a : 1517
foo.c:9920x76aea1 : 1489
foo.c:9060x76c09d : 1527
foo.h: 18210x77e3cf : 1
bar.c:34810x77ee7e : 733
bar.c 47590x78109d : 1242
bar.c 4762
Symbolizer
GCC
foo.c:853
foo.c:906foo.c:992 foo.c:1183
foo.c:1325
0
3006 1499
0
![Page 7: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/7.jpg)
7
Sampling Instructions RetiredInstruction Samples
1499 0x76a1f41517 0x76a48a1498 0x76aea11527 0x76c09d
1 0x77e3cf733 0x77ee7e
1242 0x78109d
Symbolized Samples0x76a1f4 : 1499
foo.c:11830x76a48a : 1517
foo.c:9920x76aea1 : 1489
foo.c:9060x76c09d : 1527
foo.h: 18210x77e3cf : 1
bar.c:34810x77ee7e : 733
bar.c 47590x78109d : 1242
bar.c 4762
Symbolizer
GCC
foo.c:853
foo.c:906foo.c:992 foo.c:1183
foo.c:1325
3006 1499
4505
4505
3006
3006
1499
1499
[Levin et.al. Complementing Missing and Inaccurate Profiling using a Minimum Cost Circulation Algorithm. HIPEAC’08]
![Page 8: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/8.jpg)
8
Accuracy Challenge
![Page 9: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/9.jpg)
9
Accuracy Challenge
![Page 10: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/10.jpg)
10
Accuracy Challenge
![Page 11: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/11.jpg)
11
Improving the Accuracy• Flow Consistency
– Use Minimum Cost Circulation Algorithm– Control flow Network flow
• Predict Aggregation/Shadow Effect– Sampling Multiple Events– Using the prediction to adjust the frequency
![Page 12: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/12.jpg)
Branch: 7954
Taken: 7922
Join: 1049
7954
79220
1049
6905
Source
Sink
326873
7922
7922
1049
6905
Source
Sink
326873
690532
6873 + 32
12
[Levin et.al. Complementing Missing and Inaccurate Profiling using a Minimum Cost Circulation Algorithm. HIPEAC’08]
![Page 13: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/13.jpg)
13
Use Prediction to Adjust Profile• Cost Function in MCC
– Each basic block is attached by two edges• Forward (flow represents increasing the count)• Backward (flow represents decreasing the count)
– Cost function for each edge• Larger cost means prevent changing in this direction
• Using the prediction– Over-sampled: high cost on forward edge– Under-sampled: high cost on backward edge
BB1’
BB1’’
Forward Edge Backward Edge
![Page 14: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/14.jpg)
14
Predict Aggregation/Shadow• Model Aggregation Effects
– Long latency instructions– Sample major long latency events
• Branch Mispredict, Cache/DTLB Miss, etc• Estimate the stalls these events will cause• Skid has little influence on long latency events
![Page 15: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/15.jpg)
15
Predict Aggregation/Shadow• Model Shadow Effects
– CPU_CORE_CYCLES event• Time based sampling• Skid will only shift the profile• CPU_CYCLE – INST_RETIRED Stalled Cycle (with skid)• Each stalled cycle will set a shadow area
• Aggregation and Shadow co-exist– Heuristic to check which one dominates
![Page 16: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/16.jpg)
16
Evaluation: Accuracy
50%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%Static Estimation MCC Our Prediction Perfect Prediction
![Page 17: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/17.jpg)
17
Evaluation: Performance
164.gzip
175.vpr
176.gcc
181.mcf
186.craft
y
197.parser
252.eon
253.perlbmk
254.gap
255.vorte
x
256.bzip2
300.twolf
Geomean-5%
0%
5%
10%
15%Sample FDO Instr FDO
![Page 18: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/18.jpg)
18
Conclusion and Future Work• Sampling based FDO is promising• The artifacts in PMU data can be compensated
for with appropriate understanding and heuristics, which improves the accuracy by 6%
• Sample based Value Profiling• Future: Last Branch RegisterMore precise
edge profile at binary level• Sample based LIPO
![Page 19: Taming Hardware Event Samples for FDO Compilation](https://reader034.fdocuments.in/reader034/viewer/2022050806/568165b3550346895dd8abc0/html5/thumbnails/19.jpg)
19
Questions?