Benchmarking and Performance Evaluations
description
Transcript of Benchmarking and Performance Evaluations
![Page 1: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/1.jpg)
Benchmarking and Performance Evaluations
Todd MytkowiczMicrosoft Research
![Page 2: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/2.jpg)
Let’s pole for an upcoming election
I ask 3 of my co-workers who they are voting for.
![Page 3: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/3.jpg)
Let’s pole for an upcoming election
I ask 3 of my co-workers who they are voting for.
• My approach does not deal with – Variability – Bias
![Page 4: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/4.jpg)
Issues with my approach
Variability source: http://www.pollster.com
My approach is not reproducible
![Page 5: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/5.jpg)
Issues with my approach(II)
Bias
source: http://www.pollster.com
My approach is not generalizable
![Page 6: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/6.jpg)
Take Home Message
• Variability and Bias are two different things– Difference between reproducible and
generalizable!
![Page 7: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/7.jpg)
Take Home Message
• Variability and Bias are two different things– Difference between reproducible and
generalizable!
Do we have to worry about Variability and Bias when we benchmark?
![Page 8: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/8.jpg)
Let’s evaluate the speedup of my whizbang idea
What do we do about Variability?
![Page 9: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/9.jpg)
Let’s evaluate the speedup of my whizbang idea
What do we do about Variability?
![Page 10: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/10.jpg)
Let’s evaluate the speedup of my whizbang idea
What do we do about Variability?
• Statistics to the rescue– mean– confidence interval
![Page 11: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/11.jpg)
Intuition for T-Test
• 1-6 is uniformly likely (p = 1/6)• Throw die 10 times: calculate mean
![Page 12: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/12.jpg)
Intuition for T-Test
• 1-6 is uniformly likely (p = 1/6)• Throw die 10 times: calculate mean
Trial Mean of 10 throws
1 4.0
2 4.3
3 4.9
4 3.8
5 4.3
6 2.9
… …
![Page 13: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/13.jpg)
Intuition for T-Test
• 1-6 is uniformly likely (p = 1/6)• Throw die 10 times: calculate mean
Trial Mean of 10 throws
1 4.0
2 4.3
3 4.9
4 3.8
5 4.3
6 2.9
… …
![Page 14: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/14.jpg)
Back to our Benchmark: Managing Variability
![Page 15: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/15.jpg)
Back to our Benchmark: Managing Variability
> x=scan('file')Read 20 items> t.test(x)
One Sample t-test
data: x t = 49.277, df = 19, p-value < 2.2e-1695 percent confidence interval: 1.146525 1.248241 sample estimates:mean of x 1.197383
![Page 16: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/16.jpg)
So we can handle Variability. What about Bias?
![Page 17: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/17.jpg)
System = gcc -O2 perlbench System + Innovation = gcc -O3 perlbench
Evaluating compiler optimizations
![Page 18: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/18.jpg)
Madan:speedup = 1.18 ± 0.0002
Conclusion: O3 is good
System = gcc -O2 perlbench System + Innovation = gcc -O3 perlbench
Evaluating compiler optimizations
![Page 19: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/19.jpg)
Madan:speedup = 1.18 ± 0.0002
Conclusion: O3 is good
Todd:speedup = 0.84 ± 0.0002
Conclusion: O3 is bad
System = gcc -O2 perlbench System + Innovation = gcc -O3 perlbench
Evaluating compiler optimizations
![Page 20: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/20.jpg)
Madan:speedup = 1.18 ± 0.0002
Conclusion: O3 is good
Todd:speedup = 0.84 ± 0.0002
Conclusion: O3 is bad
System = gcc -O2 perlbench System + Innovation = gcc -O3 perlbench
Why does this happen?
Evaluating compiler optimizations
![Page 21: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/21.jpg)
Madan:HOME=/home/madan
Todd:HOME=/home/toddmytkowicz
env
stack
text text
env
stack
Differences in our experimental setup
![Page 22: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/22.jpg)
Runtime of SPEC CPU 2006 perlbench depends on who runs it!
![Page 23: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/23.jpg)
32 randomly generated linking orders
Bias from linking ordersp
eedu
p
![Page 24: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/24.jpg)
32 randomly generated linking orders
Order of .o files can lead to contradictory conclusions
Bias from linking ordersp
eedu
p
![Page 25: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/25.jpg)
Where exactly does Bias come from?
![Page 26: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/26.jpg)
Interactions with hardware buffers
O2
Page N Page N + 1
![Page 27: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/27.jpg)
Interactions with hardware buffers
O2
Page N Page N + 1
Dead Code
![Page 28: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/28.jpg)
Interactions with hardware buffers
O2
Page N Page N + 1
Code affected by O3
![Page 29: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/29.jpg)
Interactions with hardware buffers
O2
Page N Page N + 1
Hot code
![Page 30: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/30.jpg)
Page N Page N + 1
Interactions with hardware buffers
O2
O3
O3 better than O2
![Page 31: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/31.jpg)
Page N Page N + 1
Interactions with hardware buffers
O2
O3
O2
O3
O3 better than O2
O2 better than O3
![Page 32: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/32.jpg)
Cachline N Cacheline N + 1
Interactions with hardware buffers
O2
O3
O2
O3
O3 better than O2
O2 better than O3
![Page 33: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/33.jpg)
Other Sources of Bias
• JIT • Garbage Collection• CPU Affinity• Domain specific (e.g. size of input data)
• How do we manage these?
![Page 34: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/34.jpg)
Other Sources of Bias
How do we manage these?– JIT:
• ngen to remove impact of JIT• “warmup” phase to JIT code before measurement
– Garbage Collection• Try different heap sizes (JVM)• “warmup” phase to build data structures• Ensure program is not “leaking” memory
– CPU Affinity• Try to bind threads to CPUs (SetProcessAffinityMask)
– Domain Specific:• Up to you!
![Page 35: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/35.jpg)
R for the T-Test
• Where to download– http://cran.r-project.org
• Simple intro to get data into R• Simple intro to do t.test
![Page 36: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/36.jpg)
![Page 37: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/37.jpg)
![Page 38: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/38.jpg)
![Page 39: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/39.jpg)
![Page 40: Benchmarking and Performance Evaluations](https://reader036.fdocuments.in/reader036/viewer/2022062520/56816404550346895dd5aa3a/html5/thumbnails/40.jpg)
Some Conclusions
• Performance Evaluations are hard!– Variability and Bias are not easy to deal with
• Other experimental sciences go to great effort to work around variability and bias– We should too!