Should you trust your experimental results?
description
Transcript of Should you trust your experimental results?
![Page 1: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/1.jpg)
Should you trust your experimental results?
Amer Diwan, GoogleStephen M. Blackburn, ANU
Matthias Hauswirth, U. LuganoPeter F. Sweeney, IBM Research
Attendees of Evaluate '11 workshop
![Page 2: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/2.jpg)
Why worry?
Experiment Innovate
For scientific progress we need sound experiments
![Page 3: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/3.jpg)
Unsound experiments
Make a bad idea look great!
Unsound Experiment Bad Idea
![Page 4: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/4.jpg)
Unsound experiments
Unsound Experiment Great Idea
Make a great idea look bad!
![Page 5: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/5.jpg)
Thesis
Sound experimentation is critical but requires•Creativity•Diligence
As a community, we must• Learn how to design and conduct sound experiments• Reward sound experimentation
![Page 6: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/6.jpg)
A simple experiment
Goal: To characterise the speedup of optimization O
Experiment: Measure program P on unloaded machine M with/without O
Claim: O speeds up programs by 10%
M
P P/O
T1 T2
![Page 7: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/7.jpg)
<< Scope of claim
Why is this unsound?
Scope of experiment
The relationship of the two scopes determines if an experiment is sound
![Page 8: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/8.jpg)
Sound experimentsSufficient for sound experiment: Scope of claim <= Scope of experiment
Option 1: Reduce claimOption 2: Extend experiment
What are the common causes of unsound experiments?
![Page 9: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/9.jpg)
The four fatal sins
It is our pleasure to inform you that your paper titled "Envy of PLDI authors" was accepted to PLDI ...
The deadly sins do not stand in the way of a PLDI acceptance:
But the four fatal sins might!
![Page 10: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/10.jpg)
Sin 1: IgnoranceDefn: Ignoring components necessary for Claim
Experiment: a particular computerClaim: all computers
![Page 11: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/11.jpg)
Sin 1: IgnoranceDefn: Ignoring components necessary for Claim
Experiment: one benchmark
avora
Ignorance systematically biases results
Claim: full suite
![Page 12: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/12.jpg)
Ignorance is not obvious!
A is better than B I found just the opposite
Have you had this conversation with a collaborator?
![Page 13: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/13.jpg)
Ignoring Linux environment variables
Changing the environment can change the outcome of your experiment!
[Mytkowicz et al., ASPLOS 2009]Todd's results
My results
![Page 14: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/14.jpg)
Ignoring heap size
Changing heap size can change the outcome of your experiment!
Graph from [Blackburn et al., OOPSLA 2006]
SS is worst!
SS is best!
![Page 15: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/15.jpg)
Ignoring profiler bias
Different profilers can yield contradictory conclusions!
[Mytkowicz et al., PLDI 2010]
![Page 16: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/16.jpg)
Sin 2: Inappropriateness
Defn: Using components irrelevant for Claim
Experiment: Server
applications
Claim: Mobile
performance
![Page 17: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/17.jpg)
Sin of inappropriateness
Defn: Using components irrelevant for Claim
Experiment: Compute
benchmarks
Claim: GC
performance
Inappropriateness produces unsupported claims
http://www.ivankuznetsov.com/
![Page 18: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/18.jpg)
Inappropriateness is not obvious!
Has your optimization ever delivered a 10% improvement
...which never materialized in the "wild"?
![Page 19: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/19.jpg)
Inappropriate statistics
Have you ever been fooled by a lucky outlier?
[Georges and Eeckhout, 2007]:
(SemiSpace is best by far) (SemiSpace is one of the best)
![Page 20: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/20.jpg)
Inappropriate data analysis
A single Google search = 100s of RPCs99th percentile affects a majority of the requests!
A mean is inappropriate if long-tail latency matters!
A
Mean: 45.0
B
Mean: 45.0
99pc: 450 99pc: 50
![Page 21: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/21.jpg)
Inappropriate data analysis
Mean
Do you check the shape of your data before summarizing it?
Cache Hit Cache Miss
Layered systems often use caches at each level:
![Page 22: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/22.jpg)
Inappropriate metric
Have you ever picked a metric that was not ends-based?
With extra nops
![Page 23: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/23.jpg)
Inappropriate metric
Pointer analysis A Pointer analysis B
Program Program
Mean points-to-set = 2 Mean points-to-set = 2
Claim: B is simpler yet just as precise as A
Have you ever used a metric that was inconsistent with "better"?
versus
P Q PR Q R
![Page 24: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/24.jpg)
Sin 3: Inconsistency
Defn: Experiment compares A to B in different contexts
![Page 25: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/25.jpg)
Sin 3: Inconsistency
Claim: B > A
Defn: Experiment compares A to B in different contexts
Experiment: They used P; We
used Q
System A System B
Suite PSuite Q
d D
Inconsistency misleads!
![Page 26: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/26.jpg)
Inconsistency is not obvious
Workload, context, and metrics must be the same
Measurement Context
System A
Workload
Metrics
Measurement Context
System B
Workload
Metrics
![Page 27: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/27.jpg)
Inconsistent workloadI want to evaluate a new optimization for Gmail
Has the workload ever changed from under you?
Optimization enabled
![Page 28: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/28.jpg)
Inconsistent metric
Issued instructions
Retired instructions
Do you (or even vendors) know what each hardware metric means?
![Page 29: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/29.jpg)
Sin 4: Irreproducibility
Irreproducibility makes it harder to identify unsound experiments
Defn: Others cannot reproduce your experiment
Experiment:Report:
Measurement Context
System
Workload
Metrics
![Page 30: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/30.jpg)
Irreproducibility is not obvious
Omitting any biases can make results irreproducible
Measurement Context
System
Workload
Metrics
![Page 31: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/31.jpg)
Revisiting the thesis
The four fatal sins• affect all aspects of experiments• cannot be eliminated with a silver bullet
o (even with a much longer history, other sciences have them too)
It will take creativity and diligence to overcome these sins!
![Page 32: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/32.jpg)
But I can give you one tip
Look your gift horse in the mouth!
![Page 33: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/33.jpg)
Back of the envelope
• Your optimization eliminates memory loadso Can the count of eliminated loads explain speedup?
• You blame "cache effects" for results you cannot explain...o Does the variation in cache misses explain results?
![Page 34: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/34.jpg)
Rewarding good experimentation
Novelty of algorithm
Qua
lity
of e
xper
imen
ts
Reject
Loch Ness Monste
r
Often rejected
Often rejected
Safe Bet
Is this where we want to be?
No evidence that the idea works...
Scope of a paper:Evaluates existing ideas; no new algorithms...
![Page 35: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/35.jpg)
Novel ideas can stand on their own
Novel (and carefully reasoned) ideas expose•New paths for exploration•New ways of thinking
A groundbreaking idea and no evaluation >> A groundbreaking idea and misleading evaluation
![Page 36: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/36.jpg)
Insightful experiments can stand on their own!
An insightful experiment mayo Give insight into leading alternativeso Opens up new investigationso Increase confidence in prior results or approaches
An insightful evaluation and no algorithm >> An insightful evaluation and a lame algorithm
![Page 37: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/37.jpg)
But sound experiments take time!
But not as much as chasing a false lead for years...
How would you feel if you built a product...based on incorrect data?
Do you prefer to build upon:
![Page 38: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/38.jpg)
Why you should care (revisited)
• Has your optimization ever yielded an improvemento ...even when you had not enabled it?
• Have you ever obtained fantastic resultso ...which even your collaborators could not reproduce?
• Have you ever wasted time chasing a leado ...only to realize your experiment was flawed?
• Have you ever read a papero ...and immediately decided to ignore the results?
![Page 39: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/39.jpg)
The end
• Experiments are difficult and not just for us
o Jonah Lehrer's "The truth wears off"• Other sciences have established methods
o It is our turn to learn from them and establish ours!
• Want to learn more?o The Evaluate collaboratory (http://evaluate.inf.usi.ch)
![Page 40: Should you trust your experimental results?](https://reader035.fdocuments.in/reader035/viewer/2022062315/56814f0e550346895dbca1cf/html5/thumbnails/40.jpg)
Acknowledgements
• Todd Mytkowicz
• Evaluate 2011 attendees: José Nelson Amaral, Vlastimil Babka, Walter Binder, Tim Brecht, Lubomír Bulej, Lieven Eeckhout, Sebastian Fischmeister, Daniel Frampton, Robin Garner, Andy Georges, Laurie J. Hendren, Michael Hind, Antony L. Hosking, Richard E. Jones, Tomas Kalibera, Philippe Moret, Nathaniel Nystrom, Victor Pankratius, Petr Tuma
• My mentors: Mike Hind, Kathryn McKinley, Eliot Moss