Statistical Methods in Computer Science The Basis for Experiment Design for Hypothesis Testing Ido...
-
date post
21-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of Statistical Methods in Computer Science The Basis for Experiment Design for Hypothesis Testing Ido...
Statistical Methods in Computer Science
The Basis for Experiment Design
for Hypothesis Testing
Ido Dagan
Reminders:1. Instructions for participating in the experiment are on the course website2. Excel Recitations:
Wednesday – in computer room 604/203Thursday – same room next week, no class this week** your BIU-CS login should be active **
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
3
Experimental Lifecycle
Model/Theory
Hypothesis
Experiment
Analysis
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
4
Proving a Theory?
Methods of proving a proposition An experiment supports it We can mathematically prove it
Some propositions cannot be verified empirically: “This compiler has linear run-time” Infinite possible inputs --> cannot prove empirically
But they may still be disproved: e.g., code that causes the compiler to run non-linearly
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
5Karl Popper's Philosophy of Science
Popper advanced a particular philosophy of science:Falsifiability
For a theory to be considered scientific, it must be falsifiable There must be some way to refute it, in principle Not falsifiable <==> Not scientific
Examples: “All crows are black” falsifiable by finding a white crow “Compile in linear time” falsifiable by non-linear
performance
Theory tested on its predictions
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
6
Proving by disproving...
Platt (“Strong Inference”, 1964) offers a specific method:1) Devise alternative hypotheses for observations2) Devise experiment(s) allowing elimination of hypotheses3) Carry out experiments to obtain a clean result4) Go to 1.
The idea is to eliminate hypotheses, by rejecting them
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
7
Forming Hypotheses
So, to support theory X, we:1) Construct falsifiability hypotheses X1,.... Xn, ....
2) Systematically experiment to disprove X, by proving Xi
3) If all falsification hypotheses eliminated, then this lends support to the theory
Note that future falsification hypotheses may be formed Theory must continue to hold against “attacks” Popper: Scientific evolution, “survival of the fittest theory”
E.g. Newton’s theory
How does this view hold in computer science?
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
8
Forming Hypotheses in CS
(1) Carefully identify the theoretical object we are studying: e.g., “the relation between input-size and run-time is
linear” e.g., “the display improves user performance”
(2) Identify falsification hypothesis (null hypothesis) H0 e.g., “there is an input-size for which run-time is non-linear” e.g., “the display will have no effect on user performance”
(3) Now, experiment to eliminate H0
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
9The Basics of Experiment Design
Experiments identify a relation between variables X, Y, ...
Simple experiments: Provide indication of a relation Better/worse, linear or non-linear, ....
Advanced experiments: help identify causes, interactions Linear in input size but constant factor depends on type of
data
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
10Types of Experiments and Variables
Manipulation experiments Manipulate (= set value of) independent variables (input size) Observe (measure value of) dependent variables (run time)
Observation experiments Observe predictor variables (person height) Observe response variables (running speed)
Also system run time – if observing system in actual use
Other variables: Endogenous: On causal path between independent and
dependent Exogenous: Other variables influencing dependent variables
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
11An example of observation experiment
Theory: Gender affects score performance
Falsifying hypothesis: Gender does not affect performance I.e. Men & women perform the same
Cannot use manipulation experiments Cannot control gender
Must use observation experiments
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
12An example observation experiment
(ala “Empirical methods in AI”, Cohen 1995)
# Siblings: 2
Mother: artist
Gender: Male
Height: 145cm
Teacher's attitude
Child confidence
Test score: 650
# Siblings: 3
Mother: Doctor
Gender: Female
Height: 135cm
Teacher's attitude
Child confidence
Test score: 720
Independent (Predictor)Variables
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
13An example observation experiment
(ala “Empirical methods in AI”, Cohen 1995)
# Siblings: 2
Mother: artist
Gender: Male
Height: 145cm
Teacher's attitude
Child confidence
Test score: 650
# Siblings: 3
Mother: Doctor
Gender: Female
Height: 135cm
Teacher's attitude
Child confidence
Test score: 720
Dependent (Response)Variables
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
14An example observation experiment
(ala “Empirical methods in AI”, Cohen 1995)
# Siblings: 2
Mother: artist
Gender: Male
Height: 145cm
Teacher's attitude
Child confidence
Test score: 650
# Siblings: 3
Mother: Doctor
Gender: Female
Height: 135cm
Teacher's attitude
Child confidence
Test score: 720
EndogenousVariables
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
15An example observation experiment
(ala “Empirical methods in AI”, Cohen 1995)
# Siblings: 2
Mother: artist
Gender: Male
Height: 145cm
Teacher's attitude
Child confidence
Test score: 650
# Siblings: 3
Mother: Doctor
Gender: Female
Height: 135cm
Teacher's attitude
Child confidence
Test score: 720
ExogenousVariables
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
16Experiment Design: Introduction
Different experiment types explore different hypotheses For instance, a very simple design: treatment experiment
Sometimes known as a lesion study
treatment Ind1 Ex1 Ex2 .... Exn Dep1
control Not(Ind1) Ex1 Ex2 .... Exn Dep2
Treatment condition: Independent variable set to “with treatment”
Control condition: Independent var set to “no treatment”
Populations are “identical” in all other variables
Determine relation of categorical var V0 and the dependent var
Variables: V0
V1 V
2... V
nDependent Variable
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
17Single-Factor Treatment Experiments
A generalization of treatment experiments Allow comparison of different conditions
treatment1 Ind1 Ex1 Ex2 .... Exn Dep1
treatment2 Ind2 Ex1 Ex2 .... Exn Dep2
[control Not(Ind) Ex1 Ex2 .... Exn Dep3 ]
Compare performance of algorithm A to B to C .... Control condition: Optional (e.g., to establish
baseline)
V0V1 V2
VnDependent Variable
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
18
Careful !
An effect on the dependent variable may not be as expected
Example: An experiment Hypothesis: fly's ear is on its wings Fly with two wings. Make loud noise. Observe flight. Fly with one wing. Make loud noise. No flight. Conclusion: Fly with only one wing cannot hear!
What's going on here? First, interpretation by the experimenter But also, lack of sufficient falsifiability:
There are other possible explanations for why fly wouldn't fly – another variable (wing) affecting the dependent variable (flying)
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
19
Controlling for other factors
Often, we cannot manipulate all exogenous variables Then, we need to make sure they are sampled randomly
Randomization averages out their effect
This can be difficult e.g.,, suppose we are trying to relate gender and math We control for effect of # of siblings by random sampling But # of siblings may be related to gender:
Parents continue to have children hoping for a boy (Beal 1994) Thus # of siblings tied with gender
Must separate results based on # of siblings
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
20Factorial Experiment Designs
• Every combination of factor values is sampled– Hope is to exclude or reveal interactions
• This creates a combinatorial number of experiments– N factors, k values each = kN combinations
• Strategies for eliminating values:– Merge values, categories. Skip values.– Focus on extremes, to get a general trend
• But may hide behavior at intermediate values
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
21Tips for Factorial Experiments
For “numerical” variables, 2 value ranges are not enough Don't give a good sense of the function relating
variables.
Measure, measure, measure. Piggybacking measurements on planned experiments:
cheaper than re-running experiments
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
22
Experiment Validity
Type of validity: Internal and External validity Internal validity:
Experiment shows relationship (independent causes dependent)
External validity: Degree to which results generalize to other conditions
Threats: uncontrolled conditions threatening validity
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
23Internal validity threats: Examples
Order effects Practice effects in human or animal test subjects
E.g. user performance improves in user interface tasks Solution: randomize order of presentation to subjects
Bug or side-effects in testing system leaves system “unclean” for next trial – need to “clean” system between experiments
If treatment/control given in two different orders E.g. run with/without new algorithm operating, for same
users Order may be good for treatment, bad for control (or vice
versa) Solution: counter-balancing (all possible orders)
Demand effects Experimenter influences subject
e.g., guiding subjects Confounding effects – variable relations aren’t clear
See “fly with no wings cannot hear”
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
24
External threats to validity
Outline:
Sampling bias: Non-representative samples e.g., non-representative external factors
Floor and ceiling effects Problems tested too hard, too easy
Regression effects Results have no way to go but up or down
Solution approach: Run pilot experiments
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
25
Sampling Bias
Setting prefers measuring specific values over others For instance:
“Random” manual selection of mice from cage for experiment
Specific values: slow, doesn’t bite (not aggressive), … Including results that were found by some deadline
Solution: Detect, and remove e.g., by visualization, looking for non-normal distributions e.g., surprising distribution of dependent data, for different
values of independent variable.
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
26Baselines: Floor and Ceiling Effects
How do we know A is good? Bad? Maybe the problems are too simple? Too hard?
For example New machine learning algorithm has 95% accuracy Is this good?
Controlling for Floor/Ceiling Establish baselines Show that a “silly” approach achieves close result Comparison to strawman (easy), ironman (hard)
May be misleading if not chosen appropriately
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
27
Regression Effects
General phenomenon: “Regression towards the mean”
Repeated measurement converges towards mean values
Example threat: Run a program on 100 different inputs Problems 6, 14, 15 get a very low score We now fix the problem that affected only these inputs, and
want to re-test If chance has anything to do with scoring, then must re-run
all Why?
Scores on 6, 14, 15 has no where to go but up. So re-running these problems will show improvement by
chance Solution:
Re-run complete tests, or sample conditions uniformly
Empirical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan
28
Summary
Defensive thinking If I were trying to disprove the claim, what would I do Then think ways to counter any possible attack on claim
Strong Inference, Popper's falsification ideas Science moves by disproving theories (empirically)
Experiment design: Ideal independent variables: easy to manipulate Ideal dependent variables: measurable, sensitive, and
meaningful Carefully think through threats