Approximate Randomization tests February 5 th, 2013.
-
Upload
gregory-perkins -
Category
Documents
-
view
216 -
download
0
Transcript of Approximate Randomization tests February 5 th, 2013.
![Page 1: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/1.jpg)
Approximate Randomization tests
February 5th, 2013
![Page 2: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/2.jpg)
Classic t-test
![Page 3: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/3.jpg)
Why ar testing?
• Classic tests often assume a given distribution (student t, normal, …) of the variable
• This is ≈ok for recall, but not for precision or F-score
• Possible hypotheses to test with non-parametric tests is limited
![Page 4: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/4.jpg)
Illustration
• 30,000 runs, 1000 instances, 500 of class A• True positives (TP): 400 (stdev:80)• False positives (FP): 60 (stdev: 15)• Assumption: true and false positives for class
A are normally distributed. This is already an approximation since TP and FP are restricted by 0 and the number of instances.
![Page 5: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/5.jpg)
Definitions
• Recall = truly predicted A / A in reference = truly predicted A / Cte
If A is normal, recall is normal.• Precision = truly predicted A / A in system A in system is a non-linear combination of TP and FP. Precision is not normal.
• F-score: non-linear combination of recall and precision Not normal.
![Page 6: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/6.jpg)
![Page 7: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/7.jpg)
![Page 8: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/8.jpg)
![Page 9: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/9.jpg)
![Page 10: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/10.jpg)
![Page 11: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/11.jpg)
Approximate randomization test
• No assumption on distribution• Can handle complicated statistics• Only assumption: independence between
shuffled elements• References:– Computer Intensive Methods for Testing
Hypotheses, Noreen, 1989.– More accurate tests for the statistical significance
of results differences, Yeh, 2000.
![Page 12: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/12.jpg)
Basic idea
• Exact randomization test
Glass 1 Glass 2 Glass 3 Glass 4
Contents Polish Premium Russian Budget
Expert Polish Premium Budget Russian
![Page 13: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/13.jpg)
Exact probability
H0: expert is independent of contents
P(ncorrect ≥ 2) = 7/24 = 0.29
Thus, do not reject H0 because the probability is larger than alpha=0.05.
![Page 14: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/14.jpg)
Approximate probability
• The number of permutations is n! => quick increase of number of permutations
• If too much permutations to compute: approximation: P = (nge + 1) / (NS + 1)– nge : number of times pseudostatistic ≥ actual
statistic– NS: number of shuffles– +1: correction for validity
![Page 15: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/15.jpg)
DIFFERENT SETUPS
![Page 16: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/16.jpg)
Translation to instances
• Each glass is an instance• Contents and expert are two labeling systems• Contents has an accuracy of 100%, expert has
an accuracy of 50%• Statistic is precision, f-score, recall, … instead
of accuracy
![Page 17: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/17.jpg)
Stratified shuffling
• For labeled instances, it makes no sense to shuffle the class label of one instance to another
• Only shuffle labels per instance
![Page 18: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/18.jpg)
MBT
• Assumpton of independence between instances
• Shuffle per sentence rather than per token
System 1 System 2
This DT NNS
is VBZ VB
nice JJ RB
. . .
![Page 19: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/19.jpg)
Term extraction
• Shuffling extracted terms between output of two term extraction systems
Reference System 1 System 2
happy happy sad
good good
lively happy
angry
![Page 20: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/20.jpg)
Script
• http://www.clips.ua.ac.be/~vincent/software.html#art• http://www.clips.ua.ac.be/scripts/art• Options:
– Exact and approximate randomization tests– Instance based, also for MBT– Term extraction based– Stratified Shuffling– Two sided / one-sided (check code!)
![Page 21: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/21.jpg)
Remarks on usage
• It makes no sense to shuffle if exact randomization can be computed
• The value of p depends on NS. The larger NS, the lower p can be
• Validity check– Sign-test– Re-test: to alleviate bad randomization
![Page 22: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/22.jpg)
Sign test
• Can be compared with P for accuracy• H0: correctness is
independent ofsystem i.e.P(groen) = 0.5
• Binomial test
System 1 System 2
![Page 23: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/23.jpg)
Interpretation (1)Reference System 1 System 2
A A B
B A B
C A B
How much do these two systems differ based on precision for the A label?
- Maximally- Intermediate- Minimally
![Page 24: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/24.jpg)
Interpretation (2)Labels PrecisionA
A B C System 1 System 2 Δ
AB AB AB 1/3 0 1/3
BA AB AB 0 1 -1
AB AB BA 1/2 0 1/2
BA BA AB 0 1/2 -1/2
BA AB BA 1/2 0 1/2
AB BA BA 1 0 1
BA BA BA 0 1/3 -1/3
AB BA AB 1/2 0 1/2
![Page 25: Approximate Randomization tests February 5 th, 2013.](https://reader036.fdocuments.in/reader036/viewer/2022062423/56649e5c5503460f94b54067/html5/thumbnails/25.jpg)
Conclusion
• Approximate randomization testing can be used for many applications.
• The basic idea is that the actual difference between two systems is (im)probable to occur when all possible permutions of the outputs are evaluated.
• Difference can be computed in many ways as long as the shuffled elements are independent.