Single-Photon Emission Computed Tomography/Computed Tomography in Endocrinology
Confidence Intervals and Hypothesis Testing€¦ · s 15 One-sided and Two-sided Intervals So far,...
Transcript of Confidence Intervals and Hypothesis Testing€¦ · s 15 One-sided and Two-sided Intervals So far,...
Universität Potsdam Institut für Informatik
Lehrstuhl Maschinelles Lernen
Confidence Intervals and Hypothesis Testing
Niels Landwehr
Inte
lligent D
ata
Analy
sis
Agenda
Confidence Intervals
Statistical Tests
2
Inte
lligent D
ata
Analy
sis
Agenda
Confidence Intervals
Statistical Tests
3
Inte
lligent D
ata
Analy
sis
Recap: Risk Estimation
Recap: risk estimation.
We have learned a model
Interested in risk of model: the expected loss on novel test
instances drawn from the data distribution
Because is unknown, risk needs to be estimated from
sample where are
independent samples.
Risk estimate („empirical risk“)
If context is clear, we denote risk by R and empirical risk by 4
: .f
( [ ( , ( ))] ( , () )) ( , )R E y f y f p y d dy
x x x x
( , )yx ( , ).p yx
( , )p yx
1
1ˆ ( ) ( , ( ))S
n
i iinR y f
x
1 1( , ),...,( , )
n nS y y x x ( ~ (, , ))
iipy yx x
ˆ .S
R
Inte
lligent D
ata
Analy
sis
Recap: Risk Estimation Zero-one loss
For this lecture, we will assume
Learning task is binary classification,
Loss is zero-one loss,
This means that for follows a
Bernoulli distribution: there is either a mistake or not (coin toss).
We also assume that model is evaluated on independent test
set, such that the error estimate is unbiased.
5
0 : ( )( , ( ))
1: otherwise
y fy f
xx
{0,1}.
( , ( ))i i
y f
x , ) ~ ,( ( ) i i
y p yxx
Inte
lligent D
ata
Analy
sis
6
Idea Confidence Intervals
Risk estimate is always uncertain – depends on sample S.
Idea confidence interval:
Specify interval around risk estimate
Such that the true risk lies within the interval „most of
the time“.
Quantifies uncertainty of risk estimate.
Route to confidence interval: analyse the distribution of the
random variable
ˆS
R[ ]
ˆS
R
ˆ .S
R
R
of confidence inwidt teh rvalR
Inte
lligent D
ata
Analy
sis
7
Central Limit Theorem
Central Limit Theorem. Let be independent draws
from a distribution with and .
Then it holds that
Central limit theorem gives approximate distribution of mean:
2
10,
1( )
n
jjn z
n
1
2
( , ) (approximately, for large 1
)~n
jjz
nn
n
1
2(0, ) (approximately, for 1
( ) ~ large )n
jjn
nnz
1,...,
nz z
( )p z [ ]z 2Var[ ]z
average of convergence in distribution (for ) n 1,..., .
nz z
divide by , add n
Inte
lligent D
ata
Analy
sis
8
Example Central Limit Theorem
Example central limit theorem: average of Bernoulli variables.
Let be independent draws from a Bernoulli
distribution, that is
Average follows (rescaled) Binomial distribution.
Binomial distribution approaches Normal distribution.
1,...,
nz z
) (coin toss with success pro~ Bern( | bability )i i
z z
1
1 n
jjz
n
Inte
lligent D
ata
Analy
sis
9
Central Limit Theorem: Error Estimator
Application of central limit theorem to error estimator.
Error estimator
is an average over the Bernoulli-distributed variables
Because the error estimate is unbiased,
Variance of Bernoulli random variable is
Central limit theorem says:
First result for distribution of , but depends on
1
1( , ( ))ˆ n
S j jjy fR
n
x
Var[ ( , ( ))] (1 ).j jy f R R x
(1 )ˆ ~ ( , ) (approximately, large enough )S
R RR R n
n
( , ( )).j j
y f x
( , ( ))] .[ j jy f R x
ˆSR .R
Inte
lligent D
ata
Analy
sis
10
Mean and Variance of Error Estimator
First result: Approximate distribution of error estimator is
Unbiased estimator, therefore the mean is the true risk
The variance of the estimator falls with n: the more instances
in the test set S, the less variance.
Variance
Standard deviation („standard error“)
2
ˆ
(1 ).
SR
R R
n
ˆ
(1 ).
SR
R R
n
ˆ ~ ((1 )
, ).S
RR
RR
n
.R
Characterizes how much risk estimate fluctuates with S
Inte
lligent D
ata
Analy
sis
Distribution of Error Estimator
Distribution of error estimator:
Problem: true risk has to be known in order to determine
variance
Idea: replace true variance by variance estimate
11
2
ˆ
(1 ).
SR
R R
n
R
ˆ
2,~ .( )ˆS
S RRR
ˆ
ˆ~ (0, )1
SR
SR R
2
ˆSR
2
ˆ
(1 ˆ.
ˆ )
S
S S
Rs
n
R R
Inte
lligent D
ata
Analy
sis
Variance Estimate and t-Distribution
If true variance is replaced by variance estimate, the normal
distribution becomes a Student‘s t-distribution:
However, for large n the t-distribution becomes a normal
distribution again, so we can keep working with the normal.
12
ˆ
ˆ~
SR
SR
s
Rt n
degrees of freedomn
Convergence: lim 0,1n t n
t-distribution has similar form
but more mass in the tails. Normal distribution.
t(2)
t(5)
Inte
lligent D
ata
Analy
sis
13
Bound For True Risk
So what does the empirical risk tell us about the true risk?
From empirical risk compute empirical variance
One-sided upper bound for true risk: probability that true risk is
at most above estimated risk.
ˆ ˆ
ˆ
ˆ ˆ( ) ( )
ˆ ( )
S S
S
R R
R
S S
S
p R R p R R
R Rp
s s
s
"cumulative distribution function of standard norm
( | 0,1)
al distribution"
x
xx dx
ˆS
R
R
]
ˆSR
ˆSR
2
ˆ .SR
s
ˆ
ˆ~ 0,1
SR
SR
s
R
Inte
lligent D
ata
Analy
sis
14
Bound For True Risk
Symmetric lower bound: because the distribution of is
symmetric around (normal distribution), we can similarly
compute probability that true risk is at most below estimated
risk.
Two-sided interval: What is the probability that true risk is at
most away from estimated risk?
ˆ
ˆ ˆ ˆ(| | ) 1 ( ( )
1 2 1
S
S S
R
Sp R R p R R p R R
s
ˆ
ˆ( )
S
S
R
p R Rs
ˆSR
R
ˆS
R
R
[
above interval below interval
Inte
lligent D
ata
Analy
sis
15
One-sided and Two-sided Intervals
So far, we have computed probability that a bound holds
for a particular interval size .
Idea: choose in such a way that bounds hold with a
certain prespecified probability 1- (e.g. =0.05).
One-sided 1--confidence interval: bound such that
Two-sided 1--confidence interval: bound such that
For symmetric distributions (here: normal) it always holds that:
for one-sided 1--interval = for two-sided 1-2 interval.
for one-sided 95%-interval = for two-sided 90% interval.
Thus, it suffices to derive for one-sided interval.
ˆ( ) 1Sp R R
ˆ(| | ) 1Sp R R
Inte
lligent D
ata
Analy
sis
16
Size of Interval
Compute one-sided 1--confidence interval: Determine such
that bound holds with probability 1-.
Two-sided confidence interval is (confidence level
1-2)
ˆ
1
ˆ
1
ˆ
ˆ( ) 1
1
(1 )
1
S
S
S
R
R
S
R
p R R
s
s
s
ˆ ˆ[ , ]S SR R
1( ) inverse of ).(xx
Result from Slide 13
Inte
lligent D
ata
Analy
sis
17
Confidence Interval: Example
Example:
We have observed an empirical risk of on
test instances.
Compute
Choosing confidence level (one-sided level, two-
sided will be )
Compute
The confidence interval contains the true risk in
90% of the cases.
ˆ 0.08SR
ˆ
0.08 0.920.027 empirical standard deviation
100SRs
ˆ ˆ, [ ]S SR R
1
ˆ 1 0.027 1.645 0.045.SR
s
100m
0.05
2
Inte
lligent D
ata
Analy
sis
18
Interpretation of Confidence Intervals
Care should be used when interpreting confidence intervals:
the random variable is the empirical risk and the resulting
interval, not the true risk
Correct:
Wrong:
"The probability of obtaining a confidence interval that
contains the true risk from an experiment is
95%"
ˆSR
"We have obtained a confidence interval
The probability that the interval contains th
fr
e t
om an experimen
rue risk is .
t.
95%"
.R
Inte
lligent D
ata
Analy
sis
Agenda
Confidence Intervals
Statistical Tests
19
Inte
lligent D
ata
Analy
sis
Statistical Tests: Motivation
Motivation: we have developed a new learning algorithm
(Algorithm 1) and compare it to an older algorithm (Algorithm 2)
on 10 data sets.
Algorithm 1 seems better (won on 8 data sets, lost on 2).
But maybe this is just a random result, based on the
particular choice of data sets?
Statistical test: rigorous procedure to decide whether it is likely
that Algorithm 1 is indeed giving better accuracy.
20
0.85 0.76 0.60 0.70 0.95 0.88 0.73 0.89 0.98 0.74
0.81 0.73 0.61 0.66 0.91 0.89 0.65 0.82 0.97 0.70
Accuracy Algorithm 1
Accuracy Algorithm 2
+ + - + + - + + + +
Inte
lligent D
ata
Analy
sis
Statistical Tests: Framework
Formulate a null hypothesis H0.
For example, H0 could be „Algorithm 1 and Algorithm 2
perform equally well“.
If the observations are very unlikely under H0, we reject it
and conclude the alternative hypothesis H1: one algorithm
is better.
Formulate a test statistic T that can be computed from data.
For example, the observed number of „wins“.
We will reject the null hypothesis if the test statistic exceeds a
threshold c.
For example, reject if one algorithm wins more than 90
times out of 100.
21
Inte
lligent D
ata
Analy
sis
Statistical Tests: Framework
Asymmetry in test: we can only reject the null hypothesis,
never conclude that it is true.
Possible outcomes of hypothesis testing:
Type I error is worst case (publish a study claiming that new
drug cures cancer when in fact it does not).
22
0 1 rejected conclude .H H
0 not rejected cannot conclude anything, no new information.H
Type I error (wrong
conclusion, very bad)
no new information but
also no error (ok)
correct conclusion
(good)
Type II error (not enough
power, kind of bad)
0 trueH
1 trueH
0 rejectedH 0 not rejectedH
Inte
lligent D
ata
Analy
sis
Statistical Tests: More Formally
More formally, let denote a true parameter of interest
(for example, is the probability that Algorithm 1 wins over
Algorithm 2 on a randomly drawn data set).
Let the null hypothesis be (for example, ).
The alternative hypothesis is
Let be the observations (for example, accuracies of
algorithms on the multiple data sets).
Let be the test statistic.
We reject the null hypothesis (and conclude that the
alternative hypothesis is true) if
23
00 :H
0 0: .5H
1 1 0: \ .H
X
:T
0H
1H ( ) .T X c
Inte
lligent D
ata
Analy
sis
Statistical Tests: Size
Size of a test: (maximal) probability of rejecting the null
hypothesis when the null hypothesis is true (bad!).
We don‘t want Type I errors, so we have to limit
For example, : formulate test in such a way that there is
at most 5% probability of rejecting null hypothesis wrongly.
Of course, depends on c
If we choose c very large, we are conservative and is low.
If we choose c smaller, we are less conservative.
Trading Type I for Type II error.
24
0
sup ( | ).p T c
.
0.05
Inte
lligent D
ata
Analy
sis
Sign Test
Sign test: decide whether the medians of two populations differ.
Motivation: we evaluate two learning algorithms on 10 datasets.
More formally: Let be independently
sampled as
Let („probability that Algorithm 1 wins on
randomly drawn data set“).
Let , 25
2
1 1( , ),..., ,( )m mba b a
( , ) ~ ( , ).i ia b p a b
( ) [0,1]p a b
0 0: .5H 1 [0,1] \{0: .5}.H
0.85 0.76 0.60 0.70 0.95 0.88 0.73 0.89 0.98 0.74
0.81 0.73 0.61 0.66 0.91 0.89 0.65 0.82 0.97 0.70
Accuracy Algorithm 1
Accuracy Algorithm 2
+ + - + + - + + + +
Inte
lligent D
ata
Analy
sis
Sign Test
Sign test: decide whether the medians of two populations differ.
Motivation: we evaluate two learning algorithms on 10 datasets.
Let (observed accuracies).
Let
We will reject the null hypothesis if , that is, if we see
more than c wins of either algorithm.
26
1 1, ),..., ( )}( ,{ m mb a bX a
„#wins of better algorithm“
T c
0.85 0.76 0.60 0.70 0.95 0.88 0.73 0.89 0.98 0.74
0.81 0.73 0.61 0.66 0.91 0.89 0.65 0.82 0.97 0.70
Accuracy Algorithm 1
Accuracy Algorithm 2
+ + - + + - + + + +
max { | } , { | } .i i i iT i a ab i b
Inte
lligent D
ata
Analy
sis
Sign Test: Distribution under H0
How do we choose c ?
Limit probability of Type I error, given by
Because are sampled independently, the logical
variable behaves like a coin toss.
Thus, the probability of seeing i wins for Algorithm 1 is given
by a Binomial distribution.
How likely is it to observe more than c wins (for either
algorithm) if ?
27
( | 0.5).p T c
0.5
0.5,
1
( | 0.5 2 Bin () )m
m
i c
p T c i
Probability of seeing extreme #wins
under a fair coin toss model c
( , ) ~ ( , )i ia b p a b
( )i ia b
Inte
lligent D
ata
Analy
sis
Sign Test: Distribution under H0
So
So far, computed for a given threshold c.
We can ensure any prespecified by solving for c:
E.g. for we set
28
1
0.5, / 2).BinCDF (1mc
0.5,
1
0.5,
2 Bin ( )
( | 0.
2(1 BinC
5
(
D )
)
F )
m
m
i c
m
p T c
i
c
c
0.5,BinCDF ( )m c
0.05 1
0.5, ).BinCDF (0.975mc
Inte
lligent D
ata
Analy
sis
Sign Test: p-value
After observing the value T of the test statistics, we can also
compute for the maximum threshold c=T-1 that would still
reject the null hypothesis. This is called the p-value.
The p-value is the smallest for which the test would reject H0.
Typically,
p < 0.001: very sure that H0 can be rejected.
p < 0.01: sure that H0 can be rejected.
p < 0.05 reasonably sure that H0 can be rejected.
p < 0.1 likely that H0 can be rejected.
29
1T
0.5,2(1 BinCDF ( 1))m Tp
Inte
lligent D
ata
Analy
sis
Sign Test: Example
Example sign test:
Compute test statistic: T=8.
Compute p-value:
Test would reject null hypothesis for , but not for
This is not considered statistically significant.
30
0.85 0.76 0.60 0.70 0.95 0.88 0.73 0.89 0.98 0.74
0.81 0.73 0.61 0.66 0.91 0.89 0.65 0.82 0.97 0.70
Accuracy Algorithm 1
Accuracy Algorithm 2
+ + - + + - + + + +
T
0.2 0.1.
0.5,102(1 BinCDF (7)) 0.1094p
Inte
lligent D
ata
Analy
sis
Sign Test: Discussion
Summary: sign test can be applied when we have paired data
and want to decide if
Advantages of sign test:
Few assumptions: the only need to be independent.
Disadvantages:
Only uses whether or , not the actual values.
This discards some information and can make it harder to
reject the null hypothesis.
Compares medians rather than means: if algorithm is
usually slightly better but in some cases much worse, it
would be declared the winner.
31
( ),i ia b
i ia b i ia b
2
1 1( , ),..., ,( )m mba b a ( ) 0.5.p a b
Inte
lligent D
ata
Analy
sis
Two-Tailed Paired t-Test
Paired t-test: standard test to determine if means between
populations differ (example: do risks of two models differ?).
Let be independently sampled from ,
that is,
Let , let , and let
Let denote the difference in population means.
Null hypothesis , that is,
Test statistic , reject if
32
0 : 0H
i i ia b 1
1i
m
im
.T c
1
2 21( ) .i
m
i
sm
| |mT
s
Variance estimate i
[ ] [ ]a b
[ [ ]] .ba
2
1 1( , ),..., ,( )m mba b a
( , ) ~ ( , ).i ia b p a b
( , )p a b
Inte
lligent D
ata
Analy
sis
Paired t-Test: Probability of Type I Error
Paired t-test intuition: if null hypothesis holds,
would expect small and therefore T. Seeing a large
(absolute) T is thus very unlikely under the null hypothesis.
What is the probability of rejecting the null hypothesis when
the null hypothesis is true?
33
[ [ ]]a b
( | 0)p T c
Inte
lligent D
ata
Analy
sis
Paired t-Test: Probability of Type I Error
Distribution of T if :
Because are independent, Central Limit Theorem says:
With estimated variance, becomes t-distributed:
Thus, test statistic T follows a t-distribution.
Probability that T exceeds c:
34
(0,1) ~m
true variance of i
( 1) ~ m
ms
t
estimated variance
zero mean because 0
0
| 0 2 ( | 1)c
mp c t x m dx
s
cc
i
probability of
exceeding c
0
Inte
lligent D
ata
Analy
sis
Paired t-Test: p-Value
Formulate using cumulative distribution function:
Can again compute a threshold c for a prespecified α: if we set
, we ensure that the Type I error is at most
α (for example, α = 0.05).
For observed value T of test statistic, we can again compute
the p-value: the smallest for which H0 would be rejected.
35
12 ( | 1) 2(1 tCDF ( ))m
c
t x m dx c
cumulative distribution function of t-distribution
12 ( | 1) 2(1 tCD ( )F )T
mp t x m dx T
cc
/2 /2
TT
p/2 p/2
1
1tCDF / 2)(1mc
set to c T
Inte
lligent D
ata
Analy
sis
Example Paired t-Test
Example: Comparing the risks of two predictive models.
We evaluate models and on test set of size m=20.
Let be the difference in loss on the different test
examples, that is,
Compute and
Let‘s say and
Compute
Compute
We can reject H0 for , but not for
Weakly significant. 36
oldf newf
1 20,...,
( ( ), ) ( (, )).i i iold w ineiy f y f x x
20
1
1
20ii
1
20
221
( ) .20
i T
i
s
0.25 2 0.3026s
| | 20 0.2
0.3026
52.03.
mT
s
1(2.2( 031 tC ))DF 0.056.mp
0.1 0.05.
Inte
lligent D
ata
Analy
sis
Discussion t-Test
Summary: paired t-test test can be applied when we have
paired data and want to decide if
Advantages t-test
Compares means rather than medians (often more
adequate).
Usually more powerful than sign test.
Disadvantages t-test
It critically relies on assuming that the test statistics is t-
distributed. This holds in the limit according to central limit
theorem, but might not be satisfied for small m.
The test can give wrong results when this assumption is
not satisfied.
37
2
1 1( , ),..., ,( )m mba b a [ ] [ ].a b
Inte
lligent D
ata
Analy
sis
Statistical Tests: Summary and Outlook
Statistical testing can determine whether observed empirical
differences likely indicate true differences between populations.
Formulate a null hypothesis.
Define a test statistic based on the observations.
Reject null hypothesis if observed value for test statistic is
very unlikely under null hypothesis.
Statistical testing is a large field, and many more tests exist
Unpaired test, would have to be used when models are
evaluated on different test sets.
Wilcoxon signed rank test, - test, …
One-tailed vs. two-tailed tests.
38
2