Confidence Intervals and Hypothesis Testing€¦ · s 15 One-sided and Two-sided Intervals So far,...

Universität Potsdam Institut für Informatik

Lehrstuhl Maschinelles Lernen

Confidence Intervals and Hypothesis Testing

Niels Landwehr

Inte

lligent D

ata

Analy

sis

Agenda

Confidence Intervals

Statistical Tests

2

Inte

lligent D

ata

Analy

sis

Agenda


Statistical Tests

3

Inte

lligent D

ata

Analy

sis

Recap: Risk Estimation

Recap: risk estimation.

We have learned a model

Interested in risk of model: the expected loss on novel test

instances drawn from the data distribution

Because is unknown, risk needs to be estimated from

sample where are

independent samples.

Risk estimate („empirical risk“)

If context is clear, we denote risk by R and empirical risk by 4

: .f

( [ ( , ( ))] ( , () )) ( , )R E y f y f p y d dy

x x x x

( , )yx ( , ).p yx

( , )p yx

1

1ˆ ( ) ( , ( ))S

n

i iinR y f

x

1 1( , ),...,( , )

n nS y y x x ( ~ (, , ))

iipy yx x

ˆ .S

R

Inte

lligent D

ata

Analy

sis

Recap: Risk Estimation Zero-one loss

For this lecture, we will assume

Learning task is binary classification,

Loss is zero-one loss,

This means that for follows a

Bernoulli distribution: there is either a mistake or not (coin toss).

We also assume that model is evaluated on independent test

set, such that the error estimate is unbiased.

5

0 : ( )( , ( ))

1: otherwise

y fy f

xx

{0,1}.

( , ( ))i i

y f

x , ) ~ ,( ( ) i i

y p yxx

Inte

lligent D

ata

Analy

sis

6

Idea Confidence Intervals

Risk estimate is always uncertain – depends on sample S.

Idea confidence interval:

Specify interval around risk estimate

Such that the true risk lies within the interval „most of

the time“.

Quantifies uncertainty of risk estimate.

Route to confidence interval: analyse the distribution of the

random variable

ˆS

R[ ]

ˆS

R

ˆ .S

R

R

of confidence inwidt teh rvalR

Inte

lligent D

ata

Analy

sis

7

Central Limit Theorem

Central Limit Theorem. Let be independent draws

from a distribution with and .

Then it holds that

Central limit theorem gives approximate distribution of mean:

2

10,

1( )

n

jjn z

n

1

2

( , ) (approximately, for large 1

)~n

jjz

nn

n

1

2(0, ) (approximately, for 1

( ) ~ large )n

jjn

nnz

1,...,

nz z

( )p z [ ]z 2Var[ ]z

average of convergence in distribution (for ) n 1,..., .

nz z

divide by , add n

Inte

lligent D

ata

Analy

sis

8

Example Central Limit Theorem

Example central limit theorem: average of Bernoulli variables.

Let be independent draws from a Bernoulli

distribution, that is

Average follows (rescaled) Binomial distribution.

Binomial distribution approaches Normal distribution.

1,...,

nz z

) (coin toss with success pro~ Bern( | bability )i i

z z

1

1 n

jjz

n

Inte

lligent D

ata

Analy

sis

9

Central Limit Theorem: Error Estimator

Application of central limit theorem to error estimator.

Error estimator

is an average over the Bernoulli-distributed variables

Because the error estimate is unbiased,

Variance of Bernoulli random variable is

Central limit theorem says:

First result for distribution of , but depends on

1

1( , ( ))ˆ n

S j jjy fR

n

x

Var[ ( , ( ))] (1 ).j jy f R R x

(1 )ˆ ~ ( , ) (approximately, large enough )S

R RR R n

n

( , ( )).j j

y f x

( , ( ))] .[ j jy f R x

ˆSR .R

Inte

lligent D

ata

Analy

sis

10

Mean and Variance of Error Estimator

First result: Approximate distribution of error estimator is

Unbiased estimator, therefore the mean is the true risk

The variance of the estimator falls with n: the more instances

in the test set S, the less variance.

Variance

Standard deviation („standard error“)

2

ˆ

(1 ).

SR

R R

n

ˆ

(1 ).

SR

R R

n

ˆ ~ ((1 )

, ).S

RR

RR

n

.R

Characterizes how much risk estimate fluctuates with S

Inte

lligent D

ata

Analy

sis

Distribution of Error Estimator

Distribution of error estimator:

Problem: true risk has to be known in order to determine

variance

Idea: replace true variance by variance estimate

11

2

ˆ

(1 ).

SR

R R

n

R

ˆ

2,~ .( )ˆS

S RRR

ˆ

ˆ~ (0, )1

SR

SR R

2

ˆSR

2

ˆ

(1 ˆ.

ˆ )

S

S S

Rs

n

R R

Inte

lligent D

ata

Analy

sis

Variance Estimate and t-Distribution

If true variance is replaced by variance estimate, the normal

distribution becomes a Student‘s t-distribution:

However, for large n the t-distribution becomes a normal

distribution again, so we can keep working with the normal.

12

ˆ

ˆ~

SR

SR

s

Rt n

degrees of freedomn

Convergence: lim 0,1n t n

t-distribution has similar form

but more mass in the tails. Normal distribution.

t(2)

t(5)

Inte

lligent D

ata

Analy

sis

13

Bound For True Risk

So what does the empirical risk tell us about the true risk?

From empirical risk compute empirical variance

One-sided upper bound for true risk: probability that true risk is

at most above estimated risk.

ˆ ˆ

ˆ

ˆ ˆ( ) ( )

ˆ ( )

S S

S

R R

R

S S

S

p R R p R R

R Rp

s s

s

"cumulative distribution function of standard norm

( | 0,1)

al distribution"

x

xx dx

ˆS

R

R

]

ˆSR

ˆSR

2

ˆ .SR

s

ˆ

ˆ~ 0,1

SR

SR

s

R

Inte

lligent D

ata

Analy

sis

14

Bound For True Risk

Symmetric lower bound: because the distribution of is

symmetric around (normal distribution), we can similarly

compute probability that true risk is at most below estimated

risk.

Two-sided interval: What is the probability that true risk is at

most away from estimated risk?

ˆ

ˆ ˆ ˆ(| | ) 1 ( ( )

1 2 1

S

S S

R

Sp R R p R R p R R

s

ˆ

ˆ( )

S

S

R

p R Rs

ˆSR

R

ˆS

R

R

[

above interval below interval

Inte

lligent D

ata

Analy

sis

15

One-sided and Two-sided Intervals

So far, we have computed probability that a bound holds

for a particular interval size .

Idea: choose in such a way that bounds hold with a

certain prespecified probability 1- (e.g. =0.05).

One-sided 1--confidence interval: bound such that

Two-sided 1--confidence interval: bound such that

For symmetric distributions (here: normal) it always holds that:

for one-sided 1--interval = for two-sided 1-2 interval.

for one-sided 95%-interval = for two-sided 90% interval.

Thus, it suffices to derive for one-sided interval.

ˆ( ) 1Sp R R

ˆ(| | ) 1Sp R R

Inte

lligent D

ata

Analy

sis

16

Size of Interval

Compute one-sided 1--confidence interval: Determine such

that bound holds with probability 1-.

Two-sided confidence interval is (confidence level

1-2)

ˆ

1

ˆ

1

ˆ

ˆ( ) 1

1

(1 )

1

S

S

S

R

R

S

R

p R R

s

s

s

ˆ ˆ[ , ]S SR R

1( ) inverse of ).(xx

Result from Slide 13

Inte

lligent D

ata

Analy

sis

17

Confidence Interval: Example

Example:

We have observed an empirical risk of on

test instances.

Compute

Choosing confidence level (one-sided level, two-

sided will be )

Compute

The confidence interval contains the true risk in

90% of the cases.

ˆ 0.08SR

ˆ

0.08 0.920.027 empirical standard deviation

100SRs

ˆ ˆ, [ ]S SR R

1

ˆ 1 0.027 1.645 0.045.SR

s

100m

0.05

2

Inte

lligent D

ata

Analy

sis

18

Interpretation of Confidence Intervals

Care should be used when interpreting confidence intervals:

the random variable is the empirical risk and the resulting

interval, not the true risk

Correct:

Wrong:

"The probability of obtaining a confidence interval that

contains the true risk from an experiment is

95%"

ˆSR

"We have obtained a confidence interval

The probability that the interval contains th

fr

e t

om an experimen

rue risk is .

t.

95%"

.R

Inte

lligent D

ata

Analy

sis

Agenda


Statistical Tests

19

Inte

lligent D

ata

Analy

sis

Statistical Tests: Motivation

Motivation: we have developed a new learning algorithm

(Algorithm 1) and compare it to an older algorithm (Algorithm 2)

on 10 data sets.

Algorithm 1 seems better (won on 8 data sets, lost on 2).

But maybe this is just a random result, based on the

particular choice of data sets?

Statistical test: rigorous procedure to decide whether it is likely

that Algorithm 1 is indeed giving better accuracy.

20

0.85 0.76 0.60 0.70 0.95 0.88 0.73 0.89 0.98 0.74

0.81 0.73 0.61 0.66 0.91 0.89 0.65 0.82 0.97 0.70

Accuracy Algorithm 1


+ + - + + - + + + +

Inte

lligent D

ata

Analy

sis

Statistical Tests: Framework

Formulate a null hypothesis H0.

For example, H0 could be „Algorithm 1 and Algorithm 2

perform equally well“.

If the observations are very unlikely under H0, we reject it

and conclude the alternative hypothesis H1: one algorithm

is better.

Formulate a test statistic T that can be computed from data.

For example, the observed number of „wins“.

We will reject the null hypothesis if the test statistic exceeds a

threshold c.

For example, reject if one algorithm wins more than 90

times out of 100.

21

Inte

lligent D

ata

Analy

sis

Statistical Tests: Framework

Asymmetry in test: we can only reject the null hypothesis,

never conclude that it is true.

Possible outcomes of hypothesis testing:

Type I error is worst case (publish a study claiming that new

drug cures cancer when in fact it does not).

22

0 1 rejected conclude .H H

0 not rejected cannot conclude anything, no new information.H

Type I error (wrong

conclusion, very bad)

no new information but

also no error (ok)

correct conclusion

(good)

Type II error (not enough

power, kind of bad)

0 trueH

1 trueH

0 rejectedH 0 not rejectedH

Inte

lligent D

ata

Analy

sis

Statistical Tests: More Formally

More formally, let denote a true parameter of interest

(for example, is the probability that Algorithm 1 wins over

Algorithm 2 on a randomly drawn data set).

Let the null hypothesis be (for example, ).

The alternative hypothesis is

Let be the observations (for example, accuracies of

algorithms on the multiple data sets).

Let be the test statistic.

We reject the null hypothesis (and conclude that the

alternative hypothesis is true) if

23

00 :H

0 0: .5H

1 1 0: \ .H

X

:T

0H

1H ( ) .T X c

Inte

lligent D

ata

Analy

sis

Statistical Tests: Size

Size of a test: (maximal) probability of rejecting the null

hypothesis when the null hypothesis is true (bad!).

We don‘t want Type I errors, so we have to limit

For example, : formulate test in such a way that there is

at most 5% probability of rejecting null hypothesis wrongly.

Of course, depends on c

If we choose c very large, we are conservative and is low.

If we choose c smaller, we are less conservative.

Trading Type I for Type II error.

24

0

sup ( | ).p T c

.

0.05

Inte

lligent D

ata

Analy

sis

Sign Test

Sign test: decide whether the medians of two populations differ.

Motivation: we evaluate two learning algorithms on 10 datasets.

More formally: Let be independently

sampled as

Let („probability that Algorithm 1 wins on

randomly drawn data set“).

Let , 25

2

1 1( , ),..., ,( )m mba b a

( , ) ~ ( , ).i ia b p a b

( ) [0,1]p a b

0 0: .5H 1 [0,1] \{0: .5}.H

0.85 0.76 0.60 0.70 0.95 0.88 0.73 0.89 0.98 0.74

0.81 0.73 0.61 0.66 0.91 0.89 0.65 0.82 0.97 0.70



+ + - + + - + + + +

Inte

lligent D

ata

Analy

sis

Sign Test

Sign test: decide whether the medians of two populations differ.

Motivation: we evaluate two learning algorithms on 10 datasets.

Let (observed accuracies).

Let

We will reject the null hypothesis if , that is, if we see

more than c wins of either algorithm.

26

1 1, ),..., ( )}( ,{ m mb a bX a

„#wins of better algorithm“

T c

0.85 0.76 0.60 0.70 0.95 0.88 0.73 0.89 0.98 0.74

0.81 0.73 0.61 0.66 0.91 0.89 0.65 0.82 0.97 0.70



+ + - + + - + + + +

max { | } , { | } .i i i iT i a ab i b

Inte

lligent D

ata

Analy

sis

Sign Test: Distribution under H0

How do we choose c ?

Limit probability of Type I error, given by

Because are sampled independently, the logical

variable behaves like a coin toss.

Thus, the probability of seeing i wins for Algorithm 1 is given

by a Binomial distribution.

How likely is it to observe more than c wins (for either

algorithm) if ?

27

( | 0.5).p T c

0.5

0.5,

1

( | 0.5 2 Bin () )m

m

i c

p T c i

Probability of seeing extreme #wins

under a fair coin toss model c

( , ) ~ ( , )i ia b p a b

( )i ia b

Inte

lligent D

ata

Analy

sis

Sign Test: Distribution under H0

So

So far, computed for a given threshold c.

We can ensure any prespecified by solving for c:

E.g. for we set

28

1

0.5, / 2).BinCDF (1mc

0.5,

1

0.5,

2 Bin ( )

( | 0.

2(1 BinC

5

(

D )

)

F )

m

m

i c

m

p T c

i

c

c

0.5,BinCDF ( )m c

0.05 1

0.5, ).BinCDF (0.975mc

Inte

lligent D

ata

Analy

sis

Sign Test: p-value

After observing the value T of the test statistics, we can also

compute for the maximum threshold c=T-1 that would still

reject the null hypothesis. This is called the p-value.

The p-value is the smallest for which the test would reject H0.

Typically,

p < 0.001: very sure that H0 can be rejected.

p < 0.01: sure that H0 can be rejected.

p < 0.05 reasonably sure that H0 can be rejected.

p < 0.1 likely that H0 can be rejected.

29

1T

0.5,2(1 BinCDF ( 1))m Tp

Inte

lligent D

ata

Analy

sis

Sign Test: Example

Example sign test:

Compute test statistic: T=8.

Compute p-value:

Test would reject null hypothesis for , but not for

This is not considered statistically significant.

30

0.85 0.76 0.60 0.70 0.95 0.88 0.73 0.89 0.98 0.74

0.81 0.73 0.61 0.66 0.91 0.89 0.65 0.82 0.97 0.70



+ + - + + - + + + +

T

0.2 0.1.

0.5,102(1 BinCDF (7)) 0.1094p

Inte

lligent D

ata

Analy

sis

Sign Test: Discussion

Summary: sign test can be applied when we have paired data

and want to decide if

Advantages of sign test:

Few assumptions: the only need to be independent.

Disadvantages:

Only uses whether or , not the actual values.

This discards some information and can make it harder to

reject the null hypothesis.

Compares medians rather than means: if algorithm is

usually slightly better but in some cases much worse, it

would be declared the winner.

31

( ),i ia b

i ia b i ia b

2

1 1( , ),..., ,( )m mba b a ( ) 0.5.p a b

Inte

lligent D

ata

Analy

sis

Two-Tailed Paired t-Test

Paired t-test: standard test to determine if means between

populations differ (example: do risks of two models differ?).

Let be independently sampled from ,

that is,

Let , let , and let

Let denote the difference in population means.

Null hypothesis , that is,

Test statistic , reject if

32

0 : 0H

i i ia b 1

1i

m

im

.T c

1

2 21( ) .i

m

i

sm

| |mT

s

Variance estimate i

[ ] [ ]a b

[ [ ]] .ba

2

1 1( , ),..., ,( )m mba b a

( , ) ~ ( , ).i ia b p a b

( , )p a b

Inte

lligent D

ata

Analy

sis

Paired t-Test: Probability of Type I Error

Paired t-test intuition: if null hypothesis holds,

would expect small and therefore T. Seeing a large

(absolute) T is thus very unlikely under the null hypothesis.

What is the probability of rejecting the null hypothesis when

the null hypothesis is true?

33

[ [ ]]a b

( | 0)p T c

Inte

lligent D

ata

Analy

sis

Paired t-Test: Probability of Type I Error

Distribution of T if :

Because are independent, Central Limit Theorem says:

With estimated variance, becomes t-distributed:

Thus, test statistic T follows a t-distribution.

Probability that T exceeds c:

34

(0,1) ~m

true variance of i

( 1) ~ m

ms

t

estimated variance

zero mean because 0

0

| 0 2 ( | 1)c

mp c t x m dx

s

cc

i

probability of

exceeding c

0

Inte

lligent D

ata

Analy

sis

Paired t-Test: p-Value

Formulate using cumulative distribution function:

Can again compute a threshold c for a prespecified α: if we set

, we ensure that the Type I error is at most

α (for example, α = 0.05).

For observed value T of test statistic, we can again compute

the p-value: the smallest for which H0 would be rejected.

35

12 ( | 1) 2(1 tCDF ( ))m

c

t x m dx c

cumulative distribution function of t-distribution

12 ( | 1) 2(1 tCD ( )F )T

mp t x m dx T

cc

/2 /2

TT

p/2 p/2

1

1tCDF / 2)(1mc

set to c T

Inte

lligent D

ata

Analy

sis

Example Paired t-Test

Example: Comparing the risks of two predictive models.

We evaluate models and on test set of size m=20.

Let be the difference in loss on the different test

examples, that is,

Compute and

Let‘s say and

Compute

Compute

We can reject H0 for , but not for

Weakly significant. 36

oldf newf

1 20,...,

( ( ), ) ( (, )).i i iold w ineiy f y f x x

20

1

1

20ii

1

20

221

( ) .20

i T

i

s

0.25 2 0.3026s

| | 20 0.2

0.3026

52.03.

mT

s

1(2.2( 031 tC ))DF 0.056.mp

0.1 0.05.

Inte

lligent D

ata

Analy

sis

Discussion t-Test

Summary: paired t-test test can be applied when we have

paired data and want to decide if

Advantages t-test

Compares means rather than medians (often more

adequate).

Usually more powerful than sign test.

Disadvantages t-test

It critically relies on assuming that the test statistics is t-

distributed. This holds in the limit according to central limit

theorem, but might not be satisfied for small m.

The test can give wrong results when this assumption is

not satisfied.

37

2

1 1( , ),..., ,( )m mba b a [ ] [ ].a b

Inte

lligent D

ata

Analy

sis

Statistical Tests: Summary and Outlook

Statistical testing can determine whether observed empirical

differences likely indicate true differences between populations.

Formulate a null hypothesis.

Define a test statistic based on the observations.

Reject null hypothesis if observed value for test statistic is

very unlikely under null hypothesis.

Statistical testing is a large field, and many more tests exist

Unpaired test, would have to be used when models are

evaluated on different test sets.

Wilcoxon signed rank test, - test, …

One-tailed vs. two-tailed tests.

38

2

Confidence Intervals and Hypothesis Testing€¦ · s 15 One-sided and Two-sided Intervals So far,...

Documents

Transcript of Confidence Intervals and Hypothesis Testing€¦ · s 15 One-sided and Two-sided Intervals So far,...