Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1...

61
Introduction to Statistical Analysis Cancer Research UK – 5 th of November 2018 D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

Transcript of Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1...

Page 1: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Introduction to Statistical Analysis

Cancer Research UK – 5th of November 2018

D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

Page 2: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Timeline

9:30 – Morning

I ⇠ 45mn Lecture: data type, summary statistics and graphical displaysI ⇠ 15mn Quiz

10:30 – 15mn Co↵ee & Tea break

I ⇠ 60mn Lecture: some statistical distributions + CLTI ⇠ 15mn Exercises & discussion

12:00 – Lunch break

13:00 – Afternoon

I ⇠ 45mn Lecture: One-sample location testI ⇠ 30mn Exercises with shiny app & discussion

I ⇠ 45mn Lecture: Two-sample location test

15:15 – 15mn Co↵ee & Tea break

I ⇠ 30mn Exercises with shiny app & discussion

16:15 – Group based exercisesI ⇠ 60mn

2

Page 3: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Grand Picture of Statistics

Population Sample

Data

(x1, x2, ..., xn)

Statistics

bµb�2

b⇡

Parameters

µ

�2

3

UK SMOKERS

Bias

MEASUREMENT ERROR

MISSING VALUES

c¢ ¢POEEFNFNNQEI.ge#sfuAt*INFERENCE

y

PROBABILITY OF@ No suitable == 1 = 0.5

ifmnotnaomsnoenhers } Test 5

Page 4: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Data Types

x1 x2 x3 · · · xn

Cancer status C �C �C · · · C

Nucleic acid sequence C T T · · · A

5-level pain score 3 1 5 · · · 4

# of daily admissions at A&E 16 23 12 · · · 17

Gene expression intensity 882.1 379.5 528.3 · · · 120.9

4

SUMMARY STATISTICS . MUTUAL EXCWSN

LOCATION VARIABILITY( typical value ) - RANKING

-

=FatIII⇐

.se#r.//EauioisI

- MODE- MEAN

- IQR- MEDIAN

- MAD ✓ /QUALITATIVE ( NOMINAL )

¥✓ x x

t¥n¥EI÷e if:/DISCRETE ✓ ✓ ✓

f continuous✓ ✓ ✓

QUANTITATIVE ( NUMERICAL ) HIGH

GRAPHICAL DISPLAYS

÷ARPLOT

- HISTOGRAM

- BOXPLOT

Page 5: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Summary statistics and plots for qualitative data

5-level answers of 21 patients to the question”How much did pain due to your ureteric stones interfere with yourday to day activities ?”:

3, 1, 5, 3, 1, 1, 1, 5, 1, 3, 4, 1, 1, 4, 5, 5, 5, 5, 5, 4, 4,

where

I 1 = ”Not at all”,

I 2 = ”A little bit”,

I 3 = ”Somewhat”,

I 4 = ”Quite a bit”,

I 5 = ”Very much”.

5

^

Tc = .§,I{xi=I where I(xi= c) = 1 if xi = c and

= O otherwise

7 -

7121 -

II.it:#t.HN2=1 °AT1 2 3 4 r

TWO - STAGE PROCESS :

A . .TN#fEnENcEYEFmoDE = YES

B : IF INTERFERENCE : INTERF . LEVEL MODE

= 5

Page 6: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Summary statistics and plots for quantative data

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●

Gene expression values of gene“CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77

6

LOCATION :

MEAN : µ^ = I = f §,

xi = 1.83

4¥ ) if " is °&= 1.93

Meenan : need = }or= { (x÷÷÷+D y u is era

g, if replaces sy

by 109,

→ impact on

mean

- > no impact on

melian.

0

Page 7: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Summary statistics and plots for quantative data

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●

Gene expression values of gene“CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77

6

VARIABILITYRANGE: Xlu ,

- 29 , ) =2. 77 .

0.46 =2.31

Variance :

^F={ E.

(79-

E)"

=

(0.46-1.83)-+4.11-1.835=-0.5u

st.

osev : E=

Fi = @= >-0¥MEDIAN ABS .

Dev :MID = med ( l x ; -

vied ( x ) ) )

( INTERQJARTIE RANGE : IOTR = 9^0#- 9^0.25

Robust To OUTLIERS

.

.

Page 8: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Summary statistics and plots for quantative data

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.00.10.20.30.40.50.60.70.80.91.0

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●

Gene expression values of gene“CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77

6

HISTOGRAMJYarea proportionate, .tg⇒"Frieden.to

"

=¥SPLITDATA RANGE#±gTEsnafter } 1 ° 4 ' 2 8 2 f. Into

^ 2 3 4 6 T 6 INTERVALS

Page 9: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Summary statistics and plots for quantative data

0.0 0.5 1.0 1.5 2.0 2.5 3.0

● ● ●

● ● ● ●● ● ●●●●●● ●●●●● ●● ●● ●●● ● ● ●

Gene expression values of gene“CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77

6

#

* ,

Fk¥IE¥raeac.BOXPL= ' . STNMETRKAL

/€¥A 9%15.19:*uotsuitatle

Fyfe,aBjY€?BA"

ottersTotter

Page 10: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Two-sample case: independent versus paired samples

Permeability constants of a placental membrane at term (X) and between 12 to

26 weeks gestational age (Y).

1 2 3 4 5 6 7 8 9 10X 0.80 0.83 1.89 1.04 1.45 1.38 1.91 1.64 0.73 1.46Y 1.15 0.88 0.90 0.74 1.21

Hamilton depression scale factor measurements in 9 patients with mixed anxiety

and depression, taken at the first (X) and second (Y) visit after initiation of a

therapy (administration of a tranquilizer).

1 2 3 4 5 6 7 8 9X 1.83 0.50 1.62 2.48 1.68 1.88 1.55 3.06 1.30Y 0.88 0.65 0.60 2.05 1.06 1.29 1.06 3.14 1.29

Y�X �0.95 0.15 �1.02 �0.43 �0.62 �0.59 �0.49 0.08 �0.01

7

÷TRENT AND SAIE PARTICIPANTS

UNRELATED PARTICIPANTS MEASURED TW

TIi

- xi

mean of the { Elyseedifferences .

Page 11: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Quiz TimeSections 1 to 4

https://docs.google.com/forms/d/e/1FAIpQLScblQ_

-ISfSCGp_EIVPPI_mnrJHttaKxln8vVoyjJFvS8BL1w/viewform

8

Page 12: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Some parametric distributions: Binomial distribution

If

I n independent experiments,

I outcome of each experiment is dichotomous (success/failure),

I the probability of success ⇡ is the same for all experiments,

then,

I the number of successes out of n trials (experiments), Y =P

n

i=1 Xi,

follows a binomial distribution with parameters n and ⇡:

Y ⇠ Bin(n,⇡),

I the probability of observing exactly y successes out of n experiments, is

given by

P (Y = y|n,⇡) = n!(n� y)!y!

⇡y(1� ⇡)n�y

.

10

Page 13: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Some parametric distributions: Binomial distribution

If

I n independent experiments,

I outcome of each experiment is dichotomous (success/failure),

I the probability of success ⇡ is the same for all experiments,

then,

I the number of successes out of n trials (experiments), Y =P

n

i=1 Xi,

follows a binomial distribution with parameters n and ⇡:

Y ⇠ Bin(n,⇡),

I the probability of observing exactly y successes out of n experiments, is

given by

P (Y = y|n,⇡) = n!(n� y)!y!

⇡y(1� ⇡)n�y

.

Number of successes out of 50 experiments

prob

abilit

y

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

10

←,

Plt = ol YTIII ) = ¥#.0¥( e . o .os to = oas'

~ t.rs

Page 14: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Some parametric distributions: Poisson distribution

If, during a time interval or in a given area,

I events occur independently,

I at the same rate,

I and the probability of an event to occur in a small interval (area) is

proportional to the length of the interval (size of the area),

then,

I the number of events occurring in a fixed time interval or in a given area,

X, may be modelled by means of a Poisson distribution with parameter �:

X ⇠ Poisson(�),

I the probability of observing x during a fixed time interval or in a given area

is given by

P (X = x|�) = �xe��

x!.

11

# EVENTS PER # Hourly ADMISSION AT ANE

UNCT OF TIME ANDFOR # of DAILY ORGAN DONORS in THE VK

PER AREA

Page 15: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Some parametric distributions: Poisson distribution

If, during a time interval or in a given area,

I events occur independently,

I at the same rate,

I and the probability of an event to occur in a small interval (area) is

proportional to the length of the interval (size of the area),

then,

I the number of events occurring in a fixed time interval or in a given area,

X, may be modelled by means of a Poisson distribution with parameter �:

X ⇠ Poisson(�),

I the probability of observing x during a fixed time interval or in a given area

is given by

P (X = x|�) = �xe��

x!.

Number of chronic conditions per patient (US National Medical Expenditure Survey)

Probab

ility

0 1 2 3 4 5 6 7 8 9 10 11

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

11

-

mean an 5

variance

} = 1. 55

! rr

-

Page 16: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Some parametric distributions: Continuous distrib.D

ensi

ty

−10 −5 0 5 10 15 20

0.0

0.1

0.2

0.3

ex−Gaussianskew tGammaInverse GammaGaussian

12

Page 17: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Some parametric distributions: Normal distribution

X ⇠ N(µ,�2), fX(x) =1p2⇡�2

e�(x�µ)2

2�2

E[X] = µ, Var[X] = �2,

Z =X � µ

�⇠ N(0, 1), fZ(z) =

1p2⇡

e�x2

2 .

Probability density function, fZ(z), of a standard normal:

Density

−4 −3 −2 −1 0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

13

÷ 6 g5%1.96

Page 18: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Some parametric distributions: Normal distribution

X ⇠ N(µ,�2), fX(x) =1p2⇡�2

e�(x�µ)2

2�2

E[X] = µ, Var[X] = �2,

Z =X � µ

�⇠ N(0, 1), fZ(z) =

1p2⇡

e�x2

2 .

Probability density function, fZ(z), of a standard normal:

99.73%

µ� 3� µ+ 3�

95.45%

µ� 2� µ+ 2�

68.27%µ� � µ+ �µ

0.0

0.1

0.2

0.3

0.4

13

Page 19: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Some parametric distributions: Normal distribution

X ⇠ N(µ,�2), fX(x) =1p2⇡�2

e�(x�µ)2

2�2

E[X] = µ, Var[X] = �2,

Z =X � µ

�⇠ N(0, 1), fZ(z) =

1p2⇡

e�x2

2 .

(i) Suitable modelling for a lot of variables

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.00.10.20.30.40.50.60.70.80.91.0

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●

13

^µ= 1.83

f2= 0.5

->

✓( 1.83,

0.5 )

[0.83

~ 95% 2.83

Page 20: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Some parametric distributions: Normal distribution

X ⇠ N(µ,�2), fX(x) =1p2⇡�2

e�(x�µ)2

2�2

E[X] = µ, Var[X] = �2,

Z =X � µ

�⇠ N(0, 1), fZ(z) =

1p2⇡

e�x2

2 .

(i) Suitable modelling for a lot of variables: IQ

99.73%

55 145

95.45%

70 130

68.27%85 115100

0.0000

0.0266

13

Page 21: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Some parametric distributions: Normal distribution

X ⇠ N(µ,�2), fX(x) =1p2⇡�2

e�(x�µ)2

2�2

E[X] = µ, Var[X] = �2,

Z =X � µ

�⇠ N(0, 1), fZ(z) =

1p2⇡

e�x2

2 .

(ii) Central limit theorem (Lindeberg-Levy CLT). Let (X1, ..., Xn) be n independent and identically distributed

(iid) random variables drawn from distributions of expectedvalues given by µ and finite variances given by �2,

. then

bµ = X =

Pni=1 Xi

nd! N

✓µ,

�2

n

◆.

If Xi ⇠ N(µ,�2), this result is true for all sample sizes.

13

Page 22: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Central limit theorem shiny app:Distribution of the meanhttp://bioinformatics.cruk.cam.ac.uk/apps/stats/central-limit-theorem/

14

#p⇐¥tuna%cIµegI ?

p

¥E

,

[0,;ozT

µ.gg z, [ a. ;q ]

(;g3 " t.it?#_tµ¥*⇐

' s °o° Miao [9 ; az ] X

Page 23: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

95% Confidence interval for µ, the population mean,when Xi ⇠ N(µ, �2)

I if X ⇠ N(µ,�2), then X ⇠ N

⇣µ,

�2

n

⌘,

I if X ⇠ N(µ,�2), then Z = X�µ

�⇠ N(0, 1),

P

< <

!= 0.95

15

ozu large and Xinii-4,4

- 1.96 Z 1.96

p( - a. 96 < II < i. 96 ) = o.gr

re

P( - i. 96Pa . I < I - n a 1.96ft -E) -.

o .9r

p( I -1.96 FE <

µ< E + 1.96¥ ) = o.gr

Page 24: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

95% Confidence interval for µ, the population mean,when Xi ⇠ N(µ, �2)

I if X ⇠ N(µ,�2), then X ⇠ N

⇣µ,

�2

n

⌘,

I if X ⇠ N(µ,�2), then Z = X�µ

�⇠ N(0, 1),

I if � unknown, then T = X�µ

s⇠ Stn�1.

P

< <

!= 0.95

Density

-4.303 4.30395%-2.228 2.22895%-2.042 2.04295%-1.96 1.9695%

X ⇠ N(0, 1)T ⇠ St30T ⇠ St10T ⇠ St2

fT (t|n� 1) =�(n�2

2 )�(n�1

2 )[(n�1)⇡]1/2�1+ t2

n�1

�n2

-5 -4 -3 -2 -1 0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

15

of n large aus Xin iisln . ry

F-

ttnoatrlfnµ

It

tnioetrlsrnTito .97r ) = 2. oyz for 4=31

e>

Page 25: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

95% Confidence interval for µ, the population mean,when Xi ⇠ N(µ, �2)

I if X ⇠ N(µ,�2), then X ⇠ N

⇣µ,

�2

n

⌘,

I if X ⇠ N(µ,�2), then Z = X�µ

�⇠ N(0, 1),

I if � unknown, then T = X�µ

s⇠ Stn�1.

P

< <

!= 0.95

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.00.10.20.30.40.50.60.70.80.91.0

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●

15

Page 26: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

95% Confidence interval for µ, the population mean,when Xi ⇠ iid(µ, �2)

I CLT: Xd! N

⇣µ,

�2

n

⌘,

I if X ⇠ N(µ,�2), then Z = X�µ

�⇠ N(0, 1),

I if � unknown, then T = X�µ

s⇠ Stn�1.

P

< <

!= 0.95

Number of chronic conditions per patient (US National Medical Expenditure Survey: n = 4406, bµ = 1.55)

Probab

ility

0 1 2 3 4 5 6 7 8 9 10 11

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

16

Assuming Xi ~ Poisson

µ = £2 #1.55

1. rr. i. as

'f÷iµ 1.rrtr.at#s

.

Page 27: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

95% Confidence interval for µ, the population mean,when Xi ⇠ iid(µ, �2)

I CLT: Xd! N

⇣µ,

�2

n

⌘,

I if X ⇠ N(µ,�2), then Z = X�µ

�⇠ N(0, 1),

I if � unknown, then T = X�µ

s⇠ Stn�1.

P

< <

!= 0.95

Number of chronic conditions per patient (US National Medical Expenditure Survey: n = 4406, bµ = 1.55)

Probab

ility

0 1 2 3 4 5 6 7 8 9 10 11

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

CINormal(µ, 0.95) = [1.5021; 1.5818]

CIStudent(µ, 0.95) = [1.5021; 1.5819]

CIBootstrap(µ, 0.95) = [1.5020; 1.5819]

CINB�GLM (µ, 0.95) = [1.5027; 1.5823]

16

Page 28: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

95% Confidence interval for µY � µX , the di↵erencebetween population means

If we haveI Xi ⇠ iid(µX ,�2

X), i = 1, ..., nX ,I Yi ⇠ iid(µY ,�2

Y ), i = 1, ..., nY ,

thenI if �2

X = �2Y [Student’s t-test equation],

. CI (µY � µX , 0.95) = (Y �X)± t1�↵2 ,nX+nY �2sp

q1

nX+ 1

nY

where sp = (nX�1)s2X+(nY �1)s2YnX+nY �2 ,

I if �2X 6= �2

Y [Welch-Satterthwaite’s t-test equation],

. CI (µY � µX , 0.95) = (Y �X)± t1�↵2 ,df

qs2XnX

+s2YnY

, where

df =

✓s2X

nX+

s2Y

nY

◆2

✓s2X

nX

◆2

nX�1 +

✓s2Y

nY

◆2

nY �1

.

17

Page 29: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

95% Confidence interval for µY � µX , the di↵erencebetween population means

If we haveI Xi ⇠ iid(µX ,�2

X), i = 1, ..., nX ,I Yi ⇠ iid(µY ,�2

Y ), i = 1, ..., nY ,

thenI if �2

X = �2Y [Student’s t-test equation],

. CI (µY � µX , 0.95) = (Y �X)± t1�↵2 ,nX+nY �2sp

q1

nX+ 1

nY

where sp = (nX�1)s2X+(nY �1)s2YnX+nY �2 ,

I if �2X 6= �2

Y [Welch-Satterthwaite’s t-test equation],

. CI (µY � µX , 0.95) = (Y �X)± t1�↵2 ,df

qs2XnX

+s2YnY

, where

df =

✓s2X

nX+

s2Y

nY

◆2

✓s2X

nX

◆2

nX�1 +

✓s2Y

nY

◆2

nY �1

.

17

Page 30: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Central limit theorem shiny app:Coverage of Student’s asymptotic confidence intervalshttp://bioinformatics.cruk.cam.ac.uk/apps/stats/central-limit-theorem/

18

Page 31: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Quiz TimePractical 1

http://bioinformatics-core-shared-training.github.io/

IntroductionToStats/practical.html

19

Page 32: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

PART II:

Parametric and non-parametric

one-sample location tests

Cancer Research UK – 5th of November 2018

D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

Page 33: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Grand Picture of Statistics

Population Sample

Data

(x1, x2, ..., xn)

Statistics

bµb�2

b⇡

Parameters

µ

�2

20

Page 34: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Statistical hypothesis testing

A hypothesis test describes a phenomenon by means oftwo non-overlapping idealised models/descriptions:

I the null hypothesis H0, “generally assumed to be true until evidenceindicates otherwise”

I the alternative hypothesis H1.

The aim of the test is to reject the null hypothesis in favour of thealternative hypothesis, and conclude, with a probability ↵ of being wrong,that the idealised model/description of H1 is true.

Theory 1: Dieters lose more fat than the exercisers

Theory 2: There is no majority for Brexit now

Theory 3: Serum vitamin C is reduced in patients

22

. 5

Hi : Ty > 0.5

/* " ¥=°

Ho : µ ,= Me

Hi :

µ , > ME

(>

to : µp=µ¢Hi : Mp < up

Page 35: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Statistical hypothesis testing

Many options:

I One-sided versus two-sided tests,

I Exact versus asymptotic tests,

I Parametric versus non-parametric tests,

23

Page 36: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Parametric or non-parametric ?

T-test Outcome(s) normally distributed

Yes Mildly No

Sample size

Small

Medium

Large

Situations which may suggest the use of non-parametric statistics:I When there is a small sample size or very unequal groups,I When the data has notable outliers,I When one outcome has a distribution other than normal,I When the data are ordered with many ties or are rank ordered.

32

! %¥

.

Page 37: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Parametric location test

(One-sided) t-test

We test:

H0: µIQ = 100,H1: µIQ > 100.

We have Xi ⇠ N(µ,�2), i = 1, ..., n,

We know

I X ⇠ N

⇣µ,

�2

n

⌘,

I Z = X�µ�pn

⇠ N(0, 1),

Thus, if H0 is true, we have:

I Z = X�µ0�pn

⇠ N(0, 1).

Define the p-value:

I p� value = P (T > Tobs)

Density

−4 −3 −2 −1 0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

99.73%

55 145

95.45%

70 130

68.27%85 115100

0.0000

0.0266

25

Page 38: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Statistical hypothesis testing4 possible outcomes

Conclude:I if p-value > ↵ ! do not reject H0.I if p-value < ↵ ! reject H0 in favour of H1.

Test Outcome

H0 not rejected H1 accepted

Unknown Truth H0 true 1� ↵ ↵

H1 true � 1� �

where

I ↵ is the type I error,

I � is the type II error.

25

HIV test

c- ) ( t )

f) FALSE Pos.

( + )FALSE NEG .

C)( > POWER

Function ofDESIGN a

Sample size ,

i , ,

Page 39: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Parametric location test

(One-sided) binomial exact test

We test:

H0: ⇡ = 5%,

H1: ⇡ > 5%.

We have Xi ⇠ Bernoulli(⇡), i = 1, ..., n,

We know

I Y =P

n

i=1 Xi ⇠ Binomial (⇡, n) ,

Thus, if H0 is true, we have:

I Y =P

n

i=1 Xi ⇠ Binomial (5%, n) ,

Define the p-value:

I p� value = P (Y > Yobs)

27

Page 40: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Parametric location test

(One-sided) binomial exact test

We test:

H0: ⇡ = 5%,

H1: ⇡ > 5%.

We have Xi ⇠ Bernoulli(⇡), i = 1, ..., n,

We know

I Y =P

n

i=1 Xi ⇠ Binomial (⇡, n) ,

Thus, if H0 is true, we have:

I Y =P

n

i=1 Xi ⇠ Binomial (5%, n) ,

Define the p-value:

I p� value = P (Y > Yobs)

Number of successes out of n experiments

Prob

abilit

y

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0.0

0.1

0.2

0.3

0.4

n = 25n = 100

27

Page 41: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Non-parametric location test

Wilcoxon sign-rank test

A location model is assumed for Xi, i = 1, ..., n:

Xi = ✓ + ei,

where ei ⇠ iid(µe = 0,�2e).

Interest for H0: ✓ = ✓0 against H1: ✓ < ✓0 or ✓ 6= ✓0 or ✓ > ✓0.

Test statistics : W+ =

Pn

i=1 ◆(Xi � ✓0 > 0) Rank(|Xi � ✓0|).

Distribution of W under H0: W+

has no closed-form distribution.

Wilcoxon signed rank test

data: golub[1042, gol.fac == "ALL"]V = 268, p-value = 0.05847alternative hypothesis: true location is not equal to 1.7595 percent confidence interval:1.73868 2.09106sample estimates:(pseudo)median

1.926475

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.00.10.20.30.40.50.60.70.80.91.0

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●

28

Page 42: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Non-parametric location test

Wilcoxon sign-rank test

A location model is assumed for Xi, i = 1, ..., n:

Xi = ✓ + ei,

where ei ⇠ iid(µe = 0,�2e).

Interest for H0: ✓ = ✓0 against H1: ✓ < ✓0 or ✓ 6= ✓0 or ✓ > ✓0.

Test statistics : W+ =

Pn

i=1 ◆(Xi � ✓0 > 0) Rank(|Xi � ✓0|).

Distribution of W under H0: W+

has no closed-form distribution.

Wilcoxon signed rank test

data: golub[1042, gol.fac == "ALL"]V = 268, p-value = 0.05847alternative hypothesis: true location is not equal to 1.7595 percent confidence interval:1.73868 2.09106sample estimates:(pseudo)median

1.926475

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.00.10.20.30.40.50.60.70.80.91.0

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●

28

Page 43: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Introduction to Shiny Apps and Exercises

29

Page 44: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

PART III:

Parametric and non-parametric

two-sample location tests

Cancer Research UK – 5th of November 2018

D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

Page 45: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Two-sample case

Many options:

I One-sided versus two-sided tests,

I Exact versus asymptotic tests,

I Parametric versus non-parametric tests,

I Tests for paired versus independent data.

31

Page 46: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Parametric two-sample location test

Two-sample two-sided Student-s & Welch’s t-tests

●●●

Intensity expression of gene 'CCND3 Cyclin D3'

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

Acute lymphoblasticleukemia (ALL)

n=27

Acute myeloidleukemia (AML)

n=11

We test H0: µY � µX = 0 against H1: µY � µX 6= 0.

We know:

I Student’s t-test [assume �2X

= �2Y]: (Y �X)�(µY �µX )

sp

q1

nX+ 1

nY

⇠ t1�↵2 ,nX+nY �2

I Welch’s t-test [assume �2X

6= �2Y]: (Y �X)�(µY �µX )r

s2X

nX+

s2Y

nY

⇠ t1�↵2 ,df

32

Page 47: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Parametric two-sample location test

Two-sample two-sided Student-s & Welch’s t-tests

●●●

Intensity expression of gene 'CCND3 Cyclin D3'

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

Acute lymphoblasticleukemia (ALL)

n=27

Acute myeloidleukemia (AML)

n=11

We test H0: µY � µX = 0 against H1: µY � µX 6= 0.

We know:

I Student’s t-test [assume �2X

= �2Y]: (Y �X)�(µY �µX )

sp

q1

nX+ 1

nY

⇠ t1�↵2 ,nX+nY �2

I Welch’s t-test [assume �2X

6= �2Y]: (Y �X)�(µY �µX )r

s2X

nX+

s2Y

nY

⇠ t1�↵2 ,df

Two Sample t-test

data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"]t = 6.7983, df = 36, p-value = 6.046e-08alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:0.8829143 1.6336690sample estimates:mean of x mean of y1.8938826 0.6355909

32

Page 48: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Parametric two-sample location test

Two-sample two-sided Student-s & Welch’s t-tests

●●●

Intensity expression of gene 'CCND3 Cyclin D3'

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

Acute lymphoblasticleukemia (ALL)

n=27

Acute myeloidleukemia (AML)

n=11

We test H0: µY � µX = 0 against H1: µY � µX 6= 0.

We know:

I Student’s t-test [assume �2X

= �2Y]: (Y �X)�(µY �µX )

sp

q1

nX+ 1

nY

⇠ t1�↵2 ,nX+nY �2

I Welch’s t-test [assume �2X

6= �2Y]: (Y �X)�(µY �µX )r

s2X

nX+

s2Y

nY

⇠ t1�↵2 ,df

Welch Two Sample t-test

data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"]t = 6.3186, df = 16.118, p-value = 9.871e-06alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:0.8363826 1.6802008sample estimates:mean of x mean of y1.8938826 0.6355909

32

Page 49: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Non-parametric two-sample location test

Mann-Whitney-Wilcoxon test

Let

I Xi ⇠ iid(µX ,�2), i = 1, ..., nX ,

I Yi ⇠ iid(µX + �,�2), i = 1, ..., nY .

Interest for H0: � = �0 against H1: � < �0 or � 6= �0 or � > �0.

Standardised test statistic: z =PnY

i=1 R(Yi)�[nY (nX+nY +1)/2]pnXnY (nX+nY +1)/12

,

where R(Yi) denotes the rank of Yi amongst the combined samples, i.e.,

amongst (X1, ..., XnX , Y1, ..., YnY ).

Distribution of Z under H0: Z ⇠ N(0, 1).

Implementation 1:statistic = -4.361334 , p-value = 1.292716e-05

Implementation 2:W = 284, p-value = 6.15e-07alternative hypothesis: true location shift is not equal to 095 percent confidence interval:0.89647 1.57023sample estimates:difference in location

1.21951

●●●

Intensity expression of gene 'CCND3 Cyclin D3'

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

Acute lymphoblasticleukemia (ALL)

n=27

Acute myeloidleukemia (AML)

n=11

33

Page 50: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Non-parametric two-sample location test

Mann-Whitney-Wilcoxon test

Let

I Xi ⇠ iid(µX ,�2), i = 1, ..., nX ,

I Yi ⇠ iid(µX + �,�2), i = 1, ..., nY .

Interest for H0: � = �0 against H1: � < �0 or � 6= �0 or � > �0.

Standardised test statistic: z =PnY

i=1 R(Yi)�[nY (nX+nY +1)/2]pnXnY (nX+nY +1)/12

,

where R(Yi) denotes the rank of Yi amongst the combined samples, i.e.,

amongst (X1, ..., XnX , Y1, ..., YnY ).

Distribution of Z under H0: Z ⇠ N(0, 1).

Implementation 1:statistic = -4.361334 , p-value = 1.292716e-05

Implementation 2:W = 284, p-value = 6.15e-07alternative hypothesis: true location shift is not equal to 095 percent confidence interval:0.89647 1.57023sample estimates:difference in location

1.21951

●●●

Intensity expression of gene 'CCND3 Cyclin D3'

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

Acute lymphoblasticleukemia (ALL)

n=27

Acute myeloidleukemia (AML)

n=11

Density

−4 −3 −2 −1 0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

33

Page 51: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Two-sample proportion test

�2goodness-of-fit test

A trial to assess the e↵ectiveness of a new treatment versus a placebo in reducing

tumour size in patients with ovarian cancer:

Observed frequencies Binary outcome

Tumour did not shrink Tumour did shrink

Group Treatment 44 40 (84)

Placebo 24 16 (40)

(68) (56) (124)

I H0 : No association between treatment group and tumour shrinkage,

I H1 : Some association.

Expected frequencies under H0 Binary outcome

Tumour did not shrink Tumour did shrink

Group Treatment

84⇥68124 = 46.06 84⇥58

124 = 37.94

(84)

Placebo

40⇥68124 = 21.94 40⇥56

124 = 18.71

(40)

(68) (56) (124)

We have 2 categorical variables with a total of J = 4 cells (categories).

I H0 : ⇡j = ⇡j0 , j = 1, ..., J ,I H1 : ⇡j 6= ⇡j0 , j = 1, ..., J .

�2-test:

JPj=1

(Oj�Ej)2

Ej⇠ �

2(J � 1).

Pearson’s Chi-squared test with Yates’ continuity correction

data: MX-squared = 0.36474, df = 1, p-value = 0.5459

34

Page 52: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Two-sample proportion test

�2goodness-of-fit test

A trial to assess the e↵ectiveness of a new treatment versus a placebo in reducing

tumour size in patients with ovarian cancer:

Observed frequencies Binary outcome

Tumour did not shrink Tumour did shrink

Group Treatment 44 40 (84)

Placebo 24 16 (40)

(68) (56) (124)

I H0 : No association between treatment group and tumour shrinkage,

I H1 : Some association.Expected frequencies under H0 Binary outcome

Tumour did not shrink Tumour did shrink

Group Treatment

84⇥68124 = 46.06 84⇥58

124 = 37.94

(84)

Placebo

40⇥68124 = 21.94 40⇥56

124 = 18.71

(40)

(68) (56) (124)

We have 2 categorical variables with a total of J = 4 cells (categories).

I H0 : ⇡j = ⇡j0 , j = 1, ..., J ,I H1 : ⇡j 6= ⇡j0 , j = 1, ..., J .

�2-test:

JPj=1

(Oj�Ej)2

Ej⇠ �

2(J � 1).

Pearson’s Chi-squared test with Yates’ continuity correction

data: MX-squared = 0.36474, df = 1, p-value = 0.5459

34

Page 53: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Two-sample proportion test

�2goodness-of-fit test

A trial to assess the e↵ectiveness of a new treatment versus a placebo in reducing

tumour size in patients with ovarian cancer:

Observed frequencies Binary outcome

Tumour did not shrink Tumour did shrink

Group Treatment 44 40 (84)

Placebo 24 16 (40)

(68) (56) (124)

I H0 : No association between treatment group and tumour shrinkage,

I H1 : Some association.Expected frequencies under H0 Binary outcome

Tumour did not shrink Tumour did shrink

Group Treatment 84⇥68124 = 46.06 84⇥58

124 = 37.94 (84)

Placebo 40⇥68124 = 21.94 40⇥56

124 = 18.71 (40)

(68) (56) (124)

We have 2 categorical variables with a total of J = 4 cells (categories).

I H0 : ⇡j = ⇡j0 , j = 1, ..., J ,I H1 : ⇡j 6= ⇡j0 , j = 1, ..., J .

�2-test:

JPj=1

(Oj�Ej)2

Ej⇠ �

2(J � 1).

Pearson’s Chi-squared test with Yates’ continuity correction

data: MX-squared = 0.36474, df = 1, p-value = 0.5459

34

Page 54: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Two-sample proportion test

�2goodness-of-fit test

A trial to assess the e↵ectiveness of a new treatment versus a placebo in reducing

tumour size in patients with ovarian cancer:

Observed frequencies Binary outcome

Tumour did not shrink Tumour did shrink

Group Treatment 44 40 (84)

Placebo 24 16 (40)

(68) (56) (124)

I H0 : No association between treatment group and tumour shrinkage,

I H1 : Some association.Expected frequencies under H0 Binary outcome

Tumour did not shrink Tumour did shrink

Group Treatment

84⇥68124 = 46.06 84⇥58

124 = 37.94

(84)

Placebo

40⇥68124 = 21.94 40⇥56

124 = 18.71

(40)

(68) (56) (124)

We have 2 categorical variables with a total of J = 4 cells (categories).

I H0 : ⇡j = ⇡j0 , j = 1, ..., J ,I H1 : ⇡j 6= ⇡j0 , j = 1, ..., J .

�2-test:

JPj=1

(Oj�Ej)2

Ej⇠ �

2(J � 1).

Pearson’s Chi-squared test with Yates’ continuity correction

data: MX-squared = 0.36474, df = 1, p-value = 0.545934

Page 55: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Two-sample proportion test

Fisher’s exact test of independence

�2goodness-of-fit test not suitable when

I n is small

I Ej < 5 for at least one cell.

Observed frequencies Variable 1

Category 1 Category 2

Variable 2 Category 1 a b (a+b)

Category 2 c d (c+d)

(a+c) (b+d) (a+b+c+d=n)

Fisher showed that, under H0 (independence),

P (observed table | H0) = P (X = a) and X ⇠ Hypergeometric(n, a+ c, a+ b).To compute the Fisher’s test:

I Define P (X = a) for all possible tables having the observed marginal

counts,

I Calculate the p� value by defining the percentage of these tables that get

a probability equal to or smaller than the one observed.

Fisher’s Exact Test for Count Data

data: Mp-value = 0.4471alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval:0.3160593 1.6790135sample estimates:odds ratio0.7351707

35

Page 56: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Two-sample proportion test

Fisher’s exact test of independence

�2goodness-of-fit test not suitable when

I n is small

I Ej < 5 for at least one cell.

Observed frequencies Binary outcome

Tumour did not shrink Tumour did shrink

Group Treatment 44 40 (84)

Placebo 24 16 (40)

(68) (56) (124)

Fisher showed that, under H0 (independence),

P (observed table | H0) = P (X = a) and X ⇠ Hypergeometric(n, a+ c, a+ b).To compute the Fisher’s test:

I Define P (X = a) for all possible tables having the observed marginal

counts,

I Calculate the p� value by defining the percentage of these tables that get

a probability equal to or smaller than the one observed.

Fisher’s Exact Test for Count Data

data: Mp-value = 0.4471alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval:0.3160593 1.6790135sample estimates:odds ratio0.7351707

35

Page 57: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

F-test of equality of variances

●●●

Intensity expression of gene 'CCND3 Cyclin D3'

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

Acute lymphoblasticleukemia (ALL)

n=27

Acute myeloidleukemia (AML)

n=11

We test H0: �2Y

= �2X

against H1: �2Y

6= �2X.

We know:

I F-test [assume Xi ⇠ N(µX ,�X) and Yi ⇠ N(µY ,�Y )]:s2Y

s2X

⇠ FnY �1,nX�1

36

Page 58: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

F-test of equality of variances

●●●

Intensity expression of gene 'CCND3 Cyclin D3'

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

Acute lymphoblasticleukemia (ALL)

n=27

Acute myeloidleukemia (AML)

n=11

We test H0: �2Y

= �2X

against H1: �2Y

6= �2X.

We know:

I F-test [assume Xi ⇠ N(µX ,�X) and Yi ⇠ N(µY ,�Y )]:s2Y

s2X

⇠ FnY �1,nX�1

F test to compare two variances

data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"]F = 0.71164, num df = 26, denom df = 10, p-value = 0.4652alternative hypothesis: true ratio of variances is not equal to 195 percent confidence interval:0.2127735 1.8428387sample estimates:ratio of variances

0.7116441

36

Page 59: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Warning

Multiplicity correction

For each test, the probability of rejecting H0 (and accept H1) when H0 istrue equals ↵.

For k tests, the probability of rejecting H0 (and accept H1) at least 1 timewhen H0 is true, ↵k, is given by

↵k = 1� (1� ↵)k.

Thus, for ↵ = 0.05,I if k = 1, ↵1 = 1� (1� ↵)1 = 0.05,I if k = 2, ↵2 = 1� (1� ↵)2 = 0.0975,I if k = 10, ↵10 = 1� (1� ↵)10 = 0.4013.

Idea: change the level of each test so that ↵k = 0.05:

I Bonferroni correction : ↵ = ↵k

k ,

I Dunn-Sidak correction: ↵ = 1� (1� ↵k)1/k.

37

Page 60: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Warning

Non-parametric is not assumption free: Type I error

Simulate 2500 samples with

I Xi ⇠ Uniform(1.5, 2.5), i = 1, ..., nX ,

I Yi ⇠ Uniform(0, 4), i = 1, ..., nY ,

so that E[Xi] = E[Yi] = 2 (i.e., same mean, same median).

Assume

I Xi ⇠ iid(µX ,�2), i = 1, ..., nX ,

I Yi ⇠ iid(µX + �,�2), i = 1, ..., nY .

Test H0: � = �0 against H1: � 6= �0, at the 5% level, by means of

I Mann-Whitney-Wilcoxon test (MWW),

I T-test,

I Welch-test.

b↵ Tests

MWW Student’s t-test Welch’s test

Sample size nX = 200, nY = 70 0.145 0.202 0.055

nX = 20, nY = 7 0.148 0.240 0.062

38

Page 61: Introduction to Statistical Analysisbioinformatics-core-shared-training.github.io/...Data Types x 1 x 2 x 3 ááá x n Cancer status C!C!C ááá C Nucleic acid sequence CT Tááá

Exercises

39