Statistical Models for sequencing data: from Experimental Design...

101
1 Statistical Models for sequencing data: from Experimental Design to Generalized Linear Models Oscar M. Rueda Breast Cancer Functional Genomics Group. CRUK Cambridge Research Institute (a.k.a. Li Ka Shing Centre) [email protected] Best practices in the analysis of RNA-Seq and CHiP-Seq data 27 th -31 st , July 2015 University of Cambridge, Cambridge, UK

Transcript of Statistical Models for sequencing data: from Experimental Design...

Page 1: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

1

Statistical Models for sequencing data: from Experimental Design

to Generalized Linear Models

Oscar M. Rueda Breast Cancer Functional Genomics Group. CRUK Cambridge Research Institute (a.k.a. Li Ka Shing Centre) [email protected]!!

Best practices in the analysis of RNA-Seq and CHiP-Seq data 27th-31st, July 2015

University of Cambridge, Cambridge, UK

Page 2: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Outline  

•  Experimental  Design  •  Design  and  Contrast  matrices  •  Generalized  linear  models  •  Models  for  coun:ng  data  

2

Page 3: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

3

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

Sir  Ronald  Fisher  (1890-­‐1962)  

[evolu:onary  biologist,  gene:cist  and  sta:s:cian]  

Page 4: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

4

An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.

John  Tukey (1915-­‐2000)  

[Sta:s:cian]  

Page 5: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

5

An unsophisticated forecaster uses statistics as a drunken man uses lamp-posts - for support rather than for illumination.

Andrew  Lang  (1844-­‐1912)  

[Poet,  novelist  and  literary  cri:c]  

Page 6: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Experimental  Design  

Page 7: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Design  of  an  experiment  

•  Select  biological  ques:ons  of  interest  •  Iden:fy  an  appropriate  measure  to  answer  that  ques:on  

•  Select  addi:onal  variables  or  factors  that  can  have  an  influence  in  the  result  of  the  experiment  

•  Select  a  sample  size  and  the  sample  units  •  Assign  samples  to  lanes/flow  cells.  

7

Page 8: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Principles  of  Sta:s:cal  Design  of  Experiments  

•  R.  A.  Fisher:  – Replica:on  – Blocking  – Randomiza:on.  

•  They  have  been  used  in  microarray  studies  from  the  beginning.  

•  Bar  coding  makes  easy  to  adapt  them  to  NGS  studies.  

8

Page 9: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Sampling  hierarchy  

There  are  three  levels  of  sampling:  •  Subject  Sampling  •  RNA  sampling  •  Fragment  sampling.  

Auer and Doerge. Genetics 185:405-416(2010) 9

Page 10: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Unreplicated  Data  

Inferences  for  RNA  and  fragment-­‐level  can  be  obtained  through  Fisher’s  test.  But  they  don’t  reflect  biological  variability.  

Auer and Doerge. Genetics 185:405-416(2010) 10

Page 11: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Replicated  Data  

Inferences  for  treatment  effect  using  generalized  linear  models  (more  on  this  later).  

Auer and Doerge. Genetics 185:405-416(2010) 11

Is  this  a  good  design?  We  should  randomize  within  block!  

Page 12: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Balanced  Block  Designs  

•  Avoids  confounding  effects:  – Lane  effects  (any  errors  from  the  point  where  the  sample  is  input  to  the  flow  cell  un:l  the  data  output).  Examples:  systema:cally  bad  sequencing  cycles,  errors  in  base  calling…    

– Batch  effects  (any  errors  ager  random  fragmenta:on  of  the  RNA  un:l  it  is  input  to  the  flow  cell).  Examples:  PCR  amplifica:on,  reverse  transcrip:on  ar:facts…  

– Other  effects  non  related  to  treatment.  

Auer and Doerge. Genetics 185:405-416(2010) 12

Page 13: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Balanced  blocks  by  mul:plexing  

Auer and Doerge. Genetics 185:405-416(2010)

Page 14: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Balanced  incomplete  block  design  and  blocking  without  mul:plexing  

•  Usually  there  are  restric:ons  with  the  number  of  treatments,  replicates,  unique  bar  codes  and  available  lanes  for  sequencing.  

•  3  treatments  (T1,  T2,  T3)  •  1  subject  per  treatment  (T11,  T21,  T31)  •  2  technical  replicates  (T111,  T112,  T211,  

T212,  T311,  T312)  

Auer and Doerge. Genetics 185:405-416(2010)

Page 15: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Simula:ons  

Auer and Doerge. Genetics 185:405-416(2010)

Page 16: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Results  

Auer and Doerge. Genetics 185:405-416(2010) 16

Page 17: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Benefits  of  a  proper  design  

•  NGS  is  benefited  with  design  principles  •  Technical  replicates  can  not  replace  biological  replicates  

•  It  is  possible  to  avoid  mul:plexing  with  enough  biological  replicates  and  sequencing  lanes  

•  The  advantages  of  mul:plexing  are  bigger  than  the  disadvantages  (cost,  loss  of  sequencing  depth,  bar-­‐code  bias…)  

17

Page 18: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Design  and  contrast  matrices  

Page 19: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Sta:s:cal  models  – We want to model the expected result of an

outcome (dependent variable) under given values of other variables (independent variables)

E(Y ) = f (X)Y = f (X)+ε

Expected value of variable Y

Arbitrary function (any shape)

A set of k independent variables (also called factors)

This is the variability around the expected mean of y

19

Page 20: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

20

Design  matrix  

– Represents the independent variables that have an influence in the response variable, but also the way we have coded the information and the design of the experiment.

–  For now, let’s restrict to models

Y = βX +ε

Response variable Parameter vector

Design matrix

Stochastic error

Page 21: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Types  of  designs  considered  

•  Models  with  1  factor  – Models  with  two  treatments  – Models  with  several  treatments  

•  Models  with  2  factors  –  Interac:ons  

•  Paired  designs  •  Models  with  categorical  and  con:nuous  factors  •  TimeCourse  Experiments  •  Mul:factorial  models.  

  21

Page 22: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Strategy  

•  Define  our  set  of  samples  •  Define  the  factors,  type  of  factors  (con:nuous,  categorical),  number  of  levels…  

•  Define  the  set  of  parameters:  the  effects  we  want  to  es:mate  

•  Build  the  design  matrix,  that  relates  the  informa:on  that  each  sample  contains  about  the  parameters.  

•  Es:mate  the  parameters  of  the  model:  tes:ng  •  Further  es:ma:on  (and  tes:ng):  contrast  matrices.  

Page 23: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Models  with  1  factor,  2  levels  

   

Treatme  Sample   Treatment  

Sample1   Treatment  A  

Sample  2   Control  

Sample  3   Treatment  A  

Sample  4   Control  

Sample  5   Treatment  A  

Sample  6   Control  

Number  of  samples:  6  Number  of  factors:  1  

 Treatment:  Number  of  levels:  2  

Possible  parameters  (What  differences  are  important)?    -­‐  Effect  of  Treatment  A  -­‐  Effect  of  Control   23

Page 24: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Design  matrix  for  models  with  1  factor,  2  levels  

   

Design  Matrix  

1 00 11 00 11 00 1

!

"

#######

$

%

&&&&&&&

Treat.  A  

Control  

Sample  1  

Sample  2  Sample  3  

Sample  4  

Sample  5  

Sample  6  

Parameters  (coefficients,  levels  of  the  variable)  S1

S2S3S4S5S6

!

"

#######

$

%

&&&&&&&

=   TC

!

"#

$

%&

24 Equivalent  to  a  t-­‐test  

Sample   Treatment  

Sample1   Treatment  A  

Sample  2   Control  

Sample  3   Treatment  A  

Sample  4   Control  

Sample  5   Treatment  A  

Sample  6   Control  

C  is  the  mean  expression  of  the  control  T  is  the  mean  expression  of  the  treatment  

Page 25: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Design  matrix  for  models  with  1  factor,  2  levels  

   

Design  Matrix  

1 00 11 00 11 00 1

!

"

#######

$

%

&&&&&&&

Treat.  A  

Control  

Sample  1  

Sample  2  Sample  3  

Sample  4  

Sample  5  

Sample  6  

Parameters  (coefficients,  levels  of  the  variable)  S1

S2S3S4S5S6

!

"

#######

$

%

&&&&&&&

=   TC

!

"#

$

%&

25 Equivalent  to  a  t-­‐test  

Sample   Treatment  

Sample1   Treatment  A  

Sample  2   Control  

Sample  3   Treatment  A  

Sample  4   Control  

Sample  5   Treatment  A  

Sample  6   Control  

Page 26: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Intercepts  

   

Different  parameteriza:on:  using  intercept  

26

Sample   Treatment  

Sample1   Treatment  A  

Sample  2   Control  

Sample  3   Treatment  A  

Sample  4   Control  

Sample  5   Treatment  A  

Sample  6   Control  

Let’s  now  consider  this  parameteriza:on:    C=  Baseline  expression  TA=  Baseline  expression  +  effect  of  treatment    So  the  set  of  parameters  are:    C  =  Control  (mean  expression  of  the  control)  a  =  TA  –  Control  (mean  change  in  expression  under  treatment  

Page 27: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Intercept  

   

Design  Matrix  

1 11 01 11 01 11 0

!

"

#######

$

%

&&&&&&&

Intercep

t  

Treatm

ent  A

 

Sample  1  

Sample  2  Sample  3  

Sample  4  

Sample  5  

Sample  6  

Parameters  (coefficients,  levels  of  the  variable)  S1

S2S3S4S5S6

!

"

#######

$

%

&&&&&&&

=  β0a

!

"##

$

%&&

Different  parameteriza:on:  using  intercept  

Intercept  measures  the  baseline  expression.  a  measures  now  the  differen:al  expression  between  Treatment  A  and  Control  

27

Page 28: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Contrast  matrices  Are  the  two  parameteriza:ons  equivalent?  

Contrast  matrix  Contrast  matrices  allow  us  to  es:mate  (and  test)  linear  combina:ons  of  our  coefficients.    

28

1 −1"#

$%

TC

"

#&&

$

%''= T −C

Page 29: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Models  with  1  factor,  more  than  2  levels  

   

Treatme  Sample   Treatment  

Sample1   Treatment  A  

Sample  2   Treatment  B  

Sample  3   Control  

Sample  4   Treatment  A  

Sample  5   Treatment  B  

Sample  6   Control  

Number  of  samples:  6  Number  of  factors:  1  

 Treatment:  Number  of  levels:  3  Possible  parameters  (What  differences  are  important)?  -­‐  Effect  of  Treatment  A  -­‐  Effect  of  Treatment  B  -­‐  Effect  of  Control  -­‐  Differences  between  treatments?  

29

ANOVA  models  

Page 30: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Design  matrix  for  ANOVA  models  

1 0 00 1 00 0 11 0 00 1 00 0 1

!

"

#######

$

%

&&&&&&&

S1S2S3S4S5S6

!

"

#######

$

%

&&&&&&&

=  

TATBC

!

"

####

$

%

&&&&

1 1 01 0 11 0 01 1 01 1 11 0 0

!

"

#######

$

%

&&&&&&&

S1S2S3S4S5S6

!

"

#######

$

%

&&&&&&&

=  

β0ab

!

"

####

$

%

&&&&

30

Sample   Treatment  

Sample1   Treatment  A  

Sample  2   Treatment  B  

Sample  3   Control  

Sample  4   Treatment  A  

Sample  5   Treatment  B  

Sample  6   Control  

Page 31: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Design  matrix  for  ANOVA  models  

1 0 00 1 00 0 11 0 00 1 00 0 1

!

"

#######

$

%

&&&&&&&

S1S2S3S4S5S6

!

"

#######

$

%

&&&&&&&

=  

TATBC

!

"

####

$

%

&&&&

1 1 01 0 11 0 01 1 01 0 11 0 0

!

"

#######

$

%

&&&&&&&

S1S2S3S4S5S6

!

"

#######

$

%

&&&&&&&

=  

β0ab

!

"

####

$

%

&&&&

31

Sample   Treatment  

Sample1   Treatment  A  

Sample  2   Treatment  B  

Sample  3   Control  

Sample  4   Treatment  A  

Sample  5   Treatment  B  

Sample  6   Control  

Control  =  Baseline  TA  =  Baseline  +  a  TB  =  Baseline  +  b  

Page 32: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Baseline  levels  

1 0 01 1 01 0 11 0 01 1 01 0 1

!

"

#######

$

%

&&&&&&&

S1S2S3S4S5S6

!

"

#######

$

%

&&&&&&&

=  

β0bc

!

"

####

$

%

&&&&

The  model  with  intercept  always  take  one  level  as  a  baseline:  

The  baseline  is  treatment  A,  the  coefficients  are  comparisons  against  it!    

32

By  default,  R  uses  the  first  level  as  baseline  

Page 33: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

R  code  

R  code:    > Treatment <- rep(c(“TreatmentA”, “TreatmentB”, “Control”), 2)!> design.matrix <- model.matrix(~ Treatment) (model with intercept)!> design.matrix <- model.matrix(~ -1 + Treatment) (model without intercept)!> design.matrix <- model.matrix(~ 0 + Treatment) (model without intercept)!    

     

33

Page 34: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Exercise  

Build  contrast  matrices  for  all  pairwise  comparisons  for  this  design:  

1 0 00 1 00 0 11 0 00 1 00 0 1

!

"

#######

$

%

&&&&&&&

S1S2S3S4S5S6

!

"

#######

$

%

&&&&&&&

=  

TATBC

!

"

####

$

%

&&&&

TATBC

!

"

####

$

%

&&&&

!

"

###

$

%

&&&

34

Page 35: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Exercise  

Build  contrast  matrices  for  all  pairwise  comparisons  for  these  designs:  

1 0 00 1 00 0 11 0 00 1 00 0 1

!

"

#######

$

%

&&&&&&&

S1S2S3S4S5S6

!

"

#######

$

%

&&&&&&&

=  

TATBC

!

"

####

$

%

&&&&

TATBC

!

"

####

$

%

&&&&

1 0 −10 1 −11 −1 0

"

#

$$$

%

&

'''

35

Page 36: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Exercise  

Build  contrast  matrices  for  all  pairwise  comparisons  for  these  designs:  

!

"

###

$

%

&&&

1 1 01 0 11 0 01 1 01 1 11 0 0

!

"

#######

$

%

&&&&&&&

S1S2S3S4S5S6

!

"

#######

$

%

&&&&&&&

=  

β0ab

!

"

####

$

%

&&&&

β0ab

!

"

####

$

%

&&&&

36

Page 37: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Exercise  

Build  contrast  matrices  for  all  pairwise  comparisons  for  these  designs:  

0 1 00 0 10 1 −1

"

#

$$$

%

&

'''

1 1 01 0 11 0 01 1 01 1 11 0 0

!

"

#######

$

%

&&&&&&&

S1S2S3S4S5S6

!

"

#######

$

%

&&&&&&&

=  

β0ab

!

"

####

$

%

&&&&

β0ab

!

"

####

$

%

&&&&

37

Page 38: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Models  with  2  factors  

   

Treatme  Sample   Treatment   ER  status  

Sample1   Treatment  A   +  

Sample  2   No  Treatment   +  

Sample  3   Treatment  A   +  

Sample  4   No  Treatment   +  

Sample  5   Treatment  A   -­‐  

Sample  6   No  Treatment   -­‐  

Sample  7   Treatment  A   -­‐  

Sample  8   No  Treatment   -­‐  Number  of  samples:  8  Number  of  factors:  2  

 Treatment:  Number  of  levels:  2    ER:  Number  of  levels:  2  

38

Page 39: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Adapted from Natalie Thorne, Nuno L. Barbosa Morais

Treat effect"

Both effects"

ER effect"

Treat x ER positive interaction

Treat x ER negative interaction

Understanding  Interac:ons  

Treat + ER effects

Treat effect"

Both effects"

ER effect"

Treat x ER effects

Treat x ER effects

No  Treat   Treat  A  

ER  -­‐     S6,  S8   S5,  S7  

ER  +   S2,  S4   S1,  S3  

39

Page 40: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Models  with  2  factors  and  no  interac:on  

 Number  of  coefficients  (parameters):    Intercept  +  (♯levels  Treat  -­‐1)  +  (♯levels  ER  -­‐1)  =  3    If  we  remove  the  intercept,  the  addi:onal  parameter  comes  from  the  missing  level  in  one  of  the  variables,  but  in  models  with  more  than  1  factor  it  is  a  good  idea  to  keep  the  intercept.  

Model  with  no  interac:on:  only  main  effects    

40

Page 41: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Models  with  2  factors  (no  interac:on)  

S1S2S3S4S5S6S7S8

!

"

##########

$

%

&&&&&&&&&&

=

1 1 11 0 11 1 11 0 11 1 01 0 01 1 01 0 0

'

(

))))))))))

*

+

,,,,,,,,,,

β0aer +

!

"

####

$

%

&&&&

R  code:  >  design.matrix <- model.matrix(~Treatment+ER)  (model  with  intercept)                              

In  R,  the  baseline  for  each  variable  is  the  first  level.    

41

No  Treat   Treat  A  

ER  -­‐     S6,  S8   S5,  S7  

ER  +   S2,  S4   S1,  S3  

Page 42: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Models  with  2  factors  (no  interac:on)  

S1S2S3S4S5S6S7S8

!

"

##########

$

%

&&&&&&&&&&

=

1 1 11 0 11 1 11 0 11 1 01 0 01 1 01 0 0

'

(

))))))))))

*

+

,,,,,,,,,,

β0aer +

!

"

####

$

%

&&&&

42

No  Treat   Treat  A  

ER  -­‐     S6,  S8   S5,  S7  

ER  +   S2,  S4   S1,  S3  

R  code:  >  design.matrix <- model.matrix(~Treatment+ER)  (model  with  intercept)                              

Page 43: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Models  with  2  factors  and  interac:on  

 Number  of  coefficients  (parameters):    Intercept  +  (♯levels  Treat  -­‐1)  +  (♯levels  ER  -­‐1)  +((♯levels  Treat  -­‐1)  *  (♯levels  ER  -­‐1))      =  4    

Model  with  interac:on:  main  effects  +  interac/on    

43

Page 44: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Models  with  2  factors  (interac:on)  

Y1Y2Y3Y4Y5Y6Y7Y8

!

"

###########

$

%

&&&&&&&&&&&

=

1 1 1 11 0 1 01 1 1 11 0 1 01 1 0 01 0 0 01 1 0 01 0 0 0

'

(

))))))))))

*

+

,,,,,,,,,,

β0a

er +a.er +

!

"

######

$

%

&&&&&&

R  code:  > design.matrix <- model.matrix(~Treatment*ER)    (model  with  intercept)                              

“Extra  effect”  of  Treatment  A  on  ER+  samples  

44

No  Treat   Treat  A  

ER  -­‐     S6,  S8   S5,  S7  

ER  +   S2,  S4   S1,  S3  

Page 45: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Models  with  2  factors  (interac:on)  

Y1Y2Y3Y4Y5Y6Y7Y8

!

"

###########

$

%

&&&&&&&&&&&

=

1 1 1 11 0 1 01 1 1 11 0 1 01 1 0 01 0 0 01 1 0 01 0 0 0

'

(

))))))))))

*

+

,,,,,,,,,,

β0a

er +a.er +

!

"

######

$

%

&&&&&&

“Extra  effect”  of  Treatment  A  on  ER+  samples  

45

R  code:  > design.matrix <- model.matrix(~Treatment*ER)    (model  with  intercept)                              

No  Treat   Treat  A  

ER  -­‐     S6,  S8   S5,  S7  

ER  +   S2,  S4   S1,  S3  

Page 46: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

2  by  3  factorial  experiment  

•  Identify DE genes that have different time profiles between different mutants. α = time effect, β = strains, αβ = interaction effect

0 12 24"

Strain B!Strain A!

time

α > 0 β = 0 αβ=0

Exp

0 12 24"

Strain B!

Strain A!

time

α > 0 β > 0 αβ=0

Exp

0 12 24"

Strain A!Strain B!

time

αβ>0 Exp

0 12 24" time

Strain A!

Strain B!

α=0 β> 0 αβ=0

Exp

Slide by Natalie Thorne, Nuno L. Barbosa Morais 46

Page 47: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Paired  Designs  Sample   Type  

Sample  1   Tumour  

Sample  2   Matched  Normal  

Sample  3   Tumour  

Sample  4   Matched  Normal  

Sample  5   Tumour  

Sample  6     Matched  Normal  

Sample  7   Tumour  

Sample  8   Matched  Normal  

Number  of  samples:  4  Number  of  factors:  2  

 Sample:  Number  of  levels:  4    Type:  Number  of  levels:  2  

Number  of  samples:  8  Number  of  factors:  1  

 Type:  Number  of  levels:  2  

Sample   Type  

Sample  1   Tumour  

Sample  1   Matched  Normal  

Sample  2   Tumour  

Sample  2   Matched  Normal  

Sample  3   Tumour  

Sample  3     Matched  Normal  

Sample  4   Tumour  

Sample  4   Matched  Normal  

47

Page 48: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Design  matrix  for  Paired  experiments  We  can  gain  precision  in  our  es:mates  with  a  paired  design,  because  individual  variability  is  removed  when  we  compare  the  effect  of  the  treatment  within  the  same  sample.  

R  code:  > design.matrix <- model.matrix(~-1 +Type) (unpaired;  model  without  intercept)    

           > design.matrix <- model.matrix(~-1 +Sample+Type)    (paired;  model  without  intercept)                              

These  effects  only  reflect  biological  differences  not  related  to  tumour/normal  effect.  

48

Sample   Type  

Sample  1   Tumour  

Sample  1   Matched  Normal  

Sample  2   Tumour  

Sample  2   Matched  Normal  

Sample  3   Tumour  

Sample  3     Matched  Normal  

Sample  4   Tumour  

Sample  4   Matched  Normal  

Y1Y2Y3Y4Y5Y6Y7Y8

!

"

###########

$

%

&&&&&&&&&&&

=

1 0 0 0 11 0 0 0 00 1 0 0 10 1 0 0 00 0 1 0 10 0 1 0 00 0 0 1 10 0 0 1 0

'

(

))))))))))

*

+

,,,,,,,,,,

S1S2S3S4t

!

"

######

$

%

&&&&&&

Page 49: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Analysis  of  covariance  (Models  with  categorical  and  con:nuous  variables)  

Sample   ER   Dose  

Sample  1   +   37  

Sample  2   -­‐   52  

Sample  3   +   65  

Sample  4   -­‐   89  

Sample  5   +   24  

Sample  6     -­‐   19  

Sample  7   +   54  

Sample  8   -­‐   67  

Number  of  samples:  8  Number  of  factors:  2  

 ER:  Number  of  levels:  2    Dose:  Con:nuous  

49

Page 50: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Analysis  of  covariance  (Models  with  categorical  and  con:nuous  variables)  

Y1Y2Y3Y4Y5Y6Y7Y8

!

"

###########

$

%

&&&&&&&&&&&

=

1 1 371 0 521 1 651 0 891 1 241 0 191 1 541 0 67

'

(

))))))))))

*

+

,,,,,,,,,,

β0er +d

!

"

####

$

%

&&&&

R  code:  > design.matrix <- model.matrix(~ ER + dose)      

If  we  consider  the  effect  of  dose  linear  we  use  1  coefficient  (degree  of  freedom).  We  can  also  model  it  as  non-­‐linear  (using  splines,  for  example).  

50

Sample   ER   Dose  

Sample  1   +   37  

Sample  2   -­‐   52  

Sample  3   +   65  

Sample  4   -­‐   89  

Sample  5   +   24  

Sample  6     -­‐   19  

Sample  7   +   54  

Sample  8   -­‐   67  

Page 51: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Analysis  of  covariance  (Models  with  categorical  and  con:nuous  variables)  

Y1Y2Y3Y4Y5Y6Y7Y8

!

"

###########

$

%

&&&&&&&&&&&

=

1 1 37 371 0 52 01 1 65 651 0 89 01 1 24 241 0 19 01 1 54 541 0 67 0

'

(

))))))))))

*

+

,,,,,,,,,,

β0er +d

er +.d

!

"

####

$

%

&&&&

If  the  interac:on  is  significant,  the  effect  on  the  dose  is  different  depending  on  the  levels  of  ER.  

Interac:on:  Is  it  the  effect  of  dose  equal  in  ER  +  and  ER  -­‐?  

51

R  code:  > design.matrix <- model.matrix(~ ER * dose)      

Sample   ER   Dose  

Sample  1   +   37  

Sample  2   -­‐   52  

Sample  3   +   65  

Sample  4   -­‐   89  

Sample  5   +   24  

Sample  6     -­‐   19  

Sample  7   +   54  

Sample  8   -­‐   67  

Page 52: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Time  Course  experiments  Sample   Time  

Sample  1   0h  

Sample  1   1h  

Sample  1   4h  

Sample  1   16h  

Sample  2   0h  

Sample  2     1h  

Sample  2   4h  

Sample  2   16h  

Number  of  samples:  2  Number  of  factors:  2  

 Sample:  Number  of  levels:  2    Time:  Con:nuous  or  categorical?  

Main  ques:on:  how  does  expression  change  over  :me?  

If  we  model  :me  as  categorical,  we  don’t  make  assump:ons  about  its  effect,  but  we  use  too  many  degrees  of  freedom.  

If  we  model  :me  as  con:nuous,  we  use  less  degrees  of  freedom  but  we  have  to  make  assump:ons  about  the  type  of  effect.  

52 Intermediate  solu:on:  splines  

Page 53: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Time  Course  experiments:  no  assump:ons  

Y1Y2Y3Y4Y5Y6Y7Y8

!

"

###########

$

%

&&&&&&&&&&&

=

1 0 0 0 01 0 1 0 01 0 0 1 01 0 0 0 10 1 0 0 00 1 1 0 00 1 0 1 00 1 0 0 1

'

(

))))))))))

*

+

,,,,,,,,,,

S1S2T1T4T16

!

"

#######

$

%

&&&&&&&

R  code:  > design.matrix <- model.matrix(~Sample + factor(Time))      

We  can  use  contrasts  to  test  differences  at  :me  points.  

53

Sample   Time  

Sample  1   0h  

Sample  1   1h  

Sample  1   4h  

Sample  1   16h  

Sample  2   0h  

Sample  2     1h  

Sample  2   4h  

Sample  2   16h  

Page 54: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Time  Course  experiments  

Y1Y2Y3Y4Y5Y6Y7Y8

!

"

###########

$

%

&&&&&&&&&&&

=

1 0 01 0 11 0 41 0 160 1 00 1 10 1 40 1 16

'

(

))))))))))

*

+

,,,,,,,,,,

S1S2X

!

"

####

$

%

&&&&

R  code:  > design.matrix <- model.matrix(~Sample + Time)      

We  are  assuming  a  linear  effect  on  :me  

54

time

small coef x

Big coef x

Large neg coef x Intermediate  models  are  possible:  splines  

Sample   Time  

Sample  1   0h  

Sample  1   1h  

Sample  1   4h  

Sample  1   16h  

Sample  2   0h  

Sample  2     1h  

Sample  2   4h  

Sample  2   16h  

Page 55: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Mul:  factorial  models  

•  We  can  fit  models  with  many  variables  •  Sample  size  must  be  adequate  to  the  number  of  factors  •  Same  rules  for  building  the  design  matrix  must  be  used:    

•  There  will  be  one  column  in  design  matrix  for  the  intercept  •  Con:nuous  variables  with  a  linear  effect  will  need  one  column  in  the  design  

matrix  •  Categorical  variable  will  need  ♯levels  -­‐1  columns  •  Interac:ons  will  need  (♯levels  -­‐1)  x    (♯levels  -­‐1)    •  It  is  possible  to  include  interac:ons  of  more  than  2  variables,  but  the  number  of  

samples  needed  to  accurately  es:mate  those  interac:ons  is  large.  

55

Page 56: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Generalized  linear  models  

Page 57: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Sta:s:cal  models  

– We  want  to  model  the  expected  result  of  an  outcome  (dependent  variable)  under  given  values  of  other  variables  (independent  variables)  

E(Y ) = f (X)Y = f (X)+ε

Expected  value  of  variable  y  Arbitrary  func:on  (any  shape)  

A  set  of  k  independent  variables  (also  called  factors)  

This  is  the  variability  around  the  expected  mean  of  y  

57

Page 58: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Fixed  vs  Random  effects  

–  If  we  consider  an  independent  variable  Xi  as  fixed,    that  is  the  set  of  observed  values  has  been  fixed  by  the  design,  then  it  is  called  a  fixed  factor.  

–  If  we  consider  an  independent  variable  Xj  as  random,  that  is  the  set  of  observed  values  comes  from  a  realiza:on  of  a  random  process,  it  is  called  a  random  factor.  

– Models  that  include  random  effects  are  called  MIXED  MODELS.  

–  In  this  course  we  will  only  deal  with  fixed  factors.  

58

Page 59: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Linear  models    

– The  observed  value  of  Y  is  a  linear  combina:on  of  the  effects  of  the  independent  variables  

–  If  we  include  categorical  variables  the  model  is  called  General  Linear  Model  

E(Y ) = β0 +β1X1 +β2X2 +...+βkXk

E(Y ) = β0 +β1X1 +β2X21 +...+βpX

p1

E(Y ) = β0 +β1 log(X1)+β2 f (X2 )+...+βkXk

Arbitrary  number  of  independent  variables  

Polynomials  are  valid  

We  can  use  func:ons  of  the  variables  if  the  effects  are  linear  

Smooth    func:ons:  not  exactly  the  same  as  the  so-­‐called  addi/ve  models  

59

Page 60: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Model  Es:ma:on  We  use  least  squares  esImaIon    

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●●●

0 20 40 60 80 100

−20

24

68

1012

X

Y

●●●

●●●

●●

●●

●●

●●●

●●

X

Y

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

Male Female

Given  n  observa:ons  (y1,..yn,x1,..xn)    minimize  the  differences  between  the  observed  and  the  predicted  values  

60

Page 61: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Model  Es:ma:on  Y = βX +ε

β

β

se(β)

Parameter  of  interest  (effect  of  X  on  Y)  

EsImator  of  the  parameter  of  interest  

Standard  Error  of  the  es:mator  of  the  parameter  of  interest  

β = (XTX)−1XTY

se(βi ) =σ ciwhere ci is the i

th diagonal element of XTX( )−1

y = βxe = y− y

Fived  values  (predicted  by  the  model)  

Residuals  (observed  errors)   61

Page 62: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Model  Assump:ons  In  order  to  conduct  sta:s:cal  inferences  on  the  parameters  on  the  model,  some  assump:ons  must  be  made:  •  The  observa:ons  1,..,n  are  independent  •  Normality  of  the  errors:  

•  Homoscedas:city:  the  variance  is  constant.  •  Linearity.  

εi ~ N(0,σ2 )

62

Page 63: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Generalized  linear  models  

– Extension  of  the  linear  model  to  other  distribu:ons  and  non-­‐linearity  in  the  structure  (to  some  degree)  

– Y  must  follow  a  probability  distribu:on  from  the  exponen:al  family  (Bernoulli,  Binomial,  Poisson,  Gamma,  Normal,…)  

– Parameter  es:ma:on  must  be  performed  using  an  itera:ve  method  (IWLS).    

g(E(Y )) = XβLink  func:on  

63

Page 64: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Example:  Logis:c  Regression  

– We  want  to  study  the  rela:onship  between  the  presence  of  an  amplifica:on  in  the  ERBB2  gene  and  the  size  of  the  tumour  in  a  specific  type  of  breast  cancer.  

– Our  dependent  variable  Y,  takes  two  possible  values:  “AMP”,  “NORMAL”  (“YES”,  “NO”)  

– X  (size)  takes  con:nuous  values.  

64

Page 65: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Example:  Logis:c  Regression  

It is very difficult to see the relationship. Let’s model the “probability of success”: in this case, the probability of amplification

65

● ●● ●●● ●●● ●● ●●● ● ●● ●● ●●● ●●● ●●● ● ●● ●● ●● ●● ●● ●● ●● ● ●● ●● ● ●

● ●● ● ● ● ●● ●● ● ● ●● ● ● ●● ●● ● ●●● ●● ●● ●●● ● ●● ● ● ●● ●● ● ●● ● ● ●●● ● ●

Size

ERBB

2 Am

plific

atio

n

NOYE

S

0 5 10 15 20

Page 66: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Example:  Logis:c  Regression  

Some predictions are out of the possible range for a probability

66

● ● ● ● ● ●

● ●

● ●

● ● ● ● ● ●

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Size

Prob.Amplification

Page 67: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Example:  Logis:c  Regression  We can transform the probabilities to a scale that goes from –Inf to Inf using log odds

logodds = log p1− p"

#$

%

&'

67

● ● ● ● ● ●

● ●

●●

● ●●

● ● ● ● ● ●

0 5 10 15 20

−10

−8−6

−4−2

Size

log

odds

Am

plifi

catio

n

Page 68: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Example:  Logis:c  Regression  

How  does  this  relate  to  the  generalized  linear  model?  •  Y  follows  a  Bernoulli  distribu:on;  it  can  take  two  values  (YES  or  NO)  

•  The  expecta:on  of  Y,  p  is  the  probability  of  YES  (EY=p)  •  We  assume  that  there  is  a  linear  rela:onship  between  size  and  a  func:on  of  the  expected  value  of  Y:  the  log  odds  (the  link  func:on)  

logodds(prob.amplif ) = β0 +β1Sizeg(EY ) = βX

68

Page 69: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Binomial  Distribu:on  

•  It  is  the  distribu:on  of  the  number  of  events  in  a  series  of  n  independent  Bernoulli  experiments,  each  with  a  probability  of  success  p.  

•  Y  can  take  integer                        values  from  0  to  n  •  EY=np  •  VarY=  np(1-­‐p)  

69

0 1 2 3 4 5 6 7 8 9 10

Binomial distribution. n=10, p=0.3

0.00

0.05

0.10

0.15

0.20

0.25

Page 70: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Poisson  Distribu:on  

•  Let  Y  ~  B(n,p).  If  n  is  large  and  p  is  small  then  Y  can  be  approximated  by  a  Poisson  Distribu:on  (Law  of  rare  events)  

•  Y  ~  P(λ)  •  EY=λ  •  VarY=λ  

70 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Poisson Distributionλ = 2

0.00

0.05

0.10

0.15

0.20

0.25

Page 71: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Nega:ve  Binomial  Distribu:on  

•  Let  Y  ~  NB(r,p)  •  Represents  the  number  of  successes  in  a  Bernoulli  experiment  

un:l  r  failures  occur.    •  It  is  also  the  distribu:on  of  a  con:nuous  mixture  of  Poisson  

distribu:ons  where  λ  follows  a  Gamma  distribu:on.    •  It  can  be  seen  as  a  overdispersed  Poisson  distribu:on.  

p = µσ 2

r = µ 2

σ 2 −µ

Overdispersion  parameter  

Loca:on  parameter  

71

Negative Binomial distribution. r=10, p=0.3

0.00

0.01

0.02

0.03

0.04

Page 72: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Hypothesis  tes:ng  

•  Everything  starts  with  a  biological  ques:on  to  test:    –  What  genes  are  differenIally  expressed  under  one  treatment?  –  What  genes  are  more  commonly  amplified  in  a  class  of  tumours?  

–  What  promoters  are  methylated  more  frequently  in  cancer?  •  We  must  express  this  biological  ques:on  in  terms  of  a  

parameter  in  a  model.  •  We  then  conduct  an  experiment,  obtain  data  and  es:mate  

the  parameter.  •  How  do  we  take  into  account  uncertainty  in  order  to  

answer  our  ques:on  based  on  our  es:mate?  

72

Page 73: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Sampling  and  tes:ng  

10% red balls and 90% blue balls

Random sample of 10 balls from the box

Discrete observations

When do I think that I am not sampling from this box anymore?

How many reds could I expect to get just by chance alone!

#red = 3

Slide by Natalie Thorne, Nuno L. Barbosa Morais 73

Page 74: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

10% red balls and 90% blue balls

Random sample of 10 balls from the box

Discrete observations

Sample

Null hypothesis (about the population that is being sampled)

Rejection criteria (based on your observed sample,

do you have evidence to reject the hypothesis

that you sampled from the null population)

#red = 3

Test statistic

Slide by Natalie Thorne, Nuno L. Barbosa Morais 74

Page 75: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

•  Null Hypothesis: Our population follows a (known) distribution defined by a set of parameters: H0 : X ~ f(θ1,…θk)

•  Take a random sample (X1,…Xn) = (x1,…xn) and observe test statistic

T(X1,…Xn)= t(x1,…xn)

•  The distribution of T under H0 is known (g(.)) •  p-value : probability under H0 of observing a

result as extreme as t(x1,…xn)

75

Hypothesis  tes:ng  

Page 76: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

observed  z-­‐score  

High  probability  of  gezng  a  more  extreme  score  just  by  chance  

P-­‐value  is  high!  

76 Slide by Natalie Thorne, Nuno L. Barbosa Morais

Page 77: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

observed  z-­‐score  

Low  probability  of  gezng  a  more  extreme  score  just  by  chance  

P-­‐value  is  low  

Reject  null  hypothesis  

77 Slide by Natalie Thorne, Nuno L. Barbosa Morais

Page 78: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Type  I  and  Type  II  errors  

•  Type  I  error:  probability  of  rejec:ng  the  null  hypothesis  when  it  is  true.  Usually,  it  is  the  significance  level  of  the  test.  It  is  denoted  as  α  

•  Type  II  error:  probability  of  not  rejec:ng  the  null  hypothesis  when  it  is  false  It  is  denoted  as  β  

•  Decreasing  one  type  of  error  increases  the  other,  so  in  prac:ce  we  fix  the  type  I  error  and  choose  the  test  that  minimizes  type  II  error.  

78

Page 79: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

The  power  of  a  test  •  The  power  of  a  test  is  the  

probability  of  rejec:ng  the  null  hypothesis  at  a  given  significance  level  when  a  specific  alterna:ve  is  true  

•  For  a  given  significance  level  and  a  given  alterna:ve  hypothesis  in  a  given  test,  the  power  is  a  func:on  of  the  sample  size  

•  What  is  the  difference  between  sta:s:cal  significance  and  biological  significance?  

0 1000 2000 3000 4000 5000

0.2

0.4

0.6

0.8

1.0

t−test: true diff:0.1 std=1 sig.lev=0.05

Sample Size

Powe

r

With  enough  sample  size,  we  can  detect  any  alterna:ve  hypothesis  (if  the  es:mator  is  consistent,  its  standard  error  converges  to  zero  as  the  sample  size  increases)    

79

Page 80: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

The  Likelihood  Ra:o  Test  (LRT)  

•  We  are  working  with  models,  therefore  we  would  like  to  do  hypothesis  tests  on  coefficients  or  contrasts  of  those  models  

•  We  fit  two  models  M1  without  the  coefficient  to  test  and  M2  with  the  coefficient.    

•  We  compute  the  likelihoods  of  the  two  models  (L1  and  L2)  and  obtain  LRT=-­‐2log(L1  /L2)  that  has  a  known  distribu:on  under  the  null  hypothesis  that  the  two  models  are  equivalent.  This  is  also  known  as  model  selec/on.  

80

Page 81: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Mul:ple  tes:ng  problem  

•  In  High  throughput  experiments  we  are  fizng  one  model  for  each  gene/exon/sequence  of  interest,  and  therefore  performing  thousands  of  tests.  

•  Type  I  error  is  not  equal  to  the  significance  level  of  each  test.  

•  Mul:ple  test  correc:ons  try  to  fix  this  problem  (Bonferroni,  FDR,…)  

81

Page 82: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Distribu:on  of  p-­‐values  

82  

Re-cap of hypothesis testing for a single hypothesis

Crucial property of p-values

Since p is a transformation of t, we can also imagine performing theexperiment many times, but rather than plotting t each time, we plot p.What will the histogram look like?

If the null hypothesis is true, the p-values from the repeated experimentscome from a Uniform(0,1) distribution.

Histogram of p

p

Frequency

0.0 0.2 0.4 0.6 0.8 1.0

01000

2000

3000

4000

5000

Alex Lewin (Biostatistics, St Mary’s) Multiple Testing March 2011 13 / 69

Slide by Alex Lewin, Ernest Turro, Paul O’Reilly  

Page 83: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Controlling  the  number  of  errors  

Null  Hypothesis  True   AlternaIve  Hypothesis  True  

Total  

Not  Significant  (don’t  reject)   ♯True  Nega:ve   ♯False  Nega:ve  

(Type  II  error)   N-­‐♯  Rejec:ons  

Significant  (Reject)   ♯False  posi:ve    (Type  I  error)  

♯True  posi:ve   ♯Total  Rejec:ons  

Total   n0   N-­‐n0   N  

N  =  number  of  hypothesis  tested  R  =  number  of  rejected  hypothesis  n0  =  number  of  true  hypothesis  

Page 84: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Bonferroni  Correc:on  

If  the  tests  are  independent:  P(at  least  one  false  posi:ve  |  all  null  hypothesis  are  true)  =    P(at  least  one  p-­‐value  <  α|  all  null  hypothesis  are  true)  =  1  –  (1-­‐α)m  

Usually,  we  set  a  threshold  at  α/  n.  Bonferroni  correc:on:  reject  each  hypothesis  at  α/N  level  It  is  a  very  conserva:ve  method  

Page 85: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

False  Discovery  Rate  (FDR)  N  =  number  of  hypothesis  tested  R  =  number  of  rejected  hypothesis  n0  =  number  of  true  hypothesis  

FDR  aims  to  control  the  set  of  false  posi:ves  among  the  rejected  null  hypothesis.  

Family  Wise  Error  Rate:  FWER  =  P(V≥1)  False  Discovery  Rate:  FDR  =  E(V/R  |  R>0)  P(R>0)  

Null  Hypothesis  True   AlternaIve  Hypothesis  True  

Total  

Not  Significant  (don’t  reject)   ♯True  Nega:ve   ♯False  Nega:ve  

(Type  II  error)   N-­‐♯  Rejec:ons  

Significant  (Reject)   V=♯False  posi:ve    (Type  I  error)  

♯True  posi:ve   R=♯Total  Rejec:ons  

Total   n0   N-­‐n0   N  

Page 86: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

86  

Classical approaches to multiple testing (FWER and FDR)

Benjamini & Hochberg (BH) step-up method to controlFDR

Benjamini & Hochberg proposed the idea of controlling FDR, and used astep-wise method for controlling it.

Step 1: compare largest p-value to the specified significance level ↵:If pordm > ↵ then don’t reject corresponding null hypothesis

Step 2: compare second largest p-value to a modified threshold:If pordm�1

> ↵ ⇤ (m � 1)/m then don’t reject corresponding null hypothesis

Step 3:If pordm�2

> ↵ ⇤ (m � 2)/m then don’t reject corresponding null hypothesis

. . .

Stop when a p-value is lower than the modified threshold:

All other null hypotheses (with smaller p-values) are rejected.

Alex Lewin (Biostatistics, St Mary’s) Multiple Testing March 2011 35 / 69

Slide by Alex Lewin, Ernest Turro, Paul O’Reilly  

Page 87: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

87  

Classical approaches to multiple testing (FWER and FDR)

Adjusted p-values for BH FDR

The final threshold on p-values below which all null hypotheses arerejected is ↵j⇤/m where j⇤ is the index of the largest such p-value.

BH:compare pi to ↵j⇤/m ! compare mpi/j⇤ to ↵

Can define ’adjusted p-values’ as mpi/j⇤

But these ’adjusted p-values’ tell you the level of FDR which is beingcontrolled (as opposed to the FWER in the Bonferroni and Holm cases).

We will see versions of FDR in later sections, in Bayesian mixture modelsand in the Bayesian-inspired approaches in the last section.

Alex Lewin (Biostatistics, St Mary’s) Multiple Testing March 2011 36 / 69

Slide by Alex Lewin, Ernest Turro, Paul O’Reilly  

Page 88: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Mul:ple  power  problem  

•  We  have  another  problem  related  to  the  power  of  each  test.  Each  unit  tested  has  a  different  test  sta:s:c  that  depends  on  the  variance  of  the  distribu:on.  This  variance  is  usually  different  for  each  gene/transcript,…  

•  This  means  that  the  probability  of  detec:ng  a  given  difference  is  different  for  each  gene;  if  there  is  low  variability  in  a  gene  we  will  reject  the  null  hypothesis  under  a  smaller  difference  

•  Methods  that  shrinkage  variance  (like  the  empirical  Bayes  in  limma  for  microarrays)  deal  with  this  problem.  

88

Page 89: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Models  for  coun:ng  data  

Page 90: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Microarray  expression  data  

Adapted  from  slides  by  Benilton  Carvalho  

0 10000 30000 500000.00000

0.00010

0.00020

RG densities

Intensity

Density

4 6 8 10 12 14 16

0.00

0.05

0.10

0.15

0.20

RG densities

Intensity

Density

log yij ~ N(µ j,σ j2 )

Data  are  color  intensi:es  

90

Page 91: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Gene   Sample  1   Sample  2  

ERBB2   0   45  

MYC   14   23  

ESR1   56   2  

Sequencing  data  

Adapted  from  slides  by  Benilton  Carvalho,  Tom  Hardcastle  

•  Transcript  (or  sequence,  or  methyla:on)  i    in  sample  j  is  generated  at  a  rate  λij  

•  A  fragment  avaches  to  the  flow  cell  with  a  probability  of  pij  (small)  

•  The  number  of  observed  tags  yij  follows  a  Poisson  distribu:on  with  a  rate  that  is  propor:onal  to  λijpij.  

Number of reads

Den

sity

0 20 40 60

0.0

0.1

0.2

0.3

0.4

0.5

0.6

The  variance  in  a  Poisson  distribu:on  is  equal  to  the  mean  

91

Page 92: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Extra  variability  

Adapted  from  slides  by  Benilton  Carvalho  

Squared  coeffi

cien

t  of  varia:o

n  

92

Page 93: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Nega:ve  binomial  model  for  sequencing  data  

Adapted  from  slides  by  Benilton  Carvalho   93

Page 94: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Es:ma:ng  Overdispersion  with  edgeR  

•  edgeR  (Robinson,  McCarthy,  Chen  and  Smyth)  •  Total  CV2=Technical  CV2  +  Biological  CV2  

•  Borrows  informa:on  from  all  genes  to  es:mate  BCV.  – Common  dispersion  for  all  tags  – Empirical  Bayes  to  shrink  each  dispersion  to  the  common  dispersion.    

Variability  in  gene  abundance  between  replicates  

Decreases  with  sequencing  depth  

94

Page 95: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Es:ma:ng  Overdispersion  with  DESeq  

•  DESeq  (Anders,  Huber)  •  Var  =  sμ  +  αs2μ2  

•  es:mateDispersions()  1.  Dispersion  value  for  each  gene    2.  Fits  a  curve  through  the  es:mates  3.  Each  gene  gets  an  es:mate  between  (1)  and  (2).  

Expected  normalized  count  value  

Size  factor  for  the  sample  

Dispersion  of  the  gene  

95

Page 96: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Reproducibility    

Slide  by  Wolfgang  Huber   96

Page 97: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

A  few  number  of  genes  get  most  of  the  reads  

Slide  by  Wolfgang  Huber  97

Page 98: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Effec:ve  library  sizes  

•  Also  called  normaliza:on  (although  the  counts  are  not  changed!!!)  

•  We  must  es:mate  the  effec:ve  library  size  of  each  sample,  so  our  counts  are  comparable  between  genes  and  samples  

•  Gene  lengths?  •  This  library  sizes  are  included  in  the  model  as  an  offset  (a  

parameter  with  a  fixed  value)  

98

Page 99: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Es:ma:ng  library  size  with  edgeR  

•  edgeR  (Robinson,  McCarthy,  Chen  and  Smyth)  •  Adjust  for  sequencing  depth  and  RNA  composi:on  (total  RNA  output)  

•  Choose  a  set  of  genes  with  the  same  RNA  composi:on  between  samples  (with  the  log  fold  change  of  normalised  counts)  ager  trimming  

•  Use  the  total  reads  of  that  set  as  the  es:mate.  

99

Page 100: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

Es:ma:ng  library  size  with  DESeq  

•  DESeq  (Anders,  Huber)  •  Adjust  for  sequencing  depth  and  RNA  composi:on  (total  RNA  output)  

•  Compute  the  ra:o  between  the  log  counts  in  each  gene  and  each  sample  and  the  log  mean    for  that  gene  on  all  samples.  

•  The  median  on  all  genes  is  the  es:mated  library  size.  

100

Page 101: Statistical Models for sequencing data: from Experimental Design …bioinformatics-core-shared-training.github.io/cruk... · 2015-11-25 · 1! Statistical Models for sequencing data:

References  

•  Anders  and  Huber.  Genome  Biology,  2010;  11:R106  •  Auer  and  Doerge.  Gene:cs  2010,  185:405-­‐416  •  Harrell.  Regression  Modeling  Strategies  •  Robles  et  al.  BMC  Genomics  2012,  13:484  •  Venables  and  Ripley.  Modern  Applied  Sta:s:cs  with  S  

101