1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n...

8
1-3-2004 1 • Assume we have two experimental conditions (j=1,2) • We measure expression of all genes n times under both experimental conditions (n two-channel microarrays) For a specific gene (focusing on a single gene) x ij = i th measurement under condition j • Statistical models for expression measurements under two different Identifying Differentially Expressed Genes ) σ , μ ( ~ x 2 1 i1 N ) σ , μ ( ~ x 2 2 i2 N 1 , 2 , are unknown model parameters - j represents the average expression measurement in the large number of replicated experiments, represents the variability of measurements • Question if the gene is differentially expressed corresponds to assessing if 1 2 • Strength of evidence in the observed data that this is the case is expressed in terms of a p-value

Transcript of 1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n...

Page 1: 1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.

1-3-2004 1

• Assume we have two experimental conditions (j=1,2)

• We measure expression of all genes n times under both experimental conditions (n two-channel microarrays)

• For a specific gene (focusing on a single gene) xij = ith measurement under condition j

• Statistical models for expression measurements under two different

Identifying Differentially Expressed Genes

)σ,μ(~ x 21i1 N )σ,μ(~ x 2

2i2 N

1, 2, are unknown model parameters - j represents the average expression measurement in the large number of replicated experiments, represents the variability of measurements

• Question if the gene is differentially expressed corresponds to assessing if 1 2

• Strength of evidence in the observed data that this is the case is expressed in terms of a p-value

Page 2: 1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.

1-3-2004 2

• Estimate the model parameters based on the data

P-value

• Calculating t-statistic which summarizes information about our hypothesis of interest (1 2)

22

)1()1(ˆ

22

2122

n

snsns

n

xx

n

iij

j

1jˆ 1

)( 1

2

2

n

xxs

n

ijij

j

• Establishing the null-distribution of the t-statistic (the distribution assuming the “null-hypothesis” that 1 = 2)

• The “null-distribution” in this case turns out to be the t-distribution with n-1 degrees of freedom

• P-value is the probability of observing as extreme or more extreme value under the “null-distribution” as it was calculated from the data (t*)

n2

s

t 12*

xx

Page 3: 1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.

1-3-2004 3

t-distribution• Number of experimental replicates affects the precision at two levels

1. Everything else being equal, increase in sample size increases the t*

2. Everything else being equal, increase in sample size “shrinks” the “null-distribution”

• Suppose that t*=3. What is the difference in p-values depending on the sample size alone.

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

df = 1df = 2df = 10df = 100

p-value = 0.2p-value = 0.1p-value = 0.01p-value = 0.003

Page 4: 1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.

1-3-2004 4

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

df = 1df = 2df = 10df = 100

t-distribution#Plotting t-distributions with different degrees of freedom

x<-seq(-5,5,.1)

plot(x,dt(x,100),type="l",col="black",lwd=2,ylab="")

lines(x,dt(x,10),col="green",lwd=2)

lines(x,dt(x,2),col="blue",lwd=2)

lines(x,dt(x,1),col="red",lwd=2)

legend(2, y =0.4, c("df = 1; p-value = ","df = 2","df = 10","df = 100"),

col = c("red","blue","green","black"),

lty=rep("solid",4), lwd=2)

#Calculating two-tailed p-values

> 2*pt(3,100,lower.tail=FALSE)

[1] 0.003407915

> 2*pt(3,10,lower.tail=FALSE)

[1] 0.01334366

> 2*pt(3,2,lower.tail=FALSE)

[1] 0.09546597

> 2*pt(3,1,lower.tail=FALSE)

[1] 0.2048328

>

Page 5: 1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.

1-3-2004 5

Performing t-test>loadURL("http://eh3.uc.edu/SimpleData.RData")

> SimpleData[1,]

Name ID W1 C1 W2 C2 W3 C3 C4 W4 C5 W5 C6 W6

1 no name Rn30000100 85 57 91 71 67 111 72 86 88 108 124 171

> LSimpleData<-SimpleData

> LSimpleData[,3:14]<-log(SimpleData[,3:14],base=2)

> LSimpleData[1,]

Name ID W1 C1 W2 C2 W3 C3 C4 W4 C5 W5 C6 W6

1 no name Rn30000100 6.409391 5.83289 6.507795 6.149747 6.066089 6.794416 6.169925 6.426265 6.459432 6.754888 6.954196 7.417853

> grep("W",dimnames(SimpleData)[[2]])

[1] 3 5 7 10 12 14

> grep("C",dimnames(SimpleData)[[2]])

[1] 4 6 8 9 11 13

> W<-grep("W",dimnames(SimpleData)[[2]])

> C<-grep("C",dimnames(SimpleData)[[2]])

> t.test(LSimpleData[1,W],LSimpleData[1,C],var.equal=TRUE)

Two Sample t-test

data: LSimpleData[1, W] and LSimpleData[1, C]

t = 0.7974, df = 10, p-value = 0.4437

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.3653337 0.7725582

sample estimates:

mean of x mean of y

6.597047 6.393434

Page 6: 1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.

1-3-2004 6

Performing t-test> MW<-mean(t(LSimpleData[1,W]))> MW[1] 6.597047> MC<-mean(t(LSimpleData[1,C]))> MC[1] 6.393434> VW<-var(t(LSimpleData[1,W]))> VW 11 0.2105798> VC<-var(t(LSimpleData[1,C]))> VC 11 0.1806291> NW<-sum(!is.na(LSimpleData[1,W]))> NW[1] 6> NC<-sum(!is.na(LSimpleData[1,C]))> NC[1] 6> VWC<-(((NW-1)*VW)+((NC-)*VC))/(NC+NW-2)> VWC 11 0.1956045> DF<-NW+NC-2> DF[1] 10

> TStat<-abs(MW-MC)/((VWC*((1/NW)+(1/NC)))^0.5)> TStat 11 0.7973981> TPvalue<-2*pt(TStat,DF,lower.tail=FALSE)> TPvalue 11 0.4437415> >t.test(LSimpleData[1,W],LSimpleData[1,C],var.equal=TRUE)

Two Sample t-test

data: LSimpleData[1, W] and LSimpleData[1, C]

t = 0.7974, df = 10, p-value = 0.4437

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.3653337 0.7725582

sample estimates:

mean of x mean of y

6.597047 6.393434

source(http://eh3.uc.edu/RSimpleTTest.R)source(http://eh3.uc.edu/MySimpleTTest.R)

Page 7: 1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.

1-3-2004 7

Statistical Inference and Statistical Significance – P-value

• Statistical Inference consists of drawing conclusions about the measured phenomenon (e.g. gene expression) in terms of probabilistic statements based on observed data. P-value is one way of doing this.

• P-value is NOT the probability of null hypothesis being true.• Rigorous interpretation of p-value is tricky.• It was introduced to measure the level of evidence against the “null-hypothesis” or better

to say in favor of a “positive experimental finding”• In this context p-value of 0.0001 could be interpreted as a stronger evidence than the p-

value of 0.01• Establishing Statistical Significance (is a difference in expression level statistically

significant or not) requires that we establish “cut-off” points for our “measure of significance” (p-value)

• For various historic reasons the cut-off 0.05 is generally used to establish “statistical significance”.

• It’s a rather arbitrary cut-off, but it is taken as a gold standard• Originally the p-value was introduced as a descriptive measure to be used in conjuction

with other criteria to judge the strength of evidence one way or another

Page 8: 1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.

1-3-2004 8

Statistical Inference and Statistical Significance-Hypothesis Testing

• The 5% cut-off points comes from the Hypothesis testing world• In this world the exact magnitude of p-value does not matter. It only matters if it is smaller than

the pre-specified statistical significance cut-off ().• The null hypothesis is rejected in favor of the alternative hypothesis at a significance level of =

0.05 if p-value<0.05• Type I error is committed when the null-hypothesis is falsely rejected• Type II error is committed when the null-hypothesis is not rejected but it is false • By following this “decision making scheme” you will on average falsely reject 5% of null-

hypothesis• If such a “decision making scheme” is adopted to identify differentially expressed genes on a

microarray, 5% of non-differentially expressed genes will be falsely implicated as differentially expressed.

• Family-wise Type I Error is committed if any of a set of null hypothesis is falsely rejected• Establishing statistical significance is a necessary but not sufficient step in assuring the

“reproducibility” of a scientific finding – Important point that will be further discussed when we start talking about issues in experimental design

• The other essential ingredient is a “representative sample” from the “population of interest”• This is still a murky point in molecular biology experimentation