1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n...
-
Upload
ellen-atkinson -
Category
Documents
-
view
213 -
download
0
Transcript of 1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n...
1-3-2004 1
• Assume we have two experimental conditions (j=1,2)
• We measure expression of all genes n times under both experimental conditions (n two-channel microarrays)
• For a specific gene (focusing on a single gene) xij = ith measurement under condition j
• Statistical models for expression measurements under two different
Identifying Differentially Expressed Genes
)σ,μ(~ x 21i1 N )σ,μ(~ x 2
2i2 N
1, 2, are unknown model parameters - j represents the average expression measurement in the large number of replicated experiments, represents the variability of measurements
• Question if the gene is differentially expressed corresponds to assessing if 1 2
• Strength of evidence in the observed data that this is the case is expressed in terms of a p-value
1-3-2004 2
• Estimate the model parameters based on the data
P-value
• Calculating t-statistic which summarizes information about our hypothesis of interest (1 2)
22
)1()1(ˆ
22
2122
n
snsns
n
xx
n
iij
j
1jˆ 1
)( 1
2
2
n
xxs
n
ijij
j
• Establishing the null-distribution of the t-statistic (the distribution assuming the “null-hypothesis” that 1 = 2)
• The “null-distribution” in this case turns out to be the t-distribution with n-1 degrees of freedom
• P-value is the probability of observing as extreme or more extreme value under the “null-distribution” as it was calculated from the data (t*)
n2
s
t 12*
xx
1-3-2004 3
t-distribution• Number of experimental replicates affects the precision at two levels
1. Everything else being equal, increase in sample size increases the t*
2. Everything else being equal, increase in sample size “shrinks” the “null-distribution”
• Suppose that t*=3. What is the difference in p-values depending on the sample size alone.
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
df = 1df = 2df = 10df = 100
p-value = 0.2p-value = 0.1p-value = 0.01p-value = 0.003
1-3-2004 4
-4 -2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
df = 1df = 2df = 10df = 100
t-distribution#Plotting t-distributions with different degrees of freedom
x<-seq(-5,5,.1)
plot(x,dt(x,100),type="l",col="black",lwd=2,ylab="")
lines(x,dt(x,10),col="green",lwd=2)
lines(x,dt(x,2),col="blue",lwd=2)
lines(x,dt(x,1),col="red",lwd=2)
legend(2, y =0.4, c("df = 1; p-value = ","df = 2","df = 10","df = 100"),
col = c("red","blue","green","black"),
lty=rep("solid",4), lwd=2)
#Calculating two-tailed p-values
> 2*pt(3,100,lower.tail=FALSE)
[1] 0.003407915
> 2*pt(3,10,lower.tail=FALSE)
[1] 0.01334366
> 2*pt(3,2,lower.tail=FALSE)
[1] 0.09546597
> 2*pt(3,1,lower.tail=FALSE)
[1] 0.2048328
>
1-3-2004 5
Performing t-test>loadURL("http://eh3.uc.edu/SimpleData.RData")
> SimpleData[1,]
Name ID W1 C1 W2 C2 W3 C3 C4 W4 C5 W5 C6 W6
1 no name Rn30000100 85 57 91 71 67 111 72 86 88 108 124 171
> LSimpleData<-SimpleData
> LSimpleData[,3:14]<-log(SimpleData[,3:14],base=2)
> LSimpleData[1,]
Name ID W1 C1 W2 C2 W3 C3 C4 W4 C5 W5 C6 W6
1 no name Rn30000100 6.409391 5.83289 6.507795 6.149747 6.066089 6.794416 6.169925 6.426265 6.459432 6.754888 6.954196 7.417853
> grep("W",dimnames(SimpleData)[[2]])
[1] 3 5 7 10 12 14
> grep("C",dimnames(SimpleData)[[2]])
[1] 4 6 8 9 11 13
> W<-grep("W",dimnames(SimpleData)[[2]])
> C<-grep("C",dimnames(SimpleData)[[2]])
> t.test(LSimpleData[1,W],LSimpleData[1,C],var.equal=TRUE)
Two Sample t-test
data: LSimpleData[1, W] and LSimpleData[1, C]
t = 0.7974, df = 10, p-value = 0.4437
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3653337 0.7725582
sample estimates:
mean of x mean of y
6.597047 6.393434
1-3-2004 6
Performing t-test> MW<-mean(t(LSimpleData[1,W]))> MW[1] 6.597047> MC<-mean(t(LSimpleData[1,C]))> MC[1] 6.393434> VW<-var(t(LSimpleData[1,W]))> VW 11 0.2105798> VC<-var(t(LSimpleData[1,C]))> VC 11 0.1806291> NW<-sum(!is.na(LSimpleData[1,W]))> NW[1] 6> NC<-sum(!is.na(LSimpleData[1,C]))> NC[1] 6> VWC<-(((NW-1)*VW)+((NC-)*VC))/(NC+NW-2)> VWC 11 0.1956045> DF<-NW+NC-2> DF[1] 10
> TStat<-abs(MW-MC)/((VWC*((1/NW)+(1/NC)))^0.5)> TStat 11 0.7973981> TPvalue<-2*pt(TStat,DF,lower.tail=FALSE)> TPvalue 11 0.4437415> >t.test(LSimpleData[1,W],LSimpleData[1,C],var.equal=TRUE)
Two Sample t-test
data: LSimpleData[1, W] and LSimpleData[1, C]
t = 0.7974, df = 10, p-value = 0.4437
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.3653337 0.7725582
sample estimates:
mean of x mean of y
6.597047 6.393434
source(http://eh3.uc.edu/RSimpleTTest.R)source(http://eh3.uc.edu/MySimpleTTest.R)
1-3-2004 7
Statistical Inference and Statistical Significance – P-value
• Statistical Inference consists of drawing conclusions about the measured phenomenon (e.g. gene expression) in terms of probabilistic statements based on observed data. P-value is one way of doing this.
• P-value is NOT the probability of null hypothesis being true.• Rigorous interpretation of p-value is tricky.• It was introduced to measure the level of evidence against the “null-hypothesis” or better
to say in favor of a “positive experimental finding”• In this context p-value of 0.0001 could be interpreted as a stronger evidence than the p-
value of 0.01• Establishing Statistical Significance (is a difference in expression level statistically
significant or not) requires that we establish “cut-off” points for our “measure of significance” (p-value)
• For various historic reasons the cut-off 0.05 is generally used to establish “statistical significance”.
• It’s a rather arbitrary cut-off, but it is taken as a gold standard• Originally the p-value was introduced as a descriptive measure to be used in conjuction
with other criteria to judge the strength of evidence one way or another
1-3-2004 8
Statistical Inference and Statistical Significance-Hypothesis Testing
• The 5% cut-off points comes from the Hypothesis testing world• In this world the exact magnitude of p-value does not matter. It only matters if it is smaller than
the pre-specified statistical significance cut-off ().• The null hypothesis is rejected in favor of the alternative hypothesis at a significance level of =
0.05 if p-value<0.05• Type I error is committed when the null-hypothesis is falsely rejected• Type II error is committed when the null-hypothesis is not rejected but it is false • By following this “decision making scheme” you will on average falsely reject 5% of null-
hypothesis• If such a “decision making scheme” is adopted to identify differentially expressed genes on a
microarray, 5% of non-differentially expressed genes will be falsely implicated as differentially expressed.
• Family-wise Type I Error is committed if any of a set of null hypothesis is falsely rejected• Establishing statistical significance is a necessary but not sufficient step in assuring the
“reproducibility” of a scientific finding – Important point that will be further discussed when we start talking about issues in experimental design
• The other essential ingredient is a “representative sample” from the “population of interest”• This is still a murky point in molecular biology experimentation