Correlation and Large-Scale Simultaneous Significance...
Transcript of Correlation and Large-Scale Simultaneous Significance...
![Page 1: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/1.jpg)
Correlation and Large-Scale SimultaneousSignificance Testing, Bradley Efron, 2007, JASA
Stat 300C: Final Presentation
Leonid Pekelis
June 03, 2011
![Page 2: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/2.jpg)
Main Points
I Correlation between test statistics can have varied effects onmultiple hypothesis testing procedures, making it harder totrust FDR procedures which don’t account for correlation.
I Allowing for some assumptions, can formalize a model whichdescribes how correlations propogate to false discoveryestimates.
I There is some evidence that this model is actually how theworld works (at least for microarrays).
I It is straightforward to adjust FDR procedures to account forsuch correlations.
![Page 3: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/3.jpg)
Effect of Correlations
![Page 4: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/4.jpg)
Effect of Correlations
1. Breast Cancer study (BC) compared gene activity groups ofpatients observed to have one of two different geneticmutations known to increase breast cancer risk, “BRCA1” or“BRCA2”, Hendenfalk et al. (2001)
I 7 BRCA1, 8 BRCA2, 15 patients totalI N = 3225 genes measured
2. HIV study, van’t Wout et al. (2003)I 4 HIV positive, 4 HIV negative controlsI N = 7680 genes per microarray
![Page 5: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/5.jpg)
Ensemble Distribution
zi = Φ−1(G0(ti )) ∼ N (0, 1), i = 1, 2, . . . ,Nzbci ∼ N (−0.09, 1.552) zHIVi ∼ N (−0.11, 0.752)
![Page 6: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/6.jpg)
Outline of the talk
1. Count vector model
1.1 Covariance of count vector under correlation
2. Poisson process model for counts
3. Numerical examples of model’s accuracy
4. Conditional FDR estimates
5. Numerical simulation comparing conditional to traditionalFDR
6. Data example, NBA
![Page 7: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/7.jpg)
Counts Model
K = 82 bins of width ∆ = 0.1 from −4.1 to 4.1, Z = ∪Kk=1Zk
Count vector y, yk = #{zi in kth bin}
πk(i) = P(zi ∈ Zk), πk· =N∑i=1
πk(i)/N.
= ∆φ(z [k])
γkl(i , j) = P(zi ∈ Zk ∩ zj ∈ Zl), γkl · =
∑i 6=j γkl(i , j)
N(N − 1)
E (y) = Nπ, Cov(y) = C0 + C1
C0 = N(diag(π)− ππ′)
C1 = N(N − 1)diag(π)δdiag(π), δkl =γkl ·πk·πl ·
− 1
![Page 8: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/8.jpg)
Counts Model
Further assume bivariate normality, Cov(zi , zj) = ρij .
γkl(i , j) =
∫Zk
∫Zl
ψ2(zi , zj , ρij)dz.
=∆2
2π√
1− ρ2ije− 1
2
z[k]2−2ρij z[k]z[l ]+z[l ]2
1−ρ2ij
δkl + 1 =
∑i 6=j P(zi ∈ Zk ∩ zj ∈ Zl)∑i P(zi ∈ Zk)
∑j P(zj ∈ Zl)
.=
∫ 1
−1
1√1− ρ2
eρ
2(1−ρ2)(ρz[k]2−2z[k]z[l ]+ρz[l ]2)
g(ρ)dρ
=
∫ 1
−1Rkl(ρ)g(ρ)dρ
![Page 9: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/9.jpg)
Counts Model
Suppose ρ ∼ (0, α2), α2 =∫ 1−1 ρ
2g(ρ)dρ,then 2nd order Taylor approximation of of Rkl(ρ) around ρ = 0gives
δ.
= α2qq′, qk = (z [k]2 − 1)/√
2.
Putting the previous results together (Theorem 1)
Cov(y).
= N(diag(π)− ππ′) +N(N − 1)
2α2ww′
wk = ∆w(z [k]), ,w(z) = φ′′(z) = φ(z)(z2 − 1)
![Page 10: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/10.jpg)
Poisson Model
Suppose y|u ∼ Po(u), u ∼ (v, Γ), will need N ∼ Po(N0).
Simplifies Cov(y).
= N(diag(π) + N2
2 α2ww′. Match with
y ∼ (v, diag(v) + Γ) ⇒
y ∼ Po(Nπ + AN√
2w), A ∼ (0, α2)
![Page 11: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/11.jpg)
Numerical Examples, α = 0.05
![Page 12: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/12.jpg)
Numerical Examples, α = 0.10
![Page 13: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/13.jpg)
Numerical Examples, α = 0.15
![Page 14: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/14.jpg)
Numerical Examples, α = 0.20
![Page 15: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/15.jpg)
Numerical Examples, α = 0.25
![Page 16: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/16.jpg)
Numerical Examples, α = 0.30
![Page 17: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/17.jpg)
Numerical Examples, α = 0.35
![Page 18: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/18.jpg)
Numerical Examples, α = 0.40
![Page 19: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/19.jpg)
Numerical Examples, α = 0.45
![Page 20: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/20.jpg)
Numerical Examples, α = 0.50
![Page 21: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/21.jpg)
Numerical Examples
![Page 22: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/22.jpg)
Numerical Examples
α: 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
C1 0.9958 0.9925 0.9828 0.9657 0.9291 0.8679 0.8085 0.7758 0.7748 0.8081Cnorm 0.1007 0.2776 0.4582 0.5962 0.6794 0.7059 0.6996 0.6938 0.7043 0.7390Cpois 0.1074 0.2790 0.4563 0.5931 0.6765 0.7036 0.6978 0.6923 0.7028 0.7374
α: 0.40 0.45 0.50
0.7758 0.7748 0.80810.6938 0.7043 0.73900.6923 0.7028 0.7374
Table: Proportion of total variance explained by first eigenvector, as afunction of α.
![Page 23: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/23.jpg)
Conditional FDR
Given A, can approximate
u = Nπ + AN√
2w.
= N∆fA(z [k])
fA(z) = φ(z)(1 + Aq(z)),
Matching moments, can approximate uk.
= N 1σAψ(x/σA), with
σ2A = 1 +√
2A.I took 2nd term in Edgeworth expansion,
fA(x).
=1
σAψ(x/σA)
(1 +
µ4 − 3σ4
24σ4H4(x)
).
![Page 24: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/24.jpg)
Conditional FDR
![Page 25: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/25.jpg)
Conditional FDR
Use GLM to fit distribution of yk ∼ Po(eβ0+β1z[k]+β2z[k]2) for
k ∈ K0.Using normal approximation for with p0 proportion of nulls givesE (yk) = p0uk , hence
σ̂A = (−2β̂2)−.5
Estimate p0 by p̂0 = P̂0/P0(σ̂A, P0(σ) = 2Φ(x0;σ)− 1,P̂0 = Y0/N
Fdr(x |σ̂A) = Np̂0Φ̄(x ;σA)/T (x)
![Page 26: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/26.jpg)
Simulation
![Page 27: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/27.jpg)
Data Example: NBA
1. What professional basketball players can really be calledexceptional?
2. Data from http://www.databasebasketball.com/
3. 1946-2009, stats on every player, each year, ≈ 22, 000 entries
4. Will focus on ppm = points scored in seasonminutes played in season
5. Idea: get z-value for each player, apply BH procedure todetermine non-null players
6. Can hypothesise there is some correlation between playersppm scores.
7. Cleaned data (year > 1950, minutes ≥ 10)
![Page 28: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/28.jpg)
Data Example: NBA
![Page 29: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/29.jpg)
Data Example: NBA
![Page 30: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/30.jpg)
Data Example: NBA
![Page 31: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/31.jpg)
Data Example: NBA
![Page 32: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/32.jpg)
Data Example: NBA
I Detrend: year effect, shot clock (1954), 3 pointer (1979),center
I Aggregate years by players, keep only careers ≥ 5 years
I Gives N = 1535 players
I Calculate tk =∑ck
i=1 ppmi/ckSE , ck - career length
I Convert to z values, zk = Φ−1(Tck−1(tk))
![Page 33: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/33.jpg)
Data Example: NBA
Max = 6.74 (Kareem , Abdul-jabbar ’69-’89), Min = -6.43 (E.c.Coleman ’94-’00)
Wilt Chamberlain (’59-’72) = 3.31, Michael Jordan (’84-’02) =6.49
![Page 34: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/34.jpg)
Data Example: NBA
I Naive BH(.10) procedure gives 891 rejections,
I Est. correlation from central spread Poisson glm,znull ∼ N (0, 22)
I Trying BH(.10) with correlated null gives 1 rejection,
I Third approach: estimate p̂0 = P̂0/P0(1.92) ≈ 0.588,P̂0 = Y0(1)/N
I Conditional Fdr estimates Fdr(naive|2) = .347 ,Fdr(cor|2) = 0.673
I Both > .10!
I x∗ = arg max{x : Fdr(x |2) ≤ 0.10}, gives 36 rejections
I Actually used x∗ = arg minFdr(x |2), sinceminFdr(x |2) = .121 > .10.
![Page 35: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/35.jpg)
Data Example: NBA
Theoretical Null Dist N (0, 1), Correlated Null Dist N (0, 22)
![Page 36: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/36.jpg)
Data Example: NBA (Best Players)
[1] ”Kareem , Abdul-jabbar” ”Tim , Duncan” ”Shaquille , O’neal”[4] ”Michael , Jordan” ”Karl , Malone” ”Julius , Erving”[7] ”Walter , Davis” ”Glenn , Robinson” ”Jerry , West”[10] ”Dominique , Wilkins” ”Tim , Thomas” ”Calvin , Murphy”[13] ”Bob , Pettit” ”Eddie , Johnson” ”Sam , Cassell”[16] ”James , Worthy” ”George , Gervin” ”John , Drew”[19] ”Allen , Iverson” ”Dan , Issel”
![Page 37: Correlation and Large-Scale Simultaneous Significance ...statweb.stanford.edu/~lpekelis/talks/11_5_300c_efron07.pdf · Stat 300C: Final Presentation Leonid Pekelis June 03, 2011.](https://reader034.fdocuments.in/reader034/viewer/2022050217/5f633b22e0d94c744d362696/html5/thumbnails/37.jpg)
Data Example: NBA (Worst Players)
[1] ”Charles , Jones” ”Tree , Rollins” ”Ben , Wallace”[4] ”Nate , Mcmillan” ”Greg , Kite” ”Manute , Bol”[7] ”Harvey , Catchings” ”Paul , Mokeski” ”Don , Buse”[10] ”Adonal , Foyle” ”Kurt , Rambis” ”Bo , Outlaw”[13] ”Matt , Guokas” ”Bruce , Bowen” ”George , Johnson”[16] ”Chris , Dudley”