A MIXTURE MODEL APPROACH TO EMPIRICAL BAYES TESTING …pp730hw1567/revised-augmented.pdf · A...

A MIXTURE MODEL APPROACHTO EMPIRICAL BAYES TESTING AND ESTIMATION

A DISSERTATIONSUBMITTED TO THE DEPARTMENT OF STATISTICS

AND THE COMMITTEE ON GRADUATE STUDIESOF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Omkar MuralidharanMay 2011

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/pp730hw1567

© 2011 by Omkar Muralidharan. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/pp730hw1567

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Bradley Efron, Primary Adviser


Robert Tibshirani


Nancy Zhang

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

Many modern statistical problems require making similar decisions or estimates for many differententities. For example, we may ask whether each of 10,000 genes is associated with some disease,or try to measure the degree to which each is associated with the disease. As in this example, theentities can often be divided into a vast majority of “null” objects and a small minority of interestingones.

Empirical Bayes is a useful technique for such situations, but finding the right empirical Bayesmethod for each problem can be difficult. Mixture models, however, provide an easy and effectiveway to apply empirical Bayes. This thesis motivates mixture models by analyzing a simple high-dimensional problem, and shows their practical use by applying them to detecting single nucleotidepolymorphisms.

iv

Acknowledgements

I’d like to thank a few of the many people who made this research possible.Brad Efron was the perfect advisor.Nancy Zhang was a fantastic collaborator and mentor.The other faculty in the Statistics department were a constant source of valuable advice, especially

Rob Tibshirani and Trevor Hastie.My fellow students were full of entertaining and interesting conversations, especially Ryan Tib-

shirani, Jacob Bien, Nelson Ray, Noah Simon, Brad Klingenberg, Yi Liu and Ya Xu.Finally, I couldn’t have done this without the support of my family: my wife Aditi, my brother

Shravan, and my parents Sudha and Murali.

v

Contents

Abstract iv

Acknowledgements v

1 Introduction and Outline 11.1 High-Dimensional Data and the “Too much, too little” problem . . . . . . . . . . . . 1

1.1.1 The Normal Means Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 Sharing Information, Empirical Bayes and Mixture Models . . . . . . . . . . 2

1.2 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Motivation: Marginal Densities and Regret . . . . . . . . . . . . . . . . . . . 31.2.3 Methodology: Mixture Models for Normal Means . . . . . . . . . . . . . . . . 31.2.4 Application: Calling SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Previous Work 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 False Discovery Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1.1 Estimating π0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1.2 FDR/fdr estimators . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Empirical Nulls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 James-Stein, Parametric Empirical Bayes and Sparsity . . . . . . . . . . . . . 102.3.2 Robbins’ Formula and Nonparametric Empirical Bayes . . . . . . . . . . . . . 11

2.4 This Thesis’ Place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Marginal Densities and Regret 133.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 A Tempered NPEB Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

vi

3.2.1 Setup, Regret and the Proposed Estimator . . . . . . . . . . . . . . . . . . . 153.2.2 Regret Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Example: Simultaneous Chi-Squared Estimation . . . . . . . . . . . . . . . . . . . . 173.3.1 Specializing Theoretical Results . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.2 An Empirical Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.2.1 The UMVU estimator and Berger’s estimator . . . . . . . . . . . . . 183.3.2.2 A Parametric EB estimator . . . . . . . . . . . . . . . . . . . . . . . 193.3.2.3 Tempered NPEB estimators . . . . . . . . . . . . . . . . . . . . . . 193.3.2.4 Testing Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Implied Densities and General Estimators . . . . . . . . . . . . . . . . . . . . . . . . 233.4.1 Implied Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.4.2 Regret Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5.1 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.6 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.6.1 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.6.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.6.3 Proof of Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.6.4 Proof of Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.6.5 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.6.6 Proof of Corollary 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4 Mixture Models for Testing and Estimation 324.1 Introduction and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2.1 A Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2.2 Fitting and Parameter Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.2.3 Identifiability Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2.4 Parameter Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.5 Example: Binomial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2.5.1 Brown’s Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.5.2 Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.2.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Normal Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3.1 Effect Size Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3.1.1 An Asymptotic Comparison . . . . . . . . . . . . . . . . . . . . . . 444.3.2 fdr estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

vii

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Finding SNPs 515.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2.2 Calling, Filtering and Genotyping . . . . . . . . . . . . . . . . . . . . . . . . 595.2.3 Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.3.1 E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2.3.2 CM-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2.3.3 Starting Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3 A Single Sample Nonparametric FDR estimator . . . . . . . . . . . . . . . . . . . . 635.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4.1 Yoruban SNP Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.4.2 Power: Spike-In Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.4.3 Model-based Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Bibliography 71

viii

List of Tables

3.1 Mean squared errors 1N

∑(θ − θ

)2

from the simulations under the priors in the text.The methods are the UMVU estimator, Berger’s estimator, a parametric EB methodbased on a Gamma prior, a nonparametric EB method based on a log-spline densityestimator, and two mixture methods, one with Gamma mixture groups, and anotherwith approximate point prior mixture groups. The quantities shown are averages over100 simulations, with standard deviations given in parentheses. . . . . . . . . . . . . 22

3.2 Average relative regret from the simulations under the priors in the text. The averagerelative regret is

[1N

∑MSE(θ)/MSE(θbayes)

]− 1; if MSE

(θbayes

)is near 0, this

can be different from the ratio of entries in table 3.1,∑MSE(θ)∑

MSE(θbayes)−1. The quantities

shown are averages and standard deviations (in parentheses) over 100 simulations.The relative regret is infinite for all methods on prior 6, as the Bayes risk is 0. . . . 23

4.1 Estimated estimation accuracy (equation 4.4) for the methods. The naive estimatoris normalized to have error 1. Values for all methods except the binomial mixturemodel are from [Brown, 2008]. The first column gives the errors on the data as a whole(single model), and the next two give errors for pitchers and non-pitchers consideredseparately. Standard errors range from 0.05 to 0.2 on non-pitchers, are higher forpitchers, and are in between for the overall data [Brown, 2008]. . . . . . . . . . . . 39

4.2 Mean and median relative error for the methods over the simulation scenarios. Therelative error is the average of the squared error

∑(θi− θi)2 over the 100 replications,

divided by the average squared error for the Bayes estimator. . . . . . . . . . . . . . 42

5.1 Example counts, reference base G. For the spike-in simulations later, we used A asthe alternative base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

ix

5.2 Calls on the Yoruban sample by various methods, with estimated FDP s. We usedan estimated mean non-null HNRF of b = 0.5 and an estimated mean null HNRF ofa = 0.1 (the 99th percentile of the mean HNRF over all positions for the Yorubansample). The overall FDP estimate was calculated by combining FDP estimateson Bentley’s calls and new calls. The ratios of FDP estimates between methods aremore reliable than the individual levels. . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3 Noisy null position, reference base T , spiked alternative base G. . . . . . . . . . . . . 665.4 Binned true fdr and estimated fdr. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

x

List of Figures

3.1 Simulation priors as described in the text. . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Implied densities for the normal case, scaled so that ft(0) = 1 and plotted on the log

scale. The thresholding methods both had threshold 2. . . . . . . . . . . . . . . . . 253.3 Implied marginal densities after tempering in the normal case. The heavy curve is

the implied marginal for t = 34z, which is ft = N (0, 4). The other curves show the

effect of tempering at various ρ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Simulation results for the one-sided scenario. Each panel corresponds to one value ofK (5, 50 or 500). Within each panel, µ increases from 2 to 5. The y-axis plots thesquared error (

∑(θi − θi)2), averaged over 100 replications. Errors are normalized so

that the Bayes estimator for each choice of K and µ has error 1. Estimation methodsare listed in the text. In the dense case, the universal soft and hard thresholdingmethods are hidden because their relative errors range from 4 to 40. . . . . . . . . . 42

4.2 Simulation results for the two-sided scenario. Each panel corresponds to one value ofK (5, 50 or 500). Within each panel, µ increases from 2 to 5. The y-axis plots thesquared error (

∑(θi − θi)2), averaged over 100 replications. Errors are normalized so

that the Bayes estimator for each choice of K and µ has error 1. Estimation methodsare listed in the text. In the dense case, the universal soft and hard thresholdingmethods are hidden because their relative errors range from 4 to 50. . . . . . . . . . 43

4.3 Relative errors for various parameter choices. Each panel corresponds to one value ofK (5, 50 or 500). Within each panel, µ increases from 2 to 5. The y-axis plots thesquared error (

∑(δi − δi)2), averaged over 100 replications. Errors are normalized so

that the Bayes estimator for each choice of K and µ has error 1. The parameter Jgives the number of groups in the mixture model, and P is a penalization parameter. 44

xi

4.4 v (z) for log-spline (black) and mixture model (red) estimators, for four true densities.f1 is a sparse model, with 90% µ = 0 and 10% µ ∼ Unif (2, 3). f2 is a continuousmodel, with µ ∼ N (0, 1). f3 is a sparse heavy-tailed model, with 99% µ = 0 and 1%

µ ∼ Exponential (1). f4 is a dense heavy-tailed model, with 90% µ ∼ N (0, 1) and10% µ ∼ Exponential (1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.5 E( ˆfdr(z)) and sd( ˆfdr(z)) for various values of z and the methods under considera-tion. “Th” means the theoretical null was used, while “Emp” means an empirical nullwas used. Locfdr MLE and CM use the truncated maximum likelihood and centralmatching empirical null estimates, respectively. . . . . . . . . . . . . . . . . . . . . . 47

4.6 E( ˆFDR(z)) and sd( ˆFDR(z)) for various values of z and the methods under consider-ation. “Th” means the theoretical null was used, while “Emp” means an empirical nullwas used. Locfdr MLE and CM use the truncated maximum likelihood and centralmatching empirical null estimates, respectively. . . . . . . . . . . . . . . . . . . . . . 48

4.7 Expectation and standard deviation of rejection threshold estimates t(q) for the var-ious methods. The threshholds are fdr based. . . . . . . . . . . . . . . . . . . . . . . 49

4.8 Expectation and standard deviation of rejection threshold estimates t(q) for the var-ious methods. The threshholds are FDR based. . . . . . . . . . . . . . . . . . . . . 49

5.1 Coverage for one of the samples in our example data set (309474 total positions). . . 545.2 Error rates for sample 1 and 2, plotted on a log-log scale. Points with no error are

not shown. Most of the variability in the histogram is actually from binomial noise,because of the low depth, as Figure 5.3 illustrates. Here, we treat all non-referencecounts as errors, which is true for most positions. . . . . . . . . . . . . . . . . . . . 55

5.3 Error rates for positions with high coverage (at least 10,000) in both sample 1 and 2,plotted on a log-log scale. Points with no error are not shown. The dramatically lowerspread around the x = y line shows how low depth contributes most of the variabilityin Figure 5.2. As in that figure, we treat all non-reference counts as errors, which istrue for most positions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4 The simplex on the left shows four null genome positions of varying noisiness takenfrom the example data set, coded by different colors. Within each color, each pointis for a different sample. There are two T ’s (blue and green), one A (yellow), and oneG (red). The simplex on the right shows a true SNP that has genotypes CC, CG,and GG among the samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.5 Coverage needed for 80% power at fdr ≤ 0.1 for clean and noisy positions. Thecoverage needed for noisy positions increases at 28 heterozygous samples because atthat stage, our algorithm begins considering the position extremely noisy. . . . . . . 67

5.6 Estimated fdr vs true weighted fdr, plotted on the logit scale. Points plotted aty = 50 have estimated fdr numerically equal to 1 . . . . . . . . . . . . . . . . . . . . 69

xii

Chapter 1

Introduction and Outline

1.1 High-Dimensional Data and the “Too much, too little”

problem

Microarray experiments measure the expression levels of thousands of genes, but usually only havea few samples for each gene. DNA sequencing experiments can sequence hundreds of thousandsof positions in the genome, but each position may only be read tens of times. Search engines logdata for billions of queries and ad impressions, but most advertisements receive few clicks and manyqueries are only seen once.

Many modern statistical problems suffer from what Bradley Efron calls the “too much, too little”data problem. These problems have data sets so massive that computation is a serious problem.On the other hand, this massive amount of data is spread out over a large number of interestingobjects, with little data for each object. We commonly want to find interesting objects, or estimate aparameter for each object, so it is important to use what little information we have about each objectwell. “Too much, too little” data sets combine modern computational challenges with a pressing needfor statistical efficiency.

1.1.1 The Normal Means Problem

Perhaps the simplest such problem is the normal means problem, where we have N parametersθi, and we observe an independent normal measurement zi of each θi, zi ∼ N (θi, 1). We want toestimate θi based on the zis. Although N may be very large, we only have one measurement foreach θi. This problem arises in signal processing and microarray analysis, so it is interesting in itsown right. More importantly, studying the normal means problem can help us understand morecomplicated high-dimensional problems. The normal means problem and its exponential familygeneralization are simple enough to analyze theoretically, but still present “too much, too little”

1

CHAPTER 1. INTRODUCTION AND OUTLINE 2

difficulties.

1.1.2 Sharing Information, Empirical Bayes and Mixture Models

When we have so much data, but so little data about each object, we can acheive great gains bysharing information across objects. We may not have much information about a particular object,but we may have information on a great many similar objects. For example, we may not knowmuch about a particular advertisement, but we can use information about similar advertisements topredict the number of times it will be clicked. Pooling information is particularly effective when weknow something about the structure of the data.

One way to share information is to use Empirical Bayes (EB) methods. EB methods typicallyassume that the parameter of interest for each object comes from some prior distribution. If we knewthe prior, we would use the Bayes procedure for the prior and our problem. Instead of specifying theprior in advance, EB methods consider the prior to be unknown, and try to estimate the prior usingthe data (they can also directly estimate the Bayes procedure). Fitting the prior lets us combineinformation across different objects, and the Bayes procedure for the fitted prior combines the dataon the object of interest with the prior information from other objects.

EB methods are particularly effective when we know something about the structure of the data,since this information can help us fit the prior. Parametric EB methods try to use such priorinformation efficiently. On the other hand, nonparametric EB methods strive for generality, and tryto adapt to any possible prior.

In this thesis, we argue that mixture models are particularly well suited to the empirical Bayesapproach. First, mixture models are general: they can take advantage of structural knowledge, butcan still adapt to general priors. Second, they are accurate: our mixture models are more efficientthan current nonparametric and parametric EB approaches. Finally, and most importantly, theyare simple: mixture models are easy to devise and interpret, and their simplicity can be useful incomplicated situations.

1.2 Outline of Thesis

1.2.1 Literature Review

We will begin with a brief review of current empirical Bayes testing and estimation methods forthe normal means problem and its exponential family generalization. EB testing methods for thisproblem have become prominent recently, in the form of “false discovery rate” estimation. Thefalse discovery rate of an object is the posterior probability that an object is uninteresting. When“uninteresting” has a fixed meaning, (like θi = 0 in the normal means problem), the null distributionof uninteresting cases is known, and a variety of methods can estimate false discovery rates quite


well. When “uninteresting” has a vaguely defined meaning, the null distribution must be estimatedfrom the data, and estimating false discovery rates is difficult.

EB estimation literature on the normal means problem has taken two major paths. Nonparamet-ric EB methods try to adapt to general priors, but cannot take advantage of structural informationlike sparsity. On the other hand, parametric EB methods can efficiently use structural information,but perform badly when the structural information turns out to be wrong. No current method canefficiently use structural information while retaining the ability to adapt to general priors.

1.2.2 Motivation: Marginal Densities and Regret

Next, we study the continuous exponential family generalization of the normal means problem indetail. Our main idea is to connect estimating the θis with estimating the marginal density of thezis. We will introduce a way to translate estimates of the marginal density into estimates of theθis; the better the estimate of the marginal density, the better the resulting estimates of the θis.From the other direction, we will see how any estimator of the θis can be viewed as coming from animplied estimate of the marginal density, and the better the estimator of the θis is, the better theimplied marginal density.

Taken together, these results establish a loose equivalence between estimating the θis and es-timating the marginal density - to estimate the θis well, it is necessary and sufficient to estimatethe marginal density well. This equivalence gives us a useful diagnostic for the estimation problem,since we can assess the fit of a marginal density using the histogram of the zis. It also gives us a wayto choose parameters for a θi estimation method, or to choose between methods - pick the methodor parameters that give the best marginal density fit. Finally, this tells us what to focus on whendevising methods to estimate the θis. A method will work well if and only if it gives us good explicitor implicit marginal density estimates.

1.2.3 Methodology: Mixture Models for Normal Means

We then turn to mixture models. We consider how to use mixture models for the exponential familiyestimation problem, particularly when the θis are known to be sparse. The mixture approach yieldsan excellent estimator of the θis, combining the strengths of nonparametric and parametric EBmethods - our estimator can efficiently take advantage of sparsity and adapt to non-sparse, generalpriors. It also gives us a slightly better false discovery rate estimator than existing methods. Thefdr estimator does about as well as other methods when using the theoretical θ = 0 null, and slightlybetter when estimating an empirical null.

The connection between density and θ estimation helps explain the mixture model’s success.Mixture models are good density estimators. With enough mixture components, they can adaptto any situation. On the other hand, they can take advantage of structural information throughpenalties on mixture parameters and initialization of the fitting procedure. Mixture models’ biggest


problems in classical situations - unidentifiability and unstable group fits - is not a problem here,since the marginal density estimated by the mixtures is quite stable.

1.2.4 Application: Calling SNPs

Finally, we apply mixture models to a real problem. Single nucleotide polymorphisms (SNPs) arepositions in the human genome that vary from person to person. Finding these positions using DNAsequencing data is an important problem. We use the mixture model approach to extend empiricalnull ideas from the normal means problem to the complicated and discrete SNP problem. Theapproach lets easily us incorporate structural information about sequencing data into our model.Despite its simplicity, our method is much more precise than existing SNP calling methods, withlittle to no loss of power.

Chapter 2

Previous Work

2.1 Introduction

In this chapter, we will briefly review previous work on high-dimensional estimation and testing,focusing on the normal means problem. In this problem, we observe N independent normal mea-surements, zi ∼ N (µi, 1), each with its own mean. We often know that the µi are sparse, that is,most µi are 0. The normal means problem is important in its own right, and as a way to understandmore complicated high-dimensional problems.

We have two goals. First, we want to find nonzero µis, or interestingly large µis. Second, wewant to estimate the µis, taking advantage of any sparsity.

The normal means problem comes up in signal processing, data mining and multiple testing. Ourmotivating example is searching for differentially expressed genes in microarray data. Suppose wehave N gene expression measurements for n1 treatment patients and n2 controls. We want to findgenes that are differentially expressed between treatment and control, and we think that relativelyfew genes are differentially expressed.

Let ti be the two-sample t-statistic for the ith gene, and zi = Φ−1 (Ft (ti)), where Ft is thet-cdf with n1 + n2 − 2 degrees of freedom and Φ is the normal cdf. If gene i is not differentiallyexpressed, zi ∼ N (0, 1). Efron [2009a] shows that if n1, n2 are reasonably large, the distribution ofz is approximately shifted by differential expression. We can thus estimate the extent of differentialexpression by estimating µi = E (zi). Although this formally matches the normal means problem,there is an important difference: genes may be highly correlated.

2.2 Testing

We first consider how to find nonzero or interestingly large µis. This is usually framed as a mul-tiple testing problem: we form a p-value or z-value for each gene, and test whether it follows the

5

CHAPTER 2. PREVIOUS WORK 6

appropriate null distribution. Accordingly, we discuss general multiple testing procedures.We want to test the N null hypotheses Hi : µi = 0. The classical approach to multiple testing

controls the Family-Wise Error Rate, FWER = P (any false rejection|all Hi null). If N is large,however, the FWER is a very stringent criterion. We can use the Bonferroni bound to guaranteeFWER ≤ α by testing each Hi at level αi/N . Using such a high threshold for significance givesus low power. Many procedures in the literature improve on Bonferroni, but not by much, andcontrolling the FWER generally costs too much power to be practical.

Another classical approach is to ignore multiple testing altogether and test each Hi at level α.This is often referred to as controlling the per-comparison error rate. As Soric [1989] pointed out,however, this approach often leads to many false positives. Suppose m of our hypotheses are non-null, the other N −m are null, and we have perfect power. Testing each hypothesis separately atlevel α, yields α (N −m) + m rejections on average, of which α (N −m) are false rejections andm are true rejections. If N � m and α is only moderately small (for example, 1 − 5%), thenα (N −m)� m, and our rejections are dominated by false positives.

2.2.1 False Discovery Rates

Multiple testing is usually exploratory, and any rejections will be studied further. We want ourrejections to have a high chance of standing up to further study. Motivated by this, Benjamini andHochberg [1995] proposed a new error criterion, the False Discovery Rate (FDR). The FDR is theexpected proportion of false positives in the rejected hypotheses, defined to be 0 if no hypothesesare rejected:

FDR = E

(false rejections

max (total rejections, 1)

).

Benjamini and Hochberg [1995] gave a step-up testing procedure to control the FDR. Their pro-cedure can be formulated as follows[Benjamini and Hochberg, 2000]. First, for a given p-valuethreshold t, we expect Nt false rejections if all Hi are true. We can thus estimate the false discoveryrate by

ˆFDR (t) =Nt

total rejections.

Next, adjust t until ˆFDR (t) ≤ q, and reject all cases with p-value less or equal to t. Benjamini andHochberg proved that this whole testing procedure has FDR ≤ q. Benjamini and Hochberg [2001]extended the procedure to dependent p-values.

Storey [2002] gave FDRs a Bayesian interpretation. Suppose that each hypothesis is null withprobability π0, and that if Hi is non-null, z has distribution F1. Let F0 = Φ be the null distributionof z, so F = π0F0 + π1F1 is the marginal dsitribution of z. Consider a rejection region Z. Storey


defined the positive false discovery rate

pFDR (Z) = E

(false rejections

total rejections|total rejections > 1

).

Since typically P (total rejections > 1) ≈ 1, pFDR ≈ FDR. Storey showed that

pFDR (Z) =π0Φ (Z)

F (Z).

But by Bayes’ rule, the right hand side is just P (Hi null|zi ∈ Z). The FDR can thus be interpretedas a Bayesian posterior probability. With this interpretation, the Benjamini-Hochberg procedure isan empirical Bayes hypothesis test. Storey [2002] showed that the Benjamini-Hochberg procedureis equivalent to estimating F by the empirical cdf of z and taking π0 = 1.

Storey [2002] also proposed estimating π0 by assuming all cases with p-value greater than somethreshold λ were null. Storey et al. [2004] proved that using this estimator of π0 still controlsthe FDR. Storey’s reformulation of estimating FDRs instead of controlling them would proveparticularly influential.

Efron et al. [2001] went further down the empirical Bayes path, suggesting the estimation of theposterior probability that each Hi is null. Bayes’ rule gives

P (Hi null|zi = z) =π0f0 (z)

f (z),

where f0 = ϕ is the null density, f1 the alternative, and f = π0f0 + π1f1 is the marginal density.Efron [2005] called this quantity the local false discovery rate, denoted fdr (z). By estimating fand bounding or estimating π0, we can estimate fdr (z). The local false discovery rate is exactlywhat its name implies: if our rejection region Z shrinks to an infinitesimal interval around z,FDR (Z)→ fdr (z).

In any given rejection region, some hypotheses are more likely to be null than others. FDR (Z)

averages over these different posterior probabilities. The local false discovery rate gives the posteriorprobability that each hypothesis is null, which can be more informative. On the other hand, thelocal false discovery rate is usually harder to estimate.

When N is large, we can estimate F and f quite accurately. Since π0 does not affect theFDR/fdr too much, we can usually bound π0 at 1. Nevertheless, the literature is full of methodsto estimate π0, F and f , which we will very briefly review.

2.2.1.1 Estimating π0

Roughly speaking, estimators of π0 fall into three types. The first type looks at extreme values ofdensities or cdfs. Genovese and Wasserman [2004] look at the discrepancy between F and the nullF0; the smaller π0 is, the larger the maximum discrepancy. They use this idea to give one-sided


confidence limits for π0. Meinshausen and Rice [2006] make this idea more precise and investigateit theoretically. Swanepoel [1999] gives a method that is similar in spirit: on the p-value scale, ifinf f1 = 0, then inf f = π0. Cai et al. [2007] investigate density based estimates for the one-sidednormal means problem.

The second type is based on characteristic functions of normal mixtures [Jin and Cai, 2007, Jin,2008, Jin and Cai, 2009]. Suppose we have f0 = N (0, 1) and f1 = N (µ, 1). The characteristic func-tion of f is f∗ = exp

(− 1

2 t2)

(1− π1 (1− exp (itµ))). Characteristic function approaches estimatef∗ by the empirical characteristic function and work backward to estimate π0.

The third type looks at the marginal cdf near 1 (or 0, for z-values). The methods proposed byStorey [2002] and Efron [2005] falls into this category - they assume all cases with large enoughp-value (or small enough z-value) are null. Benjamini and Hochberg [2000] and Nettleton et al.[2006] use the same idea, but slightly different estimators.

2.2.1.2 FDR/fdr estimators

FDR and fdr estimators also roughly fall into three types, depending on the way they estimate Fand f . The first group is based on the empirical cdf of the data, like Benjamini and Hochberg [1995]and Storey [2002]. Genovese and Wasserman [2004] analyze the Benjamini-Hochberg estimatorfurther, and use the least concave majorant of the empirical cdf, under the assumption that the truemixture cdf is concave. Pounds and Cheng [2004] modify the Benjamini-Hochberg to accomodatep-values from one-sided and discrete tests. Strimmer [2008] uses the Grenander density estimator.

The second type uses Beta mixtures. Allison et al. [2002] model the p-value density as mixtureof betas, with the first beta component corresponding to the null; Pounds and Morris [2003] dosomething similar. Tsai et al. [2003] model the p-value differences as from a Beta distribution.

The third type uses z-value models. Efron [2005] uses a Poisson glm to estimate f on the z-valuescale. Lonnstedt and Speed [2002] use a two group normal prior, with the first group an atom at0. Newton et al. [2004] use a gamma mixture and hyperpriors for the gamma parameters. Panet al. [2003] use normal mixtures, with a fixed N (0, 1) null component. Pounds and Cheng [2004]transform p-values to z-values and then use LOESS to smooth them and estimate the overall density.Sun and Cai [2007] show that using z-values is more efficient than using the usual two-sided p-values.

2.2.2 Empirical Nulls

As we noted before, when N is large, it is easy to estimate the FDR/fdr. The situation is verydifferent when we are no longer sure of the null distribution f0.

Until now, we have assumed two things. First, we assumed that null test statistics are N (0, 1).Second, we assumed that most zi are null. Together, these assumptions imply that the marginaldistribution of z is approximately N (0, 1), particularly near its center.


Efron [2008] pointed out that this is often not true for many microarray datasets. He gave fourreasons why this might happen. First, theoretical assumptions (like normality) may fail, changingthe null distribution of the test statistics.

Second, a chance imbalance of covariates unrelated to the question of interest can cause overdis-persion. If we observed these covariates, we could subtract out their effects, but we cannot subtractout the effects of unobserved covariates. In the microarray situation, unobserved covariates can leadto each gene having an uninteresting, relatively small differential expression between treatment andcontrol due to an imbalanced design. Efron [2004] shows the resulting overdispersion cannot becorrected with permutation tests.

Third, correlation between arrays can lead to highly over or underdispersed t-statistics [Allenand Tibshirani, 2010]. Although most statistical methods assume independent arrays, lab andpreparation effects make this assumption questionable [Cosgrove et al., 2010, Piper et al., 2002]. Infact, Efron [2009b] shows that many common microarray data sets show signs of array correlation.Permutation tests assume independent arrays, and cannot account for array correlation.

Finally, Efron [2008] showed that correlation between the zi can lead to random over or under-dispersion. Correlation does not change the marginal distribution of z, but it does make the realizedmarginal more variable. In particular, the central width of the z histogram is more variable. If thez’s happen to be overdispersed and we use the theoretical null, our FDR/fdrs will be correct onaverage but quite misleading for the realized data.

When the z histogram is far from N (0, 1), we must either change our null or abandon ourassumption that most zi are null. Efron [2004] suggested changing the null to match the center ofthe data. This changes the character of our hypothesis test. Instead of testing whether each z hasa fixed mean 0, we now seek “interestingly” large z’s. A zi is interesting if it is much further fromzero than we would expect, given the rest of the data.

Efron [2008] gave a simple way to change the null - assume the null is still normal, but withunknown mean and variance, and fit it to the center of the data. Assuming normality is reasonablesince both gene and array correlation tend to produce normal nulls (though over or underdispersed).In any case, we must assume something to take information from the center of the z-value histogramand translate it into information about the tails.

There are many ways to fit a normal null. Efron [2008] gives two: a maximum truncated likelihoodmethod that assumes all |z| ≤ λ are null, and a “geometric” method that fits a quadratic to log f

at zero. Strimmer [2008] uses Efron’s likelihood approach, choosing λ to maximize the proportionof null cases with |z| ≤ λ. Schwartzman [2008] extends Efron’s approach to exponential familiesbeyond the normal. McLachlan et al. [2006] take a different approach, fitting a normal mixturemodel. Jin and Cai [2007] use a characteristc function approach, which they pursue further in [Jinand Cai, 2009].

None of these methods is without problems. Efron [2008] noted that although the empirical null


FDR/fdr estimates are less misleading than the theoretical null estimates, they are much morevariable, especially in the far tail. Estimating the null quite difficult in general, and the best methodmay change from dataset to dataset.

2.3 Estimation

We now turn to estimation. It would be impossible to review all the research into high-dimensionalestimation methods. We will ignore most theoretical and incremental work (Zhang [2003] has a moreextensive review) and focus on methodology and ideas relevant to this thesis. Such research seems tohave followed two approaches. The first approach has its roots in the James-Stein estimator, whilethe second has its roots in the approach of Robbins [1954].

2.3.1 James-Stein, Parametric Empirical Bayes and Sparsity

Stein [1956] proved that if N ≥ 3, the MLE µ = z is inadmissible for squared error loss in the normalmeans problem; this led to the celebrated estimator of James and Stein [1961]. The James-Steinestimator

µi =

(1− N − 2∑

z2j

)zi

has lower mean squared error than the MLE for all θi. Stein’s inadmissibility result can be extendedto many other high-dimensional problems [Bock, 1975, Brown, 1966, Berger, 1980].

Efron and Morris [1973] motived the James-Stein estimator as an empirical Bayes estimator.Suppose µi come from a N

(0, σ2

)prior, where σ2 is unknown. Then the Bayes estimator is µi = czi,

where c is a function of σ2. We can estimate c using the marginal distribution of z (N(0, σ2 + 1

)).

Efron and Morris show that estimating c unbiasedly yields the James-Stein estimator. The James-Stein estimator can thus be viewed as a parametric empirical Bayes method, that is, a method thatassumes a certain form for the prior or Bayes estimator, then fits the resulting Bayes estimator.Morris [1983] reviews later applications of parametric empirical Bayes methods.

Although they developed separately from parametric empirical Bayes methods, methods based onsparsity often have a similar form: they fix the form of the estimator, and try to choose parametersbased on the data. In the context of wavelet shrinkage, Donoho and Johnstone [1994] showed thatsoft thresholding is nearly optimal for the sparse normal means problem. This sparked much work onhow to pick the threshold. Donoho and Johnstone [1995] used Stein’s Unbiased Risk Estimate, whileAbramovich et al. [2006] used an FDR based method. These methods can be implicitly viewed asparametric empirical Bayes methods. In contrast, Johnstone and Silverman [2004] take an explicitlyempirical Bayes approach and fit a two-parameter prior by maximum marginal likelihood.

Current parametric empirical Bayes methods usually fail to adapt when their assumptions areviolated. This is a particular problem for methods that rely on sparsity, which can perform poorly


if the µi are dense.

2.3.2 Robbins’ Formula and Nonparametric Empirical Bayes

Another approach to the estimation problem comes from Robbins [1954] (Robbins actually creditsTweedie [1947] for the idea, but it cannot be found in Tweedie’s publications). Robbins models µ ascoming from a prior G, then writes the Bayes estimator E (µ|z) in terms of estimable quantities. Inthe normal means problem, for example, E (µ|z) = z +

f ′G(z)fG(z) , where fG (z) =

´ϕ (z − µ) dG (µ) is

the marginal density of z. If z ∼ Poisson (µ), then E (µ|z) = (z + 1) fG(z+1)fG(z) . Robbins estimates the

Bayes estimators by estimating the Bayes estimator. If the Bayes estimator is estimated consistently,this approach asymptotically acheives the Bayes risk for any prior. Efron [2009c] notes that Robbins’formulas connect fdr estimation with estimating µ.

Various ways of estimating the Bayes estimator have been proposed and studied: empiricalcdfs [Robbins, 1954, 1980, Johns and Ryzin, 1971, Singh, 1976], spline-based methods [Cox, 1983],kernel methods [Johns and Ryzin, 1972, Brown and Greenshtein, 2009, Zhang, 1997], orthogonalpolynomials [Singh, 1979, Singh and Wei, 1992], and maximum likelihood estimates of the prior[Jiang and Zhang, 2009].

All these nonparametric empirical Bayes methods perform reasonably well when the prior on µis smooth. They do not, however, take advantage of sparsity. Although nonparametric empiricalBayes methods asymptotically acheive the Bayes risk for any prior, they can be outperformed byparametric methods in finite samples when µ are sparse.

2.4 This Thesis’ Place

In this thesis, we will show that mixture models are useful tools for empirical Bayes testing andestimation, in the normal means problem and beyond. For the normal means problem, our approachwill yield a marginally better FDR/fdr estimator than existing methods: our estimator does aboutas well as existing methods when the null is known, and perhaps slightly better in estimating anempirical null. More importantly, our mixture model approach makes it easy to extend empiricalnull ideas to more complicated situations. We illustrate this by using mixture models to detectSNPs. The resulting SNP caller substantially outperforms existing SNP detection tools.

Our approach is a bigger step forward in estimation. The mixture model combines the adaptivityof nonparametric methods with the ability of parametric methods to take advantage of sparsity. Itoutperforms existing methods and nearly achieves the Bayes error for the normal means problemacross a wide range of sparsity.

Finally, we prove theoretical results that motivate the mixture model approach and may be ofseperate interest. We generalize results of Jiang and Zhang [2009] to establish a loose equivalencebetween estimating µ well and estimating fG well. This connection explains why the mixture model


performs so well, and tells us how to choose parameters for the mixture model and other empiricalBayes methods.

Chapter 3

Marginal Densities and Regret

3.1 Introduction

Consider the following high-dimensional estimation problem. We have independent, real-valued,data z1, . . . , zN , and each zi has distribution fθi , where

fθ(z) = exp (θz − ψ(θ)) f0(z)

is an absolutely continuous natural exponential family. We want to estimate θ1, . . . , θN under squarederror loss. We saw this problem in Chapter 2; it is perhaps the simplest generalization of the normalmeans problem.

It is well known that the MLE is a poor estimator of θ when N is large. Empirical Bayes (EB)methods have been successfully used to construct better estimators for this problem. EB methodsmodel θ as iid from a prior G, making the full model

θi ∼ G

zi|θi ∼ fθi

independently for i = 1, . . . , N . In this model, we want an estimator θ = t(z) that minimizes theBayes risk R(t, G) = EG

((t(z)− θ)2

). For a fixed prior G, Robbins [1954] showed that the Bayes

estimator istG (z) ≡ EG (θ|z) = −f

′0 (z)

f0 (z)+f ′G (z)

fG (z), (3.1)

where fG is the marginal distribution of z and f0 is the carrier density of the family. This resultholds as long as f0 is absolutely continuous [Berger, 1980].

EB methods treat G as unknown and estimate tG. Parametric EB methods assume G (or tG) has

13

CHAPTER 3. MARGINAL DENSITIES AND REGRET 14

a certain parametric form, then fit it using the data. Nonparametric EB (NPEB) methods attemptto estimate tG consistently for all G. As we saw in Section 2.3, the EB approach has producedestimators with good theoretical and practical properties.

We do two things in this chapter. First, we propose a particular kind of NPEB method. Second,we investigate the more general connection between estimating θ and estimating fG.

We first study the NPEB method (Section 3.2). Based on estimates f and f ′ of fG and f ′G, wesuggest using the estimator

tρ = −f′0

f0+

f ′

f ∨ ρ.

This estimator is simple: we just plug f and f ′ into Robbins’ formula, but to avoid dividing by anear-zero quantity, we replace f by f ∨ ρ = max

(f , ρ). This kind of estimate was suggested for the

normal problem by Zhang [1997] and studied further by Jiang and Zhang [2009].Their methods, however, are readily generalized to all exponential families with absolutely con-

tinuous f0, that is, to all families where Robbins’ formula holds. We will bound the regret of tρ (thedifference in risk between tρ and tG) in terms of the error in f and f ′. The bounds can be used toshow that tρ asymptotically achieves the Bayes risk.

We illustrate the general theory by applying it to the simultaneous chi-squared estimation prob-lem, where zi comes from a chi-squared distribution with inverse scale θi. This problem arises inmicroarray data: the θi are needed to construct t-statistics, which in turn are used by many ofsimultaneous inference procedures to find differentially expressed genes [Efron, 2008]. We study thefinite-sample performance of tρ for different choices of f and f ′ by simulation, and specialize ourtheoretical results to the chi-squared case.

After studying our estimator, we study the relationship between estimating θ and estimating fGmore generally. Our estimator shows that a good estimate of fG can yield a good estimate of θ.We might think, however, that there are other good estimators of θ, not based on empirical Bayesand estimates of fG. Surprisingly, this is false - a good estimator of θ can always be viewed as anempirical Bayes method using a good estimator of fG.

To make this precise, consider a general estimator θ = t (z). We can invert Robbins’ formulato get a “density” ft such that t = − f

′0

f0+

f ′tft. That is, ft is the marginal density that yields t

when plugged into Robbins’ formula. We will prove that if t has low regret, ft must be close to fG.This provides a useful diagnostic for general estimators. If an estimator t corresponds to an ft thatdoesn’t match the observed data, t will probably be outperformed by methods that match fG moreclosely.

This, along with our previous results, shows that there is a close connection between estimatingθ and estimating fG. Roughly speaking, to estimate θ well, it is necessary and sufficient to estimatefG well. This connection is an important motivation for mixture models. Although mixture modelshave many statistical drawbacks, they make good density estimators, and our results in this chapter


show that that is enough to guarantee that they yield good estimates of θ.

3.2 A Tempered NPEB Method

In this section, we outline our proposed NPEB method and prove bounds on its regret.

3.2.1 Setup, Regret and the Proposed Estimator

For the rest of the chapter, we will work with the Bayesian model

θ ∼ G

z|θ ∼ fθ.

We assume that there are N previous draws (θi, zi) from this model, and we observe z1, . . . , zN . Wewant to estimate θ based on z for a new draw (θ, z). We will use z1, . . . , zN to construct an estimatort (z) = t (z; z1, . . . , zN ). We condition on z1, . . . , zN throughout; all expectations and probabilitiesare conditional on z1, . . . , zN , though our notation supresses this. This lets us treat t (z) as a fixedfunction of z.

Our goal is to construct an estimator that performs nearly as well as the Bayes estimator tG.Let the Bayes risk of an estimator t (z) be

R (t, G) = EG

((θ − t (z))

2).

The regret of t is how much extra Bayes risk we get by using t instead of the Bayes estimator tG:

∆ (t, G) = R (t, G)−R (tG, G) .

Because we use squared-error loss, tG is just EG (θ|z), and the definition of conditional expectationimplies that

∆(t, G) = EG

((t(z)− tG(z))

2). (3.2)

This actually holds even if θ is not square integrable, as long as R (tG, G) is finite [Singh, 1979,Brown, 1971]. Acheiving low regret is thus equivalent to estimating tG well under squared error loss.

We propose estimating tG using a tempered NPEB estimator. The most obvious approach basedon equation 3.1 would be to use z1, . . . , zN to estimate fG and f ′G by, say, f and f ′, then plugin to estimate tG. But if f is too small, f ′

fmay be too large, and we may overshrink. Zhang

[1997] introduced a simple solution in the normal case - replace the f in our plug-in estimator byf∨ρ = max

(f , ρ)for some small ρ, but keep f ′ the same. Using this approach for other exponential


families gives us a tempered EB estimator

tρ = −f′0

f0+

f ′

f ∨ ρ. (3.3)

Tempering protects us from overshrinking. In the tails, f ′ → 0 and f ∨ ρ → ρ, so f ′

f∨ρ→ 0. So

in the tails, tρ approaches − f′0

f0, the UMVU estimator of θ [Sharma, 1973]. This is sensible, since

the tails are exactly where we have the least information about fG. Tempered EB estimators aresimilar to the limited translation estimators introduced by Efron and Morris [1971].

3.2.2 Regret Bounds

We now bound the regret of tρ in terms of the error in f and f ′. Our bounds generalize results ofZhang [1997] and Jiang and Zhang [2009]. Lemma 1, below, shows that the regret is bounded by theintegrated mean squared errors of f and f ′, up to a tempering term and constants that only dependon fG. Recall that we are conditioning on z1, . . . , zN , so the regret is a conditional expectation andf and f ′ are fixed functions of z.

Lemma 1. Suppose that f ′GfG∨ρ ≤ A(ρ). Then

∆(tρ, G

) 12 ≤ 1

ρ

(ˆ (f ′ − f ′G

)2

fGdz

) 12

+A(ρ)

ρ

(ˆ (f − fG

)2

fGdz

) 12

+ T (ρ, fG)

where T (ρ, g) =

(´ (1− g

ρ

)2

+

(g′

g

)2

fGdz

) 12

.

Lemma 1 has two unfamiliar features, a tempering term and a bound A(ρ). The temperingterm T (ρ, fG) depends on the heaviness of the tail of fG and behaves roughly like ρ

12 . If fG has

exponential or lighter tails, it behaves like ρ12 with some log factors, and if fG falls as z−k, it behaves

like ρ12−

12k . The bound A(ρ) measures how quickly f ′G drops off compared to fG. We always have

A(ρ) ≤ 1ρ sup ‖f ′θ‖∞ = 1

ρ supz,θ |f ′θ (z)|, but sometimes we can do better. In the normal case, Jiangand Zhang [2009] get A(ρ) = O (log ρ).

Lemma 1 bounds the regret by error in f and f ′. If f and fG are smooth, we can reduce this toa bound in terms of the error in f . This makes sense, since if f and fG are smooth and f is closeto fG, f ′ should be close to f ′G.

Theorem 1, below, makes this precise. As Jiang and Zhang [2009] found in the normal case, theright kind of smoothness turns out to be the decay of the Fourier transforms of the densities. Some-times z is not supported on the whole real line, so it is natural for fG and f to have discontinuities atthe boundary of the support, giving their Fourier transforms heavy tails. In this case, the theoremcan be applied to smooth extensions of fG and f that agree with the originals on the support of z.


Theorem 1. Let f∗ be the Fourier transform of a function f . Suppose |f∗G (u)| ,∣∣∣f∗ (u)

∣∣∣ ≤ H (u) for

almost all |u| ≥ C, where´u2H (u)

2< ∞. Let L (a) = 1

a2

´|u|≥a u

2H (u)2du; L (a) ↓ 0 as a → ∞.

Then

∆(tρ, G)12 ≤ 1

ρ‖fG‖

12∞

(√5

2πL−1

(d(f , fG)2

)+A (ρ)

)d(f , fG) + T (ρ, fG)

where d(f , fG) =

(´ (f − fG

)2

dz

) 12

.

Theorem 1 generalizes a result of Jiang and Zhang [2009] from the normal case to exponentialfamilies and more general density estimators. It shows that if our densities are smooth, the regretis bounded by the density estimation error, up to smoothness and tempering terms. The temperingterm is the same as in Lemma 1. The smoothness term L−1(d2) depends on how fast the character-istic functions of fG and f decay. If they decay exponentially, L−1(d2) behaves like log d, while ifthey decay as u−k, it behaves like d−2/(2k−1). Since we are conditioning on z1, . . . , zN , what mattersis the smoothness of fG and the realized f . Although it is more general, Theorem 1 is not as sharpas the result of Jiang and Zhang [2009] when applied to the normal case.

Lemma 1 and Theorem 1 show that if f and f ′ are good estimates of fG and f ′G, our tempered EBmethod will perform well. Tempering lets us use any estimators f and f ′ that have low integratedmean squared error - they do not need to have any additional properties.

3.3 Example: Simultaneous Chi-Squared Estimation

3.3.1 Specializing Theoretical Results

We illustrate our method in the simultaneous chi-squared problem, where zi is a chi-squared randomvariable with k degrees of freedom and scale 1/θi. The distribution of z|θ is

fθ (z) = Ckθn/2zn/2−1 exp (−θz/2) .

This is an exponential family with natural parameter θ and sufficient statistic −z/2. We try toestimate θ well under squared error loss. There are other loss functions that may be of moreinterest, for example, the loss functions considered by Berger [1980]. Extending our results to moregeneral losses seems difficult, as we discuss in the conclusion of this chapter.

The previous theory is easily adapted to give estimates of θ based on z. Robbins’ formula becomes

E (θ|z) =k − 2

z− 2

f ′G (z)

fG (z)

where the factor of −2 comes from the fact that the sufficient statistic is −z/2, not z. We constructtρ by constructing estimates f and f ′, then plugging in to get tρ = k−2

z − 2 f ′

f∨ρ.


Corollary 1 specializes Theorem 1 to the chi-squared problem. For fG to be smooth, G musthave reasonably light tails. When θ is large, fθ becomes nearly a spike at zero. If G has heavy tails,fG becomes too spiky.

Corollary 1. Suppose G is integrable and the degrees of freedom k ≥ 5. Suppose that for someα ∈

(0, 1− 4

k

), PG (θ ≥ m) ≤ Dm−

1−αα

k2 for all m ≥ M , and

∣∣∣f∗ (u)∣∣∣ ≤ Bu−(1−α) k2 for u ≥ C.

Then for some constant F that depends on EG (θ), B, C, D, M :

∆(tρ, G

) 12 ≤ F

ρ

(d(f , fG

) −21+(1−α)k

+A (ρ)

)(d(f , fG

))+ T (ρ, fG) .

The condition PG (θ ≥ m) = O(m−

1−αα

k2

)is satisfied, for example, if G has tails like a Gamma

distribution. In this case, the smoothness of f becomes the limiting factor. Interestingly, theconstant in Corollary 1 only depends on G through its mean and tail behavior. The corollary thusholds uniformly in classes of priors with bounded mean and constrained tail behavior.

If G is bounded and bounded away from 0, and f is smooth enough, we can give the rate atwhich the regret converges more explicitly. The constants in Corollary 2 only depend on the supportof G, so the result holds uniformly across all priors with the same support.

Corollary 2. Suppose k ≥ 5, 0 < M1 ≤ θ ≤ M2 < ∞,∣∣∣f∗ (u)

∣∣∣ ≤ Bu−k2 for all u ≥ C. Then

if we choose ρ = O(d(f , fG

)2/5), ∆

(tρ, G

)= O

(d(f , fG

)2/5), with constants that depend on

M1,M2, B,C.

3.3.2 An Empirical Comparison

We now compare our tempered NPEB estimator to the UMVU estimator, a conjugate-prior para-metric EB estimator and an estimator introduced by Berger [1980]. Our theory used a sequentialsetup, where we observe z1, . . . , zN and estimate θ for a new observation z. Our simulations use themore realistic situation where we observe z1, . . . , zN and estimate θ1, . . . , θN . Since each zi can betreated as the “new observation,” and each zi only affects the density estimate slightly, our theorystill applies here.

3.3.2.1 The UMVU estimator and Berger’s estimator

The UMVU estimator for θ is k−2z ; it is the multiple of 1

z with lowest mean-squared-error, dominatingthe MLE k

z (the MLE is also the Jeffrey’s prior posterior mean). Berger [1980] found an estimatorthat dominates the UMVU estimator:

θi =k − 2

zi+

czib+

∑z2i


where b ≥ 0 and c ∈ (0, 4 (N − 1)). Berger left the choice of b and c open, so we tried many differentvalues. All of them performed nearly the same, and none substantially improved on the UMVUestimator. We finally used b = c = N .

3.3.2.2 A Parametric EB estimator

Our parametric EB estimator uses a conjugate prior whose parameters are estimated by the methodof moments. Efron and Morris [1973] take this approach in the normal case to obtain an empiricalBayes construction of the James-Stein estimator.

The conjugate prior for the chi-squared distribution is the Gamma distribution, Gamma (α, β) (x) =1

Γ(α)βαxα−1 exp (−xβ). If θ ∼ Gamma (α, β) and z = 1

θχ2k, then it is easy to show, for α > 2,

E (z) = k βα−1 , E

(z2)

= k (k + 2) β2

(α−1)(α−2) , and E (θ|z) =(

z/2β+z/2

)kz +

(β

β+z/2

)(αβ

).

We estimate α, β by method of moments, then plug in to estimate E (θ|z). Let m1 = zk and

m2 = z2

k(k+2) . Then the method of moment estimates are α = max(1 + m2

m2−m21, 3) (we fix α ≥ 3 to

ensure that z has finite variance) and β = m1m2

m2−m21. Plugging these in gives an estimate of E (θ|z).

If the prior is very concentrated, m2 −m21 can be negative. In this case we fit a very concentrated

Gamma by taking α to be essentially infinite (108) and β = αz.

3.3.2.3 Tempered NPEB estimators

Specifying the tempered NPEB estimator requires fixing f , f ′ and ρ. We use two choices for f andf ′, one off-the-shelf and one specially constructed for this problem.

Our first estimator is an off-the-shelf log-spline estimator. Our density estimate takes the form

f (z) ∝ exp(∑

βici (z))

where ci (z) are a natural spline basis. We use the default natural spline basis supplied by R, with 15

degrees of freedom and boundary knots at the 1st and 99th percentiles of z. We fit f by binning z andfitting a Poisson GLM. Efron [2009c] used this approach in the normal case; the reference containsdetails on the fitting method. Since cubic spines are smooth, f is smooth, and its characteristicfunction should decay quickly.

We chose the degrees of freedom to give f enough flexibility to model all the test scenarios. Wemade the choice by plotting histograms of z and assessing the fit by eye, mimicking the process wewould use for real data. Corollary 1 tells us that fitting the marginal density well will result in thebest estimation performance, so this is a reasonable approach to take.

Our second density estimator is a Gamma mixture model. We model the prior G as a mixture,G =

∑πiGamma (ai, bi). We fix a and b to a grid of values, then fit π by the EM algorithm.

We choose a, b as follows. We first specify a number of groups `. Next, we fit a Gamma prior G bymethod of moments as for the parametric EB method, and find µ = EG (log θ) and σ = V arG (log θ).


We then take a sequence of means from µ − 3σ to µ + 3σ, µ = seq (µ− 3σ, µ+ 3σ, length = `) inR notation. Finally, we initialize a, b so each group has approximate log-mean µ and approximatelog-variance σ = (µ2 − µ1)

2. To do this, we take a = 1/σ and b = exp (ψ′ (a)− µ) where ψ is thedigamma function.

For the EM algorithm, we initialize π to be approximately lognormal, π ∝ dnorm (µ, µ, σ) in Rnotation. For the E-step, we estimate

gij = P (zi from group j) =πjfj (zi)∑πjfj (zi)

where

fj (x) =1

x

(Γ(a+ k

2

)Γ (k/2) Γ (a)

)(1− x/2

b+ x/2

)a(x/2

b+ x/2

)k/2is the marginal distribution corresponding to a Gamma (a, b) prior. For the M-step, we estimateπj ← 1

N

∑i gij .

We used 10 mixture groups. As for the log-spline, we assessed the fit of f by eye, and chose thenumber of groups to to give f enough flexibility to model the test scenarios. Given the fitted G, weused the density estimator f = fG. Since f corresponds to a prior G with a Gamma-like tail, it isquite smooth.

To demonstrate the flexibility of the mixture method, we also used a mixture of approximatepoint priors, initialized the same way as the Gamma mixture, but with σ (the log-variance of thegamma groups) fixed to 10−7, and 100 mixture groups.

Both methods were insensitive to choice of ρ, as long as it was small. We took ρ = 10−6, butρ = 0 performed just as well.

3.3.2.4 Testing Scenarios

We tested the methods under distributions of θ ranging from smooth to sparse. We used the followingpriors, shown in Figure 3.1.

1. Gamma (10, 1), a smooth prior.

2. Unif (2, 4), another smooth prior.

3. 50% Gamma (10, 7), 50% Gamma (10, 20), a smooth but bimodal prior.

4. 25% θ = 1, 50% θ = 2, 25% θ = 10, a three point prior with one extreme point

5. 75% Gamma (1000, 1000), 25% Gamma (1000, 333), an approximate two-point prior

6. A point mass at θ = 1.


Prior 1

theta

Frequency

5 10 15 20 25

050

100

150

200

250

Prior 2

theta

Frequency

2.0 2.5 3.0 3.5 4.0

020

4060

80100

120

Prior 3

theta

Frequency

0 1 2 3

0100

300

500

Prior 4

theta

Frequency

2 4 6 8 10

010

2030

4050

Prior 5

theta

Frequency

1.0 1.5 2.0 2.5 3.0

0500

1000

1500

Prior 6

theta

Frequency

0.0 0.5 1.0 1.5 2.0

02000

4000

6000

8000

Figure 3.1: Simulation priors as described in the text.

We tested the methods with k = 10; increasing k improved all methods’ performance, but did notsubstantially change their relative performance. We took N = 10, 000, a size typical of microarraystudies.

These priors are fair, in the sense that none of the methods are fitting the true model (except theparametric EB method on prior 1). The UMVU, Berger, parametric EB and log-spline estimatorsare clearly not tailored to these priors. The Gamma mixture method also has no unfair advantage infitting the priors used here. It uses 10 Gamma groups for a nonparametric fit. Because the Gammagroup parameters were fixed after initializing them with the parametric EB model, the mixture isnot able to fit the Gamma priors directly. The fact that the Gamma mixture method is has nospecial advantage is supported by the fact that the mixture of point priors has similar results onthe Gamma priors. The point prior mixture, of course, has an unfair advantage on the point priors,and is mainly used to illustrate how the Gamma mixture method has no special advantages on theGamma priors.

3.3.2.5 Results

Our simulation results are in Tables 3.1 and 3.2. Berger’s estimator dominates the UMVU estimator,but its advantage is small. The parametric EB method does well on the smooth priors, including thesmooth bimodal prior, but does very badly on the point priors, doing even worse than the UMVUestimator. It does well, however, on the point mass, because the moment-based fitting can detectthat the prior is extremely concentrated.


Metho

dPrior

1Prior

2Prior

3Prior

4Prior

5Prior

6UMVU

12.698

(0.4745)

1.167(0.0351)

0.1577

(0.0070)

3.3966

(0.1677)

0.3756

(0.0157)

0.125(0.0034)

Berger

13.252

(0.4778)

1.114(0.0352)

0.1556

(0.0070)

3.3786

(0.1678)

0.3668

(0.0157)

0.118(0.0035)

Param

etricEB

5.239(0.1002)

0.2417

(0.0031)

0.1002

(0.0020)

6.413(0.0479)

0.3970

(0.0045)

2.10×

10−

5(2.9×

10−

5)

Log-splin

eNPEB

6.378(0.2321)

0.3296

(0.0187)

0.1146

(0.0050)

2.967(0.1396)

0.1830

(0.0099)

9.24×

10−

3(2.8×

10−

3)

Mixture

Gam

maNPEB

5.248(0.1021)

0.0031

(0.0032)

0.0831

(0.0018)

0.9825

(0.0392)

0.1688

(0.0057)

3.21×

10−

5(5.3×

10−

5)

Mixture

Point

NPEB

5.244(0.1014)

0.0032

(0.0032)

0.0800

(0.0017)

0.4728

(0.0321)

0.1265

(0.0050)

2.08×

10−

5(2.8×

10−

5)

Bayes

5.235(0.1004)

0.0031

(0.0031)

0.0798

(0.0017)

0.3218

(0.0354)

0.1241

(0.0051)

0(0)

Tab

le3.1:

Mean

squa

red

errors

1 N

∑( θ−θ) 2

from

thesimulations

underthepriors

inthetext.

The

metho

dsaretheUMVU

estimator,Berger’sestimator,apa

rametricEB

metho

dba

sedon

aGam

maprior,

ano

nparam

etricEB

metho

dba

sedon

alog-splin

edensityestimator,an

dtw

omixture

metho

ds,on

ewithGam

mamixture

grou

ps,an

dan

otherwithap

proxim

atepo

intpriormixture

grou

ps.The

quan

tities

show

nareaverages

over

100simulations,w

ithstan

dard

deviations

givenin

parentheses.


Method Prior 1 Prior 2 Prior 3 Prior 4 Prior 5UMVU 1.6165 (0.083) 3.8593 (0.149) 0.9765 (0.085) 9.673 (1.234) 2.029 (0.170)Berger 1.5313 (0.083) 3.6354 (0.149) 0.9490 (0.085) 9.627 (1.228) 1.959 (0.169)

Parametric EB 0.0007 (0.0001) 0.0059 (0.002) 0.2554 (0.021) 19.167 (2.262) 2.203 (0.125)Log-spline NPEB 0.2181 (0.038) 0.3295 (0.075) 0.4362 (0.058) 8.321 (1.061) 0.4753 (0.073)

Mixture Gamma NPEB 0.0023 (0.0018) 0.0114 (0.004) 0.0419 (0.009) 2.080 (0.273) 0.3584 (0.037)Mixture Point NPEB 0.0016 (0.0014) 0.0062 (0.003) 0.0022 (0.002) 0.478 (0.101) 0.0186 (0.006)

Table 3.2: Average relative regret from the simulations under the priors in the text. The averagerelative regret is

[1N

∑MSE(θ)/MSE(θbayes)

]− 1; if MSE

(θbayes

)is near 0, this can be different

from the ratio of entries in table 3.1,∑MSE(θ)∑

MSE(θbayes)− 1. The quantities shown are averages and

standard deviations (in parentheses) over 100 simulations. The relative regret is infinite for allmethods on prior 6, as the Bayes risk is 0.

The NPEB methods do well on all the priors. The mixture model does best. Both the mixturemodel and the parametric EB method essentially acheive the Bayes error on priors 1, 2 (the smoothunimodal priors) and 6 (point mass), and the mixture model is much better on the rest. The log-spline estimator is not as good, but it does remarkably well for an off-the-shelf estimator. It is a bitworse than the parametric EB method and mixture model on priors 1, 2, 6. On the rest, it trails themixture model but outperforms the other estimators.

We can understand the relative performance of the parametric and nonparametric EB methodsby considering their performance as density estimators. The parametric EB method can approxi-mately the marginal density when the prior is smooth, but fails for point priors. The connectionbetween regret and marginal density estimation means that the poor density estimation performancetranslates into poor estimates of θ. The mixture and log-spline methods, on the other hand, can fitall the marginal densities well.

These results suggest that our tempered NPEB method can match a conjugate-prior parametricEB approach on smooth unimodal priors, and substantially outperform it when the prior is bimodalor sparse. Using the off-the-shelf log-spline yields good performance, but we can improve by usingan appropriate density estimator for the problem, in this case the mixture model. We will see similarresults for the normal means problem in the next chapter.

3.4 Implied Densities and General Estimators

In this section, we consider the link between estimating θ and estimating fG more generally.


3.4.1 Implied Densities

Consider a general estimator t (z). t can be expressed as

t (z) = −f′0

f0+f ′tft

where log ft =´ z

0

(t(x) +

f ′0(x)f0(x)

)dx. We can view t as coming from Robbins’ formula with ft plugged

in as an estimate of fG; remember that we are conditioning on z1, . . . , zN , so both t and ft can dependon these previous observations. Roughly speaking, ft is the marginal density of z that would make tBayes. It is important to remember, though, that ft may not be a proper density. Working backwardfrom an estimator to a marginal density using Robbins’ formula has previously been used to proveadmissibility results in the normal case [Brown and Zhao, 2009, Berger and Srinivasan, 1978].

We illustrate implied marginals with a few examples.

Example 1 (Normal Location). Consider the normal location problem, fθ = N (θ, 1). A linearestimator t(z) = λz corresponds to a normal implied marginal. Hard thresholding with threshold λgives an improper implied marginal that is normal for |z| ≤ λ and flat for |z| ≥ λ. Soft thresholdinggives a Huber implied marginal

ft (z) =

exp(−z2/2

)|z| ≤ λ

exp(−λ |z|+ λ2/2

)|z| ≥ λ

that is, normal for |z| ≤ λ, exponential for |z| ≥ λ. Finally, the MLE t(z) = z has a flat impliedmarginal. These densities are shown in Figure 3.2.

Example 2 (Regularization). We can use implied marginal densities to understand the role oftempering in our NPEB method. If we fix f(0) = ftρ(0) = 1, the implied density of tρ is

log ftρ =

ˆ z

0

f ′

f ∨ ρdt.

For simplicity, suppose that f is unimodal with a mode at 0 and ρ is small. Then ftρ is the sameas f near 0. After f drops below ρ, ftρ decreases slower than f . Finally, ftρbecomes flat as z →∞.Tempering effectively forces f to have a heavy tail. The effect of tempering a normal f in the normalcase is shown in Figure 3.3.

3.4.2 Regret Bounds

We showed that our NPEB method can take a good estimate of fG and give a good estimator of θ.We now show the reverse is true: Theorem 2 shows that for t to have low regret, ft must be close to


-4 -2 0 2 4

-5-4

-3-2

-10

Implied Densities

z

log

dens

ity

Linear, 0.1zHard ThreshSoft ThreshMLE

Figure 3.2: Implied densities for the normal case, scaled so that ft(0) = 1 and plotted on the logscale. The thresholding methods both had threshold 2.

-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

Implicit Densites Under Truncation

z

Density

Figure 3.3: Implied marginal densities after tempering in the normal case. The heavy curve is theimplied marginal for t = 3

4z, which is ft = N (0, 4). The other curves show the effect of temperingat various ρ.


fG. Since ft is only determined up to scale, we need to choose a scaling. One approach is to simplyfix ft (z0) = fG (z0) at some arbitrary point z0.

Theorem 2. Let FG be the cdf of z under G and

P (z) =

FG(z)fG(z) z ≤ z0

1−FG(z)fG(z) z ≥ z0

.

Then if ft is scaled so ft (z0) = fG (z0),

ˆ ∣∣∣∣logfGft

∣∣∣∣ fGdz ≤ (ˆ P (z)2fG(z)dz

) 12

∆(t, G)12 ,

where ∆ (t, G) = EG

((tG (z)− t (z))

2)is the regret of t.

Alternatively, if ft is integrable and scaled to be a density, then a simple modification of Theorem2 shows that

DKL (ft‖fG) ≤ infz0

[(ˆP (z)2fG(z)dz

) 12

∆(t, G)12 + log

ft (z0)

fG (z0)

]

where DKL (f‖g) =´

log gf gdz is the Kullback-Liebler divergence. Versions of Theorem 2 also hold

for our tempered NPEB method.Theorem 2 has two interesting statistical aspects. The first is motivational. Every estimator t

with low regret corresponds to an ft that is close to fG, plugged into Robbins’ formula. The lowerthe regret, the closer ft must be to fG, up to scale. This suggests that if we restrict our attentionto EB methods based on estimates of fG, we do not overlook different techniques with low regret.

The second aspect is diagnostic. Given an estimator t, we can work out ft, and see if it matchesthe observed distribution of the data. A glaring mismatch indicates that t has high regret. Forexample, consider soft thresholding in the normal case. The soft thresholding estimator has a Huberimplied marginal. If z1, . . . , zN look unlikely to have come from a Huber density, Theorem 1 suggeststhat soft thresholding will be outperformed by methods that match fG more closely.

Taken together, Theorems 2 and 1 say that to estimate θ well, it is necessary and sufficient toestimate fG well. If f is a good estimate of fG, Theorem 1 shows tρ is a good estimator of θ.Conversely, if t is any good estimator of θ, Theorem 2 shows that its implied marginal ft must beclose to fG.

This argument has an inelegant aspect. If t is a good estimator, we can guarantee the accuracy offt, the marginal correponding to using Robbins’ formula untempered. On the other hand, if f = ft

is a good estimate of fG, we can guarantee the performance of tρ, the tempered NPEB estimator -but the implied marginal of tρ is no longer ft.


Corollary 3 resolves this small point, at the cost of an extra term depending on t in the upperbound on the regret. Ignoring the technical details, it says that for a general estimator t, t has lowregret if and only if ft is close to fG.

Corollary 3. Let t be an estimator and ft its implied density, scaled so ft (z0) = fG (z0). Supposethe conditions of Theorem 1 are satisfied. Then

∆ (t, G)12 ≤ inf

ρ

[1

ρ‖fG‖

12∞

(√5

2πL−1

(d(f , fG)2

)+A (ρ)

)d(f , fG) + T (ρ, fG) + T (ρ, ft)

].

and

∆(t, G)12 ≥

(ˆP (z)2fG(z)dz

)− 12ˆ ∣∣∣∣log

fGft

∣∣∣∣ fGdz.3.5 Summary

In this chapter, we proposed a tempered NPEB method based on estimates f , f ′ of the marginaldensity fG and its derivative f ′G. We proved that our method will perform well if f and f ′ aregood estimates. We illustrated our method on the simultaneous chi-squared estimation problem byspecializing our theoretical results and showing that our method performs well empirically.

Next, we considered the relationship between estimating θ and estimating fG more generally.We defined the concept of an implied marginal density, and showed that if a general estimator t haslow regret, its implied marginal ft must be close to the true marginal. In fact, roughly speaking, thas low regret if and only if ft is close to fG. This gave us a useful diagnostic for general estimators.

The connection between estimating θ and estimating fG is an important motivation for mixturemodels. It shows us that any modeling method that matches the marginal density of the data willgive us a good estimator of θ. Although mixture models have many statistical drawbacks, theymake good density estimators. This chapter’s results tell us that mixture models’ drawbacks do notmatter in estimating θ.

3.5.1 Extensions

Many questions remain. First, we have only considered continuous exponential families. Robbins[1954] himself was largely concerned with discrete families, such as the Poisson, but Robbins’ formulaas presented here fails for discrete families. Robbins, however, was able to find similar formulas for thediscrete families he studied, still expressing Bayes estimators in terms of the marginal distributions.Our approach may extend to these cases, even if this chapter’s techniques do not.

Second, our method only applies to squared-error loss, but we are often interested in other lossfunctions. In the chi-squared problem, for example, Berger [1980] suggests scaled loss functions of the

form θm(

1− θθ)2

. Our results depend heavily on Robbins’ formula to express the posterior mean


in terms of fG and f ′G, but Bayes estimators for other loss functions are not so cleanly expressible.On the other hand, higher order versions of Robbins’ formula express the posterior cumulants of θin terms of fG and its derivatives. Our results may extend to losses for which the Bayes estimatoris approximately a function of the first few posterior cumulants.

3.6 Proofs

3.6.1 Proof of Lemma 1

Let ‖g‖h =(´g2hdz

) 12 . We have

∆(tρ, G)12 = ‖ f ′

f ∨ ρ− f ′GfG‖fG

≤ ‖ f ′

f ∨ ρ− f ′GfG ∨ ρ

‖fG + ‖f′G

fG− f ′GfG ∨ ρ

‖fG .

The second term is the tempering term: ‖ f′G

fG− f ′G

fG∨ρ‖2f =´

(1− fGρ )2

+

(f ′GfG

)2

fGdz. The first term is

‖ f ′

f ∨ ρ− f ′GfG ∨ ρ

‖fG =

∥∥∥∥∥∥(f ′ − f ′G

)f ∨ ρ

−f ′G

(f ∨ ρ− fG ∨ ρ

)(fG ∨ ρ)

(f ∨ ρ

)∥∥∥∥∥∥ fG

≤

∥∥∥∥∥∥(f ′ − f ′G

)f ∨ ρ

∥∥∥∥∥∥fG

+

∥∥∥∥∥∥f ′G

(f ∨ ρ− fG ∨ ρ

)(fG ∨ ρ)

(f ∨ ρ

)∥∥∥∥∥∥fG

≤ 1

ρ

∥∥∥f ′ − f ′G∥∥∥fG

+A(ρ)

ρ

∥∥∥f ∨ ρ− fG ∨ ρ∥∥∥fG

Using∣∣∣f ∨ ρ− fG ∨ ρ∣∣∣ ≤ ∣∣∣f − fG∣∣∣ completes the proof.


3.6.2 Proof of Theorem 1

Proof. We first bound´ (

f ′ − f ′G)2

dz =∥∥∥f ′ − f ′G∥∥∥2

1(this is the L2 norm with weight 1, not the L1

norm). ∥∥∥f ′ − f ′G∥∥∥2

1=

1

2π

ˆu2(f∗ − f∗G

)2

du

≤ 1

2π

(ˆa2(f∗ − f∗G

)2

du+

ˆ|u|≥a

u2(f∗ − f∗G

)2

du

)

=a2

2π

(∥∥∥f − fG∥∥∥2

1+

1

a2

ˆ|u|≥a

u2(f∗ − f∗G

)2

du

)

≤ a2

2π

(∥∥∥f − fG∥∥∥2

1+ 4

1

a2

ˆ|u|≥a

u2H (u)2du

)

=a2

2π

(∥∥∥f − fG∥∥∥2

1+ 4L (a)

)

for all a ≥ C. We know L (a) → 0 as a → ∞ and L is monotone. Let a = L−1

(∥∥∥f − fG∥∥∥2

1

), or if

L <∥∥∥f − fG∥∥∥2

1, take a = C. Then

∥∥∥f ′ − f ′G∥∥∥2

1≤ 5

2πL−1

(∥∥∥f − fG∥∥∥2

1

)2 ∥∥∥f − fG∥∥∥2

1.

Now plug this bound into Lemma 1. For any function g, we have ‖g‖fG ≤ ‖fG‖12∞ ‖g‖1. Thus

∆(tρ, G

) 12 ≤ 1

ρ

∥∥∥f ′ − f ′G∥∥∥fG

+A (ρ)∥∥∥f − fG∥∥∥

fG+ T (ρ, fG)

≤ 1

ρ‖fG‖

12∞

(√5

2πL−1

(∥∥∥f − fG∥∥∥2

1

)+A (ρ)

)∥∥∥f − fG∥∥∥1

+ T (ρ, fG) .


3.6.3 Proof of Corollary 1

We have

|f∗G (u)| ≤ˆ|f∗θ (u)| dG

=

ˆ (1 +

4u2

θ2

)− k4dG

≤(

1 +4u2

m2

)− k4+ P (θ ≥ m) .

Take m = uα. Then P (θ ≥ m) = O(u−(1−α) k2

), and

|f∗G (u)| ≤(1 + 4u2−2α

)−k/4+ P (θ ≥ m)

= O(u(1−α) k2

).

So for some constants E,M , |f∗G (u)| ,∣∣∣f∗ (u)

∣∣∣ ≤ Mu−(1−α) k2 for all |u| ≥ E. Thus we can take

H (u) = Mu−(1−α) k2 . Also, note ‖fG‖∞ ≤´‖fθ‖∞ dG = EG (θ) ‖f1‖∞. Plugging into Theorem 1

completes the proof.


In this proof, C will denote a generic constant, not necessarily the same from line to line. If θ isbounded, we can take A (ρ) = 1

ρ supθ ‖f ′θ‖∞ ≤Cρ . Also, P (θ ≥ m) = 0 for m large, so we can take

any α < 1− 4k . Then by Corollary 1,

∆(tρ, G

) 12 ≤ C

[1

ρd(f , fG

)1− 21+(1−α)k

+1

ρ2d(f , fG

)+ T (ρ, fG)

].

Now we find the order of the tempering term. We have

T (ρ, fG)2

=

ˆ (1− fG

ρ

)2

+

(f ′GfG

)2

fGdz

≤ PG (fG (z) < ρ)12 EG

((f ′GfG

)4) 1

2

so we need to bound PG (fG (z) < ρ). If ρ is small, fG (z) < ρ if x is large or small. Consider thecase of large z. For z sufficiently large, fG (z) ≥ fM1

(z) = Ckzk/2−1M

k/21 exp (−M1z/2). Let z0 be


the largest z such that ρ = fM1 (z0). Then

PG (fG (z) < ρ, z large) ≤ PG (fM1(z) ≤ ρ, z large)

= PM2 (fM1 (z) ≤ ρ, z large)

= PM2(z ≥ z0)

≈ Czk2−10 exp (−M2z0/2)

= Cρ exp ((M1 −M2) z0)

≤ Cρ

using the asymptotic expasion limx→∞

´∞xts−1 exp(−t)dtxs−1 exp(−x) = 1. Similarly PG (fG (z) < ρ, z small) ≤

Cρ. So T (ρ, fG) = O(ρ

12

). Now choose ρ = O

(∥∥∥f − fG∥∥∥2/5

1

). Then ∆

(tρ, G

)= O

(∥∥∥f − fG∥∥∥2/5

1

).

3.6.5 Proof of Theorem 2

Note that (log ft)′

= t+f ′0f0, so (log fG − log ft)

′= tG − t, and log fG

ft=´ zz0

(tG − t) (s)ds.The rest of the proof is a simple application of Fubini’s theorem. We have

ˆ ∣∣∣∣logfGft

∣∣∣∣ fGdz ≤ ˆ ∣∣∣∣logfGft

∣∣∣∣ fGdz≤ˆ ∞−∞

ˆ z

z0

|tG − t| (s)fG(z)dsdz

The integrand is positive, so we can apply Fubini’s theorem:

ˆ ∞−∞

ˆ z

z0

|tG(s)− t(s)| fG(z)dsdz =

ˆ ∞z0

|tG(s)− t(s)| (1− FG(s)) ds+

ˆ z0

−∞|tG(s)− t(s)|FG(s)ds

=

ˆ ∞−∞|tG(s)− t(s)|P (s)fG(s)ds

Applying Cauchy-Schwartz finishes the proof.


We have∥∥t− tρ∥∥fG = T (ρ, ft) where tρ = − f

′0

f0+

f ′tft∨ρ . Now use the regret bounds and minimize

over ρ.

Chapter 4

Mixture Models for Testing and

Estimation

4.1 Introduction and Setup

This chapter proposes a mixture model approach to the basic high-dimensional estimation andtesting problem considered in Chapter 3. We have N parallel cases, each with some effect size θi,and observe a measurement zi ∼ fθi independently for each case. We want to estimate how big eacheffect is and zero in on the few cases of interest with nonzero θ. To do this, we must estimate θiand either the local false discovery rate, fdr(z) = P (θi = 0|zi), or the tail-area false discovery rate,FDR(z) = P (θi = 0||zi| ≥ z). Chapter 3 tells us that doing this is roughly equivalent to estimatingthe marginal distribution of z.

We present a mixture model empirical Bayes method to solve this problem in Section 4.2. A sim-ple hierarchical model lets us estimate effect sizes and false discovery rates in a flexible, conceptuallyneat way. The approach works for general exponential families fθ, and can estimate an empiricalnull. We illustrate the method for binomial data in Section 4.2.5. Simulation results in Section 4.3show that the method performs well on normal data: it estimates θ nearly as well as the Bayes rule,and is a slightly better fdr estimator than existing methods.

4.2 Mixture Model

We continue to work with a Bayesian model

θ ∼ G

z|θ ∼ fθ (4.1)

32

CHAPTER 4. MIXTURE MODELS FOR TESTING AND ESTIMATION 33

independently for i = 1, . . . , N . We estimate G and use the estimated prior to estimate E (θ|z),fdr (z) and FDR (z). We assume that fθ is an exponential family, but not necessarily continuous.

4.2.1 A Mixture Model

We model G as a mixture of J priors Gj :

G(θ) =

J−1∑j=0

πjGj(θ). (4.2)

The priors Gj are taken from some parametric family of priors for θ, and each has a hyperparametervector αj .

We usually believe that the θi are sparse, that is, many θi = 0. This gives the marginal distribu-tion fG(z) a large null component f0. To model this, we fix α0 so that G0 is a point mass at 0 andthink of the 0th mixture component as null. We estimate the other parameters αj and the mixtureproportions πj by marginal maximum likelihood via the EM algorithm (details in Subsection 4.2.2).

We can choose any family of priors as long as we can calculate the posteriors, and the family isrich enough to model G nonparametrically given enough components. With such a family, we cango from a strongly parametric model to a nearly nonparametric model by increasing J . We willuse conjugate priors for computational convenience; Dalal and Hall [1983] show that conjugate priormixtures can approximate any prior.

The mixture model makes it easy to calculate E(θ|z), fdr(z) and FDR (z) . Let fGj =´fθdGj(θ) be the jth group marginal, so the marginal distribution of z is fG(z) =

∑πjfGj (z),

and let FGj and FG be corresponding cdfs. Let

pj(z) =πjfGj (z)

fG(z)

be the posterior probability that (θ, z) came from group j, andGj(θ|z) be the posterior correspondingto prior Gj . Then under the mixture model, it is easy to show the posterior distribution is a mixture:

dG (θ|z) =ϕ (θ − z)fG (z)

J−1∑j=0

πjdGj (θ)

=

J−1∑j=0

pj (z)ϕ (θ − z)fGj (z)

dGj (θ)

=

J−1∑j=0

pj (z) dGj (θ|z) . (4.3)


In particular, this gives us our estimates:

fdr(z) = p0(z)

FDR(z) =π0 (1− FG0(z) + FG0(−z))

1− FG(z) + FG(−z)

E(θ|z) =

J−1∑j=0

pj(z)EGj (θ|z).

Other quantities, like the posterior variance V ar(θ|z), can be calculated easily using equation 4.3.This model can accommodate empirical nulls by penalizing the mixture proportions and allowing

the null component G0 to vary. As Efron [2008] points out, in many microarray data sets, it is nottrue that most z ∼ f0. This makes the theoretical null inappropriate; instead, Efron suggests fittingan empirical null to the center of the data, so that most z have the empirical null distribution. Themixture model gives us a simple way to do this: allow G0 to vary, but insist that π0 is large. We canforce π0 to be large by putting a prior on π. A Dirichlet(β) prior on π works well and is convenientfor fitting; we used β of the form (P, 0, 0, . . . , 0), with the choice of P described later. Penalizingπ can be useful even for the theoretical null - it stabilizes the π and α estimates by mitigating theeffect of the likelihood’s multiple local maxima. We will see in Subsection 4.2.3 that this is not tooimportant for estimating θ, but can be important when estimating fdr (z) and FDR (z) under anempirical null.

The Bayesian model, equation 4.1, is unnecessary if we are only interested in false discoveryrates. False discovery rate methods, local or tail-area, only depend on marginal density of the data.They estimate this marginal density directly, while we estimate it through the Bayesian model. Infdr/FDR estimation terms, using the Bayesian model this way amounts to restricting our marginalestimate to be one that can be realized from the model. For example, if fθ is N (θ, 1), using theBayesian model can never give a density estimate of N

(0, 1

2

). Our modeling approach is more

efficient than modeling the marginal directly when the Bayesian model is true, but will fail whenthe Bayesian model fails.

4.2.2 Fitting and Parameter Choice

We fit the mixture model’s parameters using marginal maximum likelihood and the EM algorithm.Let gij be an indicator of group membership - gij = 1 if zi comes from group j. Then our penalizedlog-likelihood is

L(z, gij) =

N∑i=1

J∑j=0

gij(log πi + log fGj (zi)

)− logD(π, β)


where D(π, β) is the Dirichlet(β) distribution. The E-step of the EM algorithm is simple:

E(L|z, α, π) =

N∑i=1

J∑j=0

pj(zi)(log πi + log fGj (zi)

)− logD(π, β),

where pj(zi) = πjfGj/fG = Eα,π (gij |z).For the M-step, the update for π is easy to calculate:

πj =

∑Ni=1 pj(zi) + βj∑J

j=0

∑Ni=1 pj(zi) +

∑βj.

The update for the prior hyperparameters α, will be different for each case. In general, αj =

arg max∑Ni=1 pj (zi) log fGj (zi;αj). For example, if we use a normal mixture prior for the normal

means problem with Gj = N (µj , σj) and αj = (µj , σj), a straightforward calculation shows that

µj =

∑Ni=1 pj(zi)zi∑Ni=1 pj(zi)

σ2j =

(∑Ni=1 pj(zi)z

2i∑N

i=1 pj(zi)− µ2 − 1

)+

.

The EM algorithm converges to a local maximum of the likelihood, but the starting point canbe important. It is important to start with reasonable parameters α, π that reflect the desired finalmodel; what is reasonable will change depending on the data set and family fθ. In general, theconvergence is fairly quick, and each EM iteration can be done quickly. If we wanted to speed upconvergence, for example, in very large data sets where each pass over the data is expensive, wecould switch to Newton’s method when we get close to convergence.

We have used the following initialization procedure with good results on the sparse normal meansproblem. First, order the data. Next, divide the data into J groups, putting the central 75% of thedata into the 0th group, and the rest of the data into J − 1 equal groups. For J = 5, for example,this would give groups with the lowest 12.5%, the second lowest 12.5%, the middle 75%, the secondhighest 12.5%, and then the highest 12.5% of the data points. Finally, let πj be the proportion ofthe data points in each group, µj the group means, and σ2

j =(σ2j − 1

)+, where σj is the standard

deviation within each group. This choice of starting point corresponds to what we want to see inthe final model - J relatively distinct groups, with a big central group. Using this starting pointalong with a few random starts and “almost all null” starts gives good results.

4.2.3 Identifiability Concerns

Mixture models can be nearly unidentifiable, and this unidentifiability makes them notoriouslyunstable. Our method is not affected by this problem if we are estimating θ or using a theoretical


null. Robbins’ formula shows that our estimator EG (θ|z) only depends on G through the marginalfG and its derivative f ′

G(Robbins’ formula only holds for continuous exponential families, but

EG (θ|z) can be written in terms of fG for some discrete families as well). The same is nearly truefor fdr/FDR estimation under a theoretical null. Since ˆfdr (z) = π0f0

fG, and π0 is typically close to

1, the fdr estimate approximately only depends on G through fG (the same argument holds for theFDR estimate). This means that our estimates of θ and theoretical null fdr/FDR estimates arenot affected by the near-unidentifiability of the mixture model. Although different priors can yieldvery similar marginals, our estimates only depend on the prior through the marginals, which arestable.

Two concerns remain. First, there may be fitted components of the marginal that are nearly,but not exactly null. Penalizing the likelihood usually mitigates this problem, since it encouragesthe null to absorb these nearly-null groups. If we still see nearly null groups, though, we need todecide whether to include these components in the null when estimating fdr’s and FDR’s. Efron[2004] argues that the answer depends on whether the nearly null components are still interestingin the presence of strongly null components. The nearly null components, however, are usuallyhighly unstable and sensitive to tuning parameters. It is thus usually best to include the nearlynull components in the null. If the components are insensitive to parameter choice, though, Efron’sanswer is correct, and the question becomes a scientific one.

Second, and more importantly, identifiability can be a problem for empirical nulls, since theydepend on estimating G0, the null subset. This is an unavoidable problem for all empirical nullmethods. Using an empirical null corresponds to assuming that the center of the data is null, andlooking for more points in the tail than we would expect, given the shape of the center. To makethis precise, we have to decide how much central data is null. In the mixture model, we can adjustour penalization - the stronger the penalization, the more of the center we assume to be null.

4.2.4 Parameter Choice

This method has two tuning parameters - the penalization parameter β and the number of mixturecomponents J . The parameters are easy to choose if we are interested in estimating θ or theoreticalnull fdr/FDR estimates. Estimating these quantities well is roughly equivalent to estimating fGwell, so we should choose the parameters to best fit the marginal.

Because the marginal fit of the mixture model is relatively insensitive to parameters, the exactchoice does not matter too much. The literature on mixture models has many methods to chooseJ [McLachlan and Peel, 2000]; one easy method is to use the Bayes Information Criterion. Formost purposes, however, we can just fix J . Taking J = 3 works particularly well. This choice givesa group each to null, positive effect and negative effect cases. For β, it is usually best to chooseβ = (P, 0, 0, . . . , 0), and the exact value of P is not too important.

With empirical nulls, however, P can be more important. A larger P forces a bigger null group,


which can widen the null and have a big effect on fdr/FDR estimates. We can choose P with asimple parametric bootstrap calibration scheme. First, list some candidate penalizations P1, . . . , PK

(we used 20 penalizations evenly spaced between 100 and N2 ). Next, fit a preliminary model m to

the data using some reasonable default penalization (we used P = 15N). Then create perturbed

models m1, . . . ,mL by changing the null parameters slightly, and possibly changing the alternatives.Finally, sample data sets from the perturbed models and choose the Pk that best estimates theperturbed models from the sample data sets.

4.2.5 Example: Binomial Data

We illustrate the mixture model in predicting Major League Baseball batting averages. The dataconsist of batting records for Major League Baseball players in the 2005 season. We assume thateach player has a true batting average θi, and that his hit total Hi is Binomial(Ni, θi), where Niis the number of at bats. The goal is to estimate each players’ batting average based on the firsthalf of the season. We restrict our attention to players with at least 11 at bats in this period (567players).

4.2.5.1 Brown’s Analysis

Brown [2008] analyzes the data using a normalizing and variance stabilizing transformation. Hetransforms the data (H,N) to

Xi = arcsin

√Hi + 1

4

Ni + 12

,

and the transformed data are approximately normal

Xi∼N (µi,1

4Ni)

µi = arcsin√θi

He estimates µi using the following methods:

• The naive estimator, µi = Xi.

• The overall mean, µi = X.

• A parametric empirical Bayes method that models µi ∼ N (µ, τ2). The prior parameters µand τ are fit either by method of moments or maximum likelihood.

• A nonparametric empirical Bayes method. First, Brown estimates the marginal density of eachXi with a kernel density estimator. Then he uses Robbins’ formula to estimate µ.

• The positive part James-Stein estimator.


• A Bayesian estimator that models µi ∼ N (µ, τ2), µ ∼ Unif(R), τ2 ∼ Unif(0,∞).

Finally, Brown estimates the estimation error of these methods using their prediction error on thesecond half of the season. Let (Hi, Ni) be the data for the second half of the season. Brown’s errorcriterion is

TSE =∑[(

µi − Xi

)2

− 1

4Ni

]. (4.4)

By construction, E(TSE) =∑

(µi − µi)2. The methods are assessed over all players who had atleast 11 at bats in each half of the data (499 players).

4.2.5.2 Mixture Model

We can analyze the data on the original scale using a binomial mixture model. Our model is

θi ∼ G

Hi|θi ∼ Binomial(Ni, θi)

and model G as a mixture of Beta distributions

G(θ) =

J∑j=0

πjBeta(θ;αj , βj).

This model makes the marginal distribution ofHi a mixture of Beta-binomial distributions, f(Hi;Ni) =∑πjfGj (Hi;Ni). The conjugacy of the Beta prior makes the posterior distributions simple:

g(θi|Hi) =

J∑j=0

pj(Hi)Beta (θj ;αj +Hi, βj +Ni) ,

wherepj(Hi) =

πjfGj (Hi;Ni)

fG(Hi;Ni).

We fit the parameters π, α and β by marginal maximum likelihood via the EM algorithm. For easycomparison with Brown’s results, we estimate µi by its posterior mean E(arcsin

√θ|z) and evaluate

performance using TSE.

4.2.5.3 Results

Table 4.1 compares the mixture model to Brown’s methods. The mixture model is a good performer,but not the best. It performs about 15% worse than the nonparametric empirical Bayes and James-Stein estimators. Brown observes that the number of at bats is correlated with the batting averages- better batters bat more. This violates all methods’ assumptions, but has a particularly strongeffect on the more parametric methods.


Overall Pitchers Non-pitchersNumber of training players 567 81 486Number of test players 499 64 435

Naive 1 1 1Group Mean 0.852 0.127 0.378

Parametric empirical Bayes (Moments) 0.593 0.129 0.387Parametric empirical Bayes (ML) 0.902 0.117 0.398Nonparametric Empirical Bayes 0.508 0.212 0.372

Bayesian Estimator 0.884 0.128 0.391James-Stein 0.525 0.164 0.359

Binomial Mixture Model 0.588 0.156 0.314

Table 4.1: Estimated estimation accuracy (equation 4.4) for the methods. The naive estimator isnormalized to have error 1. Values for all methods except the binomial mixture model are from[Brown, 2008]. The first column gives the errors on the data as a whole (single model), and the nexttwo give errors for pitchers and non-pitchers considered separately. Standard errors range from 0.05to 0.2 on non-pitchers, are higher for pitchers, and are in between for the overall data [Brown, 2008].

Splitting the players into pitchers (81 training, 64 test) and non-pitchers (486 training, 435 test)reduces this effect. The results, also in Table 4.1, show that splitting makes the mixture model thebest performer for non-pitchers and an average performer for pitchers. Splitting also reduces thedifferences between the methods. Both Brown’s nonparametric empirical Bayes estimator and thebinomial mixture model do better on non-pitchers than on pitchers. This is probably because thesmaller number of pitchers makes it difficult to estimate the marginal density. Simple simulationsshow that the binomial mixture model is probably truly better than the other methods for non-pitchers, but no firm conclusions can be drawn about the methods’ relative performance on pitchersor the combined data.

The question of whether to split the players or keep pitchers and non-pitchers in the same modelpoints to a deep question about relevance. When does it make sense to think of two entities ascoming from the same prior? In our setting, this is a bias-variance tradoff: by combining pitchersand non-pictchers, we get a stabler estimate, but of a prior that is wrong for both. Efron [2010]discusses this in detail (Chapter 10), and gives a way to combine information across multiple classesto get stable and relevant estimates.

The binomial mixture model has advantages beyond possible performance gains. It removesthe need for a normalizing and variance stabilizing transformation by working with the originaldata. It can estimate any function h(θ), since E(h(θ)|z) can be calculated numerically. Finally, themixture prior can be informative. For example, the estimated prior for non-pitchers was a singleBeta(302, 884) distribution, while the estimated pitchers’ prior was a mixture of Beta(90, 983) andBeta(219, 928) distributions. These prior estimates were stable under different choices of J andstarting points for the EM algorithm. This could indicate that non-pitchers are about the sameacross the league, but pitchers come in two different types.


4.3 Normal Performance

In this section we shall see that the mixture model performs very well on the normal means problem.We model the prior G as a normal mixture:

dG(θ) =

J−1∑j=0

πjϕ(θ;µj , σ2j ),

where ϕ(x;µ, σ2) is the N (µ, σ2) density. This model makes the marginal fG a normal mixture,f(z) =

∑πjϕ(z;µj , σ

2j + 1). We can use a theoretical null by fixing µ0 = 0, σ0 = 0, or an empirical

null by letting them vary. The posterior quantities are

G(θ|z) =

J∑j=0

pj(z)ϕ

(θ;

1

σ2j + 1

µj +σ2j

σ2j + 1

z,σ2j

σ2j + 1

)fdr(z) = p0(z)

E(θ|z) =∑

pj(z)

(1

σ2j + 1

µj +σ2j

σ2j + 1

z

),

where

pj(z) =πjϕ(z;µj , σ

2j + 1)

fG (z).

The parameters π, µ and σ are estimated by marginal maximum likelihood via the EM algorithm,as detailed in Subsection 4.2.2. We used a Dirichlet(P, 0, . . . , 0) penalty on π to stabilize the model.Code to fit the normal mixture model is implemented in an R package “mixfdr,” available fromCRAN.

4.3.1 Effect Size Estimation

We investigate the effect size estimation performance of the normal mixture model with simulationsclosely based on those done by Johnstone and Silverman [2004]. We generate zi ∼ N (θi, 1), fori = 1, . . . , N = 1000. The goal is to minimize the squared error

∑(θi − θi)2 in estimating θi based

on z. For our simulations, K of the θi were nonzero. In the one sided scenarios, the nonzero θi wereiid Unif(µ− 1

2 , µ+ 12 ) and in the two sided scenarios, two-thirds of the θi were Unif(µ− 1

2 , µ+ 12 )

and one-third were Unif(−µ− 12 ,−µ+ 1

2 ). We used different values of K and µ to simulate differentcombinations of sparsity and effect strengths.

We compare the mixture model to the following effect size estimation methods (the abbreviationsin parentheses are used in the plots):

• A nonparametric log-spline method used by Efron [2009a] (“Spline”). This method fits thedensity as exp (h (z)), where h is 5th degree natural cubic spline. This yields an estimator


t (z) = z + h′ (z).

• EBayesThresh, a parametric empirical Bayes method proposed by Johnstone and Silverman[2004] (“EBThresh”). This method fits the prior as a mixture of a double-exponential densityand a point mass at zero.

• SUREShrink, a method based on minimizing Stein’s Unbiased Risk Estimate for thresholding[Donoho and Johnstone, 1995] (“SURE”). This method uses soft-thresholding, but picks thethreshold to minimize Stein’s unbiased risk estimate.

• FDR-based thresholding [Abramovich et al., 2006], at threshold q = 0.1 (“FDR”). This methodestimates θ = z for all points with FDR ≤ q, and θ = 0 for points with higher FDR.

• Soft and hard thresholding using the “universal threshold”√

2 logN ≈ 3.7 from [Donoho andJohnstone, 1994] (“UnivSoft” and “UnivHard”, respectively).

The log-spline method is a nonparametric empirical Bayes method that does not take advantage ofsparsity. The other methods try to take advantage of the sparsity of θ while maintaining performanceon dense data.

All methods use the known variance of z, and when applicable, assume a theoretical N (0, 1)

null. All methods’ tuning parameters were hand-picked for good performance over the simulationscenarios, but none were rigorously optimized (including the mixture model, which used J = 10 andP = 50). The whole simulation was repeated 100 times, and the same random noise was used foreach scenario and each method.

The mixture model was clearly the best performer in our simulations, outperforming the sparsemethods on sparse scenarios and the nonparametric log-spline method on dense scenarios. Figures4.1 and 4.2 show the performance of the various methods relative to the Bayes estimator for eachscenario. The mixture model does better than the other methods on sparse θ (K = 5) and nearlyachieves the Bayes error for moderate and dense θ (K = 50, 500). Table 4.2 gives the mean andmedian relative error over the 24 scenarios; the mixture model is often within 5% of the Bayes rule.

The mixture model’s performance is not because it is fitting the true model, since the true modelis not a finite normal mixture. Neither is its performance due to careful tuning. Performance wasinsensitive to parameter choice, as Figure 4.3 shows. The number of groups J does not matter muchand as long as there is some penalization, the exact value of P is not too important, especially in themoderate and dense cases. Finally, our results are not sensitive to the exact simulation scenarios.We tested the methods using the exact setup of Johnstone and Silverman [2004] and the results wereeven more pronounced; we did not include these results as the true model is a normal mixture (onthese simulation scenarios, the mixture model also slightly outperforms the method of Jiang andZhang [2009], who fit a different normal mixture).


12

34

56

Comparison of Effect Size Estimation Methods

Rel

ativ

e E

rror

2 3 4 5 2 3 4 5 2 3 4 5

Sparse (5) Middle (50) Dense (500)

One sided

BayesMixFdrEBThreshSUREFDRUnivSoftUnivHardSpline

Figure 4.1: Simulation results for the one-sided scenario. Each panel corresponds to one value ofK (5, 50 or 500). Within each panel, µ increases from 2 to 5. The y-axis plots the squared error(∑

(θi− θi)2), averaged over 100 replications. Errors are normalized so that the Bayes estimator foreach choice of K and µ has error 1. Estimation methods are listed in the text. In the dense case, theuniversal soft and hard thresholding methods are hidden because their relative errors range from 4to 40.

Method Mean MedianMixture Model (J = 10,P = 50) 1.10 1.04

Spline 2.08 1.43EbayesThresh 1.70 1.39

FDR 1.92 1.70SUREShrink 2.11 1.64

Universal Hard 3.60 2.47Universal Soft 8.24 4.52

Table 4.2: Mean and median relative error for the methods over the simulation scenarios. Therelative error is the average of the squared error

∑(θi − θi)2 over the 100 replications, divided by

the average squared error for the Bayes estimator.


12

34

56

Comparison of Effect Size Estimation Methods

Rel

ativ

e E

rror

2 3 4 5 2 3 4 5 2 3 4 5

Sparse (5) Middle (50) Dense (500)

Two Sided

BayesMixFdrEBThreshSUREFDRUnivSoftUnivHardSpline

Figure 4.2: Simulation results for the two-sided scenario. Each panel corresponds to one value ofK (5, 50 or 500). Within each panel, µ increases from 2 to 5. The y-axis plots the squared error(∑

(θi− θi)2), averaged over 100 replications. Errors are normalized so that the Bayes estimator foreach choice of K and µ has error 1. Estimation methods are listed in the text. In the dense case, theuniversal soft and hard thresholding methods are hidden because their relative errors range from 4to 50.


1.0

1.2

1.4

1.6

1.8

Parameter Choice Comparison

Rel

ativ

e E

rror

2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5

Sparse (5) Middle (50) Dense (500) Sparse (5) Middle (50) Dense (500)

One sided Two Sided

J=3,no penJ=3,P=10J=3,P=50J=3,P=100J=3,P=200J=10,no penJ=10,P=10J=10,P=50J=10,P=100J=10,P=200Bayes

Figure 4.3: Relative errors for various parameter choices. Each panel corresponds to one value ofK (5, 50 or 500). Within each panel, µ increases from 2 to 5. The y-axis plots the squared error(∑

(δi− δi)2), averaged over 100 replications. Errors are normalized so that the Bayes estimator foreach choice of K and µ has error 1. The parameter J gives the number of groups in the mixturemodel, and P is a penalization parameter.

4.3.1.1 An Asymptotic Comparison

Mixture models are typically thought of as unstable, but they can sometimes have more favorableasymptotic properties than traditional estimators. We briefly compare a mixture model to a log-spline model to illustrate this.

Consider the following density estimator: fix a parametric family of densities hη, then fit η bymaximum likelihood. Suppose our data to data z1, . . . , zN is generated from some distribution f

(not assumed to be in the hη family). Let L (η) = log hη (z), and suppose that as N →∞, η → η0.A standard Taylor series argument shows that if L (z) has finite variance,

√n (η − η0)→ N (m,V )

where

m = −L (η0)−1E(L (η0)

)V = L (η0)

−1Cov

(L (η0)

)L (η0)

−1

where dots denote differentiation with respect to η and all expecations are under f .Recall that estimating E (θ|z) is equivalent to estimating f ′ (z) /f (z). We can estimate f ′ (z) /f (z)


by `η (z) = h′η (z) /hη (z). By the delta-method, this has asymptotic variance v (z) = ˙′V ˙. Sincev (z) only depends on the choice of family hη, it gives us a simple way to quantify the asymptoticstability of the family for our density and derivative estimation problem.

Figure 4.4 shows v (z) for a normal mixture model and a log-spline estimator under a few differentchoices of f . The two models each have seven free parameters. For the mixture model, we usedthree groups, and fixed the first group to be N (0, 1). For the log-spline model, we used a naturalspline basis with 7 degrees of freedom (excluding the intercept, which simply makes the estimateddensity integrate to one). The figure shows that the mixture model can be asymptotically stablerthan the log-spline model. The mixture model is generally stabler than the log-spline estimator nearthe center of the data, but is sometimes less stable in the far tails.

4.3.2 fdr estimation

We investigate the mixture model’s fdr and FDR estimation performance by examining a specificsimulation. We generate zi ∼ N (θi, 1), i = 1, . . . , N = 1000. 950 of the θi were 0. The other 50

were drawn from a Unif(2, 4) distribution. Various methods were used to estimate the fdr(z) =

P (θi null|zi = z) and FDR(z) = P (θi null||zi| ≥ z) curves based on zi, using either theoretical orempirical nulls.

• The normal mixture model with J = 3 and P = 50. For this simulation, nearly null componentswere counted as null.

• Locfdr, from [Efron, 2008]. This fits the overall density using spline estimation. It fits theempirical null by truncated maximum likelihood (“ML”) or fitting a quadratic to log f near thecenter (“CM” for central matching). We used the implementation in the R package "locfdr."

• Fdrtool, from [Strimmer, 2008]. This fits the overall density using the Grenander density esti-mator, and the empirical null by truncated maximum likelihood. We used the implementationin the R package “Fdrtool.”

We ran the whole simulation 100 times, and the same random noise was used for each method. Theresults are similar for other scenarios and parameter choices.

The mixture model is probably the best fdr and FDR estimator, but not by much. Figure 4.5shows the expectation and standard deviation of ˆfdr(z) for the various methods. Fdrtool’s highbias and variance, and central matching’s high variance, make them poor fdr estimators. Thisleaves Locfdr (and its ML empirical null method) as the mixture model’s only real competitor. Bothmethods are nearly unbiased for positive z, and their bias for negative z is unlikely to be misleading.The mixture model is slightly more stable than Locfdr, especially in the tails, but the difference issmall. Results for FDR estimation, seen in Figure 4.6, were similar.

The mixture model is nevertheless a little better, especially if we need an empirical null. Wetypically use fdr and FDR estimates to to find rejection regions of the form {z| ˆfdr(z) ≤ q}. The


-5 0 5

1e-01

1e+01

1e+03

v(z) for log-spline and mixture estimators, f = f1

z

v(z)

-5 0 5

110

100

1000

10000


z

v(z)

-5 0 5

1e+00

1e+02

1e+04

1e+06


z

v(z)

-5 0 5

1100

10000


z

v(z)

Figure 4.4: v (z) for log-spline (black) and mixture model (red) estimators, for four true densities.f1 is a sparse model, with 90% µ = 0 and 10% µ ∼ Unif (2, 3). f2 is a continuous model, withµ ∼ N (0, 1). f3 is a sparse heavy-tailed model, with 99% µ = 0 and 1% µ ∼ Exponential (1). f4 isa dense heavy-tailed model, with 90% µ ∼ N (0, 1) and 10% µ ∼ Exponential (1).


-4 -2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

Expected fdr estimates

z

-4 -2 0 2 40.0

0.1

0.2

0.3

0.4

Standard Deviation of fdr estimates

z

Sd(fdr.hat(z))

TruthMixFdr ThLocfdr ThFdrtool ThMixFdr EmpLocfdr MLELocfdr CMFdrtool Emp

Figure 4.5: E( ˆfdr(z)) and sd( ˆfdr(z)) for various values of z and the methods under consideration.“Th” means the theoretical null was used, while “Emp” means an empirical null was used. LocfdrMLE and CM use the truncated maximum likelihood and central matching empirical null estimates,respectively.

mixture model is a bit stabler in the tails, and this makes it a stabler estimator of the rejectionregion than Locfdr.

In our simulation, the rejection region for a given q corresponds to rejecting all z greater thansome threshold t(q). We can use the fdr estimation methods to estimate the rejection thresholds.Figure 4.7 shows the expectation and standard deviation of t(q) for the various methods. Boththe mixture model and Locfdr are nearly unbiased for the true threshold, for both theoretical andempirical nulls. Locfdr, however, gives a bit more variable threshold estimates, especially with anempirical null. This makes the mixture model a slightly better choice for threshold estimation. Thisresult held for almost all parameter choices, and is true for FDR-based threshholds as well (Figure4.8).

These simulations suggest that the mixture model is may be a slightly better fdr estimator thanLocfdr when the data are generated from the Bayesian model. Locfdr, however, does not assumethat the data from from the Bayesian model, while the mixture model does. For fdr estimation, thisassumption translates into restrictions on the marginal density. The slightly better performance ofthe mixture model may not outweigh the chance that the restriction may be false. Fortunately, themixture model will not fail silently - if the marginal density of the data truly violates the restrictionsof the Bayesian model, the mixture model will clearly misfit the density, and we can use a more


0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Expected FDR estimates

z

0 1 2 3 4 5

0.00

0.02

0.04

0.06

0.08

0.10

Standard Deviation of FDR estimates

z

Sd(FDR.hat(z))


Figure 4.6: E( ˆFDR(z)) and sd( ˆFDR(z)) for various values of z and the methods under consider-ation. “Th” means the theoretical null was used, while “Emp” means an empirical null was used.Locfdr MLE and CM use the truncated maximum likelihood and central matching empirical nullestimates, respectively.


0.05 0.10 0.15 0.20

2.5

3.0

3.5

4.0

4.5

Expected Threshhold Estimate

q

0.05 0.10 0.15 0.200.0

0.1

0.2

0.3

0.4

0.5

Standard Deviation of Threshold Estimate

q

Sd(t.hat(q))


Figure 4.7: Expectation and standard deviation of rejection threshold estimates t(q) for the variousmethods. The threshholds are fdr based.

0.05 0.10 0.15 0.20

2.5

3.0

3.5

4.0

4.5

Expected Threshhold Estimate

q

0.05 0.10 0.15 0.20

0.0

0.1

0.2

0.3

0.4

0.5

Standard Deviation of Threshold Estimate

q

Sd(t.hat(q))


Figure 4.8: Expectation and standard deviation of rejection threshold estimates t(q) for the variousmethods. The threshholds are FDR based.


appropriate estimator.

4.4 Summary

In this chapter, we proposed a mixture model approach to estimating effect sizes and false discoveryrates for a simple high-dimensional problem. The method is simple, easy to fit and quite accurate,especially for effect size estimation. The method has two tuning parameters, the number of mixturecomponents and the penalization. The tuning parameters have little influence on effect size andtheoretical null fdr/FDR estimates, but the penalization can be important for empirical nulls.

The mixture model can be extended beyond exponential families, though without Robbins’ for-mula, we may have problems caused by near-unidentifiability. More importantly, the simplicity ofthe mixture model approach lets us consider more complicated situations. In Chapter 5, we will usemixture models to get empirical nulls for SNP calling.

Chapter 5

Finding SNPs

The research in this chapter is joint work with Nancy Zhang and our biological collaborators fromthe Clinical Cancer Genomics Group: Georges Natsoulis, John Bell, Daniel Newburger, Hua Xu,Itai Kela and Hanlee Ji.

5.1 Introduction

5.1.1 Overview

In this chapter, we apply the mixture model approach to a more complicated problem: finding SNPs.Single nucleotide polymorphisms - single letter changes in the genetic code - are the simplest and

most frequent type of genome variation. Each person has two letters (“bases”) at each position intheir DNA. The vast majority of these positions are the same for all people - everyone has two copiesof the human genome reference base. For example, if the reference base at such a position is A,everyone has the two bases AA at that position. Around 3 positions in every thousand, however, varyfrom person to person. At most such positions, a person can have two copies of the reference base,one reference base and one non-reference base, or two of the same non-reference base. The personis said to be homozygous for the reference base, heterozygous, or homozygous for the alternativebase at that position, respectively, and these possibilities are called genotypes. For example, if thereference base is A and the alternative C, the three possible genotypes are AA,AC,CC. Positionsat which people can have different genotypes are called SNPs, and are an important part of thegenetic variation between people.

New sequencing technologies let us search large parts of the human genome for SNPs. In atypical sequencing experiment, DNA from each sample is copied and randomly broken into shortfragments. Each fragment is then read by the sequencer, yielding a read, that is, a fixed lengthsequence of bases. The reads are then mapped to a reference genome by an alignment algorithm,

51

CHAPTER 5. FINDING SNPS 52

yielding {A,C, T,G} counts for each position in each sample. Table 5.1 shows counts for an exampleposition. Most counts are of the reference base, but sequencing and alignment errors produce someother counts as well.

Because the sequencing and mapping process depends on the true DNA sequence, the error in thisprocess varies substantially across positions [Van Tassell et al., 2008, Li et al., 2009, Hoberman et al.,2009]. The total number of reads at each position (the coverage or depth) can vary dramatically aswell (Figure 5.1).

Distinguishing true SNPs from sequencing and alignment errors can be difficult. To avoid beingoverwhelmed by false positives when we search for SNPs in large genomic regions, we need to detectSNPs in a statistically rigorous way. In this chapter, we will use the mixture model approach fromChapter 4 to detect SNPs.

The intuition behind our approach is simple. Suppose there are N genome positions of interest;for whole-genome sequences, N is near 3 billion, and for targeted sequencing of selected genomicregions, N can be in the hundreds of thousands. Sequencing yields an N × 4 table of A,C,G, Tcounts for each sample. We can view SNP calling as a multiple testing problem, where the nullhypothesis for each position is that all samples are homozygous for the reference base. If we knewthe null distribution of counts at each position - the distribution of base counts if the sample weretruly homozygous for the reference base at the position - we could apply standard multiple testingideas to call SNPs and control the false discovery rate of our calls.

Although error rates vary substantially across positions, they are consistent across samples. Onemajor driver of error rate variation is the local genomic environment of each position. For example,some genomic regions are more repetitive than others, and are thus more prone to alignment errors,and certain combinations of DNA bases can lead to higher sequencing error. Since the genomicenvironment is is largely the same in each sample, error rates should be consistent across samples.Figures 5.2 and 5.3 show that this is indeed the case - the positional error rates are quite correlatedacross samples.

This consistency means that we can estimate the null distribution for each position by poolinginformation across samples. Figure 5.4 illustrates this idea. The left simplex shows A,C,G, Tfrequencies for four homozygous positions with quite different error rates, while the right simplexshows a true SNP position with genotypes {GG,GT, TT}. If we were to analyze each sampleseparately, we might falsely conclude that some of the noisier positions were SNPs, since they showhigh nonreference base frequencies. With a cross-sample model, we see that some positions arenoisier than others, and distinguish noisy positions from true SNPs.

We also pool information across positions, taking advantage of genome-wide error patterns toimprove our null distribution estimates. This shrinks our error estimate for each position toward agenome-wide consensus. The example in this paper has N = 300, 000 positions andM = 30 samples,so pooling information across positions can substantially improve our null distribution estimates.


Sample A C G T

1 0 0 400 02 0 1 500 13 0 2 594 24 0 0 471 45 0 0 615 06 0 0 11 07 0 5 4991 68 13 7 3907 29 3 2 2036 410 0 1 602 011 0 0 583 012 2 0 663 013 4 3 1009 114 0 1 1126 015 0 0 984 116 0 1 1053 017 1 1 1123 118 0 1 2103 019 1 0 279 020 0 1 227 121 0 0 424 022 0 0 154 023 0 0 411 024 2 1 582 025 0 2 1148 226 3 3 2346 227 1 3 1503 328 0 2 2639 129 1 0 1899 230 0 0 255 1

Table 5.1: Example counts, reference base G. For the spike-in simulations later, we used A as thealternative base.


log10 coverage for sample 1

log10(coverage) - 36755 zeroes not shown

Frequency

0 1 2 3 4 5

01000

2000

3000

4000

5000

6000

Figure 5.1: Coverage for one of the samples in our example data set (309474 total positions).


Figure 5.2: Error rates for sample 1 and 2, plotted on a log-log scale. Points with no error are notshown. Most of the variability in the histogram is actually from binomial noise, because of the lowdepth, as Figure 5.3 illustrates. Here, we treat all non-reference counts as errors, which is true formost positions.


Figure 5.3: Error rates for positions with high coverage (at least 10,000) in both sample 1 and 2,plotted on a log-log scale. Points with no error are not shown. The dramatically lower spread aroundthe x = y line shows how low depth contributes most of the variability in Figure 5.2. As in thatfigure, we treat all non-reference counts as errors, which is true for most positions.


Figure 5.4: The simplex on the left shows four null genome positions of varying noisiness takenfrom the example data set, coded by different colors. Within each color, each point is for a differentsample. There are two T ’s (blue and green), one A (yellow), and one G (red). The simplex on theright shows a true SNP that has genotypes CC, CG, and GG among the samples

This particularly benefits low-coverage positions, which contain little information in themselves.Our fundamental idea, pooling data to estimate the null distributions, is simply the empirical null

idea, stated in sequencing terms. Standard empirical null methods are designed for continuous data,and require a z-score for each hypothesis tested. The SNP detection problem is quite different, sincethe data is discrete, many positions have very few counts, and the number of counts at each positionvaries dramatically. The mixture model approach from Chapter 4, however, is readily translated.As before, we model parameters as coming from a mixture of conjugate priors, with some mixturecomponents corresponding to null parameters and others corresponding to alternatives. This lets usestimate the null distributions efficiently while accounting for the discreteness and depth variationof sequencing data.

5.2 Mixture Model

5.2.1 Model

We want to test N null hypothesis, where the ith null hypothesis is that all the samples are ho-mozygous for the reference base at position i, and the ith alternative is that at least one of thesamples is heterozygous or homozygous for the alternative base. Since most positions in the humangenome are the same for all people, most positions in any data set will be null, and thus mostnull hypotheses will be true. To test our hypotheses rigorously, we need to estimate the null andalternative distributions at each position.

We propose an empirical Bayes mixture model that shares information across samples and po-sitions to efficiently estimate these distributions. We use the following generative model for thecounts at each position. First, for each position i, we generate 4-dimensional null and alternativefrequency vectors pi and qi from null and alternative priors. Next, each sample is assigned to thenull or alternative: we generate indicators δij for each sample, and each sample is assigned to the


null (δij = 0) or the alternative (δij = 1) at position i. Finally, we generate counts for each samplefrom the appropriate multinomial distribution, using the observed coverage Nij and the generatednull and alternative frequency vectors. Expressed mathematically, this gives the following model forposition i:

pi ∼ Gnull,

qi ∼ Galt,

δij ∼ Ber (πi) ,

Xij |δij , pi, qi ∼ Multinomial ((1− δij) pi + δijqi, Nij) .

With this model, we proceed in three steps. We first estimate Gnull, Galt, πi, pi and qi bymaximum likelihood using a modified EM algorithm outlined in Subsection 5.2.3. We then usethe estimated parameters to find the posterior probability that each position is homozygous forthe reference base in each sample (that is, we find E(δij |X)). Last, we use the estimated posteriorprobabilites to call SNPs. Estimating the priors Gnull, Galt lets us share information across positions,and estimating the non-null probability πi and the position specific null and alternative frequencyvectors pi and qi lets us share information across samples.

In the rest of this subsection, we will explain our modeling approach in more depth, but we deferthe full details of our fitting algorithm to Subsection 5.2.3.

We model the priors Gnull and Galt as G component Dirichlet mixtures:

Gnull =

G∑g=1

θnull,gDirichlet (αg) ,

Galt =

G∑g=1

θalt,gDirichlet (αg) .

Mixture models give us convenient conjugate priors, but with the flexiblity to adapt to many differenterror distributions.

To estimate Gnull and Galt more efficiently, we impose extra constraints on the mixture parame-ters. First, we require that all the Dirichlets have the same precision

∑4l=1 αgl. Second, we choose

our mixture components to take advantage of the structure of SNP data. The simplest version ofour approach would be to use 4 null components (1 for each of homozygous {A,C, T,G}), and 6alternative components (1 for each heterozygous combination of {A,C,G, T}). Geometrically, thenull components put probability near the corners of the {A,C,G, T} simplex, while the alternativecomponents put probability near the edge midpoints. We require that the null mixture probabilitiesθnull,g be nonzero only on the null components and the alternative mixture probabilities θalt,g benonzero only on the alternative components.


This basic approach, however, does not fit our data well. We found that nearly all positions are“clean,” with very low error rate, but a small proportion are “noisy,” with a much higher error rate.This error rate distribution is not modeled well by a single Dirchlet group for each corner and edgemidpoint of the simplex, and using this basic model yields many false positive SNP calls. Instead,we use a mixture of two Dirichlets in each corner and edge midpoint to model the error distributionmore flexibly. This gives us 8 null and 12 alternative mixture components.

We do not explicitly model the possibility that a position is homozygous for a non-reference base.Instead, we fit our model in a way that ensures each position is very likely to have a null distributiongenerated from the homozygous group corresponding to the reference base. This means that if aposition is homozygous for a non-reference base, it will strongly appear to have been generatedfrom an alternative group, and will thus be correctly detected as a SNP. For example, suppose thereference base for a position is C and a position is homozygous AA in a given sample. The positionis far more likely to have been generated from the AC alternative than the CC null, so even thoughour model does not explicitly consider the possiblity that the position is AA, we will still detect it asa SNP. We re-genoytpe all called positions in postprocessing, so all that matters is that our modelmakes the right SNP calling decision.

Since we have so many positions and most positions are null, the fitted null parameters are quiteaccurate. But since SNPs are rare, our data will typically contain few positions displaying each het-erozygous genotype. This makes the alternative mixture components very difficult to estimate. Wesolve this issue by constraining the alternative Dirichlet parameters αg. We require αg for the het-erozygote groups g ∈ {AC,AG,AT,CG,CT,GT} to be equal to the averages of their correspondinghomozygote groups. For example,

αAC,clean =1

2(αAA,clean +αCC,clean)

This constraint significantly stabilizes our parameter estimates.

5.2.2 Calling, Filtering and Genotyping

Given our parameter estimates and posterior estimates of δij , we call SNPs as follows. We estimatethe positional false discovery rate, the posterior probability that all of the samples for a position arehomozygous for the reference base:

fdr = P (δij = 0 ∀j|X) .


We estimate fdr by taking a weighted product of the estimated δijs, downweighting very low-coverage positions:

ˆfdri = exp

∑j

wij log (1− δij) /∑j

wij

.

We use weights wij = max(

(Nij − 3)+ , 20). This gives samples with coverage less than 3 no weight,

since these are particularly noisy in our data, and saturates the weights at an arbitrary coverage of23, since increasing N beyond such coverage does not make δ more accurate. Given the estimatedfdr, we make a list of putative SNPs with low ˆfdr (we used a threshold of 0.1).

We then fit a Hardy-Weinberg model by maximum likelihood to genotype the called positions.First, we reduce the counts to the reference and highest-nonreference base counts; suppose forconcreteness that these are A and C, respectively. This produces a M × 2 count matrix Y for eachputative SNP, with each row Yj corresponding to one sample. Next, we fit the following generativemodel for Yj : each sample Yj is first assigned a genotype gj with probability p = (pAA, pAC , pCC).The nonreference base counts are then binomial, Yj2 ∼ Binomial(Nj , πg), where Nj is the coveragefor sample j and πg is the expected non-reference proportion for each genotype. For example, if thereference base is A,πAA is the error probability.

We use the EM algorithm to fit this model under the Hardy-Weinberg restriction that p =

(p2A, 2pA(1 − pA), (1 − pA)2). We constrain πg so the homozygous types have π near 0 and 1, and

the heterozygous types have π near 0.5. The estimated group membership indicators from the EMalgorithm give us estimated genotypes for each sample at the position.

This calling procedure assumes we are interested in detecting SNPs. If we are interested indetecting nonreference positions in particular samples, we can look at the Hardy-Weinberg genotyperesults, or simply consider the estimated indicators δij from the mixture model. Although theydo not distinguish between heterozygosity and homozygosity for a nonreference base, the δij aretypically more accurate indicators of SNP status than the Hardy-Weinberg genotypes for very lowcoverage samples. For the single-sample comparison in the results, we detect SNPs as above, usingˆfdr, then detect SNPs in the sample of interest using the genotypes, but require that δij ≥ 0.9 for

low coverage (N ≤ 4) positions.

5.2.3 Fitting

We fit the mixture model using a regularized ECM algorithm: ECM algorithms replace the M-step with a series of partial maximizations. They are often computationally simpler than the usualEM algorithm, yet enjoy many of the same convergence properties [Meng and Rubin, 1993]. Ouralgorithm is quite fast on current data and easily parallelizable, so it will be able to handle largerdatasets to come.


We first define mixture component indicators

ξgi,null = I(pi drawn from component g)

ξgi,alt = I(qi drawn from component g).

We treat the mixture component indicators

ξ = {ξgi,null, ξgi,alt : i = 1, . . . , N ; g = 1, . . . , G}

and the heterozygosity indicators δ = {δij : i = 1, . . . , N ; j = 1, . . . ,M} as missing data.

5.2.3.1 E-Step

In the E-step, we compute E (ξ|X) and E (δ|X), at the current values of the other parametersp, q, π,θ,α. The Xijs are conditionally independent given p and q, so

E (δij |X,p, q, π,α) =πi∏4l=1 p

Xijlil

πi∏4l=1 p

Xijlil + (1− πi)

∏4l=1 q

Xijlil

.

Similarly, E (ξ|X) are ratios of Dirichlet densities fα:

E(ξgi,null|X,p, q, π,α

)=

θnull,gfαg (pi)∑g θnull,gfαg (pi)

E(ξgi,alt|X,p, q, π,α

)=

θalt,gfαg (qi)∑g θalt,gfαg (qi)

.

5.2.3.2 CM-Step

In the CM-step, we estimate p, q, π,θ,α using the expected values of the indicators. For simplicty,we will write ξ for E (ξ|X), and so on. We sequentially optimize over π,θ,α,p, q.

To estimate π, we do not use the MLE πi = M−1∑j δij . This estimator behaves poorly. If

πi = 0, as it does for the majority of our data, the MLE is trivially biased upward, since δij ∈ (0, 1).This creates a feedback cycle, since a higher π increases δ estimates, increasing the next estimateof π yet further. The bias of the MLE is worst for low-depth poistions, but it falls exponentially asthe depth increases, since E (δij)→ 0 exponentially fast under the null.

Instead, we use a weighted shrinkage estimator for πi that downweights low-depth positions andshrinks all the πi toward the overall mean of the δs. Our estimator is

πi =

∑j wijδij + aδ∑j wij + a

,

where the weights wij are wij = min((Nij − 3)+, 20). We give samples with depth less than 3 no


weight since these positions are particularly noisy in our data. The weights increase until Nij = 23,and then remain constant. We bound the weights because the bias of the MLE is negligible whenthe depth is high (the specific choice of 23 does not make much difference). We took a = 10 for ourdata, but the exact choice did not seem to make much difference.

To estimate θ, we simply use the MLE. For example,

θnull,g =

∑i ξgi,alt

N.

That leaves p, q and α. Let A be the common precision of the α’s, so A =∑4l=1 αlg, and let

α = α/A. We first estimate α unbiasedly. For each null group g, our estimate is

αg =

∑i,j Xij (1− δij) ξgi,null∑i,j Nij (1− δij) ξgi,null

.

For the non-null groups, we set α to be the average of the appropriate null groups.Next, using the fitted α’s, we estimate A by maximum likelihood, marginalizing over p and q.

In our model,∑j Xij (1− δij) ξgi,null has a Dirichlet-multinomial distribution for each i, g:

∑j

Xij (1− δij) ξgi,null ∼ fAαg,∑j Nij(1−δij)ξ

gi,null

(Xij)

where fα,N is the Dirichlet-multinomial distribution with parametersα andN . Similarly,∑j Xijδijξ

gi,alt

has a Dirichlet-multinomial distribution for each i, g. We use the estimated δ, ξ, α and estimate Aby maximum likelihood (using Newton’s method).

Finally, given α and A, we estimate p and q by their posterior means.

pi =∑g

ξgi,null

(αg +

∑j (1− δij)Xij

A+∑j (1− δij)Nij

)

qi =∑g

ξgi,alt

(αg +

∑j δijXij

A+∑j δijNij

).

5.2.3.3 Starting Points

We use the starting points to regularize this procedure and incorporate our intuition about whatparameter values are reasonable. By starting out with parameters of the form we desire and allowingthe EM iterations to fine-tune them, we hope to get reasonable final parameter estimates.

We start with πi = 10−5. For α, we initialize A =∑g αg and α = α/A separately. We start with

A = 20. We initialize the clean null αs to (0.95, 0.0033, 0.0033, 0.0033) (changing the location of themaximum as appropriate). For noisy nulls, we use (0.85, 0.05, 0.05, 0.05). This difference in startingpoints is the only explicit difference between clean and noisy mixture components; afterward, the


algorithm ignores the clean/noisy distinction. The alternative αs are initialized to the averages ofthe corresponding null αs. For θ, we initialize θnull to put probability 0.245 on each clean null, and0.005 on each noisy null, and θalt to put probability 1

12 on each clean and noisy alternative.Finally, we initialize p and q as follows. Let Z ∈ RN×4 be the matrix obtained by summing the

counts X over all samples for each position. For each i, let li be the index of the reference base (1to 4), and let ki be the index of the highest-frequency nonreference base in Zi. We initialize pi tobe roughly a null group at base li, with some error in the base ki direction, depending on how mucherror Zi shows. More precisely, we set pi = 1, pli = max (Zi), pki = Zki , and then scale so thatp sums to 1 and puts probability between 0.85 and 0.99 on the reference base. To initialize qi, wesimply put probability 0.4 on bases li and ki and 0.1 on the other two bases.

5.3 A Single Sample Nonparametric FDR estimator

Assessing the accuracy of a set of SNP calls is difficult since we do not know the true SNP statusof most positions. We propose a simple nonparametric way to estimate the proportion of falsediscoveries (FDP ) for a set of calls in a single sample. Our FDP estimator is not intended to bea rigorous false discovery rate estimator. It is designed to be a simple, fair way to compare theprecision of SNP calling methods.

Our method is based on the highest non-reference base frequency (HNRF) and a simple generalinequality. Consider a set R of candidate SNP positions. Let Zi be any random variable observedfor each candidate position i ∈ R. Let Hi be the event that position i is actually null, and letη = P (Hi = 0|i ∈ R) be the FDP for this candidate set. Suppose we know that

E[Zi|Hi = 0, i ∈ R] = a, E[Zi|Hi = 1, i ∈ R] = b, (5.1)

with a ≤ b. Thenµ ≡ E[Zi|i ∈ R] = ηa+ (1− η)b

and henceη =

b− µb− a

(5.2)

If a and b are upper bounds for the conditional expectations in equation 5.1, then the right side ofequation 5.2 is an upper bound on η.

If we have such a quantity Zi, we can estimate µ using the data,

µ = |R|−1∑i∈R

Zi,

and use equation 5.2 to estimate the FDP η. If we know a and b exactly, we obtain an unbiasedestimate of the FDP , and if we have upper bounds on a and b, we obtain an estimated upper bound


on the FDP .We use the highest non-reference base frequency (HNRF) as our Zi. For each position, let

Zi = arg maxb6=bi

Xib/Ni

be the HNRF. Due to the low overall sequencing error rate, the HNRF should be near 0 for truenull positions, that is, E(Zi|Hi = 0) should be small. On the other hand, the HNRF should be closeto 0.5 for true non-nulls, that is, E(Zi|Hi = 1) should be close to 0.5.

We can estimate the mean non-null HNRF E(Zi|Hi = 1, i ∈ R) using the counts of known SNPs,or we can use b = 0.5 as an upper bound - biases in alignment decrease the expected non-null HNRF,but seldom increase it. Even if we condition on rejection, the average non-null HNRF should be lessthan 0.5, since the upward bias caused by conditioning on rejection should be outweighed by thedownward bias of alignment. We can either estimate the expected null HNRF E(Zi|Hi = 0, i ∈ R)

using some high quantile of the HNRF over all positions or simply bound it by some very generouserror rate, for example, a = 0.25.

By plugging in the estimated mean null and non-null HNRFs (or upper bounds) a, b and themean HNRF µ in R into equation 5.2, we can estimate or bound the FDP for R. This FDPestimate lets us assess the precision of a given set of SNP calls.

Our FDP estimator is a rough estimate of precision, not a rigorous estimator of the false discoveryrate, and has two obvious limitations. First, because it assumes a constant probability of being null,it works best when the set of calls is relatively homogenous. If the set of calls can be split into apriori distinct groups (for example, known SNPs and new discoveries), it makes sense to estimatethe FDP in each group separately and combine the estimates to get an overall FDP . Second, evenwithin a homogenous group, estimating a and b can be tricky. This is because they are null andnon-null expectations conditioned on rejection, and rejection may be a complicated event. Usingsimple bounds for a and b gets around this problem, but if the bounds are loose, the resulting FDPestimate will be very conservative.

Despite these drawbacks, however, our FDP estimator is a useful way to compare the precisionof different SNP calling methods. This is because the ratio of FDP estimates for different sets ofcalls is more reliable than the value of each FDP estimate. If we use the same a, b for both SNPcalling methods, then for two different methods,

ˆFDP 1

ˆFDP 2

=b− µ1

b− µ2

.

The estimated FDP ratio between the two methods only depends on b, µ1 and µ2, and does notinvolve a. Since the sample means µi are easy to calculate and we can use the bound b = 0.5, theFDP ratio is more reliable than the individual FDP estimates.


Method Our Method SNIP-Seq GATKPositions Called 758 1088 497

Bentley Positions Called (out of 264) 243 250 235Percentage Bentley Positions Called 92.0% 94.7% 89.1%

Estimated FDP of all calls 8.6% 47.4% 18.6%Estimated FDP of new calls 13.1% 61.5% 35.2%

Table 5.2: Calls on the Yoruban sample by various methods, with estimated FDP s. We used anestimated mean non-null HNRF of b = 0.5 and an estimated mean null HNRF of a = 0.1 (the 99thpercentile of the mean HNRF over all positions for the Yoruban sample). The overall FDP estimatewas calculated by combining FDP estimates on Bentley’s calls and new calls. The ratios of FDPestimates between methods are more reliable than the individual levels.

5.4 Results

5.4.1 Yoruban SNP Calls

We analyzed a collection of 29 normal T cell derived DNA samples. We also included one samplederived from NA18507, a fully sequenced individual of Yoruban descent Bentley [2008].

We used our method, SNIP-seq [Bansal et al., 2010] and GATK [McKenna et al., 2010, DePristoet al., 2011] to call SNPs on the Yoruban samples. GATK is a widely used single sample SNP caller,and SNIP-seq is a cross-sample method that relies on “quality scores” produced by the sequencer foreach read.

To assess the recall of the three methods, we found the overlap of each methods calls withthe SNPs identified by Bentley [2008], removing positions identified by Bentley that had very lowcoverage in our data set (since none of the methods can call these reliably). To assess precision, weuse our nonparametric FDR estimator to estimate the false discovery proportion of each method’sfull set of calls and each method’s new calls (calls not made by Bentley [2008]).

Table 5.2 shows the results. All three methods have about the same recall, but make dramaticallydifferent precision tradeoffs to acheive that recall. For example, SNIP-Seq makes many more callsthan our method, and its calls have low precision. Indeed, SNIP-Seq’s novel calls have an estimatedFDR of 61.2%, indicating that they have many false positives. In contrast, our method’s novel callshave an estimated FDR of 13.1%, indicating that they are mostly trustworthy. GATK attempts tobe conservative but is not able to increase its precision - its novel calls have an estimated FDP of31.5%.

5.4.2 Power: Spike-In Simulation

A spike-in simulation also shows that our method has good power on this data. We simulated SNPdata and added it to our experimental data. We took a clean null position (Table 5.1) or a noisynull position (Table 5.3) and replaced the samples with counts corresponding to SNPs, row by row


Sample A C G T

1 0 0 1 72 0 0 0 83 0 0 0 94 0 0 0 15 0 0 0 76 0 0 0 07 0 1 2 928 0 3 3 1029 1 0 0 3210 1 0 0 311 0 0 0 412 0 1 1 3713 0 0 0 1114 0 0 1 1715 0 0 0 816 0 0 1 2017 0 1 1 1918 0 0 0 2419 0 0 0 420 0 0 0 1221 0 0 0 022 0 0 0 1123 0 0 0 1324 0 0 0 025 0 1 1 1326 0 0 0 527 0 0 0 628 0 0 0 229 0 0 0 1530 0 0 0 6

Table 5.3: Noisy null position, reference base T , spiked alternative base G.

starting at the top. The SNP rows were (x, 0, n− x, 0), where n is the assumed coverage for theSNP and x ∼ Binom

(n, 1

2

).

Figures 5.5 shows the spike-depth n needed to call a SNP with 80% power with fdr ≤ 0.1.Because our method borrows information across samples, the power depends on the number ofheterozygous samples. The depth required to call the SNP falls quickly as the number of heterozygoussamples increases. If two or more samples are heterozygous, our method only needs a depth of 7 toachieve good power on the clean position (8 on the noisy position).


0 5 10 15 20 25 30

46

810

12

Number of Heterozygotic Samples

Cov

erag

e ne

eded

for

80%

pow

er

●

●

●

● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ●

●

clean positionnoisy position

Figure 5.5: Coverage needed for 80% power at fdr ≤ 0.1 for clean and noisy positions. The coverageneeded for noisy positions increases at 28 heterozygous samples because at that stage, our algorithmbegins considering the position extremely noisy.


fdr: True\\Estimated[0, 10−5

](10−5, 0.001] (0.001, 0.1] (0.1, 0.5] (0.5, 1] Total[

0, 10−5]

2134 8 1 0 5 2148(10−5, 0.001] 6 8 6 0 1 21(0.001, 0.1] 0 4 19 0 13 43(0.1, 0.5] 0 0 5 7 22 34(0.5, 1] 2 0 4 7 278397 278410Total 2142 20 35 21 278438 280656

Table 5.4: Binned true fdr and estimated fdr.

5.4.3 Model-based Simulation

Finally, we investigated the fdr estimation accuracy of our procedure with a parametric simulation.We generated a synthetic dataset from our model, with paramters based on the fit to real data.Given the true priors and parameters, calculating E (δ|X) is still intractable, so we approximatedE (δ|X) as in our fitting procedure - we found E (p|X) and E (q|X) using the true priors, then usedthose to find E (δ|X, p = E (p|X) , q = E (q|X)). Then, we found the weighted fdr

fdr = exp(

1−∑

wi log (1− E (δ|X)) /∑

wi

),

where wi = max((N − 3)+ , 20

). We then refit our model on the synthetic dataset and compared

the fitted ˆfdr to the true weighted fdrs.Figure 5.6 shows that the estimated fdr tracks the true fdr well for positions with low true

fdr, with a slight upward bias. For positions with high true fdrs, our estimator is quite upwardlybiased, but not so biased as to be misleading, since the exact fdr for high-fdr positions does notusually matter. Table 5.4 shows that if we bin the estimated and true fdrs into extremely low, low,moderate, large and very large, our estimator is usually in the correct bin.

These simulations show that our method can conservatively estimate the true fdr when the datais generated from the model we are fitting. This test validates our fitting method. Estimating thefdr is nontrivial, even when fitting the true model, since the likelihood is nonconvex with many localoptima. Our method’s good fdr estimation performance on this parametric simulation suggests thatour estimated fdrs are likely to be at least roughly accurate when our model fits the data.

5.5 Summary

In this chapter, we applied our empirical Bayes mixture model ideas to SNP calling. The mixturemodel approach let us extend the empirical null approach to discrete data, while efficiently sharinginformation across positions and samples. Our method has high power to detect SNPs, yet has alower false discovery rate than existing SNP detection methods.

Sequencing data is becoming more common, and empirical Bayes ideas have many potential


Figure 5.6: Estimated fdr vs true weighted fdr, plotted on the logit scale. Points plotted at y = 50have estimated fdr numerically equal to 1


applications. One direction that we are currently pursuing is the detection of rare mutations. Inthis situation, we want to detect mutations in a small portion of a sample. For example, a smallsubpopulation of a virus sample may have evolved drug resistance. This problem is more difficultthan SNP detection, since the alternative is not fixed at one of the edge midpoints of the A,C, T,Gsimplex. Sharing information across samples becomes crucial, and empirical Bayes ideas are quiteuseful. Surprisingly, a mixture model approach based on our SNP calling method performs quitepoorly on tumor data, with many false positives. Finding the best way to pool information in eachproblem is an important direction for future research.

Bibliography

Felix Abramovich, Yoav Benjamini, David L. Donoho, and Iain M. Johnstone. Adapting to unknownsparsity by controlling the false discovery rate. The Annals of Statistics, 34(2):584–653, 2006.

Genevera Allen and Robert Tibshirani. Inference with transposable data: Modeling the effects ofrow and column correlations. 2010.

David B. Allison, Gary L. Gadbury, Moonseong Heo, Jose R Fernandez, Cheol-Koo Lee, Tomas A.Prolla, and Richard Weindruch. A mixture model approach for the analysis of microarray geneexpression data. Computational Statistics and Data Analysis, 1(1):1–20, 2002.

Vikas Bansal, Olivier Harismendy, Ryan Tewhey, Sarah S. Murray, Nicholas J. Schork, Eric J.Topol, and Kelly A. Frazer. Accurate detection and genotyping of snps utilizing populationsequencing data. Genome Research, 20(4):537–545, 2010. doi: 10.1101/gr.100040.109. URLhttp://genome.cshlp.org/content/20/4/537.abstract.

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and powerfulapproach to multiple testing. Journal of the Royal Statistical Sociey, Series B, 57:289–300, 1995.

Yoav Benjamini and Yosef Hochberg. On the adaptive control of the false discovery rate in multipletesting with independent statistics. Journal of Educational and Behavioral Statistics, 60:60–83,2000.

Yoav Benjamini and Yosef Hochberg. The control of the false discovery rate in multiple testingunder dependency. The Annals of Statistics, 4:1165–1188, 2001.

David R. et al Bentley. Accurate whole human genome sequencing using reversible terminatorchemistry. Nature, 456(7218):53–59, November 2008. ISSN 1476-4687. doi: 10.1038/nature07517.URL http://dx.doi.org/10.1038/nature07517.

James Berger. Improving on inadmissible estimators in continuous exponential families with appli-cations to simultaneous estimation of gamma scale parameters. The Annals of Statistics, 8(3):545–571, 1980. ISSN 00905364. URL http://www.jstor.org/stable/2240592.

71

http://genome.cshlp.org/content/20/4/537.abstract

http://dx.doi.org/10.1038/nature07517

http://www.jstor.org/stable/2240592

BIBLIOGRAPHY 72

James O. Berger and C. Srinivasan. Generalized bayes estimators in multivariate problems. TheAnnals of Statistics, 6(4):783–801, 1978. ISSN 00905364. URL http://www.jstor.org/stable/

2958855.

M. E. Bock. Minimax estimators of the mean of a multivariate normal distribution. The Annals ofStatistics, 3:209–218, 1975.

Lawrence D. Brown. On the admissibility of invariant estimators of one or more location parameters.Annals of Mathematical Statistics, 37:1087–1136, 1966.

Lawrence D. Brown. Admissible estimators, recurrent diffusions, and insoluble boundary valueproblems. Annals of Mathematical Statistics, 42(3):855–903, 1971.

Lawrence D. Brown. In-season prediction of batting averages: A field test of empirical bayes andbayes methodologies. Annals of Applied Statistics, 2(1):113–152, 2008.

Lawrence D. Brown and Eitan Greenshtein. Nonparametric empirical bayes and compound decisionapproaches to estimation of a high-dimensional vector of normal means. The Annals of Statistics,37:1685–1704, 2009.

Lawrence D. Brown and Linda Zhao. Estimators for gaussian models having a block-wise structure.Statistica Sinica, 19:885–903, 2009.

Tony Cai, Jiashun Jin, and Mark Low. Estimation and confidence sets for sparse normal mixtures.The Annals of Statistics, 35:2421–2449, 2007.

Elissa Cosgrove, Timothy Gardner, and Eric Kolaczyk. On the choice and number of microarraysfor transcriptional regulatory network inference. BMC Bioinformatics, 11(1):454, 2010. ISSN1471-2105. doi: 10.1186/1471-2105-11-454. URL http://www.biomedcentral.com/1471-2105/

11/454.

Dennis D. Cox. A penalty method for nonparametric estimation of the logarithmic derivative of adensity function. Annals of the Institute of Statistical Mathematics, 37:271–288, 1983.

S. R. Dalal and W. J. Hall. Approximating priors by mixtures of natural conjugate priors. Journalof the Royal Statistical Society. Series B (Methodological), 45(2):278–286, 1983. ISSN 00359246.URL http://www.jstor.org/stable/2345533.

M. DePristo, E. Banks, R. Poplin, K. Garimella, J. Maguire, C. Hartl., A. Philippakis, G. del Angel,M.A Rivas, M. Hanna, A. McKenna, T. Fennell, A. Kernytsky, A Sivachenko, K. Cibulskis,S. Gabriel, D Altshuler, , and M Daly. A framework for variation discovery and genotyping usingnext-generation dna sequencing data. Nature Genetics, 0(0), 2011.



http://www.biomedcentral.com/1471-2105/11/454

http://www.biomedcentral.com/1471-2105/11/454


BIBLIOGRAPHY 73

David L. Donoho and Iain M. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika,81(3):425–455, 1994.

David L. Donoho and Iain M. Johnstone. Adapting to unknown smoothness via wavelet shrinkage.Journal of the Americal Statistical Association, 90(432):1200–1224, 1995.

Bradley Efron. Large-scale simultaneous hypothesis testing. Journal of the Americal StatisticalAssociation, 99:96–104, 2004.

Bradley Efron. Local false discovery rates. 2005. URL http://stat.stanford.edu/~brad.

Bradley Efron. Microarrays, empirical bayes and the two-groups model. Statistical Science, 23(1):1–22, 2008.

Bradley Efron. Correlated z-values and the accuracy of large-scale statistical estimates. 2009a.

Bradley Efron. Are a set of microarrays independent of each other? Annals of Applied Statistics, 3,2009b.

Bradley Efron. Empirical bayes estimates for large-scale prediction problems. Journal of the Amer-ican Statistical Assocication, 104:1015–1028, 2009c.

Bradley Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Pre-diction. Cambridge University Press, 2010.

Bradley Efron and Carl Morris. Limiting the risk of bayes and empirical bayes estimators–part i:The bayes case. Journal of the American Statistical Association, 66(336):807–815, 1971. ISSN01621459. URL http://www.jstor.org/stable/2284231.

Bradley Efron and Carl Morris. Stein’s estimation rule and its competitors - an empirical bayesapproach. Journal of the American Statistical Association, 68(341):117–130, 1973.

Bradley Efron, Robert Tibshirani, John D. Storey, and Virginia Tusher. Empirical bayes analysisof a microarray experiment. Journal of the American Statistical Association, 96(456):1151–1160,2001.

Christopher Genovese and Larry Wasserman. A stochastic process approach to false discoverycontrol. The Annals of Statistics, 32(3):1035–1061, 2004.

Rose Hoberman, Joana Dias, Bing Ge, Eef Harmsen, Michael Mayhew, Dominique J. Verlaan,Tony Kwan, Ken Dewar, Mathieu Blanchette, and Tomi Pastinen. A probabilistic approach forsnp discovery in high-throughput human resequencing data. Genome Research, 19(9):1542–1552,September 2009. ISSN 1088-9051. doi: 10.1101/gr.092072.109. URL http://dx.doi.org/10.

1101/gr.092072.109.

http://stat.stanford.edu/~brad


http://dx.doi.org/10.1101/gr.092072.109

http://dx.doi.org/10.1101/gr.092072.109

BIBLIOGRAPHY 74

W. James and Charles Stein. Estimation with quadratic loss. In Proc. 4th Berkeley Sympos. Math.Statist. and Prob., Vol. I, pages 361–379. Univ. California Press, Berkeley, Calif., 1961.

Wenhua Jiang and Cun-Hui Zhang. General maximum likelihood empirical bayes estimation ofnormal means. The Annals of Statistics, 37:1647–1684, 2009.

Jiashun Jin. Proportion of non-zero normal means: universal oracle equivalences and uniformlyconsistent estimators. Journal of the Royal Statistical Society, Series B, 70:461–493, 2008.

Jiashun Jin and Tony Cai. Estimating the null and the proportion of non-null effects in large-scalemultiple comparisons. Journal of the Americal Statistical Association, 102:495–506, 2007.

Jiashun Jin and Tony Cai. Optimal rates of convergence for estimating the null density and propor-tion of non-null effects in large-scale multiple testing. The Annals of Statistics, 2009.

Jr. Johns, M. V. and J. Van Ryzin. Convergence rates for empirical bayes two-action problemsi. discrete case. The Annals of Mathematical Statistics, 42(5):1521–1539, 1971. ISSN 00034851.URL http://www.jstor.org/stable/2240276.

Jr. Johns, M. V. and J. Van Ryzin. Convergence rates for empirical bayes two-action problems ii.continuous case. The Annals of Mathematical Statistics, 43(3):934–947, 1972. ISSN 00034851.URL http://www.jstor.org/stable/2240388.

Iain M. Johnstone and Bernard W. Silverman. Needles and straw in haystacks: Empirical bayesestimates of possibly sparse sequences. Annals of Statistics, 4(4):1594–1649, 2004.

Ruiqiang Li, Yingrui Li, Xiaodong Fang, Huanming Yang, Jian Wang, Karsten Kristiansen, and JunWang. Snp detection for massively parallel whole-genome resequencing. Genome Research, 19(6):1124–1132, June 2009. ISSN 1088-9051. doi: 10.1101/gr.088013.108. URL http://dx.doi.org/

10.1101/gr.088013.108.

Ingrid Lonnstedt and Terry Speed. Replicated microarray data. Statistica Sinica, 12:31–46, 2002.

Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, AndrewKernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, and Mark A. DePristo.The genome analysis toolkit: A mapreduce framework for analyzing next-generation dna sequenc-ing data. Genome Research, 20(9):1297–1303, 2010. URL http://genome.cshlp.org/content/

20/9/1297.abstract.

G. J. McLachlan, R. W. Bean, and L. Ben-Tovim Jones. A simple implementation of a normalmixture approach to differential gene expression in multiclass microarrays. Bioinformatics, 22(13):1608–1615, 2006.

Geoffrey McLachlan and David Peel. Finite Mixture Models. Wiley-Interscience, 2000.



http://dx.doi.org/10.1101/gr.088013.108

http://dx.doi.org/10.1101/gr.088013.108



BIBLIOGRAPHY 75

Nicolai Meinshausen and John Rice. Estimating the proportion of false null hypotheses among alarge number of independently tested hypotheses. The Annals of Statistics, 34(1):373–393, 2006.

Xiao-Li Meng and Donald B. Rubin. Maximum likelihood estimation via the ecm algorithm: Ageneral framework. Biometrika, 80(2):267–278, 1993. doi: 10.1093/biomet/80.2.267. URL http:

//biomet.oxfordjournals.org/content/80/2/267.abstract.

Carl N. Morris. Parametric empirical bayes inference: Theory and applications. Journal of theAmerican Statistical Association, 78(381):47–55, 1983. ISSN 01621459. URL http://www.jstor.

org/stable/2287098.

Dan Nettleton, J. T. Gene Hwang, Rico A Caldo, and Roger P Wise. Estimating the numberof true null hypotheses from a histogram of p values. Journal of Agricultural, Biological, andEnvironmental Statistics, 11(3):337–356, 2006.

Michael A. Newton, Amine Noueiry, Deepayan Sarkar, and Paul Ahlquis. Detecting differential geneexpression with a semiparametric hierarchical mixture method. Biostatistics, 5(2):155–176, 2004.

W. Pan, J. Lin, and C. T. Le. A mixture model approach to detecting differentially expressed geneswith microarray data. Functional and Integrative Genomics, 3(3):117–124, 2003.

Matthew D. W. Piper, Pascale Daran-Lapujade, Christoffer Bro, Birgitte Regenberg, Steen Knudsen,Jens Nielsen, and Jack T. Pronk. Reproducibility of oligonucleotide microarray transcriptome anal-yses. Journal of Biological Chemistry, 277(40):37001–37008, 2002. doi: 10.1074/jbc.M204490200.URL http://www.jbc.org/content/277/40/37001.abstract.

Stan Pounds and Cheng Cheng. Improving false discovery rate estimation. Biostatistics, 11:1737–1745, 2004.

Stan Pounds and Stephan W. Morris. Estimating the occurrence of false positives and false negativesin microarray studies by approximating and partitioning the empirical distribution of p-values.Bioinformatics, 19:1236–1242, 2003.

Herbert Robbins. An empirical bayes approach to statistics. In Jerzy Neyman, editor, Proceedingsof the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contri-butions to the Theory of Statistics, volume 1, pages 157–163, 1954.

Herbert Robbins. Estimation and prediction for mixtures of the exponential distribution. Proceedingsof the National Academy of Sciences of the United States of America, 77(5):2382–2383, 1980. URLhttp://www.pnas.org/content/77/5/2382.abstract.

Armin Schwartzman. Empirical null and false discovery rate inference for exponential families.Annals of Applied Statistics, 2(4):1332–1359, 2008.

http://biomet.oxfordjournals.org/content/80/2/267.abstract

http://biomet.oxfordjournals.org/content/80/2/267.abstract



http://www.jbc.org/content/277/40/37001.abstract

http://www.pnas.org/content/77/5/2382.abstract

BIBLIOGRAPHY 76

Divakar Sharma. Asymptotic equivalence of two estimators for an exponential family. The Annalsof Statistics, 1:973–960, 1973.

R. S. Singh. Empirical bayes estimation with convergence rates in noncontinuous lebesgue ex-ponential families. The Annals of Statistics, 4(2):431–439, 1976. ISSN 00905364. URLhttp://www.jstor.org/stable/2958217.

R. S. Singh. Empirical bayes estimation in lebesgue-exponential families with rates near the bestpossible rate. The Annals of Statistics, 7(4):890–902, 1979. ISSN 00905364. URL http://www.

jstor.org/stable/2958935.

R. S. Singh and Laisheng Wei. Empirical bayes with rates and best rates of convergence inu(x)c(theta) exp(-x/theta)-family: Estimation case. Annals of the Institute of Statistical Mathe-matics, 44:435–449, 1992.

Branko Soric. Statistical "discoveries" and effect-size estimation. Journal of the American StatisticalAssociation, 84(406):pp. 608–610, 1989. ISSN 01621459. URL http://www.jstor.org/stable/

2289950.

Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal distribu-tion. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability,1954–1955, vol. I, pages 197–206, Berkeley and Los Angeles, 1956. University of California Press.

J. D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society,Series B, 64(Part 3):479–498, 2002.

John Storey, Johnathan Taylor, and David Siegmund. Strong control, conservative point estimation,and simultaneous conservative consistency of false discovery rates: A unified approach. Journalof the Royal Statistical Society, Series B, 66:187–205, 2004.

Korbinian Strimmer. A unified approach to false discovery rate estimation. BMC Bioinformatics,9:303, 2008.

Wenguang Sun and Tony Cai. Oracle and adaptive compound decision rules for false discovery ratecontrol. Journal of the American Statistical Association, 102:479, 2007.

Jan W. H. Swanepoel. The limiting behavior of a modified maximal symmetric 2s-spacing withapplications. The Annals of Statistics, 27(1):24–35, 1999. ISSN 00905364. URL http://www.

jstor.org/stable/120116.

Chen-An Tsai, Huey-miin Hsueh, and James J. Chen. Estimation of false discovery rates in multipletesting: Application to gene microarray data. Biometrics, 59(4):1071–1081, 2003. ISSN 0006341X.URL http://www.jstor.org/stable/3695348.









BIBLIOGRAPHY 77

M. C. K. Tweedie. Functions of a statistical variate with given means, with special referenceto laplacian distributions. Mathematical Proceedings of the Cambridge Philosophical Society,43(01):41–49, 1947. doi: 10.1017/S0305004100023185. URL http://dx.doi.org/10.1017/

S0305004100023185.

Curtis P. Van Tassell, Timothy P. Smith, Lakshmi K. Matukumalli, Jeremy F. Taylor, Robert D.Schnabel, Cynthia Taylor T. Lawley, Christian D. Haudenschild, Stephen S. Moore, Wesley C.Warren, and Tad S. Sonstegard. Snp discovery and allele frequency estimation by deep sequencingof reduced representation libraries. Nature methods, 5(3):247–252, March 2008. ISSN 1548-7105.doi: 10.1038/nmeth.1185. URL http://dx.doi.org/10.1038/nmeth.1185.

Cun-Hui Zhang. Empirical bayes and compound estimation of normal means. Statistica Sinica, 7:181–193, 1997.

Cun-Hui Zhang. Compound decision theory and empirical bayes methods: invited paper. TheAnnals of Statistics, 31:379–390, 2003.

http://dx.doi.org/10.1017/S0305004100023185

http://dx.doi.org/10.1017/S0305004100023185

http://dx.doi.org/10.1038/nmeth.1185

A MIXTURE MODEL APPROACH TO EMPIRICAL BAYES TESTING …pp730hw1567/revised-augmented.pdf · A...

Documents

Transcript of A MIXTURE MODEL APPROACH TO EMPIRICAL BAYES TESTING …pp730hw1567/revised-augmented.pdf · A...