What is Empirical Bayes? A brief history, well-known examples &...
Transcript of What is Empirical Bayes? A brief history, well-known examples &...
What is Empirical Bayes? A brief history,well-known examples & connections.
Berkeley Stanford Joint Colloquium (BSJC) in Statistics
Leonid Pekelis
Apr 24, 2012
A brief (incomplete, biased) history.
I von Mises (1940s)
I Robbins (1955) - “An Empirical Bayes Approach to Statistics”
I Efron & Morris (1977) - “Stein’s Paradox in Statistics”I Stein (1956), James & Stein (1961)
I Morris (1983) - “Parametric Empirical Bayes Inference:Theory & Applications”
I Casella (1985) - “An Introduction to Empirical Bayes DataAnalysis”
I Efron, Tibshirani, Storey & Tusher (2001) - “Empirical BayesAnalysis of a Microarray Experiment”
I Benjamini & Hochberg (1995)
I Efron (2010) - “Large Scale Inference”
A brief (incomplete, biased) history.
I von Mises (1940s)
I Robbins (1955) - “An Empirical Bayes Approach to Statistics”
I Efron & Morris (1977) - “Stein’s Paradox in Statistics”I Stein (1956), James & Stein (1961)
I Morris (1983) - “Parametric Empirical Bayes Inference:Theory & Applications”
I Casella (1985) - “An Introduction to Empirical Bayes DataAnalysis”
I Efron, Tibshirani, Storey & Tusher (2001) - “Empirical BayesAnalysis of a Microarray Experiment”
I Benjamini & Hochberg (1995)
I Efron (2010) - “Large Scale Inference”
A brief (incomplete, biased) history.
I von Mises (1940s)
I Robbins (1955) - “An Empirical Bayes Approach to Statistics”
I Efron & Morris (1977) - “Stein’s Paradox in Statistics”I Stein (1956), James & Stein (1961)
I Morris (1983) - “Parametric Empirical Bayes Inference:Theory & Applications”
I Casella (1985) - “An Introduction to Empirical Bayes DataAnalysis”
I Efron, Tibshirani, Storey & Tusher (2001) - “Empirical BayesAnalysis of a Microarray Experiment”
I Benjamini & Hochberg (1995)
I Efron (2010) - “Large Scale Inference”
A brief history.
Robbins - “Our reason for doing this ... is that w hope for large N to beable to extract information about the [prior] from the valueswhich have been observed, hopefully in such a way that[EBayes] will be close to the optimal but unknown [Bayes] ...”
Efron - “Empirical Bayes blurs the line between testing andestimation as well as between frequentism Bayesianism.”
Morris - “Should statisticians use empirical Bayes modeling in mostmultiparameter inference problems? Probably not”
A brief history.
Robbins - “Our reason for doing this ... is that w hope for large N to beable to extract information about the [prior] from the valueswhich have been observed, hopefully in such a way that[EBayes] will be close to the optimal but unknown [Bayes] ...”
Efron - “Empirical Bayes blurs the line between testing andestimation as well as between frequentism Bayesianism.”
Morris - “Should statisticians use empirical Bayes modeling in mostmultiparameter inference problems? Probably not”
A brief history.
Robbins - “Our reason for doing this ... is that w hope for large N to beable to extract information about the [prior] from the valueswhich have been observed, hopefully in such a way that[EBayes] will be close to the optimal but unknown [Bayes] ...”
Efron - “Empirical Bayes blurs the line between testing andestimation as well as between frequentism Bayesianism.”
Morris - “Should statisticians use empirical Bayes modeling in mostmultiparameter inference problems? Probably not”
A brief history.
I Non-Parametric Empirical Bayes
I leaves prior completely unspecified (Robbins - 1955, Tweedie’sformula)
I Parametric Empirical Bayes
I Specifies family of priors, estimate hyperparameters(James-Stein)
A brief history.
I Non-Parametric Empirical BayesI leaves prior completely unspecified (Robbins - 1955, Tweedie’s
formula)
I Parametric Empirical BayesI Specifies family of priors, estimate hyperparameters
(James-Stein)
Robbins Original Formulation
X ∈ X , and a σ-finite ν(x)
µ ∈ Q, µ ∼ G (µ)
when parameter is µ, X |µ ∼ fµ
for any loss L(t, µ), t a decision function
R(t,G ) = IEG IEfµ [L(t(X ), µ)]
R(G ) = mint
R(t,G )
Robbins Original Formulation
X ∈ X , and a σ-finite ν(x)
µ ∈ Q, µ ∼ G (µ)
when parameter is µ, X |µ ∼ fµ
for any loss L(t, µ), t a decision function
R(t,G ) = IEG IEfµ [L(t(X ), µ)]
R(G ) = mint
R(t,G )
Robbins Original Formulation
R(G ) = mint
R(t,G )
What if G is not known?
But you have samples (xi , µi ),i = 1, . . . ,N iid with µi ∼ G , xi ∼ fµ. Then you can estimatetn(x) = tn(x ; x1, . . . , xn) and get
RN({tN},G ) = IExR(tN ,G ) ≥ R(G )
DefinitionT = {tN} is asymptotically optimal relative to G if
limN
RN(T ,G ) = R(G )
We want the above to hold on a class G containing the true G .
Robbins Original Formulation
R(G ) = mint
R(t,G )
What if G is not known? But you have samples (xi , µi ),i = 1, . . . ,N iid with µi ∼ G , xi ∼ fµ.
Then you can estimatetn(x) = tn(x ; x1, . . . , xn) and get
RN({tN},G ) = IExR(tN ,G ) ≥ R(G )
DefinitionT = {tN} is asymptotically optimal relative to G if
limN
RN(T ,G ) = R(G )
We want the above to hold on a class G containing the true G .
Robbins Original Formulation
R(G ) = mint
R(t,G )
What if G is not known? But you have samples (xi , µi ),i = 1, . . . ,N iid with µi ∼ G , xi ∼ fµ. Then you can estimatetn(x) = tn(x ; x1, . . . , xn) and get
RN({tN},G ) = IExR(tN ,G ) ≥ R(G )
DefinitionT = {tN} is asymptotically optimal relative to G if
limN
RN(T ,G ) = R(G )
We want the above to hold on a class G containing the true G .
Robbins Original Formulation
R(G ) = mint
R(t,G )
What if G is not known? But you have samples (xi , µi ),i = 1, . . . ,N iid with µi ∼ G , xi ∼ fµ. Then you can estimatetn(x) = tn(x ; x1, . . . , xn) and get
RN({tN},G ) = IExR(tN ,G ) ≥ R(G )
DefinitionT = {tN} is asymptotically optimal relative to G if
limN
RN(T ,G ) = R(G )
We want the above to hold on a class G containing the true G .
Example: James-Stein Estimator
Suppose G = N(0,A) and fµ = N(µ, 1).
Then IE (µ|x) =(
1− 1A+1
)x
And E (N−2S ) = 1
A+1 where S = ‖x‖2.
So µ̂JS(x) =(1− N−2
S
)x
TheoremFor N ≥ 3, µ̂JS everywhere dominates µ̂MLE forL(t, µ) = ‖t − µ‖2.
Example: James-Stein Estimator
Suppose G = N(0,A) and fµ = N(µ, 1).
Then IE (µ|x) =(
1− 1A+1
)x
And E (N−2S ) = 1
A+1 where S = ‖x‖2.
So µ̂JS(x) =(1− N−2
S
)x
TheoremFor N ≥ 3, µ̂JS everywhere dominates µ̂MLE forL(t, µ) = ‖t − µ‖2.
Example: Benjamini-Hochberg / FDR
BH procedure: Have p1, . . . , pn p-values
Fix some q ∈ (0, 1), and take imax as the largest index suchthat p(i) ≤ i
N q.
Reject H0(i) for i ≤ imax .
TheoremIf pi ∼ U(0, 1) under H0i , and independent for all i , then
IE
(#FR
#R
)≤ q
Example: Benjamini-Hochberg / FDR
BH procedure: Have p1, . . . , pn p-values
Fix some q ∈ (0, 1), and take imax as the largest index suchthat p(i) ≤ i
N q.
Reject H0(i) for i ≤ imax .
TheoremIf pi ∼ U(0, 1) under H0i , and independent for all i , then
IE
(#FR
#R
)≤ q
Example: BH / FDR - Empirical Bayes
Suppose pi correspond to zi test statistics.
Specifically µ ∼ π01(µ = 0) + π1g(µ), and x |µ ∼ N(µ, 1).And test H0i : µi > 0.
Define F̄ as empirical distribution of z, F0 as common nulldistribution, and ˆFdr(z) = π0F0(z)
F̄ (z).
Corollary
Then the procedure rejecting for i < imax , with imax largest suchthat ˆFdr(z(i)) < q, controls false discovery propostion at level q asbefore.
Example: BH / FDR - Empirical Bayes
Suppose pi correspond to zi test statistics.
Specifically µ ∼ π01(µ = 0) + π1g(µ), and x |µ ∼ N(µ, 1).And test H0i : µi > 0.
Define F̄ as empirical distribution of z, F0 as common nulldistribution, and ˆFdr(z) = π0F0(z)
F̄ (z).
Corollary
Then the procedure rejecting for i < imax , with imax largest suchthat ˆFdr(z(i)) < q, controls false discovery propostion at level q asbefore.
Example: Tweedie’s formula
µ ∼ g(µ), z |µ ∼ fµ(z) = eµz−ψ(µ)f0(z)
Bayes rule: g(µ|z) =fµ(z)g(µ)
f (z) , where f (z) =∫fµ(z)g(µ)dµ
⇒ g(µ|z) = ezµ−λ(z)(g(µ)e−ψ(µ)
), λ(z) = log
(f (z)
f0(z)
)⇒ IE (µ|z) = λ′(z) , Var(µ|z) = λ′′(z)
⇒ µ|z ∼ (l ′(z)− l ′0(z), l ′′(z)− l ′′0 (z)) , l(z) = log f (z)
For: z |µ ∼ N(µ, 1), µ|z ∼ (z + l ′(z), 1 + l ′′(z).
Example: Tweedie’s formula
µ ∼ g(µ), z |µ ∼ fµ(z) = eµz−ψ(µ)f0(z)
Bayes rule: g(µ|z) =fµ(z)g(µ)
f (z) , where f (z) =∫fµ(z)g(µ)dµ
⇒ g(µ|z) = ezµ−λ(z)(g(µ)e−ψ(µ)
), λ(z) = log
(f (z)
f0(z)
)⇒ IE (µ|z) = λ′(z) , Var(µ|z) = λ′′(z)
⇒ µ|z ∼ (l ′(z)− l ′0(z), l ′′(z)− l ′′0 (z)) , l(z) = log f (z)
For: z |µ ∼ N(µ, 1), µ|z ∼ (z + l ′(z), 1 + l ′′(z).
Example: Tweedie’s formula
µ ∼ g(µ), z |µ ∼ fµ(z) = eµz−ψ(µ)f0(z)
Bayes rule: g(µ|z) =fµ(z)g(µ)
f (z) , where f (z) =∫fµ(z)g(µ)dµ
⇒ g(µ|z) = ezµ−λ(z)(g(µ)e−ψ(µ)
), λ(z) = log
(f (z)
f0(z)
)⇒ IE (µ|z) = λ′(z) , Var(µ|z) = λ′′(z)
⇒ µ|z ∼ (l ′(z)− l ′0(z), l ′′(z)− l ′′0 (z)) , l(z) = log f (z)
For: z |µ ∼ N(µ, 1), µ|z ∼ (z + l ′(z), 1 + l ′′(z).
Example: Tweedie’s formula
The Empirical Bayes part comes from assuming
f (z) = e∑J
j=0 βjzj
Then can estimate β̂ by GLM (Lindsey’s Method).
And finally l̂ ′(z) =∑J
j=1 j β̂jzj−1.
TheoremUnder above assumptions, Lindsey’s Method produces nearlyunbiased estimates of f .
Example: Tweedie’s formula
The Empirical Bayes part comes from assuming
f (z) = e∑J
j=0 βjzj
Then can estimate β̂ by GLM (Lindsey’s Method).
And finally l̂ ′(z) =∑J
j=1 j β̂jzj−1.
TheoremUnder above assumptions, Lindsey’s Method produces nearlyunbiased estimates of f .
Tweedie & James-Stein
Suppose µ ∼ N(0,A), and take J = 2.
µ̂Tweedie(z) =
(1− N
S
)z
µ̂JS(z) =
(1− N − 2
S
)z
Tweedie & James-Stein
Suppose µ ∼ N(0,A), and take J = 2.
µ̂Tweedie(z) =
(1− N
S
)z
µ̂JS(z) =
(1− N − 2
S
)z
Other Connections
I under 2 groups model, fdr(z) = π0f0(z)f(z)
− ∂∂z log(fdr(z)) = l ′(z)− l ′0(z) = IE (µ|z)
I ˆFdr similar, and more general than Tweedie’s as it onlyrequires specification of null density f0(z).
I gBH(µ) = π01(µ = 0) + π1g(µ), and gJS(µ) = N(0,A)
I JS can have N = 10, but Fdr or Tweedie with J > 2 needN >> 10
I Suggests Bias / Variance tradeoff in empirical Bayesestimates.
Other Connections
I under 2 groups model, fdr(z) = π0f0(z)f(z)
− ∂∂z log(fdr(z)) = l ′(z)− l ′0(z) = IE (µ|z)
I ˆFdr similar, and more general than Tweedie’s as it onlyrequires specification of null density f0(z).
I gBH(µ) = π01(µ = 0) + π1g(µ), and gJS(µ) = N(0,A)
I JS can have N = 10, but Fdr or Tweedie with J > 2 needN >> 10
I Suggests Bias / Variance tradeoff in empirical Bayesestimates.
Other Connections
I under 2 groups model, fdr(z) = π0f0(z)f(z)
− ∂∂z log(fdr(z)) = l ′(z)− l ′0(z) = IE (µ|z)
I ˆFdr similar, and more general than Tweedie’s as it onlyrequires specification of null density f0(z).
I gBH(µ) = π01(µ = 0) + π1g(µ), and gJS(µ) = N(0,A)
I JS can have N = 10, but Fdr or Tweedie with J > 2 needN >> 10
I Suggests Bias / Variance tradeoff in empirical Bayesestimates.
The Last Slide
Robbins - “... it is obvious that any theory of statistical inference willfind itself in and out of fashion as the winds of doctrine blow.Here, then, are some remarks and references for furtherreading which I hope will interest my audience in thinking thematter through for themselves.”
Me - “Thanks!”
The Last Slide
Robbins - “... it is obvious that any theory of statistical inference willfind itself in and out of fashion as the winds of doctrine blow.Here, then, are some remarks and references for furtherreading which I hope will interest my audience in thinking thematter through for themselves.”
Me - “Thanks!”