Post on 14-Apr-2018
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 1/20
Introduction to Bayesian Statistics
Yanwei (Wayne) Zhang
Statistical ResearchCNA Insurance CompanyYanwei.Zhang@cna.com
http://www.actuaryzhang.com
June 6, 2011Chicago
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 1 / 20
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 2/20
Outline
Today’s agenda
Bayesian analysis as a measure of uncertainty
Probability and Bayes’ TheoremFrequentists’ confidence intervals
Specification of prior probabilities
Exchangeability
Summarizing posterior distribution
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 2 / 20
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 3/20
Bayesian analysis as a measure of uncertainty
What is probability
1 Frequency interpretation
Limiting proportion of times the event occurs in an infinite sequence of independent repetitions of the experimentExamples: toss of coins, dicesThe requirement that an experiment can be repeated is rather limiting:does it make sense to make the statement of the form “probability of rain tomorrow =0.5”?What does it mean for “independent” experiment? (circular use of probability)
2 Subjective probability
Concern the assessments of a given person about uncertain outcomesInterpreted as a personal belief or a statement about uncertainty
Based on the present knowledge of the event - knowledge base changeleads to probability updating
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 3 / 20
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 4/20
Bayesian analysis as a measure of uncertainty
The fundamental theorem
The inference process in a Bayesian analysis proceeds as follows:
Specify a prior distribution p (θ), which reflects our knowledge oruncertainty about θ prior to observing the data D ;
Specify a probability model p (D |θ);
Compute a posterior distribution p (θ|D ) for θ using the Bayes’Theorem, which reflects our knowledge of θ after observing the data.
Bayes’ Theorem
Posterior distribution∝ data distribution× prior distribution
Liklihood Posterior Prior
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 4 / 20
B i l i f i
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 5/20
Bayesian analysis as a measure of uncertainty
Example
You are given:
1 A portfolio of independent risks is divided into two classes, Class A andClass B.
2 There are twice as many risks in Class A as in Class B.
3 The number of claims for each insured during a single year follows a
Bernoulli distribution.4 Classes A and B have claim size distributions as follows:
Claim Size Class A Class B50,000 0.60 0.36
100,000 0.40 0.64
5 The expected number of claims per year is 0.22 for Class A and 0.11for Class B.
One insured is chosen at random. The insured’s loss for two years combinedis 100,000. You are asked to determine which class the selected insured ismost likely to belong to.
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 5 / 20
Ba esian anal sis as a meas re of ncertaint
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 6/20
Bayesian analysis as a measure of uncertainty
Solution
Denote D as the event that the two years combined loss is 100,000, and in
this case θ ∈ {A, B } .Prior distribution: p (θ = A) = 0.6667, p (θ = B ) = 0.3333.
Data distribution:
p (D |A) = 0.222 · 0.62 + 2 · 0.22 · (1 − 0.22) · 0.4 = 0.1547
p (D |B ) = 0.112 · 0.362 + 2 · 0.11 · (1 − 0.11) · 0.64 = 0.1269
Posterior distribution:
P (A|D ) =p (D |A)P (A)
p (D |A)P (A) + p (D |B )P (B )
=0.1547 · 0.6667
0.1547 · 0.6667 + 0.1269 · 0.3333
= 0.71
P (B |D ) = 1 − P (A|D ) = 0.29
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 6 / 20
Bayesian analysis as a measure of uncertainty
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 7/20
Bayesian analysis as a measure of uncertainty
What do we learn?
It illustrates how data are combined with prior information to come to
a conclusionThe conclusion includes a measure of uncertainty
Were we to use any other method, say MLE (choose θ that maximizep (D |θ)), we will not be able to get an estimate of uncertainty
In that case, getting a confidence interval is also meaningless: itsuggests either complete confidence or no confidence
In fact, the frequentists’ construction of confidence intervals encountermany difficulties.
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 7 / 20
Frequentists’ confidence interval
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 8/20
Frequentists confidence interval
Confidence interval
Consider the following example from Berger (1985):
Suppose that X 1 and X 2 are independent with identical distributiongiven by
P (X i = θ − 1) = P (X i = θ + 1) = 1/2.
Then, a frequentist 75% confidence procedure is: (i.e.,P (δ (X ) = θ) = 0.75 for all θ)
δ (X ) =
(X 1 + X 2)/2, X 1 = X 2
X 1 − 1, X 1 = X 2
Now suppose you do an experiment, the outcome being x 1 and x 2
If x 1 = x 2, we know (x 1 + x 2)/2 = θ, thus δ (x ) is 100% certain tocontain θ;If x 1 = x 2, δ (x ) is 50% certain to contain θ;Neither is 75%! What’s going wrong?
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 8 / 20
Frequentists’ confidence interval
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 9/20
Frequentists confidence interval
Confidence interval cont’d
What the frequentist’s interval means is that in a sequence of repeated
experiments, where each δ (x ) is constructed based on the experimentoutcomes as above, the confidence intervals will cover the true θ about75% of the time.
It is true with regard to overall performance because half the time theactual performance will be 100% and half the time the actual
performance will be 50%.However, it clearly makes little sense to conduct an experiment, useδ (x ), and actually report 75% as the measure of accuracy!
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 9 / 20
Specification of prior probabilities
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 10/20
p p p
Prior information
1 A lady, who adds milk to her tea, claims to be able to tell whether the
tea or the milk was poured into the cup first. In all of ten trialsconducted to test this, she correctly determines which was poured first.
2 A music expert claims to be able to distinguish a page of Haydn scorefrom a page of Mozart score. In ten trials conducted to test this, hemakes a correct determination each time.
3 A drunken friend says he can predict the outcome of a flip of a faircoin. In ten trials conducted to test this, he is correct each time.
In all three situations, the unknown quantity θ is the probability of theperson answering correctly. A classical significance test of the various claimswould reject the hypothesis that θ = 0.5 (i.e., the person is guessing) with a
(one-tailed) significance level of 2−10.
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 10 / 20
Specification of prior probabilities
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 11/20
p p p
Prior information cont’d
In situation 2 we would have no reason to doubt this conclusion. (The
outcome is quite plausible with respect to our prior beliefs.)In situation 3, however, our prior opinion that this prediction isimpossible (barring a belief in extrasensory perception) would tend tocause us to ignore the experimental evidence as being a lucky streak.
In situation 1 it is not quite clear what to think, and different people
will draw different conclusions according to their prior beliefs of theplausibility of the claim.
In these three identical statistical situations, prior information clearly cannotbe ignored.
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 11 / 20
Specification of prior probabilities
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 12/20
Specification of prior distributions
Subjective: the prior is chosen by making a serious attempt to quantify
a state of knowledge prior to observing the data. E.g., I expect the lossratio to follow log θ ∼ N (log (0.6), 0.32).
Objective: prior is chosen to satisfy some objective criterion.
Noninformative prior: a prior that “favors” no particular values of theparameter over others, e.g., θ ∼ U (0, 2), θ ∼ N (0, 1002).
Improper prior: a prior whose integral is infinite, but the resultantposterior is a proper probability distribution, e.g., p (µ) ∝ 1
In many cases, noninformative proper prior is in effect informative in thesense of favoring some parameter valuesIt may lead to procedures with attractive frequentist properties
Jeffreys prior: motivated by the desire that inference should not dependon how a model is parameterized.
E.g., if θ ∼ U [0, 1],√
θ has a non-uniform distribution with higherdensity near 1 than 0.Obtained as p (θ) ∼
I (θ), I (θ) being the Fisher information
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 12 / 20
Specification of prior probabilities
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 13/20
Conjugate priors
DefinitionIf F is a class of sampling distribution p (D |θ), and P is a class of priordistributions p (θ), then the class P is conjugate for F if
p (θ|D ) ∈ P ∀ p (D |θ) ∈ F , p (θ) ∈ P
Some of the commonly used conjugate families are listed here:
data distribution conjugate prior posterior
Binomial Beta(α, β ) Beta(α +
x i , β +
ni −
x i )Poisson Gamma(α, β ) Gamma(α +
x i , β + n)
Exponential Gamma(α, β ) Gamma(α + n, β + x i )
Normal (σ known) N (µ0, σ20 ) N
µ0
σ20
+
x i σ2 · v , v
v =
1σ
20
+ n
σ2
−
1
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 13 / 20
Exchangeability
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 14/20
Exchangeability
DefinitionThe random quantities x 1, · · · , x n are said to be finitely exchangeable if
P (x 1 ∈ E 1, · · · , x n ∈ E n) = P (x σ(1) ∈ E 1, · · · , x
σ(n) ∈ E n), (1)
for any permutation σ on the set 1, · · · , n, and any sets E 1, · · · , E n of
possible values.
Intuitively, it says distributions of x 1, · · · , x n do not depend on thelabeling or order
“independent and identically distributed” is a special case
This formalizes the notion of “the future being predictable on the basisof past experience”.
An infinite sequence x 1, x 2, · · · is exchangeable if every finite sequenceis exchangeable
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 14 / 20
Exchangeability
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 15/20
Exchangeability cont’d
de Finetti’s TheoremIf x 1, x 2, · · · is an infinite exchangeable sequence of random variables withprobability measure P , there exists a probability measure Q on F , the set of all distributions on R, such that the joint distribution of x 1, · · · , x n has theform
P (x 1, · · · , x n) = F
ni =1
F (x i )dQ (F ) (2)
It says we can think of an exchangeable sequence x 1, x 2, · · · as arising froma “two-stage” randomization
1
Pick a distribution F according to the measure Q ;2 Conditional on F , x i ∼ F , i .i .d
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 15 / 20
Summarizing posterior distribution
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 16/20
Summarizing posterior distribution
The posterior p (θ|D ) encapsulates the beliefs post observing the data, and
various summaries are availablePoint estimate
Mean, median and modeMeasure of spread such as variance, interquartile, quantiles
Interval estimate
100(1 − α)% credible interval C such that C p (θ)d θ = 1 − α.
Highest probability density region (HPD):p (θ1) ≥ p (θ2),∀θ1 ∈ C , θ2 /∈ C
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 16 / 20
Summarizing posterior distribution
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 17/20
Example
Suppose that an insurance policy incurred 5 claims each year in a period of
two years (n = 2):We assume the claim count follows a Poisson distributiony i |θ ∼ Pois (θ)
The prior comes from a conjugate Gamma distribution such thatθ ∼ Gamma(α, β )
Then we have the posterior θ|y ∼ Gamma(α +
y i , β + n).We rely on posterior simulation to draw inference, as this is morestraightforward and can work when the posterior is not of closed form.
Prior Posterior
mean variation mean median 50% interval3 1 3.79 3.72 [3.16, 4.37]5 3 4.99 4.79 [3.98, 5.83]8 4 5.56 5.42 [4.50, 6.47]8 20 5.05 4.93 [3.99, 5.97]8 100 5.03 4.84 [3.89, 5.98]
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 17 / 20
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 18/20
0 2 4 6 8 10 12 14
prior
m e a n = 3 , s t d = 1
posteior
0 2 4 6 8 10 12 14
0 2 4 6 8 10 12 14
prior
m e a n = 8 , s
t d = 4
posteior
0 2 4 6 8 10 12 14
0 2 4 6 8 10 12 14
prior
m e a n = 8 , s t d = 1 0 0
posteior
0 2 4 6 8 10 12 14
Summarizing posterior distribution
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 19/20
Posterior as the number of observations varies
0 2 4 6 8 10 12
n= 2
0 2 4 6 8 10 12
n= 5
0 2 4 6 8 10 12
n= 20
0 2 4 6 8 10 12
n= 100
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 19 / 20
Summarizing posterior distribution
7/29/2019 topic4 Bayes with R
http://slidepdf.com/reader/full/topic4-bayes-with-r 20/20
Nonconjugate prior
Now suppose I want to use a Uniform prior that log θ ∼ N (0, 1002), then
the posterior distribution will be
θ|y ∝ θ
y i −1 exp
−θ −
(log θ)2
2 · 1002
(3)
This is not a distribution (as a function of θ) we are familiar with, so howdo we simulate samples from it?
Wayne Zhang (CNA insurance company) Bayesian statistics June 6, 2011 Chicago 20 / 20