Likelihood, Bayesian and Decision Theory
description
Transcript of Likelihood, Bayesian and Decision Theory
L i ke
l
o
ohi d
1
Likelihood, Bayesian and Decision Theory
2
Kenneth Yu
History • The likelihood principle was first
introduced by R.A. Fisher in 1922. The law of likelihood was identified by Ian Hacking.
• "Modern statisticians are familiar with the notion that any finite body of data contains only a limited amount of information on any point under examination; that this limit is set by the nature of the data themselves…the statistician's task, in fact, is limited to the extraction of the whole of the available information on any particular issue." R. A. Fisher
3
Likelihood Principle• All relevant data in is contained in the
likelihood function L(θ | x) = P(X=x | θ)
Law of Likelihood• The extent to which the evidence supports
one parameter over another can be measured by taking their ratio
• These two concepts allow us to utilize likelihood for inferences on θ.
4
Motivation and Applications• Likelihood (Especially MLE) is used in a range of
statistical models such as structural equation modeling, confirmatory factor analysis, linear models, etc. to make inferences on the parameter in a function. Its importance came from a need to find the “best” parameter value subject to error.
• This makes use of only the evidence and disregards the prior probability of the hypothesis. By making inferences on unknown parameters from our past observations, we are able to estimate the true Θ value for the population.
5
• The Likelihood is a function of the form:L(Θ|X)Є{α P(X|Θ) : α > 0 }
• This represents how “likely” Θ is if we have prior outcomes X. It is the same as the probability of X happening given parameter Θ
• Likelihood functions are equivalent if they differ by constant α (They are proportional). The inferences on parameter Θ would be the same if based on equivalent functions.
6
Maximum Likelihood Method
By Hanchao
7
Main topic include:• 1. Why use Maximum Likelihood
Method?• 2. Likelihood Function • 3. Maximum Likelihood Estimators• 4. How to calculate MLE?
8
1. Why use Maximum Likelihood Method?
Difference between:
Method of Moments
& Method of Maximum likelihood
9
• Mostly, same!
• However, Method of Maximum likelihood does yield “good” estimators:
1. an after-the-fact calculation 2. More versatile methods for fitting parametric
statistical models to data 3. Suit for large data samples
10
2. Likelihood Function • Definition: Let , be the joint probability
(or density) function of n random variables : with sample values
The likelihood function of the sample is given by:
kn Rxxf ),;,...,( 1
nXX ,...,1 nxx ,...,1
);,...,(),...,,( 11 nn xxfxxL
11
• If are discrete iid random variable with probability function ,
then, the likelihood function is given by
nXX ,...,1
),( xp
n
ii
n
iii
nn
xP
xXP
xXxXPL
1
1
11
),(
)(
),...,()(
12
• In the continuous case, if the density is then, the likelihood function is given by
i.e. Let be iid random variables. Find the Likelihood function?
),( xf
n
iixfL
1
),()(
),( 2NnXX ,...,1
)2
)(exp(
)2(
1)
2
)(exp(
2
1),(
21
2
2/1
2
22
n
ii
nn
n
i
i
xx
L
13
4. Procedure of one approach to find MLE
• 1). Define the likelihood function, L(θ)• 2). Take the natural logarithm (ln) of L(θ)• 3). Differentiate ln L(θ) with respect to θ, and
then equate the derivative to 0.• 4). Solve the parameter θ, and we will obtain• 5). Check whether it is a max or global max
• Still confuse?
^
14
Ex1: Suppose are random samples from a Poisson distribution with parameter λ. Find MLE ?
^
nXX ,...,1
We have pmf:
0,...;2,1,0;!
)(
xx
exp
x
Hence, the likelihood function is:
n
ii
nx
n
i i
x
x
e
x
eL
n
ii
i
1
1 !!
)(1
15
Differentiating with respect to λ, results in:
And let the result equals to zero:
That is,
Hence, the MLE of λ is:
nx
d
Ld
n
ii
1)(ln
0)(ln 1 n
x
d
Ld
n
ii
_1 xn
xn
ii
_^
X16
Ex2: Let be .a) if μ is unknown and is known, find the
MLE for μ.b) if is known and unknown, find the
MLE for .c) if μ and are both unknown, find the MLE for
.
• Ans:Let , so the likelihood function is:
nXX ,...,1 ),( 2N20
2
0 22
2),( 2
2
)2
)(exp()2(),( 1
2
2/
n
ii
n
xL
17
So after take the natural log we have:
a). When is known, we only need to solve the unknown parameter μ:
2
)()ln(
2)2ln(
2),(ln 1
2
n
iixnn
L
02
0
02
)(2)),((ln
0
10
n
iix
L
0)(1
n
iix
nxn
ii
1
x
18
• b) When is known, so we only need to solve one parameter :
• c) When both μ and θ unknown, we need to differentiate both parameters, and mostly follow the same steps by part a). and b).
0
02
)(
2)),((ln
21
2
n
iixn
L
2
n
Xn
ii
1
20^
2^
)(
19
Reality example: Sound localization
Mic1Mic2
MCU
20
Robust Sound LocalizationIEEE Transactions on Signal Processing, Vol. 53, No. 6, June 2005
Noise reverberations
21
Sound Source
The ideality and reality
22
Mic1Mic2
The received signal in 1meter and angle 60 frequency 1kHz
Fourier Transform shows noise
23
Frequency (100Hz)
Am
plitu
de
Algorithm:
)()()(
)()()(
22
11
tntstm
tntstm
1. Signal collection (Original signal samples in time domain)
2. Cross Correlation (received signals after DFT, in freq domain)
deMM j
)()(maxarg 21
~
24
• However, we have noise mixing within the signal, so the Weighting Cross Correlation algorithm become:
• Where by Using ML method as “Weighting function ” to reduce the sensitive from noise & reverberations
deMMW j
)()()(maxarg 21
~
21
22
22
21
21
|)(||)(||)(||)(|
|)(||)(|)(
MNMN
MMW
25
Reference:
• Complicated calculation (slow) -> it is almost the last approach to solve the problem
• Approximated results (not exact)
The disadvantage of MLE
[1] Halupka, 2005,Robust sound localization in 0.18 um CMOS[2] S.Zucker, 2003, Cross-correlation and maximum-likelihood analysis: a new approach to combining cross-correlation fuctions[3]Tamhane Dunlop, “Statistics and Data Analysis: from Elementary to intermediate”, Chap 15.[4]Kandethody M. Ramachandran, Chris P. Tsokos, “Mathematical Statistics with Applications”, page 235-252.
26
Likelihood ratio test
Ji Wang
27
Brief Introduction
• The likelihood ratio test was firstly claimed by Neyman and E.pearson in 1928. This test method is widely used and always has some kind of optimality.
• In statistics, a likelihood ratio test is used to compare the fit of two models, one of which is nested within the other. This often occurs when testing whether a simplifying assumption for a model is valid, as when two or more model parameters are assumed to be related.
28
Introduction about most powerful test
To the hypothesis , we have two test functions
and , If *, ,then we called is more
powerful than .
If there is a test function Y satisfying the * inequality to the every test
function , then we called Y the uniformly most powerful test.
0 0 1 1: :H H 1Y
2Y 1 2E Y E Y 1 1Y
2Y
2Y
The advantage of likelihood ratio test comparing to the significance test
• The significance test can only deal with the hypothesis in specific interval just like:
but can not handle the very
commonly hypothesis :
because we can not use the method of significance test to find the reject region.
0 0 0 1 1 0: :H H
0 0 0 1 1: :H H
30
Definition of likelihood ratio test statistic
• are the random identical sampling from the family distribution of F={ }. For the test
let
We call is the likelihood ratio of the above mentioned hypothesis. Sometimes we also called it general likelihood ratio.
# From the definition of the likelihood ratio test statistics, we can find if the
value of is small, the null hypothesis is more probably to occur than the alternative hypothesis , so it is reasonable for us to reject null hypothesis.
Thus, this test reject if
)X
( , ) :f x 0 0 0 1 1 0: :H H
)X 0 0 0:H 1 1:H
1,....., nX X
0 1
1
MAX ( ,..., ))
MAX ( ,...,n
n
l x xx
l x x
0H
)X C
31
The definition of likelihood ratio test
• We use as the test statistic of the test :
and the rejection region is , the C satisfy the inequality
Then this test is the likelihood ratio test of level.
# If we do not know the distribution of under null hypothesis, it is very difficult for us to find the marginal value of LRT. However, if there is a statistic( )which is monotonous to the ,and we know its distribution under null hypothesis. Thus, we can make a significance test based on the .
)X
0 0 0 1 1 0: :H H
{ ) }X C
)X
( )T X
0 { ) }P X C
( )T X
32
The steps to make a likelihood ratio test
• Step1 Find the likelihood ration function of the sample .
• Step2 Find the , the test statistic or some other statistics which is monotonous to the .
• Step3 Construct the reject region by using the type 1 error at the significance level of .
1,....., nX X
)X
)X
33
• Example are the random samples having the pdf:
Please derive the rejection region of the hypothesis in the level
(( , ) ,xf x e x R
0 1: 0 : 0H H
1,....., nX X
34
Solution: ● Step1: the sample distribution is :
and it is also the likelihood function, the parameter space is
then we derived
1
(1)
(
( )( , )
n
i
x
xf X e I
0, {0}R
(1)1 1
0 1 1MAX ( ,..., ) ,MAX ( ,..., )
n n
i ii i
x x nx
n nl x x e l x x e
35
● Step2 the likelihood ratio test statistics
We can just used ,because it is monotonous to the ● Step3 Under the null hypothesis, , so the marginal
value by calculating the That is to say is the likelihood ratio test statistics and the reject region is
(1)(1)
1(2 )
2)nXnXX e e
(1)2nX
2(1)2 ~ (2)nX
2 (2)c
(1)2nX
2(1){2 (2)}nX
)x
0 1{2 }P nX C ( )
36
Wald Sequential Probability Ratio Test
So far we assumed that the sample size is fixed in advance. What if it is not fixed?
Abraham Wald(1902-1950) developed the sequential probability ratio test(SPRT) by applying the idea of likelihood ratio testing, which sample sequentially by taking observations one at a time.
Xiao Yu
Hypothesis:
• If stop sampling and decide to not reject
• If continue sampling• If stop sampling and decide to
reject
0 0 1 1: ; :H H
11 1 2 11 2
0 1 2 01
( | )( | , ,..., )( , ,..., )
( | , ,..., ) ( | )
n
in n in n n n
n n ii
f xL x x xx x x
L x x x f x
1 2( , ,..., )n nx x x A
0H
1 2( , ,..., )n nA x x x B
1 2( , ,..., )n nx x x B
0H
1A
1B
SPRT for Bernoulli Parameter
• A electrical parts manufacturer receives a large lot of fuses from a vendor. The lot is regarded as “satisfactory” if the fraction defective p is no more than 0.1, otherwise it is regarded as “unsatisfactory”.
0 0 1 1: 0.1; : 0.3H p p H p p
1 1
0 0
1
1
n ns n s
n
p p
p p
1.504 0.186 1.540 0.186nn s n 0.10, 0.20
Fisher Information
2 2
2
ln ( | ) ln ( | )( )
d f X d f XI E E
d d
ln ( | )d f X
d
ln ( | )0
d f XE
d
score
1ˆ( )( )
VarnI
Cramer-Rao Lower Bound
Single-Parameter Bernoulli experiment
• The Fisher information contained n independent Bernoulli trials may be calculated as follows. In the following, A represents the number of successes, B the number of failure.
42
2 2
2 2
2
2
2 2 2 2
( )!( ) ln( ( ; )) | ln( (1 ) ) |
! !
( ln( ) ln(1 )) | |1
(1 )|
(1 ) (1 ) (1 )
A B A BI E f A E
A B
A BE A B E
A B n n nE
We can see it’s reciprocal of the variance of the number of successes in NBernoulli trials. The more the variance, the less the Fisher information.
Large Sample Inferences Based on the MLE’s
• An approximate large sample (1-alpha)-level confidence interval(CI) is given by
43
2
2
ln ( )1ˆ 0,( )ln ( )
d Ld
NnId L
d
Plug in the Fisher information of Bernoulli trials, we can see it’s consistent as we have learned.
/2 /2
1 1ˆ ˆˆ ˆ( ) ( )
z znI nI
Bayes' theorem
Thomas Bayes (1702 –1761) -English mathematician and a Presbyterian minister born in London -a specific case of the theorem (Bayes'theorem), which was published after his death (Richard price)
Jaeheun kim
Bayesian inference
• Bayesian inference is a method of statistical inference in which some kind of evidence or observations are used to calculate the probability that a hypothesis may be true, or else to update its previously-calculated probability.
• "Bayesian" comes from its use of the Bayes' theorem in the calculation process.
BAYES’ THEOREM
Bayes' theorem shows the relation between two conditional probabilities
)(
)()|()|(
)()|()()|()(
AP
BPBAPABP
APABPBPBAPBAP
• we can make updated probability(posterior probability) from the initial probability(prior probability) using new information.
• we call this updating process Bayes' Theorem
Prior prob.
New info.
Posterior prob.
Using Bayes thm
MONTE HALL
Should we switch the door or stay?????
A contestant chose door 1 and then the host opened one of the other doors(door 3).
Would switching from door 1 to door 2 increase chances of winning the car?
http://en.wikipedia.org/wiki/Monty_Hall_problem
3
2
)31
0()31
1()31
21
(
3
11
)()|()()|()()|(
)()|()|(
31
)3
10()
3
11()
3
1
2
1(
3
1
2
1
)()|()()|()()|(
)()|()|(
333223113
22332
333223113
11331
DpDOpDpDOpDpDOp
DpDOpODp
DpDOpDpDOpDpDOp
DpDOpODp
j
i
O
D ={Door i conceals a car}
={Host opens Door j after a contestant choose Door1}
0)|(
1)|(2
1)|(
3
1)()()(
33
23
13
321
DOp
DOp
DOp
DpDpDp
(when you stay)
(when you switch)
15.3.1 Bayesian Estimation
Zhenrui & friends
Premises of doing a bayesian estimation:
1. Prior knowledge about the unknown parameter θ
2. The possibility distribution of θ : π (θ) (prior distribution)
General equation:
),,,(
)()|,,,(
)()|,,,(
)()|,,,()(
21
21
21
21
n
n
n
n
xxxf
xxxf
dxxxf
xxxf
Marginal p.d.f. of X1,X2,…Xn, Just a normalizing constant to make
1)( d
π*(θ):posterior distribution
π (θ):prior distribution
θ: unknown parameter from a distribution with pdf/pmf f (x | θ). Considered as r.v. in Bayesian estimation
f(x1,x2,…xn| θ) likelihood function of θ based on observed values x1,x2,
…,xn .
The µ* and σ*2 of π*(θ) are called posterior mean and variance, repectively. µ* can be used as a point estimate of θ (Bayes estimate)
)()|()( Xf
This
must be
a
mistake!
Trust me. I know this θ!
51
Bayesian Estimation continued
Conjugate Priors:
a family of prior distributions that the posterior distribution is of the same form of the prior distribution
Examples of Conjugate Priors( from text book): Example 15.25,15.26
• Normal distribution is a conjugate prior on µ of N(µ, σ2 ) )(if σ2 is already known)
• Beta distribution is a conjugate prior on p of a Binominal distribution Bin(n,p)
A question: If I only know the possible value range of θ, but can’t summarized it in the form of a possibility distribution. Can I still do the Bayesian estimation?
No! To apply the Bayes’ theorem, every term in the equation has to be a probability
term. π (θ) : √ vs θ : x
)()|()( Xf
52
Criticisms of Bayesian approach: 1. Perceptions of prior knowledge differ from person to person. ‘subjective’. 2. Too fuzzy to quantify the prior knowledge of θ in the form of a distribution
15.3.2 Bayesian Testing
11
00
:
:
H
Hsimple vs simple hypothesis test :
)|,,,(
)|,,,(
0210
1211*0
*1
n
n
xxxf
xxxf
a
b
ba
a
)( 0
**0
*01
**1 1)(
ba
b )|,,,( 1211 nxxxfb
)|,,,( 0210 nxxxfa
A Bayesian test rejects if k*0
*1
0H
k >0 is a suitably chosen critical constant. A large value of k corresponds toa small value of α
53
),( 00
011 1)(
Prior probability of H0 and H1
Bayesian Testing continuedBayesian test vs Neyman-Pearson likelihood ratio test (15.18)
Neyman-Pearson Lemma: kxxxf
xxxf
xxxL
xxxL
n
n
n
n
)|,,,(
)|,,,(
),,,|(
),,,|(
021
121
210
211
kkxxxf
xxxf
a
b
n
n )(*)|,,,(
)|,,,(
1
0
0210
1211*0
*1
Bayesian test:
Bayesian test can be considered as a specialized Neyman-Peanson likelihood testwhere the probabilities of each hypothesis (H0 & H1 )being true is known:π0 & π1
If , 2/110 kxxxf
xxxf
n
n
)|,,,(
)|,,,(
021
121*0
*1
The Bayesian test becomes the Neyman-Pearson likelihood ratio test
54
Bayesian Inference for one parameter
A biased coin • Bernoulli random variable
• Prob(Head)= ϴ
• ϴ is unknown
Bingqi Cheng
Bayesian StatisticsThree Ingredients:
• Prior distribution
• Likelihood function
• Bayes Theorem
Initial guess or prior knowledge on parameter ϴ, highly subjective
Fits or describes the distribution of real data ( e.g. a sequence of heads or tails when toss the coin)
Update the prior distribution with real data
Prob(ϴ | data) ϴ
Posterior distribution
Prior DistributionBeta distributions are to Bernoullidistributions
conjugate prior
If prior is Beta and likelihood function is Bernoulli, then posterior is Beta
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
Den
sity
Prior Distribution
Likelihood function
Posterior Distribution
Prior distributionLikelihood function
For this biased coin:
Calculation Steps
Posterior Distribution
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1
2
3
4
5
6
7
x
Den
sity
Posterior Distribution
Predictive Probability
Bayesian VS M.L.E with calculus method
Back to the example of biased coin, still we have 20 trials and get 13 heads.
13 720( ) (1 )
13f p p p
12 620'( ) (1 ) (13 20 ) 0
13
0.65
f p p p p
p
Xiao Yu
64
65
Jeffreys Prior ˆ ˆ( ) det ( )p I
Why bother to use Bayesian? With large amount of data, the computation of Bayesian is more easy to handle.
• M.L.E with calculus method: Find the parameter quick and directly, if
possible. -> A huge step• Bayesian Initial guess + approximation + converge -> another startline + small step + maybe not
best value
66
• This is a Gaussian Mixture, observations are vectors, C is the covariance matrix. Find the maximum likelihood estimate for a mixture by direct application of Calculus is tough.
2 2 2 21 1 2 2
1
( ) /2 ( ) /21 2
1 1 2
log ( , | ,..., )
1 1log( )
2 2
n
nx C x C
j
L p C x x
p e p eC C
67
Bayesian Learning
• The more evidence we have, the more we learn.
The more flips we do, the more we know about the probability to get a head, which is the parameter of binomial distribution.
An application: EM(Expectation Maximization) algorithm which can beautifully handle with some regression
problems.
68
Two coins Game: Suppose now that there are two
coins which can be flipped. The probability of heads is p1 for the first coin, and p2 for the second coin. We decide on each flip which of the two coins will be flipped, and our objective is to maximize the number of heads that occur.(p1 and p2 are unknown)
69
Matlab code for the strategy• function [] = twocoin2(p1,p2,n) • H1=0;T1=0;• H2=0;T2=0;• syms ps1;• syms ps2;• for k=1:n,• temp = int(ps2^H2*(1-ps2)^T2,0,ps1);• p(k) =
double(int(ps1^H1*(1-ps1)^T1*temp,0,1)/(beta(H1+1,T1+1)*betH2+1,T2+1))); • if rand < p(k),• guess(k) = 1;• y(k) = rand < p1;• H1 = H1 + y(k);• T1 = T1 + (1 - y(k));• else• guess(k) = 2;• y(k) = rand < p2;• H2 = H2 + y(k);• T2 = T2 + (1 - y(k));• end• end• disp('Guesses: ')• tabulate(guess)• disp('Outcomes: ')• tabulate(y)• figure(2)• plot(p)• end
P1=0.4, p2=0.6Value of L(p1>p2|H1,T1,H2,T2)
70
Statistical Decision Theory
ABRAHAM WALD(1902-1950)• Hungarian mathematician • Major contributions - geometry, econometrics,
statistical sequential analysis, and decision theory
• Died in an airplane accident in 1950
Hans Schneeweiss “Abraham Wald” Department of Statistics, University of Munich Akademiestr. 1, 80799 MÄunchen, Germany
71
Kicheon Park
Why decision theory is needed?
Limits of classical statisticsI. Prior information
and LossII. Initial and final
PrecisionIII. Formulational
Inadequacy
72
Limit of Classical Statistics
• Prior information and Loss- relevant effects from past experience & losses from each possible decision
• Initial and final Precision- Before and After observation of sample information which is result of long series of identical experiments
• Formulational Inadequacy- Limit to make meaningful decision to be reached in the majority problem
73
Classical statistics vs. Decision Theory
• Classical statistics- Direct use of sample information
• Decision theory- combine the sample information with other relevant aspects of the problem for the best decision
→ The goal of decision theory is to make decision based on not only the presence of statistical knowledge but also the uncertainties (θ) that are involved in the decision problem
74
Two types of relevant information
I. Knowledge of the possible consequences of the decisions → loss of result by each possible decisions
II. Prior information →effects from past experience about similar situation
75
Statistical Decision Theory - Elements
Sample Space ΧUnknown parameter θ
χ
Decision Space
,
“Abraham Wald”, Wolfowitz, Annals of Mathematical Statistics“Statistics & Data Analysis”, Tamhane & Dunlop, Prentice Hall“Statistical Decision Theory”, Berger, Springer-Verlag
76Mun Sang Yue
Statistical Decision Theory - Eqns
77
Statistical Decision Theory – Decision Rules
min {max }
78
Statistical Decision Theory - Example
• A retailer must decide whether to purchase a large lot of items containing an unknown fraction p of defectives. Before making the decision of whether to purchase the lot (decision d1) or not to purchase the lot (decision d2), 2 items are randomly selected from the lot for inspection. The retailer wants to evaluate two decisions rules formulated. Prior π(p) = 2(1-p)
79
Example - ContinueNo. of Defectives
xDecision Rule δ1
Decision
Decision Rule δ2
Decision0 d1 d1
1 d2 d1
2 d2 d2
• Loss FunctionsL(d1,p) = 8p-1, and L(d2,p)=2
• Risk Functions– R(δ1,p) = L(d1,p) P(δ1 chooses d1 | p) + L(d2,p) P(δ1 chooses d2 | p)
= (8p-1) P(X=0 | p) + 2 P(X=1 or 2 | p)
– R(δ2,p) = L(d1,p) P(δ2 chooses d1 | p) + L(d2,p) P(δ2 chooses d2 | p)
= (8p-1) P(X=0 or 1 | p) + 2P(X=2 | p) 80
Example - ContinueR(δ2,p)
R(δ1,p)
p
R
max R(δ1,p) = 2.289 max R(δ2,p) = 3.329
81
Statistical Decision Theory - Example
• A shipment of transistors was received by a radio company. A sampling plan was used to check the shipment as a whole to ensure contractual requirement of 0.05 defect rate was not exceeded. A random sample of n transistors was chosen from the shipment and tested. Based upon X, the number of defective transistors in the sample, the shipment will be accepted or rejected.
82
Example (continue)
• Proportion of defective transistors in the shipment is θ.• Decision Rule:
a1 accept lot if X/n ≤ 0.05
a2 reject lot if X/n ≥ 0.05
• Loss Function: L(a1,θ) = 10*θ ; L(a2,θ) = 1
• π(θ) can be estimated based on prior experience• R(δ,θ) can then be calculated
83
Summary
• Maximum Likelihood Estimation selects an estimate of the unknown parameter that maximizes the likelihood function.
• The Likelihood Ratio Test compares the likelihood of the observed outcomes under the null hypothesis to the likelihood under the alternate hypothesis.
• Bayesian methods treat unknown models or variables as random variables with known distributions instead of deterministic quantities that happened to be unknown
84
Summary(Continue)
• Statistical Decision Theory moves statistics from its traditional role of just drawing inferences from incomplete information. The theory focuses on the problem of statistical actions rather than inference.
“Here in the 21st Century … a combination of Bayesian and frequentist ideas will be needed to deal with our increasingly intense scientific environment.”
Bradley Efron, 164th ASA Presidential Address
85
Questions?
THANK YOU!
86