Brief Review Probability and Statistics. Probability distributions Continuous distributions.

Upload
danielanash 
Category
Documents

view
268 
download
9
Transcript of Brief Review Probability and Statistics. Probability distributions Continuous distributions.
Brief Review
Probability and Statistics
Probability distributions
Continuous distributions
Defn (density function)
Let x denote a continuous random variable then f(x) is called the density function of x
1) f(x) ≥ 0
2)
3)
( ) 1f x dx
( )
b
a
f x dx P a x b
Defn (Joint density function)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables then
f(x) = f(x1 ,x2 ,x3 , ... , xn)
is called the joint density function of x = (x1 ,x2 ,x3 , ... , xn)
if
1) f(x) ≥ 0
2)
3)
1)( xx df
Rxxx PdfR
)(
Note:
nn dxdxdxxxxfdf 2121 ,,)(
xx
n
R
n
R
dxdxdxxxxfdf 2121 ,,)( xx
Defn (Marginal density function)
The marginal density of x1 = (x1 ,x2 ,x3 , ... , xp) (p < n) is defined by:
f1(x1) = =
where x2 = (xp+1 ,xp+2 ,xp+3 , ... , xn)
2)( xx df 221 ),( xxx df
The marginal density of x2 = (xp+1 ,xp+2 ,xp+3 , ... , xn) is defined by:
f2(x2) = =
where x1 = (x1 ,x2 ,x3 , ... , xp)
121 ),( xxx df 1)( xx df
Defn (Conditional density function)
The conditional density of x1 given x2 (defined in previous slide) (p < n) is defined by:
f12(x1 x2) =
conditional density of x2 given x1 is defined by:
f21(x2 x1) =
22
21
22
),()(
x
xx
x
x
f
f
f
f
11
21
11
),()(
x
xx
x
x
f
f
f
f
Marginal densities describe how the subvector xi behaves ignoring xj
Conditional densities describe how the subvector xi behaves when the subvector xj is held fixed
Defn (Independence)
The two subvectors (x1 and x2) are called independent if:
f(x) = f(x1, x2) = f1(x1)f2(x2)
= product of marginals
or
the conditional density of xi given xj :
fij(xi xj) = fi(xi) = marginal density of xi
Example (pvariate Normal)
The random vector x (p × 1) is said to have the
pvariate Normal distribution with
mean vector (p × 1) and
covariance matrix (p × p)
(written x ~ Np(,)) if:
)()'(
2
1exp
2
1 12/12/
μxμxxp
f
Example (bivariate Normal) The random vector is said to have the bivariate
Normal distribution with mean vector
and
covariance matrix
2
1
μ
)()'(
2
1exp
2
1 12/12/
μxμxxp
f
2
1
x
xx
2221
2121
2212
1211
)()'(
2
1exp
2
1, 1
2/121 μxμx
xxf
212/12
122211
,exp2
1xxQ
)()'(,1
2212
121121 μxμx
xxQ
2122211
22211221112
21122 )())((2)(
xxxx
21211
21 ,exp12
1, xxQxxf
21, xxQ
2
2
2
22
2
22
1
11
2
1
11
1
2
xxxx
x
y
f(x,y)
x
y
f(x,y)
x
y
f(x,y)
The Bivariate Normal Distribution
x
y y y
x x1
2
1 1
2 2
Contour Plots of the Bivariate Normal Distribution
x
y y y
x x1
2
1 1
2 2
Scatter Plots of data from the Bivariate Normal Distribution
1 21 2 1 2
1 2 1 2 1 2
1 21 2
1 2
Theorem (Transformations)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables with joint density function f(x1 ,x2 ,x3 , ... , xn) = f(x). Let
y1 =1(x1 ,x2 ,x3 , ... , xn)
y2 =2(x1 ,x2 ,x3 , ... , xn)
...
yn =n(x1 ,x2 ,x3 , ... , xn)
define a 11 transformation of x into y.
Then the joint density of y is g(y) given by:
g(y) = f(x)J where
),...,,,(
),...,,,(
)(
)(
321
321
n
n
yyyy
xxxxJ
y
x
n
n
nn
n
n
y
x
y
x
y
x
y
x
y
x
y
xy
x
y
x
y
x
...
...
...
...
det
21
22
2
2
1
11
2
1
1
= the Jacobian of the transformation
Corollary (Linear Transformations)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables with joint density function f(x1 ,x2 ,x3 , ... , xn) = f(x). Let
y1 = a11x1 + a12x2 + a13x3 , ... + a1nxn
y2 = a21x1 + a22x2 + a23x3 , ... + a2nxn
...
yn = an1x1 + an2x2 + an3x3 , ... + annxn
define a 11 transformation of x into y.
Then the joint density of y is g(y) given by:
)det(
1)(
)det(
1)()( 1
AAf
Afg yxy
nnnn
n
n
aaa
aaa
aaa
A
...
...
...
where
21
22221
11211
Corollary (Linear Transformations for Normal Random variables)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables having an nvariate Normal distribution with mean vector and covariance matrix .
i.e. x ~ Nn(, ) Let
y1 = a11x1 + a12x2 + a13x3 , ... + a1nxn
y2 = a21x1 + a22x2 + a23x3 , ... + a2nxn ...
yn = an1x1 + an2x2 + an3x3 , ... + annxn define a 11 transformation of x into y.
Then y = (y1 ,y2 ,y3 , ... , yn) ~ Nn(A,AA')
Defn (Expectation)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables with joint density function
f(x) = f(x1 ,x2 ,x3 , ... , xn).
Let U = h(x) = h(x1 ,x2 ,x3 , ... , xn)
Then
xxxx dfhhEUE )()()(
Defn (Conditional Expectation)
Let x = (x1 ,x2 ,x3 , ... , xn) = (x1 , x2 ) denote a vector of continuous random variables with joint density function
f(x) = f(x1 ,x2 ,x3 , ... , xn) = f(x1 , x2 ).
Let U = h(x1) = h(x1 ,x2 ,x3 , ... , xp)
Then the conditional expectation of U given x2
121211212 )()()( xxxxxxx dfhhEUE
Defn (Variance)
Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables with joint density function
f(x) = f(x1 ,x2 ,x3 , ... , xn).
Let U = h(x) = h(x1 ,x2 ,x3 , ... , xn)
Then
222 )()( xx hEhEUEUEUVarU
Defn (Conditional Variance)
Let x = (x1 ,x2 ,x3 , ... , xn) = (x1 , x2 ) denote a vector of continuous random variables with joint density function
f(x) = f(x1 ,x2 ,x3 , ... , xn) = f(x1 , x2 ).
Let U = h(x1) = h(x1 ,x2 ,x3 , ... , xp)
Then the conditional variance of U given x2
22
112 )()( xxxx hEhEUVar
Defn (Covariance, Correlation) Let x = (x1 ,x2 ,x3 , ... , xn) denote a vector of continuous random variables with joint density function
f(x) = f(x1 ,x2 ,x3 , ... , xn).
Let U = h(x) = h(x1 ,x2 ,x3 , ... , xn) and
V = g(x) =g(x1 ,x2 ,x3 , ... , xn) Then the covariance of U and V.
)()()()( xxxx gEghEhE
VEVUEUEVUCov ,
ncorrelatio
)()(
, and
VVarUVar
VUCovUV
Properties
• Expectation
• Variance
• Covariance • Correlation
1. E[a1x1 + a2x2 + a3x3 + ... + anxn]
= a1E[x1] + a2E[x2] + a3E[x3] + ... + anE[xn]
or E[a'x] = a'E[x]
2. E[UV] = E[h(x1)g(x2)]
= E[U]E[V] = E[h(x1)]E[g(x2)]
if x1 and x2 are independent
3. Var[a1x1 + a2x2 + a3x3 + ... + anxn]
or Var[a'x] = a′ a
n
jijiji
n
iii xxCovaaxVara ],[2][
1
2
)(...),(),(
...
),(...)(),(
),(...),()(
where
21
2212
1211
nnn
n
n
xVarxxCovxxCov
xxCovxVarxxCov
xxCovxxCovxVar
4. Cov[a1x1 + a2x2 + ... + anxn ,
b1x1 + b2x2 + ... + bnxn]
or Cov[a'x, b'x] = a′ b
n
jijiji
n
iiji xxCovbaxVarba ],[][
1
5.
6.
22xx UEEUE
22 22xx xx UEVarUVarEUVar
Statistical Inference
Making decisions from data
There are two main areas of Statistical Inference
• Estimation – deciding on the value of a parameter– Point estimation– Confidence Interval, Confidence region Estimation
• Hypothesis testing– Deciding if a statement (hypotheisis) about a
parameter is True or False
The general statistical modelMost data fits this situation
Defn (The Classical Statistical Model)
The data vector
x = (x1 ,x2 ,x3 , ... , xn)
The model
Let f(x ) = f(x1 ,x2 , ... , xn  1 , 2 ,... , p) denote the joint density of the data vector x = (x1 ,x2 ,x3 , ... , xn) of observations where the unknown parameter vector (a subset of pdimensional space).
An Example
The data vector
x = (x1 ,x2 ,x3 , ... , xn) a sample from the normal distribution with mean and variance 2
The model
Then f(x , 2) = f(x1 ,x2 , ... , xn  , 2), the joint density of x = (x1 ,x2 ,x3 , ... , xn) takes on the form:
where the unknown parameter vector ( , 2) ={(x,y)∞ < x < ∞ , 0 ≤ y < ∞}.
n
i
iix
nn
n
i
x
eef 1
22
2
2/1
22
2
1
2
1
x
Defn (Sufficient Statistics)
Let x have joint density f(x ) where the unknown parameter vector .
Then S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is called a set of sufficient statistics for the parameter vector if the conditional distribution of x given S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is not functionally dependent on the parameter vector .
A set of sufficient statistics contains all of the information concerning the unknown parameter vector
A Simple Example illustrating Sufficiency
Suppose that we observe a SuccessFailure experiment n = 3 times. Let denote the probability of Success. Suppose that the data that is collected is x1, x2, x3 where xi takes on the value 1 is the ith trial is a Success and 0 if the ith trial is a Failure.
The following table gives possible values of (x1, x2, x3).
(x1, x2, x3) f(x1, x2, x3) S =xi g(S ) f(x1, x2, x3 S) (0, 0, 0) (1  )3 0 (1  )3 1 (1, 0, 0) (1  )2 1 1/3 (0, 1, 0) (1  )2 1 1/3 (0, 0, 1) (1  )2 1
3(1  )2
1/3 (1, 1, 0) (1  )2 2 1/3 (1, 0, 1) (1  )2 2 1/3 (0, 1, 1) (1  )2 2
3(1  )2
1/3 (1, 1, 1) 3 3 3 1
The data can be generated in two equivalent ways:
1. Generating (x1, x2, x3) directly from f (x1, x2, x3) or
2. Generating S from g(S) then generating (x1, x2, x3) from f (x1, x2, x3S). Since the second step does involve no additional information will be obtained by knowing (x1, x2, x3) once S is determined
The Sufficiency Principle
Any decision regarding the parameter should be based on a set of Sufficient statistics S1(x), S2(x), ...,Sk(x) and not otherwise on the value of x.
A useful approach in developing a statistical procedure
1. Find sufficient statistics
2. Develop estimators , tests of hypotheses etc. using only these statistics
Defn (Minimal Sufficient Statistics)
Let x have joint density f(x ) where the unknown parameter vector .
Then S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of Minimal Sufficient statistics for the parameter vector if S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of Sufficient statistics and can be calculated from any other set of Sufficient statistics.
Theorem (The Factorization Criterion)
Let x have joint density f(x ) where the unknown parameter vector .
Then S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of Sufficient statistics for the parameter vector if
f(x ) = h(x)g(S, )
= h(x)g(S1(x) ,S2(x) ,S3(x) , ... , Sk(x), ).
This is useful for finding Sufficient statistics
i.e. If you can factor out qdependence with a set of statistics then these statistics are a set of Sufficient statistics
Defn (Completeness)
Let x have joint density f(x ) where the unknown parameter vector .
Then S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of Complete Sufficient statistics for the parameter vector if S = (S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) is a set of Sufficient statistics and whenever
E[(S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) ] = 0
then
P[(S1(x) ,S2(x) ,S3(x) , ... , Sk(x)) = 0] = 1
Defn (The Exponential Family)
Let x have joint density f(x ) where the unknown parameter vector . Then f(x ) is said to be a member of the exponential family of distributions if:
,
0
)()(exp)()(1
Otherwise
bxapSghf iiii
k
ii θxθx
θx
,where
1)  ∞ < ai < bi < ∞ are not dependent on .
2) contains a nondegenerate kdimensional rectangle.
3) g(), ai ,bi and pi() are not dependent on x.
4) h(x), ai ,bi and Si(x) are not dependent on q.
If in addition.
5) The Si(x) are functionally independent for i = 1, 2,..., k.
6) [Si(x)]/ xj exists and is continuous for all i = 1, 2,..., k j = 1, 2,..., n.
7) pi() is a continuous function of for all i = 1, 2,..., k.
8) R = {[p1(),p2(), ...,pK()]  ,} contains nondegenerate kdimensional rectangle.
Then
the set of statistics S1(x), S2(x), ...,Sk(x) form a Minimal Complete set of Sufficient statistics.
Defn (The Likelihood function)
Let x have joint density f(x) where the unkown parameter vector . Then for a
given value of the observation vector x ,the Likelihood function, Lx(), is defined by:
Lx() = f(x) with
The log Likelihood function lx() is defined by:
lx() =lnLx() = lnf(x) with
The Likelihood Principle
Any decision regarding the parameter should be based on the likelihood function Lx() and not otherwise on the value of x.
If two data sets result in the same likelihood function the decision regarding should be the same.
Some statisticians find it useful to plot the likelihood function Lx() given the value of x.
It summarizes the information contained in x regarding the parameter vector .
An Example
The data vector
x = (x1 ,x2 ,x3 , ... , xn) a sample from the normal distribution with mean and variance 2
The joint distribution of x
Then f(x , 2) = f(x1 ,x2 , ... , xn  , 2), the joint density of x = (x1 ,x2 ,x3 , ... , xn) takes on the form:
where the unknown parameter vector ( , 2) ={(x,y)∞ < x < ∞ , 0 ≤ y < ∞}.
n
i
iix
nn
n
i
x
eef 1
22
2
2/1
22
2
1
2
1
x
The Likelihood function
Assume data vector is known
x = (x1 ,x2 ,x3 , ... , xn)
The Likelihood function
Then L( , )= f(x , ) = f(x1 ,x2 , ... , xn  , 2),
22
1 22/ 2
1
1 1
2 2
nii
i
xxn
n ni
e e
2
1
1
2
/ 2
1
2
n
ii
x
n ne
2 2
1
12
2
/ 2
1
2
n
i ii
x x
n ne
or
2 2
1
12
2
/ 2
1,
2
n
i ii
x x
n nL e
2 2
1 1
12
2
/ 2
1
2
n n
i ii i
x x n
n ne
2 2 21
1 22
/ 2
1
2
n s nx nx n
n ne
2 2
2 2 2 21
1
since or 11
n
i ni
ii
x nxs x n s nx
n
1
1
and since then
n
i ni
ii
xx x nx
n
hence
2 2 211 2
2/ 2
1,
2
n s nx nx n
n nL e
221
12
/ 2
1
2
n s n x
n ne
Now consider the following data: (n = 10)
57.1 72.3 75.0 57.8 50.3 48.0 49.6 53.1 58.5 53.7
mean 57.54s 9.2185
2 219 9.2185 10 57.54
25 10
1,
6.2832L e
1
S1
0
5E17
1E16
1.5E16
2E16
2.5E16
3E16
Likelihood n = 10
0
2050
70
1S1
Contour Map of Likelihood n = 100
0 20
50
70
Now consider the following data: (n = 100)
2 2199 11.8571 100 62.02
250 100
1,
6.2832L e
57.1 72.3 75.0 57.8 50.3 48.0 49.6 53.1 58.5 53.7
77.8 43.0 69.8 65.1 71.1 44.4 64.4 52.9 56.4 43.9
49.0 37.6 65.5 50.4 40.7 66.9 51.5 55.8 49.1 59.5
64.5 67.6 79.9 48.0 68.1 68.0 65.8 61.3 75.0 78.0
61.8 69.0 56.2 77.2 57.5 84.0 45.5 64.4 58.7 77.5
81.9 77.1 58.7 71.2 58.1 50.3 53.2 47.6 53.3 76.4
69.8 57.8 65.9 63.0 43.5 70.7 85.2 57.2 78.9 72.9
78.6 53.9 61.9 75.2 62.2 53.2 73.0 38.9 75.4 69.7
68.8 77.0 51.2 65.6 44.7 40.4 72.1 68.1 82.2 64.7
83.1 71.9 65.4 45.0 51.6 48.3 58.5 65.3 65.9 59.6
mean 62.02s 11.8571
1
S1
0
2E170
4E170
6E170
8E170
1E169
1.2E169
1.4E169
1.6E169
Likelihood n = 100
0
2050
70
1S1
Contour Map of Likelihood n = 100
0 20
50
70
The Sufficiency Principle
Any decision regarding the parameter should be based on a set of Sufficient statistics S1(x), S2(x), ...,Sk(x) and not otherwise on the value of x.
If two data sets result in the same values for the set of Sufficient statistics the decision regarding should be the same.
Theorem (Birnbaum  Equivalency of the Likelihood Principle and Sufficiency Principle)
Lx1() Lx
2()
if and only if
S1(x1) = S1(x2),..., and Sk(x1) = Sk(x2)
The following table gives possible values of (x1, x2, x3).
(x1, x2, x3) f(x1, x2, x3) S =xi g(S ) f(x1, x2, x3 S) (0, 0, 0) (1  )3 0 (1  )3 1 (1, 0, 0) (1  )2 1 1/3 (0, 1, 0) (1  )2 1 1/3 (0, 0, 1) (1  )2 1
3(1  )2
1/3 (1, 1, 0) (1  )2 2 1/3 (1, 0, 1) (1  )2 2 1/3 (0, 1, 1) (1  )2 2
3(1  )2
1/3 (1, 1, 1) 3 3 3 1
The Likelihood function
S = 0
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1
S = 1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0 0.2 0.4 0.6 0.8 1
S = 2
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0 0.2 0.4 0.6 0.8 1
S = 3
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1
Estimation Theory
Point Estimation
Defn (Estimator)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x) where the unknown parameter vector .
Then an estimator of the parameter () = (1 ,2 , ... , k) is any function T(x)=T(x1 ,x2 ,x3 , ... , xn) of the observation vector.
Defn (Mean Square Error)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x) where the unknown parameter vector . Let T(x) be an estimator of the parameter (). Then the Mean Square Error of T(x) is defined to be:
2))()((... θxθx TEESM T
xθxθx dfT )())()(( 2
Defn (Uniformly Better)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x) where the unknown parameter vector . Let T(x) and T*(x) be estimators of the parameter (). Then T(x) is said to be uniformly better than T*(x) if:
θθ xx *...... TT ESMESM θwhenever
Defn (Unbiased )
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x) where the unknown parameter vector . Let T(x) be an estimator of the parameter (). Then T(x) is said to be an unbiased estimator of the parameter () if:
θxθxxx dfTTE )()(
Theorem (Cramer Rao Lower bound) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x) where the unknown parameter vector . Suppose that: i) exists for all x and for all . θ
θ
θx
)(f
ii)
xθ
θxxθx
θd
fdf
)()(
iii)
iv)
xθ
θxxxθxx
θd
ftdft
)()(
θ
θx allfor
)(0
2
i
fE
Let M denote the p x p matrix with ijth element.
θ̂
pjif
Emji
ij ,,2,1, )(ln2
θx
Then V = M1 is the lower bound for the covariance matrix of unbiased estimators of .
That is, var(c' ) = c'var( )c ≥ c'M1c = c'Vc where is a vector of unbiased estimators of .
θ̂ θ̂
Defn (Uniformly Minimum Variance Unbiased Estimator)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x) where the unknown parameter vector . Then T*(x) is said to be the UMVU (Uniformly minimum variance unbiased) estimator of() if:
1) E[T*(x)] = () for all .2) Var[T*(x)] ≤ Var[T(x)] for all
whenever E[T(x)] = ().
Theorem (RaoBlackwell)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x) where the unknown parameter vector . Let S1(x), S2(x), ...,SK(x) denote a set of sufficient statistics.Let T(x) be any unbiased estimator of (). Then T*[S1(x), S2(x), ...,Sk (x)] = E[T(x)S1(x), S2(x), ...,Sk (x)] is an unbiased estimator of () such that:
Var[T*(S1(x), S2(x), ...,Sk(x))] ≤ Var[T(x)] for all .
Theorem (LehmannScheffe')
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x) where the unknown parameter vector .
Let S1(x), S2(x), ...,SK(x) denote a set of complete
sufficient statistics.
Let T*[S1(x), S2(x), ...,Sk (x)] be an unbiased estimator of (). Then:
T*(S1(x), S2(x), ...,Sk(x)) )] is the UMVU estimator of ().
Defn (Consistency)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x) where the unknown parameter vector . Let Tn(x) be an estimator of(). Then Tn(x) is called a consistent estimator of () if for any > 0:
θθx allfor 0lim nn
TP
Defn (M. S. E. Consistency)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x) where the unknown parameter vector . Let Tn(x) be an estimator of(). Then Tn(x) is called a M. S. E. consistent estimator of () if for any > 0:
0lim...lim 2
θxθ nn
Tn
TEESMn
θ allfor
Methods for Finding Estimators
1. The Method of Moments
2. Maximum Likelihood Estimation
Methods for finding estimators
1. Method of Moments
2. Maximum Likelihood Estimation
Let x1, … , xn denote a sample from the density function
f(x; 1, … , p) = f(x; )
Method of Moments
The kth moment of the distribution being sampled is defined to be:
1 1, , ; , ,k kk p pE x x f x dx
To find the method of moments estimator of 1, … , p we set up the equations:
The kth sample moment is defined to be:
1
1 nk
k ii
m xn
1 1 1, , p m
2 1 2, , p m
1, ,p p pm
for 1, … , p.
We then solve the equations
1 1 1, , p m
2 1 2, , p m
1, ,p p pm
The solutions 1, , p
are called the method of moments estimators
The Method of Maximum Likelihood
Suppose that the data x1, … , xn has joint density function
f(x1, … , xn ; 1, … , p)
where (1, … , p) are unknown parameters assumed to lie in (a subset of pdimensional space).
We want to estimate the parameters1, … , p
Definition: Maximum Likelihood Estimation
Suppose that the data x1, … , xn has joint density function
f(x1, … , xn ; 1, … , p)
Then the Likelihood function is defined to be
L() = L(1, … , p)
= f(x1, … , xn ; 1, … , p)
the Maximum Likelihood estimators of the parameters 1, … , p are the values that maximize
L() = L(1, … , p)
the Maximum Likelihood estimators of the parameters 1, … , p are the values
1
1 1, ,
ˆ ˆ, , max , ,p
p pL L
1̂ˆ, , p
Such that
Note: 1maximizing , , pL is equivalent to maximizing
1 1, , ln , ,p pl L
the loglikelihood function
Application
The General Linear Model
Consider the random variable Y with
1. E[Y] = g(U1 ,U2 , ... , Uk)
= 11(U1 ,U2 , ... , Uk) + 22(U1 ,U2 , ... , Uk) + ... + pp(U1 ,U2 , ... , Uk)
=
and
2. var(Y) = 2
• where 1, 2 , ... ,p are unknown parameters
• and 1 ,2 , ... , p are known functions of the nonrandom variables U1 ,U2 , ... , Uk.
• Assume further that Y is normally distributed.
k
p
iii UUU ,...,, 2
1
Thus the density of Y is:
f(Y1, 2 , ... ,p, 2) = f(Y , 2)
2
2122),...,,(
2
1exp
2
1kUUUgY
s
2
211
22,...,
2
1exp
2
1ki
p
ii UUUY
2
221122...
2
1exp
2
1pp XXXY
kii UUUX ,..., where 21 i = 1,2, … , p
Now suppose that n independent observations of Y,
(y1, y2, ..., yn) are made
corresponding to n sets of values of (U1 ,U2 , ... , Uk)  (u11 ,u12 , ... , u1k),
(u21 ,u22 , ... , u2k),...
(un1 ,un2 , ... , unk).
Let xij = j(ui1 ,ui2 , ... , uik) j =1, 2, ..., p; i =1, 2, ..., n.
Then the joint density of y = (y1, y2, ... yn) is:
f(y1, y2, ..., yn1, 2 , ... ,p, 2) = f(y, 2)
n
ikiiiin
uuugy1
22122/2
),...,,(2
1exp
2
1
n
i
p
jkiiijjin
uuuy1
2
12122/2
),...,,(2
1exp
2
1
n
i
p
jijjin
xy1
2
122/2 2
1exp
2
1
XβyXβy
22/2 2
1exp
2
1
n
XβXβXβyyy 2
2
1exp
2
122/2 n
XβyyyXβXβ 2
2
1exp
2
1exp
2
1222/2 n
Xβyyyβy 2
2
1exp,
22
gh
Thus f(y,2) is a member of the exponential family of distributions
and S = (y'y, X'y) is a Minimal Complete set of Sufficient Statistics.
Hypothesis Testing
Defn (Test of size )
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x ) where the unknown parameter vector .
Let be any subset of .
Consider testing the the Null Hypothesis
H0:
against the alternative hypothesis
H1: .
Let A denote the acceptance region for the test. (all values x = (x1 ,x2 ,x3 , ... , xn) of such that the decision to accept H0 is made.)
and let C denote the critical region for the test (all values x = (x1 ,x2 ,x3 , ... , xn) of such that the decision to reject H0 is made.).
Then the test is said to be of size if
and allfor )( θxθxxC
dfCP
0 oneleast at for )( θxθxxC
dfCP
Defn (Power) Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x ) where the unknown parameter vector .
Consider testing the the Null Hypothesis
H0:
against the alternative hypothesis
H1: .
where is any subset of . Then the Power of the test for is defined to be:
C
C dfCP xθxxθ )(
Defn (Uniformly Most Powerful (UMP) test of
size )
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x) where the unknown parameter vector . Consider testing the the Null Hypothesis
H0: against the alternative hypothesis
H1: . where is any subset of .Let C denote the critical region for the test . Then the test is called the UMP test of size if:
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x ) where the unknown parameter vector . Consider testing the the Null Hypothesis
H0: against the alternative hypothesis
H1: . where is any subset of .Let C denote the critical region for the test . Then the test is called the UMP test of size if:
and allfor )( θxθxxC
dfCP
0 oneleast at for )( θxθxxC
dfCP
and for any other critical region C* such that:
and allfor )(**
θxθxxC
dfCP
0
*
oneleast at for )(* θxθxxC
dfCP
then
. allfor )()(*
θxθxxθxCC
dfdf
Theorem (NeymannPearson Lemma)Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x ) where the unknown parameter vector = (0, 1).
Consider testing the the Null Hypothesis
H0: = 0
against the alternative hypothesis
H1: = 1.
Then the UMP test of size has critical region:
Kf
fC
)(
)(
1
0
θx
θxx
where K is chosen so that C
df xθx )( 0
Defn (Likelihood Ratio Test of size )Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x ) where the unknown parameter vector .
Consider testing the the Null Hypothesis
H0:
against the alternative hypothesis
H1: .
where is any subset of Then the Likelihood Ratio (LR) test of size a has critical region:
where K is chosen so that
Kf
fC
)(max
)(max
θx
θxx
θ
θ
and allfor )( θxθxxC
dfCP
0 oneleast at for )( θxθxxC
dfCP
Theorem (Asymptotic distribution of Likelihood ratio test criterion)
Let x = (x1 ,x2 ,x3 , ... , xn) denote the vector of observations having joint density f(x ) where the unknown parameter vector .
Consider testing the the Null Hypothesis
H0:
against the alternative hypothesis
H1: .
where is any subset of
Then under proper regularity conditions on U = 2ln(x) possesses an asymptotic Chisquare distribution with degrees of freedom equal to the difference between the number of independent parameters in and .
)(max
)(maxLet
θx
θxx
θ
θ
f
f