ST 522 Slides

7/29/2019 ST 522 Slides

1/177

ST 522: Statistical Theory II

Subhashis Ghoshal,North Carolina State University

Subhashis Ghoshal, North Carolina State University ST 522: Statistical Theory II
http://find/

7/29/2019 ST 522 Slides

2/177

Useful Results from Calculus

We recapitulate some facts from calculus we need throughout.

Theorem (Binomial theorem)

(a+b)n =

n

0

anb0+

n

1

an1b1+ +

n

n 1

a1bn1+

n

n

a0bn.

http://find/

7/29/2019 ST 522 Slides

3/177

Common infinite series

Geometric series

a + ar +

+ arn1 = a

rn 1

r 1= a

1 rn

1 r, r

= 1.

Infinite Geometric series

a + ar + ar2 + = a 11r, |r| < 1.(1 x)1 = 1 + x + x2 + , |x| < 1(1 + x)1 = 1 x + x2 x3 + , |x| < 1.

http://find/

7/29/2019 ST 522 Slides

4/177

Common infinite series (contd.)

Infinite binomial series

(1 x)2

= 1 + 2x + 3x2

+ 4x3

+ , |x| < 1,(1 x)r = 1 +n=1 r+n1n xn, |x| < 1, where for any realnumber ,

n

= ( 1) ( n + 1)/n!, the generalized

binomial coefficient. In particular,

r+n1

n = r(r+ 1) (r+ n 1)/n!. Also note that for > 0,

r

= (1)r(+1)(+r1)

r! .Exponential series

ex = 1 +x

1!+

x2

2!+

Logarithmic series

log(1 + x) = x x2

2+

x3

3 , |x| < 1

http://find/

7/29/2019 ST 522 Slides

5/177

Useful limits

limn(1 + 1/n)n = e.

limn(1 + n/n)n = e for any n .

limx0(1 + ax)1/x = ea.limx0

log(1+x)x

= 1.

limx0sin xx

= 1.

http://find/

7/29/2019 ST 522 Slides

6/177

Derivatives

ddx

xn = nxn1.ddx

eax = aeax.ddx

ax = ax log a.ddx

log x = 1/x.ddx

sin x = cos x.d

dx cos x = sin x.ddx

tan x = 1 + tan2 x.ddx

sin1 x = 1/

1 x2.ddx

tan1 x = 11+x2

.

ddx(af(x) + bg(x)) = af(x) + bg(x).ddx

f(x)g(x) = f(x)g(x) + f(x)g(x).

ddx

(f(x)/g(x)) = f(x)g(x)f(x)g(x)

g2(x).

ddx

f(g(x)) = f(g(x))g(x).


I i
http://find/

7/29/2019 ST 522 Slides

7/177

Integration

xndx = x

n+1

n+1 , n = 1.x

1

dx = log x.eaxdx = eax/a, a = 0.f(x)f(x) dx = log f(x).

Integration by substitutiong(f(x))f(x)dx =

g(y)dy, y = f(x).

Integration by parts

u(x)v(x)dx = u(x)V(x) V(x)u(x)dx,

where V(x) =

v(x)dx, u(x) is called the first function andv(x) the second.


I i ( d )
http://find/http://goback/

7/29/2019 ST 522 Slides

8/177

Integration (contd.)

Integration by partial fractionsApplies while integrating the ratio of two polynomials P(x)and Q(x), where the degree of P is less than the degree of Qwithout loss of generality. Factorize Q(x) in linear andquadratic factors. The ratio can be written as uniquely alinear combination of the reciprocals of the linear factors and

linear over quadratic factors. The resulting expression can beintegrated term by term. Consult any standard Calculus textsuch as Apostol.

Definite Integral

ba

f(x)dx = F(x))ba

= F(b) F(a),

where F(x) = f(x)dx.Subhashis Ghoshal, North Carolina State University ST 522: Statistical Theory II

O d St ti ti

7/29/2019 ST 522 Slides

9/177

Order Statistics

Given a random sample, we are interested in the smallest,largest, or middle observations.

the highest flood watersthe lowest winter temperature recorded in the last 50 yearsthe median price of houses sold in last monththe median salary of NBA players

Definition: Given a random sample, X1, , Xn, the sampleorder statistics are the sample values placed in ascendingorder,

X(1) = min1in Xi,X(2) = second smallest Xi,... = ...

X(n) = max1in Xi.Example: Suppose four numbers are observed as a sample ofsize 4. The sample values arex1 = 6, x2 = 9, x3 = 3, x4 = 8.. What are the order

statistics?Subhashis Ghoshal, North Carolina State University ST 522: Statistical Theory II

O d St ti ti ( td )
http://find/

7/29/2019 ST 522 Slides

10/177

Order Statistics (contd.)

Order statistics are random variables themselves (as functionsof a random sample).

Order statistics satisfy

X(1) X(n).Though the samples X1, , Xn are independently andidentically distributed, the order statistics X(1), , X(n) arenever independent because of the order restriction.

We will study their marginal distributions and jointdistributions


O d St tisti s M i l dist ib ti s
http://find/

7/29/2019 ST 522 Slides

11/177

Order Statistics - Marginal distributions

Assume X1, , Xn are from a continuous population with cdfF(x) and pdf f(x).

The nth order statistic, or the sample maximum, X(n) had thepdf

fX(n)(x) = n[F(x)]n1f(x)

The first order statistic, or the sample minimum, X(1) had thepdf

fX(1)(x) = n[1 F(x)]n1f(x)More generally, the jth order statistic X(j) has the pdf

fX(j)(x) =n!

(j 1)!(n j)! f(x)[F(x)]j1[1 F(x)]nj.


Order Statistics Joint distributions

7/29/2019 ST 522 Slides

12/177

Order Statistics -Joint distributions

For 1

i < j

n, the joint pdf of X(i)

and X(j)

is

fX(i),X(j)(u, v) =n!

(i 1)!(j i 1)!(n j)! f(u)f(v)[F(u)]i1

[F(v) F(u)]ji1[1 F(v)]nj

if < u < v < ; = 0 otherwise.Special case: Joint pdf of X(1) and X(n)

The joint pdf X(1), , X(n) is

fX(1), ,X(n)(u1, , un)= n!f(u1) f(un)1l{ < u1 < < un < }.


Illustration
http://goforward/http://find/http://goback/

7/29/2019 ST 522 Slides

13/177

Illustration

Example: X1, , Xn are iid from unif [0, 1].Show that X(j) Beta(j, n + 1 j).Compute E[X(j)] and Var[X(j)]

The joint pdf of X(1) and X(n).

Let n = 5. Derive the joint pdf of X(2) and X(4).

X(1)|X(n) X(n)Beta(1, n 1)For any i < j, X(i)|X(j) X(j)Beta(i,j i)

Let n = 5. Derive the joint pdf of X(1), , X(5).


Example
http://find/

7/29/2019 ST 522 Slides

14/177

Example

Compute P(X(1) > 1, X(n) 2).

P(X(1) > x, X(n) y) =n

i=1

P(x < Xi y) = [F(y) F(x)]n.


Common statistics based on order statistics

7/29/2019 ST 522 Slides

15/177

Common statistics based on order statistics

sample range: R = X(n) X(1)sample midrange: V = X(n) + X(1) /2sample median:

M =

X((n+1)/2) if n is odd

X(n/2) + X(n/2+1) /2 if n is even.sample percentile: For any 0 < p < 1, the (100p)th samplepercentile is the observation such that about np of theobservations are less than this observation and n(1 p)th ofthe observations are larger.

sample median M is 50th sample quantile (the second samplequartile)denote Q1 as 25th sample quantile (the first sample quartile)denote Q3 as 75th sample quantile (the third sample quartile)interquartile range IQR=Q3 Q1 (describing the spread aboutthe median)


Remarks

7/29/2019 ST 522 Slides

16/177

Remarks

Sample Mean vs Sample Median

Sample Median vs Population Median


Principles of data reduction

7/29/2019 ST 522 Slides

17/177

Principles of data reduction

Data X, (X1, . . . , Xn): Probability distribution P completely or

partially unknown.Distribution often modeled by standard ones such as Poisson,normal.A few parameters control the distribution of the data. P = P

Parameter : unknown, object of interest.Inference: Any conclusion about parameter values based on data.Three main inference problems point estimation, hypothesistesting, interval estimation.Statistic T = T(X): Any function of data. A summary measure of

the data.Statistics may be used as point estimators, test statistics, upperand lower confidence limit.


Inductive reasoning
http://find/

7/29/2019 ST 522 Slides

18/177

Inductive reasoning

Role of probability theory: Extent of randomness of T controlledby . Probabilistic characteristics such as expectation, variance,moments, distribution involve .Conversely, value of T reflects knowledge about . For instance, ifT has expectation and is unknown, then can be estimated by

T. Intuitively, if we observe a large value of T, we tend toconclude that must be large.Need to assess the extent of the error.Frequentist approach: Randomness of error means need to judge

based on average error over repeated sampling. Thus need tostudy the sampling distribution of T.


Sufficiency
http://find/

7/29/2019 ST 522 Slides

19/177

Sufficiency

As T summarizes the data X, the first natural question is that

whether there is any loss of information due to summarization.Data contains many information, some are relevant for and someare not.Dropping an irrelevant information is desirable, but dropping arelevant information is undesirable.

How to compare the amount of information about in data and inT? Is it sufficient to consider only the reduced data T?

Definition (Sufficient statistic)

A statistic T is called sufficient if the conditional distribution of Xgiven T is free of (that is, the conditional is a completely knowndistribution).

http://find/

7/29/2019 ST 522 Slides

20/177

Example

Toss a coin 100 times. The probability of head p is unknown.T=number of heads obtained.


Sufficiency principle
http://find/

7/29/2019 ST 522 Slides

21/177

Sufficiency principle

If T is sufficient, the extra information carried by X is worthlessas long as is concerned. It is then only natural to considerinference procedures which do not use this extra irrelevantinformation. This leads to the principle of sufficiency.

Definition (Sufficiency principle)

Any inference procedure should depend on the data only through asufficient statistic.


How to check sufficiency?
http://find/

7/29/2019 ST 522 Slides

22/177

How to check sufficiency?

Theorem (Neyman-Fisher Factorization theorem)

T is sufficient iff f(x; ) can be written as the product

g(T(x); )h(x), where the first factor depends on x only thoughT(x) and the second factor is free of .

http://find/

7/29/2019 ST 522 Slides

23/177

Example

X1, . . . , Xn iid

N(, 1).

Bin(1, )

Poi().

N(, 2). = (, ).

Ga(, ). = (, ). (Includes exponential)

U(0, ), range of X depends on .


Exponential family
http://find/

7/29/2019 ST 522 Slides

24/177

p y

f(x; ) = c()h(x) exp[k

j=1 wj()tj(x)], = (1, . . . , d), d k.Theorem

Let X1, . . . , Xn be iid observations from the above exponentialfamily. Then T(X) = (

ni=1 t1(Xi), . . . ,

ni=1 tk(Xi)) is sufficient

for = (1, . . . , d).


Applications
http://find/

7/29/2019 ST 522 Slides

25/177

pp

beta(, ).

Curved exponential family: N(, 2).

Old examples revisited: binomial, Poisson, normal,exponential, gamma (except uniform). Exercise


More applications
http://find/

7/29/2019 ST 522 Slides

26/177

pp

Discrete uniform. P(X = x) = 1/, x = 1, . . . , , a positiveinteger.

f(x, ) = e(x)

, x > .A universal example. iid f density. Order statisticsT = (X(1), . . . , X(n)) is sufficient.


Remarks
http://find/

7/29/2019 ST 522 Slides

27/177

In the order statistics example, the dimension of T the same

as the dimension of the data. Still this is a nontrivial reductionas n! different values of data corresponds to one value of T.

Often one finds better reductions for specific parametricfamilies, as seen in the many examples before.

Trivially X is always sufficient for itself, has no gain.When one statistic is a mathematical function of the otherand vice versa (i.e., there is a one to one correspondence),then they carry exactly the same amount of information, soare equivalent.

More generally, if T is sufficient for and T = c(U), amathematical function of some other statistic U, then U isalso sufficient.


Examples of in-sufficiency
http://find/

7/29/2019 ST 522 Slides

28/177

X1, X2 iid Poi(). T = X1 X2 is not sufficient.

X1, . . . , Xn iid pmf f(x; ). T = (X1, . . . , Xn1) is notsufficient.


Minimal sufficiency

7/29/2019 ST 522 Slides

29/177

Maximum possible reduction.

Definition (Minimal sufficient statistic)

T is a minimal sufficient statistic if, given any other sufficient

statistic T, there is a function c() such that T = c(T).Equivalently, T is minimal sufficient if, given any other sufficientstatistic T, whenever x and y are two data values such thatT(x) = T(y), then T(x) = T(y).


Checking minimal sufficiency

7/29/2019 ST 522 Slides

30/177

Theorem (Lehmann-Scheffe Theorem)

A statistic T is minimal sufficient if the following property holds:For any two sample points x and y, f(x; )/f(y; ) does notdepend on if and only if T(x) = T(y).

Corollary

Minimal sufficient statistic is not unique. But any two are inone-to-one correspondence, so are equivalent.


Examples

7/29/2019 ST 522 Slides

31/177

iid N(, 2).

iid U(, + 1).

iid Cauchy().

iid U(, ).


Minimal sufficiency in exponential family

7/29/2019 ST 522 Slides

32/177

Theorem

For iid observations from an exponential family

f(x; ) = c()h(x)exp[wj()tj(x)],so that, no affine (linear plus constant) relationship exists betweenw1(), . . . , wk(), the statisticT(X) = (

ni=1 t1(Xi), . . . ,

ni=1 tk(Xi)) is minimal sufficient for

= (1, . . . , d).


Examples

7/29/2019 ST 522 Slides

33/177

N(, 2).

Ga(, ).

Be(, ).N(, 2).

Be(, 1 ), 0 < < 1.


Ancillary statistic
http://find/

7/29/2019 ST 522 Slides

34/177

Definition

A statistic T is called ancillary if its distribution does not dependon the parameter.

Induced family is singleton, completely known, contains noinformation about . Opposite of sufficiency.Function of ancillary is ancillary.


Examples
http://find/

7/29/2019 ST 522 Slides

35/177

iid U(, + 1).

Location family, iid f(x ).Scale family, iid 1f(x/).

iid N(, 1).X1, X2 iid N(0,

2).

X1, . . . , Xn iid N(, 2).

T = ((X1 X)/S, . . . , (Xn X)/S), where S is samplestandard deviation is ancillary.


Results
http://find/

7/29/2019 ST 522 Slides

36/177

Location family. f(x ).T is a location invariant statistic, i.e.,T(x1 + b, . . . , xn + b) = T(x1, . . . , xn). Then T is ancillary.In particular, sample sd S is ancillary (and so are otherestimates of scale).

Location scale family. 1

f((x )/).T is a location-scale invariant statistic, i.e.,T(ax1 + b, . . . , axn + b) = T(x1, . . . , xn). Then T is ancillary.If T1 and T2 are such thatT1(ax1 + b, . . . , axn + b) = aT1(x1, . . . , xn) and

T2(ax1 + b, . . . , axn + b) = aT2(x1, . . . , xn), then T1/T2 isancillary.

http://find/

7/29/2019 ST 522 Slides

37/177

Question. An ancillary statistic does not contain any informationabout . Then why do we study it?It indicates how good the given sample is.

ExampleX1, . . . , Xn iid U( 1, + 1). is estimated by the midrange(X(1) + X(n))/2. The range R = X(n) X(1) is ancillary.

http://find/

7/29/2019 ST 522 Slides

38/177

Question. Can addition or removal of ancillary informationchange the information content about ?Intuitively, one may think that ancillary contains no informationabout , so it should not change the information content. But this

interpretation is false.

U(, + 1).

A more dramatic example: (X, Y) BVN(0, 0, 1, 1, ).


Completeness

7/29/2019 ST 522 Slides

39/177

Let a parametric family {f(x, ), } be given. Let T be astatistic. Induced family of distributions fT(t, ), .Definition

A statistic T is called complete (for the family {f(x, ), }), orequivalently the induced family fT(t, ), is called complete if

E(g(T)) = 0 for all implies g(T) = 0 a.s. P for all .

In other words, no non-constant function of T can have constantexpectation (in ).Completeness not only depends on the statistic, but also on thefamily. For instance, no nontrivial statistic is complete if the familyis singleton.In order to find optimal estimators and tests, one sometimes needsto find complete sufficient statistics.


Examples
http://find/

7/29/2019 ST 522 Slides

40/177

X bin(n, ), 0 < < 1.X

Poi(), 0 < .

X N(, 1), < < .

http://find/

7/29/2019 ST 522 Slides

41/177

Theorem

Let X1, . . . , Xn be iid observations from the above exponential

family. Then T(X) = (n

i=1 t1(Xi), . . . ,n

i=1 tk(Xi)) is completeif the parameter space contains an open set in Rk (i.e., d = k).

http://find/

7/29/2019 ST 522 Slides

42/177

A non-exponential example: iid U(0, ), T = X(n).


Useful facts
http://find/

7/29/2019 ST 522 Slides

43/177

If T is complete and S = (T) is a function of T, then S isalso complete.

The constant statistic is complete for any family.

A non-constant ancillary statistic cannot be complete.A statistic is called first order ancillary if its expectation is freeof . If a non-constant function of statistic T is first orderancillary, then T cannot be complete.


Connection with minimal sufficiency
http://find/

7/29/2019 ST 522 Slides

44/177

Theorem

If T is complete and sufficient, and a minimal sufficient statisticexists, then T is also minimal sufficient.

As a consequence, in the search for complete sufficient statistics, itis enough check completeness of a minimal sufficient statistic (ifexists and easily found).This implies no complete sufficient statistic exists for theU(, + 1) family, or the Cauchy() family.


Basus theorem

7/29/2019 ST 522 Slides

45/177

T complete sufficient carries all relevant information about . Sancillary carries no information about . The followingremarkable result shows that they are statistically independent.

Theorem (Basus theorem)

A complete sufficient statistic is independent of all ancillarystatistics.

Completeness cannot be dropped, even if T is minimal sufficient iid U(, + 1).


Applications

7/29/2019 ST 522 Slides

46/177

iid exponential. Then T =

ni=1 Xi and (W1, . . . , Wn) are

independent, where Wj = Xj/T. Also calculate E(Wj).

iid normal. T = X and sample standard deviation S areindependent.

iid U(0, ). Then X(n) and X(1)/X(n) are independent. Alsocalculate E(X(1)/X(n)).

iid Ga(, ), > 0 known. Let U = (n

i=1 Xi)1/n. Then

U/X is ancillary, independent of X. AlsoE

[(U/X)

k

] =E

(U

k

)/E

(X

k

).


Likelihood
http://find/

7/29/2019 ST 522 Slides

47/177

X f(, ) pmf or pdf. X = x is observed.Definition

The likelihood function is a function of the parameter with anobserved sample, and is given by L(|x) = f(x, ).Same expression, but now x is fixed and is variable.


Examples
http://find/

7/29/2019 ST 522 Slides

48/177

binomial experiment. Decide to stop after 10 trials. 3successes obtained.

negative binomial experiment. Decide to stop after 3successes. 10 trials were needed.

http://find/

7/29/2019 ST 522 Slides

49/177

Likelihood can be viewed as the degree of plausibility. An estimateof may be obtained by choosing the most plausible value, i.e.,where the likelihood function is maximized. This leads to one ofthe most important methods of estimation the maximumlikelihood estimator (more details in Chapter 7).For instance, in either example above, the likelihood function ismaximized at 0.3.


More examples
http://find/

7/29/2019 ST 522 Slides

50/177

iid Poisson()

iid N(, 2)

iid U(0, )Exponential family


Bayesian approach
http://find/

7/29/2019 ST 522 Slides

51/177

Suppose that can be considered as a random quantity with somemarginal distribution (), a pre-experiment assessment called theprior distribution. Then we can legitimately calculate the posteriordistribution of given the data by the Bayes theorem. Thisposterior distribution will be the source of any inference about .

Theorem (Bayes theorem)

(|X) = ()f(X, )

(t)f(X, t)dt.


Examples
http://find/

7/29/2019 ST 522 Slides

52/177

iid Bin(1, ), prior U(0, 1).

iid Poi(), prior standard exponential.

http://find/

7/29/2019 ST 522 Slides

53/177

Difficulty: is fixed, nonrandom.

How to specify a prior?

Bayesians response:

Probability is a quantification of uncertainty of any type.

The arbitrariness of prior choice can be rectified to someextent by the use of automatic priors which arenon-informative. (More later)


Point Estimation
http://find/

7/29/2019 ST 522 Slides

54/177

Find estimators for the unknown parameter or its function().

Evaluate your estimators (are they good?)

Definition

A point estimator of , is a function = W(X1, . . . , Xn).Given a sample of realized observations, the number W(x1, . . . , xn)is called a point estimate of .


Methods of point estimation
http://find/

7/29/2019 ST 522 Slides

55/177

method of moments

maximum likelihood estimator (MLE)

Bayes estimators


Method of Moments
http://find/

7/29/2019 ST 522 Slides

56/177

Let X1, . . . , Xn be a sample from a population with pdf or pmff(x|1, . . . , k). Estimate = (1, . . . , k) by solving k equationsformed by matching first k sample and population raw moments:

m1

= 1nn

i=1Xi

, 1

= E

(X)m2 =

1n

ni=1 X

2i ,

2 = E(X

2). . . , . . .mk =

1n

ni=1 X

ki ,

k = E(X

k)


Examples
http://find/

7/29/2019 ST 522 Slides

57/177

X1, . . . , Xn iid N(, 2), both and 2 unknown.X1, . . . , Xn iid Bin(1, ).X1, . . . , Xn iid Ga(, ), with (, ) unknown.X1, . . . , Xn iid Unif(1, 2), where 1 < 2, both unknown.


Features
http://find/

7/29/2019 ST 522 Slides

58/177

Easy to implement

Computationally cheap

Converges to the parameter with increasing probability (called

consistency)Not necessarily give asymptotically most efficient estimator

Often used as an initial estimator in iterative methods


Maximum Likelihood Estimator
http://find/

7/29/2019 ST 522 Slides

59/177

Recall that the likelihood function is

L(|X) = L(|X1, . . . , Xn) =n

i=1

f(Xi|)

Definition

The maximum likelihood estimator (MLE) of is the location atwhich L(|X) attains its maximum as a function of . Its numericalvalue is often called the maximum likelihood estimate.


How to find the MLE?

We want to find the global maximum of L(|X)
http://find/

7/29/2019 ST 522 Slides

60/177

We want to find the global maximum of L(|X).If L(|X) is differentiable in (1, . . . , k), we solve

j

L(|X) = 0, j = 1, . . . , k.

The solutions to these likelihood equations locate onlyextreme points in the interior of , and provide possible

candidates for the MLE. They can be local or global minima,local or global maxima, or inflection points. Our job is to finda global maximum.

((d2/d2)L())= < 0 is sufficient for local maxima. We alsoneed to check the boundary points separately.

If there is only one local maxima, then that must be theunique global maxima.

Many examples falls in this category, so no further work willbe needed then.


How to find the MLE? (contd.)
http://find/

7/29/2019 ST 522 Slides

61/177

In practice, we often work with log L(

|X), i.e. solve

jlog L(|X) = 0, j = 1, . . . , k.

We consider several different situations:

one parameter casenon-differentiable L(|X)restricted range MLE (e.g. is not the whole real line)

discrete parameter space

two-parameter case


Examples: One-parameter case
http://find/

7/29/2019 ST 522 Slides

62/177

X1, . . . , Xn iid N(, 1), with unknown.X1, . . . , Xn iid Poi().X1, . . . , Xn

iid Exp().

(numerical/iterative method): X1, . . . , Xn iid Weibull().(numerical/iterative method): X1, . . . , Xn iid gamma(, 1).


Restricted MLE
http://find/

7/29/2019 ST 522 Slides

63/177

Parameter space is a proper subset of the set of all possiblevalues of the parameter. Special attention is needed to make sure

X1, . . . , Xn iid N(, 1), 0.But what if > 0?

X1, . . . , Xn iid N(, 2), a b.


Non-differentiable likelihood
http://find/

7/29/2019 ST 522 Slides

64/177

X1, . . . , Xn iid Unif(0, ], > 0.X1, . . . , Xn iid exponential location family with pdf

f(x) = e(x)

, if x .X1, . . . , Xn iid Unif( 12 , + 12).


Discrete parameter space
http://find/

7/29/2019 ST 522 Slides

65/177

Example

Let X by a single observation taking values from {0, 1, 2} accordingto P, where = 0 or 1. The probability of X is summarized

x = 0 x = 1 x = 2 = 0 0.8 0.1 0.1 = 1 0.2 0.3 0.5


Examples: Two-parameter case
http://find/

7/29/2019 ST 522 Slides

66/177

For differentiable likelihood, needs calculus of several variables ingeneral, but often simple tricks help reduce to one-dimension.

X1, . . . , Xn iid N(, 2).

X1, . . . , Xn iid location-scale exponential family, with pdff(x; , ) =

1

e(x)/ if x .


Remarks about the MLE
http://find/

7/29/2019 ST 522 Slides

67/177

The MLE is the value for which the observed sample

xismost likely; possess some optimal properties (discussed later)

In exponential families, coincides with the method of momentestimator.

The MLE can be numerically sensitive to the variation in the

data, if the likelihood function is discontinuous.

If T is sufficient for , then the MLE must be a function ofT.

MLE is the value of that maximizes g(T(X), ), where

g(t, )) is the pdf or pmf of T = T(X) at t.


Induced likelihood
http://find/

7/29/2019 ST 522 Slides

68/177

If = () is a parametric function, then the likelihood for isdefined by

L(|X) = sup:()=

L(|X).

Theorem (Invariance Principle)

If is the MLE of , then for any function (), the MLE of ()is ().


Examples
http://find/

7/29/2019 ST 522 Slides

69/177

X1, . . . , Xn iid Bin(1, ). Find the MLE of

(1 ).X1, . . . , Xn iid Poi(). Find the MLE ofP(X 1).X1, . . . , Xn iid N(, 2).

Find the MLE of /.Find the MLE of the population median.Find the MLE for c = c(, ) such that P,(X > c) = 0.025.(the 97.5% percentile of the distribution of X).


EM-algorithm

Useful numerical algorithm to compute the MLE with
http://find/

7/29/2019 ST 522 Slides

70/177

missing data

Iterative method repeating E-step (Expectation) and M-step(Maximization).

Given data Y, missing vital X. Augmented data (X, Y).

Actual likelihood L(|Y) = E[L(|X, Y)|Y].

Start with an initial estimator.

Calculate E=0(log L(|X, Y)|Y).Maximize with respect to to get update 1.

Repeat the procedure by replacing the old estimate by the

new until convergence.

Example

Multinomial (( + 1)/2, /4, /4, 1/2 ).Subhashis Ghoshal, North Carolina State University ST 522: Statistical Theory II

Bayes Estimators
http://find/

7/29/2019 ST 522 Slides

71/177

Recall, in the Bayesian approach is considered as a quantity

whose variation can be described by a probability distribution(called the prior distribution). A sample is then taken from apopulation indexed by and the prior distribution is updated withthis sample information. The updated prior is called the posteriordistribution.

Prior distribution of : ()Posterior distribution of : (|X) = f(X|)()/m(X)Marginal distribution of X: m(X) =

f(X|)()d

The mean of the posterior distribution, E(|X), can be used

as the Bayes estimator of .


Examples
http://find/

7/29/2019 ST 522 Slides

72/177

X1, . . . , Xn iid Bin(1, ). Assume the prior distribution on isBeta(, ). Find the posterior distribution of and the Bayesestimator of .

Special case: () Unif(0,1).X1, . . . , Xn iid N(0, ), [0, 1], prior U[0, 1].


Conjugate family
http://find/

7/29/2019 ST 522 Slides

73/177

Let Fdenote the class of pdfs or pmfs f(x|). A class of priordistributions is a conjugate family for F if the posteriordistribution is in the class for all f F, all priors in , and allobservation values x.

Examples:The beta family is conjugate for the binomial family.

The normal family is conjugate for the normal family.


Methods of Evaluating Estimators
http://find/

7/29/2019 ST 522 Slides

74/177

Various criteria to evaluate and compare different pointestimators

mean squared error

best unbiased estimators or UMVUE (Uniform MinimumVariance Unbiased Estimator)

optimal for general loss function and risk


Unbiasedness and Mean Squared ErrorThe bias of a point estimator W of is Bias(W) = EW .An estimator whose bias is equal to 0 is called unbiased
http://find/

7/29/2019 ST 522 Slides

75/177

An estimator whose bias is equal to 0 is called unbiased.

An unbiased estimator satisfies EW = for all .

The mean squared error (MSE) of an estimator W of isdefined by E(W )2.

the MSE is a function of , and has the representation

E(W )2 = VarW + (BiasW)2.

the MSE incorporates two components, one measuring thevariability of the estimator (precision) and the other measuringits bias (accuracy).Small value of MSE implies small combined variance and bias.Unbiased estimators do a good job of controlling bias.Smaller MSE indicates smaller probability for W to be far from, because

P(|W | > ) 12

E(W )2 = 12

MSE(W)

by Chebyshev Inequality.

http://find/

7/29/2019 ST 522 Slides

76/177

In general, there will not be one best estimator. Often the MSE

of two estimators cross each other, showing that each estimator isbetter in only a portion of the parameter space.

Example

Let X1, X2 be iid from Bin(1, p) with 0 < p < 1. Compare three

estimators with respect to their MSE.

p1 = X1

p2 =X1+X2

2

p3 = 0.5.


Illustration
http://find/

7/29/2019 ST 522 Slides

77/177

Let X1, . . . , Xn be iid N(, 2). Show X is unbiased for and

S2 is unbiased for 2, and compute their MSEs.What about non-normal distributions with mean andvariance 2?

Let X1, . . . , Xn be iid N(, 2). Show the estimator2 = 1

n

ni=1(Xi X)2 is biased for 2, but it has a smaller

MSE than S2.More generally, find the MSE of cS2.


Uniformly Minimum Variance Unbiased Estimator

If the estimator W is unbiased for (), then its MSE is equal to
http://find/

7/29/2019 ST 522 Slides

78/177

( ), qVar(W). Therefore, choosing a better unbiased estimator is

equivalent to choosing the one with smaller variance.

Definition

An estimator W is a best unbiased estimator of () if it satisfies:

EW = () for all ;

For any other estimator W with EW = (), we have

VarW VarW for all .

W is also called a uniform minimum variance unbiased estimator(UMVUE).

http://find/

7/29/2019 ST 522 Slides

79/177

Example

X1, . . . , Xn iid Poi(). Both X and S2 are unbiased for .

How to find a best unbiased estimator?

If B() is a lower bound on the variance of any unbiasedestimators of (), and if W is unbiased satisfiesVarW

= B(), then W is a UMVUE.


Cramer-Rao InequalityTheorem

Let X be a sample with pdf f (x ) Suppose W (X) is an
http://find/

7/29/2019 ST 522 Slides

80/177

Let X be a sample with pdf f(x, ). Suppose W(X) is anestimator satisfying

EW(X) = () for any ;VarW(X) < .

If differentiation under integral sign can be carried out, then

Var(W(X)) [()]2E

( log f(X|))2

.

In the i.i.d. case, the bound reduces to ()2/nI(), where

I() = E

(

log f(X|))2

is called the Fisher information (per observation).


Score function: s(X, ) = log f(X|) = 1f(X|) f(X|).

( ( ))
http://find/

7/29/2019 ST 522 Slides

81/177

Lemma (Expressions for I())

If differentiation and integration are interchangeable,

I() = E (s(X, ))2 = var (s(X, ))

=

E

2

2log f(X, )

=

log f(x, )

2f(x, )dx

= f(x, )2

f(x, )dx

=

2

2log f(x, )

f(x, )dx.


Examples
http://find/

7/29/2019 ST 522 Slides

82/177

X1, . . . , Xn iid Poi(). Find the Fisher information numberand a UMVUE for .

X1, . . . , Xn iid N(,

2

), unknown but

2

known. Find aUMVUE for using Cramer-Rao bound.

http://find/

7/29/2019 ST 522 Slides

83/177

When can we exchange differentiation and integration?yes for exponential family.

not always true for non-exponential family. We have to do amatch check for d

d h(x)f(x, )dx and h(x) [f(x, )]dx.

Example

X1, . . . , Xn iid from Unif(0, ).

Cramer-Rao bound does not work here!


Attainability of Cramer-Rao boundThe Cramer-Rao bound inequality says, if W achieves thevariance bound then it is an UMVUE. In the one-parameter
http://find/

7/29/2019 ST 522 Slides

84/177

exponential family case, we can find such an estimator. But there

is no guarantee that this lower bound is sharp (attainable) in othersituations. It is possible that the value of Cramer-Rao bound maybe strictly smaller than the variance of any unbiased estimator.

Corollary

Let X1, . . . , Xn be iid with pdf f(x, ), where f(x, ) satisfies theassumptions of the Cramer-Rao bound theorem. LetL(|x) = ni=1 f(xi, ) denote the likelihood function. If W(X) isunbiased for (), then W(X) attains the Cramer-Rao LowerBound if and only if

a()[W(X) ()] = s(X, )

for some function a().


Attainability in one-parameter exponential family
http://find/

7/29/2019 ST 522 Slides

85/177

TheoremLet X1, . . . , Xn be iid from a one-parameter exponential familywith the pdf f(x, ) = c()h(x)exp{w()T(x)}. AssumeE[T(X)] = (). Then n1

ni=1 T(Xi), as an unbiased estimator

of (), attains the Cramer-Rao Lower Bound, i.e.

Var

n1

ni=1

T(Xi)

=

[()]2

nI().


Examples
http://find/

7/29/2019 ST 522 Slides

86/177

X1, . . . , Xn iid from Bin(1, ). Find an UMVUE of and showit attains the Lower Bound.

X1, . . . , Xn N(, 2), with (, 2) both unknown. Consider

estimation of 2. What is the Cramer-Rao Lower bound andis it attainable?

Subhashis Ghoshal North Carolina State University ST 522: Statistical Theory II

Constructing UMVUE using Rao-Blackwell Method
http://find/

7/29/2019 ST 522 Slides

87/177

An important method of finding/constructing UMVUEs with thehelp of conditioning on a complete and sufficient statistics.Review on conditional expectation:

E(X) = E[E(X|Y)], for any X, Y.Var(X) = Var[E(X|Y)] + E[Var(X|Y)], for any X, YE(g(X)|Y) = g(x)fx|y(x|y)dx, and it is a function of Y.Cov(E(X|Y), Y) = Cov(X, Y).


Rao-Blackwell Theorem

Th

7/29/2019 ST 522 Slides

88/177

Theorem

Let W be unbiased for () and T be a sufficient statistic for .Define (T) = E(W|T). Then the following hold

E(T) = ();

Var(T) VarW for all .Thus, E(W|T) is a uniformly better unbiased estimator of() than W.

Conditioning any unbiased estimator on a sufficient statistic willresult in a uniform improvement, so we need consider only

statistics that are functions of a sufficient statistic for bestunbiased estimators.


Examples
http://find/

7/29/2019 ST 522 Slides

89/177

Let X1, X2 be iid N(, 1). Show X1 is unbiased for andE(X1|X) is uniformly better.Let X1, . . . , Xn be iid Unif(0, ). Show Y = (n + 1)X(1) isunbiased for and E(Y|X(n)) is uniformly better.


Uniqueness of UMVUE
http://find/

7/29/2019 ST 522 Slides

90/177

Theorem

If W is an UMVUE of (), then W is unique.


UMVUE and unbiased estimators of zero
http://find/

7/29/2019 ST 522 Slides

91/177

Theorem

IfEW = (), W is the best unbiased estimator of () if andonly if W is uncorrelated with all unbiased estimators of 0.

Example

Let X be an observation from a Unif(, + 1). Show that

X 12 is unbiased for .Show that h(X) = sin(2X) is an unbiased estimators of zero.

Show X

12 and h(X) are correlated. So X

12 is not best.


Lehmann-Scheffe theorem
http://find/

7/29/2019 ST 522 Slides

92/177

Theorem

Let T be a complete sufficient statistic for a parameter , and let(T) be any estimator based on T. Then (T) is the unique bestunbiased estimators of its expected value.

ThusFind a complete sufficient statistic T for a parameter ,

Find an unbiased estimator h(X) of (),then (T) = E(h(X)|T) is the best unbiased estimator of().


Examples
http://find/

7/29/2019 ST 522 Slides

93/177

Let X1, . . . , Xn be iid Bin(k, ).

X1, . . . , Xn are iid from Unif(0, ).

Find the UMVUE of .Find the UMVUE of g(), where g is differentiable on (0, ).

Suppose X1, . . . , Xn are iid from Poi().Find the UMVUE of .Find the UMVUE of g() = r, r 1 integer.Find the UMVUE of g() = e.


More Examples

S h h d i bl Y Y i f
http://find/

7/29/2019 ST 522 Slides

94/177

Suppose that the random variables Y1, . . . , Yn satisfy

Yi = xi + i, i = 1, . . . , n,

where x1, . . . , xn are fixed constants, and 1, . . . , n are iidN(0, 2) with 2 known. Find the MLE of and show it is

UMVUE.Suppose X1, . . . , Xn are iid from exp(), > 0.

Find the UMVUE for .Find the UMVUE for () = 1 F(s), whereF(s) = P(X1 > s).

Find the UMVUE for e1/.


More Examples (contd.)
http://find/

7/29/2019 ST 522 Slides

95/177

Suppose X1, . . . , Xn are iid from N(, 2), both (, 2)unknown.

Find the UMVUE for .Find the UMVUE for 2.Find the UMVUE for 2.

Normal probability. X1, . . . , Xn iid N(, 1).() = P(X1 c) = (c ).Ridiculous UMVUE. X1, . . . , Xn iid Poi(). () = e

3.


Loss Function Optimality

Observations X1, . . . , Xn are iid with pdf f(x, ), . Toevaluate the estimator (X), various loss function can be used.
http://find/

7/29/2019 ST 522 Slides

96/177

The loss function measures the closeness of and

absolute error loss: L(, ) = ( )2squared error loss: L(, ) = | |a loss that penalizes overestimation more thanunderestimation is

L(, ) = ( )2I( < ) + 10( )2I( ).

a loss that penalized more if is near 0 than if || is large

L(, ) = ( )2|| + 1


Loss Function Optimality (contd.)To compare estimators, we use the expected loss, called the riskfunction,

R(, ) = EL(, (X)).
http://find/

7/29/2019 ST 522 Slides

97/177

( ) ( ( ))

If R(, 1) < R(, 2) for all , then 1 is the preferredestimator because it performs better for all . In particular, for thesquared error loss, the risk function is the MSE.

Example

X1, . . . , Xn iid from Bin(1, ). Compare two estimators in terms oftheir MSE.

MLE 1 = X

Bayes estimator: prior () Beta(, ) with

= =

n/4,

B =

ni=1 Xi +

n/4

n +

n.


Minimaxity

Risk functions are generally overlapping. One cannot beat everyoneelse.

7/29/2019 ST 522 Slides

98/177

Example

X1, . . . , Xn iid N(, 2). Consider the estimators of the form

b(X) = bS2.

Minimaxity: Compare the worst case scenario compare themaximum risks. Find the estimator which has the smallestmaximum risk minimax estimator.Downside

Problems with unbounded risk maximum is infinity.

Not easy to find the minimax estimator.

Too pessimistic.


Bayes RuleThe Bayes risk is the average risk with respect to the prior ,

R(, )()d.
http://find/

7/29/2019 ST 522 Slides

99/177

By definition, the Bayes risk can be written as

R(, )()d =

L(, (x))f(x|)dx

()d.

Note f(x)() = (

|x)m(x), where (x

|) is the posterior

distribution of and m(x) is the marginal distribution of X, thenthe Bayes risk becomes

R(, )()d =

L(, (x))(|x)d

m(x)dx.

The quantity L(, (x))(|x)d is called the posterior expected

loss.To minimize the Bayes risk, we only need to find to minimize theposterior expected loss for each x.Subhashis Ghoshal, North Carolina State University ST 522: Statistical Theory II

Bayes Rule (contd.)

Th B l h h ld
http://find/

7/29/2019 ST 522 Slides

100/177

The Bayes rule with respect to a prior is an estimator that yieldsthe smallest value of the Bayes risk.

For squared error loss, the posterior expected loss is

(

a)2(

|x)d = E (

a)2

|x ,

therefore the Bayes rule is E(|x).For absolute error loss, the posterior expected loss isE(| a||x). The Bayes rule is the median of (|x).


Examples
http://find/

7/29/2019 ST 522 Slides

101/177

X1, . . . , Xn are iid from N(, 2) and let () be N(, 2).

The values 2, , 2 are known.

X1, . . . , Xn are iid from Bin(1, ) and let () be Beta(, ).


Hypothesis Testing
http://find/

7/29/2019 ST 522 Slides

102/177

Point estimation: provide a single estimate of Hypothesis testing: test a statement about

A hypothesis is a statement about a population parameter.

Two complementary hypotheses in a hypothesis testing arecalled the null hypothesis and alternative hypothesis. Let

0be a subset of the parameter space, called null region. Thehypotheses are denoted by H0 and H1,

H0 : 0 vs H1 : c0.


IllustrationExample

An ideal manufacturing process requires that all products ared f i Thi i ld Th l i k h
http://find/

7/29/2019 ST 522 Slides

103/177

non-defective. This is very seldom. The goal is to keep theproportion of defective items as low as possible. Let be theproportion of defective items, and 0.01 be the maximumacceptable proportion of defective items.Statement 1: 0.01 (the proportion of defectives isunacceptably high)Statement 2: < 0.01 (acceptable quality)

Example

Let be the average change in a patients blood pressure aftertaking a drug. An experimenter might be interested in testingH0 : = 0 (the drug has no effect on blood pressure)H1 : = 0 (there is some effect)


Different Types of Hypotheses
http://find/

7/29/2019 ST 522 Slides

104/177

Simple hypotheses: Both H0 and H1 consist of only oneprobability distribution,

Composite hypotheses: Either H0 or H1 contains more thanone possible distribution

One-sided hypotheses: H : 0 or H : < 0.Two-sided hypotheses: H0 : = 0 vs H1 : = 0.


Rejection region

A hypothesis testing procedure or hypothesis test is a rulethat specifies:

for which sample values the decision is made to accept H0 as
http://find/

7/29/2019 ST 522 Slides

105/177

p p 0

truefor which sample values H0 is rejected and H1 is accepted astrue.

The subset of the sample space for which H0 will be rejectedis R: rejection region or critical region.

The complement of the rejection region is Rc: acceptanceregion.

The rejection region R of a hypothesis test is usually definedby a test statistic W(X), a function of the sample

R = {X : W(X) > c} = reject H0.

Rc = {X : W(X) c} = accept H0.


Methods of Evaluating Tests

In deciding to accept or reject the null hypothesis H0, we mightmake a mistake no matter whatever the decision is. There are two
http://find/

7/29/2019 ST 522 Slides

106/177

types of errors:Type I error: if H0 is actually true, i.e. 0, but the testincorrectly decides to reject H0

Type II error: if H0 is actually false, i.e. c0, but the testincorrectly decides to accept H0

DecisionAccept H0 Reject H0

H0 Correct decision Type I errorTruth

H1 Type II error Correct decision


Power Function

Definition

The power function of a hypothesis test with rejection region R isthe function of defined by
http://find/

7/29/2019 ST 522 Slides

107/177

y

() = P(X R).

=

probability of Type I error if 01 probability of Type II error if c0

Note P(Type I error) = (), for 0, P(Type II error) =1 (), for c0Ideal test: () = 0 for all

0; () = 1 for all

c0.

Good test:

() is near 0 (small) for most 0;() is near 1 (large) for most c0.


Example (Binomial power function)
http://find/

7/29/2019 ST 522 Slides

108/177

Example (Binomial power function)

XBin(5, ).

H0 : 12

versus H1 : >1

2.

Test 1: reject H0 if and only if all successes are observed,i.e R = {5}Test 2: reject H0 if X = 3, 4, or 5.


Likelihood Ratio Tests (LRT)

Definition
http://find/

7/29/2019 ST 522 Slides

109/177

The likelihood ratio test statistic for testing H0 : 0 vsH1 : c0 is

(x) =sup0L(|x)supL(|x)

.

A likelihood ratio test (LRT) has a rejection region

R : {x : (x) c},

where c is any number satisfying 0 c 1.This should be reduced to the simplest possible form.


Rationale of LRT

The numerator of (x) is the maximum probability of the
http://find/

7/29/2019 ST 522 Slides

110/177

observed sample, computed over parameters in H0. Thedenominator of (x) is the maximum probability of theobserved sample over all possible parameters.

The numerator says which 0 makes the observation ofdata most likely; the denominator say which

make the

observation of data most likely.

The ratio of these two maxima is small if there are parameterpoints in H1 for which the observed sample is much morelikely than for any parameter in H0. In this situation, the LRT

criterion says H0 should be rejected and H1 accepted as true.


Relation between LRT and MLE

Let 0 be the MLE of in the null set 0 (restricted
http://find/

7/29/2019 ST 522 Slides

111/177

Let 0 be the MLE of in the null set 0 (restrictedmaximization).Let be the MLE of in the full set (unrestrictedmaximization). then the LRT statistic, a function of x (not ) is

(x) = sup0L(|x

)supL(|x) = L(0|x)

L(|x)In R : {x : (x) c}, different c gives different rejection regionand hence different tests.


Examples

X1, . . . , Xn iid N(, 2) with unknown (2 known).Consider testing
http://find/

7/29/2019 ST 522 Slides

112/177

H0 : = 0 versus H1 : = 0,where 0 is a number fixed by the experimenter prior to theexperiment.

Find the LRT and its power function.

Comment on the decision rules given by different cs.Let X1, . . . , Xn be a random sample from alocation-exponential family

f(x, ) = e(x) if x

,

where < < . Consider testing H0 : 0 versusH1 : > 0. Find the LRT.


LRT and sufficiency
http://find/

7/29/2019 ST 522 Slides

113/177

TheoremIf T(X) is a sufficient statistic for , (t) is the LRT statisticbased on T, and (x) is the LRT statistic based on x. Then

(T(x)) = (x)

for every x in the sample space.

Thus the simplified expression for (x) should depend on x onlythrough T(x) if T(X) is a sufficient statistic for .


Examples
http://find/

7/29/2019 ST 522 Slides

114/177

X1, . . . , Xn iid N(, 2) with 2 known. Test

H0 : = 0 versus H1 : = 0.

Let X1, . . . , Xn be a random sample from alocation-exponential family. Test H0 : 0 versusH1 : > 0.


Nuisance parameter case

Likelihood ratio tests are also useful when there are nuisance
http://find/

7/29/2019 ST 522 Slides

115/177

parameters, which are present in the model but not of directinterest.

Example

X1, . . . , Xn

iid N(, 2), both and 2 unknown. Test

H0 : 0 versus H1 : > 0.Specify and 0.

Find the LRT and the power function.


Bayesian Tests
http://find/

7/29/2019 ST 522 Slides

116/177

Using the posterior density (|x, computeP( 0 |x) = P(H0 is true |x)P( c0 |x) = P(H1 is true |x)

Decide in favor the hypothesis which has greater posteriorprobability: Accept H0 if P( 0 |x) 12 .Does not work if 0 is a point and is given a prior density. Onewill need to put a prior mass at the point.

http://find/

7/29/2019 ST 522 Slides

117/177

Example

Let X1, . . . , Xn be iid N(, 2) and the prior distribution on be

N(, 2), where 2, , 2 are known. Test H0 :

0 against

H1 : > 0.


Unbiased Test

Definition
http://find/

7/29/2019 ST 522 Slides

118/177

A test with power function () is unbiased if

() (), for every c0 and 0.

In most problems, there are many unbiased tests.

Recall () = P(reject H0). An unbiased test says that theprobability of rejecting H0 when H0 is true is smaller than theprobability of rejecting H0 when H0 is false.


Examples

XBin(5, ). Consider testing
http://find/

7/29/2019 ST 522 Slides

119/177

H0 : 12

versus H1 : >12

and reject H0 if X = 5.

X1, . . . , Xn

N(, 2),with 2 known. Consider testing

H0 : 0 versus H1 : > 0.

The LRT test is unbiased.

Draw the graph of the power function.


Controlling Type I error
http://find/

7/29/2019 ST 522 Slides

120/177

For a fixed sample size, it is usually impossible to make both typesof error arbitrarily small.Common approach:

Control the Type I error probability at a specified level .

Within this class of tests, make Type II error probability thatis as small as possible; equivalently, maximize the power.


Size and level test

Definition

For 0 1, a test with power function () is a size test if
http://find/

7/29/2019 ST 522 Slides

121/177

sup0

() = .

Definition

For 0 1, a test with power function () is a level test ifsup0

() .

If these relations hold only in the limit as n , we call the testsrespectively asymptotically size (level) . [More details in the finalchapter]


Notations and remarks

Typical choices of are: 0.01, 0.05, 0.10.
http://find/

7/29/2019 ST 522 Slides

122/177

We use z/2 to denote the point having probability /2 to theright of it for a standard normal pdf. By convention, we have

P(Z > z) = , where Z N(0, 1)P(Tn1 > tn1,/2) = /2, where Tn1

tn1

P(2p > 2p,1) = 1 , chi square with d.f. pNote z = z1.Commonly used cutoffs:z0.05 = 1.645, z0.025 = 1.96, z0.01 = 2.33, z0.005 = 2.58.


How to specify H0 and H1?

If an experimenter expects an experiment will indicate aphenomenon, should choose H1 to be the theory beingproposed.
http://find/

7/29/2019 ST 522 Slides

123/177

H1 is sometimes called researchers hypothesis. By using alevel test with small , the experiment is guarding againstsaying the data support the research hypothesis when it isfalse.

Announcing a new phenomenon when in fact nothing hashappened is usually more serious than missing something newthat has in fact occurred.

Similarly, in judicial system the evidence is collected to decide

whether the accused is innocent or guilty. To prevent thepossibility of penalizing an innocent person incorrectly, thetest should be set up H0: innocent versus H1 : guilty

http://find/

7/29/2019 ST 522 Slides

124/177

How to critical value of LRT

In order to make a LRT test be a size test, we choose c such that

sup

0

P((X) c) = .

7/29/2019 ST 522 Slides

125/177

0 iid N(, 2), 2 is known. H0 : 0 vs H1 : > 0.iid N(, 2), 2 is known. Consider testing for H0 : = 0 vsH1 :

= 0.

Let X1, . . . , Xn be iid from N(, 2), 2 unknown. Consider

testing H0 : = 0 versus H1 : = 0. Show that the LRTtest that rejects H0 if |X 0| > tn1,/2S/

n is a test of

size .

iid location-exponential dist. Consider testing H0 : 0 vsH1 : > 0. Find the size LRT test.


Sample size calculation

For a fixed sample size, it is usually impossible to make both typesof error probabilities arbitrarily small. But if we can choose the
http://find/

7/29/2019 ST 522 Slides

126/177

sample size, it is possible to make the desired power level.

Example

iid N(, 2), 2 is known. Test H0 : 0 vs H1 : > 0. TheLRT test rejects H0 if (

X 0)/(/n) > C has the powerfunction

() = 1

C +0 /

n

.

Note () is increasing in .


Notes

The maximum Type I error is

() (0) 1 (C )
http://find/

7/29/2019 ST 522 Slides

127/177

sup0 () = (0) = 1 (C).

For the size test, C = z.

After C is chosen, it is possible to increase () for > 0 by

increasing the sample size n. Thus we can minimize Type IIerror (Remember: Type I error is under control already).Draw the picture of power function for small n and large n.

Assume C = z. How to choose n such that the maximumType II error is 0.2 if

0 + ?

Compute n if = 0.05 in (3).


Example

Let X Bin(n, ). Testing:
http://find/

7/29/2019 ST 522 Slides

128/177

H0 : 3/4 vs H1 : < 3/4.

The LRT test for this problem is to reject H0 if X c.Choose c and n such that the following satisfies simultaneously:

If = 34 , we have Pr(reject H0|) = 0.01; (control Type Ierror)

If = 12 , we have Pr(reject H0|) = 0.99. (control Type IIerror)


Most Powerful Tests
http://find/

7/29/2019 ST 522 Slides

129/177

Given that the maximum probability of Type I error less than orequal to , the most powerful level test minimizes theprobability of Type II error, or, equivalently maximizes the powerfunction at a

c0.

If this occurs for all c0, such a test is called the uniformlymost powerful (UMP) level test.


Test function

Given a rejection region R, define a test function on the sample

space to be

7/29/2019 ST 522 Slides

130/177

space to be

(x) =

1 if x R0 if x / R .

Interpret (X) as the probability of rejecting the null hypothesis

given the sample X.This also opens doors for randomized tests, where (X) can eventake values strictly between 0 and 1.Note the expected value of is the power function:E[(X)] = P(X

R) = ().


Existence of UMP tests

Lemma (Neyman-Pearson)

Consider testing H0 : = 0 versus H1 : = 1, where the pdf or

pmf corresponding to i is f (x i ) i = 0 1 Consider any test
http://find/

7/29/2019 ST 522 Slides

131/177

pmf corresponding to i is f(x, i), i = 0, 1. Consider any testfunction satisfying

(x) = 1, if f(x, 1) > kf(x, 0),

0, if f(x, 1) < kf(x, 0),

for some k 0, andE0(X) = . Then(X) is a UMP size test,

if k > 0, any other UMP level test must have size and

can differ from only on the set{x : f(x, 1) = kf(x, 0)}.


Examples

X Bin(2 ) one observation H0 : = 1 versus H1 : = 3
http://find/

7/29/2019 ST 522 Slides

132/177

XBin(2,), one observation. H0 : = 12 versus H1 : = 34 .To obtain the UMP level 1/8 test and the UMP level 1/2 test?

X Exp(), H0 : = 1 versus H1 : = 2.X

Cauchy(), H0 : = 0 versus H1 : = 1.

X Un(0, ), H0 : = 1 versus H1 : = 2.X Un(, + 1), H0 : = 0 versus H1 : = 2.


Sufficient statistic and UMP test

Let T (X) be a sufficient statistic for and g (t ) is the pdf or
http://find/

7/29/2019 ST 522 Slides

133/177

Let T(X) be a sufficient statistic for and g(t, ) is the pdf orpmf of T corresponding to . Then a UMP level test (T)based on T is given by

(t) =1, if g(t, 1) > kg(t, 0),

0, if g(t, 1) < kg(t, 0),

for some k 0, where = E0(T).


Examples
http://find/

7/29/2019 ST 522 Slides

134/177

UMP normal test for mean: X1, . . . , Xn be iid from N(, 2)

with 2 known, H0 : = 0 versus H1 : = 1, where 1 > 0.

UMP normal test for variance: X1, . . . , Xn be iid fromN(0, 2) with 2 unknown. H

0: 2 = 2

0versus H

1: 2 = 2

1,

where 21 > 20.


Comments

Discrete Case: Suppose only has two possible values 0 or1, and X is a discrete variable taking finite k values with

Pi(X = aj) j = 1 k ; i = 0 1 H0 : = 0 vs = 1 The
http://find/

7/29/2019 ST 522 Slides

135/177

P (X = aj),j = 1, . . . , k; i = 0, 1. H : = vs = . Therejection region R of the UMP level test satisfies

maxR

ajR

P1(X = aj)

subject toajR

P0(X = aj) .

N-P test is the LRT test for H0 : = 0 vs = 1.

For simple hypotheses, the UMP level test is unbiased, i.e.(1) > (0) = .


UMP test for one-sided composite alternative
http://find/

7/29/2019 ST 522 Slides

136/177

iid N(, 1).H0 : = 0 vs H1 : > 0.


Monotone Likelihood Ratio (MLR)
http://find/

7/29/2019 ST 522 Slides

137/177

Definition

A family of pdfs or pmfs {g(t, ) : } for a univariate randomvariable T with real-valued parameter has a monotone likelihoodratio (MLR) if, for every 2 > 1, g(t, 2)/g(t, 1) is an increasingfunction of t on {t : g(t, 1) > 0 or g(t, 2) > 0}.


Examples

Normal, Poisson, Binomial all have the MLR property.

If T is from an exponential family with the density
http://find/

7/29/2019 ST 522 Slides

138/177

If T is from an exponential family with the densityf(t, ) = h(t)c()ew()t, then T has an MLR if w() is anondecreasing function in .

If X1, . . . , Xn iid from N(, 2) with known, then X has an

MLR.If X1, . . . , Xn iid from N(,

2) with known, thenni=1(Xi )2 has an MLR.

iid Unif(0, ), T = X(n) has MLR property.


Stochastically increasing

Definition
http://find/

7/29/2019 ST 522 Slides

139/177

A statistic T with family of pdf{f(t, ), } is calledstochastically increasing in if 1 < 2 implies that

P1(T > c) P2(T > c) for every c,or equivalently, F2(c) F1(c), where F is the cdf.


Useful facts

Lemma

If f il T h th MLR t th it i t h ti ll
http://find/

7/29/2019 ST 522 Slides

140/177

If a family T has the MLR property, then it is stochasticallyincreasing in its parameter.

A location family T is stochastically increasing in its locationparameter.

Let a test have rejection region R = {T > c}. If T has theMLR property, then the power function() = P(T R) = P(T > c) is non-decreasing in .


Karlin-Rubin Theorem

Theorem

Let T(X) be a sufficient statistic for and the family

{g(t, i), } has the MLR property. Then
http://find/

7/29/2019 ST 522 Slides

141/177

{g ( , ), } p p yFor testing H : 0 vs H1 : > 0, the UMP level testrejects H0 if and only if T > t0, where

= P0

(T > t0).

For testing H : 0 vs H1 : < 0, the UMP level testrejects H0 if and only if T < t0, where

= P0(T < t0).


Examples

Let X1, . . . , Xn be iid from N(, 2), 2 known.

Find the UMP level test for testing H0 : 0 vs
http://find/

7/29/2019 ST 522 Slides

142/177

Find the UMP level test for testing H0 : 0 vsH1 : > 0.

Find the UMP level test for testing H0 : 0 vsH1 : < 0.

Let X1, . . . , Xn be iid from N(0, 2), 2 unknown, 0 known.Find the UMP level test for testing H0 :

2 20 vsH1 :

2 > 20.


Nonexistence of UMP test

F bl ith t id d lt ti th i
http://find/

7/29/2019 ST 522 Slides

143/177

For many problems with two-sided alternative, there is noUMP level test, because the class of level test is so largethat no one test dominates all the others in terms of power.

Search a UMP test within some subset of the class of level test, for example, the subset of all unbiased tests.


Example

Let X X be iid f o N( 2) 2 k o Co side testi
http://find/

7/29/2019 ST 522 Slides

144/177

Let X1, . . . , Xn be iid from N(, 2), 2 known. Consider testing

H0 : = 0 vs H1 : = 0.There is no UMP level test.

Find the UMP level test within the class of unbiased tests.


p-value

The choice of is subjective. Different people may have
http://find/

7/29/2019 ST 522 Slides

145/177

different tolerance levels .

If is small, the decision is conservative.

If is large, the decision is overly liberal.

If you reject (or accept) H0, is it a strong or borderlinerejection (acceptance)?


p-value (contd.)

Definition

A p-value is the smallest possible level at which H0 would berejected
http://find/

7/29/2019 ST 522 Slides

146/177

rejected.

Note

p-value is a test statistic, taking value 0

p(x)

1 for the

sample x.Small values of p(X) gives evidence that H1 is true.

The smaller p-value, the stronger the evidence of rejecting H0.

Reject H0 at level is equivalent to p-value being less than .


p-value for composite null

A p-value is called valid if, for every 0 and every 0 1,we have P(p(X) ) .
http://find/

7/29/2019 ST 522 Slides

147/177

Theorem

Let W(X) be a test statistic such that large values of W giveevidence that H1 is true. For each sample point x, define

p(x) = sup0P(W(X) W(x)).

Then p(X) is a valid p-value.


Examples

Two-sided normal p-value:

Let X1, . . . , Xn be iid from N(, 2), 2 unknown. ConsiderH H h LR
http://find/

7/29/2019 ST 522 Slides

148/177

testing H0 : = 0 versus H1 : = 0, use the LRT statisticW(X) = |X 0|/(S/

n).

Let 0 = 1, n = 16 , observed x = 1.5, s2 = 1. Do you reject

the hypothesis = 1 at level 0.05? at level 0.1?One-sided normal p-value:In the above example, consider testing H0 : 0 versusH1 : > 0.


p-value and sufficient statistic

Sometimes there is a non-trivial sufficient for the null model. Then
http://find/

7/29/2019 ST 522 Slides

149/177

defining a p-value through conditioning on a sufficient statisticeffectively reduces the composite null to a point null:

p(x) = P(W(X) W(x)|S = S(x)).


Fishers Exact Test

L t S d S b i d d t b ti ith S Bi ( )
http://find/

7/29/2019 ST 522 Slides

150/177

Let S1 and S2 be independent observations with S1 Bin(n1, p1)and S2 Bin(n2, p2). Consider testing H0 : p1 = p2 versusH1 : p1 > p2.To form an exact (non-asymptotic) level test.


Interval Estimation
http://find/

7/29/2019 ST 522 Slides

151/177

Interval estimate (L(X), U(X))Confidence coefficient min P( (L(X), U(X))) = 1 .


Method of inversion

One to one correspondence between tests and confidence intervals.

Hypothesis testing: Fix the parameter asks what samplevalues (in the appropriate region) are consistent with thatfi d l
http://find/

7/29/2019 ST 522 Slides

152/177

fixed value.

Confidence set: Fix the sample value asks what parametervalues make this sample most plausible.

For each 0 , let A(0) be the acceptance region of a level test H0 : = 0. Define a set C(x) = {0 : x A(0)}. ThenC(x) is a (1 )-confidence set.Example

iid N(, 2), unknown, is parameter of interest.


Method of inversion (contd.)

In general, inverting acceptance region of a two sided test will givetwo sided interval and inverting acceptance region of a one sidedtest will give an open end interval on one side.
http://find/

7/29/2019 ST 522 Slides

153/177

Theorem

Let acceptance region of a two sided test be of the formA() =

{x : c1()

T(x)

c2()

}and let the cutoff be

symmetric, that is, P(T(X) > c2()) = /2 andP(T(X) < c1()) = /2.If T has MLR property then both c1() and c2() are increasing in.


Examples

X1, . . . , Xn N(, 2), both unknown.U fid b d f
http://find/

7/29/2019 ST 522 Slides

154/177

Upper confidence bound for .Lower confidence bound for .

X1, . . . , Xn Exp(). Invert the LRT.Discrete. X1, . . . , Xn Bin(1, ) Obtain a lower confidencebound.


Pivot

Definition
http://find/

7/29/2019 ST 522 Slides

155/177

A random quantity Q(X, ) is called pivotal quantity (or a pivot) ifthe distribution of Q(X, ) is independent of .

Note this is different from an ancillary statistic since Q(X, )depends also on and hence is not a statistic.


Examples

Location familyScale family
http://find/

7/29/2019 ST 522 Slides

156/177

Scale family

Location-scale family

iid exponential. Gamma pivot.

A statistic T has density f(t, ) = g(Q(t, ))|(/t)Q(t, )|.Then Q(T, ) is a pivot.


Method of pivot

How to construct a confidence set using a pivotal quantity?
http://find/

7/29/2019 ST 522 Slides

157/177

Find a, b such that P(a Q(X, ) b) = 1 .Define C(x) = { : a Q(x, ) b}.

Then P( C(X)) = P(a Q(X, ) b) = 1 .


Method of pivot (contd.)

When will C(x) be an interval?If Q(x, ) is monotone in , then C(x) is an interval.
http://find/

7/29/2019 ST 522 Slides

158/177

Examples:

iid exponetial.iid N(, 2), known. Interval for .

iid N(, 2), unknown. Interval for .iid N(, 2), known. Interval for .iid N(, 2), unknown. Interval for .


Method of pivot (contd.)

If F(t, ) is decreasing in for all t, define L, U byF(t, L) = 1 2, F(t, U) = 1, 1 + 2 = . Then[L(T), U(T)] is (1 ) CI for .Similarly if F(t|) is increasing in for all t, define L, U by
http://find/

7/29/2019 ST 522 Slides

159/177

F(t, L) = 2, F(t, U) = 1 1, 1 + 2 = . Then[L(T), U(T)] is (1 ) CI for .Examples:

iid from f(x, ) = e(x)I(x > ). X(n) sufficient.(1 ) CI is not unique. Among many choices, want tominimize expected length.iid N(, 2) known.iid N(, 2) unknown.

iid exponential.


Asymptotic Evaluation

X1, . . . , Xn i.i.d. f(x, ), n large. Mathematically n .The assumption n makes life easier. Dependence of
http://find/

7/29/2019 ST 522 Slides

160/177

p poptimality on models or loss functions becomes lesspronounced.

Because limit theorems become available, distributions can befound approximately. Limiting distributions are much simplerthan actual distributions.


Convergence in probability

Definition

We say that Yn p c (Yn converges in probability to a constantc) if P(|Y c | > ) 0 as n for all > 0
http://find/

7/29/2019 ST 522 Slides

161/177

c), ifP(|Yn c| > ) 0 as n for all > 0.Usual calculus applies for convergence in probability.A possible method of showing this is Chebychevs inequality P(|Yn c| > ) 2E(Yn c)2 = 2[var(Yn) + (E(Yn) c)2],so it is enough to show that the right hand side goes to 0.If Yn = Xn, then Xn p E(X) by the law of large numbers.


Convergence in distribution

Definition

If Yn is a sequence of random variables and F is a continuous cdf,we say that Yn converges in distribution to F ifP(Yn

x)

F(x) for all x.We also say that Yn d Y where Y is a random variable havingcdf F
http://find/

7/29/2019 ST 522 Slides

162/177

cdf F.

The central limit theorem states that

n(Xn E(X)) converges indistribution to N(0, var(X)), i.e.,

P

n(Xn E(X))

var(X)

(x)

for all x where stands for the standard normal cdf.Another important result is Slutskys theorem: If Yn d Y andZn p c, then Yn + Zn Y + c, YnZn d cY, Yn/Zn Y/c ifc= 0.Subhashis Ghoshal, North Carolina State University ST 522: Statistical Theory II

Consistency

Definition

Let Wn = Wn(X1, . . . , Xn) be a sequence of estimators for ().We say that Wn is consistent for estimating () if Wn p ()
http://find/

7/29/2019 ST 522 Slides

163/177

( ) p ( )under P for all .

Theorem

IfE(Wn) () (in which case Wn is called asymptoticallyunbiased for ()) andvar(Wn) 0 for all , then Wn isconsistent for ().


Examples

If X1, . . . , Xn are i.i.d. f with E(X) = and var(X) = 2,

then Xn is consistent for and S2n =

ni=1(Xi Xn)2/(n 1)

is consistent for 2.ni=1(Xi Xn)2/n is consistent for 2 too.

(Invariance principle of consistency): If W is consistent for
http://find/

7/29/2019 ST 522 Slides

164/177

(Invariance principle of consistency): If Wn is consistent for and g is a continuous function, then g(Wn) is consistent forg().

Method of moment estimator is generally consistent.UMVUE is consistent: Let X1, . . . , Xn be i.i.d. f(x, ) and letWn be the UMVUE of (). Then Wn is consistent for ().

Consistency of MLE: Let X1, . . . , Xn be i.i.d. f(x, ), a

parametric family satisfying some regularity conditions. Thenthe MLE n is consistent for .

http://find/

7/29/2019 ST 522 Slides

165/177

Delta method

TheoremIf Tn isAN(,

2()/n), then g(Tn) isAN(g(), (g())22()/n).

7/29/2019 ST 522 Slides

166/177

( , ( )/ ) g ( ) (g ( ), (g ( )) ( )/ )

A multivariate version is also true.

CLT and delta method combination gives asymptoticnormality of many statistics of interest.


Efficiency

How to distinguish between consistent estimators.

Let estimators be asymptotically normal. Asymptotic meansare the same. Can compare asymptotic variances.
http://find/

7/29/2019 ST 522 Slides

167/177

are the same. Can compare asymptotic variances.

Often one variance is smaller than another throughout.

If there is a lower bound, and that lower bound is attained,

then the estimator making that happen is calledasymptotically efficient. Clearly such an estimator isimpossible to beat asymptotically the best.


Efficiency bound

Cramer-Rao bound for MSE of Tn in estimating ():

(() + bn())2

nI()

,

where I() is Fisher information, bn() the bias.
http://find/

7/29/2019 ST 522 Slides

168/177

( ) , n( )

So if bn() 0, then the bound for the asymptotic varianceshould be (())2/I().

In particular, if () = , the bound for asymptotic variance is1/I().

Strictly, speaking, this bound is not valid, although it is nearlycorrect.

Then we can define an estimator to be asymptotically efficientif its asymptotic variance is 1/I().


Attaining efficiency bound

Theorem

The MLE isAN(, 1/(nI())).

More generally, () isAN((), (())2/(nI())).
http://find/

7/29/2019 ST 522 Slides

169/177

The MLE is not the only possible asymptotically efficientestimator.

Any Bayes estimator is asymptotically efficient.Method of moment estimators are asymptotically normal, butneed not be asymptotically efficient.

Define asymptotic efficiency of n AN(, v()/n) byI()/v().


Examples

Cauchy
http://find/

7/29/2019 ST 522 Slides

170/177

Logistic

Mean versus median


Asymptotic distribution of likelihood ratio statistic

Theorem (Point null case)

Let X1 X be i i d f (x |) and let (X) be the likelihood
http://find/

7/29/2019 ST 522 Slides

171/177

Let X1, . . . , Xn be i.i.d. f(x|) and let n(X) be the likelihoodratio for testing H0 : = 0 vs H1 : = 0 and is d dimensional.Then

2log n(X)

d

2d.

Example: Poisson


Asymptotic distribution of likelihood ratio statistic

Theorem (General case)Let X

ST 522 Slides

Documents

Transcript of ST 522 Slides