Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from...

94
Inference for the mean vector

Transcript of Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from...

Page 1: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Inference for the mean vector

Page 2: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Univariate InferenceLet x1, x2, … , xn denote a sample of n from the normal distribution with mean and variance 2.Suppose we want to test

H0: = 0 vs

HA: ≠ 0

The appropriate test is the t test:

The test statistic:

Reject H0 if |t| > t/2

0xt n

s

Page 3: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The multivariate TestLet denote a sample of n from the p-variate normal distribution with mean vector and covariance matrix .

Suppose we want to test

1 2, , , nx x x

0 0

0

: vs

:A

H

H

Page 4: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Example

For n = 10 students we measure scores on – Math proficiency test (x1),

– Science proficiency test (x2),

– English proficiency test (x3) and

– French proficiency test (x4)

The average score for each of the tests in previous years was 60. Has this changed?

Page 5: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The data

Student Math Science Eng French

1 81 89 73 742 73 79 73 743 61 86 81 814 55 70 76 735 61 71 61 666 52 70 56 587 56 74 56 568 65 87 73 699 54 76 69 72

10 48 71 62 63

Page 6: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Summary Statistics

60.677.368.068.6

x

S

102.044 56.689 41.222 39.48956.689 56.456 42.000 35.35641.222 42.000 75.778 65.11139.489 35.356 65.111 61.378

the mean vector

the sample covariance matrix

0

60606060

Page 7: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Roy’s Union- Intersection PrincipleThis is a general procedure for developing a multivariate test from the corresponding univariate test.

1

i.e. observation vector

p

X

X

X

1. Convert the multivariate problem to a univariate problem by considering an arbitrary linear combination of the observation vector.

1 1 p pU a X a X a X

arbitrary linear combination of the observations

Page 8: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

2. Perform the test for the arbitrary linear combination of the observation vector.

3. Repeat this for all possible choices of

1

p

a

a

a

4. Reject the multivariate hypothesis if H0 is rejected for any one of the choices for

5. Accept the multivariate hypothesis if H0 is accepted for all of the choices for

6. Set the type I error rate for the individual tests so that the type I error rate for the multivariate test is .

.a

.a

Page 9: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Let denote a sample of n from the p-variate normal distribution with mean vector and covariance matrix .

Suppose we want to test

1 2, , , nx x x

0 0

0

: vs

:A

H

H

Application of Roy’s principle to the following situation

1 1Let i i i p piu a x a x a x

Then u1, …. un is a sample of n from the normal distribution with mean and variance .a a aΣ

Page 10: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

to test

0 0

0

: vs

:

a

aA

H a a

H a a

we would use the test statistic:

0a

u

u at n

s

1 1

1 1Now

n n

i ii i

u u a xn n

1 1

1 1n n

i ii i

a x a x a xn n

Page 11: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

and

222

1 1

1 1

1 1

n n

u i ii i

s u u a x a xn n

2

1

1

1

n

ii

a x xn

1

1

1

n

i ii

a x x x x an

1

1

1

n

i ii

a x x x x a a an

S

Page 12: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Thus

00

a a x a nt n a x

a aa a

SS

We will reject 0 0:aH a a

if 0 / 2

a nt a x t

a a

S

2

2 0 2

/ 2or a

n a xt t

a a

S

Page 13: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

We will reject

0 0 0: in favour of :AH H

Using Roy’s Union- Intersection principle:

2

2 0 2

/ 2if for at least one a

n a xt t a

a a

S

We accept0 0:H

2

2 0 2

/ 2if for all a

n a xt t a

a a

S

Page 14: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

We reject

0 0:H

i.e.

2

0 2

/ 2if max

a

n a xt

a a

S

We accept0 0:H

2

0 2

/ 2if max

a

n a xt

a a

S

Page 15: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Consider the problem of finding:

2

0max max

a a

n a xh a

a a

S

where

2

0 0 0n a x a x x a

h a na a a a

S S

0 0 0 0

2

2 20

a a x x a a x x a ah a

na a a

S S

S

0 0or a a x a x a S S

Page 16: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

thus 2

0max

opt

aopt opt

n a xh a

a a

S

1 10 0

0

or opt

a aa x k x a

a x

SS S

21

0 0

2 1 10 0

n k x x

k x x

S

S SS

10 0n x x S

Page 17: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

We reject 0 0:H Thus Roy’s Union- Intersection principle states:

1 20 0 / 2

if n x x t

S

We accept 0 0:H

1 20 0 / 2

if n x x t

S

2 10 0The statistic T n x x S

is called Hotelling’s T2 statistic

Page 18: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

We reject 0 0:H Choosing the critical value for Hotelling’s T2 statistic

2 1 20 0 / 2

if T n x x t

S

2

/ 2To determine t

, we need to find the sampling distribution of T2 when H0 is true.

It turns out that if H0 is true than

2 1

0 0 1 1

n p nn pF T x x

p n p n

S

has an F distribution with 1 = p and 2 = n - p

Page 19: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

We reject 0 0:H

Thus

Hotelling’s T2 test

2 1 20 0

1, a

p nT n x x F p n p T

n p

S

2 ,1

n pF T F p n p

p n

or if

Page 20: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

f x

Another derivation of Hotelling’s T2 statistic

Another method of developing statistical tests is the Likelihood ratio method.

Suppose that the data vector, , has joint densityx

Suppose that the parameter vector, , belongs to the set . Let denote a subset of .

Finally we want to test 0 : vs

:A

H

H

Page 21: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

ˆmax max

ˆmaxmax

Lf x L

Lf x L

The Likelihood ratio test rejects H0 if

ˆwhere the MLE of

0

ˆand the MLE of when is true.H

Page 22: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The situationLet denote a sample of n from the p-variate normal distribution with mean vector and covariance matrix .

Suppose we want to test

1 2, , , nx x x

0 0

0

: vs

:A

H

H

Page 23: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The Likelihood function is:

1

1

1

2

/ 2 / 2

1, e

2

n

i ii

x x

np nL

and the Log-likelihood function is:

, ln , l L

1

1

1ln 2 ln

2 2 2

n

i ii

np nx x

Page 24: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

and

the Maximum Likelihood estimators of

are

1

1ˆ n

ii

x xn

and

1

1 1ˆ n

i ii

nx x x x S

n n

Page 25: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

and the Maximum Likelihood estimators of

when H 0 is true are:

0

ˆ ˆ

and

0 01

1ˆ n

i ii

x xn

Page 26: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The Likelihood function is:

1

1

1

2

/ 2 / 2

1, e

2

n

i ii

x x

np nL

now

11 1

1 1

ˆ ˆˆ n n

ni i i in

i i

x x x x S x x

11

1

n

ni in

i

tr x x S x x

1

11

n

ni in

i

tr S x x x x

Page 27: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

11

1

n

ni in

i

tr S x x x x

1 11 = 1 = n nn ntr n I n p np

Thus 2

/ 2/ 2 1

1ˆ ˆ, 2

np

nnp nn

L eS

similarly

2/ 2

/ 2

1ˆ ˆˆ ˆ, ˆ2

np

nnp

L e

Page 28: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

and

/ 2 / 21 1

/ 2 / 2

0 01

ˆ ˆˆ ˆ,

ˆ ˆ ˆ, 1ˆ

n nn nn n

n nn

i ii

L S S

Lx x

n

/ 2

/ 2

0 01

1

n

nn

i ii

n S

x x

Page 29: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Note:11 12

21 22

A A u wA

A A w V

Let

111 22 21 11 12

122 11 12 22 21

A A A A AA

A A A A A

1

1u V ww

u

V u w V w

11Thus u V ww V u w V w

u

Page 30: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

and1

1

1V ww

w V wuV u

/ 2

/ 2

0 01

1

n

nn

i ii

n S

x x

Now

and

2/

0 01

1 n

n

i ii

n S

x x

Page 31: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Also

0 0 0 01 1

= n n

i i i ii i

x x x x x x x x

01 1

=n n

i i ii i

x x x x x x x

0 0 01

n

ii

x x x n x x

0 01

=n

i ii

x x x x n x x

0 01

=n

i ii

x x x x n x x

0 0= 1n S n x x

Page 32: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Thus

2/

0 01

1 n

n

i ii

n S

x x

0 0

1

1

n S

n S n x x

0 0

1

S

nS x x

n

Page 33: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Thus 0 02/ 1

n

nS x x

n

S

using 1

1

1V ww

w V wuV u

0

1,

and

u n

V S

w n x

Page 34: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Then 1

0 02/ 1 1

nn x S x

n

Thus to reject H0 if < 2/i.e. n n

2/or n n

10 0

and 1 1

nn x S x

n

10 0or 1 -1 nn x S x n

This is the same as Hotelling’s T2 test if

2/ 11 -1 , n p n

n T F p n pn p

Page 35: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Example

For n = 10 students we measure scores on – Math proficiency test (x1),

– Science proficiency test (x2),

– English proficiency test (x3) and

– French proficiency test (x4)

The average score for each of the tests in previous years was 60. Has this changed?

Page 36: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The data

Student Math Science Eng French

1 81 89 73 742 73 79 73 743 61 86 81 814 55 70 76 735 61 71 61 666 52 70 56 587 56 74 56 568 65 87 73 699 54 76 69 72

10 48 71 62 63

Page 37: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Summary Statistics

60.677.368.068.6

x

S

102.044 56.689 41.222 39.48956.689 56.456 42.000 35.35641.222 42.000 75.778 65.11139.489 35.356 65.111 61.378

0.0245 -0.0255 0.0195 -0.0218-0.0255 0.0567 -0.0405 0.02670.0195 -0.0405 0.1782 -0.1783-0.0218 0.0267 -0.1783 0.2040

1

: S

Note

2 10 0 151.135T n x S x

0.05 0.05 0.05

1 4 9 4 9, 4,6 = 4.53 27.18

6 6

p nT F p n p F

n p

0

60606060

Page 38: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Inference for the mean vector

Page 39: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Univariate InferenceLet x1, x2, … , xn denote a sample of n from the normal distribution with mean and variance 2.Suppose we want to test

H0: = 0 vs

HA: ≠ 0

The appropriate test is the t test:

The test statistic:

Reject H0 if |t| > t/2

0xt n

s

Page 40: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

We reject 0 0:H Hotelling’s T2 statistic and test

2 1 20 0 / 2

if T n x x t

S

0: offavour in

AH

2T

pnpFpn

npT

,1

where 2

Page 41: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Example

For n = 10 students we measure scores on – Math proficiency test (x1),

– Science proficiency test (x2),

– English proficiency test (x3) and

– French proficiency test (x4)

The average score for each of the tests in previous years was 60. Has this changed?

Page 42: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The data

Student Math Science Eng French

1 81 89 73 742 73 79 73 743 61 86 81 814 55 70 76 735 61 71 61 666 52 70 56 587 56 74 56 568 65 87 73 699 54 76 69 72

10 48 71 62 63

Page 43: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Summary Statistics

60.677.368.068.6

x

S

102.044 56.689 41.222 39.48956.689 56.456 42.000 35.35641.222 42.000 75.778 65.11139.489 35.356 65.111 61.378

0.0245 -0.0255 0.0195 -0.0218-0.0255 0.0567 -0.0405 0.02670.0195 -0.0405 0.1782 -0.1783-0.0218 0.0267 -0.1783 0.2040

1

: S

Note

2 10 0 151.135T n x S x

0.05 0.05 0.05

1 4 9 4 9, 4,6 = 4.53 27.18

6 6

p nT F p n p F

n p

0

60606060

Page 44: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The two sample problem

Page 45: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Univariate Inference

Let x1, x2, … , xn denote a sample of n from the normal distribution with mean x and variance 2.

Let y1, y2, … , ym denote a sample of n from the normal distribution with mean y and variance 2.

Suppose we want to test

H0: x = y vs

HA: x ≠ y

Page 46: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The appropriate test is the t test:

The test statistic:

Reject H0 if |t| > t/2 d.f. = n + m -2

1 1pooled

x yt

sn m

2 21 1

2x y

pooled

n s m ss

n m

Page 47: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The multivariate TestLet denote a sample of n from the p-variate normal distribution with mean vector and covariance matrix .

1 2, , , nx x x

x

0 : vs

:

x y

A x y

H

H

Suppose we want to test

Let denote a sample of m from the p-variate normal distribution with mean vector and covariance matrix .

1 2, , , my y y

y

Page 48: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Hotelling’s T2 statistic for the two sample problem

2 111 1 pooledT x y x y

n m

S

if H0 is true than

21

2

n m pF T

p n m

has an F distribution with 1 = p and

2 = n +m – p - 1

1 1

2 2pooled x y

n m

n m n m

S S S

Page 49: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

We reject 0 : x yH

Thus

Hotelling’s T2 test

21if , 1

2

n m pF T F p n m p

p n m

2 11with

1 1 pooledT x y x y

n m

S

1 1

2 2pooled x y

n m

n m n m

S S S

Page 50: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Example 2Annual financial data are collected for firms approximately 2 years prior to bankruptcy and for financially sound firms at about the same point in time. The data on the four variables

• x1 = CF/TD = (cash flow)/(total debt), • x2 = NI/TA = (net income)/(Total assets), • x3 = CA/CL = (current assets)/(current liabilties, and • x4 = CA/NS = (current assets)/(net sales) are given in

the following table.

Page 51: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The data are given in the following table:

Bankrupt Firms Nonbankrupt Firms x1 x2 x3 x4

x1 x2 x3 x4

Firm CF/TD NI/TA CA/CL CA/NS Firm CF/TD NI/TA CA/CL CA/NS 1 -0.4485 -0.4106 1.0865 0.4526 1 0.5135 0.1001 2.4871 0.5368 2 -0.5633 -0.3114 1.5314 0.1642 2 0.0769 0.0195 2.0069 0.5304 3 0.0643 0.0156 1.0077 0.3978 3 0.3776 0.1075 3.2651 0.3548 4 -0.0721 -0.0930 1.4544 0.2589 4 0.1933 0.0473 2.2506 0.3309 5 -0.1002 -0.0917 1.5644 0.6683 5 0.3248 0.0718 4.2401 0.6279 6 -0.1421 -0.0651 0.7066 0.2794 6 0.3132 0.0511 4.4500 0.6852 7 0.0351 0.0147 1.5046 0.7080 7 0.1184 0.0499 2.5210 0.6925 8 -0.6530 -0.0566 1.3737 0.4032 8 -0.0173 0.0233 2.0538 0.3484 9 0.0724 -0.0076 1.3723 0.3361 9 0.2169 0.0779 2.3489 0.3970 10 -0.1353 -0.1433 1.4196 0.4347 10 0.1703 0.0695 1.7973 0.5174 11 -0.2298 -0.2961 0.3310 0.1824 11 0.1460 0.0518 2.1692 0.5500 12 0.0713 0.0205 1.3124 0.2497 12 -0.0985 -0.0123 2.5029 0.5778 13 0.0109 0.0011 2.1495 0.6969 13 0.1398 -0.0312 0.4611 0.2643 14 -0.2777 -0.2316 1.1918 0.6601 14 0.1379 0.0728 2.6123 0.5151 15 0.1454 0.0500 1.8762 0.2723 15 0.1486 0.0564 2.2347 0.5563 16 0.3703 0.1098 1.9914 0.3828 16 0.1633 0.0486 2.3080 0.1978 17 -0.0757 -0.0821 1.5077 0.4215 17 0.2907 0.0597 1.8381 0.3786 18 0.0451 0.0263 1.6756 0.9494 18 0.5383 0.1064 2.3293 0.4835 19 0.0115 -0.0032 1.2602 0.6038 19 -0.3330 -0.0854 3.0124 0.4730 20 0.1227 0.1055 1.1434 0.1655 20 0.4875 0.0910 1.2444 0.1847 21 -0.2843 -0.2703 1.2722 0.5128 21 0.5603 0.1112 4.2918 0.4443 22 0.2029 0.0792 1.9936 0.3018 23 0.4746 0.1380 2.9166 0.4487 24 0.1661 0.0351 2.4527 0.1370 25 0.5808 0.0371 5.0594 0.1268

Page 52: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Hotelling’s T2 test

A graphical explanation

Page 53: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Hotelling’s T2 statistic for the two sample problem

2 111 1 pooledT x y x y

n m

S

1 1where

2 2pooled x y

n m

n m n m

S S S

Page 54: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

2

2 2max max1 1a a

pooled

a x yT t a

a an m

S

: 1 1

pooled

a x a yt a

a an m

Note

S

is the test statistic for testing:

0 : vs :x y A x yH a a a H a a a

Page 55: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Popn A

Popn B

X1

X2

Hotelling’s T2 test

Page 56: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Popn A

Popn B

X1

X2

Univariate test for X1

Page 57: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Popn A

Popn B

X1

X2

Univariate test for X2

Page 58: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Popn A

Popn B

X1

X2

Univariate test for a1X1 + a2X2

Page 59: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Mahalanobis distance

A graphical explanation

Page 60: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

22

1

,p

i ii

d a b a b a b a b

Euclidean distance

a

points equidistant

from a

Page 61: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

2 ,Md a b a b a b

Mahalanobis distance: , a covariance matrix

a

points equidistant

from a

Page 62: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Hotelling’s T2 statistic for the two sample problem

2 1 21 1, ,pooled M pooledT x y x y d x y

n m

S S

2 111 1 pooledT x y x y

n m

S

1pooled

nmx y x y

n m

S

2 , ,M pooled

n md x y

nm

S

Page 63: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Popn A

Popn B

X1

X2

Case I

Page 64: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Popn A

Popn B

X1

X2

Case II

Page 65: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Popn A

Popn B

X1

X2

Case I

Popn A

Popn B

X1

X2

Case II

In Case I the Mahalanobis distance between the mean vectors is larger than in Case II, even though the Euclidean distance is smaller. In Case I there is more separation between the two bivariate normal distributions

Page 66: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Discrimination and Classification

Page 67: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Discrimination

Situation:

We have two or more populations 1, 2, etc

(possibly p-variate normal).

The populations are known (or we have data from each population)

We have data for a new case (population unknown) and we want to identify the which population for which the new case is a member.

Page 68: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Examples Population 1 and 2 Measured variables X1, X2, X3, ... , Xn

1. Solvent and distressed Total assets, cost of stocks and bonds, property-liability market value of stocks and bonds, loss insurance companies expenses, surplus, amount of premiums written. 2. Nonulcer dyspeptics (those Measures of anxiety, dependence, guilt, with stomach problems) and perfectionism. controls ("normal") 3. Federalist papers written by Frequencies of different words and length James Madison and those of sentences. written by Alexander Hamilton 4. Good and poor credit risks. Income, age, number of credit cards, family size education 5. Succesful and unsuccessful Entrance examination scores, high-school grade- (fail to graduate) college point average, number of high-school activities students 6. Purchasers and Non purchasers. Income, Education, family size, previous of a home computer purchase of other home computers, Occupation 7. Two species of chickweed Sepal length, Petal length, petal cleft depth, bract length, sreious tip length, sacrious tip length, pollen diameter

Page 69: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The Basic Problem

Suppose that the data from a new case x1, … , xp has joint density function either :

1: f(x1, … , xn) or

2: g(x1, … , xn)

We want to make the decision to

D1: Classify the case in 1 (f is the correct distribution) or

D2: Classify the case in 2 (g is the correct distribution)

Page 70: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The Two Types of Errors

1. Misclassifying the case in 1 when it actually lies in 2.

Let P[1|2] = P[D1|2] = probability of this type of error

2. Misclassifying the case in 2 when it actually lies in 1.

Let P[2|1] = P[D2|1] = probability of this type of error

This is similar Type I and Type II errors in hypothesis testing.

Page 71: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Note:

1. C1 = the region were we make the decision D1.

(the decision to classify the case in 1)

A discrimination scheme is defined by splitting p –dimensional space into two regions.

2. C2 = the region were we make the decision D2.

(the decision to classify the case in 2)

Page 72: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

1. Set up the regions C1 and C2 so that one of the probabilities of misclassification , P[2|1] say, is at some low acceptable value . Accept the level of the other probability of misclassification P[1|2] = .

There can be several approaches to determining the regions C1 and C2. All concerned with taking into account the probabilities of misclassification P[2|1] and P[1|2]

Page 73: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

2. Set up the regions C1 and C2 so that the total probability of misclassification:

P[Misclassification] = P[1] P[2|1] + P[2]P[1|2]

is minimized

P[1] = P[the case belongs to 1]

P[2] = P[the case belongs to 2]

Page 74: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

3. Set up the regions C1 and C2 so that the total expected cost of misclassification:

E[Cost of Misclassification]

= c2|1P[1] P[2|1] + c1|2 P[2]P[1|2]

is minimized

P[1] = P[the case belongs to 1]

P[2] = P[the case belongs to 2]

c2|1= the cost of misclassifying the case in 2 when the case belongs to 1.

c1|2= the cost of misclassifying the case in 1 when the case belongs to 2.

Page 75: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

4. Set up the regions C1 and C2 The two types of error are equal:

P[2|1] = P[1|2]

Page 76: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Computer security:

P[2|1] = P[identifying a valid user as an imposter]

P[2] = P[imposter]

1: Valid users

2: Imposters

c1|2= the cost of identifying the user as a valid user when the user is an imposter.

P[1|2] = P[identifying an imposter as a valid user ]

P[1] = P[valid user]

c2|1= the cost of identifying the user as an imposter when the user is a valid user.

Page 77: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

This problem can be viewed as an Hypothesis testing problem

P[2|1] =

H0:1 is the correct population

HA:2 is the correct population

P[1|2] =

Power = 1 -

Page 78: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The Neymann-Pearson Lemma Suppose that the data x1, … , xn has joint density function

f(x1, … , xn ;)

where is either 1 or 2.Let

g(x1, … , xn) = f(x1, … , xn ;1) and

h(x1, … , xn) = f(x1, … , xn ;2)

We want to test

H0: = 1 (g is the correct distribution) against

HA: = 2 (h is the correct distribution)

Page 79: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The Neymann-Pearson Lemma states that the Uniformly Most Powerful (UMP) test of size is to reject H0 if:

2 1

1 1

, ,

, ,n

n

L h x xk

L g x x

and accept H0 if:

2 1

1 1

, ,

, ,n

n

L h x xk

L g x x

where k is chosen so that the test is of size .

Page 80: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Proof: Let C be the critical region of any test of size . Let

1*

11

, ,, ,

, ,n

nn

h x xC x x k

g x x

*

1 1, , n n

C

g x x dx dx

1 1, , n n

C

g x x dx dx

Note: * * *C C C C C

* *C C C C C

We want to show that

*

1 1, , n n

C

h x x dx dx

1 1, , n n

C

h x x dx dx

Page 81: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

hence *

1 1, , n n

C

g x x dx dx

1 1, , n n

C

g x x dx dx and

*

1 1, , n n

C C

g x x dx dx

*

1 1, , n n

C C

g x x dx dx

*

1 1, , n n

C C

g x x dx dx

*

1 1, , n n

C C

g x x dx dx

Thus *

1 1, , n n

C C

g x x dx dx

*

1 1, , n n

C C

g x x dx dx

Page 82: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

*C*C C*C C

C

*C C

*

1 1, , n n

C C

g x x dx dx

*

1 1, , n n

C C

g x x dx dx

Page 83: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

and

*

1 1, , n n

C C

g x x dx dx

*

1 1, , n n

C C

g x x dx dx

*

1 1

1, , n n

C C

h x x dx dxk

*1 1

1since , , , , in .n ng x x h x x C

k

*

1 1

1, , n n

C C

h x x dx dxk

*1 1

1since , , , , in .n ng x x h x x C

k

Page 84: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Thus *

1 1, , n n

C C

h x x dx dx

*

1 1, , n n

C C

h x x dx dx

*

1 1, , n n

C

h x x dx dx

1 1, , n n

C

h x x dx dx

and

when we add the common quantity

*

1 1, , n n

C C

h x x dx dx

to both sides.Q.E.D.

Page 85: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Fishers Linear Discriminant Function.

Suppose that x1, … , xp is either data from a p-variate Normal distribution with mean vector:

111 12

/ 2 1/ 2

1

2

x x

pf x e

The covariance matrix is the same for both populations 1 and 2.

1 2 or

112 22

/ 2 1/ 2

1

2

x x

pg x e

Page 86: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

111 12

112 22

/ 2 1/ 2

/ 2 1/ 2

1

21

2

x x

p

x x

p

ef x

g x e

The Neymann-Pearson Lemma states that we should classify into populations 1 and 2 using:

1 11 12 2 1 12 2x x x xe

That is make the decision

D1 : population is 1

if ≥ k

1 11 12 2 1 12 2or ln lnx x x x k

Page 87: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

or 1 12 2 1 1 2lnx x x x k

1 1 12 2 22x x x

1 1 11 1 12 2lnx x x k

1 1 111 2 1 1 2 22lnx k

or

and

a x K

1 1 111 2 1 1 2 22 and lna K k

Finally we make the decision

D1 : population is 1

if

where

Page 88: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

11 2a x x

The function

Is called Fisher’s linear discriminant function

11 2a x x K

1

21

2

Page 89: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

11 2a x x x S x

In the case where the populations are unknown but estimated from data

Fisher’s linear discriminant function

Page 90: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

1201008060402000

100

200

A Pictorial representation of Fisher's procedure for two populations

x

x

1

2Classify as

Classify as

1

2

1 2

Page 91: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Example 1

1 : Riding-mower owners 2 : Nonowners

x1 (Income x2 (Lot size x1 (Income x2 (Lot size in $1000s) in 1000 sq ft) in $1000s) in 1000 sq ft) 20.0 9.2 25.0 9.8 28.5 8.4 17.6 10.4 21.6 10.8 21.6 8.6 20.5 10.4 14.4 10.2 29.0 11.8 28.0 8.8 36.7 9.6 16.4 8.8 36.0 8.8 19.8 8.0 27.6 11.2 22.0 9.2 23.0 10.0 15.8 8.2 31.0 10.4 11.0 9.4 17.0 11.0 17.0 7.0 27.0 10.0 21.0 7.4

Page 92: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

403020104

8

12

Riding Mower ownersNon ownwers

Income (in thousands of dollars)

Lot

Siz

e (i

n th

ousa

nds

of s

quar

e fe

et)

Page 93: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

Example 2Annual financial data are collected for firms approximately 2 years prior to bankruptcy and for financially sound firms at about the same point in time. The data on the four variables

• x1 = CF/TD = (cash flow)/(total debt), • x2 = NI/TA = (net income)/(Total assets), • x3 = CA/CL = (current assets)/(current liabilties, and • x4 = CA/NS = (current assets)/(net sales) are given in

the following table.

Page 94: Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance

The data are given in the following table:

Bankrupt Firms Nonbankrupt Firms x1 x2 x3 x4

x1 x2 x3 x4

Firm CF/TD NI/TA CA/CL CA/NS Firm CF/TD NI/TA CA/CL CA/NS 1 -0.4485 -0.4106 1.0865 0.4526 1 0.5135 0.1001 2.4871 0.5368 2 -0.5633 -0.3114 1.5314 0.1642 2 0.0769 0.0195 2.0069 0.5304 3 0.0643 0.0156 1.0077 0.3978 3 0.3776 0.1075 3.2651 0.3548 4 -0.0721 -0.0930 1.4544 0.2589 4 0.1933 0.0473 2.2506 0.3309 5 -0.1002 -0.0917 1.5644 0.6683 5 0.3248 0.0718 4.2401 0.6279 6 -0.1421 -0.0651 0.7066 0.2794 6 0.3132 0.0511 4.4500 0.6852 7 0.0351 0.0147 1.5046 0.7080 7 0.1184 0.0499 2.5210 0.6925 8 -0.6530 -0.0566 1.3737 0.4032 8 -0.0173 0.0233 2.0538 0.3484 9 0.0724 -0.0076 1.3723 0.3361 9 0.2169 0.0779 2.3489 0.3970 10 -0.1353 -0.1433 1.4196 0.4347 10 0.1703 0.0695 1.7973 0.5174 11 -0.2298 -0.2961 0.3310 0.1824 11 0.1460 0.0518 2.1692 0.5500 12 0.0713 0.0205 1.3124 0.2497 12 -0.0985 -0.0123 2.5029 0.5778 13 0.0109 0.0011 2.1495 0.6969 13 0.1398 -0.0312 0.4611 0.2643 14 -0.2777 -0.2316 1.1918 0.6601 14 0.1379 0.0728 2.6123 0.5151 15 0.1454 0.0500 1.8762 0.2723 15 0.1486 0.0564 2.2347 0.5563 16 0.3703 0.1098 1.9914 0.3828 16 0.1633 0.0486 2.3080 0.1978 17 -0.0757 -0.0821 1.5077 0.4215 17 0.2907 0.0597 1.8381 0.3786 18 0.0451 0.0263 1.6756 0.9494 18 0.5383 0.1064 2.3293 0.4835 19 0.0115 -0.0032 1.2602 0.6038 19 -0.3330 -0.0854 3.0124 0.4730 20 0.1227 0.1055 1.1434 0.1655 20 0.4875 0.0910 1.2444 0.1847 21 -0.2843 -0.2703 1.2722 0.5128 21 0.5603 0.1112 4.2918 0.4443 22 0.2029 0.0792 1.9936 0.3018 23 0.4746 0.1380 2.9166 0.4487 24 0.1661 0.0351 2.4527 0.1370 25 0.5808 0.0371 5.0594 0.1268