Discrete Multivariate Analysis

44
Discrete Multivariate Analysis Analysis of Multivariate Categorical Data

description

Discrete Multivariate Analysis. Analysis of Multivariate Categorical Data. References. Fienberg, S. (1980), Analysis of Cross-Classified Data , MIT Press, Cambridge, Mass. Fingelton, B. (1984), Models for Category Counts , Cambridge University Press. - PowerPoint PPT Presentation

Transcript of Discrete Multivariate Analysis

Page 1: Discrete Multivariate Analysis

Discrete Multivariate Analysis

Analysis of Multivariate Categorical Data

Page 2: Discrete Multivariate Analysis

References

1. Fienberg, S. (1980), Analysis of Cross-Classified Data , MIT Press, Cambridge, Mass.

2. Fingelton, B. (1984), Models for Category Counts , Cambridge University Press.

3. Alan Agresti (1990) Categorical Data Analysis, Wiley, New York.

Page 3: Discrete Multivariate Analysis

Log Linear Model

Page 4: Discrete Multivariate Analysis

Two-way table

1( ) 2( ) 12( , )ln ij i j i ju u u u

where 1( ) 2( ) 12( , ) 12( , ) 0i j i j i ji j i j

u u u u

Note: X and Y are independent if

1( ) 2( )ln ij i ju u u In this case the log-linear model becomes

12( , ) 0 for all ,i ju i j

Page 5: Discrete Multivariate Analysis

Three-way Frequency Tables

Page 6: Discrete Multivariate Analysis

Log-Linear model for three-way tables

Let ijk denote the expected frequency in cell (i,j,k) of the table then in general

1( ) 2( ) 3( ) 12( , )ln ij i j k i ju u u u u

1( ) 2( ) 3( ) 12( , ) 12( , )0 i j k i j i ji j k i j

u u u u u

13( , ) 23( , ) 123( , , )i k j k i j ku u u where

13( , ) 13( , ) 23( , ) 23( , )i k i k j k j ki k j k

u u u u 123( , , ) 123( , , ) 123( , , )i j k i j k i j k

i j k

u u u

Page 7: Discrete Multivariate Analysis

Hierarchical Log-linear models for categorical Data

For three way tables

The hierarchical principle:

If an interaction is in the model, also keep lower order interactions and main effects associated with that interaction

Page 8: Discrete Multivariate Analysis

Hierarchical Log-linear models for 3 way table

Model Description

[1][2][3] Mutual independence between all three variables.

[1][23] Independence of Variable 1 with variables 2 and 3.

[2][13] Independence of Variable 2 with variables 1 and 3.

[3][12] Independence of Variable 3 with variables 1 and 2.

[12][13] Conditional independence between variables 2 and 3 given variable 1.

[12][23] Conditional independence between variables 1 and 3 given variable 2.

[13][23] Conditional independence between variables 1 and 2 given variable 3.

[12][13] [23] Pairwise relations among all three variables, with each two variable interaction unaffected by the value of the third variable.

[123]

Page 9: Discrete Multivariate Analysis

Maximum Likelihood Estimation

Log-Linear Model

Page 10: Discrete Multivariate Analysis

For any Model it is possible to determine the maximum Likelihood Estimators of the parameters

Example

Two-way table – independence – multinomial model

11 1211 12 11 12

11

, , , rcxx xrc rc

rc

Nf x x x

x x

11 12

11 12

11

!

! !

rcxx x

rc

rc

N

x x N N N

ij ij ijE x N orij

ij N

Page 11: Discrete Multivariate Analysis

Log-likelihood

11 12, , ln ! ln !rc iji j

l N x

ln lnij ij iji j i j

N x x lnij ij

i j

K x where ln ! ln ! lnij

i j

K N x N N

1 2ln ij i ju u u

With the model of independence

Page 12: Discrete Multivariate Analysis

and

1 1 1 2 1 2, , , , , ,c rl u u u u u K

1 2ij i ji j

x u u u

with 1 2 0i ji j

u u

1 2i ji ji j

K Nu x u x u

1 2 1 2i j i ju u u u uuij

i j i j i j

e e e e N

also

Page 13: Discrete Multivariate Analysis

Let

1 2 21 1 1 2 1 2, , , , , , , , ,c rg u u u u u

1 2

1 11 2i ju uu

i ji j i j

u u e e e N

1 2i ji ji j

K Nu x u x u

Now

1 2 1 0i ju uu

i j

gN e e e N

u

1

Page 14: Discrete Multivariate Analysis

1 2

1

1

i ju uui

ji

gx e e e

u

1

11 0

i

i

u

i u

i

ex N

e

1

1

1i

i

u

i iu

i

x xe

N Ne

1 111 and 0

ii i

i

xx

rN N N

Since

Page 15: Discrete Multivariate Analysis

Now 1

1iu

ie x K

or 11 ln lniiu x K

11 ln ln 0iii i

u x r K

Page 16: Discrete Multivariate Analysis

Hence

1

1ln lni ii

i

u x xr

1

1ln ln i

i

K xr

and

2

1ln lnj jj

i

u x xc Similarly

1 2 1 2i j i ju u u u uuij

i j i j i j

e e e e N

Finally

Page 17: Discrete Multivariate Analysis

Hence

2

1

1

ju j

c c

jj

xe

x

Now

1 2i j

uu u

i j

Ne

e e

and

1

1

1

iu i

r r

ii

xe

x

11

1 1

r c cru

i ji ji j

i j

Ne x x

x x

11

1 1

1 r c cr

i ji j

x xN

Page 18: Discrete Multivariate Analysis

Hence

Note

1 1ln ln lni j

i j

u x x Nr c

1 2ln ij i ju u u 1 1

ln ln lni ji j

x x Nr c

1 1ln ln ln lni i j j

i i

x x x xr c

ln ln lni jN x x

or i jij

x x

N

Page 19: Discrete Multivariate Analysis

Comments

• Maximum Likelihood estimates can be computed for any hierarchical log linear model (i.e. more than 2 variables)

• In certain situations the equations need to be solved numerically

• For the saturated model (all interactions and main effects)

Page 20: Discrete Multivariate Analysis

Goodness of Fit Statistics

These statistics can be used to check if a log-linear model will fit the

observed frequency table

Page 21: Discrete Multivariate Analysis

Goodness of Fit StatisticsThe Chi-squared statistic

2

2 Observed Expected

Expected

The Likelihood Ratio statistic:

2 2 ln 2 lnˆ

ijkijk

ijk

xObservedG Observed x

Expected

d.f. = # cells - # parameters fitted

ˆijk ijk

ijk

x

We reject the model if 2 or G2 is greater than2

/ 2

Page 22: Discrete Multivariate Analysis

Example: Variables

Coronary Heart

Serum Cholesterol

Systolic Blood pressure (mm Hg)

Disease (mm/100 cc) <127 127-146 147-166 167+ <200 2 3 3 4

Present 200-219 3 2 0 3 220-259 8 11 6 6 260+ 7 12 11 11 <200 117 121 47 22

Absent 200-219 85 98 43 20 220-259 119 209 68 43 260+ 67 99 46 33

1. Systolic Blood Pressure (B)Serum Cholesterol (C)Coronary Heart Disease (H)

Page 23: Discrete Multivariate Analysis

MODEL DF LIKELIHOOD- PROB. PEARSON PROB. RATIO CHISQ CHISQ ----- -- ----------- ------- ------- ------- B,C,H. 24 83.15 0.0000 102.00 0.0000 B,CH. 21 51.23 0.0002 56.89 0.0000 C,BH. 21 59.59 0.0000 60.43 0.0000 H,BC. 15 58.73 0.0000 64.78 0.0000 BC,BH. 12 35.16 0.0004 33.76 0.0007 BH,CH. 18 27.67 0.0673 26.58 0.0872 n.s. CH,BC. 12 26.80 0.0082 33.18 0.0009 BC,BH,CH. 9 8.08 0.5265 6.56 0.6824 n.s.

Goodness of fit testing of Models

Possible Models:1. [BH][CH] – B and C independent given H.2. [BC][BH][CH] – all two factor interaction model

Page 24: Discrete Multivariate Analysis

Model 1: [BH][CH] Log-linear parameters

Heart disease -Blood Pressure Interaction

Bp Hd <127 127-146 147-166 167+ Pres -0.256 -0.241 0.066 0.431 Abs 0.256 0.241 -0.066 -0.431

,HB i ju

Bp Hd <127 127-146 147-166 167+ Pres -2.607 -2.733 0.660 4.461 Abs 2.607 2.733 -0.660 -4.461

,

,

HB i j

HB i j

u

uz

Page 25: Discrete Multivariate Analysis

Multiplicative effect

,

, ,exp HB i ju

HB i j HB i ju e

Bp Hd <127 127-146 147-166 167+ Pres 0.774 0.786 1.068 1.538 Abs 1.291 1.272 0.936 0.65

, ,ln ijk H i B j C k HB i j HC i ku u u u u u

, ,H i B j C k HB i j HC i ku u u u uuijk e e e e e e

Log-Linear Model

, ,H i B j C k HB i j HC i k

Page 26: Discrete Multivariate Analysis

Heart Disease - Cholesterol Interaction

Chol Hd <200 200-219 220-259 260+ Pres -0.233 -0.325 0.063 0.494 Abs 0.233 0.325 -0.063 -0.494

,HC i ku

,

,

HC i k

HC i k

u

uz

Chol Hd <200 200-219 220-259 260+ Pres -1.889 -2.268 0.677 5.558 Abs 1.889 2.268 -0.677 -5.558

Page 27: Discrete Multivariate Analysis

Multiplicative effect

,

, ,exp HB i ku

HC i k HB i ku e

Chol Hd <200 200-219 220-259 260+ Pres 0.792 0.723 1.065 1.640 Abs 1.262 1.384 0.939 0.610

Page 28: Discrete Multivariate Analysis

Model 2: [BC][BH][CH] Log-linear parameters

Blood pressure-Cholesterol interaction: ,BC j ku

Bp Chol <200 200-219 220-259 260+ <200 0.222 -0.019 -0.034 -0.169 200-219 0.114 -0.041 0.013 -0.086 220-259 -0.114 0.154 -0.058 0.018 260+ -0.221 -0.094 0.079 0.237

Page 29: Discrete Multivariate Analysis

,

,

BC j k

BC j k

u

uz

Bp Chol <200 200-219 220-259 260+ <200 2.68 -0.236 -0.326 -1.291 200-219 1.27 -0.472 0.117 -0.626 220-259 -1.502 2.253 -0.636 0.167 260+ -2.487 -1.175 0.785 2.051

Bp Chol <200 200-219 220-259 260+ <200 1.248 0.981 0.967 0.844 200-219 1.120 0.960 1.013 0.918 220-259 0.892 1.166 0.944 1.018 260+ 0.802 0.910 1.082 1.267

Multiplicative effect ,

, ,exp HB j ku

BC j k BC j ku e

Page 30: Discrete Multivariate Analysis

Heart disease -Blood Pressure Interaction

Bp Hd <127 127-146 147-166 167+ Pres -0.211 -0.232 0.055 0.389 Abs 0.211 0.232 -0.055 -0.389

,HB i ju

Bp Hd <127 127-146 147-166 167+ Pres -2.125 -2.604 0.542 3.938

Abs 2.125 2.604 -0.542 -3.938

,

,

HB i j

HB i j

u

uz

Page 31: Discrete Multivariate Analysis

Multiplicative effect

,

, ,exp HB i ju

HB i j HB i ju e

Bp Hd <127 127-146 147-166 167+ Pres 0.809 0.793 1.056 1.475

Abs 1.235 1.261 0.947 0.678

Page 32: Discrete Multivariate Analysis

Heart Disease - Cholesterol Interaction

Chol Hd <200 200-219 220-259 260+ Pres -0.212 -0.316 0.069 0.460

Abs 0.212 0.316 -0.069 -0.460

,HC i ku

,

,

HC i k

HC i k

u

uz

Chol Hd <200 200-219 220-259 260+ Pres -1.712 -2.199 0.732 5.095

Abs 1.712 2.199 -0.732 -5.095

Page 33: Discrete Multivariate Analysis

Multiplicative effect

,

, ,exp HB i ku

HC i k HB i ku e

Chol Hd <200 200-219 220-259 260+ Pres 0.809 0.729 1.071 1.584

Abs 1.237 1.372 0.933 0.631

Page 34: Discrete Multivariate Analysis

Another Example

In this study it was determined for N = 4353 males

1. Occupation category

2. Educational Level

3. Academic Aptidude

Page 35: Discrete Multivariate Analysis

1. Occupation categoriesa. Self-employed Business

b. Teacher\Education

c. Self-employed Professional

d. Salaried Employed

2. Education levelsa. Low

b. Low/Med

c. Med

d. High/Med

e. High

Page 36: Discrete Multivariate Analysis

3. Academic Aptitude

a. Low

b. Low/Med

c. High/Med

d. High

Page 37: Discrete Multivariate Analysis

Table Self-employed, Business Teacher Education Education

Aptitude Low LMed HMed High Total Aptitude Low LMed HMed High Total Low 42 55 22 3 122 Low 0 0 1 19 20

LMed 72 82 60 12 226 LMed 0 3 3 60 66 Med 90 106 85 25 306 Med 1 4 5 86 96

HMed 27 48 47 8 130 HMed 0 0 2 36 38 High 8 18 19 5 50 High 0 0 1 14 15 Total 239 309 233 53 834 Total 1 7 12 215 235

Self-employed, Professional Salaried Employed Education Education

Aptitude Low LMed HMed High Total Aptitude Low LMed HMed High Total Low 1 2 8 19 30 Low 172 151 107 42 472

LMed 1 2 15 33 51 LMed 208 198 206 92 704 Med 2 5 25 83 115 Med 279 271 331 191 1072

HMed 2 2 10 45 59 HMed 99 126 179 97 501 High 0 0 12 19 31 High 36 35 99 79 249 Total 6 11 70 199 286 Total 794 781 922 501 2998

Page 38: Discrete Multivariate Analysis

Two-way Tables (With 2): Education vs Aptitude Education vs Occcupation

(2 = 178.6) (2 = 1254.1) Low Lmed HMed High Total Low Lmed HMed High Total

Low 215 208 138 83 644 SEB 239 309 233 53 834 Lmed 281 285 284 197 1047 SEP 6 11 70 199 286 Med 372 386 446 385 1589 TCHR 1 7 12 215 235

HMed 128 176 238 186 728 SEM 794 781 922 501 2998 High 44 53 131 117 345 Total 1040 1108 1237 968 4353 Total 1040 1108 1237 968 4353

Aptitude vs Occupation

(2 = 35.8) SEB SEP TCHR SEM Total

Low 122 30 20 472 644 Lmed 226 51 66 704 1047 Med 306 115 96 1072 1589

HMed 130 59 38 501 728 High 50 31 15 249 345 Total 834 286 235 2998 4353

Page 39: Discrete Multivariate Analysis

• It is common to handle a Multiway table by testing for independence in all two way tables.

• This is similar to looking at all the bivariate correlations

• In this example we learn that:

1. Education is related to Aptitude

2. Education is related to Occupational category

3. Education is related to Aptitude

Can we do better than this?

Page 40: Discrete Multivariate Analysis

Fitting various log-linear models

Goodness of fit

Model Likelihood

Ratio DF Sig. Pearson DF Sig.

[Occ][Ed][Apt] 1356.9702 69 0.0000 1519.802 69 0.0000

[Occ, Ed] [Apt] 228.2215 60 0.0000 226.6615 60 0.0000

[Apt, Ed][Occ] 1179.6403 57 0.0000 1336.765 57 0.0000

[Apt, Occ][Ed] 1319.561 57 0.0000 1424.1488 57 0.0000

[Occ, Ed] [Occ,Apt] 190.8123 48 0.0000 184.6386 48 0.0000

[Apt, Ed] [Occ,Apt] 1142.2311 45 0.0000 1301.1317 45 0.0000

[Apt, Ed] [Occ, Ed] 50.8915 48 0.3605 48.0105 48 0.4724

[Apt, Ed] [Occ, Ed] [Occ, Apt] 25.1048 36 0.9134 23.6465 36 0.9436

Simplest model that fits is: [Apt,Ed][Occ,Ed]

This model implies conditional independence betweenAptitude and Occupation given Education.

Page 41: Discrete Multivariate Analysis

Log-linear ParametersAptitude – Education Interaction

Education Aptitude Low Low-Med High-Med High

Low 0.4602 0.3225 -0.2752 -0.5075 Low-Med 0.1857 0.0953 -0.0957 -0.1853

Med 0.0399 -0.0277 -0.0706 0.0584 High-Med -0.2250 -0.0111 0.1032 0.1329

High -0.4607 -0.3791 0.3383 0.5015

Page 42: Discrete Multivariate Analysis

Aptitude – Education Interaction (Multiplicative)

Education Aptitude Low Low-Med High-Med High

Low 1.584 1.381 0.759 0.602 Low-Med 1.204 1.100 0.909 0.831

Med 1.041 0.973 0.932 1.060 High-Med 0.799 0.989 1.109 1.142

High 0.631 0.684 1.403 1.651

Page 43: Discrete Multivariate Analysis

Occupation – Education Interaction

Occupation Education SEB T SEP SAL

Low 1.241 -1.528 -0.718 1.005 LowMed 0.800 -0.280 -0.810 0.290 HighMed -0.050 -0.309 0.472 -0.112

High -1.991 2.117 1.057 -1.182

Page 44: Discrete Multivariate Analysis

Occupation – Education Interaction (Multiplicative)

Occupation Education SEB T SEP SAL

Low 3.460 0.217 0.488 2.731 LowMed 2.226 0.756 0.445 1.336 HighMed 0.951 0.734 1.603 0.894

High 0.137 8.303 2.877 0.307