The Analysis of Categorical Data. Categorical variables When both predictor and response variables...
-
Upload
doris-reynolds -
Category
Documents
-
view
224 -
download
1
Transcript of The Analysis of Categorical Data. Categorical variables When both predictor and response variables...
![Page 1: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/1.jpg)
The Analysis of Categorical Data
![Page 2: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/2.jpg)
Categorical variables
• When both predictor and response variables are categorical:
• Presence or absence• Color, etc.
• The data in such a study represents counts –or frequencies- of observations in each category
![Page 3: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/3.jpg)
Analysis
Data Analysis
A single categorical predictor variable
Organized as two way contingency tables, and tested with chi-square or G-test
Multiple predictor variables (or complex models)
Organized as a multi-way contingency tables, and analyzed using either log-linear models or classification trees
![Page 4: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/4.jpg)
Two way Contingency Tables
• Analysis of contingency tables is done correctly only on the raw counts, not on the percentages, proportions, or relative frequencies of the data
![Page 5: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/5.jpg)
Wildebeest carcasses from the Serengeti (Sinclair and Arcese 1995)
![Page 6: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/6.jpg)
Sex, cause of death, and bone marrow type
• Sex (males / females)
• Cause of death (predation / other)
• Bone marrow type:
1. Solid white fatty (healthy animal)2. Opaque gelatinous 3. Translucent gelatinous
![Page 7: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/7.jpg)
Data
Sex Marrow Death by predation
Male SWF Yes
Male OG Yes
Male TG Yes
… … …
![Page 8: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/8.jpg)
Brief formatSEX MARROW DEATH COUNTFEMALE SWF PRED 26
MALE SWF PRED 14
FEMALE OG PRED 32
MALE OG PRED 43
FEMALE TG PRED 8
MALE TG PRED 10
FEMALE SWF NPRED 6
MALE SWF NPRED 7
FEMALE OG NPRED 26
MALE OG NPRED 12
FEMALE TG NPRED 16
MALE TG NPRED 26
![Page 9: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/9.jpg)
Contingency table
Sex * Death Crosstabulation
Dead
Sex NPRED PRED Total
FEMALE 48 66 114
MALE 45 67 112
Total 93 133 226
![Page 10: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/10.jpg)
Contingency table
Sex * Marrow Crosstabulation
Marrow
Sex OG SWF TG Total
FEMALE 58 32 24 114
MALE 55 21 36 112
Total 113 53 60 226
![Page 11: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/11.jpg)
Contingency table
Death * Marrow Crosstabulation
Marrow
Death OG SWF TG Total
NPRED 38 13 42 93
PRED 75 40 18 133
Total 113 53 60 226
![Page 12: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/12.jpg)
Are the variables independent?
We want to know, for example, whether males are more likely to die by predation than females
• Specifying the null hypothesis:• The predictor and response variable are not
associated with each other. The two variables are independent of each other and the observed degree of association is not stronger than we would expect by chance or random sampling
![Page 13: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/13.jpg)
Calculating the expected values
• The expected value is the total number of observations (N) times the probability of a population being both males and dead by predation
)__(ˆ, predationbydeadmaleNxPY predationbydeadmale
![Page 14: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/14.jpg)
The probability of two independent events
)__()()__,( predationbydeadxPmalePpredationbydeadmaleP
Because we have no other information than the data, we estimate the probabilities of each of the right hand terms from the equation from the marginal totals
![Page 15: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/15.jpg)
Contingency table
Sex * Death expected values
Dead
Sex NPRED PRED P
FEMALE 46.91 67.09 114 0.5044
MALE 46.09 65.91 112 0.4956
93 133
P 0.4115 0.5885 N=226
![Page 16: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/16.jpg)
)_(ˆ__ predatedNofemalePNY predatednofemale
sizesample
totalcolumntotalrowYij _
__ˆ
![Page 17: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/17.jpg)
Testing the hypothesis: Pearson’s Chi-square test
cellsallPearson Expected
ExpectedObservedX
_
22
= 0.0866, P=0.7685
cellsallYates Expected
ExpectedObservedX
_
2
25.0
= 0.0253, P=0.8736
![Page 18: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/18.jpg)
The degrees of freedom
)1__()1__( columnsofnumberxrowsofnumberdf
= 1
![Page 19: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/19.jpg)
Calculating the P-value
• We find the probability of obtaining a value of Χ2 as large or larger than 0.0866 relative to a Χ2 distribution with 1 degree of freedom
• P = 0.769
![Page 20: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/20.jpg)
Sta
nd
ard
ize
dR
esi
du
als
:<-
4-4
:-2
-2:0
0:2
2:4
>4
tcount
female male
non
pred
ator
pred
ator
![Page 21: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/21.jpg)
An alternative
• The likelihood ratio test: It compares observed values with the distribution of expected values based on the multinomial probability distribution
cellsall Expected
ObservedObservedG
_ln2
= 0.0866
![Page 22: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/22.jpg)
Two way contingency tables
• Sex * Death Crosstabulation:
• Sex * Marrow Crosstabulation:
• Marrow * Death Crosstabulation:
769.0,1..,087.02 PfdX Pearson
093.0,2..,745.42 PfdX Pearson
001.0,2..,308.292 PfdX Pearson
092.0,2..,778.4 PfdG
769.0,1..,087.0 PfdG
001.0,2..,520.29 PfdG
![Page 23: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/23.jpg)
Which test to chose?
Model Rows/ Columns Sample size
Test
I
II
Not fixed
Fixed/not fixed
small G-test, with corrections
I
II
Not fixed
Fixed/not fixed
large G-test, Chi square test
III Fixed Fisher exact test
![Page 24: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/24.jpg)
Log-linear modelsMulti-way Contingency Tables
![Page 25: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/25.jpg)
Multiple two-way tablesFemales Marrow
Death OG SWF TG Total
PRED 32 26 8 66
NPRED 26 6 16 48
Total 58 32 24 114
Males Marrow
Death OG SWF TG Total
PRED 43 14 10 67
NPRED 12 7 26 45
Total 55 21 36 112
![Page 26: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/26.jpg)
Log-linear models
• They treat the cell frequencies as counts distributed as a Poisson random variable
• The expected cell frequencies are modeled against the variables using the log-link and Poisson error term
• They are fit and parameters estimated using maximum likelihood techniques
![Page 27: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/27.jpg)
Log-linear models
• Do not distinguish response and predictor variables: all the variables are considered equally as response variables
![Page 28: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/28.jpg)
However
• A logit model with categorical variables can be analyzed as a log-linear model
![Page 29: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/29.jpg)
Two way tables
• For a two way table (I by J) we can fit two log-linear models
• The first is a saturated (full) model• Log fij= constant + λi
x+ λky+ λjk
xy
• fij= is the expected frequency in cell ij• λi
x = is the effect of category i of variable X• λk
y = is the effect of category k of variable Y• λjk
xy = is the effect any interaction between X and Y
• This model fit the observed frequencies perfectly
![Page 30: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/30.jpg)
Note
• The effect does not imply any causality, just the influence of a variable or interaction between variables on the log of the expected number of observations in a cell
![Page 31: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/31.jpg)
Two way tables
• The second log-linear model represents independence of the two variables (X and Y) and is a reduced model:
• Log fij= constant + λix+ λk
y
• The interpretation of this model is that the log of the expected frequency in any cell is a function of the mean of the log of all the expected frequencies plus the effect of variable x and the effect of variable y. This is an additive linear model with no interactions between the two variables
![Page 32: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/32.jpg)
Interpretation
• The parameters of the log-linear models are the effects of a particular category of each variable on the expected frequencies:
• i.e. a larger λ means that the expected frequencies will be larger for that variable.
• These variables are also deviations from the mean of all expected frequencies
![Page 33: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/33.jpg)
Null hypothesis of independence
• The Ho is that the sampling or experimental units come from a population of units in which the two variables (rows and columns) are independent of each other in terms of the cell frequencies
• It is also a test that λjkxy =0:
• There is NO interaction between two variables
![Page 34: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/34.jpg)
Test
• We can test this Ho by comparing the fit of the model without this term to the saturated model that includes this term
• We determine the fit of each model by calculating the expected frequencies under each model, comparing the observed and expected frequencies and calculating the log-likelihood of each model
![Page 35: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/35.jpg)
Test
• We then compare the fit of the two models with the likelihood ratio test statistic ∆
• However the sampling distribution of this ratio (∆ ) is not well known, so instead we calculate G2 statistic
• G2 =-2log∆ • G2 Follows a Χ2 distribution for reasonable sample
sizes and can be generalized to • =- 2(log-likelihood reduced model -- log-likelihood
full model)
![Page 36: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/36.jpg)
Degrees of freedom
• The calculated G2 is compared to a Χ2 distribution with (I-1)(J-1) df.
• This df (I-1)(J-1) is the difference between the df for the full model (IJ-1) and the df for the reduced model [(I-1)+(j-1)]
![Page 37: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/37.jpg)
Akaike information criteria
KdataLAIC 2)|ˆ(log2
Hirotugu Akaike
![Page 38: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/38.jpg)
The full modelmarrowsexdeath
ijk Cf logmarrowsexmarrowdeathsexdeat
marrowsexdeath
elparticularelparticular dfGAIC mod_2
mod_ 2
![Page 39: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/39.jpg)
Complete table Model G2 df P AIC
1 D+S+M 42.76 7 0.001 28.76
2 D*S 42.68 6 0.001 30.68
3 D*M 13.24 5 0.021 3.24
4 S*M 37.98 5 0.001 27.98
5 D*S+D*M 13.16 4 0.01 5.16
6 D*S+S*M 37.89 4 0.001 29.89
7 D*M+S*M 8.46 3 0.037 2.46
8 D*S+D*M+S*M 7.19 2 0.027 3.19
9 Saturated full model 0 0
![Page 40: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/40.jpg)
Two way interactions (marginal independence)
D+S+M 42.76
reference
d.f P
D*S
1vs 2
42.6759
42.76-42.68=0.084
7-6
=1
0.769
D*M
1vs 3
13.24
42.76-13.24=29.520
7-5
=2
<0.001
S*M
1 vs 4
37.98
42.76-37.98=4.778
7-5
=2
0.092
![Page 41: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/41.jpg)
Three way interaction
• Death*Sex*Marrow
• Models compared 8 vs 9
• G2= 7.19
• df 2
• P=0.027
![Page 42: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/42.jpg)
Conditional independence
term Models compared G2 df P
D*S 7 vs 8 1.28 1 0.259
D*M 6 vs 8 30.71 2 0.001
S*M 5 vs 8 5.97 2 0.051
Death and marrow have a partial association
![Page 43: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/43.jpg)
Females Marrow
Death OG SWF TG Total
PRED 32 26 8 66
NPRED 26 6 16 48
Total 58 32 24 114
Males Marrow
Death OG SWF TG Total
PRED 43 14 10 67
NPRED 12 7 26 45
Total 55 21 36 112
kk
kkkXY nn
nn
2112
2211)(
ˆ
Conditional independence
![Page 44: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/44.jpg)
Males 95 % CI Females
OG vs TG 0.107 0.041-0.283 0.406 0.150-1.097
SWF vs TG 0.192 0.060-0.616 0.115 0.034-0.395
SWF vs OG 0.558 0.184-1.693 3.521 1.261-9.836
558.07*43
12*14ˆ SWFvsOGmale
521.36*32
26*26ˆ SWFvsOGfemale
![Page 45: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/45.jpg)
Complete independence
• Models compared 1 vs 8
• G2=35.57
• df= 5
• P=<0.001
![Page 46: The Analysis of Categorical Data. Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. The.](https://reader037.fdocuments.in/reader037/viewer/2022103004/56649ccc5503460f94995949/html5/thumbnails/46.jpg)
Warning
• Always fit a saturated model first, containing all the variables of interest and all the interactions involving the (potential) nuisance variables. Only delete from the model the interactions that involve the variables of interest.