I've Got a Categorical Variable?!
Now What?
Andrew ZiefflerDepartment of Educational Psychology
Research Methodology and Consulting Center (RMCC): Lunch & Learn
October 03, 2018
Scales of Measurement
Classification system describing the nature of information within the values assigned to variables. (Stevens, 1946)
Scale Property Operations Examples
Nominal Class membership = and ≠ College major, Sex, Political affiliation
Ordinal Comparison < and > Likert data, Rankings, Scoville scale
Interval Difference + and – GRE scores, Temperature (F)
Ratio Magnitude × and ÷ Income, Class size, Years of experience
Categorical Data is Here
Categorical variables are at the nominal scale of measurement, and although, in practice, we assign numbers to represent the levels of the categorical variable, those numbers do not carry
any more information than group membership.
Dichotomous and Polytomous Variables
When a categorical variable has two levels we refer to it as dichotomous or binary. If it has more than two levels, it is polychotomous or polytomous.
Major
Kinesiology
Special Education
Special Education
Child Psychology
Kinesiology
Kinesiology
Child Psychology
STEM Major?
STEM
Non-STEM
Non-STEM
STEM
STEM
STEM
Non-STEM
Dichotomous Polytomous
Contingency Tables
A contingency table is simply a table that lists each level of the categorical variable and those level's counts/percentages.
Major
Kinesiology
Special Education
Special Education
Child Psychology
Kinesiology
Kinesiology
Child Psychology
Major Count
Kinesiology 3
Special Education 2
Child Psychology 2
Data
Contingency Table
Contingency tables can also be used to show cross-classifications of two (or more) variables.
Major Sex
Kinesiology Female
Special Education Female
Special Education Male
Child Psychology Female
Kinesiology Male
Kinesiology Male
Child Psychology Female
Sex
Major Female Male Total
Kinesiology 1 2 3
Special Education
1 1 2
Child Psychology 2 0 2
Total 4 3 7
Data
Contingency Table
Bar charts are just graphical summaries of the information in a contingency table.
Sex
Major Female Male Total
Kinesiology 1 2 3
Special Education
1 1 2
Child Psychology 2 0 2
Total 4 3 7
Many methods of analyzing categorical data are based off of contingency tables:
• Methods of association‣ Chi-squared () statistics and tests‣ Phi coefficient‣ Tetrachoric correlation‣ Cramer's V‣ Goodman & Kruskal's lambda‣ Goodman & Kruskal's gamma‣ Kendall's tau
• Log-linear modeling
• Correspondence analysis
Challenge #1: Many common statistical methods require quantitative variables
To alleviate this problem, we typically re-code (or treat) categorical variables so that they are quantitative. (Remember: The numbers only denote group membership.)
ID Major Recoded 1 Recoded 2 Recoded 3
1 Kinesiology 1 1 –1
2 Special Education 2 100 0
3 Special Education 2 100 0
4 Child Psychology 3 3000 1
5 Kinesiology 1 1 –1
6 Kinesiology 1 1 –1
7 Child Psychology 3 3000 1
One common method for coding categorical variables is dummy coding (aka, reference coding).
• Dummy coding only uses the values 0 and 1• Typically 1 denotes membership in a particular level and 0 indicates not a member of
that level
Dummy/Reference Coding
ID Graduation Status Graduated
1 Graduated 1
2 Did not graduate 0
3 Graduated 1
4 Graduated 1
5 Graduated 1
6 Did not graduate 0
7 Graduated 1
8 Did not graduate 0
9 Graduated 1
10 Graduated 1
Dummy coding has several useful advantages. For example, the mean of a dummy coded variable is the proportion of cases coded as 1.
The proportion of students who graduated is 0.7.
ID Graduation Status Graduated
1 Graduated 1
2 Did not graduate 0
3 Graduated 1
4 Graduated 1
5 Graduated 1
6 Did not graduate 0
7 Graduated 1
8 Did not graduate 0
9 Graduated 1
10 Graduated 1
Mean =1 + 0 + 1 + 1 + 1 + 0 + 1 + 0 + 1 + 1
10=
7
10= 0.7
<latexit sha1_base64="fUjNVoBOyWF6Pbf/Gz3fc4vimA8=">AAACLHicbVDLSgMxFM3UV62vqks3wSIIhZKIUF0IhW7cCBVsLXRKyaSZNjSTGZKMUIb5IDf+iiAuLOLW7zBtR9TWGy6cnHMvyTleJLg2CE2c3Mrq2vpGfrOwtb2zu1fcP2jpMFaUNWkoQtX2iGaCS9Y03AjWjhQjgSfYvTeqT/X7B6Y0D+WdGUesG5CB5D6nxFiqV6y7ATFDFSQ3jMgUXkHXV4QmuIzKeHZQ1jhNMPoZqH5fUaXaK5ZQBc0KLgOcgRLIqtErvrj9kMYBk4YKonUHo8h0E6IMp4KlBTfWLCJ0RAasY6EkAdPdZGY2hSeW6UM/VLalgTP290ZCAq3HgWcnp9b0ojYl/9M6sfEvugmXUWyYpPOH/FhAE8JpcrDPFaNGjC0gVHH7V0iHxIZhbL4FGwJetLwMWmcVjCr49rxUu8ziyIMjcAxOAQZVUAPXoAGagIJH8AzewMR5cl6dd+djPppzsp1D8Keczy+tEKLd</latexit><latexit sha1_base64="fUjNVoBOyWF6Pbf/Gz3fc4vimA8=">AAACLHicbVDLSgMxFM3UV62vqks3wSIIhZKIUF0IhW7cCBVsLXRKyaSZNjSTGZKMUIb5IDf+iiAuLOLW7zBtR9TWGy6cnHMvyTleJLg2CE2c3Mrq2vpGfrOwtb2zu1fcP2jpMFaUNWkoQtX2iGaCS9Y03AjWjhQjgSfYvTeqT/X7B6Y0D+WdGUesG5CB5D6nxFiqV6y7ATFDFSQ3jMgUXkHXV4QmuIzKeHZQ1jhNMPoZqH5fUaXaK5ZQBc0KLgOcgRLIqtErvrj9kMYBk4YKonUHo8h0E6IMp4KlBTfWLCJ0RAasY6EkAdPdZGY2hSeW6UM/VLalgTP290ZCAq3HgWcnp9b0ojYl/9M6sfEvugmXUWyYpPOH/FhAE8JpcrDPFaNGjC0gVHH7V0iHxIZhbL4FGwJetLwMWmcVjCr49rxUu8ziyIMjcAxOAQZVUAPXoAGagIJH8AzewMR5cl6dd+djPppzsp1D8Keczy+tEKLd</latexit><latexit sha1_base64="fUjNVoBOyWF6Pbf/Gz3fc4vimA8=">AAACLHicbVDLSgMxFM3UV62vqks3wSIIhZKIUF0IhW7cCBVsLXRKyaSZNjSTGZKMUIb5IDf+iiAuLOLW7zBtR9TWGy6cnHMvyTleJLg2CE2c3Mrq2vpGfrOwtb2zu1fcP2jpMFaUNWkoQtX2iGaCS9Y03AjWjhQjgSfYvTeqT/X7B6Y0D+WdGUesG5CB5D6nxFiqV6y7ATFDFSQ3jMgUXkHXV4QmuIzKeHZQ1jhNMPoZqH5fUaXaK5ZQBc0KLgOcgRLIqtErvrj9kMYBk4YKonUHo8h0E6IMp4KlBTfWLCJ0RAasY6EkAdPdZGY2hSeW6UM/VLalgTP290ZCAq3HgWcnp9b0ojYl/9M6sfEvugmXUWyYpPOH/FhAE8JpcrDPFaNGjC0gVHH7V0iHxIZhbL4FGwJetLwMWmcVjCr49rxUu8ziyIMjcAxOAQZVUAPXoAGagIJH8AzewMR5cl6dd+djPppzsp1D8Keczy+tEKLd</latexit><latexit sha1_base64="fUjNVoBOyWF6Pbf/Gz3fc4vimA8=">AAACLHicbVDLSgMxFM3UV62vqks3wSIIhZKIUF0IhW7cCBVsLXRKyaSZNjSTGZKMUIb5IDf+iiAuLOLW7zBtR9TWGy6cnHMvyTleJLg2CE2c3Mrq2vpGfrOwtb2zu1fcP2jpMFaUNWkoQtX2iGaCS9Y03AjWjhQjgSfYvTeqT/X7B6Y0D+WdGUesG5CB5D6nxFiqV6y7ATFDFSQ3jMgUXkHXV4QmuIzKeHZQ1jhNMPoZqH5fUaXaK5ZQBc0KLgOcgRLIqtErvrj9kMYBk4YKonUHo8h0E6IMp4KlBTfWLCJ0RAasY6EkAdPdZGY2hSeW6UM/VLalgTP290ZCAq3HgWcnp9b0ojYl/9M6sfEvugmXUWyYpPOH/FhAE8JpcrDPFaNGjC0gVHH7V0iHxIZhbL4FGwJetLwMWmcVjCr49rxUu8ziyIMjcAxOAQZVUAPXoAGagIJH8AzewMR5cl6dd+djPppzsp1D8Keczy+tEKLd</latexit>
With polytomous variables we need to use more than one dummy variable to code all of the categories.
To distinctly code all of the categories we need to create a dummy variable for all categories except one.
ID Major Kinesiology Spec_Ed
1 Kinesiology 1 0
2 Special Education 0 1
3 Special Education 0 1
4 Child Psychology 0 0
5 Kinesiology 1 0
6 Kinesiology 1 0
7 Child Psychology 0 0
• Kinesiology majors: Kinesiology = 1 and Spec_Ed = 0• Special Education majors: Kinesiology = 0 and Spec_Ed = 1 • Child Psychology majors: Kinesiology = 0 and Spec_Ed = 0
Because there are multiple dummy variables, we compute multiple means.
Special Education =0 + 1 + 1 + 0 + 0 + 0 + 0)
7=
2
7= 0.286
<latexit sha1_base64="XFEvXhdgvzS4pnbwolrvUKD3U2c=">AAACNHicbVDLSgMxFM34rPU16tJNsAiKUDJFbF0IBREENxXtAzqlZNJMG5p5kGSEMtR/cuOHuBHBhSJu/QYznQG19V4CJ+fcS3KOE3ImFUIvxtz8wuLScm4lv7q2vrFpbm03ZBAJQusk4IFoOVhSznxaV0xx2goFxZ7DadMZnid6844KyQL/Vo1C2vFw32cuI1hpqmte2R5WA+HFNyElDPP7i16UamN4Bm1XYBKjI0s3SvtwHJd/pFJ2Q8VS5aRrFlARTQrOAisDBZBVrWs+2b2ARB71FeFYyraFQtWJsVCMcDrO25GkISZD3KdtDX3sUdmJJ6bHcF8zPegGQh9fwQn7eyPGnpQjz9GTiUU5rSXkf1o7Um6lEzM/jBT1SfqQG3GoApgkCHtMUKL4SANMBNN/hWSAdRpK55zXIVjTlmdBo1S0UNG6Pi5UT7M4cmAX7IEDYIEyqIJLUAN1QMADeAZv4N14NF6ND+MzHZ0zsp0d8KeMr2/Fx6eq</latexit><latexit sha1_base64="XFEvXhdgvzS4pnbwolrvUKD3U2c=">AAACNHicbVDLSgMxFM34rPU16tJNsAiKUDJFbF0IBREENxXtAzqlZNJMG5p5kGSEMtR/cuOHuBHBhSJu/QYznQG19V4CJ+fcS3KOE3ImFUIvxtz8wuLScm4lv7q2vrFpbm03ZBAJQusk4IFoOVhSznxaV0xx2goFxZ7DadMZnid6844KyQL/Vo1C2vFw32cuI1hpqmte2R5WA+HFNyElDPP7i16UamN4Bm1XYBKjI0s3SvtwHJd/pFJ2Q8VS5aRrFlARTQrOAisDBZBVrWs+2b2ARB71FeFYyraFQtWJsVCMcDrO25GkISZD3KdtDX3sUdmJJ6bHcF8zPegGQh9fwQn7eyPGnpQjz9GTiUU5rSXkf1o7Um6lEzM/jBT1SfqQG3GoApgkCHtMUKL4SANMBNN/hWSAdRpK55zXIVjTlmdBo1S0UNG6Pi5UT7M4cmAX7IEDYIEyqIJLUAN1QMADeAZv4N14NF6ND+MzHZ0zsp0d8KeMr2/Fx6eq</latexit><latexit sha1_base64="XFEvXhdgvzS4pnbwolrvUKD3U2c=">AAACNHicbVDLSgMxFM34rPU16tJNsAiKUDJFbF0IBREENxXtAzqlZNJMG5p5kGSEMtR/cuOHuBHBhSJu/QYznQG19V4CJ+fcS3KOE3ImFUIvxtz8wuLScm4lv7q2vrFpbm03ZBAJQusk4IFoOVhSznxaV0xx2goFxZ7DadMZnid6844KyQL/Vo1C2vFw32cuI1hpqmte2R5WA+HFNyElDPP7i16UamN4Bm1XYBKjI0s3SvtwHJd/pFJ2Q8VS5aRrFlARTQrOAisDBZBVrWs+2b2ARB71FeFYyraFQtWJsVCMcDrO25GkISZD3KdtDX3sUdmJJ6bHcF8zPegGQh9fwQn7eyPGnpQjz9GTiUU5rSXkf1o7Um6lEzM/jBT1SfqQG3GoApgkCHtMUKL4SANMBNN/hWSAdRpK55zXIVjTlmdBo1S0UNG6Pi5UT7M4cmAX7IEDYIEyqIJLUAN1QMADeAZv4N14NF6ND+MzHZ0zsp0d8KeMr2/Fx6eq</latexit><latexit sha1_base64="XFEvXhdgvzS4pnbwolrvUKD3U2c=">AAACNHicbVDLSgMxFM34rPU16tJNsAiKUDJFbF0IBREENxXtAzqlZNJMG5p5kGSEMtR/cuOHuBHBhSJu/QYznQG19V4CJ+fcS3KOE3ImFUIvxtz8wuLScm4lv7q2vrFpbm03ZBAJQusk4IFoOVhSznxaV0xx2goFxZ7DadMZnid6844KyQL/Vo1C2vFw32cuI1hpqmte2R5WA+HFNyElDPP7i16UamN4Bm1XYBKjI0s3SvtwHJd/pFJ2Q8VS5aRrFlARTQrOAisDBZBVrWs+2b2ARB71FeFYyraFQtWJsVCMcDrO25GkISZD3KdtDX3sUdmJJ6bHcF8zPegGQh9fwQn7eyPGnpQjz9GTiUU5rSXkf1o7Um6lEzM/jBT1SfqQG3GoApgkCHtMUKL4SANMBNN/hWSAdRpK55zXIVjTlmdBo1S0UNG6Pi5UT7M4cmAX7IEDYIEyqIJLUAN1QMADeAZv4N14NF6ND+MzHZ0zsp0d8KeMr2/Fx6eq</latexit>
Kinesiology =1 + 0 + 0 + 0 + 1 + 1 + 0)
7=
3
7= 0.429
<latexit sha1_base64="EvJ30ScUHO4L7cyKHMRk+jt/QCE=">AAACLnicbVDLSgMxFM3UV62vUZdugkVQhCFTC7ULoSCC4KaCfUBbSibNtKGZB0lGGIb5Ijf+ii4EFXHrZ5i2A2rrvQROzrn3Jvc4IWdSIfRq5JaWV1bX8uuFjc2t7R1zd68pg0gQ2iABD0TbwZJy5tOGYorTdigo9hxOW874cqK37qmQLPDvVBzSnoeHPnMZwUpTffOq62E1El5yowfoKh4M4xRewK4rMEnsUzRNWyc6SZPKj3SW3ZBVLlX7ZhFZaBpwEdgZKIIs6n3zuTsISORRXxGOpezYKFS9BAvFCKdpoRtJGmIyxkPa0dDHHpW9ZLpuCo80M4BuIPTxFZyyvzsS7EkZe46unCwn57UJ+Z/WiZR73kuYH0aK+mT2kBtxqAI48Q4OmKBE8VgDTATTf4VkhLUbSjtc0CbY8ysvgmbJspFl35aLtWpmRx4cgENwDGxQATVwDeqgAQh4AE/gDbwbj8aL8WF8zkpzRtazD/6E8fUNwCulIQ==</latexit><latexit sha1_base64="EvJ30ScUHO4L7cyKHMRk+jt/QCE=">AAACLnicbVDLSgMxFM3UV62vUZdugkVQhCFTC7ULoSCC4KaCfUBbSibNtKGZB0lGGIb5Ijf+ii4EFXHrZ5i2A2rrvQROzrn3Jvc4IWdSIfRq5JaWV1bX8uuFjc2t7R1zd68pg0gQ2iABD0TbwZJy5tOGYorTdigo9hxOW874cqK37qmQLPDvVBzSnoeHPnMZwUpTffOq62E1El5yowfoKh4M4xRewK4rMEnsUzRNWyc6SZPKj3SW3ZBVLlX7ZhFZaBpwEdgZKIIs6n3zuTsISORRXxGOpezYKFS9BAvFCKdpoRtJGmIyxkPa0dDHHpW9ZLpuCo80M4BuIPTxFZyyvzsS7EkZe46unCwn57UJ+Z/WiZR73kuYH0aK+mT2kBtxqAI48Q4OmKBE8VgDTATTf4VkhLUbSjtc0CbY8ysvgmbJspFl35aLtWpmRx4cgENwDGxQATVwDeqgAQh4AE/gDbwbj8aL8WF8zkpzRtazD/6E8fUNwCulIQ==</latexit><latexit sha1_base64="EvJ30ScUHO4L7cyKHMRk+jt/QCE=">AAACLnicbVDLSgMxFM3UV62vUZdugkVQhCFTC7ULoSCC4KaCfUBbSibNtKGZB0lGGIb5Ijf+ii4EFXHrZ5i2A2rrvQROzrn3Jvc4IWdSIfRq5JaWV1bX8uuFjc2t7R1zd68pg0gQ2iABD0TbwZJy5tOGYorTdigo9hxOW874cqK37qmQLPDvVBzSnoeHPnMZwUpTffOq62E1El5yowfoKh4M4xRewK4rMEnsUzRNWyc6SZPKj3SW3ZBVLlX7ZhFZaBpwEdgZKIIs6n3zuTsISORRXxGOpezYKFS9BAvFCKdpoRtJGmIyxkPa0dDHHpW9ZLpuCo80M4BuIPTxFZyyvzsS7EkZe46unCwn57UJ+Z/WiZR73kuYH0aK+mT2kBtxqAI48Q4OmKBE8VgDTATTf4VkhLUbSjtc0CbY8ysvgmbJspFl35aLtWpmRx4cgENwDGxQATVwDeqgAQh4AE/gDbwbj8aL8WF8zkpzRtazD/6E8fUNwCulIQ==</latexit><latexit sha1_base64="EvJ30ScUHO4L7cyKHMRk+jt/QCE=">AAACLnicbVDLSgMxFM3UV62vUZdugkVQhCFTC7ULoSCC4KaCfUBbSibNtKGZB0lGGIb5Ijf+ii4EFXHrZ5i2A2rrvQROzrn3Jvc4IWdSIfRq5JaWV1bX8uuFjc2t7R1zd68pg0gQ2iABD0TbwZJy5tOGYorTdigo9hxOW874cqK37qmQLPDvVBzSnoeHPnMZwUpTffOq62E1El5yowfoKh4M4xRewK4rMEnsUzRNWyc6SZPKj3SW3ZBVLlX7ZhFZaBpwEdgZKIIs6n3zuTsISORRXxGOpezYKFS9BAvFCKdpoRtJGmIyxkPa0dDHHpW9ZLpuCo80M4BuIPTxFZyyvzsS7EkZe46unCwn57UJ+Z/WiZR73kuYH0aK+mT2kBtxqAI48Q4OmKBE8VgDTATTf4VkhLUbSjtc0CbY8ysvgmbJspFl35aLtWpmRx4cgENwDGxQATVwDeqgAQh4AE/gDbwbj8aL8WF8zkpzRtazD/6E8fUNwCulIQ==</latexit>
• The proportion of Kinesiology majors is 0.429.• The proportion of Special Education majors is 0.286.
Child Psychology = 1� 0.429� 0.286 = 0285<latexit sha1_base64="7QPJ44WTneeoKZ2HZ57x83tn3OY=">AAACHnicbVDLSgMxFM3UV62vqks3wSK4ccgMrbYLodCNywr2AW0pmUymE5p5kGSEYag/4sZfceNCEcGV/o3pY6GtBwKHc+7h5h4n5kwqhL6N3Nr6xuZWfruws7u3f1A8PGrLKBGEtkjEI9F1sKSchbSlmOK0GwuKA4fTjjNuTP3OPRWSReGdSmM6CPAoZB4jWGlpWKz0A6x8EWQNn3H3oSlT4uvQKJ3Aa2jBCwiRWbZrmiDTrl5qEdnVyrBYQiaaAa4Sa0FKYIHmsPjZdyOSBDRUhGMpexaK1SDDQjHC6aTQTySNMRnjEe1pGuKAykE2O28Cz7TiQi8S+oUKztTfiQwHUqaBoyenx8hlbyr+5/US5VUHGQvjRNGQzBd5CYcqgtOuoMsEJYqnmmAimP4rJD4WmCjdaEGXYC2fvEratmkh07otl+q1RR15cAJOwTmwwBWogxvQBC1AwCN4Bq/gzXgyXox342M+mjMWmWPwB8bXD196nhM=</latexit><latexit sha1_base64="7QPJ44WTneeoKZ2HZ57x83tn3OY=">AAACHnicbVDLSgMxFM3UV62vqks3wSK4ccgMrbYLodCNywr2AW0pmUymE5p5kGSEYag/4sZfceNCEcGV/o3pY6GtBwKHc+7h5h4n5kwqhL6N3Nr6xuZWfruws7u3f1A8PGrLKBGEtkjEI9F1sKSchbSlmOK0GwuKA4fTjjNuTP3OPRWSReGdSmM6CPAoZB4jWGlpWKz0A6x8EWQNn3H3oSlT4uvQKJ3Aa2jBCwiRWbZrmiDTrl5qEdnVyrBYQiaaAa4Sa0FKYIHmsPjZdyOSBDRUhGMpexaK1SDDQjHC6aTQTySNMRnjEe1pGuKAykE2O28Cz7TiQi8S+oUKztTfiQwHUqaBoyenx8hlbyr+5/US5VUHGQvjRNGQzBd5CYcqgtOuoMsEJYqnmmAimP4rJD4WmCjdaEGXYC2fvEratmkh07otl+q1RR15cAJOwTmwwBWogxvQBC1AwCN4Bq/gzXgyXox342M+mjMWmWPwB8bXD196nhM=</latexit><latexit sha1_base64="7QPJ44WTneeoKZ2HZ57x83tn3OY=">AAACHnicbVDLSgMxFM3UV62vqks3wSK4ccgMrbYLodCNywr2AW0pmUymE5p5kGSEYag/4sZfceNCEcGV/o3pY6GtBwKHc+7h5h4n5kwqhL6N3Nr6xuZWfruws7u3f1A8PGrLKBGEtkjEI9F1sKSchbSlmOK0GwuKA4fTjjNuTP3OPRWSReGdSmM6CPAoZB4jWGlpWKz0A6x8EWQNn3H3oSlT4uvQKJ3Aa2jBCwiRWbZrmiDTrl5qEdnVyrBYQiaaAa4Sa0FKYIHmsPjZdyOSBDRUhGMpexaK1SDDQjHC6aTQTySNMRnjEe1pGuKAykE2O28Cz7TiQi8S+oUKztTfiQwHUqaBoyenx8hlbyr+5/US5VUHGQvjRNGQzBd5CYcqgtOuoMsEJYqnmmAimP4rJD4WmCjdaEGXYC2fvEratmkh07otl+q1RR15cAJOwTmwwBWogxvQBC1AwCN4Bq/gzXgyXox342M+mjMWmWPwB8bXD196nhM=</latexit><latexit sha1_base64="7QPJ44WTneeoKZ2HZ57x83tn3OY=">AAACHnicbVDLSgMxFM3UV62vqks3wSK4ccgMrbYLodCNywr2AW0pmUymE5p5kGSEYag/4sZfceNCEcGV/o3pY6GtBwKHc+7h5h4n5kwqhL6N3Nr6xuZWfruws7u3f1A8PGrLKBGEtkjEI9F1sKSchbSlmOK0GwuKA4fTjjNuTP3OPRWSReGdSmM6CPAoZB4jWGlpWKz0A6x8EWQNn3H3oSlT4uvQKJ3Aa2jBCwiRWbZrmiDTrl5qEdnVyrBYQiaaAa4Sa0FKYIHmsPjZdyOSBDRUhGMpexaK1SDDQjHC6aTQTySNMRnjEe1pGuKAykE2O28Cz7TiQi8S+oUKztTfiQwHUqaBoyenx8hlbyr+5/US5VUHGQvjRNGQzBd5CYcqgtOuoMsEJYqnmmAimP4rJD4WmCjdaEGXYC2fvEratmkh07otl+q1RR15cAJOwTmwwBWogxvQBC1AwCN4Bq/gzXgyXox342M+mjMWmWPwB8bXD196nhM=</latexit>
• The proportion of Child Psychology majors is 0.285.
Challenge #2: Categorical variables may have many levels
Imagine if we had a variable state that we wanted to analyze. If we were to code this into a set of dummy variables, we would need to create 49 dummy variables! (Or 51 if we include Puerto Rico and Washington, DC.)
ID State MN NY CA
1 Minnesota 1 0 0
2 New York 0 1 0
3 California 0 0 1
4 Iowa 0 0 0
5 North Dakota 0 0 0
6 Texas 0 0 0
7 Oregon 0 0 0
⋮ ⋮ ⋮ ⋮ ⋮
Need to add 46 more dummy variables…
If we use a variable with many levels in an analysis (say we want to see if there are differences in ACT scores across states), we will need to adjust our p-values to account for the high number of comparisons (e.g., Bonferroni adjustment).
Potential Solution: Collapse the variable into fewer categories by combining several categories into a single category.
ID State Region
1 Minnesota Midwest
2 New York East
3 California West
4 Iowa Midwest
5 North Dakota Midwest
6 Texas South
7 Oregon West
⋮ ⋮ ⋮
For our state example we might collapse states into regions. This reduces the number of levels from 50 to 4 or 5 (depending how many regions we envision).
Challenge #3: One or more categories are very rare
If one or more categories have very few cases relative to others, they will offer little to no information in the analysis (too little variation). In some cases, models may fail to converge.
Potential Solution: Try to collapse these categories into other categories.
ID Self Identified Race/Ethnicity Collapsed Race
1 Hispanic Hispanic
2 African Cuban Other
3 White White
4 African American African American
5 Hispanic Hispanic
6 African American African American
7 Hispanic Hispanic
⋮ ⋮ ⋮
Survey responses might allow respondents to write-in information. Below Respondent #2 chose to write in her/his/their race/ethnicity. This could be (depending on the RQ) collapsed into an "Other" category along with other write-in responses that cannot be categorized.
Challenge #3: One category almost always occurs
If almost all of the observations fall into a single category the variable will offer little to no information in the analysis (too little variation). In some cases, models may fail to converge.
Challenge #4: Your outcome is categorical
When your outcome is categorical, the linear models (e.g., regression, ANOVA, t-tests) are no longer appropriate for analyzing your data.
Imagine a researcher examining whether ACT score is predictive of whether or not students graduate college. In this analysis the outcome, graduation = Yes/No, is a dichotomous categorical variable.
A plot of the proportion of students who graduate by ACT score illustrates several problems with using methods that are meant for quantitative data:
• The curve that models the proportion of students who graduate is S-shaped; not linear.• This is even more apparent if we extrapolate to really low or really high ACT scores; the
proportion of students who graduate can never go below 0 or above 1 (they are bounds/asymptotes for our curve).
• If we are interested in inference, one of the assumptions of the linear model is conditional normality; proportions are not normally distributed—they are binomially distributed.
Potential Solution: Use methods that accommodate categorical outcomes.
• Bar charts• Mosaic plots• Biserial or Point-Biserial correlation coefficients• Goodman and Kruskal's Lambda• Chi-square tests of association/independence• Tests of Proportion• Generalized models (e.g., logistic regression)• ROC analysis• Survival models• Classification trees
References and Resources
Agresti, A. (2012). Categorical data analysis (3rd ed.). New York: Wiley.
Agresti, A. (2012). Analysis of ordinal categorical data (2nd ed.). New York: Wiley.
Friendly, M. (2012). Visualizing categorical data: Data, stories, and pictures. Mosaic: A Journal For The Interdisciplinary Study Of Literature, 1–9. http://www.datavis.ca/books/vcd/vcdstory.pdf
Hardy, M. A. (1993). Regression with dummy variables. Thousand Oaks, CA: Sage.
Hosmer, D. A., & Lemeshow, S. (2013). Applied logistic regression (3rd ed.). New York: Wiley.
Klein, J. P., & Moeschberger, M. L. (2005). Survival Analysis: Techniques for Censored and Truncated Data (2nd ed.). New York: Springer.
UCLA Institute for Digital Research and Education. Coding systems for categorical variables in regression analyses. https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis-2/
Wendorf, C. A. (2004). Primer on multiple regression coding: Common forms and the additional case of repeated contrasts. Understanding Statistics, 3(1), 47–57.
Research Methodology Consulting Center (RMCC)
Consulting for UMN faculty and researchers
• Grant proposal consulting• Funded project consulting and services• Unfunded projects consulting (CEHD only)
Consulting for CEHD graduate students
• General advice about methodology and statistical analysis for dissertation and thesis work
• Four, 45-minute consultations are provided each academic year at no cost.
Find out more at
http://www.cehd.umn.edu/research/consulting/
Top Related