Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a...
Transcript of Chapter 7 Analyzing categorical datastazjt/teaching/ST2137/lecture/lec 7.pdf · title \Using a...
'
&
$
%
Chapter 7
Analyzing categorical data
1
'
&
$
%
What is a categorical variable?
Examples:
• Gender (“Male”,“Female”)
• Sick or well
• Success or failure
• Age group (“Below 20”, “20 to below 40”, “40 to below 60”,“60 and above”)
2
'
&
$
%
Common techniques used to analyze categorical data
• Frequency tables
• Contingency tables
• Charts
• Test of proportion
• Chi-square test
3
'
&
$
%
Questionnaire design and analysis
• It is the most common way to collect certain types of data
• The data collected can be manually entered into the computerif they are not collected via computer or online.
4
'
&
$
%
SAS: proc freq
data ex7 1;input @1 id $3. @4 age 2.0 @6 gender $1.@7 race $1.
@8 marital $1. @9 education $1. @10 subsi 1.0;* Adding labels to the variables;label marital =“Marital Status”education=“Education Level”Subsi=“Baby Subsidy”;
datalines;001291111300245222220033513244004271111200568213230066512432;
5
'
&
$
%
SAS: proc freq
proc freq data=ex7 1;title “Frequency Counts for Categorical Variables”;tables gender race marital education subsi;/∗ Alternatively, we can use the following command;tables gender-subsi;∗/run;
6
'
&
$
%
SAS output: proc freq
7
'
&
$
%
SAS output: proc freq
8
'
&
$
%
SAS: Adding “Value Labels” (Format)
proc format;value $sexfmt “1”=“Male”
“2”=“Female”Others=“Miscoded”;
value $race “1”=“Chinese”“2”=“Malay”“3”=“Indian”“4”=“Others”;
value $mari “1”=“Single”“2”=“Married”“3”=“Widowed”“4”=“Divorced”;
9
'
&
$
%
SAS: Adding “Value Labels” (Format)
value $educ “1”=“O-level or Less”“2”=“A-Level or Poly”“3”=“Bachelor degree”“4”=“Postgraduate degree”;
value agree 1=“Strongly Disagree”2=“Disagree”3=“No Opinion”4=“Agree”5=“Strongly Agree”;
10
'
&
$
%
SAS: Adding “Value Labels” (Format)
data ex7 1label;input @1 id $3. @4 age 2.0 @6 gender $1.@7 race $1.@8 marital $1. @9 education $1.@10 subsi 1.0;label marital =“Marital Status”
education=“Education Level”Subsi=“Baby Subsidy”;
format gender $sexfmt.race $race.marital$mari.education $educ.subsi agree.;
11
'
&
$
%
SAS: Adding “Value Labels” (Format)
datalines;001291111300245222220033513244004271111200568213230066512432;proc freq data=ex7 1label;title “Frequency Counts for Categorical Variables”;tables gender race marital education subsi;run;
12
'
&
$
%
SAS output: proc freq
13
'
&
$
%
SAS output: proc freq
14
'
&
$
%
SAS: Using a format to recode a variable
proc format;value agegp low-20=“0-20”
21-40=“21-40”41-60=“41-60”60-high=“Greater than 60”.=“Did not Answer”other=“Out of Range”;
proc freq data=ex7 1label;title “Using a Fromat to Group a Numeric Varible”;tables age;format age agegp.;run;
15
'
&
$
%
SAS output: Using a format to recode a variable
16
'
&
$
%
R: Adding value labels
>ex7.1=read.fwf(“D:/ST2137/ex7 1.txt”,header=F,width=c(3,2,1,1,1,1,1))>names(ex7.1)=c(“id”,“age”,“gender”,“race”,“marital”,“education”,“subsi”)>attach(ex7.1)>gendername=c(“Male”,“Female”)>gendergp=gendername[gender]>gender[1]1 2 1 1 2 1>gendergp[1] “Male” “Female” “Male” “Male” “Female” “Male”
17
'
&
$
%
R: Recode a variable
>agegpname=c(“low-20”,“21-40”,“41-60”,“61-80”,‘over 80”)>agegp=agegpname[ceiling(age/20)]>age[1] 29 45 35 27 68 65>agegp[1] “21-40” “41-60” “61-80” “61-80”
18
'
&
$
%
R: Table
>gendername=c(“Male”,“Female”)>gendergp=gendername[gender]>table(gendergp)gendergpFemale Male2 4
19
'
&
$
%
R: Table
>agegpname=c(“low-20”,“21-40”,“41-60”,“61-80”,“over 80”)>agegp=agegpname[ceiling(age/20)]>table(agegp)agegp21-40 41-60 61-803 1 2
20
'
&
$
%
R: Table
>racegpname=c(“Chinese”,“Malay”,“Indian”,“Others”)>racegp=racegpname[race]>table(racegp)racegpChinese Indian Malay3 1 2
21
'
&
$
%
R: Table
>marigpname=c(“Single”,“Married”,“Widowed”,“Divorced”)>marigp=marigpname[marital]>table(marigp)marigpDivorced Married Single Widowed1 2 2 1
22
'
&
$
%
R: Table
>educgpname=c(“(1)High Sch or Less”,“(2)A-Level or Poly”,+“(3)Bachelor degree”,“(4)Postgraduate degree”)>educgp=educgpname[education]>table(educgp)educgp(1)High Sch or Less (2)A-Level or Poly(3)Bachelor degree (4)Postgraduate degree
2 21 1
23
'
&
$
%
R: Table
>likegpname=c(“(1)Strongly Disagree”,“(2)Disagree”,+“(3)No Opinion”,“(4)Agree”,“(5)Strongly Agree”)>subsigp=likegpname[subsi]>table(subsigp)subsigp(2)Disagree (3)No Opinion (4)Agree
3 2 1
24
'
&
$
%
SPSS: Frequency tables
• Suppose the data set on slide 5 has been imported into theSPSS.
• “Analyze”→ “Descriptive Statistics” →“Frequency...”
• Move the variables to the “Variables” panel → “OK”
25
'
&
$
%
SPSS output: Frequency tables
26
'
&
$
%
SPSS output: Frequency tables
27
'
&
$
%
Two-way frequency tables
Count the occurrences of one variable at each level of anothervariable.For example:We would like to know1. How many males and females were there in the sample?2. How many respondents were for Candidate A and how manywere for Candidate B?3. How many males and females were for Candidate A and B,respectively?
28
'
&
$
%
Two-way frequency tables: SAS
proc format;value $genfmt “M”=“Male”
“F”=”Female”Other=“Miscoded”;
value $candfmt “A”=“Candidate A”“B”=”Candidate B”;
29
'
&
$
%
Two-way frequency tables: SAS
data ex7 2;infile“D:\ST2137\ex7 2.txt”;input gender $ candid $;label gender=“Gender”candid=“Candidate”;format gender $genfmt.
candid $candfmt.;run;proc freq data=ex7 2;tables gender*candid/chisq;run;
30
'
&
$
%
Two-way frequency tables: SAS output
31
'
&
$
%
Two-way frequency tables: SAS output
32
'
&
$
%
Computing Chi-square from frequency counts: SAS
/*Computing Chi-square from frequency counts*/data ex7 2c;input group $ outcome $ count;datalines;drug alive 90drug dead 10placebo alive 80placebo dead 20;proc freq data=ex7 2c;tables group*outcome/chisq;weight count;run;
33
'
&
$
%
Two-way frequency tables: SAS output
34
'
&
$
%
Two-way frequency tables: SAS output
35
'
&
$
%
Two-way frequency tables: R
>ex7.2=read.table(“D:/ST2137/ex7 2.txt”,header=F)>names(ex7.2)=c(“gender”,“candid”)>table(ex7.2)
candidgender A BF 70 30M 40 40
36
'
&
$
%
Two-way frequency tables: R
>chisq.test(table(ex7.2))Pearson’s Chi-squared test Yate’s continuity correction
data:table(ex7.2)X-squared=6.6626,df=1,p-value=0.009846Computing chi-square from the frequency counts: R>v=matrix(c(90,10,80,20),nc=2)>v=data.frame(v)>names(v)=c(“Alive”,“Dead”)>row.names(v)=c(“Drug”,“Control”)>chisq.test(v)
Pearson’s Chi-squared test with Yate’s continuity correctiondata:vX-squared=3.1765, df=1, p-value=0.0747
37
'
&
$
%
Two-way frequency tables: SPSS
• “Analyze”→ “Descriptive Statistics” →“Cross Tables...”
• Move one of the the variables to the “Row” window and secondvariable to “Column(s)” window.
38
'
&
$
%
Two-way frequency tables: SPSS
• Click on “Statistics”
• Choose “Chi-square” or some other statistics →“Continue”→“OK”
39
'
&
$
%
Computing Chi-square from frequency tables: SPSS
• Data file as shown below
• “Data”→‘Weight Cases”
• Move the variable “Count” to the “Frequency Variable” panelunder “Weight cases by option”
• Proceed as on p38-39.
40
'
&
$
%
Computing Chi-square from frequency tables: SPSS
41
'
&
$
%
Paired Data
• Paired data arise when the subjects are responding to aquestion under two different conditions (e.g. before and aftertreatment).
• Paired designs are also used when a specific person is matchedon some criteria, such as age and gender, to another person forthe purpose of analysis.
42
'
&
$
%
McNemar’s test for paired data: SAS
proc format;value $opin “p”=“Positive” “n”=“Negative”;run;data ex7 3;length before after $1.;infile “D:\ST2137\ex7 3.txt”;input subject before $ after $;format before after $opin.;proc freq data=ex7 3;title “McNemar’s Test for Paired Samples”;tables before *after/agree;run;
43
'
&
$
%
McNemar’s test for paired data: SAS output
44
'
&
$
%
McNemar’s test for paired data: SAS output
45
'
&
$
%
McNemar’s test for frequency counts: SAS
proc format;value $opin “p”=“Positive” “n”=“Negative”;run;data ex7 3c;length before after $1.;input after $ before $ count;format before after $opin.;datalines;n n 32n p 30p n 15p p 23;
46
'
&
$
%
McNemar’s test for frequency counts: SAS
proc freq data=ex7 3;title “McNemar’s Test for Paired Samples”;tables before *after/agree;weight count;run;
47
'
&
$
%
McNemar’s test: R
#Example 7.3>ex7.3=read.table(“D:/ST2137/ex7 3.txt”,header=F)>names(ex7.3)=c(“ID”,“Before”,“After”)>attach(ex7.3)>mcnemar.test(table(ex7.3[,2:3]))
McNemar’s Chi-square test with continuity correctiondata:table(ex7.3[,2:3])McNemar’s chi-squared=4.3556,df=1,p-value=0.03689
48
'
&
$
%
McNemar’s test for Frequency Counts: R
#Example 7.3c: Handling frequency counts>ex7.3c=matrix(c(32,15,30,23),nr=2,byrow=T,+dimnames=list(“Before”=c(“No”,“Yes”),“After”=c(“No”,“Yes”)))>ex7.3c
AfterBefore No YesNo 32 15Yes 30 23>mcnemar.test(ex7.3c)
McNemar’s Chi-squared test with continuity correctiondata:ex7.3cMcNemar’s Chi-squared=4.3556,df=1,p-value=0.03689
49
'
&
$
%
McNemar’s Test: SPSS
• “Analyze”→ “Descriptive Statistics” →“Crosstabs...”
• Move “Before” to the “Row” window and “After” to“Column(s)” window.
• Click on “Statistics...” and choose “McNemar”
• “Continue”→“OK”
50
'
&
$
%
McNemar’s Test: SPSS
51
'
&
$
%
McNemar’s Test: SPSS
If frequency counts are available instead of the raw data, then wecan weight the data in the following way.“Data”→“Weight Cases..”
52