Dummy Var Regression Lecture 2008
Transcript of Dummy Var Regression Lecture 2008
-
8/12/2019 Dummy Var Regression Lecture 2008
1/9
SAS Regression Using Dummy Variables and Oneway ANOVA
/*****************************************************************This example illustrates: How to create side-by-side boxplots How to create dummy variables
How to use dummy variables in a linear regression model How to fit a oneway AN!A
"rocs used: "roc #eans "roc $oxplot "roc %eg "roc &nivariate "roc '(#)ilename: dummyvariables+sas*******************************************************************/
The commands below allow us to utilize the user-defined formats along with our permanent SAS data set.
"T,N )%#.HA%01----121---21-/345*06libname b789 0e:3789306options fmtsearch;%< b789=6options nofmterr6
We get descriptive statistics for all variables within each level of ORII!.
proc means datab789+cars6 class origin6run6 The MEANS Procedure NORIGIN Obs Variable N Mean Std Dev Minimum Maximum!SA "#$ MPG "%& "'()"&""#& *($+*&'#, )'(''''''' $,('''''''
ENGINE "#$ "%+(+)$%$&+ ,&(++,,*+& (''''''' %##(''''''' -ORSE "%, )),(*'*%"#+ $,(+,,)*%+ #"(''''''' "$'(''''''' .EIG-T "#$ $$*+($$ +&&(*))+$," )&''('' #)%'('' A//E0 "#$ )%(,"&%# "(&')))#, &(''''''' ""("'''''' 1EAR "#$ +#(#")+$,) $(+)%#&%$ +'(''''''' &"(''''''' /10INDER "#$ *("+**+,& )(**"*#"& %(''''''' &('''''''
Euro2e +$ MPG +' "+(&,)%"&* *(+"$,",* )*("'''''' %%($'''''' ENGINE +$ )',(%*#+#$% ""($+),'&$ *&(''''''' )&$(''''''' -ORSE +) &)(''''''' "'(&)$%#+" %*(''''''' )$$(''''''' .EIG-T +$ "%$)(%, %,'(&&$*)+" )&"#('' $&"'('' A//E0 +$ )*(&"),)+& $(')',)+# )"("'''''' "%(&'''''' 1EAR +$ +#(+$,+"*' $(#*$'$$" +'(''''''' &"(''''''' /10INDER +$ %()#'*&%, '(%,'+&"* %(''''''' *('''''''
3a2an +, MPG +, $'(%#'*$", *(',''%&) )&(''''''' %*(*'''''' ENGINE +, )'"(+'&&*'& "$()%')"*' +'(''''''' )*&(''''''' -ORSE +, +,(&$#%%$' )+(&),),,) #"(''''''' )$"(''''''' .EIG-T +, """)("$ $"'(%,+"%+, )*)$('' ",$'('' A//E0 +, )*()+")#), )(,#%,$+' ))(%'''''' ")(''''''' 1EAR +, ++(%%$'$&' $(*#'#,%+ +'(''''''' &"(''''''' /10INDER +, %()')"*#& '(#,'%)$# $(''''''' *('''''''
We can see that the mean of vehicle miles per gallon "#$% is lowest for the American cars& intermediate forthe 'uropean cars& and highest for the (apanese cars.
)
-
8/12/2019 Dummy Var Regression Lecture 2008
2/9
We now loo* at a side-b+-side bo,plot of miles per gallon "#$% for each level of origin.
/*'et side-by-side boxplots of ;eight for each vehicle origin*/proc sort datab789+cars6 by origin6
run6goptions devicewin targetwinprtm6proc boxplot datab789+cars6plot mpg * origin / boxstyleschematic6run6
The bo,plot shows the pattern of means that we noted in the descriptive statistics. The variance is similar for the
American& 'uropean and (apanese cars. The distribution of mpg is somewhat positivel+ s*ewed for American
and 'uropean cars& and negativel+ s*ewed for (apanese cars. There are some high outliers in the American and'uropean cars. ecause ORII! is a nominal variable& we will not be tempted to thin* of this as an ordinal
relationship. If we had a different coding for ORII!& this graph would have shown a different pattern.
efore we can fit a linear regression model with a categorical "in this case& nominal% predictor we need to createdumm+ variables to be used in the model. We will create dumm+ variables& even though onl+ two of them will
be used in the regression model. 'ach dumm+ variable will be coded as / or ). A value of ) will indicate that a
case is in a given level of origin& and a value of / will indicate that the case is not in that level of origin.
The dumm+ variables for ORII! are created in the data step below. !ote that the output shows the one car
with a missing origin does not have a value for an+ of the dumm+ variables. Also note that the fre0uenc+tabulations for the three dumm+ variables show that there is one missing value for each dumm+ variable.
1
-
8/12/2019 Dummy Var Regression Lecture 2008
3/9
/*>ata step to create dummy variables for each level of %,',N*/data b789+cars?6 set b789+cars6 if origin not+ then do6 Americanorigin8=6
@uropeanorigin?=6apaneseoriginB=6
end6run6proc print datab789+cars? var origin American @uropean apanese weight6run6
Obs ORIGIN American Euro2ean 3a2anese .EIG-T ) ( ( ( ( +$" " !SA ) ' ' )&'' $ !SA ) ' ' )&+# % !SA ) ' ' ),)# # !SA ) ' ' ),## ( ( ( "#* Euro2e ' ) ' )&"# "#+ Euro2e ' ) ' )&$%
"#& Euro2e ' ) ' )&$# "#, Euro2e ' ) ' )&$# "*' Euro2e ' ) ' )&%# ( ( ( $", 3a2an ' ' ) )*%, $$' 3a2an ' ' ) )+## $$) 3a2an ' ' ) )+*' $$" 3a2an ' ' ) )++$
proc freC datab789+cars?6 tables origin american european Dapanese6run6 The 4RE5 Procedure /umulative /umulative ORIGIN 4re6uenc7 Percent 4re6uenc7 Percent
!SA "#$ *"(%+ "#$ *"(%+ Euro2e +$ )&('" $"* &'(%, 3a2an +, ),(#) %'# )''('' 4re6uenc7 Missin8 9 )
/umulative /umulative American 4re6uenc7 Percent 4re6uenc7 Percent ' )#" $+(#$ )#" $+(#$ ) "#$ *"(%+ %'# )''('' 4re6uenc7 Missin8 9 )
/umulative /umulative Euro2ean 4re6uenc7 Percent 4re6uenc7 Percent
' $$" &)(,& $$" &)(,& ) +$ )&('" %'# )''('' 4re6uenc7 Missin8 9 ) /umulative /umulative 3a2anese 4re6uenc7 Percent 4re6uenc7 Percent ' $"* &'(%, $"* &'(%, ) +, ),(#) %'# )''('' 4re6uenc7 Missin8 9 )
-
8/12/2019 Dummy Var Regression Lecture 2008
4/9
We can now fit a regression model& to predict #$ for each Origin. We will use American cars as the reference
categor+ in this model. To do this we will include the dumm+ variables for 'uropean and (apanese cars in our
model. These two dumm+ variables represent a contrast between the average #$ for 'uropean vs. Americancars and (apanese vs. American cars& respectivel+. In general& if +ou have * categories in +our categorical
variable& +ou will need to include *-) dumm+ variables in the regression model.
/*)it a regression model with American cars as the reference category*/proc reg datab789+cars?6 model mpg european Dapanese6 plot residual+*predicted+6 output outregdat ppredicted rresidual rstudentrstudent6run6 Cuit6 The REG Procedure Model: MODE0) De2endent Variable: MPG
Number o; Observations Read %'* Number o; Observations !sed $,+ Number o; Observations (''') Error $,% )*'#* %'(+#"$" /orrected Total $,* "%'%)
Root MSE *($&$+# RS6uare '($$") De2endent Mean "$(##))$ Ad? RS6 '($"&+ /oe@ Var "+()'#,$
Parameter Estimates Parameter Standard Variable D4 Estimate Error t Value Pr = t Interce2t ) "'()"&"$ '(%'#$+ %,(*# >(''')
Euro2ean ) +(+*$"' '(&*%'' &(,, >(''') 3a2anese ) )'($""%) '(&"%+$ )"(#" >(''')
The parameter estimate for the intercept represents the estimated #$ for the reference categor+& American
cars. 2ompare this estimate "1/.)13% with the mean #$ for American cars in the descriptive statistics. The
parameter estimate for 'uropean "4.45% represents the contrast in mean #$ for 'uropean cars vs. American
cars "the reference%. That is& 'uropean cars are estimated to have a mean value of #$ that is 4.45 units higherthan American cars on average. This difference is significant "t$,%6 3.77& p8/.///)%. The parameter estimate for
(apanese ")/.1% represents the contrast in mean #$ for (apanese cars vs. American cars. (apanese cars are
estimated to have a mean value of #$ that is )/.1 units higher than American cars. This difference is
significant "t$,%6 )1.91& p8/.///)%.
We can calculate the mean #$ for 'uropean cars b+ adding the intercept plus the parameter estimate for
'uropean "1/.)131:4.451/ 6 14.37);%. We calculate the mean #$ for (apanese cars b+ adding the
intercept plus the parameter estimate for (apanese "1/.)131 : )/.11;) 6 /.;9/5;%. We can compare these
values with the values in the output from $roc #eans& and see that the+ agree within rounding error.
We now loo* at the residual vs. predicted values plot to see if there is appro,imate e0ualit+ of variances across
levels of origin. !otice that there is onl+ one predicted value of #$ for each ORII!. We can see that the
;
-
8/12/2019 Dummy Var Regression Lecture 2008
5/9
spread of residuals for each origin is appro,imatel+ the same& indicating that we have reasonable
homos*edasticit+ for this model fit.
We now loo* at the distribution of the studentized-deleted residuals for this model& using $roc "e.g.& the log of >% to get more normall+ distributed residuals. ?owever& these
departures from normalit+ do not appear to be severe& and transformations will not be e,plored here.
Tests ;or Normalit7 Test Statistic 2 Value Sha2iro.ilB . '(,*",)# Pr > . >'(''') Colmo8orovSmirnov D '('&$'#+ Pr = D >'(')'' /ramervon Mises .S6 '(#&+**, Pr = .S6 >'(''#' AndersonDarlin8 AS6 $(&&,%&$ Pr = AS6 >'(''#'
9
-
8/12/2019 Dummy Var Regression Lecture 2008
6/9
We now e,amine the output data set "regdat% produced b+ $roc Reg "the name we choose for this data set is
arbitrar+%. This new data set will contain all the original variables& plus the new ones that we re0uested. Again&
notice that the predicted value is the same for all observations within the same origin& and is e0ual to the mean#$ for that level of origin. Also& notice that some of the residuals are positive and some are negative.
/*TaEe a looE at the output data set*/
proc print dataregdat6 var mpg origin predicted residual rstudent6run6
Obs MPG ORIGIN 2redicted residual rstudent ) , ( ( ( ( " $* !SA "'()"&" )#(,+)& "(#"%'$ $ $, !SA "'()"&" )&(&+)& "(,,),% % $* !SA "'()"&" )#(#+)& "(%#,&$ # "* !SA "'()"&" #(&+)& '(,")%& * ", !SA "'()"&" &(&+)& )($,%"" + $% !SA "'()"&" )%("+)& "("#)+' & "# !SA "'()"&" %(&+)& '(+*%", , $) !SA "'()"&" )'($+)& )(*$)%$ )' $% !SA "'()"&" )$($+)& "()'&'#
( ( ( )') "$ !SA "'()"&" "($+)++ '($+)&& )'" )% !SA "'()"&" *()"&"$ '(,*)&" )'$ "' !SA "'()"&" '()"&"$ '('"')' )'% )& !SA "'()"&" "()"&"$ '($$$*& ( ( ( $'& "" Euro2e "+(&,)% *(",)% '(,,"*$ $', ( Euro2e "+(&,)% ( ( $)' "' Euro2e "+(&,)% +(#,)% )(),&%$ $)) ), Euro2e "+(&,)% &(&,)% )(%'%*) $)" )& Euro2e "+(&,)% ,(&,)% )(#*$#) $)$ "" Euro2e "+(&,)% #(&,)% '(,",$& $)% $* Euro2e "+(&,)% &(#'&* )($%$&% $)# "$ Euro2e "+(&,)% %(&,)% '(++)$+ $)* ") Euro2e "+(&,)% *(&,)% )('&+#+ $)+ ( Euro2e "+(&,)% ( ( $)& )+ Euro2e "+(&,)% )'(&,)% )(+""+" $), "' Euro2e "+(&,)% +(&,)% )("%#,+ $"' $) Euro2e "+(&,)% "(&'&* '(%%"*& $") "+ Euro2e "+(&,)% '(*,)% '()'&,* $"" "& Euro2e "+(&,)% '("'&* '('$"&+ $"$ $' Euro2e "+(&,)% "()'&* '($$"$) ( ( ( $*" )& 3a2an $'(%#'* )"(%#'* )(,*,,, $*$ "+ 3a2an $'(%#'* $(%#'* '(#%$#' $*% "+ 3a2an $'(%#'* $(%#'* '(#%$#' $*# $' 3a2an $'(%#'* '(,#'* '()%,*& $** $% 3a2an $'(%#'* $($%,% '(#"+#% $*+ "& 3a2an $'(%#'* "(%#'* '($," $*& $* 3a2an $'(%#'* #(#%,% '(&+%#, $*, ", 3a2an $'(%#'* )(%#'* '(""&%) $+' $* 3a2an $'(%#'* #(#%,% '(&+%#, $+) $% 3a2an $'(%#'* $("%,% '(#))+&
We now fit a new model& using (apan as the reference categor+ of ORII!. This time we include the twodumm+ variables& American and 'uropean in our model. The SAS commands and output are shown below@
/*%efit the model using apan as the reference category*/proc reg datab789+cars?6 model mpg american european6 plot residual+*predicted+6run6 Cuit6
5
-
8/12/2019 Dummy Var Regression Lecture 2008
7/9
The REG Procedure Model: MODE0) De2endent Variable: MPG
Number o; Observations Read %'* Number o; Observations !sed $,+ Number o; Observations (''') Error $,% )*'#* %'(+#"$" /orrected Total $,* "%'%)
Root MSE *($&$+# RS6uare '($$") De2endent Mean "$(##))$ Ad? RS6 '($"&+ /oe@ Var "+()'#,$
Parameter Estimates Parameter Standard Variable D4 Estimate Error t Value Pr = t Interce2t ) $'(%#'*$ '(+)&"$ %"(%' >(''')
American ) )'($""%) '(&"%+$ )"(#" >(''') Euro2ean ) "(##,"' )('%+&+ "(%% '(')#'
!otice that the Anal+sis of ariance table for this model is the same as for the previous model. ?owever& theparameter estimates differ& because the+ represent different 0uantities than the+ did in the first model. The
intercept is now the estimated mean #$ for (apanese cars "/.;9%. The parameter estimate for American
represents the contrast in the mean #$ for American cars vs. (apanese cars "American cars have on average&)/.1 less #$ than do (apanese cars%. The parameter estimate for 'uropean represents the contrast in the
mean #$ for 'uropean cars vs. (apanese cars "'uropean cars have on average& 1.95 #$ less than (apanese
cars%.
We can use another method to fit this same linear model. When we use $roc B#& we do not have to create the
dumm+ variables as we did for $roc Reg. ?ere is sample SAS code for fitting a onewa+ A!OA model using$roc B#. !ote the class statement specif+ing ORII! as a class variable. This causes SAS to create dumm+variables for ORII! automaticall+. SAS will use the highest formatted level "
-
8/12/2019 Dummy Var Regression Lecture 2008
8/9
Source D4 S6uares Mean S6uare 4 Value Pr = 4 Model " +,&%(,#+"# $,,"(%+&*" ,+(,+ >(''') Error $,% )*'#*(%)%+% %'(+#"$" /orrected Total $,* "%'%)($+),, RS6uare /oe@ Var Root MSE MPG Mean '($$")$% "+()'#,$ *($&$+## "$(##))$
Source D4 T72e I SS Mean S6uare 4 Value Pr = 4
ORIGIN " +,&%(,#+"%# $,,"(%+&*"$ ,+(,+ >(''')
Source D4 T72e III SS Mean S6uare 4 Value Pr = 4 ORIGIN " +,&%(,#+"%# $,,"(%+&*"$ ,+(,+ >(''')
Standard Parameter Estimate Error t Value Pr = t Interce2t "'()"&""#&) '(%'#$*&&" %,(*# >(''') ORIGIN Euro2e +(+*$"'"+* '(&*%''""* &(,, >(''') ORIGIN 3a2an )'($""%'+)' '(&"%+"+&* )"(#" >(''') ORIGIN !SA '('''''''' ( ( (
NOTE: The F matrix has been ;ound to be sin8ular and a 8eneraliHed inverse
-
8/12/2019 Dummy Var Regression Lecture 2008
9/9
Euro2e 3a2an "(##," #('"%% '(',%' LLL Euro2e !SA +(+*$" #(+$'# ,(+,#, LLL !SA 3a2an )'($""% )"("*"+ &($&") LLL !SA Euro2e +(+*$" ,(+,#, #(+$'# LLL
We can see from the above output that all pairwise comparisons of means are significant at the ./9 level& after
appl+ing the Tu*e+ method for multiple comparisons. SAS shows each comparison of means twice& but that
reall+ isnCt necessar+. There are man+ methods for multiple comparisons available in SAS $roc B# and otherA!OA procedures& such as $roc #i,ed.
7