Dummy Var Regression Lecture 2008

download Dummy Var Regression Lecture 2008

of 9

Transcript of Dummy Var Regression Lecture 2008

  • 8/12/2019 Dummy Var Regression Lecture 2008

    1/9

    SAS Regression Using Dummy Variables and Oneway ANOVA

    /*****************************************************************This example illustrates: How to create side-by-side boxplots How to create dummy variables

    How to use dummy variables in a linear regression model How to fit a oneway AN!A

    "rocs used: "roc #eans "roc $oxplot "roc %eg "roc &nivariate "roc '(#)ilename: dummyvariables+sas*******************************************************************/

    The commands below allow us to utilize the user-defined formats along with our permanent SAS data set.

    "T,N )%#.HA%01----121---21-/345*06libname b789 0e:3789306options fmtsearch;%< b789=6options nofmterr6

    We get descriptive statistics for all variables within each level of ORII!.

    proc means datab789+cars6 class origin6run6 The MEANS Procedure NORIGIN Obs Variable N Mean Std Dev Minimum Maximum!SA "#$ MPG "%& "'()"&""#& *($+*&'#, )'(''''''' $,('''''''

    ENGINE "#$ "%+(+)$%$&+ ,&(++,,*+& (''''''' %##(''''''' -ORSE "%, )),(*'*%"#+ $,(+,,)*%+ #"(''''''' "$'(''''''' .EIG-T "#$ $$*+($$ +&&(*))+$," )&''('' #)%'('' A//E0 "#$ )%(,"&%# "(&')))#, &(''''''' ""("'''''' 1EAR "#$ +#(#")+$,) $(+)%#&%$ +'(''''''' &"(''''''' /10INDER "#$ *("+**+,& )(**"*#"& %(''''''' &('''''''

    Euro2e +$ MPG +' "+(&,)%"&* *(+"$,",* )*("'''''' %%($'''''' ENGINE +$ )',(%*#+#$% ""($+),'&$ *&(''''''' )&$(''''''' -ORSE +) &)(''''''' "'(&)$%#+" %*(''''''' )$$(''''''' .EIG-T +$ "%$)(%, %,'(&&$*)+" )&"#('' $&"'('' A//E0 +$ )*(&"),)+& $(')',)+# )"("'''''' "%(&'''''' 1EAR +$ +#(+$,+"*' $(#*$'$$" +'(''''''' &"(''''''' /10INDER +$ %()#'*&%, '(%,'+&"* %(''''''' *('''''''

    3a2an +, MPG +, $'(%#'*$", *(',''%&) )&(''''''' %*(*'''''' ENGINE +, )'"(+'&&*'& "$()%')"*' +'(''''''' )*&(''''''' -ORSE +, +,(&$#%%$' )+(&),),,) #"(''''''' )$"(''''''' .EIG-T +, """)("$ $"'(%,+"%+, )*)$('' ",$'('' A//E0 +, )*()+")#), )(,#%,$+' ))(%'''''' ")(''''''' 1EAR +, ++(%%$'$&' $(*#'#,%+ +'(''''''' &"(''''''' /10INDER +, %()')"*#& '(#,'%)$# $(''''''' *('''''''

    We can see that the mean of vehicle miles per gallon "#$% is lowest for the American cars& intermediate forthe 'uropean cars& and highest for the (apanese cars.

    )

  • 8/12/2019 Dummy Var Regression Lecture 2008

    2/9

    We now loo* at a side-b+-side bo,plot of miles per gallon "#$% for each level of origin.

    /*'et side-by-side boxplots of ;eight for each vehicle origin*/proc sort datab789+cars6 by origin6

    run6goptions devicewin targetwinprtm6proc boxplot datab789+cars6plot mpg * origin / boxstyleschematic6run6

    The bo,plot shows the pattern of means that we noted in the descriptive statistics. The variance is similar for the

    American& 'uropean and (apanese cars. The distribution of mpg is somewhat positivel+ s*ewed for American

    and 'uropean cars& and negativel+ s*ewed for (apanese cars. There are some high outliers in the American and'uropean cars. ecause ORII! is a nominal variable& we will not be tempted to thin* of this as an ordinal

    relationship. If we had a different coding for ORII!& this graph would have shown a different pattern.

    efore we can fit a linear regression model with a categorical "in this case& nominal% predictor we need to createdumm+ variables to be used in the model. We will create dumm+ variables& even though onl+ two of them will

    be used in the regression model. 'ach dumm+ variable will be coded as / or ). A value of ) will indicate that a

    case is in a given level of origin& and a value of / will indicate that the case is not in that level of origin.

    The dumm+ variables for ORII! are created in the data step below. !ote that the output shows the one car

    with a missing origin does not have a value for an+ of the dumm+ variables. Also note that the fre0uenc+tabulations for the three dumm+ variables show that there is one missing value for each dumm+ variable.

    1

  • 8/12/2019 Dummy Var Regression Lecture 2008

    3/9

    /*>ata step to create dummy variables for each level of %,',N*/data b789+cars?6 set b789+cars6 if origin not+ then do6 Americanorigin8=6

    @uropeanorigin?=6apaneseoriginB=6

    end6run6proc print datab789+cars? var origin American @uropean apanese weight6run6

    Obs ORIGIN American Euro2ean 3a2anese .EIG-T ) ( ( ( ( +$" " !SA ) ' ' )&'' $ !SA ) ' ' )&+# % !SA ) ' ' ),)# # !SA ) ' ' ),## ( ( ( "#* Euro2e ' ) ' )&"# "#+ Euro2e ' ) ' )&$%

    "#& Euro2e ' ) ' )&$# "#, Euro2e ' ) ' )&$# "*' Euro2e ' ) ' )&%# ( ( ( $", 3a2an ' ' ) )*%, $$' 3a2an ' ' ) )+## $$) 3a2an ' ' ) )+*' $$" 3a2an ' ' ) )++$

    proc freC datab789+cars?6 tables origin american european Dapanese6run6 The 4RE5 Procedure /umulative /umulative ORIGIN 4re6uenc7 Percent 4re6uenc7 Percent

    !SA "#$ *"(%+ "#$ *"(%+ Euro2e +$ )&('" $"* &'(%, 3a2an +, ),(#) %'# )''('' 4re6uenc7 Missin8 9 )

    /umulative /umulative American 4re6uenc7 Percent 4re6uenc7 Percent ' )#" $+(#$ )#" $+(#$ ) "#$ *"(%+ %'# )''('' 4re6uenc7 Missin8 9 )

    /umulative /umulative Euro2ean 4re6uenc7 Percent 4re6uenc7 Percent

    ' $$" &)(,& $$" &)(,& ) +$ )&('" %'# )''('' 4re6uenc7 Missin8 9 ) /umulative /umulative 3a2anese 4re6uenc7 Percent 4re6uenc7 Percent ' $"* &'(%, $"* &'(%, ) +, ),(#) %'# )''('' 4re6uenc7 Missin8 9 )

  • 8/12/2019 Dummy Var Regression Lecture 2008

    4/9

    We can now fit a regression model& to predict #$ for each Origin. We will use American cars as the reference

    categor+ in this model. To do this we will include the dumm+ variables for 'uropean and (apanese cars in our

    model. These two dumm+ variables represent a contrast between the average #$ for 'uropean vs. Americancars and (apanese vs. American cars& respectivel+. In general& if +ou have * categories in +our categorical

    variable& +ou will need to include *-) dumm+ variables in the regression model.

    /*)it a regression model with American cars as the reference category*/proc reg datab789+cars?6 model mpg european Dapanese6 plot residual+*predicted+6 output outregdat ppredicted rresidual rstudentrstudent6run6 Cuit6 The REG Procedure Model: MODE0) De2endent Variable: MPG

    Number o; Observations Read %'* Number o; Observations !sed $,+ Number o; Observations (''') Error $,% )*'#* %'(+#"$" /orrected Total $,* "%'%)

    Root MSE *($&$+# RS6uare '($$") De2endent Mean "$(##))$ Ad? RS6 '($"&+ /oe@ Var "+()'#,$

    Parameter Estimates Parameter Standard Variable D4 Estimate Error t Value Pr = t Interce2t ) "'()"&"$ '(%'#$+ %,(*# >(''')

    Euro2ean ) +(+*$"' '(&*%'' &(,, >(''') 3a2anese ) )'($""%) '(&"%+$ )"(#" >(''')

    The parameter estimate for the intercept represents the estimated #$ for the reference categor+& American

    cars. 2ompare this estimate "1/.)13% with the mean #$ for American cars in the descriptive statistics. The

    parameter estimate for 'uropean "4.45% represents the contrast in mean #$ for 'uropean cars vs. American

    cars "the reference%. That is& 'uropean cars are estimated to have a mean value of #$ that is 4.45 units higherthan American cars on average. This difference is significant "t$,%6 3.77& p8/.///)%. The parameter estimate for

    (apanese ")/.1% represents the contrast in mean #$ for (apanese cars vs. American cars. (apanese cars are

    estimated to have a mean value of #$ that is )/.1 units higher than American cars. This difference is

    significant "t$,%6 )1.91& p8/.///)%.

    We can calculate the mean #$ for 'uropean cars b+ adding the intercept plus the parameter estimate for

    'uropean "1/.)131:4.451/ 6 14.37);%. We calculate the mean #$ for (apanese cars b+ adding the

    intercept plus the parameter estimate for (apanese "1/.)131 : )/.11;) 6 /.;9/5;%. We can compare these

    values with the values in the output from $roc #eans& and see that the+ agree within rounding error.

    We now loo* at the residual vs. predicted values plot to see if there is appro,imate e0ualit+ of variances across

    levels of origin. !otice that there is onl+ one predicted value of #$ for each ORII!. We can see that the

    ;

  • 8/12/2019 Dummy Var Regression Lecture 2008

    5/9

    spread of residuals for each origin is appro,imatel+ the same& indicating that we have reasonable

    homos*edasticit+ for this model fit.

    We now loo* at the distribution of the studentized-deleted residuals for this model& using $roc "e.g.& the log of >% to get more normall+ distributed residuals. ?owever& these

    departures from normalit+ do not appear to be severe& and transformations will not be e,plored here.

    Tests ;or Normalit7 Test Statistic 2 Value Sha2iro.ilB . '(,*",)# Pr > . >'(''') Colmo8orovSmirnov D '('&$'#+ Pr = D >'(')'' /ramervon Mises .S6 '(#&+**, Pr = .S6 >'(''#' AndersonDarlin8 AS6 $(&&,%&$ Pr = AS6 >'(''#'

    9

  • 8/12/2019 Dummy Var Regression Lecture 2008

    6/9

    We now e,amine the output data set "regdat% produced b+ $roc Reg "the name we choose for this data set is

    arbitrar+%. This new data set will contain all the original variables& plus the new ones that we re0uested. Again&

    notice that the predicted value is the same for all observations within the same origin& and is e0ual to the mean#$ for that level of origin. Also& notice that some of the residuals are positive and some are negative.

    /*TaEe a looE at the output data set*/

    proc print dataregdat6 var mpg origin predicted residual rstudent6run6

    Obs MPG ORIGIN 2redicted residual rstudent ) , ( ( ( ( " $* !SA "'()"&" )#(,+)& "(#"%'$ $ $, !SA "'()"&" )&(&+)& "(,,),% % $* !SA "'()"&" )#(#+)& "(%#,&$ # "* !SA "'()"&" #(&+)& '(,")%& * ", !SA "'()"&" &(&+)& )($,%"" + $% !SA "'()"&" )%("+)& "("#)+' & "# !SA "'()"&" %(&+)& '(+*%", , $) !SA "'()"&" )'($+)& )(*$)%$ )' $% !SA "'()"&" )$($+)& "()'&'#

    ( ( ( )') "$ !SA "'()"&" "($+)++ '($+)&& )'" )% !SA "'()"&" *()"&"$ '(,*)&" )'$ "' !SA "'()"&" '()"&"$ '('"')' )'% )& !SA "'()"&" "()"&"$ '($$$*& ( ( ( $'& "" Euro2e "+(&,)% *(",)% '(,,"*$ $', ( Euro2e "+(&,)% ( ( $)' "' Euro2e "+(&,)% +(#,)% )(),&%$ $)) ), Euro2e "+(&,)% &(&,)% )(%'%*) $)" )& Euro2e "+(&,)% ,(&,)% )(#*$#) $)$ "" Euro2e "+(&,)% #(&,)% '(,",$& $)% $* Euro2e "+(&,)% &(#'&* )($%$&% $)# "$ Euro2e "+(&,)% %(&,)% '(++)$+ $)* ") Euro2e "+(&,)% *(&,)% )('&+#+ $)+ ( Euro2e "+(&,)% ( ( $)& )+ Euro2e "+(&,)% )'(&,)% )(+""+" $), "' Euro2e "+(&,)% +(&,)% )("%#,+ $"' $) Euro2e "+(&,)% "(&'&* '(%%"*& $") "+ Euro2e "+(&,)% '(*,)% '()'&,* $"" "& Euro2e "+(&,)% '("'&* '('$"&+ $"$ $' Euro2e "+(&,)% "()'&* '($$"$) ( ( ( $*" )& 3a2an $'(%#'* )"(%#'* )(,*,,, $*$ "+ 3a2an $'(%#'* $(%#'* '(#%$#' $*% "+ 3a2an $'(%#'* $(%#'* '(#%$#' $*# $' 3a2an $'(%#'* '(,#'* '()%,*& $** $% 3a2an $'(%#'* $($%,% '(#"+#% $*+ "& 3a2an $'(%#'* "(%#'* '($," $*& $* 3a2an $'(%#'* #(#%,% '(&+%#, $*, ", 3a2an $'(%#'* )(%#'* '(""&%) $+' $* 3a2an $'(%#'* #(#%,% '(&+%#, $+) $% 3a2an $'(%#'* $("%,% '(#))+&

    We now fit a new model& using (apan as the reference categor+ of ORII!. This time we include the twodumm+ variables& American and 'uropean in our model. The SAS commands and output are shown below@

    /*%efit the model using apan as the reference category*/proc reg datab789+cars?6 model mpg american european6 plot residual+*predicted+6run6 Cuit6

    5

  • 8/12/2019 Dummy Var Regression Lecture 2008

    7/9

    The REG Procedure Model: MODE0) De2endent Variable: MPG

    Number o; Observations Read %'* Number o; Observations !sed $,+ Number o; Observations (''') Error $,% )*'#* %'(+#"$" /orrected Total $,* "%'%)

    Root MSE *($&$+# RS6uare '($$") De2endent Mean "$(##))$ Ad? RS6 '($"&+ /oe@ Var "+()'#,$

    Parameter Estimates Parameter Standard Variable D4 Estimate Error t Value Pr = t Interce2t ) $'(%#'*$ '(+)&"$ %"(%' >(''')

    American ) )'($""%) '(&"%+$ )"(#" >(''') Euro2ean ) "(##,"' )('%+&+ "(%% '(')#'

    !otice that the Anal+sis of ariance table for this model is the same as for the previous model. ?owever& theparameter estimates differ& because the+ represent different 0uantities than the+ did in the first model. The

    intercept is now the estimated mean #$ for (apanese cars "/.;9%. The parameter estimate for American

    represents the contrast in the mean #$ for American cars vs. (apanese cars "American cars have on average&)/.1 less #$ than do (apanese cars%. The parameter estimate for 'uropean represents the contrast in the

    mean #$ for 'uropean cars vs. (apanese cars "'uropean cars have on average& 1.95 #$ less than (apanese

    cars%.

    We can use another method to fit this same linear model. When we use $roc B#& we do not have to create the

    dumm+ variables as we did for $roc Reg. ?ere is sample SAS code for fitting a onewa+ A!OA model using$roc B#. !ote the class statement specif+ing ORII! as a class variable. This causes SAS to create dumm+variables for ORII! automaticall+. SAS will use the highest formatted level "

  • 8/12/2019 Dummy Var Regression Lecture 2008

    8/9

    Source D4 S6uares Mean S6uare 4 Value Pr = 4 Model " +,&%(,#+"# $,,"(%+&*" ,+(,+ >(''') Error $,% )*'#*(%)%+% %'(+#"$" /orrected Total $,* "%'%)($+),, RS6uare /oe@ Var Root MSE MPG Mean '($$")$% "+()'#,$ *($&$+## "$(##))$

    Source D4 T72e I SS Mean S6uare 4 Value Pr = 4

    ORIGIN " +,&%(,#+"%# $,,"(%+&*"$ ,+(,+ >(''')

    Source D4 T72e III SS Mean S6uare 4 Value Pr = 4 ORIGIN " +,&%(,#+"%# $,,"(%+&*"$ ,+(,+ >(''')

    Standard Parameter Estimate Error t Value Pr = t Interce2t "'()"&""#&) '(%'#$*&&" %,(*# >(''') ORIGIN Euro2e +(+*$"'"+* '(&*%''""* &(,, >(''') ORIGIN 3a2an )'($""%'+)' '(&"%+"+&* )"(#" >(''') ORIGIN !SA '('''''''' ( ( (

    NOTE: The F matrix has been ;ound to be sin8ular and a 8eneraliHed inverse

  • 8/12/2019 Dummy Var Regression Lecture 2008

    9/9

    Euro2e 3a2an "(##," #('"%% '(',%' LLL Euro2e !SA +(+*$" #(+$'# ,(+,#, LLL !SA 3a2an )'($""% )"("*"+ &($&") LLL !SA Euro2e +(+*$" ,(+,#, #(+$'# LLL

    We can see from the above output that all pairwise comparisons of means are significant at the ./9 level& after

    appl+ing the Tu*e+ method for multiple comparisons. SAS shows each comparison of means twice& but that

    reall+ isnCt necessar+. There are man+ methods for multiple comparisons available in SAS $roc B# and otherA!OA procedures& such as $roc #i,ed.

    7