Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is...

15
© Prof. Andy Field, 2016 www.discoveringstatistics.com Page 1 Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously but unlike in an experiment none of them are manipulated. When we use correlational designs we can’t look for cause-effect relationships because we haven’t manipulated any of the variables, and also because all of the variables have been measured at the same point in time (if you’re really bored, Field, 2013, Chapter 1 explains why experiments allow us to make causal inferences but correlational research does not). In psychology, the most common correlational research consists of the researcher administering several questionnaires that measure different aspects of behaviour to see which aspects of behaviour are related. Many of you will do this sort of research for your final year research project (so pay attention!). The linear model In the first few lectures we saw that the only equation we ever really need is this one: outcome ! = Model ! + error ! We also saw that we often fit a linear model, which in its simplest form can be written as: outcome ! = * + + ! + error ! y ! = * + + ! ! Eq. 1 The fundamental idea is that an outcome for an entity can be predicted from a model and some error associated with that prediction (e i ). We are predicting an outcome variable (y i ) from a predictor variable (X i ) and a parameter, b 1 , associated with the predictor variable that quantifies the relationship it has with the outcome variable. We also need a parameter that tells us the value of the outcome when the predictor is zero; this parameter is b 0 . You might recognize this model as ‘the equation of a straight line’. I have talked about fitting ‘linear models’, and linear simply means ‘straight line’. Any straight line can be defined by two things: (1) the slope (or gradient) of the line (usually denoted by b 1 ); and (2) the point at which the line crosses the vertical axis of the graph (known as the intercept of the line, b 0 ). These parameters b 1 and b 0 are known as the regression coefficients and will crop up time and time again, where you may see them referred to generally as b (without any subscript) or b n (meaning the b associated with variable n). A particular line (i.e., model) will have has a specific intercept and gradient. Figure 1: Shows lines with the same gradients but different intercepts, and lines that share the same intercept but have different gradients Same intercepts, different gradients Same gradients, different intercepts

Transcript of Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is...

Page 1: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page1

Introducing the Linear Model What is Correlational Research? Correlationaldesignsarewhenmanyvariablesaremeasuredsimultaneouslybutunlikeinanexperimentnoneofthemaremanipulated.Whenweusecorrelationaldesignswecan’tlookforcause-effectrelationshipsbecausewehaven’tmanipulatedanyofthevariables,andalsobecauseallofthevariableshavebeenmeasuredatthesamepointintime(if you’re really bored, Field, 2013, Chapter 1 explains why experiments allow us to make causal inferences butcorrelational researchdoesnot). Inpsychology, themostcommoncorrelational researchconsistsof the researcheradministeringseveralquestionnairesthatmeasuredifferentaspectsofbehaviourtoseewhichaspectsofbehaviourarerelated.Manyofyouwilldothissortofresearchforyourfinalyearresearchproject(sopayattention!).

The linear model Inthefirstfewlectureswesawthattheonlyequationweeverreallyneedisthisone:

outcome! = Model! + error!

Wealsosawthatweoftenfitalinearmodel,whichinitssimplestformcanbewrittenas:

outcome! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋! + error!

y! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋! + ε!Eq.1

Thefundamentalideaisthatanoutcomeforanentitycanbepredictedfromamodelandsomeerrorassociatedwiththat prediction (ei).We are predicting an outcome variable (yi) from a predictor variable (Xi) and a parameter,b1,associatedwiththepredictorvariablethatquantifiestherelationshipithaswiththeoutcomevariable.Wealsoneedaparameterthattellsusthevalueoftheoutcomewhenthepredictoriszero;thisparameterisb0.

Youmightrecognizethismodelas‘theequationofastraightline’.Ihavetalkedaboutfitting‘linearmodels’,andlinearsimplymeans‘straightline’.Anystraightlinecanbedefinedbytwothings:(1)theslope(orgradient)oftheline(usuallydenotedbyb1);and(2)thepointatwhichthelinecrossestheverticalaxisofthegraph(knownastheinterceptoftheline,b0).Theseparametersb1andb0areknownastheregressioncoefficientsandwillcropuptimeandtimeagain,whereyoumayseethemreferredtogenerallyasb(withoutanysubscript)orbn(meaningthebassociatedwithvariablen).Aparticularline(i.e.,model)willhavehasaspecificinterceptandgradient.

Figure1:Showslineswiththesamegradientsbutdifferentintercepts,andlinesthatsharethesameinterceptbuthavedifferentgradients

Same intercepts, different gradients Same gradients, different intercepts

Page 2: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page2

Figure1showsasetoflinesthathavethesameinterceptbutdifferentgradients.Forthesethreemodels,b0willbethesameineachbutthevaluesofb1willdifferineach;thisfigurealsoshowsmodelsthathavethesamegradients(b1isthesame ineachmodel)butdifferent intercepts (theb0isdifferent ineachmodel). I’vementionedalready thatb1quantifiestherelationshipbetweenthepredictorvariableandtheoutcome,andFigure1illustratesthispoint.Amodelwithapositiveb1describesapositiverelationship,whereasalinewithanegativeb1describesanegativerelationship.Looking at Figure1 (left) the red linedescribes a positive relationshipwhereas the green linedescribes a negativerelationship. As such, we can use a linearmodel (i.e., a straight line) to summarize the relationship between twovariables:thegradient(b1)tellsuswhatthemodellookslike(itsshape)andtheintercept(b0)tellsuswherethemodelis(itslocationingeometricspace).

Thisisallquiteabstractsolet’slookatanexample.ImaginethatIwasinterestedinpredictingphysicalanddownloadedalbumsales(outcome)fromtheamountofmoneyspentadvertisingthatalbum(predictor).WecouldsummarizethisrelationshipusingalinearmodelbyreplacingthenamesofourvariablesintoEq.1.

y! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋! + ε!

albumsales! = 𝑏𝑏* + 𝑏𝑏+advertisingbudget! + ε!Eq.2

Oncewehaveestimatedthevaluesofthebswewouldbeabletomakeapredictionaboutalbumsalesbyreplacing‘advertising’withanumberrepresentinghowmuchwewantedtospendadvertisinganalbum.Forexample,imaginethatb0turnedouttobe50andb1turnedouttobe100,ourmodelwouldbe:

albumsales! = 50 + 100×advertisingbudget! + ε! Eq.3

NotethatIhavereplacedthebetaswiththeirnumericvalues.Now,wecanmakeaprediction.Imaginewewantedtospend£5onadvertising,wecan replace thevariable ‘advertisingbudget’with thisvalueandsolve theequation todiscoverhowmanyalbumsaleswewillget:

albumsales! = 50 + 100×5 + ε!= 550 + ε!

So,basedonourmodelwecanpredictthatifwespend£5onadvertising,we’llsell550albums.I’velefttheerrortermintheretoremindyouthatthispredictionwillprobablynotbeperfectlyaccurate.Thisvalueof550albumsalesisknownasapredictedvalue.

The linear model with several predictors Wehaveseenthatwecanuseastraightlineto‘model’therelationshipbetweentwovariables.However,lifeisusuallymorecomplicatedthanthat:thereareoftennumerousvariablesthatmightberelatedtotheoutcomeofinterest.Totakeouralbumsalesexample,wemightexpectvariablesotherthansimplyadvertisingtohaveaneffect.Forexample,howmuchsomeonehearssongsfromthealbumontheradio,orthe‘look’ofthebandmighthaveaninfluence.Oneofthebeautifulthingsaboutthelinearmodelisthatitcanbeexpandedtoincludeasmanypredictorsasyoulike.Toaddapredictorallweneedtodoisplaceitintothemodelandgiveitabthatestimatestherelationshipinthepopulationbetweenthatpredictorandtheoutcome.Forexample,ifwewantedtoaddthenumberofplaysofthebandontheradioperweek(airplay),wecouldaddthissecondpredictoringeneralas:

𝑌𝑌! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋+! + 𝑏𝑏>𝑋𝑋>! + 𝜀𝜀! Eq.4

Notethatall thathaschanged istheadditionofasecondpredictor(X2)andanassociatedparameter(b2).Tomakethingsmoreconcrete,let’susethevariablenamesinstead:

albumsales! = 𝑏𝑏* + 𝑏𝑏+advertisingbudget! + 𝑏𝑏>airplay! + 𝜀𝜀! Eq.5

Thenewmodelincludesab-valueforbothpredictors(and,ofcourse,theconstant,b0).Ifweestimatetheb-values,wecouldmakepredictionsaboutalbumsalesbasednotonlyontheamountspentonadvertisingbutalsointermsofradioplay.

Page 3: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page3

Multiple regression can be usedwith three, four or even ten ormore predictors. In general we can add asmanypredictorsaswelike,andthelinearmodelwillexpandaccordingly:

𝑌𝑌! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋+! + 𝑏𝑏>𝑋𝑋>! ⋯ 𝑏𝑏D𝑋𝑋D! + 𝜀𝜀! Eq.6

Inwhich,Y istheoutcomevariable,b1 isthecoefficientofthefirstpredictor(X1),b2 isthecoefficientofthesecondpredictor(X2),bnisthecoefficientofthenthpredictor(Xni),andeiistheerrorfortheithparticipant.(Thebracketsaren’tnecessary,they’rejusttomaketheconnectiontoEq.1).Thisequationillustratesthatwecanaddinasmanypredictorsaswelikeuntilwereachthefinalone(Xn),buteachtimewedo,weassignitaregressioncoefficient(b).

Estimating the model Linearmodelscanbedescribedentirelybyaconstant(b0)andbyparametersassociatedwitheachpredictor(bs).Theseparameters are estimated using themethod of least squares (described in your lecture). Thismethod is known asordinaryleastsquares(OLS)regression.Inotherwords,SPSSfindsthevaluesoftheparametersthathavetheleastamountoferror(relativetoanyothervalue)forthedatayouhave.

Assessing the goodness of fit, sums of squares, R and R2 OnceNephwickandClungglewadhavefoundthemodelofbestfititisimportantthatweassesshowwellthismodelfitstheactualdata(weassessthegoodnessoffitofthemodel).Wedothisbecauseeventhoughthemodelisthebestoneavailable, itcanstillbea lousyfit tothedata.Onemeasureoftheadequacyofamodel is thesumofsquareddifferences(thinkbacktolecture2,orField,2013,Chapter2).Thereareseveralsumsofsquaresthatcanbecalculatedtohelpusgaugethecontributionofourmodeltopredictingtheoutcome.Let’sgobacktoourexampleofpredictingalbumsales(Y)fromtheamountofmoneyspentadvertisingthatalbum(X).Onedaymybosscameintomyofficeandsaid‘Andy,Iknowyouwantedtobearockstarandyou’veendedupworkingasmystats-monkey,buthowmanyalbumswillwesellifwespend£100,000onadvertising?’IntheabsenceofanydataprobablythebestanswerIcouldgivewouldbe themeannumberofalbumsales (say,200,000)becauseonaveragethat’showmanyalbumsweexpect tosell.However,whatifhetheasks‘Howmanyalbumswillwesellifwespend£1onadvertising?’Again,intheabsenceofanyaccurateinformation,mybestguesswouldbethemean.Thereisaproblem:whateveramountofmoneyisspentonadvertisingIalwayspredictthesamelevelsofsales.Assuch,themeanisafairlyuselessmodelofarelationshipbetweentwovariables.—butitisthesimplestmodelavailable.

Usingthemeanasamodel,wecancalculatethedifferencebetweentheobservedvalues,andthevaluespredictedbythemean.Wesawinlecture1thatwesquareallofthesedifferencestogiveusthesumofsquareddifferences.Thissumofsquareddifferencesisknownasthetotalsumofsquares(denotedSST)becauseitisthetotalamountoferrorpresentwhenthemostbasicmodelisappliedtothedata(Figure2).Now,ifwefitthemoresophisticatedmodeltothedata,suchasalineofbestfit,wecanworkoutthedifferencesbetweenthisnewmodelandtheobserveddata.Evenifanoptimalmodelisfittedtothedatathereisstillsomeinaccuracy,whichisrepresentedbythedifferencesbetweeneachobserveddatapointandthevaluepredictedbytheregressionline.Thesedifferencesaresquaredbeforetheyareaddedupsothatthedirectionsofthedifferencesdonotcancelout.Theresultisknownasthesumofsquaredresiduals(SSR).Thisvaluerepresentsthedegreeofinaccuracywhenthebestmodelisfittedtothedata.Wecanusethesetwovaluestocalculatehowmuchbettertheregressionline(thelineofbestfit)isthanjustusingthemeanasamodel(i.e.howmuchbetteristhebestpossiblemodelthantheworstmodel?).ThisimprovementinpredictionisthedifferencebetweenSSTandSSR.Thisdifferenceshowsusthereductionintheinaccuracyofthemodelresultingfromfittingtheregressionmodeltothedata.Thisimprovementisthemodelsumofsquares(SSM).

If thevalueofSSM is largethentheregressionmodel isverydifferent fromusingthemeantopredict theoutcomevariable.Thisimpliesthattheregressionmodelhasmadeabigimprovementtohowwelltheoutcomevariablecanbepredicted. However, if SSM is small then using the regression model is little better than using the mean (i.e. theregressionmodelisnobetterthantakingour‘bestguess’).Ausefulmeasurearisingfromthesesumsofsquaresistheproportionofimprovementduetothemodel.Thisiseasilycalculatedbydividingthesumofsquaresforthemodelbythetotalsumofsquares.TheresultingvalueiscalledR2andtoexpressthisvalueasapercentageyoushouldmultiplyitby100.So,R2representstheamountofvarianceintheoutcomeexplainedbythemodel(SSM)relativetohowmuchvariationtherewastoexplaininthefirstplace(SST).

Page 4: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page4

𝑅𝑅> =𝑆𝑆𝑆𝑆G𝑆𝑆𝑆𝑆H

Eq.7

Figure2:Diagramshowingfromwheretheregressionsumsofsquaresderive

A second use of the sums of squares in assessing themodel is the F-test. This test is based upon the ratio of theimprovementduetothemodel(SSM)andthedifferencebetweenthemodelandtheobserveddata(SSR).Ratherthanusingthesumsofsquares,itusesthemeansumsofsquares(referredtoasthemeansquaresorMS).Theresultisthemeansquaresforthemodel(MSM)andtheresidualmeansquares(MSR)—seeField2013formoredetail.Atthisstageitisn’tessentialthatyouunderstandhowthemeansquaresarederived(itisexplainedinField,2013).However,itisimportantthatyouunderstandthattheF-ratio(Eq.8)isameasureofhowmuchthemodelhasimprovedthepredictionoftheoutcomecomparedtothelevelofinaccuracyofthemodel.Ifamodelisgood,thentheimprovementinpredictionduetothemodel(MSM)tobelargeandthedifferencebetweenthemodelandtheobserveddata(MSR)tobesmall.Inshort,agoodmodelshouldhavealargeF-ratio.

𝐹𝐹 =𝑀𝑀𝑆𝑆G𝑀𝑀𝑆𝑆K

Eq.8

SST uses the differences between the observed data

and the mean value of Y

SSR uses the differences between the observed data

and the regression line

SSM uses the differences between the mean value of Y

and the regression line

Page 5: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page5

Assessing individual predictors Thevalueofbrepresentsthechangeintheoutcomeresultingfromaunitchangeinthepredictor.Ifthemodelisverybadthenwewouldexpectthechangeintheoutcometobezero.Aregressioncoefficientof0means:(1)aunitchangeinthepredictorvariableresultsinnochangeinthepredictedvalueoftheoutcome(thepredictedvalueoftheoutcomedoesnotchangeatall).Logically ifavariablesignificantlypredictsanoutcome,thenitshouldhaveab-valuethatisdifferentfromzero.Thet-statisticteststhenullhypothesisthatthevalueofbis0:therefore,ifitissignificantwegainconfidenceinthehypothesisthattheb-valueissignificantlydifferentfrom0andthatthepredictorvariablecontributessignificantlytoourabilitytoestimatevaluesoftheoutcome.

LikeF, the t-statistic isbasedon the ratioofexplainedvarianceagainstunexplainedvarianceorerror.Whatwe’reinterestedinhereisnotsomuchvariancebutwhetherthebisbigcomparedtotheamountoferrorinthatestimate.Toestimatehowmucherrorwecouldexpect to find inbweuse the standarderrorbecause it tellsusabouthowdifferentb-valueswouldbeacrossdifferentsamples.Eq.9showshowthet-testiscalculated:Thebexpectedisthevalueofbthatwewouldexpecttoobtainifthenullhypothesisweretrue(i.e.,zero)andsothisvaluecanbereplacedby0.Theequationsimplifiestobecometheobservedvalueofbdividedbythestandarderrorwithwhichitisassociated:

𝑡𝑡 =𝑏𝑏MNOPQRPS − 𝑏𝑏PUVPWXPS

𝑆𝑆𝑆𝑆Z=𝑏𝑏MNOPQRPS𝑆𝑆𝑆𝑆Z Eq.9

Thevaluesofthaveaspecialdistributionthatdiffersaccordingtothedegreesoffreedomforthetest.Inthiscontext,the degrees of freedom areN−p−1,whereN is the total sample size andp is the number of predictors. In simpleregressionwhenwehaveonlyonepredictor,thisreducesdowntoN−2.SPSSprovidestheexactprobabilitythattheobservedvalue(ora largerone)oftwouldoccur ifthevalueofbwas, infact,0.Asageneralrule, if thisobservedsignificanceislessthan.05,thenscientistsassumethatbissignificantlydifferentfrom0;putanotherway,thepredictormakesasignificantcontributiontopredictingtheoutcome.

Generalization and Bootstrapping Rememberfromyourlectureonbiasthatlinearmodelsassume:

• Linearityandadditivity:therelationshipyou’retryingtomodelis,infact,linearandwithseveralpredictors,theycombineadditively.

• Normality:Forbestimatestobeoptimaltheresidualsshouldbenormallydistributed.ForCIsandconfidenceintervalstobeaccurate,thesamplingdistributionofbsshouldbenormal.

• Homoscedasticity:necessaryforbestimatestobeoptimalandsignificancetestsandCIsoftheparameterstobeaccurate.

Iftheseassumptionsaremetthenwecantrusttheestimatesofourbs,whichmeansthatwecangeneralizeourmodel(i.e.assumethatitworksinsamplesotherthantheonefromwhichwecollecteddata).Ifwehaveconcernsabouttheseassumptions we can use bootstrapping to compute robust estimates of bs and their confidence intervals. Lack ofnormalitypreventsusfromknowingtheshapeofthesamplingdistributionunlesswehavebigsamples.BootstrappingEfron&Tibshirani,1993)getsaroundthisproblembyestimatingthepropertiesofthesamplingdistributionfromthesample data. In effect, the sample data are treated as a population fromwhich smaller samples (called bootstrapsamples)aretaken(puttingeachscorebackbeforeanewoneisdrawnfromthesample).Theparameterofinterest(e.g.,theregressionparameter)iscalculatedineachbootstrapsample.Thisprocessisrepeatedperhaps2000times.Theendresultisthatwehave2000parameterestimates,onefromeachbootstrapsample.Therearetwothingswecandowiththeseestimates:thefirstistoorderthemandworkoutthelimitswithinwhich95%ofthemfall.Wecanusethesevaluesasanestimateofthelimitsofthe95%confidenceintervaloftheparameter.Theresultisknownasapercentilebootstrapconfidenceinterval(becauseit isbasedonthevaluesbetweenwhich95%ofbootstrapsampleestimatesfall).Thesecondthingwecandoistocalculatethestandarddeviationoftheparameterestimatesfromthebootstrapsamplesanduseitasthestandarderrorofparameterestimates.Animportantpointtorememberisthatbecausebootstrappingisbasedontakingrandomsamplesfromthedatayou’vecollectedtheestimatesyougetwillbeslightly different every time. This is nothing to worry about. For a fairly gentle introduction to the concept ofbootstrappingseeWright,LondonandField(2011).

Page 6: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page6

Fitting a linear model Figure3showsthegeneralprocessofconductingregressionanalysis.First,weshouldproducescatterplotstogetsomeideaofwhethertheassumptionoflinearityismet,andalsotolookforanyoutliersorobviousunusualcases.Atthisstagewemighttransformthedatatocorrectproblems.Havingdonethisinitialscreenforproblemswefitamodelandsavethevariousdiagnosticstatisticsthatwewilldiscussnextweek. Ifwewanttogeneralizeourmodelbeyondthesample,orweareinterestedininterpretingsignificancetestsandconfidenceintervalsthenweexaminetheseresidualstocheckforhomoscedasticity,normality,independenceandlinearity.Ifwefindproblemsthenwetakecorrectiveactionandre-estimatethemodel.Also,it’sprobablywisetousebootstrappedconfidenceintervalswhenwefirstestimatethemodelbecausethenwecanbasicallyforgetaboutthingslikenormality.

Figure3:Theprocessoffittingaregressionmodel

Regression using SPSS TherearesomedatafromField2013inthefileAlbumSales.sav.Thisdatafilehas200rows,eachonerepresentingadifferentalbum.Therearealsotwocolumns,onerepresentingthesalesofeachalbumintheweekafterreleaseandtheotherrepresentingtheamount(inpounds)spentpromotingthealbumbeforerelease.Thisistheformatforenteringregressiondata:theoutcomevariableandanypredictorsshouldbeenteredindifferentcolumns,andeachrowshouldrepresentindependentvaluesofthosevariables.

ThepatternofthedataisshowninFigure4anditshouldbeclearthatapositiverelationshipexists:so,themoremoneyspentadvertisingthealbum,themoreitislikelytosell.Ofcoursetherearesomealbumsthatsellwellregardlessofadvertising(topleftofscatterplot),buttherearenonethatsellbadlywhenadvertisinglevelsarehigh(bottomrightofscatterplot).Thescatterplotalsoshows the lineofbest fit for thesedata:bearing inmind that themeanwouldberepresentedbyaflatlineataroundthe200,000salesmark,theregressionlineisnoticeabledifferent.

Check residuals

Assumptions met and no bias

No normality

Graphs: zpred vs. zresid

Re-run analysis: Bootstrap CIs, transform

data

Model can be generalized

Linearity

Normality

Initial checks Graphs: scatterplotsLinearity and unusual cases

Run initial regression Save diagnostics

Homoscedasticity

Graphs: histogram

Independence

HeteroscedasticityRe-run analysis: weighted least squares regression

Lack of linearityTransform data

Lack of independence Use a multilevel model (Chapter 19)

Page 7: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page7

Figure4:Scatterplotshowingtherelationshipbetweenalbumsalesandtheamountspentpromotingthealbum.

Running a basic Analysis Tofindouttheparametersthatdescribetheregressionline,andtoseewhetherthislineisausefulmodel,weneedtorun a regression analysis. To do the analysis you need to access the main dialog box by selecting

Figure4showstheresultingdialogbox.ThereisaspacelabelledDependentinwhichyoushouldplacetheoutcomevariable(inthisexamplesales).So,selectsalesfromthelistontheleft-handside,andtransferitbydraggingitorclickingon .ThereisanotherspacelabelledIndependent(s)inwhichanypredictorvariableshouldbeplaced.Insimpleregressionweuseonlyonepredictor(inthisexample,adverts)andsoyoushouldselectadvertsfromthelistandclickon totransferittothelistofpredictors.Thereareavarietyofoptionsavailable,butthesewillbeexploredwithinthecontextofmultipleregression.

Figure5:Maindialogboxforregression

Ifweareworriedaboutassumptionsthenwecangetbootstrappedconfidenceintervalsfortheregressioncoefficientsbyclicking .Select toactivatebootstrapping,andtogeta95%confidenceintervalclick

.Clickon inthemaindialogboxtorunthebasicanalysis.

Page 8: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page8

Figure6:Bootstrapdialogbox

Output from SPSS

Overall Fit of the Model

Output1

ThefirsttableprovidedbySPSSisasummaryofthemodelthatgivesthevalueofRandR2forthemodel.Forthesedata,Ris0.578andbecausethereisonlyonepredictor,thisvaluerepresentsthesimplecorrelationbetweenadvertisingandalbumsales(youcanconfirmthisbyrunningacorrelation). The value of R2 is 0.335, which tells us that advertisingexpenditurecanaccountfor33.5%ofthevariationinalbumsales.Theremightbemany factors that canexplain this variation, butourmodel,which includesonly advertisingexpenditureexplains33%:66%of thevariationinalbumsalesisunexplained.Therefore,theremustbeothervariablesthathaveaninfluencealso

Thenextpartoftheoutputreportsananalysisofvariance(ANOVA—seeField,2013,Chapter11).ThemostimportantpartofthetableistheF-ratio,whichiscalculatedusingEq.8,andtheassociatedsignificancevalue.Forthesedata,Fis99.59,whichissignificantatp<.001(becausethevalueinthecolumnlabelledSig.islessthan.001).Thisresulttellsusthatthereislessthana0.1%chancethatanF-ratiothislargewouldhappeniftherewerenoeffect.Therefore,wecanconcludethatourregressionmodelresultsinsignificantlybetterpredictionofalbumsalesthanifweusedthemeanvalueofalbumsales.Inshort,theregressionmodeloverallpredictsalbumsalessignificantlywell.

Output2

SSR

SSM

SSTMSR

MSM

Page 9: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page9

Model Parameters

TheANOVAtellsuswhether themodel,overall, results ina significantlygooddegreeofpredictionof theoutcomevariable.However,theANOVAdoesn’ttellusabouttheindividualcontributionofvariablesinthemodel(althoughinthissimplecasethereisonlyonevariableinthemodelandsowecaninferthatthisvariableisagoodpredictor).ThetableinOutput3providesdetailsofthemodelparameters(thebetavalues)andthesignificanceofthesevalues.WesawinEq.1thatb0wastheYinterceptandthisvalueisthevalueBfortheconstant.So,fromthetable,wecansaythatb0is134.14,andthiscanbeinterpretedasmeaningthatwhennomoneyisspentonadvertising(whenX=0),themodelpredictsthat134,140albumswillbesold(rememberthatourunitofmeasurementwasthousandsofalbums).Wecanalsoreadoffthevalueofb1fromthetableandthisvaluerepresentsthegradientoftheregressionline.Itis0.096.1Althoughthisvalueistheslopeoftheregressionline,itismoreusefultothinkofthisvalueasrepresentingthechangeintheoutcomeassociatedwithaunitchangeinthepredictor.Therefore,ifourpredictorvariableisincreasedby1unit(iftheadvertisingbudgetisincreasedby1),thenourmodelpredictsthat0.096extraalbumswillbesold.Ourunitsofmeasurementwerethousandsofpoundsandthousandsofalbumssold,sowecansaythatforanincreaseinadvertisingof£1000themodelpredicts96(0.096´1000=96)extraalbumsales.Asyoumightimagine,thisinvestmentisprettybadfortherecordcompany:theyinvest£1000andgetonly96extrasales!Fortunately,aswealreadyknow,advertisingaccountsforonlyone-thirdofalbumsales.

Output3

Wesawearlierthat,ingeneral,valuesoftheregressioncoefficientbrepresentthechangeintheoutcomeresultingfromaunitchangeinthepredictorandthatifapredictorishavingasignificantimpactonourabilitytopredicttheoutcomethenthisbshouldbedifferentfrom0(andbigrelativetoitsstandarderror).Wealsosawthatthet-testtellsuswhethertheb-valueisdifferentfrom0.SPSSprovidestheexactprobabilitythattheobservedvalueoftwouldoccurif thevalueofb in thepopulationwerezero. If thisobservedsignificance is less than .05, thentheresult reflectsagenuineeffect.Forbothts,theprobabilitiesare.000(zeroto3decimalplaces)andsowecansaythattheprobabilityofthesetvalues(orlarger)occurringifthevaluesofbinthepopulationwerezeroislessthan.001.Therefore,thebsaresignificantlydifferentfrom0.Inthecaseofthebforadvertisingbudgetthisresultmeansthattheadvertisingbudgetmakesasignificantcontribution(p<.001)topredictingalbumsales.

Thebootstrapconfidenceintervaltellsusthatthepopulationvalueofbforadvertisingbudgetislikelytofallbetween.08 and .11 and because this interval doesn’t include zero we would conclude that there is a genuine positiverelationshipbetweenadvertisingbudgetandalbumsalesinthepopulation.Also,thesignificanceassociatedwiththisconfidenceintervalisp=.001,whichishighlysignificant.Also,notethatthebootstrapprocessinvolvesre-estimatingthestandarderror(itchangesfrom.01intheoriginaltabletoabootstrapestimateof.009).Thisisaverysmallchange.Fortheconstant, thestandarderror is7.537comparedtothebootstrapestimateof8.214,which isadifferenceof

1SometimessmallvaluesarereportedbySPSSasthingslike9.612E-02andmanystudentsfindthisnotationconfusing.Well,thinkofE-02asmeaning‘movethedecimalplace2stepstotheleft’,so9.612E-02becomes0.09612.

Page 10: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page10

0.677.Thebootstrapconfidenceintervalsandsignificancevaluesareusefultoreportandinterpretbecausetheydonotrelyonassumptionsofnormalityorhomoscedasticity.

Using the Model Sofar,wehavediscoveredthatwehaveausefulmodel,onethatsignificantlyimprovesourabilitytopredictalbumsales.However,thenextstage isoftentousethatmodeltomakesomepredictions.Thefirststage istodefinethemodelbyreplacingtheb-valuesinEq.2withthevaluesfromtheoutput.Inaddition,wecanreplacetheXandYwiththevariablenamessothatthemodelbecomes:

albumsales! = 𝑏𝑏* + 𝑏𝑏+advertisingbudget!

= 134.14 + 0.096×advertisingbudget! Eq.10

Itisnowpossibletomakeapredictionaboutalbumsales,byreplacingtheadvertisingbudgetwithavalueofinterest.For example, imagine a recording company executive wanted to spend £100,000 on advertising a new album.Rememberingthatourunitsarealreadyinthousandsofpounds,wecansimplyreplacetheadvertisingbudgetwith100.Hewoulddiscoverthatalbumsalesshouldbearound144,000forthefirstweekofsales:

albumsales! = 134.14 + 0.096×advertisingbudget!

= 134.14 + 0.096×100

= 143.74

Eq.11

Regression with several predictors using SPSS

SELF-TEST: Produce a matrix scatterplot of Sales, Adverts, Airplay and Attract including theregressionline.

Main options Theexecutivehaspastresearchindicatingthatadvertisingbudgetisasignificantpredictorofalbumsales,andsoheshouldincludethisvariableinthemodelfirst.Hisnewvariables(airplayandattract)should,therefore,beenteredintothemodelafteradvertisingbudget.Thismethodishierarchical(theresearcherdecidesinwhichordertoentervariablesintothemodelbasedonpastresearch).TodoahierarchicalregressioninSPSSwehavetoenterthevariablesinblocks(eachblockrepresentingonestepinthehierarchy).Togettothemainregressiondialogboxselect .Tosetupthefirstblockwedoexactlywhatwedidbefore.Selecttheoutcomevariable(albumsales)anddragittotheboxlabelledDependent(orclickon ).Wealsoneedtospecifythepredictorvariableforthefirstblock.We’vedecidedthatadvertisingbudgetshouldbeenteredintothemodelfirst,soselectthisvariableinthelistanddragittotheboxlabelledIndependent(s)(orclickon ).UnderneaththeIndependent(s)box,thereisadrop-downmenuforspecifyingtheMethodofregression.Youcanselectadifferentmethodofvariableentryforeachblockbyclickingon ,nexttowhereitsaysMethod(seeFigure5).Thedefaultoptionisforcedentry,andthisistheoptionwewant,butnextweekwe’lllookatotherapproaches.

Havingspecifiedthefirstblockinthehierarchy,weneedtomoveontotothesecond.Totellthecomputerthatyouwanttospecifyanewblockofpredictorsyoumustclickon .ThisprocessclearstheIndependent(s)boxsothatyoucanenterthenewpredictors(youshouldalsonotethatabovethisboxitnowreadsBlock2of2indicatingthatyouareinthesecondblockofthetwothatyouhavesofarspecified).WedecidedthatthesecondblockwouldcontainbothofthenewpredictorsandsoyoushouldclickonAirplayandAttract(whileholdingdownCtrl,orCmdifyouuseaMac)inthevariableslistanddragthemtotheIndependent(s)boxorclickon .ThedialogboxshouldnowlooklikeFigure7.Tomovebetweenblocksusethe and buttons(soforexample,tomovebacktoblock1,clickon).Wecangetbootstrappedconfidenceintervalsfortheregressioncoefficientsbyclicking aswedidbefore.

Page 11: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page11

Figure7:Maindialogboxforblock2ofthemultipleregression

Output from SPSS

Overall Fit of the Model

Output4

Notethattherearetwomodels.Model1referstothefirst stage in the hierarchy when only advertisingbudgetisusedasapredictor.Model2referstowhenallthreepredictorsareused.UnderthistableSPSStellsus what the dependent variable (outcome) was andwhatthepredictorswere ineachof thetwomodels.The column labelled R contains the values of themultiplecorrelationcoefficientbetweenthepredictorsand the outcome. When only advertising budget isused as a predictor, this is the simple correlationbetweenadvertisingandalbumsales (0.578). In fact,all of the statistics formodel 1 are the same as theregressionmodelearlier(Output1).

Thenext columngivesusavalueofR2,whichwealreadyknow isameasureofhowmuchof thevariability in theoutcomeisaccountedforbythepredictors.Forthefirstmodelitsvalueis.335,whichmeansthatadvertisingbudgetaccountsfor33.5%ofthevariationinalbumsales.However,whentheothertwopredictorsareincludedaswell(model2),thisvalueincreasesto.665or66.5%ofthevarianceinalbumsales.Therefore,ifadvertisingaccountsfor33.5%,wecantellthatattractivenessandradioplayaccountforanadditional33%.2So,theinclusionofthetwonewpredictorshasexplainedquitealargeamountofthevariationinalbumsales

Output5containsanANOVAthattestswhetherthemodelissignificantlybetteratpredictingtheoutcomethanusingthemeanasa‘bestguess’.FortheinitialmodeltheF-ratiois99.59,p<.001.ForthesecondmodelthevalueofFis129.498, which is also highly significant (p < .001). We can interpret these results as meaning that both modelssignificantlyimprovedourabilitytopredicttheoutcomevariablecomparedtonotfittingthemodel.

2Thatis,33%=66.5%-33.5%.

Page 12: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page12

Output5

Model Parameters

RememberthatinmultipleregressionthemodeltakestheformofEq.6andinthatequationthereareseveralunknownparameters(theb-values).ThefirstpartofOutput6givesusestimatesfortheseb-valuesandthesevaluesindicatetheindividualcontributionofeachpredictortothemodel.Wecanusetheseestimatestodefineourmodel:

sales! = 𝑏𝑏* + 𝑏𝑏+advertising! + 𝑏𝑏>airplay! + 𝑏𝑏aattractiveness!

= −26.61 + 0.08advertising! + 3.37airplay! + 11.09attractiveness! Eq.12

Theb-valuestellusabouttherelationshipbetweenalbumsalesandeachpredictor.Allthreepredictorshavepositiveb-valuesindicatingpositiverelationships.So,asadvertisingbudgetincreases,albumsalesincrease;asplaysontheradioincrease,sodoalbumsales;andfinallymoreattractivebandswillsellmorealbums.Theb-valuestellusmorethanthis,though.Theytellustowhatdegreeeachpredictoraffectstheoutcome iftheeffectsofallotherpredictorsareheldconstant:

• Advertisingbudget (b=0.085):asadvertisingbudget increasesbyoneunit,albumsales increaseby0.085units.

• Airplay(b=3.367):asthenumberofplaysonradiointheweekbeforereleaseincreasesbyone,albumsalesincreaseby3.367units.

• Attractiveness (b=11.086):abandratedoneunithigherontheattractivenessscalecanexpectadditionalalbumsalesof11.086units.

Forthismodel,theadvertisingbudget,t(196)=12.26,p<.001,theamountofradioplaypriortorelease,t(196)=12.12,p<.001andattractivenessoftheband,t(196)=4.55,p<.001,areallsignificantpredictorsofalbumsales.3Rememberthatthesesignificancetestsareaccurateonlyiftheassumptionsdiscussedinyourlecturearemet.

If these assumptions aren’tmet, or youwant to ignore them, you could look at the table of bootstrap confidenceintervalsforeachpredictorandtheirsignificancevalue4.Thesetellusthatadvertising,b=0.09[0.07,0.10],p=.001,airplay,b=3.37[2.74,4.02],p=.001,andattractivenessoftheband,b=11.09[6.46,15.01],p=.001,allsignificantlypredictalbumsales.Theconfidenceintervalsareconstructedsuchthatin95%ofsamplestheboundariescontainthepopulationvalueofb.Therefore,it’slikelythattheconfidenceintervalwehaveconstructedforthissamplewillcontainthetruevalueofb inthepopulation.Therefore,wecanusetheconfidence intervalstotellusthe likelysizeoftheparameterinthepopulation(i.e.,thetruevalue).Iftheconfidenceintervalcontainszerothenthismeansthatthetruevaluemightbezero(i.e.,noeffectatall)oroppositetowhatweobservedinthesample(e.g.,anegativebinsteadof

3ForallofthesepredictorsIwrotet(196).Thenumberinbracketsisthedegreesoffreedom.InregressionthedegreesoffreedomareN-p-1,whereNisthetotalsamplesize(inthiscase200)andpisthenumberofpredictors(inthiscase3).Forthesedataweget200–3–1=196.4Rememberthatbecauseofhowbootstrappingworksthevaluesinyouroutputwillbeslightlydifferenttomine,anddifferentifyoure-runtheanalysis.

Page 13: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page13

thepositiveonethatweobserved).Therefore,becausezerodoesnotfallwithintheboundariesofanyofourbootstrapconfidenceintervals,wecanconcludeveryconfidentlythatthepopulationvaluesofbarepositive—inotherwords,allofthepredictorvariablesaregenuinepredictorsofalbumsales.

Output6

Tasks Task 1 Lacourseetal.(2001)conductedastudytoseewhethersuicideriskwasrelatedtolisteningtoheavymetalmusic.Theydevisedascaletomeasurepreferenceforbands falling intothecategoryofheavymetal.Thisscale includedheavymetalbands(BlackSabbath,IronMaiden),speedmetalbands(Slayer,Metallica),death/blackmetalbands(Obituary,Burzum)andgothicbands(MarilynManson,SistsersofMercy).Theythenusedthis(andothervariables)aspredictorsofsuicideriskbasedonascalemeasuringsuicidalideationetc.devisedbyTousignantetal.,(1988).

• Lacourse,E.,Claes,M.,&Villeneuve,M.(2001).HeavyMetalMusicandAdolescentSuicidalRisk.JournalofYouthandAdolescence,30(3),321-332.[AvailablethroughtheSussexElectronicLibrary].

Let’simaginewereplicatedthisstudy.ThedatafileHMSuicide.sav(onthecoursewebsite)containsthedatafromsuchareplication.Therearetwovariablesrepresentingscoresonthescalesdescribedabove:hm(theextenttowhichthepersonlistenstoheavymetalmusic)andsuicide(theextenttowhichsomeonehassuicidalideationandsoon).Usingthesedatacarryoutaregressionanalysistoseewhetherlisteningtoheavymetalpredictssuiciderisk.

Howmuchvariancedoesthefinalmodelexplain?

YourAnswers:

Page 14: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page14

Doeslisteningtoheavymetalsignificantlypredictsuiciderisk(quoterelevantstatistics)?

YourAnswers:

Whatisthenatureoftherelationshipbetweenlisteningtoheavymetalandsuiciderisk?(sketchascatterplotifithelpsyoutoexplain).

YourAnswers:

Writeouttheregressionequationforthemodel

YourAnswers:

Aslisteningtoheavymetalincreasesby1unit,howmuchdoessuicideriskincrease?

YourAnswers:

Task 2 Oneofmyfavouriteactivities,especiallywhentryingtodobrain-meltingthingslikewritingstatisticsbooks,istodrinktea. IamEnglishafterall.Fortunately, tea improvesyourcognitivefunction,well, inoldChinesepeopleatanyrate(Feng,Gwee,Kua,&Ng,2010).ImaynotbeChineseandI’mnotthatold,butIneverthelessenjoytheideathatteamighthelpmethink.Therearesomedata(TeaMakesYouBrainy716.sav)basedonFengetal.’sstudythatmeasuredthenumberofcupsofteadrunkandcognitivefunctioning.Useregressiontoconstructamodelthatpredictscognitivefunctioning from tea drinking, what would cognitive functioning be if someone drank 10 cups of tea? Is there asignificanteffect?

Page 15: Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is Correlational Research? Correlational designs are when many variables are measured simultaneously

©Prof.AndyField,2016 www.discoveringstatistics.com Page15

Task 3 Inthelecturewesawanexampleofoutliersvs.influentialcasesinwhichwepredictedmortalityratesformthenumberofpubsinthearea.Runaregressionanalysisforthepubs.savdatapredictingmortalityfromthenumberofpubs.Tryrepeatingtheanalysisbutbootstrappingtheconfidenceintervals.

Task 4 TheHonestyLab (www.honestylab.com) lookedathowpeopleevaluateddishonestacts.Participantsevaluated thedishonestyofactsbasedonwatchingvideosofpeopleconfessingtothoseacts.Themediawouldhaveusbelievethatthemore likeable theperpetratorwas, themorepositively theirdishonestactswereviewed. Imaginewe took100peopleandgavethemarandomdishonestact,describedbytheperpetrator.Weaskedthemtoevaluatethehonestyoftheact(0=appallingbehaviourto10=it’sOkreally)andhowmuchtheylikedtheperson(0=notatall,10=alot).I’vefabricatedsomedata(HonestyLab.sav)relatingtopeople’sratingsofdishonestactsandthelikeablenessoftheperpetrator. Run a regression using bootstrapping to predict ratings of dishonesty from the likeableness of theperpetrator.

References Feng,L.,Gwee,X.,Kua,E.H.,&Ng,T.P.(2010).Cognitivefunctionandteaconsumptionincommunitydwellingolder

ChineseinSingapore.JournalofNutritionHealth&Aging,14(6),433-438.

Wright, D. B., London, K., & Field, A. P. (2011). Using Bootstrap Estimation and the Plug-in Principle for ClinicalPsychologyData.JournalofExperimentalPsychopathology.,2(2),252–270.doi:doi:10.5127/jep.013611

Terms of Use Thishandoutcontainsmaterialfrom:

Field,A.P.(2013).DiscoveringstatisticsusingSPSS:andsexanddrugsandrock‘n’roll(4thEdition).London:Sage.

ThismaterialiscopyrightAndyField(2000-2016).

This document is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 InternationalLicense,basicallyyoucanuseitforteachingandnon-profitactivitiesbutnotmeddlewithitwithoutpermissionfromtheauthor.