Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is...

©Prof.AndyField,2016 www.discoveringstatistics.com Page1

Introducing the Linear Model What is Correlational Research? Correlationaldesignsarewhenmanyvariablesaremeasuredsimultaneouslybutunlikeinanexperimentnoneofthemaremanipulated.Whenweusecorrelationaldesignswecan’tlookforcause-effectrelationshipsbecausewehaven’tmanipulatedanyofthevariables,andalsobecauseallofthevariableshavebeenmeasuredatthesamepointintime(if you’re really bored, Field, 2013, Chapter 1 explains why experiments allow us to make causal inferences butcorrelational researchdoesnot). Inpsychology, themostcommoncorrelational researchconsistsof the researcheradministeringseveralquestionnairesthatmeasuredifferentaspectsofbehaviourtoseewhichaspectsofbehaviourarerelated.Manyofyouwilldothissortofresearchforyourfinalyearresearchproject(sopayattention!).

The linear model Inthefirstfewlectureswesawthattheonlyequationweeverreallyneedisthisone:

outcome! = Model! + error!

Wealsosawthatweoftenfitalinearmodel,whichinitssimplestformcanbewrittenas:

outcome! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋! + error!

y! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋! + ε!Eq.1

Thefundamentalideaisthatanoutcomeforanentitycanbepredictedfromamodelandsomeerrorassociatedwiththat prediction (ei).We are predicting an outcome variable (yi) from a predictor variable (Xi) and a parameter,b1,associatedwiththepredictorvariablethatquantifiestherelationshipithaswiththeoutcomevariable.Wealsoneedaparameterthattellsusthevalueoftheoutcomewhenthepredictoriszero;thisparameterisb0.

Youmightrecognizethismodelas‘theequationofastraightline’.Ihavetalkedaboutfitting‘linearmodels’,andlinearsimplymeans‘straightline’.Anystraightlinecanbedefinedbytwothings:(1)theslope(orgradient)oftheline(usuallydenotedbyb1);and(2)thepointatwhichthelinecrossestheverticalaxisofthegraph(knownastheinterceptoftheline,b0).Theseparametersb1andb0areknownastheregressioncoefficientsandwillcropuptimeandtimeagain,whereyoumayseethemreferredtogenerallyasb(withoutanysubscript)orbn(meaningthebassociatedwithvariablen).Aparticularline(i.e.,model)willhavehasaspecificinterceptandgradient.

Figure1:Showslineswiththesamegradientsbutdifferentintercepts,andlinesthatsharethesameinterceptbuthavedifferentgradients

Same intercepts, different gradients Same gradients, different intercepts


Figure1showsasetoflinesthathavethesameinterceptbutdifferentgradients.Forthesethreemodels,b0willbethesameineachbutthevaluesofb1willdifferineach;thisfigurealsoshowsmodelsthathavethesamegradients(b1isthesame ineachmodel)butdifferent intercepts (theb0isdifferent ineachmodel). I’vementionedalready thatb1quantifiestherelationshipbetweenthepredictorvariableandtheoutcome,andFigure1illustratesthispoint.Amodelwithapositiveb1describesapositiverelationship,whereasalinewithanegativeb1describesanegativerelationship.Looking at Figure1 (left) the red linedescribes a positive relationshipwhereas the green linedescribes a negativerelationship. As such, we can use a linearmodel (i.e., a straight line) to summarize the relationship between twovariables:thegradient(b1)tellsuswhatthemodellookslike(itsshape)andtheintercept(b0)tellsuswherethemodelis(itslocationingeometricspace).

Thisisallquiteabstractsolet’slookatanexample.ImaginethatIwasinterestedinpredictingphysicalanddownloadedalbumsales(outcome)fromtheamountofmoneyspentadvertisingthatalbum(predictor).WecouldsummarizethisrelationshipusingalinearmodelbyreplacingthenamesofourvariablesintoEq.1.

y! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋! + ε!

albumsales! = 𝑏𝑏* + 𝑏𝑏+advertisingbudget! + ε!Eq.2

Oncewehaveestimatedthevaluesofthebswewouldbeabletomakeapredictionaboutalbumsalesbyreplacing‘advertising’withanumberrepresentinghowmuchwewantedtospendadvertisinganalbum.Forexample,imaginethatb0turnedouttobe50andb1turnedouttobe100,ourmodelwouldbe:

albumsales! = 50 + 100×advertisingbudget! + ε! Eq.3

NotethatIhavereplacedthebetaswiththeirnumericvalues.Now,wecanmakeaprediction.Imaginewewantedtospend£5onadvertising,wecan replace thevariable ‘advertisingbudget’with thisvalueandsolve theequation todiscoverhowmanyalbumsaleswewillget:

albumsales! = 50 + 100×5 + ε!= 550 + ε!

So,basedonourmodelwecanpredictthatifwespend£5onadvertising,we’llsell550albums.I’velefttheerrortermintheretoremindyouthatthispredictionwillprobablynotbeperfectlyaccurate.Thisvalueof550albumsalesisknownasapredictedvalue.

The linear model with several predictors Wehaveseenthatwecanuseastraightlineto‘model’therelationshipbetweentwovariables.However,lifeisusuallymorecomplicatedthanthat:thereareoftennumerousvariablesthatmightberelatedtotheoutcomeofinterest.Totakeouralbumsalesexample,wemightexpectvariablesotherthansimplyadvertisingtohaveaneffect.Forexample,howmuchsomeonehearssongsfromthealbumontheradio,orthe‘look’ofthebandmighthaveaninfluence.Oneofthebeautifulthingsaboutthelinearmodelisthatitcanbeexpandedtoincludeasmanypredictorsasyoulike.Toaddapredictorallweneedtodoisplaceitintothemodelandgiveitabthatestimatestherelationshipinthepopulationbetweenthatpredictorandtheoutcome.Forexample,ifwewantedtoaddthenumberofplaysofthebandontheradioperweek(airplay),wecouldaddthissecondpredictoringeneralas:

𝑌𝑌! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋+! + 𝑏𝑏>𝑋𝑋>! + 𝜀𝜀! Eq.4

Notethatall thathaschanged istheadditionofasecondpredictor(X2)andanassociatedparameter(b2).Tomakethingsmoreconcrete,let’susethevariablenamesinstead:

albumsales! = 𝑏𝑏* + 𝑏𝑏+advertisingbudget! + 𝑏𝑏>airplay! + 𝜀𝜀! Eq.5

Thenewmodelincludesab-valueforbothpredictors(and,ofcourse,theconstant,b0).Ifweestimatetheb-values,wecouldmakepredictionsaboutalbumsalesbasednotonlyontheamountspentonadvertisingbutalsointermsofradioplay.


Multiple regression can be usedwith three, four or even ten ormore predictors. In general we can add asmanypredictorsaswelike,andthelinearmodelwillexpandaccordingly:

𝑌𝑌! = 𝑏𝑏* + 𝑏𝑏+𝑋𝑋+! + 𝑏𝑏>𝑋𝑋>! ⋯ 𝑏𝑏D𝑋𝑋D! + 𝜀𝜀! Eq.6

Inwhich,Y istheoutcomevariable,b1 isthecoefficientofthefirstpredictor(X1),b2 isthecoefficientofthesecondpredictor(X2),bnisthecoefficientofthenthpredictor(Xni),andeiistheerrorfortheithparticipant.(Thebracketsaren’tnecessary,they’rejusttomaketheconnectiontoEq.1).Thisequationillustratesthatwecanaddinasmanypredictorsaswelikeuntilwereachthefinalone(Xn),buteachtimewedo,weassignitaregressioncoefficient(b).

Estimating the model Linearmodelscanbedescribedentirelybyaconstant(b0)andbyparametersassociatedwitheachpredictor(bs).Theseparameters are estimated using themethod of least squares (described in your lecture). Thismethod is known asordinaryleastsquares(OLS)regression.Inotherwords,SPSSfindsthevaluesoftheparametersthathavetheleastamountoferror(relativetoanyothervalue)forthedatayouhave.

Assessing the goodness of fit, sums of squares, R and R2 OnceNephwickandClungglewadhavefoundthemodelofbestfititisimportantthatweassesshowwellthismodelfitstheactualdata(weassessthegoodnessoffitofthemodel).Wedothisbecauseeventhoughthemodelisthebestoneavailable, itcanstillbea lousyfit tothedata.Onemeasureoftheadequacyofamodel is thesumofsquareddifferences(thinkbacktolecture2,orField,2013,Chapter2).Thereareseveralsumsofsquaresthatcanbecalculatedtohelpusgaugethecontributionofourmodeltopredictingtheoutcome.Let’sgobacktoourexampleofpredictingalbumsales(Y)fromtheamountofmoneyspentadvertisingthatalbum(X).Onedaymybosscameintomyofficeandsaid‘Andy,Iknowyouwantedtobearockstarandyou’veendedupworkingasmystats-monkey,buthowmanyalbumswillwesellifwespend£100,000onadvertising?’IntheabsenceofanydataprobablythebestanswerIcouldgivewouldbe themeannumberofalbumsales (say,200,000)becauseonaveragethat’showmanyalbumsweexpect tosell.However,whatifhetheasks‘Howmanyalbumswillwesellifwespend£1onadvertising?’Again,intheabsenceofanyaccurateinformation,mybestguesswouldbethemean.Thereisaproblem:whateveramountofmoneyisspentonadvertisingIalwayspredictthesamelevelsofsales.Assuch,themeanisafairlyuselessmodelofarelationshipbetweentwovariables.—butitisthesimplestmodelavailable.

Usingthemeanasamodel,wecancalculatethedifferencebetweentheobservedvalues,andthevaluespredictedbythemean.Wesawinlecture1thatwesquareallofthesedifferencestogiveusthesumofsquareddifferences.Thissumofsquareddifferencesisknownasthetotalsumofsquares(denotedSST)becauseitisthetotalamountoferrorpresentwhenthemostbasicmodelisappliedtothedata(Figure2).Now,ifwefitthemoresophisticatedmodeltothedata,suchasalineofbestfit,wecanworkoutthedifferencesbetweenthisnewmodelandtheobserveddata.Evenifanoptimalmodelisfittedtothedatathereisstillsomeinaccuracy,whichisrepresentedbythedifferencesbetweeneachobserveddatapointandthevaluepredictedbytheregressionline.Thesedifferencesaresquaredbeforetheyareaddedupsothatthedirectionsofthedifferencesdonotcancelout.Theresultisknownasthesumofsquaredresiduals(SSR).Thisvaluerepresentsthedegreeofinaccuracywhenthebestmodelisfittedtothedata.Wecanusethesetwovaluestocalculatehowmuchbettertheregressionline(thelineofbestfit)isthanjustusingthemeanasamodel(i.e.howmuchbetteristhebestpossiblemodelthantheworstmodel?).ThisimprovementinpredictionisthedifferencebetweenSSTandSSR.Thisdifferenceshowsusthereductionintheinaccuracyofthemodelresultingfromfittingtheregressionmodeltothedata.Thisimprovementisthemodelsumofsquares(SSM).

If thevalueofSSM is largethentheregressionmodel isverydifferent fromusingthemeantopredict theoutcomevariable.Thisimpliesthattheregressionmodelhasmadeabigimprovementtohowwelltheoutcomevariablecanbepredicted. However, if SSM is small then using the regression model is little better than using the mean (i.e. theregressionmodelisnobetterthantakingour‘bestguess’).Ausefulmeasurearisingfromthesesumsofsquaresistheproportionofimprovementduetothemodel.Thisiseasilycalculatedbydividingthesumofsquaresforthemodelbythetotalsumofsquares.TheresultingvalueiscalledR2andtoexpressthisvalueasapercentageyoushouldmultiplyitby100.So,R2representstheamountofvarianceintheoutcomeexplainedbythemodel(SSM)relativetohowmuchvariationtherewastoexplaininthefirstplace(SST).


𝑅𝑅> =𝑆𝑆𝑆𝑆G𝑆𝑆𝑆𝑆H

Eq.7

Figure2:Diagramshowingfromwheretheregressionsumsofsquaresderive

A second use of the sums of squares in assessing themodel is the F-test. This test is based upon the ratio of theimprovementduetothemodel(SSM)andthedifferencebetweenthemodelandtheobserveddata(SSR).Ratherthanusingthesumsofsquares,itusesthemeansumsofsquares(referredtoasthemeansquaresorMS).Theresultisthemeansquaresforthemodel(MSM)andtheresidualmeansquares(MSR)—seeField2013formoredetail.Atthisstageitisn’tessentialthatyouunderstandhowthemeansquaresarederived(itisexplainedinField,2013).However,itisimportantthatyouunderstandthattheF-ratio(Eq.8)isameasureofhowmuchthemodelhasimprovedthepredictionoftheoutcomecomparedtothelevelofinaccuracyofthemodel.Ifamodelisgood,thentheimprovementinpredictionduetothemodel(MSM)tobelargeandthedifferencebetweenthemodelandtheobserveddata(MSR)tobesmall.Inshort,agoodmodelshouldhavealargeF-ratio.

𝐹𝐹 =𝑀𝑀𝑆𝑆G𝑀𝑀𝑆𝑆K

Eq.8

SST uses the differences between the observed data

and the mean value of Y

SSR uses the differences between the observed data

and the regression line

SSM uses the differences between the mean value of Y

and the regression line


Assessing individual predictors Thevalueofbrepresentsthechangeintheoutcomeresultingfromaunitchangeinthepredictor.Ifthemodelisverybadthenwewouldexpectthechangeintheoutcometobezero.Aregressioncoefficientof0means:(1)aunitchangeinthepredictorvariableresultsinnochangeinthepredictedvalueoftheoutcome(thepredictedvalueoftheoutcomedoesnotchangeatall).Logically ifavariablesignificantlypredictsanoutcome,thenitshouldhaveab-valuethatisdifferentfromzero.Thet-statisticteststhenullhypothesisthatthevalueofbis0:therefore,ifitissignificantwegainconfidenceinthehypothesisthattheb-valueissignificantlydifferentfrom0andthatthepredictorvariablecontributessignificantlytoourabilitytoestimatevaluesoftheoutcome.

LikeF, the t-statistic isbasedon the ratioofexplainedvarianceagainstunexplainedvarianceorerror.Whatwe’reinterestedinhereisnotsomuchvariancebutwhetherthebisbigcomparedtotheamountoferrorinthatestimate.Toestimatehowmucherrorwecouldexpect to find inbweuse the standarderrorbecause it tellsusabouthowdifferentb-valueswouldbeacrossdifferentsamples.Eq.9showshowthet-testiscalculated:Thebexpectedisthevalueofbthatwewouldexpecttoobtainifthenullhypothesisweretrue(i.e.,zero)andsothisvaluecanbereplacedby0.Theequationsimplifiestobecometheobservedvalueofbdividedbythestandarderrorwithwhichitisassociated:

𝑡𝑡 =𝑏𝑏MNOPQRPS − 𝑏𝑏PUVPWXPS

𝑆𝑆𝑆𝑆Z=𝑏𝑏MNOPQRPS𝑆𝑆𝑆𝑆Z Eq.9

Thevaluesofthaveaspecialdistributionthatdiffersaccordingtothedegreesoffreedomforthetest.Inthiscontext,the degrees of freedom areN−p−1,whereN is the total sample size andp is the number of predictors. In simpleregressionwhenwehaveonlyonepredictor,thisreducesdowntoN−2.SPSSprovidestheexactprobabilitythattheobservedvalue(ora largerone)oftwouldoccur ifthevalueofbwas, infact,0.Asageneralrule, if thisobservedsignificanceislessthan.05,thenscientistsassumethatbissignificantlydifferentfrom0;putanotherway,thepredictormakesasignificantcontributiontopredictingtheoutcome.

Generalization and Bootstrapping Rememberfromyourlectureonbiasthatlinearmodelsassume:

• Linearityandadditivity:therelationshipyou’retryingtomodelis,infact,linearandwithseveralpredictors,theycombineadditively.

• Normality:Forbestimatestobeoptimaltheresidualsshouldbenormallydistributed.ForCIsandconfidenceintervalstobeaccurate,thesamplingdistributionofbsshouldbenormal.

• Homoscedasticity:necessaryforbestimatestobeoptimalandsignificancetestsandCIsoftheparameterstobeaccurate.

Iftheseassumptionsaremetthenwecantrusttheestimatesofourbs,whichmeansthatwecangeneralizeourmodel(i.e.assumethatitworksinsamplesotherthantheonefromwhichwecollecteddata).Ifwehaveconcernsabouttheseassumptions we can use bootstrapping to compute robust estimates of bs and their confidence intervals. Lack ofnormalitypreventsusfromknowingtheshapeofthesamplingdistributionunlesswehavebigsamples.BootstrappingEfron&Tibshirani,1993)getsaroundthisproblembyestimatingthepropertiesofthesamplingdistributionfromthesample data. In effect, the sample data are treated as a population fromwhich smaller samples (called bootstrapsamples)aretaken(puttingeachscorebackbeforeanewoneisdrawnfromthesample).Theparameterofinterest(e.g.,theregressionparameter)iscalculatedineachbootstrapsample.Thisprocessisrepeatedperhaps2000times.Theendresultisthatwehave2000parameterestimates,onefromeachbootstrapsample.Therearetwothingswecandowiththeseestimates:thefirstistoorderthemandworkoutthelimitswithinwhich95%ofthemfall.Wecanusethesevaluesasanestimateofthelimitsofthe95%confidenceintervaloftheparameter.Theresultisknownasapercentilebootstrapconfidenceinterval(becauseit isbasedonthevaluesbetweenwhich95%ofbootstrapsampleestimatesfall).Thesecondthingwecandoistocalculatethestandarddeviationoftheparameterestimatesfromthebootstrapsamplesanduseitasthestandarderrorofparameterestimates.Animportantpointtorememberisthatbecausebootstrappingisbasedontakingrandomsamplesfromthedatayou’vecollectedtheestimatesyougetwillbeslightly different every time. This is nothing to worry about. For a fairly gentle introduction to the concept ofbootstrappingseeWright,LondonandField(2011).


Fitting a linear model Figure3showsthegeneralprocessofconductingregressionanalysis.First,weshouldproducescatterplotstogetsomeideaofwhethertheassumptionoflinearityismet,andalsotolookforanyoutliersorobviousunusualcases.Atthisstagewemighttransformthedatatocorrectproblems.Havingdonethisinitialscreenforproblemswefitamodelandsavethevariousdiagnosticstatisticsthatwewilldiscussnextweek. Ifwewanttogeneralizeourmodelbeyondthesample,orweareinterestedininterpretingsignificancetestsandconfidenceintervalsthenweexaminetheseresidualstocheckforhomoscedasticity,normality,independenceandlinearity.Ifwefindproblemsthenwetakecorrectiveactionandre-estimatethemodel.Also,it’sprobablywisetousebootstrappedconfidenceintervalswhenwefirstestimatethemodelbecausethenwecanbasicallyforgetaboutthingslikenormality.

Figure3:Theprocessoffittingaregressionmodel

Regression using SPSS TherearesomedatafromField2013inthefileAlbumSales.sav.Thisdatafilehas200rows,eachonerepresentingadifferentalbum.Therearealsotwocolumns,onerepresentingthesalesofeachalbumintheweekafterreleaseandtheotherrepresentingtheamount(inpounds)spentpromotingthealbumbeforerelease.Thisistheformatforenteringregressiondata:theoutcomevariableandanypredictorsshouldbeenteredindifferentcolumns,andeachrowshouldrepresentindependentvaluesofthosevariables.

ThepatternofthedataisshowninFigure4anditshouldbeclearthatapositiverelationshipexists:so,themoremoneyspentadvertisingthealbum,themoreitislikelytosell.Ofcoursetherearesomealbumsthatsellwellregardlessofadvertising(topleftofscatterplot),buttherearenonethatsellbadlywhenadvertisinglevelsarehigh(bottomrightofscatterplot).Thescatterplotalsoshows the lineofbest fit for thesedata:bearing inmind that themeanwouldberepresentedbyaflatlineataroundthe200,000salesmark,theregressionlineisnoticeabledifferent.

Check residuals

Assumptions met and no bias

No normality

Graphs: zpred vs. zresid

Re-run analysis: Bootstrap CIs, transform

data

Model can be generalized

Linearity

Normality

Initial checks Graphs: scatterplotsLinearity and unusual cases

Run initial regression Save diagnostics

Homoscedasticity

Graphs: histogram

Independence

HeteroscedasticityRe-run analysis: weighted least squares regression

Lack of linearityTransform data

Lack of independence Use a multilevel model (Chapter 19)


Figure4:Scatterplotshowingtherelationshipbetweenalbumsalesandtheamountspentpromotingthealbum.

Running a basic Analysis Tofindouttheparametersthatdescribetheregressionline,andtoseewhetherthislineisausefulmodel,weneedtorun a regression analysis. To do the analysis you need to access the main dialog box by selecting

Figure4showstheresultingdialogbox.ThereisaspacelabelledDependentinwhichyoushouldplacetheoutcomevariable(inthisexamplesales).So,selectsalesfromthelistontheleft-handside,andtransferitbydraggingitorclickingon .ThereisanotherspacelabelledIndependent(s)inwhichanypredictorvariableshouldbeplaced.Insimpleregressionweuseonlyonepredictor(inthisexample,adverts)andsoyoushouldselectadvertsfromthelistandclickon totransferittothelistofpredictors.Thereareavarietyofoptionsavailable,butthesewillbeexploredwithinthecontextofmultipleregression.

Figure5:Maindialogboxforregression

Ifweareworriedaboutassumptionsthenwecangetbootstrappedconfidenceintervalsfortheregressioncoefficientsbyclicking .Select toactivatebootstrapping,andtogeta95%confidenceintervalclick

.Clickon inthemaindialogboxtorunthebasicanalysis.


Figure6:Bootstrapdialogbox

Output from SPSS

Overall Fit of the Model

Output1

ThefirsttableprovidedbySPSSisasummaryofthemodelthatgivesthevalueofRandR2forthemodel.Forthesedata,Ris0.578andbecausethereisonlyonepredictor,thisvaluerepresentsthesimplecorrelationbetweenadvertisingandalbumsales(youcanconfirmthisbyrunningacorrelation). The value of R2 is 0.335, which tells us that advertisingexpenditurecanaccountfor33.5%ofthevariationinalbumsales.Theremightbemany factors that canexplain this variation, butourmodel,which includesonly advertisingexpenditureexplains33%:66%of thevariationinalbumsalesisunexplained.Therefore,theremustbeothervariablesthathaveaninfluencealso

Thenextpartoftheoutputreportsananalysisofvariance(ANOVA—seeField,2013,Chapter11).ThemostimportantpartofthetableistheF-ratio,whichiscalculatedusingEq.8,andtheassociatedsignificancevalue.Forthesedata,Fis99.59,whichissignificantatp<.001(becausethevalueinthecolumnlabelledSig.islessthan.001).Thisresulttellsusthatthereislessthana0.1%chancethatanF-ratiothislargewouldhappeniftherewerenoeffect.Therefore,wecanconcludethatourregressionmodelresultsinsignificantlybetterpredictionofalbumsalesthanifweusedthemeanvalueofalbumsales.Inshort,theregressionmodeloverallpredictsalbumsalessignificantlywell.

Output2

SSR

SSM

SSTMSR

MSM


Model Parameters

TheANOVAtellsuswhether themodel,overall, results ina significantlygooddegreeofpredictionof theoutcomevariable.However,theANOVAdoesn’ttellusabouttheindividualcontributionofvariablesinthemodel(althoughinthissimplecasethereisonlyonevariableinthemodelandsowecaninferthatthisvariableisagoodpredictor).ThetableinOutput3providesdetailsofthemodelparameters(thebetavalues)andthesignificanceofthesevalues.WesawinEq.1thatb0wastheYinterceptandthisvalueisthevalueBfortheconstant.So,fromthetable,wecansaythatb0is134.14,andthiscanbeinterpretedasmeaningthatwhennomoneyisspentonadvertising(whenX=0),themodelpredictsthat134,140albumswillbesold(rememberthatourunitofmeasurementwasthousandsofalbums).Wecanalsoreadoffthevalueofb1fromthetableandthisvaluerepresentsthegradientoftheregressionline.Itis0.096.1Althoughthisvalueistheslopeoftheregressionline,itismoreusefultothinkofthisvalueasrepresentingthechangeintheoutcomeassociatedwithaunitchangeinthepredictor.Therefore,ifourpredictorvariableisincreasedby1unit(iftheadvertisingbudgetisincreasedby1),thenourmodelpredictsthat0.096extraalbumswillbesold.Ourunitsofmeasurementwerethousandsofpoundsandthousandsofalbumssold,sowecansaythatforanincreaseinadvertisingof£1000themodelpredicts96(0.096´1000=96)extraalbumsales.Asyoumightimagine,thisinvestmentisprettybadfortherecordcompany:theyinvest£1000andgetonly96extrasales!Fortunately,aswealreadyknow,advertisingaccountsforonlyone-thirdofalbumsales.

Output3

Wesawearlierthat,ingeneral,valuesoftheregressioncoefficientbrepresentthechangeintheoutcomeresultingfromaunitchangeinthepredictorandthatifapredictorishavingasignificantimpactonourabilitytopredicttheoutcomethenthisbshouldbedifferentfrom0(andbigrelativetoitsstandarderror).Wealsosawthatthet-testtellsuswhethertheb-valueisdifferentfrom0.SPSSprovidestheexactprobabilitythattheobservedvalueoftwouldoccurif thevalueofb in thepopulationwerezero. If thisobservedsignificance is less than .05, thentheresult reflectsagenuineeffect.Forbothts,theprobabilitiesare.000(zeroto3decimalplaces)andsowecansaythattheprobabilityofthesetvalues(orlarger)occurringifthevaluesofbinthepopulationwerezeroislessthan.001.Therefore,thebsaresignificantlydifferentfrom0.Inthecaseofthebforadvertisingbudgetthisresultmeansthattheadvertisingbudgetmakesasignificantcontribution(p<.001)topredictingalbumsales.

Thebootstrapconfidenceintervaltellsusthatthepopulationvalueofbforadvertisingbudgetislikelytofallbetween.08 and .11 and because this interval doesn’t include zero we would conclude that there is a genuine positiverelationshipbetweenadvertisingbudgetandalbumsalesinthepopulation.Also,thesignificanceassociatedwiththisconfidenceintervalisp=.001,whichishighlysignificant.Also,notethatthebootstrapprocessinvolvesre-estimatingthestandarderror(itchangesfrom.01intheoriginaltabletoabootstrapestimateof.009).Thisisaverysmallchange.Fortheconstant, thestandarderror is7.537comparedtothebootstrapestimateof8.214,which isadifferenceof

1SometimessmallvaluesarereportedbySPSSasthingslike9.612E-02andmanystudentsfindthisnotationconfusing.Well,thinkofE-02asmeaning‘movethedecimalplace2stepstotheleft’,so9.612E-02becomes0.09612.


0.677.Thebootstrapconfidenceintervalsandsignificancevaluesareusefultoreportandinterpretbecausetheydonotrelyonassumptionsofnormalityorhomoscedasticity.

Using the Model Sofar,wehavediscoveredthatwehaveausefulmodel,onethatsignificantlyimprovesourabilitytopredictalbumsales.However,thenextstage isoftentousethatmodeltomakesomepredictions.Thefirststage istodefinethemodelbyreplacingtheb-valuesinEq.2withthevaluesfromtheoutput.Inaddition,wecanreplacetheXandYwiththevariablenamessothatthemodelbecomes:

albumsales! = 𝑏𝑏* + 𝑏𝑏+advertisingbudget!

= 134.14 + 0.096×advertisingbudget! Eq.10

Itisnowpossibletomakeapredictionaboutalbumsales,byreplacingtheadvertisingbudgetwithavalueofinterest.For example, imagine a recording company executive wanted to spend £100,000 on advertising a new album.Rememberingthatourunitsarealreadyinthousandsofpounds,wecansimplyreplacetheadvertisingbudgetwith100.Hewoulddiscoverthatalbumsalesshouldbearound144,000forthefirstweekofsales:

albumsales! = 134.14 + 0.096×advertisingbudget!

= 134.14 + 0.096×100

= 143.74

Eq.11

Regression with several predictors using SPSS

SELF-TEST: Produce a matrix scatterplot of Sales, Adverts, Airplay and Attract including theregressionline.

Main options Theexecutivehaspastresearchindicatingthatadvertisingbudgetisasignificantpredictorofalbumsales,andsoheshouldincludethisvariableinthemodelfirst.Hisnewvariables(airplayandattract)should,therefore,beenteredintothemodelafteradvertisingbudget.Thismethodishierarchical(theresearcherdecidesinwhichordertoentervariablesintothemodelbasedonpastresearch).TodoahierarchicalregressioninSPSSwehavetoenterthevariablesinblocks(eachblockrepresentingonestepinthehierarchy).Togettothemainregressiondialogboxselect .Tosetupthefirstblockwedoexactlywhatwedidbefore.Selecttheoutcomevariable(albumsales)anddragittotheboxlabelledDependent(orclickon ).Wealsoneedtospecifythepredictorvariableforthefirstblock.We’vedecidedthatadvertisingbudgetshouldbeenteredintothemodelfirst,soselectthisvariableinthelistanddragittotheboxlabelledIndependent(s)(orclickon ).UnderneaththeIndependent(s)box,thereisadrop-downmenuforspecifyingtheMethodofregression.Youcanselectadifferentmethodofvariableentryforeachblockbyclickingon ,nexttowhereitsaysMethod(seeFigure5).Thedefaultoptionisforcedentry,andthisistheoptionwewant,butnextweekwe’lllookatotherapproaches.

Havingspecifiedthefirstblockinthehierarchy,weneedtomoveontotothesecond.Totellthecomputerthatyouwanttospecifyanewblockofpredictorsyoumustclickon .ThisprocessclearstheIndependent(s)boxsothatyoucanenterthenewpredictors(youshouldalsonotethatabovethisboxitnowreadsBlock2of2indicatingthatyouareinthesecondblockofthetwothatyouhavesofarspecified).WedecidedthatthesecondblockwouldcontainbothofthenewpredictorsandsoyoushouldclickonAirplayandAttract(whileholdingdownCtrl,orCmdifyouuseaMac)inthevariableslistanddragthemtotheIndependent(s)boxorclickon .ThedialogboxshouldnowlooklikeFigure7.Tomovebetweenblocksusethe and buttons(soforexample,tomovebacktoblock1,clickon).Wecangetbootstrappedconfidenceintervalsfortheregressioncoefficientsbyclicking aswedidbefore.


Figure7:Maindialogboxforblock2ofthemultipleregression

Output from SPSS

Overall Fit of the Model

Output4

Notethattherearetwomodels.Model1referstothefirst stage in the hierarchy when only advertisingbudgetisusedasapredictor.Model2referstowhenallthreepredictorsareused.UnderthistableSPSStellsus what the dependent variable (outcome) was andwhatthepredictorswere ineachof thetwomodels.The column labelled R contains the values of themultiplecorrelationcoefficientbetweenthepredictorsand the outcome. When only advertising budget isused as a predictor, this is the simple correlationbetweenadvertisingandalbumsales (0.578). In fact,all of the statistics formodel 1 are the same as theregressionmodelearlier(Output1).

Thenext columngivesusavalueofR2,whichwealreadyknow isameasureofhowmuchof thevariability in theoutcomeisaccountedforbythepredictors.Forthefirstmodelitsvalueis.335,whichmeansthatadvertisingbudgetaccountsfor33.5%ofthevariationinalbumsales.However,whentheothertwopredictorsareincludedaswell(model2),thisvalueincreasesto.665or66.5%ofthevarianceinalbumsales.Therefore,ifadvertisingaccountsfor33.5%,wecantellthatattractivenessandradioplayaccountforanadditional33%.2So,theinclusionofthetwonewpredictorshasexplainedquitealargeamountofthevariationinalbumsales

Output5containsanANOVAthattestswhetherthemodelissignificantlybetteratpredictingtheoutcomethanusingthemeanasa‘bestguess’.FortheinitialmodeltheF-ratiois99.59,p<.001.ForthesecondmodelthevalueofFis129.498, which is also highly significant (p < .001). We can interpret these results as meaning that both modelssignificantlyimprovedourabilitytopredicttheoutcomevariablecomparedtonotfittingthemodel.

2Thatis,33%=66.5%-33.5%.


Output5

Model Parameters

RememberthatinmultipleregressionthemodeltakestheformofEq.6andinthatequationthereareseveralunknownparameters(theb-values).ThefirstpartofOutput6givesusestimatesfortheseb-valuesandthesevaluesindicatetheindividualcontributionofeachpredictortothemodel.Wecanusetheseestimatestodefineourmodel:

sales! = 𝑏𝑏* + 𝑏𝑏+advertising! + 𝑏𝑏>airplay! + 𝑏𝑏aattractiveness!

= −26.61 + 0.08advertising! + 3.37airplay! + 11.09attractiveness! Eq.12

Theb-valuestellusabouttherelationshipbetweenalbumsalesandeachpredictor.Allthreepredictorshavepositiveb-valuesindicatingpositiverelationships.So,asadvertisingbudgetincreases,albumsalesincrease;asplaysontheradioincrease,sodoalbumsales;andfinallymoreattractivebandswillsellmorealbums.Theb-valuestellusmorethanthis,though.Theytellustowhatdegreeeachpredictoraffectstheoutcome iftheeffectsofallotherpredictorsareheldconstant:

• Advertisingbudget (b=0.085):asadvertisingbudget increasesbyoneunit,albumsales increaseby0.085units.

• Airplay(b=3.367):asthenumberofplaysonradiointheweekbeforereleaseincreasesbyone,albumsalesincreaseby3.367units.

• Attractiveness (b=11.086):abandratedoneunithigherontheattractivenessscalecanexpectadditionalalbumsalesof11.086units.

Forthismodel,theadvertisingbudget,t(196)=12.26,p<.001,theamountofradioplaypriortorelease,t(196)=12.12,p<.001andattractivenessoftheband,t(196)=4.55,p<.001,areallsignificantpredictorsofalbumsales.3Rememberthatthesesignificancetestsareaccurateonlyiftheassumptionsdiscussedinyourlecturearemet.

If these assumptions aren’tmet, or youwant to ignore them, you could look at the table of bootstrap confidenceintervalsforeachpredictorandtheirsignificancevalue4.Thesetellusthatadvertising,b=0.09[0.07,0.10],p=.001,airplay,b=3.37[2.74,4.02],p=.001,andattractivenessoftheband,b=11.09[6.46,15.01],p=.001,allsignificantlypredictalbumsales.Theconfidenceintervalsareconstructedsuchthatin95%ofsamplestheboundariescontainthepopulationvalueofb.Therefore,it’slikelythattheconfidenceintervalwehaveconstructedforthissamplewillcontainthetruevalueofb inthepopulation.Therefore,wecanusetheconfidence intervalstotellusthe likelysizeoftheparameterinthepopulation(i.e.,thetruevalue).Iftheconfidenceintervalcontainszerothenthismeansthatthetruevaluemightbezero(i.e.,noeffectatall)oroppositetowhatweobservedinthesample(e.g.,anegativebinsteadof

3ForallofthesepredictorsIwrotet(196).Thenumberinbracketsisthedegreesoffreedom.InregressionthedegreesoffreedomareN-p-1,whereNisthetotalsamplesize(inthiscase200)andpisthenumberofpredictors(inthiscase3).Forthesedataweget200–3–1=196.4Rememberthatbecauseofhowbootstrappingworksthevaluesinyouroutputwillbeslightlydifferenttomine,anddifferentifyoure-runtheanalysis.


thepositiveonethatweobserved).Therefore,becausezerodoesnotfallwithintheboundariesofanyofourbootstrapconfidenceintervals,wecanconcludeveryconfidentlythatthepopulationvaluesofbarepositive—inotherwords,allofthepredictorvariablesaregenuinepredictorsofalbumsales.

Output6

Tasks Task 1 Lacourseetal.(2001)conductedastudytoseewhethersuicideriskwasrelatedtolisteningtoheavymetalmusic.Theydevisedascaletomeasurepreferenceforbands falling intothecategoryofheavymetal.Thisscale includedheavymetalbands(BlackSabbath,IronMaiden),speedmetalbands(Slayer,Metallica),death/blackmetalbands(Obituary,Burzum)andgothicbands(MarilynManson,SistsersofMercy).Theythenusedthis(andothervariables)aspredictorsofsuicideriskbasedonascalemeasuringsuicidalideationetc.devisedbyTousignantetal.,(1988).

• Lacourse,E.,Claes,M.,&Villeneuve,M.(2001).HeavyMetalMusicandAdolescentSuicidalRisk.JournalofYouthandAdolescence,30(3),321-332.[AvailablethroughtheSussexElectronicLibrary].

Let’simaginewereplicatedthisstudy.ThedatafileHMSuicide.sav(onthecoursewebsite)containsthedatafromsuchareplication.Therearetwovariablesrepresentingscoresonthescalesdescribedabove:hm(theextenttowhichthepersonlistenstoheavymetalmusic)andsuicide(theextenttowhichsomeonehassuicidalideationandsoon).Usingthesedatacarryoutaregressionanalysistoseewhetherlisteningtoheavymetalpredictssuiciderisk.

Howmuchvariancedoesthefinalmodelexplain?

YourAnswers:


Doeslisteningtoheavymetalsignificantlypredictsuiciderisk(quoterelevantstatistics)?

YourAnswers:

Whatisthenatureoftherelationshipbetweenlisteningtoheavymetalandsuiciderisk?(sketchascatterplotifithelpsyoutoexplain).

YourAnswers:

Writeouttheregressionequationforthemodel

YourAnswers:

Aslisteningtoheavymetalincreasesby1unit,howmuchdoessuicideriskincrease?

YourAnswers:

Task 2 Oneofmyfavouriteactivities,especiallywhentryingtodobrain-meltingthingslikewritingstatisticsbooks,istodrinktea. IamEnglishafterall.Fortunately, tea improvesyourcognitivefunction,well, inoldChinesepeopleatanyrate(Feng,Gwee,Kua,&Ng,2010).ImaynotbeChineseandI’mnotthatold,butIneverthelessenjoytheideathatteamighthelpmethink.Therearesomedata(TeaMakesYouBrainy716.sav)basedonFengetal.’sstudythatmeasuredthenumberofcupsofteadrunkandcognitivefunctioning.Useregressiontoconstructamodelthatpredictscognitivefunctioning from tea drinking, what would cognitive functioning be if someone drank 10 cups of tea? Is there asignificanteffect?


Task 3 Inthelecturewesawanexampleofoutliersvs.influentialcasesinwhichwepredictedmortalityratesformthenumberofpubsinthearea.Runaregressionanalysisforthepubs.savdatapredictingmortalityfromthenumberofpubs.Tryrepeatingtheanalysisbutbootstrappingtheconfidenceintervals.

Task 4 TheHonestyLab (www.honestylab.com) lookedathowpeopleevaluateddishonestacts.Participantsevaluated thedishonestyofactsbasedonwatchingvideosofpeopleconfessingtothoseacts.Themediawouldhaveusbelievethatthemore likeable theperpetratorwas, themorepositively theirdishonestactswereviewed. Imaginewe took100peopleandgavethemarandomdishonestact,describedbytheperpetrator.Weaskedthemtoevaluatethehonestyoftheact(0=appallingbehaviourto10=it’sOkreally)andhowmuchtheylikedtheperson(0=notatall,10=alot).I’vefabricatedsomedata(HonestyLab.sav)relatingtopeople’sratingsofdishonestactsandthelikeablenessoftheperpetrator. Run a regression using bootstrapping to predict ratings of dishonesty from the likeableness of theperpetrator.

References Feng,L.,Gwee,X.,Kua,E.H.,&Ng,T.P.(2010).Cognitivefunctionandteaconsumptionincommunitydwellingolder

ChineseinSingapore.JournalofNutritionHealth&Aging,14(6),433-438.

Wright, D. B., London, K., & Field, A. P. (2011). Using Bootstrap Estimation and the Plug-in Principle for ClinicalPsychologyData.JournalofExperimentalPsychopathology.,2(2),252–270.doi:doi:10.5127/jep.013611

Terms of Use Thishandoutcontainsmaterialfrom:

Field,A.P.(2013).DiscoveringstatisticsusingSPSS:andsexanddrugsandrock‘n’roll(4thEdition).London:Sage.

ThismaterialiscopyrightAndyField(2000-2016).

This document is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 InternationalLicense,basicallyyoucanuseitforteachingandnon-profitactivitiesbutnotmeddlewithitwithoutpermissionfromtheauthor.

http://creativecommons.org/licenses/by-nc-nd/4.0/

Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is...

Documents

Transcript of Introducing the Linear Model - Discovering Statistics · Introducing the Linear Model What is...