4 Correlation Week 3 - psychology.mcmaster.ca

HUMBEHV3ST3CorrelationWeek3

Prof.PatrickBennett

Sofar:Exploredonevariableatatime

Istherearearelationshipbetweenthese2variables?

2018161412

4

3

2

1

0

Physicians

Frequency

7.55.02.50.0-2.5-5.0

4

3

2

1

0

Infant Mortality

Freq

uenc

y

Xi Yi

Visualrelationshipwithascatterplot

Togenerateascatterplot,needtheactualmatchedpairsofXi&Yivalues.

Baseball:TeamWins&TeamOPS40

5060

7080

90100

Team

Win

s

0.60

0.65

0.70

0.75

0.80

0.85

0.90

Team

OP

S

OPS:OnbasePlusSluggingameasureofteam’sabilitytogetonbaseandhitforpower

MLBdatafromSeptember20,2019

PerceivedOrientationvs.StimulusAngle

Δ orientation

probabilityofresponding“counter-clockwiseofvertical”plottedasafunctionofanglefromvertical

-5 0 5

0.0

0.2

0.4

0.6

0.8

1.0

Δ orientation

p(counter-clockwise)

sO = 30.00mu = -1.32sigma = 1.40nhiV30C100_1_tilt.mat

prob

ability(“coun

ter-clockw

ise”)

VisualAcuityvsAgeinHumanInfants

https://www.sciencedirect.com/topics/nursing-and-health-professions/eye-chart

Correlation&LinearRegression

• correlation&linearregressionarestatisticalmethodsthatareusedto

assesstheassociationbetweenvariables

• Correlation:measurethelinearassociationbetween2variables

• Regression:estimatesthe“best-fitting”linethatrelatesapredictor

variable,X,toacriterionvariable,Y.

- usedtounderstandhowYchangeswhenXvaries

ExamplesofCorrelationalResearch

• effectivenessoffluoridatedwaterinreducingcavities• associationbetweensmokingandlungcancer

• falseclaimoflinkbetweenautism&measles,mumps,&rubella(MMR)inoculations–proposedthatMMRcausesautism

• drinkingcoffeeassociatedwithlongerlife• claimthatorchestraconductorslivelonger

Correlation&LinearRegression

Regressioncomputesbest-fittingline.Correlationisameasureofhowwelldataarefitbytheline.

Regressionline:the“lineofbestfit”;representsbestpredictionofYiforagivenvalueofXi

(Xi,Yi)=(16,2)

Correlationmeasuresthestrength&directionoflinearassociation

Strong

Weak

Negative Positive

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

x

y

r = -0.8

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

x

y

r = 0.8

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

x

y

r = -0.3

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

x

y

r = 0.3

r=-0.8

r=-0.3

r=0.8

r=0.3

Correlationdependsonfit,notslope

105 110 115 120

6080

100

120

140

160

x

y

r = 0.82

105 110 115 120

6080

100

120

140

160

x

y

r = 0.63

PerfectLinearRelation(r=1orr=-1)

100 105 110 115 120

1020

3040

5060

70

perfect positive correlation

x

y

100 105 110 115 120

1020

3040

5060

70

perfect negative correlation

x

y

r=1.0 r=-1.0

Commonmeasureofcorrelation—Pearsonr—variesbetween-1&+1

Nolinearassociation(r=0)

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

r = 0

x

y

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

r = 0

x

y

Vertical&horizontallinesindicatemeansofxandy.

Correlations&deviationsaroundthemeans

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

r = 0.7

x

y

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

r = -0.7

x

y

Vertical&horizontallinesindicatemeansofxandy.Notethatpointsdonotfallwithineachquadrantequally.

Linearvs.curvilineartrends/relations

• Monotonictrend:asXincreases,Yincreasesordecreaseswithoutreversal

– mightnotbeastraightline!

• Non-monotonictrend:asXincreases,Ychangesdirectionatleastonce

• Linearrelationship:bestfitlineisstraight• Curvilinear:bestfit“line”isnotstraight(non-linear)• Correlationisonlysensitivetolinearrelations!

Non-monotonic

Linear

Monotonic

CurvilinearCurvilinear Curvilinear

Monotonic Monotonic

●

●

●

●

●

●

●

●

●

●

●

4 6 8 10 12 14

45

67

89

1011

xy

r = 0.82●

●

●●

●

●

●

●

●

●

●

4 6 8 10 12 14

34

56

78

9

x

y

r = 0.82

●

●

●

●

●

●

●

●

●

●

●

4 6 8 10 12 14

68

1012

x

y

r = 0.82

●

●

●

●

●

●

●

●

●

●

●

8 10 12 14 16 18

68

1012

xy

r = 0.82

4(x,y)datasetsplottedwithareference(best-fitting)line

Question:whichsethasthehighestcorrelation?

A B

C D

●

●

●

●

●

●

●

●

●

●

●

4 6 8 10 12 14

45

67

89

1011

x

y

r = 0.82●

●

●●

●

●

●

●

●

●

●

4 6 8 10 12 14

34

56

78

9

x

y

r = 0.82

●

●

●

●

●

●

●

●

●

●

●

4 6 8 10 12 14

68

1012

x

y

r = 0.82

●

●

●

●

●

●

●

●

●

●

●

8 10 12 14 16 18

68

1012

x

y

r = 0.82

Thecorrelationisthesameinall4sets!r=0.82

Correlationisameasureoflinearassociationandisinsensitivetonon-linearassociation.

Thecorrelationbetweentemperatureandmonth/yearisverysmalleventhoughthereisastrong(curvilinear)associationbetweenthetwovariables.

Correlationisameasureoflinearassociationbetweenvariables

r=0.05

Covariance• Measuresthedegreetowhich2variablesvarytogether

- whereNisthenumberofobservations.

• Covariancedependsonsumofproductsofdeviationscores

- Covarianceispositivewhendeviationscoreshavesamesign

- Covarianceisnegativewhendeviationscoreshaveoppositesigns

1-N

)Y-)(YX-(Xcov 1

ii

XY

∑==

N

i

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

r = 0.7

x

y

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

r = -0.7

xy

GreenpointsdriveCOVXYinpositivedirection.BluepointsdriveCOVXYinnegativedirection.

NetCOVXYdiffersfromzero.

1-N

)Y-)(YX-(Xcov 1

ii

XY

∑==

N

i

X

Y

X

Y

positivecovariancepositivecorrelation

negativecovariancenegativecorrelation

GreenpointsdriveCOVXYinpositivedirection.BluepointsdriveCOVXYinnegativedirection.

NotethatnetCOVXYiszero.

1-N

)Y-)(YX-(Xcov 1

ii

XY

∑==

N

i

X

Y

X

Y

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

r = 0

x

y

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

r = 0

x

y

netzerocovariancezerocorrelation

netzerocovariancezerocorrelation

Covarianceisameasureofassociation(butisinfluencedbyspreadofX&Y)

90 95 100 105 110

9095

100

105

110

X

Y

Covariance = 5, r = 0.5

40 60 80 100 120 140 160

4060

80100

120

140

160

X

Y


Similar(X,Y)associationsbutverydifferentcovariancesduetodifferencesin(X,Y)spread.

Covarianceisameasureofassociation(butisinfluencedbyspreadofX&Y)

40 60 80 100 120 140 160

4060

80100

120

140

160

X

Y


Samedataasinpreviousslide,butreplottedtohighlightdifferencein(X,Y)spread.

40 60 80 100 120 140 160

4060

80100

120

140

160

X

YCovariance = 5, r = 0.5

Correlationvs.Covariance

COV(X,Y)isaffectedbyvariancesofXandY.Sochangingtheunitsofmeasuresgenerallychangescovariance.

covariance=92.03 covariance=79.88

var(weight)=178.0var(height)=80.1


CorrelationCoefficient,r

•

- sXandsYarethestandarddeviationsofXandYrespectively.• risthePearsonProduct-MomentCorrelationCoefficient

• Notethatrhasno“units”

Standardizesthecovariancesowecancomparecorrelationsobtainedwithdatasetswithverydifferent(X,Y)spreads.

YX

XY

ssr cov=


40 60 80 100 120 140 160

4060

80100

120

140

160

X

Y


Correlation,r,islessaffectedthanCOV(X,Y)byvariancesofXandY.

40 60 80 100 120 140 160

4060

80100

120

140

160

X

Y



Correlation,r,islessaffectedthanCOV(X,Y)byvariancesofXandY.Sochangingunitschangescovariancebutnotr.

covariance=92.03r=0.77

covariance=79.88r=0.77



Pearsonr

• rrangesfrom-1to1

- r=+1whenXandYareperfectlypositivelycorrelated- r=-1whenXandYareperfectlynegativelycorrelated- r=0whenthereisnolinearrelationshipbetweenXandY• inPsychology,- Weakr:0.1-0.3

- Moderater:0.3-0.5

- Strongr:>0.5

Whatdoesrmean?

IQR=83IQR=62

Y X

50100

150

variable

score

Data:20pairsofX,Yscores Y X

s1 71.02058 100.201137

s2 62.31119 52.953707

s3 179.31734 163.077298

s4 110.02782 83.582028

s5 172.73203 142.985011

s6 128.27301 138.997126

s7 81.07347 58.119052

s8 65.82096 53.981494

s9 60.2459 65.250576

s10 123.27456 108.558325

s11 135.13302 152.882385

s12 68.73133 65.628767

s13 22.89659 7.949241

s14 182.69514 173.079349

s15 14.85593 17.623536

s16 77.4901 55.465981

s17 72.90529 85.641115

s18 104.4358 95.790047

s19 174.46641 178.942469

s20 92.29351 99.291358

Whatdoesrmean?

PlotofYscores

050

100

150

200

score

PlotofY-vs-Xscores

EachYispairedwith1X

0 50 100 150

050

100

150

200

X

score

r=0.95

knowingvalueofXtellsussomethingaboutvalueofY

0 50 100 150

050

100

150

200

X

score

residuals(errors)

relationbetweenY&Xisnotperfect(r<1)

0 50 100 150

050

100

150

200

X

score

…Whatdoesrmean?

residuals(errors)

residualsrepresentthepartofYscorethatisNOT“accountedfor”or“explainedby”or“associatedwith”X

0 50 100 150

050

100

150

200

X

score

VarianceofresidualsismuchlessthanvarianceoforiginalYscores

Y Residuals

-50

050

100

150

200

WhenX&Yarecorrelated,knowingvalueofonevariablereducesuncertaintyaboutvalueofothervariable.

r2=proportionofvarianceinYthatisaccountedforbyX

VAR(Y)=2500VAR(residuals)=243.75

r2 = .952 = 0.9025

1 −243.752500

= 0.9025

proportionofvarianceNOTexplainedbyX

proportionofvariancethatISexplainedbyX

Decomposingweightvarianceinto“explained”and“unexplained”parts

50 60 70 80 90

100

150

200

250

height (inches)

wei

ght (

poun

ds)

cov = 79.88r = 0.77

VARIANCE(weight)=865.4VARIANCE(predictedweights)=514.1

VARIANCE(errors)=351.3

weight predicted errors

-50

050

100

150

200

250

IQR(weight)=40.8IQR(predictedweights)=34.2

IQR(errors)=22.7

ht.inches wt.pounds wt.predicted wt.errors

1 71.65358 169.7557 172.8675 -3.11176992 63.38586 127.8680 119.6619 8.2060186

3 63.38586 116.8449 119.6619 -2.8170814

4 69.68508 149.9142 160.1995 -10.2853574

5 61.81106 130.0726 109.5275 20.5450325

6 66.92917 167.5511 142.4643 25.0867921

7 65.74807 167.5511 134.8635 32.6875876

8 73.22839 152.1188 183.0019 -30.8831239

9 70.07878 156.5280 162.7331 -6.2050959

10 67.32287 143.3003 144.9979 -1.6976264

11 68.89768 154.3234 155.1323 -0.8089204

13 63.38586 112.4356 119.6619 -7.2263214

14 66.14177 141.0957 137.3971 3.6985491

15 64.17326 114.6402 124.7291 -10.0888984

16 65.35437 143.3003 132.3299 10.9703661

17 73.62209 202.8250 185.5355 17.2895376

18 66.14177 136.6864 137.3971 -0.7106909

19 77.55910 167.5511 210.8715 -43.3203674

20 68.89768 134.4818 155.1323 -20.6505004

21 70.86618 262.3498 167.8003 94.5494671

predictionsfromheight

r2 = 0.772 = 0.59

proportionofvariance(weight)thatisNOTaccountedforbyheight

VARIANCE(errors)VARIANCE(weight)

=351.3865.4

= 0.41 = 1 − r2

proportionofvariance(weight)thatisaccountedforbyheight

VARIANCE(predicted weights)VARIANCE(weight)

=514.1865.4

= 0.59 = r2

r2isproportionofvariationinweightthatis“accountedfor”byvariationinheight

Decomposingweightvarianceinto“explained”and“unexplained”parts

VARIANCE(weight)=865.4VARIANCE(predictedweights)=0.3

VARIANCE(errors)=865.1

IQR(weight)=40.8IQR(predictedweights)=0.8

IQR(errors)=40.3

r2 = 0.0182 ≈ 0.0003

proportionofvariance(weight)thatisNOTaccountedforbyheight

VARIANCE(errors)VARIANCE(weight)

=865.1865.4

= 0.9997 = 1 − r2

proportionofvariance(weight)thatisaccountedforbyheight

VARIANCE(predicted weights)VARIANCE(weight)

=0.3

865.4= 0.0003 = r2

r2isproportionofvariationinweightthatis“accountedfor”byvariationinheight

weight predicted errors

-50

050

100

150

200

250

ht.inches wt.pounds wt.predicted wt.errors

1 68.11027 169.7557 144.0959 25.6598435

2 68.50397 127.8680 144.1548 -16.28679303 61.81106 116.8449 143.1542 -26.3093324

4 63.77956 149.9142 143.4485 6.4656850

5 65.35437 130.0726 143.6839 -13.6113210

6 67.32287 167.5511 143.9782 23.5729365

7 65.74807 167.5511 143.7428 23.8083625

8 68.89768 152.1188 144.2136 7.9051705

9 63.77956 156.5280 143.4485 13.0795450

10 70.86618 143.3003 144.5079 -1.207592011 67.71657 154.3234 144.0370 10.2863600

13 63.77956 112.4356 143.4485 -31.0128550

14 68.11027 141.0957 144.0959 -3.0002165

15 68.89768 114.6402 144.2136 -29.5733695

16 66.53547 143.3003 143.8605 -0.5601705

17 72.44098 202.8250 144.7433 58.0817220

18 58.26775 136.6864 142.6245 -5.9380439

19 72.44098 167.5511 144.7433 22.8078020

20 64.96067 134.4818 143.6250 -9.143224521 65.74807 262.3498 143.7428 118.6070225

predictionsfromheight

KnowingXreducesuncertaintyaboutY

MeanY±2SD(Y)

PredictedY±2SD(residuals)

Whatdoesrmean?risrelatedtohowmuchuncertaintyaboutYisreducedbyknowingX

X

Y

X

Y

r=0

r=0

Thislookslikeaperfectcorrelation.Whyisr=0?

Y

Xr=-1

Y

Xr=+1

X

Y

r=-0.6

Y

Xr=+0.3

rvariesacrosssamples

Sample r (n = 20; true-r = 0.4)

Sample r

Frequency

-0.5 0.0 0.5 1.0

0100

200

300

400

500 populationr=0.4

mean(sampler)=0.39range(sampler)=[-0.45,0.88]

Eachsamplerisanestimateofthepopulationr.Someestimatesaregood,othersarenotsogood.Canwequantifytheuncertaintyofourestimate?

calculaterformanysamplesofdata(n=20persample)

ConfidenceInterval

Aconfidenceintervalisarangeofvalues,calculatedfromthesampleobservations,thatarebelieved,withaparticularprobability,tocontainthetruepopulationparameter.A95%confidenceinterval,forexample,impliesthatweretheestimationprocessrepeatedagainandagain,then95%ofthecalculatedintervalswouldbeexpectedtocontainthetrueparametervalue.

–B.S.Everitt,DictionaryofStatistics

ConfidenceInterval• 95%ConfidenceInterval- anintervalestimateofthevalueofapopulationparameter(e.g.,r)

- calculatedfromdatainyoursample

- theintervalvariesacrosssamples

- inthelongrun,theintervalcontainsthetruepopulationvalue95%ofthetime

ConfidenceInterval• 95%ConfidenceInterval- anintervalestimateofthevalueofapopulationparameter(e.g.,r)




• inoursimulationwithtruepopulationr=0.4:

- weobtainedone(X,Y)datasetwithacorrelationr=0.37- forthatsample,ourstatisticalsoftwarecalculatesCI95%=[-0.08,0.70]

‣weestimatethatthetruevalueofrisbetween-0.08&0.70

‣ notethatourCIdoescontainthetruepopulationr=0.4

ConfidenceInterval• 95%ConfidenceInterval

- anintervalestimateofthevalueofapopulationparameter(e.g.,r)




• inoursimulationwithtruepopulationr=0.4:

- weobtainedone(X,Y)datasetwithacorrelationr=0.37- forthatsample,ourstatisticalsoftwarecalculatesCI95%=[-0.08,0.70]

‣ weestimatethatthetruevalueofrisbetween-0.08&0.70

‣ notethatourCIdoescontainthetruepopulationr=0.4• Notethatwecanadjustthewidthofourinterval

- 99%CI:containstruepopulationvalue99%ofthetime

- 90%CI:containstruepopulationvalue90%ofthetime

‣ Question:Forourdataset,whichCIwouldbewidest:99%,95%,or90%?‣ Hint:Whatdoyouthinkisthe100%CI?

ConfidenceIntervalExamples“Marginoferror”inpolls

http://www.pewresearch.org/fact-tank/2016/09/08/understanding-the-margin-of-error-in-election-polls/

ConfidenceIntervalExamples

https://www.texasgateway.org/resource/807-confidence-intervals-real-world

(90%ConfidenceInterval)

OtherTypesofCorrelations

• Twocontinuousvariables- PearsonProduct-MomentCorrelationCoefficient(r)• Tworanked/ordinalvariables- Spearman’scorrelationcoefficientforrankeddata(rho(ρ)orrs)

• Onedichotomous&onecontinuousvariable- e.g.,correct/incorrectansweronmcquestionandexamtotalscore- Point-biserialCorrelation(rpb):CalculatePearson’srbutcallitrpb• Twodichotomousvariables- e.g.age(teens/seniors)andcoinpurseownership(yes/no)- CalculatePearson’srbutcalledinrφ

Correlationswithrankeddata• Spearman’scorrelationcoefficient,rs

- usefulwhenobservationshavebeenreplacedbytheirnumericalranks

‣ e.g.,changingapplicants’examscorestoranks

englishmathenglish.rankmath.rank56662.0775708.0945401.0171607.0461654.5664566.0258593.03807710.01076679.0861634.55

Spearman’srankordercorrelation

40 50 60 70 80

4050

6070

80

English

Math

r = 0.8

English Rank

Mat

h R

ank

rs = 0.67

1 3 5 7 9

13

57

9

SameasPearsonrcorrelationforranks,notactual(x,y)values.Whyusersifyouhavetheoriginalvalues?

rsismorerobusttosometypesofextremepoints

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

X

Y

r = -0.05

rs = -0.05

-6 -4 -2 0 2 4 6

-6-4

-20

24

6

X

Y

r = 0.38

rs = -0.01

Changingonly2outof100pointsmarkedlyaffectsrbutnotrs

rsismorerobusttosometypesofextremepoints

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

X

Yr = -0.05

rs = -0.05

-6 -4 -2 0 2 4 6

-6-4

-20

24

6

X

Y

r = 0.38

rs = -0.01

Changingonly2outof100pointsmarkedlyaffectsrbutnotrs

…butnottoalltypesofextremescores

correlationswith&withoutNorthernIrelanddatapoint

2.5 3.0 3.5 4.0 4.5 5.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

High Leverage Point (Figure 9.3)

Tobacco Expenditure

Alc

ohol

Exp

endi

ture

r = 0.22

rs = 0.37

Northern Ireland

r = 0.78

rs = 0.83Inthiscase,includingextremepointaltersrs,too.Why?

Correlationswithrankeddata

• Spearman’scorrelationcoefficient,rs,calculatedwiththesameformulathatisusedtocalculatePearson’sr

- formulaappliedtoranks,notactual(X,Y)values

• rsismuchmoreresistant/robustthanrtoextreme/outlierdatapoints

• However,interpretationsofrandrsdiffer:

- risanindexofthestrength/directionoflinearrelation

- rsisanindexofthestrength/directionofmonotonicrelation

rsmeasuresmonotonicityofX,Yassociation

-3 -2 -1 0 1 2 3

-20

020

4060

80

X

Y

r = 0.58

rs = 1

rmeasureslinearassociationbetweenX,YranksrsmeasuresmonotonicassociationbetweenX,Yscores

non-linearbutmonotonic

-4 -2 0 2

0.0

0.5

1.0

1.5

2.0

2.5

X

Y

-5 -3 -1 1 3

r = -0.69

rs = -0.55

non-linear&non-monotonic

rsissensitivetoextremescoresaffectingmonotonicity

2.5 3.0 3.5 4.0 4.5 5.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

High Leverage Point (Figure 9.3)

Tobacco Expenditure

Alc

ohol

Exp

endi

ture

r = 0.22

rs = 0.37

Northern Ireland

r = 0.78

rs = 0.83

2.5 3.0 3.5 4.0 4.5 5.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

Reduction of Monotonicity

Tobacco Expenditure

Alc

ohol

Exp

endi

ture

Northern Ireland

IncludingNorthernIrelanddatapointsignificantlyaltersmonotonicityoftheX,Yassociation

IncludingNorthernIrelanddatapointlowersbothr&rs

OtherTypesofCorrelations

• Twocontinuousvariables- PearsonProduct-MomentCorrelationCoefficient(r)• Tworanked/ordinalvariables- Spearman’scorrelationcoefficientforrankeddata(rho(ρ)orrs)• Onedichotomous&onecontinuousvariable- e.g.,correct/incorrectansweronmcquestionandexamtotalscore

- Point-biserialCorrelation(rpb):CalculatePearson’srbutcallitrpb• Twodichotomousvariables- e.g.age(teens/seniors)andcoinpurseownership(yes/no)- CalculatePearson’srbutcalledinrφ(phi)

Point-biserialCorrelationrpbexample:analyzinganitemonamultiple-choicetest

4050

6070

8090

100

item response

exam

sco

re

0 (incorrect) 1 (correct)

rpb = 0.46• Xvariable:

- answerforonemultiple-choicequestion

- “incorrect”(0)or“correct”(1)

• Yvariable:totalscoreonremainingquestions

• CorrelateX&Y:rpb=0.46

• re-code“incorrect”&“correct”responseswithothernumbers(e.g.,-10&10):

- magnituderpbisnotchangedbycodingschemeforXvariable

- (signofrpbcanchange)

OtherTypesofCorrelations• Twocontinuousvariables- PearsonProduct-MomentCorrelationCoefficient(r)

• Tworanked/ordinalvariables- Spearman’scorrelationcoefficientforrankeddata(rho(ρ)orrs)

• Onedichotomous&onecontinuousvariable

- e.g.,correct/incorrectansweronmcquestionandexamtotalscore

- Point-biserialCorrelation(rpb):CalculatePearson’srbutcallitrpb• Twodichotomousvariables

- e.g.,gender(male/female)&pass/failonexam- e.g.,ADD(yes/no)&NeedsRemedialEnglish?(yes/no)- codeeachbinaryvaluewith2numbers(e.g.,0&1,-1&1,etc.)

- CalculatePearson’sronnumbers:resultiscalledrφ(phi)

rφ(phi)measuresassociationin2x2contingencytables

NeedsRemedialEnglishCourse

No Yes

ADDdiagnosis

No 187 22

Yes 74 19

• studyof302schoolchildren

- eachassessedforADD&needforremedialEnglishclasses

- eachchildprovides2binarymeasures

• rφmeasuresassociationbetweenbinaryvariables

- codeeachvariableas0(NO)or1(Yes)

- calculatePearsonronzerosandones

• rφ=0.133

Howdoweknowifacorrelationis“real”?

• restimatesthepopulationcorrelation

• rvariesacrosssamples

- atruepoprof0.4canyieldasampler<=0

- andatruepoprof0.0canyieldasamplerthatisnotzero

Sample r (n = 20; true-r = 0.4)

Sample r

Frequency

-0.5 0.0 0.5 1.0

0100

200

300

400

500 95%ofvalueslie

between-0.04&0.68

Howdoweknowifacorrelationis“real”?

• andatruepoprof0.0canyieldanon-zerosampler…

• howcanwedecideifourobservedsamplecorrelationoccurredjustbychance(andtruer=0)?

• onestrategy:assumetruer=0andcalculatetheprobabilityofgettingoursampler(orsomethingevenbigger)justbychance

95%ofvaluesliebetween-0.44&0.38

Sample r (n = 20; true-r = 0)

Sample r

Frequency

-1.0 -0.5 0.0 0.5 1.0

0200

400

600

800

2.5%ofvalues>0.38

2.5%ofvalues<-0.44

NullHypothesisTestingLogic• NullHypothesis:truecorrelationiszeroH0:r=0• AlternativeHypothesis:truecorrelationisnon-zeroH1:r≠0• AssumeH0istrue:

- whatistheprobabilityofobtaininganrthatisatleastasbigastheonewefoundinoursample?

- ifthatprobabilityisverylow(e.g.,lessthan5%),thenourrisunusual(assumingH0istrue)

‣ thereforewemayrejectH0infavourofH1(i.e.,thetruer≠0)

- ifthatprobabilityisnotlow,thenourrisnotunusual(assumingH0istrue)

‣ ourobservedcorrelationisNOTunlikelywhentruer=0

‣ wedonothavesufficientevidencetorejectH0

‣ donot“accept”H0;simplyfailtorejectit

NullHypothesisTesting(Example)

Question:arethesecorrelationslargerthanwewouldexpectwhentruepopulationr=0?

Trytoanswerquestionbyestimatingprobabilityofgettingrthislargewhentruecorrelationiszero

105 110 115 120

6080

100

120

140

160

Data Set 1

x

y

r = 0.82

105 110 115 120

6080

100

120

140

160

Data Set 1

x

y

r = 0.63

DataSet1 DataSet2

NullHypothesisTesting(Example)• RandomlyscrambleorderofX,Yvalues

- soeachXpairedRANDOMLYwithaY

- ScramblingmeansthereisnoX,Yassociation




• Calculaterforourscrambledsample

- expectittobezeroonaverage- butitwillvary…notalwaysexactlyzero




• Calculaterforourscrambledsample

- expectittobezeroonaverage- butitwillvary…notalwaysexactlyzero• Repeatthisprocessmanytimes,recordr’sforallscrambledsamples

• Becauseofrandompairing,largevaluesof±roccuronlybychance

NullHypothesisTesting(Example)• RandomlyscrambleorderofX,Yvalues- soeachXpairedRANDOMLYwithaY- ScramblingmeansthereisnoX,Yassociation• Calculaterforourscrambledsample

- expectittobezeroonaverage- butitwillvary…notalwaysexactlyzero• Repeatthisprocessmanytimes,recordr’sforallscrambledsamples

• Becauseofrandompairing,largevaluesof±roccuronlybychance

• Estimateprobabilityofgettingrthatisatleastasextremeasoursampler- observedrfordataset1:r=0.82- observedrfordataset2r=0.63

NullHypothesisTesting(Example)

Observedr=0.82

Data Set 1 Scrambled

Scrambled r

Frequency

-1.0 -0.5 0.0 0.5 1.0

050

100

150

Data Set 2 Scrambled

Scrambled r

Frequency

-1.0 -0.5 0.0 0.5 1.00

50100

150

Observedr=0.63

Reddashedlinesindicateboundarieswhichcontain95%ofscrambledr’s

Observedr’sfalloutside95%boundaries,sotheyareunusualifweassumethatthenullhypothesis(popr=0)istrue

valuesoutsideboundariesarerare/unusual

Histogramsshowdistributionsofr’sforscrambled(X,Y)data

NullHypothesisTesting(Example)• OurexampleusedaPERMUTATIONtest

• Otherkindsoftestscanbeusedtoevaluatenullhypothesis(truepopr=0)• Butthelogicissimilar:

- assumeNULLHYPOTHESISISTRUE(i.e.,truepopriszero)

- calculateprobabilityofgettingasamplerthatisatleastasextremeasyours

- iftheprobabilityissmall,thenobservedrisunusualWHENTHENULLHYPOTHESISISTRUE

- andthereforeyoumayrejectH0infavourofalternative(i.e.,truepopr≠0)

- wesaythecorrelationwasstatisticallysignificant(r=0.82,p<0.05)or(r=0.63,p<0.05)• N.B.Youcanmakemistakes!

- observedrmayjustbeunusual(i.e.,truepoprmayreallybezero)

- wecanestimatetheprobabilityofmakingthesemistakes

- inourexampletheprobabilityofthiserror,referredtoasalpha,was.05or5%.Why?

• Ouralpha=5%becausewedefinedan“unusual”rasbeingoutsidethe95%boundaries

Factorsthataffectcorrelation

• Nonlinearity• Extremeobservations

• Rangerestrictions• Heterogeneoussubsamples

RestrictedrangeofXvariable

RestrictedrangeofXvariablecanobscurealinearassociationbetweenX&Y.

80 90 100 110 120

8090

100

110

120

X

Yr = 0.28

80 90 100 110 120

8090

100

110

120

X

Y

r = 0.45

RestrictedRange(Example)

Restrictedrangecanobscureacurvilinearassociation.

Pediatrics in Review Vol. 19 No. 10 October 1998 357

at a normal rate following this typeof surgery, parallel to normal growthcurves. Children also develop a sig-nificantly improved appearance ofthe face and skull.

Children who have complex cran-iosynostosis should be treated by a craniofacial team that includesplastic surgery, neurosurgery, oph-thalmology, oral surgery, otorhino-laryngology, and pediatric anesthesiaand critical care. The team alsoshould include individuals familiarwith the developmental, nutritional,and airway-related aspects of theseconditions, such as a pediatrician,

as well as someone who can helpthe family with the psychologicalissues and provide support andaccess to resources, including financial counseling.

POTENTIAL COMPLICATIONSSurgical complications are illustratedin Table 2. Although complicationsfrom most of these procedures havebeen exceedingly rare, the potentialfor eye or brain injury always is present. A leak of cerebrospinal fluid may occur, as might meningo-cele in which the meninges protrudethrough one of the openings created

in the skull. Syndrome of inappro-priate antidiuretic hormone secretionand panhypopituitarism also mayoccur following this surgery.

ANCILLARY EVALUATIONSTable 3 lists activities crucial to the evaluation and treatment of achild who has severe craniofacialabnormalities, such as Apert orCrouzon syndrome. Because manyof these diagnoses now are beingmade in the prenatal period throughthe use of routine ultrasonography,facilities for prenatal genetic coun-seling and anticipatory guidanceshould be provided. Genetic analysisthrough amniocentesis or chorionicvilli sampling may be indicated.During the neonatal period, the childshould be evaluated thoroughly toidentify other conditions, with spe-cial attention to the areas indicatedin the table.

Between 2 and 9 months of age,the child’s skull and face should beevaluated by the craniofacial team.The cervical spine may need to beevaluated, as in children who haveApert syndrome, and may exhibitpartial fusion. Referral to an ortho-dontist or oral surgeon should bemade at this time. The primarysurgery usually is performed during

FIGURE 7. Illustration of lateral views of the skull before and after surgical remod-eling in a child who has Crouzon syndrome. The dotted line indicates the areas inwhich bone will be cut and reshaped. The bone is reassembled as shown in the sec-ond illustration to remodel the skull. The osteotomized segments of the cranium andupper orbit are held in their corrected position with stainless steel wires or smallplates and screws.

FIGURE 8. Increase in weight of brain by age. Reprinted with permission fromCopoletta JM, Wolbach JB. Body length and brain weight of infants and children.Am J Pathol. 1933;9:55–70.

TABLE 2. Complications of Surgery to Correct

Craniosynostosis

• Blood loss• Subdural hematoma• Subgaleal hemorrhage• Cerebrospinal fluid leak• Meningocele• Infection• Intravascular air• Pressure necrosis of skin• Retinal damage• Corneal drying• Strabismus from damage to

trochlea or canthal tendons• Syndrome of inappropriate

antidiuretic secretion• Panhypopituitarism• Risks from anesthesia

by guest on September 21, 2017http://pedsinreview.aappublications.org/Downloaded from

0 2 4 6Age(Years)

8 10 12

BrainWeight(gram

s)

WeightofBrainbyAge

r = 0.92

r = -0.95

Physiological Arousal

Res

pons

e Ti

me

(LowtoHigh)

(Low

toHigh)

HeterogeneousSubsamples

• Simpson(1951):

- astatisticalassociationobservedinapopulationcanbeattenuatedandevenreversedwithinsubgroupsthatmakeupthatpopulation

• Consequently,correlationscalculatedonpopulationsconsistingofheterogeneoussubgroupsmaybemisleading

Simpson'sparadoxinpsychologicalscience:apracticalguideKievit,Frakenhuis,Waldorp,&Borsboom,Front.Psychol.,12August2013

Simpson’sParadox

CorrelationforpopulationispositiveCorrelationswithinsub-groupsarenegative

Simpson'sparadoxinpsychologicalscience:apracticalguideKievit,Frakenhuis,Waldorp,&Borsboom,Front.Psychol.,12August2013

Simpson’sParadox

Overallcorrelationisapproximatelyzero Strong,oppositecorrelationsin2sub-groupsoftennisplayers

Correlation≠Causation

• readingskilliscorrelatedwithshoesize• overlast2centuries,priceofbreadinGreatBritainiscorrelatedwithsealevelinVenice

• rpb:BMWownershavehigherincomesthanFordowners

• numberofpiratesiscorrelatedwithglobaltemperature

• …evenstrongcorrelationsdonotmeanXcausesY

https://xkcd.com/552/

PiratesDoNotCauseGlobalWarming

Global Average Temperature vs. Number of Pirates

Glo

bal A

vera

ge T

empe

ratu

re, °

C

Number of Pirates (Approximate)35000 45000 20000 15000 5000 400 17

16.5

16.0

15.5

15.0

14.5

14.0

13.5

13.0

2000

1980

19401920

18801860

1820

Source:Wikipedia

0 10000 20000 30000 40000 50000

14.0

14.5

15.0

15.5

16.0

Number of Pirates (Approximate)

Avg

Glo

bal T

empe

ratu

re (C

)

1820 1860

1880

1920

1940

1980

2000 r = -0.93


500 550 600 650 700

9.0

9.5

10.0

10.5

11.0

Engineering PhDs & Mozzarella Consumption (2000-09)

Civil Eng Doctorates Awarded

Moz

zare

lla C

hees

e C

onsu

mpt

ion

(lbs

per c

apita

)

r = 0.96

Correlations(summary)• Correlation:isameasureoftheassociationbetween2variables

• Pearsonr:indexofstrength&directionoflinearassociation- r=COVXY/(sXsY);variesbetween-1&+1

- measureshowmuchouruncertaintyaboutthevalueofYiisreducedbyknowingthevalueofXi- risanestimateofthepopulationcorrelationthatvariesacrosssamples

- a95%ConfidenceIntervalofr,whichvariesfromsampletosample,containsthetruepopulationparameter95%ofthetime

• Spearman’srs:indexofmonotonicityofassociation- equivalenttoPearson’srforrankeddata- generallymorerobustthanrtoextreme(i.e.,“high-leverage”)datapoints

• Correlationsaffectedby:nonlinearity,extremescores,rangerestrictions,heterogeneoussamples• CORRELATION≠CAUSATION• NullHypothesisTestingisamethodformakingdecisionsaboutvalueofpopulationr- pvaluesareprobabilityofgettingourdatawhennullhypothesisistrue

4 Correlation Week 3 - psychology.mcmaster.ca

Documents

Transcript of 4 Correlation Week 3 - psychology.mcmaster.ca