4 Correlation Week 3 - psychology.mcmaster.ca
Transcript of 4 Correlation Week 3 - psychology.mcmaster.ca
HUMBEHV3ST3CorrelationWeek3
Prof.PatrickBennett
Sofar:Exploredonevariableatatime
Istherearearelationshipbetweenthese2variables?
2018161412
4
3
2
1
0
Physicians
Frequency
7.55.02.50.0-2.5-5.0
4
3
2
1
0
Infant Mortality
Freq
uenc
y
Xi Yi
Visualrelationshipwithascatterplot
Togenerateascatterplot,needtheactualmatchedpairsofXi&Yivalues.
Baseball:TeamWins&TeamOPS40
5060
7080
90100
Team
Win
s
0.60
0.65
0.70
0.75
0.80
0.85
0.90
Team
OP
S
OPS:OnbasePlusSluggingameasureofteam’sabilitytogetonbaseandhitforpower
MLBdatafromSeptember20,2019
PerceivedOrientationvs.StimulusAngle
Δ orientation
probabilityofresponding“counter-clockwiseofvertical”plottedasafunctionofanglefromvertical
-5 0 5
0.0
0.2
0.4
0.6
0.8
1.0
Δ orientation
p(counter-clockwise)
sO = 30.00mu = -1.32sigma = 1.40nhiV30C100_1_tilt.mat
prob
ability(“coun
ter-clockw
ise”)
VisualAcuityvsAgeinHumanInfants
https://www.sciencedirect.com/topics/nursing-and-health-professions/eye-chart
Correlation&LinearRegression
• correlation&linearregressionarestatisticalmethodsthatareusedto
assesstheassociationbetweenvariables
• Correlation:measurethelinearassociationbetween2variables
• Regression:estimatesthe“best-fitting”linethatrelatesapredictor
variable,X,toacriterionvariable,Y.
- usedtounderstandhowYchangeswhenXvaries
ExamplesofCorrelationalResearch
• effectivenessoffluoridatedwaterinreducingcavities• associationbetweensmokingandlungcancer
• falseclaimoflinkbetweenautism&measles,mumps,&rubella(MMR)inoculations–proposedthatMMRcausesautism
• drinkingcoffeeassociatedwithlongerlife• claimthatorchestraconductorslivelonger
Correlation&LinearRegression
Regressioncomputesbest-fittingline.Correlationisameasureofhowwelldataarefitbytheline.
Regressionline:the“lineofbestfit”;representsbestpredictionofYiforagivenvalueofXi
(Xi,Yi)=(16,2)
Correlationmeasuresthestrength&directionoflinearassociation
Strong
Weak
Negative Positive
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
x
y
r = -0.8
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
x
y
r = 0.8
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
x
y
r = -0.3
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
x
y
r = 0.3
r=-0.8
r=-0.3
r=0.8
r=0.3
Correlationdependsonfit,notslope
105 110 115 120
6080
100
120
140
160
x
y
r = 0.82
105 110 115 120
6080
100
120
140
160
x
y
r = 0.63
PerfectLinearRelation(r=1orr=-1)
100 105 110 115 120
1020
3040
5060
70
perfect positive correlation
x
y
100 105 110 115 120
1020
3040
5060
70
perfect negative correlation
x
y
r=1.0 r=-1.0
Commonmeasureofcorrelation—Pearsonr—variesbetween-1&+1
Nolinearassociation(r=0)
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
r = 0
x
y
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
r = 0
x
y
Vertical&horizontallinesindicatemeansofxandy.
Correlations&deviationsaroundthemeans
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
r = 0.7
x
y
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
r = -0.7
x
y
Vertical&horizontallinesindicatemeansofxandy.Notethatpointsdonotfallwithineachquadrantequally.
Linearvs.curvilineartrends/relations
• Monotonictrend:asXincreases,Yincreasesordecreaseswithoutreversal
– mightnotbeastraightline!
• Non-monotonictrend:asXincreases,Ychangesdirectionatleastonce
• Linearrelationship:bestfitlineisstraight• Curvilinear:bestfit“line”isnotstraight(non-linear)• Correlationisonlysensitivetolinearrelations!
Non-monotonic
Linear
Monotonic
CurvilinearCurvilinear Curvilinear
Monotonic Monotonic
●
●
●
●
●
●
●
●
●
●
●
4 6 8 10 12 14
45
67
89
1011
xy
r = 0.82●
●
●●
●
●
●
●
●
●
●
4 6 8 10 12 14
34
56
78
9
x
y
r = 0.82
●
●
●
●
●
●
●
●
●
●
●
4 6 8 10 12 14
68
1012
x
y
r = 0.82
●
●
●
●
●
●
●
●
●
●
●
8 10 12 14 16 18
68
1012
xy
r = 0.82
4(x,y)datasetsplottedwithareference(best-fitting)line
Question:whichsethasthehighestcorrelation?
A B
C D
●
●
●
●
●
●
●
●
●
●
●
4 6 8 10 12 14
45
67
89
1011
x
y
r = 0.82●
●
●●
●
●
●
●
●
●
●
4 6 8 10 12 14
34
56
78
9
x
y
r = 0.82
●
●
●
●
●
●
●
●
●
●
●
4 6 8 10 12 14
68
1012
x
y
r = 0.82
●
●
●
●
●
●
●
●
●
●
●
8 10 12 14 16 18
68
1012
x
y
r = 0.82
Thecorrelationisthesameinall4sets!r=0.82
Correlationisameasureoflinearassociationandisinsensitivetonon-linearassociation.
Thecorrelationbetweentemperatureandmonth/yearisverysmalleventhoughthereisastrong(curvilinear)associationbetweenthetwovariables.
Correlationisameasureoflinearassociationbetweenvariables
r=0.05
Covariance• Measuresthedegreetowhich2variablesvarytogether
- whereNisthenumberofobservations.
• Covariancedependsonsumofproductsofdeviationscores
- Covarianceispositivewhendeviationscoreshavesamesign
- Covarianceisnegativewhendeviationscoreshaveoppositesigns
1-N
)Y-)(YX-(Xcov 1
ii
XY
∑==
N
i
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
r = 0.7
x
y
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
r = -0.7
xy
GreenpointsdriveCOVXYinpositivedirection.BluepointsdriveCOVXYinnegativedirection.
NetCOVXYdiffersfromzero.
1-N
)Y-)(YX-(Xcov 1
ii
XY
∑==
N
i
X
Y
X
Y
positivecovariancepositivecorrelation
negativecovariancenegativecorrelation
GreenpointsdriveCOVXYinpositivedirection.BluepointsdriveCOVXYinnegativedirection.
NotethatnetCOVXYiszero.
1-N
)Y-)(YX-(Xcov 1
ii
XY
∑==
N
i
X
Y
X
Y
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
r = 0
x
y
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
r = 0
x
y
netzerocovariancezerocorrelation
netzerocovariancezerocorrelation
Covarianceisameasureofassociation(butisinfluencedbyspreadofX&Y)
90 95 100 105 110
9095
100
105
110
X
Y
Covariance = 5, r = 0.5
40 60 80 100 120 140 160
4060
80100
120
140
160
X
Y
Covariance = 250, r = 0.5
Similar(X,Y)associationsbutverydifferentcovariancesduetodifferencesin(X,Y)spread.
Covarianceisameasureofassociation(butisinfluencedbyspreadofX&Y)
40 60 80 100 120 140 160
4060
80100
120
140
160
X
Y
Covariance = 250, r = 0.5
Samedataasinpreviousslide,butreplottedtohighlightdifferencein(X,Y)spread.
40 60 80 100 120 140 160
4060
80100
120
140
160
X
YCovariance = 5, r = 0.5
Correlationvs.Covariance
COV(X,Y)isaffectedbyvariancesofXandY.Sochangingtheunitsofmeasuresgenerallychangescovariance.
covariance=92.03 covariance=79.88
var(weight)=178.0var(height)=80.1
var(weight)=865.4var(height)=12.4
CorrelationCoefficient,r
•
- sXandsYarethestandarddeviationsofXandYrespectively.• risthePearsonProduct-MomentCorrelationCoefficient
• Notethatrhasno“units”
Standardizesthecovariancesowecancomparecorrelationsobtainedwithdatasetswithverydifferent(X,Y)spreads.
YX
XY
ssr cov=
Correlationvs.Covariance
40 60 80 100 120 140 160
4060
80100
120
140
160
X
Y
Covariance = 250, r = 0.5
Correlation,r,islessaffectedthanCOV(X,Y)byvariancesofXandY.
40 60 80 100 120 140 160
4060
80100
120
140
160
X
Y
Covariance = 5, r = 0.5
Correlationvs.Covariance
Correlation,r,islessaffectedthanCOV(X,Y)byvariancesofXandY.Sochangingunitschangescovariancebutnotr.
covariance=92.03r=0.77
covariance=79.88r=0.77
var(weight)=178.0var(height)=80.1
var(weight)=865.4var(height)=12.4
Pearsonr
• rrangesfrom-1to1
- r=+1whenXandYareperfectlypositivelycorrelated- r=-1whenXandYareperfectlynegativelycorrelated- r=0whenthereisnolinearrelationshipbetweenXandY• inPsychology,- Weakr:0.1-0.3
- Moderater:0.3-0.5
- Strongr:>0.5
Whatdoesrmean?
IQR=83IQR=62
Y X
50100
150
variable
score
Data:20pairsofX,Yscores Y X
s1 71.02058 100.201137
s2 62.31119 52.953707
s3 179.31734 163.077298
s4 110.02782 83.582028
s5 172.73203 142.985011
s6 128.27301 138.997126
s7 81.07347 58.119052
s8 65.82096 53.981494
s9 60.2459 65.250576
s10 123.27456 108.558325
s11 135.13302 152.882385
s12 68.73133 65.628767
s13 22.89659 7.949241
s14 182.69514 173.079349
s15 14.85593 17.623536
s16 77.4901 55.465981
s17 72.90529 85.641115
s18 104.4358 95.790047
s19 174.46641 178.942469
s20 92.29351 99.291358
Whatdoesrmean?
PlotofYscores
050
100
150
200
score
PlotofY-vs-Xscores
EachYispairedwith1X
0 50 100 150
050
100
150
200
X
score
r=0.95
knowingvalueofXtellsussomethingaboutvalueofY
0 50 100 150
050
100
150
200
X
score
residuals(errors)
relationbetweenY&Xisnotperfect(r<1)
0 50 100 150
050
100
150
200
X
score
…Whatdoesrmean?
residuals(errors)
residualsrepresentthepartofYscorethatisNOT“accountedfor”or“explainedby”or“associatedwith”X
0 50 100 150
050
100
150
200
X
score
VarianceofresidualsismuchlessthanvarianceoforiginalYscores
Y Residuals
-50
050
100
150
200
WhenX&Yarecorrelated,knowingvalueofonevariablereducesuncertaintyaboutvalueofothervariable.
r2=proportionofvarianceinYthatisaccountedforbyX
VAR(Y)=2500VAR(residuals)=243.75
r2 = .952 = 0.9025
1 −243.752500
= 0.9025
proportionofvarianceNOTexplainedbyX
proportionofvariancethatISexplainedbyX
Decomposingweightvarianceinto“explained”and“unexplained”parts
50 60 70 80 90
100
150
200
250
height (inches)
wei
ght (
poun
ds)
cov = 79.88r = 0.77
VARIANCE(weight)=865.4VARIANCE(predictedweights)=514.1
VARIANCE(errors)=351.3
weight predicted errors
-50
050
100
150
200
250
IQR(weight)=40.8IQR(predictedweights)=34.2
IQR(errors)=22.7
ht.inches wt.pounds wt.predicted wt.errors
1 71.65358 169.7557 172.8675 -3.11176992 63.38586 127.8680 119.6619 8.2060186
3 63.38586 116.8449 119.6619 -2.8170814
4 69.68508 149.9142 160.1995 -10.2853574
5 61.81106 130.0726 109.5275 20.5450325
6 66.92917 167.5511 142.4643 25.0867921
7 65.74807 167.5511 134.8635 32.6875876
8 73.22839 152.1188 183.0019 -30.8831239
9 70.07878 156.5280 162.7331 -6.2050959
10 67.32287 143.3003 144.9979 -1.6976264
11 68.89768 154.3234 155.1323 -0.8089204
13 63.38586 112.4356 119.6619 -7.2263214
14 66.14177 141.0957 137.3971 3.6985491
15 64.17326 114.6402 124.7291 -10.0888984
16 65.35437 143.3003 132.3299 10.9703661
17 73.62209 202.8250 185.5355 17.2895376
18 66.14177 136.6864 137.3971 -0.7106909
19 77.55910 167.5511 210.8715 -43.3203674
20 68.89768 134.4818 155.1323 -20.6505004
21 70.86618 262.3498 167.8003 94.5494671
predictionsfromheight
r2 = 0.772 = 0.59
proportionofvariance(weight)thatisNOTaccountedforbyheight
VARIANCE(errors)VARIANCE(weight)
=351.3865.4
= 0.41 = 1 − r2
proportionofvariance(weight)thatisaccountedforbyheight
VARIANCE(predicted weights)VARIANCE(weight)
=514.1865.4
= 0.59 = r2
r2isproportionofvariationinweightthatis“accountedfor”byvariationinheight
Decomposingweightvarianceinto“explained”and“unexplained”parts
VARIANCE(weight)=865.4VARIANCE(predictedweights)=0.3
VARIANCE(errors)=865.1
IQR(weight)=40.8IQR(predictedweights)=0.8
IQR(errors)=40.3
r2 = 0.0182 ≈ 0.0003
proportionofvariance(weight)thatisNOTaccountedforbyheight
VARIANCE(errors)VARIANCE(weight)
=865.1865.4
= 0.9997 = 1 − r2
proportionofvariance(weight)thatisaccountedforbyheight
VARIANCE(predicted weights)VARIANCE(weight)
=0.3
865.4= 0.0003 = r2
r2isproportionofvariationinweightthatis“accountedfor”byvariationinheight
weight predicted errors
-50
050
100
150
200
250
ht.inches wt.pounds wt.predicted wt.errors
1 68.11027 169.7557 144.0959 25.6598435
2 68.50397 127.8680 144.1548 -16.28679303 61.81106 116.8449 143.1542 -26.3093324
4 63.77956 149.9142 143.4485 6.4656850
5 65.35437 130.0726 143.6839 -13.6113210
6 67.32287 167.5511 143.9782 23.5729365
7 65.74807 167.5511 143.7428 23.8083625
8 68.89768 152.1188 144.2136 7.9051705
9 63.77956 156.5280 143.4485 13.0795450
10 70.86618 143.3003 144.5079 -1.207592011 67.71657 154.3234 144.0370 10.2863600
13 63.77956 112.4356 143.4485 -31.0128550
14 68.11027 141.0957 144.0959 -3.0002165
15 68.89768 114.6402 144.2136 -29.5733695
16 66.53547 143.3003 143.8605 -0.5601705
17 72.44098 202.8250 144.7433 58.0817220
18 58.26775 136.6864 142.6245 -5.9380439
19 72.44098 167.5511 144.7433 22.8078020
20 64.96067 134.4818 143.6250 -9.143224521 65.74807 262.3498 143.7428 118.6070225
predictionsfromheight
KnowingXreducesuncertaintyaboutY
MeanY±2SD(Y)
PredictedY±2SD(residuals)
Whatdoesrmean?risrelatedtohowmuchuncertaintyaboutYisreducedbyknowingX
X
Y
X
Y
r=0
r=0
Thislookslikeaperfectcorrelation.Whyisr=0?
Y
Xr=-1
Y
Xr=+1
X
Y
r=-0.6
Y
Xr=+0.3
rvariesacrosssamples
Sample r (n = 20; true-r = 0.4)
Sample r
Frequency
-0.5 0.0 0.5 1.0
0100
200
300
400
500 populationr=0.4
mean(sampler)=0.39range(sampler)=[-0.45,0.88]
Eachsamplerisanestimateofthepopulationr.Someestimatesaregood,othersarenotsogood.Canwequantifytheuncertaintyofourestimate?
calculaterformanysamplesofdata(n=20persample)
ConfidenceInterval
Aconfidenceintervalisarangeofvalues,calculatedfromthesampleobservations,thatarebelieved,withaparticularprobability,tocontainthetruepopulationparameter.A95%confidenceinterval,forexample,impliesthatweretheestimationprocessrepeatedagainandagain,then95%ofthecalculatedintervalswouldbeexpectedtocontainthetrueparametervalue.
–B.S.Everitt,DictionaryofStatistics
ConfidenceInterval• 95%ConfidenceInterval- anintervalestimateofthevalueofapopulationparameter(e.g.,r)
- calculatedfromdatainyoursample
- theintervalvariesacrosssamples
- inthelongrun,theintervalcontainsthetruepopulationvalue95%ofthetime
ConfidenceInterval• 95%ConfidenceInterval- anintervalestimateofthevalueofapopulationparameter(e.g.,r)
- calculatedfromdatainyoursample
- theintervalvariesacrosssamples
- inthelongrun,theintervalcontainsthetruepopulationvalue95%ofthetime
• inoursimulationwithtruepopulationr=0.4:
- weobtainedone(X,Y)datasetwithacorrelationr=0.37- forthatsample,ourstatisticalsoftwarecalculatesCI95%=[-0.08,0.70]
‣weestimatethatthetruevalueofrisbetween-0.08&0.70
‣ notethatourCIdoescontainthetruepopulationr=0.4
ConfidenceInterval• 95%ConfidenceInterval
- anintervalestimateofthevalueofapopulationparameter(e.g.,r)
- calculatedfromdatainyoursample
- theintervalvariesacrosssamples
- inthelongrun,theintervalcontainsthetruepopulationvalue95%ofthetime
• inoursimulationwithtruepopulationr=0.4:
- weobtainedone(X,Y)datasetwithacorrelationr=0.37- forthatsample,ourstatisticalsoftwarecalculatesCI95%=[-0.08,0.70]
‣ weestimatethatthetruevalueofrisbetween-0.08&0.70
‣ notethatourCIdoescontainthetruepopulationr=0.4• Notethatwecanadjustthewidthofourinterval
- 99%CI:containstruepopulationvalue99%ofthetime
- 90%CI:containstruepopulationvalue90%ofthetime
‣ Question:Forourdataset,whichCIwouldbewidest:99%,95%,or90%?‣ Hint:Whatdoyouthinkisthe100%CI?
ConfidenceIntervalExamples“Marginoferror”inpolls
http://www.pewresearch.org/fact-tank/2016/09/08/understanding-the-margin-of-error-in-election-polls/
ConfidenceIntervalExamples
https://www.texasgateway.org/resource/807-confidence-intervals-real-world
(90%ConfidenceInterval)
OtherTypesofCorrelations
• Twocontinuousvariables- PearsonProduct-MomentCorrelationCoefficient(r)• Tworanked/ordinalvariables- Spearman’scorrelationcoefficientforrankeddata(rho(ρ)orrs)
• Onedichotomous&onecontinuousvariable- e.g.,correct/incorrectansweronmcquestionandexamtotalscore- Point-biserialCorrelation(rpb):CalculatePearson’srbutcallitrpb• Twodichotomousvariables- e.g.age(teens/seniors)andcoinpurseownership(yes/no)- CalculatePearson’srbutcalledinrφ
Correlationswithrankeddata• Spearman’scorrelationcoefficient,rs
- usefulwhenobservationshavebeenreplacedbytheirnumericalranks
‣ e.g.,changingapplicants’examscorestoranks
englishmathenglish.rankmath.rank56662.0775708.0945401.0171607.0461654.5664566.0258593.03807710.01076679.0861634.55
Spearman’srankordercorrelation
40 50 60 70 80
4050
6070
80
English
Math
r = 0.8
English Rank
Mat
h R
ank
rs = 0.67
1 3 5 7 9
13
57
9
SameasPearsonrcorrelationforranks,notactual(x,y)values.Whyusersifyouhavetheoriginalvalues?
rsismorerobusttosometypesofextremepoints
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
X
Y
r = -0.05
rs = -0.05
-6 -4 -2 0 2 4 6
-6-4
-20
24
6
X
Y
r = 0.38
rs = -0.01
Changingonly2outof100pointsmarkedlyaffectsrbutnotrs
rsismorerobusttosometypesofextremepoints
-3 -2 -1 0 1 2 3
-3-2
-10
12
3
X
Yr = -0.05
rs = -0.05
-6 -4 -2 0 2 4 6
-6-4
-20
24
6
X
Y
r = 0.38
rs = -0.01
Changingonly2outof100pointsmarkedlyaffectsrbutnotrs
…butnottoalltypesofextremescores
correlationswith&withoutNorthernIrelanddatapoint
2.5 3.0 3.5 4.0 4.5 5.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
High Leverage Point (Figure 9.3)
Tobacco Expenditure
Alc
ohol
Exp
endi
ture
r = 0.22
rs = 0.37
Northern Ireland
r = 0.78
rs = 0.83Inthiscase,includingextremepointaltersrs,too.Why?
Correlationswithrankeddata
• Spearman’scorrelationcoefficient,rs,calculatedwiththesameformulathatisusedtocalculatePearson’sr
- formulaappliedtoranks,notactual(X,Y)values
• rsismuchmoreresistant/robustthanrtoextreme/outlierdatapoints
• However,interpretationsofrandrsdiffer:
- risanindexofthestrength/directionoflinearrelation
- rsisanindexofthestrength/directionofmonotonicrelation
rsmeasuresmonotonicityofX,Yassociation
-3 -2 -1 0 1 2 3
-20
020
4060
80
X
Y
r = 0.58
rs = 1
rmeasureslinearassociationbetweenX,YranksrsmeasuresmonotonicassociationbetweenX,Yscores
non-linearbutmonotonic
-4 -2 0 2
0.0
0.5
1.0
1.5
2.0
2.5
X
Y
-5 -3 -1 1 3
r = -0.69
rs = -0.55
non-linear&non-monotonic
rsissensitivetoextremescoresaffectingmonotonicity
2.5 3.0 3.5 4.0 4.5 5.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
High Leverage Point (Figure 9.3)
Tobacco Expenditure
Alc
ohol
Exp
endi
ture
r = 0.22
rs = 0.37
Northern Ireland
r = 0.78
rs = 0.83
2.5 3.0 3.5 4.0 4.5 5.0
3.5
4.0
4.5
5.0
5.5
6.0
6.5
Reduction of Monotonicity
Tobacco Expenditure
Alc
ohol
Exp
endi
ture
Northern Ireland
IncludingNorthernIrelanddatapointsignificantlyaltersmonotonicityoftheX,Yassociation
IncludingNorthernIrelanddatapointlowersbothr&rs
OtherTypesofCorrelations
• Twocontinuousvariables- PearsonProduct-MomentCorrelationCoefficient(r)• Tworanked/ordinalvariables- Spearman’scorrelationcoefficientforrankeddata(rho(ρ)orrs)• Onedichotomous&onecontinuousvariable- e.g.,correct/incorrectansweronmcquestionandexamtotalscore
- Point-biserialCorrelation(rpb):CalculatePearson’srbutcallitrpb• Twodichotomousvariables- e.g.age(teens/seniors)andcoinpurseownership(yes/no)- CalculatePearson’srbutcalledinrφ(phi)
Point-biserialCorrelationrpbexample:analyzinganitemonamultiple-choicetest
4050
6070
8090
100
item response
exam
sco
re
0 (incorrect) 1 (correct)
rpb = 0.46• Xvariable:
- answerforonemultiple-choicequestion
- “incorrect”(0)or“correct”(1)
• Yvariable:totalscoreonremainingquestions
• CorrelateX&Y:rpb=0.46
• re-code“incorrect”&“correct”responseswithothernumbers(e.g.,-10&10):
- magnituderpbisnotchangedbycodingschemeforXvariable
- (signofrpbcanchange)
OtherTypesofCorrelations• Twocontinuousvariables- PearsonProduct-MomentCorrelationCoefficient(r)
• Tworanked/ordinalvariables- Spearman’scorrelationcoefficientforrankeddata(rho(ρ)orrs)
• Onedichotomous&onecontinuousvariable
- e.g.,correct/incorrectansweronmcquestionandexamtotalscore
- Point-biserialCorrelation(rpb):CalculatePearson’srbutcallitrpb• Twodichotomousvariables
- e.g.,gender(male/female)&pass/failonexam- e.g.,ADD(yes/no)&NeedsRemedialEnglish?(yes/no)- codeeachbinaryvaluewith2numbers(e.g.,0&1,-1&1,etc.)
- CalculatePearson’sronnumbers:resultiscalledrφ(phi)
rφ(phi)measuresassociationin2x2contingencytables
NeedsRemedialEnglishCourse
No Yes
ADDdiagnosis
No 187 22
Yes 74 19
• studyof302schoolchildren
- eachassessedforADD&needforremedialEnglishclasses
- eachchildprovides2binarymeasures
• rφmeasuresassociationbetweenbinaryvariables
- codeeachvariableas0(NO)or1(Yes)
- calculatePearsonronzerosandones
• rφ=0.133
Howdoweknowifacorrelationis“real”?
• restimatesthepopulationcorrelation
• rvariesacrosssamples
- atruepoprof0.4canyieldasampler<=0
- andatruepoprof0.0canyieldasamplerthatisnotzero
Sample r (n = 20; true-r = 0.4)
Sample r
Frequency
-0.5 0.0 0.5 1.0
0100
200
300
400
500 95%ofvalueslie
between-0.04&0.68
Howdoweknowifacorrelationis“real”?
• andatruepoprof0.0canyieldanon-zerosampler…
• howcanwedecideifourobservedsamplecorrelationoccurredjustbychance(andtruer=0)?
• onestrategy:assumetruer=0andcalculatetheprobabilityofgettingoursampler(orsomethingevenbigger)justbychance
95%ofvaluesliebetween-0.44&0.38
Sample r (n = 20; true-r = 0)
Sample r
Frequency
-1.0 -0.5 0.0 0.5 1.0
0200
400
600
800
2.5%ofvalues>0.38
2.5%ofvalues<-0.44
NullHypothesisTestingLogic• NullHypothesis:truecorrelationiszeroH0:r=0• AlternativeHypothesis:truecorrelationisnon-zeroH1:r≠0• AssumeH0istrue:
- whatistheprobabilityofobtaininganrthatisatleastasbigastheonewefoundinoursample?
- ifthatprobabilityisverylow(e.g.,lessthan5%),thenourrisunusual(assumingH0istrue)
‣ thereforewemayrejectH0infavourofH1(i.e.,thetruer≠0)
- ifthatprobabilityisnotlow,thenourrisnotunusual(assumingH0istrue)
‣ ourobservedcorrelationisNOTunlikelywhentruer=0
‣ wedonothavesufficientevidencetorejectH0
‣ donot“accept”H0;simplyfailtorejectit
NullHypothesisTesting(Example)
Question:arethesecorrelationslargerthanwewouldexpectwhentruepopulationr=0?
Trytoanswerquestionbyestimatingprobabilityofgettingrthislargewhentruecorrelationiszero
105 110 115 120
6080
100
120
140
160
Data Set 1
x
y
r = 0.82
105 110 115 120
6080
100
120
140
160
Data Set 1
x
y
r = 0.63
DataSet1 DataSet2
NullHypothesisTesting(Example)• RandomlyscrambleorderofX,Yvalues
- soeachXpairedRANDOMLYwithaY
- ScramblingmeansthereisnoX,Yassociation
NullHypothesisTesting(Example)• RandomlyscrambleorderofX,Yvalues
- soeachXpairedRANDOMLYwithaY
- ScramblingmeansthereisnoX,Yassociation
• Calculaterforourscrambledsample
- expectittobezeroonaverage- butitwillvary…notalwaysexactlyzero
NullHypothesisTesting(Example)• RandomlyscrambleorderofX,Yvalues
- soeachXpairedRANDOMLYwithaY
- ScramblingmeansthereisnoX,Yassociation
• Calculaterforourscrambledsample
- expectittobezeroonaverage- butitwillvary…notalwaysexactlyzero• Repeatthisprocessmanytimes,recordr’sforallscrambledsamples
• Becauseofrandompairing,largevaluesof±roccuronlybychance
NullHypothesisTesting(Example)• RandomlyscrambleorderofX,Yvalues- soeachXpairedRANDOMLYwithaY- ScramblingmeansthereisnoX,Yassociation• Calculaterforourscrambledsample
- expectittobezeroonaverage- butitwillvary…notalwaysexactlyzero• Repeatthisprocessmanytimes,recordr’sforallscrambledsamples
• Becauseofrandompairing,largevaluesof±roccuronlybychance
• Estimateprobabilityofgettingrthatisatleastasextremeasoursampler- observedrfordataset1:r=0.82- observedrfordataset2r=0.63
NullHypothesisTesting(Example)
Observedr=0.82
Data Set 1 Scrambled
Scrambled r
Frequency
-1.0 -0.5 0.0 0.5 1.0
050
100
150
Data Set 2 Scrambled
Scrambled r
Frequency
-1.0 -0.5 0.0 0.5 1.00
50100
150
Observedr=0.63
Reddashedlinesindicateboundarieswhichcontain95%ofscrambledr’s
Observedr’sfalloutside95%boundaries,sotheyareunusualifweassumethatthenullhypothesis(popr=0)istrue
valuesoutsideboundariesarerare/unusual
Histogramsshowdistributionsofr’sforscrambled(X,Y)data
NullHypothesisTesting(Example)• OurexampleusedaPERMUTATIONtest
• Otherkindsoftestscanbeusedtoevaluatenullhypothesis(truepopr=0)• Butthelogicissimilar:
- assumeNULLHYPOTHESISISTRUE(i.e.,truepopriszero)
- calculateprobabilityofgettingasamplerthatisatleastasextremeasyours
- iftheprobabilityissmall,thenobservedrisunusualWHENTHENULLHYPOTHESISISTRUE
- andthereforeyoumayrejectH0infavourofalternative(i.e.,truepopr≠0)
- wesaythecorrelationwasstatisticallysignificant(r=0.82,p<0.05)or(r=0.63,p<0.05)• N.B.Youcanmakemistakes!
- observedrmayjustbeunusual(i.e.,truepoprmayreallybezero)
- wecanestimatetheprobabilityofmakingthesemistakes
- inourexampletheprobabilityofthiserror,referredtoasalpha,was.05or5%.Why?
• Ouralpha=5%becausewedefinedan“unusual”rasbeingoutsidethe95%boundaries
Factorsthataffectcorrelation
• Nonlinearity• Extremeobservations
• Rangerestrictions• Heterogeneoussubsamples
RestrictedrangeofXvariable
RestrictedrangeofXvariablecanobscurealinearassociationbetweenX&Y.
80 90 100 110 120
8090
100
110
120
X
Yr = 0.28
80 90 100 110 120
8090
100
110
120
X
Y
r = 0.45
RestrictedRange(Example)
Restrictedrangecanobscureacurvilinearassociation.
Pediatrics in Review Vol. 19 No. 10 October 1998 357
at a normal rate following this typeof surgery, parallel to normal growthcurves. Children also develop a sig-nificantly improved appearance ofthe face and skull.
Children who have complex cran-iosynostosis should be treated by a craniofacial team that includesplastic surgery, neurosurgery, oph-thalmology, oral surgery, otorhino-laryngology, and pediatric anesthesiaand critical care. The team alsoshould include individuals familiarwith the developmental, nutritional,and airway-related aspects of theseconditions, such as a pediatrician,
as well as someone who can helpthe family with the psychologicalissues and provide support andaccess to resources, including financial counseling.
POTENTIAL COMPLICATIONSSurgical complications are illustratedin Table 2. Although complicationsfrom most of these procedures havebeen exceedingly rare, the potentialfor eye or brain injury always is present. A leak of cerebrospinal fluid may occur, as might meningo-cele in which the meninges protrudethrough one of the openings created
in the skull. Syndrome of inappro-priate antidiuretic hormone secretionand panhypopituitarism also mayoccur following this surgery.
ANCILLARY EVALUATIONSTable 3 lists activities crucial to the evaluation and treatment of achild who has severe craniofacialabnormalities, such as Apert orCrouzon syndrome. Because manyof these diagnoses now are beingmade in the prenatal period throughthe use of routine ultrasonography,facilities for prenatal genetic coun-seling and anticipatory guidanceshould be provided. Genetic analysisthrough amniocentesis or chorionicvilli sampling may be indicated.During the neonatal period, the childshould be evaluated thoroughly toidentify other conditions, with spe-cial attention to the areas indicatedin the table.
Between 2 and 9 months of age,the child’s skull and face should beevaluated by the craniofacial team.The cervical spine may need to beevaluated, as in children who haveApert syndrome, and may exhibitpartial fusion. Referral to an ortho-dontist or oral surgeon should bemade at this time. The primarysurgery usually is performed during
FIGURE 7. Illustration of lateral views of the skull before and after surgical remod-eling in a child who has Crouzon syndrome. The dotted line indicates the areas inwhich bone will be cut and reshaped. The bone is reassembled as shown in the sec-ond illustration to remodel the skull. The osteotomized segments of the cranium andupper orbit are held in their corrected position with stainless steel wires or smallplates and screws.
FIGURE 8. Increase in weight of brain by age. Reprinted with permission fromCopoletta JM, Wolbach JB. Body length and brain weight of infants and children.Am J Pathol. 1933;9:55–70.
TABLE 2. Complications of Surgery to Correct
Craniosynostosis
• Blood loss• Subdural hematoma• Subgaleal hemorrhage• Cerebrospinal fluid leak• Meningocele• Infection• Intravascular air• Pressure necrosis of skin• Retinal damage• Corneal drying• Strabismus from damage to
trochlea or canthal tendons• Syndrome of inappropriate
antidiuretic secretion• Panhypopituitarism• Risks from anesthesia
by guest on September 21, 2017http://pedsinreview.aappublications.org/Downloaded from
0 2 4 6Age(Years)
8 10 12
BrainWeight(gram
s)
WeightofBrainbyAge
r = 0.92
r = -0.95
Physiological Arousal
Res
pons
e Ti
me
(LowtoHigh)
(Low
toHigh)
HeterogeneousSubsamples
• Simpson(1951):
- astatisticalassociationobservedinapopulationcanbeattenuatedandevenreversedwithinsubgroupsthatmakeupthatpopulation
• Consequently,correlationscalculatedonpopulationsconsistingofheterogeneoussubgroupsmaybemisleading
Simpson'sparadoxinpsychologicalscience:apracticalguideKievit,Frakenhuis,Waldorp,&Borsboom,Front.Psychol.,12August2013
Simpson’sParadox
CorrelationforpopulationispositiveCorrelationswithinsub-groupsarenegative
Simpson'sparadoxinpsychologicalscience:apracticalguideKievit,Frakenhuis,Waldorp,&Borsboom,Front.Psychol.,12August2013
Simpson’sParadox
Overallcorrelationisapproximatelyzero Strong,oppositecorrelationsin2sub-groupsoftennisplayers
Correlation≠Causation
• readingskilliscorrelatedwithshoesize• overlast2centuries,priceofbreadinGreatBritainiscorrelatedwithsealevelinVenice
• rpb:BMWownershavehigherincomesthanFordowners
• numberofpiratesiscorrelatedwithglobaltemperature
• …evenstrongcorrelationsdonotmeanXcausesY
https://xkcd.com/552/
PiratesDoNotCauseGlobalWarming
Global Average Temperature vs. Number of Pirates
Glo
bal A
vera
ge T
empe
ratu
re, °
C
Number of Pirates (Approximate)35000 45000 20000 15000 5000 400 17
16.5
16.0
15.5
15.0
14.5
14.0
13.5
13.0
2000
1980
19401920
18801860
1820
Source:Wikipedia
0 10000 20000 30000 40000 50000
14.0
14.5
15.0
15.5
16.0
Number of Pirates (Approximate)
Avg
Glo
bal T
empe
ratu
re (C
)
1820 1860
1880
1920
1940
1980
2000 r = -0.93
Correlation≠Causation
500 550 600 650 700
9.0
9.5
10.0
10.5
11.0
Engineering PhDs & Mozzarella Consumption (2000-09)
Civil Eng Doctorates Awarded
Moz
zare
lla C
hees
e C
onsu
mpt
ion
(lbs
per c
apita
)
r = 0.96
Correlations(summary)• Correlation:isameasureoftheassociationbetween2variables
• Pearsonr:indexofstrength&directionoflinearassociation- r=COVXY/(sXsY);variesbetween-1&+1
- measureshowmuchouruncertaintyaboutthevalueofYiisreducedbyknowingthevalueofXi- risanestimateofthepopulationcorrelationthatvariesacrosssamples
- a95%ConfidenceIntervalofr,whichvariesfromsampletosample,containsthetruepopulationparameter95%ofthetime
• Spearman’srs:indexofmonotonicityofassociation- equivalenttoPearson’srforrankeddata- generallymorerobustthanrtoextreme(i.e.,“high-leverage”)datapoints
• Correlationsaffectedby:nonlinearity,extremescores,rangerestrictions,heterogeneoussamples• CORRELATION≠CAUSATION• NullHypothesisTestingisamethodformakingdecisionsaboutvalueofpopulationr- pvaluesareprobabilityofgettingourdatawhennullhypothesisistrue