Heads Up! Sept 22 – Oct 4 Probability Perceived by many as a difficult topic Get ready ahead of...
-
Upload
john-dixon -
Category
Documents
-
view
214 -
download
1
Transcript of Heads Up! Sept 22 – Oct 4 Probability Perceived by many as a difficult topic Get ready ahead of...
Heads Up!Heads Up!Sept 22 – Oct 4Sept 22 – Oct 4
Probability
Perceived by many as a difficult topic
Get ready ahead of time
Last Time:Last Time:
Least Squares Regression
(Simple Linear Regression)
Correlation
In Least-Squares Regression:
XbYa
XX
YYXXb N
ii
N
iii
,
1
2
1
N
i
N
iii
N
i
N
ii
N
iiii
XXN
YXYXN
b
1
2
1
2
1 11
ComputationalFormula
N
i
N
iii
N
i
N
ii
N
iiii
XXN
YXYXN
b
1
2
1
2
1 11Can wedo this?
X X Squared Y Y Squared XY86 7396 82.6 6822.76 7103.6
109.3 11946.49 112.6 12678.76 12307.1873.3 5372.89 70 4900 513180.6 6496.36 76.6 5867.56 6173.9686.6 7499.56 84 7056 7274.485.3 7276.09 86 7396 7335.883.3 6938.89 82.6 6822.76 6880.5878.6 6177.96 81.3 6609.69 6390.1892 8464 86.6 7499.56 7967.276 5776 75.3 5670.09 5722.8851 73344.24 837.6 71323.18 72286.7Totals:
Calculating the Least Squares Regression Line contd.
2851)24.344,73(10
)6.837)(851()70.286,72(10
b
201,7244.442,733
6.797,712867,722
09.14.241,9
4.069,10
XbYa 910
85109.1
10
837
10
10.9Slope is 1.09
Intercept is -9
You can’t see it in thisgraph
TRIAL = 1.09 PRACTICE - 9
RegressionEquation
A view from further away….
0
20
40
60
80
100
120
0 50 100 150
X
Y
Look at the residuals:
Residual Plot
-10
-5
0
5
10
0 20 40 60 80 100 120
X
Res
idu
als
We wanta shot-gun blast
shape, i.e.,a random blob
Look at Residuals & Line Fit
ResidualPlot
Line FitPlot
Problem:Relationship is not linear
Look at Residuals & Line Fit
ResidualPlot
Problem:Predictions are very precise for small predicted values,
but very unprecise for large predicted values. (Not good)
1 2 3 4 5 6 7 8 9 10 11 12
Problem: Lurking (third) variables (?)
Here: Seasonal Trend?
Look at ResidualsResidual
Plot
CorrelationHow strong is the linear relationship
between two variables X and Y?
Slope in regression ofstandardized variables
XX S
XXZ
YY S
YYZ
This slope tells meHow much a given change (in standardized units) of X
translates into a change (in standardized units) of Y
CorrelationHow strong is the linear relationship
between two variables X and Y?
Correlation Coefficient
Y
iN
i X
iN S
YY
S
XXr
11
1
Computational Formula:
YX
N
ii
N
i
N
iiNii
SSN
YXYX
r)1(
11 1
1
Properties of Correlation
• Symmetric Measure (You can exchange X and Y and get the same value)
• -1 ≤ r ≤ 1
• -1 is “perfect” negative correlation
• 1 is “perfect” positive correlation
• Not dependent on linear transformations of X and Y
• Measures linear relationship only
X Z_X Y Z_Y Z_X Z_Y86 0.088816751 82.6 -0.10192166 -0.009052
109.3 2.388183737 112.6 2.533983252 6.051617673.3 -1.164486285 70 -1.20900172 1.407865980.6 -0.444083753 76.6 -0.62910264 0.279374386.6 0.148027918 84 0.021087239 0.003121585.3 0.019737056 86 0.196814233 0.003884583.3 -0.177633501 82.6 -0.10192166 0.018104778.6 -0.641454309 81.3 -0.2161442 0.138646692 0.680928421 86.6 0.249532331 0.169913776 -0.898036033 75.3 -0.74332518 0.6675328851 837.6 8.731009285.1 83.76 0.9701121
PRACTICE TRIALCASE 1 86 82.6CASE 2 109.3 112.6CASE 3 73.3 70CASE 4 80.6 76.6CASE 5 86.6 84CASE 6 85.3 86CASE 7 83.3 82.6CASE 8 78.6 81.3CASE 9 92 86.6CASE 10 76 75.3
Let’s try it out on our X = PRACTICE, Y = TRIAL
Data Set
Check this calculation at home!
TodayToday
Finish Theory on RegressionFinish Theory on Regression
Pathologies and TrapsPathologies and Traps
in Linear Regression and Correlationin Linear Regression and Correlation
Relationships between Relationships between
Categorical VariablesCategorical Variables
Regression on Standardized Variables
ii XY rZZ ˆ
0intercept, :slope1
11
N
iYXN iiZZr
X
i
Y
i
S
XXr
S
YY
ˆ
XXS
SrYY i
X
Yi ˆ
ii XY rZZ ˆ
XXS
SrYY i
X
Yi ˆ
iX
Y
X
Yi X
S
SrX
S
SrYY ˆ
ii bXaY ˆ?
XbYaS
Srb
X
Y ,
iiN bXaYYYY ˆ from ˆ,...,ˆ,ˆGiven 21
What is the variance of ? ˆ,...,ˆ,ˆ21 NYYY
22 bemust it that know We XSb
X
Y
S
Srb :know also We
2222 :Therefore YX SrSb
22
22
:Thus rS
Sb
Y
X
22
22
:Thus rS
Sb
Y
X
Variance ofpredicted Y’s
Variance ofobserved Y’s
Proportion of Varianceof observed Y’s
that is accounted forby the regression
Proportion of Variance explained
22
22
:Thus rS
Sb
Y
X
Proportion of Varianceof observed Y’s
that is accounted forby the regression
Proportion of Variance explained
Note: If you exchange X and Y in the regression, you find the same r and r squared
Correlation only checks magnitude of
Linear Relationships!
It can happen that r=0, even though X and Y are highly related to each other!
Need to look at Scatter Plot and Residual Plot to make sure that you don’t miss an obvious relationship overlooked by linear regression!
2XY Y = X-squared Line Fit Plot
-200
0
200
400
0 5 10 15 20
X
Y
How does a Linear RegressionModel approximate (for X=1,2,…,15)
Y = X-squared Residual Plot
-50
0
50
0 5 10 15 20
X
Res
idu
als
For these particular datathe regression
model finds
a = -45b = 16
The residuals have a systematic trend!!
This Linear Regressionis inappropriate!!
ii bXaY ˆ
2XY How does a Linear RegressionModel approximate (for X=-8,-7,…,7,8)
For these particular datathe regression
model finds
a = 24b = 0
The residuals have a systematic trend!!
This Linear Regressionis inappropriate!!
Y = X_squared Line Fit Plot
0
50
100
-10 -5 0 5 10
X
Y
Y = X_squared Residual Plot
-50
0
50
-10 -5 0 5 10
X
Res
idu
als
ii bXaY ˆ
2XY How does a Linear RegressionModel approximate (for X=-8,-7,…,7,8)
For these particular datathe regression
model finds
a = 24b = 0
r = 0
Y = X_squared Line Fit Plot
0
50
100
-10 -5 0 5 10
X
Y
Correlation is Zero: No LINEAR Relationship
Is there “no relationship” between X and Y?
There is an extremely strong (nonlinear) relationship here!
ii bXaY ˆ
)ln(XY How does a Linear RegressionModel approximate (for X=1,2,…,15)
For these particular datathe regression
model finds
a = .54b = .16
The residuals have a systematic trend!!
This Linear Regressionis inappropriate!!
Y = ln(X) Line Fit Plot
0
2
4
0 5 10 15 20
X
Y
Y = ln(X) Residual Plot
-1
0
1
0 5 10 15 20
X
Res
idu
als
ii bXaY ˆ
Correlation is not Causation!
Correlation between the size of your big toe and your performance on reading tasks is highly positive!
??
Lurking Third Variable: AGE
Correlation is not Causation!
Only experimentationexperimentation allows us to attribute causationto the relationship between independent and
dependent variables.
Ecological Correlation:Correlations between averages
are higher than correlations between individuals
X
Y
X Group averages
Y Group averages
Problem of Restricted Range
GRE scores
Successin Graduate
School
Strong LinearRelationship
No LinearRelationship
Extrapolations are Dangerous
Year
Number ofPassengers
Regression toward the Mean
The term “Regression” is associated with Sir Francis Galton (1822 – 1911)
Picture taken from http://www.gene.ucl.ac.uk/
Galton (1885)“Regression towards Mediocrity
In Hereditary Stature”Journal of the Anthropological
Institute
Regression toward the Mean
60. : and between n Correlatio
:son of IQ
:father of IQ
rYX
Y
X
Suppose:
XY ZZ 6.ˆ
Regression toward Mediocrity??
60. : and between n Correlatio
:son of IQ
:father of IQ
rYX
Y
XXY ZZ 6.ˆ
2.1)0.2(6.Z :son mediocre morepredict willWe
0.2 Z:fathert intelligenVery
Y
X
2.1)0.2(6.Z :son dumb less apredict willWe
0.2 Z:father dumbVery
Y
X
Predictions are closer to zero (the mean) then the observations!!
r=.60
2.0
1.2
2.0
1.2
XZ
YZ
r=.60
2.0
1.2
XZ
YZ
Among families where the father is approximately 2 standard deviations above the mean, the average son is only about 1.2 standard deviations above the mean.
Regression toward Mediocrity??
Do the sons just become more similar to each other than their fathers were?
Regression toward Mediocrity??
: of Variance XZ
: of Variance YZ
1XZ
S
1YZ
S
Variability of the Z scores is the same!
No slide into mediocrity!!
Regression toward the mean
When you have a lucky and exceptionally good performance in an exam,you expect to do worse next time, because there is no reason to believe
that you will be so exceptionally lucky again.
When you have a mental block and exceptionally bad performance in an exam,
you expect to do better next time, because there is no reason to believe that you will be so exceptionally unlucky again.
This does not mean that you are becomingmore and more average as time progresses.
It means that your average performance, as a reasonablepredictor for future performance, will lead to such a pattern
of relationships between observed and predicted performance
Regression toward the mean
Your room mate makes a huge mess in your room. You complain. The next few days are cleaner.
Your room mate has cleaned up the room.You praise your room mate. The next few days the room gets dirtier.
Does this mean that punishment leads to better performance and reward leads to worse performance?
No….
Regression toward the mean
Your room mate makes a huge mess in your room. You do nothing. The next few days are cleaner.
Your room mate has cleaned up the room.You do nothing. The next few days the room gets dirtier.
Your room mate simplymakes messes,
cleans them,makes messes,cleans them …
Your best guess for the future is an “average” level of messiness
Implications for Research
It is very risky to study anything based on selection of extreme groups
Test RetestExtremes become less extreme
May look like a treatment effect!
Relationships between Categorical Variables
Baby Held
Right-Handed Mother
Left-Handed Mother
Left 212 25
Right 43 7
237
50
255 32 287
Marginal Distributions
Theory
“Mothers tend to hold their babies with the non-dominant hand,
so that the dominant hand is available to do stuff.”
Relationships between Categorical Variables
Baby Held
Right-Handed Mother
Left-Handed Mother
Left
Right
.826 (82.6%)
.174 (17.4%)
.889(88.9%)
.111(11.1%)
Marginal Proportions (Percentages)
Vast majority of babies held leftVast majority of mothers right-handed
Relationships between Categorical Variables
Baby Held
Right-Handed Mother
Left-Handed Mother
Left .894 .105
Right .860 .140
1 (100%)
1 (100%)
Conditional proportions,given side on which the baby is held
Absolute size not taken into account
Relationships between Categorical Variables
Baby Held
Right-Handed Mother
Left-Handed Mother
Left .831 .781
Right .169 .219
1 (100%) 1 (100%)
Conditional proportions,given dexterity of mother
Absolute size not taken into account
Relationships between Categorical Variables
1 (100%) 1 (100%)
For any given dexterity of the mother,there is an overwhelming tendency to hold the
baby on the left hand side.
Absolute size not taken into account
Baby Held
Right-Handed Mother
Left-Handed Mother
Left .831 .781
Right .169 .219
Segmented Bargraphs
Segmented Bargraph
0
50
100
150
200
250
left holding right holding
Side Baby is held
Fre
qu
ency
left-handed
right-handed
Segmented BargraphsSegmented Bargraph
0
50
100
150
200
250
300
right-handed left-handed
Dexterity
Fre
qu
ency
right holding
left holding
Conclusion??
Lurking Third Variable?
Heart beat helps baby calm down
Simpson’s Paradox
Admit Deny
Male 480 120
Female 180 20
Admit Deny
Male 10 90
Female 100 200
Business School
Law School
Simpson’s Paradox
Admit Deny
Male 490 210
Female 280 220
Admit Deny
Male .70 30
Female .56 .44
Overall:
Overallconditional proportionsper gender
700
500
Men Priviliged!!Gender Discr.!!
Simpson’s Paradox
Admit Deny
Male 480 120
Female 180 20
Admit Deny
Male 10 90
Female 100 200
Admit Deny
Male .80 .20
Female .90 .10
Admit Deny
Male .10 .90
Female .33 .67
600
200
100
300
WomenPriviliged!?!
WomenPriviliged!?!
Simpson’s Paradox
Admit Deny
Male 480 120
Female 180 20
Admit Deny
Male 10 90
Female 100 200
Admit Deny
Male .80 .20
Female .90 .10
Admit Deny
Male .10 .90
Female .33 .67
600
200
100
300
However: Higher admission rate for male dominated discipline