Investigating Linear Patterns in Data · 2019. 1. 19. · Investigating Linear Patterns in Data...
Transcript of Investigating Linear Patterns in Data · 2019. 1. 19. · Investigating Linear Patterns in Data...
InvestigatingLinear Patternsin Data
Bill Gillam
Linear Regression
InvestigatingLinear Patternsin Data
Bill Gillam
"The Adventure of Wisteria Lodge" Sherlock Holmes. Sir Arthur Conan Doyle, 20 April 1988. Granada Television. Access: https://www.youtube.com/watch?v=vY7jswaefbw 4:31/9:50
InvestigatingLinear Patternsin Data
“If he was all on the same scale as his foot, he must certainly have been a giant.” - Sherlock Holmes
The Adventure of Wisteria Lodge.
InvestigatingLinear Patternsin Data
Question:Is the length of a person’s foot a useful predictor of his/her height?
How can we gather and use data to tell?
InvestigatingLinear Patternsin Data
What kind of evidence would convince us that the two variables footlength and height, to use Holmes’ words, are “on the same scale?”
InvestigatingLinear Patternsin Data
One way to do this is to create a mathematical model of the supposed relationship from data, and then measure how closely the data actually matches our model.
InvestigatingLinear Patternsin Data
We will use linear regression to find our model, and we will use the correlation coefficient to measure how strongly that model predicts our data.
Linear Regression:vocabulary in
context
We are going to perform simple linear regression in this example. We are going to assume linear relationship, and we are going to have only one explanatory(independent, manipulated, input) variable. We will let footlength be our explanatory variable, and height be our response (dependent, responding, output) variable.
Linear Regression
Here is our plan:Step One: Gather DataStep Two: See if the data trends linearlyStep Three: Use linear regression to find line-of-best-fit.Step Four: Find and interpret the correlation coefficient.
Step One: Gather some data.
Here is some I gathered from one of my classes:
(EXCEL)
footlength(cm) height(cm)
22 178
25 173
20 150
24 169
22 166
29 170
22 163
23 169
29 179
28 167
27 178
20 157
23 161
27 163
29 172
28 163
25 156
29 175
25 152
23 160
22 170
Studying numerical patterns (table)
Footlength(cm)
Height (cm)
Our Model: (y = ax + b)Height = 1.2712(foot)+134
Slope: 1.2712/1 = rise/run
Our model says that for each increase of 1 cm of footlength, a person should be 1.2712 cm taller.
Studying numerical patterns (table)
Our Model: (y = ax + b)Height = 1.2712(foot)+134
Don’t extrapolate outside your data. Notice our model has no problem predicting a person with no foot as being 134 cm tall. Footlength(cm)
Height (cm)
Studying numerical patterns (table)
We are given R2 = 0.2186
From this we can take a square root to get Pearson’s Correlation coefficient: R = .47
Footlength(cm)
Height (cm)
Studying numerical patterns (table)Interpreting Correlation
An R greater than 0.8 is generally described as strong, whereas a correlation less than 0.5 is generally described as weak.
R = .47 indicates weak to moderate positive correlation.
Studying numerical patterns (table)Interpreting Correlation
R2: R-squared: 0.2186
R2 is the percentage of the response variable variation that is explained by the model.
In our case, 21.86% of the variability in height can be explained as the variability expected from the variation in foot lengths.
Studying numerical patterns (table)Pearson’s Correlation Coefficient
Pearson’s Correlation Coefficient - how it is calculated
Sum the product of the x and y z-score for each ordered pair and divide by n-1
Studying numerical patterns (table)
x y x-xbar y-ybar (x-xbar)/sx*(y-ybar)/sy=zx*zy
6 5 -8 -2 0.7613
10 3 -4 -4 0.7613
14 7 0 0 0.0000
19 8 5 1 0.2379
21 12 7 5 1.6652
Sum zx*zy 3.4256
sx =6.2 sy = 3.4 sum/(n-1) 0.8564
Studying numerical patterns (table)Pearson’s Correlation Coefficient
Conditions for using the correlation coefficient:•Quantitative Variables•Linear •Outliers are not distorting correlation
Studying numerical patterns (table)Non-linear Data
This is the data for f-stops for a camera. Here we square the data to make it linear. Another transformation that statisticians sometimes use involves logs to straighten exponential data.
Sometimes we can make non-linear data linear
Studying numerical patterns (table)Correlation vs. Causation
Studying numerical patterns (table)Correlation vs. Causation
Correlation does not imply causation. Often the correlation is caused by lurking variables. A lurking variable is a variable that is not included as an explanatory or response variable but could strongly affect the correlation.
Another mistake we can make is to reverse the explanatory and response variables from their actual relationship. The stork story is an example of this.
Studying numerical patterns (table)
Bad conclusions:•The number of AP courses a student takes and SAT performance strongly correlate, so we should enroll all students in more AP courses.
•Studies show that the number of police in an area positively correlates with the amount of gang activity, so we should reduce the number of policemen to reduce gang activity.
•Completed homework and student performance have a positive correlation, so all teachers should assign more homework every night in every course.
•The amount of damage caused by a fire has a positive correlation with the number of firemen at the scene. To reduce the amount of damage due to fires, we should send less firemen.
Discuss possible lurking variables or reversed axes.
Studying numerical patterns (table)
What have we learned?
• We can use linear regression to find a linear model for appropriate data
• We can find the correlation coefficient, r, to see how well our model fits
• We can “Straighten” some data that is not linear by performing mathematical functions like squaring or taking the log of each data point
• Correlation is not necessarily causation, and we should watch or lurking variables.
Studying numerical patterns (table)
What have we learned?
Munroe, R. (n.d.). Correlation. Retrieved January 18, 2019, from https://xkcd.com/552/