Regression: A skin-deep dive
-
Upload
abulyomon -
Category
Data & Analytics
-
view
138 -
download
0
Transcript of Regression: A skin-deep dive
![Page 1: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/1.jpg)
RegressionA skin-deep dive (oxymoron intended)by @abulyomon
![Page 2: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/2.jpg)
Agenda
Simple Linear Regression
Exercise
Multiple Linear Regression
Exercise
Logistic Regression
Exercise
![Page 3: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/3.jpg)
Where am I?
By now, you should be able to:
● Obtain and clean data.
● Conduct Exploratory Data Analysis as well as some visualization.
This session is your first step into explaining some cross-sectional data.
![Page 4: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/4.jpg)
Disclaimer
● This session does not teach Python.
● Although it is necessary to understand the math behind modeling, we will not go through details. We will give recipes :(
![Page 5: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/5.jpg)
Context & Terminology
![Page 6: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/6.jpg)
Linear Regression
A regression model approximates the relation between the dependant and independant variables:
Y = f(X1, X2, ..., Xi) + ε
when Y is continuous quantitative and the relation is linearizable:
Y = β0 + β1X1+ β2X2 + ... + βiXi + ε
![Page 7: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/7.jpg)
Linearity
● Linear
Y = β0 + β1X + β2X2+ ε
Y = β0 + β1X1 + β2X2+ ε
● Not linear
Y = β0 + eβ1X1+ ε
![Page 8: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/8.jpg)
Ordinary Least Squares Estimation
![Page 9: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/9.jpg)
Notes on OLS
OLS Assumes:
1. Linearity
2. Random sampling
3. Error terms are normally distributed
4. Error terms have constant variance (homoskedasticity)
5. Error terms are independent (autocorrelation?)
There is another estimation method called Maximum Likelihood
Estimation. In the case of Linear Regression, MLE yields OLS findings!
![Page 10: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/10.jpg)
PracticeFile: AlWaseet.csv
Problem definition: What is the
relation between car mileage and
car price?
![Page 11: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/11.jpg)
Model Interpretation + Inference
Coefficients
R2: Predictive power based on
correlation
= Explained Variation/ Total
Variation
p-value: Statistical significance
![Page 12: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/12.jpg)
10’BREAK TIME
![Page 13: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/13.jpg)
Multiple Linear Regression
Where is the nice graph?
Problem of collinearity
R2: Curse of Dimensionality
![Page 14: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/14.jpg)
PracticeFile: AlWaseet.csv
Can we build a better model for
price with more predictors?
![Page 15: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/15.jpg)
More inference + Model Comparison
Ra2: Adjusted
F-test: tests the null hypothesis that coefficients are zero!
AIC: Information Criteria
BIC: Information Criteria with severe penalty to (i)
![Page 16: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/16.jpg)
Goodness-of-fit: Visual Checks
Normal probability plot of
residuals ->
Scatter plot of residuals vs
predictor variables
Scatter plot of residuals vs
fitted values
import statsmodels.graphics.gofplots as gof
import scipy.stats as stats
gof.qqplot(fitted_model.resid, stats.t, fit=True, line='45')
plt.show()
###
plt.hist(fitted_model.resid_pearson)
plt.show()
![Page 17: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/17.jpg)
Your HW, should you choose to accept it
Is the origin of the car significant in determining its price?
Are Korean made cars particularly different when it comes to price?
How do we handle categorical variables?
![Page 18: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/18.jpg)
10’BREAK TIME
![Page 19: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/19.jpg)
Logistic RegressionWe have expressed earlier that a regression model approximates the
relation between the dependant and independant variables:
Y = f(X1, X2, ..., Xi) + ε
when Y is binary:
ln( /1- ) = β0 + β1X1+ β2X2 + ... + βiXi
where = P(Y=1|X1=x1, ..., Xi=xi) and ( /1- ) is referred is
to as the “odds.”
![Page 20: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/20.jpg)
Let’s demystify that cryptic slide!
Suppose Y, the dependant variable was “CHASSIS” which signifies if a
car was damaged, denoted by: “1” if damaged, and “0” if not. One way
to model this may be:
ln(o) = β0 + β1PRICE+ β2MILEAGE + β3AGE
The odds of a car of price p, mileage m, and age a being damaged
would be:
![Page 21: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/21.jpg)
PracticeFile: AlWaseet.csv
Can we guess the smashed car?
![Page 22: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/22.jpg)
Predictive Accuracy
in-sample -> goodness-of-fit
Guard against overfitting
out-of-sample -> cross-validation
![Page 23: Regression: A skin-deep dive](https://reader031.fdocuments.in/reader031/viewer/2022030311/58ef33c41a28abb1598b45f5/html5/thumbnails/23.jpg)
Next?
Further reading
● Regression Analysis by Example, Chatterjee & Hadi
● Logistic Regression using SAS, Allison
Contact me:
● @abulyomon