g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min...
Transcript of g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min...
![Page 1: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/1.jpg)
Evaluating a modelgraphically
S U P E R V I S E D L E A R N I N G I N R : R E G R E S S I O N
Nina Zumel and John MountWin-Vector LLC
![Page 2: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/2.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Plotting Ground Truth vs. PredictionsA well �tting model
x = y line runs through
center of points
"line of perfect prediction"
A poorly �tting model
Points are all on one side
of x = y line
Systematic errors
![Page 3: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/3.jpg)
SUPERVISED LEARNING IN R: REGRESSION
The Residual PlotA well �tting model
Residual: actual outcome -
prediction
Good �t: no systematic
errors
A poorly �tting model
Systematic errors
![Page 4: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/4.jpg)
SUPERVISED LEARNING IN R: REGRESSION
The Gain CurveMeasures how well model sorts
the outcome
x-axis: houses in model-
sorted order (decreasing)
y-axis: fraction of total
accumulated home sales
Wizard curve: perfect model
![Page 5: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/5.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Reading the Gain Curve
GainCurvePlot(houseprices, "prediction", "price", "Home price model")
![Page 6: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/6.jpg)
Let's practice!S U P E R V I S E D L E A R N I N G I N R : R E G R E S S I O N
![Page 7: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/7.jpg)
Root Mean SquaredError (RMSE)
S U P E R V I S E D L E A R N I N G I N R : R E G R E S S I O N
Nina Zumel and John MountWin-Vector LLC
![Page 8: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/8.jpg)
SUPERVISED LEARNING IN R: REGRESSION
What is Root Mean Squared Error (RMSE)?
RMSE =
where
pred − y: the error, or residuals vector
: mean value of (pred − y)
√ (pred − y)2
(pred − y)2 2
![Page 9: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/9.jpg)
SUPERVISED LEARNING IN R: REGRESSION
RMSE of the Home Sales Price Model
# Calculate error
err <- houseprices$prediction - houseprices$price
price : column of actual sale prices (in thousands)
prediction : column of predicted sale prices (in thousands)
![Page 10: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/10.jpg)
SUPERVISED LEARNING IN R: REGRESSION
RMSE of the Home Sales Price Model
# Calculate error
err <- houseprices$prediction - houseprices$price
# Square the error vector
err2 <- err^2
![Page 11: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/11.jpg)
SUPERVISED LEARNING IN R: REGRESSION
RMSE of the Home Sales Price Model
# Calculate error
err <- houseprices$prediction - houseprices$price
# Square the error vector
err2 <- err^2
# Take the mean, and sqrt it
(rmse <- sqrt(mean(err2)))
58.33908
RMSE ≈ 58.3
![Page 12: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/12.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Is the RMSE Large or Small?# Take the mean, and sqrt it
(rmse <- sqrt(mean(err2)))
58.33908
# The standard deviation of the outcome
(sdtemp <- sd(houseprices$price))
135.2694
RMSE ≈ 58.3
sd(price) ≈ 135
![Page 13: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/13.jpg)
Let's practice!S U P E R V I S E D L E A R N I N G I N R : R E G R E S S I O N
![Page 14: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/14.jpg)
R-Squared (R )S U P E R V I S E D L E A R N I N G I N R : R E G R E S S I O N
2
Nina Zumel and John MountWin-Vector LLC
![Page 15: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/15.jpg)
SUPERVISED LEARNING IN R: REGRESSION
What is R ?A measure of how well the model �ts or explains the data
A value between 0-1
near 1: model �ts well
near 0: no better than guessing the average value
2
![Page 16: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/16.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Calculating RR is the variance explained by the model.
R = 1 −
where
RSS = (y − prediction)Residual sum of squares (variance from model)
SS = (y − )Total sum of squares (variance of data)
2
2
2
SSTot
RSS
∑ 2
Tot ∑ y 2
![Page 17: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/17.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Calculate R of the House Price Model: RSSCalculate error
err <- houseprices$prediction - houseprices$price
Square it and take the sum
rss <- sum(err^2)
price : column of actual sale prices (in thousands)
pred : column of predicted sale prices (in thousands)
RSS ≈ 136138
2
![Page 18: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/18.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Calculate R of the House Price Model: SSTake the difference of prices from the mean price
toterr <- houseprices$price - mean(houseprices$price)
Square it and take the sum
sstot <- sum(toterr^2)
RSS ≈ 136138
SS ≈ 713615
2Tot
Tot
![Page 19: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/19.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Calculate R of the House Price Model
(r_squared <- 1 - (rss/sstot) )
0.8092278
RSS ≈ 136138
SS ≈ 713615
R ≈ 0.809
2
Tot
2
![Page 20: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/20.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Reading R from the lm() model# From summary() summary(hmodel)
... Residual standard error: 60.66 on 37 degrees of freedom Multiple R-squared: 0.8092, Adjusted R-squared: 0.7989 F-statistic: 78.47 on 2 and 37 DF, p-value: 4.893e-14
summary(hmodel)$r.squared
0.8092278
# From glance() glance(hmodel)$r.squared
0.8092278
2
![Page 21: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/21.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Correlation and R
rho <- cor(houseprices$prediction, houseprices$price)
0.8995709
rho^2
0.8092278
ρ = cor(prediction, price) = 0.8995709
ρ = 0.8092278 = R
2
2 2
![Page 22: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/22.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Correlation and RTrue for models that minimize squared error:
Linear regression
GAM regression
Tree-based algorithms that minimize squared error
True for training data; NOT true for future application data
2
![Page 23: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/23.jpg)
Let's practice!S U P E R V I S E D L E A R N I N G I N R : R E G R E S S I O N
![Page 24: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/24.jpg)
Properly Training aModel
S U P E R V I S E D L E A R N I N G I N R : R E G R E S S I O N
Nina Zumel and John MountWin-Vector, LLC
![Page 25: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/25.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Models can perform much better on training thanthey do on future data.
Training R : 0.9; Test R : 0.15 -- Over�t2 2
![Page 26: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/26.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Test/Train Split
Recommended method when data is plentiful
![Page 27: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/27.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Example: Model Female Unemployment
Train on 66 rows, test on 30 rows
![Page 28: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/28.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Model Performance: Train vs. Test
Training: RMSE 0.71, R 0.8
Test: RMSE 0.93, R 0.75
2
2
![Page 29: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/29.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Cross-Validation
Preferred when data is not large enough to split off a test set
![Page 30: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/30.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Cross-Validation
![Page 31: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/31.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Cross-Validation
![Page 32: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/32.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Cross-Validation
![Page 33: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/33.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Create a cross-validation planlibrary(vtreat)
splitPlan <- kWayCrossValidation(nRows, nSplits, NULL, NULL)
nRows : number of rows in the training data
nSplits : number folds (partitions) in the cross-validation
e.g, nfolds = 3 for 3-way cross-validation
remaining 2 arguments not needed here
![Page 34: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/34.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Create a cross-validation planlibrary(vtreat) splitPlan <- kWayCrossValidation(10, 3, NULL, NULL)
First fold (A and B to train, C to test)
splitPlan[[1]]
$train 1 2 4 5 7 9 10 $app 3 6 8
Train on A and B, test on C, etc...
split <- splitPlan[[1]] model <- lm(fmla, data = df[split$train,]) df$pred.cv[split$app] <- predict(model, newdata = df[split$app,])
![Page 35: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/35.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Final Model
![Page 36: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/36.jpg)
SUPERVISED LEARNING IN R: REGRESSION
Example: Unemployment Model
Measure type RMSE R
train 0.7082675 0.8029275
test 0.9349416 0.7451896
cross-validation 0.8175714 0.7635331
2
![Page 37: g ra p h i ca l l y E va l u a ti n g a m o d elCorrel a ti on a nd R Tru e fo r mo d els t h at min imize s q u ared erro r: Lin ear regres s io n G A M regres s io n Tree- b as ed](https://reader033.fdocuments.in/reader033/viewer/2022041908/5e658b67f75c0a7a48110122/html5/thumbnails/37.jpg)
Let's practice!S U P E R V I S E D L E A R N I N G I N R : R E G R E S S I O N