Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling...
Transcript of Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling...
![Page 1: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/1.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Data-driven methodsfor predictive modelling
D G Rossiter
Cornell University, Soil & Crop Sciences Section
Nanjing Normal University, Geographic Sciences DepartmentW¬��'f0�ffb
April 9, 2020
![Page 2: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/2.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
1 Modelling culturesExplanation vs. predictionData-driven (algorithmic) methods
2 Classification & Regression Trees (CART)Regression treesSensitivity of Regression TreesClassification trees
3 Random forestsBagging and bootstrappingBuilding a random forestVariable importanceRandom forests for categorical variablesPredictor selection
4 Cubist
5 Model tuning
6 Spatial random forests
7 Data-driven vs. model-driven methods
![Page 3: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/3.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
1 Modelling culturesExplanation vs. predictionData-driven (algorithmic) methods
2 Classification & Regression Trees (CART)Regression treesSensitivity of Regression TreesClassification trees
3 Random forestsBagging and bootstrappingBuilding a random forestVariable importanceRandom forests for categorical variablesPredictor selection
4 Cubist
5 Model tuning
6 Spatial random forests
7 Data-driven vs. model-driven methods
![Page 4: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/4.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Statistical modelling
• Statistics starts with data: something we have measured
• Data is generated by some (unknown) mechanism: input(stimulus) x, output (response) y
• Before analysis this is a black box to us, we only have thedata itself
• Two goals of analysis:1 Prediction of future responses, given known inputs2 Explanation, Understanding of what is in the “black box”
(i.e., make it “white” or at least “some shade of grey”).
![Page 5: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/5.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Modelling cultures
Data modelling (also called “model-based”)• assume an empirical-statistical (stochastic)
data model for the inside of the black box,e.g., a functional form such as multiplelinear, exponential, hierarchical . . .
• parameterize the model from the data• evaluate the model using model diagnostics
Algorithmic modelling (also called “data-driven”)• find an algorithm that produces y given x• evaluate by predictive accuracy (note: not
internal accuracy)
Reference: Breiman, L. (2001). Statistical Modeling: The Two Cultures (with comments
and a rejoinder by the author). Statistical Science, 16(3), 199–231.
https://doi.org/10.1214/ss/1009213726
![Page 6: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/6.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Explanation vs. prediction
• Explanation• Testing a causal theory – why are things the way they are?• Emphasis is on correct model specification and
coefficient estimation• Uses conceptual variables based on theory, which are
represented by measureable variables
• Prediction• Predicting new (space, members of population) or future
(time) observations.• Uses measureable variables only, no need for concepts
Reference: Shmueli, G. (2010). To Explain or to Predict? Statistical Science, 25(3),
289–310. https://doi.org/10.1214/10-STS330
![Page 7: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/7.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Bias/variance tradeoff
The expected prediction error (EPE) for a new observation withvalue x is:
EPE = E{Y − f (x)}2
= E{Y − f (x)}2 + {E(f (x))− f (x)}2
+E{f (x)− E(f (x))}2
= Var(Y)+ Bias2 + Var(f (x))
Model variance: residual error with perfect model specification(i.e., noise in the relation)
Bias: mis-specification of the statistical model:f (x) 6= f (x)
Estimation variance: the result of using a sample to estimate fas f (x)
![Page 8: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/8.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Bias/variance tradeoff: explanation vs.prediction
Explanation Bias should be minimized• correct model specification and correct
coefficients → correct conclusions about thetheory (e.g., causual relation)
Prediction Total EPE should be minimized.• accept some bias if that reduces the
estimation variance• a simpler model (omitting less important
predictors) often has better fit to the data
![Page 9: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/9.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
When does an underspecified model betterpredict than a full model?
• the data are very noisy (large σ );
• the true absolute values of the left-out parameters aresmall;
• the predictors are highly correlated; and
• the sample size is small or the range of left-out variablesis narrow.
![Page 10: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/10.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Problems with data modelling
• Mosteller and Tukey(1977): “The whole area of guidedregression [an example of, model-based inference] isfraught with intellectual, statistical, computational, andsubject matter difficulties.”
• It seems we understand nature if we fit a model form, butin fact our conclusions are about the model’s mechanism,and not necessarily about nature’s mechanism.
• So, if the model is a poor emulation of nature, theconclusions about nature may be wrong . . .
• . . . and of course the predictions may be wrong – we areincorrectly extrapolating.
![Page 11: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/11.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
The philosophy of data-driven methods
• Also called “statistical learning”, “machine learning”
• Build structures to represent the “black box” without usinga statistical model
• Model quality is evaluated by predictive accuracy on testsets covering the target population
• cross-validation methods can use (part of) the originaldata set if an independent set is not available
![Page 12: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/12.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Some data-driven methods
1 Covered in this lecture• Classification & Regression Trees (CART) �{�ÞR�• Random Forests (RF) � :î�• Cubist
2 Others• Artificial Neural Networks (ANN) ºå^ÏQÜ• Support Vector Machines• Gradient Boosting
![Page 13: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/13.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Key references – texts
• Hastie, T., Tibshirani, R., & Friedman J. H. (2009). The elements ofstatistical learning data mining, inference, and prediction (2nd ed). NewYork: Springer. https://doi.org/10.1007%2F978-0-387-84858-7
• James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). Anintroduction to statistical learning: with applications in R. New York:Springer. https://doi.org/10.1007%2F978-1-4614-7138-7
• Statistical Learning on-line course (based on James et al. book):https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about
• Kuhn, M., & Johnson, K. (2013). Applied Predictive Modeling (2013edition). New York: Springer.https://doi.org/10.1007/978-1-4614-6849-3
![Page 14: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/14.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Key references – papers
• Shmueli, G. (2010). To Explain or to Predict? Statistical Science, 25(3),289–310. https://doi.org/10.1214/10-STS330
• Breiman, L. (2001). Statistical Modeling: The Two Cultures (withcomments and a rejoinder by the author). Statistical Science, 16(3),199–231. https://doi.org/10.1214/ss/1009213726
• Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.https://doi.org/10.1023/A:1010933404324
• Kuhn, M. (2008). Building Predictive Models in R Using the caretPackage. Journal of Statistical Software, 28(5), 1–26.
![Page 15: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/15.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
1 Modelling culturesExplanation vs. predictionData-driven (algorithmic) methods
2 Classification & Regression Trees (CART)Regression treesSensitivity of Regression TreesClassification trees
3 Random forestsBagging and bootstrappingBuilding a random forestVariable importanceRandom forests for categorical variablesPredictor selection
4 Cubist
5 Model tuning
6 Spatial random forests
7 Data-driven vs. model-driven methods
![Page 16: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/16.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Decision trees ³V�• Typical uses in diagnostics (medical, automotive . . . )• Begin with the full set of possible decisions• Split into two (binary) subsets based on the values of
some decision criterion• Each branch has a more limited set of decisions, or at
least has more information to help make a decision• Continue recursively on both branches until there is
enough information to make a decision
https://www.flickr.com/photos/dullhunk/7214525854
![Page 17: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/17.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Classification & Regression Trees �{�ÞR�
• A type of decision tree; decision is “what is the predictedresponse, given values of predictors”?
• Aim is to predict the response (target) variable from oneor more predictor variables
• If response is categorical (class, factor) we build aclassification tree
• If response is continuous we build a regression tree
• Predictors can be any combination of categorical orcontinuous
![Page 18: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/18.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Advantages of CART
• A simple model, no statistical assumptions other thanbetween/within class variance to decide on splits
• For example, no assumptions of the distribution ofresiduals
• So can deal with non-linear and threshold relations
• No need to transform predictors or response variable
• Predictive power is quantified by cross-validation; thisalso controls complexity to avoid over-fitting
![Page 19: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/19.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Disadvantages of CART
• No model to interpret (although we can see variableimportance)
• Predictive power over a population depends on a samplethat is representative of that population
• Quite sensitive to the sample, even when pruned
• Pruning to a complexity parameter depends on 10-foldcross-validation, which is sensitive to the choice ofobservations in each fold
• Typically makes only a small number of differentpredictions (“boxes”), so maps made with it showdiscontinuities (“jumps”)
![Page 20: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/20.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Tree terminology
• splitting variable variable to examine, to decide whichbranch of the tree to follow
• root node 9è�¹ variable used for first split; overallmean and total number of observations
• interior node ^öP�¹ splitting variable, value on whichto split, mean and number to be split
• leaf öP¹ predicted value, number of observationscontributing to it
• cutpoint of the splitting variable: value used to decidewhich branch to follow
• growing the tree
• pruning the tree
![Page 21: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/21.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example regression tree
• Meuse River soil heavy metals dataset
• Response variable: log(Zn) concentration in topsoil• Predictor variables
1 distance to Meuse river (continuous)2 elevation above sea level (continuous)3 flood frequency class (categorical, 3 classes)
![Page 22: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/22.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example regression tree – first split
dist.m >= 145
< 145
2.56n=155
2.39n=101
2.87n=54
Splitting variable: distance to river
Is the point closer or further than 145 m from the river? 101points yes, 54 points no.
![Page 23: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/23.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Explanation of first split
• root: average log(Zn) of whole dataset 2.56 log(mg kg-1)fine soil; based on all 155 observations
• splitting variable at root: distance to river
• cutpoint at root: 145 m• leaves
• distance < 145 m: 54 observations, their mean is 2.87log(mg kg-1)
• distance ≥ 145 m: 101 observations, their mean is 2.39log(mg kg-1)
• full dataset has been split into two more homogeneoussubsets
![Page 24: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/24.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example regression tree – second split
dist.m >= 145
elev >= 6.94 elev >= 8.15
< 145
< 6.94 < 8.15
2.56n=155
2.39n=101
2.35n=93
2.84n=8
2.87n=54
2.65n=15
2.96n=39
For both branches, what is the elevation of the point?
Note: this is a coincidence in this case, different splittingvariables can be used on different branches.
![Page 25: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/25.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Explanation of second split
• interior nodes were leaves after the first split, now‘roots’ of subtrees
• left: distance ≥ 145 m: 101 observations, their mean is2.39 log(mg kg-1) – note smaller mean on left
• right: distance < 145 m: 54 observations, their mean is2.87 log(mg kg-1)
• splitting variable at interior node for < 145 m: elevation
• cutpoint at interior node for < 145 m: 8.15 m.a.s.l.
• splitting variable at interior node for ≥ 145 m: elevation
• cutpoint at interior node for ≥ 145 m: 6.95 m.a.s.l.
• leaves 93, 8, 15, 39 observations; means 2.35, 2.84,2.65, 2.96 log(mg kg-1)
• These leaves are now more homogeneous than the interiornodes.
![Page 26: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/26.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example regression tree – third split
dist.m >= 145
elev >= 6.94
dist.m >= 230
elev >= 8.15
dist.m >= 75
< 145
< 6.94
< 230
< 8.15
< 75
2.56n=155
2.39n=101
2.35n=93
2.31n=78
2.55n=15
2.84n=8
2.87n=54
2.65n=15
2.96n=39
2.85n=11
3n=28
![Page 27: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/27.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example regression tree – fourth split
dist.m >= 145
elev >= 6.94
dist.m >= 230
elev >= 9.03
elev >= 8.15
dist.m >= 75
< 145
< 6.94
< 230
< 9.03
< 8.15
< 75
2.56n=155
2.39n=101
2.35n=93
2.31n=78
2.22n=29
2.37n=49
2.55n=15
2.84n=8
2.87n=54
2.65n=15
2.96n=39
2.85n=11
3n=28
![Page 28: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/28.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example regression tree – fifth split
dist.m >= 145
elev >= 6.94
dist.m >= 230
elev >= 9.03
dist.m >= 670 dist.m >= 365
elev >= 8.15
dist.m >= 75
elev >= 6.99
elev < 7.69
< 145
< 6.94
< 230
< 9.03
< 670 < 365
< 8.15
< 75
< 6.99
>= 7.69
2.56n=155
2.39n=101
2.35n=93
2.31n=78
2.22n=29
2.11n=8
2.26n=21
2.37n=49
2.34n=31
2.41n=18
2.55n=15
2.84n=8
2.87n=54
2.65n=15
2.96n=39
2.85n=11
3n=28
2.97n=21
2.92n=14
3.07n=7
3.08n=7
![Page 29: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/29.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example regression tree – maximum possiblesplits
dist.m >= 145
elev >= 6.94
dist.m >= 230
elev >= 9.03
dist.m >= 670
dist.m >= 810
elev >= 9.07
dist.m < 930
dist.m < 755
dist.m < 735
dist.m >= 700
ffreq = 2
elev >= 9.54
elev < 10.3
elev >= 9.59
elev < 9.68
dist.m >= 470elev < 10.1
dist.m < 595
ffreq = 3
dist.m < 385
dist.m < 645
dist.m >= 415
ffreq = 1,3
ffreq = 3
dist.m >= 525
dist.m >= 475
ffreq = 2,3
ffreq = 2
dist.m < 360dist.m >= 335
dist.m < 330
dist.m >= 250
dist.m >= 525
dist.m < 660
elev < 7.65
ffreq = 3
elev < 7.73
dist.m < 545
ffreq = 2ffreq = 1
dist.m >= 560
dist.m >= 700
dist.m < 510
elev >= 8.45
ffreq = 1,2
dist.m >= 360
elev < 8.49
elev < 9.02
elev >= 8.62
dist.m < 445
dist.m >= 395
elev < 8.88
elev >= 8.87
dist.m < 270
dist.m >= 300
ffreq = 2
ffreq = 2
dist.m < 435
dist.m < 400
elev < 7.98
dist.m < 360
dist.m >= 455
dist.m >= 465
dist.m >= 490
dist.m < 300
ffreq = 3
dist.m >= 275dist.m < 280
dist.m >= 315
elev >= 7.79
dist.m < 335
dist.m < 475
elev < 8.38
ffreq = 3
dist.m < 405dist.m < 415
elev < 7.63
dist.m >= 360
elev < 8.44
elev >= 7.81
elev < 8.24
ffreq = 3
elev >= 7.88
elev >= 8.68
elev < 9.71
ffreq = 2
dist.m < 170dist.m >= 190
elev < 8.52
ffreq = 2
elev < 7.64
dist.m >= 175
elev >= 7.76
elev >= 5.98
dist.m < 555
dist.m >= 230
dist.m >= 340
dist.m < 295
dist.m >= 265
elev < 6.34
elev >= 8.15
elev < 8.48
ffreq = 2
elev >= 8.47dist.m < 55
dist.m >= 65
elev >= 8.74
elev < 8.8
dist.m >= 90
elev >= 8.9
ffreq = 1
ffreq = 1
elev < 8.64
dist.m >= 105
elev < 8.54
dist.m >= 75
elev >= 6.61
elev < 7.62
ffreq = 1
dist.m >= 110dist.m >= 100
elev >= 7.82
dist.m >= 120
dist.m < 90
elev >= 5.73dist.m >= 105
elev >= 6.99
elev < 7.26
elev >= 7.01
elev >= 7.13
elev < 7.18
dist.m < 15
elev >= 7.06
dist.m >= 55
ffreq = 2 elev >= 7.31
elev < 7.72
elev < 7.38
elev >= 7.34
dist.m < 15
elev >= 7.42
elev < 7.63
dist.m < 15
elev >= 8.02
dist.m >= 30
elev < 7.82
elev >= 7.94
elev < 5.73
dist.m < 50
elev < 6.89
dist.m < 15dist.m >= 25
elev < 6.48
< 145
< 6.94
< 230
< 9.03
< 670
< 810
< 9.07
>= 930
>= 755
>= 735
< 700
3
< 9.54
>= 10.3
< 9.59
>= 9.68
< 470 >= 10.1
>= 595
2
>= 385
>= 645
< 415
2
1
< 525
< 475
1
3
>= 360 < 335
>= 330
< 250
< 525
>= 660
>= 7.65
1,2
>= 7.73
>= 545
1 2
< 560
< 700
>= 510
< 8.45
3
< 360
>= 8.49
>= 9.02
< 8.62
>= 445
< 395
>= 8.88
< 8.87
>= 270
< 300
1
1,3
>= 435
>= 400
>= 7.98
>= 360
< 455
< 465
< 490
>= 300
1
< 275 >= 280
< 315
< 7.79
>= 335
>= 475
>= 8.38
1
>= 405 >= 415
>= 7.63
< 360
>= 8.44
< 7.81
>= 8.24
2
< 7.88
< 8.68
>= 9.71
1
>= 170 < 190
>= 8.52
3
>= 7.64
< 175
< 7.76
< 5.98
>= 555
< 230
< 340
>= 295
< 265
>= 6.34
< 8.15
>= 8.48
3
< 8.47 >= 55
< 65
< 8.74
>= 8.8
< 90
< 8.9
2
2
>= 8.64
< 105
>= 8.54
< 75
< 6.61
>= 7.62
3
< 110 < 100
< 7.82
< 120
>= 90
< 5.73 < 105
< 6.99
>= 7.26
< 7.01
< 7.13
>= 7.18
>= 15
< 7.06
< 55
1 < 7.31
>= 7.72
>= 7.38
< 7.34
>= 15
< 7.42
>= 7.63
>= 15
< 8.02
< 30
>= 7.82
< 7.94
>= 5.73
>= 50
>= 6.89
>= 15 < 25
>= 6.48
2.56n=155
2.39n=101
2.35n=93
2.31n=78
2.22n=29
2.11n=8
2.07n=3
2.06n=2
2.05n=1
2.07n=1
2.1n=1
2.13n=5
2.12n=4
2.12n=3
2.11n=1
2.12n=2
2.11n=1
2.12n=1
2.13n=1
2.16n=1
2.26n=21
2.19n=9
2.16n=8
2.14n=6
2.09n=2
2.08n=1
2.11n=1
2.17n=4
2.16n=3
2.15n=2
2.15n=1
2.15n=1
2.18n=1
2.2n=1
2.22n=2
2.14n=1
2.3n=1
2.38n=1
2.32n=12
2.3n=11
2.27n=5
2.24n=3
2.21n=1
2.26n=2
2.26n=1
2.26n=1
2.3n=2
2.28n=1
2.32n=1
2.33n=6
2.3n=4
2.27n=2
2.22n=1
2.32n=1
2.33n=2
2.31n=1
2.34n=1
2.41n=2
2.4n=1
2.41n=1
2.45n=1
2.37n=49
2.36n=47
2.3n=10
2.26n=8
2.08n=1
2.29n=7
2.22n=1
2.3n=6
2.26n=1
2.31n=5
2.28n=2
2.27n=1
2.29n=1
2.33n=3
2.32n=2
2.32n=1
2.33n=1
2.34n=1
2.46n=2
2.35n=1
2.57n=1
2.38n=37
2.37n=36
2.32n=13
2.31n=12
2.28n=8
2.15n=1
2.3n=7
2.29n=6
2.29n=5
2.28n=3
2.27n=2
2.26n=1
2.28n=1
2.3n=1
2.3n=2
2.3n=1
2.31n=1
2.32n=1
2.34n=1
2.36n=4
2.25n=1
2.4n=3
2.34n=2
2.3n=1
2.38n=1
2.52n=1
2.47n=1
2.39n=23
2.32n=7
2.24n=4
2.22n=3
2.19n=2
2.19n=1
2.2n=1
2.27n=1
2.3n=1
2.44n=3
2.35n=2
2.35n=1
2.35n=1
2.6n=1
2.42n=16
2.28n=1
2.43n=15
2.36n=4
2.3n=2
2.19n=1
2.41n=1
2.43n=2
2.41n=1
2.45n=1
2.46n=11
2.44n=10
2.4n=7
2.22n=1
2.42n=6
2.4n=5
2.36n=4
2.3n=2
2.28n=1
2.33n=1
2.42n=2
2.4n=1
2.45n=1
2.53n=1
2.56n=1
2.53n=3
2.47n=2
2.43n=1
2.51n=1
2.67n=1
2.65n=1
2.7n=1
2.54n=2
2.54n=1
2.54n=1
2.55n=15
2.51n=11
2.37n=3
2.23n=1
2.44n=2
2.38n=1
2.5n=1
2.56n=8
2.49n=5
2.44n=4
2.36n=2
2.31n=1
2.42n=1
2.52n=2
2.51n=1
2.54n=1
2.68n=1
2.67n=3
2.65n=2
2.62n=1
2.67n=1
2.7n=1
2.67n=4
2.47n=1
2.73n=3
2.7n=2
2.64n=1
2.76n=1
2.81n=1
2.84n=8
2.81n=7
2.8n=6
2.77n=4
2.73n=1
2.78n=3
2.76n=2
2.74n=1
2.77n=1
2.83n=1
2.85n=2
2.84n=1
2.87n=1
2.89n=1
3.03n=1
2.87n=54
2.65n=15
2.46n=4
2.32n=2
2.28n=1
2.37n=1
2.59n=2
2.57n=1
2.61n=1
2.72n=11
2.69n=10
2.63n=5
2.4n=1
2.69n=4
2.64n=3
2.61n=2
2.6n=1
2.62n=1
2.7n=1
2.83n=1
2.76n=5
2.61n=1
2.8n=4
2.78n=3
2.74n=1
2.79n=2
2.74n=1
2.85n=1
2.86n=1
2.92n=1
2.96n=39
2.85n=11
2.8n=6
2.76n=4
2.74n=2
2.74n=1
2.75n=1
2.78n=2
2.78n=1
2.79n=1
2.88n=2
2.87n=1
2.89n=1
2.9n=5
2.83n=1
2.92n=4
2.87n=2
2.85n=1
2.88n=1
2.98n=2
2.9n=1
3.06n=1
3n=28
2.97n=21
2.85n=6
2.83n=5
2.8n=3
2.78n=1
2.81n=2
2.81n=1
2.82n=1
2.86n=2
2.85n=1
2.87n=1
2.96n=1
3.02n=15
2.84n=2
2.76n=1
2.92n=1
3.05n=13
3.03n=12
2.98n=7
2.94n=3
2.89n=2
2.88n=1
2.9n=1
3.04n=1
3n=4
2.98n=3
2.96n=2
2.96n=1
2.97n=1
3.01n=1
3.08n=1
3.11n=5
2.91n=1
3.16n=4
3.01n=1
3.21n=3
3.18n=1
3.23n=2
3.19n=1
3.26n=1
3.22n=1
3.08n=7
2.89n=1
3.11n=6
3.08n=4
3.04n=2
3.02n=1
3.06n=1
3.11n=2
3.06n=1
3.16n=1
3.17n=2
3.14n=1
3.2n=1
![Page 30: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/30.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
How are splits decided?
1 Take all possible predictors and all possible cutpoints
2 Split the data(sub)set at all combinations
3 Compute some measure of discrimination for all these –i.e., a measure which determine which split is “best”
4 Select the predictor/split that most discriminates
Criteria for continuous and categorical response variables:see next slides
![Page 31: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/31.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
How are splits decided? – Continuous response
Select the predictor/split that most increases between-classvariance (this decreases pooled within-class variance):∑
`
∑i
(y`,i − yl)2
• y`,i value i of the target in leaf `• yl is the mean value of the target in leaf `
So the set of leaves are more homogeneous, on average, thanthe root.
![Page 32: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/32.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
How are splits decided? – Categorical response
Select the predictor/split that minimizes the impurity of theset of leaves:
• Misclassification rate: 1Nm
∑i∈R I(yi 6= k(m))
• Nm: number of observations at node m• Rm: the set of observations• k(m) is the majority class; I is the logical T/F function
• Impurity is maximal when all classes have same frequency,and minimal when only one class has any observations inthe leaf
So the set of leaves are purer (less confusion), on average,than the root.
![Page 33: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/33.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example split (1)
> # all the possible cutpoints for distance to river> (distances <- sort(unique(meuse$dist.m)))[1] 10 20 30 40 50 60 70 80 100 110 120 130 140 150[15] 160 170 190 200 210 220 240 260 270 280 290 300 310 320[29] 330 340 350 360 370 380 390 400 410 420 430 440 450 460[43] 470 480 490 500 520 530 540 550 560 570 630 650 660 680[57] 690 710 720 750 760 860 1000> for (i in 1:nd) { # try them all
branch.less <- meuse$zinc[meuse$dist.m < distances[i]]branch.more <- meuse$zinc[meuse$dist.m >= distances[i]]rss.less <- sum((branch.less-mean(branch.less))^2)rss.more <- sum((branch.more-mean(branch.more))^2)rss <- sum(rss.less + rss.more)results.df[i,2:5] <- c(rss.less, rss.more, rss, 1-rss/tss)}
> # find the best split> ix.r.squared.max <- which.max(results.df$r.squared)print(results.df[ix.r.squared.max,])> print(results.df[ix.r.squared.max,])
distance rss.less rss.more rss r.squared13 140 7127795 3030296 10158091 0.510464> # plot the resultsplot(r.squared ~ distance, data=results.df, type="h",
col=ifelse(distance==d.threshold,"red","gray"))
![Page 34: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/34.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example split (2): R2 vs. cutpoint – distance toriver
Try to split the root node on this predictor:
Best cutpoint is 140 m; this explains 51% of the total variance
![Page 35: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/35.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example split (3): R2 vs. cutpoint – elevation
Try to split the root node on this predictor:
Best cutpoint is 7.48 m.a.s.l.; this only explains 35% of thetotal variance; so use the distance to river as the first split
![Page 36: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/36.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example split (4a): left first-level leaf
Try to split the left first-level leaf (101 observations):
Best cutpoint is 6.99 m.a.s.l.; this explains 93.0% of thevariance in this group. Splitting at 290 m distance wouldexplain 89.1%.
So split this leaf on elevation – it becomes an interior node
![Page 37: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/37.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example split (4b): right first-level leaf
Try to split the right first-level leaf (54 observations):
Best cutpoint is 8.23 m.a.s.l.; this explains 76.6% of thevariance in this group. Splitting at 60 m distance wouldexplain 72.6%.
So split on elevation – it becomes an interior node.
![Page 38: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/38.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Controlling tree complexity
• Fitting a full tree, until there is only one observation perleaf, is always over-fitting to the sample set, and will notbe a good predictor of the population.
• A full tree fits some noise as well as structure.
• Can control by the analyst or automatically by pruning(see below).
• Analyst can specify:• Minimum number of observations in a leaf (fewer: no split
is attempted): minsplit• Maximum depth of tree: maxdepth• Minimum improvement in pooled within-class vs.
between-class variance: cp (see below)
![Page 39: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/39.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Predicting with the fitted tree
• A simple ‘model’ is applied to each leaf:• Response variable continuous numeric: mean of observed
data in leaf• Categorical variable: most frequent category in leaf
• Value at new location is predicted by running the covariatedata down the tree
![Page 40: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/40.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Fitted regression tree
dist.m >= 145
elev >= 6.94
dist.m >= 230
elev >= 9.03
elev >= 8.15
elev < 8.48 dist.m >= 75
< 145
< 6.94
< 230
< 9.03
< 8.15
>= 8.48 < 75
2.56n=155
2.39n=101
2.35n=93
2.31n=78
2.22n=29
2.37n=49
2.55n=15
2.84n=8
2.87n=54
2.65n=15
2.46n=4
2.72n=11
2.96n=39
2.85n=11
3n=28
Question: What is the predicted value for a point 100 m fromthe river and 9 m.a.s.l. elevation?
![Page 41: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/41.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Predictions at known points
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
2.0 2.5 3.0
2.0
2.2
2.4
2.6
2.8
3.0
3.2
log10(Zn), Meuse topsoils, Regression Tree
fitted
actu
al
Note only one prediction per leaf, applies to all points falling inthe leaf.
![Page 42: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/42.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Pruning – why?
• The splitting can continue until each calibrationobservation is in its own leaf
• This is almost always over-fitting to the current dataset
• What we want is a tree for the best prediction• Solution: grow a full tree; then prune it back to a simpler
tree with the best predictive power• Similar to using the adjusted R2 to avoid over-fitting a
multiple linear regression
![Page 43: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/43.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Pruning – how?
• The cp “complexity parameter” value: Any split that doesnot decrease the overall lack of fit by a factor of cp is notused.
• Default value is 0.01 (1% increase in R2)• Can be set by the analyst during growing• Can also be used as a target for pruning
• Q: How to decide on the value of cp that gives the bestpredictive tree?
• A: Use the cross-validation error, also called theout-of-bag error.
• apply the model to the original data split K-fold (default10), each time excluding some observations; comparepredictions to actual values
• Note how this fits the philosophy of data-drivenapproaches: predictive accuracy is the criterion
![Page 44: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/44.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
X-validation error vs. complexity parameter
●
●
●
●
●● ●
● ●
● ● ● ● ● ● ●● ● ● ●
●
cp
X−
val R
elat
ive
Err
or
0.2
0.4
0.6
0.8
1.0
1.2
Inf 0.089 0.019 0.0075 0.005 0.0047 0.0037 0.0031
1 3 5 7 9 12 14 16 20 24 28
size of tree
Horizontal line is 1 standard error above the minimum error.Usually choose the largest cp below this; here cp=0.01299(about 1.3% improvement in R2).
![Page 45: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/45.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Full and pruned trees
dist.m >= 145
elev >= 6.94
dist.m >= 230
elev >= 9.03
dist.m >= 670
elev >= 9.54
dist.m >= 250
dist.m >= 525
dist.m < 660 dist.m < 510
elev >= 8.45
ffreq = 2
dist.m < 435
elev >= 7.81
elev < 8.24
elev >= 8.68
elev < 7.64
elev >= 8.15
elev < 8.48
ffreq = 2 dist.m >= 65
elev >= 8.74
elev < 8.8
dist.m >= 75
elev >= 6.99
elev < 7.26
dist.m >= 55
< 145
< 6.94
< 230
< 9.03
< 670
< 9.54
< 250
< 525
>= 660 >= 510
< 8.45
1,3
>= 435
< 7.81
>= 8.24
< 8.68
>= 7.64
< 8.15
>= 8.48
3 < 65
< 8.74
>= 8.8
< 75
< 6.99
>= 7.26
< 55
2.56n=155
2.39n=101
2.35n=93
2.31n=78
2.22n=29
2.11n=8
2.26n=21
2.19n=9
2.32n=12
2.37n=49
2.36n=47
2.3n=10
2.26n=8
2.46n=2
2.38n=37
2.37n=36
2.32n=13
2.39n=23
2.32n=7
2.24n=4
2.44n=3
2.42n=16
2.7n=1
2.54n=2
2.55n=15
2.51n=11
2.37n=3
2.56n=8
2.49n=5
2.67n=3
2.67n=4
2.47n=1
2.73n=3
2.84n=8
2.87n=54
2.65n=15
2.46n=4
2.32n=2
2.59n=2
2.72n=11
2.69n=10
2.63n=5
2.4n=1
2.69n=4
2.76n=5
2.92n=1
2.96n=39
2.85n=11
3n=28
2.97n=21
2.85n=6
3.02n=15
2.84n=2
3.05n=13
3.08n=7
dist.m >= 145
elev >= 6.94
dist.m >= 230
elev >= 9.03
elev >= 8.15
elev < 8.48 dist.m >= 75
< 145
< 6.94
< 230
< 9.03
< 8.15
>= 8.48 < 75
2.56n=155
2.39n=101
2.35n=93
2.31n=78
2.22n=29
2.37n=49
2.55n=15
2.84n=8
2.87n=54
2.65n=15
2.46n=4
2.72n=11
2.96n=39
2.85n=11
3n=28
Full tree built with cp=0.003 = 0.3%; 27 leaves; pruned to 8(cp=0.013 = 1.3%)
Interpretation: a noisy dataset if using these two predictors
![Page 46: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/46.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Variable importance
• Unlike with regression we do not get any coefficient or itsstandard error for each predictor
• So to evaluate the importance of each predictor we seehow much it’s used in the tree
• simple:• sum of gain in R2 over all splits based on the predictor
• complicated;• permute predictor values;• use these to re-build the tree;• compute cross-validation error;• the larger the difference, the more important
![Page 47: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/47.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Variable importance – example
variableImportancedist.m 55.5876elev 38.9996ffreq 5.4128
Normalized to sum to 100% of the gain in R2
Distance to river is most important.
![Page 48: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/48.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Map predicted from Regression Tree
This tree: log(Zn) predicted from dist (45% importance); E(17%); soil (15%); N (11%); ffreq. (11%).
![Page 49: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/49.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Sensitivity of Regression Trees to sample
• Question: how sensitive are Regression Trees to thesample?
• Experiment: build trees from random samples of 140 ofthe 155 observations (only 10% not used!)
• How different are the optimized trees and the predictivemaps?
• What is the distribution of the optimal complexityparameter and the out-of-bag (predictive) error?
![Page 50: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/50.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Sensitivity: complexity and out-of-bag error
![Page 51: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/51.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Sensitivity: trees
![Page 52: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/52.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Sensitivity: predictive maps
![Page 53: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/53.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Regression trees are sensitive to theobservations
• This is a problem!
• Solution: why have one tree when you can have a forest?
![Page 54: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/54.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Classification trees
• Target variable is a categorical variable• Example (Meuse river): flood frequency class (3 levels)
predicted from distance to river and elevation• Result (pruned): number of observations in each class
(left); proportion (right) – note class 3 not predicted!
elev < 7.6
elev < 9.1
elev >= 8.8
elev < 8.4
>= 7.6
>= 9.1
< 8.8
>= 8.4
184 48 23
138 0 2
246 48 21
144 28 15
114 2 1
130 26 14
123 11 11
27 15 3
22 20 6
12
elev < 7.6
elev < 9.1
elev >= 8.8
elev < 8.4
>= 7.6
>= 9.1
< 8.8
>= 8.4
1.54 .31 .15
1.95 .00 .05
2.40 .42 .18
1.51 .32 .17
1.82 .12 .06
1.43 .37 .20
1.51 .24 .24
2.28 .60 .12
2.07 .71 .21
12
![Page 55: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/55.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
1 Modelling culturesExplanation vs. predictionData-driven (algorithmic) methods
2 Classification & Regression Trees (CART)Regression treesSensitivity of Regression TreesClassification trees
3 Random forestsBagging and bootstrappingBuilding a random forestVariable importanceRandom forests for categorical variablesPredictor selection
4 Cubist
5 Model tuning
6 Spatial random forests
7 Data-driven vs. model-driven methods
![Page 56: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/56.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Random forests – motivation
• Instead of relying on a single (hopefully best) tree, maybeit is better to fit many trees.
• But. . . how to obtain multiple regression trees if we haveonly one data set?
• Go into field and collect new sample data? too expensiveand impractical.
• Split the dataset and fit trees to the separate parts? Too fewobservations to build a reliable tree.
• Solution: Use the single sample to generate an ensemble(group) of trees; use these together to predict.
![Page 57: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/57.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Bagging (1)
• “Bag” = a group of samples “in the bag”; others“out-of-bag”
• Suppose we have a large sample that is a goodrepresentation of the study area
• i.e., sample frequency distribution is close to populationfrequency distribution
• Generate a new sample is generated by sampling fromthe sample!
![Page 58: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/58.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Bagging (2) – Boostrapping
Standard method for sampling in bagging is calledbootstrapping1
• Select same number of points as in sample
• Sample with replacement (otherwise you get the samesample)
• So some observations are used more than once!
• But, the sample is supposed to represent thepopulation, so these could be values that would havebeen obtained in a new field sample.
1for historical reasons
![Page 59: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/59.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Sampling with replacement
> # sample 20 times from (1, 2,... 20) with replacement> (my.sample <- sample(1:20, 20, replace=TRUE))[1] 7 13 5 2 1 9 19 1 6 2 9 9 12 4 11 9 5 20 20 11> sort(my.sample)[1] 1 1 2 2 4 5 5 6 7 9 9 9 9 11 11 12 13 19 20 20> (1:20) %in% my.sample # in bag[1] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE[10] FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE[19] TRUE TRUE> !((1:20) %in% my.sample) # Out-of-bag[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE[10] TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE[19] FALSE FALSE
![Page 60: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/60.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example: 10 bootstrap samples from theintegers 1 ... 20 – sorted
b1 b2 b3 b4 b5 b6 b7 b8 b9 b101 1 2 1 1 2 4 2 1 1 32 3 3 3 2 3 6 3 2 2 33 5 3 3 2 4 6 3 4 3 54 6 5 6 4 4 7 4 5 3 105 7 5 6 5 7 8 6 6 5 106 8 5 7 5 8 10 7 6 6 117 11 7 8 7 8 10 7 6 6 138 15 7 9 8 8 11 9 7 7 139 15 8 13 10 9 12 10 7 8 1310 16 8 15 10 9 13 10 8 8 1411 16 9 15 10 11 13 13 8 9 1412 17 12 16 10 13 14 13 10 12 1413 17 14 16 14 13 15 14 14 12 1514 18 14 17 16 14 16 15 17 13 1615 18 15 17 16 16 18 15 17 13 1616 19 15 18 17 18 18 15 18 14 1617 19 16 19 17 19 18 16 19 14 1718 19 17 19 19 19 19 17 20 17 1919 19 18 20 19 19 20 17 20 19 2020 19 18 20 19 19 20 19 20 20 20
![Page 61: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/61.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Forests with bagging – method
• Fit a full regression tree to each bootstrap sample; donot prune
• Each bootstrap sample results in a tree and in a predictedvalue for any combination values of the predictors
• Prediction is the average of the individual predictionsfrom the “forest” of regression trees
• Jumps in predictions are smoothed; more precisepredictions
![Page 62: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/62.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Forest with bagging – limitations
• All predictors are tried at each split, so trees tend to besimilar
• Some predictors may never enter into the trees → missingsource of diversity
• Solution: random forest variation of bagging – twosources of randomness
• Random 1: sampling by bagging• Random 2: choice of predictors at each split (see next)
![Page 63: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/63.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Random forests
• Multiple samples obtained by bootstrapping, used to buildtrees (as in bagging)
• Average predictions over all trees (as in bagging)• Besides, in each internal node a random subset of
splitting variables (predictors) is used• Extra source of diversity among trees• Predictors that are “outcompeted” in bagging by stronger
competitors may now enter the group of trees
![Page 64: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/64.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Selecting predictors at each split
• randomForest, ranger parameter mtry: Number ofvariables randomly sampled as candidates at each split.
• ranger default b√pc, where p is number of possiblepredictors
• example: 60 predictors → b√
60c = b7.74c = 7 tried at eachsplit
• randomForest default bp/3c• example: 60 predictors → b60/3c = b20c = 20 tried at each
split
• Can be tuned, see below.
![Page 65: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/65.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Other control parameters
• number of trees in the forest• ranger parameter min.node.size• randomForest parameter ntree• default = 500
• minimal node size• ranger parameter min.node.size• randomForest parameter nodesize• default = 5
• (optional) names of variables to always try at each split;weights for sampling of training observations (tocompensate for unbalanced samples)
![Page 66: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/66.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Fitted by RF vs. observed
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
2.0 2.5 3.0
2.0
2.2
2.4
2.6
2.8
3.0
3.2
log10(Zn), Meuse topsoils, Random Forest
fitted
actu
al
Average prediction of many trees, comes close to actual value
![Page 67: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/67.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Out-of-bag (“OOB”) evaluation
• In a bootstrap sample not all samples are present:sampling is with replacement.
• Sample data not in bootstrap sample: out-of-bag sample:these were not used to build the tree.
• These data can be used for evaluation (“validation”):• Use the tree fitted on the bootstrap sample to predict at
out-of-bag data, i.e., observations not used in thatbootstrap sample.
• Compute squared prediction error for out-of-bag data.
• This gives a very good estimate of the true predictionerror if the sample was representative of the population.
![Page 68: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/68.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Out-of-bag RF predictions vs. observed
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
2.0 2.5 3.0
2.0
2.2
2.4
2.6
2.8
3.0
3.2
log10(Zn), Meuse topsoils, Random Forest
Out−of−bag cross−validation estimates
actu
al
Average prediction of many trees not using an observation.Further from actual value; better estimate of predictivepower
![Page 69: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/69.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
How many trees are needed to make a forest?
• Plot mean squared out-of-bag error against number oftrees
• Check whether this is stable
• If not, increase number of trees
0 100 200 300 400 500
0.02
50.
030
0.03
5
m.lzn.rf
trees
Err
or
![Page 70: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/70.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Variable importance
Importance quantified by permutation accuracy:• randomize (permute) values of a predictor
• so the predictor can not have any relation with the target
• build a random forest with this randomized predictors andthe other (non-randomized) ones
• compute OOB error; compare with OOB error withoutrandomization
• the larger the difference, the more important• Example:
% Increase in MSE under randomizationffreq 9.4dist.m 67.5elev 54.0
![Page 71: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/71.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Variable importance plot
![Page 72: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/72.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Partial dependence plots
The effect of each variable, with the others held constant attheir means/most common class.
![Page 73: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/73.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Two-way partial dependence
![Page 74: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/74.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Examining the forest – at what depth in thetrees are predictors used?
Earlier in tree → most discriminating
![Page 75: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/75.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Uncertainty of RF maps
• Recall: RF is built from many trees, each tree makes aprediction at each location
• These are averaged to get a “best” predictive map
• However, the set of predictions can be considered aprobability distribution of the true value
• From this we can make a map of any quantile, e.g., 5%and 95% confidence limits, or prediction interval width
![Page 76: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/76.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
RF uncertainty vs. RK uncertainty
95% prediction interval for topsoil pHprediction from 2 024 point observations and 18 covariatesLanguedoc-Roussillon region (F)
![Page 77: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/77.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
References for quantile random forests
• Meinshausen, N. (2006). Quantile regression forests. Journal ofMachine Learning Research, 7, 983–999.
• Meinshausen, N., & Schiesser, L., 2015. Quantregforest: QuantileRegression Forests. R package. https://cran.r-project.org
• Vaysse, K., & Lagacherie, P. (2017). Using quantile regression forest toestimate uncertainty of digital soil mapping products. Geoderma, 291,55–64. https://doi.org/10.1016/j.geoderma.2016.12.017
![Page 78: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/78.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Random forests for categorical variables
• Target variable is categorical, i.e., a class• Example: Meuse river flooding frequency classes (every
year, every 2–5 years, rare or none)
• Final prediction is the class predicted by the majority ofthe regression trees in the forest
• Can also see the probabilty for each class, by predictingwith the model with the type=”prob” argument topredict.randomForest.
![Page 79: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/79.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Predicted class probabilty
![Page 80: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/80.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Predicted most probable class
![Page 81: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/81.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Accuracy measures
• naïve agreement: how often a class in the training set iscorrectly predicted – see with a confusion matrix(“cross-classification”)
• Out-of-bag (OOB) estimate of error rate
• Gini impurity: how often a randomly chosen trainingobservation would be incorrectly assigned . . .. . . if it were randomly labeled according to the frequencydistribution of labels in the subset.
![Page 82: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/82.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Cross-classification matrix
A confusion matrix (a.k.a. cross-classification matrix) ofactual (columns) vs. predicted (rows) classes:
Confusion matrix:1 2 3 class.error
1 77 7 0 0.083333332 3 40 5 0.166666673 1 9 13 0.43478261
![Page 83: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/83.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Predictor selection
• Problem: large number of possible predictors, can lead to. . .
• Computational inefficiency• Difficult interpretation of variable importance• Meaningless good fits, even if using cross-validation2
• Solution 1: expert selection from “known” relations• this is then not pure “data mining” for unsuspected
relations
• Solution 2: (semi-)automatic feature selection, see next.
2Wadoux, A. M. J.-C., et al. (2019). A note on knowledge discovery andmachine learning in digital soil mapping. European Journal of SoilScience, 71, 133–136. https://doi.org/10.1111/ejss.12909
![Page 84: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/84.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Feature selection methods
Wrapper methods: “evaluate multiple models usingprocedures that add and/or remove predictors tofind the optimal combination that maximizesmodel performance.”
• risk of over-fitting• high computational load
Filter methods: “evaluate the relevance of thepredictors outside of the predictive models andsubsequently model only the predictors thatpass some criterion”
• does not account for correlation amongpredictors
• does not directly assess model performance
![Page 85: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/85.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Recursive feature elimination
• A “wrapper” method
• Implemented in caret::rfe “Backwards FeatureSelection” function
• Algorithm: “Recursive Feature Elimination (RFE)incorporating resampling”
1 Partition data into training/test sets via resampling2 Start with full model, compute variable importance3 for each proposed subset size
1 Re-compute model with reduced variable sets2 Calculate performance profiles using test samples
4 Determine optimum number of predictors
![Page 86: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/86.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Reference for feature selection
• From the documentation of the caret package (§5).
• Feature selection: https://topepo.github.io/caret/feature-selection-overview.html
• Recursive feature elimination:https://topepo.github.io/caret/recursive-feature-elimination.html
![Page 87: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/87.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
1 Modelling culturesExplanation vs. predictionData-driven (algorithmic) methods
2 Classification & Regression Trees (CART)Regression treesSensitivity of Regression TreesClassification trees
3 Random forestsBagging and bootstrappingBuilding a random forestVariable importanceRandom forests for categorical variablesPredictor selection
4 Cubist
5 Model tuning
6 Spatial random forests
7 Data-driven vs. model-driven methods
![Page 88: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/88.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Cubist
• Similar to CART, but instead of single values at leaves itcreates a multivariate linear regression for the cases inthe leaf
• Advantage vs. CART: predictions are continuous, notdiscrete values equal to the number of leaves in theregression tree.
• Also can be improved with nearest-neighbours, see below
• Advantage vs. RF: the model can be interpreted, to acertain extent.
• Disadvantage: its algorithm is not easy to understand;however its results are generally quite good.
![Page 89: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/89.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Refinements to Cubist
• “Committees” of models: a sequence of models, whereeach corrects the errors in the previous one
• nearest-neighbours adjustment: modify model result ata prediction point from some number of neighbours infeature (predictor) space.
y ′ = 1K
K∑i=1
wi
[ti + (y − ti)
](1)
where ti is the actual value of the neighbour, ti is its valuepredicted by the model tree(s), and wi is the weight givento this neighbour for the adjustment, based on itsdistance Di from the target point. These are computed aswi = 1/(Di + 0.5) and normalized to sum to one.
![Page 90: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/90.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Example cubist model
Rule 1/1: [66 cases, mean 2.288309, range 2.053078 to 2.89098, err 0.103603]if x > 179095, dist > 0.211846then outcome = 2.406759 - 0.32 dist
Rule 1/2: [9 cases, mean 2.596965, range 2.330414 to 2.832509, err 0.116378]if x <= 179095, dist > 0.211846then outcome = -277.415278 + 0.000847 y + 0.56 dist
Rule 1/3: [80 cases, mean 2.772547, range 2.187521 to 3.264582, err 0.157513]if dist <= 0.211846then outcome = 2.632508 - 2.1 dist - 2.4e-05 x + 1.4e-05 y
Rule 2/1: [45 cases, mean 2.418724, range 2.10721 to 2.893762, err 0.182228]if x <= 179826, ffreq in {2, 3}then outcome = 128.701732 - 0.000705 x
Rule 2/2: [121 cases, mean 2.443053, range 2.053078 to 3.055378, err 0.181513]if dist > 0.0703468then outcome = 30.512065 - 0.87 dist - 0.000154 x
Rule 2/3: [55 cases, mean 2.543648, range 2.075547 to 3.055378, err 0.125950]if dist > 0.0703468, ffreq = 1then outcome = 37.730889 - 0.000314 x - 0.35 dist + 6.5e-05 y
Rule 2/4: [34 cases, mean 2.958686, range 2.574031 to 3.264582, err 0.139639]if dist <= 0.0703468then outcome = 2.982852 - 0.36 dist
![Page 91: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/91.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Map predicted by Cubist
Optimized Cubist prediction
2.0
2.2
2.4
2.6
2.8
3.0
3.2
![Page 92: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/92.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
1 Modelling culturesExplanation vs. predictionData-driven (algorithmic) methods
2 Classification & Regression Trees (CART)Regression treesSensitivity of Regression TreesClassification trees
3 Random forestsBagging and bootstrappingBuilding a random forestVariable importanceRandom forests for categorical variablesPredictor selection
4 Cubist
5 Model tuning
6 Spatial random forests
7 Data-driven vs. model-driven methods
![Page 93: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/93.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Model tuning
• Data-driven models have parameters that control theirbehaviour and can significantly affect their predictivepower.
• CART: complexity parameter• randomForest: number of predictors to try at each split;
minimum number of observations in a leaf; number of treesin the forest
• too many predictors → trees too uniform, loss of diversity;too few → highly-variable trees, poor predictions
• too few observations per leaf to imprecise prediction; toomany → over-fitting
• too few trees → sub-optimal model; too many trees →wasted computation
• Cubist: number of committees; number of nearestneighbours
• The model can be tuned to optimize the selection ofthese.
![Page 94: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/94.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Model tuning – flow chart
source: Kuhn, M., & Johnson, K. (2013). Applied PredictiveModeling (2013 edition). New York: Springer; figure 4.4
![Page 95: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/95.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Model tuning – algorithm
1 For each combination of parameters to be optimized:1 Split the dataset into some disjunct subsets, for example
10, by random sampling.2 For each subset:
1 Fit the model with the selected parameters on all but one ofthe subsets (train subset).
2 Predict at the remaining subset, i.e., the one not used formodel building, with the fitted model.
3 Compute the goodness-of-fit statistics of fitting to the testsubsete.g., root mean square error (RMSE) of prediction; squaredcorrelation coefficient between the actual and fitted values,i.e., R2 against a 1:1 line.
3 Average the statistics for the disjunct test subsets.
2 Search the table of results for the best resultse.g., lowest RMSE, highest R2.
![Page 96: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/96.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Model tuning – R implementation
• caret “Classification And REgression Training” package• Kuhn, M. (2008). Building predictive models in R using the
caret package. Journal of Statistical Software, 28(5), 1–26.• https://topepo.github.io/caret/index.html• can tune 200+ models; some built-in, some by calling the
appropriate package
• method:1 set up a vector or matrix with the parameter values to test,
e.g, all combinations of 1 . . . 3 splitting variables to try, and1 . . . 10 observations per leaf
2 run the model for all of these and collect thecross-validation statistics
3 select the best one and build a final model
![Page 97: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/97.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Model tuning example – random forest (1)
> ranger.tune <- train(x = preds, y = response, method="ranger",tuneGrid = expand.grid(.mtry = 1:3,
.splitrule = "variance",
.min.node.size = 1:10),trControl = trainControl(method = ’cv’))
> print(ranger.tune)
## Resampling: Cross-Validated (10 fold)## Resampling results across tuning parameters:#### mtry min.node.size RMSE Rsquared MAE## 1 1 199.7651 0.8862826 156.1662## 1 2 200.5215 0.8851154 156.3225## 1 3 200.6421 0.8854146 156.2801...## 3 8 201.9809 0.8793349 158.7097## 3 9 202.9065 0.8781754 159.7739## 3 10 202.5687 0.8788200 159.5980#### RMSE was used to select the optimal model## Final values: mtry = 2, min.node.size = 6.
![Page 98: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/98.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Model tuning example – random forest (2)
Minimal Node Size
RM
SE
(C
ross
−V
alid
atio
n)
198
200
202
204
2 4 6 8 10
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
● ●
●
●
●
●
●
●
●
●
#Randomly Selected Predictors1 2 3● ● ●
Find the minimum RMSE; but favour simpler models (fewerpredictors, larger nodes) if not too much difference
![Page 99: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/99.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Model tuning example – Cubist (1)
> cubist.tune <- train(x = all.preds, y = all.resp, method="cubist",tuneGrid = expand.grid(.committees = 1:12,
.neighbors = 0:5),trControl = trainControl(method = ’cv’))
## Resampling: Cross-Validated (10 fold)## Summary of sample sizes: 139, 139, 140, 139, 139, 139, ...## Resampling results across tuning parameters:#### committees neighbors RMSE Rsquared MAE## 1 0 0.1898596 0.6678588 0.1405553## 1 1 0.1764705 0.6953460 0.1189364## 1 2 0.1654910 0.7296723 0.1163660## 1 3 0.1623381 0.7425831 0.1163285## 1 4 0.1631900 0.7453506 0.1192963...## 12 3 0.1599994 0.7533962 0.1139932## 12 4 0.1584434 0.7617762 0.1153331## 12 5 0.1589143 0.7622337 0.1165942\#### RMSE was used to select the optimal model using the smallest value.## The final values: committees = 10, neighbors = 4.
![Page 100: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/100.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Model tuning example – Cubist (2)
Criterion: RMSE Criterion: R2
#Committees
RM
SE
(C
ross
−V
alid
atio
n)
0.15
0.16
0.17
0.18
2 4 6 8 10 12
#Instances01
23
45
#Committees
Rsq
uare
d (C
ross
−V
alid
atio
n)
0.68
0.70
0.72
0.74
0.76
2 4 6 8 10 12
#Instances01
23
45
Adding one neighbour reduces predictive power; adding 2. . . increases it; 3 is close to optimum
Committees improve predictive power; 3 is optimum
![Page 101: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/101.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
1 Modelling culturesExplanation vs. predictionData-driven (algorithmic) methods
2 Classification & Regression Trees (CART)Regression treesSensitivity of Regression TreesClassification trees
3 Random forestsBagging and bootstrappingBuilding a random forestVariable importanceRandom forests for categorical variablesPredictor selection
4 Cubist
5 Model tuning
6 Spatial random forests
7 Data-driven vs. model-driven methods
![Page 102: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/102.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Spatial random forests
• Random forests can use coördinates and distances togeographic features as predictors
• e.g., E, N, distance to river, distance to a single point . . .
• Can also use distances to multiple points as predictors• Distance buffers: distance to closest point with some range
of values• Common approach: compute quantiles of the response
variable and one buffer for each• Each sample point has a distance to the closest point in
each quantile
• This uses separation between point-pairs of differentvalues, but with no model.
![Page 103: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/103.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
log(Zn) distribution – 16 quantiles
0
5
10
15
20
2.0 2.5 3.0
logZn
coun
t
![Page 104: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/104.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Distance to closest point in each quantile
3300
0033
1000
3320
0033
3000
layer.1
0
500
1000
1500
2000
layer.2
0
500
1000
1500
layer.3
0
200
400
600
800
1000
layer.4
0
200
400
600
800
3300
0033
1000
3320
0033
3000
layer.5
0
200
400
600
800
layer.6
0
200
400
600
800
1000
1200
layer.7
0
200
400
600
800
1000
layer.8
0
500
1000
1500
3300
0033
1000
3320
0033
3000
layer.9
0
200
400
600
800
layer.10
0
200
400
600
800
1000
1200
layer.11
0
500
1000
1500
layer.12
0
200
400
600
800
1000
1200
1400
178500 179500 180500 181500
3300
0033
1000
3320
0033
3000
layer.13
0
200
400
600
800
1000
178500 179500 180500 181500
layer.14
0
500
1000
1500
178500 179500 180500 181500
layer.15
0
500
1000
1500
178500 179500 180500 181500
layer.16
0
200
400
600
800
1000
1200
![Page 105: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/105.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Regression tree on 16 distance buffers
layer.16 >= 290
layer.12 >= 189
layer.5 < 584
layer.9 >= 135
layer.1 < 162
layer.14 >= 45
layer.14 >= 301
layer.3 < 321
layer.12 >= 20
layer.16 >= 45
layer.15 >= 28
layer.13 >= 135
layer.5 < 252
2.6100%
2.470%
2.353%
2.349%
2.340%
2.110%
2.330%
2.329%
2.91%
2.59%
2.74%
2.52%
2.92%
2.717%
2.58%
2.56%
2.82%
2.810%
2.930%
2.823%
2.718%
2.68%
2.32%
2.66%
2.810%
35%
3.26%
yes no
![Page 106: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/106.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Random forest prediction on 16 distancebuffers
2.0 2.5 3.0
2.2
2.6
3.0
Zn, log(mg kg−1)
Random forest fit
Act
ual v
alue
2.0 2.5 3.0
2.2
2.6
3.0
Zn, log(mg kg−1)
Out−of−bag predictionA
ctua
l val
ue
![Page 107: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/107.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
OOB error vs. OK cross-validation error
2.0 2.5 3.0
2.2
2.6
3.0
Zn, log(mg kg−1)
Cross−validation fit
Act
ual v
alue
2.0 2.5 3.0
2.2
2.6
3.0
Zn, log(mg kg−1)
Out−of−bag prediction
Act
ual v
alue
OK RF
Note that RF does not use any model of spatial autocorrelation!
![Page 108: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/108.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Random forest map on 16 distance buffers
2.2
2.4
2.6
2.8
3.0
3.2
Resembles OK map, but no model was used.
![Page 109: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/109.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Compare with Ordinary Kriging
![Page 110: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/110.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Difference spatial RF - OK
![Page 111: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/111.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Reference for spatial random forests
• Hengl, T., Nussbaum, M., Wright, M. N., Heuvelink, G. B. M., & Gräler, B.(2018). Random forest as a generic framework for predictive modelingof spatial and spatio-temporal variables. PeerJ, 6, e5518.https://doi.org/10.7717/peerj.5518
![Page 112: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/112.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
1 Modelling culturesExplanation vs. predictionData-driven (algorithmic) methods
2 Classification & Regression Trees (CART)Regression treesSensitivity of Regression TreesClassification trees
3 Random forestsBagging and bootstrappingBuilding a random forestVariable importanceRandom forests for categorical variablesPredictor selection
4 Cubist
5 Model tuning
6 Spatial random forests
7 Data-driven vs. model-driven methods
![Page 113: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/113.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Conclusion: Data-driven vs. model-basedmethods
• Data-driven: main aim is predictive power• Individual trees can be interpreted, but forests can not
(only can see variable importance, not choice or cutpoints)
• Model-based: main aim is understanding processes• We hope the model is a simplified representation of the
process that produced the observations• If the model is correct, predictions will be accurate
![Page 114: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/114.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
Conclusion: limitations
• Data-driven methods depend on their trainingobservations
• They have no way to extrapolate or even interpolate tounobserved areas in feature space
• So the observations should cover the entire range of thepopulation
• Model-based methods depend on a correctempirical-statistical model
• Model is derived from training observations, but manymodels are possible
• Various model-selection techniques• Wrong model → poor predictions, incorrect understanding
of processes
![Page 115: Data-driven methods for predictive modelling€¦ · Data-driven methods for predictive modelling ... statistical learning data mining, inference, and prediction (2nd ed). New ...](https://reader030.fdocuments.in/reader030/viewer/2022040609/5eca0ccb81df7353c14a0adb/html5/thumbnails/115.jpg)
Data-drivenmethods
for predictivemodelling
DGR/W'ô
ModellingculturesExplanation vs.prediction
Data-driven(algorithmic)methods
Classification &RegressionTrees (CART)Regression trees
Sensitivity ofRegression Trees
Classificationtrees
RandomforestsBagging andbootstrapping
Building a randomforest
Variableimportance
Random forestsfor categoricalvariables
Predictor selection
Cubist
Model tuning
Spatial randomforests
Data-driven vs.model-drivenmethods
End