xgboost - Universitetet i oslofolk.uio.no/geirs/STK9200/John_Xgboost.pdf · Xgboost: sparse data...
Transcript of xgboost - Universitetet i oslofolk.uio.no/geirs/STK9200/John_Xgboost.pdf · Xgboost: sparse data...
xgboostJohn m. aiken
1Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016.
introduction
2http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
We want to predict a person’s age based on whether they play video games, enjoy gardening, and their preference on wearing hats. Our objective is to minimize squared error.
4http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
Easy example
5
PersonID Age LikesGardening PlaysVideoGames LikesHats
1 13 FALSE TRUE TRUE
2 14 FALSE TRUE FALSE
3 15 FALSE TRUE FALSE
4 25 TRUE TRUE TRUE
5 35 FALSE TRUE TRUE
6 49 TRUE FALSE FALSE
7 68 TRUE TRUE TRUE
8 71 TRUE FALSE FALSE
9 73 TRUE FALSE TRUEhttp://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
Feature FALSE TRUE
LikesGardening {13, 14, 15, 35} {25, 49, 68, 71, 73}
PlaysVideoGames {49, 71, 73} {13, 14, 15, 25, 35, 68}
LikesHats {14, 15, 49, 71} {13, 25, 35, 68, 73}
http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
Decision tree analysis
7
This tree ignores video games and hats
http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
8http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
Predictions from first tree
9
{19.25, 19.25, 19.25, 19.25} {57.2, 57.2, 57.2,
57.2, 57.2}
predictions
PersonID Age Tree1 Prediction Tree1 Residual
1 13 19.25 -6.25
2 14 19.25 -5.25
3 15 19.25 -4.25
4 25 57.2 -32.2
5 35 19.25 15.75
6 49 57.2 -8.2
7 68 57.2 10.8
8 71 57.2 13.8
9 73 57.2 15.8
But now we can fit a new model on the residuals of the first model!
Predictions from second tree
10
PersonID Age Tree2 Prediction
1 13 -3.567
2 14 -3.567
3 15 -3.567
4 25 -3.567
5 35 -3.567
6 49 7.133
7 68 -3.567
8 71 7.133
9 73 7.133 {7.13, 7.13, 7.13} {-3.56, -3.56, -3.56, -3.56, -3.56, -3.56}
predictions
http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
F(x) =
Predicting age Predicting residuals
A first step in gradient boosting, but not there yet
12
PersonID Age Tree1 Prediction Tree1 Residual Tree2 Prediction Combined Prediction Final Residual
1 13 19.25 -6.25 -3.567 15.68 2.683
2 14 19.25 -5.25 -3.567 15.68 1.683
3 15 19.25 -4.25 -3.567 15.68 0.6833
4 25 57.2 -32.2 -3.567 53.63 28.63
5 35 19.25 15.75 -3.567 15.68 -19.32
6 49 57.2 -8.2 7.133 64.33 15.33
7 68 57.2 10.8 -3.567 53.63 -14.37
8 71 57.2 13.8 7.133 64.33 -6.667
9 73 57.2 15.8 7.133 64.33 -8.667
A more formalized algorithm
1. Fit a model to the data: F1(x) = y2. Compute the residuals: r1 = y - F1(x)3. Fit a model to the residuals: h1(x) 4. Create a new model: F2(x) = F1(x) + h1(x)
13
This process can be generalized as:
F(x) = F1(x) + h1(x) + … → FM(x) = FM-1(x) + hM-1(x)
And the model can be initialized by the average (minimum of the MSE)
http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
But wait their’s more
14
1. Fit a model to the data: F1(x) = y2. Compute the residuals: r1 = y - F1(x)3. Fit a model to the residuals: h1(x) 4. Create a new model: F2(x) = F1(x) + h1(x)
This part is much more complicated than just “fit g(x) to r”
Actually, in gradient boosting we replace this step by instead fitting to the gradient of any loss function instead of fitting the residuals (in this case we would specifically be a regression)
15Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of statistics (2001): 1189-1232.
16
At this point i found it helpful to actually write out how you get FM(x) for a small M but it's just
Here the gradient of the loss function is the same as before, but it's more general
● F0 is the initialization● m to M are the collection of trees, there a
single m tree of all M trees● Ytilde is the finite difference
approximation to the derivative of the loss function
● am is the model parameters. For a tree based model this is how each decision node should make a split
● Beta is a weighting term for each node● Rhom is the weighting term for each newly
fit h● Fm is the final combination of this
iterations of models● Fm-1 is the model from the previous
iteration
note● We did not specify what kind of model is fitting the data● We did not specify what kind of loss function is fitting the data
Thus gradient boosting basically works with anything you want. However, for xgboost, and more generally, it uses tree based models.
Xgboost adds a bunch of stuff
17
Xgboost objective functions● reg:squarederror: regression with squared loss.● reg:squaredlogerror: regression with squared log
loss ● reg:logistic: logistic regression● binary:logistic: logistic regression for binary
classification, output probability● binary:logitraw: logistic regression for binary
classification, output score before logistic transformation
● binary:hinge: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
● count:poisson –poisson regression for count data, output mean of poisson distribution
● survival:cox: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale.
18
● multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)
● multi:softprob: same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.
● rank:pairwise: Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized
● rank:ndcg: Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized
● rank:map: Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized
● reg:gamma: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.
● reg:tweedie: Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.
Xgboost: Regularized lossInstead of only using the MSE, we can add a ridge regression regularization
19Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016.
Xgboost shrinkageShrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in stochastic optimization, shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model
20
Xgboost column subsamplingFeatures are randomly subsampled and provided to each tree for fitting.
This is common in random forest.
Additionally, xgboost allows for row subsampling, so that it fits on subsamples of training data per tree instead of using all the data.
21
Xgboost: finding the split● The best way to find a split is to try every split and use the best one this is
called the “exact greedy algorithm”.● However, computers and size of data do not always support this procedure● This there is approximate algorithm for finding splits, im not going to talk
about this...
22
Finding the optimum structure for a single tree
If you assume the structure of a tree is fixed then the optimum weight in each leaf and the resulting objective value for the tree is:
24
So the weight per leaf j in each tree is the ratio of the derivative of the loss function from the previous iteration model to the second derivative of the loss function from the previous iteration
25
Calculating leaf weights
That is, if this leafs prediction is much better than before, it should be weighted much more.
26I guess the G is squared because of differentiation...
This should be as small as possible
The bigger this is the more
value the leaf has
The bigger this is the more leaves there
are
Thus you want the smallest number of leaves with the largest ratio of residuals.
But this is only one leaf, and there are many leaf combinations
27
28
29
Xgboost: sparse dataMissing data, frequent zeroes, and artifacts of feature engineering can all cause data to be sparse.
Xgboost creates a default direction based on the model using non-missing entries, it then calculates the loss based on which of two default directions are choosen (a tree can only have two default directions per node)
30
Xgboost implementations● Available on R, python, JVM, ruby, julia● Python version meets sci-kit learn api so you can drop it in to replace a
sklearn model● Can be deployed on distributed sysems via AWS, Kubernetes, Spark, etc.
In practice...
31
Predicting round trip times for packets in wifi networks
32
All sorts of things can impact networks, we want to predict from passive network statistics like
● Overlap from other networks● Network usage● Noise per antenna● Distance device is from router● etc
Xgboost completely outperforms other models
Linear Regression - typical OLS model no special effects
Random Forest - uses 1000 estimators, max depth of 3
xgboost - uses 1000 estimators, max depth of 3
33
German socio-economy panel
● Annual longitudinal survey since 1984 on topics like economics, sociology, psychology, political science
● Survey drop out has increased over time (25% in 1984, 45% in 2015) and understanding why is important to characterizing the quality of survey response
● In comparison to logistic regression, etc. xgboost performs similar to RF
34
Figure 3. Performance Curves in Test Set (y = Refusal in GSOEP Wave 2014)
Kern, Christoph, Thomas Klausch, and Frauke Kreuter. "Tree-based Machine Learning Methods for Survey Research." Survey Research Methods. Vol. 13. No. 1. 2019.
Predicting drop out in survey responses
conclusions● Xgboost provides a good top line model for prediction in many, many
situations where you have table data● Xgboost can handle missing data, data larger than memory, ● Xgboost is perfect and can do anything
○ Really this is false but it's pretty impressive what it can do
35
To put it in “plainer” terms1. Initialize the model by minimizing the loss function (the average if it's MSE).
this gives you F0(x)2. For each m model out of M models (defined as Fm(x) for each model)
a. Fit a function to the derivative of the loss function of the previous model, this is your yb. Calculate the parameters for the model, this is your a (if it's a tree it's the split
parameters)c. Calculate a controlling weight term rhod. Finish with the final model Fm(x) = Fm-1(x) + rho h(x,a)
The ultimate model ensemble is the linear combination of Fm(x) that is FM(x)
36