xgboost - Universitetet i oslofolk.uio.no/geirs/STK9200/John_Xgboost.pdf · Xgboost: sparse data...

xgboostJohn m. aiken

1Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016.

introduction

2http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/

We want to predict a person’s age based on whether they play video games, enjoy gardening, and their preference on wearing hats. Our objective is to minimize squared error.


Easy example


5

PersonID Age LikesGardening PlaysVideoGames LikesHats

1 13 FALSE TRUE TRUE

2 14 FALSE TRUE FALSE

3 15 FALSE TRUE FALSE

4 25 TRUE TRUE TRUE

5 35 FALSE TRUE TRUE

6 49 TRUE FALSE FALSE

7 68 TRUE TRUE TRUE

8 71 TRUE FALSE FALSE

9 73 TRUE FALSE TRUEhttp://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/


Feature FALSE TRUE

LikesGardening {13, 14, 15, 35} {25, 49, 68, 71, 73}

PlaysVideoGames {49, 71, 73} {13, 14, 15, 25, 35, 68}

LikesHats {14, 15, 49, 71} {13, 25, 35, 68, 73}



Decision tree analysis

7

This tree ignores video games and hats



Predictions from first tree

9

{19.25, 19.25, 19.25, 19.25} {57.2, 57.2, 57.2,

57.2, 57.2}

predictions

PersonID Age Tree1 Prediction Tree1 Residual

1 13 19.25 -6.25

2 14 19.25 -5.25

3 15 19.25 -4.25

4 25 57.2 -32.2

5 35 19.25 15.75

6 49 57.2 -8.2

7 68 57.2 10.8

8 71 57.2 13.8

9 73 57.2 15.8

But now we can fit a new model on the residuals of the first model!

Predictions from second tree

10

PersonID Age Tree2 Prediction

1 13 -3.567

2 14 -3.567

3 15 -3.567

4 25 -3.567

5 35 -3.567

6 49 7.133

7 68 -3.567

8 71 7.133

9 73 7.133 {7.13, 7.13, 7.13} {-3.56, -3.56, -3.56, -3.56, -3.56, -3.56}

predictions



F(x) =

Predicting age Predicting residuals

A first step in gradient boosting, but not there yet

12

PersonID Age Tree1 Prediction Tree1 Residual Tree2 Prediction Combined Prediction Final Residual

1 13 19.25 -6.25 -3.567 15.68 2.683

2 14 19.25 -5.25 -3.567 15.68 1.683

3 15 19.25 -4.25 -3.567 15.68 0.6833

4 25 57.2 -32.2 -3.567 53.63 28.63

5 35 19.25 15.75 -3.567 15.68 -19.32

6 49 57.2 -8.2 7.133 64.33 15.33

7 68 57.2 10.8 -3.567 53.63 -14.37

8 71 57.2 13.8 7.133 64.33 -6.667

9 73 57.2 15.8 7.133 64.33 -8.667

A more formalized algorithm

1. Fit a model to the data: F1(x) = y2. Compute the residuals: r1 = y - F1(x)3. Fit a model to the residuals: h1(x) 4. Create a new model: F2(x) = F1(x) + h1(x)

13

This process can be generalized as:

F(x) = F1(x) + h1(x) + … → FM(x) = FM-1(x) + hM-1(x)

And the model can be initialized by the average (minimum of the MSE)



But wait their’s more

14

1. Fit a model to the data: F1(x) = y2. Compute the residuals: r1 = y - F1(x)3. Fit a model to the residuals: h1(x) 4. Create a new model: F2(x) = F1(x) + h1(x)

This part is much more complicated than just “fit g(x) to r”

Actually, in gradient boosting we replace this step by instead fitting to the gradient of any loss function instead of fitting the residuals (in this case we would specifically be a regression)

15Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of statistics (2001): 1189-1232.

16

At this point i found it helpful to actually write out how you get FM(x) for a small M but it's just

Here the gradient of the loss function is the same as before, but it's more general

● F0 is the initialization● m to M are the collection of trees, there a

single m tree of all M trees● Ytilde is the finite difference

approximation to the derivative of the loss function

● am is the model parameters. For a tree based model this is how each decision node should make a split

● Beta is a weighting term for each node● Rhom is the weighting term for each newly

fit h● Fm is the final combination of this

iterations of models● Fm-1 is the model from the previous

iteration

note● We did not specify what kind of model is fitting the data● We did not specify what kind of loss function is fitting the data

Thus gradient boosting basically works with anything you want. However, for xgboost, and more generally, it uses tree based models.

Xgboost adds a bunch of stuff

17

Xgboost objective functions● reg:squarederror: regression with squared loss.● reg:squaredlogerror: regression with squared log

loss ● reg:logistic: logistic regression● binary:logistic: logistic regression for binary

classification, output probability● binary:logitraw: logistic regression for binary

classification, output score before logistic transformation

● binary:hinge: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.

● count:poisson –poisson regression for count data, output mean of poisson distribution

● survival:cox: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale.

18

● multi:softmax: set XGBoost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes)

● multi:softprob: same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata * nclass matrix. The result contains predicted probability of each data point belonging to each class.

● rank:pairwise: Use LambdaMART to perform pairwise ranking where the pairwise loss is minimized

● rank:ndcg: Use LambdaMART to perform list-wise ranking where Normalized Discounted Cumulative Gain (NDCG) is maximized

● rank:map: Use LambdaMART to perform list-wise ranking where Mean Average Precision (MAP) is maximized

● reg:gamma: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be gamma-distributed.

● reg:tweedie: Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be Tweedie-distributed.

http://en.wikipedia.org/wiki/NDCG

http://en.wikipedia.org/wiki/NDCG

http://en.wikipedia.org/wiki/Mean_average_precision#Mean_average_precision

https://en.wikipedia.org/wiki/Gamma_distribution#Applications

https://en.wikipedia.org/wiki/Tweedie_distribution#Applications

Xgboost: Regularized lossInstead of only using the MSE, we can add a ridge regression regularization

19Chen, Tianqi, and Carlos Guestrin. "Xgboost: A scalable tree boosting system." Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016.

Xgboost shrinkageShrinkage scales newly added weights by a factor η after each step of tree boosting. Similar to a learning rate in stochastic optimization, shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model

20

Xgboost column subsamplingFeatures are randomly subsampled and provided to each tree for fitting.

This is common in random forest.

Additionally, xgboost allows for row subsampling, so that it fits on subsamples of training data per tree instead of using all the data.

21

Xgboost: finding the split● The best way to find a split is to try every split and use the best one this is

called the “exact greedy algorithm”.● However, computers and size of data do not always support this procedure● This there is approximate algorithm for finding splits, im not going to talk

about this...

22

Finding the optimum structure for a single tree

If you assume the structure of a tree is fixed then the optimum weight in each leaf and the resulting objective value for the tree is:

24

So the weight per leaf j in each tree is the ratio of the derivative of the loss function from the previous iteration model to the second derivative of the loss function from the previous iteration

25

Calculating leaf weights

That is, if this leafs prediction is much better than before, it should be weighted much more.

26I guess the G is squared because of differentiation...

This should be as small as possible

The bigger this is the more

value the leaf has

The bigger this is the more leaves there

are

Thus you want the smallest number of leaves with the largest ratio of residuals.

But this is only one leaf, and there are many leaf combinations

27

Xgboost: sparse dataMissing data, frequent zeroes, and artifacts of feature engineering can all cause data to be sparse.

Xgboost creates a default direction based on the model using non-missing entries, it then calculates the loss based on which of two default directions are choosen (a tree can only have two default directions per node)

30

Xgboost implementations● Available on R, python, JVM, ruby, julia● Python version meets sci-kit learn api so you can drop it in to replace a

sklearn model● Can be deployed on distributed sysems via AWS, Kubernetes, Spark, etc.

In practice...

31

Predicting round trip times for packets in wifi networks

32

All sorts of things can impact networks, we want to predict from passive network statistics like

● Overlap from other networks● Network usage● Noise per antenna● Distance device is from router● etc

Xgboost completely outperforms other models

Linear Regression - typical OLS model no special effects

Random Forest - uses 1000 estimators, max depth of 3

xgboost - uses 1000 estimators, max depth of 3

33

German socio-economy panel

● Annual longitudinal survey since 1984 on topics like economics, sociology, psychology, political science

● Survey drop out has increased over time (25% in 1984, 45% in 2015) and understanding why is important to characterizing the quality of survey response

● In comparison to logistic regression, etc. xgboost performs similar to RF

34

Figure 3. Performance Curves in Test Set (y = Refusal in GSOEP Wave 2014)

Kern, Christoph, Thomas Klausch, and Frauke Kreuter. "Tree-based Machine Learning Methods for Survey Research." Survey Research Methods. Vol. 13. No. 1. 2019.

Predicting drop out in survey responses

conclusions● Xgboost provides a good top line model for prediction in many, many

situations where you have table data● Xgboost can handle missing data, data larger than memory, ● Xgboost is perfect and can do anything

○ Really this is false but it's pretty impressive what it can do

35

To put it in “plainer” terms1. Initialize the model by minimizing the loss function (the average if it's MSE).

this gives you F0(x)2. For each m model out of M models (defined as Fm(x) for each model)

a. Fit a function to the derivative of the loss function of the previous model, this is your yb. Calculate the parameters for the model, this is your a (if it's a tree it's the split

parameters)c. Calculate a controlling weight term rhod. Finish with the final model Fm(x) = Fm-1(x) + rho h(x,a)

The ultimate model ensemble is the linear combination of Fm(x) that is FM(x)

36

xgboost - Universitetet i oslofolk.uio.no/geirs/STK9200/John_Xgboost.pdf · Xgboost: sparse data...

Documents

Transcript of xgboost - Universitetet i oslofolk.uio.no/geirs/STK9200/John_Xgboost.pdf · Xgboost: sparse data...