Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches...

35
Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften [email protected] Winterthur, 6 Dezember 2016 1

Transcript of Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches...

Page 1: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Statistisches Data Mining (StDM) Woche 12

Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften [email protected] Winterthur, 6 Dezember 2016

1

Page 2: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Multitasking senkt Lerneffizienz: •  Keine Laptops im Theorie-Unterricht Deckel zu

oder fast zu (Sleep modus)

Page 3: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Overview of classification (until the end to the semester) Classifiers

K-Nearest-Neighbors (KNN) Logistic Regression Linear discriminant analysis Support Vector Machine (SVM) Classification Trees Neural networks NN Deep Neural Networks (e.g. CNN, RNN) …

Evaluation

Cross validation Performance measures ROC Analysis / Lift Charts

Combining classifiers

Bagging Random Forest Boosting

Theoretical Guidance / General Ideas

Bayes Classifier Bias Variance Trade off (Overfitting)

Feature Engineering

Feature Extraction Feature Selection

Page 4: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Overview of Ensemble Methods

•  Many instances of the same classifier –  Bagging (bootstrapping & aggregating)

•  Create “new” data using bootstrap •  Train classifiers on new data •  Average over all predictions (e.g. take majority vote)

–  Bagged Trees •  Use bagging with decision trees

–  Random Forest •  Bagged trees with special trick

–  Boosting •  An iterative procedure to adaptively change distribution of training data by focusing more on previously

misclassified records.

•  Combining classifiers –  Weighted averaging over predictions –  Stacking classifiers

•  Use output of classifiers as input for a new classifier

Page 5: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Bagging / Random Forest Chapter 8.2 in ILSR

Page 6: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Bagging: Bootstrap Aggregating

OriginalTraining data

....D1 D2 Dt-1 Dt

D

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers

Source: Tan, Steinbach, Kumar

Bootstrap

Aggregating: mean, majority vote

Page 7: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Why does it work?

•  Suppose there are 25 base classifiers •  Each classifier has error rate, ε = 0.35 •  Assume classifiers are independent (that’s the hardest assumption) •  Take majority vote •  Majority voter is wrong if 13 or more are wrong •  Number of wrong predictors X ~ Bin(size=25, p0=0.35)

• >1-pbinom(12,size=25,prob=0.35)• [1]0.06044491

∑=

− =−⎟⎟⎠

⎞⎜⎜⎝

⎛25

13

25 06.0)1(25

i

ii

iεε

Page 8: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Why does it work?

•  25 Base Classifiers

Ensembles are only better than one classifier, if each classifier is better than random guessing!

Page 9: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Reminder Trees

library(rpart) fit <- rpart(Kyphosis ~ ., data = kyphosis) plot(fit) text(fit, use.n = TRUE) pred = predict(fit, data=kyphosis) > head(pred,4) #Probabilities absent present 1 0.4210526 0.5789474 2 0.8571429 0.1428571 3 0.4210526 0.5789474 4 0.4210526 0.5789474

Page 10: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Bagging trees

•  Decision trees suffer from high variance! –  If we randomly split the training data into 2 parts, and fit decision trees on both

parts, the results could be quite different –  Averaging reduces Variance

•  Independent è var ~ σ2 / n (not true strictly) •  Bagging for trees

–  Take bootstrap B samples from training set –  For each bootstrap sample train a decision tree –  For the test set take the majority vote of all trained classifiers (or average)

OriginalTraining data

....D1 D2 Dt-1 Dt

D

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers

Page 11: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Reminder: Bootstrapping

•  Resampling of the observed dataset (and of equal size to the observed dataset), each of which is obtained by random sampling with replacement from the original dataset 192 5. Resampling Methods

2.8 5.3 3 1.1 2.1 2 2.4 4.3 1

Y X Obs

2.8 5.3 3 2.4 4.3 1 2.8 5.3 3

Y X Obs

2.4 4.3 1 2.8 5.3 3 1.1 2.1 2

Y X Obs

2.4 4.3 1 1.1 2.1 2 1.1 2.1 2

Y X Obs Original Data (Z)

1*Z

2*Z

Z *B

1*α̂

2*α̂

α̂*B

FIGURE 5.11. A graphical illustration of the bootstrap approach on a smallsample containing n = 3 observations. Each bootstrap data set contains n obser-vations, sampled with replacement from the original data set. Each bootstrap dataset is used to obtain an estimate of α.

bootstrap estimates using the formula

SEB(α̂) =

!""# 1

B − 1

B$

r=1

%α̂∗r − 1

B

B$

r′=1

α̂∗r′

&2

. (5.8)

This serves as an estimate of the standard error of α̂ estimated from theoriginal data set.The bootstrap approach is illustrated in the center panel of Figure 5.10,

which displays a histogram of 1,000 bootstrap estimates of α, each com-puted using a distinct bootstrap data set. This panel was constructed onthe basis of a single data set, and hence could be created using real data.Note that the histogram looks very similar to the left-hand panel which dis-plays the idealized histogram of the estimates of α obtained by generating1,000 simulated data sets from the true population. In particular the boot-strap estimate SE(α̂) from (5.8) is 0.087, very close to the estimate of 0.083obtained using 1,000 simulated data sets. The right-hand panel displays theinformation in the center and left panels in a different way, via boxplots ofthe estimates for α obtained by generating 1,000 simulated data sets fromthe true population and using the bootstrap approach. Again, the boxplots

Slide taken from Al Sharif: http://www.alsharif.info/#!iom530/c21o7

Page 12: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Lösung der Aufgabe: Bagging von Hand

library(rpart) library(MASS) # For the data set Boston$crim = Boston$crim > median(Boston$crim) # Split in Training and Testset idt = sample(nrow(Boston), size = floor(nrow(Boston)*3/4)) data.train = Boston[idt, ] data.test = Boston[- idt, ] # Single Tree d = rpart( as.factor(crim) ~ ., data = data.train) sum(predict(fit, data.test, type='class') == Boston$crim[-idt]) / nrow(data.test) #0.88 n.trees = 100 preds = rep(0, nrow(data.test)) for (j in 1:n.trees ){ # Doing the bootstrap index.train = sample( nrow(data.train), size = nrow(data.train), replace = TRUE ) # Fitting to the training data fit = rpart( as.factor(crim) ~ ., data = data.train[index.train,]) # Predict the test set tree.fit.test = predict(fit, data.test, type='class') preds = preds + ifelse(tree.fit.test == TRUE, 1, -1) #Trick with +-1 } sum(Boston$crim[-idt] == (preds > 0)) / nrow(data.test) #0.96

Page 13: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Treet=1 t=2 t=3

Twowaystoderivetheensembleresult:

1)Eachtreehasa“winner-class”TaketheclasswhichwasmostoMenthewinner

2)averageprobabiliOes:

How to classify a new observation?

Page 14: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

A Comparison of Error Rates

• Herethegreenlinerepresentsasimplemajorityvoteapproach

•  ThepurplelinecorrespondstoaveragingtheprobabilityesOmates.

•  BothdofarbeTerthanasingletree(dashedred)andgetclosetotheBayeserrorrate(dashedgrey).

Slide taken from Al Sharif: http://www.alsharif.info/#!iom530/c21o7

Page 15: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Example 2: Car Seat Data

0 20 40 60 80 100

0.15

0.20

0.25

0.30

0.35

0.40

Number of Bootstrap Data Sets

Test

Erro

r rat

e

•  Theredlinerepresentsthetesterrorrateusingasingletree.

•  TheblacklinecorrespondstothebaggingerrorrateusingmajorityvotewhilethebluelineaveragestheprobabiliOes.

Slide taken from Al Sharif: http://www.alsharif.info/#!iom530/c21o7

Page 16: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Out-of-Bag Estimation: “Cross-validation on the fly”

•  Since bootstrapping involves random selection of subsets of observations to build a training data set, then the remaining non-selected part could be the testing data.

•  On average, each bagged tree makes use of around 2/3 of the observations, so we end up having 1/3 of the observations used for testing

Slide taken from Al Sharif: http://www.alsharif.info/#!iom530/c21o7

Page 17: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

The importance of independence

•  It’s a quite strong assumption, that all classifiers are independent.

•  Since the same features are used in each bag, the resulting classifiers are quite dependent. –  We don’t really get new classifiers or trees –  The bagging has high bias but low variance.

•  De-correlate by any means! è Random Forest

Page 18: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Random Forest

Page 19: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Random Forests

•  It is a very efficient statistical learning method •  It builds on the idea of bagging, but it provides an improvement because it

de-correlates the trees •  How does it work?

–  Build a number of decision trees on bootstrapped training sample, but when building these trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors (Usually )

m ≈ p

Page 20: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Random Forest: Learning algorithm

Sample m variables out p. As usual for trees sample the best variable for splitting

Page 21: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Why are we considering a random sample of m predictors instead of all p predictors for splitting?

•  Suppose that we have a very strong predictor in the data set along with a number of other moderately strong predictor, then in the collection of bagged trees, most or all of them will use the very strong predictor for the first split!

•  All bagged trees will look similar. Hence all the predictions from the bagged trees will be highly correlated

•  Averaging many highly correlated quantities does not lead to a large variance reduction, and thus random forests “de-correlates” the bagged trees leading to more reduction in variance

•  This makes the individual trees weaker but enhances the overall performance

Page 22: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Random Forest with different values of “m”

•  Notice when random forests are built using m = p, then this amounts simply to bagging.

326 8. Tree-Based Methods

0 100 200 300 400 500

0.2

0.3

0.4

0.5

Number of Trees

Test

Cla

ssifi

catio

n Er

ror

m=pm=p/2m= p

FIGURE 8.10. Results from random forests for the fifteen-class gene expressiondata set with p = 500 predictors. The test error is displayed as a function of thenumber of trees. Each colored line corresponds to a different value of m, thenumber of predictors available for splitting at each interior tree node. Randomforests (m < p) lead to a slight improvement over bagging (m = p). A singleclassification tree has an error rate of 45.7%.

Recall that bagging involves creating multiple copies of the original train-ing data set using the bootstrap, fitting a separate decision tree to eachcopy, and then combining all of the trees in order to create a single predic-tive model. Notably, each tree is built on a bootstrap data set, independentof the other trees. Boosting works in a similar way, except that the trees aregrown sequentially: each tree is grown using information from previouslygrown trees. Boosting does not involve bootstrap sampling; instead eachtree is fit on a modified version of the original data set.Consider first the regression setting. Like bagging, boosting involves com-

bining a large number of decision trees, f̂1, . . . , f̂B. Boosting is describedin Algorithm 8.2.What is the idea behind this procedure? Unlike fitting a single large deci-

sion tree to the data, which amounts to fitting the data hard and potentiallyoverfitting, the boosting approach instead learns slowly. Given the currentmodel, we fit a decision tree to the residuals from the model. That is, wefit a tree using the current residuals, rather than the outcome Y , as the re-sponse. We then add this new decision tree into the fitted function in orderto update the residuals. Each of these trees can be rather small, with justa few terminal nodes, determined by the parameter d in the algorithm. Byfitting small trees to the residuals, we slowly improve f̂ in areas where itdoes not perform well. The shrinkage parameter λ slows the process down

Page 23: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Random Forest in R (basic)

fit = randomForest(as.factor(crim) ~ ., data=data.train) predict(fit, data.test) #### # Parameters randomForest(as.factor(crim) ~ ., data=data.train, ntree=100, mtry = 42) #ntree number of trees grown #mtry number of features used in each split. If set to p èBagged Trees

Page 24: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Variable Importance Measure

•  Bagging typically improves the accuracy over prediction using a single tree, but it is now hard to interpret the model!

•  We have hundreds of trees, and it is no longer clear which variables are most important to the procedure

•  Thus bagging improves prediction accuracy at the expense of interpretability

•  But, we can still get an overall summary of the importance of each predictor using Relative Influence Plots

Slide taken from Al Sharif: http://www.alsharif.info/#!iom530/c21o7

Page 25: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

0 1 1 2 0 1 00 2 2 1 2 0 11 0 0 1 1 2 01 0 0 1 1 0 20 2 1 0 2 0 1

v1 v2 v3 v4 v5 v6 v7

Determine importance-2 for v4:

1)  obtain standard oob-performance 2)  values of v4 are randomly permuted in oob samples and oob-

performance is again computed. If variable is not important nothing happends. Hence:

3)  Use decrease of performace as important measure of v4

Variable Importance 1: performance-loss by permutation

Page 26: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

v3

v5 v6

v1

v5

v1

v3 v4 v4

. . .

At each split where the variable, e.g. v3, is used the improvement of the score (Decrease of Giniindex) is measured. The average over all these v3-involving splits in all trees is the importance-1 measure of v3.

Variable Importance 2: Score improvement at each Split

Page 27: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

varImp

library(randomForest) library(ElemStatLearn) heart = SAheart #importance = TRUE sonst keine Accuracy fit = randomForest(as.factor(chd) ~ ., data = heart, importance=TRUE) predict(fit, data=kyphosis, type='prob') varImpPlot(fit)

Page 28: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Interpreting Variable Importance

•  RF in standard implementation is biased towards continuous variables

•  Problems with correlations: –  If two variables are highly correlated removing one yield no

degeneration in performance –  Same with Gini index measure –  Not specific for random forest: correlation often posses problems

Page 29: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

The test data contains now a pair of observation random forest determines proximity by counting in how many trees both observation end up in the same leaf. Since the RF was built in a supervised modus, the proximity is also influenced by the class label of the observations.

Proximity: Similarity between two observation according to supervised RF

Page 30: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Boosting

•  Consider binary problem with classes coded as +1,-1 •  A sequence of classifiers Cm is learnt •  The training data is reweighted (depending on misclassification of

example)

Slides: Trevor Hastie

Page 31: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

An experiment

Decission Boundary of a single tree. E.g.

Page 32: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

An experiment

Decision Boundary after 100 iterations

Decision Boundary after 3 iterations Decision Boundary after 1 iteration

Decision Boundary after 20 iterations

Strong Weights

Page 33: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Idea of Ada Boost Algorithm

Page 34: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Details of Ada Boost Algorithm

Zu c) Error small èlarge weight Error > 0.5 opposite is taken (quite uncommon only for small m)

errm

αm

Page 35: Oliver Dürr - GitHub Pagesoduerr.github.io/teaching/stdm/woche12/slides12.pdf · Statistisches Data Mining (StDM) Woche 12 Oliver Dürr Institut für Datenanalyse und Prozessdesign

Performance of Boosting

Boosting most frequently used with trees (not necessary) Trees are typically only grown to a certain depth (1,2). If trees of depth 2 would be better than trees of depth 1 (stubs) interactions between features would be important