Predictive Modeling Workshop

PREDICTIVE MODELING

WORKSHOPMax Kuhn

O P E ND A T AS C I E N C EC O N F E R E N C E_

BOSTON 2015

@opendatasci

Predictive Modeling Workshop

Max Kuhn, Ph.D

Pfizer Global R&DGroton, CT

[email protected]

mailto:[email protected]

Outline

Modeling conventions

Modeling capabilities

Data splitting

Pre–processing

Measuring performance

Over–fitting and resampling

Classification trees, boosting

Support vector machines

Extra topics as time allows

Max Kuhn (Pfizer) Predictive Modeling 2 / 142

Terminology

Data:

numeric data: numbers of any type (eg. counts, sales price)

categorical or nominal data: non–numeric data (eg. color, gender)

Variables:

outcomes: the data to be predicted

predictors (aka independent variables, descriptors, inputs): data used to predict the outcome

Models:

classification: models to predict categorical outcomes

regression: models to predict numeric outcomes

(these last two are imperfect definitions)


Modeling Conventions in R

The Formula Interface

There are two main conventions for specifying models in R: the formulainterface and the non–formula (or “matrix”) interface.

For the former, the predictors are explicitly listed in an R formula that looks like: outcome ∼ var1 + var2 + ....

For example, the formulamodelFunction(price ~ numBedrooms + numBaths + acres,

data = housingData)

would predict the closing price of a house using three quantitativecharacteristics.


The Formula Interface

The shortcut y ∼ . can be used to indicate that all of the columns in thedata set (except y) should be used as a predictor.

The formula interface has many conveniences. For example, transformations, such as log(acres) can be specified in–line.

Unfortunately, R does not efficiently store the information about the formula. Using this interface with data sets that contain a large number of predictors may unnecessarily slow the computations.


The Matrix or Non–Formula Interface

The non–formula interface specifies the predictors for the model using amatrix or data frame (all the predictors in the object are used in the model).

The outcome data are usually passed into the model as a vector object. For example:

modelFunction(x = housePredictors, y = price)

In this case, transformations of data or dummy variables must be createdprior to being passed to the function.

Note that not all R functions have both interfaces.


Building and Predicting Models

Almost all modeling functions in R follow the same workflow:Create the model using the basic function:

fit <- knn(trainingData, outcome, k = 5)Assess the properties of the model using print, plot. summary or other methods

Predict outcomes for samples using the predict method:predict(fit, newSamples).

1

2

3

The model can be used for prediction without changing the original modelobject.


Modeling Capabilities

Predictive Modeling Methods in R

As previously mentioned, there is a machine learning Task View page on the R website that does a good job of describing the range of models available

parametric regression models: ordinary/generalized/robustregression models; neural networks; partial least squares; projection pursuit regression; multivariate adaptive regression splines; principal component regression

sparse/penalized models: ridge regression; the lasso; the elastic net; generalized linear models; partial least squares; nearest shrunken centroids; logistic regression

kernel methods: support vector machines; relevance vector machines; least squares support vector machine; Gaussian processes

(more)


http://cran.r-project.org/web/views/MachineLearning.html








Predictive Modeling Methods in R

trees/rule–based models: CART; C4.5; conditional inference trees;node harvest, Cubist, C5.0

ensembles: random forest; boosting (trees, linear models, generalized additive models, generalized linear models, others); bagging (trees, multivariate adaptive regression splines), rotation forests

prototype methods: k nearest neighbors; learned vector quantization

discriminant analysis: linear; quadratic; penalized; stabilized; sparse;mixture; regularized; stepwise; flexible

others: naive Bayes; Bayesian multinomial probit models


Model Function Consistency

Since there are many modeling packages written by different people, thereare some inconsistencies in how models are specified and predictions are made.

For example, many models have only one method of specifying the model(e.g. formula method only)


Generating Class Probabilities Using Different Packages

obj Class Package predict Function Syntaxlda

glm gbm mda rpart Weka

LogitBoost

MASSstats gbm

mda rpart RWeka caTo

ols

predict(obj) (no options needed)predict(obj,

predict(obj, predict(obj, predict(obj, predict(obj, predict(obj,

type

type type type type type

======

"response")

"response", n.trees) "posterior")"prob") "probability")"raw", nIter)


http://cran.r-project.org/web/packages/MASS/index.html

http://cran.r-project.org/web/packages/mda/index.html

http://cran.r-project.org/web/packages/rpart/index.html



http://cran.r-project.org/web/packages/RWeka/index.html





http://cran.r-project.org/web/packages/caTools/index.html




The caret Package

The caret package was developed to:

create a unified interface for modeling and prediction (interfaces to183 models – up from 112 a year ago)

streamline model tuning using resampling

provide a variety of “helper” functions and classes for day–to–day model building tasks

increase computational efficiency using parallel processing

First commits within Pfizer: 6/2005, First version on CRAN: 10/2007

Website: http://topepo.github.io/caret/

JSS Paper: http://www.jstatsoft.org/v28/i05/paper

Model List: http://topepo.github.io/caret/bytag.html

Many computing sections in APM


http://cran.r-project.org/web/packages/caret/index.html





http://topepo.github.io/caret/







http://www.jstatsoft.org/v28/i05/paper





http://topepo.github.io/caret/bytag.html









Example Data

Credit Score Data Set

These data can be found in Multivariate Statistical Modelling Based onGeneralized Linear Models by Fahrmeir et al and are in the Fahrmeir package.

Data were collected by South German bank to predict whether loan recipents would be good payers.

The data are moderate size: 1000 samples and 7 predictors.

The outcome was related to whether or not a recipent repayed their loan. There is a small class imbalance: 70% repayed.

Predictors are related to demographics, credit information and data related to the loan and bank account.



>>>>>>>>>>>>>>>>>>>>

## translations, formatting, and dummy variableslibrary(Fahrmeir)data(credit)

credit$Male <-ifelse(credit$Sexo == "hombre", 1, 0)

credit$Lives_Alone <-ifelse(credit$Estc == "vive solo", 1, 0) credit$Good_Payer <-ifelse(credit$Ppag == "pre buen pagador", 1, 0) credit$Private_Loan <-ifelse(credit$Uso == "privado", 1, 0) credit$Class <-ifelse(credit$Y == "buen", "Good", "Bad") credit$Class <- factor(credit$Class, levels = c("Good", "Bad"))

credit$Y <- NULL

credit$Sexo <- NULL credit$Uso <- NULL credit$Ppag <- NULL credit$Estc <- NULLnames(credit)[names(credit) == "Mes"] <- "Loan_Duration"

names(credit)[names(credit) == "DM"] <- "Loan_Amount" names(credit)[names(credit) == "Cuenta"] <- "Credit_Quality"



> library(plyr)>> ## to make valid R column names> trans <- c("good running" = "good_running", "bad running" = "bad_running")> credit$Credit_Quality <- revalue(credit$Credit_Quality, trans)>> str(credit)

'data.frame': 1000 obs. of 8 variables:$ Credit_Quality: Factor w/ 3 levels "no","good_running",..: 1 1 3 1 1 1 1 1 2 3 .$ Loan_Duration : num 18 9 12 12 12 10 8 6 18 24 ...

1049 2799 841 2122 2171 ...0 1 0 1 1 1 1 1 0 0 ...1 0 1 0 0 0 0 0 1 1 ...1 1 1 1 1 1 1 1 1 1 ...1 0 0 0 0 0 0 0 1 1 ...

$ Loan_Amount$ Male$ Lives_Alone$ Good_Payer$ Private_Loan$ Class

: num: num: num: num: num: Factor w/ 2 levels "Good","Bad": 1 1 1 1 1 1 1 1 1 1 ...


General Strategies

APM Ch 1, 2 and 4.

Model Building Steps

Common steps during model building are:

estimating model parameters (i.e. training models)

determining the values of tuning parameters that cannot be directly calculated from the data

calculating the performance of the final model that will generalize to new data

How do we “spend” the data to find an optimal model? We typically splitdata into training and test data sets:

Training Set: these data are used to estimate model parameters and to pick the values of the complexity parameter(s) for the model.

Test Set (aka validation set): these data can be used to get an independent assessment of model efficacy. They should not be used during model training.


Spending Our Data

The more data we spend, the better estimates we’ll get (provided the datais accurate). Given a fixed amount of data,

too much spent in training won’t allow us to get a good assessmentof predictive performance. We may find a model that fits the training data very well, but is not generalizable (over–fitting)

too much spent in testing won’t allow us to get a good assessment of model parameters

Statistically, the best course of action would be to use all the data formodel building and use statistical methods to get good estimates of error.

From a non–statistical perspective, many consumers of of these modelsemphasize the need for an untouched set of samples the evaluate performance.


Spending Our Data

There are a few different ways to do the split: simple random sampling,stratified sampling based on the outcome, by date and methods that focus on the distribution of the predictors.

The base R function sample can be used to create a completely randomsample of the data. The caret package has a function createDataPartition that conducts data splits within groups of the data.For classification, this would mean sampling within the classes as topreserve the distribution of the outcome in the training and test sets

For regression, the function determines the quartiles of the data set andsamples within those groups







Credit Data Set

For these data, let’s take a stratified random sample of 760 loans fortraining.

> set.seed(8140)> in_train <- createDataPartition(credit$Class, p = .75, list = FALSE)> head(in_train)

Resample1[1,]

[2,] [3,] [4,] [5,] [6,]

123458

> train_data <- credit[ in_train,]> test_data <- credit[-in_train,]


Estimating Performance

APM Ch. 5 and 11

Estimating Performance

Later, once you have a set of predictions, various metrics can be used to evaluate performance.

For regression models:

R2 is very popular. In many complex models, the notion of the modeldegrees of freedom is difficult. Unadjusted R2 can be used, but does

not penalize complexity

the root mean square error is a common metric for understanding the performance

Spearman’s correlation may be applicable for models that are used to rank samples

Of course, honest estimates of these statistics cannot be obtained bypredicting the same samples that were used to train the model.

A test set and/or resampling can provide good estimates.


Estimating Performance For Classification

For classification models:

overall accuracy can be used, but this may be problematic when the classes are not balanced.

the Kappa statistic takes into account the expected error rate:

O − Eκ =

1 − E

where O is the observed accuracy and E is the expected accuracyunder chance agreement

For 2–class models, Receiver Operating Characteristic (ROC)curves can be used to characterize model performance (more later)



A “ confusion matrix” is a cross–tabulation of the observed and predictedclasses

R functions for confusion matrices are in the e1071 package (the classAgreement function), the caret package (confusionMatrix), the mda (confusion) and others.

ROC curve functions are found in the pROC package (roc) ROCRpackage (performance), the verification package (roc.area) and others.

We’ll use the confusionMatrix function and the pROC package later in this class.


http://cran.r-project.org/web/packages/e1071/index.html











http://cran.r-project.org/web/packages/pROC/index.html



http://cran.r-project.org/web/packages/ROCR/index.html

http://cran.r-project.org/web/packages/verification/index.html






Estimating Performance For ClassificationFor 2–class classification models we might also be interested in:

Sensitivity: given that a result is truly an event, what is the probability that the model will predict an event results?

Specificity: given that a result is truly not an event, what is the probability that the model will predict a negative results?

(an “event” is really the event of interest)

These conditional probabilities are directly related to the false positive and false negative rate of a method.

Unconditional probabilities (the positive–predictive values and negative–predictive values) can be computed, but require an estimate of what the overall event rate is in the population of interest (aka the prevalence)



For our example, let’s choose the event to be the loan being repaid:

# truly repaid predicted to be repaidSensitivity =

# truly repaid

# truly not repaid predicted to be not repaidSpecificity =

# truly not repaid

The caret package has functions called sensitivity and specificity







ROC Curve

With two classes the Receiver Operating Characteristic (ROC) curve canbe used to estimate performance using a combination of sensitivity and specificity.

Given the probability of an event, many alternative cutoffs can beevaluated (instead of just a 50% cutoff ). For each cutoff, we can calculate the sensitivity and specificity.

The ROC curve plots the sensitivity (eg. true positive rate) by one minus specificity (eg. the false positive rate).

The area under the ROC curve is a common metric of performance.


ROC Curve

0.25 (Sp = 0.6, Sn = 0.9)

0.50 (Sp = 0.7, Sn = 0.8)

(Sp = 0.9, Sn = 0.6)

1.00 (Sp = 1.0, Sn = 0.0)

1.0 0.8 0.6 0.4 0.2 0.0

Specificity


Sen

sitiv

ity

0.0

0.2

0.4

0.6

0.8

1.0

0.75

●

0●

●

●

●

Over–Fitting and Model Tuning

APM Ch. 4

Over–Fitting

Over–fitting occurs when a model inappropriately picks up on trends in the training set that do not generalize to new samples.

When this occurs, assessments of the model based on the training set can show good performance that does not reproduce in future samples.

Some models have specific “knobs” to control over-fitting

neighborhood size in nearest neighbor models is an example

the number if splits in a tree model

Often, poor choices for these parameters can result in over-fitting

For example, the next slide shows a data set with two predictors. We want to be able to produce a line (i.e. decision boundary) that differentiates two classes of data.

Two new points are to be predicted. A 5–nearest neighbor model is illustrated.


K –Nearest Neighbors Classification

Class 1 Class 2

●0.6

0.4

0.2

0.0

0.0 0.2 0.4

Predictor A

0.6


Pre

dict

or B

Over–Fitting

On the next slide, two classification boundaries are shown for the adifferent model type not yet discussed.

The difference in the two panels is solely due to different choices in tuning parameters.

One over–fits the training data.


Two Model Fits

0.2 0.4 0.6 0.8

0.8

0.6

0.4

0.2

0.2 0.4 0.6 0.8

Predictor A


Pre

dict

or B

Model #2Model #1

Characterizing Over–Fitting Using the Training Set

One obvious way to detect over–fitting is to use a test set. However,repeated “looks” at the test set can also lead to over–fitting

Resampling the training samples allows us to know when we are making poor choices for the values of these parameters (the test set is not used).

Resampling methods try to “inject variation” in the system to approximate the model’s performance on future samples.

We’ll walk through several types of resampling methods for training set samples.

See the two blog posts “Comparing Different Species of Cross-Validation”at http://bit.ly/1yE0Ss5 and http://bit.ly/1zfoFj2


http://bit.ly/1yE0Ss5



K –Fold Cross–Validation

Here, we randomly split the data into K distinct blocks of roughly equalsize.

We leave out the first block of data and fit a model.

This model is used to predict the held-out block

We continue this process until we’ve predicted all K held–out blocks

1

2

3

The final performance is based on the hold-out predictions

K is usually taken to be 5 or 10 and leave one out cross–validation has each sample as a block

Repeated K –fold CV creates multiple versions of the folds and aggregates the results (I prefer this method)


K –Fold Cross–Validation


In R

Many packages have cross–validation functions, but they are usuallylimited to 10–fold CV.caret has a general purpose function called train that has manyresampling methods for many models (more later).

caret has functions to produce samples splits for K –fold CV(createFolds), multiple training/test splits (createDataPartition)

and bootstrap sampling (createResample).

Also, the base R function sample can be used to create completelyrandom splits or bootstrap samples












The Big Picture

We think that resampling will give us honest estimates of futureperformance, but there is still the issue of which model to select.

One algorithm to select models:

Define sets of model parameter values to evaluate;for each parameter set do

for each resampling iteration doHold–out specific samples ;Fit the model on the remainder; Predict the hold–out samples;

endCalculate the average performance across hold–out predictionsend

Determine the optimal parameter set;


K –Nearest Neighbors Classification

Class 1 Class 2

●0.6

0.4

0.2

0.0

0.0 0.2 0.4

Predictor A

0.6


Pre

dict

or B

The Big Picture – K NN Example

Using k –nearest neighbors as an example:Randomly put samples into 10 distinct groups;for k = 1, 3, 5, . . . , 21 do

for i = 1 . . . 10 doHold–out block i ;Fit the model on the other 90%; Predict the i th block and save results;

endCalculate the average accuracy across the 10 hold–out sets of predictions

endDetermine k based on the highest cross–validated accuracy;


The Big Picture – K NN Example

1.0

0.9

0.8

0.7

5 10 15 20

k


Acc

urac

y

With a Different Set of Resamples

1.0

0.9

0.8

0.7

5 10 15 20

k


Acc

urac

y

Data Pre–Processing

APM Ch. 3

Pre–Processing the Data

There are a wide variety of models in R. Some models have differentassumptions on the predictor data and may need to be pre–processed.

For example, methods that use the inverse of the predictor cross–product matrix (i.e. (X tX )-1) may require the elimination of collinear predictors.

Others may need the predictors to be centered and/or scaled, etc.

If any data processing is required, it is a good idea to base thesecalculations on the training set, then apply them to any data set used formodel building or prediction.


Pre–Processing the Data

Examples of of pre-processing operations:

centering and scaling

imputation of missing data

transformations of individual predictors

transformations of the groups of predictors, such as the

the "spatial-sign" transformation (i.e. x i = x /llx ll)feature extraction via PCA

�

�


Dummy Variables

Before pre-processing the data, there are a few predictors that arecategorical in nature.

For these, we would convert the values to binary dummy variables prior to using them in numerical computations.

The core R function model.matrix can be used to do this.

If a categorical predictors has C levels, it would make C − 1 variables with values 0 or 1 (one level would be omitted).


Dummy Variables

In the credit data, the Current Credit Quality predictor has C = 3 distinctvalues: "no", "good_running," and "bad_running"

For these data, the possible dummy variables are:

Dummy Variable ColumnsData Value no good_running bad_running

"no"

"good_running" "bad_running"

100

010

001

For ordered categorical predictors, the default encoding is more complex.See "The Basics of Encoding Categorical Data for Predictive Models" athttp://bit.ly/1CtXg0x


Dummy Variables

> alt_data <- model.matrix(Class ~ ., data = train_data)> alt_data[1:4, 1:3]

(Intercept) Credit_Qualitygood_running Credit_Qualitybad_running1234

1111

0000

0010

We can get rid of the intercept column and use this data.

However, we would want to apply this same transformation to new data sets.

The caret function dummyVars has more options and can be applied to any data

Note: using the formula method with models automatically handles this.


Dummy Variables

> dummy_info <- dummyVars(Class ~ ., data = train_data)> dummy_info

Dummy Variable Object

Formula: Class ~ .8 variables, 2 factorsVariables and levels will be separated by '.'A less than full rank encoding is used

> train_dummies <- predict(dummy_info, newdata = train_data)> train_dummies[1:4, 1:3]

Credit_Quality.no Credit_Quality.good_running Credit_Quality.bad_running1234

1101

0000

0010

> test_dummies <- predict(dummy_info, newdata = test_data)


Dummy Variables and Model Functions

Most models are parameterized so that the predictors are required to benumeric. Linear regression, for example, doesn’t know what to do with a raw value of "green."

The primary convention in R is to convert factors to dummy variables when a model uses the formula interface.

However, this is not always the case. Many models using trees or rules(e.g. rpart, C5.0, randomForest, etc):

do not require numeric representations of the predictors

do not create dummy variables

Other notable exceptions are naive Bayes models and support vectormachines using string kernel functions.


Centering and Scaling

There are a few different functions for data processing in R:

scale in base R

ScaleAdv in pcaPP

stdize in pls

preProcess in caret

normalize in sparseLDA

The first three functions do simple centering and scaling. preProcess cando a variety of techniques, so we’ll look at this in more detail.


http://cran.r-project.org/web/packages/pcaPP/index.html

http://cran.r-project.org/web/packages/pcaPP/index.html

http://cran.r-project.org/web/packages/pls/index.html




http://cran.r-project.org/web/packages/sparseLDA/index.html





Centering and ScalingThe input is a matrix or data frame of predictor data. Once the values are calculated, the predict method can be used to do the actual data transformations.

First, estimate the standardization parameters:

> pp_values <- preProcess(train_dummies, method = c("center", "scale"))> pp_values

Call:preProcess.default(x = train_dummies, method = c("center", "scale"))

Created from 750 samples and 9 variablesPre-processing: centered, scaled

Apply them to the training and test sets:

> train_scaled <- predict(pp_values, newdata = train_dummies)> test_scaled <- predict(pp_values, newdata = test_dummies)


Signal Extraction via PCA

Principal component analysis (PCA) can be used to create a (hopefully)small subset of new predictors that capture most of the information in the whole set.

The principal components are linear combinations of each individual predictor and usually have not meaningful interpretation.

The components are created sequentially

the first captures the larges component of variation in the predictors

the second does the same for the leftover information, and so on

We can track how much of the variance is explained by the componentsand select enough to have fidelity to the original data.


Signal Extraction via PCA

The advantages to this approach:

the components are all uncorrected to one another

a small number of predictors in the model can sometimes help

However...

this is not feature selection; all the predictors are still

required the components may not be correlated to the

outcome


Signal Extraction via PCA in R

The base R function prcomp can be used to create the components.

preProcess can make the transformation and automatically retain the number of components to account for some pre-specified amount of information.

The predictors should be centered and scaled prior the PCA extraction.preProcess will automatically do this even if you forget.


An Example

Another data set shows an nice example of PCA. There are two the predictors and two classes:

> dim(example_train)

[1] 1009 3

> dim(example_test)

[1] 1010 3

> head(example_train)

PredictorA PredictorB Class234121516

3278.7261727.4101194.9321027.2221035.6081433.918

154.8987684.56460

101.0910768.7106273.4055979.47569

One

Two One Two One OneMax Kuhn (Pfizer) Predictive Modeling 59 / 142

Correlated Predictors

Class One Two

400

300

200

100

2500 5000PredictorA

7500


Pre

dict

orB

Mildly Predictive of the Classes

Class One Two

1.0

0.5

0.0

7 8 9 4.0 4.5 5.0 5.5 6.0log(value)


dens

ity

PredictorA PredictorB

An Example

> pca_pp <- preProcess(example_train[, 1:2],+> pca_pp

method = "pca") # also added "center" and "scale"

Call:preProcess.default(x = example_train[, 1:2], method = "pca")

Created from 1009 samples and 2 variablesPre-processing: principal component signal extraction, scaled, centered

PCA needed 2 components to capture 95 percent of the variance

> train_pc <- predict(pca_pp, example_train[, 1:2])> test_pc <- predict(pca_pp, example_test[, 1:2])> head(test_pc, 4)

PC11 0.84204475 0.21891686 1.2074404

PC20.072848020.04568417

-0.210405587 1.1794578 -0.20980371


A simple Rotation in 2D

Class One Two

1

0

−1

−2

−10.0 −7.5 −5.0PC1

−2.5 0.0


PC

2

The First PC is Least Important Here

Class One Two

3

2

1

0

−2 −1 0 1 2 −2value

−1 0 1 2


dens

ity

PC1 PC2

The Spatial Sign

The is another group transformation that can be effective when there aresignificant outliers in the data.

For P predictors, the data are projected onto a unit sphere in Pdimensions.

Recall that our principal component values had very long tails. Let’s apply the spatial sign after PCA extraction.

The predictors should be centered and scaled prior to this transformation too.


Adding Another Step

> pca_ss_pp <- preProcess(example_train[, 1:2],+> pca_ss_pp

method = c("pca", "spatialSign"))

Call:preProcess.default(x = example_train[, 1:2], method = c("pca", "spatialSign"))

Created from 1009 samples and 2 variables

Pre-processing: principal component signal extraction, spatial sign transformation, scaled, centered

PCA needed 2 components to capture 95 percent of the variance

> train_pc_ss <- predict(pca_ss_pp, example_train[, 1:2])> test_pc_ss <- predict(pca_ss_pp, example_test[, 1:2])> head(test_pc_ss, 4)

PC11 0.99627865 0.97891216 0.9851544

PC20.086191290.20428207

-0.171670587 0.9845449 -0.17513231


Projected onto a Unit Sphere

Class One Two

1.0

0.5

0.0

−0.5

−1.0

−1.0 −0.5 0.0PC1

0.5 1.0


PC

2

Pre–Processing and Resampling

To get honest estimates of performance, all data transformations shouldbe included within the cross-validation loop.

The would be especially true for feature selection as well as pre-processing techniques (e.g. imputation, PCA, etc)

One function considered later called train that can apply preProcesswithin resampling loops.


Filtering Problematic Variables

Some models have computational issues if predictors have degeneratedistributions. For example, models that use X tX or it’s inverse might have issues with

outliers

predictors with a single value (aka zero-variance predictors)

highly unbalanced distributions

Other models are insensitive to the characteristics of the predictordistributions.

The caret functions findCorrelation and nearZeroVar can be used for unsupervised filtering of predictors.







Feature Engineering

One of the most critical parts of the modeling process is featureengineering; how should the predictors enter the model?

For example, two predictors might be more informative if they enter the model as a ratio instead of as two main effects.

This requires an in-depth knowledge of the problem at hand and the nuances of the data.

Also, like pre-processing, this is highly model dependent.


Example: Encoding Time and Date Data

Some applications have date or time data as predictors. How should weencode this?

numerical day of the year along with the year?

categorical or ordinal factors for the day of the week, week, month, or season, etc?

number of days from some reference date?

The answer depends on the type of model and the nature of the data.



I have found the lubridate package to be invaluable in these cases.

Let’s load some example dates from an existing RData file:

> day_values <- c("2015-05-10", "1970-11-04", "2002-03-04", "2006-01-13")> class(day_values)

[1] "character"

> library(lubridate)> days <- ymd(day_values)> str(days)

POSIXct[1:4], format: "2015-05-10" "1970-11-04" "2002-03-04" "2006-01-13"



> day_of_week <- wday(days, label = TRUE)> day_of_week

[1] Sun Wed Mon FriLevels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat

> year(days)

[1] 2015 1970 2002 2006

> week(days)

[1] 19 45 10 2

> month(days, label =

TRUE) [1] May Nov Mar Jan12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

> yday(days)

[1] 130 308 63 13


Classification Models

APM Ch. 11-17

Classification Trees

A classification tree searches through each predictor to find a value of asingle variable that best splits the data into two groups.

typically, the best split minimizes impurity of the outcome in the resulting data subsets.

For the two resulting groups, the process is repeated until a hierarchicalstructure (a tree) is created.

in effect, trees partition the X space into rectangular sections that assign a single value to samples within the rectangle.


An ExampleThere are many tree-based packages in R. The main package for fitting single trees are rpart, RWeka, C50 and party. rpart fits the classical "CART" models of Breiman et al (1984).

To obtain a shallow tree with rpart:

> library(rpart)> rpart1 <- rpart(Class ~ .,++> rpart1

n= 750

data = train_data,control = rpart.control(maxdepth = 2))

node), split, n, loss, yval, (yprob)* denotes terminal node

1) root 750 225 Good (0.7000000 0.3000000)2) Credit_Quality=good_running 293 39 Good (0.8668942 0.1331058) *3) Credit_Quality=no,bad_running 457 186 Good (0.5929978 0.4070022)

6) Loan_Duration< 31.5 366 129 Good (0.6475410 0.3524590) *7) Loan_Duration>=31.5 91 34 Bad (0.3736264 0.6263736) *












http://cran.r-project.org/web/packages/C50/index.html



http://cran.r-project.org/web/packages/party/index.html











Visualizing the Tree

The rpart package has functions plot.rpart and text.rpart to visualizethe final tree.

The partykit package also has enhanced plotting functions for recursivepartitioning. We can convert the rpart object to a new class called partyand plot it to see more in the terminal nodes:

> library(partykit)> rpart1_plot <- as.party(rpart1)> ## plot(rpart1_plot)







http://cran.r-project.org/web/packages/partykit/index.html












A Shallow rpart Tree Using the party Package

Credit_Quality

Loan_Duration

Node 2 (n = 293) Node 4 (n = 366) Node 5 (n = 91)1 1 1

0.8

0.6

0.4

0.2

0

0.8

0.6

0.4

0.2

0

0.8

0.6

0.4

0.2

0


Bad

Goo

d

Bad

Goo

d

Bad

Goo

d

31.531.5

3

no, bad_runninggood_running

1

Tree Fitting Process

Splitting would continue until some criterion for stopping is met, such asthe minimum number of observations in a node

The largest possible tree may over-fit and "pruning" is the process ofiteratively removing terminal nodes and watching the changes in resampling performance (usually 10-fold CV)There are many possible pruning paths: how many possible trees are therewith 6 terminal nodes?

Trees can be indexed by their maximum depth and the classical CARTmethodology uses a cost-complexity parameter (Cp ) to determine best tree depth


The Final Tree

Previously, we told rpart to use a maximum of two splits.

By default, rpart will conduct as many splits as possible, then use 10-foldcross-validation to prune the tree.

Specifically, the "one SE" rule is used: estimate the standard error ofperformance for each tree size then choose the simplest tree within one standard error of the absolute best tree size.

> rpart_full <- rpart(Class ~ ., data = train_data)


The Final Tree> rpart_full

n= 750

node), split, n, loss, yval, (yprob)* denotes terminal node

1) root 750 225 Good (0.7000000 0.3000000)2) Credit_Quality=good_running 293 39 Good (0.8668942 0.1331058) *3) Credit_Quality=no,bad_running 457 186 Good (0.5929978 0.4070022)

6) Loan_Duration< 31.5 366 129 Good (0.6475410 0.3524590)12) Loan_Amount< 10975.5 359 122 Good (0.6601671 0.3398329)

24) Good_Payer>=0.5 323 100 Good (0.6904025 0.3095975)48) Loan_Duration< 11.5 66 9 Good (0.8636364 0.1363636) *49) Loan_Duration>=11.5 257 91 Good (0.6459144 0.3540856)

98) Loan_Amount>=1381.5 187 54 Good (0.7112299 0.2887701) *

99) Loan_Amount< 1381.5 70198) Private_Loan>=0.5 38199) Private_Loan< 0.5 32

33 Bad (0.4714286 0.5285714)15 Good (0.6052632 0.3947368) *10 Bad (0.3125000 0.6875000) *

25) Good_Payer< 0.5 36 14 Bad (0.3888889 0.6111111)50) Loan_Duration>=16.5 23100) Private_Loan>=0.5 11101) Private_Loan< 0.5 12

51) Loan_Duration< 16.5 13

11 Good (0.5217391 0.4782609)3 Good (0.7272727 0.2727273) *4 Bad (0.3333333 0.6666667) *2 Bad (0.1538462 0.8461538) *

13) Loan_Amount>=10975.5 7 0 Bad (0.0000000 1.0000000) *7) Loan_Duration>=31.5 91 34 Bad (0.3736264 0.6263736)14) Loan_Duration< 47.5 54 25 Bad (0.4629630 0.5370370)

28) Credit_Quality=bad_running 27 11 Good (0.5925926 0.4074074)56) Loan_Amount< 7759 20 6 Good (0.7000000 0.3000000) *

57) Loan_Amount>=7759 729) Credit_Quality=no 27

15) Loan_Duration>=47.5 37

2 Bad (0.2857143 0.7142857) *9 Bad (0.3333333 0.6666667) *9 Bad (0.2432432 0.7567568) *


The Final rpart Tree

1

Credit_Quality

good_running no, bad_running

3

Loan_Duration

31.5 31.5

4 19

Loan_Amount Loan_Duration

10975.5 10975.5 47.5 47.5

5 20

Good_Payer Credit_Quality

0.5 0.5 bad_running no

6 13 21

Loan_Duration Loan_Duration Loan_Amount

11.5 11.5 16.5 16.5

8 14

Loan_Amount Private_Loan

1381.5 1381.5 7759 7759

10

Private_Loan 0.5 0.5

0.5 0.5

Node 2 (n = 293) Node 7 (n = 66) Node 9 (n = 187) Node 11 (n = 38) Node 12 (n = 32) Node 15 (n = 11) Node 16 (n = 12) Node 17 (n = 13)

Node 18 (n = 7) Node 22 (n = 20) Node 23 (n = 7) Node 24 (n = 27) Node 25 (n = 37)1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0


Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d

Ba

dG

oo

d0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

0.8

0.6

0.4

0.2

Test Set Results

> rpart_pred <- predict(rpart_full, newdata = test_data, type = "class")> confusionMatrix(data = rpart_pred, reference = test_data$Class)

Confusion Matrix and Statistics

# requires 2 factor vectors

ReferencePrediction Good Bad

Good 163 49Bad 12 26

Accuracy : 0.756

95% CI : (0.6979, 0.8079) No Information Rate : 0.7

P-Value [Acc > NIR] : 0.02945

Kappa : 0.3237Mcnemar's Test P-Value : 4.04e-06

Sensitivity : 0.9314Specificity : 0.3467

Pos Pred Value : 0.7689Neg Pred Value : 0.6842

Prevalence : 0.7000Detection Rate : 0.6520

Detection Prevalence : 0.8480Balanced Accuracy : 0.6390

'Positive' Class : Good


Creating the ROC CurveThe pROC package can be used to create ROC curves.

The function roc is used to capture the data and compute the ROC curve. The functions plot.roc and auc.roc generate plot and area under the curve, respectively.

> class_probs <- predict(rpart_full, newdata = test_data)> head(class_probs, 3)

Good Bad6 0.8636364 0.13636367 0.8636364 0.136363613 0.8636364 0.1363636

> library(pROC)> ## The roc function assumes the *second* level is the one of> ## interest, so we use the 'levels' argument to change the order.> rpart_roc <- roc(response = test_data$Class, predictor = class_probs[, "Good"],+ levels = rev(levels(test_data$Class)))> ## Get the area under the ROC curve> auc(rpart_roc)

Area under the curve: 0.7975


Tuning the Model

There are a few functions that can be used for this purpose in R.

the errorest function in the ipred package can be used to resample a single model (e.g. a gbm model with a specific number of iterations and tree depth)the e1071 package has a function (tune) for five models that will conduct resampling over a grid of tuning values.caret has a similar function called train for over 183 models. Different resampling methods are available as are custom performance metrics and facilities for parallel processing.


http://cran.r-project.org/web/packages/ipred/index.html













The train Function

The basic syntax for the function is:

> train(formula, data, method)

Looking at ?train, using method =over Cp , so we can use:

"rpart" can be used to tune a tree

> train(Class ~ ., data = train_data, method = "rpart")

We’ll add a bit of customization too.


The train Function

By default, the function will tune over 3 values of the tuning parameter(Cp for this model).

For rpart, the train function determines the distinct number of values ofCp for the data.

The tuneLength function can be used to evaluate a broader set of models:

>++

rpart_tune <- train(Class ~ ., data =

method = "rpart", tuneLength = 9)

train_data,


The train Function

The default resampling scheme is the bootstrap. Let’s use 10-foldcross-validation instead.

To do this, there is a control function that handles some of the optionalarguments.

To use five repeats of 10-fold cross-validation, we would use

>>+++

cv_ctrl <-rpart_tune

trainControl(method = "repeatedcv", repeats = 5)<- train(Class ~ ., data = train_data,

method = "rpart", tuneLength = 9, trControl = cv_ctrl)


The train Function

Also, the default CART algorithm uses overall accuracy and the onestandard-error rule to prune the tree.

We might want to choose the tree complexity based on the largestabsolute area under the ROC curve.

A custom performance function can be passed to train. The package hasone that calculates the ROC curve, sensitivity and specificity called . For example:

> twoClassSummary(fakeData)

ROC

Sens Spec0.5020 0.1145 0.8827


The train FunctionWe can pass the twoClassSummary function in through trainControl.However, to calculate the ROC curve, we need the model to predict theclass probabilities. The classProbs option will also do this:

Finally, we tell the function to optimize the area under the ROC curveusing the metric argument:

>++>>++++

cv_ctrl <- trainControl(method = "repeatedcv", repeats = 5,

summaryFunction = twoClassSummary, classProbs = TRUE)set.seed(1735)

rpart_tune <- train(Class ~ ., data = train_data,

method = "rpart", tuneLength = 9, metric = "ROC", trControl = cv_ctrl)


Digression – Parallel Processing

Since we are fitting a lot of independent models over different tuningparameters and sampled data sets, there is no reason to do these sequentially.

R has many facilities for splitting computations up onto multiple cores ormachinesSee Tierney et al (2009, Journal of Statistical Software) for a recentreview of these methods


foreach and caret

To loop through the models and data sets, caret uses the foreach package,which parallelizes for loops.

foreach has a number of parallel backends which allow various technologiesto be used in conjunction with the package.On CRAN, these are the doSomething packages, such as doMC, doMPI,doSMP and others.

For example, doMC uses the multicore package, which forks processes tosplit computations (for unix and OS X). doParallel works well for Windows(I’m told)


foreach and caret

To use parallel processing in caret, no changes are needed when callingtrain.

The parallel technology must be registered with foreach prior to callingtrain.

For multicore

> library(doMC)> registerDoMC(cores = 2)


Training Time (min)50 bootstraps of a SVM model with 1000 samples and 400 predictors and the multicore package

500

400

300

200

100

2 4 6 8 10 12

cores


min

●

●

●

● ●

●● ●

●

●● ●

●

Speed–Up

5

4

3

2 ●

1 ●

2 4 6 8 10 12

cores


Spe

edU

p

● ●

●

●

●

●

● ●

●●

●

train Results> rpart_tune

CART

750 samples7 predictor2 classes: 'Good', 'Bad'

No pre-processingResampling: Cross-Validated (10 fold, repeated 5 times)

Summary of sample sizes: 675, 675, 674, 674, 676, 675, ...

Resampling results across tuning parameters:

cp0.000000.002220.004440.007780.008890.011110.017780.033330.05111

ROC0.6970.6960.6950.6950.6950.6930.6460.5540.507

Sens0.8350.8410.8480.8640.8630.8610.8770.9120.954

Spec0.44470.43670.41630.37720.38170.38010.35290.25910.0996

ROC SD0.07500.07440.07080.08770.08980.09590.14870.18760.1149

Sens SD0.04880.04800.04670.04330.04550.04420.05260.05050.0510

Spec SD0.1150.1100.1020.1140.1190.1270.1010.1040.106

ROC was used to select the optimal model using the largest value.The final value used for the model was cp = 0.


Working With the train Object

There are a few methods of interest:plot.train or ggplot.train can be used to plot the resampling profiles across the different models

print.train shows a textual description of the results

predict.train can be used to predict new samples

there are a few others that we’ll mention shortly.

Additionally, the final model fit (i.e. the model with the best resamplingresults) is in a sub-object called finalModel.

So in our example, rpart_tune is of class train and the objectrpart_tune$finalModel is of class rpart.

Let’s look at what the plot method does.


Resampled ROC Profileggplot(rpart_tune)

● ● ●

0.55


RO

C (

Rep

eate

d C

ross

−V

alid

atio

n)● ● ●

0.70

0.65 ●

0.60

●

●

0.500.00 0.01 0.02 0.03 0.04 0.05

Complexity Parameter

Predicting New Samples

There are at least two ways to get predictions from a train object:predict(rpart_tune$finalModel, newdata, type = "class")predict(rpart_tune, newdata)

The first method uses predict.rpart, but is not preferred if using train.

predict.train does the same thing, but automatically uses the numberof trees that were selected using the train function.


Test Set Results

> rpart_pred2 <- predict(rpart_tune, newdata = test_data)> confusionMatrix(rpart_pred2, test_data$Class)



Good 161 47Bad 14 28

Accuracy : 0.756










Predicting Class Probabilities

predict.train has an argument type that can be used to get predictedclass probabilities for different models:

> rpart_probs <- predict(rpart_tune, newdata = test_data, type = "prob")> head(rpart_probs, n = 4)

Good Bad6 0.8636364 0.13636367 0.8636364 0.136363613 0.8636364 0.136363616 0.8636364 0.1363636

> rpart_roc <- roc(response = test_data$Class, predictor = rpart_probs[, "Good"],+> auc(rpart_roc)

levels = rev(levels(test_data$Class)))



Classification Tree ROC Curveplot(rpart_roc, legacy.axes = FALSE)

1.0 0.8 0.6 0.4 0.2 0.0

Specificity


Sen

sitiv

ity

0.0

0.2

0.4

0.6

0.8

1.0

Pros and Cons of Single Trees

Trees can be computed very quickly and have simple interpretations.

Also, they have built-in feature selection; if a predictor was not used in anysplit, the model is completely independent of that data.

Unfortunately, trees do not usually have optimal performance whencompared to other methods.Also, small changes in the data can drastically affect the structure of atree.

This last point has been exploited to improve the performance of trees viaensemble methods where many trees are fit and predictions are aggregated across the trees. Examples are bagging, boosting and random forests.


Boosted Trees

APM Sect 14.5

Boosted Trees (Original “ adaBoost” Algorithm)A method to "boost" weak learning algorithms (e.g. single trees) into strong learning algorithms.

Boosted trees try to improve the model fit over different trees byconsidering past fits (not unlike iteratively reweighted least squares)The basic adaBoost algorithm:

Initialize equal weights per sample;for j = 1 . . . M iterations doFit a classification tree using sample weights (denote the model

equation as fj (x ));forall the misclassified samples do

increase sample weightendSave a "stage-weight" (βj ) based on the performance of the current model;

end


Boosted Trees (Original “ adaBoost” Algorithm)In this formulation, the categorical response yi is coded as either {−1, 1}and the model fj (x ) produces values of {−1, 1}.

The final prediction is obtained by first predicting using all M trees, then weighting each prediction

M1

f (x ) = X

βj fj (x )M

j =1

where fj is the j th tree fit and βj is the stage weight for that tree.

The final class is determined by the sign of the model prediction.

In English: the final prediction is a weighted average of each tree’sprediction. The weights are based on quality of each tree.


Statistical Approaches to Boosting

The original algorithm was developed outside of statistical theory.Statisticians discovered that the basic boosting algorithm was very similarto gradient-based steepest decent methods, such as Newton’s method.

Based on their observations, a new class of boosted tree models wereproposed and are generally referred to as "gradient boosting" methods.


Boosted Trees Parameters

Most implementations of boosting have three tuning parameters:

number of iterations (i.e. trees)

complexity of the tree (i.e. number of splits)

learning rate (aka. "shrinkage"): how quickly the algorithm

adapts the minimum number of samples in a terminal node

Boosting functions for trees in R: gbm in the gbm package, ada in ada,blackboost in mboost and C5.0 in the C50 package.


http://cran.r-project.org/web/packages/gbm/index.html



http://cran.r-project.org/web/packages/ada/index.html

http://cran.r-project.org/web/packages/ada/index.html

http://cran.r-project.org/web/packages/mboost/index.html








Using the gbm Package

The gbm function in the gbm package can be used to fit the model, thenpredict.gbm and other functions are used to predict and evaluate the model.

>>>>>>>>++++++

library(gbm)# The gbm function does not accept factor response values so

# will make a copy and modify the result variable for_gbm <- train_datafor_gbm$Class <- ifelse(for_gbm$Class == "Good", 1, 0)

set.seed(10)gbm_fit <- gbm(formula = Class ~ ., # Try all predictors

distribution = "bernoulli", # For classificationdata = for_gbm,n.trees = 100,

interaction.depth = 1, shrinkage = 0.1, verbose = FALSE)

# 100 boosting iterations# How many splits in each tree# learning rate# Do not print the details





gbm Predictions

We need to tell the predict method how many trees to use (let’s just pick500).

Also, it does not predict the actual class. We’ll get the class probabilityand do the conversion.

> gbm_pred <- predict(gbm_fit, newdata = test_data, n.trees = 100,++> head(gbm_pred)

## This calculates the class probstype = "response")

[1] 0.6803420 0.7628453 0.6882552 0.8262940 0.8861083 0.7280930

> gbm_pred <- factor(ifelse(gbm_pred > .5, "Good", "Bad"),+> head(gbm_pred)

levels = c("Good", "Bad"))

[1] Good Good Good Good Good GoodLevels: Good Bad


Test Set Results

> confusionMatrix(gbm_pred, test_data$Class)



Good 166 51Bad 9 24

Accuracy : 0.76










Model Tuning using trainWe can fit a series of boosted trees with different specifications and use resampling to understand which one is most appropriate.

For example, we can define a grid of 152 tuning parameter combinaitons:

number of trees: 100 to 1000 by 50

tree depth: 1 to 7 by 2

learning rate: 0.0001 and 0.01

minimum sample size of 10 (the default)

In R:

> gbm_grid <- expand.grid(interaction.depth = seq(1, 7, by = 2),+++

n.trees = seq(100, 1000, by = 50),

shrinkage = c(0.0001, 0.01), n.minobsinnode = 10)

We can use this grid to define exactly what models are evaluated bytrain. A list of model parameter names can be found using ?train.


Another Digression – The Ellipses

R has a great feature known as the ellipses (aka "three dots") where anarbitrary number of function arguments can be pass through nested function calls. For example:> average <- function(dat, ...) mean(dat, ...)> names(formals(mean.default))

[1] "x" "trim" "na.rm" "..."

> average(dat = c(1:10, 100))

[1] 14.09091

> average(dat = c(1:10, 100), trim

= .1) [1] 6

train is structured to take full advantage of this so that arguments canbe passed to the underlying model function (e.g. rpart, gbm) when calling the train function.


Model Tuning using train

>>++++++++

set.seed(1735)

gbm_tune <- train(Class ~ ., data = train_data, method = "gbm",

metric = "ROC",

# Use a custom grid of tuning parameters tuneGrid = gbm_grid,

trControl = cv_ctrl,# Remember the 'three dots' discussed previously?

# This options is directly passed to the gbm function. verbose = FALSE)


Boosted Tree Results> gbm_tune

Stochastic Gradient Boosting


No pre-processingResampling: Cross-Validated (10 fold, repeated 5 times)



shrinkage1e-041e-041e-041e-04:

1e-021e-021e-02

interaction.depth1111

:777

n.trees100150200250:900950

1000

ROC0.6080.6610.6610.681:

0.7330.7310.729

Sens1.0001.0001.0001.000:

0.8800.8760.877

Spec0.000

000.000

000.000

000.000

00:

0.425180.427830.43047

Tuning parameter 'n.minobsinnode' was held constant at a value of 10ROC was used to select the optimal model using the largest value.The final values used for the model were n.trees = 450, interaction.depth =3, shrinkage = 0.01 and n.minobsinnode = 10.


Boosted Tree ROC Profile

ggplot(gbm_tune)

0.750

●

● ●

●

0.725

Max Tree Depth0.700

1

3

5

70.675

●

0.650

0.625 ●

250 500 750 1000 250 500 750 1000# Boosting Iterations


RO

C (

Re

peat

ed C

ross

−V

alid

atio

n)

●

0.001 0.010

● ●

● ●● ● ●

● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●●

● ● ●

●

●

Test Set Results

> gbm_pred <- predict(gbm_tune, newdata = test_data) # Magic!> confusionMatrix(gbm_pred, test_data$Class)



Good 167 52Bad 8 23

Accuracy : 0.76










GBM Probabilities

> gbm_probs <- predict(gbm_tune, newdata = test_data, type = "prob")> head(gbm_probs)

Good Bad1 0.7559353 0.24406472 0.8336368 0.16636323 0.7846890 0.21531104 0.8455947 0.15440535 0.8454949 0.15450516 0.6093760 0.3906240


Boosted Tree ROC Curve

> gbm_roc <- roc(response = test_data$Class, predictor = gbm_probs[, "Good"],+> auc(rpart_roc)



> auc(gbm_roc)


> plot(rpart_roc, col = "#9E0142", legacy.axes = FALSE)> plot(gbm_roc, col = "#3288BD", legacy.axes = FALSE, add = TRUE)> legend(.6, .5, legend = c("rpart", "gbm"),++

lty = c(1, 1),col = c("#9E0142", "#3288BD"))


Boosted Tree ROC Curve

1.0 0.8 0.6 0.4 0.2 0.0

Specificity


Sen

sitiv

ity

0.0

0.2

0.4

0.6

0.8

1.0

rpart

gbm

Support Vector Machines for Classification

APM Sect 13.4

Support Vector Machines (SVM)

This is a class of powerful and very flexible models.SVMs for classification use a completely different objective function calledthe margin.Suppose a hypothetical situation: a dataset of two predictors and we aretrying to predict the correct class (two possible classes).

Let’s further suppose that these two predictors completely separate theclasses


Support Vector Machines

−4 −2 0 2 4

Predictor 1


Pre

dict

or 2

−4

−2

02

4


There are an infinite number of straight lines that we can use to separatethese two groups. Some must be better than others...

The margin is a defined by equally spaced boundaries on each side of theline.

To maximize the margin, we try to make it as large as possible withoutcapturing any samples.

As the margin increases, the solution becomes more robust.

SVMs determine the slope and intercept of the line which maximizes themargin.


Maximizing the Margin

−4 −2 0 2 4

Predictor 1


Pre

dict

or 2

−4

−2

02

4

SVM Prediction Function

Suppose our two classes are coded as -1 or 1.

The SVM model estimates n parameters ("1 . . . "n ) for the model.

Regularization is used to avoid saturated models (more on that in a minute).

For a new point u , the model predicts:

n

f (u ) = β0 + "

"'

i' x tu'

' =1

The decision rule would predict class #1 if f (u ) > 0 and class #2otherwise.

Data points that are support vectors have "' = 0, so the prediction

equation is only affected by the support vectors.



When the classes overlap, points are allowed within the margin. Thenumber of points is controlled by a cost parameter.

The points that are within the margin (or on it’s boundary) are thesupport vectors

Consequences of the fact that the prediction function only uses thesupport vectors:

the prediction equation is more compact and

efficient the model may be more robust to outliers


The Kernel Trick

You may have noticed that the prediction function was a function of an inner product between two samples vectors (x tu ). It turns out that this

'opens up some new possibilities.

Nonlinear class boundaries can be computed using the "kernel trick".The predictor space can be expanded by adding nonlinear functions in x .These functions, which must satisfy specific mathematical criteria, include common functions:

Polynomial : K (x , u ) = (1 + x tu )p

r -a

2

1

(x - u )2Radial basis function : K (x , u ) = exp

We don’t need to store the extra dimensions; these functions can becomputed quickly.


SVM Regularization

As previously discussed, SVMs also include a regularization parameter thatcontrols how much the regression line can adapt to the data smaller values result in more linear (i.e. flat) surfaces

This parameter is generally referred to as "Cost"

If the cost parameter is large, there is a significant penalty for havingsamples within the margin ⇒ the boundary becomes very flexible.

Tuning the cost parameter, as well as any kernel parameters, becomes very important as these models have the ability to greatly over-fit the training data.

(animation)


SVM Models in R

There are several packages that include SVM models:e1071 has the function svm for classification (2+ classes) and regression with 4 kernel functionsklaR has svmlight which is an interface to the C library of the same name. It can do classification and regression with 4 kernel functions (or user defined functions)

svmpath has an efficient function for computing 2-class models(including 2 kernel functions)

kernlab contains ksvm for classification and regression with 9 built-in kernel functions. Additional kernel function classes can be written. Also, ksvm can be used for text mining with the tm package.

Personally, I prefer kernlab because it is the most general and containsother kernel method functions (e1071 is probably the most popular).





http://cran.r-project.org/web/packages/klaR/index.html



http://cran.r-project.org/web/packages/svmpath/index.html



http://cran.r-project.org/web/packages/kernlab/index.html




http://cran.r-project.org/web/packages/tm/index.html










Tuning SVM Models

We need to come up with reasonable choices for the cost parameter and any other parameters associated with the kernel, such as

polynomial degree for the polynomial kernel

a, the scale parameter for radial basis functions (RBF)

We’ll focus on RBF kernel models here, so we have two tuning parameters.

However, there is a potential shortcut for RBF kernels. Reasonable valuesof a can be derived from elements of the kernel matrix of the training set.

The manual for the sigest function in kernlab has "The estimation [for a]is based upon the 0.1 and 0.9 quantile of lx - x tl2."

Anecdotally, we have found that the mid-point between these two numbers can provide a good estimate for this tuning parameter. This leaves only the cost function for tuning.






SVM Example

We can tune the SVM model over the cost parameter.

>>+++++++++

set.seed(1735)

svm_tune <- train(Class ~ ., data = train_data, method = "svmRadial",

# The default grid of cost parameters go from 2^-2,# 0.5, 1, ...# We'll fit 10 values in that sequence via the tuneLength

# argument. tuneLength = 10,preProc = c("center", "scale"), metric = "ROC",trControl = cv_ctrl)


SVM Example> svm_tune

Support Vector Machines with Radial Basis Function Kernel


Pre-processing: centered, scaledResampling: Cross-Validated (10 fold, repeated 5 times)



C ROC0.7190.7190.7190.7170.7100.6910.6750.6670.6550.644

Sens0.9140.9230.9240.9240.9250.9280.9340.9480.9580.966

Spec0.3080.3080.2950.2780.2650.2330.2080.1740.1430.112

ROC SD0.07260.07190.07320.07260.07570.07680.07800.07710.07830.0757

Sens SD0.04430.04140.04290.04290.04640.04570.04130.03540.03690.0291

Spec SD0.09150.09340.09260.09820.09670.10280.08830.09340.08230.0681

0.250.501.002.004.008.00

16.0032.0064.00

128.00

Tuning parameter 'sigma' was held constant at a value of 0.11402

ROC was used to select the optimal model using the largest value. The final values used for the model were sigma = 0.114 and C = 0.5.


SVM Example

> svm_tune$finalModel

Support Vector Machine object of class "ksvm"SV type: C-svc (classification)parameter : cost C = 0.5

Gaussian Radial Basis kernel function.Hyperparameter : sigma = 0.114020038085962

Number of Support Vectors : 458

Objective Function Value : -202.0199Training error : 0.230667Probability model included.

458 training data points (out of 750 samples) were used as support vectors.


SVM ROC Profileggplot(svm_tune) + scale_x_log10()

0.72

●

0.70

0.68

0.66

0.641 10 100

Cost


RO

C (

Rep

eate

d C

ross

−V

alid

atio

n)● ● ● ●

●

●

●

●

●

Test Set Results

> svm_pred <- predict(svm_tune, newdata = test_data)> confusionMatrix(svm_pred, test_data$Class)



Good 165 49Bad 10 26

Accuracy : 0.764










SVM Probability Estimates

> svm_probs <- predict(svm_tune, newdata = test_data, type = "prob")> head(svm_probs)

Good Bad1 0.8008819 0.19911812 0.8064645 0.19353553 0.8103000 0.18970004 0.7944448 0.20555525 0.6722484 0.32775166 0.7202097 0.2797903


SVM ROC Curve

> svm_roc <- roc(response = test_data$Class, predictor = svm_probs[, "Good"],+> auc(rpart_roc)



> auc(gbm_roc)


> auc(svm_roc)


> plot(rpart_roc, col = "#9E0142", legacy.axes = FALSE)> plot(gbm_roc, col = "#3288BD", legacy.axes = FALSE, add = TRUE)> plot(svm_roc, col = "#F46D43", legacy.axes = FALSE, add = TRUE)> legend(.6, .5, legend = c("rpart", "gbm", "svm"),++

lty = c(1, 1, 1),col = c("#9E0142", "#3288BD", "#F46D43"))


SVM ROC Curve

1.0 0.8 0.6 0.4 0.2 0.0

Specificity


Sen

sitiv

ity

0.0

0.2

0.4

0.6

0.8

1.0

rpart

gbm

svm

Other Functions and Classes

bag: a genereal bagging functionupSample, downSample: functions for class imbalances

predictors: class for determining which predictors are included in the prediction equations (e.g. rpart, earth, lars models)varImp: classes for assessing the aggregate effect of a predictor on the model equations

lift: creating lift/gain charts


Other Functions and Classes

knnreg: nearest-neighbor regressionplsda, splsda: PLS discriminant analysis

icr: independent component regression

pcaNNet: nnet:::nnet with automatic PCA pre-processing step

bagEarth, bagFDA: bagging with MARS and FDA models

maxDissim: a function for maximum dissimilarity samplingrfe, sbf, gabf, sabf: classes/frameworks for recursive feature selection (RFE), univariate filters, genetic algrithms, simulated annealing feature selection methods

featurePlot: a wrapper for several lattice functions


Session Info

R version 3.1.3 (2015-03-09), x86_64-apple-darwin10.8.0

Base packages: base, datasets, graphics, grDevices, grid, methods, parallel, splines, stats, tcltk, utils

Other packages: C50 0.1.0-24, caret 6.0-47, class 7.3-12, ctv 0.8-1,doMC 1.3.3, Fahrmeir 2012.04-0, foreach 1.4.2, Formula 1.2-0, gbm 2.1.1, ggplot2 1.0.1, Hmisc 3.15-0, iterators 1.0.7, kernlab 0.9-20, knitr 1.9, lattice 0.20-30, lubridate 1.3.3, MASS 7.3-40, mlbench 2.1-1,odfWeave 0.8.4, partykit 1.0-0, plyr 1.8.1, pROC 1.8, reshape2 1.4.1, rpart 4.1-9, survival 2.38-1, XML 3.98-1.1

Loaded via a namespace (and not attached): acepack 1.3-3.3,BradleyTerry2 1.0-6, brglm 0.5-9, car 2.0-25, cluster 2.0.1, codetools 0.2-10, colorspace 1.2-6, compiler 3.1.3, digest 0.6.8, e1071 1.6-4, evaluate 0.5.5, foreign 0.8-63, formatR 1.0, gtable 0.1.2, gtools 3.4.2, highr 0.4,labeling 0.3, latticeExtra 0.6-26, lme4 1.1-7, Matrix 1.2-0, memoise 0.2.1, mgcv 1.8-6, minqa 1.2.4, munsell 0.4.2, nlme 3.1-120, nloptr 1.0.4,nnet 7.3-9, pbkrtest 0.4-2, proto 0.3-10, quantreg 5.11, RColorBrewer 1.1-2, Rcpp 0.11.5, scales 0.2.4, SparseM 1.6, stringr 0.6.2, tools 3.1.3


Predictive Modeling Workshop

Technology

Transcript of Predictive Modeling Workshop