Data mining with caret package

Data mining with caret packageKai Xiao and Vivian Zhang @Supstat Inc.

OutlineIntroduction of data mining and caret

before model training

building model

advance topic

exercise

visualization

pre-processing

Data slitting

Model training and Tuning

Model performance

variable importance

feature selection

parallel processing

cross-industry standard process for data mining

Introduction of caretThe caret package (short for Classification And REgression Training) is a set of functions thatattempt to streamline the process for creating predictive models. The package contains tools for:

data splitting

pre-processing

feature selection

model tuning using resampling

variable importance estimation

A very simple examplelibrary(caret) str(iris) set.seed(1) # preprocess process <- preProcess(iris[,-5],method=c('center','scale')) dataScaled <- predict(process,iris[,-5]) # data splitting inTrain <- createDataPartition(iris$Species,p=0.75)[[1]] length(inTrain) trainData <- dataScaled[inTrain, ] trainClass <- iris[inTrain,5] testData <- dataScaled[-inTrain, ] testClass <- iris[-inTrain,5]

A very simple example# model tuning set.seed(1) fitControl <- trainControl(method = "cv", number = 10) tunedf <- data.frame(.cp=c(0.01,0.05,0.1,0.3,0.5)) treemodel <- train(x = trainData, y = trainClass, method='rpart', trControl = fitControl, tuneGrid = tunedf) print(treemodel) plot(treemodel) # prediction and performance assessment treePred <- predict(treemodel,testData) confusionMatrix(treePred, testClass)

visualizationsThe featurePlot function is a wrapper for different lattice plots to visualize the data.

Scatterplot Matrix

boxplot

featurePlot(x = iris[, 1:4], y = iris$Species, plot = "pairs", ## Add a key at the top auto.key = list(columns = 3))

featurePlot(x = iris[, 1:4], y = iris$Species, plot = "box", ## Add a key at the top auto.key = list(columns = 3))

pre-processingCreating Dummy Variables

when <- data.frame(time = c("afternoon", "night", "afternoon", "morning", "morning", "morning", "morning", "afternoon", "afternoon")) when levels(when$time) <- c("morning", "afternoon", "night") mainEffects <- dummyVars(~ time, data = when) predict(mainEffects, when)

pre-processingZero- and Near Zero-Variance Predictors

data <- data.frame(x1=rnorm(100), x2=runif(100), x3=rep(c(0,1),times=c(2,98)), x4=rep(3,length=100)) nzv <- nearZeroVar(data, saveMetrics = TRUE) nzv nzv <- nearZeroVar(data) dataFilted <- data[,-nzv] head(dataFilted)

pre-processingIdentifying Correlated Predictors

set.seed(1) x1 <- rnorm(100) x2 <- x1 + rnorm(100,0.1,0.1) x3 <- x1 + rnorm(100,1,1) data <- data.frame(x1,x2,x3) corrmatrix <- cor(data) highlyCor <- findCorrelation(corrmatrix, cutoff = 0.75) dataFilted <- data[,-highlyCor] head(dataFilted)

pre-processingIdentifying Linear Dependencies Predictors

set.seed(1) x1 <- rnorm(100) x2 <- x1 + rnorm(100,0.1,0.1) x3 <- x1 + rnorm(100,1,1) x4 <- x2 + x3 data <- data.frame(x1,x2,x3,x4) comboInfo <- findLinearCombos(data) dataFilted <- data[,-comboInfo$remove] head(dataFilted)

pre-processingCentering and Scaling

set.seed(1) x1 <- rnorm(100) x2 <- 3 + 3*x1 + rnorm(100) x3 <- 2 + 2*x1 + rnorm(100) data <- data.frame(x1,x2,x3) summary(data) preProc <- preProcess(data, method = c("center", "scale")) dataProced <- predict(preProc, data) summary(dataProced)

pre-processingImputation:bagImpute/knnImpute/

data <- iris[,-5] data[1,2] <- NA data[2,1] <- NA impu <- preProcess(data,method='knnImpute') dataProced <- predict(impu,data)

pre-processingtransformation: BoxCox/PCA

data <- iris[,-5] pcaProc <- preProcess(data,method='pca') dataProced <- predict(pcaProc,data) head(dataProced)

data splittingcreate balanced splits of the data

set.seed(1) trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE, times = 1) head(trainIndex) irisTrain <- iris[trainIndex, ] irisTest <- iris[-trainIndex, ] summary(irisTest$Species)

createResample can be used to make simple bootstrap samples

createFolds can be used to generate balanced cross–validation groupings from a set of data.

Model Training and Parameter TuningThe train function can be used to

evaluate, using resampling, the effect of model tuning parameters on performance

choose the "optimal" model across these parameters

estimate model performance from a training set

Model Training and Parameter Tuningprepare data

data(PimaIndiansDiabetes2,package='mlbench') data <- PimaIndiansDiabetes2 library(caret) # scale and center preProcValues <- preProcess(data[,-9], method = c("center", "scale")) scaleddata <- predict(preProcValues,data[,-9]) # YeoJohnson transformation preProcbox <- preProcess(scaleddata, method = c("YeoJohnson")) boxdata <- predict(preProcbox , scaleddata)

Model Training and Parameter Tuningprepare data

# bagimpute preProcimp <- preProcess(boxdata,method="bagImpute") procdata <- predict(preProcimp,boxdata) procdata$class <- data[,9] # data splitting inTrain <- createDataPartition(procdata$class,p=0.75)[[1]] length(inTrain) trainData <- procdata[inTrain, 1:8] trainClass <- procdata[inTrain, 9] testData <- procdata[-inTrain, 1:8] testClass <- procdata[-inTrain, 9]

Model Training and Parameter Tuningdefine sets of model parameter values to evaluate

tunedf <- data.frame(.cp=seq(0.001,0.2,length.out=10))

Model Training and Parameter Tuningdefine the type of resampling method

k-fold cross-validation (once or repeated)

leave-one-out cross-validation

bootstrap (simple estimation or the 632 rule)

fitControl <- trainControl(method = "repeatedcv", # 10-fold cross validation number = 10, # repeated 3 times repeats = 3)

Model Training and Parameter Tuningstart training

treemodel <- train(x = trainData, y = trainClass, method='rpart', trControl = fitControl, tuneGrid = tunedf)

Model Training and Parameter Tuninglook at the final result

treemodel plot(treemodel)

The trainControl Functionmethod: The resampling method

number and repeats: number controls with the number of folds in K-fold cross-validation ornumber of resampling iterations for bootstrapping and leave-group-out cross-validation.

verboseIter: A logical for printing a training log.

returnData: A logical for saving the data into a slot called trainingData.

classProbs: a logical value determining whether class probabilities should be computed for held-out samples during resample.

summaryFunction: a function to compute alternate performance summaries.

selectionFunction: a function to choose the optimal tuning parameters.

returnResamp: a character string containing one of the following values: "all", "final" or "none".This specifies how much of the resampled performance measures to save.

Alternate Performance MetricsPerformance Metrics:

Another built-in function, twoClassSummary, will compute the sensitivity, specificity and area underthe ROC curve

regression: RMSE and R2

classification: accuracy and Kappa

fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary) treemodel <- train(x = trainData, y = trainClass, method='rpart', trControl = fitControl, tuneGrid = tunedf, metric="ROC") treemodel

Extracting PredictionsPredictions can be made from these objects as usual.

pre <- predict(treemodel,testData) pre <- predict(treemodel,testData,type="prob")

Evaluating Test Setscaret also contains several functions that can be used to describe the performance of classificationmodels

testPred <- predict(treemodel, testData) testPred.prob <- predict(treemodel, testData,type='prob') postResample(testPred, testClass) confusionMatrix(testPred, testClass)

Exploring and Comparing ResamplingDistributions

Within-Model Comparing·

densityplot(treemodel, pch = "|")

Exploring and Comparing ResamplingDistributions

Between-Models Comparing

let's build a nnet model, and compare these two model performance

tunedf <- expand.grid(.decay=0.1, .size=1:8, .bag=T) nnetmodel <- train(x = trainData, y = trainClass, method='avNNet', trControl = fitControl, trace=F, linout=F, metric="ROC", tuneGrid = tunedf) nnetmodel

Exploring and Comparing ResamplingDistributionsGiven these models, can we make statistical statements about their performance differences? To dothis, we first collect the resampling results using resamples.

We can compute the differences, then use a simple t-test to evaluate the null hypothesis that there isno difference between models.

resamps <- resamples(list(tree = treemodel, nnet = nnetmodel)) bwplot(resamps) densityplot(resamps,metric='ROC')

difValues <- diff(resamps) summary(difValues)

Variable importance evaluationVariable importance evaluation functions can be separated into two groups:

model-based approach

Model Independent approach

For classification, ROC curve analysis is conducted on each predictor.

For regression, the relationship between each predictor and the outcome is evaluated

# model-based approach treeimp <- varImp(treemodel) plot(treeimp)

# Model Independent approach RocImp <- varImp(treemodel,useModel = FALSE) plot(RocImp) # or RocImp <- filterVarImp(x = trainData, y = trainClass) plot(RocImp)

feature selectionMany models do not necessarily use all the predictors

Feature Selection Using Search Algorithms("wrapper" approach)

Feature Selection Using Univariate Filters('filter' approach)

feature selection: wrapper approach

feature selection: wrapper approachfeature selection based on random forest model

pre-defined sets of functions: linear regression(lmFuncs), random forests (rfFuncs), naive Bayes(nbFuncs), bagged trees (treebagFuncs)

ctrl <- rfeControl(functions = rfFuncs, method = "repeatedcv", number = 10, repeats = 3, verbose = FALSE, returnResamp = "final") Profile <- rfe(x = trainData, y = trainClass, sizes = 1:8, rfeControl = ctrl) Profile

feature selection: wrapper approachfeature selection based on custom model

tunedf <- data.frame(.cp=seq(0.001,0.2,length.out=5)) fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary) customFuncs <- caretFuncs customFuncs$summary <- twoClassSummary ctrl <- rfeControl(functions = customFuncs, method = "repeatedcv", number = 10, repeats = 3, verbose = FALSE, returnResamp = "final") Profile <- rfe(x = trainData, y = trainClass, sizes = 1:8, method = 'rpart', rfeControl = ctrl, /

parallel processingsystem.time({ library(doParallel) registerDoParallel(cores = 2) nnetmodel.para <- train(x = trainData, y = trainClass, method='avNNet', trControl = fitControl, trace=F, linout=F, metric="ROC", tuneGrid = tunedf) }) nnetmodel$times nnetmodel.para$times

exercise-1use knn method to train model

library(caret) fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3) tunedf <- data.frame(.k=seq(3,20,by=2)) knnmodel <- train(x = trainData, y = trainClass, method='knn', trControl = fitControl, tuneGrid = tunedf) plot(knnmodel)

Data mining with caret package

Education

Transcript of Data mining with caret package

The caret Package: A Unified Interface for Predictive Models · The caret Package: A Uni ed Interface for Predictive Models Max Kuhn P zer Global R&D Nonclinical Statistics Groton,

Artificial Intelligence, Machine Learning, and Deep Learningumsl.edu/~adhikarib/talks/2018_UMSL_AI_ML_and_DL.pdf- Integrated with scikit-learn for Python, and with the caret package

STAT2450 - Introduction to Data Mining with R Data ...kallada/stat2450/lectures/Lecture9_Slides.pdf · Data Visualization: ggplot in R Another R library like caret This library gives

Drones Caret

CARET Q4-86 Research Report

Incipit septimus2 - WTAMUbbrasington/pan7.pdfreus11 criminis nec illo12 sacramento caret, etiam si nunquam reconcilietur Deo13, ita manente in se vinculo, non caret coniugii sacramento

Caret max kuhn

Package Caret

The DaMiRseq package - Data Mining for RNA-Seq data ...€¦ · The DaMiRseq package - Data Mining for RNA-Seq data: normalization, feature selection and classiﬁcation 2Introduction

THE DEVELOPMENT PROCESS FOR THE CARET PACKAGE

Mining Indaba Sponsor Package

Machine Learning Algorithms Using R’s Caret Packagefiles.meetup.com/11316072/Caret-Machine-Learning.pdf · Machine Learning Algorithms Using R’s aret Package •Regressions Models

Customer Engagement with Branded Phone Apps: Caret - Qualysoft

PALO ALTO CARET LABORATORY LINGUISTIC ANALYSIS ...avalonlibrary.net/Dragonfly_Drones_CARET_document_archive...PALO ALTO CARET LABORATORY LINGUISTIC ANALYSIS PRIMER *erc* Figure 14.12

National Mining Industry Training Advisory Body Black Coal ... · Black Coal Training Package AQF 2 - 6 National Mining Industry Training Advisory Body Black Coal Training Package

Metalliferous Mining Training Package · 2005-05-27 · Metalliferous Mining Training Package - MNM99 V3.00 MNM20103 Certificate II in Metalliferous Mining Operations ... MNM60101

Antranig Basman, CARET, University of Cambridge Aaron Zeckoski, CARET, University of Cambridge Josh Ryan, Arizona State University Colin Clark, Adaptive.

Package ‘caret’ - cran.microsoft.com · Package ‘caret’ May 27, 2018 Version 6.0-80 Title Classiﬁcation and Regression Training Author Max Kuhn. Contributions from Jed Wing,

Package ‘predtoolsTS’ · The caret package offers plenty of data mining algorithms. For the data splitting here we use a rolling forecasting origin technique, wich works better

Building Predictive Models in R Using the caret Package...the dragonX software (Talete SRL2007) was used to generate a baseline set of 1,579 predic-tors, including constitutional,

PALO ALTO CARET LABORATORY LINGUISTIC ANALYSIS ...avalonlibrary.net/Dragonfly_Drones_CARET_document_archive...PALO ALTO CARET LABORATORY LINGUISTIC ANALYSIS PRIMER erc Figure 14.12