An Introduction to Data Mining with R

Yanchang Zhao

http://www.RDataMining.com

6 September 2013

1 / 43

Questions

I Do you know data mining and techniques for it?

I Have you used R before?

I Have you used R in your data mining research or projects?

2 / 43

Questions

2 / 43

Questions

2 / 43

Outline

Introduction

Classification with R

Clustering with R

Association Rule Mining with R

Text Mining with R

Time Series Analysis with R

Social Network Analysis with R

R and Hadoop

Online Resources

3 / 43

What is R?

I R 1 is a free software environment for statistical computingand graphics.

I R can be easily extended with 4,728 packages available onCRAN2 (as of Sept 6, 2013).

I Many other packages provided on Bioconductor3, R-Forge4,GitHub5, etc.

I R manuals on CRAN6

I An Introduction to RI The R Language DefinitionI R Data Import/ExportI . . .

1http://www.r-project.org/2http://cran.r-project.org/3http://www.bioconductor.org/4http://r-forge.r-project.org/5https://github.com/6http://cran.r-project.org/manuals.html

4 / 43

Why R?

I R is widely used in both academia and industry.

I R is ranked no. 1 again in the KDnuggets 2013 poll on TopLanguages for analytics, data mining, data science7.

I The CRAN Task Views 8 provide collections of packages fordifferent tasks.

I Machine learning & atatistical learningI Cluster analysis & finite mixture modelsI Time series analysisI Multivariate statisticsI Analysis of spatial dataI . . .

7http://www.kdnuggets.com/2013/08/languages-for-analytics-data-mining-data-science.html

8http://cran.r-project.org/web/views/

5 / 43

Outline

Introduction

Clustering with R

Text Mining with R

R and Hadoop

Online Resources

6 / 43

I Decision trees: rpart, party

I Random forest: randomForest, party

I SVM: e1071, kernlab

I Neural networks: nnet, neuralnet, RSNNS

I Performance evaluation: ROCR

7 / 43

The Iris Dataset

# iris data

str(iris)

## 'data.frame': 150 obs. of 5 variables:

## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...

## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...

## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# split into training and test datasets

set.seed(1234)

ind <- sample(2, nrow(iris), replace=T, prob=c(0.7, 0.3))

iris.train <- iris[ind==1, ]

iris.test <- iris[ind==2, ]

8 / 43

Build a Decision Tree

# build a decision tree

library(party)

iris.formula <- Species ~ Sepal.Length + Sepal.Width +

Petal.Length + Petal.Width

iris.ctree <- ctree(iris.formula, data=iris.train)

9 / 43

plot(iris.ctree)

Petal.Lengthp < 0.001

≤ 1.9 > 1.9

Node 2 (n = 40)

setosa

Petal.Widthp < 0.001

≤ 1.7 > 1.7

Petal.Lengthp = 0.026

≤ 4.4 > 4.4

Node 5 (n = 21)

setosa

Node 6 (n = 19)

setosa

Node 7 (n = 32)

setosa

10 / 43

Prediction

# predict on test data

pred <- predict(iris.ctree, newdata = iris.test)

# check prediction result

table(pred, iris.test$Species)

## pred setosa versicolor virginica

## setosa 10 0 0

## versicolor 0 12 2

## virginica 0 0 14

11 / 43

Outline

Introduction

Clustering with R

Text Mining with R

R and Hadoop

Online Resources

12 / 43

Clustering with R

I k-means: kmeans(), kmeansruns()9

I k-medoids: pam(), pamk()

I Hierarchical clustering: hclust(), agnes(), diana()

I DBSCAN: fpc

I BIRCH: birch

9Functions are followed with “()”, and others are packages.13 / 43

k-means Clustering

set.seed(8953)

iris2 <- iris

# remove class IDs

iris2$Species <- NULL

# k-means clustering

iris.kmeans <- kmeans(iris2, 3)

# check result

table(iris$Species, iris.kmeans$cluster)

## 1 2 3

## setosa 0 50 0

## versicolor 2 0 48

## virginica 36 0 14

14 / 43

# plot clusters and their centers

plot(iris2[c("Sepal.Length", "Sepal.Width")], col=iris.kmeans$cluster)

points(iris.kmeans$centers[, c("Sepal.Length", "Sepal.Width")],

col=1:3, pch="*", cex=5)

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

Sepal.Length

15 / 43

Density-based Clustering

library(fpc)

iris2 <- iris[-5] # remove class IDs

# DBSCAN clustering

ds <- dbscan(iris2, eps = 0.42, MinPts = 5)

# compare clusters with original class IDs

table(ds$cluster, iris$Species)

## setosa versicolor virginica

## 0 2 10 17

## 1 48 0 0

## 2 0 37 0

## 3 0 3 33

16 / 43

# 1-3: clusters; 0: outliers or noise

plotcluster(iris2, ds$cluster)

−8 −6 −4 −2 0 2

17 / 43

Outline

Introduction

Clustering with R

Text Mining with R

R and Hadoop

Online Resources

18 / 43

I Association rules: apriori(), eclat() in package arules

I Sequential patterns: arulesSequence

I Visualisation of associations: arulesViz

19 / 43

The Titanic Dataset

load("./data/titanic.raw.rdata")

dim(titanic.raw)

## [1] 2201 4

idx <- sample(1:nrow(titanic.raw), 8)

titanic.raw[idx, ]

## Class Sex Age Survived

## 501 3rd Male Adult No

## 766 Crew Male Adult No

## 1485 3rd Female Adult No

## 1388 2nd Female Adult No

20 / 43

Association Rule Mining

# find association rules with the APRIORI algorithm

library(arules)

rules <- apriori(titanic.raw, control=list(verbose=F),

parameter=list(minlen=2, supp=0.005, conf=0.8),

appearance=list(rhs=c("Survived=No", "Survived=Yes"),

default="lhs"))

# sort rules

quality(rules) <- round(quality(rules), digits=3)

rules.sorted <- sort(rules, by="lift")

# have a look at rules

# inspect(rules.sorted)

21 / 43

# lhs rhs support confidence lift

# 1 {Class=2nd,# Age=Child} => {Survived=Yes} 0.011 1.000 3.096

# 2 {Class=2nd,# Sex=Female,

# Age=Child} => {Survived=Yes} 0.006 1.000 3.096

# 3 {Class=1st,# Sex=Female} => {Survived=Yes} 0.064 0.972 3.010

# 4 {Class=1st,# Sex=Female,

# Age=Adult} => {Survived=Yes} 0.064 0.972 3.010

# 5 {Class=2nd,# Sex=Male,

# Age=Adult} => {Survived=No} 0.070 0.917 1.354

# 6 {Class=2nd,# Sex=Female} => {Survived=Yes} 0.042 0.877 2.716

# 7 {Class=Crew,# Sex=Female} => {Survived=Yes} 0.009 0.870 2.692

# 8 {Class=Crew,# Sex=Female,

# 9 {Class=2nd,# Sex=Male} => {Survived=No} 0.070 0.860 1.271

# 10 {Class=2nd,# Sex=Female,

22 / 43

library(arulesViz)

plot(rules, method = "graph")

Graph for 12 rules

{Class=1st,Sex=Female,Age=Adult}

{Class=1st,Sex=Female}

{Class=2nd,Age=Child}

{Class=2nd,Sex=Female,Age=Adult}

{Class=2nd,Sex=Female,Age=Child}

{Class=2nd,Sex=Female}

{Class=2nd,Sex=Male,Age=Adult}

{Class=2nd,Sex=Male}

{Class=3rd,Sex=Male,Age=Adult}

{Class=3rd,Sex=Male}

{Class=Crew,Sex=Female,Age=Adult}

{Class=Crew,Sex=Female}

{Survived=No}

{Survived=Yes}

width: support (0.006 − 0.192)color: lift (1.222 − 3.096)

23 / 43

Outline

Introduction

Clustering with R

Text Mining with R

R and Hadoop

Online Resources

24 / 43

Text Mining with R

I Text mining: tm

I Topic modelling: topicmodels, lda

I Word cloud: wordcloud

I Twitter data access: twitteR

25 / 43

Fetch Twitter Data

## retrieve tweets from the user timeline of @rdatammining

library(twitteR)

# tweets <- userTimeline('rdatamining')

load(file = "./data/rdmTweets.RData")

(nDocs <- length(tweets))

## [1] 320

strwrap(tweets[[320]]$text, width = 50)

## [1] "An R Reference Card for Data Mining is now"

## [2] "available on CRAN. It lists many useful R"

## [3] "functions and packages for data mining"

## [4] "applications."

# convert tweets to a data frame

df <- do.call("rbind", lapply(tweets, as.data.frame))

26 / 43

Text Cleaning

library(tm)

# build a corpus

myCorpus <- Corpus(VectorSource(df$text))

# convert to lower case

myCorpus <- tm_map(myCorpus, tolower)

# remove punctuation & numbers

myCorpus <- tm_map(myCorpus, removePunctuation)

myCorpus <- tm_map(myCorpus, removeNumbers)

# remove URLs

removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)

myCorpus <- tm_map(myCorpus, removeURL)

# remove 'r' and 'big' from stopwords

myStopwords <- setdiff(stopwords("english"), c("r", "big"))

# remove stopwords

myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

27 / 43

Stemming

# keep a copy of corpus

myCorpusCopy <- myCorpus

# stem words

myCorpus <- tm_map(myCorpus, stemDocument)

# stem completion

myCorpus <- tm_map(myCorpus, stemCompletion,

dictionary = myCorpusCopy)

# replace "miners" with "mining", because "mining" was

# first stemmed to "mine" and then completed to "miners"

myCorpus <- tm_map(myCorpus, gsub, pattern="miners",

replacement="mining")

strwrap(myCorpus[320], width=50)

## [1] "r reference card data mining now available cran"

## [2] "list used r functions package data mining"

## [3] "applications"

28 / 43

Frequent Terms

myTdm <- TermDocumentMatrix(myCorpus,

control=list(wordLengths=c(1,Inf)))

# inspect frequent words

(freq.terms <- findFreqTerms(myTdm, lowfreq=20))

## [1] "analysis" "big" "computing"

## [4] "data" "examples" "mining"

## [7] "network" "package" "position"

## [10] "postdoctoral" "r" "research"

## [13] "slides" "social" "tutorial"

## [16] "university" "used"

29 / 43

Associations

# which words are associated with 'r'?

findAssocs(myTdm, "r", 0.2)

## examples code package

## 0.32 0.29 0.20

# which words are associated with 'mining'?

findAssocs(myTdm, "mining", 0.25)

## data mahout recommendation sets

## 0.47 0.30 0.30 0.30

## supports frequent itemset

## 0.30 0.26 0.26

30 / 43

Network of Terms

library(graph)

library(Rgraphviz)

plot(myTdm, term=freq.terms, corThreshold=0.1, weighting=T)

analysis big computing

examples

mining

network

packageposition

postdoctoral r

research

slides

socialtutorial

university

31 / 43

Word Cloud

library(wordcloud)

m <- as.matrix(myTdm)

freq <- sort(rowSums(m), decreasing=T)

wordcloud(words=names(freq), freq=freq, min.freq=4, random.order=F)

mininganalysisexamplesresearch

package

position

slides usednetwork

university

postdoctoral

socialtutorial

computing

applications

codeintroduction

see group

analytics

australia

available

modellingscientist

association

online

parallel

clustering

coursele

series

programming

statistics

ausdmdetection

google

openoutlier

rdatamining

techniques

thanks

chapter

graphics

presentation science

software

starting

studies

via visualizing

business

cancase

center

followerfunctions

kdnuggets

spatial

submission

technology

twitter

california

classification

handling

information

interactive

machine

melbourne

processing

provided

recent

reference

videos

workshop

wwwrdataminingcom

access

canada

canberra

conference

datasets

distributed

dmapps

events

experience

fellow

forecasting

frequent

industrial

knowledge

management

performance

search

sentiment

snowfall

southern

sydney

tenuretrack

youtube

32 / 43

Topic Modelling

library(topicmodels)

set.seed(123)

myLda <- LDA(as.DocumentTermMatrix(myTdm), k=8)

terms(myLda, 5)

## Topic 1 Topic 2 Topic 3 Topic 4

## [1,] "data" "r" "r" "research"

## [2,] "mining" "package" "time" "position"

## [3,] "big" "examples" "series" "data"

## [4,] "association" "used" "users" "university"

## [5,] "rules" "code" "talk" "postdoctoral"

## Topic 5 Topic 6 Topic 7 Topic 8

## [1,] "mining" "group" "data" "analysis"

## [2,] "data" "data" "r" "network"

## [3,] "slides" "used" "mining" "social"

## [4,] "modelling" "software" "analysis" "text"

## [5,] "tools" "kdnuggets" "book" "slides"

33 / 43

Outline

Introduction

Clustering with R

Text Mining with R

R and Hadoop

Online Resources

34 / 43

I Time series decomposition: decomp(), decompose(), arima(),stl()

I Time series forecasting: forecast

I Time Series Clustering: TSclust

I Dynamic Time Warping (DTW): dtw

35 / 43

Outline

Introduction

Clustering with R

Text Mining with R

R and Hadoop

Online Resources

36 / 43

I Packages: igraph, sna

I Centrality measures: degree(), betweenness(), closeness(),transitivity()

I Clusters: clusters(), no.clusters()

I Cliques: cliques(), largest.cliques(), maximal.cliques(),clique.number()

I Community detection: fastgreedy.community(),spinglass.community()

37 / 43

Outline

Introduction

Clustering with R

Text Mining with R

R and Hadoop

Online Resources

38 / 43

R and Hadoop

I Packages: RHadoop, RHive

I RHadoop10 is a collection of 3 R packages:

I rmr2 - perform data analysis with R via MapReduce on aHadoop cluster

I rhdfs - connect to Hadoop Distributed File System (HDFS)I rhbase - connect to the NoSQL HBase database

I You can play with it on a single PC (in standalone orpseudo-distributed mode), and your code developed on thatwill be able to work on a cluster of PCs (in full-distributedmode)!

I Step by step to set up my first R Hadoop systemhttp://www.rdatamining.com/tutorials/rhadoop

10https://github.com/RevolutionAnalytics/RHadoop/wiki39 / 43

An Example of MapReducing with R

library(rmr2)

map <- function(k, lines) {words.list <- strsplit(lines, "\\s")words <- unlist(words.list)

return(keyval(words, 1))

}reduce <- function(word, counts) {

keyval(word, sum(counts))

}wordcount <- function(input, output = NULL) {

mapreduce(input = input, output = output, input.format = "text",

map = map, reduce = reduce)

}## Submit job

out <- wordcount(in.file.path, out.file.path)

11From Jeffrey Breen’s presentation on Using R with Hadoophttp://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/

40 / 43

Outline

Introduction

Clustering with R

Text Mining with R

R and Hadoop

Online Resources

41 / 43

Online Resources

I RDataMining websitehttp://www.rdatamining.com

I R Reference Card for Data MiningI R and Data Mining: Examples and Case Studies

I RDataMining Group on LinkedIn (3100+ members)http://group.rdatamining.com

I RDataMining on Twitter (1200+ followers)http://twitter.com/rdatamining

I Free online courseshttp://www.rdatamining.com/resources/courses

I Online documentshttp://www.rdatamining.com/resources/onlinedocs

42 / 43

The End

Thanks!

Email: yanchang(at)rdatamining.com

43 / 43

An Introduction to Data Mining with R

Technology

Transcript of An Introduction to Data Mining with R

Introduction to Mining

STAT2450 - Introduction to Data Mining with R Vector Machineskallada/stat2450/lectures/Lecture10_Slides.pdf · Supervised Data Mining: The Line Drawing Contest To evaluate whether

Deep Sea Mining - KIVI · 10/5/2011 Deep Sea Mining Introduction & (technical) challenges Ir. Rick Lotman Manager R&D Deep Sea Dredging & Mining Contents • Introduction to IHC Merwede

Visualizing Association Rules: Introduction to the R ...arulesViz/vignettes/arulesViz.pdf · Association rule mining is a popular data mining method available in R as the extension

Gastric Cancer Analysis In Data Mining Techniquesijsrcseit.com/paper/CSEIT1722369.pdf · Keywords: Data Mining, Integrative Analysis, Bootstrap Function, R Graphics, I. INTRODUCTION

Data Mining: Introduction

Data mining – introduction

Data Mining Introduction to Data Mining

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining: Introduction Chapitre 1 Introduction au Data Mining.

Introduction to Data Mining - homepages.math.uic.eduhomepages.math.uic.edu/~jyang06/stat486/R/DataMining_JYang2014.pdfFundamentals of Data Mining Typical Data Mining Tasks Data Mining

Process Mining Introduction

A gentle introduction to text mining using R | Eight to Late

Introduction of JOGMEC's latest R&D of mining pollution control …mric.jogmec.go.jp/public/kouenkai/2014-02/briefing... · 2015-07-28 · Introduction of JOGMEC's latest R&D of mining

Data Mining: Introduction. Chapter 1. Introduction zMotivation: Why data mining? zWhat is data mining? zData Mining: On what kind of data? zData mining.

Data Mining: Introduction

An introduction to Web Scraping with R · An introduction to Web Scraping and Text Mining with R SimonMunzert UniversityofKonstanz October2014 Web Scraping with R Simon Munzert

Introduction to Data Mining and Knowledge Discoveryredmond/teach/624/Intro-to-Data-Mining-2n… · Introduction to Data Mining and Knowledge Discovery INTRODUCTION Data mining: What

IE 485 - Introduction to Data Mining - R Tutorial · Chapter 1 Introduction to Data Mining with R This document includes R codes and brief discussions that take place in IE 485. I

Introduction to R for Data Science :: Session 8 [Intro to Text Mining in R, ML Estimation + Binomial Logistic Regression]

Data Mining & Matrices Lecture 2: Introduction to R · DMM, summer 2015 1 Introduction to R Data Mining & Matrices Lecture 2: Saskia Metzler 28 April 2015