1
Statistics 202: Statistical Aspects of Data Mining
Professor Rajan Patel
Lecture 10 = Finish Chapter 5
Agenda:1) Assign Homework #4 (due Aug 13th)2) Lecture over more of Chapter 5
2
Homework Assignment:
Homework #4 is due Monday 8/13 at 4:15 PM
Either email it to [email protected] as a pdf file.
The assignment is posted at
http://sites.google.com/site/stats202/homework
*If emailing, please send a single file (pdf is preferred) and make sure your name is on the first page and in the body of the email. Also, the file name should say “homework4” and your name.
3
Introduction to Data Mining
byTan, Steinbach, Kumar
Chapter 5: Classification: Alternative Techniques
4
Ensemble Methods (Section 5.6, page 276)
Ensemble methods aim at “improving classification accuracy by aggregating the predictions from multiple classifiers” (page 276)
One of the most obvious ways of doing this is simply by averaging classifiers which make errors somewhat independently of each other
5
Ensemble Methods (Section 5.6, page 276)
Ensemble methods include-Bagging (page 283)-Random Forests (page 290)-Boosting (page 285)
Bagging builds different classifiers by training on repeated samples (with replacement) from the data
Random Forests averages many trees which are constructed with some amount of randomness
Boosting combines simple base classifiers by upweighting data points which are classified incorrectly
6
Random Forests (Section 5.6.6, page 290)
One way to create random forests is to grow decision trees top down but at each node consider only a random subset of attributes for splitting instead of all the attributes
Random Forests are a very effective technique
They are based on the paper
L. Breiman. Random forests. Machine Learning, 45:5-32, 2001
They can be fit in R using the function randomForest() in the library randomForest
7
In class exercise #39:Use randomForest() in R to fit the default Random Forest to the last column of the sonar training data at http://sites.google.com/site/stats202/data/sonar_train.csv Compute the misclassification error for the test data at
http://sites.google.com/site/stats202/data/sonar_test.csv
8
In class exercise #39:Use randomForest() in R to fit the default Random Forest to the last column of the sonar training data at http://sites.google.com/site/stats202/data/sonar_train.csv Compute the misclassification error for the test data at
http://sites.google.com/site/stats202/data/sonar_test.csv
Solution:
install.packages("randomForest")library(randomForest)train<-read.csv("sonar_train.csv",header=FALSE)test<-read.csv("sonar_test.csv",header=FALSE)y<-as.factor(train[,61])x<-train[,1:60]y_test<-as.factor(test[,61])x_test<-test[,1:60]fit<-randomForest(x,y)1-sum(y_test==predict(fit,x_test))/length(y_test)
9
Boosting (Section 5.6.5, page 285)
Boosting has been called the “best off-the-shelf classifier in the world”
There are a number of explanations for boosting, but it is not completely understood why it works so well
The most popular algorithm is AdaBoost from
Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data, and then taking a weighted majority vote of the sequence of classifiers thus produced.
10
Boosting (Section 5.6.5, page 285)
Boosting can use any classifier as its weak learner (base classifier) but decision trees are by far the most popular
Boosting usually gives zero training error, but rarely overfits which is very curious
11
Boosting (Section 5.6.5, page 285)
Boosting works by upweighing points at each iteration which are misclassified
On paper, boosting looks like an optimization (similar to maximum likelihood estimation), but in practice it seems to benefit a lot from averaging like Random Forests does
There exist R libraries for boosting, but these are written by statisticians who have their own views of boosting, so I would not encourage you to use them
The best thing to do is to write code yourself since the algorithms are very basic
12
AdaBoost
Here is a version of the AdaBoost algorithm
The algorithm repeats until a chosen stopping time
The final classifier is based on the sign of Fm
13
In class exercise #40:Use R to fit the AdaBoost classifier to the last column of the sonar training data athttp://sites.google.com/site/stats202/data/sonar_train.csvPlot the misclassification error for the training data and the test data athttp://sites.google.com/site/stats202/data/sonar_test.csvas a function of the iterations. Run the algorithm for 500 iterations. Use default rpart() as the base learner.
Solution:
train<-read.csv("sonar_train.csv",header=FALSE)test<-read.csv("sonar_test.csv",header=FALSE)y<-train[,61]x<-train[,1:60]y_test<-test[,61]x_test<-test[,1:60]
14
In class exercise #40:Use R to fit the AdaBoost classifier to the last column of the sonar training data athttp://sites.google.com/site/stats202/data/sonar_train.csvPlot the misclassification error for the training data and the test data athttp://sites.google.com/site/stats202/data/sonar_test.csvas a function of the iterations. Run the algorithm for 500 iterations. Use default rpart() as the base learner.
Solution (continued):train_error<-rep(0,500) # Keep track of errorstest_error<-rep(0,500)f<-rep(0,130) # 130 pts in training dataf_test<-rep(0,78) # 78 pts in test datai<-1library(rpart)
15
In class exercise #40:Use R to fit the AdaBoost classifier to the last column of the sonar training data athttp://sites.google.com/site/stats202/data/sonar_train.csvPlot the misclassification error for the training data and the test data athttp://sites.google.com/site/stats202/data/sonar_test.csvas a function of the iterations. Run the algorithm for 500 iterations. Use default rpart() as the base learner.
Solution (continued):while(i<=500){w<-exp(-y*f) # This is a shortcut to compute ww<-w/sum(w)fit<-rpart(y~.,x,w,method="class")g<--1+2*(predict(fit,x)[,2]>.5) # make -1 or 1g_test<--1+2*(predict(fit,x_test)[,2]>.5)e<-sum(w*(y*g<0))
16
In class exercise #40:Use R to fit the AdaBoost classifier to the last column of the sonar training data at http://sites.google.com/site/stats202/data/sonar_train.csvPlot the misclassification error for the training data and the test data athttp://sites.google.com/site/stats202/data/sonar_test.csvas a function of the iterations. Run the algorithm for 500 iterations. Use default rpart() as the base learner.
Solution (continued):alpha<-.5*log ( (1-e) / e )f<-f+alpha*gf_test<-f_test+alpha*g_testtrain_error[i]<-sum(1*f*y<0)/130test_error[i]<-sum(1*f_test*y_test<0)/78i<-i+1
}
17
In class exercise #40:Use R to fit the AdaBoost classifier to the last column of the sonar training data at http://sites.google.com/site/stats202/data/sonar_train.csvPlot the misclassification error for the training data and the test data athttp://sites.google.com/site/stats202/data/sonar_test.csvas a function of the iterations. Run the algorithm for 500 iterations. Use default rpart() as the base learner.
Solution (continued):plot(seq(1,500),test_error,type="l",
ylim=c(0,.5),ylab="Error Rate",xlab="Iterations",lwd=2)
lines(train_error,lwd=2,col="purple")legend(4,.5,c("Training Error","Test Error"),
col=c("purple","black"),lwd=2)
18
In class exercise #40:Use R to fit the AdaBoost classifier to the last column of the sonar training data athttp://sites.google.com/site/stats202/data/sonar_train.csvPlot the misclassification error for the training data and the test data athttp://sites.google.com/site/stats202/data/sonar_test.csvas a function of the iterations. Run the algorithm for 500 iterations. Use default rpart() as the base learner.
Solution (continued):
0 100 200 300 400 500
0.0
0.1
0.2
0.3
0.4
0.5
Iterations
Erro
r Rat
eTraining ErrorTest Error
19
Naive Bayes Classifier (Section 5.3.3, page 231)
This is a simple method that is sometimes quite effective
To understand this, it is helpful to understand conditional probability, Bayes Theorem and (conditional) independence
These topics are discussed in your book in Section 5.3 and on the following slides
20
Conditional Probability
The conditional probability of event A given event B is written as P(A|B)
This is the probability of event A if/when/given event B happens
Example:
P(make) = 8/20=.4 P(make|close)=3/5=.6
far close totalmake 5 3 8miss 10 2 12total 15 5 20
21
Conditional Probability
You can define the conditional probability P(A|B) as P(A,B) / P(B)
Example: P(make|close) = P(make,close) / P(close) = (3/20) / (5/20) = .15/.25 = .6
Bayes Theorem turns this around to solve for P(B|A) if you have P(A|B)
P(B|A) = P(A,B) / P(A) = P(A|B) P(B) / P(A)
far close totalmake 5 3 8miss 10 2 12total 15 5 20
my friend
22
Example of Bayes Theorem
On Friday my friend said that there was a 50 percent chance he would go snowboarding over the weekend. He said that if he goes snowboarding there is about a 40% chance he will break his leg; whereas, if he does not go snowboarding there is only a 2% chance he will break his leg doing something else. If I see him on Monday and his leg is broken, what would I think is the probability he went snowboarding over the weekend?
23
Example of Bayes Theorem
On Friday my friend said that there was a 50 percent chance he would go snowboarding over the weekend. He said that if he goes snowboarding there is about a 40% chance he will break his leg; whereas, if he does not go snowboarding there is only a 2% chance he will break his leg doing something else. If I see him on Monday and his leg is broken, what would I think is the probability he went snowboarding over the weekend?
P(S) = 0.5P(B|S) = 0.4P(B|not S) = 0.02
P(S|B) = P(B|S)*P(S) / P(B)= 0.4*0.5 / (0.4*0.5 + 0.02*0.5)= 0.2 / 0.21 = 0.95
24
Independence
A and B are independent iff P(A,B) = P(A)*P(B)
Here the events are not independent:P(make,far) = 5/20=.25 but P(make)*P(far) = 8/20*15/20= .30
Here the events are independent:P(make,far) = 9/20=.45which equals P(make)*P(far) = 12/20*15/20=.45
far close totalmake 5 3 8miss 10 2 12total 15 5 20
far close totalmake 9 3 12miss 6 2 8total 15 5 20
25
Conditional Independence
A and B are conditionally independent given C iff P(A,B|C) = P(A|C)*P(B|C)
Example: Height and reading ability are not independent but they are conditionally independent given the age level
short tall totalreads poorly 90 9 99
reads well 10 1 11total 100 10 110
youngshort tall total
reads poorly 2 20 22reads well 8 80 88
total 10 100 110
old
short tall totalreads poorly 92 29 121
reads well 18 81 99total 110 110 220
all
26
Naive Bayes Classifier (Section 5.3.3, page 231)
The naive Bayes classifier assumes that the x attributes are conditionally independent given the class attribute y
Thus, P(Y|X) = P(Y) * P(X|Y) / P(X) =P(Y) * P(X1|Y)* …. * P(Xd|Y) / P(X)
Then for any x you choose the class y that gives you the largest numerator
You estimate the P(Xi|Y) values based on the data (see next slide)
27
How to Estimate the P(Xi|Y)
For categorical x’s, just use counts (although some people modify this to fix problems with zero or small counts, see page 236)
For continuous x’s, fit some distribution function. The normal distribution using the observed sample mean and observed sample standard deviation is popular
The normal probability density function is given by
where μ is the mean and σ is the standard deviation
2μ)/σ](1/2)[(Xe2π1p(X) −−=σ
28
Example of the Naive Bayes Classifier
For this data, use naive Bayes to classify an observation with
Tid Refund Marital Status
Taxable Income Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
120K)IncomeMarried,No,Refund( ===X
29
Example of the Naive Bayes Classifier
For this data, use naive Bayes to classify an observation with
P(yes|X) = 3/10*1/P(X)*3/3*0/3*1.2*10-9 < P(no|X) = 7/10*1/P(X) *4/7*4/7*.0072So we classify this X as “NO”
Tid Refund Marital Status
Taxable Income Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
120K)IncomeMarried,No,Refund( ===XP(refund=yes|yes)=0/3 P(refund=yes|no)=3/7P(refund=no|yes)=3/3 P(refund=no|no)=4/7
P(ms=single|yes)=2/3 P(ms=single|no)=2/7P(ms=divorced|yes)=1/3 P(ms=divorced|no)=1/7P(ms=married|yes)=0/3 P(ms=married|no)=4/7
Given yes, Given no,income has mean=90 income has mean=110and sd=5 and sd=54.4P(120|yes)= P(120|no)=dnorm(120,90,5)= dnorm(120,110,54.4)=1.2*10-9 0.0072
Top Related