Classification with Naive Bayes
-
Upload
josh-patterson -
Category
Documents
-
view
28.739 -
download
0
Transcript of Classification with Naive Bayes
![Page 1: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/1.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Classification with Naïve BayesA Deep Dive into Apache Mahout
![Page 2: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/2.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Today’s speaker – Josh Patterson
• [email protected] / twitter: @jpatanooga
• Master’s Thesis: self-organizing mesh networks– Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
• Conceived, built, and led Hadoop integration for the openPDC project at TVA (Smartgrid stuff)
– Led small team which designed classification techniques for time series and Map Reduce
– Open source work at http://openpdc.codeplex.com
• Now: Solutions Architect at Cloudera
2
![Page 3: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/3.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
What is Classification?
• Supervised Learning
• We give the system a set of instances to learn from
• System builds knowledge of some structure
– Learns “concepts”
• System can then classify new instances
![Page 4: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/4.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Supervised vs Unsupervised Learning
• Supervised
– Give system examples/instances of multiple concepts
– System learns “concepts”
– More “hands on”
– Example: Naïve Bayes, Neural Nets
• Unsupervised
– Uses unlabled data
– Builds joint density model
– Example: k-means clustering
![Page 5: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/5.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Naïve Bayes
• Called Naïve Bayes because its based on “Baye’s Rule” and “naively” assumes independence given the label
– It is only valid to multiply probabilities when the events are independent
– Simplistic assumption in real life
– Despite the name, Naïve works well on actual datasets
![Page 6: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/6.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Naïve Bayes Classifier
• Simple probabilistic classifier based on
– applying Baye’s theorem (from Bayesian statistics)
– strong (naive) independence assumptions.
– A more descriptive term for the underlying probability model would be “independent feature model".
![Page 7: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/7.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Naïve Bayes Classifier (2)
• Assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. – Example:
• a fruit may be considered to be an apple if it is red, round, and about 4" in diameter.
• Even if these features depend on each other or upon the existence of the other features, a naive Bayesclassifier considers all of these properties to independently contribute to the probability that this fruit is an apple.
![Page 8: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/8.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
A Little Bit o’ Theory
![Page 9: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/9.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Condensing Meaning
• To train our system we need
– Total number input training instances (count)
– Counts tuples:
• {attributen,outcomeo,valuem}
– Total counts of each outcomeo
• {outcome-count}
• To Calculate each Pr[En|H]– ({attributen,outcomeo,valuem} / {outcome-count} )
…From the Vapor of That Last Big Equation
![Page 10: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/10.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
A Real Example From Witten, et al
![Page 11: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/11.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Enter Apache Mahout
• What is it?
– Apache Mahout is a scalable machine learning library that supports large data sets
• What Are the Major Algorithm Type?
– Classification
– Recommendation
– Clustering
• http://mahout.apache.org/
![Page 12: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/12.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Mahout Algorithms
Size of Dataset Mahout Algo Execution Model Characteristics
Small SGD Sequential Uses all types of predictor vars
Medium Naïve Bayes / Complementary Naïve Bayes
Parallel Prefers text, high training cost
Large Random Forrest Parallel Uses all type of predictor vars, high training cost
![Page 13: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/13.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Naïve Bayes and Text
• Naive Bayes does not model text well.
– “Tackling the Poor Assumptions of Naive Bayes Text Classifiers”
• http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
– Mahout does some modifications based around TF-IDF scoring (Next Slide)
• Includes two other pre-processing steps, common for information retrieval but not for Naive Bayes classification
![Page 14: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/14.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
High Level Algorithm
• For Each Feature(word) in each Doc:– Calc: “Weight Normalized Tf-Idf”
• for a given feature in a label is the Tf-idf calculated using standard idf multiplied by the Weight Normalized Tf
– We calculate the sum of W-N-Tf-idf for all the features in a label called Sigma_k, and alpha_i == 1.0
Weight = Log [ ( W-N-Tf-Idf + alpha_i ) / ( Sigma_k + N ) ]
![Page 15: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/15.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
BayesDriver Training Workflow
1
• BayesFeatureDriver• Compute parts of TF-IDF via Term-Doc-Count, WordFreq, and
FeatureCount
2• BayesTfIdfDriver
• Calc the TF-IDF of each feature/word in each label
3• BayesWeightSummerDriver
• Take TF-IDF and Calc Trainer Weights
4• BayesThetaNormalizerDriver
• Calcs norm factor SigmaWij for each complement class
Naïve Bayes Training MapReduce Workflow in Mahout
![Page 16: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/16.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Logical Classification Process
1. Gather, Clean, and Examine the Training Data
– Really get to know your data!
2. Train the Classifier, allowing the system to “Learn” the “Concepts”
– But not “overfit” to this specific training data set
3. Classify New Unseen Instances
– With Naïve Bayes we’ll calculate the probabilities of each class wrt this instance
![Page 17: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/17.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
How Is Classification Done?
• Sequentially or via Map Reduce
• TestClassifier.java
– Creates ClassifierContext
– For Each File in Dir
• For Each Line– Break line into map of tokens
– Feed array of words to Classifier engine for new classification/label
– Collect classifications as output
![Page 18: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/18.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
A Quick Note About Training Data…
• Your classifier can only be as good as the training data lets it be…
– If you don’t do good data prep, everything will perform poorly
– Data collection and pre-processing takes the bulk of the time
![Page 19: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/19.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Enough Math, Run the Code
• Download and install Mahout
– http://www.apache.org
• Run 20Newsgroups Example
– https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups
– Uses Naïve Bayes Classification
– Download and extract 20news-bydate.tar.gz from the 20newsgroups dataset
![Page 20: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/20.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Generate Test and Train Dataset
Training Dataset:
mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \-p examples/bin/work/20news-bydate/20news-bydate-train \-o examples/bin/work/20news-bydate/bayes-train-input \-a org.apache.mahout.vectorizer.DefaultAnalyzer \-c UTF-8
Test Dataset:
mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups \-p examples/bin/work/20news-bydate/20news-bydate-test \-o examples/bin/work/20news-bydate/bayes-test-input \-a org.apache.mahout.vectorizer.DefaultAnalyzer \-c UTF-8
![Page 21: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/21.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Train and Test Classifier
Train:$MAHOUT_HOME/bin/mahout trainclassifier \-i 20news-input/bayes-train-input \-o newsmodel \-type bayes \-ng 3 \-source hdfs
Test:$MAHOUT_HOME/bin/mahout testclassifier \-m newsmodel \-d 20news-input \-type bayes \-ng 3 \-source hdfs \-method mapreduce
![Page 22: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/22.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Other Use Cases
• Predictive Analytics
– You’ll hear this term a lot in the field, especially in the context of SAS
• General Supervised Learning Classification
– We can recognize a lot of things with practice
• And lots of tuning!
• Document Classification
• Sentiment Analysis
![Page 23: Classification with Naive Bayes](https://reader034.fdocuments.in/reader034/viewer/2022050613/587ba2771a28ab81758b5333/html5/thumbnails/23.jpg)
Copyright 2011 Cloudera Inc. All rights reserved
Questions?
• We’re Hiring!
• Cloudera’s Distro of Apache Hadoop:
– http://www.cloudera.com
• Resources
– “Tackling the Poor Assumptions of Naive Bayes Text Classifiers”
• http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf