Ai group-seminar-2013 nbc
-
Upload
gen-aloys-ochola-badde -
Category
Education
-
view
268 -
download
0
Transcript of Ai group-seminar-2013 nbc
![Page 1: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/1.jpg)
Automated Classification of Short Message Service (SMS)
ALOYSIUS OCHOLA
MAKERERE UNIVERSITY
ARTIFICIAL INTELLIGENCE GROUP
USING NAÏVE BAYES ALGORITHM
Artificial Intelligence Seminar . May 30 . 2013
![Page 2: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/2.jpg)
2Automated Classification of SMS using Naïve Bayes Algorithm
Classification
• A supervised learning technique that involves assigning a label to a set of
unlabeled input objects.
• Based on the number of classes present, there are two types of
classification:
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
– Binary classification; classifies the members of a given set of objects into one of
the two classes
– Multi-class classification; classifying instances into more than two classes.
• Unlike a better understood binary classification, the multiclass
classification is more complex and less researched.
![Page 3: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/3.jpg)
3Automated Classification of SMS using Naïve Bayes Algorithm
Text Classification/Categorization
• Text documents is one of the several areas where classification can
be applied.
• TC (text classification/categorization) is the application of
classification algorithms on documents of text in order to
automatically group them to predefined categories.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
automatically group them to predefined categories.
• How to represent text documents
– Preprocessing and feature selection
• How to build the classifier; compute a classification function.
– Training classifier and classifying
![Page 4: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/4.jpg)
4Automated Classification of SMS using Naïve Bayes Algorithm
Short Text Documents
• Normal documents like email, journals, etc are typically
large and are rich with content (natural languages).
– Easy to apply traditional classification approaches which rely on
word frequencies.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
• Unlike short text documents like SMS & Twitter messages,
Forum posts , etc where word occurrence is too small.
– Dealing with short text therefore shall require just a little more
than traditional techniques.
• Especially during preprocessing and feature selection
![Page 5: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/5.jpg)
5Automated Classification of SMS using Naïve Bayes Algorithm
Applications of TC
• Spam filtering, a process which tries to discern E-mail spam messages from
legitimate emails
• Email routing, sending an email sent to a general address to a specific address or
mailbox depending on topic.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
• Language identification, automatically determining the language of a text
• Genre classification, automatically determining the genre of a text.
• Movie reviewing, automatically classify them as good, bad and neutral.
• Etc . . .
![Page 6: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/6.jpg)
6Automated Classification of SMS using Naïve Bayes Algorithm
Data Preprocessing
• The data captured in real world is so noisy, inconsistent and has no quality.
Some cleaning and transformation required.
• Quality results from short text will see most of the major steps of text
preprocessing skipped and some selected ones modified.
• Tokenization and lowercasing: splitting text streams to tokens and forced
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
• Tokenization and lowercasing: splitting text streams to tokens and forced
lowercasing.
– Word boundary detection, using whitespace and punctuation
– Note: Prepared corpus was lowercased.
• Minor spell-correction: although there’s a growing culture of using short-
hands (not formal) in SMS texts, some spell corrections can still done.
![Page 7: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/7.jpg)
7Automated Classification of SMS using Naïve Bayes Algorithm
Data Preprocessing (cont)
– Regular expression replacer: replacing words used with apostrophes with their
matching regular expressions.
• list pairs of RE apostrophes-word and correction Ex.Willn’t : will not, didn’t : did not, . . .
– Repeat replacer: people are not often strictly grammatical. May write "I
looooove it" to emphasize the word "love“.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
looooove it" to emphasize the word "love“.
• Before replacing any characters from the supplied word
– Module replaces any word with more than two repeating characters to just two as no such words can
exist in the English vocabulary, for example “goooooooose” to “goose”.
» RE: [(\w*)(\w)\*(\w*)]
– And then look-up if WordNet (a lexical database for English natural language) recognizes the supplied
word.
![Page 8: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/8.jpg)
8Automated Classification of SMS using Naïve Bayes Algorithm
Data Preprocessing (cont)
• Then, if otherwise use regular expression (RE) [(\w*)(\w)\2(\w*)]
to remove extra repeated characters from the word.
– Matches 0 or more starting characters (\w*)
– A single character (\w), followed by another instance of that character \2
– Then 0 or more ending characters (\w*)
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
– Then 0 or more ending characters (\w*)
• Stop-words filtering: process of removing most
frequent words that exist in a document.
– Looking-up into a file containing stop words and return
only words not in the file/dictionary.
![Page 9: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/9.jpg)
9Automated Classification of SMS using Naïve Bayes Algorithm
A Classifier
• A classifier is built on a function f which will determine a category of an
input feature vector x, given a fixed set of classes C={c1, c2,…,cn} and a
description of features xX
– where X is the feature space to the output class labels.
• In simple terms; f(x) C.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
• In simple terms; f(x) C.
– where f(x) is the classification function whose domain is X and whose range is
C. The class labels C can be ordered or unordered (categorical)
• A classifier is expected to learn from learn from a set of N input-output
pairs or simply training data set and predict a class of unseen input. That is
to say, mapping X to C .CXf :
![Page 10: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/10.jpg)
10Automated Classification of SMS using Naïve Bayes Algorithm
Building the Text Classifier
• For the particular case, we will deal with a probabilistic text classifier ft based on
Naïve Bayes classification (NBC) Theorem.
• Building the classifier will therefore involve a recursive processes of creating a
functional classifier by training it with example data set (NB learning) and running
the trained classifier on unknown content to determine class membership for the
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
the trained classifier on unknown content to determine class membership for the
unknown content classification (Bayesian Classification).
• Probabilistic classifier, to predict the class membership of a certain new document
X, calculates the probability of a class C given that document, that is:
• -> XCP |
![Page 11: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/11.jpg)
11Automated Classification of SMS using Naïve Bayes Algorithm
Naïve Bayes Algorithm
• It is a simple probabilistic learning and classification methods built upon
the Bayes’ probabilistic theory.
• It assumes that the presence (or absence) of a particular feature of a class
is not related to the presence (or absence) of any other feature (naïve
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
assumption).
• Uses prior probability of each category given no information about
an item.
• Categorization produces a posterior probability distribution over
the possible categories given a description of an item.
CP
XCP |
![Page 12: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/12.jpg)
12Automated Classification of SMS using Naïve Bayes Algorithm
Naïve Bayes (NB) Probability Theorem
• Derived from the definition of conditional probability
– probability that an event will occur, when another event is known to occur or to have occurred.
• From the product rule, given events C and X.
0)(,)|( )(
)( XPXCP XP
XCP
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
• given as:
• Bayes Rule:
• ->
0)(,)|( )( XPXCP XP
)().|()().|()( CPCXPXPXCPXCP
0)(,)|( )(
)().|( XPXCP XP
CPCXP
)( CXPXCP
P(C): Prior probability, the initial probability that C holds before seeing any evidenceP(X): Probability that X is observedP(X|C): Likelihood, probability of observing X given that C holdsP(C|X): Posterior probability, the probability that C holds given X is observed
Equation (1)
![Page 13: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/13.jpg)
13Automated Classification of SMS using Naïve Bayes Algorithm
Deriving NB Classification Algorithm
• Given a set of feature vectors for each possible class C, the task of the
NBC (NB classification) algorithm is to approximate the probability of new
input features X to be present in C , that is, the class posterior or simply
the greatest .)|( XCcP
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
• Assume C boolean random variables and a vector space X containing n
boolean attributes:
– If ci is the ith possible value of C and xk denotes the kth attribute of X
– Applying NB probability theorem (Equation (1)):
j
iikki
cjCPcjCxkXP
cCPcCxXPxXcCP
)().|(
)().|()|(
Equation (2)
![Page 14: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/14.jpg)
14Automated Classification of SMS using Naïve Bayes Algorithm
• NB conditional Independence Assumption: Features (term presence) are
independent of each other given the class. A new document of n features
can therefore be classified into one of C classes using equation (2) as:
• The aim of the classifier is to return the maximum posterior probability of
Deriving NBC Algorithm
n
kk CxPXCP
1
)|()|(
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
• The aim of the classifier is to return the maximum posterior probability of
c, thus:
• Further, because the sample space (denominator) is always constant for all
the classes and does not depend on any class ci of C, the NBC theorem is
given as:
j k jkj
k ii
ci
cCxPcCP
cCPcCPcC
i )|()(
)()(maxarg
k ii
ci cCPcCPcC
i
)()(maxarg Equation (3)
![Page 15: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/15.jpg)
15Automated Classification of SMS using Naïve Bayes Algorithm
Training Naïve Bayes Text Classifier
• During the training process, the classification
function ft, extracts, selects the most useful
features from the example corpus and labels
them with their appropriate class.
– Construct and store a mapping of feature-set:label
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
– Construct and store a mapping of feature-set:label
pair sets (training dataset); which ft will learn from.
• feature-set is a list of preprocessed and unique term
occurrences from the document samples
• label is the known class of that feature-set.
![Page 16: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/16.jpg)
16Automated Classification of SMS using Naïve Bayes Algorithm
Feature Representation
• Features describes and represents texts in format suitable for further machine
processing.
• Final performance depends on how descriptive features are used for text
description.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Supervised learning classifiers can use any sort of feature
URL, email address, punctuation, capitalization, dictionaries, network features
• Word based feature (Bag of Words): feature extraction process to transform the
plain documents, which are merely strings of text, into a feature set containing
the (frequency of) occurrence of each word that is usable by a classifier.
![Page 17: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/17.jpg)
17Automated Classification of SMS using Naïve Bayes Algorithm
Feature Selection
• Text collections have a large number of features yet some classifiers can’t deal with
a very larger number of features. Therefore performing feature Selection would
ensure reduced training time and improve performance as it eliminates noise from
features and avoids over fitting.
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
• Term Weighting: Each term in a document vector must be associated with a value
(weight) which measures the importance of this term and denotes how much this
term contributes to the categorization task of the document.
– Depend on information theory; frequency count of every word
– chi-squared statistical distribution; score measure of bigram of each word per-label
![Page 18: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/18.jpg)
18Automated Classification of SMS using Naïve Bayes Algorithm
Text Classification
• One step classifier testing process of taking the built text classifier ft and running it
on unknown content to determine class membership for that content.
• New input (test) SMS stream is passed to the classifier.
• Preprocesses the stream and compares it with the set of pre-classified examples (training set).
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Numerical underflow
• In equation (3), many conditional probabilities are multiplied one for each position of X
• Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-
point underflow.
• since log(xc)=log (x)+log (c), it is better to perform all computations by summing natural logs
of the probabilities rather than multiplying them. Therefore, during text classification, a
normalized NBC equation (given bellow) is used.
nkik
c
cCxPciCPCi 1
)|(log)(logmaxarg
![Page 19: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/19.jpg)
19Automated Classification of SMS using Naïve Bayes Algorithm
Implementation Pseudo Algorithm
for a given unknown input document:
• break the input stream into word tokens
• preprocess the tokens
• for a given training set:
– count the number of documents in each class
– for every training document:• for each class:
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
• for each class:
– if a preprocessed token appears in the document:
increment the count for tokens
• for each class:
– for each preprocessed token
divide the token count by the total token count to get conditional probabilities
• return log conditional probabilities for each class
for all the individual class log conditional probabilities:
• compute a comparison of the probability values
return the class with the greatest probability (maximum likelihood hypothesis).
![Page 20: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/20.jpg)
20Automated Classification of SMS using Naïve Bayes Algorithm
Evaluation and Implementation Approach
• Evaluation: test SMS text documents to assess classifier
success on the prediction of the class .
• Implementation: complete text classification application
with user interactive interface.
testsofnumberTotal
edictionsCorrect
___
Pr_
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
– Natural Language Processing approach
• Natural Language ToolKit (NLTK) used with Python programming
language.
– NLTK is entirely self-contained and provides convenient functions and
wrappers that can be used as building blocks for common NLP tasks.
![Page 21: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/21.jpg)
21Automated Classification of SMS using Naïve Bayes Algorithm
BIBLIOGRAPHY
Automated Classification of Short Messaging Services (SMS)
Messages for Optimized Handling
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
Aloysius OcholaMsc. Computer Science Project
Makerere University Kampala (2013)
![Page 22: Ai group-seminar-2013 nbc](https://reader033.fdocuments.in/reader033/viewer/2022060118/558c7597d8b42a97468b46ad/html5/thumbnails/22.jpg)
22Automated Classification of SMS using Naïve Bayes Algorithm
DEMO . . .
• Training samples collected from manually categorized SMS
message compiled by Ureport, an SMS based opinion forum
• Problem:They receives up-to 10,000 SMS messages in a day
and are supposed to reply to all the messages, if it is relevant
AI Seminar (MUK) . May 30, 2013 Aloysius Ochola
and worthy.
smsTextClassificationApplication