Automated Classification of Short Text Messaging Services (SMS) messages for Optimized Handling

download Automated Classification of Short Text Messaging Services (SMS) messages for Optimized Handling

of 22

description

Makerere University Kampala, Artificial Intelligence Group Seminar. Presentation by Aloysius Ochola, May 30, 2013.

Transcript of Automated Classification of Short Text Messaging Services (SMS) messages for Optimized Handling

  • Automated Classification of Short Message Service (SMS)

    ALOYSIUS OCHOLA

    [email protected]

    MAKERERE UNIVERSITY

    ARTIFICIAL INTELLIGENCE GROUP

    USING NAVE BAYES ALGORITHM

    Artificial Intelligence Seminar . May 30 . 2013

  • 2Automated Classification of SMS using Nave Bayes Algorithm

    Classification

    A supervised learning technique that involves assigning a label to a set of

    unlabeled input objects.

    Based on the number of classes present, there are two types of

    classification:

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    Binary classification; classifies the members of a given set of objects into one of

    the two classes

    Multi-class classification; classifying instances into more than two classes.

    Unlike a better understood binary classification, the multiclass

    classification is more complex and less researched.

  • 3Automated Classification of SMS using Nave Bayes Algorithm

    Text Classification/Categorization

    Text documents is one of the several areas where classification can

    be applied.

    TC (text classification/categorization) is the application of

    classification algorithms on documents of text in order to

    automatically group them to predefined categories.

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    automatically group them to predefined categories.

    How to represent text documents

    Preprocessing and feature selection

    How to build the classifier; compute a classification function.

    Training classifier and classifying

  • 4Automated Classification of SMS using Nave Bayes Algorithm

    Short Text Documents

    Normal documents like email, journals, etc are typically

    large and are rich with content (natural languages).

    Easy to apply traditional classification approaches which rely on

    word frequencies.

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    Unlike short text documents like SMS & Twitter messages,

    Forum posts , etc where word occurrence is too small.

    Dealing with short text therefore shall require just a little more

    than traditional techniques.

    Especially during preprocessing and feature selection

  • 5Automated Classification of SMS using Nave Bayes Algorithm

    Applications of TC

    Spam filtering, a process which tries to discern E-mail spam messages from

    legitimate emails

    Email routing, sending an email sent to a general address to a specific address or

    mailbox depending on topic.

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    Language identification, automatically determining the language of a text

    Genre classification, automatically determining the genre of a text.

    Movie reviewing, automatically classify them as good, bad and neutral.

    Etc . . .

  • 6Automated Classification of SMS using Nave Bayes Algorithm

    Data Preprocessing

    The data captured in real world is so noisy, inconsistent and has no quality.

    Some cleaning and transformation required.

    Quality results from short text will see most of the major steps of text

    preprocessing skipped and some selected ones modified.

    Tokenization and lowercasing: splitting text streams to tokens and forced

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    Tokenization and lowercasing: splitting text streams to tokens and forced

    lowercasing.

    Word boundary detection, using whitespace and punctuation

    Note: Prepared corpus was lowercased.

    Minor spell-correction: although theres a growing culture of using short-

    hands (not formal) in SMS texts, some spell corrections can still done.

  • 7Automated Classification of SMS using Nave Bayes Algorithm

    Data Preprocessing (cont)

    Regular expression replacer: replacing words used with apostrophes with their

    matching regular expressions.

    list pairs of RE apostrophes-word and correction Ex.Willnt : will not, didnt : did not, . . .

    Repeat replacer: people are not often strictly grammatical. May write "I

    looooove it" to emphasize the word "love.

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    looooove it" to emphasize the word "love.

    Before replacing any characters from the supplied word

    Module replaces any word with more than two repeating characters to just two as no such words can

    exist in the English vocabulary, for example goooooooose to goose.

    RE: [(\w*)(\w)\*(\w*)]

    And then look-up if WordNet (a lexical database for English natural language) recognizes the supplied

    word.

  • 8Automated Classification of SMS using Nave Bayes Algorithm

    Data Preprocessing (cont)

    Then, if otherwise use regular expression (RE) [(\w*)(\w)\2(\w*)]

    to remove extra repeated characters from the word.

    Matches 0 or more starting characters (\w*)

    A single character (\w), followed by another instance of that character \2

    Then 0 or more ending characters (\w*)

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    Then 0 or more ending characters (\w*)

    Stop-words filtering: process of removing most

    frequent words that exist in a document.

    Looking-up into a file containing stop words and return

    only words not in the file/dictionary.

  • 9Automated Classification of SMS using Nave Bayes Algorithm

    A Classifier

    A classifier is built on a function f which will determine a category of an

    input feature vector x, given a fixed set of classes C={c1, c2,,cn} and a

    description of features xX

    where X is the feature space to the output class labels.

    In simple terms; f(x) C.

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    In simple terms; f(x) C.

    where f(x) is the classification function whose domain is X and whose range is

    C. The class labels C can be ordered or unordered (categorical)

    A classifier is expected to learn from learn from a set of N input-output

    pairs or simply training data set and predict a class of unseen input. That is

    to say, mapping X to C .CXf :

  • 10Automated Classification of SMS using Nave Bayes Algorithm

    Building the Text Classifier

    For the particular case, we will deal with a probabilistic text classifier ft based on

    Nave Bayes classification (NBC) Theorem.

    Building the classifier will therefore involve a recursive processes of creating a

    functional classifier by training it with example data set (NB learning) and running

    the trained classifier on unknown content to determine class membership for the

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    the trained classifier on unknown content to determine class membership for the

    unknown content classification (Bayesian Classification).

    Probabilistic classifier, to predict the class membership of a certain new document

    X, calculates the probability of a class C given that document, that is:

    -> XCP |

  • 11Automated Classification of SMS using Nave Bayes Algorithm

    Nave Bayes Algorithm

    It is a simple probabilistic learning and classification methods built upon

    the Bayes probabilistic theory.

    It assumes that the presence (or absence) of a particular feature of a class

    is not related to the presence (or absence) of any other feature (nave

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    assumption).

    Uses prior probability of each category given no information about

    an item.

    Categorization produces a posterior probability distribution over

    the possible categories given a description of an item.

    CP

    XCP |

  • 12Automated Classification of SMS using Nave Bayes Algorithm

    Nave Bayes (NB) Probability Theorem

    Derived from the definition of conditional probability

    probability that an event will occur, when another event is known to occur or to have occurred.

    From the product rule, given events C and X.

    0)(,)|( )()( XPXCP XP

    XCP

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    given as:

    Bayes Rule:

    ->

    0)(,)|( )( XPXCP XP

    )().|()().|()( CPCXPXPXCPXCP

    0)(,)|( )()().|( XPXCP XP

    CPCXP

    )( CXPXCP

    P(C): Prior probability, the initial probability that C holds before seeing any evidenceP(X): Probability that X is observedP(X|C): Likelihood, probability of observing X given that C holdsP(C|X): Posterior probability, the probability that C holds given X is observed

    Equation (1)

  • 13Automated Classification of SMS using Nave Bayes Algorithm

    Deriving NB Classification Algorithm

    Given a set of feature vectors for each possible class C, the task of the

    NBC (NB classification) algorithm is to approximate the probability of new

    input features X to be present in C , that is, the class posterior or simply

    the greatest .)|( XCcP

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    Assume C boolean random variables and a vector space X containing n

    boolean attributes:

    If ci is the ith possible value of C and xk denotes the k

    th attribute of X

    Applying NB probability theorem (Equation (1)):

    j

    iikki

    cjCPcjCxkXP

    cCPcCxXPxXcCP

    )().|(

    )().|()|(

    Equation (2)

  • 14Automated Classification of SMS using Nave Bayes Algorithm

    NB conditional Independence Assumption: Features (term presence) are

    independent of each other given the class. A new document of n features

    can therefore be classified into one of C classes using equation (2) as:

    The aim of the classifier is to return the maximum posterior probability of

    Deriving NBC Algorithm

    n

    kk CxPXCP

    1

    )|()|(

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    The aim of the classifier is to return the maximum posterior probability of

    c, thus:

    Further, because the sample space (denominator) is always constant for all

    the classes and does not depend on any class ci of C, the NBC theorem is

    given as:

    j k jkj

    k ii

    ci

    cCxPcCP

    cCPcCPcC

    i )|()(

    )()(maxarg

    k iic

    i cCPcCPcCi

    )()(maxarg Equation (3)

  • 15Automated Classification of SMS using Nave Bayes Algorithm

    Training Nave Bayes Text Classifier

    During the training process, the classification

    function ft, extracts, selects the most useful

    features from the example corpus and labels

    them with their appropriate class.

    Construct and store a mapping of feature-set:label

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    Construct and store a mapping of feature-set:label

    pair sets (training dataset); which ft will learn from.

    feature-set is a list of preprocessed and unique term

    occurrences from the document samples

    label is the known class of that feature-set.

  • 16Automated Classification of SMS using Nave Bayes Algorithm

    Feature Representation

    Features describes and represents texts in format suitable for further machine

    processing.

    Final performance depends on how descriptive features are used for text

    description.

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    Supervised learning classifiers can use any sort of feature

    URL, email address, punctuation, capitalization, dictionaries, network features

    Word based feature (Bag of Words): feature extraction process to transform the

    plain documents, which are merely strings of text, into a feature set containing

    the (frequency of) occurrence of each word that is usable by a classifier.

  • 17Automated Classification of SMS using Nave Bayes Algorithm

    Feature Selection

    Text collections have a large number of features yet some classifiers cant deal with

    a very larger number of features. Therefore performing feature Selection would

    ensure reduced training time and improve performance as it eliminates noise from

    features and avoids over fitting.

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    Term Weighting: Each term in a document vector must be associated with a value

    (weight) which measures the importance of this term and denotes how much this

    term contributes to the categorization task of the document.

    Depend on information theory; frequency count of every word

    chi-squared statistical distribution; score measure of bigram of each word per-label

  • 18Automated Classification of SMS using Nave Bayes Algorithm

    Text Classification

    One step classifier testing process of taking the built text classifier ft and running it

    on unknown content to determine class membership for that content.

    New input (test) SMS stream is passed to the classifier.

    Preprocesses the stream and compares it with the set of pre-classified examples (training set).

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    Numerical underflow

    In equation (3), many conditional probabilities are multiplied one for each position of X

    Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-

    point underflow.

    since log(xc)=log (x)+log (c), it is better to perform all computations by summing natural logs

    of the probabilities rather than multiplying them. Therefore, during text classification, a

    normalized NBC equation (given bellow) is used.

    nkik

    c

    cCxPciCPCi 1

    )|(log)(logmaxarg

  • 19Automated Classification of SMS using Nave Bayes Algorithm

    Implementation Pseudo Algorithm

    for a given unknown input document:

    break the input stream into word tokens

    preprocess the tokens

    for a given training set:

    count the number of documents in each class

    for every training document: for each class:

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    for each class:

    if a preprocessed token appears in the document:

    increment the count for tokens

    for each class:

    for each preprocessed token

    divide the token count by the total token count to get conditional probabilities

    return log conditional probabilities for each class

    for all the individual class log conditional probabilities:

    compute a comparison of the probability values

    return the class with the greatest probability (maximum likelihood hypothesis).

  • 20Automated Classification of SMS using Nave Bayes Algorithm

    Evaluation and Implementation Approach

    Evaluation: test SMS text documents to assess classifier

    success on the prediction of the class .

    Implementation: complete text classification application

    with user interactive interface.

    testsofnumberTotal

    edictionsCorrect

    ___

    Pr_

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    Natural Language Processing approach

    Natural Language ToolKit (NLTK) used with Python programming

    language.

    NLTK is entirely self-contained and provides convenient functions and

    wrappers that can be used as building blocks for common NLP tasks.

  • 21Automated Classification of SMS using Nave Bayes Algorithm

    BIBLIOGRAPHY

    Automated Classification of Short Messaging Services (SMS)

    Messages for Optimized Handling

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    [email protected]

    Aloysius OcholaMsc. Computer Science Project

    Makerere University Kampala (2013)

  • 22Automated Classification of SMS using Nave Bayes Algorithm

    DEMO . . .

    Training samples collected from manually categorized SMS

    message compiled by Ureport, an SMS based opinion forum

    Problem:They receives up-to 10,000 SMS messages in a day

    and are supposed to reply to all the messages, if it is relevant

    AI Seminar (MUK) . May 30, 2013 Aloysius Ochola

    and worthy.

    smsTextClassificationApplication