CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text...

59
CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text...

Page 1: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

CES 514 Lec 11 April 28,2010

Neural Network, case study of naïve Bayes and decision tree, text

classification

Page 2: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Artificial Neural Networks (ANN)

X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0

X1

X2

X3

Y

Black box

Output

Input

Output Y is 1 if at least two of the three inputs are equal to 1.

Page 3: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Neural Network with one neuron

X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0

X1

X2

X3

Y

Black box

0.3

0.3

0.3 t=0.4

Outputnode

Inputnodes

otherwise0

trueis if1)( where

)04.03.03.03.0( 321

zzI

XXXIY

Rosenblatt 1958(perceptron) (also known as threshold logic unit)

Page 4: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Artificial Neural Networks (ANN)

Model is an assembly of inter-connected nodes and weighted links

Output node sums up each of its input value according to the weights of its links

Compare output node against some threshold t

X1

X2

X3

Y

Black box

w1

t

Outputnode

Inputnodes

w2

w3

)( tXwIYi

ii Perceptron Model

)( tXwsignYi

ii

or

Page 5: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Training a single neuron

Rosenblatt’s algorithm:

Page 6: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Linearly separable instances

Rosenblatt’s algorithm converges and finds a separating plane when the data set is linearly separable.

Simplest example of instance that is not linearly separable:

exclusive-OR (parity function)

Page 7: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Classifying parity with more neurons

A neural network with sufficient number of neurons can classify any data set correctly.

Page 8: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

General Structure of ANN

Activationfunction

g(Si )Si Oi

I1

I2

I3

wi1

wi2

wi3

Oi

Neuron iInput Output

threshold, t

InputLayer

HiddenLayer

OutputLayer

x1 x2 x3 x4 x5

y

Training ANN means learning the weights of the neurons

Page 9: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Algorithm for learning ANN

Initialize the weights (w0, w1, …, wk)

Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples– Objective function:

– Find the weights wi’s that minimize the above objective function e.g., backpropagation algorithm

details: Nillson’s ML (Chapter 4) PDF

2),( i

iii XwfYE

Page 10: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

WEKA

Page 11: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

WEKA implementation

WEKA has implementation of all the major data mining algorithms including:

• decision trees (CART, C4.5 etc.)• naïve Bayes algorithm and all variants• nearest neighbor classifier• linear classifier• Support Vector Machine • clustering algorithms• boosting algorithms etc.

Page 12: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Weka tutorials

http://sentimentmining.net/weka/

Contains videos showing how to use weka for various data mining applications.

Page 13: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

A case study in classification

CES 514 course project from 2007 (Olson)

Consider a board game (e.g checkers, backgammon). Given a position, we want to determine how strong the position of one player (say black) is.

Can we train a classifier to learn this from training set?

As usual, problems are: • choice of attributes• creating labeled samples

Page 14: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Peg Solitaire – one player version of checkers

• To win, player should remove all except one peg.• A position from which a win can achieved is called a solvable position.

Page 15: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Square board and a solvable position

Winning move sequence: (3, 4, 5), (5, 13, 21), (25, 26, 27), (27, 28, 29), (21, 29, 37), (37, 45, 53), (83, 62, 61), (61, 53, 45)

Page 16: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

How to choose attributes?

1. Number of pegs (pegs).2. Number of first moves for any peg on the

board (first_moves).3. Number of rows having 4 pegs separated by

single vacant positions (ideal_row).4. Number of columns having 4 pegs separated

by single vacant positions (ideal col).5. Number of the first two moves for any peg

on the board (first_two).6. Percentage of the total number of pegs in

quadrant one (quad_one).7. Percentage of the total number of pegs in

quadrant two (quad_two).

Page 17: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

List of attributes

•Percentage of the total number of pegs in quadrant three (quad_three).

•Percentage of the total number of pegs in quadrant four (quad_four).

•Number of pegs isolated by one vacant position (island_one).

•Number of pegs isolated by two vacant positions (island_two).

•Number of rows having 3 pegs separated by single vacant positions (ideal_row_three).

•Number of columns having 3 pegs separated by single vacant positions (ideal_col_three).

Page 18: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
Page 19: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
Page 20: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
Page 21: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
Page 22: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
Page 23: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
Page 24: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Summary ofperformance

Page 25: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Text Classification

• Text classification has many applications– Spam email detection– Automated tagging of streams of news articles, e.g.,

Google News– Online advertising: what is this Web page about?

• Data Representation– “Bag of words” most commonly used: either counts or

binary– Can also use “phrases” (e.g., bigrams) for commonly

occurring combinations of words

• Classification Methods– Naïve Bayes widely used (e.g., for spam email)

• Fast and reasonably accurate– Support vector machines (SVMs)

• Typically the most accurate method in research studies• But more complex computationally

– Logistic Regression (regularized)• Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles, 2002)

Page 26: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Types of Labels/Categories/Classes

• Assigning labels to documents or web-pages– Labels are most often topics such as Yahoo-categories– "finance“,"sports,"news>world>asia>business"

• Labels may be genres– "editorials" "movie-reviews" "news”

• Labels may be opinion on a person/product– “like”, “hate”, “neutral”

• Labels may be domain-specific– "interesting-to-me" : "not-interesting-to-me”– “contains adult language” : “doesn’t”– language identification: English, French, Chinese, …

Ch. 13

Page 27: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Common Data Sets used for Evaluation

• Reuters– 10700 labeled documents – 10% documents with multiple class labels

• Yahoo! Science Hierarchy – 95 disjoint classes with 13,598 pages

• 20 Newsgroups data– 18800 labeled USENET postings– 20 leaf classes, 5 root level classes

• WebKB– 8300 documents in 7 categories such as “faculty”, “course”, “student”.

Page 28: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Practical Issues• Tokenization

– Convert document to word counts = “bag of words”– word token = “any nonempty sequence of characters”– for HTML (etc) need to remove formatting

• Canonical forms, Stopwords, Stemming – Remove capitalization – Stopwords:

• remove very frequent words (a, the, and…) – can use standard list

• Can also remove very rare words, e.g., words that only occur in k or fewer documents, e.g., k = 5

• Data representation– e.g., sparse 3 column for bag of words: <docid termid count>

– can use inverted indices, etc

Page 29: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

challenges of text classification

M.L classification techniques used for structured data

Text: lots of features and lot of noise No fixed number of columns No categorical attribute values Data scarcity Larger number of class label Hierarchical relationships between

classes less systematic unlike structured data

Page 30: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Techniques Nearest Neighbor Classifier

•Lazy learner: remember all training instances•Decision on test document: distribution of labels

on the training documents most similar to it•Assigns large weights to rare terms

Feature selection•removes terms in the training documents which

are statistically uncorrelated with the class labels Bayesian classifier

•Fit a generative term distribution Pr(d|c) to each class c of documents .

•Testing: The distribution most likely to have generated a test document is used to label it.

Page 31: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Stochastic Language Models

Model probability of generating strings (each word in turn) in a language (commonly all strings over alphabet ∑). E.g., a unigram model

0.2 the

0.1 a

0.01 man

0.01 woman

0.03 said

0.02 likes

the man likes the woman

0.2 0.01 0.02 0.2 0.01

multiply

Model M

P(s | M) = 0.00000008

Sec.13.2.1

Page 32: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Stochastic Language Models

Model probability of generating any string

0.2 the

0.01 class

0.0001 sayst

0.0001 pleaseth

0.0001 yon

0.0005 maiden

0.01 woman

Model M1 Model M2

maidenclass pleaseth yonthe

0.00050.01 0.0001 0.00010.2

0.010.0001 0.02 0.10.2

p(s|M2) > p(s|M1)

0.2 the

0.0001 class

0.03 sayst

0.02 pleaseth

0.1 yon

0.01 maiden

0.0001 woman

Sec.13.2.1

Page 33: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Using Multinomial Naive Bayes Classifiers to Classify Text: Basic

method

Attributes are text positions, values are words.

too many possibilities Assume that classification is independent of the positions of the words

Use same parameters for each position Result is bag of words model (over tokens)

)|text""()|our""()(argmax

)|()(argmax

1j

j

jnjjCc

ijij

CcNB

cxPcxPcP

cxPcPc

Sec.13.2

Page 34: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Textj single document containing all docsj

for each word xk in Vocabulary nk number of occurrences of xk in Textj

Naive Bayes: Learning

From training corpus, extract vocabulary Calculate required P(cj) and P(xk | cj) terms

For each cj in C do docsj subset of documents for which the target class is cj

||)|(

Vocabularyn

ncxP k

jk

|documents # total|

||)( j

j

docscP

Sec.13.2

Page 35: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Naive Bayes: Classifying

positions all word positions in current document which contain tokens found in Vocabulary

Return cNB, where

positionsi

jijCc

NB cxPcPc )|()(argmaxj

Sec.13.2

Page 36: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
Page 37: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Naive Bayes: Time Complexity

Training Time: O(|D|Lave + |C||V|)) where Lave is the average length of a document in D. Assumes all counts are pre-computed in O(|D|Lave) time during one pass through all of the data.

Generally just O(|D|Lave) since usually |C||V| < |

D|Lave

Test Time: O(|C| Lt) where Lt

is the average length of a test document. Very efficient overall, linearly proportional to

the time needed to just read in all the data.

Sec.13.2

Page 38: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Underflow Prevention: using logs

Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.

Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

Class with highest final un-normalized log probability score is still the most probable.

Note that model is now just max of sum of weights…

cNB argmaxcj C

[logP(c j ) log P(x i | c j )ipositions

]

Sec.13.2

Page 39: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Naive Bayes Classifier

Simple interpretation: Each conditional parameter log P(xi|cj) is a weight that indicates how good an indicator xi is for cj.

The prior log P(cj) is a weight that indicates the relative frequency of cj.

The sum is then a measure of how much evidence there is for the document being in the class.

We select the class with the most evidence for it

39

cNB argmaxcj C

[log P(c j ) log P(x i | c j )ipositions

]

Page 40: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Two Naive Bayes Models

Model 1: Multivariate Bernoulli One feature Xw for each word in dictionary

Xw = true in document d if w appears in d Naive Bayes assumption:

Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

This is the model used in the binary independence model in classic probabilistic relevance feedback on hand-classified data.

Page 41: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Two Models Model 2: Multinomial = Class conditional unigram One feature Xi for each word pos in document

feature’s values are all words in dictionary Value of Xi is the word in position i Naïve Bayes assumption:

Given the document’s topic, word in one position in the document tells us nothing about words in other positions

Second assumption: Word appearance does not depend on position

Just have one multinomial feature predicting all words

)|()|( cwXPcwXP ji for all positions i,j, word w, and class c

Page 42: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Multivariate Bernoulli model:

Multinomial model:

Can create a mega-document for topic j by concatenating all documents in this topic

Use frequency of w in mega-document

Parameter estimation

fraction of documents of topic cj

in which word w appears )|(ˆ

jw ctXP

fraction of times in which word w appears among allwords in documents of topic cj

)|(ˆji cwXP

Page 43: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Classification

Multinomial vs Multivariate Bernoulli?

Multinomial model is almost always more effective in text applications

Page 44: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
Page 45: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
Page 46: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
Page 47: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
Page 48: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Feature Selection: Why? Text collections have a large number of features 10,000 – 1,000,000 unique words … and more

May make using a particular classifier feasible Some classifiers can’t deal with 100,000 of features

Reduces training time Training time for some methods is quadratic or worse in the number of features

Can improve generalization (performance) Eliminates noise features Avoids overfitting

Sec.13.5

Page 49: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Feature selection: how?

Two ideas: Hypothesis testing statistics:

Are we confident that the value of one categorical variable is associated with the value of another?

Chi-square test (2)

Information theory: How much information does the value of one categorical variable give you about the value of another?

Mutual information

They’re similar, but 2 measures confidence in association, (based on available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities)

Sec.13.5

Page 50: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

2 statistic (CHI)

2 is interested in (fo – fe)2/fe summed over all table entries: is the observed number what you’d expect given the marginals?

The null hypothesis is rejected with confidence .999,

since 12.9 > 10.83 (the value for .999 confidence).

)001.(9.129498/)94989500(502/)502500(

75.4/)75.43(25./)25.2(/)(),(22

2222

p

EEOaj

9500

500

(4.75)

(0.25)

(9498)3Class auto

(502)2Class = auto

Term jaguar

Term = jaguar expected: fe

observed: fo

Sec.13.5.2

Page 51: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

There is a simpler formula for 2x2 2:

2 statistic

N = A + B + C + D

D = #(¬t, ¬c)

B = #(t,¬c)

C = #(¬t,c)A = #(t,c)

Sec.13.5.2

Page 52: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Feature selection via Mutual Information

In training set, choose k words which best discriminate (give most info on) the categories.

The Mutual Information between a word, class is:

For each word w and each category c

}1,0{ }1,0{ )()(

),(log),(),(

w ce e cw

cwcw epep

eepeepcwI

Sec.13.5.1

Page 53: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Feature selection via MI For each category we build a list of k most discriminating terms.

For example (on 20 Newsgroups): sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, …

rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, …

Greedy: does not account for correlations between terms

Why?

Sec.13.5.1

Page 54: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Feature Selection Mutual Information

Clear information-theoretic interpretation

May select rare uninformative terms Chi-square

Statistical foundation May select very slightly informative frequent terms that are not very useful for classification

Just use the commonest terms? No particular foundation In practice, this is often 90% as good

Sec.13.5

Page 55: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Greedy inclusion algorithm

Most commonly used in text Algorithm:

• Compute, for each term, a measure of discrimination amongst classes.

• Arrange the terms in decreasing order of this measure.

• Retain a number of the best terms or features for use by the classifier.

• Greedy because • measure of discrimination of a

term is computed independently of other terms

• Over-inclusion: mild effects on accuracy

Page 56: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Feature selection - performance

• Bayesian classifier cannot over fit much

Effect of feature selection on Bayesian classifiers

Page 57: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Naive Bayes vs. other methods

57

Sec.13.6

Page 58: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.
Page 59: CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification.

Benchmarks for accuracy Reuters

•10700 labeled documents •10% documents with multiple class labels

OHSUMED

•348566 abstracts from medical journals

20NG

•18800 labeled USENET postings•20 leaf classes, 5 root level classes

WebKB

•8300 documents in 7 academic categories.