Text Classification and Naïve Bayes

22
Text Classification and Naïve Bayes An example of text classification Definition of a machine learning problem A refresher on probability The Naive Bayes classifier 1

description

Text Classification and Naïve Bayes. An example of text classification Definition of a machine learning problem A refresher on probability The Naive Bayes classifier. Google News. Different ways for classification. Human labor (people assign categories to every incoming article) - PowerPoint PPT Presentation

Transcript of Text Classification and Naïve Bayes

Page 1: Text Classification and Naïve Bayes

Text Classification and Naïve Bayes

An example of text classificationDefinition of a machine learning problemA refresher on probabilityThe Naive Bayes classifier

1

Page 2: Text Classification and Naïve Bayes

Google News

2

Page 3: Text Classification and Naïve Bayes

Different ways for classification

Human labor (people assign categories to every incoming article)

Hand-crafted rules for automatic classification If article contains: stock, Dow, share, Nasdaq, etc. Business If article contains: set, breakpoint, player, Federer, etc. Tennis

Machine learning algorithms

3

Page 4: Text Classification and Naïve Bayes

What is Machine Learning?

4

Definition: A computer program is said to learn from experience E when its performance P at a task T improves with experience E.

Tom Mitchell, Machine Learning, 1997

Examples:- Learning to recognize spoken words- Learning to drive a vehicle- Learning to play backgammon

Page 5: Text Classification and Naïve Bayes

Components of a ML System (1)

Experience (a set of examples that combines together input and output for a task)

Text categorization: document + category Speech recognition: spoken text + written text

Experience is referred to as Training Data. When training data is available, we talk of Supervised Learning.Performance metrics

Error or accuracy in the Test Data Test Data are not present in the Training Data When there are few training data, methods like ‘leave-one-out’ or

‘ten-fold cross validation’ are used to measure error.

5

Page 6: Text Classification and Naïve Bayes

Components of a ML System (2)

Type of knowledge to be learned (known as the target function, that will map between input and output)Representation of the target function

Decision trees Neural networks Linear functions

The learning algorithm C4.5 (learns decision trees) Gradient descent (learns a neural network) Linear programming (learns linear functions)

6

Task

Page 7: Text Classification and Naïve Bayes

Defining Text Classification

7

XdXd},,,{ 21 Jccc C

D cd ,

CXcd ,

CX: D)(

the document in the multi-dimensional space

a set of classes (categories, or labels)

the training set of labeled documents

Target function:

Learning algorithm:

cd , “Beijing joins the World Trade Organization”, China

cd )( )(d China

Page 8: Text Classification and Naïve Bayes

Naïve Bayes Learning

8

dnk

kCcCc

MAP ctPcPdcPc1

)|(ˆ)(ˆmaxarg)|(ˆmaxarg

cd )(

Learning Algorithm: Naïve Bayes

Target Function:

)|()(maxarg)|(maxarg cdPcPdcPcCcCc

MAP

)(cP

)|( cdP

The generative process:

)|( dcP

a priori probability, of choosing a category

the cond. prob. of generating d, given the fixed c

a posteriori probability that c generated d

Page 9: Text Classification and Naïve Bayes

A Refresher on Probability

9

Page 10: Text Classification and Naïve Bayes

Visualizing probability

A is a random variable that denotes an uncertain event Example: A = “I’ll get an A+ in the final exam”

P(A) is “the fraction of possible worlds where A is true”

10

Worlds in which A is true

Slide: Andrew W. Moore

Worlds in which A is false

Event space of all possible worlds. Its area is 1.

P(A) = Area of the blue circle.

Page 11: Text Classification and Naïve Bayes

Axioms and Theorems of Probability

Axioms: 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) – P(A and B)

Theorems: P(not A) = P(~A) = 1 – P(A) P(A) = P(A ^ B) + P(A ^ ~B)

11

Page 12: Text Classification and Naïve Bayes

Conditional Probability

P(A|B) = the probability of A being true, given that we know that B is true

12

F

H

H = “I have a headache”F = “Coming down with flu”

P(H) = 1/10P(F) = 1/40P(H/F) = 1/2

Slide: Andrew W. Moore

Headaches are rare and flu even rarer, but if you got that flu, there is a 50-50 chance you’ll have a headache.

Page 13: Text Classification and Naïve Bayes

Deriving the Bayes Rule

13

)(

)()|(

BP

BAPBAP

Conditional Probability:

)()|()( BPBAPBAP Chain rule:

)()|()()( APABPABPBAP

Bayes Rule:)(

)()|()|(

AP

BPBAPABP

Page 14: Text Classification and Naïve Bayes

Back to the Naïve Bayes Classifier

14

Page 15: Text Classification and Naïve Bayes

Deriving the Naïve Bayes

15

)(

)()|()|(

AP

BPBAPABP (Bayes Rule)

21,cc 'dGiven two classes and the document

)'(

)|'()()'|( 11

1 dP

cdPcPdcP )'(

)|'()()'|( 22

2 dP

cdPcPdcP

We are looking for a that maximizes the a-posteriori ic )'|( dcP i

)'(dP (the denominator) is the same in both cases

)|()(maxarg cdPcPcCc

MAP

Thus:

Page 16: Text Classification and Naïve Bayes

Estimating parameters for the target function

We are looking for the estimates and

16

)(ˆ cP )|(ˆ cdP

P(c) is the fraction of possible worlds where c is true.

N

NcP c)(ˆ

N – number of all documentsNc – number of documents in class c

d is a vector in the space X)|,,,()|( 2 ctttPcdP

dni

where each dimension is a term:

)()|()( BPBAPBAP By using the chain rule: we have:

(P

),,...,(),,...,|()|,,,( 2212 cttPctttPctttPddd nnni

...

Page 17: Text Classification and Naïve Bayes

Naïve assumptions of independence

1. All attribute values are independent of each other given the class. (conditional independence assumption)

2. The conditional probabilities for a term are the same independent of position in the document.

We assume the document is a “bag-of-words”.

17

d

dnk

kni ctPctttPcdP1

2 )|()|,,,()|(

dnk

kCcCc

MAP ctPcPdcPc1

)|(ˆ)(ˆmaxarg)|(ˆmaxarg

Finally, we get the target function of Slide 8:

Page 18: Text Classification and Naïve Bayes

Again about estimation

18

For each term, t, we need to estimate P(t|c)

Vt ct

ct

T

TctP

' '

)|(ˆ

Because an estimate will be 0 if a term does not appear with a class in the training data, we need smoothing:

||)(

1

)1(

1)|(ˆ

' '' ' VT

T

T

TctP

Vt ct

ct

Vt ct

ct

Laplace Smoothing

|V| is the number of terms in the vocabulary

Tct is the count of term t in all documents of class c

Page 19: Text Classification and Naïve Bayes

An Example of classification with Naïve Bayes

19

Page 20: Text Classification and Naïve Bayes

Example 13.1 (Part 1)

20

Trainingset

docID c = China?

1 Chinese Beijing Chinese Yes

2 Chinese Chinese Shangai Yes

3 Chinese Macao Yes

4 Tokyo Japan Chinese No

Test set 5 Chinese Chinese Chinese Tokyo Japan ?Two classes: “China”, “not China”

N = 4 4/3)(ˆ cP 4/1)(ˆ cP

V = {Beijing, Chinese, Japan, Macao, Tokyo}

Page 21: Text Classification and Naïve Bayes

Example 13.1 (Part 1)

21

Trainingset

docID c = China?

1 Chinese Beijing Chinese Yes

2 Chinese Chinese Shangai Yes

3 Chinese Macao Yes

4 Tokyo Japan Chinese No

Test set 5 Chinese Chinese Chinese Tokyo Japan ?7/3)68/()15()|Chinese(ˆ cP

14/1)68/()10()|Japan(ˆ)|Tokyo(ˆ cPcP

9/2)63/()11()|Chinese(ˆ cP

9/2)63/()11()|Japan(ˆ)|Tokyo(ˆ cPcP

Estimation Classification

dnk

k ctPcPdcP1

)|()()|(

0001.09/29/2)9/2(4/1)|(

0003.014/114/1)7/3(4/3)|(3

5

35

dcP

dcP

Page 22: Text Classification and Naïve Bayes

Summary: Miscellanious

Naïve Bayes is linear in the time is takes to scan the data

When we have many terms, the product of probabilities with cause a floating point underflow, therefore:

For a large training set, the vocabulary is large. It is better to select only a subset of terms. For that is used “feature selection” (Section 13.5).

22

dnk

kCc

MAP ctPcPc1

)|(log)(ˆ[logmaxarg