Text Classification and Naïve Bayes

Text Classification and Naïve Bayes

An example of text classificationDefinition of a machine learning problemA refresher on probabilityThe Naive Bayes classifier

1

Google News

2

Different ways for classification

Human labor (people assign categories to every incoming article)

Hand-crafted rules for automatic classification If article contains: stock, Dow, share, Nasdaq, etc. Business If article contains: set, breakpoint, player, Federer, etc. Tennis

Machine learning algorithms

3

What is Machine Learning?

4

Definition: A computer program is said to learn from experience E when its performance P at a task T improves with experience E.

Tom Mitchell, Machine Learning, 1997

Examples:- Learning to recognize spoken words- Learning to drive a vehicle- Learning to play backgammon

Components of a ML System (1)

Experience (a set of examples that combines together input and output for a task)

Text categorization: document + category Speech recognition: spoken text + written text

Experience is referred to as Training Data. When training data is available, we talk of Supervised Learning.Performance metrics

Error or accuracy in the Test Data Test Data are not present in the Training Data When there are few training data, methods like ‘leave-one-out’ or

‘ten-fold cross validation’ are used to measure error.

5

Components of a ML System (2)

Type of knowledge to be learned (known as the target function, that will map between input and output)Representation of the target function

Decision trees Neural networks Linear functions

The learning algorithm C4.5 (learns decision trees) Gradient descent (learns a neural network) Linear programming (learns linear functions)

6

Task

Defining Text Classification

7

XdXd},,,{ 21 Jccc C

D cd ,

CXcd ,

CX: D)(

the document in the multi-dimensional space

a set of classes (categories, or labels)

the training set of labeled documents

Target function:

Learning algorithm:

cd , “Beijing joins the World Trade Organization”, China

cd )( )(d China

Naïve Bayes Learning

8

dnk

kCcCc

MAP ctPcPdcPc1

)|(ˆ)(ˆmaxarg)|(ˆmaxarg

cd )(

Learning Algorithm: Naïve Bayes

Target Function:

)|()(maxarg)|(maxarg cdPcPdcPcCcCc

MAP

)(cP

)|( cdP

The generative process:

)|( dcP

a priori probability, of choosing a category

the cond. prob. of generating d, given the fixed c

a posteriori probability that c generated d

A Refresher on Probability

9

Visualizing probability

A is a random variable that denotes an uncertain event Example: A = “I’ll get an A+ in the final exam”

P(A) is “the fraction of possible worlds where A is true”

10

Worlds in which A is true

Slide: Andrew W. Moore

Worlds in which A is false

Event space of all possible worlds. Its area is 1.

P(A) = Area of the blue circle.

Axioms and Theorems of Probability

Axioms: 0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) – P(A and B)

Theorems: P(not A) = P(~A) = 1 – P(A) P(A) = P(A ^ B) + P(A ^ ~B)

11

Conditional Probability

P(A|B) = the probability of A being true, given that we know that B is true

12

F

H

H = “I have a headache”F = “Coming down with flu”

P(H) = 1/10P(F) = 1/40P(H/F) = 1/2

Slide: Andrew W. Moore

Headaches are rare and flu even rarer, but if you got that flu, there is a 50-50 chance you’ll have a headache.

Back to the Naïve Bayes Classifier

14

Estimating parameters for the target function

We are looking for the estimates and

16

)(ˆ cP )|(ˆ cdP

P(c) is the fraction of possible worlds where c is true.

N

NcP c)(ˆ

N – number of all documentsNc – number of documents in class c

d is a vector in the space X)|,,,()|( 2 ctttPcdP

dni

where each dimension is a term:

)()|()( BPBAPBAP By using the chain rule: we have:

(P

),,...,(),,...,|()|,,,( 2212 cttPctttPctttPddd nnni

...

Naïve assumptions of independence

1. All attribute values are independent of each other given the class. (conditional independence assumption)

2. The conditional probabilities for a term are the same independent of position in the document.

We assume the document is a “bag-of-words”.

17

d

dnk

kni ctPctttPcdP1

2 )|()|,,,()|(

dnk

kCcCc

MAP ctPcPdcPc1

)|(ˆ)(ˆmaxarg)|(ˆmaxarg

Finally, we get the target function of Slide 8:

Again about estimation

18

For each term, t, we need to estimate P(t|c)

Vt ct

ct

T

TctP

' '

)|(ˆ

Because an estimate will be 0 if a term does not appear with a class in the training data, we need smoothing:

||)(

1

)1(

1)|(ˆ

' '' ' VT

T

T

TctP

Vt ct

ct

Vt ct

ct

Laplace Smoothing

|V| is the number of terms in the vocabulary

Tct is the count of term t in all documents of class c

An Example of classification with Naïve Bayes

19

Example 13.1 (Part 1)

20

Trainingset

docID c = China?

1 Chinese Beijing Chinese Yes

2 Chinese Chinese Shangai Yes

3 Chinese Macao Yes

4 Tokyo Japan Chinese No

Test set 5 Chinese Chinese Chinese Tokyo Japan ?Two classes: “China”, “not China”

N = 4 4/3)(ˆ cP 4/1)(ˆ cP

V = {Beijing, Chinese, Japan, Macao, Tokyo}

Example 13.1 (Part 1)

21

Trainingset

docID c = China?

1 Chinese Beijing Chinese Yes

2 Chinese Chinese Shangai Yes

3 Chinese Macao Yes

4 Tokyo Japan Chinese No

Test set 5 Chinese Chinese Chinese Tokyo Japan ?7/3)68/()15()|Chinese(ˆ cP

14/1)68/()10()|Japan(ˆ)|Tokyo(ˆ cPcP

9/2)63/()11()|Chinese(ˆ cP

9/2)63/()11()|Japan(ˆ)|Tokyo(ˆ cPcP

Estimation Classification

dnk

k ctPcPdcP1

)|()()|(

0001.09/29/2)9/2(4/1)|(

0003.014/114/1)7/3(4/3)|(3

5

35

dcP

dcP

Summary: Miscellanious

Naïve Bayes is linear in the time is takes to scan the data

When we have many terms, the product of probabilities with cause a floating point underflow, therefore:

For a large training set, the vocabulary is large. It is better to select only a subset of terms. For that is used “feature selection” (Section 13.5).

22

dnk

kCc

MAP ctPcPc1

)|(log)(ˆ[logmaxarg

Text Classification and Naïve Bayes

Documents

Transcript of Text Classification and Naïve Bayes