Text Categorization

74
1 Text Categorization

description

Text Categorization. Categorization. Given: A description of an instance, x  X , where X is the instance language or instance space . A fixed set of categories: C= { c 1 , c 2 ,… c n } Determine: - PowerPoint PPT Presentation

Transcript of Text Categorization

Page 1: Text Categorization

1

Text Categorization

Page 2: Text Categorization

2

Categorization

• Given:– A description of an instance, xX, where X

is the instance language or instance space.– A fixed set of categories:

C={c1, c2,…cn}

• Determine:– The category of x: c(x)C, where c(x) is a

categorization function whose domain is X and whose range is C.

Page 3: Text Categorization

3

Learning for Categorization

• A training example is an instance xX, paired with its correct category c(x): <x, c(x)> for an unknown categorization function, c.

• Given a set of training examples, D.• Find a hypothesized categorization

function, h(x), such that:)()(: )(, xcxhDxcx

Consistency

Page 4: Text Categorization

4

Sample Category Learning Problem

• Instance language: <size, color, shape>– size {small, medium, large}– color {red, blue, green}– shape {square, circle, triangle}

• C = {positive, negative}• D: Example Size Color Shape Category

1 small red circle positive

2 large red circle positive

3 small red triangle negative

4 large blue circle negative

Page 5: Text Categorization

5

General Learning Issues

• Many hypotheses are usually consistent with the training data.

• Bias– Any criteria other than consistency with the

training data that is used to select a hypothesis.• Classification accuracy (% of instances

classified correctly).– Measured on independent test data.

• Training time (efficiency of training algorithm).• Testing time (efficiency of subsequent

classification).

Page 6: Text Categorization

6

Generalization

• Hypotheses must generalize to correctly classify instances not in the training data.

• Simply memorizing training examples is a consistent hypothesis that does not generalize.

• Occam’s razor:– Finding a simple hypothesis helps ensure

generalization.

Page 7: Text Categorization

7

Text Categorization

• Assigning documents to a fixed set of categories.

• Applications:– Web pages

• Recommending• Yahoo-like classification

Page 8: Text Categorization

8

Text Categorization

• Applications: – Newsgroup Messages

• Recommending• spam filtering

– News articles • Personalized newspaper

– Email messages • Routing• Prioritizing • Folderizing• spam filtering

Page 9: Text Categorization

9

Learning for Text Categorization

• Manual development of text categorization functions is difficult.

• Learning Algorithms:– Bayesian (naïve)– Neural network– Relevance Feedback (Rocchio)– Rule based (Ripper)– Nearest Neighbor (case based)– Support Vector Machines (SVM)

Page 10: Text Categorization

10

Using Relevance Feedback (Rocchio)

• Relevance feedback methods can be adapted for text categorization.

• Use standard TF/IDF weighted vectors to represent text documents (normalized by maximum term frequency).

• For each category, compute a prototype vector by summing the vectors of the training documents in the category.

• Assign test documents to the category with the closest prototype vector based on cosine similarity.

Page 11: Text Categorization

11

Rocchio Text Categorization Algorithm(Training)

Assume the set of categories is {c1, c2,…cn}For i from 1 to n let pi = <0, 0,…,0> (init. prototype vectors)For each training example <x, c(x)> D Let d be the frequency normalized TF/IDF

term vector for doc x Let i = j: (cj = c(x))

(sum all the document vectors in ci to get pi)

Let pi = pi + d

Page 12: Text Categorization

12

Rocchio Text Categorization Algorithm (Test)

Given test document xLet d be the TF/IDF weighted term vector for xLet m = –2 (init. maximum cosSim)For i from 1 to n: (compute similarity to prototype vector) Let s = cosSim(d, pi) if s > m let m = s let r = ci (update most similar class prototype)Return class r

Page 13: Text Categorization

13

Illustration of Rocchio Text Categorization

Page 14: Text Categorization

14

Rocchio Properties

• Does not guarantee a consistent hypothesis.

• Forms a simple generalization of the examples in each class (a prototype).

• Prototype vector does not need to be averaged or otherwise normalized for length since cosine similarity is insensitive to vector length.

• Classification is based on similarity to class prototypes.

Page 15: Text Categorization

15

Rocchio Time Complexity

• Note: The time to add two sparse vectors is proportional to minimum number of non-zero entries in the two vectors.

• Training Time: O(|D|(Ld + |Vd|)) = O(|D| Ld) where Ld is the average length of a document in D and Vd is the average vocabulary size for a document in D.

• Test Time: O(Lt + |C||Vt|) where Lt is the average length of a test document and |Vt | is the average vocabulary size for a test document.– Assumes lengths of pi vectors are computed and stored during

training, allowing cosSim(d, pi) to be computed in time proportional to the number of non-zero entries in d (i.e. |Vt|)

Page 16: Text Categorization

16

Nearest-Neighbor Learning Algorithm

• Learning is just storing the representations of the training examples in D.

• Testing instance x:– Compute similarity between x and all examples in D.– Assign x the category of the most similar example in

D.• Does not explicitly compute a generalization or

category prototypes.• Also called:

– Case-based– Memory-based– Lazy learning

Page 17: Text Categorization

17

K Nearest-Neighbor

• Using only the closest example to determine categorization is subject to errors due to:– A single atypical example. – Noise (i.e. error) in the category label of a

single training example.• More robust alternative is to find the k

most-similar examples and return the majority category of these k examples.

• Value of k is typically odd to avoid ties, 3 and 5 are most common.

Page 18: Text Categorization

18

Similarity Metrics

• Nearest neighbor method depends on a similarity (or distance) metric.

• Simplest for continuous m-dimensional instance space is Euclidian distance.

• Simplest for m-dimensional binary instance space is Hamming distance (number of feature values that differ).

• For text, cosine similarity of TF-IDF weighted vectors is typically most effective.

Page 19: Text Categorization

19

3 Nearest Neighbor Illustration(Euclidian Distance)

.. ..

. .. ....

Page 20: Text Categorization

20

K Nearest Neighbor for Text

Training:For each each training example <x, c(x)> D Compute the corresponding TF-IDF vector, dx, for document x

Test instance y:Compute TF-IDF vector d for document yFor each <x, c(x)> D Let sx = cosSim(d, dx)Sort examples, x, in D by decreasing value of sx

Let N be the first k examples in D. (get most similar neighbors)Return the majority class of examples in N

Page 21: Text Categorization

21

Illustration of 3 Nearest Neighbor for Text

Page 22: Text Categorization

22

Rocchio Anomoly

• Prototype models have problems with polymorphic (disjunctive) categories.

Page 23: Text Categorization

23

3 Nearest Neighbor Comparison

• Nearest Neighbor tends to handle polymorphic categories better.

Page 24: Text Categorization

24

Nearest Neighbor Time Complexity

• Training Time: O(|D| Ld) to compose TF-IDF vectors.

• Testing Time: O(Lt + |D||Vt|) to compare to all training vectors.– Assumes lengths of dx vectors are computed and

stored during training, allowing cosSim(d, dx) to be computed in time proportional to the number of non-zero entries in d (i.e. |Vt|)

• Testing time can be high for large training sets.

Page 25: Text Categorization

25

Nearest Neighbor with Inverted Index

• Determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documents.

• Use standard VSR inverted index methods to find the k nearest neighbors.

• Testing Time: O(B|Vt|) where B is the average number of training documents in which a test-document word appears.

• Therefore, overall classification is O(Lt + B|Vt|) – Typically B << |D|

Page 26: Text Categorization

26

Bayesian Methods

• Learning and classification methods based on probability theory.

• Bayes theorem plays a critical role in probabilistic learning and classification.

• Uses prior probability of each category given no information about an item.

• Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Page 27: Text Categorization

27

Axioms of Probability Theory

• All probabilities between 0 and 1

• True proposition has probability 1, false has probability 0.

P(true) = 1 P(false) = 0.• The probability of disjunction is:

1)(0 AP

)()()()( BAPBPAPBAP

A BBA

Page 28: Text Categorization

28

Conditional Probability

• P(A | B) is the probability of A given B• Assumes that B is all and only

information known.• Defined by:

)()()|(

BPBAPBAP

A BBA

Page 29: Text Categorization

29

Independence

• A and B are independent iff:

• Therefore, if A and B are independent:

)()|( APBAP

)()|( BPABP

)()(

)()|( APBP

BAPBAP

)()()( BPAPBAP

These two constraints are logically equivalent

Page 30: Text Categorization

30

Bayes Theorem

Simple proof from definition of conditional probability:

)()()|()|(

EPHPHEPEHP

)()()|(

EPEHPEHP

)()()|(

HPEHPHEP

)()|()( HPHEPEHP

QED:

(Def. cond. prob.)

(Def. cond. prob.)

)()()|()|(

EPHPHEPEHP

Page 31: Text Categorization

31

Bayesian Categorization

• Let set of categories be {c1, c2,…cn}• Let E be description of an instance.• Determine category of E by determining for

each ci

• P(E) can be determined since categories are complete and disjoint.

)()|()()|(

EPcEPcPEcP ii

i

n

i

iin

ii EP

cEPcPEcP11

1)(

)|()()|(

n

iii cEPcPEP

1

)|()()(

Page 32: Text Categorization

32

Bayesian Categorization (cont.)

• Need to know:– Priors: P(ci) – Conditionals: P(E | ci)

• P(ci) are easily estimated from data. – If ni of the examples in D are in ci,then P(ci) = ni / |D|

• Assume instance is a conjunction of binary features:

• Too many possible instances (exponential in m) to estimate all P(E | ci)

meeeE 21

Page 33: Text Categorization

33

Naïve Bayesian Categorization

• If we assume features of an instance are independent given the category (ci) (conditionally independent).

• Therefore, we then only need to know P(ej | ci) for each feature and category.

)|()|()|(1

21

m

jijimi cePceeePcEP

Page 34: Text Categorization

34

Naïve Bayes Example

• C = {allergy, cold, well}• e1 = sneeze; e2 = cough; e3 = fever• E = {sneeze, cough, fever}

Prob Well Cold AllergyP(ci) 0.9 0.05 0.05

P(sneeze|ci) 0.1 0.9 0.9

P(cough|ci) 0.1 0.8 0.7

P(fever|ci) 0.01 0.7 0.4

Page 35: Text Categorization

35

Naïve Bayes Example (cont.)

P(well | E) = (0.9)(0.1)(0.1)(0.99)/P(E)=0.0089/P(E)P(cold | E) = (0.05)(0.9)(0.8)(0.3)/P(E)=0.01/P(E)P(allergy | E) = (0.05)(0.9)(0.7)(0.6)/P(E)=0.019/P(E)

Most probable category: allergyP(E) = 0.089 + 0.01 + 0.019 = 0.0379P(well | E) = 0.23P(cold | E) = 0.26P(allergy | E) = 0.50

Probability Well Cold Allergy

P(ci) 0.9 0.05 0.05

P(sneeze | ci) 0.1 0.9 0.9

P(cough | ci) 0.1 0.8 0.7

P(fever | ci) 0.01 0.7 0.4

E={sneeze, cough, fever}

Page 36: Text Categorization

36

Estimating Probabilities

• Normally, probabilities are estimated based on observed frequencies in the training data.

• If D contains ni examples in category ci, and nij of these ni examples contains feature ej, then:

• However, estimating such probabilities from small training sets is error-prone.

• If due only to chance, a rare feature, ek, is always false in the training data, ci :P(ek | ci) = 0.

• If ek then occurs in a test example, E, the result is that ci: P(E | ci) = 0 and ci: P(ci | E) = 0

i

ijij n

nceP )|(

Page 37: Text Categorization

37

Smoothing

• To account for estimation from small samples, probability estimates are adjusted or smoothed.

• Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “virtual” sample of size m.

• For binary features, p is simply assumed to be 0.5.

mnmpn

cePi

ijij

)|(

Page 38: Text Categorization

38

Naïve Bayes for Text

• Modeled as generating a bag of words for a document in a given category by repeatedly sampling with replacement from a vocabulary V = {w1, w2,…wm} based on the probabilities P(wj | ci).

• Smooth probability estimates with Laplace m-estimates assuming a uniform distribution over all words (p = 1/|V|) and m = |V|– Equivalent to a virtual sample of seeing each word in each

category exactly once.

Page 39: Text Categorization

39

Text Naïve Bayes Algorithm(Train)

Let V be the vocabulary of all words in the documents in DFor each category ci C

Let Di be the subset of documents in D in category ci

P(ci) = |Di| / |D| Let Ti be the concatenation of all the

documents in Di

Let ni be the total number of word occurrences in Ti

For each word wj V Let nij be the number of occurrences of wj in Ti

Let P(wi | ci) = (nij + 1) / (ni + |V|)

Page 40: Text Categorization

40

Text Naïve Bayes Algorithm(Test)

Given a test document XLet n be the number of word occurrences in XReturn the category:

where ai is the word occurring the ith

position in X

)|()(argmax1

n

iiii

CiccaPcP

Page 41: Text Categorization

41

Naïve Bayes Time Complexity

• Training Time: O(|D|Ld + |C||V|)) where Ld is the average length of a document in D.– Assumes V and all Di , ni, and nij pre-computed in O(|

D|Ld) time during one pass through all of the data.– Generally just O(|D|Ld) since usually |C||V| < |D|Ld

• Test Time: O(|C| Lt) where Lt is the average length of a test document.

• Very efficient overall, linearly proportional to the time needed to just read in all the data.

• Similar to Rocchio time complexity.

Page 42: Text Categorization

42

Underflow Prevention

• Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.

• Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

• Class with highest final un-normalized log probability score is still the most probable.

Page 43: Text Categorization

43

Naïve Bayes Posterior Probabilities

• Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.

• However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not.– Output probabilities are generally very

close to 0 or 1.

Page 44: Text Categorization

44

Evaluating Categorization

• Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances).

• Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system.

• Results can vary based on sampling error due to different training and test sets.

• Average results over multiple training and test sets (splits of the overall data) for the best results.

Page 45: Text Categorization

45

N-Fold Cross-Validation

• Ideally, test and training sets are independent on each trial.– But this would require too much labeled data.

• Partition data into N equal-sized disjoint segments.• Run N trials, each time using a different segment

of the data for testing, and training on the remaining N1 segments.

• This way, at least test-sets are independent.• Report average classification accuracy over the N

trials.• Typically, N = 10.

Page 46: Text Categorization

46

Learning Curves

• In practice, labeled data is usually rare and expensive.

• Would like to know how performance varies with the number of training instances.

• Learning curves plot classification accuracy on independent test data (Y axis) versus number of training examples (X axis).

Page 47: Text Categorization

47

N-Fold Learning Curves

• Want learning curves averaged over multiple trials.

• Use N-fold cross validation to generate N full training and test sets.

• For each trial, train on increasing fractions of the training set, measuring accuracy on the test data for each point on the desired learning curve.

Page 48: Text Categorization

48

Sample Learning Curve(Yahoo Science Data)

Page 49: Text Categorization

49

Text Clustering

Page 50: Text Categorization

50

Clustering

• Partition unlabeled examples into disjoint subsets of clusters, such that:– Examples within a cluster are very similar– Examples in different clusters are very

different• Discover new categories in an

unsupervised manner (no sample category labels provided).

Page 51: Text Categorization

51

.

Clustering Example

.. ..

. .. ..

..

....

.. ..

. .. ..

..

....

.

Page 52: Text Categorization

52

Hierarchical Clustering

• Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples.

• Recursive application of a standard clustering algorithm can produce a hierarchical clustering.

animal

vertebrate

fish reptile amphib. mammal worm insect crustacean

invertebrate

Page 53: Text Categorization

53

Aglommerative vs. Divisive Clustering

• Aglommerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters.

• Divisive (partitional, top-down) separate all examples immediately into clusters.

Page 54: Text Categorization

54

Direct Clustering Method

• Direct clustering methods require a specification of the number of clusters, k, desired.

• A clustering evaluation function assigns a real-value quality measure to a clustering.

• The number of clusters can be determined automatically by explicitly generating clusterings for multiple values of k and choosing the best result according to a clustering evaluation function.

Page 55: Text Categorization

55

Hierarchical Agglomerative Clustering (HAC)

• Assumes a similarity function for determining the similarity of two instances.

• Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster.

• The history of merging forms a binary tree or hierarchy.

Page 56: Text Categorization

56

HAC Algorithm

Start with all instances in their own cluster.Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj

Page 57: Text Categorization

57

Cluster Similarity

• Assume a similarity function that determines the similarity of two instances: sim(x,y).– Cosine similarity of document vectors.

• How to compute similarity of two clusters each possibly containing multiple instances?– Single Link: Similarity of two most similar members.– Complete Link: Similarity of two least similar

members.– Group Average: Average similarity between

members.

Page 58: Text Categorization

58

Single Link Agglomerative Clustering

• Use maximum similarity of pairs:

• Can result in “straggly” (long and thin) clusters due to chaining effect.– Appropriate in some domains, such as

clustering islands.

),(max),(,

yxsimccsimji cycxji

Page 59: Text Categorization

59

Single Link Example

Page 60: Text Categorization

60

Complete Link Agglomerative Clustering

• Use minimum similarity of pairs:

• Makes more “tight,” spherical clusters that are typically preferable.

),(min),(,

yxsimccsimji cycxji

Page 61: Text Categorization

61

Complete Link Example

Page 62: Text Categorization

62

Computational Complexity

• In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2).

• In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters.

• In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time.

Page 63: Text Categorization

63

Computing Cluster Similarity

• After merging ci and cj, the similarity of the resulting cluster to any other cluster, ck, can be computed by:– Single Link:

– Complete Link:

)),(),,(max()),(( kjkikji ccsimccsimcccsim

)),(),,(min()),(( kjkikji ccsimccsimcccsim

Page 64: Text Categorization

64

Group Average Agglomerative Clustering

• Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters.

• Compromise between single and complete link.

• Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters.

)( :)(

),()1(

1),(ji jiccx xyccyjiji

ji yxsimcccc

ccsim

Page 65: Text Categorization

65

Computing Group Average Similarity

• Assume cosine similarity and normalized vectors with unit length.

• Always maintain sum of vectors in each cluster.

• Compute similarity of clusters in constant time:

jcx

j xcs

)(

)1||||)(|||(||)||(|))()(())()((

),(

iiii

iijijiji cccc

cccscscscsccsim

Page 66: Text Categorization

66

Non-Hierarchical Clustering

• Typically must provide the number of desired clusters, k.

• Randomly choose k instances as seeds, one per cluster.

• Form initial clusters based on these seeds.• Iterate, repeatedly reallocating instances to

different clusters to improve the overall clustering.

• Stop when clustering converges or after a fixed number of iterations.

Page 67: Text Categorization

67

K-Means

• Assumes instances are real-valued vectors.

• Clusters based on centroids, center of gravity, or mean of points in a cluster, c:

• Reassignment of instances to clusters is based on distance to the current cluster centroids.

cx

xc

||

1(c)μ

Page 68: Text Categorization

68

Distance Metrics

• Euclidian distance (L2 norm):

• L1 norm:

• Cosine Similarity (transform to a distance by subtracting from 1):

2

12 )(),( i

m

ii yxyxL

m

iii yxyxL

11 ),(

yxyx

1

Page 69: Text Categorization

69

K-Means Algorithm

Let d be the distance measure between instances.Select k random instances {s1, s2,… sk} as seeds.Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is

minimal. (Update the seeds to the centroid of each

cluster) For each cluster cj sj = (cj)

Page 70: Text Categorization

70

K Means Example(K=2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reasssign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Page 71: Text Categorization

71

Time Complexity

• Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors.

• Reassigning clusters: O(kn) distance computations, or O(knm).

• Computing centroids: Each instance vector gets added once to some centroid: O(nm).

• Assume these two steps are each done once for I iterations: O(Iknm).

• Linear in all relevant factors, assuming a fixed number of iterations, more efficient than O(n2) HAC.

Page 72: Text Categorization

72

Seed Choice

• Results can vary based on random seed selection.

• Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings.

• Select good seeds using a heuristic or the results of another method.

Page 73: Text Categorization

73

Text Clustering

• HAC and K-Means have been applied to text in a straightforward way.

• Typically use normalized, TF/IDF-weighted vectors and cosine similarity.

• Optimize computations for sparse vectors.

Page 74: Text Categorization

74

Text Clustering

• Applications:– During retrieval, add other documents in the

same cluster as the initial retrieved documents to improve recall.

– Clustering of results of retrieval to present more organized results to the user.

– Automated production of hierarchical taxonomies of documents for browsing purposes (à la Yahoo).