Term Filtering with Bounded Error

1

Zi Yang, Wei Li, Jie Tang, and Juanzi Li

Knowledge Engineering GroupDepartment of Computer Science and Technology

Tsinghua University, China{yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn

[email protected]

Presented in ICDM’2010, Sydney, Australia

Term Filtering with Bounded Error

2

Outline• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion

3

Introduction• Term-document correlation matrix

– : a co-occurrence matrix,

– Applied in various text mining tasks• text classification [Dasgupta et al., 2007]• text clustering [Blei et al., 2003]• information retrieval [Wei and Croft, 2006].

– Disadvantage• high computation-cost and intensive memory access.

the number of times that term occurs in

document

4

Introduction• Term-document correlation matrix

– : a co-occurrence matrix,

– Applied in various text mining tasks• text classification [Dasgupta et al., 2007]• text clustering [Blei et al., 2003]• information retrieval [Wei and Croft, 2006].

– Disadvantage• high computation-cost and intensive memory access.

the number of times that term occurs in

document How can we overcome the disadvantage?

5

Introduction (cont'd)• Three general methods:

– To improve the algorithm itself– High performance platforms

• multicore processors and distributed machines [Chu et al., 2006]

– Feature selection approaches• [Yi, 2003]

We follow this line

6

Introduction (cont'd)• Basic assumption

– Each term captures more or less information.• Our goal

– A subspace of features (terms) – Minimal information loss.

• Conventional solution:– Step 1: Measure each individual term or the

dependency to a group of terms.• Information gain or mutual information.

– Step 2: Remove features with low scores in importance or high scores in redundancy.

7

Introduction (cont'd)• Basic assumption

– Each term captures more or less information.• Our goal

– A subspace of features (terms) – Minimal information loss.

• Conventional solution:– Step 1: Measure each individual term or the

dependency to a group of terms.• Information gain or mutual information.

– Step 2: Remove features with low scores in importance or high scores in redundancy.

Limitations- A generic definition of information

loss for a single term and a set of terms in terms of any metric?

- The information loss of each individual document, and a set of documents?

Why should we consider the information loss on documents?

8

Introduction (cont'd)• Consider a simple example:

• Question: any loss of information if A and B are substituted by a single term S?– Consider only the information loss for terms

• YES! Consider that is defined by some state-of-the-art pseudometrics, cosine similarity.

– Consider both• NO! Because both documents emphasize A more than B.

It’s information loss of documents!

A A B A A BDoc 1 Doc 2(2,2)(1,1)

A

B

ww

9

Introduction (cont'd)• Consider a simple example:

• Question: any loss of information if A and B are substituted by a single term S?– Consider only the information loss for terms

• YES! Consider that is defined by some state-of-the-art pseudometrics, cosine similarity.

– Consider both• NO! Because both documents emphasize A more than B.

It’s information loss of documents!

A A B A A BDoc 1 Doc 2(2,2)(1,1)

A

B

wwWhat we have done?

- Information loss for both terms and documents- Term Filtering with Bounded Error problem- Develop efficient algorithms

10


11

Problem Definition - Superterm• Want

– less information loss– smaller term space

• Superterm: .

Group terms into superterms!

the number of occurrences of

superterm in document

12

Problem Definition – Information loss• Information loss during term merging?

– user-specified distance measures • Euclidean metric and cosine similarity.

– superterm substitutes a set of terms • information loss of terms .• information loss of documents .

A transformed representation for document after mergingA A B A A BDoc 1 Doc 2

2 21 1

Doc 1 Doc 2

AB

error 0error 0

W

D

1.5 1.51.5 1.5

Doc 1 Doc 2

SS

13

Problem Definition – Information loss• Information loss during term merging?

– user-specified distance measures • Euclidean metric and cosine similarity.

– superterm substitutes a set of terms • information loss of terms .• information loss of documents .

A transformed representation for document after mergingA A B A A BDoc 1 Doc 2

2 21 1

Doc 1 Doc 2

AB

error 0error 0

W

D

1.5 1.51.5 1.5

Doc 1 Doc 2

SS

Can be chosen with different methods: winner-take-all, average-occurrence, etc.

14

Problem Definition – TFBE• Term Filtering with Bounded Error

(TFBE) – To minimize the size of superterms subject to

the following constraints:(1) ,(2) for all , ,(3) for all , .

Mapping function from terms to superterms

Bounded by user-specified errors

15


16

Lossless Term Filtering• Special case

– NO information loss of any term and document

• Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector representation.

• Algorithm– Step 1: find “local” superterms for each document

• Same occurrences within the document– Step 2: add, split, or remove superterms from global

superterm set

= 0 and = 0

17

Lossless Term Filtering• Special case

– NO information loss of any term and document

• Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector representation.

• Algorithm– Step 1: find “local” superterms for each document

• Same occurrences within the document– Step 2: add, split, or remove superterms from global

superterm set

= 0 and = 0

This case is applicable?YES! On Baidu Baike dataset (containing 1,531,215 documents)The vocabulary size 1,522,576 -> 714,392

> 50%

18


19

Lossy Term Filtering• General case

• A greedy algorithm– Step 1: Consider the constraint given by .

• Similar to Greedy Cover Algorithm– Step 2: Verify the validity of the candidates with

.• Computational Complexity

– – Parallelize the computation for distances

> 0 and > 0 NP-hard!

#iterations << |V|

20

Lossy Term Filtering• General case

• A greedy algorithm– Step 1: Consider the constraint given by .

• Similar to Greedy Cover Algorithm– Step 2: Verify the validity of the candidates with

.• Computational Complexity

– – Parallelize the computation for distances

> 0 and > 0 NP-hard!

#iterations << |V|

Further reduce the size of candidate superterms?

Locality-sensitive hashing (LSH)

21

Same hash value(with several randomized

hash projections)

Close to each other (in the original space)

Efficient Candidate Superterm Generation

• LSH

(Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html)

• Basic idea:

• Hash function for the Euclidean distance

[Datar et al., 2004].

22


hash projections)



• LSH


• Basic idea:



a projection vector (drawn from the

Gaussian distribution)

the width of the buckets

(assigned empirically)

23


hash projections)



• LSH


• Basic idea:



a projection vector (drawn from the

Gaussian distribution)

the width of the buckets

(assigned empirically)

Now search -near neighbors in all the buckets!

24


25

Experiments• Settings

– Datasets• Academic: ArnetMiner (10,768 papers and 8,212

terms)• 20-Newsgroups (18,774 postings and 61,188 terms)

– Baselines• Task-irrelevant feature selection: document

frequency criterion (DF), term strength (TS), sparse principal component analysis (SPCA)

• Supervised method (only on classification): Chi-statistic (CHIMAX)

26

Experiments (cont'd)• Filtering results on Euclidean metric

– Findings• Errors ↑, term radio ↓• Same bound for terms,

bound for documents ↑, term ratio ↓

• Same bound for documents, bound for terms ↑, term ratio ↓

27

Experiments (cont'd)• The information loss in terms of error

Avg. errors Max errors

28

Experiments (cont'd)• Efficiency Performance of Parallel Algorithm

and LSH*

* Optimal resulting from different choice of tuned

29

Experiments (cont'd)• Applications

– Clustering (Arnet)

– Classification (20NG)

– Document retrieval (Arnet)

30


31

Conclusion• Formally define the problem and perform a

theoretical investigation• Develop efficient algorithms for the

problems• Validate the approach through an extensive

set of experiments with multiple real-world text mining applications

32

Thank you!

33

References• Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet

Allocation. Journal of Machine Learning Research 3, 993-1022.• Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y. and

Olukotun, K. (2006). Map-Reduce for Machine Learning on Multicore. In NIPS '06 pp. 281-288,.

• Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Feature selection methods for text classification. In KDD'07 pp. 230-239,.

• Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In SCG'04 pp. 253-262,.

• Wei, X. and Croft, W. B. (2006). LDA-Based Document Models for Ad-hoc Retrieval. In SIGIR '06 pp. 178-185,.

• Yi, L. (2003). Web page cleaning for web mining through feature weighting. In IJCAI '03 pp. 43-50,.

Term Filtering with Bounded Error

Documents

Transcript of Term Filtering with Bounded Error