Term Filtering with Bounded Error
description
Transcript of Term Filtering with Bounded Error
1
Zi Yang, Wei Li, Jie Tang, and Juanzi Li
Knowledge Engineering GroupDepartment of Computer Science and Technology
Tsinghua University, China{yangzi, tangjie, ljz}@keg.cs.tsinghua.edu.cn
Presented in ICDM’2010, Sydney, Australia
Term Filtering with Bounded Error
2
Outline• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion
3
Introduction• Term-document correlation matrix
– : a co-occurrence matrix,
– Applied in various text mining tasks• text classification [Dasgupta et al., 2007]• text clustering [Blei et al., 2003]• information retrieval [Wei and Croft, 2006].
– Disadvantage• high computation-cost and intensive memory access.
the number of times that term occurs in
document
4
Introduction• Term-document correlation matrix
– : a co-occurrence matrix,
– Applied in various text mining tasks• text classification [Dasgupta et al., 2007]• text clustering [Blei et al., 2003]• information retrieval [Wei and Croft, 2006].
– Disadvantage• high computation-cost and intensive memory access.
the number of times that term occurs in
document How can we overcome the disadvantage?
5
Introduction (cont'd)• Three general methods:
– To improve the algorithm itself– High performance platforms
• multicore processors and distributed machines [Chu et al., 2006]
– Feature selection approaches• [Yi, 2003]
We follow this line
6
Introduction (cont'd)• Basic assumption
– Each term captures more or less information.• Our goal
– A subspace of features (terms) – Minimal information loss.
• Conventional solution:– Step 1: Measure each individual term or the
dependency to a group of terms.• Information gain or mutual information.
– Step 2: Remove features with low scores in importance or high scores in redundancy.
7
Introduction (cont'd)• Basic assumption
– Each term captures more or less information.• Our goal
– A subspace of features (terms) – Minimal information loss.
• Conventional solution:– Step 1: Measure each individual term or the
dependency to a group of terms.• Information gain or mutual information.
– Step 2: Remove features with low scores in importance or high scores in redundancy.
Limitations- A generic definition of information
loss for a single term and a set of terms in terms of any metric?
- The information loss of each individual document, and a set of documents?
Why should we consider the information loss on documents?
8
Introduction (cont'd)• Consider a simple example:
• Question: any loss of information if A and B are substituted by a single term S?– Consider only the information loss for terms
• YES! Consider that is defined by some state-of-the-art pseudometrics, cosine similarity.
– Consider both• NO! Because both documents emphasize A more than B.
It’s information loss of documents!
A A B A A BDoc 1 Doc 2(2,2)(1,1)
A
B
ww
9
Introduction (cont'd)• Consider a simple example:
• Question: any loss of information if A and B are substituted by a single term S?– Consider only the information loss for terms
• YES! Consider that is defined by some state-of-the-art pseudometrics, cosine similarity.
– Consider both• NO! Because both documents emphasize A more than B.
It’s information loss of documents!
A A B A A BDoc 1 Doc 2(2,2)(1,1)
A
B
wwWhat we have done?
- Information loss for both terms and documents- Term Filtering with Bounded Error problem- Develop efficient algorithms
10
Outline• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion
11
Problem Definition - Superterm• Want
– less information loss– smaller term space
• Superterm: .
Group terms into superterms!
the number of occurrences of
superterm in document
12
Problem Definition – Information loss• Information loss during term merging?
– user-specified distance measures • Euclidean metric and cosine similarity.
– superterm substitutes a set of terms • information loss of terms .• information loss of documents .
A transformed representation for document after mergingA A B A A BDoc 1 Doc 2
2 21 1
Doc 1 Doc 2
AB
error 0error 0
W
D
1.5 1.51.5 1.5
Doc 1 Doc 2
SS
13
Problem Definition – Information loss• Information loss during term merging?
– user-specified distance measures • Euclidean metric and cosine similarity.
– superterm substitutes a set of terms • information loss of terms .• information loss of documents .
A transformed representation for document after mergingA A B A A BDoc 1 Doc 2
2 21 1
Doc 1 Doc 2
AB
error 0error 0
W
D
1.5 1.51.5 1.5
Doc 1 Doc 2
SS
Can be chosen with different methods: winner-take-all, average-occurrence, etc.
14
Problem Definition – TFBE• Term Filtering with Bounded Error
(TFBE) – To minimize the size of superterms subject to
the following constraints:(1) ,(2) for all , ,(3) for all , .
Mapping function from terms to superterms
Bounded by user-specified errors
15
Outline• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion
16
Lossless Term Filtering• Special case
– NO information loss of any term and document
• Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector representation.
• Algorithm– Step 1: find “local” superterms for each document
• Same occurrences within the document– Step 2: add, split, or remove superterms from global
superterm set
= 0 and = 0
17
Lossless Term Filtering• Special case
– NO information loss of any term and document
• Theorem: The exact optimal solution of the problem is yielded by grouping terms of the same vector representation.
• Algorithm– Step 1: find “local” superterms for each document
• Same occurrences within the document– Step 2: add, split, or remove superterms from global
superterm set
= 0 and = 0
This case is applicable?YES! On Baidu Baike dataset (containing 1,531,215 documents)The vocabulary size 1,522,576 -> 714,392
> 50%
18
Outline• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion
19
Lossy Term Filtering• General case
• A greedy algorithm– Step 1: Consider the constraint given by .
• Similar to Greedy Cover Algorithm– Step 2: Verify the validity of the candidates with
.• Computational Complexity
– – Parallelize the computation for distances
> 0 and > 0 NP-hard!
#iterations << |V|
20
Lossy Term Filtering• General case
• A greedy algorithm– Step 1: Consider the constraint given by .
• Similar to Greedy Cover Algorithm– Step 2: Verify the validity of the candidates with
.• Computational Complexity
– – Parallelize the computation for distances
> 0 and > 0 NP-hard!
#iterations << |V|
Further reduce the size of candidate superterms?
Locality-sensitive hashing (LSH)
21
Same hash value(with several randomized
hash projections)
Close to each other (in the original space)
Efficient Candidate Superterm Generation
• LSH
(Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html)
• Basic idea:
• Hash function for the Euclidean distance
[Datar et al., 2004].
22
Same hash value(with several randomized
hash projections)
Close to each other (in the original space)
Efficient Candidate Superterm Generation
• LSH
(Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html)
• Basic idea:
• Hash function for the Euclidean distance
[Datar et al., 2004].
a projection vector (drawn from the
Gaussian distribution)
the width of the buckets
(assigned empirically)
23
Same hash value(with several randomized
hash projections)
Close to each other (in the original space)
Efficient Candidate Superterm Generation
• LSH
(Figure modified based on http://cybertron.cg.tu-berlin.de/pdci08/imageflight/nn_search.html)
• Basic idea:
• Hash function for the Euclidean distance
[Datar et al., 2004].
a projection vector (drawn from the
Gaussian distribution)
the width of the buckets
(assigned empirically)
Now search -near neighbors in all the buckets!
24
Outline• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion
25
Experiments• Settings
– Datasets• Academic: ArnetMiner (10,768 papers and 8,212
terms)• 20-Newsgroups (18,774 postings and 61,188 terms)
– Baselines• Task-irrelevant feature selection: document
frequency criterion (DF), term strength (TS), sparse principal component analysis (SPCA)
• Supervised method (only on classification): Chi-statistic (CHIMAX)
26
Experiments (cont'd)• Filtering results on Euclidean metric
– Findings• Errors ↑, term radio ↓• Same bound for terms,
bound for documents ↑, term ratio ↓
• Same bound for documents, bound for terms ↑, term ratio ↓
27
Experiments (cont'd)• The information loss in terms of error
Avg. errors Max errors
28
Experiments (cont'd)• Efficiency Performance of Parallel Algorithm
and LSH*
* Optimal resulting from different choice of tuned
29
Experiments (cont'd)• Applications
– Clustering (Arnet)
– Classification (20NG)
– Document retrieval (Arnet)
30
Outline• Introduction• Problem Definition• Lossless Term Filtering• Lossy Term Filtering• Experiments• Conclusion
31
Conclusion• Formally define the problem and perform a
theoretical investigation• Develop efficient algorithms for the
problems• Validate the approach through an extensive
set of experiments with multiple real-world text mining applications
32
Thank you!
33
References• Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet
Allocation. Journal of Machine Learning Research 3, 993-1022.• Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Ng, A. Y. and
Olukotun, K. (2006). Map-Reduce for Machine Learning on Multicore. In NIPS '06 pp. 281-288,.
• Dasgupta, A., Drineas, P., Harb, B., Josifovski, V. and Mahoney, M. W. (2007). Feature selection methods for text classification. In KDD'07 pp. 230-239,.
• Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V. S. (2004). Locality-sensitive hashing scheme based on p-stable distributions. In SCG'04 pp. 253-262,.
• Wei, X. and Croft, W. B. (2006). LDA-Based Document Models for Ad-hoc Retrieval. In SIGIR '06 pp. 178-185,.
• Yi, L. (2003). Web page cleaning for web mining through feature weighting. In IJCAI '03 pp. 43-50,.