Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

22
Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005

Transcript of Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Page 1: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Smart RSS Aggregator

A text classification problem

Alban Scholer & Markus Kirsten2005

Page 2: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Introduction

● Smart RSS aggregator

● Predicts how interesting a user finds an unread article

● Presents news articles depending on the prediction

Page 3: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Issues

● Extremely high dimensional data

● Lots of unlabeled data

● Few training examples

● Only clickthrough information

● Multiuser environment

Page 4: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Support Vector Machine

● Support Vector Machine

● Max-margin for generalization

● Linear but easily extended to non-linear classification

Page 5: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Max-margin separator

++

++

++

++++

++

++++

++

++

++

++

++

Page 6: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

SVM

● The problem of finding the optimal w can be reduced to the following QP

Page 7: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Transductive SVM (TSVM)

● Semi-supervised learning VS supervised learning.

● TSVM is well suited for problem where:– There are few labeled data available – There are lots of unlabeled data.

● Information lying in the unlabeled data is captured and modifies the decision surface.

Page 8: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

TSVM VS SVM

Page 9: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

TSVM optimization problem

● New optimized variable set : yi*

● New set of slack variables● New user-specified variable : C*

● Very difficult optimization problem:– Intractable when the number of unlabeled

data is greater than 10– Approximative solution proposed by

Johachims.

Page 10: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Text Classification

● Joachims T. Transductive “Inference for Text Classification using SVM”

● Characteristics of the Text Classification problem

● Why are SVM and TSVM well suited for this kind of problem?

● Feature selection for text classification using SVM

Page 11: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Characteristics of the Text Classification problem

● High dimensional input space– One dimension for each word in the vocabulary

(10 000 words)

● Sparse input vector– In one text, a tiny proportion of the full

vocabulary is used

Page 12: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Why (T)SVM?

● SVM has been shown to perform well in these conditions and can outperform other classifiers.

● Transductive SVM, exploiting information in test data, can outperform SVM when few training samples but lots of test data are available.

Page 13: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Feature selection for Text Classification using SVM

● Feature selection is the main problem in many machine learning applications.

● A poor feature selection leads to poor accuracy.

Page 14: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Feature selection (cont)

● For the text classification problem:

– The number of dimensions of the document vector is the number of words in the vocabulary. (Huge number of dimensions!)

– Each component of the document vector is the

count of the number of word in the document.

Page 15: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Feature selection (cont)

● Refinement of the feature selection:

– Johachims add to this document vector the Inverse Document Frequency of each relevant word in the document.

– The IDF can be computed using the Document Frequency DF(w)

● IDF(w) = log(n/DF(w)) ● Where n is the total number of document

Page 16: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Feature selection (cont)

● Other refinements :– Stopword elimination– Word stemmer

Page 17: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Feature selection (cont)

● Ex : “the text classification task is characterized by a special set of characteristics. The text classification problem....”

● Transformation of the above text into a feature vector

Page 18: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Feature selection (example)

text

2

classification 2

task 1

charact 2

● The document vector isvery sparse

● The words characteristicsand characterized have thesame stemmer charact

Page 19: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Smart stuff

● Wordnet

● Combinations of words

● Putting users into clusters

● Using additional features (links, dates, author, source etc.)

● Active learning

Page 20: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Conclusion

● TSVM is well suited for text classification problems

● Feature selection is crucial

● To boost accuracy to a reasonable level, we have to combine techniques.

Page 21: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

References

● Simon Haykin, Neural Networks, Second Edition, Pearson Education, chapter 6 1999

● Joachims Thorsten, Transductive Inference for Text Classification using SVM, Proceedings of ICML-99, 16th International Conference on Machine Learning, 1999

Page 22: Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

References (cont)

● Tom M. Mitchell, Machine Learning, chapter 6 Mc Graw-Hill international editions, 1997

● K. Nigam, A. K. Mccallum, S. ThMachine Learningrun, T. Mitchell, Text Classification from Labeled and Unlabeled Documents using EM, Kluwer Academic Publishers, Boston, 1999