ACTIVE LEARNING FOR TEXT CLASSIFICATION

12
ACTIVE LEARNING FOR ACTIVE LEARNING FOR TEXT CLASSIFICATION TEXT CLASSIFICATION Ankit Bhutani Y9094

description

ACTIVE LEARNING FOR TEXT CLASSIFICATION. Ankit Bhutani Y9094. AUTOMATIC TEXT CLASSIFICATION. A FEW HOURS ONLY. MANUAL TEXT CLASSIFICATION. TAKES YEARS. ORGANIZING LARGE VOLUMES OF TEXT. Massive volume of online text available. Organisation into categories to enable efficient search. - PowerPoint PPT Presentation

Transcript of ACTIVE LEARNING FOR TEXT CLASSIFICATION

Page 1: ACTIVE LEARNING FOR TEXT CLASSIFICATION

ACTIVE LEARNING FOR ACTIVE LEARNING FOR TEXT CLASSIFICATIONTEXT CLASSIFICATIONAnkit Bhutani Y9094

Page 2: ACTIVE LEARNING FOR TEXT CLASSIFICATION

AUTOMATIC TEXT CLASSIFICATION

A FEW HOURS ONLY

MANUAL TEXT CLASSIFICATIONTAKES YEARS

Page 3: ACTIVE LEARNING FOR TEXT CLASSIFICATION

ORGANIZING LARGE ORGANIZING LARGE VOLUMES OF TEXTVOLUMES OF TEXTMassive volume of online text

available.Organisation into categories to

enable efficient search.Find use in a lot of applications like

Data Mining, Automatic Query Answer, Learning User Interest, Making Suggestions, etc.

Learning Approaches : unsupervised, supervised and semi-supervised.

Page 4: ACTIVE LEARNING FOR TEXT CLASSIFICATION

Terms UsedTerms UsedMultinomial Naïve Bayes :

◦Documents in bag of words format◦Independence assumptions

Page 5: ACTIVE LEARNING FOR TEXT CLASSIFICATION

Terms UsedTerms UsedSemi-Supervised Learning :

◦Makes use of Labeled as well as Unlabeled Data to learn the parameters of the model.

Expectation Maximization :◦Class of Iterative Algorithms for

Maximum Likelihood Estimation in problems with incomplete data

Parameters of the model

Document labels

Provide Soft Labels to Documents based on estimated model parameters

Re-estimate the model parameters based on the

soft labels

Page 6: ACTIVE LEARNING FOR TEXT CLASSIFICATION

Terms usedTerms usedActive Learning :

◦Form of supervised machine learning◦Learning Algorithm is able to

interactively query the user◦Query has associated cost.◦Algorithm requests label for document

such that gain in information about model parameters is maximized

But how to choose which DOCUMENT to request for

Label???

Page 7: ACTIVE LEARNING FOR TEXT CLASSIFICATION

Terms UsedTerms UsedQuery by Committee :

◦Divide the training set into 4 – 5 sets.◦Each set as member gives

probability estimates.◦Maximum disagreement measured

by maximum average KL divergence between all pairs

Page 8: ACTIVE LEARNING FOR TEXT CLASSIFICATION

Terms UsedTerms UsedSemi-Supervised Frequency

Estimate (SFE) :◦Slight variation in basic EM :

Different parameters re-estimation formula.

Page 9: ACTIVE LEARNING FOR TEXT CLASSIFICATION

NOTICABLE WORK: NOTICABLE WORK: Semi-Supervised LearningSemi-Supervised LearningNigam et al, 1998-99 :

◦MNB + EM◦100 Labeled + 2500 Unlabeled

documents◦80 – 85 % accuracy

Nigam & McCullum, 2000 : ◦MNB + EM + Active Learning◦Total 1000 Documents◦Label requests : 50, Accuracy :

~90%

Page 10: ACTIVE LEARNING FOR TEXT CLASSIFICATION

NOTICABLE WORK: NOTICABLE WORK: Semi-Supervised LearningSemi-Supervised LearningLYRL, 2004 :

◦Compared various Semi-supervised Learning Techniques

◦Introduced Reuters Corpus as a new benchmark

Su Shirabad and Matwin, 2011 : ◦MNB + SFE

Page 11: ACTIVE LEARNING FOR TEXT CLASSIFICATION

My workMy workMNB + SFE + Active Learning

◦Data-set: Reuters Corpus from LYRL 2004: contains around 8 lakh documents

◦Experiments on 10,000 documents starting with : 50 Labeled Documents + 100 requests 100 Labeled Documents + 50 requests

Page 12: ACTIVE LEARNING FOR TEXT CLASSIFICATION

Results so farResults so far