Text Classification and Naïve Bayes The Task of Text Classification.
ACTIVE LEARNING FOR TEXT CLASSIFICATION
description
Transcript of ACTIVE LEARNING FOR TEXT CLASSIFICATION
ACTIVE LEARNING FOR ACTIVE LEARNING FOR TEXT CLASSIFICATIONTEXT CLASSIFICATIONAnkit Bhutani Y9094
AUTOMATIC TEXT CLASSIFICATION
A FEW HOURS ONLY
MANUAL TEXT CLASSIFICATIONTAKES YEARS
ORGANIZING LARGE ORGANIZING LARGE VOLUMES OF TEXTVOLUMES OF TEXTMassive volume of online text
available.Organisation into categories to
enable efficient search.Find use in a lot of applications like
Data Mining, Automatic Query Answer, Learning User Interest, Making Suggestions, etc.
Learning Approaches : unsupervised, supervised and semi-supervised.
Terms UsedTerms UsedMultinomial Naïve Bayes :
◦Documents in bag of words format◦Independence assumptions
Terms UsedTerms UsedSemi-Supervised Learning :
◦Makes use of Labeled as well as Unlabeled Data to learn the parameters of the model.
Expectation Maximization :◦Class of Iterative Algorithms for
Maximum Likelihood Estimation in problems with incomplete data
Parameters of the model
Document labels
Provide Soft Labels to Documents based on estimated model parameters
Re-estimate the model parameters based on the
soft labels
Terms usedTerms usedActive Learning :
◦Form of supervised machine learning◦Learning Algorithm is able to
interactively query the user◦Query has associated cost.◦Algorithm requests label for document
such that gain in information about model parameters is maximized
But how to choose which DOCUMENT to request for
Label???
Terms UsedTerms UsedQuery by Committee :
◦Divide the training set into 4 – 5 sets.◦Each set as member gives
probability estimates.◦Maximum disagreement measured
by maximum average KL divergence between all pairs
Terms UsedTerms UsedSemi-Supervised Frequency
Estimate (SFE) :◦Slight variation in basic EM :
Different parameters re-estimation formula.
NOTICABLE WORK: NOTICABLE WORK: Semi-Supervised LearningSemi-Supervised LearningNigam et al, 1998-99 :
◦MNB + EM◦100 Labeled + 2500 Unlabeled
documents◦80 – 85 % accuracy
Nigam & McCullum, 2000 : ◦MNB + EM + Active Learning◦Total 1000 Documents◦Label requests : 50, Accuracy :
~90%
NOTICABLE WORK: NOTICABLE WORK: Semi-Supervised LearningSemi-Supervised LearningLYRL, 2004 :
◦Compared various Semi-supervised Learning Techniques
◦Introduced Reuters Corpus as a new benchmark
Su Shirabad and Matwin, 2011 : ◦MNB + SFE
My workMy workMNB + SFE + Active Learning
◦Data-set: Reuters Corpus from LYRL 2004: contains around 8 lakh documents
◦Experiments on 10,000 documents starting with : 50 Labeled Documents + 100 requests 100 Labeled Documents + 50 requests
Results so farResults so far