ML

16
Large scale text classification Using Semi-supervised Multinomial Naïve Bayes Presenter: QingZhi Chen

description

陳慶治Large Scale Text Classification using Semi-supervised Multinomial Naive bayes

Transcript of ML

Page 1: ML

Large scale text classification

Using Semi-supervised Multinomial Naïve Bayes

Presenter: QingZhi Chen

Page 2: ML

OUT LINE1.Introduction2.Text document representation3.Multinomial Naïve Bayes4.Semi-supervised Learning for MNB5.Experiments

Page 3: ML

INTRODUCTIONMultinomial Naïve Bayes Frequency Estimate * difficulty for collect labeled data , and large unlabeled data become useless

Expectation Maximization maximizing marginal log likelihood

Semi-supervised Frequency Estimate better conditional log likelihood

Type?

Page 4: ML

bag-of-wordsIgnore the ordering of words in d

Naïve Bayesian Assumption Each word is independent of each other

apple

Red rind

sweetroundedapple

Red rind

sweetrounded

rind apple red rounded sweet

Page 5: ML

Text Document representation

d={, , ,…. ,c} corresponds to a word in document d and its value is frequency ƒ i of in d c is the class label of d V is set of unique words ω in all d i T is the training set is -th document in T is indicate parameter estimates

Page 6: ML

Multinomial Naïve Bayesobjective function

parameter

P(c) is prior probability of c class in whole T is number of in T with the label c

is the number of occurrence of in the document

Page 7: ML

FE parameter learning

is the number of occurrence of in the documentwith the class label cFE objective function

Decompose to CLL + MLL

Page 8: ML

Semi-supervised Learning for MNB

Basic assumption In where # of unlabeled >> labeled we can use provide more information about modelExpectation Maximization classical semi-supervised method for MNB : assign document to c

Page 9: ML

* will be the same as FE&SFE’s * update(3)(2) using (6)(7)(1) until parameter are stable* this implementation still use the , and counts the labeled documents with 1 rather than . * deficit of EM : inferior CLL and too strong assumption

Semi-supervised Frequency Estimate

soft classify word to c

Page 10: ML
Page 11: ML

EXPERIMENT

Source Data setPerformance Index : AUC & AccuracyInfluence on Conditional log likelihoodImpact of Size of Unlabeled DataComputational Cost

*AUC refers to area under curve

Page 12: ML

Base on ’ MNB classifier

Page 13: ML
Page 14: ML
Page 15: ML
Page 16: ML