Semi-Supervised Learning over Text

50
Semi-Supervised Learning over Text Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 2006

description

Semi-Supervised Learning over Text. Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 2006. Statistical learning methods require LOTS of training data Can we use all that unlabelled text?. Outline. Maximizing likelihood in probabilistic models - PowerPoint PPT Presentation

Transcript of Semi-Supervised Learning over Text

Page 1: Semi-Supervised Learning over Text

Semi-Supervised Learning over Text

Tom M. MitchellMachine Learning Department

Carnegie Mellon University

September 2006

Page 2: Semi-Supervised Learning over Text

Statistical learning methods require LOTS of training data

Can we use all that unlabelled text?

Page 3: Semi-Supervised Learning over Text

Outline

• Maximizing likelihood in probabilistic models– EM for text classification

• Co-Training and redundantly predictive features– Document classification– Named entity recognition– Theoretical analysis

• Sample of additional tasks– Word sense disambiguation– Learning HTML-based extractors– Large-scale bootstrapping: extracting from the web

Page 4: Semi-Supervised Learning over Text

Many text learning tasks

• Document classification. – f: Doc Class

– Spam filtering, relevance rating, web page classification, ...

– and unsupervised document clustering

• Information extraction. – f: Sentence Fact, f: Doc Facts

• Parsing– f: Sentence ParseTree

– Related: part-of-speech tagging, co-reference res., prep phrase attachment

• Translation– f: EnglishDoc FrenchDoc

Page 5: Semi-Supervised Learning over Text

1. Semi-supervised Document classification (probabilistic model and EM)

Page 6: Semi-Supervised Learning over Text

Document Classification: Bag of Words Approach

aardvark 0

about 2

all 2

Africa 1

apple 0

anxious 0

...

gas 1

...

oil 1

Zaire 0

Page 7: Semi-Supervised Learning over Text

For code and data, see www.cs.cmu.edu/~tom/mlbook.html click on “Software and Data”

Accuracy vs. # training examples

Page 8: Semi-Supervised Learning over Text

What if we have labels for only some documents?

Y

X1 X4X3X2

Y X1 X2 X3 X4

1 0 0 1 1

0 0 1 0 0

0 0 0 1 0

? 0 1 1 0

? 0 1 0 1

Learn P(Y|X)

EM: Repeat until convergence

1. Use probabilistic labels to train classifier h

2. Apply h to assign probabilistic labels to unlabeled data

Page 9: Semi-Supervised Learning over Text

From [Nigam et al., 2000]

Page 10: Semi-Supervised Learning over Text

E Step:

M Step:wt is t-th word in vocabulary

Page 11: Semi-Supervised Learning over Text

Using one labeled example per class

Words sorted by P(w|course) / P(w| : course)

Page 12: Semi-Supervised Learning over Text

20 Newsgroups

Page 13: Semi-Supervised Learning over Text

20 Newsgroups

Page 14: Semi-Supervised Learning over Text

Elaboration 1: Downweight the influence of unlabeled

examples by factor

New M step:Chosen by cross validation

Page 15: Semi-Supervised Learning over Text

Why/When will this work?

• What’s best case? Worst case? How can we test which we have?

Page 16: Semi-Supervised Learning over Text

EM for Semi-Supervised Doc Classification

• If all data is labeled, corresponds to supervised training of Naïve Bayes classifier

• If all data unlabeled, corresponds to mixture-of-multinomial clustering

• If both labeled and unlabeled data, it helps if and only if the mixture-of-multinomial modeling assumption is correct

• Of course we could extend this to Bayes net models other than Naïve Bayes (e.g., TAN tree)

• Other extensions: model negative class as mixture of N multinomials

Page 17: Semi-Supervised Learning over Text

2. Using Redundantly Predictive Features (Co-Training)

Page 18: Semi-Supervised Learning over Text

Redundantly Predictive Features

Professor Faloutsos my advisor

Page 19: Semi-Supervised Learning over Text

Co-Training

Answer1

Classifier1

Answer2

Classifier2

Key idea: Classifier1 and ClassifierJ must:

1. Correctly classify labeled examples

2. Agree on classification of unlabeled

Page 20: Semi-Supervised Learning over Text

CoTraining Algorithm #1 [Blum&Mitchell, 1998]

Given: labeled data L,

unlabeled data U

Loop:

Train g1 (hyperlink classifier) using L

Train g2 (page classifier) using L

Allow g1 to label p positive, n negative examps from U

Allow g2 to label p positive, n negative examps from U

Add these self-labeled examples to L

Page 21: Semi-Supervised Learning over Text

CoTraining: Experimental Results

• begin with 12 labeled web pages (academic course)

• provide 1,000 additional unlabeled web pages

• average error: learning from labeled data 11.1%;

• average error: cotraining 5.0%

Typical run:

Page 22: Semi-Supervised Learning over Text

Co-Training for Named Entity Extraction(i.e.,classifying which strings refer to people, places, dates, etc.)

Answer1

Classifier1

Answer2

Classifier2

I flew to New York today.

New York I flew to ____ today

[Riloff&Jones 98; Collins et al., 98; Jones 05]

Page 23: Semi-Supervised Learning over Text

One result [Blum&Mitchell 1998]: • If

– X1 and X2 are conditionally independent given Y– f is PAC learnable from noisy labeled data

• Then– f is PAC learnable from weak initial classifier plus unlabeled

data

CoTraining setting:

• wish to learn f: X Y, given L and U drawn from P(X)

• features describing X can be partitioned (X = X1 x X2)

such that f can be computed from either X1 or X2

Page 24: Semi-Supervised Learning over Text

Co-Training Rote Learner

My advisor+

-

-

pageshyperlinks

-

--

-

+

+

-

-

-

+

++

+

-

-

+

+

-

Page 25: Semi-Supervised Learning over Text

Co Training

• What’s the best-case graph? (most benefit from unlabeled data)

• What the worst case?• What does conditional-independence imply about

graph?

x1 x2

+

-

+

Page 26: Semi-Supervised Learning over Text

Expected Rote CoTraining error given m examples

mjj

j

gxPgxPerrorE ))(1)(( Where g is the jth connected component of graph of L+U, m is number of labeled examples

j

)()()()(,

:

:

221121

21

xfxgxgxggand

ondistributiunknownfromdrawnxwhere

XXXwhere

YXflearn

settingCoTraining

Page 27: Semi-Supervised Learning over Text

How many unlabeled examples suffice?

Want to assure that connected components in the underlying distribution, GD, are connected components in the observed sample, GS

GD GS

O(log(N)/) examples assure that with high probability, GS has same connected components as GD [Karger, 94]

N is size of GD, is min cut over all connected components of GD

Page 28: Semi-Supervised Learning over Text

PAC Generalization Bounds on CoTraining

[Dasgupta et al., NIPS 2001]

This theorem assumes X1 and X2 are conditionally independent given Y

Page 29: Semi-Supervised Learning over Text

Co-Training Theory

Final Accuracy

# unlabeled examples

dependencies among input features

# Redundantly predictive inputs

# labeled examples

Correctness of confidence assessments

How can we tune learning environment to enhance effectiveness of Co-Training?

best: inputs conditionally indep given class, increased number of redundant inputs, …

Page 30: Semi-Supervised Learning over Text

• Idea: Want classifiers that produce a maximally consistent labeling of the data

• If learning is an optimization problem, what function should we optimize?

What if CoTraining Assumption Not Perfectly Satisfied?

-

+

+

+

Page 31: Semi-Supervised Learning over Text

Example 2: Learning to extract named entities

I arrived in Beijing on Saturday.

location?

If: “I arrived in <X> on Saturday.”

Then: Location(X)

Page 32: Semi-Supervised Learning over Text

Co-Training for Named Entity Extraction(i.e.,classifying which strings refer to people, places, dates, etc.)

Answer1

Classifier1

Answer2

Classifier2

I arrived in Beijing saturday.

Beijing I arrived in __ saturday

[Riloff&Jones 98; Collins et al., 98; Jones 05]

Page 33: Semi-Supervised Learning over Text

Bootstrap learning to extract named entities[Riloff and Jones, 1999], [Collins and Singer, 1999], ...

Iterations

InitializationAustraliaCanada China England France Germany Japan Mexico Switzerland United_states

locations in ?x

South AfricaUnited KingdomWarrentonFar_EastOregonLexingtonEuropeU.S._A.Eastern CanadaBlairSouthwestern_statesTexasStatesSingapore …

operations in ?x

ThailandMaineproduction_controlnorthern_LosNew_Zealandeastern_EuropeAmericasMichigan New_HampshireHungarysouth_americadistrictLatin_AmericaFlorida ...

republic of ?x

…...

Page 34: Semi-Supervised Learning over Text

Co-EM [Nigam & Ghani, 2000; Jones 2005]

Idea:• Like co-training, use one set of features to label the other• Like EM, iterate

– Assigning probabilistic values to unobserved class labels– Updating model parameters (= labels of other feature set)

Page 35: Semi-Supervised Learning over Text

CoEM applied to Named Entity Recognition[Rosie Jones, 2005], [Ghani & Nigam, 2000]

Update rules:

Page 36: Semi-Supervised Learning over Text

CoEM applied to Named Entity Recognition[Rosie Jones, 2005], [Ghani & Nigam, 2000]

Update rules:

Page 37: Semi-Supervised Learning over Text

CoEM applied to Named Entity Recognition[Rosie Jones, 2005], [Ghani & Nigam, 2000]

Update rules:

Page 38: Semi-Supervised Learning over Text
Page 39: Semi-Supervised Learning over Text

[Jones, 2005]

Can use this for active learning...

Page 40: Semi-Supervised Learning over Text

[Jones, 2005]

Page 41: Semi-Supervised Learning over Text

• Idea: Want classifiers that produce a maximally consistent labeling of the data

• If learning is an optimization problem, what function should we optimize?

What if CoTraining Assumption Not Perfectly Satisfied?

-

+

+

+

Page 42: Semi-Supervised Learning over Text

What Objective Function?

2

2211

,

22211

,

222

,

211

43

2

)(ˆ)(ˆ

||||

1

||

14

))(ˆ)(ˆ(3

))(ˆ(2

))(ˆ(1

4321

ULxLyx

Ux

Lyx

Lyx

xgxg

ULy

LE

xgxgE

xgyE

xgyE

EcEcEEE

Error on labeled examples

Disagreement over unlabeled

Misfit to estimated class priors

Page 43: Semi-Supervised Learning over Text

What Function Approximators?

• Same functional form as logistic regression

• Use gradient descent to simultaneously learn g1 and g2, directly minimizing E = E1 + E2 + E3 + E4

• No word independence assumption, use both labeled and unlabeled data

j

jj xw

exg

1,

1

1)(ˆ1

jjj xw

exg

2,

1

1)(ˆ2

Page 44: Semi-Supervised Learning over Text

Gradient CoTraining Classifying Capitalized sequences as Person Names

25 labeled 5000 unlabeled

2300 labeled 5000 unlabeled

Using labeled data only

Cotraining

Cotraining without fitting class priors (E4)

.27

.13.24

* sensitive to weights of error terms E3 and E4

.11 *.15 *

*

Error Rates

Eg., “Company president Mary Smith said today…”x1 x2 x1

Page 45: Semi-Supervised Learning over Text

Example 3: Word sense disambiguation [Yarowsky]

• “bank” = river bank, or financial bank??

• Assumes a single word sense per document– X1: the document containing the word– X2: the immediate context of the word (‘swim near the __’)

Successfully learns “context word sense” rules when word occurs multiples times in documents.

Page 46: Semi-Supervised Learning over Text

Example 4: Bootstrap learning for IE from HTML structure

[Muslea, et al. 2001]

X1: HTML preceding the target

X2: HTML following the target

Page 47: Semi-Supervised Learning over Text

Example Bootstrap learning algorithms:

• Classifying web pages [Blum&Mitchell 98; Slattery 99]

• Classifying email [Kiritchenko&Matwin 01; Chan et al. 04]

• Named entity extraction [Collins&Singer 99; Jones&Riloff 99]

• Wrapper induction [Muslea et al., 01; Mohapatra et al. 04]

• Word sense disambiguation [Yarowsky 96]

• Discovering new word senses [Pantel&Lin 02]

• Synonym discovery [Lin et al., 03]

• Relation extraction [Brin et al.; Yangarber et al. 00]

• Statistical parsing [Sarkar 01]

Page 48: Semi-Supervised Learning over Text

What to Know

• Several approaches to semi-supervised learning– EM with probabilistic model– Co-Training– Graph similarity methods– ...– See reading list below

• Redundancy is important• Much more to be done:

– Better theoretical models of when/how unlabeled data can help– Bootstrap learning from the web (e.g. Etzioni, 2005, 2006)– Active learning (use limited labeling time of human wisely)– Never ending bootstrap learning?– ...

Page 49: Semi-Supervised Learning over Text

Further Reading

• Semi-Supervised Learning, Olivier Chapelle, Bernhard Sch¨olkopf, and Alexander Zien (eds.), MIT Press, 2006.

• Semi-Supervised Learning Literature Survey, Xiaojin Zhu, 2006.

• Unsupervised word sense disambiguation rivaling supervised methods D. Yarowsky (1995)

• "Semi-Supervised Text Classification Using EM,"  K. Nigam, A. McCallum, and T. Mitchell, in Semi-Supervised Learning, Olivier Chapelle, Bernhard Sch¨olkopf, and Alexander Zien (eds.), MIT Press, 2006.

• " Text Classification from Labeled and Unlabeled Documents using EM," K. Nigam, Andrew McCallum, Sebastian Thrun and Tom Mitchell. Machine Learning, Kluwer Academic Press, 1999.

• " Combining Labeled and Unlabeled Data with Co-Training," A. Blum and T. Mitchell, Proceedings of the 1998 Conference on Computational Learning Theory, July 1998.

• Discovering Word Senses from Text Pantel & Lin (2002) • Creating Subjective and Objective Sentence Classifiers from Unannotated Texts by

Janyce Wiebe and Ellen Riloff (2005) • Graph Based Semi-Supervised Approach for Information Extraction by Hany Hassan,

Ahmed Hassan and Sara Noeman (2006) • The use of unlabeled data to improve supervised learning for text summarization by

MR Amini, P Gallinari (2002)

Page 50: Semi-Supervised Learning over Text

Further Reading

• Yusuke Shinyama and Satoshi Sekine. Preemptive Information Extraction using Unrestricted Relation Discovery

• Alexandre Klementiev and Dan Roth. Named Entity Transliteration and Discovery from Multilingual Comparable Corpora.

• Rion L. Snow, Daniel Jurafsky, Andrew Y. Ng. Learning syntactic patterns for automatic hypernym discovery

• Sarkar. (1999). Applying Co-training methods to Statistical Parsing. • S. Brin, 1998. Extracting patterns and relations from the World Wide Web,  EDBT'98 • O. Etzioni et al., 2005. "Unsupervised Named-Entity Extraction from the Web: An

Experimental Study," AI Journal, 2005.