Semi-supervised Information Extraction

40
Semi-supervised Information Extraction 11-2-2009

description

Semi-supervised Information Extraction. 11-2-2009. What is “Information Extraction”. As a family of techniques:. Information Extraction = segmentation + classification + association + clustering. October 14, 2002, 4:00 a.m. PT - PowerPoint PPT Presentation

Transcript of Semi-supervised Information Extraction

Page 1: Semi-supervised Information Extraction

Semi-supervised Information Extraction

11-2-2009

Page 2: Semi-supervised Information Extraction

What is “Information Extraction”

Information Extraction = segmentation + classification + association + clustering

As a familyof techniques:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Most papers focus on the process and evaluate 1-2 steps in isolation with supervised methods:

• classification of pre-segmented candidates• association of pre-identified NERs• clustering of entities and facts• combining segmentation and classification with CRFs, VP-HMMs, …• combining segmentation, classification, and clustering with MLNs, “imperative factor graphs”, …

A couple looked at using weaker training data:• Bunescu & Mooney’s paper on multi-instance learning (from a few “bags”)• Bellare & McCallum’s paper on learning to segment from from (DBLP record, citation) pairs

Page 3: Semi-supervised Information Extraction

What is “Information Extraction”

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N

AME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

*

*

*

*

An alternative – focus on the end result (how many facts you can extract, by any means). Some new questions:• Do you need to do any step well?

• Maybe not – if there is redundancy in the corpus, and you can process a large corpus quickly.

• Do you need test data for any step, or training data?• Maybe not – in which case maybe unsupervised or semi-supervised learning is better?

• How much can you do with weak training data?• think of Bellare & McCallum, Bunescu & Mooney

General differences:• More data, larger corpora, faster techniques• High-precision, maybe low recall extraction• Different learning methods• Harder to evaluate

Today: summarize the key ideas in some old papers and give a general overview of un- and semi- supervised learning

Page 4: Semi-supervised Information Extraction

A key idea: “macro-reading” vs “micro-reading”

• Often the same facts are stated many times in different documents.– “The global focus of Carnegie Mellon's business school is about to expand

boundaries with the fall 2004 opening of a branch campus in Qatar…”– “Carnegie Mellon University in Qatar, founded in 2004, is a branch

campus…”– “In Qatar, Carnegie Mellon expects to enroll a first class of 50

undergraduates (25 in each degree program) by fall 2004. ...”– “In August 2004 Carnegie Mellon Qatar began offering its highly regarded

undergraduate programs in…”– “Carnegie Mellon University in Qatar, founded in 2004, is a branch campus

of Carnegie Mellon University…”

• When this is true, a low-recall, high-precision system can extract many facts.

Page 5: Semi-supervised Information Extraction

Macro-reading c. 1992

Idea: write some specific patterns that indicate A is a kind of B:

1. … such NP as NP (“at such schools as CMU, students rarely need extensions”)

2. NP, NP, or other NP (“William, Carlos or other machine learning professors”)

3. NP including NP (“struggling teams including the Pirates”)

4. NP, especially NP (prestigious conferences, especially NIPS)

[Coling 1992]

Results: 8.6M words of Grolier’s encyclopedia 7067 pattern instances 152 relations

Many were not in WordNet.

Page 6: Semi-supervised Information Extraction

A key idea: training with known facts and text vs training on annotated text

• Basic idea: – Find where a known fact occurs in text, by matching/alignment/…

• Sometimes this is pretty easy

– Use this as (noisy) training data for a conventional learning system.

Page 7: Semi-supervised Information Extraction

A key idea: training with derived facts and text vs training on annotated text

• Basic idea: – Find where a known fact occurs in text, by matching/alignment/…

• Sometimes this is pretty easy

– Use this as (possibly noisy) training data for a conventional IE learning system.

• Once you’ve learned an extractor from that data– Run the extractor on some (maybe additional) text

– Take the (possibly noisy) new facts and start over

• “Self-training” or “bootstrapping”

Page 8: Semi-supervised Information Extraction

Macro-reading c. 1992

Idea: write some specific patterns that indicate A is a kind of B:

1. … such NP as NP (“at such schools as CMU, students rarely need extensions”)

2. NP, NP, or other NP (“William, Carlos or other machine learning professors”)

3. NP including NP (“struggling teams including the Pirates”)

4. NP, especially NP (prestigious conferences, especially NIPS)

[Coling 1992]

Results: 8.6M words of Grolier’s encyclopedia 7067 pattern instances 152 relations

Many were not in WordNet.

Marti’s system was iterative

Page 9: Semi-supervised Information Extraction

Another iterative, high-precision system

Idea: exploit “pattern/relation duality”:

1. Start with some seed instances of (author,title) pairs (“Isaac Asimov”, “The Robots of Dawn”)

2. Look for occurrences of these pairs on the web.

3. Generate patterns that match the seeds.

- URLprefix, prefix, middle, suffix

4. Extract new (author, title) pairs that match the patterns.

5. Go to 2.

[some workshop, 1998]

Unlike Hearst, Brin learned the patterns; and learned very high-precision, easy-to-match patterns.

Result: 24M web pages + 5 books 199 occurrences 3 patterns 4047 occurrences + 5M pages 3947 occurrences 105 patterns … 15,257 books *with some manual tweaks

Page 10: Semi-supervised Information Extraction

Key Ideas: So Far

• High-precision low-coverage extractors and large redundant corpora (macro-reading)

• Self-training/bootstrapping– Advantage: train on a small corpus, test on a

larger one• You can use more-or-less off-the-shelf learning methods• You can work with very large corpora

– But, data gets noisier and noisier as you iterate– Need either

• really high-precision extractors, or • some way to cope with the noise

Page 11: Semi-supervised Information Extraction

A variant of bootstrapping: co-training

Redundantly Sufficient Features:

• features x can be separated into two types x1,x2

• either x1 or x2 is sufficient for classification – i.e.

there exists functions f1 and f2 such that

f(x) = f1(x1)=f2(x2) has low error

spelling feature context feature

person

e.g. Capitalization=X+.X+Prefix=Mr.

e.g., based on words nearby in dependency

parse

Page 12: Semi-supervised Information Extraction

Another kind of self-training

[COLT 98]

Page 13: Semi-supervised Information Extraction

Unsupervised Models for Named Entity Classification

Michael Collins and Yoram Singer [EMNLP 99]

Redundantly Sufficient Features:

• features x can be separated into two types x1,x2

• either x1 or x2 is sufficient for classification – i.e.

there exists functions f1 and f2 such that

f(x) = f1(x1)=f2(x2) has low error

spelling feature context feature

person

e.g., Capitalization=X+.X+Prefix=Mr.

Based on dependency parse

Candidate entities x segmented using a POS pattern

Page 14: Semi-supervised Information Extraction

A co-training algorithm

Page 15: Semi-supervised Information Extraction

CoTraining

spelling context

Mr. Cooper

… president

Each node is a “half-example”, x1 or x2

Each unlabeled pair is represented as an edge

An edge indicates that the two half-examples must have the same label

Labeling unlabeled examples propogates information between columns, learning propogates information within columns

Page 16: Semi-supervised Information Extraction

Evaluation for Collins and Singer

88,962 examples (spelling,context) pairs

7 seed rules are used

1000 examples are chosen as test data (85 noise)

We label the examples as (location, person,

organization, noise)

Page 17: Semi-supervised Information Extraction

Key Ideas: So Far

• High-precision low-coverage extractors and large redundant corpora (macro-reading)

• Self-training/bootstrapping• Co-training• Clustering phrases by context

– Don’t propogate labels;– Instead do without them entirely

Pres.

CEO

VP

Mr. Cooper

Bob

job

internMSR

IBM

patent

Page 18: Semi-supervised Information Extraction

[KDD 2002]

Basic idea: parse a big corpus, then cluster NPs by their contexts

Page 19: Semi-supervised Information Extraction

Key Ideas: So Far

• High-precision low-coverage extractors and large redundant corpora (macro-reading)

• Self-training/bootstrapping or co-training• Other semi-supervised methods:

– Expectation-maximization: like self-training but you “soft-label” the unlabeled examples with the expectation over the labels in each iteration.

• Works for almost any generative model (e.g., HMMs)• Learns directly from all the data

– Maybe better; Maybe slower– Extreme cases:

» supervised learning …. clustering + cluster-labeling

Page 20: Semi-supervised Information Extraction

Key Ideas: So Far

• High-precision low-coverage extractors and large redundant corpora (macro-reading)

• Self-training/bootstrapping or co-training• Other semi-supervised methods:

– Expectation-maximization– Transductive margin-based methods (e.g.,

transductive SVM)– Graph-based methods

Page 21: Semi-supervised Information Extraction

History: Open-domain IE by pattern-matching (Hearst, 92)

• Start with seeds: “NIPS”, “ICML”• Look thru a corpus for certain patterns:

• … “at NIPS, AISTATS, KDD and other learning conferences…”

• Expand from seeds to new instances• Repeat….until ___

– “on PC of KDD, SIGIR, … and…”

Page 22: Semi-supervised Information Extraction

Bootstrapping as graph proximity

“…at NIPS, AISTATS, KDD and other learning conferences…”

… “on PC of KDD, SIGIR, … and…”

NIPS

AISTATS

KDD

“For skiiers, NIPS, SNOWBIRD,… and…”

SNOWBIRD

SIGIR

“… AISTATS,KDD,…”

shorter paths ~ earlier iterationsmany paths ~ additional evidence

Page 23: Semi-supervised Information Extraction

Similarity of Nodes in Graphs: Personal PageRank/RandomWalk with Restart

• Similarity defined by “damped” version of PageRank• Similarity between nodes x and y:

– “Random surfer model”: from a node z,• with probability α, stop and “output” z• pick an edge label r using Pr(r | z) ... e.g. uniform• pick a y given x,r: e.g. uniform from { y’ : z y with label r }• repeat from node y ....

– Similarity x~y = Pr( “output” y | start at x)

– Bootstrapping: propogate from labeled data to “similar” unlabeled data.

• Intuitively, x~y is summation of weight of all paths from x to y, where weight of path decreases exponentially with length.

Page 24: Semi-supervised Information Extraction

PPR/RWR is somewhat like TFIDF

“William W. Cohen, CMU”

“Dr. W. W. Cohen”

cohenwilliam w

drcmu

“George W. Bush”

“George H. W. Bush”

“Christos Faloutsos, CMU”

Page 25: Semi-supervised Information Extraction

A little math…

1

1

1

1322

32

32

)1(

)1(

1

1)1(

)(...)()()1()1(

)1)(...1()1(

...1

xy

x

xy

xxy

xxxxxxxxy

xxxxxxy

xxxxy

n

n

nn

n

n

Let x be less than 1. Then

Example: x=0.1, and 1+0.1+0.01+0.001+…. = 1.11111 = 10/9.

Page 26: Semi-supervised Information Extraction

Graph = Matrix

AB

C

FD

E

GI

HJ

A B C D E F G H I J

A 1 1 1

B 1 1

C 1

D 1 1

E 1

F 1 1 1

G 1

H 1 1 1

I 1 1 1

J 1 1

Page 27: Semi-supervised Information Extraction

Graph = MatrixTransitively Closed Components = “Blocks”

AB

C

FD

E

GI

HJ

A B C D E F G H I J

A _ 1 1 1

B 1 _ 1

C 1 1 _

D _ 1 1

E 1 _ 1

F 1 1 1 _

G _ 1 1

H _ 1 1

I 1 1 _ 1

J 1 1 1 _

Of course we can’t see the “blocks” unless the nodesare sorted by cluster…

Page 28: Semi-supervised Information Extraction

Graph = MatrixVector = Node Weight

H

A B C D E F G H I J

A _ 1 1 1

B 1 _ 1

C 1 1 _

D _ 1 1

E 1 _ 1

F 1 1 1 _

G _ 1 1

H _ 1 1

I 1 1 _ 1

J 1 1 1 _

AB

C

FD

E

GI

J

A

A 3

B 2

C 3

D

E

F

G

H

I

J

M

M v

Page 29: Semi-supervised Information Extraction

Graph = MatrixM*v1 = v2 “propogates weights from neighbors”

A B C D E F G H I J

A _ 1 1 1

B 1 _ 1

C 1 1 _

D _ 1 1

E 1 _ 1

F 1 1 _

G _ 1 1

H _ 1 1

I 1 1 _ 1

J 1 1 1 _

AB

C

FD

E

I

A 3

B 2

C 3

D

E

F

G

H

I

J

M

M v1

A 2*1+3*1+0*1

B 3*1+3*1

C 3*1+2*1

D

E

F

G

H

I

J

v2* =

Page 30: Semi-supervised Information Extraction

A little math…

1

1(

()

W)(IY

W)IW)Y(I

W)...)(IW)(WW(IW)Y(I

W)...)(IW)(W)(W(IW)Y(I

W)...(W)(W)(WIY

2

32

32

n

n

α

ααα

ααα

Let W[i,j] be Pr(walk to j from i)and let α be less than 1. Then:

The matrix (I- αW) is the Laplacian of αW.

Generally the Laplacian is (D- A) where D[i,i] is the degree of i in the adjacency matrix A.

)|Pr(1

],[ ijZ

ji Y

Page 31: Semi-supervised Information Extraction

A little math…

)|Pr(][ so

)1(

0,....,0,1,0,....,0,0,0

0

101

0

ijjnn

tt

vYvv

Wvvv

v

Let W[i,j] be Pr(walk to j from i)and let α be less than 1. Then:

The matrix (I- αW) is the Laplacian of αW.

Generally the Laplacian is (D- A) where D[i,i] is the degree of i in the adjacency matrix A.

component i

Page 32: Semi-supervised Information Extraction

Bootstrapping via PPR/RWR on graph of patterns and nodes

“…at NIPS, AISTATS, KDD and other learning conferences…”

… “on PC of KDD, SIGIR, … and…”

NIPS

AISTATS

KDD

“For skiiers, NIPS, SNOWBIRD,… and…”

SNOWBIRD

SIGIR

“… AISTATS,KDD,…”

Page 33: Semi-supervised Information Extraction

Key Ideas: So Far

• High-precision low-coverage extractors and large redundant corpora (macro-reading)

• Self-training/bootstrapping or co-training• Other semi-supervised methods:

– Expectation-maximization– Transductive margin-based methods (e.g.,

transductive SVM)– Graph-based methods

• Label propogation via random walk with reset

Page 34: Semi-supervised Information Extraction

Bootstrapping

BlumMitchell ’98

Brin’98

Hearst ‘92

Scalability, surface patterns, use of web crawlers…

Learning, semi-supervised learning, dual feature spaces…

Deeper linguistic features, free text…

Lin & Pantel ‘02Clustering by distributional similarity…

Page 35: Semi-supervised Information Extraction

Bootstrapping

BM’98

Brin’98

Hearst ‘92

Collins & Singer ‘99

Scalability, surface patterns, use of web crawlers…

Learning, semi-supervised learning, dual feature spaces…

Deeper linguistic features, free text…

Boosting-based co-train method using content & context features; context based on Collins’ parser; learn to classify three types of NE

Lin & Pantel ‘02Clustering by distributional similarity…

Page 36: Semi-supervised Information Extraction

Bootstrapping

BM’98

Brin’98

Hearst ‘92

Scalability, surface patterns, use of web crawlers…

Learning, semi-supervised learning, dual feature spaces…

Deeper linguistic features, free text…

Collins & Singer ‘99

Riloff & Jones ‘99 Hearst-like patterns, Brin-like bootstrapping (+ “meta-level” bootstrapping) on MUC data

Lin & Pantel ‘02Clustering by distributional similarity…

Page 37: Semi-supervised Information Extraction

Bootstrapping

BM’98

Brin’98

Hearst ‘92

Scalability, surface patterns, use of web crawlers…

Learning, semi-supervised learning, dual feature spaces…

Deeper linguistic features, free text…

Collins & Singer ‘99

Riloff & Jones ‘99

Cucerzan & Yarowsky ‘99

EM like co-train method with context & content both defined

by character-level tries

Lin & Pantel ‘02Clustering by distributional similarity…

Page 38: Semi-supervised Information Extraction

Bootstrapping

BM’98

Brin’98

Hearst ‘92

Scalability, surface patterns, use of web crawlers…

Learning, semi-supervised learning, dual feature spaces…

Deeper linguistic features, free text…

Collins & Singer ‘99

Riloff & Jones ‘99

Cucerzan & Yarowsky ‘99

Etzioni et al 2005

Lin & Pantel ‘02Clustering by distributional similarity…

Page 39: Semi-supervised Information Extraction

Bootstrapping

BM’98

Brin’98

Hearst ‘92

Scalability, surface patterns, use of web crawlers…

Learning, semi-supervised learning, dual feature spaces…

Deeper linguistic features, free text…

Collins & Singer ‘99

Riloff & Jones ‘99

Cucerzan & Yarowsky ‘99

Etzioni et al 2005

… TextRunner

Lin & Pantel ‘02Clustering by distributional similarity…

Page 40: Semi-supervised Information Extraction

Bootstrapping

BM’98

Brin’98

Hearst ‘92

Scalability, surface patterns, use of web crawlers…

Learning, semi-supervised learning, dual feature spaces…

Deeper linguistic features, free text…

Collins & Singer ‘99

Riloff & Jones ‘99

Cucerzan & Yarowsky ‘99

Etzioni et al 2005

… TextRunner

ReadTheWeb

Lin & Pantel ‘02Clustering by distributional similarity…