Distant Supervision and MultiR - IIT Delhi · 2017. 3. 7. · Steve Jobs founded Apple Apple was...

Distant Supervision and MultiR

Happy Mittal

We will discuss

• Distant Supervision [Mintz et al, 2009]

• MultiR [Hoffmann et al, 2011]

Relation Instance Extraction

• Fully Supervised Learning• Labeled corpora of sentences.• Suffers from small dataset, domain bias.

• Unsupervised Learning• Cluster patterns to identify relations.• Large corpora available.• Can’t give name to relations identified.

• Bootstrap Learning• Give initial seed patterns and facts.• Generate more facts and patterns.• Suffers from semantic drift.

• Distant Supervision• Combines advantages of above approaches.

Hrithik Roshan’s Movie Kaabilfeatures love affair between two blind people.

Actor(Hrithik Roshan, Kaabil)

Distant Supervision [Mintz et al 2009]

Sentences (Ex : Wikipedia articles)

Person Birth Place

Edwin Hubble Marshfield

…. ….

Knowledge base(Ex : Freebase)

Generate training data

HOW ?

Assumption : Fact r(e1,e2) => Every sentence having entities e1 and e2specifies relation r.

Distant Supervision (Generating training data)• Astronomer Edwin Hubble was born in Marshfield, Missouri.

• Features : • Lexical Features

o Entity Types of both entities.

NE1 NE2 Label

PER LOC Birthplace



o Words between entities and their POS tags.

NE1 Middle NE2 Label

PER [was/VERB born/VERB in/CLOSED] LOC Birthplace



o Window of k words to left and right, k∈{0,1,2}

Left Window NE1 Middle NE2 Right window Label

[] PER [was/VERB born/VERB in/CLOSED] LOC [] Birthplace

[Astronomer] PER [was/VERB born/VERB in/CLOSED] LOC [,] Birthplace

[#,Astronomer] PER [was/VERB born/VERB in/CLOSED] LOC [,Missouri] Birthplace


• Features : • Syntactic Features

o Dependency Path between entities.

o Window node in dependency path.

Distant supervision

• Strong Assumption : If a fact r(e1,e2) is seen in KB, then • Every sentence having e1 and e2 specifies relation r.

• Relax this assumption : • At least one sentence having e1 and e2 specifies relation r [Riedel et al, 2010]

Relaxing the assumption [Riedel et al 2010]

Founded 𝑌 ∈ R Relation Variable

Z1 = 1 Z2 = 0

Steve Jobs founded Apple

Steve Jobs is the CEO of Apple

Z1,Z2∈ {0,1} Relation mention Variables

X1 X2

• Model the joint distribution 𝑃(𝑌 = 𝑦, 𝑍 = 𝑧|𝑥)

Relaxing the assumption [Riedel et al 2010]

Founded 𝑌 ∈ R Relation Variable

Z1 = 1 Z2 = 0



Z1,Z2∈ {0,1} Relation mention Variables

X1 X2

• Model the joint distribution 𝑃 𝑌 = 𝑦, 𝑍 = 𝑧 𝑥• Problem : Doesn’t allow overlapping relations.• MultiR solves that problem.

MultiR [Hoffman et al 2011]

Founded 𝑌 ∈ 0,1 𝑟

Relation Variables(Capture aggregate level prediction)

Z1 = Founded

Z2 = CEO-of



𝑍𝑖 ∈ 𝑅

Relation mention Variables(Capture sentence level prediction)

X1 X2

CEO-of

Z3 = None

Steve Jobs left Apple

X3

…


• Probability Distribution

• 𝑃 𝑌 = 𝑦, 𝑍 = 𝑧 𝑥 =1

𝑍𝑥 𝑟𝜙𝑗𝑜𝑖𝑛(𝑦𝑟 , 𝑧) 𝑖 ∅

𝑒𝑥𝑡𝑟𝑎𝑐𝑡(𝑧𝑖,𝑥𝑖)

1 if at least one 𝑧𝑖mentions relation 𝑦𝑟

[Mintz et al] features

MultiR [Hoffman et al 2011]• Parameter Learning

• 𝑃 𝑌 = 𝑦, 𝑍 = 𝑧 𝑥; 𝜃 =1

𝑍𝑥 𝑟𝜙𝑗𝑜𝑖𝑛(𝑦𝑟 , 𝑧) 𝑖 ∅

𝑒𝑥𝑡𝑟𝑎𝑐𝑡(𝑧𝑖,𝑥𝑖)

• 𝑃 𝑌 = 𝑦, 𝑍 = 𝑧 𝑥; 𝜃 =1

𝑍𝑥 𝑟𝜙𝑗𝑜𝑖𝑛(𝑦𝑟 , 𝑧) 𝑖 exp( 𝑗 𝜃𝑗 ∅𝑗(𝑧𝑖,𝑥𝑖)

• Treat Z variables as latent variables.

• Interested in maximizing

𝐿 𝜃 =

𝑖

𝑃 𝑦𝑖 𝑥𝑖; 𝜃 =

𝑖

𝑧

𝑃 𝑦𝑖 , 𝑧 𝑥𝑖; 𝜃

𝑙 𝜃 =

𝑖

𝑙𝑜𝑔

𝑧

𝑃 𝑦𝑖 , 𝑧 𝑥𝑖; 𝜃

1 if at least one 𝑧𝑖mentions relation 𝑦𝑟

[Mintz et al] features

MultiR [Hoffman et al 2011]• Parameter learning

Assumption of online training

MultiR [Hoffman et al 2011]• Parameter learning

Difficult to computeCompute argmax instead


• Learning Algorithm

Need to do two inferences

MultiR Inference 1 𝑎𝑟𝑔𝑚𝑎𝑥𝑦,𝑧𝑃(𝑦, 𝑧|𝑥; 𝜃)

? 𝑌 ∈ 0,1 𝑟


? ?


Apple was founded by Steve Jobs

𝑍𝑖 ∈ 𝑅


X1 X2

?

?

Steve Jobs ls the CEO of Apple

X3

founder CEO-of

?

Capital

Founder 10.5 12.5 4.5

CEO-of 8.9 8.7 8.5

Capital 6.3 4.5 0.5


? 𝑌 ∈ 0,1 𝑟


? ?



𝑍𝑖 ∈ 𝑅


X1 X2

?

?


X3

founder CEO-of

?

Capital

Founder 10.5 12.5 4.5

CEO-of 8.9 8.7 8.5

Capital 6.3 4.5 0.5


? 𝑌 ∈ 0,1 𝑟


Founder Founder



𝑍𝑖 ∈ 𝑅


X1 X2

?

CEO-of


X3

founder CEO-of

?

Capital

Founder 10.5 12.5 4.5

CEO-of 8.9 8.7 8.5

Capital 6.3 4.5 0.5


1 𝑌 ∈ 0,1 𝑟


Founder Founder



𝑍𝑖 ∈ 𝑅


X1 X2

1

CEO-of


X3

founder CEO-of

0

Capital

𝑂( 𝑅 𝑆 )Founder 10.5 12.5 4.5

CEO-of 8.9 8.7 8.5

Capital 6.3 4.5 0.5

MultiR Inference 2 𝑎𝑟𝑔𝑚𝑎𝑥𝑧𝑃(𝑧|𝑥, 𝑦; 𝜃)

1 𝑌 ∈ 0,1 𝑟


? ?



𝑍𝑖 ∈ 𝑅


X1 X2

1

?


X3

founder CEO-of

0

Capital

Founder 10.5 12.5 4.5

CEO-of 8.9 8.7 8.5

Capital 6.3 4.5 0.5


1 𝑌 ∈ 0,1 𝑟


? ?



𝑍𝑖 ∈ 𝑅


X1 X2

1

?


X3

founder CEO-of

0

Capital

10.512.5 4.58.9

8.78.5

Founder 10.5 12.5 4.5

CEO-of 8.9 8.7 8.5

Capital 6.3 4.5 0.5

Potentials as edge weights(Ignore edgesWith y = 0)


1

Variant of weighted edge cover problem

? ?



X1 X2

1

?


X3

founder CEO-of

0

Capital

10.512.5 4.58.9

8.78.5

Founder 10.5 12.5 4.5

CEO-of 8.9 8.7 8.5

Capital 6.3 4.5 0.5


Each y at least one edgeEach z exactly one edge


1


Founder Founder



X1 X2

1

CEO-Of


X3

founder CEO-of

0

Capital

10.512.5 4.58.9

8.78.5

Founder 10.5 12.5 4.5

CEO-of 8.9 8.7 8.5

Capital 6.3 4.5 0.5



Exact Solution𝑂(𝑉(𝐸 + 𝑉𝑙𝑜𝑔𝑉))


1


Founder Founder



X1 X2

1

CEO-Of


X3

founder CEO-of

0

Capital

10.512.5 4.58.9

8.78.5

Founder 10.5 12.5 4.5

CEO-of 8.9 8.7 8.5

Capital 6.3 4.5 0.5



Approx Solution𝑂(|𝑅||𝑆|)

Experiments• Data

• NY Times sentences : NER tagged

• Used Freebase as KB.

• Evaluation Metric

• Challenging

• Only 3% of sentences match facts in KB.

• Number of matches across relations highly unbalanced.

• Aggregate Extraction

• Matched extracted relations with freebase relations.

• Underestimates accuracy because many true relations not in free base.

• Sentential Extraction

• Sampled sentences from union of two sets of sentences : • Sentences from which some relation is extracted.

• Sentences whose arguments match with entities in freebase.

• Manually labelled them correct or incorrect.

• Overestimates the recall.

Experiments

• Systems compared• Original implementation of Riedel et al [2010]

• SoloR : Reimplementation of Riedel et al [2010]

• MultiR

• Metrics• Aggregate and sentential extraction results (PR curve)

• Relation specific results

• Running time

Experiments

• Results• Aggregate extraction

• MultiR : High precision over all recall

• MultiR : Recall from 20% to 25%

• Low precision in 0-1% Recall

• To investigate, extracted top 10

Relations marked wrong.

• Correct but not present in Freebase.

Experiments

• Results• Sentential extraction

• Riedel et al didn’t report.

• MultiR : High precision and recall

• MultiR : F1 score : 60.5%

Experiments

• Results• Relation specific results

• Take 10 top frequent relations.

• 𝑆𝑟𝑀 : Sentences MultiR extracted relation r.

• 𝑆𝑟𝐹 : Sentences matching arguments in freebase for relation r.

• Sample 100 sentences from both.

• Compute Accuracy, Precision and recall.

Experiments

Effect of modeling overlapping relations

Discussion

• Only relies on freebase for experimental evaluation [Nupur et al]

• Assumes that if a fact is present in text, then it must be present in KB [Dinesh Raghu]

• Only one relation in a sentence [Barun]

• Assume entities occur as NP only. [Gagan]

• Should use sampling instead of argmax as done in Riedel et al. [Happy, Barun]

• Evaluation problem : Only 3% sentences match in Freebase [Gagan]

• For sentential extraction evaluation, sampled only 1000 sentences.

• Separate graph for every entity pair : Scaling issue [Prachi]

Possible Extensions

• Evaluate on some other datasets as well, like Google knowledge graph [Anshul, Rishabh]

• Bootstrapping like NELL [Gagan et al]

• Iteratively correct the facts during learning for 0-1% recall range [Surag]

• Extract entity mentions spanning multiple sentences [Anshul]

• Relation to MLNs : Apply Lifting [Ankit]

Distant Supervision and MultiR - IIT Delhi · 2017. 3. 7. · Steve Jobs founded Apple Apple was...

Documents

Transcript of Distant Supervision and MultiR - IIT Delhi · 2017. 3. 7. · Steve Jobs founded Apple Apple was...