Outline - Stanford NLP Groupnlp.stanford.edu/joberant/esslli_2016/class3_index.pdfSupervision in...

Outline

• Learning

– Overview

– Details

– Example

– Lexicon learning

– Supervision signals

Outline

• Learning

– Overview

– Details

– Example

Supervision in syntactic parsing

Input:

ESSLLI 2016

the known summer school

located

in Bolzano

Output:

They play football

football 2

Supervision in semantic parsing

Input:

Heavy supervision Light supervision

How tall is Lebron James?

HeightOf.LebronJames

What is Steph Curry’s daughter called?

ChildrenOf.StephCurry u Gender.Female

Youngest player of the Cavaliers

argmin(PlayerOf.Cavaliers,BirthDateOf)

Riley Curry

Kyrie Irving

[Zelle & Mooney, 1996; Zettlemoyer & Collins, 2005; Clarke et al. 2010; Liang et al., 2011]

Supervision in semantic parsing

Input:

Heavy supervision Light supervision

HeightOf.LebronJames

ChildrenOf.StephCurry u Gender.Female

argmin(PlayerOf.Cavaliers,BirthDateOf)

Riley Curry

Kyrie Irving

Output:

Clay Thompson’s weight

WeightOf.ClayThompson

ClayThompson

Clay Thompson

’s Weight

weight

Weight.ClayThompson 205 lbs

[Zelle & Mooney, 1996; Zettlemoyer & Collins, 2005; Clarke et al. 2010; Liang et al., 2011]

Learning in a nutshell

utterance

0. Define model for derivations

utterance Parsing

1. Generate candidate derivations (later)

utterance Parsing

2. Label as correct and incorrect

utterance Parsing

Update model

2. Label as correct and incorrect

3. Update model to favor correct trees

Training intuition

Where did Mozart tupress?

Vienna

Training intuition

PlaceOfBirth.WolfgangMozart

PlaceOfDeath.WolfgangMozart

PlaceOfMarriage.WolfgangMozart

Vienna

Training intuition

PlaceOfBirth.WolfgangMozart ⇒ Salzburg

PlaceOfDeath.WolfgangMozart ⇒ Vienna

PlaceOfMarriage.WolfgangMozart ⇒ Vienna

Vienna

Training intuition

Vienna

Training intuition

Vienna

Where did Hogarth tupress?

Training intuition

Vienna

PlaceOfBirth.WilliamHogarth

PlaceOfDeath.WilliamHogarth

PlaceOfMarriage.WilliamHogarth

London

Training intuition

Vienna

PlaceOfBirth.WilliamHogarth ⇒ London

PlaceOfDeath.WilliamHogarth ⇒ London

PlaceOfMarriage.WilliamHogarth ⇒ Paddington

London

Training intuition

Vienna

London

Training intuition

Vienna

London

Outline

• Learning

– Overview

– Details

– Example

Constructing derivations

Type.PersonuPlaceLived.Chicago

Type.Person

people

who PlaceLived.Chicago

PlaceLived

lived in

Chicago

lexicon

lexiconlexicon

intersect

Many possible derivations!

x = people who have lived in Chicago

set of candidate derivations D(x)

Type.Person

people

PlaceLived

lived in

Chicago

lexicon

lexiconlexicon

intersectType.OrguPresentIn.ChicagoMusical

Type.Org

people

who PresentIn.ChicagoMusical

PresentIn

lived in

ChicagoMusical

Chicago

lexicon

lexiconlexicon

intersect

x: utterance

d: derivation

Type.Person

people

PlaceLived

lived in

Chicago

lexicon

lexiconlexicon

intersect

Feature vector and parameters in RF :

φ(x, d) θ ⇐learned

apply join 1 1.2

apply intersect 1 0.6

apply lexicon 3 2.1

lived maps to PlacesLived 1 3.1

lived maps to PlaceOfBirth 0 -0.4

born maps to PlaceOfBirth 0 2.7

... ... ...

x: utterance

d: derivation

Type.Person

people

PlaceLived

lived in

Chicago

lexicon

lexiconlexicon

intersect

Feature vector and parameters in RF :φ(x, d) θ ⇐learned

apply join 1 1.2

apply intersect 1 0.6

apply lexicon 3 2.1

lived maps to PlacesLived 1 3.1

lived maps to PlaceOfBirth 0 -0.4

born maps to PlaceOfBirth 0 2.7

... ... ...

Scoreθ(x, d) = φ(x, d)>θ =

1.2 · 1 + 0.6 · 1 + 2.1 · 3 + 3.1 · 1 +−0.4 · 0 + 2.7 · 0 + . . .9

Deep learning alert!

The feature vector φ(x, d) is constructed by hand

Constructing good features is hard

Algorithms are likely to do it better

Perhaps we can train φ(x, d)

φ(x, d) = Fψ(x, d), where ψ are the parameters

Log-linear model

Candidate derivations: D(x)

Model: distribution over derivations d given utterance x

pθ(d | x) = exp(Scoreθ(x,d))∑d′∈D(x) exp(Scoreθ(x,d

Log-linear model

scoreθ(x, d) [1, 2, 3, 4]

pθ(d | x) [ ee+e2+e3+e4 ,

e+e2+e3+e4 ,e3

e+e2+e3+e4 ,e4

e+e2+e3+e4 ]

Log-linear model

scoreθ(x, d) [1, 2, 3, 4]

pθ(d | x) [ ee+e2+e3+e4 ,

e+e2+e3+e4 ,e3

e+e2+e3+e4 ,e4

e+e2+e3+e4 ]

Parsing: find the top-K derivation trees Dθ(x)

Features

Dense features:

• intersection=0.67

• ent-popularity:HIGH

• denoation-size:1

Features

Dense features:

Sparse features:

• bridge-binary:STUDY

• born:PlaceOfBirth

• city:Type.Location

Features

Dense features:

Sparse features:

Syntactic features:

• ent-pos:NNP NNP

• join-pos:V NN

• skip-pos:IN

Features

Dense features:

Sparse features:

Syntactic features:

• ent-pos:NNP NNP

• join-pos:V NN

• skip-pos:IN

Grammar features:

• Binary->Verb12

Learning θ: maximum-likelihood

Training data:

What’s Bulgaria’s capital?

What movies has Tom Cruise been in?

TopGun,VanillaSky,...

CapitalOf.Bulgaria

Type.Movie u HasPlayed.TomCruise

Training data:

CapitalOf.Bulgaria

argmaxθ∑ni=1 log pθ(y

(i) | x(i)) =

argmaxθ∑ni=1 log

∑d(i) pθ(d

(i) | x(i))R(d(i))

Training data:

CapitalOf.Bulgaria

argmaxθ∑ni=1 log pθ(y

(i) | x(i)) =

argmaxθ∑ni=1 log

∑d(i) pθ(d

(i) | x(i))R(d(i))

R(d) =

{1 d.z = z(i)

0 o/wR(d) =

{1 [d.z]K = y(i)

0 o/wR(d) = F1([d.z]K, y

Optimization: stochastic gradient descent

For every example:

O(θ) = log∑d pθ(d | x)R(d)

∇O(θ) = Eqθ(d|x)[φ(x, d)]− Epθ(d|x)[φ(x, d)]pθ(d | x) ∝ exp(φ(x, d)>θ)

qθ(d | x) ∝ exp(φ(x, d)>θ) ·R(d)

For every example:

pθ(D(x)) = [0.2, 0.1, 0.1, 0.6]

R(D(x)) = [1, 0, 0, 1]

For every example:

pθ(D(x)) = [0.2, 0.1, 0.1, 0.6]

R(D(x)) = [1, 0, 0, 1]

qθ(D(x)) = [0.25, 0, 0, 0.75]

qθ =pθp>θ R

For every example:

pθ(D(x)) = [0.2, 0.1, 0.1, 0.6]

R(D(x)) = [1, 0, 0, 1]

qθ(D(x)) = [0.25, 0, 0, 0.75]

qθ =pθp>θ R

Gradient:

0.05 · φ(x, d1)− 0.1 · φ(x, d2)− 0.1 · φ(x, d3) + 0.15 · φ(x, d4)

Training

Input: {xi, yi}ni=1

Output: θ

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0

for iteration τ and example i

D(xi)← argmaxK(pθ(d | xi))

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0

θ ← θ + ητ,i(Eqθ(d|xi)[φ(xi, d)]− Epθ(d|xi)[φ(xi, d)])

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0

ητ,i: learning rate

Regularization often added (L2, L1, ...)

Training (structured perceptron)

Input: {xi, yi}ni=1

Output: θ

Input: {xi, yi}ni=1

Output: θ

θ ← 0

Input: {xi, yi}ni=1

Output: θ

θ ← 0

d̂← argmax(pθ(d | xi))

d∗ ← argmax(qθ(d | xi))

Input: {xi, yi}ni=1

Output: θ

θ ← 0

if [d∗]K 6= [d̂]K

θ ← θ + φ(xi, d∗)− φ(xi, d̂))

Input: {xi, yi}ni=1

Output: θ

θ ← 0

if [d∗]K 6= [d̂]K

θ ← θ + φ(xi, d∗)− φ(xi, d̂))

Regularization often added with weight averaging

Training

Other simple variants exist:

• E.g., cost-sensitive max-margin training

• That is, find pairs of good and bad derivations that look differentbut have similar scores and update on those

Outline

• Learning

– Overview

– Details

– Example

Example

./run @mode=simple-lambdadcs \

-Grammar.inPaths esslli 2016/class3 demo.grammar \

-SimpleLexicon.inPaths esslli 2016/class3 demo.lexicon

(loadgraph geo880/geo880.kg)

size of california

size capital california

size of capital of california

california size

Exercise

Find a pair of natural language utterances that cannot be distinguishedusing the current feature representation

• The utterances can be not fully grammatical in English

• You can ignore the denotation feature if that helps

• Verify this in sempre (ask me how to disable features)

• Design a feature that will solve this problem

Outline

• Learning

– Overview

– Details

– Example

The lexicon problem

How is the lexicon generated?

• Annotation

• Exhaustive search

• String matching

• Supervised alignment

• Unsupervised alignment

• Learning

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0

Add lexicon entries

D(xi)← argmaxK(pθ(d | xi)

ητ,i: learning rate

Regularization often added (L2, L1, ...)

Adding lexicon entries

Input: training example (xi, yi), current lexicon Λ, model θ

Λtemp ← Λ ∪ GENLEX(xi, yi) Create expanded temporaty lexicon

[Adapted from semantic parsing tutorial, Artzi et al.]

D(xi)← arg maxK(pθ,Λtemp(d | xi)) Parse with temporary lexicon

Λ = Λ ∪ {l | l ∈ d̂, d̂ ∈ D(xi), R(d) = 1} Add entries from correct trees

Overgenerate lexical entries and add promising ones

Lexicon generation

Logical form supervision:

Largest state bordering California

argmax(Type(State) u Border(California),Area)

[Zettlemoyer and Collins, 2005]

Lexicon generation

Enumerate spans

Largest

bordering

California

Largest state

Use rules to extract sub-forumulas

California

Border(California)

λf.f(California)

λx.argmax(x,Area)

Lexicon generation

Enumerate spans

Largest

bordering

California

Largest state

Use rules to extract sub-forumulas

California

Border(California)

λf.f(California)

λx.argmax(x,Area)

Add cross-product to lexicon

Lexicon generation

Denotation supervision:

Arizona

Lexicon generation

Arizona

Enumerate spans

Largest

bordering

California

Largest state

Generate sub-formulas from KB

California

Border(California)

Traverse

Type.Mountain

λx.argmax(x,Elevation)

Restrict candidates with alignment, string matching, ...

Lexicon generation

Arizona

Enumerate spans

Largest

bordering

California

Largest state

California

Border(California)

Traverse

Type.Mountain

λx.argmax(x,Elevation)

Restrict candidates with alignment, string matching, ...

Fancier methods exist (coarse-to-fine)

Unification

[Kwiatkowski et al, 2010]

Unification

Initialize lexicon with (xi, zi):

States bordering California

Type(State) u Border(California)

Split lexical entry in all possible ways:

Unification

Initialize lexicon with (xi, zi):

States bordering California

Split lexical entry in all possible ways:

Enumerate spans

(states, bordering california)

(states bordering, california)

(Type(State), λx.x u Border(California))

(λx.Type(State) u x,Border(California))

(λf.Type(State) u f(California),California)

Unification

For example xi, zi:

Unification

For example xi, zi:

Find highest scoring correct parse d∗

Unification

For example xi, zi:

Split all lexical entries in d∗ in all possible ways

Unification

For example xi, zi:

Split all lexical entries in d∗ in all possible ways

Add to lexicon lexical entry that improves parse score best

Do we need a lexicon?

Type(State)

Type State

Border(California)

Border California

California neighbors

Type(State)

Type State

Border(California)

Border California

Floating parse tree: generalization of bridging

Type(State)

Type State

Border(California)

Border California

Floating parse tree: generalization of bridging

Perhaps with better learning and search not necessary?

Outline

• Learning

– Overview

– Details

– Example

Supervision signals

We discussed training from logical forms and denotations

Supervision signals

We discussed training from logical forms and denotations

Other forms of supervision have been proposed:

• Demonstrations

• Distant supervision

• Conversations

• Unsupervised

• Paraphrasing

Training from demonstrations

Input: (xi, si, ti)

xi: utterance

si: start state

ti: end state

move forward until you reach the intersection

[Artzi and Zettlemoyer, 2013]

Input: (xi, si, ti)

xi: utterance

si: start state

ti: end state

λa.move(a) ∧ dir(a.forward) . . .

Input: (xi, si, ti)

xi: utterance

si: start state

ti: end state

λa.move(a) ∧ dir(a.forward) . . .

An instance of learning from denotations

Distant supervision

Data generation:

• Decompose declarative text to questions and answers

James Cameron is the director of Titanic

Q: X is the director of Titanic A: James Cameron

[Reddy et al., 2014]

Distant supervision

Data generation:

• Decompose declarative text to questions and answers

James Cameron is the director of Titanic

Q: X is the director of Titanic A: James Cameron

Declarative text is cheap!

Distant supervision

Training:

• Use exising non-executable semantic parsers

X is the director of Titanic

Distant supervision

Training:

λx.director(x) ∧ director.of.arg1(e, x) ∧ director.of.arg2(e,Titanic)

Distant supervision

Training:

λx.Director(x) ∧ FilmDirectedBy(e, x) ∧ FileDirected(e,Titanic)

λx.Producer(x) ∧ FilmProducedBy(e, x) ∧ FileProduced(e,Titanic)

Distant supervision

Training:

λx.Director(x) ∧ FilmDirectedBy(e, x) ∧ FileDirected(e,Titanic)

λx.Producer(x) ∧ FilmProducedBy(e, x) ∧ FileProduced(e,Titanic)

James Cameron true

James Camerson, Jon Landau false

Training from conversations

z1 : From(Atlanta) u To(London)

z2 : From(Atlanta) u From(London)

z3 : To(Atlanta) u To(London)

z4 : To(Atlanta) u From(London)

Training from conversations

z1 : From(Atlanta) u To(London)

z2 : From(Atlanta) u From(London)

z3 : To(Atlanta) u To(London)

z4 : To(Atlanta) u From(London)

Define loss:

• Does z align with converation?

• Does z obey domain constraints?

Unsupervised learning

Intuition: assume repeating patterns are correct

[Goldwasser et al., 2011]

Input: {xi}ni=1

Output: θ

Input: {xi}ni=1

Output: θ

θ initialized manually, S = φ

Input: {xi}ni=1

Output: θ

Until stopping criterion met

for example xi

S = S ∪ (xi, argmax pθ(d | xi))

Input: {xi}ni=1

Output: θ

for example xi

S = S ∪ (xi, argmax pθ(d | xi))Compute statistics of S

Input: {xi}ni=1

Output: θ

for example xi

Sconf ←find confident subset

Input: {xi}ni=1

Output: θ

for example xi

Train on Sconf

Input: {xi}ni=1

Output: θ

for example xi

Train on Sconf

Substantially lower performance

Paraphrasing

What languages do people in Brazil use?

B. and Liang, 2014

Paraphrasing

Type.HumanLanguage u LanguagesSpoken.Brazil ... CapitalOf.Brazil

B. and Liang, 2014

Paraphrasing

What language is the language of Brazil? ... What city is the capital of Brazil?

B. and Liang, 2014

Paraphrasing

paraphrase model

B. and Liang, 2014

Paraphrasing

paraphrase model

Portuguese, ...

B. and Liang, 2014

Paraphrasing

paraphrase model

Portuguese, ...

Model: pθ(c, z | x) = p(z | x)× pθ(c | x)

Idea: train a large paraphrase model pθ(c | x)

B. and Liang, 2014

Paraphrasing

paraphrase model

Portuguese, ...

Model: pθ(c, z | x) = p(z | x)× pθ(c | x)

Idea: train a large paraphrase model pθ(c | x)

More later

B. and Liang, 2014

Summary

We saw how to train from denotations and logical forms

We see methods for inducing lexicons during training

We reviewed work on how to use even weaker forms of supervision

Still an open problem

Outline - Stanford NLP Groupnlp.stanford.edu/joberant/esslli_2016/class3_index.pdfSupervision in...

Documents

Transcript of Outline - Stanford NLP Groupnlp.stanford.edu/joberant/esslli_2016/class3_index.pdfSupervision in...

The structure of this ppt - ieas.unideb.huieas.unideb.hu/admin/file_10779.pdf · 3 The VP lecture (1) 1.1. Structural issues S NP John VP laughed. read the paper. gave Kate a present.

Aims of the Course Relative Clauses in Head-Driven Phrase …stefan/PS/bukarest2002... · 2003. 9. 28. · A Simple Phrase Structure Grammar for English S NP VP Pron V NP Pron he

CS585 CFG exercise, 11/14/17 - UMass Amherstpeople.cs.umass.edu/~brenocon/inlp2017/lectures/16-cfg-exercise.pdfCS585 CFG exercise, 11/14/17 This is ... (8.38) NP!NP VP e.g., a bicycle

Querying Linguistic Trees - homepages.inf.ed.ac.uk · 2 Catherine Lai and Steven Bird V 5 S 2 VP 4 NP 3 NP 7 I saw the old man with a dog today NP 6 PP 11 NP 13 N 17 Det 8 Adj 9 N

lecture 10 parsing II - University of California, San Diegoidiom.ucsd.edu/~rlevy/lign256/winter2008/ppt/lecture_10_parsing_II.pdfNP 10 S 8 S 13 NP 24 S 22 S 27 NP 24 S 27 1 NP 4 VP

SENSITIVITY TO INFLECTIONAL MORPHOLOGY IN AGRAM MATISM… · 2010. 5. 11. · SERBO-CROAT /\ VPvl\ V NP /\VP /\VP VNP[acc) NP L..J Figure 1. The diagram compares the form of the verb

P vs. NP vs. NP-Hard vs. NP-completeP vs. NP vs. NP-Hard ...oucsace.cs.ohiou.edu/~razvan/courses/cs4040/lecture24.pdf · P vs. NP vs. NP-Hard vs. NP-completeP vs. NP vs. NP-Hard vs.

4.TOWER, statika porod post rekonstrisana kuca NP+VP+Pk 31.12.2010

Executable Semantic Parsingnlp.stanford.edu/joberant/esslli_2016/class1_index.pdf · Hawaii ContainedBy City Type UnitedStates ContainedBy USState Type Event8 Marriage MichelleObama

CHAPTER 2linguistics.byu.edu/classes/Ling654dl/ch2ch18.pdf• Unaccusative verb: __ [VP V NP] • Argument is thematic • Verb can’t assign accusative case (so “object” argument

Global&Learning&of&Textual& EntailmentGraphsnlp.stanford.edu/joberant/homepage_files/talks/Public... · 2014. 7. 27. · Global&Learning&of&Textual& EntailmentGraphs ! Supervisors:

A sentence (S) is composed of a noun phrase (NP) and a verb phrase (VP).

INDUCTION OF PROBABILISTIC SYNCHRONOUS TREE ...shieber/Biblio/Papers/stig.pdf.2 .01.08 red Adj.0 01 rouges Adj sandwiches.006 N sandwiches N candies.002 N bonbons N NP VP S V NP VP

Semi-Supervised Learning with Graphszhuxj/tmp/utah.pdf · •classify sentences into parse trees NP VP PN VB NP PP DET N P NP DET N S I saw a falcon with a telescope ... •kernelizing

6.864: Lecture 17 (November 10th, 2005) Machine ... · PDF fileMaria no daba una bof’ a la bruja ... if start position of the i’th French phases ... VP →Vt NP⇒ 0.79 VP [VP

Parsing. What is Parsing? S NP VP NP Det N NP NP PP VP V NP VP VP PP PP P NP NP Papa N caviar N spoon V spoon V ate P with Det.

NP- -Xyl NP- -GlcNP- -Gal NP- -ManNP- -Xyl NP- -Glc NP- -Gal NP- -Man NP- -Xyl NP- -GlcNP- -Gal NP- -ManNP- -Xyl NP- -Glc NP-

Phrase structure VP Adv V NP PP* oft smokes a cig in th park VP ADVVP VPPP V NP often sm a cig in the p.

Compositional and Multimodal Distributional …...S → NP VP : VP (NP ) The dog sleeps : sleep (dog ) dog picks out an individual in some model sleep is a relation The dog sleeps

Bottom-up Parsing · NN !mat VP !V VP !V PP V !sat PP !IN NP IN !on 7. ... Next step NP (length 2) from 1-2 because NP !D N and D from 1-1 and