Outline - Stanford NLP Groupnlp.stanford.edu/joberant/esslli_2016/class3_index.pdfSupervision in...

Post on 26-Jul-2020

1 views 0 download

Transcript of Outline - Stanford NLP Groupnlp.stanford.edu/joberant/esslli_2016/class3_index.pdfSupervision in...

Outline

• Learning

– Overview

– Details

– Example

– Lexicon learning

– Supervision signals

0

Outline

• Learning

– Overview

– Details

– Example

– Lexicon learning

– Supervision signals

1

Supervision in syntactic parsing

Input:

S

NP

NP

ESSLLI 2016

NP

the known summer school

VP

V

is

VP

V

located

PP

in Bolzano

Output:

They play football

S

NP

They

VP

V

play

NP

football 2

Supervision in semantic parsing

Input:

Heavy supervision Light supervision

How tall is Lebron James?

HeightOf.LebronJames

What is Steph Curry’s daughter called?

ChildrenOf.StephCurry u Gender.Female

Youngest player of the Cavaliers

argmin(PlayerOf.Cavaliers,BirthDateOf)

...

How tall is Lebron James?

203cm

What is Steph Curry’s daughter called?

Riley Curry

Youngest player of the Cavaliers

Kyrie Irving

...

[Zelle & Mooney, 1996; Zettlemoyer & Collins, 2005; Clarke et al. 2010; Liang et al., 2011]

3

Supervision in semantic parsing

Input:

Heavy supervision Light supervision

How tall is Lebron James?

HeightOf.LebronJames

What is Steph Curry’s daughter called?

ChildrenOf.StephCurry u Gender.Female

Youngest player of the Cavaliers

argmin(PlayerOf.Cavaliers,BirthDateOf)

...

How tall is Lebron James?

203cm

What is Steph Curry’s daughter called?

Riley Curry

Youngest player of the Cavaliers

Kyrie Irving

...

Output:

Clay Thompson’s weight

WeightOf.ClayThompson

ClayThompson

Clay Thompson

’s Weight

weight

Weight.ClayThompson 205 lbs

[Zelle & Mooney, 1996; Zettlemoyer & Collins, 2005; Clarke et al. 2010; Liang et al., 2011]

3

Learning in a nutshell

utterance

0. Define model for derivations

4

Learning in a nutshell

utterance Parsing

...

0. Define model for derivations

1. Generate candidate derivations (later)

4

Learning in a nutshell

utterance Parsing

...

Label

...

0. Define model for derivations

1. Generate candidate derivations (later)

2. Label as correct and incorrect

4

Learning in a nutshell

utterance Parsing

...

Label

...

Update model

0. Define model for derivations

1. Generate candidate derivations (later)

2. Label as correct and incorrect

3. Update model to favor correct trees

4

Training intuition

Where did Mozart tupress?

Vienna

5

Training intuition

Where did Mozart tupress?

PlaceOfBirth.WolfgangMozart

PlaceOfDeath.WolfgangMozart

PlaceOfMarriage.WolfgangMozart

Vienna

5

Training intuition

Where did Mozart tupress?

PlaceOfBirth.WolfgangMozart ⇒ Salzburg

PlaceOfDeath.WolfgangMozart ⇒ Vienna

PlaceOfMarriage.WolfgangMozart ⇒ Vienna

Vienna

5

Training intuition

Where did Mozart tupress?

PlaceOfBirth.WolfgangMozart ⇒ Salzburg

PlaceOfDeath.WolfgangMozart ⇒ Vienna

PlaceOfMarriage.WolfgangMozart ⇒ Vienna

Vienna

5

Training intuition

Where did Mozart tupress?

PlaceOfBirth.WolfgangMozart ⇒ Salzburg

PlaceOfDeath.WolfgangMozart ⇒ Vienna

PlaceOfMarriage.WolfgangMozart ⇒ Vienna

Vienna

Where did Hogarth tupress?

5

Training intuition

Where did Mozart tupress?

PlaceOfBirth.WolfgangMozart ⇒ Salzburg

PlaceOfDeath.WolfgangMozart ⇒ Vienna

PlaceOfMarriage.WolfgangMozart ⇒ Vienna

Vienna

Where did Hogarth tupress?

PlaceOfBirth.WilliamHogarth

PlaceOfDeath.WilliamHogarth

PlaceOfMarriage.WilliamHogarth

London

5

Training intuition

Where did Mozart tupress?

PlaceOfBirth.WolfgangMozart ⇒ Salzburg

PlaceOfDeath.WolfgangMozart ⇒ Vienna

PlaceOfMarriage.WolfgangMozart ⇒ Vienna

Vienna

Where did Hogarth tupress?

PlaceOfBirth.WilliamHogarth ⇒ London

PlaceOfDeath.WilliamHogarth ⇒ London

PlaceOfMarriage.WilliamHogarth ⇒ Paddington

London

5

Training intuition

Where did Mozart tupress?

PlaceOfBirth.WolfgangMozart ⇒ Salzburg

PlaceOfDeath.WolfgangMozart ⇒ Vienna

PlaceOfMarriage.WolfgangMozart ⇒ Vienna

Vienna

Where did Hogarth tupress?

PlaceOfBirth.WilliamHogarth ⇒ London

PlaceOfDeath.WilliamHogarth ⇒ London

PlaceOfMarriage.WilliamHogarth ⇒ Paddington

London

5

Training intuition

Where did Mozart tupress?

PlaceOfBirth.WolfgangMozart ⇒ Salzburg

PlaceOfDeath.WolfgangMozart ⇒ Vienna

PlaceOfMarriage.WolfgangMozart ⇒ Vienna

Vienna

Where did Hogarth tupress?

PlaceOfBirth.WilliamHogarth ⇒ London

PlaceOfDeath.WilliamHogarth ⇒ London

PlaceOfMarriage.WilliamHogarth ⇒ Paddington

London

5

Outline

• Learning

– Overview

– Details

– Example

– Lexicon learning

– Supervision signals

6

Constructing derivations

Type.PersonuPlaceLived.Chicago

Type.Person

people

who PlaceLived.Chicago

PlaceLived

lived in

Chicago

Chicago

lexicon

lexiconlexicon

join

intersect

7

Many possible derivations!

x = people who have lived in Chicago

?

set of candidate derivations D(x)

Type.PersonuPlaceLived.Chicago

Type.Person

people

who PlaceLived.Chicago

PlaceLived

lived in

Chicago

Chicago

lexicon

lexiconlexicon

join

intersectType.OrguPresentIn.ChicagoMusical

Type.Org

people

who PresentIn.ChicagoMusical

PresentIn

lived in

ChicagoMusical

Chicago

lexicon

lexiconlexicon

join

intersect

8

x: utterance

d: derivation

Type.PersonuPlaceLived.Chicago

Type.Person

people

who PlaceLived.Chicago

PlaceLived

lived in

Chicago

Chicago

lexicon

lexiconlexicon

join

intersect

Feature vector and parameters in RF :

φ(x, d) θ ⇐learned

apply join 1 1.2

apply intersect 1 0.6

apply lexicon 3 2.1

lived maps to PlacesLived 1 3.1

lived maps to PlaceOfBirth 0 -0.4

born maps to PlaceOfBirth 0 2.7

... ... ...

9

x: utterance

d: derivation

Type.PersonuPlaceLived.Chicago

Type.Person

people

who PlaceLived.Chicago

PlaceLived

lived in

Chicago

Chicago

lexicon

lexiconlexicon

join

intersect

Feature vector and parameters in RF :φ(x, d) θ ⇐learned

apply join 1 1.2

apply intersect 1 0.6

apply lexicon 3 2.1

lived maps to PlacesLived 1 3.1

lived maps to PlaceOfBirth 0 -0.4

born maps to PlaceOfBirth 0 2.7

... ... ...

Scoreθ(x, d) = φ(x, d)>θ =

1.2 · 1 + 0.6 · 1 + 2.1 · 3 + 3.1 · 1 +−0.4 · 0 + 2.7 · 0 + . . .9

Deep learning alert!

The feature vector φ(x, d) is constructed by hand

10

Deep learning alert!

The feature vector φ(x, d) is constructed by hand

Constructing good features is hard

10

Deep learning alert!

The feature vector φ(x, d) is constructed by hand

Constructing good features is hard

Algorithms are likely to do it better

10

Deep learning alert!

The feature vector φ(x, d) is constructed by hand

Constructing good features is hard

Algorithms are likely to do it better

Perhaps we can train φ(x, d)

φ(x, d) = Fψ(x, d), where ψ are the parameters

10

Log-linear model

Candidate derivations: D(x)

Model: distribution over derivations d given utterance x

pθ(d | x) = exp(Scoreθ(x,d))∑d′∈D(x) exp(Scoreθ(x,d

′))

11

Log-linear model

Candidate derivations: D(x)

Model: distribution over derivations d given utterance x

pθ(d | x) = exp(Scoreθ(x,d))∑d′∈D(x) exp(Scoreθ(x,d

′))

scoreθ(x, d) [1, 2, 3, 4]

pθ(d | x) [ ee+e2+e3+e4 ,

e2

e+e2+e3+e4 ,e3

e+e2+e3+e4 ,e4

e+e2+e3+e4 ]

11

Log-linear model

Candidate derivations: D(x)

Model: distribution over derivations d given utterance x

pθ(d | x) = exp(Scoreθ(x,d))∑d′∈D(x) exp(Scoreθ(x,d

′))

scoreθ(x, d) [1, 2, 3, 4]

pθ(d | x) [ ee+e2+e3+e4 ,

e2

e+e2+e3+e4 ,e3

e+e2+e3+e4 ,e4

e+e2+e3+e4 ]

Parsing: find the top-K derivation trees Dθ(x)

11

Features

Dense features:

• intersection=0.67

• ent-popularity:HIGH

• denoation-size:1

12

Features

Dense features:

• intersection=0.67

• ent-popularity:HIGH

• denoation-size:1

Sparse features:

• bridge-binary:STUDY

• born:PlaceOfBirth

• city:Type.Location

12

Features

Dense features:

• intersection=0.67

• ent-popularity:HIGH

• denoation-size:1

Sparse features:

• bridge-binary:STUDY

• born:PlaceOfBirth

• city:Type.Location

Syntactic features:

• ent-pos:NNP NNP

• join-pos:V NN

• skip-pos:IN

12

Features

Dense features:

• intersection=0.67

• ent-popularity:HIGH

• denoation-size:1

Sparse features:

• bridge-binary:STUDY

• born:PlaceOfBirth

• city:Type.Location

Syntactic features:

• ent-pos:NNP NNP

• join-pos:V NN

• skip-pos:IN

Grammar features:

• Binary->Verb12

Learning θ: maximum-likelihood

Training data:

What’s Bulgaria’s capital?

Sofia

What movies has Tom Cruise been in?

TopGun,VanillaSky,...

...

What’s Bulgaria’s capital?

CapitalOf.Bulgaria

What movies has Tom Cruise been in?

Type.Movie u HasPlayed.TomCruise

...

13

Learning θ: maximum-likelihood

Training data:

What’s Bulgaria’s capital?

Sofia

What movies has Tom Cruise been in?

TopGun,VanillaSky,...

...

What’s Bulgaria’s capital?

CapitalOf.Bulgaria

What movies has Tom Cruise been in?

Type.Movie u HasPlayed.TomCruise

...

argmaxθ∑ni=1 log pθ(y

(i) | x(i)) =

argmaxθ∑ni=1 log

∑d(i) pθ(d

(i) | x(i))R(d(i))

13

Learning θ: maximum-likelihood

Training data:

What’s Bulgaria’s capital?

Sofia

What movies has Tom Cruise been in?

TopGun,VanillaSky,...

...

What’s Bulgaria’s capital?

CapitalOf.Bulgaria

What movies has Tom Cruise been in?

Type.Movie u HasPlayed.TomCruise

...

argmaxθ∑ni=1 log pθ(y

(i) | x(i)) =

argmaxθ∑ni=1 log

∑d(i) pθ(d

(i) | x(i))R(d(i))

R(d) =

{1 d.z = z(i)

0 o/wR(d) =

{1 [d.z]K = y(i)

0 o/wR(d) = F1([d.z]K, y

(i))

13

Optimization: stochastic gradient descent

For every example:

O(θ) = log∑d pθ(d | x)R(d)

∇O(θ) = Eqθ(d|x)[φ(x, d)]− Epθ(d|x)[φ(x, d)]pθ(d | x) ∝ exp(φ(x, d)>θ)

qθ(d | x) ∝ exp(φ(x, d)>θ) ·R(d)

14

Optimization: stochastic gradient descent

For every example:

O(θ) = log∑d pθ(d | x)R(d)

∇O(θ) = Eqθ(d|x)[φ(x, d)]− Epθ(d|x)[φ(x, d)]pθ(d | x) ∝ exp(φ(x, d)>θ)

qθ(d | x) ∝ exp(φ(x, d)>θ) ·R(d)

pθ(D(x)) = [0.2, 0.1, 0.1, 0.6]

R(D(x)) = [1, 0, 0, 1]

14

Optimization: stochastic gradient descent

For every example:

O(θ) = log∑d pθ(d | x)R(d)

∇O(θ) = Eqθ(d|x)[φ(x, d)]− Epθ(d|x)[φ(x, d)]pθ(d | x) ∝ exp(φ(x, d)>θ)

qθ(d | x) ∝ exp(φ(x, d)>θ) ·R(d)

pθ(D(x)) = [0.2, 0.1, 0.1, 0.6]

R(D(x)) = [1, 0, 0, 1]

qθ(D(x)) = [0.25, 0, 0, 0.75]

qθ =pθp>θ R

14

Optimization: stochastic gradient descent

For every example:

O(θ) = log∑d pθ(d | x)R(d)

∇O(θ) = Eqθ(d|x)[φ(x, d)]− Epθ(d|x)[φ(x, d)]pθ(d | x) ∝ exp(φ(x, d)>θ)

qθ(d | x) ∝ exp(φ(x, d)>θ) ·R(d)

pθ(D(x)) = [0.2, 0.1, 0.1, 0.6]

R(D(x)) = [1, 0, 0, 1]

qθ(D(x)) = [0.25, 0, 0, 0.75]

qθ =pθp>θ R

Gradient:

0.05 · φ(x, d1)− 0.1 · φ(x, d2)− 0.1 · φ(x, d3) + 0.15 · φ(x, d4)

14

Training

Input: {xi, yi}ni=1

Output: θ

15

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0

15

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0

for iteration τ and example i

D(xi)← argmaxK(pθ(d | xi))

15

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0

for iteration τ and example i

D(xi)← argmaxK(pθ(d | xi))

θ ← θ + ητ,i(Eqθ(d|xi)[φ(xi, d)]− Epθ(d|xi)[φ(xi, d)])

15

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0

for iteration τ and example i

D(xi)← argmaxK(pθ(d | xi))

θ ← θ + ητ,i(Eqθ(d|xi)[φ(xi, d)]− Epθ(d|xi)[φ(xi, d)])

ητ,i: learning rate

Regularization often added (L2, L1, ...)

15

Training (structured perceptron)

Input: {xi, yi}ni=1

Output: θ

16

Training (structured perceptron)

Input: {xi, yi}ni=1

Output: θ

θ ← 0

16

Training (structured perceptron)

Input: {xi, yi}ni=1

Output: θ

θ ← 0

for iteration τ and example i

d̂← argmax(pθ(d | xi))

d∗ ← argmax(qθ(d | xi))

16

Training (structured perceptron)

Input: {xi, yi}ni=1

Output: θ

θ ← 0

for iteration τ and example i

d̂← argmax(pθ(d | xi))

d∗ ← argmax(qθ(d | xi))

if [d∗]K 6= [d̂]K

θ ← θ + φ(xi, d∗)− φ(xi, d̂))

16

Training (structured perceptron)

Input: {xi, yi}ni=1

Output: θ

θ ← 0

for iteration τ and example i

d̂← argmax(pθ(d | xi))

d∗ ← argmax(qθ(d | xi))

if [d∗]K 6= [d̂]K

θ ← θ + φ(xi, d∗)− φ(xi, d̂))

Regularization often added with weight averaging

16

Training

Other simple variants exist:

• E.g., cost-sensitive max-margin training

• That is, find pairs of good and bad derivations that look differentbut have similar scores and update on those

17

Outline

• Learning

– Overview

– Details

– Example

– Lexicon learning

– Supervision signals

18

Example

./run @mode=simple-lambdadcs \

-Grammar.inPaths esslli 2016/class3 demo.grammar \

-SimpleLexicon.inPaths esslli 2016/class3 demo.lexicon

(loadgraph geo880/geo880.kg)

size of california

size capital california

size of capital of california

california size

19

Exercise

Find a pair of natural language utterances that cannot be distinguishedusing the current feature representation

• The utterances can be not fully grammatical in English

• You can ignore the denotation feature if that helps

• Verify this in sempre (ask me how to disable features)

• Design a feature that will solve this problem

20

Outline

• Learning

– Overview

– Details

– Example

– Lexicon learning

– Supervision signals

21

The lexicon problem

How is the lexicon generated?

• Annotation

• Exhaustive search

• String matching

• Supervised alignment

• Unsupervised alignment

• Learning

22

Training

Input: {xi, yi}ni=1

Output: θ

θ ← 0

for iteration τ and example i

Add lexicon entries

D(xi)← argmaxK(pθ(d | xi)

θ ← θ + ητ,i(Eqθ(d|xi)[φ(xi, d)]− Epθ(d|xi)[φ(xi, d)])

ητ,i: learning rate

Regularization often added (L2, L1, ...)

23

Adding lexicon entries

Input: training example (xi, yi), current lexicon Λ, model θ

Λtemp ← Λ ∪ GENLEX(xi, yi) Create expanded temporaty lexicon

[Adapted from semantic parsing tutorial, Artzi et al.]

24

Adding lexicon entries

Input: training example (xi, yi), current lexicon Λ, model θ

Λtemp ← Λ ∪ GENLEX(xi, yi) Create expanded temporaty lexicon

D(xi)← arg maxK(pθ,Λtemp(d | xi)) Parse with temporary lexicon

[Adapted from semantic parsing tutorial, Artzi et al.]

24

Adding lexicon entries

Input: training example (xi, yi), current lexicon Λ, model θ

Λtemp ← Λ ∪ GENLEX(xi, yi) Create expanded temporaty lexicon

D(xi)← arg maxK(pθ,Λtemp(d | xi)) Parse with temporary lexicon

Λ = Λ ∪ {l | l ∈ d̂, d̂ ∈ D(xi), R(d) = 1} Add entries from correct trees

[Adapted from semantic parsing tutorial, Artzi et al.]

24

Adding lexicon entries

Input: training example (xi, yi), current lexicon Λ, model θ

Λtemp ← Λ ∪ GENLEX(xi, yi) Create expanded temporaty lexicon

D(xi)← arg maxK(pθ,Λtemp(d | xi)) Parse with temporary lexicon

Λ = Λ ∪ {l | l ∈ d̂, d̂ ∈ D(xi), R(d) = 1} Add entries from correct trees

Overgenerate lexical entries and add promising ones

[Adapted from semantic parsing tutorial, Artzi et al.]

24

Lexicon generation

Logical form supervision:

Largest state bordering California

argmax(Type(State) u Border(California),Area)

[Zettlemoyer and Collins, 2005]

25

Lexicon generation

Logical form supervision:

Largest state bordering California

argmax(Type(State) u Border(California),Area)

Enumerate spans

Largest

state

bordering

California

Largest state

. . .

Use rules to extract sub-forumulas

California

Border(California)

λf.f(California)

Area

λx.argmax(x,Area)

. . .

[Zettlemoyer and Collins, 2005]

25

Lexicon generation

Logical form supervision:

Largest state bordering California

argmax(Type(State) u Border(California),Area)

Enumerate spans

Largest

state

bordering

California

Largest state

. . .

Use rules to extract sub-forumulas

California

Border(California)

λf.f(California)

Area

λx.argmax(x,Area)

. . .

Add cross-product to lexicon

[Zettlemoyer and Collins, 2005]

25

Lexicon generation

Denotation supervision:

Largest state bordering California

Arizona

26

Lexicon generation

Denotation supervision:

Largest state bordering California

Arizona

Enumerate spans

Largest

state

bordering

California

Largest state

. . .

Generate sub-formulas from KB

California

Border(California)

Traverse

Type.Mountain

λx.argmax(x,Elevation)

. . .

Restrict candidates with alignment, string matching, ...

26

Lexicon generation

Denotation supervision:

Largest state bordering California

Arizona

Enumerate spans

Largest

state

bordering

California

Largest state

. . .

Generate sub-formulas from KB

California

Border(California)

Traverse

Type.Mountain

λx.argmax(x,Elevation)

. . .

Restrict candidates with alignment, string matching, ...

Fancier methods exist (coarse-to-fine)

26

Unification

Logical form supervision:

[Kwiatkowski et al, 2010]

27

Unification

Logical form supervision:

Initialize lexicon with (xi, zi):

States bordering California

Type(State) u Border(California)

Split lexical entry in all possible ways:

[Kwiatkowski et al, 2010]

27

Unification

Logical form supervision:

Initialize lexicon with (xi, zi):

States bordering California

Type(State) u Border(California)

Split lexical entry in all possible ways:

Enumerate spans

(states, bordering california)

(states bordering, california)

Generate sub-formulas from KB

(Type(State), λx.x u Border(California))

(λx.Type(State) u x,Border(California))

(λf.Type(State) u f(California),California)

. . .

[Kwiatkowski et al, 2010]

27

Unification

For example xi, zi:

[Kwiatkowski et al, 2010]

28

Unification

For example xi, zi:

Find highest scoring correct parse d∗

[Kwiatkowski et al, 2010]

28

Unification

For example xi, zi:

Find highest scoring correct parse d∗

Split all lexical entries in d∗ in all possible ways

[Kwiatkowski et al, 2010]

28

Unification

For example xi, zi:

Find highest scoring correct parse d∗

Split all lexical entries in d∗ in all possible ways

Add to lexicon lexical entry that improves parse score best

[Kwiatkowski et al, 2010]

28

Do we need a lexicon?

Type(State) u Border(California)

Type(State)

Type State

Border(California)

Border California

California neighbors

29

Do we need a lexicon?

Type(State) u Border(California)

Type(State)

Type State

Border(California)

Border California

California neighbors

Floating parse tree: generalization of bridging

29

Do we need a lexicon?

Type(State) u Border(California)

Type(State)

Type State

Border(California)

Border California

California neighbors

Floating parse tree: generalization of bridging

Perhaps with better learning and search not necessary?

29

Outline

• Learning

– Overview

– Details

– Example

– Lexicon learning

– Supervision signals

30

Supervision signals

We discussed training from logical forms and denotations

31

Supervision signals

We discussed training from logical forms and denotations

Other forms of supervision have been proposed:

• Demonstrations

• Distant supervision

• Conversations

• Unsupervised

• Paraphrasing

31

Training from demonstrations

Input: (xi, si, ti)

xi: utterance

si: start state

ti: end state

move forward until you reach the intersection

[Artzi and Zettlemoyer, 2013]

32

Training from demonstrations

Input: (xi, si, ti)

xi: utterance

si: start state

ti: end state

move forward until you reach the intersection

λa.move(a) ∧ dir(a.forward) . . .

[Artzi and Zettlemoyer, 2013]

32

Training from demonstrations

Input: (xi, si, ti)

xi: utterance

si: start state

ti: end state

move forward until you reach the intersection

λa.move(a) ∧ dir(a.forward) . . .

An instance of learning from denotations

[Artzi and Zettlemoyer, 2013]

32

Distant supervision

Data generation:

• Decompose declarative text to questions and answers

James Cameron is the director of Titanic

Q: X is the director of Titanic A: James Cameron

[Reddy et al., 2014]

33

Distant supervision

Data generation:

• Decompose declarative text to questions and answers

James Cameron is the director of Titanic

Q: X is the director of Titanic A: James Cameron

Declarative text is cheap!

[Reddy et al., 2014]

33

Distant supervision

Training:

• Use exising non-executable semantic parsers

X is the director of Titanic

[Reddy et al., 2014]

34

Distant supervision

Training:

• Use exising non-executable semantic parsers

X is the director of Titanic

λx.director(x) ∧ director.of.arg1(e, x) ∧ director.of.arg2(e,Titanic)

[Reddy et al., 2014]

34

Distant supervision

Training:

• Use exising non-executable semantic parsers

X is the director of Titanic

λx.director(x) ∧ director.of.arg1(e, x) ∧ director.of.arg2(e,Titanic)

λx.Director(x) ∧ FilmDirectedBy(e, x) ∧ FileDirected(e,Titanic)

λx.Producer(x) ∧ FilmProducedBy(e, x) ∧ FileProduced(e,Titanic)

[Reddy et al., 2014]

34

Distant supervision

Training:

• Use exising non-executable semantic parsers

X is the director of Titanic

λx.director(x) ∧ director.of.arg1(e, x) ∧ director.of.arg2(e,Titanic)

λx.Director(x) ∧ FilmDirectedBy(e, x) ∧ FileDirected(e,Titanic)

λx.Producer(x) ∧ FilmProducedBy(e, x) ∧ FileProduced(e,Titanic)

James Cameron true

James Camerson, Jon Landau false

[Reddy et al., 2014]

34

Training from conversations

35

Training from conversations

z1 : From(Atlanta) u To(London)

z2 : From(Atlanta) u From(London)

z3 : To(Atlanta) u To(London)

z4 : To(Atlanta) u From(London)

35

Training from conversations

z1 : From(Atlanta) u To(London)

z2 : From(Atlanta) u From(London)

z3 : To(Atlanta) u To(London)

z4 : To(Atlanta) u From(London)

Define loss:

• Does z align with converation?

• Does z obey domain constraints?

35

Unsupervised learning

Intuition: assume repeating patterns are correct

[Goldwasser et al., 2011]

36

Unsupervised learning

Intuition: assume repeating patterns are correct

Input: {xi}ni=1

Output: θ

[Goldwasser et al., 2011]

36

Unsupervised learning

Intuition: assume repeating patterns are correct

Input: {xi}ni=1

Output: θ

θ initialized manually, S = φ

[Goldwasser et al., 2011]

36

Unsupervised learning

Intuition: assume repeating patterns are correct

Input: {xi}ni=1

Output: θ

θ initialized manually, S = φ

Until stopping criterion met

for example xi

S = S ∪ (xi, argmax pθ(d | xi))

[Goldwasser et al., 2011]

36

Unsupervised learning

Intuition: assume repeating patterns are correct

Input: {xi}ni=1

Output: θ

θ initialized manually, S = φ

Until stopping criterion met

for example xi

S = S ∪ (xi, argmax pθ(d | xi))Compute statistics of S

[Goldwasser et al., 2011]

36

Unsupervised learning

Intuition: assume repeating patterns are correct

Input: {xi}ni=1

Output: θ

θ initialized manually, S = φ

Until stopping criterion met

for example xi

S = S ∪ (xi, argmax pθ(d | xi))Compute statistics of S

Sconf ←find confident subset

[Goldwasser et al., 2011]

36

Unsupervised learning

Intuition: assume repeating patterns are correct

Input: {xi}ni=1

Output: θ

θ initialized manually, S = φ

Until stopping criterion met

for example xi

S = S ∪ (xi, argmax pθ(d | xi))Compute statistics of S

Sconf ←find confident subset

Train on Sconf

[Goldwasser et al., 2011]

36

Unsupervised learning

Intuition: assume repeating patterns are correct

Input: {xi}ni=1

Output: θ

θ initialized manually, S = φ

Until stopping criterion met

for example xi

S = S ∪ (xi, argmax pθ(d | xi))Compute statistics of S

Sconf ←find confident subset

Train on Sconf

Substantially lower performance

[Goldwasser et al., 2011]

36

Paraphrasing

What languages do people in Brazil use?

B. and Liang, 2014

37

Paraphrasing

What languages do people in Brazil use?

Type.HumanLanguage u LanguagesSpoken.Brazil ... CapitalOf.Brazil

B. and Liang, 2014

37

Paraphrasing

What languages do people in Brazil use?

What language is the language of Brazil? ... What city is the capital of Brazil?

Type.HumanLanguage u LanguagesSpoken.Brazil ... CapitalOf.Brazil

B. and Liang, 2014

37

Paraphrasing

What languages do people in Brazil use?

paraphrase model

What language is the language of Brazil? ... What city is the capital of Brazil?

Type.HumanLanguage u LanguagesSpoken.Brazil ... CapitalOf.Brazil

B. and Liang, 2014

37

Paraphrasing

What languages do people in Brazil use?

paraphrase model

What language is the language of Brazil? ... What city is the capital of Brazil?

Type.HumanLanguage u LanguagesSpoken.Brazil ... CapitalOf.Brazil

Portuguese, ...

B. and Liang, 2014

37

Paraphrasing

What languages do people in Brazil use?

paraphrase model

What language is the language of Brazil? ... What city is the capital of Brazil?

Type.HumanLanguage u LanguagesSpoken.Brazil ... CapitalOf.Brazil

Portuguese, ...

Model: pθ(c, z | x) = p(z | x)× pθ(c | x)

Idea: train a large paraphrase model pθ(c | x)

B. and Liang, 2014

37

Paraphrasing

What languages do people in Brazil use?

paraphrase model

What language is the language of Brazil? ... What city is the capital of Brazil?

Type.HumanLanguage u LanguagesSpoken.Brazil ... CapitalOf.Brazil

Portuguese, ...

Model: pθ(c, z | x) = p(z | x)× pθ(c | x)

Idea: train a large paraphrase model pθ(c | x)

More later

B. and Liang, 2014

37

Summary

We saw how to train from denotations and logical forms

We see methods for inducing lexicons during training

We reviewed work on how to use even weaker forms of supervision

Still an open problem

38