Statistical NLP Winter 2009

41
Statistical NLP Winter 2009 Lecture 17: Unsupervised Learning IV Tree-structured learning Roger Levy [thanks to Jason Eisner and Mike Frank for some slides]

description

Statistical NLP Winter 2009. Lecture 17: Unsupervised Learning IV Tree-structured learning Roger Levy [thanks to Jason Eisner and Mike Frank for some slides]. Inferring tree structure from raw text. We’ve looked at a few types of unsupervised inference of linguistic structure: - PowerPoint PPT Presentation

Transcript of Statistical NLP Winter 2009

Page 1: Statistical NLP Winter  2009

Statistical NLPWinter 2009

Lecture 17: Unsupervised Learning IVTree-structured learning

Roger Levy

[thanks to Jason Eisner and Mike Frank for some slides]

Page 2: Statistical NLP Winter  2009

Inferring tree structure from raw text

• We’ve looked at a few types of unsupervised inference of linguistic structure:• Words from unsegmented utterances• Semantic word classes from word bags (=documents)• Syntactic word classes from word sequences

• But a hallmark of language is recursive structure• To a first approximation, this is trees• How might we learn tree structures from raw text?

Page 3: Statistical NLP Winter  2009

Is this the right problem?

• Before proceeding…are we tackling the right problem?• We’ve already seen how the structures giving high

likelihood to raw text may not be the structures we want• Also, it’s fairly clear that kids don’t learn language

structure from linguistic input alone…

• Real learning is cross-modal• So why study unsupervised grammar induction from

raw text?

words: “blue rings”objects: rings, big bird

words: “and green rings”objects: rings, big bird

words: “and yellow rings”objects: rings, big bird

words: “Bigbird! Do you want to hold the rings?”

objects: big bird

Page 4: Statistical NLP Winter  2009

Why induce grammars from raw text?

• I don’t think this is a trivial question to answer. But there are some answers:• Practical: if unsupervised grammar induction worked

well, it’d save a whole lot of annotation effort• Practical: we need tree structure to get compositional

semantics to work• Scientific: unsupervised grammar induction may help us

a place an upper bound on how much grammatical knowledge must be innate

• Theoretical: the models and algorithms that make unsupervised grammar induction work well may help us with other, related tasks (e.g., machine translation)

Page 5: Statistical NLP Winter  2009

Plan of today

• We’ll start with a look at an early but influential clustering-based method (Clark, 2001)

• We’ll then talk about EM for PCFGs• Then we’ll talk about the next influential work (setting

the bar for all modern work), Klein & Manning (2004)

Page 6: Statistical NLP Winter  2009

Distributional clustering for syntax

• How do we evaluate unsupervised grammar induction?

• Basically like we evaluate supervised parsing: through bracketing accuracy

• But this means that we don’t have to focus on inducing a grammar

• We can focus on inducing constituents in sentences instead

• This contrast has been called structure search versus parameter search

Page 7: Statistical NLP Winter  2009

Distributional clustering for syntax (II)

• What makes a good constituent?• Old insight from Zellig Harris (Chomsky’s teacher):

something can appear in a consistent set of contextsSarah saw a dog standing near a tree.Sarah saw a big dog standing near a tree.Sarah saw me standing near a tree.Sarah saw three posts standing near a tree.

• Therefore we could try to characterize the space of possible constituents by the contexts they appear in

• We could then do clustering in this space

Page 8: Statistical NLP Winter  2009

Distributional clustering for syntax (III)

• Clark, 2001: k-means clustering of common tag sequences occurring in context distributions

• Applied to 12 million words from the British National Corpus

• The results were promising:

(determiner)

“the big dog”

“the very big dog”“the dog”“the dog near the tree”

Page 9: Statistical NLP Winter  2009

Distributional clustering for syntax (IV)

• But…there was a problem:

• Clustering picks up both constituents and things that can’t be constituents (distituents)

“big dog the”

“dog big the”“dog the”“dog and big”

good at distinguishing constituent types bad at weeding out distituents

Page 10: Statistical NLP Winter  2009

Distributional clustering for syntax (V)

• Clark 2001’s solution:“with real constituents, there is high mutual information

between the symbol occurring before the putative constituent and the symbol after…”

• Filter out distituents on the basis of low mutual information

• (controlling for distance between symbols)

constituents

distituents

Page 11: Statistical NLP Winter  2009

Distributional clustering for syntax (VI)

• The final step: train an SCFG using vanilla maximum-likelihood estimation

• (hand-label the clusters to get interpretability)

Ten most frequent rules expanding “NP”

Page 12: Statistical NLP Winter  2009

Structure vs. parameter search

• This did pretty well• But the clustering-based structure-search approach is

throwing out a hugely important type of information• (did you see what it is?)• …bracketing consistency!

Sarah gave the big dog the most enthuisiastic hug.

• Parameter-estimation approaches automatically incorporate this type of information

• This motivated the work of Klein & Manning (2002)

both can’t be a constituent at once

Page 13: Statistical NLP Winter  2009

Interlude

• Before we discuss later work: how do we do parameter search for PCFGs?

• The EM algorithm is key• But as with HMM learning, there are exponentially

many structures to consider• We need dynamic programming (but a different sort

than for HMMs)

Page 14: Statistical NLP Winter  2009

Review: EM for HMMs

• For the M-step, the fundamental quantity we need is the expected counts of state-to-state transitions

Probability of getting from beginning to state si at time t

Probability of getting from state sj at time t+1 to the end

Probability of transitioning from state sj at time t to state sj at time t+1, emitting ot

αi

βj

aijbijot

Page 15: Statistical NLP Winter  2009

EM for PCFGs

• In PCFGs, for the M-step, the fundamental quantity is the expected counts of various rule rewrites

the familiar inside probability

something new: outside probability

Page 16: Statistical NLP Winter  2009

Inside & outside probabilities

• In Manning & Schuetze (1999), here’s the picture:

You’re interested in node Nj

outside probability of Nj

inside probability of Nj

Page 17: Statistical NLP Winter  2009

Compute Bottom-Up by CKY

VP(1,5) = p(flies like an arrow | VP) = …S(0,5) = p(time flies like an arrow | S)

= NP(0,1) * VP(1,5) * p(S NP VP|S) + …

S

NPtime

VP

Vflies

PP

Plike

NP

Detan

N arrow

p( | S) = p(S NP VP | S) * p(NP time | NP)

* p(VP V PP | VP)

* p(V flies | V) * …

Page 18: Statistical NLP Winter  2009

time 1 flies 2 like 3 an 4 arrow 5

0

NP3Vst3

NP10S8 S13

NP24NP24S22S27S27

1NP4VP4

NP18S21VP18

2P2V5

PP12VP16

3 Det1

NP10

4 N8

1 S NP VP6 S Vst NP2 S S PP

1 VP V NP2 VP VP PP

1 NP Det N2 NP NP PP3 NP NP NP

0 PP P NP

Compute Bottom-Up by CKY

Page 19: Statistical NLP Winter  2009

time 1 flies 2 like 3 an 4 arrow 5

0

NP2-3

Vst2-3

NP 2-10 S 2-8

S 2-13

NP 2-24

NP 2-24

S 2-22

S 2-27

S 2-27

NP2-4

VP 2-4

NP 2-18

S 2-21

VP 2-18

2P2-2 V2-5

PP 2-12

VP 2-16

3 Det 2-1

NP 2-10

4 N2-8

2-1 S NP VP2-6 S Vst NP2-2 S S PP

2-1 VP V NP2-2 VP VP PP

2-1 NP Det N2-2 NP NP PP2-3 NP NP NP

2-0 PP P NP

Compute Bottom-Up by CKY

S 2-22

Page 20: Statistical NLP Winter  2009

time 1 flies 2 like 3 an 4 arrow 5

0

NP2-3

Vst2-3

NP 2-10 S 2-8

S 2-13

NP 2-24

NP 2-24

S 2-22

S 2-27

S 2-27

S 2-22

NP2-4

VP 2-4

NP 2-18

S 2-21

VP 2-18

2P2-2 V2-5

PP 2-12

VP 2-16

3 Det 2-1

NP 2-10

4 N2-8

2-1 S NP VP2-6 S Vst NP2-2 S S PP

2-1 VP V NP2-2 VP VP PP

2-1 NP Det N2-2 NP NP PP2-3 NP NP NP

2-0 PP P NP

Compute Bottom-Up by CKY

S 2-27

Page 21: Statistical NLP Winter  2009

time 1 flies 2 like 3 an 4 arrow 5

0

NP2-3

Vst2-3

NP 2-10 S 2-8

S 2-13

NP 2-24

NP 2-24

S 2-22

S 2-27

S 2-27

NP2-4

VP 2-4

NP 2-18

S 2-21

VP 2-18

2P2-2 V2-5

PP 2-12

VP 2-16

3 Det 2-1

NP 2-10

4 N2-8

2-1 S NP VP2-6 S Vst NP2-2 S S PP

2-1 VP V NP2-2 VP VP PP

2-1 NP Det N2-2 NP NP PP2-3 NP NP NP

2-0 PP P NP

The Efficient Version: Add as we go

Page 22: Statistical NLP Winter  2009

time 1 flies 2 like 3 an 4 arrow 5

0

NP2-3

Vst2-3

NP 2-10 S 2-8

+2-13

NP 2-24

+2-24

S 2-22

+2-27

+2-27

NP2-4

VP 2-4

NP 2-18

S 2-21

VP 2-18

2P2-2 V2-5

PP 2-12

VP 2-16

3 Det 2-1

NP 2-10

4 N2-8

2-1 S NP VP2-6 S Vst NP2-2 S S PP

2-1 VP V NP2-2 VP VP PP

2-1 NP Det N2-2 NP NP PP2-3 NP NP NP

2-0 PP P NP

The Efficient Version: Add as we go

+2-22 +2-27

PP(2,5)

S(0,2) s(0,5)

s(0,2)* PP(2,5)*p(S S PP | S)

Page 23: Statistical NLP Winter  2009

Compute probs bottom-up (CKY)

for width := 2 to n (* build smallest first *)

for i := 0 to n-width (* start *)

let k := i + width (* end *)

for j := i+1 to k-1 (* middle *)

for all grammar rules X Y ZX(i,k) += p(X Y Z | X) * Y(i,j) *

Z(j,k) X

Y Z

i j k

what if you changed + to max?

need some initialization up here for the width-1 case

Page 24: Statistical NLP Winter  2009

VP(1,5) = p(flies like an arrow | VP)

VP(1,5) = p(time VP today | S)

Inside & Outside Probabilities S

NPtime

VP

VP NPtoday

Vflies

PP

Plike

NP

Detan

N arrow

VP(1,5) * VP(1,5) = p(time [VP flies like an arrow] today | S)

“inside” the VP

“outside” the VP

Page 25: Statistical NLP Winter  2009

VP(1,5) = p(flies like an arrow | VP)

VP(1,5) = p(time VP today | S)

Inside & Outside Probabilities S

NPtime

VP

VP NPtoday

Vflies

PP

Plike

NP

Detan

N arrow

VP(1,5) * VP(1,5) = p(time flies like an arrow today & VP(1,5) | S)

/ S(0,6)

p(time flies like an arrow today | S)

= p(VP(1,5) | time flies like an arrow today, S)

Page 26: Statistical NLP Winter  2009

VP(1,5) = p(flies like an arrow | VP)

VP(1,5) = p(time VP today | S)

Inside & Outside Probabilities S

NPtime

VP

VP NPtoday

Vflies

PP

Plike

NP

Detan

N arrow

So VP(1,5) * VP(1,5) / s(0,6) is probability that there is a VP here,given all of the observed data (words)

strictly analogousto forward-backwardin the finite-state case!

Start

C

H

C

H

C

H

Page 27: Statistical NLP Winter  2009

V(1,2) = p(flies | V)

VP(1,5) = p(time VP today | S)

Inside & Outside Probabilities S

NPtime

VP

VP NPtoday

Vflies

PP

Plike

NP

Detan

N arrow

So VP(1,5) * V(1,2) * PP(2,5) / s(0,6) is probability that there is VP V PP here,given all of the observed data (words)

PP(2,5) = p(like an arrow | PP)

… or is it?

Page 28: Statistical NLP Winter  2009

sum prob over all position triples like (1,2,5) toget expected c(VP V PP); reestimate PCFG!

V(1,2) = p(flies | V)

VP(1,5) = p(time VP today | S)

Inside & Outside Probabilities S

NPtime

VP

VP NPtoday

Vflies

PP

Plike

NP

Detan

N arrow

So VP(1,5) * p(VP V PP) * V(1,2) * PP(2,5) / s(0,6) is probability that there is VP V PP here (at 1-2-5),given all of the observed data (words)

PP(2,5) = p(like an arrow | PP)

Page 29: Statistical NLP Winter  2009

Compute probs bottom-up

Vflies

PP

Plike

NP

Detan

N arrow

V(1,2) PP(2,5)

VPVP(1,5)

p(flies | V)

p(like an arrow | PP)

p(V PP | VP)+=

*

*

p(flies like an arrow | VP)

Summary: VP(1,5) += p(V PP | VP) * V(1,2) * PP(2,5)

inside 1,5 inside 1,2 inside 2,5

Page 30: Statistical NLP Winter  2009

Compute probs top-down(uses probs as well)

Vflies

PP

Plike

NP

Detan

N arrow

PP(2,5)V(1,2)

VPVP(1,5)

VP

NPtoday

NPtime

+= p(time VP today | S)

* p(V PP | VP) *

p(like an arrow | PP)

p(time V like an arrow today | S)

S

Summary: V(1,2) += p(V PP | VP) * VP(1,5) * PP(2,5)

outside 1,2 outside 1,5 inside 2,5

Page 31: Statistical NLP Winter  2009

Compute probs top-down(uses probs as well)

Vflies

PP

Plike

NP

Detan

N arrow

V(1,2)

PP(2,5)

VPVP(1,5)

VP

NPtoday

NPtime

+= p(time VP today | S)

* p(V PP | VP)

* p(flies| V)

p(time flies PP today | S)

S

Summary: PP(2,5) += p(V PP | VP) * VP(1,5) * V(1,2)

outside 2,5 outside 1,5 inside 1,2

Page 32: Statistical NLP Winter  2009

Compute probs bottom-up

When you build VP(1,5), from VP(1,2) and VP(2,5) during CKY,increment VP(1,5) by p(VP VP PP) * VP(1,2) * PP(2,5)

Why? VP(1,5) is total probability of all derivations p(flies like an arrow | VP)and we just found another.(See earlier slide of CKY chart.)

VPflies

PP

Plike

NP

Detan

N arrow

VP(1,2) PP(2,5)

VPVP(1,5)

Page 33: Statistical NLP Winter  2009

Compute probs bottom-up (CKY)

for width := 2 to n (* build smallest first *)

for i := 0 to n-width (* start *)

let k := i + width (* end *)

for j := i+1 to k-1 (* middle *)

for all grammar rules X Y ZX(i,k) += p(X Y Z) * Y(i,j) *

Z(j,k) X

Y Z

i j k

Page 34: Statistical NLP Winter  2009

Compute probs top-down (reverse CKY)

for width := 2 to n (* build smallest first *)

for i := 0 to n-width (* start *)

let k := i + width (* end *)

for j := i+1 to k-1 (* middle *)

for all grammar rules X Y ZY(i,j) += ???Z(j,k) += ???

n downto 2 “unbuild” biggest first

X

Y Z

i j k

Page 35: Statistical NLP Winter  2009

Compute probs top-down

After computing during CKY, revisit constits in reverse order (i.e., bigger constituents first).When you “unbuild” VP(1,5) from VP(1,2) and VP(2,5), increment VP(1,2) byVP(1,5) * p(VP VP PP) * PP(2,5) and increment PP(2,5) byVP(1,5) * p(VP VP PP) * VP(1,2)

S

NPtime

VP

VP NPtoday

VP(1,5)

VP(1,2) PP(2,5)

already computed onbottom-up pass

VP(1,2) is total prob of all ways to gen VP(1,2) and all outside words.

VP PP PP(2,5)VP(1,2)

Page 36: Statistical NLP Winter  2009

Compute probs top-down (reverse CKY)

for width := 2 to n (* build smallest first *)

for i := 0 to n-width (* start *)

let k := i + width (* end *)

for j := i+1 to k-1 (* middle *)

for all grammar rules X Y ZY(i,j) += X(i,k) * p(X Y Z) * Z(j,k)

Z(j,k) += X(i,k) * p(X Y Z) * Y(i,j)

n downto 2 “unbuild” biggest first

X

Y Z

i j k

Page 37: Statistical NLP Winter  2009

Back to the actual history

• Early approaches: Pereira & Schabes (1992), Carroll & Charniak (1994)• Initialize a simple PCFG using some “good guesses”• Use EM to re-estimate

• Horrible failure

Page 38: Statistical NLP Winter  2009

EM on a constituent-context model

• Klein & Manning (2002) used a generative model that constructed sentences through bracketings• (1) choose a bracketing of a sentence• (2) for each span in the bracketing, (independently)

generate a yield and a context

• This is horribly deficient, but it worked well!

yieldcontext

Page 39: Statistical NLP Winter  2009

Adding a dependency component

• Klein & Manning (2004) added a dependency component, which wasn’t deficient – a nice generative model!

• The dependency component introduced valence – the idea that a governor’s dependent-seeking behavior changes once it has a dependent

Page 40: Statistical NLP Winter  2009

Constituent-context + dependency

• Some really nifty factorized-inference techniques allowed these models to be combined

• The results are very good

Page 41: Statistical NLP Winter  2009

More recent developments

• Improved parameter search (Smith, 2006)• Bayesian methods (Liang et al., 2007)• Unsupervised all-subtrees parsing (Bod 2007)