Download - Parsing with Compositional Vector Grammars Socher, Bauer, Manning, NG 2013.

Parsing with Compositional Vector Grammars

Socher, Bauer, Manning, NG 2013

Problem

• How can we parse a sentence and create a dense representation of it?

– N-grams have obvious problems, most important is sparsity

• Can we resolve syntactic ambiguity with context? “They ate udon with forks” vs “They ate udon with chicken”

Standard Recursive Neural Net

I like green eggs

[ Vector(I)] [ Vector(like)]

WMain

[ Vector(I-like)]

Score

[ Vector(green)] [ Vector(eggs)]

Classifier? WMain

[ Vector((I-like)green)]

Standard Recursive Neural Net

Where is usually or logistic()In other words, stack the two word vectors and multiply through a matrix W and you get a vector of the same dimensionality as the children a or b.

Syntactically Untied RNN

I like green eggs

[ Vector(I)] [ Vector(like)]

WN,V

[ Vector(I-like)]

Score

[ Vector(green)] [ Vector(eggs)]

Classifier

Wadj,N

[ Vector(green-eggs)]

First, parse lower level with PCFG

N V Adj N

Syntactically Untied RNN

The weight matrix is determined by the PCFG parse category of a and b. (You have one per parse combination)

Examples: Composition Matrixes

• Notice that he initializes them with two identity matrixes (in the absence of other information we should average

Learning the Weights

• Errors are backpropagated through structure (Goller and Kuchler, 1996)

Weight derivatives are additive across branches! (Not obvious- good proof/explanation in Socher, 2014)

𝛿

𝑓 ′ (𝑥)

(for logistic)

𝑒input

Tricks

• Our good friend, ada-grad (diagonal variant):

(Elementwise)• Initialize matrixes with identity + small

random noise• Uses Collobert and Weston (2008) word

embeddings to start

Learning the Tree

• We want the scores of the correct parse trees to be better than all incorrect trees by a margin:

(Correct Parse Trees are Given in the Training Set)

Finding the Best Tree (inference)

• Want to find the parse tree with the max score (which is the sum all the scores of all sub trees)

• Too expensive to try every combination • Trick: use non-RNN method to select best 200

trees (CKY algorithm). Then, beam search these trees with RNN.

Model Comparisons (WSJ Dataset)

(Socher’s Model)F1 for parse labels

Analysis of Errors

Conclusions:

• Not the best model, but fast• No hand engineered features• Huge number of parameters:

• Notice that Socher can’t make the standard RNN perform better than the PCFG: there is a pattern here. Most of the papers from this group involve very creative modifications to the standard RNN. (SU-RNN, RNTN, RNN+Max Pooling)

• The model in this paper has (probably) been eclipsed by the Recursive Neural Tensor Network. Subsequent work showed this model performed better (in different situations) than the SU-RNN