Parsing with Compositional Vector Grammars
Socher, Bauer, Manning, NG 2013
Problem
• How can we parse a sentence and create a dense representation of it?
– N-grams have obvious problems, most important is sparsity
• Can we resolve syntactic ambiguity with context? “They ate udon with forks” vs “They ate udon with chicken”
Standard Recursive Neural Net
I like green eggs
[ Vector(I)] [ Vector(like)]
WMain
[ Vector(I-like)]
Score
[ Vector(green)] [ Vector(eggs)]
Classifier? WMain
[ Vector((I-like)green)]
Standard Recursive Neural Net
Where is usually or logistic()In other words, stack the two word vectors and multiply through a matrix W and you get a vector of the same dimensionality as the children a or b.
Syntactically Untied RNN
I like green eggs
[ Vector(I)] [ Vector(like)]
WN,V
[ Vector(I-like)]
Score
[ Vector(green)] [ Vector(eggs)]
Classifier
Wadj,N
[ Vector(green-eggs)]
First, parse lower level with PCFG
N V Adj N
Syntactically Untied RNN
The weight matrix is determined by the PCFG parse category of a and b. (You have one per parse combination)
Examples: Composition Matrixes
• Notice that he initializes them with two identity matrixes (in the absence of other information we should average
Learning the Weights
• Errors are backpropagated through structure (Goller and Kuchler, 1996)
Weight derivatives are additive across branches! (Not obvious- good proof/explanation in Socher, 2014)
𝛿
𝑓 ′ (𝑥)
(for logistic)
𝑒input
Tricks
• Our good friend, ada-grad (diagonal variant):
(Elementwise)• Initialize matrixes with identity + small
random noise• Uses Collobert and Weston (2008) word
embeddings to start
Learning the Tree
• We want the scores of the correct parse trees to be better than all incorrect trees by a margin:
(Correct Parse Trees are Given in the Training Set)
Finding the Best Tree (inference)
• Want to find the parse tree with the max score (which is the sum all the scores of all sub trees)
• Too expensive to try every combination • Trick: use non-RNN method to select best 200
trees (CKY algorithm). Then, beam search these trees with RNN.
Model Comparisons (WSJ Dataset)
(Socher’s Model)F1 for parse labels
Analysis of Errors
Conclusions:
• Not the best model, but fast• No hand engineered features• Huge number of parameters:
• Notice that Socher can’t make the standard RNN perform better than the PCFG: there is a pattern here. Most of the papers from this group involve very creative modifications to the standard RNN. (SU-RNN, RNTN, RNN+Max Pooling)
• The model in this paper has (probably) been eclipsed by the Recursive Neural Tensor Network. Subsequent work showed this model performed better (in different situations) than the SU-RNN
Top Related