Post on 24-May-2015
description
Dynamic Pooling and Unfolding Recursive Autoencodersfor Paraphrase Detection1
Richard Socher, Eric Huang, Jeffrey Penningotn, Andrew Ng,Christopher Manning
Feynman Liang
May 16, 2013
1Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase DetectionRichard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D.Manning Advances in Neural Information Processing Systems (NIPS 2011)
F. Liang Unfolding RAE for Paraphrase Detection May 2013 1 / 26
Motivation
Consider the following phrases:
The judge also refused to postpone the trial date of Sept. 29.
Obus also denied a defense motion to postpone the September trialdate.
F. Liang Unfolding RAE for Paraphrase Detection May 2013 2 / 26
Paraphrase Detection Problem
Given: A pair of sentences S1 = (w1, . . . ,wm) andS2 = (w1, . . . ,wn),w ∈ VTask: Classify whether S1 and S2 are paraphrases or not
F. Liang Unfolding RAE for Paraphrase Detection May 2013 3 / 26
Overview
Background
Neural Language Models
Recursive Autoencoders
Contributions
Unfolding RAEs
Dynamic Pooling of Similarity Matrix
Experiments
F. Liang Unfolding RAE for Paraphrase Detection May 2013 4 / 26
Prior Work
Similarity Metrics
n-gram Overlap / Longest Common SubsequenceOrdered Tree Edit DistanceWordNet hypernyms
Language Models
n-gram HMMsP(wt |w t−1
1 ) ≈ P(wt |w t−1t−n+1)
Log-Linear Models
P(y |w t1 ; θ) ≈ eθ
ᵀf (w t1 ,y)∑
y ′∈Yeθ
ᵀf (w t1 ,y
′)
Neural Language Models2
2R. Collobert and J. Weston. A unified architecture for natural language processing:deep neural networks with multitask learning. In ICML, 2008.
F. Liang Unfolding RAE for Paraphrase Detection May 2013 5 / 26
Neural Language Models
Vocabulary VEmbedding Matrix L ∈ Rn×|V|
L : V → Rn
Each column of L “embeds” w ∈ V on a n-dimensional feature spaceCapture semantic and syntactic information about a word
A sentence S = (w1, . . . ,wm),wi ∈ V is represented as an ordered list(x1, . . . , xm), xi ∈ Rn
F. Liang Unfolding RAE for Paraphrase Detection May 2013 6 / 26
Neural Language Models
F. Liang Unfolding RAE for Paraphrase Detection May 2013 7 / 26
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic languagemodel. J. Mach. Learn. Res., 3, March 2003.
Recursive Autoencoders (RAEs)
Assume we are given a binary parse tree T :
A binary parse tree is a list of triplets of parents with children:(p → (c1, c2))
c1, c2 are either a terminal word vector xi ∈ Rn or a non-terminalparent y1 ∈ Rn
Figure : Parse tree for ((y1 → x2x3), (y2 → x1y1)),∀x , y ∈ Rn
F. Liang Unfolding RAE for Paraphrase Detection May 2013 8 / 26
Recursive Autoencoders (RAEs)
Non-terminal parent p computed as
p = f (We [c1; c2] + b)
f is an activation function (eg. sigmoid, tanh)We ∈ Rn×2n the encoding matrix to be learned[c1; c2] ∈ R2n is the concatenated childrenb is a bias term
Figure : y2 = f
We
x1
f
(We
[x2x3
]+ b1
) + b2
F. Liang Unfolding RAE for Paraphrase Detection May 2013 9 / 26
Recursive Autoencoders (RAEs)
Wd inverts We s.t. [c ′1; c ′2] = f (Wdp + bd) is the decoding of p
Erec(p) = ‖[c1; c2]− [c ′1; c ′2]‖2`2To train:
Minimize Erec(T ) =∑p∈T
Erec(p) = Erec(y1) + Erec(y2)
Add length normalization layer p = p‖p‖`2
to avoid degenerate solution
F. Liang Unfolding RAE for Paraphrase Detection May 2013 10 / 26
Unfolding RAEs
Measure reconstruction error down to terminal xi s:
For a node y that spans words i to j :
Erec(y(i ,j)) = ‖[xi ; . . . ; xj ]− [x ′i ; . . . ; x′j ]‖2`2
Hidden layer norms no longer shrink
Children with larger subtrees get more weight
F. Liang Unfolding RAE for Paraphrase Detection May 2013 11 / 26
Deep RAEs
h = f (W(1)e [c1; c2] + b
(1)e )
p = f (W(2)e h + b
(2)2 )
F. Liang Unfolding RAE for Paraphrase Detection May 2013 12 / 26
Andrew Ng. Autoencoders (CS294A Lecture notes).
Training RAEs
Data: A set of parse trees
Objective: Minimize
J =1
|T |∑n∈T
Erec(n;We) +λ
2(‖We‖2)
Gradient descent (backpropogation, L-BFGS)
Non-convex, smooth convergance =⇒ local optima
F. Liang Unfolding RAE for Paraphrase Detection May 2013 13 / 26
Sentence Similarity Matrix
For two sentences S1,S2 of lengths n and m, concatenate terminalxi s (in sentence order) with non-terminal yi s (depth-first, right-to-left)
Compute similarity matrix S ∈ R(2v−1)×(2w−1), where Si ,j is the`2-norm between the ith element from S1’s feature vector and the jthelement from S2’s feature vector
F. Liang Unfolding RAE for Paraphrase Detection May 2013 14 / 26
Dynamic Pooling
Sentence lengths may vary =⇒ S dimensionality may vary.Want to map S ∈ R(2n−1)×(2m−1) to Spooled ∈ Rnp×np with np constant
Dynamically partition rows and columns of S into np equal parts
Min. pool over each part
Normalize µ = 0, σ = 1 and pass on to classifier (e.g. softmax)
F. Liang Unfolding RAE for Paraphrase Detection May 2013 15 / 26
Qualitative Evaluation of Unsupervised Feature Learning
Dataset
150,000 sentences from NYT and AP sections of Gigaword corpus forRAE training
Setup
R100 unsupervised feature vectors provided by Turian et al.3 for initialword embeddings
Stanford parser4 to extract parse tree
Hidden layer h set to 200 units in both standard and unfolding RAE(0 in NN qualitative evaluation)
3J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and generalmethod for semisupervised learning. In Proceedings of ACL, pages 384394, 2010.
4D. Klein and C. D. Manning. Accurate unlexicalized parsing. In ACL, 2003.F. Liang Unfolding RAE for Paraphrase Detection May 2013 16 / 26
Nearest Neighbor
Figure : Comparison of nearest `2-norm neighbor
F. Liang Unfolding RAE for Paraphrase Detection May 2013 17 / 26
Recursive Decoding
Figure : Phrase reconstruction via recursive decoding
F. Liang Unfolding RAE for Paraphrase Detection May 2013 18 / 26
Paraphrase Detection Task
Dataset
Microsoft Research paraphrase corpus (MSRP)5
5,801 sentence pairs, 3,900 labeled as paraphrases
5B. Dolan, C. Quirk, and C. Brockett. Unsupervised construction of large paraphrasecorpora: exploiting massively parallel news sources. In COLING, 2004.
F. Liang Unfolding RAE for Paraphrase Detection May 2013 19 / 26
Paraphrase Detection Task
Setup
4,076 training pairs (67.5% positive), 1,725 test pairs (66.5%positive)
For all (S1, S2) in training data, (S2,S1) also added
Negative examples selected for high lexical overlap
Add features ∈ {0, 1} to Spooled related to the set of numbers in S1and S2
Numbers in S1 = numbers in S2(Numbers in S1 ∪ numbers in S2) 6= ∅Numbers in one sentence ⊂ numbers in other
Softmax classifier on top of SpooledHyperparameter selection: 10-fold cross-validation
np = 15λRAE = 10−5
λsoftmax = 0.05
Two annotators (83% agreement), third to resolve conflict
F. Liang Unfolding RAE for Paraphrase Detection May 2013 20 / 26
Example Results
F. Liang Unfolding RAE for Paraphrase Detection May 2013 21 / 26
State of the Art
F. Liang Unfolding RAE for Paraphrase Detection May 2013 22 / 26
“Paraphrase Identification (State of the Art).” ACLWiki. Web. 14 May 2013.
Comparison of Unsupervised Feature Learning Methods
Setup
Dynamic pooling layer
Hyperparameters optimized over C.V. set
Results
Recursive averaging: 75.9%
Standard RAE: 75.5%
Unfolding RAE without hidden layers: 76.8%
Unfolding RAE with hidden layers: 76.6%
F. Liang Unfolding RAE for Paraphrase Detection May 2013 23 / 26
Evaluating cotribution of Dynamic Pooling Layer
Setup
Unfolding RAE used to compute SHyperparameters optimized over C.V. set
Results
S-histogram 73.0%Only added number features 73.2%Only Spooled 72.6%Top URAE Node 74.2%Spooled + number features 76.8%
F. Liang Unfolding RAE for Paraphrase Detection May 2013 24 / 26
Criteque
Pros:
Novel unfolding reconstruction error metric, dynamic pooling layer
State of the art (2011) performance
Cons:
Vague training details / time to convergence
Unconvincing improvement over baseline (recursive averaging, topRAE node)
Training requires labeled parse trees (unsupervised performancedepends on parser accuracy)
Representing phrases on same feature-space as words
F. Liang Unfolding RAE for Paraphrase Detection May 2013 25 / 26
Criteque
Suggestions:
Add additional features to SpooledOverlap pooling regions
Letting We vary depending on labels of children in parse tree
Capture the operational meaning of a word to a sentence (MV-RNN6)
p = f
(We
[c1c2
]+ b
)→ p = f
(We
[Ba + b0Ab + a0
]+ p0
)
6Richard Socher, Brody Huval, Christopher D. Manning and Andrew Y. NgConference on Empirical Methods in Natural Language Processing
F. Liang Unfolding RAE for Paraphrase Detection May 2013 26 / 26