Parsing Natural Scenes and Natural Language with Recursive Neural Networks Richard Socher Cliff...
-
Upload
isabella-wilcox -
Category
Documents
-
view
224 -
download
0
Transcript of Parsing Natural Scenes and Natural Language with Recursive Neural Networks Richard Socher Cliff...
Parsing Natural Scenes and Natural Language with
Recursive Neural NetworksRichard Socher
Cliff Chiung-Yu Lin
Andrew Y. Ng
Christopher D. Manning
Slides & Speech: Rui Zhang
Outline
•Motivation & Contribution•Recursive Neural Network•Scene Segmentation using RNN• Learning and Optimization• Language Parsing using RNN•Experiments
Motivation
•Data naturally contains recursive structures• Image: Scenes split into objects, objects split into
parts• Language: A noun phrase contains a clause which
contains noun phrases of its own
Motivation
•The recursive structure helps to• Identify components of the data•Understand how the components interact to form
the whole data
Contribution
•First deep learning method to achieve state-of-art performance on scene segmentation and annotation• Learned deep features outperform hand-crafted
ones(e.g. Gist)•Can be generalized for other tasks, e.g. language
parsing
Recursive Neural Network
•Similar to one-layer full-connected network•Models transformation from children nodes to
parent node•Recursively applied to tree structure• Parent of one layer become child
of the upper layer•Parameters shared across layers
𝑐1 𝑐2
𝑊 𝑟𝑒𝑐𝑢𝑟
𝑥
h
𝑊 𝑟𝑒𝑐𝑢𝑟
𝑐3
Recursive vs. Recurrent NN
•There are two models called RNN: Recursive and Recurrent•Similar• Both have shared parameter which are applied in a
recursive style•Different• Recursive NN applies to trees, while Recurrent NN applies
to sequences • Recurrent NN could be considered as Recursive NN for
one-way trees
Scene Segmentation Pipeline
Over segment image into superpixels
Extract feature of superpixels
Map feature onto semantic
space
Compute score for each merge
with RNN
Permute possible merges
Merge pair of nodes with
highest score
Repeat until only one node
is left
Input Data Representation
• Image• Over-segmented superpixels• Extract hand-crafted feature•Map onto semantic space by one full-connection layer to
obtain feature vector• Each superpixel has a class label
Tree Construction
• Scene parse trees are constructed in bottom-up style• Leaf nodes are over-segmented superpixels• Extract hand-crafted feature• Map onto semantic space by one full-connection layer• Each leaf has a feature vector
• An adjacency matrix records neighboring relations
Adjacency Matrix
Greedy Merging
• Nodes are merged in a greedy style• In each iteration• Permute all possible merge(pairs of adjacent nodes)• Compute score for each possible merge
• Full-connection transformation upon • Merge the pair with highest score
• and replaced by new node • becomes feature for • Union of neighbors of and becomes neighbors of
• Repeat until only one node is left
𝑐1 𝑐2
𝑊 𝑟𝑒𝑐𝑢𝑟
𝑥
h12
𝑠𝑐𝑜𝑟𝑒
𝑊 𝑠𝑐𝑜𝑟𝑒
Training(1)
•Max Margin Estimation• Structured Margin Loss • Penalize merging a segment with another one of a different label before
merging with all its neighbors of the same label • Number of sub-trees not appearing in correct trees
• Tree Score • Sum of merge scores on all non-leaf nodes
• Class Label• Softmax upon node feature vector
• Correct Trees• Adjacent nodes with same label are merged first• One image may have more than one correct
tree
Training(2)
• Intuition: We want the score of highest scoring correct tree to be larger than other trees by a margin • Formulation
• Margin
• Loss Function
• is minimized
is all model parameters is index of training image is training image is labels of is set of correct trees of is all possible trees of is the tree score function
is a node in the parse tree is the set of nodes
Training(3)
• Label of node is predicted by softmax• The margin is no differentiable• Therefore only a sub-gradient is computed
• is obtained by back-propagation• Gradient of label prediction is also obtained by
back-propagation
𝑐1 𝑐2
𝑊 𝑟𝑒𝑐𝑢𝑟
𝑥
h12
𝑠𝑐𝑜𝑟𝑒
𝑊 𝑠𝑐𝑜𝑟𝑒 𝑊 𝑙𝑎𝑏𝑒𝑙
𝑙𝑎𝑏𝑒𝑙
Language Parsing
• Language parsing is similar to scene parsing• Differences
• Input is natural language sentence• Adjacency is strictly left and right• Class labels are syntactical classes
• Word Level• Phrase Level• Clause(从句 ) Level
• Each sentence has only one correct tree
Experiments Overview
• Image• Scene Segmentation and Annotation• Scene Classification• Nearest Neighbor Scene Subtree
• Language• Supervised Language Parsing• Nearest Neighbor Phrases
Scene Segmentation and Annotation• Dataset
• Stanford Background Dataset
• Task:• Segment and label foreground and
different types of background pixelwise
• Result• 78.1% pixelwise accuracy• 0.6% above state-of-art
Scene Classification
• Dataset• Stanford Background Dataset
• Task• Three classes: city, countryside, sea-side
• Method• Feature: Average of all node features/top node feature only• Classifier: Linear SVM
• Result• 88.1% accuracy for average feature
• 4.1% above Gist, the state-of-art feature• 71.0% accuracy for top feature
• Discussion• Learned RNN feature can better capture semantic info of scene• Top feature losses some lower level info
Nearest Neighbor Scene Subtrees• Dataset
• Stanford Background Dataset
• Task• Retrieve similar segments from all images• Subtrees whose nodes have the same label
corresponds to a segment
• Method• Feature: Top node feature of the subtree• Metrics: Euclidean Distance
• Result• Similar segments are retrieved
• Discuss• RNN feature can capture segment level
characteristics
Supervised Language Parsing
• Dataset• Penn Treebank• Wall Street Journal Section
• Task• Generate parse tree with labeled node
• Result• Unlabeled bracketing F-measure• 90.29%, comparable to 91.63% of Berkley Parser
Nearest Neighbor Phrases
• Dataset• Penn Treebank• Wall Street Journal Section
• Task• Retrieve nearest neighbor of given
sentence
• Method• Feature: Top node feature• Metrics: Euclidean Distance
• Result• Similar sentences are retrieved
Discussion
• Understanding semantic structure of data is essential for applications like fine-grained search or captioning• Recursive NN predicts tree structure along with node labels
in an elegant way• Recursive NN can be incorporated with CNN• If we can jointly learn Recursive NN with