Parsing Natural Scenes and Natural Language with Recursive Neural Networks Richard Socher Cliff...

Parsing Natural Scenes and Natural Language with

Recursive Neural NetworksRichard Socher

Cliff Chiung-Yu Lin

Andrew Y. Ng

Christopher D. Manning

Slides & Speech: Rui Zhang

Outline

•Motivation & Contribution•Recursive Neural Network•Scene Segmentation using RNN• Learning and Optimization• Language Parsing using RNN•Experiments

Motivation

•Data naturally contains recursive structures• Image: Scenes split into objects, objects split into

parts• Language: A noun phrase contains a clause which

contains noun phrases of its own

Motivation

•The recursive structure helps to• Identify components of the data•Understand how the components interact to form

the whole data

Contribution

•First deep learning method to achieve state-of-art performance on scene segmentation and annotation• Learned deep features outperform hand-crafted

ones(e.g. Gist)•Can be generalized for other tasks, e.g. language

parsing

Recursive Neural Network

•Similar to one-layer full-connected network•Models transformation from children nodes to

parent node•Recursively applied to tree structure• Parent of one layer become child

of the upper layer•Parameters shared across layers

𝑐1 𝑐2

𝑊 𝑟𝑒𝑐𝑢𝑟

𝑥

h


𝑐3

Recursive vs. Recurrent NN

•There are two models called RNN: Recursive and Recurrent•Similar• Both have shared parameter which are applied in a

recursive style•Different• Recursive NN applies to trees, while Recurrent NN applies

to sequences • Recurrent NN could be considered as Recursive NN for

one-way trees

Scene Segmentation Pipeline

Over segment image into superpixels

Extract feature of superpixels

Map feature onto semantic

space

Compute score for each merge

with RNN

Permute possible merges

Merge pair of nodes with

highest score

Repeat until only one node

is left

Input Data Representation

• Image• Over-segmented superpixels• Extract hand-crafted feature•Map onto semantic space by one full-connection layer to

obtain feature vector• Each superpixel has a class label

Tree Construction

• Scene parse trees are constructed in bottom-up style• Leaf nodes are over-segmented superpixels• Extract hand-crafted feature• Map onto semantic space by one full-connection layer• Each leaf has a feature vector

• An adjacency matrix records neighboring relations

Adjacency Matrix

Greedy Merging

• Nodes are merged in a greedy style• In each iteration• Permute all possible merge(pairs of adjacent nodes)• Compute score for each possible merge

• Full-connection transformation upon • Merge the pair with highest score

• and replaced by new node • becomes feature for • Union of neighbors of and becomes neighbors of

• Repeat until only one node is left

𝑐1 𝑐2


𝑥

h12

𝑠𝑐𝑜𝑟𝑒

𝑊 𝑠𝑐𝑜𝑟𝑒

Training(1)

•Max Margin Estimation• Structured Margin Loss • Penalize merging a segment with another one of a different label before

merging with all its neighbors of the same label • Number of sub-trees not appearing in correct trees

• Tree Score • Sum of merge scores on all non-leaf nodes

• Class Label• Softmax upon node feature vector

• Correct Trees• Adjacent nodes with same label are merged first• One image may have more than one correct

tree

Training(2)

• Intuition: We want the score of highest scoring correct tree to be larger than other trees by a margin • Formulation

• Margin

• Loss Function

• is minimized

is all model parameters is index of training image is training image is labels of is set of correct trees of is all possible trees of is the tree score function

is a node in the parse tree is the set of nodes

Training(3)

• Label of node is predicted by softmax• The margin is no differentiable• Therefore only a sub-gradient is computed

• is obtained by back-propagation• Gradient of label prediction is also obtained by

back-propagation

𝑐1 𝑐2


𝑥

h12

𝑠𝑐𝑜𝑟𝑒

𝑊 𝑠𝑐𝑜𝑟𝑒 𝑊 𝑙𝑎𝑏𝑒𝑙

𝑙𝑎𝑏𝑒𝑙

Language Parsing

• Language parsing is similar to scene parsing• Differences

• Input is natural language sentence• Adjacency is strictly left and right• Class labels are syntactical classes

• Word Level• Phrase Level• Clause(从句 ) Level

• Each sentence has only one correct tree

Experiments Overview

• Image• Scene Segmentation and Annotation• Scene Classification• Nearest Neighbor Scene Subtree

• Language• Supervised Language Parsing• Nearest Neighbor Phrases

Scene Segmentation and Annotation• Dataset

• Stanford Background Dataset

• Task:• Segment and label foreground and

different types of background pixelwise

• Result• 78.1% pixelwise accuracy• 0.6% above state-of-art

Scene Classification

• Dataset• Stanford Background Dataset

• Task• Three classes: city, countryside, sea-side

• Method• Feature: Average of all node features/top node feature only• Classifier: Linear SVM

• Result• 88.1% accuracy for average feature

• 4.1% above Gist, the state-of-art feature• 71.0% accuracy for top feature

• Discussion• Learned RNN feature can better capture semantic info of scene• Top feature losses some lower level info

Nearest Neighbor Scene Subtrees• Dataset

• Stanford Background Dataset

• Task• Retrieve similar segments from all images• Subtrees whose nodes have the same label

corresponds to a segment

• Method• Feature: Top node feature of the subtree• Metrics: Euclidean Distance

• Result• Similar segments are retrieved

• Discuss• RNN feature can capture segment level

characteristics

Supervised Language Parsing

• Dataset• Penn Treebank• Wall Street Journal Section

• Task• Generate parse tree with labeled node

• Result• Unlabeled bracketing F-measure• 90.29%, comparable to 91.63% of Berkley Parser

Nearest Neighbor Phrases

• Dataset• Penn Treebank• Wall Street Journal Section

• Task• Retrieve nearest neighbor of given

sentence

• Method• Feature: Top node feature• Metrics: Euclidean Distance

• Result• Similar sentences are retrieved

Discussion

• Understanding semantic structure of data is essential for applications like fine-grained search or captioning• Recursive NN predicts tree structure along with node labels

in an elegant way• Recursive NN can be incorporated with CNN• If we can jointly learn Recursive NN with

Parsing Natural Scenes and Natural Language with Recursive Neural Networks Richard Socher Cliff...

Documents

Transcript of Parsing Natural Scenes and Natural Language with Recursive Neural Networks Richard Socher Cliff...