, Oncel Tuzel 9 1Computer Science Department, University ... · Deep Hierarchical Parsing for...

1
Deep Hierarchical Parsing for Semantic Segmentation Abhishek Sharma 1 , Oncel Tuzel 2 , David W. Jacobs 1 1 Computer Science Department, University of Maryland at College Park. 2 Mitsubishi Electric Research Lab, Cambridge. A popular approach for semantic segmentation is labeling each super- pixel with one of the required semantic categories. The rich diversity in the appearance of even simple concepts (sky, water, grass) due to the variation in lighting and view-point makes semantic segmentation very challenging. Contextual information from the entire image has been shown to be useful in resolving the ambiguity in super-pixel labeling [1, 2, 4]. The popular approaches for encoding context, MRF or CRF based image models, often make use of simple human-designed interaction potentials that limit the pos- sible complexity of interactions between different parts of the image. This is done to avoid an intractable and computationally intensive inference step. Recently, an elegant deep recursive neural network approach for seman- tic segmentation was proposed in [3], referred to as RCPN, Fig. 1. The main idea was to facilitate the propagation of contextual information from each super-pixel to every other super-pixel in a feed-forward manner through ran- dom binary parse trees T on super-pixels. The leaf nodes of T correspond to super-pixel features and the internal nodes correspond to the features of contiguous merged-regions that result from mergers, as per T , of multiple super-pixel regions. RCPN consists of an assembly of four neural networks - semantic mapper (F sem ), combiner (F com ), decombiner (F dec ) and catego- rizer (F cat ). First, F sem mapped visual features of the super-pixels v i into semantic space features x i . This was followed by a recursive combination of semantic features of two adjacent image regions (x i and x j ), using F com to yield the holistic feature vector of the entire image, termed the root feature. Next, the global information contained in the root feature was disseminated to every super-pixel in the image, using F dec , followed by classification of the enhanced super-pixel features ˜ x i by F cat . RCPN has the potential to learn complex non-linear interaction between different parts of the image that re- sulted in impressive real-time performances on standard semantic segmen- tation datasets. Figure 1: Flow diagram of RCPN with bypass-error path. This paper shows that the presence of bypass-error paths in RCPN can lead to sub-optimal parameter learning and proposes a simple modification to improve the learning. Specifically, we propose to include the classifica- tion loss of pure-nodes to the RCPN loss function that originally consisted of classification loss of the super-pixels only. Pure-nodes are the internal nodes of T that correspond to merged-regions consisting of pixels of a sin- gle semantic category only. Therefore, pure-nodes can be used as classifi- cation targets for learning RCPN parameters. The resulting model is termed Pure-node RCPN or PN-RCPN. It leads to these three immediate benefits - a) increased labels per image; around 65% of all internal-nodes are pure- nodes for three different datasets b) deeper and stronger gradients and c) explicitly forcing the combiner to learn meaningful combinations of two image-regions. We use an example to understand the benefits of PN-RCPN over RCPN, depicted with the help of Fig. 2(a) and Fig. 2(b), respectively. The fig- ures show the left-half of a random parse tree for an image I with 5 super- pixels. We denote, e cat i d s as the error at enhanced super-pixel nodes; This is an extended abstract. The full paper is available at the Computer Vision Foundation webpage. 1 x 2 x 6 x 9 x 2 v 1 v 1 ~ x 2 ~ x 6 ~ x cat e 1 dec e 6 dec e 9 com e 6 com e 1 Right tree ((a)) RCPN 1 x 2 x 6 x 9 x 2 v 1 v 1 ~ x 2 ~ x 6 ~ x cat e 1 dec e 6 dec e 9 com e 6 com e 1 Right tree cat e 6 ((b)) PN-RCPN Figure 2: Back-propagated error tracking to visualize the effect of bypass error, please see text for the meaning of variables. e dec k 2d s as the error at the decombiner; e com k 2d s as the error at the combiner and e sem i d s as the error at the semantic mapper, d s is the di- mensionality of the semantic space features and subscript bp indicates the bypass-error at a node. We assume a non-zero categorizer error signal for the first super-pixel only, ie e cat i6=1 = 0. These assumptions facilitate easier back-propagation tracking through the parse tree, but the conclusions drawn will hold for general cases as well. From Fig. 2(a) we can see that there are two possible paths for e cat 1 to reach the combiner. One of them requires 2 layers ( ˜ x 1 ˜ x 6 x 6 ) and the other requires 3 layers ( ˜ x 1 ˜ x 6 x 9 x 6 ). Similarly, e cat 1 can reach x 1 through a 1 layer bypass path ( ˜ x 1 x 1 ) or a several layers path through the parse tree. Due to gradient attenuation, the smaller the number of layers the stronger the back-propagated signal, therefore, bypass errors lead to g sem g com . This can potentially render the combiner network inoperative and guide the training towards a network that effectively consists of a N sem + N dec +N cat layer network from the visual feature ( v i ) to the super-pixel label (y i ). This results in little or no contextual information exchange between the super-pixels. A comparison of the gradient-strengths for different modules (g sem , g com , g dec and g cat ) reveals that for RCPN g cat > g dec g sem g com that leads to bypassing the combiner, causing loss of contextual information. On the other hand, PN-RCPN gradients follow the natural order, g cat > g dec > g com > g sem , based on the distance from the initial label error which leads to a healthy context propagation via combiner. Furthermore, PN-RCPN also provides us with reliable estimates of the internal node label distributions. We utilize the label distribution of the in- ternal nodes to define a tree-style MRF, termed TM-RCPN, on the parse tree to model the hierarchical dependency between the nodes that leads to spa- tially smooth segmentation masks. Both PN-RCPN and TM-RCPN lead to significant improvement in terms of per-pixel accuracy, mean-class accuracy and intersection over union over RCPN. [1] Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, and Devi Parikh. Analyzing semantic segmentation using hybrid human-machine crfs. IEEE CVPR, 2013. [2] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong- Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. IEEE CVPR, 2014. [3] Abhishek Sharma, Oncel Tuzel, and Ming Yu Liu. Recursive context propagation network for semantic segmentation. NIPS, 2014. [4] A. Torralba, K.P. Murphy, W.T. Freeman, and M.A. Rubin. Context- based vision system for place and object recognition. CVPR, 2003.

Transcript of , Oncel Tuzel 9 1Computer Science Department, University ... · Deep Hierarchical Parsing for...

Page 1: , Oncel Tuzel 9 1Computer Science Department, University ... · Deep Hierarchical Parsing for Semantic Segmentation Abhishek Sharma1, Oncel Tuzel2, David W. Jacobs1 1Computer Science

Deep Hierarchical Parsing for Semantic Segmentation

Abhishek Sharma1, Oncel Tuzel2, David W. Jacobs1

1Computer Science Department, University of Maryland at College Park. 2Mitsubishi Electric Research Lab, Cambridge.

A popular approach for semantic segmentation is labeling each super-pixel with one of the required semantic categories. The rich diversity in theappearance of even simple concepts (sky, water, grass) due to the variationin lighting and view-point makes semantic segmentation very challenging.Contextual information from the entire image has been shown to be usefulin resolving the ambiguity in super-pixel labeling [1, 2, 4]. The popularapproaches for encoding context, MRF or CRF based image models, oftenmake use of simple human-designed interaction potentials that limit the pos-sible complexity of interactions between different parts of the image. Thisis done to avoid an intractable and computationally intensive inference step.

Recently, an elegant deep recursive neural network approach for seman-tic segmentation was proposed in [3], referred to as RCPN, Fig. 1. The mainidea was to facilitate the propagation of contextual information from eachsuper-pixel to every other super-pixel in a feed-forward manner through ran-dom binary parse trees T on super-pixels. The leaf nodes of T correspondto super-pixel features and the internal nodes correspond to the features ofcontiguous merged-regions that result from mergers, as per T , of multiplesuper-pixel regions. RCPN consists of an assembly of four neural networks- semantic mapper (Fsem), combiner (Fcom), decombiner (Fdec) and catego-rizer (Fcat ). First, Fsem mapped visual features of the super-pixels vi intosemantic space features xi. This was followed by a recursive combinationof semantic features of two adjacent image regions (xi and x j), using Fcom toyield the holistic feature vector of the entire image, termed the root feature.Next, the global information contained in the root feature was disseminatedto every super-pixel in the image, using Fdec, followed by classification ofthe enhanced super-pixel features x̃i by Fcat . RCPN has the potential to learncomplex non-linear interaction between different parts of the image that re-sulted in impressive real-time performances on standard semantic segmen-tation datasets.

Figure 1: Flow diagram of RCPN with bypass-error path.

This paper shows that the presence of bypass-error paths in RCPN canlead to sub-optimal parameter learning and proposes a simple modificationto improve the learning. Specifically, we propose to include the classifica-tion loss of pure-nodes to the RCPN loss function that originally consistedof classification loss of the super-pixels only. Pure-nodes are the internalnodes of T that correspond to merged-regions consisting of pixels of a sin-gle semantic category only. Therefore, pure-nodes can be used as classifi-cation targets for learning RCPN parameters. The resulting model is termedPure-node RCPN or PN-RCPN. It leads to these three immediate benefits- a) increased labels per image; around 65% of all internal-nodes are pure-nodes for three different datasets b) deeper and stronger gradients and c)explicitly forcing the combiner to learn meaningful combinations of twoimage-regions.

We use an example to understand the benefits of PN-RCPN over RCPN,depicted with the help of Fig. 2(a) and Fig. 2(b), respectively. The fig-ures show the left-half of a random parse tree for an image I with 5 super-pixels. We denote, ecat

i ∈ ℜds as the error at enhanced super-pixel nodes;

This is an extended abstract. The full paper is available at the Computer Vision Foundationwebpage.

1x 2x

6x

9x

2v1v

1~x 2

~x

6~x

cate1

dece6

dece9

come6

come1

Right tree

((a)) RCPN

1x 2x

6x

9x

2v1v

1~x 2

~x

6~x

cate1

dece6

dece9

come6

come1

Right tree

cate6

((b)) PN-RCPNFigure 2: Back-propagated error tracking to visualize the effect of bypasserror, please see text for the meaning of variables.

edeck ∈ ℜ2ds as the error at the decombiner; ecom

k ∈ ℜ2ds as the error at thecombiner and esem

i ∈ ℜds as the error at the semantic mapper, ds is the di-mensionality of the semantic space features and subscript bp indicates thebypass-error at a node. We assume a non-zero categorizer error signal forthe first super-pixel only, ie ecat

i 6=1 = 0. These assumptions facilitate easierback-propagation tracking through the parse tree, but the conclusions drawnwill hold for general cases as well.

From Fig. 2(a) we can see that there are two possible paths for ecat1 to

reach the combiner. One of them requires 2 layers (x̃1→ x̃6→ x6) and theother requires 3 layers (x̃1 → x̃6 → x9 → x6). Similarly, ecat

1 can reach x1through a 1 layer bypass path (x̃1→ x1) or a several layers path through theparse tree. Due to gradient attenuation, the smaller the number of layers thestronger the back-propagated signal, therefore, bypass errors lead to gsem ≥gcom. This can potentially render the combiner network inoperative andguide the training towards a network that effectively consists of a Nsem +Ndec+Ncat layer network from the visual feature ( vi) to the super-pixel label(yi). This results in little or no contextual information exchange between thesuper-pixels. A comparison of the gradient-strengths for different modules(gsem, gcom, gdec and gcat ) reveals that for RCPN gcat > gdec ≈ gsem� gcom

that leads to bypassing the combiner, causing loss of contextual information.On the other hand, PN-RCPN gradients follow the natural order, gcat >gdec > gcom > gsem, based on the distance from the initial label error whichleads to a healthy context propagation via combiner.

Furthermore, PN-RCPN also provides us with reliable estimates of theinternal node label distributions. We utilize the label distribution of the in-ternal nodes to define a tree-style MRF, termed TM-RCPN, on the parse treeto model the hierarchical dependency between the nodes that leads to spa-tially smooth segmentation masks. Both PN-RCPN and TM-RCPN lead tosignificant improvement in terms of per-pixel accuracy, mean-class accuracyand intersection over union over RCPN.[1] Roozbeh Mottaghi, Sanja Fidler, Jian Yao, Raquel Urtasun, and Devi

Parikh. Analyzing semantic segmentation using hybrid human-machinecrfs. IEEE CVPR, 2013.

[2] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The roleof context for object detection and semantic segmentation in the wild.IEEE CVPR, 2014.

[3] Abhishek Sharma, Oncel Tuzel, and Ming Yu Liu. Recursive contextpropagation network for semantic segmentation. NIPS, 2014.

[4] A. Torralba, K.P. Murphy, W.T. Freeman, and M.A. Rubin. Context-based vision system for place and object recognition. CVPR, 2003.