An End-to-End Network for Panoptic Segmentation · Panoptic segmentation can also be treated as...

10
An End-to-End Network for Panoptic Segmentation Huanyu Liu 1, Chao Peng 2 , Changqian Yu 3, Jingbo Wang 4, Xu Liu 5, Gang Yu 2 , Wei Jiang 1 1 Zhejiang University, 2 Megvii Inc. (Face++), 3 Huazhong University of Science and Technology 4 Peking University, 5 The University of Tokyo {liuhy, jiangwei zju}@zju.edu.cn, [email protected], changqian [email protected] [email protected], [email protected], [email protected] Abstract Panoptic segmentation, which needs to assign a category label to each pixel and segment each object instance simul- taneously, is a challenging topic. Traditionally, the existing approaches utilize two independent models without sharing features, which makes the pipeline inefficient to implement. In addition, a heuristic method is usually employed to merge the results. However, the overlapping relationship between object instances is difficult to determine without sufficient context information during the merging process. To address the problems, we propose a novel end-to-end Occlusion Aware Network (OANet) for panoptic segmentation, which can efficiently and effectively predict both the instance and stuff segmentation in a single network. Moreover, we intro- duce a novel spatial ranking module to deal with the occlu- sion problem between the predicted instances. Extensive experiments have been done to validate the performance of our proposed method and promising results have been achieved on the COCO Panoptic benchmark. 1. Introduction Panoptic segmentation [18] is a new challenging topic for scene understanding. The goal is to assign each pixel with a category label and segment each object instance in the image. In this task, the stuff segmentation is em- ployed to predict the amorphous regions (noted Stuff ) while the instance segmentation [14] solves the countable objects (noted Thing). Therefore, this task can provide more com- prehensive scene information and can be widely used in au- tonomous driving and scene parsing. Previous panoptic segmentation algorithms [18] usually contain three separated components: the instance segmen- tation block, the stuff segmentation block and the merging block, as shown in Figure 1 (a). Usually in these algorithms, the instance and stuff segmentation blocks are independent This work was done during an internship at Megvii Inc. Heuristics Merging Traditional method Our method Network1 Network2 Network Stuff branch Instance branch Stuff branch Instance branch Spatial Ranking Moduel (a) (b) Figure 1. An illustration of our end-to-end network in contrast to the traditional method. Traditional methods [18] train two sub- network and do heuristics merge. Our method can train a single network for two subtasks, and realize a learnable fusion approach. without any feature sharing. This results in apparent com- putational overhead. Furthermore, because of the separated models, these algorithms have to merge the corresponding separated predictions with post-processing. However, with- out the context information between the stuff and thing, the merge process will face the challenge of overlapping rela- tionships between instances and stuff. Exactly as mentioned above, with the three separate parts, it is hard to apply this complex pipeline to the industrial application. In this paper, we propose a novel end-to-end algorithm shown in Figure 1 (b). As far as we know, this is the first algorithm which can deal with the issues above in an end- to-end pipeline. More specifically, we incorporate the in- stance segmentation and stuff segmentation into one net- work, which shares the backbone features but applies dif- ferent head branches for the two tasks. During the training phase, the backbone features will be optimized by the ac- cumulated losses from both the stuff and thing supervision while the head branches will only fine-tune on the specific task. 1 arXiv:1903.05027v2 [cs.CV] 13 Mar 2019

Transcript of An End-to-End Network for Panoptic Segmentation · Panoptic segmentation can also be treated as...

Page 1: An End-to-End Network for Panoptic Segmentation · Panoptic segmentation can also be treated as multi-task learning problem. Two different task can be trained together through strategies

An End-to-End Network for Panoptic Segmentation

Huanyu Liu1†, Chao Peng2, Changqian Yu3†, Jingbo Wang4†, Xu Liu5†, Gang Yu2, Wei Jiang1

1Zhejiang University, 2Megvii Inc. (Face++), 3Huazhong University of Science and Technology4Peking University, 5The University of Tokyo

{liuhy, jiangwei zju}@zju.edu.cn, [email protected], changqian [email protected]

[email protected], [email protected], [email protected]

Abstract

Panoptic segmentation, which needs to assign a categorylabel to each pixel and segment each object instance simul-taneously, is a challenging topic. Traditionally, the existingapproaches utilize two independent models without sharingfeatures, which makes the pipeline inefficient to implement.In addition, a heuristic method is usually employed to mergethe results. However, the overlapping relationship betweenobject instances is difficult to determine without sufficientcontext information during the merging process. To addressthe problems, we propose a novel end-to-end OcclusionAware Network (OANet) for panoptic segmentation, whichcan efficiently and effectively predict both the instance andstuff segmentation in a single network. Moreover, we intro-duce a novel spatial ranking module to deal with the occlu-sion problem between the predicted instances. Extensiveexperiments have been done to validate the performanceof our proposed method and promising results have beenachieved on the COCO Panoptic benchmark.

1. IntroductionPanoptic segmentation [18] is a new challenging topic

for scene understanding. The goal is to assign each pixelwith a category label and segment each object instancein the image. In this task, the stuff segmentation is em-ployed to predict the amorphous regions (noted Stuff ) whilethe instance segmentation [14] solves the countable objects(noted Thing). Therefore, this task can provide more com-prehensive scene information and can be widely used in au-tonomous driving and scene parsing.

Previous panoptic segmentation algorithms [18] usuallycontain three separated components: the instance segmen-tation block, the stuff segmentation block and the mergingblock, as shown in Figure 1 (a). Usually in these algorithms,the instance and stuff segmentation blocks are independent

†This work was done during an internship at Megvii Inc.

HeuristicsMerging

Traditional method

Our method

Network1

Network2

Network

Stuffbranch

Instancebranch

Stuffbranch

Instancebranch

SpatialRanking Moduel

(a)

(b)

Figure 1. An illustration of our end-to-end network in contrast tothe traditional method. Traditional methods [18] train two sub-network and do heuristics merge. Our method can train a singlenetwork for two subtasks, and realize a learnable fusion approach.

without any feature sharing. This results in apparent com-putational overhead. Furthermore, because of the separatedmodels, these algorithms have to merge the correspondingseparated predictions with post-processing. However, with-out the context information between the stuff and thing, themerge process will face the challenge of overlapping rela-tionships between instances and stuff. Exactly as mentionedabove, with the three separate parts, it is hard to apply thiscomplex pipeline to the industrial application.

In this paper, we propose a novel end-to-end algorithmshown in Figure 1 (b). As far as we know, this is the firstalgorithm which can deal with the issues above in an end-to-end pipeline. More specifically, we incorporate the in-stance segmentation and stuff segmentation into one net-work, which shares the backbone features but applies dif-ferent head branches for the two tasks. During the trainingphase, the backbone features will be optimized by the ac-cumulated losses from both the stuff and thing supervisionwhile the head branches will only fine-tune on the specifictask.

1

arX

iv:1

903.

0502

7v2

[cs

.CV

] 1

3 M

ar 2

019

Page 2: An End-to-End Network for Panoptic Segmentation · Panoptic segmentation can also be treated as multi-task learning problem. Two different task can be trained together through strategies

To solve the problem of overlapping relationship be-tween object instances, we also introduce a new algorithmcalled Spatial Ranking Module. This module learns theranking score and offers an ordering accordance for in-stances.

In general, we summarize the contributions of our algo-rithm as follows:

• We are the first to propose an end-to-end occlusionaware pipeline for the problem of panoptic segmen-tation.

• We introduce a novel spatial ranking module to addressthe ambiguities of the overlapping relationship, whichcommonly exists in the problem of panoptic segmen-tation.

• We obtain state-of-the-art performance on the COCOpanoptic segmentation dataset.

2. Related Work2.1. Instance Segmentation

There are currently two main frameworks for instancesegmentation, including the proposal-based methods andsegmentation-based methods. The proposal-based ap-proaches [8, 14, 24, 25, 28, 29, 33] first generate the objectdetection bounding boxes and then perform mask predic-tion on each box for instance segmentation. These methodsare closely related to object detection algorithms such asFast/Faster R-CNN and SPPNet [12, 15, 36]. Under thisframework, the overlapping problem raises due to the in-dependence prediction of distinct instances. That is, pixelsmay be allocated to wrong categories when covered by mul-tiple masks. The segmentation-based methods use the se-mantic segmentation network to predict the pixel class, andobtain each instance mask by decoding the object bound-ary [19] or the custom field [2, 9, 27]. Finally, they use thebottom-up grouping mechanism to generate the object in-stances. RNN method was leveraged to predict a mask foreach instance at a time in [35, 37, 46] .

2.2. Semantic Segmentation

Semantic segmentation has been extensively studied, andmany new methods have been proposed in recent years.Driven by powerful deep neural networks [16, 21, 39, 40],FCN [30] successfully applied deep learning to pixel-by-pixel image semantic segmentation by replacing the fullyconnected layer of the image classification network with theconvolutional layer.

The encoder-decoder structure such as UNet [1, 38, 42,43, 47] can gradually recover resolution and capture moreobject details. Global Convolutional Network [34] proposeslarge kernel method to relieve the contradiction between

classification and localization. DFN [43] designs a channelattention block to select feature maps. DeepLab [4, 6] andPSPNet [48] use atrous spatial pyramid pooling or spatialpyramid pooling to get multi-scale context. Method of [44]uses dilated convolution to enlarge field of view. Multi-scale features were using to obtain sufficient receptive fieldin [5, 13, 41] .

Related datasets are also constantly being enriched andexpanded. Currently, there are public datasets such asVOC [11], Cityscapes [7], ADE20K [49], Mapillary Vis-tas [32], and COCO Stuff [3].

2.3. Panoptic Segmentation

Panoptic Segmentation task was first proposed in [18]and the research work for this task is not too much. Aweakly supervised model that jointly performs semanticand instance segmentation was proposed by [22]. It usesthe weak bounding box annotations for “thing” classes, andimage level tags for “stuff” classes. JSIS-Net [10] proposesa single network with the instance segmentation head [14]and the pyramid stuff segmentation head [48], followingheuristics to merge two kinds of outputs. Li et al. [23] pro-pose AUNet that can leverage proposal and mask level at-tention and get better background results.

2.4. Multi-task learning

Panoptic segmentation can also be treated as multi-tasklearning problem. Two different task can be trained togetherthrough strategies [17, 31]. UberNet [20] jointly handleslow-, mid-, and high-level vision tasks in a single network,including boundary detection, semantic segmentation andnormal estimation. Zamir et al. [45] build a directed graphnamed taskonomy, which can effectively measure and lever-age the correlation between different visual tasks. It canavoid repetitive learning and enable learning with less data.

3. Proposed End-to-end FrameworkThe overview of our algorithm is illustrated in Figure 2.

There are three major components in our algorithm: 1) Thestuff branch predicts the stuff segmentation for the wholeinput. 2) The instance branch provides the instance segmen-tation predictions. 3) The spatial ranking module generatesa ranking score for the each instance.

3.1. End-to-end Network Architecture

We employ FPN [26] as the backbone architecture for theend-to-end network. For instance segmentation, we adoptthe original Mask R-CNN [14] as our network framework.We apply top-down pathway and lateral connections to getfeature maps. And then, a 3 × 3 convolution layer is ap-pended to get RPN feature maps. After that, we apply the

2

Page 3: An End-to-End Network for Panoptic Segmentation · Panoptic segmentation can also be treated as multi-task learning problem. Two different task can be trained together through strategies

Figure 2. The illustration of overall framework. Given an input image, we use the FPN network to provide feature maps for stuff branchand instance branch. The two branches generate intermediate results, while later are passed to our spatial ranking module. The spatialranking module learns a ranking score for each instance as the final merging evidence.

ROIAlign [14] layer to extract object proposal features andget three predictions: proposal classification score, proposalbounding box coordinates, and proposal instance mask.

For stuff segmentation, two 3× 3 convolution layers arestacked on RPN feature maps. For the sake of multi-scalefeature extraction, we then concatenate these layers withsucceeding one 3 × 3 convolution layer and 1 × 1 convo-lution layer. Figure 3 presents the details of stuff branch.During training, we supervise the stuff segmentation andthing segmentation simultaneously, as the auxiliary objec-tion information could provide object context for stuff pre-diction. In inference, we only extract the stuff predictionsand normalized them to probability.

To break out the information flow barrier during trainingand to make the whole pipeline more efficient, we sharethe features from the backbone network of two branches.The issue raised here could be divided into two parts: 1)the sharing granularity on feature maps and 2) the balancebetween instance loss and stuff loss. In practice, we findthat as more feature maps are shared, better performancewe can obtain. Thus, we share the feature maps until skip-connection layers, that is the 3× 3 convolution layer beforeRPN head shown in Figure 3.

Ltotal =Lrpn cls + Lrpn bbox + Lcls + Lbbox + Lmask︸ ︷︷ ︸instance branch

+ λ · Lseg (stuff+object)︸ ︷︷ ︸stuff branch

+ Lsrm(1)

Figure 3. A building block illustrating the stuff segmentation sub-network. Here we share both the backbone and skip-connectionfeature maps in stuff branch and instance branch. Besides, we pre-dict both object and stuff category for the stuff branch.

As for the balance of two supervisions, we first presentthe multiple losses in Equation 1. The instance branchcontains 5 losses: Lrpn cls is the RPN objectness loss,Lrpn bbox is the RPN bounding-box loss, Lcls is the clas-sification loss, Lbbox is the object bounding-box regressionloss, and Lmask is the average binary cross-entropy loss formask prediction. As the stuff branch, there is only one se-

3

Page 4: An End-to-End Network for Panoptic Segmentation · Panoptic segmentation can also be treated as multi-task learning problem. Two different task can be trained together through strategies

Figure 4. An illustration of spatial ranking score map prediction. The pixel vector in instance feature map represents instance predictionresult in this pixel. The red color means that the corresponding category object includes this pixel and the multiple red channels indicatethe occlusion problem between instances. We use the panoptic segmentation category label to supervise spatial ranking score map.

mantic segmentation loss named Lseg (stuff+object). Thehyperparameter λ is employed for loss balance and will bediscussed later. Lsrm represents the loss function of the spa-tial ranking module, which is described in the next section.

3.2. Spatial Ranking Module

The modern instance segmentation framework is oftenbased on object detection network with an additional maskprediction branch, such as the Mask RCNN [14] which isusually based on FPN [26]. Generally speaking, the currentobject detection framework does not consider the overlap-ping problems between different classes, since the popu-lar metrics are not affected by this issue, e.g., the AP andAR. However, in the task of panoptic segmentation, sincethe number of pixels in one image is fixed, the overlappingproblem, or specifically the multiple assignments for onepixel, must be resolved.

Commonly, the detection score was used to sort the in-stances in descending order, and then assign them to thestuff canvas by the rule of larger score objects on top oflower ones. However, this heuristic algorithm could easilyfail in practice. For example, let’s consider a person wear-ing a tie, shown in Figure 7. As the person class is more fre-quent than the tie in the COCO dataset, its detection scoreis tend to be higher than the tie bounding box. Thus throughthe above simple rule, the tie instance is covered by the per-son instance, and leading to the performance drops.

Could we alleviate this phenomenon through the panop-tic annotation? That is if we force the network learns theperson annotation with a hole in the place of the tie, couldwe avoid the above situation? As shown in Table 3, we con-duct the experiment with the above mentioned annotations,but only find the decayed performance. Therefore, this ap-proach is not applicable currently.

To counter this problem, we resort to a semantic-like ap-proach and propose a simple but very effective algorithmfor dealing with occlusion problems, called spatial rankingmodule. As shown in the Figure 4, we first map the results

of the instance segmentation to the tensor of input size. Thedimension of the feature map is the number of object cate-gories, and the instances of different categories are mappedto the corresponding channels. The instance tensor is ini-tialized to zero, and the mapping value is set to one. Wethen append the large kernel convolution [34] after the ten-sor to obtain the ranking score map. In the end , we use thepixel-wise cross entropy loss to optimize the ranking scoremap, as the Equation 2 shows. Smap represents the outputranking score map and Slabel represents the correspondingnon-overlap semantic label.

Lsrm = CE(Smap, Slabel) (2)

After getting the ranking score map, we calculate the rank-ing score of each instance object as Equation 3. Here,Si,j,cls represents the ranking score value in (i, j) of classcls, note that Si,j,cls has been normalized to a probabil-ity distribution. mi,j is the mask indicator, representing ifpixel (i, j) belongs to the instance. The ranking score of thewhole instance Pobjs is computed by the average of pixelranking scores in a mask.

Pobjs =

∑(i,j)∈objs Si,j,cls ·mi,j∑

(i,j)∈objsmi,j(3)

mi,j =

{0 (i,j) ∈ instance1 (i,j) /∈ instance

(4)

Let’s reconsider the example mentioned above through ourproposed spatial ranking module. When we forward theperson mask and tie mask into this module, we obtain thespatial ranking scores using Equation 3 for these two ob-jects. Within the ranking score, the sorting rule in the pre-vious method could be more reliable, and the performanceis improved, as the experiments present in the next section.

4

Page 5: An End-to-End Network for Panoptic Segmentation · Panoptic segmentation can also be treated as multi-task learning problem. Two different task can be trained together through strategies

4. Experiments4.1. Dataset and Evaluation Metrics

Dataset: We conduct all experiments on COCO panop-tic segmentation dataset [18]. This dataset contains 118Kimages for training, 5k images for validation, with annota-tions on 80 categories for the thing and 53 classes for stuff.We only employ the training images for model training andtest on the validation set. Finally, we submit test-dev resultto the COCO 2018 panoptic segmentation leaderboard.

Evaluation metrics: We use the standard evaluationmetric defined in [18], called Panoptic Quality (PQ). It con-tains two factors: 1) the Segmentation Quality (SQ) mea-sures the quality of all categories and 2) the Detection Qual-ity (DQ) measures only the instance classes. The math-ematical formations of PQ, SQ and DQ are presented inEquation 5, where p and g are predictions and ground truth,and TP, FP, FN represent true positives, false positives andfalse negatives. It is easy to find that SQ is the commonmean IOU metric normalized for matching instances, andDQ could be regarded as a form of detection accuracy. Thematching threshold is set to 0.5, that is if the pixel IOU ofprediction and ground truth is larger than 0.5, the predic-tion is regarded matched, otherwise unmatched. For stuffclasses, each stuff class in an image is regarded as one in-stance, no matter the shape of it.

PQ =

∑(p,g)∈TP IOU(p, g)

|TP |︸ ︷︷ ︸Segmentation Quality (SQ)

× |TP ||TP |+ 1

2 |FP |+12 |FN |︸ ︷︷ ︸

Detection Quality (DQ)(5)

4.2. Implementation Details

We choose the ResNet-50 [16] pretrained on ImageNetfor ablation studies. We use the SGD as the optimization al-gorithm with momentum 0.9 and weight decay 0.0001. Themulti-stage learning rate policy with warm up strategy [33]is adopted. That is, in the first 2, 000 iterations, we use thelinear gradual warmup policy by increasing the learning ratefrom 0.002 to 0.02. After 60, 000 iterations, we decrease thelearning rate to 0.002 for the next 20, 000 iterations and fur-ther set it to 0.0002 for the rest 20, 000 iterations. The batchsize of input is set to 16, which means each GPU consumestwo images in one iteration. For other details, we employthe experience from Mask-RCNN [14].

Besides the training for the two branches of our network,a little bit more attention should be paid to the spatial rank-ing module. During the training process, the supervisionlabel is the corresponding non-overlap semantic label andtraining it as a semantic segmentation network. We set thenon-conflicting pixels ignored to force the network focus onthe conflicting regions.

During the inference, we set the max number of boxesfor each image to 100 and the min area of a connected stuffarea to 4,900. As for the spatial ranking module, since wedo not have the ground truth now, the outputs of instancebranch will be passed through this module to resolve theoverlapping issue.

4.3. Ablation Study on Network Structure

In this subsection, we focus on the properties of our end-to-end network design. There are three points should bediscussed as follows: the loss balance parameter, the ob-ject context for stuff branch and the sharing mode of twobranches. To avoid the Cartesian product of experiments,we only modify the specific parameters and control theother optimally.

0 1000 2000 3000 4000 5000 6000 7000 8000iterations

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Mea

n Gr

adie

nt v

alue

Train stuff gradientTrain instance gradient

Figure 5. The plot of the specified layer mean gradient value in twobranches. We chose one epoch iterations in the training process.The learning rate of the two branches is the same. The horizontalaxis is the number of iterations, and the vertical axis is the meangradient value of the backbone last layer.

λ PQ PQTh PQSt

0.2 36.9 45.0 24.60.25 37.2 45.4 24.90.33 36.9 44.4 25.40.50 36.5 43.5 25.90.75 35.3 41.9 25.41.0 - - -

Table 1. Loss balance between instance segmentation and stuffsegmentation

The loss balance issue comes from the reality that thegradients from stuff branch and instance branch are notclose. We make a statistic on the mean gradients of twobranches with respect to the last feature map of the back-bone, where the hyperparameter λ is set to 1 for fairness.

5

Page 6: An End-to-End Network for Panoptic Segmentation · Panoptic segmentation can also be treated as multi-task learning problem. Two different task can be trained together through strategies

Input Image No-sharing Result Sharing Result

Figure 6. Feature Sharing Mode Visualization. The first column is the original image, and the second column is sharing backbone featureresult, and the last column is the no-sharing result.

As shown in the Figure 5, it is brief and clear that thegradient from stuff branch dominates the penalty signals.Therefore, we obtain a hyperparameter λ in Equation 1 tobalance the gradients. We conduct the experiments withλ ∈ [0.2, 0.25, 0.33, 0.5, 0.75, 1.0]. The interval is not uni-form for the sake of searching efficiency. As the Table 1summarizes, λ = 0.25 is the optimal choice. Note that ifwe set λ = 1, which means the instance and stuff branchestrained like separate models, the network could not con-verge through our default learning rate policy.

Object context is a natural choice in stuff segmenta-tion. Although we only want the stuff predictions from thisbranch, the lack of object supervision will introduce holeson ground truth, resulting in discontinuous context aroundobjects. Therefore, we conduct a pair of comparative exper-iments where all 133 categories is supervised , and the other

Stuff-SC Object-SC PQ PQTh PQSt

X - 36.7 43.8 25.9X X 37.2 45.4 24.9

Table 2. Ablation study results on stuff segmentation network de-sign. Stuff-SC represents stuff supervision classes. It refers topredict stuff classes. While both Stuff-SC and Object-SC meanpredicting all classes.

is trained on 53 stuff classes. The results in Table 2 showsthe 0.5 improvement on overall PQ with object context.

Sharing features is a key point of our network design.The benefits of sharing have two parts: 1) two branchescould absorb useful information from other supervision sig-nals, and 2) the computation resources could be saved ifthe shared network is computed only once. To investigate

6

Page 7: An End-to-End Network for Panoptic Segmentation · Panoptic segmentation can also be treated as multi-task learning problem. Two different task can be trained together through strategies

Methods backbone PQ PQTh PQSt SQ SQTh SQSt DQ DQTh DQSt

baseline ResNet-50 37.2 45.4 24.9 77.1 81.5 70.6 45.7 54.4 32.5w/pano-instance GT ResNet-50 36.1 43.5 24.9 76.1 80.0 70.3 44.5 52.4 32.7

w/spatial ranking module ResNet-50 39.0 48.3 24.9 77.1 81.4 70.6 47.8 58.0 32.5

baseline ResNet-101 38.8 46.9 26.6 78.2 82.0 72.5 47.4 55.9 34.5w/spatial ranking module ResNet-101 40.7 50.0 26.6 78.2 82.0 72.5 49.6 59.7 34.5

Table 3. Results on MS-COCO panoptic segmentation validation dataset which use our spatial ranking module method. W/pano-instanceGT represents using panoptic segmentation ground truth to generate instance segmentation ground truth. It is trained in two separatenetworks. All results in this table are based on backbone ResNet-50.

Methods backbone PQ PQTh PQSt

no ResNet-50 36.5 44.4 24.6res1-res5 ResNet-50 37.0 44.8 25.2

+ skip-conection ResNet-50 37.2 45.4 24.9

no ResNet-101 38.2 46.3 26.0+ skip-conection ResNet-101 38.8 46.9 26.6

Table 4. Results on whether share stuff segmentation and instancesegmentation features. On ResNet-50 backbone, sharing featuresmethod gets a gain of 0.7 in PQ, and ResNet-101 gets a gain of 0.7.Ablation study results on different sharing feature way. The res1-res5 means just share the backbone ResNet features. The +skip-connection means share both the backbone features and FPN skip-connection branch.

Conv Settings PQ PQTh PQSt

1× 1 38.4 47.4 24.93× 3 38.7 47.8 24.9

1× 7 + 7× 1 39.0 48.3 24.9

Table 5. Results on the convolution settings of spatial rankingmodule. 1× 1 represents the convolution kernel size is 1. Resultsshows that the large receptive field can help the spatial rankingmodule get more context features and better results.

the granularity on sharing features, we conduct two experi-ments, where the shallow share model only shares the back-bone features and the deep share model further shares thefeature maps before RPN head, as Figure 3 presents. Ta-ble 4 shows the comparisons between different settings, anddeep share model outperform the separate training baselineby 0.7% on PQ. Figure 6 presents the visualization of shar-ing features.

4.4. Ablation Study on Spatial Ranking Module

Supervised by no-overlapping annotations are astraightforward idea to resolve the object ranking issue.Here, we process the panoptic ground truth and extractthe non-overlapping annotations for instance segmentation.The 3rd row of Table 3 gives the results of this idea. Unfor-tunately, merely replacing the instance ground truth do nothelp improve the performance, and may reversely reduce

Methods PQ PQTh PQSt

Artemis 16.9 16.8 17.0JSIS-Net [10] 27.2 29.6 23.4MMAP-seg 32.1 38.9 22.0

LeChen 37.0 44.8 25.2Ours(OANet) 41.3 50.4 27.7

Table 6. Results on the COCO 2018 panoptic segmentation chal-lenge test-dev. Results verifies the effectiveness of our featuresharing mode and the spatial ranking module. We use the ResNet-101 as our basemodel.

the accuracy and recall for objects. This phenomenon maycome from the fact that most of the objects in COCO do notmeet the overlapping issue, and forcing the network to learnnon-overlapping hurts the overall performance.

Figure 7. The visualization result of the spatial ranking module.The left two segments denote the detected instances through ourmodel, the det bbox score represents the object detection scorepredicted in detection. The spatial ranking score represents thevalues from our approach.

Spatial ranking module proposed in this paper is aimedto solve the overlapping issue in panoptic segmentation. Ta-ble 5 shows that large receptive field can help the spatial

7

Page 8: An End-to-End Network for Panoptic Segmentation · Panoptic segmentation can also be treated as multi-task learning problem. Two different task can be trained together through strategies

Input Image Heuristic Method Our Method

Figure 8. The visualization results using our spatial ranking module. The first column is the input image, and second column is the panopticsegmentation result with heuristic method, and last column is the result using our approach.

ranking module get more context features and better results.As we can see in 3rd row or 2nd row for the Resnet101 ofTable 3, compared with the above end-to-end baseline, ourspatial ranking module improve the PQ by 1.8%. Specifi-cally, the PQTh is increased by 2.9%, while the metrics forstuff remains the same. These facts prove the purpose ofour spatial ranking module is justified. We test our OANeton COCO test-dev, as shown in Table 6. Compared with theresults of other methods, our method achieves the state-of-the-art result. For detailed results, please refer to the leard-board ‡.

Figure 7 explaines the principle of our spatial rankingmodule. For the example input image, the network predictsa person plus a tie, and their bounding box scores are 0.997and 0.662 respectively. If we use the scores to decide the re-sults, the tie will definitely be covered by the person. How-

‡http://cocodataset.org/#panoptic-leaderboard

ever, in our method, we can get a spatial ranking score foreach instance, 0.325 and 0.878 respectively. With the helpof the new scores, we can get the right predictions. Figure 8summarizes more examples.

5. ConclusionIn this paper, we propose a novel end-to-end occlusion

aware algorithm, which incorporates the common seman-tic segmentation and instance segmentation into a singlemodel. In order to better employ the different supervisionsand reduce the consumption of computation resources, weinvestigate the feature sharing between different branchesand find that we should share as many features as possi-ble. Besides, we have also observed the particular rankingproblem raised in the panoptic segmentation, and design thesimple but effective spatial ranking module to deal with thisissue. The experiment results show that our approach out-performs the previous state-of-the-art models.

8

Page 9: An End-to-End Network for Panoptic Segmentation · Panoptic segmentation can also be treated as multi-task learning problem. Two different task can be trained together through strategies

References[1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A

deep convolutional encoder-decoder architecture for imagesegmentation. IEEE TPAMI, 2017.

[2] M. Bai and R. Urtasun. Deep watershed transform for in-stance segmentation. In CVPR, 2017.

[3] H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing andstuff classes in context. In CVPR, 2018.

[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. Deeplab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected crfs. IEEE TPAMI, 2018.

[5] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. At-tention to scale: Scale-aware semantic image segmentation.In CVPR, 2016.

[6] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam.Encoder-decoder with atrous separable convolution for se-mantic image segmentation. In ECCV, 2018.

[7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,R. Benenson, U. Franke, S. Roth, and B. Schiele. Thecityscapes dataset for semantic urban scene understanding.In CVPR, 2016.

[8] J. Dai, K. He, and J. Sun. Instance-aware semantic segmen-tation via multi-task network cascades. In CVPR, 2016.

[9] B. De Brabandere, D. Neven, and L. Van Gool. Semanticinstance segmentation with a discriminative loss function.arXiv:1708.02551, 2017.

[10] D. de Geus, P. Meletis, and G. Dubbelman. Panoptic seg-mentation with a joint semantic and instance segmentationnetwork. arXiv:1809.02110, 2018.

[11] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams,J. Winn, and A. Zisserman. The pascal visual object classeschallenge: A retrospective. In IJCV, 2015.

[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In CVPR, 2014.

[13] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hyper-columns for object segmentation and fine-grained localiza-tion. In CVPR, 2015.

[14] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn.In ICCV, 2017.

[15] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid poolingin deep convolutional networks for visual recognition. InECCV, 2014.

[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In CVPR, 2016.

[17] A. Kendall, Y. Gal, and R. Cipolla. Multi-task learning usinguncertainty to weigh losses for scene geometry and seman-tics. In CVPR, 2018.

[18] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar.Panoptic segmentation. arXiv:1801.00868, 2018.

[19] A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, andC. Rother. Instancecut: from edges to instances with multi-cut. In CVPR, 2017.

[20] I. Kokkinos. Ubernet: Training a universal convolutionalneural network for low-, mid-, and high-level vision usingdiverse datasets and limited memory. In CVPR, 2017.

[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, 2012.

[22] Q. Li, A. Arnab, and P. H. Torr. Weakly-and semi-supervisedpanoptic segmentation. In ECCV, 2018.

[23] Y. Li, X. Chen, Z. Zhu, L. Xie, G. Huang, D. Du, andX. Wang. Attention-guided unified network for panoptic seg-mentation. arXiv:1812.03904, 2018.

[24] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutionalinstance-aware semantic segmentation. In CVPR, 2017.

[25] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Det-net: Design backbone for object detection. In Proceedingsof the European Conference on Computer Vision (ECCV),pages 334–350, 2018.

[26] T.-Y. Lin, P. Dollar, R. B. Girshick, K. He, B. Hariharan, andS. J. Belongie. Feature pyramid networks for object detec-tion. In CVPR, 2017.

[27] S. Liu, J. Jia, S. Fidler, and R. Urtasun. Sgn: Sequen-tial grouping networks for instance segmentation. In ICCV,2017.

[28] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregationnetwork for instance segmentation. In CVPR, 2018.

[29] S. Liu, X. Qi, J. Shi, H. Zhang, and J. Jia. Multi-scale patchaggregation (mpa) for simultaneous detection and segmenta-tion. In CVPR, 2016.

[30] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In CVPR, 2015.

[31] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross-stitch networks for multi-task learning. In CVPR, 2016.

[32] G. Neuhold, T. Ollmann, S. R. Bulo, and P. Kontschieder.The mapillary vistas dataset for semantic understanding ofstreet scenes. In ICCV, 2017.

[33] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu,and J. Sun. Megdet: A large mini-batch object detector. InCVPR, 2018.

[34] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large ker-nel mattersimprove semantic segmentation by global convo-lutional network. In CVPR, 2017.

[35] M. Ren and R. S. Zemel. End-to-end instance segmentationwith recurrent attention. In CVPR, 2017.

[36] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. InNIPS, 2015.

[37] B. Romera-Paredes and P. H. S. Torr. Recurrent instancesegmentation. In ECCV, 2016.

[38] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-tional networks for biomedical image segmentation. In MIC-CAI, 2015.

[39] K. Simonyan and A. Zisserman. Very deep con-volutional networks for large-scale image recognition.arXiv:1409.1556, 2014.

[40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In CVPR, 2015.

[41] F. Xia, P. Wang, L.-C. Chen, and A. L. Yuille. Zoom betterto see clearer: Human and object parsing with hierarchicalauto-zoom net. In ECCV, 2016.

9

Page 10: An End-to-End Network for Panoptic Segmentation · Panoptic segmentation can also be treated as multi-task learning problem. Two different task can be trained together through strategies

[42] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang.Bisenet: Bilateral segmentation network for real-time se-mantic segmentation. In ECCV, 2018.

[43] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. Learn-ing a discriminative feature network for semantic segmenta-tion. In CVPR, 2018.

[44] F. Yu and V. Koltun. Multi-scale context aggregation by di-lated convolutions. arXiv:1511.07122, 2015.

[45] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, andS. Savarese. Taskonomy: Disentangling task transfer learn-ing. In CVPR, 2018.

[46] Z. Zhang, A. G. Schwing, S. Fidler, and R. Urtasun. Monoc-ular object instance segmentation and depth ordering withcnns. In CVPR, 2015.

[47] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun. Ex-fuse: Enhancing feature fusion for semantic segmentation.In ECCV, 2018.

[48] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid sceneparsing network. In CVPR, 2017.

[49] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Tor-ralba. Scene parsing through ade20k dataset. In CVPR, 2017.

10