DSSD : Deconvolutional Single Shot Detector - arXiv · PDF fileDSSD : Deconvolutional Single...

download DSSD : Deconvolutional Single Shot Detector - arXiv · PDF fileDSSD : Deconvolutional Single Shot Detector Cheng-Yang Fu 1, Wei Liu , Ananth Ranga2, Ambrish Tyagi2, Alexander C. Berg1

If you can't read please download the document

Transcript of DSSD : Deconvolutional Single Shot Detector - arXiv · PDF fileDSSD : Deconvolutional Single...

  • DSSD : Deconvolutional Single Shot Detector

    Cheng-Yang Fu1, Wei Liu1, Ananth Ranga2, Ambrish Tyagi2, Alexander C. Berg11UNC Chapel Hill, 2Amazon Inc.

    {cyfu, wliu}@cs.unc.edu, {ananthr, ambrisht}@amazon.com, [email protected]

    Abstract

    The main contribution of this paper is an approach forintroducing additional context into state-of-the-art generalobject detection. To achieve this we first combine a state-of-the-art classifier (Residual-101 [14]) with a fast detectionframework (SSD [18]). We then augment SSD+Residual-101 with deconvolution layers to introduce additional large-scale context in object detection and improve accuracy,especially for small objects, calling our resulting systemDSSD for deconvolutional single shot detector. While thesetwo contributions are easily described at a high-level, anaive implementation does not succeed. Instead we showthat carefully adding additional stages of learned transfor-mations, specifically a module for feed-forward connectionsin deconvolution and a new output module, enables this newapproach and forms a potential way forward for further de-tection research. Results are shown on both PASCAL VOCand COCO detection. Our DSSD with 513 513 inputachieves 81.5% mAP on VOC2007 test, 80.0% mAP onVOC2012 test, and 33.2% mAP on COCO, outperform-ing a state-of-the-art method R-FCN [3] on each dataset.

    1. Introduction

    The main contribution of this paper is an approach forintroducing additional context into state-of-the-art generalobject detection. The end result achieves the current high-est accuracy for detection with a single network on PAS-CAL VOC [6] while also maintaining comparable speedwith a previous state-of-the-art detection [3]. To achievethis we first combine a state-of-the-art classifier (Residual-101 [14]) with a fast detection framework (SSD [18]). Wethen augment SSD+Residual-101 with deconvolution lay-ers to introduce additional large-scale context in object de-tection and improve accuracy, especially for small objects,calling our resulting system DSSD for deconvolutional sin-gle shot detector. While these two contributions are easilydescribed at a high-level, a naive implementation does not

    Equal Contribution

    succeed. Instead we show that carefully adding additionalstages of learned transformations, specifically a module forfeed forward connections in deconvolution and a new out-put module, enables this new approach and forms a poten-tial way forward for further detection research.

    Putting this work in context, there has been a recentmove in object detection back toward sliding-window tech-niques in the last two years. The idea is that instead of firstproposing potential bounding boxes for objects in an im-age and then classifying them, as exemplified in selectivesearch[27] and R-CNN[12] derived methods, a classifier isapplied to a fixed set of possible bounding boxes in an im-age. While sliding window approaches never completelydisappeared, they had gone out of favor after the heydays ofHOG [4] and DPM [7] due to the increasingly large numberof box locations that had to be considered to keep up withstate-of-the-art. They are coming back as more powerfulmachine learning frameworks integrating deep learning aredeveloped. These allow fewer potential bounding boxes tobe considered, but in addition to a classification score foreach box, require predicting an offset to the actual locationof the objectsnapping to its spatial extent. Recently theseapproaches have been shown to be effective for bound-ing box proposals [5, 24] in place of bottom-up groupingof segmentation [27, 12]. Even more recently, these ap-proaches were used to not only score bounding boxes as po-tential object locations, but to simultaneously predict scoresfor object categories, effectively combining the steps of re-gion proposal and classification. This is the approach takenby You Only Look Once (YOLO) [23] which computes aglobal feature map and uses a fully-connected layer to pre-dict detections in a fixed set of regions. Taking this single-shot approach further by adding layers of feature maps foreach scale and using a convolutional filter for prediction, theSingle Shot MultiBox Detector (SSD) [18] is significantlymore accurate and is currently the best detector with respectto the speed-vs-accuracy trade-off.

    When looking for ways to further improve the accuracyof detection, obvious targets are better feature networks andadding more context, especially for small objects, in addi-tion to improving the spatial resolution of the bounding box

    1

    arX

    iv:1

    701.

    0665

    9v1

    [cs

    .CV

    ] 2

    3 Ja

    n 20

    17

  • Orig

    inalPredic-on

    layer

    conv1

    pool1conv2_x

    conv3_xconv4_x conv5_x

    DSSDLayers

    SSDLayers

    conv1

    pool1conv2_x

    conv3_xconv4_x conv5_x

    Predic.onModule Deconvolu.onModule

    Figure 1: Networks of SSD and DSSD on residual network. The blue modules are the layers added in SSD framework,and we call them SSD Layers. In the bottom figure, the red layers are DSSD layers.

    prediction process. Previous versions of SSD were based onthe VGG [26] network, but many researchers have achievedbetter accuracy for tasks using Residual-101 [14]. Look-ing to concurrent research outside of detection, there hasbeen a work on integrating context using so called encoder-decoder networks where a bottleneck layer in the middleof a network is used to encode information about an inputimage and then progressively larger layers decode this intoa map over the whole image. The resulting wide, narrow,wide structure of the network is often referred to as an hour-glass. These approaches have been especially useful in re-cent works on semantic segmentation [21], and human poseestimation [20].

    Unfortunately neither of these modifications, using themuch deeper Residual-101, or adding deconvolution layersto the end of SSD feature layers, work out of the box.Instead it is necessary to carefully construct combinationmodules for integrating deconvolution, and output modulesto insulate the Residual-101 layers during training and al-low effective learning.

    The code will be open sourced with models upon publi-cation.

    2. Related Work

    The majority of object detection methods, includingSPPnet [13], Fast R-CNN [11], Faster R-CNN [24], R-FCN [3] and YOLO [23], use the top-most layer of a Con-vNet to learn to detect objects at different scales. Althoughpowerful, it imposes a great burden for a single layer tomodel all possible object scales and shapes.

    There are variety of ways to improve detection accuracyby exploiting multiple layers within a ConvNet. The firstset of approaches combine feature maps from different lay-ers of a ConvNet and use the combined feature map to doprediction. ION [1] uses L2 normalization [19] to com-bine multiple layers from VGGNet and pool features forobject proposals from the combined layer. HyperNet [16]also follows a similar method and uses the combined layerto learn object proposals and to pool features. Because the

  • Cls LocRegress

    FeatureLayer

    Cls LocRegress

    FeatureLayer

    Conv1x1x256Conv1x1x256Conv1x1x1024

    EltwSum

    FeatureLayer

    Conv1x1x256Conv1x1x256Conv1x1x1024

    EltwSum

    Conv1x1x1024

    Cls LocRegress

    FeatureLayer

    Conv1x1x256Conv1x1x256Conv1x1x1024

    EltwSum

    Conv1x1x1024

    Cls LocRegress

    Conv1x1x256Conv1x1x256Conv1x1x1024

    EltwSum

    Conv1x1x1024

    (a) (b) (c) (d)

    Figure 2: Variants of the prediction module

    combined feature map has features from different levels ofabstraction of the input image, the pooled feature is moredescriptive and is better suitable for localization and classi-fication. However, the combined feature map not only in-creases the memory footprint of a model significantly butalso decreases the speed of the model.

    Another set of methods uses different layers within aConvNet to predict objects of different scales. Because thenodes in different layers have different receptive fields, itis natural to predict large objects from layers with large re-ceptive fields (called higher or later layers within a Con-vNet) and use layers with small receptive fields to predictsmall objects. SSD [18] spreads out default boxes of differ-ent scales to multiple layers within a ConvNet and enforceseach layer to focus on predicting objects of certain scale.MS-CNN [2] applies deconvolution on multiple layers of aConvNet to increase feature map resolution before using thelayers to learn region proposals and pool features. However,in order to detect small objects well, these methods needto use some information from shallow layers with small re-ceptive fields and dense feature maps, which may cause lowperformance on small objects because shallow layers haveless semantic information about objects. By using deconvo-lution layers and skip connections, we can inject more se-mantic information in dense (deconvolution) feature maps,which in turn helps predict small objects.

    There is another line of work which tries to include con-text information for prediction. Multi-Region CNN [10]pools features not only from the region proposal but alsopre-defined regions such as half parts, center, border andthe context area. Following many existing works on seman-tic segmentation [21] and pose estimation [20], we proposeto use an encoder-decoder hourglass structure to pass con-text information before doing prediction. The deconvolu-

    tion layers not only addresses the problem of shrinking res-olution of feature maps in convolution neural networks, butalso brings in context information for prediction.

    3. Deconvolutional Single Shot Detection(DSSD) model

    We begin by reviewing the structure of SSD and then de-scribe the new prediction module that produces s