Amy L. Bearman arXiv:1506.02106v2 [cs.CV] 10 Jun 2015 · about new visual concepts. Algorithms...

10
What’s the point: Semantic segmentation with point supervision Olga Russakovsky * Stanford University [email protected] Amy L. Bearman * Stanford University [email protected] Vittorio Ferrari University of Edinburgh [email protected] Li Fei-Fei Stanford University [email protected] Abstract The semantic image segmentation task presents a trade-off between test accuracy and the cost of obtaining training annotations. Detailed per-pixel annotations en- able training accurate models but are very expensive to obtain; image-level class labels are an order of magnitude cheaper but result in less accurate models. We take a natural step from image-level annotation towards stronger supervision: we ask annotators to point to an object if one exists. We demonstrate that this adds negligible additional annotation cost. We incorporate this point supervision along with a novel objectness potential in the training loss function of a state-of-the-art CNN model. The combined effect of these two extensions is a 12.9% increase in mean intersection over union on the PASCAL VOC 2012 segmentation task compared to a CNN model trained with only image-level labels. 1 Introduction At the forefront of visual recognition is the question of how to most effectively teach computers about new visual concepts. Algorithms trained from large-scale carefully annotated data enjoy better performance than their weakly supervised counterparts (e.g., [10] versus [37], [5] versus [24], [21] versus [25]); however, obtaining strongly supervised data is very expensive [28, 20]. It is particularly difficult to collect training data for semantic segmentation, i.e. the task of assigning a class label to every pixel in the image. Strongly supervised methods require a training set of images with per-pixel annotations [31, 5, 21, 14, 8] (Fig. 1b). Providing an accurate outline of a single object takes between 54 seconds [16] and 79 seconds [20]. A typical indoor scene contains 23 objects [11], raising the annotation cost to tens of minutes per image. Methods have been developed to reduce the annotation cost through effective interfaces [27, 19, 3, 20, 38], e.g., through requesting human feedback only as necessary [16]. Co-segmentation methods use manually annotated image-level labels to automatically infer the segmentations [4, 18, 12]. Nevertheless, accurate per-pixel image annotations remain costly and scarce. To alleviate the need for large-scale detailed annotations, weakly supervised semantic segmentation techniques have been developed [35, 36, 33, 24, 25, 39, 40, 26]. The most common setting is where only image-level labels for the presence or absence of classes are provided during training [35, 36, 33, 24, 25, 39, 26] (Fig. 1c), but other forms of weak supervision have been explored as well, such as bounding box annotations [24], eye tracks [23], free-form squiggles [40] or noisy web tags [1]. * Authors contributed equally 1 arXiv:1506.02106v2 [cs.CV] 10 Jun 2015

Transcript of Amy L. Bearman arXiv:1506.02106v2 [cs.CV] 10 Jun 2015 · about new visual concepts. Algorithms...

Page 1: Amy L. Bearman arXiv:1506.02106v2 [cs.CV] 10 Jun 2015 · about new visual concepts. Algorithms trained from large-scale carefully annotated data enjoy better ... [cs.CV] 10 Jun 2015.

What’s the point:Semantic segmentation with point supervision

Olga Russakovsky∗Stanford University

[email protected]

Amy L. Bearman∗Stanford University

[email protected]

Vittorio FerrariUniversity of Edinburgh

[email protected]

Li Fei-FeiStanford University

[email protected]

Abstract

The semantic image segmentation task presents a trade-off between test accuracyand the cost of obtaining training annotations. Detailed per-pixel annotations en-able training accurate models but are very expensive to obtain; image-level classlabels are an order of magnitude cheaper but result in less accurate models. Wetake a natural step from image-level annotation towards stronger supervision: weask annotators to point to an object if one exists. We demonstrate that this addsnegligible additional annotation cost. We incorporate this point supervision alongwith a novel objectness potential in the training loss function of a state-of-the-artCNN model. The combined effect of these two extensions is a 12.9% increasein mean intersection over union on the PASCAL VOC 2012 segmentation taskcompared to a CNN model trained with only image-level labels.

1 Introduction

At the forefront of visual recognition is the question of how to most effectively teach computersabout new visual concepts. Algorithms trained from large-scale carefully annotated data enjoy betterperformance than their weakly supervised counterparts (e.g., [10] versus [37], [5] versus [24], [21]versus [25]); however, obtaining strongly supervised data is very expensive [28, 20].

It is particularly difficult to collect training data for semantic segmentation, i.e. the task of assigninga class label to every pixel in the image. Strongly supervised methods require a training set of imageswith per-pixel annotations [31, 5, 21, 14, 8] (Fig. 1b). Providing an accurate outline of a single objecttakes between 54 seconds [16] and 79 seconds [20]. A typical indoor scene contains 23 objects [11],raising the annotation cost to tens of minutes per image. Methods have been developed to reducethe annotation cost through effective interfaces [27, 19, 3, 20, 38], e.g., through requesting humanfeedback only as necessary [16]. Co-segmentation methods use manually annotated image-levellabels to automatically infer the segmentations [4, 18, 12]. Nevertheless, accurate per-pixel imageannotations remain costly and scarce.

To alleviate the need for large-scale detailed annotations, weakly supervised semantic segmentationtechniques have been developed [35, 36, 33, 24, 25, 39, 40, 26]. The most common setting is whereonly image-level labels for the presence or absence of classes are provided during training [35, 36,33, 24, 25, 39, 26] (Fig. 1c), but other forms of weak supervision have been explored as well, suchas bounding box annotations [24], eye tracks [23], free-form squiggles [40] or noisy web tags [1].

∗Authors contributed equally

1

arX

iv:1

506.

0210

6v2

[cs

.CV

] 1

0 Ju

n 20

15

Page 2: Amy L. Bearman arXiv:1506.02106v2 [cs.CV] 10 Jun 2015 · about new visual concepts. Algorithms trained from large-scale carefully annotated data enjoy better ... [cs.CV] 10 Jun 2015.

Original Image Levels of supervision

(a) (b) Full (c) Image-level (d) Point-level (e) Objectness prior

Figure 1: Different levels of training supervision for semantic segmentation models. The trainingimage is shown in (a). In (b), the class label of every pixel is provided. In (c), the class labelsare known but their locations are not. Cases (b) and (c) are standard; we explore (d) and (e). Weintroduce the point-level supervision (d) where each class label is only associated with one or a fewpixels, corresponding to humans pointing to objects of that class. We include an objectness prior (e)in our training loss function alongside point-level supervision to accurately infer the object extent.

These methods require significantly less annotation effort during training, but are not able to segmentnew images nearly as accurately as fully supervised techniques.

In this work, we take a natural step towards stronger supervision for semantic segmentation at neg-ligible additional cost, compared to image-level labels. The most natural way for humans to refer toan object is by pointing: “That cat over there (point)” or “What is that over there? (point)” Psychol-ogy research has indicated that humans point to objects in a consistent and predictable way [9, 6].The fields of robotics [15, 30] and human-computer interaction [22] have long used pointing as theeffective means of communication. However, to the best of our knowledge this insight has not beenexplored in semantic image segmentation.

We demonstrate that simply pointing once to target objects in training images can be a surprisinglyeffective means of supervision (Fig. 1d). We extend a state-of-the-art convolutional neural network(CNN) framework for semantic segmentation [21, 25] to incorporate point supervision in its trainingloss function. With just one annotated point per object class, we are able to considerably improvesemantic segmentation accuracy.

One lingering concern with supervision at the point level is that it is difficult to infer the full extentof the object. To overcome this issue, we additionally modify the training loss function to incorpo-rate a generic objectness prior [2] which enables learning the full object extent. Objectness helpsseparate objects (e.g., car, sheep, bird) from background (e.g., grass, sky, water), by providing aprobability that a pixel belongs to an object (Fig. 1e). Such priors have been used in segmentationliterature for selecting image regions to segment [14], as unary potentials in a conditional randomfield model [35], or during inference [26]. However, to the best of our knowledge we are the first toemploy this directly in the loss to guide the training of a CNN.

Contributions. Our primary contribution is introducing a novel supervision regime for semanticsegmentation based on humans pointing to objects. This supervision is (1) cheap to obtain, and (2)significantly improves segmentation accuracy. Our secondary contribution is improving the accuracyof weakly supervised segmentation by using an objectness prior directly for training a segmentationCNN. The combined effect of our contributions is a substantial increase of 12.9% mean intersectionover union on the PASCAL VOC 2012 dataset [7] compared to training with image-level labels.

2 Semantic segmentation method

We describe here our approach to using point-level supervision (Fig. 1d) for training semantic seg-mentation models. In Section 3 we will demonstrate that this level of supervision is cheap andefficient to obtain. In our setting (in contrast to [38]), supervised points are only provided on train-ing images. The learned model is then used to segment test images with no additional human input.

Current state-of-the-art semantic segmentation methods [21, 5, 26, 25, 24], both supervised andunsupervised, employ a unified CNN framework. These networks take as input an image of sizeW × H and output a W × H × N score map where N is the set of classes the CNN was trainedto recognize (Fig. 2). At test time, the score map is converted to per-pixel predictions of size W ×

2

Page 3: Amy L. Bearman arXiv:1506.02106v2 [cs.CV] 10 Jun 2015 · about new visual concepts. Algorithms trained from large-scale carefully annotated data enjoy better ... [cs.CV] 10 Jun 2015.

Figure 2: Overview of our semantic segmentation training framework.

H by either simply taking the maximally scoring class at each pixel [25, 21] or employing morecomplicated post-processing [26, 5, 24].

Training these models with different levels of supervision requires defining appropriate loss func-tions in each scenario. We begin by presenting two loss functions used in the literature. We thenextend them to our setting of point supervision, and to incorporate an objectness prior.

Full supervision. When the class label is available for every pixel during training (Fig. 1b), theCNN is commonly trained by optimizing the sum of per-pixel cross-entropy terms [5, 21]. Let Ibe the set of pixels in the image. Let sic be the CNN score for pixel i and class c. Let Sic =

exp(sic)/∑N

k=1 exp(sik) be the softmax probability of class c at pixel i. Given a ground truth mapG indicating that pixel i belongs to class Gi, the loss on a single training image is:

Lpix(S,G) = −∑i∈I

log(SiGi) (1)

The loss is simply zero for pixels where the ground truth label is not defined (for example, in caseof pixels defined as “difficult” on the boundary of objects in PASCAL VOC [7]).

Image-level supervision. In this case, the only information available during training is the setL ⊆ {1, . . . N} of classes present in the image and L′ ⊆ {1, . . . N} of classes not present in theimage (Fig. 1c). The CNN model can be trained with a different cross-entropy loss:

Limg(S,L, L′) = − 1

|L|∑c∈L

log(Stcc)−1

|L′|∑c∈L′

log(1− Stcc) with tc = argmaxi∈I

Sic (2)

The first part of Eqn. (2), corresponding to c ∈ L, is used in [25]. It encourages each class in Lto have high probability on at least one pixel in the image. We find the simple extension to includec ∈ L′ to be very effective in practice. This corresponds to the fact that no pixels should have highprobability for classes that are not present in the image.

Point-level supervision. We study the intermediate case where the object classes are known for asmall set of supervised pixels Is whereas other pixels are just known to belong to some class in L.In this case, we generalize Eqn. (1) and Eqn. (2) to:

Lpoint(S,G,L, L′) = Limg(S,L, L

′)−∑i∈Is

αi log(SiGi) (3)

Here, αi determines the relative importance of each supervised pixel. For example, for each class ofinterest the user can be asked to either determine that the class is not present in the image or to pointto one instance of the object. In this case, |Is| = |L| and αi is uniform for every point. With multipleannotators, αi can correspond to the confidence in the accuracy of the annotator that provided thepoint. Alternatively, the annotators can be asked to point to every instance of the classes in theimage, and αi can correspond to the order of the points: the first point is more likely to correspondto the largest object instance and thus potentially deserves a higher weight αi.

Objectness prior. One issue with training models with very few or no supervised pixels is correctlyinferring the spatial extent of the objects. In general, weakly supervised methods are prone to localminima: focusing on only a small part of the target object, or predicting all pixels as belonging to

3

Page 4: Amy L. Bearman arXiv:1506.02106v2 [cs.CV] 10 Jun 2015 · about new visual concepts. Algorithms trained from large-scale carefully annotated data enjoy better ... [cs.CV] 10 Jun 2015.

the background class [25]. To alleviate this problem, we introduce an additional term in our trainingobjective based on an objectness prior (Fig. 1e). Objectness provides a probability for whether eachpixel belongs to any object class [2] (e.g., bird, car, sheep), as opposed to background (e.g., sky,water, grass). These probabilities have been used in the weakly supervised literature before as unarypotentials in graphical models [35] or during inference following a CNN segmentation [26]. To thebest of our knowledge we are the first to incorporate them directly into CNN training. 1

Let Pi be the probability that pixel i belongs to an object. Let O be the classes corresponding toobjects, with the other classes corresponding to backgrounds. In PASCAL VOC,O are the 20 objectclasses, and there is a single generic background class. We define a new loss:

Lobj(S, P ) = −1

|I|∑i∈I

(Pi log

(∑c∈O

Sic

)+ (1− Pi) log

(1−

∑c∈O

Sic

))(4)

At pixels with high Pi values, this objective encourages placing probability mass on object classes.Alternatively, when Pi is low, it prefers mass on the background class. Note that Lobj requires nohuman supervision and thus can be combined with any loss above.

3 Crowdsourcing point supervision

Point supervision can be used in training semantic segmentation models as described in Section 2.We now present our findings from collecting these user clicks in practice.

Collecting point data. Fig. 3 shows our annotation user interface . Annotators are given a targetobject class and a set of images. We experimented with two versions of the task:

1. Annotators are asked to click on the first instance of the target class they see (1Point), or2. Annotators are asked to click on every instance of the target class (AllPoints)

By running this task for all classes and all images, we obtain one annotated point per object classper image (1Point) or one annotated point per object instance per image (AllPoints).2

We used Amazon Mechanical Turk (AMT) to annotate 20 PASCAL VOC object classes on 12,031images: all training and validation images of the PASCAL VOC 2012 segmentation task [7] plusthe additional annotated images of [13]. Fig. 4 shows some collected data. We will release both thecollected annotations and the annotation system upon acceptance.

Error rates. Across the 12,031 images, workers incorrectly labeled an object class as absent only1.0% of the time. On the 1Point task, 7.2% − 8.0% of the clicks were incorrect: 7.2% were on apixel annotated with a different object class in the ground truth, and an additional 0.8% were on anunclassified “difficult” ground truth pixel. For comparison, [29] reports a higher 25% average errorrates when drawing bounding boxes. Our collected data is high-quality, confirming that pointing toobjects comes naturally to humans [22, 6].

On the AllPoints task, 7.9% of ground truth instances were left unannotated and 14.8− 16.4% ofthe clicks were incorrect. This task caused some confusion among workers due to blurry or verysmall instances; for example, many of these instances are not annotated in the ground truth but wereclicked by workers, accounting for the high false positive rate. For simplicity, throughout most ofthe paper we focus on the 1Point data.

Annotation time. On images where the target object was present, it took workers a median of 2.4seconds to click on the first instance of the object, and 0.9 seconds on every instance thereafter.Below we compare per-image annotations time for various types of annotations to confirm thatpoint-level supervision is very efficient.

Image-level supervision. For image-level labels, it takes approximately 1 second per image perlabel [23]. Labeling 20 object classes in PASCAL VOC takes 20 seconds per image.

1The objectness prior is trained on 50 images which do not appear in PASCAL VOC 2012 [2].2Quality control is done by planting 10 evaluation images in a 50-image task and ensuring at least 8 are

labeled correctly. We consider a click correct if it falls inside a tight bounding box around the object. For theAllPoints task, the number of annotated clicks must be at least the number of known object instances.

4

Page 5: Amy L. Bearman arXiv:1506.02106v2 [cs.CV] 10 Jun 2015 · about new visual concepts. Algorithms trained from large-scale carefully annotated data enjoy better ... [cs.CV] 10 Jun 2015.

Figure 3: Point-level supervisionannotation user interface.

Figure 4: Example points collected. Different colors corre-spond to different object classes. (Best viewed in color)

Point-level supervision. For every class not present in the image, the labeling cost is 1 second asabove. There are 1.5 classes on average per image in PASCAL VOC 2012, so it takes 18.5 secondsto annotate the classes not present in the image. For every class that is present, it takes 2.4 secondsto click on the first instance. The labeling cost of 1Point is thus 18.5 + 1.5× 2.4 = 22.1 secondsper image. If a class is present, there are on average 0.8 additional instances beyond the first one.The labeling cost of AllPoints is thus 18.5 + 1.5× (2.4 + 0.9× 0.8) = 23.2 seconds per image.This is only 10− 16% more expensive than obtaining image-level labels.

Box-level supervision. A common intermediate between image-level labels and pixel-wise segmen-tations is to get bounding box annotations around each object instance. It takes about 30 secondsto obtain an accurate bounding box, including quality control.3 On average, there are a total of 2.8instances per image over all classes. Thus annotating them takes 18.5+ 2.8× 30 = 102.5 seconds.This is about 4.5× more expensive than point-level supervision.

Full supervision. For segmentation annotation, the authors of the COCO dataset report 22 workerhours per 1000 segmentations, so about 79 seconds per segmentation [20]. Thus to segment allinstances it takes 18.5 + 2.8× 79 = 239.7 seconds, more than 10× the cost of point supervision.

In Section 4.2 we compare the accuracy of the models trained with different levels of supervision.

4 Experiments

We empirically demonstrate the effectiveness of our point-level supervision and objectness prior.

4.1 Setup

CNN architecture. We use the state-of-the-art fully convolutional network model [21]. Briefly,the architecture is based on the VGG 16-layer net [32], with all fully connected layers converted toconvolutional layers. The last classifier layer is discarded and replaced with a 1x1 convolution layerwith channel dimension N = 21 equal to the number of object classes. The final modification is theaddition of a deconvolution layer to bilinearly upsample the output to pixel-level dense predictions.4

Optimization. We train following a procedure similar to [21]. We use stochastic gradient descentwith a fixed learning rate of 10−5, doubling the learning rate for biases, and with a minibatch of20 images, momentum of 0.9 and weight decay 0.0005. The network is initialized with weights

3Timing greatly depends on the setup. [16] reports 7 seconds to draw a box. [29] reports 10.2 seconds withhigh AMT error rates. [34] reports 25.5 seconds for drawing and 42.4 seconds with quality verification.

4[21] introduces additional refinement by decreasing the stride of the output layers from 32 pixels to 8pixels, which improves their results from 59.7% to 62.7% mIOU on the PASCAL VOC 2011 validation set.We use the original model with stride of 32 for simplicity.

5

Page 6: Amy L. Bearman arXiv:1506.02106v2 [cs.CV] 10 Jun 2015 · about new visual concepts. Algorithms trained from large-scale carefully annotated data enjoy better ... [cs.CV] 10 Jun 2015.

Loss mIOU (%)Img 29.8Img + 1Point 35.1Img + Obj 32.2Img + Obj + 1Point 42.7

(a) Both point-level supervision and object-ness prior significantly improve accuracy.

Supervision mIOU (%)Img + Obj + 1Point 42.7Img + Obj + AllPoints 42.7Img + Obj + AllPoints (weighted) 43.4Img + Obj + 1Point (3 annotators) 43.8Img + Obj + 1Point (random) 46.1

(b) Variations of our point-supervised model.

Table 1: Segmentation results on the PASCAL2012 validation set. Img corresponds to training withLimg loss of Eqn. 2. Obj corresponds to Lobj of Eqn. 4. Point supervision is added using Eqn. 3.

pre-trained for a 1000-way classification task of the ILSVRC 2012 dataset [32, 28, 21].5 In thefully supervised case we zero-initialize the classifier weights [21] and for all the weakly supervisedcases we follow [25] to initialize them with weights learned by the original VGG network for classescommon to both PASCAL and ILSVRC. We backpropagate through all layers to fine-tune the net-work, and train for 50,000 iterations. We build directly upon the publicly available implementationof [21, 17].

Dataset. We train and evaluate on the PASCAL VOC 2012 segmentation dataset [7] augmentedwith extra annotations from [13]. There are 10,582 training images, 1,449 validation images and1,456 test images. We report the mean intersection over union (mIOU), averaged over 21 classes.

4.2 Results

Effect of extended image-level supervision. We begin by establishing a baseline segmentationmodel trained from image-level labels with no additional information. [25] uses a loss similar toEqn. (2) except only including the terms corresponding to classes present in the image. We noticethat the absence of a class label is also an important supervisor signal and include it in the loss.While [25] obtains 25.1% mIOU on the PASCAL VOC 2011 validation set, we achieve 30.5%.

Effect of point-level supervision. We now run a key experiment to investigate how having just oneannotated point per class per image improves semantic segmentation accuracy. We use loss Lpoint

of Eqn. (3). On average there are only 1.5 pixels supervised per image (as many as classes perimage). All other pixels are unsupervised. We set α = 1/n where n is the number of supervisedpixels on a particular training image. On the PASCAL VOC 2012 validation set, the accuracy ofa model trained using Limg is 29.8% mIOU. Adding our point supervision improves accuracy by5.3% to 35.1% mIOU (Img + 1Point in Table 1a).

Effect of objectness prior. One issue with training models with very few or no supervised pixelsis the difficulty of inferring the full extent of the object. With image-level labels, the model tendsto learn that objects occupy a much greater area than they do in practice (second column of Fig. 5).We introduce the objectness prior in the loss using Eqn. (4) to aid the model in correctly predictingthe extent of objects (third column on Fig. 5). This improves segmentation accuracy: Img obtained29.8% mIOU and Img + Obj improves to 32.2% mIOU. The effect of the objectness prior is evenmore apparent with point-level supervision. The model with Img + 1Point achieves 35.1% mIOUand improves to 42.7% mIOU with Img + Obj + 1Point .

We conclude that: (1) The objectness prior is very effective for training these models with none orvery few supervised pixels – with no additional human supervision. For the rest of the experimentswe always use Img +Obj together. (2) With this improved Img +Obj image-level model, the effectof a single point of supervision is even more apparent. Adding just one point per class improvesaccuracy by 10.5% from 32.2% to 42.7%.

Effect of point-level supervision variations. Having established that a model with 1Point superiv-ison achieves 42.7% mIOU, we investigate the effects of other parameters (documented in Table 1b).

5This is standard in the literature [5, 21, 25, 24, 26, 10]. We do not consider the cost of collecting thoseannotations; including them would not change our overall conclusions.

6

Page 7: Amy L. Bearman arXiv:1506.02106v2 [cs.CV] 10 Jun 2015 · about new visual concepts. Algorithms trained from large-scale carefully annotated data enjoy better ... [cs.CV] 10 Jun 2015.

Original image Img Img +Obj Img +Obj +1Point Fully supervised Ground truth

Figure 5: Qualitative results on the PASCAL VOC 2012 validation set. The objectness prior (Img+ Obj ) improves the accuracy of the image-level model (Img ) by helping infer the object ex-tent. Point-level supervision (Img + Obj + 1Point) yields additional substantial improvements inaccuracy at small additional annotation cost. Best viewed in color.

Using points on all instances (AllPoints) instead of just one point per class (1Point) remains at42.7% mIOU: the benefit from extra supervision is offset by the confusion introduced by somedifficult instances that are annotated. We introduce a weighting factor αi = 1/2r in Eqn. (3) wherer is the ranked order of the point (so the first instance of a class gets weight 1, the second instancegets weight 1/2, etc.). This improves results by a modest 0.7% to 43.4% mIOU. We also collected1Point data from 3 different annotators and used all of points during training. This achieved amodest improvement of 1.1% from 42.7% to 43.8%, which does not seem worth the additionalannotation cost (29.3 versus 22.1 seconds per image).

As mentioned above, the segmentation model is good at extrapolating the full extent of the object,so increasing the area of supervised pixels by a radius of 2, 5 and 25 pixels around a point also haslittle effect, with 43.0 − 43.1% mIOU (not shown in Table 1b). This suggests that annotation via“squiggles” (e.g., in [40]), are unlikely to significantly improve over point-level supervision.

An interesting experiment is supervising with one point per class but randomly sampled on the targetobject class using per-pixel supervised ground truth annotations. This improved results over thehuman points by 3.4%, from 42.7% to 46.1%. This is due to the fact that humans are predictable andconsistent in pointing [9, 6] and this reduces the variety in point-level supervision across instances.

7

Page 8: Amy L. Bearman arXiv:1506.02106v2 [cs.CV] 10 Jun 2015 · about new visual concepts. Algorithms trained from large-scale carefully annotated data enjoy better ... [cs.CV] 10 Jun 2015.

(a) Results on PASCAL VOC 2012 validation set. (b) Results on PASCAL VOC 2012 test set.

Figure 6: Results for models trained with varying levels of supervision. (a) Point-level supervisiontrades off well between annotation time and accuracy. (b) Our model outperforms [26] on bothaccuracy and labeling cost, and [24, 25] on accuracy at a 10% increase in labeling cost.

Comparison to stronger supervision. We compare our point-supervised model with models trainedwith stronger supervision in terms of labeling cost and accuracy. We use the annotation time com-puted in Section 3: on one image, it takes 20 seconds for image-level, 22.1 seconds for point-level(1Point), 102.5 seconds for bounding boxes, and 239.7 seconds for full per-pixel annotation.

To train the bounding box-supervised model we use the bounding box annotations available onPASCAL VOC. We convert them to per-pixel annotations by assuming that each pixel belongs tothe object class of the smallest box covering it [24].

Fig. 6a summarizes our results. The point-level supervised model provides a good trade-off betweenaccuracy and annotation cost. It achieves 10.5% mIOU improvement compared to image-level labels(Table 1a) at a mere 0.1× increase in labeling cost. The point-level supervised model achieves only2.4% mIOU lower than the box-supervised model (42.7% mIOU with point versus 45.1% mIOUwith box) at 4.4× smaller labeling cost. The fully supervised model trumps all others in accuracy,with 58.3% mIOU, but at a 10.2× increase in cost compared to our point supervision.

Comparison to other techniques. We compare both annotation time and accuracy of our point-supervised 1Point model with published techniques. Here we train on all 12,031 train+val imagesand evaluate on the PASCAL VOC 2012 test set (as opposed to the validation set above).

Fig. 6b shows the results. Pathak et al. [25] achieves 25.7% mIOU and Papandreou et al. [24]achieves 39.6% mIOU with only image-level labels requiring approximately 67 hours of annotationon the 12,301 images (Section 3). Pinheiro et al. [26] achieve 40.6% mIOU but with 400 hours ofannotations.6 We improve in accuracy upon all these methods and achieve 43.6% with point-levelsupervision requiring about 79 annotation hours (Img + Obj + 1Point). Note that our baselinemodel is a significantly simplified version of [25, 24]. Incorporating additional features of theirmethods is likely to further increase our accuracy at no additional annotation cost.

Long et al. [21] and Chen et al. [5] report 62.7% mIOU and 71.6% mIOU respectively but in thefully supervised setting that requires about 800 hours of annotation, an order of magnitude moreexpensive than point supervision.

5 Conclusion

We propose a new supervision approach for semantic image segmentation. By combining a smallhandful of manually annotated pixels with a novel objectness-based loss function, we are able totrain CNN models with significant improvements in semantic segmentation accuracy at negligibleadditional labeling cost compared to image-level labels.

6[26] train with only image-level annotations but add 700,000 additional positive ImageNet images and60,000 background images. We choose not to count the 700,000 freely available images but the additional60,000 background images they annotated would take an additional 60, 000 × 20 classes ×1 second = 333hours. The total annotation time is thus 333 + 67 = 400 hours.

8

Page 9: Amy L. Bearman arXiv:1506.02106v2 [cs.CV] 10 Jun 2015 · about new visual concepts. Algorithms trained from large-scale carefully annotated data enjoy better ... [cs.CV] 10 Jun 2015.

Acknowledgments

A huge thank you to Evan Shelhamer for going above and beyond in helping us set up the baselinemodel of [21]. Thanks to all the other Caffe developers as well for the beautiful and easily extensibleframework. Thanks to Lamberto Ballan for useful discussions and preliminary data analysis. Thanksto Michelle Greene and Anca Dragan for suggesting references. Thanks to Jon Krause for papercomments.

V. Ferrari was supported by the ERC Starting Grant VisCul. GPUs were graciously donated byNVIDIA.

References[1] E. Ahmed, S. Cohen, and B. Price. Semantic object selection. In CVPR, 2014.[2] B. Alexe, T. Deselares, and V. Ferrari. Measuring the objectness of image windows. In PAMI, 2012.[3] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen. iCoseg: Interactive Co-segmentation with Intelligent

Scribble Guidance. In CVPR, 2010.[4] Y. Chai, V. Lempitsky, and A. Zisserman. BiCoS: A bi-level co-segmentation method for image classifi-

cation. In CVPR, 2011.[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation

with deep convolutional nets and fully connected CRFs. In ICLR, 2015.[6] H. H. Clark. Coordinating with each other in a material world. Discourse Studies, 7(4-5):507–525, 2005.[7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object

Classes (VOC) challenge. IJCV, 88(2):303–338, June 2010.[8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling.

TPAMI, August 2013.[9] C. Firestone and B. J. Scholl. Please Tap the Shape, Anywhere You Like: Shape skeletons in human

vision revealed by an exceedingly simple measure. Psychological Science, 25(2):377386, 2014.[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection

and semantic segmentation. In CVPR, 2014.[11] M. R. Greene. Statistics of high-level scene context. Frontiers in Psychology, 4, 2013.[12] M. Guillaumin, D. Kuettel, and V. Ferrari. ImageNet Auto-annotation with Segmentation Propagation.

IJCV, 2015.[13] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors.

In ICCV, 2011.[14] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Simultaneous detection and segmentation. In ECCV,

2014.[15] M. Hild, M. Hashimoto, and K. Yoshida. Object recognition via recognition of finger pointing actions. In

Image Analysis and Processing, pages 88–93, 2003.[16] S. D. Jain and K. Grauman. Predicting sufficient annotation strength for interactive foreground segmen-

tation. In ICCV, December 2013.[17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:

Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.[18] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In CVPR, 2010.[19] V. Lempitsky, P. Kohli, C. Rother, and T. Sharp. Image segmentation with a bounding box prior. In ICCV,

2009.[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, and C. L. Zitnick. Microsoft

COCO: Common Objects in Context. In ECCV, 2014.[21] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,

2015.[22] D. Merrill and P. Maes. Augmenting looking, pointing and reaching gestures to enhance the searching

and browsing of physical objects. In Pervasive Computing, Lecture Notes in Computer Science. 2007.[23] D. P. Papadopoulos, A. D. F. Clarke, F. Keller, and V. Ferrari. Training object class detectors from eye

tracking data. In ECCV, 2014.[24] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille. Weakly- and semi-supervised learning of a

dcnn for semantic image segmentation. arxiv:1502.02734, 2015.[25] D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully convolutional multi-class multiple instance learn-

ing. In ICLR, 2015.[26] P. O. Pinheiro and R. Collobert. From image-level to pixel-level labeling with convolutional networks. In

CVPR, 2015.[27] C. Rother, V. Kolmogorov, and A. Blake. GrabCut: Interactive foreground extraction using iterated graph

cuts. In ACM SIGGRAPH, 2004.

9

Page 10: Amy L. Bearman arXiv:1506.02106v2 [cs.CV] 10 Jun 2015 · about new visual concepts. Algorithms trained from large-scale carefully annotated data enjoy better ... [cs.CV] 10 Jun 2015.

[28] O. Russakovsky, J. Deng, et al. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.[29] O. Russakovsky, L.-J. Li, and L. Fei-Fei. Best of both worlds: human-machine collaboration for object

annotation. In CVPR, 2015.[30] A. Sauppe and B. Mutlu. Robot deictics: how gesture and context shape referential communication. In

HRI’14, pages 342–349, 2014.[31] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object

recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 81(1), 2009.[32] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In

ICLR, 2015.[33] H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell. On learning to localize

objects with minimal supervision. In ICML, 2014.[34] H. Su, J. Deng, and L. Fei-Fei. Crowdsourcing annotations for visual object detection. In AAAI Human

Computation Workshop, 2012.[35] A. Vezhnevets, V. Ferrari, and J. Buhmann. Weakly supervised semantic segmentation with a multi-image

model. In ICCV, 2011.[36] A. Vezhnevets, V. Ferrari, and J. Buhmann. Weakly supervised structured output learning for semantic

segmentation. In CVPR, 2012.[37] C. Wang, W. Ren, K. Huang, and T. Tan. Weakly supervised object localization with latent category

learning. In ECCV, 2014.[38] T. Wang, B. Han, and J. Collomosse. Touchcut: Fast image and video segmentation using single-touch

interaction. CVIU, 120, 2014.[39] J. Xu, A. G. Schwing, and R. Urtasun. Tell Me What You See and I will Show You Where It Is. In CVPR,

2014.[40] J. Xu, A. G. Schwing, and R. Urtasun. Learning to segment under various forms of weak supervision. In

CVPR, 2015.

10