Unsupervised Learning of Supervoxel Embeddings for Video ... · Video Segmentation: Video...

Unsupervised Learning of Supervoxel Embeddings for Video Segmentation

Mehran Khodabandeh1,2, Srikanth Muralidharan1, Arash Vahdat1, Nazanin Mehrasa1, Eduardo M.Pereira3, Shin’ichi Satoh2 and Greg Mori1

1Simon Fraser University2National Institute of Informatics

3INESC TEC{mkhodaba, smuralid, avahdat, nmehrasa, mori}@sfu.ca, [email protected], [email protected]

Abstract

We present an algorithm for learning a feature rep-resentation for video segmentation. Standard video seg-mentation algorithms utilize similarity measurements inorder to group related pixels. The contribution of ourpaper is an unsupervised method for learning the fea-ture representation used for this similarity. The featurerepresentation is defined over video supervoxels. Anembedding framework learns a feature mapping for su-pervoxels in an unsupervised fashion such that super-voxels with similar context have similar embeddings.Based on the learned representation, we can merge sim-ilar supervoxels into spatio-temporal segments. Ex-perimental results demonstrate the effectiveness of thislearned supervoxel embedding on standard benchmarkdata.

1 Introduction

Effective video segmentation can be achieved if arobust measurement of pixel similarity can be devised.However, this task is extremely challenging due to vari-ous complexities that arise from the variability in objectappearance, occlusions in the scene, and camera mo-tion, among other challenges. In this paper we presenta method for unsupervised learning of feature represen-tations for video segmentation.

There is a rich history of algorithms for video seg-mentation within the computer vision literature. Sem-inal work includes algorithms for analyzing layers ofmoving objects, probabilistic models, and graph-basedsegmentation algorithms [18, 15, 7]. A key facet ofmethods for video segmentation is deciding upon a rep-resentation for pixels such that pixels from similar ob-jects are similar in feature representation.

Figure 1: In the embedding space supervoxels of thesame object are brought together while being separatedfrom supervoxels of the other objects.

Broadly speaking, two veins of work exist for de-termining a feature representation for video segmenta-tion. One involves supervised learning: a training set ofsegmentations with semantic object category labels isused. From this, a feature representation is learned thatcan distinguish between objects, and is then applied tosegmentation (e.g. [10]).

The second vein uses unsupervised learning / hand-crafting feature representations. For example, princi-ples from Gestalt psychology such as common fate havebeen used to motivate the design of feature representa-tions based on motion similarity in classic segmentationwork [15]. More recent segmentation systems com-bine a set of such hand-crafted feature representations(e.g. [12]).

The main idea of our paper is to leverage these un-supervised techniques to learn better feature represen-

tations for segmentation. We operate in a graph-basedsegmentation framework on supervoxels. For example,two supervoxels that have similar motion are likely tobelong to the same object. This fact can be used to re-fine an appearance-based feature representation, plac-ing these supervoxels closer together. Likewise, twosupervoxels that have similar appearance are likely tobelong to the same object. This could be used to refinea motion-based feature presentation for a supervoxel.

We operationalize this by learning an embeddingrepresentation for supervoxels, in which supervoxelsthat belong to the same segment have similar repre-sentations. For example, consider the video shown inFig. 1. Supervoxels that belong to the airplane shouldhave a similar feature representation. However, theirinitial appearance features are different. Our unsuper-vised learning algorithm uses the fact that nearby, sim-ilarly moving supervoxels are likely to be from similarobjects to learn an embedding that moves the airplanesupervoxels closer together.

This idea is similar to the word2vec algorithm [13],which learns an embedding that is able to predict a wordfrom the context of the neighboring words. In our prob-lem the context of each supervoxel is determined bythe spatio-temporally nearby supervoxels. Similarly,in this work, we aim to learn an embedding of super-voxels from videos, such that contextually similar seg-ments would have similar representations in the embed-ded space. After learning the embeddings, we use agraph-based partitioning algorithm using the embeddedfeatures to produce the final segmentation.

We conduct our experiments on the video segmenta-tion benchmark dataset [6]. Through our experiments,we show that the learned embeddings can improve theperformance of video segmentation using appearance-based [11] or motion-based [3] features.

This paper is organized as follows. Section 2 givesan overview of the related work, Section 3 describes ourmethod, Section 4 contains details of our experiments,and Section 5 concludes our work.

2 Related Work

Video Segmentation: Video segmentation is achallenging area of research in computer vision, andthere exists an abundant literature pertaining to it [4, 5,17, 20, 10].

Khoreva et al. use a graph based supervised videosegmentation algorithm [10], where different classifiersare used region-wise to learn a graph representation forthe video frames, before eventually performing the seg-mentation using standard graph partitioning methods.Badrinarayanan et al. [1] develop a semi-supervised ap-

proach involving a mixture of trees based probabilisticgraphical model, where superpixel labels are obtainedby variational inference. Chang et al. build an unsu-pervised approach to video segmentation [2]. A con-ditional random field is employed to assign labels toall the pixels, using higher order potentials with softerlabel consistency constraints in order to identify com-plex spatio-temporal tubes that actually correspond tothe same label. Zhang et al. [20] first identify a primaryobject using a layered directed acyclic graph based ap-proach that assumes that objects are spatially cohesiveand follow smooth trajectory patterns. Once the pri-mary object models are extracted, segmentation is per-formed by invoking a scoring function that segmentsregions with high optical flow gradients from the back-ground.

Deep Embedding Methods: Deep learning basedembedding approaches have gained a lot of attention re-cently, and were shown to be useful under various sce-narios. Karpathy et al. [8] work on dense image descrip-tions. A weakly supervised bimodal text-image em-bedding is performed to generate text phrases for cor-responding regions of the image, which are then usedto generate dense image descriptions. Ramanathan etal. [14], learn temporal embeddings using a large un-labeled video database, and show that these embed-dings improve the performance of several video relatedtasks. Srivatsava et al. [16] make use of a recurrent neu-ral network architecture and a large video classificationdatabase [9] to learn unsupervised representations forcontiguous video frames. They eventually show thatthese embeddings significantly help in activity recog-nition.

3 Our Method

Our goal in this paper is to develop a principledframework for learning contextually-aware embeddingsfor video segments in an unsupervised setting. Videosegments with the same “context” are those which arein the same vicinity. Therefore, contextually-aware em-beddings proposed in this paper can be used to groupsmall video segments into larger groups which representeither semantically consistent objects or actions that in-tuitively share a common context.

The main focus of this paper is learning a similaritymeasure for supervoxels, which is used for video seg-mentation. We build our segmentation approach on topof two recently proposed algorithms. Xu et al.’s super-voxel extraction algorithm [19] is employed to generatea large set of small supervoxels for a video. The graphpartitioning algorithm proposed in [5] is then used togroup supervoxels based on embeddings trained by our

proposed model.This section presents our method as follows. Our

embedding approach is presented in Sec. 3.1. Sec. 3.2explains the neighbor/negative selection strategy, fol-lowed by Sec. 3.3 that contains details about supervoxelextraction and the features used. Finally, we concludethe section with the video segmentation algorithm thatleverages the similarity measure to construct and parti-tion the graph representation in Sec. 3.4.

3.1 Learning an Embedding Function

As illustrated in Figure 1, we aim at learning anembedding function φ that maps the supervoxel featurespace to another m-dimensional space. In this space,contextually similar supervoxels should have similarvector representations when compared via the dot prod-uct similarity measurement.

We implement the embedding function φ using aneural network. The input to the network is a featurevector describing a supervoxel (e.g. motion features orappearance features from a deep network). The embed-ding function network is constructed on top of this in-put. The network we use has two fully connected layers,each followed by a nonlinear activation function. In thiswork we used rectified linear units (ReLUs) as the acti-vation functions. Figure 2 illustrates the structure of ournetwork.

Inspired by the word2vec idea, we wish to performa warping of the original supervoxel space to a spacein which a given target supervoxel is separated from“other” supervoxels using the neighboring supervoxelsas much as possible. Towards this goal, we aim to for-mulate this problem as an embedding learning problemthat is trained to separate the neighbouring supervox-els from negatives by a large margin, while bringing thetarget closer to the neighbors.

Assume SV = {s1, s2, . . . , sn} is the set of all thesupervoxels extracted from video V . For each super-voxel si ∈ SV , we define sets of positive and negativesupervoxels. N+

i denotes the set of neighbouring su-pervoxels that are close to si in space and time (andare likely to be from the same object), and N−i is theset of negative supervoxels that are distant from si (inother words, a limited set of supervoxels that are not inthe same object that si belongs to). The details of theneighbour/negative selection strategy are explained inSection 3.2.

We define the representation of the object that si be-longs to:

ci =1

|N+i |

∑snei∈N+

i

φ(snei) (1)

NeighborsTarget Negative

Inpu

t lay

er

(feat

ures

)Fu

lly C

onne

cted

la

yers + +

Figure 2: Architecture of our Embedding Network.Weights between layers with similar colors are shared.I.e. identical embedding applies to target, negative, andneighbors.

Where snei is a neighbour of si and a member of itsneighbour set, N+

i .Here, we use the hinge loss as an objective function

to learn the embedding function φ:

E(φ) =∑

si∈SV

∑sneg∈N−

i

max(0, 1−(φ(si)−φ(sneg))·ci)

(2)Here, SV is the set of supervoxels in video V , si is theith supervoxel, and sneg is an instance of a negative ofsi, which is a member of N−i .

For a given video V , the goal is to find a φ(·) thatminimizes this objective function defined in Equation 2.For each video we train an individual neural networkwith back-propagation with the standard stochastic gra-dient descent (SGD) method to perform this minimiza-tion.

The intuition behind Equation 2 is that by minimiz-ingE(φ) and therefore maximizing φ(si) ·ci−φ(sneg) ·ci, we force the embedding to separate a negative su-pervoxel from supervoxels of the object that si belongsto, while bringing si closer to them. Note that here weuse dot product similarity, so “separating” and “bring-ing close” mean “more difference” and “less difference”in the angle between two vectors, respectively.

3.2 Negative and Neighbor Selection Strategy

In our algorithm, the selection of neighbors and neg-atives are critical to the performance of our algorithm.For a given supervoxel si, we assume that the embed-ding of si is similar to the embedding of the set of su-pervoxels that are close to si in space and time. In ad-dition, based on our assumption, the embedding of sishould be different from the embedding of a negativesupervoxel, sneg, which is not from the same object assi. Essentially, any supervoxel that is far from si and isnot from the same object is a candidate negative. How-ever, in our experiments we found that not all the far

supervoxels are proper negative candidates; and not allthe neighboring supervoxels represent the object that siis extracted from. As for negatives, based on our ob-servations we decided to choose “hard negatives”. Hardnegatives of si are the closest supervoxels to si whichare not in the set of neighbours (N+

i ).Therefore, for choosing the hard negatives and

neighbours of supervoxel si, we first create a pool ofcandidate supervoxels by extracting a set of supervox-els that are close to si spatio-temporally. From thisset we pick both “neighbors” and “negatives.” We usecenter of mass to represent the spatial position of a su-pervoxel, and we use Euclidean distance to measure thedistance between suprevoxels. Then we sort the set ofcandidates based on their distance to si in the featurespace. Now we select the K closest ones as “neigh-bors” and the F farthest ones as “negatives” (Figure 4).In our experiments we tuned these parameters to obtainthe best performance.

3.3 Supervoxels and Feature Extraction

Given an input video V we use Xu’s [19] algorithmto extract the initial supervoxel set SV . Xu et al. [19]produces supervoxels in a hierarchical manner, where ateach level of the hierarchy supervoxels are constructedby merging the previous level in the hierarchy. There-fore the lower the level in the hierarchy the more su-pervoxels are available. We use the lowest level that iscomputationally affordable, level 8 with about 2000 su-pervoxels constructed from a 120 frame video. Otherparameters are set as the default of the code. We usea simple averaging function as the aggregation methodfor computing supervoxel feature:

si =

Pi∑j=1

pjPi

(3)

where i = 1, 2, . . . , n. Here n is equal to the number ofsupervoxels, Pi is equal to the number of pixels belong-ing to supervoxel i, and pj is the value of the featuremap of the jth pixel of the supervoxel.

We extract two different pixel-level feature types.The first representation is HOF [3], representing pixelmotion from optical flow. The second descriptor is theFCN [11] segmentation mask, which represents eachpixel with a 21 class score map obtained by passing aframe as input to the model 1 that was trained on thePASCAL VOC dataset. We used default parametersprovided by the code for both HOF and FCN.

1http://dl.caffe.berkeleyvision.org/fcn-8s-pascal.caffemodel

(a) Original (b) Groundtruth

(c) Hand-crafted similarity (d) Embedding similarity

Figure 3: Given a target supervoxel (the blue one on thefront car in (c)), we color the other supervoxels based ontheir cosine similarity to the target. Intensity of pink andyellow indicates magnitude of dissimilarity and similar-ity, respectively. After the embedding, supervoxels onthe front car are more similar to the target (blue) super-voxel.

(a) Original (b) Negative and Neighbors

Figure 4: A sample of selected negative supervoxels(red region) and neighboring supervoxels (green region)for a given supervoxel (blue).

3.4 Video Segmentation

In our work, for each video we train a separate net-work. Given a trained network and a supervoxel si, weobtain the learned representation of si, which is φ(si)by feeding its feature vector to the network and collect-ing the values of the network output. We define the sim-ilarity matrix [W ]n×n that contains pairwise affinitiesas: W = ΦΦT . Here [Φ]n×m is the embedding of allthe supervoxels. Then we use Galasso’s algorithm [4] tore-weight the matrix W based on the supervoxels in thelower level of hierarchy. The i-th row ofW contains thesimilarity between supervoxel si and all other supervox-els. Figure 3 shows how the similarity between a partic-ular supervoxel and other supervoxels change after theembedding. The resulting similarity matrix is then usedin spectral clustering to obtain a final segmentation.

http://dl.caffe.berkeleyvision.org/fcn-8s-pascal.caffemodel

http://dl.caffe.berkeleyvision.org/fcn-8s-pascal.caffemodel

(a) BPR (b) VPR

(c) LPC (d) NCluster

Figure 5: Boundary Precision Recall(BPR), VolumePrecision Recall(VPR), Length Precision Curve(LPC),Ncluster video segmentation statistics obtained usingFCN and HOF embeddings

4 Experiments

We evaluate our method on the VSB100 database[6]. It consists of a diverse set of video instances. Thedatabase specifies training and testing sets of videos.Our method is an unsupervised learning method inwhich we learn an embedding for supervoxels. In ourexperiments we perform this learning one video at atime, and hence we use only the test set for evaluatingour method. The test set consists of 60 videos. We usedHOF and FCN as features for testing the efficacy of neu-ral network based embedding . As a baseline, for eachfeature we evaluate the segmentation performance us-ing the corresponding feature without embedding. Wefollow the evaluation protocol in VSB100: Volume Pre-cision Recall (VPR), Boundary Precision Recall (BPR),Length Precision Recall (LPR) and number of clusters(NCluster) statistics are reported; the Volume PrecisionRecall curve’s best F-Measure is used as an aggregatemeasure of performance.

We use Caffe to implement our learning framework.We use a separate network for each video. We tuned theparameters for each network by trying a range of pa-rameters and choosing the ones with highest F-measure(VPR). The values that we chose for the parameters areas follows: number of neighbors ∈ {4, 8}, number ofnegatives ∈ {6, 12}, batch size ∈ {128, 256}. In all ourexperiments, we use an initial learning rate of 0.001.The learning rate is reduced by a factor of 0.1 after each400 iterations. We perform 20 epochs of stochastic gra-dient descent, which is sufficient for convergence for

this dataset.

Method VPRHOF without embedding 44.30HOF with embedding 46.50FCN without embedding 51.39FCN with embedding 53.68Concat HOF, FCN without embedding 49.70Concat HOF, FCN and embed 52.52

Table 1: Comparison of F-Measure score of VolumePrecision Rate (VPR) obtained using feature after em-bedding versus features without embedding on VSB100test set.

From the numbers tabulated in Table 1, it is evidentthat embedding features improves the segmentation per-formance compared to its non-embedded counterpart by2.2% on HOF, 2.3% on FCN, and 2.8% on combinationof HOF and FCN. Fig. 5 shows different segmentationstatistics obtained in these experiments.

Note that state-of-the-art supervised learning meth-ods [10] do obtain higher performance on this task.When using pixel-level labels in training data, 70%VPR can be achieved. However, acquiring pixel-levellabels can be expensive. Our experiments demonstratethe effectiveness of learning embeddings without needfor additional supervision.

4.1 DiscussionWe found that the performance of our model is high

in the videos where there are heterogeneous object(s)with different spatial and temporal features. This is be-cause, firstly, our model is constructed on the hypoth-esis that the supervoxels that are close enough in thespatial dimension, and also in the feature space used todecide on the neighbors/negatives (e.g. HOF in case ofFCN embeddings, and vice versa), are from the sameobject.

Conversely, it is difficult for our model to pro-duce good segmentation in videos that have numer-ous smaller and visually similar objects. Firstly thisviolates our neighbor/negative hypothesis construction.Secondly, the evaluation is originally intended for test-ing semantic segmentation algorithms and might re-quire higher-level semantic information to make thesedistinctions. These features of our model are evidentfrom the sample visualizations shown in Fig. 3 and 6.

5 ConclusionIn this work, we tackled the problem of video seg-

mentation using an unsupervised deep embedding ap-

(a) Original (b) Groundtruth (c) FCN no embed (d) FCN embed (e) HOF no embed (f) HOF embed

Figure 6: Visualization of segmentations obtained by applying our method using different features. Each colour in asegmentation represents a distinct segment obtained by applying that method.

proach. This embedding provides a discriminative map-ping of supervoxel feature representations for a givenvideo. We found that this learning has the potential tohelp in separating a supervoxel’s neighbor feature rep-resentation from the feature representation of a distantsupervoxel. We demonstrated this embedding on differ-ent spatial/temporal features.

References

[1] V. Badrinarayanan, I. Budvytis, and R. Cipolla. Mixtureof trees probabilistic graphical model for video segmen-tation. IJCV, 2014.

[2] J. Chang, D. Wei, and J. Fisher. A video representationusing temporal superpixels. CVPR, 2013.

[3] N. Dalal, B. Triggs, and C. Schmid. Human detec-tion using oriented histograms of flow and appearance.ECCV, 2006.

[4] F. Galasso, R. Cipolla, and B. Schiele. Video segmen-tation with superpixels. ACCV, 2012.

[5] F. Galasso, M. Keuper, T. Brox, and B. Schiele. Spectralgraph reduction for efficient image and streaming videosegmentation. CVPR, 2014.

[6] F. Galasso, N. Nagaraja, T. Cardenas, T. Brox, andB. Schiele. A unified video segmentation benchmark:Annotation, metrics and analysis. ICCV, 2013.

[7] M. Irani and P. Anandan. A unified approach to movingobject detection in 2d and 3d scenes. PAMI, 1998.

[8] A. Karpathy and L. Fei-Fei. Deep visual-semanticalignments for generating image descriptions. CVPR,2015.

[9] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Suk-thankar, and L. Fei-Fei. Large-scale video classificationwith convolutional neural networks. CVPR, 2014.

[10] A. Khoreva, F. Galasso, M. Hein, and B. Schiele. Clas-sifier based graph construction for video segmentation.In CVPR, 2015.

[11] J. Long, E. Shelhamer, and T. Darrell. Fully convo-lutional networks for semantic segmentation. CVPR,2015.

[12] J. Lu, R. Xu, and J. J. Corso. Human action segmen-tation with hierarchical supervoxel consistency. CVPR,2015.

[13] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Effi-cient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013.

[14] V. Ramanathan, K. Tang, G. Mori, and L. Fei-Fei.Learning temporal embeddings for complex video anal-ysis. ICCV, 2015.

[15] J. Shi and J. Malik. Motion segmentation and trackingusing normalized cuts. ICCV, 1998.

[16] N. Srivastava, E. Mansimov, and R. Salakhutdinov.Unsupervised learning of video representations usinglstms. arXiv preprint arXiv:1502.04681, 2015.

[17] N. Sundaram and K. Keutzer. Long term video segmen-tation through pixel level spectral clustering on gpus.ICCV, 2011.

[18] Y. Weiss and E. H. Adelson. A unified mixture frame-work for motion segmentation: incorporating spatial co-herence and estimating the number of models. CVPR,1996.

[19] C. Xu, C. Xiong, and J. J. Corso. Streaming hierarchicalvideo segmentation. ECCV, 2012.

[20] D. Zhang, O. Javed, and M. Shah. Video object segmen-tation through spatially accurate and temporally denseextraction of primary object regions. CVPR, 2013.

Unsupervised Learning of Supervoxel Embeddings for Video ... · Video Segmentation: Video...

Documents

Transcript of Unsupervised Learning of Supervoxel Embeddings for Video ... · Video Segmentation: Video...