arXiv:2007.05481v1 [cs.CV] 10 Jul 2020 · arXiv:2007.05481v1 [cs.CV] 10 Jul 2020. STaRFlow: A...

STARFLOW: A SPATIOTEMPORAL RECURRENT CELL FOR

LIGHTWEIGHT MULTI-FRAME OPTICAL FLOW ESTIMATION

Pierre Godet1, Alexandre Boulch2, Aurélien Plyer1, and Guy Le Besnerais11 DTIS, ONERA, Université Paris-Saclay, FR-91123 Palaiseau, France

Email: {pierre.godet, aurelien.plyer, guy.le_besnerais}@onera.fr2 valeo.ai, Paris, France

Email: [email protected]

ABSTRACT

We present a new lightweight CNN-based algorithmfor multi-frame optical flow estimation. Our solutionintroduces a double recurrence over spatial scaleand time through repeated use of a generic "STaR"(SpatioTemporal Recurrent) cell. It includes (i) a tem-poral recurrence based on conveying learned featuresrather than optical flow estimates; (ii) an occlusiondetection process which is coupled with optical flowestimation and therefore uses a very limited numberof extra parameters. The resulting STaRFlow algo-rithm gives state-of-the-art performances on MPISintel and Kitti2015 and involves significantly lessparameters than all other methods with comparableresults.

1 Introduction

Optical Flow (OF) is the apparent displacement of objects be-tween two frames of a video sequence. It expresses the directionand the magnitude of the motion of each object at pixel level.The OF is a key component for several computer vision tasks,such as action recognition [29], autonomous navigation [14],tracking [5], or image registration for multi-view applicationslike video inpainting [35], super-resolution [24, 25, 37] or struc-ture from motion [32]. OF estimation must be fast, accurate evenat subpixel level for some applications like super-resolution, andreliable even at sharp motion boundaries despite occlusion ef-fects. Particularly, it must deal with challenging contexts suchas fast motions, motion blur, illumination effects, uniformlycolored objects, etc.

Starting from the seminal work of Horn and Schunck [7], OFestimation has been the subject of numerous works. Recently, abreakthrough came with deep neural networks. Convolutionalneural network-based (CNNs) methods [6, 12, 21, 31] reachedthe state of the art on mostly all large OF estimation benchmarks,e.g., MPI Sintel [4] and Kitti [23], while running much fasterthan previous variational methods.

In order to increase the efficiency and the robustness of thesemethods, the focus has then been put on occlusion detec-tion [11, 13, 26], temporal dependency [26] or memory effi-ciency [9, 11]. Building on these concerns, our work followstwo main orientations. First, when processing a video sequence,most object motions are continuous across frame pairs. Thus,most of the uncertainties arising from two-frame OF estima-tion can be solved using a number of frames greater than two.This calls for a multi-frame estimation process able to exploittemporal redundancy of the OF. Second, we believe that re-lated operations can be performed by identical models withshared weights. We apply this principle to temporal recurrence,as in [26], to scale recurrence, as in [11], but also to occlu-sion detection, which is strongly correlated with OF estimation.Based on these considerations, we propose a "doubly recurrent"network over spatial scales and time instants. It takes explic-itly into account the information from previous frames and theredundancy of the estimation at each network scale within aunique processing cell, denoted STaR cell, for SpatioTemporalRecurrent cell. Given information from the past and from alower scale, the STaR cell outputs the OF and occlusion mapat current image scale and time instant. This cell is repeatedlyinvoked over scales in a coarse-to-fine scheme and over sets ofN successive frames, leading to the STaRFlow model. Thanksto this doubly recurrent structure, and by sharing the weightsbetween processes dedicated to flow estimation and to occlusiondetection, we obtain a lightweight model: STaRFlow is indeedslightly lighter than LiteFlowNet [9], while producing jointlymulti-frame OF estimation and occlusion detection.

Let us now outline the organization of the paper while listingour main contributions. We first discuss related work in Sec-tion 2, then Section 3 is devoted to the description of our maincontribution, the STaRFlow model for multi-frame OF estima-tion. Experiments are presented in Section 4, with results onMPI Sintel [4] and Kitti [23]: examples of results of STaRFlowon these two datasets are presented in Figure 1. We conductin particular an ablation study that addresses three importantsubjects: temporal recurrence, occlusions and scale recurrence.First, as regards temporal recurrence, we show that passing

arX

iv:2

007.

0548

1v1

[cs

.CV

] 1

0 Ju

l 202

0

STaRFlow: A SpatioTemporal Recurrent Cell for Lightweight Multi-Frame Optical Flow Estimation

Figure 1: Qualitative results of the proposed STaRFlow model, on MPI Sintel final pass (top line) and KITTI 2015 (bottom line)test sets. StaRFlow allows accurate motion estimation on partially occluded objects (right knee of character in upper leftmostexample) and on thin objects (fingers and posts in the rightmost examples).

learned features between instants compares favourably to pass-ing previously estimated OF as in ContinualFlow [26]. Ourapproach also makes a higher benefit from larger number offrames than [26]. Secondly, our occlusion handling appearsas efficient as previously published approaches, but is muchsimpler and involves a significantly lower number of extra pa-rameters. Thirdly, the study of scale recurrence highlights thecompactness of our model. Finally, concluding remarks andperspectives are given in Section 5.

2 Related Work

2.1 Optical Flow (OF) Estimation With CNN

Dosovitskiy et al. [6] were the first to publish a deep learningapproach for OF estimation. They proposed a synthetic trainingdataset, FlyingChairs, and two CNN architectures FlowNetSand FlowNetC. They have shown fairly good results, though notstate-of-the-art, on benchmarks data which are very differentfrom their simple 2D synthetic training dataset. By using a morecomplex training dataset, FlyingThings3D [21], and a biggerarchitecture involving several FlowNet blocks, Ilg et al. [12]proposed the first state-of-the-art CNN-based method for OFestimation. Moreover, their learning strategy (FlyingChairs→FlyingThings) was then used by several supervised learningapproaches.

Some of the works that followed [9, 27, 31] sought to leveragewell-known classical practices in OF estimation, like warping-based multi-scale estimation, within a deep learning framework,leading to state-of-the-art algorithms [9,31]. In particular, PWC-Net of [31] has then been used as a baseline for several top-performing methods [2, 11, 18, 26, 28]. Very recently, Hur andRoth [11] got even closer to classical iterative OF estimationprocesses with an "iterative residual refinement" (IRR) versionof PWC-Net. IRR mainly consists in using the same learnedparameters for every stage of the decoder, so as to obtain alighter and better-performing method. We exploit the same ideabut extend it to scale and temporal iterations in a multi-framesetting.

2.2 Multi-Frame Optical Flow Estimation

Exploiting temporal coherence as been proven to improve esti-mation quality. Wang et al. [34] use multiple frames in a Lucas-Kanade [19] estimation process and show better results whenincreasing the number of frames, i.e. a less noisy estimation anda reduced number of ambiguous matching points. Volz et al. [33]also improve their estimate, in particular in untextured regions,by modeling temporal coherence with an adaptive trajectoryregularization in a variational method. Kennedy and Taylor [16]shown improved results on the MPI Sintel benchmark [4] by us-ing additional frames, more significantly in unmatched regions.

Additional frames are useful to cope with occlusions, as, forinstance, pixels visible at time t and occluded at time t + 1may have been visible at time t − 1. Hence, the OF is ill-defined from t to t+ 1 but can be filled in with the estimationat the previous time step. Ren et al. [28] propose a multi-framefusion process to fuse the current OF estimate with the estimateat the previous time step. Maurer and Bruhn [20] propose tolearn, with a CNN, how to infer the forward flow from thebackward flow, and fuse it with the actual estimated forwardflow. Note that in these references, the multi-frame estimationstems from the fusion of two OF estimates provided by classicaltwo-frame processes launched between different frame pairs.In contrast, in the ContinualFlow model of [26], a temporalconnection is introduced to pass the OF estimate at time t− 1 tothe estimation process at time t, making the estimation recurrentin time. Let us also mention that, in an unsupervised learningframework, [15], [18] and [17] also show improved results, moresignificantly in occluded areas, by using multiple frames. Thesemethods use 3 frames and estimate jointly the OF from t to t−1and from t to t+ 1.

Our work is closer to [26], as we propose to use a recurrenttemporal connection, but is based on passing learned featuresfrom one instant to the next rather than OF estimates. Accordingto our experiments, this approach is more efficient and allows toexploit a larger time range than ContinualFlow [26].

2.3 Occlusion Handling

As OF is ill-defined at occluded pixels, occlusions have to beaccounted for during estimation. Classical methods either treatocclusion as outliers within a robust estimation setting [3], or

2


Temporalrecurrence

...

...

...

Scalerecurrence

SharedEncoder

Sharedspatiotemporalrecurrent cell

t=1

t=2

t=3

t=4

STaR

STaR STaR

STaR STaR

STaR STaR

STaR

STaR

STaR

Figure 2: Unrolled view of the proposed SpatioTemporal Recurrent architecture for multi-frame OF estimation (STaRFlow).

conduct explicit occlusion detection, often using a forward-backward consistency check [1]. In a deep learning framework,several methods estimate jointly OF and occlusion maps. Indoing so, most authors (eg. [11, 26]) observe a significant im-provement on the OF estimation — an exception being [13].Unsupervised methods also estimate occlusion maps, as theyneed to ignore occluded pixels in their photometric loss. [22]estimates occlusion maps by forward-backward check, [15, 18]learn occlusion detection in an unsupervised manner. Veryrecently, [36] proposes a self-supervised method to learn anocclusion map and uses it to filter the features warping so as toavoid ambiguity due to occlusions.

Here, we propose a very simple and lightweight way of dealingwith occlusions by processing occlusion maps almost in thesame way as OF estimates and observe a significant gain on OFaccuracy in accordance with [11, 26].

3 Proposed Approach

We propose a doubly recurrent algorithm for optical flow (OF)estimation. It is mainly the repeated application of the sameSpatioTemporal Recurrent (STaR) cell recursively with respectto time and spatial scale on features extracted from each imageof the sequence. Fig. 2 presents an unrolled representation of thisrecurrent "STaRFlow" model. Feature extraction uses a sharedencoder (green block) which architecture comes from [31]. Thescale recurrence, represented as horizontal gray arrows in Fig. 2,consists in feeding the STaR cell at each scale with the featuresextracted from the current frame and with the OF and occlusionscoming from previous scale. The data flow related to the tempo-ral recurrence carries learned features from one time instant tothe next; it is depicted as vertical pink arrows.

The rest of this Section aims at a complete description ofSTaRFlow. The internal structure of the STaR cell is presented in

section 3.1. Then section 3.2 focuses on the temporal recurrence,section 3.3 is dedicated to occlusions handling, and section 3.4presents the spatial recurrence. Finally, in Section 3.5, we dis-cuss the compound loss used for multi-frame optical estimationand the optimization process.

3.1 STaR Cell

As several other recent OF estimation approaches, the proposedmethod builds upon PWC-Net [31], which has been designed touse well-known good practices from energy minimization meth-ods: multi-scale pyramid, warping, cost-volume computation bycorrelation. These three elements are found in the architectureof the STaR cell presented in Fig. 3. It is fed by features froma siamese pyramid encoder applied to both frames. Similarlyto PWC-Net, the core trainable block is a CNN dedicated toOF (blocks CNN optical flow estimator and Context network inFig. 3). Finally, to avoid blurry results near motion discontinu-ities, we use the lightweight bilateral refinement of [11].

In addition to the inputs already appearing in PWC-Net (fea-tures from reference image, cost-volume from correlation offeatures and the upsampled flow from the previous scale), twosupplementary input/output data flows are involved in the STaRcell. The first one implements the temporal recurrence leadingto a multi-frame estimation. It conveys features from the highestlayers of the CNN OF estimator which are fed into the CNN OFestimator at the next time step, see Sec. 3.2. The second con-cerns the occlusion map, which undergoes essentially the samepipeline as the OF — further details on occlusions handling aregiven in Sec. 3.3.

3.2 Temporal Recurrence for Multi-Frame Estimation

The temporal connection passes features from time t− 1 to timet (Figure 3). These features are the outputs of the penultimate

3


Figure 3: Structure of the proposed SpatioTemporal Recurrent cell (STaR cell).

layer of the CNN OF estimator at t− 1, which are compressedby a 1 × 1 convolution to keep the number of input channelsconstant from one time step to the next. They are then warpedinto the current first image geometry, using the previous time-step backward flow i.e., the optical flow from t to t−1. This flowis not directly accessible at inference as our network predicts theforward flow, i.e., from t− 1 to t. Thus, we apply our networkon two frames with reversed time (from t to t− 1) to estimatethe backward flow (the temporal connection being set to zero).

3.3 Joint Estimation of Occlusions

As already mentioned, previous works such as [11, 26] consid-ered the idea of estimating jointly OF and occlusion maps, withthe purpose of improving OF estimation. In [26] occlusion mapsare estimated using an extra CNN module and used as an inputof the OF estimator, while [11] processes occlusion map andOF in parallel by adding an occlusion CNN estimator with thesame architecture as the OF CNN estimator, but ending witha one-channel sigmoid layer. These methods, especially [11],lead to a significant increase in the number of parameters of themodel.

In the STaR cell, joint estimation of OF and occlusions is donesimply by adding a channel to the last convolutional layer of theCNN OF estimator (which, hence, becomes a "OF+occlusion"estimator). After a sigmoid layer, this supplementary channelgives an occlusion probability map with value between 0 (non-occluded) and 1 (occluded). Compared to [11, 26], this leadsto a negligible number of extra parameters, while achievingcompetitive results, according to the experiments conducted inSec. 4.3.

3.4 Spatial Recurrence over Scales

We iterate on the same weights on each scale, according to theIRR approach of [11] — but unlike them we apply this coarse-to-fine process to a concatenation of the OF and the occlusion map.This allows a significant decrease in the number of parameters,while keeping estimation results almost unchanged, as shown inSec. 4.4.3.

3.5 Multi-Frame Training Loss

We use N -frame training sequences and train our network toestimate the OFs for each pair of consecutive images. From thesecond image pair of the sequence, information from previousestimations is transmitted through the temporal connection. Atthe end of the sequence, we update the weights so as to decrease:

L =1

N

N∑t=1

Lt (1)

where Lt is a multi-scale and multi-task loss for image pair (It,It+1):

Lt =

L∑l=1

αl

(Lt,l

flow + λLt,locc

)(2)

coefficients αl being chosen as in [31]. The supervision of OFult(x) at each time step t and each scale l is done as in [31] usingthe L2 norm summed over all pixel positions:

Lt,lflow =

∑∥∥ult − ult,GT

∥∥2

(3)

For the occlusion map olt, the loss is a weighted binary cross-entropy:

Lt,locc = −1

2

∑(wl

tolt log olt,GT + wl

t(1− olt) log(1− olt,GT))

(4)where summation is done over all pixel positions and denotingwl

t = Hl·W l∑olt+

∑olt,GT

and wlt = Hl·W l∑

(1−olt)+∑

(1−olt,GT), H l and

W l being the image size at scale l. As in [11] we update ateach iteration the weight λ that balances the flow loss and theocclusion loss.

4 Experiments

4.1 Implementation Details

As proposed in [12], all models are first trained on FlyingChairs[6] and then on FlyingThings3D [21]. We then finetune oneither Kitti or MPI Sintel. We use photometric and geometricdata augmentations as in [11] except that for the geometricaugmentations we do not apply relative transformations.

4


4.1.1 Pretraining on Image Pairs on FlyingChairs

Following [26], we first train our multi-frame architecture, ex-cept from the temporal connection, on 2D two-frame data. Tosupervise both OF and occlusion estimation, we use the Fly-ingChairsOcc dataset [11]. We train with a batch size of 8 for600k iterations, with an initial learning rate of 10−4 which isdivided by 2 every 100k iterations after the first 300k iterations.

4.1.2 Multi-Frame Training on FlyingThings3D

Then we train the STaRFlow model on sequences of N = 4images from FlyingThings3D, the temporal data stream beinginitialized to zero — note that longer sequences could be ex-ploited, at the cost of an increase in the memory space requiredfor training. As it is the first training for the temporal connec-tion, we start with a higher learning rate of 10−4 compared totwo-frame training (as suggested by [26]) and train for 400kiterations, dividing the learning rate by 2 every 100k iterationsafter the first 150k iterations. We use a batch size of 4. For theablation study, this is the final step of our training.

4.1.3 Finetuning on MPI Sintel or Kitti

We use the same finetuning protocol as [11] but extended toour multi-frame (N = 4) estimation process. For Sintel, wecan supervise every time step. In KITTI, only one time step isannotated, hence we only supervise the last time-step estimation.This finetuning step is only used for benchmark submissions.

4.1.4 Running Time

On Sintel images (1024× 436) the inference time of STaRFlowis of 0.22 second per image pair, on a mid-range NVIDIA GTX1070 GPU.

4.2 Optical Flow Results on Benchmarks

Results of STaRFlow on benchmarks MPI Sintel and KITTI2015 are given in Tab. 1, and compared to top-leading meth-ods and/or methods closely related to our approach. STaRFlowreaches the best EPE score on the final pass of Sintel, is secondon the clean pass, and is on par with IRR-PWC on Kitti2015.Kitti2015 is characterized by very large movements of fore-ground objects, which generally disadvantages multi-framemethods: among them, STaRFlow still ranks second behindMFF. Regarding the number of parameters, STaRFlow rankssecond behind ARFlow but outperforms it (as well as other lightmethods such as LiteFLowNet and SelFlow) in terms of OFprecision. It is also interesting to compare STaRFlow with therelated methods [26] and [11]. STaRFlow significantly outper-forms ContinualFlow [26] on all benchmarks while being threetimes lighter. Compared to IRR-PWC [11], the benefit of themulti-frame estimation of STaRFlow clearly appears on MPISintel.

Table 1: Results on MPI Sintel and KITTI 2015 benchmarks(test sets). Endpoint error [px] on Sintel, percentage of outlierson KITTI.

Method MPI Sintel KITTI 2015 Number ofclean final Fl-all parameters

ARFlow-mv∗ [17] 4.49 5.67 11.79 % 2.37MLiteFlowNet [9] 4.54 5.38 9.38 % 5.37MPWC-Net [31] 4.39 5.04 9.60 % 8.75MLiteFlowNet2 [10] 3.48 4.69 7.62 % 6.42MPWC-Net+ [30] 3.45 4.60 7.72 % 8.75M†

IRR-PWC [11] 3.84 4.58 7.65 % 6.36MMFF∗ [28] 3.42 4.57 7.17 % 9.95MContinualFlow_ROB∗ [26] 3.34 4.53 10.03 % 14.6M†

SelFlow∗ [18] 3.74 4.26 8.42 % 4.79M‡

MaskFlowNet [36] 2.52 4.17 6.11 % N/AScopeFlow [2] 3.59 4.10 6.82 % 6.36MSTaRFlow-ft∗ (ours) 2.72 3.71 7.65 % 4.77M

Best results are in bold characters, second ones in italic.Multi-frame methods are marked with ∗. †: value given in [11],

‡: value given in [17].

Table 2: Occlusion map estimation results (F1-score) on MPISintel.

Method Clean Final ParametersContinualFlow [26] - 0.48 14.6MSelFlow [18] 0.59 0.52 4.79MIRR-PWC [11] 0.71 0.67 6.36MScopeFlow [2] 0.74 0.71 6.36MOur occlusion estimator 0.70 0.66 4.09M

Best results are in bold characters.

4.3 Occlusion Estimation

Our main purpose here is to compare our solution for occlu-sion estimation, which shares almost all its weights with theOF estimator, to the dedicated decoder used in IRR-PWC. Todo this comparison as fairly as possible, we have trained a two-frame version of STaRFlow (by removing the red connectionsand operators on Fig. 3), which then essentially differs fromIRR-PWC by the occlusion detection process. In Tab. 2, we com-pare F1-scores of our occlusion estimator and various methods(including IRR-PWC) on occlusion maps estimated from MPISintel data. Our occlusion estimation is on par with IRR-PWCwhile being much lighter. We also report scores of SelFlow andScopeFlow for comparison to other state-of-the-art methods.

4.4 Ablation Study

In this section, we consider the contributions of the followingcomponents of the STaRFlow model to the OF estimation: tem-poral recurrence and number of used frames, joint occlusionestimation and spatial recurrence. For all the experiments, ourbackbone is the two-frame PWC-Net architecture [31]1 that wetrained as described in [11]. As this backbone does not includea bilateral refinement module, we do not include this module inthe following tests. The models are trained on FlyingChairs andFlyingThings3D, without any further finetuning, and tested onthe training sets of MPI Sintel and KITTI2015. All comparisonsare made with the main performance metrics proposed in the

1Implementation from https://github.com/visinf/irr

5

https://github.com/visinf/irr


Table 3: Influence of temporal connection and occlusion modules on performances (MPI Sintel and KITTI 2015 training sets).Method Cat. Sintel Clean [px] Sintel Final [px] KITTI 2015 Parameters

all noc occ all noc occ epe-all Fl-all number relativeWithout joint occlusion estimation.Backbone (PWC-Net) 2F 2.74 1.46 16.48 4.18 2.56 21.70 11.75 33.20 % 8.64M 0 %Backbone + TRFlow MF 2.47 1.41 13.97 4.01 2.52 20.00 11.27 33.77 % 8.68M +0.5 %Backbone + TRFeat MF 2.45 1.44 13.36 3.76 2.46 17.82 9.94 32.12 % 12.31M +42.5 %

With joint occlusion estimation.Backbone 2F 2.46 1.32 14.82 3.96 2.47 20.06 10.58 31.28 % 8.68M +0.5 %Backbone + TRFlow MF 2.17 1.23 12.33 3.90 2.50 19.11 10.82 32.51 % 8.73M +1.0 %Backbone + TRFeat MF 2.09 1.21 11.63 3.43 2.24 16.24 8.79 28.18 % 12.38M +43.3 %

With joint occlusion estimation and spatial recurrence.Backbone 2F 2.29 1.20 14.03 3.72 2.32 18.77 10.74 31.35 % 3.37M −61.0 %Backbone + TRFlow MF 2.20 1.25 12.40 3.98 2.56 19.38 11.00 35.23 % 3.38M −60.9 %Backbone + TRFeat MF 2.10 1.22 11.67 3.49 2.32 16.15 9.26 30.75 % 4.37M −49.4 %

Best results are in bold characters. Fl-all, on KITTI, is the percentage of outliers (epe > 3 px).2F (resp. MF) refers to two-frame (resp. multi-frame) methods. TR stands for temporal recurrence.

benchmark websites — note that we use the revised occlusionmaps provided by [11] to compute occ/noc scores on MPI Sintel.

4.4.1 Temporal Recurrence

Two different temporal recurrences are evaluated, with and with-out occlusion handling in Tab. 3, and compared to the two-framebackbone. The first one, termed "TRFlow", is inspired from [26],and passes the estimated OF at time t− 1 to the CNN OF esti-mator at t. In the second approach, denoted by "TRFeat", thetemporal connection conveys learned features. "TRFeat" is themethod implemented in STaRFlow and described in Sec. 3.2.

According to Tab. 3, using learned features in the temporalconnection yields better results than passing estimated OFs,with higher EPE gains on degraded images (Sintel Final vs.Sintel Clean) and especially on the real images of KITTI2015training dataset. Results are consistent whether an occlusionmodule is used or not.

The qualitative results displayed in Fig. 4–6 aim to better un-derstand the gains brought by our temporal connection andocclusions handling. As could be expected, multi-frame esti-mation improves robustness to degraded image quality. Thisis shown in Fig. 4 which compares results on Sintel Clean andFinal (blurry) images.

Multi-frame estimation also allows temporal inpainting: for aregion occluded at time t+ 1 but visible at t and previous timesteps, the previously estimated motion can be used to predictthe motion between t and t + 1. This could be observed onthe Sintel example shown in the upper left part of Fig. 1: theright knee of the central character, although occluded in the nextframe, is correctly estimated by STaRFlow. Fig. 5 displays anexample extracted from KITTI2015 training set where temporalconnection and occlusions estimation are both required to cor-rectly estimate motion of the roadsign on the lower right part ofthe image, which is occluded in the next frame. Finally, Fig. 6shows that our temporal connection with learned features yieldsincreased sensitivity to small object motion compared to thebackbone and also to TRFlow.

4.4.2 Occlusion Handling

Comparison of methods with and without occlusion estimationin Tab. 3 shows that adding the task of detecting occlusionsconsistently helps OF estimation. This is true for two-frame andmulti-frame methods.

4.4.3 Spatial Recurrence

The lower part of Tab. 3 is devoted to the spatial recurrence,i.e. the iterations on the same weights over scales in the coarse-to-fine multi-level estimation [11]. While OF precision is onlymarginally affected by this implementation, large gains in termsof number of parameters are obtained with respect to the PWC-Net backbone (see last column).

4.4.4 Impact of the Number of Frames at Test Time

Recall thatN = 4 frames are used for training multi-frame mod-els (TRFlow and TRFeat). It means that, at training time, thetemporal connection is reinitialized to zero every 4 frames, es-sentially to avoid an increased memory cost, beyond the capacityof the hardware. However, at test time, the temporal connectioncan be exploited over a different time horizon. This is the objectof Tab. 4, which compares temporal connections TRFlow andTRFeat when increasing the number of frames N ′ used at testtime. Each line of the Table presents scores computed for theOF estimated between time instants N ′ − 1 and N ′.

According to Tab. 4, performance improves more for TRFeatthan for TRFlow whenN ′ increases. This is particularly true fordegraded (Sintel Final) or real images (KITTI), or in occludedregions. Furthermore, we observe that TRFeat still improves us-ing N ′ = 5 frames. TRFeat, by propagating learned features inthe temporal connection instead of OF, exploits more efficientlylong term memory than TRFlow and appears even able to learna temporal continuity beyond the number of frames used fortraining.

This can also be seen on the qualitative results presented onFig. 7. Estimations usingN ′ = 3 andN ′ = 4 (columns 2 and 3)are presented for TRFlow and TRFeat. The fact that the objectis very close to the image border makes the problem difficult.For the two methods, using 3 frames is not enough to correctly

6


Figure 4: Multi-frame estimation provides robustness to degraded image quality: results on Sintel clean (upper row) and SintelFinal pass (lower row).

Figure 5: Both the occlusion and temporal coherence modules are needed here to resolve the motion of the lower right roadsign.

Table 4: Impact of the number of frames N ′ used at test time.

Backbone + occ + TRFlow + SRSintel Clean Sintel Final Kitti15

N ′ all noc occ all noc occ epe-all Fl-all2 2.36 1.27 14.17 4.05 2.57 20.06 12.53 35.95 %3 2.17 1.24 12.29 3.95 2.56 19.03 11.26 35.35 %4 2.20 1.25 12.40 3.98 2.56 19.38 11.01 35.27 %5 2.20 1.26 12.37 3.98 2.56 19.30 10.94 35.17 %6 2.20 1.26 12.33 3.98 2.58 19.11 10.94 35.19 %

Backbone + occ + TRFeat + SRSintel Clean Sintel Final Kitti15

N ′ all noc occ all noc occ epe-all Fl-all2 2.40 1.30 14.34 4.04 2.55 20.12 12.01 34.22 %3 2.10 1.23 11.60 3.58 2.35 16.90 9.95 31.49 %4 2.10 1.22 11.67 3.49 2.32 16.15 9.26 30.78 %5 2.08 1.22 11.36 3.43 2.27 15.99 9.17 30.66 %6 2.09 1.22 11.52 3.50 2.32 16.25 9.14 30.69 %

Best results are in bold characters.

estimate the object’s contour. TRFeat manage to resolve thecontour with a 4th frame, while TRFlow still fails to do it.

5 Conclusion

We have presented STaRFlow, a new lightweight CNN methodfor multi-frame OF estimation with occlusion handling. It in-volves a unique computing cell which recursively processes botha spatial data flow in a coarse-to-fine multi-scale scheme and atemporal flow which conveys learned features. Using learnedfeatures in the temporal recurrence allows better exploitationof temporal information than propagating OF estimates as pro-posed in [26]. STaRFlow builds upon approaches such as [8,11]based on the repeated use of the same weights over a scale re-

currence but extends this idea to a double time-scale recurrence.Moreover, we have also shown that occlusion estimation canbe done with a minimal number of extra parameters, simply byadding a dedicated layer to the output tensor of the CNN OFestimator. STaRFlow gives state-of-the-art results on the twobenchmarks MPI Sintel and Kitti2015, even outperforming, atthe time of writing, all previously published methods on Sintelfinal pass. Moreover, STaRFlow is lighter than all other two-frame or multi-frame methods with comparable performance.

Quantitative and qualitative evaluations on MPI Sintel andKITTI2015 show that STaRFlow improves OF quality on de-graded images and on small objects thanks to temporal redun-dancy, and is also able to achieve efficient temporal inpaintingin occluded areas. Our experiments also confirm conclusionsof [11, 26] that learning to predict occlusions consistently im-proves OF estimation. Moreover, our implementation, based onsharing almost all weights between OF and occlusion estimation,further indicates that these two tasks are closely related one tothe other.

Acknowledgments

The authors are grateful to the French agency of defense (DGA)for financial support, and to the ONERA project DELTA.

References

[1] Luis Alvarez, Rachid Deriche, Théo Papadopoulo, andJavier Sánchez. Symmetrical dense optical flow estima-

7


Figure 6: Our temporal recurrent cell improves optical flow estimation of small objects.

Figure 7: The benefit of exploiting more frames in OF estimation for sequence Ambush7 of Sintel Final.

tion with occlusions detection. International Journal ofComputer Vision, 75(3):371–385, 2007.

[2] Aviram Bar-Haim and Lior Wolf. ScopeFlow: Dynamicscene scoping for optical flow. In Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition, pages 7998–8007, 2020.

[3] Thomas Brox, Andrés Bruhn, Nils Papenberg, and JoachimWeickert. High accuracy optical flow estimation based ona theory for warping. In European conference on computervision, pages 25–36. Springer, 2004.

[4] Daniel J Butler, Jonas Wulff, Garrett B Stanley, andMichael J Black. A naturalistic open source movie foroptical flow evaluation. In European Conference on Com-puter Vision, pages 611–625. Springer, 2012.

[5] Zhiwen Chen, Jianzhong Cao, Yao Tang, and Linao Tang.Tracking of moving object based on optical flow detection.In International Conference on Computer Science andNetwork Technology, volume 2, pages 1096–1099. IEEE,2011.

[6] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, PhilipHausser, Caner Hazirbas, Vladimir Golkov, Patrick van derSmagt, Daniel Cremers, and Thomas Brox. FlowNet:Learning optical flow with convolutional networks. InIEEE International Conference on Computer Vision, pages2758–2766, 2015.

[7] Berthold KP Horn and Brian G Schunck. Determining op-tical flow. Artificial intelligence, 17(1-3):185–203, 1981.

[8] Ping Hu, Gang Wang, and Yap-Peng Tan. Recurrent spatialpyramid CNN for optical flow estimation. IEEE Transac-tions on Multimedia, 20(10):2814–2823, 2018.

[9] Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. Lite-FlowNet: A lightweight convolutional neural network foroptical flow estimation. In IEEE Conference on ComputerVision and Pattern Recognition, June 2018.

[10] Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. Alightweight optical flow CNN–revisiting data fidelity andregularization. arXiv preprint arXiv:1903.07414, 2019.

[11] Junhwa Hur and Stefan Roth. Iterative residual refinementfor joint optical flow and occlusion estimation. In IEEEConference on Computer Vision and Pattern Recognition,pages 5754–5763, 2019.

[12] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keu-per, Alexey Dosovitskiy, and Thomas Brox. FlowNet 2.0:Evolution of optical flow estimation with deep networks. InIEEE Conference on Computer Vision and Pattern Recog-nition, volume 2, page 6, 2017.

[13] Eddy Ilg, Tonmoy Saikia, Margret Keuper, and ThomasBrox. Occlusions, motion and depth boundaries with ageneric network for disparity, optical flow or scene flowestimation. In European Conference on Computer Vision,pages 626–643. Springer, 2018.

[14] Joel Janai, Fatma Güney, Aseem Behl, and AndreasGeiger. Computer vision for autonomous vehicles:Problems, datasets and state-of-the-art. arXiv preprintarXiv:1704.05519, 2017.

8


[15] Joel Janai, Fatma Guney, Anurag Ranjan, Michael Black,and Andreas Geiger. Unsupervised learning of multi-frameoptical flow with occlusions. In European Conference onComputer Vision, pages 690–706, 2018.

[16] Ryan Kennedy and Camillo J Taylor. Optical flow with ge-ometric occlusion estimation and fusion of multiple frames.In International Workshop on Energy Minimization Meth-ods in Computer Vision and Pattern Recognition, pages364–377. Springer, 2015.

[17] Liang Liu, Jiangning Zhang, Ruifei He, Yong Liu, YabiaoWang, Ying Tai, Donghao Luo, Chengjie Wang, Jilin Li,and Feiyue Huang. Learning by analogy: Reliable su-pervision from transformations for unsupervised opticalflow estimation. In Proceedings of the IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition, pages6489–6498, 2020.

[18] Pengpeng Liu, Michael R. Lyu, Irwin King, and Jia Xu.SelFlow: Self-supervised learning of optical flow. In IEEEConference on Computer Vision and Pattern Recognition,2019.

[19] Bruce D Lucas, Takeo Kanade, et al. An iterative imageregistration technique with an application to stereo vision.1981.

[20] Daniel Maurer and Andrés Bruhn. ProFlow: Learning topredict optical flow. In British Machine Vision Conference,2018.

[21] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer,Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. Alarge dataset to train convolutional networks for disparity,optical flow, and scene flow estimation. In IEEE Confer-ence on Computer Vision and Pattern Recognition, pages4040–4048, 2016.

[22] Simon Meister, Junhwa Hur, and Stefan Roth. UnFlow:Unsupervised learning of optical flow with a bidirectionalcensus loss. In Thirty-Second AAAI Conference on Artifi-cial Intelligence, 2018.

[23] Moritz Menze and Andreas Geiger. Object scene flow forautonomous vehicles. In Conference on Computer Visionand Pattern Recognition, 2015.

[24] Dennis Mitzel, Thomas Pock, Thomas Schoenemann, andDaniel Cremers. Video super resolution using dualitybased TV-L1 optical flow. In Joint Pattern RecognitionSymposium, pages 432–441. Springer, 2009.

[25] Kamal Nasrollahi and Thomas B Moeslund. Super-resolution: a comprehensive survey. Machine vision andapplications, 25(6):1423–1468, 2014.

[26] Michal Neoral, Jan Šochman, and Jirí Matas. Continualocclusion and optical flow estimation. In Asian Conferenceon Computer Vision, pages 159–174. Springer, 2018.

[27] Anurag Ranjan and Michael J Black. Optical flow estima-tion using a spatial pyramid network. In IEEE Conferenceon Computer Vision and Pattern Recognition, volume 2,2017.

[28] Zhile Ren, Orazio Gallo, Deqing Sun, Ming-Hsuan Yang,Erik Sudderth, and Jan Kautz. A fusion approach for multi-frame optical flow estimation. In IEEE Winter Conferenceon Applications of Computer Vision, pages 2077–2086.IEEE, 2019.

[29] Karen Simonyan and Andrew Zisserman. Two-streamconvolutional networks for action recognition in videos. InAdvances in neural information processing systems, pages568–576, 2014.

[30] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz.Models matter, so does training: An empirical study ofCNNs for optical flow estimation. IEEE Transactions onPattern Analysis and Machine Intelligence, 2018.

[31] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz.PWC-Net: CNNs for optical flow using pyramid, warping,and cost volume. In IEEE Conference on Computer Visionand Pattern Recognition, pages 8934–8943, 2018.

[32] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig,Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, andThomas Brox. DeMoN: Depth and motion network forlearning monocular stereo. In IEEE Conference on Com-puter Vision and Pattern Recognition, pages 5038–5047,2017.

[33] Sebastian Volz, Andres Bruhn, Levi Valgaerts, and Hen-ning Zimmer. Modeling temporal coherence for opticalflow. In IEEE International Conference on Computer Vi-sion, pages 1116–1123. IEEE, 2011.

[34] Chia-Ming Wang, Kuo-Chin Fan, Cheng-Tzu Wang, et al.Estimating optical flow by integrating multi-frame infor-mation. J. Inf. Sci. Eng., 24(6):1719–1731, 2008.

[35] Rui Xu, Xiaoxiao Li, Bolei Zhou, and Chen Change Loy.Deep flow-guided video inpainting. In IEEE Conferenceon Computer Vision and Pattern Recognition, pages 3723–3732, 2019.

[36] Shengyu Zhao, Yilun Sheng, Yue Dong, Eric I Chang, YanXu, et al. MaskFlowNet: Asymmetric feature matchingwith learnable occlusion mask. In Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition, pages 6278–6287, 2020.

[37] WenYi Zhao and Harpreet S Sawhney. Is super-resolutionwith optical flow feasible? In European Conference onComputer Vision, pages 599–613. Springer, 2002.

9

arXiv:2007.05481v1 [cs.CV] 10 Jul 2020 · arXiv:2007.05481v1 [cs.CV] 10 Jul 2020. STaRFlow: A...

Documents

Transcript of arXiv:2007.05481v1 [cs.CV] 10 Jul 2020 · arXiv:2007.05481v1 [cs.CV] 10 Jul 2020. STaRFlow: A...