Switchable Temporal Propagation Network

16
Switchable Temporal Propagation Network Sifei Liu 1 , Guangyu Zhong 1,3 , Shalini De Mello 1 , Jinwei Gu 1 , Varun Jampani 1 , Ming-Hsuan Yang 1,2 , and Jan Kautz 1 1 NVIDIA 2 UC Merced 3 Dalian University of Technology Abstract. Videos contain highly redundant information between frames. Such redundancy has been studied extensively in video compression and encoding, but is less explored for more advanced video processing. In this paper, we propose a learnable unified framework for propagating a variety of visual properties of video images, including but not limited to color, high dynamic range (HDR), and segmentation mask, where the properties are available for only a few key-frames. Our approach is based on a temporal propagation network (TPN), which models the transition- related affinity between a pair of frames in a purely data-driven manner. We theoretically prove two essential properties of TPN: (a) by regular- izing the global transformation matrix as orthogonal, the “style energy” of the property can be well preserved during propagation; and (b) such regularization can be achieved by the proposed switchable TPN with bi-directional training on pairs of frames. We apply the switchable TPN to three tasks: colorizing a gray-scale video based on a few colored key- frames, generating an HDR video from a low dynamic range (LDR) video and a few HDR frames, and propagating a segmentation mask from the first frame in videos. Experimental results show that our approach is sig- nificantly more accurate and efficient than the state-of-the-art methods. 1 Introduction Videos contain highly redundant information between frames. Consider a pair of consecutive frames randomly sampled from a video, it is likely that they are similar in terms of appearance, structure and content in most regions. Such re- dundancy has been extensively studied in video compression to reduce storage and speedup the transmission of videos, but is less explored for more advanced video processing. A number of recent algorithms, such as optical-flow based warp- ing [1], similarity-guided filtering [2, 3] and the bilateral CNN model [4], explore the local relationships between frames to propagate information. These methods model the similarity among pixels, regions or frames from either hand-crafted pixel-level features (e.g., pixel intensities and locations) or apparent motion (e.g., optical flow). They have several potential issues: (a) the designed similarity may not faithfully reflect the image structure, and (b) such similarity may not ex- press the high-level pairwise relationships between frames, e.g., for propagating a segmentation mask in the semantic domain.

Transcript of Switchable Temporal Propagation Network

Page 1: Switchable Temporal Propagation Network

Switchable Temporal Propagation Network

Sifei Liu1, Guangyu Zhong1,3, Shalini De Mello1, Jinwei Gu1, Varun Jampani1,Ming-Hsuan Yang1,2, and Jan Kautz1

1 NVIDIA2 UC Merced

3 Dalian University of Technology

Abstract. Videos contain highly redundant information between frames.Such redundancy has been studied extensively in video compression andencoding, but is less explored for more advanced video processing. Inthis paper, we propose a learnable unified framework for propagating avariety of visual properties of video images, including but not limited tocolor, high dynamic range (HDR), and segmentation mask, where theproperties are available for only a few key-frames. Our approach is basedon a temporal propagation network (TPN), which models the transition-related affinity between a pair of frames in a purely data-driven manner.We theoretically prove two essential properties of TPN: (a) by regular-izing the global transformation matrix as orthogonal, the “style energy”of the property can be well preserved during propagation; and (b) suchregularization can be achieved by the proposed switchable TPN withbi-directional training on pairs of frames. We apply the switchable TPNto three tasks: colorizing a gray-scale video based on a few colored key-frames, generating an HDR video from a low dynamic range (LDR) videoand a few HDR frames, and propagating a segmentation mask from thefirst frame in videos. Experimental results show that our approach is sig-nificantly more accurate and efficient than the state-of-the-art methods.

1 Introduction

Videos contain highly redundant information between frames. Consider a pairof consecutive frames randomly sampled from a video, it is likely that they aresimilar in terms of appearance, structure and content in most regions. Such re-dundancy has been extensively studied in video compression to reduce storageand speedup the transmission of videos, but is less explored for more advancedvideo processing. A number of recent algorithms, such as optical-flow based warp-ing [1], similarity-guided filtering [2, 3] and the bilateral CNN model [4], explorethe local relationships between frames to propagate information. These methodsmodel the similarity among pixels, regions or frames from either hand-craftedpixel-level features (e.g., pixel intensities and locations) or apparent motion (e.g.,optical flow). They have several potential issues: (a) the designed similarity maynot faithfully reflect the image structure, and (b) such similarity may not ex-press the high-level pairwise relationships between frames, e.g., for propagatinga segmentation mask in the semantic domain.

Page 2: Switchable Temporal Propagation Network

2 Sifei Liu et al.

Fig. 1. We propose the TPN model that takes a known property (e.g., color, HDR,segmentation mask) from a key-frame (k), and transform it to a nearby frame (k+ τ),denoted by “propagated”. The transformation is guided by a learnable matrix G, whichis learned from some known information (e.g., lightness, LDR, RGB image). We showthree tasks on the right size, where k denotes the key-frame for which the ground-truth property is provided. The orange bounding boxes shows the propagated resultsof our algorithm, when guided by the information in the left columns. We highlight(red bounding boxes) the regions where the proposed method successfully deals withlarge transitions or preserves fine details. Zoom-in to see details.

In this paper, we develop a temporal propagation network (TPN) to explicitlylearn pixel-level similarity between a pair of frames (see Fig. 1). It contains apropagation module that transfers a property (e.g., color) of one frame to anearby frame through a global, linear transformation matrix which is learnedwith a CNN from any available guidance information (e.g., lightness).

We enforce two principles when learning propagation in the temporal domain:(a) bi-directionality, i.e., the propagation between a pair of frames should beinvertible, and (b) consistency, i.e., the “style energy” (e.g., the global satu-ration of color) of the target property should be preserved during propagation.We theoretically prove that: enforcing both principles in the TPN is equivalentto ensuring that the transformation matrix is orthogonal with respect to eachpropagation direction. This theoretical result allows us to implement TPN as anovel, special network architecture — the switchable TPN (see Fig. 2) — withoutexplicitly solving for the transformation matrix. It uses bi-directional trainingfor a pair of frames in the propagation module, which is guided by switchedoutput maps from the guidance CNN network. Experiments demonstrate thatthe proposed architecture is effective in preserving the style energy even betweentwo widely separated frames.

We validate the proposed model for three propagation tasks: (a) video col-orization from a few color key-frames and a grayscale video (Section 5.2). Withsuch temporal propagation, the workload of black-and-white video colorizationcan be largely reduced to only annotating a small number of key-frames. (b)HDR video reconstruction from an LDR video with a few HDR key-frames (Sec-tion 5.3). This is a new way for HDR video capture, where the whole video canbe reconstructed with a few provided HDR frames. (c) video segmentation whenonly the segmentation mask of the target in the first frame is provided. We show

Page 3: Switchable Temporal Propagation Network

Switchable Temporal Propagation Network 3

that even without any image-based segmentation model, the proposed methodcan achieve comparable performance to the state-of-the-art algorithm. All ofthese tasks reveal that video properties between temporally close frames arehighly redundant, and that the relationships between them can be learned fromcorresponding guidance information. Compared to the existing methods, andaside from the novel architecture, our proposed method also has the followingadvantages: (a). High accuracy. Compared to prior work [4, 5], the TPN signif-icantly improves video quality. More importantly, the switchable TPN preservesthe style energy significantly better than the network without the switchablestructure. (b). High efficiency. Our method runs in real-time on a single Ti-tan XP GPU for all the three tasks, which is about 30x to 50x faster than theprior work [4, 5] (see Table 1). Moreover, our model does not require sequentialprocessing of video frames, i.e., all video frames can be processed in parallel,which can further improve its efficiency.

2 Related Work and Problem Context

Modeling affinity for pixel propagation. Affinity is a generic measure ofcloseness between two pixels/entities and is widely used in vision tasks at all lev-els. Well-modeled affinity reveals how to propagate information from the knownpixels to the unknown ones. Most prior methods design affinity measures basedon simple, intuitive functions [2, 6, 3]. Recently, a deep CNN model is proposedto learn task-dependent affinity metric [7] by modeling the propagation of pixelsas an image diffusion process.

While [7] is limited to the spatial propagation of pixels for image segmenta-tion, its high-level idea inspires us to learn pixel affinity in other domains viaCNNs, e.g., in video sequences as proposed in this work.

Considerably less attention has been paid to develop methods for propa-gating temporal information across video frames. Jampani et al. [4] propose topropagate video segmentation and color information by embedding pixels intoa bilateral space [8] defined based on spatial, temporal and color information.While pixels of the same region from different frames can be closer in this space,it requires several previous frames stacked together to generate a new frame,which results in a high computational cost. Our proposed algorithm is differ-ent in that it explicitly learns the pixel affinity that describes the task-specifictemporal frame transitions, instead of manually defining a similarity measure.

Colorizing grayscale videos Colorization in images and videos is achievedvia an interactive procedure in [3], which propagates manually annotated strokesspatially within or across frames, based on a matting Laplacian matrix and withmanually defined similarities. Recently, several methods based on CNNs havebeen developed to colorize pixels in images with fully-automatic or sparselyannotated colors [9, 10]. Due to the multinomial nature of color pixels [10], theinteractive procedure usually gives better results. While interactive methods canbe employed for single images, it is not practical to annotate them for all frames

Page 4: Switchable Temporal Propagation Network

4 Sifei Liu et al.

of a monochrome video. In this work, we suggest a more plausible approachby using a few color key-frames to propagate visual information to all frames inbetween. To this end, colorizing a full video can be easily achieved by annotatingat sparse locations in only a few key-frames, as described in Section 5.2.

Video propagation for HDR imaging Most consumer-grade digital camerashave limited dynamic range and often capture images with under/over-exposedregions, which not only degrades the quality of the captured photographs andvideos, but also impairs the performance of computer vision tasks in numerousapplications. A common way to achieve HDR imaging is to capture a stack ofLDR images with different exposures and to fuse them together [11, 12]. Such anapproach assumes static scenes and thus requires deghosting techniques [13–15]to remove artifacts. Capturing HDR videos for dynamic scenes poses a morechallenging problem. Prior methods to create HDR videos are mainly based onhardware that either alternate exposures between frames [16, 17], or use multiplecameras [18], or specialized image sensors with pixel-wise exposure controls [19,20]. A few recent methods based on deep models have been developed for HDRimaging. Kalantari et al. [21] use a deep neural network to align multiple LDRimages into a single HDR image for dynamic scenes. Zhang et al. [22] develop anauto-encoder network to predict a single HDR panorama from a single exposedLDR image for image-based rendering. In addition, Eilertsen et al. [5] propose asimilar network for HDR reconstruction from a single LDR input image, whichprimarily focuses on recovering details in the high intensity saturated regions.

In this paper, we apply the TPN for HDR video reconstruction from a LDRvideo. Given a few HDR key-frames and an LDR video, the TPN propagates thescene radiance information from the key-frames to the remaining frames. Notethat unlike all the existing single LDR-based methods [22, 5], which hallucinatethe missing HDR details in images, we focus on propagating the HDR informa-tion from the input few HDR images to neighboring LDR frames, which providesan alternative solution for efficient, low cost HDR video reconstruction.

3 Proposed Algorithm

We exploit the redundancy in videos, and propose the TPN for learning affinityand propagating target properties between frames. Take the video colorization asan example. Given an old black-and-white movie with a few key-frames colorizedby artists, can we automatically colorize the entire movie? This problem can beequivalently reformulated as propagating a target property (i.e., color) based onthe affinity of some features (e.g., lightness) between two frames. Intuitively, thisis feasible because (1) videos have redundancy over time — nearby frames tendto have similar appearance, and (2) the pixel correlation between two frames inthe lightness domain is often consistent with that in the color domain.

In this work, we model the propagation of a target property (e.g., color)between two frames as a linear transformation,

Ut = GUk, (1)

Page 5: Switchable Temporal Propagation Network

Switchable Temporal Propagation Network 5

Fig. 2. The architectures of a switchable TPN, which contains two propagation mod-ules for bi-directional training. We specifically use a red-dashed box to denote theswitchable structure In the reversed pair, the output channels {P} are switched (red)for horizontal and vertical propagation.

where Uk ∈ Rn2×1 and Ut ∈ Rn2×1 are the vectorized version of the n × nproperty maps of a key-frame and a nearby frame, and G ∈ Rn2×n2

is thetransformation matrix to be estimated4. Suppose we observe some features ofthe two frames (e.g., lightness) Vk and Vt, the transformation matrix G is thusa function of Vk and Vt,

G = g(θ, Vk, Vt). (2)

The matrix G should be dense in order to model any type of pixel transition ina global scope, but G should also be concise for efficient estimation and propa-gation. In Section 3.1, we propose a solution, called basic TPN, by formulatingthe linear transformation G as an image diffusion process similar to [7]. Follow-ing that, in Section 3.2, we introduce the key part of our work, the switchableTPN, which enforces the bi-directionality and the style consistency for temporalpropagation. We prove that enforcing these two principles is equivalent to en-suring the transformation matrix G is orthogonal, which in turn can be easilyimplemented by equipping an ordinary temporal propagation network with aswitchable structure.

3.1 Learning Pixel Transitions via the Basic TPN

Directly learning the transformation matrix G via a CNN is prohibitive, since Ghas a huge dimension (e.g., n2×n2). Instead, inspired by the recent work [7], weformulate the transformation as a diffusion process, and implement it efficientlyby propagating information along each row and each column in an image linearly.Suppose we keep only k = 3 nearest neighbors from a previous column (row)during the propagation, and we perform the propagation in d = 4 directions, thetotal number of parameters to be estimated is significantly reduced from n2×n2to n2 × k × d (see example of Fig. 2).

4 For property maps with multiple channels n×n×c, we treat each channel separately.

Page 6: Switchable Temporal Propagation Network

6 Sifei Liu et al.

Linear transformation as a diffusion process. The diffusion process fromframe k to frame t can be expressed with a partial differential equation (PDE)in the discrete form as:

OU = Ut − Uk = −LUk = (A−D)Uk, (3)

where L = D−A is the Laplacian matrix, D is the diagonal degree matrix and Ais the affinity matrix. In our case, this represents the propagation of the propertymap U over time. (3) can be re-written as Ut = (I −D +A)Uk = GUk, whereG is the transformation matrix between the two states, as defined in (1), and Iis an identity matrix.

Linear propagation network. With a propagation structure, the diffusionbetween frames can be implemented as a linear propagation along the rows orcolumns of an image. Here we briefly show their equivalence. Following [7], wetake the left-to-right spatial propagation operation as an example:

yi = (I − di)xi + wiyi−1, i ∈ [2, n], (4)

where x ∈ Uk and y ∈ Ut, and the n×1 vectors {xi, yi} represent the ith columnsbefore and after propagation with an initial condition of y1 = x1, and wi is thespatially varying n × n sub-matrix. Here, I is the identity matrix and di is adiagonal matrix, whose tth element is the sum of all the elements of the tth rowof wi as di(t, t) =

∑nj=1,j 6=t wi(j, t). Similar to [7], through (a) expanding its

recursive term, and (b) concatenating all the rows/columns as a vectorized map,it is easy to prove that (4) is equivalent to the global transformation G betweenUk and Ut, where each entry is the multiplication of several spatially variant wi

matrices [7]. Essentially, instead of predicting all the entries in G as independentvariables, the propagation structure transfers the problem into learning eachsub-matrix wi in (4), which significantly reduces the output dimensions.

Learning the sub-matrix {wi}. We adopt an independent deep CNN, namelythe guidance network, to output all the sub-matrices wi. Note that the propaga-tion in (1) is carried out for d = 4 directions independently, as shown in Fig. 2.For each direction, it takes a pair of images {Vk, Vt} as its input, and outputs afeature map P that has the same spatial size as U (see Fig. 2). Each pixel in thefeature map pi,j contains all the values of the jth row in wi, which describes alocal relation between the adjacent columns, but results in a global connectionin G though the propagation structure. Similar to [7], we keep only k = 3 near-est neighbors from the previous column, which results in wi being a tridiagonalmatrix. Thus, a total of n × n × (k × d) parameters are used to implement thetransformation matrix G. Such a structure significantly compresses the guidancenetwork while still ensuring that the corresponding G is a dense matrix that candescribe global and dense pairwise relationships between a pair of frames.

3.2 Preserving Consistency via Switchable TPN

In this part, we show that there are two unique characteristics of propagationin the temporal domain, which do not exist for propagation in the spatial do-

Page 7: Switchable Temporal Propagation Network

Switchable Temporal Propagation Network 7

main [7]. First, temporal propagation is bi-directional for two frames, i.e., anetwork capable of transforming a frame U1 into a frame U2, should also beable to transform from U2 to U1, with a corresponding reversed ordering of in-puts to the guidance network. Second, during propagation, the overall “style” ofthe propagated property across the image should stay constant between frames,e.g., during color propagation, the color saturation of all frames within a shortvideo clip should be similar. We call this feature “consistency property”. Asshown below, we prove that enforcing the bi-directionality and the consistencyis equivalent to ensure the transformation matrix G to be orthogonal, which inturn can be easily implemented by equipping an ordinary temporal propagationnetwork with a switchable structure.

Bi-directionality of TPN. We assume that properties in nearby video framesdo not have a causal relationship. This assumption holds for most properties thatnaturally exist in the real-world, e.g., color and HDR. Hence, temporal propa-gation of these properties can often be switched in direction without breakingthe process. Given a diffusion model G and a pair of frames {U1, U2}, we have apair of equations:

U2 = G1→2U1, U1 = G2→1U2, (5)

where the arrow denotes the propagation direction. The bi-directionality prop-erty implies that reversing the roles of the two frames as inputs by {V1, V2} →{V2, V1}, and the corresponding supervision signals to the network correspondsto applying an inverse transformation matrix G2→1 = G−11→2.

Style preservation in sequences. Style consistency refers to whether thegenerated frames can maintain similar chromatic properties or brightness whenpropagating color or HDR information, which is important for producing high-quality videos without the property vanishing over time. In this work, we en-sure such global temporal consistency by minimizing the difference in style lossof the propagated property for the two frames. Style loss has been intensivelyused in style transfer [23], but has not yet been used for regularizing temporalpropagation. In our work, we represent the style by the Gram matrix, which isproportional to the un-centered covariance of the property map. The style lossis the squared Frobenius norm of the difference between the Gram matrices ofthe key-frame and the succeeding frame:

Theorem 1. By regularizing the style loss we have the following optimizationw.r.t. the guidance network:

min 1N ‖ U

>1 U1 − U>2 U2 ‖2F (6)

s.t. U2 = GU1. (7)

The optimal solution is reached when G is orthogonal.

Proof. Since the function (6) is non-negative, the minimum is reached whenU>1 U1 = U>2 U2. Combining it with (7) we have G>G = I.

Page 8: Switchable Temporal Propagation Network

8 Sifei Liu et al.

Given that G is orthogonal, the G2→1 in (5) can be replaced by G>1→2, whichequals to G−11→2. Therefore, the bi-directionality propagation can be representedvia a pair of transformation matrices that are transposed w.r.t each other. Inthe following part, we show how to enforce this property for the transformationmatrix G in the linear propagation network via a special network architecture.Note that in our implementation, even though we use the channel-wise propaga-tion described in Section. 3.1, where the U>U actually reduces to an uncenteredvariance, the conclusions of Theorem 1 still hold.

A switchable propagation network. The linear transformation matrix Ghas an important property: since the propagation is directed, the transforma-tion matrix G is a triangular matrix. Consider the two directions along thehorizontal axis (i.e., →,←) in Fig. 2. G is an upper triangular matrix for a par-ticular direction (e.g., →), while it is lower triangular for the opposite one (e.g.,←). Suppose P→ and P← are the output maps of the guidance network w.r.t.these two opposite directions. This means that the transformation matrix, whichis lower-triangular for propagation in the left-to-right direction, becomes upper-triangular for the opposite direction of propagation. Since the upper-triangularmatrix: (a) corresponds to propagating in the right-to-left direction, and (b)contains the same set of weight sub-matrices, switching the CNN output chan-nels w.r.t. the opposite directions P→ and P← is equivalent to transposing thetransformation matrix G in the TPN. This fact is exploited as a regularizationstructure (see the red bbox in Fig. 2) during training.

To summarize, the switchable structure of the TPN is derived from the twoprinciples (i.e., the bi-directionality and the style consistency) for temporal prop-agation and the fact that the matrix G is triangular due to the specific formof propagation. Note that [7] did not address the triangulation of the matrixand thus were limited to propagation in the spatial domain only. We show theswitchable TPN (STPN) largely improve performance over the basic TPN, withno computational overhead at inference time.

4 Network Implementation

We provide the network implementation details shared by color, HDR and seg-mentation mask propagation, which are proposed in this work. These settingscan be potentially generalized to other properties of videos as well.

The basic TPN. The basic TPN contains two separate branches: (a) a deepCNN for the guidance network, which takes as input the provided information{V1, V2} from a pair of frames, and outputs all elements (P ) that constitute thestate transformation matrix G, and (b) a linear propagation module that takesthe property map of one frame U1 and outputs U2. It also takes as input {P} thepropagation coefficients following the formulation of (4), where {P} contains kdchannels (k = 3 connections for each pixel per direction, and d = 4 directionsin total). {V,U, P} have the same spatial size according to (4). We use node-wise max-pooling [24, 7] to integrate the hidden layers and to obtain the final

Page 9: Switchable Temporal Propagation Network

Switchable Temporal Propagation Network 9

Fig. 3. We show two groups of color transitions output through a basic TPN. For eachgroup, the left side is key-frames with the ground truth color images provided, and theright side is new frames propagated from the left. {ak, bk} and {ak+τ , bk+τ} are theinputs and outputs of the TPN. All four examples show obvious appearance transitionscaused by movement of objects. Zoom-in to see details.

Table 1. Run-times of different methods. We set K = 30 for VPN color propagation [4]to calculate its run-time. The last four columns are our methods.

Method VPN [4] (color) VPN [4] (seg) HDRCNN[5] color HDR SEG(t) SEG(t+s)

(ms) 730 750 365 15 25 17 84

propagation result. All submodules are differentiable and jointly trained usingstochastic gradient descent (SGD), with the base learning rate of 10−5.

The switchable TPN. Fig. 2 shows how the switchable structure of the TPNis exploited as an additional regularization loss term during training. For eachpair (U1, U2) of the training data, the first term in (8) shows the regular su-pervised loss between the network estimation U2 and the groundtruth U2. Inaddition, as shown in Fig. 2(b), since we want to enforce the bi-directionalityand the style consistency in the switchable TPN, the same network should beable to propagate from U2 back to U1 by simply switching the channels of theoutput of the guidance networks, i.e., switching the channels of {P →, P ←}and {P ↓, P ↑} for propagating information in the opposite direction. This willform the second loss term in (8), which serves as a regularization (weighted byλ) during the training. We set λ = 0.1 for all the experiments in this paper.

L(U1, U1, U2, U2) =∥∥∥U2(i)− U2(i)

∥∥∥2 + λ∥∥∥U1(i)− U1(i)

∥∥∥2 . (8)

At inference time, the switchable TPN reduces to the basic TPN introduced inSection 3.1 and therefore does not have any extra computational expense.

5 Experimental Results

In this section, we present our experimental results for propagating color chan-nels, HDR images, and segmentation mask across videos. We note that propa-gating information across relatively longer temporal intervals may not satisfy the

Page 10: Switchable Temporal Propagation Network

10 Sifei Liu et al.

Fig. 4. An example of color propagation from a key-frame (a) to a new frame withconsiderably large appearance transitions, using either (b) the basic TPN or (c) theswitchable TPN. The closeups show the detailed comparison. Zoom-in to see details.

assumptions of a diffusion model, especially when new objects or scenes appear.Hence, for color and HDR propagation, instead of considering such complex sce-narios, we set “key-frames” at regular fixed intervals for both tasks. That is,the ground truth color or HDR information is provided for every K frames andpropagated to all frames in between them. This is a practical strategy for real-world applications. Note that for video segmentation mask propagation, we stillfollow the protocol of the DAVIS dataset [25] and only use the mask from thefirst frame.

5.1 General Network Settings and Run-times

We use a guidance network and a propagation module similar to [7], with twocascaded propagation units. For computational and memory efficiency, the prop-agation is implemented with a smaller resolution, where U is downsampled fromthe original input space to a hidden layer before being fed into the propagationmodule. The hidden layer is then bi-linearly upsampled to the original size of theimage. We adopt a symmetric U-net shaped, light-weight deep CNN with skiplinks for all tasks, but with slightly different numbers of layers to accomondatethe different input resolutions (see Fig. 2 as an example for color propagation).We first pre-train the model on synthesized frame pairs generated from an im-age dataset. (e.g., the MS-COCO dataset [26] for color and segmentation prop-agation, and a self-collected one for HDR propagation, see the supplementarymaterial). Given an image, we augment it in two different ways via a similaritytransform with uniformly sampled parameters from s ∈ [0.9, 1.1], θ ∈ [−15◦, 15◦]and dx ∈ [−0.1, 0.1] × b, where b = min(H,W ). We also apply this data aug-mentation while training with patches from video sequences. We present therun-times for different methods on an 512× 512 image using a single TITAN X(Pascal) NVIDIA GPU (without cuDNN) in Table 1.

Page 11: Switchable Temporal Propagation Network

Switchable Temporal Propagation Network 11

5.2 Color Propagation in Videos

We use the ACT dataset [27], which contains 7260 training sequences with about600K frames in total of various categories of actions. All the sequences are shortwith small camera or scene transitions, and thus are more suitable for the pro-posed task. We re-train and evaluate the VPN network on the ACT dataset fora fair comparison. The original testing set contains 3974 sequences with morethan 300K frames. For faster processing, we randomly select five videos fromevery action category in order to maintain the prior distribution of the originalACT dataset. We use one for testing and the remaining four for training.

We perform all computations in the CIE-Lab color space. After pretrainingon the MS-COCO dataset, we fine-tune the models on the ACT dataset byrandomly selecting two frames from a sequence and cropping both frames at thesame spatial location as a single training sample. Specifically, our TPN takes asinput the concatenated ab channels that are randomly cropped to 256×256 froma key-frame. The patches are then transformed to 64×64×32 via 2 convoluitionallayers with stride = 2 before being input to the propagation module. Afterpropagation, the output maps are upsampled as a transformed ab image mapfor the frames following the key-frame. The guidance CNN takes as input a pairof lightness images (L) for the two frames. We optimize the Euclidean loss (inthe ab color space) between the ground truth and the propagated color channelsgenerated by our network. Note that for the switchable TPN, we have two losseswith different weights according to (8). During testing, we combine the estimatedab channels with the given L channel to generate a color RGB image. All ourevaluation metrics are computed in the RGB color space.

Table 2. RMSE and PSNR (in parentheses) for video color propagation on the ACTdataset for different key-frame interval K. We compared VPN with K = 30.

eval RMSE PSNR

Interval K = 10 K = 20 K = 30 K = 40 K = 10 K = 20 K = 30 K = 40

BTPNim+BTPNvd 4.43 5.46 6.04 6.44 36.65 35.22 34.46 33.96

BTPNim+STPNvd 4.00 5.00 5.58 6.01 37.63 36.09 35.26 34.70

STPNim+STPNvd 3.98 4.97 5.55 5.99 37.64 36.12 35.29 34.73

VPN (stage-1) [4] - - 6.86 - - - 32.86 -

We compare the models with three combinations. We refer to the basic andswitchable TPN networks as BTPN and STPN, the suffix “im” and “vd” denotepretraining on MS-COCO and finetuning on ACT, respectively. The methodsthat we compare include: (a) BTPNim+BTPNvd, (b) BTPNim+STPNvd, and(c) STPNim+STPNvd; and evaluate different key-frames intervals, includingK = {10, 20, 30, 40}. The quantitative results for root mean square error (RMSE)and peak signal-to-noise ratio (PSNR) are presented in Table 2. Two trends canbe inferred from the results. First, the switchable TPN consistently outperformsthe basic TPN and the VPN [4], and using the switchable TPN structure for

Page 12: Switchable Temporal Propagation Network

12 Sifei Liu et al.

Fig. 5. Results using propagation of color from a key-frame to two proceeding frames(the 18th and the 25th) at different time intervals with the basic/switchable TPN, andthe VPN [4] models. Zoom-in to see details.

both the pre-training and fine-tuning stages generates the best results. Second,while the errors decrease drastically on reducing time intervals between adjacentkey-frames, the colorized video maintains overall high-quality even when K is setclose to a common frame rate (e.g., 25 to 30 fps). We also show in Fig. 4 (b) and(c) that the switchable structure significantly improves the qualitative resultsby preserving the saturation of color, especially when there are large transitionsbetween the generated images and their corresponding key-frames. The TPNalso maintains good colorization for fairly long video sequences, which is evidentfrom a comparison of the colorized video frames with the ground truth in Fig. 5.Over longer time intervals, the quality of the switchable TPN degrades muchmore gracefully than that of the basic TPN and the VPN [4].

5.3 HDR Propagation in Videos

We compare our method against the work of [5], which directly reconstructs theHDR frames given the corresponding LDR frames as inputs. While this is not anapples-to-apples comparison because we also use an HDR key-frame as input, thework [5] is the closest related state-of-the-art method to our approach for HDRreconstruction. To our knowledge, no prior work exists on propagating HDRinformation in videos using deep learning and ours is the first work to addressthis problem. We use a similar network architecture as color propagation exceptthat U is transformed to 128 × 128 × 16 via one convolution layer to preservemore image details, and a two-stage training procedure by first pre-training thenetwork with randomly augmented pairs of patches created from a dataset ofHDR images, and then fine-tuning on an HDR video dataset. We collect themajority of the publicly available HDR image and video datasets listed in thesupplementary material, and utilize all the HDR images and every 10-th frameof the HDR videos for training in the first stage [5]. Except for the four videos(the same as [5]) that we use for testing, we train our TPN with all the collectedvideos. We evaluate our method on the four videos that [5] used for testing andcompare against their method.

To deal with the long-tail, skewed distribution of pixel values in HDR im-ages, similarly to [5], we use the logarithmic space for HDR training with U =

Page 13: Switchable Temporal Propagation Network

Switchable Temporal Propagation Network 13

Fig. 6. Results of HDR Video Propagation. We show one HDR frame (τ = 19 framesaway from the key frame) reconstructed with our switchable TPN (middle column).The top row shows the ground truth HDR, and the bottom row shows the output ofHDRCNN [5]. The HDR images are displayed with two popular tone mapping algo-rithms, Drago03 [30] and Reinhard05 [31]. The insets show that the switchable TPNcan effectively propagate the HDR information to new frames and preserve the dynamicrange of scene details. Zoom-in to see details.

Table 3. RMSE for video HDR propagation for the TPN output, the output withLDR blending, for different intervals for the key-frames K. Reconstruction from singleLDR [5] is compared under the same experimental settings.

settings HDR with blending HDR without blending

Interval K = 10 K = 20 K = 30 K = 40 K = 10 K = 20 K = 30 K = 40

BTPNim+BTPNvd 0.031 0.034 0.038 0.042 0.119 0.160 0.216 0.244BTPNim+BTPNvd 0.028 0.031 0.034 0.038 0.096 0.115 0.146 0.156BTPNim+BTPNvd 0.027 0.030 0.034 0.037 0.098 0.121 0.142 0.159

HDRCNN [5] 0.038 0.480

log(H + ε), where H denotes an HDR image and ε is set to 0.01. Since the im-age irradiance values recorded in HDR images vary significantly across cameras,naively merging different datasets together often generates domain differences inthe training samples. To resolve this issue, before merging the datasets acquiredby different cameras, we subtract from each input image the mean value of itscorresponding dataset. We use the same data augmentation as in [5] of varyingexposure values and camera curves [29] during training. During testing, we fol-low [5] to blend the inverse HDR image created from the input LDR image withthe HDR image predicted by our TPN network to obtain the final output HDRimage. More details are presented in the supplementary material.

We compare the RMSE of the generated HDR frames for different intervalsbetween the key-frames, with or without the blending of LDR information withthe HDR image generated by the TPN in Table 3. Our results indicate that theswitchable TPN can also significantly improve the results for HDR propagationcompared to the basic TPN. We also compare with the frame-wise reconstruc-tion method [5], with and without the blending-based post-processing in Fig. 6.As shown, our TPN recovers HDR images with up to K = 30 frames away fromeach key frame. The reconstructed HDR images preserve the same scene detailsas the ground truth, under different tone mapping algorithms. More results arepresented in the supplementary material. As noted earlier, since we have addi-

Page 14: Switchable Temporal Propagation Network

14 Sifei Liu et al.

Table 4. Comparisons for video segmentation on the DAVIS dataset.

J-mean F-mean

VPN [4] OSVOS [32] SEG(t) SEG(t+s) VPN [4] OSVOS [32] SEG(t) SEG(t+s)

70.2 79.8 71.1 76.19 65.5 80.6 75.65 73.53

tional HDR key-frames as input, it is not an apples-to-apples comparison withsingle-image based HDR methods like [5]. Nevertheless, the results in Fig. 6 showthe feasibility of using sparsely-sampled HDR key-frames to reconstruct HDRvideos from LDR videos with the proposed TPN approach.

5.4 Segmentation Mask Propagation in Videos

In addition, we conduct video segmentation on the DAVIS dataset [25] with thesame settings as VPN [4], to show that the proposed method can also generalizedto semantic-level propagation in videos. We note that maintaining style consis-tency does not apply to semantic segmentation. For each frame to be predicted,we use the segmentation mask of the first frame as the only key-frame, whileusing the corresponding RGB images as the input to the guidance network. Wetrain two versions of the basic TPN network for this task: (a) A basic TPNwith the input/output resolution reduced to 256 × 256, the U transformed to64× 64× 16, in the same manner as the color propagation model. We used thesame guidance network architecture as [7], while removing the last convolutionunit to fit the dimensions of the propagation module. This model, denoted asSEG(t) in Table 1, is much more efficient than the majority of the recent videosegmentation methods [4, 25, 32]. (b) A more accurate model with an SPN [7]refinement applied to the output of the basic TPN, denoted as SEG(t+s). Thismodel utilizes the same architecture as [7], except that it replaces the loss withSigmoid cross entropy for the per-pixel classification task. Similar to color andHDR propagation, We pretrain (a) on the MS-COCO dataset and then finetuneit on the DAVIS training set. For the SPN model in (b), we first train it onthe VOC image segmentation task as described in [7]. We treat each class in animage as binary mask in order to transfer the original model to a two-class classi-fication model, while replacing the corresponding loss module. We then finetunethe SPN on the coarse masks from the DAVIS training set, which are producedby an intermediate model – the pre-trained version of (a) from the MS-COCOdataset. More details are introduced in the supplementary materiel.

We compare our method to VPN [4] and one recent state-of-the-art method [32].Both VPN and our method rely purely on the propagation module from thefirst frame and does not utilize any image segmentation pre-trained modules (incontrast with [32]). Similar to the other two tasks, both models significantlyoutperform VPN [4] for video segmentation propagation (see Table 4), while allrunning one order of magnitude faster (see Table 1). The SEG(t+s) model per-forms comparatively to the OSVOS [32] method, which utilizes the pretrainedimage segmentation model and requires a much long inference time (7800 ms).

Page 15: Switchable Temporal Propagation Network

Switchable Temporal Propagation Network 15

References

1. Gadde, R., Jampani, V., Gehler, P.V.: Semantic video CNNs through representationwarping. In: Proceedings of IEEE International Conference on Computer Vision(ICCV). (2017)

2. He, K., Sun, J., Tang, X.: Guided image filtering. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI) 35(6) (2013) 1397–1409

3. Levin, A., Lischinski, D., Weiss, Y.: Colorization using optimization. In: ACMTransactions on Graphics (TOG). (2004)

4. Jampani, V., Gadde, R., Gehler, P.: Video propagation networks. In: Proceedingsof IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)

5. Eilertsen, G., Kronander, J., Denes, G., Mantiuk, R., Unger, J.: HDR image re-construction from a single exposure using deep CNNs. In: ACM Transactions onGraphics (SIGGRAPH Asia). (2017)

6. Levin, A., Lischinski, D., Weiss, Y.: A closed-form solution to natural image matting.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 30(2)(2008)

7. Liu, S., Mello, S.D., Gu, J., Zhong, G., Yang, M., Kautz, J.: Learning affinity viaspatial propagation networks. In: Neural Information Processing Systems (NIPS).(2017)

8. Jampani, V., Kiefel, M., Gehler, P.V.: Learning sparse high dimensional filters:Image filtering, dense CRFs and bilateral neural networks. In: Proceedings of IEEEConference on Computer Vision and Pattern Recognition (CVPR). (2016)

9. Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Proceedings ofEuropean Conference on Computer Vision (ECCV). (2016)

10. Zhang, R., Zhu, J.Y., Isola, P., Geng, X., Lin, A.S., Yu, T., Efros, A.A.: Real-timeuser-guided image colorization with learned deep priors. In: ACM Transactions onGraphics (SIGGRAPH). (2017)

11. Debevec, P., Malik, J.: Recovering high dynamic range radiance maps from pho-tographs. In: ACM Transactions on Graphics (SIGGRAPH). (1997)

12. Reinhard, E., Heidrich, W., Debevec, P., Pattanaik, S., Ward, G., Myszkowski,K.: High Dynamic Range Imaging: Acquisition, Display, and Image-based Lighting.Morgan Kaufmann (2010)

13. Hu, J., Gallo, O., Pulli, K., Sun, X.: HDR deghosting: How to deal with saturation?In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR). (2013)

14. Oh, T., Lee, J., Tai, Y., Kweon, I.: Robust high dynamic range imaging by rankminimization. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI) 37(6) (2015) 1219–1232

15. Gallo, O., Troccoli, A., Hu, J., Pulli, K., Kautz, J.: Locally non-right registrationfor mobile HDR photography. In: Proceedings of IEEE Conference on ComputerVision and Pattern Recognition (CVPR). (2015)

16. Kang, S., Uyttendaele, M., Winder, S., Szeliski, R.: High dynamic range video. In:ACM Transactions on Graphics (SIGGRAPH). (2003)

17. Kalantari, N., Shechtman, E., Barnes, C., Darabi, S., Goldman, D., Sen, P.: Patch-based high dynamic range video. In: ACM Transactions on Graphics (SIGGRAPH).(2013)

18. Tocci, M., Kiser, C., Tocci, N., Sen, P.: A versatile HDR video production system.In: ACM Transactions on Graphics (SIGGRAPH). (2011)

Page 16: Switchable Temporal Propagation Network

16 Sifei Liu et al.

19. Nayar, S., Mitsunaga, T.: High dynamic range imaging: Spatially varying pixelexposure. In: Proceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR). (2000)

20. Gu, J., Hitomi, Y., Mitsunaga, T., Nayar, S.: Coded rolling shutter photography:Flexible space-time sampling. In: Proceedings of IEEE International Conference onComputational Photography (ICCP). (2010)

21. Kalantari, N., Ramamoorthi, R.: Deep high dynamic range imaging of dynamicscenes. In: ACM Transactions on Graphics (SIGGRAPH). (2017)

22. Zhang, J., Lalonde, J.: Learning high dynamic range from outdoor panoramas. In:Proceedings of IEEE International Conference on Computer Vision (ICCV). (2017)

23. Gatys, L.A., Ecker, A.S., Bethge, M.: A neural algorithm of artistic style. CoRRabs/1508.06576 (2015)

24. Liu, S., Pan, J., Yang, M.H.: Learning recursive filters for low-level vision viaa hybrid neural network. In: Proceedings of European Conference on ComputerVision (ECCV). (2016)

25. Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video objectsegmentation. In: Proceedings of IEEE Conference on Computer Vision and PatternRecognition (CVPR). (2016)

26. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P.,Zitnick, C.L.: Microsoft COCO: Common objects in context. In: Proceedings ofEuropean Conference on Computer Vision (ECCV). (2014) 740–755

27. Wang, X., Farhadi, A., Gupta, A.: Actions ∼ transformations. In: Proceedings ofIEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)

28. Kalantari, N.K., Shechtman, E., Barnes, C., Darabi, S., Goldman, D.B., Sen, P.:Patch-based high dynamic range video. Volume 32. (2013)

29. Grossberg, M.D., Nayar, S.K.: Modeling the space of camera response functions.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 26(10)(2004) 1272–1282

30. Drago, F., Myszkowski, K., Annen, T., Chiba, N.: Adaptive logarithmic mappingfor displaying high contrast scenes. Compute Graphics Forum 22(3) (2003) 419–426

31. Reinhard, E., Devlin, K.: Dynamic range reduction inspired by photoreceptorphysiology. IEEE Transactions on Visualization and Computer Graphics 11(1)(2005) 13–24

32. Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixe, L., Cremers, D., Van Gool,L.: One-shot video object segmentation. In: Proceedings of IEEE Conference onComputer Vision and Pattern Recognition (CVPR). (2017)