faaich001@, agupt013@, rpand002@, rhyde001@, sasif@ece., … · 2020. 4. 21. · Non-Adversarial...

Non-Adversarial Video Synthesis with Learned Priors

Abhishek Aich†,∗, Akash Gupta†,*, Rameswar Panda‡, Rakib Hyder†,M. Salman Asif†, Amit K. Roy-Chowdhury†

University of California, Riverside†, IBM Research AI, Cambridge‡

{aaich001@, agupt013@, rpand002@, rhyde001@, sasif@ece., amitrc@ece.}ucr.edu

Abstract

Most of the existing works in video synthesis focus ongenerating videos using adversarial learning. Despite theirsuccess, these methods often require input reference frameor fail to generate diverse videos from the given data dis-tribution, with little to no uniformity in the quality of videosthat can be generated. Different from these methods, we fo-cus on the problem of generating videos from latent noisevectors, without any reference input frames. To this end,we develop a novel approach that jointly optimizes the in-put latent space, the weights of a recurrent neural networkand a generator through non-adversarial learning. Opti-mizing for the input latent space along with the networkweights allows us to generate videos in a controlled envi-ronment, i.e., we can faithfully generate all videos the modelhas seen during the learning process as well as new unseenvideos. Extensive experiments on three challenging and di-verse datasets well demonstrate that our proposed approachgenerates superior quality videos compared to the existingstate-of-the-art methods.

1. IntroductionVideo synthesis is an open and challenging problem in

computer vision. As literature suggests, a deeper under-standing of spatio-temporal behavior of video frame se-quences can directly provide insights in choosing priors,future prediction, and feature learning [34, 36]. Muchprogress has been made in developing variety of ways togenerate videos which can be classified into broadly twocategories: class of video generation methods which requirerandom latent vectors without any reference input pixel[26, 33, 34], and class of video generation methods whichdo depend on reference input pixels [10, 30, 36]. Currentliterature contains methods mostly from the second classwhich often requires some human intervention [10, 36].

In general, Generative Adversarial Networks (GANs) [8]

*Joint first authors1Project page: https://abhishekaich27.github.io/navsynth.html

MoC

oGAN

(Adversaria

l)Ours

(Non

-Adversaria

l)

Figure 1: Comparison of proposed non-adversarialapproach to one representative adversarial approach(MoCoGAN [33]) on the Chair-CAD [2] dataset. Top:MoCoGAN often generates blurry frames including similartype of chairs for different videos as the time step increases.Bottom: Our approach1, on the other hand, generates rela-tively sharper frames, maintaining consistency with the typeof chairs unique to each video in the dataset.

have shown remarkable success in various kinds of videomodality problems [14, 17, 19, 26]. Initially, video gener-ation frameworks predominantly used GANs to synthesizevideos from latent noise vectors. For example, VGAN [34]and TGAN [26] proposed generative models that synthe-size videos from random latent vectors with deep convolu-tional GAN. Recently, MoCoGAN [33] proposed to decom-pose a video into content and motion parts using a generatorguided by two discriminators. During testing, these frame-works generate videos that are captured in the range of thetrained generator, by taking random latent vectors. Whileall these methods have obtained reasonable performance oncommonly used benchmark datasets, they utilize adversariallearning to train their models and hence, inherit the short-comings of GANs. Specifically, GANs are often very sen-sitive to multiple factors such as random network initial-ization, and type of layers employed to build the network[15, 27]. Some infamous drawbacks of GANs are mode-collapse (i.e., able to generate only some parts of the data

1

arX

iv:2

003.

0956

5v3

[ee

ss.I

V]

17

Apr

202

0

distribution: see Fig. 1 for an example) and/or vanishinggenerator gradients due to discriminator being way better indistinguishing fake samples and real samples [1].

Non-adversarial approaches [4, 11, 16] have recentlybeen explored to tackle these challenges. For example, Gen-erative Latent Optimization (GLO) [4] and Generative La-tent Nearest Neighbor (GLANN) [11] investigate the impor-tance of inductive bias in convolutional networks by discon-necting the discriminator for a non-adversarial learning pro-tocol of GANs. These works show that without a discrimi-nator, a generator can be learned to map the training imagesin the given data distribution to a lower dimensional latentspace that is learned in conjunction with the weights of thegenerative network. Such procedure not only avoids themode-collapse problem of the generators, but also providesthe user an optimized low dimensional latent representation(embedding) of the data in contrast with the random latentspace as in GANs. Recently Video-VAE [10] proposed touse Variational Auto-Encoder (VAE) for conditional videosynthesis, either by randomly generating or providing thefirst frame to the model for synthesizing a video. However,the quality of generated videos using Video-VAE often de-pends on the provided input frame. Non-adversarial videosynthesis without any visual inputs still remains as a noveland rarely addressed problem.

In this paper, we propose a novel non-adversarial frame-work to generate videos in a controllable manner withoutany reference frame. Specifically, we propose to synthe-size videos from two optimized latent spaces, one provid-ing control over the static portion of the video (static la-tent space) and the other over the transient portion of thevideo (transient latent space). We propose to jointly opti-mize these two spaces while optimizing the network (a gen-erative and a recurrent network) weights with the help ofregression-based reconstruction loss and a triplet loss.

Our approach works as follows. During training, wejointly optimize over network weights and latent spaces(both static and transient) and obtain a common transientlatent space, and individual static latent space dictionary forall videos sharing the same class (see Fig. 2). During test-ing, we randomly choose a static vector from the dictionary,concatenate it with the transient latent vector and generatea video. This enables us to obtain a controlled environmentof diverse video generation from learned latent vectors foreach video in the given dataset, while maintaining almostuniform quality. In addition, the proposed approach also al-lows a concise video data representation in form of learnedvectors, frame interpolation (using a low rank constraint in-troduced in [12]), and generation of videos unseen duringthe learning paradigm.

The key contributions of our work are as follows.

• We propose a novel framework for generating a widerange of diverse videos from learned latent vectors with-

out any conditional input reference frame with almostuniform visual quality. Our framework obtains a latentspace dictionary on both static and transient portions forthe training video dataset, which enables us to generateeven unseen videos with almost equal quality by provid-ing combinations of static and transient latent vectors thatwere not part of training data.

• Our extensive experiments on multiple datasets welldemonstrate that the proposed method, without the adver-sarial training protocol, has better or at par, performancewith current state-of-the-art methods [26, 33, 34]. More-over, we do not need to optimize the (multiple) discrimi-nator networks as in previous methods [26, 33, 34] whichoffers a computational advantage.

2. Related Works

Our work relates to two major research directions: videosynthesis and non-adversarial learning. This section focuseson some representative methods closely related to our work.

2.1. Video Synthesis

Video synthesis has been studied from multiple perspec-tives [10, 26, 30, 33, 34] (see Tab. 1 for a categorization ofexisting methods). VGAN [34] demonstrates that a videocan be divided into foreground and background using deepneural networks. TGAN [26] proposes to use a generator tocapture temporal dynamics by generating correlated latentcodes for each video frame and then using an image genera-tor to map each of these latent codes to a single frame for thewhole video. MoCoGAN [33] presents a simple approachto separate content and motion latent codes of a video us-ing adversarial learning. The most relevant work for us isVideo-VAE [10] that extends the idea of image generationto video generation using VAE by proposing a structuredlatent space in conjunction with the VAE architecture forvideo synthesis. While this method doesn’t require a dis-criminator network, it depends on reference input frame togenerate a video. In contrast, our method proposes a ef-ficient framework for synthesizing videos from learnablelatent vectors without any input frame. This gives a con-trolled environment for video synthesis that even enablesus to generate visually good quality unseen videos throughcombining static and transient parts.

2.2. Non-Adversarial Learning

Generative adversarial networks, as powerful as they arein pixel space synthesis, are also difficult to train. Thisis owing to the saddle-point based optimization game be-tween the generator and the discriminator. On top of thechallenges discussed in the previous section, GANs re-quire careful user driven configuration tuning which may

2

�<latexit sha1_base64="UU9fvSbrjkVazf53FSAlbxCS9PY=">AAAChHicbZHbattAEIZXapum7slpLnOz1Jg6EIzktKRXJaQ3uSghhToJWK5YrUbOkj2I3VGIK/Qkfave9W26ckwOdgcWfr5/ZnZ3JiulcBhFf4PwydNnG883X3Revnr95m13692ZM5XlMOZGGnuRMQdSaBijQAkXpQWmMgnn2dXX1j+/BuuE0T9wXsJUsZkWheAMPUq7v/sJwg1mRf2rSTlNnFA0UQwvOZP1STOI9jyaKZby3U4/yWAmdJ1534qbpnNXa5uf9SDebVKVJCt0dEevc4Nuzf/W+p6Bzu8bJ6aUlUu7vWgYLYKui3gpemQZp2n3T5IbXinQyCVzbhJHJU5rZlFwCb5t5aBk/IrNYOKlZgrctF4MsaF9T3JaGOuPRrqgDytqppybq8xntvNxq14L/+dNKiw+T2uhywpB89uLikpSNLTdCM2FBY5y7gXjVvi3Un7JLOPo99bxQ4hXv7wuzkbDeH84+v6xd3i0HMcm2SHvyYDE5IAckmNySsaEB0HwIYiCONwI98L98NNtahgsa7bJowi//AMReMJ/</latexit>

...<latexit sha1_base64="GJdAJSlN295GCsGY0regmevVeew=">AAACI3icbVDLSgMxFM34rPVVdekmWARXZariAzdFN66kilWhM5RM5lZDM5MhuVMtQz/Ghd/iwo0VNy78F9MHYtUDgcM595F7gkQKg6774UxMTk3PzObm8vMLi0vLhZXVK6NSzaHGlVT6JmAGpIihhgIl3CQaWBRIuA5aJ33/ug3aCBVfYicBP2K3sWgKztBKjcLRpofwgNkF8FRriLFLPY8OtTNINZNjAt4r3ermvXao0DQKRbfkDkD/kvKIFMkI1Uah54WKp5FdwyUzpl52E/QzplFwCXZsaiBhvMVuoW5pzCIwfjY4sks3rRLSptL2xUgH6s+OjEXGdKLAVkYM78xvry/+59VTbB74mYiTFCHmw0XNVFJUtJ8YDYUGjrJjCeNa2L9Sfsc042hzHWRw2Mfe98V/ydV2qbxT2jnfLVaOR2nkyDrZIFukTPZJhZySKqkRTh7JM3klPefJeXHenPdh6YQz6lkjY3A+vwALZaYh</latexit>

network<latexit sha1_base64="lxtYGTT5xG0wRWVQK58VvT7ecPk=">AAACDHicbZA9SwNBEIb3/IzxK2ppsxgDVuFixI9OtNAyglEhOcLeZqJL9vaO3Tk1HPkDFv4WCxsVW3tL/4175yFqfGHh5ZkZZuf1IykMuu6HMzY+MTk1XZgpzs7NLyyWlpbPTBhrDk0eylBf+MyAFAqaKFDCRaSBBb6Ec79/mNbPr0EbEapTHETgBexSiZ7gDC3qlNYrbYRbTI5AgbbsGoa0mCHERAHehLo/7JTKbtXNREdNLTdlkqvRKb23uyGPA1DIJTOmVXMj9BKmUXAJw2I7NhAx3meX0LJWsQCMl2TXDGnFki7thdo+hTSjPycSFhgzCHzbGTC8Mn9rKfyv1oqxt+slQkUxguJfi3qxpBjSNBraFRo4yoE1jGth/0r5FdOMow0wy2Av1fb3xaPmbLNaq1frJ1vl/YM8jQJZJWtkg9TIDtknx6RBmoSTO/JAnsizc+88Oi/O61frmJPPrJBfct4+AS3SnLc=</latexit>

Generative<latexit sha1_base64="8JZ26r24EenHX0oIQ+TptEMukHU=">AAACDnicbZA9SwNBEIb3/IzxK2ppsxgUq3BR8aMLWmgZwSRCEsLeZhKX7O0du3NqOO4fWPhbLGxUbG0t/TduLlHU+MLCyzMzzM7rhVIYdN0PZ2JyanpmNjOXnV9YXFrOraxWTRBpDhUeyEBfesyAFAoqKFDCZaiB+Z6Emtc7GdRr16CNCNQF9kNo+qyrREdwhha1clsNhFtEjE9BgbbwGhKa3fyiCvAm0L2klcu7BTcVHTfFkcmTkcqt3HujHfDIB4VcMmPqRTfEZsw0Ci4hyTYiAyHjPdaFurWK+WCacXpPQjctadNOoO1TSFP6cyJmvjF937OdPsMr87c2gP/V6hF2DpuxUGGEoPhwUSeSFAM6CIe2hQaOsm8N41rYv1J+xTTjaCNMMzgaaP/74nFT3SkUdwu753v50vEojQxZJxtkmxTJASmRM1ImFcLJHXkgT+TZuXcenRfnddg64Yxm1sgvOW+f/Iedsw==</latexit>

Recurrent<latexit sha1_base64="gsYJDZiC5pXBMjZvtdnq+NuewPE=">AAACInicbZDLTgIxFIY7eEO8oS7dNBISV2QQ421FdKFLNYIkQEinHKCh05m0Z1Ay4V1c+Cwu3KhxZeLDWC4aFf+kyZ/vnJPT83uhFAZd991JzMzOzS8kF1NLyyura+n1jbIJIs2hxAMZ6IrHDEihoIQCJVRCDcz3JNx43dNh/aYH2ohAXWM/hLrP2kq0BGdoUSN9nK0h3CFifAYKtKU9GNDUN1WAt4HuDlJf4Ap4pDUoHDTSGTfnjkSnTX5iMmSii0b6tdYMeOTbYS6ZMdW8G2I9ZhoFl2BXRAZCxrusDVVrFfPB1OPRjQOataRJW4G2TyEd0Z8TMfON6fue7fQZdszf2hD+V6tG2Dqsx0KFEYLi40WtSFIM6DAw2hQaOMq+NYxrYf9KeYdpxtHGOsrgaKj974unTXk3ly/kCpd7meLJJI0k2SLbZIfkyQEpknNyQUqEk3vySJ7Ji/PgPDmvztu4NeFMZjbJLzkfn04TpmU=</latexit>

network<latexit sha1_base64="ki51rdA6BgYzJJ2aaVTnY165L2U=">AAACH3icbVDLSgMxFM3UV62vqks3wSK4KlMrVneiC11JBVsLbSmZ9FZDM5khuaOWYT7Fhd/iwo2KLvs3pg+LWg8EDuecy809XiiFQdftO6mZ2bn5hfRiZml5ZXUtu75RNUGkOVR4IANd85gBKRRUUKCEWqiB+Z6Ea697OvCv70AbEagr7IXQ9NmNEh3BGVqplS3tNBAeEDE+AwXaqneQ0My3qADvA91NMpPYBUSayaSVzbl5dwg6TQpjkiNjlFvZz0Y74JEPCrlkxtQLbojNmGkUXEKSaUQGQsa77Abqlirmg2nGwwMTumOVNu0E2j6FdKj+nIiZb0zP92zSZ3hr/noD8T+vHmHnsBkLFUYIio8WdSJJMaCDtmhbaOAoe5YwroX9K+W3TDOOttNhB0cDHEwunibVvXyhmC9e7ueOT8ZtpMkW2Sa7pEBK5JickzKpEE4eyTN5JW/Ok/PivDsfo2jKGc9skl9w+l+enKT0</latexit>neural

<latexit sha1_base64="RlIzHLn4mgAOgZbHsMxJfLe8X4M=">AAAB+HicbVBNS8NAEN3Ur1o/GvXoJVgETyW14set6MVjBfsBbSib7aRdutmE3YlYQ3+JFw+KePWnePPfmKRB1Ppg4PHeDDPz3FBwjbb9aRSWlldW14rrpY3Nre2yubPb1kGkGLRYIALVdakGwSW0kKOAbqiA+q6Ajju5Sv3OHSjNA3mL0xAcn44k9zijmEgDs9xHuEfEWEKkqJgNzIpdtTNYi6SWkwrJ0RyYH/1hwCIfJDJBte7V7BCdmCrkTMCs1I80hJRN6Ah6CZXUB+3E2eEz6zBRhpYXqKQkWpn6cyKmvtZT3006fYpj/ddLxf+8XoTeuRNzGUYIks0XeZGwMLDSFKwhV8BQTBNCmeLJrRYbU0UZJlmVshAuUpx+v7xI2sfVWr1avzmpNC7zOIpknxyQI1IjZ6RBrkmTtAgjEXkkz+TFeDCejFfjbd5aMPKZPfILxvsX2iaUAA==</latexit>

static

<latexit sha1_base64="LpnaRyw1wnUxY+PdGEKCf4dlZ1I=">AAACW3icbVHLTttAFJ2Y8qhLIRSx6mbUCIlFFTkh4rGjsGFJJQJISRSNx9dhlBnbmrmuGln+nn4NWyoW/AvXToqgcKUZHZ1zX3MmzLRyGAQPDW/pw/LK6tpH/9P6543N5taXK5fmVkJfpjq1N6FwoFUCfVSo4SazIEyo4TqcnlX69S+wTqXJJc4yGBkxSVSspECixs0fQ4TfWPcppIiMys3EAiRlUQuIhUNKlWXJ/d1/lBYICXKXCQnluNkK2kEd/C3oLECLLeJivNXwhlEqc0M9pBbODTpBhqNCWJqjofSHuQNqPRUTGBBMhAE3KuodS75LTMTj1NKhHWr2ZUUhjHMzE1KmEXjr/tcq8j1tkGN8NCpUkuX0ODkfFOeaY8or43ikLEjUMwJCWkW7cnkrrJBI9vrDCGL6gveMtJOwLMiQ7+RKr1vd3d7cteMqDp49eguuuu3Ofnv/Z691crrwb419Zd/YHuuwQ3bCztkF6zPJ/rA7ds/+Nh69Jc/31uepXmNRs81ehbfzBOQrtu0=</latexit>

late

nt<latexit sha1_base64="Hlv0j/5n5C2BDZDSj/ZFBgprYyk=">AAACZHicbVFdSxtBFJ2stepWa6z0qVCGBsEHCZsYrH0TfSk+KTQqJCHMzt6NgzO7y8zdYlj2P/lrBJ/0yd/h3U0qWr0ww+Gc+zVnwkwrh0Fw1/AWPix+XFpe8T+trn1eb258OXNpbiX0ZapTexEKB1ol0EeFGi4yC8KEGs7Dq6NKP/8L1qk0+YPTDEZGTBIVKymQqHHzeGuIcI11o0KKyKjcTCxAUha1gFg4pFxZltz/x2iBkGDpz2opIxMSynGzFbSDOvhb0JmDFpvHyXij4Q2jVOaGmkktnBt0ggxHhbA0T0PpD3MH1PpKTGBAMBEG3Kiody35FjERj1NLJ0Fesy8rCmGcm5qQMo3AS/e/VpHvaYMc4/1RoZIsp1fK2aA41xxTXjnII2VBop4SENIq2pXLS2GFRPLZH0YQ01+8Z6idhGVBhuyQK71udXd7M9d+VbH37NFbcNZtd3bbu6e91sHh3L9l9o39YNusw36yA/abnbA+k+yG3bJ79tB49Fa9Te/rLNVrzGs22avwvj8BUk664w==</latexit>

space

<latexit sha1_base64="utGRZkHdcL/f/4vrJ/OKP/BTykE=">AAACZXicbVHBThsxEHUW2tIAbWgrLhywiCJxQNEmRKXcEFyQuIDUAFISRV7vbLCwd1f2bEW02o/ia1BvcOI3mN0ERAsj2Xp6M/Nm/BykWjn0/b81b2Hxw8dPS5/ryyurX7421r6duySzEvoy0Ym9DIQDrWLoo0INl6kFYQINF8H1UZm/+APWqST+jdMURkZMYhUpKZCoceOkNUS4wUoolyI0KjMTCxAXeZVAzB1SrSwKXm89U1ogxFjUX0pSIaEYN5p+26+CvwWdOWiyeZyO12reMExkZkhMauHcoOOnOMqFpYEaSD9zQNLXYgIDgrEw4EZ5tWzBW8SEPEosnRh5xb7uyIVxbmoCqjQCr9z/uZJ8LzfIMPo1ylWcZvRKORsUZZpjwksLeagsSNRTAkJaRbtyeSWskEhG14chRPQZ7zlqJ0GRkyE75EqvW97d3sy1/TJ+vnj0Fpx3253d9u5Zr3lwOPdviW2wLbbNOmyPHbBjdsr6TLJbdsfu2UPt0Vv1fnjrs1KvNu/5zv4Jb/MJWFa7YQ==</latexit>

late

nt<latexit sha1_base64="Hlv0j/5n5C2BDZDSj/ZFBgprYyk=">AAACZHicbVFdSxtBFJ2stepWa6z0qVCGBsEHCZsYrH0TfSk+KTQqJCHMzt6NgzO7y8zdYlj2P/lrBJ/0yd/h3U0qWr0ww+Gc+zVnwkwrh0Fw1/AWPix+XFpe8T+trn1eb258OXNpbiX0ZapTexEKB1ol0EeFGi4yC8KEGs7Dq6NKP/8L1qk0+YPTDEZGTBIVKymQqHHzeGuIcI11o0KKyKjcTCxAUha1gFg4pFxZltz/x2iBkGDpz2opIxMSynGzFbSDOvhb0JmDFpvHyXij4Q2jVOaGmkktnBt0ggxHhbA0T0PpD3MH1PpKTGBAMBEG3Kiody35FjERj1NLJ0Fesy8rCmGcm5qQMo3AS/e/VpHvaYMc4/1RoZIsp1fK2aA41xxTXjnII2VBop4SENIq2pXLS2GFRPLZH0YQ01+8Z6idhGVBhuyQK71udXd7M9d+VbH37NFbcNZtd3bbu6e91sHh3L9l9o39YNusw36yA/abnbA+k+yG3bJ79tB49Fa9Te/rLNVrzGs22avwvj8BUk664w==</latexit>

space

<latexit sha1_base64="utGRZkHdcL/f/4vrJ/OKP/BTykE=">AAACZXicbVHBThsxEHUW2tIAbWgrLhywiCJxQNEmRKXcEFyQuIDUAFISRV7vbLCwd1f2bEW02o/ia1BvcOI3mN0ERAsj2Xp6M/Nm/BykWjn0/b81b2Hxw8dPS5/ryyurX7421r6duySzEvoy0Ym9DIQDrWLoo0INl6kFYQINF8H1UZm/+APWqST+jdMURkZMYhUpKZCoceOkNUS4wUoolyI0KjMTCxAXeZVAzB1SrSwKXm89U1ogxFjUX0pSIaEYN5p+26+CvwWdOWiyeZyO12reMExkZkhMauHcoOOnOMqFpYEaSD9zQNLXYgIDgrEw4EZ5tWzBW8SEPEosnRh5xb7uyIVxbmoCqjQCr9z/uZJ8LzfIMPo1ylWcZvRKORsUZZpjwksLeagsSNRTAkJaRbtyeSWskEhG14chRPQZ7zlqJ0GRkyE75EqvW97d3sy1/TJ+vnj0Fpx3253d9u5Zr3lwOPdviW2wLbbNOmyPHbBjdsr6TLJbdsfu2UPt0Vv1fnjrs1KvNu/5zv4Jb/MJWFa7YQ==</latexit>

transient

<latexit sha1_base64="4wHfmdYr4lLwocwvEXzaKryh8/I=">AAACaHicbVFNTxsxEHW2hULaQigHVPViESFxqKJNiPpxQ/TCqaISAaQkima9s8GK7V3Zs1Wj1f6r/hm4wqm/Au8mbfl6kq2n9zzj8XOUKekoDK8awYuXK6uv1tabr9+83dhsbb07c2luBQ5EqlJ7EYFDJQ0OSJLCi8wi6EjheTT7VvnnP9E6mZpTmmc41jA1MpECyEuT1vcR4S+q+xQCYi1zPbWIpixqg6ggC8ZJNFSWvLn3V1VAlfRfcBkILCetdtgJa/CnpLskbbbEyWSrEYziVOTadxMKnBt2w4zGBViSQmHZHOUOfesZTHHoqQGNblzUA5d8zysxT1LrlyFeq/crCtDOzXXkT2qgS/fYq8TnvGFOyZdxIU2W+2eKxUVJrjilvEqRx9KiIDX3BISVflYuLsGCIJ91cxRj4v/juVTtNCoLH8hHn0q/V+29/iK1rxU+/cvoKTnrdboHnYMf/fbh0TK/NfaB7bJ91mWf2SE7ZidswAT7za7ZDbtt/AlawU7wfnE0aCxrttkDBLt3ee+8zw==</latexit>

G<latexit sha1_base64="TEEKqEYuQd+TQ6h6Jitr41VTBWU=">AAAB8nicbVDLSsNAFL2pr1pfVZduBovgqiRWfOyKLnRZwT4gDWUynbRDJ5MwMxFK6Ge4caGIW7/GnX/jJA2i1gMDh3PuZc49fsyZ0rb9aZWWlldW18rrlY3Nre2d6u5eR0WJJLRNIh7Jno8V5UzQtmaa014sKQ59Trv+5Drzuw9UKhaJez2NqRfikWABI1gbye2HWI8J5unNbFCt2XU7B1okTkFqUKA1qH70hxFJQio04Vgp17Fj7aVYakY4nVX6iaIxJhM8oq6hAodUeWkeeYaOjDJEQSTNExrl6s+NFIdKTUPfTGYR1V8vE//z3EQHF17KRJxoKsj8oyDhSEcoux8NmaRE86khmEhmsiIyxhITbVqq5CVcZjj7PnmRdE7qTqPeuDutNa+KOspwAIdwDA6cQxNuoQVtIBDBIzzDi6WtJ+vVepuPlqxiZx9+wXr/ApI0kZQ=</latexit>

Updated with gradient of loss

26664

v1

v2

...vL

37775

<latexit sha1_base64="1Wu26tFhIELuEcx8b2p+HeRxzB8=">AAADP3icfVJNb9MwGHbCx0b56uDIxaKqNCRUJRvaxm0aFw47DIluk+quchyntebYkf2mWrHyz3bhL3DjyoUDCHHlhpt2U+lYXynS4+d5Xr8fcVJIYSGKvgbhnbv37q+tP2g8fPT4ydPmxrNjq0vDeJdpqc1pQi2XQvEuCJD8tDCc5onkJ8n5u6l+MubGCq0+wqTg/ZwOlcgEo+CpwUbQbZOED4VySU7BiIuq0SbALyDJnKnO3Cbx9MhmDqpX1SDGhOBVhi1v8DpLNdgZvNV66EWu0sW6tzXyaTHZ+uTXeJVUV19hWSp8RkAXjeXiZETBjWcjXx/q8cj4aror+rA+LN45aLaiTlQHvgniOWiheRwNml9IqlmZcwVMUmt7cVRA31EDgknu2yktLyg7p0Pe81DRnNu+q/9/hdueSXGmjf8U4JpdzHA0t3aSJ95Zr2FZm5L/03olZHt9J1RRAldsVigrJQaNp48Jp8JwBnLiAWVG+F4xG1FDGfgn16iX8HYaO9cj3wTHW514u7P94U1r/2C+jnX0Ar1EmyhGu2gfvUdHqItYcBl8C34EP8PP4ffwV/h7Zg2Dec5z9E+Ef/4CBdwQfQ==</latexit>


266664

r(t)1

r(t)2...

r(t)L

377775

<latexit sha1_base64="Acl9MFN689Yrmx0HhsMf5i3YwtQ=">AAADinicfVJdb9MwFHUTYCMM1sEjLxZVpaGxqe3QxsTLBDzwsIch0W1S00WO47RWHSeyb8pK5P/Cb+KNf4OTVF0/pl3J0rnnnOvre+UwE1xDp/Ov4bhPnj7b2n7uvdh5+Wq3uff6Sqe5oqxPU5Gqm5BoJrhkfeAg2E2mGElCwa7DyddSv54ypXkqf8IsY8OEjCSPOSVgqWCv8ccP2YjLIkwIKH5nPHVb7Ps2Geu4APPeBF3s+3iT7lna86dRCrpEG/qF5zMZ3d/bXm/U9oHdQRgXv81Srba1H/BjEi1bPmKxFy83vvUhzR7qPiZQTOvx7rNenS3GWggXdbY2EhMiKKqXFIpRYw58YXcfkaDmwKxYbFUmGBiDD7w2XrUKZZatIv11qIicGBM0W52jThV4E3TnoIXmcRk0//pRSvOESaCCaD3odjIYFkQBp4IZz881ywidkBEbWChJwvSwqL6SwW3LRDhOlT0ScMUuVxQk0XqWhNZZrXxdK8mHtEEO8adhwWWWA5O0bhTnAkOKy3+JI24XCGJmAaGK27diOiaKULC/16uWcFbGyWLkTXDVO+oeHx3/+Ng6/zJfxzZ6i96hfdRFp+gcfUeXqI+os+UcOifOqbvj9twz93NtdRrzmjdoJdxv/wHp7Cc9</latexit>

266664

z(t)1

z(t)2...

z(t)L

377775

<latexit sha1_base64="9+pgnqRMiN15IJzCuR+9dGzQ/FY=">AAADinicfVJdb9MwFHUTYCMM6OCRF4uq0tDY1HZoY+JlAh542MOQ6Dap6SLHcVqrjhPZN2Vd5P/Cb+KNf4OTVF0/pl3J0rnnnOvre+UwE1xDp/Ov4bhPnj7b2n7uvdh5+ep1c/fNpU5zRVmfpiJV1yHRTHDJ+sBBsOtMMZKEgl2Fk2+lfjVlSvNU/oJZxoYJGUkec0rAUsFu448fshGXRZgQUPzWeHc3xZ5vk7GOCzAfTNDFvo836Z6lPX8apaBLtKGfez6T0f297fVGbR/YLYRxcWeWarWt/Ygfk2jZ8hGLvXi58Y0PafZQ9zGBYlqPd5/16mwx1kI4r7O1kZgQQVG9pFCMGrPvC7v7iAQ1B2bFYqsywcAYvO+18apVKLNsFenvA0XkxJig2eocdqrAm6A7By00j4ug+dePUponTAIVROtBt5PBsCAKOBXMeH6uWUbohIzYwEJJEqaHRfWVDG5bJsJxquyRgCt2uaIgidazJLTOauXrWkk+pA1yiD8PCy6zHJikdaM4FxhSXP5LHHG7QBAzCwhV3L4V0zFRhIL9vV61hNMyjhcjb4LL3mH36PDo56fW2df5OrbRO/Qe7aEuOkFn6Ae6QH1EnS3nwDl2Ttwdt+eeul9qq9OY17xFK+F+/w8xUydV</latexit>

⇥z(s), z(s), · · · , z(s)

⇤><latexit sha1_base64="cBd0FoCeNzyEVUGsMyvP2TJN7C4=">AAADcnicbVJLbxMxEHY2PMrySkFcQAJDFKlVS5S0FY9bBRcOPRSJtJXidOX1ehMrXu/Knk0bVr7z+7jxK7jwA3A2C026HcnSN9/MeGY+TZhJYaDX+9Xwmrdu37m7cc+//+Dho8etzScnJs014wOWylSfhdRwKRQfgADJzzLNaRJKfhpOPy/ipzOujUjVN5hnfJTQsRKxYBQcFWw2fnRIyMdCFWFCQYtL63e+nxdbxHkTExdgt23Qx4TgG/g9x/sdMotSMCWsZRy5MFfR1efXm61WGFexi+sMYYsG9Yi/9vU5gTTz69uQCYVitlziyquN/i9wtPTWhnaulEFBgF9CoTmzdodIp3FEgyUHdi3FVWWSg7V4x+/g9VSp7WqqTC/eaqqm1gatdq/bKw3XQb8CbVTZcdD6SaKU5QlXwCQ1ZtjvZTAqqAbBJHfq5IZnlE3pmA8dVDThZlSUJ2NxxzERjlPtngJcsqsVBU2MmSehyywVvx5bkDfFhjnEH0aFUFkOXLFloziXGFK8uD8cCScgyLkDlGnhZsVsQjVl4K7UL0X4uLB3/1eug5O9bn+/u//1oH34qZJjA71Ab9AW6qP36BB9QcdogFjjt/fMe+m98v40nzdfNyvtvEZV8xStWXP3LxROHS8=</latexit>


`rec + �s`static + �t`triplet

<latexit sha1_base64="AcJcVKFtP62R8MiPdUBkQVCABfQ=">AAACTXicbVHLSgMxFM3UR2t9VV26CRZBEMqMVHRZdOOygn1Ap5RM5taGZh4kd8QyzA+6Edz5F25cKCKmD0HbXggczjn33uTEi6XQaNuvVm5ldW09X9gobm5t7+yW9vabOkoUhwaPZKTaHtMgRQgNFCihHStggSeh5Q2vx3rrAZQWUXiHoxi6AbsPRV9whobqlXwXpOylLsIjpgp4ltFT6kozwGe/tM6yvy6NpnepEf8bUYlYgiF7pbJdsSdFF4EzA2Uyq3qv9OL6EU8CCJFLpnXHsWPspkyZxRKyoptoiBkfsnvoGBiyAHQ3naSR0WPD+LQfKXNCpBP2b0fKAq1HgWecAcOBntfG5DKtk2D/spuKME4QQj5d1E8kxYiOo6W+MAGiHBnAuBLmrpQPmGIczQcUTQjO/JMXQfOs4lQr57fVcu1qFkeBHJIjckIcckFq5IbUSYNw8kTeyAf5tJ6td+vL+p5ac9as54D8q1z+BwPnuJQ=</latexit>

`rec + �s`static

<latexit sha1_base64="33uAVJkog2VZw8M6iBbvAvNAjhU=">AAACI3icbVDJSgNBEO2JW4xb1KOXxiAIQpiRiOIp6MVjBLNAJoSeTiVp0rPQXSOGYf7Fi7/ixYMSvHjwX+wsQkx80PB49aqq63mRFBpt+8vKrKyurW9kN3Nb2zu7e/n9g5oOY8WhykMZqobHNEgRQBUFSmhECpjvSah7g9txvf4ISosweMBhBC2f9QLRFZyhkdr5axekbCcuwhMmCnia0jPqSjOgw35lnabzLo2m1xjb+YJdtCegy8SZkQKZodLOj9xOyGMfAuSSad107AhbCVNmnIQ058YaIsYHrAdNQwPmg24lkxtTemKUDu2GyrwA6USd70iYr/XQ94zTZ9jXi7Wx+F+tGWP3qpWIIIoRAj5d1I0lxZCOA6MdYWJBOTSEcSXMXynvM8U4mlhzJgRn8eRlUjsvOqXixX2pUL6ZxZElR+SYnBKHXJIyuSMVUiWcPJNX8k4+rBfrzRpZn1Nrxpr1HJI/sL5/ALumptY=</latexit>

`rec + �s`static + �t`triplet

<latexit sha1_base64="AcJcVKFtP62R8MiPdUBkQVCABfQ=">AAACTXicbVHLSgMxFM3UR2t9VV26CRZBEMqMVHRZdOOygn1Ap5RM5taGZh4kd8QyzA+6Edz5F25cKCKmD0HbXggczjn33uTEi6XQaNuvVm5ldW09X9gobm5t7+yW9vabOkoUhwaPZKTaHtMgRQgNFCihHStggSeh5Q2vx3rrAZQWUXiHoxi6AbsPRV9whobqlXwXpOylLsIjpgp4ltFT6kozwGe/tM6yvy6NpnepEf8bUYlYgiF7pbJdsSdFF4EzA2Uyq3qv9OL6EU8CCJFLpnXHsWPspkyZxRKyoptoiBkfsnvoGBiyAHQ3naSR0WPD+LQfKXNCpBP2b0fKAq1HgWecAcOBntfG5DKtk2D/spuKME4QQj5d1E8kxYiOo6W+MAGiHBnAuBLmrpQPmGIczQcUTQjO/JMXQfOs4lQr57fVcu1qFkeBHJIjckIcckFq5IbUSYNw8kTeyAf5tJ6td+vL+p5ac9as54D8q1z+BwPnuJQ=</latexit>

low – rank condition+

<latexit sha1_base64="3DnS3vImeIO9jz1IyYjxDR65oEU=">AAAB6HicbVDLSgNBEOyNrxhfUY9eBoMgCGFXInoMevGYgHlAsoTZSW8yZnZ2mZkVQsgXePGgiFc/yZt/4yTZgyYWNBRV3XR3BYng2rjut5NbW9/Y3MpvF3Z29/YPiodHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7mZ+6wmV5rF8MOME/YgOJA85o8ZK9YteseSW3TnIKvEyUoIMtV7xq9uPWRqhNExQrTuemxh/QpXhTOC00E01JpSN6AA7lkoaofYn80On5MwqfRLGypY0ZK7+npjQSOtxFNjOiJqhXvZm4n9eJzXhjT/hMkkNSrZYFKaCmJjMviZ9rpAZMbaEMsXtrYQNqaLM2GwKNgRv+eVV0rwse5XyVb1Sqt5mceThBE7hHDy4hircQw0awADhGV7hzXl0Xpx352PRmnOymWP4A+fzB3QdjLc=</latexit>

Figure 2: Overview of the proposed method. Videos can be broken down into two main parts: static and transient compo-nents. To capture this, we map a video (with L frame sequence) into two learnable latent spaces. We jointly learn the staticlatent space and the transient latent space along with the network weights. We then use these learned latent spaces to generatevideos at the inference time. See Sec. 3 for more details.

MethodsSettings

Adversariallearning?

Inputframe?

Input latentvectors?

VGAN [34] X 7 X(random)TGAN [26] X 7 X(random)MoCoGAN

[33]X 7 X(random)

Video-VAE[10]

7 X X(random)

Ours 7 7 X(learned)

Table 1: Categorization of prior works in video synthe-sis. Different from existing methods, our model doesn’t re-quire a discriminator, or any reference input frame. How-ever, since we have learned latent vectors, we have controlof the kind of videos the model should generate.

not guarantee same performance for every run. Some tech-niques to make the generator agnostic to described prob-lems have been discussed in [27]. The other alternative tothe same has given rise to non-adversarial learning of gener-ative networks [4, 11]. Both [4, 11] showed that propertiesof convolutional GANs can be mimicked using simple re-construction losses while discarding the discriminator.

While there has been some work on image generationfrom learned latent vectors [4, 11], our work significantlydiffers from these methods as we do not map all the framespixel-wise in a given video to the same latent distribution.This is because doing so would require a separate latentspace (hence, a separate model for each video) for all thevideos in a given dataset, and performing any operation inthat space would naturally become video specific. Instead,

we divide the latent space of videos sharing the same classinto two parts - static and transient. This gives us a dic-tionary of static latent vectors for all videos and a commontransient latent subspace. Hence, any random video of thedataset can now be represented by the combination of onestatic vector (which remains same for all frames) and thecommon transient subspace.

3. FormulationDefine a video clip V represented by L frames as V =[

v1, v2, · · · , vL]. Corresponding to each frame vi, let there

be a point zi in latent space ZV ∈ RD×L such that

ZV =[z1, z2, · · · , zL

](1)

which forms a path of length L. We propose to disentanglea video into two parts: a static constituent, which capturesthe constant portion of the video common for all frames,and a transient constituent which represents the temporaldynamics between all the frames in the video. Hence, letZV

be decomposed as ZV = [Z>s ,Z>t ]> where Zs ∈ RDs×L

represents the static subspace and Zt ∈ RDt×L representsthe transient subspace with D = Ds + Dt. Thus {zi}Li=1

in (1) can be expressed as zi =[z(s)>i , z

(t)>i

]>∀ i =

1, 2, · · · , L. Next assuming that the video is of short length,we can fix z(s)i = z(s) for all frames after sampling onlyonce. Therefore, (1) can be expressed as

ZV =

[[z(s)

z(t)1

],

[z(s)

z(t)2

], · · · ,

[z(s)

z(t)L

]](2)

The transient portion will represent the motion of a givenvideo. Intuitively, the latent vectors corresponding to this

3

transient state should be correlated, or in other words, willform a path between z(t)1 and z(t)L . Specifically, the framesin a video are correlated in time and hence a frame vi attime i = T is a function of all previous frames {vi}T−1i=1 . Asa result, their corresponding transient representation shouldalso exhibit such a trajectory. This kind of representationof latent vectors can be obtained by employing a Recur-rent Neural Network (RNN) where output of each cell ofthe network is a function of its previous state or input. De-note the RNN as R with weights θ. Then, the RNN outputR(zi) = {r(t)i } ∀ i = 1, 2, · · · , L is a sequence of corre-lated variables representing the transient state of the video.

3.1. Learning Network Weights

Define a generative network G with weights repre-sented by γ. G takes latent vectors sampled from ZV

as input and predicts up to L frames of the video clip.For a set of N videos, initialize set of D-dimensionalvectors ZV to form the pair

{(ZV1 ,V1

),(ZV2 ,V2

),

· · · ,(ZVN

,VN) }

. More specifically from (2), defin-

ing z(s) =[z(s), z(s), · · · , z(s)

]∈ RDs×L, and z(t) =[

z(t)1 , z

(t)2 , · · · , z(t)L

]∈ RDt×L, we will have the pairs

{([z(s)z(t)

]

1

,V1

),

([z(s)z(t)

]

2

,V2

), · · · ,

([z(s)z(t)

]

N

,VN

)}.

With these pairs, we propose to optimize the weights γ, θ,and input latent vectors ZV (sampled once in the beginningof training) in the following manner. For each video Vj , wejointly optimize for θ, γ, and {ZVj}Nj=1 for every epoch intwo stages:

Stage 1 : minγ

`(Vj ,G(ZVj

)|(ZVj

, θ))

(3.1)

Stage 2 : minZV,θ

`(Vj ,G(ZVj

)|γ)

(3.2)

`(·) can be any regression-based loss. For rest of the paper,we will refer to both (3.1) and (3.2) together as min

ZV,θ,γ`rec.

Regularized loss function to capture static subspace.The transient subspace, along with the RNN, handles thetemporal dynamics of the video clip. To equally capturethe static portion of the video, we randomly choose a framefrom the video and ask the generator to compare its cor-responding generated frame during training. For this, weupdate the above loss as follows.

minZV,θ,γ

(`rec + λs`static) (4)

where `static = `(vk, vk) with k ∈ {1, 2, · · · , L}, vk is theground truth frame, vk = G(zk), and λs is the regularizationconstant. `static can also be understood to essentially handlethe role of image discriminator in [33, 35] that ensures thatthe generated frame is close to the ground truth frame.

3.2. Learning Latent Spaces

Non-adversarial learning involves joint optimization ofnetwork weights as well as the corresponding input latentspace. Apart from the gradients with respect to loss in (4),we propose to further optimize the latent space with gradi-ent of a loss based on the triplet condition as follows.

3.2.1 The Triplet Condition

z(t),ni · · · z

(t)i · · · · · · z

(t),ai z

(t),pi · · · · · ·

Negative range Negative rangePositive range

1

Figure 3: Triplet Condition in the transient latent space.Latent code representation of different frames of short videoclips may lie very near to each other in the transient sub-space. Using the proposed triplet condition, our modellearns to explain the dynamics of similar looking framesand simultaneously map them to distinct latent vectors.

Short video clips often have indistinguishable dynamicsin consecutive frames which can force the latent code rep-resentations to lie very near to each other in the transientsubspace. However, an ideal transient space should ensurethat the latent vector representation of a frame should onlybe close to a similar frame than a dissimilar one [28, 29]. Tothis end, we introduce a triplet loss to (4) that ensures thata pair of co-occurring frames vai (anchor) and vpi (positive)are closer but distinct to each other in embedding space thanany other frame vni (negative) (see Fig. 3). In this work, pos-itive frames are randomly sampled within a margin range αof the anchor and negatives are chosen outside of this mar-gin range. Defining a triplet set with transient latent codevectors {z(t),ai , z

(t),pi , z

(t),ni }, we aim to learn the transient

embedding space z(t) such that

‖z(t),ai − z(t),pi ‖22 + α < ‖z(t),ai − z(t),ni ‖22

∀ {z(t),ai , z(t),pi , z

(t),ni } ∈ Γ, where Γ is the set of all possi-

ble triplets in z(t). With the above regularization, the loss in(4) can be written as

minZV,θ,γ

(`rec + λs`static)

s.t. ‖z(t),ai − z(t),pi ‖22 + α < ‖z(t),ai − z(t),ni ‖22 (5)

where α is a hyperparameter that controls the margin whileselecting positives and negatives.

3.3. Full Objective Function

For any choice of differentiable generator G, the objec-tive (4) will be differentiable with respect to ZV, and (γ, θ)

4

[5]. We initialize ZV by sampling them from two differ-ent Gaussian distributions for both static and transient latentvectors. We also ensure that the latent vectors ZV lie on theunit `2 sphere, and hence, we project ZV after each updateby dividing its value by max(1, ‖ZV‖) [4], where max(·)returns maximum among the set of given elements. Finally,the complete objective function can be written as follows.

minZV,θ,γ

(`rec + λs`static + λt`triplet) (6)

where `static = `(vk, vk), λt is a regulariza-tion constant for the triplet loss, and `triplet =

max(‖z(t),ai − z(t),pi ‖22 + α− ‖z(t),ai − z(t),ni ‖22, 0

).

The weights of the generator γ and static latent vector z(s)

are updated by gradients of the losses `rec and `static. Theweights. θ, of the RNN, and transient latent vectors z(t) areupdated by gradients of the losses `rec, `static and `triplet.

3.3.1 Low Rank Representation for Interpolation

The objective of video frame interpolation is to synthesizenon-existent frames in-between the reference frames. Whilethe triplet condition ensures that similar frames have theirtransient latent vectors nearby, it doesn’t ensure that they lieon a manifold where simple linear interpolation will yieldlatent vectors that generate frames with plausible motioncompared to preceding and succeeding frames [4, 12]. Thismeans that the transient latent subspace can be representedin a much lower dimensional space compared to its largerambient space. So, to enforce such a property, we projectthe latent vectors into a low dimensional space while learn-ing them along with the network weights, first proposed in[12]. Mathematically, the loss in (6) can be written as

minZV,θ,γ

(`rec + λs`static + λt`triplet)

s.t. rank(

z(t))

= ρ (7)

where rank(·) indicates rank of the matrix and ρ is a hyper-parameter that decides what manifold z(t) is to be projectedon. We achieve this by reconstructing z(t) matrix from itstop ρ singular vectors in each iteration [7]. Note that, weonly employ this condition for optimizing the latent spacefor the frame interpolation experiments in Sec. 4.3.3.

4. ExperimentsIn this section, we present extensive experiments to

demonstrate the effectiveness of our proposed approach ingenerating videos through learned latent spaces.

4.1. Datasets

We evaluate the performance of our approach using threepublicly available datasets which have been used in manyprior works [10, 33, 34].

Chair-CAD [2]. This dataset consists of total 1393chair-CAD models, out of which we randomly choose 820chairs for our experiments with the first 16 frames, similarto [10]. The rendered frame in each video for all the modelsare center-cropped and then resized to 64 × 64 × 3 pixels.We obtain the transient latent vectors for all the chair mod-els with one static latent vectors for the training set.

Weizmann Human Action [9]. This dataset provides10 different actions performed by 9 people, amounting to90 videos. Similar to Chair-CAD, we center-cropped eachframe, and then resized to 64×64×3 pixels. For this dataset,we train our model to obtain nine static latent vectors (fornine different identities) and ten transient latent vectors (forten different actions) for videos with 16 frames each.

Golf scene dataset [34]. Golf scene dataset [34] con-tains 20,268 golf videos with 128 × 128 × 3 pixels whichfurther has 583,508 short video clips in total. We ran-domly chose 500 videos with 16 frames each and resizedthe frames to 64 × 64 × 3 pixels. Same as the Chair-CADdataset, we obtained the transient latent vectors for all thegolf scenes and one static latent vector for the training set.

4.2. Experimental Settings

We implement our framework in PyTorch [22]. Pleasesee supplementary material for details on implementationand values of different hyper-parameters (Ds, Dt, α, etc.).

Network Architecture. We choose DCGAN [23] as thegenerator architecture for the Chair-CAD and Golf scenedataset, and conditional generator architecture from [21] forthe Weizmann Human Action dataset for our experiments.For the RNN, we employ a one-layer gated recurrent unitnetwork with 500 hidden units [6].

Choice of Loss Function for `rec and `static. One straightforward loss function that can be used is the mean squaredloss, but it has been shown in literature that it leads to gen-eration of blurry pixels [37]. Moreover, it has been shownempirically that generative functions in adversarial learningfocus on edges [4]. Motivated by this, the loss function for`rec and `static is chosen to be the Laplacian pyramid lossLLaplacian [18] defined as

LLaplacian(v, v) =∑

l

22l |Ll(v)− Ll(v)|1

where Ll(·) is the l-th level of the Laplacian pyramid repre-sentation of the input.

Baselines. We compare our proposed method with twoadversarial methods. For Chair-CAD and Weizmann Hu-man Action, we use MoCoGAN [33] as the baseline, and forGolf scene dataset, we use VGAN [34] as the baseline. Weuse the publicly available code for MoCoGAN and VGAN,and set the hyper-parameters as recommended in the pub-lished work. We also compare two different versions of theproposed method by ablating the proposed loss functions.

5

Note that, we couldn’t compare our results with Video-VAE[10] using our performance measures (described below) asthe implementation has not been made available by the au-thors, and to the best of our efforts we couldn’t reproducethe results provided by them.

Performance measures. Past video generation workshave been evaluated quantitatively on Inception score (IS)[10]. But, it has been shown that IS is not a good evaluationmetric for pixel domain generation, as the maximal IS scorecan be obtained by synthesizing a video from every class ormode in the given data distribution [3, 20, 32]. Moreover,a high IS does not guarantee any confidence on the qualityof generation, but only on the diversity of generation. Sincea generative model trained using our proposed method cangenerate all videos using the learned latent dictionary2, andfor a fair comparison with baselines, we use the followingtwo measures, similar to measures provided in [33]. Wealso provide relevant bounds computed on real videos forreference. Note that arrows indicate whether higher (↑) orlower (↓) scores are better.

(1) Relative Motion Consistency Score (MCS ↓): Dif-ference between consecutive frames captures the movingcomponents, and hence motion in a video. So, firstly eachframe in the generated video, as well as the ground-truthdata, is represented as a feature vector computed using aVGG16 network [31] pre-trained on ImageNet [25] at therelu3 3 layer. Secondly, the averaged consecutive frame-feature difference vector for both set of videos is computed,denoted by f and f respectively. Finally, the relative MCSis then given by log10

(‖f − f‖22

).

(2) Frame Consistency Score (FCS ↑): This score mea-sures the consistency of the static portion of the generatedvideo frames. We keep the first frame of the generated videoas reference and compute the averaged structural similaritymeasure for all frames. The FCS is then given by the aver-age of this measure over all videos.

4.3. Qualitative Results

Fig. 5 shows some examples with randomly selectedframes of generated videos for the proposed method andthe adversarial approaches MoCoGAN [33] and VGAN[34]. For Chair-CAD [2] and Weizmann Human Action [9]dataset, it can be seen that the proposed method is able togenerate visually good quality videos with a non-adversarialtraining protocol, whereas MoCoGAN produces blurry andinconsistent frames. Since we use optimized latent vectorsunlike MoCoGAN (which uses random latent vectors forvideo generation), our method produces visually more ap-

2Direct video comparison seems straight forward for our approach asthe corresponding one-to-one ground truth is known. However, for [33,34], we do not know which video is being generated (action may be knowne.g. [33]) which makes such direct comparison infeasible and unfair.

pealing videos. Fig. 5 presents two particularly importantpoints. As visualized for the Chair-CAD videos, the adver-sarial approach of MoCoGAN produces not only blurredchair images in the generated video, but they are also non-uniform in quality. Further, it can be seen that as the timestep increases, MoCoGAN tends to generate the same chairfor different videos. This shows a major drawback of theadversarial approaches, where they fail to learn the diver-sity of the data distribution. Our approach overcomes thisby producing a optimized dictionary of latent vectors whichcan be used for generating any video in the data distributioneasily. To further validate our method for qualitative results,we present the following experiments.

4.3.1 Qualitative Ablation Study

Fig. 4 qualitatively shows the contribution of the specificparts of the proposed method on Chair-CAD [2]. First,we investigate the impact of input latent vector optimiza-tion. For a fair comparison, we optimize the model for samenumber of epochs. It can be observed that the model bene-fits from the joint optimization of input latent space to pro-duce better visual results. Next, we validate the contributionof `static and `triplet on a difficult video example whose chaircolor matches with the background. Our method, combinedwith `static and `triplet, is able to distinguish between thewhite background and the white body of the chair model.

ActionsIdentities

P1 P2 P3 P4 P5 P6 P7 P8 P9

run • • • •walk • • • •jump • • • • • • •skip • • • • • • •

Table 3: Generating videos by exchanging unseen ac-tions by identities. Each cell in this table indicates a videoin the dataset. Only cells containing the symbol • indicatethat the video was part of the training set. We randomly gen-erated videos corresponding to rest of the cells indicated bysymbols •, •, •, and •, visualized in Fig. 6.

4.3.2 Action Exchange

Our non-adversarial approach can effectively separate thestatic and transient portion of a video, and generate videosunseen during the training protocol. To validate thesepoints, we choose a simple matrix completion for the com-bination of identities and actions in the Weizmann Humanaction [9] dataset. For training our model, we created aset of videos (without any cropping to present the completescale of the frame) represented by the cells marked with •in Tab. 3. Hence, the unseen videos correspond to the cellsnot marked with •. During testing, we randomly generated

6

Without latent optimization

With latent optimization

+`static+`triplet<latexit sha1_base64="nM2iGDahG0LXX4/k9SPrcXJYc0k=">AAACJ3icbVDJSgNBEO1xjXGLevTSGARBCBMNLhcJevEYwSyQhNDTqSRNeha6a8QwzN948Ve8CCqiR//EzmQQNXnQ8Oq9KqrrOYEUGm3705qbX1hcWs6sZFfX1jc2c1vbNe2HikOV+9JXDYdpkMKDKgqU0AgUMNeRUHeGV2O/fgdKC9+7xVEAbZf1PdETnKGROrmLFsI9IkaHcQuk7ERJHWk0Po/j2S4qEUjAOKadXN4u2AnoNCmmJE9SVDq5l1bX56ELHnLJtG4W7QDbEVNmnYQ42wo1BIwPWR+ahnrMBd2Okjtjum+ULu35yjwPaaL+noiYq/XIdUyny3Cg/3tjcZbXDLF31o6EF4QIHp8s6oWSok/HodGuUMBRjgxhXAnzV8oHTDGOJtpsEsL5GCc/J0+T2lGheFwo3ZTy5cs0jgzZJXvkgBTJKSmTa1IhVcLJA3kir+TNerSerXfrY9I6Z6UzO+QPrK9vSPGpZQ==</latexit>

-`static-`triplet<latexit sha1_base64="vtrp+dclY5hcPx8SSuogobCtwy0=">AAACJ3icbVDJSgNBEO1xjXGLevTSGAQvhokGl4sEvXiMYBZIQujpVJImPQvdNWIY5m+8+CteBBXRo39iZzKImjxoePVeFdX1nEAKjbb9ac3NLywuLWdWsqtr6xubua3tmvZDxaHKfemrhsM0SOFBFQVKaAQKmOtIqDvDq7FfvwOlhe/d4iiAtsv6nugJztBIndxFC+EeEaPDuAVSdqKkjjQan8fxbBeVCCRgHNNOLm8X7AR0mhRTkicpKp3cS6vr89AFD7lkWjeLdoDtiCmzTkKcbYUaAsaHrA9NQz3mgm5HyZ0x3TdKl/Z8ZZ6HNFF/T0TM1XrkOqbTZTjQ/72xOMtrhtg7a0fCC0IEj08W9UJJ0afj0GhXKOAoR4YwroT5K+UDphhHE202CeF8jJOfk6dJ7ahQPC6Ubkr58mUaR4bskj1yQIrklJTJNamQKuHkgTyRV/JmPVrP1rv1MWmds9KZHfIH1tc3T5mpaQ==</latexit>

Figure 4: Qualitative ablation study on Chair-CAD [2]. Left: It can be seen that the model is not able to generategood quality frames properly resulting in poor videos when the input latent space is not optimized, whereas with latentoptimization, the generated frames are sharper. Right: The impact of `static and `triplet is indicated by the red bounding boxes.Our method with `static and `triplet captures the difference between the white background and white chair, whereas withoutthese two loss functions, the chair images are not distinguishable from their background. + and - indicate presence andabsence of the terms, respectively.

MoC

oGAN

Ours

MoC

oGAN

Ours

VGAN

Ours

(a) Chair-CAD [2]

MoC

oGAN

Ours

MoC

oGAN

Ours

VGAN

Ours

(b) Weizmann Human Action [9]

MoC

oGAN

Ours

MoC

oGAN

Ours

VGAN

Ours

(c) Golf [34]

Figure 5: Qualitative results comparison with state-of-the-art methods. We show two generated video sequences forMoCoGAN [33] (for (a) Chair-CAD [2], (b) Weizmann Human action [9]), VGAN [34] (for (c) Golf scene [34]) (top), andthe proposed method (Ours, bottom). The proposed method produces visually sharper, and consistently better using thenon-adversarial training protocol. More examples have been provided in the supplementary material.

jump

run

skip

walk

wave2

Figure 6: Examples of action exchange to generate un-seen videos. This figure shows the generated videos un-seen during the training of the model with colored boundingboxes indicating the colored dots (•, •, •, •) referred to inTab. 3. This demonstrates the effectiveness of our methodin disentangling static and transient portion of videos.

these unseen videos (marked with •, •, • and • in Tab. 3),and the visual results are shown in Fig. 6. This experimentclearly validates our claim of static (identities) and transient(action) portion disentanglement of a video and, generationof unseen videos by using combinations of action and iden-

tities not part of training set. Note that generated videosmay not exactly resemble ground truth videos of the saidcombinations as we learn z(t) over a class of many videos.

4.3.3 Frame Interpolation

To show our methodology can be employed for frame inter-polation, we trained our model using the loss (7) for ρ = 2and ρ = 10. During testing, we generated intermediateframes by interpolating learned latent variables of two dis-tinct frames. For this, we computed the difference ∆z(t)

between the learned latent vectors of second (z(t)2 ) and fifth(z(t)5 ) frame, and generated k = 3 unseen frames using{z(t)2 + n/k∆z(t)}kn=1, after concatenating with z(s). Fig. 7shows the results of interpolation between second and fifthframes for two randomly chosen videos. Thus, our methodis able to produce dynamically consistent frames with re-spect to the reference frames without any pixel clues.

4.4. Quantitative Results

Quantitative result comparisons with respect to baselineshave been provided in Tab. 2. Compared to videos gener-ated by the adversarial method MoCoGAN [33], we reporta relative decrease of 19.22% in terms of MCS, and 4.70%

7

MCS ↓ FCS ↑

Bound 0.0 0.91MoCoGAN [33] 4.11 0.85

Ours (-`triplet-`static) 3.83 0.77Ours (+`triplet+`static) 3.32 0.89

(a) Chair-CAD [2]

MCS ↓ FCS ↑

Bound 0.0 0.95MoCoGAN [33] 3.41 0.85



MCS ↓ FCS ↑

Bound 0.0 0.97VGAN [34] 3.61 0.88


(c) Golf [34]

Table 2: Quantitative results comparison with state-of-the-art methods. We obtained better scores on the proposedmethod on both Chair-CAD [2], Weizmann Human Action [9], and Golf [34] datasets, compared to the adversarial approaches(MoCoGAN, and VGAN). Best scores have been highlighted in bold.

⇢ = 2<latexit sha1_base64="QFz1eQYSV2x9VIV1s9NgjY0qdv8=">AAACMXicbVDJSgNBEO2JW4xb1KOXxhDwYphocDkIQS8eIxgNZELo6VRMY89Cd40YhvklL/6JePGgiFd/wp7JIC550PDqvSqq67mhFBpt+8UqzMzOzS8UF0tLyyura+X1jSsdRIpDmwcyUB2XaZDChzYKlNAJFTDPlXDt3p6l/vUdKC0C/xLHIfQ8duOLoeAMjdQvn1cdhHtEjHcTB6Tsx1kdazQNPEmmu6hEKAGThJYcNQroCd3rlyt2zc5A/5N6TiokR6tffnIGAY888JFLpnW3bofYi5kyeyUkJSfSEDJ+y26ga6jPPNC9OLs4oVWjDOgwUOb5SDP150TMPK3Hnms6PYYj/ddLxWleN8LhUS8Wfhgh+HyyaBhJigFN46MDoYCjHBvCuBLmr5SPmGIcTcilLITjFAffJ/8nV3u1+n6tcdGoNE/zOIpki2yTHVInh6RJzkmLtAknD+SZvJI369F6sd6tj0lrwcpnNskvWJ9flSasUA==</latexit>

⇢ = 10<latexit sha1_base64="Rc+FJRk7ev6+LOPhxMQmr0wFyU0=">AAACMnicbVDLSgNBEJz1bXxFPXoZDAEvhl0NPg5C0IveFIwJZEOYnXSSIbMPZnrFsOw3efFLBA96UMSrH+FkDaImBQPVVd30dHmRFBpt+9mamp6ZnZtfWMwtLa+sruXXN250GCsOVR7KUNU9pkGKAKooUEI9UsB8T0LN658N/dotKC3C4BoHETR91g1ER3CGRmrlL4ouwh0iJrupC1K2kqxONJoGnqaTXVQikoBpSnOu6oX0hDp2K1+wS3YGOk6cESmQES5b+Ue3HfLYhwC5ZFo3HDvCZsKUWSwhzbmxhojxPutCw9CA+aCbSXZySotGadNOqMwLkGbq74mE+VoPfM90+gx7+r83FCd5jRg7R81EBFGMEPDvRZ1YUgzpMD/aFgo4yoEhjCth/kp5jynG0aScy0I4HuLg5+RxcrNXcvZL5atyoXI6imOBbJFtskMcckgq5Jxckirh5J48kVfyZj1YL9a79fHdOmWNZjbJH1ifXx6erIk=</latexit>

Figure 7: Examples of frame interpolation. An importantadvantage of our method is translation of interpolation inthe learned latent space to video space using (7). It canbe observed that as ρ increases, the interpolation (boundedby color) is better. Note that the adjacent frames are alsogenerated frames, and not ground truth frames.

relative increase in terms of FCS for chair-CAD dataset [2].For the Weizmann Human Action [9] dataset, the proposedmethod is reported to have a relative decrease of 22.87% interms of in terms of MCS, and 4.61% relative increase interms of FCS. Similarly for Golf scene dataset [34], we per-form competitively with VGAN [34] with a observed rela-tive decrease of 24.90% in terms of in terms of MCS. A im-portant conclusion from these results is that our proposedmethod, being non-adversarial in nature, learns to synthe-size a diverse set of videos, and is able to perform at parwith adversarial approaches. It should be noted that a bet-ter loss function for `rec and `static would produce strongerresults. We leave this for future works.

4.4.1 Quantitative Ablation Study

In this section, we demonstrate the contribution of differ-ent components in our proposed methodology on the Chair-CAD [2] dataset. For all the experiments, we randomlygenerate 500 videos using our model by using the learnedlatent vector dictionary. We divide the ablation study intotwo parts. Firstly, we present the results for impact of thelearned latent vectors on the network modules. For this, wesimply generate videos once with the learned latent vectors(+Z), and once with randomly sampled latent vectors froma different distribution (-Z). The inter-dependency of ourmodel weights and the learned latent vectors can be inter-preted from Tab. 4a. We see that there is a relative decreaseof 16.16% in MCS from 3.96 to 3.32, and 18.66% of rel-

MCS ↓ FCS ↑

Bound 0 0.91-Z 3.96 0.75+Z 3.32 0.89

(a) With respect to latentspace optimization.

MCS ↓ FCS ↑

Bound 0 0.91-`triplet-`static 3.83 0.77-`triplet+`static 3.82 0.85+`triplet-`static 3.36 0.81+`triplet+`static 3.32 0.89

(b) With respect to loss functions

Table 4: Ablation study of proposed method on Chair-CAD [2]. In (a), we evaluate contributions of latent spaceoptimization (Z). In (b), we evaluate contributions of `tripletand `static in four combinations. + and - indicate presenceand absence of the terms, respectively.

ative increase in FCS. This shows that optimization of thelatent space in the proposed method is important for goodquality video generation.

Secondly, we investigate the impact of the proposedlosses on the proposed method. Specifically, we look intofour possible combinations of `triplet and `static. The resultsare presented in Tab. 4b. It can observed that the combi-nation of triplet loss `triplet and static loss `static provides thebest result when employed together, indicated by the rela-tive decrease of 14.26% in MCS from 3.83 to 3.32.

5. Conclusion

We present a non-adversarial approach for synthesiz-ing videos by jointly optimizing both network weights andinput latent space. Specifically, our model consists of aglobal static latent variable for content features, a framespecific transient latent variable, a deep convolutional gen-erator, and a recurrent neural network which are trainedusing a regression-based reconstruction loss, including atriplet based loss. Our approach allows us to generate adiverse set of almost uniform quality videos, perform frameinterpolation, and generate videos unseen during training.Experiments on three standard datasets show the efficacy ofour proposed approach over state-of-the-methods.

Acknowledgements. The work was partially supportedby NSF grant 1664172 and ONR grant N00014-19-1-2264.

8

References[1] Martin Arjovsky, Soumith Chintala, and Leon Bottou.

Wasserstein Generative Adversarial Networks. In Proceed-ings of the International Conference on Machine Learning,pages 214–223, 2017.

[2] Mathieu Aubry, Daniel Maturana, Alexei A Efros, Bryan CRussell, and Josef Sivic. Seeing 3D Chairs: Exemplar part-based 2D-3D Alignment using a Large Dataset of CADModels. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 3762–3769,2014.

[3] Shane Barratt and Rishi Sharma. A Note on the InceptionScore. arXiv preprint arXiv:1801.01973, 2018.

[4] Piotr Bojanowski, Armand Joulin, David Lopez-Pas, andArthur Szlam. Optimizing the Latent Space of GenerativeNetworks. In Proceedings of the International Conferenceon Machine Learning, pages 599–608, 2018.

[5] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G Di-makis. Compressed Sensing using Generative Models. InProceedings of the International Conference on MachineLearning, pages 537–546, 2017.

[6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, andYoshua Bengio. Empirical Evaluation of Gated RecurrentNeural Networks on Sequence Modeling. arXiv preprintarXiv:1412.3555, 2014.

[7] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. TheElements of Statistical Learning, volume 1. Springer Seriesin Statistics New York, 2001.

[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville,and Yoshua Bengio. Generative Adversarial Networks. InAdvances in Neural Information Processing Systems, pages2672–2680, 2014.

[9] Lena Gorelick, Moshe Blank, Eli Shechtman, Michal Irani,and Ronen Basri. Actions as Space-Time Shapes. IEEETransactions on Pattern Analysis and Machine Intelligence,29(12):2247–2253, 2007.

[10] Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori,and Leonid Sigal. Probabilistic Video Generation usingHolistic Attribute Control. In Proceedings of the EuropeanConference on Computer Vision, pages 452–467, 2018.

[11] Yedid Hoshen, Ke Li, and Jitendra Malik. Non-AdversarialImage Synthesis with Generative Latent Nearest Neighbors.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 5811–5819, 2019.

[12] Rakib Hyder and M. Salman Asif. Generative Modelsfor Low-Dimensional Video Representation and Reconstruc-tion. IEEE Transactions on Signal Processing, 68:1688–1701, 2020.

[13] Diederik P Kingma and Jimmy Ba. ADAM: A Method forStochastic Optimization. arXiv preprint arXiv:1412.6980,2014.

[14] Yong-Hoon Kwon and Min-Gyu Park. Predicting FutureFrames using Retrospective Cycle-GAN. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 1811–1820, 2019.

[15] Jerry Li, Aleksander Madry, John Peebles, and Lud-wig Schmidt. Towards Understanding the Dynamicsof Generative Adversarial Networks. arXiv preprint

arXiv:1706.09884, 2017.[16] Ke Li and Jitendra Malik. Implicit Maximum Likelihood

Estimation. arXiv preprint arXiv:1809.09087, 2018.[17] Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P Xing. Dual-

Motion GAN for Future-Flow Embedded Video Prediction.In Proceedings of the IEEE International Conference onComputer Vision, pages 1744–1752, 2017.

[18] Haibin Ling and Kazunori Okada. Diffusion Distance forHistogram Comparison. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, volume 1,pages 246–253, 2006.

[19] William Lotter, Gabriel Kreiman, and David Cox. Deep Pre-dictive Coding Networks for Video Prediction and Unsuper-vised Learning. arXiv preprint arXiv:1605.08104, 2016.

[20] Mario Lucic, Karol Kurach, Marcin Michalski, SylvainGelly, and Olivier Bousquet. Are GANs created equal? ALarge-scale Study. In Advances in Neural Information Pro-cessing Systems, pages 700–709, 2018.

[21] Mehdi Mirza and Simon Osindero. Conditional Genera-tive Adversarial Networks. arXiv preprint arXiv:1411.1784,2014.

[22] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, AlbanDesmaison, Luca Antiga, and Adam Lerer. Automatic Dif-ferentiation in PyTorch. In NIPS AutoDiff Workshop, 2017.

[23] Alec Radford, Luke Metz, and Soumith Chintala. Un-supervised Representation Learning with Deep Convolu-tional Generative Adversarial Networks. arXiv preprintarXiv:1511.06434, 2015.

[24] Sebastian Ruder. An Overview of Gradient Descent Op-timization Algorithms. arXiv preprint arXiv:1609.04747,2016.

[25] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, et al. ImageNet: Large-Scale Visual Recognition Challenge. International Journalof Computer Vision, 115(3):211–252, 2015.

[26] Masaki Saito, Eiichi Matsumoto, and Shunta Saito. Tem-poral Generative Adversarial Nets with Singular Value Clip-ping. In Proceedings of the IEEE International Conferenceon Computer Vision, pages 2830–2839, 2017.

[27] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, VickiCheung, Alec Radford, and Xi Chen. Improved Techniquesfor Training GANs. In Advances in Neural Information Pro-cessing Systems, pages 2234–2242, 2016.

[28] Florian Schroff, Dmitry Kalenichenko, and James Philbin.FACENET: A Unified Embedding for Face Recognition andClustering. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 815–823, 2015.

[29] Pierre Sermanet, Corey Lynch, Yevgen Chebotar, JasmineHsu, Eric Jang, Stefan Schaal, Sergey Levine, and GoogleBrain. Time-Contrastive Networks: Self-Supervised Learn-ing from Video. In Proceedings of the IEEE InternationalConference on Robotics and Automation, pages 1134–1141,2018.

[30] Aliaksandr Siarohin, Stephane Lathuiliere, Sergey Tulyakov,Elisa Ricci, and Nicu Sebe. Animating Arbitrary Objects viaDeep Motion Transfer. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages

9

2377–2386, 2019.[31] Karen Simonyan and Andrew Zisserman. Very Deep Convo-

lutional Networks for Large-scale Image Recognition. arXivpreprint arXiv:1409.1556, 2014.

[32] L Theis, A van den Oord, and M Bethge. A Note on theEvaluation of Generative Models. In Proceedings of the In-ternational Conference on Learning Representations, pages1–10, 2016.

[33] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and JanKautz. MOCOGAN: Decomposing Motion and Content forVideo Generation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 1526–1535, 2018.

[34] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba.Generating Videos with Scene Dynamics. In AdvancesIn Neural Information Processing Systems, pages 613–621,2016.

[35] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-Video Synthesis. In Advances in Neural Information Pro-cessing Systems, pages 1144–1156, 2018.

[36] Tsun-Hsuan Wang, Yen-Chi Cheng, Chieh Hubert Lin,Hwann-Tzong Chen, and Min Sun. Point-to-Point VideoGeneration. In Proceedings of the IEEE International Con-ference on Computer Vision, 2019.

[37] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. LossFunctions for Neural Networks for Image Processing. arXivpreprint arXiv:1511.08861, 2015.

10

Non-Adversarial Video Synthesis with Learned Priors(Supplementary Material)

Page # Content12 Dataset Descriptions12 Implementation Details

• Hyper-parameters• Other details

13 More Qualitative Examples• Qualitative examples on Chair-CAD [2]• Qualitative examples on Weizmann Human Action [9]• Qualitative examples on Golf scene [34]• Interpolation examples on Chair-CAD [2] and Weizmann Human Action [9]

Table 5: Supplementary Material Overview.

11

A. Dataset Descriptions

Chair-CAD

Weizmann Human Action

Golf Scene

Figure 8: Sample videos from datasets used in the paper. Two unprocessed video examples from Chair-CAD [2], Weiz-mann Human Action [21], and Golf Scene [34] datasets have been presented here. As seen from the examples, datasets arediverse in nature, different in categories and present unique challenges in learning the transient and static portions of thevideos. Best viewed in color.

Chair-CAD [2]. This dataset provides 1393 chair-CAD models. Each model frame sequence is produced using twoelevation angles in addition to thirty one azimuth angles. All the chair models have been designed to be at a fixed distancewith respect to the camera. The authors provide four video sequences per CAD model. We choose the first 16 frames of eachvideo for our paper, and consider the complete dataset as one class.

Weizmann Human Action [9]. This dataset is a collection of 90 video sequences showing nine different identities per-forming 10 different actions, namely, run, walk, skip, jumping-jack (or ‘jack’), jump-forward-on-two-legs(or ‘jump’), jump-in-place-on-two-legs (or ‘pjump’), gallopsideways (or ‘side’), wave-two-hands (or‘wave2’), waveone-hand (or ‘wave1’), and bend. We randomly choose 16 consecutive frames for every video in eachiteration during training.

Golf Scene [34]. [34] released a dataset containing 35 million clips (32 frames each) stabilized by SIFT+RANSAC. Itcontains several categories filtered by a pre-trained Place-CNN model, one of them being the Golf scenes. The Golf scenedataset contains 20,268 golf videos. Due to many non-golf videos being part of the golf category (due to inaccurate labels),this dataset presents a particularly challenging data distribution for our proposed method. Note that for a fair comparison,we further selected our training set videos from this provided dataset pertaining to golf action as close as possible. We thentrained the VGAN [34] model on this selected videos for a fair comparison.

B. Implementation DetailsWe used Pytorch [22] for our implementation. The Adam optimizer [13], with ε = 10−8, β1 = 0.9 and β2 = 0.999, was

used to update the model weights and SGD optimizer [24], with momentum = 0.9, was used to update the latent spaces. Thecorresponding learning rate for the generator τG , the RNN τR, and the latent spaces τZV

were set as values indicated in Tab. 6.Hyper-parameters. [26, 33, 34] that generate videos from latent priors have no dataset split as the task is to synthesize

high quality videos from the data distribution, and then evaluate the model performance. All hyperparameters (except Ds,Dt) are set as described in [28, 29, 33, 35] (e.g. α from [29]). For Ds and Dt, we follow the strategy used in Sec. 4.3 of [33]and observe that our model generates videos with good visual quality (FCS) and plausible motion (MCS) for Chair-CADwhen (Ds, Dt) = (206, 50). Same strategy is used for all datasets. The hyper-parameters employed with respect to each

12

dataset used in this paper is given in Tab. 6. G and R refer to the generator, with weights γ, and RNN, with weight θ,respectively. τ(·) represents the learning rate. µ(·) represents the number of epochs. Ds andDt refer to the static and transientlatent dimensions, respectively. λs and λt refer to the static loss, and triplet loss regularization constants, respectively. α isthe margin for triplet loss. l refers to the level of the Laplacian pyramid representation used in `rec and `static.

DatasetsHyper-parameters

Ds Dt λs λt α τG τR τZVµγ µZV,θ l

Chair-CAD [2] 206 50 0.01 0.01 2 6.25× 10−5 6.25× 10−5 12.5 5 300 4Weizmann Human Action [9] 56 200 0.01 0.1 2 6.25× 10−5 6.25× 10−3 12.5 5 700 3Golf Scene [34] 56 200 0.01 0.01 2 0.1 0.1 12.5 10 1000 4

Table 6: Hyper-parameters used in all experiments for all datasets.

Other details. We performed all our experiments on a system with 48 core Intel(R) Xeon(R) Gold 6126 processor with256GB RAM. We used NVIDIA GeForce RTX 2080 Ti for all GPU computations during training. Further, NVIDIA TeslaK40 GPUs were used for computation of all evaluation metrics in our experiments. All our implementations are based onnon-optimized PyTorch based codes. Our runtime analysis revealed that it took on average one to two days to train the modeland obtain learned latent vectors.

C. More Qualitative ExamplesIn this section, we provide more qualitative results of generated videos synthesized using our proposed approach on each

dataset (Fig. 9 for Chair-CAD [2] dataset, Fig. 10 for Weizmann Human Action [9] dataset, and Fig. 11 for Golf scene [34]dataset). We also provide more examples interpolation experiment in Fig. 12.

Figure 9: Qualitative results on Chair-CAD [2]. On this large scale dataset, our model is able to capture the intrinsicrotation and color of videos unique to each chair model. This shows the efficacy of our approach, compared to adversarialapproaches such as MoCoGAN [33] which produce the same chair for all videos, with blurry frames (See Fig. 1 of mainmanuscript).

13

Figure 10: Qualitative results on Weizmann Human Action [9]. The videos show that our model produces sharp visualresults with the combination of trained generator, RNN along with 9 identities, and 10 different action latent vectors.

Figure 11: Qualitative results on Golf Scene [34]. Our proposed approach produces visually good results on this particularlychallenging dataset. Due to incorrect labels on the videos, this dataset has many non-golf videos. Our model is still able tocapture the static and transient portion of the videos, although better filtering can still improve our results.

14

⇢ = 4<latexit sha1_base64="2wZxtR0E+1ZKI5/yY3uOKsYMhME=">AAAB73icbVBNSwMxEJ2tX7V+rXr0EiyCp7KrFb0IRS8eK9gPaJeSTbNtaDZZk6xQlv4JLx4U8erf8ea/MW33oK0PBh7vzTAzL0w408bzvp3Cyura+kZxs7S1vbO75+4fNLVMFaENIrlU7RBrypmgDcMMp+1EURyHnLbC0e3Ubz1RpZkUD2ac0CDGA8EiRrCxUrurhhJdo2rPLXsVbwa0TPyclCFHved+dfuSpDEVhnCsdcf3EhNkWBlGOJ2UuqmmCSYjPKAdSwWOqQ6y2b0TdGKVPoqksiUMmqm/JzIcaz2OQ9sZYzPUi95U/M/rpCa6CjImktRQQeaLopQjI9H0edRnihLDx5Zgopi9FZEhVpgYG1HJhuAvvrxMmmcV/7xycV8t127yOIpwBMdwCj5cQg3uoA4NIMDhGV7hzXl0Xpx352PeWnDymUP4A+fzB8mFjyc=</latexit>

⇢ = 8<latexit sha1_base64="CVQi7WDarqFkpserBYVbu1G8nk0=">AAAB73icbVDLSgNBEOyNrxhfqx69DAbBU9j1gbkIQS8eI5gHJEuYncwmQ2Zn1plZISz5CS8eFPHq73jzb5wke9DEgoaiqpvurjDhTBvP+3YKK6tr6xvFzdLW9s7unrt/0NQyVYQ2iORStUOsKWeCNgwznLYTRXEcctoKR7dTv/VElWZSPJhxQoMYDwSLGMHGSu2uGkp0jao9t+xVvBnQMvFzUoYc9Z771e1LksZUGMKx1h3fS0yQYWUY4XRS6qaaJpiM8IB2LBU4pjrIZvdO0IlV+iiSypYwaKb+nshwrPU4Dm1njM1QL3pT8T+vk5qoGmRMJKmhgswXRSlHRqLp86jPFCWGjy3BRDF7KyJDrDAxNqKSDcFffHmZNM8q/nnl8v6iXLvJ4yjCERzDKfhwBTW4gzo0gACHZ3iFN+fReXHenY95a8HJZw7hD5zPH8+Vjys=</latexit>

⇢ = 12<latexit sha1_base64="Q0jbbOCQDhAc7h37RHc/TWMx1CM=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgKexGRS9C0IvHCOYhyRJmJ5NkyDyWmVkhLPkKLx4U8ernePNvnCR70MSChqKqm+6uKObMWN//9nIrq2vrG/nNwtb2zu5ecf+gYVSiCa0TxZVuRdhQziStW2Y5bcWaYhFx2oxGt1O/+US1YUo+2HFMQ4EHkvUZwdZJjx09VOgaBZVuseSX/RnQMgkyUoIMtW7xq9NTJBFUWsKxMe3Aj22YYm0Z4XRS6CSGxpiM8IC2HZVYUBOms4Mn6MQpPdRX2pW0aKb+nkixMGYsItcpsB2aRW8q/ue1E9u/ClMm48RSSeaL+glHVqHp96jHNCWWjx3BRDN3KyJDrDGxLqOCCyFYfHmZNCrl4Kx8cX9eqt5kceThCI7hFAK4hCrcQQ3qQEDAM7zCm6e9F+/d+5i35rxs5hD+wPv8ATgVj2A=</latexit>

⇢ = 15<latexit sha1_base64="aFIiHCaQF/Irc9qfkJve0KmQbGU=">AAAB8HicbVDLSgNBEOyNrxhfUY9eBoPgKeyqQS9C0IvHCOYhyRJmJ5NkyDyWmVkhLPkKLx4U8ernePNvnCR70MSChqKqm+6uKObMWN//9nIrq2vrG/nNwtb2zu5ecf+gYVSiCa0TxZVuRdhQziStW2Y5bcWaYhFx2oxGt1O/+US1YUo+2HFMQ4EHkvUZwdZJjx09VOgaBZVuseSX/RnQMgkyUoIMtW7xq9NTJBFUWsKxMe3Aj22YYm0Z4XRS6CSGxpiM8IC2HZVYUBOms4Mn6MQpPdRX2pW0aKb+nkixMGYsItcpsB2aRW8q/ue1E9u/ClMm48RSSeaL+glHVqHp96jHNCWWjx3BRDN3KyJDrDGxLqOCCyFYfHmZNM7KwXm5cn9Rqt5kceThCI7hFAK4hCrcQQ3qQEDAM7zCm6e9F+/d+5i35rxs5hD+wPv8ATyhj2M=</latexit>

(a) Chair-CAD [2]










Figure 12: More interpolation results. In this figure, ρ represents the rank of transient latent vectors zt. We present theinterpolation results on (a) Chair-CAD dataset, and (b) Weizmann Human Action, for different values of ρ. It can be observedthat as ρ increases, the interpolation becomes clearer.

15

faaich001@, agupt013@, rpand002@, rhyde001@, sasif@ece., … · 2020. 4. 21. · Non-Adversarial...

Documents

Transcript of faaich001@, agupt013@, rpand002@, rhyde001@, sasif@ece., … · 2020. 4. 21. · Non-Adversarial...