Novel View Synthesis from Single Images via Point Cloud … · 2020. 9. 3. · manipulation [18],...

13
LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION 1 Novel View Synthesis from Single Images via Point Cloud Transformation Hoang-An Le 1 [email protected] Thomas Mensink 2 [email protected] Partha Das 1 [email protected] Theo Gevers 1 [email protected] 1 Computer Vision Lab, University of Amsterdam 2 Google Research, Amsterdam Abstract In this paper the argument is made that for true novel view synthesis of objects, where the object can be synthesized from any viewpoint, an explicit 3D shape representation is desired. Our method estimates point clouds to capture the geometry of the object, which can be freely rotated into the desired view and then projected into a new image. This image, however, is sparse by nature and hence this coarse view is used as the input of an image completion network to obtain the dense target view. The point cloud is obtained using the predicted pixel-wise depth map, estimated from a single RGB input image, combined with the camera intrinsics. By using forward warping and backward warping between the input view and the target view, the network can be trained end-to-end without supervision on depth. The benefit of using point clouds as an explicit 3D shape for novel view synthesis is experimentally validated on the 3D ShapeNet benchmark. Source code and data are available at https://github.com/lhoangan/pc4novis 1 Introduction Novel view synthesis aims to infer the appearance of an object from unobserved points of view. The synthesis of unseen views of objects could be important for image-based 3D object manipulation [18], robot traversability [12], or 3D object reconstruction [29]. Generating a coherent view of unseen parts of an object requires a non-trivial understanding of the object’s inherent properties such as (3D) geometry, texture, shading, and illumination. Different algorithms make use of provided source images in different ways. Model-based approaches use similar-look open stock 3D models [18], or through user interactive construc- tion [2, 26, 31]. Image-based methods [23, 24, 28, 29, 32] assume an underlying parametric model of object appearances conditioned on viewpoints and try to learn it using statistical frameworks. Despite their differences, both approaches use 3D information in predicting object new views. The former imposes stronger assumptions on the full 3D structure and shifts the paradigm to obtain the full models, while the latter captures the 3D information in latent space to cope with (self) occlusion. c 2020. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms.

Transcript of Novel View Synthesis from Single Images via Point Cloud … · 2020. 9. 3. · manipulation [18],...

  • LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION 1

    Novel View Synthesis from Single Imagesvia Point Cloud TransformationHoang-An Le1

    [email protected]

    Thomas Mensink2

    [email protected]

    Partha Das1

    [email protected]

    Theo Gevers1

    [email protected]

    1 Computer Vision Lab,University of Amsterdam

    2 Google Research,Amsterdam

    Abstract

    In this paper the argument is made that for true novel view synthesis of objects, wherethe object can be synthesized from any viewpoint, an explicit 3D shape representation isdesired. Our method estimates point clouds to capture the geometry of the object, whichcan be freely rotated into the desired view and then projected into a new image. Thisimage, however, is sparse by nature and hence this coarse view is used as the input of animage completion network to obtain the dense target view. The point cloud is obtainedusing the predicted pixel-wise depth map, estimated from a single RGB input image,combined with the camera intrinsics. By using forward warping and backward warpingbetween the input view and the target view, the network can be trained end-to-end withoutsupervision on depth. The benefit of using point clouds as an explicit 3D shape for novelview synthesis is experimentally validated on the 3D ShapeNet benchmark. Source codeand data are available at https://github.com/lhoangan/pc4novis

    1 IntroductionNovel view synthesis aims to infer the appearance of an object from unobserved points ofview. The synthesis of unseen views of objects could be important for image-based 3D objectmanipulation [18], robot traversability [12], or 3D object reconstruction [29]. Generating acoherent view of unseen parts of an object requires a non-trivial understanding of the object’sinherent properties such as (3D) geometry, texture, shading, and illumination.

    Different algorithms make use of provided source images in different ways. Model-basedapproaches use similar-look open stock 3D models [18], or through user interactive construc-tion [2, 26, 31]. Image-based methods [23, 24, 28, 29, 32] assume an underlying parametricmodel of object appearances conditioned on viewpoints and try to learn it using statisticalframeworks. Despite their differences, both approaches use 3D information in predictingobject new views. The former imposes stronger assumptions on the full 3D structure andshifts the paradigm to obtain the full models, while the latter captures the 3D information inlatent space to cope with (self) occlusion.

    c© 2020. The copyright of this document resides with its authors.It may be distributed unchanged freely in print or electronic forms.

    https://github.com/lhoangan/pc4novis

  • 2 LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION

    Figure 1: Overview of the proposed model for training and inference. From a single inputimage, the pixel-wise depth map is predicted. The depth map is subsequently used to com-pute a coarse novel view (forward warping), and trained by making use of backward warping(from the target view back to the source view). The model is trained end-to-end.

    The principle is that the generation of a new view of an object is composed of (1) relo-cating pixels in source images that will be visible to the corresponding positions in the targetview, (2) removing the pixels that will be occluded, and (3) adding disoccluded pixels thatare not seen in the source and will be visible in the target view [24]. With the advance of con-volution neural networks (CNNs) and generative adversarial networks (GANs), [24, 28, 32]show that (1) and (2) can be done by learning an appearance flow field that "flows" pixelsfrom a source image to the corresponding positions in the target view, and (3) can be doneby a completion network with an adversarial loss.

    In this paper, we leverage the explicit use of geometry information in synthesizing novelviews. We argue that (1) and (2) can be done in a straightforward manner by obtaining accessto the geometry of the objects. The appearance flow [24, 28, 32] which associates pixels ofthe source view to their positions in the target view, is the projection of the 3D displacementof objects’ points before and after transformation. Occluded object parts can be identifiedbased on the orientation of the object surface normals and the view directions. The argumentcan also be extended for multiple input images. In this paper, we show that the geometry ofan object provides an explicit and natural basis to the problem of novel view synthesis.

    In contrast to geometry-based methods, the proposed approach does not require 3D su-pervision. The method predicts a depth map in a self-supervised manner by formulating thedepth estimation problem in the context of novel view synthesis. The predicted depth is usedto partly construct the target views and to assist the completion network.

    The main contributions of this paper are: (1) a novel methodology for novel view syn-thesis using explicit transformations of estimated point clouds; (2) an integrated model com-bining self-supervised monocular depth estimation and novel view synthesis, which can betrained end-to-end; (3) natural extensions to multi-view inputs and full point cloud recon-struction from a single image; and (4) experimental benchmarking to validate the proposedmethod, which outperforms the current state-of-the art methods for novel view synthesis.

    2 Related Work

    2.1 Geometry-based view synthesisView synthesis via 3D models Full models (textured meshes or colored point clouds)of objects or scenes are constructed from multiple images taken from various viewpoints

  • LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION 3

    [4, 20, 27] or are given and aligned interactively by users [18, 26]. The use of 3D modelsallows for extreme pose estimation, re-texturing and flexible (re-)lighting by applying ren-dering techniques [20, 22]. However, obtaining complete 3D models of objects or scenesis a challenging task in itself. Therefore, these approaches require additional user input toidentify objects boundaries [2, 31], select and align 3D models with image views [18, 26],or use simple textured-mapped 3-planar billboard models [13]. In contrast, the proposedmethod makes use of objects partial point clouds constructed from a given source view anddoes not require a predefined (explicit) 3D model.View synthesis via depth Methods using 3D models assume a coherent structure betweenthe desired objects and the obtained 3D models [2, 18]. Synthesis using depth obtains an in-termediate representation from depth information. The intermediate representation captureshidden surfaces from one or multiple viewpoints. [37] proposes to use layered depth images,[5] creates 3D plane sweep volumes by projecting images onto target viewpoints at differentdepths, [34] uses multi-plane images at fix-distances to the camera, and [3] estimates depthprobability volumes to leverage depth uncertainty in occluded regions.

    In contrast, the proposed method estimates depth directly from monocular views topartially construct the target views. Self-supervised depth estimation using deep neuralnetworks using photometric re-projection consistency has been researched by several au-thors [7, 9, 10, 17, 33]. In this paper, we train a self-supervised depth prediction networkwith novel view synthesis in an end-to-end system.

    2.2 Image-based view synthesisRequiring explicit geometrical structures of objects or scenes as a precursor severely limitsthe applicability of a method. With the advance of neural networks (CNNs), generativeadversarial networks [11] (GANs) achieve impressive results in image generation, allowingview synthesis without explicit geometrical structures of objects or scenes.View synthesis via embedded geometry Zhou et al. [32] proposes learning a flow fieldthat maps pixels in input images to their corresponding locations in target views to capturelatent geometrical information. [23] learns a volumetric representation in a transformablebottleneck layer, which can generate corresponding views for arbitrary transformations.The former explicitly utilizes input (source) image pixels in constructing new views, ei-ther fully [32], or partly with the rest being filled by a completion network [24, 28]. Thelatter explicitly applies transformations on the volumetric representation in latent space andgenerates new views by means of pixel generation networks.

    The proposed method takes the best of both worlds. By directly using object geom-etry the source pixels are mapped to their target positions based on given transformationparameters, hence making the best use of the given information synthesizing new views.Our approach is fundamentally different from [24]: we estimate the object point cloud us-ing self-supervised depth predictions and obtain coarse target views from purely geometricaltransformations, while [24] learns mappings from input images and ground truth occludedregions to generate coarse target views using one-hot encoded transformation vectors.View synthesis directly from image Since the introduction of image-to-image transla-tion [14], there is a paradigm shift towards pure image-based approaches [29]. [36] synthe-sizes bird view images from a single frontal view image, while [25] generates cross-views ofaerial and street-view images. The networks can be trained to predict all the views in an orbitfrom a single-view object [17, 19], or generate a view in an iterative manner [6]. Additionalfeatures can be embedded such as view-independent intrinsic properties of objects [30]. In

  • 4 LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION

    this paper, we employ GANs to generate complete views, which is conditioned on the geo-metrical features and the relative poses between source and target views. Our approach canbe interpreted as a reverse and end-to-end process of [17]: we estimate objects’ arbitrarynew views via point clouds constructed from self-supervised depth maps, while [17] predictobjects’ fixed orbit views for 3D reconstruction.

    3 Method

    3.1 Point-cloud based transformationsThe core of the proposed novel view synthesis method is to use point clouds for geometri-cally aware transformations. Using the pinhole camera model and known intrinsics K, thepoint cloud can be reconstructed when the pixel-wise depth map (D) is available. The cam-era intrinsics can be obtained by camera calibration, yet for the synthetic data used in ourexperiments, K is given. A pixel on the source image plane ps = [u,v,1] (using homogeneouscoordinates), corresponds to a point Ps = [X ,Y,Z] in the source camera space:

    Ds p>s = K P>s P

    >s = K

    −1 Ds p>s (1)

    Rigid transformations can be obtained by matrix multiplications. The relative transformationto the target viewpoint from the source camera, is given by

    θs→t =[

    R t0 1

    ](2)

    where R denotes the desired rotation matrix and t the translation vector. Points in the targetcamera view are given by Pt = θs→tPs. This can also be regarded as an image-based flow fieldφ : R2→R2 parameterized by θs→t (c.f . [24, 28, 32]). The flow field φ(ps;θs→t) returns thehomogeneous coordinates of pixels in the target image for each pixel in the source image:

    φ(ps;θs→t) = K θs→t K−1 Ds p>s (3)

    By observing that φ(ps;θs→t) = Dt pt , the Cartesian pixel coordinates in the target viewcan be extracted. The advantage of the flow field interpretation is that it provides a directmapping between the image planes of the source view and the target view.

    Forward warping The flow field is used to generate the target view from the source:

    Ĩt(φ(ps;θs→t)) = Is(ps). (4)

    The resulted image is sparse due to the discrete pixel coordinates and (dis)occluded regions,see Fig. 2 (top-right). It is used as input to the image completion network (Sec. 3.2).

    Backward warping The flow field is used to generate the source view from the target:

    Ĩs(ps) = It(φ(ps;θs→t)). (5)

    The process assigns a value to every pixel (u,v) in Ĩs resulting in a dense image, as il-lustrated in Fig. 2 (bottom-right). The generated source view may contain artifacts due to(dis)occlusion in the target view. To sample φ(ps;θs→t) from It , a differentiable bi-linearsampling layer [15] is used. The generated source view is used for self-supervised monocu-lar depth prediction (Sec. 3.3).

  • LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION 5

    Figure 2: Illustration of the forward and backward warping operation of point clouds. Theforward warping is used to generate a coarse target view, while the backward warping is usedto reconstruct the source view from a target view for self-supervised depth estimation.

    3.2 Novel view synthesis

    The point-cloud-based forward warping relocates the visible pixels of the object in the sourceview to their corresponding positions in the target view. For novel view synthesis, however,two more steps are required: (1) obtaining the target coarse view by discarding occludedpixels, and (2) filling in the pixels that are not seen in the source view.

    Coarse view construction The goal is to remove the pixels which are seen in thesource view yet should not be visible in the target view, due to occlusion. To this end, pixelsthat have surface normals (after transformation) pointing away from the viewing direction areremoved, similarly to [24]. Surface normals are obtained from normalized depth gradients.

    An illustration of the coarse view construction is shown in Fig. 3 for different targetviews. The first row depicts the target views, the second row indicates the visible parts fromthe input image (third column). The third and fourth row show the coarse view with andwithout occlusion removal (or backface culling). Finally, the fifth row shows an enhancedversion of the coarse view, where the object is assumed to be left-right symmetric [24]. Theproposed method directly identifies and removes occlusion pixels from the input view usingestimated depth, which contrasts to [24], where ground truth visibility mask are required foreach target view to train a visibility prediction network.

    View completion The obtained coarse view is already in the target viewpoint, but itremains sparse. To synthesize the final dense image, an image completion network is used.

    The completion network uses the hour-glass architecture [21]. Following [24], we con-catenate the depth bottleneck features and embedded transformation to the completion net-work bottleneck. By conditioning the completion network on the input features and the de-sired transformation θs→t , the network can fix artifacts and errors due to estimated depth andcope better with extreme pose transformations, i.e. when coarse view image is near empty(e.g. columns 9-11 in Fig. 3).

    The image completion network is trained in a GAN-manner by using a generator G, adiscriminator D, an input image Is and a target image It . The combination of losses that areused is given by

    LD = (D(Is)−1)2 +D(G(Is)))2, LS-GAN Discriminator loss (6)LG = [1−SSIM(It ,G(Is))]+‖It −G(Is)‖1 , Generator loss (7)LPerc =

    ∥∥FDIt −FDG(Is)∥∥2 +∥∥FVGGIt −FVGGG(Is)∥∥2, Perceptual loss (8)

  • 6 LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION

    Figure 3: Image coarse views for different target viewpoints. The input image is depicted inthe third column (red box). From top to bottom: (1) target views, (2) source region visiblein each target viewpoint, coarse view (3) naive (4) with occlusion removal, and (5) withocclusion removal and symmetry.

    where the perceptual loss uses FD (FVGG) to denote features extracted from image It andG(Is) from the discriminator network and pre-trained VGG network respectively, c.f . [16].SSIM denotes the structural similarity index measure, see Sec. 4. The total loss is given by:

    Lc = w1LD+w2LG +w3LPerc, (9)

    where w denotes the weighting of the losses (w1 = 1,w2 = 100, and w3 = 100, c.f . [24]).

    3.3 Self-supervised Monocular Depth estimation

    The discussion so far has assumed that pixel-wise depth maps are available. In this section,the method used to estimate depth from a single RGB image is detailed. In order to makethe minimum assumption about the training data, self-supervised methods are considered,which do not require ground-truth depth [7, 9, 10, 17, 33].

    For the depth prediction an encoder-decoder network with bottleneck architecture is used,similar to [10]. The network is optimised using a set of (reconstruction) losses between thesource image Is and its synthesized version Ĩs, using the backward warping, Eq. (5), froma second (target) image It and the predicted depth map. The underlying rationale is that amore realistic depth map will have a lower reconstruction loss.

    The losses are given by:

    Lp(Is, Ĩs

    )=

    α2[1−SSIM

    (Is, Ĩs

    )]+(1−α)

    ∥∥Is− Ĩs∥∥1 , Photometric loss (10)Ls(d) = |∂xd|e−|∂xIs|+

    ∣∣∂yd∣∣e−|∂yIs|, Smoothness loss [9] (11)Ld

    (Is, Ĩs,d

    )= µLp

    (Is, Ĩs

    )+wdLs, Total loss (12)

    where α = 0.85, and d = DD is the mean-normalized inverse depth, wd = 10−3, and µ is

    an indicator function which equals 1 iff the photometric loss Lp(Is, Ĩs)< Lp(Is, It), see [10]for more details. The smoothness loss encourages nearby pixels to have similar depths,while the artifacts due to (dis)occlusion are excluded by the per-pixel minimum-projectionmechanism.

  • LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION 7

    Table 1: Quality of coarse and completed view (a) and ablation study (b). The proposedtransformation within estimated point clouds generates better coarse images over image-based predicted flow fields. Subsequently, the completed view quality is improved. Theablation study shows the best results are obtained by using LSGAN (LS), perceptual loss(PL), symmetry (Sym), bottleneck inter-connection (IC), and SSIM loss (SL).

    (a) Coarse vs Completed

    Coarse view Completed view

    L1 (↓) SSIM (↑) L1 (↓) SSIM (↑)DOAFN [24] .220 .876 .121 .910M2NV [28] .226 .879 .154 .906Ours .203 .882 .118 .924

    (b) Ablation study

    LS PL Sym IC SL L1 (↓) SSIM (↑)X X .118 .924

    X X X .101 .939X X .103 .939X X .107 .933X X X X .097 .939X X X X X .097 .942

    4 ExperimentsIn this section, the proposed method is analysed on the 3D ShapeNet benchmark includingan ablation to study the effects of the different components and a state-of-the-art comparison.

    Dataset We use the object-centered car and chair images rendered from the 3D ShapeNetmodels [1] using the same render engine1 and set up as in [23, 24, 28, 32]. Specifically, thereare 7497 car and 698 chair models with high-quality textures, split by 80%/20% for trainingand test. The images are rendered at 18 azimuth angles (in [0,340], 20◦-separation) and 3elevation angles (0◦,10◦,20◦). Input and output images are of size 256×256.

    Metrics We evaluate the generated images using the standard L1 pixel-wise error (nor-malized to the ranged [0,1], lower is better) and the structural similarity index measure(SSIM) [35] (value range of [−1,1], higher is better). L1 indicates the proximity of pixelvalues between a completed image and the target, while SSIM measures the perceived qual-ity and structural similarity between the images.

    Baseline We compare the results of our method with the following state-of-the-artmethods: AFN [32], TVSN [24], M2NV [28], and TBN [23].

    4.1 Initial experiments

    Comparison to image-based completion In this section, we compare the intermediateviews generated by the forward warping using estimated point clouds and those by image-based flow field prediction by DOAFN [24] and M2NV [28]. For this experiment, thecoarse view after occlusion removal and left-right symmetric enhancements are used. Theimage completion network is the basis variant, using DCGAN, without bottleneck inter-connections. The results are shown in Table 1(a). The transformation of estimated pointclouds provides coarse views which are closer to the target view, and these help to obtain ahigher quality of completed views.

    Ablation Study We analyze the effects of the different component of the proposed pipeline.The results are shown in Table 1(b). The use of the LSGAN loss shows a relative large im-provement over the traditional DCGAN. The drop of performance by removing symmetry

    1The specific render engine and setup is to guarantee fair comparison with reported methods as none of theauthors-provided weights perform at the similar level on images rendered with different rendering setups.

  • 8 LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION

    Methods cars chairs

    L1 (↓) SSIM (↑) L1 (↓) SSIM (↑)Fixed elevations

    AFN [32] .148 .877 .229 .871TVSN [24] .119 .913 .202 .889M2VN [28] .098 .923 .181 .895TBN [23] .091 .927 .178 .895ours .090 .945 .175 .914

    Random elevationsTBN .199 .910 .215 .902ours .122 .934 .207 .905

    Table 2: Quantitative comparison with state-of-the-art methods on novel view synthesis: ourmethod consistently performs (slightly) better than the other methods for both categorieswhere target views have the same or different elevation angles with input views.

    assumption shows the importance of prior knowledge on target objects, which is intuitive.The inter-connection from the depth network and the embedded transformation to the com-pletion network allow the model to not rely solely on intermediate views. This is importantfor overcoming errors and artifacts which occur in the coarse images (due to inevitable uncer-tainties in depth prediction) and generate in general higher quality images. The SSIM loss,first employed by [23], shows improvement in SSIM metric, which is intuitive as trainingobjectives are closer to evaluation metrics.

    4.2 Comparison to State-of-the-Art

    In this section, the proposed method is compared with state-of-the-art methods. The quan-titative results are shown in Table 2. The proposed method performs consistently performs(slightly) better on both evaluation metrics for both types of objects. Quantitative resultsare shown in Fig. 4 where challenging cases are shown in the last 2 rows. Notice the betterability in retaining objects’ textures (such as color patterns and texts on cars) of methodsthat explicitly use input pixel values in generating new views to that of TBN. The results ofcars are constantly higher than that of chairs due to the intricate structures of chairs. How-ever, by having access to object geometry, geometrical assumptions such as symmetry andocclusion can be applied directly to intermediate views (instead of having to learn from an-notated data c.f . [24]), which creates better views for near-to symmetry targets. High-qualityqualitative results and more analyses can be found in the supplementary materials.

    Table 2 also shows the evaluation when target viewpoints are from different elevation an-gles. Methods such AFN, TVSN, and M2NV encode transformation as one-hot vectors andthus, are limited to operate within a pre-defined set of transformations (18 azimuth angles,same elevation). This is not the case for our method and TBN which apply direct transfor-mation. We use the same azimuth angles as in the standard test set while randomly samplenew elevation angles for input images in (0◦,10◦,20◦). The results are shown with networkstrained with the regular fixed-elevation settings. The new transformations produces differentstatistics from what the networks have been trained, resulting in a performance drop for bothmethods. Nevertheless, the proposed method can still maintain high quality image synthesis.

  • LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION 9

    Input TVSN M2NV TBN Ours Target

    Input TVSN M2NV TBN Ours Target

    Figure 4: Qualitative comparisons of synthesized cars (left) and chairs (right), given a singleinput image (first column) and a given target view (last column). The last two rows show amore challenging examples. The proposed method captures better the geometry of the objectand the fine (texture) details. More examples are provided in the supplementary materials.

    4.3 Multi-View Synthesis and Point Cloud Reconstruction

    Multi-view inputs The proposed method can be naturally extended to use multi-viewinputs as follows: for each image depth is predicted independently and combined into asingle point cloud. The resulting coarse target image will be denser when more images areused, and is passed through the image completion network.

    In this experiment, the model trained for single-view prediction is used and evaluated us-ing multiple (1 to 8) input images. The results in Table 3 show that the quality of the coarseview increases, as expected, when more input images are used and hence the point clouds aredenser. Surprisingly, however, the image completion network only marginally improves, in-dicating that the coarse view contains enough information for the image completion networkto synthesis a high quality target image.

    No.views

    Coarse Final

    L1 (↓) SSIM (↑) L1 (↓) SSIM (↑)1 .203 .882 .090 .9452 .188 .888 .089 .9454 .152 .906 .085 .9468 .111 .907 .084 .947

    Table 3: Performance by extending single-view-trained networks for multi-view inputs.

    Point cloud reconstruction In this finalexperiment, the aim is to reconstruct a fulldense point cloud from a single image, us-ing the models trained for novel view syn-thesis. In order to do so, 360◦-views aregenerated from a single view of an object,see Fig. 5 (top). Each of these views arefed to the depth estimation network and theobtain estimated depth is used to generatea partial point cloud. These point cloudsare stitched together, using correspondingtransformations, resulting in a high quality dense point cloud, as shown in Fig. 5 (bottom).

    4.4 Results on real-world imagery

    We apply the trained car model to the car images of the real-imagery ALOI dataset [8],consisting of 100 objects, captured at 72 viewing angles. We use 4 cars for fine-tuning onlythe depth network, which requires no ground truths, while the image-completion network isleft untouched. The quantivative resuls on the remaining 3 cars are shown in Fig. 6.

  • 10 LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION

    Figure 5: Top 360◦-views generation from single input images (first column), bottom pointcloud reconstruction using estimated depths of each generated views. Depth estimationtrained on real images can perform well on synthesized ones.

    Input Target Ours Input Target Ours Input Target Ours

    Figure 6: Quantitative results on real-imagery ALOI [8] dataset. The inputs and targets areshown with provided object masks, while the synthesized images are with predicted masks.The completion network does not need to be finetuned, yet provide competent results.

    5 Conclusion

    In this paper partial point clouds are estimated from a single image, by a self-superviseddepth prediction network and used to obtain a coarse image in the target view. The finalimage is produced by an image completion network which uses the coarse image as in-put. Experimentally the proposed method outperforms any of the current SOTA methodson the ShapeNet Benchmark on novel view synthesis. Qualitative results show high qualityand dense point clouds, obtained from a single image, by synthesizing and combining 360◦

    views. Based on these results, we conclude that point clouds are a suitable, geometry awarerepresentation for true novel view synthesis.

    References[1] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang,

    Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi,and Fisher Yu. ShapeNet: An Information-Rich 3D Model Repository. Technicalreport, ArXiV:1512.03012, 2015. 7

  • LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION 11

    [2] Tao Chen, Zhe Zhu, Ariel Shamir, Shi-Min Hu, and Daniel Cohen-Or. 3-Sweep: Ex-tracting Editable Objects from a Single Photo. ACM Trans. Graph., 32(6), 2013. 1,3

    [3] Inchang Choi, Orazio Gallo, Alejandro Troccoli, Min H Kim, and Jan Kautz. ExtremeView Synthesis. In ICCV, 2019. 3

    [4] Paul E Debevec, Camillo J Taylor, and Jitendra Malik. Modeling and Rendering Ar-chitecture from Photographs: A Hybrid Geometry- and Image-Based Approach. InSIGGRAPH, 1996. 3

    [5] J Flynn, I Neulander, J Philbin, and N Snavely. Deep Stereo: Learning to Predict NewViews from the World’s Imagery. In CVPR, pages 5515–5524, 2016. 3

    [6] Ysbrand Galama and Thomas Mensink. IterGANs: Iterative GANs to learn and control3D object transformation. Computer Vision and Image Understanding, 2019. 3

    [7] Ravi Garg, B G Vijay Kumar, Gustavo Carneiro, and Ian Reid. Unsupervised CNN forsingle view depth estimation: Geometry to the rescue. In ECCV, 2016. 3, 6

    [8] Jan-Mark Geusebroek, Gertjan J. Burghouts, and Arnold W.M. Smeulders. The Ams-terdam Library of Object Images. International Journal of Computer Vision, 61, 2005.9, 10

    [9] Clément Godard, Oisin Mac Aodha, and Gabriel J Brostow. Unsupervised MonocularDepth Estimation with Left-Right Consistency. In CVPR, 2017. 3, 6

    [10] Clément Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digginginto Self-Supervised Monocular Depth Prediction. In ICCV, 2019. 3, 6

    [11] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Nets. InNeurIPS. 2014. 3

    [12] N Hirose, A Sadeghian, F Xia, R Martín-Martín, and S Savarese. VUNet: DynamicScene View Synthesis for Traversability Estimation Using an RGB Camera. IEEERobotics and Automation Letters, 4(2):2062–2069, 2019. 1

    [13] Derek Hoiem, Alexei A Efros, and Martial Hebert. Automatic Photo Pop-Up. InSIGGRAPH, 2005. 3

    [14] Phillip Isola, Jun-Yan Yan Zhu, Tinghui Zhou, Alexei A Efros, and Berkeley Ai Re-search. Image-to-Image Translation with Conditional Adversarial Networks. In CVPR,volume 2017-Janua, pages 5967–5976, 2017. ISBN 9781538604571. 3

    [15] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. SpatialTransformer Networks. In NeurIPS, 2015. 4

    [16] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time styletransfer and super-resolution. In ECCV, 2016. 6

    [17] A Johnston and G Carneiro. Single View 3D Point Cloud Reconstruction using NovelView Synthesis and Self-Supervised Depth Estimation. In 2019 Digital Image Com-puting: Techniques and Applications (DICTA), pages 1–8, 2019. 3, 4, 6

  • 12 LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION

    [18] Natasha Kholgade, Tomas Simon, Alexei Efros, and Yaser Sheikh. 3D Object Manip-ulation in a Single Photograph Using Stock 3D Models. ACM Trans. Graph., 33(4):127:1—-127:12, 2014. 1, 3

    [19] B Kicanaoglu, R Tao, and A W M Smeulders. Estimating small differences in car-posefrom orbits. In British Machine Vision Conference, 2018. 3

    [20] M Meshry, D B Goldman, S Khamis, H Hoppe, R Pandey, N Snavely, and R Martin-Brualla. Neural Rerendering in the Wild. In CVPR, pages 6871–6880, 2019. 3

    [21] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for humanpose estimation. In ECCV, 2016. 5

    [22] Thu Nguyen-Phuoc, Chuan Li, Stephen Balaban, and Yong-Liang Yang. RenderNet: Adeep convolutional network for differentiable rendering from 3D shapes. In NeurIPS,2018. 3

    [23] Kyle Olszewski, Sergey Tulyakov, Oliver Woodford, Hao Li, and Linjie Luo. Trans-formable Bottleneck Networks. ICCV, 2019. 1, 3, 7, 8

    [24] Eunbyung Park, Jimei Yang, Ersin Yumer, Duygu Ceylan, and Alexander C Berg.Transformation-grounded image generation network for novel 3D view synthesis. InCVPR, 2017. 1, 2, 3, 4, 5, 6, 7, 8

    [25] Krishna Regmi and Ali Borji. Cross-View Image Synthesis Using Conditional GANs.In CVPR, 2018. 3

    [26] K Rematas, C H Nguyen, T Ritschel, M Fritz, and T Tuytelaars. Novel Views ofObjects from a Single Image. IEEE Trans. on PAMI, 39(8):1576–1590, 2017. 1, 3

    [27] Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski.A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms. InCVPR, 2006. 3

    [28] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, Jan Kautz, and Jan Kautz Nvidia. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. In CVPR,2018. 1, 2, 3, 4, 7, 8

    [29] M Tatarchenko, A Dosovitskiy, and T Brox. Multi-view 3D Models from Single Imageswith a Convolutional Network. In ECCV, 2016. 1, 3

    [30] Xiaogang Xu, Ying-Cong Chen, and Jiaya Jia. View Independent Generative Adver-sarial Network for Novel View Synthesis. In ICCV, 2019. 3

    [31] Youyi Zheng, Xiang Chen, Ming-Ming Cheng, Kun Zhou, Shi-Min Hu, and Niloy JMitra. Interactive Images: Cuboid Proxies for Smart Image Manipulation. ACM Trans.Graph., 31(4), 2012. 1, 3

    [32] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros.View Synthesis by Appearance Flow. In ECCV, 2016. 1, 2, 3, 4, 7, 8

    [33] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe. UnsupervisedLearning of Depth and Ego-Motion from Video. In CVPR, 2017. 3, 6

  • LE ET AL.: NOVEL VIEW SYNTHESIS VIA POINT CLOUD TRANSFORMATION 13

    [34] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. StereoMagnification: Learning View Synthesis using Multiplane Images. In SIGGRAPH,2018. 3

    [35] Zhou Wang, A C Bovik, H R Sheikh, and E P Simoncelli. Image quality assessment:from error visibility to structural similarity. IEEE Trans. on Image Processing, 13(4):600–612, 2004. 7

    [36] Xinge Zhu, Zhichao Yin, Jianping Shi, Hongsheng Li, and Dahua Lin. GenerativeAdversarial Frontal View to Bird View Synthesis. In 2018 International Conference on3D Vision (3DV), pages 454–463. IEEE, 2018. 3

    [37] C Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, andRichard Szeliski. High-Quality Video View Interpolation Using a Layered Represen-tation. In SIGGRAPH, 2004. 3