Pixel2Mesh++: Multi-View 3D Mesh Generation via...

Pixel2Mesh++: Multi-View 3D Mesh Generation via Deformation

Chao Wen1∗ Yinda Zhang2∗ Zhuwen Li3∗ Yanwei Fu1†

1Fudan University 2Google LLC 3Nuro, Inc.

Abstract

We study the problem of shape generation in 3D mesh

representation from a few color images with known camera

poses. While many previous works learn to hallucinate the

shape directly from priors, we resort to further improving

the shape quality by leveraging cross-view information with

a graph convolutional network. Instead of building a di-

rect mapping function from images to 3D shape, our model

learns to predict series of deformations to improve a coarse

shape iteratively. Inspired by traditional multiple view ge-

ometry methods, our network samples nearby area around

the initial mesh’s vertex locations and reasons an optimal

deformation using perceptual feature statistics built from

multiple input images. Extensive experiments show that our

model produces accurate 3D shape that are not only vi-

sually plausible from the input perspectives, but also well

aligned to arbitrary viewpoints. With the help of physically

driven architecture, our model also exhibits generalization

capability across different semantic categories, number of

input images, and quality of mesh initialization.

1. Introduction

3D shape generation has become a popular research topic

recently. With the astonishing capability of deep learning,

lots of works have been demonstrated to successfully gen-

erate the 3D shape from merely a single color image. How-

ever, due to limited visual evidence from only one view-

point, single image based approaches usually produce rough

geometry in the occluded area and do not perform well

when generalized to test cases from domains other than

training, e.g. cross semantic categories.

Adding a few more images (e.g. 3-5) of the object is an

effective way to provide the shape generation system with

more information about the 3D shape. On one hand, multi-

view images provide more visual appearance information,

and thus grant the system with more chance to build the

connection between 3D shape and image priors. On the

other hand, it is well-known that traditional multi-view ge-

∗indicates equal contributions.†indicates corresponding author. This work is supported by the STCSM

project (19ZR1471800), and Eastern Scholar (TP2017006).

Ours

MVP2M

P2M

GT

MeshMesh Align to

Other View Images

(a) (b) (c) (d) (e)

Figure 1. Multi-View Shape Generation. From multiple input

images, we produce shapes aligning well to input (c and d) and ar-

bitrary random (e) camera viewpoint. Single view based approach,

e.g. Pixel2Mesh (P2M) [41], usually generates shape looking

good from the input viewpoint (c) but significantly worse from

others. Naive extension with multiple views (MVP2M, Sec. 4.2)

does not effectively improve the quality.

ometry methods [12] accurately infer 3D shape from corre-

spondences across views, which is analytically well defined

and less vulnerable to the generalization problem. However,

these methods typically suffer from other problems, like

large baselines and poorly textured regions. Though typi-

cal multi-view methods are likely to break down with very

limited input images (e.g. less than 5), the cross-view con-

nections might be implicitly encoded and learned by a deep

model. While well-motivated, there are very few works in

the literature exploiting in this direction, and a naive multi-

view extension of single image based model does not work

well as shown in Fig. 1.

In this work, we propose a deep learning model to gen-

erate the object shape from multiple color images. Espe-

cially, we focus on endowing the deep model with the ca-

pacity of improving shapes using cross-view information.

We resort to designing a new network architecture, named

Multi-View Deformation Network (MDN), which works in

1042

conjunction with the Graph Convolutional Network (GCN)

architecture proposed in Pixel2Mesh [41] to generate accu-

rate 3D geometry shape in the desirable mesh representa-

tion. In Pixel2Mesh, a GCN is trained to deform an initial

shape to the target using features from a single image, which

often produces plausible shapes but lack of accuracy (Fig.

1 P2M). We inherit this characteristic of “generation via

deformation” and further deform the mesh in MDN using

features carefully pooled from multiple images. Instead of

learning to hallucinate via shape priors like in Pixel2Mesh,

MDN reasons shapes according to correlations across dif-

ferent views through a physically driven architecture in-

spired by classic multi-view geometry methods. In partic-

ular, MDN proposes hypothesis deformations for each ver-

tex and move it to the optimal location that best explains

features pooled from multiple views. By imitating corre-

spondences search rather than learning priors, MDN gener-

alizes well in various aspects, such as cross semantic cate-

gory, number of input views, and the mesh initialization.

Besides the above-mentioned advantages, MDN is in ad-

dition featured with several good properties. First, it can be

trained end-to-end. Note that it is non-trivial since MDN

searches deformation from hypotheses, which requires a

non-differentiable argmax/min. Inspired by [20], we ap-

ply a differentiable 3D soft argmax, which takes a weighted

sum of the sampled hypotheses as the vertex deformation.

Second, it works with varying number of input views in a

single forward pass. This requires the feature dimension

to be invariant with the number of inputs, which is typi-

cally broken when aggregating features from multiple im-

ages (e.g. when using concatenation). We achieve the in-

put number invariance by concatenating the statistics (e.g.

mean, max, and standard deviation) of the pooled feature,

which further maintains input order invariance. We find

this statistics feature encoding explicitly provides the net-

work cross-view information, and encourages it to automat-

ically utilize image evidence when more are available. Last

but not least, the nature of “generation via deformation” al-

lows an iterative refinement. In particular, the model output

can be taken as the input, and quality of the 3D shape is

gradually improved throughout iterations. With these de-

siring features, our model achieves the state-of-the-art per-

formance on ShapeNet for shape generation from multiple

images under standard evaluation metrics.

To summarize, we propose a GCN framework that pro-

duces 3D shape in mesh representation from a few observa-

tions of the object in different viewpoints. The core compo-

nent is a physically driven architecture that searches optimal

deformation to improve a coarse mesh using perceptual fea-

ture statistics built from multiple images, which produces

accurate 3D shape and generalizes well across different se-

mantic categories, numbers of input images, and the quality

of coarse meshes.

2. Related Work

3D Shape Representations Since 3D CNN is readily ap-

plicable to 3D volumes, the volume representation has been

well-exploited for 3D shape analysis and generation [4, 42].

With the debut of PointNet [30], the point cloud representa-

tion has been adopted in many works [7, 29]. Most recently,

the mesh representation [19, 41] has become competitive

due to its compactness and nice surface properties. Some

other representations have been proposed, such as geome-

try images [33], depth images [36, 31], classification bound-

aries [26, 3], signed distance function [28], etc., and most

of them require post-processing to get the final 3D shape.

Consequently, the shape accuracy may vary and the infer-

ence take extra time.

Single view shape generation Classic single view shape

reasoning can be traced back to shape from shading [6, 45],

texture [25], and de-focus [8], which only reason the visible

parts of objects. With deep learning, many works leverage

the data prior to hallucinate the invisible parts, and directly

produce shape in 3D volume [4, 9, 43, 11, 32, 37, 16], point

cloud [7], mesh models [19], or as an assembling of shape

primitive [40, 27]. Alternatively, 3D shape can be also gen-

erated by deforming an initialization, which is more related

to our work. Tulsiani et al. [39] and Kanazawa et al. [17]

learn a category-specific 3D deformable model and reasons

the shape deformations in different images. Wang et al. [41]

learn to deform an initial ellipsoid to the desired shape in

a coarse to fine fashion. Combining deformation and as-

sembly, Huang et al. [14] and Su et al. [34] retrieve shape

components from a large dataset and deform the assembled

shape to fit the observed image. Kuryenkov et al. [22] learns

free-form deformations to refine shape. Even though with

impressive success, most deep models adopt an encoder-

decoder framework, and it is arguable if they perform shape

generation or shape retrieval [38].

Multi-view shape generation Recovering 3D geometry

from multiple views has been well studied. Traditional

multi-view stereo (MVS) [12] relies on correspondences

built via photo-consistency and thus it is vulnerable to large

baselines, occlusions, and texture-less regions. Most re-

cently, deep learning based MVS models have drawn atten-

tion, and most of these approaches [44, 13, 15, 46] rely on

a cost volume built from depth hypotheses or plane sweeps.

However, these approaches usually generate depth maps,

and it is non-trivial to fuse a full 3D shape from them.

On the other hand, direct multi-view shape generation uses

fewer input views with large baselines, which is more chal-

lenging and has been less addressed. Choy et al. [4] propose

a unified framework for single and multi-view object gen-

eration reading images sequentially. Kar et al. [18] learn

a multi-view stereo machine via recurrent feature fusion.

Gwak et al. [10] learns shapes from multi-view silhouettes

1043

Figure 2. System Pipeline. Our whole system consists of a 2D CNN extracting image features and a GCN deforming an ellipsoid to target

shape. A coarse shape is generated from Pixel2Mesh and refined iteratively in Multi-View Deformation Network. To leverage cross-view

information, our network pools perceptual features from multiple input images for hypothesis locations in the area around each vertex and

predicts the optimal deformation.

by ray-tracing pooling and further constrains the ill-posed

problem using GAN. Our approach belongs to this cate-

gory but is fundamentally different from the existing meth-

ods. Rather than sequentially feeding in images, our method

learns a GCN to deform the mesh using features pooled

from all input images at once.

3. Method

Our model receives multiple color images of an object

captured from different viewpoints (with known camera

poses) and produces a 3D mesh model in the world coordi-

nate. The whole framework adopts the strategy of coarse-to-

fine (Fig. 2), in which a plausible but rough shape is gener-

ated first, and details are added later. Realizing that existing

3D shape generators usually produce reasonable shape even

from a single image, we simply use Pixel2Mesh [41] trained

either from single or multiple views to produce the coarse

shape, which is taken as input to our Multi-View Deforma-

tion Network (MDN) for further improvement. In MDN,

each vertex first samples a set of deformation hypotheses

from its surrounding area (Fig. 3 (a)). Each hypothesis then

pools cross-view perceptual feature from early layers of a

perceptual network, where the feature resolution is high and

contains more low-level geometry information (Fig. 3 (b)).

These features are further leveraged by the network to rea-

son the best deformation to move the vertex. It is worth

noting that our MDN can be applied iteratively for multiple

times to gradually improve shapes.

3.1. MultiView Deformation Network

In this section, we introduce Multi-View Deformation

Network, which is the core of our system to enable the

network exploiting cross-view information for shape gen-

eration. It first generates deformation hypotheses for each

vertex and learns to reason an optimum using feature

pooled from inputs. Our model is essentially a GCN, and

can be jointly trained with other GCN based models like

Pixel2Mesh. We refer reader to [1, 21] for details about

GCN, and Pixel2Mesh [41] for graph residual block which

will be used in our model.

3.1.1 Deformation Hypothesis Sampling

The first step is to propose deformation hypotheses for each

vertex. This is equivalent as sampling a set of target loca-

tions in 3D space where the vertex can be possibly moved

to. To uniformly explore the nearby area, we sample from

a level-1 icosahedron centered on the vertex with a scale of

0.02, which results in 42 hypothesis positions (Fig. 3 (a),

left). We then build a local graph with edges on the icosa-

hedron surface and additional edges between the hypothe-

ses to the vertex in the center, which forms a graph with

43 nodes and 120 + 42 = 162 edges. Such local graph is

built for all the vertices, and then fed into a GCN to predict

vertex movements (Fig. 3 (a), right).

3.1.2 Cross-View Perceptual Feature Pooling

The second step is to assign each node (in the local GCN)

features from the multiple input color images. Inspired by

1044

(b) Cross-View Perceptual Feature Pooling

(a) Deformation Hypothesis Sampling

Deformation Hypothesis

Sampling

Current Mesh verticesHypothesis points for each vertex

Hypothesis score0 1

Level-1 icosahedron

conv1conv2

conv3conv4

conv5

conv1conv2

conv3conv4

conv5

Figure 3. Deformation Hypothesis and Perceptual Feature

Pooling. (a) Deformation Hypothesis Sampling. We sample 42

deformation hypotheses from a level-1 icosahedron and build a

GCN among hypotheses and the vertex. (b) Cross-View Percep-

tual Feature Pooling. The 3D vertex coordinates are projected

to multiple 2D image planes using camera intrinsics and extrin-

sics. Perceptual features are pooled using bilinear interpolation,

and feature statistics are kept on each hypothesis.

Pixel2Mesh, we use the prevalent VGG-16 architecture to

extract perceptual features. Since we assume known cam-

era poses, each vertex and hypothesis can find their projec-

tions in all input color image planes using known camera

intrinsics and extrinsics and pool features from four neigh-

boring feature blocks using bilinear interpolation (Fig. 3

(b)). Different from Pixel2Mesh where high level features

from later layers of the VGG (i.e. ‘conv3 3’, ‘conv4 3’,

and ‘conv5 3’) are pooled to better learn shape priors, MDN

pools features from early layers (i.e. ‘conv1 2’, ‘conv2 2’,

and ‘conv3 3’), which are in high spatial resolution and

considered maintaining more detailed information.

To combine multiple features, concatenation has been

widely used as a loss-less way, however ends up with total

dimension changing with respect to (w.r.t.) the number of

input images. Statistics feature has been proposed for multi-

view shape recognition [35] to handle this problem. In-

spired by this, we concatenate some statistics (mean, max,

and std) of the features pooled from all views for each ver-

tex, which makes our network naturally adaptive to variable

input views and behave invariant to different input orders.

This also encourages the network to learn from cross-view

feature correlations rather than each individual feature vec-

tor. In addition to image features, we also concatenate the

3-dimensional vertex coordinate into the feature vector. In

total, we compute for each vertex and hypothesis a 1347

dimension feature vector.

3.1.3 Deformation Reasoning

The next step is to reason an optimal deformation for each

vertex from the hypotheses using pooled cross-view percep-

tual features. Note that picking the best hypothesis of all

needs an argmax operation, which requires stochastic opti-

mization and usually is not optimal. Instead, we design a

differentiable network component to produce desirable de-

formation through soft-argmax of the 3D deformation hy-

potheses, which is illustrated in Fig. 4. Specifically, we

first feed the cross-view perceptual feature P into a scoring

network, consisting of 6 graph residual convolution layers

[41] plus ReLU, to predict a scalar weight ci for each hy-

pothesis. All the weights are then fed into a softmax layer

and normalized to scores si, with∑

43

i=1si = 1. The ver-

tex location is then updated as the weighted sum of all the

hypotheses, i.e. v =∑

43

i=1si ∗ hi, where hi is the location

of each deformation hypothesis including the vertex itself.

This deformation reasoning unit runs on all local GCN built

upon every vertex with shared weights, as we expect all the

vertices leveraging multi-view feature in a similar fashion.

3.2. Loss

We train our model fully supervised using ground truth

3D CAD models. Our loss function includes all terms from

Pixel2Mesh, but extends the Chamfer distance loss to a re-

sampled version. Chamfer distance measures “distance” be-

tween two point clouds, which can be problematic when

points are not uniformly distributed on the surface. We pro-

pose to randomly re-sample the predicted mesh when calcu-

lating Chamfer loss using the re-parameterization trick pro-

posed in Ladicky et al. [23]. Specifically, given a triangle

defined by 3 vertices {v1, v2, v3} ∈ R3, a uniform sampling

can be achieved by:

s = (1−√r1) v1 + (1− r2)

√r1v2 +

√r1r2v1,

where s is a point inside the triangle, and r1, r2 ∼ U [0, 1].Knowing this, when calculating the loss, we uniformly sam-

ple our generated mesh for 4000 points, with the number of

points per triangle proportional to its area. We find this is

empirically sufficient to produce a uniform sampling on our

output mesh with 2466 vertices, and calculating Chamfer

loss on the re-sampled point cloud, containing 6466 in to-

tal, helps to remove artifacts in the results.

3.3. Implementation Details

For initialization, we use Pixel2Mesh to generate a

coarse shape with 2466 vertices. To improve the quality of

1045

Cross-View

Perceptual

Feature

Pooling

Vt-1 Ht-1

P

Deformation

Hypothesis

Sampling

Softmax

G-ResNet

2466×3 2466×43×32466×43×1

+

Weighted Sum

Vt

2466×3

Current

Vertex

Coordinate

vi

New Vertex Coordinate

vi

Softmax

Weighted Sum

’

P: Perceptual feature

H: Hypothesis coordinate

V: Vertices coordinate

Detail for

each vertex

Deformation Reasoning

3D Soft Argmax

Hypothesis Coordinate

h0 h1 ...h2 h3 hk

c0 c1 ...c2 c3 ck

s0 s1 ...s2 s3 sk

Figure 4. Deformation Reasoning. The goal is to reason a good deformation from the hypotheses and pooled features. We first estimate

a weight (green circle) for each hypothesis using a GCN. The weights are normalized by a softmax layer (yellow circle), and the output

deformation is the weighted sum of all the deformation hypotheses.

initial mesh, we equip the Pixel2Mesh with our cross-view

perceptual feature pooling layer, which allows it to extract

features from multiple views.

The network is implemented in Tensorflow and opti-

mized using Adam with weight decay as 1e-5 and mini-

batch size as 1. The model is trained for 50 epochs in to-

tal. For the first 30 epochs, we only train the multi-view

Pixel2Mesh for initialization with learning rate 1e-5. Then,

we make the whole model trainable, including the VGG

for perceptual feature extraction, for another 20 epoch with

the learning rate as 1e-6. The whole model is trained on

NVIDIA Titan Xp for 96 hours. During training, we ran-

domly pick three images for a mesh as input. During test-

ing, it takes 0.32s to generate a mesh.

4. Experiments

In this section, we perform extensive evaluation of our

model for multi-view shape generation. We compare to

state-of-the-art methods, as well as conduct controlled ex-

periments w.r.t. various aspects, e.g. cross category general-

ization, quantity of inputs, etc.

4.1. Experimental setup

Dataset We adopt the dataset provided by Choy et al. [4]

as it is widely used by many existing 3D shape generation

works. The dataset is created using a subset of ShapeNet[2]

containing 50k 3D CAD models from 13 categories. Each

model is rendered from 24 randomly chosen camera view-

points, and the camera intrinsic and extrinsic parameters are

given. For fair comparison, we use the same training/testing

split as in Choy et al. [4] for all our experiments.

Evaluation Metric We use standard evaluation metrics

for 3D shape generation. Following Fan et al. [7], we cal-

culate Chamfer Distance(CD) between points clouds uni-

formly sampled from the ground truth and our prediction

to measure the surface accuracy. We also use F-score fol-

lowing Wang et al. [41] to measure the completeness and

precision of generated shapes. For CD, the smaller is better.

For F-score, the larger is better.

4.2. Comparison to Multiview Shape Generation

We compare to previous works for multi-view shape gen-

eration and show effectiveness of MDN in improving shape

quality. While most shape generation methods take only a

single image, we find Choy et al. [4] and Kar et al. [18]

work in the same setting with us. We also build two com-

petitive baselines using Pixel2Mesh. In the first baseline

(Tab.1, P2M-M), we directly run single-view Pixel2Mesh

on each of the input image and fuse multiple results [5, 24].

In the second baseline (Tab.1, MVP2M), we replace the

perceptual feature pooling to our cross-view version to en-

able Pixel2Mesh for the multi-view scenario (more details

in supplementary materials).

Tab. 1 shows quantitative comparison in F-score. As

can be seen, our baselines already outperform other meth-

ods, which shows the advantage of mesh representation

in finding surface details. Moreover, directly equipping

Pixel2Mesh with multi-view features does not improve too

much (even slightly worse than the average of multiple runs

of single-view Pixel2Mesh), which shows dedicate archi-

tecture is required to efficiently learn from multi-view fea-

tures. In contrast, our Multi-View Deformation Network

significantly further improves the results from the MVP2M

baseline (i.e. our coarse shape initialization).

More qualitative results are shown in Fig. 8. We show

results from different methods aligned with one input view

(left) and a random view (right). As can be seen, Choy

et al. [4] (3D-R2N2) and Kar et al. [18] (LSM) produce

3D volume, which lose thin structures and surface details.

Pixel2Mesh (P2M) produces mesh models but shows obvi-

1046

Category

F-score(τ ) ↑ F-score(2τ ) ↑3DR2N2† LSM MVP2M P2M-M Ours 3DR2N2† LSM MVP2M P2M-M Ours

Couch 45.47 43.02 53.17 53.70 57.56 59.97 55.49 73.24 72.04 75.33

Cabinet 54.08 50.80 56.85 63.55 65.72 64.42 60.72 76.58 79.93 81.57

Bench 44.56 49.33 60.37 61.14 66.24 62.47 65.92 75.69 75.66 79.67

Chair 37.62 48.55 54.19 55.89 62.05 54.26 64.95 72.36 72.36 77.68

Monitor 36.33 43.65 53.41 54.50 60.00 48.65 56.33 70.63 70.51 75.42

Firearm 55.72 56.14 79.67 74.85 80.74 76.79 73.89 89.08 84.82 89.29

Speaker 41.48 45.21 48.90 51.61 54.88 52.29 56.65 68.29 68.53 71.46

Lamp 32.25 45.58 50.82 51.00 62.56 49.38 64.76 65.72 64.72 74.00

Cellphone 58.09 60.11 66.07 70.88 74.36 69.66 71.39 82.31 84.09 86.16

Plane 47.81 55.60 75.16 72.36 76.79 70.49 76.39 86.38 82.74 86.62

Table 48.78 48.61 65.95 67.89 71.89 62.67 62.22 79.96 81.04 84.19

Car 59.86 51.91 67.27 67.29 68.45 78.31 68.20 84.64 84.39 85.19

Watercraft 40.72 47.96 61.85 57.72 62.99 63.59 66.95 77.49 72.96 77.32

Mean 46.37 49.73 61.05 61.72 66.48 62.53 64.91 77.10 76.45 80.30

Table 1. Comparison to Multi-view Shape Generation Methods. We show F-score on each semantic category. Our model significantly

outperforms previous methods, i.e. 3DR2N2 [4] and LSM [18], and competitive baselines derived from Pixel2Mesh [41]. Please see

supplementary materials for Chamfer Distance. The notation † indicates the methods which does not require camera extrinsics.

ous artifacts when visualized in viewpoint other than the in-

put. In comparison, our results contain better surface details

and more accurate geometry learned from multiple views.

4.3. Generalization Capability

Our MDN is inspired by multi-view geometry methods,

where 3D location is reasoned via cross-view information.

In this section, we investigate the generalization capability

of MDN in many aspects to improve the initialization mesh.

For all the experiments in this section, we fix the coarse

stage and train/test MDN under different settings.

4.3.1 Semantic Category

We first verify how our network generalizes across semantic

categories. We fix the initial MVP2M and train MDN with

12 out 13 categories and test on the one left out, and the im-

provements upon initialization are shown in Fig. 5 (a). As

can be seen, the performance is only slightly lower when

the testing category is removed from the training set com-

pared to the model trained using all categories. To make it

more challenging, we also train MDN on only one category

and test on all the others. Surprisingly, MDN still general-

izes well between most of the categories as shown in Fig. 5

(b). Strong generalizing categories (e.g. chair, table, lamp)

tend to have relatively complex geometry, thus the model

has better chance to learn from cross-view information. On

the other hand, categories with super simple geometry (e.g.

speaker, cellphone) do not help to improve other categories,

even not for themselves. On the whole, MDN shows good

generalization capability across semantic categories.

Category Except All

lamp 10.96 11.73

cabinet 8.88 8.99

cellphone 7.10 8.29

chair 6.49 7.86

monitor 6.06 6.60

speaker 5.75 5.98

table 5.44 5.94

bench 5.30 5.87

couch 3.76 4.39

plane 1.12 1.63

firarm 0.67 1.07

watercraft 0.21 1.14

car 0.14 1.18

(a) Train except one category (b)Train on one category

Figure 5. Cross-Category Generalization. (a) MDN trained on

12 out of 13 categories and tested on the one left. (b) MDN trained

on 1 category and tested on the other. Each block represents the

experiment with MDN trained on horizontal category and tested

on vertical category. Both (a) and (b) show improvements of F-

score(τ ) upon MVP2M through MDN.

4.3.2 Number of Views

We then test how MDN performs w.r.t. the number of input

views. In Tab. 2, we see that MDN consistently performs

better when more input views are available, even though

the number of view is fixed as 3 for efficiency during the

training. This indicates that features from multiple views

are well encoded in the statistics, and MDN is able to ex-

ploit additional information when seeing more images. For

reference, we train five MDNs with the input view number

1047

Image Inputmesh Our result

TranslationNoise

Image Inputmesh Our result

Voxel

Image Our resultInputmesh

Figure 6. Robustness to Initialization. Our model is robust to added noise, shift, and input mesh from other sources.

fixed at 2 to 5 respectively. As shown in Tab. 2 “Resp.”, the

3-view MDN performs very close to models trained with

more views (e.g. 4 and 5), which shows the model learns

efficiently from fewer number of views during the training.

The 3-view MDN also outperform models trained with less

views (e.g. 2), which indicates additional information pro-

vided during the training can be effectively activated during

the test even when observation is limited. Overall, MDN

generalizes well to different number of inputs.

#train #test 2 3 4 5

3

F-score(τ ) ↑ 64.48 66.44 67.66 68.29

F-score(2τ ) ↑ 78.74 80.33 81.36 81.97

CD ↓ 0.515 0.484 0.468 0.459

Resp.

F-score(τ ) ↑ 64.11 66.44 68.54 68.82

F-score(2τ ) ↑ 78.34 80.33 81.56 81.99

CD ↓ 0.527 0.484 0.467 0.452

Table 2. Performance w.r.t. Input View Numbers. Our MDN

performs consistently better when more view is given, even trained

using only 3 views.

4.3.3 Initialization

Lastly, we test if the model overfits to the input initial-

ization, i.e. the MVP2M. To this end, we add translation

and random noise to the rough shape from MVP2M. We

also take as input the mesh converted from 3DR2N2 using

marching cube [24]. As shown in Fig. 6, MDN successfully

removes the noise, aligns the input with ground truth, and

adds significant geometry details. This shows that MDN is

tolerant to input variance.

4.4. Ablation Study

In this section, we verify the qualitative and quantitative

improvements from statistic feature pooling, re-sampled

Chamfer distance, and iterative refinement.

4.4.1 Statistical Feature

We first check the importance of using feature statistics. We

train MDN with the ordinary concatenation. This maintains

Input image -Stat Feat Pooling -Re-sample Loss Full model

Figure 7. Qualitative Ablation Study. We show meshes from the

MDN with statistics feature or re-sampling loss disabled.

Metrics F-score(τ ) ↑ F-score(2τ ) ↑ CD ↓-Feat Stat 65.26 79.13 0.511

-Re-sample Loss 66.26 80.04 0.496

Full Model 66.48 80.30 0.486

Table 3. Quantitative Ablation Study. We show the metrics of

the MDN with statistics feature or re-sampling loss disabled.

all the features loss-less to potentially produce better ge-

ometry, but does not support variable number of inputs any

more. Surprisingly, our model with feature statistics (Tab.

3, “Full Model”) still outperforms the one with concatena-

tion (Tab. 3, “-Feat Stat”). This is probably because that

our feature statistics is invariant to the input order, such

that the network learns more efficiently during the train-

ing. It also explicitly encodes cross-view feature correla-

tions, which can be directly leveraged by the network.

4.4.2 Re-sampled Chamfer Distance

We then investigate the impact of the re-sampled Cham-

fer loss. We train our model using the traditional Chamfer

loss only on mesh vertices as defined in Pixel2Mesh, and

all metrics drop consistently (Tab. 3, “-Re-sample Loss”).

Intuitively, our re-sampling loss is especially helpful for

places with sparse vertices and irregular faces, such as the

elongated lamp neck as shown in Fig. 7, 3rd column. It also

1048

Figure 8. Qualitative Evaluation. From top to bottom, we show in each row: two camera views, results of 3DR2N2, LSM, multi-view

Pixel2Mesh, ours, and the ground truth. Our predicts maintain good details and align well with different camera views. Please see

supplementary materials for more results.

79.20

79.60

80.00

80.40

80.80

1 2 3 4

F-score

(2τ)

Number of iterations

0.470

0.480

0.490

0.500

0.510

1 2 3 4

ChamferDistance

Number of iterations

Figure 9. Performance with Different Iterations. The perfor-

mance keeps improving with more iterations and saturate at three.

prevents big mistakes from happening on a single vertex,

e.g. the spike on bench, where our loss penalizes a lot of

sampled points on wrong faces caused by the vertex but the

standard Chamfer loss only penalizes one point.

4.4.3 Number of Iteration

Figure 9 shows that the performance of our model keeps

improving with more iterations, and is roughly saturated at

three. Therefore we choose to run three iterations during

the inference even though marginal improvements can be

further obtained from more iterations.

5. Conclusion

We propose a graph convolutional framework to produce

3D mesh model from multiple images. Our model learns to

exploit cross-view information and generates vertex defor-

mation iteratively to improve the mesh produced from the

direct prediction methods, e.g. Pixel2Mesh and its multi-

view extension. Inspired by multi-view geometry methods,

our model searches in the nearby area around each vertex

for an optimal place to relocate it. Compared to previous

works, our model achieves the state-of-the-art performance,

produces shapes containing accurate surface details rather

than merely visually plausible from input views, and shows

good generalization capability in many aspects. For future

work, combining with efficient shape retrieval for initializa-

tion, integrating with multi-view stereo models for explicit

photometric consistency, and extending to scene scales are

some of the practical directions to explore. On a high level,

how to integrating the similar idea in emerging new repre-

sentations, such as part based model with shape basis and

learned function [28] are interesting for further study.

1049

References

[1] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur

Szlam, and Pierre Vandergheynst. Geometric deep learning:

Going beyond euclidean data. IEEE Signal Process. Mag.,

34(4):18–42, 2017. 3

[2] Angel X Chang, Thomas Funkhouser, Leonidas Guibas,

Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,

Manolis Savva, Shuran Song, Hao Su, et al. Shapenet:

An information-rich 3d model repository. arXiv preprint

arXiv:1512.03012, 2015. 5

[3] Zhiqin Chen and Hao Zhang. Learning implicit fields for

generative shape modeling. In CVPR, pages 5939–5948,

2019. 2

[4] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin

Chen, and Silvio Savarese. 3d-r2n2: A unified approach for

single and multi-view 3d object reconstruction. In ECCV,

2016. 2, 5, 6

[5] Brian Curless and Marc Levoy. A volumetric method for

building complex models from range images. 1996. 5

[6] Jean-Denis Durou, Maurizio Falcone, and Manuela Sagona.

Numerical methods for shape-from-shading: A new survey

with benchmarks. Computer Vision and Image Understand-

ing, 109(1):22–43, 2008. 2

[7] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set

generation network for 3d object reconstruction from a single

image. In CVPR, 2017. 2, 5

[8] Paolo Favaro and Stefano Soatto. A geometric approach to

shape from defocus. IEEE Trans. Pattern Anal. Mach. Intell.,

27(3):406–417, 2005. 2

[9] Rohit Girdhar, David F. Fouhey, Mikel Rodriguez, and Ab-

hinav Gupta. Learning a predictable and generative vector

representation for objects. In ECCV, 2016. 2

[10] JunYoung Gwak, Christopher B. Choy, Manmohan Chan-

draker, Animesh Garg, and Silvio Savarese. Weakly super-

vised 3d reconstruction with adversarial constraint. In 3DV,

2017. 2

[11] Christian Hane, Shubham Tulsiani, and Jitendra Malik. Hi-

erarchical surface prediction for 3d object reconstruction. In

3DV, 2017. 2

[12] Andrew Harltey and Andrew Zisserman. Multiple view ge-

ometry in computer vision (2. ed.). Cambridge University

Press, 2006. 1, 2

[13] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra

Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view

stereopsis. In CVPR, 2018. 2

[14] Qixing Huang, Hai Wang, and Vladlen Koltun. Single-view

reconstruction via joint analysis of image and shape collec-

tions. ACM Trans. Graph., 34(4):87:1–87:10, 2015. 2

[15] Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So

Kweon. Dpsnet: End-to-end deep plane sweep stereo. In

ICLR, 2018. 2

[16] Adrian Johnston, Ravi Garg, Gustavo Carneiro, and Ian D.

Reid. Scaling cnns for high resolution volumetric reconstruc-

tion from a single image. In ICCV, 2017. 2

[17] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and

Jitendra Malik. Learning category-specific mesh reconstruc-

tion from image collections. In ECCV, 2018. 2

[18] Abhishek Kar, Christian Hane, and Jitendra Malik. Learning

a multi-view stereo machine. In NeurIPS, pages 365–376,

2017. 2, 5, 6

[19] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neu-

ral 3d mesh renderer. In CVPR, 2018. 2

[20] Alex Kendall, Hayk Martirosyan, Saumitro Dasgupta, and

Peter Henry. End-to-end learning of geometry and context

for deep stereo regression. In ICCV, 2017. 2

[21] Thomas N. Kipf and Max Welling. Semi-supervised classi-

fication with graph convolutional networks. In ICLR, 2016.

3

[22] Andrey Kurenkov, Jingwei Ji, Animesh Garg, Viraj Mehta,

JunYoung Gwak, Christopher Choy, and Silvio Savarese.

Deformnet: Free-form deformation network for 3d shape re-

construction from a single image. In WACV, pages 858–866,

2018. 2

[23] Lubor Ladicky, Olivier Saurer, SoHyeon Jeong, Fabio Man-

inchedda, and Marc Pollefeys. From point clouds to mesh

using regression. In ICCV, pages 3893–3902, 2017. 4

[24] William E. Lorensen and Harvey E. Cline. Marching cubes:

A high resolution 3d surface construction algorithm. In SIG-

GRAPH, 1987. 5, 7

[25] Constantinos Marinos and Andrew Blake. Shape from tex-

ture: the homogeneity hypothesis. In ICCV, pages 350–353,

1990. 2

[26] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-

bastian Nowozin, and Andreas Geiger. Occupancy networks:

Learning 3d reconstruction in function space. In CVPR,

pages 4460–4470, 2019. 2

[27] Chengjie Niu, Jun Li, and Kai Xu. Im2struct: Recovering 3d

shape structure from a single RGB image. In CVPR, 2018. 2

[28] Jeong Joon Park, Peter Florence, Julian Straub, Richard

Newcombe, and Steven Lovegrove. Deepsdf: Learning con-

tinuous signed distance functions for shape representation.

In CVPR, June 2019. 2, 8

[29] Charles R. Qi, Wei Liu, Chenxia Wu, Hao Su, and

Leonidas J. Guibas. Frustum pointnets for 3d object detec-

tion from RGB-D data. In CVPR, 2018. 2

[30] Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and

Leonidas J. Guibas. Pointnet: Deep learning on point sets

for 3d classification and segmentation. In CVPR, 2017. 2

[31] Stephan R. Richter and Stefan Roth. Matryoshka networks:

Predicting 3d geometry via nested shape layers. In CVPR,

2018. 2

[32] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger.

Octnet: Learning deep 3d representations at high resolutions.

In CVPR, 2017. 2

[33] Ayan Sinha, Asim Unmesh, Qixing Huang, and Karthik Ra-

mani. Surfnet: Generating 3d shape surfaces using deep

residual networks. In CVPR, 2017. 2

[34] Hao Su, Qixing Huang, Niloy J. Mitra, Yangyan Li, and

Leonidas J. Guibas. Estimating image depth using shape col-

lections. ACM Trans. Graph., 33(4):37:1–37:11, 2014. 2

[35] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and

Erik G. Learned-Miller. Multi-view convolutional neural

networks for 3d shape recognition. In ICCV, 2015. 4

1050

[36] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.

Multi-view 3d models from single images with a convolu-

tional network. In ECCV, 2016. 2

[37] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.

Octree generating networks: Efficient convolutional archi-

tectures for high-resolution 3d outputs. In ICCV, 2017. 2

[38] Maxim Tatarchenko, Stephan Richter, Rene Ranftl, Zhuwen

Li, Vladlen Koltun, and Thomas Brox. What do single-view

3d reconstruction networks learn? In CVPR, 2019. 2

[39] Shubham Tulsiani, Abhishek Kar, Joao Carreira, and Jiten-

dra Malik. Learning category-specific deformable 3d models

for object reconstruction. IEEE Trans. Pattern Anal. Mach.

Intell., 39(4):719–731, 2017. 2

[40] Shubham Tulsiani, Hao Su, Leonidas J. Guibas, Alexei A.

Efros, and Jitendra Malik. Learning shape abstractions by

assembling volumetric primitives. In CVPR, 2017. 2

[41] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei

Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh

models from single rgb images. In ECCV, 2018. 1, 2, 3, 4,

5, 6

[42] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun,

and Xin Tong. O-cnn: Octree-based convolutional neu-

ral networks for 3d shape analysis. ACM Transactions on

Graphics (TOG), 36(4):72, 2017. 2

[43] Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and

Josh Tenenbaum. Learning a probabilistic latent space of

object shapes via 3d generative-adversarial modeling. In

Neurips, 2016. 2

[44] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan.

Mvsnet: Depth inference for unstructured multi-view stereo.

In ECCV, 2018. 2

[45] Ruo Zhang, Ping-Sing Tsai, James Edwin Cryer, and

Mubarak Shah. Shape from shading: A survey. IEEE Trans.

Pattern Anal. Mach. Intell., 21(8):690–706, 1999. 2

[46] Yinda Zhang, Sameh Khamis, Christoph Rhemann, Julien

Valentin, Adarsh Kowdle, Vladimir Tankovich, Michael

Schoenberg, Shahram Izadi, Thomas Funkhouser, and Sean

Fanello. Activestereonet: End-to-end self-supervised learn-

ing for active stereo systems. In ECCV, pages 784–801,

2018. 2

1051

Pixel2Mesh++: Multi-View 3D Mesh Generation via...

Documents

Transcript of Pixel2Mesh++: Multi-View 3D Mesh Generation via...