Neural 3D Morphable Models: Spiral Convolutional Networks ... · 3D Morphable Model [5] and the...

Neural 3D Morphable Models: Spiral Convolutional Networks for 3D ShapeRepresentation Learning and Generation

Giorgos Bouritsas ∗1 Sergiy Bokhnyak * 2 Stylianos Ploumpis1,3

Michael Bronstein1,2,4 Stefanos Zafeiriou1,3

Imperial College London, UK1 Universita Svizzera Italiana, Switzerland2 FaceSoft.io3 Twitter4

1{g.bouritsas18, s.ploumpis, m.bronstein, s.zafeiriou}@imperial.ac.uk [email protected]

Abstract

Generative models for 3D geometric data arise in manyimportant applications in 3D computer vision and graph-ics. In this paper, we focus on 3D deformable shapes thatshare a common topological structure, such as human facesand bodies. Morphable Models and their variants, despitetheir linear formulation, have been widely used for shaperepresentation, while most of the recently proposed non-linear approaches resort to intermediate representations,such as 3D voxel grids or 2D views. In this work, we intro-duce a novel graph convolutional operator, acting directlyon the 3D mesh, that explicitly models the inductive biasof the fixed underlying graph. This is achieved by enforc-ing consistent local orderings of the vertices of the graph,through the spiral operator, thus breaking the permutationinvariance property that is adopted by all the prior workon Graph Neural Networks. Our operator comes by con-struction with desirable properties (anisotropic, topology-aware, lightweight, easy-to-optimise), and by using it as abuilding block for traditional deep generative architectures,we demonstrate state-of-the-art results on a variety of 3Dshape datasets compared to the linear Morphable Modeland other graph convolutional operators.

1. IntroductionThe success of deep learning in computer vision and im-

age analysis, speech recognition, and natural language pro-cessing, has driven the recent interest in developing simi-lar models for 3D geometric data. Generalisations of suc-cessful architectures such as convolutional neural networks(CNNs) to data with non-Euclidean structure (e.g. mani-folds and graphs) is known under the umbrella term Ge-ometric deep learning [10]. In applications dealing with3D data, the key challenge of geometric deep learning isa meaningful definition of intrinsic operations analogous to

∗Equal Contribution

convolution and pooling on meshes or point clouds. Amongnumerous advantages of working directly on mesh or pointcloud data is the fact that it is possible to build invarianceto shape transformations (both rigid and nonrigid) into thearchitecture, as a result allowing to use significantly sim-pler models and much less training data. So far, the mainfocus of research in the field of geometric deep learning hasbeen on analysis tasks, encompassing shape classificationand segmentation [36, 38], local descriptor learning, corre-spondence, and retrieval [32, 9, 28].

On the other hand, there has been limited progress inrepresentation learning and generation of geometric data(shape synthesis). Obtaining descriptive and compact repre-sentations of meshes and point clouds is essential for down-stream tasks such as classification and 3D reconstruction,when dealing with limited labelled training data. Addition-ally, geometric data synthesis is pivotal in applications suchas 3D printing, computer graphics and animation, virtualreality, and game design, and can heavily assist graphicsdesigners and speed-up production. Furthermore, given thehigh cost and time of acquiring quality 3D data, geomet-ric generative models can be used as a cheap alternative forproducing training data for geometric ML algorithms.

Most of the previous approaches in this direction relyon intermediate representations of 3D shapes, such as pointclouds [1], voxels [45] or mappings to a flat domain [33, 4]instead of direct surface representations, such as meshes.Despite the success of such techniques, they either sufferfrom high computational complexity (e.g. voxels) or ab-sence of smoothness of the data representation (e.g. pointclouds), while usually pre- and post-processing steps areneeded in order to obtain the output surface model. Learn-ing directly on the mesh was only recently explored in[26, 39, 44, 23] for shape completion, non-linear facial mor-phable model construction and 3D reconstruction from sin-gle images, respectively.

In this paper, we propose a novel representation learningand generative framework for fixed topology meshes. Forthis purpose, we formulate an ordering-based graph convo-

arX

iv:1

905.

0287

6v3

[cs

.CV

] 3

Aug

201

9

Figure 1: Illustration of our Neural3DMM architecture

lutional operator, contrary to the permutation invariant oper-ators in the literature of Graph Neural Networks. In partic-ular, similarly to image convolutions, for each vertex on themesh, we enforce an explicit ordering of its neighbours, al-lowing a “1-1” mapping between the neighbours and the pa-rameters of a learnable local filter. The order is obtained viaa spiral scan, as proposed in [25], hence the name of the op-erator, Spiral Convolution. This way we obtain anisotropicfilters without sacrificing computational complexity, whilesimultaneously we explicitly encode the fixed graph con-nectivity. The operator can potentially be generalised toother domains that accept implicit local orderings, such asarbitrary mesh topologies and point clouds, while it is nat-urally equivalent to traditional grid convolutions. Via thisequivalence, common CNN practices, such as dilated con-volutions, can be easily formulated for meshes.

We use spiral convolution as a basic building block forhierarchical intrinsic mesh autoencoders, which we coinNeural 3D Morphable Models. We quantitatively eval-uate our methods on several popular datasets: humanfaces with different expressions (COMA [39]) and identities(Mein3D [7]) and human bodies with shape ad pose varia-tion (DFAUST [6]). Our model achieves state-of-the-art re-construction results, outperforming the widely used linear3D Morphable Model [5] and the COMA autoencoder [39],as well other graph convolutional operators, including theinitial formulation of the spiral operator [25]. We also qual-itatively assess our framework showing ‘shape arithmetic’in the latent space of the autoencoder and by synthesisingfacial identities via a spiral convolution Wasserstein GAN.

2. Related WorkGenerative models for arbitrary shapes: Perhaps themost common approaches for generating arbitrary shapesare volumetric CNNs [46, 37, 29] acting on 3D voxels.For example, voxel regression from images [20], denois-ing autoencoders [41] and voxel-GANs [45] have been pro-posed. Among the key drawbacks of volumetric meth-ods are their inherent high computational complexity andthat they yield coarse and redundant representations. Pointclouds are a simple and lightweight alternative to volu-metric representation recently gaining popularity. Severalmethods have been proposed for representation learning offixed-size point clouds [1] using the PointNet [36] architec-

ture. In [47], point clouds of arbitrary size can be synthe-sised via a 2D grid deformation. Despite their compactness,point clouds are not popular for realistic and high-quality3D geometry generation due to their lack of an underly-ing smooth structure. Image-based methods have also beenproposed, such as multi-view [3] and flat domain mappingssuch as UV maps [33, 4], however they are computation-ally demanding, require pre- and post-processing steps andusually produce undesirable artefacts. It is also worth men-tioning the recently introduced implicit-surface based ap-proaches [30, 13, 34], that can yield accurate results, thoughwith the disadvantage of slow inference (dense sampling ofthe 3D space followed by marching cubes).Morphable models: In the case of deformable shapes, suchas faces, bodies, hands etc., where a fixed topology can beobtained by establishing dense correspondences with a tem-plate, the most popular methods are still statistical modelsgiven their simplicity. For Faces, the baseline is the PCA-based 3D Morphable Model (3DMM) [5]. The Large ScaleFace Model (LSFM) [7] was proposed for facial identityand made publicly available, [12, 24] were proposed for fa-cial expression, while for the entire head a large scale modelwas proposed in [35]. For Body & Hand, the most wellknown models are the skinned vertex-based models SMPL[27] and MANO [40], respectively. SMPL and MANO arenon-linear and require (a) joint localisation and (b) solv-ing special optimisation problems in order to project a newshape to the space of the models. In this paper, we take a dif-ferent approach introducing a new family of differentiableMorphable Models, which can be applied on a variety ofobjects, with strong (i.e. body) and less strong (i.e. face) ar-ticulations. Our methods have better representational powerand also do not require any additional supervision.Geometric Deep Learning is a set of recent methods try-ing to generalise neural networks to non-Euclidean do-mains such as graphs and manifolds [10]. Such meth-ods have achieved promising results in geometry process-ing and computer graphics [28, 9], computational chemistry[17, 19], and network science [22, 32]. Multiple approacheshave been proposed to construct convolution-like opera-tions, including spectral methods [11, 15, 22, 48], localcharting based [28, 9, 32, 18, 25] and soft attention [42, 43].Finally, graph or mesh coarsening techniques [15, 49] havebeen proposed, equivalent to image pooling.

3. Spiral Convolutional Networks3.1. Spiral Convolution

For the following discussion, we assume to be given amanifold, discretised as a triangular meshM = (V, E ,F)where V = {1, . . . , n}, E , and F denote the sets of vertices,edges, and faces respectively. Furthermore, let f : V → R,a function representing the vertex features.

One of the key challenges in developing convolution-likeoperators on graphs or manifolds is the lack of a global sys-tem of coordinates that can be associated with each point.The first intrinsic mesh convolutional architectures such asGCNN [28], ACNN [9] or MoNet [32] overcame this prob-lem by constructing a local system of coordinates u(x, y)around each vertex x of the mesh, in which a set of localweighting functions w1, . . . , wL is applied to aggregate in-formation from the vertices y of the neighborhood N (x).This allows to define ‘patch operators’ generalising the slid-ing window filtering in images:

(f ? g)x =

L∑`=1

g`∑

y∈N (x)

w`(u(x, y))f(y) (1)

where∑

y∈N (x) w`(u(x, y))f(y) are ‘soft pixels’ (L in to-tal), f are akin to pixel intensity in images, and g` the filterweights. The problem of the absence of a global coordinatesystem is equivalent to the absence of canonical ordering ofthe vertices, and the patch-operator based approaches canbe also interpreted as attention mechanisms, as in [42] and[43]. In particular, the absence of ordering does not allowthe construction of a “1-1” mapping between neighbouringfeatures f(y) and and filter weights g`, thus a “all-to-all”mapping is performed via learnable soft-attention weightsw`(u(x, y)). In the Euclidean setting, such operators boildown to the classical convolution, since an ordering can beobtained via the global coordinate system.

Besides the lack of a global coordinate system, an-other motivation for patch-operator based approaches whenworking on meshes, is the need for insensitivity to meshingof the continuous surface, i.e. ideally, each patch operatorshould be independent of the underlying graph topology.

Figure 2: Spiral ordering on a mesh and an image patch

However, all the methods falling into this family, come

at the cost of high computational complexity and parame-ter count and can be hard to optimise. Moreover, patch-operator based methods specifically designed for meshes,require hand-crafting and pre-computing the local systemsof coordinates. To this end, in this paper we make a cru-cial observation in order to overcome the disadvantages ofthe aforementioned approaches: the issues of the absenceof a global ordering and insensitivity to graph topology areirrelevant when dealing with fixed topology meshes. In par-ticular, one can locally order the vertices and keep the orderfixed. Then, graph convolution can be defined as follows:

(f ? g)x =

L∑`=1

g`f(x`). (2)

where {x1, . . . , xL} denote the neighbours of vertex x or-dered in a fixed way. Here, in analogy with the patch oper-ators, each patch operator is a single neighbouring vertex.

In the Euclidean setting, the order is simply a raster scanof pixels in a patch. On meshes, we opt for a simple andintuitive ordering using spiral trajectories inspired by [25].Let x ∈ V be a mesh vertex, and let Rd(x) be the d-ring,i.e. an ordered set of vertices whose shortest (graph) path tox is exactly d hops long; Rd

j (x) denotes the jth element inthe d-ring (trivially,R0

1(x) = x). We define the spiral patchoperator as the ordered sequence

S(x) = {x,R11(x), R1

2(x), . . . , Rh|Rh|}, (3)

where h denotes the patch radius, similar to the size of thekernel in classical CNNs. Then, spiral convolution is:

(f ∗ g)x =

L∑`=1

g` f(S`(x)

). (4)

The uniqueness of the ordering is given by fixing two de-grees of freedom: the direction of the rings and the first ver-tex R1

1(x). The rest of the vertices of the spiral are orderedinductively. The direction is chosen by moving clockwiseor counterclockwise, while the choice of the first vertex, thereference point, is based on the underlying geometry of theshape to ensure the robustness of the method. In particular,we fix a reference vertex x0 on a template shape and choosethe initial point for each spiral to be in the direction of theshortest geodesic path to x0, i.e.

R11(x) = arg min

y∈R1(x)

dM(x0, y), (5)

where dM is the geodesic distance between two vertices onthe mesh M. In order to allow for fixed-sized spirals, wechoose a fixed lengthL as a hyper-parameter and then eithertruncate or zero-pad each spiral depending on its size.Comparison to Lim et al. [25]: The authors choose thestarting point of each spiral at random, for every mesh sam-ple, every vertex, and every epoch during training. This

choice prevents us from explicitly encoding the fixed con-nectivity, since corresponding vertices in different mesheswill not undergo the same transformation (as in image con-volutions). Moreover, single vertices also undergo differenttransformations every time a new spiral is sampled. Thus,in order for the network to obtain robustness to differentspiral samples, it inevitably has to become invariant to dif-ferent rotations of the neighbourhoods, thus it has reducedcapacity. To this end, we emphasise the need of consistentorderings across different meshes.

Moreover, in [25], the authors model the vertices on thespiral via a recurrent network, which has higher computa-tional complexity, is harder to optimise and does not takeadvantage of the stationary properties of the 3D shape (lo-cal statistics are repeated across different patches), whichare treated by our spiral kernel with weight sharing.Comparison to spectral filters: Spectral convolutional op-erators developed in [16, 22] for graphs and used in [39]for mesh autoencoders, suffer from the fact that the areinherently isotropic. This is a side-effect when one, un-der the absence of a canonical ordering, needs to designa permutation-invariant operator with small number of pa-rameters. In particular, spectral filters rely on the Lapla-cian operator, which performs a weighted averaging of theneighbour vertices :

(∆f)x =∑

y:(x,y)∈Ewxy

(f(y)− f(x)

), (6)

wherewxy denotes an edge weight. A polynomial of degreer with learnable coefficients θ0, . . . , θr is then applied to∆. Then, the graph convolution amounts to filtering theLaplacian eigenvalues, p(∆) = Φp(Λ)Φ>. Equivalently:

(f ∗ g) = p(∆)f =

r∑`=0

θ`∆`f, (7)

While a necessary evil in general graphs, spectral filterson meshes are rather weak given that they are locallyrotationally-invariant. On the other hand, spiral convolu-tional filters leverage the fact that on a mesh one can canon-ically order the neighbours. Thus, they are anisotropic byconstruction and as will be shown in the experimental sec-tion 4 they are expressive by using just one-hop neighbour-hoods, contrary to the large receptive fields used in [39].In Fig 3 we visualise the impulse response (centred on avertex on the forehead) of a selected laplacian polynomialfilter from the architecture of [39] (left) and from a spiralconvolutional filter with h = 1 (right).

Finally, the equivalence of spiral convolutions to imageconvolutions allows the use of long-studied practices in thecomputer vision community. For example, small patchescan be used, leading to few parameters and fast computa-tion. Furthermore, dilated convolutions [50] can also beadapted to the spiral operator by simply sub-sampling the

Figure 3: Activations of ChebNet vs spiral convolutions

spiral. Finally, we argue here that our operator could beapplied to other domains, such as point clouds, where anordering of the data points can be enforced.

3.2. Neural 3D Morphable Models

Let F = [f0|f1|...,fN ], fi ∈ Rd∗m the matrix of all thesignals defined on a set of meshes in dense correspondencethat are sampled from a distribution D, where d the dimen-sionality of the signal on the mesh (vertex position, textureetc.) and m the number of vertices. A linear 3D MorphableModel [5] represents arbitrary instances y ∈ D as a linearcombination of the k largest eigenvectors of the covariancematrix of F by making a gaussianity assumption:

y ≈ f +

k∑i

αi

√divi (8)

where f the mean shape, vi is the ith principal component,di the respective eigenvalue and αi the linear weight co-efficient. Given its linear formulation, the representationalpower of the 3DMM is constrained by the span of the eigen-vectors, while its parameters scale linearly w.r.t the numberof the eigencomponents used, leading to large parametrisa-tions for meshes of high resolution.

In contrast, in this paper, we use spiral convolutions as abuilding block to build a fully differentiable non-linear Mor-phable Model. In essence, a Neural 3D Morphable Modelis a deep convolutional mesh autoencoder, that learns hier-archical representations of a shape. An illustration of thearchitecture can be found in Fig 1. Leveraging the connec-tivity of the graph with spiral convolutional filters, we al-low for local processing of each shape, while the hierarchi-cal nature of the model allows learning in multiple scales.This way we manage to learn semantically meaningful rep-resentations and considerably reduce the number of parame-ters. Furthermore, we bypass the need to make assumptionsabout the distribution of the data.

Similar to traditional convolutional autoencoders, wemake use of series of convolutional layers with small re-ceptive fields followed by pooling and unpooling, for theencoder and the decoder respectively, where a decimated orupsampled version of the mesh is obtained each time and

the features of the existing vertices are either aggregated orextrapolated. We follow [39] for the calculation of the fea-tures of the added vertices after upsampling, i.e. through in-terpolation by weighting the nearby vertices with barycen-tric coordinates. The network is trained by minimising theL1 norm between the input and the predicted output.

3.3. Spiral Convolutional GAN

In order to improve the synthesis of meshes of high reso-lution, thus increased detail, we extend our framework witha distribution matching scheme. In particular, we proposea mesh Wasserstein GAN [2] with gradient penalty to en-force the Lipschitz constraint [21], that is trained to min-imise the wasserstein divergence between the real distribu-tion of the meshes and the distribution of those produced bythe generator network. The generator and discriminator ar-chitectures, have the same structure as the decoder and theencoder of the Neural3DMM respectively. Via this frame-work, we obtain two additional properties that are inher-ently absent from the autoencoder: high frequency detailand a straightforward way to sample from the latent space.

4. EvaluationIn this section, we showcase the effectiveness of our pro-

posed method on a variety of shape datasets. We conduct aseries of ablation studies in order to compare our operatorto other Graph Neural Networks, by using the same autoen-coder architecture. Fist, we demonstrate the inherent highercapacity of spiral convolutions compared to ChebNet (spec-tral). Moreover, we discuss the advantages of our methodcompared to soft-attention based Graph Neural Networks,such as patch-operator based. Finally, we show the impor-tance of the consistency of the ordering by comparing ourmethod to different variants of the method proposed in [25].

Furthermore, we quantitatively show that our methodcan yield better representations than the linear 3DMM andCOMA, while maintaining a small parameter count andfrequently allowing a more compact latent representation.Moreover, we proceed with a qualitative evaluation of ourmethod by generating novel examples through vector spacearithmetic. Finally, we assess our intrinsic GAN in terms ofits ability to produce high resolution realistic examples.

For all the cases, we choose as signal on the mesh thenormalised deformations from the mean shape, i.e. for ev-ery vertex we subtract its mean position and divide withthe standard deviation. In this way, we encourage sig-nal stationarity, thus facilitating optimisation. The codeis available at https://github.com/gbouritsas/neural3DMM.

4.1. Datasets

COMA. The facial expression dataset from Ranjan et al.[39], consisting of 20K+ 3D scans (5023 vertices) of twelve

unique identities performing twelve types of extreme facialexpressions. We used the same data split as in [39].DFAUST. The dynamic human body shape dataset fromBogo et al. [6], consisting of 40K+ 3D scans (6890 vertices)of ten unique identities performing actions such as leg andarm raises, jumps, etc. We randomly split the data into atest set of 5000, 500 validation, and 34,5K+ train.MeIn3D. The 3D large scale facial identity dataset fromBooth et al. [8], consisting of more than 10,000 distinctidentity scans with 28K vertices which cover a wide rangeof gender ethnicity and age. For the subsequent experi-ments, the MeIn3D dataset was randomly split within de-mographic constraints to ensure gender, ethnic and age di-versity, into 9K train and 1K test meshes.

For the quantitative experiments of sections 4.3 and 4.4the evaluation metric used is generalisation, which mea-sures the ability of a model to represent novel shapes fromthe same distribution as it was trained on. More specifi-cally we evaluate the average per sample and per vertex eu-clidean distance in the 3D space (in millimetres) betweencorresponding vertices in the input and its reconstruction.

4.2. Implementation Details

We denote as SC(h,w) a spiral convolution of h hopsand w filters, DS(p) and US(p) a downsampling and anupsampling by a factor of p, respectively, FC(d) a fullyconnected layer, l the number of vertices after the last down-sampling layer. The simple Neural3DMM for COMA andDFAUST datasets is the following:

Enc :SC(1, 16)→ DS(4)→ SC(1, 16)→ DS(4)→SC(1, 16)→ DS(4)→ SC(1, 32)→ DS(4)→ FC(d)

Dec :FC(l ∗ 32)→ US(4)→ SC(1, 32)→ US(4)→SC(1, 16) → US(4) → SC(1, 16) → US(4) →SC(1, 3)

For Mein3D, due to the high vertex count, we modi-fied the COMA architecture for our simple Neural3DMMby adding an extra convolution and an extra downsam-pling/upsampling layer in the encoder and the decoder re-spectively (encoder filter sizes: [8,16,16,32,32], decoder:mirror of the encoder). The larger Neural3DMM followsthe above architecture, but with an increased parameterspace. For COMA, the convolutional filters of the encoderhad sizes [64,64,64,128] and for Mein3D the sizes were[8,16,32,64,128], while the decoder is a mirror of the en-coder. For DFAUST, the sizes were [16,32,64,128] and[128,64,32,32,16] and dilated convolutions with h = 2 hopsand dilation ratio r = 2 were used for the first and thelast two layers of the encoder and the decoder respectively.We observed that by adding an additional convolution atthe very end (of size equal to the size of the input featurespace), training was accelerated. All of our activation func-tions were ELUs [14]. Our learning rate was 10−3 with adecay of 0.99 after each epoch, and our weight decay was

https://github.com/gbouritsas/neural3DMM

https://github.com/gbouritsas/neural3DMM

Figure 4: Quantitative evaluation of our Neural3DMM against the baselines, in terms of generalisation and # of parameters

5× 10−5. All models were trained for 300 epochs.

Figure 5: Spiral vs ChebNet (spectral) filters

4.3. Ablation Studies

4.3.1 Isotropic vs Anisotropic Convolutions

For the purposes of this experiment we used the architec-ture deployed by the authors of [39]. The number of pa-rameters in our case is slightly larger due to the fact thatthe immediate neighbours, that affect the size of the spiral,range from 7 to 10, while the polynomials used in [39] goup to the 6th power of the Laplacian. For both datasets, asclearly illustrated in Fig 5, spiral convolution-based autoen-coders consistently outperform the spectral ones for everylatent dimension, in accordance with the analysis made insection 3.1. Additionally, increasing the latent dimensions,our model’s performance increases at a higher rate than itscounterpart. Notice that the number of parameters scales

the same way as the latent size grows, but the spiral modelmakes better use of the added parameters especially lookingat dimensions 16, 32, 64, and 128. Especially on the COMAdataset, the spectral model seems to be flattening between64 and 128 while the spiral is still noticeably decreasing.

4.3.2 Spiral vs Attention based Convolutions

In this experiment we compare our method with cer-tain state-of-the-art soft-attention based Graph Neural Net-works: MoNet: the patch-operator based model of [32],where the attention weights are the learnable parametersof gaussian kernels defined on a pseudo-coordinate space1,FeastNet [43] and Graph Attention [42], where the atten-tion weights are learnable functions of the input features.

In table 1, we provide results on COMA dataset, usingthe simple Neural3DMM architecture with latent size 16.We choose the number of attention heads (gaussian ker-nels in [32]) to be either 9 (equal to the size of the spiralin our method, for a fair comparison) or 25 (as in [32],to showcase the effect of over-parametrisation). When itcomes to similar number of parameters our method man-ages to outperform its counterparts, while compared toover-parametrised soft attention networks it either outper-forms them, or achieves slightly worse performance. Thisshows that the spiral operator can make more efficient use ofthe available learnable parameters, thus being a lightweightalternative to attention-based methods without sacrificingperformance. Also, its formulation allows for fast compu-tation; in table 1 we measure per mesh inference time in ms(on a GeForce RTX 2080 Ti GPU).

4.3.3 Comparison to Lim et al. [25]

In order to showcase how the operator behaves when theordering is not consistent, we perform experiments underfour scenarios: the original formulation of [25], where

1here we display the best obtained results when choosing the pseudo-coordinates to be local cartesian.

GAT FeastNet MoNet Ourskernels 9 25 9 25 9 25 -error 0,762 0,732 0,750 0,623 0,708 0,583 0,635params 50K 101K 49K 98K 48K 95K 48Ktime 12,77 15,37 9,04 9,66 10,55 10,96 8,18

Table 1: Spirals vs soft-attention operators

each spiral is randomly oriented for every mesh and everyepoch (rand mesh & epoch); choosing the same orienta-tion across all the meshes randomly at every epoch (randepoch); choosing different orientations for every mesh, butkeeping them fixed across epochs (rand mesh); and fixedordering (Ours). We compare the LSTM-based approach of[25] and our linear projection formulation (Eq (2)). The ex-perimental setting and architecture is the same as in the pre-vious section. The proposed approach achieves over 28%improved performance compared to [25], which substanti-ates the benefits of passing corresponding points throughthe same transformations.

operation rand mesh & epoch rand mesh rand epoch fixed orderingLSTM 0.888 [25] 0.880 0,996 0.792lin. proj. 0.829 0.825 0.951 0.635 (Ours)

Table 2: Importance of the ordering consistency

4.4. Neural 3D Morphable models

Figure 6: Colour coding of the per vertex euclidean error ofthe reconstructions produced by PCA (2nd), COMA (3rd),and our Neural3DMM (bottom). Top row is ground truth.

4.4.1 Quantitative results

In this section, we compare the following methods for dif-ferent dimensions of the latent space: PCA, the 3D Mor-

phable Model [5], COMA, the ChebNet-based Mesh Au-toencoder, Neural3DMM (small), ours spiral convolutionautoencoder with the same architecture as in COMA, Neu-ral3DMM (ours), our proposed Neural3DMM framework,where we enhanced our model with a larger parameter space(see Sec. 4.2). The latent sizes were chosen based on thevariance explained by PCA (explained variance of roughly85%, 95% and 99% of the total variance).

As can be seen from the graphs in Fig 4, our Neu-ral3DMM achieves smaller generalisation errors in everycase it was tested on. For the COMA and DFAUST datasetsall hierarchical intrinsic architectures outperform PCA forsmall latent sizes. That should probably be attributed to thefact that the localised filters used allow for effective recon-struction of smaller patches of the shape, such as arms andlegs (for the DFAUST case), whilst PCA attempts a moreglobal reconstruction, thus its error is distributed equallyacross the entire shape. This is well shown in Fig 6, wherewe compare exemplar reconstructions of samples from thetest set (latent size 16). It is clearly visible that PCA pri-oritises body shape over pose resulting to body parts in thewrong locations (for example see the right leg of the womanon the leftmost column). On the contrary COMA places thevertices in approximately correct locations, but struggles torecover the fine details of the shape leading to various arte-facts and deformities; our model on the other hand seem-ingly balances these two difficult tasks resulting in qualityreconstructions that preserve pose and shape.

Comparing to [39], it is again apparent here that ourspiral-based autoencoder has increased capacity, which to-gether with the increased parameter space, makes our largerNeural3DMM outperform the other methods by a consid-erably large margin in terms of both generalisation andcompression. Despite the fact that for higher dimensions,PCA can explain more than 99% of the total variance, thusmaking it a tough-to-beat baseline, our larger model stillmanages to outperform it. The main advantage here is thesubstantially smaller number of parameters of which wemake use. This is clearly seen in the comparison for theMeIn3D dataset, where the large vertex count makes non-local methods as PCA impractical. It is necessary to men-tion here, that larger latent space sizes are not necessarilydesirable for an autoencoder because they might lead to lesssemantically meaningful and discriminative representationfor downstream tasks.

4.4.2 Qualitative results

Here, we assess the representational power of our modelsby the common practice of testing their ability to performlinear algebra in their latent spaces.Interpolation Fig 7: We choose two sufficiently differentsamples x1 and x2 from our test set, encode them in their

latent representations z1 and z2 and then produce interme-diate encodings by sampling the line that connects them i.e.z = az1 + (1− a)z2, where a ∈ (0, 1).

Figure 7: Interpolations between expressions and identities

Extrapolation Fig 8: Similarly, we decode latent represen-tations that reside on the line defined by z1 and z2, but out-side the respective line segment, i.e. z = a∗z1+(1−a)∗z2,where a ∈ (−∞, 0)∪(1,+∞). We choose z1 to be our neu-tral expression for COMA and neutral pose for DFAUST, inorder to showcase the exaggeration of a specific character-istic on the shape.

Figure 8: Extrapolation. Left: neutral expression/pose

Shape Analogies Fig 9: We choose three meshes A, B, C,and construct a D such that it satisfies A:B::C:D using lin-ear algebra in the latent space as in [31]: e(B) − e(A) =e(D) − e(C) (e(∗) the encoding), where we then solve fore(D) and decode it. This way we transfer a specific charac-teristic using meshes from our dataset.

Figure 9: Analogies in MeIn3D and DFAUST

4.5. GAN evaluation

In figure 10, we sampled several faces from the la-tent distribution of the trained generator. Notice that theyare realistically looking and, following the statistics of the

dataset, span a large proportion of the real distribution of thehuman faces, in terms of ethnicity, gender and age. Com-pared to the most popular approach for synthesizing faces,i.e. the 3DMM, our model learns to produce fine details onthe facial structure, making them hard to distinguish fromreal 3D scans, whereas the 3DMM, although it producessmooth surfaces, frequently makes it easy to tell the differ-ence between real and artificially produced samples. Wedirect the reader to the supplementary material to comparewith samples drawn from the 3DMM’s latent space.

Figure 10: Generated identities from our intrinsic 3D GAN

5. ConclusionIn this paper we introduced a representation learning and

generative framework for fixed topology 3D deformableshapes, by using a mesh convolutional operator, spiral con-volutions, that efficiently encodes the inductive bias of thefixed topology. We showcased the inherent representationalpower of the operator, as well as its reduced computationalcomplexity, compared to prior work on graph convolutionaloperators and show that our mesh autoencoder achievesstate-of-the-art results in mesh reconstruction. Finally, wepresent the generation capabilities of our models throughvector space arithmetic, as well as by synthesising novel fa-cial identities. Regarding future work, we plan to extendour framework to general graphs and 3D shapes of arbitrarytopology, as well as to other domains that have capacity foran implicit ordering of their primitives, such as point clouds.

6. AcknowledgementsThis research was partially supported by ERC Consol-

idator Grant No. 724228 (LEMAN), Google ResearchFaculty awards, and the Royal Society Wolfson ResearchMerit award. G. Bouritsas was funded by the Imperial Col-lege London, Department of Computing, PhD scholarship.Dr. Zafeiriou acknowledges support from a Google Facultyaward and EPSRC fellowship Deform (EP/S010203/1). S.Ploumpis was suppored by EPSRC Project (EP/N007743/1)FACER2VM.

References[1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas.

Learning representations and generative models for 3d pointclouds. International Conference on Machine Learning(ICML), 2018. 1, 2

[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gen-erative adversarial networks. In Proceedings of the 34th In-ternational Conference on Machine Learning (ICML), 2017.5

[3] A. Arsalan Soltani, H. Huang, J. Wu, T. D. Kulkarni, andJ. B. Tenenbaum. Synthesizing 3d shapes via modelingmulti-view depth maps and silhouettes with deep generativenetworks. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), 2017. 2

[4] H. Ben-Hamu, H. Maron, I. Kezurer, G. Avineri, and Y. Lip-man. Multi-chart generative surface modeling. In SIG-GRAPH Asia 2018 Technical Papers, page 215. ACM, 2018.1, 2

[5] V. Blanz, T. Vetter, et al. A morphable model for the synthe-sis of 3d faces. In SIGGRAPH, 1999. 2, 4, 7

[6] F. Bogo, J. Romero, G. Pons-Moll, and M. J. Black. Dy-namic FAUST: Registering human bodies in motion. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2017. 2, 5

[7] J. Booth, A. Roussos, A. Ponniah, D. Dunaway, andS. Zafeiriou. Large scale 3d morphable models. Interna-tional Journal of Computer Vision (IJCV), 2018. 2

[8] J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dun-away. A 3d morphable model learnt from 10,000 faces. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2016. 5

[9] D. Boscaini, J. Masci, E. Rodola, and M. Bronstein. Learn-ing shape correspondence with anisotropic convolutionalneural networks. In Advances in Neural Information Pro-cessing Systems (NIPS), 2016. 1, 2, 3

[10] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Van-dergheynst. Geometric deep learning: going beyond eu-clidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017. 1, 2

[11] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectralnetworks and locally connected networks on graphs. In In-ternational Conference on Learning Representations (ICLR),2014. 2

[12] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Faceware-house: A 3d facial expression database for visual computing.IEEE Transactions on Visualization and Computer Graphics,2014. 2

[13] Z. Chen and H. Zhang. Learning implicit fields for generativeshape modeling. CVPR, 2019. 2

[14] D. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accu-rate deep network learning by exponential linear units (elus).CoRR, 2015. 5

[15] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolu-tional neural networks on graphs with fast localized spectralfiltering. In Advances in Neural Information Processing Sys-tems (NIPS), 2016. 2

[16] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolu-tional neural networks on graphs with fast localized spectralfiltering. Advances in Neural Information Processing Sys-tems (NIPS), 2016. 4

[17] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bom-barell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Con-volutional networks on graphs for learning molecular finger-prints. In Advances in Neural Information Processing sys-tems (NIPS), 2015. 2

[18] M. Fey, J. E. Lenssen, F. Weichert, and H. Muller. Splinecnn:Fast geometric deep learning with continuous b-spline ker-nels. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2018. 2

[19] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E.Dahl. Neural message passing for quantum chemistry. InProceedings of the 34th International Conference on Ma-chine Learning (ICML), 2017. 2

[20] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta.Learning a predictable and generative vector representationfor objects. In European Conference on Computer Vision(ECCV). Springer, 2016. 2

[21] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.Courville. Improved training of wasserstein gans. In Ad-vances in Neural Information Processing Systems (NIPS),2017. 5

[22] T. N. Kipf and M. Welling. Semi-supervised classificationwith graph convolutional networks. In International Confer-ence on Learning Representations (ICLR), 2017. 2, 4

[23] N. Kolotouros, G. Pavlakos, and K. Daniilidis. Convolu-tional mesh regression for single-image human shape recon-struction. In CVPR, 2019. 1

[24] T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero. Learninga model of facial shape and expression from 4D scans. ACMTransactions on Graphics, (Proc. SIGGRAPH Asia), 2017. 2

[25] I. Lim, A. Dielen, M. Campen, and L. Kobbelt. A simpleapproach to intrinsic correspondence learning on unstruc-tured 3d meshes. Proceedings of the European Conferenceon Computer Vision Workshops (ECCVW), 2018. 2, 3, 4, 5,6, 7

[26] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. De-formable shape completion with graph convolutional autoen-coders. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 1886–1895,2018. 1

[27] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J.Black. SMPL: A skinned multi-person linear model. ACMTrans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015. 2

[28] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst.Geodesic convolutional neural networks on riemannian man-ifolds. In Proceedings of the IEEE International Confer-ence on Computer Vision Workshops (ICCVW), pages 37–45,2015. 1, 2, 3

[29] D. Maturana and S. Scherer. Voxnet: A 3d convolutionalneural network for real-time object recognition. In 2015IEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS). IEEE, 2015. 2

[30] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, andA. Geiger. Occupancy networks: Learning 3d reconstructionin function space. CVPR, 2019. 2

[31] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, andJ. Dean. Distributed representations of words and phrasesand their compositionality. In Advances in Neural Informa-tion Processing Systems (NIPS). 2013. 8

[32] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, andM. M. Bronstein. Geometric deep learning on graphs andmanifolds using mixture model cnns. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), 2017. 1, 2, 3, 6

[33] S. Moschoglou, S. Ploumpis, M. Nicolaou, and S. Zafeiriou.3dfacegan: Adversarial nets for 3d face representation, gen-eration, and translation. arXiv preprint arXiv:1905.00307,2019. 1, 2

[34] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Love-grove. Deepsdf: Learning continuous signed distance func-tions for shape representation. CVPR, 2019. 2

[35] S. Ploumpis, H. Wang, N. Pears, W. A. Smith, andS. Zafeiriou. Combining 3d morphable models: A large scaleface-and-head model. CVPR, 2019. 2

[36] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deeplearning on point sets for 3d classification and segmentation.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2017. 1, 2

[37] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J.Guibas. Volumetric and multi-view cnns for object classi-fication on 3d data. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2016.2

[38] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hi-erarchical feature learning on point sets in a metric space. InAdvances in Neural Information Processing Systems (NIPS),pages 5099–5108, 2017. 1

[39] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. Gener-ating 3d faces using convolutional mesh autoencoders. Pro-ceedings of the European Conference on Computer Vision(ECCV), 2018. 1, 2, 4, 5, 6, 7

[40] J. Romero, D. Tzionas, and M. J. Black. Embodied hands:Modeling and capturing hands and bodies together. ACMTransactions on Graphics, (Proc. SIGGRAPH Asia), 2017.2

[41] A. Sharma, O. Grau, and M. Fritz. Vconv-dae: Deep vol-umetric shape learning without object labels. In EuropeanConference on Computer Vision, pages 236–250. Springer,2016. 2

[42] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio,and Y. Bengio. Graph attention networks. In InternationalConference on Learning Representations (ICLR), 2018. 2, 3,6

[43] N. Verma, E. Boyer, and J. Verbeek. Feastnet: Feature-steered graph convolutions for 3d shape analysis. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition(CVPR), pages 2598–2606, 2018. 2, 3, 6

[44] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang.Pixel2mesh: Generating 3d mesh models from single rgb im-ages. In ECCV, 2018. 1

[45] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum.Learning a probabilistic latent space of object shapes via 3dgenerative-adversarial modeling. In Advances in Neural In-formation Processing Systems (NIPS), 2016. 1, 2

[46] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, andJ. Xiao. 3d shapenets: A deep representation for volumetricshapes. In Proceedings of the IEEE conference on ComputerVision and Pattern Recognition (CVPR), 2015. 2

[47] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Pointcloud auto-encoder via deep grid deformation. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2018. 2

[48] L. Yi, H. Su, X. Guo, and L. J. Guibas. Syncspeccnn: Syn-chronized spectral cnn for 3d shape segmentation. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2017. 2

[49] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, andJ. Leskovec. Hierarchical graph representation learning withdifferentiable pooling. In Advances in Neural InformationProcessing Systems (NIPS), 2018. 2

[50] F. Yu and V. Koltun. Multi-scale context aggregation by di-lated convolutions. International Conference on LearningRepresentations (ICLR), 2016. 4

Neural 3D Morphable Models: Spiral Convolutional Networks ... · 3D Morphable Model [5] and the...

Documents

Transcript of Neural 3D Morphable Models: Spiral Convolutional Networks ... · 3D Morphable Model [5] and the...