Deep Generative Vision as Approximate Bayesian Computationilkery/papers/KulkarnietalNIPSABC.pdf ·...

Deep Generative Vision as Approximate Bayesian Computation

Tejas D. Kulkarni1, Ilker Yildirim1, 3, Pushmeet Kohli2, Winrich A. Freiwald3, and Joshua B. Tenenbaum1

1Computer Science and Artificial Intelligence Laboratory, BCS, MIT2Microsoft Research Cambridge

3Rockefeller University

Abstract

Probabilistic formulations of inverse graphics have recently been proposed for a variety of 2D and3D vision problems [15, 12, 14, 9]. These approaches represent visual elements in form of graphicssimulators that produce approximate renderings of the visual scenes. Existing approaches eithermodel pixel data or hand-crafted intermediate representations such as edge maps, super-pixels, sil-houettes etc. However, the choice of features can drastically affect inference quality and run-time.Recently, deep learning techniques such as Convolutional Neural Networks (CNNs) have demon-strated impressive performance on various tasks such as object recognition and scene pixel labeling,suggesting the superiority of CNN-based features. Encouraged by this findings, we test the abilityof CNNs in combination with Approximate Bayesian Computation (ABC) to invert high dimen-sional generative inverse graphics models from single images. We successfully applied a variant ofthe probabilistic approximate MCMC algorithm [21] which uses CNN to quantify summary statis-tics on two real world problems: inferring 3D pose of humans and generative face analysis fromsingle images. Computer Graphics seems to be advancing at a great pace in terms of designingsolutions for hard image synthesis problems. Our experiments indicate that the combination of richprobabilistic inverse graphics models and deep learning approaches could utilize such simulatorsdirectly to solve the hard inversion problem.

1 Introduction

The idea of vision as inverse graphics has recently attracted renewed interest [15, 12, 14, 9]. Moreover, generativemodels for natural images have been widely explored by a variety of analysis-by-synthesis approaches [19, 5, 24, 22].These approaches present an appealing framework to integrate top-down processing with bottom-up computation forlearning and inference, providing an inspiration for the approach we take in this paper. However, such approachesoften require considerable problem-specific engineering, mainly to design custom inference strategies, MCMC pro-posals and features.

The approach of inverse graphics often implies that generative models should explain pixel data. This is a hard toachieve especially in the case of natural images as producing photo-realistic samples is both difficult and extremelyexpensive. Approaches such as [11] try to get around this issue by using hand-defined intermediate representation ofimages such as contours, silhouettes etc. that might have similar statistics in natural images and graphics renderings.However, reasoning about which is the best intermediate representation of the image is still an open problem.

We begin by considering convolutional neural networks (CNN) [13] as an architecture to compute summary statisticsfrom natural or synthetic images. CNNs have demonstrated superior performance on a variety of tasks such as objectrecognition [10] and scene parsing [7]. CNN models trained exclusively on the ImageNet dataset have been shown togeneralize on a variety of other datasets as well as tasks beyond object identity recognition and label assignment [18].Using a CNN pre-trained on ImageNet [6], we pose the following question – how well can we recover latents of aninverse graphics model by modeling the learnt representation of the CNN?

Contact(left-right): [email protected]‖[email protected]‖[email protected]‖[email protected]‖[email protected]

1

(a) Figure 1a. DeepGenerative Vision ABCOverview

1: Si: Scene variables2: ID: Observed Image, IR: Sampled image given latents S3: ε: Model error in ABC, g(.|.): Graphics Simulator4: q(S′|S): Transition Kernel5: Θ: Graphics Simulator Variables6: α: Step size of the ellipse move, β: Mixture Kernel Probability7: ρ(., .): Distance function, SΣ: Covariance of gaussian priors in S (if present)8: Set S ∼ π(S)9: Set IR = g(Θ ∗ S)

10: Let ε ∼ N (0, σ20)

11: for l = 1 ... convergence do:12: θ ∼ N (0, SΣ)

13: Set q(S′|S) = β(π(S)

)+ (1− β)

(√1− α2X + αθ

)14: S′ ∼ q(S′|S), I ′R ∼ g(Θ ∗ S)

15: Accept S′ with the probability: min(

1,πε(ν(ID)−ν(I′R))q(S|S′)π(S′)πε(ν(ID)−ν(IR))q(S′|S)π(S)

)16: Otherwise S′ = S17: end for(b) Figure 1b. Psuedo-code for the ABC based approximate MCMC inverse graphics algorithm

Bayesian inference in our setting is difficult as the likelihood is unavailable in closed form due to the presence ofthe rendering pipeline and subsequent CNN computations. However, there has been considerable progress in thestatistics community to perform likelihood free Bayesian inference, which is often termed as Approximate BayesianComputation (ABC). Therefore, ABC provides a natural formulation to describe inverse graphics in likelihood-freesettings. We define a simple distance function on the summary statistics obtained from CNN and apply a variant ofthe probabilistic approximate MCMC algorithm [21] to do inference.

To the best of our knowledge, this paper is the first inverse graphics formulation to utilize a common deep learningarchitecture to solve a variety of 3D computer vision tasks. We demonstrate the efficacy of our approach on tworeal-world image interpretation problems: obtaining 3D shapes, texture and lighting parameters of faces from singleimages, and inferring the 3D pose of articulated human bodies from single images.

2 ModelThe latent variables in our models can be described using the template first proposed in Generative ProbabilisticGraphics Programming (GPGP) [15] with various modifications as highlighted in Figure 1a. The first component isthe stochastic scene generator expressing the prior over 3D scenes denoted by S = {Si}, where S is decomposedinto parts Si with independent priors P (Si). The second component is the approximate renderer which takes in thescene S and produces: IR = g(Θ ∗ S), where IR is the 2D projection of the 3D generated scene and Θ denotesadditional control variables for the rendering engine. The function g(.) is typically a graphics simulator which ab-stracts graphics routines including fragment and vertex shaders. The third component is the deep learning modulewhich transforms IR to lower dimensional feature vector ν(IR). The final component in our model is the stochasticcomparator ρ(ν(ID), ν(IR)), which is simply the L1 distance between the summary statistics of the rendered andobserved image.

The basic idea behind ABC is to use a summary statistic ν(.) along with a small tolerance ε that produce a reasonableapproximation to the posterior distribution. Let us denote IB(.) as the indicator function of the set B and Aε,ID ={IR ∈ D|ρ(ν(ID), ν(IR)) ≤ ε}. We can now formulate the image interpretation task as approximately sampling theposterior distribution of scene elements Si given observations (ID):

π(S|ID) ≈ πε(S|ID), where πε(S|ID) =

∫πε(S, IR|ID)dIR ∝

π(S)g(IR|Θ ∗ S)IAε,ID (IR)∫Aε,ID×S

π(S)g(IR|Θ ∗ S)dS(1)

3 Inference via ABCInference for inverting graphics simulators is intractable and therefore we resort to using Markov chain Monte-Carlo(MCMC) for approximate inference. Inference in our model is especially hard due to the following reasons: high

2

(a) (b)

Figure 2: Visualization of CNN top level features using tSNE: (a) CNNs seem to approximately represent facepose, identity (not visible due to clutter) and lighting in different semantic clusters. (b) The shared manifold betweenrendered images from our model and real images with a person are semantically categorized in a reasonable fashion.In the future, it is interesting to explore embeddings of the lower layers which will be less invariant to some importanttransformation that we need to infer.

dimensional and highly coupled latent variables, use of a distance function instead of a likelihood function. Moreover,since our model will try to explain the CNN features of the data, in many real-world image interpretation cases, theexpected value of ε is problem dependent. Wilkinson [21] made key observations about the interpretation of ABCalgorithms namely – the approximation in the posterior can be viewed in terms of the distribution implicitly assumedfor the error term ε. For the task of generative vision, ε can be interpreted as model error as discussed in [21].

The latent space for some models such as faces can have dimensionality anywhere between 400 to 800 variables.Therefore single-site Metropolis-Hastings (MH) update quickly becomes infeasible (see Figure 3d for speed trade-offs). Whenever our model has Gaussian priors, we make use of elliptical proposals on large set of coupled continuousrandom variables. Neal [2] and Murray et al. [16] studied generative models with Gaussian priors having zero meanand arbitrary covariance structure. Let us assume that X ∼ N (0,Σ). Given a state X , a new state X ′ can beefficiently proposed as follows: X ′ =

√1− α2X + αθ, where θ ∼ N (0,Σ) and α Uniform(−1, 1).

From empirical evidence (Figure 3d), in comparison to single site MH, moving along the ellipse rapidly convergesand typically reaches at least a better local minima. This is intuitive because as we increase the dimensionalityof the problem space, single site update will linearly scale with the number of dimensions. We closely follow theprobabilistic approximate MCMC algorithm from [21] with modifications as described in Figure 1b.

4 ExperimentsWe experimented with two 3D visual perception tasks: parsing 3D faces and bodies from single images. Other thanchanging the scene prior P (S), we ran the same inference code for both of the tasks, highlighting the generality ofour approach. In order to assess the reliability of CNN features on these images (for details about CNN architecturesee [11]), we systematically tested the feature space of images generated from the model and real images. As shownin Figure 2, we collected the top level CNN features for every image, compressed the represented to 2 dimensionalfeature vectors using tSNE toolkit [20] for visualization. Surprisingly, the CNN features seem to approximatelyrepresent high order semantic bits relevant to the inverse graphics tasks such as: face pose, lighting and body pose.This confirms our belief that our choice of summary statistics reasonably captures the semantic richness of our inputstimuli. In ongoing work, we are testing the effect of modeling all layers of the CNN in order to better capture alltransformations required by the inverse graphics model. Interestingly, such type of a model could produce novelinsights to the reverse hierarchy theory in visual scene perception [1].

4.1 Face AnalysisWe used the Basel model [3] in our graphics simulator for 3D face experiments. The face model consists of a set oftwo face related latent variables (Sshape, Stexture,), each with 200 dimensions and sampled from {N (0, 1)}200

i=1. Ad-ditionally, there are scene variables including: (Slight, Selevation) ∼ Uniform(−90, 90) and {Sjcamera}j∈{x,y,z} ∼Uniform(0, 1). Our evaluation dataset consists of 198 faces generated from the model along with a couple of realfaces for illustration. Based on Figure 3, the L1 distance between the observed and generated summary statisticsaveraged over all the images goes down as expected. From Figure 3(a,b), it can be seen that as inference progresses,the posterior samples approximately reconstructs the input stimuli.

3

(a) (b) (c)

(d)1

0

50

100

150

200

250

300

350

400

450

Iterations

Tim

e(se

cond

s)

Single−site MHElliptical Move

(e) (f) Head Arm1 Arm2 Fot1 Fot20

5

10

15

20

25

ERRO

R

BaselineOur Method

Figure 3: Experimental results on 3D face and pose estimation from single images: (a) Top row is the observed.Bottom row are corresponding posterior samples.(b) Inference on a few illustrative real images. Only shape variablesare inferred. (c) As expected, L1 distance ρ(ν(IR), ν(ID)) averaged over all 198 faces decreases with iterations. (d)We ran 30 independent chains for both proposal moves. Elliptical moves give significant speedup to reach a certainlevel of L1 distance (0.5). (e) Human body simulator with prior over bones. Top row are observations, second roware results from DPM-pose detector [23] and last row are our results. We expect that modeling lower layers of CNNmay further improve results. (f) We compared our approach on a subset of KTH dataset with frontal pose gestures(20 images randomly selected).

4.2 Human Pose EstimationWe use a compositional mesh of the human body written in blender [4], where parts or vertex groups of the mesh ap-proximate bones/joints. The model is represented as a tree with the root node centered around the center of the mesh.The underlying tree is used as the medial axis to deform the mesh in a local part-wise fashion. Each joint/bone on thearmature has a 4 × 4 affine matrix SiL with scale SCi ∼ {Ut(µ0, 0.1)}t∈x,y,z , rotation Ri ∼ {Nt(µr, 0.1)}t∈x,y,zand location T i ∼ {Ut(at, bt)}t∈x,y,z latent variables.The armature marked on the 3D mesh are depicted in Fig-ure 3e. Whenever the random choices in SiL are re-flipped during inference, the change is propagated all the way tothe root node or a pre-defined stopping node SjL to smoothly evolve the 3D mesh. In comparison to the widely useddeformable parts model pose detector [23], we quantitatively evaluate the performance on a dataset of 20 images col-lected from the subset of KTH dataset [17] (front facing gestures). Our initial experiments (Figure 3f) suggests thatour model fairs well in comparison to the baseline. In order to scale up the model, an interesting research direction isto use better generative models such as BlendSCAPE [8] and to approximately simulate clothing on the body modelto estimate more robust summary statistics.

5 Conclusion

The combination of inverse graphics models and feature learning techniques opens up interesting research directionsin trying to obtain structured symbolic description of natural visual scenes. Traditionally, inverse graphics modelshave seemed intractable due to the focus on modeling pixels. We believe that integrating probabilistic inverse graphicsmodels, ABC, and deep learning techniques addresses this fundamental issue by projecting the rendered and realimage on a common manifold for approximate inference. Additionally, sequentially modeling the hierarchy of deepnetworks from top to bottom may significantly increase inference quality. Based on preliminary experiments, webelieve that learning data-driven MCMC proposals [12, 9] in the context of ABC could significantly improve run-timeand accuracy. High resolution and real-time computer graphics simulators seem to contain solutions to many hardimage synthesis problems. We hope that designing and using more advanced graphics simulators can simultaneouslyadvance visual scene analysis as well.

4

References[1] M. Ahissar and S. Hochstein. The reverse hierarchy theory of visual perceptual learning. Trends in cognitive sciences,

8(10):457–464, 2004.

[2] J. Bernardo, J. Berger, A. Dawid, A. Smith, et al. Regression and classification using gaussian process priors. In BayesianStatistics 6: Proceedings of the sixth Valencia international meeting, volume 6, page 475, 1998.

[3] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference onComputer graphics and interactive techniques, pages 187–194. ACM Press/Addison-Wesley Publishing Co., 1999.

[4] Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Blender Institute,Amsterdam,

[5] L. Del Pero, J. Bowdish, D. Fried, B. Kermgard, E. Hartley, and K. Barnard. Bayesian geometric modeling of indoor scenes.In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2719–2726. IEEE, 2012.

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. InComputer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.

[7] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Scene parsing with multiscale feature learning, purity trees, and optimalcovers. arXiv preprint arXiv:1202.2160, 2012.

[8] D. A. Hirshberg, M. Loper, E. Rachlin, and M. J. Black. Coregistration: Simultaneous alignment and modeling of articulated3d shape. In Computer Vision–ECCV 2012, pages 242–255. Springer, 2012.

[9] V. Jampani, S. Nowozin, M. Loper, and P. V. Gehler. The informed sampler: A discriminative approach to bayesian inferencein generative computer vision models. arXiv preprint arXiv:1402.0859, 2014.

[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advancesin neural information processing systems, pages 1097–1105, 2012.

[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS,pages 1106–1114, 2012.

[12] T. D. Kulkarni, V. K. Mansinghka, P. Kohli, and J. B. Tenenbaum. Inverse graphics with probabilistic cad models. arXivpreprint arXiv:1407.1339, 2014.

[13] Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory andneural networks, 3361, 1995.

[14] M. M. Loper and M. J. Black. Opendr: An approximate differentiable renderer. In Computer Vision–ECCV 2014, pages154–169. Springer, 2014.

[15] V. Mansinghka, T. D. Kulkarni, Y. N. Perov, and J. Tenenbaum. Approximate bayesian image interpretation using generativeprobabilistic graphics programs. In Advances in Neural Information Processing Systems, pages 1520–1528, 2013.

[16] I. Murray, R. P. Adams, and D. J. MacKay. Elliptical slice sampling. arXiv preprint arXiv:1001.0175, 2009.

[17] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local svm approach. In Pattern Recognition, 2004. ICPR2004. Proceedings of the 17th International Conference on, volume 3, pages 32–36. IEEE, 2004.

[18] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization anddetection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.

[19] Z. Tu, X. Chen, A. L. Yuille, and S.-C. Zhu. Image parsing: Unifying segmentation, detection, and recognition. InternationalJournal of Computer Vision, 63(2):113–140, 2005.

[20] L. Van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(2579-2605):85,2008.

[21] R. D. Wilkinson. Approximate bayesian computation (abc) gives exact results under the assumption of model error. Statisticalapplications in genetics and molecular biology, 12(2):129–141, 2013.

[22] D. Wingate, N. D. Goodman, A. Stuhlmueller, and J. Siskind. Nonstandard interpretations of probabilistic programs forefficient inference. Advances in Neural Information Processing Systems, 23, 2011.

[23] Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In Computer Vision and PatternRecognition (CVPR), 2011 IEEE Conference on, pages 1385–1392. IEEE, 2011.

[24] Y. Zhao and S.-C. Zhu. Image parsing via stochastic scene grammar. In Advances in Neural Information Processing Systems,2011.

5

Deep Generative Vision as Approximate Bayesian Computationilkery/papers/KulkarnietalNIPSABC.pdf ·...

Documents

Transcript of Deep Generative Vision as Approximate Bayesian Computationilkery/papers/KulkarnietalNIPSABC.pdf ·...