On Joint Estimation of Pose, Geometry and svBRDF from a...

11
On Joint Estimation of Pose, Geometry and svBRDF from a Handheld Scanner Carolin Schmitt 1,2,* Simon Donn´ e 1,2,* Gernot Riegler 3 Vladlen Koltun 3 Andreas Geiger 1,2 1 Max Planck Institute for Intelligent Systems, T ¨ ubingen 2 University of T¨ ubingen 3 Intel Intelligent Systems Lab {firstname.lastname}@tue.mpg.de {firstname.lastname}@intel.com Abstract We propose a novel formulation for joint recovery of camera pose, object geometry and spatially-varying BRDF. The input to our approach is a sequence of RGB-D images captured by a mobile, hand-held scanner that actively il- luminates the scene with point light sources. Compared to previous works that jointly estimate geometry and materi- als from a hand-held scanner, we formulate this problem using a single objective function that can be minimized us- ing off-the-shelf gradient-based solvers. By integrating ma- terial clustering as a differentiable operation into the op- timization process, we avoid pre-processing heuristics and demonstrate that our model is able to determine the correct number of specular materials independently. We provide a study on the importance of each component in our formu- lation and on the requirements of the initial geometry. We show that optimizing over the poses is crucial for accurately recovering fine details and that our approach naturally re- sults in a semantically meaningful material segmentation. 1. Introduction Reconstructing the shape and appearance of objects is a long standing goal in computer vision and graphics with numerous applications ranging from telepresence to train- ing embodied agents in photo-realistic environments. While novel depth sensing technology (e.g., Kinect) enabled large- scale 3D reconstructions [12, 61, 87], the level of realism provided is limited since physical light transport is not taken into account. As a consequence, material properties are not recovered and illumination effects such as specular reflec- tions or shadows are merged into the texture component. Material properties can be directly measured using ded- icated light stages [26, 40, 49] or inferred from images by assuming known [15, 36, 65] or flat [3, 4, 29, 76] object ge- ometry. However, most setups are either restricted to lab environments, planar geometries, or difficult to employ “in the wild” as they assume aligned 3D models or scans. * Joint first author with equal contribution. Figure 1: Illustration. Based on images captured from a handheld scanner with point light illumination, we jointly optimize for the camera poses, the surface geometry and spatially varying materials using a single objective function. Ideally, object geometry and material properties are in- ferred jointly: a good model of light transport allows for re- covering geometric detail using shading cues. An accurate shape model, in turn, facilitates the estimation of material properties. This is particularly relevant for shiny surfaces where small changes in the geometry greatly impact the ap- pearance and location of specular reflections. Yet joint opti- mization of these quantities (shown in Fig. 1) is challenging. Several works have addressed this problem by assuming multiple images from a static camera [16, 21, 23, 28, 105] which is impractical for mobile scanning applications. Only a few works consider the challenging problem of joint ge- ometry and material estimation from a handheld device [20, 24, 57]. However, existing approaches assume known camera poses and leverage sophisticated pipelines, decom- posing the problem into smaller problems using multiple decoupled objectives and optimization algorithms that treat geometry and materials separately. Furthermore the number of base materials must be provided and/or pre-processing is required to cluster the object surface accordingly. 1

Transcript of On Joint Estimation of Pose, Geometry and svBRDF from a...

  • On Joint Estimation of Pose, Geometry and svBRDF from a Handheld Scanner

    Carolin Schmitt1,2,∗ Simon Donné1,2,∗ Gernot Riegler3 Vladlen Koltun3 Andreas Geiger1,21Max Planck Institute for Intelligent Systems, Tübingen

    2University of Tübingen 3Intel Intelligent Systems Lab{firstname.lastname}@tue.mpg.de {firstname.lastname}@intel.com

    Abstract

    We propose a novel formulation for joint recovery ofcamera pose, object geometry and spatially-varying BRDF.The input to our approach is a sequence of RGB-D imagescaptured by a mobile, hand-held scanner that actively il-luminates the scene with point light sources. Compared toprevious works that jointly estimate geometry and materi-als from a hand-held scanner, we formulate this problemusing a single objective function that can be minimized us-ing off-the-shelf gradient-based solvers. By integrating ma-terial clustering as a differentiable operation into the op-timization process, we avoid pre-processing heuristics anddemonstrate that our model is able to determine the correctnumber of specular materials independently. We provide astudy on the importance of each component in our formu-lation and on the requirements of the initial geometry. Weshow that optimizing over the poses is crucial for accuratelyrecovering fine details and that our approach naturally re-sults in a semantically meaningful material segmentation.

    1. IntroductionReconstructing the shape and appearance of objects is

    a long standing goal in computer vision and graphics withnumerous applications ranging from telepresence to train-ing embodied agents in photo-realistic environments. Whilenovel depth sensing technology (e.g., Kinect) enabled large-scale 3D reconstructions [12, 61, 87], the level of realismprovided is limited since physical light transport is not takeninto account. As a consequence, material properties are notrecovered and illumination effects such as specular reflec-tions or shadows are merged into the texture component.

    Material properties can be directly measured using ded-icated light stages [26, 40, 49] or inferred from images byassuming known [15, 36, 65] or flat [3, 4, 29, 76] object ge-ometry. However, most setups are either restricted to labenvironments, planar geometries, or difficult to employ “inthe wild” as they assume aligned 3D models or scans.

    ∗ Joint first author with equal contribution.

    Figure 1: Illustration. Based on images captured from ahandheld scanner with point light illumination, we jointlyoptimize for the camera poses, the surface geometry andspatially varying materials using a single objective function.

    Ideally, object geometry and material properties are in-ferred jointly: a good model of light transport allows for re-covering geometric detail using shading cues. An accurateshape model, in turn, facilitates the estimation of materialproperties. This is particularly relevant for shiny surfaceswhere small changes in the geometry greatly impact the ap-pearance and location of specular reflections. Yet joint opti-mization of these quantities (shown in Fig. 1) is challenging.

    Several works have addressed this problem by assumingmultiple images from a static camera [16, 21, 23, 28, 105]which is impractical for mobile scanning applications. Onlya few works consider the challenging problem of joint ge-ometry and material estimation from a handheld device[20, 24, 57]. However, existing approaches assume knowncamera poses and leverage sophisticated pipelines, decom-posing the problem into smaller problems using multipledecoupled objectives and optimization algorithms that treatgeometry and materials separately. Furthermore the numberof base materials must be provided and/or pre-processing isrequired to cluster the object surface accordingly.

    1

  • In this work, we provide a novel formulation for thisproblem which does not rely on sophisticated pipelines ordecoupled objective functions. However, we assume thatthe data was captured under known, non-static illumina-tion with negligible ambient light. We make the followingcontributions: (1) We demonstrate that joint optimizationof camera pose, object geometry and materials is possibleusing a single objective function and off-the-shelf gradient-based solvers. (2) We integrate material clustering as a dif-ferentiable operation into the optimization process by for-mulating non-local smoothness constraints. (3) Our ap-proach automatically determines the number of specularbase materials during the optimization process, leading toparsimonious and semantically meaningful material assign-ments. (4) We provide a study on the importance of eachcomponent in our formulation and a comparison to variousbaselines. (5) We provide our source code, dataset and re-constructed models publicly available at https://github.com/autonomousvision/handheld svbrdf geometry.

    2. Related workWe now discuss the most related work on geometry, ma-

    terial as well as joint geometry and material estimation.

    2.1. Geometry Estimation

    Multi-View Stereo (MVS) reconstruction techniques[18,38,39,77,80,84,85] recover the 3D geometry of an ob-ject from multiple input images by matching feature corre-spondences across views or optimizing photo-consistency.As they ignore physical light transport, they cannot recovermaterial properties. Furthermore, they are only able to re-cover geometry for surfaces which are sufficiently textured.

    Shape from Shading (SfS) techniques exploit shadingcues for reconstructing [27,30,72,73,99] or for refining 3Dgeometry [22, 48, 89, 104] from one or multiple images byrelating surface normals to image intensities through Lam-bert’s law. While early SfS approaches were restricted toobjects made of a single Lambertian material, modern rein-carnations of these models [6, 45, 62] are also able to in-fer non-Lambertian materials and lighting. Unfortunately,reconstructing geometry from a single image is a highlyill-posed problem, requiring strong assumptions about thesurface geometry. Moreover, textured objects often causeambiguities as intensity changes can be caused by changesin either surface orientation or surface albedo.

    Photometric Stereo (PS) approaches [25, 63, 70, 71, 83,88,102] assume three or more images captured with a staticcamera while varying illumination or object pose [41,82] toresolve the aforementioned ambiguities. In contrast to earlyPS approaches which often assumed orthographic camerasand distant light sources, newer works have considered themore practical setup of near light sources [42, 43, 74, 94]and perspective projection [53, 54, 68]. To handle non-

    Lambertian surfaces, robust error functions have been sug-gested [69, 75] and the problem has been formulated usingspecularity-invariant image ratios [10, 50–52]. The advan-tages of PS (accurate normals) and MVS (global geometry)have also been combined by integrating normals from PSand geometry from MVS [17, 33, 44, 47, 58, 64, 81, 96] intoa single consistent reconstruction. However, many classicalPS approaches are not capable of estimating material prop-erties other than albedo and most PS approaches require afixed camera which restricts their applicability to lab envi-ronments. In contrast, here we are interested in recoveringshape and surface materials using a handheld device.

    2.2. Material Estimation

    Intrinsic Image Decomposition [6,7,11,19] is the prob-lem of decomposing an image into its material-dependentand light-dependent properties. However, only a small por-tion of the 3D physical process is captured by these modelsand strong regularizers must be exploited to solve the task.A more accurate description of the reflective properties ofmaterials is provided by the Bidirectional Reflectance Dis-tribution Function (BRDF) [59].

    For known 3D geometry, the BRDF can be measuredusing specialized light stages or gantries [26,40,49,60,78].While this setup leads to accurate reflectance estimates, it istypically expensive, stationary and only works for objectsof limited size. In contrast, recent works have demonstratedthat reflectance properties of flat surfaces can be acquiredusing an ordinary mobile phone [3, 4, 29, 76, 95]. Whiledata collection is easy and practical, these techniques aredesigned for capturing flat texture surfaces and do not gen-eralize to objects with more complex geometries.

    More closely aligned with our goals are approaches thatestimate parametric BRDF models for objects with knowngeometry based on sparse measurements of the BRDF space[15,36,55,56,65,90–92,97,101]. While we also estimate aparametric BRDF model and assume only sparse measure-ments of the BRDF domain, we jointly optimize for camerapose, object geometry and material parameters. As our ex-periments show, joint optimization allows us to recover finegeometric structures (not present in the initial reconstruc-tion) while at the same time improving material estimatescompared to a sequential treatment of both tasks.

    2.3. Joint Geometry and Material Estimation

    Several works have addressed the problem of jointly in-ferring geometry and material. By integrating shading cueswith multi-view constraints and an accurate model of ma-terials and light transport, this approach has the potential todeliver the most accurate results. However, joint optimiza-tion of all relevant quantities is a challenging task.

    Several works have considered extensions of the classicPS setting [1,5,8,16,21,23,28,67,93,103,105]. While some

    https://github.com/autonomousvision/handheld_svbrdf_geometryhttps://github.com/autonomousvision/handheld_svbrdf_geometry

  • of these approaches consider multiple viewpoints and/or es-timate spatially varying BRDFs, all of them require multipleimages from the same viewpoint as input, i.e., they assumethat the camera is on a tripod. While this would simplifymatters, here we are interested in jointly estimating geome-try and materials from a handheld device which can be usedfor scanning a wide range of objects outside the laboratoryand which allows for obtaining more complete reconstruc-tions by scanning objects from multiple viewpoints.

    There exist only few works that consider the problemof joint geometry and material estimation from an activehandheld device. Higo et al. [24] present a plane-sweepingapproach combined with graph cuts for estimating albedo,normals and depth, followed by a post-processing stepto integrate normal information into the depth and to re-move outliers [58]. Georgoulis et al. [20] optimize an ini-tial mesh computed via structure-from-motion and a data-driven BRDF model in an alternating fashion. They usek-means for clustering initial BRDF estimates into base ma-terials and iteratively recompute the 3D geometry using themethod of [58]. In similar spirit, Nam et al. [57] split theoptimization problem into separate parts. They first clusterthe material estimates using k-means, followed by an alter-nating optimization procedure which interleaves material,normal and geometry updates. The latter is updated usingscreened Poisson surface reconstruction [35] while materi-als are recovered using a separate objective.

    The main contribution of our work is a simple and cleanformulation of this problem: we demonstrate that geometryand materials can be inferred jointly using a single objec-tive function optimized using standard gradient-based tech-niques. Our approach naturally allows for optimizing ad-ditional relevant quantities such as camera poses and inte-grates material clustering as a differentiable operation intothe optimization process. Moreover, we demonstrate auto-matic model selection by determining the number of distinctmaterial components as illustrated in Fig. 2.

    3. MethodLet us assume a set of color images Ii : R2 → R3 cap-

    tured from N different views i ∈ {1, . . . , N}. Without lossof generality, let us select i = 1 as the reference view basedon which we will parameterize the surface geometry andmaterials as detailed below. Note that in our visualizationsall observations are represented in this reference view.

    Our goal is to jointly recover the camera poses, the ge-ometry of the scene as well as the material properties interms of a spatially varying Bidirectional Reflectance Dis-tribution Function (svBRDF).

    More formally, we wish to estimate the locations xp =(xp, yp, zp)

    T , surface normals np = (nxp , nyp, n

    zp)

    T , andsvBRDF fp(·) of a set of P surface points p, as well asthe projective mappings πi : R3 → R2 that map a 3D point

    xp ∈ R3 into camera image i. We assume that each imageis illuminated by exactly one point light source. Similar toprior works, we assume that global and ambient illumina-tion effects are negligible.

    3.1. Preliminaries

    This section describes the parameterizations of our model.

    Camera Representation: We use a perspective pinholecamera model and assume constant intrinsic camera param-eters that can be estimated using established calibration pro-cedures [100]. We also assume that all images have beenundistorted and the vignetting has been removed. We there-fore only optimize for the extrinsic parameters (i.e., rotationand translation) of each projective mapping πi : R3 → R2.

    Geometry Representation: We define the surface pointsin terms of the depth map in the reference view Z1 = {zp},using p as the pixel/point index. Assuming a pinhole pro-jection, the 3D location of surface point p is given by

    xp = π−11 (up, vp, zp)

    =

    (up − cxf

    ,vp − cyf

    , 1

    )zp (1)

    where [up, vp]T denotes the location of pixel p in the refer-

    ence image Ic, zp is the depth at pixel p, π−11 is the inverseprojection function and f, cx, cy denote its parameters.

    Normal Representation: We represent normals np as unitvectors. In every iteration of the gradient-based optimiza-tion, we estimate an angular change for this vector so thatwe avoid both the unit normal constraint and the gimballock problem.

    svBRDF Representation: The svBRDF fp(np,ωin,ωout)models the fraction of light that is reflected from incominglight direction ωin to outgoing light direction ωout given thesurface normal np at point p. We use a modified version ofthe Cook-Torrance microfacet BRDF model [13]

    fp(np,ωin,ωout) = dp + sp

    D(rp) G(np,ωin,ωout, rp)

    π(np · ωin)(np · ωout)(2)

    whereD(·) describes the microfacet slope distribution,G(·)is the geometric attenuation factor, and dp ∈ R3, sp ∈ R3and rp ∈ R denote diffuse albedo, specular albedo and sur-face roughness, respectively. We use Smith’s function asimplemented in Mitsuba [31] for G(·) and the GTR modelof the Disney BRDF [9] for D(·). Following [57], we ig-nore the Fresnel effect which cannot be observed using ahandheld setup.

    As illustrated in Fig. 2, many objects can be modeledwell with few specular material components [46] while theobject texture is more complex. We thus allow the diffuse

  • Figure 2: Reconstructions and Estimated Material Assignments for Real Objects. The teapot and the globe are verysmooth and reflective objects whereas the girl and rabbit are near-Lambertian objects with rich geometry and a rough surface.The duck, the gnome and the hydrant are composed of both shiny and rough parts. While the duck, the gnome, the green vaseand the sheep comprise mostly homogeneous colours, both the globe and the teapot have very detailed textures. The teapotis painted, which creates surface irregularities, while the globe is printed and therefore smooth. We show renderings (top)and specular base material segmentations (bottom) from our method; the number of bases T is automatically determined bymodel selection within our framework. We visualize the more specular materials in red.

    albedo dp to vary freely per pixel p, and model specularreflectance as a combination of T specular base materials

    ( sprp

    )=

    T∑t=1

    αtp (strt ) (3)

    with per-pixel BRDF weights αtp ∈ [0, 1] and specular basematerials {(st, rt)}Tt=1. Note that this is in contrast to otherrepresentations [21,40] which linearly combine also the dif-fuse part, hence requiring more base materials to reach thesame fidelity. We found that T ≤ 3 specular bases are suf-ficient for almost all objects. In summary, our svBRDF isfully determined by {(st, rt)}Tt=1 and {dp,αp}Pp=1.

    3.2. Model

    This section describes our objective function. Let X ={{(zp,np, fp)}Pp=1, {πi}Ni=2} denote the depth, normal andmaterial for every pixel p in the reference view, as well asthe projective mapping for each adjacent view. We formu-late the following objective function

    X ∗ = argminX

    ψP + ψG + ψD + ψN + ψM (4)

    omitting the dependency on X and the relative weights be-tween the individual terms. Our objective function is com-posed of five terms which encourage photoconsistency ψP ,geometric consistency ψG , depth compatibility ψD, normalsmoothness ψN and material smoothness ψM.

    Photoconsistency: The photoconsistency term ensuresthat the prediction of our model matches the observation

    Ii for every image i and pixel p:

    ψP(X ) = (5)1

    N

    ∑i

    ∑p

    ‖ϕip [Ii(πi(xp))−Ri(xp,np, fp)] ‖1

    Here,Ri denotes the rendering operator for image i whichapplies the rendering equation [34] to every pixel p. As-suming a single point light source, we obtain

    Ri(xp,np, fp) = (6)

    fp(np,ω

    ini (xp),ω

    outi (xp)

    ) ai(xp)nTp ωini (xp)di(xp)2

    L

    where ωini (xp) denotes the direction of the ray from the sur-face point xp to the light source and ωouti (xp) denotes thedirection from xp to the camera center. ai(xp) is the angle-dependent light attenuation which is determined throughphotometric calibration, di(xp) is the distance between xpand the light source and L denotes the radiant intensity ofthe light. Note that all terms depend on the image index i,as the location of the camera and the light source vary fromframe to frame when recording with a handheld lightstage.

    The visibility term ϕip in (5) disables occluded or shad-owed observations i.e., we do not optimize for these regions.We set ϕip = 1 if surface point xp is both visible in view i(i.e., no occluder between xp and the i’th camera) and illu-minated (e.g., no occluder between xp and the point light),and ϕip = 0 otherwise. Note that for the reference viewevery pixel is visible, but not necessarily illuminated.

  • Geometric Consistency: We enforce consistency betweendepth {zp} and normals {np} by ensuring that the normalfield integrates to the estimated depth map. We formulatethis constraint by maximizing the inner product between theestimated normals {np} and the cross product of the surfacetangents at {xp}:

    ψG(X ) = −∑p

    ~nTp

    ( ∂zp∂x ×

    ∂zp∂y

    ‖∂zp∂x ×∂zp∂y ‖2

    )(7)

    The surface tangent ∂zp∂x is given by

    ∂zp∂x∝[1, 0, ~∇Z1(π1(~xp))T [f/zp, 0]T

    ]T(8)

    where ~∇Z1(π1(~xp)) denotes the gradient of the depth map,which we estimate using finite differences. We obtain a sim-ilar equation for ∂zp∂y . See the supplement for details.

    A valid question to raise is whether a separate treatmentof depth and normals is necessary. An alternative formula-tion would consider consistency between depth and normalsas a hard constraint, i.e., enforcing Equation (7) strictly,and optimizing only for depth. While reducing the numberof parameters to be estimated, we found that such a repre-sentation is prone to local minima during optimization dueto the complementary nature of the constraints (depth vs.normals/shading). Instead, using auxiliary normal variablesand optimizing for both depth and normals using a soft cou-pling between them allows us to overcome these problems.

    Depth Compatibility: The optional depth term allows forincorporating depth measurements Z1 in the reference viewi = 1 by regularizing our estimates zp against it:

    ψD(X ) =∑p

    ‖zp −Z1(up, vp)‖22 (9)

    Note that our model is able to significantly improve uponthe initial coarse geometry provided by the structured lightsensor by exploiting shading cues. However, as these cuesare related to depth variations (i.e., normals) rather than ab-solute depth, they do not fully constrain the 3D shape of theobject. Our experiments demonstrate that combining com-plementary depth and shading cues yields reconstructionswhich are both locally detailed and globally consistent.

    Normal Smoothness: We apply a standard smoothnessregularizer to the normals of adjacent pixels p ∼ q

    ψG(X ) =∑p∼q‖np − nq‖22 (10)

    in order to encourage smooth surfaces.

    Material Smoothness: We only observe specular BRDFcomponents for a minority of pixels that actually observe a

    specular highlight in at least one of their measurements. Wetherefore introduce a non-local material regularizer whichpropagates specular behavior across image regions of sim-ilar appearance. Assuming that nearby pixels with similardiffuse behavior also exhibit similar specular behavior, weformulate this term by penalizing deviation of the materialweights wrt. a bilaterally smoothed version of themselves

    ψM(X ) =∑p

    ∥∥∥∥∥αp −∑

    q αq wqkp,q∑q wqkp,q

    ∥∥∥∥∥1

    −∑p

    ∥∥∥∥∥αp − 1P ∑q

    αq

    ∥∥∥∥∥1

    (11)

    using a Gaussian kernel kp,q(dp,dq) with 3D location xand diffuse albedo d at pixels p and q as features:

    kp,q = exp

    (− (xp − xq)

    22

    2σ21− (dp − dq)

    22

    2σ22

    )(12)

    As the only informative regions are those that potentiallyobserve a highlight, the weightswq = maxi acos−1(nq ·hiq)indicate whether pixel q was ever observed close to perfectmirror reflection. This is determined by the normal nq andthe half-vector hip (i.e., the bisector between ω

    in and ωout)for each view i. We use the permutohedral lattice [2] toefficiently evaluate the bilateral filter.

    The second term in (11) encourages material sparsityby maximizing the distance to the average BRDF weightswhere P denotes the total number of surface points/pixels.

    3.3. Optimization

    We now discuss the parameter initialization and how weminimize our objective function (4) with respect to X .

    Initial Poses: The camera poses can be either initializedusing classical SfM pipelines such as COLMAP [77, 79] orusing a set of fiducial markers. As SfM approaches fail inthe presence of textureless surfaces, we use a small set ofAprilTags [86] attached to the table supporting the objectof interest. As evidenced by our experiments, the poses es-timated using fiducial markers are not accurate enough tomodel pixel-accurate light transport. We demonstrate thatgeometry and materials can be significantly improved byjointly refining the initial camera poses.

    Initial Depth: The initial depth map Z = {zp} can beobtained using active or passive stereo, or the visual hullof the object. As we do not assume textured objects andsilhouettes can be difficult to extract in the presence of darkmaterials, we use active stereo with a Kinect-like dot patternprojector for estimating Z . More specifically, we estimatea depth map for each of the N views, integrate them usingvolumetric fusion [14] and project the resulting mesh backto the reference view.

  • Observation Crop ReconstructionFigure 3: Super-Resolution and Denoising. Blurred andnoisy input images (Observation and Crop) get denoisedand sharpened by our method (Reconstruction).

    Initial Normals and Albedo: Assuming a Lambertianscene, normals and albedo can be recovered in closedform. We follow the approach of Higo et al. [24] and useRANSAC to reject outliers due to specularities.

    Initial Specular BRDF Parameters and Weights: Weinitialize each pixel in the scene as a uniform mix of allbase materials. To diversify the initial base materials, weinitialize the specular base components st differently, andset each base roughness rt to 0.1.

    Model Selection: We perform model selection by op-timizing for multiple numbers of specular base materialsT ∈ {1, 2, 3}, choosing that with the smallest photometricerror while adding a small MDL penalty (linear in T ).

    Implementation: We jointly optimize over X , usingADAM [37] and PyTorch [66], see supplement for details.

    4. Experimental EvaluationIn order to evaluate our method quantitatively and quali-

    tatively, we capture several real objects using a custom-builtrig with active illumination. Reconstructions of these ob-jects are shown in Fig. 2. We scanned the objects with anArtec Spider1 to obtain ground truth geometry.

    We first briefly describe our hardware and data captureprocedure. After introducing the metrics, both geometricand photometric, we provide an ablation study in terms ofthe individual components of our model. Finally, we com-pare our approach with several competitive baselines.

    4.1. Evaluation Protocol

    Hardware: Our custom-built handheld sensor rig is shownin Fig. 4. While we use multiple light sources for a densesampling of the BRDF, our framework and code is directlyapplicable to any number of lights. We calibrate the cameraand depth sensor regarding their intrinsics and extrinsics aswell as vignetting effects. We also calibrate the locationof the light sources relative to the camera as well as theirangular attenuation behavior and radiant intensities. Due to

    1https://www.artec3d.com/portable-3d-scanners/artec-spider

    Depth SensorRGB

    Camera

    Light SourceFigure 4: Sensor Rig. Oursetup comprises a Kinect-likeactive depth sensor, a globalshutter RGB camera and 12point light sources (high-power LEDs) surrounding thecamera in two circles (withradii 10 cm and 25 cm).

    space limitations, we provide more details about our systemin the supplement.

    Data Capture: The objects are placed on a table withAprilTags [86] for tracking the sensor position. We assumethat the ambient light is negligible and capture videos ofeach object by moving the handheld sensor around the ob-ject. Given each reference view, we select 45 views withina viewing cone of 30 degrees by maximizing the minimumpairwise distance; no two views are ever close together.These views are then split into 40 training and 5 held-outtest views. While a handheld setup is challenging due tothe trade-off between motion blur and image noise, our ex-periments demonstrate that our method is capable to super-resolve and denoise fine textures while simultaneously re-jecting blurry observations, see Fig. 3.

    Evaluation Metrics: We evaluate the estimated structure{xp} wrt. the Artec Spider scan. The ground truth scan isfirst roughly aligned by hand and subsequently finetunedusing dense image-based alignment wrt. depth and normalerrors. We evaluate geometric accuracy by using the aver-age point-to-mesh distance for all reconstructed points asin [32]. To evaluate surface normals, we calculate the aver-age angular error (AAE) between the predicted normal npand the normal of the closest point in the ground truth scan.To quantify photometric reconstruction quality, we calcu-late the photoconsistency term in Eq. (5) for the test views.

    4.2. Ablation Study

    In this section we demonstrate the need to optimize overthe camera poses and discuss the effect of specifically thegeometric consistency and the material smoothness terms.Finally, we investigate the impact of the number of viewson the photometric and geometric error. Additional resultsare provided in the supplement.

    Pose Optimization: Disambiguating geometric propertiesfrom material is a major challenge. We found that optimiz-ing the poses jointly with the other parameters is crucial forthis, in particular when working with a handheld scanner.Fig. 5 shows that inaccurate poses cause a significant con-tamination of the geometry with texture information. This iseven more crucial when estimating specularities: misalign-ment causes highlights to be inconsistent with the geometryand therefore difficult to recover.

    https://www.artec3d.com/portable-3d-scanners/artec-spider

  • Photometric Test Error Overall Specular Non-SpecularFixed Poses 1.210 3.349 1.151Full Model 1.138 3.243 1.081

    Estimation Crop Error Structure NormalsFigure 5: Pose Optimization. Compared to using the inputposes (top), optimizing the poses (bottom) improves recon-structions, both quantitatively and qualitatively. The photo-metric error is reported for regions with and without specu-lar highlights. See the supplement for more details.

    Segmentations Test Reconstructions

    (a) Materials (b) Geometry

    Figure 6: Loss Regularizers. Without the regularization(top), the appearance is inconsistent within homogeneousareas of the object. Using the regularization losses (bottom),we are able to propagate the information and successfullygeneralize to new illumination conditions on the test set.

    Material Segmentation: Decomposing the appearance ofthe object into its individual materials is an integral elementof our approach. Our material smoothness term Eq. (11)propagates material information over large areas of the im-age. This is essential as we otherwise only obtain sparsemeasurements of the BRDF at each pixel. It leads to seman-tically meaningfull segmentations, as illustrated in Fig. 2, aswell as more successfull generalization, as shown in Fig. 6a.

    Geometric Consistency: Splitting up the depth and nor-mals into separate optimization variables yields a better be-haved optimization problem, but coupling depth and nor-mals proves crucial for consistent results. Even though thephotometric term provides some constraints for the depth ateach pixel, Fig. 6b shows that omitting the geometric con-sistency term results in high-frequency structure artifacts.

    Number of Input Views: Our goal is to estimate the spa-tially varying BRDF but we only observe a very sparse set

    Figure 7: Number of Input Views. The photometric testerror (blue) degrades gracefully with decreasing number ofobservations. As expected, the quality of the highlights ismost affected by a small number of views.

    Init. from 5 depth maps Init. from 40 depth mapsFigure 8: Geometry Refinement. The final geometry esti-mate (right) is largely unaffected by the initialization (left).Details are recovered, even from a very coarse initialization.

    of samples for each surface point p. Reducing the numberof images exacerbates this problem, as shown in Fig. 7. Wesee that our method degrades gracefully, with reasonableresults even for using only 10 input images.

    We also evaluate the robustness of our method wrt. theinitial geometry by reducing the number of depth mapsfused for initialization. As Fig. 8 shows, our method is ableto recover from inaccurate depth initialization and achievessimilar quality reconstructions even when initializing fromonly 5 depth maps. By construction, our model does notrecover geometry that is absent in the initial estimate.

    4.3. Comparison to Existing Approaches

    Similar to us, Higo et al. [24] use a handheld scannerfor estimating depth, normals and material using a 2.5Drepresentation. Unlike us, they treat specular highlights,shadows and occlusions as outliers using RANSAC. Geor-goulis et al. [20] and Nam et al. [57] also estimate struc-ture and normals, explicitly modeling non-Lambertian ma-terials. But due to the nature of their pipelines, they arerestricted to a disjoint optimization procedure and updategeometry and materials in alternation. It is important tonote that the baselines expect the camera positions to beknown accurately. So for the baselines we first refine theposes using SfM [77]. Unfortunately, none of the existingworks provide code. We have re-implemented the approachof Higo et al. [24] as baseline. To investigate the benefitsof joint optimization, we implemented a disjoint variant ofour method that alternates between geometry and material

  • TSDF Fusion [98] Higo et al. [24] Proposed (disjoint) Proposed

    10.0mm

    1.0mm

    0.1mm

    Figure 9: Qualitative Geometry Comparison. We show, for each object, the rendered depth map (shaded based on estimatedsurface normals rather than the estimated normals np) and the color-coded depth error wrt. the Artec Spider ground truth.We observe that the photometric approaches recover far more details than the purely geometric TSDF fusion. The resultingstructure for Higo et al. [24] is rather noisy, while the disjoint version of our proposed approach is not as successful atdisambiguating texture and geometry for very fine details. We refer to the supplement for additional results.

    duck pineapple girl gnome sheep hydrant rabbit

    AE

    A

    TSDF Fusion [98] 0.81 1.24 1.11 0.73 0.79 1.35 2.16Higo et al. [24] 2.65 1.05 1.59 1.60 2.09 1.81 2.85

    Proposed (disjoint) 0.81 1.00 1.06 0.65 0.64 1.18 2.25Proposed 0.80 1.00 1.00 0.64 0.57 1.16 2.16

    AA

    E

    TSDF Fusion [98] 6.75 12.09 11.40 7.64 8.38 11.67 24.42Higo et al. [24] 7.77 10.62 11.30 9.24 8.65 15.27 27.67

    Proposed (disjoint) 6.01 9.13 9.81 6.41 6.66 9.59 23.01Proposed 5.17 8.98 8.73 5.74 5.60 8.50 23.50

    Table 1: Quantitative Geometry Comparison. We reportboth the average euclidean accuracy and average angularerror, as discussed in Section 4.1. Please refer to the sup-plement for a more detailed table.

    updates. We also evaluate the improvement over the initialTSDF fusion using the implementation of Zeng et al. [98].

    Our experimental evaluation2 is shown in Fig. 9 and Ta-ble 1. We see that TSDF fusion, a purely geometric ap-proach, reconstructs the general surface well but missesfine details. Their spatial regularizer helps Higo et al. [24]to achieve reasonable reconstructions, which are howeverstrongly affected by the noisy, unregularized, normal es-timates. Additionally, the RANSAC approach to shadowhandling results in artifacts around depth discontinuities.

    Both the joint and disjoint versions of our approach re-construct the scene more accurately than the baselines, butthe joint approach consistently obtains better reconstructionaccuracy given a fixed computational budget.

    Glossy black materials, such as the eyes of the duck(Fig. 6b) or the rabbit (Fig. 2), remain a significant chal-lenge. For such materials, the signal-to-noise of the diffusecomponent is low and the signal from specular highlightsis very sparse so that neither the photoconsistency nor the

    2We do not include results of traditional MVS techniques here, as theyfail for textureless objects. We include them in the supplement.

    Figure 10: A strongly Non-Convex Scene. Due to our ex-plicit shadow and occlusion model, our approach is also ap-plicable to the reconstruction of more complex scenes.

    depth compatibility term constrain the solution correctly.Finally, Fig. 10 illustrates that our approach is able to

    also handle strongly non-convex scenes whose shadows andocclusions often cause issues for existing methods.

    5. Conclusion

    We have proposed a practical approach to estimating ge-ometry and materials from a handheld sensor. Accuratecamera poses are crucial to this task, but are not readilyavailable. To tackle this problem, we propose a novel for-mulation which enables joint optimization of poses, geome-try and materials using a single objective. Our approach re-covers accurate geometry and material properties, and pro-duces a semantically meaningful set of material weights.

    Acknowledgements: This work was supported by the In-tel Network on Intelligent Systems (NIS). We thank LarsMescheder and Michal Rolı́nek for their feedback and theInternational Max Planck Research School for IntelligentSystems (IMPRS-IS) for supporting Carolin Schmitt.

  • References[1] Jens Ackermann, Martin Ritz, André Stork, and Michael

    Goesele. Removing the example from example-based pho-tometric stereo. In ECCV Workshops, 2010. 2

    [2] Andrew Adams, Jongmin Baek, and Myers AbrahamDavis. Fast high-dimensional filtering using the permuto-hedral lattice. Computer Graphics Forum, 29(2):753–762,2010. 5

    [3] Miika Aittala, Tim Weyrich, and Jaakko Lehtinen. Two-shot SVBRDF capture for stationary materials. ACM Trans.on Graphics, 34(4):110:1–110:13, 2015. 1, 2

    [4] Rachel A. Albert, Dorian Yao Chan, Dan B. Goldman, andJames F. O’Brien. Approximate svbrdf estimation from mo-bile phone video. In EUROGRAPHICS, 2018. 1, 2

    [5] Neil Gordon Alldrin, Todd E. Zickler, and David J. Krieg-man. Photometric stereo with non-parametric and spatially-varying reflectance. In CVPR, 2008. 2

    [6] Jonathan T. Barron and Jitendra Malik. Shape, illumina-tion, and reflectance from shading. PAMI, 37(8):1670–1687, 2015. 2

    [7] H.G. Barrow. Recovering intrinsic scene characteristicsfrom images. CVS, pages 3–26, 1978. 2

    [8] Neil Birkbeck, Dana Cobzas, Peter F. Sturm, and MartinJägersand. Variational shape and reflectance estimation un-der changing light and viewpoints. In ECCV, 2006. 2

    [9] Brent Burley. Physically-based shading at Disney. Techni-cal report, Walt Disney Animation Studios, 2012. 3

    [10] Manmohan Krishna Chandraker, Jiamin Bai, and Ravi Ra-mamoorthi. A theory of differential photometric stereo forunknown isotropic brdfs. In CVPR, 2011. 2

    [11] Qifeng Chen and Vladlen Koltun. A simple model for in-trinsic image decomposition with depth cues. In ICCV,2013. 2

    [12] Sungjoon Choi, Qian-Yi Zhou, and Vladlen Koltun. Robustreconstruction of indoor scenes. In CVPR, 2015. 1

    [13] Robert L. Cook and Kenneth E. Torrance. A reflectancemodel for computer graphics. ACM Trans. on Graphics,1(1):7–24, 1982. 3

    [14] Brian Curless and Marc Levoy. A volumetric method forbuilding complex models from range images. In ACMTrans. on Graphics, 1996. 5

    [15] Yue Dong, Guojun Chen, Pieter Peers, Jiawan Zhang, andXin Tong. Appearance-from-motion: recovering spatiallyvarying surface reflectance under unknown lighting. ACMTrans. on Graphics, 33(6):193:1–193:12, 2014. 1, 2

    [16] Carlos Hernández Esteban, George Vogiatzis, and RobertoCipolla. Multiview photometric stereo. PAMI, 30(3):548–554, 2008. 1, 2

    [17] Hao Fan, Lin Qi, Junyu Dong, Gongfa Li, and Hui Yu. Dy-namic 3d surface reconstruction using a hand-held camera.In IECON, 2018. 2

    [18] Yasutaka Furukawa and Jean Ponce. Accurate, dense,and robust multi-view stereopsis. PAMI, 32(8):1362–1376,2010. 2

    [19] Peter V. Gehler, Carsten Rother, Martin Kiefel, LuminZhang, and Bernhard Schölkopf. Recovering intrinsic im-

    ages with a global sparsity prior on reflectance. In NIPS,pages 765–773, 2011. 2

    [20] Stamatios Georgoulis, Marc Proesmans, and Luc J. VanGool. Tackling shapes and brdfs head-on. In 3DV, 2014. 1,3, 7

    [21] Dan B. Goldman, Brian Curless, Aaron Hertzmann, andSteven M. Seitz. Shape and spatially-varying brdfs fromphotometric stereo. PAMI, 32(6):1060–1071, 2010. 1, 2, 4

    [22] Bjoern Haefner, Yvain Quéau, Thomas Möllenhoff, andDaniel Cremers. Fight ill-posedness with ill-posedness:Single-shot variational depth super-resolution from shad-ing. In CVPR, 2018. 2

    [23] Aaron Hertzmann and Steven M. Seitz. Example-basedphotometric stereo: Shape reconstruction with general,varying brdfs. PAMI, 27(8):1254–1264, 2005. 1, 2

    [24] Tomoaki Higo, Yasuyuki Matsushita, Neel Joshi, and Kat-sushi Ikeuchi. A hand-held photometric stereo camera for3-d modeling. In ICCV, 2009. 1, 3, 6, 7, 8

    [25] Michael Holroyd, Jason Lawrence, Greg Humphreys, andTodd E. Zickler. A photometric approach for estimat-ing normals and tangents. ACM Trans. on Graphics,27(5):133:1–133:9, 2008. 2

    [26] Michael Holroyd, Jason Lawrence, and Todd E. Zickler. Acoaxial optical scanner for synchronous acquisition of 3dgeometry and surface reflectance. ACM Trans. on Graphics,29(4):99:1–99:12, 2010. 1, 2

    [27] B. K.P. Horn. Shape from shading: A method for obtainingthe shape of a smooth opaque object from one view bibtex.Technical report, 1970. 2

    [28] Z. Hui and A. C. Sankaranarayanan. Shape and spatially-varying reflectance estimation from virtual exemplars.PAMI, 2017. 1, 2

    [29] Z. Hui, K. Sunkavalli, J. Lee, S. Hadap, J. Wang, and A. C.Sankaranarayanan. Reflectance capture using univariatesampling of brdfs. In ICCV, 2017. 1, 2

    [30] Katsushi Ikeuchi and Berthold K. P. Horn. Numerical shapefrom shading and occluding boundaries. AI, 17(1-3):141–184, 1981. 2

    [31] Wenzel Jakob. Mitsuba renderer, 2010.http://www.mitsuba-renderer.org. 3

    [32] Rasmus Ramsbøl Jensen, Anders Lindbjerg Dahl, GeorgeVogiatzis, Engil Tola, and Henrik Aanæs. Large scalemulti-view stereopsis evaluation. In CVPR, 2014. 6

    [33] Neel Joshi and David J. Kriegman. Shape from varyingillumination and viewpoint. In ICCV, 2007. 2

    [34] James T. Kajiya. The rendering equation. In ACM Trans.on Graphics, 1986. 4

    [35] Michael M. Kazhdan and Hugues Hoppe. Screened poissonsurface reconstruction. ACM Trans. on Graphics, 32(3):29,2013. 3

    [36] Kihwan Kim, Jinwei Gu, Stephen Tyree, Pavlo Molchanov,Matthias Nießner, and Jan Kautz. A lightweight approachfor on-the-fly reflectance estimation. In ICCV, 2017. 1, 2

    [37] Diederik P. Kingma and Jimmy Ba. Adam: A method forstochastic optimization. In ICLR, 2015. 6

    [38] Vladimir Kolmogorov and Ramin Zabih. Multi-camerascene reconstruction via graph cuts. In ECCV, 2002. 2

  • [39] Florent Lafarge, Renaud Keriven, Mathieu Bredif, andHoang-Hiep Vu. A hybrid multiview stereo algorithm formodeling urban scenes. PAMI, 35(1):5–17, 2013. 2

    [40] Hendrik P. A. Lensch, Jan Kautz, Michael Goesele, Wolf-gang Heidrich, and Hans-Peter Seidel. Image-based recon-struction of spatial appearance and geometric detail. ACMTrans. on Graphics, 22(2):234–257, 2003. 1, 2, 4

    [41] Jongwoo Lim, Jeffrey Ho, Ming-Hsuan Yang, and David J.Kriegman. Passive photometric stereo from motion. InICCV, 2005. 2

    [42] Chao Liu, Srinivasa G. Narasimhan, and Artur W.Dubrawski. Near-light photometric stereo using circularlyplaced point light sources. 2018. 2

    [43] Fotios Logothetis, Roberto Mecca, and Roberto Cipolla.Semi-calibrated near field photometric stereo. In CVPR,2017. 2

    [44] Fotios Logothetis, Roberto Mecca, and Roberto Cipolla. Adifferential volumetric approach to multi-view photometricstereo. In ICCV, 2019. 2

    [45] Stephen Lombardi and Ko Nishino. Single image multima-terial estimation. In CVPR, 2012. 2

    [46] Stephen Lombardi and Ko Nishino. Radiometric scene de-composition: Scene reflectance, illumination, and geometryfrom RGB-D images. In 3DV, 2016. 3

    [47] Zheng Lu, Yu-Wing Tai, Moshe Ben-Ezra, and Michael S.Brown. A framework for ultra high resolution 3d imaging.In CVPR, 2010. 2

    [48] Robert Maier, Kihwan Kim, Daniel Cremers, Jan Kautz,and Matthias Nießner. Intrinsic3d: High-quality 3d recon-struction by joint appearance and geometry optimizationwith spatially-varying lighting. In ICCV, pages 3133–3141,2017. 2

    [49] Wojciech Matusik, Hanspeter Pfister, Matt Brand, andLeonard McMillan. A data-driven reflectance model. ACMTrans. on Graphics, 22(3):759–769, July 2003. 1, 2

    [50] Roberto Mecca and Yvain Quéau. Unifying diffuse andspecular reflections for the photometric stereo problem. InWACV, 2016. 2

    [51] Roberto Mecca, Yvain Quéau, Fotios Logothetis, andRoberto Cipolla. A single-lobe photometric stereo ap-proach for heterogeneous material. SIAM, 9(4):1858–1888,2016. 2

    [52] Roberto Mecca, Emanuele Rodolà, and Daniel Cremers.Realistic photometric stereo using partial differential irra-diance equation ratios. Computers & Graphics, 51:8–16,2015. 2

    [53] Roberto Mecca, Ariel Tankus, Aaron Wetzler, and Al-fred M. Bruckstein. A direct differential approach to pho-tometric stereo with perspective viewing. SIAM, 7(2):579–612, 2014. 2

    [54] Roberto Mecca, Aaron Wetzler, Alfred M. Bruckstein, andRon Kimmel. Near field photometric stereo with point lightsources. SIAM, 7(4):2732–2770, 2014. 2

    [55] Jean Mélou, Yvain Quéau, Jean-Denis Durou, Fabien Cas-tan, and Daniel Cremers. Beyond multi-view stereo:Shading-reflectance decomposition. In SSVM, pages 694–705, 2017. 2

    [56] Jean Mélou, Yvain Quéau, Jean-Denis Durou, Fabien Cas-tan, and Daniel Cremers. Variational reflectance estimationfrom multi-view images. JMIV, 60(9):1527–1546, 2018. 2

    [57] Giljoo Nam, Joo Ho Lee, Diego Gutierrez, and Min H.Kim. Practical SVBRDF acquisition of 3d objects withunstructured flash photography. ACM Trans. on Graphics,37(6):267:1–267:12, 2018. 1, 3, 7

    [58] Diego Nehab, Szymon Rusinkiewicz, James Davis, andRavi Ramamoorthi. Efficiently combining positions andnormals for precise 3d geometry. ACM Trans. on Graph-ics, 24(3):536–543, 2005. 2, 3

    [59] F. E. Nicodemus, J. C. Richmond, J. J. Hsia, I. W. Ginsberg,and T. Limperis. Geometrical considerations and nomen-clature for reflectance. In Radiometry, chapter GeometricalConsiderations and Nomenclature for Reflectance, pages94–145. Jones and Bartlett Publishers, Inc., 1992. 2

    [60] Jannik Boll Nielsen, Henrik Wann Jensen, and RaviRamamoorthi. On optimal, minimal BRDF samplingfor reflectance acquisition. ACM Trans. on Graphics,34(6):186:1–186:11, 2015. 2

    [61] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger.Real-time 3d reconstruction at scale using voxel hashing.In ACM Trans. on Graphics, 2013. 1

    [62] Geoffrey Oxholm and Ko Nishino. Multiview shape andreflectance from natural illumination. In CVPR, 2014. 2

    [63] Thoma Papadhimitri and Paolo Favaro. Uncalibrated near-light photometric stereo. In BMVC, 2014. 2

    [64] Jaesik Park, Sudipta N. Sinha, Yasuyuki Matsushita, Yu-Wing Tai, and In So Kweon. Robust multiview photo-metric stereo using planar mesh parameterization. PAMI,39(8):1591–1604, 2017. 2

    [65] Jeong Joon Park, Richard A. Newcombe, and Steven M.Seitz. Surface light field fusion. In 3DV, 2018. 1, 2

    [66] Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-ban Desmaison, Luca Antiga, and Adam Lerer. Automaticdifferentiation in PyTorch. In NIPS Workshops, 2017. 6

    [67] Songyou Peng, Bjoern Haefner, Yvain Quéau, and DanielCremers. Depth super-resolution meets uncalibrated photo-metric stereo. In ICCV Workshops, 2017. 2

    [68] Yvain Quéau, Bastien Durix, Tao Wu, Daniel Cremers,François Lauze, and Jean-Denis Durou. Led-based pho-tometric stereo: Modeling, calibration and numerical solu-tion. JMIV, 60(3):313–340, 2018. 2

    [69] Yvain Quéau, François Lauze, and Jean-Denis Durou. Alˆ1 -tv algorithm for robust perspective photometric stereowith spatially-varying lightings. In SSVM, 2015. 2

    [70] Yvain Quéau, François Lauze, and Jean-Denis Durou.Solving uncalibrated photometric stereo using total varia-tion. JMIV, 52(1):87–107, 2015. 2

    [71] Yvain Quéau, Roberto Mecca, and Jean-Denis Durou. Un-biased photometric stereo for colored surfaces: A varia-tional approach. In CVPR, 2016. 2

    [72] Yvain Quéau, Jean Mélou, Fabien Castan, Daniel Cremers,and Jean-Denis Durou. A variational approach to shape-from-shading under natural illumination. In EMMCVPR,2017. 2

  • [73] Yvain Quéau, Jean Mélou, Jean-Denis Durou, and DanielCremers. Dense multi-view 3d-reconstruction withoutdense correspondences. arXiv.org, 1704.00337, 2017. 2

    [74] Yvain Quéau, Tao Wu, and Daniel Cremers. Semi-calibrated near-light photometric stereo. In SSVM, 2017.2

    [75] Yvain Quéau, Tao Wu, François Lauze, Jean-Denis Durou,and Daniel Cremers. A non-convex variational approachto photometric stereo under inaccurate lighting. In CVPR,2017. 2

    [76] Jérémy Rivière, Pieter Peers, and Abhijeet Ghosh. Mo-bile surface reflectometry. Computer Graphics Forum,35(1):191–202, 2016. 1, 2

    [77] Johannes Lutz Schönberger, Enliang Zheng, Marc Polle-feys, and Jan-Michael Frahm. Pixelwise view selection forunstructured multi-view stereo. In ECCV, 2016. 2, 5, 7

    [78] Christopher Schwartz, Ralf Sarlette, Michael Weinmann,and Reinhard Klein. DOME II: A parallelized BTF acqui-sition system. In EUROGRAPHICS, 2013. 2

    [79] Johannes Lutz Schönberger and Jan-Michael Frahm.Structure-from-motion revisited. In CVPR, 2016. 5

    [80] S.M. Seitz and C.R. Dyer. Photorealistic scene reconstruc-tion by voxel coloring. In CVPR, 1997. 2

    [81] Boxin Shi, Kenji Inose, Yasuyuki Matsushita, Ping Tan,Sai-Kit Yeung, and Katsushi Ikeuchi. Photometric stereousing internet images. In 3DV, 2014. 2

    [82] Denis Simakov, Darya Frolova, and Ronen Basri. Denseshape reconstruction of a moving object under arbitrary, un-known lighting. In ICCV, 2003. 2

    [83] Borom Tunwattanapong, Graham Fyffe, Paul Graham, JayBusch, Xueming Yu, Abhijeet Ghosh, and Paul E. De-bevec. Acquiring reflectance and shape from continuousspherical harmonic illumination. ACM Trans. on Graphics,32(4):109:1–109:12, 2013. 2

    [84] Ali Osman Ulusoy, Andreas Geiger, and Michael J. Black.Towards probabilistic volumetric reconstruction using raypotentials. In 3DV, 2015. 2

    [85] George Vogiatzis, Philip H. S. Torr, and Roberto Cipolla.Multi-view stereo via volumetric graph-cuts. In CVPR,2005. 2

    [86] John Wang and Edwin Olson. AprilTag 2: Efficient androbust fiducial detection. In IROS, October 2016. 5, 6

    [87] Thomas Whelan, Stefan Leutenegger, Renato F. Salas-Moreno, Ben Glocker, and Andrew J. Davison. Elastic-fusion: Dense SLAM without A pose graph. In RSS, 2015.1

    [88] Robert J Woodham. Photometric method for deter-mining surface orientation from multiple images. OE,19(1):191139, 1980. 2

    [89] Chenglei Wu, Michael Zollhöfer, Matthias Nießner, MarcStamminger, Shahram Izadi, and Christian Theobalt. Real-time shading-based refinement for consumer depth cam-eras. In ACM Trans. on Graphics, 2014. 2

    [90] Hongzhi Wu, Zhaotian Wang, and Kun Zhou. Simultane-ous localization and appearance estimation with a consumerRGB-D camera. VCG, 22(8):2012–2023, 2016. 2

    [91] Hongzhi Wu and Kun Zhou. Appfusion: Interactive appear-ance acquisition using a kinect sensor. Computer GraphicsForum, 34(6):289–298, 2015. 2

    [92] Zhe Wu, Sai-Kit Yeung, and Ping Tan. Towards building anRGBD-M scanner. arXiv.org, 1603.03875, 2016. 2

    [93] Rui Xia, Yue Dong, Pieter Peers, and Xin Tong. Re-covering shape and spatially-varying surface reflectanceunder unknown illumination. ACM Trans. on Graphics,35(6):187:1–187:12, 2016. 2

    [94] Wuyuan Xie, Chengkai Dai, and Charlie C. L. Wang. Pho-tometric stereo with near point lighting: A solution by meshdeformation. In CVPR, 2015. 2

    [95] Zexiang Xu, Jannik Boll Nielsen, Jiyang Yu, Henrik WannJensen, and Ravi Ramamoorthi. Minimal BRDF samplingfor two-shot near-field reflectance acquisition. ACM Trans.on Graphics, 35(6):188:1–188:12, 2016. 2

    [96] Yusuke Yoshiyasu and Nobutoshi Yamazaki. Topology-adaptive multi-view photometric stereo. In CVPR, 2011.2

    [97] Yizhou Yu, Paul E. Debevec, Jitendra Malik, and TimHawkins. Inverse global illumination: Recovering re-flectance models of real scenes from photographs. In ACMTrans. on Graphics, pages 215–224, 1999. 2

    [98] Andy Zeng, Shuran Song, Matthias Nießner, MatthewFisher, Jianxiong Xiao, and Thomas Funkhouser. 3dmatch:Learning local geometric descriptors from rgb-d recon-structions. In CVPR, 2017. 8

    [99] Ruo Zhang, Ping-Sing Tsai, James Edwin Cryer, andMubarak Shah. Shape from shading: A survey. PAMI,21(8):690–706, 1999. 2

    [100] Zhengyou Zhang. Flexible camera calibration by viewing aplane from unknown orientations. In ICCV, 1999. 3

    [101] Zhiming Zhou, Guojun Chen, Yue Dong, David Wipf, YongYu, John Snyder, and Xin Tong. Sparse-as-possible svbrdfacquisition. ACM Trans. on Graphics, 35(6):189:1–189:12,2016. 2

    [102] Zhenglong Zhou and Ping Tan. Ring-light photometricstereo. In ECCV, 2010. 2

    [103] Zhenglong Zhou, Zhe Wu, and Ping Tan. Multi-view pho-tometric stereo with spatially varying isotropic materials. InCVPR, 2013. 2

    [104] Michael Zollhöfer, Angela Dai, Matthias Innmann, Chen-glei Wu, Marc Stamminger, Christian Theobalt, andMatthias Nießner. Shading-based refinement on volumet-ric signed distance functions. ACM Trans. on Graphics,34(4):96:1–96:14, 2015. 2

    [105] Xinxin Zuo, Sen Wang, Jiangbin Zheng, and Ruigang Yang.Detailed surface geometry and albedo recovery from RGB-D video under natural illumination. In ICCV, 2017. 1, 2