A Multi-View Stereo Benchmark with High-Resolution Images and … · 2019-09-21 · For the...

A Multi-View Stereo Benchmark withHigh-Resolution Images and Multi-Camera Videos

Thomas Schops1 Johannes L. Schonberger1 Silvano Galliani2

Torsten Sattler1 Konrad Schindler2 Marc Pollefeys1,4 Andreas Geiger1,31Department of Computer Science, ETH Zurich

2Institute of Geodesy and Photogrammetry, ETH Zurich3Autonomous Vision Group, MPI for Intelligent Systems, Tubingen 4Microsoft, Redmond

Abstract

Motivated by the limitations of existing multi-view stereobenchmarks, we present a novel dataset for this task. To-wards this goal, we recorded a variety of indoor and out-door scenes using a high-precision laser scanner and cap-tured both high-resolution DSLR imagery as well as syn-chronized low-resolution stereo videos with varying fields-of-view. To align the images with the laser scans, we pro-pose a robust technique which minimizes photometric er-rors conditioned on the geometry. In contrast to previousdatasets, our benchmark provides novel challenges and cov-ers a diverse set of viewpoints and scene types, rangingfrom natural scenes to man-made indoor and outdoor en-vironments. Furthermore, we provide data at significantlyhigher temporal and spatial resolution. Our benchmark isthe first to cover the important use case of hand-held mobiledevices while also providing high-resolution DSLR cameraimages. We make our datasets and an online evaluationserver available at http://www.eth3d.net.

1. IntroductionThe problem of reconstructing 3D geometry from two

or more views has received tremendous attention in com-puter vision for several decades. Applications range from3D reconstruction of objects [4] and larger scenes [3, 5, 35]over dense sensing for autonomous vehicles [6–8, 11, 30]or obstacle detection [10] to 3D reconstruction from mobiledevices [14, 20, 28, 36, 41]. Despite its long history, manyproblems in 3D reconstruction remain unsolved to date. Toidentify these problems and analyze the strengths and weak-nesses of the state-of-the-art, access to a large-scale datasetwith 3D ground truth is indispensable.

Indeed, the advent of excellent datasets and benchmarks,such as [6,12,17,29,32–34,38,40], has greatly advanced thestate of the art of stereo and multi-view stereo (MVS) tech-niques. However, constructing good benchmark datasets is

a tedious and challenging task. It requires the acquisitionof images and a 3D scene model, e.g., through a laser scan-ner or structured light sensor, as well as careful registrationbetween the different modalities. Often, manual work is re-quired [6] to mask occluded regions, sensor inaccuracies orimage areas with invalid depth estimates, e.g., due to mov-ing objects. Therefore, existing benchmarks are limited intheir variability and are often also domain specific.

This paper presents a novel benchmark for two- andmulti-view stereo algorithms designed to complement ex-isting benchmarks across several dimensions (c.f . Fig. 1):(i) Compared to previous MVS benchmarks, our dataset of-fers images acquired at a very high resolution. Using a pro-fessional DSLR camera, we capture images at 24 Megapixelresolution compared to 6 Megapixels in Strecha et al. [40],0.5 Megapixels in KITTI [6], and 0.3 Megapixels in Mid-dlebury [38]. This enables the evaluation of algorithms de-signed for detailed 3D reconstruction. At the same time, itencourages the development of memory and computation-ally efficient methods which can handle very large datasets.(ii) By now, mobile devices have become powerful enoughfor real-time stereo [20,28,30,36,41], creating the need forbenchmark datasets that model the acquisition process typ-ical for such hand-held devices. In addition to the DSLRimages, we also capture a set of image sequences with foursynchronized cameras forming two stereo pairs that movefreely through the scene. These videos enable algorithmsto exploit the redundancy provided by high frame rates toimprove the reconstruction quality. Again, this scenario re-wards efficient algorithms which can handle large amountsof data. To study the effect of field-of-view (FOV) and dis-tortion, we recorded stereo imagery using different lenses.(iii) In contrast to the Middlebury benchmarks [33, 38],our scenes are not carefully staged in a controlled labo-ratory environment. Instead, they provide the full rangeof challenges of real-word photogrammetric measurements.Rather than moving along a constrained trajectory, e.g., in a

http://www.eth3d.net

(a) Scene type (b) View point (c) Camera type (d) Field of viewFigure 1. Examples demonstrating the variety of our dataset in terms of appearance and depth. (a) Colored 3D point cloud renderings ofdifferent natural and man-made scenes. (b) DSLR images taken from different view points. (c) DSLR image (top) and image from ourmulti-camera rig (bottom) of the same scene. (d) Camera rig images with different fields-of-view.

circle around an object, our cameras undergo unconstrained6-DoF motion. As a result, MVS algorithms need to beable to account for stronger variations in viewpoint. Incontrast to Strecha’s dataset [40], our benchmark covers awider spectrum of scenes, ranging from office scenes overman-made outdoor environments to natural scenes depict-ing mostly vegetation. The latter type is especially interest-ing as there exist fewer priors which are applicable to thisscenario. In addition, our scenes comprise fine details (e.g.,trees, wires) which are challenging for existing techniques.

The contributions of this paper include (i) a benchmark,which is made publicly available together with a websitefor evaluating novel algorithms on a hold out test set, (ii) ahighly accurate alignment strategy that we use to registerimages and video sequences against 3D laser scan pointclouds, and (iii) an analysis of existing state-of-the-art algo-rithms on this benchmark. Our benchmark provides novelchallenges and we believe that it will become an invaluableresource for future research in dense 3D reconstruction witha focus on big data and mobile devices.

2. Related Work

In this section, we review existing two- and multi-viewstereo datasets. An overview which compares key aspectsof our dataset to existing benchmarks is provided in Tab. 1.

Two-View Stereo Datasets. One of the first datasets fortwo-view stereo evaluation was the Tsukuba image pair [27]for which 16 levels of disparity have been manually anno-tated. Unfortunately, manual annotation does not scale tolarge realistic datasets due to their complexity [21].

Towards more realism, Scharstein et al. [33] proposedthe Middlebury stereo evaluation, comprising 38 indoorscenes at VGA resolution with ground truth correspon-dences obtained via a structured light scanner. A new ver-sion of the Middlebury dataset [32] has recently been re-leased, featuring ground truth disparities for 33 novel scenesat a resolution of 6 Megapixels. Unfortunately, the amountof human labor involved in staging the scenes and record-ing the ground truth is considerable. Thus, these datasetsare relatively small in size. Besides, their variability is lim-ited as the setup requires controlled structured lighting con-ditions. In contrast, we are interested in general scenes andintroduce a dataset that features both indoor and outdoorenvironments.

Geiger et al. [6, 24] recorded the KITTI datasets usinga mobile platform with a laser scanner mounted on a car.This automated the recording process. However, the bench-mark images are of low resolution (0.5 Megapixels) and theground truth annotations are sparse (< 50% of all imagepixels). Besides, a fixed sensor setup on a car limits thediversity of the recorded scenes to road-like scenarios.

Rendered images, as used in the MPI Sintel stereobenchmark [1], provide an alternative to real recordings andhave been used for learning complex models [23]. Yet, cre-ating realistic 3D models is difficult and the degree of real-ism required is still not well understood [43].

Multi-View Stereo Datasets. The Middlebury dataset ofSeitz et al. [38] was the first common benchmark for eval-uating multi-view stereo on equal grounds. They capturedhundreds of images per object using a robot that uniformlysampled the hemisphere enclosing the scene. Reference

Benchmark Setting Resolution Online Eval. 6DoF Motion MVS Stereo Video Varying FOVMiddlebury MVS [38] Laboratory 0.3 Mpx 3 3

Middlebury [32, 33] Laboratory 6 Mpx 3 3

DTU [17] Laboratory 2 Mpx 3

MPI Sintel [1] Synthetic 0.4 Mpx 3 3 3 3

KITTI [6, 24] Street scenes 0.5 Mpx 3 3 3 3

Strecha [40] Buildings 6 Mpx 3 3

ETH3D (Proposed) Varied 0.4 / 24 Mpx 3 3 3 3 3 3

Table 1. Comparison of existing state-of-the-art benchmarks with our new dataset. Among other factors, we differentiate between differentscene types (e.g., staged scenes captured in a laboratory vs. synthetic scenes), whether the camera undergoes a restricted or a full 6degrees-of-freedom (DoF) motion, or whether cameras with different fields-of-view (FOV) are used.

data has been created by stitching several line laser scanstogether. Unfortunately, this benchmark provides a limitedimage resolution (VGA), and its data, captured in a con-trolled laboratory environment, does not reflect many ofthe challenges in real-world scenes. Besides, only two toyscenes with Lambertian surface properties are provided, re-sulting in overfitting and performance saturation.

As a consequence, Strecha et al. [40] proposed a newMVS benchmark comprising 6 outdoor datasets which in-clude ∼30 images at 6 Megapixel resolution, as well asground truth 3D models captured by a laser scanner. Whilethis dataset fostered the development of efficient methods, itprovides relatively easy (i.e., well-textured) scenes and thebenchmark’s online service is not available anymore.

To compensate for the lack of diversity in [38, 40] andthe well textured, diffuse surfaces of [40], Jensen et al. [17]captured a multitude of real-world objects using a roboticarm. Yet, their controlled environment shares several limita-tions with the original Middlebury benchmark and reducesthe variety of scenes and viewpoints.

Simultaneously to our work, Knapitsch et al. proposed anew benchmark for challenging indoor and outdoor scenes[19]. Their benchmark provides high-resolution videodata and uses ground truth measurements obtained with alaser scanner. While our benchmark focuses on evaluat-ing both binocular stereo and MVS, theirs jointly evalu-ates Structure-from-Motion (SfM) and MVS. Knapitsch etal. captured their video sequences with a high-end cameraand carefully selected camera settings to maximize videoquality. In contrast, our videos were captured with camerascommonly used for mobile robotics and always use auto-exposure. Thus, both benchmarks complement each other.

3. Data Acquisition and RegistrationWe follow [6,24,25,38,40] and capture the ground truth

for our dataset using a highly accurate 3D laser scanner.This section describes the data acquisition and our approachto robustly and accurately register images and laser scans.

3.1. Data Acquisition

We recorded the ground truth scene geometry with a FaroFocus X 330 laser scanner. For each scene, depending on

the occlusions within it, we recorded one or multiple 360◦

scans with up to ∼28 million points each. In addition tothe depth measurements, we recorded the color of each 3Dpoint provided by the laser scanner’s integrated RGB cam-era. Recording a single scan took ∼9 minutes.

For the high-resolution image data, we used a profes-sional Nikon D3X DSLR camera on a tripod. We keptthe focal length and the aperture fixed, such that the intrin-sic parameters can be shared between all images with thesame settings. The camera captures photos at a resolutionof 6048×4032 Pixels with a 85◦ FOV.

For the mobile scenario, we additionally recorded videosusing the multi-camera setup described in [9]: We use fourglobal-shutter cameras, forming two stereo pairs, which arehardware-synchronized via an FPGA and record images at∼13.6Hz. The cameras of the first stereo pair have a FOVof 54◦ each, while the other two cameras have a FOV of83◦. All cameras capture images at a resolution of 752×480pixels. As common and necessary for mobile devices, weset the exposure settings to automatic, allowing the deviceto adapt to illumination changes.

3.2. Registration

To use the recorded data for our benchmark, we firstremove errors from the laser scans and mask problematicareas in the images. Next, we align the scans taken fromdifferent positions with each other and register the cam-era images against the laser scan point clouds. We employa fully automatic three-stage alignment procedure for thistask. The first stage estimates a rough initial alignment be-tween the laser scans and the camera images. We then refinethe registration of the laser scans, followed by a refinementof the intrinsic and extrinsic calibration of the cameras. Inthe following, we describe each of these steps in detail.

Preprocessing. The raw laser scans contain artifactscaused by beams reflected from both foreground and back-ground objects, resulting in the interpolation of foregroundand background depths at occlusion boundaries. Further-more, reflecting objects and glass frequently cause system-atic outliers. Therefore, we filter the scans with the sta-tistical outlier removal procedure from [31]. This removesall points from the cloud whose average distance to their

Figure 2. Left: Illustration of a cube map. One of the 6 virtualcameras is highlighted in red, coordinate axes are shown in blue.Right: Sparse color image and depth map (left) and the inpaintedimage (right) for one virtual camera.

k nearest neighbors is larger than a threshold. The pointdensity of our scans differs depending on the distance fromthe scanner to the surface. Thus, we compute the thresholdused for outlier removal for each point from its local neigh-borhood rather than using one single global value. In a fi-nal step, we manually remove undetected systematic errors.We also inspect each image and annotate regions whichshould not be used. These regions include moving objects,e.g., moving branches in the wind, objects not representedcorrectly in the laser scan such as transparent surfaces, orregions for which occlusion reasoning (as described later)fails due to the sparsity of the measured 3D point cloud.

Initial Laser Scan and Image Alignment. We use theCOLMAP SfM pipeline [35,37] to obtain an initial estimateof the laser scan poses as well as the extrinsics and intrinsicsof the cameras. It is well known [15,39] that rendered viewscan be robustly matched against real imagery using classicaldescriptors like SIFT [22]. To register the laser scans withthe images, we thus include rendered cube map images foreach scan position into the SfM reconstruction. Cube mapsuse the six faces of a cube to create an omnidirectional viewof the environment. The six projection centers of the vir-tual cube map cameras coincide with the origin of the laserscanner. We render the colored point cloud of a laser scaninto these cameras, resulting in six sparse color and depthimages per laser scan (c.f . Fig. 2). We fill in missing pix-els not covered by a laser scan point by using the color ofthe nearest neighbor. While more complex rendering meth-ods [39] could be used, we found that this strategy alreadysuffices for feature matching in SfM. To obtain an initial es-timate of the scale of the SfM model, we compare the depthof SfM points projected into the cube maps to the rendereddepth maps.

Refinement of Laser Scan Alignment. The projectioncenters of the cube maps correspond to the origin of eachlaser scan. Thus, the SfM reconstruction from the previ-ous step provides an initial relative alignment of the scans.We refine this alignment by jointly optimizing the rigidbody poses of all laser scans via point-to-plane ICP [2] onthe point clouds. Visual inspection verified that the result-ing alignments are virtually perfect, which can be expected

given the high accuracy of the laser scanner and the massiveamount of information per scan. Thus, we fix the scan posesfrom here on.

Refinement of Image Alignment. In the last step of thepipeline, we refine the extrinsic and intrinsic parameters ofthe cameras while keeping the laser scan point cloud fixed.For this step, we use an extended version of the dense imagealignment approach proposed by Zhou & Koltun [47].

Zhou & Koltun first sample a set of pointsP from a meshsurface and then optimize the camera parameters and the in-tensity c(p) of each 3D point to minimize the cost function∑

p∈P

∑i∈I(p)

(Ii(πi(p))− c(p))2 . (1)

Here, I(p) denotes the set of images in which point p ∈ Pis visible, Ii(πi(p)) is the intensity of image i at pixel coor-dinate πi(p) corresponding to p’s projection. c(p) denotesthe intensity of p, which belongs to the variables that areoptimized. In our case, we use the joint point cloud from allscans for P . To determine the visibility I(p), we computea screened Poisson surface reconstruction [18] for P . Sincethin objects such as wires are often not captured by thereconstruction, we augment the mesh-based representationwith splats, i.e., oriented discs, generated for all scan pointsfar away from the Poisson surface. I(p) is then determinedfrom depth map renderings of the mesh and the splats. Allpoints whose depth is smaller than that of the depth maprendering at their projected positions, plus a small toleranceof 1cm, are assumed to be visible in the image.

Eq. 1 directly compares pixel intensities and thus as-sumes brightness constancy. However, this assumption isstrongly violated in our setting, since we (1) use multiplecameras, some of which employ an auto-exposure setting,and (2) record outdoor scenes where strong lighting changesrequire manipulation of shutter time. Rather than directlycomparing pixel intensities, we thus compare intensity gra-dients g, making our objective robust to brightness changes.Similar to computing finite difference gradients in an image,we compute the intensity gradients in the point cloud usinglocal neighborhoods.

However, due to the different image resolutions and highlaser scan point density in our dataset, the nearest neigh-bors of a point p might project to the same pixel in oneimage and to relatively far away pixels in another image. Inthe former case, the points’ intensity differences are nearlyconstant and do not provide enough information for the op-timization. In the latter case, under-sampling of the imageleads to a meaningless intensity gradient. Consequently, wesample the neighborhoods at appropriate point cloud reso-lutions. We only add the projection of a point p to an imageas a constraint if all neighbors of p project roughly one pixelaway from it. This avoids the discussed case of over- andunder-sampling. To create enough constraints between all

images and for all points, we efficiently sample the neigh-borhoods from a pre-calculated multi-resolution point cloudand we use a multi-resolution scheme over image pyramids.For each point projection to an image, the image pyramidlevel with the most suitable resolution is considered. Thisincreases the chance that a gradient g is compared againstat least two images and influences the respective camera pa-rameters. Further, the multi-resolution scheme over imagesenlarges the convergence basin of the optimization. We pro-cess the image pyramid coarse-to-fine while coarser resolu-tions are kept in the objective function.

More concretely, we associate each point cloud level lwith a point radius rl in 3D. Since the changes made bythe refinement of the image alignment will be small, we usethe initial image alignment to determine the relevant pointcloud levels. For each laser scan point p and each pyramidlevel h of each image i ∈ I(p), we determine the radiusr(i, h,p) of a 3D sphere around the 3D point p such that theprojection of the sphere into image i at this pyramid levelhas a diameter of ∼1 pixel. To define the radius r0 at thehighest point cloud resolution, we use the minimum radiusamong all points and images. The radius of level l + 1 isdefined as 2rl. The minimum and maximum radii r(i, h,p)of a point p define an interval, and a point is associatedwith level l if rl falls into this range. At each level l, wegreedily merge points within 2rl using the mean position asthe position of the resulting 3D point. For each resultingpoint, we find its 25 nearest neighbors and randomly select5 of them to define the point’s neighborhood. If the averageintensity difference between a point and its neighbors is lessthan 5, we drop the point as it lies in a homogeneous regionand thus would not contribute to the optimization.

Let pj denote the j-th neighbor of point p. The variablesg(p,pj) are now associated to pairs of points and representtheir gradient. We modify the cost function in Eq. 1 to takethe following form

∑p∈P

∑i∈I(p)

ρ

√√√√ 5∑j=1

(Ii(πi(pj))− Ii(πi(p))− g(p,pj))2

(2)

where P contains all points of the multi-resolution pointcloud and ρ[·] is the robust Huber loss function. Note that,in contrast to Eq. 1, we now represent and optimize for thegradients g of each 3D point rather than the point intensityc. Details on implementing this cost function can be foundin the supplementary material, which also provides an illus-tration of the multi-resolution scheme.

For the sequences recorded with the multi-camera rig,we ensure that the relative poses between the cameras in therig remain consistent for all images during optimization by arigid parametrization of the relative camera poses. To speedup the optimization, we optimize g(p,pj) and the cameraparameters alternatingly, as proposed in [47].

L StFigure 3. Sketch of the accuracy evaluation given a single scanpoint, S, measured from the laser scanner position L, with evalua-tion threshold t. Reconstruction points within the green region areaccurate, points in the red region are inaccurate, and points in theblue region are unobserved.

In practice, we found this to result in a good relativealignment between images, but not necessarily in a goodabsolute alignment to the laser scans. We thus add an ad-ditional cost term in analogy to Eq. 2 which minimizes theintensity differences in the images wrt. the intensity differ-ences measured by the laser scanner. This term lowers driftby using the laser scan colors as a global reference. As alimitation, it creates a dependency on the laser scan col-ors, which themselves may not be perfectly aligned to thescan geometry. However, we empirically found the result-ing alignments to be of satisfying quality for our needs.

4. Tasks and Evaluation ProtocolsOur benchmark consists of three scenarios correspond-

ing to different tasks for (multi-view) stereo algorithms:

• High-resolution multi-view stereo with relatively fewimages recorded by a DSLR camera.

• Low-resolution multi-view stereo on video data(“many-view”) recorded with a multi-camera rig.

• Low-resolution two-view stereo on camera pairs of amulti-camera rig.

Each frame of the two-view stereo evaluation consists of all4 images taken at the same time by the multi-camera rig.These 4 cameras form 2 stereo pairs such that both camerasin each pair have the same FOV. Both multi-view stereo sce-narios are evaluated in 3D with the same evaluation proto-col, while the two-view stereo scenario is evaluated in 2Dwith a separate protocol, as detailed in the following.

Multi-View Stereo Evaluation Protocol. We comparethe MVS reconstruction, given as a point cloud, against thelaser scan ground truth of the scene. Only laser scan pointsvisible in at least two images are used for evaluation.

We evaluate the reconstruction in terms of accuracy andcompleteness. Both measures are evaluated over a rangeof distance thresholds from 1cm to 50cm. To determinecompleteness, we measure the distance of each ground truth3D point to its closest reconstructed point. Completeness isdefined as the amount of ground truth points for which thisdistance is below the evaluation threshold.

Accuracy is defined as the fraction of reconstructionpoints which are within a distance threshold of the ground

Figure 4. Two examples of laser scan renderings colored by differently aligned images. Top row, from left to right: Original laser scancolors, initial alignment, 7-DoF ICP alignment, our alignment. Bottom row: Difference of each image to the laser scan image. Note thatthe different lighting causes a significant color difference.

truth points. Since our ground truth is incomplete, carehas to be taken to prevent potentially missing ground truthpoints from distorting the results. Consequently, we seg-ment the reconstruction into occupied, free, and unobservedspace using an approximation of the laser scanner beams(c.f . Fig. 3). We model the shape of the laser beam of eachground truth point as a truncated cone. We make the as-sumption that the beam volume from the laser scanner ori-gin to a scan point contains only free space.

We extend the beam volume beyond each ground truthpoint by the intersection of the extended beam cone with asphere centered at the observed point. The sphere’s radiusis equal to the evaluation tolerance t. Reconstructed pointsoutside all extended beam volumes are in unobserved spaceand are therefore discarded in the evaluation. Among the re-maining reconstruction points, points are classified as accu-rate if they are within the extended beam volume and withinradius t of a ground truth point. Accuracy is then defined asthe ratio of accurate points out of all points while ignoringunobserved points.

The definitions of accuracy and completeness providedabove are susceptible to the densities of both the recon-structed and the ground truth point clouds. For instance,an adversary could uniformly fill the 3D space with pointsto achieve high completeness while creating comparativelymany more copies of a single reconstructed point knownto be accurate to also achieve high accuracy. We thus dis-cretized the space into voxels with small side length. Bothmeasures are first evaluated for each voxel individually. Wethen report the averages over all voxels. To measure avoxel’s completeness, we use the ground truth points in itand all reconstructed points, even those outside the voxel.These roles are reversed to measure accuracy.

Since both accuracy and completeness are important formeasuring the quality of a reconstruction, we use the F1

score as a single measure to rank the results. Given accuracy(precision) p and completeness (recall) r, the F1 score isdefined as the harmonic mean 2 · (p · r)/(p+ r).

Two-View Stereo Evaluation Protocol. The two-viewstereo evaluation is performed on rectified stereo pairs gen-erated from the images of the multi-camera rig. The ground

Figure 5. Top row: Overlays of ground truth depth (colored) ontoimages recorded by the multi-camera rig, showing the accuracy ofour alignments. Middle & bottom row: Detailed views for rig andDLSR images, respectively. The ground truth depth is sparse atfull DSLR resolution and not all objects are scanned completely.

truth is given by the laser scan points projected into the rec-tified image. Occluded points are dropped using the sameocclusion reasoning as in our image alignment procedure.The left disparity image is used for evaluation.

For this scenario, we evaluate the same metrics used inthe Middlebury benchmark [32]: We measure the percent-age of pixels having a disparity error larger than a threshold,for thresholds of 0.5, 1, 2, and 4 disparities (bad 0.5 - bad 4),the average absolute error in pixels (avgerr), the root-mean-square disparity error in pixels (rms), and the error quantilesin pixels for 50%, 90%, 95%, and 99% (A50 - A99).

5. ResultsFirst, we evaluate the accuracy of our image registration

pipeline. Due to lack of more precise measurements, thisevaluation is performed qualitatively. In Sec. 5.2, we thenevaluate state-of-the-art algorithms on our benchmark anddiscuss the gained insights.

5.1. Image Registration

We compare our alignment strategy to the initial align-ment obtained after the laser scan refinement step and to abaseline method. The latter refines the initial camera poses

through a 7-DoF ICP alignment (optimizing for position,rotation and scale) between the reconstructed SfM pointsand the laser scans. For each alignment result, we project allimages onto the laser scans and compute the average colorfor each scan point to create the qualitative evaluations inFig. 4. As can be seen, the baseline significantly improvesthe initial alignment. In turn, our alignment strategy clearlyimproves upon the baseline.

Fig. 5 shows overlays of images with their ground truthdepth map computed from the laser scans. Depth edges arenot used in our alignment procedure, thus they serve as agood indicator for the quality of the alignment. We observethat for both the DSLR images as well as the camera rigimages, the alignment is generally pixel accurate.

5.2. Evaluation of Stereo Methods

High-Resolution Multi-View Scenario. For this sce-nario, we evaluate the popular patch-based PMVS [4],Gipuma [5], which is a well-performing PatchMatch-basedvariant [26], the multi-view stereo method based on pixel-wise view selection in COLMAP [35], and CMPMVS [16],which aims at reconstructing weakly supported surfaces.The results are shown in Fig. 6. We observe that formost scenes, COLMAP and PMVS outperform Gipuma andCMPMVS in terms of accuracy. In terms of completeness,Gipuma achieves a low score for, e.g., courtyard, electro,and delivery area, since its view selection scheme is tai-lored to object-centric scenes. For most datasets, CMP-MVS and COLMAP clearly achieve the best completeness.COLMAP still struggles for weakly textured surfaces andthin structures, such as electro (c.f . Fig. 8c), kicker, office,and pipes. As shown in Fig. 8d, CMPMVS is able to cor-rectly interpolate some of the weakly textured surfaces, butalso hallucinates structures in other parts.

Fig. 8b shows the cumulative completeness score overall methods for the office scene, illustrating that all existingtechniques struggle to achieve high completeness for poorlytextured surfaces. We believe that solving this hard but im-portant problem requires higher-level scene understanding.Our benchmark provides a variety of scenes that can be usedto evaluate such approaches. In general, we observe thatthere is significant room for improvement on our datasets.

Tab. 2a compares the relative performance of the dif-ferent methods on Strecha [40] and our new benchmark,ranked based on [35] and using the F1 score, respectively,with a 2cm evaluation threshold used in both cases. Evi-dently, good performance on the two datasets is not nec-essarily correlated. We thus conclude that our benchmarkcontains challenges different from the Strecha dataset.

Low-Resolution Multi-View Scenario. For this scenario,we evaluate the same methods as for the previous one. ForGipuma, we downsampled the videos to one fifth the frame

Method Strecha OursPMVS 3 (68.9) 3 (41.2)Gipuma 4 (48.8) 4 (33.2)COLMAP 2 (75.9) 1 (64.7)CMPMVS 1 (78.2) 2 (48.9)

(a) MVS

Method Middle. KITTI OursSPS-Stereo [44] 5 (29.3) 2 (5.3) 1 (3.4)MeshStereo [46] 2 (14.9) 4 (8.4) 3 (7.1)SGM+D. [13, 42] 4 (29.2) 3 (6.3) 2 (5.5)MC-CNN [45] 1 (10.1) 1 (3.9) 4 (8.9)ELAS [7] 3 (25.7) 5 (9.7) 5 (10.5)

(b) Stereo (% Bad Pixels)Table 2. Relative rankings on different benchmarks, demonstratingthe difference between ours and existing datasets.

Method Indoor Outdoor Mobile DSLRCMPMVS 67.2 / 47.3 / 55.5 44.2 / 40.0 / 42.0 14.4 / 7.4 / 9.8 71.6 / 57.6 / 63.8COLMAP 90.2 / 51.1 / 65.2 80.9 / 53.1 / 64.1 69.5 / 41.2 / 51.8 91.7 / 56.2 / 69.7Gipuma 74.9 / 24.0 / 36.3 52.8 / 20.8 / 29.9 31.1 / 13.4 / 18.7 76.5 / 25.9 / 38.7PMVS 85.1 / 28.0 / 42.1 72.2 / 27.8 / 40.1 48.7 / 18.8 / 27.2 90.1 / 31.3 / 46.5

Table 3. Category-based MVS evaluation showing accuracy / com-pleteness / F1 score (in %) at a 2cm evaluation threshold.

Method bad 0.5 bad 1 bad 2 bad 4 avgerr rms A50 A90 A95 A99SPS-Stereo [44] 57.42 22.25 4.28 2.00 0.95 2.11 1.08 2.40 3.60 8.77SGM+D. [13, 42] 58.56 24.02 7.48 4.51 1.34 3.23 2.47 3.61 8.54 12.95MeshStereo [46] 29.91 13.98 7.57 4.67 0.90 2.21 1.37 2.44 3.65 9.73MC-CNN [45] 33.79 14.74 9.26 8.46 8.52 17.14 49.81 30.01 30.49 52.19ELAS [7] 42.26 22.77 11.54 5.05 1.12 3.11 4.87 5.26 4.49 11.31

SPS-Stereo [44] 56.91 21.29 3.43 1.43 0.83 1.61 2.22 1.36 2.11 6.52SGM+D. [13, 42] 57.79 22.43 5.48 2.65 1.03 2.43 1.14 7.05 6.46 10.69MeshStereo [46] 28.99 13.23 7.09 4.38 0.87 2.18 1.61 2.55 4.46 10.65MC-CNN [45] 32.51 13.85 8.92 8.59 8.48 17.03 49.01 27.63 28.79 45.79ELAS [7] 41.20 21.56 10.50 4.56 1.06 3.03 4.54 3.98 5.24 9.68

Table 4. Results of two-view stereo methods on our dataset. Weshow the metric averages over all stereo pairs for all regions (up-per part) and non-occluded regions (lower part). The tables areordered by the bad 2 criterion.

rate since it ran out of memory while using all images. Theresults are shown in Fig. 7. As can be seen, these datasetschallenge all algorithms, resulting in lower accuracy scorescompared to the high-quality datasets. PMVS and Gipumaproduce very incomplete and noisy results while CMPMVSfails completely. This demonstrates the fact that they do notproperly exploit the high view redundancy in the videos.COLMAP achieves relatively better results, but there is stillsignificant room for improvement in terms of absolute num-bers. Furthermore, all methods take in the order of severalminutes to an hour to compute the sequences, underliningthe need for more efficient algorithms that can operate inreal-time on a mobile device.

Dataset Diversity. Tab. 3 provides an analysis of the differ-ent MVS algorithms for different scenario categorizations.COLMAP performs best on average as well as best for mostindividual categories. We also observe that the performanceof the algorithms can vary significantly between the differ-ent scenarios, indicating the need for benchmarks such asours that cover a wide variety of scenes.

Two-View Scenario. For this scenario, we evaluated fivemethods, [7, 44–46] and a version of SGM stereo [13] onDaisy descriptors [42]. These include methods belongingto the state-of-the-art on KITTI and Middlebury Stereo. Wedid not tune their parameters except for setting the maxi-

10−2 10−10

20

40

60

80

100ac

c.,

com

p.

[%]

courtyard

10−2 10−10

20

40

60

80

100delivery area

10−2 10−10

20

40

60

80

100electro

10−2 10−10

20

40

60

80

100facade

10−2 10−10

20

40

60

80

100kicker

10−2 10−10

20

40

60

80

100meadow

10−2 10−10

20

40

60

80

100office

10−2 10−1

threshold [m]

0

20

40

60

80

100

acc.

,co

mp

.[%

]

pipes

10−2 10−1

threshold [m]

0

20

40

60

80

100playground

10−2 10−1

threshold [m]

0

20

40

60

80

100relief

10−2 10−1

threshold [m]

0

20

40

60

80

100relief 2

10−2 10−1

threshold [m]

0

20

40

60

80

100terrace

10−2 10−1

threshold [m]

0

20

40

60

80

100terrains

Figure 6. Evaluation of the high-resolution multi-view scenario (indoor and outdoor datasets). Results for CMPMVS, COLMAP, Gipuma,and PMVS are shown as a solid line for accuracy and as a dashed line for completeness.

10−2 10−1

threshold [m]

0

20

40

60

80

100

acc.

,co

mp

.[%

]

delivery area

10−2 10−1

threshold [m]

0

20

40

60

80

100electro

10−2 10−1

threshold [m]

0

20

40

60

80

100forest

10−2 10−1

threshold [m]

0

20

40

60

80

100playground

10−2 10−1

threshold [m]

0

20

40

60

80

100terrains

Figure 7. Multi-view evaluation results in the low-resolution scenario. The interpretation of the plots is the same as in Fig. 6.

(a) Electro: Laserscan (b) Office: Completeness

(c) Electro: COLMAP (d) Electro: CMPMVSFigure 8. Qualitative results. See the text for details.

mum number of disparities. The evaluation results are pre-sented in Table 4. Table 2b compares the relative rankingsof the different approaches on KITTI, Middlebury, and thebad 2 non-occluded results of our benchmark. As can beseen, the rankings differ significantly between ours and pre-vious datasets, indicating that our benchmark complementsexisting ones. In particular, our data requires algorithms toperform well over a wide variety of scenes. It thus encour-ages general solutions and prevents overfitting. The latter isespecially important given the popularity of learning-basedmethods: As evident from Tabs. 4 and 2b, [45] performsbelow average on our benchmark while outperforming allother methods significantly on both Middlebury and KITTI.

6. Conclusion

In this paper, we proposed an accurate and robust reg-istration procedure to align images and laser scans. Usingthis algorithm, we created a new and diverse dataset for theevaluation of two-view and multi-view stereo methods. Ourbenchmark differs from existing datasets in several key as-pects: We cover a wide variety of scene types and thus re-quire general solutions which prevent overfitting. In addi-tion, we provide the first benchmark for hand-held (multi-view) stereo with consumer-grade cameras. Experimentalresults for state-of-the-art algorithms show that our datasetposes various challenges not yet covered by existing bench-marks. One of these challenges is efficient processing oflarge amounts of data, in the form of both high spatial andhigh temporal sampling. These challenges are far from be-ing solved and there is significant room for improvement.As a service to the community, we provide the websitehttp://www.eth3d.net for online evaluation and comparisonof algorithms.

Acknowledgements. We thank Fatma Guney for runningseveral stereo baselines on KITTI and Middlebury, LukasMeier for the multi-camera rig mount and the Geosensorsand Engineering Geodesy group at ETH for providing thelaser scanner. Thomas Schops was supported by a GooglePhD Fellowship. This project received funding from theEuropean Union’s Horizon 2020 research and innovationprogramme under grant No. 688007 (Trimbot2020). Thisproject was partially funded by Google Tango.

http://www.eth3d.net

References[1] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A

naturalistic open source movie for optical flow evaluation.In ECCV, 2012. 2, 3

[2] Y. Chen and G. Medioni. Object modelling by registra-tion of multiple range images. Image and vision computing,10(3):145–155, 1992. 4

[3] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. To-wards internet-scale multi-view stereo. In CVPR, 2010. 1

[4] Y. Furukawa and J. Ponce. Accurate, dense, and robust mul-tiview stereopsis. PAMI, 32(8):1362–1376, 2010. 1, 7

[5] S. Galliani, K. Lasinger, and K. Schindler. Massively parallelmultiview stereopsis by surface normal diffusion. In ICCV,2015. 1, 7

[6] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-tonomous driving? The KITTI vision benchmark suite. InCVPR, 2012. 1, 2, 3

[7] A. Geiger, M. Roser, and R. Urtasun. Efficient large-scalestereo matching. In ACCV, 2010. 1, 7

[8] A. Geiger, J. Ziegler, and C. Stiller. StereoScan: Dense 3Dreconstruction in real-time. In IV, 2011. 1

[9] P. Gohl, D. Honegger, S. Omari, M. Achtelik, M. Pollefeys,and R. Siegwart. Omnidirectional Visual Obstacle Detectionusing Embedded FPGA. In IROS, 2015. 3

[10] C. Hane, T. Sattler, and M. Pollefeys. Obstacle detection forself-driving cars using only monocular cameras and wheelodometry. In IROS, 2015. 1

[11] C. Hane, C. Zach, J. Lim, A. Ranganathan, and M. Pollefeys.Stereo depth map fusion for robot navigation. In IROS, 2011.1

[12] H. Hirschmuller and D. Scharstein. Evaluation of Cost Func-tions for Stereo Matching. In CVPR, 2007. 1

[13] H. Hirschmuller. Stereo processing by semiglobal matchingand mutual information. PAMI, 30(2):328–341, 2008. 7

[14] D. Honegger, H. Oleynikova, and M. Pollefeys. Real-timeand low latency embedded computer vision hardware basedon a combination of FPGA and mobile CPU. In IROS, 2014.1

[15] Q. Huang, H. Wang, and V. Koltun. Single-view reconstruc-tion via joint analysis of image and shape collections. InSIGGRAPH, 2015. 4

[16] M. Jancosek and T. Pajdla. Multi-view reconstruction pre-serving weakly-supported surfaces. In CVPR, 2011. 7

[17] R. Jensen, A. Dahl, G. Vogiatzis, E. Tola, and H. Aanæs.Large scale multi-view stereopsis evaluation. CVPR, 2014.1, 3

[18] M. M. Kazhdan and H. Hoppe. Screened poisson surfacereconstruction. SIGGRAPH, 32(3):29, 2013. 4

[19] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun. Tanksand temples: Benchmarking large-scale scene reconstruc-tion. SIGGRAPH, 36(4), 2017. 3

[20] K. Kolev, P. Tanskanen, P. Speciale, and M. Pollefeys. Turn-ing mobile phones into 3D scanners. In CVPR, 2014. 1

[21] L. Ladicky, P. Sturgess, C. Russell, S. Sengupta, Y. Bastanlar,W. Clocksin, and P. Torr. Joint optimisation for object classsegmentation and dense stereo reconstruction. In BMVC,2010. 2

[22] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91–110, 2004. 4

[23] N. Mayer, E. Ilg, P. Haeusser, P. Fischer, D. Cremers,A. Dosovitskiy, and T. Brox. A large dataset to train convo-lutional networks for disparity, optical flow, and scene flowestimation. In CVPR, 2016. 2

[24] M. Menze and A. Geiger. Object scene flow for autonomousvehicles. In CVPR, 2015. 2, 3

[25] P. Merrell, P. Mordohai, J.-M. Frahm, and M. Pollefeys.Evaluation of large scale scene reconstruction. In ICCV,2007. 3

[26] C. R. Michael Bleyer and C. Rother. PatchMatch Stereo -Stereo Matching with Slanted Support Windows. In BMVC,2011. 7

[27] Y. Nakamura, T. Matsuura, K. Satoh, and Y. Ohta. Occlusiondetectable stereo - occlusion patterns in camera matrix. InCVPR, 1996. 2

[28] P. Ondruska, P. Kohli, and S. Izadi. MobileFusion: Real-timeVolumetric Surface Reconstruction and Dense Tracking OnMobile Phones. In ISMAR, 2015. 1

[29] C. J. Pal, J. J. Weinman, L. C. Tran, and D. Scharstein.On Learning Conditional Random Fields for Stereo. IJCV,99(3):319–337, 2012. 1

[30] S. Pillai, S. Ramalingam, and J. J. Leonard. High-Performance and Tunable Stereo Reconstruction. In ICRA,2016. 1

[31] R. B. Rusu, Z. C. Marton, N. Blodow, M. E. Dolha, andM. Beetz. Towards 3d point cloud based object maps forhousehold environments. RAS, 56(11):927–941, 2008. 3

[32] D. Scharstein, H. Hirschmuller, Y. Kitajima, G. Krathwohl,N. Nesic, X. Wang, and P. Westling. High-resolution stereodatasets with subpixel-accurate ground truth. In GCPR,2014. 1, 2, 3, 6

[33] D. Scharstein and R. Szeliski. A taxonomy and evaluation ofdense two-frame stereo correspondence algorithms. IJCV,47:7–42, 2002. 1, 2, 3

[34] D. Scharstein and R. Szeliski. High-accuracy Stereo DepthMaps Using Structured Light. In CVPR, 2003. 1

[35] J. L. Schonberger, E. Zheng, M. Pollefeys, and J.-M. Frahm.Pixelwise view selection for unstructured multi-view stereo.In ECCV, 2016. 1, 4, 7

[36] T. Schops, T. Sattler, C. Hane, and M. Pollefeys. 3D mod-eling on the go: Interactive 3D reconstruction of large-scalescenes on mobile devices. In 3DV, 2015. 1

[37] J. L. Schonberger and J.-M. Frahm. Structure-from-motionrevisited. In CVPR, 2016. 4

[38] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, andR. Szeliski. A comparison and evaluation of multi-viewstereo reconstruction algorithms. In CVPR, 2006. 1, 2, 3

[39] D. Sibbing, T. Sattler, B. Leibe, and L. Kobbelt. Sift-realisticrendering. In 3DV, 2013. 4

[40] C. Strecha, W. von Hansen, L. J. V. Gool, P. Fua, andU. Thoennessen. On benchmarking camera calibration andmulti-view stereo for high resolution imagery. In CVPR,2008. 1, 2, 3, 7

[41] P. Tanskanen, K. Kolev, L. Meier, F. Camposeco, O. Saurer,and M. Pollefeys. Live metric 3D reconstruction on mobilephones. In ICCV, 2013. 1

[42] E. Tola, V. Lepetit, and P. Fua. Daisy: An efficient densedescriptor applied to wide baseline stereo. PAMI, 32(5):815–830, May 2010. 7

[43] T. Vaudrey, C. Rabe, R. Klette, and J. Milburn. Differencesbetween stereo and motion behaviour on synthetic and real-world stereo sequences. In IVCNZ, 2008. 2

[44] K. Yamaguchi, D. McAllester, and R. Urtasun. Efficient jointsegmentation, occlusion labeling, stereo and flow estimation.In ECCV, 2014. 7

[45] J. Zbontar and Y. LeCun. Stereo matching by training a con-volutional neural network to compare image patches. Jour-nal of Machine Learning Research, 17:1–32, 2016. 7, 8

[46] C. Zhang, Z. Li, Y. Cheng, R. Cai, H. Chao, and Y. Rui.Meshstereo: A global stereo model with mesh alignment reg-ularization for view interpolation. In ICCV, 2015. 7

[47] Q. Zhou and V. Koltun. Color map optimization for 3d re-construction with consumer depth cameras. In SIGGRAPH,2014. 4, 5

A Multi-View Stereo Benchmark with High-Resolution Images and … · 2019-09-21 · For the...

Documents

Transcript of A Multi-View Stereo Benchmark with High-Resolution Images and … · 2019-09-21 · For the...