Blocks-World Cameras

24
Blocks-World Cameras Jongho Lee University of Wisconsin-Madison [email protected] Mohit Gupta University of Wisconsin-Madison [email protected] Abstract For several vision and robotics applications, 3D geome- try of man-made environments such as indoor scenes can be represented with a small number of dominant planes. How- ever, conventional 3D vision techniques typically first ac- quire dense 3D point clouds before estimating the compact piece-wise planar representations (e.g., by plane-fitting). This approach is costly, both in terms of acquisition and computational requirements, and potentially unreliable due to noisy point clouds. We propose Blocks-World Cameras, a class of imaging systems which directly recover dominant planes of piece-wise planar scenes (Blocks-World), with- out requiring point clouds. The Blocks-World Cameras are based on a structured-light system projecting a single pat- tern with a sparse set of cross-shaped features. We develop a novel geometric algorithm for recovering scene planes without explicit correspondence matching, thereby avoiding computationally intensive search or optimization routines. The proposed approach has low device and computational complexity, and requires capturing only one or two images. We demonstrate highly efficient and precise planar-scene sensing with simulations and real experiments, across vari- ous imaging conditions, including defocus blur, large light- ing variations, ambient illumination, and scene clutter. 1. The 3D Revolution We are in the midst of a 3D revolution. Robots en- abled by 3D cameras are beginning to drive cars, explore space, and manage our factories. While some of these ap- plications require high-resolution 3D scans of the surround- ings, several tasks do not explicitly need dense 3D point clouds. Imagine a robot navigating an indoor space, or an augmented reality (AR) system finding surfaces in a living room for placing virtual objects. For such applications, par- ticularly in devices with limited computational budgets, it is often desirable to create compact, memory- and compute- efficient 3D scene representations. For example, in piece- wise planar indoor scenes, a popular approach is to first cap- ture 3D point clouds with a depth or an RGBD camera, and then estimate a piece-wise planar representation (Fig. 1). Historically, point clouds have been the canonical repre- sentation for 3D scenes in the computer vision and robotics communities. This is not surprising because almost all depth imaging modalities capture 3D point clouds as the raw data. Indeed, there are several applications which do re- quire dense 3D representations (e.g., CAD modeling, facial motion retargeting), for which points clouds are a good fit. However, point clouds also have limitations: First, dense point clouds are memory, compute and bandwidth inten- sive. Second, acquisition of point clouds by depth cameras is prone to errors in non-ideal imaging conditions including defocus, multi-path [23, 46, 43] and multi-camera interfer- ence [10, 63, 39], and ambient illumination [24, 3]. Finally, extracting piece-wise planar representation by fitting planes to a point cloud requires global reasoning, which may result in inaccurate plane segmentation, especially if the underly- ing point-clouds are noisy to begin with (Fig. 1). This raises a natural question: Why capture high- resolution and noisy 3D point clouds at large acquisition costs, only to compress it later into planar representations at large computational cost? If we are going to perform downstream reasoning in terms of planes, can we design imaging modalities that directly capture compact and accu- rate plane-centric geometric representations of the world? We propose Blocks-World Cameras, a class of imaging systems which directly recover dominant plane parameters for Blocks-World [57] (piece-wise planar) scenes without creating 3D point clouds, enabling fast, low-cost and ac- curate reconstructions (Fig. 1). The Blocks-World Cam- eras are based on a structured-light system consisting of a projector which projects a single pattern on the scene, and a camera to capture the images. The pattern consists of a sparse set of cross-shaped features (each with two line- segments) which get mapped to cross-shaped features in the camera image via homographies induced by scene planes. If correspondences between image and pattern features can be established, the plane parameters can be estimated simply by measuring the deformation (change of angles of the two segments) between these features [28]. For scenes with high geometric complexity (e.g., a large number of distinct dominant planes), the projected pattern must have a sufficiently high feature density, requiring mul- tiple features on each epipolar line, leading to ambiguities. Resolving these ambiguities would require correspondence matching via computationally intensive global reasoning,

Transcript of Blocks-World Cameras

Page 1: Blocks-World Cameras

Blocks-World Cameras

Jongho LeeUniversity of Wisconsin-Madison

[email protected]

Mohit GuptaUniversity of Wisconsin-Madison

[email protected]

Abstract

For several vision and robotics applications, 3D geome-try of man-made environments such as indoor scenes can berepresented with a small number of dominant planes. How-ever, conventional 3D vision techniques typically first ac-quire dense 3D point clouds before estimating the compactpiece-wise planar representations (e.g., by plane-fitting).This approach is costly, both in terms of acquisition andcomputational requirements, and potentially unreliable dueto noisy point clouds. We propose Blocks-World Cameras,a class of imaging systems which directly recover dominantplanes of piece-wise planar scenes (Blocks-World), with-out requiring point clouds. The Blocks-World Cameras arebased on a structured-light system projecting a single pat-tern with a sparse set of cross-shaped features. We developa novel geometric algorithm for recovering scene planeswithout explicit correspondence matching, thereby avoidingcomputationally intensive search or optimization routines.The proposed approach has low device and computationalcomplexity, and requires capturing only one or two images.We demonstrate highly efficient and precise planar-scenesensing with simulations and real experiments, across vari-ous imaging conditions, including defocus blur, large light-ing variations, ambient illumination, and scene clutter.

1. The 3D RevolutionWe are in the midst of a 3D revolution. Robots en-

abled by 3D cameras are beginning to drive cars, explorespace, and manage our factories. While some of these ap-plications require high-resolution 3D scans of the surround-ings, several tasks do not explicitly need dense 3D pointclouds. Imagine a robot navigating an indoor space, or anaugmented reality (AR) system finding surfaces in a livingroom for placing virtual objects. For such applications, par-ticularly in devices with limited computational budgets, it isoften desirable to create compact, memory- and compute-efficient 3D scene representations. For example, in piece-wise planar indoor scenes, a popular approach is to first cap-ture 3D point clouds with a depth or an RGBD camera, andthen estimate a piece-wise planar representation (Fig. 1).

Historically, point clouds have been the canonical repre-sentation for 3D scenes in the computer vision and robotics

communities. This is not surprising because almost alldepth imaging modalities capture 3D point clouds as theraw data. Indeed, there are several applications which do re-quire dense 3D representations (e.g., CAD modeling, facialmotion retargeting), for which points clouds are a good fit.However, point clouds also have limitations: First, densepoint clouds are memory, compute and bandwidth inten-sive. Second, acquisition of point clouds by depth camerasis prone to errors in non-ideal imaging conditions includingdefocus, multi-path [23, 46, 43] and multi-camera interfer-ence [10, 63, 39], and ambient illumination [24, 3]. Finally,extracting piece-wise planar representation by fitting planesto a point cloud requires global reasoning, which may resultin inaccurate plane segmentation, especially if the underly-ing point-clouds are noisy to begin with (Fig. 1).

This raises a natural question: Why capture high-resolution and noisy 3D point clouds at large acquisitioncosts, only to compress it later into planar representationsat large computational cost? If we are going to performdownstream reasoning in terms of planes, can we designimaging modalities that directly capture compact and accu-rate plane-centric geometric representations of the world?

We propose Blocks-World Cameras, a class of imagingsystems which directly recover dominant plane parametersfor Blocks-World [57] (piece-wise planar) scenes withoutcreating 3D point clouds, enabling fast, low-cost and ac-curate reconstructions (Fig. 1). The Blocks-World Cam-eras are based on a structured-light system consisting ofa projector which projects a single pattern on the scene,and a camera to capture the images. The pattern consistsof a sparse set of cross-shaped features (each with two line-segments) which get mapped to cross-shaped features in thecamera image via homographies induced by scene planes. Ifcorrespondences between image and pattern features can beestablished, the plane parameters can be estimated simplyby measuring the deformation (change of angles of the twosegments) between these features [28].

For scenes with high geometric complexity (e.g., a largenumber of distinct dominant planes), the projected patternmust have a sufficiently high feature density, requiring mul-tiple features on each epipolar line, leading to ambiguities.Resolving these ambiguities would require correspondencematching via computationally intensive global reasoning,

Page 2: Blocks-World Cameras

Cameras Detected planes3D point clouds

3D floor plansNavigation AR & VRScene geometry understanding

Memory-heavy & error-prone point clouds

Computationally costly &inaccurate plane detection

Compute efficient &accurate plane detection

Cameras Detected planes

Blocks-World scenes

Conventional 3D cameras Blocks-World Cameras

Blocks-World scenes (piece-wise planar scenes)

Figure 1. Blocks-World Cameras. (top) Several applications require compact 3D representations of piece-wise planar scenes. (bottom)Even for such blocks-World scenes, conventional approaches first recover dense 3D point clouds, followed by estimating planar scenes viaplane-fitting. This process has large acquisition and computational costs, and is often error-prone. We propose Blocks-World Cameras forrecovering dominant planes directly without creating 3D point clouds, enabling fast, low-cost and accurate Blocks-World reconstructions.

thus defeating the purpose of Blocks-World Cameras. Is itpossible to perform reconstruction while maintaining bothhigh feature density and low computational complexity?

Scene representation with plane parameter space: Wedevelop a novel geometric method which enables plane es-timation even with unknown correspondences. For a givenimage feature, the set of all the candidate pattern featurecorrespondences vote for a set of plane hypotheses (in the3D plane parameter space), called the plane parameter lo-cus. Our key observation is that if the pattern features arespaced non-uniformly on the epipolar line, then the planeparameter loci for multiple image features lying on the sameworld plane will intersect at a unique location in the param-eter space. The intersection point corresponds to the param-eters of the world plane, and can be determined by simplepeak finding, without determining correspondences.

Implications: Based on this observation, we design apattern, and a fast algorithm that simultaneously recoversdepths and normals of Blocks-World scenes. We demon-strate, via simulations and experiments, capture of cleanand clutter-free 3D models, for a wide range of challengingscenarios, including texture-rich and texture-poor scenes,strong defocus, and large lighting variations. The compu-tational complexity of the proposed approach is low, andremains largely the same regardless of the geometric com-plexity of the scene, enabling real-time performance onhigh-resolution images. The method requires capturingonly 1 or 2 images, and can be implemented with simple

and low-cost single-pattern projectors with a static mask.Furthermore, the sparsity of the projected pattern makes itrobust to interreflections, a challenging problem which isdifficult to solve with dense patterns.

Scope: Blocks-World Cameras are specifically tailored topiece-wise planar scenes, in applications requiring compact3D representations consisting of a small set of planes. Itis not meant to be a general-purpose technique that can re-place conventional approaches. Indeed, for scenarios re-quiring dense geometry information for complex scenes,existing 3D imaging approaches will achieve better perfor-mance. However, the proposed technique can facilitate fastand robust dominant plane extraction, with applications inrobotic navigation [66, 56], indoor scene modeling and AR.

2. Related WorkPiece-wise planar scene constraint: There is a long tra-dition of piece-wise planar 3D scene reconstructions, start-ing from the Blocks-World [57] and Origami-World [37]works nearly five decades ago. Since then, piece-wise pla-narity has been widely used as a prior for accurate 3D mod-eling [55, 49, 15, 31, 18, 7, 69], and scene understand-ing [29, 54, 76, 20]. In Multi-View Stereo, the planar sceneconstraint has been used to overcome lack of texture, repeti-tive structures, and occlusions [18, 64, 44, 7, 69]. Planes arepopular scene primitives in SLAM [59, 66, 12, 74, 38, 36] aswell, having been used for efficient and accurate 3D regis-tration between frames [56, 66]. The planar scene constraint

Page 3: Blocks-World Cameras

(a) Imaging system (b) A pair of pattern and image features

𝐷

𝑥

𝑥

𝑦

𝑧

𝑏

𝐜𝑝

𝐜𝑐𝑦𝐧

𝜃𝜑

Π

Plane

𝐮𝑝𝐯𝑝

𝐩𝑝

𝐮𝑐𝐯𝑐

𝐩𝑐

Pattern feature PImage feature I

PI

𝑓

Figure 2. Imaging principle. (a) The Blocks-World Cameras arebased on a structured-light system consisting of a projector toproject a single pattern on the scenes and a camera to capture theimages. (b) The pattern consists of a sparse set of cross-shapedfeatures, which get mapped to cross-shaped features in the imagevia homographies induced by scene planes.

has been used for detecting junctions of indoor scenes orwireframes of urban scenes to recover scene layouts froma single RGB image [54, 76]. The Manhattan world con-straint [13] which assumes the scenes to be made of axis-aligned planes has been exploited to reconstruct indoor en-vironments such as floor-plans and room layouts [11, 40].

Plane-fitting to point clouds: A piece-wise planar scenerepresentation can be created from the dense, and oftennoisy, 3D point clouds captured by conventional depth cam-eras, by fitting planes. For example, Hough transform [32]is a method for detecting parameterized objects such as linesand circles in images, and is easily extended to 3D planes [8,33]. The RANdom SAmple Consensus (RANSAC) [16]has also been widely used for plane detection due to itsrobustness to outliers [19, 7, 67]. Other approaches forplane-fitting include region growing [52, 30, 48], as wellas energy-based multi-model fitting [35, 55, 69]. Theseapproaches can be computationally intensive especially forcluttered scenes, often requiring complex global reasoning.In contrast, Blocks-World Cameras infer the parameters ofthe piecewise planar scenes directly using lightweight com-putational algorithms, without capturing 3D point clouds.

Scene planarity in learning-based approaches: Recently,scene planarity has been used in learning-based approachesfor recovering scene geometry from a single RGB im-age [42, 73, 41, 71]. While these learning-based approacheshave started producing promising results, their generaliza-tion abilities are not well understood. Our work leveragesgeometric multi-view cues from a structured-light setup,and can be used in a complementary manner to improve thegeneralization abilities of learning-based approaches.

3. Mathematical PreliminariesTwo-view geometry of structured-light: The Blocks-World Camera is based on a structured-light system, whichtypically consists of a projector and a camera [45], as shownin Fig. 2 (a). We assume a pinhole projection model for both

the camera and the projector, and define the camera and pro-jector coordinate systems (CCS and PCS) centered at cc andcp, the optical centers of the camera and the projector, re-spectively. cc and cp are separated by the projector-camerabaseline b along the x axis. The world coordinate system(WCS) is assumed to be the same as the CCS centered atcc, i.e., cc = [0, 0, 0]

T and cp = [b, 0, 0]T in the WCS.

Without loss of generality, both the camera and the projec-tor are assumed to have the same focal length f . We furtherassume a rectified system such that the epipolar lines arealong the rows of the camera image and projector pattern.These assumptions (same focal length, rectified setup) aremade only for ease of exposition, and are relaxed in prac-tice by calibrating the projector-camera setup and rectifyingthe captured images to this canonical configuration [45].

Plane parameterization: A 3D plane can be characterizedby three parameters: Π = {D, θ, ϕ}, where D ∈ [0,∞)is the shortest distance from cc to Π, θ ∈ [0, π] is the po-lar angle between the plane normal and the −z axis, andϕ ∈ [0, 2π) is the azimuthal angle from the x axis to theplane normal (clockwise), as shown in Fig. 2 (a). The planenormal is given by: n = [sin θ cosϕ, sin θ sinϕ,− cos θ]

T .

4. Single-Shot Blocks-World CameraStructured-light (SL) systems can be broadly classi-

fied in two ways. Multi-shot methods such as line strip-ing [62, 4, 14], binary Gray coding [34, 61] or sinusoidphase-shifting [65] require projecting multiple patterns onthe scenes. These techniques can achieve high depth-precision, but are not suitable for dynamic scenes. In con-trast, single-shot methods [75, 72, 58, 60] require projectingonly a single pattern, enabling them to handle scene/cameramotion. Furthermore, these methods can be implementedwith low-cost single-pattern projectors using a static maskor a diffractive optical element, instead of a full projectorthat can dynamically change the projected patterns.

In this section, we present single-shot Blocks-WorldCameras that can estimate both depths and surface normalsof piece-wise planar scenes with a single projected pattern.These cameras have low complexity, both computationally(low-cost algorithms) and for hardware (single-shot).

4.1. What Pattern should be Projected?The performance of a single-shot SL system is deter-

mined by the projected pattern. There are several single-shot SL patterns such as 1D color De Bruijn codes [75, 72],multiple sets of 1D stripes for all-round 3D scanning [17],sparse 2D grid of lines [60, 53], 2D color encoded grids [9,58], grid patterns with spacings that follow a De Bruijn se-quence [68], 2D pseudo-random binary code [70], and 2Drandom dots (e.g., MS Kinect V1). While these patternshave been designed for explicitly recovering scene depths,our goal is different: directly estimate the plane parameters

Page 4: Blocks-World Cameras

𝐧

𝐜𝑝𝐜𝑐

𝐮𝑐𝐮𝑝𝐯𝑐𝐯𝑝

𝒍𝐮𝒍𝐯

Image Pattern

(b) Plane defined by 𝒍𝐮 and 𝒍𝐯

𝐜𝑝𝐜𝑐

𝒍𝐮

Pattern

(a) 𝒍𝐮 and 𝒍𝐯 formed by pairs of features

Image

𝐮𝑐𝐮𝑝

𝐜𝑝𝐜𝑐

𝒍𝐯

Image Pattern

𝐯𝑐𝐯𝑝

Π

Figure 3. Plane estimation from a known feature correspon-dence. (a) Line segments up and uc from image and pattern fea-tures create a pair of planes which meet at a 3D line lu. Similarly,vp and vc create lv . (b) lu and lv define a 3D plane which can beestimated from known image and pattern feature correspondence.

without recovering dense depth maps. Next, we describe thedesign of a new pattern optimized for achieving this goal.

Pattern design principles: There are two key considera-tions when designing the pattern. First, for piece-wise pla-nar scenes, a pair of corresponding patches in the projectedpattern and the captured images are related via a homog-raphy (assuming the patches lie on a single plane). Thehomography contains sufficient information to uniquely re-cover the parameters of the 3D scene plane [27], and it pre-serves straight lines and their intersections. Second, a pat-tern with a sparse set of features (a small fraction of the pro-jector pixels are on) enables robust and fast correspondencematching, potentially reduced source power with diffractiveoptical elements and robustness to multi-path interference,a critical issue in SL imaging with dense patterns [22, 21].On the other hand, sparse single-shot patterns have a trade-off in that for general scenes, they can achieve only sparse3D reconstructions. However, for piece-wise planar sceneswith a relatively small set of dominant planes, scene geom-etry can be recovered even with sparse patterns.

Based on these two considerations, we design a patternconsisting of a sparse set of identical features distributedspatially. Each feature is cross-shaped, consisting of twointersecting line-segments. For optimal performance, thesegments make angles of 45° and 135° with the epipolarline (Fig. 2 (b)). See supplementary report for a detaileddiscussion. For sufficiently small line segments, the imagefeatures in the camera image also have cross shapes (Fig. 2(b)). These cross-shaped features facilitate robust localiza-tion and efficient plane parameter estimation with computa-tionally light-weight algorithms, as discussed next.

4.2. Plane from a Known CorrespondenceConsider a pattern feature P = {up,vp,pp}, where vp

and up are two line vectors and pp is the intersection of vpand up as shown in Fig. 2 (b). Let the corresponding imagefeature I be described by I = {uc,vc,pc}, where vc and ucare line vectors corresponding to vp and up, and pc is theintersection of vc and uc. We assume that P lies within asingle scene plane, and is completely visible to the camera.

The elements in P and I are described in their own co-ordinate systems (PCS and CCS, respectively), i.e., for thepattern feature P = {up,vp,pp},

up = [upx, uy, 0]T,vp = [vpx, vy, 0]

T,pp = [ppx, py, f ]

T.

(1)For the corresponding image feature I = {uc,vc,pc},

uc = [ucx, uy, 0]T,vc = [vcx, vy, 0]

T,pc = [pcx, py, f ]

T.

(2)Then, if the correspondence is known, i.e., if pairs of cor-responding P and I can be identified, the plane parameterscan be recovered analytically by basic geometry, as illus-trated in Fig. 3. Specifically, each cross-shaped feature cor-respondence provides two line correspondences {uc, up}and {vc, vp}, which can be triangulated to estimate two3D line vectors lu and lv, respectively. The plane Π can beestimated from the estimates of lu and lv. In particular, thesurface normal n of Π is given as:

n =((pp × vp)× (pc × vc))× ((pp × up)× (pc × uc))

‖ ((pp × vp)× (pc × vc))× ((pp × up)× (pc × uc)) ‖.

(3)The shortest distance D from cc to Π is:

D =bnTppppx − pcx

− nT cp. (4)

Given n and D, depth of pc can be computed. See the sup-plementary report for details and measurable plane space.

Avoiding degenerate solutions: If line correspondences{uc, up} or {vc, vp} are collinear with epipolar lines, itgives a degenerate solution. To avoid this, the line segmentsof the features should not be aligned with the epipolar lines.

5. Plane from Unknown CorrespondencesAs described above, if the feature correspondences are

known, the plane parameters can be estimated using Eqs. 3and 4. One way to achieve this is to place a single fea-ture on each epipolar line of the pattern. In this case, foreach image feature, the correspondence can be computedtrivially. However, this limits the maximum number of pat-tern features by the number of rows of the pattern. In orderto maximize the likelihood of each scene plane being illu-minated by a feature, we need to have a sufficiently largedensity of pattern features, which requires placing multiplepattern features on each epipolar line. While this approachincreases the feature density, the pattern now consists ofmultiple identical features on each epipolar line, leadingto ambiguities. Without additional information or complexglobal reasoning, it is challenging to find the correct featurecorrespondences. This presents a tradeoff: Is it possible toperform reconstruction while maintaining both high featuredensity and low computational complexity?

Page 5: Blocks-World Cameras

5.1. Geometric Approach to Correspondence-FreePlane Estimation

In order to address this tradeoff, we develop a novel,light-weight computational approach for estimating planeparameters without explicitly computing correspondencesbetween image and pattern features. Let the set of pat-tern features on one epipolar line of the projected patternbe {P1, . . . ,PN}. A subset of these features are mappedto the camera image, resulting in the set of image features{I1, . . . , IM} (M ≤ N) (upper row of Figs. 4 (a) and (b)).

Consider one image feature, say I1. All the N patternfeatures are candidate matching features. Each candidatepattern feature results in a plane hypothesis Π = {D, θ, ϕ}by triangulating with the image feature I1. Accordingly,the set of all candidate pattern features {P1, . . . ,PN} cre-ate a set of plane hypotheses Λ1 = {Π11, . . . ,Π1N}, whereΠ1n (n ∈ {1, . . . , N}) is the plane parameters computedfrom I1 and Pn. Each plane hypothesis can be representedas a point in the 3D plane parameter space (we call this theΠ-space), as shown in the upper row of Fig. 4 (c). There-fore, the set of plane hypotheses Λ1 = {Π11, . . . ,Π1N}create a plane parameter locus in the Π-space. Simi-larly, we can create another plane parameter locus Λ2 ={Π21, . . . ,Π2N} by pairing I2 and {P1, . . . ,PN}.

Observation 1. The key observation is if I1 and I2 corre-spond to scene points on the same scene plane, then two lociΛ1 and Λ2 must intersect. If they intersect at a unique loca-tion Π in the Π-space, then Π is the true plane parameters.

Voting in the plane parameter space: This is a simple,yet powerful observation, which motivates a computation-ally light-weight voting-based approach for plane estima-tion that does not require correspondence estimation. Foreach detected image feature, we compute its plane parame-ter locus as described above. The locus is the set of candi-date planes that the feature votes for. We then collect votesfrom all the detected image features; the Π-space with locifrom all the image features can be considered a likelihooddistribution on scene planes. Fig. 5 (b) shows an example ofΠ-space. Finally, we estimate plane parameters of the dom-inant scene planes by identifying dominant local peaks inthe Π-space. For a given local peak, all the image featuresthat voted for the peak belong to the corresponding plane.For those image features, depth and surface normal valuescan be computed by plane-ray intersection (Fig. 5 (d)).

This approach is reminiscent of conventional Houghtransform-based plane estimation, with two key differences:First, in conventional Hough transform, the planes are esti-mated from 3D points (each 3D point votes for candidateplanes that pass through it), requiring first a 3D point cloudto be computed. In contrast, in our approach, 2D image fea-tures directly vote for candidate planes, thus avoiding thepotentially expensive point cloud generation. Second, in

(c) Locus by I1(a) Pattern features (b) Image features (d) Loci by I1 and I2

𝐷

𝜃

Π21Π22

Π23Π11

Π12

Π13

Π24

𝐷

𝜃

Π21Π22

Π23

Π11

Π12

Π13

Π24

𝐷

𝜃

𝜑Π11

Π12

Π1𝑁

Large overlaps

Single overlap at true Π

𝐷

𝜃

𝜑Π11

Π12

Π1𝑁

P1 P2 P3 P𝑁

P1 P2 P3 P𝑁

Uniform placement

Non-uniform placement

I1 I2

I1 I2

I𝑀⋯ ⋯

⋯ I𝑀⋯

Figure 4. Plane parameter space with uniformly and non-uniformly spaced pattern features. (a), (b)N features are placedat (upper row) uniform and (lower row) non-uniform spacing on anepipolar line of the pattern. M of these are imaged as image fea-tures. (c) A plane parameter locus is created in the Π-space bypairing an image feature I1 and all the pattern features on the cor-responding epipolar line. The locus is on a plane parallel to the(D− θ) plane. (d, upper row) Loci corresponding to two differentimage features lying on the same scene plane have a large overlapwith uniform pattern feature distribution, making it impossible todetermine the true scene plane containing the features. (d, bottomrow) However, for a pattern with non-uniform feature distribution,it is possible to uniquely determine the true scene plane.

the conventional approach, each 3D point votes for a denseset of potential planes. Coupled with a large number of 3Dpoints, this can result in large computational and memorycosts [47]. On the other hand, in the proposed approach,we use a sparse set of features, and each feature votes for asmall, discrete set of candidate planes (e.g., we used < 10in our experiments). This results in considerably, up to 2orders of magnitude lower computational costs, especiallyin scenes with a small number of dominant planes.

5.2. Do Parameter Loci have Unique Intersections?The voting-based algorithm described above relies on an

important assumption: plane parameter loci for differentimage features corresponding to the same world plane in-tersect in a unique location. If, for example, the loci forall the features on a camera epipolar line overlap at sev-eral locations, we will not be able to identify unique planeparameters. This raises the following important questions:Does this assumption hold for general scenes? What is theeffect, if any, of the pattern design (e.g., the spatial layoutof the features)? In order to address these, we describe twokey geometric properties of the plane parameter locus.

Property 1. The parameter locus Λm = {Πm1, . . . ,ΠmN}created by pairing an image feature Im and a set of patternfeatures {P1, . . . ,PN} on the same epipolar line always lieson a plane parallel to the ϕ = 0 plane in the Π-space.

Property 2. Let Λm = {Πm1, . . . ,ΠmN} be the param-eter locus created in the same way as Property 1. LetPµ (µ ∈ {1, . . . , N}) be the true corresponding pattern fea-ture of Im. Let dµn be the distance between pattern fea-

Page 6: Blocks-World Cameras

tures Pµ and Pn on the epipolar line. Then, the loca-tions of the elements of Λm are a function only of the setDµ = {dµn |n ∈ {1, . . . , N}} of relative distances be-tween the true and candidate pattern features.

See supplementary report for proofs. The first propertyimplies that it is possible to recover the azimuth angle ofthe plane normal from a single parameter locus, withoutcomputing correspondences. An example is illustrated inthe upper row of Figs. 4 (a-c). Since ϕ is constant across thelocus, for the rest of the paper, we visualize parameter lociin 2D D− θ space, as shown in the upper row of Fig. 4 (d).Note that full 3D Π-space is necessary when differentiatingbetween planes with the same D and θ, but different ϕ.

The second, perhaps more important, property impliesthat if the pattern features are uniformly spaced on theepipolar line, the resulting loci will overlap significantly.This is because of the following: for a uniformly spaced pat-tern, the set of relative distances (as defined in Property 2)for two distinct pattern features will share several commonvalues. Since the elements of the parameter loci (of the cor-responding image features) are determined solely by the setof relative distances, the loci will also share common loca-tions. An example is shown in the upper row of Fig. 4 (d).This is not a degenerate case; for uniformly spaced patterns,regardless of the scene, the loci will always have large over-laps, making it impossible to find unique intersections. Howcan we ensure that different loci have unique intersections?

Patterns with non-uniform feature distribution: Thekey idea is to design patterns with features that are non-uniformly spaced across epipolar lines. The lower row ofFig. 4 (a) and (b) show an example, where N pattern fea-tures {P1, . . . ,PN} are non-uniformly distributed on anepipolar line, and M of them are imaged as image features{I1, . . . IM}. If this condition is met, the parameter loci donot overlap, except at the true plane parameters, as shownin the lower row of Fig. 4 (d). This enables estimation ofthe plane parameters even with unknown correspondences.

In our experiments, we placed 7 pattern features non-uniformly on each epipolar line. To ensure robustnessagainst errors in epipolar line estimation, we place featureson every kth epipolar line on the pattern. See the supple-mentary report for details and the resulting patterns.

5.3. Image Feature Localization and MeasurementWe localize cross-shaped image features by applying

Harris corner detector [26] to the captured image, after thin-ning morphological operation. Although a single image issufficient, for scenes with strong texture and lighting vari-ations, we capture two camera frames in rapid succession,with and without the projected pattern, and take their dif-ference. For each candidate feature location, the two linesegments of the image feature (uc and vc in Fig. 2) are

extracted. For robustness against projector/camera defo-cus blur, we extract two edges (positive and negative gradi-ents) from each (possibly blurred) line segment, and com-pute their average. The line fitting computational routineis fast since it has a closed-form solution. Image featureI = {uc,vc,pc} is then estimated from the two line seg-ments, and their intersection point pc.

5.4. Toward Higher Memory EfficiencyBlocks-World Cameras are memory-efficient since they

do not require capturing and processing dense 3D pointclouds. However, the plane parameter Π-space can occupyconsiderably amount of memory if very small bin sizes areused. We develop a memory-efficient version of Blocks-World Camera algorithm which does not explicitly create aplane parameter voting array. The key observation is thatsince the Blocks-World Cameras provide a pool of planecandidates with different confidence (e.g., larger number ofplane candidates for dominant planes), it is possible to es-timate scene planes by finding inliers via a RANSAC-likeprocedure, instead of voting in the Π-space. See the supple-mentary report for details of the algorithm and the results.

6. Experiments and Results6.1. Validation by Simulations

We simulate the Blocks-World Camera imaging processwith a ray tracing tool [1], using 3D models from an indoordataset [2]. This allows us to compare the Blocks-WorldCamera reconstructions with the ground truth, as well asalternate approaches such as plane-fitting to point clouds.

Ground truth comparison: Fig. 5 (a) shows a pattern-projected scene with five dominant planes labeled as Π1 toΠ5. Plane parameters for these planes are estimated fromthe Π-space (Fig. 5 (b)). The image features that voted foreach dominant plane are identified and segmented to formthe plane boundary by their convex hull (Fig. 5 (c)). Theproposed approach accurately recovers 3D scene geometryin terms of both depths and surface normals (Fig. 5 (d)).

Comparison with plane-fitting: For evaluating conven-tional plane-fitting approaches, we simulate a structured-light system that captures a 3D point-cloud of the sceneusing sinusoid phase-shifting [65]. Fig. 6 (a) shows an ex-ample scene with six dominant planes. Fig. 6 (b) and thebottom center of Fig. 1 show the captured depth map and apoint cloud. We use 3D Hough transform [8] and RANSAC,two approaches which have been widely used to extractplanes from point clouds. We use the randomized version ofthe 3D Hough transform (RHT) [8] due to its computationalefficiency. Figs. 6 (c), (d), and (e) show plane segmentationresults by RHT, RANSAC, and Blocks-World Cameras, re-spectively. To ensure fair comparisons, for plane-fitting ap-proaches, we down-sample the point cloud such that the

Page 7: Blocks-World Cameras

(a) Pattern-projected scene

Π1

Π2Π3

Π4

Π5

Π1

Π2Π3

Π4

Π5

Π1Π2Π3

Π4

Π5

𝐷

𝜃

0

1

(b) 2D Π-space (c) Plane segmentations

Pla

ne

dep

ths

Pla

ne

no

rmal

s

(d) Recovered plane geometry (e) Ground truth scene geometry

1.1m

3.6m

1.1m

3.6m

Figure 5. Ground truth comparison. (a) A 3D scene with a pro-jected pattern. (b) 2D Π-space with votes. Dominant planes illus-trated at detected peak locations. (c) Plane boundaries formed byidentifying image features that voted for the peaks. (d) Recoveredplane depths and normals. (e) Ground truth depths and normals.

number of 3D points is the same as the number of imagefeatures captured by the Blocks-World Cameras.

For RHT (Fig. 6 (c)), it is challenging to extract small,distant or noisy planes because the votes for these planesare not reliably accumulated by random selection of points.Although RANSAC achieves better plane extraction, bothRHT and RANSAC result in erroneous plane segmentationresults (e.g., orange and blue points on the walls in Fig. 6(c) and (d), respectively). This is a common issue withpoint cloud-based approaches since each 3D point does nothave local plane information. In comparison, Blocks-WorldCameras achieve accurate plane segmentation since eachcross-shaped image feature contains partial information onthe plane it belongs to, and does not need global reasoning.See the supplementary report for implementation details forRHT, RANSAC, and the Blocks-World Cameras.

Fig. 7 shows quantitative comparison between theBlocks-World Cameras and the conventional plane-fittingapproaches in terms of (a) the accuracy of the extractedplane parameters, and (b) run-time of MATLAB imple-mentations. We used a well-optimized implementation ofMSAC (M-estimator sample and consensus) for RANSACplane-fitting. In run-time comparison, we did not in-clude time to create the point clouds for conventional ap-proaches. RHT estimates the plane parameters accurately,but it fails to find all dominant planes and is slow in run-time. RANSAC is fast and finds all dominant planes ro-bustly, but less accurate in plane parameter estimation. TheBlocks-World Cameras can extract the plane parameterswell in terms of both accuracy and run-time even with-out creating the point cloud. See the supplementary reportfor additional discussions on the trade-off between the run-

time and plane estimation accuracy while varying the sam-pling rate of the 3D point clouds. Comparisons with otherstructured-light schemes as well as alternate 3D modalitiesare also discussed in the supplementary report.

6.2. Blocks-World Cameras in-the-WildWe prototype a Blocks-World Camera using a

structured-light system consisting of an Epson 3LCDprojector, and a digital SLR camera (Canon EOS 700D).The projector-camera baseline is 353 mm. The system isrectified such that epipolar lines are aligned along the rowsof the pattern and the captured image. Using this setup,we validate the performance of Blocks-World Camera withvarious challenging scenes in the real world.

Scene with large defocus blur: The ability to handle de-focus blur is critical for the Blocks-World Cameras whenimaging scenes with large depth variations. Our imagefeature detection algorithm averages the detected line seg-ments for both positive and negative edges as mentioned inSection 5.3, thereby achieving robustness to defocus blur.Fig. 8 (a) shows a scene consisting of planar objects at dif-ferent distances from the camera. The camera and the pro-jector are focused on the corner between two walls to cre-ate a large blur on the rightmost wall just to demonstratethe performance over a wide range of blurs (Fig. 8 (b)).The Blocks-World Cameras can reliably estimate the planeseven with blurred features, up to a certain blur size (Fig. 8(c, d)). For scenes with huge depth variation, the blur sizecan be reduced by lowering the aperture, using extendeddepth-of-field approaches, and diffractive optical elements.

Performance under ambient light: Fig. 9 demonstratesthe performance of the Blocks-World Cameras under dif-ferent ambient lighting conditions. Since our approach isbased on shape features instead of intensity features, it isrobust to photometric variations (photometric calibration isnot required) leading to stable plane estimation under dif-ferent lighting. When ambient light completely overwhelmsthe projected pattern, the features may not be detected. Thisissue can be mitigated by narrow-band illumination, spatio-temporal illumination and image coding [25, 51, 50].

Scene with specular interreflections and strong textures:Fig. 10 (a) shows a scene with a metallic elevator door understrong, directional ambient light (upper), and a picture withcomplicated textures (lower). The Blocks-World Camerasuse geometric features which encode the scene geometrythrough deformation of the feature shape, and are thus ro-bust to challenging illumination conditions resulting in ac-curate geometry estimation (Fig. 10 (b, c)).

Non-planar scenes: Although Blocks-World Cameras aredesigned for piece-wise planar scenes, their performancedegrades gracefully for non-planar scenes. Fig. 11 (a)shows a cylindrical object, and the piece-wise planar ap-

Page 8: Blocks-World Cameras

(a) Scene (b) Simulated depth map (d) RANSAC (e) Blocks-World Cameras

1.0m

4.2m

(c) Randomized 3D Hough

Π1 Π2

Π3Π5 Π4Π6

Figure 6. Comparison with plane-fitting. (a) A 3D scene. (b) Depth map captured by a simulated structured-light system. (c, d, e) Planesegmentation results by randomized 3D Hough transform, RANSAC, and Blocks-World Cameras. The Blocks-World Cameras achievemore accurate plane segmentation than conventional approaches since each cross-shaped image feature contains local plane information.

Pla

ne

no

rmal

err

or °

0

12

8

4

0.0

0.4

0.2

0.1

Ru

n-t

ime

s

0

3

2

1

Randomized 3D Hough RANSAC Blocks-World Cameras

0.3

𝐷er

ror m

(a) Plane parameters error comparison (b) Run-time comparison

Π1 Π2 Π3 Π4 Π5 Π6 Π1 Π2 Π3 Π4 Π5 Π6

Figure 7. Quantitative performance comparison. (a) Plane pa-rameters error comparison. (b) Run-time comparison. Blocks-World Cameras can extract the plane parameters well in terms ofboth accuracy and run-time even without creating the point cloud.

(b) Pattern-projected scene

(c) Measured plane depths (d) Measured plane normals

(a) Scene

1.0m

1.6m

Figure 8. Robustness to defocus blur. (a, b) A scene with vary-ing amounts of defocus blur. (c, d) Measured plane depths andnormals. Our approach is robust to defocus blur.

1.4m

1.0m

(a) Pattern-projected scene

1.4m

1.0m

(b) Plane depths (c) Plane normals

Light ON

Light OFF

Figure 9. Robustness to ambient light. (a) A scene under differ-ent indoor lighting conditions. (b, c) Recovered plane depths andnormals. Our shape features are robust to photometric variations.

proximation extracted by the proposed approaches. Al-though only perfectly or nearly planar scene geometry is ex-tracted with relatively smaller bin sizes of Π-space (Fig. 11(b)), non-planar portions of the scene is approximated withseveral planes with relatively larger bin sizes (Fig. 11 (c)).

1.0m

0.8m

(a) Scenes (b) Plane depths (c) Plane normals

0.90m

0.84m

Figure 10. Robustness to specular reflections and strong tex-tures. (a) Scenes under challenging illumination conditions withspecular reflections and strong textures. (b, c) Reconstructed planedepths and surface normals by Blocks-World Camera.

(a) Cylinder scene (b) Plane estimation with small & large bin sizes

Figure 11. Approximating non-planar scene with piece-wiseplanar scene. (a) Cylinder scene. (b) Plane estimation with rela-tively small and large bin sizes of Π-space, respectively.

7. Limitations and Future WorkHoles in reconstructions: Due to a sparse set of features inthe pattern, the reconstructions have holes in regions wherefeatures are absent. An important next step is to developsensor-fusion systems based on the proposed approach, byleveraging learning-based methods [42, 41] (that producepotentially inaccurate, but dense reconstructions) to gener-ate dense, high-accuracy, hole-free reconstructions.

Non-planar geometric primitives: The proposed ap-proach is designed for reconstructing planar surfaces. Apromising line of future work is to design patterns and re-construction algorithms for non-planar geometric primitivessuch as spheres, generalized cylinders [6] and geons [5].Such a generalized Blocks-World Camera will find applica-tions in a considerably broader set of scenarios.

Acknowledgement. This research was supported in partsby ONR grant N00014-16-1-2995, NSF CAREER Award1943149, and Wisconsin Alumni Research Foundation. Wealso thank Tomislav Pribanic for helpful feedback and dis-cussions.

Page 9: Blocks-World Cameras

References[1] http://www.povray.org, 2021 (accessed March 27,

2021). 6[2] http : / / www . ignorancia . org / index . php /

technical/lightsys/, 2021 (accessed March 27,2021). 6

[3] Supreeth Achar, Joseph R Bartels, William L’Red’ Whit-taker, Kiriakos N Kutulakos, and Srinivasa G Narasimhan.Epipolar time-of-flight imaging. ACM Transactions onGraphics (ToG), 36(4):1–8, 2017. 1

[4] G. J. Agin and T. O. Binford. Computer description of curvedobjects. IEEE Trans. Comput., 25(4):439–449, 1976. 3

[5] Irving Biederman. Recognition-by-components: a the-ory of human image understanding. Psychological review,94(2):115, 1987. 8

[6] T Binford. Visual perception by computer. In IEEE Confer-ence of Systems and Control, 1971. 8

[7] Andras Bodis-Szomoru, Hayko Riemenschneider, and LucVan Gool. Fast, approximate piecewise-planar modelingbased on sparse structure-from-motion and superpixels. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 469–476, 2014. 2, 3

[8] Dorit Borrmann, Jan Elseberg, Kai Lingemann, and AndreasNuchter. The 3d hough transform for plane detection in pointclouds: A review and a new accumulator design. 3D Re-search, 2(2):3, 2011. 3, 6

[9] K.L. Boyer and AC. Kak. Color-encoded structured light forrapid active ranging. IEEE Transactions on Pattern Analysisand Machine Intelligence, 9(1):14–28, 1987. 3

[10] D Alex Butler, Shahram Izadi, Otmar Hilliges, DavidMolyneaux, Steve Hodges, and David Kim. Shake’n’sense:reducing interference for overlapping structured light depthcameras. In Proceedings of the SIGCHI Conference on Hu-man Factors in Computing Systems, pages 1933–1936, 2012.1

[11] Ricardo Cabral and Yasutaka Furukawa. Piecewise planarand compact floorplan reconstruction from images. In 2014IEEE Conference on Computer Vision and Pattern Recogni-tion, pages 628–635. IEEE, 2014. 3

[12] Alejo Concha and Javier Civera. Dpptam: Dense piecewiseplanar tracking and mapping from a monocular sequence.In 2015 IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), pages 5686–5693. IEEE, 2015.2

[13] James M Coughlan and Alan L Yuille. Manhattan world:Compass direction from a single image by bayesian infer-ence. In Proceedings of the Seventh IEEE InternationalConference on Computer Vision, volume 2, pages 941–947.IEEE, 1999. 3

[14] B. Curless and M. Levoy. Better optical triangulationthrough spacetime analysis. In Proceedings of IEEE Inter-national Conference on Computer Vision, 1995. 3

[15] Maksym Dzitsiuk, Jurgen Sturm, Robert Maier, Lingni Ma,and Daniel Cremers. De-noising, stabilizing and complet-ing 3d reconstructions on-the-go using plane priors. In 2017IEEE International Conference on Robotics and Automation(ICRA), pages 3976–3983. IEEE, 2017. 2

[16] Martin A Fischler and Robert C Bolles. Random sampleconsensus: a paradigm for model fitting with applications toimage analysis and automated cartography. Communicationsof the ACM, 24(6):381–395, 1981. 3

[17] R. Furukawa, R. Sagawa, H. Kawasaki, K. Sakashita, Y.Yagi, and N. Asada. One-shot entire shape acquisitionmethod using multiple projectors and cameras. In Pacific-Rim Symposium on Image and Video Technology (PSIVT),pages 107–114, 2010. 3

[18] Yasutaka Furukawa, Brian Curless, Steven M Seitz, andRichard Szeliski. Manhattan-world stereo. In 2009 IEEEConference on Computer Vision and Pattern Recognition,pages 1422–1429. IEEE, 2009. 2

[19] David Gallup, Jan-Michael Frahm, and Marc Pollefeys.Piecewise planar and non-planar stereo for urban scene re-construction. In 2010 IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, pages 1418–1425. IEEE, 2010. 3

[20] Abhinav Gupta, Alexei A Efros, and Martial Hebert. Blocksworld revisited: Image understanding using qualitative ge-ometry and mechanics. In European Conference on Com-puter Vision, pages 482–496. Springer, 2010. 2

[21] Mohit Gupta, Amit Agrawal, Ashok Veeraraghavan, andSrinivasa G. Narasimhan. A practical approach to 3d scan-ning in the presence of interreflections, subsurface scatter-ing and defocus. International Journal of Computer Vision,102(1-3):33–55, 2013. 4

[22] Mohit Gupta and Shree K. Nayar. Micro phase shifting. InProc. IEEE CVPR, 2012. 4

[23] Mohit Gupta, Shree K. Nayar, Matthias Hullin, and JaimeMartin. Phasor Imaging: A Generalization of Correla-tion Based Time-of-Flight Imaging. ACM Transactions onGraphics, 2015. 1

[24] Mohit Gupta, Qi Yin, and Shree K Nayar. Structured light insunlight. In Proceedings of the IEEE International Confer-ence on Computer Vision, pages 545–552, 2013. 1

[25] Mohit Gupta, Qi Yin, and Shree K. Nayar. Structured Lightin Sunlight. In IEEE International Conference on Com-puter Vision (ICCV), pages 545–552, Sydney, Australia, Dec.2013. IEEE. 7

[26] Christopher G Harris, Mike Stephens, et al. A combined cor-ner and edge detector. In Alvey vision conference, volume 15,pages 10–5244. Citeseer, 1988. 6

[27] R. Hartley and R. Gupta. Computing matched-epipolar pro-jections. In IEEE CVPR, pages 549–555, 1993. 4

[28] R. I. Hartley and A. Zisserman. Multiple View Geometryin Computer Vision. Cambridge University Press, ISBN:0521540518, second edition, 2004. 1

[29] D. Hoiem, A. A. Efros, and M. Hebert. Automatic photopop-up. ACM Trans. Graph., 24(3):577–584, 2015. 2

[30] Dirk Holz and Sven Behnke. Fast range image segmentationand smoothing using approximate surface reconstruction andregion growing. In Intelligent autonomous systems 12, pages61–73. Springer, 2013. 3

[31] Thomas Holzmann, Michael Maurer, Friedrich Fraundorfer,and Horst Bischof. Semantically aware urban 3d reconstruc-tion with plane-based regularization. In Proceedings of the

Page 10: Blocks-World Cameras

European Conference on Computer Vision (ECCV), pages468–483, 2018. 2

[32] Paul VC Hough. Method and means for recognizing complexpatterns, Dec. 18 1962. US Patent 3,069,654. 3

[33] Rostislav Hulik, Michal Spanel, Pavel Smrz, and ZdenekMaterna. Continuous plane detection in point-cloud databased on 3d hough transform. Journal of visual communi-cation and image representation, 25(1):86–97, 2014. 3

[34] S. Inokuchi, K. Sato, and F. Matsuda. Range imaging sys-tem for 3-d object recognition. In International ConferencePattern Recognition, pages 806–808, 1984. 3

[35] Hossam Isack and Yuri Boykov. Energy-based geometricmulti-model fitting. International journal of computer vi-sion, 97(2):123–147, 2012. 3

[36] M. Kaess. Simultaneous localization and mapping with in-finite planes. In Proceedings of IEEE International Confer-ence on Robotics and Automation (ICRA), page 4605–4611.IEEE, 2015. 2

[37] Takeo Kanade. A theory of origami world. Artificial Intelli-gence, 13:279–311, June 1980. 2

[38] Pyojin Kim, Brian Coltin, and H Jin Kim. Linear rgb-d slamfor planar environments. In Proceedings of the EuropeanConference on Computer Vision (ECCV), pages 333–348,2018. 2

[39] Jongho Lee and Mohit Gupta. Stochastic exposure codingfor handling multi-tof-camera interference. In Proceedingsof the IEEE International Conference on Computer Vision,pages 7880–7888, 2019. 1

[40] Cheng Lin, Changjian Li, and Wenping Wang. Floorplan-jigsaw: Jointly estimating scene layout and aligning partialscans. In Proceedings of the IEEE International Conferenceon Computer Vision, pages 5674–5683, 2019. 3

[41] Chen Liu, Kihwan Kim, Jinwei Gu, Yasutaka Furukawa, andJan Kautz. Planercnn: 3d plane detection and reconstructionfrom a single image. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 4450–4459, 2019. 3, 8

[42] Chen Liu, Jimei Yang, Duygu Ceylan, Ersin Yumer, and Ya-sutaka Furukawa. Planenet: Piece-wise planar reconstruc-tion from a single rgb image. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 2579–2588, 2018. 3, 8

[43] Julio Marco, Quercus Hernandez, Adolfo Munoz, Yue Dong,Adrian Jarabo, Min H Kim, Xin Tong, and Diego Gutierrez.Deeptof: off-the-shelf real-time correction of multipath in-terference in time-of-flight imaging. ACM Transactions onGraphics (ToG), 36(6):1–12, 2017. 1

[44] Branislav Micusik and Jana Kosecka. Piecewise planar city3d modeling from street view panoramic sequences. In 2009IEEE Conference on Computer Vision and Pattern Recogni-tion, pages 2906–2912. IEEE, 2009. 2

[45] Daniel Moreno and Gabriel Taubin. Simple, accurate, androbust projector-camera calibration. In 2012 Second Inter-national Conference on 3D Imaging, Modeling, Processing,Visualization & Transmission, pages 464–471. IEEE, 2012.3

[46] Nikhil Naik, Achuta Kadambi, Christoph Rhemann,Shahram Izadi, Ramesh Raskar, and Sing Bing Kang. Alight transport model for mitigating multipath interferencein time-of-flight sensors. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages73–81, 2015. 1

[47] Anh Nguyen and Bac Le. 3d point cloud segmentation: Asurvey. In 2013 6th IEEE conference on robotics, automationand mechatronics (RAM), pages 225–230. IEEE, 2013. 5

[48] Sven Oesau, Florent Lafarge, and Pierre Alliez. Planar shapedetection and regularization in tandem. In Computer Graph-ics Forum, volume 35, pages 203–215. Wiley Online Library,2016. 3

[49] Ali Osman Ulusoy, Michael J Black, and Andreas Geiger.Patches, planes and probabilities: A non-local prior for vol-umetric 3d reconstruction. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages3280–3289, 2016. 2

[50] Matthew O’Toole, Supreeth Achar, Srinivasa G.Narasimhan, and Kiriakos N. Kutulakos. Homoge-neous codes for energy-efficient illumination and imaging.ACM Transactions on Graphics (TOG), 34(4):1–13, July2015. 7

[51] Matthew O’Toole, John Mather, and Kiriakos N Kutulakos.3D Shape and Indirect Appearance by Structured LightTransport. In IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pages 3246–3253, 2014. 7

[52] Charalambos Poullis. A framework for automatic modelingfrom point cloud data. IEEE transactions on pattern analysisand machine intelligence, 35(11):2563–2575, 2013. 3

[53] M. Proesmans, L. J. Van Gool, and A J. Oosterlinck. Activeacquisition of 3d shape for moving objects. In Proceedingsof the International Conference on Image Processing, vol-ume 3, pages 647–650 vol.3, 1996. 3

[54] Srikumar Ramalingam, Jaishanker K Pillai, Arpit Jain, andYuichi Taguchi. Manhattan junction catalogue for spatialreasoning of indoor scenes. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages3065–3072, 2013. 2, 3

[55] Carolina Raposo, Michel Antunes, and Joao P Barreto.Piecewise-planar stereoscan: structure and motion fromplane primitives. In European Conference on Computer Vi-sion, pages 48–63. Springer, 2014. 2, 3

[56] Carolina Raposo, Miguel Lourenco, Michel Antunes, andJoao Pedro Barreto. Plane-based odometry using an rgb-dcamera. In BMVC, 2013. 2

[57] Lawrence G Roberts. Machine perception of three-dimensional solids. PhD thesis, Massachusetts Institute ofTechnology, 1963. 1, 2

[58] R. Sagawa, Yuichi Ota, Y. Yagi, R. Furukawa, N. Asada, andH. Kawasaki. Dense 3d reconstruction method using a singlepattern for fast moving object. In Proc. IEEE ICCV, pages1779–1786, 2009. 3

[59] Renato F Salas-Moreno, Ben Glocken, Paul HJ Kelly, andAndrew J Davison. Dense planar slam. In 2014 IEEE in-ternational symposium on mixed and augmented reality (IS-MAR), pages 157–164. IEEE, 2014. 2

Page 11: Blocks-World Cameras

[60] J. Salvi, J. Batlle, and E. Mouaddib. A robust-coded pat-tern projection for dynamic 3d scene measurement. PatternRecognition Letters, 19(11):1055 – 1065, 1998. 3

[61] K. Sato and S. Inokuchi. 3d surface measurement byspace encoding range imaging. Journal of Robotic Systems,2(1):27–39, 1985. 3

[62] Yoshiaki Shirai and Motoi Suwa. Recognition of polyhe-drons with a range finder. In Proceedings of the Interna-tional Joint Conference on Artificial Intelligence, pages 80–87, 1971. 3

[63] Shikhar Shrestha, Felix Heide, Wolfgang Heidrich, and Gor-don Wetzstein. Computational imaging with multi-cameratime-of-flight systems. ACM Transactions on Graphics(ToG), 35(4):1–11, 2016. 1

[64] Sudipta Sinha, Drew Steedly, and Rick Szeliski. Piecewiseplanar stereo for image-based rendering. 2009. 2

[65] V Srinivasan, Hsin-Chu Liu, and Maurice Halioua. Auto-mated phase-measuring profilometry: a phase mapping ap-proach. Applied optics, 24(2):185–188, 1985. 3, 6

[66] Yuichi Taguchi, Yong-Dian Jian, Srikumar Ramalingam, andChen Feng. Point-plane slam for hand-held 3d sensors. In2013 IEEE International Conference on Robotics and Au-tomation, pages 5182–5189. IEEE, 2013. 2

[67] Alexander JB Trevor, John G Rogers, and Henrik I Chris-tensen. Planar surface slam with 3d and 2d sensors. In 2012IEEE International Conference on Robotics and Automation,pages 3041–3048. IEEE, 2012. 3

[68] A. O. Ulusoy, F. Calakli, and Gabriel Taubin. One-shot scan-ning using de bruijn spaced grids. In IEEE ICCV Workshops,pages 1786–1792, 2009. 3

[69] Cedric Verleysen and Christophe De Vleeschouwer.Piecewise-planar 3d approximation from wide-baselinestereo. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3327–3336, 2016. 2,3

[70] P. Vuylsteke and A Oosterlinck. Range image acquisi-tion with a single binary-encoded light pattern. IEEETransactions on Pattern Analysis and Machine Intelligence,12(2):148–164, 1990. 3

[71] Peng Wang, Xiaohui Shen, Bryan Russell, Scott Cohen,Brian Price, and Alan L Yuille. Surge: Surface regular-ized geometry estimation from a single image. In Advancesin Neural Information Processing Systems, pages 172–180,2016. 3

[72] Shuntaro Yamazaki, Akira Nukada, and Masaaki Mochi-maru. Hamming color code for dense and robust one-shot3d scanning. In Proceedings of the British Machine VisionConference, pages 96.1–96.9, 2011. 3

[73] Fengting Yang and Zihan Zhou. Recovering 3d planes froma single image via convolutional neural networks. In Pro-ceedings of the European Conference on Computer Vision(ECCV), pages 85–100, 2018. 3

[74] Shichao Yang, Yu Song, Michael Kaess, and SebastianScherer. Pop-up slam: Semantic monocular plane slam forlow-texture environments. In 2016 IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS), pages1222–1229. IEEE, 2016. 2

[75] Li Zhang, Brian Curless, and Steven M. Seitz. Rapid shapeacquisition using color structured light and multi-pass dy-namic programming. In IEEE International Symposium on3D Data Processing, Visualization, and Transmission, pages24–36, 2002. 3

[76] Yichao Zhou, Haozhi Qi, Yuexiang Zhai, Qi Sun, Zhili Chen,Li-Yi Wei, and Yi Ma. Learning to reconstruct 3d manhattanwireframes from a single image. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 7698–7707, 2019. 2, 3

Page 12: Blocks-World Cameras

Supplementary Technical Report for the Article: Blocks-World Cameras

Jongho Lee Mohit GuptaUniversity of Wisconsin-Madison{jongho, mohitg}@cs.wisc.edu

1. OverviewThis document provides derivations, explanations, and more results supporting the content of the paper submission titled,

“Blocks-World Cameras”.

2. Geometric Relationship between Pattern Feature and Image FeatureIn this section, we algebraically derive geometric relationship between a pattern feature and a corresponding image feature,

for given scene plane parameters. Based on this relationship, we can estimate plane parameters given a pair of image featureand pattern feature. Toward that end, we first review mathematical preliminaries in Section 3 of the main manuscript forcompleteness.Two-view geometry of structured-light: The Blocks-World Camera consists of a projector and a camera, as shown in Fig. 1(a). We define the camera and projector coordinate systems (CCS and PCS) centered at cc and cp, the optical centers of thecamera and the projector, respectively. cc and cp are separated by the projector-camera baseline b along the x axis. The worldcoordinate system (WCS) is assumed to be the same as the CCS centered at cc, i.e., cc = [0, 0, 0]

T and cp = [b, 0, 0]T in

the WCS. Without loss of generality, both the camera and the projector are assumed to have the same focal length f , i.e., theimage planes of both are located at a distance f from their optical centers along the z-axis. For simplicity, we further assumea rectified system such that the epipolar lines are along the rows of the camera image and the projector pattern.Plane parameterization: A 3D scene plane is characterized by Π = {D, θ, ϕ}, where D ∈ [0,∞) is the perpendicular(shortest) distance from the origin (cc) to Π, θ ∈ [0, π] is the polar angle between the plane normal and the −z axis, andϕ ∈ [0, 2π) is the azimuthal angle between the plane normal and the x axis (measured clockwise), as shown in Fig. 1 (a).The pattern consists of a sparse set of cross-shaped features, which get mapped to cross-shaped features in the camera imagevia homographies induced by scene planes, as shown in Fig. 1 (b).Pattern feature and image feature: Consider a pattern feature P described by P = {up,vp,pp}, where vp and up aretwo line vectors and pp is the intersection point of vp and up as shown in Fig. 1 (b). Let the corresponding image featureI be described by I = {uc,vc,pc}, where vc and uc are line vectors corresponding to vp and up, respectively, and pc isthe intersection point of vc and uc. The elements in P and I are described in their own coordinate systems (PCS and CCS,respectively), i.e., for the pattern feature P = {up,vp,pp},

up = [upx, uy, 0]T,vp = [vpx, vy, 0]

T,pp = [ppx, py, f ]

T. (1)

Similarly, for the corresponding image feature I = {uc,vc,pc},

uc = [ucx, uy, 0]T,vc = [vcx, vy, 0]

T,pc = [pcx, py, f ]

T. (2)

Depth estimation: To derive the relationship between the pattern feature and the image feature in terms of the plane pa-rameters, we first derive the scene depth at the intersection point pc of the image feature in terms of the plane parameters.Let the ray passing cp and pp intersect Π at p = [x, y, z]

T , and p is imaged at pc as shown in Fig 1 (c). Let the equationof the line passing cc and pc be tpc = t [pcx, py, f ] (−∞ < t <∞) as shown in Fig 1 (c). Depth z of pc can be obtainedby intersecting this line with the plane Π = {D, θ, ϕ} and taking the third element of tpc. By substituting tpc to the planeequation, we get nT (tpc) +D = 0 and t = − D

nTpc. Thus,

z = tf = −f D

nTpc. (3)

Page 13: Blocks-World Cameras

(b) A pair of pattern and image features

Pattern feature PImage feature I

𝐮𝑝𝐯𝑝

𝐩𝑝

𝐮𝑐𝐯𝑐

𝐩𝑐

(a) Imaging system

𝐷

𝑥

𝑥

𝑦

𝑧

𝑏

𝐜𝑝

𝐜𝑐𝑦𝐧

𝜃𝜑

Π

Plane

PI

𝑓

(c) Point correspondence

𝑥𝑦

𝑧

𝐜𝑐

Π

(d) Line correspondence

𝑥𝑦

𝑧

Π

𝐜𝑝

𝐩𝑐

𝐩𝑝

𝑡𝐩𝑐

𝑝 = 𝑥, 𝑦, 𝑧 𝑇

𝐮𝑐𝐮𝑝

𝐩𝑐1𝐩𝑝1

𝐩𝑐2𝐩𝑝1

Figure 1. Feature correspondences in the Blocks-World Cameras. (a) The Blocks-World Cameras are based on a structured light systemconsisting of a projector which projects a single pattern on the scenes and a camera to capture the images. (b) The pattern consists of asparse set of cross-shaped features, which get mapped to cross-shaped features in the camera image via homographies induced by sceneplanes. The plane parameters can be estimated by measuring the deformation between these features. (c) To derive the relationship betweenthe pattern feature and the image feature, point correspondence can be established from a triangle defined by cp, p, and cc. (d) Similarly,line vector correspondence can be defined from two pairs of point correspondences.

The depth z of pc can be also represented in terms of the corresponding pattern intersection point pp. By intersecting theline passing cp and pp with the plane Π (Fig 1 (c)):

z = −f nT cp +D

nTpp. (4)

In the following, we describe the image feature I = {uc,vc,pc} in terms of the pattern feature P = {up,vp,pp} given theplane parameters Π = {D, θ, ϕ}.Point correspondence: From a triangle defined by cp, p, and pp (Fig 1 (c)), the imaged point pc corresponding to pp is:

pc =

1 0 bz

0 1 00 0 1

pp, (5)

where z is the depth of pc. By substituting Eq. 3 or Eq. 4 to Eq. 5, we get the point correspondence in terms of the planeparameters:

pc =

DD+bnx

− bny

D+bnx− bnz

D+bnx

0 1 00 0 1

pp, (6)

where n = [nx, ny, nz]T .

Line vector correspondence: Let pp1 and pp2 be two points defining up as shown in Fig 1 (d). Using Eq. 5, we can definetwo imaged points pc1 and pc2 corresponding to pp1 and pp2, respectively as shown in Fig 1 (d):

pc1 =

1 0 bz1

0 1 00 0 1

pp1 and pc2 =

1 0 bz2

0 1 00 0 1

pp2, (7)

where z1 and z2 are the depths of pc1 and pc2, respectively. By defining uc = pc2 − pc1 and up = pp2 − pp1,

uc = up +

fb(

1z2− 1

z1

)00

. (8)

From Eq. 3, we get:1

z2− 1

z1= −nTuc

fD. (9)

Page 14: Blocks-World Cameras

𝐧

𝐜𝑝𝐜𝑐

𝐮𝑐

𝐮𝑝𝐯𝑐

𝐯𝑝

𝒍𝐮𝒍𝐯

Image Pattern

(c) Π defined by 𝒍𝐮 and 𝒍𝐯

𝐜𝑝𝐜𝑐

𝒍𝐮

Pattern

(a) 𝒍𝐮 formed by 𝐮𝑐 and 𝐮𝑝

Image

𝐮𝑐

𝐮𝑝

𝐜𝑝𝐜𝑐

𝒍𝐯

Image Pattern

(b) 𝒍𝐯 formed by 𝐯𝑐 and 𝐯𝑝

𝐯𝑐

𝐯𝑝

𝐜𝑝𝐜𝑐

Image Pattern

(d) Degenerate case

𝐮𝑝 , 𝐯𝑝𝐮𝑐 , 𝐯𝑐

Π

Figure 2. Plane estimation from a known feature correspondence. (a) Line segments up and uc from image and pattern features createa pair of planes which meet at a 3D line vector lu. (b) Another pair of image and pattern feature segments create a pair of planes meetingat another 3D line. (c) The two 3D lines define a 3D plane. (d) Pattern features aligned with epipolar lines create a degenerate case, whichis avoided by designing pattern features such that line segments do not lie along epipolar lines.

By substituting Eq. 9 to Eq. 8, we get:

uc =

DD+bnx

− bny

D+bnx0

0 1 00 0 0

up. (10)

Similarly,

vc =

DD+bnx

− bny

D+bnx0

0 1 00 0 0

vp. (11)

From these relationships, we can derive the equations for the plane parameters given the pattern feature and the corre-sponding image feature as explained in the next Section.

3. Plane Parameter Estimation from a Known CorrespondenceIn this section, we derive the expression for scene plane parameters when the correspondence between pattern feature and

the corresponding image feature is known. Consider the plane including the projector center cp and the line vector up of thepattern feature as shown in Fig 2 (a). Similarly, consider another plane including the camera center cc and the line vector ucof the corresponding image feature. These two planes meet at a 3D line vector lu as shown in Fig 2 (a) if up and uc are noton the same epipolar line as shown in Fig 2 (d). lu can be computed by the cross product of the surface normals of these twoplanes. The surface normals of the plane including cp and up and the plane including cc and uc are:

nup =pp × up‖pp × up‖

and nuc =pc × uc‖pc × uc‖

, (12)

respectively, where × is a cross product of the vectors and ‖ · ‖ is a norm of the vector. Thus, lu can be obtained by:

lu = nup × nuc =(pp × up)× (pc × uc)

‖ (pp × up)× (pc × uc) ‖. (13)

Similarly, another pair of two planes created by vp and vc meet at a 3D line vector lv as shown in Fig 2 (b), and lv can beobtained by:

lv =(pp × vp)× (pc × vc)

‖ (pp × vp)× (pc × vc) ‖. (14)

The surface normal of the plane can be obtained by the cross product of lv and lu as shown in Fig 2 (c):

n = lv × lu =((pp × vp)× (pc × vc))× ((pp × up)× (pc × uc))

‖ ((pp × vp)× (pc × vc))× ((pp × up)× (pc × uc)) ‖︸ ︷︷ ︸Eq. 3 of the main manuscript

. (15)

Page 15: Blocks-World Cameras

The polar angle θ and the azimuthal angle ϕ of the plane normal can be obtained from n = [sin θ cosϕ, sin θ sinϕ,− cos θ]T .

From Eq. 4 and Eq. 5, the shortest distance D from the origin (cc) to Π is:

D =bnTppppx − pcx

− nT cp︸ ︷︷ ︸Eq. 4 of the main manuscript

. (16)

4. Plane Parameter LocusIn this section, we derive the algebraic equations describing the plane parameter locus explained in Section 5 of the main

manuscript. Let I = {uc,vc,pc} be the image feature corresponding to a pattern feature P = {up,vp,pp}. By pairingI and P, we can obtain the true plane parameter set Π = [D, θ, ϕ]

T . For a given I, consider a set of pattern featuresP′ =

{up,vp,p

′p

}on the same epipolar line, where p′p =

[p′px, py, f

]T= [ppx + ω, py, f ]

T(ω ∈ R). By pairing I and P′,

we can derive the equations of the plane parameter locus from Eq. 15 and Eq. 16 .Let n′ be the surface normal of the plane created by pairing I and P′. Using Eq. 15,

n′ =((p′p × vp

)× (pc × vc)

)×((p′p × up

)× (pc × uc)

). (17)

We drop the normalization factor without loss of generality. By applying p′p = [ppx + ω, py, f ]T , Eq. 1, and Eq. 2 to Eq. 17,

we get:

n′ =

f (−ucxvy + upxvy + uyvcx − uyvpx)f (−upxvcx + ucxvpx)

pcxuyvpx − pyucxvpx − p′pxuyvcx − pcxupxvy + pyupxvcx + p′pxucxvy

. (18)

Please note that we dropped the common factors in x-,y-, and z-components of n′ without loss of generality. By applyingEq. 6, Eq. 10, and Eq. 11 to Eq. 18 and arranging terms, we get:

n′ =

n′xn′yn′z

=1

a

nxny

nz + ωDbf

, (19)

where a =

√n2x + n2y +

(nz + ωD

bf

)2is the normalization factor.

Let D′ be the perpendicular distance from the camera origin to the plane created by pairing I and P′. Using Eq. 16,

D′ =bn′

Tp′p

p′px − pcx− n′

Tcp =

bn′Tpc

p′px − pcx=

Dbn′Tpc

bnTpc + ωD=D

a. (20)

Thus, the plane parameter set Π′ of the plane created by pairing I and P′ is:

Π′ =

D′θ′ϕ′

=

D′

tan−1(√

n′x2+n′

y2

−n′z

)tan−1

(n′y

n′x

) =

Da

tan−1( √

n2x+n

2y

−(nz+ωDbf )

. (21)

Property 1. The parameter locus Λm = {Πm1, . . . ,ΠmN} created by pairing an image feature Im and a set of patternfeatures {P1, . . . ,PN} on the same epipolar line always lies on a plane parallel to the ϕ = 0 plane in the Π-space.

Proof: From Eq. 21, the azimuth angle ϕ′ of Π′ is always a constant ϕ.

Property 2. Let Λm = {Πm1, . . . ,ΠmN} be the parameter locus created in the same way as Property 1. Let Pµ (µ ∈ {1, . . . , N})be the true corresponding pattern feature of Im. Let dµn be the distance between pattern features Pµ and Pn on the epipolarline. Then, the locations of the elements of Λm are a function only of the set Dµ = {dµn |n ∈ {1, . . . , N}} of relativedistances between the true and candidate pattern features.

Proof: From Eq. 21, the locations of the elements of Λm are a function only of ω = p′px − ppx (not a function of pp or pc),which is the relative distance between the true and candidate pattern features.

Page 16: Blocks-World Cameras

𝑏 = 0.4(m) assumed

𝐷 = 0.1m

0°180°

45°

90°

135°

315°225°

270°

90°

𝜃

𝜙

𝐷 = 0.2m

0°180°

45°

90°

135°

315°225°

270°

90°

𝜃

𝐷 = 0.3m

0°180°

45°

90°

135°

315°225°

270°

90°

𝜃

𝐷 ≥ 𝑏 = 0.4m

0°180°

45°

90°

135°

315°225°

270°

90°

𝜃

Figure 3. Measurable plane space. The measurable plane normal space increases with D, and if D ≥ b (baseline between the cameracenter and the projector center, 0.4 m assumed here), planes with any surface normals can be measured theoretically.

𝐷 (m)

∆𝜙𝑐(°)

𝜙𝑝 = 30°

𝜙𝑝 = 45°

𝜙𝑝 = 60°

0 2 4 6 8 100

20

40

60

80

100

120

0

10

20

30

40

50

60

∆𝜙𝑐(°)

𝜙𝑝 = 30°

𝜙𝑝 = 45°

𝜙𝑝 = 60°

𝑏 (m)

0.2 0.4 0.6 0.8 1.0

Figure 4. Image feature angle variation over D (perpendicular distance to plane) and b (baseline). Image feature angle variationdecreases with D or increases with b. Therefore, it becomes difficult to measure precise plane parameters for in practice if D is very largeor b is very small.

5. Measurable Plane SpaceOne of the important practical questions regarding the performance of Blocks-World Cameras is “what is the plane pa-

rameter space measurable with the Blocks-World Cameras?” The measurable plane space is determined by (a) fundamentallimitations of the two-viewing imaging systems, and (b) measurement accuracy of image features.

Scene planes passing between the projector’s and camera’s optical centers or any segments on these planes are not mea-surable. This is because for such planes, the projected pattern cannot be observed by the camera. These non-measurableplanes can be described as:

nx = 0, D = 0 (planes including x-axis) (22)

andnx 6= 0, 0 ≤ −D

nx≤ b (planes with the x-intercept between 0 and b). (23)

The measurable plane space is the complement of the set of the planes described by Eq. 22 and Eq. 23. Fig. 3 showsexamples of the measurable plane space when the baseline b = 0.4 m. The measurable plane normals (represented by θ andϕ) at differentD values are shown in blue with the polar plot, where the radial direction and the clockwise direction representθ and ϕ directions, respectively. The measurable plane normal space increases with D. If D ≥ b, planes with any surfacenormals can be measured theoretically.

In practice, however, it is challenging to estimate plane parameters precisely when D is very large or b is very small. Letthe pattern feature angle (i.e., the angle between one of the line segments of the pattern feature and the epipolar line) be φp,and the corresponding image feature angle (e.g., the angle between the corresponding image feature’s line segment and theepipolar line) be φc. For a given φp, φc is a function of the plane parameters. As shown in Fig. 4, when the plane normalchanges, the corresponding image feature angle variation ∆φc gets smaller as D increases (b is fixed as 0.4 m) or b decreases(D is fixed as 2 m). If ∆φc is too small, it is difficult to distinguish between different plane parameters.

Page 17: Blocks-World Cameras

𝐷 = 𝑏 = 0.4(m) assumed

180°

90°

𝜙𝑝 = 0°

0°180°

45°

90°

135°

315°225°

270°

90°

𝜃

𝜙

𝜙𝑝 = 45°

0°180°

45°

90°

135°

315°225°

270°

90°

𝜃

𝜙𝑝 = 90°

0°180°

45°

90°

135°

315°225°

270°

90°

𝜃

𝜙𝑝 = 135°

0°180°

45°

90°

135°

315°225°

270°

90°

𝜃

Figure 5. Image feature angles according to plane normal direction over different pattern feature angles. Image feature angle φc

does not change when φp = 0°, thus plane parameters cannot be estimated from feature deformation. On the other hand, φc changessensitively according to the plane normal direction when φp = 90°, leading to more accurate plane parameter estimation. D = b = 0.4 mwas assumed.

𝜙𝑝(°)

𝜙𝑐(°)

Max. 𝜙𝑐

∆𝜙𝑐1

∆𝜙𝑐2Min. 𝜙𝑐

0 45 90 135 1800

30

60

90

120

150

180

Figure 6. Bound of image feature angle when plane normal changes over pattern feature angle. We use two pattern feature anglesφp1 = 45° and φp2 = 135° since they give the maximum image feature angle variation without the overlap.

6. Pattern DesignIn this section, we discuss various parameters for pattern design and their conditions for optimal performance of the

Blocks-World Cameras.Angles of pattern feature (why are 45° and 135° of pattern feature angles used?): The pattern feature angle (e.g., anglebetween the line segment of the pattern feature and the epipolar line) is an important parameter for pattern design since itinfluences the accuracy of plane parameter estimation. For a given pattern feature angle φp, the corresponding image featureangle φc (e.g., angle between the line segment of the corresponding image feature and the epipolar line) varies as the planeparameters Π change. The range of image feature angles ∆φc (over all possible plane parameters) is determined by thepattern feature angle φp. For precise plane parameter estimation, ∆φc should be sufficiently large. Fig. 5 shows φc as afunction of the scene plane normal direction (D is fixed as D = b = 0.4 m) for different φps. If φp = 0°, φc is always 0°regardless of the plane normal direction (this corresponds to the degerate case in Fig. 2 (d)), which makes it impossible toestimate the plane parameters. On the other hand, if φp = 90°, φc changes from 0° to 180°, which makes it much easier toestimate the plane parameters accurately. Then, what is the optimal φp for precise plane parameter estimation?

Fig. 6 shows the maximum and minimum φc values over φp when the plane normal changes. We achieve the maximum∆φc of 180° when φp = 90°, and the minimum ∆φc of 0° when φp = 0° or φp = 180°. Therefore, φp = 90° shouldbe selected for precise plane estimation. However, we need two φp values for plane estimation, and the range of two φcscorresponding to two φps should not overlap for distinction between two image line segments when a single pattern is used.For this purpose, we chose φp1 = 45° and φp2 = 135° since ∆φc1 and ∆φc2 are maximized while achieving no overlap inthe corresponding φc values as shown in Fig. 6.Other pattern parameters: In addition to the pattern feature angle, there are more parameters for pattern design: radius ofthe line segment r, number of pattern features on a single epipolar line n, distance between adjacent epipolar lines with patternfeatures k, decrement or increment of distance between adjacent pattern features on each epipolar line h. These parameters

Page 18: Blocks-World Cameras

Pattern for simulations Pattern for experiments

Figure 7. Example patterns for simulations and experiments.

(a) Pattern with regular feature placement

(c) Pattern with random feature placement

(b) Π–space with (a)

(d) Π–space with (c)

True peak

True peaks

True peaks

Figure 8. Pattern feature displacement for different epipolar lines and corresponding Π-spaces. (a, b) Same pattern feature displace-ment for different epipolar lines and the resulting Π-space, respectively. (c, d) Random pattern feature displacement for different epipolarlines and the resulting Π-space, respectively. If the pattern feature displacement for different epipolar lines is different, false local peaks inΠ-space are spread over the locus, leading to more robust true peak estimation.

can be chosen appropriately according to the scene or imaging conditions. Fig. 7 shows example patterns for our simulationsand experiments. We use r = 15, n = 7, k = 7, h = 5 (in pixels) for simulations and r = 18, n = 7, k = 10, h = 9 (inpixels) for experiments. The total number of pattern features are 1050 and 735 for simulations and experiments, respectively.Larger size and sparser distribution of pattern features for the experiments are to be robust to various imperfections in thesystem. The resolution of the pattern is 1920× 1080.Pattern feature displacement for different epipolar lines: If pattern features are non-uniformly distributed (more specifi-cally, all distances between pattern features (multi-hops as well as single-hop between pattern features) are different) on eachepipolar line, true plane parameters can be estimated. Then what about the pattern feature displacement for different epipolarlines? Is the pattern feature displacement for different epipolar lines the same or different? As long as the non-uniformfeature distribution on each epipolar line is satisfied, we can estimate the true plane parameters. However, if the patternfeature distribution for different epipolar lines is all different, we can find the local peaks for the true plane parameters morerobustly. Fig. 8 shows Π-spaces for the scene of Fig. 5 in the main manuscript when the pattern feature displacement for

Page 19: Blocks-World Cameras

(a) Blocks-World Cameras with voting in Π- space (b) Blocks-World Cameras with RANSAC-like procedure(no voting in Π- space)

Figure 9. Blocks-World Cameras in two ways. (a) Blocks-World Cameras with voing in Π-space. (b) Blocks-World Cameras with aRANSAC-like procedure, which are more memory-efficient because they do not require voting in Π-space.

different epipolar lines is the same (upper row) and random (lower row). True peaks representing the true plane parametersare not affected by the pattern feature displacement. However, false peaks are spread over the locus when the pattern featuredisplacement for different epipolar lines is random, which enables more robust true peak finding as shown in Fig. 8.

7. More Memory-Efficient Blocks-World CamerasIn this section, we discuss more memory-efficient Blocks-World Cameras which do not require a plane parameter Π-space

for voting. Because the Blocks-World Cameras provide a pool of plane candidates with different confidence (e.g., largernumber of plane candidates for real dominant scene planes), parameters for dominant planes can be estimated by findinginliers via a RANSAC-like procedure, instead of voting in the Π-space. The algorithm is as follows.

Algorithm 1: More Memory-Efficient Blocks-World CamerasInput: ntol: error tolerance for plane normal, Dtol: error tolerance for D, T : number of iterations,Im (m ∈ {1, . . .M}) : M number of image features,Pn (n ∈ {1, . . . N}): N number of pattern features on the same epipolar line,Πmn: all plane candidates created by pairing Im and PnOutput: Πq (q ∈ {1, . . . , Q}): Q number of dominant scene planesfor q=1 to Q do

for t=1 to T doRandomly choose Π from {Πmn};Inliers← Πmn within ntol and Dtol;

endΠq ← model with the largest number of inliers (can be averaged inliers for more accurate model);Remove current inliers from {Πmn};Find Im which participated in creating current inliers;Remove plane candidates created by these Im from {Πmn};

end

Fig 9 (a) and (b) show the Blocks-World Camera results with voting in the Π-space and with a RANSAC-like procedure,respectively. All dominant planes are extracted well with the RANSAC-based Blocks-World Cameras as well. ntol = 4°,

Page 20: Blocks-World Cameras

1.0m

4.2m

Π1 Π2

Π3Π5 Π4Π6

(e) Blocks-World Cameras(a) Scene (b) Simulated depth maps (d) RANSAC(c) Randomized 3D Hough

1.0m

4.2m

1.0m

4.2m

Phase-shifting

Gray coding

Time-of-flight

Figure 10. Comparison with plane-fitting. (a) Scene. (b) Depth maps simulated by structured-light systems (sinusoid phase-shifting,binary Gray coding) and continuous wave time-of-flight (C-ToF) imaging. (c, d, e) Plane segmentation results by randomized 3D Houghtransform (RHT), RANSAC, and Blocks-World Cameras.

Dtol = 0.1, and T = 100 are used to get the results.

8. Comparisons with Conventional Plane-Fitting ApproachesWe compare plane estimation performance of the Blocks-World Cameras to the conventional approaches fitting the planes

to 3D point clouds. For conventional approaches, we use 3D Hough transform and RANSAC, which are two popular plane-fitting methods. We use the randomized version of the 3D Hough transform (RHT) due to its computational efficiency. Tocreate 3D point clouds for conventional approaches, we simulate structured-light (SL) systems and continuous-wave time-of-flight (C-ToF) imaging systems. We use sinusoid phase-shifting [3] and binary Gray coding [1, 2] to test two different setsof patterns for the SL systems. To simulate the C-ToF systems, multiple variables are required. Assuming sinusoid codingfor amplitude modulation, the average numbers of signal photons (at the minimum depth) and ambient photons are assumedto be 1 × 106 and 5 × 105, respectively. 10 MHz of modulation frequency and 10 ms of integration time are used. Fig. 10(b) shows the depth maps created by phase-shifting, Gray coding, and C-ToF, respectively. In case of the SL systems, Graycoding where 24 patterns are used shows a higher quality depth map than phase-shifting where 12 patterns are used. Althoughthe performance of C-ToF depends on various parameter values, in general, C-ToF enables faster depth acquisition, longerdepth range, but lower depth resolution compared to the SL systems. After creating 3D point clouds from the depth maps,we down-sample the point clouds such that the number of 3D points is the same as the number of image features captured bythe Blocks-World Cameras to ensure fair comparison especially in terms of run-time.

Several parameter values are required for the MATLAB implementations of the conventional plane-fitting approaches. ForRANSAC, we set the maximum number of iterations as 103 and the maximum distance from an inlier 3D point to the planeas 0.1 m. For RHT, the bin sizes for θ and φ ranges are the same as 3° and the bin size for D range is 0.04 m. These relativelylarge bin sizes are to handle noise existing in the 3D point clouds. The number of iterations is 106 for RHT. All these values aredetermined empirically to generate the most reasonable results. Fig. 10 (c), (d), and (e) show the plane segmentation results byRHT, RANSAC, and Blocks-World Cameras, respectively. For RHT (Fig. 10 (c)), it is challenging to extract small, distantor noisy planes because the votes for these planes are not reliably accumulated by random selection of points. AlthoughRANSAC (Fig. 10 (d)) achieves better plane extraction, both RHT and RANSAC result in erroneous plane segmentationresults (e.g., erroneous points on the imaginary planes created when the round table is expanded). This is a common issuewith point cloud-based approaches since each 3D point does not have local plane information. In comparison, Blocks-WorldCameras achieve accurate plane segmentation since each cross-shaped image feature contains partial information on the planeit belongs to, and does not need global reasoning.

Plane estimation results by the conventional approaches and the Blocks-World Cameras are compared to the ground truth

Page 21: Blocks-World Cameras

Plane Π1 Π2 Π3 Π4 Π5 Π6

RHT 66, 270, 1.08 66, 270, 1.68 48, 153, 2.97 NA NA NARANSAC 64, 273, 1.15 59, 270, 1.84 46, 154, 3.00 48, 22, 2.31 37, 144, 2.41 55, 18, 3.00

Blocks-World Cameras 65, 270, 1.10 64, 269, 1.70 46, 155, 3.00 53, 20, 2.06 46, 152, 2.02 52, 18, 3.04Ground truth 65, 270, 1.10 65, 270, 1.70 46, 153, 3.00 54, 20, 2.00 46, 153, 2.00 54, 20, 3.00

Table 1. Plane estimation results when the point cloud is created by a structured-light system with phase-shifting for conventional ap-proaches.

Plane Π1 Π2 Π3 Π4 Π5 Π6

RHT 66, 270, 1.08 66, 270, 1.68 51, 153, 2.96 NA 45, 153, 2.00 54, 21, 3.00RANSAC 65, 271, 1.10 65, 270, 1.69 46, 153, 2.99 54, 19, 2.02 44, 151, 2.12 54, 20, 2.99

Blocks-World Cameras 65, 270, 1.10 64, 269, 1.70 46, 155, 3.00 53, 20, 2.06 46, 152, 2.02 52, 18, 3.04Ground truth 65, 270, 1.10 65, 270, 1.70 46, 153, 3.00 54, 20, 2.00 46, 153, 2.00 54, 20, 3.00

Table 2. Plane estimation results when the point cloud is created by a structured-light system with Gray coding for conventional approaches.

Plane Π1 Π2 Π3 Π4 Π5 Π6

RHT 63, 270, 1.12 NA 48, 159, 3.00 NA NA NARANSAC 67, 278, 1.13 63, 271, 1.76 47, 147, 3.00 NA 45, 154, 2.03 53, 21, 3.05

Blocks-World Cameras 65, 270, 1.10 64, 269, 1.70 46, 155, 3.00 53, 20, 2.06 46, 152, 2.02 52, 18, 3.04Ground truth 65, 270, 1.10 65, 270, 1.70 46, 153, 3.00 54, 20, 2.00 46, 153, 2.00 54, 20, 3.00

Table 3. Plane estimation results when the point cloud is created by a continuous-wave time-of-flight imaging for conventional approaches.

Plane Π1 Π2 Π3 Π4 Π5 Π6

RHT 66, 270, 1.08 66, 270, 1.68 45, 159, 3.00 NA NA NARANSAC 63, 270, 1.14 62, 269, 1.75 46, 151, 2.99 56, 19, 1.93 43, 152, 2.13 56, 21, 2.97

Blocks-World Cameras 65, 270, 1.10 64, 269, 1.70 46, 155, 3.00 53, 20, 2.06 46, 152, 2.02 52, 18, 3.04Ground truth 65, 270, 1.10 65, 270, 1.70 46, 153, 3.00 54, 20, 2.00 46, 153, 2.00 54, 20, 3.00

Table 4. Plane estimation results when the point cloud is down-sampled by 0.5 sampling rate for conventional approaches.

in Table 1, 2, and 3. Phase-shifting, Gray coding, and C-ToF are used to create the 3D point clouds for conventionalapproaches in Table 1, 2, and 3, respectively. Each cell in the table represents [θ(◦), φ(◦), D(m)]. The planes which cannotbe segmented by the conventional approaches are represented by NA. The estimation errors and the run-time are shown inFig. 11 (a) and (b), respectively. For the SL systems, conventional approaches show better performance in plane parameterserror with Gray coding than phase-shifting since Gray coding uses more patterns leading to more accurate correspondencematching. The Blocks-World Cameras shows comparable performance to Gray coding even with a single pattern. For theC-ToF systems, the conventional approaches fail to find all dominant planes while the Blocks-World Cameras can. Theconventional approaches are slower than the Blocks-World Cameras in run-time regardless of the imaging modalities.

We also discuss the trade-off between run-time and plane estimation accuracy while varying the sampling rate of the3D point clouds. Fig. 12 (a) and (b) show the plane segmentation results by RHT and RANSAC, respectively after down-sampling the point clouds with different rates. The sampling rate of 0.02 is to ensure that the number of 3D points is the sameas the number of image features of the Blocks-World Cameras. The sampling rate of 1.0 means no down-sampling of the 3Dpoint clouds. The plane segmentation error by the conventional approaches is not reduced by increasing the sampling rates.Tables 4 and 5 summarize the ground truth plane parameters and the plane estimation results by the Blocks-World Camerasand the conventional approaches with different sampling rates. The plane estimation errors and the run-time comparisonswith different sampling rates are shown in Fig. 13 (a) and (b), respectively. When the sampling rate increases, RANSACbecomes more accurate in plane parameter estimation, but it becomes slower in run-time. The Blocks-World Cameras showbetter performance than conventional approaches in both plane parameters error and run-time regardless of the sampling rate.

Page 22: Blocks-World Cameras

(a) Plane parameters error comparison (b) Run-time comparison

Pla

ne

no

rmal

er

ror

°0

12

8

4

Ru

n-t

ime

s

0

3

2

1

0.0

0.4

0.2

0.1

0.3

Π1 Π2 Π3 Π4 Π5 Π6

𝐷er

ror

m

Π1 Π2 Π3 Π4 Π5 Π6

Pla

ne

no

rmal

er

ror

°

0

5

4

3

Ru

n-t

ime

s

0

3

2

1

0.00

0.12

0.04

0.08

Π1 Π2 Π3 Π4 Π5 Π6

𝐷er

ror

m

Π1 Π2 Π3 Π4 Π5 Π6

Pla

ne

no

rmal

er

ror

°

0

8

6

4

Ru

n-t

ime

s

0

3

2

1

0.00

0.04

0.02

0.06

Π1 Π2 Π3 Π4 Π5 Π6

𝐷er

ror

m

Π1 Π2 Π3 Π4 Π5 Π6

Randomized 3D Hough RANSAC Blocks-World Cameras

Phase-shifting(12 patterns)

Gray coding(24 patterns)

Time-of-flight

2

1

2

Figure 11. Quantitative performance comparison with different imaging modalities. Quantitative performance of the conventionalapproaches with different imaging modalities and the Blocks-World Cameras are compared in terms of (a) plane parameters error and (b)run-time (point cloud acquisition time is not included for conventional approaches). Structured-light systems (upper and middle rows)and continuous wave time-of-flight imaging system (lower row) are simulated. For the structured-light systems, sinusoid phase-shiftingwith 12 patterns (upper row) and binary Gray coding with 24 patterns are used. The Blocks-World Cameras with a single pattern show theperformance comparable to the Gray coding with 24 patterns in plane parameters error while achieving very low computational complexity.

Plane Π1 Π2 Π3 Π4 Π5 Π6

RHT 66, 270, 1.08 66, 270, 1.68 51, 153, 2.92 NA NA NARANSAC 65, 274, 1.12 66, 270, 1.69 43, 149, 3.03 55, 19, 1.99 48, 150, 1.96 54, 17, 3.01

Blocks-World Cameras 65, 270, 1.10 64, 269, 1.70 46, 155, 3.00 53, 20, 2.06 46, 152, 2.02 52, 18, 3.04Ground truth 65, 270, 1.10 65, 270, 1.70 46, 153, 3.00 54, 20, 2.00 46, 153, 2.00 54, 20, 3.00

Table 5. Plane estimation results when the point cloud is down-sampled by 1.0 sampling rate for conventional approaches.

9. Mechanics of Plane Estimation in Plane Parameter SpaceBin sizes of plane parameter space: The optimal bin sizes of the plane parameter space (Π-space) depends on variousfactors such as scene conditions and imaging conditions. Roughly speaking, relatively larger bin sizes are used for low SNRconditions (e.g., noisy imaging conditions) and for non-planar scenes (e.g., Fig. 11 of the main manuscript). We use 1° for θand ϕ ranges and 0.02 m for D range. To approximate the non-planar scene with piece-wise planar scene in the third resultof Fig. 11 of the main manuscript, we use 7° for θ range, 10° for ϕ range and 0.05 m for D range.Finding local peaks for true plane parameters: Multiple loci created by multiple image features on the same scene planebuild several local peaks in Π-space. Only one peak represents true scene plane parameters, and others are false peaks createdby possible candidate voting. Since a true local peak for a small scene plane can be lower than false local peaks for a hugescene plane, true local peaks should be selected carefully. We describe how to find the local peaks for true plane parameterswith the Π-space of the scene in Fig. 5 of the main manuscript. 1) Find the maximum peak (e.g., peak pointed by Π3 in thefirst sub-figure of Fig. 14) in the Π-space. This peak represents the true plane parameters for the plane Π3. 2) Identify allimage features which voted for this peak and remove all votes by these image features from the Π-space. Then all false peaks

Page 23: Blocks-World Cameras

(c) Blocks-World Cameras(b) RANSAC(a) Randomized 3D Hough

Sampling rate 0.02

Sampling rate 0.5

Sampling rate 1.0

Figure 12. Plane segmentation comparisons with different sampling rates of 3D point clouds. After 3D point clouds are down-sampledwith different sampling rates, planes are segmented by (a) randomized 3D Hough transform and (b) RANSAC. The segmentation resultsare compared to the result by (c) Blocks-World Cameras. The randomized 3D Hough transform fails to find all dominant planes even withthe increased sampling rates. The plane segmentation error by conventional approaches is not reduced by increasing the sampling rate.

by these image features will disappear as shown in the second sub-figure of Fig. 14. Repeat 1) and 2) to find all true localpeaks (true plane parameters) (Fig. 14).

References[1] S. Inokuchi, K. Sato, and F. Matsuda. Range imaging system for 3-d object recognition. In International Conference Pattern Recog-

nition, pages 806–808, 1984. 9[2] K. Sato and S. Inokuchi. 3d surface measurement by space encoding range imaging. Journal of Robotic Systems, 2(1):27–39, 1985. 9[3] V Srinivasan, Hsin-Chu Liu, and Maurice Halioua. Automated phase-measuring profilometry: a phase mapping approach. Applied

optics, 24(2):185–188, 1985. 9

Page 24: Blocks-World Cameras

0

12

8

4

0

3

2

1

0.0

0.4

0.2

0.1

0.3

0.0

4.5

3.0

0

4

2

1

0.00

0.12

0.04

0.08

0

5

4

3

0

8

4

2

0.00

0.06

0.03

0.09

1.5

2

3

1

6

(b) Run-time comparison

Pla

ne

no

rmal

er

ror

°

Ru

n-t

ime

s

Π1 Π2 Π3 Π4 Π5 Π6

𝐷er

ror

m

Π1 Π2 Π3 Π4 Π5 Π6

Pla

ne

no

rmal

er

ror

°

Ru

n-t

ime

s

Π1 Π2 Π3 Π4 Π5 Π6

𝐷er

ror m

Π1 Π2 Π3 Π4 Π5 Π6

Pla

ne

no

rmal

er

ror

°

Ru

n-t

ime

s

Π1 Π2 Π3 Π4 Π5 Π6

𝐷er

ror m

Π1 Π2 Π3 Π4 Π5 Π6

Randomized 3D Hough RANSAC Blocks-World Cameras

Sampling rate0.02

Sampling rate0.5

Sampling rate1.0

(a) Plane parameters error comparison

Figure 13. Quantitative performance comparison with different sampling rates of 3D point clouds. Quantitative performance of theconventional approaches with different sampling rates of 3D point clouds and the Blocks-World Cameras are compared in terms of (a)plane parameters error and (b) run-time (point cloud acquisition time is not included for conventional approaches). When the samplingrate increases, RANSAC becomes more accurate in plane parameter estimation, but it becomes slower in run-time. Blocks-World Camerasshow better performance than conventional approaches in both plane parameters error and run-time regardless of the sampling rate.

Π1 Π2 Π3

Π4

Π5

𝐷

𝜃

0

1

Π1 Π2

Π4

Π5

𝐷

𝜃

0

1

Π1 Π2

Π5

𝐷

𝜃

0

1

Π2

Π5

𝐷

𝜃

0

1 Π5

𝐷0

1

𝜃

Figure 14. Finding true plane parameters in Π-space. Find the maximum peak and identify the image features voted for the maximumpeak. Remove all votes by these image features from the Π-space. Repeat this to find all true plane parameters.