3-D Visual Reconstruction: A System Perspective · 3-D Visual Reconstruction: A System Perspective...

3-D Visual Reconstruction: A System PerspectiveGuillermo Medina-Zegarra ∗ and Edgar Lobaton †

Abstract.- This paper presents a comprehensive study ofall the challenges associated with the design of a platformfor the reconstruction of 3-D objects from stereo images.The overview covers: Design and Calibration of a physicalsetup, Rectification and disparity computation for stereopairs, and construction and rendering of meshes in 3-D forvisualization. We outline some challenges and research ineach of these areas, and present some results using somebasic algorithm and a system build with off-the-shelf com-ponents.

Index Terms.- Epipolar geometry, Rectification, Dis-parity calculation, Triangulation, 3-D reconstruction.

1 IntroductionTo get the impression of depth in a painting some artistsof the Renaissance such as Filippo Brunelleschi, Piero dellaFrancesca, Leonardo Da Vinci [1] and Albert Durer re-searched the way a human eye sees the world. They bridgedthe space between the artist and the object. In Figure 1 youcan see the perspective machine which was used to determinethe projection of 3-D points to a plane known as the imageplane. The projections were found by drawing lines betweenthe 3-D points and the eyes of the artist (vanishing points).

Figure 1: Perspectiva machine [2]

It is now well known that the same mathematical principlesused to understand how humans see the world can be used tobuild 3-D models from two images. There are many appli-cations for 3-D reconstruction such as: Photo Tourism [3],visualization in teleimmersive environments [4] [5] [6], videogames (e.g. Microsoft Kinect), facial animation [7], etc.

The stereo computation paradigm [8] is broadly defined asthe recovery of 3-D characteristics of a scene from multiplesimages taken from different points of view. The paradigm

∗Universidad Catolica San Pablo, Arequipa-Peru (e-mail:[email protected]).†Assistant Professor at North Carolina State University, North Carolina-

EEUU (e-mail: [email protected]).

involves the following steps: image acquisition, camera mod-eling, feature extraction, image matching, depth determina-tion and interpolation. There is a large body of literature inthe area of computer vision which deals with this problem [9][10] [11].

In this paper we will describe and show the common stepsto build a 3-D visual reconstruction from two images, andwe will demonstrate some results using a custom built stereotestbed.

The rest of the paper is organized as follows: section 2describes the single view geometry; section 3 describes theepipolar geometry for two views; in section 4, the correspondence problem is presented; in section 5, the physical and al-gorithmic architecture used for our experiment are described;in section 6, some results using our platform are shown; fi-nally, section 7 presents our conclusions.

2 Single View GeometryCameras are usually modeled using a pinhole imaging model[10]. As illustrated in Figure 2, images are formed by project-ing the observations from 3-D objects onto the image plain.

The image of a point P = [X ,Y,Z]> in a physical object isthe point p = [x,y,1]> corresponding to the intersect betweenthe image plain and a ray from the camera focal point C toP. Given that the world coordinate system and the coordinatesystem associated with a camera coincide (as shown in thefigure), we have

x = fXZ

and y = fYZ, (1)

where f is the focal length of the camera. However, in the casethat the world coordinate system does not agree with the cam-era frame of reference and the point P is expressed in world

Figure 2: Pinhole camera model [10].

1

coordinate system as P0 = [X0,Y0,Z0]>, then we have that

P = RP0 + t, (2)

where R and t are the rotation and translation defining therigid body transformation between the camera and the worldcoordinate frames. These are called the extrinsic parametersfor a camera. Given this model, the coordinates of a point inthe image domain satisfy:

λ

xy1

=

f 0 00 f 00 0 1

Π0

[R t0 1

]X0Y0Z01

(3)

where λ := Z and

Π0 =

1 0 0 00 1 0 00 0 1 0

.The previous equation is obtained under the assumption of anideal camera model. In general, the coordinates of the imageplain are not perfectly aligned and may not have the samescale (i.e., uneven spacing between sensors in a camera sensorarray). This leads to the more general equations:

λ

xy1

= KΠ0

[R t0 1

]X0Y0Z01

(4)

where

K =

f sx f sθ ox0 f sy oy0 0 1

,[ox,oy]

> is called the principal point, sx and sy specify thespacing between sensors in the x and y directions in the cam-era sensor array, and sθ is an skewing factor accounting formisalignment between sensors. The entries of K are knownas the intrinsic parameters of a camera. All of these param-eters can be found by measurements of image points from 3-Dpoints with known coordinates.

In addition to the linear distortions that are modeled by K,cameras with wide field of view can have radial distortionswhich are modeled using polynomial terms in the model [10].Cameras can be calibrated to obtain undistorted images whichwe will assume it is the case from now on.

3 Epipolar geometryEpipolar geometry [9] studies the geometric relationship be-tween the points in 3-D and their projections in the imageplanes of stereo cameras for a vision system.

In Figure 3, one can observe that the 3-D point P has perspective projection p1 and p2 [12] in the left image plane I1and in the right image plane I2, respectively. The points p1

Figure 3: Epipolar Geometry for stereo views [10]

and p2, known as corresponding image points, can be foundby intercepting the image planes with the lines from the 3-Dpoint P to the optic centers C1 and C2. The so called epipolarplane is defined by the points P and the optic centers C1 andC2.

Due to the geometry of the setup it is possible to verify thatthe image points p1 and p2 must satisfy the so called epipolarequation:

pT2 F p1 = 0 (5)

where F is the fundamental matrix [9][13] and it is definedas

F = K−T2 EK−1

1 (6)

The matrices K1 and K2 are the matrices of intrinsic parame-ters and E is the so-called essential matrix. The matrix E isgiven by

E = [T21]xR21 (7)

where R21 and T21 are the rotation and translation matrices be-tween camera 1 and 2, and [T ]x for T = [T1,T2,T3]

> is definedas

[T ]x =

0 −T3 T2T3 0 −T1−T2 T1 0

(8)

Equation 5 can be used to solve for the fundamental matrixF given that enough corresponding image points are provided.One such algorithm is known as the eight point algorithm[14]. Hence, since the matrices K1 and K2 can be learnedfrom individual calibration of the cameras, then the essentialmatrix E can be found. One can solve for the rotation andtranslation (up to scale) between cameras by using E.

Figure 3 also illustrates one more fact that can be used tofind corresponding points between images once the intrinsicand extrinsic parameters are known. Essentially, given a pointp1, we know that the 3-D point P has to lay in the planedefined by T21 and the line from C1 to p1 (i.e., the epipolarplane). This plane intercepts the image plane I1 in the line l1and the image plane I2 in the line l2. That is, we know that

2

any point in the line l1 must correspond to some point in theline l2. This turns the identification of corresponding pointsinto a 1-D search problem. This process is known as dispar-ity computation [15] [11]. Finally, it is common to warp theoriginal image planes such that the lines l1 and l2 are bothhorizontal and aligned. This process, known as rectification[16], is done to simplify computations during the search.

Finally, given that the intrinsic parameter, extrinsic param-eters, and the corresponding points p1 and p2 are known aswell, it is possible to recover the 3-D position of the pointP via a process called triangulation [17]. Essentially, P isfound to be the intercept between the line passing through p1and C1, and the line passing through p2 and C2.

4 The Correspondence ProblemIn order to estimate the essential matrix E and to perform thedisparity computations, we require the identification of cor-responding points (i.e., points in the image planes that arethe projection of the same physical point in 3-D). This is avery challenging and ambiguous problem [11] due to possiblylarge changes in perspective between cameras, the occurrenceof repeated patterns in images, and the existence of occlu-sions.

4.1 Generic Correspondence ProblemIn order to find corresponding image points for the estimationof the essential matrix E, it is possible to use techniques thatidentify unique points. The correspondence between pointscan then be determined by a metric between local descriptors.The SIFT and GLOH [18] descriptors are two commonly usedtechniques.

For the case of calibration [19] [20] in controlled environ-ments, it is commonly assumed that the observed object hasa regular pattern that is easy to identify (e.g., a checkerboardpattern). In these cases, it is possible to use more specializedregion detection algorithms such as the corner Harris detector[11].

4.2 Disparity ComputationsIf the intrinsic and extrinsic calibration parameters betweentwo cameras are known, it is possible to rectify the imagessuch that corresponding points can be found along horizontallines in the images. As mentioned before, finding correspon-dences turns into a 1-D search problem known as disparitycomputation. In order to perform this search, it is possible tospecify a metric that characterizes the difference between tworegions (e.g., normalized cross correlation). Then, we lookfor the location along the horizontal line that minimizes thedifference between these regions [11]. This is a very simplemethodology that will be used in this paper in order to per-form disparity computations. Of course, there are many moresophisticated algorithms that give more accurate results [15].

5 Testing PlatformIn this section, we introduce the design of a simple stereo plat-form used to building 3-D models using the framework pre-sented so far.

5.1 Physical ArchitectureFigure 4(a) shows a frontal view of two digital camera config-uration used for our experiments. Figure 4(b) shows the backview. The cameras were mounted using screws to prevent anymotion during the calibration and image acquisition.

Figure 4: Frontal (a) and back (b) views of the stereo cameratestbed.

The brands of the two digital cameras that were used areSony DSC-S750 and Canon SD1200 IS. Both cameras wereconfigured to have the same settings.

5.2 Algorithmic ArchitectureThe Camera Calibration toolbox [21] was used to calibratethe cameras. Figure 5 illustrates the steps of our implementa-tion of 3-D visual reconstruction. Image acquisition was doneusing the camera setup described before. We pre-processingthe two images with gaussian filters to reduce noise and AdobePhotoshop CS3 is used to remove the background. The met-ric used to calculate the disparity map is the normalized crosscorrelation [10]. As a post-processing step, we use a medianfilter to correct the problem of outliers. After that, we gener-ate a Delaunay triangulation of the disparity image, which isthen used to obtain a 3-D mesh by projecting key points fromthe image plane to 3-D as described in section 3. Finally, thecolor texture is mapped from the right image to the 3-D mesh.

Figure 5: Outline of the reconstruction Algorithm

3

Figure 6: (a) Detection of corners of the checkerboard pat-tern, and (b) detection of internal points of the checkerboardpattern.

Figure 7: (a - b) Original images of the cube, (c - d) rectifiedimages, and (e - f) images without background.

6 ResultsIn this section we present some 3-D reconstruction results us-ing the methodology described in this paper. We also discusslimitations and drawbacks.

6.1 3-D Reconstruction ResultsWe used a checkerboard pattern of 10×7 squares for the cal-ibration. Figure 6 shows two images of this process in whichcorner points are automatically selected using the CameraCalibration toolbox [21].

Most of the implementation is done in MATLAB. How-ever, Paraview Viewer [23] is also used for the visualizationof the point clouds, which are processed using the decimateand smooth filters from the application.

Figure 7 shows some images illustrating the steps neededfor the preparation of a pair of images of a cube. Sub-figures(a) and (b) show the two original images of the object. Sub-figures (c) and (d) show the rectified images. The black re-gions indicate points in the rectified images that are outsidethe range of the original ones. Sub-figure (e) and (f) show theimages with their background removed.

Figure 8 illustrates the 3-D reconstruction obtained fromthe cube images. Sub-figure (a) is the processed disparity mapobtained from the pre-processed rectified images. Sub-figures(b) and (c) illustrate a cloud of point and the resulting 3-Dmesh, respectively. Sub-figures (d) and (e) show the surfaceof the cube without and with smoothing. Finally, sub-figure(f) corresponds to the 3-D mesh with the color texture fromthe rectified right image.

Figures 9 and 10 show similar results using a pair of imagesof a teddy bear instead of a cube.

6.2 Limitations and finding problemsOne limitation was the lack of lighting control during the im-age acquisition process, which causes artifacts such as shad-ows and color variations between images as observed in Fig-ure 11 (a) and (b). These artifacts cause errors in the disparitycalculation as observed in Figure 11 (c), which translate intoinconsistent reconstruction of the object as observed in Figure11 (d).

Figure 8: (a) Disparity map for the cube images, (b) matchingpoints, (c) 3-D mesh, (d) reconstructed surface, (e) smoothingof the surface, and (f) surface with texture.

4

Figure 9: (a - b) Original images of the teddy bear, (c - d)rectified images, and (e - f) images without background.

Figure 10: (a) Disparity map for the teddy bear images, (b)matching points, (c) 3-D mesh, (d) reconstructed surface, (e)smoothing of the surface, and (f) surface with texture.

Figure 11: (a - b) Original images with artifacts, (c) disparitymap, and (d) 3-D reconstruction.

As described earlier, we used a normalized cross correla-tion approach to compute the disparity map. This process re-quires the specification of a neighborhood, which is a chal-lenging problem on its own [10]. The disparity process alsorequires some knowledge of the maximum offset between cor-responding points [11], which may not be known in advanced.Figure 12 illustrate a comparison between a failed and a goodreconstruction using the approach described in this paper.

Figure 12: (a) Failed reconstruction and (b) good reconstruc-tion of a cube.

Tests were performed on a machine Sony VAIO VGN-FE880E with a processor Intel (R) Core (TM) 2 CPU 1.66GHz and memory RAM 2GB with an operating system Win-dows 7 Ultimate 64-bit. Time for the processing of our algo-rithm is twenty-five minutes.

7 ConclusionIn this paper, we outline all the steps needed to generate a3-D reconstruction of an objected by using a stereo imagepair. These 3-D models can be used in a variety of applica-tions ranging from photo tourism [3] to virtual conferencing[6]. The method used for disparity computation uses a simplenormalized cross correlation to figure out correspondences,which is not a very robust approach but it is simple enoughfor demonstration purposes. There are more robust methodsthat use techniques such as dynamic programming and geneticalgorithms [15].

One of the main challenges observed during the reconstruc-tion process was the variability of lighting conditions, whichcan lead to artifacts in the reconstruction. There is also a largedependency between steps in the reconstruction. For example,

5

small errors in calibration [24] affect the rectification process,causing problem in the disparity computation, and leading toerroneous 3-D reconstructions. For this reason, it is necessaryto put special care to the design of a stable physical architec-ture for image acquisition.

AcknowledgmentsThe authors of this paper template would like to thanks to:Alex Cuadros, Juan Carlos Gutierrez and Nestor Calvo.

References[1] Leonardo Da Vinci. El Tratado de la Pintura. Edimat Libros,

S.A., 2005.[2] Olivier Faugeras. Three-dimensional Computer Vision: A Ge-

ometric Viewpoint. The MIT Press, 1993.[3] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo

tourism: Exploring photo collections in 3d. In SIGGRAPHConference Proceedings, pages 835–846, New York, NY, USA,2006. ACM Press.

[4] Jason Leigh, Andrew E. Johnson, Maxine Brown, Daniel J.Sandin, and Thomas A. DeFanti. Visualization in teleimmer-sive environments. Computer, 32:66–73, December 1999.

[5] Sang-Hack Jung and Ruzena Bajcsy. A framework for con-structing real-time immersive environments for training physi-cal activities. Journal of Multimedia.

[6] Ramanarayan Vasudevan, Zhong Zhou, Gregorij Kurillo,Edgar J. Lobaton, Ruzena Bajcsy, and Klara Nahrstedt. Real-time stereo-vision system for 3d teleimmersive collaboration.In ICME’10, pages 1208–1213, 2010.

[7] Thibaut Weise, Sofien Bouaziz, Hao Li, and Mark Pauly. Re-altime performance-based facial animation. ACM Transac-tions on Graphics (Proceedings SIGGRAPH 2011), 30(4), July2011.

[8] Stephen T. Barnard and Martin A. Fischler. Computationalstereo. ACM Computing Surveys (CSUR), pag. 553-572, Di-ciembre, 1982.

[9] Richard Hartley and Andrew Zisserman. Multiple View Geome-try in Computer Vision. Second Edition. Cambridge UniversityPress, 2004.

[10] Yi Ma, Stefano Soatto, Jana Kosecka, and S. Shankar Sastry.An Invitation to 3D Vision from Images to Geometric Models.Springer, 2004.

[11] D.A. Forsyth and Ponce J. Computer Vision: A Modern ap-proach. Prentice Hall, 2003.

[12] Donald Hearn and M. Pauline Baker. Graficas por Computa-dora. Segunda Edicion. Prentice Hall, 1995.

[13] Emanuele Trucco and Alessandro Verri. Introductory Tech-niques for 3-D Computer Vision. Prentice Hall, 1998.

[14] Richard Hartley. In defense of the eight-point algorithm. IEEETransactions on Pattern Analysis and Machine Intelligence,pag. 383-395, 1997.

[15] D. Scharstein and R. Szeliski. A Taxonomy and Evaluation ofDense Two-Frame Stereo Correspondence Algorithms. Inter-national Journal of Computer Vision, 47:7–42, 2002.

[16] Andrea Fusiello, Emanuele Trucco, and Alessandro Verri. Acompact algorithm for rectification of stereo pairs. MachineVision and Applications, pag. 16-22, 2000.

[17] Richard Hartley and Peter Sturm. Triangulation. ComputerVision and Image Understanding, pag. 146-157, 1997.

[18] K. Mikolajczyk and C. Schmid. A Performance Evaluation ofLocal Descriptors. IEEE Transactions on Pattern Analysis andMachine Intelligence, 27(10):1615–1630, 2005.

[19] X. Armangue, J. Salvi, and J. Batlle. A Comparative Review OfCamera Calibrating Methods with Accuracy Evaluation. Pat-tern Recognition, 35:1617–1635, 2000.

[20] Zhengyou Zhang. Flexible camera calibration by viewing aplane from unknown orientations. Seventh International Con-ference on Computer Vision, pag. 666-673, 1999.

[21] Camera Calibration Toolbox for matlab. http://www.

vision.caltech.edu/bouguetj/calib_doc/. 18 Oc-tubre, 2011.

[22] Paraview. http://www.paraview.org. 16 Octubre, 2011.[23] Gregorij Kurillo, Ramanarayan Vasudevan, Edgar Lobaton,

and Ruzena Bajcsy. A framework for collaborative real-time3d teleimmersion in a geographically distributed environment.In Proceedings of the 2008 Tenth IEEE International Sympo-sium on Multimedia, ISM ’08, pages 111–118, Washington,DC, USA, 2008. IEEE Computer Society.

6

http://www.vision.caltech.edu/bouguetj/calib_doc/

http://www.vision.caltech.edu/bouguetj/calib_doc/

http://www.paraview.org

3-D Visual Reconstruction: A System Perspective · 3-D Visual Reconstruction: A System Perspective...

Documents

Transcript of 3-D Visual Reconstruction: A System Perspective · 3-D Visual Reconstruction: A System Perspective...