[IEEE 2011 Annual IEEE India Conference (INDICON) - Hyderabad, India (2011.12.16-2011.12.18)] 2011...

Optimality in Homography in 3D Reconstruction

from Canonical Stereo Setup.

Avik Chatterjee Hothur Satheesh

CSIR-CMERI NIT Durgapur

Durgapur , India Durgapur , India

[email protected] [email protected]

Prof. Indrajit Basak Dr. S. Majumdar

NIT Durgapur CSIR-CMERI

Durgapur , India Durgapur , India

[email protected] [email protected]

Abstract— This paper presents an experimental study and

investigation of errors in 3D reconstruction from views in a

canonical stereo camera setup with point correspondences of a

grid known apriori. The objective of investigation is to find the

estimate of P by Direct Linear Transform (DLT) solution i.e by

minimizing the algebraic error (||AP||) subjected to normalizing

constraint ||P||=1 and then further minimizing the total cost

function (C) involving reprojection error and 3D geometric

error through iterative Levenberg-Marquardt technique, to see

whether the second level of minimization of C brings any

remarkable improvement in solution or not. Here A is the

measured value matrix with measurement error in image and

space point coordinates and P is the estimate of camera

projection matrix. The problem of reconstructed from stereo pair

is very much documented and well researched for both precisely

calibrated and uncalibrated cameras. But much literature is not

available in investigating the optimality of the homography by

minimizing the total cost function considering both geometric

error in the image pair and 3d geometric error term. We focus

our investigation in studying the results of optimizing the total

cost function including 3D geometric error term and see its

effects in homography and in reprojection.

Keywords- DLT, GSA, 3D reconstruction, camera calibration,

algebraic error, geometric error.

I. INTRODUCTION

The 3D reconstruction or space reconstruction relates to the techniques of recovering information about the structure of a 3D space based on direct measurements or depth computation from stereo image matching or from multiple view processing. This gives positions and dimensions of the sensed object surfaces and this information can, for instance, be used for robot navigation, guided surgery procedures, reconstruction of terrain from mapped data etc.

Estimation of the spatial coordinates of object points determined from stereo images of that environment is well documented and much studied problem and there are many established approaches to reach the goal. The pioneering work in this field has been done by Hartely, Zisserman [1][2][3] and Beardsely[2]. However, the current approaches to 3D reconstruction of a scene from stereo images, require either accurate and precise intrinsic and extrinsic camera parameters which is not always feasible. Reconstruction using uncalibrated camera is also documented in detail by Hartely [3][4] and others [5][6][7]. The availability of precise extrinsic and intrinsic camera parameters will result in full 3D Euclidian reconstruction. If only intrinsic parameters are available then reconstruction can be done up to a certain scaling factor. Non

availability of both intrinsic and extrinsic parameters will result in reconstruction up to certain projective transformation although much work has been done by Hartely in the field of Euclidian reconstruction from uncalibrated camera both in stereo view and multiview [3][4]. Another established and well documented general framework of 3D reconstruction from stereo view is computing the optimal fundamental matrix F,

from image correspondence ',↔i ix x and then recovering the

projection camera matrices P and 'P from F. Thereafter linear

triangulation is used to reconstruct the space point setiX , from

the correspondence '↔i ix x subjected to minimization of

cost function. This will result in projective reconstruction of

iX . There are many variations in this approach like, if we

know camera projection matrices K and 'K before hand, we can compute the essential matrix E and obtain a metric reconstruction.[1]

The objective of this investigation is to study the effect of minimizing the total cost function which consists of both geometric error in the image pair and 3D geometric error as suggested by Hartely and Zisserman [1] considering errors in measurement in both image and objects. The errors in measurement have been introduced not by measuring object coordinates w.r.t world coordinates very precisely (1~2 mm deviation). The errors in the image correspondences are in the range of 2~4 pixels as estimated by line fitting and corner detection techniques. Much literature in not available in checking the optimality condition considering the above mentioned total cost function along with measurement errors in image and object, although various other costs functions has been proposed and worked out. In this study the pin hole camera model has been chosen.

II. CAMERA CALIBRATION

Given a set of 3d space points iX in 3R and a set of

corresponding points ix in 2R image plane, the general finite

projective camera maps ↔i iX x according to xi = PXi, where

P is 3x4 homogeneous central finite camera projection matrix. In homogeneous coordinates, P can be written as P = diag( f,f,1) [ I | 0] where diag( f,f,1) is a diagonal matrix and [ I | 0] represents a matrix divided up into a 3 x 3 block (the identity

matrix) plus a column vector, here the zero vector. If T

x y[p ,p ]

are the coordinates of the principle point, then the above

relation can be expressed as [ ] camx = K I | 0 X where K is the

camera calibration matrix. Here camX is to emphasize that the

camera is assumed to be located at the origin of a Euclidean coordinate system with the principal axis of the camera pointing straight down the Z-axis, and is expressed in Camera Coordinate Frame(CCF). Considering camera rotation (R) and translation (T) w.r.t World Coordinate Frame (WCF), the equation in homogeneous coordinates can be written as

[ ]x = KR I | - C X where X is in WCF, which can be more

compacted as [ ]x = K R | T X , where T=-RC. The parameters

contained in K are called camera internal parameters or intrinsic parameters and parameters contained in [R | T] are responsible for camera position and orientation w.r.t WCF, are called external parameters or extrinsic parameters. Hence for simple finite pin hole camera model, P has 9 degrees of freedom: 3 for K ( f , px and py), 3 for R and 3 for T. For CCD camera the camera matrix has to be modified by

introducing parameters xm and

ym (number of pixels in unit

distance in image coordinates along x and y directions) for non square pixels and a skew parameters s for adding generality (Refer Appendix-A.1). A general finite projective camera has 11 degrees of freedom. This is the same number of degrees of freedom as a 3 x 4 matrix, defined up to an arbitrary scale.

Finite camera projection matrix (P) can also be expressed

as [ ]P = K R | - RC = [M | -MC] = i e M M where

iM is the

intrinsic 3x3 non singular camera matrix and

− e[R | RC] = M is the 3x4 extrinsic camera matrix. Camera

calibration in this context, is the process of determining the internal camera geometric and optical characteristics (intrinsic

parameters-iM ) and/or the 3D position and orientation of the

camera frame relative to WCF (extrinsic parameter-eM )[1][5].

Several methods for geometric camera calibration are presented in the literature. The classic approach solves the problem by minimizing a nonlinear error function. Due to slowness and computational burden of this technique, other closed-form solutions have been also suggested [6].

In this experiment, pixel skew factor (s) and the effects of lens distortion (barreling, pin cushioning etc) are neglected for simplicity, as we primarily focus our investigation on the contribution of 3D geometric error term on minimization of the cost function through iterative Levenberg-Marquardt (LM) technique [8][9], to see whether the second level of minimization of the cost function brings any remarkable improvement in solution or not.

The LM method is a blend between the Gradient Descent (GD) method and Gauss-Newton (GN) method. In GD method

the update is governed by 1 ( )k k k kx x f xα+ = − ∇ and

determination of kα (line search parameter) is an one

dimensional minimization problem of min ( ( ))k

k k kf x f x

αα− ∇ .It

will always progress provided the gradient is non zero but it has linear convergence rate and suffers from several convergence issues as it may not take large step when gradient is less (less slope) and small step when gradient is large (stiff slope). Another issue is that the curvature of the error surface may not be the same in all directions, resulting long narrow valleys, in which this method has difficulty in convergence.

This situation improved by using curvature as well as gradient

information, namely second derivative, 2 ,f∇ in GN Method,

where the function ( )f x is assumed to be quadratic giving rise

to the update rule 2 1

1 ( ( )) ( )k k k kx x f x f x−

+ = − ∇ ∇ . It has rapid

convergence but sensitive to linearity at starting location. These two methods are complimentary in advantages they provided, which has been exploited in LM method, by setting the update

rule as 1

1 ( [ ]) ( ),k k kx x H diag H f xλ −

+ = − + ∇ where H is the

Hessian Matrix ( 2 ( )kf x∇ ) and λ is a parameter. If the error

goes down following an update, it implies that the quadratic assumption on f (x) is working and λ is reduced (usually by a factor of 10) to minimize the influence of gradient descent. On the other hand, if the error goes up, which means the update has to follow the gradient more and so λ is increased by the same factor. Since the Hessian is proportional to the curvature

of ( ),f x , the update rule implies a large step in the direction

with low curvature and a small step in the direction with high curvature , which is desired.

III. EXPERIMENTAL SETUP

The cameras considered in this study are Logitech make high definition (HD 1280x720) webcam, model C910 in fixed focus mode. No information is available about focal length, size or type of the sensing chip. The canonical stereo setup (Figure–1) is made for reducing position and orientation in angular measurements of the camera frame. The space

pointsiX , in the Tsai grid are intentionally measured with in

the error of 2-3 mm w.r.t WCF. The image points ix are

measured in pixels and obtained through Canny’s edge detection, straight line fitting and intersection calculation to get the grid corners w.r.t to the image origin.

There are errors in the measurements of both object and

image points and hence the measured quantities are represented

as _

iX and _

ix . We require to find a estimate of 3x4 camera

matrix �P such that �_

i ix = P X for all i. Note that this is an

equation involving homogeneous vectors, thus the 3-vector _

ix

and �_

iP X not equal, they have the same direction but different magnitude by a non zero scale factor. The equation may be

expressed in terms of vector cross product as � ,_ _

i ix ×P X = 0

Figure-1: Experimental setup for canonical stereo.

[a] Included angle between Tsai grid is θ = 120o [b]

Included angle between same grid is θ =225o

which on simplification reduces to (1), where

( )i i ix , y , z ,T

= jT T ji i ix P X = X p and �

_

( .= 1T 2T 3T Ti i i iP X P X P X P X ) In (1)

the first two are linearly independent and the third one is linearly dependent. Hence from a set of n point correspondences, we obtain 2n x12 matrix by stacking up the equations for each point correspondence.

0......................................(1)

=

T T T 1

i i i i

T T T 2

i i i i

T T T 3

i i i i

z X 0 -x X P

0 -z X y X P

-y X x X 0 P

Projection matrix P is computed by solving the set of equations AP=0, where P is vector containing the entries of the matrix P. Rewriting and stacking up results the expression in Appendix-A.2.

For the minimal solution the matrix P has 12 entries, and (ignoring scale) 11 degrees of freedom, it is necessary to have 11 equations to solve for P for exact solution. Since each point correspondence leads to two equations, a minimum 5(1/2) such correspondences are required to solve for P. The (1/2) indicates that only one of the equations is used from the sixth point, so one needs only to know the x-coordinate (or alternatively the y- coordinate ) of the sixth image point. In general A will have rank 11, and the solution vector P is the 1-dimensional right null-space of A. If the data are not exact, because of noise in the point coordinates, and 6n ≥ point correspondences are

given, then there will not be an exact solution to the equation AP = 0. An estimation of solution for P may be obtained by minimizing an algebraic or geometric error. In the case of algebraic error (residual AP), the approach is to minimize

AP� � subject to some normalization constraint. Possible

constraints are (i) 1=P� � and (ii) 1=3P� � , where 3P is the

vector T

31 32 33(p ,p ,p ) , namely the first three entries in the last

row of P. Constraint 1=P� � has been used in this case. This is

equivalent to the solution scheme of finding SVD of A, (A = UDV

T ) and then after arranging the positive diagonal (D)

entries in descending order, P is the last column of V.

Nnormalization has been applied before implementing the algorithm as suggested [1] through isotropic scaling. Thus, the centroid of the points is translated to the origin, and their coordinates are scaled so that the RMS distance from the origin

is 3 .This approach is suitable for a compact distribution of

points. This solution is widely known as Minimal solution or Direct Linear Transform (DLT) solution. The estimate of P has been further refined from the concept of reducing the geometric error through an iterative method such as

Levenberg-Marquardt, assuming iX is accurately known and

ix has Gaussian noise as measurement error. Then the Maximum Likelihood estimate of P is

� � ,∑ ∑n n

2 2i i i i

P Pi i

min d(x , x ) = min d(x , PX ) where the geometric

error is � .∑n

2i i

i

d(x , x ) The DLT solution, or the minimal

solution, may be used as a starting point for the iterative

minimization. This solution is also called Gold Standard Algorithm (GSA)[1]. If we consider Gaussian measurement error both in image points and 3D points, then the Maximum Likelihood estimate of P is solution of (2). This estimate is

then denormalized for the original coordinates as �-1P = T PU

where T and U are similarity transforms for xi and Xi.

� � +∑n

2 2i i i i

Pi

min d(x , x ) d(X , X ) …………………..…..(2)

The general projective finite camera matrix P can be decomposed to extract the internal camera parameters (Mi) and the camera position and orientation parameters (Me). The camera centre C is a point for which PC = 0 and numerically this right null-vector may be obtained from the SVD of P.

Since [ ]P = K R | - RC = [M | -MC], K and R can be found

by RQ decomposition of M as KR = M. In that situation if the rotation matrix is not orthogonal we need to enforce orthogonality to R through SVD of R and replacing the diagonal matrix by identity matrix, since the singular values of an orthogonal matrix are all one.

IV. RECONSTRUCTION

The reconstruction problem is to construct the position of the

point in 3D space given two views (x and 'x ) and the most

likelihood camera projection matrix P and 'P . Back

projection for two corresponding points '↔x x doesn't work

because with image measurement error, the back projected

rays will be skew. If we consider the triangular method

X = τ(x,x',P,P') then we would like to have τ to be invariant

under projective transformation H, i.e

= -1 -1 -1τ(x,x', P,P') H (x, x',PH , P'H ) . If we adopt this goal,

minimizing error in projective space P3 will not work because

distance and perpendicularity relationships are not invariant in

P3. Instead of minimizing error in P3, we would like to

estimate a 3D point �X exactly satisfying

� � and 'x = PX x' = P'X and maximizing the likelihood of the

measurements under Gaussian error. As usual, the Maximum

Likelihood (ML) estimate under Gaussian errors minimizes

reprojection error. Since reprojection error only measures

distance in the image, the ML estimate will be invariant under

projective transformations of 3D-space.

First we have considered a simple linear estimate minimizing

algebraic error similar to the DLT that is not optimal. We use

the cross product to eliminate the homogeneous scale factor.

For each image we have x ×(PX) = 0 giving the expression

referred in Appendix-A.3. Taking two linearly independent

equations from each camera we obtain the system AX = 0,

where expression of A referred in Appendix-A.4. Hence we

have four equations with four homogeneous unknowns which

we can solve linearly, up to an indeterminate scale factor This

method is also called the Linear Triangulation Method which

is a the direct analogue of the DLT method and purely

suboptimal. As we have assumed Gaussian noise in

measurement we seek a solution by projecting the estimated

world point � iX in to two images at � � and 'iix x , for right and

left cameras respectively, so that the reprojection error

d and d' is minimized, where � �andd(x,x) d'(x', x') are the

Euclidian distance and hence the geometric error for left and

right images, e and e′ are the epipoles, x and x' are Maximum

Likelihood Estimates (MLE) for the true image point

correspondences, C and C′ are the camera centers for right

and left cameras respectively. (Figure-2)

The geometric error in the image is the cost function C1 such

that � �∑ 2 2

1C (x, x') = (d(x, x) + d'(x', x') ) ………………....(3)

Once x' and x are found, the world point estimate � X can be

derived by triangulation method, since the corresponding rays

will meet precisely in space. This cost function is modified by

introducing the 3D geometric error term for studying its

contribution of its error. If � X is the estimate such that

� ix = PX and _

iX is the noisy measured data, then the 3D

geometric error term is

� �∑ 2i i i i2C (X , X ) = d(X , X ) ………………………………..(4)

There fore the total cost function to be minimized is

� � � �α βi i i i1 2C(x, x, X , X ) = C (x, x) + C (X , X ) ………..……….(5)

where α and β are introduced as weights as the units of

measurements of image points and world points are different.

The total cost function C, is minimized using numerical

Levenberg-Marquardt method for different values of α and

β . This method is used as it has the merits of both the

Gradient Descent and Gauss-Newton methods. The update

algorithm can take large step in the direction with low

curvature and small step in the direction with high curvature.

The minimum can also be obtained non-iteratively by the

solution of a sixth-degree polynomial as suggested by

Hartley[4]. The iterations for minimizing the total cost function for

range of values for α is [1000,0.001] and β is [1000,0.001], in

multiples of 10. The Frobinious norm of the general projection matrix is identified as the evaluating criteria. The maximum

change of norm occurs at α=0.01 and β 1.0,= i.e β/α 100.0= .

V. RESULTS

Number of image points is n = 56 for both oθ=120 (case-1)

and oθ=225 (case-2). The reconstructed space points generated

using the camera projection matrix P derived through linear

DLT( DLTP ) named as �DLTX and through iterative GSA

(GSAP ) named as � ,GSAX are reprojected in to image for both

the cases (Figure-3).

The error between measured space points X and � DLTX is

named as DLT error, �e =(X-X ),DLTDLT and the error between

measured space points ( X ) and optimal reconstructed space

points by minimizing the total cost function � � ,i iC(x, x, X , X ) i.e

�X ,GSA named as GSA error, �e (X X ),= − GSAGSA as shown in

Figure-4 and Figure-6 for case-1 and case-2.

The reprojection error i.e difference in pixels coordinates

between the measured image points ( x ) and the reprojection of reconstructed space points through DLT (linear

triangulation i.e DLTP X ) and optimal GSA (i.e

GSAP X ) have

been found and shown in Figure-4c for case-1 left camera and

Figure-6c for case-2 left camera. Here it must be mentioned that in this case, minimization does not guarantee to be

converged at global minimum. There is possibility of

convergence at local minimum between the bounds. Finally

the triangular mesh reconstructed surfaces are generated and

shown in Figure-5 and Figure-7 for case-1 and case-2 .

Figure-2: Geometric error d and d' for left and right images.

(a)

(b)

Figure-3. Reprojection of reconstructed points for (a) case

-1 left camera and (b) case-2 left camera.

(a) (b) (c)

(a) (b) (c)

(a) (b) (c)

(a) (b) (c)

Figure-4. For case-1 left camera : (a) DLT error (b) GSA error (c) Reprojection error.

Figure-5. For case-1 left camera : Triangular mesh of reconstructed space: (a) _

iX (b) �DLTX (c) �GSAX

Figure-6. For case-2 left camera : (a) DLT error (b) GSA error (c) Reprojection error.

Figure-7. For case-2 left camera : Triangular mesh of reconstructed space: (a) _

iX (b) �DLTX (c) �GSAX

VI. DISCUSSION

The effect of minimizing the total cost function which consists of both geometric error in the image pair and 3D geometric error as suggested by Hartely and Zisserman[1], in considering errors in measurement in both image and objects, has been presented for an Tsai grid in two positions. It has been observed that in reconstruction through linear DLT, the error in Z direction ( direction of camera axis) is significant in comparison with X and Y directions. (Figure-4 and Figure-6) The inclusion of 3D geometric error term in the total cost function (Eq-5) does not improve camera projection matrix, so that the reconstructed points will move closure to their measured values. In contrast it increases the average Z coordinate error than DLT from 4.8 mm to 5.0mm in case-1 and even more in case-2 (Figure-6).

When the measured space points are reprojected on the

image by the projection matrix GSAP , obtained by minimizing

the total cost function, the reprojected points shows variation of 2± pixels in the image y-axis direction and the variation is oscillatory having no definite trend. This is true for

β/α 100.0= , where the maximum change in Frobenius norm of

the general projection matrix occurs

( ~ 4193.3).=DLT GSAP P Iteration on β/α > 1000 results in

insignificant change in norms

( ~ 0.00003)=DLT GSAP P suggesting no significant change in

matrix elements. This is also corroborated by the fact the error

levels eDLTand eGSA

remains unaltered, apart from being

suffering from ill conditioning and very slow convergence. For

β/α < 10 , the change in norm is lower, indicating no

significant change in the matrix elements and also no significant change in the error levels.

The most sensitive region for this problem is β/α 100.0= ,

but it does not significantly improve the solution offered by linear DLT method. In least-squares problems, given the

Jacobian matrix J, we can essentially get the Hessian 2 ( )kf x∇ ,

if the residuals ( )kr x themselves are small. In that case the

Hessian becomes 2 ( ) ( ) ( ),T

k k kf x J x J x∇ = which is

implemented in Levenberg-Marquardt method. In this particular nonlinear least square minimization problem, the residuals are large and hence quadratic approximation methods may not be suitable. This area needs further investigation, particularly in minimization strategy of total cost function C by adding the 3D geometric error term.

AP P E N D I X –A

A.1 Finite Camera Projective Matrix

x 0

y 0

α s x

K= α y

1

where xα = fxm , yα = f ym , 0x = xp xm and

0y = p y ym .

A.2 Stacked expression of P matrix.

11

12

13

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

2 12

1 0 0 0 0

0 0 0 0 1

1 0 0 0 0

0 0 0 0 1

W W W W W W

W W W W W W

Wn Wn Wn n Wn n Wn n Wn n

Wn Wn Wn n Wn n Wn n Wn n n

P

P

P

X Y Z x X x Y x Z x P

X Y Z y X y Y y Z y

X Y Z x X x Y x Z x

X Y Z y X y Y y Z y×

− − − −

− − − − − − − −

− − − −

� � � � � � � � � � � �

� � � � � � � � � � � �

4

21

22

23

24

31

32

33

34 12 1

0

P

P

P

P

P

P

P

P×

=

A.3 Expression for x ×(PX) = 0

0

=

3T 1T

3T 2T

2T 1T

x(p X) - (p X)

y(p X) - (p X)

x(p X) - y(p X)

A.4 Expression for A matrix.

2

' ' '

' ' '

=

3T 1T

3T 2T

3T 1T

3T T

xp - p

yp - pA

x p - p

y p - p

ACKNOWLEDGMENT

The authors would like to thank Prof. Gautam Biswas, director of Central Mechanical Engineering Research Institute (CMERI), constituent establishment of Council of Scientific and Industrial Research (CSIR), New Delhi, for extending the facilities and infrastructure for carrying out experiments.

REFERENCES

[1] Richard Hartley, A. Zisserman, Multiple View Geometry In Computer Vision, Second Edition, Cambridge. 2003, Chapter-7.

[2] Paul Beardsley, Phil Torr and Andrew Zisserman, “3D model acquisition from extended image sequences,” Lecture Notes in Computer Science, volume 1065/1996, pp 683-695, 1996.

[3] R.I.Hartley, R. Gupta, and T. Chang, “Stereo from uncalibrated cameras,” In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp 761–764, 1992.

[4] R.I. Hartley, “Euclidean Reconstruction from Uncalibrated Views,” Proc. Conf. Computer Vision and Pattern Recognition, pp. 908-912, 1994.

[5] R.Y Tsai , “A Versatile Camera Calibration Technique for High-Accuracy 3D Machine Vision Metrology Using Off-the-shelf TV Cameras and Lenses,” IEEE Journal of Robotics and Automation, vol. RA-3, NO. 4, pp- 323-344, August 1987.

[6] Zhengyou Zhang , “A Flexible New Technique for Camera Calibration,”.

http://research.microsoft.com/en-us/um/people/zhang/Calib/

[7] Janne Heikkilä , Olli Silvén, “A Four-step Camera Calibration Procedure with Implicit Image Correction,” in Proc. IEEE conference on Computer Vision and Pattern Recognition, 1997.

[8] K. Levenberg, “A method for the solution of certain problems in least squares.”, Quart. Appl. Math., 1944, Vol. 2, pp. 164–168.

[9] D. Marquardt, “An algorithm for least-squares estimation of nonlinear parameters.”, SIAM J. Appl. Math., 1963, Vol. 11, pp. 431–441.

[IEEE 2011 Annual IEEE India Conference (INDICON) - Hyderabad, India (2011.12.16-2011.12.18)] 2011...

Documents

Transcript of [IEEE 2011 Annual IEEE India Conference (INDICON) - Hyderabad, India (2011.12.16-2011.12.18)] 2011...