Efficient Vision-based Loop Closing Techniques for …Efficient Vision-based Loop Closing Techniques...

Effici

Abstract— This paperloop closing for Simultaexperiments show thatmates maintained withFilter are consistent enbased pose constraintsinformation links are adcomputes relative poseminimisation of 3D poinobtained from the matcimage pairs. We proposfor closeness of meansthe same time.

I.

Closing large loopsMapping (SLAM) is qplished, it reduces thmously. A straight foproblem is to rely onchoice (be it a Kalmparticle filter) and peand as often as posbased on the likelihoof the problems assosuch as aliasing for hclosing all loops consmight add informationunreliable.

Unnecessary links ainformation to reducewith repetitive measulittle or no motion ocsmall steps are usedresentation.Unreliablethat are not consistewhich the system wawhen distance estimaDistance errors beinlarger for measuremeintroduce errors that a

Thus it is importanhand, and reliably onthat testing for loop cindependently of the va nice thing to do oncthis work, we adopt inrelying on the estimat

The authors are withdustrial, CSIC-UPC. Llor[vila,cetto,sanfel

ent Vision-based Loop Closing Techniques for

Delayed State Robot Mapping

Viorela Ila, Juan Andrade-Cetto and Alberto Sanfeliu

shows results on outdoor vision-basedneous Localization and Maping. Ourfor loops of over 50m, the pose esti-

a Delayed-State Extended Informationough to guarantee assertion of vision-for loop closure, provided no necessaryded to the estimator. The techniqueconstraints via a robust least squarest correspondences, which are in turnhing of SIFT features over candidatee a loop closure test that checks bothand for highly informative updates at

I NTRODUCTION

during Simultaneous Localisation anduite challenging. But, when accom-e accumulated estimation error enor-rward solution to the loop closingthe pose estimates from the filter ofan filter, an Information filter, or arform data association tests as muchsible. By testing for data associationod of estimates, one can avoid someciated with appearance-based SLAM,omogeneous or repetitive scenes. But,istent with the likelihood of estimates

links that are either unnecessary or

re those that contribute with littlethe estimation error. That is the case

rements to the same landmarks whencurs, or when pose constraints from

in the case of a delayed state rep-links on the other hand, are thosent with the sensor uncertainties fors trained. This happens for exampletes come from computer vision sensors.g inverse to disparity in images arents to far away landmarks, and mightre inconsistent.t to close loops sparsly on the onethe other. In [1] the authors suggest

losure in SLAM should be performedehicle pose estimates. This is certainlye the filter has become inconsistent. Instead the straightforward approach of

or for the generation of pose constraint

the Institut de Robotica i Informatica In-ens Artigas 4-6, Barcelona, 08028 Spain.iu]@iri.upc.edu.

hypotheses, but paying special attention no to let the systembecome overconfident too soon.

Since adding information links for all possible matchesproduces overconfident estimates that in the long term leadto filter inconsistency, we propose instead a two step loopclosure test. First, we check whether two poses are candidatesfor loop closure with respect to their mean estimates. This isachieved by testing for the Mahalanobis distance in the sameway data association gating is commonly performed duringa SLAM update. But instead of adding all these informationlinks, we limit the candidates to a second test that checksfor large values on the second term of the Bhattacharyyadistance. The purpose of this part of the test is to allow loopclosing only on highly informative situations. That is, whenthe proposed matching covariances are sufficiently different,and a large amount of information is expected to enter thefilter.

The reminder of this paper is as follows. In section II wepresent our strategy for computing pose constraints from twovantage points using computer vision. Section III is devotedto explain our chosen SLAM representation, and the way inwhich our vision-based pose constraints are used to updatethe map. Section IV described the proposed loop closure test.Some experimental results in an outdoor scenario are shownin Section V, and concluding remarks are added in SectionVI.

II. RELATIVE POSECONSTRAINTS

The technique iterates as follows: SIFT image features areextracted and matched from candidate stereo image pairs.Their image point correspondences are then triangulated toobtain a set of 3D feature matches, which are in turn usedto compute a least squares best fit pose transformation.Robust feature outlier rejection is obtained via RANSACduring the computation of the best camera pose constraint.These camera pose constraints are used as relative posemeasurements in a delayed-state information-form SLAM.A substantial computational complexity advantage of thedelayed-state information-form SLAM is that predictionsand updates take constant time prior to loop closure givenits exact sparseness [2]. Thanks to the features used, theproposed technique is robust enough not only to relateconsecutive image pairs during robot motion, but also, toassert loop closure hypotheses.

A. Feature Extraction

Simple correlation-based features, such as Harris corners[3] or Shi and Tomasi features [4], are of common use invision-based SFM and SLAM; from the early uses of Harris

himself to the popular work of Davison [5]. This kind offeatures can be robustly tracked when camera displacementis small and are tailored to real-time applications. However,given their sensitivity to scale, their matching is prone tofail under larger camera motions; less to say for loop-closing hypotheses testing. Given their scale and local affineinvariance properties, we opt to use SIFTs instead [6], [7], asthey constitute a better option for matching visual featuresfrom varying poses. To deal with scale and affine distortionsin SIFTs, keypoint patches are selected from difference-of-Gaussian images at various scales, for which the dominantgradient orientation and scale are stored.

In our system, image pairs are acquired from a calibratedstereo rig1. Features are extracted and matched with pre-vious image pairs. The surviving features are then stereotriangulated enforcing epipolar and disparity constraints. Theepipolar constraint is enforced by allowing feature matchesonly within ±1 pixel rows on rectified images. The disparityconstraint is set to allow matches within a 1−10 meter range,where camera resolution is best. The result is a set of twoclouds of matching 3D pointspt from the current pose, andpi from a previous pose, 0< i < t, both referenced to thecoordinate frame of the left camera.

B. Pose Estimation

The homogeneous transformation relating the two afore-mentioned clouds of points can be computed by solving aset of equations of the form

pt = Rpi + t . (1)

A solution for the rotation matrixR is computed byminimising the sum of the squared errors between the rotateddirectional vectors2 of feature matches for the two robotposes. The solution to this minimisation problem gives anestimate of the orientation of one cloud of points with respectto the other, and can be expressed in quaternion form as

∂∂R

(

q⊤Aq)

= 0, (2)

whereA is given by

A =N

∑k=1

BkB⊤k , (3)

Bk =

0 −ckx −ck

y −ckz

ckx 0 bk

z −bky

cky −bk

z 0 bkx

ckz bk

y −bkx 0

, (4)

andbk = vk

t +vki , ck = vk

t −vki . (5)

The quaternionq that minimises the argument of thederivative operator in the differential Equation 2 is thesmallest eigenvector of the matrixA.

1Point Gray’s Bumblebee firewire stereo 640×480 camera, with a 6mmlens.

2A directional vectorv can be computed as the unit norm direction alongp, and indicates the orientation of such point.

Fig. 1. SIFT correspondences in two consecutive stereo image pairs afteroutlier removal using RANSAC.

If we denote this smallest eigenvector by the 4-tuple(α1,α2,α3,α4)

⊤, it follows that the rotational angleθ as-sociated with the rotational transform is given by

θ = 2cos−1(α1), (6)

and the axis of rotation would be given by

a =(α2,α3,α4)

⊤

sin(θ/2). (7)

Then, it can be shown that the elements of the rotationsubmatrixR are related to the orientation parametersa andθ by

R =

a2x +(1−a2

x)cθ axayc′θ −azsθ axazc′θ + aysθaxayc′θ + azsθ a2

y +(1−a2y)cθ ayazc′θ −axsθ

axazc′θ −aysθ ayazc′θ + axsθ a2z +(1−a2

z)cθ

,

(8)wheresθ = sinθ , cθ = cosθ , andc′θ = 1−cosθ [8].

Once the rotation matrixR is computed, we can use againthe matched set of points to compute the translation vectort

t =N

∑k=1

pki −R

N

∑k=1

pkt . (9)

It might be the case that SIFT matches occur on areas ofthe scene that experienced motion during the acquisition ofthe two image stereo pairs. For example, an interest pointmight appear at an acute angle of a tree leaf shadow, or ona person walking in front of the robot. The correspondingmatched 3D points will not represent good fits to the cameramotion model, and might introduce large bias to our leastsquares pose error minimisation. To eliminate suchoutliers,we resort to the use of RANSAC [9]. The use of such a robustmodel fitting technique allows us to preserve the largestnumber of point matches that at the same time minimisethe square sum of the residuals‖Rpt + t −pi‖, as shown inFigure 1.

H =

∆xcosψ+∆ysinψd

−∆xsinψ+∆ycosψd 0 −∆xcosψ−∆ysinψ

d∆xsinψ−∆ycosψ

d d sinψ∆xsinψ−∆ycosψ

d∆xcosψ+∆ysinψ

d 0 −∆xsinψ+∆ycosψd

−∆xcosψ−∆ysinψd −d cosψ

0 0 1 0 0 −1

(20)

III. V ISUALLY AUGMENTED ODOMETRY IN

INFORMATION FORM

A. Exactly Sparse Delayed-State SLAM

Compared to the Extended Kalman Filter (EKF) SLAMwhich has quadratic time complexity in the number ofstates, the delayed-state information-form SLAM has beenshown to produce exactly sparse information matrices [2]. Ifconsecutive robot poses are added to the state, the result isa tri-block diagonal information matrix, linking consecutivemeasurements. This situation allows constant predictionsand updates, with a considerable advantage in terms ofcomputational cost and making it suitable for relative largeenvironments. This three-block diagonal structure is onlymodifyied sparsely during loop closure. Thus, to keep thecomputational complexity low, we would like to only closeloops when these are really needed. Our loop closure test isment to do exactly that.

The delayed-state information-form SLAM representationconsists on estimating a state vectorx with the history ofposes parameterised as an inverse normal distribution.

p(x) = N (x;µ ,Σ) = N−1(x;η ,Λ) , (10)

whereΛ = Σ−1 and η = Λµ (11)

In a delayed state representation, the map state is simplythe history of pose mean estimates

µ =

µ t...

µ1

.

As in most SLAM formulations, white noisewt withcovarianceQ is added to the vehicle motion predictionmodel, and its linearised version is used in the computationof covariance prediction (information prediction in our case).

xt+1 = f (xt ,ut)+wt

≈ f (µ t ,ut)+F(xt − µt)+wt. (12)

Given the absolute dead-reckoning readings of positionxd ,yd , and orientationθ , the delayed state can be written as

x = [xt ,yt ,θt , . . .x1,y1,θ1]⊤ . (13)

The vehicle motion model can be express in terms of therelative travelled distance and relative orientation change

xt+1 = xt + d cos(θt + ψ) (14)

yt+1 = yt + d sin(θt + ψ) (15)

θt+1 = θt + ∆θ . (16)

At two consecutive time stepst and t + 1, the relativetravelled distance and relative orientation change are

d =√

(xdt+1− xd

t )2 +(ydt+1− yd

t )2 (17)

ψ = tan−1

(

ydt+1− yd

t

xdt+1− xd

t

)

−θt (18)

∆θ = θt+1−θt . (19)

Considering the liniarisation of the noninear motion predic-tion model from Equation 12 we can calculate the JacobianF as

F =

1 0 −d sin(θt + ψ)0 1 d cos(θt + ψ)0 0 1

. (21)

The revision of the entire history of poses, as a resultof adding the new information that links the current andpredicted poses, can be computed in information form [2]with

η =

Q−1 (f(µt ,ut )−Fµt)ηt −F⊤Q−1 (f(µ t ,ut )−Fµt)

ηt−1:1

, (22)

and the associated information matrix is

Λ =

Q−1 −Q−1F 0−F⊤Q−1 Λt,t +F⊤Q−1F Λt,t−1

0 Λt−1:1,t Λt−1:1,t−1:1

, (23)

Augmenting the information vector in this form introducesshared information only between the new robot posext+1

and the previous onext . Moreover, the shared informationbetweenxt+1 and the delayed-states (t − 1 to 1) is alwayszero, resulting in a naturaly sparse information matrix witha tridiagonal block structure.

Now, in the information-form measurement updates areadditive and can be computed in constant-time. The poseconstraints linking the predicted pose and any previousnearby poses, as measured by our vision system would havethe form

zt+1 = h(xt+1,i)+vt

≈ h(µ t+1,i)+H(xt+1,i− µt+1,i)+vt+1, (24)

with vt+1 the zero mean, white measurement noise withcovarianceR and the measurement JacobianH.

The nonlinear measurement model is

zx = d cos(ψ) (25)

zy = d sin(ψ) (26)

zθ = θt+1−θi . (27)

−15

10

105

0

0

2

4

6

8

12

14

-10 -5

-2

Fig. 2. Links coming from Mahalanobis (blue) and Mahalanobis-Bhattacharyya distance (red) tests.

where

d =√

(xt+1− xi)2 +(yt+1− yi)2 (28)

ψ = tan−1(

yt+1− yi

xt+1− xi

)

−θi (29)

and the JacobianH is computed as shown in Equation 20with

∆x = xt+1− xi (30)

∆y = yt+1− yi . (31)

The update to the entries linking posest +1 andi in theinformation vector and information matrix become:

ηt+1,i = η t+1,i +H⊤R−1(zt+1−h(µt+1,i)+Hµt+1,i) (32)

Λt+1,i,t+1,i = Λt+1,i,t+1,i +H⊤R−1H . (33)

R now represents a covariance term for the sensor uncer-tainty, usually fixed depending on the sensor used. In thisvisually augmented SLAM respresentation, the JacobianHis always sparse [10] and as a consequence only the fourblocks relating posest + 1 and i in the information matrixwill be updated

H = [Ht+1,0, . . . ,0,Hi,0, . . . ,0] . (34)

For consecutive poses, this produces tridiagonal blocks inΛ,and for larger loop closures, the updates only modify thediagonalt + 1 and i terms and introduces sparse blocks atthe locationst +1, i and i,t +1.

IV. L OOPCLOSURE DETECTION

Two phases can be distinguished during the loop-closingprocess. First, we need to detect the possibility of a loopclosure event and then, we must certify the presence ofsuch loop closure from visual data. The likelihood of poseestimates are valuable in detecting possible loop closures.A comparison of the current pose estimate with the historyof poses can tell whether the robot is in the vicinity ofa previously visited place, in terms of both the global

600

600

800

800

0

0

1000

1000

200

200

400

400

Fig. 3. Sparse information matrix in delayed state SLAM. Loop-closureevent adds non-zero off-diagonal elemets (see zoomed region).

position and the orientation. This is achieved by measuringthe Mahalanobis distance from the prior estimate to allpreviously visited locations, i.e., for all 0< i < t,

d2M = (µ t+1− µ i)

⊤

(

Σt+1 + Σi

2

)−1

(µ t+1− µ i) (35)

The average mean is used to accommodate for the varyinglevels of estimation uncertainty both on the pose prior beingevaluated, and on the past pose being compared. In case ofa normal distribution, the Mahalanobis distances follows theχ2-square distribution with n-1 degrees of freedom (where nis the number of variables; in our case, 2dof corresponding a95% confidence bound suffice). We are serching for nearbylocations only sufficiently close as to guarantee that a linkfrom vision data is possible.

Many nearby poses will satisfy this condition, as shownin Figure 2. At the start of a SLAM run, when covariancesare small, only links connecting very close poses will satisfythe test. But, as error is accumulated, pose covariances growcovering larger and larger areas of matching candidates.For long straight trajectories having corresponding visualfeatures, information links could even be established. Someaid in reducing the effect of large pose constraints, and itcomes from the fact that orientation variance is usually largerthan position variance, and pose covariance ellipsoids areusually align orthogonal to the direction of motion.

Nevertheless, adding all these information links woulddrive the filter to inconsistency, specially for pose constraintsover large distances. One reason is that the rigid transforma-tions computed form the matching of visually salient featuresthat are distant from the camera are too noisy, inconsistentwith the sensor uncertainty parameterR used during the filterupdate step.

Instead of adding all the information links coming fromMahalanobis distance test, our update procedure must firstpass a second test. The aim of this second test is to allowupdating using only links with high informative load. Interms of covariances, this happens when a pose with a large

30

25

20

20

15

15

1010

10

5

5

00

0

-5

-10

-10

-15

-20-20-30

1 2

3

4

678911

12

13

14

16 17

k = 392 ( 100 percent )

Fig. 4. Odometry-only (black), vision-only (green), and combined vehicletrajectory (red).

covariance can be linked with a pose with a small uncertainty.

dB =12

ln

∣

∣

∣

∣

Σt+1 + Σi

2

∣

∣

∣

∣

√

| Σt+1 || Σi |(36)

The above expression refers to the second term of theBhattacharyya distance,and gives a measure of separability interms of covariance difference [11]. This test is typically usedto discern between to distinct classes with close means butvarying covariances. We can see however that it also can beused to fuse two observations of the same event with varyingcovariance estimates. Given that, the value ofdB increasesas the two covariancesΣt+1 and Σi are more different. TheBhattacharyya covariance separability measure is symmetric,and we need to test whether the current pose covariance islarger than thei-th pose it is being compared with. This isdone by analysing the area of uncertainty of each estimateby comparing the determinants of| Σt+1 | and | Σi |. Thereason is that we only want to update the overall estimatewith information links to states that had smaller uncertaintythan the current state. Figure 2 shows in red the remaininglinks after the second test.

In a second phase we still must certify the presence of aloop closure event to update the entire pose estimate and toreduce overall uncertainty. When an image correspondencecan be established, the computed pose constraint is usedin a one-step update of the information filter, as shown inEquations 33 and 32. A one-step update in information formchanges the entire history of poses adding a linear numberof non-zero off-diagonal elements in the information matrixas shown in the Figure 3. The degree of sparsity can becontrolled by reducing the confidence on image registration

30

25

20

20

15

15

1010

10

5

5

00

0

-5

-10

-10

-15

-20-20-30

1 2

3

4

678911

12

13

14

16 17

k = 392 ( 100 percent )

Fig. 5. The information added during the loop closure event enourmoslyreduces the accumulated errors: prediction (red), update (blue).

when testing for loop-closure event.

V. EXPERIMENTS

To test our strategy for vision-based augmented odometrywe have performed a series of experiments on urban unstruc-tured environments of small size. In one of our experimentswe chose a cyclic path of around 300 sq meters with thepurpose to test a loop-closure situation. A snapshot of thesetests is shown in Figure 4. The robot was manually driventhrough a series of predefined via points previously markedon the floor. The results of estimating the vehicle motionpurely from accumulated raw odometry and purely fromconcatenating vision pose constraints are shown in the figureas black and green plots, respectively. The delayed-stateinformation-based revised trajectory resulting from the fu-sion of the two is shown in red. No motion was accumulatedfor those cases when not enough SIFT points were obtainedduring the computation of vision-based pose constraints.The effect of noninformative vision-based poses at someiterations can be efficiently modelled in our approach onlyby computing motion predictions from odometry withoutperforming map updates.

Due to the fact that the raw odometry is really poorespecially when the vehicle turns and the vision-based poseconstraints can fail in translation estimation but providesquite accurate rotation estimation, our SLAM gives moreweight to the translation measurements provided by theodometry and to the rotation estimated using SIFT points.Whereas odometry error accumulates monotonically, vision-based pose constraints vary in accuracy from very accurate(about 3,4 cm) to as large as 1 meter in estimation error.Nontheless the fusion of both provides a consistently revised

pose estimate.At each step both distances (Mahalanobis and Bhat-

tacharyya) are used to test the presence of loop closureevents. When a loop-closing event is announced, visualmatching is performed among the corresponding indexedimages. If sufficient point matches are found, the homoge-neous transformation relating the pose constraint is used toupdate the history of poses, and their associated informationterms. Figure 5 shows in blue the correction of the entirehistory of poses after a link of about 50 meters is established.Also shown are the corresponding covariance estimates asmaintained by the filter in information form.

VI. CONCLUSIONS ANDFUTURE WORK

This paper proposed an efficient approach to vision-basedloop closing problem in delayed-state robot mapping basedon analysing the similarity and separability of pose estimatesfrom the mean and covariance difference measures. Themapping process is based on an exactly sparse delayed-statefilter that uses SIFT features to compute pose constraints.Another type of features that we seek to explore in the futureare Speed Up Robust Features SURFs [12]. These featureshave similar response properties to SIFTs, replacing Gaus-sian convolutions with Haar convolutions, and a significantreduction in computational cost.

We can conclude saying that, concerning the accuracyof the pose estimation, our approach performs well whencomparing the estimated trajectory with ground truth points,and considerably reduces the memory and execution timeby using a sparse information matrix to link consecutivemeasurements each time (time and memory increase linearlycompared to the quadratic cost of the traditional EKF).

On the other hand, concerning the loop closure process,the method introduced here is a straightforward approachconsistent with the likelihood of estimates. Instead of search-ing for visual correspondences within the whole history ofimages, we restrict our search by applying conservative testsfor loop-closure event detection in terms of mean closenessand covariances separability. The experimental results showthat these tests avoid unnecessary and unreliable links, sub-stantially reducing the search time. Moreover, the establishedlinks enormously reduce the accumulated errors from thehistory of poses as shown in Figure 5 by adding only highinformative elements to the filter.

The experimental results presented here constitute a pre-liminary study on the use of 6D vision-based pose transformsfor outdoor mapping. We are currently working in testingthis vision-based technique for mapping of wider areas, inthe order of 10000 sq. meters, where robust loop closingmethods as the one presented in this paper are necessary toreduce the accumulated errors in the mapping process. Ourfocus now is on pursuing strategies for intelligent pruning ofthe number of states in the filter, so as to keep the problemtractable for large areas.

ACKNOWLEDGEMENTS

This research is supported by the Spanish Ministry ofEducation and Science under a Juan de la Cierva Postdoctoral

Fellowship to V. Ila and a Ramon y Cajal PostdoctoralFellowship to J. Andrade-Cetto, the NAVROB Project DPI-2004-5414, and the EU URUS Project FP6-IST-045062.

REFERENCES

[1] P. Newman and K. Ho, “SLAM loop-closing with visually salientfeatures,” inProc. IEEE Int. Conf. Robot. Automat., Barcelona, Apr.2005, pp. 635–642.

[2] R. Eustice, H. Singh, and J. Leonard, “Exactly sparse delayed-statefilters for view-based SLAM,”IEEE Trans. Robot., vol. 22, no. 6, pp.1100–1114, Dec. 2006.

[3] C. G. Harris and M. Stephens, “A combined corner edge detector,” inProc. Alvey Vision Conf., Manchester, Aug. 1988, pp. 189–192.

[4] J. Shi and C. Tomasi, “Good features to track,” inProc. 9th IEEE Conf.Comput. Vision Pattern Recog., Seattle, Jun. 1994, pp. 593–600.

[5] A. J. Davison, I. Reid, N. Molton, and O. Stasse, “MonoSLAM: Real-time single camera SLAM,”IEEE Trans. Pattern Anal. Machine Intell.,2007, to appear.

[6] D. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vision, vol. 60, no. 2, pp. 91–110, 2004.

[7] K. Mikolajczyk and C. Schmid, “A performance evaluation of localdescriptors,”IEEE Trans. Pattern Anal. Machine Intell., vol. 27, no. 10,pp. 1615–1630, 2005.

[8] W. Y. Kim and A. C. Kak, “3D object recognition using bipartitematching embedded in discrete relaxation,”IEEE Trans. Pattern Anal.Machine Intell., vol. 13, no. 3, pp. 224–251, Mar. 1991.

[9] M. Fischler and R. Bolles, “Random sample consensus: A paradigmfor model fitting with applications to image analysis and automatedcartography,”Comm. ACM, vol. 24, pp. 381–385, 1981.

[10] S. Thrun, Y. Liu, D. Koller, A. Y. Ng, Z. Ghahramani, and H. Durrant-Whyte, “Simultaneous localization and mapping with sparse extendedinformation filters,” Int. J. Robot. Res., vol. 23, no. 7-8, pp. 693–716,Jul. 2004.

[11] K. Fukunaga,Introduction to Statistical Pattern Recognition, 2nd ed.San Diego: Academic Press, 1990.

[12] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robustfeatures,” inProc. 9th European Conf. Comput. Vision, ser. Lect. NotesComput. Sci., vol. 3951. Graz: Springer-Verlag, 2006, pp. 404–417.

Efficient Vision-based Loop Closing Techniques for …Efficient Vision-based Loop Closing Techniques...

Documents

Transcript of Efficient Vision-based Loop Closing Techniques for …Efficient Vision-based Loop Closing Techniques...