[IEEE 2012 IEEE Intelligent Vehicles Symposium (IV) - Alcal de Henares , Madrid, Spain...

Obstacle detection using sparse stereovision and clustering techniques

Sebastien Kramm and Abdelaziz Bensrhair

Abstract— We present a novel technique for localisation ofscene elements through sparse stereovision, targeted at obstacledetection. Applications are autonomous driving or robotics.Given a sparse 3D map computed from low-cost featuresand with many matching errors, we present a technique thatcan achieve localisation in a real-time context of all potentialobstacles in front of the camera pair. We use v-disparityhistograms for identifying relevant depth values, and extractfrom the 3D map successive subsets of points that correspondto these depth values. We apply a clustering step that providesthe corresponding elements localisation. These clusters are thenused to build a set of potential obstacles, considered as high levelprimitives. Experimental results on real images are provided.

I. INTRODUCTION

The works presented here are targeted at real time sceneperception. This is a crucial step for many practical ap-plications, such as ADAS1 providing obstacle detection orautonomous robotics. In the past years, many advanceshave been done on low-level perception. Many sensors havebeen tried, each of them requiring a specific processing.Multisensors approaches have been shown to achieve thehighest performance [1] but vision and stereovision usuallyplay a center role in the overall sensor architecture. However,while the low-level part of this task has been widely covered,practical and low cost implementations of robust stereovisionsystems is still a challenging task. While many papers haveproposed different vision-based techniques, they usually havedrawbacks or consider only specific situations making themunsuitable for a general case. Some techniques are basedon a single camera but we consider stereovision as a bettersolution, as camera costs have greatly reduced these pastyears. Stereovision has the advantage of directly providingdepth information. We consider here sparse stereovision, andpresent an approach that we believe to be robust in real-worldsituations, where the original 3D map is corrupted with noise.

A. Stereovision techniques

Stereovision can be divided into two approaches. Themost researched and advanced are the dense approaches,where a depth information is computed for every imagepixel, usually through correlation techniques or graph-cuttechniques. These are costly techniques, and researchers nowfocus on optimisations to improve them [2]. This approachgives visually pleasant depth maps, but these do not providedirectly an information on the content of the scene. Usually,

Sebastien Kramm is with University of Rouen, 76130 Mt St Aignan,France [email protected]

Abdelaziz Bensrhair is with INSA Rouen, 76800 St Etienne du Rouvray,France [email protected]

1Advanced Driver Assistance Systems

a segmentation or connected-component labeling step isrequired to achieve computer recognition, thus increasing thecomputing cost. Moreover, the amount of data thats needsto get matched and processed is directly related to imageresolution, and this can become a problem when targetingreal-time applications. While computer processing powerhas indeed done a great leap, camera resolutions have alsoincreased in a similar way, thus the computation cost stillneeds to be considered.

On the other side, sparse approaches first extract relevantinformation (contours or other features) using fast detectors.Only these features are matched using their attributes, whichgreatly reduces the amount of processed data. The matchingstep produces a set of 3D points, i.e. a set of pointsdefined by their position in image and the disparity or thedepth of the point. This enables implementation of real-timesystems. Once we have computed the 3D sparse map, we canimplement pattern recognition techniques to identify sceneelements from a much smaller dataset. These techniques areonly suitable for fields where only contour of elements areuseful. This is the case in automobile applications, wherescene elements appear as distinctive contours in images.

However, while fast, these techniques are very sensitive tonoise. They can produce reliable information when sourceimages are of good quality, but fail in more difficult situ-ations. Numerous matching errors can occur, giving outlier3D points in the final dataset. This paper focuses on thisissue: how can we build a high-level representation of thescene when the sparse 3D map is noisy? In this paper, wedo not consider the computing of this map, but only theissue of making the most out of it, and this with a real-timeconstraint.

We experimented this approach in an automobile embed-ded obstacle detection context, but other applicative fields(autonomous robots) could use this technique.

B. Previous and related works

Stereovision as a scene analysis tool has become attractiverecently, due to the increase in computing power. One of thefirst practical implementation in the automobile field waspresented in [3]. Most of the recent works on automobileenvironnement perception use the ”dense” approach, such as[4], [5], [6], [7]. A review of technique for obstacle detectioncan be found in [8].

For the sparse techniques, an approach has been proposedin [9], that uses [10] for the matching step, and the v-disparity approach proposed in [11] for the scene analysisstep.

2012 Intelligent Vehicles SymposiumAlcalá de Henares, Spain, June 3-7, 2012

978-1-4673-2118-1/$31.00 ©2012 IEEE 760

V-disparity representation: V-disparity [11] is an imagetransformation that maps all the 3D points x = (u, v, d) toa space xvd = (d, v). In such a space, the ground space willappear as a diagonal line, and all vertical scene elementsthat are located at a given distance appear as vertical linesegments in this map, as shown on Fig. 1. It becomes thentheorically easy to locate these using a two step-technique:first, build an horizontal histogram of the number of pointsat a given disparity, then locate precisely the segment usingthe Hough transformation. Once the depth of obstacles are

Fig. 1. Illustration of the v-disparity representation for a typical road scene.

located, the authors propose other simple heuristics forlocating potential obstacles. However, we argue that thesetechniques only succeed on simple scenes producing ”clean”3D maps, and that they fail in more complex scenes suchas urban situations. In these situations the 3D map can getvery noisy, many matching error can occur, producing 3Dwith erroneous disparity. With a noisy map such as the oneshown on Fig. 2, localisation of line segments with a Houghtransformation becomes difficult, thus the need for a moreadvanced technique that is presented here.

The v-disparity technique has been since widely usedfor obstacle detection [12], [13], [14], and has been alsoextended. Some authors [15] proposed to apply this principlealso to the vertical axis of image to achieve similar goals.

Obstacle detection: On the specific task of collisionavoidance and obstacle detection through stereovision, areview can be found in [8], that introduces the conceptsof Hypothesis Generation (HG) and Hypothesis Validation(HV). A full stereo system is proposed in [16], based on bothIPM2 and lane marking detection, while dense approachesare proposed in [17], [18]. A full system is describedin [19], however, it focuses only on the detection of theleading vehicle, while we consider here the whole scene asviewed by the pair of cameras. Recently, an approach addingtemporal consistency between frames has been proposed toadd robustness using low-level image features [20].

C. Method summary

The idea is to consider the problem as a data-miningsituation and build a high-level representation of the scene.Instead of searching low-level features (line segments or oth-ers) in the sparse 3D map or its related v-disparity map, weextract from the 3D map a subset of points whose disparitycorresponds to a given mode in the disparity histogram. This

2Inverse Perspective Mapping

Fig. 2. Example of a real source image and the corresponding 3D mapused as input data of our method. Please note the high noise amount in thelatter map.

is the first step. In order to achieve localisation in the image,the second step uses a clustering algorithm to identify pointsthat are related to the same scene elements. The task offinding modes in a histogram can be highly unreliable, so weuse a multiscale approach that produces several histograms.On each of these, we search the corresponding modes, andkeep only those that were not already found on previousscales. We end up with a set of modes, that we use to feedthe extraction-clustering steps above. This produces a set ofclusters for each of these modes.

The multiscale mode extraction has a drawback: somemodes are close and/or redundant, thus producing differentclusters that are in fact related to the same scene element.The final step consists in merging close clusters into sceneelements (SE), that fully define a relevant element in thescene, including its depth, computed from the disparity value.The whole process can be described by Fig. 3. At all ofthese different steps, we apply an adapted noise filteringstrategy by using applicative constraints. All those steps willbe presented in the following section.

Fig. 3. Overview of the proposed method.

761

II. THE PROPOSED METHOD

In this section we will provide the details of the proposedmethod. We assume that a sparse 3D map has already beencomputed for a given scene using a pair of two alignedcameras, using for example a technique described in [20]. Weconsider here the classical ”aligned” camera situation, wheredepth z (distance to the camera) of a point is directly relatedto disparity d (pixel horizontal distance between matchedelements in the two images) with the expression z = b/d,with b the baseline between the two cameras.

A. Reducing the amount of data

The 3D sparse map is defined by a set of n points S ={x1, ..., xn}, each point corresponds to a tuple xi = (u, v, d),with u, v the coordinates of points in source image (3D map),and d the disparity. The disparity can be used afterwards toretrieve depth, directly if the cameras are perfectly aligned,or using some 3D reconstruction algorithm if not.

From S, we build a v-disparity representation of the setas proposed by [11]. As the authors explain, all the 3Dpoints lying on the ground plane appear as a diagonal linein the v-disp map. This line can be easily detected usingthe Hough transformation. We can consequently discard thepoints beneath this ground plane. However, we extend thisprinciple further by considering the fact that relevant sceneelements have a limited height h0. These elements appear inthe image with a pixel-height hp = k h0/z, with z the metricdistance from the camera center (depth) and k some scalingfactor. As depth z is linearly dependent on the inverse of thedisparity d, we can write hp = k d h0/b. This means that anyobject of height h0 will appear in the v-disp representation asa line going from the vanishing point (null height) at infinitydepth (z = ∞ ⇒ d = 0) to a maximum value given bythe applicative field. This is illustrated on Fig. 4. Therefore,similarly to the ground plane, we define a top-plane that weuse in cunjunction with v-disp representation to remove allthe points that appear to be above this plane. This processdiagram is shown on Fig. 5.

Removing isolated points: Once we have removedpoints that are out of the considered field, we apply a simplefiltering strategy: we discard isolated points, i.e. points thathave less than 2 neighbors at a distance dr (typically, 15pixels). This is based on the assumption that these points areprobably noise, as a relevant scene element would generatea cloud of close points. Typically this will remove 3 to 5%of the points.

B. Extracting modes from v-disparity histogram

Given a reduced set of 3D points, we can begin extractinginformation. We build histograms of disparity of points toachieve localisation of relevant disparities. However, in noisysituations as the one considered here, this step may discardsome relevant scene elements, if for example they are atvery close distances in the optical axis of the camera set, butclearly separated in the image. If we use a single histogram,we might only fetch one of the two, as shown by Fig. 6. In

Fig. 4. An example of v-disparity representation of a 3D sparse map. Theground plane (blue) is automatically determined using a Hough transform,while the top plane (green) is computed using both a given maximum heightof scene elements and the parameters of the camera set.

Fig. 5. Automatic removal of out-of-field points using ground plane andtop plane, see section II-A.

order to correcly handle this situation, we use a multiscaleapproach.

For a given v-disparity representation whose disparity liesin the range [0, dmax], we build n histograms hi of m bins,with m = si dmax and si a scale factor. We used n = 6integer scale coefficients (si = {1, . . . , 6}).

For each histogram hi, we fill the bins with the numberof points having a disparity si d. This produces histogramswith a high standard deviation, making them unsuitable fordirectly extracting modes. So we apply a mean shift filteringto the bins, so we can clearly identify modes. We showon Fig. 7 an example of histograms with and without thisfiltering, that clearly show the more stable modes.

We detect modes in the histograms by retrieving all thesuccessive bins that have a value higher than a threshold. Amode is defined by its position (the corresponding disparityvalue) and by its width. The position of the mode is definedas the weighted mean value of the corresponding binsdisparity. In order to avoid a parameter setting, we choosean auto-adaptative threshold that we define as the sum of thefiltered histogram mean value µ and standard deviation σ.

762

Fig. 6. Toy example of how a multiscale disparity histogram can detecttwo scene elements that have a close disparity: with s = 1, we only fetcha mode at d = 3.5. With s = 2, we are able to detect the two elements, atd = 3 and d = 4.

Fig. 7. Example of two disparity histograms, with and without meanfiltering.

C. Merging the different modes

For each of of the n histograms hi, we produce a set ofmodes S0 = {mi1,mi2, . . . }, as shown on Fig. 8. At eachiteration, we add to the final set of modes Sm the modesmij if and only if is not found in Sm, to avoid overloadingthis list with several times the same mode. As these modesare computed from different scales, we will actually neverfind exactly the the same in Sm, so we define a margin valueq that will be related to the considered depth for adding amode in the set. Formally, given a mode mi whith a disparityvalue di, we iterate in Sm and extract all the modes mj witha disparity value dj . We add the mode mi to Sm only if‖di − dj‖ < q, with q = d0 + kq di, so that the margin is alinear function of the considered disparity. Typical values ared0 = 1.5 and kd = 0.1, that will produce a range q ∈ [1.5; 8]for a maximum disparity value dmax = 80 pixels. Practically,we end up with 6 to 12 modes that need to be explored usingthe clustering step.

D. Extracting 3D points

For each of the modes of Sm, we extract the correspondingsubset of points from the sparse 3D map. Formally, for amode mi defining a disparity value di we add to the currentsubset all the points whose disparity respects di −w < d <di + w, with w = ks d so we have a range tolerance that isconstant over considered depth. This produces a set of pointswith a limited size in which we will search for significantclusters.

Fig. 8. Multiscale histogram modes extracting from the v-disparityrepresentation. The blue bins are the ones higher than the threshold,computed as the sum of mean value and standard deviation.

E. Clustering

The choice of a clustering algorithm needs to be consid-ered. First, we must assume a high level of outlier points,that must not generate clusters. Second, we have no idea ofhow many elements we need to identify. Usually, in such acase, hierarchical algorithms are prefered. However, the highnumber of outliers makes this approach difficult. On the otherside, the K-Means algorithm gives good results but with thedrawback that one needs to set the number of clusters. Thiscannot be done here, as we do not know how many sceneelements are present at a given depth.

The best solution is to use an algorithm than can copewith noise while correctly identify patterns in the dataset.We use one of of the most popular: the DBSCAN algorithm[21]. It has low computational cost and its implementationis simple. It is based on the concept of density reachability:shortly, to be included in a cluster, a point must have aminimum number of neighbors located at a distance lessthan a threshold. Its drawback is that it needs a distancethreshold value, which can be difficult in some unsupervisedsituations. Some authors have provided a solution to thisrequirement [22] but we do not need it here. Instead we usethe application constraint to compute an adaptative distancethreshold. Indeed, each clustering operation occurs on a set ofpoints that all have approximately the same disparity values,thus they all corresponds to a given depth. Moreover, a givenscene element appears smaller in the image at low disparityvalues. So a smaller object size means that the edge pointsthat appear in the 3D map are thus closer one from anotherthan an object who is closer. On the opposite, a closer objectmay not produce sufficient edge points in the 3D map, thusa requirement for a larger threshold to correctly identify it.

So again we compute a adaptative threshold td that islinearly defined from the current disparity di, given bythe histogram mode: td = t0 + dikt, with t0 and kt twoempirically defined coefficients. In our experiments, we useda disparity range equal to [0, 80] pixels, and we chose thesecoefficients so they produces for that range a threshold in therange [5, 30] pixels. Thus this gives t0 = 5 and kt = 0.31.

763

Once we have computed the cluster set for a givenhistogram mode and corresponding point set, we use someapplicative constraints to discard those that have a highprobability of being just noise. We compute several attributesfor each cluster and discard those that do not respect one ofthese three criterions:

• the number of points n is less than a value nmin

(typically, 5 points);• the area a of the convex hull of the points of the cluster

is less than a value amin (typically, 20 square pixels);• the ratio of n to a (density of the cluster) is less than a

value densmin (typically, 10−2).

This allows us to remove non-significant clusters beforeproceeding to the next step, but it can be tweaked or removedif insufficient sensitivity is noticed.

F. Merging clusters into ”scene elements”

Once we have our set of clusters, we need to generate thefinal set of scene elements (SE). Recall that we have a setof set of clusters: each cluster set corresponds to a givenhistogram mode, and inside the set, the clusters are clearlyseparated. But a different mode might also produce a clusterthat corresponds to the same area in image. Moreover, asthe source 3D maps clearly contains outlier points, we canproduce erroneous clusters that either overlap correct ones,or even just group together noise points. Thus the need fora merging and filtering step. From a practical point of view,we first copy all the valid clusters into a set of SE, thatis, we copy the corresponding points and all the computedattributes and then we filter and merge this set.

First, we check the spatial consistency of all the SE relatedto their convex hull: when comparing two SE e1 and e2, if thehull of e2 is completely inside the hull of e1 and the disparityof e2 (defined as the mean value of all the points disparity)is less than the disparity of e1, then this corresponds to animpossible situation as e2 should not be visible in the image.In this case, we discard e2. This is illustrated by Fig. 9.

Fig. 9. Example of overlapping SE: if the disparity of e2 is lower thanthe disparity of e1, then we discard e2.

Second, we try to merge SE that are close, based on thefollowing constraints. As the points extraction step describedin section II-D does not remove any points, it is perfectlypossible to have different clusters computed at differenthistogram modes that share a common set of points. Thisdoes not make sense considering the physical reality that isbehind the data, so we merge together SE that match one ofthese two conditions:

• they share more than a certain amount of points (eachpoint in the 3D map gets assigned an index, so we caneasily identify them afterwards). Typically, we set thisto 10 to 20%.

• the difference between their mean disparity value is lessthan a threshold (whose value is computed similarlyas in section II-C), AND the mean distance betweenthe underlying corresponding 2D points is less than athreshold value (so we don’t merge SE that are too farapart).

After every merging operation, we need to recompute allthe attributes of all the remaining SE before proceeding toanother merge operation. We end up with a set of SE for eachframe, each of these corresponding to a potential obstaclethat needs to be considered.

III. EXPERIMENTAL RESULTS

The complete algorithm has been tested on a full urbansequence of 3000 frames. Recall that it makes no assump-tions on what the detected elements are, as long as theyhave enough 3D points and that the area of the correspondingcluster is above a minimal threshold. Due to space, only foursuccessives frames of a typical urban scene are shown onFig. 10. They show the final detected elements, representedby their convex hull. About 20 to 30 SE where detectedon each image, but the filtering techniques presented in thelatter paragraph are able to correctly remove non relevantones, and merge the correct ones together.

One can see that the main potential obstacles are correctlyhandled (the car and motorcycle in the foreground), as wellas the vehicle in front of them that is partially occluded.Most of the secondary elements (sidewalk) are also detectedbut with less stability. This is due to fact that the hull isan unstable feature as a single point can greatly modify itsshape.

IV. CONCLUSION

We have presented a novel technique for identifying rele-vant scene elements that can potentially turn into obstacles.It can mostly run without parameter adjusting, due to anauto-adaptative threshold approach. It can be adapted to moststereovision situations where only the distance to contoursof elements are needed. Experiments on real-world imageshave been shown and results are promising, as previouslysimilar techniques would fail on this type of images. Furtherwork will be aimed at temporal consistency: a matching stepbetween scene elements detected at successive frames can beimplemented, thus adding robustness to the system. However,preliminar results show that this task is far from easy: theimage location of the obstacles can change between twoframes, due to camera vibrations, and the attributes of sceneelements are heterogeneous.

Acknowledgments

We would like to thank the Valeo group for allowing usthe use of their video data, and the anonymous reviewers fortheir comments that helped improving this paper.

764

Fig. 10. Example of results produced by the proposed method on 4 successive frames, with disparity color-coding.

REFERENCES

[1] Y.-L. Chen, V. Sundareswaran, C. Anderson, A. Broggi, P. Grisleri,P. Porta, P. Zani, and J. Beck, “Terramax: Team Oshkosh urban robot,”in The DARPA Urban Challenge, ser. Springer Tracts in AdvancedRobotics. Springer Berlin / Heidelberg, 2009, vol. 56, pp. 595–622.

[2] S. K. Gehrig, F. Eberli, and T. Meyer, “A real-time low-powerstereo vision engine using semi-global matching,” in InternationalConference on Computer Vision (ICCV), 2009, pp. 134–143.

[3] M. Bertozzi, A. Broggi, A. Fascioli, and S. Nichele, “Stereo vision-based vehicle detection,” in IEEE Intelligent Vehicles Symposium (IV),2000.

[4] A. Broggi, C. Caraffi, P. Porta, and P. Zani, “The single frame stereovision system for reliable obstacle detection used during the 2005DARPA grand challenge on TerraMax,” in Intelligent TransportationSystems Conference (ITSC), 2006, pp. 745–752.

[5] W. Miled, J. Pesquet, and M. Parent, “Robust obstacle detectionbased on dense disparity maps,” in Computer Aided Systems TheoryEUROCAST. Springer Berlin / Heidelberg, 2007, vol. 4739, pp. 1142–1150.

[6] C. Caraffi, S. Cattani, and P. Grisleri, “Off-road path and obstacledetection using decision networks and stereo vision,” IEEE Trans. onIntelligent Transportation Systems, vol. 8, no. 4, pp. 607–618, 2007.

[7] M. Bai, Y. Zhuang, and W. Wang, “Stereovision based obstacledetection approach for mobile robot navigation,” in InternationalConference on Intelligent Control and Information Processing, 2010,pp. 328–333.

[8] Z. Sun, G. Bebis, and R. Miller, “On-road vehicle detection: a review.”IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 5, pp. 694–711,2006.

[9] J. Rebut, G. Toulminet, and A. Bensrhair, “Road obstacles detectionusing a self-adaptive stereo vision sensor: a contribution to the arcosfrench project,” in IEEE Intelligent Vehicles Symposium (IV), 2004.

[10] A. Bensrhair, P. Miche, and R. Debrie, “Fast and automatic stereovision matching algorithm based on dynamic programming method,”Pattern Recognition Letters, vol. 17, no. 5, pp. 457–466, 1996.

[11] R. Labayrade, D. Aubert, and J.-P. Tarel, “Real time obstacle detectionon non flat road geometry through ‘v-disparity’ representation,” inIEEE Intelligent Vehicles Symposium (IV), vol. 2, 2002, pp. 646–651.

[12] N. Soquet, D. Aubert, and N. Hautiere, “Road segmentation supervisedby an extended V-disparity algorithm for autonomous navigation,” inIEEE Intelligent Vehicles Symposium (IV), 2007.

[13] A. D. Sappa, R. Herrero, F. Dornaika, D. Geronimo, and A. Lopez,“Road approximation in euclidean and v-disparity space: A compara-tive study,” in EUROCAST, 2007, pp. 1105–1112.

[14] J. Zhao, M. Whitty, and J. Katupitiya, “Detection of non-flat groundsurfaces using v-disparity images,” in IEEE International Conferenceon Intelligent Robots and Systems, 2009.

[15] Z. Hu and K. Uchimura, “U-v-disparity: an efficient algorithm forstereovision based scene analysis,” in IEEE Intelligent Vehicles Sym-posium (IV), 2005, pp. 48–54.

[16] H. Jung, Y. Lee, B. J. Kim, P. J. Yoon, and J. H. Kim, “Stereo vision-based forward obstacle detection,” International journal of automotivetechnology, vol. 8, no. 4, pp. 493–504, 2007.

[17] S. Kwon, J.-H. Lee, I. tae Na, and H. Jung, “On-road obstacle detectionand tracking system using robust global stereo vision method,” in SPIESignal and Data Processing, vol. 7698, 2010.

[18] Y. Xie and Y. Zheng, “Stereo vision-based obstacle detection usingdense disparity map,” in International Conference on Graphic andImage Processing (ICGIP), vol. 8285, 2011.

[19] J. Hwang and K. Huh, “Vehicle detection system design based onstereo vision sensors,” International Journal of Automotive Technol-ogy, vol. 10, pp. 373–379, 2009.

[20] M. El-Ansari, S. Mousset, and A. Bensrhair, “Temporal consistentreal-time stereo for intelligent vehicles,” Pattern Recognition Letters,vol. 31, no. 11, pp. 1226 – 1238, 2010.

[21] M. Ester, H. Kriegel, S. Jorg, and X. Xiaowei, “A density-basedalgorithm for discovering clusters in large spatial databases withnoise,” in International Conference on Knowledge Discovery, 1996,pp. 226–231.

[22] J. Sander, M. Ester, H.-P. Kriegel, and X. Xu, “Density-based clus-tering in spatial databases: The algorithm GDBSCAN and its applica-tions,” Data Mining and Knowledge Discovery, vol. 2, pp. 169–194,1998.

765

[IEEE 2012 IEEE Intelligent Vehicles Symposium (IV) - Alcal de Henares , Madrid, Spain...

Documents

Transcript of [IEEE 2012 IEEE Intelligent Vehicles Symposium (IV) - Alcal de Henares , Madrid, Spain...