Model-based object tracking with an infrared stereo...

International Master’s Thesis

Model-based object tracking with an infrared stereocamera

Juan Manuel Rivas Diaz

Technology

Studies from the Department of Technology at Örebro Universityörebro 2015

Model-based object tracking with an infrared stereo

camera

Studies from the Department of Technologyat Örebro University

Juan Manuel Rivas Diaz

Model-based object tracking with an

infrared stereo camera

Supervisor: Todor Stoyanov

Examiners: Henrik Andreasson

Daniel Canelhas

© Juan Manuel Rivas Diaz, 2015

Title: Model-based object tracking with an infrared stereo camera

ISSN

Abstract

Object tracking has become really important in the field of robotics in the lastyears. Frequently, the goal is to obtain the trajectory of the tracked target overtime and space by acquiring and processing information from the sensors.

In this thesis we are interested in tracking objects at a very short range. Theprimary application of our approach is targeting the domain of object trackingduring grasp execution with a hand-in-eye sensor setup. To this end, apromising approach investigated in this work is based on the Leap Motionsensor, which is designed for tracking human hands. However, we areinterested in tracking grasped objects thus we need to extend its functionality.

The main goal of the thesis is to track the 3D position and orientation of anobject from a set of simple primitives (cubes, cylinders, triangles) over a videosequence. That is the reason we have designed and developed two differentapproaches for tracking objects with the Leap Motion device as stereo visionsystem.

Keywords: Stereo Vision, Tracking, Leap Motion, Robotics, Particle filter.

i

Acknowledgements

This thesis is the end of an amazing year in Sweden, where I had theopportunity to grow as an adult person and in my academic career. First andforemost, this would not have been possible without the endless love,encourage and support of my family.

Second to my friends Javi, Zurita, Nelson, Belen, Carra...even with thedistance I know that you were there. Third, to all the friends from differentnationalities that I made here and specially to those who shared the bad andgood moments in the lab T002 with me making this thesis.

Last but not the least, to all the great Masters professors and specially to mysupervisor Todor, that always helped me when I need it.

To all of them, THANKS.

iii

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Pinhole Camera . . . . . . . . . . . . . . . . . . . . . . . 42.2.2 Distortions . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.3 Stereo cameras and epipolar geometry . . . . . . . . . . 82.2.4 Cannonical Stereo Vision . . . . . . . . . . . . . . . . . 102.2.5 Stereo Camera Parameters . . . . . . . . . . . . . . . . . 12

2.3 Feature Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3.1 Canny edge detector . . . . . . . . . . . . . . . . . . . . 142.3.2 Distance Fields . . . . . . . . . . . . . . . . . . . . . . . 162.3.3 FAST detector . . . . . . . . . . . . . . . . . . . . . . . . 162.3.4 SURF descriptor . . . . . . . . . . . . . . . . . . . . . . 172.3.5 Matching Keypoints . . . . . . . . . . . . . . . . . . . . 19

2.4 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.1 Bayesian Filter . . . . . . . . . . . . . . . . . . . . . . . 202.4.2 Particle Filters . . . . . . . . . . . . . . . . . . . . . . . . 222.4.3 Low variance sampling . . . . . . . . . . . . . . . . . . . 23

3 Methodology 253.1 Image acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Feature Based Particle Filter . . . . . . . . . . . . . . . . . . . . 27

3.2.1 Image processing . . . . . . . . . . . . . . . . . . . . . . 273.2.2 Implementation Problems: Lack of visual features . . . . 29

3.3 Contour Based Particle Filter . . . . . . . . . . . . . . . . . . . . 30

v

vi CONTENTS

3.3.1 Image Processing . . . . . . . . . . . . . . . . . . . . . . 303.3.2 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . 313.3.3 3D Model Generation and Projection . . . . . . . . . . . 33

4 Experimental Setup 374.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.4 Experimental Scenario . . . . . . . . . . . . . . . . . . . . . . . 40

5 Experimental Results 435.1 Feature Based Test . . . . . . . . . . . . . . . . . . . . . . . . . 435.2 Contour Based Test . . . . . . . . . . . . . . . . . . . . . . . . . 47

6 Conclusions and Future work 536.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

References 55

List of Figures

2.1 Pinhole camera example . . . . . . . . . . . . . . . . . . . . . . 52.2 Pinhole camera geometry . . . . . . . . . . . . . . . . . . . . . . 62.3 Radial distortion . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Before and after radial distortion correction . . . . . . . . . . . 72.5 Epipolar geometry . . . . . . . . . . . . . . . . . . . . . . . . . 82.6 Occlusion representation . . . . . . . . . . . . . . . . . . . . . . 92.7 Canonical camera configuration . . . . . . . . . . . . . . . . . . 102.8 Stereo geometry in canonical configuration . . . . . . . . . . . . 112.9 Example of the Canny edge detector. . . . . . . . . . . . . . . . 162.10 Distance transform for a simple rectangular shape. . . . . . . . . 162.11 FAST feature detection in an image patch. . . . . . . . . . . . . 172.12 Haar filters used in the SURF descriptor . . . . . . . . . . . . . 182.13 Haar responses in the subregions around a keypoint . . . . . . . 192.14 Matching representation of the descriptor . . . . . . . . . . . . 19

3.1 Leap Motion architecture. . . . . . . . . . . . . . . . . . . . . . 263.2 Raw image from the Leap Motion . . . . . . . . . . . . . . . . . 273.3 Example of the feature extraction steps. . . . . . . . . . . . . . . 283.4 Flowchart of the image processing of the first approach . . . . . 293.5 Image before and after applying the Canny edge detector. . . . . 303.6 Distance Image. . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.7 Flowchart of the Particle Filter . . . . . . . . . . . . . . . . . . . 323.8 Flowchart of the measurement model . . . . . . . . . . . . . . . 333.9 Representation of the different frames of the system. . . . . . . . 34

4.1 Leap Motion schematic . . . . . . . . . . . . . . . . . . . . . . . 384.2 Leap Motion Controller interaction area . . . . . . . . . . . . . 384.3 Cubes and rectangle objects to track used in the experiments. . . 394.4 Triangle and cylinder to track used in the experiments. . . . . . 404.5 Experimental setup overview . . . . . . . . . . . . . . . . . . . . 40

vii

viii LIST OF FIGURES

5.1 Test 1 : Plot of features detected over time at 5 cm. . . . . . . . 445.2 Test 1 : Plot of good features over time at 5 cm. . . . . . . . . . 445.3 Test 2 : Plot of features detected over time at 10 cm. . . . . . . . 455.4 Test 2 : Plot of good features over time at 10 cm. . . . . . . . . . 455.5 Test 3 : Plot of features detected over time at 15 cm. . . . . . . . 465.6 Test 3 : Plot of good features over time at 15 cm. . . . . . . . . . 465.7 Evolution of particles in the particle filter . . . . . . . . . . . . . 485.8 Evolution of the projection in the particle filter . . . . . . . . . . 495.9 Histogram of the weights of the particles in the initial position . 505.10 Histogram of the weights of the particles in the final position . . 50

List of Tables

4.1 Summarize of the experiments. . . . . . . . . . . . . . . . . . . 41

5.1 Means of the features detected with SURF and FAST algorithms 435.2 Table with the results of the particle filter . . . . . . . . . . . . . 47

ix

List of Algorithms

1 Algorithm Bayes Filter . . . . . . . . . . . . . . . . . . . . . . . 222 MCL Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Algorithm Low variance sampler . . . . . . . . . . . . . . . . . 23

xi

Chapter 1

Introduction

1.1 Motivation

Object tracking has become really important in the field of robotics withmultiple of applications, for example object manipulation, human-machineinteraction, human detection, road traffic control, security and surveillancesystems[44]. Frequently, the goal is to obtain the trajectory of the trackedtarget over time and space, by acquiring and processing information from thesensors. Object tracking in real-time is a difficult task, since it requires onlineprocessing of a large amount of data, which can be computationally expensive.

Depending on the sensor type, different approaches can be proposed for thetracking problem. For example, Bray and Koller-Meier [5] have used a particlefilter for 3D Hand tracking and Muñoz-Salinas [30] have used a Kalman filterto track people based on color features.

1.2 Problem statement

In this thesis we are interested in tracking objects at a very short range. Theprimary application of our approach is targeting the domain of object trackingduring grasp execution with a hand-in-eye sensor setup. To this end, apromising approach investigated in this work is based on the Leap Motion [8]sensor. It is a USB device with two IR cameras, packaged with a set ofprocessing algorithms that can track hands in a range of 0.1 to 0.6 m abovethe controller. The Leap Motion is designed for tracking human hands,however we are interested in tracking grasped objects, thus we need to extendits functionality.

Due to the IP protection of the company, the firmware is closed source thuswill be difficult to replicate its performance. The main goal of the thesis is to

1

2 CHAPTER 1. INTRODUCTION

track the 3D position and orientation of an object from a set of simpleprimitives (cubes, cylinders, triangles) over a video sequence.

1.3 Contributions

The Leap Motion device is able to track hands but not objects, thus the objecttracking improvement can extend the range of applications in the roboticsfield that can use it. That is the reason we have designed and developed twodifferent approaches for tracking objects with the Leap Motion device as astereo vision system. In the first approach, different computer visionalgorithms for extracting features from a scene were covered, however theimplementation was not possible due to the lack of features. For the secondapproach, a contour based particle filter is implemented with a simple edgedetector as input of the algorithm. Despite the behaviour is not as expected,with future modifications and improvements it can be used as a trackingsystem for a robotic grip.

Additionally, different experiments were carried out with the Leap Motiondevice; testing the different algorithms from the first approach and theefficiency of the contour based particle filter. Analysing quantitative andqualitative the results and giving the reasons for their behaviour.

1.4 Thesis Outline

This section describes the organization of the content of this thesis. Thedocument is structured in five chapters. The Chapter 1 relates the reasons thatmotivate the realization of this project and states the goals to be achieved.In the Chapter 2, all the theory needed to make the thesis is summarized.Including information about computer vision, geometry and filter algorithms.Two different ways to address the problem are described in Chapter 3.Different flowcharts are included in order to easily understand them.

Then, in the Chapter 5 the results are presented and analysed. Finally, inChapter 6 the conclusions obtained from the results and the future work ispresented.

Chapter 2

Background

2.1 Previous Work

In the last decade, tracking objects with stereo cameras has attracted theattention due to its applications in fields like robotics, road traffic control orsecurity, some of which were described in the Chapter 1. Tracking objects is adifficult problem because there is no sensor that gives perfect measurements.Laser range finders, sonars, IR sensors or cameras are tied down withuncertainty and noise. Bayesian filters can easily handle this level ofuncertainty and that is why most of the researches of tracking objects withcameras are extensions of Bayesian filtering techniques.

Most of the solutions proposed were based on the Kalman filter [19], a variantof the Bayesian filter. These filters are optimal under the assumption that theuncertainty can be modelled with a Gaussian and the observation anddynamic models are linear. Dellaert and Thorpe [11] have use a Kalman filtersfor robust car tracking with simple image processing techniques. Gennery [13]uses a modified Kalman Filter for tracking known three-dimensional objectswith six degrees of freedom.

As it was stated before, the Kalman filters assume that the system is linear,however most of the real world system are not linear. For these situations theExtended Kalman Filter [18] is often used. In this special case of Kalmanfilters, the idea is to linearise the estimation of the state with a first-orderTaylor series. Medeiros [24] have used EKF to model and track objects invideo sequences.

Due to the difficulties on the implementation and tuning of the EKF’s, theUnscendent Kalman Filters (UKF) are presented [40]. The main difference isthat the UKF state distribution is represented using a minimal set of samplepoints that are carefully chosen. Dambreville [9] uses Unscented Kalman filters

3

4 CHAPTER 2. BACKGROUND

for tracking deformable objects by active contours.

A different approach of Bayesian filters is the Particle Filter [10], which can beeasily implemented and do not have the drawbacks of the Kalman filtermethods. They became popular with Isard and Blake [17] where contourbased observations were used to track objects in clutter. Contour basedtracking systems were also described by MacCormick [23] and Sullivan [38].Colour based particle filters methods were also presented by Barrera [2] wherethe estimated errors were under 5 cm.

2.2 Computer Vision

Computer vision includes the methods to convert the images from the realworld onto numerical information in order to make it tractable for acomputer. It also covers the algorithms to analyse and manipulate images [27].

Different types of image data exist: single view image, stereo vision, stereovideo with more than two cameras, video sequences etc. There are alsodifferent sub-disciplines like object recognition, object tracking, scenereconstruction, pattern recognition etc.

This chapter starts with a brief explanation of the simplest geometric cameramodel and continues with the stereo vision theory. Afterwards, computervision algorithms for detecting, extracting and matching features arepresented. In the end, the basics of general particle filters are explained.

2.2.1 Pinhole Camera

Different models to define the geometry of the projection of an object into thecamera plane exist. The simplest one is based on the Pinhole camera model.

As it could be seen in the Figure 2.1, a Pinhole camera is a box completelydark in its interior that allows the light rays to pass through a single smallaperture, a pinhole. The light that passes through the pinhole projects aninverted image of the scene.

Some considerations must be taken into account for this model (Figure 2.2):

• Every light ray that traverses the optical center C do not suffer anydeviation. It is the origin of the coordinate system.

• The optical axis is the imaginary line that starts in the optical center andperpendicularly cross the image plane.

2.2. COMPUTER VISION 5

• The focal distance b is the distance from the optical center to the focalplane.

• The focal plane or image plane is situated in Z = b. It is the virtual planewhere the image is projected without any inversion.

• The principal point p, is the intersection of the optical axis with theimage plane.

Figure 2.1: Pinhole Camera example [12].

This camera model provides an easy understanding for the relation of theposition of a point in the real world P(X, Y,Z) with the optical system of thecamera (u, v). From the Figure 2.2 , with direct geometric relations, thefollowing equations can be extracted:

u = −bx

z(2.1)

v = −by

z(2.2)

For this model, every point M from the real word can be converted into apoint m of an image with the following relation:

m ≈ PM (2.3)

Where P is the projection matrix that will be described in detail in the nextsections.


Figure 2.2: Pinhole camera geometry [42].

2.2.2 Distortions

In order to maintain the relation from the Equation 2.3, the radial distortionsgenerated during the image creation need to be considered.

The use of the lens make easier the entrance of the light, providing a goodfocus and versatility. However, it also introduces deformations in the imagestaken by the sensor. One of these effects is called radial distortion and it canbe appreciated in the Figure 2.3.

Figure 2.3: Radial distortion [20].

When a light ray enters the camera, the lens bends the light ray so that it hitsthe sensor, which records it as a greyscale brightness value at a specific pixellocation. Of course, no lens is perfect, so a ray of light does not land on thesensor in the optically perfect spot. This distortion is more pronounced next tothe image limits, as it could be seen in the Figure 2.4. It also growths when the


focal distance decreases or with low quality lens [28].

The radial distortion can be modelled with a Taylor series around r = 0 [16].Therefore, the image coordinates with distortion are transformed like:

x = xc + L(r)(x − xc) (2.4)

y = yc + L(r)(y − yc) (2.5)

where

• (x, y) are the corrected coordinates.

• (x,y) are the measured coordinates.

• (xc,yc) is the centre of radial distortion.

• r is the radial distance√

x2 + y2 from the centre for radial distortion.

• L(r) = (1 + k1r2 + k2r

4 + k3r6 + ...) is a distortion factor, which is only

function of the radius r.

• k1 and k2 are the distortion coefficients. Normally it is enough with k1

and k2, however for lens with a high distortion (e.g. cheap lens,fisheye...) the use of k3 is necessary for a proper correction.

Figure 2.4: Before and after radial distortion correction [15].


2.2.3 Stereo cameras and epipolar geometry

Only with one camera it is possible to determine the transformation of a 3Dpoint from the real world to a 2D point of an image. It is feasible to establishthe relation between the coordinates of a real world point and its projection inpixels in the image. Nevertheless, the inverse operation can not be computedonly with one image due to the ambiguity of the relations within the realpoints and pixels.

Figure 2.5: Epipolar geometry [31].

Hence, it is necessary to have two or more cameras (or views) forsimultaneously acquiring images of the same scene like in the Figure 2.6.Then, the disparity map can be computed and the distances can be measured.

This fact is possible thanks to the epipolar geometry, that establish therelations between the cameras and the real world.

Based on the Figure 2.5

• P is a point in the space that is projected in each camera, generating thepoints pr and pl.

• Ol and Or are the optical centres of the cameras.

• The lines Pl and Pr are called epipolar lines.

• The indicated points are called epipoles, which are the projections of theoptical centre of a camera into the projection plane of the other camera.Every epipolar line of an image traverses the epipole of the image.


It could be seen that when a point falls in the line Olpl or in Orpr, it alwayshas the same projection pr or pl. That is why it is not possible with just oneview to relate a projection and a real world point. This ambiguity is calledocclusion.

Taking a look to the Figure 2.6, it could be appreciated that with only oneimage, the three points from the real world will appear as the same point inthe projected image. The depth information is recovered by the secondcamera, where the three points will be represented.

However, in order to use the epipolar geometry for the triangulation, thefollowing facts need to be considered:

• Every point X in the cameras work area it is placed in the epipolar plane,thus it will have two projections (xl, xr).

• A point xi placed in the projective plane of one of the cameras has aassociated projection in the other image, situated along the epipolar lineof the camera. This fact is called epipolar restriction and once theepipolar geometry is known, it is possible to make a bi-dimensionalsearch of the projection pairs along the epipolar lines. It will savecomputation, time for the correspondences search and it will also help toreject false positives.

• Every epipolar plane associated to a point in the space will alwaysintersect the line O1O2.

All these facts will help for the calculus of the physic positions of the points inthe 3D space.

Figure 2.6: Occlusion representation [34].


2.2.4 Cannonical Stereo Vision

The reconstruction of a three-dimensional scene based on two images acquiredfrom different positions and viewing directions is defined as stereo vision [42].The properties of these cameras are determined by their epipolar geometry,which describes the relationship between the real-world observed points intheir fields of view and the images on their respective sensing planes.

The Leap Motion Sensor takes advantage of the canonical configuration forbuilding the stereo geometry. In this configuration, the baseline of the twoPinhole cameras is aligned to the horizontal coordinate axis, the optical axisof the cameras are parallel, the epipoles move to the infinity and the epipolarlines in the image planes are parallel as it is shown in Figure 2.7.

Figure 2.7: Canonical camera configuration [32].


Figure 2.8: Stereo geometry [32].

In the Figure 2.8 if a point in the real world is examined in P = (x,y, z), thefollowing relations can be extracted from the geometry.

Pl

f= −

h+ x

z(2.6)

Pr

f=

h− x

z(2.7)

Where Pl and Pr are the projections of P onto the left and right images and his the half of the baseline distance.

Combining these equations together and defining the disparity as d = Pr − Pl,the depth can be obtained as:

Z =2hf

d=

bf

d(2.8)

As it could be seen, the depth Z is inversely proportional to the disparity, thuszero disparity means that the point is infinite far.


2.2.5 Stereo Camera Parameters

The projection matrix P from the Equation 2.3 can be used to transform thecoordinates of a 3D point from the real world to the pixel coordinates of animage. For stereo pair configurations, it is constructed from the matrix K anda vector with information about the position of the optical center.

P = [K | T ] (2.9)

Where K is the intrinsic matrix calibration, constructed with the intrinsicparameters of the camera as follows:

K =

fx 0 cx0 fy cy0 0 1

cx and cy represents the distance from the center of the coordinate imageplane to the principal point. fx and fy are the focal lengths in pixels. They areproportional to the focal length as stated in the Equations 2.10 and 2.11:

fx = fSx (2.10)

fy = fSy (2.11)

Where f is the physical focal length of the lens in metric units and SxSy are thenumber of pixels per metric unit of the sensor along the X and Y axis. If thesensor has the same number of pixels per unit metric in all the dimensions, fxand fy will have the same value.

The vector T contains information about the position of the optical center ofthe second camera in the first camera’s frame. In the canonocial configuration,the first camera always has Tx = Ty = 0. For the right (second) camera of ahorizontal stereo pair, Ty = 0 and Tx = −fx ′ ∗ B where B is the baselinebetween the cameras.Tz = 0 since both cameras are in the same stereo imageplane [7].

T =

TxTy0

2.3 Feature Detection

Once the camera parameters are determined, the obtained images are rectifiedand transformed to correct distortions and errors induced by the camera, thenext step is the search of correspondences. In this thesis, findingcorrespondences within image pairs will help to detect objects and extract

2.3. FEATURE DETECTION 13

good features that can be used as input of the tracking algorithm.

The first step for this process is to find the features on the images. These pointhave a perfectly defined position and their neighbour pixels carry a bigamount of local information. One of their most important characteristics isthe stability under local and global perturbations. These perturbations can betranslations, rotations, scaling or changes in the luminosity and perspective.

In order to achieve the first step, there are feature detectors based on differentoperators that can be used. Some operators reveal information about thecorners, others about the edges etc.

Once the features are detected, the next step is to describe the area aroundeach feature. The features detected in the first step give information about thelocalization, however they are not sufficient for a subsequent imagecorrespondence search. The description of the features will be extracted froma defined area or kernel around the point of interest. Bigger kernels implymore computational cost but large description region. Nevertheless, bigdescriptors are not of interest since they lose the locality property.

There are different ways to describe the neighbourhood of a feature point.One way is to evaluate the grey level, if the image is normalized to one, darkerpixels will have a value closer to zero and brighter ones close to one. Anotherway it is the color level, where the three color matrices R, G and B areevaluated. The color of a region can be evaluated by its three color histogramsor by the mean of the region. It is also possible to describe the features by theorientation of the gradients around them. All these descriptors areindependent of the detection method used in the first step.

The feature description will allow the algorithms to relate the feature pointsfrom a trained image to a test image. That its why it is important for thesefeature descriptors to be scale and position invariant. An item in the trainedimage can appear rotated, translated or scaled in the test image, thus thealgorithm must be invariant under these changes.

There are different methods for computing these point features. Although thedescriptors should have these characteristics:

• Simplicity: The descriptor should capture the characteristics of the imagein a simple and clear way in order to be easily interpreted.

• Repeatability: The generated descriptor from an image must beindependent from the moment where it was generated.


• Uniqueness: The descriptor should have a hight grade of differentiabilitywith respect other images and at the same time it should have enoughinformation for establishing relations within similar images.

• Invariant: The descriptor needs a high degree of robustness to facedifferent transformations within images.

• Efficiency: The generation of the descriptors needs to be in accordancewith the time/computation constraints of the applications.

There are two types of image descriptors that have to be considered for theelection of the correspondence method.

• Global descriptors: They contain lots of information of the image withfew data. Even through its simplicity, they have a common use due totheir low computational cost and good results. One example of thesedescriptors is the Color Histogram.

• Local descriptors: These descriptors are applied when the localinformation of the image is more relevant. They act under predefinedregions of interest identifying key information about the point and itsneighbours. These regions of interest or points are called keypoints andthe descriptor is formed by the totality of feature keypoints vectors.Some examples of local descriptors are SURF, SIFT and ORB.

After testing different methods, the election for this thesis is: the featuredetection will be computed with FAST and the descriptor with SURF due totheir robustness and high performance that can be used in real timeapplications. The matching problem will be solved with a brute force matcher,even thought is a naive implementation, it systematically checks for all thepossible feature pairs.

In the following section the Canny edge detector and the distance images arealso explained because they are applied in the thesis to obtain the informationthat is going to be used as input in the particle filter.

2.3.1 Canny edge detector

The Canny edge detector is an operator developed by John F.Canny in1986 [6]. It uses a multi-step algorithm for detecting edges in images that aimsto satisfy these criteria:

• Low error rate: Just detect good edges.

• Good localization: The distance within real edge pixels and edge pixelshave to be minimized.


• Minimal response: It can just exist one detector response per edge.

The steps of the algorithm are the following [33]:

1. A Gaussian filter will remove any noise. An example of a Gaussiankernel of size = 5 that might be used is:

K =1

159

2 4 5 4 24 9 12 9 45 12 15 12 54 9 12 9 42 4 5 4 2

2. Obtain the intensity gradient of the image:

(a) Apply a pair of convolution masks in x and y directions:

Gx =1

159

−1 0 1−2 0 2−1 0 1

Gy =1

159

−1 −2 −10 0 01 2 1

(b) Find the direction and strength of the gradient:

G =√

G2x +G2

y

θ = arctan(Gy

Gx)

The direction is rounded to one of these possible angles: 0, 45, 90or 135.

3. Non-maximum suppression is applied for removing the pixels that arenot considered to be part of an edge.

4. In the last step Canny uses two thresholds to perform the hysteresis:

(a) If the pixel gradient is higher then the upper threshold, the pixel isaccepted as an edge.

(b) If a pixel gradient value is below the lower threshold, then it isrejected.

(c) if the pixel gradient is within the thresholds, then it will be acceptedonly if it is connected to a pixel that is above the upper threshold.


An example of the algorithm can be seen in the Figure 2.9:

Figure 2.9: Example of edge detection with the Canny algorithm [33].

2.3.2 Distance Fields

The distance fields or distance images convert a binary digital image into animage where all the pixels have a value corresponding to the distance to thenearest black pixel [4]. The result of the transformation is a grey level imagethat looks similar to the input image, except that the grey scale intensities ofpoints inside foreground regions are changed to show the distance to the closestblack pixel. Figure shows a distance transform for a simple rectangular shape:

Figure 2.10: Distance transform for a simple rectangular shape [35].

2.3.3 FAST detector

Features from Accelerated Segment Test (FAST) is an algorithm to detect localfeatures in images. In this thesis it will be used due to its high efficiency thatallows the on-line operation of a tracking system [37].

An examination around a circle of 16 pixels (radius 3) with center in thefeature placed at pixel p is performed. A feature is detected at p if the


intensities of at least 12 contiguous pixels are all above or below the intensityof p by some threshold, t. This condition can be optimized by for exampleexamining pixels 1, 9, 5 and 13 in order to faster pixel candidate rejection asit can be appreciated in the Figure 2.11. This can be done because a featurecan only exist if three of these tested points are all above or below theintensity of p by the threshold [36].

Figure 2.11: FAST feature detection in an image patch: C is the pixel positionof the feature and the numbered pixels are the ones to be evaluated [37].

2.3.4 SURF descriptor

Speeded Up Robust Features (SURF) [3] is a local feature detector anddescriptor inspired in SIFT [21]. In this thesis it will be used as descriptor.The first step is to determine the orientation for each of the keypoints detectedin the previous step. For it, the Haar response is going to be computed in bothx and y directions with the functions of the Figure 2.12. The region of interestfor the calculus is a circular area centred in the keypoint and with radius 6s,where s is the scale where the keypoint was detected.

The sampling step also depends on the scale and it takes as value the s. Aboutthe wavelet Haar functions, they also depend on the s by taking a value of 4s.Higher s values will imply a higher dimensionality of the wavelet Haarfunctions.


Figure 2.12: Haar functions for the calculus of the responses in the x direction(left) and y (right). Black color has a value of -1 and white +1 [3].

After all this calculus, the computed responses are weighted by a Gaussiandistribution centred in the keypoint and with σ = 2, 5s. Then, the responsesare represented by vectors in the space by setting the horizontal and verticalresponses in the x and y axes. Finally, the predominant orientation for eachsector is obtained by the sum of all the responses under a a mobile window

that covers an angle ofπ

3like the author states [1].

In the second step of the process, a square region with size 20s is generated. Itis created around the keypoint with the computed orientation from the laststep. This region is divided in 4 x 4 subregions where the Haar response iscalculated. The Haar responses in the horizontal and vertical directionsrelative to the keypoint orientation are defined as dx and dy. In theFigure 2.13 the Haar responses in each sub-region and the dx, dy of eachvector are represented.

For highest robustness under geometric transformations and position errors,dx and dy are weighted by a Gaussian with σ = 3.3s centred in the keypoint.

In order to have a representative value for each of these sub-regions, theresponses dx and dy are added up. At the same time, the sum of all theabsolute values of the responses |dx| and |dy| it is done, thus the polarityinformation about the intensity changes is computed.Summarizing, each of the sub-regions is represented by a components vector v:

v = (Σdx,Σdy, |Σdx|, |Σdy|) (2.12)

Taking all the 4 x 4 subregions, the SURF descriptor is formed by 64 valuesfor each of the detected keypoints.


Figure 2.13: Haar responses in the subregions around a keypoint [43].

2.3.5 Matching Keypoints

The main goal of matching keypoints is to obtain a representative value of thesimilarity within two images. The calculus of this value (represented asdistance) is done by the application of a distance formula between theimages [25]. However, before this calculus, it is necessary to establish thekeypoints correspondences.

The keypoints correspondence is obtained by the calculus of the euclideandistance between their feature vectors. Afterwards, this value is used in aBrute-force matcher to obtain the keypoints matching. An example of thematching can be appreciated in the Figure 2.14.

Figure 2.14: Matching representation of the descriptor: the red dots are thekeypoints, the green lines are the matches within the two images [26].


2.4 Filters

Many physic and scientific problems require the estimation of the state of asystem that changes over time. This problem can be visualized as anestimation process, since it has to deal with noisy measurements. Normally,the noise is modeled with statistics, thus the estimation will be stochastic.

2.4.1 Bayesian Filter

The most general algorithm for state-space estimation is the Bayesian Filter. Itis able to estimate the state of a system with noisy measurements. In order tomake estimations, it is necessary to make inference in the system. Theseinferences are going to be done with the system model (Motion model intracking) and the measurement model.

On the one hand, the system model describes the evolution of the state of thesystem over time. In the tracking problem, it will infer where is the targetbased on the motion data.

On the other hand, the measurement model describes the formation processby which sensor measurements are generated in the physical world.

The estimation or believe of the state is obtained from the posteriorprobability density function, which is constructed with all the available data.The first step is to describe the state-space model. In the particular case oftracking objects in images, the state of a system at time t (xt) will be amulti-dimensional vector with the 3D position and orientation of the target.

The set of measurements at a time t are labelled as zt. Thus the set of allmeasurements from t1 to t2 will be:

zt1:t2 = zt1, zt1+1, zt1+2...zt2

Then, the goal is to compute the posterior probability conditioned over allobservations zt using the Bayes theorem:

p(xt|z1:t) =p(zt|xt, zt−1) · p(xt|z1:t−1)

p(zt|z1:t−1)(2.13)

The motion model is going to be defined with the probability distributionp(xt|xt−1), as result the Equation 2.13 is:

p(xt|z1:t) =p(zt|xt, zt−1) ·

∫

p(xt|xt−1, z1:t−1)p(xt−1|z1:t−1)dxt−1

p(zt|z1:t−1)(2.14)

2.4. FILTERS 21

Computing this density becomes an intractable problem since themeasurements are accumulated over time. Hence, it is necessary to apply theMarkov assumptions:

• The states Xt are complete because they are the best predictors of thefuture. Completeness entails that knowledge of past states andmeasurements carry no additional information that would help the filterto predict the future more accurately.

• The past and future observations are independent if the current state xtis known.

With the observation independence assumption, the Equation 2.13 becomes:

p(xt|z1:t) =p(zt|xt, zt−1) · p(xt|z1:t−1)

p(zt)(2.15)

With the Markov assumption, it can be rewritten as:

p(xt|z1:t) =p(zt|xt) · p(xt|z1:t−1)

p(zt)(2.16)

and finally the Equation 2.14 will be:

p(xt|z1:t) =p(zt|xt) ·

∫

p(xt|xt−1)p(xt−1|z1:t−1)dxt−1

p(zt)(2.17)

The initialization of the filter p(x0) depends on the available information:

• If the exact position is known, particles will be placed there.

• If there is knowledge about the approximate position, the particles canbe initialized with a uniform distribution around that position.

• If there is no knowledge, they can be initialized with a uniformdistribution all over the state space.

In the Recursive Bayesian Filter, the Equation 2.17 is evaluated in two steps.The first step is named prediction step. Where the prediction is computed fromthe previous state posterior, before incorporating the measurements at timet.The second step is called measurement update step and it corrects theprediction from the last step.

The particle filters, particularly the Monte Carlo approximation is onesolution for the tracking problem based on the Bayesian filter. It is discussed inthe next section.


Algorithm 1 Algorithm Bayes Filter

1: Algorithm BayesFilter(bel(xt−1), zt) :2: for all xt do3: ¯bel = p(xt|z1:t−1) =

∫

p(xt|xt−1)bel(xt−1)dxt−1

4: bel(xt) = p(xt|z1:t) = ηp(zt|xt) ¯bel(xt)

5: end for6: return bel(xt)

2.4.2 Particle Filters

Particle Filters are Sequential Monte Carlo (SMC) methods for thecomputation of posterior distributions based on the Bayes FilterEquation 2.17.

The Monte Carlo particle filter was originally designed for robot localization.However, it has many similarities with tracking objects in images. Hence, it isgoing to be the approach used in this thesis to solve the problem. It has 4 steps.

Algorithm 2 MCL Algorithm

1: Algorithm MCL(Xt−1, zt)2: Xt = Xt = ∅

3: for m = 1 to M do4: x

[m]

t = motion_model(x[m]

t−1)

5: w[m]

t = measurement_model(zt, x[m]

t )

6: Xt = Xt+ < x[m]

t ,w[m]

t >

7: end for8: for m = 1 to M do9: draw i with probability ∝ w

[i]t

10: add x[i]t to Xt

11: end for12: return Xt

In the first step the filter is initialized in one of the ways stated in the lastsection. The selection of the initialization option depends on the informationavailable at the instant t = 0.

The second step consist in the application of the motion model. Whenever thetarget moves, the filter will move the particles to predict where the target isafter the movement. There are different ways to model this fact, such as theodometry model or velocity model. In this thesis a different one isimplemented in order to adapt the filter to the application.

2.4. FILTERS 23

In the third step the measurement model assigns a weight to each of theparticles based on the sensed information.

Lastly, the particles are resampled based on their importance weight. Theimportance of this step and a deeper explanation is done in the next section.

2.4.3 Low variance sampling

The sample procedure will keep the particles with best probabilities and willdiscard the rest of them. The algorithm is called low variance sampling, whichselects the particles based on a sequential stochastic process.

Algorithm 3 Algorithm Low variance sampler

1: Xt = ∅

2: r = rand(0;M−1)

3: c = w[1]t

4: i = 15: for m = 1 to M do6: U = r+ (m − 1) ·M−1

7: while U > c do8: i = i + 19: c = c+w

[i]t

10: end while11: add x

[i]t to Xt

12: end for13: return Xt

Instead of randomly choosing M particles, this algorithm computes a singlerandom number and selects samples according to it but still with a probabilityproportional to the particle weight. This is done by calculating a randomnumber r in the interval [0;M−1], where M is the number of samples to bedrawn at time t. Afterwards, the particles are selected by repeatedly addingthe value M−1 to r and selecting the particle that corresponds with theresulting number.

Like the authors states in [39], the low-variance sampler has three advantages.First, it covers the space of samples in a more systematic fashion than theindependent random sampler. Second, if all samples have the same importancefactor, the generated sample set will be equivalent to the one from the lastiteration. Lastly, the algorithm has a complexity of O(M), which in particlefilters is an importance factor for its performance.

Chapter 3

Methodology

Two different approaches were implemented in order to achieve the goals. Bothof them have similarities in the program structure but they differ in the methodsand algorithms used. The programs were developed under the Robot OperationSystem (ROS) using C++, OpenCV and Eigen libraries.

3.1 Image acquisition

As mentioned before, the Leap Motion runs over a USB port on the Linuxplatform. A service receives image data from the device; A DLL connects tothe service and provides data through a variety of languages(C++, Python,Javascript, Objective-C, C\, Java).

The architecture shown on Figure 3.1 consist on 1)Leap Service receives datafrom the device. 2)Leap Control Panel configures the device tracking settings,calibration and troubleshooting. As the application runs independently fromthe service, it has control directly on the panel. 3)Leap-ROS node Driveraccess and modify the distorted images (Figure 3.2) from the service. 4)Stereo_img _ proc node rectifies and publish the undistorted images and cameraparameters. 5)Tracking node subscribes to the camera parameters and images.Here is where the image processing and particle filter is going to be performed.

25

26 CHAPTER 3. METHODOLOGY

Figure 3.1: Leap Motion architecture.

The "stereo_img_proc" nodelet encapsulates the left and right images as aROS message Image that has the following fields:

• Header: Containing the timestamp and frame identification.

• Width and height of the image: They will be 280 x 220.

• Encoding type: In this case it will be mono8, each pixel will berepresented by eight unsigned bytes with one channel, since the imagesare acquired as grayscale.

• Matrix containing the data of the image.

It also publish another message containing important information about eachof the cameras. This message contains the information of the calibration files,which is:

• Header: Containing the timestamp, and frame identification.

• Width and height of the image: 280x220.

• Distortion parameters: k1, k2, t1, t2 and k3.

• Intrinsic camera 3x3 matrix of the distorted images, defined as K.

• Rectification 3x3 matrix to obtain parallel epipolar lines, named as R.

• Projection 3x4 matrix for the projection of 3D points in the camerascoordinate system frame to 2D pixel coordinates.

3.2. FEATURE BASED PARTICLE FILTER 27

Figure 3.2: Raw image from one of the cameras of the Leap Motion. A gridhighlighting the significant, complex distortion is superimposed on the image[28].

3.2 Feature Based Particle Filter

The first proposal was a featured based particle filter. The general steps arefirst acquiring the images from the Leap Motion controller, then processingthem with OpenCV functions and afterwards estimate the state with a featurebased particle filter.

3.2.1 Image processing

The 3D points of the object to track need to be extracted from the acquiredimages. Hence the first step is to obtain these points, for it a FAST detectorwill detect the points of the object, a SURF descriptor will describe theirneighbour pixels and a brute-force matcher will extract the points that looksimilar in both images and discard the false features detected. Afterwards, theparticles are filtered in order to just extract high-quality feature matches is theratio test proposed in [22]. This test rejects poor matches by computing theratio between the best and second-best match. If the ratio is below somethreshold, the match is discarded as being low-quality. An example of theextract features procedure can be seen in Figure 3.3.


Figure 3.3: Example of the feature extraction steps: From up to down, leftand right images of the features detected with FAST, features matched with theinformation from the SURF descriptor and finally the good matches obtainedafter a filter.

3.2. FEATURE BASED PARTICLE FILTER 29

The process is summarized in the flowchart 3.4.

Feature Detection

Feature De-scription

Feature Matching

3D pointGeneration

Particle Filter

Figure 3.4: Flowchart of the image processing of the first approach

The estimated 3D point locations can then be used to compute a fitness score,based on the average point-to-model distance between a 3D object model andthe points. This can then in turn be used to weight the particles in the filter. Asdiscussed in the next section, some issues in the feature extraction componentof the system prevented us from fully testing this approach.

3.2.2 Implementation Problems: Lack of visual features

The problems encountered were the lack of 3D points detected that wouldhave been the input data of the particle filter. Nevertheless, they could not beextracted because the firmware of the device automatically adjust the amountof IR light that the cameras capture, thus sometimes the image wasoverexposed and in consequence it led to a loss of information.

Depending on the distance, the mean of features extracted was around 5.Afterwards, texture was added to the object in order to improve the featureextraction. The number of features extracted increased a 50% per frame.


Although, it was not enough to perform a particle filter. These problems areshowed and discussed in the Chapter 5.

3.3 Contour Based Particle Filter

The feature-based approach from the previous section depends on thepresence of sufficient texture on the objects. In this section we discuss analternative approach based on image contours, which should be lessdependent on texture.

3.3.1 Image Processing

In this step is where the information that the particle filter will use as input forthe measurement model is extracted.

Since it is a contour based approach, it is necessary to extract the edges of theobject from the scene, for it the Canny edge detector is used. Depending on thedistance, the parameters of the filter change, thus for every distance by try anerror the parameters have to be tuned to detect the edges of the object anddiscriminate the other ones. An example of the filter with the followingparameters can be appreciated in the Figure 3.5:

• Minimum threshold for the hysteresis procedure: 100.

• Ratio =HighThreshold

LowThreshold= 2

• Kernel that specifies the aperture size of the Sobel() operator: 3.

Figure 3.5: Image before and after applying the Canny edge detector.

3.3. CONTOUR BASED PARTICLE FILTER 31

In this application, for the distances of 5, 10 and 15 cm the thresholds usedwere 50, 100 and 130 respectively.

After that, the distance image is computed (Figure 3.6). The distance imagecontains for each pixel the distance between this pixel and the closest blackpixel. In this case, the black pixels will be the edges of the object, thus a lowdistance means that this pixel is close to the object.

Figure 3.6: Distance Image (left) of the edge image (right).

3.3.2 Particle Filter

Once the input data to the particle filter is processed, the tracking estimationcan be done. The particle filter has 5 steps (Figure 3.7).

As stated in Section 2.4.1, there are different ways to initialize the filter. In thisparticular case, the initial position of the object is unknown, thus it willuniformly distribute M particles in a fixed region.

Then, the measurement model will make a first prediction of where the objectis after a motion. There are different motion models like the velocity model orodometry model, but since there are no information about the object velocityor how much it moved, they can not be used. The implemented model assumesthat in every frame the particle will have a certain motion in each of its 6degrees of freedom. In order to model this behaviour, the measurement modelwill take each of the Mi particles and then with a Gaussian centred in the Mi

position and with a fixed covariance it will generate a j number of Mi,j

particles.

The third step is the measurement model, which will weight each Mi,j particlefollowing the flow described in Figure 3.8.


Initialize Mparticles

Motion Model

MeasurementModel

ResampleN particles

Random M-N particles

Figure 3.7: Flowchart of the particle filter

It will ask for a 3D model of the object to track in the Mi,j position and thenthis model is going to be projected in both camera planes. The measurementmodel will weight the particle based on the distance (D) from each of theprojected points of the particle 3D model to the edges of the object viewedwith the cameras. The probability density function is modelled as an inverseexponential function since a lower value of the distance means a highprobability to be in the same position of the object. The probability will becomputed as:

p(zt|xt)i,j,k =

e−λDt if Dmin 6 Dt 6 Dmax

0 otherwise

The maximum value of Dmax is 279 (image limit), however it was setted asDmax = 5 and Dmin = 0 in order to disregard particles that are too far awayfrom the edges. The λ factor will make the function slope more or lesspronounced.

For each particle is going to be a left and right camera probability, thus thefinal probability of the particle will be:

pi,j(zt|xt) = 0.5 ∗ pLeft:i,j(zt|xt) + 0.5 ∗ pRight:i,j(zt|xt) (3.1)


Generate 3DModel of

the particle

Project 3D Model

Weight particle

Projection Matrix

Homography

Distance Image

PDF

Figure 3.8: Flowchart of the measurement model

Where pLeft:i,j(zt|xt) and pRight:i,j(zt|xt) are the mean of all the 3D points ofthe projected particle model. For example for the Left is going to be like:

pi,j:l/r(zt|xt) =1

n

n∑

i=1

pi,j,k(zt|xt) (3.2)

The fourth step will keep the M particles with the higher probabilities. Thelow variance sampling algorithm explained in Section 2.4.3 will perform it.

3.3.3 3D Model Generation and Projection

In order to evaluate a particle, the measurement model needs the projection inthe camera planes of a 3D model of the object with the particle position andorientation.

First and foremost, it is necessary to analyse the different frames that need tobe considered. The world coordinate OXYZ is a right-handed coordinatesystem with X forward, Y left and Z up fixed by ROS. The left OLXLYLZL

and right ORXRYRZR camera frames are left-handed coordinate systems withX forward, Y down and Z right , with the origin of the left frame at the worldorigin. The right frame is horizontally displaced a distance of 4 cm (Baselinebetween cameras). The cameras image plane frames are right-handed systemsaligned with the X and Y world axis and fixed at the top left corner of theimage. All these frames can be seen in the Figure 3.9.


Figure 3.9: Representation of the different frames of the system:OXYZ is theworld frame,OLXLYLZL and ORXRYRZR are the cameras frames and cxy is theframe of the image plane.

Once the frames positions are known, the first step is to construct the model.Since they are simple shapes (cube, triangle, rectangle and a cylinder) they arerepresented by a point cloud of the edges of the primitive objects. The modelframe is fixed in the middle of the bottom side coinciding with the worldframe OXYZ. The world and camera frames are not the same, thus it is neededto transform the points from the world coordinate to the left and right cameraframes. For it, all the points of the model are transformed to the new positionwith the rotation and translation matrix as follows:

X′ = X · [Rx · Rz · Ry] + T (3.3)

In more detail:

xR

yR

zR

=

xyz

1 0 00 cosα − sinα0 sinα cosα

cosγ − sinγ 0sinγ cosγ 0

0 0 1

cosβ 0 sinβ0 1 0

− sinβ 0 cosβ

+

txtytz


Where α = −90o, β = 0o and γ = 0o. Note that the camera frames are lefthanded, thus the positive X positions are negative and vice-versa.

The second step is to move the 3D model to the particle position and rotation[x,y, z,α,β,γ]. Again with the rotation and translation matrix the points aretransformed to the new position.

Now that the model is placed in the particle position, the last step is to projectit to the camera left and right image planes with the Projection matrix(Equation 2.9) of each of the cameras with the Equation 2.3:

u

v

w

=

fx 0 cx Tx0 fy cy 00 0 1 0

·

XY

Z1

where the coordinate positions in the image plane (u, v) are obtained bydividing them by the homogeneous coordinate:

u =u

w

v =v

w

Chapter 4

Experimental Setup

This chapter describes the setup of the experiments. All of them were carriedout in and indoor environment with natural illumination. The sections explainin detail the hardware and software used and the assumptions made aboutthem.

4.1 Hardware

The Leap Motion controller is a small USB device designed forhuman-machine interaction. With it, a user can perform tasks like navigatingwebsites, augmented reality in video-games, high-precision drawing or 3Dobject manipulation. It acquires 2D images with its two cameras and thendifferent filters and mathematical algorithms are used to build the models andinterpret the interaction with the computer. However, all these mathematicalbackground is hidden by the company, thus the tracking procedures can not beanalysed. Guna [14] affirms that it can not be used for a professional trackingsystem due to its lower sensory space and inconsistent sampling frequency.

Anyway, the device can be used as a stereo vision system since it is able toacquire images from its two CCD cameras. It also has three infrared LEDsthat emit IR light with a wavelength of 850 nanometers. In the Figure 4.1, aschematic view of the device can be appreciated [41].

The company states that the device has an interaction space of eight cubic feetthat takes the shape of an inverted pyramid as it could be seen in theFigure 4.2 [29].

37

38 CHAPTER 4. EXPERIMENTAL SETUP

Figure 4.1: Leap Motion visualization [41], (a) real and (b) schematic

Figure 4.2: Leap Motion interaction area [29] 2 feet above the controller, by 2feet wide on each side(150o angle), by 2 feet deep on each side(120oangle).

The device does not have a fixed frame rate, it is unstable and changes fromone measurement to other. In [14] they analyse this fact, observing that theminimum logging period between two samples was 14ms (71.43Hz), the meanfrequency 39Hz with a standard deviation of 12.8 Hz. The cameras also havea feature that allow them to auto-adjust the quantity of IR light that they cancapture. The acquired images appear in grayscale due to the infrared lightproperties.

The data from the Leap Motion is transfered to the host PC trough aUniversal Serie Bus (USB) cable. The USB cable transfer data at a maximum of480Mb/s. A Dell Inspiron 14 laptop with a 4x2GHz Intel i7-4510U and16Gb of RAM is used as host PC.

4.2. SOFTWARE 39

4.2 Software

All the experiments were carried out in the Robot Operation System (ROS)under a system running Ubuntu 14.04. OpenCV libraries are used for imageprocessing. Eigen library was used for matrix and transformation operations.

4.3 Targets

The targets that are going to be tracked are basic geometry shapes. In particularthe following ones:

• Cube with side = 3cm Figure 4.3a.

• Cube with texture and side = 3cm Figure 4.3a.

• Cylinder with texture, radius = 1, 5cm and heigth = 6, 5cm Figure 4.4b.

• Equilateral triangle with side = 4cm Figure 4.4a.

• Rectangle with side = 5cm and heigth = 11cm Figure 4.3b.

(a) Texture and untexture cubes. (b) Rectangle.

Figure 4.3: Cubes and rectangle objects to track used in the experiments.

40 CHAPTER 4. EXPERIMENTAL SETUP

(a) Triangle. (b) Textured Cylinder.

Figure 4.4: Triangle and cylinder to track used in the experiments.

4.4 Experimental Scenario

The experiments record were carried out in the lab T002 of the ÖrebroUniversity. For a realistic experiment, it should have been done with a roboticarm holding the object with a grip. However it was not possible so a lessaccurate test was done.

Two rulers of 40 cm and millimetre precision were firmly parallel placed in astatic platform, thus the depth position can be measured. Perpendicular tothem another mobile ruler was placed to measure the horizontal displacement.The camera was fixed in the middle of the two rulers and the object was fixedat the mobile ruler. An overview of it can be seen in Figure 4.5.

Figure 4.5: Experimental setup overview.

4.4. EXPERIMENTAL SCENARIO 41

The mobile ruler was manually moved to obtain motions along the X and Zaxes that were recorded as rosbag files. The experiments recorded aresummarized in the following table:

Model

x(cm) y(cm) z(cm) △X(cm) △Y(cm) △Z(cm)

0 5 0 ±5 0 00 10 0 ±5 0 00 15 0 ±5 0 00 5 0 0 0 ±5

Table 4.1: Summarize of the experiments realized. X,Y,Z are the coordinates ofthe initial position of the target, △X, △Y and △Z are the distance moved fromthe initial position.

Chapter 5

Experimental Results

On this section we describe and discuss the experimental results obtained aswell as the findings and qualitative observations gathered through the previousevaluation.

5.1 Feature Based Test

Before the particle filter implementation, different test where done in order toevaluate the number of features and its quality. The selected objects for theexperiments were the cube and the cube with texture. These objects performsa motion along the X axes at three different depth distances. The parametersevaluated are the number of features detected from the SURF and FASTdetectors and the good features that after the matching and filtering survive.These good features would have been the input of the particle filter.

In the Table 5.1 the means of the tests are summarized:

Test Fast Detections SURF Detections Fast GM SURF GM

Cube 5cm 17.75 13.38 6.22 5.42Cube 10cm 8.91 7.74 4.08 5.26Cube 15cm 7.33 3.97 5.64 3.15Cube with

texture 5 cm73.44 29.78 14.94 10.67

Cube withtexture 10 cm

33.47 7.59 12.41 5.49

Cube withtexture 15 cm

18,10 6.05 8.56 5.22

Table 5.1: Means of the features detected and good features at different dis-tances and objects with the FAST and SURF detectors.

43

44 CHAPTER 5. EXPERIMENTAL RESULTS

And the results can be visualized in the following plots:

Figure 5.1: Test 1: Number of features detected over time with the FAST andSURF detectors at a depth distance of 5 cm.

Figure 5.2: Test 1: Number of good features over time with the FAST and SURFdetectors at a depth distance of 5 cm.

5.1. FEATURE BASED TEST 45



5.2. CONTOUR BASED TEST 47

In the beginning the shapes did not have any texture and the mean of goodfeatures extracted over time was around 5 for both algorithms. After that,some paints were added to create texture on the objects. The number offeatures extracted increased a 50% but however there were not sufficient toperform an estimation with a particle filter.

The FAST algorithm detects three times more features than the SURF, but afterthe matching and filtering the number of good features is quite similar, thusthe SURF algorithm is more robust on the feature detection.

5.2 Contour Based Test

After the full implementation of the approach, a tracking test with thetextured cube placed in between the two cameras at a depth distance of 5cmwas done. The shape realises a slowly horizontal motion of ±5cm.The graphical results obtained were good, all the particles were able to trackthe object along its movement as it could be seen in the Figure 5.7. Howeverthe numerical results were completely wrong as it could be seen in theTable 5.2.

X (m) Y (m) Z (m) α β γ

S0 0 0.05 0 0 0 0SF

′

1 0.05 0.05 0 0 0 0SF1 0.083 1.46 1.14 4.093 -0.56 -0.44SF2 -0.05 0.05 0 0 0 0SF

′

2 -0.123 1.67 0.89 7.764 4.912 -0.71

Table 5.2: Table with the results of the particle filter. S0 is the initial position,SF

′

1, SF′

2 are the expected positions and SF1,SF2 are the estimations

This bad estimation is due to a bad weight evaluation function and theprojection characteristics that confuse the particle filter. In the Figure 5.8 canbe appreciated how the best particle projection matches with the model and inthe following frames how the projection tends to go to the infinity making thefilter diverge.


Figure 5.7: Evolution of the particles in the particle filter: from up to down fourdifferent frames showing the initial projection of the particles around the cube,then three frames at different execution time showing how the particles trackthe cube.


Figure 5.8: Evolution of the projection in the particle filter: from up to downthree different frames showing the initial projection of the best particle aroundthe cube, then a second a third frame showing how the projection goes to theinfinity making the filter diverge.

The following plots shows the histograms of a set of 100,000 particles in theinitial frame and in the final frame.


Figure 5.9: Histogram of the weights of the particles in the initial position.

Figure 5.10: Histogram of the weights of the particles in the final position.

In sense, the particle filter works since the higher weights in the final framesare more concentrated than the initial ones spread out in the space. Theproblem comes when a particle is too far, thus it projection result is acondensation of the points. If they felt in an edge, the probability of these far


particles is more than the particles close to the model that have a betterestimation of the state. As a result, in the following steps of the filter all theparticles are going to be really far and the particle filter is going to getconfused, leading in a wrong state estimation.

Chapter 6

Conclusions and Future work

6.1 Conclusions

In the current work we have presented the design of two implementations fortracking basic shapes with the Leap Motion device as a stereo vision system.Firstly, we have introduce the basic concepts of computer vision and particlefilters for tracking objects with stereo vision systems. Then a first approach isproposed and implemented to solve the problem. After testing the number andquality of features that can be extracted from the scene, the conclusion wasthat there were not enough information for the state estimation algorithm totrack the object. As consequence, a contour based approach is implemented.In the beginning it was really promising since it does not depend on theextraction of the features of the object. Afterwards the test results were not asexpected, seeing that due to the projection particularities and a inaccuracyparticle weight evaluation function the particle filter got confused.

The Leap Motion’s technology is very promising in the sense that it has hugerange of applications in hand tracking and gesture recognition forhuman-interaction applications. However the tracking algorithms on thebackground take advantage of the hardware capabilities like control theacquiring frequency, the intensity of the IR LEDS and the amount of IR lightthat the cameras capture. Since these parameters are not accessible for thedevelopers, it becomes hard to replicate the behaviour in an efficient way.

53

54 CHAPTER 6. CONCLUSIONS AND FUTURE WORK

6.2 Future work

Based on the results obtained, the future work include several improvementsto the current contour based approach. First of all it is needed to find afunction that evaluates the particles weight in a way that the occluded pointsand the far projections that concentrate the points in the space leading towrong high probability are disregarded.

Also an optimization of the particle filter and the way the projections areobtained can be done to perform a real time application able to track theobjects that a grip placed in a robot can take.

References

[1] A Ahmadi, MR Daliri, Ali Nodehi, and Amin Qorbani. Objects recogni-tion using the histogram based on descriptors of sift and surf. Journal ofBasic and Applied Scientific Research, 2(9):8612–8616, 2012. (Cited onpage 18.)

[2] Pablo Barrera, José M Cañas, Vicente Matellán, and Francisco Martín.Multicamera 3d tracking using particle filter. In Int. Conf. on Multimedia,Image processing and Computer Vision, 30march 1april, 2005. (Cited onpage 4.)

[3] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up ro-bust features. In Computer vision–ECCV 2006, pages 404–417. Springer,2006. (Cited on pages 17 and 18.)

[4] Gunilla Borgefors. Distance transformations in digital images. Computervision, graphics, and image processing, 34(3):344–371, 1986. (Cited onpage 16.)

[5] Matthieu Bray, Esther Koller-Meier, and Luc Van Gool. Smart particlefiltering for 3d hand tracking. In Automatic Face and Gesture Recogni-tion, 2004. Proceedings. Sixth IEEE International Conference on, pages675–680. IEEE, 2004. (Cited on page 1.)

[6] John Canny. A computational approach to edge detection. Pattern Anal-ysis and Machine Intelligence, IEEE Transactions on, (6):679–698, 1986.(Cited on page 14.)

[7] ROS Community. Camerainfo documentation.http://docs.ros.org/indigo/api/sensor_msgs/html/msg/CameraInfo.html.(Cited on page 12.)

[8] Leap Motion Company. Leap motion device.https://www.leapmotion.com/. (Cited on page 1.)

55

http://docs.ros.org/indigo/api/sensor_msgs/html/msg/CameraInfo.html

https://www.leapmotion.com/

56 REFERENCES

[9] Samuel Dambreville, Yogesh Rathi, and Allen Tannenbaum. Trackingdeformable objects with unscented kalman filtering and geometric activecontours. In American Control Conference, 2006, pages 6–pp. IEEE,2006. (Cited on page 3.)

[10] Pierre Del Moral. Non-linear filtering: interacting particle resolution.Markov processes and related fields, 2(4):555–581, 1996. (Cited on page4.)

[11] Frank Dellaert and Chuck E Thorpe. Robust car tracking using kalmanfiltering and bayesian templates. In Intelligent Systems & Advanced Man-ufacturing, pages 72–83. International Society for Optics and Photonics,1998. (Cited on page 3.)

[12] Diane K Fisher and Alexander Novati. Make a pinhole camera: In thisactivity you will make your own pinhole camera and discover its creativepossibilities. The Technology Teacher, 69(3):15, 2009. (Cited on page5.)

[13] Donald B Gennery. Visual tracking of known three-dimensional objects.International Journal of Computer Vision, 7(3):243–270, 1992. (Citedon page 3.)

[14] Jože Guna, Grega Jakus, Matevž Pogacnik, Sašo Tomažic, and Jaka Sod-nik. An analysis of the precision and reliability of the leap motion sensorand its suitability for static and dynamic tracking. Sensors, 14(2):3702–3720, 2014. (Cited on pages 37 and 38.)

[15] Richard Hartley and Sing Bing Kang. Parameter-free radial distortion cor-rection with center of distortion estimation. Pattern Analysis and MachineIntelligence, IEEE Transactions on, 29(8):1309–1321, 2007. (Cited onpage 7.)

[16] Richard Hartley and Andrew Zisserman. Multiple view geometry in com-puter vision. Cambridge university press, 2003. (Cited on page 7.)

[17] Michael Isard and Andrew Blake. Condensation–conditional densitypropagation for visual tracking. International journal of computer vision,29(1):5–28, 1998. (Cited on page 4.)

[18] Simon J Julier and Jeffrey K Uhlmann. New extension of the kalmanfilter to nonlinear systems. In AeroSense’97, pages 182–193. InternationalSociety for Optics and Photonics, 1997. (Cited on page 3.)

[19] Rudolph Emil Kalman. A new approach to linear filtering and predictionproblems. Journal of Fluids Engineering, 82(1):35–45, 1960. (Cited onpage 3.)

REFERENCES 57

[20] Augmented Reallity Lab. How to solve the image distortion problem.http://www.arlab.com. (Cited on page 6.)

[21] David G Lowe. Object recognition from local scale-invariant features.In Computer vision, 1999. The proceedings of the seventh IEEE interna-tional conference on, volume 2, pages 1150–1157. Ieee, 1999. (Cited onpage 17.)

[22] David G Lowe. Distinctive image features from scale-invariant keypoints.International journal of computer vision, 60(2):91–110, 2004. (Cited onpage 27.)

[23] John MacCormick. Stochastic algorithms for visual tracking: probabilisticmodelling and stochastic algorithms for visual localisation and tracking.Springer Science and Business Media, 2012. (Cited on page 4.)

[24] Henry Medeiros, Johnny Park, and Avinash Kak. A parallel color-basedparticle filter for object tracking. In Computer Vision and Pattern Recog-nition Workshops, 2008. CVPRW 08. IEEE Computer Society Confer-ence on, pages 1–8. IEEE, 2008. (Cited on page 3.)

[25] Ajmal S Mian, Mohammed Bennamoun, and Robyn Owens. Keypointdetection and local feature matching for textured 3d face recognition. In-ternational Journal of Computer Vision, 79(1):1–12, 2008. (Cited onpage 19.)

[26] Alexander Mordvintsev. Feature matching.http://opencv-python-tutroals.readthedocs.org. (Cited on page 19.)

[27] Tim Morris. Computer vision and image processing. Palgrave Macmillan,2004. (Cited on page 4.)

[28] Leap Motion. Camera images. https://developer.leapmotion.com. (Citedon pages 7 and 27.)

[29] Leap Motion. Hardware. http://blog.leapmotion.com. (Cited on pages37 and 38.)

[30] Rafael Muñoz-Salinas, Eugenio Aguirre, and Miguel García-Silvente. Peo-ple detection and tracking using stereo vision and color. Image and VisionComputing, 25(6):995–1007, 2007. (Cited on page 1.)

[31] University of Auckland. Epipolar geometry.https://www.cs.auckland.ac.nz. (Cited on page 8.)

[32] The Univeristy of Iowa. Geometry for 3d vision.http://user.engineering.uiowa.edu/~dip/lecture/3dvisionp1_2.html. (Citedon pages 10 and 11.)

http://www.arlab.com/blog/how-to-solve-the-image-distortion-problem-code-snippet/

http://opencv-python-tutroals.readthedocs.org/en/latest/py_tutorials/py_feature2d/py_matcher/py_matcher.html

https://developer.leapmotion.com/documentation/csharp/devguide/Leap_Images.html

http://blog.leapmotion.com/hardware-to-software-how-does-the-leap-motion-controller-work/

https://www.cs.auckland.ac.nz/courses/compsci773s1c/lectures/773-GG/lectA-773.htm

http://user.engineering.uiowa.edu/~dip/lecture/3dvisionp1_2.html

58 REFERENCES

[33] OpenCV. Canny edge detector. http://docs.opencv.org. (Cited on pages15 and 16.)

[34] OpenCV. Epipolar geometry. http://docs.opencv.org. (Cited on page 9.)

[35] Ashley Walker Robert Fisher, Simon Perkins and Erik Wolfart. Distanceimages. http://homepages.inf.ed.ac.uk/rbf/HIPR2/distance.htm. (Citedon page 16.)

[36] Edward Rosten and Tom Drummond. Fusing points and lines for highperformance tracking. In Computer Vision, 2005. ICCV 2005. TenthIEEE International Conference on, volume 2, pages 1508–1515. IEEE,2005. (Cited on page 17.)

[37] Edward Rosten, Reid Porter, and Tom Drummond. Faster and better: Amachine learning approach to corner detection. Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on, 32(1):105–119, 2010. (Citedon pages 16 and 17.)

[38] Josephine Sullivan, Andrew Blake, Michael Isard, and John MacCormick.Bayesian object localisation in images. International Journal of ComputerVision, 44(2):111–135, 2001. (Cited on page 4.)

[39] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilisticrobotics. MIT press, 2005. (Cited on page 23.)

[40] Eric Wan, Ronell Van Der Merwe, et al. The unscented kalman filter fornonlinear estimation. In Adaptive Systems for Signal Processing, Com-munications, and Control Symposium 2000. AS–SPCC. The IEEE 2000,pages 153–158. IEEE, 2000. (Cited on page 3.)

[41] Frank Weichert, Daniel Bachmann, Bartholomäus Rudak, and Denis Fis-seler. Analysis of the accuracy and robustness of the leap motion con-troller. Sensors, 13(5):6380–6393, 2013. (Cited on pages 37 and 38.)

[42] Christian Wöhler. 3D computer vision: efficient methods and applica-tions. Springer Science & Business Media, 2012. (Cited on pages 6and 10.)

[43] Caleb Woodruff. Feature detection and matching.https://courses.cs.washington.edu. (Cited on page 19.)

[44] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object tracking: A survey.Acm computing surveys (CSUR), 38(4):13, 2006. (Cited on page 1.)

http://docs.opencv.org/doc/tutorials/imgproc/imgtrans/canny_detector/canny_detector.html

http://docs.opencv.org/master/da/de9/tutorial_py_epipolar_geometry.html

http://homepages.inf.ed.ac.uk/rbf/HIPR2/distance.htm

https://courses.cs.washington.edu/courses/cse576/13sp/projects/project1/artifacts/woodrc/index.htm

Model-based object tracking with an infrared stereo...

Documents

Transcript of Model-based object tracking with an infrared stereo...