Mobile Robot Navigation Using Visual Servoing › 22b6 › 5894299c37a... · 8 Figure 3.1. The...

1

Mobile Robot Navigation Using

Visual Servoing

T. TEPE DC 2010.018

2

DYNAMICS & CONTROL TECHNOLOGY

GROUP

MOBILE ROBOT NAVIGATION USING VISUAL SERVOING

M.Sc. INTERNSHIP

Supervisor : Prof. Dr. Henk NIJMEIJER

Coach : Dr. Dragan KOSTIĆ

Student : Tufan TEPE

Student ID: 0666323

3

ABSTRACT

Equipping robots with vision systems increases the versatility of the robots but also

complexity of their control. Despite the increasing complexity, vision remains an attractive

sensory modality for mobile robot navigation since it provides rich information about the

robot's environment.

In this work, a problem of visual servoing based on a fixed monocular camera mounted on a

mobile robot is investigated. A homography based control method is used for autonomous

navigation of a mobile robot with nonholonomic motion constraints. The visual control task

uses the idea of homing. With this approach, an image is taken previously at the desired

position. Then, the robot is driven from an initial position towards the desired position by

using the information extracted from the target image and the images taken during movement

of the robot.

4

Table of Contents

1 INTRODUCTION .............................................................................................................................. 5

2 DESIGN ISSUES ............................................................................................................................... 5

2.1 Camera configuration .............................................................................................................. 5

2.1 Servoing architectures ............................................................................................................. 6

3 AN INSIGHT INTO VISUAL SERVOING METHODS ............................................................................ 7

3.1 The geometry of image formation ............................................................................................ 7

3.1 Analysis of visual servoing methods ......................................................................................... 9

4 PROJECT DESCRIPTION ................................................................................................................. 13

5 HOMOGRAPHY BASED VISUAL SERVOING of a NONHOLONOMIC MOBILE ROBOT...................... 13

5.1 Homography and its estimation ............................................................................................. 13

5.1.1 Geometric transformations ......................................................................................... 14

5.1.2 Situations in which solving a homography arises .......................................................... 18

5.1.3 How to find the homography? ..................................................................................... 19

5.2 Motion model of the mobile robot ......................................................................................... 29

5.3 Input-output linearization and control law ............................................................................. 33

5.3.1 Input-output linearization ............................................................................................ 34

5.3.2 Control law .................................................................................................................. 35

5.3.3 Desired trajectories of the homography elements ....................................................... 36

5.4 Stability analysis ..................................................................................................................... 38

6 SIMULATIONS .............................................................................................................................. 39

7 EXPERIMENTAL ARRANGEMENTS ................................................................................................ 58

8 CONCLUSIONS .............................................................................................................................. 59

APPENDIX A .................................................................................................................................... 60

APPENDIX B .................................................................................................................................... 62

5

1-INTRODUCTION

Robots are electro-mechanical machines which are designed in such a way that they interact

with their environment. In order to realize that interaction in a desired manner, they must be

equipped with appropriate sensory modalities. In today's world so far, most of robotic

applications take place in known environments or in the environments which are arranged to

be suitable for robots. Robots have been rarely used until lately in the work environments

which can not be controlled fully or about which not much information is available. The main

reason of this limitation lies under the insufficient sensory capabilities of the robots. In order

to compensate for the lack of information obtained from the surroundings, integration of

different sensors to the robots is made to be one of the crucial steps in the design of the robots

and vision is recognized to be very important to increase the versatility of robots. In the last

couple of decades, a lot of work and investigation have been carried out successfully in the

area of robotic vision [1], [2], [3]. Increased computing power and developed pixel

processing hardware enable analysis of images at a sufficient rate to guide the robotic

manipulators without touching the objects [1]. With the use of vision devices and the

information obtained from them in robotic applications, the term "visual servoing or visual

servo control" is started to be used. "Visual Servo Control" refers to closed loop control of

the pose of a robot by utilizing the information extracted from vision sensors and it relies on

the offerings and techniques from many elemental areas such as image processing, computer

vision, kinematics, dynamics and control theory.

2-DESIGN ISSUES

While designing a vision-based control system, one can raise many questions ranging from

the type of the camera to be used to the type of the lens, from the number of cameras to

where to place the cameras, from which kind of image features to utilize to whether to derive

three dimensional description of the scene or to use two dimensional image data or

combination of both etc. Since vision has a broad application area and new techniques and

solutions are being developed day by day, the number of this type of questions can be

increased easily. However, two very crucial issues in the design step of vision based control

systems are explained to stay in the bounds of this project and people can consult with

numerous academic sources easily to obtain detailed information for other aspects.

2.1. Camera Configuration

One main issue when constructing a vision based control system is the determination of the

place where the camera is positioned. There are two main options: the camera can be placed

at a fixed location and it does not possess any motion or it can be mounted to the robot. These

configurations are named as "fixed camera" and "eye-in-hand" configurations respectively.

If a fixed camera configuration is used, the camera is placed at a location that it is

allowed to observe the task space and the robot/manipulator. Since the camera is not exposed

6

to any motion, the geometric link between the task space and the camera does not change.

However, the clear view of the task space of the camera can be hampered by the manipulator

motion and this kind of occlusions can create severe degradation of the performance or even

some instability issues.

With an eye-in-hand system, the camera is mounted on the robot/manipulator. This

configuration enables the camera to see the task space without any occlusions while the robot

travels around the work space. As opposed to the fixed camera configuration, the geometric

relationship between the task space and the camera alters when the robot moves in this

configuration. On the other hand, the scene that the camera sees can change very drastically

when the position of the camera attachment point is exposed to large and fast movements.

This drawback may be encountered especially with multiple link robotic manipulators and

could have undesired performance consequences.

2.2. Servoing Architectures

Different servoing architecture classifications are offered by different people in the literature

but the mostly used one is based upon the question: "Is the error signal or the task function

defined in three dimensional work space coordinates or directly in terms of the image

features?" and the answer to this question resulted in such a taxonomy that the error signal

can be defined in 3D workspace coordinates or directly in terms of image features or

combination of them.

2.2.1. Image Based Visual Servoing

This approach uses the image data directly to control the robot motion and the task function is

defined in the image such that there is no need to estimate the pose error in Cartesian space

explicitly. The image measurements that are used to determine the task/error function are the

pixel coordinates of a set of image features such as interest points and the task function is

isomorphic to the camera pose. A control law is constructed to map the image error to robot

motion directly. A system can either use a fixed camera or eye-in-hand configuration. In

either case, the motion of the robot results in changes of the image provided by the vision

system. Hence, determination of an image based visual servoing task necessitates an

appropriate definition of an error e such that when the task is accomplished, error becomes

zero.

2.2.2. Position Based Visual Servoing

The vision data are used to build the 3D representation of the scene with this approach, that

is, the task/error function is expressed in Cartesian space. Features extracted from the image

and/or 3D model of the object are used to find out the position and orientation of the target

with respect to the camera. Using this information, an error between the current pose and the

desired pose of the robot is defined in the work space and suitable coordinates can be

provided as set points to the controller.

7

2.2.3. 2D ½ Visual Servoing(Hybrid Visual Servoing)

The task function is expressed both in Cartesian space and in the image such that the rotation

error is estimated explicitly in Cartesian space and the translational error is expressed in the

image. 2D 1/2 visual servoing is based on the estimation of the partial camera displacement

from the current to the desired camera poses at each iteration of the control law. Contrary to

position based approaches, it does not need 3D model of the object and contrary to image

based methods, it can avoid some stability problems in the whole task space [4].

3-AN INSIGHT INTO VISUAL SERVOING METHODS

In order for the reader to get an insight into visual servoing methods, an analysis of these

methods is carried out and related references are given in this section. Before going on with

this analysis, the geometry of image formation is explained as a preliminary subject.

3.1. The geometry of Image formation

A digital image is a data structure representing a generally rectangular grid of pixels. The

word pixel is based on a contraction of pix ("pictures") and el (for "element"). Pixels are

normally arranged in a 2-dimensional grid, and are often represented using dots or squares.

The image is formed by directing the light onto a two dimensional array of sensing elements.

Each pixel has a value which is corresponding to the intensity of the light focused on a

particular sensing element [5]. The mediums used to focus the light onto the sensing elements

are the lens and the sensing elements are composed of charge coupled device sensors. A

charge-coupled device (CCD) is a device for the movement of electrical charge, usually from

within the device to an area where the charge can be manipulated, for example conversion

into a digital value [6].

3.1.1. The Camera Coordinate Frame

Image plane is the plane that contains the sensing elements and the camera coordinate frame

is assigned as follows:

i) z axis is chosen to be perpendicular to the image plane and along the optical axis of the

lens,

ii) The origin of the camera coordinate frame is λ (focal distance of the camera) much behind

the image plane,

iii) x and y axes are assigned according to the right hand rule and they are taken to be parallel

to the horizontal and vertical axes of the image plane respectively.

The origin of the camera coordinate frame is called center of projection and the point where

the optical axis crosses the image plane is the principal point. An illustration of the coordinate

frame is given in Figure 3.1. Any point on the image plane can be represented by the

coordinates of (u,v, 𝜆) with respect to the camera coordinate frame.

8

Figure 3.1. The Camera Coordinate Frame

The point P whose coordinates with respect to the camera coordinate frame are (x,y,z) is

projected on to the image plane with coordinates (u,v, 𝜆). The relation between these

coordinates with an unknown positive constant k is given as:

k

x

y

z

u

v

From this equality, following equations can be obtained easily.

𝑘 =𝜆

𝑧 , 𝑢 = 𝜆

𝑥

𝑧 , 𝑣 = 𝜆

𝑦

𝑧 (3.1)

This relation is defined for perspective projection method which is a widely used camera

projection method. There are also other camera projection methods such as scaled

orthographic projection and affine projection offered by S.Hutchinson [2]. Analysis of visual

servo control methods will be based upon perspective projection method in this report.

3.1.2. The Image Plane and the Sensor Array

The row and column indices for a pixel are denoted by pixel coordinates (r,c). In order to

establish a relation between the coordinates of image points and their corresponding 3D

world coordinates, the image plane coordinates(u,v) and the pixel coordinates(r,c) must be

related.

Let the pixel coordinates of the principal point be denoted by (or , oc) and let the origin of the

pixel array be attached to the corner of the image. The horizontal and vertical dimensions of a

pixel are given by sx and sy respectively. sx and sy are the scale factors relating pixels to

distance. Also, the vertical and horizontal axes of the pixel coordinate system usually point in

opposite directions from the horizontal and vertical axes of the camera frame [5]. Therefore,

9

combining all the information above reveals equation (3.2) which relates the image plane

coordinates and pixel coordinates.

−𝑢

𝑠𝑥= 𝑟 − 𝑜𝑟 , −

𝑣

𝑠𝑦= 𝑐 − 𝑜𝑐 (3.2)

3.2. Analysis of Visual Servoing Methods

As it is stated before, there are mainly three classes of visual servoing methods and

explanation of each method is generally specific to the application. For this reason, a lot of

works are being added to the visual servoing knowledge and all of them deserve its own

tutorial so it is not possible to cover all the available methods here. Thus, only the classical

image based visual servoing method is considered here in order to gain some basic insight

and some appropriate references are pointed out for other methods.

The aim of all vision based control schemes is to minimize an error usually defined by

𝑒 𝑡 = 𝑠 𝑡 − 𝑠∗.

s(t) denotes a vector of image feature values that are tracked during motion and 𝑠∗ contains

the desired values of those features. If a single point is used as an image feature, then s(t) can

be defined in terms of image plane coordinates of that point as such

𝒔 𝒕 = 𝑢(𝑡)𝑣(𝑡)

.

The time derivative of 𝒔 𝒕 is called as an image feature velocity and it is linearly related to

the camera velocity. If the camera velocity is represented by 𝝃 = 𝑣𝜔

in which 𝑣 stands for

linear velocity of the origin of the camera and 𝜔 stands for the angular velocity of the camera

about the z axis of camera coordinate frame, then the relationship between the image feature

velocity and the camera velocity becomes

𝒔 = 𝑳 𝒔, 𝒒 𝝃. (3.3)

The matrix L is called image Jacobian matrix or interaction matrix and it is a function of

image features and position of the robot. In order to derive the interaction matrix which

relates the velocity of the camera(𝝃) to the time derivatives of the coordinates of the

projection of a 3D fixed point 𝑷 in the image (𝒔 ), it is necessary to find out an expression

for the velocity of point 𝑷 with respect to the moving camera. Using homogeneous

transformation equations, the relation between the coordinates of point 𝑷 with respect to the

world frame and with respect to the moving camera can be established as

𝑷𝒐 = 𝑹 𝒕 𝑷𝒄 𝒕 + 𝒐(𝒕). In this equation, 𝑷𝒐 stands for the coordinates of P with respect to

the world coordinate frame and 𝑷𝒄 is the coordinates of P relative to the moving camera

frame. Also, 𝑹 𝒕 and 𝒐(𝒕) are the rotation matrix and the translation vector respectively

between the world frame and the camera coordinate frame. Thus, the coordinates of P relative

to the camera frame can be obtained as in the following equation

10

𝑷𝒄 𝒕 = 𝑹𝑻 𝒕 𝑷𝒐 − 𝒐 𝒕 (3.4)

since 𝑹𝑻 𝒕 = 𝑹−𝟏 𝒕 . By taking the time derivative of equation (3.4), we get

𝑷 𝒄 𝒕 = 𝑹 𝑻 𝒕 𝑷𝒐 − 𝒐 𝒕 − 𝑹𝑻 𝒕 𝒐 𝒕 (3.5)

since 𝑷𝒐 is invariant in time. Plugging 𝑹 = 𝑺 𝝎 𝑹 and 𝑹 𝑻 = 𝑹𝑻𝑺(𝝎)𝑻 = 𝑹𝑻𝑺(−𝝎) into

equation 3.5 and after some manipulations, the following equation is obtained [5].

𝑷 𝒄 𝒕 = −𝝎𝒄 𝒕 𝐱 𝑷𝒄 𝒕 − 𝒐 𝒄 𝒕 (3.6)

Here, 𝝎 𝒄 and 𝒐 𝒄 are the angular velocity and linear velocity of the camera respectively

expressed in the camera coordinate frame. If the arguments in equation (3.6) are defined

explicitly and the cross product and subtraction operations are done, a system of three

independent equations are obtained.

𝑷𝒄(𝒕) =

𝑥 𝑡

𝑦 𝑡

𝑧 𝑡 , 𝑷 𝒄(𝒕) =

𝑥 𝑡

𝑦 𝑡

𝑧 𝑡 , 𝝎𝒄(𝒕) =

𝜔𝑥 𝑡

𝜔𝑦 𝑡

𝜔𝑧 𝑡

, 𝒐 𝒄(𝒕) =

𝑣𝑥(𝑡)𝑣𝑦 (𝑡)

𝑣𝑧(𝑡)

The coordinates of point 𝑷 relative to the moving camera as well as the angular and linear

velocities of the camera with respect to the camera coordinate frame are time dependent.

However, the explicit time dependence will not be shown in the following equations for the

sake of simplicity of the notation.

𝑥 𝑦 𝑧

= −

𝜔𝑥

𝜔𝑦

𝜔𝑧

𝐱 𝑥𝑦𝑧 −

𝑣𝑥

𝑣𝑦

𝑣𝑧

(3.7)

Equating the right hand side and the left hand side of the equation (3.7) results in a system of

three equations (3.8)-(3.10).

𝑥 = 𝑦𝜔𝑧 − 𝑧𝜔𝑦 − 𝑣𝑥 (3.8)

𝑦 = 𝑧𝜔𝑥 − 𝑥𝜔𝑧 − 𝑣𝑦 (3.9)

𝑧 = 𝑥𝜔𝑦 − 𝑦𝜔𝑥 − 𝑣𝑧 (3.10)

Combining these equations with equation (3.1) gives the equations (3.11)-(3.13).

𝑥 =𝑣𝑧

𝜆𝜔𝑧 − 𝑧𝜔𝑦 − 𝑣𝑥 (3.11)

𝑦 = 𝑧𝜔𝑥 −𝑢𝑧

𝜆𝜔𝑧 − 𝑣𝑦 (3.12)

𝑧 =𝑢𝑧

𝜆𝜔𝑦 −

𝑣𝑧

𝜆𝜔𝑥 − 𝑣𝑧 (3.13)

11

It is also necessary to find the time derivative of the image plane coordinates. While taking

the time derivative of image plane coordinates, equations (3.11)-(3.13) are used wherever

necessary.

𝑢 = 𝜆𝑧𝑥 − 𝑥𝑧

𝑧2= −

𝜆

𝑧𝑣𝑥 +

𝑢

𝑧𝑣𝑧 +

𝑢𝑣

𝜆𝜔𝑥 −

𝜆2 + 𝑢2

𝜆𝜔𝑦 + 𝑣𝜔𝑧 (3.14)

𝑣 = 𝜆𝑧𝑦 − 𝑦𝑧

𝑧2= −

𝜆

𝑧𝑣𝑦 +

𝑣

𝑧𝑣𝑧 −

𝑢𝑣

𝜆𝜔𝑦 +

𝜆2 + 𝑣2

𝜆𝜔𝑥 − 𝑢𝜔𝑧 (3.15)

Equations (3.14) and (3.15) can be represented in the matrix form [5]:

𝑢 𝑣 =

−𝜆

𝑧 0

0 −𝜆

𝑧

𝑢

𝑧

𝑢𝑣

𝜆

𝑣

𝑧

𝜆2+𝑣2

𝜆

−

𝜆2+𝑢2

𝜆𝑣

−𝑢𝑣

𝜆−𝑢

𝑣𝑥

𝑣𝑦

𝑣𝑧

𝜔𝑥

𝜔𝑦

𝜔𝑧

(3.16)

The first three columns are dependent on the image plane coordinates (𝑢, 𝑣) and the depth, 𝑧,

of the 3D point relative to the camera frame. Therefore, the interaction matrix must estimate

or approximate the value of 𝑧 for any control scheme using this form. This depth information

can come from stereotype cameras, multiple cameras, a single camera but with multiple

views or proper range sensors/finders. 𝑧 can be estimated, for instance, by triangulation for at

least two views of the scene. As can be seen, the part of the interaction matrix which

includes the depth value is related to the translational part and rotation part is just dependent

on image plane coordinates.

When more than one point is tracked in the image, the interaction matrices for each point can

be stacked in one general interaction matrix in order to find the camera movement.

𝑢 1𝑣 1..

𝑢 𝑛𝑣 𝑛

=

−

𝜆

𝑧1 0

0 −𝜆

𝑧1

𝑢1

𝑧1

𝑢1𝑣1

𝜆

𝑣1

𝑧1

𝜆2 +𝑣12

𝜆

−

𝜆2+𝑢12

𝜆𝑣1

−𝑢1𝑣1

𝜆−𝑢1

. .

. . . . . .

. .

. .

−𝜆

𝑧𝑛 0

0 −𝜆

𝑧𝑛

𝑢𝑛

𝑧𝑛

𝑢𝑛 𝑣𝑛

𝜆

𝑣𝑛

𝑧𝑛

𝜆2+𝑣𝑛2

𝜆

−

𝜆2+𝑢𝑛2

𝜆𝑣𝑛

−𝑢𝑛 𝑣𝑛

𝜆−𝑢𝑛

𝑣𝑥

𝑣𝑦

𝑣𝑧

𝜔𝑥

𝜔𝑦

𝜔𝑧

Thus 𝑳 ∈ R2nX 6 and therefore three points are sufficient to solve for 𝝃 given the image

measurements 𝒔 and desired camera velocity 𝝃 can be used as the control input. In order to

find 𝝃, if possible, the interaction matrix must be directly inverted. Otherwise,

pseudoinverse(Moore-Penrose inverse) must be used. If k many features are tracked in the

image and the camera has a velocity which is consisting of m components and

rank(L)=min(k,m), i.e., L is full rank, then there are three possibilities for the inversion of

the interaction matrix.

12

i)If k=m, 𝝃 = 𝑳−𝟏𝒔 Enough number of features are observed.

ii)If k<m, 𝝃 = 𝑳+𝒔 where 𝑳+ = 𝑳𝑻 𝑳𝑳𝑻 −𝟏 Not enough features are observed.

iii)If k>m, 𝝃 = 𝑳+𝒔 where 𝑳+ = 𝑳𝑳𝑻 −𝟏𝑳𝑻More than sufficient number of features are

observed.

Proof of the stability can be done with the help of a suitable Lyapunov function for the error

system.

𝑉 𝑡 =1

2 𝒆(𝒕) 2

The Lyapunov candidate must be positive definite in the space except at the origin of the

error system and its time derivative 𝑉 (𝑡) = 𝒆𝑻𝒆 must be negative definite excluding the

origin. Stability of the system is proven, if 𝒆 is chosen as 𝒆 = −𝜅𝒆 (𝜅 being a positive

constant). The choice of the time derivative of the error for the vision system can be made as

the following:

𝒆 𝒕 = 𝒔 𝒕 − 𝒔∗

𝒆 𝒕 = 𝒔 𝒕 = 𝑳𝝃

Since 𝒆 = −𝜅𝒆 must be satisfied, 𝑳𝝃 = −𝜅𝒆 must also be satisfied.

If k=m and rank(L)=min(k,m), then the exact inverse of the interaction matrix exists so we

can use 𝝃 = −𝜅𝑳−𝟏𝒆(𝒕) as the control signal. Also, the time derivative of the Lyapunov

function stated above becomes

𝑉 = 𝒆𝑻𝒆 = 𝒆𝑻𝑳𝝃 = −𝜅𝒆𝑻𝑳𝑳−𝟏𝒆 = −𝜅𝒆𝑻𝒆 < 0

and this proves the asymptotic stability.

If k>m or k<m and L is full rank, then the exact inverse of the interaction matrix can not be

obtained so pseudoinverse of it should be used. Then, 𝝃 = −𝜅𝑳+𝒆(𝒕) is used as the control

signal(Definition of 𝑳+ varies for the cases k>m and k<m). Then, the time derivative of the

Lyapunov function becomes

𝑉 = 𝒆𝑻𝒆 = 𝒆𝑻𝑳𝝃 = −𝜅𝒆𝑻𝑳𝑳+𝒆 ≤ 0

since 𝑳𝑳+ is positive semidefinite. Therefore, system stability can be proven but this is not

valid for asymptotic stability.

The analysis of the position based visual servoing and hybrid approaches may vary from an

application to one another. Several of important basic works can be enumerated as [4], [7],

[8]. For these kinds of visual servoing methods, the main aim is to minimize the error

𝒆 𝒕 = 𝒔 𝒕 − 𝒔∗ too as the case in classical image based control method but this time the

ingredients of 𝒔 change depending on the available information, set-ups and the aim of the

application.

13

4-PROJECT DESCRIPTION

Having provided an introduction to visual servoing and been familiar with the classical

methods and the way of constructing the control law, description of the project and the rest of

the work will be more appropriate to build on top of the basics. In this assignment, a problem

of visual servoing based on a fixed monocular camera mounted on a mobile robot is

investigated. The objective is to design a control law for autonomous navigation of the robot

with non-holonomic motion constraints. The visual control task uses the idea of homing.

With this approach, an image is taken previously at the desired position. Then, the control

law drives the mobile robot from an initial pose towards the desired pose by processing the

image information extracted from the target image and the current images taken during the

movement of the robot. Off beat the classical methods, homography based visual servoing

method is adopted in order to achieve this task without the need of depth estimation or any

measurements of the scene. With this approach, the controller is obtained by an exact input-

output linearization of the geometric model in which homography elements are chosen to be

the outputs of the system [9].

5-HOMOGRAPHY BASED VISUAL SERVOING of a NONHOLONOMIC

MOBILE ROBOT

In this chapter, detailed analysis of homography based visual servoing is carried out. Section

5.1 describes the homography and develops an understanding of it and section 5.2 derives the

motion model of the mobile robot. In section 5.3, input-output linearization of the system is

done through the homography and control law is constructed based upon that linearization

scheme. Then, in section 5.4, stability analysis of the system is conducted.

5.1. Homography and Its Estimation

A two dimensional point 𝑿𝟐𝑫 = (𝑥, 𝑦) which lies on a plane can be represented by a three

dimensional vector as well like 𝑿𝟑𝑫 = (𝑥1, 𝑥2, 𝑥3). Here, 𝑿𝟐𝑫 is the scaled version of 𝑿𝟑𝑫

by its third elements such as 𝑥 =𝑥1

𝑥3 𝑎𝑛𝑑 𝑦 =

𝑥2𝑥3

. When points on a projective plane

are represented with respect to a coordinate frame whose x and y axes are on the very same

projective plane, all points possess the same depth value such that "z" coordinate does not

mean much. Therefore, all points are scaled by the third element and "z" becomes 1 for all

points. This kind of representation(𝑿𝟑𝑫) is used in homography analysis and it is called

homogeneous representation of a point lying on a projective plane 𝑃2. Then, homography can

be defined as a mapping of these points from one projective plane to another projective plane

and it has the property of invertibility. Synonymies of homography are projectivity, planar

projective transformation and collineation. According to [10], a homography is an invertible

mapping from P2 to itself such that three points lie on the same line if and only if their

mapped points are also collinear and its algebraic definition is as such: A mapping from

P2 → P2 is a projectivity if and only if there exists a nonsingular 3x3 matrix H such that for

any point in P2 represented by a vector x, it is true that its mapped point is equal to Hx.

14

5.1.1. Geometric Transformations

There are several geometric transformations each of which has some properties peculiar to

them and homographies are one of them. Homographies will be better understood if it is

explained in a context which includes other types of geometric transformations. A detailed

description of all geometric transformations can be found in [10].

i)Isometries

Isometries(Iso=same, metric=measure) are transformations of the plane 𝑃2 that preserve

Euclidean distance. An isometry can be described by equation 5.1.

𝑥 ′

𝑦 ′

1

= 𝜖𝑐𝑜𝑠𝜃 −𝑠𝑖𝑛𝜃 𝑡𝑥

𝜖𝑠𝑖𝑛𝜃 𝑐𝑜𝑠𝜃 𝑡𝑦

0 0 1

𝑥𝑦1 =

𝑹 𝒕𝟎 1

𝑥𝑦1 (5.1)

where 𝜖 = ∓1. If 𝜖 is 1, the isometry is preserving the orientation and it becomes a Euclidean

transformation. If 𝜖 is -1, then it is reversing the orientation. Euclidean transformations

represent the rigid body motion. The isometry consists of planar rotations and translations. If

the rotation matrix becomes identity matrix, this means the points are just 2D translated.

Also, if the translation vector becomes a zero vector, then the points are exposed to pure 2D

rotation. A planar Euclidean transformation has three degrees of freedom: one d.o.f. for

rotation(𝜃) and two d.o.f. for the translation(𝑡𝑥 𝑎𝑛𝑑 𝑡𝑦 ). The distance between two points is

kept same when they are mapped by an isometry transformation and so is the angle between

two lines and the area.

ii)Similarity Transformations

A similarity transformation(or a similarity) is an isometry but with a difference of isotropic

scaling and its representation is given in equation (5.2).

𝑥 ′

𝑦 ′

1

= 𝑠𝑐𝑜𝑠𝜃 −𝑠𝑠𝑖𝑛𝜃 𝑡𝑥

𝑠𝑠𝑖𝑛𝜃 𝑠𝑐𝑜𝑠𝜃 𝑡𝑦

0 0 1

𝑥𝑦1 =

𝑠𝑹 𝒕𝟎 1

𝑥𝑦1 (5.2)

where the isotropic scaling is direction invariant. "s" adds one more degree of freedom to

isometries and the similarity has four degrees of freedom. A similarity no longer preserves

the distance between the points when 𝑠 ≠ ∓1. However, it keeps the ratio of the distances

and the angles between lines invariant so it preserves the shape. An example is shown in

Figure 5.1[11].

15

Figure 5.1 Similarity Transformation

iii)Affine Transformations

An affine transformation(or an affinity) is a non-singular linear transformation followed by a

translation [10]. It is like a similarity but it has two rotations and two non-isotropic scalings.

It is represented by

𝑥 ′

𝑦 ′

1

= 𝑎11 𝑎12 𝑡𝑥

𝑎21 𝑎22 𝑡𝑦

0 0 1

𝑥𝑦1 =

𝑨 𝒕𝟎 1

𝑥𝑦1 (5.3)

It has six degrees of freedom corresponding to 𝑎11 , 𝑎12 , 𝑎21 , 𝑎22 , 𝑡𝑥 , 𝑡𝑦 . The affine matrix 𝑨

can be decomposed as

𝑨 = 𝑅 𝜃 𝑅 −𝜙 𝐷𝑅 𝜙 𝑤𝑕𝑒𝑟𝑒 𝐷 = 𝜆1 00 𝜆2

.

Therefore, what the affine matrix 𝑨 does is a rotation by 𝜙, a scaling of 𝜆1 in the direction of

x and another scaling of 𝜆2 in the direction of y, a rotation by – 𝜙 and another rotation by 𝜃.

An affinity has two more degrees of freedom than a similarity. Those are corresponding to

the angle 𝜙 which shows the direction of scaling and the ratio of scaling parameters 𝜆1/𝜆2.

Figure 5.2 shows the interpretation of the action of the affine matrix 𝑨.

Figure 5.2 Effect of Affine Transformation

If the affine matrix is considered in two parts like 𝑨 = [𝑅 𝜃 | 𝑅 −𝜙 𝐷𝑅 𝜙 ], then 𝑅 𝜃

corresponds to a rotation preserving the shape and 𝑅 −𝜙 𝐷𝑅 𝜙 part corresponds to the

16

deformation of the shape in the axis defined by 𝜙 and in the axis that is perpendicular to the

axis defined by 𝜙 and the amount of distortion is dependent on the scaling factors 𝜆1and 𝜆2.

Figure 5.3 [12] shows some examples of affinity transformation.

Figure 5.3 Visual examples of affinity transformations

The distances between the points and the angles between the lines are not preserved in affine

transformations. However, there are some invariants such that parallel lines in one image

remain parallel in the mapped image, ratios of lengths of parallel line segments and the ratios

of areas are kept unchanged.

iv)Perspective Projection

Perspective projection is the projection of three dimensional points in the Cartesian space to

two dimensional points. This projection is an important projection method which is widely

used. This projection describes the mapping of points in the space into the image plane when

images are taken by the cameras. A perspective projection can be described by

𝒙 = 𝑷𝑿

where P is 3x4 projection matrix, 𝒙 is an image point represented by a homogeneous 3-vector

and X is a point in the space represented by a homogeneous 4-vector[13]. In the projection

matrix, there are 12 elements but they are defined up to a scale constant, i.e., the ratios of the

elements are significant so it has 11 degrees of freedom. These 11 degrees of freedom come

from internal and external camera matrices. Internal(Intrinsic) camera matrix or camera

calibration matrix provides 5 degrees of freedom and external(extrinsic) camera matrix

17

provides 6 degrees of freedom. Perspective projection can be split into two phases in terms of

its actions. First, it finds the coordinates of the point, which is in the 3D space, with respect to

the camera frame by the help of homogeneous transformation matrix. Then, it projects those

coordinates which are relative to the camera frame into the image plane and this is done by

using intrinsic camera matrix.

Extrinsic camera matrix can be defined as [R|t] which accounts for the rotation matrix and

the translation vector between camera and world frames, so six external parameters relate the

camera orientation to the world coordinate system. Those six parameters are 3 rotations

expressed by 3x3 rotation matrix "R" and three translations denoted by 3x1 vector "t".

Intrinsic camera matrix "K" can be defined as

𝑲 =

𝛼𝑥 𝑠 𝑥𝑜

0 𝛼𝑦 𝑦𝑜

0 0 1 .

In this matrix, 𝛼𝑥 and 𝛼𝑦 are the focal lengths of the camera in terms of pixel dimensions in

the x and y directions respectively. 𝑥𝑜 and 𝑦𝑜 are the coordinates of the principal point in

pixels in the image and 𝑠 is the skew parameter which shows the deviation of pixels from

orthogonality (or perpendicularity of the sides of the pixels). 𝑠 = cot(𝜍) where 𝜍 is the angle

between sides of the a pixel. Generally, pixels are rectangular so 𝜍 = 90° and then

𝑠 = cot 𝜍 = 0.Hence, intrinsic camera matrix 𝑲 explains 5 internal parameters

(𝛼𝑥 , 𝛼𝑦 , 𝑥𝑜 , 𝑦𝑜 ,𝑠).

Thus, the projection matrix can be represented by the combination of extrinsic and intrinsic

camera matrices 𝑷 = 𝑲[𝑹|𝒕].

An example of perspective projection is given in Figure 5.4 [14].

Figure 5.4 Perspective Projection

18

All 3D world points are mapped into 2D image points as illustrated in figure 5.4. The

perspective projection gives the most realistic impression of depth, although it is not possible

to know the exact depth information from a single image. A perspective projection produces

a similar view to the way the human eye perceives its environment. Remark: When you close

one of your eyes fully and try to touch something around you, you will see that you are not as

accurate as when you touch the same thing when both of your eyes are open. In other words,

you could not touch the object in a relaxed and comfortable manner when one of your eyes is

closed. This shows that perspective projection really gives a realistic but not perfect

impression of depth. However, when you open both of your eyes, you know the exact depth

of the point. This is called Stereopsis in human sense of depth.

v)Projective Transformation

A planar projective transformation or a homography is a transformation on homogeneous

3-vectors represented by a nonsingular 3x3 matrix H such that 𝒙′ = 𝑯𝒙. The matrix H can be

changed by multiplying it by a nonzero scale factor without altering the projective

transformation. Hence, H is called a homogeneous matrix since only the ratios of the matrix

elements are important. There are 8 independent ratios so homographies have 8 degrees of

freedom. None of the invariants of affine transformation is valid for homographies. However,

as it is mentioned at the beginning of this chapter, if three points are on the same line in one

image, they will also be on the same line when they are mapped to another image. A

projective transformation can be written as

𝑯 = 𝑨 𝒕𝐕 𝒗

where V=(𝑽𝟏, 𝑽𝟐).

An important difference between projective transformations and affinities is V vector which

is the source of nonlinearities of projective transformations. Besides, as opposed to the

affinities, the scalings included in "𝑨" vary depending on the position on the image.

Similarly, orientation of the transformed line also depends on the position and orientation of

the source line.

5.1.2. Situations in which solving a homography arises

There are many situations where the use of homographies is required. In this part, the

applications which use homographies are discussed [13].

i)Camera Calibration

Camera calibration is the key step in many vision applications as it lets the systems to

determine the relation between what appears on the image and where it is located in 3D

world. In order to compensate for the undesired features of the lens such as radial distortions,

camera calibration matrix must be known. Two important works of finding camera

calibration matrix using homography estimation are [15] and [16]. In these works, the images

of the same planar pattern such as checker boards are taken from different perspectives and a

homography is estimated between those images to find out calibration matrix.

19

ii) 3D Reconstruction and Visual Metrology

3D reconstruction is a problem in computer vision where the goal is to obtain the scene

configurations and camera positions from images of the scene. In medical imaging, multiple

images of some body parts are taken and 3D model of that part is analyzed. Additionally, the

distances between the objects and the size of the objects are estimated by utilizing

homographies in visual metrology.

iii) Stereo Vision

Two cameras which are separated by a distance take the pictures of the same scene. Images

are shifted over top of each other to find the parts that match. The shifted amount is called the

disparity. A key step is to find out the point correspondences in the images, and these points

are searched across a line called epipolar line. Rectifying the homographies between the

images allows to make the epipolar lines axis-aligned and parallel, thus makes the search of

corresponding points very efficient [13].

Some more applications can be added to the ones mentioned above. The homography

between two views plays an important role in the geometry of multiple views. Homography

is also used in tracking applications using multiple cameras and/or using one camera with

multiple views of the scene, and also it is used to build projector-camera systems.

Homography relation can be used between two views to obtain the transformation between

planes. Even when the target is partially or fully occluded by an unknown object, the tracker

can follow the target as long as it is visible from another view [13]. No complicated inference

scheme is used and no 3D information is recovered explicitly [17]. Additionally,

homographies are used for military applications such that they are used to obtain the altitude

map of an unknown environment by the help of photos taken by airplanes so the risks to the

soldiers can be eliminated in advance.

5.1.3. How to find the homography?

Finding the homography between two images is a must in order to construct the control law

in this project. The ways of finding homography are analyzed in two subsections. In the first

subsection, the answer to the question "How can the homography be found in a simulation

environment?" is provided. In the second subsection, the method of estimating the

homography from two real images for real experiments is explained in details.

*5.1.3.1. Theory of Homography and Homography in Simulation

Environments

In this project, the aim is to bring the current camera frame ℱ to the target(reference) camera

frame ℱ∗. It is supposed that only the images 𝔗∗ and 𝔗 of the scenes at the target position and

at the current position respectively are available to us. This is illustrated in Figure 5.5.

20

Figure 5.5 Illustration of the configuration and Homography between two images of a plane

Let P be a point in 3D space and its coordinates are represented by 𝛘∗ = [𝑋∗, 𝑌∗, 𝑍∗]𝑇 in the

reference frame ℱ∗. 𝛘∗ is mapped to a virtual plane which is perpendicular to the optical axis

and 𝜆(focal length) much away from the center of projection 𝒪∗. Then, its mapped

coordinates are denoted by 𝒎∗ = [𝑢∗, 𝑣∗, 𝜆]𝑇 with respect to the reference camera frame.

Thus, the relationship between 𝛘∗ and 𝒎∗comes out to be 𝒎∗ =𝜆

𝑍∗ 𝛘∗. After that, 𝒎∗ is

projected onto the reference image plane 𝔗∗ as 𝒑∗ = [𝑟∗, 𝑐∗, 1] which has the pixel

coordinates 𝑟∗ and 𝑐∗ by the help of intrinsic camera matrix such that

𝒑∗ = 𝑲𝒎∗ 𝑤𝑕𝑒𝑟𝑒 𝑲 =

𝛼𝑥 𝑠 𝑥𝑜

0 𝛼𝑦 𝑦𝑜

0 0 1 .

In the intrinsic camera matrix, 𝛼𝑥 and 𝛼𝑦 are the focal lengths of the camera in terms of pixel

dimensions in the x and y directions respectively. 𝑥𝑜 and 𝑦𝑜 are the coordinates of the

principal point in pixels in the image and 𝑠 is the skew parameter as explained in perspective

projection section.

The 3D point P is represented by 𝛘 = [𝑋, 𝑌, 𝑍]𝑇 relative to the current camera coordinate

frame ℱ. If the same procedure is applied to the point P but this time with respect to the

current camera frame, the following equations are obtained:

𝒎 =𝜆

𝑍𝛘 where 𝒎 = [𝑢, 𝑣, 𝜆]𝑇

and then m is projected onto the current image plane as point 𝒑 = [𝑟, 𝑐, 1]𝑇 by the help of

𝒑 = 𝑲𝒎.

The rotation matrix and the translation vector between the frames ℱ∗and ℱ are 𝑹 ∈ 𝑆𝑂 3

and 𝒄 ∈ ℜ3 respectively. Besides, if the point P is supposed to belong to the plane 𝜋

and 𝒏∗ = [𝑛𝑥 , 𝑛𝑦 , 𝑛𝑧]𝑇 is the normal to the plane 𝜋 expressed relative to the reference camera

frame and 𝑑∗ is the distance between the plane 𝜋 and the origin of the reference plane, then

21

the relation between 𝒑∗ and 𝒑 is defined by a projective transformation H in such a way that

𝒑 = 𝑯𝒑∗. A homography H can be related to the camera motion as seen in the equation (5.4).

𝑯 = 𝑲𝑹 𝑰 + 𝒄𝒏∗𝑇

𝑑∗ 𝑲−𝟏 (5.4)

In the simulations, there is no real robot which travels through the works space and takes the

images of the scene so no real images are available in the computer. Therefore, the initial and

the target positions and orientations of the robot must be known by us in order to emulate the

real motion of the robot in simulation environment. With that knowledge, the rotation matrix

and the translation vector between the current frame and the target frame can be found. With

the knowledge of the intrinsic camera matrix [18], the homography can be computed by

equation (5.4). For 𝒏∗ and 𝑑∗, some arbitrary but appropriate values can be tried by

inspection. Although they have effects on the performance, they do not affect the

convergence of the system at all. Thus, plugging all of these into the equation (5.4), a 3x3

homography is obtained and its elements can be used in the determination of the control

signal.

*5.1.3.2. Homography estimation from two real images

In the real experiments, we have the image of the scene taken at the desired position as a

reference image and the current images taken during the robot's motion. This means that we

have nothing else as additional information other than two images(current one and the

reference one), so the rotation matrix and the translation vector are not known a priori.

Therefore, all required information must be extracted from the images in order to find out the

control signal. In order to do so, two steps must be completed.

STEP 1: First, features that can be utilized in order to find out reliable matchings between the

views of the scene must be extracted from the images. There are several methods in the

literature to find features in the images such as Harris corner detector, canny edges, entropy

operator, SIFT etc. If the features detected are highly distinctive and invariant to image

scaling and rotation, it is going to allow for more robust estimation of the homography.

Among several, the Scale Invariant Feature Transform (SIFT) which is an algorithm in

computer vision to detect and describe local features in images is employed in this project.

SIFT and most common algorithms search for points as image features, while lines and

conics may also be utilized as image features by other algorithms. There are four main

cascaded steps to determine the set of image features in SIFT algorithm. In [19] and [20],

detailed information about this algorithm can be found.

1. Scale-space extrema detection: The first stage of computation searches over all

scales and image locations, i.e., the first stage of keypoint detection is to identify locations

and scales which can be used under various views of the same scene. It is implemented

efficiently by using a difference-of-Gaussian function to identify potential interest points that

are invariant to scale and orientation. The image is convolved with Gaussian filters at

http://en.wikipedia.org/wiki/Computer_vision

22

different scales, and then the difference of successive Gaussian-blurred images is taken.

Specifically, difference-of-Gaussian image is given by

D x, y, σ = L x, y, kiσ − L x, y, kjσ

where L x, y, kσ is the convolution of the original image I(x, y) with Gaussian-blur

G x, y, kσ at a scale kσ such that

L x, y, kσ = G x, y, kσ ∗ I(x, y).

Thus, a difference-of-Gaussian image between scales kiσ and kjσ is just the difference of the

Gaussian-blurred images at scales kiσ and kjσ . The image is first convolved with Gaussian-

blurs at different scales. The convolved images are grouped by octave (an octave corresponds

to doubling the value of σ), and the value of k is selected so a fixed number of convolved

images per octave are obtained. Then, the Difference-of-Gaussian images are taken from

adjacent Gaussian-blurred images per octave. An illustration of difference-of-Gaussian is

given in figure 5.6[19].

Figure 5.6 Illustration of difference of Gaussian

2. Keypoint localization: Once difference-of-Gaussian images have been obtained,

keypoints are then taken as maxima/minima of the Difference of Gaussians (DoG) images

across scales[19], [21]. This is done by comparing each pixel in the difference-of-Gaussian

images to its eight neighbors at the same scale and nine corresponding neighboring pixels in

each of the neighboring scales. Figure 5.7 [19] shows the search region.

http://en.wikipedia.org/wiki/Difference_of_Gaussians

23

Figure 5.7 Search region to find a keypoint candidate

The pixel marked with an X is investigated whether it could be a keypoint candidate or not.

There are 8 neighbors around it, all of which are at the same scale of X pixel and 9 pixels

above at a higher scale and 9 pixels below at a lower scale shown by green circles. If the X

pixel has the minimum or maximum intensity value among 26 pixels, then it is included to

the list of keypoint candidates. As a result of this procedure, lots of keypoint candidates

appear. However, some of them are not stable enough such that they may be located on an

edge of the image or may be in a low contrast region so if there is some image noise, it could

be hard to distinguish that pixel from its neighbors so it may not be recognized as a keypoint

anymore. There are some algorithms developed for discarding low contrast candidate

keypoints and eliminating edge repsonses [21]. These subjects are not explained here for the

sake of keeping the main subject in the bounds. After elimination of the inappropriate

keypoint candidates, there is one more thing left to do, which is the determination of the

keypoint location. For each candidate keypoint, interpolation of nearby data is used to

accurately determine its position. Calculating the interpolated location of the extremum

improves matching and stability when compared to locating each keypoint at the location and

scale of the candidate keypoint. This can be explained by a simple example: Assume that

there are two pixels nearby and one pixel is totally white and the other one is in another color

and besides, the white pixel is considered as a keypoint. Normally, the coordinates of the

center of the white pixel should be provided as keypoint coordinates. However, as you may

guess, the point at the middle of the line that combines the center of the white pixel and the

center of the other pixel has a higher contrast because it is on the transition region and it is

thus easier to detect this point at other images of the same scene taken from different

perspectives. Therefore, the interpolations are carried out to find more suitable coordinates.

The interpolations are done using the quadratic Taylor expansion of the Difference-of-

Gaussian scale-space function with the candidate keypoint as the origin. Additionally, the

softwares used to find the keypoints generally return some double numbers for the

coordinates of the keypoints rather than integer numbers which indicate the pixel location in a

matrix. This is simply because of this kind of interpolations.

3. Orientation assignment: Orientations are assigned to each pixel around the

keypoint location based on local image gradient directions. Firstly, the Gaussian-smoothed

http://en.wikipedia.org/wiki/Taylor_expansion

24

image L x, y, σ at the keypoint's scale σ is taken so that all computations are performed in a

scale-invariant manner. For an image sample L x, y at the scale of σ, the gradient magnitude,

𝑚(𝑥, 𝑦), and the orientation, 𝜃(𝑥, 𝑦), are computed using pixel differences [21]:

𝑚 𝑥, 𝑦 = (𝐿 𝑥 + 1, 𝑦 − 𝐿 𝑥 − 1, 𝑦 )2 + (𝐿 𝑥, 𝑦 + 1 − 𝐿 𝑥, 𝑦 − 1 )2

𝜃 𝑥, 𝑦 = tan−1 𝐿 𝑥, 𝑦 + 1 − 𝐿 𝑥, 𝑦 − 1

𝐿 𝑥 + 1, 𝑦 − 𝐿 𝑥 − 1, 𝑦

The magnitude and direction calculations for the gradient are repeated for every pixel

in a neighboring region around the keypoint in the Gaussian-blurred image L x, y, σ . The

result of this procedure is illustrated in figure 5.8 for 8x8 array of pixels in the neighborhood

of the keypoint location.

Figure 5.8 Gradients of pixels around the keypoint location

An orientation histogram with 36 bins covering 360 degree range of orientation is formed,

with each bin covering 10 degrees. Each sample in the neighboring window added to a

histogram bin is weighted by its gradient magnitude and by a Gaussian-weighted circular

window with σ that is 1.5 times that of the scale of the keypoint [21]. The peaks in this

histogram correspond to dominant orientations. Once the histogram is filled, the orientations

corresponding to the highest peak and local peaks that are within 80% of the highest peak are

assigned to the keypoint. In the case of multiple orientations being assigned, an additional

keypoint is created having the same location and scale as the original keypoint for each

additional orientation [19].

4. Keypoint descriptor: Previous steps found keypoint locations at particular scales

and assigned orientations to them and this ensures invariance to image location, scale and

rotation. At this step, a descriptor vector for each keypoint is computed such that the

descriptor is highly distinctive and partially invariant to the remaining variations such as

25

illumination. Generally, magnitude and orientation values of samples in a 16x16 region

around the keypoint are calculated. Then, for each 4x4 subregion of the original

neighborhood region, the samples are accumulated into orientation histograms with 8 bins

corresponding to 8 directions so there are totally (16x16)/(4x4)=16 histograms created. The

magnitudes of the gradients are further weighted by a Gaussian function with σ equal to 1.5

times the scale of the keypoint. The descriptor then becomes a vector of all the values of

these histograms [21]. 16 histograms with 8 bins are created so there are 16x8=128 entries

that must be included in keypoint descriptor vector.

After applying these four steps to both of the images, keypoint descriptors of both

images are obtained. Then, the keypoints in both images must be checked whether they are

matching with each other or not. In order find the matches, one keypoint is taken from the

first image and it is compared with all keypoints of the other image one by one. After that, the

second keypoint is picked up from the first image and it is compared with all keypoints of the

other image again. This loop continues until all keypoints are compared with eachother. The

criterion of whether two key points are accepted as a matched pair or not can be explained

like the following: Each keypoint has its feature(descriptor) vector. When two keypoints are

compared, the angle between the feature vectors is found by the help of dot product. If that

angle is smaller than a threshold, they are accepted as a matched pair.

𝐹 1. 𝐹 2 = 𝐹 1 |𝐹 2|cos(𝛼) where 𝐹 1 and 𝐹 2 are the feature vectors and 𝛼 is the angle between

them. The smaller the 𝛼 is, the more similar the feature vectors are. When 𝛼 gets below a

certain threshold, keypoints are assumed to match with each other. An example of matched

points by SIFT program is illustrated in Figure 5.9.

Figure 5.9 An example of point matches

There are 1021 and 579 keypoints found in the left and right images respectively and 19 of

them are matched.

100 200 300 400 500 600 700

50

100

150

200

250

300

350

26

STEP 2: After matched points are found, it is now doable to determine the homography

between two images. One of the widely used methods for homography estimation is Direct

Linear Transformation algorithm. In order to find a homography between two images, there

should be at least 4 matched point pairs. As stated in projective transformation section, a

homography has 9 elements and only the ratios of the elements are important so a

homography has 8 degrees of freedom. One matched pair of keypoints constraints 2 degrees

of freedom so 4 matched pairs are eventually necessary to define the homography fully. The

homography relates one point 𝑥𝑖 in one image to another point 𝑥𝑖′ which is in the other image

in such a way that 𝒙𝒊′ = 𝑯𝒙𝒊. In this representation, the homogeneous 3-vectors 𝒙𝒊

′ and 𝑯𝒙𝒊

may not be equal in magnitude since H is defined up to a scale but they have the same

direction. In order to ease the analysis, it is more appropriate to use 𝒙𝒊′ 𝐱 𝑯𝒙𝒊 = 𝟎 .

If the 𝑗𝑡𝑕 row of the matrix H is represented by 𝒉𝒋𝑇 , then 𝑯𝒙𝒊 can be written as

𝑯𝒙𝒊 =

𝒉𝟏𝑇𝒙𝒊

𝒉𝟐𝑇𝒙𝒊

𝒉𝟑𝑇𝒙𝒊

.

If 𝒙𝒊′ = (𝑥𝑖

′ , 𝑦𝑖′ , 𝑤𝑖

′ )𝑇 , then the cross product becomes

𝒙𝒊′ 𝐱 𝑯𝒙𝒊 =

𝑦𝑖′𝒉𝟑𝑻

𝒙𝒊 − 𝑤𝑖′𝒉𝟐𝑻

𝒙𝒊

𝑤𝑖′𝒉𝟏𝑻

𝒙𝒊 − 𝑥𝑖′𝒉𝟑𝑻

𝒙𝒊

𝑥𝑖′𝒉𝟐𝑻

𝒙𝒊 − 𝑦𝑖′𝒉𝟏𝑻

𝒙𝒊

= 𝟎.

Since 𝒉𝒋𝑇𝒙𝒊 is 1x1, it is equal to its transpose. Therefore, 𝒉𝒋𝑇𝒙𝒊 = 𝒙𝒊𝑻𝒉𝒋 for j=1,2,3 and a set

of three equations can be obtained and represented by the equation (5.5).

𝟎𝑻 −𝑤𝑖′𝒙𝒊

𝑻 𝑦𝑖′𝒙𝒊

𝑻

𝑤𝑖′𝒙𝒊

𝑻 𝟎𝑻 −𝑥𝑖′𝒙𝒊

𝑻

−𝑦𝑖′𝒙𝒊

𝑻 𝑥𝑖′𝒙𝒊

𝑻 𝟎𝑻

𝑕1

𝑕2

𝑕3

𝑕4

𝑕5

𝑕6

𝑕7

𝑕8

𝑕9

= 𝟎 (5.5)

Now the equations are in the form of 𝑨𝒊𝒉 = 𝟎, where 𝑨𝒊 is a 3X9 matrix and h is a 9-vector

consisting of the elements of the homography elements. Therefore if h is found, then H is

also determined.

27

𝒉 =

𝑕1

𝑕2

𝑕3

𝑕4

𝑕5

𝑕6

𝑕7

𝑕8

𝑕9

, 𝑯 =

𝑕1 𝑕2 𝑕3

𝑕4 𝑕5 𝑕6

𝑕7 𝑕8 𝑕9

(5.6)

Even though there are three equations in (5.5), the third row is dependent on the other two

rows such that third row is the sum of 𝑥𝑖′ times the first row and 𝑦𝑖

′ times the second row.

𝑥𝑖′ times the first row : 𝟎𝑻 −𝑥𝑖

′𝑤𝑖′𝒙𝒊

𝑻 𝑥𝑖′𝑦𝑖

′𝒙𝒊𝑻

𝑦𝑖′ times the second row: 𝑦𝑖

′ 𝑤𝑖′𝒙𝒊

𝑻 𝟎𝑻 −𝑦𝑖′ 𝑥𝑖

′𝒙𝒊𝑻

Sum: 𝑦𝑖′ 𝑤𝑖

′𝒙𝒊𝑻 −𝑥𝑖

′𝑤𝑖′𝒙𝒊

𝑻 𝟎𝑻

If −𝑤𝑖′ is factored out from the sum, the third row of equation (5.5) is obtained. Therefore,

equation (5.5) can be reduced to equation (5.7).

𝑨𝒊𝒉 = 𝟎𝑻

𝑤𝑖′𝒙𝒊

𝑻 −𝑤𝑖

′𝒙𝒊𝑻

𝟎𝑻

𝑦𝑖′𝒙𝒊

𝑻

−𝑥𝑖′𝒙𝒊

𝑻

𝑕1

𝑕2

𝑕3

𝑕4

𝑕5

𝑕6

𝑕7

𝑕8

𝑕9

= 𝟎 (5.7)

The solution of equation (5.7) gives the homography. The summary of the Direct Linear

Transformation algorithm [10] is as follows.

i) For each matched pair of points 𝑥𝑖 ↔ 𝑥𝑖′ , find 2x9 𝑨𝒊 matrix.

ii) Stack all n many 𝑨𝒊 matrices for n correspondences in 2nx9 A matrix.

iii) Obtain the singular value decomposition of A. The unit singular vector corresponding to

the smallest singular value is the solution h. If 𝑨 = 𝑼𝑫𝑽𝑻, then h is the last column of V.

iv) Then using equation (5.6), H can be constructed from h.

This algorithm is implemented in Matlab and it gives the following 3x3 homography for the

images in figure 5.9

𝑯 = −0.0009 −0.0021 0.61740.0030 −0.0013 −0.78670.0000 −0.0000 −0.0020

.

28

The correctness of this homography matrix can be verified by the following way

* Pick up a specific point in the left image and find its coordinates(𝒙𝒊),

* Find that specific point in the right image and also find its coordinates(𝒙𝒊′ ),

* If they are related by the obtained homography, then it means that software is working

correctly.

Let's examine the correctness of the homography this way. Find the coordinates of the upper

right corner of the letter "I" in the word "BASMATI" in the left image. Then this time, find

again the upper right corner of the letter "I" in the word "BASMATI" in the right image. This

is illustrated in figures 5.10 and 5.11 .

Figure 5.10 A specific point in the left image

Figure 5.11 Same specific point in the right image

X: 97 Y: 34 Index: 113 RGB: 0.471, 0.471, 0.471

X: 336 Y: 213 Index: 65 RGB: 0.259, 0.259, 0.259

29

The coordinates of that specific point in the left image are 𝒙𝒊 = [336, 213, 1]𝑇 and in the

right image 𝒙𝒊′ = [97, 34,1]𝑇 and

𝒙𝒊′ =

97341

= 𝑯𝒙𝒊 ≅ −0.0009 −0.0021 0.61740.0030 −0.0013 −0.78670.0000 −0.0000 −0.0020

336213

1 =

97.084734.4144

1

so this indicates that homography is estimated in a true manner. Please note that the elements

of H are rounded off here, so the hand calculation of 𝑯𝒙𝒊 is not exactly same as the result of

Matlab.

5.2. Motion Model of a Mobile Robot

The system that is to be controlled is a mobile robot with nonholonomic motion constraints.

Nonholonomic constraints occur due to the presence of the wheels such that the mobile robot

can not move sideways as shown in figure 5.12.

Figure 5.12 Nonholonomic constraint for a mobile robot

The nonholonomic constraints allow for rolling but not slipping. In general, a nonholonomic

mechanical system can not move arbitrarily in its configuration space. Holonomic constraints

can be written as equations independent of 𝑞 , like 𝑓 𝑞, 𝑡 = 0, where 𝑞 stands for generalized

coordinates. However, nonholonomic constraints can not be written only in terms of

generalized coordinates as they also depend on the time derivative of the generalized

coordinates. This means that nonholonomic constraints are not integrable constraints. A

nonholonomic mobile robot model can be represented by the following state and output

equations:

𝒙 = 𝒇 𝒙, 𝒖 (5.8)

30

𝒚 = 𝒉 𝒙 5.9

where 𝒙 denotes the state vector, 𝒖 denotes the input vector and 𝒚 is the output vector. Inputs

consist of forward velocity(𝑣) and angular velocity(𝑤).

The coordinate system used is shown in figure 5.13.

Figure 5.13 Coordinate System

There are two coordinate frames that should be specified in order to remove possible

ambiguities in minds. One of them is the coordinate frame attached to the mobile robot and

the other one is the world coordinate frame. When the robot reaches its target pose, the

coordinate frame attached to the robot may be different than the world coordinate frame.

However, the world coordinate frame can be chosen to be fully coincident with the robot

coordinate frame at its target pose without loss of generality.

The state vector can be defined as 𝒙 = [𝑥 𝑧 𝜙]𝑇 since the robot has movements on x-z plane.

𝑥 and 𝑧 are for the position of the mobile robot with respect to the world coordinate frame. 𝜙

represents the orientation of the robot and it is the angle between the z axis of the coordinate

frame attached to the mobile robot and the z axis of the world frame. According to the

information provided above, it can be said without loss of generality that when the mobile

robot reaches the target position, all state variables become zero since the world coordinate

frame is coincident with the robot coordinate frame at the target pose. Now, state equations

can be written explicitly as

𝑥 𝑧 𝜙

= −sin(𝜙)cos(𝜙)

0

𝑣 + 001 𝑤 (5.10)

31

In order to define the output vector, a homography between two images(current and target

images) must be found since outputs of the system are chosen among homography elements.

A homography is related to camera motion as

𝑯 = 𝑲𝑹 𝑰 + 𝒄𝒏∗𝑇

𝑑∗ 𝑲−𝟏. (5.11)

𝑹 and 𝒄 are the rotation matrix and the translation vector between the current and target

poses. 𝑲 is the internal camera calibration matrix. In practice, there are some assumptions

made such that robot moves on a planar surface without irregularities, the principal point

coordinates are (0,0) and there is no skew of pixels.

The rotation matrix can be derived by conveying the origin of the target frame to the origin of

the current frame and examining the relationship between those two coordinate frames.

Figure 5.14 Target frame(x, y, z) and Current Frame(x′ , y′ , z′)

The target frame and the current frame are shown in figure 5.14. y and y′ axes are not shown

in the figure because they are orthogonal to the page plane according to the right hand rule. In

order for the current frame to be coincident with the target frame, it must rotate –𝜙 degrees

in clockwise direction(according to the convention used, counterclockwise rotations are

positive as shown in figure 5.13). Thus, following equations define the relationship between

the current and target coordinate frames when their origins are coincident.

𝑥 = −𝑧 ′ sin −𝜙 + 𝑥 ′ cos −𝜙 = 𝑥 ′ cos 𝜙 + 𝑧 ′ sin 𝜙

𝑦 = 𝑦 ′

𝑧 = 𝑧 ′ cos −𝜙 + 𝑥 ′ sin −𝜙 = −𝑥 ′ sin 𝜙 + 𝑧 ′ cos 𝜙

These equations can be put into matrix representation and the rotation matrix, 𝑹, can be

obtained as equation (5.12).

32

𝑥𝑦𝑧 =

cos(𝜙) 0 sin(𝜙)0 1 0

−sin(𝜙) 0 cos(𝜙)

𝑥 ′

𝑦 ′

𝑧 ′

= 𝑹 𝑥 ′

𝑦 ′

𝑧 ′

(5.12)

The translation vector between the target and current frames is represented by

𝒄 = 𝑥0𝑧 . (5.13)

y coordinate is always zero because robot moves on x-z plane. Using equation (5.11),

homography between the target and current images can be obtained as

𝑯 = 𝑕11 𝑕12 𝑕13

𝑕21 𝑕22 𝑕23

𝑕31 𝑕32 𝑕33

where

𝑕11 = cos 𝜙 + [𝑥𝑐𝑜𝑠 𝜙 + 𝑧𝑠𝑖𝑛(𝜙)]𝑛𝑥

𝑑𝜋

𝑕12 =𝛼𝑥

𝛼𝑦[𝑥𝑐𝑜𝑠 𝜙 + 𝑧𝑠𝑖𝑛(𝜙)]

𝑛𝑦

𝑑𝜋

𝑕13 = αx[sin 𝜙 + 𝑥𝑐𝑜𝑠 𝜙 + 𝑧𝑠𝑖𝑛 𝜙 𝑛𝑧

𝑑𝜋]

𝑕21 = 0

𝑕22 = 1

𝑕23 = 0

𝑕31 = [−sin 𝜙 + (−𝑥𝑠𝑖𝑛 𝜙 + 𝑧𝑐𝑜𝑠 𝜙 )𝑛𝑥

𝑑𝜋]

1

𝛼𝑥

𝑕32 = (−𝑥𝑠𝑖𝑛 𝜙 + 𝑧𝑐𝑜𝑠 𝜙 )𝑛𝑦

𝑑𝜋

1

𝛼𝑦

𝑕33 = cos 𝜙 + −𝑥𝑠𝑖𝑛 𝜙 + 𝑧𝑐𝑜𝑠 𝜙 𝑛𝑧

𝑑𝜋.

𝑕21 ,𝑕22 and 𝑕23 do not give any information because they are already constant numbers due

to planar motion constraint. Elements 𝑕31 and 𝑕32 are discarded since their magnitudes are

low due to that 𝛼𝑥and 𝛼𝑦 take place at the denominator and they are more sensitive to noise

when compared with other homography elements. In monocular systems, planes in front of

the camera with dominant 𝑛𝑧 are detected more easily [9], so 𝑕13 and 𝑕33 are chosen among

the rest of the elements since they are dependent on 𝑛𝑧 . Therefore, output vector is defined as

𝒚 = 𝑕13

𝑕33 .

33

5.3. Input-Output Linearization and Control Law

The approach employed here navigates the mobile robot by controlling the elements of the

homography. This means that the problem of visual servo control is converted into a tracking

problem, i.e., actual elements of the homography should follow the desired trajectories of the

homography elements during the motion. The geometric model of this system is nonlinear

relating inputs and outputs. A linearization is carried out by differentiating the homography

elements until the control inputs can be obtained. Before going on with input-output

linearization and derivation of the control law, we can show that the system is controllable.

The state dynamics of the mobile robot allows the system to be written in an affine format as

𝒙 = 𝑓 𝒙 + 𝑔𝑖(𝑥)𝑢𝑖

𝑚

𝑖=1

where ui′s are the inputs. (5.14)

The state dynamics of the mobile robot is given by equation (5.10). If the equations (5.14)

and (5.10) are equated, the following result is obtained.

𝑓 𝑥 = 0 ,

𝑚 = 2 ,

𝑢1 = 𝑣 𝑎𝑛𝑑 𝒈𝟏 = −sin(𝜙)cos(𝜙)

0

,

𝑢2 = 𝑤 𝑎𝑛𝑑 𝒈𝟐 = 001 .

Since 𝑚 = 2, the accessibility distribution(C) becomes

𝑪 = 𝒈𝟏, 𝒈𝟐, 𝒈𝟏, 𝒈𝟐 .

𝑔1, 𝑔2 is the Lie bracket operation and its definition is the following:

𝒈𝟏, 𝒈𝟐 ≡𝝏𝒈𝟏

𝝏𝒙𝒈𝟐 −

𝝏𝒈𝟐

𝝏𝒙𝒈𝟏 where 𝒙 =

𝑥𝑧𝜙

is the state vector.

Accessibility distribution is obtained as:

𝝏𝒈𝟏

𝝏𝒙𝒈𝟐 =

0 0 −𝑐𝑜𝑠(𝜙)0 0 −𝑠𝑖𝑛(𝜙)0 0 0

001 =

−𝑐𝑜𝑠(𝜙)−𝑠𝑖𝑛(𝜙)

0

𝝏𝒈𝟐

𝝏𝒙𝒈𝟏 =

0 0 00 0 00 0 0

−sin(𝜙)cos(𝜙)

0

= 000

34

𝒈𝟏, 𝒈𝟐 = −𝑐𝑜𝑠(𝜙)−𝑠𝑖𝑛(𝜙)

0

𝑪 = −sin(𝜙) 0 −𝑐𝑜𝑠(𝜙)cos(𝜙) 0 −𝑠𝑖𝑛(𝜙)

0 1 0

.

Since rank(C) is equal to 3 which is the number of states, the system is controllable [22].

5.3.1. Input-Output Linearization

Linearization is a common way of designing nonlinear control systems. In this section,

outputs will be differentiated until they become linearly dependent on the inputs. Please note

that the normal vector(n) of the plane which creates the homography and the distance(d)

between that plane and the origin of the target frame are invariant and time derivatives of

them are also zero.

Time derivative of 𝑕13 :

𝑕13 = αx[sin 𝜙 + 𝑥𝑐𝑜𝑠 𝜙 + 𝑧𝑠𝑖𝑛 𝜙 𝑛𝑧

𝑑𝜋]

𝑕 13 = αx[cos 𝜙 𝜙 + 𝑥 𝑐𝑜𝑠 𝜙 + 𝑧 𝑠𝑖𝑛 𝜙 − 𝑥𝑠𝑖𝑛 𝜙 𝜙 + 𝑧𝑐𝑜𝑠(𝜙)𝜙 𝑛𝑧

𝑑𝜋]

By the help of state equations, the equation above can be simplified.

𝑥 = − sin 𝜙 𝑣 ==> 𝑥 cos(𝜙) = − sin 𝜙 cos(𝜙) 𝑣

𝑧 = cos 𝜙 𝑣 ==> 𝑧 sin 𝜙 = sin 𝜙 cos(𝜙) 𝑣

𝑥 cos 𝜙 + 𝑧 sin 𝜙 = − sin 𝜙 cos(𝜙) 𝑣 + sin 𝜙 cos(𝜙) 𝑣 = 0

Therefore,

𝑕 13 = αx cos 𝜙 𝜙 + −𝑥𝑠𝑖𝑛 𝜙 𝜙 + 𝑧𝑐𝑜𝑠 𝜙 𝜙 𝑛𝑧

𝑑𝜋

= αx𝜙 cos 𝜙 + −𝑥𝑠𝑖𝑛 𝜙 + 𝑧𝑐𝑜𝑠 𝜙 𝑛𝑧

𝑑𝜋 = αxh33𝑤 since 𝑤 = 𝜙 .

First time derivative of 𝑕13 becomes linearly dependent on the inputs so the relative degree of

this output is 1, and there is no need for further differentiations of 𝑕13 .

Time derivative of 𝑕33 :


𝑑𝜋

𝑕 33 = −sin 𝜙 𝜙 + −𝑥 𝑠𝑖𝑛 𝜙 + 𝑧 𝑐𝑜𝑠 𝜙 − 𝑥𝑐𝑜𝑠 𝜙 𝜙 − 𝑧𝑠𝑖𝑛(𝜙)𝜙 𝑛𝑧

𝑑𝜋

By the help of state equations, the equation above can be simplified.

35

𝑥 = − sin 𝜙 𝑣 ==> −𝑥 sin(𝜙) = sin2 𝜙 𝑣

𝑧 = cos 𝜙 𝑣 ==> z cos ϕ = cos2(𝜙)v

−𝑥 sin 𝜙 + 𝑧 cos 𝜙 = sin2 𝜙 𝑣 + cos2(𝜙)𝑣 = 𝑣

Therefore,

𝑕 33 =𝑛𝑧

𝑑𝜋𝑣 − 𝑤 sin 𝜙 +

𝑛𝑧

𝑑𝜋

𝑥𝑐𝑜𝑠 𝜙 + 𝑧𝑠𝑖𝑛 𝜙 =𝑛𝑧

𝑑𝜋𝑣 −

𝑕13

𝛼𝑥𝑤.

Also, first time derivative of 𝑕33 becomes linearly dependent on the inputs so relative degree

of this output is 1, too, and there is no need for further differentiations of 𝑕33 .

5.3.2. Control Law

After taking the first time derivatives of the outputs, a linear relationship is obtained between

outputs and inputs. This relationship can be shown by matrix representation, and decoupling

matrix(𝑳) can be obtained as

𝑕 13

𝑕 33

=

0 𝛼𝑥𝑕33

𝑛𝑧

𝑑𝜋−

𝑕13

𝛼𝑥

𝑣𝑤

= 𝑳 𝑣𝑤

. (5.15)

The error system should be in such a form that both the tracking error and the derivative of

the tracking error must converge to zero. To illustrate, an error system differential equation of

a tracking problem should be 𝑒 + 𝑘𝑒 = 0, so it has a left half plane pole for positive 𝑘 values

and thus the error and the time derivative of the error decay to zero exponentially. In order to

achieve this task, following arrangements are made(Superscript ′𝑑′ stands for 'desired').

𝒆 = 𝑒1

𝑒2 =

𝑕13𝑑 − 𝑕13

𝑕33𝑑 − 𝑕33

, 𝒆 = 𝑒 1𝑒 2

= 𝑕 13

𝑑 − 𝑕 13

𝑕 33𝑑 − 𝑕 33

𝑎𝑛𝑑 𝒌 = 𝑘13 00 𝑘33

(5.16)

𝑕 13

𝑑 − 𝑕 13

𝑕 33𝑑 − 𝑕 33

+ 𝑘13 00 𝑘33

𝑕13

𝑑 − 𝑕13

𝑕33𝑑 − 𝑕33

= 00 (5.17)

After some manipulations on equation (5.17), equation (5.18) is obtained.

𝑕 13

𝑕 33

= 𝑕 13

𝑑 + 𝑘13(𝑕13𝑑 − 𝑕13)

𝑕 33𝑑 + 𝑘33(𝑕33

𝑑 − 𝑕33) (5.18)

𝑘13 and 𝑘33 are positive control gains. Equating right hand sides of the equations (5.15) and

(5.18) allows for the solution of the control signal.

𝑳 𝑣𝑤

= 𝑕 13

𝑑 + 𝑘13(𝑕13𝑑 − 𝑕13)

𝑕 33𝑑 + 𝑘33(𝑕33

𝑑 − 𝑕33) (5.19)

36

Multiplying both the left and the right hand sides of the equation (5.19) by 𝑳−1 gives the

control signal.

𝑣𝑤

= 𝑳−1 𝑕 13

𝑑 + 𝑘13(𝑕13𝑑 − 𝑕13)

𝑕 33𝑑 + 𝑘33(𝑕33

𝑑 − 𝑕33) =

𝑕13𝑑𝜋

𝛼𝑥2𝑕33𝑛𝑧

𝑑𝜋

𝑛𝑧

1

𝛼𝑥𝑕330

𝑕 13

𝑑 + 𝑘13(𝑕13𝑑 − 𝑕13)

𝑕 33𝑑 + 𝑘33(𝑕33

𝑑 − 𝑕33) (5.20)

In order to have a nonsingular control signal, decoupling matrix must be invertible such that

det(𝑳)≠ 0. In order to investigate the situations that can create nonsingularity, determinant of

the decoupling matrix should be analyzed:

det 𝑳 = −𝛼𝑥𝑕33

𝑛𝑧

𝑑𝜋 (5.21)

Here, 𝛼𝑥 denotes the focal length in pixel dimensions in x direction, so it is not zero and since

the plane that generates the homography is at a finite distance from the target position,

𝑑𝜋 ≠ ∞. Also, the plane must be seen by the camera and this makes 𝑛𝑧 ≠ 0. Then, there is

one possibility left which can make the determinant of the decoupling matrix zero and that

possibility is 𝑕33 = 0. 𝑕33 is given by


𝑑𝜋.

It should be shown that 𝑕33 never becomes zero in order to hamper singularity in control law.

The target is in front of the mobile robot so 𝑧 < 0 until the moment robot reaches the target

pose according to the assigned target coordinate frame. At the moment robot is at the desired

pose, 𝑥, 𝑧 and 𝜙 become zero and 𝑕33 becomes one. There are some constraints on the

orientation of the robot. In order for the robot to see the target scene fully or partially,

−𝜋

2< 𝜙 <

𝜋

2 must be satisfied. Otherwise, robot would see a scene which is not related to

the target scene at all, and it would not be possible to construct a meaningful control signal.

−𝜋

2< 𝜙 <

𝜋

2 constraint ensures that 𝑐𝑜𝑠 𝜙 > 0. Besides, 𝑛𝑧 must be negative with respect

to the target coordinate frame since the plane that produces the homography is visible for the

camera. Therefore, 𝑧𝑐𝑜𝑠 𝜙 𝑛𝑧

𝑑𝜋 becomes greater than zero and then it follows that if

cos 𝜙 + 𝑧𝑐𝑜𝑠 𝜙 𝑛𝑧

𝑑𝜋 > | − 𝑥𝑠𝑖𝑛(𝜙)

𝑛𝑧

𝑑𝜋|, then 𝑕33 becomes greater than zero(𝑕33 > 0).

This inequality imposes that the lateral distance to compensate is smaller than the depth error.

In other words, this inequality holds if the depth error is higher than the lateral error due to

the camera field of view constraint such that 𝑧𝑐𝑜𝑠 𝜙 > |𝑥𝑠𝑖𝑛(𝜙)|. As a result, it is

concluded that the determinant of the decoupling matrix is never zero in the work space and

control signal can be constructed without facing any singularity.

5.3.3. Desired Trajectories of the Homography Elements

Control law needs the definition of the desired trajectories of the homography elements as

can be seen by equation (5.20). The motion performed by the robot is obviously dependent on

37

the selection of the desired trajectories of the homography elements(𝑕13𝑑 , 𝑕33

𝑑 ). When the

robot reaches the target pose,

𝑥𝑧𝜙

= 000 and therefore, the homography becomes an identity

matrix. This dictates that final values of 𝑕13 and 𝑕33 must be 0 and 1 respectively. There are

several proposals for the desired trajectories of the homography elements in the literature.

Two of the most important ones are offered by [9] and [23]. Suggestion of [23] for the

desired trajectories is taken into consideration in this project.

＊Desired Trajectories:

The desired trajectory of 𝑕13 is selected in such a way that it corrects the lateral and

orientation errors simultaneously and the chosen desired trajectory for 𝑕33 is a sinusoid which

is a smooth function that converges to 1 assuring depth error is removed.

Desired Trajectory of 𝑕13: There is a condition regarding the initial configuration of

the robot that should be checked before deciding about the desired trajectory of 𝑕13 . That

condition is related to the current and target epipoles at the starting time of the motion. The

sign of the multiplication of x coordinates of the current epipole(ecx ) and the target

epipole(etx ) at the beginning of the motion must be examined. Please refer to Appendix A for

information about the epipolar geometry and its relationship with the mobile robot

navigation. Let's analyze the desired trajectory of 𝑕13 in two cases.

Case 1: If ecx 0 . etx 0 ≤ 0, desired trajectory can be defined in two steps.

𝑕13𝑑 0 ≤ 𝑡 ≤ 𝑇2 = 𝑕13(0)

𝜓 (𝑡)

𝜓(0)

𝑕13𝑑 𝑇2 < 𝑡 < ∞ = 0.

Case 2: If ecx 0 . etx 0 > 0, desired trajectory is defined in three steps. First

of these steps drives the robot to a proper orientation and thereafter, a smooth motion towards

the target can be realized. Second and third steps can be defined alike the first and second

steps of the first case.

𝑕13𝑑 0 ≤ 𝑡 ≤ 𝑇1 =

𝑕13 0 +𝑕13𝑑 (𝑇1)

2+

𝑕13 0 −𝑕13𝑑 (𝑇1)

2cos(

𝜋𝑡

𝑇1)

𝑕13𝑑 𝑇1 < 𝑡 ≤ 𝑇2 = 𝑕13(𝑇1)

𝜓 (𝑡)

𝜓(𝑇1)

𝑕13𝑑 𝑇2 < 𝑡 < ∞ = 0 where 𝑕13

𝑑 𝑇1 = −2

3𝑕13 0 and 𝑇1 < 𝑇2.

First step is an intermediate step that should be completed in 𝑇1. 𝜓 is the angle of the straight

line connecting the current position of the robot with the target position defined in target

frame as seen in figure 5.13. 𝑕13𝑑 is proposed in relation with 𝜓 since it is desired to correct

the lateral and orientation errors altogether.

38

Desired Trajectory of 𝑕33 : The desired trajectory of 𝑕33 is realized in two steps.

𝑕33𝑑 0 ≤ 𝑡 ≤ 𝑇2 =

𝑕33 0 +1

2+

𝑕33 0 −1

2cos(

𝜋𝑡

𝑇2)

𝑕33𝑑 𝑇2 < 𝑡 < ∞ = 1.

Desired homography values should be reached in 𝑇2. The desired trajectories are dependent

on the homography and initial position. As the robot moves, control law makes the realized

homography elements track the desired trajectories defined above guaranteeing the

convergence to the target.

5.4. Stability Analysis

A candidate Lyapunov function for the error system is chosen as

𝑉 𝒙, 𝑡 =1

2 𝒆 2 𝑤𝑕𝑒𝑟𝑒 𝒆 =

𝑒1

𝑒2 =

𝑕13𝑑 − 𝑕13

𝑕33𝑑 − 𝑕33

. (5.22)

The Lyapunov candidate is positive definite except the origin of the error space. Now, it must

be proven that the time derivative of the Lyapunov function is zero at the origin of the error

space and negative definite elsewhere. Following definitions are made for this analysis.

𝑫 = 𝑕13

𝑑

𝑕13𝑑

, 𝑫 = 𝑕 13

𝑑

𝑕 33𝑑

and 𝒌 = 𝑘13 00 𝑘33

𝑉 𝒙, 𝑡 = 𝒆𝑻𝒆 = 𝒆𝑻 𝑕 13

𝑑 − 𝑕 13

𝑕 33𝑑 − 𝑕 33

= 𝒆𝑻 𝑕 13

𝑑

𝑕 33𝑑

− 𝑳 𝑣𝑤

= 𝒆𝑻 𝑫 − 𝑳𝑳−𝟏(𝑫 + 𝒌𝒆)

= 𝒆𝑻 𝑰 − 𝑳𝑳−𝟏 𝑫 − 𝒌 𝒆𝑻𝑳𝑳−𝟏𝒆 (5.23)

Equation (5.23) shows that the time derivative of the Lyapunov candidate is negative definite

in the error space except the origin so asymptotic stability is guaranteed. Since 𝑳𝑳−𝟏 is equal

to 2x2 identity matrix in theory, first term of equation (5.23) drops and 𝑉 𝒙, 𝑡 = −𝒌 𝒆 2

with a positive definite and diagonal gain matrix 𝒌 satisfies the asymptotic stability

conditions. In practice, the estimation of 𝑳−𝟏 may not be exact, so 𝑳𝑳−𝟏 may not be an exact

identity matrix. However, if the estimation of 𝑳−𝟏 is not too course, asymptotic stability of

the system is achieved [3]. Region of the stability is the workspace of the mobile robot with

the camera field of view limitations [23].

Now, it has been proven that 𝑕13 converges to 𝑕13𝑑 and 𝑕33 converges to 𝑕33

𝑑 since 𝒆 goes to

zero(system is asymptotically stable). After time 𝑇2, 𝑕13𝑑 becomes 0 and 𝑕33

𝑑 becomes 1 as

understood from the proposed desired trajectory sets. If figure 5.13 is examined, it is seen that

𝜓 = − arctan 𝑥

𝑧 since 𝑥 = −𝜌 sin 𝜓 and 𝑧 = 𝜌 cos 𝜓 for all quadrants. In order for

𝑕13𝑑 (and so 𝑕13) to converge to zero, 𝜓 must goes to zero eventually and this is realized when

𝑥 becomes equal to zero. Therefore, 𝑥 = 0 is reached at the end of the motion. Now, the final

39

values of the other state variables (z and 𝜙) must be found. The values of these state variables

are found by the help of the homography equations of h13 and h33 .

𝑕13 = αx sin 𝜙 + 𝑥𝑐𝑜𝑠 𝜙 + 𝑧𝑠𝑖𝑛 𝜙 𝑛𝑧

𝑑𝜋 (5.24)


𝑑𝜋 (5.25)

𝑧 variable is eliminated from equations (5.24) and (5.25) by following the procedure below.

i) Multiply equation (5.24) by 𝑐𝑜𝑠 𝜙 ,

ii) Multiply equation (5.25) by –αx sin 𝜙 ,

iii) Add the results of (i) and (ii) side by side.

cos 𝜙 𝑕13 = αx sin 𝜙 cos 𝜙 + αx𝑥𝑐𝑜𝑠2 𝜙 𝑛𝑧

𝑑𝜋+ αx𝑧𝑠𝑖𝑛 𝜙 cos(𝜙)

𝑛𝑧

𝑑𝜋

–αx sin 𝜙 𝑕33 = –αx sin 𝜙 cos 𝜙 + αx𝑥𝑠𝑖𝑛2 𝜙 𝑛𝑧

𝑑𝜋– αx𝑧𝑠𝑖𝑛 𝜙 𝑐𝑜𝑠 𝜙

𝑛𝑧

𝑑𝜋

Adding the equations above and plugging the final values of 𝑕13 and 𝑕33 into the added

equation result in equation (5.26):

sin 𝜙 = 𝑥𝑛𝑧

𝑑𝜋 (5.26)

Since 𝑥 becomes equal to zero at the end of the motion, 𝜙 must also become zero as

understood from equation (5.26). Plugging 𝑥 = 0 and 𝜙 = 0 into equation (5.25) shows that

𝑧 = 0. This analysis proves that the only equilibrium state of the system is

𝑥𝑧𝜙

= 000 and

when the equilibrium state is reached, homography becomes 3x3 identity matrix( 𝑯 = 𝑰)

which is an indication of that camera sees the target scene and the goal is accomplished.

6-SIMULATIONS

Simulations are carried out in order to show the validity of the proposed approach.

Performance of the system with and without noise and calibration errors is investigated. In

simulations, the knowledge of the initial and target configurations is enough as a priori.

Control algorithm tries to drive the robot from initial configuration to the target

configuration. Control loop is illustrated in figure 6.1.

40

Figure 6.1 Diagram of the control loop

The rotation matrix and the translation vector between the current and target configurations

can be found since the current and target positions and orientations are known as inputs.

Possessing the knowledge of intrinsic camera matrix, the theoretical formula of homography

(equation (5.11)) can be used to find out the 3x3 homography matrix between the current and

target virtual scenes. Intrinsic camera matrix is formed by using the information in [10] and

[18]. The virtual image is assumed to have a 640x480 pixel resolution. The value of the focal

length is used as 𝑓 = 6 𝑚𝑚, and its real value is varied to see the effect on the final errors.

Besides, the effect of principal coordinates on the final errors is also analyzed. The values of

the control gains used are 𝑘13 = 1 and 𝑘33 = 1 and 𝑇1 = 40 𝑠 and 𝑇2 = 80 𝑠. Total time of

the simulation is chosen to be 𝑇𝑡𝑜𝑡𝑎𝑙 = 100 𝑠.

The simulations are carried out for several initial configurations and the target configuration

of (𝑥 = 0, 𝑧 = 0, 𝜙 = 0°). Since the mobile robot is moving on a horizontal plane

(𝑥 − 𝑧 𝑝𝑙𝑎𝑛𝑒), 𝑦 coordinate with respect to the robot attached coordinate frame and the world

reference frame is zero. Furthermore, the roll and pitch angles do not change in time, but yaw

angle(𝜙) is a variable. The outcomes of the simulations are shown in figures 6.2-6.22.

41

i) Results for initial configuration of (x = −5, z = −15, ϕ = 5°):

Figure 6.2 Evolution of position and orientation parameters and control signals

Figure 6.3 Evolution of homography elements

0 50 100-6

-5

-4

-3

-2

-1

0

1Evolution of Lateral Position (X) in time

Time[s]

X [

m]

0 50 100-15

-10

-5

0Evolution of Depth (Z) in time

Time[s]

Z [

m]

0 50 100-30

-25

-20

-15

-10

-5

0

5Evolution of Orientation () in time

Time[s]

[

deg]

-10 -5 0 5-15

-10

-5

0Followed Path: X vs. Z

X [m]

Z [

m]

0 50 1000

0.05

0.1

0.15

0.2

0.25

0.3

0.35Control Output: Linear Velocity(v) vs. Time

v [

m/s

]

Time[s]

0 50 100-1.5

-1

-0.5

0

0.5

1

1.5Control Output: Angular Velocity(w) vs. Time

w [

deg/s

]

Time[s]

0 50 1000.5

1

1.5

2

H11

vs. Time

Time[s]

0 50 100-0.05

0

0.05

0.1

0.15

H12

vs. Time

Time[s]

0 50 100-500

0

500

1000

H13

vs. Time

Time[s]

Realized H13

Desired H13

0 50 100-1

-0.5

0

0.5

1

H21

vs. Time

Time[s]

0 50 1000

0.5

1

1.5

2

H22

vs. Time

Time[s]

0 50 100-1

-0.5

0

0.5

1

H23

vs. Time

Time[s]

0 50 100-2

0

2

4x 10

-3 H31

vs. Time

Time[s]

0 50 1000

2

4

6x 10

-4 H32

vs. Time

Time[s]

0 50 1001

2

3

4

H33

vs. Time

Time[s]

Realized H33

Desired H33

42

Figure 6.4 Evolution of error in position and orientation parameters

ii) Results for initial configuration of (x = −8, z = −20,ϕ = −45°):


0 50 100-6

-5

-4

-3

-2

-1

0

1Error in Lateral Position (X) in time

Time[s]

[m]

0 50 100-15

-10

-5

0Error in Depth (Z) in time

Time[s]

[m]

0 50 100-30

-25

-20

-15

-10

-5

0

5Error in Orientation () in time

Time[s]

[deg]

0 50 100-8

-6

-4

-2

0


Time[s]

X [

m]

0 50 100-20

-15

-10

-5


Time[s]

Z [

m]

0 50 100-50

-40

-30

-20

-10

0


Time[s]

[

deg]

-10 -5 0 5-20

-15

-10

-5


X [m]

Z [

m]

0 50 1000

0.1

0.2

0.3

0.4


v [

m/s

]

Time[s]

0 50 100-0.5

0

0.5

1


w [

deg/s

]

Time[s]

43



0 50 100-1

0

1

2

H11

vs. Time

Time[s]

0 50 100-0.2

0

0.2

H12

vs. Time

Time[s]

0 50 100-2000

-1000

0

1000

H13

vs. Time

Time[s]

Realized H13

Desired H13

0 50 100-1

0

1

H21

vs. Time

Time[s]

0 50 1000

1

2

H22

vs. Time

Time[s]

0 50 100-1

0

1

H23

vs. Time

Time[s]

0 50 1000

2

4

6x 10

-3 H31

vs. Time

Time[s]

0 50 1000

2

4

6x 10

-4 H32

vs. Time

Time[s]

0 50 1001

2

3

4

H33

vs. Time

Time[s]

Realized H33

Desired H33

0 50 100-8

-7

-6

-5

-4

-3

-2

-1

0


Time[s]

[m]

0 50 100-20

-18

-16

-14

-12

-10

-8

-6

-4

-2


Time[s]

[m]

0 50 100-45

-40

-35

-30

-25

-20

-15

-10

-5

0


Time[s]

[deg]

44

iii) Results for initial configuration of (x = 10, z = −35, ϕ = −25°):



0 50 100-5

0

5

10


Time[s]

X [

m]

0 50 100-40

-30

-20

-10


Time[s]

Z [

m]

0 50 100-40

-20

0

20

40

60


Time[s]

[

deg]

-10 0 10 20-35

-30

-25

-20

-15

-10

-5


X [m]

Z [

m]

0 50 100-0.5

0

0.5

1


v [

m/s

]

Time[s]

0 50 100-6

-4

-2

0

2

4Control Output: Angular Velocity(w) vs. Time

w [

deg/s

]

Time[s]

0 50 100-4

-2

0

2

4

H11

vs. Time

Time[s]

0 50 100-0.5

0

0.5

H12

vs. Time

Time[s]

0 50 100-4000

-2000

0

2000

H13

vs. Time

Time[s]

Realized H13

Desired H13

0 50 100-1

-0.5

0

0.5

1

H21

vs. Time

Time[s]

0 50 1000

0.5

1

1.5

2

H22

vs. Time

Time[s]

0 50 100-1

-0.5

0

0.5

1

H23

vs. Time

Time[s]

0 50 1000

2

4

6

8x 10

-3 H31

vs. Time

Time[s]

0 50 1000

0.5

1x 10

-3 H32

vs. Time

Time[s]

0 50 1001

2

3

4

5

H33

vs. Time

Time[s]

Realized H

33

Desired H33

45


iv) Results for initial configuration of (x = 10, z = −25, ϕ = −35°):


0 50 100-2

0

2

4

6

8

10

12


Time[s]

[m]

0 50 100-35

-30

-25

-20

-15

-10

-5


Time[s]

[m]

0 50 100-30

-20

-10

0

10

20

30

40

50

60


Time[s]

[deg]

0 50 100-5

0

5

10

15


Time[s]

X [

m]

0 50 100-25

-20

-15

-10

-5


Time[s]

Z [

m]

0 50 100-50

0

50

100


Time[s]

[

deg]

0 5 10 15 20-25

-20

-15

-10

-5


X [m]

Z [

m]

0 50 100-1

-0.5

0

0.5

1


v [

m/s

]

Time[s]

0 50 100-10

-5

0

5


w [

deg/s

]

Time[s]

46



0 50 100-4

-2

0

2

H11

vs. Time

Time[s]

0 50 100-0.5

0

0.5

H12

vs. Time

Time[s]

0 50 100-4000

-2000

0

2000

H13

vs. Time

Time[s]

Realized H13

Desired H13

0 50 100-1

-0.5

0

0.5

1

H21

vs. Time

Time[s]

0 50 1000

0.5

1

1.5

2

H22

vs. Time

Time[s]

0 50 100-1

-0.5

0

0.5

1

H23

vs. Time

Time[s]

0 50 100-2

0

2

4

6x 10

-3 H31

vs. Time

Time[s]

0 50 1000

2

4

6x 10

-4 H32

vs. Time

Time[s]

0 50 1001

1.5

2

2.5

3

H33

vs. Time

Time[s]

Realized H

33

Desired H33

0 50 100-2

0

2

4

6

8

10

12

14

16


Time[s]

[m]

0 50 100-25

-20

-15

-10

-5


Time[s]

[m]

0 50 100-40

-20

0

20

40

60

80

100


Time[s]

[deg]

47

v) Results for initial configuration of (x = −3, z = −20, ϕ = 30°):



0 50 100-5

-4

-3

-2

-1

0


Time[s]

X [

m]

0 50 100-20

-15

-10

-5


Time[s]

Z [

m]

0 50 100-60

-40

-20

0

20


Time[s]

[

deg]

-10 -5 0 5-20

-15

-10

-5


X [m]

Z [

m]

0 50 100-0.2

0

0.2

0.4

0.6

0.8

1


v [

m/s

]

Time[s]

0 50 100-4

-2

0

2

4


w [

deg/s

]Time[s]

0 50 100-2

0

2

4

H11

vs. Time

Time[s]

0 50 100-0.5

0

0.5

H12

vs. Time

Time[s]

0 50 100-1000

0

1000

2000

H13

vs. Time

Time[s]

Realized H13

Desired H13

0 50 100-1

0

1

H21

vs. Time

Time[s]

0 50 1000

1

2

H22

vs. Time

Time[s]

0 50 100-1

0

1

H23

vs. Time

Time[s]

0 50 1000

2

4x 10

-3 H31

vs. Time

Time[s]

0 50 1000

2

4

6x 10

-4 H32

vs. Time

Time[s]

0 50 1001

2

3

4

H33

vs. Time

Time[s]

Realized H

33

Desired H33

48


vi) Results for initial configuration of (x = −0.25, z = −1.2, ϕ = −20°):


0 20 40 60 80 100-5

-4

-3

-2

-1

0


Time[s]

[m]

0 20 40 60 80 100-20

-18

-16

-14

-12

-10

-8

-6

-4

-2


Time[s]

[m]

0 20 40 60 80 100-50

-40

-30

-20

-10

0

10

20


Time[s]

[deg]

0 50 100-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1Evolution of Lateral Position (X) in time

Time[s]

X [

m]

0 50 100-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2


Time[s]

Z [

m]

0 50 100-20

-15

-10

-5

0


Time[s]

[

deg]

-0.5 0 0.5

-1

-0.8

-0.6

-0.4

-0.2


X [m]

Z [

m]

0 50 1000

0.005

0.01

0.015

0.02

0.025


v [

m/s

]

Time[s]

0 50 1000

0.1

0.2

0.3


w [

deg/s

]

Time[s]

49



0 20 40 60 80 1000.9

0.95

1

H11

vs. Time

Time[s]

0 20 40 60 80 100-4

-3

-2

-1

0x 10

-3 H12

vs. Time

Time[s]

0 20 40 60 80 100-300

-200

-100

0

100

H13

vs. Time

Time[s]

Realized H13

Desired H13

0 20 40 60 80 100-1

-0.5

0

0.5

1

H21

vs. Time

Time[s]

0 20 40 60 80 1000

0.5

1

1.5

2

H22

vs. Time

Time[s]

0 20 40 60 80 100-1

-0.5

0

0.5

1

H23

vs. Time

Time[s]

0 20 40 60 80 100-5

0

5

10x 10

-4 H31

vs. Time

Time[s]

0 20 40 60 80 1000

1

2

3

4x 10

-5 H32

vs. Time

Time[s]

0 20 40 60 80 1000.9

1

1.1

1.2

H33

vs. Time

Time[s]

Realized H

33

Desired H33

0 20 40 60 80 100-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1Error in Lateral Position (X) in time

Time[s]

[m]

0 20 40 60 80 100-1.4

-1.2

-1

-0.8

-0.6

-0.4

-0.2


Time[s]

[m]

0 20 40 60 80 100-20

-15

-10

-5

0


Time[s]

[deg]

50

vii) Results for initial configuration of (x = 12, z = −40, ϕ = 45) and this time target

configuration of (x = −8, z = −5, ϕ = −20°).



0 20 40 60 80 100-10

-5

0

5

10


Time[s]

X [

m]

0 20 40 60 80 100-40

-35

-30

-25

-20

-15

-10

-5Evolution of Depth (Z) in time

Time[s]Z

[m

]

0 20 40 60 80 100-40

-20

0

20

40


Time[s]

[

deg]

-10 0 10 20-40

-35

-30

-25

-20

-15

-10

-5Followed Path: X vs. Z

X [m]

Z [

m]

0 20 40 60 80 100-0.2

0

0.2

0.4

0.6

0.8

1Control Output: Linear Velocity(v) vs. Time

v [

m/s

]

Time[s]

0 20 40 60 80 100-4

-3

-2

-1

0


w [

deg/s

]Time[s]

0 20 40 60 80 1000.8

1

1.2

1.4

H11

vs. Time

Time[s]

0 20 40 60 80 100-0.1

0

0.1

0.2

H12

vs. Time

Time[s]

0 20 40 60 80 100-1000

0

1000

2000

H13

vs. Time

Time[s]

Realized H

13

Desired H13

0 20 40 60 80 100-1

0

1

H21

vs. Time

Time[s]

0 20 40 60 80 1000

1

2

H22

vs. Time

Time[s]

0 20 40 60 80 100-1

0

1

H23

vs. Time

Time[s]

0 20 40 60 80 100-2

0

2

4x 10

-3 H31

vs. Time

Time[s]

0 20 40 60 80 1000

0.5

1

1.5x 10

-3 H32

vs. Time

Time[s]

0 20 40 60 80 1000

5

10

H33

vs. Time

Time[s]

Realized H

33

Desired H33

51


Since the homography decomposition is not necessary and it is not done in this control

approach, the normal vector(𝒏) of the plane that generates the homography and the

distance(𝑑𝜋) between that plane and the origin of the target frame are not known, so the term 𝑛𝑧

𝑑𝜋 used in control and homography calculations is not known exactly either. Therefore, the

value of term 𝑛𝑧

𝑑𝜋 must be estimated. The effect of the uncertainty in 𝑛𝑧 and 𝑑𝜋 on the

performance is checked by using fixed values in the computation of the homography and

varying those values in the control law. Figure 6.23 and 6.24 show the effect of this

uncertainty on the final pose error.

0 50 100-5

0

5

10

15


Time[s]

[m]

0 50 100-35

-30

-25

-20

-15

-10

-5


Time[s]

[m]

0 50 100-10

0

10

20

30

40

50

60


Time[s]

[deg]

52

Figure 6.23 Final pose error for different 𝑑𝜋 values

Figure 6.24 Final pose error for different 𝑛𝑧 values

As can be understood from the results shown in figures 6.23 and 6.24, the convergence of the

approach is not affected by the uncertainty and good final pose errors are obtained. Another

important issue in most of the visual servoing systems is the calibration of the camera. Since

0 5 10 15 20 25 30-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

Fin

al P

ose E

rror

d [m]

Lateral(x) Error[m]

Depth(z) Error[m]

Orientation() Error[deg]

-5 -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Fin

al P

ose E

rror

nz [m]

Lateral(x) Error[m]

Depth(z) Error[m]


53

the elements of the intrinsic camera matrix take place in the control law and in the

computation of homography, it is necessary to investigate the impacts of the elements of the

intrinsic camera matrix on the performance. The simulation results presented before are

obtained by taking the focal length of the camera as 6 millimeters as mentioned before and

the principal point is assumed to be at the centre of the image(𝑥0 = 0, 𝑦0 = 0). Final pose

errors of the robot are shown in figures 6.25-6.27 for a range of the focal length and the

coordinates of the principal point.

Figure 6.25 Final pose error varying the focal length

Figure 6.26 Final pose error varying the location of the x coordinate of the principle point

0 5 10 15-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

Pose

Erro

r

f [mm])

Lateral(x) Error

Depth(z) Error

Orientation() Error

-50 -40 -30 -20 -10 0 10 20 30 40 50-0.01

-0.005

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Pose

Erro

r

x0(pixels)

Lateral(x) Error

Depth(z) Error

Orientation() Error

54

Figure 6.27 Final pose error varying the location of the y coordinate of the principle point

Results indicate that the method is able to compensate for the calibration errors. In other

words, a rough calibration is sufficient to ensure the convergence of the system.

Also, the performance of the system is analyzed when noise is applied to the homography

elements directly. The results of driving the robot from

𝑥𝑧𝜙

= −5−15

5

to

𝑥𝑧𝜙

= 000 with

white noise of standard deviation(𝜍) equal to 0.3 are represented in figures 6.28 and 6.29.

Figure 6.28 Evolution of pose parameters with noise

-50 -40 -30 -20 -10 0 10 20 30 40 50-0.5

0

0.5

1

1.5

2

2.5

3x 10

-4

Pose

Erro

r

y0(pixels)

Lateral(x) Error

Depth(z) Error

Orientation() Error

0 20 40 60 80 100-6

-4

-2

0


Time[s]

X [

m]

0 20 40 60 80 100-20

-15

-10

-5

0


Time[s]

Z [

m]

0 20 40 60 80 100-30

-20

-10

0


Time[s]

[

deg]

-15 -10 -5 0 5 10-15

-10

-5

0

Followed Path: X vs. Z

X [m]

Z [

m]

55

Figure 6.29 Evolution of homography elements with noise

Besides, the final pose error under the effect of white noise with increasing standard

deviation(𝜍) is given in figure 6.30.

Figure 6.30 Final pose error varying noise on homography

It could be inferred from the graphics above that the convergence of the system is achieved in

spite of the existence of noise. Unsurprisingly, the higher the standard deviation noise has,

the more deviation from the target configuration occurs. Lateral and depth errors are

0 20 40 60 80 100-600

-400

-200

0

200

400

600

800

H13

vs. Time

Time[s]

0 20 40 60 80 1000.5

1

1.5

2

2.5

3

3.5

H33

vs. Time

Time[s]

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2-25

-20

-15

-10

-5

0

5

10

15

20

Fin

al P

ose E

rror

Lateral(x) Error[m]

Depth(z) Error[m]


56

compensated better when compared with orientation error when noise with high standard

deviation is affecting the system.

As a final explanatory remark in this chapter, the ways of finding 𝜓 will be discussed. Since

the definition of the desired trajectories is absolutely necessary to carry out the simulations

and since the desired trajectory of 𝑕13 includes 𝜓 , it is a must to know its value during

simulations. There are two methods that can be employed to find out 𝜓 in simulations.

1-) Initial pose of the mobile robot is provided as an input to simulation algorithm so

the initial values of 𝑥, 𝑧 and 𝜙 are known in the beginning. If the target pose is

𝒙𝑡 = [𝑥𝑡 𝑧𝑡 𝜙𝑡]𝑇 , then the value of 𝜓 can be computed by using the relation

𝜓 = − arctan 𝑥−𝑥𝑡

𝑧−𝑧𝑡 − 𝜙𝑡 which can be inferred by examining figure 5.13. If the target pose

is 𝒙𝑡 = [0 0 0]𝑇, then simply 𝜓 = − arctan 𝑥

𝑧 . After the computation of 𝜓, the construction

of the control law can be completed and the robot can be driven by the control signal to its

next position(𝑥, 𝑧) and orientation(𝜙) that are also known. Applying 𝜓 = − arctan 𝑥

𝑧 again,

the control signal can be calculated and robot can be driven again. This loop continues until

robot reaches the target pose.

2-) Second way of finding 𝜓 is related to the target epipole. Please refer to

Appendix A for information about epipolar geometry. The relation between the target epipole

and 𝜓 is explained by the help of figures 6.31 and 6.32.

Figure 6.31 Epipoles in the current and target poses

57

If the target epipole is zoomed in, figure 6.32 is obtained.

Figure 6.32 Target epipole

The triangle in figure 6.32 reveals out equation (6.1) which gives the value of 𝜓.

tan 180 − 𝜓 = − tan 𝜓 =𝑒𝑡𝑥

𝛼𝑥 ==> 𝜓 = − arctan

𝑒𝑡𝑥

𝛼𝑥 (6.1)

In equation (6.1), 𝛼𝑥 is the focal length of the camera in pixel dimensions, so 𝑒𝑡𝑥 must also be

in pixel dimensions in order to make the argument of the arctangent function unitless. In

order to find out 𝜓 from equation (6.1), the value of x coordinate of the target epipole in pixel

dimensions must be known. This is done by projecting the focal center of the camera, which

is at the current pose, 𝐶𝑐 , onto the image plane of the camera which is at the target pose.

When figure 6.31 is analyzed, it is seen that the ray emanating from 𝐶𝑐 and going towards 𝐶𝑡

creates the target epipole. The relationship between a 3D homogeneous point 𝐗 = [X Y Z 1]T

expressed in the fixed world frame and its projection 𝐱 = [x y 1]T in the image plane of the

camera is:

𝐱 = 𝐏𝐗 = 𝐊 𝐑 𝐭]𝐗

where (𝐑, 𝐭) are the extrinsic parameters(the rotation and the translation between the fixed

world and the camera frames) and 𝐊 is the intrinsic camera matrix as explained in perspective

projection section. Therefore, there are two steps to calculate the target epipole in pixel

dimensions.

i) Compute 3x4 projection matrix 𝐏 for the target pose,

ii) Project the focal center of the camera, which is at the current pose, onto the image

plane of the camera, which is at the target pose by 𝐞𝐭 = 𝐏𝐗𝐂𝐜. Here, 𝐞𝐭 is the 3x1 vector

standing for the target epipole, 𝐏 is the 3x4 projection matrix of the target pose found in step

(i), and 𝐗𝐂𝐜 is the 4x1 vector showing the homogeneous coordinates of the focal center of the

camera which is at the current pose with respect to the world coordinate frame.

After the calculation of 𝐞𝐭, 𝑒𝑡𝑥 , which is the x coordinate of the target epipole, can easily be

found and used in equation (6.1) to ascertain 𝜓. Then, the construction of the control law can

be finished.

Also, please note that the time derivative of the desired trajectories is necessary to find the

control signal. Since numerical values of 𝜓 are available, time derivative of 𝜓 is found by

numerical differentiation such that 𝜓 𝑡 = limΔ𝑡→0𝜓 𝑡+Δ𝑡 −𝜓 𝑡

Δ𝑡.

58

7-EXPERIMENTAL ARRANGEMENTS

In an experiment, there are only real images from the camera as inputs and nothing else. This

control algorithm needs two images, one of which is the image taken at the desired pose and

the other one is the current image. It tries to drive the robot from the initial configuration

towards the target pose by comparing the image taken at the desired pose and the current

images captured during the motion. Control loop for an experiment is shown in figure 7.1.

Figure 7.1 Diagram of the control loop for an experiment

Features extraction from images and matching of image points are carried out by SIFT.

SIFT(Scale Invariant Feature Transform) is an interest point detector and descriptor which is

invariant to scale and rotation as explained in section 5.1.3.2. The information obtained from

SIFT is used for the estimation of homography and the extraction of 𝜓. Estimation of

homography is done by direct linear transformation method as elucidated in 5.1.3.2, and the

extraction of 𝜓 is achieved by the relation 𝜓 = − arctan 𝑒𝑡𝑥

𝛼𝑥 , so x coordinate of the target

epipole must be found in reality from images. An algorithm proposed by [10] is used in order

to find out the fundamental matrix and then epipoles. Please refer to Appendix B for

information about the derivation of fundamental matrix and epipoles. After the extraction of

𝜓 and the computation of 3x3 homography matrix, construction of the control law is

complete. Then, the control signal which includes the angular and linear velocities

compatible with the aim can be applied to the robot. Thus, all required algorithms to carry out

an experiment and an understanding of them are explained in this report. Although all

necessary Matlab scripts are prepared to conduct an experiment on top of the required Matlab

codes of simulations, we had lack of time in this three month internship project to perform an

experiment.

59

It takes about 1.25 seconds(0.8 Hz) to calculate the control signal from two real images and

the completion time of one cycle of the control loop depends on the communication speed

between the computer and the robot. It has been verified by [23] with the experiments that if

the control loop runs even at 0.75 Hz of frequency, the stability of the system is achieved.

Thus, if the communication between the robot and the computer is sufficiently fast, then the

proposed algorithm has to perform well with the guarantee of stability.

8-CONCLUSIONS

In this project, a research on mobile robot navigation using visual servo control methods is

carried out. A homography based visual servoing method is decided to apply on a

nonholonomic mobile robot. A control law is constructed based upon the input-output

linearization of the system. Outputs of the system are chosen among the homography

elements and a set of desired trajectories for those outputs are defined. Therefore, the visual

servo control problem is transformed into a tracking problem. The visual control method

needs neither homography decomposition nor depth estimation nor any 3D measure of the

scene. Simulations show that the control algorithm is robust and the convergence of the

system is achieved with noise, calibration errors and uncertainty of the control parameters.

The performance of the system is obviously dependent on the desired trajectories of the

homography elements, since the problem is a tracking problem. In literature, there are several

proposed sets of desired trajectories of the homography elements and one of them is used in

this project. The set of desired trajectories picked up makes the robot converge towards the

target in a smooth manner avoiding discrete motions. However, the mobile robot can not

always converge to the target with zero pose error in a specified duration. This is mainly

because of that the desired homography trajectories dictate the robot to follow a path which

can not be achieved with present robot capabilities. Therefore, the mapping from

homography trajectories to Cartesian path should be investigated more as a future work, and

while doing that, the abilities of the robot should be taken into account. Then, more

appropriate and realizable desired homography trajectories could be brought out.

Also, there is a drawback for all homography based control methods used in applications and

offered in the literature. The homgoraphy based control methods may fail or give insufficient

results, if no plane is detected in the scene or the plane detected has 𝑛𝑧 = 0, i.e., the plane is

horizontal. In order to get rid of this disadvantage, some switching model based control

methods are proposed, such that when there is no appropriate plane detected to employ

homography based visual control, another control method takes over the control of the

system. If the other control method faces a singularity, then the homography based control

method becomes in charge again. As a future work, an addition of another control method

such as epipole based control method to the present work will eventually increase the

versatility and the robustness of the robot, on which the switching control algorithm is used.

60

APPENDIX A

When two cameras view a 3D scene from two distinct positions or when a single camera

takes the pictures of the same 3D scene from different positions, there are a number of

geometric relations between the 3D points and their projections onto the 2D images that lead

to constraints between the image points. Figure A.1 shows two cameras looking at point X

which is the point of interest to the cameras. OL and OR are the centers of projection(focal

points) of the cameras. XL and XR are the projected points of 3D point X onto the image

planes. Each camera captures a 2D image of the 3D world and transformation from 3D to 2D

is carried out by perspective projection.

Figure A.1 Epipolar Geometry

Centers of projection of the cameras are distinct so each center of projection is projected onto

a distinct point into the other camera's image plane(projection manifold) [24]. These two

points on the image planes are denoted by eL and eR and they are called epipoles. Centers of

projections and the epipoles of the cameras lie on the same 3D line. The line OL − X is

viewed by the left camera as a point because that line is the projection ray such that it is

directly in line with the left camera's center of projection. On the other hand, the very same

line is seen as a line by the right camera and the projection of that line onto the image plane

of the right camera is called an epipolar line (eR − XR). In the same manner, the line OR − X

which is seen as a point by the right camera is viewed as an epipolar line (eL − XL) by the left

camera. Additionally, the plane formed by OL , OR and X is called the epipolar plane. This

plane intersects each camera's image plane and that intersection results in a line which is the

epipolar line. All epipolar planes and lines intersect the epipole regardless of the location of

X. Additionally, the vector, w , originated from OL pointing towards OR is called positive

epipolar ray while the vector, −w originated from OL pointing to the opposite direction is

called negative epipolar ray.

61

The knowledge of the signs of the epipoles at the beginning of the motion is required in the

determination of desired trajectory of h13 . If the robot has not got at a suitable orientation at

the beginning of the motion, there is an extra step that should be taken in order to drive the

robot into a proper orientation for a smooth motion towards the target. Decision about the

extra step is dependent on the signs of x coordinates of the epipoles with respect to the robot

attached coordinate frame. In the framework of this project, the mobile robot moves with

planar motion, so only the x coordinates of the epipoles change in time. Therefore, x

coordinates of the epipoles are the decisive factors. This phenomenon is explained with the

help of figure A.2.

(a) (b)

Figure A.2 Geometric relations of the epipoles in the current image and target image

x coordinate of the target epipole is always positive when the initial position of the robot is in

the third quadrant(x<0 and z<0) of target frame, such as the cases illustrated in figure A.2. To

explain, the ray emanating from Cc crosses the projection manifold of the target scene in the

first quadrant of the target frame, so epipole takes place in the first quadrant and it has a

positive x coordinate. In a similar analogy, if the robot is in the fourth quadrant initially,

target epipole will always be in the second quadrant and it will have a negative x coordinate.

If the current epipoles are analyzed, it is seen that x coordinate of the current epipole of "case

a" has a positive value and x coordinate of the current epipole of "case b" is negative with

regard to the robot attached coordinate frames. Therefore, the desired trajectory of h13 is

defined in three phases for the "case a". However, it is defined in two phases for the "case b"

skipping the extra step.

62

APPENDIX B

The epipolar geometry explained in Appendix A is the intrinsic projective geometry between

two views and independent of scene structure. It only depends on the cameras' internal

parameters and relative pose. The fundamental matrix F encapsulates this intrinsic geometry

[10]. In other words, it is the algebraic representation of epipolar geometry.

A point X in three dimensional space is projected onto two images as being 𝒙 in the first

image and 𝒙′ in the second image. Then, the fundamental matrix shows the relation between

these two image points. The image points 𝒙 and 𝒙′ , the space point X, and the camera centers

are coplanar as shown in figure B.1, and this plane is called epipolar plane and denoted by 𝜋.

Figure B.1 3D Point X and its image points x and x′

The image point 𝒙 back projects to a ray in 3D space defined by camera center, C, and 𝒙

which are collinear. This ray is seen as line 𝒍′ in the second image.

Figure B.2 The ray emanating from C and passing through x is seen as line

l′(epipolar line for x) in the second image

63

As it can be seen in figure B.2, for each point 𝐱 in one image, there is a corresponding

epipolar line 𝒍′ in the other image and the matched point 𝐱′ of 𝐱 must lie on 𝒍′ . The

fundamental matrix defines the mapping from a point in one image to its corresponding

epipolar line in the other image(𝐱 → 𝒍′ ), and it satisfies the condition that for any pair of

corresponding points 𝒙 ↔ 𝒙′ in two images

𝒙′𝑻𝑭𝒙 = 0 (B. 1)

If points 𝒙 and 𝒙′ are the matching points, then 𝒙′ must lie on the epipolar line 𝒍′ = 𝑭𝒙. Since

𝒙′ is on 𝒍′ , the equation 𝒙′𝑻𝒍′ = 0 must be satisfied. Plugging 𝒍′ = 𝑭𝒙 into 𝒙′𝑻𝒍′ = 0 results

in 𝒙′𝑻𝑭𝒙 = 0. If the fundamental matrix denoted by

𝑭 =

f11 f12 f13

f21 f22 f23

f31 f32 f33

is written as

𝒇 = [f11 f12 f13 f21 f22 f23 f31 f32 f33]𝑇

and 𝒙 = [𝑥 𝑦 1]𝑇 and 𝒙′ = [𝑥 ′ 𝑦 ′ 1] , then each point match results in one linear equation in

terms of the unknown entries of the fundamental matrix, as shown in equation (B.2) below:

𝑥 ′𝑥f11 + 𝑥 ′𝑦f12 + 𝑥 ′ f13 + 𝑦 ′𝑥f21 + 𝑦 ′𝑦f22 + 𝑦 ′ f23 + 𝑥f31 + yf32 + f33 = 0 or

𝑥 ′𝑥 𝑥 ′𝑦 𝑥 ′ 𝑦 ′𝑥 𝑦 ′𝑦 𝑦 ′ 𝑥 y 1 𝒇 = 0 (B. 2)

For a set of n point matches, a set of linear equations are obtained.

𝑥1′𝑥1 𝑥1

′𝑦1 𝑥1′ 𝑦1

′𝑥1 𝑦1′𝑦1 𝑦1

′ 𝑥1 y1 1...

𝑥𝑛′𝑥𝑛 𝑥𝑛

′𝑦𝑛 𝑥𝑛′ 𝑦𝑛

′𝑥𝑛 𝑦𝑛′𝑦𝑛 𝑦𝑛

′ 𝑥𝑛 yn 1

𝒇 = 𝑨𝒇 = 0 (B. 3)

Equation (B.3) shows a homogeneous set of equations, so 𝒇 can be determined up to a scale

[10]. In order to obtain a solution for 𝒇, matrix A must have a rank of 8 at most. If it has a

rank of 8, then there exists a unique solution. However, if the data is noisy, then the rank may

be higher than 8. In this case, least squares solution is applied to find out 𝒇. The least-squares

solution for 𝒇 is the singular vector corresponding to the smallest singular value of 𝑨, that is,

the last column of 𝑽 in SVD(𝑨) = 𝑼𝑫𝑽𝑻.

An important property of the fundamental matrix is that it is not full rank, that is, it is not an

invertible mapping. An image point 𝒙 in one image defines a line 𝒍′ in the other image which

is the epipolar line of 𝒙. In the same manner, the image point 𝒙′ in the second image also

defines a line 𝒍 in the first image which is the epipolar line of 𝒙′ . Then, any point 𝒙 on 𝒍 is

mapped to the same line 𝒍′ . Therefore, there is no inverse mapping since the location of the

64

inverse mapped point of line 𝒍′ can not be exactly known, i.e., it can be anywhere on the

epipolar line 𝒍. This phenomenon makes the fundamental matrix rank deficient and it has a

rank of 2. Also, another consequence of the singularity of the fundamental matrix is that the

epipole location does not vary for different points. Physical interpretation of the singularity of

the fundamental matrix is explained by the help of figure B.3 [10].

Figure B.3 (a)Full rank Fundamental Matrix (b)Rank Deficient Fundamental Matrix

The lines seen in figure B.3 are the epipolar lines calculated using 𝒍′ = 𝑭𝒙 for different 𝒙

points. There is no common epipole in (a), but all epipolar lines intersect at the same point

which is the epipole in (b).

The fundamental matrix obtained by solving the linear equations in equation (B.3) may not

be of rank 2 due to contaminated data due to noise. For such a case, there is a step that should

be applied to force the fundamental matrix to be singular. This is done by singular value

decomposition. If singular value decomposition is applied to 𝑭 found from equation (B.3),

the following result is obtained:

SVD(𝑭) = 𝑼𝑫𝑽𝑻 where 𝑫 = a 0 00 b 00 0 c

and a ≥ b ≥ c.

Then, the reconstruction of the fundamental matrix is done by making the smallest singular

value zero, such that a ≥ b ≥ c = 0. Hence, 𝑭 = 𝑼 𝑑𝑖𝑎𝑔 𝑎, 𝑏, 0 𝑽𝑻 and it has a rank of 2.

Besides, the epipoles in two images are the left and right nullspaces of 𝑭, i.e., the last

columns of 𝑼 and 𝑽 respectively.

65

REFERENCES

[1] G. N. DeSouza and A. C. Kak, “Vision for mobile robot navigation: A survey,” IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 237–267,

2002.

[2] S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo control”, IEEE

Tran. on Robotics and Automation, vol. 12, no. 5, pp. 651–670, 1996.

[3] Francois Chaumette and Seth Hutchinson, “Visual Servo Control Part 1: Basic

Approaches and Part 2: Advanced Approaches” , IEEE Robotics & Automation Magazine,

December 2006.

[4] E. Malis, F. Chaumette, S. Boudet, 2 ½ D Visual servoing. IEEE Transactions on

Robotics and Automation, 1999.

[5] Mark W. Spong, Seth Hutchinson, M.Vidyasagar, Robot Modeling and Control, John

Wiley & Sons, Inc. ,the USA.

[6] http://en.wikipedia.org, Charge Coupled Devices. Obtained on 10th of December, 2009.

[7] B. Thuilot, P. Martinet, L.Cordesses, J. Gallice, “Position based visual servoing: Keeping

the object in the field of vision”, in Proc. IEEE Int. Conf. Robot Automat., pp. 1624-1629,

May 2002.

[8] W.Wilson, C.Hulls, G. Bell, “Relative end effector control using cartesian position based

visual servoing”, IEEE Trans. Robot. Automat. vol. 12, pp. 684-696, Oct. 1996.

[9] C.Sagues, G. Lopez-Nicolas, J.J.Guerrero, “Homography based visual control of

nonholonomic vehicles”, IEEE Int. Conference on Robotics and Automation, pages 1703-

1708, Rome- Italy, April 2007

[10] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge

University Press: Cambridge, UK.

[11] http://mathworld.wolfram.com, Dilation. Obtained on 10th of December, 2009.

[12] https://www.e-education.psu.edu/natureofgeoinfo/c2_p18.html, Nature of Geographic

Information , Plane Coordinate Transformations. Obtained on 15th of October, 2009

[13] Elan Dubrofsky, “Homography Estimation: A Master's essay submitted in partial

fulfillment of the requirements for the degree of master of science in faculty of graduate

studies”, University of British Columbia, March 2009.

[14] http://www.svgopen.org/2008/papers/86-Achieving_3D_Effects_with_SVG, Achieving

3D Effects with SVG For the SVG Open 2008 conference. Obtained on 5th of December.

http://en.wikipedia.org/

66

[15] Z. Chuan, T.D. Long, Z. Feng and D.Z. Li, “A planar homography estimation method

for camera calibration”, Computational Intelligence in Robotics and Automation, 2003 and

IEEE International Symposium on, 1:424-429, 2003.

[16] Z. Zhang., “A flexible new technique for camera calibration”, IEEE Transactions on

Pattern Analysis and Machine Intelligence, 22(11):1330–1334, 2000.

[17] Anubhav Agarwal, C. V. Jawahar, and P. J. Narayanan, “A Survey of Planar

Homography Estimation Techniques”,Tech. Rep. IIIT/TR/2005/12, 2005.

[18] http://en.wikipedia.org, Camera resectioning. Obtained on 10th of October,2009.

[19] David G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”,

International Journal of Computer Vision, January 2004.

[20] David G. Lowe, “Object Recognition from Local Scale-Invariant Features”, Proc. of

International Conference on Computer Vision, Corfu, September 1999.

[21] http://en.wikipedia.org, Scale invariant feature transform. Obtained on 1st of

December,2009.

[22] J.J.E Slotine, Li Wieping, “Applied Non-linear Control”, Prentice-Hall.

[23] C. Sagues, G. Lopez-Nicolas, J.J. Guerrero, “Visual Control of Vehicles Using Two

View Geometry”, sent to the journal “Mechatronics”,2009.

[24] http://en.wikipedia.org, Epipolar Geometry. Obtained on 15th

November,2009.




Mobile Robot Navigation Using Visual Servoing › 22b6 › 5894299c37a... · 8 Figure 3.1. The...

Documents

Transcript of Mobile Robot Navigation Using Visual Servoing › 22b6 › 5894299c37a... · 8 Figure 3.1. The...