Improvement of the Camera Calibration through the...

52
IMPROVEMENT OF THE CAMERA CALIBRATION THROUGH THE USE OF MACHINE LEARNING TECHNIQUES BY SCOTT A. NICHOLS A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2001

Transcript of Improvement of the Camera Calibration through the...

IMPROVEMENT OF THE CAMERA CALIBRATIONTHROUGH THE USE OF MACHINE

LEARNING TECHNIQUES

BY

SCOTT A. NICHOLS

A THESIS PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCE

UNIVERSITY OF FLORIDA

2001

ii

ACKNOWLEDGMENTS

I wish to thank Dr. Antonio Arroyo for taking a long shot on an average undergrad. You have

my gratitude, and have helped me make much more of myself. I also wish to thank Dr. Michael

Nechyba for more things than I can enumerate here, but will make a half-hearted effort to. Thank

you for your patience; your service as an idea blackboard that corrects mistakes; your recommen-

dation of Danela’s Ristorante; oh yeah, and your patience. I also wish to thank the members of the

Machine Intelligence Lab that I have shared workbench space with over the years for ideas and

motivation.

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

CHAPTERS

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Computer Vision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 This Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 THE CAMERA MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Training the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 A SINGLE CAMERA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Training Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Optimization Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Initial Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.5 Gradient Descent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.6 Model Perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.7 Gradient-Peturbation Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.8 Performance Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 STEREO CAMERAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.4 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.5 Model Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iii

5 FURTHER RESULTS AND DISCUSSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.1 Single Camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2 Stereo Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

APPENDICES

A VISUAL CALIBRATION EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

B GRAPHICAL CALIBRATION TOOL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

iv

v

LIST OF FIGURES

figure page

2-1 The Pinhole Model of Perspective Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3-1 Example Training Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3-2 Error Types of Initial Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3-3 Gradient Descent vs. Different Types of Error . . . . . . . . . . . . . . . . . . . . . . . . . 15

3-4 Stochastic Perturbation vs. Different Types of Error . . . . . . . . . . . . . . . . . . . . 16

3-5 Stochastic Perturbation with Adaptive Delta vs. Different Types of Error . . . . 17

3-6 Gradient-Perturbation Hybrid vs. Different Types of Error . . . . . . . . . . . . . . . 18

3-7 Performance Comparison for the Translational and Scale Initial Models. . . . . 20

3-8 Performance Comparison for the Close and Rotational Initial Models. . . . . . . 21

3-9 Final Models Using the Gradient - Perturbation Hybrid Technique . . . . . . . . . 22

4-1 Results for a Non-Weighted Model Improvement . . . . . . . . . . . . . . . . . . . . . . 27

4-2 Error Over Time for Various Gains on the Training Model . . . . . . . . . . . . . . . 28

4-3 The Long Term Performance using Various Gains . . . . . . . . . . . . . . . . . . . . . . 29

5-1 Example Single Camera Calibrations (1 & 2) . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5-2 Example Single Camera Calibrations (3 & 4) . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5-3 Example Single Camera Calibrations (5 & 6) . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5-4 Example Single Camera Calibrations (7 & 8) . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5-5 Poor Initial Calibrations (1 & 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5-6 Poor Initial Calibrations (3 & 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5-7 Stereo Calibration Examples (1 & 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5-8 Stereo Calibration Examples (3 & 4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A-1 The Calibration Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

A-2 The Experimental Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

B-1 The Graphical Calibration Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

vi

LIST OF TABLES

table page

3-1 Model Perturbation: Average Error Per Pixel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3-2 Model Perturbation With Adaptive Delta: Average Error Per Pixel . . . . . . . . . . . . 19

3-3 Gradient - Perturbation Hybrid: Average Error Per Pixel . . . . . . . . . . . . . . . . . . . . 19

vii

Abstract of Thesis Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

IMPROVEMENT OF THE CAMERA CALIBRATIONTHROUGH THE USE OF MACHINE

LEARNING TECHNIQUES

By

Scott A. Nichols

August 2001

Chairman: Dr. Michael C. NechybaMajor Department: Electrical and Computer Engineering

In computer vision, we are frequently interested not only in automatically recognizing what

is present in a particular image or video, but also where it is located in the real world. That is, we

want to relate two-dimensional image coordinates and three-dimensional world coordinates. Cam-

era calibration refers to the process through which we can derive this mapping from real-world

coordinates to image pixels. We propose to reduce the amount of effort required to compute accu-

rate camera calibration models by automating much of the calibration process through machine

learning techniques. The techniques developed are intended to simplify the calibration process

such that a minimally trained person can perform single, stereo, or multiple fixed-camera calibra-

tions with precision and ease. In this thesis, we first develop a learning algorithm for improving a

single calibration model through a combination of gradient descent and stochastic model perturba-

tion. We then develop a second algorithm that applies specifically to simultaneous calibration of

multiple fixed cameras. Finally, we illustrate our algorithms through a series of examples and dis-

cuss avenues for further research.

CHAPTER 1

INTRODUCTION

1.1 Computer Vision

For decades, researchers have been attempting to duplicate in machine-centered systems

what we as humans do on a daily basis with our eyes — to recognize and understand the world

around us through visual input. To date, however, our imagination has outpaced the reality of state-

of-the-art computer vision systems. While science fiction has often depicted robots and machines

with human-like capabilities, current computer vision systems do not come close to matching

those imagined capabilities. In fact, we are still many years away from computers that can rival

humans and other animals in image processing and recognition capabilities. Why? Computer

vision, it appears, is a much more difficult problem than was first believed by early researchers. In

the late sixties with the spread of general purpose computers, researchers felt that a solution to the

general vision problem — the near instantaneous recognition of any visual input — was achievable

within a short number of years. Since then, we have come to understand the enormous computa-

tional resources our own brains devote to the vision processing task, and the consequent challenges

that the general computer vision problem poses.

Therefore, rather than develop one computer vision system with very general capabilities,

researchers have begun to focus on implementing practical computer-vision applications that are

limited in scope but can successfully carry out specific tasks. Some examples of recent work

include face and car recognition [14], people detection and tracking [6, 13], computer interaction

through gestures [15], handwriting recognition [9], traffic monitoring and accident reporting [8],

detection of lesions in retinal images [17], and even automated aircraft landing [5]. In many of

1

2

these computer vision projects, researchers are not only interested in recognizing what is present in

an image or video; they would also like to infer 3D geometric information about the world from

the image itself. If a system is to relate with and/or draw conclusions about the 3D position of

objects in the image, it needs to be calibrated; that is, we need to derive a relationship between the

3D geometry of the real world and corresponding image pixels.

1.2 Camera Calibration

Developing accurate calibration or camera models in computer vision has been the focus of

much research over the years. Many researchers have opted for the following simple approach:

first the 3D location of certain known image pixels is identified; then, these pairs of correspon-

dence are exploited to estimate the parameters of the calibration model. From this model, intrinsic

camera parameters, which define the properties of the camera sensor and lens, and the extrinsic

parameters, which define the pose of the camera with respect to the world, can be extracted. This

process can be burdensome, as it requires many precise position measurements. Often, someone

doing computer vision research spends as much time fiddling with and worrying about camera cal-

ibration, as they do on their actual application.

1.3 This Work

In this thesis, we propose to reduce the amount of effort required to compute accurate cam-

era calibration models, by automating much of the calibration process through machine learning

techniques. The techniques developed are intended to simplify the calibration process such that a

minimally trained person can perform single, stereo, or multiple fixed camera calibrations with

precision and ease. In Chapter 2, we review the basics of camera calibration. Then, in Chapter 3 we

develop a training algorithm for improving a single-camera calibration model from constrained

features in the image. Next, in chapter 4, we build on the previous chapter by developing an algo-

3

rithm for improving multiple fixed camera calibration models. Finally, in chapter 5, we present fur-

ther results and discuss possible extensions of this work.

CHAPTER 2

THE CAMERA MODEL

2.1 Introduction

Camera calibration in computer vision refers to the process of determining a camera’s inter-

nal geometric and optical characteristics (intrinsic parameters) and/or the 3D position and orienta-

tion of the camera relative to a specified world coordinate system (extrinsic parameters) [16]. We

do this in order to extract 3D information from image pixel coordinates and to project 3D informa-

tion onto 2D image coordinates. Cameras in computer vision can be mounted as fixed view, a pan-

ning and tilting view, or can be integrated onto a mobile system. Panning, tilting and mobile

cameras do not have a fixed relationship to any world coordinate system (extrinsic parameters).

For such systems, information is often available only in the camera coordinate system, which is

defined by the direction of the camera at the moment the image was captured. External sensors and

encoders can alleviate this problem, but typically only at a significantly higher cost. In fixed cam-

era systems, on the other hand, both the intrinsic and extrinsic parameters combine to provide a

transformation between 3D world and 2D image coordinates.

In recent years, many different techniques have been developed for camera calibration. The

differences in these techniques primarily reflect the wide array of applications that researchers

have pursued. One of the most popular, and the one that forms the basis of this work, is direct lin-

ear transformation (DLT) introduced by Abdel-Aziz and Kahara [1]. The DLT method does not

consider radial lens distortion, but is conceptually and computationally simple. In 1987 Tsai [16]

proposed a method that is likely one of the most referenced works on the topic of camera calibra-

tion. It outlines a two-stage technique using the “radial alignment constraint” to model lens distor-

4

5

tion. Tsai’s method involves a direct solution for most of the camera parameters and some iterative

solutions for the remaining parameters. The cameras used in this work appear to have little or no

lens distortion. Since it has been shown by Weng, Cohen and Herniou [18] that Tsai’s method can

be worse than DLT if lens distortion is relatively small, we chose the DLT method.

2.2 Definition

In this thesis, we apply the pinhole lens model of perspective projection, whose basic geom-

etry is shown in Figure 2-1. This model constructs a transformation from 3D world coordinates to

pixels in an image in three steps. First, 3D world coordinates are converted to 3D camera coordi-

nates through a homogeneous transformation. Let us denote as a coordinate in

the world, and as the corresponding 3D camera coordinate. Then, the homoge-

neous transform T is defined by,

Xw

YwZw Ow

Zc

Xc

YcOc

R

t

f

Figure 2-1: The Pinhole Model of Perspective Projection

Pw

Pw xw yw zw, ,( )=

Pc xc yc zc, ,( )=

6

(2-1)

such that,

(2-2)

where R denotes a rotation matrix,

(2-3)

and t denotes a translation vector,

. (2-4)

Second, the 3D camera coordinate is transformed to a 3D sensor coordinate :

(2-5)

where the camera’s intrinsic matrix K is defined as,

(2-6)

In equation (2-6) f is the effective focal length of the camera; a, b, and c describe the scaling

in x and y and the angle between the optical axis and the image sensor plane, respectively; and,

T R t000 1

Pc

1T Pw

1=

3 3×

Rr11 r12 r13

r21 r22 r23

r31 r32 r33

3 1×

ttx

ty

tz

Pc Pu u v w, ,( )=

Pu

1K Pc

1=

K

fa– fb– uo– 0

0 fc– vo– 0

0 0 1 0

0 0 0 1

uo

7

and are the offset from the image origin to the imaging sensor origin. Finally, the 3D sensor

coordinate is converted to the 2D image coordinate :

. (2-7)

The complete projection equation is therefore given by,

(2-8)

The transformation on the world coordinate in equation (2-8) can be combined into a

single matrix S. Since we are only interested in the overall transform between world coordinates

and image coordinates, and not an explicit solution of the extrinsic and/or intrinsic camera param-

eters, we therefore write,

(2-9)

or alternatively,

. (2-10)

2.3 Training the model

From equation (2-10), we now have a linear transformation with 12 unknowns that relate

world coordinates to image pixels. Each correspondence between a 3D world coordinate and a 2D

image point corresponds to a set of two equations,

vo

Pu xi yi,( )

xiuw----=

yivw----=

Pu

1KT Pw

1=

KT

Pu S Pw

1=

Pu

s11 s12 s13 s14

s21 s22 s23 s24

s31 s32 s33 s34

Pw

1=

8

(2-11)

or, in terms of the actual image coordinates,

(2-12)

Given a set of n pairs of world and image coordinates, equation (2-11) can be written in matrix

form as,

(2-13)

Arbitrarily setting leaves 11 unknown parameters which can be solved for using linear

regression. In general, the more correspondence pairs that are defined, the less susceptible the

model is to noise. To a large extent, this thesis aims to reduce or eliminate the need to collect many

such precise correspondence pairs, as that can be labor intensive and/or prone to human operator

error.

u s31x s32y s33z s34+ + +( ) w s11x s12y s13z s14+ + +( )=

v s31x s32y s33z s34+ + +( ) w s21x s22y s23z s24+ + +( )=

xi s31x s32y s33z s34+ + +( ) s11x s12y s13z s14+ + +( )=

yi s31x s32y s33z s34+ + +( ) s21x s22y s23z s24+ + +( )=

x y z 1 0 0 0 0 xix– xiy– xiz– xi–

0 0 0 0 x y z 1 yix– yiy– yiz– yi–

……

s11

s12

…s34

0=

s34 1=

CHAPTER 3 A SINGLE CAMERA

3.1 Introduction

A common method of calibration is to place a special object in the field of view of the cam-

era [10, 11, 16, 21]. Since the 3D shape of the object is known a priori, the 3D coordinates of spec-

ified reference points on the object can be defined in an object-relative coordinate system. One of

the more popular calibration objects described in the literature consists of a flat surface with a reg-

ular pattern marked on it. This pattern is typically chosen such that the image coordinates of the

projected reference points (corners, for example) can be measured with great accuracy. Given a

number of these points, each one yielding an equation of the form (2-9), the perspective transfor-

mation matrix S can be estimated with good results. One drawback of such techniques is that

sometimes a calibration object might not be available. Even when a calibration object is available,

the world coordinate system is defined by the placement of that object with respect to the camera,

and is not necessarily optimized to take advantage of the geometry of the scene.

In another popular method, called structure from motion, the camera is moved relative to the

world, and points of correspondence between consecutive images define the camera model [7, 21,

22]. In this approach, however, only the intrinsic parameters of the camera can be estimated (e.g.

the K matrix); as such, this method is used primarily for stereo vision and will be addressed in the

subsequent chapter.

In our work, we propose to divide the calibration process into two stages. First, we propose

to generate an initial “close” camera model, that then gets optimized to run-time quality through

standard machine learning techniques. Since our system will improve its model over time, all we

9

10

require initially is method for generating reasonably close calibrations without undue effort or

complexity. This type of problem was addressed by Worrall, et al. [19] through a graphical calibra-

tion tool, where the user can rotate and translate a known grid to the desired position on the image.

The system then calculates a perspective projection matrix S that will place the grid at that location

in the image. Our group has implemented a similar, intuitive GUI interface, which allows us to

generate fast and easy initial calibration estimates; this interface is described in further detail in

Appendix B.

3.2 Training Edges

Now, in order to improve the calibration model from the GUI interface through machine

learning, we need something for the machine learning algorithm to train on. Since we do not want

to require human operators to meticulously and precisely select numerous known correspondence

pairs in an image, we should select features in the image that can be easily isolated or extracted

through simple image processing techniques. In man-made environments, constrained edges with

known dimensions are frequent and stand out visually; some examples of these might be the inter-

section of the floor and a wall, the corner of a room, window sills, etc. These features can provide

a wealth of training data, without requiring explicit and precise image-to-world correspondence.

As such, we choose to rely on such constrained data for improving our initial camera calibration

model.

Exploiting the geometry of a given scene, a set of lines is chosen in 3D space, where each of

the lines is constrained along two dimensions; for example the vertical intersection of two perpen-

dicular walls is constrained by and where and are known constants. The

pixels corresponding to each of the lines and the constraints that define each line are the basis for

our model improvement algorithm. In order not to bias the training algorithm, the lines (edges)

x C0= y C1= C0 C1

11

should be chosen so as to provide training data that is balanced throughout the region of interest

both in area and scale as shown, for example, in Figure 3-1.

3.3 Optimization Criterion

Given our constrained-edges training data, we must define a reasonably well-behaved opti-

mization criterion that lets us know how the training algorithm is progressing. Care should be used

when choosing the optimization criterion, as it is the only mechanism a system has to evaluate a

potential model. After some experimentation, our final optimization criterion was designed to

reflect the error between the actual and projected pixel locations of the constrained edges and is

defined by,

(3-1)

where e denotes a training edge, i denotes a pixel along that edge, n denotes the number of pixels

along a particular edge, and m denotes the total number of edges. In equation (3-1), is defined

by,

(a) (b)

Figure 3-1: Example Training Edges

E 1m----

Eei

n-------

i 0=

n

∑e 0=

m

∑=

Eei

12

(3-2)

where is a actual pixel location in the training set and is the projected pixel loca-

tion, which is computed as follows. First, given the two constraints of a single edge (e.g. ,

), a pixel from the corresponding image line and the current perspective projection

model S, apply equation (2-11) to generate two equations and one unknown. For example, for the

constraints specified above, the two equations would become,

(3-3)

Equations (3-3) are easily generalized to arbitrary lines in space, and can be solved for the

unknown coordinate (or parameter, in the general case) through linear regression. Then, we can

project the resulting 3D coordinate onto the image, using equations (2-8) and (2-7), to get

The camera pespective will have some effect on the training data. Given two equal length

edges, the one with a larger image crossection will have more sample points and therefore will

contribute more to the error. To compensate for this, the error generated by each edge is averaged

to an error per pixel along that edge. The average error of each edge is then averaged to obtain the

final error measure E.

3.4 Initial Models

Error in a calibration model can be decomposed into three basic types: rotation, translation,

and scale. Any training algorithm may handle these different sources of error with varying degrees

of success. Therefore, we investigate how well our algorithm (defined below) performs on improv-

ing initial models in four general classes, as defined and illustrated in Figure 3-2. Each of the first

Eei 2 xs xp–( )2ys yp–( )2

+=

xs ys,( ) xp yp,( )

x C0=

y C1= xs ys,( )

s31xs s11–( )C0 s32xs s12–( )C1 s33xs s13–( )z+ + s14 s34xs–=

s31ys s21–( )C0 s32ys s22–( )C1 s33ys s23–( )z+ + s24 s34ys–=

xp yp,( )

13

three initial models in Figure 3-2 is labeled by the dominant type of error displayed. The fourth

model exhibits a combination of errors, but was generated to be fairly close to an acceptable final

model. Given a graphical calibration tool, as described in Appendix B, the close model represents

an easily achievable and therefore most likely starting configuration. The other three cases are pre-

sented to establish the limits of our training approach and to determine the types of error that

present the most difficulty.

3.5 Gradient Descent

The error measure defined in equation (3-2) can be expanded to,

(a) Close

(c) Scale

(b) Rotation

(d) Translation

Figure 3-2: Error Types of Initial Models

14

(3-4)

This is a differentiable function in S for which we can compute the gradient with respect to

the parameters in S such that,

. (3-5)

Note that since is assigned to be equal to one, it is not part of the gradient in equation (3-5).

From equation (3-1), the overall gradient is then given by,

. (3-6)

Given this gradient, the current model S can now be modified by a small positive constant in

the direction of the negative gradient,

. (3-7)

For error surfaces that can be roughly approximated as quadratic, we would expect the

model recursion in equation (3-7) to converge to a good near-optimal solution. Figure 3-3 below

illustrates the performance of pure gradient descent for the four types of initial models. From these

plots, it is apparent that the gradient descent recursion very quickly gets stuck in a local minimum

that is far from optimal; all four types of initial models caused gradient descent to fail within 2.75

seconds.1 In other words, the error surface is decidedly non-quadratic in a global sense, and gradi-

ent-descent represents at best only a partial training algorithm for this problem.

1. All experiments in this thesis were run on a 700 MHz Pentium III running Linux.

Eei 2 xs

s11x s12y s13z s14+ + +

s31x s32y s33z s34+ + +---------------------------------------------------------–

2ys

s21x s22y s23z s24+ + +

s31x s32y s33z s34+ + +-------------------------------------------------------–

2+=

E∇ ei

E∇ eis11∂

∂Eei …s33∂

∂Eei

T

=

s34

E∇

E∇ 1m----

E∇ ei

n------------

i 0=

n

∑e 0=

m

∑=

δg

Snew 1 Eδg∇–( )S=

15

3.6 Model Perturbation

Another method for improving model is stochastic model perturbation. In this approach, the

current model S is first perturbed by a small delta along a random direction ,

. (3-8)

The error for the new model is computed; if it represents an improvement, the current

model becomes the new model; otherwise, the new model is discarded, and we simply try another

random perturbation of the current model.This approach is much less sensitive to the local minima

problem, since the random perturbations can effectively “jump out” of local minima; that is, the

direction is not constrained to be in the negative gradient direction. Figure 3-4 illustrates train-

0 0.5 1 1.5 2 2.5

200

210

220

230

240

0 0.25 0.5 0.75 1 1.25 1.5

235

240

245

250

255

0 0.25 0.5 0.75 1 1.25 1.5

235

240

245

250

255

0 0.1 0.2 0.3 0.4 0.5

46.5

47

47.5

48

Close initial model Rotation Error

Scale Error Translation Error

Figure 3-3: Gradient Descent vs. Different Types of Error

δp

Snew 1 Dδp∇+( )S=

Snew

D∇

16

ing results for the four initial model types. Note that there is an immediate and significant improve-

ment for all of the initial models, particularly for the cases of translation error, where training

results in an order-of-magnitude improvement. In each of the cases, the bulk of the model improve-

ment occurs in less than one minute. The models improve further over time, although at some point

there is a trade off between model quality and computing time.

One possible approach to speeding up improvement over time is to introduce an adaptive

perturbation delta that grows when the current model is stuck for some period of time in a rela-

tively isolated local minimum. Examples of convergence with an adaptive delta are shown in Fig-

0 200 400 600 800 1000 1200

5

10

15

20

0 200 400 600 800 1000 1200

5

10

15

20

25

30

0 200 400 600 800 1000 1200

5

10

15

20

25

0 200 400 600 800 1000 1200

0.5

1

1.5

2

2.5

3

3.5

4

(a) Close initial model (b) Rotation Error

(c) Scale Error (d) Translation Error

Figure 3-4: Stochastic Perturbation vs. Different Types of Error

Avg Pixel Error Avg Pixel Error

Avg Pixel Error Avg Pixel Error

Time (sec)

Time (sec) Time (sec)

Time (sec)

17

ure 3-5. Not surprisingly, the results are very similar, but the adaptive delta does improve

performance slightly, and, we expect, would increasingly improve performance if used over a

longer period on time.

0 200 400 600 800 1000 1200

5

10

15

20

0 200 400 600 800 1000 1200

5

10

15

20

25

30

0 200 400 600 800 1000 1200

5

10

15

20

25

0 200 400 600 800 1000 1200

0.5

1

1.5

2

2.5

3

3.5

4

(a) Close initial model (b) Rotation Error

(c) Scale Error (d) Translation Error

Figure 3-5: Stochastic Perturbation with Adaptive Delta vs. Different Types of Error

Time (sec)

Avg Pixel Error

Avg Pixel Error

Avg Pixel Error

Avg Pixel Error

Time (sec) Time (sec)

Time (sec)

18

3.7 Gradient-Peturbation Hybrid

Given the relative speed of convergence of gradient descent in localized near quadratic

neighborhoods, and the insensitivity of stochastic perturbation to local minima, a combination of

gradient descent and model perturbation may well train faster than either approach by itself. In this

hybrid approach, gradient descent will quickly reach local minima, at which point, adaptive-delta

model perturbation takes over to search for better regions in model parameter space. In other

words, gradient descent and stochastic perturbation alternate in optimizing the camera model over

time. Sample results for this approach are shown in Figure 3-6.

0 200 400 600 800 1000 1200

5

10

15

20

0 200 400 600 800 1000 1200

5

10

15

20

25

30

0 200 400 600 800 1000

5

10

15

20

25

0 200 400 600 800 1000

1

2

3

4

(a) Close initial model (b) Rotation Error

(c) Scale Error (d) Translation Error

Figure 3-6: Gradient-Perturbation Hybrid vs. Different Types of Error

Avg Pixel Error

Time (sec)

Avg Pixel Error

Time (sec)

Time (sec)

Avg Pixel ErrorAvg Pixel Error

Time (sec)

19

3.8 Performance Comparisons

Which approach is best for our purposes depends on a number of different criteria. Ideally,

we would like a method that responds quickly, improves in the face of difficult models, and is able

to continue to improve if given an open-ended amount of time. As we have already seen, and as we

further detail in Tables 3-1, 3-2 and 3-3, each method improves the camera model the most within

Table 3-1: Model Perturbation: Average Error Per Pixel

Initial 20 Sec. 1 Min. 5 Min. 10 Min. 20 Min.

Close 3.9336 2.0053 1.9618 1.6327 1.5182 1.4246

Rotation 21.2163 8.8545 6.5318 6.2082 6.1391 5.8464

Scale 28.9091 6.9955 6.9518 6.8318 5.4664 5.2064

Translation 23.0482 1.7082 1.6143 1.3182 1.1909 1.1618

Table 3-2: Model Perturbation With Adaptive Delta: Average Error Per Pixel

Initial 20 Sec. 1 Min. 5 Min. 10 Min. 20 Min.

Close 3.9336 2.0053 1.9691 1.64 1.5972 1.4127

Rotation 21.2163 9.138 6.8472 6.22 5.9091 5.2491

Scale 28.9091 7.0035 6.9515 5.8091 5.2945 4.5236

Translation 23.0482 1.7082 1.6145 1.3255 1.2227 1.19

Table 3-3: Gradient - Perturbation Hybrid: Average Error Per Pixel

Initial 20 Sec. 1 Min. 5 Min. 10 Min. 20 Min.

Close 3.9336 2.0364 1.9173 1.5664 1.4736 1.3355

Rotation 21.2164 12.4909 8.0218 5.3855 5.1991 5.0564

Scale 28.9091 14.13 6.7918 5.7545 5.5655 4.6891

Translation 23.0482 1.6909 1.5627 1.4 1.3118 1.1964

20

the first minute of training. For each approach the error then continues to decline, but the two tech-

niques that make use of an adaptive delta show a continuing ability to improve over time with the

hybrid approach outperforming the others. A direct comparison is shown in Figures 3-7 and 3-8,

0 200 400 600 800 1000

-0.04

-0.02

0

0.02

0 200 400 600 800 1000

-0.2

-0.1

0

0.1

0 200 400 600 800 1000-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0 200 400 600 800 1000-0.5

-0.25

0

0.25

0.5

0.75

1

0 200 400 600 800 1000

-1

-0.5

0

0.5

1

0 200 400 600 800 1000

-1.5

-1

-0.5

0

0.5

Translation Scale

Perturbation - Perturbation with delta Perturbation - Perturbation with delta

Perturbation - Hybrid Perturbation - Hybrid

Perturbation with Delta - HybridPerturbation with Delta - Hybrid

Figure 3-7: Performance Comparison for the Translational and Scale Initial Models

Time (sec)

Pixels

Pixels

Pixels Pixels

Time (sec)Time (sec)

Pixels

Time (sec)Time (sec)

Pixels

Time (sec)

21

which plot the difference in error between the different techniques over time. The final camera

models trained by the hybrid method are depicted in Figure 3-9.

0 200 400 600 800 1000

-0.08

-0.06

-0.04

-0.02

0

0.02

0 200 400 600 800 1000

-0.02

0

0.02

0.04

0.06

0.08

0 200 400 600 800 1000

-0.025

0

0.025

0.05

0.075

0.1

0.125

0 200 400 600 800 1000-0.6

-0.4

-0.2

0

0.2

0.4

0 200 400 600 800 1000

0

0.2

0.4

0.6

0.8

0 200 400 600 800 1000

-0.5

-0.25

0

0.25

0.5

0.75

Close Rotation

Perturbation - Perturbation with delta Perturbation - Perturbation with delta

Perturbation - Hybrid Perturbation - Hybrid

Perturbation with Delta - HybridPerturbation with Delta - Hybrid

Figure 3-8: Performance Comparison for the Close and Rotational Initial Models

Pixels

Time (sec)

Pixels

Time (sec)Time (sec)

Pixels

PixelsPixels

Time (sec)

Time (sec)Time (sec)

Pixels

22

Why does training virtually eliminate translation error while rotation and scale errors prove

more difficult to correct? The answer lies in the camera model itself [see equation (2-8)]. Transla-

tion is parameterized by 3 of the 12 parameters in S, while rotation is parameterized by 9 parame-

ters, which are additionally constrained by the orthonormality requirement for rotation matrices.

Finally, scale affects all 12 parameters in S. Thus, to correct rotational errors, 9 model parameters

must be changed simultaneously, while for scale errors, all 12 parameters must be changed. This

adds complexity and size to the search space, resulting in a slower improvement and a greater like-

lihood of getting stuck in a remote local minimum.

(a) Close

(c) Scale

(b) Rotation

(d) Translation

Figure 3-9: Final Models Using the Gradient - Perturbation Hybrid Technique

CHAPTER 4STEREO CAMERAS

4.1 Introduction

We are interested in abstracting 3D information about the world through computer vision

techniques. Therefore, we will, in general, require more than one non-coincident camera view for

an area of interest, since pixels from a single-camera image do not correspond to exact 3D world

coordinates, but rather to rays in space. If we can establish a feature correspondence between at

least two camera views calibrated to the same world coordinate system, then we can extract the

underlying 3D information for that feature by intersecting the rays corresponding to its pixel loca-

tion in each image. Hence, in this chapter, we consider the problem of simultaneous calibration of

multiple (stereo) fixed cameras to a unique world coordinate system. More specifically, we build

on the results of the previous chapter by assuming that one camera in a multi-camera setup has

already been trained towards a good model. The task that remains is to calibrate the additional

camera(s). In the remainder of this chapter, we focus on the two-camera case, although our results

generalize in a straightforward manner to more than two cameras.

Given a set of two cameras where one has already been trained, and the second has a rough

initial calibration, we propose to improve the second calibration through an iterative process that

involves using a “virtual calibration object.” This calibration object is created by moving a physical

object throughout the area of interest and tracking it from each camera view. The path that the

object follows in the image (in pixels) and the corresponding 3D estimates constitute the training

data at each step of the algorithm.

23

24

4.2 Related Work

Previous work in stereo camera calibration is relatively extensive, but varies based on the

particular application of interest. Many of the previous methods use a known calibration pattern

whose features are extracted from each image. Rander and Kanade [12] have a system of approxi-

mately 50 cameras arranged in a dome to observe a room-sized space. In order to calibrate these

cameras, a large calibration object is built and then moved to several precise locations, in effect

building a virtual calibration object that covers most of the room. Do [4] applied a neural network

to acquire stereo calibration models and determined the approach to be adequate; however, he went

on to show that high accuracy did not appear possible in a reasonable amount of training time.

Horaud et al. [7] use rigid and constrained motions of a stereo rig to allow camera calibration. This

approach, like most in the field, address small baseline-stereo rigs which usually have a fixed, rigid

geometry. Azarbayejani and Pentland [2] propose an approach that tracks an item of interest

through images to obtain useable training data. This approach is similar to ours except that they

use object tracking to completely establish a calibration rather than using it to modify an already

existing calibration. The drawback of this technique is that scale is not recoverable, so absolute

distance has no meaning. Chen et al. [3] offer a similar technique that uses structure-from-motion

to obtain initial camera calibration models, and then tracks a virtual calibration object to iterate to

better models. The initial calibrations are obtained sequentially, where each new calibration model

depends on those previously derived. As such, error accumulates for each sucessive calibration,

placing newer calibrations further away from an optimal solution. Our work has shown (e.g. Fig-

ures 3-4, 3-5 and 3-6) that this can have a significant impact on both the final calibration obtained

and the amount of time required to obtain it.

25

4.3 Our Approach

In this research, our primary motivation is to make the calibration process as quick and easy

as possible without sacrificing precision. We wish for a non-expert to be able to completely cali-

brate a system for operation by making part of the calibration process visual and automating the

rest. Each camera obtains its initial calibration via a graphical calibration tool (as described in

Appendix B). The calibration is then improved by having the cameras track an object of interest

throughout the viewing area. Ideally, image capture and processing should be synchronized

between cameras; however, in practice, such synchronization is difficult and expensive to achieve

in hardware. Even for an asynchronous system, however, we can approximate synchronous opera-

tion through Kalman filtering and interpolation of time-stamped images, as long as the system

clocks are synchronized between the image processing computers. This Kalman-filtered and inter-

polated trajectory then becomes the training data for improving our stereo camera models.

4.4 Training Data

In order to collect training data, we apply a modified version of Zapata’s color-model-based

tracking algorithm [20] to track an object of interest from multiple camera views in real-time. The

time-stamped pixel data of the object’s centroid are passed through a first-order Kalman filter and

then sent to a multi-camera data fusor. The data fusor accepts the time-stamped data streams, inter-

polating and synchronizing the multiple data streams at 30Hz. Prior to training, the synchronized

tracking data is balanced so that no one single region of the image is dominated by a disproportion-

ately large amount of data. Although training examples reported in this thesis are over fixed data

sets, there is no algorithmic obstacle to training off streaming data in real-time.

4.5 Model Improvement

The training data now consists of synchronized sets of pixel values from the multiple

cameras. Let us consider the set of pixel values for the object at time :

n m

t

26

(4-1)

where denotes the pixel location of the object for camera . Given our current estimate of

each camera model , we can estimate the 3D world coordinate of the object at time by regress-

ing the equations,

(4-2)

, for , which denotes the estimated 3D world coordinate at time .

For each training camera, we now have estimated correspondence pairs between the syn-

chronized pixel values for that camera and their corresponding estimated 3D world coordinates.

Given this data, we can now apply equation (2-13) to generate a new perspective projection matrix

based on the estimated 3D tracking data. This process is repeated until the calibration

reaches acceptable precision. Figure 4-1 shows the error over time for the above approach. There is

initial improvement followed by rapid and consistent model degradation. This is not what was

expected, but can be explained after looking a little closer at the procedure and how it operates.

The resulting calibration in Figure 4-1 gives an indication of the source of the problem. Recall

from equation (2-8) that part of the projection matrix is the camera matrix K which contains intri-

nisic parameters for the camera such as scale, skew and offset. These parameters are fixed in real-

ity but are obviously being changed by this training process. This is happening due to the

unconstrained nature of the training process: known incorrect (x, y, z) are being used to train a

model which is simply a linear least squares solution and K, as a component of this model, is also

being changed. We are seeking to use a good model to train a bad model by modifing its rotation

and translation, not its intrinsic camera parameters. A property of the linear least squares estimate

x1 y1,( ) x2 y2,( ) … xm ym,( ), , ,{ }t

xj yj,( ) j

Sj t

s31 j, xj s11 j,–( )xt s32 j, xj s12 j,–( )yt s33 j, xj s13 j,–( )zt+ + s14 j, s34 j, xj–=

s31 j, yj s21 j,–( )xt s32 j, yj s22 j,–( )yt s33 j, yj s23 j,–( )zt+ + s24 j, s34 j, yj–=

j 1 2 … m, , ,{ }∈ xt yt zt, ,( ) t

n

Sj new,

27

is that it distributes the error as evenly as possible between the two models. The even distribution

of the error is not appropriate for us since we know that the source of the error is the bad model.

Weighing the parameters of the regression towards the training model as an indication that it is the

source of the error might help. Figure 4-2 shows the error vs. time as a function of different

weights applied to the training model. It shows that we can mitigate the effect on the K matrix of

the training model with a weighted regression. A gain of ten shows better performance but is still

unstable. A gain of 100 or 1000 shows much better stability and performance. Looking at the error

over time for the different gains it appears that a higher gain exibits a similar response when

viewed on a larger scale.

0 25 50 75 100 125 150 1750

200

400

600

800

(b) Initial Calibration(a) Error Over Time

(d) Final Calibration(c) Best Calibration

Figure 4-1: Results for a Non-Weighted Model Improvement

Error (mm)

Time (sec)

28

0 25 50 75 100 125 150 175

10

20

30

40

50

60

70

80

0 25 50 75 100 125 150 175

10

20

30

40

50

60

70

80

0 25 50 75 100 125 150 175

10

20

30

40

50

60

70

80

0 200 400 600 800 1000 1200 1400

20

40

60

80

100

(a) Gain of 10 on Training Modell (b) Gain of 100 on Training Model

(c) Gain of 1000 on Training Model (d) Gain of 10000 on Training Model

Figure 4-2: Error Over Time for Various Gains on the Training Model

Time (sec)

Time (sec) Time (sec)

Time (sec)

Error (mm)

Error (mm) Error (mm)

Error (mm)

29

0 25 50 75 100 125 150 1750

200

400

600

800

0 25 50 75 100 125 150 175

10

20

30

40

50

60

70

80

0 250 500 750 1000 1250 1500

20

40

60

80

100

0 100 200 300

20

40

60

80

100

(a) Gain of 1 (b) Gain of 10

(c) Gain of 100 (d) Gain of 1000

Figure 4-3: The Long Term Performance using Various Gains

Error (mm)

Time (sec)

Time (sec)

Error (mm)

Time (sec)

Error (mm)Error (mm)

Time (sec)

CHAPTER 5 FURTHER RESULTS AND DISCUSSION

5.1 Single Camera

Here, we present some further results for the algorithms developed in the previous two chap-

ters. Below, we solve eight sample calibration problems, where the initial calibrations are obtained

using our previously referenced graphical calibration tool These are shown in Figures 5-1, 5-2, 5-3

and 5-4. Given these descent initial calibrations, our algorithms develop run-time quality models

within two minutes from the start of training. Figures 5-5 and 5-6 show the systems perfor-

mance in the face of poor initial calibrations. These figures show that even a poor initial

calibration can be improved significantly, but not always to run-time quality. Overall, these

results show that our single-camera calibration approach meets our goal — namely, mak-

ing camera calibration a simpler, faster and easier process.

5.2 Stereo Cameras

Figure 5-7 and Figure 5-8 show four example calibrations obtained from two different cam-

era angles. The initial calibrations are obtained using our graphical calibration tool. These calibra-

tions and the associated graphs reveal the capabilities and shortcomings of our proposed stereo

technique. While in its present form, it may not yet be sufficiently robust for real world applica-

tion, it can certainly be made to be so with minor changes. As we discussed in Chapter 4, if we

constrained modifications of the camera model to extrinsic parameters only, keeping the intrinsic

parameters of the model fixed, we expect that the observed model drift would no longer occur.

30

31

0 20 40 60 80 100

1

2

3

4

5

6

0 500 1000 1500 2000

1

2

3

4

5

6

7

(a) Example 1 Initial

(c) Example 2 Initial

(b) Example 1 Final

(d) Example 2 Final

Figure 5-1: Example Single Camera Calibrations (1 & 2)

(f) Example 2 Error vs. Time(e) Example 1 Error vs. Time

32

0 500 1000 1500 2000

0.5

1

1.5

2

2.5

3

3.5

4

0 500 1000 1500 2000

1

2

3

4

5

(a) Example 3 Initial

(c) Example 4 Initial

(b) Example 3 Final

(d) Example 4 Final

Figure 5-2: Example Single Camera Calibrations (3 & 4)

(f) Example 4 Error vs. Time(e) Example 3 Error vs. Time

33

0 500 1000 1500 2000

2.5

5

7.5

10

12.5

15

17.5

0 200 400 600 800 1000 1200 1400

1

2

3

4

5

6

7

(a) Example 5 Initial

(c) Example 6 Initial

(b) Example 5 Final

(d) Example 6 Final

Figure 5-3: Example Single Camera Calibrations (5 & 6)

(f) Example 6 Error vs. Time(e) Example 5 Error vs. Time

34

0 200 400 600 800

1

2

3

4

5

6

7

0 100 200 300 400 500

2

4

6

8

(a) Example 7 Initial

(c) Example 8 Initial

(b) Example 7 Final

(d) Example 8 Final

Figure 5-4: Example Single Camera Calibrations (7 & 8)

(f) Example 8 Error vs. Time(e) Example 7 Error vs. Time

35

0 50 100 150 200 250 300 350

2

4

6

8

10

12

14

16

0 100 200 300 400 500 600

5

10

15

20

(a) Poor Calibration 1 Initial

(c) Poor Calibration 2 Initial

(b) Poor Calibration 1 Final

(d) Poor Calibration 2 Final

Figure 5-5: Poor Initial Calibrations (1 & 2)

(f) Poor Calibration 2 Error vs. Time(e) Poor Calibration 1 Error vs. Time

36

0 100 200 300 400 500 600

2

4

6

8

10

12

14

16

0 100 200 300 400 500

5

10

15

20

25

(a) Poor Calibration 3 Initial

(c) Poor Calibration 4 Initial

(b) Poor Calibration 3 Final

(d) Poor Calibration 4 Final

Figure 5-6: Poor Initial Calibrations (3 & 4)

(f) Poor Calibration 4 Error vs. Time(e) Poor Calibration 3 Error vs. Time

37

0 20 40 60 80 100 120 140

30

40

50

60

70

80

0 50 100 150 200 250

50

60

70

80

90

(a) Initial Stereo Calibration 1

(c) Initial Stereo Calibration 2

(b) Final Stereo Calibration 1

(d) Final Stereo Calibration 2

Figure 5-7: Stereo Calibration Examples (1 & 2)

(f) Calibration 2 Error vs. Time(e) Calibration 1 Error vs. Time

38

0 50 100 150 200

42.5

45

47.5

50

52.5

55

57.5

60

0 25 50 75 100 125 150 175

30

40

50

60

70

80

90

100

(a) Initial Stereo Calibration 3

(c) Initial Stereo Calibration 4

(b) Final Stereo Calibration 3

(d) Final Stereo Calibration 4

Figure 5-8: Stereo Calibration Examples (3 & 4)

(f) Calibration 4 Error vs. Time(e) Calibration 3 Error vs. Time

APPENDIX AVISUAL CALIBRATION EVALUATION

When applying machine learning to improve camera models, an well-behaved error measure

is critical. With exact correspondence data (2D image coordinates paired with 3D world coordi-

nates), such an error measure is easily specified. It is precisely this type of data that we want to

avoid having to collect, however. Yet, without such data, it is difficult, if not impossible, to define a

globally well-behaved error measure. As such, we choose the human eye as an appropriate means

of evaluating different calibration models. To make this intuitive, we draw a grid defined in space

onto the image using the current camera model, and decide visually if the system’s error measure

is accurately measuring the quality of a model. This grid is chosen to match some feature(s) in the

scene to allow a person to visually determine the quality of a calibration. Figure A-2 shows the

experimental area. In Figure A-2, the red arrows indicate the corner points of a one meter cube

placed in the far corner of the room, while the green arrows show the world coordinate system,

defined after considering the experimental area and how to keep initial data collection as simple as

possible. Figure A-1 shows a sample grid drawn on an image. The grid uses squares and

should ideally be aligned with the floor and walls. Points marked with a red arrow in Figure A-2

should have the outside corner of the fifth box out or up laying on it. The windowsill is a straight

edge that a human can use as a guide for the top edge of the grid.

20cm2

39

40

Figure A-1: The Calibration Grid

Increasing XIncreasing Y

Increasing Z

Figure A-2: The Experimental Area

APPENDIX BGRAPHICAL CALIBRATION TOOL

In order to generate initial rough calibration models quickly and easily, we developed an

intuitive graphical interface through which a user can manipulate and align a grid model of a scene

for different rotations and translations of the camera with respect to the world. This interface, as

shown in Figure B-1, was written in C for the X/Motif window system.

Given approximate intrinsic parameters for a camera, we can project a three-dimensional

model of the world onto an image for any given rotation, translation and scale (effective focal

length). Therefore, as the user adjusts parameters that control scale, rotation and translation, the

grid model of the world is continuously redrawn to reflect the new effective pose of the camera.

Once the user is happy with the alignment of the grid model to the world, he/she can then save the

effective perspective projection matrix and use that as the initial camera calibration model.

41

42

Figure B-1: The Graphical Calibration Tool

REFERENCES

[1] Y. I. Abdel-Aziz and H. M. Kahara, “Direct Linear Transformation From Comparator Coor-dinates into Object-Space Coordinates,” Proc. American Society of Photogrammetry Sympo-sium on Close Range Photogrammetry, pp. 1-18, 1971.

[2] A. Azarbayejani and A. Pentland, “Camera Self Calibration From One Point Correspon-dence,” Perceptual Computing Technical Report #341, Massachusetts Institute of Technol-ogy Media Library, 1995.

[3] X. Chen, J. Davis and P. Slusallek, “Wide Area Calibration Using Virtual CalibrationObjects,” Proc. IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2, pp. 520-7,2000.

[4] Y. Do, “Application of Neural Networks for Stereo-Camera Calibration,” Proc. Int. JointConf. on Neural Networks, vol. 4, pp 2719-22, 1999.

[5] R. Frezza and C. Altafini, “Autonomous Landing by Computer Vision: An Application ofPath following in SE(3),” Proc. IEEE Conf. on Decision and Control, vol. 3, pp. 2527-32,2000.

[6] I. Haritaoglu, D. Harwood and L. Davis, “ : Who? When? Where? What? A Real TimeSystem for Detecting and Tracking People,” Proc. IEEE Int. Conf. on Face and Gesture Rec-ognition, pp. 222-7, 1998.

[7] R. Horaud, G. Csurka and D. Demirdijian, “Stereo Calibration from Rigid Motions,” IEEETrans. on Pattern and Machine Intelligence, vol. 22, no. 12, pp. 1446-52, 2000.

[8] S. Kamijo, Y. Matsushita, K. Ikeuchi and M. Sakauchi, “Traffic Monitoring and AccidentDetection at Intersections,” IEEE Trans. on Intelligent Transportation Systems, vol. 1, no. 2,pp. 108-18, 2000.

[9] F. Karbou and F. Karbou, “An Interval Approach to Handwriting Recognition,” Proc. Conf.of the North American Fuzzy Information Processing Society, pp. 153-7, 2000.

[10] J.M. Lee, B.H. Kim, M.H. Lee, K. Son, M.C. Lee, J.W. Choi and S.H. Han, “Fine Active Cal-ibration of Camera Position/Orientation Through Pattern Recognition,” IEEE Int. Symposiumon Industrial Electronics, vol. 2, pp. 657-62, 1998.

[11] P. Mendonca and R. Cipolla, “A Simple Technique for Self-Calibration,” Proc. IEEE Conf. onComputer Vision and Pattern Recognition, vol. 1, pp. 500-5, 1999.

W4

43

44

[12] P. Rander, “A Multi-Camera Method for 3D Digitization of Dynamic, Real-World Events,”CMU-RI-TR-98-12, Technical Report, The Robotics Institute, Carnegie Mellon University,1998.

[13] G. Rigoll, S. Eickeler and S. Muller, “Person Tracking in Real-World Scenarios Using Statis-tical Methods,” Proc. IEEE Int. Conf. on Automatic Face and Gesture Recognition, pp.342-7,2000.

[14] H. Schneiderman, “A Statistical Approach to 3D Object Detection Applied to Faces andCars,” CMU-RI-TR-00-06, Ph.D. Thesis, The Robotics Institute, Carnegie Mellon Univer-sity, 2000.

[15] R. Sharma, M. Zeller, V.I. Pavlovic, T.S. Huang, Z. Lo, S. Chu, Y. Zhao, J.C. Phillips and K.Schulten. “Speech/Gesture Interface to a Visual-Computing Environment,” IEEE ComputerGraphics and Applications, vol. 20, no. 2, pp. 29-37, 2000.

[16] R. Tsai, “A Versatile Camera Calibration Technique for High-Accuracy 3D Machine VisionMetrology Using Off-the-Shelf TV Cameras and Lenses,” IEEE Journal of Robotics andAutomation, vol. RA-3, no. 4, pp 323-44, 1987.

[17] H. Wang, W. Hsu, K. Guan and M. Lee, “An Effective Approach to Detect Lesions in ColorRetinal Images,” Proc. IEEE Conf. on Computer Vision an Pattern Recognition, vol. 2, pp.181-6, 2000.

[18] J. Weng, P. Cohen and M. Herniou, “Camera Calibration with Distortion Models and Accu-racy Evaluation,” IEEE Trans. of Pattern Analysis and Machine Intelligence, vol. 14, no. 10,pp. 965-80, 1992.

[19] A.D.Worrall, G.D.Sullivan and K.D.Baker, “A Simple Intuitive Camera Calibration Tool ForNatural Images,” Proc. of the 5th British Machine Vision Conf., vol. 2, pp. 781-90, 1994.

[20] I. Zapata, “Detecting Humans in Video Sequences Using Color and Shape Information,” M.S.Thesis, Dept. of Electrical and Computer Engineering, University of Florida, 2001.

[21] Z. Zhang, “A Flexible Technique for Camera Calibration,” IEEE Trans. on Pattern Analysisand Machine Intelligence, vol.22, no.11, pp. 1330-4, 2000.

[22] Z. Zhang, “Motion and Structure of Four Points from One Motion of a Stereo Rig withUnknown Extrinsic Parameters,” IEEE Trans. on Pattern Analysis and Machine Intelligence,vol. 17, no. 12, pp. 1222-5, 1995.

45

BIOGRAPHICAL SKETCH

Scott Nichols was born in Miami, Florida, in 1969. A high school dropout, Scott decided to

pursue an education and started community college full-time in 1994. He transferred as a junior to

the University of Florida in 1996 and recieved both a Bachelor of Science degree in electrical engi-

neering and a Bachelor of Science degree in computer engineering in Aug 1999. He has since

worked as a research assistant at the Machine Intelligence Laboratory, working toward a Master of

Science degree in electrical engineering.