Download - De Printat Articole

7/31/2019 De Printat Articole

http://slidepdf.com/reader/full/de-printat-articole 1/101

An Introduction to the Kalman Filter

Greg Welch

1

and Gary Bishop

2

TR 95-041

Department of Computer Science

University of North Carolina at Chapel Hill

Chapel Hill, NC 27599-3175

Updated: Monday, July 24, 2006

Abstract

In 1960, R.E. Kalman published his famous paper describing a recursive solutionto the discrete-data linear filtering problem. Since that time, due in large part to ad-vances in digital computing, the Kalman filter has been the subject of extensive re-search and application, particularly in the area of autonomous or assistednavigation.

The Kalman filter is a set of mathematical equations that provides an efficient com-putational (recursive) means to estimate the state of a process, in a way that mini-mizes the mean of the squared error. The filter is very powerful in several aspects:it supports estimations of past, present, and even future states, and it can do so evenwhen the precise nature of the modeled system is unknown.

The purpose of this paper is to provide a practical introduction to the discrete Kal-man filter. This introduction includes a description and some discussion of the basicdiscrete Kalman filter, a derivation, description and some discussion of the extend-ed Kalman filter, and a relatively simple (tangible) example with real numbers &results.

1. [email protected], http://www.cs.unc.edu/~welch

2. [email protected], http://www.cs.unc.edu/~gb



Welch & Bishop, An Introduction to the Kalman Filter 2

UNC-Chapel Hill, TR 95-041, July 24, 2006

1 The Discrete Kalman Filter

In 1960, R.E. Kalman published his famous paper describing a recursive solution to the discrete-

data linear filtering problem [Kalman60]. Since that time, due in large part to advances in digitalcomputing, the Kalman filter

has been the subject of extensive research and application,particularly in the area of autonomous or assisted navigation. A very “friendly” introduction to thegeneral idea of the Kalman filter can be found in Chapter 1 of [Maybeck79], while a more completeintroductory discussion can be found in [Sorenson70], which also contains some interestinghistorical narrative. More extensive references include [Gelb74; Grewal93; Maybeck79; Lewis86;Brown92; Jacobs93].

The Process to be Estimated

The Kalman filter addresses the general problem of trying to estimate the state of adiscrete-time controlled process that is governed by the linear stochastic difference equation

, (1.1)

with a measurement that is

. (1.2)

The random variables and represent the process and measurement noise (respectively).They are assumed to be independent (of each other), white, and with normal probabilitydistributions

, (1.3)

. (1.4)

In practice, the process

noise covariance and measurement noise covariance

matrices mightchange with each time step or measurement, however here we assume they are constant.

The matrix in the difference equation (1.1) relates the state at the previous time stepto the state at the current step , in the absence of either a driving function or process noise. Notethat in practice might change with each time step, but here we assume it is constant. Thematrix B

relates the optional control input to the state x

. The matrix in themeasurement equation (1.2) relates the state to the measurement z

k

. In practice might changewith each time step or measurement, but here we assume it is constant.

The Computational Origins of the Filter

We define (note the “super minus”) to be our a priori

state estimate at step k

givenknowledge of the process prior to step k

, and to be our a posteriori

state estimate at stepk

given measurement . We can then define a priori

and a posteriori

estimate errors as

x n

xk Axk 1 – Buk 1 – wk 1 –+ +=

z m

zk Hxk vk +=

wk vk

p w( ) N 0 Q,( )

p v( ) N 0 R,( )

Q R

n n A k 1 –k

A n l u

l m n H

H

xk

-

n

xk n

zk

ek

- xk xk

-, and –

ek xk xk . –





The a priori

estimate error covariance is then

, (1.5)

and the a posteriori

estimate error covariance is

. (1.6)

In deriving the equations for the Kalman filter, we begin with the goal of finding an equation thatcomputes an a posteriori

state estimate as a linear combination of an a priori

estimate anda weighted difference between an actual measurement and a measurement prediction asshown below in (1.7). Some justification for (1.7) is given in “The Probabilistic Origins of theFilter” found below.

(1.7)

The difference in (1.7) is called the measurement innovation

, or the residual

. Theresidual reflects the discrepancy between the predicted measurement and the actualmeasurement . A residual of zero means that the two are in complete agreement.

The matrix K

in (1.7) is chosen to be the gain

or blending factor

that minimizes the a posteriori

error covariance (1.6). This minimization can be accomplished by first substituting (1.7)into the above definition for , substituting that into (1.6), performing the indicated expectations,taking the derivative of the trace of the result with respect to K

, setting that result equal to zero, andthen solving for K

. For more details see [Maybeck79; Brown92; Jacobs93]. One form of theresulting K

that minimizes (1.6) is given by

1

. (1.8)

Looking at (1.8) we see that as the measurement error covariance approaches zero, the gain K

weights the residual more heavily. Specifically,

.

On the other hand, as the a priori

estimate error covariance approaches zero, the gain K

weightsthe residual less heavily. Specifically,

.

1. All of the Kalman filter equations can be algebraically manipulated into to several forms. Equation (1.8)

represents the Kalman gain in one popular form.

Pk

- E ek

-ek

- T [ ]=

Pk E ek ek T [ ]=

xk xk

-

zk H xk

-

xk xk

-K zk H xk

- –( )+=

zk H xk

- –( )

H xk

-

zk

n m

ek

K k Pk - H T HPk

- H T R+( ) 1 –=

Pk

- H T

HPk

- H T R+

-----------------------------=

R

K k Rk 0

lim H 1 –=

Pk

-

K k Pk

-0

lim 0=





Another way of thinking about the weighting by K is that as the measurement error covarianceapproaches zero, the actual measurement is “trusted” more and more, while the predictedmeasurement is trusted less and less. On the other hand, as the a priori estimate error

covariance approaches zero the actual measurement is trusted less and less, while thepredicted measurement is trusted more and more.

The Probabilistic Origins of the Filter

The justification for (1.7) is rooted in the probability of the a priori estimate conditioned on allprior measurements (Bayes’ rule). For now let it suffice to point out that the Kalman filtermaintains the first two moments of the state distribution,

The a posteriori state estimate (1.7) reflects the mean (the first moment) of the state distribution—it is normally distributed if the conditions of (1.3) and (1.4) are met. The a posteriori estimate errorcovariance (1.6) reflects the variance of the state distribution (the second non-central moment). Inother words,

.

For more details on the probabilistic origins of the Kalman filter, see [Maybeck79; Brown92;Jacobs93].

The Discrete Kalman Filter Algorithm

We will begin this section with a broad overview, covering the “high-level” operation of one formof the discrete Kalman filter (see the previous footnote). After presenting this high-level view, wewill narrow the focus to the specific equations and their use in this version of the filter.

The Kalman filter estimates a process by using a form of feedback control: the filter estimates theprocess state at some time and then obtains feedback in the form of (noisy) measurements. As such,the equations for the Kalman filter fall into two groups: time update equations and measurement update equations. The time update equations are responsible for projecting forward (in time) thecurrent state and error covariance estimates to obtain the a priori estimates for the next time step.The measurement update equations are responsible for the feedback—i.e. for incorporating a new

measurement into the a priori estimate to obtain an improved a posteriori estimate.

The time update equations can also be thought of as predictor equations, while the measurementupdate equations can be thought of as corrector equations. Indeed the final estimation algorithmresembles that of a predictor-corrector algorithm for solving numerical problems as shown belowin Figure 1-1.

R zk

H xk

-

Pk

-

zk H xk

-

xk

-

zk

E xk [ ] xk =

E xk xk –( ) xk xk –( )T [ ] Pk .=

p xk zk ( ) N E xk [ ] E xk xk –( ) xk xk –( )T [ ],( )

N xk Pk ,( ).=





Figure 1-1. The ongoing discrete Kalman filter cycle. The time update projects the current state estimate ahead in time. The measurement update adjusts the projected estimate by an actual measurement at that time.

The specific equations for the time and measurement updates are presented below in Table 1-1 andTable 1-2.

Again notice how the time update equations in Table 1-1 project the state and covariance estimatesforward from time step to step . and B are from (1.1), while is from (1.3). Initialconditions for the filter are discussed in the earlier references.

The first task during the measurement update is to compute the Kalman gain, . Notice that theequation given here as (1.11) is the same as (1.8). The next step is to actually measure the processto obtain , and then to generate an a posteriori state estimate by incorporating the measurementas in (1.12). Again (1.12) is simply (1.7) repeated here for completeness. The final step is to obtainan a posteriori error covariance estimate via (1.13).

After each time and measurement update pair, the process is repeated with the previous a posteriori estimates used to project or predict the new a priori estimates. This recursive nature is one of thevery appealing features of the Kalman filter—it makes practical implementations much morefeasible than (for example) an implementation of a Wiener filter [Brown92] which is designed tooperate on all of the data directly for each estimate. The Kalman filter instead recursivelyconditions the current estimate on all of the past measurements. Figure 1-2 below offers a completepicture of the operation of the filter, combining the high-level diagram of Figure 1-1 with theequations from Table 1-1 and Table 1-2.

Table 1-1: Discrete Kalman filter time update equations.

(1.9)

(1.10)

Table 1-2: Discrete Kalman filter measurement update equations.

(1.11)

(1.12)

(1.13)

Time Update(“Predict”)

Measurement Update(“Correct”)

xk

- Axk 1 – Buk 1 –+=

Pk

- APk 1 – AT Q+=

k 1 – k A Q

K k Pk

- H T HPk

- H T R+( )

1 –=

xk xk

-K k zk H xk

- –( )+=

Pk I K k H –( )Pk

-=

K k

zk





Filter Parameters and Tuning

In the actual implementation of the filter, the measurement noise covariance is usually measured

prior to operation of the filter. Measuring the measurement error covariance is generallypractical (possible) because we need to be able to measure the process anyway (while operating thefilter) so we should generally be able to take some off-line sample measurements in order todetermine the variance of the measurement noise.

The determination of the process noise covariance is generally more difficult as we typically donot have the ability to directly observe the process we are estimating. Sometimes a relativelysimple (poor) process model can produce acceptable results if one “injects” enough uncertaintyinto the process via the selection of . Certainly in this case one would hope that the processmeasurements are reliable.

In either case, whether or not we have a rational basis for choosing the parameters, often times

superior filter performance (statistically speaking) can be obtained by tuning the filter parametersand . The tuning is usually performed off-line, frequently with the help of another (distinct)

Kalman filter in a process generally referred to as system identification.

Figure 1-2. A complete picture of the operation of the Kalman filter, com-

bining the high-level diagram of Figure 1-1 with the equations fromTable 1-1 and Table 1-2.

In closing we note that under conditions where and .are in fact constant, both the estimationerror covariance and the Kalman gain will stabilize quickly and then remain constant (seethe filter update equations in Figure 1-2). If this is the case, these parameters can be pre-computedby either running the filter off-line, or for example by determining the steady-state value of asdescribed in [Grewal93].

R

R

Q

Q

Q R

K k Pk

- H T HPk

- H T R+( )

1 –=

(1) Compute the Kalman gain

xk 1 –Initial estimates for and Pk 1 –

xk xk

-K k zk H xk

- –( )+=

(2) Update estimate with measurement zk

(3) Update the error covariance

Pk I K k H –( )Pk

-=

Measurement Update (“Correct”)

(1) Project the state ahead

(2) Project the error covariance ahead

Time Update (“Predict”)

xk

- Ax

k 1 – Bu

k 1 –+=

Pk

- APk 1 – AT Q+=

Q RPk K k

Pk





It is frequently the case however that the measurement error (in particular) does not remainconstant. For example, when sighting beacons in our optoelectronic tracker ceiling panels, thenoise in measurements of nearby beacons will be smaller than that in far-away beacons. Also, the

process noise is sometimes changed dynamically during filter operation—becoming —inorder to adjust to different dynamics. For example, in the case of tracking the head of a user of a3D virtual environment we might reduce the magnitude of if the user seems to be movingslowly, and increase the magnitude if the dynamics start changing rapidly. In such cases mightbe chosen to account for both uncertainty about the user’s intentions and uncertainty in the model.

2 The Extended Kalman Filter (EKF)

The Process to be Estimated

As described above in section 1, the Kalman filter addresses the general problem of trying toestimate the state of a discrete-time controlled process that is governed by a linear

stochastic difference equation. But what happens if the process to be estimated and (or) themeasurement relationship to the process is non-linear? Some of the most interesting and successfulapplications of Kalman filtering have been such situations. A Kalman filter that linearizes aboutthe current mean and covariance is referred to as an extended Kalman filter or EKF.

In something akin to a Taylor series, we can linearize the estimation around the current estimateusing the partial derivatives of the process and measurement functions to compute estimates evenin the face of non-linear relationships. To do so, we must begin by modifying some of the materialpresented in section 1. Let us assume that our process again has a state vector , but that theprocess is now governed by the non-linear stochastic difference equation

, (2.1)


, (2.2)

where the random variables and again represent the process and measurement noise as in(1.3) and (1.4). In this case the non-linear function in the difference equation (2.1) relates thestate at the previous time step to the state at the current time step . It includes as parametersany driving function and the zero-mean process noise wk . The non-linear function in themeasurement equation (2.2) relates the state to the measurement .

In practice of course one does not know the individual values of the noise and at each time

step. However, one can approximate the state and measurement vector without them as

(2.3)

and

, (2.4)

where is some a posteriori estimate of the state (from a previous time step k ).

Q Qk

Qk Qk

x n

x n

xk f xk 1 – uk 1 – wk 1 –, ,( )=

z m

zk h xk vk ,( )=

wk vk

k 1 – k uk 1 – h

xk zk

wk vk

xk f xk 1 – uk 1 – 0, ,( )=

zk h xk 0,( )=

xk





It is important to note that a fundamental flaw of the EKF is that the distributions (or densities inthe continuous case) of the various random variables are no longer normal after undergoing theirrespective nonlinear transformations. The EKF is simply an ad hoc state estimator that only

approximates the optimality of Bayes’ rule by linearization. Some interesting work has been doneby Julier et al. in developing a variation to the EKF, using methods that preserve the normaldistributions throughout the non-linear transformations [Julier96].

The Computational Origins of the Filter

To estimate a process with non-linear difference and measurement relationships, we begin bywriting new governing equations that linearize an estimate about (2.3) and (2.4),

, (2.5)

. (2.6)

where

• and are the actual state and measurement vectors,

• and are the approximate state and measurement vectors from (2.3) and (2.4),

• is an a posteriori estimate of the state at step k ,

• the random variables and represent the process and measurement noise as in

(1.3) and (1.4).

• A is the Jacobian matrix of partial derivatives of with respect to x, that is

,

• W is the Jacobian matrix of partial derivatives of with respect to w,

,

• H is the Jacobian matrix of partial derivatives of with respect to x,

,

• V is the Jacobian matrix of partial derivatives of with respect to v,

.

Note that for simplicity in the notation we do not use the time step subscript with the Jacobians, , , and , even though they are in fact different at each time step.

xk xk A xk 1 – xk 1 – –( ) Wwk 1 –+ +

zk zk H xk xk –( ) V vk + +

xk zk

xk zk

xk

wk vk

A i j ,[ ] x j [ ]

f i[ ] xk 1 – uk 1 – 0, ,( )=

W i j ,[ ] w j [ ]

f i[ ] xk 1 – uk 1 – 0, ,( )=

h

H i j ,[ ] x j [ ]

h i[ ] xk 0,( )=

h

V i j ,[ ] v j [ ]

h i[ ] xk 0,( )=

k A W H V





The complete set of EKF equations is shown below in Table 2-1 and Table 2-2. Note that we havesubstituted for to remain consistent with the earlier “super minus” a priori notation, and thatwe now attach the subscript to the Jacobians , , , and , to reinforce the notion that they

are different at (and therefore must be recomputed at) each time step.

As with the basic discrete Kalman filter, the time update equations in Table 2-1 project the stateand covariance estimates from the previous time step to the current time step . Again in(2.14) comes from (2.3), and are the process Jacobians at step k , and is the processnoise covariance (1.3) at step k .

As with the basic discrete Kalman filter, the measurement update equations in Table 2-2 correctthe state and covariance estimates with the measurement . Again in (2.17) comes from (2.4),

and V are the measurement Jacobians at step k , and is the measurement noise covariance(1.4) at step k . (Note we now subscript allowing it to change with each measurement.)

The basic operation of the EKF is the same as the linear discrete Kalman filter as shown inFigure 1-1. Figure 2-1 below offers a complete picture of the operation of the EKF, combining thehigh-level diagram of Figure 1-1 with the equations from Table 2-1 and Table 2-2.

Table 2-1: EKF time update equations.

(2.14)

(2.15)

Table 2-2: EKF measurement update equations.

(2.16)

(2.17)

(2.18)

xk

- xk

k A W H V

xk

- f xk 1 – uk 1 – 0, ,( )=

Pk

- Ak Pk 1 – Ak

T W k Qk 1 – W k T +=

k 1 – k A

k W

k Q

k

K k Pk

- H k

T H k Pk

- H k

T V k Rk V k T +( )

1 –=

xk xk

-K k zk h xk

-0,( ) –( )+=

Pk I K k H k –( )Pk

-=

zk h H k Rk

R





Figure 2-1. A complete picture of the operation of the extended Kalman fil-ter, combining the high-level diagram of Figure 1-1 with the equations fromTable 2-1 and Table 2-2.

An important feature of the EKF is that the Jacobian in the equation for the Kalman gainserves to correctly propagate or “magnify” only the relevant component of the measurementinformation. For example, if there is not a one-to-one mapping between the measurement andthe state via , the Jacobian affects the Kalman gain so that it only magnifies the portion of

the residual that does affect the state. Of course if over all measurements there is not a one-to-one mapping between the measurement and the state via , then as you might expectthe filter will quickly diverge. In this case the process is unobservable.

3 A Kalman Filter in Action: Estimating a Random Constant

In the previous two sections we presented the basic form for the discrete Kalman filter, and theextended Kalman filter. To help in developing a better feel for the operation and capability of thefilter, we present a very simple example here. Andrew Straw has made available a Python/SciPyimplementation of this example at http://www.scipy.org/Cookbook/KalmanFiltering (validlink as of July 24, 2006).

The Process Model

In this simple example let us attempt to estimate a scalar random constant, a voltage for example.Let’s assume that we have the ability to take measurements of the constant, but that themeasurements are corrupted by a 0.1 volt RMS white measurement noise (e.g. our analog to digitalconverter is not very accurate). In this example, our process is governed by the linear differenceequation

,

K k Pk

- H k

T H k Pk

- H k

T V k Rk V k T +( )

1 –=

(1) Compute the Kalman gain

xk xk

-K k zk h xk

-0,( ) –( )+=

(2) Update estimate with measurement zk

(3) Update the error covariance

Pk I K k H k –( )Pk

-=

Measurement Update (“Correct”)

(1) Project the state ahead

(2) Project the error covariance ahead

Time Update (“Predict”)

xk

- f xk 1 – uk 1 – 0, ,( )=

Pk

- Ak Pk 1 – Ak

T W k Qk 1 – W k T +=

xk 1 –Initial estimates for and Pk 1 –

H k K k

zk h H k

zk h xk - 0,( ) – zk h

xk Axk 1 – Buk 1 – wk + +=

xk 1 – wk +=






.

The state does not change from step to step so . There is no control input so . Ournoisy measurement is of the state directly so . (Notice that we dropped the subscript k inseveral places because the respective parameters remain constant in our simple model.)

The Filter Equations and Parameters

Our time update equations are

,

,

and our measurement update equations are

, (3.1)

,

.

Presuming a very small process variance, we let . (We could certainly let butassuming a small but non-zero value gives us more flexibility in “tuning” the filter as we willdemonstrate below.) Let’s assume that from experience we know that the true value of the randomconstant has a standard normal probability distribution, so we will “seed” our filter with the guessthat the constant is 0. In other words, before starting we let .

Similarly we need to choose an initial value for , call it . If we were absolutely certain thatour initial state estimate was correct, we would let . However given theuncertainty in our initial estimate , choosing would cause the filter to initially and

always believe . As it turns out, the alternative choice is not critical. We could choosealmost any and the filter would eventually converge. We’ll start our filter with .

z 1

zk Hxk vk +=

xk vk +=

A 1= u 0= H 1=

xk

- xk 1 –=

Pk - Pk 1 – Q+=

K k Pk

-Pk

- R+( )

1 –=

Pk

-

Pk

- R+

----------------=

xk xk

-K k zk xk

- –( )+=

Pk 1 K k –( )Pk

-=

Q 1e 5 –= Q 0=

xk 1 – 0=

Pk 1 – P0 x0 0= P0 0=

x0 P0 0=

xk 0=P0 0 P0 1=





The Simulations

To begin with, we randomly chose a scalar constant (there is no “hat” on the x

because it represents the “truth”). We then simulated 50 distinct measurements that had errornormally distributed around zero with a standard deviation of 0.1 (remember we presumed that themeasurements are corrupted by a 0.1 volt RMS white measurement noise). We could havegenerated the individual measurements within the filter loop, but pre-generating the set of 50measurements allowed me to run several simulations with the same exact measurements (i.e. samemeasurement noise) so that comparisons between simulations with different parameters would bemore meaningful.

In the first simulation we fixed the measurement variance at . Because this isthe “true” measurement error variance, we would expect the “best” performance in terms of balancing responsiveness and estimate variance. This will become more evident in the second andthird simulation. Figure 3-1 depicts the results of this first simulation. The true value of the random

constant is given by the solid line, the noisy measurements by the cross marks, andthe filter estimate by the remaining curve.

Figure 3-1. The first simulation: . The true value of therandom constant is given by the solid line, the noisy mea-surements by the cross marks, and the filter estimate by the remaining curve.

When considering the choice for above, we mentioned that the choice was not critical as long

as because the filter would eventually converge. Below in Figure 3-2 we have plotted thevalue of versus the iteration. By the 50th iteration, it has settled from the initial (rough) choiceof 1 to approximately 0.0002 (Volts2).

x 0.37727 –=

zk

R 0.1( )2 0.01= =

x 0.37727 –=

5040302010

-0.2

-0.3

-0.4

-0.5

Iteration

V o l t a g e

R 0.1( )2 0.01= = x 0.37727 –=

P0

P0 0 Pk





Figure 3-2. After 50 iterations, our initial (rough) error covariancechoice of 1 has settled to about 0.0002 (Volts2).

In section 1 under the topic “Filter Parameters and Tuning” we briefly discussed changing or“tuning” the parameters Q and R to obtain different filter performance. In Figure 3-3 and Figure 3-4 below we can see what happens when R is increased or decreased by a factor of 100 respectively.In Figure 3-3 the filter was told that the measurement variance was 100 times greater (i.e. )so it was “slower” to believe the measurements.

Figure 3-3. Second simulation: . The filter is slower to respond tothe measurements, resulting in reduced estimate variance.

In Figure 3-4 the filter was told that the measurement variance was 100 times smaller (i.e.) so it was very “quick” to believe the noisy measurements.

5040302010

0.01

.008

.006

.004

.002

Iteration

( V o l t a g e ) 2

Pk

-

R 1=

5040302010

-0.2

-0.3

-0.4

-0.5

V o l t a g e

R 1=

R 0.0001=





Figure 3-4. Third simulation: . The filter responds to measure-ments quickly, increasing the estimate variance.

While the estimation of a constant is relatively straight-forward, it clearly demonstrates theworkings of the Kalman filter. In Figure 3-3 in particular the Kalman “filtering” is evident as theestimate appears considerably smoother than the noisy measurements.

5040302010

-0.2

-0.3

-0.4

-0.5

V o l t a g e

R 0.0001=





References

Brown92 Brown, R. G. and P. Y. C. Hwang. 1992. Introduction to Random Signalsand Applied Kalman Filtering, Second Edition, John Wiley & Sons, Inc.

Gelb74 Gelb, A. 1974. Applied Optimal Estimation, MIT Press, Cambridge, MA.

Grewal93 Grewal, Mohinder S., and Angus P. Andrews (1993). Kalman Filtering The-ory and Practice. Upper Saddle River, NJ USA, Prentice Hall.

Jacobs93 Jacobs, O. L. R. 1993. Introduction to Control Theory, 2nd Edition. OxfordUniversity Press.

Julier96 Julier, Simon and Jeffrey Uhlman. “A General Method of ApproximatingNonlinear Transformations of Probability Distributions,” Robotics Re-search Group, Department of Engineering Science, University of Oxford

[cited 14 November 1995]. Available from http://www.robots.ox.ac.uk/~siju/work/publications/Unscented.zip.

Also see: “A New Approach for Filtering Nonlinear Systems” by S. J. Julier,J. K. Uhlmann, and H. F. Durrant-Whyte, Proceedings of the 1995 Ameri-can Control Conference, Seattle, Washington, Pages:1628-1632. Availablefrom http://www.robots.ox.ac.uk/~siju/work/publications/ACC95_pr.zip.

Also see Simon Julier's home page at http://www.robots.ox.ac.uk/~siju/.

Kalman60 Kalman, R. E. 1960. “A New Approach to Linear Filtering and PredictionProblems,” Transaction of the ASME—Journal of Basic Engineering,pp. 35-45 (March 1960).

Lewis86 Lewis, Richard. 1986. Optimal Estimation with an Introduction to Stochas-tic Control Theory, John Wiley & Sons, Inc.

Maybeck79 Maybeck, Peter S. 1979. Stochastic Models, Estimation, and Control, Vol-ume 1, Academic Press, Inc.

Sorenson70 Sorenson, H. W. 1970. “Least-Squares estimation: from Gauss to Kalman,” IEEE Spectrum, vol. 7, pp. 63-68, July 1970.



Figure 2. Target tracking by background dif-

ferencing. The central person is tracked us-

ing all pixels whereas the two other persons

are tracked using every second pixel.

3. The tracking system

In this section, we describe the theoretical aspects and

the details on the actual implementation of the core tracking

system.

3.1 Energy detection

Currently, targets can be detected by energy measure-

ments based on background subtraction or intensity normal-

ized color histograms. The background subtraction mod-

ule computes a difference image I d from the current frameI = (I red, I green, I blue) and the background image B =(Bred, Bgreen, Bblue):

I d = 13

| I red − Bred | + | I green − Bgreen | +

| I blue − Bblue |

The background image B is updated with each frame us-

ing a weighted averaging technique, with a strong weight

applied to the previous background, and a small weight ap-

plied to the current image. This procedure constitutes a sim-

ple first order recursive filter along the time axis for each

pixel. The background image is only updated for those pix-

els that do not belong to one of the target ROIs.

Bt(i, j) =

αI t(i, j) + (1 − α)Bt−1(i, j), (i, j) ∈ bg

Bt−1(i, j), else(1)

Figure 2 shows an example of target tracking by back-

ground subtraction. The right image represents the back-

ground difference image I d after processing of three ROI’s.

Three targets can be clearly identified. Notice that the cen-

ter target appears as solid white, while the adjacent targets

appear to be ”hashed”. This is the result of optimization that

allows the processing to be applied to every N th pixel. In

this example, the two adjacent regions were processed with

N = 2, while the center target was processed with N = 1.

N is determined dynamically during each cycle by the pro-

cess supervisor.The position and extent of a target are determined by the

moments of the detected pixels in the difference image I dwithin the ROI. The center of gravity (or first moment) gives

the position of a target. The covariance (or second moment)

determines the spatial extent, and can be used to determine

width, height, and slant of a target. These parameters also

provide the target’s search region in the next image.

Chrominance information can be used to provide prob-

abilistic detection of targets. The intensity for each RGB

color pixel within a ROI is normalized to separate chromi-

nance from luminance.

r = RR + G + B

, g = GR + G + B

(2)

These color components have the property to be robust to

intensity variations [6].

The probability that a pixel takes on a particular color

can be represented as a histogram of (r, g) values. The his-

togram hT of chrominance values for a target, T , provides

an estimate of the probability of a chrominance vector (r, g)given the target p(r, g|T ). The histogram of chrominance

for all pixels htotal gives the global probability p(r, g) of

encountering a chrominance among the pixels. The prob-

ability of a target is the number of pixels of the target di-

vided by the total number of pixels. Putting these valuesinto Bayes rule shows that an estimate of the probability

of the target for each pixel can be obtained by evaluating

the ratio of the target histogram divided by the global his-

togram.

p(T |r, g) =p(r, g|T ) p(T )

p(r, g)≈

hT (r, g)

htotal(r, g)(3)

For each image, a probability map, I p, can be created by

evaluating the ratio of histograms for each pixel in the im-

age. Figure 3 shows an example of face detection using a

ratio of chrominance histograms. The bottom image dis-

plays the probability map I p. The probability map is onlyevaluated within the search region provided by the Kalman

filter in order to increase processing speed.

A common problem in both background subtraction and

histogram detection are spatial outliers. In order to increase

the stability of target localization, we suppress the contribu-

tion of outliers using a method proposed by Schwerdt in [5].

With this method, the probability image I p is multiplied by



Figure 3. Target detection by normalizedcolor

histogram.

a Gaussian weighting function centered at the predicted tar-

get position. This corresponds to a filtering by a strong po-

sitional prior. The effect is that spatial outliers lose their

influence on position and extent as a function of distance

from the predicted Gaussian. In order to save computation

time, this operation is performed only within the region of interest R of each target. Even for small regions of interest

this operation stabilizes the estimated position and extent of

targets.

I ′ p =

I p ∗ G(µ, Σ), (i, j) ∈ R0, else

(4)

where

G(x; µ, Σ) = e−1

2(x−µ)T Σ−1(x−µ) (5)

The center of gravity µ = [ x−t , y−t ]T is the Kalman pre-

diction of the target location. The spatial covariance Σ re-

flects the size of the target as well as the growing uncer-

tainty about the current target size and location. The same principle can be applied to the background difference I d.

3.2 Tracking process

The tracking system is a form of Kalman filter [7]. The

state vector for each target is composed of position and ve-

locity. The current target state vector xt−1 is used to make

a new prediction according to :

x−t = Φt xt−1, with Φt =

1 ∆t0 1

(6)

and ∆t the time difference between two iterations.

From the new position measurement zt, estimation up-

date is carried out.

xt = x−t + K t(zt − H t x−t ) (7)

This relation is important for balancing the estimation be-

tween measurement and prediction with the Kalman gain

K t. The estimated precision is a diagonal covariance ma-

trix

P −t =

σ2

xx 0 0 00 σ2

yy 0 00 0 σ2

vxx0

0 0 0 σ2vyy

(8)

and is predicted by:

P −t = Φt−1P t−1ΦT t−1 + Qt−1 (9)

where Qt−1 is the covariance matrix of the prediction error

which represents the growth of the uncertainty in the current

target parameters.

3.3 The core modules

The tracking process has been implemented in the

ImaLab environment [4]. This environment allows real-

time processing of frames extracted from the video stream.

The basic tracking system is composed of two modules:

• TargetObservation predicts for each target the position

in the current frame by a Kalman filter and then com-

putes its real position by background subtraction or

color histogram detection.

• DetectionRegion detects new targets by analysing the

energy (background differencing or color histogram)

within several manually defined detection regions.

Figure 1 shows the system architecture. Both core mod-

ules can be instantiated to use either background differenc-

ing or color histogram. For the PETS 04 experiments, we

use tracking based on background subtraction.

3.4 Target initialization module

Detection regions are image regions where new targets

can appear. Restricting detection of new targets to such

regions allows the system to reduce the overall computing

time. As a side effect, the use of detection regions also pro-

vides a reduction in the number of spurious false detections



Rmax

Noise threshold

Rmin

Background

difference of

region

Analysis interval R

4sx

4sy

detection

histogram

1 dim energy

Analysis and

moment computation

Initialised target

Figure 4. Initialisation of new target.

by avoiding detection in unlikely regions, but targets might

be missed when the detection regions are not chosen appro-

priately.

For each scenario a different set of detection regions

is determined. Currently, these regions are selected by

hand. An automatic algorithm appears to be relatively easy

to imagine. New targets are initialized automatically by

analysing the detection regions in each tracking cycle. This

analysis is done in two steps. In the first step, the subregion

which is occupied by the new target is determined by cre-

ating a 1 dimensional histogram along the long axis of the

detection region. The limits of the target subregion are char-

acterized by an interval, Rmin, Rmax, whose values of the

one dimensional histogram are above a noise threshold (see

Figure 4). In the second phase, the energy density within

the so specified subregion R is computed as

eR

=1

|R| (i,j)∈R I d

(i, j) (10)

with |R| number of pixels of R. A new target with mean

µR and covariance ΣR is initialised when the measured en-

ergy density eR exceeds a threshold. This approach has the

advantage, that targets can be detected independently of the

size of the detection region.

3.5 Tracking module

The module TargetObservation implements the target

tracking. The supervisor maintains a list of current targets.

Targets of this list are sequentially updated by the supervi-

sor depending on the feedback of the modules. For each tar-

get, a new position is predicted by a first order Kalman filter.

This prediction determines a search region within which thetarget is expected to be found. A target is found by apply-

ing the specified detection operation to the search region. If

the average target detection energy is above a threshold, the

target observation vector is updated. This module depends

on following parameters:

• Detection energy threshold: this represents the average

energy threshold validating the target existence.

• Sensitivity threshold : this parameter thresholds the

energy image (I d in case of background differencing

or I p in case of chrominance detection). If the value is

0, the raw data of the energy image is used.

• Target area threshold: A step size parameter N enables

faster processing for large targets by processing only 1

out of N pixels. When the target surface is larger than

a threshold, N is increased. This temporary measure

will be replaced by a more sophisticated control logic

based on computing time. Figure 2 illustrates the use

of this parameter.

3.6 Split and merge of targets

In real world video sequences, especially in the domain

of video surveillance, it often happens that targets come to-

gether, move in the same direction for a while and then sep-

arate. It can also occur that close targets occlude each other.

In that case only one target is visible at the time, but both

targets are still present in the scene. To solve such prob-

lems, we use a method that allows merging and splitting of

targets. This method enables to keep track of occluded tar-

gets and also to model common behavior of a target group.

The PETS 04 sequences contain many examples of such

group behavior.

A straight forward approach is applied for the detection

of target split and merge. Merging of two targets that are

within a certain distance from each other is detected by eval-

uating following equation:

c/(a + b) < threshold (11)

where c is the distance between the gravity centers of both

targets, a and b are the distances between the center of grav-

ity and the boundary of the ellipse defined by the covariance

of the respective target(see Figure 5 (left)). In our imple-

mentation we use a threshold = 0.8.



cba

Figure 5. (left)Merging of targetsas a function

of the target relative position and size. (right)

Splitting detectors are defined proportionally

to the target size.

Splitting of targets is implemented by placing detection

regions around the target as shown in Figure 5 (right). The

size and location of the split detection regions are propor-

tional to the target size. Within each split detection re-gion, the average enery is evaluated in the same way as

in the target initialisation module. A new target is cre-

ated if this average energy is greater than the threshold

u = energy density ∗ split coefficient . The parameter split

coefficient controls the constraints for target splitting.

4. Automatic parameter adaption

Target initialization and tracking by background differ-

encing or histogram detection requires a certain number of

parameters, as mentioned in the previous sections (detec-

tion energy threshold, sensitivity, density energy threshold,

α, split coefficient, area threshold).

In order to preserve the re-usability of the tracking mod-

ule and guarantee good performance in a wide range of dif-

ferent tracking scenarios, it is crucial to have a good pa-

rameter setting at hand. Up to now, parameter adaption is

done manually. This is a very tedious job which might need

frequent repetition when the scene setup has changed.

In this section we propose a first attempt of a module

that automatically finds a good parameter setting. As a first

step, we consider the tracker as a classical system with con-

trol parameters and noise perturbations (see Figure 6). The

system produces an output y(t) that depends on the input

r(t), some noise d(t), and a set of parameters that affect thecontrol module K [1].

4.1 Algorithm

First we need to explore the effect of particular parame-

ters on the system. The goal of this step is to identify the

important parameters, their relation and eventually discard

K

Noise

d(t)

y(t)

Control

r(t)

f(y(t))

−

SystemInput Output

Parameters P

Figure 6. A controlled system

parameters with little effect. For a sequence for which the

ground truth r(t) is available we vary the parameters sys-

tematically and measure the output of the system, yP k(t)for a particular parameter setting P k in the parameter space

P . yP k(t) and r(t) are split in m sections according to mintervals si = [ti−1, ti], i = 1, . . . , m.

For each parameter setting P k and each interval r(si)and yP k(si) are known. From these input/output correspon-

dences we can compute the transfer function f (yP k(si)) =r(si) by a least squares approximation. The overall error

of the transfer function on the sequence is computed as fol-

lows:

ǫ = ||r(t) − f (yP k(t))|| =si

||r(si) − f (yP k(si))|| (12)

For each P k, we determine the transfer function that mini-

mizes this error. The average error (ǫ = ǫ/n, n number of

frames) is used to characterize the performance of the sys-

tem with the current parameter setting. This is a very coarse

approximation, but as we will see, the average error evolves

smoothly over the parameter space.We consider polynomial transfer functions of first and

second order (linear and quadratic) of the following form

r(tk) = A0y(tk) + b (13)

r(tk) = A2( y(tk))2 + A1y(tk) + b (14)

with transfer matrices Ai and offset b.

The measurements have either two or four dimensions.

In the two dimensional case, the measurements contain the

coordinates of the center of gravity of the target. The four

dimensional case also contains the height and width of the

target bounding box. We could have considered an addi-

tional dimension for the target slant, but we discarded this possibility due to the discontinuity of the slant measurement

at 180.

The linear transfer function estimated from the data of

the sequences Walk1.mpeg and Walk3.mpeg produce good

results. We observe a transfer matrix A0 that is close to

identity. The quadratic transfer function has a smaller ǫ, but

the transfer matrix A2 has very low values and is therefore



not significant. This means that the linear transfer function

is a good model for our system.

4.2 Exploration of the parameter space

The average error of the best transfer function evaluated

on the entire test sequence is used to characterize the per-

formance of the controlled system. The parameter spacecan be very high dimensional. Therefore, exploring the en-

tire space can be time consuming. To cope with this prob-

lem we assume that some parameters evolve independently

from each other. This allows to restrict the search of an op-

timal parameter value to a low dimensional hyperspace. In

the experiment we use following default values for the con-

stant parameters of the hyperspace: detection energy = 10,

density = 15, sensitivity = 20, split coefficient = 2.0, α =

0.001, area threshold = 1500. We experiment on sequence

Walk1.mpeg except for figure 7.

Figure 7 shows the surface produced by varying the de-

tection energy threshold and the sensitivity threshold simul-

taneously. Figure 8 shows the error evolution by varying the

split coefficient and the sensitivity. The optimal parameter

value is different for each sequence. This means that the

parameters are sequence dependent. In all cases the error

evolves smoothly. This means that we are dealing with a

controlled system and not with a system following chaotic

or arbitrary rules.

Figure 9 (left) provides evidence to set α = 0.1. Fig-

ure 9 (right) shows that the density threshold has no effect

on the average error. This parameter is therefore a candidate

that needs not be considered for further exploration of the

parameter space.

Figure 10 shows the effect of the parameter area thresh-

old. This parameter treats one pixel out of two for targets

that are larger than area threshold pixels. This explains the

increase of the error for small thresholds and the speed up

in processing time. It is interesting to see, that the error in-

crease is very small, less than 4% error increase for a 25%gain in processing time. Our method allows to identify this

kind of relations between parameters.

4.3 Summary

We have shown a method to evaluate the performance

of a system controlled by a set of parameters. The average

error is used to understand the effect of single parametersand parameter pairs. This method allows to verify that our

tracking system has a controlled behavior. We identified

that the density parameter has no effect on the error per-

formance and it can be removed from the parameter space.

The area threshold parameter influences the overall process-

ing time and the average error. With our method, we found

that the increase in error is small with respect to the gain in

Figure 11. Modules for face and hand obser-

vation are plugged into tracking system.

processing time. This is an interesting result which a dy-

namic control system should take into account. The exper-

iments show that the optimal parameter setting estimatedfrom one sequence scenario must not be optimal for an-

other sequence. This needs to be explored by evaluating

more data sequences. Another important point is that the

approach requires ground truth labelling. This means that

our method can not find the optimal parameters when the

ground truth is unknown. Likelihood may be appropriate in

some cases to replace the ground truth, but the results will

be inferior since the likelihood increases the noise perturba-

tions.

5. Tracking : optional higher level modules

In this section we demonstrate the flexibility of our track-

ing system. The proposed architecture enables easy plug in

of higher level modules which enables the system to solve

quite different tasks.

5.1. Face and hand tracking for human computerinteraction

Modules for face and hand tracking use color histogram

detection. Face and hands are initialised automatically with

respect to a body detected by backgrounddifferencing. This

means that the same tracking principle is applied to faces

and hands at a higher level. An example is shown in Fig-ure 11.

5.2. Eye detection for head pose estimation

This module detects facial features by evaluating the re-

sponse to receptive field clusters [2]. The method detects

facial features robust to scale, lighting variation, person and



"walk1_energy_sensitivity"

1015

2025

30energy threshold

05

1015

2025

3035

40

sensitivity

20

30

40

50

60

70

80

average error

"walk3_energy_sensitivity"

1015

2025

30energy threshold

05

1015

2025

3035

40

sensitivity

0

50

100

150

200

250

average error

Figure 7. Evolution of the average error over detection energy threshold and sensitivity threshold

(sequence Walk1.mpeg (left) and Walk3.mpg (right) and default values for free parameters).

"walk1_split_sensitivity"

11.5

22.5

3split coefficient

05

1015

2025

3035

40

sensitivity

30

40

50

60

70

80

90

average error

Figure 8. Evolution of the average error over split coefficient and sensitivity threshold.

"walk1_energy_alpha"

0

0.20.4

0.60.8

1alpha

9.889.9

9.929.94

9.969.98

1010.02

10.0410.06

10.0810.1

energy = 10

30

32

34

36

38

40

42

44

46

average error

"walk1_energy_density"

0

10 20 30 40 50 60 70 80 90 100density

9.889.9

9.929.94

9.969.98

1010.02

10.0410.06

10.0810.1

energy = 10

31.5

31.6

31.7

31.8

31.9

32

32.1

32.2

32.3

average error = 31.9 const

Figure 9. Evolution with varying alpha (left) and varying density (right). We can identify an optimal

value for alpha (α = 0.1), but the error is constant for all density values.



"walk1_energy_area"

0500

10001500

20002500

3000area threshold

9.889.9

9.929.94

9.969.98

1010.02

10.0410.06

10.0810.1

energy

30

31

32

33

34

35

average error

"walk1_area_error_time"

0500

10001500

20002500

3000area threshold

30

32

34

36

38

40

average error

7880

828486889092949698

100

processing time [Hz]

Figure 10. Evolution with varying area threshold (left). The error increases slightly with decreasing

area threshold. The area threshold has a significant impact on the processing time (right).

Figure 12. Real-time head pose estimation.

head pose. The tracking system provides the precise face

location which allows the combined system to run in real

time. Figure 12 shows an example of the eye tracking mod-

ule.

5.3. Agent identification

The agent identification module provides an association

between individual features and tracked targets by back-

ground subtraction. Identification of each tracked blob is

carried out by elastic matching of labelled graphs where the

labels are receptive field responses [2]. The degree of cor-respondence between the model and the observations ex-

tracted from the ROI provided by the tracking system is

computed by evaluating a cost function. The cost function

is a weighted sum of the spatial similarity and the appear-

ance similarity [3, 8]. Figure 13 shows a successful identity

recovery after a target occlusion. The system currently pro-

cesses 10 frames/s.

cost pers1 165, pers2 186 Merge: cost pers1 337, pers2 492

Occlusion: cost pers1 488, pers2 1470 Split: cost pers1 2073, pers2 735

Figure 13. Example of a split and merge event

with successful identity recovery.



Figure 14. True versus False detections for

individuals

6. Tracking performance of the core modules

In order to evaluate the performance of our tracking sys-

tem, we have tested the core modules on 16 of the PETS 04

sequences (17182 frames containing 50404 targets marked

by bounding boxes)1. In this section we give a brief sum-

mary of the tracking results.

Figure 14 shows the receiver operator curve for all 16 se-

quences. Our system has a low false detection probabilityof

9.8% and a true detection probability of 53.6%. This trans-

lates to a recall of 53.6% (27030 correct positives out of

50404 total positives) and a precision of 90.2% (27030 cor-

rect positives out of 29974 detections). The reason for the

relatively low recall is the fact that the ground truth label-

ing takes into account targets that are already present in the

scene and targets that pass on the gallery at the first floor.

Our tracking system relies on the method of detection re-

gion for target initialization. Both type of targets are not

detected by our tracking system, because they are not ini-

tialized.

The tracking results are evaluated with respect to other

parameters such as errors in detected position, size, and ori-

entation, the time lag of entry and exit. The performance of

our system with respect to these parameters is summarized

in Table 1. Our system performs very well in position detec-tion, orientation estimation and exit time lag. The bounding

box produced by the tracking system is significantly smaller

than the bounding box of the ground truth. This is due to

the fact that the tracking system estimates the bounding box

from the covariance of the pixels with high energy whereas

1The sequences as well as the statistics are available at the CAVIAR

home page http://homepages.inf.ed.ac.uk/rbf/CAVIAR/caviar.htm

Average error in average value maximum value

Position 6 - 7 pixels 13 - 15 pixels

Size -160% to -240% -240%

Orientation ±0.5% ±30%Entry time lag 50 to 80 frames 100 to 160 frames

Exit time lag 1 frame 1 frame

Table 1. Evaluation of the trackingresultswith

respect to measurement precision.

a human draw a bounding box that includes all pixels that

belong to the target. The tracking system can produce a

similar output by computing the connected components of

the energy image. This is a costly operation. In the case

where the connected components bounding box is used for

position computation, the position become more unstable.

For this reason we decided to use the first and second mo-

ments of energy pixels for target specification. The entrytime lag is a problem related to the detection region. A hu-

man observer marks a new target as soon as it occurs. The

detection region requires that the observed energy is above

the energy density threshold.

7. Conclusion

We have presented an architecture for a tracking sys-

tem that consists of a central supervisor, a tracking mod-

ule based on background subtraction or color histogram de-

tection combined with Kalman filtering and an automatic

target initialization module restricted to detection regions.

These three modules form the core system. The central su-

pervisor architecture has the advantage that additional mod-

ules can be plugged in very easily. New tracking systems

can be created in this way that can solve different tasks.

The tracking system depends on a number of parameters

that influence the performance of the system. Therefore,

finding a good parameter setting for a particular scenario is

essential. We have proposed to consider the tracking system

as a classical controlled system and identified a method to

evaluate the quality of a particular parameter setting. The

preliminary experiments show that small variations of the parameters produce smooth changes of the average error

function. Using this behavior, we can improve the perfor-

mance of our tracking system by finding a good parame-

ter setting using gradient descend in the parameter space.

Unfortunately, the experiments on the automatic parameter

adaption are preliminary and could not yet be integrated in

the performance evaluation of the system.



References

[1] P. de Larminat. Automatique commande des syst emes

lineaires. Hermes Science Publications, 2nd edition,

1996.

[2] D. Hall and J.L. Crowley. Detection du visage par

caracteristiques generiques calculees a partir des im-ages de luminance. In Congr es Francophone de Recon-

naissance des Formes et Intelligence Artificielle, pages

1365–1373, Toulouse, France, 2004.

[3] M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange,

C. von der Mahlsburg, R.P. Wurz, and W. Konen. Dis-

tortion invariant object recognition in the dynamic link

architecture. Transactions on Computers, 42(3):300–

311, March 1993.

[4] A. Lux. The imalab method for vision systems. In Inter-

national Conference on Vision Systems, pages 319–327,

Graz, Austria, April 2003.

[5] K. Schwerdt and J.L. Crowley. Robust face tracking

using color. In International Conference on Automatic

Face and Gesture Recognition, pages 90–95, Grenoble,

France, March 2000.

[6] M.J. Swain and D.H. Ballard. Color indexing. Interna-

tional Journal of Computer Vision, 7(1):11–32, 1991.

[7] G. Welch and G. Bishop. An introduction to the kalman

filter. Technical Report TR 95-041, University of North

Carolina at Chapel Hill, 2004.

[8] L. Wiskott, J.M. Fellous, N. Kruger, and C. von der

Mahlsburg. Face Recognition by Elastic Bunch GraphMatching , chapter 11, pages 355–396. Intelligent Bio-

metric Techniques in Fingerprint and Face Recognition.

CRC Press, 1999.



Automatic parameter regulation for a tracking

system with an auto-critical function

Daniela Hall

INRIA Rhone-Alpes, St. Ismier, France

Email: [email protected]

Abstract— In this article we propose an architecture of a track-ing system that can judge its own performance by an auto-criticalfunction. Performance drops can be detected which trigger anautomatic parameter regulation module. This regulation moduleis an expert system that searches a parameter setting with betterperformance and returns it to the tracking system. With suchan architecture, a robust tracking system can be implemented

which automatically adapts its parameters in case of changes inthe environmental conditions. This article opens a way to self-adaptive systems in detection and recognition.

I. INTRODUCTION

Parameter tuning of complex systems is often performed

manually. A tracking system requires different parameter set-

tings as a function of the environmental conditions and the

type of the tracked targets. Each change in condition requires

a parameter update. There is a great need to design an expert

system that performs the parameter regulation automatically.

This article proposes an approach and applies it to a real-time tracking systems. The here proposed architecture for auto-

regulation is valid for any complex system whose performance

depends on a set of parameters.

Automatic regulation of parameters can significantly en-

hance performance of systems for detection and recognition.

Surprising little previous work has been published in this

domain [5]. A first step towards performance optimization is

the ability of the system to be auto-critical. This means that the

system must be able to judge its own performance. A perfor-

mance drop, detected with this kind of auto-critical function,

can trigger an independent module for auto-regulation. The

task of the regulation module is to propose a set of parameters

to improve system performance.The auto-critical function detects a performance drop when

the measurements with respect to a scene reference model

diverge. In this case the automatic regulation module is trig-

gered to provide a parameter setting with better performance.

Section II explains the architecture of the tracking system and

the architecture of the regulation cycle. Section III explains

the details of the auto-critical function, the generation of the

scene reference model and the measure used for performance

evaluation. In section IV we explain the use of the regulation

module. We then show experiments that demonstrate the utility

of our approach. We finish with conclusions and a critical

evaluation.

Detectionregion

list Detection

Background detector

Target initialisation

Estimation

detector Background

PredictionTarget list

SupervisorTime

Robust tracking

Fig. 1. Architecture of the tracking and detection system controlledby a supervisor.

I I . SYSTEM ARCHITECTURE

In order to demonstrate the utility of our approach for auto-

regulation of parameters we choose a detection and tracking

system as previously described in [2]. Figure 1 shows the

architecture of the system. The tracking system is composedof a central supervisor, a target initialisation module and a

tracking module. This modular architecture is flexible such

that competing algorithms for detection can be integrated. For

our experiments we use a detection module based on adaptive

background differencing using manually defined detection

regions. Robust tracking is achieved by a first order Kalman

filter that propagates the target positions in time and updates

them by measurements from the detection module.

The tracking system depends on a number of parameters

such as detection energy threshold, sensitivity for detection,

energy density threshold to avoid false detections due to noise,

a temporal parameter for background adaptation, and a split

coefficient to enable merging and splitting of targets (i.e. whentwo people meet they merge to a single group target, a split

event is observed when a person separates from the group).

Figure 2 shows the integration of the parameter regulation

module and the auto-critical function. The auto-critical func-

tion evaluates the current system performance and decides if

parameter regulation is necessary. If this is the case, the tracker

supervisor sends a request to the regulation module. It provides

the its current parameter setting and current performance as

well as other data that is needed by the regulation module.

When the regulation module has found a better parameter

setting (or after a maximum number of iteration) it stops

processing and sends the result to the system supervisor that



K

Control−

y(t)

optimized parameters

Input

System

Auto−critical function

yes

no

y(t)

Regulation

Regulation?

Fig. 2. Integration of the regulation module in a complex system.

updates the parameters and reinitialises the modules.

It is difficult to predict the performance gain of the auto-

regulation. Since the module can test only a discrete number

of parameter settings, there is no guarantee that the global

optimal parameter setting is found. For this reason, the goal

of the regulation system is to find a parameter setting thatincreases system performance. Subsequent calls of the reg-

ulation module allow then to obtain a constantly increasing

system performance. The modular architecture enables the

use of different methods and apply the regulation to different

system kinds.

III. THE AUTO-CRITICAL FUNCTION

The task of the auto-critical function is to provide a fast

estimation of the current tracking performance. A performance

evaluation function requires a reliable measure to estimate the

current system performance. The used measure (described in

Section III-B) is based on a probabilistic model of the scene

which allows to estimate the likelihood of measurements.The probabilistic scene model is generated by a learning

approach. Section III-C explains how the quality of a model

can be measured. Section III-D discusses different clustering

schemes.

A. Learning a probabilistic model of a scene

A model of a scene describes what usually happens in the

scene. It describes a set of target positions and sizes, but also

a set of paths of the targets within the scene. The model

is computed from previously observed data. A valid model

allows to describe everything that is going to be observed. For

this reason we require that the training data is representative

for what usually happens in the scene.The ideal model of a scene allows to decide in a prob-

abilistic manner which measurements are typical and which

measurements are unusual. With such a model we can com-

pute the probability of single measurements and of temporal

trajectories. Furthermore, we can detect outliers that occur

due to measurement errors. The model represents the typical

behaviour of the scene. Furthermore it enables the system

to alert a user when unusual behavior takes place. This is

a feature which is useful for the task of a video surveillance

operator.

In this section we describe the generation of a scene

reference model which gives rise to a goodness measure that

can compute the likelihood of measurements y(ti) with respect

to the scene reference model. We know that a single mode

is insufficient to provide a valid scene description. We need

a model with several modes that associate spatially close

measurements and provide a locally valid model. The modelis composed from data using a static camera.

An important question is which training data should be used

to create an initial model. The CAVIAR test case scenarios [4]

contain 26 image sequences and hand labelled ground truth.

We can use the ground truth to generate an initial model. If

the initial model is not sufficient, the model can be refined

by adding tracking observations where the measurements with

low probability which are likely to contain errors are removed.

For the computation of the scene reference model, we

use the hand labelled data of the CAVIAR data set (42000

bounding boxes). We divide the model into a training and

a test set of equal size. The observations consist of spatial

measurements yspatial(ti) = (µx, µy, σ2x, σ

2y) (first and second

moments of the target observation in frame I (ti)). We can

extend these observations to spatio-temporal measurements

yspatiotemp(ti) = (µx, µy, σ2x, σ2

y, ∆µx, ∆µy, ∆σ2x, ∆σ2

y) by

considering observations at subsequent time instances ti and

ti−1. Such measurements have the advantage that we take into

account the local motion direction and speed. A trajectory

y(t) is a sequence of spatial or spatio-temporal measurements

y(ti). Single measurements are noted as vectors y(ti) whereas

trajectories y(t) are coded as vector lists. The following

approach is valid for both types of observed trajectories y(t).

To obtain a multi-modal model we have experimented with

two types of clustering methods: k-means and k-means with

pruning. K-means requires a fixed number of clusters thatmust be specified by the user a priori. K-means converges to

a local minimum that depends on the initial clusters. These

are determined randomly, which means that the algorithm

produces different sub-optimal solutions in different runs. To

overcome this problem, k-means is run several times with the

same parameters. In section III-C we propose a measure to

judge the quality of the clustering result. With this measure

we select an optimal clustering solution as our scene reference

model.

The method k-means with pruning is a variation of the tradi-

tional k-means that produces stabler results due to subsequent

fusion of close clusters. In this variation, k-means is called

with a large number of clusters, k ∈ [500, 2000]. Clusters

that are close within this solution are merged subsequently

and clusters with few elements are considered as noise and

removed. This method is less sensitive to outliers and has the

characteristics of a hierarchical clustering scheme and at the

same time can be computed quickly due to the initial fast k-

means clustering. Figure 3 illustrates this algorithm.

B. Evaluating the goodness of a trajectory

A set of Gaussian clusters modeled by mean and covariance

is an appropriate representation for statistical evaluation of

measurements. The probability P (y(ti)|C ) can be computed

according to equation 2.



Result 3 clusters and noisemerge clusters whose centers are closer t han 1.0delete clusters with < 4 elements

Fig. 3. K-means with pruning. After initial k-means clustering closeclusters are merged and clusters with few elements are assigned tonoise.

The auto-regulation and auto-critical module need a measure

to judge the goodness of a particular trajectory. A simplegoodness score consists of the average probability of the

most likely cluster for the single measurements. The goodness

G(y(t)) of the trajectory y(t) = (y(tn), . . . , y(t0)) with length

n + 1 is computed as follows:

G(y(t)) =1

n + 1

ni=0

maxk

(P ( y(ti)|C k)) (1)

with

P ( y(ti)|C ) = P (y(ti)| µ; U ) (2)

=1

(2π)dim/2

|U |1/2

e(−0.5( y(ti)− µ)T U −1( y(ti)− µ))

with µ mean and U covariance of cluster C . Trajectories have

variable length and may consist of several hundred measure-

ments. The proposed goodness score is high for trajectories

composed of likely measurements and small for trajectories

that contain many unlikely measurements (errors). This mea-

sure allows to reliably classify good and bad trajectories

independent of their particular length.

On the other hand, the goodness score does not take into

account the sequential structure of the measurements. The

sequential structure is an important indicator for the detection

of local measurement errors and errors due to badly adapted

parameters. To study the potential of a goodness score that

is sensitive to the sequential structure, we propose following

measure (see equation 3).

Gseq(v)(y(t)) =1

m

m−1i=0

log(P (y(si))) (3)

which is the average log likelihood of the dominant term

P (y(s)) of the probability of a sub-trajectory y(s) of length

v. We use the log likelihood because P (y(s)) is typically very

small.

A trajectory y(t) = (y(s0), y(s1), . . . , y(sm−1)) is com-

posed of m sub-trajectories y(si) of length v. We develop

the measure for v = 3, the measure for any other value v is

developed accordingly. The probability of the sub-trajectories

is defined as:

P (y(si)) = P ( y(t2), y(t1), y(t0))

= P (y(si)) + r

= P (C k2 |y(t2))P (C k1 | y(t1))P (C k0 | y(t0))

P (C k2C k1C k0) + r (4)

P (y(si)) is composed of the probability of the most likely

path through the modes of the model P (y(si)) plus a term rwhich contains the probability of all other path permutations.

Naturally, the P (y(si)) will be dominated by P (y(si)), and

r tends to be very small. This is the reason, why we use in

the final goodness score only the dominant term P (y(si)).

P (C ki|y(ti)) is computed using Bayes rule. The prior P (C k)

is set to the ratio |C k|/(

u |C u|). The normalisation factor

P ( y(ti)) is constant. Since we are interested in the maximum

likelihood, we compute:

P (C ki| y(ti)) =

P (y(ti)|C ki)P (C ki

)

P (y(ti))

∼ P ( y(ti)|C ki)

|C ki|

u |C u|(5)

where |C ki| denotes the number of elements in C ki

.

P ( y(ti)|C ki) is computed according to equation 2.

The joint probability P (C k2C k1C k0) is developed according

to

P (C k2C k1C k0) = P (C k2 |C k1C k0)P (C k1 |C k0)P (C k0) (6)

We simplify this equation by assuming a Markov constraint

of first order:

P (C k2C k1C k0) = P (C k2 |C k1)P (C k1 |C k0)P (C k0) (7)

To compute the conditional probabilities P (C i|C j), we need

to construct a transfer matrix from the training set. This can

be obtained by counting for each cluster C i the number of

state changes and then normalise such that each line in the

state matrix sums to 1. The probabilistically inspired sequential

goodness score of equation 3 is computed using equations 4

to 7.

C. Measuring the quality of the model

K-means clustering is a popular tool for learning and model

generation because the user needs to provide only the numberof desired clusters [3], [7], [8]. K-means converges quickly to

a (locally) optimal solution. K-means clustering starts from a

number of randomly initialised cluster centers. Therefore, each

run produces a different sub-optimal solution. In cases where

the number of clusters is unknown, k-means can be run several

times with varying number of clusters. A difficult problem is

to rank the different k-means solutions and select the one that

is the most appropriate for the task. This section provides a

solution to this problem which is often neglected.

For a particular model (clustering solution) we can compute

the probability of a measurement belonging to the model.

To ensure that the computed probability is meaningful, the



model must be representative. A good model assigns a high

probability to a typical trajectory and a low probability to

an unusual trajectory. Based on these notions we define an

evaluation criteria for measuring the quality of the model.

We need to have a model that is neither too simple nortoo complex. The complexity is related to the number of

clusters [1]. A high number of clusters tends to over-fitting and

a low number of clusters provides an imprecise description.

Model quality evaluation requires a positive and negative

example set. Typical target trajectories (positive examples)

are provided within the training data. It is more difficult

to create a negative example. A negative example trajectory

is constructed as follows. First we measure the mean and

variance of all training data. This represents the distribution

of the data. We can now generate random measurements

by drawing from this distribution with a random number

generator. The result is a set of random measurements. From

the training set, we generate a k-means clustering with a largenumber of clusters (K=100). For each random measurement

we compute p(y(ti)|model100). From the original random

5000 measurements we keep the 1200 measurements with the

lowest probability. This gives the set of negative examples.

Figure 4 shows an example of the positive and negative

trajectory as well as the hand labelled ground truth and a

multi-modal model obtained by k-means with pruning.

For any positive and negative measurements we compute

the probability P ( y(ti)). Classification of the measurements

in positive and negative can be obtained by thresholding

this value. For a threshold d the classification error can be

computed according to equation 8. The optimal threshold,

d, separates positive from negative measurements with aminimum classification error [1].

P d(error) = P (x ∈ Rbad, C pos) + P (x ∈ Rgood, C bad) (8)

=

d0

p(x|C good)P (C good)dx +

1d

p(x|C bad)P (C bad)dx

with Rbad = [0, d] and Rgood = [d, 1].We search the optimal threshold d such that P d(error) is

minimised. We operate on a histogram using logarithmic scale.

This has the advantage that the distribution of lower values is

sampled more densely. The optimal threshold d with minimum

classification error can be estimated precisely with the method.

This classification error P (error) is a measurement for the

quality of the cluster model. Furthermore, less complex models

should be preferred. For this reason we formulate the quality

constraint of clustering solutions as follows: the best clustering

has the lowest number of clusters and an error probability

P (error) < q with q = 1%. The values of q are chosen

depending on the task requirements. This measure is a fair

evaluation criteria which enables to choose the best model

among a set of k-means solutions.

D. Clustering results

We test two clustering methods: K-means and k-means with

pruning. The positive trajectory is a person walking across the

hall, the negative trajectory consists of 1200 measurements

Initial parameter

setting Optimized parameter

setting

Parameter space

exploration tool

Subsequence

of images Regulation process

Scene reference model with metric

Fig. 5. A process for automatic parameter regulation.

constructed as described above. The training set consists

of 21000 hand labelled bounding boxes from 15 CAVIAR

sequences (see Figure 4).

Table I shows characteristics of the winning models with

highest quality defined by minimum classification error andminimum number of clusters. The superiority of the k-means

with pruning is demonstrated by the results. For the constraint

P (error) < 1%, k-means with pruning requires only 20 or

19 clusters respectively whereas the classical k-means needs

a model of clusters to obtain the same error rate. The best

overall model is obtained for spatio-temporal measurements

using k-means with pruning.

IV. THE MODULE FOR AUTOMATIC PARAMETER

REGULATION

The task of the module for automatic regulation is to

determine a parameter setting that improves the performance

of the system. In the case of a detection and recognitionsystem, this corresponds to increasing the number of true

positives and reducing the number of false positives. For

this task, the module requires an evaluation function of the

current output, a strategy to choose a new parameter setting

and a subsequence which can be replayed to optimize the

performance.

A. Integration

When the parameter regulation module is switched on, the

system tries to find a parameter setting that performs better

than the current parameter setting on a subsequence that is

provided by the tracking system. The system uses one of the

goodness scores of section III-B.In the experiments we use a subsequence of 200 frames for

auto-regulation. The tracker is run several times with changing

parameter settings on this subsequence and the goodness score

of the trajectory is measured for each parameter setting. The

parameter setting that produces the highest goodness score

is kept. Parameter settings are obtained from a parameter

space exploration tool whose strategies are explained in the

section IV-B and IV-C.

The automatic regulation can only operate on sequences that

produce a trajectory (something observable must happen in

the scene). To allow a fair comparison, the regulation module

must process the same subsequence several times. For this



Clustering resultExample of a unusual trajectory (random)Example of a typical trajectoryHand labelled bounding boxes (21000)

Fig. 4. Ground truth labelling for the entrance hall scenario, examples of typical and unusual trajectories and clustering result using k-meanswith pruning.

Measurement type Clustering method # clusters optimal threshold d P (error)Spatial K-means 35 0.0067380 0.0007

K-means with pruning 20 0.0067380 0.0061

Spatio-temporal K-means 35 0.00012341 0.0013K-means with pruning 19 0.00012341 0.0034

TABLE IBEST MODEL REPRESENTATIONS AND THEIR CHARACTERISTICS (FINAL NUMBER OF CLUSTERS, OPTIMAL THRESHOLD, AND CLASSIFICATION ERROR).

reason the regulation process requires a significant amount of

computing power. As a consequence, the regulation module

should be run on a different host such that the regulation does

not slow down the real time tracking.

B. Parameter space exploration tool

To solve the problem of the parameter space exploration we

propose a parameter exploration tool that provides the next

parameter setting to the regulation module. The dimensionsof the parameter space as well as a reasonable range of the

parameter values are given by the user. In our tracking example

the parameter space is spanned by detection energy, density,

sensitivity, split coefficient, α, and area threshold.

In the experiments we tested two strategies for parameter

setting selection. An enumerative method, that defines a small

number of discrete values for each parameter. At each call the

parameter space exploration tool provides the next parameter

setting in the list. The disadvantage of this method is that only

a small number of settings can be tested and the best setting

may not be in the predefined list. The second strategy for

parameter space exploration is based on a genetic algorithm.

We found genetic algorithms perfectly adapted to our problem.It enables feedback from the performance of previous settings.

We have a high dimensional feature space which makes hill

climbing methods costly, whereas genetic algorithms explore

the space without need of a high dimensional surface analysis.

C. Genetic algorithm for parameter space exploration

Among the different optimization schemes that exist we are

looking for a particular method, that fulfills several constraints.

We are not requiring to reach a global maximum of our func-

tion, but we would like to reach a good level of performance

quickly. Furthermore we are not particularly interested in the

shape of the surface in parameter space. We are only interested

in obtaining a good payoff with a small number of tests.

According to Goldberg [6], these are exactly the constraints of

an application for which genetic algorithms are appropriate.

Hill climbing methods are not feasible because the estima-

tion of the gradient of a single point in a 6 dimensional space

requires 26 tests. Testing several points would therefore require

a higher number of tests than we would like.

Genetic algorithms are inspired by the mechanics of natural

selection. Genetic algorithms require an objective function toevaluate the performance of an individual and a coding of the

input variables. Typically the coding is a binary string. In our

example, each parameter is represented by 5 bit, which gives

an input string of length 30.

Genetic algorithms have three major operators: reproduc-

tion, crossover and mutation. Reproduction is a process in

which individuals are copied according to their objective

function values. Those individuals with high performance are

copied more often than those with low performance. After

reproduction, crossover is performed as follows. First, pairs

of individuals are selected at random. Then, a position kwithin the string of length l is selected at random. Two new

individuals are created by swapping all characters of positionk + 1 to l. The mutation operator selects at random a position

within the string and swaps its value.

The power of genetic algorithms comes from the fact, that

individuals with good performance are selected for reproduc-

tion and crossing of high performance individuals speculates

on generating new ideas from high performance elements of

past trials.

For the initialisation of the genetic algorithm, the user

needs to specify the boundaries of the input variable space,

coding of the input variables, the size of the initial population

and the probability of crossover and mutation. Goldberg [6]

proposes to use a moderate population size, a high cross over



probability and a low mutation probability. The coding of the

input variables should use the smallest alphabet that allows to

express the problem. In the experiment we use a population of

size 16, we estimate 7 generations, the crossover probability

is set to 0.6 and the mutation probability to 0.03.

V. EXPERIMENTAL EVALUATION

In this section we evaluate the proposed approach on the

CAVIAR entry hall sequences1. The system is evaluated by

recall and precision of the targets compared to the hand-

labelled ground truth.

recall =true positives

total # targets(9)

precision =true positives

(true positives + false positives)(10)

We use the results of the manual adaptation as an upper

benchmark. These results were obtained by a human expertwho processed several times the sequences and hand tuned the

parameters. Quickly the expert gained experience which kind

of tracking errors depend on which parameter. The automatic

regulation module does not use this kind of knowledge. For

this reason, the recall and precision of the manual adaptation

is the best we can hope to reach with an automatic method.

We do not have manual adapted parameters for all sequences,

due to the repetitive and time consuming manual task.

A lower benchmark is provided by the tracking results that

do no adaptation. This means all 5 sequences are evaluated

using the same parameter setting. Choosing parameters with

high values2 produce little recall and bad precision. Choosing

parameters with low values3 increase the recall but the verylarge number of false positives is not acceptable.

Table II shows the tracking results using a spatial and a

spatio-temporal model and two parameter space exploration

schemes. The first uses a brute force search (enumerative

method) of the discrete parameter space composed of the

discrete values for detection energy ∈ [20, 30, 40], density

∈ [5, 15, 25], sensitivity ∈ [0, 20, 30, 40], split coefficient =

2.0, α = 0.001, area threshold = 1500. The method tests

36 parameter settings. The second exploration scheme uses

a genetic algorithm as described in section IV-C.

The enumerative method has several disadvantages, that

are reflected by the rather low performance measurements

of the experiments. The sampling of the parameter space is

coarse and therefore it happens frequently that none of the

parameter settings provide an acceptable improvement. The

same arguments are true for random sampling of the parameter

space.

The spatial model using brute force method and the simple

score has a small recall, but a better precision than the lower

benchmark. The spatio-temporal measurements using the same

1available at http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/ 2detection energy=30, density=15, sensitivity=30, split coefficient=1.0, α =

0.01, and area threshold=15003detection energy=10, density=15, sensitivity=20, split coefficient=2.0, α =

0.01, and area threshold=1500

parameter selection and evaluation measure produces superior

results (higher recall and higher precision). This seems to

be related to the spatio-temporal model. The precision can

be further improved using the genetic approach and the

more complex evaluation function (recall 39.7% and precision78.8%).

V I . CONCLUSIONS AND OUTLOOK

We presented an architecture for a tracking system that uses

an auto-critical function to judge its own performance and an

automatic parameter regulation module for parameter adapta-

tion. This system opens the way to self-adaptive systems which

can operate under difficult lighting conditions. We applied our

approach to tracking systems, but the same approach can be

used to increase the performance of other systems who depend

of a set of parameters.

An auto-critical function and a parameter regulation module

require a reliable performance evaluation measure. In our case,this measure is computed as a divergence of the observed

measurements with respect to a scene reference model. We

proposed an approach for the generation of such a scene

reference model and developed a measure that is based on

the measurement likelihood.

With this measure, we can compute a best parameter setting

for pre-stored sequences. The experiments show that the auto-

regulation greatly enhances the performance of the tracking

output compared to a tracking without auto-regulation. The

system can not quite reach the performance of a human expert,

who uses knowledge based on the type of tracking errors for

parameter tuning. This kind of knowledge is not available to

our system.The implementation of the auto-critical function can trigger

the automatic parameter regulation. First successful tests have

been made to host the system on a distributed system. The

advantages of the distributed system architecture is that the

tracking system can continue the real time tracking. There

rests the problem of re-initialisation of the tracker. Currently,

existing targets are destroyed when the tracker is reinitialised.

The current model relies entirely on ground truth labelling.

The success of the method strongly depends on the quality of

the model. In many cases, a small number of hand labelled

trajectories can be gathered, but often their number is not

sufficient for the creation of a valid model. For such cases

we envision an incremental modeling approach by generatingan initial model from few hand-labelled sequences. The initial

model is then used to filter the tracking results, such that they

are error free. These error free trajectories are then used to

refine the model. This corresponds to a feed back loop in

model generation. After a small number of iterations a valid

model should be obtained. The option of such an incremental

model is essential for non-static scenes.

ACKNOWLEDGMENT

This research is funded by the European commission’s IST

project CAVIAR (IST 2001 37540). Thanks to Thor List for

providing the recognition evaluation tool.



Auto-regulation method Recall Precision Total # true falsetargets positives positives

Manual adaptation (benchmark) 49.7 91.0 23180 11520 1136

Spatio-temporal model, 39.7 78.8 21564 8556 2304(genetic approach,Gseq(10))

Spatio temporal model, 39.4 73.2 21564 8492 3108

(genetic approach, simple score G)Spatio temporal model, 38.1 72.2 21564 8224 3160(brute force, simple score G)Spatial model, 29.2 68.7 21564 6302 2872(brute force, simple score G)

No adaptation (low thresholds) 68.0 24.5 21564 14672 45131No adaptation (high thresholds) 28.3 47.5 21564 6109 6746

TABLE II

PRECISION AND RECALL OF THE DIFFERENT METHODS EVALUATED FOR 5 CAVIAR SEQUENCES (OVERLAP REQUIREMENT 50%).

REFERENCES

[1] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford

University Press, 1995.[2] A. Caporossi, D. Hall, P. Reignier, and J.L. Crowley. Robust visual

tracking from dynamic control of processing. In International Workshopon Performance Evaluation of Tracking and Surveillance, pages 23–31,Prague, Czech Republic, May 2004.

[3] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization withbags of keypoints. In European Conference on Computer Vision, Prague,Czech Republic, May 2004.

[4] R.B. Fisher. The PETS04 surveillance ground-truth data sets. In International Workshop on Performance Evaluation of Tracking and Surveillance, Prague, Czech Republic, May 2004.

[5] B. Georis, F. Bremond, M. Thonnat, and B. Macq. Use of an evaluationand diagnosis method to improve tracking performances. In InternationalConference on Visualization, Imaging and Image Proceeding, September2003.

[6] D.E. Goldberg. Genetic Algorithms in Search and Optimization. Addison-Wesley, 1989.

[7] T. Leung and J. Malik. Recognizing surfaces using three-dimensionaltextons. In International Conference on Computer Vision, Corfu, Greece,September 1999.

[8] C. Schmid. Constructing models for content-based image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2,pages 39–45, Kauai, USA, December 2001.



1

Performance evaluation of object detection

algorithms for video surveillance

Jacinto Nascimento⋆, Member, IEEE Jorge Marques

[email protected] [email protected]

IST/ISR, Torre Norte, Av. Rovisco Pais, 1049-001, Lisboa Portugal

EDICS: 4-SEGMAbstract

In this paper we propose novel methods to evaluate the performance of object detection algorithms in video

sequences. This procedure allows us to highlight characteristics (e.g., region splitting or merging) which are specific

of the method being used. The proposed framework compares the output of the algorithm with the ground truth

and measures the differences according to objective metrics. In this way it is possible to perform a fair comparison

among different methods, evaluating their strengths and weaknesses and allowing the user to perform a reliablechoice of the best method for a specific application. We apply this methodology to segmentation algorithms recently

proposed and describe their performance. These methods were evaluated in order to assess how well they can detect

moving regions in an outdoor scene in fixed-camera situations.

Index Terms

Surveillance Systems, Performance Evaluation, Metrics, Ground Truth, Segmentation, Multiple Interpretations.

I. I NT ROD UC TI ON

VIDEO surveillance systems rely on the ability to detect moving objects in the video stream which is a relevant

information extraction step in a wide range of computer vision applications. Each image is segmented byautomatic image analysis techniques. This should be done in a reliable and effective way in order to cope with

unconstrained environments, non stationary background and different object motion patterns. Furthermore, different

types of objects are manually considered e.g., persons, vehicles or groups of people.

Many algorithms have been proposed for object detection in video surveillance applications. They rely on different

assumptions e.g., statistical models of the background [1]–[3], minimization of Gaussian differences [4], minimum

and maximum values [5], adaptivity [6,7] or a combination of frame differences and statistical background models

[8]. However, few information is available on the performance of these algorithms for different operating conditions.

Two approaches have been recently considered to characterize the performance of video segmentation algorithms:

pixel-based methods, template based methods and object-based methods. Pixel based methods assume that we wish

to detect all the active pixels in a given image. Object detection is therefore formulated as a set of independent

pixel detection problems. This is a classic binary detection problem provided that we know the ground truth (ideal

segmented image). The algorithms can therefore be evaluated by standard measures used in Communication theory

e.g., misdetection rate, false alarm rate and receiver operating characteristic (ROC) [9].

This work was supported by FCT under the project LTT and by EU project CAVIAR (IST-2001-37540).

Corresponding Author: Jacinto Nascimento, (email:[email protected]), Complete Address: Instituto Superior Tecnico-Instituto

de Sistemas e Robotica (IST/ISR), Av. Rovisco Pais, Torre Norte, 6o piso, 1049-001, Lisboa, PORTUGAL Phone: +351-21-8418270, Fax:

+351-21-8418291



2

Several proposals have been made to improve the computation of the ROC in video segmentation problems e.g.,

using a perturbation detection rate analysis [10] or an equilibrium analysis [11]. The usefulness of pixel-based

methods for surveillance applications is questionable since we are not interested in the detection of point targets

but object regions instead. The computation of the ROC can also be performed using rectangular regions selected

by the user, with and without moving objects [12]. This improves the evaluation strategy since the statistics are

based on templates instead of isolated pixels.

A third class of methods is based on an object evaluation. Most of the works aim to characterize color, shape and

path fidelity by proposing figures of merit for each of these issues [13]–[15] or area based performance evaluation

as in [16]. This approach is instrumental to measure the performance of image segmentation methods for video

coding and synthesis but it is not usually used in surveillance applications.

These approaches have three major drawbacks. First object detection is not a classic binary detection problem.

Several types of errors should be considered (not just misdetection and false alarms). For example what should we

do if a moving object is split into several active regions ? or if two objects are merged into a single region ? Second

some methods are based on the selection of isolated pixels or rectangular regions with and without persons. This

is an unrealistic assumption since practical algorithms have to segment the image into background and foreground

and do not have to classify rectangular regions selected by the user. Third, it is not possible to define a unique

ground truth. Many images admit several valid segmentations. If the image analysis algorithm produces a valid

segmentation its output should be considered as correct.

In this paper we propose objective metrics to evaluate the performance of object detection methods by comparing

the output of the video detector with the ground truth obtained by manual edition. Several types of errors are

considered: splits of foreground regions; merges of foreground regions; simultaneous split and merge of foreground

regions; false alarms, and detection failures. False alarms occur when false objects are detected. The detection

failures are caused by missing regions which have not been detected.

In this paper five segmentation algorithms are considered as examples and evaluated. We also consider multiple

interpretations in the case of ambiguous situations e.g., when it is not clear if two objects overlap and should be

considered as a group or if they are separated apart.

The first algorithm is denoted as basic background subtraction (BBS ) algorithm. It computes the absolute

difference between the current image and a static background image and compares each pixel to a threshold.

All the connected components are computed and they are considered as active regions if their area exceeds a

given threshold. This is perhaps the simplest object detection algorithm one can imagine. The second method is the

detection algorithm used in the W 4 system [17]. Three features are used to characterize each pixel of the background

image: minimum intensity, maximum intensity and maximum absolute difference in consecutive frames. The third

method assumes that each pixel of the background is a realization of a random variable with Gaussian distribution

(SGM - Single Gaussian Model) [1]. The mean and covariance of the Gaussian distribution are independently

estimated for each pixel. The fourth algorithm represents the distribution of the background pixels with a mixture

of Gaussians [2]. Some modes correspond to the background and some are associated with active regions ( M GM

- Multiple Gaussian Model). The last method is the one proposed in [18] and denoted as Lehigh Omnidirectional



3

Tracking System (LOTS ). It is tailored to detect small non cooperative targets such as snipers. Some of these

algorithms are described in a special issue of IEEE transactions on PAMI (August 2001), which describes a state

of art methods for automatic surveillance systems.

In this work we provide segmentation results of these algorithms on the PETS2001 sequences, using the proposed

framework. The main features of the proposed method are the following. Given the correct segmentation of the

video sequence we detect several types of errors i) splits of foreground regions, ii) merges of foreground regions,

iii) simultaneously split and merge of foreground regions, iv) false alarms (detection of false objects) and v) the

detection failures (missing active regions). We then compute statistics for each type of error.

The structure of the paper is as follows. Section 2 briefly reviews previous work. Section 3 describes the

segmentation algorithms used in this paper. Section 4 describes the proposed framework. Experimental tests are

discussed in Section 5 and Section 6 presents the conclusions.

II. R ELATED WOR K

Surveillance and monitoring systems often require on line segmentation of all moving objects in a video

sequence. Segmentation is a key step since it influences the performance of the other modules, e.g., object tracking,

classification or recognition. For instance, if object classification is required, an accurate detection is needed to

obtain a correct classification of the object.

Background subtraction is a simple approach to detect moving objects in video sequences. The basic idea is

to subtract the current frame from a background image and to classify each pixel as foreground or background

by comparing the difference with a threshold [19]. Morphological operations followed by a connected component

analysis are used to compute all active regions in the image. In practice, several difficulties arise: the background

image is corrupted by noise due to camera movements and fluttering objects (e.g., trees waving), illumination

changes, clouds, shadows. To deal with these difficulties several methods have been proposed (see [20]).

Some works use a deterministic background model e.g., by characterizing the admissible interval for each pixel

of the background image as well as the maximum rate of change in consecutive images or the median of largest

inter-frames absolute difference [5,17]. Most works however rely on statistical models of the background, assuming

that each pixel is a random variable with a probability distribution estimated from the video stream. For example, the

Pfinder system (“Person Finder”) uses a Gaussian model to describe each pixel of the background image [1]. A more

general approach consists of using a mixture of Gaussians to represent each pixel. This allows the representation

of multi modal distributions which occur in natural scene (e.g., in the case of fluttering trees) [2].

Another set of algorithms is based on spatio-temporal segmentation of the video signal. These methods try to

detect moving regions taking into account not only the temporal evolution of the pixel intensities and color but also

their spatial properties. Segmentation is performed in a 3D region of image-time space, considering the temporal

evolution of neighbor pixels. This can be done in several ways e.g., by using spatio-temporal entropy, combined

with morphological operations [21]. This approach leads to an improvement of the systems performance, compared

with traditional frame difference methods. Other approaches are based on the 3D structure tensor defined from

the pixels spatial and temporal derivatives, in a given time interval [22]. In this case, detection is based on the

Mahalanobis distance, assuming a Gaussian distribution for the derivatives. This approach has been implemented



4

in real time and tested with PETS 2005 data set. Other alternatives have also been considered e.g., the use of a

region growing method in 3D space-time [23].

A significant research effort has been done to cope with shadows and with nonstationary backgrounds. Two

types of changes have to be considered: show changes (e.g., due to the sun motion) and rapid changes (e.g., due to

clouds, rain or abrupt changes in static objects). Adaptive models and thresholds have been used to deal with slow

background changes [18]. These techniques recursively update the background parameters and thresholds in order to

track the evolution of the parameters in nonstationary operating conditions. To cope with abrupt changes, multiple

model techniques have been proposed [18] as well as predictive stochastic models (e.g., AR, ARMA [24,25]).

Another difficulty is the presence of ghosts [26], i.e., false active regions due to statics objects belonging to

the background image (e.g., cars) which suddenly start to move. This problem has been addressed by combining

background subtraction with frame differencing or by high level operations [27],[28].

III. SEGMENTATION ALGORITHMS

This section describes object detection algorithms used in this work: BBS , W 4, SGM , M GM and LOTS . The

BBS , SGM , M GM algorithms use color while W 4 and LOTS use gray scale images. In the BBS algorithm,

the moving objects are detected by computing the difference between the current frame and the background image.

A thresholding operation is performed to classify each pixel as foreground region if

|I t(x, y) −µt(x, y)| > T , (1)

where I t(x, y) is a 3 × 1 vector being the intensity of the pixel in the current frame and µt(x, y) is the mean

intensity (background) of the pixel, T is a constant.

Ideally, pixels associated with the same object should have the same label. This can be accomplished by

performing a connected component analysis (e.g., using 8 - connectivity criterion). This step is usually performed

after a morphological filtering (dilation and erosion) to eliminate isolated pixels and small regions.

The second algorithm is denoted here as W 4 since it is used in the W 4 system to compute moving objects

[17]. This algorithm is designed for grayscale images. The background model is built using a training sequence

without persons or vehicles. Three values are estimated for each pixel using the training sequence: minimum

intensity (Min), maximum intensity (Max), and the maximum intensity difference between consecutive frames (D).

Foreground objects are computed in four steps:i)

thresholding,ii)

noise cleaning by erosion,iii)

fast binary

component analysis and iv) elimination of small regions.

We have modified the thresholding step of this algorithm since often leads to a significant level of miss

classifications. We classify a pixel I (x, y) as a foreground pixel iff

|I t(x, y) < Min(x, y)| ∨ |I t(x, y) > Max(x, y)|) ∧ |I t(x, y) − I t−1(x, y)| > D(x, y) (2)

Figs. 1, 2 show an example comparing both approaches. Fig. 1 shows the original image with two active regions.

Figs. 2(a),(b) display the output of the thresholding step performed as in [17] and using (2).



5

Fig. 1. Two regions (in bounding boxes) of an image.

(a) (b)

Fig. 2. Thresholding results: (a) using the approach as in [17] and (b) using (2).

The third algorithm considered in this study is the SGM (Single Gaussian Model) algorithm. In this method,

the information is collected in a vector [Y,U ,V ]T , which defines the intensity and color of each pixel. We assume

that the scene changes slowly. The mean µ(x, y) and covariance Σ(x, y) of each pixel can be recursively updated

as follows

µt(x, y) = (1 − α)µt−1(x, y) + αI t(x, y), (3)

Σt(x, y) = (1 − α)Σt−1(x, y) + α(I t(x, y) −µt(x, y))(I t(x, y) −µ

t(x, y))T (4)

where I (x, y) is the pixel of the current frame in Y U V color space, α is a constant.

After updating the background, the SGM performs a binary classification of the pixels into foreground or

background and tries to cluster foreground pixels into blobs. Pixels in the current frame are compared with the

background by measuring the log likelihood in color space. Thus, individual pixels are assigned either to the

background region or a foreground region

l(x, y) = −1

2(I t(x, y) − µ

t(x, y))T (Σ−1)t(I t(x, y) −µt(x, y)) −

1

2ln |Σt| −

m

2ln(2π) (5)

where I t(x, y) is a vector (Y,U ,V )T defined for each pixel in the current image, µt(x, y) is the pixel vector in

the background image B.



6

If a small likelihood is computed using (5), the pixel is classified as active. Otherwise, it is classified as

background.

The fourth algorithm (M GM ) models each pixel I (x) = I (x, y) as a mixture of N (N = 3) Gaussians distributions,

i.e.

p(I (x)) =N

k=1

ωk N (I (x),µk(x), Σk(x)), (6)

where N (I (x),µk(x), Σk(x)) is a multivariate normal distribution and ωk is the weight of kth normal,

N (I (x),µk(x), Σk(x)) = c exp

−1

2

I (x) − µk(x)

T Σ−1k (x)

I (x) − µk(x)

. (7)

with c = 1

(2π)n/2|Σk|1

2

. Note that each pixel I (x) is a 3 × 1 vector with three component colors (red, green and blue),

i.e., I (x) = [I (x)RI (x)GI (x)B]T . To avoid an excessive computational cost, the covariance matrix is assumed to

be diagonal [2].

The mixture model is dynamically updated. Each pixel is updated as follows: i) The algorithm checks if each

incoming pixel value x can be ascribed to a given mode of the mixture, this is the match operation. ii) If the pixel

value occurs inside the confidence interval with + 2.5 standard deviation, a match event is verified. The parameters

of the corresponding distributions (matched distributions) for that pixel are updated according to

µtk(x) = (1 − λtk)µt−1

k (x) + λtkI t(x) (8)

Σtk(x) = (1 − λtk)Σt−1

k (x) + λtk(I t(x) −µtk(x))(I t(x) − µ

tk(x))T (9)

where

λtk = α N (I t(x),µt−1k (x), Σt−1k (x)) (10)

The weights are updated by

ωtk = (1 − α)ωt−1

k + α(M tk), with M tk =

1 matched models

0 remaining models(11)

α is the learning rate. The non match components of the mixture are not modified. If none of the existing components

match the pixel value, the least probable distribution is replaced by a normal distribution with mean equal to the

current value, a large covariance and small weight. iii) The next step is to order the distributions in the descendingorder of ω/σ. This criterion favours distributions which have more weight (most supporting evidence) and less

variance (less uncertainty). iv) Finally the algorithm models each pixel as the sum of the corresponding updated

distributions. The first B Gaussian modes are used to represent the background, while the remaining modes are

considered as foreground distributions. B is chosen as follows: B is the smallest integer such that

Bk=1

ωk > T (12)

where T is a threshold that accounts for a certain quantity of data that should belong to the background.



7

The fifth algorithm [18] is tailored for the detection of non cooperative targets (e.g., snipers) under non stationary

environments. The algorithm uses two gray level background images B1, B2. This allows the algorithm to cope with

intensity variations due to noise or fluttering objects, moving in the scene. The background images are initialized

using a set of T consecutive frames, without active objects

B1(x, y) = minI t(x, y), t = 1, . . . , T (13)

B2(x, y) = maxI t(x, y), t = 1, . . . , T (14)

where t ∈ 1, 2, . . . , T denotes the time instant.

In this method, targets are detected by using two thresholds ( T L, T H ) followed by a quasi-connected components

(QCC) analysis. These thresholds are initialized using the difference between the background images

T L(x, y) = |B1(x, y) − B2(x, y) | + cU (15)

T H (x, y) = T L(x, y) + cS (16)

where, cU and cS ∈ [0, 255] are constants specified by the user.

We compute the difference between each pixel and the closest background image. If the difference exceeds a

low threshold T L, i.e.,

mini

|I t(x, y) − Bti(x, y)| > T L(x, y) (17)

the pixel is considered as active. A target is a set of connected active pixels such that a subset of them verifies

mini

|I t(x, y) − Bti(x, y)| > T H (x, y) (18)

where T H (x, y) ia a high threshold. The low and high thresholds T tL(x, y), T tH (x, y) as well as the background

images, Bti(x, y), i = 1, 2 are recursively updated in a fully automatic way (see [18] for details).

IV. PROPOSED FRAMEWORK

In order to evaluate the performance of object detection algorithms we propose a framework which is based on

the following principles:

• A set sequences is selected for testing and all the moving objects are detected using an automatic procedure

and manually corrected if necessary to obtain the ground truth. This is performed one frame per second.

• The output of the automatic detector is compared with the ground truth.

• The errors are detected and classified in one of the following classes: correct detections, detections failures,

splits, merges, split/merges and false alarms.

• A set of statistics (mean, standard deviation) are computed for each type of error.

To perform the first step we made a user friendly interface which allows the user to define the foreground regions

in the test sequence in a semi-automatic way. Fig. 3 shows the interface used to generate the ground truth. A set

of frames is extracted from the test sequence (one per second). An automatic object detection algorithm is then

used to provide a tentative segmentation of the test images. Finally, the automatic segmentation is corrected by the



8

user, by merging, splitting, removing or creating active regions. Typically the boundary of the object is detected

with a two pixel accuracy. Multiple segmentations of the video data are generated every time there is an ambiguous

situation i.e., two close regions which are almost overlapping. This problem is discussed in section IV-D.

In the case depicted in the Fig. 3, there are four active regions: a car, a lorry and two groups of persons. The

segmentation algorithm also detects regions due to lighting changes, leading to a number of false alarms (four). The

user can easily edit the image by adding, removing, checking the operations, thus providing a correct segmentation.

In Fig. 3 we can see an example where the user progressively removes the regions which do not belong to the

object of interest. The final segmentation is shown at the bottom images.

Fig. 3. User interface used to create the ground truth from the automatic segmentation of the video images.

The test images are used to evaluate the performance of object detection algorithms. In order to compare the

output of the algorithm with the ground truth segmentation, a region matching procedure is adopted which allows

to establish a correspondence between the detected objects and the ground truth. Several cases are considered:

1) Correct Detection (CD) or 1-1 match: the detected region matches one and only one region.

2) False Alarm (FA): the detected region has no correspondence.

3) Detection Failure (DF): the ground truth region has no correspondence.

4) Merge Region (M): the detected region is associated to several ground truth regions.

5) Split Region (S): the ground truth region is associated to several detected regions.

6) Split-Merge Region (SM): when the conditions pointed in 4, 5 are simultaneously satisfied.

A. Region Matching

Object matching is performed by computing a binary correspondence matrix Ct which defines the correspondence

between the active regions in a pair of images. Let us assume that we have N ground truth regions Ri and M

detected regions R j . Under these conditions Ct is a N × M matrix, defined as follows



9

Ct(i, j) =

1 if ♯(Ri ∩ R j)

♯(Ri ∪ R j)> T

∀i∈1,...,N ,j∈1,...,M

0 if ♯(Ri ∩ R j)

♯(Ri ∪ R j)< T

(19)

where T is the threshold which accounts for the overlap requirement. It is also useful to add the number of ones

in each line or column, defining two auxiliary vectors

L(i) =M j=1

C(i, j) i ∈ 1, . . . , N (20)

C ( j) =N i=1

C(i, j) j ∈ 1, . . . , M (21)

When we associate ground truth regions with detected regions six cases can occur: zero-to-one, one-to-zero,

one-to-one, many-to-one, one-to-many, many-to-many associations. These correspond to false alarm, misdetection,

correct detection, merge, split and split-merge.

Detected regions R j are classified according to the following rules

CD ∃i : L(i) = C ( j) = 1 ∧ C(i, j) = 1M ∃i : C ( j) > 1 ∧ C(i, j) = 1S ∃i : L(i) > 1 ∧ C(i, j) = 1SM ∃i : L(i) > 1 ∧ C ( j) > 1 ∧ C(i, j) = 1FA ∃i : C ( j) = 0

(22)

Detection failures (DF ) associated to the ground truth region Ri occurs if L(i) = 0.

The two last situations (FA, DF) in (22) occur whenever empty columns or lines in matrix C are observed.

Fig. 4 illustrates the six situations considered in this analysis, by showing synthetic examples. Two images

are shown in each case, corresponding to the ground truth (left) and detected regions (right). It also depicts the

correspondence matrix C. For each case, the left image (I ) contains the regions defined by the user (ground truth),

the right image (I ) contains the regions detected by the segmentation algorithm. Each region is represented by an

white area containing a visual label. Fig. 4 (a) shows an ideal situation, in which each ground truth region matches

only one detected region (correct detection). In Fig. 4 (b) the “square-region” has no correspondence with the

detected regions, thus it corresponds to a detection failure. In Fig. 4 (c) the algorithm detects regions which have

no correspondence to the I image, thus indicating a false alarm occurrence. In Fig. 4 (d) shows a merge of two

regions since two different regions (“square” and “dot” regions in I ) have the same correspondence to the “square

region” in I . The remaining examples in this figure are self explaining, illustrating the split (e) and split-merge (f)

situations.

B. Region Overlap

The region based measures described herein depends on an overlap requirement T (see (19)) between the region

of the ground truth and the detected region. Without this requirement, this means that a single pixel overlap is

enough for establishing a match between a detected region and a region in the ground truth segmentation, which

does not make sense.



11

Ground Truth Detector output Ground Truth Detector output

C =

1 0 00 1 00 0 1

C =

0 01 00 1

(a) (b)


C =

1 0 0 00 1 0 0

0 0 1 0

C =

1 01 0

0 1

(c) (d)


C = 1 1 00 0 1 C = 1 1 0

0 0 10 1 0

(e) (f)

Fig. 4. Different matching cases: (a) Correct detection; (b) Detection Failure; (c) False alarm; (d) Merge; (e) Split; (f) Split Merge.



12


C =

1 0 00 0 0

C =

1 0 00 1 0

(a) (b)


C =

1 0 00 0 0

C =

1 0 00 1 1

(c) (d)

Fig. 5. Matching cases with an overlap requirement of T = 20%. Detection failure (overlap < T) (a) Correct detection (overlap > T) (b);

two detection failures (overlap < T) (c) and split (overlap > T) (d).



13

D. Multiple Interpretations

Sometimes the segmentation procedure is subjective, since each active region may contain several objects and

it is not always easy to determine if it is a single connected region or several disjoint regions. For instance, Fig.

6 (a) shows an input image and a manual segmentation. Three active regions were considered: person, lorry and

group of people. Fig. 6 (b) shows the segmentation results provided by the SGM algorithm. This algorithm splits

the group into three individuals which can also be considered as a valid solution since there is very little overlap.

This segmentation should be considered as an alternative ground truth. All these situations should not penalize the

performance of the algorithm. On the contrary, situations such as the ones depicted in Fig. 7 should be considered

as errors. Fig. 7 (a) shows the ground truth and in Fig. 7 (b) the segmentation provided by the W 4 algorithm. In

this situation the algorithm makes a wrong split of the vehicle.

(a) (b)

Fig. 6. Correct split example: (a) supervised segmentation, (b)SGM

segmentation.

(a) (b)

Fig. 7. Wrong split example: (a) supervised segmentation, (b) W 4 segmentation.

Since we do not know how the algorithm behaves in terms of merging or splitting, every possible combinations

within elements, belonging to a group, must be taken into account. For instance, another ambiguous situation is

depicted in Fig. 8, where it is shown the segmentation results of the SGM method. Here, we see that the same

algorithm provides different segmentations (both can be considered as correct) on the same group in different



14

instants. This suggests the use of multiple interpretations for the segmentation. To accomplish this the evaluation

setup takes into account all possible merges of single regions belonging to the same group whenever multiple

interpretations should be considered in a group, i.e., when there is a small overlap among the group members.

The number of merges depends on the relative position of single regions. Fig. 9 shows two examples of different

merged regions groups with three objects ABC (each one representing a person in the group). In the first example

(Fig. 9 (a)) four interpretations are considered: all the objects are separated, they are all merged in a single active

region or AB (BC) are linked and the other is isolated. In the second example an addition interpretation is added

since A can be linked with C.

Instead of asking the user to identify all the possible merges in an ambiguous situation, an algorithm is used to

generate all the valid interpretations in two steps. First we assign all the possible labels sequences to the group

regions. If the same label is assigned to two different regions, these regions are considered as merged. Equation

(23)(a) shows the labelling matrix M for the example of Fig. 9 (a). Each row corresponds to a different labelling

assignment. The element M ij denotes the label of the jth region in the ith labelling configuration. The second

step checks if the merged regions are close to each other and if there is another region in the middle. The invalid

labelling configuration are removed from the matrix M . The output of this step for the example of Fig. 9 (a) is

in equation (23)(b). The labelling sequence 121 is discarded since region 2 is between region 1 and 3. Therefore,

regions 1, 3 cannot be merged. In the case of the Fig. 9 (b) all the configurations are possible ( M = M FINAL). A

detailed description of the labelling method is included in appendix VII-A.

Figs. 10,11 illustrate the generation of the valid interpretations. Fig. 10 (a) shows the input frame, Fig. 10 (b)

shows the hand segmented image, where the user specifies all the objects (three objects must be provided separately

in the group of persons) and Fig. 10 (c) illustrates the output of the SGM . Fig. 11 shows all possible merges of

individual regions. All of them are considered as correct. Remain to know which segmentation should be selected

to appraise the performance. In this paper we choose the best segmentation, which is the one that provides the

highest number of correct detections. In the present example the segmentation illustrated in Fig. 11 (g) is selected.

In this way we overcome the segmentation ambiguities that may appear without penalizing the algorithm. This is

the most complex situation which occurs in the video sequences used in this paper.

Fig. 8. Two different segmentations, provided by SGM method on the same group taken at different time instants.



15

(a) (b)

Fig. 9. Regions linking procedure with three objects A B C (from left to right). The same number of foreground regions may have different

interpretations: three possible configurations (a), or four configurations (b). Each color represent a different region.

M =

1 1 11 1 21 2 1

1 2 21 2 3

(a)

M FINAL =

1 1 11 1 2

1 2 21 2 3

(b)

(23)

(a) (b) (c)

Fig. 10. Input frame (a), segmented image by the user (b), output of SGM (c).

V. TESTS ON PETS2001 DATASET

This section presents the evaluation of several object detection algorithms using PETS2001 dataset. The training

and test sequences of PETS2001 were used for this study. The training sequence has 3064 and the test sequence has2688 frames. In both sequences, the first 100 images were used to build the background model for each algorithm.

The resolution is half-resolution PAL standard (288 × 384 pixels, 25 frames per second). The algorithms were

evaluated using one frame per second. The ground truth was generated by an automatic segmentation of the video

signal followed by a manual correction using a graphical editor described in section IV. The outputs of the algorithms

were then compared with the ground truth. Most algorithms require the specification of the smallest area of an

object. An area of 25 pixels was chosen since it allows to detect all objects of interest in the sequences.



16

(a) (b) (c) (d)

(e) (f) (g) (h)

Fig. 11. Multiple interpretations given by the application. The segmentation illustrated in (g) is selected for the current frame.

A. Choice of the Model Parameters

The segmentation algorithms described herein depend on a set of parameters, which are mainly the thresholds and

the learning rate α. In this scenario, we must figure out which are the best values for the most significant parameters

for each algorithm. This was done using ROC curves which display the performance of each algorithm as a function

of the parameters. The Receiver Operation Characteristic (ROC) have been extensively used in communications

[9]. It is assumed that all the parameters are constant but one. In this case we have kept the learning rate α

constant and varied the thresholds in the attempt to obtain the best threshold value T . We repeated this procedure

for several values of α. This requires a considerable number of tests, but in this way it is possible to achieve a

proper configuration for the algorithm parameters. These tests were made for a training sequence of the PETS2001

data set. Once the parameters are set, we use these values in a different sequence.

To ROC curves describe the evolution of the false alarms (FA) and detection failures (DF) as T varies. An ideal

curve would be close to the origin, and the area under the curve would be close to zero. To obtain these two values,

we compute these measures (for each value of T ) by applying the region matching trough the sequence. The final

values are computed as the mean values of FA and DF.

Fig. 12 shows the receiver operating curves (ROC) for all the algorithms. It is observed that the performance of

BBS algorithm is independent of α. We can also see that this algorithm is sensitive with respect to the threshold,

since there is a large variation of FA and DF for small changes of T, this can be viewed as a lack of smoothness

of the ROC curve (T = 0.2 is the best value). There is a large number of false alarms in the training sequence due

to the presence of a static object (car) which suddenly starts to move. The background image should be modified

when the car starts to move. However, the image analysis algorithms are not able to cope with this situation since

they only consider slow adaptations of the background. A ghost region is therefore detected in the place where the

car was (a false alarm).



17

The second row of the Fig. 12 shows the ROC curves of the SGM method, for three values of α (0.01, 0.05, 0.15).

This method is more robust than the BBS algorithm with respect to the threshold. We see that for −400 < T <

−150, and α = 0.01, α = 0.05 we get similar FA rates and a small variation of DF. We chose α = 0.05, T = −400.

The third row show the results of the M GM method. The best performances are obtained for α < 0.05 (first

and second column). The best value of the α parameter is α = 0.008. In fact, we observe the best performances

for α ≤ 0.01. We notice that the algorithm strongly depends on the value of T , since for small variations of T

there are significant changes of FA and DF. The ROC curve suggest that it is acceptable to choose T > 0.9.

The fourth row shows the results of the LOTS algorithm for a variation of the sensitivity from 10% to 110%.

As discussed in [29] we use a small α parameter. For the sake of computational burden, LOTS does not update

the background image in every single frame. This algorithm decreases the background update rate which takes

place in periods of N frames. For instance an effective integration factor α = 0.0003 is achieved by adding

approximately 113 of the current frame to the background in every 256th frame, or 1

6.5 in every 512th frame.

Remark that Bt = Bt−1 + αDt, with Dt = I t − Bt. In our case we have used intervals of 1024 (Fig. 12 (j)) 256

(Fig. 12 (k)) 128 (Fig. 12 (l)), being the best results achieved in the first case. The latter two cases Fig.(12) (k),

(l) present a right shift in relation to (j), meaning that in these cases one obtains a large number of false alarms.

From this study we conclude that the best ROC curves are the curves associated with LOTS and SGM since

they have the smallest area under the curve.



18

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100

False Alarms

D e t e

c t i o n F a i l u r e s

T = 0.9

T = 0.8

T = 0.7

T = 0.6T = 0.5

T = 0.4T = 0.3

T = 0.2 T = 0.1

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100

False Alarms

D e t e


T = 0.9

T = 0.8

T = 0.7

T = 0.6T = 0.5

T = 0.4T = 0.3

T = 0.2 T = 0.1

0 10 20 30 40 50 600

10

20

30

40

50

60

70

80

90

100

False Alarms

D e t e


T = 0.9

T = 0.8

T = 0.7

T = 0.6

T = 0.5

T = 0.4

T = 0.3

T = 0.2 T = 0.1

(a) (b) (c)

30 40 50 60 70 80 90

0

5

10

15

False Alarms

D e t e c t i o n F a i l u r e s

T = −600T = −500

T = −400

T = −300

T = −200

T = −150

T = −100

T = −50

T = −25

30 40 50 60 70 80 90

0

5

10

15

False Alarms


T = −600

T = −500

T = −400

T = −300

T = −200

T = −150

T = −100

T = −50

T = −25

30 40 50 60 70 80 90

0

5

10

15

False Alarms

D e t e c t i o n

F a i l u r e s

T = −25T = −50

T = −100

T = −150

T = −200

T = −300

T = −400

T <= −500

(d) (e) (f)

0 10 20 30 40 50 600

10

20

30

40

50

60

70

False Alarms

D e t e c t i o n

F a i l u r e s

T = 0.99

T = 0.95

T = 0.9 T < 0.9

0 10 20 30 40 50 600

10

20

30

40

50

60

70

False Alarms

D e t e c t i o n

F a i l u r e s

T = 0.99

T = 0.95

T = 0.9

T < 0.9

0 10 20 30 40 50 600

10

20

30

40

50

60

70

False Alarms

D e t e c t i o n

F a i l u r e s

T=0

T = 0.1

T = 0.2

T = 0.3

T = 0.4

T > 0.4

(g) (h) (i)

0 10 20 30 40 50 60 70 80 90

0

1

2

3

4

5

6

7

8

9

10

False Alarms

D e t e c t i o n F

a i l u r e s

S = 110

S = 100

S = 90

S = 80 S = 70

S <= 60

0 10 20 30 40 50 60 70 80 90

0

1

2

3

4

5

6

7

8

9

10

False Alarms


S = 110

S = 100

S = 90

S = 80

S = 70

S = 60

S = 50

S = 40

S = 30

S = 20

S = 10

0 10 20 30 40 50 60 70 80 90

0

1

2

3

4

5

6

7

8

9

10

False Alarms


S = 110

S = 100

S = 90S = 80

S = 70

S = 60

S = 50

S = 40

S = 30

S = 20

S = 10

(j) (k) (l)

Fig. 12. Receiver Operation Characteristic for different values of α: BBS (first row: (a) α = 0.05, (b) α = 0.1, (c) α = 0.15), SGM

(second row: (d) α = 0.01, (e) α = 0.05, (f) α = 0.15), MGM (third row: (g) α = 0.008, (h) α = 0.01, (i) α = 0.05, LOTS (fourth row

with background update at every: (j) 1024th frame, (k) 256th frame, (l) 128th frame.



19

B. Performance Evaluation

Table I (a),(b) shows the results obtained in the test sequence using the parameters selected in the previous

study. The percentage of correct detections, detection failures, splits, merges and split-merges were obtained by

normalizing the number of each type of event by the total number of moving objects in the image. Their sum is

100%. The percentage of false alarms is defined by normalizing the number of false alarms by the total number of

detected objects. It is therefore a number in the range 0 − 100%.

Each algorithm is characterized in terms of correct detections, detection failures, number of splits, merges and

split/merges false alarms as well as matching area.

Two types of ground truth were used. They correspond to different interpretations of static objects. If a moving

object stops and remains still it is considered an active region in the first case (Table I (a)) and it is integrated in

the background after one minute in the second case (Table I (b)). For example, if a car stops in front of the camera

it will always be an active region in the first case. In the second case it will be ignored after one minute.

Let us consider the first case. The results are shown in Table I (a). In terms of correct detections, the best resultsare achieved by the LOTS (91.2%) algorithm followed by SGM (86.8%).

Concerning the detection failures, the LOTS (8.5%) followed by W 4 (9.6%) outperforms all the others. The

worst results are obtained by M GM (13.1%). This is somewhat surprising since M GM method, based on the

use of multiple Gaussians per pixel, performs worse than the SGM method based on a single Gaussian. We will

discuss this issue bellow. The W 4 has the highest percentage of splits and the BBS , M GM methods tend to split

the regions as well. The performance of the methods in terms of region merging is excellent: very few merges are

observed in the segmented data. However, some methods tend to produce split/merges errors (e.g., W 4, SGM and

BBS ). The LOTS and M GM algorithm have the best score in terms of split/merge errors.Let us now consider the false alarms (false positives). The LOTS (0.6%) is the best and the M GM and BBS

are the worst. The LOTS , W 4 and SGM methods are much better than the others in terms of false alarms.

The LOTS has the best tradeoff between CD and FA. Although the W 4 produces many splits (splits can often be

overcome in tracking applications since the region matching algorithms are able to track the active regions though

they are split). The LOTS algorithm has the best performance if all the errors are equally important.

In terms of matching area the LOTS exhibit the best value in both situations.

In this study, the performance of the M GM method, based on mixtures of Gaussians is unexpectedly low.

During the experiments we have observed the following: i) when the object undergoes a slow motion and stops,the algorithm ceases to detect the object after a small period of time; ii) when an object enters the scene it is not

well detected during a few frames since the Gaussian modes have to adapt to this case.

This situation justify the percentage of the splits in both Tables. In fact, when a moving object stops, the M GM

starts to split the region until it disappears, becoming part of the background. Objects entering into the scene will

cause some detection failures (during the first frames) and splits, when the M GM method starts to separate the

foreground region from the background.

Comparing the results in Table I (a) and (b) we can see that the performance of the M GM is improved. The

detection failures are reduced, meaning that the stopped car is correctly integrated in the background. This produces



20

an increase of correct detections by the same amount. However, we stress that the percentage of false alarms also

increases. This means that the removal of the false positives is not stable. In fact some frames contain, as small

active regions, the object which stops in the scene. In regard to the other methods, it is already expected that

the false alarms percentage suffers an increase, since these algorithms remain with false positives throughout the

sequence.

The computational complexity of all methods was studied to judge the performance of the five algorithms. Details

about the number of operations in each method is provided in the Appendix VII-B.

% BBS W4 SGM MGM LOTS

Correct Detections 84.3 81.6 86.8 85.0 91.2

Detection Failures 12.2 9.6 11.5 13.1 8.5

Splits 2.9 5.4 0.2 1.9 0.3

Merges 0 1.0 0 0 0

Split/Merges 0.6 1.8 1.5 0 0

False Alarms 22.5 8.5 11.3 24.3 0.6

Matching Area 64.7 50.4 61.9 61.3 78.8

% BBS W4 SGM MGM LOTS

Correct Detections 83.5 84.0 86.4 85.4 91.0

Detection Failures 12.4 8.5 11.7 12.0 8.8

Splits 3.3 4.3 0.2 2.6 0.3

Merges 0 0.8 0 0 0

Split/Merges 0.8 1.8 1.7 0 0

False Alarms 27.0 15.2 17.0 28.2 7.2

Matching Area 61.3 53.6 61.8 65.6 78.1

(a) (b)

TABLE I

PERFORMANCE OF FIVE OBJECT DETECTION ALGORITHMS.

V I . CONCLUSIONS

This paper proposes a framework for the evaluation of object detection algorithms in surveillance applications.The proposed method is based on the comparison of the detector output with a ground truth segmented sequence

sampled at 1 frame per second. The difference between both segmentations is evaluated and the segmentation

errors are classified into detection failures, false alarms, splits, merges and split/merges. To cope with ambiguous

situations in which we do not know if two or more objects belong to a single active region or to several regions, we

consider multiple interpretations of the ambiguous frames. These interpretations are controlled by the user through

a graphical interface.

The proposed method provides a statistical characterization of the object detection algorithm by measuring the

percentage of each type of error. The user can thus select the best algorithm for a specific application taking intoaccount the influence of each type of error in the performance of the overall system. For example, in object tracking

detection failures are worse than splits. We should therefore select a method with less detection failures, even if it

has more splits than another method.

Five algorithms were considered in this paper to illustrate the proposed evaluation method. These algorithms are:

Basic Background Subtraction (BBS ), W 4, Single Gaussian Model (SGM ), Multiple Gaussian Model (M GM ),

Lehigh Omnidirectional Tracking System (LOTS ). The best results were achieved by the LOTS and SGM

algorithm.



21

Acknowledgement: We are very grateful to the three anonymous reviewers for their useful comments and

suggestions. We also thank R. Oliveira and P. Ribeiro for kindly provide the code of LOTS detector.



22

VII. APPENDIX

A. Merge Regions Algorithm

The pseudo code of the region labelling algorithm is given in Algorithms 1, 2.

Algorithm 1 describes the synopsis of the first step, i.e., generation of the labels configurations. When the same

label is assigned to two different regions, this means that these regions are considered as merged. Algorithm 2

describes the synopsis of the second step, which checks and eliminates label sequences which contain invalid

merges. Every time the same label is assigned to a pair of regions we define a strip connecting the mass center of

the two regions and check if the strip is intersected by any other region. If so, the labelling sequence is considered

as invalid.

In these algorithms N denotes the number of objects, label is a labelling sequence, M is the matrix of all label

configurations, M FINAL is a matrix which contains the information (final label configurations) needed to create

the merges.

Algorithm 1 Main1: N ← Num;

2: M(1) ← 1;

3: for t = 2 to N do

4: AUX ← [ ];5: for i = 1 to size(M, 1) do

6: label ← max(M(i, :)) + 1;

7: AUX ← [AUX; [repmat(M(i, :), label, 1) (1 : label)T] ];8: end for

9: M ← AUX;

10: end for

11: MFINAL ← FinalConfiguration(M);

@

@@ @

Fig. 13. Generation of the label sequences for the example in the Fig. 14.

To illustrate the purposes of algorithms 1 and 2 we will consider the example illustrated in the figure 14, where

each rectangle in the image represents an active region.

Algorithm 1 computes the leaves of the graph shown in the Fig. 13 with all label sequences.



23

Algorithm 2 MFINAL = FinalConfiguration (M)

1: MFINAL ← [ ];2: for i = 1 to lenght(M) do

3: Compute the centroids of the objects to be linked in M(i, :);

4: Link the centroids with strip lines;

5: if the strip lines do not intersect another object region then

6: MFINAL

← [MT

FINALM(i, :)T]T;

7: end if

8: end for

A CB D

Fig. 14. Four rectangles A,B,C,D representing active regions in the image.

Algorithm 2 checks each sequence taking into account the relative position of the objects in the image. For

example, configurations 1212,1213 are considered as invalid since object A cannot be merged with C (see Fig. 14).

Equations (24)(a) and (b) show the output of the first and the second step respectively. All the labelling sequences

considered as valid (the content of the matrix M FINAL) provides the resulting images shown in Fig. 15.

M =

1 1 1 11 1 1 21 1 2 11 1 2 21 1 2 31 2 1 11 2 1 21 2 1 31 2 2 11 2 2 21 2 2 31 2 3 11 2 3 21 2 3 31 2 3 4

(a)

M FINAL =

1 1 1 11 1 1 2

1 1 2 21 1 2 3

1 2 2 21 2 2 3

1 2 3 31 2 3 4

(b)

(24)

B. Computational Complexity

Computational complexity was also studied to judge the performance of the five algorithms. Next, we provide

comparative data on computational complexity using the “Big-O” analysis.

Let us define the following variables:



24

Fig. 15. Valid merges generated from the example in the Fig. 14.

• N, number of images in the sequence,

• L, C, number of lines and columns of the image,

• R, number of regions detected in the image,

• N g, number of Gaussians.

The BBS , W 4, SGM , M GM and LOTS methods share several common operations namely: i) morphological

operations, for noise cleaning, ii) computation of the areas of the regions and iii) labelling assignment.

The complexity of these three operations is

K = (2 × (ℓ × c) − 1) × (L × C )

morphological op.

+ (L × C ) + R

region areas op.

+ R × (L × C )

Labels op.

(25)

where ℓ, c are the kernel dimensions (ℓ × c = 9, 8 - connectivity is used), L, C are the image dimensions and R is

the number of detected regions. The first term, 2 × (ℓ × c) − 1, is the number of products and summations required

for the convolution of each pixel in the image. The second term, (L × C ) + R, is the number of differences taken

to compute the areas of the regions in the image. Finally, the R × (L × C ) term is the number of operations to

label all the regions in the image.

BBS Algorithm

The complexity of the BBS is

O11 × (L × C ) threshold op.

+K × N (26)

where 11 × (L × C ) is the number of operations required to perform the thresholding step (see (1)) which involves

3 × (L × C ) differences and 8 × (L × C ) logical operations.

W4 Algorithm

The complexity of this method is

O

2 × [2 p3 + (L × C ) × ( p + ( p − 1))]

rgb2gray op.

+ 9 × (L × C )

threshold op.

+K + K W 4× N

(27)



25

where the first term is related to the conversion of the images to grayscale level, p = 3 (RGB space). The second

one is concerned with the threshold operation (see (2)) which requires 9 × (L × C ) operations (8 logical and 1

difference operations). The term K W 4 corresponds to the background subtraction and morphological operations

inside the bounding boxes of the foreground regions

K W 4 = R × 9 × (Lr × C r) Threshold op.

+ (2 × (ℓ × c) − 1) × (Lr × C r) morphological op.

+ (L × C ) + R region areas op.

+ R × (L × C ) Labels op.

(28)

where Lr, C r are the dimensions of the bounding boxes, assuming that the bounding boxes of the active regions

have the same length and width.

SGM Algorithm

The complexity of the SGM method is

O

p × [2 p × (L × C )]

rgb2yuv op.

+ 28 × (L × C )

likelihood op.

+ (L × C )

threshold op.

+K

× N

(29)

The first term is related to the conversion of the images to YUV color space (in (29) p = 3). The second term

is the number of operations required to compute the likelihood measure (see (5)). The third term is related to the

threshold operation to classify the pixel as foreground if the likelihood is greater than a threshold, or classified as

background otherwise.

MGM Algorithm

The number of operations of the MGM method is

ON g(136 × (L × C )) mixture modelling

+ 2 × (2N g − 1) × (L × C ) norm. and mixture op.

+K × N (30)

The first term depends on the number of Gaussians N g. This term is related to the following operations: i) matching

operation - 70 × (L × C ), ii) weight update - 3 × (L × C ) (see (11)), iii) background update - 3 × 8 × (L × C )

(see (8)), iv) covariance update for all color components - 3 × 13 × (L × C ) (see (9)). The second term accounts

for: i) weight normalization - (2N g − 1)(L × C ) and ii) (2N g − 1) × (L × C ) computation of the Gaussian mixture

for all pixels.

LOTS Algorithm

The complexity of the LOTS method is

O

[2 p3 + (L × C ) × ( p + ( p − 1))] rgb2gray op.

+ 11 × (L × C ) + (2 × (Lb × C b) − 1) × nb + (2 × (ℓ × c) − 1) × (Lrsize

× C rsize

) + (Lrsize

× C rsize

) QCC op.

+ K

× N

(31)

The first term is related to the conversion of the images and it is similar with the first term in (27). The second

term is related to the QCC algorithm. A number of 11 × (L × C ) operations are needed to compute (17,18).



26

BBS 1 + 30× (L×C ) 3.3× 106

LOTS 55 + (35 + 145

64)× (L×C ) 4.1× 106

W4 760 + 40 × (L×C ) 4.4× 106

SGM 1 + 66× (L×C ) 7.2× 106

MGM 1 + 437× (L×C ) 48.3× 106

TABLE II

THE SECOND COLUMN GIVES THE SIMPLFIED EXPRESSION FOR EQUATIONS (26, 27, 29, 30, 31). THE SECOND COLUMN GIVES THE

NUM BE R OF TOTAL OP ER ATIO NS.

The QCC analysis is computed in a low resolution image P H , P L. This is accomplished by converting each block

of Lb × C b pixels (in high resolution images) into a new element of the new matrices (P H , P L). Each element of

P H , P L contains the active pixels of each block in the respective images. This task requires (2 × (Lb× C b) − 1)× nb

operations (second term of QCC in (31)) where (Lb × C b) is the size of each block and nb is the number of blocks

in the image. A morphological operation (4-connectivity is used) over P H is performed, taking (2 × (ℓ × c) − 1) ×

(Lrsize

× C rsize

) operations where (Lrsize

× C rsize

) is the dimension of the resized images. The targets candidates

are obtained by comparing P H and P L. This task takes (Lrsize

× C rsize

) operations (fourth term in QCC).

For example, the complexity of the five algorithms is shown in table II assuming the following conditions for

each frame

• the kernel dimensions, ℓ × c = 9,

• the block dimensions, Lb × C b = 8 × 8, i.e., (Lrsize

× C rsize

) = L×C 64 (for LOTS method),

• the number of Gaussians, N g = 3 (for MGM method),

•a single region is detected with an area of 25 pixels, ( R = 1, Łr × C r = 25),

• the image dimension is (L × C ) = 288 × 384.

From the table, it is concluded that the four algorithms (BBS ,LOTS ,W 4,SGM ) have a similar computational

complexity whilst M GM is more complex requiring a higher computational cost.



27

R EFERENCES

[1] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: Real-time tracking of the human body,” IEEE Trans. Pattern

Anal. Machine Intell., vol. 19, no. 7, pp. 780–785, July 1997.

[2] C. Stauffer, W. Eric, and L. Grimson, “Learning patterns of activity using real-time tracking,” IEEE Trans. Pattern Anal. Machine

Intell., vol. 22, no. 8, pp. 747–757, August 2000.

[3] S. J. McKenna and S. Gong, “Tracking colour objects using adaptive mixture models,” Image Vision Computing , vol. 17, pp. 225–231,

1999.

[4] N. Ohta, “A statistical approach to background suppression for surveillance systems,” in Proceedings of IEEE Int. Conference onComputer Vision, 2001, pp. 481–486.

[5] I. Haritaoglu, D. Harwood, and L. S. Davis, “W 4: Who? when? where? what? a real time system for detecting and tracking people,”

in IEEE International Conference on Automatic Face and Gesture Recognition, April 1998, pp. 222–227.

[6] M. Seki, H. Fujiwara, and K. Sumi, “A robust background subtraction method for changing background,” in Proceedings of IEEE

Workshop on Applications of Computer Vision, 2000, pp. 207–213.

[7] D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, and S. Russel, “Towards robust automatic traffic scene analysis in

real-time,” in Proceedings of Int. Conference on Pattern Recognition, 1994, pp. 126–131.

[8] R. Collins, A. Lipton, and T. Kanade, “A system for video surveillance and monitoring,” in Proc. American Nuclear Society (ANS)

Eighth Int. Topical Meeting on Robotic and Remote Systems, Pittsburgh, PA, April 1999, pp. 25–29.

[9] H. V. Trees, Detection, Estimation, and Modulation Theory. John Wiley and Sons, 2001.

[10] T. H. Chalidabhongse, K. Kim, D. Harwood, and L. Davis, “A perturbation method for evaluating background subtraction algorithms,”

in Proc. Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS

2003), Nice, France, October 2003.

[11] X. Gao, T.E.Boult, F. Coetzee, and V. Ramesh, “Error analysis of background adaption,” in IEEE Computer Society Conference on

Computer Vision and Pattern Recognition, 2000, pp. 503–510.

[12] F. Oberti, A. Teschioni, and C. S. Regazzoni, “Roc curves for performance evaluation of video sequences processing systems for

surveillance applications,” in IEEE Int. Conf. on Image Processing , vol. 2, 1999, pp. 949–953.

[13] J. Black, T. Ellis, and P. Rosin, “A novel method for video tracking performance evaluation,” in Joint IEEE Int. Workshop on Visual

Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), Nice, France, 2003, pp. 125–132.

[14] P. Correia and F. Pereira, “Objective evaluation of relative segmentation quality,” in Int. Conference on Image Processing , 2000, pp.

308–311.

[15] C. E. Erdem, B. Sankur, and A. M.Tekalp, “Performance measures for video object segmentation and tracking,” IEEE Trans. Image

Processing , vol. 13, no. 7, pp. 937–951, 2004.

[16] V. Y. Mariano, J. Min, J.-H. Park, R. Kasturi, D. Mihalcik, H. Li, D. Doermann, and T. Drayer, “Performance evaluation of object

detection algorithms,” in Proceedings of 16th Int. Conf. on Pattern Recognition (ICPR02), vol. 3, 2002, pp. 965–969.

[17] I. Haritaoglu, D. Harwood, and L. S. Davis, “W 4: real-time surveillance of people and their activities,” IEEE Trans. Pattern Anal.

Machine Intell., vol. 22, no. 8, pp. 809–830, August 2000.

[18] T. Boult, R. Micheals, X. Gao, and M. Eckmann, “Into the woods: Visual surveillance of non-cooperative camouflaged targets in

complex outdoor settings,” in Proceedings of the IEEE , October 2001, pp. 1382–1402.[19] R. C. Gonzalez and R. E. Woods, Digital Image Processing . Prentice Hall, 2002.

[20] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting moving objects, ghosts and shadows in video streams,” IEEE Trans.

Pattern Anal. Machine Intell., vol. 25, no. 10, pp. 1337–1342, 2003.

[21] Y.-F. Ma and H.-J. Zhang, “Detecting motion object by spatio-temporal entropy,” in IEEE Int. Conf. on Multimedia and Expo, Tokyo,

Japan, August 2001.

[22] R. Souvenir, J. Wright, and R. Pless, “Spatio-temporal detection and isolation: Results on the PETS2005 datasets,” in Proceedings of

the IEEE Workshop on Performance Evaluation in Tracking and Surveillance, 2005.

[23] H. Sun, T. Feng, and T. Tan, “Spatio-temporal segmentation for video surveillance,” in IEEE Int. Conf. on Pattern Recognition, vol. 1,

Barcelona, Spain, September, pp. 843–846.

[24] A. Monnet, A. Mittal, N. Paragios, and V. Ramesh, “Background modeling and subtraction of dynamic scenes,” in Proceedings of the

ninth IEEE Int. Conf. on Computer Vision, 2003, pp. 1305–1312.

[25] J. Zhong and S. Sclaroff., “Segmenting foreground objects from a dynamic, textured background via a robust kalman filter,” in

Proceedings of the ninth IEEE Int. Conf. on Computer Vision, 2003, pp. 44–50.

[26] N. T. Siebel and S. J. Maybank, “Real-time tracking of pedestrians and vehicles,” in Proc. of IEEE workshop on Performance Evaluationof tracking and surveillance, 2001.

[27] R. Cucchiara, C. Grana, and A. Prati, “Detecting moving objects and their shadows: an evaluation with the PETS2002 dataset,” in

Proceedings of Third IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS 2002) in conj. with

ECCV 2002, Pittsburgh, PA, May 2002, pp. 18–25.

[28] Collins, Lipton, Kanade, Fujiyoshi, Duggins, Tsin, Tolliver, Enomoto, and Hasegawa, “A system for video surveillance and monitoring:

Vsam final report,” Robotics Institute, Carnegie Mellon University, Tech. Rep. Technical report CMU-RI-TR-00-12, May 2000.

[29] T. Boult, R. Micheals, X. Gao, W. Y. P. Lewis, C. Power, and A. Erkan, “Frame-rate omnidirectional surveillance and tracking of

camouflaged and occluded targets,” in Second IEEE International Workshop on Visual Surveillance, 1999, pp. 48–55.



Segmentation and Classication of Human Activities∗

J.C. Nascimento1 M. A. T. Figueiredo2 J. S. Marques3

[email protected] [email protected] [email protected],3Instituto de Sistemas e Robotica 2Instituto de Telecomunicacoes

Instituto Superior Tecnico1049-001 Lisboa

PORTUGAL

Abstract

This paper describes an algorithm for segmenting and classifying human activities from video sequences

of a shopping center. These activities comprise entering or exiting a shop, passing, or browsing in front

of shop windows. The proposed approach recognizes these activities by using a priori knowledge of the

layout of the shopping view. Human actions are represented by a bank of switch dynamical models,

each tailored to describe a specific motion regime. Experimental tests illustrate the effectiveness of the

proposed approach with synthetic and real data.

Keywords: Surveillance, Segmentation, Classification, Human Activities, Minimum Description Length.

1 Introduction

The analysis of human activities is an important computer vision research topic with applications in surveillance, e.g.

in developing automated security applications. In this paper, we focus on recognizing human activities in a shopping

center.

In commercial spaces, it is common to have many surveillance cameras. The monitor room is usually equipped

with a large set of monitors which are used by a human operator to watch over the areas observed by the cameras.

This requires a considerable effort of the human operator, who has to somehow multiplex his/her attention. In recent

years a considerable effort was devoted to develop automatic surveillance systems providing information about which

activities take place in a given space. With such a system, it would be possible to monitor the actions of individuals,determining its nature and discerning common activities from inappropriate behavior (for example, standing for a large

period of time at the entrance of a shop, fighting).

In this paper, we aim at labelling common activities taking place in the shopping space. 1 Activities are recog-

nized from motion patterns associated to each person tracked by the system. Motion is described by a sequence of

displacements of the 2D centroid (mean position) of each person’s blob. The trajectory is modelled by using multiple

dynamical models with a switching mechanism. Since the trajectory is described by its appearance, we compute the

statistics for the identification of the dynamical models involved in a trajectory.

The rest of the paper is organized as follows. Section 2 deals with related work. Section 3, describes the statistical

activity model. Section 4 derives the segmentation algorithm. Section 5 reports experimental results with synthetic

data and real video sequences. Section 6 concludes the paper.

2 Related Work

The analysis of human activities has been extensively addressed in several ways using different types of features and

inference methods. Typically, a set of motion features is extracted from the video signal and an inference model is

used to classify it into one of c possible classes.

For example in [16] the human body is approximated by a set of segments and atomic activities are then defined as

vectors of temporal measurements which capture the evolution of the five body parts. In other works the human body

is simply represented by the mass center of its active region (blob) in the image plane [12] or the body blob as in [4].

The activity is then represented by the trajectory obtained from the blob center, or from the correspondence of body

blob regions respectively.

Other works try to characterize the human activity directly from the video signal without segmenting the active

regions. In [2] human activities are characterized by temporal templates. These templates try to convey information

about “where” and “how” motion is performed. Two templates are created: a binary motion-energy-image whichrepresents where the motion has occurred in the whole sequence, and a scalar motion-history-image which represents

∗This work was partially supported by FCT under project CAVIAR(IST-2001-37540)1This work is integrated in project CAVIAR, which has the general goal of representing and recognizing contexts and situations.An introduction

and the main goals of the project can be found in http://homepages.inf.ed.ac.uk/rbf/CAVIAR/caviar.htm

HAREM 2005 - International Workshop on Human Activity Recognition and Modelling,

Oxford, UK, September 2005



Figure 2: Examples of three different activities (entering, exiting, passing).

From xt we can obtain ∆xit , where ∆xi

t contains the displacements of xt known to have been generated by the ith

model. Defining ∆Xi = ∆xi1,∆xi

2,.. . ,∆xi N as the vector containing all the displacements in ith model of the training

set, we have, for the ith model:

µ i =1

♯∆Xi ∑∆Xit , Qi =

1

♯∆Xi ∑(∆Xi − µ i)(∆Xi − µ i)T , (2)

where µ i and Qi are standard estimates of the mean and the covariance matrix respectively.

4.2 Segmentation and Classication

Having defined the set of models and the corresponding parameters, one can now classify a test trajectory xt . One

way to attain this goal is to compute the likelihood of xt into the model space. In this paper, the activity depends on

the number of the model switchings. In Fig. 2, we see that “passing” can be described by using just one model. The

activities “entering” and “exiting” can be described by using two dynamical models. The fourth activity considered

“browsing”, requires three models to be described; we define “browsing” when the person is walking, stop to see the

shop-window and restarts walking. This behavior was observed in all the other samples of the activities which come

about in this context. This means that we have to estimate the time instants in which the model switching happens.

Assuming that the sequence xt has n samples and is described by T segments (and T is known) the log-likelihoodis

L(m1,.. . , mT ,t 1,. ..,t T −1) = log p(∆x1,.. .,∆xn | m1, m2,. .., mT , t 1, t 2, . . . ,t T −1) (3)

where m1,. .., mT is the sequence of model labels describing the trajectory and t i for i = 1,. .., T −1 is the time instant

when switching from model mi to mi+1 occurs. If T = 1, there is no switching.

Due to the conditional independence assumption underlying (1), the log-likelihood can be written as

L(∆x1,.. . ,∆xn | m1,. .., mT ,t 1,.. .,t T −1)

=T

∑ j=1

t j

∑i=t j−1

log p(∆xi | m j) =T

∑ j=1

t j

∑i=t j−1

logN (∆xi | µ m j, Qm j

)(4)

where we define t 0 = 1, T is the number of segments and t j the switch time. Assuming that T is known, we can

“segment” the sequence (i.e., estimate m1,. .., mT and t 1,.. ., t T −1) using the maximum-likelihood approach:

m1,.. ., mT , t 1,.. ., t T −1 = argmax L(∆x1, . . . ,∆xn | m1,. .., mT , t 1,. .., t T −1) (5)

This maximization can be performed in a nested way,

t 1,. .., t T −1 = arg maxt 1,...,t T −1

max

m1,...,mT

L(∆x1,. ..,∆xn | m1,.. . , mT ,t 1,.. . ,t T −1)

(6)

In fact, the inner maximization can be decoupled as

maxm1,...,mT

L(∆x1,.. .,∆xn | m1, . . . , mT ,t 1,. ..,t T −1) =T

∑ j=1

maxm j

t j

∑i=t j−1

log p(∆xi | m j) (7)

where the maximization with respect to each of m j is a simple maximum likelihood classifier of sub-set of samples

(∆xt j−1,. ..,∆xt j ) into one of a set of Gaussian classes. Finally, the maximization with respect to t 1,.. . , t T −1 is done

by exhaustive search (this is never too expensive, since we consider a maximum of three segments).



4.3 Estimating the number of models of the activity

4.3.1 MDL Criterion

In the previous section, we derived the segmentation criterion assuming that the number of segments T is known. As

is well known, the same criterion can not be used to select T , as this would always return the largest possible number

of segments. We are thus in the presence of a model selection problem, which we address by using the minimum

description length (MDL) criterion [14]. The MDL criterion for selecting T is

T = argminT

− log p(∆x1,.. . ,∆xn | m1,.. ., mT , t 1,.. . , t T −1)

+ M (m1,. .., mT , t 1,. .., t T −1) (8)

where M (m1,.. ., mT , t 1,.. ., t T −1) is the number of bits required to encode the selected model indeces and the estimated

switching times. Notice that we do not have the usual 12

log n term because the real-valued model parameters (means

and covariances) are assumed fixed (previously estimated). Finally, it is easy to conclude that

M (m1,.. ., mT , t 1,.. . , t T −1) ≈ T log c + (T −1) log n (9)

where T log c is the code length for the model indeces m1,. .., mT , since each belongs to 1,.. . , c, and (T −1) log n is

the code length for t 1,.. . , t T −1, because each belongs to 1,.. . , n; we have ignored the fact that two switchings can

not occur at the same time, because T << n.

5 Experimental results

This section presents results with synthetic and real data. In the synthetic case, we have performed Monte Carlo tests.

We have considered five models (c = 5) shown in Fig. 3. The synthetic models shown in Fig. 3(a) were obtained

by simulating four activities of a person, using the generation model in (1). Fig. 4 shows examples of activities (the

trajectory shape of “Leaving” is the same as “Entering”, however with opposite direction). Here, the thin (green)

rectangles correspond to areas where the trajectory begins. The first sample of xt in these areas is random, because

the agent may appears at random places in the scene. The wide (yellow) rectangle is the area in which occurs a model

switching. In this figure the trajectories are generated with two segments (“Entering”, “Leaving”, “Passing”) and with

three segments (“Browsing”).For each activity we generate 100 test samples using (1) and classify each of them in one of the four classes. Fig.

5 shows the displacements ∆xt (black dots) of the test sequences (“Entering” and “Passing”) overlapped with the five

models. We can see that the displacements lie on right -up clusters (“Entering”) and right cluster (“Passing”). In this

experiment, all the test sequences were correctly classified (%100 accuracy).

−10 − 8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

−50 −40 −30 −20 −10 0 10 20 30 40 50

−50

−40

−30

−20

−10

0

10

20

30

40

50

(a) (b)

Figure 3: Five models are considered to describe trajectory. Each color corresponds to a different model. Synthetic

case (a), real case (b).

We also generated different test trajectories, this is because the exiting and entering may occur in different direction

from the ones in Fig. 4. These examples are illustrated in Fig. 6. In this new experiment, the same 100% accuracy

was also obtained.



0 50 100 150 200 250 300 350 400 450 500−100

0

100

200

300

400

500

0 50 100 150 200 250 300 350 400 450 500−100

0

100

200

300

400

500

0 50 100 150 200 250 300 350 400 450 500−100

0

100

200

300

400

500

(a) (b) (c)

Figure 4: Examples of synthetic activities (performed in left-right direction): (a) entering, (b) passing, (c) browsing.

−10 − 8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

−10 − 8 −6 −4 −2 0 2 4 6 8 10−10

−8

−6

−4

−2

0

2

4

6

8

10

(a) (b)

Figure 5: Five models with the displacements (black dots) of the test activities: (a) entering, (b) passing.

The proposed algorithm was also tested with real data. The video sequences were acquired in the context of the

EC funded project CAVIAR. All the video sequences comprise human activities in indoor plaza and shopping center

observations of individuals and small groups of people. Ground truth was hand-labelled for all sequences2. Fig. 7

shows the bounding boxes as well as the centroid, which is the information used for the segmentation.

As in the synthetic case, we also generate the statistics of the considered models. The procedure is the same as in

the previous case using training sequences. Fig. 3(b) shows the clusters of the models.

Fig. 8 shows several activities performed at the shopping center with the time instants of the model switching

marked with small red circle. From this experiment, it can be seen that the proposed approach correctly determines

the switching times between models.

We have tested the proposed approach in more than 40 trajectories from 25 movies of about 5 minutes each. We just present the results of some of those activities in Tables 1 and 2. These Tables show the penalized log-likelihood

values (8) of each test sequence. The first table refers to all activities performed in the left-right direction, whilst the

second table reports all activities performed in the opposite direction. In the first table the classes referring to entering,

exiting, passing and browsing are right-upwards, downwards-right , right , right-stop-right respectively, whereas in the

second table the classes are left-upwards, downwards-left , left and left-stop-left . It can be observed that the output

classifier correctly assigns the activities into the corresponding classes, exhibiting good results as in the previous

synthetic examples.

6 Conclusions

In this paper we have proposed and tested an algorithm for modelling, segmentation, and classification of human

activities in a constrained environment. The proposed approach uses a switched dynamical models to represent the

human trajectories. It was illustrated that the time instants are effectively well determined, despite of the significant

random perturbations that the trajectory may contain. It is demonstrated that the proposed approach provides good

2The ground truth labelled video sequences is provided at http://homepages.inf.ed.ac.uk/rbf/CAVIAR/.



0 50 100 150 200 250 300 350 400 450 500−100

0

100

200

300

400

500

0 50 100 150 200 250 300 350 400 450 500−100

0

100

200

300

400

500

0 50 100 150 200 250 300 350 400 450 500−100

0

100

200

300

400

500

(a) (b) (c)

Figure 6: Synthetic activities with different dynamic models (entering,exiting,passing).

Figure 7: Bounding boxes and centroids of the pedestrians performing activities.

results with synthetic and real data obtained in a shopping center. The proposed method is able to effectively recognize

instances of the learned activities. The activities studied herein can be interpreted as atomic, in the sense that they aresimple events. Compound actions or complex events can be represented as concatenations of the activities studied in

this paper. This is one of the issues to be addressed in the future.

Acknowledgement: We would like to thank Prof. Jose Santos Victor of ISR and the members of CAVIAR project,

for providing video data of human activities with the ground truth information.



Figure 8: Samples of different activities. The large circles are the computed times instants where the model switches:

Entering (first column); exiting (second column); browsing (third column).

Test trajectories

Classes E 1 E 2 Ex1 Ex2 P 1 P 2 B

Ent ering 187.2 157.3 212.7 217.0 100.3 107.4 169.1

Exit ing 401.0 340.0 116.1 102.4 104.6 93.8 178.7

Passing 359.7 311.0 232.5 183.3 88.8 90.2 147.7

Browsing 299.1 265.6 196.5 180.0 160.7 156.0 98.1

Table 1: Penalized Log-likelihood of several real activities performed in left-right direction: E - entering, Ex-exiting,

P - passing, B- browsing.

Test trajectories

Classes E 1 E 2 Ex1 Ex2 P 1 P 2 B

Ent ering 116.2 115.0 337.7 358.2 89.3 90.9 211.7

Exit ing 277.6 284.6 151.0 127.4 98.6 96.6 297.4

Passing 210.0 224.4 350.1 362.0 63.4 64.7 358.4

Browsing 207.4 197.3 343.2 286.7 188.9 179.0 170.1

Table 2: Penalized Log-likelihood of several real activities performed in right-left direction: E - entering, Ex- exiting,

P - passing, B- browsing.



References

[1] D. Ayers and M. Shah,“Monitoring Human Behavior from Video Taken in an Office Environment”, Image and

Vision Computing , vol. 19, Issue 12, 1, pp. 833-846, Oct, 2001.

[2] A. Bobick and J. Davis, “The Recognition of Human Movement using Temporal Templates”, in IEEE Transac-

tion on Pattern Analysis and Machine Intelligence, pp. 257-267, vol. 23, no. 3, March 2001.

[3] J. Davis and M. Shah, “Visual Gesture Recognition”, IEE Proc. Vision, Image and Signal Processing , Vol. 141, No. 2, pp. 101-106, April 1994.

[4] S. Hongeng and R. Nevatia, “Multi-Agent Event Recognition”, in Proc. of the 8 th IEEE Int. Conf. on Computer

Vision (ICCV’01), pp. 84-91, vol. 2, 2001.

[5] M. Isard and A. Blake,“A Mixed-state Condensation Tracker with Automatic Model-switching”, Proc. of the Int.

Conf. on Computer Vision, pp. 107-112, 1998.

[6] J. S. Marques and J. M. Lemos, “Optimal and Suboptimal Shape Tracking Based on Switched Dynamic Models”,

Image and Vision Computing , pp. 539-550, june, 2001.

[7] N. Johnson and D. Hogg, “Representation and Synthesis of Behaviour using Gaussian Mixtures”, in Image and

Vision Computing , pp. 889-894, vol. 20, no 12, 2002.

[8] A. J. Abrantes, J. S. Marques, J. M. Lemos, “Long Term Tracking Using Bayesian Networks”, in Proc. of IEEE

Int. Conf. on Image Processing , Rochester, 609-612, vol. III, Sept. 2002.

[9] O. Masoud and N.P. Papanikolopoulos, “A Method for Human Action Recognition”, in Image and Vision Com-

puting , pp.729-743, vol. 21, no. 8, August 2003.

[10] A. Nagai, Y. Kuno and Y. Suirai, “Surveillance Systems based on Spatio-temporal Information”, Proc. IEEE Int.

Conf. Image Processing , pp. 593-596, 1996.

[11] J. C. Nascimento and M. A. T. Figueiredo and J. S. Marques, “Recognition of Human Activities with Space

Dependent Switched Dynamical Models”, Proc. IEEE Int. Conf. Image Processing , September, 2005.

[12] N. M. Oliver and B. Rosario and A. P. Pentland, “A Bayesian Computer Vision System for Modeling Human

Interactions”, in IEEE Trans. on Pattern Anal. and Machine Intell. , pp. 831-843, vol. 22, no. 8, August 2000.

[13] T. J. Olson and F. Z. Brill, “Moving Object Detection and Event Recognition for smart Cameras”, Proc. Image

Understanding Workshop, pp. 159-175, 1997.

[14] J. Rissanen, “Stochastic Complexity in Statistical Inquiry.”Singapore: World Scientific, 1989.

[15] M. Rosenblum and Y. Yacoob and L. S. Davis, “Human expression recognition from motion using a radial basis

function network architecture”, IEEE Trans. Neural Networks, no. 7, pp. 1121-1138, 1996.

[16] Y. Yacoob and M. J. Black, “Parameterized Modeling and Recognition of Activities”, in Computer Vision and

Image Understanding , pp. 232-247, vol. 73, no. 2, February 1999.



Chapter 4

The Kalman Filter

Approach

Imagine you are sitting in a car waiting at a crossroad to pass it. The visibility ispoor due to parked cars at the roadside. But there are some gaps between themso that you have the possibility to observe these openings to decide whetheryou can cross the street without causing an accident or not. You have to guessthe number, position and velocity of potential vehicles moving on the road from just a few information derived by watching these gaps over time.Let us integrate the mentioned attributes of the street into the concept of a

state of the street. The observations can also be seen as measurements and arenoisy because of the poor visibility.An estimation of the state of the street is just possible if you know how vehiclesmove on a road and how the measurements are related to this motion. Dueto the noise in the measurements and to not directly observable aspects likeacceleration there will not be absolute certainty in your estimation.

This task is one instance of the problem known as the observer design prob-

lem . In general, you have to estimate the unknown internal state of a dynamicalsystem given its output in the presence of uncertainty. The output dependssomehow on the system’s state. To be able to infer this state from the out-put you need to know the according relation and the system’s “behaviour”. Insuch situations, we have to construct a model. In practise it is not possible to

represent the system considered with absolute precision. Instead, the accord-ing model will stop at some level of detail. The gap between it and reality isfilled with some probabilistic assumption referred to as noise. The noise modelintroduced in this chapter will be applied throughout this work.

An optimal solution for this sort of problems in the case of linear modelscan be derived by using the Kalman Filter which is explained in the first sectionof this chapter based on [12]. Most of the interesting instances of the observerdesign problem, e.g. the SLAM problem, do not fulfil the condition of linearity.To be able to apply the Kalman Filter approach to this non-linear sort of tasks,we have to linearise the models. The according algorithm is referred to asExtended Kalman Filter . We will introduce it in the second section.

23



24 CHAPTER 4. THE KALMAN FILTER APPROACH

4.1 The Discrete Kalman Filter

In this section we introduce the Kalman Filter chiefly based on its originalformulation in [17] where the state is estimated at discrete points in time. Thealgorithm is slightly simplified by ignoring the so called control input which isnot used in this specific application of purely vision based SLAM. Nevertheless,in a robotic application it might be useful to involve e.g. odometry data ascontrol input. A complete description of the Kalman Filter can be found in [17]and [12].

In the following, we will firstly introduce the models for the system’s stateand the process model which describes the already mentioned system’s “be-haviour”. Here, also the noise model is presented. After that, we introduce themodel for the relation between the state and its output. The section closes witha description of the whole Kalman Filter algorithm.

4.1.1 Model for the Dynamical System to Be Estimated

The Kalman filter is based on the assumption that the dynamical system, whichshould be estimated, can be modelled as a normally distributed random processX(k) with mean xk and covariance matrix Pk where index k represents time.The mean xk is referred to as the estimate of the unknown real state xk of the system at the point k in time. This state is modelled by an n dimensionalvector:

x =

x1

...

xi

...xn

For the simplicity of the notation we did not use the subscript k, here. Through-out this work, we will continue omitting k when the components of a vector ormatrix are presented even if they are different at each point in time.

Our main objective is to derive a preferably accurate estimate xk for thestate of the observed system at time k.

The covariance matrix Pk describes the possible error between the stateestimate xk and the unknown real state xk, in other words - the uncertainty inthe state estimation after time step k. It can be modelled as an n

×n matrix.

P =

x1x1 . . . x1xi . . . x1xn

.... . .

.... . .

...xix1 . . . xixi . . . xixn

.... . .

.... . .

...xnx1 . . . xnx1 . . . xnxn

where the main diagonal contains the variances of each variable in the statevector and the other entries contain the covariances of pairs of these variables.Covariance matrices are always symmetric due to the symmetric property of



4.1. THE DISCRETE KALMAN FILTER 25

covariances.1

If we want to derive an accurate estimate of the system’s state, the corre-sponding uncertainty should obviously be small. The Kalman filter is optimalin the sense, that it minimises the error covariance matrix Pk.

4.1.2 Process Model

Examined over time the dynamical system is subject to a transformation. Someaspects of this transformation are known and can be modelled. Others, e.g.,acceleration as in the example above (also influencing the state of the system)are unknown, not measurable or too complex to be modelled. Then, this trans-formation has to be approximated by a process model A involving the knownfactors. The “classic” Kalman filter expects that the model is linear. Underthis condition the normal distribution of the state model is maintained after it

has undergone the linear transformation A. The new mean xk and covariancematrix Pk for the next point in time are derived by

xk = Axk−1 (4.1)

Pk = APk−1A⊤. (4.2)

Due to the approximative character of A, the state estimate xk is also just anapproximation of the real state xk. The difference is represented by a randomvariable w:

xk = Axk−1 + wk−1. (4.3)

The individual values for w are not known for each point k in time but need tobe involved to improve the estimation. We assume these values to be realisations

of a normally distributed white noise vector with zero mean. In the following,this vector w is referred to as process noise. It is denoted by

p(w) ∼ N (0, Q) (4.4)

where zero is the mean and Q the process noise covariance. The individualvalues of w at each point in time can now be assumed to be equal to the mean,to zero. Thus, we stick to Equation (4.1) to estimate xk as xk.

The process noise does not influence the current state estimate, but the un-certainty about it. Intuitively we can say, the higher the discrepancy is betweenthe real process and the according model, the higher is the uncertainty aboutthe quality of the state estimate.This can be expressed by extending the computation of the error covariance Pk

in Equation (4.2) with the process noise covariance matrix Q.

Pk = APk−1A⊤ + Q (4.5)

The choice of the values for the process noise covariance matrix reflects thequality we expect from the process model. If we set them to small values, weare quite sure that our assumptions about the considered system are mostlyright. The uncertainty regarding to our estimates will be low. But then, we willbe unable or hardly able to cope with large variations between the model and

1The covariance value x1xn is the same as xnx1. In practise this means, that x1 is

correlated to xn like xn to x1




the system. Setting the variances to large values instead means to accept that

there might be large differences between the state estimate and the real state of the system. We will be able to cope with large variations but the uncertaintyabout the state estimate will increase stronger than with a small process noise.A lot of good measurements are needed to constrain the estimate.

4.1.3 Output of the System

As already mentioned earlier, the output of the system is related to the state of the system. If we know this relation and the estimated state after the currenttime step, we are able to predict the according measurement of the system’soutput. In this section, we will introduce the model for the measurement of theoutput. In the next section the relation between state and output is examined.

As well as the state of the considered dynamical system, also its output

is modelled as a normally distributed random process Z(k) with mean zk andcovariance matrix Sk where index k indicates time. The mean zk representsthe estimated and predicted measurement of the output depending on the stateestimate xk at the point k in time. The real measurement zk of the outputis obtained by explicitly measuring the system’s output. zk is modelled as anm dimensional vector

z =

z1...

zi...

zm

The so called innovation covariance matrix Sk describes the possible error be-tween the estimate zk and the real measurement zk, in other words - the uncer-tainty in the measurement estimation after time step k. It can be modelled asan m×m matrix

S =

z1z1 . . . z1zi . . . z1zm...

. . ....

. . ....

ziz1 . . . zizi . . . zizm...

. . ....

. . ....

zmz1 . . . zmz1 . . . zmzm

where the main diagonal contains the variances of each variable in the mea-surement vector and the other entries contain the covariances of pairs of thesevariables.

Note, that in contrast to the system’s real state, the real measurement can beobtained and we are therefore able to compare predicted and real measurement.The precisely known difference between estimation and reality constitutes thebasis to correct the state estimate used to predict the measurement. This willbe explained in detail in Section 4.1.5.

4.1.4 Measurement Model

In the previous sections we mentioned, that the system’s output is somehowrelated to the system’s state. In this sections the relation is modelled.




We have the same situation as for the process model. The connection be-

tween the output and the state can just be modelled up to a certain degree.Known factors are summarised in the measurement model H. After we haveobtained a new state estimate for the current point in time, we can apply H

to predict the according measurement zk and covariance matrix Sk. If thismeasurement model is linear, the normal distribution of the state model ismaintained after applying this linear transformation.

zk = Hxk (4.6)

Sk = HPkH⊤. (4.7)

Because measurements of the system’s output are mostly noisy due to inaccuratesensors, the difference between the estimate zk and the real measurement zk isnot just caused by the dependency on the state estimate but also by a random

variable v:

zk = Hxk + vk. (4.8)

As already mentioned for the process noise, the individual values of v are notknown for each point k in time. We apply the same noise model and approximatethese unknown values as realisations of a normally distributed white noise vectorwith zero mean. In the following, v is referred to as measurement noise. It isdenoted by

p(v) ∼ N (0, R) (4.9)

As v is now assumed to be equal to the mean of its distribution at each pointin time, it does not influence the measurement estimate but the uncertaintyabout it. This is modelled by extending the computation of the measurementinnovation covariance matrix Sk in Equation (4.7) with the measurement noisecovariance matrix R.

Sk = HPkH⊤ + R (4.10)

Again, the values chosen for the measurement noise covariance matrix indicatehow sure we are about the assumptions we made in our measurement model.

More information about the influence of the measurement noise are givenbelow in connection with the Kalman Gain .

4.1.5 Predict and Correct Steps

In the last sections we introduced the model for the process the system is subject

to and the model for the relation between the system’s state and its output.These models are used in the Kalman Filter algorithm to determine an optimalestimate of the unknown state of the system.

As already mentioned in Section 4.1.3, we use the known difference betweenthe predicted measurement zk and real measurement zk as basis to correct thestate estimate derived by the application of the process model A. The filtercan be divided into two parts. In the predict step the process model and thecurrent state and error covariance matrix estimates are used to derive an a

priori state estimate for the next time step. Next, in the correct step, a (noisy)measurement is obtained to enhance the a priori state estimate and derive animproved a posteriori estimate.




Initialisation Predict Correct

Figure 4.1: The Predict-Correct Cycle of the Kalman Filter Algorithm.

Before this predict-correct cycle as depicted in Figure 4.1 can be started, thestate and its error covariance matrix have to be initialised. In the following wewill assume that this is already the case.

Predict Step

We are situated at the point k in time and the state and error covariance matrixestimates at time k−1 are given. By using Equations (4.1) and (4.5) we predictthe state and error covariance matrix for k:

x−k = Axk−1

P−k = APk−1A⊤ + Q.

The minus superscript labels the predicted state and error covariance matrix as

a priori in contrast to a posteriori estimates.

Correct Step

Assume that we have already obtained an actual measurement zk of the system’soutput. With the help of this, we first want to calculate the a posteriori stateestimate xk. This is a linear combination of the a priori estimate x−k and aweighted difference between zk and the predicted measurement zk. Accordingto Equation (4.6), zk is calculated by Hx−k . Summarised, we have:

xk = x−k + Kk(zk − zk)

= x−k + Kk(zk −Hx−k ).

The difference zk − Hx

−

k is called measurement innovation or residual . If thevalue is zero, the prediction and the actual measurement are in complete agree-ment and the a priori state estimate won’t be corrected. If it is unequal to zero,xk will be unequal to x−k .

The weight Kk, the so called Kalman Gain , is represented by a n × m

matrix and minimises the a posteriori error covariance estimate P−k . It can becalculated by

Kk =P−k H⊤

(HP−k H⊤ + R)(4.11)

Note, that the denominator equals Equation (4.10), representing the uncertaintyin the predicted measurement. If we look closely at Equation (4.11), we can




see that, if the measurement error covariance error R approaches zero, the

measurement innovation is weighted more heavily.

limR→0

Kk =1

H

In other words, the smaller the measurement error, the more reliable is theactual measurement zk.On the other hand, if the predicted error covariance matrix P−k approaches tobe zero the residual is weighted less.

limP−

k→0

Kk = 0

This means, the smaller the uncertainty in the a priori state estimate x−

k

, themore reliable is the predicted measurement zk.

Secondly, we have to correct the a priori error covariance matrix estimateto derive the a posteriori estimate.

Pk = (I− KkH)P−k

For details of the derivation of the filter algorithm see [26].

In Figure 4.2 the whole algorithm is given again step by step.

4.1.6 A Simple Example

To clarify the effectiveness of the Kalman Filter we will examine a simple ex-ample. To stick to the central theme of this work right from the beginning,this example will be an instance of the SLAM problem. The section will bestructured as follows: Firstly, we will give a short description of the problem.After that, the process and measurement model are formulated. The sectioncloses with some experiments on simulated data.

Problem Description

In Chapter 5, we will analyse how to apply the Kalman Filter approach tothe problem of SLAM with using a vision sensor mounted on a robot. Thisfirstly means to track the position and orientation of the camera within the

3D environment (localisation) and secondly to estimate the position of somelandmarks situated in the world (mapping).

In the following we will simplify this task to SLAM in one dimension. Thecamera is represented by a point moving randomly in 1D. There also is a staticlandmark with a position known up to a certain degree. The process model of this example should describe the motion of the camera. We will assume that itmoves smoothly so that fast changes in its velocity are unlikely. We are able tomeasure the distance between the landmark and the moving point at discretepoints in time. The measurement model should relate this distance to the stateof the considered system.

The situation is depicted in Figure 4.3.




1. Predict Step

(a) Predict the statex−k = Axk−1

(b) Predict the error covariance matrix

P−k = APk−1A⊤ + Q

2. Correct Step

(a) Calculate the Kalman Gain

Kk = P−

k H⊤

(HP−k H⊤ + R)

(b) Correct the a priori state estimate

xk = x−k + Kk(zk −Hx−k )

(c) Correct the a priori error covariance matrix estimate

Pk = (I−KkH)P−k

Figure 4.2: Equations of one Kalman Filter Cycle. We assume that the state,

its covariance and the noise values are already initialised.

0

2

4

6

8

10

12

0 2 4 6 8 10

P o s i t i o n [ U n i t s ]

Time [Filter Cycles]

Point PositionsPosition Landmark

Figure 4.3: An Example for a Point Moving Randomly in 1D. A static landmark

is situated at x = 3. The distance between the current point position and this

landmark is measurable at each time step.




Process and Measurement Model

At first we have to model the state x which has to be estimated. Three importantentities have to be taken into account. Firstly, there is the position of the pointin a point in time. It is fully described by an one-dimensional coordinate in x-direction. Secondly, we choose a constant velocity to describe the motion of thepoint. 2 This does not mean that we assume the point is moving constantly overall time but that this value is the average velocity between two points in timeand changes occur with a Gaussian profile. These changes are modelled beneathas process noise. At last, the position of the landmark has to be augmented intothe state.

x =

x pv pxf

=

Position of the pointVelocity of the point

Position of the landmark

The error covariance matrix is then a 3 × 3 matrix of the following form

P =

x px p x pv p x pxf

v px p v pv p v pxf

xf x p xf v p xf xf

.

The task of the process model A is to approximate the transformation of theconsidered system over time. Here, this is the motion of the point betweentime k and k − 1. This constant time period is denoted as k. A is used topredict the state of the system for the current point k in time from the old stateestimate at time k − 1 by calculating x(k) = Ax(k − 1).

x p(k) = old point position + old velocity per k= x p(k − 1) + v p(k − 1)k

v p(k) = constant velocity due to assumed smooth motion= v p(k − 1)

xf (k) = static landmark= xf (k − 1)

(4.12)

As already mentioned, the constant velocity value just describes the averagevelocity in the time period k. Therefore, it is just an approximation. Varia-tions are caused by random unmeasurable accelerations a.3 We involve it in theprocess noise vector w. If we would know the individual values of w at each k,we could derive the real state:

x(k) = Ax(k − 1) + w(k − 1)

Because the process noise is an additive constant, w is modelled as a three-dimensional vector w = (w0, w1, w2)⊤. Noise is just added to the velocitycomponent of the state. Thus, the first and third component, w0 and w2,referring to the position of the moving point and to the position of the landmark,are set to zero. Just the second value carries a different random value after eachtime step: w = (0, ak, 0)⊤ . Adding the noise term to the process model, we

2A velocity vp describes the distance x covered in a certain time intervall k.3An acceleration a is a change in velocity vp in a certain time intervall k. Thus,

w1 = ak = vp, the change in velocity.




have:

x p(k) = x p(k − 1) + (v p(k − 1) + a(k − 1)k)k

v p(k) = v p(k − 1) + a(k − 1)k

xf (k) = xf (k − 1)

We do not know the individual values for a at each point in time. Therefore, wemodel the process noise as a realization of a normally distributed white noiserandom vector with zero mean and a covariance matrix Q.

p(w) ∼ N (0, Q)

Now, we can assume w to be equal to the mean of its distribution, which is zero.We derive the process model already formulated in Equation (4.12). Expressedin a linear transformation with

k assumed to be 1, this is

A =

1 1 0

0 1 00 0 1

.

Q is of the following form:

Q =

0 0 0

0 σ2 p 0

0 0 0

.

The constant value of σ p as the standard deviation of the noise in the velocityvalue indicates the amount of smoothness in the motion we expect. If we chooseit to be small, we expect the point to move with a nearly constant velocity.Then, we will not be able to cope with sudden accelerations. If we choose largevalues instead, we will be able to track the point well, if it acts in another waythan expected by the process model. On the other hand, the uncertainty abouta state estimate is higher than with small values for σ p.

The measurement model approximates the relation between the actualmeasurement zk and the current state xk. In our example the measurementconsists of just one value representing the distance dk between the movingpoint and the static landmark at the current point k in time. Expressed in alinear equation, we have

z(k) = dk = x p(k)− xf (k) (4.13)

The sensor used to measure the distance is assumed to provide just noisy mea-surements. If we would know the value for this measurement noise exactly, wecould determine the real measurement and not just an estimate. If we denotethe measurement noise by the random variable v, the real measurement can becomputed by:

z(k) = dk = x p(k)− xf (k) + v(k).

But we do not know the individual values of the random variable v. Therefore,we apply our noise model such that the values of v are a realization of a normallydistributed white noise with zero mean and the variance σ2

m

p(v) ∼ N (0, σ2m).




The measurement noise has the same dimension as the measurement and its

distribution is therefore modelled by specifying a variance instead of a covariancematrix.We can now assume the value v to be equal to the mean of its distribution,

to zero. Then, we derive the measurement model already formulated in Equa-tion (4.13). Note, that the difference between the estimate of the measurementzk and the real measurement is not just caused by the unknown noise, but alsoby the fact that in reality we just have an estimate of the state to predict themeasurement. The final measurement model for this problem is:

z(k) = dk = x p(k)− xf (k).

Expressed in a linear transformation we have

H =

1 0 −1

.

The constant value σm as the standard deviation of the measurement noisedistribution indicates how sure we are about the correctness of the real mea-surements. Large value show that we do not trust them that much and wewill weight the measurement innovation less. Small values indicate that themeasured values are accurate. The residual will be weighted more heavily.

Experiments on Simulated Data

In the previous section, we derived the basis for the application of the KalmanFilter on our problem: the appropriate process and measurement model. Inthis section, we will test these models on simulated data. The simulation was

initialized with the state:

x0 =

0

13

The subsequent real positions of the point moving in 1D were generated byapplying the process exactly described in the according model and adding somerandom values. The standard deviation of the random values is set to 0.2.The real measurements were also generated as described in the measurementmodel. Measurment noise is simulated by adding random values with a standarddeviation of 0.2.

To start the predict-correct cycle of the Kalman Filter, we have to initializethe state and its error covariance matrix as well as the process and measurement

noise values. Let us set the state to the real initial values. We assume anuncertainty about the initial position of the moving point as well as about theposition of the landmark and velocity at time 0. Let the error covariance be

P0 =

1 0 0

0 1 00 0 1

The real noise in the measurements can usually be determined prior to theapplication of the filter. To determine the process noise covariance is more com-plicated, because we generally do not have the ability to measure the process,we want to estimate, directly. Anyway, we set the standard deviation of the




0

2

4

6

8

10

12

0 2 4 6 8 10

P o s i t i o n

[ U n i t s ]


Point PositionsPosition Landmark

Estimated Point Position

Estimated Landmark Position

Figure 4.4: The Simulation of the Problem of Estimating a Moving Point’s

Position by Orientating at a Single Landmark. The deviation between the

estimation and real position of the point is very small as well as between the

estimated and real position of the landmark.

noise in the velocity σv and in the measurement σm to the real value used inthe simulation: 0.2

We will run the filter on ten simulated measurements. The results are de-picted in Figure 4.4. In Figure 4.5, the behaviour of the error covariance P

during the ten filter cycles is visualized.

4.2 The Extended Kalman Filter

As we saw in Section 4.1.6, the Kalman Filter algorithm works quite well for theestimation of a linear system with linear related measurements depending onthe quality of the appropriate models for the process and measurement of theoutput. Moreover, the Kalman Filter is optimal in the sense that it minimizesthe error covariance representing the uncertainty in the estimate of the state.

To come back to the main theme of this work, estimating the position of a moving robot and of static landmarks using a camera sensor, we need to beable to cope with nonlinear motion and a nonlinear relationship between mea-

surements and the system’s state. The nonlinear motion is caused by possiblerotational movements, the robot is able to do. Measurements of landmarks inthe surrounding of the robot are projections of them onto the image plane of the camera sensor. The process of projection is nonlinear.

In Section 4.1.2, it is stated that a Gaussian distribution is maintained by alinear transformation. This is not the case if we use a nonlinear transformationinstead. Thus, we cannot apply the Kalman Filter equations in its originalformulation to estimate a nonlinear system. A solution for this problem is tolinearize the transformation via Taylor Expansion. A Kalman Filter that usesTaylor Expansion to linearize the process and measurement models is calledExtended Kalman Filter , in the following abbreviated as EKF .



4.2. THE EXTENDED KALMAN FILTER 35

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 10

E r r o r C o v a r i a n c e P

[ U n i t s ]


Variance in Point PositionVariance in Velocity

Variance in Landmark Position

Figure 4.5: The Error Covariance matrix P. After two iterations, the initial

value of 1 for the variances has settled at approximatly 0.5 for the estimation

of the point’s position and of the landmark’s position and at approximatly 0.04for the estimation of the velocity.

Like in Section 4.1.1 we assume that the considered system can be modelledas a normally distributed random process X(k) with mean xk as the estimationof the real system’s state xk and covariance matrix Pk. Its output can bemodelled as well as a normally distributed random process Z(k) with mean zk

as the prediction of the real measurement zk and covariance matrix Sk. In thefollowing sections the EKF is derived for nonlinear process and measurementmodels.

Right from the beginning, we will stick to the “super minus” notation label-ing a priori estimates.

4.2.1 Process Model

Let us assume that our system to be estimated, represented by a state vectorxk at time k, is now governed by the nonlinear funtion

xk = f (xk−1, wk−1) (4.14)

relating the previous state xk−1 at the point k−

1 in time to the next state xk

at the current point k in time. The random value wk−1 represents the processnoise as in Equation (4.4).

p(w) ∼ N (0, Q)

We assume w to be equal to the mean of its distribution, which is zero. Theresult of the function f will then be an approximation x−k of the real state xk.

x−k = f (xk−1, 0) (4.15)

Let the difference between the real state and its estimate, namely the error inthe prediction, be a random variable e:

exk = xk − x−k .




To be able to estimate the result of the process represented by the non-

linear Equation (4.14) via the Kalman Filter algorithm, we linearize itabout the current state estimate given in Equation (4.15) by setting up afirst order Taylor polynomial ([16], p.411):

xk ≈ x−k + A(xk−1 − xk−1) + Wwk−1 = xk (4.16)

The matrix A is the Jacobian matrix containing the partial derivatives of (4.15)with respect to x, whereas the Jacobian matrix W is filled with the partialderivatives of f with respect to w. Note, that we ommitted time subscript k forthe Jacobians to simplify the notation. Nevertheless, they may be different ateach point in time. In the following, we will stick to omitting k for the Jacobianmatrices.

The a priori estimate x−k in Equation (4.16) can be calculated by f (xk−1, 0).

The remainder term approximates exk as exk .

exk ≈ A(xk−1 − xk−1) + Wwk−1 = exk (4.17)

With this defintion of exk , we can rewrite Equation (4.16) to

xk = x−k + exk (4.18)

According to Equation (4.18), we need to estimate the random value exk as exkat each point in time to achieve our actual goal: estimating xk as xk.

Note, that (4.17) is a linear equation. Thus, we can apply a second hypo-thetical “classic” Kalman Filter to estimate exk . We will model this dynamicallinear error system as a normally distributed random process with mean exk

and covariance matrix Pk representing the uncertainty about the estimated exk .Since exk denotes the error in the state estimate, it is clear that it should alwaysbe approximatly zero. Therefore, the mean exk of the distribution is chosen tobe zero.

Let’s consider Equation (4.17) again. The second term Wwk−1 denotes thenoise in the estimation of exk . It is the product of the process noise w and theJacobian matrix W containing the partial derivatives of Equation (4.15) withrespect to w. Remember, that the process noise is assumed to be always equalto zero. Thus, the term Wwk−1 is also assumed to be equal to zero. If w

is transformed by applying W, the corresponding covariance matrix Q of theprocess noise is transformed by WQW⊤. The noise in the estimation of exk isthen modelled as

p(Wwk−1) ∼ N (0, WQW

⊤

).To involve this noise in the prediction of the error exk between real and esti-mated state, the according error covariance WQW⊤ is added to the predictionAPk−1A⊤ of its error covariance P. To summarize the last statements, wehave:

e−xk = A(xk−1 − xk−1) = 0 (4.19)

P−k = APk−1A⊤ + WQW⊤. (4.20)

Equations (4.19) and (4.20) represent the process model for the linear errorsystem.




If we substitute Equation (4.19) for exk in Equation (4.18), the process model

for the nonlinear system to predict a state estimate x

−

k is then

x−k = f (xk−1, 0) (4.21)

P−k = APk−1A⊤ + WQW⊤. (4.22)

The process noise covariance matrix WQW⊤ acts in the nonlinear processmodel as the covariance matrix Q in the linear process model: It represents theamount of trust in the process model. High values indicate that high variationsbetween the state estimate and the real state are expected. Low values show alot of confidence in the process model.

4.2.2 Measurement Model

Let us assume that the relation between the system and its output is describedby the nonlinear function

zk = h(xk, vk) (4.23)

where vk represents the measurement noise as in (4.9).

p(v) ∼ N (0, R)

As usual, we assume vk to be zero which is the mean of its distribution.

zk = h(x−k , 0). (4.24)

The result zk is just an approximation of the real measurement. Let the differ-ence between the actual and the predicted measurement be the random value

ezk = zk − zk.

In contrast to the error exk between the real state and its estimate, ezk isaccessible.

To estimate the measurement of the system’s output we linearize Equa-tion (4.23) about the current state estimate given in Equation (4.24) by settingup a first order Taylor polynomial:

zk ≈ zk + H(xk − x−k ) + Vvk (4.25)

The matrix H is the Jacobian matrix containing the partial derivatives of Equa-tion (4.24) with respect to x in contrast to the Jacobian matrix V which contains

the derivatives of the same equation with respect to the measurement noise v.The predicted measurement zk in Equation (4.25) can be calculated by Equa-

tion (4.24). The error ezk is approximated as ezk by the remainder term

ezk ≈ H(xk − x−k ) + Vvk = ezk . (4.26)

With this definition of ezk we can rewrite Equation (4.25).

zk ≈ zk + ezk (4.27)

Note, that Equation (4.26) is a linear equation. Therefore, we also model theerror in the estimation of the output as a normally distributed random process




with mean ezk and innovation covariance matrix Sk, which approximates the

error between the predicted and the actual measurement. From the notion thatezk specifies the estimated error in the estimation of the state xk of the system,it is clear that it should preferably be approximatly equal to zero. Thus, themean ezk of its distribution is assumed to be always equal to zero.

If we re-consider Equation (4.26), we can state that Vvk is the noise term inthe prediction of ezk . Remember that the measurement noise v is assumed tobe zero at every point in time. Thus, the product of v and the Jacobian matrixV containing the partial derivatives of Equation (4.24) with respect to the noiseis zero. If v is transformed by applying V, the corresponding covariance matrixR is transformed by VRV⊤. Then, the noise involved in the estimation of theerror ezk is modelled as follows:

p(Vvk)

∼N (0, VRV⊤)

The covariance matrix of the noise Vvk is added to the prediction of the inno-vation covariance matrix by HP−k H⊤. Summarized, we have:

e−zk = H(xk − xk) = 0 (4.28)

Sk = HP−k H⊤ + VRV⊤. (4.29)

Equations (4.28) and (4.29) represent the measurement model for the linearerror system and are used to correct the a priori error estimate e−xk betweenthe state and its approximation.

If we substitute Equation (4.28) for ezk in Equation (4.27), the measurementmodel for the nonlinear system is:

zk = h(x−k , 0) (4.30)

Sk = HP−k H⊤ + VRV⊤. (4.31)

4.2.3 Predict and Correct Steps

Using the Kalman Filter for the estimation of the state of a linear system, meansthat we exactly know how uncertain we are about this estimate. Whereas,using the EKF for the estimation of the state of a nonlinear system means toadditionally estimate the uncertainty in this state estimate. This can be done bya second hypthetical Kalman Filter, presented in the previous chapters, whichestimates the error between the real state and its estimate.

Let’s assume, that we already used the process model for the nonlinear

system given in Equations (4.21) and (4.22) to derive an a priori estimate x−k forthe state and P−k for its error covariance. Then, we can predict the measurementby using Equation (4.30). After we have obtained the real measurement zk, wecan calculate the error ezk between zk and the predicted measurement zk.

According to Equation (4.19), the predicted error estimate e−xk between thereal state and its estimate is assumed to be zero in every time step.

The Kalman Filter equation to correct the a priori error estimate e−xk andderive an a posteriori exk is then

exk = e−xk + Kkezk

= Kkezk .




1. Predict Step

(a) Predict the state.

x−k = f (xk−1, 0)

(b) Predict the error covariance matrix.

P−k = APk−1A⊤ + WQW⊤

2. Correct Step

(a) Calculate the Kalman Gain.

Kk =P−

kH⊤

(HP−k H⊤ + VRV⊤)


x−k + Kk(zk − h(x−k , 0))

(c) Correct the a posteriori error covariance matrix estimate

Pk = (I− KkH)P−k

Figure 4.6: Equations of one Extended Kalman Filter Cycle. We assume that

the state, its covariance and the noise values are already initialized. Note, thatfor simplicity the superscript k is not used here for the Jacobians, although,

they have to be re-calculated after each predict-correct cycle.

If we substitute this into Equation (4.18) we get

xk = x−k + Kkezk .

Because ezk is the measurement residual, we also can write

xk = x−k + Kk(zk − zk) (4.32)

= x

−

k + Kk(zk − h(x

−

k , 0)). (4.33)

Equation (4.33) can be used in the correct step of the Extended Kalman Filteralgorithm to derive the a posteriori estimate for the state of the nonlinearsystem. The Kalman Gain Kk itself is calculated as in Equation (4.11) withthe appropriate substitution for the measurement error covariance matrix givenin (4.31):

Kk =P−k H⊤

(HP−k H⊤ + VRV⊤)

In Figure 4.6, the Extended Kalman Filter algorithm is given step by step.




Lighthouse

x0 xi xj Position of the Ship

αi αj·

Figure 4.7: A ship is sailing on the straight line perpendicular to the axis

between x0, the initial position of the ship, and the position of the lighthouse.

xi and xj are sample positions of the ship which need to be estimated from the

corresponding observable angles αi and αj .

4.2.4 A Simple Example

The derivation of the Extended Kalman Filter presented in the previous sectionis a bit more complicated than the explanations of the “classic” Filter. Inthis section a simple example is examined to provide a better understanding of the EKF algorithm. Again, we will consider an instance of the general SLAMproblem.

The section is structures as follows. Firstly, we will describe the specificproblem in general. After that, the models for the system’s state and processand the relation between the state and the measurement are presented. Thesection closes with some experiments on simulated data.

Problem Description

Imagine you are the skipper of a ship and your task is to sail a straight route of acertain length on an ocean. As you might infer from this sentence, the exampledeals more or less with the routing aspect of navigation. But we will focus onthe localization and mapping problem. To be more concrete, as a skipper youneed to localize your ship on that straight route. We assume that there is alighthouse with an uncertainly known position to orientate at.

Your initial position is located in some distance from that lighthouse. Youwill sail in a perpendicular direction to the axis between the lighthouse and theinitial ship position. It is obvious that the motion of a ship is smooth, so thatchanges in the velocity are unlikely.

You will be able to measure the angle between the current position of yourship and the lighthouse. Of course, these values will be more or less guessesthan precise measurements. We assume that you are not able to measure yourvelocity which is normally the case.

This situation is depicted in Figure 4.7.

Process and Measurement Model

In this example we have two tasks. Firstly, we need to localize the position x of the moving ship on the straight route at every time step. Secondly, we have torefine our knowledge about the position y of the lighthouse.




Thus, the state x of the considered system contains three entities. The

position x and velocity vs of the ship are the first ones. Again, we choosea constant value for the velocity which represents an average value during theconstant time period k. The third component of the state denotes the distancebetween the lighthouse and the initial position x0 of the ship.

x =

x

vxy

=

Position of the Ship

Velocity of the ShipDistance of the Lighthouse from x0

With this definition of the state, we have the following error covariance matrixP representing the uncertainty in the estimation of the state.

P =

xx xvx xy

vxx vxvx vxy

yx yvx yy

The process, the system is subject to, is just the motion of the ship on thatroute. The process model f we will set up here, relates the state at time k − 1to k by calculating:

x(k) = old position + old velocity per time intervall= x(k − 1) + vx(k − 1)k

vx(k) = constant velocity due to assumed smooth motion= vx(k − 1)

y(k) = static landmark= y(k − 1)

(4.34)

These equations are linear. Nevertheless, we will treat them as to be nonlinearand apply the EKF approach. We will see, that the EKF equations will reduce

to the equations of the “classic” Kalman Filter.As already mentioned, vx just describes the average velocity between two

time steps. Thus, it is just an approximation of the real velocity. The randomdifference between estimated and real velocity is modelled as process noise w =(w0, w1, w2)⊤ = (0, ak, 0)⊤. As the state, w is a three-dimensional vector.Just the velocity is corrupted by noise. Therefore, just w1 carries a value unequalto zero involving unmeasurable acceleration a:

p(w) ∼ N (0, Q)

Q is of the following form

Q =

0 0 00 σ2

v 0

0 0 0

The variable σv denotes the standard deviation of the noise in the velocity.If we would know the indivdual values for w, we could derive the real state

of the considered system by calculating f (xk−1, wk−1):

x(k) = x(k − 1) + (vx(k − 1) + w1)k + w0

vx(k) = vx(k − 1) + w1

y(k) = y(k − 1) + w2

= x(k − 1) + (vx(k − 1) + a(k − 1)k)k

= vx(k − 1) + a(k − 1)k

= y(k − 1)

(4.35)




Again, we assume w to be always equal to the mean of its distribution which is

zero. Then we derive the process model f (xk−1, 0), as it is already formulatedin Equation (4.34). To be able to predict the error covariance matrix P at eachpoint in time, we need to derive the Jacobian matrix A containing the partialderivatives of Equation (4.34) with respect to the state x and the Jacobianmatrix W containing the partial derivatives of Equation (4.34) with respect tothe noise w. Assuming that k is equal to 1, as A, we have:

A =

∂x∂x

∂x∂vx

∂x∂y

∂vx∂x

∂vx∂vx

∂vx∂y

∂y∂x

∂y∂vx

∂y∂y

=

1 1 0

0 1 00 0 1

.

Note, that this is the same as Equation (4.34) expressed as a linear transforma-tion.

As W, we have:

W =

∂x∂w0

∂x∂w1

∂x∂w2

∂vx∂w0

∂vx∂w1

∂vx∂w2

∂y∂w0

∂y∂w1

∂y∂w2

=

1 0 0

0 1 00 0 1

.

Hence, WQW⊤ = Q. Then, the equation to predict the error covarianceequals the one for the standard Kalman Filter: P−(k) = AP(k − 1)A⊤ + Q.

Now, let us consider the measurement model for our system. It providesthe relation between the state x of the system and the measurement z of itsoutput. Remember, as measurement we obtain the value for the angle α at

each time step. If we have a look again at Figure 4.7, we can state, that thesituation can be represented by a right triangle. Then, two definitions hold:

a2 + b2 = c2

a = c · sin α

We define the axis between the lighthouse and x0 as a, the distance, the shiphas covered till a certain point in time, as b and the connection between thelighthouse and the current position of the ship as the hypotenuse c. b is thenequal to x in the state and a is the same as y. Thus, the measurement modelto obtain the measurement z is

z(k) = α = arcsin y(k)

(x(k))2 + (y(k))2

. (4.36)

Thus, we have a nonlinear measurement model h.The value provided for α might be more or less a guess than a precise mea-

surement. Therefore we have to introduce measurement noise v to model thedifference between the real measurement and the predicted one. If we know thenoise value for each time step, we would obtain z instead of z by calculatingh(xk, vk):

z(k) = α = arcsin

y(k)

(x(k))2 + (y(k))2

+ v(k).




But this is not the case. Therefore, we model v as normally distributed mea-

surement noise with zero mean and standard deviation σr.

p(v) ∼ N (0, σ2r)

Now, we can assume v to be zero at each point in time which is the meanof its distribution. Then we obtain h(x, 0) as it is already formulated in Equa-tion (4.36). The standard deviation is added to the calculation of the innovationcovariance matrix S(k) = HP(k)−H⊤ which is also one dimensional. Becausewe have a nonlinear model, the value for the variance is firstly transformed byVσrV⊤ and then added.

As usual, the value we choose for σr indicates how we rate the quality of the measurement model.

Because we have a nonlinear measurement model, we need to derive theJacobian matrices H and V for each point in time. H contains the partialderivatives of the measurement model h(xk, 0) with respect to the state. It isof the following form:

H =∂h∂x

∂h∂vx

∂h∂y

For ∂h

∂xwe have

∂h

∂x=

−xy 1− y2

x2+y2

(x2 + y2)3

∂h∂vx

is equal to zero, because the velocity of the point is irrelevant in the mea-

surement model. For∂h

∂y we have

∂h

∂y=

1√x2+y2

− y2√(x2+y2)3

1− y2

x2+y2

The covariance matrix V contains the partial derivatives of h(xk, 0) withrespect to the noise v. Thus, it is of the following form:

V =∂h

∂v

Because the measurement noise v is an additive constant, ∂h∂v

is equal to 1.

Experiments on Simulated Data

In the previous section we derived the basis to apply the EKF approach tosolve our problem: the process and measurement model. In this section, wewill test these models on simulated data. We repeat the procedure from thesimple example for the standard Kalman Filter. The initial values for the Filterreflect reality but are just known approximately. This is represented by an errorcovariance matrix P where the values in the main diagonal are unequal to zero.The values for the process and measurement noise are also choosen to representthe real values.




0

2

4

6

8

10

12

0 2 4 6 8 10

P o s i t i o n

[ U n i t s ]


Ship PositionEstimated Ship Position

Figure 4.8: The Simulation of the Problem of Estimating a the Position of a

Ship by Orientating at a Lighthouse.

To start the predict-correct cycle we initialize the state x and the errorcovariance matrix P. For x we choose:

x = 01

20

These initial values are just uncertainly known. For P we choose:

P =

1 0 0

0 1 00 0 1

In reality, the standard deviation of the process and measurement noise need tobe determined prior to the application of the filter. Here, the values σv and σm

reflect the real noise values.

σv = 0.02

σm = 0.02

We will run the filter on 10 simulated measurements. The results for the es-timation of the ship’s position are depicted in Figure 4.8. In Figure 4.9, theestimated lighthouse position is opposed to the real one. In Figure 4.10, thebehaviour of the error covariance P is depicted during the ten filter cycles. Wecan note that the uncertainty about the position of the ship decreases first andthen starts to increase slowly. This is due to the more and more influencing mea-surement noise. The farer the ship is getting away from its starting point thelesser the measured angle will change its value. The measurement noise stays ata constant level and will therefore increase its influence concerning uncertainty




19.6

19.8

20

20.2

20.4

0 2 4 6 8 10

P o s i t i o n

[ U n i t s ]


Lighthouse PositionEstimated Lighthouse Position

Figure 4.9: The Results for the Mapping of the Lighthouse.

about the correctness of the infered position of the ship. Small changes in thevalue of the angle will cause larger deviations in the estimation of the ship’sposition and therefore a large uncertainty about the state estimate.



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 2 4 6 8 10

E r r o r C o v a r i a n c e P

[ U n i t s ]


Variance in Ship PositionVariance in Velocity

Variance in Landmark Position

Figure 4.10: The Error Covariance matrix P. After just one iteration, the

uncertainty about the ship’s position has decreased massivly. Then it increases

slightly. In contrast to that, the uncertainty about the velocity has nearly fallen

to zero.



Chapter 6

An Observation Strategy

In the previous chapter we applied the Extended Kalman Filter approach to theSLAM problem. A problem of using the EKF is that it does not scale very well.The complexity is cubic in the number of features in the map. In this chapter,we will examine strategies to reduce the complexity to at least O(n2) where n

is the number of features.

One of these strategies includes that just a single feature instead of all visibleones is measured. In [35] it is shown, that this is sufficient for tracking. If wedo so, we need to select the best feature based on a heuristic. In the following,we will refer to this heuristic as an observation strategy. It is adapted fromDavison in [9] and [8].

In this chapter, we will firstly concentrate on the ways to reduce the time

complexity of one EKF cycle. This examination is chiefly based on [23]. Sec-ondly, an appropriate heuristic is introduced to realise the selection of the bestlandmark. The two SLAM scenarios, the first with a single camera, the secondwith a stereo camera, are handled separately.

6.1 Complexity of the Kalman Filter

We will first examine the general time complexity of the Extended KalmanFilter algorithm in detail. Considering each step during one EKF cycle, we willintroduce methods to reduce the cubic time complexity to at least O(n2). Justto remember, the appropriate equations are listed in Figure 6.1.

If we have a look at these equations, we can state that there are two major time consuming operations: matrix multiplication and matrix inversion. If the matrix multiplication is carried out in a straightforward manner, its timecomplexity is O(n3) if we multiply n× n matrices. Matrix inversion also growscubic with the number of visible and measured features.

In the case of the EKF, the maximal size of a matrix, here P, is (13 +3n) × (13 + 3n) where n is the number of features. The matrix which will beinverted is the innovation covariance. It is of dimension (2l× 2l) or (3l× 3l).1 l

denotes the number of visible and measurable features. Because the number of

1The dimension of the measurements using a monocular camera is 2. If a stereo camera is

used as a vision sensor, the measurement is three-dimensional

69



70 CHAPTER 6. AN OBSERVATION STRATEGY

1. Predict Step

(a) Predict the state ahead.

x−k = f (xk−1, 0)

(b) Predict the error covariance matrix ahead.

P−

k = APk−1A⊤ + WQW⊤

2. Correct Step

(a) Calculate the Kalman Gain.

Kk =P−

k H⊤

(HP−

k H⊤ + VRV⊤

)


x−k + Kk(zk − h(x−k , 0))

(c) Correct the a posteriori error covariance matrix estimate

Pk = (I−KkH)P−

k

Figure 6.1: Equations of one Extended Kalman Filter Cycle.



6.1. COMPLEXITY OF THE KALMAN FILTER 71

measurable features cannot be larger than the number of known features, the

overall complexity of the EKF is O(13 + 3n) = O(n3

).We can reduce this complexity to O(n2) by considering aspects related tothe SLAM problem. First of all, the process model just affects the state of the camera and the velocities, summarised in xv. The known features are notinvolved and thus not the whole state of the system.

Secondly, usually just a small subset of feature points can be measured ateach point in time, due to the constraints of the viewing direction. In thefollowing, we will explain this in detail, first for the predict step and after thatfor the correct step.

6.1.1 Complexity of the Predict Step

In the predict step of the Kalman Filter, we predict the state x of the system

as x−

and the related error covariance P as P−

. The process model f relatesthe state at one point in time to the next. But, as already mentioned above, just the state of the camera and its velocities are affected. Thus,the Jacobianmatrix A, containing the partial derivatives of the process model with respectto the state, is of the following form,

A =

∂f v∂ xv

0

0 I

where f v is the first part of the measurement model.

xv,new = f (xv,w = 0) =

tWnewqCWnewvW

newωCWnew

=

rW + vWk

q(ωCWk)× qCW

vW

ωCW

The detailed Jacobian matrix A can be found in Appendix A.The overall dimension of the state is m = 13 + 3n where n is the number of

the 3D landmarks and 13 is the dimension of xv. Thus, A is a m×m Jacobianmatrix as well as the error covariance matrix P. The block ∂f v

∂ xvis of dimension

13× 13.Let’s consider the first summand APk−1A⊤ of the prediction of the error

covariance matrix as P−

k and let the old Pk−1 be denoted by

Pk−1 =

P11 P12

P21 P22

.

P11 is a covariance matrix also of dimension 13×13 related to xv. P12 and P21are of dimension 13 × 3n and 3n × 13, respectively. 2 P22 is then a 3n × 3ncovariance matrix.

If we perform the matrix operation for APk−1A⊤ explicitly, we obtain:

APk−1A⊤ =

∂f v∂ xv

0

0 I

P11 P12

P21 P22

( ∂f v∂ xv

)⊤ 0

0 I

=

∂f v∂ xv

P11( ∂f v∂ xv

)⊤ ∂f v∂ xv

P12

( ∂f v∂ xv

P12)⊤ P22

2Note that P12 is the transpose of P21 because of the symmetry of covariances.




Regarding to the dimensions of the matrices, the term ∂f v∂ xv

P11( ∂f v∂ xv

)⊤ can be

evaluated by 2(13∗13∗13) multiplications. To solve∂f v∂ xv P12 we need 13∗13∗3n

multiplications. ( ∂f v∂ xv

P12)⊤ is just the transpose of the previous term and donot need to be evaluated again. Altogether, the whole amount of multiplicationsto evaluate APk−1A⊤ lies at 2(13 ∗ 13 ∗ 13) + (13 ∗ 13 ∗ 3n).

The second summand WQW⊤ of the prediction function can be consideredequivalently. The Jacobian matrix W contains the partial derivatives of theprocess model with respect to the process noise. It is of the following form:

W =

∂f v∂ VW

∂f v∂ ΩCW

0 0

For the detailed matrix, have a look at Appendix A.Since the process noise vector w is of dimension 6, W is a m

×6 matrix.

The blocks ∂f v

∂ VW as well as ∂f v

∂ ΩCW carry 13× 3 elements. The process noise doesnot affect the coordinates of the known features. Thus, the according elementsof W are equal to zero.

The process noise covariance Q can be denoted by:

Q =

Q11 0

0 Q22

It is a 6× 6 matrix and the blocks Q11 and Q22 are each of dimension 3 × 3.If we perform the matrix multiplication WQW⊤ explicitly, we derive:

WQW⊤ =

∂f v∂ VW

∂f v∂ ΩCW

0 0

Q11 0

0 Q22

( ∂f v∂ VW )⊤ 0

( ∂f v∂ ΩCW )⊤ 0

=

∂f v∂ VW Q11( ∂f v

∂ VW )⊤ + ∂f v∂ ΩW Q22( ∂f v

∂ ΩW )⊤ 0

0 0

Because no block of a size related to the n known features is involved in the oneblock unequal to the zero matrix, the number of multiplications is independentof n. We exactly need 4(13 ∗ 3 ∗ 3) multiplications.

Thus, the cost of the predict step in all is linear in m.

6.1.2 Complexity of the Correct Step

Since just a few features of all known are visible for the camera sensor at eachpoint in time, the Jacobian matrix H containing all partial derivatives of themeasurement model h with respect to the state, carries a large number of zeros.Let’s assume that we just measure one feature yWi after each time step. ThenH is of the following form:

H =

∂h∂ xv

0 ∂h∂ yW

i

0

The detailed Jacobian matrix can be found in Appendix A.We know, that the dimension of the state vector x is m = 13 + 3n. The

dimension p of the measurement vector is either 2 or 3, depending on whetherwe use a single or stereo camera. Thus, the whole matrix H is of dimension p×m. The block ∂h

∂ xvcarries p×13 elements whereas ∂h

∂ yWi

is of dimension p×3.



6.1. COMPLEXITY OF THE KALMAN FILTER 73

To evaluate the Kalman Gain K, we need to perform the multiplication

P

−

k H

⊤

. For this case, P

−

k is represented by:P−

k =

P1 P01 P2 P02

(6.1)

The block P1 contains m× 13 and the block P2 m× 3 elements. If we performthis multiplication explicitly, we obtain

P−

k H⊤ =

P1 P01 P2 P02

∂h∂ xv

⊤

0∂h∂ yW

i

⊤

0

= P1

∂h

∂ xv

⊤

+ P2

∂ h

∂ yWi

⊤

.

The number of multiplications adds up to 16 pm.

After evaluating P

−

k H⊤

we need to derive the innovation covariance S. Itcan be obtained by equation HP−

k H⊤ + VRV⊤. We will firstly consider thefirst summand.

The result for P−

k H⊤ is a m× p matrix and is represented by

P−

k H⊤ =

P′1

P′01

P′2

P′02

where the block P′1 is a 13 × p and P′

2 a 3× p matrix.As result for the product HP−

k H⊤ we obtain

HP−

k H⊤ =

∂h∂ xv

0 ∂h∂ yW

i

0

P′

1P′01

P′2

P′02

=∂h

∂ xv

P′

1 +∂h

∂ yWiP′

2

The amount of multiplications lies at 16 p2, where p is either 2 or 3.The second summand VRV⊤ in the equation to derive the innovation co-

variance can be simplified equivalently. R is the measurement error covarianceof dimension p × p. The Jacobian matrix V contains the partial derivativesof the measurement model with respect to the measurement noise. Becausethe measurement noise vector is an additive constant in both SLAM scenarioswhether with a single or stereo camera, V is an identity matrix regardless of the value of p. We have

VRV⊤

= R.

The overall amount of multiplications to calculate the innovation covariance is16 p2.

To evaluate the Kalman Gain K we need to invert S . As already mentionedabove, the complexity of matrix inversion grows cubic with the number of rowsor columns, respectively, of the considered quadratic matrix. Here, we have a p× p matrix to invert. Thus, we need p3 multiplications.

The whole amount of multiplications to calculate the Kalman Gain is there-fore 16 pm + 16 p2 + p3 which is linear in m.

Until now, the complexity of all equations whether in the predict or correctstep were linear in m. The second equation of the correct step updating the




Predict Step

P−

k = APk−1A⊤ + WQW⊤ O(m) = O(13 + 3n) = O(n)

Correct Step

Kk =P−

kH⊤

(HP−kH⊤+VRV⊤)

O(m) = O(13 + 3n) = O(n)

Pk = (I−KkH)P−

k O(m2) = O((13 + 3n)2) = O(n2)

Table 6.1: Complexities for the Equations of one Extended Kalman Filter Cycle.

error covariance P is responsible for the quadratic complexity. We have to

evaluate the summand KkHP−

k . We will first consider the product HP−

k . H isas already stated above represented by

H =

∂h∂ xv

0 ∂h∂ yW

i

0

The predicted error covariance matrix P−

k is denoted by

P−

k =

P1

P01

P2

P02

.

Note that these blocks are not the same as in Equation (6.1) although they split

up the same matrix P−k . Here, P1 is of dimension 13 × m. P2 carries 3 ×m

elements. If we evaluate the product, we obtain

HP−

k =

∂h∂ xv

0 ∂h∂ yW

i

0

P1

P01

P2

P02

=

∂h

∂ xv

P1 +∂h

∂ yWiP2

where the result is a p×m matrix. 16 pm multiplications are needed. The laststep is to multiply the Kalman Gain K with this result. Either K which is am× p matrix, nor HP−

k carries a zero or identity matrix. Therefore, we derivean m×m matrix by performing pm2 multiplications.

Thus, the time complexity of the correct step is O(m2) or if we justconsider the number of known features O((13 + 3n)2) = O(n2). At the sametime, this is the time complexity of one EKF cycle. The results presented inSection 6.1 are summarised in Table 6.1.2.

6.2 A Heuristic to Decide which Feature to

Track

In the last section we presented methods to reduce the complexity of one EKFcycle by taking the particular structure of the SLAM problem into account.



6.2. A HEURISTIC TO DECIDE WHICH FEATURE TO TRACK 75

For one of these methods it is assumed, that we just measure one of the visible

feature points per point in time. But if we do so, two questions may arise:

Is it sufficient for the estimation of the state to measure just one feature?

Which feature of the several visible ones is best to be measured?

Considering the first question, Welch and Bishop [35] presented the SCAAT

method where it is shown that measuring a single landmark after each timestep is sufficient to observe 3D structure and motion of a scene over time.3

In the case of 3D-SLAM, a single measurement of a 2D projection of a 3Dlandmark just provides partial or incomplete information about the whole stateof the system, e.g., nothing about the (linear or angular) velocity of the cameraand nothing about the depth of the 3D feature position. Systems operating

just by obtaining incomplete measurements are referred to as unobservablebecause the whole system’s state cannot not be inferred from them. Suchsystems must incorporate a sufficient set of these measurements to obtainobservability. This can be achieved over space or over time. The latter isadopted by the SCAAT technique. It is based on the Extended Kalman Filterwhere individual measurements providing incomplete information about thesystem’s state are blended into a complete state estimate. The mean forthis blending provided by the filter describes the state estimate. Based onseveral experiments, SCAAT was shown to be accurate, stable, fast and flexible.

To find an answer on the second question, we first need to find a crite-ria to rate the feature. An intuitive idea is stated by Davison in [9]: Themore uncertain we are about the 3D position of a feature, the more profitable

it is to measure this one. Or in other words, measurements of features, thatare difficult to predict, provide more information about the position of thisfeature and of the camera than measurements of features which can be reliablypredicted.

The innovation covariance S describes the uncertainty about each predictedmeasurement. Thus, it contains the basic information to decide which visiblefeature should be measured at each point in time. It is calculated as follows

S = HPH⊤ + VRV⊤ (6.2)

where H and V are the Jacobian matrices of the measurement model h(x, 0)with respect to the state x and the measurement noise v, respectively. P is

the error covariance matrix linked to the state and R is the measurement noisecovariance.S is a multivariate Gaussian. Therefore, covariance matrices Si for each

predicted measurement zi corresponding to a visible feature point yWi can beextracted from it. These smaller covariances refer to a Gaussian with the mea-surement zi as its mean. According to Whaite and Ferrie [36], depending onthe measurement space, each Si can be represented either by an ellipse or el-lipsoid centred around the mean of the distribution. They are also referred toas ellipses or ellipsoids of confidence and represent the amount of uncertaintyabout the predicted measurement. Or in other words, we can be confident, that

3Single Constraint At A Time




the real measurement is situated within the ellipse or ellipsoid. By calculat-

ing the surface area or volume of these objects, we can decide which predictedmeasurement is most uncertain.Besides its role as a measure of the information content expected of a mea-

surement, Si also defines a search region where the according measurement zishould be located in with high probability. Thus, if we have decided to measurea specific feature, we can send the parameters of the search region to the fea-ture tracker. The advantages of this method are obvious. The feature tracker just needs to search a small region of interest instead of the whole picture.Furthermore, the chances of a mismatch are reduced.

In the previous chapter, we considered two SLAM cases: SLAM with a singlecamera and SLAM with a stereo camera. In the following sections, the heuristicis discussed in detail with respect to the different vision sensors.

6.2.1 Deriving the Innovation Covariance Matrix for

SLAM with a Single Camera

In the case we use a single camera, we predict two dimensional measurements yIifor each visible three-dimensional feature yWi referring to its 2D projection ontothe image plane. Thus, if l features are visible, S is a 2l× 2l matrix and l 2× 2covariance matrices Si regarding to the visible features can be extracted fromit. These covariance matrices represent a two-dimensional standard distributionover image coordinates. Its mean is the predicted measurement yIi. The dis-tribution can be visualised by an ellipse of confidence on the picture. Its focalpoint refers to the mean, the direction of its axes are given by the eigenvectors of the covariance matrix and the square root of the according eigenvalues specifies

the deviation of the distribution along the axis.According to [36], the surface area of the ellipse can be used as a measure

of uncertainty. If a and b denote the length of the principal axes of the ellipse,the surface area A is calculated by

A = πab.

The standard deviation of a distribution describes the average deviation of therelated Gaussian. The values of the whole distribution diversify much more.Possible realisations of the predicted measurement situated beyond the averagedeviation are just less probable but should also be involved in the calculationof the amount of uncertainty and in the size of the search region.

Thus, we introduce the factor nσ

and multiply the length of the principle axesof the ellipse with it. Consider the estimated measurement yIi with eigenvaluese1,i and e2,i of the according covariance matrix Si. To derive the surface areaof the demanded ellipse, we have to compute

Ai = πnσ√e1,ie2,i. (6.3)

The value for nσ should extend the standard deviation such that the probabilityfor a measurement to be found within the considered region is approximately100%. In [9], Davison chose nσ = 3. The probability that the possible realisa-tions of a standard deviated random variable lie within the 3σ-region aroundthe mean of the distribution is approximately 99% ([16], p. 1119).



6.2. A HEURISTIC TO DECIDE WHICH FEATURE TO TRACK 77

After calculating the amount of uncertainty about the predicted measure-

ment of each visible 3D feature, we can rank them and send the parameters(predicted measurement and corresponding covariance matrix) of the landmarkwhose measurement is most difficult to predict to the feature tracker. The corre-sponding covariance matrix specifies the search region for the demanded featuremeasurement within the image and centred around the estimated measurement.

6.2.2 Deriving the Innovation Covariance Matrix for

SLAM with a Stereo Camera

In the second case of SLAM scenarios, we use a stereo camera to measure thevisible features. For each, we derive three-dimensional measurement vectors.Thus, if l features are visible, l 3 × 3 smaller covariances Si, each referring toone of the predicted measurements for the visible features, can be extracted

from the innovation covariance matrix S. As already mentioned for the two-dimensional case, these covariances are related to a standard distribution. Theirmeans are the predicted measurements.

Considering one visible feature point yWi , the measurement vector for theSLAM scenario with a stereo camera consists of the image coordinates of theprojection of this feature on the left image plane yIi = (xIl, y

Il)⊤ and the disparity

dI. The according innovation covariance matrix Si is therefore not defined overone of the image coordinate frames as it was the case wen using a monocularvision sensor. It can be represented as an ellipsoid in the space spanned byxIl, y

Il and dI. Analogous to the surface area of the ellipses, the volume of the

ellipsoids can be seen as a measure for uncertainty. The equation to calculatethe volume of an ellipsoid is

V = 43πabc.

where a, b and c are the lengths of its principal axes. If we substitute the squareroot of the eigenvalues for a, b and c and introduce the factor nσ again, we derivethe equation to calculate the volume of each Si:

V i =4

3πnσ

√e1,ie2,ie3,i

After calculating this volume for each ellipsoid, we are able to rank the visible3D feature points. The corresponding predicted measurement and innovationcovariance of the landmark whose measurement is most difficult to predict is sentto the feature tracker Centred around this prediction the covariance matrix