7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 1/101
An Introduction to the Kalman Filter
Greg Welch
1
and Gary Bishop
2
TR 95-041
Department of Computer Science
University of North Carolina at Chapel Hill
Chapel Hill, NC 27599-3175
Updated: Monday, July 24, 2006
Abstract
In 1960, R.E. Kalman published his famous paper describing a recursive solutionto the discrete-data linear filtering problem. Since that time, due in large part to ad-vances in digital computing, the Kalman filter has been the subject of extensive re-search and application, particularly in the area of autonomous or assistednavigation.
The Kalman filter is a set of mathematical equations that provides an efficient com-putational (recursive) means to estimate the state of a process, in a way that mini-mizes the mean of the squared error. The filter is very powerful in several aspects:it supports estimations of past, present, and even future states, and it can do so evenwhen the precise nature of the modeled system is unknown.
The purpose of this paper is to provide a practical introduction to the discrete Kal-man filter. This introduction includes a description and some discussion of the basicdiscrete Kalman filter, a derivation, description and some discussion of the extend-ed Kalman filter, and a relatively simple (tangible) example with real numbers &results.
1. [email protected], http://www.cs.unc.edu/~welch
2. [email protected], http://www.cs.unc.edu/~gb
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 2/101
Welch & Bishop, An Introduction to the Kalman Filter 2
UNC-Chapel Hill, TR 95-041, July 24, 2006
1 The Discrete Kalman Filter
In 1960, R.E. Kalman published his famous paper describing a recursive solution to the discrete-
data linear filtering problem [Kalman60]. Since that time, due in large part to advances in digitalcomputing, the Kalman filter
has been the subject of extensive research and application,particularly in the area of autonomous or assisted navigation. A very “friendly” introduction to thegeneral idea of the Kalman filter can be found in Chapter 1 of [Maybeck79], while a more completeintroductory discussion can be found in [Sorenson70], which also contains some interestinghistorical narrative. More extensive references include [Gelb74; Grewal93; Maybeck79; Lewis86;Brown92; Jacobs93].
The Process to be Estimated
The Kalman filter addresses the general problem of trying to estimate the state of adiscrete-time controlled process that is governed by the linear stochastic difference equation
, (1.1)
with a measurement that is
. (1.2)
The random variables and represent the process and measurement noise (respectively).They are assumed to be independent (of each other), white, and with normal probabilitydistributions
, (1.3)
. (1.4)
In practice, the process
noise covariance and measurement noise covariance
matrices mightchange with each time step or measurement, however here we assume they are constant.
The matrix in the difference equation (1.1) relates the state at the previous time stepto the state at the current step , in the absence of either a driving function or process noise. Notethat in practice might change with each time step, but here we assume it is constant. Thematrix B
relates the optional control input to the state x
. The matrix in themeasurement equation (1.2) relates the state to the measurement z
k
. In practice might changewith each time step or measurement, but here we assume it is constant.
The Computational Origins of the Filter
We define (note the “super minus”) to be our a priori
state estimate at step k
givenknowledge of the process prior to step k
, and to be our a posteriori
state estimate at stepk
given measurement . We can then define a priori
and a posteriori
estimate errors as
x n
xk Axk 1 – Buk 1 – wk 1 –+ +=
z m
zk Hxk vk +=
wk vk
p w( ) N 0 Q,( )
p v( ) N 0 R,( )
Q R
n n A k 1 –k
A n l u
l m n H
H
xk
-
n
xk n
zk
ek
- xk xk
-, and –
ek xk xk . –
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 3/101
Welch & Bishop, An Introduction to the Kalman Filter 3
UNC-Chapel Hill, TR 95-041, July 24, 2006
The a priori
estimate error covariance is then
, (1.5)
and the a posteriori
estimate error covariance is
. (1.6)
In deriving the equations for the Kalman filter, we begin with the goal of finding an equation thatcomputes an a posteriori
state estimate as a linear combination of an a priori
estimate anda weighted difference between an actual measurement and a measurement prediction asshown below in (1.7). Some justification for (1.7) is given in “The Probabilistic Origins of theFilter” found below.
(1.7)
The difference in (1.7) is called the measurement innovation
, or the residual
. Theresidual reflects the discrepancy between the predicted measurement and the actualmeasurement . A residual of zero means that the two are in complete agreement.
The matrix K
in (1.7) is chosen to be the gain
or blending factor
that minimizes the a posteriori
error covariance (1.6). This minimization can be accomplished by first substituting (1.7)into the above definition for , substituting that into (1.6), performing the indicated expectations,taking the derivative of the trace of the result with respect to K
, setting that result equal to zero, andthen solving for K
. For more details see [Maybeck79; Brown92; Jacobs93]. One form of theresulting K
that minimizes (1.6) is given by
1
. (1.8)
Looking at (1.8) we see that as the measurement error covariance approaches zero, the gain K
weights the residual more heavily. Specifically,
.
On the other hand, as the a priori
estimate error covariance approaches zero, the gain K
weightsthe residual less heavily. Specifically,
.
1. All of the Kalman filter equations can be algebraically manipulated into to several forms. Equation (1.8)
represents the Kalman gain in one popular form.
Pk
- E ek
-ek
- T [ ]=
Pk E ek ek T [ ]=
xk xk
-
zk H xk
-
xk xk
-K zk H xk
- –( )+=
zk H xk
- –( )
H xk
-
zk
n m
ek
K k Pk - H T HPk
- H T R+( ) 1 –=
Pk
- H T
HPk
- H T R+
-----------------------------=
R
K k Rk 0
lim H 1 –=
Pk
-
K k Pk
-0
lim 0=
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 4/101
Welch & Bishop, An Introduction to the Kalman Filter 4
UNC-Chapel Hill, TR 95-041, July 24, 2006
Another way of thinking about the weighting by K is that as the measurement error covarianceapproaches zero, the actual measurement is “trusted” more and more, while the predictedmeasurement is trusted less and less. On the other hand, as the a priori estimate error
covariance approaches zero the actual measurement is trusted less and less, while thepredicted measurement is trusted more and more.
The Probabilistic Origins of the Filter
The justification for (1.7) is rooted in the probability of the a priori estimate conditioned on allprior measurements (Bayes’ rule). For now let it suffice to point out that the Kalman filtermaintains the first two moments of the state distribution,
The a posteriori state estimate (1.7) reflects the mean (the first moment) of the state distribution—it is normally distributed if the conditions of (1.3) and (1.4) are met. The a posteriori estimate errorcovariance (1.6) reflects the variance of the state distribution (the second non-central moment). Inother words,
.
For more details on the probabilistic origins of the Kalman filter, see [Maybeck79; Brown92;Jacobs93].
The Discrete Kalman Filter Algorithm
We will begin this section with a broad overview, covering the “high-level” operation of one formof the discrete Kalman filter (see the previous footnote). After presenting this high-level view, wewill narrow the focus to the specific equations and their use in this version of the filter.
The Kalman filter estimates a process by using a form of feedback control: the filter estimates theprocess state at some time and then obtains feedback in the form of (noisy) measurements. As such,the equations for the Kalman filter fall into two groups: time update equations and measurement update equations. The time update equations are responsible for projecting forward (in time) thecurrent state and error covariance estimates to obtain the a priori estimates for the next time step.The measurement update equations are responsible for the feedback—i.e. for incorporating a new
measurement into the a priori estimate to obtain an improved a posteriori estimate.
The time update equations can also be thought of as predictor equations, while the measurementupdate equations can be thought of as corrector equations. Indeed the final estimation algorithmresembles that of a predictor-corrector algorithm for solving numerical problems as shown belowin Figure 1-1.
R zk
H xk
-
Pk
-
zk H xk
-
xk
-
zk
E xk [ ] xk =
E xk xk –( ) xk xk –( )T [ ] Pk .=
p xk zk ( ) N E xk [ ] E xk xk –( ) xk xk –( )T [ ],( )
N xk Pk ,( ).=
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 5/101
Welch & Bishop, An Introduction to the Kalman Filter 5
UNC-Chapel Hill, TR 95-041, July 24, 2006
Figure 1-1. The ongoing discrete Kalman filter cycle. The time update projects the current state estimate ahead in time. The measurement update adjusts the projected estimate by an actual measurement at that time.
The specific equations for the time and measurement updates are presented below in Table 1-1 andTable 1-2.
Again notice how the time update equations in Table 1-1 project the state and covariance estimatesforward from time step to step . and B are from (1.1), while is from (1.3). Initialconditions for the filter are discussed in the earlier references.
The first task during the measurement update is to compute the Kalman gain, . Notice that theequation given here as (1.11) is the same as (1.8). The next step is to actually measure the processto obtain , and then to generate an a posteriori state estimate by incorporating the measurementas in (1.12). Again (1.12) is simply (1.7) repeated here for completeness. The final step is to obtainan a posteriori error covariance estimate via (1.13).
After each time and measurement update pair, the process is repeated with the previous a posteriori estimates used to project or predict the new a priori estimates. This recursive nature is one of thevery appealing features of the Kalman filter—it makes practical implementations much morefeasible than (for example) an implementation of a Wiener filter [Brown92] which is designed tooperate on all of the data directly for each estimate. The Kalman filter instead recursivelyconditions the current estimate on all of the past measurements. Figure 1-2 below offers a completepicture of the operation of the filter, combining the high-level diagram of Figure 1-1 with theequations from Table 1-1 and Table 1-2.
Table 1-1: Discrete Kalman filter time update equations.
(1.9)
(1.10)
Table 1-2: Discrete Kalman filter measurement update equations.
(1.11)
(1.12)
(1.13)
Time Update(“Predict”)
Measurement Update(“Correct”)
xk
- Axk 1 – Buk 1 –+=
Pk
- APk 1 – AT Q+=
k 1 – k A Q
K k Pk
- H T HPk
- H T R+( )
1 –=
xk xk
-K k zk H xk
- –( )+=
Pk I K k H –( )Pk
-=
K k
zk
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 6/101
Welch & Bishop, An Introduction to the Kalman Filter 6
UNC-Chapel Hill, TR 95-041, July 24, 2006
Filter Parameters and Tuning
In the actual implementation of the filter, the measurement noise covariance is usually measured
prior to operation of the filter. Measuring the measurement error covariance is generallypractical (possible) because we need to be able to measure the process anyway (while operating thefilter) so we should generally be able to take some off-line sample measurements in order todetermine the variance of the measurement noise.
The determination of the process noise covariance is generally more difficult as we typically donot have the ability to directly observe the process we are estimating. Sometimes a relativelysimple (poor) process model can produce acceptable results if one “injects” enough uncertaintyinto the process via the selection of . Certainly in this case one would hope that the processmeasurements are reliable.
In either case, whether or not we have a rational basis for choosing the parameters, often times
superior filter performance (statistically speaking) can be obtained by tuning the filter parametersand . The tuning is usually performed off-line, frequently with the help of another (distinct)
Kalman filter in a process generally referred to as system identification.
Figure 1-2. A complete picture of the operation of the Kalman filter, com-
bining the high-level diagram of Figure 1-1 with the equations fromTable 1-1 and Table 1-2.
In closing we note that under conditions where and .are in fact constant, both the estimationerror covariance and the Kalman gain will stabilize quickly and then remain constant (seethe filter update equations in Figure 1-2). If this is the case, these parameters can be pre-computedby either running the filter off-line, or for example by determining the steady-state value of asdescribed in [Grewal93].
R
R
Q
Q
Q R
K k Pk
- H T HPk
- H T R+( )
1 –=
(1) Compute the Kalman gain
xk 1 –Initial estimates for and Pk 1 –
xk xk
-K k zk H xk
- –( )+=
(2) Update estimate with measurement zk
(3) Update the error covariance
Pk I K k H –( )Pk
-=
Measurement Update (“Correct”)
(1) Project the state ahead
(2) Project the error covariance ahead
Time Update (“Predict”)
xk
- Ax
k 1 – Bu
k 1 –+=
Pk
- APk 1 – AT Q+=
Q RPk K k
Pk
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 7/101
Welch & Bishop, An Introduction to the Kalman Filter 7
UNC-Chapel Hill, TR 95-041, July 24, 2006
It is frequently the case however that the measurement error (in particular) does not remainconstant. For example, when sighting beacons in our optoelectronic tracker ceiling panels, thenoise in measurements of nearby beacons will be smaller than that in far-away beacons. Also, the
process noise is sometimes changed dynamically during filter operation—becoming —inorder to adjust to different dynamics. For example, in the case of tracking the head of a user of a3D virtual environment we might reduce the magnitude of if the user seems to be movingslowly, and increase the magnitude if the dynamics start changing rapidly. In such cases mightbe chosen to account for both uncertainty about the user’s intentions and uncertainty in the model.
2 The Extended Kalman Filter (EKF)
The Process to be Estimated
As described above in section 1, the Kalman filter addresses the general problem of trying toestimate the state of a discrete-time controlled process that is governed by a linear
stochastic difference equation. But what happens if the process to be estimated and (or) themeasurement relationship to the process is non-linear? Some of the most interesting and successfulapplications of Kalman filtering have been such situations. A Kalman filter that linearizes aboutthe current mean and covariance is referred to as an extended Kalman filter or EKF.
In something akin to a Taylor series, we can linearize the estimation around the current estimateusing the partial derivatives of the process and measurement functions to compute estimates evenin the face of non-linear relationships. To do so, we must begin by modifying some of the materialpresented in section 1. Let us assume that our process again has a state vector , but that theprocess is now governed by the non-linear stochastic difference equation
, (2.1)
with a measurement that is
, (2.2)
where the random variables and again represent the process and measurement noise as in(1.3) and (1.4). In this case the non-linear function in the difference equation (2.1) relates thestate at the previous time step to the state at the current time step . It includes as parametersany driving function and the zero-mean process noise wk . The non-linear function in themeasurement equation (2.2) relates the state to the measurement .
In practice of course one does not know the individual values of the noise and at each time
step. However, one can approximate the state and measurement vector without them as
(2.3)
and
, (2.4)
where is some a posteriori estimate of the state (from a previous time step k ).
Q Qk
Qk Qk
x n
x n
xk f xk 1 – uk 1 – wk 1 –, ,( )=
z m
zk h xk vk ,( )=
wk vk
k 1 – k uk 1 – h
xk zk
wk vk
xk f xk 1 – uk 1 – 0, ,( )=
zk h xk 0,( )=
xk
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 8/101
Welch & Bishop, An Introduction to the Kalman Filter 8
UNC-Chapel Hill, TR 95-041, July 24, 2006
It is important to note that a fundamental flaw of the EKF is that the distributions (or densities inthe continuous case) of the various random variables are no longer normal after undergoing theirrespective nonlinear transformations. The EKF is simply an ad hoc state estimator that only
approximates the optimality of Bayes’ rule by linearization. Some interesting work has been doneby Julier et al. in developing a variation to the EKF, using methods that preserve the normaldistributions throughout the non-linear transformations [Julier96].
The Computational Origins of the Filter
To estimate a process with non-linear difference and measurement relationships, we begin bywriting new governing equations that linearize an estimate about (2.3) and (2.4),
, (2.5)
. (2.6)
where
• and are the actual state and measurement vectors,
• and are the approximate state and measurement vectors from (2.3) and (2.4),
• is an a posteriori estimate of the state at step k ,
• the random variables and represent the process and measurement noise as in
(1.3) and (1.4).
• A is the Jacobian matrix of partial derivatives of with respect to x, that is
,
• W is the Jacobian matrix of partial derivatives of with respect to w,
,
• H is the Jacobian matrix of partial derivatives of with respect to x,
,
• V is the Jacobian matrix of partial derivatives of with respect to v,
.
Note that for simplicity in the notation we do not use the time step subscript with the Jacobians, , , and , even though they are in fact different at each time step.
xk xk A xk 1 – xk 1 – –( ) Wwk 1 –+ +
zk zk H xk xk –( ) V vk + +
xk zk
xk zk
xk
wk vk
A i j ,[ ] x j [ ]
f i[ ] xk 1 – uk 1 – 0, ,( )=
W i j ,[ ] w j [ ]
f i[ ] xk 1 – uk 1 – 0, ,( )=
h
H i j ,[ ] x j [ ]
h i[ ] xk 0,( )=
h
V i j ,[ ] v j [ ]
h i[ ] xk 0,( )=
k A W H V
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 9/101
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 10/101
Welch & Bishop, An Introduction to the Kalman Filter 10
UNC-Chapel Hill, TR 95-041, July 24, 2006
The complete set of EKF equations is shown below in Table 2-1 and Table 2-2. Note that we havesubstituted for to remain consistent with the earlier “super minus” a priori notation, and thatwe now attach the subscript to the Jacobians , , , and , to reinforce the notion that they
are different at (and therefore must be recomputed at) each time step.
As with the basic discrete Kalman filter, the time update equations in Table 2-1 project the stateand covariance estimates from the previous time step to the current time step . Again in(2.14) comes from (2.3), and are the process Jacobians at step k , and is the processnoise covariance (1.3) at step k .
As with the basic discrete Kalman filter, the measurement update equations in Table 2-2 correctthe state and covariance estimates with the measurement . Again in (2.17) comes from (2.4),
and V are the measurement Jacobians at step k , and is the measurement noise covariance(1.4) at step k . (Note we now subscript allowing it to change with each measurement.)
The basic operation of the EKF is the same as the linear discrete Kalman filter as shown inFigure 1-1. Figure 2-1 below offers a complete picture of the operation of the EKF, combining thehigh-level diagram of Figure 1-1 with the equations from Table 2-1 and Table 2-2.
Table 2-1: EKF time update equations.
(2.14)
(2.15)
Table 2-2: EKF measurement update equations.
(2.16)
(2.17)
(2.18)
xk
- xk
k A W H V
xk
- f xk 1 – uk 1 – 0, ,( )=
Pk
- Ak Pk 1 – Ak
T W k Qk 1 – W k T +=
k 1 – k A
k W
k Q
k
K k Pk
- H k
T H k Pk
- H k
T V k Rk V k T +( )
1 –=
xk xk
-K k zk h xk
-0,( ) –( )+=
Pk I K k H k –( )Pk
-=
zk h H k Rk
R
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 11/101
Welch & Bishop, An Introduction to the Kalman Filter 11
UNC-Chapel Hill, TR 95-041, July 24, 2006
Figure 2-1. A complete picture of the operation of the extended Kalman fil-ter, combining the high-level diagram of Figure 1-1 with the equations fromTable 2-1 and Table 2-2.
An important feature of the EKF is that the Jacobian in the equation for the Kalman gainserves to correctly propagate or “magnify” only the relevant component of the measurementinformation. For example, if there is not a one-to-one mapping between the measurement andthe state via , the Jacobian affects the Kalman gain so that it only magnifies the portion of
the residual that does affect the state. Of course if over all measurements there is not a one-to-one mapping between the measurement and the state via , then as you might expectthe filter will quickly diverge. In this case the process is unobservable.
3 A Kalman Filter in Action: Estimating a Random Constant
In the previous two sections we presented the basic form for the discrete Kalman filter, and theextended Kalman filter. To help in developing a better feel for the operation and capability of thefilter, we present a very simple example here. Andrew Straw has made available a Python/SciPyimplementation of this example at http://www.scipy.org/Cookbook/KalmanFiltering (validlink as of July 24, 2006).
The Process Model
In this simple example let us attempt to estimate a scalar random constant, a voltage for example.Let’s assume that we have the ability to take measurements of the constant, but that themeasurements are corrupted by a 0.1 volt RMS white measurement noise (e.g. our analog to digitalconverter is not very accurate). In this example, our process is governed by the linear differenceequation
,
K k Pk
- H k
T H k Pk
- H k
T V k Rk V k T +( )
1 –=
(1) Compute the Kalman gain
xk xk
-K k zk h xk
-0,( ) –( )+=
(2) Update estimate with measurement zk
(3) Update the error covariance
Pk I K k H k –( )Pk
-=
Measurement Update (“Correct”)
(1) Project the state ahead
(2) Project the error covariance ahead
Time Update (“Predict”)
xk
- f xk 1 – uk 1 – 0, ,( )=
Pk
- Ak Pk 1 – Ak
T W k Qk 1 – W k T +=
xk 1 –Initial estimates for and Pk 1 –
H k K k
zk h H k
zk h xk - 0,( ) – zk h
xk Axk 1 – Buk 1 – wk + +=
xk 1 – wk +=
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 12/101
Welch & Bishop, An Introduction to the Kalman Filter 12
UNC-Chapel Hill, TR 95-041, July 24, 2006
with a measurement that is
.
The state does not change from step to step so . There is no control input so . Ournoisy measurement is of the state directly so . (Notice that we dropped the subscript k inseveral places because the respective parameters remain constant in our simple model.)
The Filter Equations and Parameters
Our time update equations are
,
,
and our measurement update equations are
, (3.1)
,
.
Presuming a very small process variance, we let . (We could certainly let butassuming a small but non-zero value gives us more flexibility in “tuning” the filter as we willdemonstrate below.) Let’s assume that from experience we know that the true value of the randomconstant has a standard normal probability distribution, so we will “seed” our filter with the guessthat the constant is 0. In other words, before starting we let .
Similarly we need to choose an initial value for , call it . If we were absolutely certain thatour initial state estimate was correct, we would let . However given theuncertainty in our initial estimate , choosing would cause the filter to initially and
always believe . As it turns out, the alternative choice is not critical. We could choosealmost any and the filter would eventually converge. We’ll start our filter with .
z 1
zk Hxk vk +=
xk vk +=
A 1= u 0= H 1=
xk
- xk 1 –=
Pk - Pk 1 – Q+=
K k Pk
-Pk
- R+( )
1 –=
Pk
-
Pk
- R+
----------------=
xk xk
-K k zk xk
- –( )+=
Pk 1 K k –( )Pk
-=
Q 1e 5 –= Q 0=
xk 1 – 0=
Pk 1 – P0 x0 0= P0 0=
x0 P0 0=
xk 0=P0 0 P0 1=
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 13/101
Welch & Bishop, An Introduction to the Kalman Filter 13
UNC-Chapel Hill, TR 95-041, July 24, 2006
The Simulations
To begin with, we randomly chose a scalar constant (there is no “hat” on the x
because it represents the “truth”). We then simulated 50 distinct measurements that had errornormally distributed around zero with a standard deviation of 0.1 (remember we presumed that themeasurements are corrupted by a 0.1 volt RMS white measurement noise). We could havegenerated the individual measurements within the filter loop, but pre-generating the set of 50measurements allowed me to run several simulations with the same exact measurements (i.e. samemeasurement noise) so that comparisons between simulations with different parameters would bemore meaningful.
In the first simulation we fixed the measurement variance at . Because this isthe “true” measurement error variance, we would expect the “best” performance in terms of balancing responsiveness and estimate variance. This will become more evident in the second andthird simulation. Figure 3-1 depicts the results of this first simulation. The true value of the random
constant is given by the solid line, the noisy measurements by the cross marks, andthe filter estimate by the remaining curve.
Figure 3-1. The first simulation: . The true value of therandom constant is given by the solid line, the noisy mea-surements by the cross marks, and the filter estimate by the remaining curve.
When considering the choice for above, we mentioned that the choice was not critical as long
as because the filter would eventually converge. Below in Figure 3-2 we have plotted thevalue of versus the iteration. By the 50th iteration, it has settled from the initial (rough) choiceof 1 to approximately 0.0002 (Volts2).
x 0.37727 –=
zk
R 0.1( )2 0.01= =
x 0.37727 –=
5040302010
-0.2
-0.3
-0.4
-0.5
Iteration
V o l t a g e
R 0.1( )2 0.01= = x 0.37727 –=
P0
P0 0 Pk
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 14/101
Welch & Bishop, An Introduction to the Kalman Filter 14
UNC-Chapel Hill, TR 95-041, July 24, 2006
Figure 3-2. After 50 iterations, our initial (rough) error covariancechoice of 1 has settled to about 0.0002 (Volts2).
In section 1 under the topic “Filter Parameters and Tuning” we briefly discussed changing or“tuning” the parameters Q and R to obtain different filter performance. In Figure 3-3 and Figure 3-4 below we can see what happens when R is increased or decreased by a factor of 100 respectively.In Figure 3-3 the filter was told that the measurement variance was 100 times greater (i.e. )so it was “slower” to believe the measurements.
Figure 3-3. Second simulation: . The filter is slower to respond tothe measurements, resulting in reduced estimate variance.
In Figure 3-4 the filter was told that the measurement variance was 100 times smaller (i.e.) so it was very “quick” to believe the noisy measurements.
5040302010
0.01
.008
.006
.004
.002
Iteration
( V o l t a g e ) 2
Pk
-
R 1=
5040302010
-0.2
-0.3
-0.4
-0.5
V o l t a g e
R 1=
R 0.0001=
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 15/101
Welch & Bishop, An Introduction to the Kalman Filter 15
UNC-Chapel Hill, TR 95-041, July 24, 2006
Figure 3-4. Third simulation: . The filter responds to measure-ments quickly, increasing the estimate variance.
While the estimation of a constant is relatively straight-forward, it clearly demonstrates theworkings of the Kalman filter. In Figure 3-3 in particular the Kalman “filtering” is evident as theestimate appears considerably smoother than the noisy measurements.
5040302010
-0.2
-0.3
-0.4
-0.5
V o l t a g e
R 0.0001=
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 16/101
Welch & Bishop, An Introduction to the Kalman Filter 16
UNC-Chapel Hill, TR 95-041, July 24, 2006
References
Brown92 Brown, R. G. and P. Y. C. Hwang. 1992. Introduction to Random Signalsand Applied Kalman Filtering, Second Edition, John Wiley & Sons, Inc.
Gelb74 Gelb, A. 1974. Applied Optimal Estimation, MIT Press, Cambridge, MA.
Grewal93 Grewal, Mohinder S., and Angus P. Andrews (1993). Kalman Filtering The-ory and Practice. Upper Saddle River, NJ USA, Prentice Hall.
Jacobs93 Jacobs, O. L. R. 1993. Introduction to Control Theory, 2nd Edition. OxfordUniversity Press.
Julier96 Julier, Simon and Jeffrey Uhlman. “A General Method of ApproximatingNonlinear Transformations of Probability Distributions,” Robotics Re-search Group, Department of Engineering Science, University of Oxford
[cited 14 November 1995]. Available from http://www.robots.ox.ac.uk/~si- ju/work/publications/Unscented.zip.
Also see: “A New Approach for Filtering Nonlinear Systems” by S. J. Julier,J. K. Uhlmann, and H. F. Durrant-Whyte, Proceedings of the 1995 Ameri-can Control Conference, Seattle, Washington, Pages:1628-1632. Availablefrom http://www.robots.ox.ac.uk/~siju/work/publications/ACC95_pr.zip.
Also see Simon Julier's home page at http://www.robots.ox.ac.uk/~siju/.
Kalman60 Kalman, R. E. 1960. “A New Approach to Linear Filtering and PredictionProblems,” Transaction of the ASME—Journal of Basic Engineering,pp. 35-45 (March 1960).
Lewis86 Lewis, Richard. 1986. Optimal Estimation with an Introduction to Stochas-tic Control Theory, John Wiley & Sons, Inc.
Maybeck79 Maybeck, Peter S. 1979. Stochastic Models, Estimation, and Control, Vol-ume 1, Academic Press, Inc.
Sorenson70 Sorenson, H. W. 1970. “Least-Squares estimation: from Gauss to Kalman,” IEEE Spectrum, vol. 7, pp. 63-68, July 1970.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 17/101
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 18/101
Figure 2. Target tracking by background dif-
ferencing. The central person is tracked us-
ing all pixels whereas the two other persons
are tracked using every second pixel.
3. The tracking system
In this section, we describe the theoretical aspects and
the details on the actual implementation of the core tracking
system.
3.1 Energy detection
Currently, targets can be detected by energy measure-
ments based on background subtraction or intensity normal-
ized color histograms. The background subtraction mod-
ule computes a difference image I d from the current frameI = (I red, I green, I blue) and the background image B =(Bred, Bgreen, Bblue):
I d = 13
| I red − Bred | + | I green − Bgreen | +
| I blue − Bblue |
The background image B is updated with each frame us-
ing a weighted averaging technique, with a strong weight
applied to the previous background, and a small weight ap-
plied to the current image. This procedure constitutes a sim-
ple first order recursive filter along the time axis for each
pixel. The background image is only updated for those pix-
els that do not belong to one of the target ROIs.
Bt(i, j) =
αI t(i, j) + (1 − α)Bt−1(i, j), (i, j) ∈ bg
Bt−1(i, j), else(1)
Figure 2 shows an example of target tracking by back-
ground subtraction. The right image represents the back-
ground difference image I d after processing of three ROI’s.
Three targets can be clearly identified. Notice that the cen-
ter target appears as solid white, while the adjacent targets
appear to be ”hashed”. This is the result of optimization that
allows the processing to be applied to every N th pixel. In
this example, the two adjacent regions were processed with
N = 2, while the center target was processed with N = 1.
N is determined dynamically during each cycle by the pro-
cess supervisor.The position and extent of a target are determined by the
moments of the detected pixels in the difference image I dwithin the ROI. The center of gravity (or first moment) gives
the position of a target. The covariance (or second moment)
determines the spatial extent, and can be used to determine
width, height, and slant of a target. These parameters also
provide the target’s search region in the next image.
Chrominance information can be used to provide prob-
abilistic detection of targets. The intensity for each RGB
color pixel within a ROI is normalized to separate chromi-
nance from luminance.
r = RR + G + B
, g = GR + G + B
(2)
These color components have the property to be robust to
intensity variations [6].
The probability that a pixel takes on a particular color
can be represented as a histogram of (r, g) values. The his-
togram hT of chrominance values for a target, T , provides
an estimate of the probability of a chrominance vector (r, g)given the target p(r, g|T ). The histogram of chrominance
for all pixels htotal gives the global probability p(r, g) of
encountering a chrominance among the pixels. The prob-
ability of a target is the number of pixels of the target di-
vided by the total number of pixels. Putting these valuesinto Bayes rule shows that an estimate of the probability
of the target for each pixel can be obtained by evaluating
the ratio of the target histogram divided by the global his-
togram.
p(T |r, g) =p(r, g|T ) p(T )
p(r, g)≈
hT (r, g)
htotal(r, g)(3)
For each image, a probability map, I p, can be created by
evaluating the ratio of histograms for each pixel in the im-
age. Figure 3 shows an example of face detection using a
ratio of chrominance histograms. The bottom image dis-
plays the probability map I p. The probability map is onlyevaluated within the search region provided by the Kalman
filter in order to increase processing speed.
A common problem in both background subtraction and
histogram detection are spatial outliers. In order to increase
the stability of target localization, we suppress the contribu-
tion of outliers using a method proposed by Schwerdt in [5].
With this method, the probability image I p is multiplied by
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 19/101
Figure 3. Target detection by normalizedcolor
histogram.
a Gaussian weighting function centered at the predicted tar-
get position. This corresponds to a filtering by a strong po-
sitional prior. The effect is that spatial outliers lose their
influence on position and extent as a function of distance
from the predicted Gaussian. In order to save computation
time, this operation is performed only within the region of interest R of each target. Even for small regions of interest
this operation stabilizes the estimated position and extent of
targets.
I ′ p =
I p ∗ G(µ, Σ), (i, j) ∈ R0, else
(4)
where
G(x; µ, Σ) = e−1
2(x−µ)T Σ−1(x−µ) (5)
The center of gravity µ = [ x−t , y−t ]T is the Kalman pre-
diction of the target location. The spatial covariance Σ re-
flects the size of the target as well as the growing uncer-
tainty about the current target size and location. The same principle can be applied to the background difference I d.
3.2 Tracking process
The tracking system is a form of Kalman filter [7]. The
state vector for each target is composed of position and ve-
locity. The current target state vector xt−1 is used to make
a new prediction according to :
x−t = Φt xt−1, with Φt =
1 ∆t0 1
(6)
and ∆t the time difference between two iterations.
From the new position measurement zt, estimation up-
date is carried out.
xt = x−t + K t(zt − H t x−t ) (7)
This relation is important for balancing the estimation be-
tween measurement and prediction with the Kalman gain
K t. The estimated precision is a diagonal covariance ma-
trix
P −t =
σ2
xx 0 0 00 σ2
yy 0 00 0 σ2
vxx0
0 0 0 σ2vyy
(8)
and is predicted by:
P −t = Φt−1P t−1ΦT t−1 + Qt−1 (9)
where Qt−1 is the covariance matrix of the prediction error
which represents the growth of the uncertainty in the current
target parameters.
3.3 The core modules
The tracking process has been implemented in the
ImaLab environment [4]. This environment allows real-
time processing of frames extracted from the video stream.
The basic tracking system is composed of two modules:
• TargetObservation predicts for each target the position
in the current frame by a Kalman filter and then com-
putes its real position by background subtraction or
color histogram detection.
• DetectionRegion detects new targets by analysing the
energy (background differencing or color histogram)
within several manually defined detection regions.
Figure 1 shows the system architecture. Both core mod-
ules can be instantiated to use either background differenc-
ing or color histogram. For the PETS 04 experiments, we
use tracking based on background subtraction.
3.4 Target initialization module
Detection regions are image regions where new targets
can appear. Restricting detection of new targets to such
regions allows the system to reduce the overall computing
time. As a side effect, the use of detection regions also pro-
vides a reduction in the number of spurious false detections
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 20/101
Rmax
Noise threshold
Rmin
Background
difference of
region
Analysis interval R
4sx
4sy
detection
histogram
1 dim energy
Analysis and
moment computation
Initialised target
Figure 4. Initialisation of new target.
by avoiding detection in unlikely regions, but targets might
be missed when the detection regions are not chosen appro-
priately.
For each scenario a different set of detection regions
is determined. Currently, these regions are selected by
hand. An automatic algorithm appears to be relatively easy
to imagine. New targets are initialized automatically by
analysing the detection regions in each tracking cycle. This
analysis is done in two steps. In the first step, the subregion
which is occupied by the new target is determined by cre-
ating a 1 dimensional histogram along the long axis of the
detection region. The limits of the target subregion are char-
acterized by an interval, Rmin, Rmax, whose values of the
one dimensional histogram are above a noise threshold (see
Figure 4). In the second phase, the energy density within
the so specified subregion R is computed as
eR
=1
|R| (i,j)∈R I d
(i, j) (10)
with |R| number of pixels of R. A new target with mean
µR and covariance ΣR is initialised when the measured en-
ergy density eR exceeds a threshold. This approach has the
advantage, that targets can be detected independently of the
size of the detection region.
3.5 Tracking module
The module TargetObservation implements the target
tracking. The supervisor maintains a list of current targets.
Targets of this list are sequentially updated by the supervi-
sor depending on the feedback of the modules. For each tar-
get, a new position is predicted by a first order Kalman filter.
This prediction determines a search region within which thetarget is expected to be found. A target is found by apply-
ing the specified detection operation to the search region. If
the average target detection energy is above a threshold, the
target observation vector is updated. This module depends
on following parameters:
• Detection energy threshold: this represents the average
energy threshold validating the target existence.
• Sensitivity threshold : this parameter thresholds the
energy image (I d in case of background differencing
or I p in case of chrominance detection). If the value is
0, the raw data of the energy image is used.
• Target area threshold: A step size parameter N enables
faster processing for large targets by processing only 1
out of N pixels. When the target surface is larger than
a threshold, N is increased. This temporary measure
will be replaced by a more sophisticated control logic
based on computing time. Figure 2 illustrates the use
of this parameter.
3.6 Split and merge of targets
In real world video sequences, especially in the domain
of video surveillance, it often happens that targets come to-
gether, move in the same direction for a while and then sep-
arate. It can also occur that close targets occlude each other.
In that case only one target is visible at the time, but both
targets are still present in the scene. To solve such prob-
lems, we use a method that allows merging and splitting of
targets. This method enables to keep track of occluded tar-
gets and also to model common behavior of a target group.
The PETS 04 sequences contain many examples of such
group behavior.
A straight forward approach is applied for the detection
of target split and merge. Merging of two targets that are
within a certain distance from each other is detected by eval-
uating following equation:
c/(a + b) < threshold (11)
where c is the distance between the gravity centers of both
targets, a and b are the distances between the center of grav-
ity and the boundary of the ellipse defined by the covariance
of the respective target(see Figure 5 (left)). In our imple-
mentation we use a threshold = 0.8.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 21/101
cba
Figure 5. (left)Merging of targetsas a function
of the target relative position and size. (right)
Splitting detectors are defined proportionally
to the target size.
Splitting of targets is implemented by placing detection
regions around the target as shown in Figure 5 (right). The
size and location of the split detection regions are propor-
tional to the target size. Within each split detection re-gion, the average enery is evaluated in the same way as
in the target initialisation module. A new target is cre-
ated if this average energy is greater than the threshold
u = energy density ∗ split coefficient . The parameter split
coefficient controls the constraints for target splitting.
4. Automatic parameter adaption
Target initialization and tracking by background differ-
encing or histogram detection requires a certain number of
parameters, as mentioned in the previous sections (detec-
tion energy threshold, sensitivity, density energy threshold,
α, split coefficient, area threshold).
In order to preserve the re-usability of the tracking mod-
ule and guarantee good performance in a wide range of dif-
ferent tracking scenarios, it is crucial to have a good pa-
rameter setting at hand. Up to now, parameter adaption is
done manually. This is a very tedious job which might need
frequent repetition when the scene setup has changed.
In this section we propose a first attempt of a module
that automatically finds a good parameter setting. As a first
step, we consider the tracker as a classical system with con-
trol parameters and noise perturbations (see Figure 6). The
system produces an output y(t) that depends on the input
r(t), some noise d(t), and a set of parameters that affect thecontrol module K [1].
4.1 Algorithm
First we need to explore the effect of particular parame-
ters on the system. The goal of this step is to identify the
important parameters, their relation and eventually discard
K
Noise
d(t)
y(t)
Control
r(t)
f(y(t))
−
SystemInput Output
Parameters P
Figure 6. A controlled system
parameters with little effect. For a sequence for which the
ground truth r(t) is available we vary the parameters sys-
tematically and measure the output of the system, yP k(t)for a particular parameter setting P k in the parameter space
P . yP k(t) and r(t) are split in m sections according to mintervals si = [ti−1, ti], i = 1, . . . , m.
For each parameter setting P k and each interval r(si)and yP k(si) are known. From these input/output correspon-
dences we can compute the transfer function f (yP k(si)) =r(si) by a least squares approximation. The overall error
of the transfer function on the sequence is computed as fol-
lows:
ǫ = ||r(t) − f (yP k(t))|| =si
||r(si) − f (yP k(si))|| (12)
For each P k, we determine the transfer function that mini-
mizes this error. The average error (ǫ = ǫ/n, n number of
frames) is used to characterize the performance of the sys-
tem with the current parameter setting. This is a very coarse
approximation, but as we will see, the average error evolves
smoothly over the parameter space.We consider polynomial transfer functions of first and
second order (linear and quadratic) of the following form
r(tk) = A0y(tk) + b (13)
r(tk) = A2( y(tk))2 + A1y(tk) + b (14)
with transfer matrices Ai and offset b.
The measurements have either two or four dimensions.
In the two dimensional case, the measurements contain the
coordinates of the center of gravity of the target. The four
dimensional case also contains the height and width of the
target bounding box. We could have considered an addi-
tional dimension for the target slant, but we discarded this possibility due to the discontinuity of the slant measurement
at 180.
The linear transfer function estimated from the data of
the sequences Walk1.mpeg and Walk3.mpeg produce good
results. We observe a transfer matrix A0 that is close to
identity. The quadratic transfer function has a smaller ǫ, but
the transfer matrix A2 has very low values and is therefore
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 22/101
not significant. This means that the linear transfer function
is a good model for our system.
4.2 Exploration of the parameter space
The average error of the best transfer function evaluated
on the entire test sequence is used to characterize the per-
formance of the controlled system. The parameter spacecan be very high dimensional. Therefore, exploring the en-
tire space can be time consuming. To cope with this prob-
lem we assume that some parameters evolve independently
from each other. This allows to restrict the search of an op-
timal parameter value to a low dimensional hyperspace. In
the experiment we use following default values for the con-
stant parameters of the hyperspace: detection energy = 10,
density = 15, sensitivity = 20, split coefficient = 2.0, α =
0.001, area threshold = 1500. We experiment on sequence
Walk1.mpeg except for figure 7.
Figure 7 shows the surface produced by varying the de-
tection energy threshold and the sensitivity threshold simul-
taneously. Figure 8 shows the error evolution by varying the
split coefficient and the sensitivity. The optimal parameter
value is different for each sequence. This means that the
parameters are sequence dependent. In all cases the error
evolves smoothly. This means that we are dealing with a
controlled system and not with a system following chaotic
or arbitrary rules.
Figure 9 (left) provides evidence to set α = 0.1. Fig-
ure 9 (right) shows that the density threshold has no effect
on the average error. This parameter is therefore a candidate
that needs not be considered for further exploration of the
parameter space.
Figure 10 shows the effect of the parameter area thresh-
old. This parameter treats one pixel out of two for targets
that are larger than area threshold pixels. This explains the
increase of the error for small thresholds and the speed up
in processing time. It is interesting to see, that the error in-
crease is very small, less than 4% error increase for a 25%gain in processing time. Our method allows to identify this
kind of relations between parameters.
4.3 Summary
We have shown a method to evaluate the performance
of a system controlled by a set of parameters. The average
error is used to understand the effect of single parametersand parameter pairs. This method allows to verify that our
tracking system has a controlled behavior. We identified
that the density parameter has no effect on the error per-
formance and it can be removed from the parameter space.
The area threshold parameter influences the overall process-
ing time and the average error. With our method, we found
that the increase in error is small with respect to the gain in
Figure 11. Modules for face and hand obser-
vation are plugged into tracking system.
processing time. This is an interesting result which a dy-
namic control system should take into account. The exper-
iments show that the optimal parameter setting estimatedfrom one sequence scenario must not be optimal for an-
other sequence. This needs to be explored by evaluating
more data sequences. Another important point is that the
approach requires ground truth labelling. This means that
our method can not find the optimal parameters when the
ground truth is unknown. Likelihood may be appropriate in
some cases to replace the ground truth, but the results will
be inferior since the likelihood increases the noise perturba-
tions.
5. Tracking : optional higher level modules
In this section we demonstrate the flexibility of our track-
ing system. The proposed architecture enables easy plug in
of higher level modules which enables the system to solve
quite different tasks.
5.1. Face and hand tracking for human computerinteraction
Modules for face and hand tracking use color histogram
detection. Face and hands are initialised automatically with
respect to a body detected by backgrounddifferencing. This
means that the same tracking principle is applied to faces
and hands at a higher level. An example is shown in Fig-ure 11.
5.2. Eye detection for head pose estimation
This module detects facial features by evaluating the re-
sponse to receptive field clusters [2]. The method detects
facial features robust to scale, lighting variation, person and
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 23/101
"walk1_energy_sensitivity"
1015
2025
30energy threshold
05
1015
2025
3035
40
sensitivity
20
30
40
50
60
70
80
average error
"walk3_energy_sensitivity"
1015
2025
30energy threshold
05
1015
2025
3035
40
sensitivity
0
50
100
150
200
250
average error
Figure 7. Evolution of the average error over detection energy threshold and sensitivity threshold
(sequence Walk1.mpeg (left) and Walk3.mpg (right) and default values for free parameters).
"walk1_split_sensitivity"
11.5
22.5
3split coefficient
05
1015
2025
3035
40
sensitivity
30
40
50
60
70
80
90
average error
Figure 8. Evolution of the average error over split coefficient and sensitivity threshold.
"walk1_energy_alpha"
0
0.20.4
0.60.8
1alpha
9.889.9
9.929.94
9.969.98
1010.02
10.0410.06
10.0810.1
energy = 10
30
32
34
36
38
40
42
44
46
average error
"walk1_energy_density"
0
10 20 30 40 50 60 70 80 90 100density
9.889.9
9.929.94
9.969.98
1010.02
10.0410.06
10.0810.1
energy = 10
31.5
31.6
31.7
31.8
31.9
32
32.1
32.2
32.3
average error = 31.9 const
Figure 9. Evolution with varying alpha (left) and varying density (right). We can identify an optimal
value for alpha (α = 0.1), but the error is constant for all density values.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 24/101
"walk1_energy_area"
0500
10001500
20002500
3000area threshold
9.889.9
9.929.94
9.969.98
1010.02
10.0410.06
10.0810.1
energy
30
31
32
33
34
35
average error
"walk1_area_error_time"
0500
10001500
20002500
3000area threshold
30
32
34
36
38
40
average error
7880
828486889092949698
100
processing time [Hz]
Figure 10. Evolution with varying area threshold (left). The error increases slightly with decreasing
area threshold. The area threshold has a significant impact on the processing time (right).
Figure 12. Real-time head pose estimation.
head pose. The tracking system provides the precise face
location which allows the combined system to run in real
time. Figure 12 shows an example of the eye tracking mod-
ule.
5.3. Agent identification
The agent identification module provides an association
between individual features and tracked targets by back-
ground subtraction. Identification of each tracked blob is
carried out by elastic matching of labelled graphs where the
labels are receptive field responses [2]. The degree of cor-respondence between the model and the observations ex-
tracted from the ROI provided by the tracking system is
computed by evaluating a cost function. The cost function
is a weighted sum of the spatial similarity and the appear-
ance similarity [3, 8]. Figure 13 shows a successful identity
recovery after a target occlusion. The system currently pro-
cesses 10 frames/s.
cost pers1 165, pers2 186 Merge: cost pers1 337, pers2 492
Occlusion: cost pers1 488, pers2 1470 Split: cost pers1 2073, pers2 735
Figure 13. Example of a split and merge event
with successful identity recovery.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 25/101
Figure 14. True versus False detections for
individuals
6. Tracking performance of the core modules
In order to evaluate the performance of our tracking sys-
tem, we have tested the core modules on 16 of the PETS 04
sequences (17182 frames containing 50404 targets marked
by bounding boxes)1. In this section we give a brief sum-
mary of the tracking results.
Figure 14 shows the receiver operator curve for all 16 se-
quences. Our system has a low false detection probabilityof
9.8% and a true detection probability of 53.6%. This trans-
lates to a recall of 53.6% (27030 correct positives out of
50404 total positives) and a precision of 90.2% (27030 cor-
rect positives out of 29974 detections). The reason for the
relatively low recall is the fact that the ground truth label-
ing takes into account targets that are already present in the
scene and targets that pass on the gallery at the first floor.
Our tracking system relies on the method of detection re-
gion for target initialization. Both type of targets are not
detected by our tracking system, because they are not ini-
tialized.
The tracking results are evaluated with respect to other
parameters such as errors in detected position, size, and ori-
entation, the time lag of entry and exit. The performance of
our system with respect to these parameters is summarized
in Table 1. Our system performs very well in position detec-tion, orientation estimation and exit time lag. The bounding
box produced by the tracking system is significantly smaller
than the bounding box of the ground truth. This is due to
the fact that the tracking system estimates the bounding box
from the covariance of the pixels with high energy whereas
1The sequences as well as the statistics are available at the CAVIAR
home page http://homepages.inf.ed.ac.uk/rbf/CAVIAR/caviar.htm
Average error in average value maximum value
Position 6 - 7 pixels 13 - 15 pixels
Size -160% to -240% -240%
Orientation ±0.5% ±30%Entry time lag 50 to 80 frames 100 to 160 frames
Exit time lag 1 frame 1 frame
Table 1. Evaluation of the trackingresultswith
respect to measurement precision.
a human draw a bounding box that includes all pixels that
belong to the target. The tracking system can produce a
similar output by computing the connected components of
the energy image. This is a costly operation. In the case
where the connected components bounding box is used for
position computation, the position become more unstable.
For this reason we decided to use the first and second mo-
ments of energy pixels for target specification. The entrytime lag is a problem related to the detection region. A hu-
man observer marks a new target as soon as it occurs. The
detection region requires that the observed energy is above
the energy density threshold.
7. Conclusion
We have presented an architecture for a tracking sys-
tem that consists of a central supervisor, a tracking mod-
ule based on background subtraction or color histogram de-
tection combined with Kalman filtering and an automatic
target initialization module restricted to detection regions.
These three modules form the core system. The central su-
pervisor architecture has the advantage that additional mod-
ules can be plugged in very easily. New tracking systems
can be created in this way that can solve different tasks.
The tracking system depends on a number of parameters
that influence the performance of the system. Therefore,
finding a good parameter setting for a particular scenario is
essential. We have proposed to consider the tracking system
as a classical controlled system and identified a method to
evaluate the quality of a particular parameter setting. The
preliminary experiments show that small variations of the parameters produce smooth changes of the average error
function. Using this behavior, we can improve the perfor-
mance of our tracking system by finding a good parame-
ter setting using gradient descend in the parameter space.
Unfortunately, the experiments on the automatic parameter
adaption are preliminary and could not yet be integrated in
the performance evaluation of the system.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 26/101
References
[1] P. de Larminat. Automatique commande des syst emes
lineaires. Hermes Science Publications, 2nd edition,
1996.
[2] D. Hall and J.L. Crowley. Detection du visage par
caracteristiques generiques calculees a partir des im-ages de luminance. In Congr es Francophone de Recon-
naissance des Formes et Intelligence Artificielle, pages
1365–1373, Toulouse, France, 2004.
[3] M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange,
C. von der Mahlsburg, R.P. Wurz, and W. Konen. Dis-
tortion invariant object recognition in the dynamic link
architecture. Transactions on Computers, 42(3):300–
311, March 1993.
[4] A. Lux. The imalab method for vision systems. In Inter-
national Conference on Vision Systems, pages 319–327,
Graz, Austria, April 2003.
[5] K. Schwerdt and J.L. Crowley. Robust face tracking
using color. In International Conference on Automatic
Face and Gesture Recognition, pages 90–95, Grenoble,
France, March 2000.
[6] M.J. Swain and D.H. Ballard. Color indexing. Interna-
tional Journal of Computer Vision, 7(1):11–32, 1991.
[7] G. Welch and G. Bishop. An introduction to the kalman
filter. Technical Report TR 95-041, University of North
Carolina at Chapel Hill, 2004.
[8] L. Wiskott, J.M. Fellous, N. Kruger, and C. von der
Mahlsburg. Face Recognition by Elastic Bunch GraphMatching , chapter 11, pages 355–396. Intelligent Bio-
metric Techniques in Fingerprint and Face Recognition.
CRC Press, 1999.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 27/101
Automatic parameter regulation for a tracking
system with an auto-critical function
Daniela Hall
INRIA Rhone-Alpes, St. Ismier, France
Email: [email protected]
Abstract— In this article we propose an architecture of a track-ing system that can judge its own performance by an auto-criticalfunction. Performance drops can be detected which trigger anautomatic parameter regulation module. This regulation moduleis an expert system that searches a parameter setting with betterperformance and returns it to the tracking system. With suchan architecture, a robust tracking system can be implemented
which automatically adapts its parameters in case of changes inthe environmental conditions. This article opens a way to self-adaptive systems in detection and recognition.
I. INTRODUCTION
Parameter tuning of complex systems is often performed
manually. A tracking system requires different parameter set-
tings as a function of the environmental conditions and the
type of the tracked targets. Each change in condition requires
a parameter update. There is a great need to design an expert
system that performs the parameter regulation automatically.
This article proposes an approach and applies it to a real-time tracking systems. The here proposed architecture for auto-
regulation is valid for any complex system whose performance
depends on a set of parameters.
Automatic regulation of parameters can significantly en-
hance performance of systems for detection and recognition.
Surprising little previous work has been published in this
domain [5]. A first step towards performance optimization is
the ability of the system to be auto-critical. This means that the
system must be able to judge its own performance. A perfor-
mance drop, detected with this kind of auto-critical function,
can trigger an independent module for auto-regulation. The
task of the regulation module is to propose a set of parameters
to improve system performance.The auto-critical function detects a performance drop when
the measurements with respect to a scene reference model
diverge. In this case the automatic regulation module is trig-
gered to provide a parameter setting with better performance.
Section II explains the architecture of the tracking system and
the architecture of the regulation cycle. Section III explains
the details of the auto-critical function, the generation of the
scene reference model and the measure used for performance
evaluation. In section IV we explain the use of the regulation
module. We then show experiments that demonstrate the utility
of our approach. We finish with conclusions and a critical
evaluation.
Detectionregion
list Detection
Background detector
Target initialisation
Estimation
detector Background
PredictionTarget list
SupervisorTime
Robust tracking
Fig. 1. Architecture of the tracking and detection system controlledby a supervisor.
I I . SYSTEM ARCHITECTURE
In order to demonstrate the utility of our approach for auto-
regulation of parameters we choose a detection and tracking
system as previously described in [2]. Figure 1 shows the
architecture of the system. The tracking system is composedof a central supervisor, a target initialisation module and a
tracking module. This modular architecture is flexible such
that competing algorithms for detection can be integrated. For
our experiments we use a detection module based on adaptive
background differencing using manually defined detection
regions. Robust tracking is achieved by a first order Kalman
filter that propagates the target positions in time and updates
them by measurements from the detection module.
The tracking system depends on a number of parameters
such as detection energy threshold, sensitivity for detection,
energy density threshold to avoid false detections due to noise,
a temporal parameter for background adaptation, and a split
coefficient to enable merging and splitting of targets (i.e. whentwo people meet they merge to a single group target, a split
event is observed when a person separates from the group).
Figure 2 shows the integration of the parameter regulation
module and the auto-critical function. The auto-critical func-
tion evaluates the current system performance and decides if
parameter regulation is necessary. If this is the case, the tracker
supervisor sends a request to the regulation module. It provides
the its current parameter setting and current performance as
well as other data that is needed by the regulation module.
When the regulation module has found a better parameter
setting (or after a maximum number of iteration) it stops
processing and sends the result to the system supervisor that
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 28/101
K
Control−
y(t)
optimized parameters
Input
System
Auto−critical function
yes
no
y(t)
Regulation
Regulation?
Fig. 2. Integration of the regulation module in a complex system.
updates the parameters and reinitialises the modules.
It is difficult to predict the performance gain of the auto-
regulation. Since the module can test only a discrete number
of parameter settings, there is no guarantee that the global
optimal parameter setting is found. For this reason, the goal
of the regulation system is to find a parameter setting thatincreases system performance. Subsequent calls of the reg-
ulation module allow then to obtain a constantly increasing
system performance. The modular architecture enables the
use of different methods and apply the regulation to different
system kinds.
III. THE AUTO-CRITICAL FUNCTION
The task of the auto-critical function is to provide a fast
estimation of the current tracking performance. A performance
evaluation function requires a reliable measure to estimate the
current system performance. The used measure (described in
Section III-B) is based on a probabilistic model of the scene
which allows to estimate the likelihood of measurements.The probabilistic scene model is generated by a learning
approach. Section III-C explains how the quality of a model
can be measured. Section III-D discusses different clustering
schemes.
A. Learning a probabilistic model of a scene
A model of a scene describes what usually happens in the
scene. It describes a set of target positions and sizes, but also
a set of paths of the targets within the scene. The model
is computed from previously observed data. A valid model
allows to describe everything that is going to be observed. For
this reason we require that the training data is representative
for what usually happens in the scene.The ideal model of a scene allows to decide in a prob-
abilistic manner which measurements are typical and which
measurements are unusual. With such a model we can com-
pute the probability of single measurements and of temporal
trajectories. Furthermore, we can detect outliers that occur
due to measurement errors. The model represents the typical
behaviour of the scene. Furthermore it enables the system
to alert a user when unusual behavior takes place. This is
a feature which is useful for the task of a video surveillance
operator.
In this section we describe the generation of a scene
reference model which gives rise to a goodness measure that
can compute the likelihood of measurements y(ti) with respect
to the scene reference model. We know that a single mode
is insufficient to provide a valid scene description. We need
a model with several modes that associate spatially close
measurements and provide a locally valid model. The modelis composed from data using a static camera.
An important question is which training data should be used
to create an initial model. The CAVIAR test case scenarios [4]
contain 26 image sequences and hand labelled ground truth.
We can use the ground truth to generate an initial model. If
the initial model is not sufficient, the model can be refined
by adding tracking observations where the measurements with
low probability which are likely to contain errors are removed.
For the computation of the scene reference model, we
use the hand labelled data of the CAVIAR data set (42000
bounding boxes). We divide the model into a training and
a test set of equal size. The observations consist of spatial
measurements yspatial(ti) = (µx, µy, σ2x, σ
2y) (first and second
moments of the target observation in frame I (ti)). We can
extend these observations to spatio-temporal measurements
yspatiotemp(ti) = (µx, µy, σ2x, σ2
y, ∆µx, ∆µy, ∆σ2x, ∆σ2
y) by
considering observations at subsequent time instances ti and
ti−1. Such measurements have the advantage that we take into
account the local motion direction and speed. A trajectory
y(t) is a sequence of spatial or spatio-temporal measurements
y(ti). Single measurements are noted as vectors y(ti) whereas
trajectories y(t) are coded as vector lists. The following
approach is valid for both types of observed trajectories y(t).
To obtain a multi-modal model we have experimented with
two types of clustering methods: k-means and k-means with
pruning. K-means requires a fixed number of clusters thatmust be specified by the user a priori. K-means converges to
a local minimum that depends on the initial clusters. These
are determined randomly, which means that the algorithm
produces different sub-optimal solutions in different runs. To
overcome this problem, k-means is run several times with the
same parameters. In section III-C we propose a measure to
judge the quality of the clustering result. With this measure
we select an optimal clustering solution as our scene reference
model.
The method k-means with pruning is a variation of the tradi-
tional k-means that produces stabler results due to subsequent
fusion of close clusters. In this variation, k-means is called
with a large number of clusters, k ∈ [500, 2000]. Clusters
that are close within this solution are merged subsequently
and clusters with few elements are considered as noise and
removed. This method is less sensitive to outliers and has the
characteristics of a hierarchical clustering scheme and at the
same time can be computed quickly due to the initial fast k-
means clustering. Figure 3 illustrates this algorithm.
B. Evaluating the goodness of a trajectory
A set of Gaussian clusters modeled by mean and covariance
is an appropriate representation for statistical evaluation of
measurements. The probability P (y(ti)|C ) can be computed
according to equation 2.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 29/101
Result 3 clusters and noisemerge clusters whose centers are closer t han 1.0delete clusters with < 4 elements
Fig. 3. K-means with pruning. After initial k-means clustering closeclusters are merged and clusters with few elements are assigned tonoise.
The auto-regulation and auto-critical module need a measure
to judge the goodness of a particular trajectory. A simplegoodness score consists of the average probability of the
most likely cluster for the single measurements. The goodness
G(y(t)) of the trajectory y(t) = (y(tn), . . . , y(t0)) with length
n + 1 is computed as follows:
G(y(t)) =1
n + 1
ni=0
maxk
(P ( y(ti)|C k)) (1)
with
P ( y(ti)|C ) = P (y(ti)| µ; U ) (2)
=1
(2π)dim/2
|U |1/2
e(−0.5( y(ti)− µ)T U −1( y(ti)− µ))
with µ mean and U covariance of cluster C . Trajectories have
variable length and may consist of several hundred measure-
ments. The proposed goodness score is high for trajectories
composed of likely measurements and small for trajectories
that contain many unlikely measurements (errors). This mea-
sure allows to reliably classify good and bad trajectories
independent of their particular length.
On the other hand, the goodness score does not take into
account the sequential structure of the measurements. The
sequential structure is an important indicator for the detection
of local measurement errors and errors due to badly adapted
parameters. To study the potential of a goodness score that
is sensitive to the sequential structure, we propose following
measure (see equation 3).
Gseq(v)(y(t)) =1
m
m−1i=0
log(P (y(si))) (3)
which is the average log likelihood of the dominant term
P (y(s)) of the probability of a sub-trajectory y(s) of length
v. We use the log likelihood because P (y(s)) is typically very
small.
A trajectory y(t) = (y(s0), y(s1), . . . , y(sm−1)) is com-
posed of m sub-trajectories y(si) of length v. We develop
the measure for v = 3, the measure for any other value v is
developed accordingly. The probability of the sub-trajectories
is defined as:
P (y(si)) = P ( y(t2), y(t1), y(t0))
= P (y(si)) + r
= P (C k2 |y(t2))P (C k1 | y(t1))P (C k0 | y(t0))
P (C k2C k1C k0) + r (4)
P (y(si)) is composed of the probability of the most likely
path through the modes of the model P (y(si)) plus a term rwhich contains the probability of all other path permutations.
Naturally, the P (y(si)) will be dominated by P (y(si)), and
r tends to be very small. This is the reason, why we use in
the final goodness score only the dominant term P (y(si)).
P (C ki|y(ti)) is computed using Bayes rule. The prior P (C k)
is set to the ratio |C k|/(
u |C u|). The normalisation factor
P ( y(ti)) is constant. Since we are interested in the maximum
likelihood, we compute:
P (C ki| y(ti)) =
P (y(ti)|C ki)P (C ki
)
P (y(ti))
∼ P ( y(ti)|C ki)
|C ki|
u |C u|(5)
where |C ki| denotes the number of elements in C ki
.
P ( y(ti)|C ki) is computed according to equation 2.
The joint probability P (C k2C k1C k0) is developed according
to
P (C k2C k1C k0) = P (C k2 |C k1C k0)P (C k1 |C k0)P (C k0) (6)
We simplify this equation by assuming a Markov constraint
of first order:
P (C k2C k1C k0) = P (C k2 |C k1)P (C k1 |C k0)P (C k0) (7)
To compute the conditional probabilities P (C i|C j), we need
to construct a transfer matrix from the training set. This can
be obtained by counting for each cluster C i the number of
state changes and then normalise such that each line in the
state matrix sums to 1. The probabilistically inspired sequential
goodness score of equation 3 is computed using equations 4
to 7.
C. Measuring the quality of the model
K-means clustering is a popular tool for learning and model
generation because the user needs to provide only the numberof desired clusters [3], [7], [8]. K-means converges quickly to
a (locally) optimal solution. K-means clustering starts from a
number of randomly initialised cluster centers. Therefore, each
run produces a different sub-optimal solution. In cases where
the number of clusters is unknown, k-means can be run several
times with varying number of clusters. A difficult problem is
to rank the different k-means solutions and select the one that
is the most appropriate for the task. This section provides a
solution to this problem which is often neglected.
For a particular model (clustering solution) we can compute
the probability of a measurement belonging to the model.
To ensure that the computed probability is meaningful, the
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 30/101
model must be representative. A good model assigns a high
probability to a typical trajectory and a low probability to
an unusual trajectory. Based on these notions we define an
evaluation criteria for measuring the quality of the model.
We need to have a model that is neither too simple nortoo complex. The complexity is related to the number of
clusters [1]. A high number of clusters tends to over-fitting and
a low number of clusters provides an imprecise description.
Model quality evaluation requires a positive and negative
example set. Typical target trajectories (positive examples)
are provided within the training data. It is more difficult
to create a negative example. A negative example trajectory
is constructed as follows. First we measure the mean and
variance of all training data. This represents the distribution
of the data. We can now generate random measurements
by drawing from this distribution with a random number
generator. The result is a set of random measurements. From
the training set, we generate a k-means clustering with a largenumber of clusters (K=100). For each random measurement
we compute p(y(ti)|model100). From the original random
5000 measurements we keep the 1200 measurements with the
lowest probability. This gives the set of negative examples.
Figure 4 shows an example of the positive and negative
trajectory as well as the hand labelled ground truth and a
multi-modal model obtained by k-means with pruning.
For any positive and negative measurements we compute
the probability P ( y(ti)). Classification of the measurements
in positive and negative can be obtained by thresholding
this value. For a threshold d the classification error can be
computed according to equation 8. The optimal threshold,
d, separates positive from negative measurements with aminimum classification error [1].
P d(error) = P (x ∈ Rbad, C pos) + P (x ∈ Rgood, C bad) (8)
=
d0
p(x|C good)P (C good)dx +
1d
p(x|C bad)P (C bad)dx
with Rbad = [0, d] and Rgood = [d, 1].We search the optimal threshold d such that P d(error) is
minimised. We operate on a histogram using logarithmic scale.
This has the advantage that the distribution of lower values is
sampled more densely. The optimal threshold d with minimum
classification error can be estimated precisely with the method.
This classification error P (error) is a measurement for the
quality of the cluster model. Furthermore, less complex models
should be preferred. For this reason we formulate the quality
constraint of clustering solutions as follows: the best clustering
has the lowest number of clusters and an error probability
P (error) < q with q = 1%. The values of q are chosen
depending on the task requirements. This measure is a fair
evaluation criteria which enables to choose the best model
among a set of k-means solutions.
D. Clustering results
We test two clustering methods: K-means and k-means with
pruning. The positive trajectory is a person walking across the
hall, the negative trajectory consists of 1200 measurements
Initial parameter
setting Optimized parameter
setting
Parameter space
exploration tool
Subsequence
of images Regulation process
Scene reference model with metric
Fig. 5. A process for automatic parameter regulation.
constructed as described above. The training set consists
of 21000 hand labelled bounding boxes from 15 CAVIAR
sequences (see Figure 4).
Table I shows characteristics of the winning models with
highest quality defined by minimum classification error andminimum number of clusters. The superiority of the k-means
with pruning is demonstrated by the results. For the constraint
P (error) < 1%, k-means with pruning requires only 20 or
19 clusters respectively whereas the classical k-means needs
a model of clusters to obtain the same error rate. The best
overall model is obtained for spatio-temporal measurements
using k-means with pruning.
IV. THE MODULE FOR AUTOMATIC PARAMETER
REGULATION
The task of the module for automatic regulation is to
determine a parameter setting that improves the performance
of the system. In the case of a detection and recognitionsystem, this corresponds to increasing the number of true
positives and reducing the number of false positives. For
this task, the module requires an evaluation function of the
current output, a strategy to choose a new parameter setting
and a subsequence which can be replayed to optimize the
performance.
A. Integration
When the parameter regulation module is switched on, the
system tries to find a parameter setting that performs better
than the current parameter setting on a subsequence that is
provided by the tracking system. The system uses one of the
goodness scores of section III-B.In the experiments we use a subsequence of 200 frames for
auto-regulation. The tracker is run several times with changing
parameter settings on this subsequence and the goodness score
of the trajectory is measured for each parameter setting. The
parameter setting that produces the highest goodness score
is kept. Parameter settings are obtained from a parameter
space exploration tool whose strategies are explained in the
section IV-B and IV-C.
The automatic regulation can only operate on sequences that
produce a trajectory (something observable must happen in
the scene). To allow a fair comparison, the regulation module
must process the same subsequence several times. For this
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 31/101
Clustering resultExample of a unusual trajectory (random)Example of a typical trajectoryHand labelled bounding boxes (21000)
Fig. 4. Ground truth labelling for the entrance hall scenario, examples of typical and unusual trajectories and clustering result using k-meanswith pruning.
Measurement type Clustering method # clusters optimal threshold d P (error)Spatial K-means 35 0.0067380 0.0007
K-means with pruning 20 0.0067380 0.0061
Spatio-temporal K-means 35 0.00012341 0.0013K-means with pruning 19 0.00012341 0.0034
TABLE IBEST MODEL REPRESENTATIONS AND THEIR CHARACTERISTICS (FINAL NUMBER OF CLUSTERS, OPTIMAL THRESHOLD, AND CLASSIFICATION ERROR).
reason the regulation process requires a significant amount of
computing power. As a consequence, the regulation module
should be run on a different host such that the regulation does
not slow down the real time tracking.
B. Parameter space exploration tool
To solve the problem of the parameter space exploration we
propose a parameter exploration tool that provides the next
parameter setting to the regulation module. The dimensionsof the parameter space as well as a reasonable range of the
parameter values are given by the user. In our tracking example
the parameter space is spanned by detection energy, density,
sensitivity, split coefficient, α, and area threshold.
In the experiments we tested two strategies for parameter
setting selection. An enumerative method, that defines a small
number of discrete values for each parameter. At each call the
parameter space exploration tool provides the next parameter
setting in the list. The disadvantage of this method is that only
a small number of settings can be tested and the best setting
may not be in the predefined list. The second strategy for
parameter space exploration is based on a genetic algorithm.
We found genetic algorithms perfectly adapted to our problem.It enables feedback from the performance of previous settings.
We have a high dimensional feature space which makes hill
climbing methods costly, whereas genetic algorithms explore
the space without need of a high dimensional surface analysis.
C. Genetic algorithm for parameter space exploration
Among the different optimization schemes that exist we are
looking for a particular method, that fulfills several constraints.
We are not requiring to reach a global maximum of our func-
tion, but we would like to reach a good level of performance
quickly. Furthermore we are not particularly interested in the
shape of the surface in parameter space. We are only interested
in obtaining a good payoff with a small number of tests.
According to Goldberg [6], these are exactly the constraints of
an application for which genetic algorithms are appropriate.
Hill climbing methods are not feasible because the estima-
tion of the gradient of a single point in a 6 dimensional space
requires 26 tests. Testing several points would therefore require
a higher number of tests than we would like.
Genetic algorithms are inspired by the mechanics of natural
selection. Genetic algorithms require an objective function toevaluate the performance of an individual and a coding of the
input variables. Typically the coding is a binary string. In our
example, each parameter is represented by 5 bit, which gives
an input string of length 30.
Genetic algorithms have three major operators: reproduc-
tion, crossover and mutation. Reproduction is a process in
which individuals are copied according to their objective
function values. Those individuals with high performance are
copied more often than those with low performance. After
reproduction, crossover is performed as follows. First, pairs
of individuals are selected at random. Then, a position kwithin the string of length l is selected at random. Two new
individuals are created by swapping all characters of positionk + 1 to l. The mutation operator selects at random a position
within the string and swaps its value.
The power of genetic algorithms comes from the fact, that
individuals with good performance are selected for reproduc-
tion and crossing of high performance individuals speculates
on generating new ideas from high performance elements of
past trials.
For the initialisation of the genetic algorithm, the user
needs to specify the boundaries of the input variable space,
coding of the input variables, the size of the initial population
and the probability of crossover and mutation. Goldberg [6]
proposes to use a moderate population size, a high cross over
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 32/101
probability and a low mutation probability. The coding of the
input variables should use the smallest alphabet that allows to
express the problem. In the experiment we use a population of
size 16, we estimate 7 generations, the crossover probability
is set to 0.6 and the mutation probability to 0.03.
V. EXPERIMENTAL EVALUATION
In this section we evaluate the proposed approach on the
CAVIAR entry hall sequences1. The system is evaluated by
recall and precision of the targets compared to the hand-
labelled ground truth.
recall =true positives
total # targets(9)
precision =true positives
(true positives + false positives)(10)
We use the results of the manual adaptation as an upper
benchmark. These results were obtained by a human expertwho processed several times the sequences and hand tuned the
parameters. Quickly the expert gained experience which kind
of tracking errors depend on which parameter. The automatic
regulation module does not use this kind of knowledge. For
this reason, the recall and precision of the manual adaptation
is the best we can hope to reach with an automatic method.
We do not have manual adapted parameters for all sequences,
due to the repetitive and time consuming manual task.
A lower benchmark is provided by the tracking results that
do no adaptation. This means all 5 sequences are evaluated
using the same parameter setting. Choosing parameters with
high values2 produce little recall and bad precision. Choosing
parameters with low values3 increase the recall but the verylarge number of false positives is not acceptable.
Table II shows the tracking results using a spatial and a
spatio-temporal model and two parameter space exploration
schemes. The first uses a brute force search (enumerative
method) of the discrete parameter space composed of the
discrete values for detection energy ∈ [20, 30, 40], density
∈ [5, 15, 25], sensitivity ∈ [0, 20, 30, 40], split coefficient =
2.0, α = 0.001, area threshold = 1500. The method tests
36 parameter settings. The second exploration scheme uses
a genetic algorithm as described in section IV-C.
The enumerative method has several disadvantages, that
are reflected by the rather low performance measurements
of the experiments. The sampling of the parameter space is
coarse and therefore it happens frequently that none of the
parameter settings provide an acceptable improvement. The
same arguments are true for random sampling of the parameter
space.
The spatial model using brute force method and the simple
score has a small recall, but a better precision than the lower
benchmark. The spatio-temporal measurements using the same
1available at http://homepages.inf.ed.ac.uk/rbf/CAVIARDATA1/ 2detection energy=30, density=15, sensitivity=30, split coefficient=1.0, α =
0.01, and area threshold=15003detection energy=10, density=15, sensitivity=20, split coefficient=2.0, α =
0.01, and area threshold=1500
parameter selection and evaluation measure produces superior
results (higher recall and higher precision). This seems to
be related to the spatio-temporal model. The precision can
be further improved using the genetic approach and the
more complex evaluation function (recall 39.7% and precision78.8%).
V I . CONCLUSIONS AND OUTLOOK
We presented an architecture for a tracking system that uses
an auto-critical function to judge its own performance and an
automatic parameter regulation module for parameter adapta-
tion. This system opens the way to self-adaptive systems which
can operate under difficult lighting conditions. We applied our
approach to tracking systems, but the same approach can be
used to increase the performance of other systems who depend
of a set of parameters.
An auto-critical function and a parameter regulation module
require a reliable performance evaluation measure. In our case,this measure is computed as a divergence of the observed
measurements with respect to a scene reference model. We
proposed an approach for the generation of such a scene
reference model and developed a measure that is based on
the measurement likelihood.
With this measure, we can compute a best parameter setting
for pre-stored sequences. The experiments show that the auto-
regulation greatly enhances the performance of the tracking
output compared to a tracking without auto-regulation. The
system can not quite reach the performance of a human expert,
who uses knowledge based on the type of tracking errors for
parameter tuning. This kind of knowledge is not available to
our system.The implementation of the auto-critical function can trigger
the automatic parameter regulation. First successful tests have
been made to host the system on a distributed system. The
advantages of the distributed system architecture is that the
tracking system can continue the real time tracking. There
rests the problem of re-initialisation of the tracker. Currently,
existing targets are destroyed when the tracker is reinitialised.
The current model relies entirely on ground truth labelling.
The success of the method strongly depends on the quality of
the model. In many cases, a small number of hand labelled
trajectories can be gathered, but often their number is not
sufficient for the creation of a valid model. For such cases
we envision an incremental modeling approach by generatingan initial model from few hand-labelled sequences. The initial
model is then used to filter the tracking results, such that they
are error free. These error free trajectories are then used to
refine the model. This corresponds to a feed back loop in
model generation. After a small number of iterations a valid
model should be obtained. The option of such an incremental
model is essential for non-static scenes.
ACKNOWLEDGMENT
This research is funded by the European commission’s IST
project CAVIAR (IST 2001 37540). Thanks to Thor List for
providing the recognition evaluation tool.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 33/101
Auto-regulation method Recall Precision Total # true falsetargets positives positives
Manual adaptation (benchmark) 49.7 91.0 23180 11520 1136
Spatio-temporal model, 39.7 78.8 21564 8556 2304(genetic approach,Gseq(10))
Spatio temporal model, 39.4 73.2 21564 8492 3108
(genetic approach, simple score G)Spatio temporal model, 38.1 72.2 21564 8224 3160(brute force, simple score G)Spatial model, 29.2 68.7 21564 6302 2872(brute force, simple score G)
No adaptation (low thresholds) 68.0 24.5 21564 14672 45131No adaptation (high thresholds) 28.3 47.5 21564 6109 6746
TABLE II
PRECISION AND RECALL OF THE DIFFERENT METHODS EVALUATED FOR 5 CAVIAR SEQUENCES (OVERLAP REQUIREMENT 50%).
REFERENCES
[1] C.M. Bishop. Neural Networks for Pattern Recognition. Oxford
University Press, 1995.[2] A. Caporossi, D. Hall, P. Reignier, and J.L. Crowley. Robust visual
tracking from dynamic control of processing. In International Workshopon Performance Evaluation of Tracking and Surveillance, pages 23–31,Prague, Czech Republic, May 2004.
[3] G. Csurka, C. Bray, C. Dance, and L. Fan. Visual categorization withbags of keypoints. In European Conference on Computer Vision, Prague,Czech Republic, May 2004.
[4] R.B. Fisher. The PETS04 surveillance ground-truth data sets. In International Workshop on Performance Evaluation of Tracking and Surveillance, Prague, Czech Republic, May 2004.
[5] B. Georis, F. Bremond, M. Thonnat, and B. Macq. Use of an evaluationand diagnosis method to improve tracking performances. In InternationalConference on Visualization, Imaging and Image Proceeding, September2003.
[6] D.E. Goldberg. Genetic Algorithms in Search and Optimization. Addison-Wesley, 1989.
[7] T. Leung and J. Malik. Recognizing surfaces using three-dimensionaltextons. In International Conference on Computer Vision, Corfu, Greece,September 1999.
[8] C. Schmid. Constructing models for content-based image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2,pages 39–45, Kauai, USA, December 2001.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 34/101
1
Performance evaluation of object detection
algorithms for video surveillance
Jacinto Nascimento⋆, Member, IEEE Jorge Marques
[email protected] [email protected]
IST/ISR, Torre Norte, Av. Rovisco Pais, 1049-001, Lisboa Portugal
EDICS: 4-SEGMAbstract
In this paper we propose novel methods to evaluate the performance of object detection algorithms in video
sequences. This procedure allows us to highlight characteristics (e.g., region splitting or merging) which are specific
of the method being used. The proposed framework compares the output of the algorithm with the ground truth
and measures the differences according to objective metrics. In this way it is possible to perform a fair comparison
among different methods, evaluating their strengths and weaknesses and allowing the user to perform a reliablechoice of the best method for a specific application. We apply this methodology to segmentation algorithms recently
proposed and describe their performance. These methods were evaluated in order to assess how well they can detect
moving regions in an outdoor scene in fixed-camera situations.
Index Terms
Surveillance Systems, Performance Evaluation, Metrics, Ground Truth, Segmentation, Multiple Interpretations.
I. I NT ROD UC TI ON
VIDEO surveillance systems rely on the ability to detect moving objects in the video stream which is a relevant
information extraction step in a wide range of computer vision applications. Each image is segmented byautomatic image analysis techniques. This should be done in a reliable and effective way in order to cope with
unconstrained environments, non stationary background and different object motion patterns. Furthermore, different
types of objects are manually considered e.g., persons, vehicles or groups of people.
Many algorithms have been proposed for object detection in video surveillance applications. They rely on different
assumptions e.g., statistical models of the background [1]–[3], minimization of Gaussian differences [4], minimum
and maximum values [5], adaptivity [6,7] or a combination of frame differences and statistical background models
[8]. However, few information is available on the performance of these algorithms for different operating conditions.
Two approaches have been recently considered to characterize the performance of video segmentation algorithms:
pixel-based methods, template based methods and object-based methods. Pixel based methods assume that we wish
to detect all the active pixels in a given image. Object detection is therefore formulated as a set of independent
pixel detection problems. This is a classic binary detection problem provided that we know the ground truth (ideal
segmented image). The algorithms can therefore be evaluated by standard measures used in Communication theory
e.g., misdetection rate, false alarm rate and receiver operating characteristic (ROC) [9].
This work was supported by FCT under the project LTT and by EU project CAVIAR (IST-2001-37540).
Corresponding Author: Jacinto Nascimento, (email:[email protected]), Complete Address: Instituto Superior Tecnico-Instituto
de Sistemas e Robotica (IST/ISR), Av. Rovisco Pais, Torre Norte, 6o piso, 1049-001, Lisboa, PORTUGAL Phone: +351-21-8418270, Fax:
+351-21-8418291
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 35/101
2
Several proposals have been made to improve the computation of the ROC in video segmentation problems e.g.,
using a perturbation detection rate analysis [10] or an equilibrium analysis [11]. The usefulness of pixel-based
methods for surveillance applications is questionable since we are not interested in the detection of point targets
but object regions instead. The computation of the ROC can also be performed using rectangular regions selected
by the user, with and without moving objects [12]. This improves the evaluation strategy since the statistics are
based on templates instead of isolated pixels.
A third class of methods is based on an object evaluation. Most of the works aim to characterize color, shape and
path fidelity by proposing figures of merit for each of these issues [13]–[15] or area based performance evaluation
as in [16]. This approach is instrumental to measure the performance of image segmentation methods for video
coding and synthesis but it is not usually used in surveillance applications.
These approaches have three major drawbacks. First object detection is not a classic binary detection problem.
Several types of errors should be considered (not just misdetection and false alarms). For example what should we
do if a moving object is split into several active regions ? or if two objects are merged into a single region ? Second
some methods are based on the selection of isolated pixels or rectangular regions with and without persons. This
is an unrealistic assumption since practical algorithms have to segment the image into background and foreground
and do not have to classify rectangular regions selected by the user. Third, it is not possible to define a unique
ground truth. Many images admit several valid segmentations. If the image analysis algorithm produces a valid
segmentation its output should be considered as correct.
In this paper we propose objective metrics to evaluate the performance of object detection methods by comparing
the output of the video detector with the ground truth obtained by manual edition. Several types of errors are
considered: splits of foreground regions; merges of foreground regions; simultaneous split and merge of foreground
regions; false alarms, and detection failures. False alarms occur when false objects are detected. The detection
failures are caused by missing regions which have not been detected.
In this paper five segmentation algorithms are considered as examples and evaluated. We also consider multiple
interpretations in the case of ambiguous situations e.g., when it is not clear if two objects overlap and should be
considered as a group or if they are separated apart.
The first algorithm is denoted as basic background subtraction (BBS ) algorithm. It computes the absolute
difference between the current image and a static background image and compares each pixel to a threshold.
All the connected components are computed and they are considered as active regions if their area exceeds a
given threshold. This is perhaps the simplest object detection algorithm one can imagine. The second method is the
detection algorithm used in the W 4 system [17]. Three features are used to characterize each pixel of the background
image: minimum intensity, maximum intensity and maximum absolute difference in consecutive frames. The third
method assumes that each pixel of the background is a realization of a random variable with Gaussian distribution
(SGM - Single Gaussian Model) [1]. The mean and covariance of the Gaussian distribution are independently
estimated for each pixel. The fourth algorithm represents the distribution of the background pixels with a mixture
of Gaussians [2]. Some modes correspond to the background and some are associated with active regions ( M GM
- Multiple Gaussian Model). The last method is the one proposed in [18] and denoted as Lehigh Omnidirectional
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 36/101
3
Tracking System (LOTS ). It is tailored to detect small non cooperative targets such as snipers. Some of these
algorithms are described in a special issue of IEEE transactions on PAMI (August 2001), which describes a state
of art methods for automatic surveillance systems.
In this work we provide segmentation results of these algorithms on the PETS2001 sequences, using the proposed
framework. The main features of the proposed method are the following. Given the correct segmentation of the
video sequence we detect several types of errors i) splits of foreground regions, ii) merges of foreground regions,
iii) simultaneously split and merge of foreground regions, iv) false alarms (detection of false objects) and v) the
detection failures (missing active regions). We then compute statistics for each type of error.
The structure of the paper is as follows. Section 2 briefly reviews previous work. Section 3 describes the
segmentation algorithms used in this paper. Section 4 describes the proposed framework. Experimental tests are
discussed in Section 5 and Section 6 presents the conclusions.
II. R ELATED WOR K
Surveillance and monitoring systems often require on line segmentation of all moving objects in a video
sequence. Segmentation is a key step since it influences the performance of the other modules, e.g., object tracking,
classification or recognition. For instance, if object classification is required, an accurate detection is needed to
obtain a correct classification of the object.
Background subtraction is a simple approach to detect moving objects in video sequences. The basic idea is
to subtract the current frame from a background image and to classify each pixel as foreground or background
by comparing the difference with a threshold [19]. Morphological operations followed by a connected component
analysis are used to compute all active regions in the image. In practice, several difficulties arise: the background
image is corrupted by noise due to camera movements and fluttering objects (e.g., trees waving), illumination
changes, clouds, shadows. To deal with these difficulties several methods have been proposed (see [20]).
Some works use a deterministic background model e.g., by characterizing the admissible interval for each pixel
of the background image as well as the maximum rate of change in consecutive images or the median of largest
inter-frames absolute difference [5,17]. Most works however rely on statistical models of the background, assuming
that each pixel is a random variable with a probability distribution estimated from the video stream. For example, the
Pfinder system (“Person Finder”) uses a Gaussian model to describe each pixel of the background image [1]. A more
general approach consists of using a mixture of Gaussians to represent each pixel. This allows the representation
of multi modal distributions which occur in natural scene (e.g., in the case of fluttering trees) [2].
Another set of algorithms is based on spatio-temporal segmentation of the video signal. These methods try to
detect moving regions taking into account not only the temporal evolution of the pixel intensities and color but also
their spatial properties. Segmentation is performed in a 3D region of image-time space, considering the temporal
evolution of neighbor pixels. This can be done in several ways e.g., by using spatio-temporal entropy, combined
with morphological operations [21]. This approach leads to an improvement of the systems performance, compared
with traditional frame difference methods. Other approaches are based on the 3D structure tensor defined from
the pixels spatial and temporal derivatives, in a given time interval [22]. In this case, detection is based on the
Mahalanobis distance, assuming a Gaussian distribution for the derivatives. This approach has been implemented
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 37/101
4
in real time and tested with PETS 2005 data set. Other alternatives have also been considered e.g., the use of a
region growing method in 3D space-time [23].
A significant research effort has been done to cope with shadows and with nonstationary backgrounds. Two
types of changes have to be considered: show changes (e.g., due to the sun motion) and rapid changes (e.g., due to
clouds, rain or abrupt changes in static objects). Adaptive models and thresholds have been used to deal with slow
background changes [18]. These techniques recursively update the background parameters and thresholds in order to
track the evolution of the parameters in nonstationary operating conditions. To cope with abrupt changes, multiple
model techniques have been proposed [18] as well as predictive stochastic models (e.g., AR, ARMA [24,25]).
Another difficulty is the presence of ghosts [26], i.e., false active regions due to statics objects belonging to
the background image (e.g., cars) which suddenly start to move. This problem has been addressed by combining
background subtraction with frame differencing or by high level operations [27],[28].
III. SEGMENTATION ALGORITHMS
This section describes object detection algorithms used in this work: BBS , W 4, SGM , M GM and LOTS . The
BBS , SGM , M GM algorithms use color while W 4 and LOTS use gray scale images. In the BBS algorithm,
the moving objects are detected by computing the difference between the current frame and the background image.
A thresholding operation is performed to classify each pixel as foreground region if
|I t(x, y) −µt(x, y)| > T , (1)
where I t(x, y) is a 3 × 1 vector being the intensity of the pixel in the current frame and µt(x, y) is the mean
intensity (background) of the pixel, T is a constant.
Ideally, pixels associated with the same object should have the same label. This can be accomplished by
performing a connected component analysis (e.g., using 8 - connectivity criterion). This step is usually performed
after a morphological filtering (dilation and erosion) to eliminate isolated pixels and small regions.
The second algorithm is denoted here as W 4 since it is used in the W 4 system to compute moving objects
[17]. This algorithm is designed for grayscale images. The background model is built using a training sequence
without persons or vehicles. Three values are estimated for each pixel using the training sequence: minimum
intensity (Min), maximum intensity (Max), and the maximum intensity difference between consecutive frames (D).
Foreground objects are computed in four steps:i)
thresholding,ii)
noise cleaning by erosion,iii)
fast binary
component analysis and iv) elimination of small regions.
We have modified the thresholding step of this algorithm since often leads to a significant level of miss
classifications. We classify a pixel I (x, y) as a foreground pixel iff
|I t(x, y) < Min(x, y)| ∨ |I t(x, y) > Max(x, y)|) ∧ |I t(x, y) − I t−1(x, y)| > D(x, y) (2)
Figs. 1, 2 show an example comparing both approaches. Fig. 1 shows the original image with two active regions.
Figs. 2(a),(b) display the output of the thresholding step performed as in [17] and using (2).
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 38/101
5
Fig. 1. Two regions (in bounding boxes) of an image.
(a) (b)
Fig. 2. Thresholding results: (a) using the approach as in [17] and (b) using (2).
The third algorithm considered in this study is the SGM (Single Gaussian Model) algorithm. In this method,
the information is collected in a vector [Y,U ,V ]T , which defines the intensity and color of each pixel. We assume
that the scene changes slowly. The mean µ(x, y) and covariance Σ(x, y) of each pixel can be recursively updated
as follows
µt(x, y) = (1 − α)µt−1(x, y) + αI t(x, y), (3)
Σt(x, y) = (1 − α)Σt−1(x, y) + α(I t(x, y) −µt(x, y))(I t(x, y) −µ
t(x, y))T (4)
where I (x, y) is the pixel of the current frame in Y U V color space, α is a constant.
After updating the background, the SGM performs a binary classification of the pixels into foreground or
background and tries to cluster foreground pixels into blobs. Pixels in the current frame are compared with the
background by measuring the log likelihood in color space. Thus, individual pixels are assigned either to the
background region or a foreground region
l(x, y) = −1
2(I t(x, y) − µ
t(x, y))T (Σ−1)t(I t(x, y) −µt(x, y)) −
1
2ln |Σt| −
m
2ln(2π) (5)
where I t(x, y) is a vector (Y,U ,V )T defined for each pixel in the current image, µt(x, y) is the pixel vector in
the background image B.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 39/101
6
If a small likelihood is computed using (5), the pixel is classified as active. Otherwise, it is classified as
background.
The fourth algorithm (M GM ) models each pixel I (x) = I (x, y) as a mixture of N (N = 3) Gaussians distributions,
i.e.
p(I (x)) =N
k=1
ωk N (I (x),µk(x), Σk(x)), (6)
where N (I (x),µk(x), Σk(x)) is a multivariate normal distribution and ωk is the weight of kth normal,
N (I (x),µk(x), Σk(x)) = c exp
−1
2
I (x) − µk(x)
T Σ−1k (x)
I (x) − µk(x)
. (7)
with c = 1
(2π)n/2|Σk|1
2
. Note that each pixel I (x) is a 3 × 1 vector with three component colors (red, green and blue),
i.e., I (x) = [I (x)RI (x)GI (x)B]T . To avoid an excessive computational cost, the covariance matrix is assumed to
be diagonal [2].
The mixture model is dynamically updated. Each pixel is updated as follows: i) The algorithm checks if each
incoming pixel value x can be ascribed to a given mode of the mixture, this is the match operation. ii) If the pixel
value occurs inside the confidence interval with + 2.5 standard deviation, a match event is verified. The parameters
of the corresponding distributions (matched distributions) for that pixel are updated according to
µtk(x) = (1 − λtk)µt−1
k (x) + λtkI t(x) (8)
Σtk(x) = (1 − λtk)Σt−1
k (x) + λtk(I t(x) −µtk(x))(I t(x) − µ
tk(x))T (9)
where
λtk = α N (I t(x),µt−1k (x), Σt−1k (x)) (10)
The weights are updated by
ωtk = (1 − α)ωt−1
k + α(M tk), with M tk =
1 matched models
0 remaining models(11)
α is the learning rate. The non match components of the mixture are not modified. If none of the existing components
match the pixel value, the least probable distribution is replaced by a normal distribution with mean equal to the
current value, a large covariance and small weight. iii) The next step is to order the distributions in the descendingorder of ω/σ. This criterion favours distributions which have more weight (most supporting evidence) and less
variance (less uncertainty). iv) Finally the algorithm models each pixel as the sum of the corresponding updated
distributions. The first B Gaussian modes are used to represent the background, while the remaining modes are
considered as foreground distributions. B is chosen as follows: B is the smallest integer such that
Bk=1
ωk > T (12)
where T is a threshold that accounts for a certain quantity of data that should belong to the background.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 40/101
7
The fifth algorithm [18] is tailored for the detection of non cooperative targets (e.g., snipers) under non stationary
environments. The algorithm uses two gray level background images B1, B2. This allows the algorithm to cope with
intensity variations due to noise or fluttering objects, moving in the scene. The background images are initialized
using a set of T consecutive frames, without active objects
B1(x, y) = minI t(x, y), t = 1, . . . , T (13)
B2(x, y) = maxI t(x, y), t = 1, . . . , T (14)
where t ∈ 1, 2, . . . , T denotes the time instant.
In this method, targets are detected by using two thresholds ( T L, T H ) followed by a quasi-connected components
(QCC) analysis. These thresholds are initialized using the difference between the background images
T L(x, y) = |B1(x, y) − B2(x, y) | + cU (15)
T H (x, y) = T L(x, y) + cS (16)
where, cU and cS ∈ [0, 255] are constants specified by the user.
We compute the difference between each pixel and the closest background image. If the difference exceeds a
low threshold T L, i.e.,
mini
|I t(x, y) − Bti(x, y)| > T L(x, y) (17)
the pixel is considered as active. A target is a set of connected active pixels such that a subset of them verifies
mini
|I t(x, y) − Bti(x, y)| > T H (x, y) (18)
where T H (x, y) ia a high threshold. The low and high thresholds T tL(x, y), T tH (x, y) as well as the background
images, Bti(x, y), i = 1, 2 are recursively updated in a fully automatic way (see [18] for details).
IV. PROPOSED FRAMEWORK
In order to evaluate the performance of object detection algorithms we propose a framework which is based on
the following principles:
• A set sequences is selected for testing and all the moving objects are detected using an automatic procedure
and manually corrected if necessary to obtain the ground truth. This is performed one frame per second.
• The output of the automatic detector is compared with the ground truth.
• The errors are detected and classified in one of the following classes: correct detections, detections failures,
splits, merges, split/merges and false alarms.
• A set of statistics (mean, standard deviation) are computed for each type of error.
To perform the first step we made a user friendly interface which allows the user to define the foreground regions
in the test sequence in a semi-automatic way. Fig. 3 shows the interface used to generate the ground truth. A set
of frames is extracted from the test sequence (one per second). An automatic object detection algorithm is then
used to provide a tentative segmentation of the test images. Finally, the automatic segmentation is corrected by the
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 41/101
8
user, by merging, splitting, removing or creating active regions. Typically the boundary of the object is detected
with a two pixel accuracy. Multiple segmentations of the video data are generated every time there is an ambiguous
situation i.e., two close regions which are almost overlapping. This problem is discussed in section IV-D.
In the case depicted in the Fig. 3, there are four active regions: a car, a lorry and two groups of persons. The
segmentation algorithm also detects regions due to lighting changes, leading to a number of false alarms (four). The
user can easily edit the image by adding, removing, checking the operations, thus providing a correct segmentation.
In Fig. 3 we can see an example where the user progressively removes the regions which do not belong to the
object of interest. The final segmentation is shown at the bottom images.
Fig. 3. User interface used to create the ground truth from the automatic segmentation of the video images.
The test images are used to evaluate the performance of object detection algorithms. In order to compare the
output of the algorithm with the ground truth segmentation, a region matching procedure is adopted which allows
to establish a correspondence between the detected objects and the ground truth. Several cases are considered:
1) Correct Detection (CD) or 1-1 match: the detected region matches one and only one region.
2) False Alarm (FA): the detected region has no correspondence.
3) Detection Failure (DF): the ground truth region has no correspondence.
4) Merge Region (M): the detected region is associated to several ground truth regions.
5) Split Region (S): the ground truth region is associated to several detected regions.
6) Split-Merge Region (SM): when the conditions pointed in 4, 5 are simultaneously satisfied.
A. Region Matching
Object matching is performed by computing a binary correspondence matrix Ct which defines the correspondence
between the active regions in a pair of images. Let us assume that we have N ground truth regions Ri and M
detected regions R j . Under these conditions Ct is a N × M matrix, defined as follows
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 42/101
9
Ct(i, j) =
1 if ♯(Ri ∩ R j)
♯(Ri ∪ R j)> T
∀i∈1,...,N ,j∈1,...,M
0 if ♯(Ri ∩ R j)
♯(Ri ∪ R j)< T
(19)
where T is the threshold which accounts for the overlap requirement. It is also useful to add the number of ones
in each line or column, defining two auxiliary vectors
L(i) =M j=1
C(i, j) i ∈ 1, . . . , N (20)
C ( j) =N i=1
C(i, j) j ∈ 1, . . . , M (21)
When we associate ground truth regions with detected regions six cases can occur: zero-to-one, one-to-zero,
one-to-one, many-to-one, one-to-many, many-to-many associations. These correspond to false alarm, misdetection,
correct detection, merge, split and split-merge.
Detected regions R j are classified according to the following rules
CD ∃i : L(i) = C ( j) = 1 ∧ C(i, j) = 1M ∃i : C ( j) > 1 ∧ C(i, j) = 1S ∃i : L(i) > 1 ∧ C(i, j) = 1SM ∃i : L(i) > 1 ∧ C ( j) > 1 ∧ C(i, j) = 1FA ∃i : C ( j) = 0
(22)
Detection failures (DF ) associated to the ground truth region Ri occurs if L(i) = 0.
The two last situations (FA, DF) in (22) occur whenever empty columns or lines in matrix C are observed.
Fig. 4 illustrates the six situations considered in this analysis, by showing synthetic examples. Two images
are shown in each case, corresponding to the ground truth (left) and detected regions (right). It also depicts the
correspondence matrix C. For each case, the left image (I ) contains the regions defined by the user (ground truth),
the right image (I ) contains the regions detected by the segmentation algorithm. Each region is represented by an
white area containing a visual label. Fig. 4 (a) shows an ideal situation, in which each ground truth region matches
only one detected region (correct detection). In Fig. 4 (b) the “square-region” has no correspondence with the
detected regions, thus it corresponds to a detection failure. In Fig. 4 (c) the algorithm detects regions which have
no correspondence to the I image, thus indicating a false alarm occurrence. In Fig. 4 (d) shows a merge of two
regions since two different regions (“square” and “dot” regions in I ) have the same correspondence to the “square
region” in I . The remaining examples in this figure are self explaining, illustrating the split (e) and split-merge (f)
situations.
B. Region Overlap
The region based measures described herein depends on an overlap requirement T (see (19)) between the region
of the ground truth and the detected region. Without this requirement, this means that a single pixel overlap is
enough for establishing a match between a detected region and a region in the ground truth segmentation, which
does not make sense.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 43/101
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 44/101
11
Ground Truth Detector output Ground Truth Detector output
C =
1 0 00 1 00 0 1
C =
0 01 00 1
(a) (b)
Ground Truth Detector output Ground Truth Detector output
C =
1 0 0 00 1 0 0
0 0 1 0
C =
1 01 0
0 1
(c) (d)
Ground Truth Detector output Ground Truth Detector output
C = 1 1 00 0 1 C = 1 1 0
0 0 10 1 0
(e) (f)
Fig. 4. Different matching cases: (a) Correct detection; (b) Detection Failure; (c) False alarm; (d) Merge; (e) Split; (f) Split Merge.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 45/101
12
Ground Truth Detector output Ground Truth Detector output
C =
1 0 00 0 0
C =
1 0 00 1 0
(a) (b)
Ground Truth Detector output Ground Truth Detector output
C =
1 0 00 0 0
C =
1 0 00 1 1
(c) (d)
Fig. 5. Matching cases with an overlap requirement of T = 20%. Detection failure (overlap < T) (a) Correct detection (overlap > T) (b);
two detection failures (overlap < T) (c) and split (overlap > T) (d).
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 46/101
13
D. Multiple Interpretations
Sometimes the segmentation procedure is subjective, since each active region may contain several objects and
it is not always easy to determine if it is a single connected region or several disjoint regions. For instance, Fig.
6 (a) shows an input image and a manual segmentation. Three active regions were considered: person, lorry and
group of people. Fig. 6 (b) shows the segmentation results provided by the SGM algorithm. This algorithm splits
the group into three individuals which can also be considered as a valid solution since there is very little overlap.
This segmentation should be considered as an alternative ground truth. All these situations should not penalize the
performance of the algorithm. On the contrary, situations such as the ones depicted in Fig. 7 should be considered
as errors. Fig. 7 (a) shows the ground truth and in Fig. 7 (b) the segmentation provided by the W 4 algorithm. In
this situation the algorithm makes a wrong split of the vehicle.
(a) (b)
Fig. 6. Correct split example: (a) supervised segmentation, (b)SGM
segmentation.
(a) (b)
Fig. 7. Wrong split example: (a) supervised segmentation, (b) W 4 segmentation.
Since we do not know how the algorithm behaves in terms of merging or splitting, every possible combinations
within elements, belonging to a group, must be taken into account. For instance, another ambiguous situation is
depicted in Fig. 8, where it is shown the segmentation results of the SGM method. Here, we see that the same
algorithm provides different segmentations (both can be considered as correct) on the same group in different
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 47/101
14
instants. This suggests the use of multiple interpretations for the segmentation. To accomplish this the evaluation
setup takes into account all possible merges of single regions belonging to the same group whenever multiple
interpretations should be considered in a group, i.e., when there is a small overlap among the group members.
The number of merges depends on the relative position of single regions. Fig. 9 shows two examples of different
merged regions groups with three objects ABC (each one representing a person in the group). In the first example
(Fig. 9 (a)) four interpretations are considered: all the objects are separated, they are all merged in a single active
region or AB (BC) are linked and the other is isolated. In the second example an addition interpretation is added
since A can be linked with C.
Instead of asking the user to identify all the possible merges in an ambiguous situation, an algorithm is used to
generate all the valid interpretations in two steps. First we assign all the possible labels sequences to the group
regions. If the same label is assigned to two different regions, these regions are considered as merged. Equation
(23)(a) shows the labelling matrix M for the example of Fig. 9 (a). Each row corresponds to a different labelling
assignment. The element M ij denotes the label of the jth region in the ith labelling configuration. The second
step checks if the merged regions are close to each other and if there is another region in the middle. The invalid
labelling configuration are removed from the matrix M . The output of this step for the example of Fig. 9 (a) is
in equation (23)(b). The labelling sequence 121 is discarded since region 2 is between region 1 and 3. Therefore,
regions 1, 3 cannot be merged. In the case of the Fig. 9 (b) all the configurations are possible ( M = M FINAL). A
detailed description of the labelling method is included in appendix VII-A.
Figs. 10,11 illustrate the generation of the valid interpretations. Fig. 10 (a) shows the input frame, Fig. 10 (b)
shows the hand segmented image, where the user specifies all the objects (three objects must be provided separately
in the group of persons) and Fig. 10 (c) illustrates the output of the SGM . Fig. 11 shows all possible merges of
individual regions. All of them are considered as correct. Remain to know which segmentation should be selected
to appraise the performance. In this paper we choose the best segmentation, which is the one that provides the
highest number of correct detections. In the present example the segmentation illustrated in Fig. 11 (g) is selected.
In this way we overcome the segmentation ambiguities that may appear without penalizing the algorithm. This is
the most complex situation which occurs in the video sequences used in this paper.
Fig. 8. Two different segmentations, provided by SGM method on the same group taken at different time instants.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 48/101
15
(a) (b)
Fig. 9. Regions linking procedure with three objects A B C (from left to right). The same number of foreground regions may have different
interpretations: three possible configurations (a), or four configurations (b). Each color represent a different region.
M =
1 1 11 1 21 2 1
1 2 21 2 3
(a)
M FINAL =
1 1 11 1 2
1 2 21 2 3
(b)
(23)
(a) (b) (c)
Fig. 10. Input frame (a), segmented image by the user (b), output of SGM (c).
V. TESTS ON PETS2001 DATASET
This section presents the evaluation of several object detection algorithms using PETS2001 dataset. The training
and test sequences of PETS2001 were used for this study. The training sequence has 3064 and the test sequence has2688 frames. In both sequences, the first 100 images were used to build the background model for each algorithm.
The resolution is half-resolution PAL standard (288 × 384 pixels, 25 frames per second). The algorithms were
evaluated using one frame per second. The ground truth was generated by an automatic segmentation of the video
signal followed by a manual correction using a graphical editor described in section IV. The outputs of the algorithms
were then compared with the ground truth. Most algorithms require the specification of the smallest area of an
object. An area of 25 pixels was chosen since it allows to detect all objects of interest in the sequences.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 49/101
16
(a) (b) (c) (d)
(e) (f) (g) (h)
Fig. 11. Multiple interpretations given by the application. The segmentation illustrated in (g) is selected for the current frame.
A. Choice of the Model Parameters
The segmentation algorithms described herein depend on a set of parameters, which are mainly the thresholds and
the learning rate α. In this scenario, we must figure out which are the best values for the most significant parameters
for each algorithm. This was done using ROC curves which display the performance of each algorithm as a function
of the parameters. The Receiver Operation Characteristic (ROC) have been extensively used in communications
[9]. It is assumed that all the parameters are constant but one. In this case we have kept the learning rate α
constant and varied the thresholds in the attempt to obtain the best threshold value T . We repeated this procedure
for several values of α. This requires a considerable number of tests, but in this way it is possible to achieve a
proper configuration for the algorithm parameters. These tests were made for a training sequence of the PETS2001
data set. Once the parameters are set, we use these values in a different sequence.
To ROC curves describe the evolution of the false alarms (FA) and detection failures (DF) as T varies. An ideal
curve would be close to the origin, and the area under the curve would be close to zero. To obtain these two values,
we compute these measures (for each value of T ) by applying the region matching trough the sequence. The final
values are computed as the mean values of FA and DF.
Fig. 12 shows the receiver operating curves (ROC) for all the algorithms. It is observed that the performance of
BBS algorithm is independent of α. We can also see that this algorithm is sensitive with respect to the threshold,
since there is a large variation of FA and DF for small changes of T, this can be viewed as a lack of smoothness
of the ROC curve (T = 0.2 is the best value). There is a large number of false alarms in the training sequence due
to the presence of a static object (car) which suddenly starts to move. The background image should be modified
when the car starts to move. However, the image analysis algorithms are not able to cope with this situation since
they only consider slow adaptations of the background. A ghost region is therefore detected in the place where the
car was (a false alarm).
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 50/101
17
The second row of the Fig. 12 shows the ROC curves of the SGM method, for three values of α (0.01, 0.05, 0.15).
This method is more robust than the BBS algorithm with respect to the threshold. We see that for −400 < T <
−150, and α = 0.01, α = 0.05 we get similar FA rates and a small variation of DF. We chose α = 0.05, T = −400.
The third row show the results of the M GM method. The best performances are obtained for α < 0.05 (first
and second column). The best value of the α parameter is α = 0.008. In fact, we observe the best performances
for α ≤ 0.01. We notice that the algorithm strongly depends on the value of T , since for small variations of T
there are significant changes of FA and DF. The ROC curve suggest that it is acceptable to choose T > 0.9.
The fourth row shows the results of the LOTS algorithm for a variation of the sensitivity from 10% to 110%.
As discussed in [29] we use a small α parameter. For the sake of computational burden, LOTS does not update
the background image in every single frame. This algorithm decreases the background update rate which takes
place in periods of N frames. For instance an effective integration factor α = 0.0003 is achieved by adding
approximately 113 of the current frame to the background in every 256th frame, or 1
6.5 in every 512th frame.
Remark that Bt = Bt−1 + αDt, with Dt = I t − Bt. In our case we have used intervals of 1024 (Fig. 12 (j)) 256
(Fig. 12 (k)) 128 (Fig. 12 (l)), being the best results achieved in the first case. The latter two cases Fig.(12) (k),
(l) present a right shift in relation to (j), meaning that in these cases one obtains a large number of false alarms.
From this study we conclude that the best ROC curves are the curves associated with LOTS and SGM since
they have the smallest area under the curve.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 51/101
18
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
False Alarms
D e t e
c t i o n F a i l u r e s
T = 0.9
T = 0.8
T = 0.7
T = 0.6T = 0.5
T = 0.4T = 0.3
T = 0.2 T = 0.1
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
False Alarms
D e t e
c t i o n F a i l u r e s
T = 0.9
T = 0.8
T = 0.7
T = 0.6T = 0.5
T = 0.4T = 0.3
T = 0.2 T = 0.1
0 10 20 30 40 50 600
10
20
30
40
50
60
70
80
90
100
False Alarms
D e t e
c t i o n F a i l u r e s
T = 0.9
T = 0.8
T = 0.7
T = 0.6
T = 0.5
T = 0.4
T = 0.3
T = 0.2 T = 0.1
(a) (b) (c)
30 40 50 60 70 80 90
0
5
10
15
False Alarms
D e t e c t i o n F a i l u r e s
T = −600T = −500
T = −400
T = −300
T = −200
T = −150
T = −100
T = −50
T = −25
30 40 50 60 70 80 90
0
5
10
15
False Alarms
D e t e c t i o n F a i l u r e s
T = −600
T = −500
T = −400
T = −300
T = −200
T = −150
T = −100
T = −50
T = −25
30 40 50 60 70 80 90
0
5
10
15
False Alarms
D e t e c t i o n
F a i l u r e s
T = −25T = −50
T = −100
T = −150
T = −200
T = −300
T = −400
T <= −500
(d) (e) (f)
0 10 20 30 40 50 600
10
20
30
40
50
60
70
False Alarms
D e t e c t i o n
F a i l u r e s
T = 0.99
T = 0.95
T = 0.9 T < 0.9
0 10 20 30 40 50 600
10
20
30
40
50
60
70
False Alarms
D e t e c t i o n
F a i l u r e s
T = 0.99
T = 0.95
T = 0.9
T < 0.9
0 10 20 30 40 50 600
10
20
30
40
50
60
70
False Alarms
D e t e c t i o n
F a i l u r e s
T=0
T = 0.1
T = 0.2
T = 0.3
T = 0.4
T > 0.4
(g) (h) (i)
0 10 20 30 40 50 60 70 80 90
0
1
2
3
4
5
6
7
8
9
10
False Alarms
D e t e c t i o n F
a i l u r e s
S = 110
S = 100
S = 90
S = 80 S = 70
S <= 60
0 10 20 30 40 50 60 70 80 90
0
1
2
3
4
5
6
7
8
9
10
False Alarms
D e t e c t i o n F a i l u r e s
S = 110
S = 100
S = 90
S = 80
S = 70
S = 60
S = 50
S = 40
S = 30
S = 20
S = 10
0 10 20 30 40 50 60 70 80 90
0
1
2
3
4
5
6
7
8
9
10
False Alarms
D e t e c t i o n F a i l u r e s
S = 110
S = 100
S = 90S = 80
S = 70
S = 60
S = 50
S = 40
S = 30
S = 20
S = 10
(j) (k) (l)
Fig. 12. Receiver Operation Characteristic for different values of α: BBS (first row: (a) α = 0.05, (b) α = 0.1, (c) α = 0.15), SGM
(second row: (d) α = 0.01, (e) α = 0.05, (f) α = 0.15), MGM (third row: (g) α = 0.008, (h) α = 0.01, (i) α = 0.05, LOTS (fourth row
with background update at every: (j) 1024th frame, (k) 256th frame, (l) 128th frame.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 52/101
19
B. Performance Evaluation
Table I (a),(b) shows the results obtained in the test sequence using the parameters selected in the previous
study. The percentage of correct detections, detection failures, splits, merges and split-merges were obtained by
normalizing the number of each type of event by the total number of moving objects in the image. Their sum is
100%. The percentage of false alarms is defined by normalizing the number of false alarms by the total number of
detected objects. It is therefore a number in the range 0 − 100%.
Each algorithm is characterized in terms of correct detections, detection failures, number of splits, merges and
split/merges false alarms as well as matching area.
Two types of ground truth were used. They correspond to different interpretations of static objects. If a moving
object stops and remains still it is considered an active region in the first case (Table I (a)) and it is integrated in
the background after one minute in the second case (Table I (b)). For example, if a car stops in front of the camera
it will always be an active region in the first case. In the second case it will be ignored after one minute.
Let us consider the first case. The results are shown in Table I (a). In terms of correct detections, the best resultsare achieved by the LOTS (91.2%) algorithm followed by SGM (86.8%).
Concerning the detection failures, the LOTS (8.5%) followed by W 4 (9.6%) outperforms all the others. The
worst results are obtained by M GM (13.1%). This is somewhat surprising since M GM method, based on the
use of multiple Gaussians per pixel, performs worse than the SGM method based on a single Gaussian. We will
discuss this issue bellow. The W 4 has the highest percentage of splits and the BBS , M GM methods tend to split
the regions as well. The performance of the methods in terms of region merging is excellent: very few merges are
observed in the segmented data. However, some methods tend to produce split/merges errors (e.g., W 4, SGM and
BBS ). The LOTS and M GM algorithm have the best score in terms of split/merge errors.Let us now consider the false alarms (false positives). The LOTS (0.6%) is the best and the M GM and BBS
are the worst. The LOTS , W 4 and SGM methods are much better than the others in terms of false alarms.
The LOTS has the best tradeoff between CD and FA. Although the W 4 produces many splits (splits can often be
overcome in tracking applications since the region matching algorithms are able to track the active regions though
they are split). The LOTS algorithm has the best performance if all the errors are equally important.
In terms of matching area the LOTS exhibit the best value in both situations.
In this study, the performance of the M GM method, based on mixtures of Gaussians is unexpectedly low.
During the experiments we have observed the following: i) when the object undergoes a slow motion and stops,the algorithm ceases to detect the object after a small period of time; ii) when an object enters the scene it is not
well detected during a few frames since the Gaussian modes have to adapt to this case.
This situation justify the percentage of the splits in both Tables. In fact, when a moving object stops, the M GM
starts to split the region until it disappears, becoming part of the background. Objects entering into the scene will
cause some detection failures (during the first frames) and splits, when the M GM method starts to separate the
foreground region from the background.
Comparing the results in Table I (a) and (b) we can see that the performance of the M GM is improved. The
detection failures are reduced, meaning that the stopped car is correctly integrated in the background. This produces
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 53/101
20
an increase of correct detections by the same amount. However, we stress that the percentage of false alarms also
increases. This means that the removal of the false positives is not stable. In fact some frames contain, as small
active regions, the object which stops in the scene. In regard to the other methods, it is already expected that
the false alarms percentage suffers an increase, since these algorithms remain with false positives throughout the
sequence.
The computational complexity of all methods was studied to judge the performance of the five algorithms. Details
about the number of operations in each method is provided in the Appendix VII-B.
% BBS W4 SGM MGM LOTS
Correct Detections 84.3 81.6 86.8 85.0 91.2
Detection Failures 12.2 9.6 11.5 13.1 8.5
Splits 2.9 5.4 0.2 1.9 0.3
Merges 0 1.0 0 0 0
Split/Merges 0.6 1.8 1.5 0 0
False Alarms 22.5 8.5 11.3 24.3 0.6
Matching Area 64.7 50.4 61.9 61.3 78.8
% BBS W4 SGM MGM LOTS
Correct Detections 83.5 84.0 86.4 85.4 91.0
Detection Failures 12.4 8.5 11.7 12.0 8.8
Splits 3.3 4.3 0.2 2.6 0.3
Merges 0 0.8 0 0 0
Split/Merges 0.8 1.8 1.7 0 0
False Alarms 27.0 15.2 17.0 28.2 7.2
Matching Area 61.3 53.6 61.8 65.6 78.1
(a) (b)
TABLE I
PERFORMANCE OF FIVE OBJECT DETECTION ALGORITHMS.
V I . CONCLUSIONS
This paper proposes a framework for the evaluation of object detection algorithms in surveillance applications.The proposed method is based on the comparison of the detector output with a ground truth segmented sequence
sampled at 1 frame per second. The difference between both segmentations is evaluated and the segmentation
errors are classified into detection failures, false alarms, splits, merges and split/merges. To cope with ambiguous
situations in which we do not know if two or more objects belong to a single active region or to several regions, we
consider multiple interpretations of the ambiguous frames. These interpretations are controlled by the user through
a graphical interface.
The proposed method provides a statistical characterization of the object detection algorithm by measuring the
percentage of each type of error. The user can thus select the best algorithm for a specific application taking intoaccount the influence of each type of error in the performance of the overall system. For example, in object tracking
detection failures are worse than splits. We should therefore select a method with less detection failures, even if it
has more splits than another method.
Five algorithms were considered in this paper to illustrate the proposed evaluation method. These algorithms are:
Basic Background Subtraction (BBS ), W 4, Single Gaussian Model (SGM ), Multiple Gaussian Model (M GM ),
Lehigh Omnidirectional Tracking System (LOTS ). The best results were achieved by the LOTS and SGM
algorithm.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 54/101
21
Acknowledgement: We are very grateful to the three anonymous reviewers for their useful comments and
suggestions. We also thank R. Oliveira and P. Ribeiro for kindly provide the code of LOTS detector.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 55/101
22
VII. APPENDIX
A. Merge Regions Algorithm
The pseudo code of the region labelling algorithm is given in Algorithms 1, 2.
Algorithm 1 describes the synopsis of the first step, i.e., generation of the labels configurations. When the same
label is assigned to two different regions, this means that these regions are considered as merged. Algorithm 2
describes the synopsis of the second step, which checks and eliminates label sequences which contain invalid
merges. Every time the same label is assigned to a pair of regions we define a strip connecting the mass center of
the two regions and check if the strip is intersected by any other region. If so, the labelling sequence is considered
as invalid.
In these algorithms N denotes the number of objects, label is a labelling sequence, M is the matrix of all label
configurations, M FINAL is a matrix which contains the information (final label configurations) needed to create
the merges.
Algorithm 1 Main1: N ← Num;
2: M(1) ← 1;
3: for t = 2 to N do
4: AUX ← [ ];5: for i = 1 to size(M, 1) do
6: label ← max(M(i, :)) + 1;
7: AUX ← [AUX; [repmat(M(i, :), label, 1) (1 : label)T] ];8: end for
9: M ← AUX;
10: end for
11: MFINAL ← FinalConfiguration(M);
@
@@ @
Fig. 13. Generation of the label sequences for the example in the Fig. 14.
To illustrate the purposes of algorithms 1 and 2 we will consider the example illustrated in the figure 14, where
each rectangle in the image represents an active region.
Algorithm 1 computes the leaves of the graph shown in the Fig. 13 with all label sequences.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 56/101
23
Algorithm 2 MFINAL = FinalConfiguration (M)
1: MFINAL ← [ ];2: for i = 1 to lenght(M) do
3: Compute the centroids of the objects to be linked in M(i, :);
4: Link the centroids with strip lines;
5: if the strip lines do not intersect another object region then
6: MFINAL
← [MT
FINALM(i, :)T]T;
7: end if
8: end for
A CB D
Fig. 14. Four rectangles A,B,C,D representing active regions in the image.
Algorithm 2 checks each sequence taking into account the relative position of the objects in the image. For
example, configurations 1212,1213 are considered as invalid since object A cannot be merged with C (see Fig. 14).
Equations (24)(a) and (b) show the output of the first and the second step respectively. All the labelling sequences
considered as valid (the content of the matrix M FINAL) provides the resulting images shown in Fig. 15.
M =
1 1 1 11 1 1 21 1 2 11 1 2 21 1 2 31 2 1 11 2 1 21 2 1 31 2 2 11 2 2 21 2 2 31 2 3 11 2 3 21 2 3 31 2 3 4
(a)
M FINAL =
1 1 1 11 1 1 2
1 1 2 21 1 2 3
1 2 2 21 2 2 3
1 2 3 31 2 3 4
(b)
(24)
B. Computational Complexity
Computational complexity was also studied to judge the performance of the five algorithms. Next, we provide
comparative data on computational complexity using the “Big-O” analysis.
Let us define the following variables:
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 57/101
24
Fig. 15. Valid merges generated from the example in the Fig. 14.
• N, number of images in the sequence,
• L, C, number of lines and columns of the image,
• R, number of regions detected in the image,
• N g, number of Gaussians.
The BBS , W 4, SGM , M GM and LOTS methods share several common operations namely: i) morphological
operations, for noise cleaning, ii) computation of the areas of the regions and iii) labelling assignment.
The complexity of these three operations is
K = (2 × (ℓ × c) − 1) × (L × C )
morphological op.
+ (L × C ) + R
region areas op.
+ R × (L × C )
Labels op.
(25)
where ℓ, c are the kernel dimensions (ℓ × c = 9, 8 - connectivity is used), L, C are the image dimensions and R is
the number of detected regions. The first term, 2 × (ℓ × c) − 1, is the number of products and summations required
for the convolution of each pixel in the image. The second term, (L × C ) + R, is the number of differences taken
to compute the areas of the regions in the image. Finally, the R × (L × C ) term is the number of operations to
label all the regions in the image.
BBS Algorithm
The complexity of the BBS is
O11 × (L × C ) threshold op.
+K × N (26)
where 11 × (L × C ) is the number of operations required to perform the thresholding step (see (1)) which involves
3 × (L × C ) differences and 8 × (L × C ) logical operations.
W4 Algorithm
The complexity of this method is
O
2 × [2 p3 + (L × C ) × ( p + ( p − 1))]
rgb2gray op.
+ 9 × (L × C )
threshold op.
+K + K W 4× N
(27)
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 58/101
25
where the first term is related to the conversion of the images to grayscale level, p = 3 (RGB space). The second
one is concerned with the threshold operation (see (2)) which requires 9 × (L × C ) operations (8 logical and 1
difference operations). The term K W 4 corresponds to the background subtraction and morphological operations
inside the bounding boxes of the foreground regions
K W 4 = R × 9 × (Lr × C r) Threshold op.
+ (2 × (ℓ × c) − 1) × (Lr × C r) morphological op.
+ (L × C ) + R region areas op.
+ R × (L × C ) Labels op.
(28)
where Lr, C r are the dimensions of the bounding boxes, assuming that the bounding boxes of the active regions
have the same length and width.
SGM Algorithm
The complexity of the SGM method is
O
p × [2 p × (L × C )]
rgb2yuv op.
+ 28 × (L × C )
likelihood op.
+ (L × C )
threshold op.
+K
× N
(29)
The first term is related to the conversion of the images to YUV color space (in (29) p = 3). The second term
is the number of operations required to compute the likelihood measure (see (5)). The third term is related to the
threshold operation to classify the pixel as foreground if the likelihood is greater than a threshold, or classified as
background otherwise.
MGM Algorithm
The number of operations of the MGM method is
ON g(136 × (L × C )) mixture modelling
+ 2 × (2N g − 1) × (L × C ) norm. and mixture op.
+K × N (30)
The first term depends on the number of Gaussians N g. This term is related to the following operations: i) matching
operation - 70 × (L × C ), ii) weight update - 3 × (L × C ) (see (11)), iii) background update - 3 × 8 × (L × C )
(see (8)), iv) covariance update for all color components - 3 × 13 × (L × C ) (see (9)). The second term accounts
for: i) weight normalization - (2N g − 1)(L × C ) and ii) (2N g − 1) × (L × C ) computation of the Gaussian mixture
for all pixels.
LOTS Algorithm
The complexity of the LOTS method is
O
[2 p3 + (L × C ) × ( p + ( p − 1))] rgb2gray op.
+ 11 × (L × C ) + (2 × (Lb × C b) − 1) × nb + (2 × (ℓ × c) − 1) × (Lrsize
× C rsize
) + (Lrsize
× C rsize
) QCC op.
+ K
× N
(31)
The first term is related to the conversion of the images and it is similar with the first term in (27). The second
term is related to the QCC algorithm. A number of 11 × (L × C ) operations are needed to compute (17,18).
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 59/101
26
BBS 1 + 30× (L×C ) 3.3× 106
LOTS 55 + (35 + 145
64)× (L×C ) 4.1× 106
W4 760 + 40 × (L×C ) 4.4× 106
SGM 1 + 66× (L×C ) 7.2× 106
MGM 1 + 437× (L×C ) 48.3× 106
TABLE II
THE SECOND COLUMN GIVES THE SIMPLFIED EXPRESSION FOR EQUATIONS (26, 27, 29, 30, 31). THE SECOND COLUMN GIVES THE
NUM BE R OF TOTAL OP ER ATIO NS.
The QCC analysis is computed in a low resolution image P H , P L. This is accomplished by converting each block
of Lb × C b pixels (in high resolution images) into a new element of the new matrices (P H , P L). Each element of
P H , P L contains the active pixels of each block in the respective images. This task requires (2 × (Lb× C b) − 1)× nb
operations (second term of QCC in (31)) where (Lb × C b) is the size of each block and nb is the number of blocks
in the image. A morphological operation (4-connectivity is used) over P H is performed, taking (2 × (ℓ × c) − 1) ×
(Lrsize
× C rsize
) operations where (Lrsize
× C rsize
) is the dimension of the resized images. The targets candidates
are obtained by comparing P H and P L. This task takes (Lrsize
× C rsize
) operations (fourth term in QCC).
For example, the complexity of the five algorithms is shown in table II assuming the following conditions for
each frame
• the kernel dimensions, ℓ × c = 9,
• the block dimensions, Lb × C b = 8 × 8, i.e., (Lrsize
× C rsize
) = L×C 64 (for LOTS method),
• the number of Gaussians, N g = 3 (for MGM method),
•a single region is detected with an area of 25 pixels, ( R = 1, Łr × C r = 25),
• the image dimension is (L × C ) = 288 × 384.
From the table, it is concluded that the four algorithms (BBS ,LOTS ,W 4,SGM ) have a similar computational
complexity whilst M GM is more complex requiring a higher computational cost.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 60/101
27
R EFERENCES
[1] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: Real-time tracking of the human body,” IEEE Trans. Pattern
Anal. Machine Intell., vol. 19, no. 7, pp. 780–785, July 1997.
[2] C. Stauffer, W. Eric, and L. Grimson, “Learning patterns of activity using real-time tracking,” IEEE Trans. Pattern Anal. Machine
Intell., vol. 22, no. 8, pp. 747–757, August 2000.
[3] S. J. McKenna and S. Gong, “Tracking colour objects using adaptive mixture models,” Image Vision Computing , vol. 17, pp. 225–231,
1999.
[4] N. Ohta, “A statistical approach to background suppression for surveillance systems,” in Proceedings of IEEE Int. Conference onComputer Vision, 2001, pp. 481–486.
[5] I. Haritaoglu, D. Harwood, and L. S. Davis, “W 4: Who? when? where? what? a real time system for detecting and tracking people,”
in IEEE International Conference on Automatic Face and Gesture Recognition, April 1998, pp. 222–227.
[6] M. Seki, H. Fujiwara, and K. Sumi, “A robust background subtraction method for changing background,” in Proceedings of IEEE
Workshop on Applications of Computer Vision, 2000, pp. 207–213.
[7] D. Koller, J. Weber, T. Huang, J. Malik, G. Ogasawara, B. Rao, and S. Russel, “Towards robust automatic traffic scene analysis in
real-time,” in Proceedings of Int. Conference on Pattern Recognition, 1994, pp. 126–131.
[8] R. Collins, A. Lipton, and T. Kanade, “A system for video surveillance and monitoring,” in Proc. American Nuclear Society (ANS)
Eighth Int. Topical Meeting on Robotic and Remote Systems, Pittsburgh, PA, April 1999, pp. 25–29.
[9] H. V. Trees, Detection, Estimation, and Modulation Theory. John Wiley and Sons, 2001.
[10] T. H. Chalidabhongse, K. Kim, D. Harwood, and L. Davis, “A perturbation method for evaluating background subtraction algorithms,”
in Proc. Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS
2003), Nice, France, October 2003.
[11] X. Gao, T.E.Boult, F. Coetzee, and V. Ramesh, “Error analysis of background adaption,” in IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, 2000, pp. 503–510.
[12] F. Oberti, A. Teschioni, and C. S. Regazzoni, “Roc curves for performance evaluation of video sequences processing systems for
surveillance applications,” in IEEE Int. Conf. on Image Processing , vol. 2, 1999, pp. 949–953.
[13] J. Black, T. Ellis, and P. Rosin, “A novel method for video tracking performance evaluation,” in Joint IEEE Int. Workshop on Visual
Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), Nice, France, 2003, pp. 125–132.
[14] P. Correia and F. Pereira, “Objective evaluation of relative segmentation quality,” in Int. Conference on Image Processing , 2000, pp.
308–311.
[15] C. E. Erdem, B. Sankur, and A. M.Tekalp, “Performance measures for video object segmentation and tracking,” IEEE Trans. Image
Processing , vol. 13, no. 7, pp. 937–951, 2004.
[16] V. Y. Mariano, J. Min, J.-H. Park, R. Kasturi, D. Mihalcik, H. Li, D. Doermann, and T. Drayer, “Performance evaluation of object
detection algorithms,” in Proceedings of 16th Int. Conf. on Pattern Recognition (ICPR02), vol. 3, 2002, pp. 965–969.
[17] I. Haritaoglu, D. Harwood, and L. S. Davis, “W 4: real-time surveillance of people and their activities,” IEEE Trans. Pattern Anal.
Machine Intell., vol. 22, no. 8, pp. 809–830, August 2000.
[18] T. Boult, R. Micheals, X. Gao, and M. Eckmann, “Into the woods: Visual surveillance of non-cooperative camouflaged targets in
complex outdoor settings,” in Proceedings of the IEEE , October 2001, pp. 1382–1402.[19] R. C. Gonzalez and R. E. Woods, Digital Image Processing . Prentice Hall, 2002.
[20] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting moving objects, ghosts and shadows in video streams,” IEEE Trans.
Pattern Anal. Machine Intell., vol. 25, no. 10, pp. 1337–1342, 2003.
[21] Y.-F. Ma and H.-J. Zhang, “Detecting motion object by spatio-temporal entropy,” in IEEE Int. Conf. on Multimedia and Expo, Tokyo,
Japan, August 2001.
[22] R. Souvenir, J. Wright, and R. Pless, “Spatio-temporal detection and isolation: Results on the PETS2005 datasets,” in Proceedings of
the IEEE Workshop on Performance Evaluation in Tracking and Surveillance, 2005.
[23] H. Sun, T. Feng, and T. Tan, “Spatio-temporal segmentation for video surveillance,” in IEEE Int. Conf. on Pattern Recognition, vol. 1,
Barcelona, Spain, September, pp. 843–846.
[24] A. Monnet, A. Mittal, N. Paragios, and V. Ramesh, “Background modeling and subtraction of dynamic scenes,” in Proceedings of the
ninth IEEE Int. Conf. on Computer Vision, 2003, pp. 1305–1312.
[25] J. Zhong and S. Sclaroff., “Segmenting foreground objects from a dynamic, textured background via a robust kalman filter,” in
Proceedings of the ninth IEEE Int. Conf. on Computer Vision, 2003, pp. 44–50.
[26] N. T. Siebel and S. J. Maybank, “Real-time tracking of pedestrians and vehicles,” in Proc. of IEEE workshop on Performance Evaluationof tracking and surveillance, 2001.
[27] R. Cucchiara, C. Grana, and A. Prati, “Detecting moving objects and their shadows: an evaluation with the PETS2002 dataset,” in
Proceedings of Third IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS 2002) in conj. with
ECCV 2002, Pittsburgh, PA, May 2002, pp. 18–25.
[28] Collins, Lipton, Kanade, Fujiyoshi, Duggins, Tsin, Tolliver, Enomoto, and Hasegawa, “A system for video surveillance and monitoring:
Vsam final report,” Robotics Institute, Carnegie Mellon University, Tech. Rep. Technical report CMU-RI-TR-00-12, May 2000.
[29] T. Boult, R. Micheals, X. Gao, W. Y. P. Lewis, C. Power, and A. Erkan, “Frame-rate omnidirectional surveillance and tracking of
camouflaged and occluded targets,” in Second IEEE International Workshop on Visual Surveillance, 1999, pp. 48–55.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 61/101
Segmentation and Classication of Human Activities∗
J.C. Nascimento1 M. A. T. Figueiredo2 J. S. Marques3
[email protected] [email protected] [email protected],3Instituto de Sistemas e Robotica 2Instituto de Telecomunicacoes
Instituto Superior Tecnico1049-001 Lisboa
PORTUGAL
Abstract
This paper describes an algorithm for segmenting and classifying human activities from video sequences
of a shopping center. These activities comprise entering or exiting a shop, passing, or browsing in front
of shop windows. The proposed approach recognizes these activities by using a priori knowledge of the
layout of the shopping view. Human actions are represented by a bank of switch dynamical models,
each tailored to describe a specific motion regime. Experimental tests illustrate the effectiveness of the
proposed approach with synthetic and real data.
Keywords: Surveillance, Segmentation, Classification, Human Activities, Minimum Description Length.
1 Introduction
The analysis of human activities is an important computer vision research topic with applications in surveillance, e.g.
in developing automated security applications. In this paper, we focus on recognizing human activities in a shopping
center.
In commercial spaces, it is common to have many surveillance cameras. The monitor room is usually equipped
with a large set of monitors which are used by a human operator to watch over the areas observed by the cameras.
This requires a considerable effort of the human operator, who has to somehow multiplex his/her attention. In recent
years a considerable effort was devoted to develop automatic surveillance systems providing information about which
activities take place in a given space. With such a system, it would be possible to monitor the actions of individuals,determining its nature and discerning common activities from inappropriate behavior (for example, standing for a large
period of time at the entrance of a shop, fighting).
In this paper, we aim at labelling common activities taking place in the shopping space. 1 Activities are recog-
nized from motion patterns associated to each person tracked by the system. Motion is described by a sequence of
displacements of the 2D centroid (mean position) of each person’s blob. The trajectory is modelled by using multiple
dynamical models with a switching mechanism. Since the trajectory is described by its appearance, we compute the
statistics for the identification of the dynamical models involved in a trajectory.
The rest of the paper is organized as follows. Section 2 deals with related work. Section 3, describes the statistical
activity model. Section 4 derives the segmentation algorithm. Section 5 reports experimental results with synthetic
data and real video sequences. Section 6 concludes the paper.
2 Related Work
The analysis of human activities has been extensively addressed in several ways using different types of features and
inference methods. Typically, a set of motion features is extracted from the video signal and an inference model is
used to classify it into one of c possible classes.
For example in [16] the human body is approximated by a set of segments and atomic activities are then defined as
vectors of temporal measurements which capture the evolution of the five body parts. In other works the human body
is simply represented by the mass center of its active region (blob) in the image plane [12] or the body blob as in [4].
The activity is then represented by the trajectory obtained from the blob center, or from the correspondence of body
blob regions respectively.
Other works try to characterize the human activity directly from the video signal without segmenting the active
regions. In [2] human activities are characterized by temporal templates. These templates try to convey information
about “where” and “how” motion is performed. Two templates are created: a binary motion-energy-image whichrepresents where the motion has occurred in the whole sequence, and a scalar motion-history-image which represents
∗This work was partially supported by FCT under project CAVIAR(IST-2001-37540)1This work is integrated in project CAVIAR, which has the general goal of representing and recognizing contexts and situations.An introduction
and the main goals of the project can be found in http://homepages.inf.ed.ac.uk/rbf/CAVIAR/caviar.htm
HAREM 2005 - International Workshop on Human Activity Recognition and Modelling,
Oxford, UK, September 2005
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 62/101
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 63/101
Figure 2: Examples of three different activities (entering, exiting, passing).
From xt we can obtain ∆xit , where ∆xi
t contains the displacements of xt known to have been generated by the ith
model. Defining ∆Xi = ∆xi1,∆xi
2,.. . ,∆xi N as the vector containing all the displacements in ith model of the training
set, we have, for the ith model:
µ i =1
♯∆Xi ∑∆Xit , Qi =
1
♯∆Xi ∑(∆Xi − µ i)(∆Xi − µ i)T , (2)
where µ i and Qi are standard estimates of the mean and the covariance matrix respectively.
4.2 Segmentation and Classication
Having defined the set of models and the corresponding parameters, one can now classify a test trajectory xt . One
way to attain this goal is to compute the likelihood of xt into the model space. In this paper, the activity depends on
the number of the model switchings. In Fig. 2, we see that “passing” can be described by using just one model. The
activities “entering” and “exiting” can be described by using two dynamical models. The fourth activity considered
“browsing”, requires three models to be described; we define “browsing” when the person is walking, stop to see the
shop-window and restarts walking. This behavior was observed in all the other samples of the activities which come
about in this context. This means that we have to estimate the time instants in which the model switching happens.
Assuming that the sequence xt has n samples and is described by T segments (and T is known) the log-likelihoodis
L(m1,.. . , mT ,t 1,. ..,t T −1) = log p(∆x1,.. .,∆xn | m1, m2,. .., mT , t 1, t 2, . . . ,t T −1) (3)
where m1,. .., mT is the sequence of model labels describing the trajectory and t i for i = 1,. .., T −1 is the time instant
when switching from model mi to mi+1 occurs. If T = 1, there is no switching.
Due to the conditional independence assumption underlying (1), the log-likelihood can be written as
L(∆x1,.. . ,∆xn | m1,. .., mT ,t 1,.. .,t T −1)
=T
∑ j=1
t j
∑i=t j−1
log p(∆xi | m j) =T
∑ j=1
t j
∑i=t j−1
logN (∆xi | µ m j, Qm j
)(4)
where we define t 0 = 1, T is the number of segments and t j the switch time. Assuming that T is known, we can
“segment” the sequence (i.e., estimate m1,. .., mT and t 1,.. ., t T −1) using the maximum-likelihood approach:
m1,.. ., mT , t 1,.. ., t T −1 = argmax L(∆x1, . . . ,∆xn | m1,. .., mT , t 1,. .., t T −1) (5)
This maximization can be performed in a nested way,
t 1,. .., t T −1 = arg maxt 1,...,t T −1
max
m1,...,mT
L(∆x1,. ..,∆xn | m1,.. . , mT ,t 1,.. . ,t T −1)
(6)
In fact, the inner maximization can be decoupled as
maxm1,...,mT
L(∆x1,.. .,∆xn | m1, . . . , mT ,t 1,. ..,t T −1) =T
∑ j=1
maxm j
t j
∑i=t j−1
log p(∆xi | m j) (7)
where the maximization with respect to each of m j is a simple maximum likelihood classifier of sub-set of samples
(∆xt j−1,. ..,∆xt j ) into one of a set of Gaussian classes. Finally, the maximization with respect to t 1,.. . , t T −1 is done
by exhaustive search (this is never too expensive, since we consider a maximum of three segments).
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 64/101
4.3 Estimating the number of models of the activity
4.3.1 MDL Criterion
In the previous section, we derived the segmentation criterion assuming that the number of segments T is known. As
is well known, the same criterion can not be used to select T , as this would always return the largest possible number
of segments. We are thus in the presence of a model selection problem, which we address by using the minimum
description length (MDL) criterion [14]. The MDL criterion for selecting T is
T = argminT
− log p(∆x1,.. . ,∆xn | m1,.. ., mT , t 1,.. . , t T −1)
+ M (m1,. .., mT , t 1,. .., t T −1) (8)
where M (m1,.. ., mT , t 1,.. ., t T −1) is the number of bits required to encode the selected model indeces and the estimated
switching times. Notice that we do not have the usual 12
log n term because the real-valued model parameters (means
and covariances) are assumed fixed (previously estimated). Finally, it is easy to conclude that
M (m1,.. ., mT , t 1,.. . , t T −1) ≈ T log c + (T −1) log n (9)
where T log c is the code length for the model indeces m1,. .., mT , since each belongs to 1,.. . , c, and (T −1) log n is
the code length for t 1,.. . , t T −1, because each belongs to 1,.. . , n; we have ignored the fact that two switchings can
not occur at the same time, because T << n.
5 Experimental results
This section presents results with synthetic and real data. In the synthetic case, we have performed Monte Carlo tests.
We have considered five models (c = 5) shown in Fig. 3. The synthetic models shown in Fig. 3(a) were obtained
by simulating four activities of a person, using the generation model in (1). Fig. 4 shows examples of activities (the
trajectory shape of “Leaving” is the same as “Entering”, however with opposite direction). Here, the thin (green)
rectangles correspond to areas where the trajectory begins. The first sample of xt in these areas is random, because
the agent may appears at random places in the scene. The wide (yellow) rectangle is the area in which occurs a model
switching. In this figure the trajectories are generated with two segments (“Entering”, “Leaving”, “Passing”) and with
three segments (“Browsing”).For each activity we generate 100 test samples using (1) and classify each of them in one of the four classes. Fig.
5 shows the displacements ∆xt (black dots) of the test sequences (“Entering” and “Passing”) overlapped with the five
models. We can see that the displacements lie on right -up clusters (“Entering”) and right cluster (“Passing”). In this
experiment, all the test sequences were correctly classified (%100 accuracy).
−10 − 8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
−50 −40 −30 −20 −10 0 10 20 30 40 50
−50
−40
−30
−20
−10
0
10
20
30
40
50
(a) (b)
Figure 3: Five models are considered to describe trajectory. Each color corresponds to a different model. Synthetic
case (a), real case (b).
We also generated different test trajectories, this is because the exiting and entering may occur in different direction
from the ones in Fig. 4. These examples are illustrated in Fig. 6. In this new experiment, the same 100% accuracy
was also obtained.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 65/101
0 50 100 150 200 250 300 350 400 450 500−100
0
100
200
300
400
500
0 50 100 150 200 250 300 350 400 450 500−100
0
100
200
300
400
500
0 50 100 150 200 250 300 350 400 450 500−100
0
100
200
300
400
500
(a) (b) (c)
Figure 4: Examples of synthetic activities (performed in left-right direction): (a) entering, (b) passing, (c) browsing.
−10 − 8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
−10 − 8 −6 −4 −2 0 2 4 6 8 10−10
−8
−6
−4
−2
0
2
4
6
8
10
(a) (b)
Figure 5: Five models with the displacements (black dots) of the test activities: (a) entering, (b) passing.
The proposed algorithm was also tested with real data. The video sequences were acquired in the context of the
EC funded project CAVIAR. All the video sequences comprise human activities in indoor plaza and shopping center
observations of individuals and small groups of people. Ground truth was hand-labelled for all sequences2. Fig. 7
shows the bounding boxes as well as the centroid, which is the information used for the segmentation.
As in the synthetic case, we also generate the statistics of the considered models. The procedure is the same as in
the previous case using training sequences. Fig. 3(b) shows the clusters of the models.
Fig. 8 shows several activities performed at the shopping center with the time instants of the model switching
marked with small red circle. From this experiment, it can be seen that the proposed approach correctly determines
the switching times between models.
We have tested the proposed approach in more than 40 trajectories from 25 movies of about 5 minutes each. We just present the results of some of those activities in Tables 1 and 2. These Tables show the penalized log-likelihood
values (8) of each test sequence. The first table refers to all activities performed in the left-right direction, whilst the
second table reports all activities performed in the opposite direction. In the first table the classes referring to entering,
exiting, passing and browsing are right-upwards, downwards-right , right , right-stop-right respectively, whereas in the
second table the classes are left-upwards, downwards-left , left and left-stop-left . It can be observed that the output
classifier correctly assigns the activities into the corresponding classes, exhibiting good results as in the previous
synthetic examples.
6 Conclusions
In this paper we have proposed and tested an algorithm for modelling, segmentation, and classification of human
activities in a constrained environment. The proposed approach uses a switched dynamical models to represent the
human trajectories. It was illustrated that the time instants are effectively well determined, despite of the significant
random perturbations that the trajectory may contain. It is demonstrated that the proposed approach provides good
2The ground truth labelled video sequences is provided at http://homepages.inf.ed.ac.uk/rbf/CAVIAR/.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 66/101
0 50 100 150 200 250 300 350 400 450 500−100
0
100
200
300
400
500
0 50 100 150 200 250 300 350 400 450 500−100
0
100
200
300
400
500
0 50 100 150 200 250 300 350 400 450 500−100
0
100
200
300
400
500
(a) (b) (c)
Figure 6: Synthetic activities with different dynamic models (entering,exiting,passing).
Figure 7: Bounding boxes and centroids of the pedestrians performing activities.
results with synthetic and real data obtained in a shopping center. The proposed method is able to effectively recognize
instances of the learned activities. The activities studied herein can be interpreted as atomic, in the sense that they aresimple events. Compound actions or complex events can be represented as concatenations of the activities studied in
this paper. This is one of the issues to be addressed in the future.
Acknowledgement: We would like to thank Prof. Jose Santos Victor of ISR and the members of CAVIAR project,
for providing video data of human activities with the ground truth information.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 67/101
Figure 8: Samples of different activities. The large circles are the computed times instants where the model switches:
Entering (first column); exiting (second column); browsing (third column).
Test trajectories
Classes E 1 E 2 Ex1 Ex2 P 1 P 2 B
Ent ering 187.2 157.3 212.7 217.0 100.3 107.4 169.1
Exit ing 401.0 340.0 116.1 102.4 104.6 93.8 178.7
Passing 359.7 311.0 232.5 183.3 88.8 90.2 147.7
Browsing 299.1 265.6 196.5 180.0 160.7 156.0 98.1
Table 1: Penalized Log-likelihood of several real activities performed in left-right direction: E - entering, Ex-exiting,
P - passing, B- browsing.
Test trajectories
Classes E 1 E 2 Ex1 Ex2 P 1 P 2 B
Ent ering 116.2 115.0 337.7 358.2 89.3 90.9 211.7
Exit ing 277.6 284.6 151.0 127.4 98.6 96.6 297.4
Passing 210.0 224.4 350.1 362.0 63.4 64.7 358.4
Browsing 207.4 197.3 343.2 286.7 188.9 179.0 170.1
Table 2: Penalized Log-likelihood of several real activities performed in right-left direction: E - entering, Ex- exiting,
P - passing, B- browsing.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 68/101
References
[1] D. Ayers and M. Shah,“Monitoring Human Behavior from Video Taken in an Office Environment”, Image and
Vision Computing , vol. 19, Issue 12, 1, pp. 833-846, Oct, 2001.
[2] A. Bobick and J. Davis, “The Recognition of Human Movement using Temporal Templates”, in IEEE Transac-
tion on Pattern Analysis and Machine Intelligence, pp. 257-267, vol. 23, no. 3, March 2001.
[3] J. Davis and M. Shah, “Visual Gesture Recognition”, IEE Proc. Vision, Image and Signal Processing , Vol. 141, No. 2, pp. 101-106, April 1994.
[4] S. Hongeng and R. Nevatia, “Multi-Agent Event Recognition”, in Proc. of the 8 th IEEE Int. Conf. on Computer
Vision (ICCV’01), pp. 84-91, vol. 2, 2001.
[5] M. Isard and A. Blake,“A Mixed-state Condensation Tracker with Automatic Model-switching”, Proc. of the Int.
Conf. on Computer Vision, pp. 107-112, 1998.
[6] J. S. Marques and J. M. Lemos, “Optimal and Suboptimal Shape Tracking Based on Switched Dynamic Models”,
Image and Vision Computing , pp. 539-550, june, 2001.
[7] N. Johnson and D. Hogg, “Representation and Synthesis of Behaviour using Gaussian Mixtures”, in Image and
Vision Computing , pp. 889-894, vol. 20, no 12, 2002.
[8] A. J. Abrantes, J. S. Marques, J. M. Lemos, “Long Term Tracking Using Bayesian Networks”, in Proc. of IEEE
Int. Conf. on Image Processing , Rochester, 609-612, vol. III, Sept. 2002.
[9] O. Masoud and N.P. Papanikolopoulos, “A Method for Human Action Recognition”, in Image and Vision Com-
puting , pp.729-743, vol. 21, no. 8, August 2003.
[10] A. Nagai, Y. Kuno and Y. Suirai, “Surveillance Systems based on Spatio-temporal Information”, Proc. IEEE Int.
Conf. Image Processing , pp. 593-596, 1996.
[11] J. C. Nascimento and M. A. T. Figueiredo and J. S. Marques, “Recognition of Human Activities with Space
Dependent Switched Dynamical Models”, Proc. IEEE Int. Conf. Image Processing , September, 2005.
[12] N. M. Oliver and B. Rosario and A. P. Pentland, “A Bayesian Computer Vision System for Modeling Human
Interactions”, in IEEE Trans. on Pattern Anal. and Machine Intell. , pp. 831-843, vol. 22, no. 8, August 2000.
[13] T. J. Olson and F. Z. Brill, “Moving Object Detection and Event Recognition for smart Cameras”, Proc. Image
Understanding Workshop, pp. 159-175, 1997.
[14] J. Rissanen, “Stochastic Complexity in Statistical Inquiry.”Singapore: World Scientific, 1989.
[15] M. Rosenblum and Y. Yacoob and L. S. Davis, “Human expression recognition from motion using a radial basis
function network architecture”, IEEE Trans. Neural Networks, no. 7, pp. 1121-1138, 1996.
[16] Y. Yacoob and M. J. Black, “Parameterized Modeling and Recognition of Activities”, in Computer Vision and
Image Understanding , pp. 232-247, vol. 73, no. 2, February 1999.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 69/101
Chapter 4
The Kalman Filter
Approach
Imagine you are sitting in a car waiting at a crossroad to pass it. The visibility ispoor due to parked cars at the roadside. But there are some gaps between themso that you have the possibility to observe these openings to decide whetheryou can cross the street without causing an accident or not. You have to guessthe number, position and velocity of potential vehicles moving on the road from just a few information derived by watching these gaps over time.Let us integrate the mentioned attributes of the street into the concept of a
state of the street. The observations can also be seen as measurements and arenoisy because of the poor visibility.An estimation of the state of the street is just possible if you know how vehiclesmove on a road and how the measurements are related to this motion. Dueto the noise in the measurements and to not directly observable aspects likeacceleration there will not be absolute certainty in your estimation.
This task is one instance of the problem known as the observer design prob-
lem . In general, you have to estimate the unknown internal state of a dynamicalsystem given its output in the presence of uncertainty. The output dependssomehow on the system’s state. To be able to infer this state from the out-put you need to know the according relation and the system’s “behaviour”. Insuch situations, we have to construct a model. In practise it is not possible to
represent the system considered with absolute precision. Instead, the accord-ing model will stop at some level of detail. The gap between it and reality isfilled with some probabilistic assumption referred to as noise. The noise modelintroduced in this chapter will be applied throughout this work.
An optimal solution for this sort of problems in the case of linear modelscan be derived by using the Kalman Filter which is explained in the first sectionof this chapter based on [12]. Most of the interesting instances of the observerdesign problem, e.g. the SLAM problem, do not fulfil the condition of linearity.To be able to apply the Kalman Filter approach to this non-linear sort of tasks,we have to linearise the models. The according algorithm is referred to asExtended Kalman Filter . We will introduce it in the second section.
23
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 70/101
24 CHAPTER 4. THE KALMAN FILTER APPROACH
4.1 The Discrete Kalman Filter
In this section we introduce the Kalman Filter chiefly based on its originalformulation in [17] where the state is estimated at discrete points in time. Thealgorithm is slightly simplified by ignoring the so called control input which isnot used in this specific application of purely vision based SLAM. Nevertheless,in a robotic application it might be useful to involve e.g. odometry data ascontrol input. A complete description of the Kalman Filter can be found in [17]and [12].
In the following, we will firstly introduce the models for the system’s stateand the process model which describes the already mentioned system’s “be-haviour”. Here, also the noise model is presented. After that, we introduce themodel for the relation between the state and its output. The section closes witha description of the whole Kalman Filter algorithm.
4.1.1 Model for the Dynamical System to Be Estimated
The Kalman filter is based on the assumption that the dynamical system, whichshould be estimated, can be modelled as a normally distributed random processX(k) with mean xk and covariance matrix Pk where index k represents time.The mean xk is referred to as the estimate of the unknown real state xk of the system at the point k in time. This state is modelled by an n dimensionalvector:
x =
x1
...
xi
...xn
For the simplicity of the notation we did not use the subscript k, here. Through-out this work, we will continue omitting k when the components of a vector ormatrix are presented even if they are different at each point in time.
Our main objective is to derive a preferably accurate estimate xk for thestate of the observed system at time k.
The covariance matrix Pk describes the possible error between the stateestimate xk and the unknown real state xk, in other words - the uncertainty inthe state estimation after time step k. It can be modelled as an n
×n matrix.
P =
x1x1 . . . x1xi . . . x1xn
.... . .
.... . .
...xix1 . . . xixi . . . xixn
.... . .
.... . .
...xnx1 . . . xnx1 . . . xnxn
where the main diagonal contains the variances of each variable in the statevector and the other entries contain the covariances of pairs of these variables.Covariance matrices are always symmetric due to the symmetric property of
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 71/101
4.1. THE DISCRETE KALMAN FILTER 25
covariances.1
If we want to derive an accurate estimate of the system’s state, the corre-sponding uncertainty should obviously be small. The Kalman filter is optimalin the sense, that it minimises the error covariance matrix Pk.
4.1.2 Process Model
Examined over time the dynamical system is subject to a transformation. Someaspects of this transformation are known and can be modelled. Others, e.g.,acceleration as in the example above (also influencing the state of the system)are unknown, not measurable or too complex to be modelled. Then, this trans-formation has to be approximated by a process model A involving the knownfactors. The “classic” Kalman filter expects that the model is linear. Underthis condition the normal distribution of the state model is maintained after it
has undergone the linear transformation A. The new mean xk and covariancematrix Pk for the next point in time are derived by
xk = Axk−1 (4.1)
Pk = APk−1A⊤. (4.2)
Due to the approximative character of A, the state estimate xk is also just anapproximation of the real state xk. The difference is represented by a randomvariable w:
xk = Axk−1 + wk−1. (4.3)
The individual values for w are not known for each point k in time but need tobe involved to improve the estimation. We assume these values to be realisations
of a normally distributed white noise vector with zero mean. In the following,this vector w is referred to as process noise. It is denoted by
p(w) ∼ N (0, Q) (4.4)
where zero is the mean and Q the process noise covariance. The individualvalues of w at each point in time can now be assumed to be equal to the mean,to zero. Thus, we stick to Equation (4.1) to estimate xk as xk.
The process noise does not influence the current state estimate, but the un-certainty about it. Intuitively we can say, the higher the discrepancy is betweenthe real process and the according model, the higher is the uncertainty aboutthe quality of the state estimate.This can be expressed by extending the computation of the error covariance Pk
in Equation (4.2) with the process noise covariance matrix Q.
Pk = APk−1A⊤ + Q (4.5)
The choice of the values for the process noise covariance matrix reflects thequality we expect from the process model. If we set them to small values, weare quite sure that our assumptions about the considered system are mostlyright. The uncertainty regarding to our estimates will be low. But then, we willbe unable or hardly able to cope with large variations between the model and
1The covariance value x1xn is the same as xnx1. In practise this means, that x1 is
correlated to xn like xn to x1
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 72/101
26 CHAPTER 4. THE KALMAN FILTER APPROACH
the system. Setting the variances to large values instead means to accept that
there might be large differences between the state estimate and the real state of the system. We will be able to cope with large variations but the uncertaintyabout the state estimate will increase stronger than with a small process noise.A lot of good measurements are needed to constrain the estimate.
4.1.3 Output of the System
As already mentioned earlier, the output of the system is related to the state of the system. If we know this relation and the estimated state after the currenttime step, we are able to predict the according measurement of the system’soutput. In this section, we will introduce the model for the measurement of theoutput. In the next section the relation between state and output is examined.
As well as the state of the considered dynamical system, also its output
is modelled as a normally distributed random process Z(k) with mean zk andcovariance matrix Sk where index k indicates time. The mean zk representsthe estimated and predicted measurement of the output depending on the stateestimate xk at the point k in time. The real measurement zk of the outputis obtained by explicitly measuring the system’s output. zk is modelled as anm dimensional vector
z =
z1...
zi...
zm
The so called innovation covariance matrix Sk describes the possible error be-tween the estimate zk and the real measurement zk, in other words - the uncer-tainty in the measurement estimation after time step k. It can be modelled asan m×m matrix
S =
z1z1 . . . z1zi . . . z1zm...
. . ....
. . ....
ziz1 . . . zizi . . . zizm...
. . ....
. . ....
zmz1 . . . zmz1 . . . zmzm
where the main diagonal contains the variances of each variable in the mea-surement vector and the other entries contain the covariances of pairs of thesevariables.
Note, that in contrast to the system’s real state, the real measurement can beobtained and we are therefore able to compare predicted and real measurement.The precisely known difference between estimation and reality constitutes thebasis to correct the state estimate used to predict the measurement. This willbe explained in detail in Section 4.1.5.
4.1.4 Measurement Model
In the previous sections we mentioned, that the system’s output is somehowrelated to the system’s state. In this sections the relation is modelled.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 73/101
4.1. THE DISCRETE KALMAN FILTER 27
We have the same situation as for the process model. The connection be-
tween the output and the state can just be modelled up to a certain degree.Known factors are summarised in the measurement model H. After we haveobtained a new state estimate for the current point in time, we can apply H
to predict the according measurement zk and covariance matrix Sk. If thismeasurement model is linear, the normal distribution of the state model ismaintained after applying this linear transformation.
zk = Hxk (4.6)
Sk = HPkH⊤. (4.7)
Because measurements of the system’s output are mostly noisy due to inaccuratesensors, the difference between the estimate zk and the real measurement zk isnot just caused by the dependency on the state estimate but also by a random
variable v:
zk = Hxk + vk. (4.8)
As already mentioned for the process noise, the individual values of v are notknown for each point k in time. We apply the same noise model and approximatethese unknown values as realisations of a normally distributed white noise vectorwith zero mean. In the following, v is referred to as measurement noise. It isdenoted by
p(v) ∼ N (0, R) (4.9)
As v is now assumed to be equal to the mean of its distribution at each pointin time, it does not influence the measurement estimate but the uncertaintyabout it. This is modelled by extending the computation of the measurementinnovation covariance matrix Sk in Equation (4.7) with the measurement noisecovariance matrix R.
Sk = HPkH⊤ + R (4.10)
Again, the values chosen for the measurement noise covariance matrix indicatehow sure we are about the assumptions we made in our measurement model.
More information about the influence of the measurement noise are givenbelow in connection with the Kalman Gain .
4.1.5 Predict and Correct Steps
In the last sections we introduced the model for the process the system is subject
to and the model for the relation between the system’s state and its output.These models are used in the Kalman Filter algorithm to determine an optimalestimate of the unknown state of the system.
As already mentioned in Section 4.1.3, we use the known difference betweenthe predicted measurement zk and real measurement zk as basis to correct thestate estimate derived by the application of the process model A. The filtercan be divided into two parts. In the predict step the process model and thecurrent state and error covariance matrix estimates are used to derive an a
priori state estimate for the next time step. Next, in the correct step, a (noisy)measurement is obtained to enhance the a priori state estimate and derive animproved a posteriori estimate.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 74/101
28 CHAPTER 4. THE KALMAN FILTER APPROACH
Initialisation Predict Correct
Figure 4.1: The Predict-Correct Cycle of the Kalman Filter Algorithm.
Before this predict-correct cycle as depicted in Figure 4.1 can be started, thestate and its error covariance matrix have to be initialised. In the following wewill assume that this is already the case.
Predict Step
We are situated at the point k in time and the state and error covariance matrixestimates at time k−1 are given. By using Equations (4.1) and (4.5) we predictthe state and error covariance matrix for k:
x−k = Axk−1
P−k = APk−1A⊤ + Q.
The minus superscript labels the predicted state and error covariance matrix as
a priori in contrast to a posteriori estimates.
Correct Step
Assume that we have already obtained an actual measurement zk of the system’soutput. With the help of this, we first want to calculate the a posteriori stateestimate xk. This is a linear combination of the a priori estimate x−k and aweighted difference between zk and the predicted measurement zk. Accordingto Equation (4.6), zk is calculated by Hx−k . Summarised, we have:
xk = x−k + Kk(zk − zk)
= x−k + Kk(zk −Hx−k ).
The difference zk − Hx
−
k is called measurement innovation or residual . If thevalue is zero, the prediction and the actual measurement are in complete agree-ment and the a priori state estimate won’t be corrected. If it is unequal to zero,xk will be unequal to x−k .
The weight Kk, the so called Kalman Gain , is represented by a n × m
matrix and minimises the a posteriori error covariance estimate P−k . It can becalculated by
Kk =P−k H⊤
(HP−k H⊤ + R)(4.11)
Note, that the denominator equals Equation (4.10), representing the uncertaintyin the predicted measurement. If we look closely at Equation (4.11), we can
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 75/101
4.1. THE DISCRETE KALMAN FILTER 29
see that, if the measurement error covariance error R approaches zero, the
measurement innovation is weighted more heavily.
limR→0
Kk =1
H
In other words, the smaller the measurement error, the more reliable is theactual measurement zk.On the other hand, if the predicted error covariance matrix P−k approaches tobe zero the residual is weighted less.
limP−
k→0
Kk = 0
This means, the smaller the uncertainty in the a priori state estimate x−
k
, themore reliable is the predicted measurement zk.
Secondly, we have to correct the a priori error covariance matrix estimateto derive the a posteriori estimate.
Pk = (I− KkH)P−k
For details of the derivation of the filter algorithm see [26].
In Figure 4.2 the whole algorithm is given again step by step.
4.1.6 A Simple Example
To clarify the effectiveness of the Kalman Filter we will examine a simple ex-ample. To stick to the central theme of this work right from the beginning,this example will be an instance of the SLAM problem. The section will bestructured as follows: Firstly, we will give a short description of the problem.After that, the process and measurement model are formulated. The sectioncloses with some experiments on simulated data.
Problem Description
In Chapter 5, we will analyse how to apply the Kalman Filter approach tothe problem of SLAM with using a vision sensor mounted on a robot. Thisfirstly means to track the position and orientation of the camera within the
3D environment (localisation) and secondly to estimate the position of somelandmarks situated in the world (mapping).
In the following we will simplify this task to SLAM in one dimension. Thecamera is represented by a point moving randomly in 1D. There also is a staticlandmark with a position known up to a certain degree. The process model of this example should describe the motion of the camera. We will assume that itmoves smoothly so that fast changes in its velocity are unlikely. We are able tomeasure the distance between the landmark and the moving point at discretepoints in time. The measurement model should relate this distance to the stateof the considered system.
The situation is depicted in Figure 4.3.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 76/101
30 CHAPTER 4. THE KALMAN FILTER APPROACH
1. Predict Step
(a) Predict the statex−k = Axk−1
(b) Predict the error covariance matrix
P−k = APk−1A⊤ + Q
2. Correct Step
(a) Calculate the Kalman Gain
Kk = P−
k H⊤
(HP−k H⊤ + R)
(b) Correct the a priori state estimate
xk = x−k + Kk(zk −Hx−k )
(c) Correct the a priori error covariance matrix estimate
Pk = (I−KkH)P−k
Figure 4.2: Equations of one Kalman Filter Cycle. We assume that the state,
its covariance and the noise values are already initialised.
0
2
4
6
8
10
12
0 2 4 6 8 10
P o s i t i o n [ U n i t s ]
Time [Filter Cycles]
Point PositionsPosition Landmark
Figure 4.3: An Example for a Point Moving Randomly in 1D. A static landmark
is situated at x = 3. The distance between the current point position and this
landmark is measurable at each time step.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 77/101
4.1. THE DISCRETE KALMAN FILTER 31
Process and Measurement Model
At first we have to model the state x which has to be estimated. Three importantentities have to be taken into account. Firstly, there is the position of the pointin a point in time. It is fully described by an one-dimensional coordinate in x-direction. Secondly, we choose a constant velocity to describe the motion of thepoint. 2 This does not mean that we assume the point is moving constantly overall time but that this value is the average velocity between two points in timeand changes occur with a Gaussian profile. These changes are modelled beneathas process noise. At last, the position of the landmark has to be augmented intothe state.
x =
x pv pxf
=
Position of the pointVelocity of the point
Position of the landmark
The error covariance matrix is then a 3 × 3 matrix of the following form
P =
x px p x pv p x pxf
v px p v pv p v pxf
xf x p xf v p xf xf
.
The task of the process model A is to approximate the transformation of theconsidered system over time. Here, this is the motion of the point betweentime k and k − 1. This constant time period is denoted as k. A is used topredict the state of the system for the current point k in time from the old stateestimate at time k − 1 by calculating x(k) = Ax(k − 1).
x p(k) = old point position + old velocity per k= x p(k − 1) + v p(k − 1)k
v p(k) = constant velocity due to assumed smooth motion= v p(k − 1)
xf (k) = static landmark= xf (k − 1)
(4.12)
As already mentioned, the constant velocity value just describes the averagevelocity in the time period k. Therefore, it is just an approximation. Varia-tions are caused by random unmeasurable accelerations a.3 We involve it in theprocess noise vector w. If we would know the individual values of w at each k,we could derive the real state:
x(k) = Ax(k − 1) + w(k − 1)
Because the process noise is an additive constant, w is modelled as a three-dimensional vector w = (w0, w1, w2)⊤. Noise is just added to the velocitycomponent of the state. Thus, the first and third component, w0 and w2,referring to the position of the moving point and to the position of the landmark,are set to zero. Just the second value carries a different random value after eachtime step: w = (0, ak, 0)⊤ . Adding the noise term to the process model, we
2A velocity vp describes the distance x covered in a certain time intervall k.3An acceleration a is a change in velocity vp in a certain time intervall k. Thus,
w1 = ak = vp, the change in velocity.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 78/101
32 CHAPTER 4. THE KALMAN FILTER APPROACH
have:
x p(k) = x p(k − 1) + (v p(k − 1) + a(k − 1)k)k
v p(k) = v p(k − 1) + a(k − 1)k
xf (k) = xf (k − 1)
We do not know the individual values for a at each point in time. Therefore, wemodel the process noise as a realization of a normally distributed white noiserandom vector with zero mean and a covariance matrix Q.
p(w) ∼ N (0, Q)
Now, we can assume w to be equal to the mean of its distribution, which is zero.We derive the process model already formulated in Equation (4.12). Expressedin a linear transformation with
k assumed to be 1, this is
A =
1 1 0
0 1 00 0 1
.
Q is of the following form:
Q =
0 0 0
0 σ2 p 0
0 0 0
.
The constant value of σ p as the standard deviation of the noise in the velocityvalue indicates the amount of smoothness in the motion we expect. If we chooseit to be small, we expect the point to move with a nearly constant velocity.Then, we will not be able to cope with sudden accelerations. If we choose largevalues instead, we will be able to track the point well, if it acts in another waythan expected by the process model. On the other hand, the uncertainty abouta state estimate is higher than with small values for σ p.
The measurement model approximates the relation between the actualmeasurement zk and the current state xk. In our example the measurementconsists of just one value representing the distance dk between the movingpoint and the static landmark at the current point k in time. Expressed in alinear equation, we have
z(k) = dk = x p(k)− xf (k) (4.13)
The sensor used to measure the distance is assumed to provide just noisy mea-surements. If we would know the value for this measurement noise exactly, wecould determine the real measurement and not just an estimate. If we denotethe measurement noise by the random variable v, the real measurement can becomputed by:
z(k) = dk = x p(k)− xf (k) + v(k).
But we do not know the individual values of the random variable v. Therefore,we apply our noise model such that the values of v are a realization of a normallydistributed white noise with zero mean and the variance σ2
m
p(v) ∼ N (0, σ2m).
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 79/101
4.1. THE DISCRETE KALMAN FILTER 33
The measurement noise has the same dimension as the measurement and its
distribution is therefore modelled by specifying a variance instead of a covariancematrix.We can now assume the value v to be equal to the mean of its distribution,
to zero. Then, we derive the measurement model already formulated in Equa-tion (4.13). Note, that the difference between the estimate of the measurementzk and the real measurement is not just caused by the unknown noise, but alsoby the fact that in reality we just have an estimate of the state to predict themeasurement. The final measurement model for this problem is:
z(k) = dk = x p(k)− xf (k).
Expressed in a linear transformation we have
H =
1 0 −1
.
The constant value σm as the standard deviation of the measurement noisedistribution indicates how sure we are about the correctness of the real mea-surements. Large value show that we do not trust them that much and wewill weight the measurement innovation less. Small values indicate that themeasured values are accurate. The residual will be weighted more heavily.
Experiments on Simulated Data
In the previous section, we derived the basis for the application of the KalmanFilter on our problem: the appropriate process and measurement model. Inthis section, we will test these models on simulated data. The simulation was
initialized with the state:
x0 =
0
13
The subsequent real positions of the point moving in 1D were generated byapplying the process exactly described in the according model and adding somerandom values. The standard deviation of the random values is set to 0.2.The real measurements were also generated as described in the measurementmodel. Measurment noise is simulated by adding random values with a standarddeviation of 0.2.
To start the predict-correct cycle of the Kalman Filter, we have to initializethe state and its error covariance matrix as well as the process and measurement
noise values. Let us set the state to the real initial values. We assume anuncertainty about the initial position of the moving point as well as about theposition of the landmark and velocity at time 0. Let the error covariance be
P0 =
1 0 0
0 1 00 0 1
The real noise in the measurements can usually be determined prior to theapplication of the filter. To determine the process noise covariance is more com-plicated, because we generally do not have the ability to measure the process,we want to estimate, directly. Anyway, we set the standard deviation of the
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 80/101
34 CHAPTER 4. THE KALMAN FILTER APPROACH
0
2
4
6
8
10
12
0 2 4 6 8 10
P o s i t i o n
[ U n i t s ]
Time [Filter Cycles]
Point PositionsPosition Landmark
Estimated Point Position
Estimated Landmark Position
Figure 4.4: The Simulation of the Problem of Estimating a Moving Point’s
Position by Orientating at a Single Landmark. The deviation between the
estimation and real position of the point is very small as well as between the
estimated and real position of the landmark.
noise in the velocity σv and in the measurement σm to the real value used inthe simulation: 0.2
We will run the filter on ten simulated measurements. The results are de-picted in Figure 4.4. In Figure 4.5, the behaviour of the error covariance P
during the ten filter cycles is visualized.
4.2 The Extended Kalman Filter
As we saw in Section 4.1.6, the Kalman Filter algorithm works quite well for theestimation of a linear system with linear related measurements depending onthe quality of the appropriate models for the process and measurement of theoutput. Moreover, the Kalman Filter is optimal in the sense that it minimizesthe error covariance representing the uncertainty in the estimate of the state.
To come back to the main theme of this work, estimating the position of a moving robot and of static landmarks using a camera sensor, we need to beable to cope with nonlinear motion and a nonlinear relationship between mea-
surements and the system’s state. The nonlinear motion is caused by possiblerotational movements, the robot is able to do. Measurements of landmarks inthe surrounding of the robot are projections of them onto the image plane of the camera sensor. The process of projection is nonlinear.
In Section 4.1.2, it is stated that a Gaussian distribution is maintained by alinear transformation. This is not the case if we use a nonlinear transformationinstead. Thus, we cannot apply the Kalman Filter equations in its originalformulation to estimate a nonlinear system. A solution for this problem is tolinearize the transformation via Taylor Expansion. A Kalman Filter that usesTaylor Expansion to linearize the process and measurement models is calledExtended Kalman Filter , in the following abbreviated as EKF .
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 81/101
4.2. THE EXTENDED KALMAN FILTER 35
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10
E r r o r C o v a r i a n c e P
[ U n i t s ]
Time [Filter Cycles]
Variance in Point PositionVariance in Velocity
Variance in Landmark Position
Figure 4.5: The Error Covariance matrix P. After two iterations, the initial
value of 1 for the variances has settled at approximatly 0.5 for the estimation
of the point’s position and of the landmark’s position and at approximatly 0.04for the estimation of the velocity.
Like in Section 4.1.1 we assume that the considered system can be modelledas a normally distributed random process X(k) with mean xk as the estimationof the real system’s state xk and covariance matrix Pk. Its output can bemodelled as well as a normally distributed random process Z(k) with mean zk
as the prediction of the real measurement zk and covariance matrix Sk. In thefollowing sections the EKF is derived for nonlinear process and measurementmodels.
Right from the beginning, we will stick to the “super minus” notation label-ing a priori estimates.
4.2.1 Process Model
Let us assume that our system to be estimated, represented by a state vectorxk at time k, is now governed by the nonlinear funtion
xk = f (xk−1, wk−1) (4.14)
relating the previous state xk−1 at the point k−
1 in time to the next state xk
at the current point k in time. The random value wk−1 represents the processnoise as in Equation (4.4).
p(w) ∼ N (0, Q)
We assume w to be equal to the mean of its distribution, which is zero. Theresult of the function f will then be an approximation x−k of the real state xk.
x−k = f (xk−1, 0) (4.15)
Let the difference between the real state and its estimate, namely the error inthe prediction, be a random variable e:
exk = xk − x−k .
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 82/101
36 CHAPTER 4. THE KALMAN FILTER APPROACH
To be able to estimate the result of the process represented by the non-
linear Equation (4.14) via the Kalman Filter algorithm, we linearize itabout the current state estimate given in Equation (4.15) by setting up afirst order Taylor polynomial ([16], p.411):
xk ≈ x−k + A(xk−1 − xk−1) + Wwk−1 = xk (4.16)
The matrix A is the Jacobian matrix containing the partial derivatives of (4.15)with respect to x, whereas the Jacobian matrix W is filled with the partialderivatives of f with respect to w. Note, that we ommitted time subscript k forthe Jacobians to simplify the notation. Nevertheless, they may be different ateach point in time. In the following, we will stick to omitting k for the Jacobianmatrices.
The a priori estimate x−k in Equation (4.16) can be calculated by f (xk−1, 0).
The remainder term approximates exk as exk .
exk ≈ A(xk−1 − xk−1) + Wwk−1 = exk (4.17)
With this defintion of exk , we can rewrite Equation (4.16) to
xk = x−k + exk (4.18)
According to Equation (4.18), we need to estimate the random value exk as exkat each point in time to achieve our actual goal: estimating xk as xk.
Note, that (4.17) is a linear equation. Thus, we can apply a second hypo-thetical “classic” Kalman Filter to estimate exk . We will model this dynamicallinear error system as a normally distributed random process with mean exk
and covariance matrix Pk representing the uncertainty about the estimated exk .Since exk denotes the error in the state estimate, it is clear that it should alwaysbe approximatly zero. Therefore, the mean exk of the distribution is chosen tobe zero.
Let’s consider Equation (4.17) again. The second term Wwk−1 denotes thenoise in the estimation of exk . It is the product of the process noise w and theJacobian matrix W containing the partial derivatives of Equation (4.15) withrespect to w. Remember, that the process noise is assumed to be always equalto zero. Thus, the term Wwk−1 is also assumed to be equal to zero. If w
is transformed by applying W, the corresponding covariance matrix Q of theprocess noise is transformed by WQW⊤. The noise in the estimation of exk isthen modelled as
p(Wwk−1) ∼ N (0, WQW
⊤
).To involve this noise in the prediction of the error exk between real and esti-mated state, the according error covariance WQW⊤ is added to the predictionAPk−1A⊤ of its error covariance P. To summarize the last statements, wehave:
e−xk = A(xk−1 − xk−1) = 0 (4.19)
P−k = APk−1A⊤ + WQW⊤. (4.20)
Equations (4.19) and (4.20) represent the process model for the linear errorsystem.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 83/101
4.2. THE EXTENDED KALMAN FILTER 37
If we substitute Equation (4.19) for exk in Equation (4.18), the process model
for the nonlinear system to predict a state estimate x
−
k is then
x−k = f (xk−1, 0) (4.21)
P−k = APk−1A⊤ + WQW⊤. (4.22)
The process noise covariance matrix WQW⊤ acts in the nonlinear processmodel as the covariance matrix Q in the linear process model: It represents theamount of trust in the process model. High values indicate that high variationsbetween the state estimate and the real state are expected. Low values show alot of confidence in the process model.
4.2.2 Measurement Model
Let us assume that the relation between the system and its output is describedby the nonlinear function
zk = h(xk, vk) (4.23)
where vk represents the measurement noise as in (4.9).
p(v) ∼ N (0, R)
As usual, we assume vk to be zero which is the mean of its distribution.
zk = h(x−k , 0). (4.24)
The result zk is just an approximation of the real measurement. Let the differ-ence between the actual and the predicted measurement be the random value
ezk = zk − zk.
In contrast to the error exk between the real state and its estimate, ezk isaccessible.
To estimate the measurement of the system’s output we linearize Equa-tion (4.23) about the current state estimate given in Equation (4.24) by settingup a first order Taylor polynomial:
zk ≈ zk + H(xk − x−k ) + Vvk (4.25)
The matrix H is the Jacobian matrix containing the partial derivatives of Equa-tion (4.24) with respect to x in contrast to the Jacobian matrix V which contains
the derivatives of the same equation with respect to the measurement noise v.The predicted measurement zk in Equation (4.25) can be calculated by Equa-
tion (4.24). The error ezk is approximated as ezk by the remainder term
ezk ≈ H(xk − x−k ) + Vvk = ezk . (4.26)
With this definition of ezk we can rewrite Equation (4.25).
zk ≈ zk + ezk (4.27)
Note, that Equation (4.26) is a linear equation. Therefore, we also model theerror in the estimation of the output as a normally distributed random process
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 84/101
38 CHAPTER 4. THE KALMAN FILTER APPROACH
with mean ezk and innovation covariance matrix Sk, which approximates the
error between the predicted and the actual measurement. From the notion thatezk specifies the estimated error in the estimation of the state xk of the system,it is clear that it should preferably be approximatly equal to zero. Thus, themean ezk of its distribution is assumed to be always equal to zero.
If we re-consider Equation (4.26), we can state that Vvk is the noise term inthe prediction of ezk . Remember that the measurement noise v is assumed tobe zero at every point in time. Thus, the product of v and the Jacobian matrixV containing the partial derivatives of Equation (4.24) with respect to the noiseis zero. If v is transformed by applying V, the corresponding covariance matrixR is transformed by VRV⊤. Then, the noise involved in the estimation of theerror ezk is modelled as follows:
p(Vvk)
∼N (0, VRV⊤)
The covariance matrix of the noise Vvk is added to the prediction of the inno-vation covariance matrix by HP−k H⊤. Summarized, we have:
e−zk = H(xk − xk) = 0 (4.28)
Sk = HP−k H⊤ + VRV⊤. (4.29)
Equations (4.28) and (4.29) represent the measurement model for the linearerror system and are used to correct the a priori error estimate e−xk betweenthe state and its approximation.
If we substitute Equation (4.28) for ezk in Equation (4.27), the measurementmodel for the nonlinear system is:
zk = h(x−k , 0) (4.30)
Sk = HP−k H⊤ + VRV⊤. (4.31)
4.2.3 Predict and Correct Steps
Using the Kalman Filter for the estimation of the state of a linear system, meansthat we exactly know how uncertain we are about this estimate. Whereas,using the EKF for the estimation of the state of a nonlinear system means toadditionally estimate the uncertainty in this state estimate. This can be done bya second hypthetical Kalman Filter, presented in the previous chapters, whichestimates the error between the real state and its estimate.
Let’s assume, that we already used the process model for the nonlinear
system given in Equations (4.21) and (4.22) to derive an a priori estimate x−k forthe state and P−k for its error covariance. Then, we can predict the measurementby using Equation (4.30). After we have obtained the real measurement zk, wecan calculate the error ezk between zk and the predicted measurement zk.
According to Equation (4.19), the predicted error estimate e−xk between thereal state and its estimate is assumed to be zero in every time step.
The Kalman Filter equation to correct the a priori error estimate e−xk andderive an a posteriori exk is then
exk = e−xk + Kkezk
= Kkezk .
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 85/101
4.2. THE EXTENDED KALMAN FILTER 39
1. Predict Step
(a) Predict the state.
x−k = f (xk−1, 0)
(b) Predict the error covariance matrix.
P−k = APk−1A⊤ + WQW⊤
2. Correct Step
(a) Calculate the Kalman Gain.
Kk =P−
kH⊤
(HP−k H⊤ + VRV⊤)
(b) Correct the a priori state estimate
x−k + Kk(zk − h(x−k , 0))
(c) Correct the a posteriori error covariance matrix estimate
Pk = (I− KkH)P−k
Figure 4.6: Equations of one Extended Kalman Filter Cycle. We assume that
the state, its covariance and the noise values are already initialized. Note, thatfor simplicity the superscript k is not used here for the Jacobians, although,
they have to be re-calculated after each predict-correct cycle.
If we substitute this into Equation (4.18) we get
xk = x−k + Kkezk .
Because ezk is the measurement residual, we also can write
xk = x−k + Kk(zk − zk) (4.32)
= x
−
k + Kk(zk − h(x
−
k , 0)). (4.33)
Equation (4.33) can be used in the correct step of the Extended Kalman Filteralgorithm to derive the a posteriori estimate for the state of the nonlinearsystem. The Kalman Gain Kk itself is calculated as in Equation (4.11) withthe appropriate substitution for the measurement error covariance matrix givenin (4.31):
Kk =P−k H⊤
(HP−k H⊤ + VRV⊤)
In Figure 4.6, the Extended Kalman Filter algorithm is given step by step.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 86/101
40 CHAPTER 4. THE KALMAN FILTER APPROACH
Lighthouse
x0 xi xj Position of the Ship
αi αj·
Figure 4.7: A ship is sailing on the straight line perpendicular to the axis
between x0, the initial position of the ship, and the position of the lighthouse.
xi and xj are sample positions of the ship which need to be estimated from the
corresponding observable angles αi and αj .
4.2.4 A Simple Example
The derivation of the Extended Kalman Filter presented in the previous sectionis a bit more complicated than the explanations of the “classic” Filter. Inthis section a simple example is examined to provide a better understanding of the EKF algorithm. Again, we will consider an instance of the general SLAMproblem.
The section is structures as follows. Firstly, we will describe the specificproblem in general. After that, the models for the system’s state and processand the relation between the state and the measurement are presented. Thesection closes with some experiments on simulated data.
Problem Description
Imagine you are the skipper of a ship and your task is to sail a straight route of acertain length on an ocean. As you might infer from this sentence, the exampledeals more or less with the routing aspect of navigation. But we will focus onthe localization and mapping problem. To be more concrete, as a skipper youneed to localize your ship on that straight route. We assume that there is alighthouse with an uncertainly known position to orientate at.
Your initial position is located in some distance from that lighthouse. Youwill sail in a perpendicular direction to the axis between the lighthouse and theinitial ship position. It is obvious that the motion of a ship is smooth, so thatchanges in the velocity are unlikely.
You will be able to measure the angle between the current position of yourship and the lighthouse. Of course, these values will be more or less guessesthan precise measurements. We assume that you are not able to measure yourvelocity which is normally the case.
This situation is depicted in Figure 4.7.
Process and Measurement Model
In this example we have two tasks. Firstly, we need to localize the position x of the moving ship on the straight route at every time step. Secondly, we have torefine our knowledge about the position y of the lighthouse.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 87/101
4.2. THE EXTENDED KALMAN FILTER 41
Thus, the state x of the considered system contains three entities. The
position x and velocity vs of the ship are the first ones. Again, we choosea constant value for the velocity which represents an average value during theconstant time period k. The third component of the state denotes the distancebetween the lighthouse and the initial position x0 of the ship.
x =
x
vxy
=
Position of the Ship
Velocity of the ShipDistance of the Lighthouse from x0
With this definition of the state, we have the following error covariance matrixP representing the uncertainty in the estimation of the state.
P =
xx xvx xy
vxx vxvx vxy
yx yvx yy
The process, the system is subject to, is just the motion of the ship on thatroute. The process model f we will set up here, relates the state at time k − 1to k by calculating:
x(k) = old position + old velocity per time intervall= x(k − 1) + vx(k − 1)k
vx(k) = constant velocity due to assumed smooth motion= vx(k − 1)
y(k) = static landmark= y(k − 1)
(4.34)
These equations are linear. Nevertheless, we will treat them as to be nonlinearand apply the EKF approach. We will see, that the EKF equations will reduce
to the equations of the “classic” Kalman Filter.As already mentioned, vx just describes the average velocity between two
time steps. Thus, it is just an approximation of the real velocity. The randomdifference between estimated and real velocity is modelled as process noise w =(w0, w1, w2)⊤ = (0, ak, 0)⊤. As the state, w is a three-dimensional vector.Just the velocity is corrupted by noise. Therefore, just w1 carries a value unequalto zero involving unmeasurable acceleration a:
p(w) ∼ N (0, Q)
Q is of the following form
Q =
0 0 00 σ2
v 0
0 0 0
The variable σv denotes the standard deviation of the noise in the velocity.If we would know the indivdual values for w, we could derive the real state
of the considered system by calculating f (xk−1, wk−1):
x(k) = x(k − 1) + (vx(k − 1) + w1)k + w0
vx(k) = vx(k − 1) + w1
y(k) = y(k − 1) + w2
= x(k − 1) + (vx(k − 1) + a(k − 1)k)k
= vx(k − 1) + a(k − 1)k
= y(k − 1)
(4.35)
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 88/101
42 CHAPTER 4. THE KALMAN FILTER APPROACH
Again, we assume w to be always equal to the mean of its distribution which is
zero. Then we derive the process model f (xk−1, 0), as it is already formulatedin Equation (4.34). To be able to predict the error covariance matrix P at eachpoint in time, we need to derive the Jacobian matrix A containing the partialderivatives of Equation (4.34) with respect to the state x and the Jacobianmatrix W containing the partial derivatives of Equation (4.34) with respect tothe noise w. Assuming that k is equal to 1, as A, we have:
A =
∂x∂x
∂x∂vx
∂x∂y
∂vx∂x
∂vx∂vx
∂vx∂y
∂y∂x
∂y∂vx
∂y∂y
=
1 1 0
0 1 00 0 1
.
Note, that this is the same as Equation (4.34) expressed as a linear transforma-tion.
As W, we have:
W =
∂x∂w0
∂x∂w1
∂x∂w2
∂vx∂w0
∂vx∂w1
∂vx∂w2
∂y∂w0
∂y∂w1
∂y∂w2
=
1 0 0
0 1 00 0 1
.
Hence, WQW⊤ = Q. Then, the equation to predict the error covarianceequals the one for the standard Kalman Filter: P−(k) = AP(k − 1)A⊤ + Q.
Now, let us consider the measurement model for our system. It providesthe relation between the state x of the system and the measurement z of itsoutput. Remember, as measurement we obtain the value for the angle α at
each time step. If we have a look again at Figure 4.7, we can state, that thesituation can be represented by a right triangle. Then, two definitions hold:
a2 + b2 = c2
a = c · sin α
We define the axis between the lighthouse and x0 as a, the distance, the shiphas covered till a certain point in time, as b and the connection between thelighthouse and the current position of the ship as the hypotenuse c. b is thenequal to x in the state and a is the same as y. Thus, the measurement modelto obtain the measurement z is
z(k) = α = arcsin y(k)
(x(k))2 + (y(k))2
. (4.36)
Thus, we have a nonlinear measurement model h.The value provided for α might be more or less a guess than a precise mea-
surement. Therefore we have to introduce measurement noise v to model thedifference between the real measurement and the predicted one. If we know thenoise value for each time step, we would obtain z instead of z by calculatingh(xk, vk):
z(k) = α = arcsin
y(k)
(x(k))2 + (y(k))2
+ v(k).
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 89/101
4.2. THE EXTENDED KALMAN FILTER 43
But this is not the case. Therefore, we model v as normally distributed mea-
surement noise with zero mean and standard deviation σr.
p(v) ∼ N (0, σ2r)
Now, we can assume v to be zero at each point in time which is the meanof its distribution. Then we obtain h(x, 0) as it is already formulated in Equa-tion (4.36). The standard deviation is added to the calculation of the innovationcovariance matrix S(k) = HP(k)−H⊤ which is also one dimensional. Becausewe have a nonlinear model, the value for the variance is firstly transformed byVσrV⊤ and then added.
As usual, the value we choose for σr indicates how we rate the quality of the measurement model.
Because we have a nonlinear measurement model, we need to derive theJacobian matrices H and V for each point in time. H contains the partialderivatives of the measurement model h(xk, 0) with respect to the state. It isof the following form:
H =∂h∂x
∂h∂vx
∂h∂y
For ∂h
∂xwe have
∂h
∂x=
−xy 1− y2
x2+y2
(x2 + y2)3
∂h∂vx
is equal to zero, because the velocity of the point is irrelevant in the mea-
surement model. For∂h
∂y we have
∂h
∂y=
1√x2+y2
− y2√(x2+y2)3
1− y2
x2+y2
The covariance matrix V contains the partial derivatives of h(xk, 0) withrespect to the noise v. Thus, it is of the following form:
V =∂h
∂v
Because the measurement noise v is an additive constant, ∂h∂v
is equal to 1.
Experiments on Simulated Data
In the previous section we derived the basis to apply the EKF approach tosolve our problem: the process and measurement model. In this section, wewill test these models on simulated data. We repeat the procedure from thesimple example for the standard Kalman Filter. The initial values for the Filterreflect reality but are just known approximately. This is represented by an errorcovariance matrix P where the values in the main diagonal are unequal to zero.The values for the process and measurement noise are also choosen to representthe real values.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 90/101
44 CHAPTER 4. THE KALMAN FILTER APPROACH
0
2
4
6
8
10
12
0 2 4 6 8 10
P o s i t i o n
[ U n i t s ]
Time [Filter Cycles]
Ship PositionEstimated Ship Position
Figure 4.8: The Simulation of the Problem of Estimating a the Position of a
Ship by Orientating at a Lighthouse.
To start the predict-correct cycle we initialize the state x and the errorcovariance matrix P. For x we choose:
x = 01
20
These initial values are just uncertainly known. For P we choose:
P =
1 0 0
0 1 00 0 1
In reality, the standard deviation of the process and measurement noise need tobe determined prior to the application of the filter. Here, the values σv and σm
reflect the real noise values.
σv = 0.02
σm = 0.02
We will run the filter on 10 simulated measurements. The results for the es-timation of the ship’s position are depicted in Figure 4.8. In Figure 4.9, theestimated lighthouse position is opposed to the real one. In Figure 4.10, thebehaviour of the error covariance P is depicted during the ten filter cycles. Wecan note that the uncertainty about the position of the ship decreases first andthen starts to increase slowly. This is due to the more and more influencing mea-surement noise. The farer the ship is getting away from its starting point thelesser the measured angle will change its value. The measurement noise stays ata constant level and will therefore increase its influence concerning uncertainty
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 91/101
4.2. THE EXTENDED KALMAN FILTER 45
19.6
19.8
20
20.2
20.4
0 2 4 6 8 10
P o s i t i o n
[ U n i t s ]
Time [Filter Cycles]
Lighthouse PositionEstimated Lighthouse Position
Figure 4.9: The Results for the Mapping of the Lighthouse.
about the correctness of the infered position of the ship. Small changes in thevalue of the angle will cause larger deviations in the estimation of the ship’sposition and therefore a large uncertainty about the state estimate.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 92/101
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10
E r r o r C o v a r i a n c e P
[ U n i t s ]
Time [Filter Cycles]
Variance in Ship PositionVariance in Velocity
Variance in Landmark Position
Figure 4.10: The Error Covariance matrix P. After just one iteration, the
uncertainty about the ship’s position has decreased massivly. Then it increases
slightly. In contrast to that, the uncertainty about the velocity has nearly fallen
to zero.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 93/101
Chapter 6
An Observation Strategy
In the previous chapter we applied the Extended Kalman Filter approach to theSLAM problem. A problem of using the EKF is that it does not scale very well.The complexity is cubic in the number of features in the map. In this chapter,we will examine strategies to reduce the complexity to at least O(n2) where n
is the number of features.
One of these strategies includes that just a single feature instead of all visibleones is measured. In [35] it is shown, that this is sufficient for tracking. If wedo so, we need to select the best feature based on a heuristic. In the following,we will refer to this heuristic as an observation strategy. It is adapted fromDavison in [9] and [8].
In this chapter, we will firstly concentrate on the ways to reduce the time
complexity of one EKF cycle. This examination is chiefly based on [23]. Sec-ondly, an appropriate heuristic is introduced to realise the selection of the bestlandmark. The two SLAM scenarios, the first with a single camera, the secondwith a stereo camera, are handled separately.
6.1 Complexity of the Kalman Filter
We will first examine the general time complexity of the Extended KalmanFilter algorithm in detail. Considering each step during one EKF cycle, we willintroduce methods to reduce the cubic time complexity to at least O(n2). Justto remember, the appropriate equations are listed in Figure 6.1.
If we have a look at these equations, we can state that there are two ma- jor time consuming operations: matrix multiplication and matrix inversion. If the matrix multiplication is carried out in a straightforward manner, its timecomplexity is O(n3) if we multiply n× n matrices. Matrix inversion also growscubic with the number of visible and measured features.
In the case of the EKF, the maximal size of a matrix, here P, is (13 +3n) × (13 + 3n) where n is the number of features. The matrix which will beinverted is the innovation covariance. It is of dimension (2l× 2l) or (3l× 3l).1 l
denotes the number of visible and measurable features. Because the number of
1The dimension of the measurements using a monocular camera is 2. If a stereo camera is
used as a vision sensor, the measurement is three-dimensional
69
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 94/101
70 CHAPTER 6. AN OBSERVATION STRATEGY
1. Predict Step
(a) Predict the state ahead.
x−k = f (xk−1, 0)
(b) Predict the error covariance matrix ahead.
P−
k = APk−1A⊤ + WQW⊤
2. Correct Step
(a) Calculate the Kalman Gain.
Kk =P−
k H⊤
(HP−
k H⊤ + VRV⊤
)
(b) Correct the a priori state estimate
x−k + Kk(zk − h(x−k , 0))
(c) Correct the a posteriori error covariance matrix estimate
Pk = (I−KkH)P−
k
Figure 6.1: Equations of one Extended Kalman Filter Cycle.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 95/101
6.1. COMPLEXITY OF THE KALMAN FILTER 71
measurable features cannot be larger than the number of known features, the
overall complexity of the EKF is O(13 + 3n) = O(n3
).We can reduce this complexity to O(n2) by considering aspects related tothe SLAM problem. First of all, the process model just affects the state of the camera and the velocities, summarised in xv. The known features are notinvolved and thus not the whole state of the system.
Secondly, usually just a small subset of feature points can be measured ateach point in time, due to the constraints of the viewing direction. In thefollowing, we will explain this in detail, first for the predict step and after thatfor the correct step.
6.1.1 Complexity of the Predict Step
In the predict step of the Kalman Filter, we predict the state x of the system
as x−
and the related error covariance P as P−
. The process model f relatesthe state at one point in time to the next. But, as already mentioned above, just the state of the camera and its velocities are affected. Thus,the Jacobianmatrix A, containing the partial derivatives of the process model with respectto the state, is of the following form,
A =
∂f v∂ xv
0
0 I
where f v is the first part of the measurement model.
xv,new = f (xv,w = 0) =
tWnewqCWnewvW
newωCWnew
=
rW + vWk
q(ωCWk)× qCW
vW
ωCW
The detailed Jacobian matrix A can be found in Appendix A.The overall dimension of the state is m = 13 + 3n where n is the number of
the 3D landmarks and 13 is the dimension of xv. Thus, A is a m×m Jacobianmatrix as well as the error covariance matrix P. The block ∂f v
∂ xvis of dimension
13× 13.Let’s consider the first summand APk−1A⊤ of the prediction of the error
covariance matrix as P−
k and let the old Pk−1 be denoted by
Pk−1 =
P11 P12
P21 P22
.
P11 is a covariance matrix also of dimension 13×13 related to xv. P12 and P21are of dimension 13 × 3n and 3n × 13, respectively. 2 P22 is then a 3n × 3ncovariance matrix.
If we perform the matrix operation for APk−1A⊤ explicitly, we obtain:
APk−1A⊤ =
∂f v∂ xv
0
0 I
P11 P12
P21 P22
( ∂f v∂ xv
)⊤ 0
0 I
=
∂f v∂ xv
P11( ∂f v∂ xv
)⊤ ∂f v∂ xv
P12
( ∂f v∂ xv
P12)⊤ P22
2Note that P12 is the transpose of P21 because of the symmetry of covariances.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 96/101
72 CHAPTER 6. AN OBSERVATION STRATEGY
Regarding to the dimensions of the matrices, the term ∂f v∂ xv
P11( ∂f v∂ xv
)⊤ can be
evaluated by 2(13∗13∗13) multiplications. To solve∂f v∂ xv P12 we need 13∗13∗3n
multiplications. ( ∂f v∂ xv
P12)⊤ is just the transpose of the previous term and donot need to be evaluated again. Altogether, the whole amount of multiplicationsto evaluate APk−1A⊤ lies at 2(13 ∗ 13 ∗ 13) + (13 ∗ 13 ∗ 3n).
The second summand WQW⊤ of the prediction function can be consideredequivalently. The Jacobian matrix W contains the partial derivatives of theprocess model with respect to the process noise. It is of the following form:
W =
∂f v∂ VW
∂f v∂ ΩCW
0 0
For the detailed matrix, have a look at Appendix A.Since the process noise vector w is of dimension 6, W is a m
×6 matrix.
The blocks ∂f v
∂ VW as well as ∂f v
∂ ΩCW carry 13× 3 elements. The process noise doesnot affect the coordinates of the known features. Thus, the according elementsof W are equal to zero.
The process noise covariance Q can be denoted by:
Q =
Q11 0
0 Q22
It is a 6× 6 matrix and the blocks Q11 and Q22 are each of dimension 3 × 3.If we perform the matrix multiplication WQW⊤ explicitly, we derive:
WQW⊤ =
∂f v∂ VW
∂f v∂ ΩCW
0 0
Q11 0
0 Q22
( ∂f v∂ VW )⊤ 0
( ∂f v∂ ΩCW )⊤ 0
=
∂f v∂ VW Q11( ∂f v
∂ VW )⊤ + ∂f v∂ ΩW Q22( ∂f v
∂ ΩW )⊤ 0
0 0
Because no block of a size related to the n known features is involved in the oneblock unequal to the zero matrix, the number of multiplications is independentof n. We exactly need 4(13 ∗ 3 ∗ 3) multiplications.
Thus, the cost of the predict step in all is linear in m.
6.1.2 Complexity of the Correct Step
Since just a few features of all known are visible for the camera sensor at eachpoint in time, the Jacobian matrix H containing all partial derivatives of themeasurement model h with respect to the state, carries a large number of zeros.Let’s assume that we just measure one feature yWi after each time step. ThenH is of the following form:
H =
∂h∂ xv
0 ∂h∂ yW
i
0
The detailed Jacobian matrix can be found in Appendix A.We know, that the dimension of the state vector x is m = 13 + 3n. The
dimension p of the measurement vector is either 2 or 3, depending on whetherwe use a single or stereo camera. Thus, the whole matrix H is of dimension p×m. The block ∂h
∂ xvcarries p×13 elements whereas ∂h
∂ yWi
is of dimension p×3.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 97/101
6.1. COMPLEXITY OF THE KALMAN FILTER 73
To evaluate the Kalman Gain K, we need to perform the multiplication
P
−
k H
⊤
. For this case, P
−
k is represented by:P−
k =
P1 P01 P2 P02
(6.1)
The block P1 contains m× 13 and the block P2 m× 3 elements. If we performthis multiplication explicitly, we obtain
P−
k H⊤ =
P1 P01 P2 P02
∂h∂ xv
⊤
0∂h∂ yW
i
⊤
0
= P1
∂h
∂ xv
⊤
+ P2
∂ h
∂ yWi
⊤
.
The number of multiplications adds up to 16 pm.
After evaluating P
−
k H⊤
we need to derive the innovation covariance S. Itcan be obtained by equation HP−
k H⊤ + VRV⊤. We will firstly consider thefirst summand.
The result for P−
k H⊤ is a m× p matrix and is represented by
P−
k H⊤ =
P′1
P′01
P′2
P′02
where the block P′1 is a 13 × p and P′
2 a 3× p matrix.As result for the product HP−
k H⊤ we obtain
HP−
k H⊤ =
∂h∂ xv
0 ∂h∂ yW
i
0
P′
1P′01
P′2
P′02
=∂h
∂ xv
P′
1 +∂h
∂ yWiP′
2
The amount of multiplications lies at 16 p2, where p is either 2 or 3.The second summand VRV⊤ in the equation to derive the innovation co-
variance can be simplified equivalently. R is the measurement error covarianceof dimension p × p. The Jacobian matrix V contains the partial derivativesof the measurement model with respect to the measurement noise. Becausethe measurement noise vector is an additive constant in both SLAM scenarioswhether with a single or stereo camera, V is an identity matrix regardless of the value of p. We have
VRV⊤
= R.
The overall amount of multiplications to calculate the innovation covariance is16 p2.
To evaluate the Kalman Gain K we need to invert S . As already mentionedabove, the complexity of matrix inversion grows cubic with the number of rowsor columns, respectively, of the considered quadratic matrix. Here, we have a p× p matrix to invert. Thus, we need p3 multiplications.
The whole amount of multiplications to calculate the Kalman Gain is there-fore 16 pm + 16 p2 + p3 which is linear in m.
Until now, the complexity of all equations whether in the predict or correctstep were linear in m. The second equation of the correct step updating the
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 98/101
74 CHAPTER 6. AN OBSERVATION STRATEGY
Predict Step
P−
k = APk−1A⊤ + WQW⊤ O(m) = O(13 + 3n) = O(n)
Correct Step
Kk =P−
kH⊤
(HP−kH⊤+VRV⊤)
O(m) = O(13 + 3n) = O(n)
Pk = (I−KkH)P−
k O(m2) = O((13 + 3n)2) = O(n2)
Table 6.1: Complexities for the Equations of one Extended Kalman Filter Cycle.
error covariance P is responsible for the quadratic complexity. We have to
evaluate the summand KkHP−
k . We will first consider the product HP−
k . H isas already stated above represented by
H =
∂h∂ xv
0 ∂h∂ yW
i
0
The predicted error covariance matrix P−
k is denoted by
P−
k =
P1
P01
P2
P02
.
Note that these blocks are not the same as in Equation (6.1) although they split
up the same matrix P−k . Here, P1 is of dimension 13 × m. P2 carries 3 ×m
elements. If we evaluate the product, we obtain
HP−
k =
∂h∂ xv
0 ∂h∂ yW
i
0
P1
P01
P2
P02
=
∂h
∂ xv
P1 +∂h
∂ yWiP2
where the result is a p×m matrix. 16 pm multiplications are needed. The laststep is to multiply the Kalman Gain K with this result. Either K which is am× p matrix, nor HP−
k carries a zero or identity matrix. Therefore, we derivean m×m matrix by performing pm2 multiplications.
Thus, the time complexity of the correct step is O(m2) or if we justconsider the number of known features O((13 + 3n)2) = O(n2). At the sametime, this is the time complexity of one EKF cycle. The results presented inSection 6.1 are summarised in Table 6.1.2.
6.2 A Heuristic to Decide which Feature to
Track
In the last section we presented methods to reduce the complexity of one EKFcycle by taking the particular structure of the SLAM problem into account.
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 99/101
6.2. A HEURISTIC TO DECIDE WHICH FEATURE TO TRACK 75
For one of these methods it is assumed, that we just measure one of the visible
feature points per point in time. But if we do so, two questions may arise:
Is it sufficient for the estimation of the state to measure just one feature?
Which feature of the several visible ones is best to be measured?
Considering the first question, Welch and Bishop [35] presented the SCAAT
method where it is shown that measuring a single landmark after each timestep is sufficient to observe 3D structure and motion of a scene over time.3
In the case of 3D-SLAM, a single measurement of a 2D projection of a 3Dlandmark just provides partial or incomplete information about the whole stateof the system, e.g., nothing about the (linear or angular) velocity of the cameraand nothing about the depth of the 3D feature position. Systems operating
just by obtaining incomplete measurements are referred to as unobservablebecause the whole system’s state cannot not be inferred from them. Suchsystems must incorporate a sufficient set of these measurements to obtainobservability. This can be achieved over space or over time. The latter isadopted by the SCAAT technique. It is based on the Extended Kalman Filterwhere individual measurements providing incomplete information about thesystem’s state are blended into a complete state estimate. The mean forthis blending provided by the filter describes the state estimate. Based onseveral experiments, SCAAT was shown to be accurate, stable, fast and flexible.
To find an answer on the second question, we first need to find a crite-ria to rate the feature. An intuitive idea is stated by Davison in [9]: Themore uncertain we are about the 3D position of a feature, the more profitable
it is to measure this one. Or in other words, measurements of features, thatare difficult to predict, provide more information about the position of thisfeature and of the camera than measurements of features which can be reliablypredicted.
The innovation covariance S describes the uncertainty about each predictedmeasurement. Thus, it contains the basic information to decide which visiblefeature should be measured at each point in time. It is calculated as follows
S = HPH⊤ + VRV⊤ (6.2)
where H and V are the Jacobian matrices of the measurement model h(x, 0)with respect to the state x and the measurement noise v, respectively. P is
the error covariance matrix linked to the state and R is the measurement noisecovariance.S is a multivariate Gaussian. Therefore, covariance matrices Si for each
predicted measurement zi corresponding to a visible feature point yWi can beextracted from it. These smaller covariances refer to a Gaussian with the mea-surement zi as its mean. According to Whaite and Ferrie [36], depending onthe measurement space, each Si can be represented either by an ellipse or el-lipsoid centred around the mean of the distribution. They are also referred toas ellipses or ellipsoids of confidence and represent the amount of uncertaintyabout the predicted measurement. Or in other words, we can be confident, that
3Single Constraint At A Time
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 100/101
76 CHAPTER 6. AN OBSERVATION STRATEGY
the real measurement is situated within the ellipse or ellipsoid. By calculat-
ing the surface area or volume of these objects, we can decide which predictedmeasurement is most uncertain.Besides its role as a measure of the information content expected of a mea-
surement, Si also defines a search region where the according measurement zishould be located in with high probability. Thus, if we have decided to measurea specific feature, we can send the parameters of the search region to the fea-ture tracker. The advantages of this method are obvious. The feature tracker just needs to search a small region of interest instead of the whole picture.Furthermore, the chances of a mismatch are reduced.
In the previous chapter, we considered two SLAM cases: SLAM with a singlecamera and SLAM with a stereo camera. In the following sections, the heuristicis discussed in detail with respect to the different vision sensors.
6.2.1 Deriving the Innovation Covariance Matrix for
SLAM with a Single Camera
In the case we use a single camera, we predict two dimensional measurements yIifor each visible three-dimensional feature yWi referring to its 2D projection ontothe image plane. Thus, if l features are visible, S is a 2l× 2l matrix and l 2× 2covariance matrices Si regarding to the visible features can be extracted fromit. These covariance matrices represent a two-dimensional standard distributionover image coordinates. Its mean is the predicted measurement yIi. The dis-tribution can be visualised by an ellipse of confidence on the picture. Its focalpoint refers to the mean, the direction of its axes are given by the eigenvectors of the covariance matrix and the square root of the according eigenvalues specifies
the deviation of the distribution along the axis.According to [36], the surface area of the ellipse can be used as a measure
of uncertainty. If a and b denote the length of the principal axes of the ellipse,the surface area A is calculated by
A = πab.
The standard deviation of a distribution describes the average deviation of therelated Gaussian. The values of the whole distribution diversify much more.Possible realisations of the predicted measurement situated beyond the averagedeviation are just less probable but should also be involved in the calculationof the amount of uncertainty and in the size of the search region.
Thus, we introduce the factor nσ
and multiply the length of the principle axesof the ellipse with it. Consider the estimated measurement yIi with eigenvaluese1,i and e2,i of the according covariance matrix Si. To derive the surface areaof the demanded ellipse, we have to compute
Ai = πnσ√e1,ie2,i. (6.3)
The value for nσ should extend the standard deviation such that the probabilityfor a measurement to be found within the considered region is approximately100%. In [9], Davison chose nσ = 3. The probability that the possible realisa-tions of a standard deviated random variable lie within the 3σ-region aroundthe mean of the distribution is approximately 99% ([16], p. 1119).
7/31/2019 De Printat Articole
http://slidepdf.com/reader/full/de-printat-articole 101/101
6.2. A HEURISTIC TO DECIDE WHICH FEATURE TO TRACK 77
After calculating the amount of uncertainty about the predicted measure-
ment of each visible 3D feature, we can rank them and send the parameters(predicted measurement and corresponding covariance matrix) of the landmarkwhose measurement is most difficult to predict to the feature tracker. The corre-sponding covariance matrix specifies the search region for the demanded featuremeasurement within the image and centred around the estimated measurement.
6.2.2 Deriving the Innovation Covariance Matrix for
SLAM with a Stereo Camera
In the second case of SLAM scenarios, we use a stereo camera to measure thevisible features. For each, we derive three-dimensional measurement vectors.Thus, if l features are visible, l 3 × 3 smaller covariances Si, each referring toone of the predicted measurements for the visible features, can be extracted
from the innovation covariance matrix S. As already mentioned for the two-dimensional case, these covariances are related to a standard distribution. Theirmeans are the predicted measurements.
Considering one visible feature point yWi , the measurement vector for theSLAM scenario with a stereo camera consists of the image coordinates of theprojection of this feature on the left image plane yIi = (xIl, y
Il)⊤ and the disparity
dI. The according innovation covariance matrix Si is therefore not defined overone of the image coordinate frames as it was the case wen using a monocularvision sensor. It can be represented as an ellipsoid in the space spanned byxIl, y
Il and dI. Analogous to the surface area of the ellipses, the volume of the
ellipsoids can be seen as a measure for uncertainty. The equation to calculatethe volume of an ellipsoid is
V = 43πabc.
where a, b and c are the lengths of its principal axes. If we substitute the squareroot of the eigenvalues for a, b and c and introduce the factor nσ again, we derivethe equation to calculate the volume of each Si:
V i =4
3πnσ
√e1,ie2,ie3,i
After calculating this volume for each ellipsoid, we are able to rank the visible3D feature points. The corresponding predicted measurement and innovationcovariance of the landmark whose measurement is most difficult to predict is sentto the feature tracker Centred around this prediction the covariance matrix
Top Related