Using Multi-Modality to Guide Visual Tracking Jaco Vermaak Cambridge University Engineering...

Using Multi-Modality to Guide Visual Tracking

Jaco Vermaak

Cambridge University Engineering Department

Patrick Pérez, Michel Gangnet, Andrew Blake

Microsoft Research Cambridge

Paris, December 2002

Introduction Visual tracking difficult: changes in pose and illumination,

occlusion, clutter, inaccurate models, high-dimensional state spaces, etc.

Tracking can be aided by combining information in multiple measurement modalities

Illustrated here on head tracking using: Sound and contour measurements Colour and motion measurements

General Tracking

Tracking Equations Objective: recursive estimation of the filtering distribution:

General solution: Prediction step:

Filtering/update step:

Problem: generally no analytic solutions available

ttttp yyyyx ,,,| 1:1:1

11:1111:1 ||| t

filteringprevious

tt

priordynamical

tttt dppp xyxxxyx

prediction

tt

likelihood

tttt pLp 1:1:1 ||| yxxyyx

Particle Filter Tracking Monte Carlo implementation of general recursions. Filtering distribution represented by samples/particles with

associated importance weights:

Proposal step: new particles proposed from a suitable proposal distribution:

Reweighting step: particles reweighted with importance weights:

Resampling step: multiply particles with high importance weights and eliminate those with low importance weights.

t

N

i

itttN dp i

txyx

x

1

:1|

titt

it q yxxx ,| from simulated 1

tit

it

it

it

itt

it

it qpL yxxxxxy ,|/|| 111

Particle Filter Building Blocks Sampling from conditional density

Resampling

Reweighting with positive function

ii

p

,xx

xx |q

ii

dpq

,

|'

x

xxx

q

ii

p

,xx

Nij MNijp

p

1)( )(,1, x

x

ii

p

,xx

0xh

h

iii h

dphph

xx

xxxx

,

1

Particle Filter ImplementationRequires specification of: System configuration and state space Likelihood model Dynamical model for state evolution State proposal distribution Particle filter architecture

Head Tracking using Sound and Contour Measurements

Problem Formulation Objective: track the head of a person in a video sequence

using audio and image cues Audio: time delay of arrival (TDOA) measurements at

microphone pair orthogonal to optical axis of camera Image: edge events along normal lines to a hypothesised

contour Complimentary modalities: audio good for (re)initialisation;

image good for fine localisation

System Configuration

image plane

camera

microphone pair

Model Ingredients Low-dimensional state space: similarity transform applied to a

reference template

Dynamical prior: integrated Langevin equation, i.e. second-order Markov kernel

Multi-modal data likelihoods:

Sound based likelihood: TDOA at mic. pair Contour based likelihood: edge events

,,, yxx

211:0 , ttttt pp xxxxx

xr

xxyEDGETDOA LxL

Lp

xr

Contour Likelihood Input: maxima of projected luminance gradient along normals

( such events on normal)

j

N

icji

j

j

dNN

qqL

1

2,

00

EDGE ,0;1 xr

jN thjEDGEL

jd ,1 jd ,2 jd ,3

Contour Likelihood Advantages

Low computational cost Robust to illumination changes

Drawbacks Fragile because of narrow support (especially with only

similarity transform on a fixed shape space) Sensitive to background clutter

Extension Multiply gradient by inter-frame difference to reduce

influence of background clutter

II

II

max

Inter-Frame Difference

Without frame difference With frame difference

Audio Likelihood Input: positions of peaks in generalised cross-correlation

function (GCCF) Reverberation leads to multiple peaks

x

TDOA

x

1d

GCCF

TDOA1d Nd

Nd

TDOAL

Audio Likelihood Deterministic mapping from Time Delay of Arrival (TDOA) to

bearing angle (microphone calibration) to X-coordinate in image plane (camera calibration)

Audio likelihood follows in similar manner to contour likelihood

Likelihood assumes a uniform clutter model

xdG :

N

isi xGdN

N

qqxL

1

2100

TDOA ,;1

Particle Filter Architecture

Layered sampling: first X-position and sound likelihood; then rest

X-position proposal: mixture of diffusion dynamics and sound proposal:

To admit “jumps” from proposal X-dynamics have to be augmented with an uniform component:

N

isiX

XXX

dxGNNxGG

xq

xqxpxq

1

211

TDOA

TDOALANG

,;1

1

XqX

X

q

pLTDOA

pppY EDGEL

xUxpxp XX 1LANG

Examples Effect of inter-frame difference:

Conversational ping-pong:

Examples Conversational ping-pong and sound based reinitialisation:

Head Tracking using Colour and Motion Measurements

Problem Formulation Objective: detect and track the head of a single person in a

video sequence taken from a stationary camera Modality fusion:

Motion and colour measurements are complementary Motion: when the object is moving colour is unreliable Colour: when the object is stationary motion information

disappears Automatic object detection and tracker initialisation using

motion measurements Individualisation of the colour model to the object:

Initialised with a generic skin colour model Adapted to object colour during periods of motion: motion

model acts as “anchor”

Object Description and Motion Head modelled as an ellipse that is free to translate and

scale in the image Binary indicator variable to signal whether object is present in

the image or not, so object state becomes: State components assumed to have independent motion

models Indicator: discrete Markov chain Position and scale: Langevin motion with uniform initialisation:

rsyx ,,,x

0 and 1 if

1 and 1 if |

0 if undefined

,,|

1

1111

tttR

ttttL

t

tttt

rrxU

rrxxp

r

rrxxp

x

Image Measurements Measurements taken on a regular filter grid:

Measurement vector:

hue image

saturation image

frame-difference image

iiii DSH ,,y

isotropic Gaussian filters

Gyyy 1

Observation Likelihood Model Measurements at gridpoints assumed to be independent Unique background (object absent) likelihood model for each

gridpoint All gridpoints covered by the object share the same

foreground likelihood model:

At each gridpoint the measurements are also assumed to be independent:

Note that the background motion model is shared by all the gridpoints

xx

yyxyxyBi

iBi

Fii

FG

iii LLLL

1

||

iBM

iBSii

BHii

Bi

iFM

iFS

iFH

iF

DLSLHLL

DLSLHLL

y

y

Colour Likelihood Model Normalised histograms for both foreground and background

colour likelihood models:

Background models trained on a sequence without objects Foreground models trained on a set of labelled face images Histogram models supplied with a small uniform component

to prevent numerical problems associated with empty bins

Bi

c

c

cL

i

c

1bin for count normalised :

tmeasuremen toingcorrespondindex bin :

tmeasuremencolour :

Motion Likelihood Model Background frame-difference measurements empirically found

to be gamma distributed:

Foreground frame-difference depends on magnitude of motion, number and orientation of foreground edges, etc.

Modelling these effects accurately is difficult In general: if the object is moving foreground frame-difference

measurements are substantially larger than those for background

Thus a two-component uniform distribution is adopted for the foreground frame-difference measurements (outlier model)

iaii

BM bDDDL exp1

Particle Proposal Three stages of operation:

Birth: object first enters scene; proposal should detect object and spawn particles in the object region

Alive: object persists in scene; proposal should allow object to be tracked, whether it is stationary or moves around

Death: object leaves scene; proposal should kill particles associated with the object

Form of particle proposal:

N

i

ii rP

syx

rrqPrrqPq

1

)()(

,,

,',,'|','|',,'|

z

yzzyxx

empirical probability ofobject being alive

Particle Proposal Indicator proposal:

Birth only allowed if there is no object currently in the scene All particles alive are subjected to a fixed death probability

State proposal:

Langevin dynamics if object is alive Gaussian birth proposal: parameters from detection module

death

birth

Prrq

PPPrrq

1'|0

otherwise 0

0' if ',0'|1

0' and 1 if ˆ,ˆ;

1' and 1 if '|

0 if undefined

,',,'|

rrN

rrp

r

rrq L

Σμz

zzyzz

Object Detection Object region detected by probabilistic segmentation of the

horizontal and vertical projections of the frame-difference measurements:

Region location and size determine parameters for birth proposal distribution

Colour Model Adaptation Why:

Generic skin colour model may be too broad for accurate localisation

Model sensitive to colour changes due to changes in pose and illumination

When: Object present and moving: largest variations in colour

expected Motion likelihood “anchors” particles around moving object

How: Gradual: avoid fitting to the background: enforced with

prior Stochastic EM: contribution of particles proportional to

likelihood

Colour Model Adaptation Unknown parameters: normalised bin values for object hue and

saturation histograms EM Q-function for MAP estimation:

No analytic solution but particle approximation yields:

Monte Carlo approximation only performed over particles that are currently alive

prior dynamical

1ˆ,|ˆ|log,|logˆ,

:1:1 tttttptt pLEQ

tttθθθxyθθ

θyx

11

ˆ|log,|logˆ,

tttitt

N

i

itttN pLQ θθθxyθθ

Colour Model Adaptation Dirichlet prior used for parameter updates:

Prior centred on old parameter values Variance controlled by multiplicative constant Update rule for normalised bin counts becomes:

11 || tttt CDip θθθθ

parameterprior Dirichlet :

particleth -for count bin th - :

1

1

1

i

ji

N

j

ji

ji

B

jjj

iii

jin

nn

Bn

n

What Happens?

1

2

particlehistograms

weighted averagehistogram

Implementation Colour model adaptation iterations occur between particle

prediction and particle reweighting in standard particle filter Stochastic EM algorithm initialised with parameters from

previous time step A single stochastic EM iteration is sufficient at each time step Number of particles is fixed to 100 Non-optimised algorithm runs at 15fps on standard desktop PC

Examples

No adaptation: tracker gets stuck on skin-coloured carpet in the background

Adaptation: tracker successfully adapts to changes in pose and illumination and lock is maintained

No motion likelihood: tracker fails, illustrating need for “anchor” likelihood

Examples

Tracking is successful despite substantial variations in pose and illumination and the subject temporarily leaving the scene

Particles are killed when the subject leaves the scene; upon re-entering the individualised colour model allows lock to be re-established within a few frames

The End

Using Multi-Modality to Guide Visual Tracking Jaco Vermaak Cambridge University Engineering...

Documents

Transcript of Using Multi-Modality to Guide Visual Tracking Jaco Vermaak Cambridge University Engineering...