Motion and tracking
description
Transcript of Motion and tracking
Motion and Tracking
Eng-Jon OngUniversity of [email protected]
Introduction
There have been many objects that have been tracked in the past.
Whole objects: Cars, bicycles, human bodies.
Source:Youtube: Intelligent Traffic Surveillance
What objects have been tracked? There have been many
objects that have been tracked in the past.
Medium level features: Heads, Hands, small objects, etc..
What objects have been tracked? There have been many objects that have been
tracked in the past. Fine level features: Facial feature points, finger
positions, etc...
Overview
The task of visual tracking involves locating the position of a tracked target by a combination of features and motion models.
There is a strong relationship between the task of object detection and tracking.
Visual model + Detector
Motion Model
Overview
One can think of tracking as a motion-model constrained detection. Detection on the whole image tends to be expensive
Visual model + Detector
Motion Model
Overview Introduction Object models Simple search strategies Using linear dynamics Optimisation search
strategies Summary
Object Models and Evaluation
Representation of Tracked Objects The first question: How do we computationally
represent an object we want to track? Image template Combination of low level information (e.g. Lines) Contour information
Evaluation of different models “fitness” We need a measure of model fitness on an image
given a set of parameters (e.g. Position + scale). For images, we have template matching using
different scores: Normalised cross correlation is the most basic(i.e. Sum ofsquares ofpixel differences)
Evaluation of different models “fitness” There are more sophisticated methods for
matching a template to an image: Boosted detectors are a popular choice. Boosting is a method that combines a set of very
simple object detectors together to yield a strong detector.
Boosted Cascade
Cascade Layer 1
90% Rejected
10% pass . . . .
Cascade Layer 2 Cascade Layer 3
10% pass
90% Rejected 90% Rejected 90% Rejected
Face detected
Cascade Layer n
Boosted CascadeLayer 12 Classifiers
Layer 25 Classifiers
Layer 35 Classifiers
Layer 420 ClassifiersLayer 550 Classifiers
Layer 650 ClassifiersLayer 7128 ClassifiersLayer 8132 Classifiers
Layer 9100 Classifiers
Detecting and Tracking Humans in Images
Constrained Detection: Simple Search Strategies
Simple Tracking Strategies
Detection/Global Search Goal: Where to place the
contour on the image?
Simple Tracking Strategies
n
dIdn
I
n
(x1,y1)
(x2,y2)
(x3,y3)
(x4,y4)
^n1
^n2
^n3
^n4
Contours and Costs– Search along contour normal for edges
– Move contour x,y,scale & rotation
Evaluation of different models “fitness” For lines and contours, we can use distances to
nearest edges. But, different configurations of contour searches
can have different results. Run demos: 3tracescanline.exe 4tracescanlinelong.exe
n
dIdn
I
n
(x1,y1)
(x2,y2)
(x3,y3)
(x4,y4)
^n1
^n2 ^
n3
^n4
Simple Tracking Strategies
Global Search– If the parameter space of
the search is low in dimensionality then a simple global search of the image is sufficient
Simple Tracking Strategies
Global Search– If the parameter space of
the search is low in dimensionality then a simple global search of the image is sufficient
Simple Tracking Strategies
Global Search– If the parameter space of
the search is low in dimensionality then a simple global search of the image is sufficient
– Not practical for most applications
Detecting and TrackingHumans in Images We can track just using
global search if the detectors are fast enough
Iterative Tracking
Most tracking schemes work on the assumption that an object will make small iterative movements between frames
Using this assumption only a local search is required to update model parameters
Tracking is typically posed as a 2 step process:– Initialisation (Global/Detection)– Iteration (Local)
Iterative Tracking Example 1
Assume the initial position is known
Assume object wont move far
Search locally to find movement that maximises some fitness function
Iterative Tracking Example 1
Assume the initial position is known
Assume object wont move far
Search locally to find movement that maximises some fitness function
Iterative Tracking Example 2
Again:– requires good initialisation– relies on small inter-frame movements
Iterative Tracking Example 2
Example of contour tracking failing due to indistinct edges
A better example of tracking but highly susceptible to initialisation
Increasing the local search provides better initialisation but decreases tracking performance
1BadContour.exe
2BetterContour.exe
4TraceScanLineLong.exe
Constrained Detection: Optimisation Search Strategies
Tracking as an Optimisation Problem
Tracking can be thought of as an optimisation where some cost function represents how well a model fits an image.
Model fitting is done by attempt to find the model parameters that minimise/maximise this cost function
This can be done at each frame to track objects through a video sequence
Using Gradient Descent
The previous approaches of iteratively refining a model given a local search is effectively a gradient descent optimisation
This will only work if theinitial pose of the model is very close to the idealposition as energy surfacestypically have many localminima
Cost
Parameter
Using Gradient Descent
Energy surfaces are typically very complex and impossible to visualise due to high dimensionality
In the figure there is one global minimum but many local minima that are almost as good
Unless our model is very close to the ideal location a gradient descent approach will converge on a local minima and get trapped
We've already seen this in action on the contour tracker
Cost
Parameter
Choosing a cost function
Returning to the contour example lets formulate a cost function as the Euclidean distance between a model and the strongest features in the image
We can visualise the cost surface across a single parameter
Notice the surface has a global minimum but it is not distinct
3TraceScanLine.exe
Choosing a cost function
We can do the same after increasing the local search (by extending our search along normals) to see how this affects the cost surface
Note it makes the minima more distinct but this image has no background clutter. Additional clutter would result in further complicating the surface
4TraceScanLineLong.exe
Choosing a cost function
Lets choose a different cost function
This time we will take the edge strength supporting the model pose
Notice the surface has inverted and we now seek to find the maximum
It has a very clear maximum which corresponds to the global solution which SHOULD be easy to find!!!
5cost2TraceScanLine.exe
Lucas-Kanade Tracking
Remember Gradient Descent
Cost
Parameter
Well if we know more about the surface we can speed things up:– If we assume the cost
surface is a parabola then given a position anda gradient we can move to the minimum in one move
Lucas-Kanade Tracking
Newton-Raphson convergence
v n+1=vn−f n '
f n '' Jacobian
Hessian
• Two differences
• LK uses the Sum of Squared differences across the entire image.
• x is a multi-dimensional warp parameter.
v
f(v)
Lucas-Kanade Tracking
x
ssd Tv,wI=d 2xx
xx Tv,wIv
wI=d
v xssd
2
- =
{ } *
∑
y
wI
Jacobian
?)(?,
ssddv
x
wI
Lucas-Kanade Tracking
x
ssd Tv,wI=d 2xx
xx Tv,wIv
wI=d
v xssd
2
22
2
2 dO+v
wI
v
wI=
v
d
x
T
ssd
∑
y
wI
Jacobian
Hessian
x
wI
y
wI
??
??2
2
v
dssd
x
wI
Lucas-Kanade Tracking
Lucas-Kanade Tracking
Youtube: vision: optical flow detection
Mean-shift
We can look for local maxima in object detector outputs using mean-shift
Mean-shift
We can look for local maxima in object detector outputs using mean-shift
Mean shift
Example of simple mean-shift tracking Object “Detector” is distance to RGB histogram
Youtube: Mean shift tracking of red bal, normalised RGB and 64 bin histogram
Regression-based Tracking
Regression-based Tracking
Up till now, tracking is seen as a constrained detection problem. Essentially template matching, searching a parameter space to minimise a matching fitness function.
Another approach is to pose the problem as a regression problem: Given template difference, predict the translational offset to the correct position. (no explicit search needed!)
Linear Predictors (Robust Facial Feature Tracking using Shape Constrained Multi Resolution Selected Linear Predictors, Ong et al)
a
cb Y
P= [ Ia – I'a, Ib – I'b, lc – I'c ]
X = HP
Reference Point + Support Pixels (a,b,c) Linear mapping (H) from support pixel
intensity difference to translation vector
Linear Predictor “Bunches”– Single LPs are not stable enough for tracking image
features– Use a set (“bunch”) of
LPs instead– Final prediction =
consensus of the mostcommon predictedtranslation
Linear Predictors
Linear Predictor “Bunches”– Single LPs are not stable enough for tracking image
features– Use a set (“bunch”) of
LPs instead– Final prediction =
consensus of the mostcommon predictedtranslation
Linear Predictors
“Tracking context” is very important.
We only want to use surrounding visual information if it helps the tracking
Linear Predictors
We want to track this point
BUT, we shoulduse visual informationaround here for tracking it! Other regions have toomuch variations.
We can find the tracking context by evaluating the accuracy of trackers using local patches, and gradually removing the bad ones
Linear Predictors
Cascaded linear predictors:– Linear predictors trained to overcome large offsets are not
accurate but robust
– LPs trained to overcome small offsets are accurate but not robust.
– Solution, cascade them: Use big-offset LPs, then pass the results to smaller ones for refinement.
Linear Predictors
Errors of “large” LP predictingfrom an offseted position (blue is medium prediction error)
Errors of “small” LP predictingfrom an offseted position (white is small prediction error)
Linear Predictors
Linear Predictors
Linear Predictors
Non-Linear Predictors(Non-linear Predictors for Facial feature Tracking, FG2013, Sheerman-Chase et al.)
a
cb Y
P= [ Ia – I'a, Ib – I'b, lc – I'c ]
X = H( P )
Replace linear mapping with the non-linear mapping of regression trees
Input still support pixel differences, output still offsets
Non-Linear Predictors
Replace linear mapping with the non-linear mapping of regression trees
Input still support pixel differences, output still offsets
S1<0.4
dy = 23 S50<0.1
Dy = 32dy = -10
Non-Linear Predictors
Results: More robust tracking able to handle larger amounts of pose and expression variations.
Non-Linear Predictors
Results: More robust tracking able to handle larger amounts of pose and expression variations.
Non-Linear Predictors
Allows us to do freaky things like this:
Background to template update problem
No update– Misrepresentation Error– Catastrophic
Naïve update– Drift Error– Slow accumulation
True Feature – Old AppearanceTrue Feature – New AppearanceFalse Feature
Frame
time
Error
time
Error
1 2 3 4 5
Background template update(Mutual information for Lucas Kanade tracking (MILK): An inverse compositional formulation, Dowson et al, PAMI 08)
Building a Model of Templates
Appearance space
LP SMAT
SMAT
Incorporating Motion Modelsfor Tracking
Temporal Consistency
This sequence shows a surveillance application tracking subjects as they move.
The technique uses a per pixel mixture of Gaussians to model background colour distributions and perform dynamic background subtraction.
Tracking with Motion Models
The task of visual tracking involves locating the position of a tracked target by a combination of features and motion models.
There is a strong relationship between the task of object detection and tracking.
Visual model + Detector
Motion Model
Using Motion
Objects often exhibit consistent motion
Kalman Filter
To exploit this motion consistency, many authors model it with simple dynamics in the what is called the Kalman filter
A Kalman filter is simply an optimal recursive data processing algorithm. It makes predictions based on previous
estimates and current observations
Kalman Filter
Suppose we have some hidden information to recover (i.e. Not directly observable) and takes the form of a state vector E.g. X = [x,y,v] position, velocity of a tracked object
This object has a true position at time t, Xt, which we do not know But suppose we think this object’s dynamics works in a linear
fashion like: Xt = FXt-1 BUT this may not be exactly the case, it might be slightly off, thus
we have Xt = FXt-1 + wt, where wt ~ N(0,Q)
Xt
Kalman Filter
Suppose we have some sensors that can provide some measurements about the tracked object in the form of a state vector: Z = [a,b]
This sensor measurements is originates from the hidden state vector X with the form: Zt = HXt
BUT, in reality this sensor can be imperfect, noisy etc... We deal with this by saying Zt = HXt + v, where v ~ N(0,R) R is called the sensor’s error covariance
Kalman Filter
We want to recover some hidden information about a tracked object: X = [x,y,v]
We can predict it’s movements “blindly” using: X’t|t-1 = FX’t-1|t-1 + wt
But this model is inaccurate in a Gaussian sense: wt ~ N(0,Q) We have some sensors that provide observations to indirectly tell
us how accurate our predictions are Zt – HX’t|t-1 BUT, need to take this with a pinch of salt, since our sensors are
inaccurate as well (Zt has Gaussian noise with covariance R)
Kalman Filter
Suppose we have some hidden information to recover (i.e. Not directly observable) and takes the form of a state vector E.g. X = [x,y,v] position, velocity of a tracked object
This object has a true position at time t, Xt, which we do not know But suppose we think this object’s dynamics works in a linear
fashion like: Xt = FXt-1 BUT this may not be exactly the case, it might be slightly off, thus
we have Xt = FXt-1 + wt, where wt ~ N(0,Q)
Xt
Kalman Filter
So, task at hand: how do we best combine our prediction of a tracked object state with the sensor observations, given that both have Gaussian noise?
That is what a Kalman filter does in a optimal sense (provide your noise IS Gaussian and your dynamics IS linear)
Xt|t = X’t|t-1 + K( Zt – HX’t|t-1 ) K is called the “Kalman gain” Essentially, if sensor noise is small and prediction noise large, K
becomes H-1, meaning trust the observations. Conversely, if sensor noise is large,
K becomes 0, trust prediction
Kalman Filter Operation
From: Kalman filter for dummies
Using a Kalman Filter to Track
How prediction overcomes occlusion issues
Youtube: kalman Filter result on real aircraft & Result of Kalman Filter on a Moving Aircraft
Using a Kalman Filter to Track
How prediction overcomes occlusion issues
Youtube: kalman Filter result on real aircraft & Result of Kalman Filter on a Moving Aircraft
Using a Kalman Filter to Track
How prediction overcomes occlusion issues
Youtube: kalman Filter result on real aircraft & Result of Kalman Filter on a Moving Aircraft
Extended Kalman Filter-EKF
The Kalman filter addresses the problem of dynamics estimation by linear equations
Most problems are non-linear EKF attempts to address this making
the state prediction Xt = F( Xt-1 ) + w F can be any non linear function
See www.cs.unc.edu/~welch for introductory tutorials and sample code
Exploring a parameter space for the global solution
We could try every single model configuration to find the lowest cost solution but this can be unfeasible (640x480x100x360=11,059,200,000)
We could just randomly pick model configurations in the hope that we find a low cost solution but this does not guarantee that we will find it and as the dimensionality and complexity increase so must the number of random samples
These are common problems and hence standard optimisation techniques can be employed– e.g. Simulated Annealing, Genetic Algorithms
7RandomSample.exe
Tracking as an Optimisation Problem In simulated annealing we try and use some simple
heuristic to reduce the number of samples we need to test
In Genetic Algorithms we try and guide our random search through observation to again reduce the complexity of the search
However, these are blind optimisations and we often know much more about the problem we are trying to solve such as the nature of observations or the dynamics we are expecting (remember the Kalman Filter)
Tracking as an Optimisation Problem Example of using simulated annealing for tracking the
body pose
N. Lehment, M. Kaiser, D. Arsic, and G. Rigoll. Cue-Independent Extending Inverse Kinematics For Robust Pose Estimation in 3D Point Clouds. Proc. IEEE Intern. Conf.on Image Processing (ICIP2010)
Factored Sampling
We have seen how the KF uses a simple Gaussian to model observations but what happens if observations are non-Gaussian?
Factored Sampling can be used to search a static image in these cases
We want to calculate the posterior probability that an object X exists in an image given the observed data obj – P(X |obj)
Factored Sampling
This is difficult to achieve for continuous complex non-Gaussian distributions
Luckily Bayes’ formula says that the posterior density can be obtained as a product of a prior density P0(X ) and an observation density P(obj|X )– P(X |obj) ≈ P(obj|X ) P0(X )
Factored sampling estimates the posterior by generating samples from the prior and weighting them according to the observation density
Factored Sampling
A set of n points s (n), the centres of the blobs in the figure are sampled randomly from the prior density P(X )
Each sample is then assigned a weight (depicted by blob area) based upon the observation density P(obj|X = s (n) )
If n is sufficiently large then the weighted set represents the posterior density P(X |obj)
State X
Probability
posterior density
weightedsample
CONDENSATION and Particle Filtering
CONDitional DENsity propagATION also known as particle filtering is the natural extension of the KF to factored sampling
Basically:– Randomly generate a distribution from the prior pdf
and apply a model of dynamics (i.e. predict)– Fit each sample to the image (i.e. measure)– Weight samples accordingly to generate a new
posterior pdf that will serve as the prior for the next iteration
CONDENSATION and Particle Filtering
predict
measure
CONDENSATION and Particle Filtering
The animation shows a few cycles of the algorithm applied to a one-dimensional system. The green spheres correspond to the members of the sample set, where the size of the sphere is an indication of the sample weight. The red line is the measurement density function.
This animation shows a short sequence of the CONDENSATION filter tracking a leaf exhibiting non-linear motion with occlusion and clutter.
Movie sequences taken from http://www.dai.ed.ac.uk/CVonline/LOCAL_COPIES/ISARD1/condensation.html
CONDENSATION and Particle Filtering
We can extend our random sampler to a simple PF using gaussian noise as our dynamics/drift term
Notice how the population quickly homes in on the area of highest probability as we saw in the random sampling
It quickly converges on incorrect local solutions, increasing the noise term helps explore the space further but the global maximum is at the bottom of the image
8ParticleFilter.exe
CONDENSATION and Particle Filtering
We can further try to change the model to better fit the head and ensure the global is at the correct position
Tracking is better but easily lost to other maxima
As the population size is increased we start to see multiple hypothesis tracking
By combining both the PF and a gradient decent method we can get the best results for the lowest population, but our cost function is still flawed
9Particle filter.exe
10ParticleFilter.exe
CONDENSATION and Particle Filtering
Advantages– Allows complex non-Gaussian systems– Easy to add non-linear dynamics– Provides support for multiple hypotheses (!!!)
Disadvantages– Large numbers of samples make the techniques
extremely slow for high parameter spaces– Not a global optimisation so has the tendency to
converge upon good observations at the cost of other observations
There are many schemes for overcoming these problems but are beyond the scope of this lecture
Interesting Applications of Motion Tracking
Lip-Reading
Facial features of a subject are tracked, specifically the mouth regions.
Mouth texture and shape are extracted and used to build discriminative patterns called sequential patterns
Lip-Reading
Results:
Sign Language Recognition
Tracking required for extracting the motions of the hands and head.
Movement features of the hands and hand shapes are extracted
Again, discriminative movement patterns uniquely identifying a sign is extracted
These patterns will be used to detect whether a sign is present in a video sequence or not
Sign Language Recognition
Results:
Group Behaviour Profiling
Even when tracking is not very accurate or robust, it can still be used to do useful things!
Example: Use simple trackers (e.g. Lucas Kanade trackers) to “track” people in a crowd
These will only last a short while, but can form short trajectories.
The analysis of these trajectories can be used to do profile crowd behaviours.
Group Behaviour Profiling
Results:
Summary
We have looked at a variety of tracking strategies from very simple schemes to those which can learn and predict complex non-linear motion in cluttered environments. This talk is not exhaustive but should give you a basic understanding of the types of techniques used in modern computer vision systems.
For more details on many of the examples see my website http://www.surrey.ac.uk/personal/e.ong
For a good introduction on the temporal mechanics of tracking I would recommend reading “Active Contours” by Isard and Blake
Things to remember!!!
When tracking:– Tracking is only as good as your model and data
A bad metric will give bad results The larger the parameter space the more difficult things
become
– Make things as simple as possible Constrain your environment Use appropriate techniques and dynamics
– e.g. if your tracking someone jumping up and down don’t use a kalman filter
– Don’t try to reinvent the wheel But if your going to use black box techniques ensure you
know what they will and wont do for you