1198 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …
Transcript of 1198 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …
Segmentation and Tracking of Multiple Humansin Crowded Environments
Tao Zhao, Member, IEEE, Ram Nevatia, Fellow, IEEE, and Bo Wu, Student Member, IEEE
Abstract—Segmentation and tracking of multiple humans in crowded situations is made difficult by interobject occlusion. We propose
a model-based approach to interpret the image observations by multiple partially occluded human hypotheses in a Bayesian
framework. We define a joint image likelihood for multiple humans based on the appearance of the humans, the visibility of the body
obtained by occlusion reasoning, and foreground/background separation. The optimal solution is obtained by using an efficient
sampling method, data-driven Markov chain Monte Carlo (DDMCMC), which uses image observations for proposal probabilities.
Knowledge of various aspects, including human shape, camera model, and image cues, are integrated in one theoretically sound
framework. We present experimental results and quantitative evaluation, demonstrating that the resulting approach is effective for very
challenging data.
Index Terms—Multiple human segmentation, multiple human tracking, Markov chain Monte Carlo.
Ç
1 INTRODUCTION AND MOTIVATION
SEGMENTATION and tracking of humans in video sequencesis important for a number of applications, such as visual
surveillance and human-computer interaction. This hasbeen a topic of considerable research in the recent past androbust methods for tracking isolated or a small number ofhumans for which only transient occlusion exists. However,tracking in a more crowded situation where several peopleare present, which exhibits persistent occlusion, remainschallenging. The goal of this work is to develop a method todetect and track humans in the presence of persistent andtemporarily heavy occlusion. We do not require thathumans be isolated, that is, unoccluded, when they firstenter the scene. However, in order to “see” a person, werequire that at least the head-shoulder region must bevisible. We assume a stationary camera so that motion canbe detected by comparison with a background model. Wedo not require the foreground detection to be perfect, e.g.,the foreground blobs may be fragmented, but we assumethat there are no significant false alarms due to shadows,reflections, or other reasons. We also assume that thecamera model is known and that people walk on a knownground plane.
Fig. 1a shows a sample frame of a crowded environment,and Fig. 1b shows the motion blobs detected by comparisonwith the learned background. It is apparent that segmentinghumans from such blobs is not straightforward. One blobmay include multiple objects, while one object may split
into multiple blobs. Blob tracking over extended periods,
e.g., [20], may resolve some of these ambiguities, but such
approaches are likely to fail when occlusion is persistent.
Some approaches have been developed to handle occlusion,
for example, [9], but require the objects to be initialized
before occlusion happens. This is usually infeasible for a
crowded scene. We believe that the use of a shape model is
necessary to achieve individual human segmentation and
tracking in crowded scenes.In earlier related work [54], Zhao and Nevatia modeled
the human body as a 3D ellipsoid and human hypotheses
were proposed based on head top detection from fore-
ground boundary peaks. This method works reasonably
well in the presence of partial occlusions if the number of
people in the field of view is small. As the complexity of the
scene grows, head tops cannot be obtained by simple
foreground boundary analysis and more complex shape
models are needed to fit more accurately with the observed
shapes. Also, joint reasoning about the collection of objects
is needed, rather than the simpler one-by-one verification
method in [54]. The consequence of this joint consideration
is that the optimal solution has to be computed in the joint
parameter space of all of the objects. To track the objects in
multiple frames, temporal coherence is another desired
property besides the accuracy of the spatial segmentation.
We adapt a data-driven Markov chain Monte Carlo
(MCMC) approach to explore this complex solution space.
To improve the computational efficiency, we use direct
image features from a bottom-up image analysis as
importance proposal probabilities to guide the moves of
the Markov chain. The main features of this work include
1. a three-dimensional part-based human body modelwhich enables the segmentation and tracking ofhumans in 3D and the inference of interobjectocclusion naturally,
1198 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 7, JULY 2008
. T. Zhao is with Intuitive Surgical Inc., 950 Kifer Road, Sunnyvale, CA94086. E-mail: [email protected].
. R. Nevatia and B. Wu are with the Institute for Robotics and IntelligentSystems, USC Viterbi School of Engineering, University of SouthernCalifornia, 3737 Watt Way, Los Angeles, CA 90089.E-mail: {nevatia, bowu}@usc.edu.
Manuscript received 18 Sept. 2006; revised 24 Apr. 2007; accepted 13 Aug.2007; published online 31 Aug. 2007.Recommended for acceptance by C. Kambhamettu.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPAMI-0668-0906.Digital Object Identifier no. 10.1109/TPAMI.2007.70770.
0162-8828/08/$25.00 � 2008 IEEE Published by the IEEE Computer Society
2. a Bayesian framework that integrates segmentationand tracking based on a joint likelihood for theappearance of multiple objects,
3. the design of an efficient Markov chain dynamics,directed by proposal probabilities based on imagecues, and
4. the incorporation of a color-based backgroundmodel in a mean-shift tracking step.
Our method is able to successfully detect and trackhumans in the scenes of complexity shown in Fig. 1 with
high detection and low false-alarm rates; the tracking
results for the frame in Fig. 1a are shown in Fig. 1c (the
result includes the integration of multiple frames duringtracking). In Section 6, we give graphical and quantitative
results on a number of sequences. Parts of our system have
been partially described in [53] and [55]; this paper provides
a unified presentation of the methodology, additionalresults, and discussions. This approach has been built on
by other researchers, for example, [41]. The same frame-
work has also been successfully applied to vehiclesegmentation and tracking in challenging cases [43].
The rest of the paper is organized as follows: Section 2
gives a brief review of the related works. Section 3 presents
an overview of our method. Section 4 describes the
probabilistic modeling of the problem. Section 5 describesour MCMC-based solution. Section 6 shows experimental
results and evaluation. Conclusions and discussions are
given in the last section.
2 RELATED WORK
We summarize related work in this section; some of these
are referred to in more detail in the following sections. Due
to the amount of literature in this field, it is not possible forus to provide a comprehensive survey, but we attempt to
include the major trends.The observations for human hypotheses may come from
multiple cues. Many previous approaches [20], [9], [54],
[37], [44], [15], [18], [40], [24], [3], [45] use motion blobsdetected by comparing pixel colors in a frame to learnedmodels of the stationary background. When the scene is nothighly crowded, most of the parts of the humans in thescene are detected in the foreground motion blob; multiplehumans may be merged into a single blob, but they can beseparated by rather simple processing. For example,Haritaoglu et al. [15] use vertical projection of the blob tohelp segment a big blob into multiple humans. Siebel andMaybank [40] and Zhao and Nevatia [54] detect headcandidates by analyzing the foreground boundaries. Sincedifferent humans have small overlapping foregroundregions, they could be segmented in a greedy way.However, the utility of these methods in crowded environ-ments such as in Fig. 1 is likely to be limited.
Some methods, for example, [50], [31], [7], [13], detectappearance or shape-based patterns of humans directly.Those in [50] and [31] learn human detectors from localshape features; those in [7] and [13] build contour templatesfor pedestrians. These learning-based methods need a largenumber of training samples and may be sensitive toimaging viewpoint variations as they learn 2D patterns.Besides motion and shape, face and skin color are alsouseful cues for human detection, but environments wherethese cues could be utilized are limited, usually to indoorscenes where illumination is controlled and the objects areimaged with high resolution, for example, [42] and [12].
Without a specific model of objects, tracking methods arelimited to blob tracking, for example, [3]. The mainadvantage of model-based tracking is that it can solve theblob merge and split problems by enforcing a global shapeconstraint. The shape models could be either parametric, forexample, an ellipsoid as in [54], or nonparametric, forexample, the edge template as in [13], and either in 2D, forexample, [46], or in 3D, for example, [54]. Parametricmodels are usually generative and of high dimensionality,while nonparametric models are usually learned from realsamples. Two-dimensional models make the matching ofhypotheses and image observations straightforward, while3D models are more natural for occlusion reasoning. Thechoice of the model complexity depends on both theapplication and the video resolution. For human trackingfrom a middistant camera, we do not need to capture thedetailed body articulation; a rough body model such as thegeneric cylinder in [19], the ellipsoid in [54], and themultiple rectangles in [46] suffices. When the body pose ofhumans is desired and the video resolution is high enough,more complex models could be used, such as the articulatedmodels in [54] and [34].
Tracking of multiple objects requires the matching ofhypotheses with the observations both spatially andtemporally. When objects are highly interoccluded, theirimage observations are far from independent; hence, a jointlikelihood for multiple objects is necessary [46], [27], [19],[35], [30], [51]. Smith et al. [41] use a pairwise MarkovRandom Field (MRF) to model the interaction betweenhumans and define the joint likelihood. Rittscher et al. [36]include a hidden variable which indicates a global mappingfrom the observed features to human hypotheses in thestate vector.
ZHAO ET AL.: SEGMENTATION AND TRACKING OF MULTIPLE HUMANS IN CROWDED ENVIRONMENTS 1199
Fig. 1. A sample frame, the corresponding motion blobs, and our
segmentation and tracking result for a crowded situation. (a) Sample
frame. (b) Motion blobs. (c) Our result.
As the solution space is of high dimension, searching forthe best interpretation by brute force is not feasible. Particlefilter-based methods, for example, [19], [46], [30], [51], [27],become unsuitable when the dimensionality of the searchspace is high as the number of samples needed usuallygrows exponentially with the dimension. The methods in[41], [21] use some variations of the MCMC algorithm tosample the solution space, while those in [45], [36] use anEM-style method. For efficiency, the candidate solutionscould be generated from some image cues, not purelyrandomly, for example, the work in [36] proposes hypoth-eses from local silhouette features.
Information from multiple cameras with overlappingviews can reduce the ambiguity of a single camera. Suchmethods usually assume that, at least from one viewpoint,the object can be detected successfully (for example, [11]) ormany cameras are available for 3D reconstruction (forexample, [28]). The difficulty in segmenting multiplehumans that overlap in images from a stereo camera isalleviated by analyzing where in the 3D space they areseparable [52]. In a multicamera context, an object can betracked even when it is fully occluded from some of theviews; however, many real environments do not permit theuse of multiple cameras with overlapping views. In thispaper, we consider situations where video from only onecamera is available. However, our approach can utilizemultiple cameras with little modification.
MCMC-based methods are receiving increasing popu-larity for computer vision problems due to their flexibilityin optimizing an arbitrary energy function as opposed toenergy functions of a specific type as in graph cut [2] orbelief propagation [49]. They have been used for variousapplications, including segmenting multiple cells [38],image parsing [48], multiobject tracking [21], estimatingarticulated structures [23], and so forth. The data-drivenMCMC was proposed in [48] to utilize bottom-up imagecues to speed up the sampling process.
We want to point out the difference between ourapproach and another independently developed work [21]that also used MCMC for multiobject tracking. The work in[21] assumes that the objects do not overlap by applying apenalty term for overlap, while our approach explicitly usesa likelihood of appearance under occlusion. Our approachfocuses on the domain of tracking a human, which is themost important subject for visual surveillance. We considerthe three-dimensional perspective effect in a typical camerasetting, while the ant tracking problem described in [21] isalmost a 2D problem. We utilize the acquired appearancewhere each object is of different appearance, while ants in[21] are assumed to have the same appearance. Wedeveloped a full set of effective bottom-up cues for humansegmentation and hypotheses generation.
3 OVERVIEW
Our approach to segmenting and tracking multiple humansemphasizes the use of shape models. An overview diagramis given in Fig. 2. Based on a background model, theforeground blobs are extracted as the basic observation. Byusing the camera model and the assumption that objectsmove on a known ground plane, multiple 3D humanhypotheses are projected onto the image plane and matchedwith the foreground blobs. Since the hypotheses are in 3D,
occlusion reasoning is straightforward. In one frame, wesegment the foreground blobs into multiple humans andassociate the segmented humans with the existing trajec-tories. Then, the tracks are used to propose humanhypotheses in the next frame. The segmentation andtracking are integrated in a unified framework andinteroperate along time.
We formulate the problem of segmentation and trackingas one of Bayesian inference to find the best interpretationgiven the image observations, the prior models, and theestimates from the previous frame analysis (that is, themaximum a posteriori (MAP) estimation). The state to beestimated at each frame includes the number of objects,their correspondences to the objects in the previous frame(if any), their parameters (for example, positions), and theuncertainty of the parameters. We define a color-based jointlikelihood model that considers all of the objects and thebackground together and encodes both the constraints thatthe object should be different from the background and thatthe object should be similar to its correspondence. Usingthis likelihood model gracefully integrates segmentationand tracking and avoids a separate, sometimes ad hoc,initialization step. Given multiple human hypotheses,before calculating the joint image likelihood, interobjectocclusion reasoning is done. The occluded parts of a humanshould not have corresponding image observations.
The solution space contains subspaces of varyingdimensions, each corresponding to a different number ofobjects. The state vector consists of both discrete andcontinuous variables. This disqualifies many optimizationtechniques. Therefore, we use a highly general reversiblejump/diffusion MCMC-based method to compute the MAPestimate. We design dynamics for the multiobject trackingproblem. We also use various direct image features to makethe Markov chain more efficient. Direct image featuresalone do not guarantee optimality because they are usuallycomputed locally or using partial cues. Using them asproposal probabilities of the Markov chain results in anintegrated top-down/bottom-up approach that has both thecomputational efficiency of image features and the optim-ality of a Bayesian formulation. A mean-shift technique [5] isused as efficient diffusion for the Markov chain. The data-driven dynamics and the in-depth exploration of thesolution space make the approach less sensitive todimensionality compared to particle filters. Our experi-ments show that the described approach works robustly invery challenging situations with affordable computation;some results are shown in Section 6.
1200 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 7, JULY 2008
Fig. 2. Overview diagram of our approach.
4 PROBABILISTIC MODELING
Let � represent the state of the objects in the scene at time t;it consists of the number of objects in the scene, their3D positions, and other parameters describing their size,shape, and pose. Our goal is to estimate the state at time t,�ðtÞ, given the image observations, Ið1Þ; . . . ; IðtÞ, abbreviatedas Ið1;...;tÞ. We formulate the tracking problem as computingthe MAP estimation, �ðtÞ?:
�ðtÞ? ¼ arg max�ðtÞ2�
P �ðtÞjIð1;...;tÞ� �
¼ arg max�ðtÞ2�
P IðtÞj�ðtÞ� �
P �ðtÞjIð1;...;t�1Þ� �n o
;ð1Þ
where � is the solution space. Denote by m the state vectorof one individual object. A state containing n objects can bewritten as � ¼ ðk1;m1Þ; . . . ; ðkn;mnÞf g 2 �n, where ki is theunique identity of the ith object whose parameters are mi
and �n is the solution space of exactly n objects. The entiresolution space is � ¼ [Nmax
n¼0 �n, where Nmax is the upperbound of the number of objects. In practice, we compute anapproximation of P �ðtÞjIð1;...;t�1Þ� �
(details are given later inSection 4.4).
4.1 3D Human Shape Model
The parameters of an individual human, m, are definedbased on a 3D human shape model. The human body ishighly articulated; however, in our case, the human motionis mostly limited to standing or walking, and we do notattempt to capture the detailed shape and articulationparameters of the human body. Thus, we use a number oflow-dimensional models to capture the gross shape ofhuman bodies (Fig. 3).
Ellipsoids fit human body parts well and have the propertythat their projection is an ellipse with a convenient form [16].Therefore, we model human shape by a composition ofmultiple ellipsoids corresponding to the head, the torso, andthe legs, with fixed spatial relationship. A few such models incharacteristic poses are sufficient to capture the gross shapevariations of most humans in the scene for midresolutionimages. We use the multi-ellipsoid model to control themodel complexity while maintaining a reasonable level offidelity. We have used three such models (one for legs close toeach other and two for legs well split) in our previous work onmultihuman segmentation [53]. However, in this work, weuse only a single model with three ellipsoids, which we foundsufficient for tracking.
The model is controlled by two parameters called sizeand thickness. The size parameter is the 3D height of the
model; it also controls the overall scaling of the object in the
three directions. The thickness parameter captures extra
scaling in the horizontal directions. Besides size and
thickness, the parameters also include the image position
of the head,1 3D orientation of the body, and 2D inclination of
the body. The orientations of the models are quantized into
a few levels for computation efficiency. The origin of the
rotation is chosen so that 0 degrees corresponds to a human
facing the camera. We use 0 and 90 degrees to represent
front/back and side view in this work. The 3D models
assume that humans are perfectly upright, but there is the
chance that the body may be inclined slightly. We use one
parameter to capture the inclination in 2D (as opposed to
two parameters in 3D). Therefore, the parameters of the
ith human are mi ¼ foi; xi; yi; hi; fi; iig, which are orienta-
tion, position, size, thickness, and inclination, respectively. We
also write ðxi; yiÞ as ui.With a given camera model and a known ground plane,
the 3D shape models automatically incorporate the per-
spective effect of camera projection (change in object image
size and shape due to the change in object position and/or
camera viewpoint). Compared to 2D shape models (for
example, [13]) or prelearned 2D appearance models (for
example, [50]), the 3D models are more easily applicable for
a novel viewpoint.
4.2 Object Appearance Model
Besides the shape model, we also use a color histogram of
the object, ~p ¼ ~p1; . . . ; ~pmf g (m is the number of bins of the
color histogram) defined within the object shape, as a
representation of its appearance, which helps establish
correspondence in tracking. We use a color histogram
because it is insensitive to the nonrigidity of human motion.
Furthermore, there exists an efficient algorithm, for exam-
ple, the mean-shift technique [5], to optimize a histogram-
based object function. When calculating the color histo-
gram, a kernel function KEðÞ with Epanechnikov profile [5]
is applied to weight pixel locations so that the center has a
higher weight than the boundary. Such a representation has
been used in [6]. Our implementation uses a single red,
green, blue (RGB) histogram with 512 bins (eight for each
dimension) of all of the samples within the three elliptic
regions of our object model.
4.3 Background Appearance Model
The background appearance model is a modified version of
a Gaussian distribution. Denote by ð�rj; �gj; �bjÞ and �j ¼diagf�2
rj; �2
gj; �2
bjg the mean and the covariance of the color at
pixel j. The probability of pixel j being from the back-
ground is
ZHAO ET AL.: SEGMENTATION AND TRACKING OF MULTIPLE HUMANS IN CROWDED ENVIRONMENTS 1201
1. The image head location is an equivalent parameterization of theworld location on the ground plane ðxw; ywÞ given the human height. Thetwo are related by ½x; y; 1�T � ½p1;p2;p3hþ p4�½xw; yw; 1�T , where pi is theith column of the camera projection matrix and h is the height of thehuman. For clarity of presentation, we chose the ground plane to be z ¼ 0.
Fig. 3. A number of 3D human models to capture the gross shape of
human bodies.
Pb Ij� �¼ Pb rj; gj; bj
� �/ max exp � rj � �rj
�rj
� �2
� gj � �gj�gj
� �2
� bj � �bj�bj
� �2" #
; �
( );
ð2Þ
where � is a small constant. It is a composition of a Gaussiandistribution and a uniform distribution. The uniformdistribution captures the outliers that are not modeled bythe Gaussian distribution to make the model more robust.The Gaussian parameters (mean and covariance) areupdated continuously by the video stream only with thenonmoving regions. A more sophisticated backgroundmodel (for example, mixture of Gaussians [44] or nonpara-metric [10]) could be used to account for more variations,but this is not the focus of this work; we assume thatcomparison with a background model yields adequateforeground blobs.
4.4 The Prior Distribution
The prior distribution P �ðtÞjIð1;...;t�1Þ� �is decomposed into
two parts given by
P �ðtÞjIð1;...;t�1Þ� �
/ P �ðtÞ� �
P �ðtÞjIð1;...;t�1Þ� �
: ð3Þ
P ð�ðtÞÞ is independent of time and is defined byQni¼1 P ðjSijÞP ðmiÞ, where Si is the projected image of
the ith object and jSij is its area. The prior of the imagearea P ðjSijÞ is modeled as being proportional toexp ��1jSijð Þ 1� exp ��2jSijð Þ½ �.2 The first term here pena-lizes a large total object size to avoid situations where twohypotheses overlap a large portion of an image blob, whilethe second term penalizes objects with small image sizes asthey are more likely to be due to image noise. Althoughthe prior on 2D image size could be converted to the3D space, defining this prior in 2D is more natural becausethese properties model the reliability of image evidenceindependent of the camera models. The priors on thehuman body parameters are considered independent.Thus, we have P ðmiÞ ¼ P ðoiÞP ðxi; yiÞP ðhiÞP ðfiÞP ðiiÞ. Weset P ðofrontalÞ ¼ P ðoprofileÞ ¼ 1=2. P ðxi; yiÞ is a uniformdistribution in the image region where a human head isplausible. P ðhiÞ is a Gaussian distribution Nð�h; �2
hÞtruncated in the range of ½hmin; hmax� and P ðfiÞ is Gaussiandistribution Nð�f; �2
fÞ truncated in the range of ½fmin; fmax�.P ðiiÞ is Gaussian distribution Nð�i; �2
i Þ. In our experiments,we use �h ¼ 1:7 m, �h ¼ 0:2 m, hmin ¼ 1:5 m, hmax ¼ 1:9 m,�f ¼ 1, �f ¼ 0:2, fmin ¼ 0:8, fmax ¼ 1:2; �i ¼ 0, and �i ¼ 3degrees. These parameters correspond to common adultbody sizes.
We approximate the second term of the right side of
(3), P ð�ðtÞjIð1;...;t�1ÞÞ, by P ð�ðtÞj�ðt�1ÞÞ, assuming �t�1 encodes
the necessary information from the past observations. For
convenience of expression, we rearrange �ðtÞ and �ðt�1Þ
as ~�ðtÞ ¼ fð~kðtÞi ; ~mðtÞi Þg
Ni¼1 and ~�ðt�1Þ ¼ fð~kðt�1Þ
i ; ~mðt�1Þi ÞgNi¼1,
where N is the overall number of objects present in the
two frames, so that one of f~kðtÞi ¼ ~k
ðt�1Þi ; ~m
ðtÞi ¼ �; ~m
ðt�1Þi ¼
�g is true for each i. ~kðtÞi ¼ ~k
ðt�1Þi means that object ~k
ðtÞi is a
tracked object, ~mðtÞi ¼ � means that object ~k
ðt�1Þi is a dead
object (that is, trajectory is terminated), and ~mðt�1Þi ¼ �
means that object ~kðtÞi is a new object. With the rearranged
state vector, we have
P ð�ðtÞj�ðt�1ÞÞ ¼ P ð~�ðtÞj~�ðt�1ÞÞ ¼YNi¼1
P ð ~mðtÞi j ~m
ðt�1Þi Þ:
The temporal prior of each object follows the definition
P ~mðtÞi j ~m
ðt�1Þi
� �/
Passoc ~mðtÞi j ~m
ðt�1Þi
� �; ~k
ðtÞi ¼ ~k
ðt�1Þi
Pnew ~mðtÞi
� �; ~m
ðt�1Þi ¼ �
Pdead ~mðt�1Þi
� �; ~m
ðtÞi ¼ �:
8>>><>>>: ð4Þ
We assume that the position and the inclination of an objectfollow constant velocity models with Gaussian noise andthat the height and thickness follow a Gaussian distribution(for simplicity of presentation, we omit the velocity terms inthe state). We use Kalman filters for temporal estimation;Passoc is therefore a Gaussian distribution. Pnewð ~m
ðtÞi Þ ¼
Pnewð~uðtÞi Þ and Pdeadð ~mðt�1Þi Þ ¼ Pdeadð~uðt�1Þ
i Þ are the likeli-hoods of the initialization of a new track at position ~u
ðtÞi and
the termination of an existing track at position ~uðt�1Þi ,
respectively. They are set empirically according to thedistance of the object to the entrances/exits (the boundariesof the image and other areas that people move in/out of).PnewðuÞ � N ð�ðuÞ;�eÞ, where �ðuÞ is the location of theclosest entrance point to u and �e is its associatedcovariance matrix, which is set manually or through alearning phase. PdeadðÞ follows a similar definition.
4.5 Joint Image Likelihood for Multiple Objects andthe Background
The image likelihood P ðIj�Þ reflects the probability that weobserve image I (or some features extracted from I) givenstate �. Here, we develop a likelihood model based on thecolor information of the background and objects. Given astate vector �, we partition the image into different regionscorresponding to different objects and the background.Denote by ~Si the visible part of the ith object defined by mi.The visible part of an object is determined by the depthorder of all of the objects, which can be inferred from their3D positions and the camera model. The entire object regionS ¼ [ni¼1Si ¼
Pni¼1
~Si since ~Si are disjoint regions. We use �Sto denote the supplementary region of S, that is, thenonobject region. The relationship of the regions isillustrated in Fig. 4.
In case of multiple objects which can possibly overlap inthe image, the likelihood of the image given the state cannotbe simply decomposed into the likelihood of each indivi-dual object. Instead, a joint likelihood of the whole image,given all objects and the background model, needs to beconsidered. The joint likelihood P ðIj�Þ consists of two termscorresponding to the object region and the nonobject region:
P Ij�ð Þ ¼ P ISj�� �
P I�Sj�
� �: ð5Þ
1202 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 7, JULY 2008
2. We have used prior on the number of objects in [53] to constrainoversegmentation. However, we found that the prior on the area is moreeffective due to the large variation in the image sizes of the objects (due tothe camera perspective effect) and, therefore, their different contribution tothe likelihood.
After obtaining ~Si by occlusion reasoning, the object regionlikelihood can be calculated by
P ISj�� �
¼Yni¼1
P I~Si jmi
� �
/ exp �SXni¼1
~Si�� �� ��bB pi;dið Þ|fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl}
ð1Þ
þ�fB pi; ~pið Þ|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}ð2Þ
264
375
8><>:
9>=>;;
ð6Þ
where di is the color histogram of the background imagewithin the visibility mask of object i and ~pi is the colorhistogram of the object; both are weighted by the kernelfunction KEðÞ. Bðp;dÞ ¼
Pmj¼1
ffiffiffiffiffiffiffiffiffipjdj
pis the Bhattachayya
coefficient, which reflects the similarity of the two histo-grams.
This likelihood favors both the difference in an objecthypothesis from the background and its similarity to itscorresponding object in a previous frame (Fig. 4). Thisenables simultaneous segmentation and tracking in thesame object function. We call the two terms backgroundexclusion and object attraction, respectively. The back-ground exclusion concept was also proposed in [33]. �b and�f weight the relative contribution of the two terms (weconstrain �b þ �f ¼ 1). The object attraction term is the sameas the likelihood function used in [6]. For an object withouta correspondence, that is, a new object, only the backgroundexclusion part is used.
The nonobject likelihood is calculated by
P I�Sj�
� �¼Yj2 �S
Pb Ij� �� �� �S / exp �� �S
Xj2 �S
ej
0@
1A ð7Þ
where ej ¼ logðPbðIjÞÞ is the probability of belonging to thebackground model, as defined in (2). �S in (6) and � �S in (7)weight the balance of the foreground and the backgroundconsidering the different probabilistic models being used.The posterior probability is obtained by combining theprior, (3), and the likelihood, (5).
5 COMPUTING MAP BY EFFICIENT MCMC
Computing the MAP is an optimization problem. Due to thejoint consideration of an unknown number of objects, thesolution space contains subspaces of varying dimensions. Italso includes both discrete variables and continuous
variables. These have made the optimization challenging.We use an MCMC method with jump/diffusion dynamicsto sample the posterior probability. Jumps cause theMarkov chain to move between subspaces with differentdimensions and to traverse the discrete variables; diffusionsmake the Markov chain sample continuous variables. In theprocess of sampling, the best solution is recorded and theuncertainty associated with the solution is also obtained.
Fig. 5 gives a block diagram of the computation process.The MCMC-based algorithm is an iterative process, startingfrom an initial state. In each iteration, a candidate isproposed from the state in the previous iteration assisted byimage features. The candidate is accepted probabilisticallyaccording to the Metropolis-Hasting rule [17]. The statecorresponding to the maximum posterior value is recordedand becomes the solution.
Suppose we want to design a Markov chain with
stationary distribution Pð�Þ ¼ P �ðtÞjIðtÞ; �ðt�1Þ� �. At the
gth iteration, we sample a candidate state �0 according to
�g�1 from a proposal distribution qð�gj�g�1Þ. The candidate state
�0 is accepted with the probability p ¼ minf1; Pð�0Þqð�g�1j�0Þ
Pð�g�1Þqð�0 j�g�1Þg.3
If the candidate state �0 is accepted, �g ¼ �0; otherwise,
�g ¼ �g�1. It can be proven that the Markov chain con-
structed in this way has its stationary distribution equal to
PðÞ, independent of the choice of the proposal probability qðÞand the initial state �0 [47]. However, the choice of the
proposal probability qðÞ can affect the efficiency of MCMC
significantly. Random proposal probabilities will lead to a
very slow mixing rate. Using more informed proposal
probabilities, for example, as in the data-driven MCMC
[48], will make the Markov chain traverse the solution space
more efficiently. Therefore, the proposal distribution is
written as qð�gj�g�1; IÞ. If the proposal probability is informa-
tive enough that each sample can be thought of as a
hypothesis, then the MCMC approach becomes a stochastic
version of the hypothesize and test approach. In general, the
ZHAO ET AL.: SEGMENTATION AND TRACKING OF MULTIPLE HUMANS IN CROWDED ENVIRONMENTS 1203
Fig. 4. First pane: The relationship of visible object regions and thenonobject region. Remaining panes: The color likelihood model. In ~Si,the likelihood favors both the difference in an object hypothesis with thebackground and its similarity to its corresponding object in a previousframe. In �S, the likelihood penalizes the difference from the backgroundmodel. Note that the elliptic models are used for illustration.
3. Based on our experiments, we find that approximating the ratio in thesecond term with just the posterior probability ratio, Pð�
0 ÞPð�g�1Þ , gives almost the
same results as the complete computation; hence, we use this approxima-tion in our implementation.
Fig. 5. The block diagram of the MCMC tracking algorithm.
original version of MCMC has a dimension matching problem
for a solution space with varying dimensionality. A variation
of MCMC, called trans-dimensional MCMC [14], is proposed
to solve this problem. However, with some appropriate
assumption and simplification, the trans-dimensional
MCMC can be reduced to the standard MCMC. We address
this issue later in this section.
5.1 Markov Chain Dynamics
We design the following reversible dynamics for the Markovchain to traverse the solution space. The dynamics corre-spond to the proposal distribution with a mixture densityqð�0j�g�1; IÞ ¼
Pa2A paqað�0j�g�1; IÞ, where A is the set of all
dynamics ¼fadd; remove; establish; break; exchange; diffg:
The mixing probabilities pa are the chances of selectingdifferent dynamics and
Pa2A pa ¼ 1.
We assume that we have the sample in the ðg� 1Þthiteration �
ðtÞg�1 ¼ fðk1;m1Þ; . . . ; ðkn;mnÞg and now propose a
candidate �0 for the gth iteration (t is omitted where there isno ambiguity).
Object hypothesis addition. Sample the parameters of anew human hypothesis ðknþ1;mnþ1Þ and add it to �g�1.qaddð�g�1 [ fðknþ1;mnþ1Þgj�g�1; IÞ is defined in a data-drivenway whose details will be given later.
Object hypothesis removal. Randomly select an existinghuman hypothesis r 2 ½1; n� with a uniform distribution andremove it. qremove �g�1 n fðkr;mrÞgj�g�1
� �¼ 1=n. If kr has a
correspondence in �ðt�1Þ, then that object becomes dead.Establish correspondence. Randomly select a new object r
in �ðtÞg�1 and a dead object r0 in �ðt�1Þ and establish their
temporal correspondence. qestablish �0j�g�1
� �/ ur � ur0k k�2
for all of the qualified pairs.
Break correspondence. Randomly select an object r
where kr 2 �ðt�1Þ with a uniform distribution and change krto a new object (and the same object in �ðt�1Þ becomes dead).
qbreak �0j�g�1
� �¼ 1=n0, where n0 is the number of objects in
�ðtÞg�1 that have correspondences in the previous frame.
Exchange identity. Exchange the IDs of two nearbyobjects. Randomly select two objects r1; r2 2 ½1; n� andexchange their IDs. qexchange r1; r2ð Þ / ur1
� ur2k k�2. Identi-
ties exchange can also be replaced by the composition ofbreaking and establishing correspondence. It is used to easethe traversal since breaking and establishing correspon-dences may lead to a big decrease in the probability and areless likely to be accepted.
Parameter update. Update the continuous parameters ofan object. Randomly select an existing human hypothesisr 2 ½1; n� with a uniform distribution and update itscontinuous parameters qdiff �
0j�g�1
� �¼ ð1=nÞqd m0rjmr
� �.
Among the above, addition and removal are a pair ofreverse moves, as are establishing and breaking correspon-dences; exchanging identity and parameter updating aretheir own reverse moves.
5.2 Informed Proposal Probability
In theory, the proposal probability qðÞ does not affect thestationary distribution. However, different qðÞ lead to
different performance. The number of samples needed toget a good solution strongly depends on the proposal
probabilities. In this application, the proposal probability ofadding a new object and the update of the object parameters
are the two most important ones. We use the following
informed proposal probabilities to make the Markov chainmore intelligent and thus have a higher acceptance rate.
Object addition. We add human hypotheses from three
cues, foreground boundaries, intensity edges, and fore-ground residue (foreground with the existing objects carved
out). In [54], a method to detect the heads that are on theboundary of the foreground is described. The basic idea is
to find the local vertical peaks of the boundary. The peaksare further verified by checking if there are enough
foreground pixels below it according to a human heightrange and the camera model. This detector has a high
detection rate and is also effective when the human is smalland image edges are not reliable; however, it cannot detect
the heads in the interior of the foreground blobs. Fig. 6ashows an example of head detection from foreground
boundaries.The second head detection method is based on an “�”
shape head-shoulder model (this term was first introduced in
[53]). This detector matches the �-shape edge template withthe image intensity edges to find the head candidates. First,
the Canny edge detector is applied to the foreground regionof the input image. A distance transformation [1] is computed
on the edge map. Fig. 6b shows the exponential edge mapwhereEðx; yÞ ¼ expð��Dðx; yÞÞðDðx; yÞ is the distance to the
closest edge point and� is a factor to control the response fielddepending on the object scale in the image; we use � ¼ 0:25).
In addition, the coordinates of the closest pixel point are alsorecorded as ~Cðx; yÞ. The unit image gradient vector ~Oðx; yÞ is
only computed at edge pixels. The “�” shape model, seeFig. 6c, is derived by projecting a generic 3D human model to
the image and taking the contour of the whole head and theupper quarter torso as the shoulder. The normals of the
contour points are also computed. The size of the human
1204 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 7, JULY 2008
Fig. 6. Head detection. (a) Head detection from foreground blob
boundaries. (b) Distance transformation on the Canny edge detection
result. (c) The �-shape head-shoulder model (black—head-shoulder
shape, white—normals). (d) Head detection from intensity edges.
model is determined by the camera calibration assuming anaverage human height.
Denote f~u1; . . . ; ~ukg and f~v1; . . . ;~vkg as the positions andthe unit normals of the model points, respectively, when thehead top is at ðx; yÞ. The model is matched with the imageas Sðx; yÞ ¼ ð1=kÞ�k
i¼1e��Dð~uiÞð~vi � ~Oð~Cð~uiÞÞÞ. A head candi-
date map is constructed by evaluating Sðx; yÞ on every pixelin the dilated foreground region. After smoothing it, wefind all of the peaks above a threshold such that it results ina very high detection rate but may also result in a high falsealarm rate. An example is shown in Fig. 6d. The false alarmstend to happen in the area of rich texture, where there areabundant edges of various orientations.
Finally, after some human objects obtained from the firsttwo methods are hypothesized and removed from theforeground, the foreground residue map R ¼ F � S iscomputed. A morphological “open” operation with avertically elongated structural element is applied to removethin bridges and small/thin residues. From each connectedcomponent c, human candidates can be generated, assum-ing that 1) the centroid of the c is aligned with the center ofthe human body, 2) the top center point of c is aligned withthe human head, or 3) the bottom center point of c is alignedwith the human feet.
The proposal probability for addition combines thesethree head detection methods qaðk;mÞ ¼
P3i¼1 �aiqaiðk;mÞ,
where �ai, i ¼ 1; 2; 3 are mixing probabilities of the threemethods, and we use �ai ¼ 1=3. qaiðÞ samples m first andthen k. qaiðk;mÞ ¼ qaiðmÞqaiðkjmÞ, and
qaiðmÞ ¼ qoðoÞqaiðuÞqhðhÞqfðfÞqiðiÞ:
qaiðuÞ answers the question “where do we add a new
human hypothesis.” In practice, qoðoÞ, qhðhÞ, qfðfÞ, and qiðiÞuse their respective prior distributions and qaiðuÞ is a
mixture of Gaussians based on the bottom-up detection
results. For example, denote by HC1 ¼ fðxi; yiÞgN1
i¼1 the head
candidates obtained by the first method, and then,
qa1ðuÞ ¼ qa1ðx; yÞ �PN1
i¼1Nððxi; yiÞ; diagf�2x; �
2ygÞ. The defi-
nitions of qa2ðuÞ and qa3ðuÞ are similar. After u0 is sampled,
qðkjmÞ / qðkju0Þ is to sample k from fkðt�1Þd1
; . . . ; kðt�1Þdnd
; newgaccording to P ðujuðt�1Þ
diÞ, see (4), i ¼ 1; . . . ; nd, and PnewðuÞ,
where nd is the number of dead objects.The addition and removal actions change the dimension
of the state vector. When calculating the acceptanceprobability, we need to compute the ratio of probabilitiesfrom spaces with different dimensions. Smith et al. [41] usean explicit strategy of transdimensional MCMC [14] to dealwith the dimension-matching problem. We do not need anexplicit strategy to match the dimension. Since thetransdimensional actions only add or remove one object atone iteration, leaving the other objects unchanged, theJacobian in [14] is unit, as in [41]. Therefore, our formulationis just a special case of the more general theory.
Parameter update. We use two ways to update themodel parameters:
qdiff m0rjmr
� �¼ �d1qd1 m0rjmr
� �þ �d2qd2 m0rjmr
� �;
�di ¼ 1=2. qd1ðÞ uses stochastic gradient descent to update
the object parameters. qd1ðm0rjmrÞ / N ðmr � k dEdm ;wÞ,
where E ¼ � logP ð�ðtÞjIðtÞ; �ðt�1ÞÞ is the energy function, k
is a scalar to control the step size, and w is random noise to
avoid a local maximum.A mean-shift vector computed in the visible
region provides an approximation of the gradientof the object likelihood with respect to the position.qd2ðm0rjmrÞ / N ðmms
r ;wÞ, where mmsr is the new location
computed from the mean-shift procedure (details are givenin the Appendix). We assume that the change in theposterior probability by other components and due toocclusion can be absorbed in the noise term. The mean shifthas an adaptive step size and has a better convergencebehavior than numerically computed gradients. The rest ofthe parameters follow their numerically computed gradi-ents. Compared to the original color-based mean-shifttracking, the background exclusion term in (6) can utilizea known background model, which is available for astationary camera. As we observe in our experiments,tracking using the above likelihood is more robust to thechange in appearance of the object, for example, whengoing into the shadow, compared to using the objectattraction term alone.
Theoretically, the Markov chain designed should beirreducible and reversible; however, the use of the abovedata-driven proposal probabilities makes the approach notconform to the theory exactly. First, irreducibility requiresthe Markov chain to be able to reach any possible point inthe solution space. However, in practice, the proposalprobability of some point is very small, close to zero. Forexample, the proposal probability of adding a hypothesis ata position where there is no head candidate detected nearbyis extremely low. With finite numbers of iterations, a stateincluding such a hypothesis will never be sampled.Although this breaks the completeness of the Markovchain, we argue that skipping the parts of the solution spacewhere no sign of objects is observed brings no harm to thequality of the final solution and makes the searchingprocess more efficient. Second, the use of the mean shift,which is a nonparametric method, makes the chainirreversible. Mean shift can be seen as an approximationof the gradient, while stochastic gradient descent isessentially a Gibbs sampler [39], which is a special case ofthe Metropolis-Hasting sampler with an acceptance ratioalways equal to one [25]. However, mean shift is muchfaster than random walk to estimate the parameters of theobject. We choose to use these techniques with the loss ofsome theoretical beauty because, experimentally, they makeour method much more efficient and the results are good.
5.3 Incremental Computation
As the MCMC process may need hundreds or moresamples to approximate the distribution, we need anefficient method to compute the likelihood for eachproposed state. In one iteration of the algorithm, at mosttwo objects may change. It affects the likelihood locally;therefore, the computation of the new likelihood can becarried out more efficiently by incrementally computing itonly within their neighborhood (the area associated withthe changed objects and those overlapping with them).
ZHAO ET AL.: SEGMENTATION AND TRACKING OF MULTIPLE HUMANS IN CROWDED ENVIRONMENTS 1205
Take the addition action as an example. When a newhuman hypothesis is added to the state vector, for thelikelihood of the nonobject region P ðI �Sj�Þ, we only need toremove those background pixels taken by the new hypoth-esis. For the likelihood of the object regionP ðIS j�Þ, as the newhypothesis may overlap with some existing hypotheses, weneed to recompute the visibility of the object regionsconnected to the new hypothesis and then update thelikelihood of these neighboring objects. The incrementalcomputations of the likelihood for the other actions aresimilar. Although a joint state and joint likelihood is used, thecomputation of each iteration is greatly reduced through theincremental computation. This is in contrast to the particlefilter where the evaluation of each particle (joint state) needsthe computation of the full joint likelihood.
The appearance models of the tracked objects areupdated after processing each frame to adapt to thechange in object appearance. We update the object colorhistogram using an Infinite Impulse Response (IIR) filter~pðtÞ ¼ �ppðtÞ þ ð1� �pÞ~pðt�1Þ. We choose to update the ap-pearance conservatively: We use a small �p ¼ 0:01 and stopupdating if the object is occluded by more than 25 percent orits position covariance is too big.
6 EXPERIMENTAL RESULTS
We have experimented on the system with many types ofdata and will only show some representative ones. We willfirst show results on an outdoor scene video and, then, on astandard evaluation data set of indoor scene videos.
Among all of the parameters of our approach, many are“natural,” meaning that they correspond to measurablephysical quantities (for example, 3D human height); there-fore, setting their values is straightforward. We use thesame set of parameters for all of the sequences. This meansthat our approach is not sensitive to the choice of parametervalues. We list here the values of the parameters that are notmentioned in the previous sections. For the size prior (inSection 4.4), �1 ¼ 0:04, and �2 ¼ 0:002. For likelihood,�f ¼ 0:5, �b ¼ 0:5 in (6), �S ¼ 25 in (6), and �S ¼ 0:005 in(7). For the mixing probabilities of different types ofdynamics, we use Padd ¼ 0:1, Premove ¼ 0:1, Pestablish ¼ 0:1,Pbreak ¼ 0:1, Pexchange ¼ 0:1, and Pdiff ¼ 0:5. We also apply ahard constraint of 25 pixels on the minimum image heightof a human.
We also want to comment here on the choice ofparameters related to the peakedness of a distribution insampling algorithms. The image likelihood is usually acombination of a number of components (sites, e.g., pixels).Inevitable simplifications (for example, independence as-sumption) in probabilistic modeling may result in excessivepeakedness of the distribution, which affects the perfor-mance of the sampling algorithms such as MCMC andparticle filter by having the samples in both MCMC andparticle filter focused in one location (that is, the highestpeak) of the state space, therefore making them degenerateinto greedy algorithms. Eliminating the dependencies ofdifferent components can be extremely difficult andinfeasible. From an engineering point of view, one shouldset the values of the parameters (for example, �S and �Swhile keeping their ratio constant) so that the likelihood
ratio of different hypotheses is reasonable so that theMarkov chains can efficiently traverse and particle filterscan maintain multiple hypotheses. In a similar fashion,simulated annealing has been used in the sampling processto reduce the effect of the peakedness and force conver-gence [48], [8]; however, the varying temperature makes thesamples not from a single posterior distribution.
6.1 Evaluation on an Outdoor Scene
We show results on an outdoor video sequence, which wecall the “Campus Plaza” sequence and which contains900 frames. This sequence is captured from a camera abovea building gate with a 40 degree camera tilt angle. Theframe size is 360� 240 pixels and the sampling rate is30 fps. In this sequence, 33 humans pass by the scene, with23 going out of the field of view and 10 going inside abuilding. The interhuman occlusions in this sequence arelarge. There are 20 occlusion events overall, nine of whichare heavy occlusions (over 50 percent of the object isoccluded). For MCMC sampling, we use 500 iterations perframe. We show in Fig. 7 some sample frames from theresult on this sequence. The identities of the objects areshown by their ID numbers displayed on the head.
We evaluate the results by the trajectory-based errors.
Trajectories whose lengths are less than 10 frames are
discarded. Among the 33 human objects, trajectories of three
objects are broken once (ID 28 ! ID 35, ID 31! ID 32, and
ID 30 ! ID 41, all between frames 387 and 447, as marked
with arrows in Fig. 7); the rest of the trajectories are correct.
Usually, the trajectories are initialized once the humans are
fully in the scene; some start when the objects are only
partially inside. Only the initializations of three objects
(objects 31, 50, 52) are noticeably delayed (by 50, 55, and
60 frames, respectively, after they are fully in the scene).
Partial occlusion and/or the lack of contrast with the
background are the causes of the delays. To justify our
approach for integrated segmentation and tracking, we
compare the tracking result with the result using frame-by-
frame segmentation as in [53], where we use frame-based
evaluation metrics. The detection rate and the false-alarm
rate are 98.13 and 0.27 percent, respectively. The detection
rate and the false-alarm rate of the same sequence by using
segmentation alone are 92.82 and 0.18 percent. With
tracking, not only are the temporal correspondences
obtained but the detection rate is also increased by a large
margin, while the false-alarm rate is kept low.
6.2 Evaluation on Indoor Scene Sequences
Next, we describe the results of our method on an indoor
video set, Context-Aware Vision using Image-based Active
Recognition (CAVIAR) video corpus4 [56]. We test our
system on the 26 “shopping center corridor view”
sequences, 36,292 frames overall, captured by a camera
looking down toward a corridor. The frame size is 384� 288
pixels and the sampling rate is 25 fps. Some 2D-3D point
correspondences are given from which the camera can be
1206 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 7, JULY 2008
4. In the provided ground truth, there are 232 trajectories overall.However, five of these are mostly out of sight, for example, only one arm orthe head top is visible; we set these as “do not care.”
calibrated. However, we compute the camera parameters by
an interactive method [26].The interobject occlusion in this set is also intensive.
Overall, there are 96 occlusion events in this set, 68 out of 96
are heavy occlusions, and 19 out of the 96 are almost fully
occluded (more than 90 percent of the object is occluded).
Many interactions between humans, such as talking and
handshaking, make this set very difficult for tracking. For
MCMC sampling, we use 500 iterations per frame again. For
such a big data set, it is infeasible to enumerate the errors as
ZHAO ET AL.: SEGMENTATION AND TRACKING OF MULTIPLE HUMANS IN CROWDED ENVIRONMENTS 1207
Fig. 7. Selected frames of the tracking results from “Campus Plaza.” The numbers on the heads show identities. (Please note that the two people
who are sitting on two sides are in the background model and, therefore, not detected.)
we did for the “Campus Plaza” sequence. Instead, wedefined five statistical criteria:
1. the number of mostly tracked trajectories,2. the number of mostly lost trajectories,3. the number of fragments of trajectory,4. the number of false trajectories (a results trajectory
corresponding to no object), and5. the frequency of identity switches (identity exchan-
ging between a pair of result trajectories).
Fig. 8 illustrates their definition. These five categories are byno means a complete classification; however, they covermost of the typical errors observed on this set. There areother performance measures that have been proposed in therecent evaluations, such as the Multiple Object TrackingPrecision and Accuracy in the CLEAR 2006 evaluation [57].We do not use these measures because they are lessintuitive as they try to integrate multiple factors into onescalar valued measure.
Table 1 gives the performance of our method. Wedeveloped an evaluation software to count the number ofmostly tracked trajectories, mostly lost trajectories, false alarms,and fragments automatically. Denote a ground-truthtrajectory by fGðiÞ; . . . GðiþnÞg, where GðtÞ is the objectstate at the tth frame; denote a hypothesized trajectory byfHðjÞ; . . . HðjþmÞg. The overlap ratio of the ground-truthobject and the hypothesized object at the t-frame is defined by
OverlapðGðtÞ;HðtÞÞ ¼ RegðGðtÞÞ \ RegðHðtÞÞRegðGðtÞÞ [ RegðHðtÞÞ
; ð8Þ
where RegðÞ is the image region of the object. IfOverlapðGðtÞ;HðtÞÞ > 0:5, we say fGðtÞ;HðtÞg is a potentialmatch. The overlap ratio of the ground-truth trajectory andthe hypothesized trajectory is defined by
OverlapðGði:iþnÞ;Hðj:jþmÞÞ
¼Pminðiþn;jþmÞ
t¼maxði;jÞ � OverlapðGðtÞ;HðtÞÞ > 0:5� �
maxðiþ n; jþmÞ �minði; jÞ þ 1;
ð9Þ
where �ðÞ is an indicator function. Given that one sequencehas NG ground-truth trajectories, fGkgNG
k¼1, and NH hy-pothesized trajectories, fHkgNH
k¼1, we compute the overlapratios for all ground-truth hypothesis pairs fGk;Hlg; thepairs whose overlap ratios are larger than 0.8 are consideredto be potential matches. Then, the Hungarian matchingalgorithm [22] is used to find the best matches that areconsidered to be mostly tracked. To count the mostly losttrajectories, we define a recall ratio by replacing thedenominator of (9) with nþ 1. If, for Gk, there is no Hl suchthat the recall ratio between them is larger than 0.2, weconsider Gk to be mostly lost. To count the false alarms andfragments, we define a precision ratio by replacing the
denominator of (9) with mþ 1. If, for Hl, there is no Gk suchthat the precision ratio between them is larger than 0.2, weconsider Hl a false alarm; if there is such a Gk that theprecision between them is larger than 0.8 but the overlap ratiois smaller than 0.8, we consider Hl to be a fragment of Gk. Wefirst count the mostly tracked trajectories and remove thematched parts of the ground-truth tracks. Second, we countthe trajectory fragments with a greedy iterative algorithm. Ateach round, the fragment with the highest overlap ratio isfound and, then, the matched part of the ground-truth track isremoved; this procedure is repeated until there are no morevalid fragments. Last, we count the mostly lost trajectoriesand the false alarms. This algorithm cannot classify allground-truth and hypothesized tracks; the unlabeled onesare mainly due to an identity switch. We count the frequencyof identity switches visually.
Some sample frames and results are shown in Fig. 9.Most of the missed detections are due to the humanswearing clothing with a color very similar to that of thebackground so that some part of the object is misclassifiedas background; see frame 1,413 in Fig. 9b for an example.The fragmentation of trajectory and the ID switch aremainly due to full occlusions; see frame 496 in Fig. 9a andframe 316 in Fig. 9b for examples. Our method can dealwith partial occlusion well. For full occlusion, classifying anobject as going into an “occluded” state and associating itwhen it reappears could potentially improve the perfor-mance. The false alarms are mainly due to the shadows,reflections, and sudden brightness changes that are mis-classified as foreground; see frame 563 in Fig. 9a. A moresophisticated background model and shadow model (forexample, [32]) could be used to improve the result. Ingeneral, our method performs reasonably well on theCAVIAR set, though not as well as on the “Campus Plaza”sequence, mainly due to the abovementioned difficulties.The running speed of the system is about 2 fps with a2.8 GHz Pentium IV CPU. The implementation is in C++code without any special optimization.
7 CONCLUSION AND FUTURE WORK
We have presented a principled approach to simulta-neously detect and track humans in a crowded sceneacquired from a single stationary camera. We take a model-based approach and formulate the problem as a BayesianMAP estimation problem to compute the best interpretationof the image observations collectively by the 3D humanshape model, the acquired human appearance model, thebackground appearance model, the camera model, theassumption that humans move on a known ground plane,and the object priors. The image is modeled as a composi-tion of an unknown number of possibly overlapping objects
1208 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 7, JULY 2008
Fig. 8. Tracking evaluation criteria.
TABLE 1Results of Performance Evaluations on the CAVIAR Set
(277 Trajectories)
and a background. The inference is performed by an
MCMC-based approach to explore the joint solution space.
Data-driven proposal probabilities are used to direct the
Markov chain dynamics. Experiments and evaluations on
challenging real-life data show promising results.The success of our approach mainly lies in the integra-
tion of the top-down Bayesian formulation following the
image formation process and the bottom-up features that
are directly extracted from images. The integration has the
benefit of both the computational efficiency of image
features and the optimality of a Bayesian formulation.This work could be improved/extended in several ways:
1) Extension to track multiple classes of objects (for
example, humans and cars) can be done by adding model
switching in the MCMC dynamics. 2) Tracking, operating in
a two-frame interval, has a very local view; therefore,
ambiguities inevitably exist, especially in the case of
tracking fully occluded objects. The analysis in the level of
trajectories may resolve the local ambiguities (for example,[29]). The analysis may take into account prior knowledgeof the valid object trajectories, including their starting andending points.
APPENDIX
SINGLE OBJECT TRACKING WITH BACKGROUND
KNOWLEDGE USING MEAN SHIFT
Denote by ~p, pðuÞ, and bðuÞ the color histograms of the
object learned online, the color histogram of the object at
location u, and the color histogram of the background at
the corresponding region, respectively. Let xif gi¼1;...;n be
the pixel locations in the region with the object center at
u. A kernel with profile kðÞ is used to assign smaller
weights to the pixels farther away from the center. An
m-bin color histogram pðuÞ ¼ pjðuÞ� �
j¼1;...;mis constructed
as pjðuÞ ¼Pn
i¼1 kðkxik2Þ�½bfðxiÞ � j�, where function bfðÞ
ZHAO ET AL.: SEGMENTATION AND TRACKING OF MULTIPLE HUMANS IN CROWDED ENVIRONMENTS 1209
Fig. 9. Selected frames of the tracking results from the CAVIAR set. (a) Sequence “ThreePastShop2cor.” (b) Sequence “TwoEnterShop2cor.”
maps the pixel location to the corresponding histogram bin
and � is the delta function. The same goes for ~p and b. We
would like to optimize
LðuÞ ¼ ��b B pðuÞ;bðuÞð Þ|fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl}L1ðuÞ
þ�f B pðuÞ; ~pð Þ|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl}L2ðuÞ
; ð10Þ
where BðÞ is the Bhattachayya coefficient. By applying
Taylor expansion at pðu0Þ and bðu0Þ ðu0 is a predicted
position of the object), we have
L1ðuÞ ¼ B pðuÞ;bðuÞð Þ ¼ BðuÞ� Bðu0Þ þB0pðu0Þ pðuÞ � pðu0Þð Þ þB 0dðu0Þ bðuÞ � bðu0Þð Þ
¼ c1 þXmu¼1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffibuðu0Þpuðu0Þ
spuðuÞ þ
Xmu¼1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffipuðu0Þbuðu0Þ
sbuðuÞ
¼ c1 þXni¼1
ku� xih
2� �
wbi ;
ð11Þ
where
wbi ¼Xmu¼1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffibuðu0Þpuðu0Þ
s� bf xið Þ � u� �
þ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffipuðu0Þbuðu0Þ
s� bb xið Þ � u½ �
( ):
Similarly, also in [6],
L2ðuÞ ¼B pðuÞ; ~pð Þ � 1
2
Xmu¼1
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipuðu0Þ~pu
pþ 1
2puðuÞ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffi~pu
puðu0Þ
s
¼ c2 þXnhi¼1
wfi ku� xih
2� �
;
ð12Þ
where wfi ¼Pmu¼1
ffiffiffiffiffiffiffiffiffiffi~pu
puðu0Þ
q� bf xið Þ � u� �
; therefore,
LðuÞ ¼ c1 þ c2 þXni¼1
�fwfi � �bwbi
� �|fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl}
wi
ku� xih
2� �
: ð13Þ
The last term of LðuÞ is the density estimate computed with
kernel profile kðÞ at u. The mean-shift algorithm with
negative weight [4] applies. By using the Epanechikov
profile [6], LðuÞ will be increased, with the new location
moved to
u0 Pn
i¼1 xiwiPni¼1 wij j
: ð14Þ
ACKNOWLEDGMENTS
This research was funded in part by the US Government’s
Video Analysis and Content Extraction (VACE) program.
REFERENCES
[1] G. Borgefors, “Distance Transformations in Digital Images,”Computer Vision, Graphics, and Image Processing, vol. 34, no. 3,pp. 344-371, 1986.
[2] Y. Boykov, O. Veksler, and R. Zabih, “Fast Approximate EnergyMinimization via Graph Cuts,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 23, no. 11, pp. 1222-1239, Nov. 2001.
[3] I. Cohen and G. Medioni, “Detecting and Tracking MovingObjects for Video Surveillance,” Proc. IEEE Conf. Computer Visionand Pattern Recognition, vol. 2, pp. 2319-2326, 1999.
[4] R.T. Collins, “Mean-Shift Blob Tracking through Scale Space,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2,pp. 234-240, 2003.
[5] D. Comaniciu and P. Meer, “Mean Shift: A Robust Approachtoward Feature Space Analysis,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 24, no. 5, pp. 603-619, May 2002.
[6] D. Comaniciu and P. Meer, “Kernel-Based Object Tracking,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 25, no. 5,pp. 564-577, May 2003.
[7] L. Davis, V. Philomin, and R. Duraiswami, “Tracking Humansfrom a Moving Platform,” Proc. Int’l Conf. Pattern Recognition,vol. 4, pp. 171-178, 2000.
[8] J. Deutscher, A. Blake, and I. Reid, “Articulated Body MotionCapture by Annealed Particle Filtering,” Proc. IEEE Conf. ComputerVision and Pattern Recognition, vol. 2, pp. 126-133, 2000.
[9] A. Elgammal and L. Davis, “Probabilistic Framework forSegmenting People under Occlusion,” Proc. Eighth Int’l Conf.Computer Vision, vol. 2, pp. 145-152, 2001.
[10] A. Elgammal, R. Duraiswami, D. Harwood, and L. Davis,“Background and Foreground Modeling Using Non-ParametricKernel Density Estimation for Visual Surveillance,” Proc. IEEE,vol. 90, no. 7, pp. 1151-1163, 2002.
[11] F. Fleuret, R. Lengagne, and P. Fua, “Fixed Point Probability Fieldfor Complex Occlusion Handling,” Proc. 10th Int’l Conf. ComputerVision, vol. 1, pp. 694-700, 2005.
[12] D. G-Perez, J.-M. Odobez, S. Ba, K. Smith, and G. Lathoud,“Tracking People in Meetings with Particles,” Proc. Int’l WorkshopImage Analysis for Multimedia Interactive Services, 2005.
[13] D. Gavrila and V. Philomin, “Real-Time Object Detection for“Smart” Vehicles,” Proc. Seventh Int’l Conf. Computer Vision, vol. 1,pp. 87-93, 1999.
[14] P. Green, Trans-Dimensional Markov Chain Monte Carlo. OxfordUniv. Press, 2003.
[15] S. Haritaoglu, D. Harwood, and L. Davis, “W4: Real-TimeSurveillance of People and Their Activities,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 22, no. 8, pp. 809-830, Aug.2000.
[16] R. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision. Cambridge Univ. Press, 2000.
[17] W. Hasting, “Monte Carlo Sampling Methods Using MarkovChains and Their Applications,” Biometrika, vol. 57, no. 1, pp. 97-109, 1970.
[18] S. Hongeng and R. Nevatia, “Multi-Agent Event Recognition,”Proc. Eighth Int’l Conf. Computer Vision, vol. 2, pp. 84-91, 2001.
[19] M. Isard and J. MacCormick, “Bramble: A Bayesian Multiple-BlobTracker,” Proc. Eighth Int’l Conf. Computer Vision, vol. 2, pp. 34-41,2001.
[20] J. Kang, I. Cohen, and G. Medioni, “Continuous Tracking withinand across Camera Streams,” Proc. IEEE Conf. Computer Vision andPattern Recognition, vol. 1, pp. 267-272, 2003.
[21] Z. Khan, T. Balch, and F. Dellaert, “MCMC-Based ParticleFiltering for Tracking a Variable Number of Interacting Targets,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 11,pp. 1805-1819, Nov. 2005.
[22] H.W. Kuhn, “The Hungarian Method for the AssignmentProblem,” Naval Research Logistics Quarterly, vol. 2, pp. 83-87, 1955.
[23] M.-W. Lee and I. Cohen, “A Model-Based Approach for EstimatingHuman 3D Poses in Static Images,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 28, no. 6, pp. 905-916, June 2006.
[24] A. Lipton, H. Fujiyoshi, and R. Patil, “Moving Target Classifica-tion and Tracking from Real-Time Video,” Proc. DARPA ImageUnderstanding Workshop, pp. 129-136, 1998.
[25] J. Liu, “Metropolized Gibbs Sampler,” Monte Carlo Strategies inScientific Computing, Springer, 2001.
[26] F. Lv, T. Zhao, and R. Nevatia, “Self-Calibration of a Camera fromVideo of a Walking Human,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 28, no. 9, pp. 1513-1518, Sept. 2006.
[27] J. MacCormick and A. Blake, “A Probabilistic Exclusion Principlefor Tracking Multiple Objects,” Proc. Seventh Int’l Conf. ComputerVision, vol. 1, pp. 572-578, 1999.
[28] A. Mittal and L. Davis, “M2tracker: A Multi-View Approach toSegmenting and Tracking People in a Cluttered Scene UsingRegion-Based Stereo,” Proc. Seventh European Conf. ComputerVision, vol. 2, pp. 18-33, 2002.
1210 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 7, JULY 2008
[29] P. Nillius, J. Sullivan, and S. Carlsson, “Multi-Target Tracking-Linking Identities Using Bayesian Network Inference,” Proc. IEEEConf. Computer Vision and Pattern Recognition, vol. 2, pp. 2187-2194,2006.
[30] K. Okuma, A. Taleghani, N. de Freitas, J. Little, and D. Lowe, “ABoosted Particle Filter: Multitarget Detection and Tracking,” Proc.Eighth European Conf. Computer Vision, vol. 1, pp. 28-39, 2004.
[31] C. Papageorgiou, T. Evgeniou, and T. Poggio, “A TrainablePedestrian Detection System,” Proc. IEEE Intelligent Vehicles Symp.,pp. 241-246, 1998.
[32] A. Prati, I. Mikic, M. Trivedi, and R. Cucchiara, “DetectingMoving Shadows: Algorithms and Evaluation,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 25, no. 7, pp. 918-923, July 2003.
[33] P. Prez, C. Hue, J. Vermaak, and M. Gangnet, “Color-BasedProbabilistic Tracking,” Proc. Seventh European Conf. ComputerVision, vol. 1, pp. 661-675, 2002.
[34] D. Ramanan, D. Forsyth, and A. Zisserman, “Strike a Pose:Tracking People by Finding Stylized Poses,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, vol. 1, pp. 271-278, 2005.
[35] C. Rasmussen and G.D. Hager, “Probabilistic Data AssociationMethods for Tracking Complex Visual Objects,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 560-576,June 2001.
[36] J. Rittscher, P. Tu, and N. Krahnstoever, “Simultaneous Estimationof Segmentation and Shape,” Proc. IEEE Conf. Computer Vision andPattern Recognition, vol. 2, pp. 487-493, 2005.
[37] R. Rosales and S. Sclaroff, “3D Trajectory Recovery for TrackingMultiple Objects and Trajectory Guided Recognition of Actions,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2,pp. 2117-2123, 1999.
[38] H. Rue and MA. Hurn, “Bayesian Object Identification,” Biome-trika, vol. 86, no. 3, pp. 649-660, 1999.
[39] C.R.H.S. Geman, “Diffusion for Global Optimization,” SIAM J.Control and Optimization, vol. 24, no. 5, pp. 1031-1043, 1986.
[40] N. Siebel and S. Maybank, “Fusion of Multiple TrackingAlgorithms for Robust People Tracking,” Proc. Seventh EuropeanConf. Computer Vision, vol. 4, pp. 373-387, 2002.
[41] K. Smith, D. Gatica-Perez, and J.-M. Odobez, “Using Particles toTrack Varying Numbers of Interacting People,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, vol. 1, pp. 962-969, 2005.
[42] X. Song and R. Nevatia, “Combined Face-Body Tracking in IndoorEnvironment,” Proc. 17th Int’l Conf. Pattern Recognition, vol. 4,pp. 159-162, 2004.
[43] X. Song and R. Nevatia, “A Model-Based Vehicle SegmentationMethod for Tracking,” Proc. 10th Int’l Conf. Computer Vision, vol. 2,pp. 1124-1131, 2005.
[44] C. Stauffer and E. Grimson, “Learning Patterns of Activity UsingReal-Time Tracking,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 22, no. 8, pp. 747-757, Aug. 2000.
[45] C. Tao, H. Sawhney, and R. Kumar, “Object Tracking withBayesian Estimation of Dynamic Layer Representations,” IEEETrans. Pattern Analysis and Machine Intelligence, vol. 24, no. 1,pp. 75-89, Jan. 2002.
[46] H. Tao, H. Sawhney, and R. Kumar, “A Sampling Algorithm forTracking Multiple Objects,” Proc. Workshop Vision Algorithms,1999.
[47] L. Tierney, “Markov Chain Concepts Related to SamplingAlgorithms,” Markov Chain Monte Carlo in Practice, pp. 59-74, 1996.
[48] Z.W. Tu and S.C. Zhu, “Image Segmentation by Data-DrivenMarkov Chain Monte Carlo,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 24, no. 5, pp. 651-673, May 2002.
[49] Y. Weiss, “Correctness of Local Probability Propagation inGraphical Models with Loops,” Neural Computation, vol. 12,no. 1, pp. 1-41, 2000.
[50] B. Wu and R. Nevatia, “Detection of Multiple, Partially OccludedHumans in a Single Image by Bayesian Combination of EdgeletPart Detectors,” Proc. 10th Int’l Conf. Computer Vision, vol. 1,pp. 90-97, 2005.
[51] T. Yu and Y. Wu, “Collaborative Tracking of Multiple Targets,”Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1,pp. 834-841, 2004.
[52] T. Zhao, M. Aggarwal, R. Kumar, and H. Sawhney, “Real-TimeWide Area Multi-Camera Stereo Tracking,” Proc. IEEE Conf.Computer Vision and Pattern Recognition, vol. 1, pp. 976-983, 2005.
[53] T. Zhao and R. Nevatia, “Bayesian Human Segmentation inCrowded Situations,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 2, pp. 459-466, 2003.
[54] T. Zhao and R. Nevatia, “Tracking Multiple Humans in ComplexSituations,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 26, no. 9, pp. 1208-1221, Sept. 2004.
[55] T. Zhao and R. Nevatia, “Tracking Multiple Humans in CrowdedEnvironment,” Proc. IEEE Conf. Computer Vision and PatternRecognition, vol. 2, pp. 406-413, 2004.
[56] The CAVIAR Data Set, http://homepages.inf.ed.ac.uk/rbf/CAVIAR/, 2008.
[57] CLEAR06 Evaluation Campaign and Workshop, http://isl.ira.uka.de/clear06/, 2008.
Tao Zhao received the BEng degree from theDepartment of Computer Science and Technol-ogy at Tsinghua University, China, in 1998 andthe MSc and PhD degrees from the Departmentof Computer Science at the University of South-ern California in 2001 and 2003, respectively. Hewas with Sarnoff Corp., Princeton, New Jersey,from 2003 to 2006. He is currently with IntuitiveSurgical Inc., Sunnyvale, California, working oncomputer vision applications for medicine and
surgery. His research interests include computer vision, machinelearning, and pattern recognition. His experience has been in visualsurveillance, human motion analysis, aerial image analysis, and medicalimage analysis. He is a member of the IEEE and the IEEE ComputerSociety.
Ram Nevatia received the PhD degree fromStanford University with a specialty in the area ofcomputer vision. He has been with the Universityof Southern California since 1975, where he iscurrently a professor of computer science andelectrical engineering. He is also the director ofthe Institute for Robotics and Intelligent Sys-tems. He has been a principal investigator ofmajor government-funded computer vision re-search programs for more than 25 years. He has
made important contributions to several areas of computer vision,including the topics of shape description, object recognition, stereoanalysis aerial image analysis, tracking of humans, and eventrecognition. He is an associate editor of the Pattern Recognition andthe Computer Vision and Image Understanding journals. He is theauthor of two books, several book chapters, and more than 100 refereedtechnical papers. He is a fellow of the IEEE and of the AmericanAssociation for Artificial Intelligence (AAAI).
Bo Wu received the BEng and MEng degreesfrom the Department of Computer Science andTechnology at Tsinghua University, Beijing, in2002 and 2004, respectively. He is currently aPhD candidate in the Computer Science Depart-ment at the University of Southern California,Los Angeles. His research interests includecomputer vision, machine learning, and patternrecognition. He is a student member of the IEEEand the IEEE Computer Society.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
ZHAO ET AL.: SEGMENTATION AND TRACKING OF MULTIPLE HUMANS IN CROWDED ENVIRONMENTS 1211