Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

1/30

1

Multiple Camera Tracking ofInteracting and Occluded Human Motion

Shiloh L. Dockstader and A. Murat Tekalp

Department of Electrical and Computer Engineering

University of Rochester, Rochester, NY 14627, USA

E-mail: [email protected], [email protected]

URL: http://www.ece.rochester.edu/~dockstad/research

Abstract

We propose a distributed, real-time computing platform for tracking multiple interacting persons

in motion. To combat the negative effects of occlusion and articulated motion we use a multi-view implementation, where each view is first independently processed on a dedicated processor.

This monocular processing uses a predictor-corrector filter to weigh re-projections of 3-D

position estimates, obtained by the central processor, against observations of measurable image

motion. The corrected state vectors from each view provide input observations to a Bayesian

belief network, in the central processor, with a dynamic, multidimensional topology that varies

as a function of scene content and feature confidence. The Bayesian net fuses independent

observations from multiple cameras by iteratively resolving independency relationships and

confidence levels within the graph, thereby producing the most likely vector of 3-D state

estimates given the available data. To maintain temporal continuity we follow the network with a

layer of Kalman filtering that updates the 3-D state estimates. We demonstrate the efficacy of the

proposed system using a multi-view sequence of several people in motion. Our experiments

suggest that, when compared with data fusion based on averaging, the proposed technique yields

a noticeable improvement in tracking accuracy.

Key Words: Kalman filtering, Bayesian network, human motion analysis, human interaction,

real-time tracking, multiple camera fusion, articulated motion, occlusion

Corresponding Author


2/30

2

1. Introduction

1.1. Motivation

Motion analysis and tracking from monocular or stereo video has long been proposed for

applications such as visual security and surveillance [1][2][3][4]. More recently, it has also been

employed for performing gesture and event understanding [5][6]; developing generalized

ubiquitous and wearable computing for the man-machine interface [7]; and monitoring patients

to capture the essence of epileptic seizures, assist with the analysis of rehabilitative gait, and

perform monitoring and examination of basic sleep disorders [8][9]. However, due to a

combination of several factors, reliable motion tracking still remains a challenging domain of

research. The underlying difficulties behind human motion analysis are founded mostly upon the

complex, articulated, and self-occluding nature of the human body [10]. The interaction between

this inherent motion complexity and an equally complex environment greatly compounds the

situation. Environmental conditions such as dynamic ambient lighting, object occlusions,

insufficient or incomplete scene knowledge, and other moving objects and people are just some

of the naturally interfering factors. These issues make tasks as fundamental as tracking and

semantic correspondence recognition exceedingly difficult without first outlining numerous, and

sometimes prohibitive, assumptions. Computational complexity also plays an important role, as

many of the popular applications of human motion analysis and monitoring demand real-time (or

near real-time) solutions.1.2. Previous Work

A fundamental step in human motion analysis and tracking consists of the modeling of

moving people in image sequences. Classical 3-D modeling of moving people has taken the form

of volumetric bodies based on elliptical cylinders [11], tapered super-quadrics [12], or other

more highly parameterized primitives or configurations [13]. Rehg and Kanade [14] extend 3-D

model-based tracking to account for the significant self-occlusions inherent in the tracking of

articulated motion. Alternatives to using exact 3-D models include the use of 2-D models that

represent the projection of 3-D data onto some imaging plane [15], point distribution models

(PDM) [16], strict 2-D/3-D contour modeling [17], stick figure and medial axis modeling [18], or

even 2-D generalized shape and region modeling [19][20]. For human motion analysis, choosing

an object or motion model depends greatly on the application. Real-time scenarios, for example,

must often sacrifice the accuracy of more highly parameterized models in order to achieve rapid,


3/30

3

automatic model initialization. In line with this motivation, Wren et al. [21] introduce a multi-

class statistical model of color and shape to obtain a 2-D representation of the head and hands.

They use simple blobs, or structurally significant regions, to represent various body parts for the

purpose of tracking and recognition. The approach lends itself to a real-time implementation andis more robust to certain types of occlusion than other techniques.

At the heart of human motion analysis is motion tracking. The dependence between

tracking and recognition is significant, where even a slight increase in the quality or precision of

tracking is able to return considerable improvements in recognition accuracy. Various

approaches to object tracking employ features such as edges, shape, color, and optical flow [21].

To handle occlusion, articulated motion, and noisy observations, numerous researchers have

looked to stochastic modeling such as Kalman filtering [22][23][24][25] and probabilistic

conditional density propagation (Condensation algorithm) [17]. Dockstader and Tekalp [3]

suggest a modified Kalman filtering approach to the tracking of several moving people in video

surveillance sequences. They take advantage of the fact that as multiple moving people interact,

the state predictions and observations for the corresponding Kalman filters no longer remain

independent. Lerasle et al. [25] employ Kalman filters to perform tracking of human limbs from

multiple vantage points. Jang and Choi [26] suggest the use of active models based on regional

and structural characteristics such as color, shape, texture, and the like to track non-rigid moving

objects. Like the tracking theory set forth by Peterfreund [27], the active models employ Kalman

filtering to predict basic motion information and snake-like energy minimization terms to

perform dynamic adaptations using the moving objects structure. MacCormick and Blake [28]

present the notion ofpartitioned sampling to perform robust tracking of multiple moving objects.

The underlying probabilistic exclusion principle prevents a single observation from supporting

the presence of multiple targets by employing a specialized observational model. The

Condensation [17] algorithm is used in conjunction with the aforementioned observational

model to perform the contour tracking of multiple, interacting objects. Without the explicit use ofextensive temporal or stochastic modeling, McKenna et al. [29] describe a computer vision

system for tracking multiple moving persons in relatively unconstrained environments.

Haritaoglu et al. [30] present a real-time system, W4, for detecting and tracking multiple people


4/30

4

when they appear in a group. All of the techniques discussed so far have considered motion

tracking from monocular video.

The use of multi-view monitoring [12][16][31] and data fusion [32][33] provides an

eloquent mechanism for handling occlusion, articulated motion, and multiple moving objects invideo sequences. With distributed monitoring systems, however, comes additional issues such as

stereo matching, pose estimation, automated camera switching, and system bandwidth and

processing capabilities [34]. Stillman et al. [35] propose a robust system for tracking and

recognizing multiple people with two cameras capable of panning, tilting, and zooming (PTZ) as

well as two static cameras for general motion monitoring. The static cameras perform person

detection and histogram registration in parallel while the PTZ cameras acquire more highly

focused data to perform basic face recognition. Utsumi et al. [36] suggest a system for detecting

and tracking multiple persons using multiple cameras to address complex occlusion and

articulated motion. The system is composed of multiple tasks including position detection,

rotation angle detection, and body-side detection. For each of the tasks, the camera that provides

the most relevant information is automatically chosen using the distance transformations of

multiple segmented object maps. Cai and Aggarwal [34] develop an approach for tracking

human motion using a distributed-camera model. The system starts with tracking from a single

camera view and switches when the active camera no longer has a sufficient view of the moving

object. Selecting the optimal camera is based on a matching evaluation, which uses a simple

threshold on some tracking confidence parameter, and a frame number calculation, which

attempts to minimize the total amount of camera switching.

It stands to reason that many who focus on the modeling and tracking of articulated

human motion also initiate efforts in the recognition, interpretation, and understanding of such

motion. Toward this end, popular approaches typically employ parameterized motion trajectories

or vector fields and use Dynamic Time Warping, hidden Markov models [6][37], combinations

thereof [38], or hybrid graph-theoretic methods [39]. Other researchers have addressed activityrecognition using methods based on principle component analysis [40], vector basis fields [41],

or generalized dynamic modeling. Wren and Pentland [42] describe an approach to gesture

recognition based on the use of 3-D dynamic modeling of the human body. Complimentary

research to activity recognition investigates the effects of various features on recognition


5/30

5

accuracy. Nagaya et al. [43] propose an appearance-based feature for real-time gesture

recognition from motion images. Campbell et al. [5] experiment with a number of potential

feature vectors for human activity recognition, including 3-D position, measures of local

curvature, and numerous velocity terms. Their experimental results lead favor to the use ofredundant velocity-based feature vectors.

For a more thorough discussion of modeling, tracking, and recognition for human motion

analysis, we refer the reader to Gavrila [44], Aggarwal and Cai [45], and the special issue of the

IEEE Transactions an Pattern Analysis and Machine Intelligence on video surveillance [46].

1.3. Proposed Contribution

In this paper, we introduce a distributed, real-time computing platform for improving

feature-based tracking in the presence of articulation and occlusion for the goal of recognition.

The main contribution of this work is to perform both spatial (between multiple cameras) and

temporal data integration within a unified framework of 3-D position tracking to provide

increased robustness to temporary feature point occlusion. In particular, the proposed system

employs a probabilistic weighting scheme for spatial data integration as a simple Bayesian belief

network (BBN) [47] with a dynamic, multidimensional topology. As a directed acyclic graph

(DAG), a Bayesian network demonstrates a natural framework for representing causal,

inferential relationships between multiple variables (e.g., a 3-D trajectory is always inferred from

two or more 2-D sources). Moreover, unlike traditional probabilistic weighting schemes [48], a

BBN is easily extended to multiple modalities due to its inherent semantic structure. Perhaps the

most exploited aspect, however, is the possibility of topological simplification resulting from

conditional dependency and independency relationships between multiple network variables. For

the proposed system, this corresponds to the selective use of multiple views of particular features

based on measures of spatio-temporal tracking confidence. Our approach is particularly well

suited as a precursor to generalized motion monitoring, activity recognition, and interaction

analysis systems [49]. To this end, we use multi-view data acquired from an unconstrained homeenvironment. Video captured within a home environment has the advantage of broad

applicability, where human motion is often significant and unrestricted. Applications range from

home security to understanding of social and interactive phenomena to monitoring for the

purposes of home health care and rehabilitation.


6/30

6

2. Theory

2.1. System Overview

The proposed system, as shown in Figure 1, consists of three major components: (i) state-

based 2-D predictor-corrector filtering for monocular tracking (in the dotted box), (ii) multi-view

(spatial) data fusion, and (iii) Kalman filtering for 3-D trajectory tracking. The first stage consists

of video preprocessing including background subtraction, sparse 2-D motion estimation, and

foreground region clustering. From the estimated 2-D motion, we extract a set of measurements

(observations) [ ]kx , where 0k is the frame number, for the estimation of the state vector,

[ ]1 2[ ] [ ] [ ] [ ]T

Nk k k k s s s s" . Here, [ ]m ks , 1 m N , denotes the image coordinates of the

mth feature point that we wish to track in time, and Nis the number of features being tracked on

one or more independently moving regions. We also define, [ ]k , as a 3-D state vector that

captures both the velocity and position of features in 3-D Cartesian space, where the unknown 3-

D feature position is denoted by [ ]ky . At each frame, k, we use the observations, [ ]kx , in

conjunction with 3-D state estimates, [ 1| 1]k k , as input to a predictor-corrector filter. The

output of the predictor-corrector filter is a state estimate, [ ]ks , with some confidence, [ ]kM .

Input Video Data

Extracted Features

(Observations)

BayesianNetwork

Predictor-Corrector Filter

3-D Observations

Output Trajectory

States of Multiple

Moving Objects

Data ProcessingStandard

Kalman Filter

Processing for thejth View

States fromOther Views #

Corrected 3-D States

(Feature Locations)

Projection of 3-DFeatures onto

jth Imaging Plane

(Predictions)

[ ] [ ]jjk ks

[ ]kx

[ ]ky

[ ]k[ | 1]

jk ks

Figure 1. System flow diagram.

The stage of multi-view fusion (indicated as the Bayesian network in Figure 1) performs

spatial data integration using triangulation, perspective projections, and Bayesian inference. The

input to the Bayesian network is a set of random variables, { }1 2[ ], [ ], , [ ]Jk k k " , where thejth

element is identical to the output, [ ]j

ks , taken from the predictor-corrector filter corresponding

to thejth view of the scene. The network defines an unknown subset of { }1 2[ ], [ ], , [ ]Jk k k " ,


7/30

7

denoted by [ ]k , that combine to form yet another random variable, [ ]k , indicative of a vector

of 3-D positions. At the output of the BBN are the estimates, [ ]k and [ ]ky , that maximize the

joint density, , [ , ] PPPP . We use the notation [ ]ky instead of [ ]k since the output of the

network is actually an estimate of the unknown 3-D position vector, [ ]ky . Accompanying the 3-D estimate is a noise covariance matrix, [ ]kR , minimized by the maximization of , [ , ] PPPP .

The final stage of the system uses a Kalman filter to maintain a level of temporal

smoothness on the vector of 3-D trajectories. The observations for the Kalman filter are the 3-D

estimates, [ ]ky , produced by the Bayesian network. As mentioned previously, the states of the

filter are indicated by [ ]k which represent both the true velocity and position of the Nfeatures

in 3-D Cartesian space. The corrected states, [ | ]k k , at the output of the Kalman filter provide

updated estimates of the unknown 3-D position vector. For tracking in the next frame, the

algorithm develops a prediction, ( )[ 1] [ | ]k k k+ FFFF , based on a perspective projection, ( )FFFF ,

of the corresponding 3-D prediction. This process introduces a temporally iterative estimation

algorithm capable of improving the accuracy of both 2-D and 3-D object tracking of features

proven useful in human motion analysis [5].

The proposed architecture considers both the fundamental bandwidth constraints as well

as the computational complexity requirements for various components. Feature tracking and

processing occurs for a single camera (1 j J ) and, due to the complexity of the required

motion estimation and spatial clustering routines, is performed using a dedicated processor.

Rather than burdening a network with the real-time transfer and analysis of video data from J

views, we perform all tracking and correspondence temporally at each view and then use the

Bayesian network to perform spatial integration of only the relevant data. Both the Bayesian

network and the subsequent stage of 3-D Kalman filtering are computationally simple enough (in

comparison toJoccurrences of motion-based estimation and segmentation) to coexist on a single

central processor.

2.2. Two-Dimensional Feature Tracking

Throughout this section it is tacitly understood that each equation is applied to data at the

plane of the jth camera, although we drop the explicit dependence on j with the understanding

that the steps in question are equally applicable to multiple views of the scene. As in [3], the

proposed technique performs moving object detection and segmentation using change detection


8/30

8

and localized, sparse motion estimation over a grid of points, including the semantic features,

within the foreground of the video sequence. While in [3] we were able to effectively use motion

estimation and frame differencing to produce filter predictions and observations, respectively,

such an approach is not feasible for the tracking of specific occluded features of moving people.Rather, we must now rely on temporal correspondence to generate observations while using

feedback from the 3-D feature vector trajectories to develop next state predictions.

The first step involves computation of the state prediction, [ | 1]k ks , in the current

frame, k, according to

( ) [ | 1] [ ] [ 1| 1]k k k k k = s FFFF , (1)

where [k] is the 3-D state transition matrix and ( )FFFF is a vector-valued function that maps

points in 3-D Cartesian space to a particular imaging plane using a perspective projection. Theprecise definition of [ ]k is given in 2.4. We then compute the 2-D error covariance matrix,

[ | 1]k kM , by projecting the corresponding 3-D error covariance matrix, [ | 1]k k , associated

with [ 1| 1]k k to each imaging plane as per

( )( ){ } ( ) [ | 1] [ ] [ | 1] [ ] [ | 1] [ | 1]Tk k E k k k k k k k k = M s s s s GGGG , (2)

where ( )GGGG is a transformation of the noise covariance from 3-D to 2-D. The mapping of 3-D

data to the imaging plane of each camera, as in (1) and (2), results in a one-step predictor-

corrector filter at each frame, k. Collectively, the implementation of these functions, ( )FFFF and( )GGGG , parallels that of a transformation of random variables, ( )j HHHH , which we fully define and

illustrate in 2.3.

The next step involves the computation of the gain matrix, [ ]kK , which determines the

magnitude of the correction at the kth frame. The operation relies on [ | 1]k kM and a noise

covariance matrix, [ ]kC , which describes the distribution of the assumed Gaussian observation

noise. Using a probabilistic weighting scheme, similar to that proposed in [3], we introduce a

matrix that captures the confidence associated with the temporal correspondence measurements(i.e., observations, [ ]kx ) for each feature. For this task, we classify the correspondence of

various features and their neighboring pixels into three basic classes:

Class A The element is visible in the previous frame and presumably in the current

frame, as there exists a strong temporal correspondence;


9/30

9

Class B The element is visible in the previous frame but presumably not in the current

frame, as there exists only a weaktemporal correspondence; and

Class C The element is not visible in the previous frame.

The qualification of a feature as having a strong or weak temporal correspondence is based onwell-established criteria in the motion estimation and optical flow literature [50]. Since, in the

interest of computational complexity, the proposed system estimates only the forward motion,

we ignore the case where a previously occluded feature becomes revealed in the next frame; this

is considered a Class Ccorrespondence.

Let us refer to the ith motion vector in the neighborhood of some feature point, [ ]m ks , as

[ ]i kv and the origin of this vector in the previous frame (in image coordinates) as [ ]i kv

. For

convenience, we introduce the notation A, B, and C to represent the sets of all points, [ ]i kv

and [ 1]m ks , that are classified as Class A, Class B, and Class C, respectively. It is assumed that

the union of these three sets describes a single, arbitrary region. For the proposed contribution,

this corresponds to an entire moving person, although the technique is equally applicable to

single body parts, groups of moving people, or even generic moving regions, depending on the

segmentation of the foreground.

We construct [ ]kC in the traditional manner by allowing the matrix to be representative

of the time-varying differences between observations and corrected predictions. We then

compose a gain matrix according to

( )1

[ ] [ | 1] [ ] [ ] [ ] [ ] [ ] [ | 1] [ ]T T Tk k k k k k k k k k k

= + K M H W C W H M H , (3)

where [ ]k =H I represents the linear observation matrix and

[ ]{ }( )1

2

1 2[ ] ( ) ( ) ( )Nk diag k k k

W " , (4)

1 12( )

1, [ 1]

( ) ( ) ( , ), , [ 1]

0, [ 1]m

m

m m m mD ki

m

k

k D k d k i k

k

+ =

A

A

A

B

C

s

s

s

, (5)

and ( , ) [ ] [ 1]m i md k i k k v s

is the distance between the mth feature of and the ith observable

motion vector originating from a particular moving region. Here, ( ) max { ( , )}m i mD k d k i ,

indicates the maximum distance between a particular feature and the observable motion vectors

used to produce its temporal correspondence. The observations are indicated by


10/30

10

( )( )

1

[ ] ( ) ( , ) [ 1] [ ] ( ) ( , )m m m m i m mi i

k D k d k i k k D k d k i

= +

A A

A

x s v , (6)

where ( , )md k i , [ ]i kv , and [ ]i kv

are illustrated in Figure 2 more clearly.

Upon inspecting (3) through (6), one may make a number of observations. First, acompletely visible feature that has a seemingly accurate motion estimate is tracked in the usual

way; this represents the trivial correspondence problem with no apparent occlusion. An occluded

feature, on the other hand, develops a weighted least-squares estimate of its trajectory using

neighboring, observable motion vectors. The confidence of this estimate is encoded in [ ]kW ,

where both the visibility and the proximity of the neighboring trajectories is taken into account.

We draw the readers attention to the fact that nothing can be said about the optimality of the

weighted least-squares estimate unless something more is known regarding the distribution of the

bordering motion vector estimates. In the next section, we address this issue by introducing a

weighting scheme that takes the form of a Bayesian network with a dynamic topology.

B

C

A

Object 1 Object 2

Frame k-1

( , )m m Ad k i = B

C

A

Frame k

ith motion vector

[ 1]i kv

[ 1]i kv

Figure 2. Occlusion modeling. Here we show various feature tracking scenarios in the presence of interacting

foreground objects (i.e., two moving people). When a feature is occluded, the temporal trajectory is estimated using

neighboring, but visible, motion vectors.

For the sake of completeness, we provide the remaining steps of the filtering procedure.

These equations follow those of the standard Kalman filter and are indicated by( ) [ | ] [ | 1] [ ] [ ] [ ] [ | 1]k k k k k k k k k = + s s K x H s (7)

and

( )[ | ] [ ] [ ] [ | 1]k k k k k k = M I K H M . (8)


11/30

11

The two equations represent the state correction based on the gain matrix and the update of the

error covariance matrix, respectively.

2.3. Spatial Integration

The spatial integration is performed for each feature point independently. The followingtreatment addresses spatial integration for the mth feature point, although the index m has been

omitted for ease of notation. We introduce random variables, [ ] [ ] [ ] [ ]j j j jk k k k + s s s n and

[ ] [ ] [ ]k k k+ yy n , where [ ] jksn represents some zero-mean distribution of estimation error on

the imaging plane corresponding to thejth view. Similarly, [ ]kyn characterizes the zero-mean 3-

D reconstruction error inherent in [ ]k . Let [ ], [1, ]j k j J indicate a family of random

variables defined by the vector of probability density functions (PDF),

[ ] ( ) ( )

12 1 1

2 [ ] (2 ) [ ] exp [ ] [ ] [ ] [ ] [ ]

TN

j j jj j jjk k k k k k k

= M s M sPPPP . (9)

where M[k] contains the confidence of the mth component of the estimate on thejth view. Under

the assumption that the mth feature may be occluded in some views, the algorithm uses 1 K J<

views of the scene to reconstruct an estimate of [ ]ky . We indicate the set of random variables

used in reconstruction by

{ } [ ]1 2 1[ ; ] [ ], [ ], , [ ] [ ], [ ]Kj j j Jk K k k k k k " , (10)

where 1 nj J indicates one ofKviews upon which [ ]ky might be dependent. For the sake of

notational simplicity, we drop the explicit dependence of these variables on the frame number, k.

The proposed Bayesian network obtains an estimate of [ ]ky and the proper subset

[ ; ]k K by calculating

( ) { }, , arg max [ , ]=

y ,,,,

PPPP , (11)

where, according to Bayes rule,

, 1 1 1 2 1 1[ , ] [ | , , , ] [ | , , , ] [ ]J J J J J = " " "P P P P P P P P P P P P P P P P . (12)

Representing (12) as a causal network, our task is reduced to finding a topology, { , }T , suchthat ,{ , } [ , ] T PPPP is maximized. In (12), 1 1[ | , , , ]J J "PPPP models 3-D reconstruction

noise, [ ]j PPPP models 2-D observation noise, and the remaining conditional densities model the

effects of occlusion and correlation between various views of the scene. We can contrast the

model of (12) to that of a more simplistic approach, assumingJindependent sources, where


12/30

12

1 2

1

{ , } [ | , , , ] [ ]J

J j

j

=

T "P PP PP PP P . (13)

The latter model works well for a small number of variables, but (if the variables are correlated)

becomes less accurate as Jincreases. The generalized topologies of (12) and (13) are illustrated

in Figure 3. The remaining modifications to the network consist of defining a topological

ordering of the nodes, removing nonexistent dependencies (i.e., edges), and estimating the

conditional probabilities for the remaining I-map structure.

1 j J2 "

1 j J2 "

The jth node in the networkcontains (J-j+1) divergentpaths and (j-1) convergent paths

(A) (B)

Figure 3. Bayesian network topologies. The network shown in (A) represents a basic BBN with a convergent

topology at a child node, indicative of independent sources, while that in (B) shows the more general case in which

the correlation between views is considered.

To determine a topological ordering of the nodes, we must take into consideration the

relative significance of various data sources. With two cameras there is no ordering, as both

views are equally necessary for 3-D reconstruction. However, the case of two cameras does not

provide a means to deal with the possibility of occlusion; hence it is not of interest to us. For

more views, we need relative measures of importance, most likely determined by redundant

scene content. If we assume uniformly distributed scene content, then by moving the cameras

further apart we increase the total probability of having multiple unobstructed views of one or

more features. Herewith, there is also an implicit upper bound on camera separation in order to

maintain a well-defined stereo matching problem for feature initialization.

Using the ( )2J fundamental matrices for the system, based on the static content associated

with each view, we define a probabilistic measure, 0 1ijl . This probability portrays the ratio

of points seen from the ith view that have some correspondence in thejth view to the total number

of pixels in the ith view. Within a graph-theoretic framework, this quantity states that an


13/30

13

inferential dependency exists between these nodes such that one might conjecture that i j

with a confidence of lij. Two identical views of a scene are indicated by 1ijl = , while 0ijl =

would suggest two views containing no points in common. Note that the converse of the former

statement does not necessarily follow. With fixed cameras, such a matrix, { }ijl=L , is easilypopulated with arbitrary levels of precision using automatic, semi-automatic, or even manual

assessment. This dependence on camera position is illustrated in Figure 4. For a system

consisting ofJcameras, the nodal ordering is specified by

11 2 1{ , } { , , , }

JJ

++ =

T T "

, (14)

where the nodes are placed in descending order of confidence from 1, which indicates the root

of the BBN, to , which has no dependencies. This topological ordering (14) varies for each

feature being tracked and, therefore, must consider both the confidence of temporal tracking andthe correlation between views of the scene. For the mth feature in the thnj view of the scene we

have a confidence metric

( )1 1[ ] ( ) 1n nnJ

mj m j iJj ik k l

= , (15)

where ( )m k for an arbitrary view is given in (5). The random variable for the nth node, n , in

the network is equal tonj , where the views of the scene are ranked according to

1 2[ ] [ ] [ ]

Kmj mj mjk k k " . (16)

Figure 4. Three-dimensional view of a scene. The correlation between viewsj1 andj3 is higher than that between j1

andj2. It is expected, then, that occluded content inj1 has a higher probability of remaining occluded inj3 than inj2.

As an example for a single, arbitrary feature, consider a four-camera system in which we

have [ ]( ) 0.72 0.51 1.00 0.39m k = and a matrix of camera correlation data


14/30

14

1.00 0.86 0.90 0.52

0.42 1.00 0.73 0.36

0.92 0.71 1.00 0.57

0.94 0.56 0.78 1.00

=

L . (17)

The corresponding confidence vector is given by [ ][ ] 0.13 0.19 0.20 0.07m k = , where the

resulting topological ordering, 3 2 1 4{ , , , , }T , is demonstrated graphically in Figure 5. For an

arbitrary feature, this arrangement suggests that one should expect that data obtained from 2 is

likely, on average, to contain more unique scene content than that from 1 or 4 . Alternatively,

this might be read as: Given that we have an estimated feature vector in the second view, it is

58% or 64% likely ( 21 jl ) that the first or fourth views, respectively, provide additional

information. Of course, information in this context is taken for the case of occlusion. The goal

of this implementation is meant to lessen the adverse effects of observation and reconstructionerror by jointly considering camera position and temporal tracking confidence.

1 3 42

Figure 5. Four-camera BBN. The illustrated network demonstrates a dynamic topology of

3 2 1 4{ , , , , }T . This

configuration is valid for only the mth feature for the kth frame.

To calculate (11), we must define the three fundamental densities provided in (12). In

addition to the a priori distribution at each imaging plane, as indicated by (9), we require density

functions for and j , each conditioned upon a collection of other views. Let us refer to this

collection of random variables as { }1 2, , , Bj j j " such that 1 1[ , ] [ , ]B Kj j j j j . We

introduce two transformation functions, ( )BBBB and ( )j HHHH , where the first function maps a

collection of random variables, , from 2-D to 3-D, while the latter transformation projects a

random variable, , from 3-D to the imaging plane of the jth camera. These functions are

analogous to those used in error propagation within the 3-D reconstruction literature [51]. Due to

the inherent complexity of these transformations, however, it is not always possible to provide a

simple, analytical representation of the noise propagation.


15/30

15

To estimate the transformed distributions, we first assume a noise model for the

reconstructed and projected density functions. For density functions transformed by ( )BBBB , we

assume a 3-D Gaussian distribution with ( ), RN , while for those functions transformed by

( )j HHHH , we assume a 2-D distribution of ( ),j j

UN . These distributions are calculated usingthe notion of random sampling. For ( ), RN , we choose one point at random (distributed

according to1j , 2j , etc) on each of the imaging planes corresponding to

. TheseB points

are then triangulated in the standard way [52] to form a random observation in 3-D Cartesian

space. We continue this sampling process until enough observations exist to estimate and R

with sufficient confidence. A similar sampling procedure is used to estimate j and jU , where

points are selected at random in 3-D according to ( ), RN , which also uses random samples

from thejth view, and then projected onto the jth imaging plane. The random 3-D positions will

project to a random distribution on the imaging plane that provides a set of observations for

estimating j and jU . We illustrate the random sampling process for constructing ( ), RN

and ( ),j j UN in Figure 6.

1

1

1 1222

2

33

3

3

j1 j2

j3

Random samples triangulated into 3-D

3-D Observations for

Estimating and R

1[ ]j k P

2[ ]j k P

3[ ]j k P

3

3

1

1

2

2

j

Random samples projected to 2-D

2-D Observations for

Estimating andj j U

( ), RN

Figure 6. Random sampling process. The diagram on the left shows the generation of 3-D observations given

random samples of1[ ]j k P , 2 [ ]j k P , and 3 [ ]j k P , while that on the right demonstrates the creation of 2-D

observations given random samples with a ( ), RN distribution.

Starting with the two most confident views of the scene, corresponding to 1 and 2 in the

network, we construct , [ , ] PPPP under the assumption that a conditional independence exists

[ ], [ , ] | [ ] | = P P P P P P P P P P P P P P P P P P P P , (18)

where { }1 2,j j= . To calculate (18), we introduce


16/30

16

[ ] [ ]1

3 22 1 1

2[ ] | [ ] (2 ) [ ] exp [ ] [ ] [ ] [ ] [ ]N T

k k k k k k k k

=

R RPPPP (19)

and

12 1 1

2[ ] | [ ] (2 ) [ ] exp [ ] [ ] [ ] [ ] [ ]

TN

j j j j j jjk k k k k k k k

=

U U PPPP , (20)

where we show the dependence of these equations on time, k, for completeness. The

maximization of (18) is trivial due to the normal distribution of each variable. The solution

parallels that of a weighted least-squares problem, where are the observations and the a priori

density in (9) is analogous to the weighting factor for a particular observation. We iteratively

modify the topology of the graph, adding nodes of successively lower confidence, until the

dynamic topological ordering satisfies (11). By considering nodes in a highest confidence first

(HCF) manner, the iterations are encouraged to converge quickly, and typically to a global

maximum. Since is finite, however, guaranteed absolute convergence to the maximum of

, [ , ] PPPP is possible by simply testing all possible input combinations. This may become

prohibitively computational, however, depending on the number of views and features. The

resulting estimate of 3-D feature position, [ ]ky , and corresponding noise covariance, [ ]kR , are

the input to a Kalman filter that encourages a 3-D trajectory with temporal continuity.

2.4. Temporal Integration

The state-based feature vector represents both the location and velocity of points in 3-D

and is indicated by

[ ]1 2 (2 )[ ] [ ] [ ] [ ]T

Nk k k k = " , (21)

where [ ],m k m N indicates the ideal 3-D position of the mth feature within the kth frame and

[ ]

,[ ] b km k b m N m N k

= >

= . (22)

An estimate of the 3-D feature vector in (21) is denoted by (23) where we use a dynamic model

for [ ]k with constant velocity and linear 3-D displacement such that

[ ]1 2 [ | 1] [ ] [ 1| 1] [ ] [ ] [ ] 1 1 1T

Nk k k k k k k k = = " " , (23)

where

[ ] 2 [ 1| 1] [ 2 | 2]m m mk k k k k = . (24)

We develop an error covariance matrix, [ | 1]k k , that depicts our confidence in the predictions

of the state estimates. The update equation is indicated by


17/30

17

[ | 1] [ ] [ 1| 1] [ ] [ ]Tk k k k k k k = + Q , (25)

where [ ]kQ represents a Gaussian noise covariance matrix which is iteratively modified over

time to account for the deviations between the predictions and corrections of the state estimates.

The system then constructs a Kalman gain matrix, [ ]kD , as follows

( )11

2[ ] [ | 1] [ ] { [ ] [ ]} [ ] [ | 1] [ ]T Tk k k k k k k k k k

= + + D R , (26)

where [ ] 2[ ] N Nk = I 0 indicates the linear observation matrix and [ ]k is a recursively

updated observation noise covariance matrix. The remaining steps of the three-dimensional

trajectory filtering include

( ) [ | ] [ | 1] [ ] [ ] [ ] [ | 1]k k k k k k k k k = + D y (27)

and

( )[ | ] [ ] [ ] [ | 1]k k k k k k = I D . (28)

These equations represent the state and noise covariance update equations for [ ]k and [ ]k ,

respectively.

2.5. Algorithm Procedure

For convenience, we summarize the overall flow of the algorithm. Following the initial

process of camera calibration and the estimation of the redundancy matrix, L, semantic features

are chosen manually betweenJviews of the scene to initialize the predictor-corrector filters used

for tracking. Then, for an arbitrary frame, k, we have the following procedure:At each imaging plane:

1. Estimate the sparse motion between frames kand k-1 and segment the foreground into

moving regions, as in [3].

2. Update H[k] and C[k]; using (1) through (8) and the projections of [ ] [ 1]k k and

[ 1| 1]k k , produce an estimate, [ ]ks , with some confidence, [ ]kM , for the feature

vector, [ ]ks .

For each feature over all imaging planes:

1. Define a set of random variables, [ ]k , and a subset, [ ]k , that characterize some

unknown combinations of vector estimates, [ ]ks , from different views.

2. Following (15) and (16), construct a Bayesian topology (12) that orders the views in

descending order of confidence.


18/30

18

3. Start with the two most confident views of the feature and estimate the density functions

in (9), (19), and (20) using random sampling, triangulation, and geometric projections, as

illustrated in Figure 6.

4. Define and maximize , [ , ] PPPP using the previously estimated density functions.5. Iterate over steps 2-4, adding an additional view of the feature at each step until a global

maximum of , [ , ] PPPP is reached, yielding the estimates [ ]k and [ ]ky .

In 3-D Cartesian space:

1. Update [ ]k , [ ]k , and [ ]kQ ; using (21) through (28), produce a corrected estimate,

[ ]k , with some confidence, [ ]k , for the actual 3-D position of features, [ ]ky .

3. Experimental Results

To test the proposed contribution, we use 600 frames of synchronized video data takenfrom three (i.e., 3J ) distinct views of a home environment. The underlying scene captures an

informal social gathering of four people, each of whom is characterized by five semantic features

that are selected at first sight and tracked throughout the remainder of the sequence. For features,

we use the top of each shoe, the transition between the sleeve and the arm, and the top of the

torso underneath the neck as seen from the front of the body. These points correlate well with

various body parts (e.g., wrists, elbows, feet, head/neck) while providing enough color and

content to perform accurate tracking. If a desired feature is initially not visible due to self-

occlusion, but its position in one more views can be estimated with fair accuracy, it is labeled

and tracked as an occluded point until it becomes visible. This method of feature selection and

tracking is illustrated more clearly in Figure 7, where we show the tracking results for the system

at various stages.

Using the metric in (5), the algorithm associates a confidence level for each feature. We

quantize the number of levels to three, where a confidence level of 0 (LV) suggests that the

feature is being tracked with high accuracy, a level of 1 (LO) indicates that the feature is being

tracked with relatively low accuracy, and a level of2 (LM) identifies a feature that is no longer in

the field of view. Referring to the notation of2.1,LV-features typically include those ofClass A

and some of those in Class B and are assumed to be visible, while LO-features include those in

Class Cand some of those in Class B and are assumed to be occluded. While tracking, visible

and occluded features are marked using solid and dashed circles, respectively.


19/30

19

Figure 7. Tracking results. Each column provides a distinct view of the scene and each row represents an individual

frame extracted from the sequence. Features that appear to be occluded or tracked with questionable accuracy are

automatically labeled with a dashed circle, while those of higher confidence are given a solid mark.

In addition to capturing the precise location and state of the features at various frames, we

also show feature trajectories for a continuous set of frames. In particular, Figure 8 shows the

path of every feature followed over a period of five seconds. For the sake of visual clarity, each

row of the figure is dedicated to a different individual, while the features on each person are

denoted using a variety of colors. To quantify the accuracy of the proposed tracking system, we

calculate the average absolute error between the automatically generated feature locations and

the corresponding ground-truth data. Since the underlying features have some well-defined

semantic interpretation, it stands to reason that the ground-truth should most appropriately be


20/30

20

generated by the same intervention that initially selected and defined each feature. When 3-D

data is available, we characterize the tracking error, B , by using the absolute difference between

the ground-truth and the 3-D feature projection at each imaging plane. The reported error is taken

on the imaging plane carrying the maximum absolute difference over allJviews. If a feature canonly be tracked from one view, however, we simply calculate the absolute error between the

ground-truth and the tracking results of monocular image sequence processing.

Figure 8. Trajectories of various features. Each row shows three views of a five second interval of the scene for a

single person. The numbers in the lower corner of each image indicate the view followed by a range of frames.

For exhaustively quantifying the data, then, we must develop a baseline for three views of

four people, each with five features, over 600 frames. As this would clearly be no less than an


21/30

21

overwhelming task, we choose to faithfully represent the entire sequence using a random sample

of only 60 frames. The fundamental hypothesis of the proposed contribution is based upon an

assumed correlation between tracking accuracy and various configurations of visible and

occluded features. As is such, we group features into any ofB

bins, where for a total ofJviewsand Fconfidence levels, B is the number of unique solutions to

{ }1

, where 0 , , , ,F

i i i

i

O J O J F O J =

= ` (29)

and Oi is the number of views with the ith confidence level. Using the three levels defined earlier

(LV,LO,LM), it can be shown that we have 2 312 2 1J J+ + bins for anyJcombination of cameras.

Each bin is represented using a triplet, where for any feature the first number, OV, indicates the

number of cameras that presumably have an unobstructed view, the second number, OO,

represents the number of cameras that are thought to have an occluded view, and the third

number, OM, specifies the number of cameras for which the feature is outside the field of view.

Figure 9 summarizes the tracking error for the proposed Bayesian technique (B) while

comparing to the performance of a more simplistic averaging approach (A). For the latter

method, multiple observations are combined in the least-squares sense in order to combat the

effects of observation noise. We draw the readers attention to a number of important

characteristics. As indicated by the data, using the proposed BBN for data fusion produces lower

tracking errors for any class of features for which more than two observations exist (i.e., bins

300, 210, 120, and 030). Assuming uniformly distributed features over Bbins, it is easily shown

that the probability of having an arbitrary feature with more than two observations is

{ }1 22 2Pr[ 2] 3 10M JO J J J < = + B . (30)

With three views, one would expect nearly 40% of the features (under a uniform distribution) to

benefit from the proposed data fusion approach. Equation (30) states that as the number of views

increases, so does the probability that the tracking of a feature occurs with a lower error. This is

seen when taking the limit as the number of cameras monitoring the scene approaches infinity:

{ } { } { }2 23 10 122 3 2lim Pr[ 2] lim lim 1 1J J

M J Jj j jO J +

+ + < = = =

B. (31)

Due to the nature of occlusion, however, one cannot draw a similar inference for the case of

simple averaging. That is, without a priori knowledge of the feature distribution within the


22/30

22

scene, adding more cameras is likely to increase the number of occluded and visible points by

comparable amounts. This clearly increases the probability that more than two cameras have an

unobstructed view of a particular feature, as indicated by (31), but typically at the expense of

additional observations in occlusion. As demonstrated by the transitions from bins 111 to120/210 and 201 to 210/300 in Figure 9, we only witness a significant decrease in error, on

average, when the third observation has a relatively high confidence. In contrast, the Bayesian

belief network effectively weights only the most likely observations based on the a priori

confidence distributions presented in (9).

300 210 201 120 111 030 021 102 012 003

0

1

2

3

4

5

6

View/Confidence Configuration

MaximumAbsoluteError(pix

els) U

n

d

e

f

i

n

e

d

E

r

r

o

r

= A

= B

4%

34%

0%

14%0%

9%0%

0%

0% N/A

Figure 9. Summary of tracking results. Each bin in the histogram represents a different class of features, denoted by

a triplet, where the first, second, and third digits indicate the number of cameras tracking a given feature with high,

low, and no accuracy, respectively. The three right-most bins indicate features for which 3-D data is unavailable,

where the last bin (003) shows all points that have left the scene completely. The labeled percentages indicate the

percent increase in error of one approach over another for each class of features.

If we consider our specific distribution of features as a function of configuration over the

entire sequence, as indicated in Figure 10, the difference in error between A and B is as low as

4%, as high as 34%, and 14.5% on average. As further indicated by Figure 10, the total number

of features for which the proposed algorithm improves the tracking accuracy accounts for 42% of

all features. This value is in agreement with the prediction of 40% for a uniform distribution of

observations. In support of Figures 9 and 10, we show corresponding numerical data in Table 1.


23/30

23

300 210 201 120 111 030 021 102 012 0030

50

100

150

200

250

300

View/Confidence Configuration

NumberofSamples

20% 62% 18%

5%

8%7%

10%

12%

19%

21%

2%

4%

12%

Figure 10. Distribution of features. We show the distribution of sampled features as a function of configuration. The

total number sampled is 1200, 18% of which did not have enough observations to generate 3-D data.

All major sources of error in the tracking results can be traced back to occlusion; this

point is exemplified by the trends in Figure 9. A more careful examination of the sequence even

indicates that self-occlusion is often a stronger culprit than other forms, such as occlusion due to

multiple object interactions and scene clutter. With the proposed technique, however, occlusion

due to multiple moving objects still yields erroneous tracking results when an occluded feature

undergoes some form of unexpected acceleration. If the visible motion estimates in the vicinity

of the occluded region do not positively correlate with the motion of that region, then neither the

predictions nor the observations will produce an accurate feature trajectory. For a more detailed

discussion of tracking multiple objects in the presence of occlusion, we refer the reader to [3].

Table 1. Tracking statistics. The first column shows those features for which 3-D position was estimated with high

accuracy, the middle column indicates those features for which one or more necessary views were apparently

occluded, while features in the third column had (at most) an estimated position in only one view of the scene.

OV 2 OV< 2 (OV + OO) 2 (OV + OO) < 2

Config (A, B) Samples Config (A, B) Samples Config (A, B) Samples

300 (1.80, 1.73) 58 120 (3.21, 2.82) 121 102 (2.26, 2.26) 29

210 (2.41, 1.79) 93 111 (3.03, 3.03) 140 012 (5.38, 5.38) 48

201 (1.81, 1.81) 85 030 (3.78, 3.48) 229 003 (N/A, N/A) 145

021 (4.09, 4.09) 252

Percentage 19.6% Percentage 61.8% Percentage 18.6%


24/30

24

Our technique for tracking occluded features is based on an analysis of visible data.

Because we use the less computational, yet more simplistic, approach of model-free tracking, the

motion of features in self-occlusion is often predicted using erroneous vector estimates within

the foreground of the moving person in question. The result is a trajectory that is correct at aglobal scale, but corrupt at finer resolutions. This is easily seen, for example, while attempting to

track a feature on an occluded arm, where the motion of the arm is not as positively correlated

with that of the torso (or the visible arm) as the algorithm would like to imply. We illustrate the

difficulty of self-occlusion in Figure 11.

View A View B

Visible

Observations

Visible

FeatureOccluded

Feature

Figure 11. Observations in self-occlusion. This figure illustrates the adverse effects of self-occlusion and articulated

motion on estimating observations for feature positions.

One solution to the problem of self-occlusion might be the introduction of an a priori

motion model, as suggested by [53]. In conjunction with some type of human model [20], one

might consider segmenting the foreground video data into contiguous regions, each defined by

some parametric motion model. These regions could then be used to intelligently guide the

inclusion of sparse motion estimates in the definition of the state observations given in (6). This

approach is theoretically sound, but is difficult to implement in practice due to the dependence of

the segmentation on an accurate initialization, the uncertainty associated with region boundaries

(especially under occlusion), and the need for real-time performance. Given the proposed

framework for the fusion of video data from multiple cameras, however, a simpler (and provably

more optimal) solution might be to extend the number of views, thus migrating towards the

increasingly popular notion of next generation ubiquitous computing. To develop a full

appreciation of the proposed technique, we invite and encourage the reader to visit our website to

inspect the multi-view tracking results in their entirety.


25/30

25

4. Conclusions and Future Work

We introduce a novel technique for tracking interacting human motion using multiple

layers of temporal filtering coupled by a simple Bayesian belief network for multiple camera

fusion. The system uses a distributed computing platform to maintain real-time performance and

multiple sources of video data, each capturing a distinct view of some scene. To maximize the

efficiency of distributed computation each view of the scene is processed independently using a

dedicated processor. The processing for each view is based on a predictor-corrector filter with

Kalman-like state propagation that uses measures of state visibility and sparse estimates of image

motion to produce observations. The corresponding gain matrix probabilistically weights the

calculated observations against projected 3-D data from the previous frame. This mixing of

predictions and observations produces a stochastic coupling between interacting states that

effectively models the dependencies between spatially neighboring features.

The corrected output of each predictor-corrector filter provides a vector observation for a

Bayesian belief network. The network is characterized by a dynamic, multidimensional topology

that varies as a function of scene content and feature confidence. The algorithm calculates an

appropriate input configuration by iteratively resolving independency relationships and a priori

confidence levels within the graph. The output of the network is a vector of 3-D positional data

with a corresponding vector of noise covariance matrices; this information is provided as input to

a standard Kalman filtering mechanism. The proposed method of data fusion is compared to themore basic approach of data averaging. Our results indicate that for any input configuration

consisting of more than two observations per feature the method of Bayesian fusion is superior.

In addition to producing lower errors than observational averaging, the Bayesian network lends

itself well to applications involving numerous input data (qualitatively as well as

computationally) and, in general, is better suited to handling multi-sensor data sources.

Future work in this area will continue to investigate methods for handling complex

occlusion such as those employing a priori motion and object models or those based onadditional observational sources of data. Outside the scope of tracking, but well within the field

of human motion analysis are other topics of potential contribution. These investigations might

be in areas including, but not limited to, simultaneous object tracking and recognition, multi-

modal recognition of gestures and activities, and higher-level fusion of activities and modalities

for event and interactive understanding applications.


26/30

26

Acknowledgments

This research was made possible by generous grants from the Center for Future Health,

the Keck Foundation, and Eastman Kodak Company. We would also like to acknowledge

Terrance Jones for his assistance and organizational efforts in providing synchronized multi-

view sequences of multiple people in action.

References

[1] Y. Ivanov, C. Stauffer, A. Bobick, and W. E. L. Grimson, Video surveillance of

interactions, Proc. of the Workshop on Visual Surveillance, Fort Collins, CO, 26 June

1999, pp. 82-89.

[2] T. Boult, R. Micheals, A. Erkan, P. Lewis, C. Powers, C. Qian, and W. Yin, Frame-rate

multi-body tracking for surveillance, Proc. of the DARPA Image Understanding

Workshop, Monterey, CA, 20-23 November 1998, pp. 305-313.

[3] S. L. Dockstader and A. M. Tekalp, On the Tracking of Articulated and Occluded Video

Object Motion,Real-Time Imaging, to appear in 2001.

[4] I. Haritaoglu, D. Harwood, and L. S. Davis, Hydra: multiple people detection and tracking

using silhouettes,Proc. of the Workshop on Visual Surveillance, Fort Collins, CO, 26 June

1999, pp. 6-13.

[5] L. W. Campbell, D. A. Becker, A. Azarbayejani, A. F. Bobick, and A. Pentland, Invariant

features for 3-D gesture recognition, Proc. of the Int. Conf. on Automatic Face and

Gesture Recognition, Killington, VT, 14-16 October 1996, pp. 157-162.

[6] A. D. Wilson and A. F. Bobick, Parametric Hidden Markov Models for Gesture

Recognition,IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 21, no. 9, pp.

884-890, September 1999.

[7] A. Pentland, Looking at People: Sensing for Ubiquitous and Wearable Computing,IEEE

Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 107-119, January

2000.

[8] J. M. Nash, J. N. Carter, and M. S. Nixon, Extraction of moving articulated-objects by

evidence gathering,Proc. of the British Machine Vision Conference, Southampton, United

Kingdom, 14-17 September 1998, pp. 609-618.


27/30

27

[9] C. Bregler, Learning and recognizing human dynamics in video sequences,Proc. of the

Conf. on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, 17-19 June

1997, pp. 568-574.

[10] H. A. Rowley and J. M. Rehg, Analyzing articulated motion using expectation-maximization,Proc. of the Conf. on Computer Vision and Pattern Recognition, San Juan,

Puerto Rico, 17-19 June 1997, pp. 935-941.

[11] D. Hogg, Model-Based Vision: A Program to See a Walking Person,Image and Vision

Computing, vol. 1, no. 1, pp. 5-20, January 1983.

[12] D. M. Gavrila and L. S. Davis, 3-D model-based tracking of humans in action: a multi-

view approach, Proc. of the Conf. on Computer Vision and Pattern Recognition , San

Francisco, CA, 18-20 June 1996, pp. 73-80.

[13] J. O'Rourke and N. I. Badler, Model-Based Image Analysis of Human Motion Using

constraint propagation,IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 2,

no. 6, pp. 522-536, November 1980.

[14] J. M. Rehg and T. Kanade, Model-based tracking of self-occluding articulated objects,

Proc. of the Int. Conf. on Computer Vision, Cambridge, MA, 20-23 June 1995, pp. 618-623.

[15] S. Wachter and H.-H. Nagel, Tracking Persons in Monocular Image Sequences,

Computer Vision and Image Understanding, vol. 74, no. 3, June 1999.

[16] E.-J. Ong and S. Gong, Tracking hybrid 2D-3D human models from multiple views,

Proc. of the Int. Workshop on Modelling People, Kerkyra, Greece, 20 September 1999, pp.

11-18.

[17] M. Isard and A. Blake, Condensation - Conditional Density Propagation for Visual

Tracking,Int. J. of Computer Vision, vol. 29, no. 1, pp. 5-28, August 1998.

[18] K. Akita, Image Sequence Analysis of Real World Human Motion,Pattern Recognition,

vol. 17, no. 1, pp. 73-83, January 1984.

[19] M. K. Leung and Y.-H. Yang, Human Body Motion Segmentation in a Complex Scene,Pattern Recognition, vol. 20, no. 1, pp. 55-64, January 1987.

[20] S. X. Ju, M. J. Black, and Y. Yacoob, Cardboard people: A parameterized model of

articulated image motion, Proc. of the Int. Conf. on Automatic Face and Gesture

Recognition, Killington, VT, 14-16 October 1996, pp. 38-44.


28/30

28

[21] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, Pfinder: Real-Time Tracking

of the Human Body,IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 19,

no. 7, pp. 780-785, July 1997.

[22] Y.-S. Yao and R. Chellappa, Tracking a Dynamic Set of Feature Points,IEEE Trans. onImage Processing, vol. 4, no. 10, pp. 1382-1395, October 1995.

[23] A. Azarbayejani and A. P. Pentland, Recursive Estimation of Motion, Structure, and Focal

Length,IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 17, no. 6, pp. 562-

575, June 1995.

[24] T. J. Broida and R. Chellappa, Estimating the Kinematics and Structure of a Rigid Object

from a Sequence of Monocular Images,IEEE Trans. on Pattern Analysis and Machine

Intelligence, vol. 13, no. 6, pp. 497-513, June 1991.

[25] F. Lerasle, G. Rives, M. Dhome, Tracking of Human Limbs by Multiocular Vision,

Computer Vision and Image Understanding, vol. 75, no. 3, pp. 229-246, September 1999.

[26] D.-S. Jang and H.-I. Choi, Active Models for Tracking Moving Objects, Pattern

Recognition, vol. 33, no. 7, pp. 1135-1146, July 2000.

[27] N. Peterfreund, Robust Tracking of Position and Velocity with Kalman Snakes,IEEE

Trans. on Pattern Analysis and Machine Intelligence, vol. 21, no. 6, pp. 564-569, June

1999.

[28] J. MacCormick and A. Blake, A probabilistic exclusion principle for tracking multiple

objects, Proc. of the Int. Conf. on Computer Vision, Kerkyra, Greece, 20-27 September

1999, pp. 572-578.

[29] S. J. McKenna, S. Jabri, Z. Duric, and H. Wechsler, Tracking interacting people,Proc. of

the Int. Conf. on Automatic Face and Gesture Recognition, Grenoble, France, 28-30 March

2000, pp. 348-353.

[30] I. Haritaoglu, D. Harwood, and L. S. Davis, W4: Real-Time Surveillance of People and

Their Activities,IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 8,pp. 809-830, August 2000.

[31] V. Kettnaker and R. Zabih, Counting people from multiple cameras, Proc. of the Int.

Conf. on Multimedia Computing and Systems, Florence, Italy, 7-11 June 1999, pp. 267-271.


29/30

29

[32] G.-W. Chu and M. J. Chung, An optimal image selection from multiple cameras under the

limitation of communication capacity,Proc. of the Int. Conf. on Multisensor Fusion and

Integration for Intelligent Systems, Taipei, Taiwan, 15-18 August 1999, pp.261-266.

[33] T. Darrell, G. Gordon, M. Harville, and J. Woodfill, Integrated Person Tracking UsingStereo, Color, and Pattern Detection,Int. J. of Computer Vision, vol. 37, no. 2, pp. 175-

185, June 2000.

[34] Q. Cai and J. K. Aggarwal, Tracking Human Motion in Structured Environments Using a

Distributed-Camera System,IEEE Trans. on Pattern Analysis and Machine Intelligence,

vol. 21, no. 11, pp. 1241-1247, November 1999.

[35] S. Stillman, R. Tanawongsuwan, and I. Essa, A system for tracking and recognizing

multiple people with multiple cameras,Proc. of the Int. Conf. on Audio and Video-Based

Biometric Person Authentication, Washington, DC, 22-23 March 1999, pp. 96-101.

[36] A. Utsumi, H. Mori, J. Ohya, and M. Yachida, Multiple-human tracking using multiple

cameras,Proc. of the Int. Conf. on Automatic Face and Gesture Recognition, Nara, Japan,

14-16 April 1998, pp. 498-503.

[37] J. Yamato, J. Ohya, and K. Ishii, Recognizing human action in time-sequential images

using hidden Markov model, Proc. of the Conf. on Computer Vision and Pattern

Recognition, Champaign, IL, 15-18 June 1992, pp. 379-385.

[38] M. J. Black and A. J. Jepson, Recognizing temporal trajectories using the Condensation

algorithm, Proc. of the Int. Conf. on Automatic Face and Gesture Recognition, Nara,

Japan, 14-16 April 1998, pp. 16-21.

[39] T. J. Olson and F. Z. Brill, Moving object detection and event recognition algorithms for

smart cameras,Proc. of the DARPA Image Understanding Workshop, New Orleans, LA,

11-14 May 1997, pp. 159-175.

[40] Y. Yacoob and M. J. Black, Parameterized Modeling and Recognition of Activities,

Computer Vision and Image Understanding, vol. 73, no. 2, pp. 232-247, February 1999.[41] Y. Yacoob and L. S. Davis, Learned Models for Estimation of Rigid and Articulated

Human Motion from Stationary or Moving Camera,Int. J. of Computer Vision, vol. 12, no.

1, pp. 5-30, January 2000.


30/30

[42] C. R. Wren and A. P. Pentland, Dynamic models of human motion,Proc. of the Int. Conf.

on Automatic Face and Gesture Recognition, Nara, Japan, 14-16 April 1998, pp. 22-27.

[43] S. Nagaya, S. Seki, R. Oka, A theoretical consideration of pattern space trajectory for

gesture spotting recognition, Proc. of the Int. Conf. on Automatic Face and GestureRecognition, Killington, VT, 14-16 October 1996, pp. 72-77.

[44] D. M. Gavrila, The Visual Analysis of Human Movement: A Survey,Computer Vision

and Image Understanding, vol. 73, no. 1, pp. 82-98, January 1999.

[45] J. K. Aggarwal and Q. Cai, Human Motion Analysis: A Review,Computer Vision and

Image Understanding, vol. 73, no. 3, pp. 428-440, March 1999.

[46] R. T. Collins, A. J. Lipton, and T. Kanade, Introduction to the Special Section on Video

Surveillance,IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 8,

pp. 745-746, August 2000.

[47] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference,

San Francisco, CA: Morgan Kaufmann, 1988.

[48] Y. Altunbasak, A. M. Tekalp, and G. Bozdagi, Simultaneous stereo-motion fusion and 3-D

motion tracking, Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing ,

Detroit, MI, 9-12 May 1995, vol. 4, pp. 2270-2280.

[49] N. M. Oliver, B. Rosario, and A. P. Pentland, A Bayesian Computer Vision System for

Modeling Human Interactions,IEEE Trans. on Pattern Analysis and Machine Intelligence,

vol. 22, no. 8, pp. 831-843, August 2000.

[50] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, Systems and Experiment: Performance of

Optical Flow Techniques,Int. J. of Computer Vision, vol. 12, no. 1, pp. 43-77, 1994.

[51] Z. Sun, Object-Based Video Processing with Depth, Ph.D. Thesis, University of Rochester,

2000.

[52] E. Trucco and A. Verri, Introductory Techniques for 3-D Computer Vision, Upper Saddle

River, NJ: Prentice-Hall, 1998.[53] H. Sidenbladh, M. J. Black, and D. J. Fleet, Stochastic tracking of 3D human figures using

2D image motion,Proc. of the European Conf. on Computer Vision, Dublin, Ireland, 26

June - 1 July 2000.

Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

Documents

Transcript of Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion