Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

download Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

of 30

Transcript of Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    1/30

    1

    Multiple Camera Tracking ofInteracting and Occluded Human Motion

    Shiloh L. Dockstader and A. Murat Tekalp

    Department of Electrical and Computer Engineering

    University of Rochester, Rochester, NY 14627, USA

    E-mail: [email protected], [email protected]

    URL: http://www.ece.rochester.edu/~dockstad/research

    Abstract

    We propose a distributed, real-time computing platform for tracking multiple interacting persons

    in motion. To combat the negative effects of occlusion and articulated motion we use a multi-view implementation, where each view is first independently processed on a dedicated processor.

    This monocular processing uses a predictor-corrector filter to weigh re-projections of 3-D

    position estimates, obtained by the central processor, against observations of measurable image

    motion. The corrected state vectors from each view provide input observations to a Bayesian

    belief network, in the central processor, with a dynamic, multidimensional topology that varies

    as a function of scene content and feature confidence. The Bayesian net fuses independent

    observations from multiple cameras by iteratively resolving independency relationships and

    confidence levels within the graph, thereby producing the most likely vector of 3-D state

    estimates given the available data. To maintain temporal continuity we follow the network with a

    layer of Kalman filtering that updates the 3-D state estimates. We demonstrate the efficacy of the

    proposed system using a multi-view sequence of several people in motion. Our experiments

    suggest that, when compared with data fusion based on averaging, the proposed technique yields

    a noticeable improvement in tracking accuracy.

    Key Words: Kalman filtering, Bayesian network, human motion analysis, human interaction,

    real-time tracking, multiple camera fusion, articulated motion, occlusion

    Corresponding Author

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    2/30

    2

    1. Introduction

    1.1. Motivation

    Motion analysis and tracking from monocular or stereo video has long been proposed for

    applications such as visual security and surveillance [1][2][3][4]. More recently, it has also been

    employed for performing gesture and event understanding [5][6]; developing generalized

    ubiquitous and wearable computing for the man-machine interface [7]; and monitoring patients

    to capture the essence of epileptic seizures, assist with the analysis of rehabilitative gait, and

    perform monitoring and examination of basic sleep disorders [8][9]. However, due to a

    combination of several factors, reliable motion tracking still remains a challenging domain of

    research. The underlying difficulties behind human motion analysis are founded mostly upon the

    complex, articulated, and self-occluding nature of the human body [10]. The interaction between

    this inherent motion complexity and an equally complex environment greatly compounds the

    situation. Environmental conditions such as dynamic ambient lighting, object occlusions,

    insufficient or incomplete scene knowledge, and other moving objects and people are just some

    of the naturally interfering factors. These issues make tasks as fundamental as tracking and

    semantic correspondence recognition exceedingly difficult without first outlining numerous, and

    sometimes prohibitive, assumptions. Computational complexity also plays an important role, as

    many of the popular applications of human motion analysis and monitoring demand real-time (or

    near real-time) solutions.1.2. Previous Work

    A fundamental step in human motion analysis and tracking consists of the modeling of

    moving people in image sequences. Classical 3-D modeling of moving people has taken the form

    of volumetric bodies based on elliptical cylinders [11], tapered super-quadrics [12], or other

    more highly parameterized primitives or configurations [13]. Rehg and Kanade [14] extend 3-D

    model-based tracking to account for the significant self-occlusions inherent in the tracking of

    articulated motion. Alternatives to using exact 3-D models include the use of 2-D models that

    represent the projection of 3-D data onto some imaging plane [15], point distribution models

    (PDM) [16], strict 2-D/3-D contour modeling [17], stick figure and medial axis modeling [18], or

    even 2-D generalized shape and region modeling [19][20]. For human motion analysis, choosing

    an object or motion model depends greatly on the application. Real-time scenarios, for example,

    must often sacrifice the accuracy of more highly parameterized models in order to achieve rapid,

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    3/30

    3

    automatic model initialization. In line with this motivation, Wren et al. [21] introduce a multi-

    class statistical model of color and shape to obtain a 2-D representation of the head and hands.

    They use simple blobs, or structurally significant regions, to represent various body parts for the

    purpose of tracking and recognition. The approach lends itself to a real-time implementation andis more robust to certain types of occlusion than other techniques.

    At the heart of human motion analysis is motion tracking. The dependence between

    tracking and recognition is significant, where even a slight increase in the quality or precision of

    tracking is able to return considerable improvements in recognition accuracy. Various

    approaches to object tracking employ features such as edges, shape, color, and optical flow [21].

    To handle occlusion, articulated motion, and noisy observations, numerous researchers have

    looked to stochastic modeling such as Kalman filtering [22][23][24][25] and probabilistic

    conditional density propagation (Condensation algorithm) [17]. Dockstader and Tekalp [3]

    suggest a modified Kalman filtering approach to the tracking of several moving people in video

    surveillance sequences. They take advantage of the fact that as multiple moving people interact,

    the state predictions and observations for the corresponding Kalman filters no longer remain

    independent. Lerasle et al. [25] employ Kalman filters to perform tracking of human limbs from

    multiple vantage points. Jang and Choi [26] suggest the use of active models based on regional

    and structural characteristics such as color, shape, texture, and the like to track non-rigid moving

    objects. Like the tracking theory set forth by Peterfreund [27], the active models employ Kalman

    filtering to predict basic motion information and snake-like energy minimization terms to

    perform dynamic adaptations using the moving objects structure. MacCormick and Blake [28]

    present the notion ofpartitioned sampling to perform robust tracking of multiple moving objects.

    The underlying probabilistic exclusion principle prevents a single observation from supporting

    the presence of multiple targets by employing a specialized observational model. The

    Condensation [17] algorithm is used in conjunction with the aforementioned observational

    model to perform the contour tracking of multiple, interacting objects. Without the explicit use ofextensive temporal or stochastic modeling, McKenna et al. [29] describe a computer vision

    system for tracking multiple moving persons in relatively unconstrained environments.

    Haritaoglu et al. [30] present a real-time system, W4, for detecting and tracking multiple people

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    4/30

    4

    when they appear in a group. All of the techniques discussed so far have considered motion

    tracking from monocular video.

    The use of multi-view monitoring [12][16][31] and data fusion [32][33] provides an

    eloquent mechanism for handling occlusion, articulated motion, and multiple moving objects invideo sequences. With distributed monitoring systems, however, comes additional issues such as

    stereo matching, pose estimation, automated camera switching, and system bandwidth and

    processing capabilities [34]. Stillman et al. [35] propose a robust system for tracking and

    recognizing multiple people with two cameras capable of panning, tilting, and zooming (PTZ) as

    well as two static cameras for general motion monitoring. The static cameras perform person

    detection and histogram registration in parallel while the PTZ cameras acquire more highly

    focused data to perform basic face recognition. Utsumi et al. [36] suggest a system for detecting

    and tracking multiple persons using multiple cameras to address complex occlusion and

    articulated motion. The system is composed of multiple tasks including position detection,

    rotation angle detection, and body-side detection. For each of the tasks, the camera that provides

    the most relevant information is automatically chosen using the distance transformations of

    multiple segmented object maps. Cai and Aggarwal [34] develop an approach for tracking

    human motion using a distributed-camera model. The system starts with tracking from a single

    camera view and switches when the active camera no longer has a sufficient view of the moving

    object. Selecting the optimal camera is based on a matching evaluation, which uses a simple

    threshold on some tracking confidence parameter, and a frame number calculation, which

    attempts to minimize the total amount of camera switching.

    It stands to reason that many who focus on the modeling and tracking of articulated

    human motion also initiate efforts in the recognition, interpretation, and understanding of such

    motion. Toward this end, popular approaches typically employ parameterized motion trajectories

    or vector fields and use Dynamic Time Warping, hidden Markov models [6][37], combinations

    thereof [38], or hybrid graph-theoretic methods [39]. Other researchers have addressed activityrecognition using methods based on principle component analysis [40], vector basis fields [41],

    or generalized dynamic modeling. Wren and Pentland [42] describe an approach to gesture

    recognition based on the use of 3-D dynamic modeling of the human body. Complimentary

    research to activity recognition investigates the effects of various features on recognition

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    5/30

    5

    accuracy. Nagaya et al. [43] propose an appearance-based feature for real-time gesture

    recognition from motion images. Campbell et al. [5] experiment with a number of potential

    feature vectors for human activity recognition, including 3-D position, measures of local

    curvature, and numerous velocity terms. Their experimental results lead favor to the use ofredundant velocity-based feature vectors.

    For a more thorough discussion of modeling, tracking, and recognition for human motion

    analysis, we refer the reader to Gavrila [44], Aggarwal and Cai [45], and the special issue of the

    IEEE Transactions an Pattern Analysis and Machine Intelligence on video surveillance [46].

    1.3. Proposed Contribution

    In this paper, we introduce a distributed, real-time computing platform for improving

    feature-based tracking in the presence of articulation and occlusion for the goal of recognition.

    The main contribution of this work is to perform both spatial (between multiple cameras) and

    temporal data integration within a unified framework of 3-D position tracking to provide

    increased robustness to temporary feature point occlusion. In particular, the proposed system

    employs a probabilistic weighting scheme for spatial data integration as a simple Bayesian belief

    network (BBN) [47] with a dynamic, multidimensional topology. As a directed acyclic graph

    (DAG), a Bayesian network demonstrates a natural framework for representing causal,

    inferential relationships between multiple variables (e.g., a 3-D trajectory is always inferred from

    two or more 2-D sources). Moreover, unlike traditional probabilistic weighting schemes [48], a

    BBN is easily extended to multiple modalities due to its inherent semantic structure. Perhaps the

    most exploited aspect, however, is the possibility of topological simplification resulting from

    conditional dependency and independency relationships between multiple network variables. For

    the proposed system, this corresponds to the selective use of multiple views of particular features

    based on measures of spatio-temporal tracking confidence. Our approach is particularly well

    suited as a precursor to generalized motion monitoring, activity recognition, and interaction

    analysis systems [49]. To this end, we use multi-view data acquired from an unconstrained homeenvironment. Video captured within a home environment has the advantage of broad

    applicability, where human motion is often significant and unrestricted. Applications range from

    home security to understanding of social and interactive phenomena to monitoring for the

    purposes of home health care and rehabilitation.

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    6/30

    6

    2. Theory

    2.1. System Overview

    The proposed system, as shown in Figure 1, consists of three major components: (i) state-

    based 2-D predictor-corrector filtering for monocular tracking (in the dotted box), (ii) multi-view

    (spatial) data fusion, and (iii) Kalman filtering for 3-D trajectory tracking. The first stage consists

    of video preprocessing including background subtraction, sparse 2-D motion estimation, and

    foreground region clustering. From the estimated 2-D motion, we extract a set of measurements

    (observations) [ ]kx , where 0k is the frame number, for the estimation of the state vector,

    [ ]1 2[ ] [ ] [ ] [ ]T

    Nk k k k s s s s" . Here, [ ]m ks , 1 m N , denotes the image coordinates of the

    mth feature point that we wish to track in time, and Nis the number of features being tracked on

    one or more independently moving regions. We also define, [ ]k , as a 3-D state vector that

    captures both the velocity and position of features in 3-D Cartesian space, where the unknown 3-

    D feature position is denoted by [ ]ky . At each frame, k, we use the observations, [ ]kx , in

    conjunction with 3-D state estimates, [ 1| 1]k k , as input to a predictor-corrector filter. The

    output of the predictor-corrector filter is a state estimate, [ ]ks , with some confidence, [ ]kM .

    Input Video Data

    Extracted Features

    (Observations)

    BayesianNetwork

    Predictor-Corrector Filter

    3-D Observations

    Output Trajectory

    States of Multiple

    Moving Objects

    Data ProcessingStandard

    Kalman Filter

    Processing for thejth View

    States fromOther Views #

    Corrected 3-D States

    (Feature Locations)

    Projection of 3-DFeatures onto

    jth Imaging Plane

    (Predictions)

    [ ] [ ]jjk ks

    [ ]kx

    [ ]ky

    [ ]k[ | 1]

    jk ks

    Figure 1. System flow diagram.

    The stage of multi-view fusion (indicated as the Bayesian network in Figure 1) performs

    spatial data integration using triangulation, perspective projections, and Bayesian inference. The

    input to the Bayesian network is a set of random variables, { }1 2[ ], [ ], , [ ]Jk k k " , where thejth

    element is identical to the output, [ ]j

    ks , taken from the predictor-corrector filter corresponding

    to thejth view of the scene. The network defines an unknown subset of { }1 2[ ], [ ], , [ ]Jk k k " ,

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    7/30

    7

    denoted by [ ]k , that combine to form yet another random variable, [ ]k , indicative of a vector

    of 3-D positions. At the output of the BBN are the estimates, [ ]k and [ ]ky , that maximize the

    joint density, , [ , ] PPPP . We use the notation [ ]ky instead of [ ]k since the output of the

    network is actually an estimate of the unknown 3-D position vector, [ ]ky . Accompanying the 3-D estimate is a noise covariance matrix, [ ]kR , minimized by the maximization of , [ , ] PPPP .

    The final stage of the system uses a Kalman filter to maintain a level of temporal

    smoothness on the vector of 3-D trajectories. The observations for the Kalman filter are the 3-D

    estimates, [ ]ky , produced by the Bayesian network. As mentioned previously, the states of the

    filter are indicated by [ ]k which represent both the true velocity and position of the Nfeatures

    in 3-D Cartesian space. The corrected states, [ | ]k k , at the output of the Kalman filter provide

    updated estimates of the unknown 3-D position vector. For tracking in the next frame, the

    algorithm develops a prediction, ( )[ 1] [ | ]k k k+ FFFF , based on a perspective projection, ( )FFFF ,

    of the corresponding 3-D prediction. This process introduces a temporally iterative estimation

    algorithm capable of improving the accuracy of both 2-D and 3-D object tracking of features

    proven useful in human motion analysis [5].

    The proposed architecture considers both the fundamental bandwidth constraints as well

    as the computational complexity requirements for various components. Feature tracking and

    processing occurs for a single camera (1 j J ) and, due to the complexity of the required

    motion estimation and spatial clustering routines, is performed using a dedicated processor.

    Rather than burdening a network with the real-time transfer and analysis of video data from J

    views, we perform all tracking and correspondence temporally at each view and then use the

    Bayesian network to perform spatial integration of only the relevant data. Both the Bayesian

    network and the subsequent stage of 3-D Kalman filtering are computationally simple enough (in

    comparison toJoccurrences of motion-based estimation and segmentation) to coexist on a single

    central processor.

    2.2. Two-Dimensional Feature Tracking

    Throughout this section it is tacitly understood that each equation is applied to data at the

    plane of the jth camera, although we drop the explicit dependence on j with the understanding

    that the steps in question are equally applicable to multiple views of the scene. As in [3], the

    proposed technique performs moving object detection and segmentation using change detection

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    8/30

    8

    and localized, sparse motion estimation over a grid of points, including the semantic features,

    within the foreground of the video sequence. While in [3] we were able to effectively use motion

    estimation and frame differencing to produce filter predictions and observations, respectively,

    such an approach is not feasible for the tracking of specific occluded features of moving people.Rather, we must now rely on temporal correspondence to generate observations while using

    feedback from the 3-D feature vector trajectories to develop next state predictions.

    The first step involves computation of the state prediction, [ | 1]k ks , in the current

    frame, k, according to

    ( ) [ | 1] [ ] [ 1| 1]k k k k k = s FFFF , (1)

    where [k] is the 3-D state transition matrix and ( )FFFF is a vector-valued function that maps

    points in 3-D Cartesian space to a particular imaging plane using a perspective projection. Theprecise definition of [ ]k is given in 2.4. We then compute the 2-D error covariance matrix,

    [ | 1]k kM , by projecting the corresponding 3-D error covariance matrix, [ | 1]k k , associated

    with [ 1| 1]k k to each imaging plane as per

    ( )( ){ } ( ) [ | 1] [ ] [ | 1] [ ] [ | 1] [ | 1]Tk k E k k k k k k k k = M s s s s GGGG , (2)

    where ( )GGGG is a transformation of the noise covariance from 3-D to 2-D. The mapping of 3-D

    data to the imaging plane of each camera, as in (1) and (2), results in a one-step predictor-

    corrector filter at each frame, k. Collectively, the implementation of these functions, ( )FFFF and( )GGGG , parallels that of a transformation of random variables, ( )j HHHH , which we fully define and

    illustrate in 2.3.

    The next step involves the computation of the gain matrix, [ ]kK , which determines the

    magnitude of the correction at the kth frame. The operation relies on [ | 1]k kM and a noise

    covariance matrix, [ ]kC , which describes the distribution of the assumed Gaussian observation

    noise. Using a probabilistic weighting scheme, similar to that proposed in [3], we introduce a

    matrix that captures the confidence associated with the temporal correspondence measurements(i.e., observations, [ ]kx ) for each feature. For this task, we classify the correspondence of

    various features and their neighboring pixels into three basic classes:

    Class A The element is visible in the previous frame and presumably in the current

    frame, as there exists a strong temporal correspondence;

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    9/30

    9

    Class B The element is visible in the previous frame but presumably not in the current

    frame, as there exists only a weaktemporal correspondence; and

    Class C The element is not visible in the previous frame.

    The qualification of a feature as having a strong or weak temporal correspondence is based onwell-established criteria in the motion estimation and optical flow literature [50]. Since, in the

    interest of computational complexity, the proposed system estimates only the forward motion,

    we ignore the case where a previously occluded feature becomes revealed in the next frame; this

    is considered a Class Ccorrespondence.

    Let us refer to the ith motion vector in the neighborhood of some feature point, [ ]m ks , as

    [ ]i kv and the origin of this vector in the previous frame (in image coordinates) as [ ]i kv

    . For

    convenience, we introduce the notation A, B, and C to represent the sets of all points, [ ]i kv

    and [ 1]m ks , that are classified as Class A, Class B, and Class C, respectively. It is assumed that

    the union of these three sets describes a single, arbitrary region. For the proposed contribution,

    this corresponds to an entire moving person, although the technique is equally applicable to

    single body parts, groups of moving people, or even generic moving regions, depending on the

    segmentation of the foreground.

    We construct [ ]kC in the traditional manner by allowing the matrix to be representative

    of the time-varying differences between observations and corrected predictions. We then

    compose a gain matrix according to

    ( )1

    [ ] [ | 1] [ ] [ ] [ ] [ ] [ ] [ | 1] [ ]T T Tk k k k k k k k k k k

    = + K M H W C W H M H , (3)

    where [ ]k =H I represents the linear observation matrix and

    [ ]{ }( )1

    2

    1 2[ ] ( ) ( ) ( )Nk diag k k k

    W " , (4)

    1 12( )

    1, [ 1]

    ( ) ( ) ( , ), , [ 1]

    0, [ 1]m

    m

    m m m mD ki

    m

    k

    k D k d k i k

    k

    + =

    A

    A

    A

    B

    C

    s

    s

    s

    , (5)

    and ( , ) [ ] [ 1]m i md k i k k v s

    is the distance between the mth feature of and the ith observable

    motion vector originating from a particular moving region. Here, ( ) max { ( , )}m i mD k d k i ,

    indicates the maximum distance between a particular feature and the observable motion vectors

    used to produce its temporal correspondence. The observations are indicated by

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    10/30

    10

    ( )( )

    1

    [ ] ( ) ( , ) [ 1] [ ] ( ) ( , )m m m m i m mi i

    k D k d k i k k D k d k i

    = +

    A A

    A

    x s v , (6)

    where ( , )md k i , [ ]i kv , and [ ]i kv

    are illustrated in Figure 2 more clearly.

    Upon inspecting (3) through (6), one may make a number of observations. First, acompletely visible feature that has a seemingly accurate motion estimate is tracked in the usual

    way; this represents the trivial correspondence problem with no apparent occlusion. An occluded

    feature, on the other hand, develops a weighted least-squares estimate of its trajectory using

    neighboring, observable motion vectors. The confidence of this estimate is encoded in [ ]kW ,

    where both the visibility and the proximity of the neighboring trajectories is taken into account.

    We draw the readers attention to the fact that nothing can be said about the optimality of the

    weighted least-squares estimate unless something more is known regarding the distribution of the

    bordering motion vector estimates. In the next section, we address this issue by introducing a

    weighting scheme that takes the form of a Bayesian network with a dynamic topology.

    B

    C

    A

    Object 1 Object 2

    Frame k-1

    ( , )m m Ad k i = B

    C

    A

    Frame k

    ith motion vector

    [ 1]i kv

    [ 1]i kv

    Figure 2. Occlusion modeling. Here we show various feature tracking scenarios in the presence of interacting

    foreground objects (i.e., two moving people). When a feature is occluded, the temporal trajectory is estimated using

    neighboring, but visible, motion vectors.

    For the sake of completeness, we provide the remaining steps of the filtering procedure.

    These equations follow those of the standard Kalman filter and are indicated by( ) [ | ] [ | 1] [ ] [ ] [ ] [ | 1]k k k k k k k k k = + s s K x H s (7)

    and

    ( )[ | ] [ ] [ ] [ | 1]k k k k k k = M I K H M . (8)

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    11/30

    11

    The two equations represent the state correction based on the gain matrix and the update of the

    error covariance matrix, respectively.

    2.3. Spatial Integration

    The spatial integration is performed for each feature point independently. The followingtreatment addresses spatial integration for the mth feature point, although the index m has been

    omitted for ease of notation. We introduce random variables, [ ] [ ] [ ] [ ]j j j jk k k k + s s s n and

    [ ] [ ] [ ]k k k+ yy n , where [ ] jksn represents some zero-mean distribution of estimation error on

    the imaging plane corresponding to thejth view. Similarly, [ ]kyn characterizes the zero-mean 3-

    D reconstruction error inherent in [ ]k . Let [ ], [1, ]j k j J indicate a family of random

    variables defined by the vector of probability density functions (PDF),

    [ ] ( ) ( )

    12 1 1

    2 [ ] (2 ) [ ] exp [ ] [ ] [ ] [ ] [ ]

    TN

    j j jj j jjk k k k k k k

    = M s M sPPPP . (9)

    where M[k] contains the confidence of the mth component of the estimate on thejth view. Under

    the assumption that the mth feature may be occluded in some views, the algorithm uses 1 K J<

    views of the scene to reconstruct an estimate of [ ]ky . We indicate the set of random variables

    used in reconstruction by

    { } [ ]1 2 1[ ; ] [ ], [ ], , [ ] [ ], [ ]Kj j j Jk K k k k k k " , (10)

    where 1 nj J indicates one ofKviews upon which [ ]ky might be dependent. For the sake of

    notational simplicity, we drop the explicit dependence of these variables on the frame number, k.

    The proposed Bayesian network obtains an estimate of [ ]ky and the proper subset

    [ ; ]k K by calculating

    ( ) { }, , arg max [ , ]=

    y ,,,,

    PPPP , (11)

    where, according to Bayes rule,

    , 1 1 1 2 1 1[ , ] [ | , , , ] [ | , , , ] [ ]J J J J J = " " "P P P P P P P P P P P P P P P P . (12)

    Representing (12) as a causal network, our task is reduced to finding a topology, { , }T , suchthat ,{ , } [ , ] T PPPP is maximized. In (12), 1 1[ | , , , ]J J "PPPP models 3-D reconstruction

    noise, [ ]j PPPP models 2-D observation noise, and the remaining conditional densities model the

    effects of occlusion and correlation between various views of the scene. We can contrast the

    model of (12) to that of a more simplistic approach, assumingJindependent sources, where

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    12/30

    12

    1 2

    1

    { , } [ | , , , ] [ ]J

    J j

    j

    =

    T "P PP PP PP P . (13)

    The latter model works well for a small number of variables, but (if the variables are correlated)

    becomes less accurate as Jincreases. The generalized topologies of (12) and (13) are illustrated

    in Figure 3. The remaining modifications to the network consist of defining a topological

    ordering of the nodes, removing nonexistent dependencies (i.e., edges), and estimating the

    conditional probabilities for the remaining I-map structure.

    1 j J2 "

    1 j J2 "

    The jth node in the networkcontains (J-j+1) divergentpaths and (j-1) convergent paths

    (A) (B)

    Figure 3. Bayesian network topologies. The network shown in (A) represents a basic BBN with a convergent

    topology at a child node, indicative of independent sources, while that in (B) shows the more general case in which

    the correlation between views is considered.

    To determine a topological ordering of the nodes, we must take into consideration the

    relative significance of various data sources. With two cameras there is no ordering, as both

    views are equally necessary for 3-D reconstruction. However, the case of two cameras does not

    provide a means to deal with the possibility of occlusion; hence it is not of interest to us. For

    more views, we need relative measures of importance, most likely determined by redundant

    scene content. If we assume uniformly distributed scene content, then by moving the cameras

    further apart we increase the total probability of having multiple unobstructed views of one or

    more features. Herewith, there is also an implicit upper bound on camera separation in order to

    maintain a well-defined stereo matching problem for feature initialization.

    Using the ( )2J fundamental matrices for the system, based on the static content associated

    with each view, we define a probabilistic measure, 0 1ijl . This probability portrays the ratio

    of points seen from the ith view that have some correspondence in thejth view to the total number

    of pixels in the ith view. Within a graph-theoretic framework, this quantity states that an

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    13/30

    13

    inferential dependency exists between these nodes such that one might conjecture that i j

    with a confidence of lij. Two identical views of a scene are indicated by 1ijl = , while 0ijl =

    would suggest two views containing no points in common. Note that the converse of the former

    statement does not necessarily follow. With fixed cameras, such a matrix, { }ijl=L , is easilypopulated with arbitrary levels of precision using automatic, semi-automatic, or even manual

    assessment. This dependence on camera position is illustrated in Figure 4. For a system

    consisting ofJcameras, the nodal ordering is specified by

    11 2 1{ , } { , , , }

    JJ

    ++ =

    T T "

    , (14)

    where the nodes are placed in descending order of confidence from 1, which indicates the root

    of the BBN, to , which has no dependencies. This topological ordering (14) varies for each

    feature being tracked and, therefore, must consider both the confidence of temporal tracking andthe correlation between views of the scene. For the mth feature in the thnj view of the scene we

    have a confidence metric

    ( )1 1[ ] ( ) 1n nnJ

    mj m j iJj ik k l

    = , (15)

    where ( )m k for an arbitrary view is given in (5). The random variable for the nth node, n , in

    the network is equal tonj , where the views of the scene are ranked according to

    1 2[ ] [ ] [ ]

    Kmj mj mjk k k " . (16)

    Figure 4. Three-dimensional view of a scene. The correlation between viewsj1 andj3 is higher than that between j1

    andj2. It is expected, then, that occluded content inj1 has a higher probability of remaining occluded inj3 than inj2.

    As an example for a single, arbitrary feature, consider a four-camera system in which we

    have [ ]( ) 0.72 0.51 1.00 0.39m k = and a matrix of camera correlation data

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    14/30

    14

    1.00 0.86 0.90 0.52

    0.42 1.00 0.73 0.36

    0.92 0.71 1.00 0.57

    0.94 0.56 0.78 1.00

    =

    L . (17)

    The corresponding confidence vector is given by [ ][ ] 0.13 0.19 0.20 0.07m k = , where the

    resulting topological ordering, 3 2 1 4{ , , , , }T , is demonstrated graphically in Figure 5. For an

    arbitrary feature, this arrangement suggests that one should expect that data obtained from 2 is

    likely, on average, to contain more unique scene content than that from 1 or 4 . Alternatively,

    this might be read as: Given that we have an estimated feature vector in the second view, it is

    58% or 64% likely ( 21 jl ) that the first or fourth views, respectively, provide additional

    information. Of course, information in this context is taken for the case of occlusion. The goal

    of this implementation is meant to lessen the adverse effects of observation and reconstructionerror by jointly considering camera position and temporal tracking confidence.

    1 3 42

    Figure 5. Four-camera BBN. The illustrated network demonstrates a dynamic topology of

    3 2 1 4{ , , , , }T . This

    configuration is valid for only the mth feature for the kth frame.

    To calculate (11), we must define the three fundamental densities provided in (12). In

    addition to the a priori distribution at each imaging plane, as indicated by (9), we require density

    functions for and j , each conditioned upon a collection of other views. Let us refer to this

    collection of random variables as { }1 2, , , Bj j j " such that 1 1[ , ] [ , ]B Kj j j j j . We

    introduce two transformation functions, ( )BBBB and ( )j HHHH , where the first function maps a

    collection of random variables, , from 2-D to 3-D, while the latter transformation projects a

    random variable, , from 3-D to the imaging plane of the jth camera. These functions are

    analogous to those used in error propagation within the 3-D reconstruction literature [51]. Due to

    the inherent complexity of these transformations, however, it is not always possible to provide a

    simple, analytical representation of the noise propagation.

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    15/30

    15

    To estimate the transformed distributions, we first assume a noise model for the

    reconstructed and projected density functions. For density functions transformed by ( )BBBB , we

    assume a 3-D Gaussian distribution with ( ), RN , while for those functions transformed by

    ( )j HHHH , we assume a 2-D distribution of ( ),j j

    UN . These distributions are calculated usingthe notion of random sampling. For ( ), RN , we choose one point at random (distributed

    according to1j , 2j , etc) on each of the imaging planes corresponding to

    . TheseB points

    are then triangulated in the standard way [52] to form a random observation in 3-D Cartesian

    space. We continue this sampling process until enough observations exist to estimate and R

    with sufficient confidence. A similar sampling procedure is used to estimate j and jU , where

    points are selected at random in 3-D according to ( ), RN , which also uses random samples

    from thejth view, and then projected onto the jth imaging plane. The random 3-D positions will

    project to a random distribution on the imaging plane that provides a set of observations for

    estimating j and jU . We illustrate the random sampling process for constructing ( ), RN

    and ( ),j j UN in Figure 6.

    1

    1

    1 1222

    2

    33

    3

    3

    j1 j2

    j3

    Random samples triangulated into 3-D

    3-D Observations for

    Estimating and R

    1[ ]j k P

    2[ ]j k P

    3[ ]j k P

    3

    3

    1

    1

    2

    2

    j

    Random samples projected to 2-D

    2-D Observations for

    Estimating andj j U

    ( ), RN

    Figure 6. Random sampling process. The diagram on the left shows the generation of 3-D observations given

    random samples of1[ ]j k P , 2 [ ]j k P , and 3 [ ]j k P , while that on the right demonstrates the creation of 2-D

    observations given random samples with a ( ), RN distribution.

    Starting with the two most confident views of the scene, corresponding to 1 and 2 in the

    network, we construct , [ , ] PPPP under the assumption that a conditional independence exists

    [ ], [ , ] | [ ] | = P P P P P P P P P P P P P P P P P P P P , (18)

    where { }1 2,j j= . To calculate (18), we introduce

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    16/30

    16

    [ ] [ ]1

    3 22 1 1

    2[ ] | [ ] (2 ) [ ] exp [ ] [ ] [ ] [ ] [ ]N T

    k k k k k k k k

    =

    R RPPPP (19)

    and

    12 1 1

    2[ ] | [ ] (2 ) [ ] exp [ ] [ ] [ ] [ ] [ ]

    TN

    j j j j j jjk k k k k k k k

    =

    U U PPPP , (20)

    where we show the dependence of these equations on time, k, for completeness. The

    maximization of (18) is trivial due to the normal distribution of each variable. The solution

    parallels that of a weighted least-squares problem, where are the observations and the a priori

    density in (9) is analogous to the weighting factor for a particular observation. We iteratively

    modify the topology of the graph, adding nodes of successively lower confidence, until the

    dynamic topological ordering satisfies (11). By considering nodes in a highest confidence first

    (HCF) manner, the iterations are encouraged to converge quickly, and typically to a global

    maximum. Since is finite, however, guaranteed absolute convergence to the maximum of

    , [ , ] PPPP is possible by simply testing all possible input combinations. This may become

    prohibitively computational, however, depending on the number of views and features. The

    resulting estimate of 3-D feature position, [ ]ky , and corresponding noise covariance, [ ]kR , are

    the input to a Kalman filter that encourages a 3-D trajectory with temporal continuity.

    2.4. Temporal Integration

    The state-based feature vector represents both the location and velocity of points in 3-D

    and is indicated by

    [ ]1 2 (2 )[ ] [ ] [ ] [ ]T

    Nk k k k = " , (21)

    where [ ],m k m N indicates the ideal 3-D position of the mth feature within the kth frame and

    [ ]

    ,[ ] b km k b m N m N k

    = >

    = . (22)

    An estimate of the 3-D feature vector in (21) is denoted by (23) where we use a dynamic model

    for [ ]k with constant velocity and linear 3-D displacement such that

    [ ]1 2 [ | 1] [ ] [ 1| 1] [ ] [ ] [ ] 1 1 1T

    Nk k k k k k k k = = " " , (23)

    where

    [ ] 2 [ 1| 1] [ 2 | 2]m m mk k k k k = . (24)

    We develop an error covariance matrix, [ | 1]k k , that depicts our confidence in the predictions

    of the state estimates. The update equation is indicated by

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    17/30

    17

    [ | 1] [ ] [ 1| 1] [ ] [ ]Tk k k k k k k = + Q , (25)

    where [ ]kQ represents a Gaussian noise covariance matrix which is iteratively modified over

    time to account for the deviations between the predictions and corrections of the state estimates.

    The system then constructs a Kalman gain matrix, [ ]kD , as follows

    ( )11

    2[ ] [ | 1] [ ] { [ ] [ ]} [ ] [ | 1] [ ]T Tk k k k k k k k k k

    = + + D R , (26)

    where [ ] 2[ ] N Nk = I 0 indicates the linear observation matrix and [ ]k is a recursively

    updated observation noise covariance matrix. The remaining steps of the three-dimensional

    trajectory filtering include

    ( ) [ | ] [ | 1] [ ] [ ] [ ] [ | 1]k k k k k k k k k = + D y (27)

    and

    ( )[ | ] [ ] [ ] [ | 1]k k k k k k = I D . (28)

    These equations represent the state and noise covariance update equations for [ ]k and [ ]k ,

    respectively.

    2.5. Algorithm Procedure

    For convenience, we summarize the overall flow of the algorithm. Following the initial

    process of camera calibration and the estimation of the redundancy matrix, L, semantic features

    are chosen manually betweenJviews of the scene to initialize the predictor-corrector filters used

    for tracking. Then, for an arbitrary frame, k, we have the following procedure:At each imaging plane:

    1. Estimate the sparse motion between frames kand k-1 and segment the foreground into

    moving regions, as in [3].

    2. Update H[k] and C[k]; using (1) through (8) and the projections of [ ] [ 1]k k and

    [ 1| 1]k k , produce an estimate, [ ]ks , with some confidence, [ ]kM , for the feature

    vector, [ ]ks .

    For each feature over all imaging planes:

    1. Define a set of random variables, [ ]k , and a subset, [ ]k , that characterize some

    unknown combinations of vector estimates, [ ]ks , from different views.

    2. Following (15) and (16), construct a Bayesian topology (12) that orders the views in

    descending order of confidence.

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    18/30

    18

    3. Start with the two most confident views of the feature and estimate the density functions

    in (9), (19), and (20) using random sampling, triangulation, and geometric projections, as

    illustrated in Figure 6.

    4. Define and maximize , [ , ] PPPP using the previously estimated density functions.5. Iterate over steps 2-4, adding an additional view of the feature at each step until a global

    maximum of , [ , ] PPPP is reached, yielding the estimates [ ]k and [ ]ky .

    In 3-D Cartesian space:

    1. Update [ ]k , [ ]k , and [ ]kQ ; using (21) through (28), produce a corrected estimate,

    [ ]k , with some confidence, [ ]k , for the actual 3-D position of features, [ ]ky .

    3. Experimental Results

    To test the proposed contribution, we use 600 frames of synchronized video data takenfrom three (i.e., 3J ) distinct views of a home environment. The underlying scene captures an

    informal social gathering of four people, each of whom is characterized by five semantic features

    that are selected at first sight and tracked throughout the remainder of the sequence. For features,

    we use the top of each shoe, the transition between the sleeve and the arm, and the top of the

    torso underneath the neck as seen from the front of the body. These points correlate well with

    various body parts (e.g., wrists, elbows, feet, head/neck) while providing enough color and

    content to perform accurate tracking. If a desired feature is initially not visible due to self-

    occlusion, but its position in one more views can be estimated with fair accuracy, it is labeled

    and tracked as an occluded point until it becomes visible. This method of feature selection and

    tracking is illustrated more clearly in Figure 7, where we show the tracking results for the system

    at various stages.

    Using the metric in (5), the algorithm associates a confidence level for each feature. We

    quantize the number of levels to three, where a confidence level of 0 (LV) suggests that the

    feature is being tracked with high accuracy, a level of 1 (LO) indicates that the feature is being

    tracked with relatively low accuracy, and a level of2 (LM) identifies a feature that is no longer in

    the field of view. Referring to the notation of2.1,LV-features typically include those ofClass A

    and some of those in Class B and are assumed to be visible, while LO-features include those in

    Class Cand some of those in Class B and are assumed to be occluded. While tracking, visible

    and occluded features are marked using solid and dashed circles, respectively.

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    19/30

    19

    Figure 7. Tracking results. Each column provides a distinct view of the scene and each row represents an individual

    frame extracted from the sequence. Features that appear to be occluded or tracked with questionable accuracy are

    automatically labeled with a dashed circle, while those of higher confidence are given a solid mark.

    In addition to capturing the precise location and state of the features at various frames, we

    also show feature trajectories for a continuous set of frames. In particular, Figure 8 shows the

    path of every feature followed over a period of five seconds. For the sake of visual clarity, each

    row of the figure is dedicated to a different individual, while the features on each person are

    denoted using a variety of colors. To quantify the accuracy of the proposed tracking system, we

    calculate the average absolute error between the automatically generated feature locations and

    the corresponding ground-truth data. Since the underlying features have some well-defined

    semantic interpretation, it stands to reason that the ground-truth should most appropriately be

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    20/30

    20

    generated by the same intervention that initially selected and defined each feature. When 3-D

    data is available, we characterize the tracking error, B , by using the absolute difference between

    the ground-truth and the 3-D feature projection at each imaging plane. The reported error is taken

    on the imaging plane carrying the maximum absolute difference over allJviews. If a feature canonly be tracked from one view, however, we simply calculate the absolute error between the

    ground-truth and the tracking results of monocular image sequence processing.

    Figure 8. Trajectories of various features. Each row shows three views of a five second interval of the scene for a

    single person. The numbers in the lower corner of each image indicate the view followed by a range of frames.

    For exhaustively quantifying the data, then, we must develop a baseline for three views of

    four people, each with five features, over 600 frames. As this would clearly be no less than an

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    21/30

    21

    overwhelming task, we choose to faithfully represent the entire sequence using a random sample

    of only 60 frames. The fundamental hypothesis of the proposed contribution is based upon an

    assumed correlation between tracking accuracy and various configurations of visible and

    occluded features. As is such, we group features into any ofB

    bins, where for a total ofJviewsand Fconfidence levels, B is the number of unique solutions to

    { }1

    , where 0 , , , ,F

    i i i

    i

    O J O J F O J =

    = ` (29)

    and Oi is the number of views with the ith confidence level. Using the three levels defined earlier

    (LV,LO,LM), it can be shown that we have 2 312 2 1J J+ + bins for anyJcombination of cameras.

    Each bin is represented using a triplet, where for any feature the first number, OV, indicates the

    number of cameras that presumably have an unobstructed view, the second number, OO,

    represents the number of cameras that are thought to have an occluded view, and the third

    number, OM, specifies the number of cameras for which the feature is outside the field of view.

    Figure 9 summarizes the tracking error for the proposed Bayesian technique (B) while

    comparing to the performance of a more simplistic averaging approach (A). For the latter

    method, multiple observations are combined in the least-squares sense in order to combat the

    effects of observation noise. We draw the readers attention to a number of important

    characteristics. As indicated by the data, using the proposed BBN for data fusion produces lower

    tracking errors for any class of features for which more than two observations exist (i.e., bins

    300, 210, 120, and 030). Assuming uniformly distributed features over Bbins, it is easily shown

    that the probability of having an arbitrary feature with more than two observations is

    { }1 22 2Pr[ 2] 3 10M JO J J J < = + B . (30)

    With three views, one would expect nearly 40% of the features (under a uniform distribution) to

    benefit from the proposed data fusion approach. Equation (30) states that as the number of views

    increases, so does the probability that the tracking of a feature occurs with a lower error. This is

    seen when taking the limit as the number of cameras monitoring the scene approaches infinity:

    { } { } { }2 23 10 122 3 2lim Pr[ 2] lim lim 1 1J J

    M J Jj j jO J +

    + + < = = =

    B. (31)

    Due to the nature of occlusion, however, one cannot draw a similar inference for the case of

    simple averaging. That is, without a priori knowledge of the feature distribution within the

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    22/30

    22

    scene, adding more cameras is likely to increase the number of occluded and visible points by

    comparable amounts. This clearly increases the probability that more than two cameras have an

    unobstructed view of a particular feature, as indicated by (31), but typically at the expense of

    additional observations in occlusion. As demonstrated by the transitions from bins 111 to120/210 and 201 to 210/300 in Figure 9, we only witness a significant decrease in error, on

    average, when the third observation has a relatively high confidence. In contrast, the Bayesian

    belief network effectively weights only the most likely observations based on the a priori

    confidence distributions presented in (9).

    300 210 201 120 111 030 021 102 012 003

    0

    1

    2

    3

    4

    5

    6

    View/Confidence Configuration

    MaximumAbsoluteError(pix

    els) U

    n

    d

    e

    f

    i

    n

    e

    d

    E

    r

    r

    o

    r

    = A

    = B

    4%

    34%

    0%

    14%0%

    9%0%

    0%

    0% N/A

    Figure 9. Summary of tracking results. Each bin in the histogram represents a different class of features, denoted by

    a triplet, where the first, second, and third digits indicate the number of cameras tracking a given feature with high,

    low, and no accuracy, respectively. The three right-most bins indicate features for which 3-D data is unavailable,

    where the last bin (003) shows all points that have left the scene completely. The labeled percentages indicate the

    percent increase in error of one approach over another for each class of features.

    If we consider our specific distribution of features as a function of configuration over the

    entire sequence, as indicated in Figure 10, the difference in error between A and B is as low as

    4%, as high as 34%, and 14.5% on average. As further indicated by Figure 10, the total number

    of features for which the proposed algorithm improves the tracking accuracy accounts for 42% of

    all features. This value is in agreement with the prediction of 40% for a uniform distribution of

    observations. In support of Figures 9 and 10, we show corresponding numerical data in Table 1.

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    23/30

    23

    300 210 201 120 111 030 021 102 012 0030

    50

    100

    150

    200

    250

    300

    View/Confidence Configuration

    NumberofSamples

    20% 62% 18%

    5%

    8%7%

    10%

    12%

    19%

    21%

    2%

    4%

    12%

    Figure 10. Distribution of features. We show the distribution of sampled features as a function of configuration. The

    total number sampled is 1200, 18% of which did not have enough observations to generate 3-D data.

    All major sources of error in the tracking results can be traced back to occlusion; this

    point is exemplified by the trends in Figure 9. A more careful examination of the sequence even

    indicates that self-occlusion is often a stronger culprit than other forms, such as occlusion due to

    multiple object interactions and scene clutter. With the proposed technique, however, occlusion

    due to multiple moving objects still yields erroneous tracking results when an occluded feature

    undergoes some form of unexpected acceleration. If the visible motion estimates in the vicinity

    of the occluded region do not positively correlate with the motion of that region, then neither the

    predictions nor the observations will produce an accurate feature trajectory. For a more detailed

    discussion of tracking multiple objects in the presence of occlusion, we refer the reader to [3].

    Table 1. Tracking statistics. The first column shows those features for which 3-D position was estimated with high

    accuracy, the middle column indicates those features for which one or more necessary views were apparently

    occluded, while features in the third column had (at most) an estimated position in only one view of the scene.

    OV 2 OV< 2 (OV + OO) 2 (OV + OO) < 2

    Config (A, B) Samples Config (A, B) Samples Config (A, B) Samples

    300 (1.80, 1.73) 58 120 (3.21, 2.82) 121 102 (2.26, 2.26) 29

    210 (2.41, 1.79) 93 111 (3.03, 3.03) 140 012 (5.38, 5.38) 48

    201 (1.81, 1.81) 85 030 (3.78, 3.48) 229 003 (N/A, N/A) 145

    021 (4.09, 4.09) 252

    Percentage 19.6% Percentage 61.8% Percentage 18.6%

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    24/30

    24

    Our technique for tracking occluded features is based on an analysis of visible data.

    Because we use the less computational, yet more simplistic, approach of model-free tracking, the

    motion of features in self-occlusion is often predicted using erroneous vector estimates within

    the foreground of the moving person in question. The result is a trajectory that is correct at aglobal scale, but corrupt at finer resolutions. This is easily seen, for example, while attempting to

    track a feature on an occluded arm, where the motion of the arm is not as positively correlated

    with that of the torso (or the visible arm) as the algorithm would like to imply. We illustrate the

    difficulty of self-occlusion in Figure 11.

    View A View B

    Visible

    Observations

    Visible

    FeatureOccluded

    Feature

    Figure 11. Observations in self-occlusion. This figure illustrates the adverse effects of self-occlusion and articulated

    motion on estimating observations for feature positions.

    One solution to the problem of self-occlusion might be the introduction of an a priori

    motion model, as suggested by [53]. In conjunction with some type of human model [20], one

    might consider segmenting the foreground video data into contiguous regions, each defined by

    some parametric motion model. These regions could then be used to intelligently guide the

    inclusion of sparse motion estimates in the definition of the state observations given in (6). This

    approach is theoretically sound, but is difficult to implement in practice due to the dependence of

    the segmentation on an accurate initialization, the uncertainty associated with region boundaries

    (especially under occlusion), and the need for real-time performance. Given the proposed

    framework for the fusion of video data from multiple cameras, however, a simpler (and provably

    more optimal) solution might be to extend the number of views, thus migrating towards the

    increasingly popular notion of next generation ubiquitous computing. To develop a full

    appreciation of the proposed technique, we invite and encourage the reader to visit our website to

    inspect the multi-view tracking results in their entirety.

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    25/30

    25

    4. Conclusions and Future Work

    We introduce a novel technique for tracking interacting human motion using multiple

    layers of temporal filtering coupled by a simple Bayesian belief network for multiple camera

    fusion. The system uses a distributed computing platform to maintain real-time performance and

    multiple sources of video data, each capturing a distinct view of some scene. To maximize the

    efficiency of distributed computation each view of the scene is processed independently using a

    dedicated processor. The processing for each view is based on a predictor-corrector filter with

    Kalman-like state propagation that uses measures of state visibility and sparse estimates of image

    motion to produce observations. The corresponding gain matrix probabilistically weights the

    calculated observations against projected 3-D data from the previous frame. This mixing of

    predictions and observations produces a stochastic coupling between interacting states that

    effectively models the dependencies between spatially neighboring features.

    The corrected output of each predictor-corrector filter provides a vector observation for a

    Bayesian belief network. The network is characterized by a dynamic, multidimensional topology

    that varies as a function of scene content and feature confidence. The algorithm calculates an

    appropriate input configuration by iteratively resolving independency relationships and a priori

    confidence levels within the graph. The output of the network is a vector of 3-D positional data

    with a corresponding vector of noise covariance matrices; this information is provided as input to

    a standard Kalman filtering mechanism. The proposed method of data fusion is compared to themore basic approach of data averaging. Our results indicate that for any input configuration

    consisting of more than two observations per feature the method of Bayesian fusion is superior.

    In addition to producing lower errors than observational averaging, the Bayesian network lends

    itself well to applications involving numerous input data (qualitatively as well as

    computationally) and, in general, is better suited to handling multi-sensor data sources.

    Future work in this area will continue to investigate methods for handling complex

    occlusion such as those employing a priori motion and object models or those based onadditional observational sources of data. Outside the scope of tracking, but well within the field

    of human motion analysis are other topics of potential contribution. These investigations might

    be in areas including, but not limited to, simultaneous object tracking and recognition, multi-

    modal recognition of gestures and activities, and higher-level fusion of activities and modalities

    for event and interactive understanding applications.

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    26/30

    26

    Acknowledgments

    This research was made possible by generous grants from the Center for Future Health,

    the Keck Foundation, and Eastman Kodak Company. We would also like to acknowledge

    Terrance Jones for his assistance and organizational efforts in providing synchronized multi-

    view sequences of multiple people in action.

    References

    [1] Y. Ivanov, C. Stauffer, A. Bobick, and W. E. L. Grimson, Video surveillance of

    interactions, Proc. of the Workshop on Visual Surveillance, Fort Collins, CO, 26 June

    1999, pp. 82-89.

    [2] T. Boult, R. Micheals, A. Erkan, P. Lewis, C. Powers, C. Qian, and W. Yin, Frame-rate

    multi-body tracking for surveillance, Proc. of the DARPA Image Understanding

    Workshop, Monterey, CA, 20-23 November 1998, pp. 305-313.

    [3] S. L. Dockstader and A. M. Tekalp, On the Tracking of Articulated and Occluded Video

    Object Motion,Real-Time Imaging, to appear in 2001.

    [4] I. Haritaoglu, D. Harwood, and L. S. Davis, Hydra: multiple people detection and tracking

    using silhouettes,Proc. of the Workshop on Visual Surveillance, Fort Collins, CO, 26 June

    1999, pp. 6-13.

    [5] L. W. Campbell, D. A. Becker, A. Azarbayejani, A. F. Bobick, and A. Pentland, Invariant

    features for 3-D gesture recognition, Proc. of the Int. Conf. on Automatic Face and

    Gesture Recognition, Killington, VT, 14-16 October 1996, pp. 157-162.

    [6] A. D. Wilson and A. F. Bobick, Parametric Hidden Markov Models for Gesture

    Recognition,IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 21, no. 9, pp.

    884-890, September 1999.

    [7] A. Pentland, Looking at People: Sensing for Ubiquitous and Wearable Computing,IEEE

    Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 107-119, January

    2000.

    [8] J. M. Nash, J. N. Carter, and M. S. Nixon, Extraction of moving articulated-objects by

    evidence gathering,Proc. of the British Machine Vision Conference, Southampton, United

    Kingdom, 14-17 September 1998, pp. 609-618.

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    27/30

    27

    [9] C. Bregler, Learning and recognizing human dynamics in video sequences,Proc. of the

    Conf. on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, 17-19 June

    1997, pp. 568-574.

    [10] H. A. Rowley and J. M. Rehg, Analyzing articulated motion using expectation-maximization,Proc. of the Conf. on Computer Vision and Pattern Recognition, San Juan,

    Puerto Rico, 17-19 June 1997, pp. 935-941.

    [11] D. Hogg, Model-Based Vision: A Program to See a Walking Person,Image and Vision

    Computing, vol. 1, no. 1, pp. 5-20, January 1983.

    [12] D. M. Gavrila and L. S. Davis, 3-D model-based tracking of humans in action: a multi-

    view approach, Proc. of the Conf. on Computer Vision and Pattern Recognition , San

    Francisco, CA, 18-20 June 1996, pp. 73-80.

    [13] J. O'Rourke and N. I. Badler, Model-Based Image Analysis of Human Motion Using

    constraint propagation,IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 2,

    no. 6, pp. 522-536, November 1980.

    [14] J. M. Rehg and T. Kanade, Model-based tracking of self-occluding articulated objects,

    Proc. of the Int. Conf. on Computer Vision, Cambridge, MA, 20-23 June 1995, pp. 618-623.

    [15] S. Wachter and H.-H. Nagel, Tracking Persons in Monocular Image Sequences,

    Computer Vision and Image Understanding, vol. 74, no. 3, June 1999.

    [16] E.-J. Ong and S. Gong, Tracking hybrid 2D-3D human models from multiple views,

    Proc. of the Int. Workshop on Modelling People, Kerkyra, Greece, 20 September 1999, pp.

    11-18.

    [17] M. Isard and A. Blake, Condensation - Conditional Density Propagation for Visual

    Tracking,Int. J. of Computer Vision, vol. 29, no. 1, pp. 5-28, August 1998.

    [18] K. Akita, Image Sequence Analysis of Real World Human Motion,Pattern Recognition,

    vol. 17, no. 1, pp. 73-83, January 1984.

    [19] M. K. Leung and Y.-H. Yang, Human Body Motion Segmentation in a Complex Scene,Pattern Recognition, vol. 20, no. 1, pp. 55-64, January 1987.

    [20] S. X. Ju, M. J. Black, and Y. Yacoob, Cardboard people: A parameterized model of

    articulated image motion, Proc. of the Int. Conf. on Automatic Face and Gesture

    Recognition, Killington, VT, 14-16 October 1996, pp. 38-44.

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    28/30

    28

    [21] C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, Pfinder: Real-Time Tracking

    of the Human Body,IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 19,

    no. 7, pp. 780-785, July 1997.

    [22] Y.-S. Yao and R. Chellappa, Tracking a Dynamic Set of Feature Points,IEEE Trans. onImage Processing, vol. 4, no. 10, pp. 1382-1395, October 1995.

    [23] A. Azarbayejani and A. P. Pentland, Recursive Estimation of Motion, Structure, and Focal

    Length,IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 17, no. 6, pp. 562-

    575, June 1995.

    [24] T. J. Broida and R. Chellappa, Estimating the Kinematics and Structure of a Rigid Object

    from a Sequence of Monocular Images,IEEE Trans. on Pattern Analysis and Machine

    Intelligence, vol. 13, no. 6, pp. 497-513, June 1991.

    [25] F. Lerasle, G. Rives, M. Dhome, Tracking of Human Limbs by Multiocular Vision,

    Computer Vision and Image Understanding, vol. 75, no. 3, pp. 229-246, September 1999.

    [26] D.-S. Jang and H.-I. Choi, Active Models for Tracking Moving Objects, Pattern

    Recognition, vol. 33, no. 7, pp. 1135-1146, July 2000.

    [27] N. Peterfreund, Robust Tracking of Position and Velocity with Kalman Snakes,IEEE

    Trans. on Pattern Analysis and Machine Intelligence, vol. 21, no. 6, pp. 564-569, June

    1999.

    [28] J. MacCormick and A. Blake, A probabilistic exclusion principle for tracking multiple

    objects, Proc. of the Int. Conf. on Computer Vision, Kerkyra, Greece, 20-27 September

    1999, pp. 572-578.

    [29] S. J. McKenna, S. Jabri, Z. Duric, and H. Wechsler, Tracking interacting people,Proc. of

    the Int. Conf. on Automatic Face and Gesture Recognition, Grenoble, France, 28-30 March

    2000, pp. 348-353.

    [30] I. Haritaoglu, D. Harwood, and L. S. Davis, W4: Real-Time Surveillance of People and

    Their Activities,IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 8,pp. 809-830, August 2000.

    [31] V. Kettnaker and R. Zabih, Counting people from multiple cameras, Proc. of the Int.

    Conf. on Multimedia Computing and Systems, Florence, Italy, 7-11 June 1999, pp. 267-271.

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    29/30

    29

    [32] G.-W. Chu and M. J. Chung, An optimal image selection from multiple cameras under the

    limitation of communication capacity,Proc. of the Int. Conf. on Multisensor Fusion and

    Integration for Intelligent Systems, Taipei, Taiwan, 15-18 August 1999, pp.261-266.

    [33] T. Darrell, G. Gordon, M. Harville, and J. Woodfill, Integrated Person Tracking UsingStereo, Color, and Pattern Detection,Int. J. of Computer Vision, vol. 37, no. 2, pp. 175-

    185, June 2000.

    [34] Q. Cai and J. K. Aggarwal, Tracking Human Motion in Structured Environments Using a

    Distributed-Camera System,IEEE Trans. on Pattern Analysis and Machine Intelligence,

    vol. 21, no. 11, pp. 1241-1247, November 1999.

    [35] S. Stillman, R. Tanawongsuwan, and I. Essa, A system for tracking and recognizing

    multiple people with multiple cameras,Proc. of the Int. Conf. on Audio and Video-Based

    Biometric Person Authentication, Washington, DC, 22-23 March 1999, pp. 96-101.

    [36] A. Utsumi, H. Mori, J. Ohya, and M. Yachida, Multiple-human tracking using multiple

    cameras,Proc. of the Int. Conf. on Automatic Face and Gesture Recognition, Nara, Japan,

    14-16 April 1998, pp. 498-503.

    [37] J. Yamato, J. Ohya, and K. Ishii, Recognizing human action in time-sequential images

    using hidden Markov model, Proc. of the Conf. on Computer Vision and Pattern

    Recognition, Champaign, IL, 15-18 June 1992, pp. 379-385.

    [38] M. J. Black and A. J. Jepson, Recognizing temporal trajectories using the Condensation

    algorithm, Proc. of the Int. Conf. on Automatic Face and Gesture Recognition, Nara,

    Japan, 14-16 April 1998, pp. 16-21.

    [39] T. J. Olson and F. Z. Brill, Moving object detection and event recognition algorithms for

    smart cameras,Proc. of the DARPA Image Understanding Workshop, New Orleans, LA,

    11-14 May 1997, pp. 159-175.

    [40] Y. Yacoob and M. J. Black, Parameterized Modeling and Recognition of Activities,

    Computer Vision and Image Understanding, vol. 73, no. 2, pp. 232-247, February 1999.[41] Y. Yacoob and L. S. Davis, Learned Models for Estimation of Rigid and Articulated

    Human Motion from Stationary or Moving Camera,Int. J. of Computer Vision, vol. 12, no.

    1, pp. 5-30, January 2000.

  • 7/28/2019 Multiple.camera.tracking.of.Interacting.and.Occluded.human.motion

    30/30

    [42] C. R. Wren and A. P. Pentland, Dynamic models of human motion,Proc. of the Int. Conf.

    on Automatic Face and Gesture Recognition, Nara, Japan, 14-16 April 1998, pp. 22-27.

    [43] S. Nagaya, S. Seki, R. Oka, A theoretical consideration of pattern space trajectory for

    gesture spotting recognition, Proc. of the Int. Conf. on Automatic Face and GestureRecognition, Killington, VT, 14-16 October 1996, pp. 72-77.

    [44] D. M. Gavrila, The Visual Analysis of Human Movement: A Survey,Computer Vision

    and Image Understanding, vol. 73, no. 1, pp. 82-98, January 1999.

    [45] J. K. Aggarwal and Q. Cai, Human Motion Analysis: A Review,Computer Vision and

    Image Understanding, vol. 73, no. 3, pp. 428-440, March 1999.

    [46] R. T. Collins, A. J. Lipton, and T. Kanade, Introduction to the Special Section on Video

    Surveillance,IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 22, no. 8,

    pp. 745-746, August 2000.

    [47] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference,

    San Francisco, CA: Morgan Kaufmann, 1988.

    [48] Y. Altunbasak, A. M. Tekalp, and G. Bozdagi, Simultaneous stereo-motion fusion and 3-D

    motion tracking, Proc. of the Int. Conf. on Acoustics, Speech, and Signal Processing ,

    Detroit, MI, 9-12 May 1995, vol. 4, pp. 2270-2280.

    [49] N. M. Oliver, B. Rosario, and A. P. Pentland, A Bayesian Computer Vision System for

    Modeling Human Interactions,IEEE Trans. on Pattern Analysis and Machine Intelligence,

    vol. 22, no. 8, pp. 831-843, August 2000.

    [50] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, Systems and Experiment: Performance of

    Optical Flow Techniques,Int. J. of Computer Vision, vol. 12, no. 1, pp. 43-77, 1994.

    [51] Z. Sun, Object-Based Video Processing with Depth, Ph.D. Thesis, University of Rochester,

    2000.

    [52] E. Trucco and A. Verri, Introductory Techniques for 3-D Computer Vision, Upper Saddle

    River, NJ: Prentice-Hall, 1998.[53] H. Sidenbladh, M. J. Black, and D. J. Fleet, Stochastic tracking of 3D human figures using

    2D image motion,Proc. of the European Conf. on Computer Vision, Dublin, Ireland, 26

    June - 1 July 2000.