Tracking Human Position and Lower Body Parts

download Tracking Human Position and Lower Body Parts

of 12

Transcript of Tracking Human Position and Lower Body Parts

  • 8/8/2019 Tracking Human Position and Lower Body Parts

    1/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS 1

    Tracking Human Position and Lower Body PartsUsing Kalman and Particle Filters Constrained

    by Human BiomechanicsJess Martnez del Rincn, Dimitrios Makris, Member, IEEE, Carlos Orrite Uruuela, Member, IEEE,

    and Jean-Christophe Nebel, Senior Member, IEEE

    AbstractIn this paper, a novel framework for visual track-ing of human body parts is introduced. The approach presenteddemonstrates the feasibility of recovering human poses with datafrom a single uncalibrated camera by using a limb-tracking systembased on a 2-D articulated model and a double-tracking strategy.Its key contribution is that the 2-D model is only constrained bybiomechanical knowledge about human bipedal motion, insteadof relying on constraints that are linked to a specific activity orcamera view. These characteristics make our approach suitablefor real visual surveillance applications. Experiments on a set ofindoor and outdoor sequences demonstrate the effectiveness of ourmethod on tracking human lower body parts. Moreover, a detailcomparison with current tracking methods is presented.

    Index TermsBiomechanics, bipedal motion, human pose,particle filter, video surveillance, 2-D articulated model.

    I. INTRODUCTION

    H UMAN motion modeling is one of the most active re-search areas in computer vision. It can be defined as theability to estimate, at each frame of a video sequence, the posi-tion of each joint of a human figure, which is represented by anarticulated model. Because of the 3-D nature of human motion,tracking methods based on 3-D anthropomorphic articulatedmodels have proved to be the most effective [14][18]. Theirapplications include analysis of human activity [53], entertain-ment, ambient intelligence, and medical diagnosis. However,their main drawback is that they, in general, synchronouslyrely on data capture by several cameras that have accurately

    Manuscript received July 2, 2009; revised October 19, 2009 andJanuary 13, 2010. This work was supported in part by the Engineeringand Physical Sciences Research Council (EPSRC) through the Multienviron-ment Deployable Universal Software Application (MEDUSA) project underGrant EP/E001025/1 and the Pose Recovery in Context Specific Scenarios(PRoCeSS) project under Grant EP/E033288, by the Spanish Ministry ofEducation under Grant TIN2006-11044 and through the Fondo Europeo de

    Desarrollo Regional (FEDER), and the Spanish Ministry of Education andScience (MEyC) through the Fellowship for the Training of Research Personal(FPI) under Grant BES-2004-3741. This paper was recommended by AssociateEditor X. Li.

    J. Martnez del Rincn, D. Makris, and J.-C. Nebel are with the DigitalImaging Research Centre, KingstonUniversity, KT1 2EE Surrey, U.K.,and alsowith the Faculty of Computing, Information Systems and Mathematics,Kingston University, London, KT1 1LQ Surrey, U.K. (e-mail: [email protected]; [email protected]; [email protected]).

    C. Orrite Uruuela is with the Department of Electronics and Communica-tions Engineering, University of Zaragoza, 50018 Zaragoza, Spain, and alsowith the Aragon Institute of Engineering Research, University of Zaragoza,50018 Zaragoza, Spain (e-mail: [email protected]).

    Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TSMCB.2010.2044041

    been calibrated. Therefore, these techniques are impracticalfor applications that target unconstrained environments such asvideo surveillance [6], [7]. The alternative is the use of trackingmethods based on 2-D models that cannot deal by themselveswith the intrinsic ambiguity of projected 3-D postures, selfocclusions, and distortions introduced by a camera perspective.Therefore, they are usually restricted to well-defined motionsand specific camera views; however, these constraints reducetheir value in many real applications.

    We propose a double-tracking (T2) strategy to accuratelyand simultaneously track both the position of the body andits articulated motion. Position is tracked by a Kalman filter,whereas human body parts are tracked using a set of particlefilters [14], [19], [55], which iteratively refine their solution.The key contribution of this method is that it relies on agenerative approach based on a 2-D model that is constrainedonly by human biomechanics. The inclusion of biomechanicalknowledge about bipedal motion significantly reduces the com-plexity of the problem. This result is achieved by the detectionof the pivot foot (i.e., the foot that is static during a step) and its

    trajectory during a whole step.In this paper, we concentrate our effort on tracking the legsof a subject, because the other body parts do not benefit frombiomechanics constraints. Our results are evaluated against theHumanEva (HE) data set, which is becoming the standard forassessing human body tracking algorithms [5], and outdoor datafrom Sidenbladh [44]. After a brief description of the state ofthe art in tracking human body parts, we present an overviewof our methodology. Then, we detail the key algorithms and thebiomechanics constraints that we use. Finally, after the presen-tation and evaluation of our results, conclusions are drawn.

    II. RELATED WORK

    Tracking complexity exponentially increases with the num-ber of targets when their motion is dependent on each other, asin the case when dealing with articulated objects. Articulatedmodels have been shown to be essential tools for handlingtracking and detection tasks by reinforcing motion constraintsin either the 2-D [23] or 3-D space [13] so that motions ofsubparts are interrelated. Several approaches have been inves-tigated to alleviate this challenge, such as dynamic program-ming [1], annealed sampling [20], partitioned sampling [17],eigenspace tracking [12], hybrid Monte Carlo filtering [21], andbottomup [8] approaches.

    Approaches to vision-based human motion analysis canbroadly be divided into two categories: 1) generative and

    1083-4419/$26.00 2010 IEEE

  • 8/8/2019 Tracking Human Position and Lower Body Parts

    2/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    2 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS

    2) discriminative. The first category explicitly uses a humanbody model [20], [24][30], [32] that describes both the visualand kinematic properties of the human body. Discriminativeapproaches [12], [31], [33][41] learn the mapping from imagespace to directly pose space from carefully selected trainingdata. Because discriminative approaches work in a learned

    pose space where the dimensionality has been reduced, theyare computationally much less expensive, can potentially beapplied in real time, and are more robust to noise or occlusions.Furthermore, discriminative approaches allow the recovery ofposes with less information, which makes them more suitablefor monocular applications. However, they have one seriousdrawback: their accuracy relies on the similarity between theposture to recover and the data used in the training data set. Inaddition, their performance tends to decrease when the varietyof activities used in a training set increases [54].

    Independent of the chosen modeling strategy, another keydecision has to be taken regarding the dimension of the bodymodel, i.e., 2-D or 3-D [29]. The first option involves working

    directly with the 2-D features that are derived from the images.This option has successfully been applied for constrained typesof movements such as walking parallel to the plane of theimage and periodic motions. Nevertheless, their performancesignificantly decreases for unconstrained and complex humanactions, including movements that are out of the camera plane(e.g., wandering, making gestures, and turning), which pro-duce frequent self occlusions. In general, 2-D discriminativeapproaches are more robust when dealing with self occlusions.However, prior knowledge about either the movement or theviewpoint is required to correctly drive their 2-D models.

    Many techniques that are based on 2-D models have beenproposed. In [1], an approach that locally analyzes subparts

    is proposed for visual tracking of articulated models while re-inforcing the structural constraints between different subparts.It combines a dynamic Markov network, which characterizesthe dynamics and the image observations of each individualsubpart and motion constraints based on a mean-field MonteCarlo (MFMC), in which a set of low-dimensional particlefilters interact with each other and collaboratively solve thehigh-dimensional problem. Ju et al. [23] propose a cardboardmodel in which the human limbs are modeled by a set ofconnected planar patches. By constraining the parameterizedmotion of the patches in the image, the articulated motionis reinforced. Optical flow is used as a feature to track thelimbs and to estimate the viewpoint. Results confirm that 2-D

    patches can track a limb that is not subject to occlusions ifthe viewpoint has been determined. Rehg et al. [42] describe a2-D scaled prismatic model (SPM) for figure registration, whichdeals with variations in rotation and depth. SPM significantlyreduces the number of singularities that appear due to thebidimensional projection of the 3-D pose and does not requiredetailed knowledge of the 3-D kinematics. Although the authorsdemonstrate the application of the model for motion capturefrom movies, only certain types of movements can be tracked,and the system fails for fast movements. In [2], the random sam-ple consensus (RANSAC) and maximum likelihood estimationsample consensus (MLESAC) algorithms are incorporated intoa planar patch tracker like feature weights to perform robust

    tracking. In [3], Noriega and Bernier propose a planar patcharticulated model, which is a loose limbed model that includes

    attraction potentials between adjacent limbs and constraints toreject poses that result in collisions. Compatibility betweenmodel and image is estimated using one particle filter perlimb, whereas compatibility between limbs is represented byinteraction potentials. The joint probability is obtained by beliefpropagation on a factor graph. The main drawback of all these

    2-D models is that their usage is restricted to specific typesof motions, which are usually linear and seen from a specificviewpoint.

    On the other hand, 3-D methods [14][18] can be consideredas more general-purpose approaches, because they provide awell-posed solution to tracking a 3-D object. In particular,this solution enables the user to take advantage of a largeamount of available prior knowledge about the kinematics,shape properties, and biomechanics of human body and gait.This information makes the problem more tractable and permitsthe user to predict events such as self occlusions. However, thefact that 3-D models must be projected into the image plane hastwo consequences. First, in addition to the larger dimensionality

    of the model, projections make 3-D tracking a computationallyexpensive methodology. Second, a constrained environment isrequired: cameras have to be calibrated, and the transformationbetween the image plane and the 3-D world has to be known.Consequently, they are not suitable for applications like videosurveillance, where real-time tracking is expected, and cameracalibration is not practical.

    Kaadiaris and Metaxas [30] consider a multicamera systemto cope with 3-D model-based body part tracking. A Kalmanfilter is applied to predict the location of each limb. Thecorrespondence between the contour in the image and theprojection of the 3-D shape is used as a likelihood func-tion. Gavrila and Davis [29] extended this methodology to a

    22-degree-of-freedom model. Hunter et al. [43] build a modelthat is composed of five ellipsoids with 14 degrees of freedom,where a particle filter was successfully combined with a 3-Darticulated model. The paper that Deutscher et al. presented isprobably one of the most important papers in this field [20].It is not only the most important generative approach but alsothe method of reference that is used to benchmark new algo-rithms. Deutscher et al. propose a modified version of particlefilters to efficiently estimate the multimodal distribution of thehuman body articulated model in a huge dimensional space.The main drawback is the prohibitive computational cost that isassociated with the processing of each frame. Sidenbladh et al.[44] present another relevant probabilistic method for tracking

    3-D articulated human figures in monocular sequences. It isbased on a generative model of appearance, a robust likelihoodfunction that works out gray-level differences, and a priorprobability distribution that introduces knowledge about humangait and joint angles. Moreover, valid 3-D human motions areconstrained by prior probability distribution over the dynamicsof the human body.

    Recently, discriminative approaches that are based on latentspaces and manifolds have achieved a high popularity [41],[45], [51], [52]. This case is mainly because they reduce thecomputational cost by constraining the space of possible poseswith prior information. Elgammal [31] proposes a manifold torelate silhouettes with 3-D poses. A different 1-D manifold is

    learned per view and activity. In [40], two different regressionalgorithms are used for the forward mapping (dimensionality

  • 8/8/2019 Tracking Human Position and Lower Body Parts

    3/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MARTNEZ DEL RINCN et al.: T RACKING HUMAN P OS IT ION AND L OWER B ODY PART S USI NG KAL MAN AND PART IC LE F ILT ER S 3

    Fig. 1. Principle of the double-tracking strategy.

    reduction) and inverse mapping. The representatives that areused in the regression are chosen in a heuristic manner. In [39],Gaussian process latent variable model (GPLVM) and a second-order Markov model are used to track applications. The learnedGPLVM is used to provide model prior. Tracking is then doneby minimizing a cost of 2-D image matching, with the negativelog-likelihood of the model prior being the regularization term.

    Both [39] and [40] advocate the use of gradient descent opti-mization techniques; hence, the low-dimensional space that waslearned has to be smooth, and accurate initialization is requiredfor the success of such techniques. One alternative approach[45] employs GPLVM in a modified particle filtering algorithmwhere samples are drawn from the low-dimensional latentspace. In these three papers, the smoothness that the learningalgorithms enforced in the low-dimensional space works wellfor tracking small limb movements but may fail when largemovements occur over time.

    This overview of human pose recovery methodology informsus about the design of a solution based on the requirements andcharacteristics of the particular problem that we want to solve.Because our objective consists of recovering the human pose inunconstrained environments, where the subject can perform anykind of movement and where initialization should eventually beautomated, we are constrained to choose a generative approachbased on a 2-D model. However, unlike the previous workwithin this framework, which was limited to specific types ofmotions, we propose an approach that can deal with variationsin rotation and depth so that it can be applied to real-life data.This result is achieved by constraining the 2-D model, which isdesigned to tackle 3-D motion patterns such as changes in thepose of the object with respect to the camera, by using specificknowledge about human biomechanics and gait analysis.

    III. T2 STRATEGY

    A. General Principle

    In our previous work [46], we proposed a methodologywhere the global location of the person and the relative pose ofthe limbs were simultaneously tracked. Although this integratedstrategy was elegant, it showed some inefficiency, because anerror of global location directly affects limb pose recovery.

    To deal with this problem, we propose the T2strategy (seeFig. 1). The estimation of the pose of the limbs Xleg is cal-culated using the combination of two trackers: the first trackertraces the global location of the person Xext, and the second

    tracker recovers the relative pose of the limbs X

    leg, i.e.,Xleg = X

    ext + Xleg. (1)

    For the first tracker, we use a Kalman filter, which has beenshown as a very efficient paradigm for tracking pedestrians invisual surveillance applications [50]. For body part tracking, weuse a set of particle filters.

    Human articulated motion is highly multimodal, and thisnon-Gaussian characteristic is amplified in the image plane bythe camera perspective. Therefore, a tracking framework thatcan work with nonlinear distribution is required. Particle filtershave successfully been applied for this purpose [19]; therefore,this algorithm is at the core of our tracking framework. Adetailed explanation about particle filtering is shown in [19]

    Once the first tracker has obtained an estimation of thelocation of the person by using a motion blob, this informationis introduced as prior knowledge in the proposal distributionof the particle filter. Thus, the particle distribution in thenext prediction step is guided by the global location in the(x, y)-coordinates [xext, yext]. Moreover, limb sizes of the newhypotheses are estimated by taking into account blob heightchanges between two frames lext. In this manner, tracking

    can recover from incorrect estimations from the particle filterswithout being limited to the result of the first tracker. Indeed,the dynamic model of the new hypotheses enables us to correctthe first trackers estimation, which is only used as a guide orsoft constraint that helps put the hypotheses near the globaloptimum.

    The limb model that is employed is identical to the modelin [46], but here, the spatial coordinates have been normalizedwith respect to the central point of the line that links bothhip points, and the size parameters have been normalized withrespect to the human height of the blob.

    Limb tracking is based on a set of particle filters to fit a2-D articulated model on each frame of a video sequence.

    In addition, we take advantage of a biomechanics constraintthat is inherent in human bipedal motion: during a step,one leg pivots around a single point. This approach allowsus to deal with more motions compared to other techniquesthat rely on training on a specific activity. Because we candetect the position of this point, this constraint is integratedinto an asymmetrical 2-D model where the two legs are treateddifferently. Finally, model fitting is performed after differenttrackers have successively been applied.

    Initially, a standard particle filter process operates to tracklower limb locations until the end of the step. Due to the highdimensionality of the problem and the ill-conditioned model,it may not produce satisfactory tracking. To refine the tracking

    of the articulated model, two assistant particle filters are thenlaunched in parallel by using information that is intrinsic tothe step of interest. The main reason for using two trackers,instead of one, is to handle the degradation and potentialdivergence of tracking over time.

    To take advantage of the pivot point constraint and trajec-tory information, we propose to rely on data that are capturedduring a full step before completing the tracking task. Al-though a short delay is introducedtypically around ten frames(i.e., 0.5 s)in a real-time system, this approach allows usto process a wide range of human activities without loss ofaccuracy. Moreover, because this delay does not increase, ifsuitable processing power is available, the whole system can

    operate online. Although some actions (such as running orjumping) break the pivot constraint during short periods of

  • 8/8/2019 Tracking Human Position and Lower Body Parts

    4/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    4 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS

    time and the pivot point can momentarily be occluded, thiscase can be detected and handled without significantly affectingthe proposed tracking framework, because the standard parti-cle filter can still estimate the poses without those constraints.

    B. Biomechanics Constraints for Human Motion Tracking

    Most human motion tracking methods rely on constraintssuch as specific activity, constant velocity, and linear or periodicmotion, which critically impact on their accuracy and/or theirgenericity. The study of human biomechanics, however, revealsthat human motion itself provides some explicit constraints. Inthis section, we show that they can be utilized to simplify thetask of tracking human body parts. Walking is a very commonhuman activity whose many other motions such as loitering,balancing, and dancing can be seen as derivatives and wherethe underlying mechanics of walking can be applied. All thesebipedal motions are based on a series of steps defined as oneleg swinging around a support leg whose foot or pivotstays in contact with the ground at any instant [22]. Therefore,

    the detection of this pivot point from a video sequence providesa very important biomechanics cue that is present in mostmotion that the tracker processed.

    Knowledge of the precise position of the pivot foot alsoallows us to use different strategies for tracking either thesupport or the swinging leg, which enhances the powerof our 2-D model. Moreover, positions of consecutive strikingfeet provide some information about the subjects trajectory inthe image plane, which supplies clues regarding the relativecamerasubject position. Consequently, detection of this po-sition permits a significant reduction of the complexity of thetracking task.

    In addition to the pivot foot constraint, the support leg

    has another property: upper and lower legs are supposed tobe aligned during the pivot motion around the static foot.Therefore, an estimate of the locations of the associated kneeand hip can be refined if they do not form a straight line withthe pivot foot.

    In our framework, the static foot is detected using the algo-rithm that was proposed in [4]. It is based on the biomechanicsof gait motion. During the strike phase, the foot of the strikingleg stays at the same position for half a gait cycle, whereasthe rest of the human body moves. The pivot foot is detectedusing a low-level feature: corners produced by the Harris cornerdetector. Outliers due to cluttered backgrounds are filtered outby using a background subtraction algorithm. Corners that are

    associated to the pedestrian of interest are accumulated acrossseveral frames (i.e., 20 in our implementation). The regionwhere the leg strikes the ground must have a high density ofcorners. Although this approach is usually efficient (when anindividual motion is parallel to the camera plane, the static footis easily detected), motions toward or away from the cameraproduce many points that are seen as static on the body dueto the influence of the perspective. We deal with this case byremoving outliers and false positives by maintaining both thetemporal and spatial coherences of the pivot point.

    Corners C are accumulated across several frames by usingthe following equation:

    C=

    Nt=1

    (H(It) Lt) (2)

    where H is the output of the Harris corner detector, It isthe original image at frame t, Lt is the pedestrian blob atframe t, and is the logical conjunction operator. Although weonly consider one pedestrian, as commented in the introductionof this section, the pivot point detection algorithm could beextended to deal with several people by selecting an appropriateassociation algorithm.

    Dense areas of corners are located using a measure fordensity of proximity dp. The value of proximity at point pdepends on the number of corners within the region Rp andtheir corresponding distances from p. Rp is assumed to be acircular area that is centered at p, whose radius r is determinedas the ratio of total image points to the total of corners in C.Corner proximity values dp are computed for all regions Rp inCby using the following equation:

    drp =Nrr

    dip = di+1p

    Nii

    (3)

    where di

    p is the proximity value for rings of radius i away fromthe center p, and Ni is the number of corners at the distance ifrom the center, with rings being one pixel wide.

    Starting from a radius r, the process then iterates to accumu-late all the densities for the subregions Rp for all points p intoa matrix to produce the corner proximity matrix of the frame.Highest values in the matrix generally correspond to the heelstrike areas.

    C. Position Tracking Based on a Kalman Filter

    Using a Kalman filter, we track the bounding box ofthe person under observation. The state vector is xt =[xext, yext, xext, yext, lext, lext], where [xext, yext] is theglobal location in the (x, y)-coordinates, lext is the blob height,and x, y, and l are their derivatives. The likelihood functionis based on a motion detector that extracts the blob that corre-sponds to the subject.

    D. Multiple-Particle-Filter Tracking Based on

    Two-Dimensional Articulated Model

    1) Two-Dimensional Asymmetrical Articulated Model In-

    formed by Trajectory Information: Our model aims at simul-taneously tracking the relative positions of the different partsof the limbs. Thus, the tracker state vector is composed of

    the image coordinates of the hip points and the parametersthat model the relative motions and positions such as anglesand lengths in the image plane. To introduce the biomechanicsconstraints, which rely on a relative independence between bothlegs, both hip points are employed as references, and the anglesof both legs with respect to the hips are included in the statevector. The state vector of each leg is described by the followingequation:

    Xleg = [xhip, yhip, xhip, yhip, hipthigh, knee,

    hipthigh, knee, lfemur, lshin, lfemur, lshin] (4)

    where x and y are the coordinates (in pixels), is the angle

    between a limb and the x-axis, and l is the length of the limb(see Fig. 2).

  • 8/8/2019 Tracking Human Position and Lower Body Parts

    5/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MARTNEZ DEL RINCN et al.: T RACKING HUMAN P OS IT ION AND L OWER B ODY PART S USI NG KAL MAN AND PART IC LE F ILT ER S 5

    Fig. 2. Two-dimensional articulated model.

    By including the persons global location that was providedby the Kalman filter, we obtain our articulated model state inabsolute image coordinates by substituting (4) in (1), i.e.,

    Xleg = [xext+xhip, y

    ext+yhip, xhip, yhip, hipthigh, knee,

    hipthigh, knee, lfemur lext, lshin l

    ext, lfemur, lshin]. (5)

    Using the pivot point as a constraint, the support leg is firstestimated. Then, the swinging leg is calculated. To perform arobust estimation, the hip point position of the support leg isused to constrain the other hip point. The distance between thetwo hip points is set at a fixed anthropometric value D0 duringinitialization as a proportion of the width of the legs. Moreover,we assume that the two hip points share the same y-coordinate,where the y-axis is defined as the axis that goes along thedorsal spine of the subject. This assumption is reasonable if thecamera is sufficiently far from the subject and does not providea zenithal view, which is usually the case in visual surveillance.

    Although, in general, this y-axis corresponds to the vertical axisof the image, its direction can be determined more precisely bycalculating the momenta of the human figure, where the y-axisis the larger axis of the ellipse that surrounds the subject.

    Due to its nature, 2-D tracking allows higher flexibility andmore simplicity of use and initialization than 3-D tracking.However, in 2-D, it is not possible to introduce traditionalconstraints such as motion dynamic or kinematics. Instead, wetransfer 3-D properties to the 2-D world. In the 3-D world, thedistance between the hips remains constant over time. However,when this fixed distance is projected in the camera plane, itsvalue is changed by two different parameters: 1) the locationand 2) the orientation. Although the location introduces a factor

    of scale that is estimated with the global size of the legs, theorientation distorts this distance in a nonlinear way, whichdepends on the viewpoint.

    Because of the stochastic nature of our tracking algorithm,the exact value of this distance is not required. Given the posesof the hips at the beginning and the end of a step, values ofthe hips between these two frames are estimated. In fact, thedistance is correlated to the angle of the step trajectory in a non-linear manner, as shown in Fig. 3(a). We approximate this corre-lation function by using a function that models an S-curve, i.e.,

    D() = D0 1 e

    1 + e(6)

    where D0 is the maximum size of the hip distance with respectto the size of the leg (in our implementation, it is half the

    Fig. 3. (a) Interpolated trajectory (blue dots) of the pivot points (red and greendots). (b) Correspondences during a turn between the hip distances in a zenithalview and in the image plane.

    value of the sum of the thigh widths), is the angle betweenthe trajectory and the x-axis in the image plane, and is anempirical factor that controls the speed of the curve descent.

    Therefore, hip distance is estimated at each frame based onthe trajectory angle. This estimation is performed by fittingcubic splines to all pivot points [see Fig. 3(a) and (b)].

    At the end of the step, the new pose of the swinging legis known, i.e., positions of the hip, knee, and ankle. Therefore,the sizes of both limbs at the beginning and end of the stepare available. Their values are used to constrain limb sizeparameters during the whole step.

    Consequently, our model allows us to introduce two gait con-straints that help both loop tracking and backtracking processesto improve the results of the swing leg: 1) the size constraintsand 2) the hip distance constraints. Because this informationis known a posteriori, it can only be applied to the auxiliarytracking process.

    2) Multiple-Particle-Filter Tracking: One of the most chal-lenging problems of 2-D tracking is to deal with the perspective

    effect that amplifies changes in trajectories and, therefore,can create major variations in the targets size. Therefore, theuse of a simple first-order model does not allow adequatelyrepresenting size dynamics. Because our tracking framework isbased on a full step where heel strike positions are known, thefinal position of a step is partially reinitialized. Consequently,information is available to define the trajectory of the target dur-ing each step. Moreover, new tracking constraints with regardto maximum and minimum apparent limb sizes and distancesbetween the hip points during the step are derived.

    This last constraint provides a reference point for the swingleg similar to the case that a pivot point restricts the location ofthe support leg. These new constraints, which were not ini-

    tially available when the standard tracker operated, significantlyreduce the complexity of the tracking problem. Furthermore,when using a particle-filter-based tracker, the probability ofdivergence increases after each prediction: the closer a frameis to the initialization frame, the more accurate the estimation islikely to be.

    To take advantage of these new constraints and tackle thisinherent tracker weakness, we propose that once the standardtracker has processed a full step, two new trackers arelaunched in parallel. These trackers have the same configurationand dynamical models that are enhanced by the constraintsextracted from the output of the standard tracker. The forwardtracker starts from the first frame of the step, whereas the

    backward tracker begins at the last frame and tracks targetsbackward.

  • 8/8/2019 Tracking Human Position and Lower Body Parts

    6/12

  • 8/8/2019 Tracking Human Position and Lower Body Parts

    7/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MARTNEZ DEL RINCN et al.: T RACKING HUMAN P OS IT ION AND L OWER B ODY PART S USI NG KAL MAN AND PART IC LE F ILT ER S 7

    Fig. 6. Configurations of the pixel map sampling points for the color and edgemeasurements. The sampling points for the color measurements are defined bya grid that samples these regions, whereas the edge measurements are locatedalong the contours of the regions that compose the articulated model.

    function that takes into account the reduction of reliability overtime, i.e.,

    frel(t) = e(1t) (8)

    where is an empirical factor that models accuracy degrada-tion. It is set to 1 for walking sequences.

    3) Predictive Motion Model and Likelihood Function of

    Particle Filters: We use simple first-order dynamic models totrack location, size, and angular parameters, because they aresufficiently accurate for modeling motion between successiveframes during a single step, and motion nonlinearities are takencare of by the aforementioned biomechanics constraints, i.e.,the hip distance D() and the size constraint in the auxiliarytracker.

    We use a simple constant-acceleration dynamic modelXtleg = F X

    t1leg . We can express F as the following dynamic

    matrix:

    F =

    1 0 dt 0 0 0 0 0 0 0 0 00 1 0 dt 0 0 0 0 0 0 0 00 0 1 0 0 0 0 0 0 0 0 00 0 0 1 0 0 0 0 0 0 0 00 0 0 0 1 0 dt 0 0 0 0 00 0 0 0 0 1 0 dt 0 0 0 00 0 0 0 0 0 1 0 0 0 0 00 0 0 0 0 0 0 1 0 0 0 0

    0 0 0 0 0 0 0 0 1 0 dt 00 0 0 0 0 0 0 0 0 1 0 dt0 0 0 0 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 0 0 0 1

    (9)

    where dt is the time lapse between two frames.An adequate likelihood function must be applied to track

    the targets. To weigh each hypothesis, several visual featuresare combined, i.e., color and edges (see Fig. 6). Color isa discriminative feature that differentiates not only betweenobject and background but also between objects. Moreover, it ispose invariant. Edges also provide a good visual feature due tothe continuity of the human limbs. Because of their invariance

    to color, lighting, and pose, they are particularly useful to dealwith self occlusions between limbs [44].

    We assume that these features are independent of each other;therefore, we can combine them to obtain the following obser-vation probability:

    p(zt|xt) = pz1txt

    pz2txt

    (10)

    where z1t

    and z2t

    are the color and edge observations, respec-tively, and x is the state vector, xt = [X

    leftleg , X

    rightleg ].

    Color features are obtained by sampling each region witha grid and expressing the color information by red-green-blue (RGB) values that are subsampled to 4 b per channelto filter out noise and small variations. The color density ismeasured by comparing the color feature of each region ofthe articulated model with its corresponding color model. It isevaluated by estimating the Bhattacharyya coefficient betweentheir histograms, i.e.,

    p

    z1t

    xt

    =

    rR(xt)

    H

    h=1s(h) q(h)

    c(11)

    where r is each body part that belongs to the set R of regionsfrom the articulated model xt, H are all the histogram bins,s(h) is the current histogram, q(h) is the reference model, andc > 0 is an empirical factor to strengthen the discriminativepower of the feature.

    A gradient detector is used to detect edges, and the result isthresholded to eliminate spurious edges. The Canny algorithmis applied for this purpose. The result is smoothed with aGaussian filter and is normalized between 0 and 1. The resultingdensity image Pe assigns a value to each pixel according to itsproximity to an edge by using the Euclidean distance transformas follows:

    pz2t xt =

    rR(xt)

    1

    N

    Ni=1

    Pe(It, i)e (12)

    where It is the original image in RGB, r represents each ofthe regions that compose the articulated model, N are all thepixels that compose the region, and e > 0 is an empiricalfactor similar to c. By default, both factors are assigned thesame value. However, their weight can be adjusted to bias theprobability density function toward the feature that is believedto be the most informative in a given scene.

    IV. RESULTS

    Our approach was evaluated over data sets that have beenproduced as benchmarks to the scientific community to evaluateand compare different tracking and pose recovery methodolo-gies. First, we have used the HE data sets I and II, wheremotion capture and video data were synchronously collected[5]. Because cameras are calibrated, motion capture data pro-vide the ground truth not only for 3-D pose recovery but alsofor 2-D pose recovery by projection on the 2-D sequences. Astandard set of error metrics are also defined for the evaluationof both pose estimations and tracking algorithms. Second, wehave tested our solution with a well-known outdoor sequence

    [44], where ground truth was obtained by carefully annotatingthe location of the limbs by hand.

  • 8/8/2019 Tracking Human Position and Lower Body Parts

    8/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    8 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS

    Fig. 7. Tracking error for each frame of part 1 of the S2_Combo_1_(C1)(HE II) sequence. Magenta and dark blue vertical lines are, respectively, themanual and automatic detection at the beginning and end of a step. The reddashed-dotted line is the error when using a single particle filter, the greendashed line shows the error when using the multiple-particle-filter frameworkwith the TI strategy, and the blue solid line shows the error when using the

    multiple-particle-filter framework with the double-tracking strategy (T2).

    Fig. 8. Numerical results for the Sidenbladh sequence (two-trackers strategy).

    A. Test Sets and Evaluation Metrics

    Our algorithms were tested with three indoor sequences fromHE data sets: 1) S2_Walking_1_C1, S2_Combo_2_C1 fromHE I; 2) S2_Combo_1_C1 from HE II; and 3) the outdoorSidenbladh sequence. Because the S2_Combo_1_C1 from theHE II sequence is particularly long, we divided it into two parts.Part 1 is at the beginning of the sequence and corresponds to

    walking, and part 2 is at the end and shows some balancing (seeTable I for details). These sequences were chosen to includea variety of movements (e.g., walking a complete circle andbalancing) that are seen in indoor and outdoor environmentsfrom different points of view and mainly happen outside thecamera plane (see Figs. 913).

    The pose of a human body can be represented using Mvirtual markers; therefore, the state of the body can be writtenas X = x1, x2, . . . , xM, where xm 2 (a 2-D body modelis used) is the position of the marker m in the image. Theerror between the estimated pose X and the ground truthpose X can then be expressed as the average absolute dis-tance between individual markers. To ensure fair comparison

    between algorithms by using different numbers of parts, abinary selection variable per marker = 1, 2, . . . , M was

    added [5]. Therefore, the final proposed error metric is given asfollows

    D(X, X, ) =Mm=1

    mxm xmMi=1 i

    (13)

    where m = 1 if the evaluated algorithm can recover marker m;otherwise, it is 0.

    For a sequence of T frames, we can compute the averageperformance seq and the standard deviation of the performanceseq by using the following equations:

    seq =1

    T

    Tt=1

    D(Xt, Xt, t) (14)

    seq =

    1T

    Tt=1

    D(Xt, Xt, t) seq

    2. (15)

    B. Experimental Results and Discussion

    We report experiments that were conducted first with theHE sequences and then with the outdoor data. Although ex-periments were performed with a number of particles in theparticle filters that range from 200 to 500, their number did notaffect tracking accuracy. Because the pivot point detector canproduce erroneous locations (an average error of 20 pixels wasmeasured for the HE sequences), this result negatively affectsthe tracking module. To independently analyze the trackingalgorithm, results where manual annotation was used to definepivot points are also provided (see Table I). The mean errorincreases from 13.5 pixels to 15.1 pixels when tracking iscombined with automatic pivot point detection.

    Fig. 7 shows a frame-by-frame comparison of pose re-construction errors between a single particle filter (withoutbacktracking or feedback) and two trackers built on ourmultiple-particle-filter framework with or without the additionof a Kalman filter, respectively, called double or integratedtrackers. Not only does our system perform significantly betterthan a single particle filter but this chart also highlights oneof the strengths of our proposition: tracking can recover fromserious divergence because of the partial reinitialization thatwas provided by the detection of pivot points and trajectoryconstraints. For example, although the integrated tracker (TI)starts diverging around frame 200, where limbs reach their ap-parent maximum size and are self occluded, legs are accurately

    labeled on frames 219 and 242 (see Figs. 7 and 9). The figurealso shows that T2 is more accurate than TI. However, becauseT2 relies on blob position, its incorrect estimation, e.g., aroundframe 250, may temporarily cause poor pose reconstructionuntil the trackers recovery. Analysis of the data of the columnAutomatic pivot point detection in Table I, which correspondsto the practical usage of our system, reveals that the T2 strategynot only generally improves the mean accuracy of recoveredposes but is also much more stable than TI; T2 is, on theaverage, 14% more accurate, with a standard deviation that is35% smaller in the case of automatic pose recovery.

    Table II shows how our results compare with other tech-niques that are used to recover either 2-D or 3-D poses from

    the HE data sets. When authors only provided mean errors for3-D poses, they were converted in pixels by using approximate

  • 8/8/2019 Tracking Human Position and Lower Body Parts

    9/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MARTNEZ DEL RINCN et al.: T RACKING HUMAN P OS IT ION AND L OWER B ODY PART S USI NG KAL MAN AND PART IC LE F ILT ER S 9

    TABLE ICOMPARISON OF PERFORMANCES OF THE T2 AN D TI STRATEGIES BY USING EITHER MANUAL OR AUTOMATIC PIVOT POINT DETECTION

    Fig. 9. Results for the S2_Combo_1_(C1) (HE II) sequence. Frames: 1, 26, 51, 76, 101, 126, 151, 176, 201, 226, 251, 276, 291.

    Fig. 10. Results for the S2_Combo_2_(C1) (HE I) sequence. Frames: 1661, 1731, 1801, 1871, 1941, 2011, 2041, 2071.

    relationships between pixel and object lengths for each of theHE data sets. Thus, for a subject height between 250 and410 pixels and an assumed human height of 1.80 m [11], a one-pixel error is equivalent to an error of 4.47.2 mm, dependingon the position of the person in the image and the perspective.

    Most methods perform similar to our method on the HEdata sets, i.e., a pixel error in the 1215 and 1720 ranges for,

    respectively, HE I and HE II. Howe [9], [10], Poppe et al. [11],and Okada and Soatto [48] present example-based approachesto pose recovery but use very different image descriptors, i.e.,silhouettes (first paper) and histograms of oriented gradients(last two papers). Rogez et al. [47] have recently proposed

    a spatiotemporal 2-D-model that allows a monocular poserecovery, where the 2-D limitations are tackled by the use of a

  • 8/8/2019 Tracking Human Position and Lower Body Parts

    10/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    10 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS

    Fig. 11. Results for the S2_Walking_1_(C1) (HE I) sequence. Frames: 6, 46, 86, 126, 166, 206, 246, 286, 326, 366, 406, 423.

    Fig. 12. Results for the S2_Combo_1_(C1) (HE II) sequence. Frames: 748, 818, 888, 958, 1028, 1098, 1168, 1224.

    probabilistic transition matrix. Finally, the hierarchical particlefilter that was proposed by Husz et al. [13] relies on a motion

    model based on action primitives, which predicts the next posein a stochastic manner. Although their tracker performs similarto the other methods when two or more camera sequences areavailable, its performances significantly degrade when process-ing a single sequence. The main drawback of all these methodsis that they are action specific, and therefore, they cannot trackindividuals that display either unexpected motions or a combi-nation of motions. The only approach that presents much moreaccurate results is proposed by Lee and Elgammal [12]. Theirwork is based on a manifold whose topology is learned usinga training set. Although they can claim a joint mean accuracyof 31 mm, i.e., five to seven pixels, their approach relies on aneven more constrained scenariowalking sequences or cyclic

    activities that have to explicitly be learnt. The outcome of thiscomparison is that, first, because our framework is based on

    a generative approach, our approach is the only one that doesnot require any training phase and can therefore recover human

    poses of unusual movements, as shown in Figs. 10 and 12.Second, although our scheme does not rely on a constrainedenvironment, it can produce results whose accuracy is similarto most state-of-the-art techniques.

    Finally, our T2 strategy was tested on outdoor data (seeFig. 13). Quantitative results for the Sidenbladh sequence con-firm the accuracy and robustness of our method (see Fig. 8). Theresolution of these data is about half of the resolution of the HEdata set; therefore, pixel accuracy cannot directly be comparedwith those obtained with HE. However, we could estimate that,at equal resolution, an accuracy of about 17 pixels would beachieved, which is in line with values shown in Table I. Thisexperiment demonstrates the generality of our method to an

    environment with different image resolutions, perspectives, andillumination conditions, i.e., indoor and outdoor scenes.

  • 8/8/2019 Tracking Human Position and Lower Body Parts

    11/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    MARTNEZ DEL RINCN et al.: TRACKING HUMAN POSITION AND LOWER BODY PARTS USING KALMAN AND PARTICLE FILTERS 11

    Fig. 13. Results for the Sidenbladh sequence. Frames: 15, 30, 45, 60, 75, 90, 105, 120, 135, 150.

    TABLE IICOMPARISON WIT H STATE OF THE ART

    V. CONCLUSION

    This paper has introduced a novel framework based on aset of Kalman and particle filters to track human body partsfrom a single camera. Its main contribution is the use of a 2-Darticulated model that is constrained by human biomechanics.We have shown that such a 2-D model is as accurate at tracking3-D motions as 3-D models. The use of a 2-D model not onlyreduces the computation complexity of tracking human bodyparts but also simplifies tracker initialization. Moreover, risksof divergence are reduced by our framework capacity of partialreinitialization at each step.

    As demonstrated in experiments with walking and balancingsequences, the main advantage of our system is that it canhandle any bipedal motion and is not constrained to specificactivities compared to most other methods. The only limitationof the system is that the pivot point should not be occluded foran extended period of time. To deal with this situation, moreadvanced reinitialization methods should be integrated into thesystem [49].

    REFERENCES[1] Y. Wu, G. Hua, and T. Yu, Tracking articulated body by dynamic Markov

    network, in Proc. ICCV, 2003, pp. 10941101.

    [2] G. McAllister, S. J. McKenna, and I. W. Ricketts, MLESAC-basedtracking with 2-D revolute-prismatic articulated models, in Proc. ICPR,Quebec City, QC, Canada, 2002, vol. 2, pp. 725728.

    [3] P. Noriega and O. Bernier, Multicues 2-D articulated pose tracking usingparticle filtering and belief propagation on factor graphs, in Proc. ICIP,2007, vol. 5, no. 2, pp. 725728.

    [4] I. Bouchrika and M. S. Nixon, People detection and recognition usinggait for automated visual surveillance, in Proc. IET Conf. Crime Security,2006, pp. 576581.

    [5] L. Sigal and M. J. Black, HumanEva: Synchronized video and motioncapture data set for evaluation of articulatedhuman motion, Brown Univ.,Providence, RI, Tech. Rep. CS-06-08, 2006.

    [6] J. Xue, N. Zheng,J. Geng, andX. Zhong,Tracking multiplevisual targetsvia particle-based belief propagation,IEEETrans. Syst., Man,Cybern. B:Cybern., vol. 38, no. 1, pp. 196209, Feb. 2008.

    [7] L. Li, W. Huang, I. Y.-H. Gu, R. Luo, and Q. Tian, An efficient sequen-tial approach to tracking multiple objects through crowds for real-timeintelligent CCTV systems, IEEE Trans. Syst., Man, Cybern. B: Cybern.,vol. 38, no. 5, pp. 12541269, Oct. 2008.

    [8] L. Sigal and M. J. Black, Predicting 3-D people from 2-D pictures, inProc. AMDO, 2006, pp. 185195.

    [9] N. R. Howe, Evaluating lookup-based monocular human pose trackingon the HumanEva test data, in Proc. Workshop Eval. Articulated Human

    Motion Pose Estimation (EHuM2), 2007.[10] N. R. Howe, Recognition-based motion capture and the HumanEva II

    test data, in Proc. Workshop Eval. Articulated Human Motion Pose Esti-mation (EHuM2), 2007.

    [11] R. Poppe, Evaluating example-based pose estimation: Experiments onthe HumanEva sets, in Proc. Workshop Eval. Articulated Human MotionPose Estimation (EHuM2), 2007.

    [12] C. S. Lee and A. Elgammal, Body pose tracking from uncalibratedcamera using supervised manifold learning, in Proc. NIPS Workshop

    Eval. Articulated Human Motion Pose Estimation (EHuM), Whistler, BC,Canada, 2006.

    [13] Z. L. Husz, A. M. Wallace, and P. R. Green, Evaluation of a hierarchicalpartitioned particle filter with action primitives, in Proc. Workshop Eval.Articulated Human Motion Pose Estimation (EHuM2), 2007.

  • 8/8/2019 Tracking Human Position and Lower Body Parts

    12/12

    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

    12 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS

    [14] J. Deutscher and I. D. Reid, Articulated body motion capture by stochas-tic search, Int. J. Comput. Vis., vol. 61, no. 2, pp. 185205, Feb. 2005.

    [15] P. F. Felzenszwalb and D. P. Huttenlocher, Pictorial structures for objectrecognition, Int. J. Comput. Vis., vol. 61, no. 1, pp. 5579, Jan. 2005.

    [16] M. W. Lee and I. Cohen, Human body tracking with auxiliary measure-ments, in Proc. IEEE Int. Workshop AMFG, 2003, pp. 112119.

    [17] J. MacCormick and M. Isard, Partitioned sampling, articulatedobjects, and interface-quality hand tracking, in Proc. ECCV, 2000, vol. 2,

    pp. 319.[18] C. Sminchisescu and B. Triggs, Covariance scaled sampling for monoc-ular 3-D body tracking, in Proc. CVPR, 2001, vol. 1, pp. 447454.

    [19] M. Isard and A. Blake, Condensation: Conditional density propaga-tion for visual tracking, Int. J. Comput. Vis., vol. 29, no. 1, pp. 528,Aug. 1998.

    [20] J. Deutscher, A. Blake, and I. Reid, Articulated body motion capture byannealed particle filtering, in Proc. CVPR, 2000, vol. 2, pp. 126133.

    [21] K. Choo and D. Fleet, People tracking using hybrid Monte Carlo filter-ing, in Proc. ICCV, 2001, pp. 321328.

    [22] C. M. Fryer, Biomechanics of the lower extremity, Instruct CourseLect., vol. 20, pp. 124130, 1971.

    [23] S. X. Ju, M. J. Black, and Y. Yacoob, Cardboard people: A parameterizedmodel of articulated image motion, in Proc. 2nd Int. Conf. Autom. FaceGesture Recog. (FG), 1996, pp. 3844.

    [24] J. ORourke and N. I. Badler, Model-based image analysis of humanmotion using constraint propagation, IEEE Trans. Pattern Anal. Mach.

    Intell., vol. PAMI-2, no. 6, pp. 522536, Nov. 1980.[25] D. Hogg, Model-based vision: A program to see a walking person,Image Vis. Comput., vol. 1, no. 1, pp. 520, Feb. 1983.

    [26] Z. Chen and H. Lee, Knowledge-guided visual perception of 3-D humangait from a single image sequence, IEEE Trans. Syst., Man, Cybern.,vol. 22, no. 2, pp. 336342, Mar./Apr. 1992.

    [27] K. Rohr, Towards model-based recognition of human movements inimage sequences, CVGIP: Image Underst., vol. 59, no. 1, pp. 94115,Jan. 1994.

    [28] J. M. Rehg and T. Kanade, Model-based tracking of self-occluding artic-ulated objects, in Proc. 5th ICCV, 1995, pp. 612617.

    [29] D. M. Gavrila and L. S. Davis, 3-D model-based tracking of humans inaction: A multiview approach, in Proc. Conf. CVPR, 1996, pp. 7380.

    [30] I. A. Kakadiaris and D. Metaxas, Model-based estimation of 3-D humanmotion with occlusion based on active multiviewpoint selection, in Proc.

    IEEE Conf. Comput. Vis. Pattern Recog., 1996, pp. 8187.[31] A. Elgammal and C. Lee, Inferring 3-D body pose from silhouettes using

    activity manifold learning, in Proc. IEEE Conf. Comput. Vis. PatternRecog., 2004, pp. 681688.

    [32] B. Li, Q. Meng, and H. Holstein, Articulated pose identification withsparse point features, IEEE Trans. Syst., Man, Cybern. B: Cybern.,vol. 34, no. 3, pp. 14121422, Jun. 2004.

    [33] R. Li, M. H. Yang, S. Sclaroff, and T. P. Tian, Monocular tracking of 3-Dhuman motion with a coordinated mixture of factor analyzers, in Proc.

    ECCV, 2006, vol. 2, pp. 137150.[34] Q. Wang, G. Xu, and H. Ai, Learning object intrinsic structure for robust

    visual tracking, in Proc. CVPR, 2003, vol. 2, pp. 227233.[35] A. Safonova, J. K. Hodgins, and N. S. Pollard, Synthesizing physically

    realistic human motion in low-dimensional, behavior-specific spaces,ACM Trans. Graph., vol. 23, no. 3, pp. 514521, Aug. 2004.

    [36] A. Rahimi, B. Recht, and T. Darrell, Learning appearance manifoldsfrom video, in Proc. IEEE Comput. Soc. Conf. CVPR, 2005, vol. 1,pp. 868875.

    [37] R. Urtasun, D. J. Fleet, and P. Fua, Monocular 3-D tracking of thegolf swing, in Proc. IEEE Comput. Soc. Conf. CVPR, 2005, vol. 2,pp. 932938.

    [38] R. Urtasun, D. J. Fleet, and P. Fua, 3-D people tracking with Gaussianprocess dynamical models, in Proc. IEEE Comput. Soc. Conf. CVPR,2006, pp. 238245.

    [39] R. Urtasun, D. J. Fleet, A. Hertzmann, and P. Fua, Priors for peopletracking from small training sets, in Proc. 10th IEEE ICCV, 2005, vol. 1,pp. 403410.

    [40] C. Sminchisescu and A. Jepson, Generative modeling for continuousnonlinearly embedded visual inference, in Proc. 21st ICML, 2004, p. 96.

    [41] S. B. Hou, A. Galata, F. Caillette, N. Thacker, and P. Bromiley, Real-timebody tracking using a Gaussian process latent variable model, in Proc.

    ICCV, 2007, pp. 18.[42] J. M. Rehg, D. D. Morris, and T. Kanade, Ambiguities in visual tracking

    of articulated objects using two- and three-dimensional models, Int. J.Robot. Res., vol. 22, no. 6, pp. 393418, Jun. 2003.

    [43] E. A. Hunter, P. H. Kelly, and R. C. Jain, Estimation of articulatedmotion using kinematically constrained mixture densities, in Proc. IEEEWorkshop Motion Nonrigid Articulated Objects (NAM), 1997, pp. 1017.

    [44] H. Sidenbladh, M. J. Black, and D. J. Fleet, Stochastic tracking of3-D human figures using 2-D image motion, in Proc. 6th ECCVPart

    II, 2000, pp. 702718.[45] T. Tian, R. Li, and S. Sclaroff, Articulated pose estimation in a learned

    smooth space of feasible solutions, in Proc. IEEE Comput. Soc. Conf.CVPR, 2005, p. 50.

    [46] J. Martnez, J. C. Nebel, D. Makris, and C. Orrite, Tracking human bodyparts using particle filters constrained by human biomechanics, in Proc.

    BMVC, Leeds, U.K., 2008.[47] G. Rogez, C. Orrite, and J. Martnez-del-Rincon, A spatiotemporal2-D-models framework for human pose recovery in monocular se-quences, Pattern Recognit., vol. 41, no. 9, pp. 29262944, Sep. 2008.

    [48] R. Okad and S. Soatto, Relevant feature selection for human poseestimation and localization in cluttered images, in Proc. ECCV, 2008,pp. 434445.

    [49] P. Kuo, T. Ammar, M. Lewandowski, D. Makris, and J.-C. Nebel, Ex-ploiting human bipedal motion constraints for 3-D pose recovery from asingle uncalibrated camera, in Proc. I nt. Conf. Comput. Vis. Theory Appl.(VISAPP), 2009, pp. 557564.

    [50] A. Utsumi, H. Mori, J. Ohya, and M. Yachida, Multiple-view-basedtrack-ing of multiple humans, in Proc. 14th ICPR, 1998, vol. 1, pp. 597601.

    [51] L. Raskin, E. Rivlin, and M. Rudzsky, Using Gaussian process annealingparticle filter for 3-D human tracking, EURASIP J. Adv. Signal Process.,vol. 2008, pp. 113, 2008.

    [52] Z. Lu, M. C. Perpinan, and C. Sminchisescu, People tracking with

    the Laplacian eigenmaps latent variable model, in Proc. NIPS, 2007,pp. 17051712.[53] X. Li, S. J. Maybank, S. Yan, D. Tao, and D. Xu, Gait components and

    their application to gender recognition, IEEE Trans. Syst., Man, Cybern.C: Appl. Rev., vol. 38, no. 2, pp. 145155, Mar. 2008.

    [54] J. Darby, B. Li, N. Costen, D. Fleet, and N. Lawrence, Backing off:Hierarchical decomposition of activity for 3-D novel pose recovery, inProc. BMVC, London, U.K., 2009.

    [55] J. Wang and Y. Yagi, Adaptive mean-shift tracking with auxiliary par-ticles, IEEE Trans. Syst., Man, Cybern. B: Cybern., vol. 39, no. 6,pp. 15781589, Dec. 2009.

    Jess Martnez del Rincn received the M.Eng. degree in telecommunica-tions and the Ph.D. degree in biomedical engineering from the University ofZaragoza, Zaragoza, Spain, in 2003 and 2008, respectively.

    He is currently a Research Fellow with the Faculty of Computing, Infor-

    mation Systems and Mathematics, Kingston University, London, U.K. Hisresearch interests include aspects of computer vision such as human motionanalysis, activity recognition and multitarget tracking in real time.

    Dimitrios Makris (M03) received the Diploma degree in electrical andcomputer engineering from Aristotle University of Thessaloniki, Thessaloniki,Greece, in 1999 and the Ph.D. degree in computer vision from City University,London, U.K., in 2004.

    He is currently a Senior Lecturer with the Faculty of Computing, InformationSystems and Mathematics, Kingston University, London, U.K. His researchinterests include motion analysis and human pose recovery.

    Dr. Makris is an Elected Member of the Executive Committee of the BritishMachine Vision Association (BMVA).

    Carlos Orrite Uruuela (M06) received the M.Eng. degree in industrial en-gineering, the M.Sc. degree in biomedical engineering, specializing in medical

    instrumentation, and the Ph.D. degree in computer vision from University ofZaragoza, Zaragoza, Spain, in 1989,1994, and 1997, respectively.He is currently an Associate Professor with the Department of Electronics

    and Communications Engineering, University of Zaragoza, Zaragoza, Spain,where he also carries out his research activities in the Aragon Institute ofEngineering Research (I3A). His research interests include computer vision andhuman-machine interface.

    Jean-Christophe Nebel (M08SM09) received the M.Sc. (Eng.) degree inelectronics and signal processing from the Institute of Chemistry and IndustrialPhysics, Lyon, France, in 1992 and the Ph.D. degree in parallel programmingfrom the University of St Etienne, St. Etienne, France, in 1997.

    He is currently an Associate Professor with the Faculty of Computing,Information Systems and Mathematics, Kingston University, London, U.K. Hisresearch interests include computer vision and bioinformatics.

    Dr. Nebel is a corecipient of the 2004 A. H. Reeve Premium from the Council

    of the Institute of Electrical and Electronics Engineers for being a coauthor ofa journal paper that describes a pioneer work in developing a 3-D dynamicwhole-body measurement system.