SPIE JEI Paper UnderReview

8/10/2019 SPIE JEI Paper UnderReview

1/43

A piece-wise Kalman-filter based body joint tracking scheme for low

resolution interlaced imagery

Binu M Naira, Kimberly D Kendricksb, Vijayan K Asaria, Ronald F TuttlecaUniversity of Dayton Vision Lab, 300 College Park , Dayton, OH-45469

bCentral State University, 1400 Brush Row Road, Wilberforce, OH-45384

cAir Force Institute of Technology, 2950 Hobson Wa, OH-45433

Abstract. We propose an efficient scheme to track joint locations in low resolution interlaced imagery based on

primitive low-level image features. The proposed scheme consists of a piece-wise linear trackers where we model

the sinusoidal joint trajectory as a series of linear sub-trajectories with an lbp-based region matching implemented at

the boundaries. Each sub-trajectory is modeled by a Kalman filter which tracks the optical flow at the joint region in

successive frames. When an optical flow mismatch occurs signaling the end of a sub-trajectory, the tracking scheme

switches to an lbp-based region matching. This region-based match along with coarse joint estimated location from

either a human pose estimation or point light algorithm provides us with a search region on which a second pass of

region matching is performed. The corresponding Kalman filter is reinitialized with this finer joint location estimate

and switches to track the optical flow along the next sub-trajectory. This tracking scheme is tested on a datasetcontaining 24 sequences of individuals walking across the face of the building. Three statistical measures are computed

to describe its efficiency in terms of spatio-temporal closeness and multiple object tracking accuracy: Trajectory Co-

variance, MOTA/MOTP and precision/recall scores.

Keywords:Joint Tracking, Local Binary Patterns, Kalman Filter, Lucas Kanade Optical Flow, Region-based match-

ing, Multiple Object Tracking, Sinusoidal trajectory modeling.

Address all correspondence to: Binu M Nair, University of Dayton Vision Lab, E-mail: [email protected]

1 Introduction

Detection, tracking and locating of objects of interest in a scene is an important aspect of computer

vision where this tuple of research areas hold many applications in surveillance based scenarios

ranging from target identification and tracking in aerial imagery to object detection and location

estimation in videos from CCTV cameras. In recent years, research in object tracking from surveil-

lance videos has been restricted to either detecting and tracking large objects in the scene such as

people in shopping malls, players on a soccer/basketball court,1, 2 detection and tracking of cars

etc. Some research on tracking points in a scene has also been developed where interest points

detected on the human body can be tracked to obtain trajectories to differentiate between differ-

ent actions.35 Research on joint tracking which requires consistent estimation of a certain body

1


2/43

part has been tackled in scenarios such as indoors for gaming applications where depth sensors

give accurate depth information and higher resolution video data captured at higher frame rates are

available. But, the problem of body joint detection and tracking from surveillance videos captured

by low resolution CCTV cameras without any depth information and at low frame rates has not

given much emphasis in todays research.

In this manuscript, we propose a novel tracking scheme which tracks some of the relevant

body joints such as the shoulder, elbow,wrist, waist, knee and ankle across the scene given a

coarse estimation of the joint location obtained from a human pose estimation algorithm or from

a point light software. An illustration of the specific body joints used for tracking is shown in

Figure 1a. The proposed scheme approximates the non-linear trajectory of a joint as a combination

of successive linear sub-trajectories where each one is modeled by a Kalman filter designed to

track small non-linear variations between frames. The novel aspect of this scheme is that the

boundaries of these sub-trajectories are not pre-defined and can in fact be determined on-line by

detecting apparent mismatches of joint regions between frames along the linear sub-trajectory. In

low resolution imagery, optical flow provides us with the best coarse of measure to track regions or

blobs moving on a linear(or slighty non-linear) fashion. By designing a Kalman filter to model the

state transitions of the optical flow matches, we can detect mismatches if the optical flow match in

the current frame does not fall into the Kalman filter predicted search region. By using this scheme

of the Kalman filter to track optical flow matches on a linear sub-trajectory, its boundaries can be

determined. This brings us to the next issue : Finding a suitable joint location to reinitialize the

Kalman tracker. This finer estimate of the joint location can be determined by using a 2-level region

matching scheme, where at each level, a coarse measure of the joint location is used to define a

search region for the next level and at the finer level, a minimum distance measure between LBP 6

2


3/43

(a) Illustration of specific joints on the human body to be tracked. (b) Illustration of a piecewise Kalman Filter

concept for joint tracking.

Fig 1: Joint illustration and its theoretical trajectory.

joint region descriptors is used. Figure 1b illustrates the conceptual modeling of the non-linear

trajectory by piece-wise Kalman filters.

This paper is organized as follows : Section 2 gives the work related to various kinds of research

done in tracking and a brief detail of the problems and issues being tackled. Section 3 explains the

required theoretical background which formed the foundations of the proposed tracking framework

described in Section 4. Finally, we provide results of tracking the specific body joints in Section 5

and conclude this paper in Section 6 with some future proposals and ideas for further improvement

of the algorithm.

2 Related Work

Joint tracking research or in other words, the human body pose estimation problem has been tack-

led by the research community in two different scenarios; one which uses the depth information

and the other which uses only the images. The former uses the depth information generated either

by a depth sensor such as the Kinect or by a 3D reconstruction algorithm from multiple video

sources. This is suitable for applications in indoor scenarios such as in gaming consoles or for

3


4/43

human interactive systems where high resolution is available. The latter is used in surveillance

scenarios which does not have any source of depth information such as the video feed from CCTV

cameras monitoring a parking lot or a shopping mall. One of the most recent and popular work

was done by Shotten et al7 for locating 3D joint position in the human body from a depth image

acquired by a Kinect sensor. They used a part-based recognition paradigm where they converted

a difficult pose estimation problem to an easier per-pixel classification problem and subsequently

estimate the 3D joint locations irrespective of pose, body shape or clothing. In a more recent

approach by Huang et al,8 human body pose is estimated and tracked across the scene using infor-

mation acquired by a multi-camera system. Here, both the skeletal joint positions as well as the

surface deformations(body shape changes) are estimated by fitting a reference surface model to the

3D point reconstructions from the multi-camera system. This also makes use of a learning scheme

which divides the point reconstructions into rigid body parts for accurate estimation. But the above

cases requires the use of high resolution imagery under controlled lighting or environmental condi-

tions to work with high accuracy. Another restriction of these methods is the dependence on depth

information for either direct usage or for point cloud reconstruction.

One of the earlier and popular works which does not use the depth information and which

uses only a single video camera to track human motion is done by Markus Kohler. 9 Here, he

designs a Kalman Filter to track human motion in such a way that non-linearity in motion can be

considered as a motion with constant velocity and changing acceleration which can be modeled as

white noise. The process noise co-variance of the Kalman filter is designed in such a manner so as

to incorporate this changing acceleration. In our proposed algorithm, we use a modification of this

Kalman filter and the design of the process noise co-variance to track the body joints along a sub-

trajectory across the video sequence. Kaniche et al4 used the extended version of the Kalman filter

4


5/43

to track specific points or corners detected at every frame of the video sequence for the purpose

of gesture recognition. Each point is described by a region descriptor such as the Histogram of

Oriented Gradients(HOG) and the Kalman filter tracks the position of the corner by using a HOG-

based region matching. For tracking specific joints however, this methodology does not suffice

as any corner point which does not get matched with previous frame gets discarded. Bilimski et

al5 extended this methodology by incorporating the object speed and orientation to track multiple

objects under occlusion.

In recent years, the problem of human body pose estimation has not just being limited to track-

ing points or corners or using depth information. One of the state of art methods for human pose es-

timation on static images is the Flexible mixture of parts model, proposed by Yang and Ramanan. 10

Instead of explicitly using a variety of oriented body part model templates(parameterized by pixel

location and orientation) in a search-based template matching scheme, a family of affinely-warped

templates is modelled, each template containing a mixture of non-oriented pictorial structures.

This eliminates the need to estimate multiple degrees of freedom of a limb due to these approx-

imations. Ramakrishna et al11 proposed an occlusion aware algorithm which tracks human body

pose in a sequence where the human body is modeled as a combination of single parts such as the

head and neck and symmetric part pairs such as the shoulders, knees and feet. Here, an impor-

tant aspect of this algorithm is it can differentiate between similar looking parts such as the left

or right leg/arm, thereby giving a suitable estimate of the human pose. Although these methods

show an increased accuracy on datasets such as the Buffy Dataset12 and the Image Parse dataset,13

the performance on very low-resolution imagery with interlacing is not yet evaluated. But one of

main advantages of these kinds of human pose estimation algorithms is that in the post-processing

stage, the various body part detections can provide coarse estimates of a joint location which can

5


6/43

then be used to initialize tracking schemes and track joint locations in subsequent frames in a video

sequence. One such work was done by Xavier et al14 where they propose a generalization of the

non-maximum suppression post processing schemes to merge multiple post estimates either in a

single frame or in multiple consecutive frames of a video sequence. This merging of estimates

is done by a robust and constrained K-means clustering15 along the spatial domain for a single

frame and along spatio-temporal domain for a video sequence. Again, one of the main concerns

is its dependence on multiple pose estimates which relies on the ability of the state of the art pose

estimation algorithms on low-resolution interlaced imagery.

In our proposed joint tracking framework, we follow the track by detect scheme where we use

optical flow matches and a Kalman filter to track joint locations lying on an approximately linear

sub-trajectory with suitable re-initialization of the joint tracker using LBP-based region matching

criteria. The coarse joint location estimates are used to re-initialize the tracker and this happens

only in certain frames or intervals which indicate the beginning/end of a sub-trajectory of a joint.

In the next section, we will describe the theory involved in the various modules of the tracking

framework.

3 Theory

This section describes the necessary theoretical background required for a deeper understanding

of the proposed model for joint estimation and tracking. The main sections which will be covered

are : a) Lucas Kanade Optical Flow estimation, b) LBP-based Region Matching, c) Kalman Filter.

Our proposed methodology is a combination of these techniques designed to estimate and track

joints in a low-resolution video, given the coarse estimate of the joint locations.

6


7/43

(a) Block schematic of optical flow computation to compute global velocity. (b) Optical flow illustration.

Fig 2: Framework for computing optical flow and illustration.

3.1 Lucas Kanade Optical Flow

Optical flow between two frames of a video sequence estimates the velocity of a point in the real

world scene by finding a relationship between the projections of that point in the corresponding

frames. In other words, optical flow measures the velocity or movement of a pixel or region

between two time instances. In our case, the point of interest is the corresponding joint of a human

body and we need to estimate the velocity of that joint in the current frame given its location in

the previous frame. There exists two main methods for computing this velocity : one is the Horn

Shunck method which takes into consideration a global constraint (i.e the entire image pixels are

used in the determination of the velocity of a single pixel) while the other is the Lucas Kanade 16

method which is more localized (i.e. it considers only a neighborhood region around the point of

interest, thereby setting a local constraint). The optical flow of both the methods are based on a

single equation given by I(x,y,t) = I(x+ x, y+ y, t+ t). Here, lets consider that a pixel

p= (x, y)at timethas moved to a positionp = (x + x,y + y)at timet + t. The equation then

assumes that the brightness of the pixel remains constant through its movement. This is one of

major assumptions of the optical flow. Other assumptions such as the spatial coherence where the

point describing an object region does not change shape with time and temporal persistance where

the motion happening in a pixel or region is purely based on the motion of the object and not

7


8/43

due to the camera movement. For tracking joint regions of the human body, the localized regions

remain rigid or do not change shape and thus does not violate the spatial coherence assumption.

Since in our testing scenarios, we use video sequences captured from a stationary camera with a

constant background, the temporal persistance assumption is not violated. So, for our purposes, we

employ the Lucas Kanade(L-K) Optical Flow estimation technique which uses a local constraint.

The optical flow equation can be derived by using a Taylor series expansion of the basic equation

and is given by

I

xvx+

I

yvy+

I

t = 0 or I.v+It (1)

where (vx, vy) are the optical flow velocity of a pixel p = (x, y). As mentioned earlier, L-K

method uses a local constraint. A small window region (local neighborhood) around the point

p = (x, y)is considered and within this neigbhorhood, a weighed least squares estimate equation

is minimized. This equation is given by

x,y

W2(x, y)[I(x,y,t) +It(x,y,t)]2 (2)

Using the above equation and the optical flow constraint equation (Equation 1), we can uniquely

compute the solutionv. The assumption here is that the optical flow within that local region is con-

stant. But there are some issues when computing the Lucas Kanade Optical flow. One issue is that

the motion in the scene is not small enough and we will need the higher order terms in the optical

flow constraint equation. The alternative approach is to use a pyramidal iterative Lucas Kanade

approach where the image scene at a particular instant is down sampled to form a Gaussian pyra-

mid and at each level, optical flow is computed. The other issue is that if the point in a local region

8


9/43

does not move like its neighbors. This brings back to our earlier assumption of spatial coherence

where the objects or points to be tracked should be rigid. So, one of the important design criteria

is to determine what would be the ideal window size (local region size) for computing the optical

flow at a certain point. For the joint tracking problem, this window size depends on the resolution

of the video and thus for poor resolutions, we use a window size of77. An illustration of optical

flow estimation on the points of human body silhouette is shown in Figure 2. For the proposed

algorithm, we use the optical flow estimation in two scenarios: one to compute the global velocity

of the motion of the human body; and the other to find an estimate of the location of a particular

body joint in the next instant. For the latter purpose, we compute the optical flow for every point

surrounding the joint region using L-K method and then compute the median flow.

3.2 Region Descriptors

The region descriptors such as the Local Binary Patterns(LBP)6 are used to describe the edge infor-

mation and the textural content in a local region and can be a very effective descriptor for region-

based image matching. Many efficient image descriptors are out there such as the SIFT,ORB,HOG

etc.. but one of the assumptions is that the images should be of high resolution. The LBP is very

effective in describing an image region in spite of low resolution and interlacing. The local bi-

nary pattern is an image coding scheme which brings out the textural features in a region. For

representing a joint region and to associate a joint in successive frames, the texture of the region

plays a vital part in addition to the edge information. The LBP considers an local neighborhood of

8 8in a joint region, and labels the neighborhood pixels by either a 1 or 0 based on the center

pixel value. The coded value representing this local region is then the decimal representation of

the neighborhood labels taken in clockwise manner. Thus, for every pixel within the joint region,

9


10/43

a coded value is generated which represents the underlying texture. The LBP operator is defined

as

LBPP,R=P

p=0

s(gp gc)2p s(z) =

1 z 0

0 z


11/43

with co-variances given byQ = E[qkqTk ]and R = E[rkr

Tk ]. Here, we can definePk = E[eke

Tk ]

as the error co-variance matrix at time instant k where we can consider a prior estimate of the

state xk from the knowledge of the system and posterior estimate of the state xk after knowing

the current measurementzk. The error ek is then defined as the difference between the true state

and the posterior state(xk xk). For obtaining a true value of a response(or state) generated by

a process or system, an iterative procedure will be to get a prior estimate xk at instantk which is

obtained from the posterior estimate (xk1) at instant k 1. Then, using the measured value of

the response (zk), we compute the innovation or measurement residualzk Hx

k and use this to

obtain a posterior estimate xk = x

k +Kk(zk Hx

k). The kalman gainKk at instantk is given

by

Kk=P

k HT(HPk H

T +R)1 (6)

wherePk =E[e

keTk ], e

k = (xk x

k)and Pk = (I KkH)P

k . This iterative procedure can

be divided into two stages; Time update (prediction stage) and Measurement Update(correction

stage). Thus, the Kalman filter can be thought of as a process which estimates the state at one

instant and then obtains the feedback in the form of noisy measurements of the response of the

system.

The recursive version of the Kalman filter can also be used for tracking purposes and in litera-

ture, it has been widely applied for tracking points in video sequences. In the proposed algorithm,

we use the Kalman filter to track a specific body joint across the scene. This is done by setting the

state of the process (which in this case is the human body movement) as the (x, y)coordinates of

the joint along with its velocity (vx, vy) to get a state vector xk R4. The measurement vector

zk = [xo, yo] R2 will be provided by the coarse estimates obtained by using a human body pose

11


12/43

Fig 3: Joint tracking algorithm using Kalman filter.

estimation algorithm or point light software. By approximating the motion of a joint in a small

time interval by a linear function, we can design the transition matrix A so that the next state is

a linear function of the previous states. As done by Kohler,9 to account for non-constant velocity

often associated with accelerating image structures, we use the process noise co-variance matrix

Qdefined as

Q=a2t

6

2(t)2 0 3t 0

0 2(t)2 3t

3t 0 6

0 3t 6

(7)

whereais the acceleration and tis the time step determined by the frame rate of the camera.

This design of the Kalman filter suits our scheme well as any small non-linearity in the sub-

trajectory can be account for a non-constant velocity of the joint region. The modification of the

Kalman Filter recursive algorithm used for the joint tracking is shown in Figure 3. It is shown from

the figure that the measurement is obtained from the optical flow estimate. There are a couple of

12


13/43

Fig 4: Block schematic of tracking.

scenarios which needs to be tackled in order to use the optical flow as a reliable measurement

vector. The first one is that the optical flow estimate falls in the elliptical search region computed

during the prediction phase and this confirms the correctness of the optical flow thereby making the

optical flow estimate as a suitable measurement vector. The elliptical search region is computed by

using the posterior state and the predicted state as two foci of an ellipse and computing the major

and minor axis using the possible error values from the prior error co-variance matrix. 4 The second

scenario is when the optical flow estimate does not fall in the search region, thereby confirming that

the optical flow estimate is noisy and is not suitable for measurement. This signals the end of the

current linear sub-trajectory and beginning of the next linear sub-trajectory where the associated

Kalman filter must be re-initialized to track the optical flow matches in the next sub-trajectory.

4 Proposed Framework

The proposed tracking scheme consists of two main stages: a) Kalman tracking of the optical flow

matches on a sub-trajectory b) Reinitialization of the Kalman tracker using a region based match.

In the overall schematic shown in Figure 4, the first step is to compute the foreground/background

model. The foreground mask can provide us with an estimate of the global velocity to initialize/re-

13


14/43

(a) Coarse joint location estimated in frame 1. (b) Elliptical search region in frame 2.

(c) Fine estimates of joint location in frame 2 after tracking. (d) Elliptical search region in frame 4 (Wrist joint tracker is

reinitialized).

(e) Finer estimates of joint locations in frame 4 after tracking.

Fig 5: Illustration of elliptical search regions before tracking and joint location estimates af-

ter tracking. The coarse pose estimates are represented by purple color in each frame. The

search regions and the finer joint estimates are given as shoulder(blue), elbow(green), wrist (red),

waist(cyan), knee (yellow) and ankle(pink). In frame 4, region-based matching is initiated and the

corresponding tracker is re-initialized.

14


15/43

(a) Elliptical search region for frame 9. Here, the ankle joint

undergoes region matching and since the constraint is satis-

fied, the corresponding tracker is re-initialized.

(b) Finer joint location estimations after the tracking scheme.

(c) Elliptical search regions Sop(t) and Sreg(t) in frame 11.Here, for the knee joint, the constraint is not satisfied and the

tracker is only corrected by the coarse joint location estimate

given by the purple point.

(d) Finer joint location estimates after the tracking scheme.

(e) Elliptical search regions Sop(t) and Sreg(t) for both theknee and the ankle joints in frame 13.

(f) Finer joint location estimate where the knee and ankle

trackers are corrected by coarse joint location estimates.

Fig 6: Illustration of elliptical search regions and fine joint estimates in certain frames when tracker

is only corrected.

15


16/43

initialize the Kalman tracker associated with a joint. As we traverse across each time step along

the sub-trajectory, each joint region will be described by a uniform LBP histogram. The coarse

estimation of the joint location is provided by the estimates given by Point Light Software. To

demonstrate the tracking ability of the framework, we use the coarse estimated points at sub-

trajectory boundaries to get a finer region-based estimate of the joint location. The algorithm is

given below :

1. Extract the first frame(time instantt = 1) of the sub-trajectory. Compute dense optical flow

within the foreground region to get the global velocity estimate(median flow). Initialize/Re-

initialize the Kalman filter with the coarse joint location(xcos, ycos)/finer region-based esti-

mate(xreg2, yreg2)and the global velocity. The state of the tracker for each body joint is then

xt = [x,y,vx, vy]where(vx, vy)is the joint velocity which is set to the global flow velocity

estimate. This will considered as the corrected statext1 at timet = 1. Updatet t+ 1

and predict the state(get prior state) xt of the Kalman filter. Using the predicted statex

t ,

posterior statext1and the apriori error co-varianceP

t , estimate the elliptical regionSop(t)

where the joint location is likely to fall on.

2. Extract the next frame of the sub-trajectory. Find the optical flow match(xop, yop) of each

joint between instances t and t 1. Also compute the dense optical flow and the global

velocity of the foreground region. Check if optical flow joint location estimate falls on the

predicted elliptical search region. If yes, go to Step 3. Else go to step 4.

3. Using the joint location estimates provided by the optical flow as the measurement vector

z = [zx, zy], perform the correction phase of the filter to get the posterior state xt . Update

t t+1. Set the joint velocity as the global velocity and predict the state(get prior state) xt

16


17/43

and the elliptical search region. Repeat steps 2 and 3 until the end boundary of sub-trajectory

denoted by optical flow mismatch.

4. Compute the joint location estimate(xreg,yreg)

within the Kalman filter predicted search

region using LBP-based region matching. This estimate is given by

argminpSop(t)2(fj, fp)

where fjis the joint descriptor updated in the previous time instant, fpis the region descriptor

computed at the pixel p = (xreg, yreg) within the elliptical search region Sop(t). Using

this estimate and the coarse joint location estimate, predict the new elliptical search region

Sreg(t). If the new elliptical search region is very large, a constraintSreg(t) Sop(t) is

used. Re-initialization occurs only if this constraint is satisfied. If it is satisfied, go to step 5,

else goto step 6.

5. Compute the LBP-based region matching estimate given byargminpSreg(t)2(fj , fp)where

p = (xreg2, yreg2). Use this finer estimate of the joint location to re-initialize the Kalman

tracker associated with that particular joint. Update joint velocity as the global velocity and

predict the state(get prior state) xt and the elliptical search regionSop(t). Go to Step 2.

6. Use the coarse joint location estimates(xcos, ycos)as the measurement vectorz = [zx, zy]to

correct the corresponding tracker.

7. Continue till all the frames of the sequence has been processed.

We provide sample illustrations of the tracking scheme in Figures 5 and 6. In Figure 5, for frame

2, the optical flow matches of all the joints fall in their respective predicted elliptical search region

17


18/43

Sop(t)and these matches correct the corresponding joint tracker. In frame 4, all of the joints except

the wrist joint are still in the sub-trajectory as there are no optical flow mismatches. For the wrist

however, the optical flow match does not fall into its respective predicted elliptical search region

Sop(t). Thus, within theSop(t), a LBP-based region match is found. Using this match and the

coarse estimated joint location, another elliptical region Sreg(t)is obtained. Again, on the Sreg(t)

, the region based match is obtained. This match re-initializes the Kalman tracker as the constraint

is satisfied and signals the beginning of another sub-trajectory. This is not the case with the knee

and ankle joints in frame 11 and 13 where in fact, the elliptical regionSreg(t)is much much larger

thanSop(t). This constraint is violated either when the coarse joint location estimates are noisy(

sometimes not on the body but on the background) or when the region-based LBP match fails and

catches onto the an edge on the background. In the proposed technique, we tackle this issue by

using the coarse joint location estimate to correct the existing tracker and not re-initialize it. This

is to make sure that the tracker does not get caught on to the background edges and only keeps

tracks of the corresponding body joint.

5 Results and Experiments

The proposed tracking scheme has been tested on a private dataset provided by the Air Force

Institute of Technology, Dayton OH. It consists of 12 subjects walking along a outdoor track across

the face of a building and with a staircase in the front, and is performed twice by each subject to

get a total of 24 video sequences. Each subject not only wears different colored clothing but also

wears a coat vest on their second try during data capture. These video sequences are captured

simultaneously using two cameras focused on the same area. So, when each sequence is divided

into 5 phases A - E, a sequence of each phase is selected from either the left camera or right

18


19/43

(a) Background image (b) Phase A (c) Phase B

(d) Phase C (e) Phase D (f) Phase E

Fig 7: Illustration of the scene and division of video sequence into five phases.

camera depending on what area is being focused on for analysis. Thus, we dont consider from

which camera the sequence has been shot from. The description of each phase along with the

illustration in Figure 7 is described as follows.

Phase A : Subject is walking clockwise around the track. The frames of interest are of the

subject walking on the cross over the platform.

Phase B: Subject is walking clockwise around the track. The frames of interest are of the

subject walking on the grass after the ramp.

Phase C: Subject is walking clockwise around the track. The frames of interest are of the

subject walking on the grass after the ramp on the side of the track away from the building.

Phase D: Subject is walking counter-clockwise around the track. The frames of interest are

of the subject walking on the grass before the ramp.

19


20/43

Phase E: Subject is walking counter-clockwise around the track. The frames of interest are

of the subject walking on the grass along the ramp.

5.1 Challenges of the dataset and evaluation strategies

In this manuscript, we provide test results obtained by testing the proposed tracking scheme on

all sequences in all the phases. Although the dataset was captured to analyze the difference in the

gait of the individual in the case of wearing/not-wearing a coat vest, this dataset provides a good

number of challenges to test the precision of the proposed tracking scheme. The dataset comes

with the human pose estimates at every frame and is obtained by the Point Light Software. These

pose estimates give us a coarse joint location estimates which are noisy and accurate with regards

to the application of gait tracking at hand. The proposed tracking scheme makes use of the coarse

joint location estimates to give finer estimates of the joint location. The effect of the tracking

scheme on the gait analysis algorithms is beyond the scope of this paper and we focus mainly on

the joint tracking aspect and the smoothness of the trajectory. One main challenge in this dataset

is the very low resolution imagery where a 17 17 neighborhood around a single joint, say a

shoulder joint will capture the entire upper body of the individual. This is illustrated in Figure

1a. The other challenge is the interlacing effects present in the video which can render edge based

region descriptors ineffective and affect the matching process. Apart from these global challenges,

there are certain characteristics associated each phase which sometimes introduces a challenging

scenario for tracking. Some of these characteristics are

Phase A : There can be partial/complete occlusion of the lower-body joints such as knee

and ankle due to the platform railings and staircase. The lowest resolution of the person is

captured in this phase as the person is at the farthest distance from the camera.

20


21/43

Phase B : There is a complete occlusion of the ankle by the tall grass and joint region de-

scriptions cannot be computed. Moreover, the coarse estimates provided by the Point Light

software are also very noisy and do not give robust estimates of the joint.

Phase C : The image region containing the person is of slightly higher resolution as he is

closer to the camera. No occlusions of the ankle joint by the grass was noticed and it gives a

much cleaner data for the tracking scheme.

Phase D : There is a complete occlusion of the ankle due to the tall grass and same problems

from phase B exists as well. The only difference being the person is walking on the opposite

direction.

Phase E : Same challenges as that of phase A with the difference being the person walking

in the opposite direction.

We set equal neighborhood sizes of17 17for each joint region and set a constant acceleration

a = 0.1pixels/f rame2 in the process noise co-variance design of the corresponding Kalman

filter. To illustrate the effectiveness of the tracking scheme, we provide three different types of

measures and graphs which explains the different aspects of tracking efficiency.

1. Co-variance-Based Trajectory Measure: A statistical measure which gives how close the

tracked joint locations are to the coarse estimates of the joint location for each sequence

associated with a particular subject. This statistical metric19 is given by

d(K, Km) =

ni=1

(log(i(K, Km))2 (8)

21


22/43

whereK R3 is the co-variance of the tracked points,Km R3 is the co-variance matrix

of the coarse joint locations, i is theith eigen value associated with |K Km| = 0 and

nbeing the number of eigen values. The lower the value, the closer are the tracked points

to the coarse joint location estimates. This measure although does not provide us with the

precision of the tracking scheme, gives an indication whether the tracked joint trajectory are

located within the spatio-temporal neighborhood of the coarse joint trajectory.

2. Multiple Object Tracking Precision/Accuracy (MOTP/MOTA): The MOTP/MOTA20 met-

ric is a widely used efficiency measure for multiple-object tracking mechanisms where the

MOTP/MOTA gives the precision and accuracy of the tracker by considering all the detected

and tracked objects. We use a implementation of the CLEAR-MOT provided by the au-

thors21 to provide us the statistical data such as false positive rate, false negative rate, MOTA

and MOTP scores. These statistics are computed as follows

(a) Multiple Object Tracking Precision (MOTP) : It refers to the closeness of a tracked

point location to its true location(given as ground truth). Here, we measure the close-

ness by measuring the overlap between the neighborhood region occupied by the tracked

point location and the ground truth. Higher the value of this overlap, more precise is

the estimated location of the point. This is given by

MOTP =

i,t o

i

t

t

ct(9)

where oit is the amount of overlap for the joint i at frame t of a sequence and ct is

the number of correct correspondences at frame t. Only those joints which satisfy the

22


23/43

criteriaoit> Tare included in the above equation.

(b) Multiple Object Tracking Accuracy (MOTA) : It gives the accumulated accuracy in

terms of the fraction of the tracked joints matched correctly without any misses or

mismatches. It is given by

MOTA= 1

t

(mt+f pt+mmet)

t

gt(10)

where mt, f pt and mmet are the number of misses, false positives and mismatches

respectively andgtare the number of points present at framet. Thus the false negative

rate, false positive rate and rate of mismatches can be computed as

tmttgt

,

tfpttgt

and

tmmet

tgt.

These statistics evaluates the tracking algorithm in terms of overall accuracy and precision

achieved by accumulating the measures of all the joints of interest per video sequence.

3. Precision/Recall: The precision and recall for the multiple object tracking is computed as the

overall MOTP and MOTA scores. The precision, in contrast to the theoretical definition, is

computed by accumulating the overlaps oitand correct correspondences ctat all frames for all

the sequences and taking the ratio between them. The recall is computed by accumulating the

total number of misses, false positives and mismatches from all frames of all the sequences

and using the formula for MOTA. The precision and recall is computed from every possible

parameter set of the tracking scheme so that the best combination can be found for each

phase.

23


24/43

(a) Kalman filtered coarse joint location estimates. (b) Corresponding range percentage for Kalman trackin

scheme.

(c) Proposed tracking scheme. (d) Corresponding range percentage for the proposed trackin

scheme.

Fig 8: Statistical measures (left) and percentage of sequences into different ranges (right) obtained

for phase A.

5.2 Covariance-Based Trajectory Analysis

The co-variance based trajectory measure is computed between the tracked points and the coarse

estimated points for each phase. We provide two variations of the tracking scheme ; a) One which

simply uses the Kalman filter on the coarse estimates directly ; b) the other is the proposed tracking

24


25/43

(a) Kalman filtered manual point annotation. (b) Corresponding range percentage for Kalman trackin

scheme.


scheme.


for phase B.

25


26/43


scheme.


scheme.


for phase C.

26


27/43


scheme.


scheme.


for phase D.

27


28/43


scheme.


scheme.


for phase E.

28


29/43

scheme which uses the image information in determining the fine joint location estimates. Thus,

we compute the trajectory measure of the two tracking schemes for each video sequence and for

each joint. Figures 8, 9, 10, 11 and12 provide the tables containing the trajectory measures for

each phase and for each tracking scheme.

We can empirically determine different ranges of the trajectory discrepancy measures over

which we can say the finer estimates obtained by the tracking scheme is acceptable or not and this

is possible by a visual inspection of the trajectory plots for each joint for each sequence. A sample

of the trajectory plots computed for subject 11 in phase A is shown in Figure. The ranges and the

possible acceptance level with explanations are given below

Trajectory measure ;d [0, 1): This denotes that the finer estimates obtained by a tracking

scheme are much closer to the coarse joint location estimates than required. In this scenario,

the finer estimates leans more towards the noisy, discrete coarse estimates. Although the

tracking scheme gives better estimates than the coarse pose estimates, the finer estimates are

slightly noisy in nature and are not very smooth.

Trajectory measure; d [1, 3) : This range of values are considered as highly acceptable

levels even though they seem farther from the noisy coarse pose estimates. By observation,

we see that the finer estimates of the joint trajectory are more smoother than the coarse joint

estimates and in fact resembles the actual sinusoidal trajectory of the joint.

Trajectory measure;d [3, 5): This range of values can be considered as semi-acceptable

where the finer joint trajectory estimates obtained from the tracking are smooth but they are

slightly far apart from the noisy coarse estimates. This is because either the coarse pose

estimates are noisy or that it tracks a different point on the same body joint region and

29


30/43

maintains the wrong track. Sometimes the estimated fine trajectory might miss/track some

other point in a sub-trajectory and the corresponding trajectory measure falls in this range as

well.

Trajectory measure;d [5,]: This corresponds to some wayward tracking by the tracking

scheme. This happens mainly because the coarse joint location estimates contain a large

error due to the failing of the human pose estimation algorithm. In this case, the finer joint

location estimates and the coarse estimates are drastically different.

Using these pre-defined ranges, we compute the percentage of sequences whose trajectory discrep-

ancy measure falls within the specified ranges for the two schemes as mentioned earlier. For phase

A, we see a large percentage of sequences of around6575%falling within the first measure range

d [0, 1]for the Kalman filtered tracking scheme. As mentioned before, although this measure is

small, this tracking scheme gives more precedence to the coarse points and is under the assump-

tion that these coarse points are noise free. Thus, it is an acceptable estimated trajectory but not

a smooth one as required for gait analysis. However, for the proposed scheme, around 65 85%

of the sequences lie on the most acceptable region d [1, 3)with the exception of the ankle joint

where only20% falls on it while the rest falls on the region d [0, 1) . This latter region is still

acceptable as far as tracking is concerned.510%of sequences for shoulder, elbow, wrist and hip

joint falls in the semi-acceptable regiond [3, 5).

For phase B, the Kalman filtering scheme performs better where most of the sequences falls

in the acceptable region with equal divisions between ranges d [0, 1) and d [1, 3). Using

the proposed scheme, we get improved performance for the shoulder, knee and ankle joint and

comparable performances for the hip and knee joint. The elbow however has a lot of sequences

30


31/43

in the semi-acceptable ranged [3, 5) with a couple of sequences falling in the bad range d

[5,)for the elbow, wrist and hip joints. This is mainly because the LBP descriptor of wrist joint

was unable to capture the information as there was not enough pixels for representation in this

low-resolution imagery. The bad region matching is also due to similar appearances between the

clothing and the background in this phase. Interestingly enough, although the ankle was occluded

due to the grass for some sequences, the tracking scheme was able to pick up the ankle joint

from one of the coarse pose estimates and was able to track it to a certain degree and thus, the

corresponding sequences fall in the acceptable regions.

All of the sequences in Phase C falls in the acceptable region where around65 100%falls in

the regiond [0, 1)for the Kalman filtered scheme. For the proposed scheme, these sequences are

distributed between the two acceptable regions with the majority falling in the highly acceptable

regiond [1, 3). Similar distribution of the sequences is seen for phase D and E with the proposed

scheme showing a larger number of sequences falling in the highly acceptable region for all the

joints.

Thus, we see that for all the phases, most of the sequences are distributed in the highly ac-

ceptable regions d [1, 3) where gait analysis can be useful. This is also the region where the

estimated trajectories follow a smooth sinusoidal path. Some sequences however are distributed in

the regiond [0, 1)even with the proposed scheme which will require post-processing of the fine

joint location estimates for gait analysis. This is because of the constraint of having very low res-

olution with interlacing effects which makes region-based descriptor matching ineffective. When

the region matching fails, the proposed scheme becomes equivalent to the Kalman filtered tracking

scheme, thereby atleast maintaining the joint track. This is useful when in a certain region where

the region matching does get effective, a portion of the track can be used to analyze the gait of an

31


32/43

individual.

5.3 MOTP/MOTA Analysis

We computed the MOTP,MOTA, false positive rate and false negative rate for each sequence indi-

vidual for each phase by setting the threshold T = 0.5with same acceleration parameter a = 0.1

and a neighborhood size of17 17 for each body joint. The corresponding distributions in the

MOTA-MOTP space are shown in Figure 14 where the red stars are the sequences, labeled ap-

propriately. The gaussian contours approximates the distribution of the sequences in the MOTP-

MOTA space. The more concentrated the distribution is towards the upper right corner, the more

better the precision and accuracy of the tracking scheme. In Figure 14a, we see that all of the

sequences in phase A have moderately high precision and accuracy with some achieving high

accuracy of90%with the corresponding precision being above80%. However, two sequences be-

longing to Subject 26 show a low accuracy of60%or less with a precision of around 75%. This is

mainly because the hip and the ankle joint tracks follow a different path as compared to the ground

truth data. Another important factor contributing to the drop in accuracy for some sequences is

also the noise in the ground truth data annotation provided by the Point Light Software.

For phase B, as shown in Figure 14b, most of the sequences have only a moderate precision of

around70 75%and moderate accuracy ranging from 50% 75%. Some sequences belonging

to Subject 3, 22 and 26 exhibits low accuracy of50% or less. However, there are some sequences

belonging to Subject 18 and 27 which exhibits very high accuracy of90% or more with good

precision of8085%. The wide distribution of the sequences in the MOTP-MOTA space is mainly

due to a lot of noisy ground truth annotations by the Point Light Software and a lot of occlusion-

based challenges present in this phase. Even in such a challenging scenario( with occlusions and

32


33/43


34/43

(c) Wrist joint

(d) Hip joint

Fig 13: Estimated fine joint trajectories by different schemes for subject 11 wearing a coat in phase

A.

34


35/43

(e) Knee joint

(f) Ankle joint

Fig 13: Estimated fine joint trajectories by different schemes for subject 11 wearing a coat in phase

A.

35


36/43

(a) Phase A

(b) Phase B

Fig 14: Scatter plot showing where the sequences of each phase are distributed in the

MOTP/MOTA space.

36


37/43

(c) Phase C

(d) Phase D


MOTP/MOTA space.

37


38/43

(e) Phase E


MOTP/MOTA space.

background similarities with interlacing ), the tracking scheme performs moderately well.

For phase C, although there are a few sequences which shows low accuracy, majority of the

sequences have a moderate of accuracy of around60% more. Although this phase has a slightly

better resolution of the person, some of the challenging scenarios similar to phase B exists in this

phase as well where due to the better resolution, the tracking scheme performs much better in

phase C than in phase B.

Phase D and E however, show a lot of sequences having good accuracies of75%or more. Sim-

ilar challenging scenarios exist with the difference being the person moving in an anti-clockwise

manner around the track. Overall for each phase, we notice that there is a considerable amount

of sequences showing good accuracies of75% or more with a minor portion exhibiting low accu-

38


39/43

racies of50% or more. Again, this is due to a lot of noise in the coarse joint location estimates

provided by the Point Light Software which drops the accuracy for some sequences. This noise

in fact contributes to the number of false positives which maybe incorrectly interpreted, thereby

reducing some portion of the accuracy during evaluation. However, for all the phases, a good pre-

cision of75% or more is achieved and the tracking scheme is precise in locating or providing us

with finer estimates of the joint location.

5.4 Precision/Recall for each body joint

The precision and recall is computed for each phase for a particular value of the acceleration

parameter a in the Kalman filter and is illustrated in Figure 15. For phases A,C and D, we see

that the precision and recall achieves the highest value of around 80%and85%for an acceleration

valuea = 0.1. However, for phases B and E, we see that an acceleration ofa = 0.2gives higher

values of precision and recall. This is due to the difference in speed of the joints with respect to

each individual and an optimal value of the acceleration for each person is required.

6 Conclusions and Future Work

We have proposed a body joint tracking algorithm for use in low-resolution imagery for outdoor

sequences. The algorithm is a combination of primitive but effective point tracking techniques

using the optical flow and region based matching using LBP coupled with the learning ability

of the Kalman filter. Some joints such as shoulder, elbow and hip are successfully tracked in

most of the sequences along with the wrist joint. However, the knee and ankle joints have multiple

occurrences of re-initialization due to the mismatching of the optical flow caused by low-resolution

artifacts and interlacing effects. An important addition which we plan to add in the future work is

39


40/43

(a) Phase A (b) Phase B

(c) Phase C (d) Phase D

(e) Phase E

Fig 15: Variation of precision and recall of tracking scheme with change in acceleration parameter.

40


41/43

to use the contextual relationship between the body joints. This crucial aspect is missing in this

proposed algorithm as it assumes that joint movement is independent of the other joints which

pertains to the use of individual piece-wise tracking schemes for each joint.

References

1 H. Ben Shitrit, J. Berclaz, F. Fleuret, and P. Fua, Tracking multiple people under global

appearance constraints, in Computer Vision (ICCV), 2011 IEEE International Conference

on, pp. 137144, 2011.

2 J. Shao, S. Zhou, and R. Chellappa, Tracking algorithm using background-foreground mo-

tion models and multiple cues, inAcoustics, Speech, and Signal Processing, 2005. Proceed-

ings. (ICASSP 05). IEEE International Conference on,2, pp. 233236, 2005.

3 W.-L. Lu and J. Little, Simultaneous tracking and action recognition using the pca-hog de-

scriptor, in Computer and Robot Vision, 2006. The 3rd Canadian Conference on, pp. 66,

2006.

4 M. Kaaniche and F. Bremond, Tracking hog descriptors for gesture recognition, in Ad-

vanced Video and Signal Based Surveillance, 2009. AVSS 09. Sixth IEEE International Con-

ference on, pp. 140145, 2009.

5 P. Bilinski, F. Bremond, and M. B. Kaaniche, Multiple object tracking with occlusions using

hog descriptors and multi resolution images, in Crime Detection and Prevention (ICDP

2009), 3rd International Conference on, pp. 16, 2009.

6 T. Ojala, M. Pietikainen, and T. Maenpaa, Multiresolution gray-scale and rotation invariant

texture classification with local binary patterns, Pattern Analysis and Machine Intelligence,

IEEE Transactions on24(7), pp. 971987, 2002.

41


42/43

7 J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and

A. Blake, Real-time human pose recognition in parts from single depth images, in Com-

puter Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 12971304,

2011.

8 C.-H. Huang, E. Boyer, and S. Ilic, Robust human body shape and pose tracking, in 3DV-

Conference, 2013 International Conference on, pp. 287294, 2013.

9 M. Kohler, Using the Kalman Filter to Track Human Interactive Motion: Modelling and

Initialization of the Kalman Filter for Translational Motion, Forschungsberichte des Fach-

bereichs Informatik der Universitat Dortmund, Dekanat Informatik, Univ., 1997.

10 Y. Yang and D. Ramanan, Articulated pose estimation with flexible mixtures-of-parts, in

Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1385

1392, June 2011.

11 V. Ramakrishna, T. Kanade, and Y. Sheikh, Tracking human pose by tracking symmet-

ric parts, in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on ,

pp. 37283735, 2013.

12 V. Ferrari, M. Marin-Jimenez, and A. Zisserman, Progressive search space reduction for

human pose estimation, in Computer Vision and Pattern Recognition, 2008. CVPR 2008.

IEEE Conference on, pp. 18, June 2008.

13 D. Ramanan, Learning to parse images of articulated bodies, in Advances in Neural Infor-

mation Processing Systems 19, B. Scholkopf, J. Platt, and T. Hoffman, eds., pp. 11291136,

MIT Press, 2007.

42


43/43

14 X. Burgos-Artizzu, D. Hall, P. Perona, and P. Dollar, Merging pose estimates across space

and time, inProceedings of the British Machine Vision Conference, BMVA Press, 2013.

15 C. M. Bishop,Pattern Recognition and Machine Learning (Information Science and Statis-

tics), Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

16 B. D. Lucas and T. Kanade, An iterative image registration technique with an application to

stereo vision, in Proceedings of the 7th International Joint Conference on Artificial Intelli-

gence - Volume 2,IJCAI81, pp. 674679, 1981.

17 T. Lacey, Tutorial: The kalman filter,Georgia Institute of Technology.

18 G. Welch and G. Bishop, An introduction to the kalman filter, 1995.

19 W. Forstner and B. Moonen, A metric for covariance matrices, 1999.

20 K. Bernardin and R. Stiefelhagen, Evaluating multiple object tracking performance: The

clear mot metrics,J. Image Video Process. 2008, pp. 1:11:10, Jan. 2008.

21 A. D. Bagdanov, A. Del Bimbo, F. Dini, G. Lisanti, and I. Masi, Compact and efficient

posterity logging of face imagery for video surveillance, 2012.

SPIE JEI Paper UnderReview

Documents

Transcript of SPIE JEI Paper UnderReview