Trajectory Learning for Activity Understanding:...

1

Trajectory Learning for Activity Understanding:Unsupervised, Multilevel, and Long-Term

Adaptive ApproachBrendan Tran Morris, Member, IEEE, and Mohan Manubhai Trivedi, Fellow, IEEE

Abstract—Society is rapidly accepting the use of video cameras in many new and varied locations, but effective methods to utilizeand manage the massive resulting amounts of visual data are only slowly developing. This paper presents a framework for live videoanalysis in which the behaviors of surveillance subjects are described using a vocabulary learned from recurrent motion patterns, forreal-time characterization and prediction of future activities, as well as the detection of abnormalities. The repetitive nature of objecttrajectories is utilized to automatically build activity models in a 3-stage hierarchical learning process. Interesting nodes are learnedthrough Gaussian mixture modeling, connecting routes formed through trajectory clustering, and spatio-temporal dynamics of activitiesprobabilistically encoded using hidden Markov models. Activity models are adapted to small temporal variations in an online fashionusing maximum likelihood regression and new behaviors are discovered from a periodic re-training for long-term monitoring. Extensiveevaluation on various datasets, typically missing from other work, demonstrates the efficacy and generality of the proposed frameworkfor surveillance-based activity analysis.

Index Terms—Trajectory clustering, real-time activity analysis, abnormality detection, trajectory learning, activity prediction

�

1 INTRODUCTION

THE dramatic decrease in cost for quality video equip-ment coupled with the ease of transmitting and stor-

ing video data has led to widespread use of vision basedanalysis systems. Cameras are in continuous use allaround, along highways to monitor traffic, for security ofairports and other public places, and even in our homes.Methods to manage these huge volumes of video dataare necessary as it has almost become an impossible taskto continually monitor these video sources manually.Vision researchers have been forced to develop robustreal-time methods to recognize events and activities ofinterest as a means to compress video data into a moremanageable form, and to provide annotations that canbe used for search indexing.

While researchers have been long been interested inunderstanding human behavior, much of the work hasrelied on high resolution descriptors such as state spaceapproaches [1], 3D reconstruction of body configurations[2] or volumetric space-time shapes [3], or body partdecomposition [4]. In the surveillance and monitoringsetting, such rich descriptors may not available or maynot be necessary. Rather, coarse center-of-body motioncan be reliably extracted from either rigid or deformableobjects (e.g. vehicles or humans) in these far-field situa-tions. This motion has cued human activity understand-ing through selective focus of attention in PTZ image

• B. Morris and M. Trivedi are with the Computer Vision and RoboticsResearch Laboratory in the Department of Electrical and Computer Engi-neering, University of California, San Diego, La Jolla, CA, 92093-0434.E-mail: {b1morris, mtrivedi}@ucsd.edu

capture [5] or detection of loitering individuals [6]. In thetransportation setting, vehicle motion has been utilizedfor traffic congestions measurements [7] and detailedorigin-destination information at intersections [8]. But,these techniques require domain knowledge and as morevideo sites have been erected, the need for automaticmethods to characterize activity from data rather thanby rough manual specification [9] has greatly increasedas these methods provide a framework for unsuperviseddetection of interesting or abnormal events.

This work seeks understand surveillance scene activitywith minimal constraints and limited domain knowl-edge. The multilevel probabilistic analysis frameworkthat enables the understanding of activity from obser-vation of trajectory patterns is presented in Fig. 1. A 3-stage hierarchical modeling process, which characterizesan activity at multiple levels of resolution, is developedto classify and predict future activity and detect ab-normal behavior. The unsupervised learning scheme isable to learn the points of interest in a scene, clustertrajectories into spatial routes, capture dynamics andtemporal variations by a hidden Markov model, and pro-vide an activity adaption methodology to characterizea novel scene over long time periods without a prioriknowledge. The framework effectively handles imperfecttracking and can automatically estimate the number oftypical activities in a scene. Finally, extensive evaluationon a number of datasets, which is lacking in literature,is provided to quantify analysis performance.

2 RELATED RESEARCHAs shown in Fig. 2, a surveillance activity is composed ofa sequence of ordered actions and a trajectory is the set

Digital Object Indentifier 10.1109/TPAMI.2011.64 0162-8828/11/$26.00 © 2011 IEEE

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCEThis article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

2

Online Evaluation

Observation

Tracking

ActivityModels

ActivityAnalysis

TrajectoriesInputVideo

AnnotatedVideo

TrajectoryDatabase

Clustering ProbabilisticModeling

PredictionClassification AbnormalityDetection

Fig. 1: The general framework for trajectory analysis.During the observation (training) phase, trajectories areclustered and modeled. The learned set of typical pat-terns are used for live activity analysis in the onlineevaluation phase.

observable motion measurements. Often the observedmotion patterns in visual surveillance systems are notcompletely random but have some underlying structurewhich has been exploited to build models that allow foraccurate inferencing.

Pioneering work by Johnson and Hogg [10] describedoutdoor motions with a flow vector and learned implicittemporal relationships using a leaky neural network.The leaky network was modified with feedback betweenoutput nodes and the leaky neurons to enable predictionby Sumpter and Bulpitt [11]. Owens and Hunter [12]extended this idea using a self organizing feature mapto further detect abnormal behavior. Unfortunately, theneural network implementations required difficult finetuning of parameters and large amounts of data whichresulted in slow convergence.

Hu et al. drastically improved the learning process byutilizing a complete trajectory as the learning input [13].These trajectory cluster methods have been used to makepredictions which utilize previously observed motionand detect abnormalities [7], [14]–[16]. This explicitlyenforced the sequential nature of trajectory observationpoints. Multi-layered learning approaches have been de-veloped to first model the spatial extent then dynamicsof an activity [15]. Junejo et al. characterized an activityby its extent, the traversal speed, as well as the amountcurvature. Makris and Ellis [17] developed an onlinemethod for building up spatial envelopes for each activ-ity. Others have learned the smaller action groups andconnected them within a Markov model [18] or tree-likestructure [19]. By learning actions, new activities branchout from existing models (natural for online learning)and enable efficient learning through the use of shareddata.

In stark contrast, others have completely ignored theinherent ordering in a trajectory. Rather than require

Sequence of Actions

Trajectory

Activity

Fig. 2: A surveillance activity is defined by a set of atomicactions and composed by the action sequence. Trajectorypoints are samples drawn from each action and relate theduration and ordering of actions.

full trajectories, which are difficult to obtain due toscene clutter and occlusion, only inter frame motionis considered for bag-of-word type learning methods.Stauffer and Grimson created a codebook of motionflows and learned the co-occurrence of motions thattended to occur together. Similarly, drawing inspirationfrom document clustering research, Wang et al. [20]–[22]has developed hierarchical Bayesian models to groupco-occurring motions into actions and activities, jointlywithout supervision. Xiang and Gong [23] were able tosegment videos into clips of different activites based onthe spectral similarity of behaviors.

A more detailed examination of techniques is pre-sented in the review by Morris and Trivedi [24] whichhighlights a wide range of applications, challenges, andcommon approaches to trajectory learning and model-ing. They emphasis not only the learning frameworkbut also trajectory representation [25]–[27] and similaritymetrics for comparison [28], [29].

This work presents a new multilevel trajectory learn-ing framework which is capable of automatically esti-mating the number of activities in a scene and adapts tochanges over time. Thorough evaluation of performance,lacking in related studies, is presented to characterizeanalysis performance.

3 TRAJECTORY LEARNING FRAMEWORKThis paper adopts the general probabilistic trajectoryanalysis framework, shown in Fig. 1, to automaticallylearn and describe activity. At the front end, a visionbased background subtraction module detects movingobjects and a tracking process creates trajectories whichare the main input for activity inferencing [7]. Coarsebody motion is utilized as a reliable low level featurewhich can describe how an agent moves through a visualscene. A trajectory is a sequence of T flow vectors

F = {f1, . . . , ft, . . . , fT } (1)ft = [xt, yt, ut, vt]T (2)

where flow vector ft compactly represents an object’smotion at time t by the (x, y) position and correspondingcomponent velocities (u, v).


3

ActivityPaths

Routes

Points ofInterest

Route Merge

Spectral Clustering

LCSS

GMM

PathUpdate

HMMTraining

A BNode Level

A B

Dynamics Level

A BSpatial Level

A B

Temporal Level

Fig. 3: In order to model activities at various resolutions, a 3-stage hierarchical learning procedure is adopted. Thefirst level learns points of interests (nodes), second locates spatial routes between nodes through clustering, andthe final level probabilistically encodes spatio-temporal dynamics.

During the initial observation and learning phase, tra-jectories are collected to create a measurement database.The trajectory database is clustered to find similar mo-tion patterns without specifying those of interest. Finally,the groups of similar trajectories are probabilisticallymodeled to populate an activity database for inferencing.

A key for designing an effective learning frameworkis to examine and limit the types of behaviors to beanalyzed. This work is concerned with simple activitiescomposed of smaller actions which are performed by asingle agent. More complex behaviors, e.g. interactionsbetween agents, are left for future studies. As shown inFig. 2, an activity is defined by an sequence of atomicactions and a trajectory is the observable measurementthat summarizes activity history.

An activity can be decomposed at different resolutionsto answer specific analysis questions. An activity can becharacterized by four main components (left side of Fig.3) that form an explanation hierarchy with each levelproviding a more refined activity description. Each ofthe levels is briefly described to highlight its analysisimportance.

Level 1 (Node): A simple compact representation of theinteresting endpoints (origin and destination) of an activ-ity used by the transportation community to understandto and from where people travel.

Level 2 (Spatial): Different routes between nodes spa-tially localize differing activities and separates thestraight route from a more round about approach.

Level 3 (Dynamic): Further disambiguation is made byconsidering the dynamics in how routes are traversed tomake a distinction between slowly and quickly movingactivities (e.g. free-flow versus congestion on the high-way).

Level 4 (Temporal): The finest level of activity charac-terization accounts for temporal variations. This most

closely matches the activity model to account for theduration and sequencing of actions.The 3-stage hierarchical learning process presented onthe right of Fig. 2 learns each of the levels in turn. Nodesor points of interests (POI) are discovered using Gaus-sian mixture modeling (GMM) and expectation maxi-mization. The spatial routes are extracted by comparingand clustering similar trajectories while simultaneouslyand automatically determining the number of activitiesin a scene. In the last stage, both the spatio-temporaldynamics that characterize an activity are encoded inthe probabilistic hidden Markov model (HMM).

The evaluation phase uses the learned activity HMMsas a descriptive vocabulary for online analysis. As anew object is tracked, its activity is characterized, futurebehavior is predicted, and alarms are signaled if thereis an abnormal or unusual action. In addition, recentobservations are used to update and modify the activitymodels to better reflect the surveillance scene.

The analysis framework must cope with a number ofreal-world implementation issues in order to facilitatedeployment on new scenes. In a new scene, the numberof typical activities is not known a priori and must beestimated automatically. In addition, the learning andevaluation algorithms must gracefully handle incom-plete trajectories due to faulty tracking. Finally, whenmonitoring a site over long periods, the initial trainingperiod may not reflect the current scene configurationnecessitating approaches to modify the activity models.In the following sections, more details of the learningand analysis framework are provided as well as anevaluation methodology.

4 NODE LEVEL LEARNINGThe first activity level locates the points of interest (POI)in the image plane. These POI are represented as graph


4

Enter

Exit

Stop

stopVv

R

Fig. 4: Graph representation of a visual scene wherenodes indicate points of interest while the edges signifythe spatial connections (routes). The green nodes are thelocations where objects enter the scene, red where theyexit, and yellow are regions where an object stops andremains stationary.

nodes in a topographical map of the scene (Fig. 4). ThePOI are used to filter noisy trajectories which arise fromtracking failures to provide robustness and improve theautomatic activity definition.

4.1 Points of InterestThere are three different types of POI in a scene, entry,exit, and stop. The entry and exit POI are the locationswhere objects either appear or disappear from the scene(e.g. the edge of the camera field of view) and correspondto the first, f1, and last, fT , tracking point respectively.

The stop POI indicate scene landmarks where objectstend to idle or remain stationary (e.g. a person sittingat a desk). Potential stop points must have very lowspeed for at least τ seconds. But, unlike the inactivityregions of Dickinson and Hunter [30], stop regions arealso localized. The stop POI shown in Fig. 4 is definedby τ/Δt consecutive points with speed less than thethreshold Vstop and all within the small radius R. Thismore complex inactivity definition ensures a stationaryobject remains in a particular location. Just relying ona low speed threshold can contaminate the set of stoppoints, e.g. samples of a vehicle in congestion couldbe consistent with the low speed check even though ittravels across the the full camera field of view.

The POI are learned through a 2D mixture of Gaus-sian zone modeling procedure [17]. For each POI type{enter,exit,stop}, a zone consisting of Z Gaussian com-ponents

Z =Z∑

m=1

zmG([x, y]T, μm, Σm) (3)

is learned using expectation maximization (EM) [31],with [x, y]T the image coordinates of a point in the POIset. Each mixture component m denotes the location ofan activity node on the image plane by its mean μm andits size is specified by Σm. The density of a node

di =zm

π√|Σm| . (4)

Fig. 5: Entry/Exit interest are zones learned by Gaussianmixture modeling (no stop zones in this dataset).

defines its importance.Since it is not known a priori how many POI belong

in a zone, the number must be estimated. The unimpor-tant zones components can be automatically discardedusing a density criterion based on the average mixturethreshold

Ld =αZ

π√|ΣZ |

(5)

where 0 < αZ < 1 is a user defined weight and ΣZ is thecovariance matrix of the entire zone dataset [17]. Zonecomponents with low density (low importance) di < Ld

do not have much support in the training data and mostlikely model noise points in the tracking process (brokentrajectories) since they are not well localized while tightmixtures with high density indicate valid POI.

Fig. 5 shows the entry/exit zones learned for anintersection with green denoting entry and red exit POI.Noise mixtures from broken tracks, which are removed,are drawn in black.

4.2 POI-Based Filtering of Broken TracksThe POI modeling procedure indicates important nodesin the surveillance scene which can be used to help theactivity training process. Including noise tracks, gener-ated during tracking failure from things like occlusion,would make activity learning more difficult becausethey are not adequately explained and result in wastedmodeling effort.

The POI-based filtering process considers any trajec-tory that does not begin and end in a POI as a trackingerror and removes it training database. Trajectories thattravel through a stop zone are split into separate tracksleading into and out of the zone. The enter/exit and stopfiltering provides the consistent trajectories needed forrobust activity modeling.

4.3 DiscussionThe Node Level learning is able to distinguish the imagelocations were objects should appear, disappear, andremain stationary. While it is evident that the mixture ofGaussian process is not able to perfectly place POI foreach lane, the top left enter POI actually incorporates


5

LCSS(Fi, Fj) =

⎧⎪⎨⎪⎩

0 Ti = 0 | Tj = 01 + LCSS(FTi−1

i , FTj−1j ) dE(fi,Ti

, fj,Tj) < ε & |Ti − Tj | < δ

max (LCSS(FTi−1i , F

Tj

j ), LCSS(FTii , F

Tj−1j )) otherwise

(6)

Spectral Clustering of Trajectories1) Construct similarity graph S = {sij}2) Compute the normalized Laplacian L,

L = I − D−1/2SD−1/2

3) Compute the first K eigenvectors of L4) Let U ∈ RN×K be the normalized matrix

constructed with eigenvectors as columns5) Cluster the rows of U using k-means

Fig. 6: Basic steps for spectral clustering of trajectoriesas presented by Ng et al. [33].

three different lanes in Fig. 5, it is effective at removingbroken trajectories from the training database. The choiceof Z for a zone is not critical because when underspeci-fied, EM will create a larger mixture component.

Better POI localization might be possible by automat-ically selecting the number of mixture components inconjunction with estimation by modifying the classicalEM algorithm [32].

5 SPATIAL LEVEL LEARNING

The second behavior level focuses on spatial support,differentiating tracks based on where they occur. Oncethe scene goals are explained by POI, the connectionsbetween nodes are described by spatial routes.

5.1 Trajectory ClusteringAfter POI-based filtering, a training database of cleantrajectories is available for grouping into routes whichseparate different ways to get between nodes. Theseroutes can be learned in unsupervised fashion throughclustering. Clustering groups typical trajectory patternsand only relies on the definition of similarity betweentracks. The main difficulties when trying to learn routesis the time-varying nature of activities, which leads tounequal length trajectories and determining how manyroutes exist automatically.

5.1.1 LCSS Trajectory DistanceSince trajectories are realizations of an activity process,the length of trajectories can be of unequal length basedon the properties of differing activities. Additionally,even trajectories sampled from the same activity canbe of unequal length due to sampling rate and speedof action. In order to compare trajectories, a distancemeasure must be able to handle variable sized inputs.

LCSS is a sequence alignment tool for unequal lengthdata that is robust to noise and outliers because notall points need to be matched. Instead of a one-to-one

mapping between points, a point with no good matchcan be ignored to prevent unfair biasing. When modifiedfor trajectory comparison, LCSS was found to providethe best clustering performance when compared with anumber of other popular distance measures [28], [34].The LCSS distance for trajectories suggested by Vlachoset al. [35] is defined as

DLCSS(Fi, Fj) = 1 − LCSS(Fi, Fj)min(Ti, Tj)

, (7)

where the LCSS(Fi, Fj) value specifies the number ofmatching points between two trajectories of differentlength Ti and Tj . The recursive LCSS definition (6),which can be efficiently computed with dynamic pro-gramming, matches points that are within a small Eu-clidean distance ε and an acceptable time window δ. Theterm F t = {f1, . . . , ft} denotes all the sample points inF up to time t.

5.1.2 Spectral Clustering

Spectral clustering has become popular technique re-cently because it can be efficiently computed and hasimproved performance over more traditional clusteringalgorithms [34]. Spectral methods do not make any as-sumptions on the distribution of data points and insteadrelies on eigen-decomposition of a similarity matrixwhich approximates an optimal graph partition. Thebasic steps for normalized spectral clustering presentedby Ng et al. [33] are outlined in Fig. 6

The similarity matrix S = {sij}, which representsthe adjacency matrix of a fully connected graph, isconstructed from the LCSS trajectory distances using aGaussian kernel function

sij = e−D2LCSS(Fi,Fj)/2σ2 ∈ [0, 1] (8)

where the parameter σ describes the trajectory neigh-borhood. Large values of σ cause trajectories to have ahigher similarity score while small values lead to a moresparse similarity matrix (more entries will be very small).The Laplacian matrix is formed using from the adjacencymatrix

L = I − D−1/2SD−1/2 (9)

with D the diagonal degree matrix with elements thesum of the same row in S. A new N × K matrix, U , isbuilt using the first K eigenvectors of L as columns. Fi-nally, the rows of U , each viewed as a new feature vectorrepresentation of a training F , are clustered using fuzzyC means (FCM) [36] to group similar trajectories. Theinitial cluster centers are chosen based on the orthogonalinitialization method proposed by Hu et al. [37].


6

Route 01Route 02Route 03Route 04Route 05Route 06Route 07Route 08Route 09Route 10Route 11Route 12Route 13Route 14Route 15

Route 16Route 17Route 18Route 19Route 20Route 21Route 22Route 23Route 24Route 25Route 26Route 27Route 28Route 29Route 30

(a)

Route 01Route 02Route 03Route 04Route 05Route 06Route 07Route 08Route 09Route 10Route 11Route 12Route 13Route 14Route 15Route 16Route 17Route 18Route 19

(b)

Model 01Model 02Model 03Model 04Model 05Model 06Model 07Model 08Model 09Model 10Model 11Model 12Model 13Model 14Model 15Model 16Model 17Model 18Model 19

(c)

Fig. 7: (a) A large set of clusters, representing intersection maneuvers, are learned by spectral clustering with LCSStrajectory distance. (b) The correct Nr = 19 routes remain after merging. (c) The spatio-temporal dynamics of eachroute is encoded by an HMM.

Rather than k-means, FCM is used in the last spec-tral clustering step to minimize the effects of outliersand obtain soft cluster membership values uik ∈ [0, 1]which indicate the quality of sample i. High membershipmeans little route ambiguity, or a typical realization ofan activity process. The membership variable uik tellshow confidently trajectory i is placed in cluster k.

5.1.3 Route CreationThe spectral clustering process partitions the trajectorytraining database into the groups of similar patterns. Aroute is defined as a prototype from each group. Theroute prototype is chosen as an average of trajectoriesin a cluster, which utilizes the membership values uik

returned from FCM.In order to average trajectories of different length, each

trajectory F has its velocity information ignored and isspatially resampled to a fixed size L. Rather than sim-ple temporal subsampling, the resampled track evenlydistributes points along the trajectory. A trajectory isviewed as a parameterized curve F = {x(t), y(t)} andis reparmeterized in terms of arc length F (s) to consideronly the trace of the curve. The sampling ensures the Eu-clidean distances between consecutive points are equalto completely remove dynamic information (as hiddenin the temporal sampling rate). This prevents regionsof higher sample density from contributing bunches ofpoints in a single area as occurs when an object ismoving slowly during a section of its trajectory. Finally, avector F = [x1, y1, . . . , xL, yL], representing a point in theR2L route space, is constructed for each of the N trainingtrajectories by stacking the consecutive trace points.

A route prototype is constructed as the weightedaverage of the training database

rk =∑N

i=1 u2ikFi∑N

i=1 u2ik

(10)

where uik is the FCM membership example i to clusterk.The resulting set of K prototypes {rk} encode the thetypical scene routes.

5.2 Cluster ValidationSpectral clustering is performed with a large K to par-tition the training data but the true number of routesmust still be estimated because it is not known a priori.After clustering the routes are subsequently refined toa smaller number (Nr < K) by merging similar proto-types.

The merge procedure compares routes by finding analignment between pairs of routes using dynamic timewarping (DTW) [38]. DTW is used to equally value allmatches and provide a one-to-one matching betweenpoints on routes. After alignment, two routes are consid-ered similar if the number of matching points is high orif the total distance is small. The match count is definedas

M(rm, rn) =L∑

l=1

I (dl(rm, rn) < εd) (11)

dl(rm, rn) =√

(xm(l) − xn(l))2 + (ym(l) − yn(l))2, (12)

where I(.) is the indicator function and the total distancebetween routes is

D(rm, rn) =L∑

l=1

dl(rm, rn). (13)

The threshold value εd pixels was chosen experimentallyfor good results and is based on how closely routesare allowed to exist in the image plane. When eitherM(rm, rn) > TM or D(rm, rn) < Lεd, then routes rm andrn are considered similar.

A cluster correspondence list is created from thesepairwise similarities, forming similarity groups. Eachcorrespondence group is reduced to a single route by re-moving similar routes and transferring the membershipweight of every training trajectory onto the remainingroute, e.g.

umi = umi + uni ∀n ∼ m. (14)

where u represents the membership after merging.In practice we have found that FCM tends not to

over fit the data but instead finds several very similar


7

clusters, making the merge algorithm effective. The routeprototypes are plotted in Fig. 7a for K = 30. The turns inthe upper corners have multiple similar clusters whichare removed by the match-merge procedure to retainonly the true Nr = 19 routes shown in Fig. 7b.

5.3 DiscussionThere are many options when clustering trajectories,ranging from the choice of clustering algorithm to themeasure of similarity between tracks [24]. The choice ofLCSS distance along with spectral clustering has beenproven to be very effective for surveillance settings [28],[34] but still requires a model order choice and validationscheme such as the match-merge process presented.

Additionally, automatic route definition relies on con-sistent and repetitive activities, rare occurrences willbe lost during clustering. A hierarchical clustering pro-cedure can avoid active selection of model order andprovides a set of solutions at varying resolutions whichcould prove effective for handling infrequently recurringactivities.

6 DYNAMIC-TEMPORAL LEVEL LEARNING

The clustering procedure locates routes spatially, but thisis insufficient for detailed analysis. It is necessary toknow not only where objects are but also the manner inwhich they travel. The higher order dynamics and tem-poral structure of the third and fourth activity hierarchylevels are needed to completely characterize an activitypath.

6.1 Activity HMMA path provides a compact representation of an activitywhich incorporates location, dynamics, and temporalproperties. Each activity is represented using a hiddenMarkov model (HMM) because it naturally handles timenormalization. In addition, the simplicity of training andevaluation makes it ideal for real-time implementation.

An HMM, in compact notation λ = (A, B, π0), ischaracterized by the following parameters:• The number of states (actions) in the model Q. In

this work Q is fixed for simplicity but it is possibleto estimate an optimal number [39].

• The Q × Q state transition probability matrix A ={aij} where

aij = p(qt+1 = j|qt = i) (15)

• The observation probability distribution B = bj(f)where

bj(f) = G(f, μj , Σj) (16)

represents the Gaussian flow f distribution of eachof the of the j = 1, . . . , Q states of unknown meanμj and covariance Σj .

• The initial state distribution π0 = {πj} with

πj = p(q1 = j) (17)

The temporal level of the learning hierarchy in Fig.3 gives a graphical representation of an HMM wherecircles represent the states and arrows indicate the statetransition.

The activity path HMM is a left-right hidden Markovmodel (aij = 0 for j < i) which encodes the sequentialstart to finish structure of trajectories by prohibitingbacktracking. Each of the k = 1, . . . , Nr scene activitiesis represented as λk = (Ak, Bk, π0) with a fixed π0 foreach path of

π0(j) =1C

e−αpj j = 1, . . . , Q. (18)

C is a normalization constant to ensure valid probabili-ties. This definition allows tracks to begin in any state,which is important when trajectories are incomplete dueto occlusion or during online analysis. The exponentialweighting is used to prefer action sequences that begindecoding in earlier states, though, the choice of αp is notcrucial as long as there is non-zero probability for eachstate.

The path model is finalized when both the transitionprobabilities Ak and the action states which define Bk

are learned from the trajectory training database.

6.2 Path TrainingAn HMM is trained for each activity by dividing thetraining set D into Nr disjoint sets, D =

⋃Nr

k=1 Dk. Theset

Dk = {Fi|uik > 0.8} (19)

contains only the most representative examples of clusterk. In this stage of learning, the full trajectory, includingposition and velocity, Fi = {ft} with ft = [xt, yt, ut, vt]T

is utilized for a precise spatio-temporal model. Usingpath training sets Dk, the Nr HMMs can be efficientlylearned using standard methods such as the Baum-Welchmethod [38] to find the transition probabilities Ak andthe Gaussian state observation distributions μjk and Σjk.

The set of learned activity paths, which specify wherelanes are located as well as how vehicles are expected tomove in the lane, for the traffic intersection are shownin Fig. 7c. Typically, the path models have a highlystructured transition matrix which is block diagonal.For long activities with few model states (T � Q), thehighest transition probability is along the main diagonalwhich means it is typical to have multiple observationsfrom a single action.

Notice that while data could be shared and statestied in regions of overlap between HMMs, doing sowould discard subtle cues which distinguish activities.For example, the slowing necessary to make a rightturn through an intersection is not present when passingstraight through.

The resultant activity models after the third stage ofthe hierarchical activity learning process provide a sim-ple probabilistic interpretation of an activity. In addition,new camera setups can be easily deployed because there


8

was no need to manually select “good” trajectories forthe modeling since they were automatically discovered.

6.3 Updating Path Models

The activity HMMs learned above accurately depict thescene at the time of training, but in a surveillance settingthere is no guarantee that the activity processes arestationary meaning the models must reflect changes overtime. Two complementary adaption methods are used toupdate the HMM database. The first is an online schemeto refine the existing activity HMMs based on newlyobserved trajectories while the second introduces newmodels through periodic re-learning rounds.

6.3.1 Online Incremental UpdateAfter training, the activity models are optimal for thetraining data, but after time might not reflect the currentconfiguration of the scene. Small variations and pertur-bations could arise from various causes such as cameramovement or path reconfiguration, e.g. people walkingaround a puddle. An activity HMM can updated withnew trajectories in an online fashion using maximumlikelihood linear regression (MLLR) [40]. MLLR com-putes a set of linear transformations that reduce themismatch between the initial model and new (adaption)data. The adapted HMM state mean is given by

μj = Wkξj , (20)

where Wk is the 4 × (4 + 1) transformation matrix foractivity k and ξj is the extended mean vector

ξj = [1, μTj ]T = [1, μx, μy, μu, μv]T (21)

of state j. The state j represents an individual actionthat makes up an activity and μj = [μx, μy, μu, μv]T

summarizes the location and dynamics of the action.Wk = [b H] produces an affine transformation for eachGaussian HMM state with H a transformation and b abias term. The transformation matrix Wk can be foundusing EM by solving the auxiliary equation

T∑t=1

Q∑j=1

Lj(t)Σ−1j ftξ

Tj =

T∑t=1

Q∑j=1

Lj(t)Σ−1j Wkξjξ

Tj (22)

where Lj(t) = p(qj(t)|F ).The optimization is performedover all Q states of an HMM to provide an activity levelregression.

Each time a new trajectory is classified path λk (Section7.1.1), a transformation is learned and applied to themean of each of the HMM states for a sequential onlineupdate. The update

μj(t + 1) = (1 − αMLLR)μj(t) + αMLLRWk(t)ξj ∀j (23)

modifies the mean of existing path λk to better fit newobservations at time t. The learning rate αMLLR ∈ [0, 1] isa user defined parameter set to control the importanceof a new trajectory.

TypicalA

bnormal

+

*k*kkk

-

Training Sample

Log

Like

lihoo

d

Fig. 8: A threshold is automatically set between thewithin class (green) and out of class (red) training exam-ples with a tunable parameter β to control the sensitivityof abnormality detection.

6.3.2 Incorporating New ActivitiesThe MLLR update allows modification of existing activ-ities but in order to introduce new and unseen activitiesinto the activity set, a periodic batch re-training proce-dure is adopted [15]. Abnormal trajectories that are notwell explained by any of the activity models (defined inSection 7.1.2) are collected to form an auxiliary trainingdatabase. Once the database has grown sufficiently large,it can be passed through the 3-stage learning machineryto periodically augment the activity set. Since the abnor-mality database is populated by trajectories which didnot fit any of the existing models, any consistent patternsextracted indicate new activities e.g. a newly opened laneon a highway. In this way, repetitive motions initiallyconsidered abnormal can be assimilated into the scenedefinition.

6.4 Discussion

The HMM based formulation does not impose globaltemporal alignment unlike other probabilistic spatio-temporal activity representations [15]. Instead, a localmodel of temporal structure is used which encodesthe probability of time-scale distortion at any particularinstance in time in the transition probability. The localmodel provides the flexibility to accurately comparesimilar, though different length, trajectories. But, becausethere is an underlying dynamic process, the exponentialstate duration model implied by the transition probabil-ities may not accurately match the perceived dynamics.

The hidden semi-Markov model (HSMM) has beenproposed to explicitly model the state duration density[41]. Yet, this is often ignored due to increased compu-tational and memory requirements along with a needfor significantly more data to train a larger number ofparameters [38]. By modeling state durations, the pathtransitions can more closely match observed dynamics


9

(a) (b) (c) (d)

Fig. 9: Abnormal trajectories (a) Wide u-turn (b) Illegal loop (c) Off-road driving (d) Travel in opposite direction

and more complex activities can be included (states withmulti-modal stay durations e.g. straight through or stopat an intersection).

7 ACTIVITY ANALYSISAfter the the offline learning process, the underlyingactivities present of visual scene are compactly repre-sented by the set of HMM paths. Using these learnedmodels, the activity of a scene agent can be accuratelycharacterized from live video. Activity analysis includesdescribing actions, predicting future behavior, and de-tection of abnormal and unusual events.

7.1 Trajectory SummarizationWhen a trajectory is completed (e.g. leaves the scene),a summary of the object activity is generated and it isdetermined to be either typical or abnormal.

7.1.1 Trajectory ClassificationUsing Bayesian inference, the activity that best describesthe trajectory observation sequence is determined bymaximum likelihood estimation

(24)

The likelihood can be solved efficiently for the HMMusing the forward-backward procedure [38] to indicatehow likely is it that activity generated trajectory .Partially observed activities (broken tracks from occlu-sion) can also be evaluated using 24 because of the non-zero allows any initial start state.g

7.1.2 Abnormal TrajectoriesSince only typical trajectories are used to learn the activ-ity paths, abnormalities (outliers) are not well modeledand the quality of class assignment will be low. Trajec-tories with low log-likelihood canbe recognized as abnormalities. The decision thresholdis learned during training by comparing the averagelikelihood of samples in training set to those outside.

(25)

(26)

(27)

The sensitivity factor controls the abnormalityrate with larger causing more trajectories to be con-sidered anomalous. The automatic threshold selectionprocess is visualized in Fig. 8 where the green linesignifies , red signifies , and black indicatesthe threshold .

A few examples of abnormalities are displayed in Fig.9. In (a), although U-turns were allowed, the wide arcfrom the middle lane was not. The (b) illegal loop and (c)off-road driving are clearly unusual while the trajectoryin (d) looks acceptable but actually came from travel inthe wrong direction.

7.2 Online Tracking AnalysisAlthough it is interesting to provide a summary of com-pleted tracks, it is often more important in surveillance torecognize activity as it occurs in order to assess and reactto the current situation. A description of activity mustbe generated based on the most current information andrefined with each new video frame. The online analysisengine is given the difficult tasks of making predictionswith incomplete data (partial trajectory) and detectingunusual actions.

It is important to note that within this framework eachindividual scene agent is considered separately. Ratherthan just indicating a noteworthy event is happening,the location and individual of interest is highlighted aswell.

7.2.1 Activity PredictionReal-time analysis provides the alerts necessary fortimely reaction. This response time could be improved ifthe monitoring system were able to infer intentions anddetermine what will happen before it actually occurs.Accurate prediction can help reduce reaction time oreven avoid undesirable situations by providing a bufferto take countermeasures and corrective actions.

Future actions can be inferred from the current track-ing information. Instead of using all the tracking pointsaccumulated up to time , only a small window of datais utilized

(28)

The windowed track consists of past measurements,the current point , as well as future points. The


10

(a) (b) (c) (d)

Fig. 10: The top three 1 green, 2 yellow, and 3 red most likely maneuvers and prediction confidence are shownduring a U-turn. (a) The motion model (light blue X) fits during linear motion but (b) does not handle turns well.(c) Halfway through the u-turn, the prediction can accurately gauge the maneuver while the motion model poorlyapproximates it. (d) New maneuvers explain the current situation but activity memory still indicates the U-turn ingreen.

future points are estimated by applying the trackingmotion model time steps ahead. By utilizing only thewindowed trajectory, only the recent history is consid-ered during online evaluation because old samples maynot correlate well with the current activity. This allowsan agent to be monitored over very long time periodsby discarding stale data.

The activity prediction is made at the current time byevaluating (24) with replaced by its windowed version

(29)

This prediction has a longer time horizon than standardone step prediction (e.g. Kalman prediction) becauseit exploits previously observed activities rather thanrelying on a generic motion model. During complexmaneuvers, motion models will rarely be effective muchmore than a few time steps into the future.

An example of the superiority of the trajectory patternprediction is shown in Fig. 10. The top 3 best activity pre-dictions are color coded as green for best match, yellowas top-2, and red as top-3 and their associated confidenceis presented in the colored box. During the straightsections (a) the motion model, shown as light blue X’s,seems to estimate behavior. As the turn progresses in(b) the motion model very poorly approximates themaneuver. It is quickly realized that the maneuver is a U-turn (c) since it was seen before while the motion modelseriously falls behind. Finally in (d), the estimates re-align in the straight section but the track memory helpsdistinguish the U-turn in green.

The prediction sequence, which encodes an object’saction history, can indicatebehavioral changes. A consistent activity manifests asconsecutive labels which are equal while a transitionbetween labels indicates an activity switch (such asduring a lane change).

7.2.2 Unusual Action DetectionSimilar to abnormal trajectories, unusual actions canbe detected during tracking. These anomalies indicatedeviations the instant they occur. Since only a windowedversion of a track is used during tracking the the log-likelihood threshold (27) needs to be adjusted. The newthreshold is

(30)

The abnormality threshold is adjusted with to accountfor the reduced probability mass associated with a par-tial trajectory. The term corresponds to the fractionof states visited in the evaluation window. This assumesuniform state duration for a full trajectory and averagesthe log-likelihood into each model state while thenumerator is the expected number of statesthat will be visited in an observation window. Theadjustment term is estimated as

(31)

with the average length of training trajectories incorresponding to path . the number of samplesper state which results in as the number of statesin a window. Notice that the window only considersand sets to make assessments only on observeddata. Here is again chosen to ensure detectionof most suspicious tracking points. In this work,was automatically selected to return an unusual actiondetection rate of approximately on the training set.

As soon as an object strays from an activity model, anunusual action alarm is triggered for timely detection.Examples of unusual actions are given in Fig. 11. Theillegal loop and off-road driving from Fig. 9 are seen in


11

(a) (b) (c)

Fig. 11: First occurrence of unusual event during online processing. The red box indicates unusual detection whilegreen is deemed acceptable. (a) Illegal loop. (b) Off-road driving (c) The illegal left turn from the middle lane isdetected before re-aligning into the exit lane.

cluster 1cluster 2cluster 3cluster 4cluster 5cluster 6cluster 7cluster 8

(a) (b)

cluster 1cluster 2cluster 3cluster 4cluster 5cluster 6cluster 7cluster 8cluster 9cluster 10cluster 11cluster 12cluster 13cluster 14cluster 15

(c) (d)

Fig. 12: The other experimental datasets tested in addition to the intersection (CROSS). (a) The I5 highway trafficscene. (b) POI learning results with sparse mixtures due to broken trajectories. (c) OMNI camera inside a laboratory.(d) Analysis results show the predicted path and a red bounding box highlights and usual action from backtracking.

TABLE 1: Experimental Parameters

CROSS 1900 1764 15 15 0.3 0.1 0.9 0.9I5 606 261 15 15 0.2 0.1 0.9 0.95

OMNI1 117 55 15 25 0.3 0.1 0.98 0.75OMNI2 131 98 15 25 0.3 0.1 0.9 0.75

(a) and (b) respectively. The time evolution of a detectionin shown in (c) where initially a typical behavior isdenoted with a green box but in the middle of andillegal left turn from the middle lane the box turns redto indicate an unusual action. Finally, a short time laterthe action stabilizes back at green.

8 EXPERIMENTAL STUDIES AND ANALYSIS

The following section assess the performance of the ac-tivity learning and analysis framework. First, the three-staged learning process is evaluated by examining thequality of the automatically extracted activities. After,the accuracy of the trajectory summarization and onlineanalysis modules is considered.

The results are compiled from a set varying scenes.Three different experimental locations are analyzed; asimulated traffic intersection (CROSS), a highway roadsegment (I5), and the interior of a laboratory under ob-servation by an omni-directional camera (OMNI). Table 1details the size of the datasets and experimental parame-ters. Further details about the experimental datasets canbe found in [34]. The results of trajectory summarization

and online analysis are presented in Tables 2 and 3respectively.

8.1 Cluster ValidationPerhaps the most important, and most difficult, task intrajectory learning is to robustly and accurately extractthe dominant scene activities without a priori specifica-tion. This section compares different methods to estimatethe number of routes a scene; this process is generallyknown as cluster validation.

In this work, the number of activities is automaticallydetermined by first over-clustering trajectories and thenmerging similar routes. Two types of merge criteria arecompared; membership and distance. The membershipcriteria utilizes the distribution of membership scoreto find similar clusters. The three membership criteria arethe Bhattacharyya coefficient , cosine angle

, and sum-of-squared difference

(32)

(33)

(34)

The route distance criteria measures are the match count(11) and total distance (13).


12

(a) (b)

�

(c)

Fig. 13: (a) The point match criteria outperforms the distance and membership criteria for merging similar clusters.(b) Spectral clustering using LCSS distance has better merge performance than simply resampling tracks and FCMclustering. (c) The true number of clusters is consistently selected after the merge process. The route count Nr isautomatically estimated as the mode across the range of match thresholds.

TABLE 2: Trajectory Summarization Results

classification abnormalitiesNr count Acc % count TPR FPR

CROSS 19 9191/9500 96.8% 163/200 81.5% 18.5I5 8 879/923 95% - - -

OMNI1 7 25/25 100% 10/16 68.8% 12.0OMNI2 15 12/16 75.0% 15/18 83.3% 25.0

Using the distance-based merge criterion, all 19 of thetraffic intersections are correctly identified in the CROSSexperiment. This set contained the most activities butwere distinct due to a favorable view. All 8 of the high-way lanes were correctly identified in the I5 experimentas well, but there were also 2 false lanes. The extralanes appeared in the southbound direction closest to thecamera. Perspective distortion causes the northboundlanes to be significantly closer than southbound. Thedistance threshold εd was chosen to resolve the northlanes at the expense of multiple hits in the closest lanes.This highlights the need for camera calibration to unifythe measurement space. In the OMNI experiments, thereare no physical lanes but virtual paths that people travelbetween doors and desks. The 7 door-to-door pathswere correctly extracted in the OMNI1 experiment. Thesecond omni experiment was more complex and only 13of 15 paths were discovered. The two missing paths hadvery little support after trajectory filtering but could belearned given more data. These OMNI experiments weremuch more difficult than the vehicle scenes because theyare less constrained, paths do not have clear separation,and paths contain significant overlap.

The cluster validity evaluation shows the learningframework is able to accurately extract the typical sceneactivities with few parameters. The two parameters tospecify for route clustering are the initial number ofclusters K and εd. The merge process is robust to thechoice of K and εd is related to the desired spatial

TABLE 3: Online Analysis Results

prediction unusual actionsNr count Acc % count TPR FPR

CROSS 19 31713/41871 75.7 364/443 0.822 0.324I5 8 13859/14876 93.2 - - -

OMNI1 7 2241/3048 73.5 544/945 0.576 0.266OMNI2 15 1719/2693 63.8 - - -

resolution between trajectories.

8.2 Trajectory SummarizationAfter an object track finishes, trajectory summarizationdetermines which activity likely generated the trajectoryand also whether or not it is typical or somethingabnormal. In the following section, the quality of activityassignment and anomaly detection are evaluated withresults summarized in Table 2.

8.2.1 Trajectory ClassificationUsing (24), each test example is categorized by the activ-ity that most likely generated the observation. The inter-section experiment had 9191 of 9500 trajectories correctlylabeled. The lane number for 879 of 923 manually labeledtracks from I5 video were correctly determined for 95%accuracy. Not surprisingly, most classification errors oc-cur in the northbound lanes (93% vs. 98% southbound)where the lanes appear very close in the image plane.Since a trajectory was constructed from the centroid of adetection bounding box, large vehicles which cover mul-tiple lanes had trajectories that would “float” above thetrue lane. In this situation, homography-based groundplane tracking would help keep trajectories consistentacross detection sizes and preserve relative distances.

The test set for the OMNI1 database was collected over24 hours on a single Saturday without participant aware-ness for natural tracks and all 25 of the test trajectories


13

�

(a)

�

�

�

�

�

�

�

(b)

� ��

��

��

��

��

��

��

��

�

�

�

�

�

�

�

�

(c)

Fig. 14: OMNI1 Online Analysis Results (a) The true activity is often within the top 3 predictions. (b) Predictionaccuracy increases with more data but saturates between 40−50% of track length. (c) Propagating the motion modelforward a short amount of time improves prediction with the most benefit when utilizing fewer historical samples.

were correctly classified. The OMNI2 test set, in contrast,was choreographed during a 30 minute collection period.In this set only 12 of the 16 (75%) were correctly assignedto a path.

8.2.2 Abnormal TrajectoriesAbnormal trajectories were determined using (27) withβ = 0.9 for the CROSS test and a more lenient β = 0.75for the OMNI1 test reflecting the relative difficulty ofthe two datasets. 84% of the CROSS abnormalities wereidentified with a 10% false positive rate. The OMNI1and OMNI2 experiments did not perform quite as well.They had a 68.8% and 83.3% detection rate with 12.0%and 25.0% false positive rate (FPR) respectively. Fig. 9shows abnormal trajectories in the CROSS set while Fig.16a shows an example from the OMNI scene.

8.3 Online AnalysisWhile trajectory summarization is useful for data com-pression and semantic query [37], it is any offline processthat operates after the completion of an activity. Incontrast, surveillance requires real-time up-to-date infor-mation to understand the monitoring situation. Onlineanalysis must interpret agent behavior while under ob-servation and express findings immediately. The follow-ing section gives the online performance. In particular,it examines how well an activity can be predicted usinglimited data and how often unusual actions are detected.The online analysis results are summarized in Table 3which uses wc = 0.3T and wp = 0.1T in (28).

8.3.1 Activity PredictionAn emerging theme in intelligent monitoring is intentactive prediction. Rather than just explaining what hasoccurred, intent prediction tries to infer what an agentwill do before it happens to provide time for preemptiveacts

This section provides evaluation on how well activitiescan be predicted using the HMMs. The accuracy of

prediction is measured at each time instant for everymoving object in a scene. A correct prediction must havethe same label at a time t as the true label of the fulltrajectory.

Looking at the prediction column in Table 3 we seethat the highest prediction performance is for the I5dataset. This is not surprising because the lanes of thehighway are straight and do not have any overlap.In contrast, the other datasets have 20 − 30% lowerperformance attributed to the overlap of activities. Usingonly a small window of data it is almost impossible todistinguish activities when segments are shared. At thestart of the U-turn in Fig. 10 none of the top-3 predictionsmatch so are considered incorrect even though it is notpossible to make an accurate prediction with so littledata.

When considering not just the best match, but thetop 3 matches as shown in Fig. 14a the performance issignificantly better. This shows that it is possible to havea good idea of the final activity as demonstrated in theu-turn example (Fig. 10).

The effects effects of wc and wp when evaluating (29)are examined for the OMNI1 experiment in Fig. 14.Although the best prediction might be unreliable, thecorrect activity is often with top-3 (Fig. 14a). As expected,the prediction accuracy improves given more data sincelarger wc values gets closer to having the full trajectory,but, when wc is between 40−50% performance saturates(Fig. 14b). In Fig. 14c the effects of wp are explored anindicates that despite poor motion models there is somebenefit for small wp > 0, especially at lower wc values.But, wp should not be too large since the predictedpoints will poorly represent the future activity far intothe future.

The plots in Fig. 14 indicate it is best to only propagatethe motion model forward for a few time steps and touse as much historical data as possible for accuracy. But,because a balance must be made between accuracy andresponse delay when dealing with more past samples.


14

� ��

��

��

��

��

��

��

��

��

��

Fig. 15: Unusual action ROC for the CROSS and OMNI1datasets. The OMNI1 experiment demonstrates signif-icant room for improvement. Evaluation is difficult be-cause of the large overlap between activities which causeconfusion when using only a small window of data.

8.3.2 Unusual Action Detection

With unusual action detection, the exact time and lo-cation of an abnormality is indicated. It is noted thatthese events tend to occur in groups of successive points,where the number of points is directly proportional tothe time duration of an unusual action.

The unusual action detection was performed for theCROSS and OMNI1 experiments only and each indi-vidual tracking point had to be manually labeled asabnormal or not. The unusual action column of Table3 summarizes the detection performance on these sets.The simulated intersection had an 82.2% detection ratewith a 32.4% FPR while the OMNI1 experiment only had57.6% TPR but at a lower 26.6% FPR. Unfortunately, thefalse positive rates were quite high.

Performance on the simulated dataset was higherbecause the trajectories were more controlled. It was easyto indicate what was an abnormal trajectory point. Incontrast, the natural tracks from the OMNI1 dataset weregiven a best guess label. It was often very difficult to tellwhat exactly was unusual in this scene as there was littleinter-frame motion and high overlap between routes. Inaddition, what might be considered abnormal given thefinal activity label might actual be quite typical given thebest online match Λ∗(t) based on just a window of data.For example, when moving in a direction, backtracking,then resuming forward all the backtrack was manuallylabeled as abnormal (since it is opposite the end ma-neuver), but during the backtrack it is a typical actionfor another maneuver. An example of this behavior isdemonstrated in Fig. 16b where the red X’s correspondto the point of backtrack.

The online abnormality detection performance is bet-ter characterized in Fig. 15 by the receiver operating

(a) (b)

Fig. 16: Examples of automatically detected anomalies.(a) Offline: Walking along the walls of the OMNI1 room.(b) Online: Backtracking and reversing travel direction(red X’s indicate the point of unusual action).

5 10 15 20 2510−2

10−1

100

101

102

103

number of clustersTS

C

I5

OMNI2

CROSS

FCMSpectral

Fig. 17: The TSC cluster validity criterion used by Huwas not able to consistently elbow at Nr (denoted bythe red X).

characteristics (ROC) curve for the two datasets. TheCROSS curve looks promising and has decent detection(40 − 50%) even at very low FPR. In contrast, the real-data collected in the OMNI1 set shows ample room forimprovement. Work is needed to move the OMNI1 curvetoward the CROSS ROC curve which may require mod-ification of the γt

k term. Notice that the curve associatedwith a velocity (dynamics) encoded path (OMNI1 - uv)is slightly better than relying on just position (OMNI1 -xy). The added velocity detail helps discriminate higherorder effects.

8.4 Comparative AnalysisIn order to validate our work we compare or systemagainst the state-of-the-art. We use the system presentedby Hu et al. in 2006 [15] with their improved spectralclustering technique developed in 2007 [37]. The com-parison looks at the quality of route extraction fromclustering and activity recognition results on the CROSSdataset.

In order to learn the routes, Hu et al. [15] definedtrajectory similarity as the average distance betweentrajectory points (after resampling to a fixed numberof points) and used spectral clustering. The optimalnumber of scene routes was automatically determinedusing the tightness and separation criterion (TSC). Theplot in Fig. 17 shows the TSC values for different number


15

� ��

��

��

��

��

Fig. 18: The single tracking point abnormality detectiontechnique developed by Hu resulted in high false posi-tive rate (even at the highest threshold).

TABLE 4: Comparison with Hu [15] for CROSS dataset

classify abnormal (FPR) predictHu [15] 98.01% 91.0% (23.3 $) 69.3%CVRR 96.8% 81.5% (18.5 %) 75.7%

of clusters K for the test sets. Note: a smaller TSC valueindicates better clusters (tighter more compact clusterswith greater separation between different clusters). Al-though spectral clustering produces better separatedroutes than FCM it is not readily apparent what theoptimal choice for K should be. The red X indicates thetrue number of paths and unfortunately choosing the Kvalue that minimizes the TSC does not always producethe correct number because of local minima. This is acontrast with Fig. 13c which demonstrates the stabilityof the match-based merge techniques.

In [15], the motion patterns are modeled as a chainof Gaussians very similar to an HMM. The main dif-ference is time normalization is done probabilisticallywith the HMM while the chain does this based ontrack length (dividing a trajectory into Q parts witheach part corresponding to a small set of points). Usingthe average distance between a trajectory and a chain,the likelihood of a path was found to behave as anexponential random variable. Their abnormal trajectorythreshold was chosen as the minimal value in each setDk. Finally, they estimate the probability of an unusualpoint during live analysis by seeing how well the currenttracking point is modeled by the best fit Gaussian state.Table 4 gives quantitative comparison between the twosystems on the intersection CROSS dataset.

Since a Gaussian chain and an HMM are very similar,the results for full trajectories are quite similar. Usingthe same abnormality threshold, Hu is able to detectmore abnormal trajectories but with almost 5% higherFPR. In contrast to the summarization, the HMMs aremuch more effective during live analysis both for pre-diction and unusual action detection. The HMM timenormalization procedure is more robust than the lengthbased normalization used by Hu. The ROC curve in Fig.18 shows significant improvement when using a smallwindow of data over just the current sample point. When

accepting only 20% FPR there is 20% better performance.In fact, the lowest FPR possible with the single point isapproximately 30% and this requires βt values up to fivesignificant digits.

The comparative experiments suggest the 3 stagelearning framework accurately specifies scene activities.Automatic selection of the number or routes is robustlyhandled through the match-based merge process and theHMM path models activity subtleties for robust onlineanalysis.

9 CONCLUDING REMARKS

This paper has introduced a general framework forunsupervised visual scene description and live analysis,based on learning the repetitive structure inherent inmotion trajectories. An activity vocabulary at differingresolutions is automatically generated through a 3-stagehierarchical learning process, which indicates importantimage points, connects them through spatial routes, andprobabilistically models spatio-temporal dynamics withan HMM. The activity models adapt for long-term mon-itoring of changing environments. Comprehensive anal-ysis, not present in previous literature, of three differentexperimental sites demonstrates the framework’s abilityto accurately predict future activity and detect abnormalactions in real-time.

Trajectory learning and analysis can be integrated intolarger monitoring systems to [42] to help focus and directattention in multi-camera setups. By utilizing only tra-jectory information, rather than higher resolution visualdescriptors, the learning framework is general enough tobe used for data exploration in alternate sensor modali-ties such as radar [43] without modification.

ACKNOWLEDGMENTS

We would like to thank the reviewers for all their insight-ful comments, NSF-ITR and UC Discovery for support,and our CVRR colleagues for their continued assistance.

REFERENCES[1] J. K. Aggarwal and Q. Cai, “Human motion analysis: A review,”

Computer Vision and Image Understanding, vol. 73, pp. 428–440,1999.

[2] K. Huang and M. Trivedi, “3D shape context based gestureanalysis integrated with tracking using omni video array,” inProc. IEEE Workshop on Vision for Human Computer Interaction inconjunction with CVPR, Jun. 2005.

[3] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri,“Actions as space-time shapes,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 29, no. 12, pp. 2247–2253, Dec. 2007.

[4] S. Mitra and T. Acharya, “Gesture recognition: A survey,” IEEETrans. Syst., Man, Cybern. C, vol. 37, no. 3, pp. 311–324, May 2007.

[5] M. M. Trivedi, T. L. Gandhi, and K. S. Huang, “Distributed in-teractive video arrays for event capture and enhanced situationalawareness,” IEEE Intell. Syst., vol. 20, no. 5, pp. 58–66, Sep. 2005.

[6] W. Yan and D. A. Forsyth, “Learning the behavior of users ina public space through video tracking,” in IEEE Workshop onApplication of Computer Vision., Brekenridge, CO,, Jan. 2005, pp.370–377.

[7] B. T. Morris and M. M. Trivedi, “Learning, modeling, and clas-sification of vehicle track patterns from live video,” IEEE Trans.Intell. Transp. Syst., vol. 9, no. 3, pp. 425–437, Sep. 2008.


16

[8] S. Messelodi, C. M. Modena, and M. Zanin, “A computer visionsystem for the detection and classification of vehicles at urbanroad intersections,” Pattern Anal. and Applications, vol. 8, no. 1-2,pp. 17–31, Sep. 2005.

[9] D. Demirdjian, K. Tollmar, K. Koile, N. Checka, and T. Darrell,“Activity maps for location-aware computing,” in IEEE Workshopon Applications of Computer Vision, Dec. 2002, pp. 70–75.

[10] N. Johnson and D. Hogg, “Learning the distribution of objecttrajectories for event recognition,” in Proc. British Conf. MachineVision, vol. 2, Sep. 1995, pp. 583–592.

[11] N. Sumpter and A. J. Bulpitt, “Learning spatio-temporal pat-terns for predicting object behavior,” Image and Vision Computing,vol. 18, pp. 697–704, Jun. 2000.

[12] J. Owens and A. Hunter, “Application of the self-organising mapto trajectory classification,” in Proc. IEEE Visual Surveillance, Jul.2000, pp. 77–83.

[13] W. Hu, D. Xie, T. Tan, and S. Maybank, “Learning activity patternsusing fuzzy self-organizing neural network,” IEEE Trans. Syst.,Man, Cybern. B, vol. 34, no. 3, pp. 1618–1626, Jun. 2004.

[14] I. N. Junejo, O. Javed, and M. Shah, “Multi feature path modelingfor video surveillance,” in Inter. Conf. Patt. Recog., Aug. 2004, pp.716–719.

[15] W. Hu, X. Xiao, Z. Fu, D. Xie, T. Tan, and S. Maybank, “A systemfor learning statistical motion patterns,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 28, no. 9, pp. 1450–1464, Sep. 2006.

[16] B. Morris and M. Trivedi, “Learning and classification of trajec-tories in dynamic scenes: A general framework for live videoanalysis,” in Proc. IEEE International Conference on Advanced Videoand Signal based Surveillance, Santa Fe, New Mexico, Sep. 2008, pp.154–161.

[17] D. Makris and T. Ellis, “Learning semantic scene models fromobserving activity in visual surveillance,” IEEE Trans. Syst., Man,Cybern. B, vol. 35, no. 3, pp. 397–408, Jun. 2005.

[18] F. I. Bashir, A. A. Khokhar, and D. Schonfeld, “Object trajectory-based activity classification and recognition using hidden markovmodels,” IEEE Trans. Image Process., vol. 16, no. 7, pp. 1912–1919,Jul. 2007.

[19] C. Piciarelli and G. L. Foresti, “On-line trajectory clustering foranomalous events detection,” Pattern Recognition Letters, vol. 27,no. 15, pp. 1835–1842, Nov. 2006.

[20] X. Wang, X. Ma, and E. Grimson, “Unsupervised activity per-ception by hierarchical bayesian models,” in Proc. IEEE Conf.Computer Vision and Pattern Recognition, Jun. 2007, pp. 1–8.

[21] X. Wang, K. T. Ma, G.-W. Ng, and W. Grimson, “Trajectoryanalysis and semantic region modeling using a nonparametricbayesian model,” in Proc. IEEE Conf. Computer Vision and PatternRecognition, Jun. 2008, pp. 1–8.

[22] X. Wang, X. Ma, and W. Grimson, “Unsupervised activity per-ception in crowded and complicated scenes using hierarchicalbayesian models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31,no. 3, pp. 539–555, Mar. 2009.

[23] T. Xiang and S. Gong, “Video behaviour profiling for anomalydetection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 5,pp. 893–908, May 2008.

[24] B. T. Morris and M. M. Trivedi, “A survey of vision-basedtrajectory learning and analysis for surveillance,” IEEE Trans.Circuits Syst. Video Technol., vol. 18, no. 8, pp. 1114–1127, Aug.2008, Special Issue on Video Surveillance.

[25] F. Porikli, “Learning object trajectory patterns by spectral cluster-ing,” in Proc. IEEE Inter. Conf. Multimedia and Expo, vol. 2, Jun.2004, pp. 1171–1174.

[26] F. Bashir, W. Qu, A. Khokhar, and D. Schonfeld, “HMM-basedmotion recognition system using segmented pca,” in Proc. IEEEConf. Computer Vision and Pattern Recognition, vol. 3, sep 2005, pp.1288–1291.

[27] A. Naftel and S. Khalid, “Classifying spatiotemporal object tra-jectories using unsupervised learning in the coefficient featurespace,” Multimedia Systems, vol. 12, no. 3, pp. 227–238, Dec. 2006.

[28] Z. Zhang, K. Huang, and T. Tan, “Comparison of similaritymeasures for trajectory clustering in outdoor surveillance scenes,”in Proc. IEEE Inter. Conf. on Pattern Recog., 2006, pp. 1135–1138.

[29] S. Atev, O. Masoud, and N. Papanikolopoulos, “Learning trafficpatterns at intersections by spectral clustering of motion trajecto-ries,” in IEEE Conf. Intell. Robots and Systems, Bejing, China, Oct.2006, pp. 4851–4856.

[30] P. Dickinson and A. Hunter, “Using inactivity to detect unusualbehavior,” in IEEE Workshop on Motion and Video Computing, Jan.2008.

[31] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximumlikelihood from incomplete data via the EM algorithm,” J. RoyalStat. Soc., vol. 39, pp. 1–38, 1977.

[32] M. A. T. Figueiredo and A. K. Jain, “Unsupervised learning offinite mixture models,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 24, no. 3, pp. 381–396, Mar. 2002.

[33] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering:Analysis and an algorithm,” in Advances in Neural InformationProcessing Systems, vol. 14, sep 2002, pp. 849–856.

[34] B. Morris and M. Trivedi, “Learning trajectory patterns by clus-tering:experimental studies and comparative evaluation,” in Proc.IEEE Conf. Computer Vision and Pattern Recognition, Miami, Florida,jun 2009, pp. 312–319.

[35] M. Vlachos, G. Kollios, and D. Gunopulos, “Discovering similarmultidimensional trajectories,” in Proc. IEEE Conf. on Data Engi-neering, Feb. 2002, pp. 673–684.

[36] W. Pedrycz, Knowledge-Based Clustering: From Data to InformationGranules. Hoboken, New Jersey: John Wily, 2005.

[37] W. Hu, D. Xie, Z. Fu, W. Zeng, and S. Maybank, “Semantic-basedsurveillance video retrieval,” IEEE Trans. Image Process., vol. 16,no. 4, pp. 1168–1181, Apr. 2007.

[38] L. Rabiner and B. Juang, Fundamentals of Speech Recognition. En-glewood Cliffs, New Jersey: Prentice-Hall, 1993.

[39] E. Swears, A. Hoogs, and A. Perera, “Learning motion patternsin surveillance video using hmm clustering,” in IEEE Workshopon Motion and video Computing, Jan. 2008, pp. 1–8.

[40] M. Gales, D. Pye, and P. Woodland, “Variance compensationwithin the MLLR framework for robust speech recognition andspeaker adaptation,” in Proc. IEEE Inter. Conf. Spoken Language,Oct. 1996, pp. 1832–1835.

[41] M. Russell and R. Moore, “Explicit modelling of state occupancyin hidden markov models for automatic speech recognition,” inProc. IEEE Inter. Conf. on Acoustics, Speech, and Signal Processing,Apr. 1985, pp. 5–8.

[42] B. T. Morris and M. M. Trivedi, “Contextual activity visualizationfrom long-term video observations,” IEEE Intell. Syst., vol. 25,no. 3, pp. 50–62, 2010, Special Issue on Intelligent Monitoring ofComplex Environments.

[43] B. Morris and M. Trivedi, “Unsupervised learning of motionpatterns of rear surrounding vehicles,” in Proc. IEEE Conf. Veh.Elec. and Safety, Pune, India, Nov. 2009, pp. 80–85.

Brendan Tran Morris received his B.S. degreefrom the University of California, Berkeley andhis Ph.D. degree from the University of Califor-nia, San Diego. He was awarded the IEEE ITSSBest Dissertation Award in 2010. He currentlyworks as a postdoctoral researcher in the Com-puter Vision and Robotics Research Laboratorywere he has worked to develop VECTOR, a real-time highway transportation measurement sys-tem for congestion and environmental studies,and on-road behavior prediction for intelligent

driver assistance systems. Morris’ research interests include unsuper-vised machine learning for recognizing and understanding activities,real-time measurement, monitoring, and analysis, and driver assistanceand safety systems.


17

Mohan Manubhai Trivedi (IEEE Fellow) re-ceived B.E. (with honors) degree from the BirlaInstitute of Technology and Science, Pilani, In-dia, and the Ph.D. degree from Utah State Uni-versity, Logan. He is Professor of electrical andcomputer engineering and the Founding Directorof the Computer Vision and Robotics ResearchLaboratory, University of California, San Diego(UCSD), La Jolla. He and his team are cur-rently pursuing research in machine and humanperception, machine learning, distributed video

systems, multimodal affect and gesture analysis, human-centered in-terfaces and intelligent driver assistance systems. He regularly servesas a consultant to industry and government agencies in the U.S.and abroad. He has given over 65 keynote/plenary talks. Prof. Trivediserved as the Editor-in-Chief of the Machine Vision and Applicationsjournal. He is currently an editor for the IEEE Transactions on Intel-ligent Transportation Systems and Image and Vision Computing. Heserved as the General Chair for IEEE Intelligent Vehicles SymposiumIV 2010. Trivedis team also designed and deployed the “Eagle Eyes”system on the US-Mexico border in 2006. He served as a chartermember, and Vice Chair of the Executive Committee of the Universityof California System wide UC Discovery Program. Trivedi served asan Expert Panelist for the Strategic Highway Research Program ofthe Transportation Research Board of the National Academies. He hasreceived the Distinguished Alumnus Award from Utah State University,Pioneer Award and Meritorious Service Award from the IEEE ComputerSociety, and several “Best Paper” Awards. Two of his students wereawarded Best Dissertation Awards by the IEEE ITS Society (Dr. ShinkoCheng 2008 and Dr. Brendan Morris 2010). Trivedi is a Fellow of theIEEE (“for contributions to Intelligent Transportation Systems field”),IAPR (“for contributions to vision systems for situational awareness andhuman-centered vehicle safety”), and the SPIE (“for contributions to thefield of optical engineering”).


Trajectory Learning for Activity Understanding:...

Documents

Transcript of Trajectory Learning for Activity Understanding:...