Video Understanding Framework For Automatic Behavior ...

12
Video Understanding Framework For Automatic Behavior Recognition Fran¸ cois Br´ emond, Monique Thonnat, Marcos Z´ niga INRIA Sophia Antipolis, ORION group 2004, route des Lucioles, BP93 06902 Sophia Antipolis Cedex — France [email protected] ABSTRACT We propose an activity monitoring framework based on a platform called VSIP, enabling behavior recognition in dif- ferent environments. To allow end-users to actively partic- ipate in the development of a new application, VSIP sep- arates algorithms from a priori knowledge. For describing how VSIP works, we present a full description of a sys- tem developed with this platform for recognizing behav- iors, involving either isolated individual, group of people or crowds, in the context of visual monitoring of metro scenes using multiple cameras. In this work, we also illustrate the capability of the framework to easily combine and tune var- ious recognition methods dedicated to the visual analysis of specific situations (e.g. mono/multi actors activities, nu- merical/symbolic actions or temporal scenarios). We also present other applications using this framework, in the con- text of behavior recognition. VSIP has shown a good perfor- mance on human behavior recognition for different problems and configurations, being suitable to fulfill a large variety of requirements. 1 INTRODUCTION One of the most challenging problems in the domain of computer vision and artificial intelligence is video under- standing. The research in this area concentrates mainly on the development of methods for analysis of visual data in order to extract and process information about the behav- ior of physical objects in a scene. Most approaches in the field of video understanding incor- porated methods for detection of domain-specific events. Examples of such systems use Dynamic Time Warping for gesture recognition [1] or self-organizing networks for tra- jectory classification [18]. The main drawback of these ap- proaches is the usage of techniques specific only to a certain domain which causes difficulties on applying these tech- niques to other areas. Therefore some researchers have adopted a two-steps approach to the problem of video un- derstanding: 1. A visual module is used in order to extract visual cues and primitive events. 2. This information is used in a second stage for the de- tection of more complex and abstract behavior pat- terns [15]. By dividing the problem into two sub-problems we can use simpler and more domain-independent techniques in each step. The first step makes usually extensive usage of stochastic methods for data analysis while the second step conducts structural analysis of the symbolic data gathered at the preceding step (see figure 1). Examples of this two- level architecture can be found in the works of [16] and [23]. This approach is available as a platform for image sequence understanding called VSIP (Video Surveillance Interpreta- tion Platform) which was developed at the research group ORION at INRIA (Institut National de Recherche en In- formatique et en Automatique), Sophia Antipolis. VSIP is a generic environment for combining algorithms for pro- cessing and analysis of videos which allows to flexibly com- bine and exchange various techniques at the different stages of the video understanding process. Moreover, VSIP is oriented to help developers describing their own scenarios and building systems capable of monitoring behaviors, ded- icated to specific applications. At the first level, VSIP extracts primitive geometric features like areas of motion. Based on them, objects are recognized and tracked. At the second level those events in which the detected objects participate, are recognized. For perform- ing this task, a special representation of events is used which is called event description language [23]. This formalism is based on an ontology for video events presented in [4] which defines concepts and relations between these concepts in the domain of human activity monitoring. The major concepts encompass different object types and the understanding of 1

Transcript of Video Understanding Framework For Automatic Behavior ...

Video Understanding Framework For Automatic

Behavior Recognition

Francois Bremond, Monique Thonnat, Marcos Zuniga

INRIA Sophia Antipolis, ORION group2004, route des Lucioles, BP93

06902 Sophia Antipolis Cedex — [email protected]

ABSTRACT

We propose an activity monitoring framework based on aplatform called VSIP, enabling behavior recognition in dif-ferent environments. To allow end-users to actively partic-ipate in the development of a new application, VSIP sep-arates algorithms from a priori knowledge. For describinghow VSIP works, we present a full description of a sys-tem developed with this platform for recognizing behav-iors, involving either isolated individual, group of people orcrowds, in the context of visual monitoring of metro scenesusing multiple cameras. In this work, we also illustrate thecapability of the framework to easily combine and tune var-ious recognition methods dedicated to the visual analysisof specific situations (e.g. mono/multi actors activities, nu-merical/symbolic actions or temporal scenarios). We alsopresent other applications using this framework, in the con-text of behavior recognition. VSIP has shown a good perfor-mance on human behavior recognition for different problemsand configurations, being suitable to fulfill a large varietyof requirements.

1 INTRODUCTION

One of the most challenging problems in the domain ofcomputer vision and artificial intelligence is video under-standing. The research in this area concentrates mainly onthe development of methods for analysis of visual data inorder to extract and process information about the behav-ior of physical objects in a scene.

Most approaches in the field of video understanding incor-porated methods for detection of domain-specific events.Examples of such systems use Dynamic Time Warping forgesture recognition [1] or self-organizing networks for tra-jectory classification [18]. The main drawback of these ap-proaches is the usage of techniques specific only to a certaindomain which causes difficulties on applying these tech-niques to other areas. Therefore some researchers have

adopted a two-steps approach to the problem of video un-derstanding:

1. A visual module is used in order to extract visual cuesand primitive events.

2. This information is used in a second stage for the de-tection of more complex and abstract behavior pat-terns [15].

By dividing the problem into two sub-problems we canuse simpler and more domain-independent techniques ineach step. The first step makes usually extensive usage ofstochastic methods for data analysis while the second stepconducts structural analysis of the symbolic data gatheredat the preceding step (see figure 1). Examples of this two-level architecture can be found in the works of [16] and [23].

This approach is available as a platform for image sequenceunderstanding called VSIP (Video Surveillance Interpreta-tion Platform) which was developed at the research groupORION at INRIA (Institut National de Recherche en In-formatique et en Automatique), Sophia Antipolis. VSIPis a generic environment for combining algorithms for pro-cessing and analysis of videos which allows to flexibly com-bine and exchange various techniques at the different stagesof the video understanding process. Moreover, VSIP isoriented to help developers describing their own scenariosand building systems capable of monitoring behaviors, ded-icated to specific applications.

At the first level, VSIP extracts primitive geometric featureslike areas of motion. Based on them, objects are recognizedand tracked. At the second level those events in which thedetected objects participate, are recognized. For perform-ing this task, a special representation of events is used whichis called event description language [23]. This formalism isbased on an ontology for video events presented in [4] whichdefines concepts and relations between these concepts in thedomain of human activity monitoring. The major conceptsencompass different object types and the understanding of

1

Fig. 1: A general architecture of a video understandingsystem. The steps depicted in the figure describe the dataflow during a video understanding process.

their behavior from the point of view of the domain expert.

In this article, we illustrate this framework by presentinga system developed using VSIP platform for recognizingpeople behaviors, such as fighting or vandalism, occurringin a metro scene, viewed by one or several cameras. Thiswork has been performed in the framework of the Europeanproject ADVISOR (http://www-sop.inria.fr/orion/ADVISOR).Our goal is to recognize in real time behaviors involvingeither isolated individuals, groups of people or crowds fromreal world video streams coming from metro stations. Toreach this goal, we have developed a system which takesas input video streams coming from cameras and gener-ates annotation about the behaviors recognized in the videostreams.

This paper is organized as follows. Section two presentsstate-of-the-art on behavior recognition. In section three,we present briefly the global system and its vision mod-ule. Then, we detail the behavior recognition process illus-trated through five behavior recognition examples: “Fight-ing”, “Blocking”, “Vandalism”, “Overcrowding” and “Fraud”behaviors, and analyzing the obtained results. On sectionfour, we will discuss VSIP capability of dealing with otherapplications under various configurations.

2 RELATED WORK

Since the years 90s, a problem of focus in cognitive vi-sion has been Automatic Video Interpretation. There arenow several research units and companies defining new ap-proaches to design systems that can understand specific sce-narios in dynamic scenes. We define a scenario as a com-bination of states, events or sub scenarios. Behaviors arespecific scenarios, defined by the users.

Three main categories of approaches are used to recognizescenarios:

1. Probabilistic/neural network combining potentiallyrecognized scenario.

2. Symbolic network that Stores Totally RecognizedScenarios.

3. Symbolic network that Stores Partially RecognizedScenarios.

For the computer vision community, a natural approachconsists in using a probabilistic/neural network. The nodesof this network correspond usually to scenarios that are rec-ognized at a given instant with a computed likelihood. Forexample, [14] proposed an approach to recognize a scenariobased on a neural network (time delay Radial Basis Func-tion). [12] proposed a scenario recognition method that usesconcurrence Bayesian threads to estimate the probability ofpotential scenarios.

For the artificial intelligence community, a natural way torecognize a scenario is to use a symbolic network whichnodes correspond usually to the boolean recognition of sce-narios. For example, [21] used a declarative representationof scenarios defined as a set of spatio-temporal and logicalconstraints. They used a traditional constraint resolutiontechnique to recognize them. To reduce the processing timefor the recognition step, they proposed to check the consis-tency of the constraint network using the AC4 algorithm.[10] defined a method to recognize a scenario based on afuzzy temporal logic. The common characteristic of theseapproaches is to store all totally recognized behaviors (rec-ognized in the past) [23].Another approach consists in using a symbolic network andstoring partially recognized scenarios (to be recognized inthe future). For example, [11] has used the terminologychronicle to express a temporal scenario. A chronicle is rep-resented as a set of temporal constraints on time-stampedevents. The recognition algorithm keeps and updates par-tial recognition of scenarios using the propagation of tem-poral constraints based on RETE algorithm. Their applica-tions are dedicated to the control of turbines and telephonicnetworks. [6] made an adaptation of temporal constraintspropagation for video surveillance. In the same period, [19]have used Allen’s interval algebra to represent scenarios andhave presented a specific algorithm to reduce its complex-ity.

2

All these techniques allow an efficient recognition of sce-narios, but there are still some temporal constraints whichcannot be processed. For example, most of these approachesrequire that the scenarios are bounded in time [11], or pro-cess temporal constraints and atemporal constraints in thesame way [21].

Another problem that has captured the attention of re-searchers recently is the problem of unsupervised behav-ior learning and recognition, consisting in the capabilityof a vision interpretation system of learning and detect-ing the frequent scenarios of a scene without requiring theprior definition of behaviors by the user. The unsupervisedbehavior learning and recognition problem in the field ofcomputer vision is addressed only in a few works. Mostof the approaches are confined to a specific domain andtake benefit of domain knowledge in order, for example, tochoose a proper model or to select features. One of themost widely used techniques for learning scenarios in anunsupervised manner is the topology of a Markov model.[3] use an entropy-based function instead of the Maximum-Likelihood estimator in the E-step of the EM-algorithmfor learning parameters of Hidden Markov Models (HMM).This leads to a concentration of the transitional probabili-ties just on several states which correspond in most of thecases to meaningful events. Another approach is based onvariable length Markov models which can express the de-pendence of a Markov state on more than one previous state[8]. While this method learns good stochastic models of thedata it cannot handle temporal relations. A further similartechnique is based on hierarchical HMMs whose topologyis learned by merging and splitting states [24]. The advan-tage of the above techniques for topology learning of Markovmodels is that they work in a completely unsupervised way.Additionally, they can be used after the learning phase torecognize efficiently the discovered events. On the otherhand, these methods deal with simple events, are not capa-ble of creating concept hierarchies and there is no guarantythat the states of these models correspond to meaningfulevents.A different approach for this problem was proposed by [22].In this work, the A priori algorithm from the field of datamining is used to propose a method for unsupervised learn-ing of behaviors from videos. The developed algorithm pro-cesses a set of generic primitive events and outputs the fre-quent patterns of these primitive events, also interpreted asfrequent composite events. In a second step, models of thesecomposite events are automatically generated (i.e. learned)in the event description language defined by [23], which canbe used to recognize the detected composite events in newvideos. This application was used for detecting frequentbehaviors on a parking lot monitoring system.

The state-of-the-art shows the large diversity of video un-derstanding techniques in automatic behavior recognition.The challenge is to combine efficiently these techniques to

address the large diversity of the real world.

3 VIDEO UNDERSTANDING PLATFORM

The video interpretation platform is based on the co-operation of a vision and a behavior recognition module asshown on Figure 2.

The vision module is composed of three tasks. First amotion detector and a frame to frame tracker generates agraph of mobile objects for each calibrated camera. Sec-ond, a combination mechanism is performed to combine thegraphs computed for each camera into a global one. Third,this global graph is used for long term tracking of individ-uals, vehicles, groups of people and crowds evolving in thescene.

For each tracked actor, the behavior recognition moduleperforms three levels of reasoning: states, events and sce-narios.

On top of that, we use 3D scene models (i.e. geometricmodel of the empty scene, including the furniture), one foreach camera, as a priori contextual knowledge of the ob-served scene. We define in a scene model the 3D positionsand dimensions of the static scene objects (e.g. a bench, aticket vending machine) and the zones of interest (e.g. anentrance zone). Semantic attributes (e.g. fragile) can beassociated to the objects or zones of interest to be used inthe behavior recognition process.

On this paper we focus on the behavior recognition pro-

Fig. 2: Video interpretation system.

cess, as it is our current focus of interest1. The goal of thisprocess is to recognize specific behaviors occurring in anobserved scene. A main problem in behavior recognition isthe ability to define and reuse methods to recognize spe-cific behaviors, knowing that the perception of behaviors isstrongly dependent on the site, the camera view point and

1For more details on the vision processing module, see [7].

3

the individuals involved in the behaviors. Our approachconsists in defining a formalism allowing us to write andeasily reuse all methods needed for the recognition of be-haviors. This formalism is based on three main ideas:

1. The formalism should be flexible enough to allow var-ious types of operators to be defined (e.g. a temporalfilter or an automaton). We use operator as an ab-stract term to define programs. This term will bedefined in the following section.

2. All the needed knowledge for an operator should beexplained within the operator so that it can be easilyreused.

3. The description of the operators should be declarativein order to build an extensible library of operators.

3.1 Behavior representation

We call an actor of a behavior any scene object involvedin the behavior, including static objects (equipment, zonesof interest), individuals, groups of people or crowds. Theentities needed to recognize behaviors correspond to differ-ent types of concepts which are:

1. The basic properties: A characteristic of an actorsuch as its trajectory or its speed.

2. The states: A state describes a situation character-izing one or several actors defined at time t (e.g. agroup is agitated) or a stable situation defined overa time interval. For the state: ”an individual staysclose to the ticket vending machine”, two actors areinvolved: an individual and a piece of equipment.

3. The events: An event is a change of states at twoconsecutive times (e.g. a group enters a zone of in-terest).

4. The scenarios: A scenario is a combination ofstates, events or sub scenarios. Behaviors are spe-cific scenarios (dependent on the application) definedby the users. For example, to monitor metro sta-tions, end-users have defined five targeted behav-iors: ”Fraud”, ”Fighting””Blocking”, ”Vandalism”and”Overcrowding”.

To compute all the needed entities for the recognition of be-haviors, we use a generic framework based on the definitionof operators which are program descriptions containing fourelements:

• Name: Indicates the entity to be computed such asthe state ”an Individual is walking” or ”the trajectoryis straight”.

• Input: Gives a description of input data. There aretwo types of input data: basic properties character-izing an actor and sub entities computed by otheroperators.

• Body: Contains a set of competitive methods tocompute the entity. All these methods are able tocompute this entity but they are specialized depend-ing on different configurations. For example, to com-pute the scenario ”fighting”, there are four methods(as shown on Figure 3). For example, one methodcomputes the evolution of the lateral distance be-tween people inside a group. A second one detectsif someone, surrounded by people, has fallen on thefloor.

• Output: Contains the result of the entity computa-tion accessible by all the other operators. This resultcorresponds to the value of the entity at the currenttime.

This generic framework, based on the definition of opera-tors, gives two advantages: It first enables us to test a setof methods to compute an entity, independently of otherentities. So we can locally modify the system (the methodsto compute an entity) while keeping it globally consistent(without modifying the meaning of the entity). Second, thenetwork of operators to recognize one scenario is organizedas a hierarchy. The bottom of the hierarchy is composed ofstates and the top corresponds to the scenario to be recog-nized. Several intermediate levels, composed of state(s) orevent(s) can be defined.

3.2 Behavior recognition

We have defined four types of methods depending onthe type of entities:

• Basic properties methods: We use dedicated rou-tines to compute properties characterizing actors suchas trajectory, speed and direction. For example, weuse a polygonal approximation to compute the tra-jectory of an individual or a group of people.

• State methods: We use numerical methods whichinclude the computation of:

– 3D distance for states dealing with spatial rela-tions (e.g. ”an individual is close to the ticketvending machine”).

– Evolution of temporal features for states dealingwith temporal relations (e.g. ”the size of a groupof people is constant”).

– Speed for states dealing with spatio-temporalrelations (e.g. ”an individual is walking”).

– the combination of sub states computed byother operators.

The output of these numerical methods is then clas-sified to obtain a symbolic value.

• Event methods: We compare the status of statesat two consecutive instants. The output of an eventmethod is boolean: the event is either detected or notdetected. For example, the event ”a group of people

4

(a) (b)

(c) (d)

Fig. 3: Four methods combined by an AND/OR tree to recognize the behavior ”Fighting”. Each image illustrates aconfiguration where one method is more appropriate to recognize the behavior: (a) lying person on the floor surrounded bypeople, (b) significant variation of the group width, (c) quick separation of people inside a group and (d) significant variationof the group trajectory.

enters a zone of interest” is detected when the state”a group of people is inside a zone of interest” changesfrom false to true.

• Scenario methods: For simple scenarios (composedof only one state), we verify that a state has been de-tected during a predefined time period using a tempo-ral filter. For sequential scenarios (composed of a se-quence of states), we use finite state automatons. Anautomaton state corresponds to a state and a tran-sition to an event. An automaton state also corre-sponds to an intermediate stage before the completerecognition of the scenario. We have used an automa-ton to recognize the scenarios ”Blocking” and ”Fraud”as described on Figures 4 and 5.

For composed scenarios defining a single unit ofmovement composed of sub scenarios, we use Bayesiannetworks as proposed by [13] or AND/OR trees of subscenarios as illustrated on Figure 6. The structure of

AND/OR trees, even if not continuous, is a good com-promise between the usability of knowledge represen-tation by experts and correspondence with observa-tions on videos. A description of Bayesian networksfor scenario recognition can be found in [17]. We havedefined one Bayesian network to recognize the ”vio-lence” behavior composed of two sub scenarios: ”in-ternal violence” (e.g. erratic motion of people inside agroup) and ”external violence” (e.g. quick evolutionof the trajectory of the group). The structures ofBayesian networks are statistically learned by an off-line process, allowing adaptability for different kindof behaviors, but lacking in usage of knowledge froman expert. In contrast, AND/OR trees can representmore precise knowledge from experts, but not nec-essarily in correspondence with the observed videos.Both methods are time demanding, either to collectrepresentative videos or tuning the parameters corre-sponding to the expert knowledge. Also, both need

5

Fig. 4: To check whether a group of people is blocking azone of interest (ZOI), we have defined an automaton withthree states: (a) a group is tracked, (b) the group is insidethe ZOI and (c) the group has stopped inside the ZOI forat least 30 seconds.

Fig. 5: To check whether an individual is jumping overthe barrier without validating his ticket, we have definedan automaton with five states: (a) an individual is tracked,(b) the individual is at the beginning of the validation zone,(c) the individual has a high speed, (d) the individual is overthe barrier with legs up and (e) the individual is at the endof the validation zone.

a learning stage (statistical or manual) to adjust theparameters of the network using ground truth (videosannotated by operators). Bayesian networks are opti-mal given ground truth but AND/OR trees are easierto tune and to adapt to new scenes.

Fig. 6: To recognize whether a group of people is fighting,we have defined an AND/OR tree composed of four basicscenarios: (L) lying person on the floor surrounded by peo-ple, (W) significant variation of the group width, (S) quickseparation of people inside the group and (T) significantvariation of the group trajectory. Given these four basicscenarios we were able to build an OR node with all com-binations (corresponding to 15 sub scenarios) of the basicscenarios. These combinations correspond to AND nodeswith one up to four basic scenarios. The more basic scenar-ios there are in AND nodes, the less strict is the recognitionthreshold of each basic scenario. For example, when thereis only one basic scenario (e.g. L(90)), the threshold is 90and when there are four basic scenarios, the threshold is60. To parameterize these thresholds, we have performeda learning stage consisting in a statistical analysis of therecognition of each basic scenario.

For scenarios with multiple actors involved in com-plex temporal relationships, we use a network of tem-poral variables representing sub scenarios and webacktrack temporal constraints among the alreadyrecognized sub scenarios as proposed by [23].

For users to be able of defining the behaviors they want torecognize, an event description language is used as a formal-ism for describing the events characterizing these behaviors[23]. The purpose of this language is to give a formal butalso intuitive, easily understandable, and simple tool fordescribing events. All these features can be achieved bydefining events in a hierarchical way and reusing definitionsof simple events in more complex ones. A definition of anevent consists of:

• Event name.

• List of physical objects involved in the event:A physical object can be a mobile object or a staticone. Typical examples are humans, vehicles, zones orequipments.

• List of components representing sub eventswhich describe simpler activities.

6

• List of constraints expressing additional con-ditions: The constraints can be spatial or temporalin dependence on their meaning. In both cases wecan have symbolic or numeric form. For example, aspatial symbolic constraint is “object inside zone”,while a spatial numeric constraint can be defined asfollows:

distance( object1, object2) ≤ threshold

In the case of a temporal constraint, we can also havea numeric form like:

duration( event) ≤ 20[secs]

or a symbolic form:

event1 before event2

On figure 7 is depicted an example of a complex scenario,“vandalism”: a person p tries to break up an equipmenteq, using the formalism of [23]. This scenario will be recog-nized if a sequence of five events described on figure 8 hasbeen detected.

event ( vandalism,physical objects( p: Person, eq: Equipment),components( (state e1: p far from eq ),

(state e2: p stays at eq ),(event e3: p moves away from eq ),(event e4: p moves close to eq ),(state e5: p stays at eq)

),constraints( ( (e1 before e2) (e2 before e3)

(e3 before e4) (e4 before e5))

))

Fig. 7: Definition of the behavior vandalism using theformalism introduced by [23]. A sequence of five eventswhich represent relative positions between a person p andan equipment eq, must be detected in order to recognizethis behavior.

Fig. 8: Temporal constraints for the states and events con-stituting a scenario vandalism.

3.3 Behavior recognition results

The platform has been tested in different situations andvalidated in the metro monitoring application. The behav-ior recognition module is running on a PC Linux and is pro-cessing four tracking outputs corresponding to four cameraswith a rate of five images per second. We have tested thewhole video interpretation system (including motion detec-tion,tracking and behavior recognition) on videos comingfrom ten cameras of Barcelona and Brussels metros. Wecorrectly recognized the scenario ”Fraud” 6/6 (six times outof six) (Figure 9(a)), the scenario ”Vandalism” 4/4 (Figure9(b)), the scenario ”Fighting”20/24 (Figure 3), the scenario”blocking” 13/13 (Figure 9(c)) and the scenario ”overcrowd-ing” 2/2 (Figure 9(d)). We also tested the system overlong sequences (10 hours) to check the robustness over falsealarms. For each behavior, the rate of false alarm is: twofor ”fraud”, zero for ”vandalism”, four for ”fighting”, one for”blocking” and zero for ”overcrowding”.

Moreover, in the framework of the European project AD-VISOR, the video interpretation system has been ported onWindows and installed at Barcelona metro in March 2003 tobe evaluated and validated. This evaluation has been doneby Barcelona and Brussels video-surveillance metro opera-tors during one week at the Sagrada Familia metro station.Together with this evaluation, a demonstration has beenperformed to various guests, including the European Com-mission, project Reviewers and representative of Brusselsand Barcelona Metro to validate the system. The evalua-tion and the demonstration were conducted using both liveand recorded videos: four channel in parallel composed ofthree recorded sequences and one live input stream fromthe main hall of the station. The recorded sequences en-abled to test the system with rare scenarios of interest (e.g.fighting, jumping over the barrier, vandalism) whereas thelive camera allowed to evaluate the system against scenarioswhich often happen (e.g. overcrowding) and which occurredduring the demonstration and also to evaluate the systemagainst false alarms. In total, out of 21 fighting incidentsin all the recorded sequences, 20 alarms were correctly gen-erated, giving a very good detection rate of 95%. Out ofnine blocking incidents, seven alarms were generated, giv-ing a detection rate of 78%. Out of 42 instances of jumpingover the barrier, including repeated incidents, the behaviorwas detected 37 times, giving a success rate of 88%. Thetwo sequences of vandalism were always detected over thesix instances of vandalism, giving a perfect detection rateof 100%. Finally, the overcrowding incidents were also con-sistently detected in the live camera, with some 28 separateevents being well detected. Results are summarized in Ta-ble 1.

In conclusion, the ADVISOR demonstration has been evalu-ated positively by end-users and European Committee. Thealgorithm responded successfully to the input data, withhigh detection rates, less than 5% of false alarms and with

7

(a) (b)

(c) (d)

Fig. 9: Four behaviors selected by end users and recognized by the video interpretation system: (a) ”Fraud” recognized byan automaton, (b) ”Vandalism” recognized by a temporal constraint network, (c) ”Blocking” recognized by an automatonand (d) ”Overcrowding” recognized by an AND/OR tree.

all the reports being above 70% accurate.

4 DISCUSSION

The evaluation of the metro application has validatedthe ability of VSIP for the recognition of human behaviors.End-users of the application evaluated it positively, becauseof its high detection rate on different scenarios.We have been also working closely with end-users on otherapplication domains. For example, we have built with VSIPsix systems which have been validated by end-users:

1. The activity monitoring system in metro stations hasbeen validated by metro operators from Barcelonaand Brussels.

2. A system for detecting abnormal behaviors insidemoving trains (figure 10(a)), able to handle situa-tions in which people are partially occluded by thetrain equipment (like seats) has been validated overthree scenarios (grafitti, theft, begging).

3. A bank agency monitoring system has been installedand validated in four bank agencies around Paris [9](figures 10(b) and 10(c)).

4. A lock chamber access control system for buildings se-curity has been validated on more than 140 recordedvideos and on a live prototype (figure 10(d)).

5. An apron monitoring system has been developed foran airport2 where vehicles of various types are evolv-ing in a cluttered scene (figures 10(e) and 10(f)). Thededicated system has been validated on five servicingscenarios (GPU vehicle arrival, parking and refuel-ing aircraft, loading/unloading container, towing air-craft).

6. A metro access control system has been tested by end-users on more than 200 videos and on a live prototype[5] (figure 10(g)).

2In the framework of AVITRACK European Project, see [20]and [2]

8

ScenarioName

Number ofBehaviors

Number ofRecognizedInstances

% ofRecognizedInstances

AccuracyNumber ofFalse Alerts

fighting 21 20 95 % 61 % 0blocking 9 7 78 % 60 % 1

vandalism 2 2 100 % 71 % 0jumping o.t.b. 42 37 88 % 100 % 0overcrowding 7 7 100 % 80 % 0

TOTAL 81 73 90 % 85 % 1

Table 1: Results of the technical validation of the metro monitoring system. For each scenario, we report in particular thepercentage of recognized instances of this scenario (fourth column) and the accuracy in time of the recognition (that meanswhat percentage of the duration of the shown behavior is covered by the generation of the corresponding alert by the system.This value is an average over all the scenario instances) (fifth column).

Some of these applications are illustrated on figure 10. Theypresent several characteristics which make them interestingfor research purposes:

• The observed scenes vary from large open spaces (likemetro halls) to small and closed spaces (corridors andlock chambers).

• Cameras can have both non overlapping (like in metrostations and lock chambers systems) and overlappingfields of view (metro stations and bank agencies).

• Humans can interact with the equipment (like ticketvending machines or access control barriers, banksafes and lock chambers doors) either in simple ways(open/close) or in more complex ones (such as theinteraction occurring during vandalism against equip-ment or jumping over the barrier scenarios).

We are currently building with VSIP various other appli-cations. For instance, a system concerning traffic monitor-ing on highway has been built in few weeks to show theadaptability of the platform (see figure 10(h)). Other ap-plications are envisaged such as home-care (monitoring ofelderly people at home), multimedia (e.g. intelligent cam-era for video conferencing) and animal behavior recognition(e.g. insect parasitoids attacking their hosts). VSIP plat-form is currently extended to learn behavior models usingunsupervised learning techniques to be applied on parkinglot monitoring [22] (see figures 10(i) and 10(j)).

VSIP has shown its ability to automatically recognize andanalyze human behaviors. However, some limitations re-main. Video understanding systems are often difficult toconfigure and install. To have an efficient system handlingthe variety of the real world, extended validation is needed.Automatic capability to adapt to dynamic environmentsshould be added to the platform, which is a new topic ofresearch. Nevertheless, the diversity of applications whereVSIP has been used shows that this platform is suitable tofulfill many types of requirements.

5 CONCLUSION

We have presented a video understanding platform toautomatically recognize human behaviors involving individ-uals, groups of people, crowds and vehicles, by detectingvisual invariants. A visual invariant is a visual property orclue which characterizes a behavior.

Different methods have been defined to recognize specifictypes of behaviors under different scene configurations. Toalso experiment various software solutions, all these meth-ods have been integrated in a coherent framework enablingto modify locally and easily a given method. VSIP plat-form has been fully evaluated on several applications. Forinstance, the system has been tested off-line and has beenevaluated, demonstrated and successfully validated in livecondition during one week at the Barcelona metro, in March2003.This approach can also by applied on biological domain.For instance, in 2005 we are developing a system basedon VSIP platform which detects the behaviors of a tri-chogramma interacting with butterfly eggs. This applica-tion corresponds to a behavioral study for understandinghow a trichogramma introduces its eggs on butterfly eggs,contributing to plague control on agriculture.Hence, we believe that VSIP shows a great potential as atool for recognition and analysis of human behaviors in verydifferent configurations.

VSIP still presents some limitations when environmentalconditions suddenly change or complexity of the scene in-creases, which makes necessary the improvement of visionmodules to ensure robustness. Moreover, VSIP requires thepre-definition of events for the detection of behaviors, whichcan be a very hard task. To cope with this limitation, anunsupervised frequent events detection algorithm [22] hasshown encouraging preliminary results. This algorithm iscapable of extracting the most significant events in a scene,without behavior pre-definition by end-user.

9

(a) (b) (c)

(d) (e) (f)

(g)

(h) (i) (j)

Fig. 10: Six other systems derived from VSIP to handle different applications. Image 10(a) illustrates a system for unrulybehaviors detection inside trains. Images 10(b) and 10(c), taken with two synchronized cameras with overlapping field ofview working in a cooperative way, depict a bank agency monitoring system detecting an abnormal bank attack scenario.Image 10(d) illustrates a single-camera system for a lock chamber access control application for building entrances. Images10(e) and 10(f) show a system for apron monitoring on airports; this system combines a total of eight surveillance cameraswith overlapped fields of view. Image 10(g) depicts a multi-sensor system for metro access control; this system uses theinformation obtained from a static top camera and a set of lateral visual sensors, to perform shape recognition of people. Theright image corresponds to the understanding of the scene projected in a 3D virtual representation. Image 10(h) illustratesa highway traffic monitoring application. Finally, images 10(i) and 10(j) depict a parking lot monitoring system for learningbehavior patterns using an unsupervised technique.

10

6 REFERENCES

[1] Aaron F. Bobick and Andrew D. Wilson. A state-based approach to the representation and recognitionof gesture. IEEE Transactions on Pattern Analysisand Machine Intelligence, 19(12):1325–1337, 1997.

[2] M. Borg, D. Thirde, J. Ferryman, F. Fusier, F. Bre-mond, and M. Thonnat. Video event recognitionfor aircraft activity monitoring. In Proceedings ofthe 8th International IEEE Conference on IntelligentTransportation Systems (ITSC05), in Vienna, Austria,September 2005.

[3] Matthew Brand and Vera Kettnaker. Discovery andsegmentation of activities in video. IEEE Transac-tions on Pattern Analysis and Machine Intelligence,22(8):844–851, 2000.

[4] Francois Bremond, Nicolas Maillot, Monique Thon-nat, and Thinh Van Vu. Rr5189 - ontologies for videoevents. Technical report, Orion Team, Institut Na-tional de Recherche en Informatique et Automatique(INRIA), May 2004.

[5] Bihn Bui, Francois Bremond, and Monique Thonnat.Shape recognition based on a video and multi-sensorsystem. In Proceedings of the IEEE International Con-ference on Advanced Video and Signal-Based Surveil-lance (AVSS 2005), Como, Italie, September 2005.

[6] Nicolas Chleq and Monique Thonnat. Real-time imagesequence interpretation for video-surveillance applica-tions. In Proceedings of the IEEE International Con-ference on Image Processing (ICIP96), volume 2, pages801–804, Lausanne, Switzerland, September 1996.

[7] F. Cupillard, F. Bremond, and M. Thonnat. Videounderstanding for metro surveillance. In Proceedingsof the IEEE International Conference on Networking,Sensing and Control (ICNSC04), volume 1, pages 186–191, Tapei, Taiwan, 21-23 March 2004.

[8] Aphrodite Galata, Anthony Cohn, Derek Magee, andDavid Hogg. Modeling interaction using learnt qual-itative spatio-temoral relations and variable lengthmarkov models. In Proceedings of the 15th EuropeanConference on Artificial Intelligence, pages 741–745,2002.

[9] B. Georis, M. Maziere, F. Bremond, and M. Thonnat.A video interpretation platform applied to bank agencymonitoring. In Proceedings of the International Con-ference on Intelligent Distributed Surveillance Systems(IDSS04), London, Great Britain, pages 46–50, Febru-ary 2004.

[10] R. Gerber, H. Nagel, and H. Schreiber. Deriving tex-tual descriptions of road traffic queues from video se-quences. In Proceedings of the 15th European Con-ference on Artificial Intelligence (ECAI2002), Lyon,France, pages 736–740, 21-26 July 2002.

[11] Malik Ghallab. On chronicles: Representation, on-line recognition and learning. In Proceedings of the5th International Conference on Principles of Knowl-edge Representation and Reasoning (KR96), Cam-bridge (USA), pages 597–606, 5-8 November 1996.

[12] S. Hongeng, F. Bremond, and R. Nevatia. Represen-tation and optimal recognition of human activities. InIEEE Proceedings of the International Conference onComputer Vision and Pattern Recognition (CVPR00),South Carolina, USA, 2000.

[13] S. Hongeng and R. Nervatia. Multi-agent event recog-nition. In Proceedings of the International Conferenceon Computer Vision (ICCV2001), Vancouver, B.C.,Canada, May 2001.

[14] A.J. Howell and H. Buxton. Active vision techniquesfor visually mediated interaction. Image and VisionComputing, 20(12):861–871, October 2002.

[15] Weiming Hu, Tieniu Tan, Liang Wang, and Steve May-bank. A survey on visual surveillance of object motionand behaviors. IEEE Transactions on Systems, Man,and Cybernetics - Part C: Applications and Reviews,34(3):334–352, 2004.

[16] Yuri A. Ivanov and Aaron F. Bobick. Recognition ofvisual activities and interactions by stochastic parsing.IEEE Transactions on Pattern Analysis and MachineIntelligence, 22(8):852–872, 2000.

[17] N. Moenne-Locoz, F. Bremond, and M. Thonnat. Re-current bayesian network for the recognition of humanbehaviors from video. In Proceedings of the 3rd In-ternational Conference On Computer Vision Systems(ICVS03), Graz, Austria, April 2003.

[18] Jonathan Owens and Andrew Hunter. Applicationof the self-organizing map to trajectory classification.In Proceedings of the 3rd IEEE International Work-shop on Visual Surveillance (VS 2000), Dublin, Ire-land. IEEE Computer Society, July 2000.

[19] Claudio Pinhanez and Aaron Bobick. Human ac-tion detection using pnf propagation of temporal con-straints. Technical Report 423, M.T.T Media Labora-tory Perceptual Section Technical, April 1997.

[20] AVITRACK European Research Project.http://www.avitrack.net.

[21] N. Rota and M. Thonnat. Activity recognition fromvideo sequences using declarative models. In Proceed-ings of the 14th European Conference on Artificial In-telligence (ECAI00), Berlin, Germany, August 2000.

[22] Alexander Toshev, Francois Bremond, and MoniqueThonnat. Unsupervised learning of scenario models inthe context of video surveillance. In Proceedings of theIEEE International Conference on Computer VisionSystems, January 2006.

11

[23] T-V. Vu, F. Bremond, and M. Thonnat. Automaticvideo interpretation: a novel algorithm for temporalscenario recognition. In Proceedings of the 18th In-ternational Joint Conference on Artificial Intelligence(IJCAI03), Acapulco, Mexico, August 2003.

[24] Lexing Xie, Shih-Fu Chang, Ajay Divakaran, andHuifang Sun. Unsupervised mining of statisticaltemoral structures in video. In A. Rosenfeld, D. Do-ermann, and D. Dementhon, editors, Video Mining,pages 279–307. Kluwer Academic, 2003.

12