Real-time human activity monitoring exploring multiple vision sensors

8
Robotics and Autonomous Systems 35 (2001) 221–228 Real-time human activity monitoring exploring multiple vision sensors Paulo Peixoto , Jorge Batista 1 , Helder J. Araujo 2 Department of Electrical Engineering, Institute of Systems and Robotics (ISR), University of Coimbra, 3030 Coimbra, Portugal Abstract In this paper, we describe the monitoring of human activity in an indoor environment through the use of multiple vision sensors. The system described in this paper is made up of three cameras. Two of these cameras are active and are part of a binocular system. They operate either as a set of three static cameras or as a set of one fixed camera and an active binocular vision system. The human activity is monitored by extracting several parameters that are useful for their classification. The system enables the creation of a record based on the type of activity. These logs can be selectively accessed and provide images of the humans in specific areas. © 2001 Elsevier Science B.V. All rights reserved. Keywords: Surveillance; Active vision; Real-time tracking 1. Introduction Automated monitoring of human activity is impor- tant for many applications. The problem of analysing human activity in video has been the focus of several researchers’ efforts and several systems have been de- scribed in the literature [4,5,8–10,13,15,16]. Many of these systems consist of a computer vision system to detect and segment a moving object and a higher level interpretation module. In very specialised applications other sensors are used besides vision. Automatic interpretation of the data is very difficult and most systems in use require human attention to interpret the data. These systems are characterised by the storage of large amounts of Corresponding author. Tel.: +351-239796275; fax: +351-239406672. E-mail addresses: [email protected] (P. Peixoto), [email protected] (J. Batista), [email protected] (H.J. Araujo). 1 Tel.: +351-239796289; fax: +351-239406672. 2 Tel.: +351-239796216; fax: +351-239406672. data that require no specific action. A single and spe- cific event which may require immediate intervention can be difficult to find among a lot of redundant in- formation. Tracking human motion in an indoor environment is of interest in several applications. A consider- able amount of work has been devoted to tracking humans with a single camera. Images of the envi- ronment are acquired either with static cameras with wide-angle lenses (to cover all the space), or with cameras mounted on pan and tilt devices (so that all the space is covered by using good resolution im- ages) [6,7,11,12,14,17]. In some cases both types of images are acquired but the selection of the region to be imaged by the pan and tilt devices depends on the action of a human operator. The combination of several modalities of imaging devices enables the achievement of robust performance. In addition, the monitoring of events requires the use of multiple sensing agents. This is an important and essential step towards the full automation of high-security applications in man-made environments. 0921-8890/01/$ – see front matter © 2001 Elsevier Science B.V. All rights reserved. PII:S0921-8890(01)00117-8

Transcript of Real-time human activity monitoring exploring multiple vision sensors

Page 1: Real-time human activity monitoring exploring multiple vision sensors

Robotics and Autonomous Systems 35 (2001) 221–228

Real-time human activity monitoringexploring multiple vision sensors

Paulo Peixoto∗, Jorge Batista1, Helder J. Araujo2

Department of Electrical Engineering, Institute of Systems and Robotics (ISR), University of Coimbra, 3030 Coimbra, Portugal

Abstract

In this paper, we describe the monitoring of human activity in an indoor environment through the use of multiple visionsensors. The system described in this paper is made up of three cameras. Two of these cameras are active and are part of abinocular system. They operate either as a set of three static cameras or as a set of one fixed camera and an active binocularvision system. The human activity is monitored by extracting several parameters that are useful for their classification. Thesystem enables the creation of a record based on the type of activity. These logs can be selectively accessed and provideimages of the humans in specific areas. © 2001 Elsevier Science B.V. All rights reserved.

Keywords: Surveillance; Active vision; Real-time tracking

1. Introduction

Automated monitoring of human activity is impor-tant for many applications. The problem of analysinghuman activity in video has been the focus of severalresearchers’ efforts and several systems have been de-scribed in the literature [4,5,8–10,13,15,16]. Many ofthese systems consist of a computer vision system todetect and segment a moving object and a higher levelinterpretation module.

In very specialised applications other sensors areused besides vision. Automatic interpretation of thedata is very difficult and most systems in use requirehuman attention to interpret the data. These systemsare characterised by the storage of large amounts of

∗ Corresponding author. Tel.: +351-239796275;fax: +351-239406672.E-mail addresses: [email protected] (P. Peixoto), [email protected](J. Batista), [email protected] (H.J. Araujo).

1 Tel.: +351-239796289; fax: +351-239406672.2 Tel.: +351-239796216; fax: +351-239406672.

data that require no specific action. A single and spe-cific event which may require immediate interventioncan be difficult to find among a lot of redundant in-formation.

Tracking human motion in an indoor environmentis of interest in several applications. A consider-able amount of work has been devoted to trackinghumans with a single camera. Images of the envi-ronment are acquired either with static cameras withwide-angle lenses (to cover all the space), or withcameras mounted on pan and tilt devices (so that allthe space is covered by using good resolution im-ages) [6,7,11,12,14,17]. In some cases both types ofimages are acquired but the selection of the regionto be imaged by the pan and tilt devices dependson the action of a human operator. The combinationof several modalities of imaging devices enables theachievement of robust performance. In addition, themonitoring of events requires the use of multiplesensing agents. This is an important and essentialstep towards the full automation of high-securityapplications in man-made environments.

0921-8890/01/$ – see front matter © 2001 Elsevier Science B.V. All rights reserved.PII: S0 9 2 1 -8 8 90 (01 )00117 -8

Page 2: Real-time human activity monitoring exploring multiple vision sensors

222 P. Peixoto et al. / Robotics and Autonomous Systems 35 (2001) 221–228

The system described in this paper tries to explorethe combination of several vision sensors in order tocope with the proposed goal of autonomously detect-ing and tracking human intruders in man-made envi-ronments. In the current setup the system is made upof three cameras that can operate in two modes: pas-sive mode and active mode. In the passive mode thethree cameras remain static and monitor the environ-ment. When specific events occur the system startsoperating as a combination of a static camera and abinocular active system (entering in the active mode).

The system is also able to detect and log the pres-ence of targets in some specific areas of interest. Thislog file can be consulted off-line in order to search forsome particular event.

2. Global vision

During the active mode of operation the global vi-sion system (the wide-angle static camera) is responsi-ble for the detection and tracking of all targets visiblein the scene. It is also responsible for the selection ofthe target that is going to be tracked by the binocularactive system.

2.1. Ground plane correspondence

In order to redirect the attention to a new target theactive vision system should know where to look forit. Since, the position of the target is known in the

Fig. 1. Correspondence between image points and ground plane points.

static camera image, we will need to map that positionin terms of rotation angles of the neck (pan and tilt)and vergence (we are assuming a symmetric vergenceconfiguration). The goal would be to fixate the activevision system on the target head.

Assuming that all target points considered in thestatic camera image lie on the ground plane then anypoint on this plane can be mapped to a point in theimage plane of the static camera using a homography[3].

For each detected target on the image plane p(x,y),we compute the corresponding point on the groundplane. Then the relationship between the point P(X,Y)in the plane and the joint angles can be derived directlyfrom the geometry of the active vision system (seeFig. 1):

θp = arctanX

Y,

θv = arctanB/2√

X2 + Y 2 − D,

θt = arctanH − h√X2 + Y 2

,

where B is the baseline distance.To compute the tilt angle, we must know the

target height which can be easily computed, since wecan obtain the projection of the targets head and feetpoints (detected in the image plane) on the groundplane. Assuming that the static camera height andposition, relative to a predefined referential on the

Page 3: Real-time human activity monitoring exploring multiple vision sensors

P. Peixoto et al. / Robotics and Autonomous Systems 35 (2001) 221–228 223

Fig. 2. Target height computation.

ground plane, are known we can compute the targetheight (see Fig. 2). In fact the camera height andposition can be estimated if we know the position andheight of two objects in the scene (we used a doorwayand a closet).

2.2. Target tracking and initialisation

One of the most important steps in detecting ob-jects in video is to localise, where motion is occurringin a frame. The simplest technique is to use imagedifferencing of consecutive frames of video, to seewhere motion has occurred. Another, more sophisti-cated, approach is the use optical flow/image motionalgorithms. The information provided by the opticalflow algorithms is more detailed than simple changedetection.

In each frame a segmentation procedure basedon optical flow allows the detection of the possibleavailable targets. A Kalman filter is attached to each

Fig. 3. Dealing with shadows: (a) real image captured by the static camera; (b) result of the segmentation process.

detected target and the information returned by thefilter is used to predict the location of the target inthe next frame. The prediction is used to estimate abounding box around the expected new position ofthe target. This bounding box is then compared to thenew detected blobs. If a match occurs then the targetposition is updated. If the uncertainty in position be-comes too large over a significant amount of time thenthe target is considered to be lost and the associatedtracking process is terminated. This can occur whenthe target walks out of the image, is heavily occludedor stops.

When two or more people pass by each other thesegmented region could result in one big blob. In thisparticular case the system tries to recognise each of theindividual targets using the predicted bounding boxes.Since we also compute the height of each target, wecan use this measurement to certify that the correctmatch was made.

This problem of the overlapping of the targets onthe image also determines the maximum number oftargets that the system is able to track. If the numberof targets is such that the segmented region is largeand the confidence on the trajectory of each previouslydetected target is low the system is unable, withoutany kind of additional information, to detect the targetsand their trajectories.

One of the problems that sometimes arise with thesekind of methods is the presence of shadows on thefloor that could lead to an incorrect segmentation ofthe targets pixels (see Fig. 3). One way to overcomethis problem is to rely on the top portion of the blobwhere segmentation is much more robust to light vari-

Page 4: Real-time human activity monitoring exploring multiple vision sensors

224 P. Peixoto et al. / Robotics and Autonomous Systems 35 (2001) 221–228

Fig. 4. Example of the influence of the shadows on the 2D mapping of the targets on the ground plane. In these two experiments thesubject moved along a circular path. On the right, the result obtained where shadows are included as part of the segmented blob. On theleft, the result obtained by ignoring the shadows using the height computation.

ations. If the height of the target is known with a levelof confidence above a certain threshold, then the pro-jection of the target on the ground plane can be es-tablished using the estimated height of the target. Weare then able to compute the number of pixels thatthe blob should have on the image. Using this kindof approach, we can improve the quality of the targetmapping on the ground plane. The example shown inFig. 4 clearly shows that the presence of a shadow caninduce a false mapping on the ground plane. In bothexamples the subject moved along a circular path. Onthe right we can see the result obtained by assumingthat the shadows are part of the segmented blob. Theresult path has a clear bias near the spot where theshadow appears (see Fig. 3). The image on the leftshows the result obtained by ignoring the shadows us-ing the height computation.

An important aspect of the system is its ability tokeep track of the 2D trajectory of the targets on theground plane. This feature is interesting specially forthe posterior reconstruction of the trajectory of eachtarget.

2.3. Evaluation

Some experiments were made to determinethe performance of the system regarding both

the 2D trajectory estimation and the heightestimation.

The entire tracking system is based on a PentiumII 300 MHz computer equipped with an RGB MatroxMeteor frame grabber that captures simultaneouslyimages from the three cameras, avoiding the problemsof image synchronisation. The full tracking system(both static camera and the active vision system) runat approximately 25 Hz on 384 × 288 images, includ-ing the vision routines, control routines and loggingroutines.

In the first example three persons, with differentheights, walked into the scene. Fig. 5 shows the com-puted height for each target in each frame. The targetsreal heights are presented in Table 1.

The height is computed using the process describedin Section 2.1. Since we are interested in having anincremental algorithm in which the height value is up-dated in every frame, we assume that the height esti-mations follow a normal distribution with parametersµ and σ 2. If ht is the computed height on time t, we

Table 1True heights of subjects represented in Fig. 5

Target Subject A Subject B Subject C

True height (cm) 191 176 184

Page 5: Real-time human activity monitoring exploring multiple vision sensors

P. Peixoto et al. / Robotics and Autonomous Systems 35 (2001) 221–228 225

Fig. 5. Results of target height computation.

update the statistical parameters using

µt = αµt−1 + (1 − α)ht ,

σt = ασ 2t−1 + (1 − α)(ht − µt)

2.

The constant α controls the update rate of the sta-tistical information: if α is low then the adaptation isquicker but the learned value can deviate from the truestatistics. If α is set high then a sufficient amount ofdata needs to be gathered for the solution to convergeto the correct distribution. We use the statistics to re-fine future estimations of the height. We placed a gateon acceptable values for the new estimations (in prac-tice one can use a 3σ gate). Any value larger/lowerthan µ ± 3σ is rejected in subsequent updates. Thegate has the effect of excluding the outlying heightcomputations (caused for instance by an incorrect seg-mentation of the target), gradually reducing the esti-mated value of the height to its true value. As longas the target is visible for a sufficient amount of timethis gated adaptation is guaranteed to converge to thecorrect height value. The system is able to determinethe height of the targets after several frames with anaverage error of 2 cm.

The trajectory of each target on the ground planeis computed and stored for posterior off-line analysis.We performed some experiments in order to evalu-ate the performance of the system in terms of map-ping errors. We show here two examples of thosetests. In both cases two persons walked on the scene

Fig. 6. Results of the 2D mapping of the targets onto the groundplane: subject A moves from left to right and subject B movesback and forth.

trying to maintain two straight line trajectories, markedwith a tape, on the ground plane. In the first example,one of the persons moved from left to right while theother one moved back and forth (Fig. 6). In the secondexample both persons moved along a diagonal path(Fig. 7).

2.4. Active vision system visual routines

The active vision system is responsible for pursuitof a specific target in the scene. There should existsome kind of priority scheme in order to choose what

Fig. 7. Results of the 2D mapping of the targets onto the groundplane: both subjects move diagonally.

Page 6: Real-time human activity monitoring exploring multiple vision sensors

226 P. Peixoto et al. / Robotics and Autonomous Systems 35 (2001) 221–228

target to pursue. Of course this priority scheme is de-pendent on the type of activity in which the target isinvolved and the relevance of that activity in termsof the application. A priority level is dynamically as-signed to each newly detected target. This allows thesystem to sort the several targets available in the sceneaccording to their priority level. The highest prioritytarget will be the one that will get the attention of theactive vision system.

During the tracking process the motion of the activevision system must satisfy two basic requirements:

1. stabilise the images of the selected target on bothretinas;

2. maintain fixation on the target.

The tracking task is achieved using two differentsteps: fixation and smooth pursuit. In the first one theattention of the active vision system is directed to thetarget with the fastest velocity possible and in the sec-ond one the target is tracked [2]. For a more detaileddescription of the active vision system and for someperformance characterisation please refer to [1].

3. Human activity logging

A fundamental problem to be addressed in anysurveillance or human activity monitoring scenario isthat of information filtering: how to decide whethera scene contains an activity or behaviour worthanalysing. Our approach to detection and monitoringof such situations is based on the fact that typicallyactions are somehow conditioned by the context inwhich they are produced. For instance the action ofopening a closet only makes sense in the near vicinityof a closet.

We assumed the concept of “context cells” to dis-criminate portions of the scene where any behaviourcan be important in terms of the application. It is as-sumed that a set of rules is known “a priori” and thatthese rules have the adequate relevance for the pur-pose of the monitoring task. It is also assumed thatthese context cells have enough information to triggera logging event in order to register the actions of thehuman subjects.

Since in this first approach, we only have as an inputthe position of the target in the plane and its height wedefined a very simple set of context cells in our lab

Fig. 8. Context cells definition.

(see Fig. 8). Three different context cells were defined:closets, desks and in/out zones that correspond to areaswhere targets can enter/exit the scene. The rule usedto describe the possible actions in the desk contextcell is shown in Fig. 9. Two actions are logged in thiscase: “near desk” and “seated at desk”.

To establish if a certain target is in a certain con-text cell, we take into account the time that the targetspends on that particular cell. In each cell we define

Fig. 9. An example of a rule used to describe an action, in thiscase the use of a desk.

Page 7: Real-time human activity monitoring exploring multiple vision sensors

P. Peixoto et al. / Robotics and Autonomous Systems 35 (2001) 221–228 227

a certain threshold time τ s that determines the min-imum amount of time that the target should stay inthe cell in order for the action to be logged. In somecells, such as the case of the in/out cells this time τ s isequal to zero since in this particular case we are onlyinterested in the event of entering/leaving the scene.

An advantage of this concept based on the contextis that it can be used to predict the expected appear-ance and location of the target in the image, to predictocclusion and initiate processes to deal with that oc-clusion.

The system creates log files that describe the actionsthat occurred during a certain period of time. A pictureis taken by the active vision system for each recordedaction. These images can then be used for posterioranalysis and processing, for instance for identificationpurposes. Fig. 10 shows an example of a portion of alog file recorded by the system.

Different actions require different views in order tounderstand what is going on. For instance if the targetis near a closet, then his hands, not his head, shouldhave the attention of the system. The advantage of theuse of the active vision system is that if the “best”view needed for a particular action understanding hasnot been captured then the active vision system canbe instructed to redirect its attention to the right place.Once again the definition of “best” view is contextdependent and if this context is known then the searchspace for the best view can be substantially reduced.

Another aspect of this work (not yet in real-time) isthe modelling of more complex behaviours using the

Fig. 10. An example of a typical human activity log output.

Table 2Classification results of context cells

Context cell Closet Closet Closet Desk I/OContext cell A B C A Zone 1

Classification (%) 91.4 88.6 85.7 94.3 100

same underlying principle. The logging ability can beextended to the detection of more elaborate actionslike the detection of opening or closing of closets, andother typical indoor actions.

Some experiments were made in order to evaluatethe performance of the logging module of the system.Several persons walked around in the scenes and theiractivity in terms of the defined cells were registeredby a human operator. At the same time the systemwas trying to track the same actions. At the end ofthe experiments the human report was compared withthe one produced by the system. A correct match wasconsidered to be one with an exact correspondencebetween the system and the human operator. Table 2represents the result obtained in terms of percentageof correct classifications.

The results shown here state that there is in generala good classification of the context cell visited by thetargets. The worst results were obtained on the Clos-ets B and C possibly because this two areas overlapslightly and any small error in the 2D mapping of thetarget could result in a misclassification of the cell.

4. Conclusions

In this paper, we described a real-time systemaimed at detecting and tracking targets in man-madeenvironments. This system is based on a global viewof the scene and a binocular tracking system. The spe-cific features of the active system enable the trackingof humans while handling some degree of occlusion.The degree of occlusion that can be tolerated dependsupon the target distance. Behaviour modelling canadvantageously use the 3D trajectories reconstructedboth with the data from the global view camera and thedata from the active system. Logging of human activ-ity is performed in real time and by analysing changesin that data (changes in height and position) some lim-ited interpretation of action is performed. In addition,

Page 8: Real-time human activity monitoring exploring multiple vision sensors

228 P. Peixoto et al. / Robotics and Autonomous Systems 35 (2001) 221–228

the redundancy of the system enables cross-checkingof some types of information, enabling greaterrobustness.

References

[1] J. Batista, P. Peixoto, H.J. Araujo, Visual behaviors forreal-time control of a binocular active vision system, ControlEngineering Practice 5 (10) (1997) 1451–1461.

[2] J. Batista, P. Peixoto, H.J. Araujo, Real-time visualsurveillance by integrating peripheral motion detection withfoveated tracking, in: Proceedings of the IEEE Workshop onVisual Surveillance, IEEE Computer Society, January 1998,pp. 18–25.

[3] K.J. Bradshaw, I. Reid, D. Murray, The active recovery of3D motion trajectories and their use in prediction, IEEETransactions on Pattern Analysis and Machine Intelligence19 (3) (1997) 219–234.

[4] C. Bregler, Learning and recognizing human dynamics invideo sequences, in: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR’97), 1997,pp. 568–574.

[5] H. Buxton, S.G. Gong, Visual surveillance in a dynamicand uncertain world, Artificial Intelligence 78 (1–2) (1995)431–459.

[6] J.L. Crowley, Coordination of action and perception in asurveillance robot, IEEE Expert 2 (4) (1987) 32–43.

[7] Y. Cui, S. Samarasekera, Q. Huang, M. Greiffenhagen, Indoormonitoring via the collaboration between a peripheral sensorand a foveal sensor, in: Proceedings of the IEEE Workshop onVisual Surveillance, IEEE Computer Society, 1998, pp. 2–9.

[8] J.W. Davis, A.F. Bobick, The representation and recognitionof action using temporal templates, in: Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition (CVPR’97), 1997, pp. 928–934.

[9] L. Davis, R. Chellapa, Y. Yacoob, Q. Zheng, Visualsurveillance and monitoring of human and vehicular activity,in: Proceedings of DARPA’97, 1997, pp. 19–27.

[10] J. Fernyhough, A.G. Cohn, D.C. Hogg, Building qualitativeevent models automatically from visual input, in: Proceedingsof the IEEE International Conference on Computer Vision(ICCV’98), 1998, pp. 350–355.

[11] D.M. Gavrila, L.S. Davis, 3D model-based tracking ofhumans in action: A multi-view approach, in: Proceedings ofCVPR’96, 1996, pp. 73–79.

[12] E. Grimson, P. Viola, O. Faugeras, T. Lozano-Perez, T.Poggio, S. Teller, A forest of sensors, in: Proceedings ofthe DARPA Image Understanding Workshop, Vol. 1, NewOrleans, LA, 1997, pp. 45–50.

[13] Y. Guo, G. Xu, S. Tsuji, Understanding human motionpatterns, in: Proceedings of the IEEE International Conferenceon Pattern Recognition (ICPR’94), 1994, pp. B325–B329.

[14] T. Kanade, R.T. Collins, A.J. Lipton, P. Anandan, P. Burt,L. Wixson, Cooperative multisensor video surveillance, in:Proceedings of the DARPA Image Understanding Workshop,1997, pp. 3–10.

[15] D. Murray, K. Bradshaw, P. MacLauchlan, I. Reid, P. Sharkey,Driving saccade to pursuit using image motion, InternationalJournal of Computer Vision 16 (3) (1995) 205–228.

[16] N. Oliver, B. Rosario, A. Pentland, A Bayesian computervision system for modeling human interactions, in:Proceedings of the First International Conference onComputer Vision Systems (ICVS’99), 1999, pp. 255–272.

[17] M. Yedanapudi, Y. Bar-Shalom, K.R. Pattipati, Immestimation for multitarget–multisensor air-traffic surveillance,Proceedings of the IEEE 1 (1997) 80–94.

Paulo Peixoto received his B.Sc. degree inElectrical Engineering and M.S. degree inSystems and Automation from the Univer-sity of Coimbra in 1989 and 1995, respec-tively. He is currently a Ph.D. candidate inthe Department of Electrical Engineeringat the University of Coimbra. He is alsoa member of the Portuguese Institute forSystems and Robotics (ISR), where he is

a researcher. His research interests include computer vision, activevision and visual surveillance.

Jorge Batista received the B.Sc. degreein Electrical Engineering from the Uni-versity of Coimbra in 1986. In 1992he received the M.S. degree in Systemsand Automation and in 1999 he receivedthe Ph.D in Electrical Engineering bothfrom the University of Coimbra. JorgeBatista is currently an Assistant Professorin the Department of Electrical Engineer-

ing, University of Coimbra, and a researcher at the Institute ofSystems and Robotics. His research interests include camera cali-bration, computer vision, active vision and visual surveillance.

Helder J. Araujo is currently an AssociateProfessor in the Department of ElectricalEngineering, University of Coimbra. He isa co-founder of the Portuguese Institutefor Systems and Robotics (ISR), where heis now a researcher. His primary researchinterests are in computer vision and mobilerobotics.