IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND...

10
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 1 Detecting Human Behavior Models From Multimodal Observation in a Smart Home Oliver Brdiczka, Matthieu Langet, Jérôme Maisonnasse, and James L. Crowley Abstract—This paper addresses learning and recognition of human behavior models from multimodal observation in a smart home environment. The proposed approach is part of a framework for acquiring a high-level contextual model for human behavior in an augmented environment. A 3-D video tracking system cre- ates and tracks entities (persons) in the scene. Further, a speech activity detector analyzes audio streams coming from head set microphones and determines for each entity, whether the entity speaks or not. An ambient sound detector detects noises in the environment. An individual role detector derives basic activity like “walking” or “interacting with table” from the extracted entity properties of the 3-D tracker. From the derived multimodal observations, different situations like “aperitif” or “presentation” are learned and detected using statistical models (HMMs). The objective of the proposed general framework is two-fold: the automatic offline analysis of human behavior recordings and the online detection of learned human behavior models. To evaluate the proposed approach, several multimodal recordings showing different situations have been conducted. The obtained results, in particular for offline analysis, are very good, showing that multimodality as well as multiperson observation generation are beneficial for situation recognition. Note to Practitioners—This paper was motivated by the problem of automatically recognizing human behavior and interactions in a smart home environment. The smart home environment is equipped with cameras and microphones that permit the ob- servation of human activity in the scene. The objective is first to visualize the perceived human activities (e.g., for videocon- ferencing or surveillance of elderly people), and then to provide appropriate services based on these activities. We adopt a layered approach for human activity recognition in the environment. The layered framework is motivated by the human perception of human behavior in the scene (white box). The system first recog- nizes basic activities of individuals, called roles, like “interacting with table” or “walking.” Then, based on the recognized individual roles, group situations like “aperitif,” “presentation,” or “siesta" are recognized. In this paper, we describe an implementation that is based on a 3-D video tracking system, as well as speech activity detection using head set microphones. We evaluated the system for offline (a posteriori) situation classification and online (in scenario) Manuscript received June 14, 2007; revised June 16, 2008. This paper was recommended by Associate Editor P. Remagnino and Editor M. Wang upon evaluation of the reviewers’ comments. This work was supported in part by the France Télécom R&D Project HARP and the European Commission Project CHIL under Grant IST-506909. This paper has supplementary downloadable material available at http://ieeexplore.ieee.org, provided by the authors. This includes one mpeg2 movie file which shows the visualization of online human behavior recognition in the scene via a web interface. This material is 40.5 MB in size. O. Brdiczka is with the Computer Science Laboratory, Palo Alto Research Center, Palo Alto, CA 94304-1314 USA (e-mail: [email protected]). M. Langet, J. Maisonnasse, and J. L. Crowley are with Project PRIMA, Labo- ratoire LIG, INRIA Rhône-Alpes, France (e-mail:[email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASE.2008.2004965 situation recognition. A prototype system has been realized and installed at France Télécom R&D, visualizing current human behavior in the smart home to a distant user using a web interface. An open issue is still the detection of group dynamics and group formation, which is necessary for group situation recognition in (informal) real settings. Index Terms—Context-awareness, human-centered computing, individual role detection, multimodal human behavior modeling and detection, situation modeling. I. INTRODUCTION P ERVASIVE and ubiquitous computing [33] integrates computation into everyday environments. The techno- logical progress of the last decade has enabled computerized spaces equipped with multiple sensor arrays, like microphones or cameras, and multiple human–computer interaction de- vices. An early example is the KidsRoom [4], a perceptually based, interactive playspace for children developed at MIT. Smart home environments [10] and even complete apartments equipped with multiple sensors [13] have been realized. The major goal of these augmented or “smart” environments is to enable devices to sense changes in the environment and to au- tomatically adapt and act based on these changes. A main focus is laid on sensing and responding to human activity. Human actors need to be identified and their current activity needs to be recognized. Addressing the right user at the correct moment, while perceiving his correct activity, is essential for correct human–computer interaction in augmented environments. Smart environments have enabled the computer observation of human (inter)action within the environment. The analysis of (inter)actions of two and more individuals is here of particular interest as it provides information about social context and rela- tions and it further enables computer systems to follow and an- ticipate human (inter)action. The latter is a difficult task given the fact that human activity is situation dependent [31] and does not necessarily follow plans. Computerized spaces and their de- vices need hence to use this situational information, i.e., con- text [17], to respond correctly to human activity. Context is the key for interaction without distraction [15]. In order to become context-aware, computer systems must thus construct and main- tain a model describing the environment, its occupants and their activities. The notion of context is not new and has been explored in different areas like linguistics, natural language processing and knowledge representation. Dey defines context as “any informa- tion that can be used to characterize the situation of an entity” [17]. An entity can be a person, place or object considered rel- evant to user and application. The structure and representation 1545-5955/$25.00 © 2008 IEEE Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.

Transcript of IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND...

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING 1

Detecting Human Behavior Models From MultimodalObservation in a Smart Home

Oliver Brdiczka, Matthieu Langet, Jérôme Maisonnasse, and James L. Crowley

Abstract—This paper addresses learning and recognition ofhuman behavior models from multimodal observation in a smarthome environment. The proposed approach is part of a frameworkfor acquiring a high-level contextual model for human behaviorin an augmented environment. A 3-D video tracking system cre-ates and tracks entities (persons) in the scene. Further, a speechactivity detector analyzes audio streams coming from head setmicrophones and determines for each entity, whether the entityspeaks or not. An ambient sound detector detects noises in theenvironment. An individual role detector derives basic activitylike “walking” or “interacting with table” from the extractedentity properties of the 3-D tracker. From the derived multimodalobservations, different situations like “aperitif” or “presentation”are learned and detected using statistical models (HMMs). Theobjective of the proposed general framework is two-fold: theautomatic offline analysis of human behavior recordings and theonline detection of learned human behavior models. To evaluatethe proposed approach, several multimodal recordings showingdifferent situations have been conducted. The obtained results,in particular for offline analysis, are very good, showing thatmultimodality as well as multiperson observation generation arebeneficial for situation recognition.

Note to Practitioners—This paper was motivated by the problemof automatically recognizing human behavior and interactionsin a smart home environment. The smart home environment isequipped with cameras and microphones that permit the ob-servation of human activity in the scene. The objective is firstto visualize the perceived human activities (e.g., for videocon-ferencing or surveillance of elderly people), and then to provideappropriate services based on these activities. We adopt a layeredapproach for human activity recognition in the environment.The layered framework is motivated by the human perception ofhuman behavior in the scene (white box). The system first recog-nizes basic activities of individuals, called roles, like “interactingwith table” or “walking.” Then, based on the recognized individualroles, group situations like “aperitif,” “presentation,” or “siesta"are recognized. In this paper, we describe an implementation thatis based on a 3-D video tracking system, as well as speech activitydetection using head set microphones. We evaluated the system foroffline (a posteriori) situation classification and online (in scenario)

Manuscript received June 14, 2007; revised June 16, 2008. This paper wasrecommended by Associate Editor P. Remagnino and Editor M. Wang uponevaluation of the reviewers’ comments. This work was supported in part by theFrance Télécom R&D Project HARP and the European Commission ProjectCHIL under Grant IST-506909. This paper has supplementary downloadablematerial available at http://ieeexplore.ieee.org, provided by the authors. Thisincludes one mpeg2 movie file which shows the visualization of online humanbehavior recognition in the scene via a web interface. This material is 40.5 MBin size.

O. Brdiczka is with the Computer Science Laboratory, Palo Alto ResearchCenter, Palo Alto, CA 94304-1314 USA (e-mail: [email protected]).

M. Langet, J. Maisonnasse, and J. L. Crowley are with Project PRIMA, Labo-ratoire LIG, INRIA Rhône-Alpes, France (e-mail:[email protected];[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TASE.2008.2004965

situation recognition. A prototype system has been realized andinstalled at France Télécom R&D, visualizing current humanbehavior in the smart home to a distant user using a web interface.An open issue is still the detection of group dynamics and groupformation, which is necessary for group situation recognition in(informal) real settings.

Index Terms—Context-awareness, human-centered computing,individual role detection, multimodal human behavior modelingand detection, situation modeling.

I. INTRODUCTION

P ERVASIVE and ubiquitous computing [33] integratescomputation into everyday environments. The techno-

logical progress of the last decade has enabled computerizedspaces equipped with multiple sensor arrays, like microphonesor cameras, and multiple human–computer interaction de-vices. An early example is the KidsRoom [4], a perceptuallybased, interactive playspace for children developed at MIT.Smart home environments [10] and even complete apartmentsequipped with multiple sensors [13] have been realized. Themajor goal of these augmented or “smart” environments is toenable devices to sense changes in the environment and to au-tomatically adapt and act based on these changes. A main focusis laid on sensing and responding to human activity. Humanactors need to be identified and their current activity needs tobe recognized. Addressing the right user at the correct moment,while perceiving his correct activity, is essential for correcthuman–computer interaction in augmented environments.

Smart environments have enabled the computer observationof human (inter)action within the environment. The analysis of(inter)actions of two and more individuals is here of particularinterest as it provides information about social context and rela-tions and it further enables computer systems to follow and an-ticipate human (inter)action. The latter is a difficult task giventhe fact that human activity is situation dependent [31] and doesnot necessarily follow plans. Computerized spaces and their de-vices need hence to use this situational information, i.e., con-text [17], to respond correctly to human activity. Context is thekey for interaction without distraction [15]. In order to becomecontext-aware, computer systems must thus construct and main-tain a model describing the environment, its occupants and theiractivities.

The notion of context is not new and has been explored indifferent areas like linguistics, natural language processing andknowledge representation. Dey defines context as “any informa-tion that can be used to characterize the situation of an entity”[17]. An entity can be a person, place or object considered rel-evant to user and application. The structure and representation

1545-5955/$25.00 © 2008 IEEE

Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING

Fig. 1. Context of a “presentation with questions.” The concepts (roles and/orrelations) characterizing the situations are not detailed.

of this information must be determined before being exploitedby a specific application.

Context and activity are separable. Context describes fea-tures of the environment within which the activity takes place.Dourish claims further that context defines a relational propertybetween objects and activities and is dynamically redefined asunexpected events occur or when activity evolves [18]. Lokestates that situation and activity are not interchangeable, andactivity can be considered as a type of contextual informationwhich can be used to characterize a situation [21]. Dey definessituation as “description of the states of relevant entities” [17].Situation is thus a temporal state within context, defined by spe-cific relations and states of entities. Allen’s temporal operators[1] can be used to describe relationships between situations.

Crowley et al. introduce then the concepts of role and relationin order to characterize a situation [16]. Roles involve only oneentity, describing its activity. An entity is observed to “play”a role. Relations are defined as predicate functions on severalentities, describing the relationship or interaction between en-tities playing roles. Context is finally represented by a networkof situations [16]. These situation networks have been used toimplement context for different applications [6] (see Fig. 1 for asimple example). These situation models have so far been hand-crafted by experts. However, little work has been done on theautomatic acquisition, i.e., learning, of these high-level modelsfrom data.

Many approaches for learning and recognizing human(inter)actions and behavior models from sensor data havebeen proposed in recent years, with particular attention toapplications in video surveillance [26], [28], [34], workplacetools [24], [30], and group entertainment [4]. Some projectsfocus on supplying appropriate system services to the users [4],[10], [13], [30], while others focus on the correct classificationof activities [7], [24], [25], [28]. Most of the previous workis based on video [26], [28], [34], audio [7], or multimodalinformation [24] using statistical models for learning andrecognition (in particular hidden Markov models). Most ofthe reported work has been concerned with the recognition ofthe activities of individuals who have been identified a priori.To our knowledge, little work has been done on real-timemultimodal activity recognition [24]. However, most workdoes not attempt to acquire a high-level contextual model ofhuman behavior. The main focus is laid on the classification ofbasic human activities or scenarios without considering a richercontextual description or model.

Fig. 2. Overview of the different parts and methods of our approach.

This paper investigates learning and recognition of human be-havior models from multimodal observation in a smart home en-vironment. The proposed approach is based on audio and videoinformation. A 3-D video tracking system creates and tracks en-tities (persons) in the scene. A speech activity detector analyzesfurther audio streams coming from head set microphones anddetermines for each entity, whether the entity speaks or not. Anambient sound detector detects noises in the environment. Anindividual role detector derives basic activity like “walking” or“interacting with table” from the extracted entity properties ofthe 3-D tracker. Roles here can be interpreted as referring tobasic activity of individuals in the scene. This activity is de-tected framewise. From the derived role and sound detections,different situations like “aperitif” or “presentation” are learnedand detected using statistical models (HMMs). This paper pro-poses a general framework providing methods for learning andrecognizing the different parts of a human behavior model. Thelearning and detection of different situations is here of partic-ular interest because situations are temporal states involvingaudio and video detections of several individuals. The objec-tive is twofold. First, we want to enable the automatic offlineanalysis of human behavior recordings in a smart home envi-ronment. The aim is to identify and to recognize the situations,as well as the roles played in multimodal recordings. Second,the online detection of learned human behavior models is to beenabled. The aim is to visualize human behavior (e.g., for video-conferencing) and to be capable of reacting correctly to humanactivity in the scene.

II. APPROACH

In the following, we present our approach for learning andrecognizing a human behavior model in a smart home environ-ment (Fig. 2). First, our smart home environment, the 3-D videotracking system, as well as noise and speech detectors are brieflydescribed. Then, our method for role recognition is presented.This method takes the entity properties provided by the 3-Dtracking system as input and generates, based on posture, speedand interaction distance, a role value for each entity. Based onrole and sound detections, several situations are learned and de-tected using hidden Markov models.

Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BRDICZKA et al.: DETECTING HUMAN BEHAVIOR MODELS FROM MULTIMODAL OBSERVATION IN A SMART HOME 3

Fig. 3. Map of our smart room (top left), map with wide-angle camera view(top right), wide-angle camera image of the smart room (bottom).

A. Smart Home Environment

In this paper, experiments take place in a smart home envi-ronment which is a room equipped like a living room. The setof furniture is composed by small table placed in center of threearmchairs and one couch [Fig. 3 (top left)]. Microphone arraysand video cameras are mounted against walls in the environ-ment. The set of sensors used for our approach comprises threecameras in different corners of the environment [one camera iswide-angle, Fig. 3 (top right)], a microphone array for noise de-tection, as well as head set microphones for individual speechdetection.

The cameras record images of the environment (seeFig. 3 (bottom) for an example image) with a frame rateof approximately 25 images/s. A 3-D real-time robust trackingsystem detects and tracks targets in these video images.

B. 3-D Video Tracking System

A 3-D video tracking system [3] detects and tracks entities(people) in the scene in real-time using multiple cameras(Fig. 4). The tracker itself is an instance of basic Bayesianreasoning [2]. The 3-D position of each target is calculatedby combining tracking results from several 2-D trackers [11]running on the video images of each camera. Each couplecamera-detector is running on a dedicated processor. All in-terprocess communication is managed with an object orientedmiddleware for service connection [19].

The output of the 3-D tracker are the position of eachdetected target, as well as the corresponding covariance matrix(3 3 matrix describing the form of the bounding ellipsoid ofthe target). Additionally, a velocity vector can be calculatedfor each target.

The 3-D video tracking system provides high tracking sta-bility. The generated 3-D target positions correspond to real po-sitions in the environment that can be compared to the positionof objects. The extracted target properties (covariance matrix,velocity) provided by the 3-D tracker are independent of the

camera positions (after calibration). Further, tracking is robustagainst occlusions and against split and merge of targets.

C. Ambient Sound and Speech Detection

A microphone array mounted against the wall of the smartenvironment is used for noise detection. Based on the energy ofthe audio streams, we determine whether there is noise in theenvironment or not (e.g., movement of objects on the table).

The people taking part in our experiments wear head set mi-crophones. A real-time speech activity detector [9], [32] ana-lyzes the audio stream of each head set microphone and de-termines whether the corresponding person speaks or not. Thespeech activity detector is composed of several subsystems: anenergy detector, a basic classifier and a neural net trained to rec-ognize voiced segments like vowels for example. At each time,i.e., for each frame, each subsystem gives an answer indicatingwhether the input signal is speech or not. A handcrafted rulebased automaton then determines the final result: speech activityor not. The complete system is shown in Fig. 5.

The association of the audio streams (microphone number) tothe corresponding entity (target) generated by the 3-D tracker isdone at the beginning of each recording by a supervisor.

Ambient sound, speech detection and 3-D tracking are syn-chronized. As the audio events have a much higher frame rate(62.5 Hz) than video (up to 25 Hz), we add sound events (nosound, speech, noise) to each video frame (of each entity).

D. Individual Role Detection

Individual roles refer to a combination of posture, movementand interaction with objects of an individual person in the envi-ronment. The detection is conducted for each observation frame.The input are the extracted properties of each target (position,covariance matrix, speed) provided by the 3-D tracking system.The output is one of the individual role labels (Fig. 8).

The detection process consists of three parts (Fig. 6). The firstpart detects the posture of the target using support vector ma-chines (SVMs). The second part determines a movement speedvalue for the target and the third part determines an interactionvalue with objects in the environment (normally a table). A pureSVM approach has already been successfully used to determineposture and activity values from target properties of a 2-D videotracking system [8]. This first approach used SVMs as a blackbox learning method, without considering specific target prop-erties. From the obtained results, we concluded that, in orderto optimize role recognition, we need to reduce the number ofclasses, as well as the target properties used for classification.Additional classes are determined by using specific target prop-erties (speed, interaction distance) and expert knowledge (parts2 and 3 of our approach).

The first part of the process [Fig. 6 (left)] takes the covariancematrix values of each target as input. Support vector machines(SVMs) [5], [14] are used to detect, based on these covariancevalues, the basic individual postures “standing,” “lying down,”and “sitting” (Fig. 5).

SVMs classify data through determination of a set of supportvectors, through minimization of the average error. The supportvectors are members of the set of training inputs that outlinea hyperplane in feature space. This -dimensional hyperplane,where is the number of features of the input vectors, defines

Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING

Fig. 4. Three-dimensional video tracking system fusing information of three 2-D trackers to a 3-D representation.

Fig. 5. Diagram of the speech activity detection system (picture from [9]).

the boundary between the different classes. The classificationtask is simply to determine on which side of the hyperplane thetesting vectors reside.

The training vectors can be mapped into a higher (maybeinfinite) dimensional space by the function . The SVMfinds a separating hyperplane with the maximal margin inthis higher dimensional space.is used as a kernel function. For multiclass classification, a“one-against-one” classification for each of the classes canbe performed. The classification of the testing data is accom-plished by a voting strategy, where the winner of each binary

Fig. 6. Individual role detection process: SVMs (left), target speed (middle),and distance to interaction object (right).

Fig. 7. Postures “standing,” “lying down,” and “sitting” detected by the SVMs.

comparison increments a counter. The class with the highestcounter value after all classes have been compared is selected.

Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BRDICZKA et al.: DETECTING HUMAN BEHAVIOR MODELS FROM MULTIMODAL OBSERVATION IN A SMART HOME 5

Fig. 8. Schema describing the combination of posture, speed and distance values (blue arrows refer to “no interaction distance with table” and red arrows refer to“interaction distance with table”).

In our approach, we use a radial basis function kernel withparameters and . The LIBSVM library[12] has been used for implementation and evaluation.

The second part of the process [Fig. 6 (middle)] uses the speedvalue of each target. Based on empirical values in our smarthome environment, we can then determine whether the speed ofthe target is zero, low, medium, or high.

The third part of the process [Fig. 6 (right)] uses the posi-tion of each target to calculate the distance to an in-teraction object. In our smart environment, we are interested inthe interaction with a table at a known position [white table inFig. 3 (right)]. So, we calculate the distance between the targetand the table in the environment. If this distance is approachingzero (or below zero), the target is interacting with the table.

The results of the different parts of the detection process arecombined following the schema in Fig. 8. Posture, speed value,and interaction value are combined to a role value.

E. Situation Learning and Recognition

Based on ambient sound, speech and individual role de-tection, we generate observation codes for the multimodalbehavior in the scene (see Section III). These observationsare the input for situation learning and recognition. The ob-jective is to detect six different situations (Fig. 9): siesta (oneperson), individual work (one person), introduction/addressof welcome (multiperson), aperitif (multiperson), presentation(multiperson), and game (multiperson). For each situation,several multimodal recordings are conducted. Hidden Markovmodels are used to learn and detect the six different situationsfrom the observation data.

For each situation label, we learn a left-right hidden Markovmodel with eight states (Fig. 10) using Baum–Welch algorithm.A hidden Markov model (HMM) [27] is a stochastic processwhere the evolution is managed by states. The series of statesconstitute a Markov chain which is not directly observable. Thischain is “hidden.” Each state of the model generates an obser-vation. Only the observations are visible. The objective is toderive the state sequence and its probability, given a particularsequence of observations. The Baum–Welch algorithm is used

Fig. 9. One person situations “individual work” and “siesta” (left side) andmultiperson situations “introduction,” “aperitif,” “presentation,” and “game.”

Fig. 10. Left-right HMM with eight states.

for learning a HMM from given observation sequences, whilethe Viterbi algorithm can be used for calculating the most prob-able state sequence (and its probability) for a given observationsequence and HMM. Hidden Markov models have been usedwith success in speech recognition [20], sign language recogni-tion [29], handwriting gesture recognition [23] and many otherdomains. We adopt a HMM approach due to the temporal dy-namics of human multimodal behavior, as well as the noisy char-acter of our observation data (3-D tracking, speech activity de-tection, individual role detection). First experiments indicatedthat a hidden Markov model with eight states provides a goodcompromise between generality and specificity for our observa-tion data.

In order to detect the situation for an observation sequence,we calculate the probability of this sequence for each learned

Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING

Fig. 11. Multimodal entity detection codes.

Fig. 12. Enumeration of all possible combinations of two multimodal entityobservation codes between 0 and ��� (left side). The number of combi-nations is summed up (right side), resulting in the formula to calculate the totalnumber of combinations (right side bottom).

situation HMM. The HMM with the highest probability (abovea threshold) determines the situation label for the sequence.

III. MULTIMODAL CODING AND DATA SETS

Based on individual role detection as well as the ambientsound and speech detection, we derive multimodal observationcodes for each entity created and tracked by the 3-D trackingsystem. The 12 individual role values (Fig. 8) are derived foreach entity by the individual role detection process. Further,the ambient sound detector indicates whether there is noise inthe environment or not. The speech activity detector determineswhether the concerned entity is speaking or not. This multi-modal information is fused to 53 observation codes for each en-tity (Fig. 11). Codes 1–13 (13 codes) are based on individualrole detection. We further add ambient sound detection (codes27–39 and 40–52) and speech detection per entity (codes 14–26and 40–52).

Fig. 13. Fusion algorithm combining the multimodal entity observation codes��� �� of two entities. For ��� � ��, the resulting codes are between 0and 1430.

Fig. 14. Different multimodal data recordings. Recordings on the left (oneperson) and in the middle (two persons) contain only one specific situation,while the long recordings on the right contain several different situations. Themultimodal data sets contain a total of 75 349 observation frames.

Fig. 15. Overall situation recognition rates for the one situation recordings.

As we can have several persons involved in a situation, weneed to fuse the multimodal codes of several entities. We couldsimply combine the multimodal entity detection codes (for twoentities: codes, i.e.,

combinations). However, the result is a high number of pos-sible observation values. Further, as we are interested in situ-ation recognition, many of these values are redundant. For ex-ample, person A is lying down and person B is sitting is a dif-ferent observation code as person A is sitting and person B islying down, even though, from the perspective of activity recog-nition, the situation is identical.

So if the order of the entity observation codes is not importantto us, we can express the fusion of both entity observation codesusing

combinations (see Fig. 12 for details). For twogiven entity observation codes and , we can then calculatethe fusion code using the algorithm in Fig. 13. First, we needto make sure that contains the smallest of both code valuesand (step 1). In step 2, the value is then used to sum overcombination blocks [Fig. 12 (right side)], while adds thevalue to the correct entry in the combination block. The resultingobservation code fuses the observation codes of two (or more)entities. In order to fuse the observation codes of more than twoentities, the fusion algorithm can be applied several times fusingsuccessively all entities.

We would like to mention that the fact that multimodal ob-servation code 0 (Fig. 11) corresponds to the nonexistence of

Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BRDICZKA et al.: DETECTING HUMAN BEHAVIOR MODELS FROM MULTIMODAL OBSERVATION IN A SMART HOME 7

Fig. 16. Situation recognition rate (long recordings) for different window sizes (in frames) used for recognition.

the entity implies that a generated multiperson code (Fig. 13)contains all generated lower codes. That is, if we generate, forexample, a two person code for only one entity, the resultingcode and the observation code of the entity are identical.

In order to evaluate our approach, we did several multimodalrecordings in our smart home environment. The recordings in-volved up to two persons. First, three different recordings havebeen conducted for each situation label. These recordings onlycontain one specific situation. Further, three long recordingshave been done containing each the situations: introduction,aperitif, game, and presentation. An overview of the conductedmultimodal data recordings can be seen in Fig. 14. The indicatedframe numbers refer to multimodal observations.

IV. EVALUATION AND RESULTS

A first series of evaluations concerned offline situationrecognition for the one situation recordings [Fig. 14 (left)and (middle)]. The objective was to show the recognitionperformance of our approach for offline observation sequenceclassification. Therefore, we did a threefold cross-validation,taking two third of the sequences as input for learning and theremaining third of the sequences as basis for recognition. Wedid an evaluation based on the multimodal observations of onlyone entity (codes 0–52, without the fusion algorithm). This en-tity-wise situation detection already obtains good performance.However, when applying the fusion algorithm (two-personcodes 0–1430), we obtain the best results.

A second series of evaluations concerned the detection of thesituations within the long recordings (Fig. 14 right). The ob-jective was to show the performance of our approach for onlinesituation detection. Therefore, we used the one situation record-ings (Fig. 14 left and middle) as input for learning the HMMscorresponding to the different situations. We applied the fusionalgorithm to the learning sequences. In order to provide onlinesituation recognition, the size for the observation sequence usedfor recognition needs to be limited. We slide an observationwindow of different sizes from the beginning to the end of the

Fig. 17. Confusion matrix and information retrieval statistics for each situation���������� �� �� ��� � ��� �������. The overall situation recognitionrate is 66.51%.

Fig. 18. Confusion matrix and information retrieval statistics for each situation���������� �� �� ��� � ���� �������. The overall situation recogni-tion rate is 88.79%.

recordings, constantly recognizing the current situation. Fig. 16depicts the obtained average situation recognition rates for dif-ferent window sizes.

If we limit the observation time provided for recognition to10 s (i.e., 250 frames with a frame rate of 25 frames/s), we getan overall recognition rate of 66.51% (Fig. 17). By increasingthe observation time to 60 s, the overall recognition rate risesto 88.79% (Fig. 18). However, wrong detections between“aperitif” and “game” persist, resulting in a poor precision for“aperitif” and a poor recall for “game.” This is partially dueto the ambiguity of both situations from the point of view ofthe available observations. Interacting with the table, gesturing,and speaking/noise are characteristic for both situations.

Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING

Fig. 19. Web interface visualizing detected roles and situations (per entity) in real-time.

Online situation detection does not only involve the correctclassification of observation sequences with regard to learnedsituations (as offline situation recognition does). The transitionsbetween situations need to be correctly detected. Further, obser-vation sequences used for recognition are of limited length (ob-servation windows). Thus, online situation detection is a moredifficult problem and the obtained results are less good than foroffline situation recognition.

V. CONCLUSION AND PERSPECTIVES

This paper proposed and evaluated an approach for learningand recognizing human behavior models from multimodalobservation in a smart home environment. The approach isbased on audio and video information. A 3-D video trackingsystem was used to create and to tracks entities (persons) in thescene. Based on extracted entity properties of the 3-D tracker,an individual role detector derived basic individual activitylike “walking” or “interacting with table.” Speech activitydetection for each entity and noise detection in the environmentcomplete the multimodal observations. Different situationslike “aperitif” or “presentation” are learned and detected usinghidden Markov models. The proposed approach should achievetwo objectives: an offline analysis of human behavior in mul-timodal data recordings with regard to learned situations, andthe online detection of learned behavior models. The conductedevaluations showed good results, validating that the approachis applicable to both objectives.

The online detection of learned behavior models is of par-ticular interest because it opens the way to a number of newapplications (videoconferencing, surveillance of elderly people,etc.). Our approach has served as basis for creating a web ser-

vice, communicating the detected behavior of occupants of oursmart home environment. A distant user can visualize currentmultimodal behavior in the scene by using a web interface(Fig. 19). The role values of the detected entities as well asthe situations are indicated. Situation detection is at presentconducted for the observations of each individual entity only(without multiperson fusion). The reason is that multipersonsituation recognition requires (predetermined) group formationinformation in order to fuse the observations of the correctentities. People tend, however, to dynamically change groupformation, in particular in (informal) real settings. This makesmultiperson situation recognition a difficult task.

A solution to alleviate this problem is to analyze group forma-tion. One issue is the determination of the focus of attention foreach individual [22]. The correlation between attentional focusand/or speech contributions can be used to derive (interaction)groups [7]. Applied to our approach to activity recognition, the(probabilistic) correlation or distance between the situations ofparticipants, as well as speech correlation could be good indica-tors for group formation. Once the interaction groups are deter-mined, multiperson situation recognition can be proceeded foreach group. We would like to mention that interaction group de-tection and situation recognition are interdependent. On the onehand, the determination of group configurations is necessary formultiperson situation recognition. On the other hand, the situa-tions detected for one or several individuals are strong indicatorsfor possible group membership.

ACKNOWLEDGMENT

The authors thank P. Reignier, D. Vaufreydaz, G. Privat, andO. Bernier for their remarks and support.

Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

BRDICZKA et al.: DETECTING HUMAN BEHAVIOR MODELS FROM MULTIMODAL OBSERVATION IN A SMART HOME 9

REFERENCES

[1] J. Allen, “Maintaining knowledge about temporal intervals,” Commun.ACM, vol. 26, no. 11, pp. 832–843, 1983.

[2] P. Bessiere and B. R. Group, Survey: Probabilistic methodology andtechniques for artifact conception and development, Tech. Rep. INRIARhône-Alpes, Feb. 2003.

[3] A. Biosca-Ferrer and A. Lux, “A visual service for distributed environ-ments: A Bayesian 3D person tracker,” INRIA, Tech. Rep., 2007.

[4] A. F. Bobick, S. S. Intille, J. W. Davis, F. Baird, C. S. Pinhanez, L.W. Campell, Y. A. Ivanov, Schutte, and A. Wilson, “The KidsRoom:A perceptually-based interactive and immersive story environment,”Presence (USA), vol. 8, no. 4, pp. 369–393, 1999.

[5] B. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimalmargin classifiers,” in Proc. 5th Annu. Workshop on ComputationalLearning Theory, 1992.

[6] O. Brdiczka, P. Reignier, J. L. Crowley, D. Vaufreydaz, and J. Maison-nasse, “Deterministic and probabilistic implementation of context,” inProc. 4th IEEE Int. Conf. Pervasive Comput. Commun. Workshops,2006, pp. 46–50.

[7] O. Brdiczka, J. Maisonnasse, and P. Reignier, “Automatic detection ofinteraction groups,” in Proc. Int. Conf. Multimodal Interfaces, 2005,pp. 32–36.

[8] O. Brdiczka, J. Maisonnasse, P. Reignier, and J. L. Crowley, “Learningindividual roles from video in a smart home,” in Proc. 2nd IEE Int.Conf. Intell. Environ., Athens, Greece, Jul. 2006.

[9] O. Brdiczka, D. Vaufreydaz, J. Maisonnasse, and P. Reignier, “Unsu-pervised segmentation of meeting configurations and activities usingspeech activity detection,” in Proc. 3rd IFIP Conf. Artif. Intell. Ap-plicat. Innov. (AIAI) 2006, Athens, Greece, Jun. 2006, pp. 195–203.

[10] B. Brumitt, B. Meyers, J. Krumm, A. Kern, and S. Shafer, “Easy living:Technologies for intelligent environments,” in Proc. 2nd Int. Symp.Handheld and Ubiquitous Computing, 2000, vol. 1927, Lecture Notesin Computer Science, pp. 12–29.

[11] A. Caporossi, D. Hall, P. Reignier, and J. L. Crowley, “Robust visualtracking from dynamic control of processing,” in Proc. Int. Workshopon Performance Evaluation for Tracking and Surveillance, 2004, pp.23–32.

[12] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for support vector ma-chines. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm2001

[13] D. J. Cook, M. Youngblood, E. O. Heierman, K. Gopalratnam, S. Rao,A. Litvin, and F. Khawaja, “MavHome: An agent-based smart home,”in Proc. 1st IEEE Int. Conf. Pervasive Comput. Commun., 2003.

[14] C. Cortes and V. Vapnik, “Support-vector network,” Mach. Learn., vol.20, pp. 273–297, 1995.

[15] J. Coutaz, J. L. Crowley, S. Dobson, and D. Garlan, “Context is Key,”Commun. ACM, vol. 48, no. 3, pp. 49–53, Mar. 2005.

[16] J. L. Crowley, J. Coutaz, G. Rey, and P. Reignier, “Perceptual compo-nents for context aware computing,” in Proc. 4th Int. Conf. UbiquitousComput., 2002.

[17] A. K. Dey, “Understanding and using context,” Pers. UbiquitousComput., vol. 5, pp. 4–7, 2001.

[18] P. Dourish, “What we talk about when we talk about context,” Pers.Ubiquitous Comput., vol. 8, pp. 19–30, 2004.

[19] R. Emonet, D. Vaufreydaz, P. Reignier, and J. Letessier, “O3miscid :An object oriented opensource middleware for service connection, in-trospection and discovery,” in Proc. 1st IEEE Int. Workshop on ServicesIntegration in Pervasive Environments, Jun. 2006.

[20] X. D. Huang, Y. Ariki, and M. A. Jack, Hidden Markov Models forSpeech Recognition. Edinburgh, U.K.: Edinburgh Univ. Press, 1990.

[21] S. W. Loke, “Representing and reasoning with situations for con-text-aware pervasive computing: A logic programming perspective,”Knowledge Eng. Rev., vol. 19, no. 3, pp. 213–233, 2005.

[22] J. Maisonnasse, N. Gourier, O. Brdiczka, and P. Reignier, “Attentionalmodel for perceiving social context in intelligent environments,” inProc. 3rd IFIP Conf. Artificial Intell. Applicat. Innov. (AIAI) 2006, Jun.2006, pp. 171–178.

[23] J. Martin and J.-B. Durand, “Automatic handwriting gestures recogni-tion using hidden Markov models,” FG 2000 pp. 403–409.

[24] I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, andD. Zhang, “Automatic analysis of multimodal group actions in meet-ings,” IEEE Trans. Pattern Anal. Machine Intell., vol. 27, no. 3, pp.305–317, Mar. 2005, 2005.

[25] M. Muehlenbrock, O. Brdiczka, D. Snowdon, and J.-L. Meunier,“Learning to detect user activity and availability from a varietyof sensor data,” in Proc. 2nd IEEE Int. Conf. Pervasive Comput.Commun., 2004, pp. 13–23.

[26] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian computer visionsystem for modeling human interactions,” IEEE Trans. Pattern Anal.Machine Intell., vol. 22, no. 8, pp. 831–843, 2000.

[27] L. A. Rabiner, “A tutorial on hidden Markov models and selected appli-cations in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286,1987.

[28] P. Ribeiro and J. Santos-Victor, “Human activity recognition fromvideo: Modeling, feature selection and classification architecture,” inProc. Int. Workshop on Human Activity Recognition and Modelling,2005.

[29] T. E. Starner, “Visual Recognition of american sign language usinghidden Markov model,” Ph.D. dissertation, MIT Media Laboratory,Perceptual Computing Section, Cambridge, MA, 1995.

[30] R. Stiefelhagen, H. Steusloff, and A. Waibel, “CHIL—Computers inthe human interaction loop,” in Proc. Int. Workshop on Image Analy.for Multimedia Interactive Services, 2004.

[31] L. Suchman, Plans and Situated Actions: The Problem of Human-Ma-chine Communication. Cambridge, MA: Cambridge Univ. Press,1987.

[32] D. Vaufreydaz, “IST-2000-28323 FAME: Facilitating agent for multi-cultural exchange (WP4),” Eur. Commission Project IST-2000-28323,Oct. 2001, 2001.

[33] M. Weiser, Ubiquitous Computing: Definition 1. [Online]. Available:http://www.ubiq.com/hypertext/weiser/UbiHome.html 1996

[34] S. Zaidenberg, O. Brdiczka, P. Reignier, and J. L. Crowley, “Learningcontext models for the recognition of scenarios,” in Proc. 3rd IFIPConf. Artif. Intell. Applicat. Innov., 2006, pp. 86–97.

[35] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, and G. Lathoud,“Multimodal group action clustering in meetings,” in Proc. Int. Work-shop on Video Surveillance Sensor Networks, 2004.

Oliver Brdiczka received the Diploma degree incomputer science from the University of Karlsruhe,Karlsruhe, Germany, and the Engineer’s degree fromEcole National Superieure d’Informatique et deMathematiques Appliquées (ENSIMAG), Grenoble,France. He received the Ph.D. degree from theInstitut National Polytechnique de Grenoble (INPG).His Ph.D. research was with the PRIMA ResearchGroup, INRIA Rhône-Alpes Research Center.

After that, he directed the Ambient CollaborativeLearning Group, Technical University of Darmstadt,

Germany. He is currently a Scientific Researcher with the Palo Alto ResearchCenter, Palo Alto, CA. His research interests include context modeling, activityrecognition, machine learning, and e-learning.

Matthieu Langet received the Engineer’s degreefrom Formation d’Ingénieur en Informatique de laFaculté d’Orsay, France.

He was with CNRS Laboratory, Laboratoirede Recherche en Informatique, for two years as aResearch Engineer. He joined the PRIMA ResearchGroup, INRIA Rhône-Alpes Research Center,France, in February 2006 to work as ResearchEngineer on the HARP Project.

Jérôme Maisonnasse received the M.S. degree incognitive sciences from the Institut National Poly-technique de Grenoble (INPG), France. Currently, heis working towards the the Ph.D. degree in cognitivesciences at the Université Joseph Fourier, Grenoble.

He joined the PRIMA Research Group, INRIARhône-Alpes Research Center, France, in January2004. His main research interest is human activityrecognition for human–computer interaction.

Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING

James L. Crowley leads the PRIMA ResearchGroup, INRIA Rhône-Alpes Research Center,Montbonnot, France. He is a Professor with theInstitut National Polytechnique de Grenoble (INPG),France. He teaches courses in computer vision,signal processing, pattern recognition, and artificialintelligence at Ecole National Supérieure d’Informa-tique et de Mathematiques Appliquées, ENSIMAG.He has edited two books, five special issues ofjournals, and authored over 180 articles on computervision and mobile robotics. He ranks number 1473

in the CiteSeers most cited authors in computer science (August 2006).

Authorized licensed use limited to: UR Rhône Alpes. Downloaded on December 17, 2008 at 10:45 from IEEE Xplore. Restrictions apply.