Making action recognition work in the real world...ing, mobile computing and video retrieval to...

Making action recognition work in the real world

Dr Fabio Cuzzolin and Professor Philip H.S. TorrDepartment of Computing and Communication Technologies, Oxford Brookes University

Abstract

Action and activity recognition lies at the core of a panoply ofscenarios in human machine interaction, ranging from gam-ing, mobile computing and video retrieval to health monitor-ing, surveillance, robotics and biometrics. The problem, how-ever, is made really challenging by the inherent variability ofmotions carrying the same meaning, the unavoidable over-fitting due to limited training sets, and the presence of numer-ous nuisance factors such as locality, viewpoint, illumination,and occlusions that make real-world deployment a still distantperspective. The most successful recent approaches, whichmainly classify bags of local features, have reached their lim-its: only understanding the spatial and temporal structure ofhuman activities can help us to successfully locate and recog-nize them in a robust and reliable way. We propose here todevelop novel frameworks for the integration of action struc-ture in both generative and discriminative models, pushing fora breakthrough in activity recognition that would have enor-mous exploitation potential.

1 Previous research track recordPrincipal Investigator

Dr Fabio Cuzzolin graduated in 1997 from the University ofPadua (Universitas Studii Paduani, founded 1222, is the sev-enth most ancient university in the world) with a laurea magnacum laude in Computer Engineering and a Master’s thesis on“Automatic gesture recognition”. He received a Ph.D. degreefrom the same institution in 2001, for a thesis entitled “Visionsof a generalized probability theory”. He was first VisitingScholar at the Washington University in St. Louis (12th in theUS universities ranking), and was later appointed a fixed-termAssistant Professor with Politecnico di Milano, Italy (consis-tently recognized as the best Italian university). Subsequently,he moved as a Postdoctoral Fellow to the University of Cali-fornia at Los Angeles, and received a Marie Curie Fellowshipin partnership with INRIA Rhone-Alpes, France. He joinedthe internationally recognized Computer Vision group at Ox-ford Brookes University in September 2008, and is a Readerthere since September 2011.Dr Cuzzolin was recently nominated Coordinator of the newMSc in Computer Vision which will be launched by the De-partment in September 2013. He has taken on the role of Headof the Machine Learning research group in September 2012,while at the same time he is setting up his own group on Vi-sion and Imprecise Probabilities, likely to include three Ph.D.students, a post-doc and two visiting students in 2013.

Publication record. Dr Cuzzolin’s research interests spanboth machine learning applications to computer vision, includ-ing gesture and action recognition and identity recognitionfrom gait, and uncertainty modeling via imprecise probabil-ities, to which he has contributed by developing an originalgeometric approach to belief functions and other uncertaintymeasures. His scientific productivity is extremely high, as theforty papers he has published in the last five years attest. Dr

Cuzzolin is currently the author of 73 peer reviewed scientificpublications (published or under review), most of them asfirst or sole author, including a monograph, two book chap-ters, 19 journal papers, and 9 chapters in collective volumes(http://cms.brookes.ac.uk/staff/FabioCuzzolin/).Dr Cuzzolin’s top papers have been all rated *four star* atthe latest mock REF meetings by panels of both external andinternal reviewers.He won awards for Best Paper at Pacific Rim Conference onAI (PRICAI’08), Best Poster at the ISIPTA’11 Symposium onImprecise Probabilities, Best Poster for a work on “Learningdiscriminative space-time actions from weakly labelledvideos” [46] at the 2012 INRIA Summer School on MachineLearning and Visual Recognition, and was short-listed forprizes at the ECSQARU’11 and the British Machine Vision(BMVC12) conferences, where he was given the OutstandingReviewer Award.

Editorial boards and TPC memberships. Dr Cuzzolin iscurrently Associate Editor of the “IEEE Transactions on Sys-tems, Man, and Cybernetics - Part C”, and Guest Editor for“Information Fusion”, but collaborates with several other topinternational journals in both computer vision and artificial in-telligence. He has been recently confirmed a member of theBoard of Directors of the “Belief Functions and ApplicationsSociety” (http://www.bfasociety.org/), in recognitionof his position in the community of imprecise probabilities. Hehas served in the Technical Program Committee of more than30 international conferences in both imprecise probabilities(e.g. ISIPTA, ECSQARU, BELIEF) and computer vision (e.g.VISAPP), and is a reviewer for top vision conferences such asBMVC, ICCV and ECCV. Dr Cuzzolin will be the chair andlocal organizer of the upcoming 3rd International Conferenceon the Theory of Belief Functions (BELIEF 2014).

Proposer’s Related Work. The proposer is very activein human motion analysis and recognition, where he has ex-plored the use of bilinear and multi-linear models for identityrecognition from gait [10, 11]. As in this context performanceis influenced by factors as diverse as viewpoint, emotionalstate, illumination, presence of clothes/occlusions, etcetera,gait can be modeled by tensor algebra. He has published anumber of papers on spectral motion capture [42], focusingin particular on the crucial issue of how to select and mapeigenspaces generated by two different shapes in order to track3D points on their surfaces or consistently segment bodypartsalong sequences of voxelsets [14, 13], as a preprocessing stepto action recognition. In direct relation to the topic of this pro-posal, he is now exploring manifold learning techniques fordynamical models representing (human) motions [12], and to-gether with his student Michael Sapienza and Philip Torr hasachieved extremely promising preliminary results on the selec-tion of discriminative action parts for recognition and localiza-tion [46], that won them a Best Poster Prize at the last INRIASummer School, and were shortlisted for a prize at BMVC12.

Grants and awards. Dr Cuzzolin has recently beenawarded a 122K EPSRC First Grant on “Tensorial model-ing of dynamical systems for gait and activity recognition”,

a project which has received exceptional reviews. The projectstarted in August 2011, and proposes a generative approachto recognition quite related to parts of the current proposal.He has also applied for a Project Grant on “The total prob-ability theorem for finite random sets” with the LeverhulmeTrust, which invited the submission of a full proposal on thetopic. He is submitting as well another outline Leverhulmeproposal on “Guessing plots for video googling” with Pro-fessor T. Lukasiewicz of Oxford University, with whom heis planning a Google Research Award application. At Euro-pean level he has just submitted to the last FP7 Call 9 a 3 mil-lion euro STREP on “Dynamical Generative and Discrimina-tive Models for Action and Activity Localization and Recog-nition” as the Coordinator, with IDSIA (Switzerland), Univer-siteit Gent (Belgium), SUPELEC and Dynamixyz (France) aspartners. He is also finalizing (again as the Coordinator) aShort Proposal for a Future and Emerging Technology (FET-Open) FP7 EU grant on “Large scale hybrid manifold learn-ing”, with INRIA, Pompeu Fabra and Technion. At UK level,Dr Cuzzolin is setting up an EPSRC Network on UncertaintyTheory (NUTS) with Professors J. Hall and T. Lukasiewicz(Oxford University), J. Lawry (Bristol), F. Coolen (Durham),J-B. Yang (Manchester Metropolitan), W. Liu (U. Belfast), A.Hunter (UCL) and others, and is involved as a partner in anEPSRC network proposal on Compressed Sensing, led by pro-fessor Mark Plumbley.

Co-InvestigatorAcademic profile. Professor Philip Torr graduated in 1990from Southampton University with a B.Sc. in Mathematicswith Computer Science (1st class). He then undertook re-search at Oxford University for seven years as both a D.Phil.and a postdoctoral researcher. In 1997 he went to work for twoyears for Microsoft Research in Redmond USA; returning toEngland he was one of the initial members of Microsoft Re-search Europe based at Cambridge, at which he establishedthe vision group, and was involved in hiring and manag-ing several research staff. In November 2003 he returned toacademia to take up the post of Professor in Computing at Ox-ford Brookes University in order to establish his own researchgroup there (http://cms.brookes.ac.uk/research/visiongroup/).He has supervised seven PhD students to completion, two(Bjorn Stenger and Pushmeet Kohli) have been awarded theSullivan Prize for best thesis, another (Pushmeet Kohli) wasrunner up for the British Computer Society doctoral thesisprize. His research interests are at the nexus of machine vision,machine learning, graphics and Bayesian methods. Much ofhis work on robust estimation of epipolar geometry and modelselection now features in the standard text books of computervision by Hartley and Zisserman, and Forsyth and Ponce. Heis on the editorial boards of the ACM magazine: Computersin Entertainment, IEEE PAMI and the International Journalof Image and Vision Computing. In 1998 he was awardedthe Marr Prize, the most prestigious prize in this field, heand his co-workers have received awards in 6 other confer-ences, including best paper at CVPR 08, BMVC’10, ECCV’10and honourary mention at NIPS. Most recently he has beenawarded a Royal Society Wolfson Society Research MeritAward for his research in computer vision. He has served onthe program committee and as area chair for all the major com-puter vision conferences. He has been program chair of severalconferences including the European Conference on ComputerVision 2008. Prof. Torr’s work is highly cited in the field, e.g.see (http://citeseer.ist.psu.edu/) and a full list of publicationscan be found on http://cms.brookes.ac.uk/staff/PhilipTorr.

Contribution to UK Competitiveness. During this time

his research and software were (together with that of P. Beard-sley, A. Fitzgibbon and A. Zisserman) used to form a newstartup company 2d3 (http://www.2d3.com/), part of the Ox-ford Metrics Group (OMG). Their first product, “boujou”,is used by special effects studios all over the world. Bou-jou is used to track the motion of the camera and allow forclean video insertion of objects, and has been used on thespecial effects of almost every major feature film in the lastfive years, including the “Harry Potter” and “Lord of theRings” series. Prof. Torr has directly worked with the fol-lowing companies based in the UK: 2d3, Vicon Life, Yotta(http://www.yotta.tv/company/), Microsoft Research Europe,Sharp Laboratories Europe, Sony Entertainments Europe, withcontributions to commercial products appearing (or about toappear) with four of them. His work is currently in usein the film and game industry. His work with the OxfordMetrics Group in a Knowledge Transfer Partnership 2005-9 won the National Best Knowledge Transfer Partnershipof the year at the 2009 awards, sponsored by the Technol-ogy Strategy Board, selected out of several hundred projects(http://www.ktponline.org.uk/awards 2009/BestKTP.aspx).

Related Work. Prof. Torr has been the PI on several grants,several of which are related to the topic of this proposal. HisEPSRC first grant (cash limited to 120K) was Markerless Mo-tion Capture for Humans in Video (GR/T21790/01(P)) Oct2004-Oct 2007, which has led to a large output of research,including four papers accepted as orals to the top vision con-ferences. The student on the grant (P. Kohli) won two best the-sis awards. The grant was rated as internationally outstanding(the top mark). On the commercial front this grant has led totechnology transfer to Vicon, and a UK patent (P106738GB).In addition, the first grant has led to three companies invest-ing in Oxford Brookes to work on various aspects of humanmotion analysis. Prof. Torr has Knowledge Transfer Partner-ship (KTP) grants with Sony (to take some of the markerlessmotion capture technology and apply it to EyeToy games) andand had one with Vicon (also a sister company of Yotta) em-ploying two associates who have added much increased videofunctionality to the existing marker based system.His second EPSRC grant Automatic Generation of Content for3D Displays EP/C006631/1, Nov 2005-May 2009, has led to aSIGGRAPH paper [25] (and patent) as well as a paper prize atIEEE CVPR 2008 [57] and at NIPS (Neural Information Pro-cessing) 2007, amongst others [32, 34, 2, 52]. The majorityof these papers have been published on the main journals inthe field: IJCV, JMLR, PAMI. The product arising from theSIGGRAPH paper (VideoTrace) led to a spin off company.

The organization: Oxford Brookes UniversityThe Department of Computing and Communication Tech-nologies comprises around 30-35 academic staff, includ-ing world leaders such as Prof. David Duce (co-chairof Eurographics 03, 06) and Prof. Rachel Harrison, Ed-itor in Chief of Software Engineering. Projections forthe next REF indicate an overall score of at least 3.1.The School of Technology has recently established a doc-toral training programme on Intelligent Transport Systems( http://tech.brookes.ac.uk/research/) whose in-frastructure will be directly beneficial to this project, andwhich concerns a scenario in which a quod bike is able tounderstand pedestrians behavior and gesturally interact withthem. Prof. Torr holds a research chair with minimal under-graduate teaching duties and hence has sufficient time to con-duct the research listed in this proposal. Dr Cuzzolin is fullysupported by the School in his goal of reaching a professorialposition within a couple of years’ time.

2 Proposed research and its context

2.1 Background

2.1.1 Action recognition: scenarios and challenges

Since Johanssen’s classic experiment [29], recognising humanactivities from videos has been an increasingly important topicof computer vision. The problem consists in telling, given oneor more image sequences capturing one or more people per-forming various activities, what categories (among those pre-viously learned) these activities belong to.

Figure 1: action and activity recognition have manifold appli-cations: virtual reality, human-computer interaction, surveil-lance and entertainment are just some examples.

Why is action recognition important. The potential so-cietal impact of reliable automatic action recognition is enor-mous, and involves a variety of scenarios (Figure 1).Historically, ergonomic human-machine interfaces allowinghumans to gesturally interact with virtual characters for en-tertainment or educational purposes, or avoid the use of key-board and mouse, have been envisaged first. Smart rooms havebeen imagined, in which people are assisted in their every-day activities by distributed intelligence in their own homes(switching lights when they move through the rooms, inter-preting their gestures to replace remote controls and switches,etcetera). In particular, given our rapidly ageing population,semi-automatic assistance to non-autonomous elderly peopleis rapidly gaining interest. Security guards can be assistedby semiautomatic event detection algorithms able to signalanomalous events to their attention for surveillance purposes;identity recognition from gait is also being studied as a novel“behavioral” biometric, based on people’s distinctive gait pat-tern. A new generation of consoles (such as Microsoft’sKinect) have opened new directions in the gaming industry:yet, these only track the user’s movements, without any realinterpretation of their actions, which could “spice up” the gam-ing experience to the players’ satisfaction. Last but not least,techniques able to efficiently datamine the thousands of videospeople post, say, on Facebook or YouTube are in dire need: thepotential of a “drag and drop” style application, similar to thatset up by Google for images, able to retrieve videos with asame “semantic” content is easy to imagine.

Critical issues of activity recognition. Unfortunately, ac-tion recognition is a hard problem, for a number of reasons.

Challenge 1: inherent variability and limited training sets.Human motions possess an extremely high degree of inher-ently variability: quite distinct movements can carry the samemeaning or represent the same gesture. Action models are nor-mally learned from forcibly inadequate datasets (in terms ofnumber of training videos and number of action classes), pos-ing serious over-fitting issues: models describe very well theavailable training sets, but have limited generalization power.

Challenge 2: actions “in the wild” and influence of co-variates. In addition, actions are subject to a large numberof nuisance or “covariate” factors [40], such as illumination,moving background, viewpoint, and many others (Figure 2).For this reason tests have been often run in small, controlledenvironments: few attempts have been made to progress to-wards recognition “in the wild” [40, 26].

Challenge 3: space/time localization. Detecting when andwhere a semantically meaningful action takes place within avideo sequence is the first necessary step in any action recog-nition framework [37]: so far, however, the general focus haslargely been on the recognition of pre-segmented videos.

Challenge 4: multiple agents. The presence of multipleactors [45], e.g., different players sitting in front of a singleconsole, greatly complicates both localization and recognition.

Challenge 5: complex activities versus elementary actions.A serious challenge arises when we try to move from simple,“atomic” actions to more complex, sophisticated “activities”,series of elementary actions connected in a meaningful way[1] common, for instance, in the smart home scenario.

Challenge 6: beyond pure recognition. In video retrievalthe purpose, rather than to recognize motions, is to (automat-ically) compute a descriptor of the video at hand, and look onthe internet for videos with a similar description/signature.

Figure 2: some issues with action recognition: viewpoint, il-lumination, analysis of complex activities, multiple actors.

2.1.2 Plus and cons of current SoA approaches

Bag-of-features methods and their issues. Recognitionmethods that neglect action dynamics (typically by extractinglocal spatio-temporal [8] features from the 3D volume asso-ciated with a video, Figure 3) have been very successful inrecent times [30, 60], at least on datasets of fairly limited size,with few action categories. However, when tested on the mostrecent human motion action dataset (such as HMDB51 [33]),such Bag-of-Features (BoF) models [36] have been shown toachieve classification results of just over 20%, suggesting thatthese approaches do not fully represent the complexities of hu-man actions, and that it may be necessary to take more intoaccount action dynamics. Psychophysical experiments indi-cate that both motion and shape are critical for visual recog-nition of human actions [29]. Besides, entirely forgetting thespatio-temporal structure produces clear paradoxes: by scram-bling around the frames in a spatio-temporal volume we canobtain videos with absolutely no meaning to us, but with (onlyroughly, as descriptors are extracted at multiple space and timescales) the same descriptors as the original one.

Figure 3: BoF methods build a vector of frequency occur-rences of a set of local features extracted from the video: anyvaluable spatiotemporal relationship is lost. As features arecomputed locally, although at different scales, meaninglessvideos with roughly the same histogram can be generated.

Dynamical generative modeling and its issues. On theother hand, dynamical generative models, i.e., systems ofdifferential or discrete-time equations designed to model agiven phenomenon (either deterministically or statistically)have shown a clear potential for action recognition in the past.Hidden Markov models [17], for instance, have been widelyemployed in action and gait recognition [50]: the use of linear,nonlinear [9], stochastic [54] or even chaotic [3] dynamical

systems has also been proposed.Generative dynamical models have a number of desirable fea-tures that can be exploited to address several of the challengesof action recognition. In primis, they provide ways of tempo-rally segmenting in an automatic way an action embedded in alarger video sequence [50], for instance by assessing the like-lihood of the sequence been generated by the model at eachtime instant. Models such as max-margin CRFs [56] havebeen shown able to address locality by recognizing actions asconstellations of local motions. When a whole crowd is mon-itored, it is useful to consider the crowd itself as some sort offluid: dynamical models are well equipped to deal with such ascenario. Moreover, sophisticated graphical models can be ex-ploited to learn in a bottom-up fashion the “plot” of a footage[23], which can be used to retrieve the video over the internet.

Unfortunately, generative dynamical models have beensometimes found too rigid to cope with the many nuisance fac-tors or the inherent variability of human actions. These factors,and the fact that the same models (or videos) can be endowedwith different labels (e.g., action, ID, emotional state), makeany naive approach to the classification of generative models(typically via k-NN classification based on some fixed dis-tance function [51]) doomed to fail. More flexible classifica-tion methodologies are necessary to develop the full potentialthe generative approach to classification, while retaining thedesirable features of generative models.

2.1.3 Industrial and societal context

Target industries. The market potential of action, gesture andactivity recognition, briefly outlined in the above scenarios,is just too big to be described extensively here. Microsoft’sKinect console, for instance, with its controller-free gamingexperience is already revolutionizing the market of interactivevideo games: Oxford Brookes Vision Group enjoys continu-ing strong links with Microsoft through its founder Philip Torr.Intelligent action recognition can significantly improve userexperience, making games which merely track the user’s bodyposture out of fashion. In security and surveillance, most bio-metric companies focus at the moment on cooperative tech-niques such as face or iris recognition: investing in behav-ioral, non-cooperative biometrics before the rest of the mar-ket can give them a significant competitive advantage. Au-tomatic event recognition can lead to more efficient securityprotocols in which humans do not have to waste their time forlong hours in front of static scenes, but are only alerted whensome unusual activity is actually taking place. Efficient toolsfor gesture recognition can help much in supporting disabledor elderly people, hopefully improving their quality of life. Onthe business scene, companies such as Google are investinghuge money on video retrieval as the next level in the brows-ing experience. In all these scenarios, and in many others,truly robust action recognition frameworks would contributeenormously, via significant gains in productivity, to boost theeconomic sectors involved in times of economic uncertainty.

Government policy. With its implications for crime pre-vention and security, automatic and semi-automatic surveil-lance obviously interest policy-makers and government agen-cies. These are likely to be be attracted by the idea of novelsurveillance and biometric systems able to improve the generallevel of security of the UK, and that of sensitive areas in partic-ular, especially in uncertain times such as ours. Airport man-agement authorities such as BAA, railway companies, publictransport authorities (e.g., Transport for London) are some ex-amples. The infrastructure is basically there, as millions ofCCTV cameras are already active throughout the country.

2.2 Research hypotheses and objectives

2.2.1 - Research idea: pushing the boundaries of generativeand discriminative modeling. In summary, while discrimi-native approaches have been successful in controlled environ-ment, they need to include a description of the spatio-temporalstructure of an action if they are to tackle issues such as actionlocalization, multi-agent recognition, and complex activities.On the other hand, generative graphical models have attrac-tive features in terms of automatic segmentation, localizationand extraction of plots from videos, but suffer from a tendencyto overfit the available training data. We propose to breaknew ground in both these respects, with significant impacton the real world deployment of action recognition tools, bydesigning novel modeling and classification techniques (bothgenerative and discriminative) able to incorporate the spatio-temporal structure of the data, while dealing with insufficientamounts of training information and nuisance factors.

Pipeline # 1: metric learning for generative models.Generative dynamical models cannot be classified in naive,top down ways. The literature of machine learning supportsinstead the idea of learning in a supervised fashion the “best”distance function for a specific classification problem [7, 48],at least in the linear case. However, generative models livein complex, non-linear spaces [4, 12, 39]. Developing novelmanifold learning techniques for their classification can pushtowards “recognition in the wild” scenarios and cope with theinherent variability of actions.

Pipeline # 2: introducing S/T structure in discriminativemodels. Not only localizing an action within a larger video isa hard problem (as the spatial and temporal extent of an actionis ambiguous even to a human observer), but discarding thestructure of human motions can be detrimental to recognition,especially when dealing with complex activities. We proposeto represent actions and complex activities as spatio-temporal“objects”, composed by distinct but coordinated “parts” (el-ementary motions, simple actions), building on the alreadysignificant results of discriminative models, and inspired bythe successes achieved by similar approaches in 2D object de-tection [20]. Transfer learning techniques [44] can then beemployed to transfer localization models learnt on the few la-belled dataset to the many more unlabelled ones.

2.2.2 - Novelty and contributions. Via these new develop-ments in both discriminative and generative modeling we aimat the following breakthroughs in action recognition.Breakthrough #1 : effective space-time localization, not justa first necessary step of the process, also crucial to discrimi-nate actions sharing a great deal of context or common parts.Breakthrough #2: recognition with multiple actors, whoseindividual behavior can be resolved by detecting the most dis-criminative regions of a video.Breakthrough #3: taking a step towards recognition “inthe wild” via more flexible classification techniques able toaddress the issue with nuisance factors and the inherent vari-ability of human actions, to finally bring action recognition outof our research labs.Breakthrough #4: exploiting unlabeled videos, by transfer-ring models learnt from the few datasets for which manual la-beling is available to the many more for which it is not.Breakthrough #5: moving from actions to activities by in-corporating their spatial and temporal structure.

Novel, learning-based classification techniques for genera-tive graphical models (pipeline 1) can allow the latter to copewith variability, overfitting, and nuisance factors, while retain-ing their desirable features in terms of temporal segmentationand flexible description of complex activities.

Figure 4: The “generative” pipeline. Pullback metric learning for dynamical models. Once encoded each training video sequenceas a graphical model (for instance a hidden Markov model, left) we obtain a training set of such models in the appropriate modelspaceM (right). Any automorphism onM induces a push-forward map of tangent vectors onM, which in turn induces a pullbackmetric there. By parameterizing F we obtain a whole family of pullback metrics to optimize upon, from which we can select themost discriminative one.

Novel discriminative models which incorporate the actions’spatio-temporal structure (pipeline 2) can naturally deal withspatio-temporal localization and multiple actors, push towardsrecognition in the wild via the selection of the most discrimi-native action parts, exploit weakly supervised data, cope withthe analysis of complex activities and reconstruct their plots.

2.2.3 - Goals of the project. Consequently, we plan toachieve the following objectives:1 - proving how explicitly modeling spatio-temporal structuresallows us to cope with the major current challenges in actionand activity recognition, resulting in a dramatic increase ofthe robustness of the overall system;2 - to this purpose, developing a theory of classification ofgenerative dynamical models based on the supervised learningof optimal metrics for most classes of such models;3 - formulating at the same time discriminative models able todescribe the structure of actions in both their spatial and theirtemporal aspects, in terms of collections of distinct classifiersfor action parts at different level of granularity, organized in acoherent, structured hierarchy;4 - instrumental to both approaches, studying new waysof selecting and extracting robust features from video datacoming in multiple modalities, both from spatio-temporalvolumes and single frames;5 - gathering new benchmark testbeds targeting open chal-lenges such as localization, multi-agent recognition, complexactivities, to be used to launch grand challenges;6 - testing transfer learning to map the learnt models to unla-beled datasets concerning entirely different action classes;7 - overall, delivering a robust, all purpose ges-ture/action/activity localization and recognition workingprototype, outperforming current state-of-the-art algorithmson both existing public datasets and proprietary ones.

2.2.4 - Timeliness. Current market conditions, societal andtechnological developments are very favourable to the appli-cation of the novel methodologies we propose to develop hereto the various scenarios discussed. The huge interest by com-panies, governments and ordinary people for natural humancomputer interaction, improved security levels and more “fun”entertainment systems will ensure that a positive scientific out-come of the project is guaranteed a rapid diffusion and is, pos-sibly, commercialised via focussed industrial partnerships.

2.2.5 - Milestones. The project’s objectives are mirroredby the following verifiable, measurable milestones:M1 Supervised metric learning for generative models (month18), measured in terms of a library of Matlab/C++ routinesvalidated on benchmark data, and top vision publications.M2 Classification of generative models (month 30): measuredagainst a running library of Matlab/C++ routines, available tothe public, and high impact publications.M3 Learning of discriminative action parts (month 18), mea-

sured by software libraries validated on all benchmarks withstate of the art results, and top publications.M4 Learning, classification and transfer of structured discrim-inative models (month 30), verified as above.M5 Proprietary databases in different modalities such asmonocular, synchronized and range cameras, made public viaa dedicated web site (month 12).M6 Feature selection and extraction in all modalities (month18), implemented by a library of Matlab/C++ routines, testedin all scenarios with state-of-the-art results.M7 All purpose action and activity localization and recogni-tion framework (month 36), achieving state-of-the-art resultsin all scenarios on all the public and proprietary databases, im-plemented as a set of toolboxes which will be made public tomaximize impact.

2.3 Programme and methodology2.3.1 - Methodology.

Feature selection and extraction. Prior to modeling ac-tions, the video streams have to be processed to extract themost discriminative information: such “features” can be ex-tracted frame by frame, or from the entire spatio-temporalvolume which contains the action(s) of interest. Setting upan efficient feature selection/extraction framework is a crucialstep of this project. A plethora of local video descriptors havebeen proposed for space-time volumes, mainly derived fromtheir 2D counterparts: Cuboid, 3D-SIFT [49], HoGHoF [36],HOG3D [31], extended SURF. More recently, Wang et al. [55]have proposed Dense Trajectory Features, a combination ofHoG-HoF with optical flow vectors and motion boundary his-tograms. These dense features, combined with the standardBoF pipeline, have been shown to outperform the most recentstate-of-the-art [38] on challenging datasets. They are there-fore strong initial candidate descriptors for space-time videoblocks. An appealing alternative to conventional video data isprovided by “range” or Time-of-Flight cameras (whose mostpopular application is the Kinect console), which greatly fa-cilitate depth estimation. The study of feature extraction fromrange images and a fusion of range and standard images willbe integral parts of this project.

First pipeline: metric learning for generative models.In order to flexibly classify generative models encoding hu-man action videos, we propose a framework for the supervisedlearning of metrics for such models, derived from differentialgeometry. Suppose a dataset of N videos is provided, and afeature vector is extracted from each image, so that each videois represented as a sequence of feature vectors. Chose a classof generative dynamical models, and suppose an identificationalgorithm able to identify the parameters of the model whichbest fits an input feature sequence, so that the training videosare mapped to a dataset D = {m1, ...,mN} of models.

Pullback formalism. Suppose our models live on a Rie-mannian manifoldM [27] for which a Riemannian metric g isdefined on any point m ∈ M of the manifold. Any automor-phism (invertible differentiable map) between M and itself:F : M→M (Figure 4) induces a “pullback” metric1 on M.A pullback distance between two dynamical models on Mcan be computed along the associated pullback geodesic (theshortest path). By defining a parameterized family of such au-tomorphisms we get a family of pullback metrics on M, overwhich we can optimize to select the “best” metric.

Learning pullback distances for dynamical models. Theframework we propose for learning optimal distances to usefor the classification of generative dynamical models, given atraining set of videos, is articulated into the following steps:

1. each training video of variable length is mapped to afeature sequence, from which a dynamical model of a cer-tain class C is estimated by parameter identification, yieldinga training set of models D = {m1, ...,mN};

2. such models belong to a certain domain M: to measuredistances on M we need either a distance function dM or aRiemannian metric gM;

3. a family {Fλ, λ ∈ Λ} of automorphisms of M (parame-terized by a vector λ) is designed to provide a search space ofmetrics/distances;

4. Fλ induces a family of pullback metrics {gλM, λ} or dis-

tances {dλM, λ} on M, respectively;

5. optimizing over this family of pullback distances/metrics(according to some sensible objective function) yields an op-timal pullback metric g or distance function d. The latter canfinally be used to cluster or classify new “test” sequences, us-ing any state-of-the-art classification tool.

Rationale of pullback mechanism. Imagine M is a sim-plex, as it is often the case for families of probability distribu-tions [4]. Any automorphism which is a linear combination ofthe original coordinates maps linear boundaries (red line, Fig-ure 5) to different linear boundaries (green line). Under moregeneral mappings the new decision boundary will in generalbe curved (blue curve). Designing families of automorphismsamounts therefore to designing families of non-linear decisionboundaries to apply to the data to classify.

Figure 5: Pullback metrics as learning decision boundaries.

Pushing the boundaries. One of us has recently investi-gated the use of pullback metrics to classify simple scalar au-toregressive models, and hidden Markov models [12]: morecomplex classes of models such as hierarchical HMMs orMRFs (and their manifold structure) need to be addressed.When the data-set of models is labeled, we can determine themost discriminative metric/distance function by maximizingthe classification performance of the metric. The issue of howto optimize this quantity in closed form needs as well to be ad-dressed. When the training set is unlabeled we need to resortto the minimization of a purely geometric quantity.

1Such that the scalar product of two tangent vectors u, v in m ∈ Maccording to the pullback metric g∗ is the scalar product with respect to theoriginal metric g of the image vectors in F (m).

In addition, as they make use geodesic distances (see [35],paragraphs 3.1 and 3.2), the pullback mechanism can beplugged in the diffusion kernel framework [35] to generate en-tire families of geodesic distances whose diffusion Mercer ker-nel can be later computed and used in an SVM.

In the second pipeline we propose to learn spatio-temporal discriminative models in a weakly supervised set-ting, taking action localization to massive datasets in whichmanual annotation is basically unfeasible [53].

Figure 6: The “discriminative” pipeline. Top: In Multiple In-stance Learning, positive bags (sequences) are those whichcontain at least a positive example (left, in blue). In ourcase, the examples of a bag (video sequence) are all its spatio-temporal subvolumes (right, in various colors). Bottom: A hu-man action can be represented as a “star model” of elementaryBoF action parts (learnt via MIL).

Multiple instance learning of discriminative action parts.The first step consists in learning discriminative models forthe elementary action parts that compose the activity to de-scribe/recognize [26]. Let an action-part be defined as a BoFmodel bounded in a space-time cube (see Figure 6). The taskcan be cast in a Multiple Instance Learning (MIL)/Latent Sup-port Vector Machine learning framework [20], in which thetraining set consists of a set of “bags” (the training sequences),containing a number of BoF models (or “examples”, in ourcase SVM classifiers learnt for each sub-volume of the spatio-temporal sequence), and the corresponding ground truth classlabels. An initial “positive” model is learned by assuming thatall examples in the positive bag (all the sub-volumes of the se-quence) do contain the action at hand; a “negative” model islearned from the examples in the negative bags (videos labeledwith a different action category, Figure 6 again). Initial modelsare updated in an iterative process: eventually, only the mostdiscriminative examples in each positive bag are retained aspositive. MIL reduces to a semi-convex optimisation problem,for which efficient heuristic approaches exist [5].The bottom line of this approach is being able to factor outthe effect of common, shared context (similar objects or back-ground, common action elements). Very promising prelimi-nary results have been recently obtained by us [46] which indi-cate that learning discriminative action parts can both help ef-fective localization (Figure 7) and significantly improve clas-sification results w.r.t. standard BoF models (Table 1).

Learning and classifying structured discriminative modelsof actions. Once the most discriminative action parts are learnt

Dataset KTH YOUTUBE HOHA2Perf. measure mAcc mAP mF1 mAcc mAP mF1 mAcc mAP mF1BoF 95.37 96.48 93.99 76.03 79.33 57.54 39.04 48.73 32.04MIL-BoF 96.76 97.02 96.04 80.39 86.10 77.35 39.63 44.18 39.42

Table 1: Results in terms of mean accuracy and average preci-sion for BoF and MIL-BoF methods on common testbeds.

via MIL, we can construct tree-like ensembles of action parts(as in [22], Figure 6) to use for both localization and classi-fication of actions. Felzenszwalb and Huttenlocher [19] haveshown (in the object detection problem) that if the pictorialstructure forms a star model, where each part is only connectedto the root node, it is possible to compute the best match veryefficiently by dynamic programming. A cost function can bedefined as a function of both the (BoF) appearance models ofthe individual parts and of the relative positions between pairsof action parts, whose maximization yields the best action con-figuration [19]. Other approaches to building a constellation

Figure 7: Preliminary localisation results on a challengingvideo from the Hollywood2 dataset [46]. The colour of eachbox indicates the positive rank score of the subvolume belong-ing to a particular class (red = high). In actioncliptest00058,a woman gets out of her car roughly around the middle of thevideo, as indicated by the detected subvolumes.

of discriminative parts have been proposed by Hoiem [18] andRamanan [59]. Eventually, transfer learning techniques testedfor object detection [44] can be adopted to the transfer of ac-tion categories to unlabeled datasets.

Validation on action, activity and gait datasets. A com-prehensive validation stage in which the entire prototype in allits components is tested on state-of-the-art action, gesture, andactivity recognition/localization databases will be crucial. Thecurrently available datasets (ViHasi, LSCOM, IRISA, CMURADAR, Daily living) can be classified in terms of number ofaction categories, size, nuisance factors considered, and levelof complexity of the motions captured (Table 2). New testbedshave recently been acquired in order to tackle the problem ofrecognizing actions “in the wild”, such as the YouTube [40]and the Hollywood [36] dataset. Localization, multiple ac-tions and actors and presence of complex activities, however,are still rather neglected.

dataset KTH Weiz Sports HOHA1 YouTube HOHA2 UCF50 HMDByear 04 05 08 08 09 09 10 11# actions 6 10 10 8 11 12 50 51Rec. rate 94.5% 100% 88.2% 53.5% 84.2% 58.3% 47.9% 23.18%

Table 2: SoA recognition rates go down as more challengingdatasets involving more action classes are introduced.

We will therefore build new benchmarks focussed on thosechallenges: a new state-of-the-art activity dataset will likelyspur a host of subsequent research worldwide. We will con-sider both traditional and range cameras. We are not aware ofavailable action datasets based on range data.

2.3.2 - Programme of work. The work will be mainly di-vided between the two Research Assistants, each in charge ofa research pipeline. The P.I. and the Co-investigator will en-

sure the mentoring and supervision of the R.A.s, and activelycontribute to the most critical parts of the workplan.

Pipeline 1. The first year of the project for the generativeapproach will be devoted to:- the study of feature selection and extraction from singleframes, to feed to identification algos for generative models;- that of dynamical models suitable for elementary actionsrecognition, and their manifolds;- the implementation of the pullback framework for elemen-tary action recognition in the labeled case, initially using crossvalidation to optimize classification performance;- an initial testing of the framework on public datasets.

In year 2 we will focus on:- the implementation of the pullback framework for elemen-tary action recognition in the unlabeled case, and the investi-gation of feasible objective functions to minimize;- a closed-form formulation of classification performance asthe objective function to minimize in the labeled case;- collecting a number of proprietary datasets using monocularand stereo video to address the failures of current testbeds.

The main activities of year 3 will be:- the testing of the closed-form framework on the new datasets;- the study of more sophisticated classes of dynamical models(such as hierarchical HMMs, MRFs) to use for complex activ-ity recognition, and the associated manifolds;- the study of the integration of pullback learning with heatkernels for SVM classification of generative models.

Pipeline 2. The first year of the project’s discriminativeside will be instead devoted to:- the implementation of all the current state of the art featureselection and extraction approaches from S/T volumes;- a first implementation of Multiple Instance Learning of dis-criminative action parts of fixed shape and scale;- a testing of the part-based discriminative approach on actionrecognition testbeds.

During year 2 we will:- investigate the use of Fisher vector-based features [28] to re-duce the computational complexity of action part selection inorder to effectively tackle large scale problems;- learn via MIL multiple-scale and multi-shape action partmodels thanks to Fisher vectors;- develop structured ensembles of discriminative action parts,building on approaches developed for object detection;- work on estimating the “correct” number of parts, and com-paring constellations composed by a different number of parts;- deploy the part-based discriminative approach for action lo-calization under multiple actions/actors;

Final, in year 3 our tasks will be:- the collection of proprietary human action datasets based onrange data for indoor scenarios (e.g. gaming);- the design of feature extraction strategies from range data;- the generalization of the model to action parts not constrainedto fit within rectangular boxes (e.g. action “tubes”);- the testing of transfer learning techniques for application tounrelated, unlabeled datasets.More details are available in the Workplan.

2.4 Relevance to academic beneficiariesWe foresee a significant impact on the ever-expanding com-munity of human action and activity recognition, plus method-ological consequences in a number of fields of applied science.

Impact on action recognition. The framework we proposeaims at exploiting the full classification potential of complexdynamical graphical models. As such, this study could lead tomuch subsequent research in the field along these lines. Thesame holds for the novel, structured discriminative models we

propose and for which we have great expectations, for we be-lieve they could really revolutionise the field. Our preliminaryresults have already earned a prize at the latest INRIA MachineLearning Summer School [46].

Impact on identity recognition/crowd analysis. As bothdynamical generative and discriminative models are also suit-able for identity recognition from gait or crowd activity recog-nition, a successful outcome of this project could have signif-icant impact in these related fields too. Indeed, the developedtechniques can be applied to any classification problem involv-ing complex, structured objects, living in a metric space.

Impact on manifold learning. More importantly, though,both the proposed non-linear manifold learning scheme andthe development of a principled way of dealing with structuredcollection of discriminative models constitute methodologicaladvances in the wider fields of machine learning with conse-quences that are not easy to predict at this time.

References[1] J. K. Aggarwal and M. S. Ryoo, Human Activity Analysis: A Review,

ACM Computing Surveys 43 (2011), no.3.

[2] K. Alahari, P. Kohli, and P.H.S. Torr, Reduce, reuse and recycle: Effi-ciently solving multi-label MRFs, CVPR’08.

[3] S. Ali, A. Basharat, and M. Shah, Chaotic invariants for human actionrecognition, ICCV’07, pp. 1–8.

[4] S.-I. Amari, Differential geometric methods in statistics, Springer, 1985.

[5] S. Andrews, I. Tsochantaridis and T. Hofmann, Support vector machinesfor multiple-instance learning, NIPS’03, pp. 561-568.

[6] G. BakIr, B. Taskar, T. Hofmann, B. Schlkopf, A. Smola, and S.V.N.Vishwanathan, Predicting structured data, MIT Press, 2007.

[7] M. Bilenko, S. Basu, and R.J. Mooney, Integrating constraints and met-ric learning in semi-supervised clustering, ICML’04.

[8] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, Actions asspace-time shapes, ICCV’05, pp. 1395–1402.

[9] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal, Histograms oforiented optical flow and Binet-Cauchy kernels on nonlinear dynamicalsystems for the recognition of human actions, CVPR’09, pp. 1932–1939.

[10] F. Cuzzolin, Using bilinear models for view-invariant action and identityrecognition, CVPR’06, vol. 2, pp. 1701–1708.

[11] , Multilinear modeling for robust identity recognition from gait,Behavioral Biometrics for Human Identification, IGI, 2009.

[12] , Learning pullback manifolds of generative dynamical modelsfor action recognition, IEEE Trans. PAMI (2012, under review).

[13] F. Cuzzolin, D. Mateus and R. Horaud, Robust coherent Laplacian pro-trusion segmentation along 3D sequences, IJCV (2012, under review).

[14] F. Cuzzolin, D. Mateus, D. Knossow, E. Boyer, and R. Horaud, Coherentlaplacian protrusion segmentation, CVPR’08, pp. 1–8.

[15] N. Dalai and B. Triggs, Histograms of oriented gradients for humandetection, CVPR’06, pp. 886– 893.

[16] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, Automatic an-notation of human actions in video, ICCV’09.

[17] R. Elliot, L. Aggoun, and J. Moore, Hidden markov models: estimationand control, Springer Verlag, 1995.

[18] I. Endres, V. Srikumar, M.-W. Chang and D. Hoiem, Learning SharedBody Plans, CVPR’12.

[19] P. Felzenszwalb and D. Huttenlocher, Pictorial structures for objectrecognition, Int. Journal of Computer Vision 61 (2005).

[20] P. Felzenszwalb, R. Girshick, D. McAllester and D. Ramanan, Objectdetection with discriminatively trained part based models, PAMI 32(2010), 1627–1645.

[21] S. Fine, Y. Singer, and N. Tishby, The hierarchical hidden Markovmodel: Analysis and applications, Mach. Learn. 32 (1998), 41–62.

[22] M. Fischler and R. Elschlager, The representation and matching of pic-torial structures, IEEE Trans. Computer 22 (1973), 67–92.

[23] A. Gupta, P. Srinivasan, J. Shi, and L.S. Davis, Understanding videos,constructing plots learning a visually grounded storyline model fromannotated videos., CVPR’09, pp. 2012–2019.

[24] D. Han, L. Bo and C. Sminchisescu, Selection and context for actionrecognition, ICCV’09.

[25] A. Hengel, A. Dick, T. Thormahlen, B. Ward, and P.H.S. Torr, Video-trace: Rapid interactive scene modelling from video, ACM Transactionson Graphics (SIGGRAPH special issue) 26, no. 3 (2007).

[26] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong and T. S. Huang, Action detectionin complex scenes with spatial and temporal ambiguities, ICCV’09.

[27] M. Itoh and Y. Shishido, Fisher information metric and poisson kernels,Differential Geometry and its Applications 26 (2008), no. 4, 347 – 356.

[28] H. Jegou, F. Perronnin, M. Douze, J. Sanchez, P. Perez and C. Schmid,Visual perception of biological motion and a model for its analysis,PAMI (upcoming, 2012).

[29] G. Johanssen, Visual perception of biological motion and a model for itsanalysis, Perception & Psychophysics 14 (1973), 201-211.

[30] T.K. Kim and R. Cipolla, Canonical correlation analysis of video vol-ume tensors for action categorization and detection, 31 (2009), no. 8.

[31] A. Klaser, M. Marszalek and C. Schmid, A spatio-temporal descriptorbased on 3D-gradients, BMVC’08.

[32] P. Kohli, M.P. Kumar, and P.H.S. Torr, P 3 and beyond: Solving energieswith higher order cliques, CVPR’07.

[33] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio and T. Serre, HMDB: Alarge video database for human motion recognition, ICCV’11.

[34] M.P. Kumar, V. Kolmogorov and P.H.S. Torr, An analysis of convex re-laxations for MAP estimation, NIPS’07.

[35] J. D. Lafferty and G. Lebanon, Diffusion kernels on statistical manifolds,Journal of Machine Learning Research 6 (2005), 129–163.

[36] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, Learning realis-tic human actions from movies, CVPR’08.

[37] I. Laptev and P. Perez, Retrieving actions in movies, ICCV’07.

[38] Q. V. Le, W. Y. Zou, S. Y. Yeung and A. Y. Ng, Learning hierarchicalinvariant spatio-temporal features for action recognition with indepen-dent subspace analysis, CVPR’11, pp. 3361-3368.

[39] G. Lebanon, Metric learning for text documents, PAMI 28 (2006), no. 4.

[40] J.G. Liu, J.B. Luo, and M. Shah, Recognizing realistic actions fromvideos ’in the wild’, ICCV’09, pp. 1996–2003.

[41] M. Marszalek, I. Laptev, and C. Schmid, Actions in context, CVPR’09.

[42] D. Mateus, R. Horaud, D. Knossow, F. Cuzzolin, and E. Boyer, Articu-lated shape matching using laplacian eigenfunctions and unsupervisedpoint registration, CVPR’08.

[43] B. North, A. Blake, M. Isard, and J. Rittscher, Learning and classifica-tion of complex dynamics, PAMI 22 (2000), no. 9, 1016–1034.

[44] R. Raina, A. Battle, H. Lee, B. Packer and A.Y. Ng, Self-taught learning:transfer learning from unlabeled data , ICML’07.

[45] K. K. Reddy, J. Liu, and M. Shah, Incremental action recognition usingfeature-tree, ICCV’09.

[46] M. Sapienza, F. Cuzzolin and Ph. Torr, Learning discriminative space-time actions from weakly labelled videos, BMVC’12.

[47] K. Schindler and L. van Gool, Action snippets: How many frames doeshuman action recognition require?, CVPR’08.

[48] M. Schultz and T. Joachims, Learning a distance metric from relativecomparisons, NIPS’04.

[49] P. Scovanner, S. Ali and M. Shah, A 3-dimensional SIFT descriptor andits application to action recognition, ACM Multimedia, 2007.

[50] Q.F. Shi, L. Wang, L. Cheng, and A. Smola, Discriminative human ac-tion segmentation and recognition using semi-Markov model, CVPR’08.

[51] A. J. Smola and S. V. N. Vishwanathan, Hilbert space embeddings indynamical systems, IFAC’03, pp. 760–767.

[52] P.H.S. Torr, L. Ladicky and P. Kohli, Robust higher order potentials forenforcing label consistency, CVPR’08.

[53] P. Viola, J. Platt and C. Zhang, Multiple instance boosting for objectdetection, NIPS’05.

[54] J. M. Wang, D. J. Fleet, and A. Hertzmann, Gaussian process dynamicalmodel, NIPS’06, vol. 18, pp. 1441–1448.

[55] H. Wang, A. Klaser, C. Schmid and C.L. Liu, Action Recognition byDense Trajectories, CVPR’11, pp. 3169–3176.

[56] Y. Wang and G. Mori, Max-margin hidden conditional random fields forhuman action recognition, CVPR’09, pp. 872–879.

[57] O.J. Woodford, P.H.S. Torr, I.D. Reid and A.W. Fitzgibbon, Globalstereo reconstruction under second order smoothness priors, CVPR’08.

[58] E.P. Xing, A.Y. Ng, M.I. Jordan, and S. Russel, Distance metric learningwith applications to clustering with side information, NIPS’03.

[59] Y. Yang and D. Ramanan, Articulated Pose Estimation using FlexibleMixtures of Parts, CVPR’11.

[60] J.S. Yuan, Z.C. Liu, and Y. Wu, Discriminative subvolume search forefficient action detection, CVPR’09, pp. 2442–2449.

Making action recognition work in the real world...ing, mobile computing and video retrieval to...

Documents

Transcript of Making action recognition work in the real world...ing, mobile computing and video retrieval to...