A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Action Recognition

22
Outline Introduction Framework Overview Experimental Conditions Results and Analysis Conclusions and Future Work A MKL Based Fusion Framework for Real-Time Multi-View Action Recognition Feng Gu, Francisco Fl´ orez-Revuelta, Dorothy Monekosso and Paolo Remagnino Digital Imaging Research Centre Kingston University, London, UK December 3rd, 2014 1 / 22

Transcript of A Multiple Kernel Learning Based Fusion Framework for Real-Time Multi-View Action Recognition

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

A MKL Based Fusion Framework for Real-TimeMulti-View Action Recognition

Feng Gu, Francisco Florez-Revuelta, Dorothy Monekosso andPaolo Remagnino

Digital Imaging Research CentreKingston University, London, UK

December 3rd, 2014

1 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

1 Introduction

2 Framework Overview

3 Experimental Conditions

4 Results and Analysis

5 Conclusions and Future Work

2 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Background and Motivations

Real-time multi-view action recognition:

Gain an increasing interest in video surveillance, humancomputer interaction, and multimedia retrieval etc.

Provide complementary field of views (FOVs) of a monitoredscene via multiple cameras

Lead to a more robust decision making based on multipleheterogeneous video streams

Real-time capacity enables continuous long-term monitoring

If possible multiple cameras should be deployed to monitorhuman behaviour, where data fusion techniques can be applied.

3 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Illustration of the Monitored Scenario

C4

C1

C2C3

4 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Motion-Based Person Detector

We use a state-of-the-art motion-based tracker [6]:

Each pixel modelled as a mixture of Gaussians in RGB space

Background model to find foreground pixels in a new frame

Found foreground pixels grouped to form large regionsassociated the person of interest

Kalman filters used to track foreground detections

Person detections generated for every frame

5 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Feature Representation of Videos

Use of STIP and improved dense trajectories (IDT) [7] aslocal descriptor to extract visual features from a video

Person detections and frame spans to define a XYT cuboidassociated with an action performed by the monitored person

Apply bag of words (BOWs) to compute the feature vector ofa cuboid, where K-Means clustering used for the generationof a codebook

6 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Disciminative Models for Classification

Let xki ∈ RD , where i ∈ {1, 2, . . . ,N} is the index of a featurevector corresponding to a XYT cuboid and k ∈ {1, 2, . . . ,K} is theindex of a camera view. We learn a SVM classifier as

f (x) =N∑i=1

αiyik(xi , x) + b (1)

We then compute a classification score via a sigmoid function as

p(y = 1|x) =1

1 + exp(−f (x))(2)

7 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Simple Fusion Strategies

Concantenation of Features: concatenate the featurevectors of multiple views into one single feature vector suchthat xi = [x1i , . . . , x

Ki ]

Sum of Classification Scores: compute a classification scorefor each camera view p(y = 1|x) as in 2, and then averagethem as 1

K

∑Kk=1 p(y = 1|xk)

Product of Classification Scores: apply the product rule tothe classification scores of all the camera views as∏K

k=1 p(y = 1|xk)

8 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Multiple Kernel Learning

Combine of multiple kernels corresponding to different datasources (e.g. camera views) via a convex function such as

K(xi , xj) =K∑

k=1

βkkk(xi , xj) (3)

where βk ≥ 0 and∑K

k=1 βk = 1 and each kernel kk only uses adistinct set of features from a data source.

9 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Two-Step Optimisation

We need to learn the kernel parameters weights (αk) and bias(bk) of a SVM model, and the combination parameters βk in 3.This can be solved as follows:

Step-1: optimise over the kernel parameters αk and bk

while fixing the combination parameters βk (quadraticprogramming)

Step-2: optimise over the combination parameters βk whilefixing the kernel parameters αk and bk (gradient decent)

Alternates between two steps iteratively until the systemconverges to an optimal solution

10 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

IXMAS Multi-View Dataset

Created for view-invariant human action recognition [8]

Include 13 daily actions, each of which performed 3 times by12 actors

Video sequences collected via 5 cameras, at 23 frames persecond and 390× 291 resolution

We use all 12 actors and 5 cameras and evaluate 11 actionsas in [9]

Leave-one-subject-out cross validation used in theexperiments

11 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Implementation Details

A codebook sized 4000 quantised from 100000 randomlyselected descriptor features of the training set

STIP descriptor uses the entire image plane and the framespan of an action given in the ground truth to define a cuboid

IDT descriptor relies on the person detections in addition tothe frame span

All the SVM models use `1 normalisation and the χ2 kernel

12 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Person Detection Results

cam0 cam1 cam2 cam3 cam4

Figure: Detection results of the motion-based tracker of the first run ofthe subject ‘Alba’, for all the camera views.

13 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Results of STIP (Internal Comparison)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

chec

k watc

h

cross

arms

scratc

h hea

d

sit do

wnge

t up

turn a

round walk wav

epu

nch

kick

pick u

p

SVM−COMSVM−SUMSVM−PRDSVM−MKL

Figure: Class-wise mean recognition rates of all the folds of the comparedmethods using STIP descriptor, where µSVM−COM = 0.819,µSVM−SUM = 0.820, µSVM−PRD = 0.815, and µSVM−MKL = 0.842.

14 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Results of IDT (Internal Comparison)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

chec

k watc

h

cross

arms

scratc

h hea

d

sit do

wnge

t up

turn a

round walk wav

epu

nch

kick

pick u

p

SVM−COMSVM−SUMSVM−PRDSVM−MKL

Figure: Class-wise mean recognition rates of all the folds of the comparedmethods using IDT descriptor, where µSVM−COM = 0.915,µSVM−SUM = 0.927, µSVM−PRD = 0.921, and µSVM−MKL = 0.950.

15 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Comparison with State-of-the-Art (External Comparison)

Method Actions Actors Views Rate FPSCilla et al. [3] 11 12 5 0.913 N/A

Weiland et al. [10] 11 10 5 0.933 N/ACilla et al. [4] 11 10 5 0.940 N/AHolte et al. [5] 13 12 5 1.000 N/A

Weinland et al. [9] 11 10 5 0.835 500Chaaraoui et al. [1] 11 12 5 0.859 26Chaaraoui et al. [2] 11 12 5 0.914 207

SVM-MKL (IDT+BOWs) 11 12 5 0.950 25

Table: Comparison of the proposed MKL method using (IDT) descriptorand BOWs, where the methods with ‘N/A’ in the FPS column are offline.

16 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Conlusions and Future Work

Proposed MKL based framework outperforms the simplefusion techniques, and the state-of-the-art methods

IDT descriptor superior than STIP descriptor for featurerepresentation in action recognition

The proposed framework capable of performing real-timeaction recognition at 25 frames per second

For the future, apply to other similar vision problems, and studyalternative feature representation and fusion techniques.

17 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Thank you very much! Any questions?

18 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

A. A. Chaaraoui, P. Climent-Perez, and F. Florez-Revuelta.Silhouette-based human action recognition using sequences ofkey poses.Pattern Recognition Letters, 34:1799–1807, 2013.

A. A. Chaaraoui, J. R. Padilla-Lopez, F. J. Ferrandez-Pastor,M. Nieto-Hidalgo, and F. Florez-Revuelta.A vision-based system for intelligent monitoring: humanbehaviour analysis and privacy by context.Sensors, 14:8895–8925, 2014.

R. Cilla, M. A. Patricio, and A. Berlanga.A probabilistic, discriminative and distributed system for therecognition of human actions from multiple views.Neurocomputing, 75:78–87, 2012.

R. Cilla, M. A. Patricio, A. Berlanga, and J. M. Molina.

19 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Human action recognition with sparse classification andmultiple-view learning.Expert Systems, DOI: 10.1111/exsy.12040, 2013.

M. Holte, B. Chakraborty, J. Gonzalez, and T. Moeslund.A local 3-D motion descriptor for mult-view human actionrecognition from 4-D spatio-temporal interest points.IEEE Journal of Selected Topics in Signal Processing,6:553–565, 2012.

C. Stauffer and W. Grimson.Learning patterns of activity using real time tracking.IEEE Transactions on Pattern Analysis and MachineIntelligence (PAMI), 22(8):747–767, 2000.

H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid.

20 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Evaluation of local spatio-temporal features for actionrecognition.In British Machine Vision Conference (BMVC), 2009.

D. Weinland, E. Boyer, and R. Ronfard.Action recognition from arbitrary views using 3d exemplars.In IEEE International Conference on Computer Vision (ICCV),pages 1–7, 2007.

D. Weinland, M. Ozuysal, and P. Fua.Making action recognition robust to occlusions and viewpointchanges.In European Conference on Computer Vision, 2010.

D. Weinland, R. Ronfard, and E. Boyer.Free viewpoint action recognition using motion historyvolumes.

21 / 22

OutlineIntroduction

Framework OverviewExperimental Conditions

Results and AnalysisConclusions and Future Work

Computer Vision and Image Understanding, 104(2-3):249–257,2006.

22 / 22