V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and...

19
20 V-JAUNE: A Framework for Joint Action Recognition and Video Summarization FAIROUZ HUSSEIN and MASSIMO PICCARDI, University of Technology, Sydney Video summarization and action recognition are two important areas of multimedia video analysis. While these two areas have been tackled separately to date, in this article, we present a latent structural SVM framework to recognize the action and derive the summary of a video in a joint, simultaneous fashion. Efficient inference is provided by a submodular score function that accounts for the action and summary jointly. In this article, we also define a novel measure to evaluate the quality of a predicted video summary against the annotations of multiple annotators. Quantitative and qualitative results over two challenging action datasets—the ACE and MSR DailyActivity3D datasets—show that the proposed joint approach leads to higher action recognition accuracy and equivalent or better summary quality than comparable approaches that perform these tasks separately. CCS Concepts: Computing methodologies Video summarization; Activity recognition and understanding; Theory of computation Structured prediction; Additional Key Words and Phrases: Action recognition, video summarization, latent structural SVM, sub- modular inference ACM Reference Format: Fairouz Hussein and Massimo Piccardi. 2017. V-JAUNE: A framework for joint action recognition and video summarization. ACM Trans. Multimedia Comput. Commun. Appl. 13, 2, Article 20 (April 2017), 19 pages. DOI: http://dx.doi.org/10.1145/3063532 1. INTRODUCTION The amount of publicly-available video footage is growing at unprecedented rates thanks to the commoditization of video acquisition and the role played by social me- dia. However, video data are typically large in size whereas the events of interest may be concentrated only in small segments. Video summarization has, therefore, become imperative to concisely capture the contents of videos. The main applications of video summaries are indexing, search, and retrieval from video collections and the story- boarding of the videos to end users [Ma et al. 2002; Liu et al. 2010; Cong et al. 2012; Guan et al. 2014]. The basic requirements of an effective video summary are well un- derstood, mainly being the appropriate coverage of the original footage and limited redundancy in the frames selected as the summary. At the same time, the huge number of videos calls for the automated labeling of their main theme or activity. For instance, in social media it can be helpful to know whether a video depicts activities such as “food preparation” or “conversation in a living room” for categorization and content customization. In addition to social media, automated ac- tivity or action recognition is an important component of many other applications such Authors’ addresses: F. Hussein and M. Piccardi, Global Big Data Technologies Centre, Faculty of Engi- neering and IT, University of Technology Sydney, PO Box 123, Broadway NSW 2007, Australia; emails: [email protected], [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2017 ACM 1551-6857/2017/04-ART20 $15.00 DOI: http://dx.doi.org/10.1145/3063532 ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Transcript of V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and...

Page 1: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

20

V-JAUNE: A Framework for Joint Action Recognitionand Video Summarization

FAIROUZ HUSSEIN and MASSIMO PICCARDI, University of Technology, Sydney

Video summarization and action recognition are two important areas of multimedia video analysis. Whilethese two areas have been tackled separately to date, in this article, we present a latent structural SVMframework to recognize the action and derive the summary of a video in a joint, simultaneous fashion.Efficient inference is provided by a submodular score function that accounts for the action and summaryjointly. In this article, we also define a novel measure to evaluate the quality of a predicted video summaryagainst the annotations of multiple annotators. Quantitative and qualitative results over two challengingaction datasets—the ACE and MSR DailyActivity3D datasets—show that the proposed joint approach leadsto higher action recognition accuracy and equivalent or better summary quality than comparable approachesthat perform these tasks separately.

CCS Concepts: � Computing methodologies → Video summarization; Activity recognition andunderstanding; � Theory of computation → Structured prediction;

Additional Key Words and Phrases: Action recognition, video summarization, latent structural SVM, sub-modular inference

ACM Reference Format:Fairouz Hussein and Massimo Piccardi. 2017. V-JAUNE: A framework for joint action recognition and videosummarization. ACM Trans. Multimedia Comput. Commun. Appl. 13, 2, Article 20 (April 2017), 19 pages.DOI: http://dx.doi.org/10.1145/3063532

1. INTRODUCTION

The amount of publicly-available video footage is growing at unprecedented ratesthanks to the commoditization of video acquisition and the role played by social me-dia. However, video data are typically large in size whereas the events of interest maybe concentrated only in small segments. Video summarization has, therefore, becomeimperative to concisely capture the contents of videos. The main applications of videosummaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users [Ma et al. 2002; Liu et al. 2010; Cong et al. 2012;Guan et al. 2014]. The basic requirements of an effective video summary are well un-derstood, mainly being the appropriate coverage of the original footage and limitedredundancy in the frames selected as the summary.

At the same time, the huge number of videos calls for the automated labeling of theirmain theme or activity. For instance, in social media it can be helpful to know whether avideo depicts activities such as “food preparation” or “conversation in a living room” forcategorization and content customization. In addition to social media, automated ac-tivity or action recognition is an important component of many other applications such

Authors’ addresses: F. Hussein and M. Piccardi, Global Big Data Technologies Centre, Faculty of Engi-neering and IT, University of Technology Sydney, PO Box 123, Broadway NSW 2007, Australia; emails:[email protected], [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of thiswork in other works requires prior specific permission and/or a fee. Permissions may be requested fromPublications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© 2017 ACM 1551-6857/2017/04-ART20 $15.00DOI: http://dx.doi.org/10.1145/3063532

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 2: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

20:2 F. Hussein and M. Piccardi

as video surveillance, human-computer interaction, and home intelligence [Wang andMori 2011; Wang et al. 2011; Wang and Schmid 2013; Yang et al. 2016]. Yet, it remainsa challenging task due to the inherent challenges of activity videos that include subjectdependence, occlusions of view and viewpoint, and illumination and scale variations.

Given the above, a question that spontaneously arises is: can video summarizationand action recognition benefit from being performed jointly? This question can berephrased as: can action recognition prove more accurate if it is performed based ona selection of the video’s frames rather than the entire set? And, simultaneously, canthe selected frames enjoy the properties required by an effective summary, i.e., goodcoverage and limited redundancy? Assuming that this question can be answered inthe affirmative, in this article, we set out to investigate the performance of joint actionrecognition and video summarization.

Inferring an optimal summary is a combinatorially exponential problem and, as such,intractable. However, it has been proved that most functions used to evaluate the qual-ity of a summary are monotonic submodular [Lin and Bilmes 2011; Sipos et al. 2012;Tschiatschek et al. 2014]. The main advantage of these functions is that inexpensive,greedy algorithms can be used to perform approximate inference with performanceguarantees [Nemhauser et al. 1978]. In this article, we extend the existing submodularfunctions for summarization to functions for joint recognition and summarization thatstill enjoy submodularity.

As learning framework, we adopt the latent structural support vector machine (SVM)of Yu and Joachims [Yu and Joachims 2009]. This framework joins the benefits ofstructured prediction (i.e., the ability to predict sequences, trees, and other graphs) withmaximum-margin learning that has gained a reputation for accurate prediction in anumber of fields [Zhu et al. 2010; Wang and Mori 2011; Duan et al. 2012; Kim et al. 2015;Sachan et al. 2015]. In addition, it allows us to exploit different degrees of supervisionfor the summary variables, from completely unsupervised to fully supervised, whichsuit different scenarios of application.

An earlier version of this work was published in the work of Hussein et al. [2016]. Themain, distinct contributions of this article are: (1) a submodular inference approachfor the computation of latent structural SVM (Section 3.2); (2) a new measure for thequantitative evaluation of video summaries, nicknamed V-JAUNE (Section 4); and (3)an extensive experimental evaluation over action datasets with different extents ofsummary supervision (Section 5).

The rest of this article is organized as follows: in Section 2, we review the state of theart on relevant topics. In Section 3, we describe the model and the learning framework.In Section 4, we introduce the proposed summary evaluation metric. Experiments andresults are presented in Section 5 and the main findings of this work are recapitulatedin Section 6.

2. RELATED WORK

This article relates to structured prediction learning and its applications to actionrecognition and video summarization. Since the state of the art is vast, we restrict thereview of related work to immediately relevant topics.

Automated video summarization is a long-standing research area in multime-dia [Maybury et al. 1997]. Summarization methods can be mainly categorized into:(a) clustering approaches and (b) frame-differences approaches. The clustering ap-proaches are aggregative methods that attempt grouping similar frames and selectrepresentatives from each group [De Avila et al. 2011; Ghosh et al. 2012; Jaffe et al.2006; Mundur et al. 2006]. Frames can be clustered using low-level features (e.g.,De Avila et al. [2011]) or even detected objects [Ghosh et al. 2012], and structurecan also be usefully enforced during clustering [Chen et al. 2009; Gygli et al. 2015].

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 3: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

V-JAUNE: A Framework for Joint Action Recognition and Video Summarization 20:3

Turaga et al. [2009] have proposed an approach to make the clustering less sensitiveto the viewpoint and rate of the activities depicted in the videos. Frame-differencesapproaches, on the other hand, scan the video’s frames in sequential order to detectshot boundaries and key frames [Xiong et al. 2006; Cong et al. 2012; Yang et al. 2013;Lu et al. 2016].

Submodular functions have recently played a major role in machine learning thanksto their efficient maximization and minimization properties. Submodular functionshave been identified in tasks as diverse as social network analysis, graph cutting,machine translation, and summarization [Bach 2013]. For instance, Lin and Bilmes[2011] and Sipos et al. [2012] have presented submodular approaches to documentsummarization. Tschiatschek et al. [2014] have proposed a similar approach for thesummarization of a collection of images. The most attractive property of submodularfunctions that are also monotonic (a frequent case) are the guaranteed performanceof greedy maximization algorithms. This is not only useful for inference of unseenexamples, but also for inference during training.

Action recognition has been one of the most active areas of computer vision for overa decade [Negin and Bremond 2016]. An obvious categorization of the approaches iselusive, but in the context of this article, we can categorize them as (a) non-structuralversus (b) structural. The approaches in the first category extract a single representa-tion from the whole video and apply a classifier to predict its action label [Laptev et al.2008; Wang et al. 2011; Wang and Schmid 2013; Karpathy et al. 2014]. The approachesin the structural category leverage the relationships between the video’s frames, oftenin terms of graphical models, and infer the action class from the graph [Tang et al.2012; Izadinia and Shah 2012; Donahue et al. 2015; Slama et al. 2015; Devanne et al.2015]. These approaches more naturally lend themselves to extensions for the purposesof summarization, key frame detection, and pose detection (e.g., Brendel and Todorovic[2010]). Various works have argued that actions can be recognized more accuratelyusing only a selection of the video’s “key” frames [Schindler and Van Gool 2008; Huand Zheng 2011; Raptis and Sigal 2013] and our work follows similar lines.

Structural SVM is an extension of the conventional support vector machine to theclassification of structured outputs, i.e., sets of labels organized into sequences, trees,and graphs [Tsochantaridis et al. 2005]. It has been applied successfully in areas asdiverse as handwritten digit recognition, object recognition, action recognition, andinformation retrieval [Altun et al. 2005; Wang and Mori 2011; Wu et al. 2013; Kimet al. 2015]. Yu and Joachims [2009] have extended structural SVM to training sampleswith latent variables and used a concave-convex procedure to ensure convergence toa local optimum. Latent structural SVM, too, has proven useful in many multimediaapplications, especially those where ground-truth annotation is expensive or impossiblesuch as complex event detection [Tang et al. 2012] or natural language comprehension[Sachan et al. 2015]. In this article, we adopt a latent and semi-latent structural SVMapproach to jointly infer the action and the summary from a video sequence, dealingwith the summary as a set of latent variables.

For the quantitative evaluation of video summarization, many works have adoptedthe F1 score for its ability to balance the precision and recall requirements [Li et al.2011; Ejaz et al. 2013; Gygli et al. 2015]. Others have used a recall-based measure calledComparison of User Summaries (CUS) where the predicted summary is compared toground-truth summaries from multiple annotators using a Manhattan distance [DeAvila et al. 2011; Almeida et al. 2012; Ejaz et al. 2012]. Following the widespread useof summarization metrics such as ROUGE [Lin 2004] in natural language processing,Tschiatschek et al. [2014] have introduced a recall-based metric called V-ROUGE andapplied it to the evaluation of summaries of image collections. While this metric couldalso be used to evaluate the quality of a video summary, it does not take into account

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 4: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

20:4 F. Hussein and M. Piccardi

Fig. 1. The graphical model for joint action classification and summarization of a video: y: action class label;h: frames selected for the summary; x: measurements from the video frames.

the frames’ sequentiality. In other words, a video summary ought to consider the orderin which the frames appear and consist of a sequence, rather than a set, of frames. Forthis reason, in this article, we introduce a novel measure, V-JAUNE, that addressesthe video summaries as frame sequences.

3. LEARNING FRAMEWORK

The framework proposed for joint action recognition and summarization is based ongraphical models and latent structural SVM. The model is described hereafter whilelatent structural SVM is presented in Section 3.2.

3.1. Model Formulation

The goal of our work is to provide joint classification and summarization for a videorepresenting an action. To this aim, let us note a sequence of multivariate measure-ments, one per frame, as x = {x1, . . . xi, . . . xT }; a sequence of binary variables indicatingwhether a frame belongs to the summary or not as h = {h1, . . . hi, . . . hT }; and the actionclass as y ∈ {1 . . . M}. Figure 1 shows the variables organized in a graphical model. For-mally, we aim to jointly infer class label y and summary h while keeping the summaryto a given, maximum size, B (the “budget”):

y∗, h∗ = argmaxy,h

F(x, y, h) s.t.T∑

i=1

hi = B (1)

Lin and Bilmes [2011] have shown that desirable summaries (i.e., summaries withgood coverage of the entire document and limited redundancy) enjoy the property ofsubmodularity. Submodularity can be intuitively explained as a law of diminishingreturns: let us assume to have a scalar function, F, which can measure the quality ofa given summary, together with an arbitrary summary, S. We now add a new element,v, to S and compute the difference in value between F(S ∪ v) and F(S) (the “return” ofv for S). Let us then consider a superset of S, T ⊃ S, and add v to it: submodularityholds if the return of v for T is less than or equal to the return of v for S. In simpleterms, the larger the summary, the less is the benefit brought in by a new element.This property can be formally expressed as:

∀S ⊂ T , v : F(S ∪ v) − F(S) ≥ F(T ∪ v) − F(T ) (2)

Note that submodular functions are not required to be monotonically non-decreasing,i.e., returns can be negative; however, Inequality (2) must hold. For simplicity, in thefollowing, we assume that F is monotonically non-decreasing for reasonably small sizesof the summary. The most remarkable property of monotonic submodular functions isthat a value for F with a guaranteed lower bound can be found by simply selecting

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 5: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

V-JAUNE: A Framework for Joint Action Recognition and Video Summarization 20:5

the elements for the summary one by one. The approximate maximum returned bysuch a greedy algorithm is guaranteed to be at least (1 − 1/e) ≈ 0.632 of the actualmaximum [Nemhauser et al. 1978], and it is found to be often better in practice. Defacto, greedy inference algorithms perform well with submodular functions [Lin andBilmes 2011]. In addition, the search for the B highest-scoring elements of a set enjoysminimal, linear complexity, O(T ), in the size of the set, which is the lowest possiblecomputational complexity for the inference.

We now restrict the choice of the scoring function to the case of linear models:

F(x, y, h) = w�ψ(x, y, h), (3)

with w, a parameter vector of non-negative elements and ψ(x, y, h), a suitable featurefunction of equal size. Lin and Bilmes [2011] have proposed the following featurefunction for summarization:

ψ(x, y, h) =T∑

i, j=1, j �=i

φ(xi, xj, y, hi, hj), (4)

whereφ(xi, xj, y, hi, hj) = λ(hi, hj)s(xi, xj) (y-aligned)

λ(hi, hj) ={

λ1, hi = 1, hj = 0 (coverage)λ2, hi = 1, hj = 1 (non-redundancy)0, hi = 0, hj = 0

λ1 ≥ 0, λ2 ≤ 0

, (5)

with s(xi, xj) a similarity function between frames xi and xj . If the similarity func-tion is D-dimensional, function φ(xi, xj, y, hi, hj) is MD-dimensional and is obtained byaligning the similarity function at index (y − 1)D and padding all the remaining ele-ments with zeros. Frame xi, i = 1 . . . T , is included in the summary if its correspondingbinary indicator, hi, is set to one. Therefore, the λ1 terms in Function (4) are the cover-age terms, while the λ2 terms promote non-redundancy in the summary by penalizingsimilar frames. Following the work of Lin and Bilmes [2011], it is easy to prove thatFunction (4) is submodular.

Functions based on between-frame similarities such as Function ( 4) are suitable forsummarization, but do not properly describe the class of the action since their spaceis very sparse. Typical feature functions for action recognition are, instead, based onbagging or averages of the frame measurements. To provide joint summarization andrecognition, we propose augmenting Function (4) as follows:

ψ(x, y, h) =T∑

i, j=1, j �=i

φ(xi, xj, y, hi, hj) +T∑

i=1

λ3I[y, hi = 1]xi︸ ︷︷ ︸action

(6)

In this way, a new term is added containing the weighted sum of all measurementsxi in the summary. Such a term is equivalent to a pooled descriptor and promises to beinformative for action recognition. Its dimensionality is assumed to be equal to that ofthe similarity function, D, so that the terms can be added up and y-aligned. We nowprove that Equation (6) is still submodular:

PROPOSITION 1. Function ψ(x, y, h) in Equation (6) is submodular.

PROOF (REPHRASED FROM HUSSEIN ET AL. [2016]). Given a current summary, h, addingany extra frame to it makes the term

∑Ti=1 λ3I[y, hi = 1]xi vary by the same amount

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 6: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

20:6 F. Hussein and M. Piccardi

irrespectively of h. This term thus satisfies Inequality (2) with the equal sign and is,therefore, submodular. At its turn, function ψ(x, y, h) is a positive combination of twosubmodular terms and is proven to be submodular thanks to well-known properties ofsubmodularity [Bach 2013].

Algorithm 1 shows the greedy algorithm that we use to jointly infer the best actionclass and the best summary, choosing one frame for the summary at a time.

ALGORITHM 1: Greedy algorithm for inferring class y∗ and summary h∗ given scoringfunction F(x, y, h).max = −∞, argmax = 0for y = 1 . . . M do

h∗ ← ∅X ← xwhile X �= ∅ and |h∗| ≤ B do

k ← argmaxv∈X F(x, y, h∗ ∪ v) − F(x, y, h∗)h∗ ← h∗ ∪ {k}X ← X\{k}

end whileif F(x, y, h∗) > max then

max = F(x, y, h∗)argmax = y

end ifend for

Given the recent ascent of deep neural networks in classification performance(from the works of Karpathy et al. [2014] and Donahue et al. [2015], and manyothers), it is important to highlight the advantages of using a graphical model such asEquations (3-6) for the joint prediction of an action and its summary:

—conventional deep neural networks such as convolutional or recurrent networks[Goodfellow et al. 2016] could straightforwardly be used to infer either the action orthe summary, but their joint inference would require substantial modifications;

—the nature of the score function in a graphical model allows enforcing meaningfulconstraints for the score (e.g., coverage and non-redundancy of the summaries) andenjoys the properties of submodular inference;

—the variables for the summary can be trained in an unsupervised way alongside thesupervised actions. This is a major advantage in terms of annotation and is discussedin detail in the following section.

3.2. Latent Structural SVM for Unsupervised and Semi-Supervised Learning

Latent structural SVM is an extension of the support vector machine suitable for theprediction of complex outputs such as trees and graphs in the presence of latent nodes[Yu and Joachims 2009]. It has been applied successfully in a number of fields suchas computer vision, natural language processing, and bioinformatics [Zhu et al. 2010;Wang and Mori 2011; Duan et al. 2012; Kim et al. 2015; Sachan et al. 2015]. Its mainstrength is its ability to combine the typical accuracy of large-margin training withthe flexibility of arbitrary output structures and unobserved variables. It is thereforea natural training framework for the score function in Equation (3).

Let us assume that we are given a training set with N videos, (xn, yn), n = 1 . . . N,where the action classes are supervised, but the summaries are completely unsuper-vised. Please note that in this section we use a superscript index to indicate the video

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 7: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

V-JAUNE: A Framework for Joint Action Recognition and Video Summarization 20:7

and, where needed, a subscript to indicate the frame. The learning objective of latentstructural SVM can be expressed as:

w∗ = argminw≥0,ξ1:N≥0

12

‖w‖2 + CN∑

n=1

ξn

s.t. w�ψ(xn, yn, hn∗) − w�ψ(xn, y, h) ≥ �(yn, y) − ξn

∀{y, h} �= {yn, hn∗}

(7)

hn∗ = argmaxh

w∗�ψ(xn, h, yn) (8)

Like in a conventional SVM, the objective function in Equation (7) is a tradeoffbetween two terms: an upper bound over the classification error on the training set,∑N

n=1 ξn, (also known as the hinge loss) and a regularizer, ‖w‖2, that encourages alarge margin between the classes. The constraints in the minimization impose that,for every sample, the score assigned to the correct labeling, yn, hn∗, is higher than thescore given to any other labelings, y, h �= yn, hn∗, by a margin equal to the loss func-tion, �(yn, y) (margin-rescaled SVM). However, given that the h variables are unsuper-vised/unknown, an estimate has to be inferred in Equation (8) using the current model.Latent structural SVM is therefore an iterative objective that alternates between theconstrained optimization in Equation (7), performed using the current values for thelatent variables, hn∗, and a new assignment for the hn∗ in Equation (8), performed usingthe current model, w∗. This algorithm is guaranteed to converge to a local minimum ofthe objective function [Yu and Joachims 2009]. Note that the loss function that we min-imize, �(yn, y), only accounts for the loss from action misclassifications. As such, theselection of the frames for the summary, h, is driven by the requirement of maximizingthe action recognition accuracy.

The initialization of the training algorithm requires an arbitrary starting assignmentfor the h variables. A uniformly-spaced selection of the frames is a reasonable startingsummary and we thus use it for initialization (i.e., hi = 1 if i = �T/B�, where T isthe video’s length and B is the budget). In case some of the summaries can be ground-truth annotated (semi-supervised training), Algorithm (7-8) can be used, substantiallyunchanged, by just skipping Assignment (8) for the supervised sequences.

The optimization in Equation (7) is a standard quadratic program that can be ad-dressed by any common solver. However, since the number of constraints in Equation (7)is exponential, we adopt the relaxation of Tsochantaridis et al. [2005], which can findε-correct solutions using only a polynomial-size working set of constraints. The work-ing set is built by searching the sample’s most violated constraint at each iteration ofthe solver:

ξn = maxy,h

(−w�ψ(xn, hn∗, yn) + w�ψ(xn, h, y) + �(yn, y)), (9)

which equates to finding the labeling with the highest sum of score and loss:

yn∗, hn∗,= argmaxy,h

(w�ψ(xn, h, y) + �(yn, y)) (10)

This Equation is commonly referred to as “loss-augmented inference” due to itssimilarity to the standard inference and can again be solved by Algorithm 1 simply bythe addition of loss �(yn, y) to the score. In the following, we prove that the argumentof Equation (10) is submodular:

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 8: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

20:8 F. Hussein and M. Piccardi

PROPOSITION 2. Function w�ψ(xn, h, y) + �(yn, y) is submodular.

PROOF. In Proposition 1, we had already proved that function ψ(x, y, h) is submodular.Score w�ψ(x, y, h), w ≥ 0, is a positive combination of the dimensions of ψ(x, y, h) and istherefore submodular for well-known properties of submodularity [Bach 2013]. Giventhat �(yn, y) is independent of h, its contribution to the return is null and the functionis therefore submodular overall.

4. V-JAUNE: VIDEO SUMMARY EVALUATION

Video summarization still lacks a generally-agreed measure for the quantitative as-sessment of a summary’s quality. Unlike metrics for text summaries like the popularROUGE [Lin 2004], a measure for video summaries should reflect not only the sum-mary’s content, but also its order since summaries could otherwise prove ambiguous.For example, actions “sitting down” and “standing up” could generate similar sets ofsummary frames, but their order must be different to correctly convey the action.

For this reason, in this article, we propose a novel performance measure, nicknamedV-JAUNE following the conventional use of color names and referring by “V” to visualdata as in Tschiatschek et al. [2014]. To present the measure, in this section, we utilizea compact notation for a summary, h = {h1, . . . hi, . . . hB}, consisting of the frame indicesof its B frames. Given a ground-truth summary, h, and a predicted summary, h, themeasure is phrased as a loss function and defined as follows:

�(h, h) =B∑

i=1

δ(hi, hi)

δ(hi, hi) = min{‖xhj − xhi‖2}, s.t. i − ≤ j ≤ i +

(11)

With this definition, loss function �(h, h) reflects the sequential order of the framesin their respective summaries, while allowing for a ± tolerance in the matching of thecorresponding positions.

In the field of summarization, the annotation of the ground truth is highly subjectiveand it is therefore desirable to extend the loss to the multi-annotator case. By callingM the number of annotators, the multi-annotator loss is defined as:

�(h1:M, h) =M∑

m=1

�(hm, h). (12)

A loss function such as in Equations (11) and (12) visibly depends on the scale ofthe x measurements and it is therefore “denormalized.” A possible way to normalize itwould be to estimate the scale of the measurements and divide the loss by it. However,a preferable approach is to normalize it by the disagreement between the annotators’summaries. In this way, the loss simultaneously becomes normalized by both the mea-surements’ scale and the extent of disagreement between the ground-truth annotators.Therefore, we quantify the disagreement as:

D = 2M(M − 1)

∑p,q

�(hp, hq) p = 1 . . . M, q = p + 1 . . . M (13)

and normalize the loss as:

�′(h1:M, h) = �(h1:M, h)/D. (14)

Figure 2 compares the values of the denormalized and normalized loss functionsfor three summary annotations of 95 videos from the Actions for Cooking Eggs (ACE)

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 9: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

V-JAUNE: A Framework for Joint Action Recognition and Video Summarization 20:9

Fig. 2. V-JAUNE values for the ACE test set (95 videos) with multiple annotators: blue bars, denormalizedvalues; red bars, normalized values.

Fig. 3. V-JAUNE loss for different annotators over the ACE test set (95 videos), using the first annotatoras ground truth and the second as prediction. Please note that the changes in value are mainly due to thechanges in magnitude of the VLAD descriptors. However, the agreement also varies with the video.

action dataset (Section 5.1). It is evident that the normalized loss values are much moreuniform. For more detail, Figure 3 plots the disagreements between pairs of annotators.

5. EXPERIMENTAL RESULTS

To evaluate the effectiveness of the proposed method, we have performed experimentson two challenging action datasets of depth videos: the ACE dataset [Shimada et al.2013] and the Microsoft Research (MSR) DailyActivity3D dataset [Wang et al. 2012].The datasets and experimental results are presented in detail in Sections 5.2 and 5.1.The evaluation addresses both the accuracy of the action recognition and the qualita-tive and quantitative quality of the produced summaries. For both datasets, we haveused comparable implementation settings: for each video, we have extracted dense localdescriptors (histogram of oriented gradients (HOG)/histogram of optical flow (HOF))over a regular spatio-temporal grid using the code from Wang et al. [2009]. As timescale, we have used τ = 2, which has resulted in 162-D individual descriptors. We havechosen the HOG/HOF features as well-proven, general-purpose features for actionrecognition, suitable for the scope of this article. However, it is likely that the exper-imental accuracy could easily be improved by using alternative features. As featureencoding, we have used VLAD [Jegou et al. 2010], which embeds the distance betweenthe pooled local features and the clusters’ centers. For the encoding, we have first runa k-means clustering over all the descriptors in the training set, empirically choosingk = 64 for the ACE dataset (more complex and varied) and k = 32 for MSR DailyAc-tivity3D. Then, for each frame, we have used the found clusters to encode the frame’sdescriptors in an encoding of 162 × k-D dimensions to be used as the measurementvector for the frame. As software for the latent structural SVM model, we have usedJoachims’ solver [Tsochantaridis et al. 2005] with Vedaldi’s MATLAB wrapper [Vedaldi2011]. For the similarity function, s(xi, xj), in Equation (5), we have used the element-wise absolute difference of vectors xi and xj . As parameters, we have used summary

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 10: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

20:10 F. Hussein and M. Piccardi

Table I. Details of the ACE Dataset

Training TestAction Instances Frames Instances Frames

Breaking 24 4,845 10 2,078Mixing 36 18,671 22 5,886Baking 44 43,306 27 13,960Turning 9 7,595 10 4,024Cutting 20 17,669 11 2,528Boiling 8 11,174 7 5,396

Seasoning 15 5,449 6 1,583Pealing 5 11,008 2 2,382

size B = 10, regularization coefficient C = 100, and performed a grid search over thetraining set for weights λ1, λ2, λ3 in range

[−1, 1]

in 0.5 steps. The summary size waschosen arbitrarily as a reasonable number of frames to display at once to a user, whilethe values for the number of clusters k and regularization coefficient C were chosen overan initial evaluation phase using a small subset of the training set as validation set.

5.1. ACE

The ACE dataset was released as part of an ICPR 2012 official contest (the KitchenScene Context Based Gesture Recognition contest; the dataset also being known asKitchen Scene Context-based Gesture Recognition (KSCGR)). It was collected in asimulated kitchen using a Kinect camera at 30fps. The resolution of the depth framesis 640×480. Using our grid on these frames resulted in 1,786 descriptors per frame. Inthe dataset, five actors cook eggs according to different recipes (omelet, scrambled eggs,etc). The cooking entails eight different classes of actions: cutting, seasoning, peeling,boiling, turning, baking, mixing, and breaking, annotated in the videos at frame level.Classification is challenging since most of the actions share similar body postures andspan limited movements. Most of the previous work on this dataset had used it for jointaction segmentation and classification [Niebles et al. 2010; Yuan et al. 2011; Wang andMori 2011; Wang et al. 2011; Wang and Schmid 2013; Ni et al. 2015], while we haveused it for joint action classification and summarization. To prepare the dataset for thistask, we have clipped the individual action instances, maintaining the same trainingand test split mentioned in Shimada et al. [2013]. In this way, we have obtained 161action instances for training and 95 for testing. Table I shows the number of instancesand frames per action. Each instance ranges between 20 and 4,469 frames. In addition,we have asked three annotators to independently select B = 10 frames from each actioninstance as their preferred summary for that instance. The annotators were instructedto select the frames based on how well they seemed to cover the overall instance and itsvarious phases. This left room for significant subjectivity and variance in the resultingsummaries (Figure 3).

Results. To evaluate the action recognition component, we have compared the test-set recognition accuracy of the proposed system with: (1) a baseline system using thepooled descriptors from all frames as measurement and libsvm [Chang and Lin 2011]as the classifier; (2) the proposed system without the summarization component in thescore function (i.e., λ1 = λ2 = 0), still with B = 10; and (3) with all the frames. Inaddition, we have compared it with the highest result reported in the literature for thejoint action segmentation and classification task [Ni et al. 2015].

Table II shows that the proposed method (with λ1 = 0.5 and λ2 = −0.5) has achieved amuch higher accuracy (77.9%) than the same method without the summarization com-ponent in the score function, both when using only 10 frames (66.3%) and all frames(54.7%). The recognition accuracy obtained by the baseline system (62.1%) has also been

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 11: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

V-JAUNE: A Framework for Joint Action Recognition and Video Summarization 20:11

Table II. Comparison of the Action RecognitionAccuracy on the ACE Dataset

Method Accuracylibsvm [Chang and Lin 2011] 62.1%

PA-Pooling [Ni et al. 2015] 75.2%Proposed method (all frames & no summary) 54.7%Proposed method (10 frames & no summary) 66.3%

Proposed method 77.9%

Table III. The Evaluation Results on the ACE Dataset Using Various Amounts of Supervision

Ground1 Ground2 Ground3

Learning Accuracy V-JAUNE Accuracy V-JAUNE Accuracy V-JAUNEUnsupervised 77.9% 0.947 77.9% 0.947 77.9% 0.947

10% supervised 81.1% 0.926 73.7% 0.971 73.9% 0.90920% supervised 81.1% 0.936 77.9% 0.932 80.0% 0.927Fully supervised 66.3% 0.944 72.6% 0.917 71.6% 0.938

remarkably lower than that of the proposed method. In addition, the proposed methodhas outperformed the highest result for the action segmentation and classification task(75.2%) [Ni et al. 2015], although these accuracies cannot be directly compared since wehave not undertaken action segmentation. With the proposed method, various actionclasses (“breaking,” “cutting,” and “boiling”) have been recognized with 100% accuracy.Conversely, class “turning” has never been recognized correctly, most likely becauseof its very short duration. Overall, these results give evidence that: (a) higher actionclassification accuracy can be achieved by leveraging a selection of the frames; (b) thesummarization component in the score function increases the accuracy of action recog-nition; and (c) the proposed method is in line with the state of the art on this dataset.

For the evaluation of the summarization component, we resort to both a qualita-tive comparison and a quantitative comparison by means of the proposed V-JAUNEloss. The proposed system, as described in the previous paragraph, has achieved anormalized loss value of 0.947. To put it in perspective, we compare it with the lossvalue obtained by a popular summarization approach, the Sum of Absolute Differences(SAD) which has been widely used in object recognition and video compression [Xionget al. 2006]. The loss value achieved by SAD is 0.927, showing that our summaries areonly slightly worse than those obtained with this method.

However, all the experiments conducted so far have been carried out in a completelyunsupervised way as far as the summaries are concerned. This means that the initial-ization of the summary variables in latent structural SVM has been performed in anarbitrary way (i.e., uniform spacing). Conversely, the proposed method has the potentialto take advantage of a more informed initialization. To this aim, we have created a set ofexperiments where an increasing percentage of the training sequences were initializedusing the summaries from one of the annotators in turn (the remaining sequences werestill initialized uniformly). Table III shows the quantitative evaluation of the producedsummaries alongside the recognition accuracy with different percentages of supervi-sion. A remarkable result has been obtained with 10% supervision from the first annota-tor (Ground1), with an action recognition accuracy of 81.1% and a normalized V-JAUNEvalue of 0.926. This result seems very valuable since both the action recognition accu-racy and the summary quality have improved compared to the unsupervised case, alsooutperforming the SAD baseline. As expected, the performance and the optimal amountof supervision vary with the annotator (Table III), and they are therefore selected bycross-validation. However, the fact that the performance does not improve beyond agiven amount of supervision seems desirable: while it is easy to collect large datasets

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 12: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

20:12 F. Hussein and M. Piccardi

Table IV. Influence of the Budget on the ActionRecognition Accuracy for the ACE Dataset

Budget size 5 10 13 16Accuracy 54.7% 77.9% 75.8% 72.6%

Table V. Sensitivity Analysis of the Action Recognition Accuracyat the Variation of the λ Parameters for the ACE Dataset

(Unsupervised Case)

λ1 λ2 λ3 Accuracy0.5 −0.5 10 77.9%0 0 1 66.3%

0.005 −0.005 10 67.4%0.5 −0.5 0 57.9%

of action videos, it is very time-consuming to manually annotate their summaries.Therefore, such a weakly-supervised scenario is also the most feasible in practice.

For a qualitative comparison, Figure 4 shows examples of summaries predicted bythe proposed approach (10% supervision) and SAD for actions breaking, baking (egg),baking (ham), and turning. In general, the summaries obtained with the proposedapproach appear mildly more informative and diversified in terms of stages of theaction (compare, for instance, the first and second rows of Figure 4(b)). Given that theloss value is also slightly lower for the proposed method, this qualitative comparisonconfirms the usefulness of V-JAUNE as a quantitative indicator of a summary’s quality.

As a last experiment with this dataset, we have aimed to explore the sensitivity ofthe action recognition accuracy to the budget, B, and the λ parameters in the scorefunction. Table IV shows that if the budget is reduced to five frames per video, theaction recognition accuracy drops significantly (54.7%). This is likely due to an insuffi-cient description of the action. However, the action recognition accuracy also starts todecrease if the budget is increased beyond a certain value. This confirms that addingframes in excess ends up introducing “noise” in the score function. Table V shows thatthe best accuracy is achieved with a tuned balance of the coverage, non-redundancy,and recognition coefficients (first row). Renouncing the summarization components(second row) decreases the accuracy, while the accuracy increases back as they are pro-gressively reintroduced (third row). Renouncing the recognition component, too, leadsto a marked decrease in accuracy (fourth row).

5.2. MSR DailyActivity3D

The MSR DailyActivity3D dataset is a popular activity dataset captured using a Kinectsensor. It consists of 16 classes of typical indoor activities, namely drinking, eating,reading, using cell phones, writing, using a laptop, vacuum cleaning, cheering, sittingstill, tossing crumbled paper, playing games, lying on the sofa, walking, playing the gui-tar, standing up, and sitting down. The total number of videos is 320, each representingan activity instance from one of 10 actors and either of two different poses (standingclose to the couch and sitting on it). The resolution of the depth frames is 320 × 240which, with our parameterization, has led to a total of 419 descriptors per frame. Forevaluation, we have used the most common training/test split for this dataset, withthe first five subjects used for training and the remaining for testing. For annotationof the summaries, given the easier interpretability of these videos, we have used onlya single annotator.

Results. Table VI reports the test-set results from the proposed method with differ-ent extents of summary supervision. The highest action recognition accuracy (60.6%)has been achieved with completely unsupervised summaries. This can be justified by

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 13: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

V-JAUNE: A Framework for Joint Action Recognition and Video Summarization 20:13

Fig. 4. Examples of predicted summaries from the ACE dataset (displayed as RGB frames for the sakeof visualization). The subfigures display the following actions: (a) breaking; (b) baking (omelet); (c) baking(ham); and (d) turning. In each subfigure, the first row is from the proposed method, the second from SAD.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 14: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

20:14 F. Hussein and M. Piccardi

Table VI. The Evaluation Results on the MSR DailyActivity3DUsing Various Flavors of Learning

Learning style Action accuracy V-JAUNE (unnorm)Fully supervised 58.8% 4.99

Semi-supervised 10% 56.3% 5.02Semi-supervised 20% 56.3% 5.10

Unsupervised 60.6% 5.22

Table VII. Comparison of the Action Recognition Accuracy on the MSRDailyActivity3D Dataset (Depth Frames Only)

Method Accuracylibsvm [Chang and Lin 2011] 34.4%Proposed method (all frames) 48.8%

Proposed method 60.6%Dynamic temporal warping [Wang et al. 2012; Muller and Roder 2006] 54.0%

Proposed method (RGB videos) 46.3%

the fact that, during training, the summary variables are free to take the values thatmaximize the training objective and this seems to generalize well on the test set. Whilethe accuracy is not high in absolute terms, it is the highest reported to date for thisdataset without the use of the skeletons’ information. Conversely, the best value ofthe denormalized V-JAUNE measure on the test set (4.99) is achieved when trainingwith full supervision of the summaries. However, the increase in value with decreasingsupervision is modest, and unsupervised training may be regarded as the preferabletradeoff between action recognition accuracy and summary quality in this case. Pleasenote that the values for V-JAUNE are generally higher than those reported in Table III,since, in this experiment, the measure does not include multi-annotator normalization.

For a comparative evaluation of the action recognition accuracy, we report the test-setaccuracy of: (1) a reference system that uses the pooled descriptors from all frames asmeasurement and libsvm as the classifier; (2) the proposed system using all the framesand without the summarization component in the score function (i.e., λ1 = λ2 = 0); and(3) the proposed system with full functionalities. In addition, we have compared theaction recognition accuracy with a system from the literature that uses dynamic timewarping; to the best of our knowledge, this is the best reported accuracy without makinguse of the actors’ skeletal joints in any form (locations or angles). Table VII shows thatthe accuracy achieved with the proposed method (60.6%) is much higher than that ofthe reference system (34.4%) and also remarkably higher than that of the proposedmethod using all the frames (48.8%). This proves that action recognition based on aselected summary can prove more accurate than recognition from the entire video, andvalidates the intuition of providing action recognition and summarization jointly. Withthe proposed method, various action classes (“vacuum cleaning,” “cheering,” “lying onthe sofa,” “walking,” “standing up,” and “sitting down”) have been recognized with 100%accuracy, likely because they involve more macroscopic movements. On the other hand,classes “using cell phones” and “writing” have never been recognized correctly, mostlikely because of the small size and limited visual evidence of the objects involved.Since the proposed method can be applied to frame measurements of any nature, wehave also carried out an experiment using the RGB videos of this dataset (Table VII).The results show that the accuracy using depth videos (60.6%) is remarkably higherthan that using RGB videos (46.3%), suggesting that depth frames can provide moreinformative clues for recognizing actions than RGB frames alone.

For a quantitative evaluation of the summarization component, we have comparedthe V-JAUNE measure for the summaries obtained with the proposed method and with

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 15: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

V-JAUNE: A Framework for Joint Action Recognition and Video Summarization 20:15

Fig. 5. Examples of summaries from the MSR DailyActivity3D dataset (displayed as RGB frames for easeof interpretation) for actions (a) Cheer and (d) Walk: in each subfigure, the first row is from the proposedmethod and the second from SAD. The results from the proposed method look more informative.

SAD: the loss with SAD (5.65) is significantly higher than with the proposed methodwith any extent of supervision (4.99 − 5.22). Also, from a qualitative perspective, thepredicted summaries seem more informative, as displayed in Figure 5.

6. CONCLUSION

In this article, we have presented an approach for the joint inference of the actionlabel and a frame summary of an action video. The approach leverages a generalizedlinear model together with submodularity to efficiently infer the action label and thesummary. Learning of the model from unsupervised and semi-supervised training datahas been performed by an original latent structural SVM approach still based onsubmodularity. Another contribution of this article has been the definition of a novelloss function, nicknamed V-JAUNE, for the quantitative evaluation of a summary’squality. The experimental results over two challenging action datasets of depth videos -ACE and MSR DailyActivity3D - have shown that:

—the joint inference of the action and the summary leads to higher action recognitionaccuracies than recognizing the actions from the full frame set (Tables II and VII).This confirms that using a selection of the video’s frames has a positive effect akinto noise removal;

—the joint inference of the action and the summary delivers summaries of quality, com-parable or better than those of a traditional syntactic summary approach (the sumof absolute differences, or SAD; Tables III and VI and accompanying text; Figures 4and 5);

—the proposed submodular score function is capable of delivering remarkable accu-racy, even in conjunction with a very simple and efficient greedy algorithm for theinference;

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 16: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

20:16 F. Hussein and M. Piccardi

—the regularized minimum-risk framework of latent structural SVM is capable ofdelivering effective parameterization of the model.

As a further note, although the proposed method has not been explicitly designedto prove invariant to the duration of the actions, we expect the summaries to only bemildly variant with the actions’ duration thanks to the non-redundancy component ofthe scoring function that discourages the inclusion of similar frames. As a consequence,the summaries of actions of different durations should be rather consistent, and actionclassification almost unaffected in principle. In practice, this would depend on a numberof factors such as the actual content of the frames, the distribution of the classes,the relative weights of the components in the scoring function, and the extent of thevariations in duration.

An interesting direction to explore in the future could be the extension of thetraining settings with various combinations of the summary and action recognitionloss functions to allow tuning different tradeoffs between these two tasks in differentapplications.

REFERENCES

Jurandy Almeida, Neucimar J. Leite, and Ricardo da S. Torres. 2012. Vison: Video summarization for onlineapplications. Pattern Recognition Letters 33, 4 (2012), 397–409.

Yasemin Altun, Mikhail Belkin, and David A. Mcallester. 2005. Maximum margin semi-supervised learningfor structured variables. In Advances in Neural Information Processing Systems (NIPS). 33–40.

Francis R. Bach. 2013. Learning with submodular functions: A convex optimization perspective. Foundationsand Trends in Machine Learning 6, 2–3 (2013), 145–373.

William Brendel and Sinisa Todorovic. 2010. Activities as time series of human postures. In Proceedings ofthe European Conference on Computer Vision (ECCV). 721–734.

Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Trans-actions on Intelligent Systems and Technology 2, 3 (2011), 27:1–27:27.

Bo-Wei Chen, Jia-Ching Wang, and Jhing-Fa Wang. 2009. A novel video summarization based on mining thestory-structure and semantic relations among concept entities. IEEE Transactions on Multimedia 11, 2(2009), 295–312.

Yang Cong, Junsong Yuan, and Jiebo Luo. 2012. Towards scalable summarization of consumer videos viasparse dictionary selection. IEEE Transactions on Multimedia 14, 1 (2012), 66–75.

Sandra Eliza Fontes De Avila, Ana Paula Brandao Lopes, Antonio da Luz, and Arnaldo de AlbuquerqueAraujo. 2011. VSUMM: A mechanism designed to produce static video summaries and a novel evaluationmethod. Pattern Recognition Letters 32, 1 (2011), 56–68.

Maxime Devanne, Hazem Wannous, Stefano Berretti, Pietro Pala, Mohamed Daoudi, and Alberto Del Bimbo.2015. 3-D human action recognition by shape analysis of motion trajectories on Riemannian manifold.IEEE Trans. Cybernetics 45, 7 (2015), 1340–1352.

Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan,Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recog-nition and description. In Proceedings of the 2015 IEEE Conference on Computer Vision and PatternRecognition (CVPR). 2625–2634.

Huizhong Duan, Yanen Li, ChengXiang Zhai, and Dan Roth. 2012. A discriminative model for query spellingcorrection with latent structural SVM. In Proceedings of the 2012 Joint Conference on Empirical Methodsin Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’12).1511–1521.

Naveed Ejaz, Irfan Mehmood, and Sung Wook Baik. 2013. Efficient visual attention based framework forextracting key frames from videos. Signal Processing: Image Communication 28, 1 (2013), 34–44.

Naveed Ejaz, Tayyab Bin Tariq, and Sung Wook Baik. 2012. Adaptive key frame extraction for video summa-rization using an aggregation mechanism. Journal of Visual Communication and Image Representation23, 7 (2012), 1031–1040.

Joydeep Ghosh, Yong Jae Lee, and Kristen Grauman. 2012. Discovering important people and objects foregocentric video summarization. In Proceedings of the 2012 IEEE Conference on Computer Vision andPattern Recognition (CVPR). 1346–1353.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 17: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

V-JAUNE: A Framework for Joint Action Recognition and Video Summarization 20:17

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Retrieved from http://www.deeplearningbook.org.

Genliang Guan, Zhiyong Wang, Shaohui Mei, Max Ott, Mingyi He, and David Dagan Feng. 2014. A top-downapproach for video summarization. ACM Trans. Multimedia Comput. Commun. Appl. 11, 1 (Sept. 2014),4:1–4:21.

Michael Gygli, Helmut Grabner, and Luc Van Gool. 2015. Video summarization by learning submodularmixtures of objectives. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).3090–3098.

Yong Hu and Wei Zheng. 2011. Human action recognition based on key frames. In Proceedings of the Advancesin Computer Science and Education Applications (CSE). Springer, 535–542.

Fairouz Hussein, Sari Awwad, and Massimo Piccardi. 2016. Joint action recognition and summarization bysub-modular inference. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). 2697–2701.

Hamid Izadinia and Mubarak Shah. 2012. Recognizing complex events using large margin joint low-levelevent model. In European Conference on Computer Vision (ECCV). 430–444.

Alexander Jaffe, Mor Naaman, Tamir Tassa, and Marc Davis. 2006. Generating summaries for large collec-tions of geo-referenced photographs. In Proceedings of the 15th International Conference on World WideWeb (WWW). ACM, 853–854.

Herve Jegou, Matthijs Douze, Cordelia Schmid, and Patrick Perez. 2010. Aggregating local descriptors intoa compact image representation. In Proceedings of the 2010 IEEE Conference on Computer Vision andPattern Recognition (CVPR). 3304–3311.

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei-Fei Li. 2014.Large-scale video classification with convolutional neural networks. In Proceedings of the 2014 IEEEConference on Computer Vision and Pattern Recognition (CVPR). 1725–1732.

Gunhee Kim, Seungwhan Moon, and Leonid Sigal. 2015. Ranking and retrieval of image sequences frommultiple paragraph queries. In Proceedings of the 2015 IEEE Conference on Computer Vision and PatternRecognition (CVPR). IEEE, 1993–2001.

Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld. 2008. Learning realistic humanactions from movies. In Proceedings of the 2008 IEEE Conference on Computer Vision and PatternRecognition (CVPR). IEEE, 1–8.

Liangda Li, Ke Zhou, Gui-Rong Xue, Hongyuan Zha, and Yong Yu. 2011. Video summarization via trans-ferrable structured learning. In Proceedings of the 20th International Conference on the World Wide Web(WWW). ACM, 287–296.

Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the ACLWorkshop: Text Summarization Branches Out, Vol. 8. 74–81.

Hui Lin and Jeff Bilmes. 2011. A class of submodular functions for document summarization. In Proceed-ings of the 49th Annual Meeting of the Association for Computational Linguistics: Human LanguageTechnologies-Volume 1. Association for Computational Linguistics, 510–520.

Yang Liu, Feng Zhou, Wei Liu, Fernando De la Torre, and Yan Liu. 2010. Unsupervised summarization ofrushes videos. In Proceedings of the 18th ACM International Conference on Multimedia (ACM). 751–754.

Guoliang Lu, Yiqi Zhou, Xueyong Li, and Peng Yan. 2016. Unsupervised, efficient and scalable key-frameselection for automatic summarization of surveillance videos. Multimedia Tools and Applications (2016),1–23.

Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, and Mingjing Li. 2002. A user attention model for video summariza-tion. In Proceedings of the 10th ACM International Conference on Multimedia (ACM). 533–542.

Mark Maybury, Andrew Merlino, and James Rayson. 1997. Segmentation, content extraction and visualiza-tion of broadcast news video using multistream analysis. In Proceedings of the 5th ACM InternationalConference on Multimedia. 102–112.

Meinard Muller and Tido Roder. 2006. Motion templates for automatic classification and retrieval of motioncapture data. In Proceedings of the 2006 ACM SIGGRAPH/Eurographics Symposium on ComputerAnimation. Eurographics Association, 137–146.

Padmavathi Mundur, Yong Rao, and Yelena Yesha. 2006. Keyframe-based video summarization using De-launay clustering. International Journal on Digital Libraries 6, 2 (2006), 219–232.

Farhood Negin and Francois Bremond. 2016. Human Action Recognition in Videos: A Survey. INRIA Tech-nical Report, Sophia Antipolis, France, 47 pages.

George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. 1978. An analysis of approximations formaximizing submodular set functions-I. Mathematical Programming 14, 1 (1978), 265–294.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 18: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

20:18 F. Hussein and M. Piccardi

Bingbing Ni, Pierre Moulin, and Shuicheng Yan. 2015. Pose adaptive motion feature pooling for humanaction analysis. International Journal of Computer Vision 111, 2 (2015), 229–248.

Juan Carlos Niebles, Chih-Wei Chen, and Li Fei-Fei. 2010. Modeling temporal structure of decomposablemotion segments for activity classification. In Proceedings of the European Conference on ComputerVision (ECCV). Springer, 392–405.

Michalis Raptis and Leonid Sigal. 2013. Poselet key-framing: A model for human activity recognition. In2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2650–2657.

Mrinmaya Sachan, Kumar Dubey, Eric P. Xing, and Matthew Richardson. 2015. Learning answer-entailingstructures for machine comprehension. In Proceedings of the 2015 Conference of the Association forComputational Linguistics (ACL). 239–249.

Konrad Schindler and Luc Van Gool. 2008. Action snippets: How many frames does human action recognitionrequire? In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 1–8.

Atsushi Shimada, Kazuaki Kondo, Daisuke Deguchi, Geraldine Morin, and Helman Stern. 2013. Kitchenscene context based gesture recognition: A contest in ICPR2012. In Advances in Depth Image Analysisand Applications. Springer, 168–185.

Ruben Sipos, Pannaga Shivaswamy, and Thorsten Joachims. 2012. Large-margin learning of submodularsummarization models. In Proceedings of the Conference of the European Chapter of the Association forComputational Linguistics (EACL). 224–233.

Rim Slama, Hazem Wannous, Mohamed Daoudi, and Anuj Srivastava. 2015. Accurate 3D action recognitionusing learning on the Grassmann manifold. Pattern Recognition 48, 2 (2015), 556–567.

Kevin D. Tang, Fei-Fei Li, and Daphne Koller. 2012. Learning latent temporal structure for complex eventdetection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 1250–1257.

Sebastian Tschiatschek, Rishabh K. Iyer, Haochen Wei, and Jeff A. Bilmes. 2014. Learning mixtures ofsubmodular functions for image collection summarization. In Advances in Neural Information ProcessingSystems (NIPS). 1413–1421.

Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. 2005. Large marginmethods for structured and interdependent output variables. Journal of Machine Learning Research 6,Sep (2005), 1453–1484.

Pavan K. Turaga, Ashok Veeraraghavan, and Rama Chellappa. 2009. Unsupervised view and rate invariantclustering of video sequences. Computer Vision and Image Understanding 113, 3 (2009), 353–371.

Andrea Vedaldi. 2011. A MATLAB wrapper of SVMstruct. Retrieved from http://www.vlfeat.org/vedaldi/code/1svm-struct-matlab.html.

Heng Wang, Alexander Klaser, Cordelia Schmid, and Cheng-Lin Liu. 2011. Action recognition by densetrajectories. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 3169–3176.

Heng Wang and Cordelia Schmid. 2013. Action recognition with improved trajectories. In Proceedings of theIEEE International Conference on Computer Vision (ICCV). 3551–3558.

Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. 2009. Evalu-ation of local spatio-temporal features for action recognition. In Proceedings of the 2009 British MachineVision Conference (BMVC). 124–1.

Jiang Wang, Zicheng Liu, Ying Wu, and Junsong Yuan. 2012. Mining actionlet ensemble for action recognitionwith depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and PatternRecognition (CVPR). 1290–1297.

Yang Wang and Greg Mori. 2011. Hidden part models for human action recognition: Probabilistic versusmax margin. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 7 (2011), 1310–1323.

Xinxiao Wu, Dong Xu, Lixin Duan, Jiebo Luo, and Yunde Jia. 2013. Action recognition using multilevelfeatures and latent structural SVM. IEEE Transactions on Circuits and Systems for Video Technology23, 8 (2013), 1422–1431.

Ziyou Xiong, Regunathan Radhakrishnan, Ajay Divakaran, Yong Rui, and Thomas S. Huang. 2006. A UnifiedFramework for Video Summarization, Browsing, and Retrieval with Applications to Consumer andSurveillance Video. Elsevier/Academic Press.

Chunlei Yang, Jialie Shen, Jinye Peng, and Jianping Fan. 2013. Image collection summarization via dictio-nary learning for sparse representation. Pattern Recognition 46, 3 (2013), 948–961.

Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2016. Semantic feature mining for video event under-standing. ACM Trans. Multimedia Comput. Commun. Appl. 12, 4 (Aug. 2016), 55:1–55:22.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.

Page 19: V-JAUNE: A Framework for Joint Action Recognition and ... · summaries are indexing, search, and retrieval from video collections and the story-boarding of the videos to end users

V-JAUNE: A Framework for Joint Action Recognition and Video Summarization 20:19

Chun-Nam John Yu and Thorsten Joachims. 2009. Learning structural SVMs with latent variables. InProceedings of the 26th Annual International Conference on Machine Learning (ACM). 1169–1176.

Junsong Yuan, Zicheng Liu, and Ying Wu. 2011. Discriminative video pattern search for efficient actiondetection. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 9 (2011), 1728–1743.

Long Zhu, Yuanhao Chen, Alan Yuille, and William Freeman. 2010. Latent hierarchical structural learn-ing for object detection. In Proceedings of the 2010 IEEE Conference on Computer Vision and PatternRecognition (CVPR). 1062–1069.

Received November 2016; revised January 2017; accepted February 2017

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 13, No. 2, Article 20, Publication date: April 2017.