LEARNING A POSE LEXICON FOR SEMANTIC ACTION RECOGNITION ... · LEARNING A POSE LEXICON FOR SEMANTIC...

10
LEARNING A POSE LEXICON FOR SEMANTIC ACTION RECOGNITION Lijuan Zhou, Wanqing Li, Philip Ogunbona School of Computing and Information Technology University of Wollongong, Keiraville, NSW 2522, Australia [email protected], [email protected], [email protected] ABSTRACT This paper presents a novel method for learning a pose lex- icon comprising semantic poses defined by textual instruc- tions and their associated visual poses defined by visual fea- tures. The proposed method simultaneously takes two input streams, semantic poses and visual pose candidates, and sta- tistically learns a mapping between them to construct the lex- icon. With the learned lexicon, action recognition can be cast as the problem of finding the maximum translation proba- bility of a sequence of semantic poses given a stream of vi- sual pose candidates. Experiments evaluating pre-trained and zero-shot action recognition conducted on MSRC-12 gesture and WorkoutSu-10 exercise datasets were used to verify the efficacy of the proposed method. Index TermsLexicon, semantic pose, visual pose, ac- tion recognition. 1. INTRODUCTION Human action recognition is currently one of the most active research topics in multimedia content analysis. Most recog- nition models are typically constructed from low to middle level visual spatio-temporal features and directly associated with class labels [1, 2, 3, 4, 5]. In particular, many meth- ods [1, 6, 7, 8] have been developed based on the concept that an action can be well represented by a sequence of key or salient poses and these salient poses can be identified through visual features alone. However, these salient poses often do not necessarily possess semantic significance thus leading to the so-called semantic gap. We refer to the salient poses de- fined using visual features as “visual poses”. The textual instruction of how a simple or elementary ac- tion should be performed is often expressed using Talmy’s ty- pology of motion [9]. In this typology a motion event is one in which an entity experiences a change in location with respect to another object and such event can be described by four se- mantic elements including Motion (M), Figure (F), Ground (G) and Path (P), where M refers to the movement of an en- tity F that changes its location relative to the reference object G along the trajectory P (including a start and an end). For a complex action or an action that involves many body parts, a sequence of basic textual instructions will suffice. Each in- struction, describing a part of the action, will be made up of the four elements. A simplification of the trajectory P for hu- man actions entails retaining the starting P s and end P e , status or configuration of the body parts, and ignoring intermediate parts of the trajectory. We refer to both P s and P e as “seman- tic poses”. Hence, an action can be described by a sequence of semantic poses if F is defined as the whole human body. Alternatively, if F refers to a body part, multiple sequences of semantic poses can be used. Semantic poses can be obtained by parsing the textual instructions. This paper proposes a method to construct a pose lexicon comprising a set of semantic poses and the corresponding vi- sual poses, by learning a mapping between them. It is as- sumed that for each action there is a textual instruction from which a set of semantic poses can be extracted through nat- ural language parsing [10] and that for each action sample a sequence of visual pose candidates can be extracted. The mapping task is formulated as a problem of machine trans- lation. With the learned lexicon, action recognition can be considered as a problem of finding the maximum posterior probability of a given sequence of visual pose candidates be- ing generated from a given sequence of semantic poses. This is equivalent to determining how likely the given sequence of visual pose candidates follow a sequence of semantic poses. Such a lexicon bridges the gap between the semantics and visual features and offers a number of advantages includ- ing text-based action retrieval and summarization, recogni- tion of actions with small or even zero training samples (also known as zero-short recognition), and easy growth of seman- tic poses for new action recognition since poses in the lexicon are sharable by many actions. The rest of this paper is organized as follows. Section 2 provides a review of previous work related to semantic ac- tion recognition. The proposed method for learning pose lex- icon and action classification is developed and formulated in Section 3. In Section 4, experiments are presented to demon- strate the effectiveness of the proposed method in recognition tasks using MSRC-12 Kinect gesture [11], WorkoutSU-10 ex- ercise [12] datasets and novel actions extracted from the two datasets. Finally, the paper is concluded with remarks in Sec- tion 5. arXiv:1604.00147v1 [cs.CV] 1 Apr 2016

Transcript of LEARNING A POSE LEXICON FOR SEMANTIC ACTION RECOGNITION ... · LEARNING A POSE LEXICON FOR SEMANTIC...

LEARNING A POSE LEXICON FOR SEMANTIC ACTION RECOGNITION

Lijuan Zhou, Wanqing Li, Philip Ogunbona

School of Computing and Information TechnologyUniversity of Wollongong, Keiraville, NSW 2522, Australia

[email protected], [email protected], [email protected]

ABSTRACT

This paper presents a novel method for learning a pose lex-icon comprising semantic poses defined by textual instruc-tions and their associated visual poses defined by visual fea-tures. The proposed method simultaneously takes two inputstreams, semantic poses and visual pose candidates, and sta-tistically learns a mapping between them to construct the lex-icon. With the learned lexicon, action recognition can be castas the problem of finding the maximum translation proba-bility of a sequence of semantic poses given a stream of vi-sual pose candidates. Experiments evaluating pre-trained andzero-shot action recognition conducted on MSRC-12 gestureand WorkoutSu-10 exercise datasets were used to verify theefficacy of the proposed method.

Index Terms— Lexicon, semantic pose, visual pose, ac-tion recognition.

1. INTRODUCTION

Human action recognition is currently one of the most activeresearch topics in multimedia content analysis. Most recog-nition models are typically constructed from low to middlelevel visual spatio-temporal features and directly associatedwith class labels [1, 2, 3, 4, 5]. In particular, many meth-ods [1, 6, 7, 8] have been developed based on the concept thatan action can be well represented by a sequence of key orsalient poses and these salient poses can be identified throughvisual features alone. However, these salient poses often donot necessarily possess semantic significance thus leading tothe so-called semantic gap. We refer to the salient poses de-fined using visual features as “visual poses”.

The textual instruction of how a simple or elementary ac-tion should be performed is often expressed using Talmy’s ty-pology of motion [9]. In this typology a motion event is one inwhich an entity experiences a change in location with respectto another object and such event can be described by four se-mantic elements including Motion (M), Figure (F), Ground(G) and Path (P), where M refers to the movement of an en-tity F that changes its location relative to the reference objectG along the trajectory P (including a start and an end). Fora complex action or an action that involves many body parts,

a sequence of basic textual instructions will suffice. Each in-struction, describing a part of the action, will be made up ofthe four elements. A simplification of the trajectory P for hu-man actions entails retaining the starting Ps and end Pe, statusor configuration of the body parts, and ignoring intermediateparts of the trajectory. We refer to both Ps and Pe as “seman-tic poses”. Hence, an action can be described by a sequenceof semantic poses if F is defined as the whole human body.Alternatively, if F refers to a body part, multiple sequences ofsemantic poses can be used. Semantic poses can be obtainedby parsing the textual instructions.

This paper proposes a method to construct a pose lexiconcomprising a set of semantic poses and the corresponding vi-sual poses, by learning a mapping between them. It is as-sumed that for each action there is a textual instruction fromwhich a set of semantic poses can be extracted through nat-ural language parsing [10] and that for each action samplea sequence of visual pose candidates can be extracted. Themapping task is formulated as a problem of machine trans-lation. With the learned lexicon, action recognition can beconsidered as a problem of finding the maximum posteriorprobability of a given sequence of visual pose candidates be-ing generated from a given sequence of semantic poses. Thisis equivalent to determining how likely the given sequence ofvisual pose candidates follow a sequence of semantic poses.Such a lexicon bridges the gap between the semantics andvisual features and offers a number of advantages includ-ing text-based action retrieval and summarization, recogni-tion of actions with small or even zero training samples (alsoknown as zero-short recognition), and easy growth of seman-tic poses for new action recognition since poses in the lexiconare sharable by many actions.

The rest of this paper is organized as follows. Section 2provides a review of previous work related to semantic ac-tion recognition. The proposed method for learning pose lex-icon and action classification is developed and formulated inSection 3. In Section 4, experiments are presented to demon-strate the effectiveness of the proposed method in recognitiontasks using MSRC-12 Kinect gesture [11], WorkoutSU-10 ex-ercise [12] datasets and novel actions extracted from the twodatasets. Finally, the paper is concluded with remarks in Sec-tion 5.

arX

iv:1

604.

0014

7v1

[cs

.CV

] 1

Apr

201

6

2. RELATED WORK

Despite the good progress made in action recognition over thepast decade, few studies have reported methods based on se-mantic learning. Earlier methods bridged the semantic gapusing mid-level features (eg. visual keywords) [13]) obtainedby quantizing low-level spatio-temporal features which formvisual vocabulary. However, mid-level features are not suffi-ciently robust to obtain good performance on relatively largeaction dataset. This problem has been addressed by proposinghigh-level latent semantic features to represent semanticallysimilar mid-level features. Unsupervised methods [14, 15]were previously applied for learning latent semantics basedon topic models; example include probabilistic latent seman-tic analysis [16] and latent Dirichlet allocation (LDA) [17].Recently, multiple layers model [18] based on LDA was pro-posed for learning local and global action semantics. Theintuitive basis of using mid- and high-level latent semanticfeatures is that frequently co-occurring low-level features arecorrelated at some conceptual level. It is noteworthy thatthese two kinds of semantic features have no explicit seman-tic relationship to the problem; a situation different from theproposed semantic poses.

Apart from learning latent semantics of actions, other ap-proaches focused on defining semantic concepts to describeaction or activity related properties. Actions were describedby a set of attributes that possess spatial characteristics. Un-fortunately, the attributes are not specific enough to allow sub-jects to recreate the actions [19, 20]. It is also difficult todescribe them as there is no a common principle for describ-ing different actions. In our work, textual instructions usea common principle (four semantic elements) for action rep-resentation and they also provide unambiguous guideline forperforming actions. Activity was represented by basic actionsand corresponding participants such as subjects, objects andtools [21, 22]. The method for representing activities ties theobject and action together. The work presented in this paperfocuses on single actions which do not depend on other ob-jects.

3. PROPOSED METHOD

Visual poses can be extracted from either RGB, depth maps orskeleton data. In this work, we take skeleton data as an exam-ple to illustrate the proposed method. Inspired by the trans-lation model for image annotation [23], a translation modelfrom visual poses to semantic poses, namely “visual pose-to-semantic pose translation model” (VS-TM), is proposed forlearning a pose lexicon from skeleton data with instructions.Figure 1 illustrates the action recognition framework basedon the VS-TM. In the training phase, a pose lexicon is con-structed in two main steps. First, a parallel corpus is con-structed based on a stream of semantic poses and a streamof visual pose candidates. Second, a mapping between thetwo streams is learned from the parallel corpus to generate

Lift arms

Change weapon

T1: Arms are beside legs

Moving pose +

k-means

Semantic pose set

Visual pose candidate set

A B

DC

T2: Arms are overhead

Training action video

T1 T2 T1

A A B B B A

Quantization

Quantization

Lift arms

Test action video

T1 T3 T4

A A C D

Change weapon

A C C A

Lexicon

T1 T3 T1

T1

T2

A B

Visual pose candidate sequence Semantic pose sequence

Parallel corpusParser + clustering

VS-TM

MTP

MTPQuantization

Mapping

TestingTraining

T1

D

Textual instructions of training action

Fig. 1: Framework of the proposed method.

the lexicon for inferring optimal visual poses from the candi-dates. In the test phase, actions are classified according to themaximum translation probability (MTP) of a semantic posesequence given a sequence of visual pose candidates.

3.1. Parallel corpus construction

An action instance is represented by two streams; semanticpose and visual pose candidate. The parallel corpus consistsof multiple such data streams that have been constructed fromeach action instance through vector quantization on the setsof semantic pose and visual pose candidate. In the followingSections 3.1.1 and 3.1.2 we focus on how to construct the twosets.

3.1.1. Semantic poses generation

Semantic poses are constructed based on start and end se-mantic poses Ps and Pe. Ps refers to the configuration ofhuman body in which the action starts and Pe indicates thepoint at which body parts reach a salient configuration, e.g.maximum extension. Ps and Pe are normally encoded bypreposition phrases in the textual instruction. Constituency-based parser (i.e. Berkeley [10]) can be used for extract-ing Ps and Pe. Note that parsing is not the focus of paper.Suppose an action instance contains G elementary or sim-ple actions. The semantic pose sequence can be written as{Ps1 , Pe1 , . . . , Psi , Pei , . . . , PsG , PeG}, where Psi and Peidenote semantic poses of the i-th elementary action. Oncethe semantic poses are extracted, similar semantic poses arereplaced by a single symbol and the semantic pose set is eas-ily constructed.

3.1.2. Visual pose candidates generation

Key frames are firstly extracted from each action instance tovisually represent the start and end semantic poses. Visualpose candidates are further generated through clustering keyframes of all action instance. In this paper, we consider keyframes as frames when body parts reach maximum or min-imum extension. If we consider skeleton joints as observa-

tions, the covariance matrix of joint positions at an instanttime captures the stretch of body parts or distribution of jointsin space at this instant time. Hence, the covariance of jointpositions is applied for extracting key frames.

Given an action instance containing F frames, F featurevectors are generated to represent this instance using a mov-ing pose descriptor [24]. A covariance is calculated from eachfeature vector, resulting in a 3×3 matrix. Suppose Σf denotethe covariance matrix at frame f (f ∈ {1, . . . , F}). The re-lationships amongst the joints of the pose can be analysed byperforming eigen-decomposition on Σf . For each frame weselect the largest eigenvalue, denoted by λf and thus, producethe sequence Λ = {λ1, λ2, . . . , λf , . . . , λF }.

To reduce the impact of noise in skeletons, we smoothenthe sequence Λ along the time dimension, with a movingGaussian filter of window size 5 frames. Then frames, atwhose smoothed largest eigenvalues are bigger or smallerthan those of neighbouring frames, are extracted as keyframes. In particular, frame f is a key frame if the follow-ing conditions are met:

λf > λf+1

λf > λf−1or

λf < λf+1

λf < λf−1.(1)

The extracted key frames of all action instances are clus-tered using k-means algorithm and cluster centers are consid-ered as visual pose candidates. The selection of k depends onthe size of semantic pose set and the size of semantic pose setis determined by the number of elementary actions. Hence,it is available before learning the lexicon. One semantic posecan be mapped to multiple visual pose candidates when thesevisual pose candidates are similar. However, one visual posecandidate corresponds to at most one semantic pose. There-fore, k is chosen as equal or larger than the number of seman-tic poses so that any semantic pose can correspond to at leastone visual pose candidate.

3.2. Lexicon Learning

3.2.1. Formulation

Given the parallel corpus which has underlying correspon-dence between sequences of semantic pose and visual posecandidate, the problem of learning an action lexicon entailsdetermining precise correspondences among the elements ofthe two sequences. Let the set of visual pose candidates bedenoted by S = {S1, S2, . . . , Sp, . . .} and the semantic poseset by T = {T1, T2, . . . , Tq, . . .}. The task of lexicon con-struction is converted to finding the most likely visual posecandidate given a semantic pose, based on conditional proba-bility P (Sp|Tq).

The underlying correspondence between the two se-quences provides an opportunity to use machine translationto model the problem. The sequence pair encodes the startingand end positions of actions which are discrete units. Thesediscrete units are actually analogous to words in translation

model. This observation makes the particular word-basedmachine translation framework useful to our problem. Thesequence of visual pose candidate is analogous to the sourcelanguage and semantic pose sequence is similar to the targetlanguage. According to the standard word-based translationframework [25], learning a lexicon is converted to the partic-ular problem - translation model. Hence, we develop a trans-lation model from visual poses to semantic poses (VS-TM)based on the parallel corpus to learn a pose lexicon.

We now illustrate the translation model (VS-TM). LetMn denote the number of visual pose candidates inthe n-th action instance. The sequence of visual posecandidate, after quantization, can be written as sn ={sn1, sn2, . . . , snj , . . . , snMn

}(snj ∈ S). Similarly, if Lnrepresents the number of semantic poses in the n-th action in-stance, then the semantic pose sequence of the action instancecan be written as tn = {tn1, tn2, . . . , tni, . . . , tnLn}(tni ∈T ). VS-TM finds the most likely sequence of visual posecandidate for each semantic pose sequence through the con-ditional probability P (sn|tn).

In the word-based translation model, the conditional prob-ability of two sequences is converted to the conditional prob-ability of elements of the sequences. However, we do notknow the correspondence between individual elements of thesequence pair. If we introduce a hidden variable an whichdetermines the alignment of sn and tn, the alignment ofN sequence pairs form a set which can be written as a ={a1, a2, . . . , an, . . . , aN}. Based on an, the translation ofsequence pair is accomplished through summing conditionalprobabilities of all possible alignments. Hence, we learn VS-TM through element-to-element alignment models. P (sn|tn)is calculated using

P (sn|tn) =∑an

P (sn, an|tn). (2)

If each visual pose candidate in the sequence can bealigned to at most one semantic pose, we guarantee thata visual pose candidate corresponds to only one seman-tic pose. To ensure this constraint, we construct thealignment from visual to semantic pose sentence. Thealignment of the n-th instance can be written as an ={an1, an2, . . . , anj , . . . , anMn

}(anj ∈ [0, Ln]), where anjrepresents the alignment position of the j-th visual pose can-didate. If the j-th visual pose candidate is aligned to the i-thsemantic pose, we write, anj = i. anj = 0 refers to the sit-uation in which no semantic pose corresponds to this visualpose candidate; this happens when visual pose candidate isnoisy. According to alignment an, Equation (2) can be ex-tended through structuring P (sn, an|tn) without loss of gen-erality by chain rule as follows:

P (sn|tn) =∑an

Mn∏j=1

P (anj |sn(j−1)n1 , an(j−1)n1 , tnLn

n1 )

× P (snj |sn(j−1)n1 , anjn1, tnLnn1 ),

(3)

where the first item determines alignment probability,the second encodes translation probability and xnjn1 ={xn1, xn2, . . . , xnj}(x ∈ (s, t, a)). Since lexicon acquisitionaims to find conditional probability among visual and seman-tic poses, we further assume that alignment probabilities areequal (i.e. 1

Ln+1 ) and snj depends only on the sequence el-ement at anj position which is tnanj (equal to tni). Hence,Equation (3) can be rewritten as

P (sn|tn) =

Ln∑anj=0

Mn∏j=1

1

Ln + 1P (snj |tnanj

), (4)

where the translation probability is constrained through∑SpP (Sp|Tq) = 1 for any Tq .

For N action instances in the training parallel corpus,the proposed model VS-TM aims to maximize the translationprobability P (s|t) through

P (s|t) =

N∏n=1

Mn∏j=1

Ln∑i=0

1

Ln + 1P (snj |tni). (5)

Here, it is easy to verify that the sum can be interchanged inEquation (4).

3.2.2. Optimization

The expectation maximization (EM) algorithm is invokedfor the optimization by mapping the translation probabilityP (Sp|Tq) to parameter θ and alignment a to the unobserveddata. The likelihood function L is defined as

Lθ(s, t, a) =

N∏n=1

Mn∏j=1

Ln∑i=0

1

Ln + 1P (snj |tni). (6)

Maximizing likelihood function L is further extended to seekan unconstrained extremum of auxiliary function

h(P, β) ≡N∏n=1

Mn∏j=1

Ln∑i=0

1

Ln + 1P (snj |tni)

−∑Tq

β(∑Sp

P (Sp|Tq)− 1)

(7)

In the E-step, the posterior probability among alignmentis calculated by

Pθ(anj |snj , tni) =P (snj |tni)∑Ln

i=0 P (snj |tni)(8)

In the M-step, parameter θ is updated through

P (Sp|Tq) = β−1N∑n=1

Mn∑j=1

Ln∑i=0

Pθ(anj |snj , tni)

× δ(Sp, snj)δ(Tq, tni).

(9)

Here, δ(., .) is 1 if two elements are equal and 0 otherwise. βnormalizes the probabilities.

3.3. Action classification

Once the translation model from visual pose candidates tosemantic poses is learned, the task of action classification isconverted to finding the most likely semantic pose sequencegiven a sequence of visual pose candidate. This is the de-coding process in machine translation system [25]. Since tex-tual instructions of all action classes are available, we reducesearch space to the possible solution space containing instruc-tions of all trained actions.

Let stest = {s′1, . . . , s′j , . . . , s′m}(s′j ∈ S) denote thesequence of visual pose candidate in a test action instanceand ttest = {t′1, . . . , t′i, . . . , t′l}(t′i ∈ T ), its semanticpose sequence. The alignment atest is written as atest ={a′1, . . . , a′j , . . . , a′m}. Given the model parameter θ, ac-tion is classified based on P (stest|ttest) which is calculatedthrough finding the best alignment to avoid summing all pos-sible alignment probability. In particular, it is formulated as

{ttest, atest} = arg max{ttest,atest}

m∏j=1

Pθ(a′j |s′j , t′i). (10)

4. EXPERIMENTS AND RESULTS

4.1. Datasets and experimental setup

Two action datasets, MSRC-12 Kinect gesture [11] andWorkoutSU-10 [12], were used to evaluate our method.MSRC-12 Kinect gesture dataset is collected from 30 subjectsperforming 12 gestures and contains 594 video sequences ofskeletal body part movements. Each sequence contains 9 to11 instances. In total, there are 6244 instances in this dataset.WorkoutSu-10 dataset is collected from 15 subjects perform-ing 10 fitness exercises. Each exercise has been repeated 10times by each subject. It contains 510550 frames resultingin 1500 instances and large time span for each instance. Thelarge time spans make our method valuable in key frames ex-traction.

Textual instructions of actions in the two datasets and lin-guistics description of extracted semantic poses are manuallydescribed which shown in supplementary material1. How-ever, it is not difficult to automatically obtain textual instruc-tions as they are often used by sports trainers and may beavailable online. In total, 16 and 15 semantic poses are re-spectively applied to MSRC-12 and WorkoutSu-10 datasets.The ground truth lexicon can be seen in Figure 2 which il-lustrates corresponding optimal visual pose of semantic pose.Here, symbols have been used to represent semantic poses.

Experiments were conducted to evaluate the proposedmethod on two classification problems including trained anduntrained actions. In order to compare with the state-of-the-art algorithms, cross-subject evaluation scheme was appliedon the instance level of actions. Moreover, the initial value of

1 Supplementary material is attached at the end of paper.

T1 T2 T3 T4 T5 T6 T7 T8

T9 T10 T11 T12 T13 T14 T15 T16

T17 T18 T19 T20 T21 T22 T23 T24

T25 T26 T27 T28

Fig. 2: Ground truth optimal visual poses corresponding tosemantic poses.

translation probability P (Sp|Tq) is assigned a uniform prob-ability based on

∑SpP (Sp|Tq) = 1.

4.2. Results

4.2.1. Trained action classification

We separately test two datasets and learned a lexicon (seeFigure 3 and 4) for each dataset used for action recognition.Comparing the learned lexicon with the ground truth lexicon,one finds the lexicon for MSRC-12 and WorkoutSu-10 are al-most consistent with ground truth except the semantic poseT4 in MSRC-12 dataset. Notice also that corresponding poseof semantic pose T4 is confused with T5. These two semanticposes are used by only one action “Push right”, which re-duces the performance of the proposed method by fewer co-occurrence of semantic pose and visual pose candidates. Thevariation among subjects performing this action also results inconfusion as subjects may ignore the elementary action fromT1 to T4.

The learned lexicon was further verified through ac-tion recognition. The accuracy gained with MRSC-12 andWorkoutSu-10 dataset are respectively 85.86% and 98.71%.Comparative results with discriminative models are shown inTable 1. Although the performance on MSRC-12 is slightlyworse than discriminative models [26, 12, 27], it is a reason-able result when considering the fact that we do not model ref-erence object (G) and simply consider the whole body as ob-ject. Consequently, it is hard to distinguish actions “Googlesand “Had enough which have particular reference objects.The result based on WorkoutSu-10 dataset outperforms thediscriminative model RDF [12] even though the number ofinstances in [12] was 300 fewer than in our experiment.

To demonstrate the performance of semantic action recog-nition with the aid of textual instruction, we compare the pro-posed method with state-of-the-art semantic learning meth-ods. As attributes [19, 20] are hard to describe and not com-parable, we opt to compare it with state-of-the-art latent se-mantic learning. The comparisons are made with generative

T1 T2 T3 T4 T5 T6 T7 T8

T9 T10 T11 T12 T13 T14 T15 T16

Fig. 3: A lexicon based on MSRC-12 dataset.

T1 T17 T18 T19 T20 T21 T22 T23

T24 T25 T2 T3 T26 T27 T28

Fig. 4: A lexicon based WorkoutSu-10 dataset.

Table 1: Comparative results: trained action classification.

Methods MSRC-12 WorkoutSu-10Cov3DJ [26] 91.70% -RDF [12] 94.03% 98%ELC-KSVD [27] 90.22% -LDA[14] 74.81% 92.27%HGM [18] 66.25% 82.37%Ours 85.86% 98.71%

models and results are shown in Table 1; the proposed methodcan be catgorised as a generative model. In particular, wecompare it with classical latent Dirichlet allocation (LDA)model [14] and a hierarchical generative model (HGM) [18]which is a two-layer LDA. In both models, word is similarto visual pose candidate of the proposed method and topic issimilar to semantic pose. Hence, in LDA the number of top-ics is equal to the number of semantic poses. In HGM, thenumber of global topics is same as the number of semanticposes. Comparative results show that the proposed methodoutperforms HGM and LDA, clearly indicating its semanticrepresentation power for action recognition. Moreover, ex-periments investigating the role of k (set size of visual posecandidate) showed an increased rate for the proposed methodwith increased value of k, levelling off when k is about 5 or 6times as many as the size of semantic pose set 2.

4.2.2. Zero-shot classification

Zero-shot classification is to recognize an action that has notbeen trained before. Notice that the four semantic poses areshared in different actions among two datasets including T1,T2, T3 and T7. This motivated the selection of all actionsthat used these semantic poses for experiments. We also con-

2 Graphical result is available in the supplementary material.

Table 2: Results of zero-shot classification.

Testing set Accuracy

Single actionLift arms 92.72%

Duck 61.40%Wind up 54.70%

Composite activityA1 then A2 99.30%B2 then B3 100%C1 then C2 98.67%

structed a synthesized set of composite activities by concate-nating single actions from WorkoutSu-10 in order to enlargethe number of test actions. Specifically, training actions in-clude: Cheap weapon, Beat both, Hip flexion (A1), Trunk ro-tation (A2), Hip adductor stretch (B2), Hip adductor stretch(B3), Curl-to-press (C1) and Squats (C2). Three single ac-tions and three composite activities are used for testing andrecognition accuracy is shown in Table 2. Results demon-strate that the proposed method based on action semanticposes is effective in zero-shot action recognition.

5. CONCLUSION

This paper has presented a novel method of learning a poselexicon that consists of semantic poses and visual poses.Experimental results showed that the proposed method caneffectively learn the mapping between semantic and visualposes and was verified in both pre-trained and zero-shot ac-tion recognition.

The proposed method can be easily extended to seman-tic action recognition based on RGB or depth datasets andprovides a foundation to build action-verb and activity-phrasehierarchies. A unique large lexicon can also be learned foraction recognition involving different datasets. In addition,the future work about representing semantic poses will be ex-plored to improve the proposed method through modellingreference objects and middle part of trajectories since we onlyconsidered start and end positions of trajectories.

6. REFERENCES

[1] W. Li, Z. Zhang, and Z. Liu, “Expandable data-driven graphical model-ing of human actions based on salient postures,” IEEE Trans. CircuitsSyst. Video Technol., vol. 18, no. 11, pp. 1499–1510, 2008.

[2] D. Weinland, R. Ronfard, and E. Boyer, “A survey of vision-basedmethods for action representation, segmentation and recognition,”Comput. Vision Image Underst., vol. 115, no. 2, pp. 224–241, 2011.

[3] P. Wang, W. Li, P. Ogunbona, Z. Gao, and H. Zhang, “Mining mid-levelfeatures for action recognition based on effective skeleton representa-tion,” in Proc. Int. Conf. Digital lmage Computing Techniques Appl.,2014.

[4] P. Wang, W. Li, Z. Gao, C. Tang, J. Zhang, and P. Ogunbona,“Convnets-based action recognition from depth maps through virtualcameras and pseudocoloring,” in Proc. Annual Conf. ACM Multimedia,2015.

[5] P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. O. Ogunbona, “Ac-tion recognition from depth maps using deep convolutional neural net-works,” IEEE Trans. Human-Mach. Syst., Accepted 2 November 2015.

[6] W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3Dpoints,” in Proc. Comput. Vision Pattern Recognit. Workshops, 2010.

[7] C. Wang, Y. Wang, and A. L. Yuille, “An approach to pose-based actionrecognition,” in Proc. Comput. Vision Pattern Recognit., 2013.

[8] A. Eweiwi, M. S. Cheema, C. Bauckhage, and J. Gall, “Efficientpose-based action recognition,” in Proc. Asian Conf. Comput. Vision.Springer, 2014, pp. 428–443.

[9] L. Talmy, Toward a cognitive semantics, M. I. of Technology, Ed. MITpress, 2003, vol. 1.

[10] S. Petrov, L. Barrett, R. Thibaux, and D. Klein, “Learning accurate,compact, and interpretable tree annotation,” in Proc. Annul MeetingAssoc. Comput. Ling., 2006.

[11] S. Fothergill, H. Mentis, P. Kohli, and S. Nowozin, “Instructing peo-ple for training gestural interactive systems,” in Proc. SIGCHI Conf.Human Factor Comput. Syst., 2012.

[12] F. Negin, F. Ozdemir, C. B. Akgul, K. A. Yuksel, and A. Ercil, “Adecision forest based feature selection framework for action recognitionfrom RGB-Depth cameras,” in Image Anal. and Recognit. Springer,2013, pp. 648–657.

[13] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning re-alistic human actions from movies,” in Proc. Comput. Vision PatternRecognit., 2008.

[14] J. C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learning of hu-man action categories using spatial-temporal words,” Int. J. Comput.Vision, vol. 79, no. 3, pp. 299–318, 2008.

[15] Y. Wang and G. Mori, “Human action recognition by semi-latent topicmodels,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 10, pp.1762–1774, 2009.

[16] T. Hofmann, “Unsupervised learning by probabilistic latent semanticanalysis,” Mach. Learning, vol. 42, no. 1-2, pp. 177–196, 2001.

[17] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J.Mach. Learn Res., vol. 3, pp. 993–1022, 2003.

[18] S. Yang, C. Yuan, W. Hu, and X. Ding, “A hierarchical model basedon latent dirichlet allocation for action recognition,” in Proc. Int. Conf.Pattern Recognit., 2014.

[19] J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions byattributes,” in Proc. Comput. Vision Pattern Recognit., 2011.

[20] Z. Zhang, C. Wang, B. Xiao, W. Zhou, and S. Liu, “Attribute regu-larization based human action recognition,” IEEE Trans. Inf. ForensicsSecurity, vol. 8, no. 10, pp. 1600–1609, 2013.

[21] M. Rohrbach, M. Regneri, M. Andriluka, S. Amin, M. Pinkal, andB. Schiele, “Script data for attribute-based recognition of compositeactivities,” in Proc. European Conf. Comput. Vision, 2012.

[22] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan,R. Mooney, T. Darrell, and K. Saenko, “Youtube2text: Recognizing anddescribing arbitrary activities using semantic hierarchies and zero-shotrecognition,” in Proc. Int. Conf. Comput. Vision, 2013.

[23] P. Duygulu, K. Barnard, J. F. de Freitas, and D. A. Forsyth, “Objectrecognition as machine translation: Learning a lexicon for a fixed imagevocabulary,” in Proc. European Conf. Comput. Vision, 2002.

[24] M. Zanfir, M. Leordeanu, and C. Sminchisescu, “The moving pose: Anefficient 3D kinematics descriptor for low-latency action recognitionand detection,” in Proc. Int. Conf. Comput. Vision, 2013.

[25] P. Koehn, Statistical machine translation. Cambridge UniversityPress, 2009.

[26] M. E. Hussein, M. Torki, M. A. Gowayyed, and M. El-Saban, “Humanaction recognition using a temporal hierarchy of covariance descriptoron 3D joint locations,” in Proc. Int. Joint Conf. Artificial Intelligence,2013.

[27] L. Zhou, W. Li, Y. Zhang, P. Ogunbona, D. T. Nguyen, and H. Zhang,“Discriminative key pose extraction using extended LC-KSVD for ac-tion recognition,” in Proc. Int. Conf. Digital lmage Computing Tech-niques Appl., 2014.

Supplementary material:In this supplementary material, the textual instructions of theactions in the MSRC-12 gesture [11] and WorkoutSu-10 ex-ercise [12] datasets are described in the first section. The sec-ond section details the extracted semantic poses of the twodatasets. In the third section, the experimental results on therecognition accuracy vs. the number of visual pose candidatesare presented.

Textual instructionsBeginning with a starting configuration of the body, the in-structions consist of a sequence description of the elementarymotions that constitute an action. The instructions can be un-derstood and followed by a subject with average literacy toperform the action.

MSRC-12• Lift outstretched arms:

– Begin in standing position with arms beside body.

– Raise outstretched arms from sides to shouldheight until overhead.

– Return to starting position.

• Duck:

– Begin in standing position with arms beside body.

– Slightly bend knee.

– Return to starting position.

• Push right:

– Begin in standing position with arms beside body.

– Bring right hand in front of tummy.

– Slide right hand towards the right until same planeas torso.

– Return to starting position.

• Goggles:

– Begin in standing position with arms beside body.

– Raise hands through tummy, chest until eyes.

– Return to starting position.

• Wind up:

– Begin in standing position with arms beside body.

– Raise arms through tummy, chest until shoulderheight.

– Bring arms back and and return to starting posi-tion.

• Shoot:

– Begin in standing position with arms beside body.

– Stretch arms out in front of body with claspedhands.

– Slightly bring up hands.

– Return to starting position.

• Bow:

– Begin in standing position with arms beside body.

– Bend forwards with torso parallel with the floor.

– Return to starting position.

• Throw:

– Begin with standing position.

– Step left foot forward while raising right handoverhead and back to the point of maximum ex-ternal shoulder rotation.

– Accelerate right arm forward until elbow is com-pletely straight.

– Return to starting position.

• Had enough:

– Begin with standing position.

– Raise arms with hands on the head.

– Return to starting position.

• Change weapon:

– Begin with standing position.

– Reach over left shoulder with right hand.

– Bring both hands in front of tummy as if they areholding something.

– Return to starting position.

• Beat both:

– Begin with standing position.

– Raise hands to the head level.

– Beat hands sidewards for a few times.

– Return to starting position.

• Kick:

– Begin with standing position.

– Step left foot forward while raising right foot tothe right knee.

– Straighten right leg forward.

– Return to starting position.

WorkoutSu-10• Hip flexion (A1)

– Begin in standing position with arms beside body.

– Place hands on the waist.

– Flex left leg at the hip up to 90 degree and bendthe knee.

– Return to starting position.

• Torso rotation (A2)

– Begin in standing position with arms beside body.

– Outstretch arms at shoulder height with claspedhands and flex left leg at knee.

– Rotate torso towards left side at almost 45 de-grees.

– Rotate torso to the right.

– Return to starting position.

• Lateral stepping (A3)

– Begin in standing position with arms beside body.

– Place hands on the waist.

– With right foot step laterally to the right.

– Bring left foot beside right foot.

– Repeat the last two elementary actions for severaltimes.

– With left foot step laterally to the left.

– Bring right foot beside left foot.

– Repeat the last two elementary actions for severaltimes.

– Return to starting position.

• Thoracic rotation (B1)

– Begin in standing position with arms are flexed atelbow and hands raised at shoulder height.

– Rotate torso to left side.

– Return to starting position.

• Hip Adductor stretch (B2)

– Begin in standing position with arms beside body.

– Spread legs beyond shoulder width.

– Shift body weight to right leg by bending the kneeup to 90 degree and straighten left leg.

– Return to starting position.

• Hip adductor stretch (B3)

– Begin in standing position with arms beside body.

– Spread legs beyond shoulder width.

– Bend forwards with torso parallel with the floorwhile arms are hanging down.

– Return to starting position.

• Curl-to-press (C1)

– Begin in standing position with arms beside body.

– Flex arms at elbow and raise arms in front of bodyand overhead.

– Return to starting position.

• Freestanding squarts (C2)

– Begin in standing position with arms beside body.

– Bend knee up to 90 degrees and arms are out-stretched at shoulder height in front of the body.

– Return to starting position.

• Transverse horizontal punch (C3)

– Begin in standing position with arms beside body.

– Rotate torso towards left side and punch with righthand in front of the body until arm is straight.

– Rotate torso towards right side and punch with lefthand in front of the body until arm is straight.

– Return to starting position.

• Oblique stretch (C4)

– Begin in standing position with arms beside body.

– Spread legs beyond shoulder width.

– Raise left arm laterally overhead and lean torsotowards the right.

– Return to starting position.

Semantic posesSemantic poses extracted from the MSRC-12 gesture andWorkoutSu-10 exercise datasets are show in Table 3 and 4and their linguistic descriptions are as follows.

T1 Arms are beside legs.

T2 Arms are overhead with elbows above shoulder level.

T3 Thighs at an angle to the shins.

T4 Right arm is in front of stomach.

T5 Horizontally outstretched right arm is in the right of thebody.

T6 Hands are on the corresponding eyes.

T7 Two arms are in front of stomach.

T8 Horizontally outstretched arms are in front of the body.

T9 Torso is parallel to floor.

T10 Left foot is one step in front of the right foot. Right el-bow is at shoulder level.

T11 Left foot is one step front of the right foot. Horizontallyoutstretched right arm is in front of the body.

T12 Hands are on the head.

T13 Horizontally outstretched right arm is in front of thebody.

T14 Two elbows are at shoulder level.

T15 Left foot is one step in front of the right foot.

T16 Raised and outstretched right leg is in front of the body.

T17 Forward raised left knee is with thigh at almost 90 de-grees to the shin.

T18 The trunk faces the left side at almost 45 degrees. Hori-zontally outstretched arms are in front of the body. Leftshin is at almost same level as right knee.

T19 The trunk faces the right side at almost 45 degrees. Hor-izontally outstretched arms are in front of the body.Right shin is at almost same level as left knee.

T20 Two hands on the hips. Thighs at a slight angle to theshins.

T21 Raised two arms are beside the body with hands at al-most shoulder level.

T22 The trunk faces one side at almost 45 degrees. Raisedtwo arms are beside the body with hands at almostshoulder level.

T23 One thigh at an angle to the corresponding shin. Theother leg is outstretched.

T24 Separate legs are much wider than shoulder.

T25 Separate legs are much wider than shoulder. Torso is par-allel to floor and hanging arms are down besides legs.

T26 Trunk faces the left side at almost 90 degrees. Horizon-tally outstretched right arm is in front of the body.

T27 Trunk faces the right side at almost 90 degrees. Horizon-tally outstretched left arm is in front of the body.

T28 The trunk is bent obliquely to one side. The stretchedopposite side arm is over the head.

Table 3: Semantic poses used in MSRC-12 dataset.

Gestures Semantic posesLift outstretched arms T1, T2Duck T1, T3Push right T1, T4, T5Goggles T1, T6Wind up T1, T7, T2, T7Shoot T1, T8Bow T1, T9Throw T1, T10, T11Had enough T1, T12Change weapon T1, T13, T7Beat both T1, T14, T2Kick T1, T15, T16

Table 4: Semantic poses used in WorkoutSu-10 dataset.

Exercises Semantic posesHip flexion (A1) T1, T17Trunk rotation (A2) T1, T18, T19Lateral stepping (A3) T1, T20Thoracic rotation (B1) T21, T22Hip adductor stretch (B2) T1, T23Hip stretch (B3) T1, T24, T25Curl-to-press (C1) T1, T2Freestanding squats (C2) T1, T3Transverse horizontal punch (C3) T1, T26, T27Oblique stretch (C4) T1, T28

1 2 3 4 5 6 70.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Times of semantic pose set size

Acc

urac

y

VS−TM−12

LDA−12

HGM−12

VS−TM−10

LDA−10

HGM−10

Fig. 5: Recognition accuracy versus the number of visualpose candidates.

Impact of the number of visual posecandidatesThe recognition accuracy versus the number of visual posecandidates is shown in Figure 5, where 12 and 10 in the

legends represent MSRC-12 and WorkoutSu-10 datasets re-spectively. Specifically, the number of visual pose candidatesranges from one to seven times as many as the number of se-mantic poses. Results have shows that the recognition rateof the propose method increases first with the increase of thenumber of visual pose candidates and then levels off whenthe number of visual pose candidates reaches 5 or 6 times asmany as the number of semantic poses.

We also compare the proposed method with the latentDirichlet allocation (LDA) model [14] and a hierarchical gen-erative model (HGM) [18] on the MSRC-12 and WorkoutSu-10 datasets, where the number of words was set to one toseven times as many as the number of topics in LDA and thenumber of global topics in HGM respectively. Experimentalresults have shown that the proposed method is more robustthan LDA and HGM with respect to the number of visual posecandidates.