Temporal Action Co-Segmentation in 3D Motion Capture Data...

Temporal Action Co-Segmentation in 3D Motion Capture Data and Videos

Konstantinos [email protected]

Constantinos [email protected]

Antonis [email protected]

Foundation for Research and Technology-Hellas (FORTH),Institute of Computer Science, Crete, Greece

Abstract

Given two action sequences, we are interested inspotting/co-segmenting all pairs of sub-sequences that rep-resent the same action. We propose a totally unsupervisedsolution to this problem. No a-priori model of the actionsis assumed to be available. The number of common sub-sequences may be unknown. The sub-sequences can be lo-cated anywhere in the original sequences, may differ in du-ration and the corresponding actions may be performed bya different person, in different style. We treat this type oftemporal action co-segmentation as a stochastic optimiza-tion problem that is solved by employing Particle SwarmOptimization (PSO). The objective function that is mini-mized by PSO capitalizes on Dynamic Time Warping (DTW)to compare two action sequences. Due to the generic prob-lem formulation and solution, the proposed method canbe applied to motion capture (i.e., 3D skeletal) data or toconventional RGB video data acquired in the wild. Wepresent extensive quantitative experiments on several stan-dard, ground truthed datasets. The obtained results demon-strate that the proposed method achieves a remarkable in-crease in co-segmentation quality compared to all tested ex-isting state of the art methods.

1. IntroductionThe unsupervised discovery of common patterns in im-

ages and videos is considered as two important problems incomputer vision. We are interested in the temporal aspectof the problem and focus on action sequences (sequences of3D motion capture data or video data) that contain multi-ple common actions. The problem was introduced in [10]as Temporal Commonality Discovery (TCD). Our motiva-tion and interest to the problem stems from the fact that thediscovery of common action patterns in two or more se-

Sequence A

Sequence B

Common actions

Throwing a ball Punching Jumping jacks Clapping Sit down Jumping in place

Non-common actions

Figure 1: Given two image sequences that share commonactions, our goal is to automatically co-segment them in atotally unsupervised manner. In this example, there are fourcommon actions and two non-common actions. Notice thatthere are two instances of the 1st action of sequence A in se-quence B. Each point of the grayscale background encodesthe pairwise distance of the corresponding sequence frames.

quences provides a way to segment them, to identify a setof elementary actions performed in the sequences and, at ahigher level, to build models of the performed actions in anunsupervised manner.

In this paper, we propose a novel method for solvingthis problem. The proposed method operates on multivari-ate time-series (i.e., a feature vector of fixed dimensionalityper frame), representing action-relevant information. Weassume that two sequences/time-series contain a number ofcommon action sub-sequences which we aim at identifyingin a unsupervised manner, in the sense that no prior mod-

1

els of the actions are available. As illustrated graphically inFig. 1, a shared (common) sub-sequence may appear any-where in both sequences and may differ in duration as wellas in the style of execution. Such a commonality is definedby four values, i.e., the start positions of the co-segmentedsub-sequences and their possibly different lengths, in bothsequences. We cast the search for a single such commonal-ity as a stochastic optimization problem that is solved basedon Particle Swarm Optimization (PSO) [20]. Essentially,PSO searches for a commonality that minimizes the dissim-ilarity between the co-segmented sequences. This is quan-tified by the cost of their non-linear alignment through Dy-namic Time Warping (DTW) [32]. Iterative invocations ofthis optimization process identify all commonalities. Theirnumber can either be known a-priori, or can be automat-ically identified, leading to a supervised or unsupervisedvariant of the solution, respectively.

Several experiments were carried out on several standardand custom datasets that contain sequences of human or an-imal actions. The datasets involve either motion capturedata or conventional RGB videos. The quantitative analy-sis of the obtained results reveals that the proposed strat-egy improves the co-segmentation overlap metric by a greatmargin over the best competing, state of the art methods.

2. Related workThe term co-segmentation was originally introduced in

computer vision by Rother et al. [35], as the task of jointsegmentation of “something similar” given a set of images.This idea aspires to eliminate the need for tedious super-vised training, enabling weakly supervised solutions to anumber of interesting problems, such as automatic annota-tion of human actions in videos [11, 4].Image/object co-segmentation: Methods that co-segmentsimilar image regions given two or more images have al-ready been proposed [35, 25, 28, 5, 37]. Object co-segmentation was introduced in [41] to segment a promi-nent object imaged in an image pair. State-of-the-art meth-ods can handle multiple objects in two images [18] or ina single image [12]. Finally, Rubio et.al [37] proposed amethod for unsupervised co-segmentation of multiple com-mon regions of objects that appear in multiple images.Video co-segmentation: Recently, the idea was extendedto video segmentation towards binary fore/back-groundsegmentation [36] or single object segmentation [6]. Chiu etal. [8] proposed a method to perform multi-class video ob-ject co-segmentation, in which the number of object classesand the number of instances are unknown in each frame andvideo. The assumption in those works is that all framesfrom all videos should contain the target object. This as-sumption is relaxed in Wang et al. [45] that jointly discov-ers and co-segments the target objects in multiple videos, inwhich the target object may not be present in an unknown

number of frames.

Unsupervised segmentation of time-series: Several meth-ods deal with the problem of finding one or multiple com-mon temporal patterns in a single or multiple sequences [27,7]. A solution to this problem is useful in various broad do-mains ranging from biology and bioinformatics to computerscience and engineering. Various methods deal with theproblem using temporal clustering [49], segmentation [21],temporal alignment [48], etc. In a recent work, Wang etal. [46] proposed a method for temporal segmentation ofhuman repetitive actions based on motion capture data, inan unsupervised manner. This is achieved by analyzing fre-quencies of kinematic parameters, detecting zero-velocitycrossings and clustering of sequential data, representing hu-man motion in terms of joint trajectories in order to identifythe optimal segmentation points. DTW [32] is a widely-used algorithm for temporal non-linear alignment of two se-quences of different lengths. Its variants are widely used fortemporal alignment of human motion [48] and for unsuper-vised speech processing [31]. In [31], a dynamic program-ming algorithm, which is a segmental variant of DTW, wasproposed to discover an inventory of lexical speech unitsin an unsupervised manner by exploiting the structure ofrepeating patterns within the speech signal. Moreover, wemention the use of the longest consecutive common sub-sequence (LCCS) algorithm [2] towards unsupervised dis-covery of commonality in sequential data and atomic actiondetection [16]. However, being able to find only identicalsub-sequences, makes this approach not appropriate in do-mains involving signals where variations in the representa-tions may occur due to various reasons such as noise, dif-ferent style of execution of a certain action, etc.

Action co-segmentation: The method in [14] performscommon action extraction in a pair of videos by segment-ing the frames of both videos that contain the common ac-tion. The method relies on measuring the co-saliency ofdense trajectories on spatio-temporal features. The methodproposed by Zhou et al. [49] can discover facial units invideo sequences of one or more persons in an unsupervisedmanner. The method relies on temporal segmentation andclustering of sequences containing facial features. Anotherrecent work proposed by Chu et al. [9] tackles the prob-lem of video co-summarization, by exploiting visual co-occurrence of the same topic across multiple videos. Todo so, they propose the Maximal Bi-clique Finding (MBF)algorithm. The method is able to discover one or multi-ple commonalities between two long action sequences andmatch similar action within each sequence. Our work ismost relevant to the state of the art Temporal CommonalityDiscovery (TCD) method by Chu et. al. [10] that discoverscommon semantic temporal patterns in a pair of videos ortime-series in an unsupervised manner. This is treated asan integer optimization problem that uses the branch-and-

bound (B& B) algorithm [24] for searching for an optimalsolution over all possible segments in each video sequence.The method is applicable to any histogram-based feature.Our contribution: We present a novel formulation of theproblem of temporal action co-segmentation. The proposedmethod is totally unsupervised and assumes a very generalrepresentation of its input. Moreover, it is shown to out-perform TCD and other state of the art methods by a verylarge margin, while maintaining a better computational per-formance.

3. Method descriptionOur approach consists of four components that address

(a) feature extraction and representation of the data (Sec-tion 3.1), (b) DTW-based comparison of sequences of ac-tions (Section 3.2) (c) an objective function that quantifiesa potential sub-sequence commonality (Section 3.3) and (d)the use of (c) in an evolutionary optimization frameworkfor spotting one common sub-sequence (Section 3.4) and allcommon sub-sequences (Section 3.5) in a pair of sequences.

3.1. Features and representation

The proposed framework treats sequences of actions asmultivariate time-series. More specifically, it is assumedthat the input consists of two action sequences SA and SB

of lengths lA and lB , respectively. Each “frame” of sucha sequence is a vector f ∈ Rd. The generality of thisformulation enables the consideration of a very wide setof features. For example, vectors f can stand for motion-capture data representing the joints of a human skeleton ora Bag-of-Features representation of appearance and motionfeatures based on dense trajectories [43, 40] extracted fromconventional RGB videos. Section 4 presents experimentsperformed on four datasets, each of which considers a dif-ferent representation of human motion capture or image se-quence features. This demonstrates the generality and thewide applicability of the proposed solution.

3.2. Comparing action sequences based on DTW

A key component of our approach is a method that as-sesses quantitatively the similarity between two action se-quences being treated as time-series, based on the notionof their temporal alignment. In a recent study, Wang etal. [47] performed an extended study on comparing 9 align-ment methods across 38 data sets from various scientific do-mains. It was shown that the DTW algorithm [3, 38, 32, 39],originally developed for the task of spoken word recog-nition [38], was consistently superior to the other studiedmethods. Thus, we also adopt DTW as a sequence align-ment and comparison tool.

More specifically, given two action sequences S1 and S2

of lengths l1 and l2, we first calculate the distance matrixW1,2 of the pair-wise distances of all frames of the two

(a) Distance matrix (b) O(p)

Figure 2: (a) The pairwise distance matrix for all frames oftwo sequences SA and SB . (b) Visualization of the scoresof the objective function of Eq. 1 for all possible startingpoints in each sequence and sub-sequence lengths (see textfor details).

sequences, as shown in Fig.2a. Depending on the natureof the sequences, different distance functions can be em-ployed (e.g., Euclidean norm for motion capture data, χ2

distance for histograms). The matrix W1,2 represents thereplacement costs, that is, the input to the DTW method [3].The algorithm calculates a cost matrix with a minimumdistance/cost warp path traced through it, providing pair-wise correspondences that establish a non-linear matchingamong all frames of the two sequences. The cumulativecost of all values across the path provides the alignmentcost between the two sequences, noted as D. The resultingminimum distance path contains a number np of diagonalelements that identify matches of frames between the twosequences. If the input sequences were identical time seriesof length l, the warp path through the cost matrix would beits diagonal, with D = 0 and np = l.

3.3. Evaluating a candidate commonality

Having a method for measuring the alignment cost oftwo sequences, our goal is now to define an effective objec-tive function to evaluate candidate commonalities. A can-didate commonality p is fully represented by the quadruplep = (sa, la, sb, lb), where sa, sb are the starting frames andla, lb are the lengths of two sub-sequences of SA and SB ,respectively. A commonality p can be viewed as a rectan-gle R(p) whose top-left corner is located at point (sa, sb)and whose side lengths are la and lb (e.g., see any coloredrectangle in Fig. 1). We are interested in promoting com-monalities p of low alignment cost (or equivalently, of highsimilarity) that also correspond to as many matched framesas possible. To this end, we define an objective functionO(p) that quantifies the quality of a possible commonalityp, as:

O(p) =D(p) + c

np(p) + 1. (1)

As a quadruple p represents two sub-sequences of the orig-inal sequences, the DTW-based comparison of those cal-culates the alignment cost D(p) and the number np(p) ofmatched frames of these sub-sequences. Essentially, theobjective function O(p) calculates the average alignmentcost across the alignment path calculated by DTW, by di-viding the temporal alignment cost D(p) with the numberof matched frames np(p). c is a small constant that guaran-tees that longer commonalities are favored over small ones,even in the unlikely case that D(p) = 0.

Figures 2a, 2b aspire to provide some intuition regardingthe objective function and its 4D domain, based on a givenpair of sequences containing a single common action. Weassume two sequences SA, SB that contain a common ac-tion across frames [200..350] and [400..600], respectively.Figure 2a illustrates the pairwise distance matrix W of alltheir frames, calculated based on the Euclidean distance ofthe feature vectors representing the frames. Each cell (i, j)in the map of Fig. 2b represents the minimum objectivefunction score O for two sub-sequences starting at frame iin SA and j in SB over all possible combinations of allowedlengths. Thus, the presented map visualizes the 4D para-metric space by a 2D map of the responses of the definedobjective function. It can be verified that low scores areconcentrated near the area of the common sub-sequences.

3.4. Spotting a single commonality

Spotting a single commonality amounts to optimizingEq. (1) over all possible commonalities p. In notation, theoptimal commonality p∗ is defined as

p∗ = arg minp

O(p). (2)

Searching exhaustively all candidate solutions in the 4D pa-rameter space spanned by possible commonalities is pro-hibitively expensive. Thus, we choose to consider this as astochastic optimization problem that is solved based on thecanonical PSO algorithm [20, 19, 15].

PSO is a powerful evolutionary optimization methodused to optimize an objective function through the evolutionof particles (candidate solutions) of a population (swarm).Particles lie in the parameter space of the objective func-tion to be optimized and evolve through a limited numberof generations (iterations) according to a policy which emu-lates “social interaction”. PSO is a derivative-free optimiza-tion method that handles multi-modal, discontinuous objec-tive functions with several local minima. Its main parame-ters are the number of particles and generations, the productof which determines its computational budget (i.e., num-ber of objective function evaluations). PSO has recentlybeen applied with success to several challenging, multidi-mensional optimization problems in computer vision suchas 3D pose estimation and tracking of hands [29, 30], handsand objects [23] and humans [17].

For single action co-segmentation, PSO operates on the4D space of all possible commonalities. Certain constraintsapply to the parameters sa, sb, la and lb. Specifically, itholds that la, lb ≥ lmin la, lb ≤ lmax, sa ≤ lA − lmin,sb ≤ lB − lmin, where lmin, lmax are user-defined mini-mum/maximum allowed commmonality lengths.

The decision on the actual number of particles and gen-erations needed to come up with an accurate solution to theproblem is taken based on experimental evidence and is dis-cussed in Section 4.

Finally, sequences do not necessarily contain a common-ality. In such situations, the optimization of Eq. 5 results ina large value.

3.5. Spotting multiple commonalities

A joint solution to the problem for identifying N com-monalities in a single run of PSO would require to explorea 4N -dimensional space. This can become intractable evenfor PSO when considering many common actions to be co-segmented. Thus, we resort to an iterative optimization pro-cedure that identifies a single commonality at a time.

It should be noted that the n-th commonality pn shouldoverlap as less as possible with the n − 1 previously iden-tified commonalities pi, 1 ≤ i ≤ n − 1. There is a fairlyintuitive way of identifying this overlap. Given two com-monalities pi and pj , their normalized intersection Ω(pi, pj)with respect to pi is defined as

Ω(pi, pj) =|R(pi) ∩R(pj)||R(pi)|

, (3)

where R(p) is the region of a commonality p (see Sec-tion 3.3) and | · | is an operator that measures the area ofa 2D region. Given this, we define a new objective func-tion for the optimization problem. In order to identify thei-th commonality pi, the new objective function considersits DTW-based score (as before), but also its normalized in-tersection with the already identified commonalities. Theoptimal i-th commonality is thus defined as:

p∗i = arg minpi

O(pi) + λ

i−1∑j=1

Ω(pj , pi)

, (4)

where λ > 0 is used to tune the contribution of the twoterms in the objective function. Large λ values excludecommonalities that have even a slight overlap with the al-ready identified ones. In our implementation, we set λ = 1.

The above definition of the objective function, penalizescommonalities whose regions overlap. However, it doesnot penalize non-overlaping commonalities whose regionsshare the same rows (or columns) of the distance matrix. Inother words, an action in one video can be matched withseveral instances of the same action in the second video.

Supervised vs unsupervised action co-segmentation:The iterative co-segmentation method described so far canbe applied for a known number N of iterations, giving riseto N retrieved commonalities. We denote this variant of thealgorithm as S-EVACO that stands for Supervised EVolu-tionary Action CO-segmentation. S-EVACO is useful whenthe number of the common action subsequences is knowna-priori. However, a totally unsupervised method that doesnot assume this knowledge is definitely preferable. We ap-proach the problem of developing such an unsupervisedmethod as a model selection task. To this end, we considerthe maximum possible number of common actions, notedas K. We run the iterative PSO-based optimization processfor K iterations, retrieving K commonalities pi as well astheir fitness values O(pi). We sort the commonalities in as-cending order of O(pi). Finally, we consider all possibleK − 1 break points between consecutive commonalities.We accept the break point j∗ that maximizes the absolutedifference of the mean values of fitnesses to the left and tothe right of it. By doing so, we guarantee that the introduc-tion of pj∗+1 into the commonalities solution set decreasessubstantially the quality of the solution. In notation,

j∗ = arg maxj∈1,...,K−1

∣∣∣∣∣∣1jj∑

i=1

O(pi)−1

K − j

K∑i=j+1

O(pi)

∣∣∣∣∣∣ .(5)

The commonalities p1 to pj∗ constitute the sought solution.We denote this variant of our method as U-EVACO.

4. Experimental evaluationWe assess the performance of the proposed action co-

segmentation method and compare it with state-of-the artmethods using various ground truthed datasets based on ei-ther 3D motion capture data or conventional (RGB) videos.

In a first series of experiments we investigate the compu-tational budget (number of particles and generations) thatis required by PSO to solve the co-segmentation problem.Typically, more particles and/or generations help PSO tobetter explore the search space of the optimization prob-lem. However, beyond a certain point, the accuracy gainsare disproportionally low compared to the increase in thecomputational requirements. These experiments lead to theselection of the computational budget that constitutes thebest compromise between computational requirements andco-segmentation accuracy.

Then, we compare the resulting performance of themethod based on the selected and fixed PSO budget withthat of the state-of-the-art TCD method [10] for single andmultiple temporal commonality discovery, the method pro-posed by Guo et al. [14] for video co-segmentation of actionextraction and our own implementation of the SegmentalDTW [31]. In Segmental DTW, a local alignment proce-

dure produces multiple warp paths having limited temporalvariation and low distortion. Each warping path is limitedto a diagonal region of a given width. The minimum lengthof each path is also given as a parameter. Since the Seg-mental DTW in an unsupervised method, we name our im-plementation of it as U-SDTW. We also consider a super-vised variant, namely S-SDTW, where the number of com-mon sub-sequences is known and identified by selecting thepaths with the lower length-constrained minimum averagedistortion fragment [31]. For all methods, we employed theparameters that optimize the performance of the method forall employed datasets.

4.1. Datasets and performance metrics

The experimental evaluation was conducted using a to-tal of 373 pairs of sequences, consisting of up to 2355 ac-tion sub-sequences and 1286 pairs of common actions. Allthe compiled datasets and the code for the proposed methodwill be made available online.

MHAD101-s dataset: The Berkeley Multimodal HumanAction Database (MHAD) [42] contains human motioncapture data acquired by an optical motion capture systemas well as conventional RGB video and depth data acquiredfrom multiple views and depth sensors, respectively. Allinformation streams are temporally synchronized and ge-ometrically calibrated. The original dataset (see Fig. 3a)contains 11 actions performed by 12 subjects (7 male, 5female). Each action is repeated 5 times by each subject.We considered all the available action categories except one(the action labeled as sit down/stand up), as it is a composi-tion of the actions No10-(sit down) and No11-(stand up).This alleviates potential problems and ambiguities in thedefinition of the ground truth. We selected only the first(out of the five) execution of an action by each subject, thuswe collected a set of 120 action sequences. We use the mo-tion capture (3D skeletal) data of the original dataset andwe downsampled it temporally by a factor of 16 to reachthe standard frame-rate of 30 fps. We then considered thesubset of human actions defined above, as building blocksfor synthesizing larger sequences of actions and definingpairs of such sequences for which ground truth regardingcommonalities is available, by construction.

The resulting MHAD101-s dataset contains 101 pairs ofaction sequences. In 50 pairs, each sequence consists of 3actions and the two sequences have exactly 1 in common. In17 pairs, each sequence consists of 3-7 actions and the twosequences have exactly 2 in common. In 17 pairs, each se-quence consists of 3-7 actions and the two sequences haveexactly 3 in common. Finally, in 17 pairs, each sequenceconsists of 4-7 actions and the two sequences have exactly4 in common. It is also guaranteed that (a) a sequence con-tains actions of the same subject (b) to promote style and du-

(a) MHAD dataset (b) CMU dataset

(c) 80-Pair dataset

Figure 3: (a) Snapshots from the Berkeley MHAD dataset.(b) Snapshots from the CMU-Mocap dataset. (c) Pairs ofsnapshots from the 80-Pair dataset.

ration variability, for every pair, the two sequences involvedifferent subjects and (c) the placement of the common ac-tions in the sequences is random. The length of each se-quence ranges between 300 and 2150 frames and the lengthof each common action ranges between 55 and 910 frames.Representing 3D motion capture data in MHAD101-s: Sev-eral representations of skeletal data have been proposed inthe literature [22, 26, 13, 34]. In this work, we employ avariant of the representation proposed in [34]. According tothis, a human pose is represented as a 30 + 30 + 4 = 64Dvector. The first 30 dimensions encode angles of selectedbody parts with respect to a body-centered coordinate sys-tem. The next 30 dimensions encode the same angles butin a camera-centered coordinate system. Finally, this repre-sentation is augmented with the four angles between the 3Dvectors of the fore- and the back-arms as well as the anglesbetween the upper- and lower legs.

CMU86-91 dataset: We also employed the CMU-Mocapdatabase1 [1] (Fig. 3b) of continuous action sequences. Weselected the set Subject 86 consisting of 14 labeled, long ac-tion sequences of skeletal-based human motion data, eachconsisting of up to 10 actions (between 4k-8k frames). Incontrast to MHAD101-s, the actions are not concatenatedbut rather executed in a continuous manner. We utilize avariation of the original dataset, presented in Chu et.al [9]that concerns, (a) grouping of action labels to 24 categoriesof similar action (the original dataset consists of 48 pre-defined actions), (b) feature representation of human motiondata based on the position and orientation of the skeletalroot and relative joint angles that results in a 30-D featurevector per frame, (c) temporally down-sampled sequencesby a factor of 4 to reach the standard frame-rate of 30 fps,

1http://mocap.cs.cmu.edu/

(d) a set of 91 pairs of action sequences, by combining allindividual sequences. The ground truth of each sequence isprovided in [1]. We consider the median value of the threeframe numbers provided as possible action boundaries foreach action in a long sequence. In the resulting dataset,the length of the sequences ranges between 330 and 1570frames, and the length of their common action ranges be-tween 70 and 1000 frames at 30 fps.

MHAD101-v dataset: The MHAD101-v dataset is iden-tical to MHAD101-s in terms of action composition andco-segmentation-related ground truth. However, instead ofemploying the motion capture data stream, we employ thecorresponding RGB video stream. Our motivation origi-nates in the idea of testing the applicability of the proposedmethodology also based on low-level video data represent-ing human actions. At the same time, the comparison ofthe performance of the method for the same sequences un-der completely different representations, provides interest-ing insights for both the representations and the method.Representing video data in MHAD101-v: The employedrepresentation is based on the Improved Dense Trajectories(IDT) features [43]. Based on the IDT, we compute fourtypes of descriptors, namely trajectory shape, HOG, HOF,and MBH [43], using publicly available code2, and the sameconfiguration and parameters, as presented in [44, 43], Toencode features, we use a Bag-of-Features representation,separately for each type of descriptor and for each pair ofvideos in the dataset. More specifically, we built a code-book for each type of descriptors using k-means over thefeatures extracted over the frames of the two videos of apair. Then, we calculate the Bag-of-Features representa-tion for each frame, which results in a per frame featurevector (histogram of frequencies of codewords) that cap-tures information regarding the descriptors of the trajecto-ries that were detected and tracked in a temporal window of15 frames preceding that frame. We found that a codebookof 25 codewords is sufficient for our purposes. Finally, weconcatenate all feature vectors calculated for each type ofdescriptors per frame in a single 100-D feature vector.

80-Pair dataset: We also employ the 80-Pair dataset,specifically designed for the problem of common action ex-traction in videos and presented in Guo et.al. [14]. Amongthe 80 pairs of the dataset, 50 are segmented clips of hu-man actions from the UCF50 dataset [33] and 30 pairs areselected from BBC animal documentaries depicting animalactions. In contrast to the MHAD101-v dataset that containsconcatenated videos of actions, the 80-Pair dataset containsvideos of continuous actions executed in the wild.

2http://lear.inrialpes.fr/people/wang/

Figure 4: The overlap score of the objective functionas a function of PSO particles and generations for theMHAD101-s (left) and MHAD101-v (right) datasets.

Representing video data in the 80-Pairs: Dense point tra-jectories [40] based on optical flow are employed and en-coded using MBH descriptors [43], following the sameexperimental setup as in [14] . A motion based figure-ground segmentation is applied to each video to removebackground trajectories. The features are computed usingpublicly available code3. We build a Bag-of-Features rep-resentation based on the MBH descriptors of all frames fora pair of videos. We employ k-means to define 25 code-words. Then, each frame is represented by a 25D featurevector that is the histogram of frequencies of the codewordsfor the trajectories ending up in that frame.

Performance metrics: In order to assess the perfor-mance of the evaluated methods we employed the stan-dard metrics of precision P , recall R, F1 score andoverlap O (intersection-over-union) [10]. In our context,precision quantifies how many of the frames of the co-segmented sequences belong to the set of commonalitiesin both sequences. Recall quantifies how many of the ac-tual commonalities (common frames) are indeed discov-ered/segmented by the method. For each dataset, we cal-culate the average for each metric for all pairs.

4.2. Choosing the PSO budget

The effectiveness of the PSO optimization as well as itsrunning time is determined by its number of particles s (can-didate solutions) and generations g (particle evolutions).The product s · g equals the number of objective functionevaluations. g and s are experimentally defined, so that thetrade off between quality of solutions and running time isappropriately set. In that direction, we applied our methodto the MHAD101-s and MHAD101-v datasets, runningover all combinations of s, g in 8, 16, 32, 64, 128, 256 i.e.,from a lowest of 8 × 8 = 64 to a highest of 256 × 256 =65536 objective function evaluations. For each combina-tion, we considered the average overlap score of 5 runs.

Figures 4 summarize the obtained results for the twodatasets. In both datasets, the overlap score increases asthe PSO budget increases. Additionally, in both datasets,

3http://lmb.informatik.uni-freiburg.de/resources/software.php,http://www.lizhuwen.com/

Table 1: Evaluation results on the MHAD-101s dataset.

MHAD101-s R(%) P(%) F1(%) O(%)

TCD 16.7 18.1 13.8 8.5S-SDTW 61.6 47.1 48.5 35.9U-SDTW 65.8 45.5 47.7 35.1S-EVACO 77.9 67.6 71.3 59.4U-EVACO 71.3 63.9 63.3 50.3

the overlap score increases faster when increasing the num-ber of particles than increasing the number of generations.Finally, while best performances do not differ considerably,optimization based on skeletal data appears easier that opti-mization based on video data. An overview of the 3D maps,provides evidence that combinations of at least 32-32 gener-ations and particles achieve good results. The (g, s) config-urations of (32, 128), (64, 128), (128, 32), (128, 64) achieveapproximately the 90%, 95%, 90%, 95% of the maximumscore achieved by the (256, 256) configuration, by usingonly the 6.25%, 12.5%, 6.25%, 12.5% of the maximumbudget, respectively. We finally set (g, s) = (64, 128) inall our experiments. The reason is that generations need tobe executed serially, while particles can be evaluated in par-allel. Thus, the (64, 128) configuration can eventually betwo times faster than the (128, 64) one.

4.3. Action co-segmentation on skeletal data

Results on MHAD101-s: We allow all methods to searchfor common sub-sequences whose lengths varies in therange [25..1370] (from half the length of the shortest ac-tion to 1,5 times the length of the largest one). Results arereported in Table 1. The scores of the methods are presentedas % average scores in the tables, computed over all individ-ual scores per sample (pairs of sequences) of a dataset. Thescores of the proposed S-EVACO and U-EVACO methodsare the average scores over all samples of a dataset com-puted after 10 repetitions of the experiment for each dataset.S-EVACO achieves an overlap score of 59.4% and outper-forms TCD by over 50% and both variants of SegmentalDTW (U-SDTW/S-SDTW) by over 20% for all the reportedmetrics. The overlap metric of U-EVACO is 9% lower com-pared to that of S-EVACO. Still, we stress that this unsuper-vised version of the proposed method outperforms the stateof the art supervised methods by a very wide margin.

Results on CMU86-91: We allow all methods to searchfor common sub-sequences whose lengths varies in therange [70..1135] (from half the length of the shortest ac-tion, to 1,5 times the length of the largest one). Resultsfor the CMU86-91 dataset are reported in Table 2. Theproposed approaches outperform TCD [10] in all reportedmetrics (27 − 33% higher overlap). We also note the con-

Table 2: Evaluation results on the CMU86-91 dataset.

CMU86-91 R(%) P(%) F1(%) O(%)


Table 3: Evaluation results on the MHAD-101v dataset.

MHAD101-v R(%) P(%) F1(%) O(%)


Table 4: Evaluation results on the 80-Pair dataset.

80-Pairs R(%) P(%) F1(%) O(%)

TCD 22.9 65.4 31.2 21.5S-SDTW 27.8 52.2 31.4 21.6U-SDTW 34.6 60.6 37.3 25.6S-EVACO 75.8 77.2 73.9 64.5U-EVACO 61.0 69.7 62.0 54.2Guo [14] 55.6 78.1 60.9 51.6

siderably higher performance of our method compared toS-SDTW and U-SDTW (36− 41% higher overlap).

4.4. Action co-segmentation on video data

Results on MHAD101-v: The results are summarized inTable 3. Given the number of common actions to be dis-covered in each pair of sequences, the proposed S-EVACOmethod outperforms S-SDTW method over 20% and 10%with respect to the overlap and the other metrics, respec-tively. TCD has a less than 20% overlap score, and its per-formance is lower by more than 30% with respect to re-call, precision and F1 metrics. The unsupervised variant ofthe proposed method (U-EVACO) results in 13% less over-lap compared to S-EVACO. In addition, S-SDTW achieveslower overlap, precision and F1 scores by 8%, 14% and10%, respectively, compared to S-EVACO.

Results on 80-Pair dataset: Table 4 summarizes the find-ings on this dataset. Besides TCD, S-SDTW and U-SDTW,we compare our approach to the method of Guo et al. [14].We employed the publicly available implementation of thatmethod and we run it with the parameters suggested in [14]for that dataset. The TCD, S-SDTW and U-SDTW methods

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8F1 vs overlap, skeletal & video

TCDS−SDTWU−SDTWS−EVACOU−EVACO

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8F1 vs overlap, skeletal


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8F1 vs overlap, video


0 0.2 0.4 0.6 0.8 10

20

40

60

80

100% frames vs overlap, skeletal & video


0 0.2 0.4 0.6 0.8 10

20

40

60

80

100% frames vs overlap, skeletal


0 0.2 0.4 0.6 0.8 10

20

40

60

80

100% frames vs overlap, video


Figure 5: Summary of the obtained results in all datasets.

achieve comparable scores, with TCD performing the lowerscores in all metrics except the precision that is 5% betterthan that of U-SDTW. The proposed S-EVACO, U-EVACOhave similar scores, mainly due to the fact that all pairs ofvideos in that dataset contain a single common action to bediscovered. Both variants of the proposed method outper-form TCD, S-SDTW and U-SDTW methods with over 40%improvements for the overlap and between 17% and 53%for the other metrics. Both proposed variants also scorehigher than Guo’s method [14] in overlap (12% improve-ment), F1 score (12%) and recall (20%).

Figure 5 summarizes the findings in all datasets. The leftcolumn shows the % of pairs where the overlap is abovea certain threshold. The right column shows plots of themean F1 score for all sequence pairs, after zeroing the F1score of pairs below an overlap threshold. Plots are shownfor all pairs of the datasets (top), for those involving videodata only (middle) and skeletal data only (bottom). It canbe verified that the S-EVACO and U-EVACO outperformthe state of the art by a large margin. Regarding executiontime, U-EVACO requires, on average, 10sec for processinga pair of videos. This makes it slower but comparable toU-SDTW and more than two times faster than TCD.

5. Summary and conclusionsWe presented a novel method for the temporal co-

segmentation of all common actions in a pair of action se-quences. We treated this as a stochastic optimization prob-lem whose solution is the start positions and the lengthsof the sub-sequences of the input sequences that define ac-tion segments of maximum similarity. Optimization wasperformed based on iterative Particle Swarm Optimizationwith an objective function defined based on the non-linear

DTW alignment cost of two sub-sequences. The proposedapproach operates on multivariate time series. As such,it can assume a variety of image/video/motion representa-tions. Two variants were presented, one that assumes thatthe number of commonalities is known (S-EVACO) and onethat does not require that information (U-EVACO). Bothvariants were extensively tested on challenging datasets ofmotion capture and video data, with a variety of featuresand representations and in comparison with state of the artmethods. The results demonstrated that the proposed ap-proach outperforms all state of the art methods in all datasets by a large margin.

References[1] J. Barbic, A. Safonova, J.-Y. Pan, C. Faloutsos, J. K. Hod-

gins, and N. S. Pollard. Segmenting motion capture datainto distinct behaviors. In Proceedings of Graphics Interface2004, pages 185–194. Canadian Human-Computer Commu-nications Society, 2004. 6

[2] L. Bergroth, H. Hakonen, and T. Raita. A survey of longestcommon subsequence algorithms. In String Processing andInformation Retrieval, 2000. SPIRE 2000. Proceedings. Sev-enth International Symposium on, pages 39–48. IEEE, 2000.2

[3] D. J. Berndt and J. Clifford. Using dynamic time warping tofind patterns in time series. In KDD workshop, volume 10,pages 359–370. Seattle, WA, 1994. 3

[4] P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce,C. Schmid, and J. Sivic. Weakly supervised action label-ing in videos under ordering constraints. In ECCV, pages628–643. Springer, 2014. 2

[5] Y. Chai, V. Lempitsky, and A. Zisserman. Bicos: A bi-levelco-segmentation method for image classification. In IEEEICCV, pages 2579–2586, Nov 2011. 2

[6] D.-J. Chen, H.-T. Chen, and L.-W. Chang. Video objectcosegmentation. In Proceedings of the 20th ACM Interna-tional Conference on Multimedia, MM ’12, pages 805–808,New York, NY, USA, 2012. ACM. 2

[7] B. Chiu, E. Keogh, and S. Lonardi. Probabilistic discov-ery of time series motifs. In Proceedings of the Ninth ACMSIGKDD International Conference on Knowledge Discoveryand Data Mining, KDD ’03, pages 493–498, New York, NY,USA, 2003. ACM. 2

[8] W.-C. Chiu and M. Fritz. Multi-class video co-segmentationwith a generative multi-video model. In The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR),June 2013. 2

[9] W.-S. Chu, Y. Song, and A. Jaimes. Video co-summarization: Video summarization by visual co-occurrence. In IEEE CVPR, pages 3584–3592, 2015. 2,6

[10] W.-S. Chu, F. Zhou, and F. De la Torre. Unsupervised tem-poral commonality discovery. In A. Fitzgibbon, S. Lazebnik,P. Perona, Y. Sato, and C. Schmid, editors, ECCV, volume7575 of Lecture Notes in Computer Science, pages 373–387.Springer Berlin Heidelberg, 2012. 1, 2, 5, 7

[11] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce. Auto-matic annotation of human actions in video. In IEEE ICCV,pages 1491–1498, Sept 2009. 2

[12] A. Faktor and M. Irani. Co-segmentation by composition. InIEEE ICCV, pages 1297–1304, 2013. 2

[13] D. Gavrila. The visual analysis of human movement. Com-put. Vis. Image Underst., 73(1):82–98, Jan. 1999. 6

[14] J. Guo, Z. Li, L.-F. Cheong, and S. Z. Zhou. Video co-segmentation for meaningful action extraction. In IEEEICCV, pages 2232–2239. IEEE, 2013. 2, 5, 6, 7, 8

[15] S. Helwig and R. Wanka. Particle swarm optimization inhigh-dimensional bounded search spaces. In Swarm Intel-ligence Symposium, 2007. SIS 2007. IEEE, pages 198–205,April 2007. 4

[16] S.-Y. Jin and H.-J. Choi. Essential body-joint and atomicaction detection for human activity recognition using longestcommon subsequence algorithm. In ACCV, ACCV’12, pages148–159, Berlin, Heidelberg, 2013. Springer-Verlag. 2

[17] V. John, S. Ivekovic, and E. Trucco. Articulated human mo-tion tracking with hpso. In VISAPP (1), pages 531–538,2009. 4

[18] A. Joulin, F. Bach, and J. Ponce. Multi-class cosegmentation.In IEEE CVPR, pages 542–549. IEEE, 2012. 2

[19] J. Kennedy and R. Eberhart. Particle swarm optimization.In Neural Networks, 1995. Proceedings., IEEE InternationalConference on, volume 4, pages 1942–1948 vol.4, Nov 1995.4

[20] J. Kennedy, J. F. Kennedy, R. C. Eberhart, and Y. Shi. Swarmintelligence. Morgan Kaufmann, 2001. 2, 4

[21] E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online al-gorithm for segmenting time series. In Data Mining, 2001.ICDM 2001, Proceedings IEEE International Conferenceon, pages 289–296, 2001. 2

[22] L. Kovar and M. Gleicher. Automated extraction and param-eterization of motions in large data sets. In ACM SIGGRAPH2004 Papers, SIGGRAPH ’04, pages 559–568, New York,NY, USA, 2004. ACM. 6

[23] N. Kyriazis and A. Argyros. Scalable 3d tracking of multipleinteracting objects. In IEEE CVPR, pages 3430–3437. IEEE,2014. 4

[24] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Efficientsubwindow search: A branch and bound framework for ob-ject localization. IEEE Trans. Pattern Anal. Mach. Intell.,31(12):2129–2142, Dec. 2009. 3

[25] H. Liu and S. Yan. Common visual pattern discovery viaspatially coherent correspondences. In Computer Visionand Pattern Recognition (CVPR), 2010 IEEE Conference on,pages 1609–1616. IEEE, 2010. 2

[26] T. B. Moeslund, A. Hilton, and V. Kruger. A survey of ad-vances in vision-based human motion capture and analysis.Comput. Vis. Image Underst., 104(2):90–126, Nov. 2006. 6

[27] A. Mueen. Enumeration of time series motifs of all lengths.In 2013 IEEE 13th International Conference on Data Min-ing, pages 547–556, Dec 2013. 2

[28] L. Mukherjee, V. Singh, and J. Peng. Scale invariant coseg-mentation for image groups. In IEEE CVPR, CVPR ’11,pages 1881–1888, Washington, DC, USA, 2011. IEEE Com-puter Society. 2

[29] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Efficientmodel-based 3d tracking of hand articulations using kinect.In BMVC, volume 1, page 3, 2011. 4

[30] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Trackingthe articulated motion of two strongly interacting hands. InIEEE CVPR, pages 1862–1869. IEEE, 2012. 4

[31] A. S. Park and J. R. Glass. Unsupervised pattern discoveryin speech. Audio, Speech, and Language Processing, IEEETransactions on, 16(1):186–197, 2008. 2, 5

[32] L. Rabiner and B.-H. Juang. Fundamentals of Speech Recog-nition, volume Chapter 4. Prentice-Hall, Inc., Upper SaddleRiver, NJ, USA, 1993. 2, 3

[33] K. K. Reddy and M. Shah. Recognizing 50 human actioncategories of web videos. Machine Vision and Applications,24(5):971–981, 2013. 6

[34] I. Rius, J. Gonzlez, J. Varona, and F. X. Roca. Action-specific motion prior for efficient bayesian 3d human bodytracking. Pattern Recognition, 42(11):2907 – 2921, 2009. 6

[35] C. Rother, T. Minka, A. Blake, and V. Kolmogorov.Cosegmentation of image pairs by histogram matching-incorporating a global constraint into mrfs. In IEEE CVPR,volume 1, pages 993–1000. IEEE, 2006. 2

[36] J. C. Rubio, J. Serrat, and A. Lopez. Video co-segmentation.In ACCV, pages 13–24. Springer, 2012. 2

[37] J. C. Rubio, J. Serrat, A. Lopez, and N. Paragios. Unsu-pervised co-segmentation through region matching. In IEEECVPR, pages 749–756. IEEE, 2012. 2

[38] H. Sakoe and S. Chiba. Dynamic programming algorithmoptimization for spoken word recognition. Acoustics, Speechand Signal Processing, IEEE Transactions on, 26(1):43–49,1978. 3

[39] S. Salvador and P. Chan. Toward accurate dynamic timewarping in linear time and space. Intell. Data Anal.,11(5):561–580, Oct. 2007. 3

[40] N. Sundaram, T. Brox, and K. Keutzer. Dense point trajec-tories by gpu-accelerated large displacement optical flow. InECCV, pages 438–451, Berlin, Heidelberg, 2010. SpringerBerlin Heidelberg. 3, 7

[41] S. Vicente, C. Rother, and V. Kolmogorov. Object coseg-mentation. In IEEE CVPR, pages 2217–2224, June 2011.2

[42] R. Vidal, R. Bajcsy, F. Ofli, R. Chaudhry, and G. Kurillo.Berkeley mhad: A comprehensive multimodal human actiondatabase. In WACV, WACV ’13, pages 53–60, Washington,DC, USA, 2013. IEEE Computer Society. 5

[43] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Dense tra-jectories and motion boundary descriptors for action recog-nition. IJCV, 103(1):60–79, mar 2013. 3, 6, 7

[44] H. Wang and C. Schmid. Action recognition with improvedtrajectories. In IEEE ICCV, December 2013. 6

[45] L. Wang, G. Hua, R. Sukthankar, J. Xue, and N. Zheng.Video object discovery and co-segmentation with extremelyweak supervision. In ECCV, 2014. 2

[46] Q. Wang, G. Kurillo, F. Ofli, and R. Bajcsy. Unsupervisedtemporal segmentation of repetitive human actions based onkinematic modeling and frequency analysis. In 3D Vision(3DV), pages 562–570. IEEE, 2015. 2

[47] X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuer-mann, and E. Keogh. Experimental comparison of rep-resentation methods and distance measures for time seriesdata. Data Mining and Knowledge Discovery, 26(2):275–309, 2013. 3

[48] F. Zhou and F. De la Torre Frade. Generalized time warpingfor multi-modal alignment of human motion. In IEEE CVPR,June 2012. 2

[49] F. Zhou, F. D. la Torre, and J. F. Cohn. Unsupervised discov-ery of facial events. In IEEE CVPR, pages 2574–2581, June2010. 2

Temporal Action Co-Segmentation in 3D Motion Capture Data...

Documents

Transcript of Temporal Action Co-Segmentation in 3D Motion Capture Data...