Extraction of Discriminative Patterns from Skeleton ...chen-fan/papers/rivf2012.pdf · of MS...

Extraction of Discriminative Patterns from Skeleton Sequences for Human Action Recognition

Tran Thang Thanh1, Fan Chen1, Kazunori Kotani1, Hoai-Bac Le2 1Japan Advanced Institute of Science and Technology

School of Information Science, Japan 2Faculty of Information Technology

University of Science, Ho Chi Minh City, Vietnam Email: {s1020210,chen-fan, ikko}@jaist.ac.jp, [email protected]

Abstract—Emergence of novel techniques, such as the invention of MS Kinect, enables reliable extraction of human skeletons from action videos. Taking skeleton data as inputs, we propose an approach in this paper to extract the discriminative patterns for efficient human action recognition. Each action is considered to consist of a series of unit actions, each of which is represented by a pattern. Given a skeleton sequence, we first automatically extract the key-frames for unit actions, and then label them as different patterns. We further use a statistical metric to evaluate the discriminative capability of each pattern, and define the bag of reliable patterns as local features for action recognition. Experimental results show that the extracted local descriptors could provide very high accuracy in the action recognition, which demonstrate the efficiency of our method in extracting discriminative patterns.

Keywords - discriminative patterns; 3D skeletons; 3D videos; local descriptor; action recognition; patterns retrieval; improved TF-IDF.

I. INTRODUCTION Action recognition has been widely researched and applied

in many domains such as visual surveillance, human computer interaction and video retrieval etc. This problem can be defined as follows: given a motion sequence, the computer should identify the actions being performed by the human subject. Action recognition is difficult due to the large dimensionality of observed action data (e.g. the noises in the 2D videos; number of joints of skeletons in 3D), which not only increases computational complexity but also hides key features of the action. Most of the state-of-the-art methods solve this problem in three steps: feature extraction, feature refinement (e.g. via dimension reduction) and pattern classification.

A key issue of feature extraction is to identify the significant spatial and temporal characteristic of an action. Previous methods of feature extraction can be divided into two categories:

• The first group is based on local descriptors [1-4] which extract the neighborhood information of interest points and focuses more on local motion.

• The second group is based on holistic features [5-6], which deals with the figure’s shapes in given time period as the global information.

There are also some researches [7-8] that combine both local descriptors and holistic features to improve the

performance. Our work belongs to the first category. Specifically, we extract local features from skeleton sequences, due to the following two reasons:

• Compared to appearances, skeleton information is more suitable in describing the human’s action, in the sense that they are less affected by personal features;

• Thanking to the significant improvement of the skeleton extraction algorithms/devices, e.g., the kinect, reliable skeleton data become available.

In this work, we further use the key-frames automatically extracted from 3D skeleton sequences, following a method proposed by P. Huang [9]. It not only reduces the number of dimension in 3D sequence of skeletons, but also reduces the viewing time of browsing, bandwidth, and computational requirement of retrieval.

With the extracted features, a refinement process, via dimension reduction or feature selection, is usually applied to further improve the discriminative capability of the features. Some reduction methods use spatial-temporal interest points or global statistics are reviewed in [8]. In this work, the patterns are defined based on those extracted key-frames. We use statistical approach to weight patterns shared by the same action and differed by different actions, so as to extract the discriminative patterns for each action category.

The refined patterns will be fed into a classifier for action recognition. The patterns classifier can be divided into two categories, one is based on the stochastic model such as HMM [10] and pLSA [11], etc.; the other is based on the statistical model such as SVM [11], ANN [6], [13], etc. Although HMM-based method is widely used in video processing, it has the drawback that the action states must be pre-defined. The actions contain sequences of patterns. We consider actions like documents containing words. Recognizing actions problem becomes the classifying documents into the right categories. In text classification, documents are presented as vectors where each component is associated with a particular word in the codebook. Traditional weighting method like Term Frequency–Inverse Document Frequency (TF-IDF) estimates the importance of each word in the document. However, the importance of a word is independent from the categorization task to represent same kind of documents. In this work, we use the improved TF-IDF weighting method [14] to extract discrimination patterns which obtain a set of characteristics that remain relatively constant to separate different categories.

978-1-4673-0309-5/12/$31.00 ©2012 IEEE

The main contribution of this work is the extraction of discriminative patterns as local features and using statistical approach in text classification to recognize actions. We use the 3D skeleton histograms to convert skeletons into histograms which can comparable and use those to summary the 3D videos to the sequences of key-frames. We define patterns as the local features based on these key-frames. We use the improved TF-IDF [14] to extract discriminative patterns which obtain characteristics for separating different categories. This bag of discriminative patterns is used to recognize actions.

II. RELATED WORK This work is in local descriptor category to extract

discriminative patterns from 3D videos of skeletons. We review representative papers on this field and 3D object descriptors.

A. Local descriptors Local descriptor approaches extract the neighborhood

information of interest points and focus more on local features. These features are the crucial elements of the actions. Their distribution is then often used to represent the entire sequence, which at the end results in a global representation.

Many researches on 2D video used local descriptors. Willems et al. [2] presented the spatial-temporal interest points that are scale-invariant, both spatially and temporally. They built visual vocabulary based on the feature descriptors using k-means clustering. A bag of words is computed for each video using a weighted approach. Klaser et al. [1] presented a local descriptor based on histograms of oriented spatial-temporal gradients. They constructed a HoG-like descriptor to optimize parameters for action recognition in videos. In general, methods using 2D videos are view-dependent, and are vulnerable to appearance changing of target objects.

There are only a few of researches considering 3D skeletons as local descriptors, mainly due to the difficulty of reliable skeleton extraction. Kilner et al. [3] use an appropriate feature descriptor and distance metric to extract action key-poses in 3D human videos. They used a key-pose detector to identify each key-pose in the target sequence. The Markov model is used to match the library to the target sequence, to estimate the matched action. Andreas Baak et al. [4] proposed a novel key-frame-based method for automatic classification and retrieval of capturing motion data. They label individual frame from a set of relational features which describe geometric relations between specified points of poses, then use them to extract key-frames. They proposed algorithms to measure the hits and matching between two sequences of key-frames for retrieving. However, by this way, it is hard to minimize the set of relational features for different actions because different actions should have different sets of relational features.

In this paper, we extract key-frames automatically based on self-similarity matrix, then group and refine these key-frames to get discriminative patterns. We then regard each action sequence as a document of action “words”, and use the text classification concept to classify different actions.

B. Three-dimensional (3D) object descriptors In order to have effective 3D object retrieval, it requires the

specification of 3D object similarity measures. 3D video data has no hierarchical structure in each frame and no correspondences between successive frames, which makes motion analysis difficult and computationally expensive. On other hand, a 3D video data provides strong geometric information which allows us to compare a pair of frames only measuring their geometric similarity. Various shape descriptors have been proposed for 3D shape matching: Shape Distribution [15], Spin Image [16], Shape Histogram [17] and Spherical Harmonics [13]. The good similarity measure has to satisfy some conditions:

• Dissimilar shapes have the different descriptors and the similar shapes have the same descriptors.

• The descriptor should be unchanged if we apply different rotations to the different frequency components of a spherical function.

P. Huang et al. [18] presented a performance evaluation of shape similarity metrics for 3D video sequences of people with unknown temporal correspondence. They used optimal parameter setting for each approach and compared different similarity measures. By evaluating the self-similarity for a list of actions, they concluded that the Shape Histograms with volume sampling, which is also used in the present paper, consistently give the best performance for different people and motions in measure 3D shape similarity with unknown correspondence.

III. PATTERN EXTRACTION

A. Three-Dimensional Skeleton Histogram In this work, we use Shape Histograms to represent 3D

skeletons instead of 3D surface meshes. There are different properties between 3D surface meshes and 3D skeletons. With the same position of body parts, two surface meshes could be different due to their clothes or body shapes while two skeletons are nearly the same. Based on this property, we use shape histograms to define the different action patterns from skeleton positions.

3D shape histogram was first introduced by M. Ankerst [17]. The shape histogram of a point is a measure of the distribution of relative positions of neighboring points. The distribution is defined as a joint histogram. Each histogram axis represents a parameter in a polar coordinate system. Bins around each point are spatially defined. The number of the neighboring points in these bins is assimilated as a context to the point.

Along the radial direction as described in Figure 1, bins are arranged uniformly in log-polar space increasing the importance of nearby points with respect to points farther away. If there are X bins for the radius, Y bins for the azimuth and Z bins for the elevation, there are X×Y×Z = L bins for the 3D shape histogram in total. The optimal bin size (X,Y,Z) was reported for human shape similarity in [19].

xO

y

z

Bin index

Num

ber o

f poi

nts

0

x

y

z’

x

Figure 1. Illustration of Skeleton Histogram.

A 3D skeleton is a set of bone segments that lie inside the model. 3D space is transformed to a spherical coordinates system around the skeleton center of the model (see Figure 1). A skeleton histogram is then constructed by accumulating the points from each bin. Since it is computed on the gravity center of the skeleton, this histogram is translation invariant.

For any two skeletons A and B, let their shape histograms be expressed as histograms hA(l) and hB(l), where l = 1,2,3,…,L. A measure of similarity between these two histograms is computed by using the 2χ -distance:

2

1

( ( ) ( ))1( , )2 ( ) ( )

LA B

l A B

h l h ld A Bh l h l=

−=+∑ (1)

which in fact evaluates the normalized difference at each bin of two histograms. A lower distance value between two histograms means a higher similarity between two skeletons.

B. Key-frames Extraction We suppose that each action consists of more fundamental

unit actions. Given an action sequence, we first intend to divide it into a series of short temporal segments, each of which represents one unit action. From the segment of each unit action, we choose one frame as its representative state, which is named as a “Key-frame”. In this paper, we use the automatic key-frames selection method proposed by P. Huang [9] to summarize 3D video sequences. Basically, the key-frames are extracted by using a graph-based method. We regard each key frame, denoted by both its index and the neighborhood size it represents, as one node in a graph. Edges are only inserted between two nodes that could be neighboring key-frames, i.e. they should be temporally overlapped. The weight of each edge is evaluated by the difference between the key-frames of its two nodes. Since we expect a set of key-frames that have maximized mutual distances, we can intuitively use the shortest

path algorithm to find the optimal solution. The whole processes are illustrated in Figure 2.

Fram

e no

.

Fram

e no

.

Figure 2. Illustration of automatic key-frames selection.

Formally, this method first computes the self-similarity matrix of shape histograms between all frames of a 3D video sequence,

( ) { }, ( , )f f f f

i j i jN N N NS s d h h

× ×= = (2)

where Nf is the number of frames in a video.

Each possible key-frame is evaluated by a Conciseness Cost, which is defined as a weighted sum of its representative cost (denoted by its average similarity to its neighbors) and accuracy cost (currently set to 1).

, ,(1 )i

i

i

f

i f i i kk f

c sβ β +=−

= + − ∑ (3)

where β is the parameter to weight the distortion, and i is the index of key-frame with a local time window fi for all neighboring frames it represents. A Conciseness Matrix is then formed from the costs of all frames in a motion sequence under different neighbor sizes.

( ) 1,2fi

fNi f N

C c +⎢ ⎥×⎢ ⎥⎣ ⎦

= (4)

A graph is constructed from the Conciseness Cost Matrix. Each entry of the Conciseness Matrix is a node in the graph. Each element , ii fc in the conciseness matrix corresponds to a

graph node , ii fv . Two extra nodes sourcev and sin kv are added to represent the starting and ending of the sequence. The edges between two nodes, if they satisfy the following conditions:

, ii fv to , jj fv : ( ) ( ) ( ), jij j f i je c i j i f j f= ∧ < ∧ + ≥ − (5)

sourcev to , ii fv : ( ) ( ), , 1isource i i f ie c i f= ∧ − = (6)

, ii fv to sin kv : ( ) ( ),sin 0i k i fe i f N= ∧ + = (7)

which require any two neighboring nodes to be temporally overlapped. A list of key-frames is created by finding the

Figure 3. Key-frames extraction and patterns definition

shortest path on the graph from sourcev and sin kv . Each node on the shortest path represents a key-frame. After extracting from each video in the database, we have a list of key-frames:

{ } 1( , ) m

i ii fκ

==K (8)

where ( , )ii fκ is a key-frame and m being the number of key-frames extracted from that video. In Figure 3, the blue circles, in key-frames extraction step, describe the presenting space for the key-frames.

C. Patterns Definition Although we have extracted a set of key-frames, some of them might represent the same unit action. We hence further group them into patterns. For each pattern, we specify an acceptance threshold, which means that any key-frames that have a smaller distance than this threshold to a pattern will be recognized as one sample of this pattern e.g. in Figure 3, pattern P1 includes key-frame 0, 134 and 426.

The actions are considered as the sequences of patterns. Each pattern is similar to a word in a document. An action sequence S is defined as a collection of patterns1:

1{ } pNi ip ==S (9)

where Np is the number of patterns inS .

We define the patterns based on all lists of extracted key-frames. Each histogram expresses a skeleton center point in each key-frame. A list of histogram is created corresponding to a list of key-frames:

( ){ } 1,

mi i i

H h f=

= (10)

A pattern ( )( ), ,i i i ip h f θ= includes key-frame ( , )jj fκ , if it

satisfies ( ),i j id h h θ≤ . In this work, the pattern’s acceptance

threshold iθ is calculated as

( ) ( )( )max , , ,i ii i i f i i fd h h d h hθ − += (11)

1 Note that a sequence is not a set, but a naive collection of patterns, which allows the existence of duplicated patterns.

where d(*) being the distance measure from Eq. (1).

From all lists of key-frames, we create a bag of patterns P. A bag of patterns P is defined as:

{ } ( )( ){ }1 1, , pp

NNi i i ii i

P p h f θ= =

= = (12)

We convert all the lists of key-frames to sequences of correlated patterns.

A category Ck is defined as a set of sequences:

{ } ( )

1, 1,2,...,seqN k

k i catiC k N

== =S (13)

where Nseq(k) is the number of sequences in category Ck and Ncat is the number of categories.

D. Extraction of Discriminative Patterns Although the pattern has been defined, it is not sure to be

suitable for classifying the action. In this section, we intend to include the category information from the training data to select those discriminative patterns. In our understanding, a discriminative pattern should have two major features

• It should appear quite often in the target sequence, so as to be a reliable pattern for representing the action;

• It should appear much more often in one action than in all other actions, so that we can use it as a clue for identifying the action.

Based on the above consideration, we define a weight for evaluating the confidence of a sequence S belong to a specific action kC , given the fact that it contains pattern pi , as:

( ) ( ) ( )|k i i i kconf C p TF p Str p C∈ = ∈ ⋅ ∈S S (14)

There are two parts in Eq. (14), the first part is calculated as:

j,

1

1( )p

i

N

i p pjp

TF pN

δ=

∈ = ∑S (15)

It in fact defines the frequency of pattern pi in the sequenceS , which evaluates the importance of pi in the sequence. , 1

i jp pδ =

when i jp p= , and is zero otherwise. We are focusing on the second part, which defines the discriminative capability of pi in

classifying kC . We use the Wilson proportion estimate to evaluate how often a pattern appears in an action category.

The second part was solved similar to the way in [14] as following. We estimate the proportion of sequences containing pattern p to be:

2

/22

/2

0.5zz

pxp

nα

α

+=

+ (16)

where p is the Wilson proportion estimate [20]; xp is the number of documents contain pattern p in the collection (category/categories) and n is the size of the collection. We suppose that the confidence of pattern to the categories is a normal distribution, and take the 95% confidence interval, i.e.,

5%α = , which is calculated as:

(1 )1.963.84

p ppn

−±+

(17)

For a given category cj, the positive documents is denoted as p+ and the negative documents in other categories as p− . From Eq. (5), we calculate the lower range of the confidence interval of p+ , labeled MinPos, and the higher range of p− , labeled MaxNeg. The strength of pattern p for category kC is calculated as:

( ) 2

2log

0i k

MinPos if MinPos MaxNegStr p C MinPos MaxNeg

otherwise

⎧ ⎛ ⎞ >⎪ ⎜ ⎟∈ = +⎨ ⎝ ⎠⎪⎩

(18)

which means that pattern pi should appear more often in positive documents than negative documents, and the difference should be as significant as possible.

From Eq. (14), (15) and (18), the confidence of a sequence S belong to a specific action kC is calculated by summing up the weighting of all discrimination patterns

( ) ( )1

|pN

k k ii

conf C conf C p=

∈ = ∈∑S S (19)

If ( )kconf C∈S gets the highest value in all categories Cj ,

1,2,..., catj N= , S is classified to category kC .

IV. EXPERIMENTAL RESULTS In this section, we present some experimental results of

using discriminative patterns extracted for action recognition. Motion capture systems, ranging from marker-based system to the recent Kinect, can be used to reconstruct the motion of moving subjects by measuring the 3D skeletons. The skeleton in 3D videos is a collection of accurate 3D positions of body parts. This clear information is very helpful to extract the hidden action features. We collect seven actions from CMU motion capture database [12]: walking up/down stairs, jumping forward, jumping up, boxing, golf swinging, running and walking. Each action has fifteen 3D videos. We randomly sampled two-thirds of the action sequences for training and

used the remaining one-thirds for testing. We repeated the experiment 10 times and averaged the results.

Figure 4. Average classification results.

Overall classification accuracies for each action are showed in Figure 4. Four actions, i.e. walking, running, golf swinging and boxing, are well classified. The walking up/down stairs has high classification accuracy on training data set while only gets 88 percent on testing data. The jumping forward get 68 percent on testing data set and 84 percent on training. Three actions, i.e., walking up/down stairs, jumping forward and jumping up, share some same skeleton positions. So that, the number of overlapped patterns between them are high, which reduces the classification accuracy.

Figure 5. Confusion matrix for training (A) and testing (B) data sets.

The confusion matrix on testing and training data sets are shown in Figure 5. Some strong confusions occur between similar actions such as jumping up to walking (No.4 to No.0), jumping forward to jumping up (No.5 to No.4) and walking up/down stairs to jumping up (No.6 to No.4). These confusions are due to the pairs of actions having some same skeleton positions, that makes their sets of discriminative patterns overlap and confuses the classification. The jumping up contains some same discriminative patterns presented in walking. The confusion (No.4 to No.0) and (No.5 to No.4) not only happen on the testing data set but also on training.

We evaluate the overall classification rate of categories depend on the conf from Eq. (19). The contribution of a pattern to a given classification is evaluated by whether the strength in Eq.(18) is positive nor not. We calculate the ROC curve with True Positive Rate (TPR) and False Positive Rate (FPR) under different confidence interval each category kC :

k

TPkAPk

NNCTPR = (20)

FP

TN

NNk

kC

k

FPR = (21)

TPkN (Number of True Positive) is the number of patterns p in

sequence S satisfy ( )k k( ) ( 0)C Str p C∈ ∧ ∈ >S , and APNk

(Number of All Positive) is the number of patterns p that satisfy ( )k 0Str p C∈ > . FPNk

(Number of False Positive) is the number of patterns p satisfy ( )( ) ( )( )k i0 0, kStr p C Str p C i∈ > ∧ ∈ > ≠ , and TNNk

(Number of True Negative) is the number of patterns p satisfy: ( ) ( )k , kip C p C i∈ ∧ ∈ ≠ . TPR evaluates the average positive discriminative power，while FPR evaluates the average mutual ambiguity of patterns. Hence，a higher TPR with a lower FPR implies a better discriminative capability of extracted patterns.

The ROC curves are shown in Figure 6, which are calculated from Eqs. (20-21) for all categories. As expected, there is a trade-off between a high classification rate and a low false alarm rate. From the fact that the curve is higher than the random guess line, we confirm that our method is effective in extracting discriminative patterns, especially for classifying the actions including walking, boxing, golf swinging and running. The extracted patterns for walking up/down stairs, jumping forward and jumping up are somehow lower, due to the mutual ambiguity shown in Figure 5. A possible to solution is to further explore the mutual difference between pairs of similar actions, which is left as one of our future work .

Figure 6. ROC curves shows the influence of conf on the classification

performance.

V. CONCLUSION AND FUTURE WORKS This paper introduces a method to extract discriminative

patterns as local features to classify skeleton sequences in human action recognition. Based on skeleton histogram, we extract key-frames from 3D videos and define patterns as local features. We use classification concept in information retrieval to estimate the importance of discriminative patterns for a specific categorization. Experimental results indicate that the action recognition model using these discriminative patterns gives very high accuracy. For further works, this model can be extended in the data driven approach by using the knowledge of 3D human skeleton motion to guide the 2D feature tracking

or re-projection into silhouettes to recover pose estimates from multi-cameras videos.

REFERENCES [1] A. Klaser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor

based on 3D-gradients. BMVC, pages xx-yy, 2008. [2] G. Willems, T. Tuytelaars, and L. V. Gool. An efficient dense and scale-

invariant spatio-temporal interest point detector. ECCV, pages 650-663, 2008.

[3] Kilner, J., Guillemaut, J.-Y., Hilton, A., 3D action matching with key-pose detection, Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th International Conference, Kyoto, pages 1 – 8, 2009.

[4] Andreas Baak, Meinard Müller, Hans-Peter Seidel, An Efficient Algorithm for Keyframe-based Motion Retrieval in the Presence of Temporal Deformations, (ACM-MIR), 2008.

[5] M. D. Rodriguez, J. Ahmed, and M. Shah. Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. CVPR, pages 1-8, 2008.

[6] L. Wang, X. Geng, C. Leckie, and R. Kotagiri. Moving shape dynamics: a signal processing perspective. CVPR, pages 1-8, 2008.

[7] Sun, X., Chen, M.-Y., and Hauptmann, A., Action Recognition via Local Descriptors and Holistic Features, IEEE - CVPR for Human Communicative Behaviour Analysis, Miami Beach, Florida, USA, June 25, 2009.

[8] Krystian Mikolajczyk, Hirofumi Uemura, Action recognition with appearance-motion features and fast search trees. Computer Vision and Image Understanding, Volume 115 Issue 3, March, 2011.

[9] P. Huang, A. Hilton and J. Starck. Automatic 3D Video Summarization: Key Frame Extraction from Self-Similarity. In Proceedings of the Fourth International Symposium on 3D Data Processing, Visualization and Transmission (3DPVT’08), pages 71-78, Atlanta, GA, USA, June 2008.

[10] J. Yamato, J. Ohya, and K. Ishii. Recognizing human action in time-sequential images using hidden Markov model. CVPR, pages 379-385, 1992.

[11] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. IJCV, 79(3):299-318, 2008.

[12] Database, C.G.L.M.C.: (http://mocap.cs.cmu.edu/) [13] Kazhdan M, Funkhouser T, Rusinkiewicz S. Rotation invariant spherical

harmonic representation of 3d shape descriptors. SGP ’03: Proceedings of the 2003 Eurographics/ ACM SIGGRAPH symposium on Geometry processing pp 156–164, 2003

[14] Pascal Soucy, G W Mineau, Beyond TFIDF Weighting for Text Categorization in the Vector Space Model, International Joint Conference on Artificial Intelligence. Edinburgh, Scotland: UK, pp. 1130-1135, 2005

[15] Osada R, Funkhouser T, Chazelle B, Dobkin D, Shape distributions. ACM Trans Graph 21(4):807–832, 2002

[16] Johnson AE, Hebert M. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Trans Pattern Anal Mach Intell 21(5):433–449, 1999

[17] M. Ankerst, G. Kastenmller, H. P. Kriegel, and T. Seidl. 3D shape histograms for similarity search and classification in spatial databases. Advances in Spatial Databases, 6th International Symposium, SSD’99, 1651:207–228, 1999.

[18] P. Huang, A. Hilton and J. Starck. Shape Similarity for 3D Video Sequences of People. In International Journal of Computer Vision (IJCV) special issue on 3D Object Retrieval, Volume 89, Issue 2-3, September 2010.

[19] P. Huang, J. Starck and A. Hilton. A Study of Shape Similarity for Temporal Surface Sequences of People. In Proceedings of the Sixth International Conference on - 3DIM’07, pages 408-418, Montréal, Québec, Canada, August 2007.

[20] E.B. Wilson. Probable Inference, the Law of Succession, and Statistical Inference. Journal of the American Statistical Association, 22, 209, 212, 1927.

Extraction of Discriminative Patterns from Skeleton ...chen-fan/papers/rivf2012.pdf · of MS...

Documents

Transcript of Extraction of Discriminative Patterns from Skeleton ...chen-fan/papers/rivf2012.pdf · of MS...