[IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia...

6
RECOGNIZING HUMAN ACTIONS BASED ON SPARSE CODING WITH NON-NEGATIVE AND LOCALITY CONSTRAINTS Yuanbo Chen 1 , Yanyun Zhao 1,2 , Anni Cai 1,2 1 School of Information and Communication Engineering 2 Beijing Key Laboratory of Network System and Network Culture Beijing University of Posts and Telecommunications, Beijing, China {bupt cyb, zyy, annicai}@bupt.edu.cn ABSTRACT In this paper, Sparse Coding with Non-negative and Locali- ty constraints (SCNL) is proposed to generate discriminative feature descriptions for human action recognition. The non- negative constraint ensures that every data sample is in the convex hull of its neighbors. The locality constraint makes a data sample only represented by its related neighbor atoms. The sparsity constraint confines the dictionary atoms involved in the sample representation as fewer as possible. The SCNL model can better capture the global subspace structures of da- ta than classical sparse coding, and are more robust to noise compared to locality-constrained linear coding. Extensive ex- periments testify the significant advantages of the proposed SCNL model through evaluations on three remarkable human action datasets. Index TermsHuman action recognition, SCNL model, datum-adaptive, locality, sparse 1. INTRODUCTION As the complexity of human actions in real world, an infor- mative feature description is critical to an action recognition system. Sparse coding (SC) is the widely appreciated feature representation nowadays due to its nonlinear representative a- bility. Also, it has been proved [1] in neural science that the human vision system seeks a sparse coding for the incom- ing image using a few words in a feature vocabulary. Instead of representing a feature with its closest centroid as Vector Quantization (VQ) does, each data sample in SC is represent- ed by a weighted combination of dictionary atoms, thus better managing outliers than the popular VQ coding scheme. Re- cently, sparse coding attracts much attentions and has proven to be very successful in action recognition and other classifi- cation tasks [2, 3, 4]. This work was supported by Chinese National Natural Science Founda- tion (90920001, 61101212), National S&T Major Project of the Ministry of S&T (2012ZX03005008,2012BAH41F03), and the Fundamental Research Funds for the Central Universities. In a SC system, the dictionary is usually over-completed, which property guarantees the sparsity of a test sample rep- resented over the dictionary. However, due to simply pursu- ing sparsity, a data sample may be represented by dictionary atoms of very different subsets, the resulting sparse codes can have significant variations in the activation set, which have harmful impact for recognition tasks. On the other hand, the requirement of solving 1 -norm optimization problems a- mong high-dimensional dictionary atoms in SC makes it al- ways with high computational complexity. Recently, a fast version called Locality-constrained Lin- ear Coding (LLC) [5] is proposed by considering locality constraint where each descriptor is projected into its local- coordinate system only. This means that nonzero coefficients are often assigned to bases nearby to the encoded data. Be- sides, avoiding solution of 1 -norm in LLC results much faster computation and more smooth property compared to SC and has shown better performance than state-of-the-art methods including ScSPM in image classification [6]. However, there are several issues blocking LLC’s appli- cations. In LLC, the number k of nearest neighbors (knn) is usually fixed to a small number (e.g., knn=5 in [5]) to gain sparse solutions, but this is not the sparsity in the sense of 0 -norm. When knn is set to a large number (such as 10, 50, or larger), the solution is getting dense. Due to using a fixed global parameter to define local neighbors, LLC cannot produce datum-adaptive neighborhoods, so making it unable to reflect the real structure of data and sensitive to local data noise. Moreover, the data may be described as a combination of bases involving both additive and subtractive interactions in LLC. The fact that features can cancel each other out using subtraction is contrary to the intuitive notion of combining parts to form a whole [7]. Also, from the biological point of view, the non-negative representations are more related to non-negativity of neural firing rates [7]. Inspired by above insights, we propose to harness sparse, locality and non-negative constraints of high-dimensional data to construct a new informative feature representation named Sparse Coding with Non-negative and Locality con-

Transcript of [IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia...

Page 1: [IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia (2013.11.17-2013.11.20)] 2013 Visual Communications and Image Processing (VCIP) - Recognizing human

RECOGNIZING HUMAN ACTIONS BASED ON SPARSE CODING WITH NON-NEGATIVEAND LOCALITY CONSTRAINTS

Yuanbo Chen1, Yanyun Zhao1,2, Anni Cai1,2

1School of Information and Communication Engineering2Beijing Key Laboratory of Network System and Network Culture

Beijing University of Posts and Telecommunications, Beijing, China{bupt cyb, zyy, annicai}@bupt.edu.cn

ABSTRACT

In this paper, Sparse Coding with Non-negative and Locali-

ty constraints (SCNL) is proposed to generate discriminative

feature descriptions for human action recognition. The non-

negative constraint ensures that every data sample is in the

convex hull of its neighbors. The locality constraint makes

a data sample only represented by its related neighbor atoms.

The sparsity constraint confines the dictionary atoms involved

in the sample representation as fewer as possible. The SCNL

model can better capture the global subspace structures of da-

ta than classical sparse coding, and are more robust to noise

compared to locality-constrained linear coding. Extensive ex-

periments testify the significant advantages of the proposed

SCNL model through evaluations on three remarkable human

action datasets.

Index Terms— Human action recognition, SCNL model,

datum-adaptive, locality, sparse

1. INTRODUCTION

As the complexity of human actions in real world, an infor-

mative feature description is critical to an action recognition

system. Sparse coding (SC) is the widely appreciated feature

representation nowadays due to its nonlinear representative a-

bility. Also, it has been proved [1] in neural science that the

human vision system seeks a sparse coding for the incom-

ing image using a few words in a feature vocabulary. Instead

of representing a feature with its closest centroid as Vector

Quantization (VQ) does, each data sample in SC is represent-

ed by a weighted combination of dictionary atoms, thus better

managing outliers than the popular VQ coding scheme. Re-

cently, sparse coding attracts much attentions and has proven

to be very successful in action recognition and other classifi-

cation tasks [2, 3, 4].

This work was supported by Chinese National Natural Science Founda-

tion (90920001, 61101212), National S&T Major Project of the Ministry of

S&T (2012ZX03005008,2012BAH41F03), and the Fundamental Research

Funds for the Central Universities.

In a SC system, the dictionary is usually over-completed,

which property guarantees the sparsity of a test sample rep-

resented over the dictionary. However, due to simply pursu-

ing sparsity, a data sample may be represented by dictionary

atoms of very different subsets, the resulting sparse codes can

have significant variations in the activation set, which have

harmful impact for recognition tasks. On the other hand,

the requirement of solving �1-norm optimization problems a-

mong high-dimensional dictionary atoms in SC makes it al-

ways with high computational complexity.

Recently, a fast version called Locality-constrained Lin-

ear Coding (LLC) [5] is proposed by considering locality

constraint where each descriptor is projected into its local-

coordinate system only. This means that nonzero coefficients

are often assigned to bases nearby to the encoded data. Be-

sides, avoiding solution of �1-norm in LLC results much

faster computation and more smooth property compared to

SC and has shown better performance than state-of-the-art

methods including ScSPM in image classification [6].

However, there are several issues blocking LLC’s appli-

cations. In LLC, the number k of nearest neighbors (knn) is

usually fixed to a small number (e.g., knn=5 in [5]) to gain

sparse solutions, but this is not the sparsity in the sense of

�0-norm. When knn is set to a large number (such as 10,

50, or larger), the solution is getting dense. Due to using a

fixed global parameter to define local neighbors, LLC cannot

produce datum-adaptive neighborhoods, so making it unable

to reflect the real structure of data and sensitive to local data

noise. Moreover, the data may be described as a combination

of bases involving both additive and subtractive interactions

in LLC. The fact that features can cancel each other out using

subtraction is contrary to the intuitive notion of combining

parts to form a whole [7]. Also, from the biological point

of view, the non-negative representations are more related to

non-negativity of neural firing rates [7].

Inspired by above insights, we propose to harness sparse,

locality and non-negative constraints of high-dimensional

data to construct a new informative feature representation

named Sparse Coding with Non-negative and Locality con-

Page 2: [IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia (2013.11.17-2013.11.20)] 2013 Visual Communications and Image Processing (VCIP) - Recognizing human

straints (SCNL). The non-negative constraint ensures that

every data sample is in the convex hull of its neighbors. The

locality constraint makes data samples only consider their

related neighbor atoms, which accommodates the so-called

classification assumption that samples on the same class are

likely to share the same structure, so benefiting for reducing

intra-class distance. The sparsity constraint ensures the dic-

tionary atoms involved in solutions are as fewer as possible.

Given a set of data vectors, SCNL model seeks the most

related atoms automatically among their local neighbors, so

reflecting the optimal neighborhood structures for dada. This

property of SCNL can better capture the global subspace

structures of data than SC, and can have data-adaptive neigh-

borhoods compared to LLC. The proposed SCNL model has

several advantages:

(1) Robust to noise. Data noise is inevitable especially for

visual data in real world. LLC [5] considers �2-norm making

the feature representation easily be changed when unfavor-

able noise comes into the neighborhood. However, in SCNL,

the optimal atoms are selected by the �1-constraint, thus better

managing outliers than LLC as well as VQ [8].

(2) Local smooth sparsity. To favor sparsity on over-

completed dictionary, the SC might select quite different

atoms for similar patches, making the same class give much

different structures. However, in SCNL coding scheme, the

locality constraint restricts the sparse coding process to select

atoms among related local neighborhoods, so enforcing sim-

ilar data samples sharing similar atoms and resulting much

smoother codes than SC.

(3) Datum-adaptive neighborhood. In action recognition

systems, many factors, such as illumination, occlusion, cam-

era motion and cluttered background, may make data distribu-

tion vary significantly at different areas of the feature space,

which results in a distinctive neighborhood structure for each

datum even for the same class. Therefore, when a fixed neigh-

borhood is applied, the neighborhoods of some data samples

may easily be contaminated with fake neighbors. Fortunately,

in the proposed SCNL, the explicit sparsity adaptor is capa-

ble of automatically selecting the most valid atoms, making

the SCNL model is datum-adaptive. This property is valu-

able for applications with unevenly distributed data, and can

reflect the optimal sparse neighborhood structure of the dada.

We have conducted extensive experiments on three pub-

lic action datasets, and the results demonstrate that our SC-

NL model can provide informative and discriminative feature

representations and thus is very useful for action recognition

task.

The remainder of this paper is organized as follows. In

section 2, we introduce the formulation of our SCNL mod-

el and show its optimization algorithm. Section 3 gives the

details of SCNL model for action recognition. Our experi-

ments and analysis are presented in section 4. Finally, section

5 concludes our paper.

2. SPARSE CODING WITH NON-NEGATIVE ANDLOCALITY CONSTRAINTS

2.1. Formulation of SCNL

Suppose samples(e.g., image patches) X ∈ RM×N can be

represented as a linear combination of a few columns of a

dictionary:

X = BC, (1)

where B = [b1, ...bK ] ∈ RM×K is the dictionary with K

columns and each column bi, i ∈ {1, ...K} represents a base

vector with M -dimension. C ∈ RK×N is the coefficient ma-

trix corresponding to all the N samples.

Taking the reconstruction error, locality, sparsity and non-

negative into account, the proposed SCNL can be seen as

seeking the optimal solution over the following cost function:

minB,C

1

2‖X −BC‖2F + λ‖D � C‖2F + μ‖C‖1,

s.t. B � 0,

C � 0.

(2)

where the first item is the reconstruction error. ||D � C|| de-

notes the locality constraint to keep the solution within local

bases, ||C||1 denotes the �1-norm of C, ensuring the solution

as sparse as possible. λ and μ denote the positive regulariza-

tion parameters that make trade-offs of the reconstruction er-

ror with the local and sparse properties of coefficients C, and

larger μ leads to sparser solutions. � denotes the element-

wise multiplication, and D = [D1, ..., DN ] ∈ RK×N is the

matrix representing the distance between an input sample xi

with each base vector, i.e.

Di = dist(xi, B)

=[exp(||x− b1||2/σ), . . . , exp(||x− bK ||2/σ)

]T,

(3)

where σ is the radius parameter of the Gaussian function,

which is set as the mean value of all pairwise distances be-

tween the training samples and basis of dictionary. This pa-

rameter is set by the statistic mean of distance between sam-

ples and dictionary on training set of KTH dataset [9], and

further used in all of our experiments.

2.2. Solving the optimization problem

For efficiency, we adopt inexact Augmented Lagrange Mul-

tiplier(IALM) method [10] to solve Eq.2 in this work. To

facilitate an efficient use of alternating minimization, we first

introduce an auxiliary variable R and consider the following

equivalent model:

minB,C

1

2‖X −BC‖2F + λ‖D � C‖2F + μ‖R‖1,

s.t. B � 0,

C � 0,

C = R.

(4)

Page 3: [IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia (2013.11.17-2013.11.20)] 2013 Visual Communications and Image Processing (VCIP) - Recognizing human

Then, the problem can be solved by minimizing the fol-

lowing augmented lagrange function

Γ(B,C,R,Λ1,Λ2,Λ3, β)

=1

2‖X −BC‖2F + λ‖D � C‖2F + μ‖R‖1

+ 〈Λ1, B〉+ 〈Λ2, C〉+ 〈Λ3, C −R〉+

β

2(‖B‖2F + ‖C‖2F + ‖C −R‖2F ),

(5)

where Λi, i = {1, 2, 3} are Lagrange multipliers and β is the

penalty parameter for the constrains. Inexact ALM [10] is de-

rived by successively minimizing the augmented Lagrangian

function Γ with respect to B,C,R one at a time while fix-

ing others at their most recent values, and then updating the

multipliers after each sweep of such alternating minimization.

Specifically, these steps can be written in a closed form as E-

q.6, and the complete solving optimization algorithm is out-

lined in Algorithm 1.

B∗ = (XCT − Λ1)(CCT + βI)−1,

R∗ij = soft((C +Λ3

β)ij ,

μ

β),

C∗i = (M + 2λD′i)−1Ni.

(6)

where, soft(α, δ) = sign(α)(|a|−δ)+, x+ = max(x, 0),M =BTB+2βI , N = BTX−Λ2−Λ3+βR. Ci, Di, Ni are the

columns of matrix C,D and N respectively, D′i = diag(D2

i ),and superscript ′∗′ is used to denote iterative values at the new

iteration.

3. IMPLEMENTATION FOR ACTIONRECOGNITION

3.1. Key-points extraction

Human action recognition in real-world videos is a challenge

problem, and especially when camera is non-static, the mo-

tion field in an action region is often contaminated by back-

ground movements. In this work, we extract interest points

based on a vorticity-based spatio-temporal point detector pro-

posed in [11], which can suppress camera motion to a certain

extent and can extract spatio-temporal interest points around

articulations of a human body such as moving knees, ankles

and elbows, even in a cluttered field with camera movements.

Then, HOG features with 8 orientations are computed on the

defined spatio-temporal patches around the extracted interest

points.

3.2. SCNL-based feature representation

Note that, traditional nonlinear coding methods learn a global

dictionary from all the training data [5, 6, 12], which do not

explicitly exploit label information available in the given su-

pervised setting. Therefore, they are difficult to interpret the

Algorithm 1 Efficient IALM Algorithm for SCNL

Input: Data matrix X , parameter λ and μ.

Initialize: R = C = 0, Λ1 = Λ2 = Λ3 = 0, β = 10−6, ρ =1.1,maxβ = 1010, initialize dictionary B from K-means or

selected from training data set firstly.

While not converged, do:

1. Fix the others and update C by:

C∗ = minC

1

2‖X −BC‖2F + λ‖D � C‖2F

+ 〈Λ2, C〉+ 〈Λ3, C −R〉+

β

2(‖C‖2F + ‖C −R‖2F )

2. Fix the others and update R by:

R∗ = minR

μ‖R‖1 + 〈Λ3, C −R〉+ β

2‖C −R‖2F

3. Fix the others and update B by:

B∗ = minB

1

2‖X −BC‖2F + 〈Λ1, B〉+ β

2‖B‖2F

4. Update the variables as Eq. 6.

5. Update the multiples:

Λ1 = Λ1 + βBΛ2 = Λ2 + βCΛ3 = Λ3 + β(C −R)

6. Update the parameter: β = min(ρβ,maxβ)

End while

Output: optimal solution (C∗, B∗, R∗)

class relationships directly in these high-dimensional global

coefficients.

In this work, we follow the same way as [4]does, where

each class-dependent sub-dictionary is learned first, and then

combined to an entire structured dictionary. So, each da-

ta sample is represented by a mixture of all actions and the

nonzero weights from more than one class give different con-

tributions between actions. This strategy indeed reveals the

intrinsic character that actions are represented not only by

their own model but also by how connected they are to the

models of other actions. However, the shared components be-

tween actions make the classification task difficult. Thus, a

sum-pooling is used to quantify the contributions from all ac-

tions to a per-class manner which will be described clearly

in Algorithm 2. After this step, the actions contributions in

the sample are quantified with invariance to the subset selec-

tion in the sub-dictionaries, and also the dimensionality of the

data is notably reduced to A-dimension in a reasonable way

(suppose there are A types of actions).

Moreover, since the locality constraint in SCNL restricts

the few atoms with significant values in solution to local ba-

Page 4: [IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia (2013.11.17-2013.11.20)] 2013 Visual Communications and Image Processing (VCIP) - Recognizing human

Algorithm 2 LSNL-based feature coding algorithm.

Input: Data matrix X = [x1, ...xN ] ∈ RM×N , parameter λ

and μ;

Steps:

1. Train special-class dictionary using Algorithm 1 for all

the action classes ( such as A types of actions).

2. Combining all the A sub-dictionaries Bi ∈RM×ki , i = 1, ..., A together and forming the

all-classes dictionary B = [B1, B2, ..., BA] ∈ RM×K ,

and K =∑A

i=1 ki.

3. For each sample path xi, i = 1, ...N , use k-nearest

neighbors based on distance matrix Di to construct a

new dictionary B′

which with only knn significan-

t atoms drawn from B .

4. Then coding each patch with formulation:

ci = minci�0

1

2‖xi −Bci‖2 + λ‖Di � ci‖+ μ‖ci‖1,

where the corresponding coefficient ci =[c1i , ..., c

kii , ..., ckA

i ] ∈ RK+ with only knn signifi-

cant values.

5. Each-class sum-pooling is used for each sample de-

scriptor ci ∈ RK+ and reduce its dimension to si =

[S(c1i ), ..., S(cAi )] ∈ R

A+, where S(cji ) =

∑kj

j=1 cji .

Then si is the final feature descriptors corresponding

to sample xi.

6. Repeating steps 3-5 to get feature descriptors for al-

l samples.

sis, a fast approximation process can be used. The k-nearest

neighbors(knn K) of basis are selected based on dis-

tance matrix Di of a sample xi to compose a new dictionary

first, and then, a small system could be solved. The proposed

SCNL-based coding scheme is summarized in Algorithm 2.

As the number of selected local neighbor bases decreases, the

computational complexity of our proposed SCNL scheme re-

duces.

Sometimes, we may be interested in video clips. After

obtaining a set of feature descriptors {si}, i ∈ {1, . . . , P} for

P patches in a video clip, the average-pooling is used to pool

all P patches’ descriptors together to get the corresponding

pooled video representation h:

h =P∑

i=1

si/P. (7)

Then, �2-normalization is used to get the final fea-

ture description. For classification, a linear SVM is used

to produce the classification results which are denoted as

LSNL LSVM p and LSNL LSVM v for patches and

video clips respectively for convenience.

4. EXPERIMENTS

In this section, we systematically evaluate the effectiveness of

our proposed SCNL model for human action recognition on

three wildly used datasets: KTH, UCF Sports and YouTube,

and the average accuracy over all the action classes is used as

the performance measure in all experiments.

KTH dataset [9] is one of the most commonly used

dataset in evaluation of human action recognition. This

dataset contains 6 types of human actions (boxing, handclap-

ping, handwaving, jogging, running and walking) performed

several times by 25 subjects in four different scenarios in-

cluding outdoors, outdoors with camera motion (zoom in and

out), outdoors with different clothes and indoors, giving a

total of 2391 video sequences. We follow the mostly used

experimental data setting, i.e., dividing the samples into a

test set (9 subjects: 2, 3, 5, 6, 7, 8, 9, 10, 22) and a training

set (the remaining 16 subjects).

UCF sports dataset [13] contains 10 different types of

human actions in sport broadcasting videos acquired from

sports broadcast networks, including diving, kicking, weight-

lifting, horse-riding, running, skateboarding, golf swinging,

swinging1 (gymnastics, on the pommel horse and floor),

swinging2 (gymnastics, on the high and uneven bars) and

walking. The dataset consists of 150 video clips in a wide

range of scenes and viewpoints, which contains many chal-

lenges such as camera motion and jitter, highly cluttered and

dynamic backgrounds, compression artifacts, and variable

illumination settings at variable spatial resolution and frame

rates. We use the original videos and adopt a leave-one-out

setup where one clip is used for testing and the remaining for

training.

YouTube dataset [14] is the very challenging action

dataset with much large variations in scales, illumination

conditions, object appearance, scale and pose, viewpoint,

cluttered background and unconstraint large camera motion.

This dataset contains 1168 video clips and 11 types of ac-

tions: diving, biking, basketball shooting, golf swing, soccer

juggling, horse riding, swinging, tennis swinging, trampoline

jumping, volleyball spiking and walking with a dog. We

follow the original video and adopt a leave-one-out cross

validation for a per-defined set of 25 folds.

4.1. Parameter settings

Fig. 1 shows some sample frames from the three action

datasets, and the corresponding parameter used in SCNL

model for each dataset is shown in Table 1.

Page 5: [IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia (2013.11.17-2013.11.20)] 2013 Visual Communications and Image Processing (VCIP) - Recognizing human

Fig. 1. Sample frames from KTH dataset(the first row), UCF

sports dataset (the middle row) and YouTube dataset(the last

row).

Table 1. Parameter settings of SCNL model on each datasets.

λ μ ki knnKTH 0.4 0.2 1024 10

UCF sports 1.0 0.4 1024 10

YouTube 0.2 1.2 1024 100

4.2. Effectiveness of the proposed SCNL model

In order to verify the effectiveness of the proposed SCNL

model, we evaluate it for action recognition on two commonly

used action datasets and compare it with two popular nonlin-

ear coding methods: SC and LLC. Due to some subtle en-

gineering details, we cannot reproduce the literatures results,

so we perform the comparisons under our own implementa-

tions(all the same strategies but with different coding scheme

only), such as exactly the same interest point detector, the

same set of features and the same classifier. Linear SVM is

used in this experiment and the corresponding results are de-

noted as SC LSVM v and LLC LSVM v for SC and LLC

respectively.

The comparison results are shown in Table 2. From the ta-

ble we see that, our proposed SCNL model outperforms other

two popular models, which demonstrates that the proposed

SCNL model can provide discriminative feature descriptions

and is effective for human action recognition.

4.3. Parameter evaluation

As to provide a comprehensive analysis of the proposed SCN-

L method, we further evaluate its performance with different

strategies such as different k-nearest neighborhoods, different

pooling and normalization methods on the challenging UCF

sports dataset.

First, we adopt the same parameters and the same process

but with different k-nearest neighborhoods on UCF sports

dataset to show the recognition results of our proposed SCNL

model. Two classifiers, linear SVM and Max-contribution-

win are used in this experiment. The Max-contribution-win

Table 2. Comparisons of various models on KTH and UCF

sports datasets, and average accuracy(%) of all classes is re-

ported.

KTH UCF sports

LLC LSVM v 95.25 97.33

SC LSVM v 96.41 98.00

SCNL LSVM v 96.64 99.33

Table 3. Average accuracy (%) of the proposed SCNL model

on UCF Sports with different K-nearest neighbors.

knn 5 10 20 50

SCNL MaxW 96.00 91.33 84.00 69.33

SCNL LSVM p 99.89 99.92 99.89 99.93

SCNL LSVM v 98.67 99.33 98.00 98.00

formulation is given by Eq. 8, and its corresponding result,

which usually contains a mixture of actions, is denoted as

SCNL MaxW for simplicity.

f(h) = {j|hj > hi, j = i, (i, j) ∈ [1, ..., A]}. (8)

The comparison results of different neighborhoods are

shown in Table 3. From the table we have the following ob-

servations. (1) There is indeed a confusion among different

types of actions. The SCNL MaxW performance drop-

s significantly when relaxing the locality constraint with a

large knn (only 69.33% at knn=50). This is true because the

sparse constraint is in the dominant position at a large knn

and induces very different atoms into the solution to satisfy

the sparsity, which aggravates the sharing of actions. This

fact demonstrates the importance of the locality constraint

in SCNL which keeps samples of the same class distributed

in the same subspace. (2) Although the actions are fused

with each other when relaxing the locality constraint, our

method still gains excellent classification performance if

SVM is used even with linear kernel. Also, we see that our

model achieves exciting results when evaluated in patches

(SCNL LSVM p). Moreover, if the majority-win rule is

used on patches, i.e., the dominant label wins, we can ob-

tain 100% recognition rate for video clips on UCF sports,

which is much higher compared with SCNL LSVM v.

The reason for this is that the average-pooling strategy used

in SCNL LSVM v makes each video clip be represented

by only one low-dimensional (equal to the number of classes)

feature vector, which might reduce the discriminative ability

of the descriptions for complicated human actions in video

clips.

Then, to see impacts of pooling and normalization on fi-

nal recognition performance, we perform the following ex-

periment on UCF sports dataset. Two feature pooling meth-

ods (max-pooling and average-pooling) and two normal-

ization strategies (�1-normalization and �2-normalization) are

considered in this experiment, and different combinations of

Page 6: [IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia (2013.11.17-2013.11.20)] 2013 Visual Communications and Image Processing (VCIP) - Recognizing human

Maxp−L1 Maxp−L2 Avgp−L1 Avgp−L285

90

95

100

Ave

rage

Pre

cisi

on (%

)

Knn5Knn10Knn20Knn50

Fig. 2. Performances with different pooling and normaliza-

tion strategies on UCF sports dataset.

Table 4. Comparisons of our proposed method with state-of-

the-art methods on KTH, UCF Sports and YouTube datasets

in terms of average accuracy (%) over all classes.

KTH UCF YouTube

Kovashka et al. [15] 94.5 87.3 -

Wang et al. [16] 94.2 88.2 84.2

Le et al. [17] 93.9 86.5 75.8

Castrodad et al. [4] 96.3 97.3 89.5SCNL LSVM v 96.6 99.3 86.6

strategies are evaluated. The results are presented graphically

in Fig. 2. From the figure we see that the average-poolingcombined with �2-normalization at knn=10 produce the best

performance.

4.4. Comparison with state-of-the-art methods

Also, we compare our approach with state-of-the-art methods

on KTH, UCF sports and YouTube datasets. The average

recognition rates obtained by the proposed approach are

presented in Table 4 and compared with state-of-the-art ap-

proaches. It can be seen from the table that the proposed

method gains much better recognition performance on K-

TH and UCF sports datasets compared with state-of-the-art

methods. However, in YouTube dataset, because of very few

samples are selected to tune the model parametersλ and μ,

our method did not gain as good result as does the two-level

sparse coding method proposed in [4]. Even though, our

method still outperforms other popular methods. Since the

model parameter selection is a hard problem especially with

large samples, and there is no good solution for it yet to the

best of our knowledge, how to automatically select the best

model parameters in nonlinear models will be a topic for us

to study in the further.

5. CONCLUSION

In this work, we propose a new Sparse Coding with Non-

negative and Locality constraints model for human action

recognition. The SCNL model is datum-adaptive, and the

combination of locality and sparsity properties ensures the

model to generate sparse solutions and to capture the global

subspace structures of the data very well. Extensive experi-

ments demonstrate the significant advantages of the proposed

SCNL model which can provide good discriminative feature

descriptions for human action recognition through evalua-

tions on three remarkable human action datasets.

6. REFERENCES

[1] R. Olshausen Rao and M. Lewicki, “Probabilistic models of

the brain: Perception and neural function,” in MIT Press, 2002.

[2] K. Guo, P. Ishwar, and J. Konrad, “Action recognition using

sparse representation on covariance manifolds of optical flow,”

in AVSS, 2010.

[3] G. Taylor, R. Fergus, Y. Le Cun, and C. Bregler, “Convolution-

al learning of spatio-temporal features,” in ECCV, 2010.

[4] A. Castrodad and G. Sapiro, “sparse modeling of human action

from motion imagery,” in IJCV, 2012.

[5] J. J. Wang, J. C. Yang, K. Yu, and et al., “Locality-constrained

linear coding for image classification,” in CVPR, 2010.

[6] J. C. Yang, K. Yu, Y. H. Gong, and T. Huang, “Linear spa-

tial pyramid matching using sparse coding for image classifi-

cation,” in CVPR, 2009.

[7] Patrik O Hoyer, “Modeling receptive fields with non-negative

sparse coding,” in Neurocomputing, 2003.

[8] Stephen Poythress Boyd and Lieven Vandenberghe, Convexoptimization, Cambridge university press, 2004.

[9] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing humanac-

tions: A local svm approach,” in ICPR, 2004.

[10] Z. C. Lin, M. M. Chen, L. Q. Wu, and Y. Ma, “The augment-

ed lagrange multiplier method for exact recovery of corrupted

low-rank matrices,” in Tech-nical report, UIUC, 2009.

[11] Y. B. Chen, Z. X. Li, X. Guo, Y. Y. Zhao, and A. N. Cai, “a

spatio-temporal interest point detector based on vorticity for

action recognition,” in ICME workshop on BRUREC, 2013.

[12] Kai Yu, Tong Zhang, and Yihong Gong, “Nonlinear learning

using local coordinate coding,” in NIPS, 2009.

[13] M. Rodriguez, J. Ahmed, and M. Shah, “Action mach: A

spatio-temporal maximum average correlation height filter for

action recognition,” in CVPR, 2008.

[14] Jingen Liu, Jiebo Luo, and Mubarak Shah, “Recognizing real-

istic actions from videos in the wild,” in CVPR, 2009.

[15] Adriana Kovashka and Kristen Grauman, “Learning a hier-

archy of discriminative space-time neighborhood features for

human action recognition,” in CVPR, 2010.

[16] H. Wang, A. Klaser, C. Schmid, and Cheng-Lin Liu, “Action

recognition by dense trajectories,” in CVPR, 2011.

[17] Quoc V Le, Will Y Zou, Serena Y Yeung, and Andrew Y

Ng, “Learning hierarchical invariant spatio-temporal features

for action recognition with independent subspace analysis,” in

CVPR, 2011.