[IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia...
Transcript of [IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia...
RECOGNIZING HUMAN ACTIONS BASED ON SPARSE CODING WITH NON-NEGATIVEAND LOCALITY CONSTRAINTS
Yuanbo Chen1, Yanyun Zhao1,2, Anni Cai1,2
1School of Information and Communication Engineering2Beijing Key Laboratory of Network System and Network Culture
Beijing University of Posts and Telecommunications, Beijing, China{bupt cyb, zyy, annicai}@bupt.edu.cn
ABSTRACT
In this paper, Sparse Coding with Non-negative and Locali-
ty constraints (SCNL) is proposed to generate discriminative
feature descriptions for human action recognition. The non-
negative constraint ensures that every data sample is in the
convex hull of its neighbors. The locality constraint makes
a data sample only represented by its related neighbor atoms.
The sparsity constraint confines the dictionary atoms involved
in the sample representation as fewer as possible. The SCNL
model can better capture the global subspace structures of da-
ta than classical sparse coding, and are more robust to noise
compared to locality-constrained linear coding. Extensive ex-
periments testify the significant advantages of the proposed
SCNL model through evaluations on three remarkable human
action datasets.
Index Terms— Human action recognition, SCNL model,
datum-adaptive, locality, sparse
1. INTRODUCTION
As the complexity of human actions in real world, an infor-
mative feature description is critical to an action recognition
system. Sparse coding (SC) is the widely appreciated feature
representation nowadays due to its nonlinear representative a-
bility. Also, it has been proved [1] in neural science that the
human vision system seeks a sparse coding for the incom-
ing image using a few words in a feature vocabulary. Instead
of representing a feature with its closest centroid as Vector
Quantization (VQ) does, each data sample in SC is represent-
ed by a weighted combination of dictionary atoms, thus better
managing outliers than the popular VQ coding scheme. Re-
cently, sparse coding attracts much attentions and has proven
to be very successful in action recognition and other classifi-
cation tasks [2, 3, 4].
This work was supported by Chinese National Natural Science Founda-
tion (90920001, 61101212), National S&T Major Project of the Ministry of
S&T (2012ZX03005008,2012BAH41F03), and the Fundamental Research
Funds for the Central Universities.
In a SC system, the dictionary is usually over-completed,
which property guarantees the sparsity of a test sample rep-
resented over the dictionary. However, due to simply pursu-
ing sparsity, a data sample may be represented by dictionary
atoms of very different subsets, the resulting sparse codes can
have significant variations in the activation set, which have
harmful impact for recognition tasks. On the other hand,
the requirement of solving �1-norm optimization problems a-
mong high-dimensional dictionary atoms in SC makes it al-
ways with high computational complexity.
Recently, a fast version called Locality-constrained Lin-
ear Coding (LLC) [5] is proposed by considering locality
constraint where each descriptor is projected into its local-
coordinate system only. This means that nonzero coefficients
are often assigned to bases nearby to the encoded data. Be-
sides, avoiding solution of �1-norm in LLC results much
faster computation and more smooth property compared to
SC and has shown better performance than state-of-the-art
methods including ScSPM in image classification [6].
However, there are several issues blocking LLC’s appli-
cations. In LLC, the number k of nearest neighbors (knn) is
usually fixed to a small number (e.g., knn=5 in [5]) to gain
sparse solutions, but this is not the sparsity in the sense of
�0-norm. When knn is set to a large number (such as 10,
50, or larger), the solution is getting dense. Due to using a
fixed global parameter to define local neighbors, LLC cannot
produce datum-adaptive neighborhoods, so making it unable
to reflect the real structure of data and sensitive to local data
noise. Moreover, the data may be described as a combination
of bases involving both additive and subtractive interactions
in LLC. The fact that features can cancel each other out using
subtraction is contrary to the intuitive notion of combining
parts to form a whole [7]. Also, from the biological point
of view, the non-negative representations are more related to
non-negativity of neural firing rates [7].
Inspired by above insights, we propose to harness sparse,
locality and non-negative constraints of high-dimensional
data to construct a new informative feature representation
named Sparse Coding with Non-negative and Locality con-
straints (SCNL). The non-negative constraint ensures that
every data sample is in the convex hull of its neighbors. The
locality constraint makes data samples only consider their
related neighbor atoms, which accommodates the so-called
classification assumption that samples on the same class are
likely to share the same structure, so benefiting for reducing
intra-class distance. The sparsity constraint ensures the dic-
tionary atoms involved in solutions are as fewer as possible.
Given a set of data vectors, SCNL model seeks the most
related atoms automatically among their local neighbors, so
reflecting the optimal neighborhood structures for dada. This
property of SCNL can better capture the global subspace
structures of data than SC, and can have data-adaptive neigh-
borhoods compared to LLC. The proposed SCNL model has
several advantages:
(1) Robust to noise. Data noise is inevitable especially for
visual data in real world. LLC [5] considers �2-norm making
the feature representation easily be changed when unfavor-
able noise comes into the neighborhood. However, in SCNL,
the optimal atoms are selected by the �1-constraint, thus better
managing outliers than LLC as well as VQ [8].
(2) Local smooth sparsity. To favor sparsity on over-
completed dictionary, the SC might select quite different
atoms for similar patches, making the same class give much
different structures. However, in SCNL coding scheme, the
locality constraint restricts the sparse coding process to select
atoms among related local neighborhoods, so enforcing sim-
ilar data samples sharing similar atoms and resulting much
smoother codes than SC.
(3) Datum-adaptive neighborhood. In action recognition
systems, many factors, such as illumination, occlusion, cam-
era motion and cluttered background, may make data distribu-
tion vary significantly at different areas of the feature space,
which results in a distinctive neighborhood structure for each
datum even for the same class. Therefore, when a fixed neigh-
borhood is applied, the neighborhoods of some data samples
may easily be contaminated with fake neighbors. Fortunately,
in the proposed SCNL, the explicit sparsity adaptor is capa-
ble of automatically selecting the most valid atoms, making
the SCNL model is datum-adaptive. This property is valu-
able for applications with unevenly distributed data, and can
reflect the optimal sparse neighborhood structure of the dada.
We have conducted extensive experiments on three pub-
lic action datasets, and the results demonstrate that our SC-
NL model can provide informative and discriminative feature
representations and thus is very useful for action recognition
task.
The remainder of this paper is organized as follows. In
section 2, we introduce the formulation of our SCNL mod-
el and show its optimization algorithm. Section 3 gives the
details of SCNL model for action recognition. Our experi-
ments and analysis are presented in section 4. Finally, section
5 concludes our paper.
2. SPARSE CODING WITH NON-NEGATIVE ANDLOCALITY CONSTRAINTS
2.1. Formulation of SCNL
Suppose samples(e.g., image patches) X ∈ RM×N can be
represented as a linear combination of a few columns of a
dictionary:
X = BC, (1)
where B = [b1, ...bK ] ∈ RM×K is the dictionary with K
columns and each column bi, i ∈ {1, ...K} represents a base
vector with M -dimension. C ∈ RK×N is the coefficient ma-
trix corresponding to all the N samples.
Taking the reconstruction error, locality, sparsity and non-
negative into account, the proposed SCNL can be seen as
seeking the optimal solution over the following cost function:
minB,C
1
2‖X −BC‖2F + λ‖D � C‖2F + μ‖C‖1,
s.t. B � 0,
C � 0.
(2)
where the first item is the reconstruction error. ||D � C|| de-
notes the locality constraint to keep the solution within local
bases, ||C||1 denotes the �1-norm of C, ensuring the solution
as sparse as possible. λ and μ denote the positive regulariza-
tion parameters that make trade-offs of the reconstruction er-
ror with the local and sparse properties of coefficients C, and
larger μ leads to sparser solutions. � denotes the element-
wise multiplication, and D = [D1, ..., DN ] ∈ RK×N is the
matrix representing the distance between an input sample xi
with each base vector, i.e.
Di = dist(xi, B)
=[exp(||x− b1||2/σ), . . . , exp(||x− bK ||2/σ)
]T,
(3)
where σ is the radius parameter of the Gaussian function,
which is set as the mean value of all pairwise distances be-
tween the training samples and basis of dictionary. This pa-
rameter is set by the statistic mean of distance between sam-
ples and dictionary on training set of KTH dataset [9], and
further used in all of our experiments.
2.2. Solving the optimization problem
For efficiency, we adopt inexact Augmented Lagrange Mul-
tiplier(IALM) method [10] to solve Eq.2 in this work. To
facilitate an efficient use of alternating minimization, we first
introduce an auxiliary variable R and consider the following
equivalent model:
minB,C
1
2‖X −BC‖2F + λ‖D � C‖2F + μ‖R‖1,
s.t. B � 0,
C � 0,
C = R.
(4)
Then, the problem can be solved by minimizing the fol-
lowing augmented lagrange function
Γ(B,C,R,Λ1,Λ2,Λ3, β)
=1
2‖X −BC‖2F + λ‖D � C‖2F + μ‖R‖1
+ 〈Λ1, B〉+ 〈Λ2, C〉+ 〈Λ3, C −R〉+
β
2(‖B‖2F + ‖C‖2F + ‖C −R‖2F ),
(5)
where Λi, i = {1, 2, 3} are Lagrange multipliers and β is the
penalty parameter for the constrains. Inexact ALM [10] is de-
rived by successively minimizing the augmented Lagrangian
function Γ with respect to B,C,R one at a time while fix-
ing others at their most recent values, and then updating the
multipliers after each sweep of such alternating minimization.
Specifically, these steps can be written in a closed form as E-
q.6, and the complete solving optimization algorithm is out-
lined in Algorithm 1.
B∗ = (XCT − Λ1)(CCT + βI)−1,
R∗ij = soft((C +Λ3
β)ij ,
μ
β),
C∗i = (M + 2λD′i)−1Ni.
(6)
where, soft(α, δ) = sign(α)(|a|−δ)+, x+ = max(x, 0),M =BTB+2βI , N = BTX−Λ2−Λ3+βR. Ci, Di, Ni are the
columns of matrix C,D and N respectively, D′i = diag(D2
i ),and superscript ′∗′ is used to denote iterative values at the new
iteration.
3. IMPLEMENTATION FOR ACTIONRECOGNITION
3.1. Key-points extraction
Human action recognition in real-world videos is a challenge
problem, and especially when camera is non-static, the mo-
tion field in an action region is often contaminated by back-
ground movements. In this work, we extract interest points
based on a vorticity-based spatio-temporal point detector pro-
posed in [11], which can suppress camera motion to a certain
extent and can extract spatio-temporal interest points around
articulations of a human body such as moving knees, ankles
and elbows, even in a cluttered field with camera movements.
Then, HOG features with 8 orientations are computed on the
defined spatio-temporal patches around the extracted interest
points.
3.2. SCNL-based feature representation
Note that, traditional nonlinear coding methods learn a global
dictionary from all the training data [5, 6, 12], which do not
explicitly exploit label information available in the given su-
pervised setting. Therefore, they are difficult to interpret the
Algorithm 1 Efficient IALM Algorithm for SCNL
Input: Data matrix X , parameter λ and μ.
Initialize: R = C = 0, Λ1 = Λ2 = Λ3 = 0, β = 10−6, ρ =1.1,maxβ = 1010, initialize dictionary B from K-means or
selected from training data set firstly.
While not converged, do:
1. Fix the others and update C by:
C∗ = minC
1
2‖X −BC‖2F + λ‖D � C‖2F
+ 〈Λ2, C〉+ 〈Λ3, C −R〉+
β
2(‖C‖2F + ‖C −R‖2F )
2. Fix the others and update R by:
R∗ = minR
μ‖R‖1 + 〈Λ3, C −R〉+ β
2‖C −R‖2F
3. Fix the others and update B by:
B∗ = minB
1
2‖X −BC‖2F + 〈Λ1, B〉+ β
2‖B‖2F
4. Update the variables as Eq. 6.
5. Update the multiples:
Λ1 = Λ1 + βBΛ2 = Λ2 + βCΛ3 = Λ3 + β(C −R)
6. Update the parameter: β = min(ρβ,maxβ)
End while
Output: optimal solution (C∗, B∗, R∗)
class relationships directly in these high-dimensional global
coefficients.
In this work, we follow the same way as [4]does, where
each class-dependent sub-dictionary is learned first, and then
combined to an entire structured dictionary. So, each da-
ta sample is represented by a mixture of all actions and the
nonzero weights from more than one class give different con-
tributions between actions. This strategy indeed reveals the
intrinsic character that actions are represented not only by
their own model but also by how connected they are to the
models of other actions. However, the shared components be-
tween actions make the classification task difficult. Thus, a
sum-pooling is used to quantify the contributions from all ac-
tions to a per-class manner which will be described clearly
in Algorithm 2. After this step, the actions contributions in
the sample are quantified with invariance to the subset selec-
tion in the sub-dictionaries, and also the dimensionality of the
data is notably reduced to A-dimension in a reasonable way
(suppose there are A types of actions).
Moreover, since the locality constraint in SCNL restricts
the few atoms with significant values in solution to local ba-
Algorithm 2 LSNL-based feature coding algorithm.
Input: Data matrix X = [x1, ...xN ] ∈ RM×N , parameter λ
and μ;
Steps:
1. Train special-class dictionary using Algorithm 1 for all
the action classes ( such as A types of actions).
2. Combining all the A sub-dictionaries Bi ∈RM×ki , i = 1, ..., A together and forming the
all-classes dictionary B = [B1, B2, ..., BA] ∈ RM×K ,
and K =∑A
i=1 ki.
3. For each sample path xi, i = 1, ...N , use k-nearest
neighbors based on distance matrix Di to construct a
new dictionary B′
which with only knn significan-
t atoms drawn from B .
4. Then coding each patch with formulation:
ci = minci�0
1
2‖xi −Bci‖2 + λ‖Di � ci‖+ μ‖ci‖1,
where the corresponding coefficient ci =[c1i , ..., c
kii , ..., ckA
i ] ∈ RK+ with only knn signifi-
cant values.
5. Each-class sum-pooling is used for each sample de-
scriptor ci ∈ RK+ and reduce its dimension to si =
[S(c1i ), ..., S(cAi )] ∈ R
A+, where S(cji ) =
∑kj
j=1 cji .
Then si is the final feature descriptors corresponding
to sample xi.
6. Repeating steps 3-5 to get feature descriptors for al-
l samples.
sis, a fast approximation process can be used. The k-nearest
neighbors(knn K) of basis are selected based on dis-
tance matrix Di of a sample xi to compose a new dictionary
first, and then, a small system could be solved. The proposed
SCNL-based coding scheme is summarized in Algorithm 2.
As the number of selected local neighbor bases decreases, the
computational complexity of our proposed SCNL scheme re-
duces.
Sometimes, we may be interested in video clips. After
obtaining a set of feature descriptors {si}, i ∈ {1, . . . , P} for
P patches in a video clip, the average-pooling is used to pool
all P patches’ descriptors together to get the corresponding
pooled video representation h:
h =P∑
i=1
si/P. (7)
Then, �2-normalization is used to get the final fea-
ture description. For classification, a linear SVM is used
to produce the classification results which are denoted as
LSNL LSVM p and LSNL LSVM v for patches and
video clips respectively for convenience.
4. EXPERIMENTS
In this section, we systematically evaluate the effectiveness of
our proposed SCNL model for human action recognition on
three wildly used datasets: KTH, UCF Sports and YouTube,
and the average accuracy over all the action classes is used as
the performance measure in all experiments.
KTH dataset [9] is one of the most commonly used
dataset in evaluation of human action recognition. This
dataset contains 6 types of human actions (boxing, handclap-
ping, handwaving, jogging, running and walking) performed
several times by 25 subjects in four different scenarios in-
cluding outdoors, outdoors with camera motion (zoom in and
out), outdoors with different clothes and indoors, giving a
total of 2391 video sequences. We follow the mostly used
experimental data setting, i.e., dividing the samples into a
test set (9 subjects: 2, 3, 5, 6, 7, 8, 9, 10, 22) and a training
set (the remaining 16 subjects).
UCF sports dataset [13] contains 10 different types of
human actions in sport broadcasting videos acquired from
sports broadcast networks, including diving, kicking, weight-
lifting, horse-riding, running, skateboarding, golf swinging,
swinging1 (gymnastics, on the pommel horse and floor),
swinging2 (gymnastics, on the high and uneven bars) and
walking. The dataset consists of 150 video clips in a wide
range of scenes and viewpoints, which contains many chal-
lenges such as camera motion and jitter, highly cluttered and
dynamic backgrounds, compression artifacts, and variable
illumination settings at variable spatial resolution and frame
rates. We use the original videos and adopt a leave-one-out
setup where one clip is used for testing and the remaining for
training.
YouTube dataset [14] is the very challenging action
dataset with much large variations in scales, illumination
conditions, object appearance, scale and pose, viewpoint,
cluttered background and unconstraint large camera motion.
This dataset contains 1168 video clips and 11 types of ac-
tions: diving, biking, basketball shooting, golf swing, soccer
juggling, horse riding, swinging, tennis swinging, trampoline
jumping, volleyball spiking and walking with a dog. We
follow the original video and adopt a leave-one-out cross
validation for a per-defined set of 25 folds.
4.1. Parameter settings
Fig. 1 shows some sample frames from the three action
datasets, and the corresponding parameter used in SCNL
model for each dataset is shown in Table 1.
Fig. 1. Sample frames from KTH dataset(the first row), UCF
sports dataset (the middle row) and YouTube dataset(the last
row).
Table 1. Parameter settings of SCNL model on each datasets.
λ μ ki knnKTH 0.4 0.2 1024 10
UCF sports 1.0 0.4 1024 10
YouTube 0.2 1.2 1024 100
4.2. Effectiveness of the proposed SCNL model
In order to verify the effectiveness of the proposed SCNL
model, we evaluate it for action recognition on two commonly
used action datasets and compare it with two popular nonlin-
ear coding methods: SC and LLC. Due to some subtle en-
gineering details, we cannot reproduce the literatures results,
so we perform the comparisons under our own implementa-
tions(all the same strategies but with different coding scheme
only), such as exactly the same interest point detector, the
same set of features and the same classifier. Linear SVM is
used in this experiment and the corresponding results are de-
noted as SC LSVM v and LLC LSVM v for SC and LLC
respectively.
The comparison results are shown in Table 2. From the ta-
ble we see that, our proposed SCNL model outperforms other
two popular models, which demonstrates that the proposed
SCNL model can provide discriminative feature descriptions
and is effective for human action recognition.
4.3. Parameter evaluation
As to provide a comprehensive analysis of the proposed SCN-
L method, we further evaluate its performance with different
strategies such as different k-nearest neighborhoods, different
pooling and normalization methods on the challenging UCF
sports dataset.
First, we adopt the same parameters and the same process
but with different k-nearest neighborhoods on UCF sports
dataset to show the recognition results of our proposed SCNL
model. Two classifiers, linear SVM and Max-contribution-
win are used in this experiment. The Max-contribution-win
Table 2. Comparisons of various models on KTH and UCF
sports datasets, and average accuracy(%) of all classes is re-
ported.
KTH UCF sports
LLC LSVM v 95.25 97.33
SC LSVM v 96.41 98.00
SCNL LSVM v 96.64 99.33
Table 3. Average accuracy (%) of the proposed SCNL model
on UCF Sports with different K-nearest neighbors.
knn 5 10 20 50
SCNL MaxW 96.00 91.33 84.00 69.33
SCNL LSVM p 99.89 99.92 99.89 99.93
SCNL LSVM v 98.67 99.33 98.00 98.00
formulation is given by Eq. 8, and its corresponding result,
which usually contains a mixture of actions, is denoted as
SCNL MaxW for simplicity.
f(h) = {j|hj > hi, j = i, (i, j) ∈ [1, ..., A]}. (8)
The comparison results of different neighborhoods are
shown in Table 3. From the table we have the following ob-
servations. (1) There is indeed a confusion among different
types of actions. The SCNL MaxW performance drop-
s significantly when relaxing the locality constraint with a
large knn (only 69.33% at knn=50). This is true because the
sparse constraint is in the dominant position at a large knn
and induces very different atoms into the solution to satisfy
the sparsity, which aggravates the sharing of actions. This
fact demonstrates the importance of the locality constraint
in SCNL which keeps samples of the same class distributed
in the same subspace. (2) Although the actions are fused
with each other when relaxing the locality constraint, our
method still gains excellent classification performance if
SVM is used even with linear kernel. Also, we see that our
model achieves exciting results when evaluated in patches
(SCNL LSVM p). Moreover, if the majority-win rule is
used on patches, i.e., the dominant label wins, we can ob-
tain 100% recognition rate for video clips on UCF sports,
which is much higher compared with SCNL LSVM v.
The reason for this is that the average-pooling strategy used
in SCNL LSVM v makes each video clip be represented
by only one low-dimensional (equal to the number of classes)
feature vector, which might reduce the discriminative ability
of the descriptions for complicated human actions in video
clips.
Then, to see impacts of pooling and normalization on fi-
nal recognition performance, we perform the following ex-
periment on UCF sports dataset. Two feature pooling meth-
ods (max-pooling and average-pooling) and two normal-
ization strategies (�1-normalization and �2-normalization) are
considered in this experiment, and different combinations of
Maxp−L1 Maxp−L2 Avgp−L1 Avgp−L285
90
95
100
Ave
rage
Pre
cisi
on (%
)
Knn5Knn10Knn20Knn50
Fig. 2. Performances with different pooling and normaliza-
tion strategies on UCF sports dataset.
Table 4. Comparisons of our proposed method with state-of-
the-art methods on KTH, UCF Sports and YouTube datasets
in terms of average accuracy (%) over all classes.
KTH UCF YouTube
Kovashka et al. [15] 94.5 87.3 -
Wang et al. [16] 94.2 88.2 84.2
Le et al. [17] 93.9 86.5 75.8
Castrodad et al. [4] 96.3 97.3 89.5SCNL LSVM v 96.6 99.3 86.6
strategies are evaluated. The results are presented graphically
in Fig. 2. From the figure we see that the average-poolingcombined with �2-normalization at knn=10 produce the best
performance.
4.4. Comparison with state-of-the-art methods
Also, we compare our approach with state-of-the-art methods
on KTH, UCF sports and YouTube datasets. The average
recognition rates obtained by the proposed approach are
presented in Table 4 and compared with state-of-the-art ap-
proaches. It can be seen from the table that the proposed
method gains much better recognition performance on K-
TH and UCF sports datasets compared with state-of-the-art
methods. However, in YouTube dataset, because of very few
samples are selected to tune the model parametersλ and μ,
our method did not gain as good result as does the two-level
sparse coding method proposed in [4]. Even though, our
method still outperforms other popular methods. Since the
model parameter selection is a hard problem especially with
large samples, and there is no good solution for it yet to the
best of our knowledge, how to automatically select the best
model parameters in nonlinear models will be a topic for us
to study in the further.
5. CONCLUSION
In this work, we propose a new Sparse Coding with Non-
negative and Locality constraints model for human action
recognition. The SCNL model is datum-adaptive, and the
combination of locality and sparsity properties ensures the
model to generate sparse solutions and to capture the global
subspace structures of the data very well. Extensive experi-
ments demonstrate the significant advantages of the proposed
SCNL model which can provide good discriminative feature
descriptions for human action recognition through evalua-
tions on three remarkable human action datasets.
6. REFERENCES
[1] R. Olshausen Rao and M. Lewicki, “Probabilistic models of
the brain: Perception and neural function,” in MIT Press, 2002.
[2] K. Guo, P. Ishwar, and J. Konrad, “Action recognition using
sparse representation on covariance manifolds of optical flow,”
in AVSS, 2010.
[3] G. Taylor, R. Fergus, Y. Le Cun, and C. Bregler, “Convolution-
al learning of spatio-temporal features,” in ECCV, 2010.
[4] A. Castrodad and G. Sapiro, “sparse modeling of human action
from motion imagery,” in IJCV, 2012.
[5] J. J. Wang, J. C. Yang, K. Yu, and et al., “Locality-constrained
linear coding for image classification,” in CVPR, 2010.
[6] J. C. Yang, K. Yu, Y. H. Gong, and T. Huang, “Linear spa-
tial pyramid matching using sparse coding for image classifi-
cation,” in CVPR, 2009.
[7] Patrik O Hoyer, “Modeling receptive fields with non-negative
sparse coding,” in Neurocomputing, 2003.
[8] Stephen Poythress Boyd and Lieven Vandenberghe, Convexoptimization, Cambridge university press, 2004.
[9] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing humanac-
tions: A local svm approach,” in ICPR, 2004.
[10] Z. C. Lin, M. M. Chen, L. Q. Wu, and Y. Ma, “The augment-
ed lagrange multiplier method for exact recovery of corrupted
low-rank matrices,” in Tech-nical report, UIUC, 2009.
[11] Y. B. Chen, Z. X. Li, X. Guo, Y. Y. Zhao, and A. N. Cai, “a
spatio-temporal interest point detector based on vorticity for
action recognition,” in ICME workshop on BRUREC, 2013.
[12] Kai Yu, Tong Zhang, and Yihong Gong, “Nonlinear learning
using local coordinate coding,” in NIPS, 2009.
[13] M. Rodriguez, J. Ahmed, and M. Shah, “Action mach: A
spatio-temporal maximum average correlation height filter for
action recognition,” in CVPR, 2008.
[14] Jingen Liu, Jiebo Luo, and Mubarak Shah, “Recognizing real-
istic actions from videos in the wild,” in CVPR, 2009.
[15] Adriana Kovashka and Kristen Grauman, “Learning a hier-
archy of discriminative space-time neighborhood features for
human action recognition,” in CVPR, 2010.
[16] H. Wang, A. Klaser, C. Schmid, and Cheng-Lin Liu, “Action
recognition by dense trajectories,” in CVPR, 2011.
[17] Quoc V Le, Will Y Zou, Serena Y Yeung, and Andrew Y
Ng, “Learning hierarchical invariant spatio-temporal features
for action recognition with independent subspace analysis,” in
CVPR, 2011.