A Hierarchical Model of Shape and Appearance for Human Action …vision.stanford.edu › posters ›...
Transcript of A Hierarchical Model of Shape and Appearance for Human Action …vision.stanford.edu › posters ›...
A Hierarchical Model of Shape and Appearance forHuman Action Classification
Ref: J.C. Niebles & L. Fei-Fei. A hierarchical model of shape and appearance for human action classification. CVPR 2007. Minneapolis, USA.
Highlights and Summary• A novel model for human action categorization from video sequences.• Our model can be characterized as a constellation of bags-of-features.• Use of hybrid features: combines both static shape and spatio-temporal features.
Hybrid featuresStatic features Dynamic features
Originalframe
Feature detection
Canny edge detector
Spatio-temporal interest pointdetector[Dollár et al. 2005]
Feature description
Shape context[Belongie et al. 2000]
Concatenated brightness gradient
codebookcodebook
Membership assignment
video frame: w = x,awi = xi,ai
xi: i-th feature positionai: i-th feature appearance
Final Representation
Learning
Estimate model parameters using EM
Ω=
==
K
K
1
1 ,,,,, 00,,,,
ω
θθθθ ωωωωω PpAXA
p
X
pLL ΣΣµ
SVM
action class
( )( )
( )
Cclassframep
classframep
classframep
2
1
M
Conclusions• The constellation of bags-of-features is able to capture semantic information of human action classes.• Combines hybrid features: static shape features and dynamic motion features.• Capable of classifying in both frame based and video based manner.
1 University of Illinois at Urbana – Champaign
3 Princeton University2 Universidad del Norte, [email protected]
Juan Carlos Niebles 1,2
Li Fei-Fei 1,3
Algorithm
New sequence
Feature extraction and description
Decide on best model
Recognition
Class 1
Class N
...
Feature extraction
Model 1
Model N
Feature extraction and description Learning
Video frame representation
Input videosequences
Learn a modelfor each class
...
...
Form Codebook
Class 1
Class N
Previous Works
Partlayer
Feature Layer
constellation
Small number of
features
Strong shape
representation
bag-of-features
Large number of
features
No geometrical or
shape information
w
P3
P1 P2
P4
Bg
w
Hierarchical model
Image
Partlayer
Featurelayer
Mixturecomponents
w
P3
P1 P2
P4
Bg
θω
Action Modelsmixture components
bend
ω = 1 ω = 2 ω = 3
jack
run
wave1
jump
pjump
wave2
side
walk
http://vision.cs.princeton.edu
[Weber et al. 2000,
Csurka et al. 2004,
Sudderth et al. 2006]
( )
( ) ( ) ( )∑ ∑Ω
= ∈
∗
≈
1
,,,,
,
ωωωωω θθθπ
Hure layerLocal featPart layer
ppp
p
h
hmYwhYh
θYw
44 344 2143421
Approximated data likelihood:
( ) ( )( )ωωωθ ,, ,|,| LLNp ΣµhYhYT
=
Part layer term:
( )
( ) ( ) ( ) ( )∏ ∏∏= ∈∈
∗ =
P
p PAppearancePart
A
pi
ShapePart
X
pp
r
i
BgAppearanceBg
A
j
ShapeBg
Xr
j
pij
aphxpapxp
p
1
0
0 ,,
,,,
ww
Y
θhmYw
4342144 344 214342143421θθθθ
ω
Local feature layer term:
• Large number of features from the bag-of-features model• Strong shape representation from the constellation model.
Experimental Results• 9 action classes,performed by 9subjects [Blank et al 2005]• Leave one outcross-validation• Video Classificationperformance: 72.8%
( )( ) ( ) ( )
( )old
oldoldold
old
p
ppp
p
ω
ωωωω
θ
θθθπ
θω
Yw
mhYwhhY
Ywh
,
,,,,
,,,
∗
≈
• E-step:
( ) ( )θωθωθθ
,,,ln,,,maxarg hYwYwh
h
ppoldnew ∑=
• M-Step:
• Classify actions in both frame based and video based manner• Video classification based on majority votes of frames
Recognition