A Hierarchical Model of Shape and Appearance for Human Action …vision.stanford.edu › posters ›...

1
A Hierarchical Model of Shape and Appearance for Human Action Classification Ref: J.C. Niebles & L. Fei-Fei. A hierarchical model of shape and appearance for human action classification. CVPR 2007. Minneapolis, USA. Highlights and Summary • A novel model for human action categorization from video sequences. • Our model can be characterized as a constellation of bags-of-features. • Use of hybrid features: combines both static shape and spatio-temporal features. Hybrid features Static features Dynamic features Original frame Feature detection Canny edge detector Spatio-temporal interest point detector [Dollár et al. 2005] Feature description Shape context [Belongie et al. 2000] Concatenated brightness gradient codebook codebook Membership assignment video frame: w = {x,a} w i = {x i ,a i } x i : i-th feature position a i : i-th feature appearance Final Representation Learning Estimate model parameters using EM { } Ω = = = K K 1 1 , , , , , 0 0 , , , , ω θ θ θ θ ω ω ω ω ω P p A X A p X p L L Σ Σ μ SVM action class ( ) ( ) ( ) C class frame p class frame p class frame p 2 1 M Conclusions • The constellation of bags-of-features is able to capture semantic information of human action classes. • Combines hybrid features: static shape features and dynamic motion features. • Capable of classifying in both frame based and video based manner. 1 University of Illinois at Urbana – Champaign 3 Princeton University 2 Universidad del Norte, Colombia [email protected] Juan Carlos Niebles 1,2 [email protected] Li Fei-Fei 1,3 Algorithm New sequence Feature extraction and description Decide on best model Recognition Class 1 Class N . . . Feature extraction Model 1 Model N Feature extraction and description Learning Video frame representation Input video sequences Learn a model for each class . . . . . . Form Codebook Class 1 Class N Previous Works Part layer Feature Layer constellation Small number of features Strong shape representation bag-of-features Large number of features No geometrical or shape information w P 3 P 1 P 2 P 4 Bg w Hierarchical model Image Part layer Feature layer Mixture components w P 3 P 1 P 2 P 4 Bg θ ω Action Models mixture components bend ω = 1 ω = 2 ω = 3 jack run wave1 jump pjump wave2 side walk http://vision.cs.princeton.edu [Weber et al. 2000, Csurka et al. 2004, Sudderth et al. 2006] ( ) ( ) ( ) ( ) Ω = * 1 , , , , , ω ω ω ω ω θ θ θ π H ure layer Local feat Part layer p p p p h h m Y w h Y h θ Y w 4 4 3 4 4 2 1 43 42 1 Approximated data likelihood: ( ) ( ) ( ) ω ω ω θ , , , | , | L L N p Σ μ h Y h Y T = Part layer term: ( ) ( ) ( ) ( ) ( ) ∏∏ = * = P p P Appearance Part A p i Shape Part X p p r i Bg Appearance Bg A j Shape Bg X r j p i j a p h x p a p x p p 1 0 0 , , , , , w w Y θ h m Y w 43 42 1 4 43 4 42 1 43 42 1 43 42 1 θ θ θ θ ω Local feature layer term: Large number of features from the bag-of-features model Strong shape representation from the constellation model. Experimental Results • 9 action classes, performed by 9 subjects [Blank et al 2005] • Leave one out cross-validation • Video Classification performance: 72.8% ( ) ( ) ( ) ( ) ( ) old old old old old p p p p p ω ω ω ω ω θ θ θ θ π θ ω Y w m h Y w h h Y Y w h , , , , , , , , * E-step: ( ) ( ) θ ω θ ω θ θ , , , ln , , , max arg h Y w Y w h h p p old new = M-Step: • Classify actions in both frame based and video based manner • Video classification based on majority votes of frames Recognition

Transcript of A Hierarchical Model of Shape and Appearance for Human Action …vision.stanford.edu › posters ›...

Page 1: A Hierarchical Model of Shape and Appearance for Human Action …vision.stanford.edu › posters › NieblesFeiFei_CVPR07_poster.pdf · 2009-05-06 · Bg Bg Appearance A j Bg Shape

A Hierarchical Model of Shape and Appearance forHuman Action Classification

Ref: J.C. Niebles & L. Fei-Fei. A hierarchical model of shape and appearance for human action classification. CVPR 2007. Minneapolis, USA.

Highlights and Summary• A novel model for human action categorization from video sequences.• Our model can be characterized as a constellation of bags-of-features.• Use of hybrid features: combines both static shape and spatio-temporal features.

Hybrid featuresStatic features Dynamic features

Originalframe

Feature detection

Canny edge detector

Spatio-temporal interest pointdetector[Dollár et al. 2005]

Feature description

Shape context[Belongie et al. 2000]

Concatenated brightness gradient

codebookcodebook

Membership assignment

video frame: w = x,awi = xi,ai

xi: i-th feature positionai: i-th feature appearance

Final Representation

Learning

Estimate model parameters using EM

Ω=

==

K

K

1

1 ,,,,, 00,,,,

ω

θθθθ ωωωωω PpAXA

p

X

pLL ΣΣµ

SVM

action class

( )( )

( )

Cclassframep

classframep

classframep

2

1

M

Conclusions• The constellation of bags-of-features is able to capture semantic information of human action classes.• Combines hybrid features: static shape features and dynamic motion features.• Capable of classifying in both frame based and video based manner.

1 University of Illinois at Urbana – Champaign

3 Princeton University2 Universidad del Norte, [email protected]

Juan Carlos Niebles 1,2

[email protected]

Li Fei-Fei 1,3

Algorithm

New sequence

Feature extraction and description

Decide on best model

Recognition

Class 1

Class N

...

Feature extraction

Model 1

Model N

Feature extraction and description Learning

Video frame representation

Input videosequences

Learn a modelfor each class

...

...

Form Codebook

Class 1

Class N

Previous Works

Partlayer

Feature Layer

constellation

Small number of

features

Strong shape

representation

bag-of-features

Large number of

features

No geometrical or

shape information

w

P3

P1 P2

P4

Bg

w

Hierarchical model

Image

Partlayer

Featurelayer

Mixturecomponents

w

P3

P1 P2

P4

Bg

θω

Action Modelsmixture components

bend

ω = 1 ω = 2 ω = 3

jack

run

wave1

jump

pjump

wave2

side

walk

http://vision.cs.princeton.edu

[Weber et al. 2000,

Csurka et al. 2004,

Sudderth et al. 2006]

( )

( ) ( ) ( )∑ ∑Ω

= ∈

1

,,,,

,

ωωωωω θθθπ

Hure layerLocal featPart layer

ppp

p

h

hmYwhYh

θYw

44 344 2143421

Approximated data likelihood:

( ) ( )( )ωωωθ ,, ,|,| LLNp ΣµhYhYT

=

Part layer term:

( )

( ) ( ) ( ) ( )∏ ∏∏= ∈∈

∗ =

P

p PAppearancePart

A

pi

ShapePart

X

pp

r

i

BgAppearanceBg

A

j

ShapeBg

Xr

j

pij

aphxpapxp

p

1

0

0 ,,

,,,

ww

Y

θhmYw

4342144 344 214342143421θθθθ

ω

Local feature layer term:

• Large number of features from the bag-of-features model• Strong shape representation from the constellation model.

Experimental Results• 9 action classes,performed by 9subjects [Blank et al 2005]• Leave one outcross-validation• Video Classificationperformance: 72.8%

( )( ) ( ) ( )

( )old

oldoldold

old

p

ppp

p

ω

ωωωω

θ

θθθπ

θω

Yw

mhYwhhY

Ywh

,

,,,,

,,,

• E-step:

( ) ( )θωθωθθ

,,,ln,,,maxarg hYwYwh

h

ppoldnew ∑=

• M-Step:

• Classify actions in both frame based and video based manner• Video classification based on majority votes of frames

Recognition