CVPR2010: grouplet: a structured image representation for recognizing human and object interactions

37
Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford University {bangpeng,feifeili}@cs.stanford.edu 1

Transcript of CVPR2010: grouplet: a structured image representation for recognizing human and object interactions

1

Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

Bangpeng Yao and Li Fei-Fei

Computer Science Department, Stanford University

{bangpeng,feifeili}@cs.stanford.edu

2

Human-Object Interaction

Playing saxophoneHuman SaxophoneNot playing saxophone

3

Robots interact with objects

Automatic sports commentary

“Kobe is dunking the ball.”

Medical care

Human-Object Interaction

4

Background: Human-Object Interaction

• Schneiderman & Kanade, 2000• Viola & Jones, 2001• Huang et al, 2007• Papageorgiou & Poggio, 2000• Wu & Nevatia, 2005• Dalal & Triggs, 2005• Mikolajczyk et al, 2005• Leibe et al, 2005• Bourdev & Malik, 2009• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

• Lowe, 1999• Belongie et al, 2002• Fergus et al, 2003• Fei-Fei et al, 2004• Berg & Malik, 2005• Felzenszwalb et al, 2005• Grauman & Darrell, 2005• Sivic et al, 2005• Lazebnik et al, 2006• Zhang et al, 2006• Savarese et al, 2007• Lampert et al, 2008• Desai et al, 2009• Gehler & Nowozin, 2009

• Murphy et al, 2003• Hoiem et al, 2006• Shotton et al, 2006

• Rabinovich et al, 2007• Heitz & Koller, 2008• Divvala et al, 2009

• Gupta et al, 2009

context

vs.

To be done

• Yao & Fei-Fei, 2010a• Yao & Fei-Fei, 2010b

5

Background: Human-Object Interaction

• Schneiderman & Kanade, 2000• Viola & Jones, 2001• Huang et al, 2007• Papageorgiou & Poggio, 2000• Wu & Nevatia, 2005• Dalal & Triggs, 2005• Mikolajczyk et al, 2005• Leibe et al, 2005• Bourdev & Malik, 2009• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

• Lowe, 1999• Belongie et al, 2002• Fergus et al, 2003• Fei-Fei et al, 2004• Berg & Malik, 2005• Felzenszwalb et al, 2005• Grauman & Darrell, 2005• Sivic et al, 2005• Lazebnik et al, 2006• Zhang et al, 2006• Savarese et al, 2007• Lampert et al, 2008• Desai et al, 2009• Gehler & Nowozin, 2009

• Murphy et al, 2003• Hoiem et al, 2006• Shotton et al, 2006

• Rabinovich et al, 2007• Heitz & Koller, 2008• Divvala et al, 2009

• Gupta et al, 2009

context

vs.

To be done

• Yao & Fei-Fei, 2010a• Yao & Fei-Fei, 2010b

6

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

7

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

8

Recognizing Human-Object Interaction is Challenging

Different background

Same object (saxophone), different interactions

Different pose (or viewpoint)

Different lighting

Different instrument, similar pose

Reference image: playing saxophone

9

Grouplet: our intuitionBag-of-words Spatial pyramid Part-based

• Thomas & Malik, 2001• Csurka et al, 2004• Fei-Fei & Perona, 2005• Sivic et al, 2005

• Grauman & Darrell, 2005• Lazebnik et al, 2006

• Weber et al, 2000• Fergus et al, 2003• Leibe et al, 2004• Felzenszwalb et al, 2005• Bourdev & Malik, 2009

Grouplet Representation:

0 20 40 60 80 100 120 140 160 180 2000

5

10

15

20

25

10

Grouplet: our intuitionGrouplet Representation:

• Part-based

configuration

• Co-occurrence

• Discriminative

• Dense

Capture the subtle difference in human-object interactions.

11

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

12

Grouplet representation (e.g. 2-Grouplet)

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.- Ai: Visual codeword;- xi: Image location;- σi: Variance of spatial distribution.

Notations

Visual codewords Gaussian distribution

13

Grouplet representation (e.g. 2-Grouplet)

• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.

• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.

- Ai: Visual codeword;- xi: Image location;- σi: Variance of spatial distribution.

Notations

( , ) min ,ii

v I v I λ

Matching score between Λ and I

Matching score between λi and I

Visual codewords Gaussian distribution

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

14

2 2 2 2:{ , , }A x λ

1 2{ , } λ λ

Grouplet representation (e.g. 2-Grouplet)

• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.

• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.• For an image patch:

• Ω(x): Image neighborhood of x.

- Ai: Visual codeword;- xi: Image location;- σi: Variance of spatial distribution.

Notations

- a′: Its visual appearance;- x′: Its image location.

( , ) min ,ii

v I v I λ

min p( | ) ( | , )i

i i ii

x x

A a N x x

Codeword assignment score

Gaussian density value

Visual codewords Gaussian distribution

Matching score between Λ and I

Matching score between λi and I

I

1 1 1 1:{ , , }A x λ

P

15

min max p( | ) ( | , )

ji i

ji i i i

i jx x

A a N x x

( , ) min ,ii

v I v I λ

Grouplet representation (e.g. 2-Grouplet)

• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.

• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.• For an image patch:

• Ω(x): Image neighborhood of x.• Δ: A small shift of the location.

- Ai: Visual codeword;- xi: Image location;- σi: Variance of spatial distribution.

Notations

Matching score between Λ and I

Codeword assignment score

Gaussian density value

- a′: Its visual appearance;- x′: Its image location.

min p( | ) ( | , )i

i i ii

x x

A a N x x

Visual codewords Gaussian distribution

Matching score between λi and I

Codeword assignment score

Gaussian density value

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

16

matching score: 0.6

Grouplet representation

• Part-based configuration

• Co-occurrence

• Discriminative

matching score: 0.4 matching score: 0.0 matching score: 0.1

Playing saxophone Other interactions

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

17

• Part-based configuration

• Co-occurrence

• Discriminative

• Dense

Grouplet representation

All possible Codewords

Densely sample image locations

Many possible spatial distributions

1-grouplet 2-grouplet 3-grouplet

All possible combinations of feature units

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

18

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

19

A “Space” of Grouplets

20

Playing violin

Other interactions

A “Space” of Grouplets

21

Playing violin

Other interactions

Playing saxophone

Other interactions

A “Space” of Grouplets

22

Playing violin

Other interactions

Playing saxophone

Other interactions

On background

Shared by different interactions

A “Space” of Grouplets

23

Shared by different interactions

On background

23

We only need discriminative Grouplets

Large ν(Λ,I) Small ν(Λ,I) Large ν(Λ,I) Small ν(Λ,I)

Playing violin

Other interactions

Playing saxophone

Other interactions

Number of Grouplets: 2N very large space

Number of feature units: N. N is large (192200)

24

Obtaining discriminative grouplets for a class

Obtain grouplets with large ν(Λ,I) on the class.

Remove grouplets with large ν(Λ,I) from other classes.

Apriori Mining

[Agrawal & Srikant, 1994]

Selected 1-grouplets

Candidate 2-grouplets

Number of Grouplets: 2N very large space

Number of feature units: N. N is large (192200)

Mine 1000~2000 grouplets, only need to evaluate (2~100)×N grouplets

25

Using Grouplets for Classification

1, , , ,NI I Discriminative

grouplets

1, , N

SVM

I

26

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

27

People-Playing-Musical-Instruments (PPMI) Datasethttp://vision.stanford.edu/resources_links.html

PPMI+

PPMI-

(172)

(164)

(191)

(148)

(177)

(133)

(179)

(149)

(200)

(188)

(198)

(169)

(185)

(167)

# Image:

# Image:

Original image Normalized image(200 images each interaction)

28

Recognition Tasks on People-Playing-Musical-Instruments (PPMI) Dataset

Classification Detection

Playing saxophone

Playing bassoon

Playing saxophone

Playing French horn

Playing violin

vs.

Playing violin

Not playing violin

vs.

Playing different instruments

Playing vs. Not playing

For each interaction, 100 training and 100 testing images.

29

Classification: Playing Different Instruments

• 7-class classification on PPMI+ images

SPM: [Lazebnik et al, 2006]DPM: [Felzenszwalb et al, 2008]Constellation: [Fergus et al, 2003] [Niebles & Fei-Fei, 2007]

59.9%

54.9%

39.0%37.7%

Grouplet+SVM

SPMDPMConstel-lation

BoW

65.7%

Cla

ssifi

catio

n ac

cura

cy

0.7

0.6

0.5

0.4

1 2 3 4 5 60

200

400

600

800

1000

1200

Grouplet sizeN

o. o

f m

ined

Gro

uple

ts

30

Ave

rage

P

PM

I+ im

ages

Classifying Playing vs. Not playing• Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.

Ave

rage

PP

MI-

im

ages

Bassoon Erhu Flute French horn Guitar Saxophone Violin

Acc

urac

y

Grouplet+SVMDPM DPMBoW SPM

Bassoon Erhu Flute French horn Saxophone Violin

31

Ave

rage

P

PM

I+ im

ages

Classifying Playing vs. Not playing• Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.

Ave

rage

PP

MI-

im

ages

Bassoon Erhu Flute French horn Guitar Saxophone Violin

Acc

urac

y

Grouplet+SVMDPM DPMBoW SPM

Guitar

32

Detecting people playing musical instruments

• Face detection with a low threshold;

• Crop and normalize image regions;

• 8-class classification

Procedure:

Playing saxophone No playing No playing

- 7 classes of playing instruments;

- Another class of not playing any

instrument.

33

Detecting people playing musical instruments

Playing saxophone

Playing bassoon

Playing French horn

Playing saxophone

Playing French horn

Area under the precision-recall curve:

• Out method: 45.7%; • Spatial pyramid: 37.3%.

34

Detecting people playing musical instruments

Playing French horn

False detection Missed detection

Area under the precision-recall curve:

• Out method: 45.7%; • Spatial pyramid: 37.3%.

35

Examples of Mined Grouplets

Playing bassoon:

Playing saxophone:

Playing violin:

Playing guitar:

36

Conclusion

• Holistic image-based classification

Vs.

[B. Yao and L. Fei-Fei. “Modeling mutual context of object and human pose in human-object interaction activities.” CVPR 2010.]

[B. Yao and L. Fei-Fei. “Grouplet: A structured image representation for recognizing human and object interactions.” CVPR 2010.]

• Detailed understanding and reasoning

Pose estimation & object detection

The Next Talk

Playing saxophone

Playing bassoon

Playing saxophone

37

Thanks toJuan Carlos Niebles, Jia Deng, Jia Li, Hao Su, Silvio Savarese, and anonymous reviewers.

And You