Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng...

37
Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford University {bangpeng,feifeili}@cs.stanford.edu 1

Transcript of Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng...

Page 1: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

Bangpeng Yao and Li Fei-Fei

Computer Science Department, Stanford University

{bangpeng,feifeili}@cs.stanford.edu

1

Page 2: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

2

Human-Object Interaction

Playing saxophoneHuman SaxophoneNot playing saxophone

Page 3: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Robots interact with objects

Automatic sports commentary

“Kobe is dunking the ball.”

Medical care

3

Human-Object Interaction

Page 4: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Background: Human-Object Interaction

• Schneiderman & Kanade, 2000• Viola & Jones, 2001• Huang et al, 2007• Papageorgiou & Poggio, 2000• Wu & Nevatia, 2005• Dalal & Triggs, 2005• Mikolajczyk et al, 2005• Leibe et al, 2005• Bourdev & Malik, 2009• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

• Lowe, 1999• Belongie et al, 2002• Fergus et al, 2003• Fei-Fei et al, 2004• Berg & Malik, 2005• Felzenszwalb et al, 2005• Grauman & Darrell, 2005• Sivic et al, 2005• Lazebnik et al, 2006• Zhang et al, 2006• Savarese et al, 2007• Lampert et al, 2008• Desai et al, 2009• Gehler & Nowozin, 2009

• Murphy et al, 2003• Hoiem et al, 2006• Shotton et al, 2006

• Rabinovich et al, 2007• Heitz & Koller, 2008• Divvala et al, 2009

• Gupta et al, 2009

4

context

vs.

To be done

• Yao & Fei-Fei, 2010a• Yao & Fei-Fei, 2010b

Page 5: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Background: Human-Object Interaction

• Schneiderman & Kanade, 2000• Viola & Jones, 2001• Huang et al, 2007• Papageorgiou & Poggio, 2000• Wu & Nevatia, 2005• Dalal & Triggs, 2005• Mikolajczyk et al, 2005• Leibe et al, 2005• Bourdev & Malik, 2009• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

• Lowe, 1999• Belongie et al, 2002• Fergus et al, 2003• Fei-Fei et al, 2004• Berg & Malik, 2005• Felzenszwalb et al, 2005• Grauman & Darrell, 2005• Sivic et al, 2005• Lazebnik et al, 2006• Zhang et al, 2006• Savarese et al, 2007• Lampert et al, 2008• Desai et al, 2009• Gehler & Nowozin, 2009

• Murphy et al, 2003• Hoiem et al, 2006• Shotton et al, 2006

• Rabinovich et al, 2007• Heitz & Koller, 2008• Divvala et al, 2009

• Gupta et al, 2009

5

context

vs.

To be done

• Yao & Fei-Fei, 2010a• Yao & Fei-Fei, 2010b

Page 6: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

6

Page 7: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

7

Page 8: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

8

Recognizing Human-Object Interaction is Challenging

Different background

Same object (saxophone), different interactions

Different pose (or viewpoint)

Different lighting

Different instrument, similar pose

Reference image: playing saxophone

Page 9: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

9

Grouplet: our intuitionBag-of-words Spatial pyramid Part-based

• Thomas & Malik, 2001• Csurka et al, 2004• Fei-Fei & Perona, 2005• Sivic et al, 2005

• Grauman & Darrell, 2005• Lazebnik et al, 2006

• Weber et al, 2000• Fergus et al, 2003• Leibe et al, 2004• Felzenszwalb et al, 2005• Bourdev & Malik, 2009

Grouplet Representation:

0 20 40 60 80 100 120 140 160 180 2000

5

10

15

20

25

Page 10: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

10

Grouplet: our intuitionGrouplet Representation:

• Part-based

configuration

• Co-occurrence

• Discriminative

• Dense

Capture the subtle difference in human-object interactions.

Page 11: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

11

Page 12: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

12

Grouplet representation (e.g. 2-Grouplet)

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.

- Ai: Visual codeword;

- xi: Image location;

- σi: Variance of spatial distribution.

Notations

Visual codewords Gaussian distribution

Page 13: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

13

Grouplet representation (e.g. 2-Grouplet)

• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.

• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.

- Ai: Visual codeword;

- xi: Image location;

- σi: Variance of spatial distribution.

Notations

( , ) min ,ii

v I v I λ

Matching score between Λ and I

Matching score between λi and I

Visual codewords Gaussian distribution

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

Page 14: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

2 2 2 2:{ , , }A x λ

1 2{ , } λ λ

14

Grouplet representation (e.g. 2-Grouplet)

• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.

• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.• For an image patch:

• Ω(x): Image neighborhood of x.

- Ai: Visual codeword;

- xi: Image location;

- σi: Variance of spatial distribution.

Notations

- a′: Its visual appearance;- x′: Its image location.

( , ) min ,ii

v I v I λ

min p( | ) ( | , )i

i i ii

x x

A a N x x

Codeword assignment score

Gaussian density value

Visual codewords Gaussian distribution

Matching score between Λ and I

Matching score between λi and I

I

1 1 1 1:{ , , }A x λ

P

Page 15: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

min max p( | ) ( | , )

ji i

ji i i i

i jx x

A a N x x

( , ) min ,ii

v I v I λ

15

Grouplet representation (e.g. 2-Grouplet)

• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.

• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.• For an image patch:

• Ω(x): Image neighborhood of x.• Δ: A small shift of the location.

- Ai: Visual codeword;

- xi: Image location;

- σi: Variance of spatial distribution.

Notations

Matching score between Λ and I

Codeword assignment score

Gaussian density value

- a′: Its visual appearance;- x′: Its image location.

min p( | ) ( | , )i

i i ii

x x

A a N x x

Visual codewords Gaussian distribution

Matching score between λi and I

Codeword assignment score

Gaussian density value

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

Page 16: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

matching score: 0.6

16

Grouplet representation

• Part-based configuration

• Co-occurrence

• Discriminative

matching score: 0.4 matching score: 0.0 matching score: 0.1

Playing saxophone Other interactions

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

Page 17: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

17

• Part-based configuration

• Co-occurrence

• Discriminative

• Dense

Grouplet representation

All possible Codewords

Densely sample image locations

Many possible spatial distributions

1-grouplet 2-grouplet 3-grouplet

All possible combinations of feature units

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

Page 18: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

18

Page 19: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

A “Space” of Grouplets

19

Page 20: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

20

Playing violin

Other interactions

A “Space” of Grouplets

Page 21: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

21

Playing violin

Other interactions

Playing saxophone

Other interactions

A “Space” of Grouplets

Page 22: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

22

Playing violin

Other interactions

Playing saxophone

Other interactions

On background

Shared by different interactions

A “Space” of Grouplets

Page 23: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Shared by different interactions

On background

2323

We only need discriminative Grouplets

Large ν(Λ,I) Small ν(Λ,I) Large ν(Λ,I) Small ν(Λ,I)

Playing violin

Other interactions

Playing saxophone

Other interactions

Number of Grouplets: 2N very large space

Number of feature units: N. N is large (192200)

Page 24: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

24

Obtaining discriminative grouplets for a class

Obtain grouplets with large ν(Λ,I) on the class.

Remove grouplets with large ν(Λ,I) from other classes.

Apriori Mining

[Agrawal & Srikant, 1994]

Selected 1-grouplets

Candidate 2-grouplets

Number of Grouplets: 2N very large space

Number of feature units: N. N is large (192200)

Mine 1000~2000 grouplets, only need to evaluate (2~100)×N grouplets

Page 25: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

25

Using Grouplets for Classification

1, , , ,NI I Discriminative

grouplets

1, , N

SVM

I

Page 26: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

26

Page 27: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

People-Playing-Musical-Instruments (PPMI) Datasethttp://vision.stanford.edu/resources_links.html

PPMI+

PPMI-

27

(172)

(164)

(191)

(148)

(177)

(133)

(179)

(149)

(200)

(188)

(198)

(169)

(185)

(167)

# Image:

# Image:

Original image Normalized image(200 images each interaction)

Page 28: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Recognition Tasks on People-Playing-Musical-Instruments (PPMI) Dataset

28

Classification Detection

Playing saxophone

Playing bassoon

Playing saxophone

Playing French horn

Playing violin

vs.

Playing violin

Not playing violin

vs.

Playing different instruments

Playing vs. Not playing

For each interaction, 100 training and 100 testing images.

Page 29: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Classification: Playing Different Instruments

• 7-class classification on PPMI+ images

SPM: [Lazebnik et al, 2006]DPM: [Felzenszwalb et al, 2008]Constellation: [Fergus et al, 2003] [Niebles & Fei-Fei, 2007]

59.9%

54.9%

39.0%37.7%

Grouplet+SVM

SPMDPMConstel-lation

BoW

65.7%

Cla

ssifi

catio

n ac

cura

cy

0.7

0.6

0.5

0.4

29

1 2 3 4 5 60

200

400

600

800

1000

1200

Grouplet sizeN

o. o

f m

ined

Gro

uple

ts

Page 30: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Ave

rage

P

PM

I+ im

ages

Classifying Playing vs. Not playing

30

• Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.

Ave

rage

P

PM

I- im

ages

Bassoon Erhu Flute French horn Guitar Saxophone Violin

Acc

urac

y

Grouplet+SVMDPM DPMBoW SPM

Bassoon Erhu Flute French horn Saxophone Violin

Page 31: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Ave

rage

P

PM

I+ im

ages

Classifying Playing vs. Not playing

31

• Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.

Ave

rage

P

PM

I- im

ages

Bassoon Erhu Flute French horn Guitar Saxophone Violin

Acc

urac

y

Grouplet+SVMDPM DPMBoW SPM

Guitar

Page 32: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Detecting people playing musical instruments

32

• Face detection with a low threshold;

• Crop and normalize image regions;

• 8-class classification

Procedure:

Playing saxophone No playing No playing

- 7 classes of playing instruments;

- Another class of not playing any

instrument.

Page 33: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

33

Detecting people playing musical instruments

Playing saxophone

Playing bassoon

Playing French horn

Playing saxophone

Playing French horn

Area under the precision-recall curve:

• Out method: 45.7%; • Spatial pyramid: 37.3%.

Page 34: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

34

Detecting people playing musical instruments

Playing French horn

False detection Missed detection

Area under the precision-recall curve:

• Out method: 45.7%; • Spatial pyramid: 37.3%.

Page 35: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

35

Examples of Mined Grouplets

Playing bassoon:

Playing saxophone:

Playing violin:

Playing guitar:

Page 36: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

36

Conclusion

• Holistic image-based classification

Vs.

[B. Yao and L. Fei-Fei. “Modeling mutual context of object and human pose in human-object interaction activities.” CVPR 2010.]

[B. Yao and L. Fei-Fei. “Grouplet: A structured image representation for recognizing human and object interactions.” CVPR 2010.]

• Detailed understanding and reasoning

Pose estimation & object detection

The Next TalkThe Next Talk

Playing saxophone

Playing bassoon

Playing saxophone

Page 37: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford.

Thanks toJuan Carlos Niebles, Jia Deng, Jia Li, Hao Su, Silvio Savarese, and anonymous reviewers.

And You

37