Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

37
Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford University {bangpeng,feifeili}@cs.stanford.edu 1

description

Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions. {bangpeng,feifeili}@cs.stanford.edu. Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford University. Human-Object Interaction. Playing saxophone. Human. Not playing saxophone. - PowerPoint PPT Presentation

Transcript of Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

Page 1: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

1

Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

Bangpeng Yao and Li Fei-FeiComputer Science Department, Stanford University

{bangpeng,feifeili}@cs.stanford.edu

Page 2: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

2

Human-Object Interaction

Playing saxophoneHuman SaxophoneNot playing saxophone

Page 3: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

3

Robots interact with objects

Automatic sports commentary

“Kobe is dunking the ball.”

Medical care

Human-Object Interaction

Page 4: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

4

Background: Human-Object Interaction

• Schneiderman & Kanade, 2000• Viola & Jones, 2001• Huang et al, 2007• Papageorgiou & Poggio, 2000• Wu & Nevatia, 2005• Dalal & Triggs, 2005• Mikolajczyk et al, 2005• Leibe et al, 2005• Bourdev & Malik, 2009• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

• Lowe, 1999• Belongie et al, 2002• Fergus et al, 2003• Fei-Fei et al, 2004• Berg & Malik, 2005• Felzenszwalb et al, 2005• Grauman & Darrell, 2005• Sivic et al, 2005• Lazebnik et al, 2006• Zhang et al, 2006• Savarese et al, 2007• Lampert et al, 2008• Desai et al, 2009• Gehler & Nowozin, 2009

• Murphy et al, 2003• Hoiem et al, 2006• Shotton et al, 2006

• Rabinovich et al, 2007• Heitz & Koller, 2008• Divvala et al, 2009

• Gupta et al, 2009

context

vs.

To be done

• Yao & Fei-Fei, 2010a• Yao & Fei-Fei, 2010b

Page 5: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

5

Background: Human-Object Interaction

• Schneiderman & Kanade, 2000• Viola & Jones, 2001• Huang et al, 2007• Papageorgiou & Poggio, 2000• Wu & Nevatia, 2005• Dalal & Triggs, 2005• Mikolajczyk et al, 2005• Leibe et al, 2005• Bourdev & Malik, 2009• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

• Lowe, 1999• Belongie et al, 2002• Fergus et al, 2003• Fei-Fei et al, 2004• Berg & Malik, 2005• Felzenszwalb et al, 2005• Grauman & Darrell, 2005• Sivic et al, 2005• Lazebnik et al, 2006• Zhang et al, 2006• Savarese et al, 2007• Lampert et al, 2008• Desai et al, 2009• Gehler & Nowozin, 2009

• Murphy et al, 2003• Hoiem et al, 2006• Shotton et al, 2006

• Rabinovich et al, 2007• Heitz & Koller, 2008• Divvala et al, 2009

• Gupta et al, 2009

context

vs.

To be done

• Yao & Fei-Fei, 2010a• Yao & Fei-Fei, 2010b

Page 6: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

6

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

Page 7: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

7

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

Page 8: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

8

Recognizing Human-Object Interaction is Challenging

Different background

Same object (saxophone), different interactions

Different pose (or viewpoint)

Different lighting

Different instrument, similar pose

Reference image: playing saxophone

Page 9: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

9

Grouplet: our intuitionBag-of-words Spatial pyramid Part-based

• Thomas & Malik, 2001• Csurka et al, 2004• Fei-Fei & Perona, 2005• Sivic et al, 2005

• Grauman & Darrell, 2005• Lazebnik et al, 2006

• Weber et al, 2000• Fergus et al, 2003• Leibe et al, 2004• Felzenszwalb et al, 2005• Bourdev & Malik, 2009

Grouplet Representation:

0 20 40 60 80 100 120 140 160 180 2000

5

10

15

20

25

Page 10: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

10

Grouplet: our intuitionGrouplet Representation:

• Part-based

configuration• Co-occurrence• Discriminative• Dense

Capture the subtle difference in human-object interactions.

Page 11: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

11

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

Page 12: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

12

Grouplet representation (e.g. 2-Grouplet)

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.- Ai: Visual codeword;- xi: Image location;- σi: Variance of spatial distribution.

Notations

Visual codewords Gaussian distribution

Page 13: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

13

Grouplet representation (e.g. 2-Grouplet)

• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.

• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.

- Ai: Visual codeword;- xi: Image location;- σi: Variance of spatial distribution.

Notations

( , ) min ,iiv I v I λ

Matching score between Λ and I

Matching score between λi and I

Visual codewords Gaussian distribution

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

Page 14: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

14

2 2 2 2:{ , , }A x λ

1 2{ , } λ λ

Grouplet representation (e.g. 2-Grouplet)

• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.

• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.• For an image patch:

• Ω(x): Image neighborhood of x.

- Ai: Visual codeword;- xi: Image location;- σi: Variance of spatial distribution.

Notations

- a′: Its visual appearance;- x′: Its image location.

( , ) min ,iiv I v I λ

min p( | ) ( | , )

i

i i iix x

A a N x x

Codeword assignment score

Gaussian density value

Visual codewords Gaussian distribution

Matching score between Λ and I

Matching score between λi and I

I

1 1 1 1:{ , , }A x λ

P

Page 15: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

15

min max p( | ) ( | , )

ji i

ji i i ii j

x x

A a N x x

( , ) min ,iiv I v I λ

Grouplet representation (e.g. 2-Grouplet)

• I: Image.• P: Reference point in the image.• Λ: Grouplet.• λi: Feature unit.

• ν(Λ,I): Matching score of Λ and I.• ν(λi,I): Matching score of λi and I.• For an image patch:

• Ω(x): Image neighborhood of x.• Δ: A small shift of the location.

- Ai: Visual codeword;- xi: Image location;- σi: Variance of spatial distribution.

Notations

Matching score between Λ and I

Codeword assignment score

Gaussian density value

- a′: Its visual appearance;- x′: Its image location.

min p( | ) ( | , )i

i i iix x

A a N x x

Visual codewords Gaussian distribution

Matching score between λi and I

Codeword assignment score

Gaussian density value

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

Page 16: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

16

matching score: 0.6

Grouplet representation

• Part-based configuration• Co-occurrence• Discriminative

matching score: 0.4 matching score: 0.0 matching score: 0.1

Playing saxophone Other interactions

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

Page 17: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

17

• Part-based configuration• Co-occurrence• Discriminative• Dense

Grouplet representation

All possible Codewords

Densely sample image locations

Many possible spatial distributions

1-grouplet 2-grouplet 3-grouplet

All possible combinations of feature units

I

2 2 2 2:{ , , }A x λ

1 1 1 1:{ , , }A x λ

1 2{ , } λ λP

Page 18: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

18

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

Page 19: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

19

A “Space” of Grouplets

Page 20: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

20

Playing violin

Other interactions

A “Space” of Grouplets

Page 21: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

21

Playing violin

Other interactions

Playing saxophone

Other interactions

A “Space” of Grouplets

Page 22: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

22

Playing violin

Other interactions

Playing saxophone

Other interactions

On background

Shared by different interactions

A “Space” of Grouplets

Page 23: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

23

Shared by different interactions

On background

23

We only need discriminative Grouplets

Large ν(Λ,I) Small ν(Λ,I) Large ν(Λ,I) Small ν(Λ,I)

Playing violin

Other interactions

Playing saxophone

Other interactions

Number of Grouplets: 2N very large space

Number of feature units: N. N is large (192200)

Page 24: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

24

Obtaining discriminative grouplets for a class

Obtain grouplets with large ν(Λ,I) on the class.

Remove grouplets with large ν(Λ,I) from other classes.

Apriori Mining

[Agrawal & Srikant, 1994]

Selected 1-grouplets

Candidate 2-grouplets

Number of Grouplets: 2N very large space

Number of feature units: N. N is large (192200)

Mine 1000~2000 grouplets, only need to evaluate (2~100)×N grouplets

Page 25: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

25

Using Grouplets for Classification

1, , , ,NI I Discriminative

grouplets 1, , N

SVM

I

Page 26: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

26

• Intuition of Grouplet Representation

• Grouplet Feature Representation

• Using Grouplet for Recognition

• Dataset & Experiments

• Conclusion

Outline

Page 27: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

27

People-Playing-Musical-Instruments (PPMI) Datasethttp://vision.stanford.edu/resources_links.html

PPMI+

PPMI-

(172)

(164)

(191)

(148)

(177)

(133)

(179)

(149)

(200)

(188)

(198)

(169)

(185)

(167)

# Image:

# Image:

Original image Normalized image(200 images each interaction)

Page 28: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

28

Recognition Tasks on People-Playing-Musical-Instruments (PPMI) Dataset

Classification Detection

Playing saxophone

Playing bassoon

Playing saxophone

Playing French horn

Playing violin

vs.

Playing violin

Not playing violin

vs.

Playing different instruments

Playing vs. Not playing

For each interaction, 100 training and 100 testing images.

Page 29: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

29

Classification: Playing Different Instruments• 7-class classification on PPMI+ images

SPM: [Lazebnik et al, 2006]DPM: [Felzenszwalb et al, 2008]Constellation: [Fergus et al, 2003] [Niebles & Fei-Fei, 2007]

59.9%

54.9%

39.0%37.7%

Grouplet+SVMSPMDPMConstel

-lationBoW

65.7%

Cla

ssifi

catio

n ac

cura

cy

0.7

0.6

0.5

0.4

1 2 3 4 5 60

200

400

600

800

1000

1200

Grouplet sizeN

o. o

f min

ed G

roup

lets

Page 30: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

30

Ave

rage

P

PM

I+ im

ages

Classifying Playing vs. Not playing• Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.

Ave

rage

PP

MI-

imag

es

Bassoon Erhu Flute French horn Guitar Saxophone Violin

Accu

racy

Grouplet+SVMDPM DPMBoW SPM

Bassoon Erhu Flute French horn Saxophone Violin

Page 31: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

31

Ave

rage

P

PM

I+ im

ages

Classifying Playing vs. Not playing• Seven 2-class classification problem; PPMI+ vs. PPMI- for each instrument.

Ave

rage

PP

MI-

imag

es

Bassoon Erhu Flute French horn Guitar Saxophone Violin

Accu

racy

Grouplet+SVMDPM DPMBoW SPM

Guitar

Page 32: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

32

Detecting people playing musical instruments

• Face detection with a low threshold;• Crop and normalize image regions;• 8-class classification

Procedure:

Playing saxophone No playing No playing

- 7 classes of playing instruments;- Another class of not playing any instrument.

Page 33: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

33

Detecting people playing musical instruments

Playing saxophone

Playing bassoon

Playing French horn

Playing saxophone

Playing French horn

Area under the precision-recall curve:• Out method: 45.7%; • Spatial pyramid: 37.3%.

Page 34: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

34

Detecting people playing musical instruments

Playing French horn

False detection Missed detection

Area under the precision-recall curve:• Out method: 45.7%; • Spatial pyramid: 37.3%.

Page 35: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

35

Examples of Mined Grouplets

Playing bassoon:

Playing saxophone:

Playing violin:

Playing guitar:

Page 36: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

36

Conclusion• Holistic image-based classification

Vs.

[B. Yao and L. Fei-Fei. “Modeling mutual context of object and human pose in human-object interaction activities.” CVPR 2010.]

[B. Yao and L. Fei-Fei. “Grouplet: A structured image representation for recognizing human and object interactions.” CVPR 2010.]

• Detailed understanding and reasoning

Pose estimation & object detection

The Next Talk

Playing saxophone

Playing bassoon

Playing saxophone

Page 37: Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions

37

Thanks toJuan Carlos Niebles, Jia Deng, Jia Li, Hao Su, Silvio Savarese, and anonymous reviewers.

And You