CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

51
Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford University {bangpeng,feifeili}@cs.stanford.edu 1

Transcript of CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Page 1: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Modeling Mutual Context of Object

and Human Pose in Human-Object

Interaction Activities

Bangpeng Yao and Li Fei-Fei

Computer Science Department, Stanford University

{bangpeng,feifeili}@cs.stanford.edu

1

Page 2: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Robots interact

with objects

Automatic sports

commentary

“Kobe is dunking the ball.”

2

Human-Object Interaction

Medical care

Page 3: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

3

Vs.

Human-Object Interaction

Playing

saxophone

Playing

bassoon

Playing

saxophone

Grouplet is a generic feature for structured objects, or interactions

of groups of objects.

(Previous talk: Grouplet)

Caltech101

HOI activity: Tennis Forehand

Holistic image based classification

Detailed understanding and reasoning

Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURS

48% 59% 77% 62%

Page 4: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

4

Human-Object Interaction

Torso

Head

• Human pose estimation

Holistic image based classification

Detailed understanding and reasoning

Page 5: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

5

Human-Object Interaction

Tennis

racket

• Human pose estimation

Holistic image based classification

Detailed understanding and reasoning

• Object detection

Page 6: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

6

Human-Object Interaction

• Human pose estimation

Holistic image based classification

Detailed understanding and reasoning

• Object detection

Torso

Head

Tennis

racket

HOI activity: Tennis Forehand

Page 7: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

7

Page 8: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

8

Page 9: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

• Felzenszwalb & Huttenlocher, 2005

• Ren et al, 2005

• Ramanan, 2006

• Ferrari et al, 2008

• Yang & Mori, 2008

• Andriluka et al, 2009

• Eichner & Ferrari, 2009

Difficult part

appearance

Self-occlusion

Image region looks

like a body part

Human pose estimation & Object detection

9

Human pose

estimation is

challenging.

Page 10: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Human pose estimation & Object detection

10

Human pose

estimation is

challenging.

• Felzenszwalb & Huttenlocher, 2005

• Ren et al, 2005

• Ramanan, 2006

• Ferrari et al, 2008

• Yang & Mori, 2008

• Andriluka et al, 2009

• Eichner & Ferrari, 2009

Page 11: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Human pose estimation & Object detection

11

Facilitate

Given the

object is

detected.

Page 12: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

• Viola & Jones, 2001

• Lampert et al, 2008

• Divvala et al, 2009

• Vedaldi et al, 2009

Small, low-

resolution, partially

occluded

Image region similar

to detection target

Human pose estimation & Object detection

12

Object

detection is

challenging

Page 13: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Human pose estimation & Object detection

13

Object

detection is

challenging

• Viola & Jones, 2001

• Lampert et al, 2008

• Divvala et al, 2009

• Vedaldi et al, 2009

Page 14: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Human pose estimation & Object detection

14

Facilitate

Given the

pose is

estimated.

Page 15: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Human pose estimation & Object detection

15

Mutual Context

Page 16: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

• Hoiem et al, 2006

• Rabinovich et al, 2007

• Oliva & Torralba, 2007

• Heitz & Koller, 2008

• Desai et al, 2009

• Divvala et al, 2009

• Murphy et al, 2003

• Shotton et al, 2006

• Harzallah et al, 2009

• Li, Socher & Fei-Fei, 2009

• Marszalek et al, 2009

• Bao & Savarese, 2010

Context in Computer Vision

~3-4%

with

context

without

context

Helpful, but only moderately

outperform better

Previous work – Use context

cues to facilitate object detection:

• Viola & Jones, 2001

• Lampert et al, 2008

16

Page 17: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Context in Computer Vision

Our approach – Two challenging

tasks serve as mutual context of

each other:

With

mutual

context:

Without

context:

17

~3-4%

with

context

without

context

Helpful, but only moderately

outperform better

Previous work – Use context

cues to facilitate object detection:

• Hoiem et al, 2006

• Rabinovich et al, 2007

• Oliva & Torralba, 2007

• Heitz & Koller, 2008

• Desai et al, 2009

• Divvala et al, 2009

• Murphy et al, 2003

• Shotton et al, 2006

• Harzallah et al, 2009

• Li, Socher & Fei-Fei, 2009

• Marszalek et al, 2009

• Bao & Savarese, 2010

Page 18: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

18

Page 19: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

19

H

A

Mutual Context Model Representation

• More than one H for each A;

• Unobserved during training.

A:

Croquet

shot

Volleyball

smash

Tennis

forehand

Intra-class variations

Activity

Object

Human pose

Body parts

lP: location; θP: orientation; sP: scale.

Croquet

malletVolleyball

Tennis

racket

O:

H:

P:

f: Shape context. [Belongie et al, 2002]

P1

Image evidence

fO

f1 f2 fN

O

P2 PN

Page 20: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

20

Mutual Context Model Representation

( , )e O H

( , )e A O

( , )e A H

e e

e E

w

Markov Random Field

Clique

potential

Clique

weight

O

P1 PN

fO

H

A

P2

f1 f2 fN

( , )e A O ( , )e A H ( , )e O H• , , : Frequency

of co-occurrence between A, O, and H.

Page 21: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

21

A

f1 f2 fN

Mutual Context Model Representation

( , )e nO P

( , )e m nP P

fO

P1 PNP2

O

H• , , : Spatial

relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P

bin binn n nO P O P O Pl l s s

location orientation size

( , )e nH P

e e

e E

w

Markov Random Field

Clique

potential

Clique

weight

( , )e A O ( , )e A H ( , )e O H• , , : Frequency

of co-occurrence between A, O, and H.

Page 22: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

22

H

A

f1 f2 fN

Mutual Context Model Representation

Obtained by

structure learning

fO

PNP1 P2

O

• Learn structural connectivity among

the body parts and the object.

( , )e A O ( , )e A H ( , )e O H• , , : Frequency

of co-occurrence between A, O, and H.

• , , : Spatial

relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P

bin binn n nO P O P O Pl l s s

location orientation size ( , )e nO P

( , )e m nP P

( , )e nH P

e e

e E

w

Markov Random Field

Clique

potential

Clique

weight

Page 23: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

23

H

O

A

fO

f1 f2 fN

P1 P2 PN

Mutual Context Model Representation

• and : Discriminative

part detection scores.

( , )e OO f ( , )ne n PP f

[Andriluka et al, 2009]

Shape context + AdaBoost

• Learn structural connectivity among

the body parts and the object.

[Belongie et al, 2002]

[Viola & Jones, 2001]

( , )e OO f

( , )ne n PP f

( , )e A O ( , )e A H ( , )e O H• , , : Frequency

of co-occurrence between A, O, and H.

• , , : Spatial

relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P

bin binn n nO P O P O Pl l s s

location orientation size

e e

e E

w

Markov Random Field

Clique

potential

Clique

weight

Page 24: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

24

Page 25: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

25

Model Learning

H

O

A

fO

f1 f2 fN

P1 P2 PN

e e

e E

w

cricket

shot

cricket

bowling

Input:

Goals:

Hidden human poses

Page 26: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

26

Model Learning

H

O

A

fO

f1 f2 fN

P1 P2 PN

Input:

Goals:

Hidden human poses

Structural connectivity

e e

e E

w

cricket

shot

cricket

bowling

Page 27: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

e e

e E

w

27

Model Learning

Goals:

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

H

O

A

fO

f1 f2 fN

P1 P2 PN

Input:

cricket

shot

cricket

bowling

Page 28: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

28

Model Learning

Goals:

Parameter estimation

Hidden variables

Structure learning

H

O

A

fO

f1 f2 fN

P1 P2 PN

Input:e e

e E

w

cricket

shot

cricket

bowling

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

Page 29: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

29

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

croquet shot

e e

e E

w

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

Page 30: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

30

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

2

2max

2e eeE e

Ew

Joint density

of the model

Gaussian priori of

the edge number

Hill-climbing

e e

e E

w

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

Page 31: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

31

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

( , )e O H( , )e A O ( , )e A H

( , )e nO P ( , )e m nP P( , )e nH P

( , )e OO f ( , )ne n PP f

• Maximum likelihood

• Standard AdaBoost

e e

e E

w

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

Page 32: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

32

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

Max-margin learning

2

2,

1min

2r i

r i

w

w

• xi: Potential values of the i-th image.

• wr: Potential weights of the r-th pose.

• y(r): Activity of the r-th pose.

• ξi: A slack variable for the i-th image.

Notations

s.t. , where ,

1

, 0

i

i

c i r i i

i

i r y r y c

i

w x w x

e e

e E

w

Hidden human poses

Structural connectivity

Potential parameters

Potential weights

Page 33: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

33

Learning Results

Cricket

defensive

shot

Cricket

bowling

Croquet

shot

Page 34: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

34

Learning Results

Tennis

serve

Volleyball

smash

Tennis

forehand

Page 35: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

35

Page 36: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

I

36

Model Inference

The learned models

Page 37: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

I

37

Model Inference

The learned models

Head detection

Torso detection

Tennis racket detection

Layout of the object and body parts.

Compositional

Inference

[Chen et al, 2007]

* *

1 1 1 1,, , , nn

A H O P

Page 38: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

I

38

Model Inference

The learned models

* *

1 1 1 1,, , , nn

A H O P * *

,, , ,K K K K nn

A H O P

Output

Page 39: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

• Background and Intuition

• Mutual Context of Object and Human Pose

Model Representation

Model Learning

Model Inference

• Experiments

• Conclusion

Outline

39

Page 40: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

40

Dataset and Experiment Setup

• Object detection;

• Pose estimation;

• Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket

defensive shot

Cricket

bowling

Croquet

shot

Tennis

forehand

Tennis

serve

Volleyball

smash

Sport data set: 6 classes

180 training (supervised with object and part locations) & 120 testing images

Page 41: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

[Gupta et al, 2009]

Cricket

defensive shot

Cricket

bowling

Croquet

shot

Tennis

forehand

Tennis

serve

Volleyball

smash

Sport data set: 6 classes

41

Dataset and Experiment Setup

• Object detection;

• Pose estimation;

• Activity classification.

Tasks:

180 training (supervised with object and part locations) & 120 testing images

Page 42: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Object Detection Results

Cricket bat

42

Valid

region

Croquet mallet Tennis racket Volleyball

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Cricket ball

Our

Method

Sliding

window

Pedestrian

context

[Andriluka

et al, 2009]

[Dalal &

Triggs, 2006]

Page 43: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Object Detection Results

43

430 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Volleyball

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Cricket ball

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

RecallP

recis

ion

Our Method

Pedestrian as context

Scanning window detector

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Our Method

Pedestrian as context

Scanning window detector

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

Our Method

Pedestrian as context

Scanning window detectorSliding window Pedestrian context Our method

Sm

all

ob

jec

tB

ac

kg

rou

nd

clu

tte

r

Page 44: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

44

Dataset and Experiment Setup

• Object detection;

• Pose estimation;

• Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket

defensive shot

Cricket

bowling

Croquet

shot

Tennis

forehand

Tennis

serve

Volleyball

smash

Sport data set: 6 classes

180 training & 120 testing images

Page 45: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

45

Human Pose Estimation Results

Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head

Ramanan,

2006.52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et

al, 2009.50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full

model.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

Page 46: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

46

Human Pose Estimation Results

Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head

Ramanan,

2006.52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et

al, 2009.50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full

model.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

Andriluka

et al, 2009

Our estimation

result

Tennis serve

modelAndriluka

et al, 2009

Our estimation

result

Volleyball

smash model

Page 47: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

47

Human Pose Estimation Results

Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head

Ramanan,

2006.52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et

al, 2009.50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full

model.66 .43 .39 .44 .34 .44 .40 .27 .29 .58

One pose

per class.63 .40 .36 .41 .31 .38 .35 .21 .23 .52

Estimation

result

Estimation

result

Estimation

result

Estimation

result

Page 48: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

48

Dataset and Experiment Setup

• Object detection;

• Pose estimation;

• Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket

defensive shot

Cricket

bowling

Croquet

shot

Tennis

forehand

Tennis

serve

Volleyball

smash

Sport data set: 6 classes

180 training & 120 testing images

Page 49: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Activity Classification Results

49

Gupta et

al, 2009

Our

model

Bag-of-

Words

83.3%

Cla

ssific

atio

n a

ccu

racy

78.9%

52.5%

0.9

0.8

0.7

0.6

0.5

No scene

information Scene is

critical!! Cricket

shot

Tennis

forehand

Bag-of-words

SIFT+SVM

Gupta et

al, 2009

Our

model

Page 50: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

50

Conclusion

Human-Object Interaction

Next Steps

Vs.

• Pose estimation & Object detection on PPMI images.

• Modeling multiple objects and humans.

Grouplet representation

Mutual context model

Page 51: CVPR2010: modeling mutual context of object and human pose in human-object interaction activities

Acknowledgment• Stanford Vision Lab reviewers:

– Barry Chai (1985-2010)

– Juan Carlos Niebles

– Hao Su

• Silvio Savarese, U. Michigan

• Anonymous reviewers

51