ICML2011: recognizing human-object interaction activities

Post on 03-Dec-2014

106 views 3 download

Tags:

Transcript of ICML2011: recognizing human-object interaction activities

1

Recognizing Human-Object Interaction Activities

Bangpeng Yao, Aditya Khosla and Li Fei-Fei

Computer Science Department, Stanford University

{bangpeng,feifeili}@cs.stanford.edu

• Action Classification• Action Retrieval

2

B. Yao and L. Fei-Fei. “Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities.” CVPR 2010.

B. Yao, A. Khosla, and L. Fei-Fei. “Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses.” ICML 2011.

Visual Recognition

3

Visual Recognition

Focus on Humans

6

Human images are everywhere:

Why humans are important?

7

Why humans are important?

Top 3 most popular synsets in ImageNet:

Deng et al, 2009

http://www.image-net.org/

8

Human Action Recognition

9

Robots interact with objects

Automatic sports commentary

Security – Drunk people detection

Human Action RecognitionHuman-Object InteractionB. Yao and L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities. IEEE Computer Vision and Pattern Recognition (CVPR). 2010.

B. Yao, A. Khosla, and L. Fei-Fei. Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses. International Conference on Machine Learning (ICML). 2011.

Robots interact with objects

Automatic sports commentary

Security – Drunk people detection

10

11

• Mutual context model for Action Recognition Motivation

Model representation

Model learning

• Recognition I: Action Classification, Object

Detection, and Pose Estimation

• Recognition II: Action Retrieval by Matching

Action Similarity

• Conclusion

Outline

12

• Mutual context model for Action Recognition Motivation

Model representation

Model learning

• Recognition I: Action Classification, Object

Detection, and Pose Estimation

• Recognition II: Action Retrieval by Matching

Action Similarity

• Conclusion

Outline

• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

Difficult part appearance

Self-occlusion

Image region looks like a body part

Human pose estimation & Object detection

13

Human pose estimation is challenging.

Human pose estimation & Object detection

14

Human pose estimation is challenging.

• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

Human pose estimation & Object detection

15

Facilitate

Given the object is detected.

• Viola & Jones, 2001• Lampert et al, 2008• Divvala et al, 2009• Vedaldi et al, 2009

Small, low-resolution, partially occluded

Image region similar to detection target

Human pose estimation & Object detection

16

Object detection is challenging

Human pose estimation & Object detection

17

Object detection is challenging

• Viola & Jones, 2001• Lampert et al, 2008• Divvala et al, 2009• Vedaldi et al, 2009

Human pose estimation & Object detection

18

Facilitate

Given the pose is estimated.

Human pose estimation & Object detection

19

Mutual Context

20

• Mutual context model for Action Recognition Motivation

Model representation

Model learning

• Recognition I: Action Classification, Object

Detection, and Pose Estimation

• Recognition II: Action Retrieval by Matching

Action Similarity

• Conclusion

Outline

Mutual Context Model Representation

21

Croquet shot

Volleyball smash

Tennis forehand

Activity classes:

Activity

A

Image evidenceII

[Yao et al, 2011]

Mutual Context Model Representation

22

Human pose as layout of body parts. Activity

H

A

P1 P2 PL

Human pose

Body parts

I

[Yao et al, 2011]

Mutual Context Model Representation

23

Volleyball smashing

Cricket bowling

Tennis forehand

Human pose as layout of body parts.

Atomic poses – pose dictionary.

Activity

H

A

P1 P2 PL

Human pose

Body parts

I

[Yao et al, 2011]

Mutual Context Model Representation

24

List of objects:

Human interact with any number of objects:

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]

Mutual Context Model Representation

25

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]

Mutual Context Model Representation

26

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Conditional Random Field: Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]

Mutual Context Model Representation

27

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Conditional Random Field:

Compatibility between actions, objects, and human poses:

1

( ) ( ) , ,( )1 1 1 1

( , , )

1 1 1h o a

mi kj

N N NM

H h A a i j kO oi m j k

A O H

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]

Mutual Context Model Representation

28

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Conditional Random Field:

Modeling actions:

2 ( )1

( , ) 1 ( )a

k

NT

A a kk

A I s I

Na-dimensional output of an action classifier

Activity

Objects H

A

P1 P2 PL

OM O1Body parts

I

Human pose

[Yao et al, 2011]

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

Mutual Context Model Representation

29

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Conditional Random Field:

Modeling objects:

3 ( )1 1

( , ) 1 ( )o

mj

NMT mjO o

m j

O I g O

Object detection scores

Spatial relationship between two object windows

,( ) ( )1 1 1 1

1 1 ( , )o o

m mj j

N NM MT m mj jO o O o

m m j j

b O O

[Yao et al, 2011]

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

Mutual Context Model Representation

30

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Conditional Random Field:

Modeling human poses:

4

( ) , ,1 1

( , )

1 ( | ) ( )h

i i

N LT l l T l

H h i l I h i li l

H I

p f I

x x

Detection score of the l-th body part

Location of the l-th body part with the prior of atomic pose hi

[Yao et al, 2011]

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

Mutual Context Model Representation

31

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Conditional Random Field:

Modeling human poses:

5

( ) , ,( )1 1 1 1

( , )

1 1 ( , )h o

mi j

N NM LT l m

H h i j l IO om i j l

H O

b O

x

Spatial relationship between the l-th body part and the m-th object window

[Yao et al, 2011]

32

• Mutual context model for Action Recognition Motivation

Model representation

Model learning

• Recognition I: Action Classification, Object

Detection, and Pose Estimation

• Recognition II: Action Retrieval by Matching

Action Similarity

• Conclusion

Outline

Mutual Context Model Learning

33

• Obtaining atomic poses

Annotating

Clustering

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]

Mutual Context Model Learning

34

• Obtaining atomic poses• Potentials

– Object & body part detection

One detector for each object or body part

Deformable part model [Felzenszwalb et al, 2008]

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]

Mutual Context Model Learning

35

• Obtaining atomic poses• Potentials

– Object & body part detection– Action classification

Spatial pyramid model [Lazebnik et al, 2005]

Activity

Objects H

A

P1 P2 PL

OM O1Body parts

I

Human pose

[Yao et al, 2011]

Mutual Context Model Learning

36

• Obtaining atomic poses• Potentials

– Object & body part detection– Action classification– Spatial relationships

Bin function [Desai et al, 2009]

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]

Mutual Context Model Learning

37

• Obtaining atomic poses• Potentials

– Object & body part detection– Action classification– Spatial relationships

• Model parameter estimation

Standard Conditional random field: Belief Propagation

[Pearl, 1988]

, , , , ,

Activity

Objects H

A

P1 P2 PL

OM O1

Human pose

Body parts

I

[Yao et al, 2011]

Model Learning Result

38

Activity classes:

Atomic poses:

Objects:

Model Learning Result

39

Activity classes:

Atomic poses:

Objects:

Tennis Serving

Model Learning Result

40

Activity classes:

Atomic poses:

Objects:

Tennis Serving

Volleyball Smash

41

• Mutual context model for Action Recognition Motivation

Model representation

Model learning

• Recognition I: Action Classification, Object

Detection, and Pose Estimation

• Recognition II: Action Retrieval by Matching

Action Similarity

• Conclusion

Outline

42

Model Inference for Pose Estimation, Object Detection, and Action Classification

• Initialization• Iteratively optimize :

Updating the layout of human body parts Updating the object detections Updating the action and atomic pose labels

( , , , )A O H I

Action a1

Action a2

Action a3

Action aNa

43

Model Inference for Pose Estimation, Object Detection, and Action Classification

• Initialization• Iteratively optimize :

Updating the layout of human body parts

1( )H hp 2( )H hp 3( )H hp

0.51 0.06 0.04

Mixture model

Re-estimate human pose

[Felzenszwalb et al, 2005][Sapp et al, 2010]

( , , , )A O H I

Action a1

Action a2

Action a3

Action aNa

44

Model Inference for Pose Estimation, Object Detection, and Action Classification

• Initialization• Iteratively optimize :

Updating the layout of human body parts Updating the object detections

( , , , )A O H I

Start from no objects in the image;

Evaluate the contribution of increasing for each detection window separately.

( , , , )A O H IAction a1

Action a2

Action a3

Action aNa

45

Model Inference for Pose Estimation, Object Detection, and Action Classification

• Initialization• Iteratively optimize :

Updating the layout of human body parts Updating the object detections Updating the action and atomic pose labels

( , , , )A O H I

Enumerating all possible A and H values to maximize .

Action a1

Action a2

Action a3

Action aNa

( , , , )A O H I

46[Gupta et al, 2009]

Cricket batting Cricket bowling Croquet shot

Tennis forehand Tennis serve Volleyball smash

Sport data set: 6 classes, 180 training (supervised with object and body part locations) & 120 testing images

Action Classification Experiment

47

Action Classification Results

1 2 3 4 5 6 70.5

0.6

0.7

0.8

0.9

1

Acc

urac

y

1 2 3 4 5 6 70.5

0.6

0.7

0.8

0.9

1

Acc

urac

y

Cricket bowling

Croquet shot

Tennis forehand

Tennis serving

Volleyball smash

Cricket batting

Overall

83%87%

Yao & Fei-Fei (2010b)

Lazebnik et al. (2006)

Our MethodYao et al,

(2011)Yao & Fei-Fei,

(2010)

48[Gupta et al, 2009]

Cricket batting Cricket bowling Croquet shot

Tennis forehand Tennis serve Volleyball smash

Object Detection and Pose Estimation

Sport data set: 6 classes, 180 training (supervised with object and body part locations) & 120 testing images

49

Object Detection Results

cricket bat .17 .18 .20

cricket ball .24 .27 .32

cricket stump .77 .78 .77

croquet mallet .29 .32 .34

croquet ball .50 .52 .58

croquet hoop .15 .17 .22

tennis racket .33 .31 .37

tennis ball .42 .46 .49

volleyball .64 .65 .67

volleyball net .04 .06 .09

MethodFelzensz-walb et al.

(2010)

Desai et al. (2009)

Yao et al. (2011)

overall .36 .37 .41

50

Object Detection Results

cricket bat .17 .18 .20

cricket ball .24 .27 .32

cricket stump .77 .78 .77

croquet mallet .29 .32 .34

croquet ball .50 .52 .58

croquet hoop .15 .17 .22

tennis racket .33 .31 .37

tennis ball .42 .46 .49

volleyball .64 .65 .67

volleyball net .04 .06 .09

MethodFelzensz-walb et al.

(2010)

Desai et al. (2009)

Yao et al. (2011)

overall .36 .37 .41

51

Object Detection Results

cricket bat .17 .18 .20

cricket ball .24 .27 .32

cricket stump .77 .78 .77

croquet mallet .29 .32 .34

croquet ball .50 .52 .58

croquet hoop .15 .17 .22

tennis racket .33 .31 .37

tennis ball .42 .46 .49

volleyball .64 .65 .67

volleyball net .04 .06 .09

MethodFelzensz-walb et al.

(2010)

Desai et al. (2009)

Yao et al. (2011)

overall .36 .37 .41

52

Human Pose Estimation Results

head .58 .71 .76

torso .66 .69 .77

left/rightupper arms

.44 .44 .52

.40 .40 .45

left/rightlower arms

.27 .35 .39

.29 .36 .37

left/rightupper legs

.43 .58 .63

.39 .63 .61

left/rightlower legs

.44 .59 .60

.34 .71 .77

Method Yao & Fei-Fei (2010)

Andrilu-ka et al. (2009)

Yao et al. (2011)

overall .42 .55 .59

53

Human Pose Estimation Results

head .58 .71 .76

torso .66 .69 .77

left/rightupper arms

.44 .44 .52

.40 .40 .45

left/rightlower arms

.27 .35 .39

.29 .36 .37

left/rightupper legs

.43 .58 .63

.39 .63 .61

left/rightlower legs

.44 .59 .60

.34 .71 .77

Method Yao & Fei-Fei (2010)

Andrilu-ka et al. (2009)

Yao et al. (2011)

overall .42 .55 .59

54

• Mutual context model for Action Recognition Motivation

Model representation

Model learning

• Recognition I: Action Classification, Object

Detection, and Pose Estimation

• Recognition II: Action Retrieval by Matching

Action Similarity

• Conclusion

Outline

Action Recognition as Classification

Cricket batting

Tennis Forehand

Volleyball Smashing

Playing Bassoon

Playing Guitar

Playing Erhu

Running

Gupta et al (2009)Yao & Fei-Fei (2010)

PASCAL VOC (2010)

Reading

Ikizler-Cinbis et al, 2009Desai et al, 2010Yang et al, 2010Delaitre et al, 2011Maji et al, 2011

55

Is Classification the End?

stand run

Actions in a continuous space

56

Is Classification the End?

Same action,Different meanings

57

Is Classification the End?

More than one action at the same time

Shopping

Calling

58

59

Retrieval Instead of ClassificationRetrieval as Similarity Ranking

> > >

> >

> > >

60

Ref.

Retrieval as Similarity Ranking

Decreasing of similarity value

61

Retrieval as Similarity Ranking

Ref.

Decreasing of similarity value

62

Ref.

Retrieval as Similarity Ranking

• Challenges:How to obtain the ground-truth?How to perform automatic retrieval?How to evaluate a retrieval system?

Decreasing of similarity value

63

Action Retrieval: Obtaining Ground Truth

• Human annotation experiment:– Eight human subjects, the same set of 252 trials.

One trial:Comparison images

Reference image

64

• Human annotation experiment:– Eight human subjects, the same set of 252 trials.

One trial:

Reference image

Comparison images

Reference image

Action Retrieval: Obtaining Ground Truth

65

• Human annotation experiment:– Eight human subjects, the same set of 252 trials.

One trial:

?? ?

Reference image

Comparison images

Action Retrieval: Obtaining Ground Truth

66

• Human annotation experiment:– Eight human subjects, the same set of 252 trials.

One trial:

?? ?

Reference image

Comparison images

Action Retrieval: Obtaining Ground Truth

67

• Human annotation experiment:– Eight human subjects, the same set of 252 trials.

1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

8:0 7:1 6:2 5:3 4:4

Degree of consistency of human annotations

Per

cent

age

Action Retrieval: Obtaining Ground Truth

68

• From pairwise annotation to overall similarity:

1Ref.

1s

2s

3s

4s

Ns

Sim( , )

PairwiseHuman annotation

Similarityvector

Action Retrieval: Obtaining Ground Truth

2s.t. 0, 1s s

-1 0 1 0 0 0 1 0 0 -1

69

1Ref.

1s

2s

3s

4s

Ns

Sim( , )

PairwiseHuman annotation

Similarityvector

-1 0 1 0 0 0 1 0 0 -1

• From pairwise annotation to overall similarity:

Action Retrieval: Obtaining Ground Truth

2s.t. 0, 1s s

70

Ref.

0.260 0.227 0.145 0.135 0.112

0.085 0.075 0.041 0.012 0.006

0.002 0.000 0.000 0.000 0.000

• From pairwise annotation to overall similarity:

Action Retrieval: Obtaining Ground Truth

Action Retrieval: Our Approach

>

>

>

Action class

Human pose

Object

71

>

>

>

Action class

Human pose

Object

• Distance between two images I and I’:

2 ( | ), ( | )D p A I p A I

2 ( | ), ( | )D p H I p H I

( | ), ( | )D p O I p O I

( , ) i ii

D T p q p q

22 ( )( , ) i i

i i i

p qD p q

p q

Total variance:

Chi-square statistics:

72

Action Retrieval: Our Approach

73

Action Retrieval: Evaluation Metric

Ref.

• Ranking from an algorithm:

• Ranking by ground-truth similarity:

1reI 2reI nreI

1gtI 2gtI ngtI

Number of Neighborhoods

Acc

urac

y

refI

n

1

1

( , )

( , )

i

i

n re ref

in gt ref

i

s I I

s I I

: ground-truth similaritys

74

Action Retrieval: Result

MC: Mutual Context

10 20 30 400.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of neighbors

Ave

rage

pre

cisi

on

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

75

Action Retrieval: Result

MC: Mutual Context

SPM: spatial pyramid matching (Lazebnik et al, 2005)

• Use the confidence scores of SPM output to evaluate the action similarity.

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of neighbors

Ave

rage

pre

cisi

on

76

Action Retrieval: Result

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of neighbors

Ave

rage

pre

cisi

on

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

77

Action Retrieval: Result

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of neighbors

Ave

rage

pre

cisi

on

MC, Overall

SPM Baseline

78

Action Retrieval: Result

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.6

0.7

0.8

Number of retrieved images

Ave

rage

pre

cisi

on

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of neighbors

Ave

rage

pre

cisi

on

MC, Overall

SPM Baseline

79

• Mutual context model for Action Recognition Motivation

Model representation

Model learning

• Recognition I: Action Classification, Object

Detection, and Pose Estimation

• Recognition II: Action Retrieval by Matching

Action Similarity

• Conclusion

Outline

80

Conclusion

Human action as human-object interaction:

• Action classification:

• Matching action similarity:

> >

Croquet shot

Tennis forehand

Cricket bowling

81

Acknowledgment

• Stanford Vision Lab reviewers:– Jia Deng– Jia Li