ICML2011: recognizing human-object interaction activities

Recognizing Human-Object Interaction Activities

Bangpeng Yao, Aditya Khosla and Li Fei-Fei

Computer Science Department, Stanford University

{bangpeng,feifeili}@cs.stanford.edu

• Action Classification• Action Retrieval

B. Yao and L. Fei-Fei. “Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities.” CVPR 2010.

B. Yao, A. Khosla, and L. Fei-Fei. “Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses.” ICML 2011.

Visual Recognition

Focus on Humans

Human images are everywhere:

Why humans are important?

Top 3 most popular synsets in ImageNet:

Deng et al, 2009

http://www.image-net.org/

Human Action Recognition

Robots interact with objects

Automatic sports commentary

Security – Drunk people detection

Human Action RecognitionHuman-Object InteractionB. Yao and L. Fei-Fei. Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities. IEEE Computer Vision and Pattern Recognition (CVPR). 2010.

B. Yao, A. Khosla, and L. Fei-Fei. Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses. International Conference on Machine Learning (ICML). 2011.

Robots interact with objects

Automatic sports commentary

Security – Drunk people detection

• Mutual context model for Action Recognition Motivation

Model representation

Model learning

• Recognition I: Action Classification, Object

Detection, and Pose Estimation

• Recognition II: Action Retrieval by Matching

Action Similarity

• Conclusion

Outline

Model learning

Action Similarity

• Conclusion

Outline

• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

Difficult part appearance

Self-occlusion

Image region looks like a body part

Human pose estimation & Object detection

Human pose estimation is challenging.

• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

Facilitate

Given the object is detected.

• Viola & Jones, 2001• Lampert et al, 2008• Divvala et al, 2009• Vedaldi et al, 2009

Small, low-resolution, partially occluded

Image region similar to detection target

Object detection is challenging

• Viola & Jones, 2001• Lampert et al, 2008• Divvala et al, 2009• Vedaldi et al, 2009

Facilitate

Given the pose is estimated.

Mutual Context

Model learning

Action Similarity

• Conclusion

Outline

Mutual Context Model Representation

Croquet shot

Volleyball smash

Tennis forehand

Activity classes:

Activity

Image evidenceII

[Yao et al, 2011]

Human pose as layout of body parts. Activity

P1 P2 PL

Human pose

Body parts

[Yao et al, 2011]

Volleyball smashing

Cricket bowling

Tennis forehand

Human pose as layout of body parts.

Atomic poses – pose dictionary.

Activity

P1 P2 PL

Human pose

Body parts

[Yao et al, 2011]

List of objects:

Human interact with any number of objects:

Activity

Objects H

P1 P2 PL

Human pose

Body parts

[Yao et al, 2011]

Activity

Objects H

P1 P2 PL

Human pose

Body parts

[Yao et al, 2011]

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Conditional Random Field: Activity

Objects H

P1 P2 PL

Human pose

Body parts

[Yao et al, 2011]

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Conditional Random Field:

Compatibility between actions, objects, and human poses:

( ) ( ) , ,( )1 1 1 1

( , , )

1 1 1h o a

N N NM

H h A a i j kO oi m j k

Activity

Objects H

P1 P2 PL

Human pose

Body parts

[Yao et al, 2011]

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Modeling actions:

2 ( )1

( , ) 1 ( )a

A a kk

A I s I

Na-dimensional output of an action classifier

Activity

Objects H

P1 P2 PL

OM O1Body parts

Human pose

[Yao et al, 2011]

Activity

Objects H

P1 P2 PL

Human pose

Body parts

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Modeling objects:

3 ( )1 1

( , ) 1 ( )o

NMT mjO o

O I g O

Object detection scores

Spatial relationship between two object windows

,( ) ( )1 1 1 1

1 1 ( , )o o

m mj j

N NM MT m mj jO o O o

m m j j

[Yao et al, 2011]

Activity

Objects H

P1 P2 PL

Human pose

Body parts

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Modeling human poses:

( ) , ,1 1

1 ( | ) ( )h

N LT l l T l

H h i l I h i li l

Detection score of the l-th body part

Location of the l-th body part with the prior of atomic pose hi

[Yao et al, 2011]

Activity

Objects H

P1 P2 PL

Human pose

Body parts

1 2( , , , ) ( , , ) ( , )A O H I A O H A I

3 4 5( , ) ( , ) ( , )O I H I O H

Modeling human poses:

( ) , ,( )1 1 1 1

1 1 ( , )h o

N NM LT l m

H h i j l IO om i j l

Spatial relationship between the l-th body part and the m-th object window

[Yao et al, 2011]

Model learning

Action Similarity

• Conclusion

Outline

Mutual Context Model Learning

• Obtaining atomic poses

Annotating

Clustering

Activity

Objects H

P1 P2 PL

Human pose

Body parts

[Yao et al, 2011]

• Obtaining atomic poses• Potentials

– Object & body part detection

One detector for each object or body part

Deformable part model [Felzenszwalb et al, 2008]

Activity

Objects H

P1 P2 PL

Human pose

Body parts

[Yao et al, 2011]

– Object & body part detection– Action classification

Spatial pyramid model [Lazebnik et al, 2005]

Activity

Objects H

P1 P2 PL

OM O1Body parts

Human pose

[Yao et al, 2011]

– Object & body part detection– Action classification– Spatial relationships

Bin function [Desai et al, 2009]

Activity

Objects H

P1 P2 PL

Human pose

Body parts

[Yao et al, 2011]

– Object & body part detection– Action classification– Spatial relationships

• Model parameter estimation

Standard Conditional random field: Belief Propagation

[Pearl, 1988]

, , , , ,

Activity

Objects H

P1 P2 PL

Human pose

Body parts

[Yao et al, 2011]

Model Learning Result

Activity classes:

Atomic poses:

Objects:

Activity classes:

Atomic poses:

Objects:

Tennis Serving

Activity classes:

Atomic poses:

Objects:

Tennis Serving

Volleyball Smash

Model learning

Action Similarity

• Conclusion

Outline

Model Inference for Pose Estimation, Object Detection, and Action Classification

• Initialization• Iteratively optimize :

Updating the layout of human body parts Updating the object detections Updating the action and atomic pose labels

( , , , )A O H I

Action a1

Action a2

Action a3

Action aNa

Updating the layout of human body parts

1( )H hp 2( )H hp 3( )H hp

0.51 0.06 0.04

Mixture model

Re-estimate human pose

[Felzenszwalb et al, 2005][Sapp et al, 2010]

( , , , )A O H I

Action a1

Action a2

Action a3

Action aNa

Updating the layout of human body parts Updating the object detections

( , , , )A O H I

Start from no objects in the image;

Evaluate the contribution of increasing for each detection window separately.

( , , , )A O H IAction a1

Action a2

Action a3

Action aNa

Updating the layout of human body parts Updating the object detections Updating the action and atomic pose labels

( , , , )A O H I

Enumerating all possible A and H values to maximize .

Action a1

Action a2

Action a3

Action aNa

( , , , )A O H I

46[Gupta et al, 2009]

Cricket batting Cricket bowling Croquet shot

Tennis forehand Tennis serve Volleyball smash

Sport data set: 6 classes, 180 training (supervised with object and body part locations) & 120 testing images

Action Classification Experiment

Action Classification Results

1 2 3 4 5 6 70.5

Cricket bowling

Croquet shot

Tennis forehand

Tennis serving

Volleyball smash

Cricket batting

Overall

83%87%

Yao & Fei-Fei (2010b)

Lazebnik et al. (2006)

Our MethodYao et al,

(2011)Yao & Fei-Fei,

(2010)

48[Gupta et al, 2009]

Cricket batting Cricket bowling Croquet shot

Tennis forehand Tennis serve Volleyball smash

Object Detection and Pose Estimation

Sport data set: 6 classes, 180 training (supervised with object and body part locations) & 120 testing images

Object Detection Results

cricket bat .17 .18 .20

cricket ball .24 .27 .32

cricket stump .77 .78 .77

croquet mallet .29 .32 .34

croquet ball .50 .52 .58

croquet hoop .15 .17 .22

tennis racket .33 .31 .37

tennis ball .42 .46 .49

volleyball .64 .65 .67

volleyball net .04 .06 .09

MethodFelzensz-walb et al.

(2010)

Desai et al. (2009)

Yao et al. (2011)

overall .36 .37 .41

(2010)

Desai et al. (2009)

Yao et al. (2011)

overall .36 .37 .41

(2010)

Desai et al. (2009)

Yao et al. (2011)

overall .36 .37 .41

Human Pose Estimation Results

head .58 .71 .76

torso .66 .69 .77

left/rightupper arms

.44 .44 .52

.40 .40 .45

left/rightlower arms

.27 .35 .39

.29 .36 .37

left/rightupper legs

.43 .58 .63

.39 .63 .61

left/rightlower legs

.44 .59 .60

.34 .71 .77

Method Yao & Fei-Fei (2010)

Andrilu-ka et al. (2009)

Yao et al. (2011)

overall .42 .55 .59

Human Pose Estimation Results

head .58 .71 .76

torso .66 .69 .77

left/rightupper arms

.44 .44 .52

.40 .40 .45

left/rightlower arms

.27 .35 .39

.29 .36 .37

left/rightupper legs

.43 .58 .63

.39 .63 .61

left/rightlower legs

.44 .59 .60

.34 .71 .77

Method Yao & Fei-Fei (2010)

Andrilu-ka et al. (2009)

Yao et al. (2011)

overall .42 .55 .59

Model learning

Action Similarity

• Conclusion

Outline

Action Recognition as Classification

Cricket batting

Tennis Forehand

Volleyball Smashing

Playing Bassoon

Playing Guitar

Playing Erhu

Running

Gupta et al (2009)Yao & Fei-Fei (2010)

PASCAL VOC (2010)

Reading

Ikizler-Cinbis et al, 2009Desai et al, 2010Yang et al, 2010Delaitre et al, 2011Maji et al, 2011

Is Classification the End?

stand run

Actions in a continuous space

Same action,Different meanings

More than one action at the same time

Shopping

Calling

Retrieval Instead of ClassificationRetrieval as Similarity Ranking

Retrieval as Similarity Ranking

Decreasing of similarity value

• Challenges:How to obtain the ground-truth?How to perform automatic retrieval?How to evaluate a retrieval system?

Action Retrieval: Obtaining Ground Truth

• Human annotation experiment:– Eight human subjects, the same set of 252 trials.

One trial:Comparison images

Reference image

One trial:

Reference image

Comparison images

Reference image

One trial:

Reference image

Comparison images

One trial:

Reference image

Comparison images

1 2 3 4 50

8:0 7:1 6:2 5:3 4:4

Degree of consistency of human annotations

• From pairwise annotation to overall similarity:

Sim( , )

PairwiseHuman annotation

Similarityvector

2s.t. 0, 1s s

-1 0 1 0 0 0 1 0 0 -1

Sim( , )

PairwiseHuman annotation

Similarityvector

-1 0 1 0 0 0 1 0 0 -1

2s.t. 0, 1s s

0.260 0.227 0.145 0.135 0.112

0.085 0.075 0.041 0.012 0.006

0.002 0.000 0.000 0.000 0.000

Action Retrieval: Our Approach

Action class

Human pose

Object

Action class

Human pose

Object

• Distance between two images I and I’:

2 ( | ), ( | )D p A I p A I

2 ( | ), ( | )D p H I p H I

( | ), ( | )D p O I p O I

( , ) i ii

D T p q p q

22 ( )( , ) i i

p qD p q

Total variance:

Chi-square statistics:

Action Retrieval: Our Approach

Action Retrieval: Evaluation Metric

• Ranking from an algorithm:

• Ranking by ground-truth similarity:

1reI 2reI nreI

1gtI 2gtI ngtI

Number of Neighborhoods

n re ref

in gt ref

: ground-truth similaritys

Action Retrieval: Result

MC: Mutual Context

10 20 30 400.5

Number of neighbors

10 20 30 400.5

Number of retrieved images

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

MC: Mutual Context

SPM: spatial pyramid matching (Lazebnik et al, 2005)

• Use the confidence scores of SPM output to evaluate the action similarity.

10 20 30 400.5

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

Number of neighbors

10 20 30 400.5

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

Number of neighbors

10 20 30 400.5

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

Number of neighbors

MC, Overall

SPM Baseline

10 20 30 400.5

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

MC overall, 2

MC oeverall, T

MC action only, 2

MC action only, T

MC object only, 2

MC object only, T

MC pose only, 2

MC pose only, T

SPM baseline, 2

SPM baseline, T

10 20 30 400.5

Number of neighbors

MC, Overall

SPM Baseline

Model learning

Action Similarity

• Conclusion

Outline

Conclusion

Human action as human-object interaction:

• Action classification:

• Matching action similarity:

Croquet shot

Tennis forehand

Cricket bowling

Acknowledgment

• Stanford Vision Lab reviewers:– Jia Deng– Jia Li

ICML2011: recognizing human-object interaction activities

Documents

Transcript of ICML2011: recognizing human-object interaction activities

A Three-Way Model for Collective Learning on Multi-Relational …nickel/data/slides-icml2011.pdf · MaximilianNickel1 VolkerTresp2 Hans-PeterKriegel1 1Ludwig-Maximilians Universität,

Recognizing Unfamiliar Gestures for Human-Robot Interaction … · 2019. 11. 22. · for Human-Robot Interaction Through Zero-Shot Learning Wil Thomason(B) and Ross A. Knepper Department

Recognizing Type

Learning Kernels -Tutorial - NYU Computer Sciencemohri/icml2011-tutorial/tutorial-icml2011-2.pdf · Learning Kernels -Tutorial Part II: Learning Kernel Algorithms. Corinna Cortes

A Story Storytelling Objectives You will: Practice recognizing synonyms Practice recognizing the long /a/ spelling eigh Practice recognizing the /g

Recognizing fractures

Recognizing Opportunities

An Experimental Approach in Recognizing Synthesized ... · An Experimental Approach in Recognizing Synthesized Auditory Components in a Non-Visual Interaction with Documents Gerasimos

Recognizing and treating immunemediated polyarthritis in …korthalgriffon.co.uk/onewebmedia/Recognizing and treating immune... · IMPA Published on Drupal Home > Recognizing and

Recognizing Opportunities and Generating Ideassrilankatalkforum.weebly.com/uploads/1/3/7/6/13767808/recognizing... · 06-09-2013 · Recognizing Opportunities and Generating Ideas

Recognizing Pneumothorax

Ling 21: LANGUAGE & THOUGHT Lecture 2: Recognizing Arguments Recognizing Arguments.

Recognizing and Addressing Suicidal Ideation & Behavior in ... 1 - Recognizing and Addressing Suicidal...Recognizing and Addressing Suicidal Ideation & Behavior in Individuals with

Recognizing Mathematics

jointly recognizing hand and estimating metric depth … and estimating metric depth for 3D interaction, ... Passive stereo requires physical ... we chose to use more complicated machine

Multi Recognizing

Human Action Recognition A Grand Challenge - iceis.org · Human Action Recognition A Grand Challenge ... • Robot/Human interaction ... A hierarchical framework for recognizing human

Recognizing Fouls

Recognizing Small Protein Complexes fromwongls/psZ/baiyu-hyp2013.pdfRecognizing Small Protein Complexes from Protein Interaction Network By Bai Yu Department of Computer Science School

Icml2011 Minimum Probability Flow Learning