Pose from Action: Unsupervised Learning of Pose Features ... · Printing: Pose from Action:...

1
Pose from Action: Unsupervised Learning of Pose Features based on Motion Senthil Purushwalkam, Abhinav Gupta Motivation Human Actions are an ordered sequence of poses Can we leverage human action videos to learn a pose encoding representation? Surrogate Task Overview Data and Implementation Appearance ConvNet Motion ConvNet A AT 13 Frames Frame 1 224x224x3 Optical Flow Maps 224x224x24 Frame 13 224x224x3 VGG-M-2048 Robotics Institute, Carnegie Mellon University A representation in which poses cluster together should be useful for Pose Estimation and Action Recognition Given two poses, it should be possible to predict the motion between them Given: Two appearance representations A and A’ (pose) One motion representation T (transformation of pose) Predict: If the transformation T could cause the change AA’ We use videos from UCF101, HMDB51 and ACT video datasets Sample two frames separated by ∆n (=12) frames and extract optical flow for the ∆n frames Use the VGG-M-2048 architecture for all CNNs Experiments Nearest neighbor in the FC6 feature space of the Appearance ConvNet Results Action Recognition: Pose Estimation: Static Image Action Recognition: 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% Full Network Full Network Last 2 layers Full Network Full Network Random Wang et. al Ours Ours ImageNet Classification Accuracy UCF101 HMDB51 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% Random UCF101 Pretraining Wang et. al Ours ImageNet Strict PCP evaluation Body Part Upper Arms Body Part Lower Arms 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% Random Initialisation SSFA Ours Classification Accuracies Paper available on arxiv at: https://arxiv.org/abs/1609.05420 Conv1 Filter Visualisation:

Transcript of Pose from Action: Unsupervised Learning of Pose Features ... · Printing: Pose from Action:...

Page 1: Pose from Action: Unsupervised Learning of Pose Features ... · Printing: Pose from Action: Unsupervised Learning of Pose Features based on Motion Senthil Purushwalkam, Abhinav Gupta

Printing:

Pose from Action: Unsupervised Learning of Pose Features based on Motion

Senthil Purushwalkam, Abhinav Gupta

Motivation

Human Actions are an ordered

sequence of poses

Can we leverage human action

videos to learn a pose encoding

representation?

Surrogate Task

Overview

Data and Implementation

Appearance ConvNet

Motion ConvNet

A

A’

T

13 FramesFrame 1

224x224x3

Optical Flow Maps

224x224x24

Frame 13

224x224x3

VGG-M-2048

Robotics Institute,

Carnegie Mellon University

A representation in which poses cluster together should be useful for

Pose Estimation and Action Recognition

Given two poses, it should be

possible to predict the motion

between them

Given:

Two appearance

representations A and A’(pose)

One motion representation

T (transformation of pose)

Predict:

If the transformation T could

cause the change AA’

We use videos from UCF101,

HMDB51 and ACT video datasets

Sample two frames separated by ∆n

(=12) frames and extract optical flow

for the ∆n frames

Use the VGG-M-2048 architecture

for all CNNs

Experiments

Nearest neighbor in the FC6 feature

space of the Appearance ConvNet

Results

Action Recognition:

Pose Estimation:

Static Image Action Recognition:

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

FullNetwork

FullNetwork

Last 2layers

FullNetwork

FullNetwork

Random Wang et. al Ours Ours ImageNet

Classification Accuracy

UCF101 HMDB51

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

Random UCF101Pretraining

Wang et. al Ours ImageNet

Strict PCP evaluation

Body Part Upper Arms Body Part Lower Arms

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

RandomInitialisation

SSFA Ours

Classification Accuracies

Paper available on arxiv at:

https://arxiv.org/abs/1609.05420

Conv1 Filter Visualisation: