Articulated human pose estimation by deep learning

Articulated Human Pose Estimation by Deep Learning

Wei YangSupervisor: Xiaogang Wang, Wanli Ouyang

[email protected]

05/01/2023 2

Outline

• Introduction• Regression by Convolutional Neural Network• Deformable Convolutional Neural Networks• Discussion and Future work

05/01/2023 3

Introduction

Articulated body pose estimation “recovers the pose of an articulated body, which consists of joints and rigid parts using image-based observations.”

http://en.wikipedia.org/wiki/Articulated_body_pose_estimation

05/01/2023 4

Applications

Action recognition

Human tracking

Clothing Parsing

Gaming

05/01/2023 5

Challenges

05/01/2023 6

Classic Approaches

Fischler & Elschlager 1973 Felzenszwalb & Huttenlocher 2005

Pictorial Structure• Unary Templates• Pairwise Springs

Yang & Ramanan 2011

Mixtures of “mini-parts”• Mixture of part • Unary template for part with mixture • Pairwise springs between part with

mixture and part with mixture

05/01/2023 7

Deep Learning Methods

Multi-source Deep Learning • Candidate estimations• Deep model uses multi-source

including appearance score, mixture type, and deformation.

Ouyang et al. 2014

Deeppose• Reasoning pose in a holistic fashion • refines the joint predictions by using

higher resolution sub-images

Toshev & Szegedy 2014

05/01/2023 8

We propose to study pose estimation in two ways

• Holistic View–Regression of joint locations by convolutional neural

networks (CNNs)

• Local information–Deformable Convolutional Neural Networks

05/01/2023 9

Regression by Convolutional Neural Network

05/01/2023 10

Formulation

• Image: • Part location:

𝜓 (𝐼 ;𝜃)=𝐩

Location of part :

Learned by deep CNN

05/01/2023 11

Basic Architecture of the CNN Regressor

• AlexNet – Krizhevsky, Sutskever, and Hinton, NIPS 2012

– The first time deep model is shown to be effective on large scalecomputer vision task.

05/01/2023 12

Normalize Scale of Human Body

• Size of the CNN input is fixed• Simple warping changes the aspect ratio of people

• People appear at different scales of an image

1. Original image 2. Human detection[Ouyang et al. CVPR 2014]

3. Crop by bbox 4. Padding with mean RGB value

05/01/2023 13

Architecture 1

• Loss function:• Evaluation metric: PCP

Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP

Yang&Ramanan 84.1 77.1 69.5 65.6 52.5 35.9 60.8

Conv5 58.8 24.1 49.6 36.6 25.8 2.8 31.3

05/01/2023 14

Architecture 2


Yang&Ramanan 84.1 77.1 69.5 65.6 52.5 35.9 60.8

Conv5 58.8 24.1 49.6 36.6 25.8 2.8 31.3

Fc8 (AlexNet) 81.1 63.7 72.8 66.6 50.6 21.9 56.9


05/01/2023 15

Architecture 3


Yang&Ramanan 84.1 77.1 69.5 65.6 52.5 35.9 60.8

Conv5 58.8 24.1 49.6 36.6 25.8 2.8 31.3

Fc8 (AlexNet) 81.1 63.7 72.8 66.6 50.6 21.9 56.9

Fc10 84.1 68.8 76.8 69.4 54.9 26.8 60.9


05/01/2023 16

PCP and PDJ on LSP

# Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP

ours

1 Conv5 58.8 24.1 49.6 36.6 25.8 2.8 31.3

2 Fc8 (AlexNet) 81.1 63.7 72.8 66.6 50.6 21.9 56.9

3 Fc8(LSP-extend) 83.1 67.2 75.0 68.7 53.4 25.6 59.6

4 Fc10 84.1 68.8 76.8 69.4 54.9 26.8 60.9

5 Fc10 (Fusion) 84.8 71.8 77.6 71.2 55.9 29.2 62.5

State-of-the-art

methods

6 Yang&Ramanan 84.1 77.1 69.5 65.6 52.5 35.9 60.8

7 Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6

05/01/2023 17

Results on LSP dataset

05/01/2023 18

Failure Cases

• articulation• fore-shortening• occlusions and distractions• cluttered background or overlapping people

05/01/2023 19

Deformable Convolutional Neural Networks

05/01/2023 20

Motivation

• Local image patches are able to capture:– Part presence

– Pairwise part spatial relationships

Number of mixture type for each pair: 6

Neighbor: 1# of relationships:

Neighbor: 2# of relationships:

Lower arm

Upper arm

[Chen & Yuille NIPS 2014]

05/01/2023 21

Tree-structured Relational Graph

– : positions of body parts

– : pairwise relationships between parts

– : Pixel location of part

– Pairwise relationship

– Defined by relative position

– In experiment: 13 type for each pair

05/01/2023 22

Formulation

𝐹 (𝐩 ,𝐭|𝐼 ;𝝎 ,𝜃 )¿∑𝑖∈𝑉

𝐴𝑖(𝑝𝑖∨𝐼 ;𝜃)

Part presence

𝜔 𝑖 ⋅

Inference: • Tree structure• Can be solved efficiently by dynamic programming

, , are currently learned by Latent structure SVM

+ ∑(𝑖 , 𝑗 )∈𝐸

𝑅 (𝑝𝑖 ,𝑝 𝑗 , 𝑡𝑖𝑗 , 𝑡 𝑗𝑖∨𝐼 ;𝜃)

Pairwise deformation

+𝝎𝑖𝑗𝑡𝑖𝑗 ⋅𝜔 𝑖𝑗 ⋅

Pairwise Relationship

05/01/2023 23

Learning parameters

Derive the type label for each patch• use relative position to represent the

pairwise relations• Cluster the relative positions over the

whole training set • Type label : cluster index• Mean relative position : cluster center

05/01/2023 24

Casting Full Connections into Convolutions

Elbow

Part presence map

Pairwise relationship map

05/01/2023 25

PCP and PDJ on LSP dataset and FLIC dataset

Dataset Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP

LSPDCNN 92.5 85.1 82.7 76.3 70.2 55.9 74.8

Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6

FLIC DCNN 87.0 98.8 - - 96.5 84.0 91.1

LSP FLIC

05/01/2023 26

Future work

05/01/2023 27

Future Work

• Build end-to-end system to estimate human pose

• Consider combining local information and holistic view• Beyond tree structure

Thank you

Articulated Human Pose Estimation by Deep Learning

05/01/2023 29

Appendix

Data AugmentationEvaluation Metrics

05/01/2023 30

Data Augmentation

• The number of training data of existing datasets are insufficient to train deep CNNs– Statistics of existing datasets

– Number of parameters of AlexNet: 60 million

• Data augmentation is efficient to prevent overfitting

Dataset # Training images

# Testing images

Type

PARSE 100 205 Full body

LSP 1,000 1,000 Full body

LSP extend 10,000 - Full

FLIC 3,987 1,016 Upper body

MPII 28,821 11,701 Full body

05/01/2023 31

Data Augmentation (cont.)

• Random padding

• Rotating– ±[2.5◦, 5◦, 7.5◦, 10◦, 15◦, 20◦]

• Flipping

05/01/2023 32

Evaluation Metrics

• Percentage of Correct Parts (PCP)– measures the percentage of correctly localized body parts.

– A candidate body part is treated as correct if its segment endpoints lie within 50% of the length of the ground-truth annotated endpoints.

• Percentage of Detected Joints (PDJ)– measures the performance using a curve of the percentage of correctly localized

joints by varying localization precision threshold, which is normalized by the scale defined as distance between left shoulder and right hip

– invariant to scale

Articulated human pose estimation by deep learning

Science

Transcript of Articulated human pose estimation by deep learning