Articulated human pose estimation by deep learning
Transcript of Articulated human pose estimation by deep learning
Articulated Human Pose Estimation by Deep Learning
Wei YangSupervisor: Xiaogang Wang, Wanli Ouyang
05/01/2023 2
Outline
• Introduction• Regression by Convolutional Neural Network• Deformable Convolutional Neural Networks• Discussion and Future work
05/01/2023 3
Introduction
Articulated body pose estimation “recovers the pose of an articulated body, which consists of joints and rigid parts using image-based observations.”
05/01/2023 6
Classic Approaches
Fischler & Elschlager 1973 Felzenszwalb & Huttenlocher 2005
Pictorial Structure• Unary Templates• Pairwise Springs
Yang & Ramanan 2011
Mixtures of “mini-parts”• Mixture of part • Unary template for part with mixture • Pairwise springs between part with
mixture and part with mixture
05/01/2023 7
Deep Learning Methods
Multi-source Deep Learning • Candidate estimations• Deep model uses multi-source
including appearance score, mixture type, and deformation.
Ouyang et al. 2014
Deeppose• Reasoning pose in a holistic fashion • refines the joint predictions by using
higher resolution sub-images
Toshev & Szegedy 2014
05/01/2023 8
We propose to study pose estimation in two ways
• Holistic View–Regression of joint locations by convolutional neural
networks (CNNs)
• Local information–Deformable Convolutional Neural Networks
05/01/2023 10
Formulation
• Image: • Part location:
𝜓 (𝐼 ;𝜃)=𝐩
Location of part :
Learned by deep CNN
05/01/2023 11
Basic Architecture of the CNN Regressor
• AlexNet – Krizhevsky, Sutskever, and Hinton, NIPS 2012
– The first time deep model is shown to be effective on large scalecomputer vision task.
05/01/2023 12
Normalize Scale of Human Body
• Size of the CNN input is fixed• Simple warping changes the aspect ratio of people
• People appear at different scales of an image
1. Original image 2. Human detection[Ouyang et al. CVPR 2014]
3. Crop by bbox 4. Padding with mean RGB value
05/01/2023 13
Architecture 1
• Loss function:• Evaluation metric: PCP
Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
Yang&Ramanan 84.1 77.1 69.5 65.6 52.5 35.9 60.8
Conv5 58.8 24.1 49.6 36.6 25.8 2.8 31.3
05/01/2023 14
Architecture 2
Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
Yang&Ramanan 84.1 77.1 69.5 65.6 52.5 35.9 60.8
Conv5 58.8 24.1 49.6 36.6 25.8 2.8 31.3
Fc8 (AlexNet) 81.1 63.7 72.8 66.6 50.6 21.9 56.9
• Loss function:• Evaluation metric: PCP
05/01/2023 15
Architecture 3
Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
Yang&Ramanan 84.1 77.1 69.5 65.6 52.5 35.9 60.8
Conv5 58.8 24.1 49.6 36.6 25.8 2.8 31.3
Fc8 (AlexNet) 81.1 63.7 72.8 66.6 50.6 21.9 56.9
Fc10 84.1 68.8 76.8 69.4 54.9 26.8 60.9
• Loss function:• Evaluation metric: PCP
05/01/2023 16
PCP and PDJ on LSP
# Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
ours
1 Conv5 58.8 24.1 49.6 36.6 25.8 2.8 31.3
2 Fc8 (AlexNet) 81.1 63.7 72.8 66.6 50.6 21.9 56.9
3 Fc8(LSP-extend) 83.1 67.2 75.0 68.7 53.4 25.6 59.6
4 Fc10 84.1 68.8 76.8 69.4 54.9 26.8 60.9
5 Fc10 (Fusion) 84.8 71.8 77.6 71.2 55.9 29.2 62.5
State-of-the-art
methods
6 Yang&Ramanan 84.1 77.1 69.5 65.6 52.5 35.9 60.8
7 Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6
05/01/2023 18
Failure Cases
• articulation• fore-shortening• occlusions and distractions• cluttered background or overlapping people
05/01/2023 20
Motivation
• Local image patches are able to capture:– Part presence
– Pairwise part spatial relationships
Number of mixture type for each pair: 6
Neighbor: 1# of relationships:
Neighbor: 2# of relationships:
Lower arm
Upper arm
[Chen & Yuille NIPS 2014]
05/01/2023 21
Tree-structured Relational Graph
– : positions of body parts
– : pairwise relationships between parts
– : Pixel location of part
– Pairwise relationship
– Defined by relative position
– In experiment: 13 type for each pair
05/01/2023 22
Formulation
𝐹 (𝐩 ,𝐭|𝐼 ;𝝎 ,𝜃 )¿∑𝑖∈𝑉
𝐴𝑖(𝑝𝑖∨𝐼 ;𝜃)
Part presence
𝜔 𝑖 ⋅
Inference: • Tree structure• Can be solved efficiently by dynamic programming
, , are currently learned by Latent structure SVM
+ ∑(𝑖 , 𝑗 )∈𝐸
𝑅 (𝑝𝑖 ,𝑝 𝑗 , 𝑡𝑖𝑗 , 𝑡 𝑗𝑖∨𝐼 ;𝜃)
Pairwise deformation
+𝝎𝑖𝑗𝑡𝑖𝑗 ⋅𝜔 𝑖𝑗 ⋅
Pairwise Relationship
05/01/2023 23
Learning parameters
Derive the type label for each patch• use relative position to represent the
pairwise relations• Cluster the relative positions over the
whole training set • Type label : cluster index• Mean relative position : cluster center
05/01/2023 24
Casting Full Connections into Convolutions
Elbow
Part presence map
Pairwise relationship map
05/01/2023 25
PCP and PDJ on LSP dataset and FLIC dataset
Dataset Method Torso Head U.Leg L.Leg U.Arm L.Arm Mean PCP
LSPDCNN 92.5 85.1 82.7 76.3 70.2 55.9 74.8
Ouyang et al. 85.8 83.1 76.5 72.2 63.3 46.6 68.6
FLIC DCNN 87.0 98.8 - - 96.5 84.0 91.1
LSP FLIC
05/01/2023 27
Future Work
• Build end-to-end system to estimate human pose
• Consider combining local information and holistic view• Beyond tree structure
05/01/2023 30
Data Augmentation
• The number of training data of existing datasets are insufficient to train deep CNNs– Statistics of existing datasets
– Number of parameters of AlexNet: 60 million
• Data augmentation is efficient to prevent overfitting
Dataset # Training images
# Testing images
Type
PARSE 100 205 Full body
LSP 1,000 1,000 Full body
LSP extend 10,000 - Full
FLIC 3,987 1,016 Upper body
MPII 28,821 11,701 Full body
05/01/2023 31
Data Augmentation (cont.)
• Random padding
• Rotating– ±[2.5◦, 5◦, 7.5◦, 10◦, 15◦, 20◦]
• Flipping
05/01/2023 32
Evaluation Metrics
• Percentage of Correct Parts (PCP)– measures the percentage of correctly localized body parts.
– A candidate body part is treated as correct if its segment endpoints lie within 50% of the length of the ground-truth annotated endpoints.
• Percentage of Detected Joints (PDJ)– measures the performance using a curve of the percentage of correctly localized
joints by varying localization precision threshold, which is normalized by the scale defined as distance between left shoulder and right hip
– invariant to scale