Download - Digest of Human Detection from CVPR2015

Digest of Human Detectionfrom CVPR 2015

Jan. 27th, 2016, Daichi SUZUO

Digest of Human Detection from CVPR2015

Features

1. Combination Features and Models for Human Detection - Y. Jiang et al.

2. Filtered Channel Features for Pedestrian Detection - S. Zhang et al.

Training

3. Learning Scene-Specific Pedestrian Detectors without Real Data - H.Hattori et al.

4. Taking a Deeper Look at Pedestrians - J. Hosang et al.

5. Pedestrian Detection aided by Deep Learning Semantic Tasks - Y. Tian et al.

Dataset / Benchmark

6. Multispectral Pedestrian Detection :

Benchmark Dataset and Baseline - S. Hwang et al.

Fundamentals of Human Detection

• Machine learning based bi-class classifier

• Sliding window search

Negative class

Positive class Convert toimage feature

Training Classifier

ClassifierCrop Feature

extraction

Human?

Not human?

Image features

1. Combination Features and Models for Human Detection- Y. Jiang et al.

2. Filtered Channel Features for Pedestrian Detection- S. Zhang et al.

θ

1. Combination Features and Models

for Human Detection - Y. Jiang et al.

• Popular HOG feature[Dalal05]

Input image Edge-image

Edgeextraction

(“cell”)pixel-wise

gradient

power

Histogram

θ

• Popular HOG feature[Dalal05]: 1st order feature

power

Input image 1st derivative

Differentiate

Histogram

(“cell”)pixel-wise

gradient

idea: How about extending to 0-th/2nd order?





• 2nd order: HOB – “bar” shape

• Same as HOG, just using 2nd derivative

• 0th order: HOC – color feature

• Using HSI color space; H as θ, S as power

ignore I

convert to HSIR

G

V



• Combine them into one vector: HOG-III feature



• Train different classifiers from the same HOG-IIIs

• Detect individually, and fuse into one result

Inputimage

HOG-IIIfeatures

Detection byGrammar model[Girshick11]

Detection byPoselet model[Bourdev10]

FusionFinalresult

(This is one of the key process of the method

Please refer the original paper for more details)



Effect of HOG-III

Effect of Fusion

Feature AP

HOG 45.8%

HOC+HOG+HOB 50.1%

HOG-III 51.3%

Classifier AP

Single use of Grammer 45.8%

Single use of Poselet 47.0%

Fusion 52.3% Combining HOG-III and Fusion

performs best

2. Filtered Channel Features

for Pedestrian Detection - S. Zhang et al.

• Extension of “Integral Channel Features” [Dollár09]

• ChnFtrs: Extension of “Viola-Jones method” [Viola02]

(Viola-Jones method)

…

… …

Input imageLearn decision-tree

by AdaBoost

Extract “Haar-like”Features (scalar)

※Sum of difference between

white and black region





(Integral Channel Features)

…

… …

Input imageLearn decision-tree

by AdaBoost

“channel”

Extract sumof rectangle

※Unlike

Haar-like

Transform





(Filtered Channel Features)

…

… …

Learn decision-tree

by AdaBoost

“channel” Apply various

filters

(convolution)…

*

*

Pick-uppixel value

as a feature…



Using 50 filters

performs bestAchieved the highest accuracy

Training

3. Learning Scene-Specific Pedestrian Detectors

without Real Data - H. Hattori et al.

4. Taking a Deeper Look at Pedestrians- J. Hosang et al.

5. Pedestrian Detection aided by

Deep Learning Semantic Tasks - Y. Tian et al.

• Train detector by CG-based training datasets



Real background

(static image)

annotate

CG-based human

compositeSimulated scene

• Not only scene-specific, but also location-specific!



…

Classifier

Classifier

…

Grid with overwrap

(102~105 patches)Training images

(~103 pos, ~103 neg

for each patch)

JointClassifierEnsembleTraining

Scene-specific

Location-specific

detectors



Patch size # detectors Avg. Precision

8x8 371 .802

16x16 102 .798

32x32 30 .764

Effect of location-specific detection

Example of the detection result

Comparison

“convnet still underperforms state-of-the-arts”

…Really?

Enhance know-how of convnet based detector


• Small network (CifarNet) / Big network (AlexNet)

• Window size

• How to collect training images

• Fine-tuning

• Number and Type of layers

• …


Convnet with the best configuration outperforms!

Interesting points:

• Ratio of pos/neg does not affect

to the accuracy so much

• Data-augumentation is effective

• Network size should be chosen

by the amount of training samples

• ...



Binary-classification is sometimes insufficient…

Human

Not human

(Hard negatives)

It is necessary to use semantic information jointly



Classify pedestrian and Recognize semantic at once!




Also recognizes current scene semantics

• Pedestrian attribute (e.g. wearing backpack)

• Background attribute (e.g. road, sky, …)




Difficult to collect various (annotated) negs from one dataset…

Transfer from other annotated datasets by TA-CNN

(Please refer the original and related papers for more details about TA-CNN…)



Comparison with CNN-based methods

Example of detection results

Benchmark / Dataset



• Dataset of visible-light and thermal image



Contributions:

• Color and thermal images

• Both test/training data

• Temporally-corresponded tag

• Large enough

• …

Takeaways

• Human detection is still challenging

• Deep learning does not necessarily solve

every problems at this moment

• There are several knowledge that might be helpful

for your research/hobby/…

Takeaways

References / Supplemental materials

1. Filtered channel features for pedestrian detection

4. Taking a Deeper Look at Pedestrians• Author's website: http://rodrigob.github.io/

3. Learning Scene-Specific Pedestrian Detectors without Real Data• Project: http://vishnu.boddeti.net/projects/detection-by-synthesis.html

• YouTube: https://youtu.be/2Jf7faozHUs

5. Pedestrian Detection aided by Deep Learning Semantic Tasks• Project: http://mmlab.ie.cuhk.edu.hk/projects/TA-CNN/

6. Multispectral Pedestrian Detection: Benchmark Dataset and Baseline• Lab: http://rcv.kaist.ac.kr/v2/

And all the papers of CVPR2015 are available at cv-foundation.org

See also

http://rodrigob.github.io/

http://vishnu.boddeti.net/projects/detection-by-synthesis.html

https://youtu.be/2Jf7faozHUs

http://mmlab.ie.cuhk.edu.hk/projects/TA-CNN/

http://rcv.kaist.ac.kr/v2/