High Level Computer Vision Part-Based Models for Object Class ...€¦ · ‣ parts obtained by...

Post on 06-Aug-2020

0 views 0 download

Transcript of High Level Computer Vision Part-Based Models for Object Class ...€¦ · ‣ parts obtained by...

High Level Computer Vision

Part-Based Models for Object Class Recognition Part 1

Bernt Schiele - schiele@mpi-inf.mpg.de Mario Fritz - mfritz@mpi-inf.mpg.de

https://www.mpi-inf.mpg.de/hlcv

High Level Computer Vision - May 23, 2o16 2

Complexity of Recognition

High Level Computer Vision - May 23, 2o16 3

Complexity of Recognition

High Level Computer Vision - May 23, 2o16

• Pictorial Structures [Fischler & Elschlager 1973] ‣ Model has two components

- parts (2D image fragments) - structure (configuration of parts)

4

Class of Object Models: Part-Based Models / Pictorial Structures

High Level Computer Vision - May 23, 2o16

• Bag of Words Models (BoW) ‣ object model = histogram of local features ‣ e.g. local feature around interest points

• Global Object Models ‣ object model = global feature object feature

‣ e.g. HOG (Histogram of Oriented Gradients)

• Part-Based Object Models ‣ object model = models of parts

& spatial topology model ‣ e.g. constellation model or

ISM (Implicit Shape Model)

• But: What is the Ideal Notion of Parts here? • And: Should those Parts be Semantic?

“State-of-the-Art” in Object Class Representations

5

BoW: no spatial relationships

e.g. HOG: fixed spatial relationships

e.g. ISM: flexible spatial relationships

High Level Computer Vision - May 23, 2o16 6

Part-Based Models - Overview Today (more next week)

• Part-Based using Manual Labeling of Parts ‣ Detection by Components ‣ Multi-Scale Parts

• The Constellation Model ‣ automatic discovery of parts and part-structure

• The Implicit Shape Model (ISM) ‣ parts obtained by clustering interest-points

‣ star-model to model configuration of parts

High Level Computer Vision - May 23, 2o16 7

Manually Selected Parts

• Simplest solution ‣ Let a human expert select a set of parts ‣ (If it doesn’t work, take a different human expert)

Mohan, Papageorgiou, Poggio, ‘01

High Level Computer Vision - May 23, 2o16 8

Example 1: Detection by Components

• Application ‣ Pedestrian detection

• Representation by 4 parts ‣ Part candidates are selected

by a human expert

‣ Part detectors are learnedand applied independently

‣ The “most suitable” head, leg, and arms are identifiedby the part detectors

Mohan, Papageorgiou, Poggio, ‘01

High Level Computer Vision - May 23, 2o16 9

• “Structural model” via a Combination Classifier (stacking)

‣ Part scores are fed intothe combination classifier

‣ Combination classifier classifies the pattern as“person” or “non-person”

‣ The person is detected as anensemble of its parts

Mohan, Papageorgiou, Poggio, ‘01

Example 1: Detection by Components

High Level Computer Vision - May 23, 2o16 10

• Detection results

Example 1: Detection by Components

Mohan, Papageorgiou, Poggio, ‘01

High Level Computer Vision - May 23, 2o16 11

Example 1: Detection by Components

• Robustness to occlusion ‣ System still detects pedestrians if a part is not visible

Mohan, Papageorgiou, Poggio, ‘01

High Level Computer Vision - May 23, 2o16 12

• Body model ‣ 4 body parts

(face, head, upper body, legs)

• 7 part detectors ‣ frontal face & head

‣ profile face & head

‣ frontal upper body

‣ profile upper body

‣ legs

Frontal face Profile face

Frontalupper body

Profileupper body

Legs

Example 2: Multi-Scale Parts

Frontal head Profile head

Mikolajczyk, Schmid, Zisserman, ‘04

High Level Computer Vision - May 23, 2o16 13

• Computing local features for body parts (BP) ‣ Computing gradient and Laplacian ‣ Estimating and quantizing local orientations ‣ Grouping local orientation into horizontal and vertical triplets

orientations gradient

Laplacian

Horizontal and vertical orientation

triplets

Motivated by local shape context, SIFT [D. Lowe], wavelet features [H. Schneiderman]

Mikolajczyk, Schmid, Zisserman, ‘04

Example 2: Multi-Scale Parts

High Level Computer Vision - May 23, 2o16 14

• Learning feature occurrence and co-occurrence on training data ‣ Representing the body parts (BP) with feature distributions

‣ Using aligned examples for training

Positive examples

Occurrence and co-occurence probability distributions

Negative examples

Example 2: Multi-Scale Parts

High Level Computer Vision - May 23, 2o16 15

• Classifiers ‣ Weak classifiers based on a single feature

‣ Weak classifiers based on feature co-occurrence

‣ Strong classifier trained with AdaBoost

Mikolajczyk, Schmid, Zisserman, ‘04

Example 2: Multi-Scale Parts

High Level Computer Vision - May 23, 2o16 16

Input image

Feature scale-space

Head log-likelihood map

Upper body log-likelihood map

Legs log-likelihood map

Σ HH

Σ HU

Σ HL

Classifiers

Mikolajczyk, Schmid, Zisserman, ‘04

Part detections

Example 2: Multi-Scale Parts

High Level Computer Vision - May 23, 2o16 17

Joint likelihood body detector

Individual part detectors(a)(b) (c)

(d)Example 2: Human Detection Results

High Level Computer Vision - May 23, 2o16 18

• Fawlty Towers

Human Detection Results

Mikolajczyk, Schmid, Zisserman, ‘04

High Level Computer Vision - May 23, 2o16 19

Discussion

• Approach ‣ Manually selected set of parts - Specific detector trained for each part

‣ Spatial model trained on part activations

‣ Evaluate joint likelihood of part activations

• Advantages ‣ Parts have intuitive meaning.

‣ Standard detection approaches can be used for each part (e.g. SVMs or AdaBoost).

‣ Works well for specific categories.

• Disadvantages ‣ Parts need to be selected manually

- Semantically motivated parts sometimes don’t have a simple appearance distribution

- No guarantee that some important part hasn’t been missed

‣ When switching to another category, the model has to be rebuilt from scratch.

⇒ Goal: Model that can be automatically learned for many categories

High Level Computer Vision - May 23, 2o16 20

Part-Based Models - Overview Today (more next week)

• Part-Based using Manual Labeling of Parts ‣ Detection by Components ‣ Multi-Scale Parts

• The Constellation Model ‣ automatic discovery of parts and part-structure

• The Implicit Shape Model (ISM) ‣ parts obtained by clustering interest-points

‣ star-model to model configuration of parts

High Level Computer Vision - May 23, 2o16 21

x1

x3

x4

x6

x5

x2

Fully connected shape model

Weber, Welling, Perona, ’00; Fergus, Zisserman, Perona, 03

Constellation of Parts

High Level Computer Vision - May 23, 2o16 22

Automatic Part Learning

• Basic idea consists of two steps ‣ “Part” candidates in each image

- take the output regions of an interest point detector as part candidates (use scale-invariant interest point detector for that).

- interest point detector “guarantees” (sort of ;-) that similar structures will be detected in all images (keyword: repeatability)

‣ “Part learning” - find those regions, that occur repeatedly on different instances of the same

object: - for this: group (=cluster) the extracted regions to find those that are

characteristic for the object category.

Fergus, Zisserman, Perona, ‘03

High Level Computer Vision - May 23, 2o16 23

11x11 patch NormalizeProjection onto

PCA basis

c1

c2

c15

Representation of Appearance Fergus, Zisserman, Perona, ‘03

interest point detection

size normalized

luminance normalized

High Level Computer Vision - May 23, 2o16 24

Selected Features & “Parts” (=feature clusters)

interest points100 clusters

Weber, Welling, Perona, ‘00

High Level Computer Vision - May 23, 2o16 25

• Repeating structures (clusters in appearance space and in location space) are more likely to belong to the object category than to the background. ⇒ Clusters should mainly represent objects.

Weakly Supervised Training

200 images containing faces 200 background images

Weber, Welling, Perona, ‘00

High Level Computer Vision - May 23, 2o16 26

Constellation Model

• Joint model for appearance and structure (=shape) ‣ X: positions, A: part appearance, S: scale ‣ h: Hypothesis = assignment of features (in the image) to

parts (of the model)

Gaussian shape pdfProb. of

detection

Gaussian part appearance pdf

Gaussian relative scale pdf

Log(scale)

Weber, Welling, Perona, ’00; Fergus, Zisserman, Perona, 03

High Level Computer Vision - May 23, 2o16 27

Training Procedure

• Need to solve two problems ‣ Select a subset of appearance clusters as part candidates

- Greedy strategy - Start with 3-part model, then test if additional part

improves the results

‣ Learn the parameters of their joint probability density over appearance & structure

- Expectation Maximization (EM) algorithm

Weber, Welling, Perona, ‘00

High Level Computer Vision - May 23, 2o16 28

• Task: Estimation of model parameters • Chicken and Egg type problem, since we initially know neither:

‣ Model parameters

‣ Assignment of regions to foreground/background

• Let the assignments be a hidden variable and use EM algorithm to learn them and the model parameters

Learning

High Level Computer Vision - May 23, 2o16 29

• Find regions: their location, scale & appearance • Initialize model parameters • Use EM and iterate to convergence

‣ E-step: Compute assignments for which regions are foreground/background

‣ M-step: Update model parameters

• Trying to maximize likelihood – consistency in shape & appearance

Learning Procedure

High Level Computer Vision - May 23, 2o16 30

Experiments

• Data sets ‣ Motorbikes, Airplanes, Faces, Cars from side and behind, Spotted cats ‣ and background images ‣ Between 200 and 800 images per category

• Training ‣ 50% of images ‣ position of object unknown within

image (called weakly supervised)

• Testing ‣ 50% of images ‣ Simple object present/absent test ‣ ROC equal error rate computed,

using background set of images

Fergus, Zisserman, Perona, ‘03

High Level Computer Vision - May 23, 2o16 31

Fergus, Zisserman, Perona, ‘03

Example: Motorbikes - Part Hypotheses

High Level Computer Vision - May 23, 2o16 32

Fergus, Zisserman, Perona, ‘03

Example: Motorbikes - Learned Parts

High Level Computer Vision - May 23, 2o16 33

Equal error rate: 7.5%

Fergus, Zisserman, Perona, ‘03

Motorbikes - Constellation Model

High Level Computer Vision - May 23, 2o16 34

Fergus, Zisserman, Perona, ‘03

Background Images

High Level Computer Vision - May 23, 2o16 35

Equal error rate: 4.6%

Fergus, Zisserman, Perona, ‘03

Frontal Faces - Constellation Model

High Level Computer Vision - May 23, 2o16 36

Equal error rate: 9.8%

Airplanes - Constellation Model

Fergus, Zisserman, Perona, ‘03

High Level Computer Vision - May 23, 2o16 37

Equal error rate: 10.0%

Fergus, Zisserman, Perona, ‘03

Spotted Cats - Constellation Model

High Level Computer Vision - May 23, 2o16 38

Equal error rate: 9.7%

Fergus, Zisserman, Perona, ‘03

Cars (Rear Views) - Constellation Model

High Level Computer Vision - May 23, 2o16 39

Fergus, Zisserman, Perona, ‘03

Robustness of the Algorithm

High Level Computer Vision - May 23, 2o16 40

Scale-invariant learning and recognition:

Fergus, Zisserman, Perona, ‘03

ROC equal error rates

High Level Computer Vision - May 23, 2o16 41

Discussion

• Advantages ‣ Works well for different object categories ‣ Can adapt to categories where

- Shape/structure is more important - Appearance is more important

‣ Everything is learned from training data ‣ Weakly-supervised training possible

• Disadvantages ‣ Model contains many parameters that need to be estimated

‣ Cost increases exponentially with increasing number of parameters (that is in particular with the # of parts !)

High Level Computer Vision - May 23, 2o16 42

Part-Based Models - Today

• Part-Based using Manual Labeling of Parts ‣ Detection by Components ‣ Multi-Scale Parts

• The Constellation Model ‣ automatic discovery of parts and part-structure

• The Implicit Shape Model (ISM) ‣ parts obtained by clustering interest-points

‣ star-model to model configuration of parts

High Level Computer Vision - May 23, 2o16

Implicit Shape Model: Object Categorization

• Goals ‣ Learn to recognize object categories

‣ Detect and localize them in real-world scenes

‣ Segment objects from background

• Combination with top-down segmentation ‣ Initial hypothesis generation

‣ Category-specific figure-ground segmentation - used to verify object hypothesis

43

“cow”

“motorbike”

“car”

High Level Computer Vision - May 23, 2o16 44

Codebook Representation

• Extraction of local object patches ‣ Interest Points (e.g. Harris detector, Hes-Lap, DoG, ...) ‣ inspired by [Agarwal & Roth, 02]

• Collect patches from whole training set ‣ Example:

High Level Computer Vision - May 23, 2o16

Appearance Codebook

• Clustering Results ‣ Visual similarity preserved ‣ Wheel parts, window corners, fenders, ...

45

High Level Computer Vision - May 23, 2o16

Learning the Spatial Layout

• For every codebook entry, store possible “occurrences”

46

‣ Object identity ‣ Pose ‣ Relative position

‣ Object identity ‣ Pose ‣ Relative position

For new image, let the matched patches vote for possible object positions

High Level Computer Vision - May 23, 2o16 47

Implicit Shape Model - Representation

• Learn appearance codebook ‣ Extract patches at DoG interest points

‣ Agglomerative clustering ⇒ codebook

• Learn spatial distributions ‣ Match codebook to training images

‣ Record matching positions on object

105 training images (+motion segmentation) Appearance codebook…

………

Spatial occurrence distributions

High Level Computer Vision - May 23, 2o16

Object Detection: ISM (Implicit Shape Model)

• Appearance of parts: Implicit Shape Model (ISM) [Leibe, Seemann & Schiele, CVPR 2005]

48

xo

High Level Computer Vision - May 23, 2o16 49

Spatial Models for Categorization

x1

x3

x4

x6

x5

x2

“Star” shape model

x1

x3

x4

x6

x5

x2

Fully connected shape model

‣ e.g. Constellation Model ‣ Parts fully connected ‣ Recognition complexity: O(NP) ‣ Method: Exhaustive search

‣ e.g. ISM (Implicit Shape Model) ‣ Parts mutually independent ‣ Recognition complexity: O(NP) ‣ Method: Generalized Hough Transform

High Level Computer Vision - May 23, 2o16 50

Object Categorization Procedure

Interest Points Matched Codebook Entries

Probabilistic Voting

Interpretation(Codebook match)

Object Position

Image Patch

o,xe I

High Level Computer Vision - May 23, 2o16 51

Voting Space(continuous)

Object Categorization Procedure

Interest Points Matched Codebook Entries

Probabilistic Voting

Backprojectionof Maximum

Refined Hypothesis(uniform sampling)

BackprojectedHypothesis

High Level Computer Vision - May 23, 2o16 52

• 1st hypothesis

Car Categorization - Qualitative Results

2nd hypothesis

4th hypothesis

7th hypothesis

8th hypothesis

High Level Computer Vision - May 23, 2o16 53

Original imageInterest pointsMatched patches

Results on Cows

Prob. Votes

High Level Computer Vision - May 23, 2o16 54

1’st hypothesis

Results on Cows

High Level Computer Vision - May 23, 2o16 55

2’nd hypothesis

Results on Cows

High Level Computer Vision - May 23, 2o16 56

Results on Cows

3’rd hypothesis

High Level Computer Vision - May 23, 2o16 57

More Results on Cows…

2’nd hypothesis 14’th hypothesis8’th hypothesis16’th hypothesis

High Level Computer Vision - May 23, 2o16

Detection Results

• Qualitative Performance (UIUC database -200 cars) ‣ Recognizes different kinds of cars ‣ Robust to clutter, occlusion, low contrast, noise

58

Leibe, Leonardis, Schiele, ‘04

High Level Computer Vision - May 23, 2o16 59

• Results on UIUC car database ‣ (170 images containing 200 cars)

‣ Good performance, similar to Constellation Model

‣ Still some false positives

Quantitative Evaluation

-

High Level Computer Vision - May 23, 2o16 60

• Scale-invariant feature selection ‣ Scale-invariant interest points ‣ Rescale extracted patches ‣ Match to constant-size codebook

• Generate scale votes ‣ Scale as 3rd dimension in voting space

‣ Search for maxima in 3D voting space

Scale Invariance

Search window

x

y

s

Leibe, Schiele ‘04

High Level Computer Vision - May 23, 2o16 61

Qualitative Detection Results

Altogether, objects detected with factor 5.0 scale differences!

scale = 0.75

scale = 3.71

High Level Computer Vision - May 23, 2o16 62

Discussion

• Approach: Implicit Shape Model ‣ Generate appearance codebook ‣ Learn spatial occurrence distribution for each codebook entry ‣ Recognition using a probabilistic extension of the Generalized Hough

Transform

• Advantages ‣ Highly flexible shape model

‣ Each image feature acts independently

‣ Possible to learn good object models already from very few (50-100) training examples

‣ Recognition is fast!

High Level Computer Vision - May 23, 2o16 63

Discussion (2)

• Disadvantages ‣ Each feature acts independently

⇒ Assumption violated if sampled patches overlap

‣ Only loose constraints on object shape

‣ False positives on structured regions of the background

⇒ Hypothesis verification needed

• Idea: Combination with top-down segmentation ‣ Initial hypothesis generation ‣ Category-specific figure-ground segmentation ‣ Hypothesis verification using segmentation

High Level Computer Vision - May 23, 2o16 64

Voting Space(continuous)

"Closing the Loop"

Interest Points Matched Codebook Entries

Probabilistic Voting

Backprojectionof Maximum

Refined Hypothesis(uniform sampling)

BackprojectedHypothesis

Segmentation

High Level Computer Vision - May 23, 2o16 65

Segmentation: Probabilistic Formulation

• Influence of patch e on object hypothesis

• Backprojection to patches e and pixels p:

Leibe, Schiele, ‘03

High Level Computer Vision - May 23, 2o16 66

Segmentation: Probabilistic Formulation

• Resolve patches by interpretations (codebook entries) I

Segmentationinformation

Influence on object hypothesis

⇒ Store patch segmentation mask for every occurrence position!

Leibe, Schiele, ‘03

High Level Computer Vision - May 23, 2o16 67

p(figure)p(ground)

Segmentation

Segmentation

p(figure)

Original image

p(ground)

High Level Computer Vision - May 23, 2o16

p(figure)

p(ground)

Segmentation

Segmentation

p(figure)

p(ground)

Original image

• Interpretation of p(figure) map ‣ per-pixel confidence in object hypothesis

‣ Use for hypothesis verification

High Level Computer Vision - May 23, 2o16 69

Top-Down Driven Segmentation

• Example 1:

‣ Pedestrian is segmented out since it does not contribute to the car hypothesis

• Example 2:

imagehypothesis segmentation p(figure)

imagep(figure)segmentationsub-image

contours

Leibe, Schiele, ‘03

High Level Computer Vision - May 23, 2o16 70

Motorbikes: Segmentation Results

Leibe, Schiele, ‘04

High Level Computer Vision - May 23, 2o16 71

• Secondary hypotheses ‣ Desired property of algorithm! ⇒ robustness to partial occlusion

‣ Standard solution: reject based on bounding box overlap ⇒ Problematic - may lead to missing detections!

⇒ Use segmentations to resolve ambiguities instead

Hypothesis Verification: Motivation

Leibe, Leonardis, Schiele, ‘04

High Level Computer Vision - May 23, 2o16 72

Formalization in MDL Framework

• Savings of a hypothesis [Leonardis, IJCV’95]

• with ‣ Sarea : #pixels N in segmentation

‣ Smodel : model cost, assumed constant

‣ Serror : estimate of error, according to

• Final form of equation

Sh = K0Sarea �K1Smodel �K2Serror

Serror =�

p�Seg(h)

(1� p(p = figure|h))

Sh = �K1

K0+

�1� K2

K0

⇥N +

K2

K0

p�Seg(h)

p(p = figure|h)

High Level Computer Vision - May 23, 2o16 73

Formalization in MDL Framework (2)

• Savings of combined hypothesis

• Goal: Find combination (vector m) that best explains the image ‣ Quadratic Boolean Optimization problem [Leonardis et al, 95]

‣ In practice often sufficient to compute greedy approximation

Sh1�h2 = Sh1 + Sh2 � Sarea(h1 ⇥ h2) + Serror(h1 ⇥ h2)

High Level Computer Vision - May 23, 2o16 74

Performance after Verification Stage

• Direct Comparison

-

195/200 correct detections5 false positives

High Level Computer Vision - May 23, 2o16 75

Other Categories: Cows

• Articulated Object Recognition ‣ Use set of cow sequences (from Derek Magee@Leeds)

• Extract frames from subset of sequences

Train on 113 images (+ segmentation)

Leibe, Leonardis, Schiele, ‘04

High Level Computer Vision - May 23, 2o16 76

Cows: Results on Novel Sequences

• Object Detections ‣ Single-frame recognition - No temporal continuity used!

Leibe, Leonardis, Schiele, ‘04

High Level Computer Vision - May 23, 2o16 77

• Segmentations from interest points ‣ Single-frame recognition - No temporal continuity used!

Leibe, Leonardis, Schiele, ‘04

Cows: Results on Novel Sequences (2)

High Level Computer Vision - May 23, 2o16

• Segmentations from refined hypotheses ‣ Single-frame recognition - No temporal continuity used!

78

Cows: Results on Novel Sequences (3)Leibe, Leonardis, Schiele, ‘04

High Level Computer Vision - May 23, 2o16

• Object Detections

79

Another ExampleLeibe, Leonardis, Schiele, ‘04

High Level Computer Vision - May 23, 2o16

• Segmentations from interest points

80

Another Example (2)Leibe, Leonardis, Schiele, ‘04

High Level Computer Vision - May 23, 2o16

• Segmentations from refined hypotheses

81

Another Example (3)Leibe, Leonardis, Schiele, ‘04

High Level Computer Vision - May 23, 2o16 82

Robustness to Occlusion

• Quantitative results (14 sequences, 2217 frames total) ‣ No difficulties recognizing fully visible cows (99.1% recall)

‣ Robust to significant partial occlusion!

High Level Computer Vision - May 23, 2o16 83

Application: Pedestrian Detection

• Small training set ‣ 105 images (+motion segmentations), 210 silhouettes

• Challenging real-world test set ‣ 209 crowded scenes containing 595 pedestrians

training

test

High Level Computer Vision - May 23, 2o16 84

Problem for Articulated Objects

• Overcomplete Segmentations ‣ Flexible spatial model

‣ Segmentations may contain superfluous body parts ⇒ Score of neighboring hypotheses may be reduced!

High Level Computer Vision - May 23, 2o16 85

Qualitative Results

Leibe, Seemann, Schiele, ‘05

High Level Computer Vision - May 23, 2o16 86

Estimated Articulations

Walking direction

Detection scale

Body pose

Problem cases

High Level Computer Vision - May 23, 2o16 87

Detections at EER (Equal Error Rate)

Single-frame recognition – no temporal continuity used!Leibe, Seemann, Schiele, ‘05

High Level Computer Vision - May 23, 2o16 88

Example Detections