Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross...

25
Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Transcript of Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross...

Page 1: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Pedro FelzenszwalbBrown University

Joint work with Ross Girshick and David McAllester

Object Detection Grammars

Tuesday, November 22, 11

Page 2: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

The challenge

[Pascal VOC images]

Objects in each category vary greatly in appearance

Tuesday, November 22, 11

Page 3: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

An evolution of models• HOG features/templates [Dalal, Triggs 2005]

- Invariance to photometric variation and small deformations + SVM training

• Deformable part models (DPM)

- HOG templates + LSVM training

- Invariance to larger deformations

• Mixtures of DPM

- Allow significant variations due tomajor poses and subtypes

Tuesday, November 22, 11

Page 4: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Deformable models

• Can take us a long way...

• But not all the way

Tuesday, November 22, 11

Page 5: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Structure variation• Object in rich categories have variable structure

• These are NOT deformations

• Mixture of deformable models? too many combined choices

- There is always something you never saw before

• Bag of words? not enough structure

• Non-parametric? doesn’t generalizeTuesday, November 22, 11

Page 6: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Richer part-based models• Some parts should be optional

- A person could have a hat or not

• There should be subtypes (mixtures) at the part level

- A person could wear a skirt or pants

- A mouth can be smiling or frowning

• Parts should be reusable

- A wheel model can be used twice in a car model

- Same wheel model can be used in car and truck model et

• This can be done using a grammar/compositional model

- [Jin, Geman, 2006], [Zhu, Mumford, 2006], [Zhu, Yuille, 2005], etc.

Tuesday, November 22, 11

Page 7: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Object detection grammars

• Objects defined in terms of other objects through production rules

- face -> eyes, nose, mouth

• Objects can be defined by multiple productions

- legs -> pants

- legs -> skirt

- Subtypes, structure variability

• Deformation rules allow parts to move relative to each other

- Spatial variability

• Same object can be used in different productions

- Shared parts

(A tractable compositional framework)

Tuesday, November 22, 11

Page 8: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

- person -> face, trunk, arms, lower-part

- face -> eyes, nose, mouth

- face -> hat, eyes, nose, mouth

- hat -> baseball-cap

- hat -> sombrero

- lower-part -> shoe, shoe, legs

- lower-part -> bare-foot, bare-foot, legs

- legs -> pants

- legs -> skirt

shoe

lower-part

person

eyes

face

legsshoenose

mouth

pants

trunk arms

Tuesday, November 22, 11

Page 9: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Relationship to pictorial structures / DPM

• Pictorial structure

- parts (local appearance)

- springs (spatial relationships)

- parts and springs forms a graph --- structure is fixed

• Object detection grammar

- Grammar generates tree of symbols --- structure is variable

- Location of symbol is related to location of parent

- Appearance model associated with each terminal

Tuesday, November 22, 11

Page 10: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Formalism

• Set of terminal symbols T

- (templates)

• Set of nonterminal symbols N

- (objects/parts)

• Set of placements Ω within an image

• Placed symbol X(ω)

- X ∈ T ⋃ N

- ω ∈ Ω

eye((100,80),10)

face((90,10),50)ω might be (x,y) position and scale

Tuesday, November 22, 11

Page 11: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Production rules• Productions define expansions of nonterminals into bags of symbols

• We can expand a nonterminal into a bag of terminals by repeatedly applying productions

- There are choices along the way

- Expansion has score = sum of scores of productions used along the way

- X(ω) ~~s~~> { A1(ω1), ... , An(ωn) } (sequence of expansions)

- Leads to a derivation tree

placed nonterminal

Bag of placed symbols

score

X(ω) --s--> { Y1(ω1), ... , Yn(ωn) }

Tuesday, November 22, 11

Page 12: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Appearance for terminals• Each terminal has an appearance model

- Defined by a scoring function f(A,ω,I)

- Score for placing terminal A at position ω within image I

FA I f(A,ω,I)

f(A,ω,I) might be the response of a HOG filter FA at position ω within I

Tuesday, November 22, 11

Page 13: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Appearance for nonterminals• We extend the appearance model from terminals to nonterminals

• Best expansion of X(ω) into a bag of placed terminals

- Takes into account

1) expansion score

2) appearance model of placed terminals at their placements

• Detect objects (any symbol) by finding high scoring placements

f(X,ω,I) = max ( s + ∑ f(Ai,ωi,I) )X(ω) ~~s~~> { A1(ω1), ... , An(ωn) }

i

Tuesday, November 22, 11

Page 14: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Implementation

• General implementation for a class of grammars(voc-release4)

- Production rules specified by schemas

- Appearance of terminals defined by HOG filters

- Inference done via dynamic programming

- Parameter learning from bounding boxes (LSVM)

Tuesday, November 22, 11

Page 15: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Isolated deformation grammars• Productions defined by two kinds of schemas

• Structure schema

- One production for each placement ω

• Deformation schema

- One production for each ω and displacement δ

• Leads to efficient algorithm for computing scores f(X,ω,I)

X(ω) --s--> { Y1(ω+δ1), ... , Yn(ω+δn) }

X(ω) --s(δ)--> { Y(ω+δ) }

Tuesday, November 22, 11

Page 16: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Face grammar

domain D = ⌦⇥� of the parameter (!, �) in a deformation rule.

An isolated deformation grammar defines a “factored” model such that the position of a symbol

in a derivation tree is only related to the position of its parent.

For example, we can define an isolated deformation grammar for faces with

N = {FACE,EYE,EYE0,MOUTH,MOUTH0},

T = {FACE.FILTER,EYE.FILTER,SMILE.FILTER,FROWN.FILTER}.

We have a structural rule for representing a face in terms of a global template and parts

8! : FACE(!) 0! {FACE.FILTER(!),EYE0(! � �l

),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m

are constants that specify the ideal displacement between each part and the face.

Note that EYE0 appears twice in the right hand side under di↵erent ideal displacements, to account

for the two eyes in a face. We can move the parts that make up a face from their ideal locations

using deformation rules

8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.

Finally we associate templates with the part nonterminals using structural rules

8! : EYE(!) 0! {EYE.FILTER(!)},

8! : MOUTH(!) s! {SMILE.FILTER(!)},

8! : MOUTH(!) f! {FROWN.FILTER(!)}.

The last two rules specify two di↵erent templates that can be used for the mouth at di↵erent scores

that reflect the prevalence of smiling and frowning faces.

2.6 Labeled Derivation Trees

Let G be an object detection grammar. We define a labeled derivation tree to be a rooted tree

such that: (1) each leaf v has an associated placed terminal A(!); (2) each internal node v has an

associated placed nonterminal X(!), a placed production schema, and a value z for the schema

leading to a placed production with X(!) in the left hand side and the placed symbols associated

7

Tuesday, November 22, 11

Page 17: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Face grammar

domain D = ⌦⇥� of the parameter (!, �) in a deformation rule.

An isolated deformation grammar defines a “factored” model such that the position of a symbol

in a derivation tree is only related to the position of its parent.

For example, we can define an isolated deformation grammar for faces with

N = {FACE,EYE,EYE0,MOUTH,MOUTH0},

T = {FACE.FILTER,EYE.FILTER,SMILE.FILTER,FROWN.FILTER}.

We have a structural rule for representing a face in terms of a global template and parts

8! : FACE(!) 0! {FACE.FILTER(!),EYE0(! � �l

),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m

are constants that specify the ideal displacement between each part and the face.

Note that EYE0 appears twice in the right hand side under di↵erent ideal displacements, to account

for the two eyes in a face. We can move the parts that make up a face from their ideal locations

using deformation rules

8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.

Finally we associate templates with the part nonterminals using structural rules

8! : EYE(!) 0! {EYE.FILTER(!)},

8! : MOUTH(!) s! {SMILE.FILTER(!)},

8! : MOUTH(!) f! {FROWN.FILTER(!)}.

The last two rules specify two di↵erent templates that can be used for the mouth at di↵erent scores

that reflect the prevalence of smiling and frowning faces.

2.6 Labeled Derivation Trees

Let G be an object detection grammar. We define a labeled derivation tree to be a rooted tree

such that: (1) each leaf v has an associated placed terminal A(!); (2) each internal node v has an

associated placed nonterminal X(!), a placed production schema, and a value z for the schema

leading to a placed production with X(!) in the left hand side and the placed symbols associated

7

domain D = ⌦⇥� of the parameter (!, �) in a deformation rule.

An isolated deformation grammar defines a “factored” model such that the position of a symbol

in a derivation tree is only related to the position of its parent.

For example, we can define an isolated deformation grammar for faces with

N = {FACE,EYE,EYE0,MOUTH,MOUTH0},

T = {FACE.FILTER,EYE.FILTER,SMILE.FILTER,FROWN.FILTER}.

We have a structural rule for representing a face in terms of a global template and parts

8! : FACE(!) 0! {FACE.FILTER(!),EYE0(! � �l

),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m

are constants that specify the ideal displacement between each part and the face.

Note that EYE0 appears twice in the right hand side under di↵erent ideal displacements, to account

for the two eyes in a face. We can move the parts that make up a face from their ideal locations

using deformation rules

8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.

Finally we associate templates with the part nonterminals using structural rules

8! : EYE(!) 0! {EYE.FILTER(!)},

8! : MOUTH(!) s! {SMILE.FILTER(!)},

8! : MOUTH(!) f! {FROWN.FILTER(!)}.

The last two rules specify two di↵erent templates that can be used for the mouth at di↵erent scores

that reflect the prevalence of smiling and frowning faces.

2.6 Labeled Derivation Trees

Let G be an object detection grammar. We define a labeled derivation tree to be a rooted tree

such that: (1) each leaf v has an associated placed terminal A(!); (2) each internal node v has an

associated placed nonterminal X(!), a placed production schema, and a value z for the schema

leading to a placed production with X(!) in the left hand side and the placed symbols associated

7

1) Face defined by global template and parts

Tuesday, November 22, 11

Page 18: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Face grammar

domain D = ⌦⇥� of the parameter (!, �) in a deformation rule.

An isolated deformation grammar defines a “factored” model such that the position of a symbol

in a derivation tree is only related to the position of its parent.

For example, we can define an isolated deformation grammar for faces with

N = {FACE,EYE,EYE0,MOUTH,MOUTH0},

T = {FACE.FILTER,EYE.FILTER,SMILE.FILTER,FROWN.FILTER}.

We have a structural rule for representing a face in terms of a global template and parts

8! : FACE(!) 0! {FACE.FILTER(!),EYE0(! � �l

),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m

are constants that specify the ideal displacement between each part and the face.

Note that EYE0 appears twice in the right hand side under di↵erent ideal displacements, to account

for the two eyes in a face. We can move the parts that make up a face from their ideal locations

using deformation rules

8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.

Finally we associate templates with the part nonterminals using structural rules

8! : EYE(!) 0! {EYE.FILTER(!)},

8! : MOUTH(!) s! {SMILE.FILTER(!)},

8! : MOUTH(!) f! {FROWN.FILTER(!)}.

The last two rules specify two di↵erent templates that can be used for the mouth at di↵erent scores

that reflect the prevalence of smiling and frowning faces.

2.6 Labeled Derivation Trees

Let G be an object detection grammar. We define a labeled derivation tree to be a rooted tree

such that: (1) each leaf v has an associated placed terminal A(!); (2) each internal node v has an

associated placed nonterminal X(!), a placed production schema, and a value z for the schema

leading to a placed production with X(!) in the left hand side and the placed symbols associated

7

domain D = ⌦⇥� of the parameter (!, �) in a deformation rule.

An isolated deformation grammar defines a “factored” model such that the position of a symbol

in a derivation tree is only related to the position of its parent.

For example, we can define an isolated deformation grammar for faces with

N = {FACE,EYE,EYE0,MOUTH,MOUTH0},

T = {FACE.FILTER,EYE.FILTER,SMILE.FILTER,FROWN.FILTER}.

We have a structural rule for representing a face in terms of a global template and parts

8! : FACE(!) 0! {FACE.FILTER(!),EYE0(! � �l

),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m

are constants that specify the ideal displacement between each part and the face.

Note that EYE0 appears twice in the right hand side under di↵erent ideal displacements, to account

for the two eyes in a face. We can move the parts that make up a face from their ideal locations

using deformation rules

8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.

Finally we associate templates with the part nonterminals using structural rules

8! : EYE(!) 0! {EYE.FILTER(!)},

8! : MOUTH(!) s! {SMILE.FILTER(!)},

8! : MOUTH(!) f! {FROWN.FILTER(!)}.

The last two rules specify two di↵erent templates that can be used for the mouth at di↵erent scores

that reflect the prevalence of smiling and frowning faces.

2.6 Labeled Derivation Trees

Let G be an object detection grammar. We define a labeled derivation tree to be a rooted tree

such that: (1) each leaf v has an associated placed terminal A(!); (2) each internal node v has an

associated placed nonterminal X(!), a placed production schema, and a value z for the schema

leading to a placed production with X(!) in the left hand side and the placed symbols associated

7

1) Face defined by global template and parts

domain D = ⌦⇥� of the parameter (!, �) in a deformation rule.

An isolated deformation grammar defines a “factored” model such that the position of a symbol

in a derivation tree is only related to the position of its parent.

For example, we can define an isolated deformation grammar for faces with

N = {FACE,EYE,EYE0,MOUTH,MOUTH0},

T = {FACE.FILTER,EYE.FILTER,SMILE.FILTER,FROWN.FILTER}.

We have a structural rule for representing a face in terms of a global template and parts

8! : FACE(!) 0! {FACE.FILTER(!),EYE0(! � �l

),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m

are constants that specify the ideal displacement between each part and the face.

Note that EYE0 appears twice in the right hand side under di↵erent ideal displacements, to account

for the two eyes in a face. We can move the parts that make up a face from their ideal locations

using deformation rules

8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.

Finally we associate templates with the part nonterminals using structural rules

8! : EYE(!) 0! {EYE.FILTER(!)},

8! : MOUTH(!) s! {SMILE.FILTER(!)},

8! : MOUTH(!) f! {FROWN.FILTER(!)}.

The last two rules specify two di↵erent templates that can be used for the mouth at di↵erent scores

that reflect the prevalence of smiling and frowning faces.

2.6 Labeled Derivation Trees

Let G be an object detection grammar. We define a labeled derivation tree to be a rooted tree

such that: (1) each leaf v has an associated placed terminal A(!); (2) each internal node v has an

associated placed nonterminal X(!), a placed production schema, and a value z for the schema

leading to a placed production with X(!) in the left hand side and the placed symbols associated

7

2) Parts can move relative to their idea location

Tuesday, November 22, 11

Page 19: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Face grammar

domain D = ⌦⇥� of the parameter (!, �) in a deformation rule.

An isolated deformation grammar defines a “factored” model such that the position of a symbol

in a derivation tree is only related to the position of its parent.

For example, we can define an isolated deformation grammar for faces with

N = {FACE,EYE,EYE0,MOUTH,MOUTH0},

T = {FACE.FILTER,EYE.FILTER,SMILE.FILTER,FROWN.FILTER}.

We have a structural rule for representing a face in terms of a global template and parts

8! : FACE(!) 0! {FACE.FILTER(!),EYE0(! � �l

),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m

are constants that specify the ideal displacement between each part and the face.

Note that EYE0 appears twice in the right hand side under di↵erent ideal displacements, to account

for the two eyes in a face. We can move the parts that make up a face from their ideal locations

using deformation rules

8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.

Finally we associate templates with the part nonterminals using structural rules

8! : EYE(!) 0! {EYE.FILTER(!)},

8! : MOUTH(!) s! {SMILE.FILTER(!)},

8! : MOUTH(!) f! {FROWN.FILTER(!)}.

The last two rules specify two di↵erent templates that can be used for the mouth at di↵erent scores

that reflect the prevalence of smiling and frowning faces.

2.6 Labeled Derivation Trees

Let G be an object detection grammar. We define a labeled derivation tree to be a rooted tree

such that: (1) each leaf v has an associated placed terminal A(!); (2) each internal node v has an

associated placed nonterminal X(!), a placed production schema, and a value z for the schema

leading to a placed production with X(!) in the left hand side and the placed symbols associated

7

domain D = ⌦⇥� of the parameter (!, �) in a deformation rule.

An isolated deformation grammar defines a “factored” model such that the position of a symbol

in a derivation tree is only related to the position of its parent.

For example, we can define an isolated deformation grammar for faces with

N = {FACE,EYE,EYE0,MOUTH,MOUTH0},

T = {FACE.FILTER,EYE.FILTER,SMILE.FILTER,FROWN.FILTER}.

We have a structural rule for representing a face in terms of a global template and parts

8! : FACE(!) 0! {FACE.FILTER(!),EYE0(! � �l

),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m

are constants that specify the ideal displacement between each part and the face.

Note that EYE0 appears twice in the right hand side under di↵erent ideal displacements, to account

for the two eyes in a face. We can move the parts that make up a face from their ideal locations

using deformation rules

8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.

Finally we associate templates with the part nonterminals using structural rules

8! : EYE(!) 0! {EYE.FILTER(!)},

8! : MOUTH(!) s! {SMILE.FILTER(!)},

8! : MOUTH(!) f! {FROWN.FILTER(!)}.

The last two rules specify two di↵erent templates that can be used for the mouth at di↵erent scores

that reflect the prevalence of smiling and frowning faces.

2.6 Labeled Derivation Trees

Let G be an object detection grammar. We define a labeled derivation tree to be a rooted tree

such that: (1) each leaf v has an associated placed terminal A(!); (2) each internal node v has an

associated placed nonterminal X(!), a placed production schema, and a value z for the schema

leading to a placed production with X(!) in the left hand side and the placed symbols associated

7

1) Face defined by global template and parts

domain D = ⌦⇥� of the parameter (!, �) in a deformation rule.

An isolated deformation grammar defines a “factored” model such that the position of a symbol

in a derivation tree is only related to the position of its parent.

For example, we can define an isolated deformation grammar for faces with

N = {FACE,EYE,EYE0,MOUTH,MOUTH0},

T = {FACE.FILTER,EYE.FILTER,SMILE.FILTER,FROWN.FILTER}.

We have a structural rule for representing a face in terms of a global template and parts

8! : FACE(!) 0! {FACE.FILTER(!),EYE0(! � �l

),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m

are constants that specify the ideal displacement between each part and the face.

Note that EYE0 appears twice in the right hand side under di↵erent ideal displacements, to account

for the two eyes in a face. We can move the parts that make up a face from their ideal locations

using deformation rules

8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.

Finally we associate templates with the part nonterminals using structural rules

8! : EYE(!) 0! {EYE.FILTER(!)},

8! : MOUTH(!) s! {SMILE.FILTER(!)},

8! : MOUTH(!) f! {FROWN.FILTER(!)}.

The last two rules specify two di↵erent templates that can be used for the mouth at di↵erent scores

that reflect the prevalence of smiling and frowning faces.

2.6 Labeled Derivation Trees

Let G be an object detection grammar. We define a labeled derivation tree to be a rooted tree

such that: (1) each leaf v has an associated placed terminal A(!); (2) each internal node v has an

associated placed nonterminal X(!), a placed production schema, and a value z for the schema

leading to a placed production with X(!) in the left hand side and the placed symbols associated

7

2) Parts can move relative to their idea location

domain D = ⌦⇥� of the parameter (!, �) in a deformation rule.

An isolated deformation grammar defines a “factored” model such that the position of a symbol

in a derivation tree is only related to the position of its parent.

For example, we can define an isolated deformation grammar for faces with

N = {FACE,EYE,EYE0,MOUTH,MOUTH0},

T = {FACE.FILTER,EYE.FILTER,SMILE.FILTER,FROWN.FILTER}.

We have a structural rule for representing a face in terms of a global template and parts

8! : FACE(!) 0! {FACE.FILTER(!),EYE0(! � �l

),EYE0(! � �r

),MOUTH0(! � �m

)}.

Here �l

, �r

, �m

are constants that specify the ideal displacement between each part and the face.

Note that EYE0 appears twice in the right hand side under di↵erent ideal displacements, to account

for the two eyes in a face. We can move the parts that make up a face from their ideal locations

using deformation rules

8!, � : EYE0(!)||�||2! {EYE(! � �)},

8!, � : MOUTH0(!)||�||2! {MOUTH(! � �)}.

Finally we associate templates with the part nonterminals using structural rules

8! : EYE(!) 0! {EYE.FILTER(!)},

8! : MOUTH(!) s! {SMILE.FILTER(!)},

8! : MOUTH(!) f! {FROWN.FILTER(!)}.

The last two rules specify two di↵erent templates that can be used for the mouth at di↵erent scores

that reflect the prevalence of smiling and frowning faces.

2.6 Labeled Derivation Trees

Let G be an object detection grammar. We define a labeled derivation tree to be a rooted tree

such that: (1) each leaf v has an associated placed terminal A(!); (2) each internal node v has an

associated placed nonterminal X(!), a placed production schema, and a value z for the schema

leading to a placed production with X(!) in the left hand side and the placed symbols associated

7

3) Parts defined by templates

Tuesday, November 22, 11

Page 20: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Learning

• z is an expansion of X(ω) into a bag of terminals

• w is a vector of model parameters

- Score of each structure schema

- Deformation parameters of each deformation schema

- Appearance template for each terminal (HOG filters)

• w can be trained using Latent SVM

f(X,ω,I) = max ( s + ∑ f(Ai,ωi,I) )X(ω) ~~s~~> { A1(ω1), ... , An(ωn) }

f(X,ω,I) = max wTϕ(z)z

Tuesday, November 22, 11

Page 21: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Person detection grammar [NIPS 2011]

• Instantiation includes a variable number of parts

- 1,...,k and occluder if k < 6

• Parts can translate relative to each other

• Parts have subtypes

• Parts have deformable sub-parts (not shown)

• Beats all other methods on PASCAL 2010 (49.5 AP)

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

Parts 1-6 (no occlusion) Parts 1-4 & occluder Parts 1-2 & occluder

Example detections and derived filtersSubtype 1 Subtype 2

Part 1

Part 2

Part 3

Part 4

Part 5

Part 6

Occluder

Figure 1: Shallow grammar model. This figure illustrates a shallow version of our grammar model(Section 2.1). This model has six person parts and an occlusion model (“occluder”), each of whichcomes in one of two subtypes. A detection places one subtype of each visible part at a location andscale in the image. If the derivation does not place all parts it must place the occluder. Parts areallowed to move relative to each other but are constrained by deformation penalties.

We consider models with productions specified by two kinds of schemas (a schema is a template forgenerating productions). A structure schema specifies one production for each placement ! 2 ⌦,

X(!) s�! { Y1(! � �1), . . . , Yn

(! � �n

) }. (3)

Here the �i

specify constant displacements within the feature map pyramid. Structure schemas canbe used to define decompositions of objects into other objects.

Let � be the set of possible displacements within a single scale of a feature map pyramid. Adeformation schema specifies one production for each placement ! 2 ⌦ and displacement � 2 �,

X(!)↵·�(�)�! { Y (! � �) }. (4)

Here �(�) is a feature vector and ↵ is a vector of deformation parameters. Deformation schemascan be used to define deformable models. We define �(�) = (dx, dy, dx2, dy2) so that deformationscores are quadratic functions of the displacements.

The parameters of our models are defined by a weight vector w with entries for the score of eachstructure schema, the deformation parameters of each deformation schema and the filter coefficientsassociated with each terminal. Then score(T ) = w ·�(T ) where �(T ) is the sum of (sparse) featurevectors associated with each placed terminal and production in T .

2.1 A grammar model for detecting people

Each component in the person model learned by the voc-release4 system [12] is tuned to detectpeople under a prototypical visibility pattern. Based on this observation we designed, by hand, thestructure of a grammar that models visibility by using structural variability and optional parts. Forclarity, we begin by describing a shallow model (Figure 1) that places all filters at the same resolutionin the feature map pyramid. After explaining this model, we describe a deeper model that includesdeformable subparts at higher resolutions.

Fine-grained occlusion Our grammar model has a start symbol Q that can be expanded using oneof six possible structure schemas. These choices model different degrees of visibility ranging fromheavy occlusion (only the head and shoulders are visible) to no occlusion at all.

Beyond modeling fine-grained occlusion patterns when compared to the mixture models from [12]or [7], our grammar model is also richer in the following ways. In Section 5 we show that each ofthese aspects improves detection performance.

Occlusion model If a person is occluded, then there must be some cause for the occlusion — eitherthe edge of the image or an occluding object such as a desk or dinner table. We model the cause ofocclusion through an occlusion object that has a non-trivial appearance model.

3

Tuesday, November 22, 11

Page 22: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Building the model• Type in manually defined grammar

• Train parameters from bounding box annotations

- Production scores

- Deformation models

- HOG filters for terminals

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

Part subtypes The mixture model from [12] has two subtypes for each mixture component. Thesubtypes are forced to be mirror images of each other and correspond roughly to left-facing peopleand right-facing people. Our grammar model has two subtypes for each part, which are also forcedto be mirror images of each other. But in the case of our grammar model, the decision of which partsubtype to instantiate at detection time is independent for each part.

The shallow person grammar model is defined by the following grammar. The indices p (for part), t(for subtype), and k have the following ranges: p 2 {1, . . . , 6}, t 2 {L, R} and k 2 {1, . . . , 5}.

Q(!) sk�! { Y1(! � �1), . . . , Yk

(! � �k

), O(! � �k+1) }

Q(!) s6�! { Y1(! � �1), . . . , Y6(! � �6) }

Yp

(!) 0�! { Yp,t

(!) } Yp,t

(!)↵p,t·�(�)�! { A

p,t

(! � �) }O(!) 0�! { O

t

(!) } Ot

(!)↵t·�(�)�! { A

t

(! � �) }

The grammar has a start symbol Q with six alternate choices that derive people under varying de-grees of visibility (occlusion). Each part has a corresponding nonterminal Y

p

that is placed at someideal position relative to Q. Derivations with occlusion include the occlusion symbol O. A derivationselects a subtype and displacement for each visible part. The parameters of the grammar (productionscores, deformation parameters and filters) are learned with the discriminative procedure describedin Section 4. Figure 1 illustrates the filters in the resulting model and some example detections.

Deeper model We extend the shallow model by adding deformable subparts at two scales: (1)the same as, and (2) twice the resolution of the start symbol Q. When detecting large objects,high-resolution subparts capture fine image details. However, when detecting small objects, high-resolution subparts cannot be used because they “fall off the bottom” of the feature map pyramid.The model uses derivations with low-resolution subparts when detecting small objects.

We begin by replacing the productions from Yp,t

in the grammar above, and then adding new pro-ductions. Recall that p indexes the top-level parts and t indexes subtypes. In the following schemas,the indices r (for resolution) and u (for subpart) have the ranges: r 2 {H,L}, u 2 {1, . . . , N

p

},where N

p

is the number of subparts in a top-level part Yp

.

Yp,t

(!)↵p,t·�(�)�! { Z

p,t

(! � �) }Z

p,t

(!) 0�! {Ap,t

(!), Wp,t,r,1(! � �

p,t,r,1), . . . ,Wp,t,r,Np(! � �p,t,r,Np)}

Wp,t,r,u

(!)↵p,t,r,u·�(�)�! {A

p,t,r,u

(! � �)}

We note that as in [22] our model has hierarchical deformations. The part terminal Ap,t

can moverelative to Q and the subpart terminal A

p,t,r,u

can move relative to Ap,t

.

The displacements �p,t,H,u

place the symbols Wp,t,H,u

one octave below Zp,t

in the feature mappyramid. The displacements �

p,t,L,u

place the symbols Wp,t,L,u

at the same scale as Zp,t

. We addsubparts to the first two top-level parts (p = 1 and 2), with the number of subparts set to N1 = 3and N2 = 2. We find that adding additional subparts does not improve detection performance.

2.2 Inference and test time detection

Inference involves finding high scoring derivations. At test time, because images may contain mul-tiple instances of an object class, we compute the maximum scoring derivation rooted at Q(!), foreach ! 2 ⌦. This can be done efficiently using a standard dynamic programming algorithm [11].

We retain only those derivations that score above a threshold, which we set low enough to ensurehigh recall. We use box(T ) to denote a detection window associated with a derivation T . Given aset of candidate detections, we apply non-maximal suppression to produce a final set of detections.

To define box(T ) we assign a detection window size, in feature map coordinates, to each produc-tions schema that can be applied to the start symbol. This leads to detections with one of six possibleaspect ratios, depending on which production was used in the first step of the derivation. The ab-solute location and size of a detection depends on the placement of Q. For the first five productionschemas, the ideal location of the occlusion part, O, is outside of box(T ).

4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

Part subtypes The mixture model from [12] has two subtypes for each mixture component. Thesubtypes are forced to be mirror images of each other and correspond roughly to left-facing peopleand right-facing people. Our grammar model has two subtypes for each part, which are also forcedto be mirror images of each other. But in the case of our grammar model, the decision of which partsubtype to instantiate at detection time is independent for each part.

The shallow person grammar model is defined by the following grammar. The indices p (for part), t(for subtype), and k have the following ranges: p 2 {1, . . . , 6}, t 2 {L, R} and k 2 {1, . . . , 5}.

Q(!) sk�! { Y1(! � �1), . . . , Yk

(! � �k

), O(! � �k+1) }

Q(!) s6�! { Y1(! � �1), . . . , Y6(! � �6) }

Yp

(!) 0�! { Yp,t

(!) } Yp,t

(!)↵p,t·�(�)�! { A

p,t

(! � �) }O(!) 0�! { O

t

(!) } Ot

(!)↵t·�(�)�! { A

t

(! � �) }

The grammar has a start symbol Q with six alternate choices that derive people under varying de-grees of visibility (occlusion). Each part has a corresponding nonterminal Y

p

that is placed at someideal position relative to Q. Derivations with occlusion include the occlusion symbol O. A derivationselects a subtype and displacement for each visible part. The parameters of the grammar (productionscores, deformation parameters and filters) are learned with the discriminative procedure describedin Section 4. Figure 1 illustrates the filters in the resulting model and some example detections.

Deeper model We extend the shallow model by adding deformable subparts at two scales: (1)the same as, and (2) twice the resolution of the start symbol Q. When detecting large objects,high-resolution subparts capture fine image details. However, when detecting small objects, high-resolution subparts cannot be used because they “fall off the bottom” of the feature map pyramid.The model uses derivations with low-resolution subparts when detecting small objects.

We begin by replacing the productions from Yp,t

in the grammar above, and then adding new pro-ductions. Recall that p indexes the top-level parts and t indexes subtypes. In the following schemas,the indices r (for resolution) and u (for subpart) have the ranges: r 2 {H,L}, u 2 {1, . . . , N

p

},where N

p

is the number of subparts in a top-level part Yp

.

Yp,t

(!)↵p,t·�(�)�! { Z

p,t

(! � �) }Z

p,t

(!) 0�! {Ap,t

(!), Wp,t,r,1(! � �

p,t,r,1), . . . ,Wp,t,r,Np(! � �p,t,r,Np)}

Wp,t,r,u

(!)↵p,t,r,u·�(�)�! {A

p,t,r,u

(! � �)}

We note that as in [22] our model has hierarchical deformations. The part terminal Ap,t

can moverelative to Q and the subpart terminal A

p,t,r,u

can move relative to Ap,t

.

The displacements �p,t,H,u

place the symbols Wp,t,H,u

one octave below Zp,t

in the feature mappyramid. The displacements �

p,t,L,u

place the symbols Wp,t,L,u

at the same scale as Zp,t

. We addsubparts to the first two top-level parts (p = 1 and 2), with the number of subparts set to N1 = 3and N2 = 2. We find that adding additional subparts does not improve detection performance.

2.2 Inference and test time detection

Inference involves finding high scoring derivations. At test time, because images may contain mul-tiple instances of an object class, we compute the maximum scoring derivation rooted at Q(!), foreach ! 2 ⌦. This can be done efficiently using a standard dynamic programming algorithm [11].

We retain only those derivations that score above a threshold, which we set low enough to ensurehigh recall. We use box(T ) to denote a detection window associated with a derivation T . Given aset of candidate detections, we apply non-maximal suppression to produce a final set of detections.

To define box(T ) we assign a detection window size, in feature map coordinates, to each produc-tions schema that can be applied to the start symbol. This leads to detections with one of six possibleaspect ratios, depending on which production was used in the first step of the derivation. The ab-solute location and size of a detection depends on the placement of Q. For the first five productionschemas, the ideal location of the occlusion part, O, is outside of box(T ).

4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

Part subtypes The mixture model from [12] has two subtypes for each mixture component. Thesubtypes are forced to be mirror images of each other and correspond roughly to left-facing peopleand right-facing people. Our grammar model has two subtypes for each part, which are also forcedto be mirror images of each other. But in the case of our grammar model, the decision of which partsubtype to instantiate at detection time is independent for each part.

The shallow person grammar model is defined by the following grammar. The indices p (for part), t(for subtype), and k have the following ranges: p 2 {1, . . . , 6}, t 2 {L, R} and k 2 {1, . . . , 5}.

Q(!) sk�! { Y1(! � �1), . . . , Yk

(! � �k

), O(! � �k+1) }

Q(!) s6�! { Y1(! � �1), . . . , Y6(! � �6) }

Yp

(!) 0�! { Yp,t

(!) } Yp,t

(!)↵p,t·�(�)�! { A

p,t

(! � �) }O(!) 0�! { O

t

(!) } Ot

(!)↵t·�(�)�! { A

t

(! � �) }

The grammar has a start symbol Q with six alternate choices that derive people under varying de-grees of visibility (occlusion). Each part has a corresponding nonterminal Y

p

that is placed at someideal position relative to Q. Derivations with occlusion include the occlusion symbol O. A derivationselects a subtype and displacement for each visible part. The parameters of the grammar (productionscores, deformation parameters and filters) are learned with the discriminative procedure describedin Section 4. Figure 1 illustrates the filters in the resulting model and some example detections.

Deeper model We extend the shallow model by adding deformable subparts at two scales: (1)the same as, and (2) twice the resolution of the start symbol Q. When detecting large objects,high-resolution subparts capture fine image details. However, when detecting small objects, high-resolution subparts cannot be used because they “fall off the bottom” of the feature map pyramid.The model uses derivations with low-resolution subparts when detecting small objects.

We begin by replacing the productions from Yp,t

in the grammar above, and then adding new pro-ductions. Recall that p indexes the top-level parts and t indexes subtypes. In the following schemas,the indices r (for resolution) and u (for subpart) have the ranges: r 2 {H,L}, u 2 {1, . . . , N

p

},where N

p

is the number of subparts in a top-level part Yp

.

Yp,t

(!)↵p,t·�(�)�! { Z

p,t

(! � �) }Z

p,t

(!) 0�! {Ap,t

(!), Wp,t,r,1(! � �

p,t,r,1), . . . ,Wp,t,r,Np(! � �p,t,r,Np)}

Wp,t,r,u

(!)↵p,t,r,u·�(�)�! {A

p,t,r,u

(! � �)}

We note that as in [22] our model has hierarchical deformations. The part terminal Ap,t

can moverelative to Q and the subpart terminal A

p,t,r,u

can move relative to Ap,t

.

The displacements �p,t,H,u

place the symbols Wp,t,H,u

one octave below Zp,t

in the feature mappyramid. The displacements �

p,t,L,u

place the symbols Wp,t,L,u

at the same scale as Zp,t

. We addsubparts to the first two top-level parts (p = 1 and 2), with the number of subparts set to N1 = 3and N2 = 2. We find that adding additional subparts does not improve detection performance.

2.2 Inference and test time detection

Inference involves finding high scoring derivations. At test time, because images may contain mul-tiple instances of an object class, we compute the maximum scoring derivation rooted at Q(!), foreach ! 2 ⌦. This can be done efficiently using a standard dynamic programming algorithm [11].

We retain only those derivations that score above a threshold, which we set low enough to ensurehigh recall. We use box(T ) to denote a detection window associated with a derivation T . Given aset of candidate detections, we apply non-maximal suppression to produce a final set of detections.

To define box(T ) we assign a detection window size, in feature map coordinates, to each produc-tions schema that can be applied to the start symbol. This leads to detections with one of six possibleaspect ratios, depending on which production was used in the first step of the derivation. The ab-solute location and size of a detection depends on the placement of Q. For the first five productionschemas, the ideal location of the occlusion part, O, is outside of box(T ).

4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

Part subtypes The mixture model from [12] has two subtypes for each mixture component. Thesubtypes are forced to be mirror images of each other and correspond roughly to left-facing peopleand right-facing people. Our grammar model has two subtypes for each part, which are also forcedto be mirror images of each other. But in the case of our grammar model, the decision of which partsubtype to instantiate at detection time is independent for each part.

The shallow person grammar model is defined by the following grammar. The indices p (for part), t(for subtype), and k have the following ranges: p 2 {1, . . . , 6}, t 2 {L, R} and k 2 {1, . . . , 5}.

Q(!) sk�! { Y1(! � �1), . . . , Yk

(! � �k

), O(! � �k+1) }

Q(!) s6�! { Y1(! � �1), . . . , Y6(! � �6) }

Yp

(!) 0�! { Yp,t

(!) } Yp,t

(!)↵p,t·�(�)�! { A

p,t

(! � �) }O(!) 0�! { O

t

(!) } Ot

(!)↵t·�(�)�! { A

t

(! � �) }

The grammar has a start symbol Q with six alternate choices that derive people under varying de-grees of visibility (occlusion). Each part has a corresponding nonterminal Y

p

that is placed at someideal position relative to Q. Derivations with occlusion include the occlusion symbol O. A derivationselects a subtype and displacement for each visible part. The parameters of the grammar (productionscores, deformation parameters and filters) are learned with the discriminative procedure describedin Section 4. Figure 1 illustrates the filters in the resulting model and some example detections.

Deeper model We extend the shallow model by adding deformable subparts at two scales: (1)the same as, and (2) twice the resolution of the start symbol Q. When detecting large objects,high-resolution subparts capture fine image details. However, when detecting small objects, high-resolution subparts cannot be used because they “fall off the bottom” of the feature map pyramid.The model uses derivations with low-resolution subparts when detecting small objects.

We begin by replacing the productions from Yp,t

in the grammar above, and then adding new pro-ductions. Recall that p indexes the top-level parts and t indexes subtypes. In the following schemas,the indices r (for resolution) and u (for subpart) have the ranges: r 2 {H,L}, u 2 {1, . . . , N

p

},where N

p

is the number of subparts in a top-level part Yp

.

Yp,t

(!)↵p,t·�(�)�! { Z

p,t

(! � �) }Z

p,t

(!) 0�! {Ap,t

(!), Wp,t,r,1(! � �

p,t,r,1), . . . ,Wp,t,r,Np(! � �p,t,r,Np)}

Wp,t,r,u

(!)↵p,t,r,u·�(�)�! {A

p,t,r,u

(! � �)}

We note that as in [22] our model has hierarchical deformations. The part terminal Ap,t

can moverelative to Q and the subpart terminal A

p,t,r,u

can move relative to Ap,t

.

The displacements �p,t,H,u

place the symbols Wp,t,H,u

one octave below Zp,t

in the feature mappyramid. The displacements �

p,t,L,u

place the symbols Wp,t,L,u

at the same scale as Zp,t

. We addsubparts to the first two top-level parts (p = 1 and 2), with the number of subparts set to N1 = 3and N2 = 2. We find that adding additional subparts does not improve detection performance.

2.2 Inference and test time detection

Inference involves finding high scoring derivations. At test time, because images may contain mul-tiple instances of an object class, we compute the maximum scoring derivation rooted at Q(!), foreach ! 2 ⌦. This can be done efficiently using a standard dynamic programming algorithm [11].

We retain only those derivations that score above a threshold, which we set low enough to ensurehigh recall. We use box(T ) to denote a detection window associated with a derivation T . Given aset of candidate detections, we apply non-maximal suppression to produce a final set of detections.

To define box(T ) we assign a detection window size, in feature map coordinates, to each produc-tions schema that can be applied to the start symbol. This leads to detections with one of six possibleaspect ratios, depending on which production was used in the first step of the derivation. The ab-solute location and size of a detection depends on the placement of Q. For the first five productionschemas, the ideal location of the occlusion part, O, is outside of box(T ).

4

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

Part subtypes The mixture model from [12] has two subtypes for each mixture component. Thesubtypes are forced to be mirror images of each other and correspond roughly to left-facing peopleand right-facing people. Our grammar model has two subtypes for each part, which are also forcedto be mirror images of each other. But in the case of our grammar model, the decision of which partsubtype to instantiate at detection time is independent for each part.

The shallow person grammar model is defined by the following grammar. The indices p (for part), t(for subtype), and k have the following ranges: p 2 {1, . . . , 6}, t 2 {L, R} and k 2 {1, . . . , 5}.

Q(!) sk�! { Y1(! � �1), . . . , Yk

(! � �k

), O(! � �k+1) }

Q(!) s6�! { Y1(! � �1), . . . , Y6(! � �6) }

Yp

(!) 0�! { Yp,t

(!) } Yp,t

(!)↵p,t·�(�)�! { A

p,t

(! � �) }O(!) 0�! { O

t

(!) } Ot

(!)↵t·�(�)�! { A

t

(! � �) }

The grammar has a start symbol Q with six alternate choices that derive people under varying de-grees of visibility (occlusion). Each part has a corresponding nonterminal Y

p

that is placed at someideal position relative to Q. Derivations with occlusion include the occlusion symbol O. A derivationselects a subtype and displacement for each visible part. The parameters of the grammar (productionscores, deformation parameters and filters) are learned with the discriminative procedure describedin Section 4. Figure 1 illustrates the filters in the resulting model and some example detections.

Deeper model We extend the shallow model by adding deformable subparts at two scales: (1)the same as, and (2) twice the resolution of the start symbol Q. When detecting large objects,high-resolution subparts capture fine image details. However, when detecting small objects, high-resolution subparts cannot be used because they “fall off the bottom” of the feature map pyramid.The model uses derivations with low-resolution subparts when detecting small objects.

We begin by replacing the productions from Yp,t

in the grammar above, and then adding new pro-ductions. Recall that p indexes the top-level parts and t indexes subtypes. In the following schemas,the indices r (for resolution) and u (for subpart) have the ranges: r 2 {H,L}, u 2 {1, . . . , N

p

},where N

p

is the number of subparts in a top-level part Yp

.

Yp,t

(!)↵p,t·�(�)�! { Z

p,t

(! � �) }Z

p,t

(!) 0�! {Ap,t

(!), Wp,t,r,1(! � �

p,t,r,1), . . . ,Wp,t,r,Np(! � �p,t,r,Np)}

Wp,t,r,u

(!)↵p,t,r,u·�(�)�! {A

p,t,r,u

(! � �)}

We note that as in [22] our model has hierarchical deformations. The part terminal Ap,t

can moverelative to Q and the subpart terminal A

p,t,r,u

can move relative to Ap,t

.

The displacements �p,t,H,u

place the symbols Wp,t,H,u

one octave below Zp,t

in the feature mappyramid. The displacements �

p,t,L,u

place the symbols Wp,t,L,u

at the same scale as Zp,t

. We addsubparts to the first two top-level parts (p = 1 and 2), with the number of subparts set to N1 = 3and N2 = 2. We find that adding additional subparts does not improve detection performance.

2.2 Inference and test time detection

Inference involves finding high scoring derivations. At test time, because images may contain mul-tiple instances of an object class, we compute the maximum scoring derivation rooted at Q(!), foreach ! 2 ⌦. This can be done efficiently using a standard dynamic programming algorithm [11].

We retain only those derivations that score above a threshold, which we set low enough to ensurehigh recall. We use box(T ) to denote a detection window associated with a derivation T . Given aset of candidate detections, we apply non-maximal suppression to produce a final set of detections.

To define box(T ) we assign a detection window size, in feature map coordinates, to each produc-tions schema that can be applied to the start symbol. This leads to detections with one of six possibleaspect ratios, depending on which production was used in the first step of the derivation. The ab-solute location and size of a detection depends on the placement of Q. For the first five productionschemas, the ideal location of the occlusion part, O, is outside of box(T ).

4

Tuesday, November 22, 11

Page 23: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Detections with person grammarQualitative results 1

(a) Full visibility (b) Occlusion boundaries

Figure: Example detections. Parts are blue. The occlusion part, if used,is dashed cyan. (a) Detections of fully visible people. (b) Examples wherethe occlusion part detects an occlusion boundary.

Qualitative results 2

(a) Early termination (b) Mistakes

Figure: Example detections. Parts are blue. The occlusion part, if used,is dashed cyan. (a) Detections where there is no occlusion, but a partialperson is appropriate. (b) Mistakes, where the model did not detectocclusion properly.

Qualitative results 1

(a) Full visibility (b) Occlusion boundaries

Figure: Example detections. Parts are blue. The occlusion part, if used,is dashed cyan. (a) Detections of fully visible people. (b) Examples wherethe occlusion part detects an occlusion boundary.

full visibility occlusion mistakesTuesday, November 22, 11

Page 24: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

EvolutionA Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of [email protected]

David McAllesterToyota Technological Institute at Chicago

[email protected]

Deva RamananUC Irvine

[email protected]

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

1. IntroductionWe consider the problem of detecting and localizing ob-

jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

object categories. Figure 1 shows an example detection ob-tained with our person model.

The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6,10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

1

DPM (CVPR08)AP=0.27 2 DPM (PAMI10)

AP=0.36 6 DPM (voc-release4) AP=0.43

(a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each “pixel”shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) It’s computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVM’s.

References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The8th ICCV, Vancouver, Canada, pages 454–461, 2001.

[2] V. de Poortere, J. Cant, B. Van den Bosch, J. dePrins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.

[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 66–75, 2000.

[4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296–301, June 1995.

[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100–105, October 1996.

[6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):82–98, 1999.

[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.

[8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages87–93, 1999.

[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):45–68, 2001.

[10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.

[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 66–75, 2004.

[12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91–110, 2004.

[13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.

[14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.

[15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):63–86, 2004.

[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 69–81, 2004.

[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349–361, April 2001.

[18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):15–33, 2000.

[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700–714, 2002.

[20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151–177, 2004.

[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181–194, 1977.

[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734–741, 2003.

HOG (DT, CVPR05)AP=0.16

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

Parts 1-6 (no occlusion) Parts 1-4 & occluder Parts 1-2 & occluder

Example detections and derived filtersSubtype 1 Subtype 2

Part 1

Part 2

Part 3

Part 4

Part 5

Part 6

Occluder

Figure 1: Shallow grammar model. This figure illustrates a shallow version of our grammar model(Section 2.1). This model has six person parts and an occlusion model (“occluder”), each of whichcomes in one of two subtypes. A detection places one subtype of each visible part at a location andscale in the image. If the derivation does not place all parts it must place the occluder. Parts areallowed to move relative to each other but are constrained by deformation penalties.

We consider models with productions specified by two kinds of schemas (a schema is a template forgenerating productions). A structure schema specifies one production for each placement ! 2 ⌦,

X(!) s�! { Y1(! � �1), . . . , Yn

(! � �n

) }. (3)

Here the �i

specify constant displacements within the feature map pyramid. Structure schemas canbe used to define decompositions of objects into other objects.

Let � be the set of possible displacements within a single scale of a feature map pyramid. Adeformation schema specifies one production for each placement ! 2 ⌦ and displacement � 2 �,

X(!)↵·�(�)�! { Y (! � �) }. (4)

Here �(�) is a feature vector and ↵ is a vector of deformation parameters. Deformation schemascan be used to define deformable models. We define �(�) = (dx, dy, dx2, dy2) so that deformationscores are quadratic functions of the displacements.

The parameters of our models are defined by a weight vector w with entries for the score of eachstructure schema, the deformation parameters of each deformation schema and the filter coefficientsassociated with each terminal. Then score(T ) = w ·�(T ) where �(T ) is the sum of (sparse) featurevectors associated with each placed terminal and production in T .

2.1 A grammar model for detecting people

Each component in the person model learned by the voc-release4 system [12] is tuned to detectpeople under a prototypical visibility pattern. Based on this observation we designed, by hand, thestructure of a grammar that models visibility by using structural variability and optional parts. Forclarity, we begin by describing a shallow model (Figure 1) that places all filters at the same resolutionin the feature map pyramid. After explaining this model, we describe a deeper model that includesdeformable subparts at higher resolutions.

Fine-grained occlusion Our grammar model has a start symbol Q that can be expanded using oneof six possible structure schemas. These choices model different degrees of visibility ranging fromheavy occlusion (only the head and shoulders are visible) to no occlusion at all.

Beyond modeling fine-grained occlusion patterns when compared to the mixture models from [12]or [7], our grammar model is also richer in the following ways. In Section 5 we show that each ofthese aspects improves detection performance.

Occlusion model If a person is occluded, then there must be some cause for the occlusion — eitherthe edge of the image or an occluding object such as a desk or dinner table. We model the cause ofocclusion through an occlusion object that has a non-trivial appearance model.

3

Grammar (NIPS11) AP=0.47

Tuesday, November 22, 11

Page 25: Pedro Felzenszwalb Brown University · Pedro Felzenszwalb Brown University Joint work with Ross Girshick and David McAllester Object Detection Grammars Tuesday, November 22, 11

Summary• The big challenge is handling appearance variation

• Object detection grammars can express many types of models

- Mixtures of DPM

- Models with variable structure

- Models with shared parts

- etc. -- think of it as a programming language

• General implementation

- Isolated deformation grammars + HOG + LSVM

• Learning grammar structure is still an open problemTuesday, November 22, 11