Chrisoph Lampert, Talk part 2

173
Learning with Structured Inputs and Outputs Christoph Lampert Institute of Science and Technology Austria Vienna (Klosterneuburg), Austria Microsoft Computer Vision School 2011

description

 

Transcript of Chrisoph Lampert, Talk part 2

Page 1: Chrisoph Lampert, Talk part 2

Learning with StructuredInputs and Outputs

Christoph LampertInstitute of Science and Technology Austria

Vienna (Klosterneuburg), Austria

Microsoft Computer Vision School 2011

Page 2: Chrisoph Lampert, Talk part 2

Kernel Methods in Computer Vision

Overview...Monday:

9:00 – 10:00 Introduction to Kernel Classifiers10:00 – 10:10 — mini break —10:10 – 10:50 More About Kernel Methods

Tuesday:9:00 – 9:50 Conditional Random Fields9:50 – 10:00 — mini break —10:00 – 10:50 Structured Support Vector Machines

Slides available on my home page:http://www.ist.ac.at/~chl

Page 3: Chrisoph Lampert, Talk part 2

Extended versions of the lectures in book form

Foundations and Trends inComputer Graphics and Vision,

now publisherwww.nowpublishers.com/

Available as PDFs on myhomepage

Page 4: Chrisoph Lampert, Talk part 2

"Normal" Machine Learning:f : X → R.

inputs X can be any kind of objectsoutput y is a real number

I classification, regression, density estimation, . . .

Structured Output Learning:f : X → Y .

inputs X can be any kind of objectsoutputs y ∈ Y are complex (structured) objects

I images, text, audio, folds of a protein

Page 5: Chrisoph Lampert, Talk part 2

"Normal" Machine Learning:f : X → R.

inputs X can be any kind of objectsoutput y is a real number

I classification, regression, density estimation, . . .

Structured Output Learning:f : X → Y .

inputs X can be any kind of objectsoutputs y ∈ Y are complex (structured) objects

I images, text, audio, folds of a protein

Page 6: Chrisoph Lampert, Talk part 2

What is structured data?

Ad hoc definition: data that consists of several parts, and not onlythe parts themselves contain information, but also the way in whichthe parts belong together.

Text Molecules / Chemical Structures

Documents/HyperText Images

Page 7: Chrisoph Lampert, Talk part 2

What is structured output prediction?

Ad hoc definition: predicting structured outputs from input data(in contrast to predicting just a single number, like in classification or regression)

Natural Language Processing:I Automatic Translation (output: sentences)I Sentence Parsing (output: parse trees)

Bioinformatics:I Secondary Structure Prediction (output: bipartite graphs)I Enzyme Function Prediction (output: path in a tree)

Speech Processing:I Automatic Transcription (output: sentences)I Text-to-Speech (output: audio signal)

Robotics:I Planning (output: sequence of actions)

This lecture: Structured Prediction in Computer Vision

Page 8: Chrisoph Lampert, Talk part 2

Computer Vision Example: Semantic Image Segmentation

7→input: images output: segmentation masks

input space X = {images} = [0, 255]3·M ·N

output space Y = {segmentation masks} = {0, 1}M ·N

(structured output) prediction function: f : X → Y

f (x) := argminy∈Y

E(x , y)

energy function E(x , y) = ∑i w>i ϕu(xi , yi) +∑

i,j w>ij ϕp(yi , yj)

Images: [M. Everingham et al. "The PASCAL Visual Object Classes (VOC) challenge", IJCV 2010]

Page 9: Chrisoph Lampert, Talk part 2

Computer Vision Example: Human Pose Estimation

input: image body model output: model fit

input space X = {images}

output space Y = {positions/angles of K body parts} = R4K .

prediction function: f : X → Y

f (x) := argminy∈Y

E(x , y)

energy E(x , y) = ∑i w>i ϕfit(xi , yi) +∑

i,j w>ij ϕpose(yi , yj)

Images: [Ferrari, Marin-Jimenez, Zisserman: "Progressive Search Space Reduction for Human Pose Estimation", CVPR 2008.]

Page 10: Chrisoph Lampert, Talk part 2

Computer Vision Example: Point Matching

input: image pairs

output: mapping y : xi ↔ y(xi)

prediction function: f : X → Yf (x) := argmax

y∈YF(x , y)

scoring function F(x , y) =∑i w>i ϕsim(xi , y(xi)) +∑

i,j w>ij ϕdist(xi , xj , y(xi), y(xj)) +∑i,j,k w>ijkϕangle(xi , xj , xk , y(xi), y(xj), y(xk))

[J. McAuley et al.: "Robust Near-Isometric Matching via Structured Learning of Graphical Models", NIPS, 2008]

Page 11: Chrisoph Lampert, Talk part 2

Computer Vision Example: Object Localization

input:image

output:object position(left, topright, bottom)

input space X = {images}

output space Y = R4 bounding box coordinates

prediction function: f : X → Y

f (x) := argmaxy∈Y

F(x , y)

scoring function F(x , y) = w>ϕ(x , y) where ϕ(x , y) = h(x|y) isa feature vector for an image region, e.g. bag-of-visual-words.

[M. Blaschko, C. Lampert: "Learning to Localize Objects with Structured Output Regression", ECCV, 2008]

Page 12: Chrisoph Lampert, Talk part 2

Computer Vision Examples: Summary

Image Segmentation

y = argminy∈{0,1}N

E(x , y) E(x , y) =∑

iw>i ϕ(xi , yi) +

∑i,j

w>ij ϕ(yi , yj)

Pose Estimationy = argmin

y∈R4KE(x , y) E(x , y) =

∑i

w>i ϕ(xi , yi) +∑i,j

w>ij ϕ(yi , yj)

Point Matchingy = argmax

y∈ΠnF(x, y) F(x, y) =

∑i

w>i ϕ(xi , yi) +∑

i,j

w>ij ϕ(yi , yj) +∑i,j,k

w>ijkϕ(yi , yj , yk)

Object Localization

y = argmaxy∈R4

F(x , y) F(x , y) = w>ϕ(x , y)

Page 13: Chrisoph Lampert, Talk part 2

Grand Unified ViewPredict structured output by maximization

y = argmaxy∈Y

F(x , y)

of a compatiblity function

F(x , y) = 〈w, ϕ(x , y)〉

that is linear in a parameter vector w.

Page 14: Chrisoph Lampert, Talk part 2

A generic structured prediction problemX : arbitrary input domainY : structured output domain, decompose y = (y1, . . . , yk)Prediction function f : X → Y by

f (x) = argmaxy∈Y

F(x , y)

Compatiblity function (or negative of "energy")

F(x , y) = 〈w, ϕ(x , y)〉

=k∑

i=1w>i ϕi(yi , x) unary terms

+k∑

i,j=1w>ij ϕij(yi , yj , x) binary terms

+ . . . higher order terms (sometimes)

Page 15: Chrisoph Lampert, Talk part 2

Example: Sequence Prediction – Handwriting RecognitionX = 5-letter word images , x = (x1, . . . , x5), xj ∈ {0, 1}300×80

Y = ASCII translation , y = (y1, . . . , y5), yj ∈ {A, . . . ,Z}.feature functions has only unary termsϕ(x , y) =

(ϕ1(x , y1), . . . , ϕ5(x , y5)

).

F(x , y) = 〈w1, ϕ1(x , y1)〉+ · · ·+ 〈w, ϕ5(x , y5)〉

Q V E S T

Input

Output

Advantage: computing y∗ = argmaxy F(x , y) is easy.We can find each y∗i independently, check 5 · 26 = 130 values.

Problem: only local information, we can’t correct errors.

Page 16: Chrisoph Lampert, Talk part 2

Example: Sequence Prediction – Handwriting RecognitionX = 5-letter word images , x = (x1, . . . , x5), xj ∈ {0, 1}300×80

Y = ASCII translation , y = (y1, . . . , y5), yj ∈ {A, . . . ,Z}.one global feature function

ϕ(x , y) =

(0, . . . , 0,yth pos.︷ ︸︸ ︷Φ(x) , 0, . . . , 0) if y ∈ D dictionary,

(0, . . . , 0, 0 , 0, . . . , 0) otherwise.

QUEST

Input

Output

Advantage: access to global information, e.g. from dictionary D.

Problem: argmaxy〈w, ϕ(x , y)〉 has to check 265 = 11881376 values.We need separate training data for each word.

Page 17: Chrisoph Lampert, Talk part 2

Example: Sequence Prediction – Handwriting RecognitionX = 5-letter word images , x = (x1, . . . , x5), xj ∈ {0, 1}300×80

Y = ASCII translation , y = (y1, . . . , y5), yj ∈ {A, . . . ,Z}.feature function with unary and pairwise terms

ϕ(x , y) =(ϕ1(y1, x), ϕ2(y2, x), . . . , ϕ5(y5, x),

ϕ1,2(y1, y2), . . . , ϕ4,5(y4, y5))

Q U E S T

Input

Output

Compromise: computing y∗ is still efficient (Viterbi best path)Compromise: neighbor information allows correction of local errors.

Page 18: Chrisoph Lampert, Talk part 2

Example: Sequence Prediction – Handwriting Recognitionϕ(x , y) = (ϕ1(y1, x), . . . , ϕ5(y5, x), ϕ1,2(y1, y2), . . . , ϕ4,5(y4, y5)).w = (w1, , . . . ,w5,w1,2, . . . ,w4,5)F(x , y) = 〈w, ϕ(x , y)〉 =F1(x1, y1) + · · ·+ F5(x5, y5) + F1,2(y1, y2) + . . . ,F4,5(y4, y5).

Q V E

F1F2 F3

F1;2 F2;3

. . .

Fi,i+1 Q U V EQ 0.0 0.9 0.1 0.1U 0.1 0.1 0.4 0.6V 0.0 0.1 0.2 0.5E 0.3 0.5 0.5 1.0

F1Q 1.8U 0.2V 0.1E 0.4

F2Q 0.2U 1.1V 1.2E 0.3

F3Q 0.3U 0.2V 0.1E 1.8

Page 19: Chrisoph Lampert, Talk part 2

Every y ∈ Y corresponds to a path. argmaxy∈Y F is best path.

F(VUQ, x) = F1(V, x1) + F12(V, U) + F2(U, x2) + F23(U, Q) + F3(Q, x3)= 0.1 + 0.1 + 1.1 + 0.1 + 0.3 = 1.7

Page 20: Chrisoph Lampert, Talk part 2

Every y ∈ Y corresponds to a path. argmaxy∈Y F is best path.

F(VUQ, x) = F1(V, x1) + F12(V, U) + F2(U, x2) + F23(U, Q) + F3(Q, x3)= 0.1 + 0.1 + 1.1 + 0.1 + 0.3 = 1.7

Page 21: Chrisoph Lampert, Talk part 2

Maximal per-letter scores.

Total: 1.8 + 0.1 + 1.2 + 0.5 + 1.8 = 5.4

Page 22: Chrisoph Lampert, Talk part 2

Maximal per-letter scores. Total: 1.8 + 0.1 + 1.2 + 0.5 + 1.8 = 5.4

Page 23: Chrisoph Lampert, Talk part 2

Best (=Viterbi) path. Total: 1.8 + 0.9 + 1.1 + 0.6 + 1.8 = 6.2

Page 24: Chrisoph Lampert, Talk part 2

Example: Human Pose Estimation

Pictorial Structures for Articulated Pose Estimation:

. . .

Ytop

Yhead

YtorsoYrarm

Yrhnd

Yrleg

Yrfoot Ylfoot

Ylleg

Ylarm

Ylhnd

X

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

F(1)

top

F(2)

top,head

Many popular models have unary and pairwise terms.

Certain higher order terms are also doable, see C. Rother’s lectures.

[Felzenszwalb and Huttenlocher, 2000], [Fischler and Elschlager, 1973]

Page 25: Chrisoph Lampert, Talk part 2

Thursday/Friday’s lecture: how to evaluate argmaxy F(x , y).

chain treeloop-free graphs: Viterbi decoding / dynamic programming

grid arbitrary graphloopy graphs: approximate inference (e.g. loopy BP)

Today: how to learn a good function F(x , y) from training data.

Page 26: Chrisoph Lampert, Talk part 2

Parameter Learning in Structured Models

Given: parametric model (family): F(x , y) = 〈w, ϕ(x , y)〉Given: prediction method: f (x) = argmaxy∈Y F(x , y)Not given: parameter vector w (high-dimensional)

Supervised Training:Given: example pairs {(x1, y1), . . . , (xn, yn)} ⊂ X × Y .typical inputs with "the right" outputs for them.

{ , , ,

, }Task: determine "good" w

Page 27: Chrisoph Lampert, Talk part 2

Parameter Learning in Structured Models

Given: parametric model (family): F(x , y) = 〈w, ϕ(x , y)〉Given: prediction method: f (x) = argmaxy∈Y F(x , y)Not given: parameter vector w (high-dimensional)

Supervised Training:Given: example pairs {(x1, y1), . . . , (xn, yn)} ⊂ X × Y .typical inputs with "the right" outputs for them.

{ , , ,

, }Task: determine "good" w

Page 28: Chrisoph Lampert, Talk part 2

Parameter Learning in Structured Models

Given: parametric model (family): F(x , y) = 〈w, ϕ(x , y)〉Given: prediction method: f (x) = argmaxy∈Y F(x , y)Not given: parameter vector w (high-dimensional)

Supervised Training:Given: example pairs {(x1, y1), . . . , (xn, yn)} ⊂ X × Y .typical inputs with "the right" outputs for them.Task: determine "good" w

now–9:50 Probabilistic Traininginspired by probabilistic discriminative classifiers:Logistic Regression 7→ Conditional Random Fields

10:00–10:50 Maximum Margin Traininginspired by maximum-margin discriminative classifiers:Support Vector Machines 7→ Structured Output SVMs

Page 29: Chrisoph Lampert, Talk part 2

Probabilistic Trainingof Structured Models(Conditional Random Fields)

Page 30: Chrisoph Lampert, Talk part 2

Probabilistic Models for Structured Prediction

Idea: Establish a one-to-one correspondence betweencompatibility function F(x , y) andconditional probability distribution p(y|x),

such that

argmaxy∈Y

F(x , y) = argmaxy∈Y

p(y|x)

= maximum of the posterior probability= Bayes optimal decision

Gibbs Distribution

p(y|x) = 1Z (x)eF(x,y) with Z (x) =

∑y∈Y

eF(x,y)

argmaxy∈Y

p(y|x) = argmaxy∈Y

eF(x,y)

Z (x) = argmaxy∈Y

eF(x,y) = argmaxy∈Y

F(x, y)

Page 31: Chrisoph Lampert, Talk part 2

Probabilistic Models for Structured Prediction

Idea: Establish a one-to-one correspondence betweencompatibility function F(x , y) andconditional probability distribution p(y|x),

such that

argmaxy∈Y

F(x , y) = argmaxy∈Y

p(y|x)

= maximum of the posterior probability

= Bayes optimal decision

Gibbs Distribution

p(y|x) = 1Z (x)eF(x,y) with Z (x) =

∑y∈Y

eF(x,y)

argmaxy∈Y

p(y|x) = argmaxy∈Y

eF(x,y)

Z (x) = argmaxy∈Y

eF(x,y) = argmaxy∈Y

F(x, y)

Page 32: Chrisoph Lampert, Talk part 2

Probabilistic Models for Structured Prediction

Idea: Establish a one-to-one correspondence betweencompatibility function F(x , y) andconditional probability distribution p(y|x),

such that

argmaxy∈Y

F(x , y) = argmaxy∈Y

p(y|x)

= maximum of the posterior probability= Bayes optimal decision

Gibbs Distribution

p(y|x) = 1Z (x)eF(x,y) with Z (x) =

∑y∈Y

eF(x,y)

argmaxy∈Y

p(y|x) = argmaxy∈Y

eF(x,y)

Z (x) = argmaxy∈Y

eF(x,y) = argmaxy∈Y

F(x, y)

Page 33: Chrisoph Lampert, Talk part 2

Probabilistic Models for Structured Prediction

Idea: Establish a one-to-one correspondence betweencompatibility function F(x , y) andconditional probability distribution p(y|x),

such that

argmaxy∈Y

F(x , y) = argmaxy∈Y

p(y|x)

= maximum of the posterior probability= Bayes optimal decision

Gibbs Distribution

p(y|x) = 1Z (x)eF(x,y) with Z (x) =

∑y∈Y

eF(x,y)

argmaxy∈Y

p(y|x) = argmaxy∈Y

eF(x,y)

Z (x) = argmaxy∈Y

eF(x,y) = argmaxy∈Y

F(x, y)

Page 34: Chrisoph Lampert, Talk part 2

Probabilistic Models for Structured Prediction

Idea: Establish a one-to-one correspondence betweencompatibility function F(x , y) andconditional probability distribution p(y|x),

such that

argmaxy∈Y

F(x , y) = argmaxy∈Y

p(y|x)

= maximum of the posterior probability= Bayes optimal decision

Gibbs Distribution

p(y|x) = 1Z (x)eF(x,y) with Z (x) =

∑y∈Y

eF(x,y)

argmaxy∈Y

p(y|x) = argmaxy∈Y

eF(x,y)

Z (x) = argmaxy∈Y

eF(x,y) = argmaxy∈Y

F(x, y)

Page 35: Chrisoph Lampert, Talk part 2

Probabilistic Models for Structured Prediction

For learning, we make the dependence on w explicit:

F(x , y,w) = 〈w, ϕ(x , y)〉

Parametrized Gibbs Distribution – Loglinear Model

p(y|x ,w) = 1Z (x ,w)eF(x,y,w) with Z (x ,w) =

∑y∈Y

eF(x,y,w)

Page 36: Chrisoph Lampert, Talk part 2

Probabilistic Models for Structured Prediction

Supervised training:Given: i.i.d. example pairs D = {(x1, y1), . . . , (xn, yn)}.Task: determine w.Idea: Maximize the posterior probability p(w|D)

Use Bayes’ rule to infer posterior distribution for w:

p(w|D) = p(D|w)p(w)p(D) = p((x1, y1), . . . , (xn, yn)|w)p(w)

p(D)

i.i.d.= p(w)n∏

i=1

p(x i , yi |w)p(x i , yi) = p(w)

n∏i=1

p(yi |x i ,w)p(x i |w)p(yi |x i)p(x i)

assuming x does not depend on w: p(x|w) ≡ p(x)

= p(w)n∏

i=1

p(yi |x i ,w)p(yi |x i)

Page 37: Chrisoph Lampert, Talk part 2

Probabilistic Models for Structured Prediction

Supervised training:Given: i.i.d. example pairs D = {(x1, y1), . . . , (xn, yn)}.Task: determine w.Idea: Maximize the posterior probability p(w|D)

Use Bayes’ rule to infer posterior distribution for w:

p(w|D) = p(D|w)p(w)p(D) = p((x1, y1), . . . , (xn, yn)|w)p(w)

p(D)

i.i.d.= p(w)n∏

i=1

p(x i , yi |w)p(x i , yi) = p(w)

n∏i=1

p(yi |x i ,w)p(x i |w)p(yi |x i)p(x i)

assuming x does not depend on w: p(x|w) ≡ p(x)

= p(w)n∏

i=1

p(yi |x i ,w)p(yi |x i)

Page 38: Chrisoph Lampert, Talk part 2

Probabilistic Models for Structured Prediction

Supervised training:Given: i.i.d. example pairs D = {(x1, y1), . . . , (xn, yn)}.Task: determine w.Idea: Maximize the posterior probability p(w|D)

Use Bayes’ rule to infer posterior distribution for w:

p(w|D) = p(D|w)p(w)p(D) = p((x1, y1), . . . , (xn, yn)|w)p(w)

p(D)i.i.d.= p(w)

n∏i=1

p(x i , yi |w)p(x i , yi) = p(w)

n∏i=1

p(yi |x i ,w)p(x i |w)p(yi |x i)p(x i)

assuming x does not depend on w: p(x|w) ≡ p(x)

= p(w)n∏

i=1

p(yi |x i ,w)p(yi |x i)

Page 39: Chrisoph Lampert, Talk part 2

Probabilistic Models for Structured Prediction

Supervised training:Given: i.i.d. example pairs D = {(x1, y1), . . . , (xn, yn)}.Task: determine w.Idea: Maximize the posterior probability p(w|D)

Use Bayes’ rule to infer posterior distribution for w:

p(w|D) = p(D|w)p(w)p(D) = p((x1, y1), . . . , (xn, yn)|w)p(w)

p(D)i.i.d.= p(w)

n∏i=1

p(x i , yi |w)p(x i , yi) = p(w)

n∏i=1

p(yi |x i ,w)p(x i |w)p(yi |x i)p(x i)

assuming x does not depend on w: p(x|w) ≡ p(x)

= p(w)n∏

i=1

p(yi |x i ,w)p(yi |x i)

Page 40: Chrisoph Lampert, Talk part 2

Posterior probability:

p(w|D) = p(w)∏

i

p(yi |x i ,w)p(yi |x i)

Prior Data Term (loglinear)

Prior p(w) expresses prior knowledge. We cannot infer it from data.

Gaussian p(w) := const. · e−1

2σ2 ‖w‖2

I Gaussianity is often a good guess if you know nothing elseI easy to handle, analytically und numerically

Less common:Laplacian p(w) := const. · e− 1

σ‖w‖

I if we expect sparsity in wI but: properties of the optimization problem change

Page 41: Chrisoph Lampert, Talk part 2

Learning w from D ≡ Maximizing posterior probability for w

w∗ = argmaxw∈Rd

p(w|D) = argminw∈Rd

[− log p(w|D)

]

− log p(w|D) = − log[p(w)

∏i

p(yi |x i ,w)p(yi |x i)

]= − log p(w)−

n∑i=1

log p(yi |x i ,w)︸ ︷︷ ︸= 1

Z eF(xi ,yi ,w)

+n∑

i=1log p(yi |x i)︸ ︷︷ ︸indep. of w

= ‖w‖2

2σ2 −n∑

i=1

[F(x i , yi ,w)− log Z (x i ,w)

]+ const.

= ‖w‖2

2σ2 −n∑

i=1

[〈w, ϕ(x i , yi)〉 − log

∑y∈Y

e〈w,ϕ(xi ,y)〉]

︸ ︷︷ ︸=:L(w)

+ c.

Page 42: Chrisoph Lampert, Talk part 2

(Negative) (Regularized) Conditional Log-Likelihood (of D)

L(w) = 12σ2‖w‖

2 −n∑

i=1

[〈w, ϕ(x i , yi)〉 − log

∑y∈Y

e〈w,ϕ(xi ,y)〉]

Probabilistic parameter estimation or training means solving

w∗ = argminw∈Rd

L(w).

Same optimization problem as for multi-class logistic regression.

Llogreg(w) = 12σ2‖w‖

2 −n∑

i=1

[〈wyi , ϕ(x i)〉 − log

∑y∈Y

e〈wy ,ϕ(xi)〉]

with w = (w1, . . . ,wK ) for K classes.

Page 43: Chrisoph Lampert, Talk part 2

Negative Conditional Log-Likelihood for w ∈ R2:

3 2 1 0 1 2 3 4 52

1

0

1

2

3

16.000

32.0

00

64.0

00

128.0

00

256.

00051

2.00

0 512.000

1024.0

00

negative log likelihood σ2 =0.01

3 2 1 0 1 2 3 4 52

1

0

1

2

3

2.00

0

4.000

8.0

00

16.0

00

32.0

00

64.0

00

128.

000

128.0

00

negative log likelihood σ2 =0.10

3 2 1 0 1 2 3 4 52

1

0

1

2

3

0.50

0

1.0

00

2.000

4.00

0

8.00

016

.000

32.0

00

64.0

00

128.

000

negative log likelihood σ2 =1.00

3 2 1 0 1 2 3 4 52.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

0.00

0

0.00

00.

0000.00

0

0.00

00.

0010.00

2

0.00

40.

0080.01

6

0.03

10.

062

0.12

50.25

0

0.50

01.

0002.

000

4.00

08.00

0

16.0

00

32.0

00

64.0

00

negative log likelihood σ2 →∞

Page 44: Chrisoph Lampert, Talk part 2

3 2 1 0 1 2 3 4 52

1

0

1

2

3

16.000

32.0

00

64.0

00

128.0

00

256.

00051

2.00

0 512.000

1024.0

00

negative log likelihood σ2 =0.01

3 2 1 0 1 2 3 4 52

1

0

1

2

3

2.00

0

4.000

8.0

00

16.0

00

32.0

00

64.0

00

128.

000

128.0

00

negative log likelihood σ2 =0.10

3 2 1 0 1 2 3 4 52

1

0

1

2

3

0.50

0

1.0

00

2.000

4.00

0

8.00

016

.000

32.0

00

64.0

00

128.

000

negative log likelihood σ2 =1.00

3 2 1 0 1 2 3 4 52.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

0.00

0

0.00

00.

0000.00

0

0.00

00.

0010.00

2

0.00

40.

0080.01

6

0.03

10.

062

0.12

50.25

0

0.50

01.

0002.

000

4.00

08.00

0

16.0

00

32.0

00

64.0

00

negative log likelihood σ2 →∞

w∗ = argminw∈Rd

L(w).

w ∈ Rd continuous (not discrete) ,w ∈ Rd high-dimensional (often d � 1000) /optimization has no side-constraints ,L(w) differentiable , → can apply gradient descentL(w) convex! , → g.d. will find global optimum

Page 45: Chrisoph Lampert, Talk part 2

3 2 1 0 1 2 3 4 52

1

0

1

2

3

16.000

32.0

00

64.0

00

128.0

00

256.

00051

2.00

0 512.000

1024.0

00

negative log likelihood σ2 =0.01

3 2 1 0 1 2 3 4 52

1

0

1

2

3

2.00

0

4.000

8.0

00

16.0

00

32.0

00

64.0

00

128.

000

128.0

00

negative log likelihood σ2 =0.10

3 2 1 0 1 2 3 4 52

1

0

1

2

3

0.50

0

1.0

00

2.000

4.00

0

8.00

016

.000

32.0

00

64.0

00

128.

000

negative log likelihood σ2 =1.00

3 2 1 0 1 2 3 4 52.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

0.00

0

0.00

00.

0000.00

0

0.00

00.

0010.00

2

0.00

40.

0080.01

6

0.03

10.

062

0.12

50.25

0

0.50

01.

0002.

000

4.00

08.00

0

16.0

00

32.0

00

64.0

00

negative log likelihood σ2 →∞

w∗ = argminw∈Rd

L(w).

w ∈ Rd continuous (not discrete) ,

w ∈ Rd high-dimensional (often d � 1000) /optimization has no side-constraints ,L(w) differentiable , → can apply gradient descentL(w) convex! , → g.d. will find global optimum

Page 46: Chrisoph Lampert, Talk part 2

3 2 1 0 1 2 3 4 52

1

0

1

2

3

16.000

32.0

00

64.0

00

128.0

00

256.

00051

2.00

0 512.000

1024.0

00

negative log likelihood σ2 =0.01

3 2 1 0 1 2 3 4 52

1

0

1

2

3

2.00

0

4.000

8.0

00

16.0

00

32.0

00

64.0

00

128.

000

128.0

00

negative log likelihood σ2 =0.10

3 2 1 0 1 2 3 4 52

1

0

1

2

3

0.50

0

1.0

00

2.000

4.00

0

8.00

016

.000

32.0

00

64.0

00

128.

000

negative log likelihood σ2 =1.00

3 2 1 0 1 2 3 4 52.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

0.00

0

0.00

00.

0000.00

0

0.00

00.

0010.00

2

0.00

40.

0080.01

6

0.03

10.

062

0.12

50.25

0

0.50

01.

0002.

000

4.00

08.00

0

16.0

00

32.0

00

64.0

00

negative log likelihood σ2 →∞

w∗ = argminw∈Rd

L(w).

w ∈ Rd continuous (not discrete) ,w ∈ Rd high-dimensional (often d � 1000) /

optimization has no side-constraints ,L(w) differentiable , → can apply gradient descentL(w) convex! , → g.d. will find global optimum

Page 47: Chrisoph Lampert, Talk part 2

3 2 1 0 1 2 3 4 52

1

0

1

2

3

16.000

32.0

00

64.0

00

128.0

00

256.

00051

2.00

0 512.000

1024.0

00

negative log likelihood σ2 =0.01

3 2 1 0 1 2 3 4 52

1

0

1

2

3

2.00

0

4.000

8.0

00

16.0

00

32.0

00

64.0

00

128.

000

128.0

00

negative log likelihood σ2 =0.10

3 2 1 0 1 2 3 4 52

1

0

1

2

3

0.50

0

1.0

00

2.000

4.00

0

8.00

016

.000

32.0

00

64.0

00

128.

000

negative log likelihood σ2 =1.00

3 2 1 0 1 2 3 4 52.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

0.00

0

0.00

00.

0000.00

0

0.00

00.

0010.00

2

0.00

40.

0080.01

6

0.03

10.

062

0.12

50.25

0

0.50

01.

0002.

000

4.00

08.00

0

16.0

00

32.0

00

64.0

00

negative log likelihood σ2 →∞

w∗ = argminw∈Rd

L(w).

w ∈ Rd continuous (not discrete) ,w ∈ Rd high-dimensional (often d � 1000) /optimization has no side-constraints ,

L(w) differentiable , → can apply gradient descentL(w) convex! , → g.d. will find global optimum

Page 48: Chrisoph Lampert, Talk part 2

3 2 1 0 1 2 3 4 52

1

0

1

2

3

16.000

32.0

00

64.0

00

128.0

00

256.

00051

2.00

0 512.000

1024.0

00

negative log likelihood σ2 =0.01

3 2 1 0 1 2 3 4 52

1

0

1

2

3

2.00

0

4.000

8.0

00

16.0

00

32.0

00

64.0

00

128.

000

128.0

00

negative log likelihood σ2 =0.10

3 2 1 0 1 2 3 4 52

1

0

1

2

3

0.50

0

1.0

00

2.000

4.00

0

8.00

016

.000

32.0

00

64.0

00

128.

000

negative log likelihood σ2 =1.00

3 2 1 0 1 2 3 4 52.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

0.00

0

0.00

00.

0000.00

0

0.00

00.

0010.00

2

0.00

40.

0080.01

6

0.03

10.

062

0.12

50.25

0

0.50

01.

0002.

000

4.00

08.00

0

16.0

00

32.0

00

64.0

00

negative log likelihood σ2 →∞

w∗ = argminw∈Rd

L(w).

w ∈ Rd continuous (not discrete) ,w ∈ Rd high-dimensional (often d � 1000) /optimization has no side-constraints ,L(w) differentiable , → can apply gradient descent

L(w) convex! , → g.d. will find global optimum

Page 49: Chrisoph Lampert, Talk part 2

3 2 1 0 1 2 3 4 52

1

0

1

2

3

16.000

32.0

00

64.0

00

128.0

00

256.

00051

2.00

0 512.000

1024.0

00

negative log likelihood σ2 =0.01

3 2 1 0 1 2 3 4 52

1

0

1

2

3

2.00

0

4.000

8.0

00

16.0

00

32.0

00

64.0

00

128.

000

128.0

00

negative log likelihood σ2 =0.10

3 2 1 0 1 2 3 4 52

1

0

1

2

3

0.50

0

1.0

00

2.000

4.00

0

8.00

016

.000

32.0

00

64.0

00

128.

000

negative log likelihood σ2 =1.00

3 2 1 0 1 2 3 4 52.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

2.5

0.00

0

0.00

00.

0000.00

0

0.00

00.

0010.00

2

0.00

40.

0080.01

6

0.03

10.

062

0.12

50.25

0

0.50

01.

0002.

000

4.00

08.00

0

16.0

00

32.0

00

64.0

00

negative log likelihood σ2 →∞

w∗ = argminw∈Rd

L(w).

w ∈ Rd continuous (not discrete) ,w ∈ Rd high-dimensional (often d � 1000) /optimization has no side-constraints ,L(w) differentiable , → can apply gradient descentL(w) convex! , → g.d. will find global optimum

Page 50: Chrisoph Lampert, Talk part 2

Steepest Descent Minimization – minimize L(w)require: tolerance ε > 0wcur ← 0repeat

I v ← −∇wL(wcur) descent directionI η ← argminη∈R L(wcur + ηv) stepsizeI wcur ← wcur + ηv update

until ‖v‖ < ε

return wcur

Alternatives:L-BFGS (second-order descent without explicit Hessian)Conjugate Gradient

Page 51: Chrisoph Lampert, Talk part 2
Page 52: Chrisoph Lampert, Talk part 2

w0

Page 53: Chrisoph Lampert, Talk part 2

w0

Page 54: Chrisoph Lampert, Talk part 2

w1

Page 55: Chrisoph Lampert, Talk part 2

w1

Page 56: Chrisoph Lampert, Talk part 2

w1

Page 57: Chrisoph Lampert, Talk part 2

w2

Page 58: Chrisoph Lampert, Talk part 2

Summary I: Probabilistic Training (Conditional Random Fields)

Well-defined probabilistic setting.p(y|x ,w) log-linear in w ∈ Rd .Training: maximize p(w|D) for data set D.Differentiable, convex optimization,Same structure as logistic regression.⇒ gradient descent will find global optimum.

For logistic regression: this is where the textbook ends. We’re done.

For conditional random fields: we’re not in safe waters, yet!

Page 59: Chrisoph Lampert, Talk part 2

Summary I: Probabilistic Training (Conditional Random Fields)

Well-defined probabilistic setting.p(y|x ,w) log-linear in w ∈ Rd .Training: maximize p(w|D) for data set D.Differentiable, convex optimization,Same structure as logistic regression.⇒ gradient descent will find global optimum.

For logistic regression: this is where the textbook ends. We’re done.

For conditional random fields: we’re not in safe waters, yet!

Page 60: Chrisoph Lampert, Talk part 2

Problem: Computing ∇wL(wcur) and evaluating L(wcur + ηv):

∇w L(w) = 1σ2 w −

n∑i=1

[ϕ(x i , yi)−

∑y∈Y

p(y|x i ,w)ϕ(x i , y)]

L(w) = 12σ2‖w‖

2 −n∑

i=1

[〈w, ϕ(x i , yi)〉+ log

∑y∈Y

e〈w,ϕ(xi ,y)〉]

Y typically is very (exponentially) large:I N parts, each with label from a set Y : |Y| = |Y |N ,

binary image segmentation: |Y| = 2640×480 ≈ 1092475,I ranking N images: |Y| = N !, e.g. N = 1000: |Y| ≈ 102568.

Without structure in Y , we’re lost.

Page 61: Chrisoph Lampert, Talk part 2

Problem: Computing ∇wL(wcur) and evaluating L(wcur + ηv):

∇w L(w) = 1σ2 w −

n∑i=1

[ϕ(x i , yi)−

∑y∈Y

p(y|x i ,w)ϕ(x i , y)]

L(w) = 12σ2‖w‖

2 −n∑

i=1

[〈w, ϕ(x i , yi)〉+ log

∑y∈Y

e〈w,ϕ(xi ,y)〉]

Y typically is very (exponentially) large:I N parts, each with label from a set Y : |Y| = |Y |N ,

binary image segmentation: |Y| = 2640×480 ≈ 1092475,I ranking N images: |Y| = N !, e.g. N = 1000: |Y| ≈ 102568.

Without structure in Y , we’re lost.

Page 62: Chrisoph Lampert, Talk part 2

∇w L(w) = 1σ2 w −

n∑i=1

[ϕ(x i , yi)− Ey∼p(y|xi ,w)ϕ(x i , y)

]

Computing the Gradient (naive): O(K M nd)

L(w) = 12σ2‖w‖

2 −n∑

i=1

[〈w, ϕ(x i , yi)〉+

= log Z(xi ,w)︷ ︸︸ ︷log

∑y∈Y

e〈w,ϕ(xi ,y)〉]

Line Search (naive): O(K M nd) per evaluation of L

n: number of samplesd: dimension of feature spaceM : number of output nodes

≈ 10s to 1,000,000s

K : number of possible labels of each output nodes

≈ 2 to 100s

Page 63: Chrisoph Lampert, Talk part 2

∇w L(w) = 1σ2 w −

n∑i=1

[ϕ(x i , yi)− Ey∼p(y|xi ,w)ϕ(x i , y)

]

Computing the Gradient (naive): O(K M nd)

L(w) = 12σ2‖w‖

2 −n∑

i=1

[〈w, ϕ(x i , yi)〉+

= log Z(xi ,w)︷ ︸︸ ︷log

∑y∈Y

e〈w,ϕ(xi ,y)〉]

Line Search (naive): O(K M nd) per evaluation of L

n: number of samplesd: dimension of feature spaceM : number of output nodes ≈ 10s to 1,000,000sK : number of possible labels of each output nodes ≈ 2 to 100s

Page 64: Chrisoph Lampert, Talk part 2

How Does Structure in Y Help?

Computing

Z (x ,w) =∑y∈Y

e〈w,ϕ(x,y)〉

with ϕ and w decomposing over the factors

ϕ(x , y) =(ϕF(xF , yF)

)F∈F

and w =(wF)

F∈F

Z (x ,w) =∑y∈Y

e∑

F∈F 〈wF ,ϕF(xF ,yF)〉

=∑y∈Y

∏F∈F

e〈wF ,ϕF(xF ,yF)〉︸ ︷︷ ︸=:ΨF(yF)

A lot of research goes into analyzing this expression.

Page 65: Chrisoph Lampert, Talk part 2

Case I: only unary factorsF =

{{x1, y1}, . . . , {xM , yM}

}.

Q U E S Tφ1 φ2 φ4 φ5φ3

Z =∑y∈Y

∏F∈F

ΨF(yF)

=∑y∈Y

M∏i=1

Ψi(yi)

=∑

y1∈Y

∑y2∈Y· · ·

∑yM∈Y

Ψ1(y1) · · ·ΨM (yM )

=∑

y1∈YΨ1(y1)

∑y2∈Y

Ψ2(y2) · · ·∑

yM∈YΨM (yM )

=[ ∑

y1∈YΨ1(y1)

]·[ ∑

y2∈YΨ2(y2)

]· · ·

[ ∑yM∈Y

ΨM (yM )]

Case I: O(K M nd) → O(MKnd)

Page 66: Chrisoph Lampert, Talk part 2

Case I: only unary factorsF =

{{x1, y1}, . . . , {xM , yM}

}.

Q U E S Tφ1 φ2 φ4 φ5φ3

Z =∑y∈Y

∏F∈F

ΨF(yF)

=∑y∈Y

M∏i=1

Ψi(yi)

=∑

y1∈Y

∑y2∈Y· · ·

∑yM∈Y

Ψ1(y1) · · ·ΨM (yM )

=∑

y1∈YΨ1(y1)

∑y2∈Y

Ψ2(y2) · · ·∑

yM∈YΨM (yM )

=[ ∑

y1∈YΨ1(y1)

]·[ ∑

y2∈YΨ2(y2)

]· · ·

[ ∑yM∈Y

ΨM (yM )]

Case I: O(K M nd) → O(MKnd)

Page 67: Chrisoph Lampert, Talk part 2

Case I: only unary factorsF =

{{x1, y1}, . . . , {xM , yM}

}.

Q U E S Tφ1 φ2 φ4 φ5φ3

Z =∑y∈Y

∏F∈F

ΨF(yF)

=∑y∈Y

M∏i=1

Ψi(yi)

=∑

y1∈Y

∑y2∈Y· · ·

∑yM∈Y

Ψ1(y1) · · ·ΨM (yM )

=∑

y1∈YΨ1(y1)

∑y2∈Y

Ψ2(y2) · · ·∑

yM∈YΨM (yM )

=[ ∑

y1∈YΨ1(y1)

]·[ ∑

y2∈YΨ2(y2)

]· · ·

[ ∑yM∈Y

ΨM (yM )]

Case I: O(K M nd) → O(MKnd)

Page 68: Chrisoph Lampert, Talk part 2

Case I: only unary factorsF =

{{x1, y1}, . . . , {xM , yM}

}.

Q U E S Tφ1 φ2 φ4 φ5φ3

Z =∑y∈Y

∏F∈F

ΨF(yF)

=∑y∈Y

M∏i=1

Ψi(yi)

=∑

y1∈Y

∑y2∈Y· · ·

∑yM∈Y

Ψ1(y1) · · ·ΨM (yM )

=∑

y1∈YΨ1(y1)

∑y2∈Y

Ψ2(y2) · · ·∑

yM∈YΨM (yM )

=[ ∑

y1∈YΨ1(y1)

]·[ ∑

y2∈YΨ2(y2)

]· · ·

[ ∑yM∈Y

ΨM (yM )]

Case I: O(K M nd) → O(MKnd)

Page 69: Chrisoph Lampert, Talk part 2

Case I: only unary factorsF =

{{x1, y1}, . . . , {xM , yM}

}.

Q U E S Tφ1 φ2 φ4 φ5φ3

Z =∑y∈Y

∏F∈F

ΨF(yF)

=∑y∈Y

M∏i=1

Ψi(yi)

=∑

y1∈Y

∑y2∈Y· · ·

∑yM∈Y

Ψ1(y1) · · ·ΨM (yM )

=∑

y1∈YΨ1(y1)

∑y2∈Y

Ψ2(y2) · · ·∑

yM∈YΨM (yM )

=[ ∑

y1∈YΨ1(y1)

]·[ ∑

y2∈YΨ2(y2)

]· · ·

[ ∑yM∈Y

ΨM (yM )]

Case I: O(K M nd) → O(MKnd)

Page 70: Chrisoph Lampert, Talk part 2

Case I: only unary factorsF =

{{x1, y1}, . . . , {xM , yM}

}.

Q U E S Tφ1 φ2 φ4 φ5φ3

Z =∑y∈Y

∏F∈F

ΨF(yF)

=∑y∈Y

M∏i=1

Ψi(yi)

=∑

y1∈Y

∑y2∈Y· · ·

∑yM∈Y

Ψ1(y1) · · ·ΨM (yM )

=∑

y1∈YΨ1(y1)

∑y2∈Y

Ψ2(y2) · · ·∑

yM∈YΨM (yM )

=[ ∑

y1∈YΨ1(y1)

]·[ ∑

y2∈YΨ2(y2)

]· · ·

[ ∑yM∈Y

ΨM (yM )]

Case I: O(K M nd) → O(MKnd)

Page 71: Chrisoph Lampert, Talk part 2

Example: Pose Estimation

original unary only

Example: Image Segmentation

original unary only

Page 72: Chrisoph Lampert, Talk part 2

Case II: chain/tree withunary and pairwise factors

Q U E S Tφ1 φ2 φ4 φ5φ3

φ1;2 φ2;3 φ3;4 φ4;5

F ={{x1, y1}, {y1, y2}, . . . , {yM−1, yM}, {xM , yM}

}.

Z =∑y∈Y

∏F∈F

ΨF(yF)

=∑y∈Y

M∏i=1

Ψi(yi)M∏

i=2Ψi−1,i(yi−1,i)

=∑

y1∈YΨ1

∑y2∈Y

Ψ1,2Ψ2 · · ·∑

yM−1∈YΨM−2,M−1ΨM−1

∑yM∈Y

ΨM−1,M ΨM

=

2222

t

·

2 2 2 22 2 2 22 2 2 22 2 2 2

· · ·

2 2 2 22 2 2 22 2 2 22 2 2 2

·2 2 2 22 2 2 22 2 2 22 2 2 2

2222

Case II: O(MK 2nd) independent was O(MKnd), naive O(KM nd)

Page 73: Chrisoph Lampert, Talk part 2

Case II: chain/tree withunary and pairwise factors

Q U E S Tφ1 φ2 φ4 φ5φ3

φ1;2 φ2;3 φ3;4 φ4;5

F ={{x1, y1}, {y1, y2}, . . . , {yM−1, yM}, {xM , yM}

}.

Z =∑y∈Y

∏F∈F

ΨF(yF) =∑y∈Y

M∏i=1

Ψi(yi)M∏

i=2Ψi−1,i(yi−1,i)

=∑

y1∈YΨ1

∑y2∈Y

Ψ1,2Ψ2 · · ·∑

yM−1∈YΨM−2,M−1ΨM−1

∑yM∈Y

ΨM−1,M ΨM

=

2222

t

·

2 2 2 22 2 2 22 2 2 22 2 2 2

· · ·

2 2 2 22 2 2 22 2 2 22 2 2 2

·2 2 2 22 2 2 22 2 2 22 2 2 2

2222

Case II: O(MK 2nd) independent was O(MKnd), naive O(KM nd)

Page 74: Chrisoph Lampert, Talk part 2

Case II: chain/tree withunary and pairwise factors

Q U E S Tφ1 φ2 φ4 φ5φ3

φ1;2 φ2;3 φ3;4 φ4;5

F ={{x1, y1}, {y1, y2}, . . . , {yM−1, yM}, {xM , yM}

}.

Z =∑y∈Y

∏F∈F

ΨF(yF) =∑y∈Y

M∏i=1

Ψi(yi)M∏

i=2Ψi−1,i(yi−1,i)

=∑

y1∈YΨ1

∑y2∈Y

Ψ1,2Ψ2 · · ·∑

yM−1∈YΨM−2,M−1ΨM−1

∑yM∈Y

ΨM−1,M ΨM

=

2222

t

·

2 2 2 22 2 2 22 2 2 22 2 2 2

· · ·

2 2 2 22 2 2 22 2 2 22 2 2 2

·2 2 2 22 2 2 22 2 2 22 2 2 2

2222

Case II: O(MK 2nd) independent was O(MKnd), naive O(KM nd)

Page 75: Chrisoph Lampert, Talk part 2

Case II: chain/tree withunary and pairwise factors

Q U E S Tφ1 φ2 φ4 φ5φ3

φ1;2 φ2;3 φ3;4 φ4;5

F ={{x1, y1}, {y1, y2}, . . . , {yM−1, yM}, {xM , yM}

}.

Z =∑y∈Y

∏F∈F

ΨF(yF) =∑y∈Y

M∏i=1

Ψi(yi)M∏

i=2Ψi−1,i(yi−1,i)

=∑

y1∈YΨ1

∑y2∈Y

Ψ1,2Ψ2 · · ·∑

yM−1∈YΨM−2,M−1ΨM−1

∑yM∈Y

ΨM−1,M ΨM

=

2222

t

·

2 2 2 22 2 2 22 2 2 22 2 2 2

· · ·

2 2 2 22 2 2 22 2 2 22 2 2 2

·2 2 2 22 2 2 22 2 2 22 2 2 2

2222

Case II: O(MK 2nd) independent was O(MKnd), naive O(KM nd)

Page 76: Chrisoph Lampert, Talk part 2

Case II: chain/tree withunary and pairwise factors

Q U E S Tφ1 φ2 φ4 φ5φ3

φ1;2 φ2;3 φ3;4 φ4;5

F ={{x1, y1}, {y1, y2}, . . . , {yM−1, yM}, {xM , yM}

}.

Z =∑y∈Y

∏F∈F

ΨF(yF) =∑y∈Y

M∏i=1

Ψi(yi)M∏

i=2Ψi−1,i(yi−1,i)

=∑

y1∈YΨ1

∑y2∈Y

Ψ1,2Ψ2 · · ·∑

yM−1∈YΨM−2,M−1ΨM−1

∑yM∈Y

ΨM−1,M ΨM

=

2222

t

·

2 2 2 22 2 2 22 2 2 22 2 2 2

· · ·

2 2 2 22 2 2 22 2 2 22 2 2 2

·2 2 2 22 2 2 22 2 2 22 2 2 2

2222

Case II: O(MK 2nd) independent was O(MKnd), naive O(KM nd)

Page 77: Chrisoph Lampert, Talk part 2

Example: Pose Estimation

original unary only unary + pairwise

Example: Image Segmentation

original unary only unary + pairwise

Page 78: Chrisoph Lampert, Talk part 2

Message Passing in Trees

Rephrase matrix multiplication as sendingmessages from node to node:Belief Propagation (BP)

Can also computate expectations: Ey∼p(y|xi ,w)ϕ(xi , y).

Message Passing in Loopy Graphs

For loopy graph, send the same messages anditerate: Loopy Belief Propagation (LBP)

Results only in approximation to Z (x ,w), Ey∼p(y|xi ,w)ϕ(xi , y).No converge guarantee, but variants are often used in practice.

More:our bookletPawan Kumar’s thesis "Combinatorial and Convex Optimization for ProbabilisticModels in Computer Vision" http://ai.stanford.edu/~pawan/kumar-thesis.pdf

Page 79: Chrisoph Lampert, Talk part 2

∇w L(w) = 1σ2 w −

n∑i=1

[ϕ(x i , yi)− Ey∼p(y|xi ,w)ϕ(x i , y)

]

Computing the Gradient: ������XXXXXXO(K M nd), O(MK 2nd) (if BP is possible):

L(w) = 12σ2‖w‖

2 −n∑

i=1

[〈w, ϕ(x i , yi)〉+ log

∑y∈Y

e〈w,ϕ(xi ,y)〉]

Line Search: ������XXXXXXO(K M nd), O(MK 2nd) per evaluation of L

n: number of samples

≈ 1,000s to 1,000,000s

d: dimension of feature spaceM : number of output nodesK : number of possible labels of each output nodes

Page 80: Chrisoph Lampert, Talk part 2

∇w L(w) = 1σ2 w −

n∑i=1

[ϕ(x i , yi)− Ey∼p(y|xi ,w)ϕ(x i , y)

]

Computing the Gradient: ������XXXXXXO(K M nd), O(MK 2nd) (if BP is possible):

L(w) = 12σ2‖w‖

2 −n∑

i=1

[〈w, ϕ(x i , yi)〉+ log

∑y∈Y

e〈w,ϕ(xi ,y)〉]

Line Search: ������XXXXXXO(K M nd), O(MK 2nd) per evaluation of L

n: number of samples ≈ 1,000s to 1,000,000sd: dimension of feature spaceM : number of output nodesK : number of possible labels of each output nodes

Page 81: Chrisoph Lampert, Talk part 2

What, if our training set D is too large (e.g. millions of examples)?

Switch to Simpler ClassifierTrain independent per-node classifiers

We lose all label dependences /. Still slow /.

SubsamplingCreate random subset D′ ⊂ DPerform gradient descent to maximize p(w|D′)

Ignores all information in D \ D′. /Parallelize

Train several smaller models on different computers.Merge the models.

Follows "multi-core" trend ,. Unclear how to merge models / ,?Doesn’t reduce computation, just distributes it /.

Page 82: Chrisoph Lampert, Talk part 2

What, if our training set D is too large (e.g. millions of examples)?

Switch to Simpler ClassifierTrain independent per-node classifiers

We lose all label dependences /. Still slow /.

SubsamplingCreate random subset D′ ⊂ DPerform gradient descent to maximize p(w|D′)

Ignores all information in D \ D′. /Parallelize

Train several smaller models on different computers.Merge the models.

Follows "multi-core" trend ,. Unclear how to merge models / ,?Doesn’t reduce computation, just distributes it /.

Page 83: Chrisoph Lampert, Talk part 2

What, if our training set D is too large (e.g. millions of examples)?

Switch to Simpler ClassifierTrain independent per-node classifiers

We lose all label dependences /. Still slow /.

SubsamplingCreate random subset D′ ⊂ DPerform gradient descent to maximize p(w|D′)

Ignores all information in D \ D′. /

ParallelizeTrain several smaller models on different computers.Merge the models.

Follows "multi-core" trend ,. Unclear how to merge models / ,?Doesn’t reduce computation, just distributes it /.

Page 84: Chrisoph Lampert, Talk part 2

What, if our training set D is too large (e.g. millions of examples)?

Switch to Simpler ClassifierTrain independent per-node classifiers

We lose all label dependences /. Still slow /.

SubsamplingCreate random subset D′ ⊂ DPerform gradient descent to maximize p(w|D′)

Ignores all information in D \ D′. /Parallelize

Train several smaller models on different computers.Merge the models.

Follows "multi-core" trend ,.

Unclear how to merge models / ,?Doesn’t reduce computation, just distributes it /.

Page 85: Chrisoph Lampert, Talk part 2

What, if our training set D is too large (e.g. millions of examples)?

Switch to Simpler ClassifierTrain independent per-node classifiers

We lose all label dependences /. Still slow /.

SubsamplingCreate random subset D′ ⊂ DPerform gradient descent to maximize p(w|D′)

Ignores all information in D \ D′. /Parallelize

Train several smaller models on different computers.Merge the models.

Follows "multi-core" trend ,. Unclear how to merge models /

,?Doesn’t reduce computation, just distributes it /.

Page 86: Chrisoph Lampert, Talk part 2

What, if our training set D is too large (e.g. millions of examples)?

Switch to Simpler ClassifierTrain independent per-node classifiers

We lose all label dependences /. Still slow /.

SubsamplingCreate random subset D′ ⊂ DPerform gradient descent to maximize p(w|D′)

Ignores all information in D \ D′. /Parallelize

Train several smaller models on different computers.Merge the models.

Follows "multi-core" trend ,. Unclear how to merge models��ZZ/ ,?

Doesn’t reduce computation, just distributes it /.

Page 87: Chrisoph Lampert, Talk part 2

What, if our training set D is too large (e.g. millions of examples)?

Switch to Simpler ClassifierTrain independent per-node classifiers

We lose all label dependences /. Still slow /.

SubsamplingCreate random subset D′ ⊂ DPerform gradient descent to maximize p(w|D′)

Ignores all information in D \ D′. /Parallelize

Train several smaller models on different computers.Merge the models.

Follows "multi-core" trend ,. Unclear how to merge models��ZZ/ ,?Doesn’t reduce computation, just distributes it /.

Page 88: Chrisoph Lampert, Talk part 2

What, if our training set D is too large (e.g. millions of examples)?

Stochastic Gradient Descent (SGD)

Keep maximizing p(w|D). ,In each gradient descent step:

I Create random subset D′ ⊂ D,

← often just 1–3 elements!

I Follow approximate gradient

∇w L(w) = 1σ2 w −

∑(xi ,yi)∈D′

[ϕ(x i , yi)− Ey∼p(y|xi ,w)ϕ(x i , y)

]

Line search no longer possible. Extra parameter: stepsize ηt

SGD converges to argminw p(w|D)! (with ηt chosen right)SGD needs more iterations, but each one is much faster

more: see L. Bottou, O. Bousquet: "The Tradeoffs of Large Scale Learning", NIPS 2008.also: http://leon.bottou.org/research/largescale

Page 89: Chrisoph Lampert, Talk part 2

What, if our training set D is too large (e.g. millions of examples)?

Stochastic Gradient Descent (SGD)

Keep maximizing p(w|D). ,In each gradient descent step:

I Create random subset D′ ⊂ D, ← often just 1–3 elements!I Follow approximate gradient

∇w L(w) = 1σ2 w −

∑(xi ,yi)∈D′

[ϕ(x i , yi)− Ey∼p(y|xi ,w)ϕ(x i , y)

]

Line search no longer possible. Extra parameter: stepsize ηt

SGD converges to argminw p(w|D)! (with ηt chosen right)SGD needs more iterations, but each one is much faster

more: see L. Bottou, O. Bousquet: "The Tradeoffs of Large Scale Learning", NIPS 2008.also: http://leon.bottou.org/research/largescale

Page 90: Chrisoph Lampert, Talk part 2

What, if our training set D is too large (e.g. millions of examples)?

Stochastic Gradient Descent (SGD)

Keep maximizing p(w|D). ,In each gradient descent step:

I Create random subset D′ ⊂ D, ← often just 1–3 elements!I Follow approximate gradient

∇w L(w) = 1σ2 w −

∑(xi ,yi)∈D′

[ϕ(x i , yi)− Ey∼p(y|xi ,w)ϕ(x i , y)

]

Line search no longer possible. Extra parameter: stepsize ηt

SGD converges to argminw p(w|D)! (with ηt chosen right)SGD needs more iterations, but each one is much faster

more: see L. Bottou, O. Bousquet: "The Tradeoffs of Large Scale Learning", NIPS 2008.also: http://leon.bottou.org/research/largescale

Page 91: Chrisoph Lampert, Talk part 2

What, if our training set D is too large (e.g. millions of examples)?

Stochastic Gradient Descent (SGD)

Keep maximizing p(w|D). ,In each gradient descent step:

I Create random subset D′ ⊂ D, ← often just 1–3 elements!I Follow approximate gradient

∇w L(w) = 1σ2 w −

∑(xi ,yi)∈D′

[ϕ(x i , yi)− Ey∼p(y|xi ,w)ϕ(x i , y)

]

Line search no longer possible. Extra parameter: stepsize ηt

SGD converges to argminw p(w|D)! (with ηt chosen right)

SGD needs more iterations, but each one is much faster

more: see L. Bottou, O. Bousquet: "The Tradeoffs of Large Scale Learning", NIPS 2008.also: http://leon.bottou.org/research/largescale

Page 92: Chrisoph Lampert, Talk part 2

What, if our training set D is too large (e.g. millions of examples)?

Stochastic Gradient Descent (SGD)

Keep maximizing p(w|D). ,In each gradient descent step:

I Create random subset D′ ⊂ D, ← often just 1–3 elements!I Follow approximate gradient

∇w L(w) = 1σ2 w −

∑(xi ,yi)∈D′

[ϕ(x i , yi)− Ey∼p(y|xi ,w)ϕ(x i , y)

]

Line search no longer possible. Extra parameter: stepsize ηt

SGD converges to argminw p(w|D)! (with ηt chosen right)SGD needs more iterations, but each one is much faster

more: see L. Bottou, O. Bousquet: "The Tradeoffs of Large Scale Learning", NIPS 2008.also: http://leon.bottou.org/research/largescale

Page 93: Chrisoph Lampert, Talk part 2

∇w L(w) = 1σ2 w −

n∑i=1

[ϕ(x i , yi)− Ey∼p(y|xi ,w)ϕ(x i , y)

]

Computing the Gradient: ������XXXXXXO(K M nd), O(MK 2nd) (if BP is possible):

L(w) = 12σ2‖w‖

2 −n∑

i=1

[〈w, ϕ(x i , yi)〉+ log

∑y∈Y

e〈w,ϕ(xi ,y)〉]

Line Search: ������XXXXXXO(K M nd), O(MK 2nd) per evaluation of L

n: number of samplesd: dimension of feature space: ≈ ϕi,j 1–10s, ϕi : 100s to 10000sM : number of output nodesK : number of possible labels of each output nodes

Page 94: Chrisoph Lampert, Talk part 2

Typical feature functions in image segmentation:

ϕi(yi , x) ∈ R≈1000: local image features, e.g. bag-of-words→ 〈wi , ϕi(yi , x)〉: local classifier (like logistic-regression)

ϕi,j(yi , yj) = Jyi = yjK ∈ R1: test for same label→ 〈wij , ϕij(yi , yj)〉: penalizer for label changes (if wij > 0)

combined: argmaxy p(y|x) is smoothed version of local cues

original local classification local + smoothness

Page 95: Chrisoph Lampert, Talk part 2

Typical feature functions in handwriting recognition:

ϕi(yi , x) ∈ R≈1000: image representation (pixels, gradients)→ 〈wi , ϕi(yi , x)〉: local classifier if xi is letter yi

ϕi,j(yi , yj) = eyi ⊗ eyj ∈ R26·26: letter/letter indicator→ 〈wij , ϕij(yi , yj)〉: encourage/suppress letter combinations

combined: argmaxy p(y|x) is "corrected" version of local cues

Q V E S T

Input

Output Q U E S T

Input

Output

local classification local + "correction"

Page 96: Chrisoph Lampert, Talk part 2

Typical feature functions in pose estimation:

ϕi(yi , x) ∈ R≈1000: local image representation, e.g. HoG→ 〈wi , ϕi(yi , x)〉: local confidence map

ϕi,j(yi , yj) = good_fit(yi , yj) ∈ R1: test for geometric fit→ 〈wij , ϕij(yi , yj)〉: penalizer for unrealistic poses

together: argmaxy p(y|x) is sanitized version of local cues

original local classification local + geometry

Images: [Ferrari, Marin-Jimenez, Zisserman: "Progressive Search Space Reduction for Human Pose Estimation", CVPR 2008.]

Page 97: Chrisoph Lampert, Talk part 2

Typical feature functions for CRFs in Computer Vision:

ϕi(yi , x): local representation, high-dimensional→ 〈wi , ϕi(yi , x)〉: local classifier

ϕi,j(yi , yj): prior knowledge, low-dimensional→ 〈wij , ϕij(yi , yj)〉: penalize outliers

learning adjusts parameters:I unary wi : learn local classifiers and their importanceI pairwise wij : learn importance of smoothing/penalization

argmaxy p(y|x) is cleaned up version of local prediction

Page 98: Chrisoph Lampert, Talk part 2

Idea: split learning of unary potentials into two parts:local classifiers,their importance.

Two-Stage Trainingpre-train f y

i (x) = log p(yi |x)use ϕi(yi , x) := f y

i (x) ∈ RK (low-dimensional)keep ϕij(yi , yj) are beforeperform CRF learning with ϕi and ϕij

Advantage:CRF feature space much lower dimensional → faster trainingf yi (x) can be stronger classifiers, e.g. non-linear SVMs

Disadvantage:if local classifiers are really bad, CRF training cannot fix that.

Page 99: Chrisoph Lampert, Talk part 2

Idea: split learning of unary potentials into two parts:local classifiers,their importance.

Two-Stage Trainingpre-train f y

i (x) = log p(yi |x)use ϕi(yi , x) := f y

i (x) ∈ RK (low-dimensional)keep ϕij(yi , yj) are beforeperform CRF learning with ϕi and ϕij

Advantage:CRF feature space much lower dimensional → faster trainingf yi (x) can be stronger classifiers, e.g. non-linear SVMs

Disadvantage:if local classifiers are really bad, CRF training cannot fix that.

Page 100: Chrisoph Lampert, Talk part 2

Summary II — Numeric Training

(All?) Current CRF training methods rely on gradient descent.

∇w L(w) = wσ2 −

n∑i=1

[ϕ(x i , yi)− ∑

y∈Yp(y|x i, w) ϕ(x i , y)

]∈ Rd

The faster we can do it, the better (more realistic) models we can use.

A lot of research on speeding up CRF training:

problem "solution" method(s)|Y| too large exploit structure efficient inference (e.g. LBP)

smart sampling contrastive divergencen too large mini-batches stochastic gradient descentd too large discriminative ϕunary two-stage training

Page 101: Chrisoph Lampert, Talk part 2

Conditional Random Fields Summary

CRFs model conditional probability distribution p(y|x) = 1Z e〈w,ϕ(x,y)〉.

x ∈ X is (almost) arbitrary inputI we need a feature function ϕ : X × Y → Rd ,I we don’t need to know/estimate p(x) or p(ϕ(x))!

y ∈ Y is output with a structureI we need structure in Y that makes training/inference feasibleI e.g. graph labeling: local properties plus neighborhood relations

w ∈ Rd is parameter vectorI CRF training is solving w = argmaxw p(w|D) for training set D.

after learning, we can predict: f (x) := argmaxy p(y|x).

Many many many more details in Machine Learning literature, e.g.:• [M. Wainwright, M. Jordan: "Graphical Models, Exponential Families, and Variational

Inference", now publishers, 2008]. Free PDF at http://www.nowpublishers.com

• [D. Koller, N. Friedman: "Probabilistic Graphical Models", MIT Press, 2010]

Page 102: Chrisoph Lampert, Talk part 2

Extra I: Beyond Fully Supervised Learning

So far, training was fully supervised, all variables were observed.In real life, some variables can be unobserved even during training.

missing labels in training data latent variables, e.g. part location

latent variables, e.g. part occlusion latent variables, e.g. viewpoint

Page 103: Chrisoph Lampert, Talk part 2

Graphical model: three types of variablesx ∈ X always observed,y ∈ Y observed only in training,z ∈ Z never observed (latent).

Marginalization over Latent VariablesConstruct conditional likelihood as usual:

p(y, z |x ,w) = 1Z (x ,w) exp(〈w, ϕ(x , y, z)〉)

Derive p(y|x ,w) by marginalizing over z :

p(y|x ,w) = 1Z (x ,w)

∑z∈Z

p(y, z |x ,w)

Page 104: Chrisoph Lampert, Talk part 2

Negative regularized conditional log-likelihood:

L(w) = λ‖w‖2 −N∑

n=1log p(yn|xn,w)

= λ‖w‖2 −N∑

n=1log

∑z∈Z

p(yn, z |xn,w)

= λ‖w‖2 −N∑

n=1log

∑z∈Z

exp(〈w, ϕ(xn, yn, z)〉)

+N∑

n=1log

∑z∈Zy∈Y

exp(〈w, ϕ(xn, y, z)〉)

L is not convex in w → can have local minimano agreed on best way for treating latent variables

Page 105: Chrisoph Lampert, Talk part 2

Maximum Margin Trainingof Structured Models(Structure Output SVMs)

Page 106: Chrisoph Lampert, Talk part 2

Example: Object Localization

Learn a localization function:

f : X → Y with X ={images}, Y ={bounding boxes}.

Compatibility function:

F : X × Y → R with F(x , y) = 〈w, ϕ(x , y)〉

with ϕ(x , y), e.g., the BoW histogram of the region y in x .

f (x) = argmaxy∈Y

F(x , y)

Page 107: Chrisoph Lampert, Talk part 2

Example: Object Localization

Learn a localization function:

f : X → Y with X ={images}, Y ={bounding boxes}.

Compatibility function:

F : X × Y → R with F(x , y) = 〈w, ϕ(x , y)〉

with ϕ(x , y), e.g., the BoW histogram of the region y in x .

f (x) = argmaxy∈Y

F(x , y)

Page 108: Chrisoph Lampert, Talk part 2

Learn a localization function:

f (x) = argmaxy∈Y

F(x , y) with F(x , y) = 〈w, ϕ(x , y)〉.

ϕ(x , y) = BoW histogram of region y = [left,top,right,bottom] in x .

top

bottom

left right

x left top

right bottom

image

p(y|x ,w) = 1Z exp

(〈w, ϕ(x , y)〉

)with Z =

∑y′∈Y

exp(〈w, ϕ(x , y ′)〉

)

We can’t train probabilistically: ϕ doesn’t decompose, Z too hard.We can do MAP prediction: sliding window or branch-and-bound.

Can we learn structured prediction functions without "Z"?

Page 109: Chrisoph Lampert, Talk part 2

Learn a localization function:

f (x) = argmaxy∈Y

F(x , y) with F(x , y) = 〈w, ϕ(x , y)〉.

ϕ(x , y) = BoW histogram of region y = [left,top,right,bottom] in x .

top

bottom

left right

x left top

right bottom

image

p(y|x ,w) = 1Z exp

(〈w, ϕ(x , y)〉

)with Z =

∑y′∈Y

exp(〈w, ϕ(x , y ′)〉

)We can’t train probabilistically: ϕ doesn’t decompose, Z too hard.We can do MAP prediction: sliding window or branch-and-bound.

Can we learn structured prediction functions without "Z"?

Page 110: Chrisoph Lampert, Talk part 2

Loss-Minimizing Parameter LearningLet d(x , y) be the (unknown) true data distribution.Let D = {(x1, y1), . . . , (xn, yn)} be i.i.d. samples from d(x , y).Let ϕ : X × Y → RD be a feature function.Let ∆ : Y × Y → R be a loss function.

Find a weight vector w∗ that leads to minimal expected loss

E(x,y)∼d(x,y){∆(y, f (x))} (∗)

for f (x) = argmaxy∈Y 〈w, ϕ(x , y)〉.

Pro:We directly optimize for the quantity of interest: expected loss.No expensive-to-compute partition function Z will show up.

Con:1) We need to know the loss function already at training time.2) We don’t know d(x , y), so we can’t compute (∗).3) Even if we did, minimizing (∗) would be NP-hard

Page 111: Chrisoph Lampert, Talk part 2

Loss-Minimizing Parameter LearningLet d(x , y) be the (unknown) true data distribution.Let D = {(x1, y1), . . . , (xn, yn)} be i.i.d. samples from d(x , y).Let ϕ : X × Y → RD be a feature function.Let ∆ : Y × Y → R be a loss function.

Find a weight vector w∗ that leads to minimal expected loss

E(x,y)∼d(x,y){∆(y, f (x))} (∗)

for f (x) = argmaxy∈Y 〈w, ϕ(x , y)〉.

Pro:We directly optimize for the quantity of interest: expected loss.No expensive-to-compute partition function Z will show up.

Con:1) We need to know the loss function already at training time.2) We don’t know d(x , y), so we can’t compute (∗).3) Even if we did, minimizing (∗) would be NP-hard

Page 112: Chrisoph Lampert, Talk part 2

1) Loss function

How to judge if a prediction is good?Define a loss function

∆ : Y × Y → R+,

∆(y ′, y) measures the loss incurred by predicting y when y ′ iscorrect.

The loss function is application dependent

Page 113: Chrisoph Lampert, Talk part 2

Example 1: 0/1 loss

Loss is 0 for perfect prediction, 1 otherwise:

∆0/1(y ′, y) = Jy ′ 6= yK ={

0 if y ′ = y1 otherwise

Every mistake is equally bad. Usually not very useful in structuredprediction.

Page 114: Chrisoph Lampert, Talk part 2

Example 2: Hamming loss

Count the number of mislabeled variables:

∆H (y ′, y) = 1|V |

∑i∈V

I (y ′i 6= yi)

Used, e.g., in image segmentation.

Page 115: Chrisoph Lampert, Talk part 2

Example 3: Squared error

If we can add elements in Yi(pixel intensities, optical flow vectors, etc.).

Sum of squared errors

∆Q(y ′, y) = 1|V |

∑i∈V‖y ′i − yi‖2.

Used, e.g., in stereo reconstruction, part-based object detection.

Page 116: Chrisoph Lampert, Talk part 2

Example 4: Task specific losses

Object detectionbounding boxes, orarbitrary regions ground truth

detection

image

Area overlap loss:

∆AO(y ′, y) = 1− area(y ′ ∩ y)area(y ′ ∪ y) = 1−

Used, e.g., in PASCAL VOC challenges for object detection, becauseit scale-invariants (no bias for or against big objects).

Page 117: Chrisoph Lampert, Talk part 2

2)+3) Learning by regularized risk minimization

Instead of minimzing E(x,y)∼d(x,y){∆(y, f (x))} directly, solve aregularized risk minimization task:

Find w for decision function f = 〈w, ϕ(x , y)〉 by

minw∈Rd

λ‖w‖2 +n∑

i=1`(yi , f (x i))

Regularization + Error/Loss on data

Maximum-Margin Training: Hinge-loss

`(yi , f (x i)) := max{0,max

y∈Y∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

}∆(y, y ′) ≥ 0 with ∆(y, y) = 0 loss function between outputs.

Page 118: Chrisoph Lampert, Talk part 2

2)+3) Learning by regularized risk minimization

Instead of minimzing E(x,y)∼d(x,y){∆(y, f (x))} directly, solve aregularized risk minimization task:

Find w for decision function f = 〈w, ϕ(x , y)〉 by

minw∈Rd

λ‖w‖2 +n∑

i=1`(yi , f (x i))

Regularization + Error/Loss on data

Maximum-Margin Training: Hinge-loss

`(yi , f (x i)) := max{0,max

y∈Y∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

}∆(y, y ′) ≥ 0 with ∆(y, y) = 0 loss function between outputs.

Page 119: Chrisoph Lampert, Talk part 2

Structured Output Support Vector Machine

Structured Output Support Vector Machine (SSVMs):

minw∈Rd

12‖w‖

2 + Cn

n∑i=1

maxy∈Y

[∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

]

Conditional Random Fields (CRFs):

minw∈Rd

‖w‖2

2σ2 +n∑

i=1log

∑y∈Y

exp(〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

)

Sidenote: CRFs and SSVMs are really not that different.both do regularized risk minimization, just with different loss terms

don’t believe people who say one is (in general) better than the other

Page 120: Chrisoph Lampert, Talk part 2

Structured Output Support Vector Machine

Structured Output Support Vector Machine (SSVMs):

minw∈Rd

12‖w‖

2 + Cn

n∑i=1

maxy∈Y

[∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

]

Conditional Random Fields (CRFs):

minw∈Rd

‖w‖2

2σ2 +n∑

i=1log

∑y∈Y

exp(〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

)

Sidenote: CRFs and SSVMs are really not that different.both do regularized risk minimization, just with different loss terms

don’t believe people who say one is (in general) better than the other

Page 121: Chrisoph Lampert, Talk part 2

Structured Output Support Vector Machine

Structured Output Support Vector Machine (SSVMs):

minw∈Rd

12‖w‖

2 + Cn

n∑i=1

maxy∈Y

[∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

]

Conditional Random Fields (CRFs):

minw∈Rd

‖w‖2

2σ2 +n∑

i=1log

∑y∈Y

exp(〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

)

Sidenote: CRFs and SSVMs are really not that different.both do regularized risk minimization, just with different loss terms

don’t believe people who say one is (in general) better than the other

Page 122: Chrisoph Lampert, Talk part 2

S-SVM Objective Function for w ∈ R2:

3 2 1 0 1 2 3 4 52

1

0

1

2

3S-SVM objective C=0.01

3 2 1 0 1 2 3 4 52

1

0

1

2

3S-SVM objective C=0.10

3 2 1 0 1 2 3 4 52

1

0

1

2

3S-SVM objective C=1.00

3 2 1 0 1 2 3 4 52

1

0

1

2

3S-SVM objective C→∞

Page 123: Chrisoph Lampert, Talk part 2

Structured Output SVM:

minw∈Rd

12‖w‖

2 + Cn

n∑i=1

maxy∈Y

[∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

]

uncontrained optimization, convex, non-differentiable objective.

Page 124: Chrisoph Lampert, Talk part 2

Structured Output SVM (equivalent formulation):

minw∈Rd ,ξ∈Rn

+

12‖w‖

2 + Cn

n∑i=1

ξi

subject to, for i = 1, . . . , n,

maxy∈Y

[∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

]≤ ξi

n non-linear contraints, convex, differentiable objective.

Page 125: Chrisoph Lampert, Talk part 2

Structured Output SVM (also equivalent formulation):

minw∈Rd ,ξ∈Rn

+

12‖w‖

2 + Cn

n∑i=1

ξi

subject to, for i = 1, . . . , n,

∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉 ≤ ξi , for all y ∈ Y

|Y|n linear constraints, convex, differentiable objective.

Page 126: Chrisoph Lampert, Talk part 2

Example: A "True" Multiclass SVM

Y = {1, 2, . . . ,K}, ∆(y, y ′) =

1 for y 6= y ′

0 otherwise..

ϕ(x , y) =(Jy = 1KΦ(x), Jy = 2KΦ(x), . . . , Jy = KKΦ(x)

)= Φ(x)e>y with ey=y-th unit vector

Solve:

minw,ξ

12‖w‖

2 + Cn

n∑i=1

ξi

subject to, for i = 1, . . . , n,

〈w, ϕ(x i , yi)〉 − 〈w, ϕ(x i , y)〉 ≥ 1− ξi for all y ∈ Y .

Classification: MAP f (x) = argmaxy∈Y

〈w, ϕ(x , y)〉

Crammer-Singer Multiclass SVM

Page 127: Chrisoph Lampert, Talk part 2

Example: A "True" Multiclass SVM

Y = {1, 2, . . . ,K}, ∆(y, y ′) =

1 for y 6= y ′

0 otherwise..

ϕ(x , y) =(Jy = 1KΦ(x), Jy = 2KΦ(x), . . . , Jy = KKΦ(x)

)= Φ(x)e>y with ey=y-th unit vector

Solve:

minw,ξ

12‖w‖

2 + Cn

n∑i=1

ξi

subject to, for i = 1, . . . , n,

〈w, ϕ(x i , yi)〉 − 〈w, ϕ(x i , y)〉 ≥ 1− ξi for all y ∈ Y .

Classification: MAP f (x) = argmaxy∈Y

〈w, ϕ(x , y)〉

Crammer-Singer Multiclass SVM

Page 128: Chrisoph Lampert, Talk part 2

Hierarchical Multiclass Classification

Loss function can reflect hierarchy:cat dog car bus

∆(y, y ′) := 12(distance in tree)

∆(cat, cat) = 0, ∆(cat, dog) = 1, ∆(cat, bus) = 2, etc.

Solve:

minw,ξ

12‖w‖

2 + Cn

n∑i=1

ξi

subject to, for i = 1, . . . , n,

〈w, ϕ(x i , yi)〉 − 〈w, ϕ(x i , y)〉 ≥ ∆(yi , y)− ξi for all y ∈ Y .

Page 129: Chrisoph Lampert, Talk part 2

Numeric Solution: Solving an S-SVM like a CRF.

How to solve?

minw∈Rd

12‖w‖

2 + Cn

n∑i=1

maxy∈Y

[∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

]

Unconstrained, convex, non-differentiable:unconstrained ,convex ,non-differentiable / → we can’t use gradient descent directly.

Idea: use sub-gradient descent.

Page 130: Chrisoph Lampert, Talk part 2

Sub-Gradients

Definition: Let f : Rd → R be a convex function. A vector v ∈ Rd

is called a sub-gradient of f at w0, if

f (w) ≥ f (w0) + 〈v,w − w0〉 for all w.

f(w)

ww0

f(w0)

f(w0)+⟨v,w-w0⟩

For differentiable f , the gradient v = ∇f (w0) is the only subgradient.f(w)

ww0

f(w0)

f(w0)+⟨v,w-w0⟩

Page 131: Chrisoph Lampert, Talk part 2

Sub-Gradients

Definition: Let f : Rd → R be a convex function. A vector v ∈ Rd

is called a sub-gradient of f at w0, if

f (w) ≥ f (w0) + 〈v,w − w0〉 for all w.

f(w)

ww0

f(w0)

f(w0)+⟨v,w-w0⟩

For differentiable f , the gradient v = ∇f (w0) is the only subgradient.f(w)

ww0

f(w0)

f(w0)+⟨v,w-w0⟩

Page 132: Chrisoph Lampert, Talk part 2

Sub-Gradients

Definition: Let f : Rd → R be a convex function. A vector v ∈ Rd

is called a sub-gradient of f at w0, if

f (w) ≥ f (w0) + 〈v,w − w0〉 for all w.

f(w)

ww0

f(w0)

f(w0)+⟨v,w-w0⟩

For differentiable f , the gradient v = ∇f (w0) is the only subgradient.f(w)

ww0

f(w0)

f(w0)+⟨v,w-w0⟩

Page 133: Chrisoph Lampert, Talk part 2

Sub-Gradients

Definition: Let f : Rd → R be a convex function. A vector v ∈ Rd

is called a sub-gradient of f at w0, if

f (w) ≥ f (w0) + 〈v,w − w0〉 for all w.

f(w)

ww0

f(w0)

f(w0)+⟨v,w-w0⟩

For differentiable f , the gradient v = ∇f (w0) is the only subgradient.f(w)

ww0

f(w0)

f(w0)+⟨v,w-w0⟩

Page 134: Chrisoph Lampert, Talk part 2

Computing a subgradient:

minw

12‖w‖

2 + Cn

n∑i=1

`i(w)

with `i(w) = max{0,maxy `iy(w)}, and

`iy(w) := ∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

ℓ(w)

w

y

For each y ∈ Y , `y(w) is a linear function.

Page 135: Chrisoph Lampert, Talk part 2

Computing a subgradient:

minw

12‖w‖

2 + Cn

n∑i=1

`i(w)

with `i(w) = max{0,maxy `iy(w)}, and

`iy(w) := ∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

ℓ(w)

w

y

For each y ∈ Y , `y(w) is a linear function.

Page 136: Chrisoph Lampert, Talk part 2

Computing a subgradient:

minw

12‖w‖

2 + Cn

n∑i=1

`i(w)

with `i(w) = max{0,maxy `iy(w)}, and

`iy(w) := ∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

ℓ(w)

w

y'

For each y ∈ Y , `y(w) is a linear function.

Page 137: Chrisoph Lampert, Talk part 2

Computing a subgradient:

minw

12‖w‖

2 + Cn

n∑i=1

`i(w)

with `i(w) = max{0,maxy `iy(w)}, and

`iy(w) := ∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

ℓ(w)

w

For each y ∈ Y , `y(w) is a linear function.

Page 138: Chrisoph Lampert, Talk part 2

Computing a subgradient:

minw

12‖w‖

2 + Cn

n∑i=1

`i(w)

with `i(w) = max{0,maxy `iy(w)}, and

`iy(w) := ∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

ℓ(w)

w

`(w) = max{0,maxy `y(w)}: maximum over all y ∈ Y and 0.

Page 139: Chrisoph Lampert, Talk part 2

Computing a subgradient:

minw

12‖w‖

2 + Cn

n∑i=1

`i(w)

with `i(w) = max{0,maxy `iy(w)}, and

`iy(w) := ∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

ℓ(w)

ww0

Subgradient at w0:

Page 140: Chrisoph Lampert, Talk part 2

Computing a subgradient:

minw

12‖w‖

2 + Cn

n∑i=1

`i(w)

with `i(w) = max{0,maxy `iy(w)}, and

`iy(w) := ∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

ℓ(w)

ww0

Subgradient at w0: find maximal (active) y.

Page 141: Chrisoph Lampert, Talk part 2

Computing a subgradient:

minw

12‖w‖

2 + Cn

n∑i=1

`i(w)

with `i(w) = max{0,maxy `iy(w)}, and

`iy(w) := ∆(yi , y) + 〈w, ϕ(x i , y)〉 − 〈w, ϕ(x i , yi)〉

ℓ(w)

ww0

Subgradient of `i at w0: find maximal (active) y, use v = ∇`iy(w0).

Page 142: Chrisoph Lampert, Talk part 2

Subgradient Descent S-SVM Traininginput training pairs {(x1, y1), . . . , (xn, yn)} ⊂ X × Y ,input feature map ϕ(x , y), loss function ∆(y, y ′), regularizer C ,input number of iterations T , stepsizes ηt for t = 1, . . . ,T

1: w ← ~02: for t=1,. . . ,T do3: for i=1,. . . ,n do4: y ← argmaxy∈Y ∆(yi , y) + 〈w, ϕ(x i , y)〉−〈w, ϕ(x i , yi)〉5: vi ← ϕ(x i , y)− ϕ(x i , yi)6: end for7: w ← w − ηt(w − C

n∑

i vi)8: end for

output prediction function f (x) = argmaxy∈Y〈w, ϕ(x , y)〉.

Obs: each update of w needs 1 argmax-prediction per example.

Page 143: Chrisoph Lampert, Talk part 2

Numeric Solution: Solving an S-SVM like a linear SVM.

Other, equivalent, S-SVM formulation was

minw∈Rd ,ξ∈Rn

+

‖w‖2 + Cn

n∑i=1

ξi

subject to, for i = 1, . . . n,

〈w, ϕ(x i , yi)〉−〈w, ϕ(x i , y)〉 ≥ ∆(yi , y) − ξi , for all y ∈ Y .

With feature vectors δϕ(x i , yi , y) := ϕ(x i , yi)− ϕ(x i , y), this is

minw∈Rd ,ξ∈Rn

+

‖w‖2 + Cn

n∑i=1

ξi

subject to, for i = 1, . . . n, for all y ∈ Y ,

〈w, δϕ(x i , yi , y)〉 ≥ ∆(yi , y) − ξi .

Page 144: Chrisoph Lampert, Talk part 2

Numeric Solution: Solving an S-SVM like a linear SVM.

Other, equivalent, S-SVM formulation was

minw∈Rd ,ξ∈Rn

+

‖w‖2 + Cn

n∑i=1

ξi

subject to, for i = 1, . . . n,

〈w, ϕ(x i , yi)〉−〈w, ϕ(x i , y)〉 ≥ ∆(yi , y) − ξi , for all y ∈ Y .

With feature vectors δϕ(x i , yi , y) := ϕ(x i , yi)− ϕ(x i , y), this is

minw∈Rd ,ξ∈Rn

+

‖w‖2 + Cn

n∑i=1

ξi

subject to, for i = 1, . . . n, for all y ∈ Y ,

〈w, δϕ(x i , yi , y)〉 ≥ ∆(yi , y) − ξi .

Page 145: Chrisoph Lampert, Talk part 2

Solve

minw∈Rd ,ξ∈Rn

+

‖w‖2 + Cn

n∑i=1

ξi (QP)

subject to, for i = 1, . . . n, for all y ∈ Y ,

〈w, δϕ(x i , yi , y)〉 ≥ ∆(yi , y) − ξi .

This has the same structure as an ordinary SVM!quadratic objective ,linear constraints ,

Question: Can’t we use a ordinary QP solver to solve this?

Answer: Almost! We could, if there weren’t n(|Y| − 1) constraints.E.g. 100 binary 16× 16 images: 1079 constraints /

Page 146: Chrisoph Lampert, Talk part 2

Solve

minw∈Rd ,ξ∈Rn

+

‖w‖2 + Cn

n∑i=1

ξi (QP)

subject to, for i = 1, . . . n, for all y ∈ Y ,

〈w, δϕ(x i , yi , y)〉 ≥ ∆(yi , y) − ξi .

This has the same structure as an ordinary SVM!quadratic objective ,linear constraints ,

Question: Can’t we use a ordinary QP solver to solve this?

Answer: Almost! We could, if there weren’t n(|Y| − 1) constraints.E.g. 100 binary 16× 16 images: 1079 constraints /

Page 147: Chrisoph Lampert, Talk part 2

Solve

minw∈Rd ,ξ∈Rn

+

‖w‖2 + Cn

n∑i=1

ξi (QP)

subject to, for i = 1, . . . n, for all y ∈ Y ,

〈w, δϕ(x i , yi , y)〉 ≥ ∆(yi , y) − ξi .

This has the same structure as an ordinary SVM!quadratic objective ,linear constraints ,

Question: Can’t we use a ordinary QP solver to solve this?

Answer: Almost! We could, if there weren’t n(|Y| − 1) constraints.E.g. 100 binary 16× 16 images: 1079 constraints /

Page 148: Chrisoph Lampert, Talk part 2

Solution: working set trainingIt’s enough if we enforce the active constraints.The others will be fulfilled automatically.We don’t know which ones are active for the optimal solution.But it’s likely to be only a small number ← can of course be formalized.

Keep a set of potentially active constraints and update it iteratively:Start with working set S = ∅ (no contraints)Repeat until convergence:

I Solve S-SVM training problem with constraints from SI Check, if solution violates any of the full constraint set

F if no: we found the optimal solution, terminate.F if yes: add most violated constraints to S , iterate.

Good practical performance and theoretic guarantees:polynomial time convergence ε-close to the global optimum

[Tsochantaridis et al. "Large Margin Methods for Structured and Interdependent Output Variables", JMLR, 2005.]

Page 149: Chrisoph Lampert, Talk part 2

Solution: working set trainingIt’s enough if we enforce the active constraints.The others will be fulfilled automatically.We don’t know which ones are active for the optimal solution.But it’s likely to be only a small number ← can of course be formalized.

Keep a set of potentially active constraints and update it iteratively:Start with working set S = ∅ (no contraints)Repeat until convergence:

I Solve S-SVM training problem with constraints from SI Check, if solution violates any of the full constraint set

F if no: we found the optimal solution, terminate.F if yes: add most violated constraints to S , iterate.

Good practical performance and theoretic guarantees:polynomial time convergence ε-close to the global optimum

[Tsochantaridis et al. "Large Margin Methods for Structured and Interdependent Output Variables", JMLR, 2005.]

Page 150: Chrisoph Lampert, Talk part 2

Solution: working set trainingIt’s enough if we enforce the active constraints.The others will be fulfilled automatically.We don’t know which ones are active for the optimal solution.But it’s likely to be only a small number ← can of course be formalized.

Keep a set of potentially active constraints and update it iteratively:Start with working set S = ∅ (no contraints)Repeat until convergence:

I Solve S-SVM training problem with constraints from SI Check, if solution violates any of the full constraint set

F if no: we found the optimal solution, terminate.F if yes: add most violated constraints to S , iterate.

Good practical performance and theoretic guarantees:polynomial time convergence ε-close to the global optimum

[Tsochantaridis et al. "Large Margin Methods for Structured and Interdependent Output Variables", JMLR, 2005.]

Page 151: Chrisoph Lampert, Talk part 2

Working Set S-SVM Traininginput training pairs {(x1, y1), . . . , (xn, yn)} ⊂ X × Y ,input feature map ϕ(x , y), loss function ∆(y, y ′), regularizer C

1: S ← ∅2: repeat3: (w, ξ)← solution to QP only with constraints from S4: for i=1,. . . ,n do5: y ← argmaxy∈Y ∆(yi , y) + 〈w, ϕ(x i , y)〉6: if y 6= yi then7: S ← S ∪ {(x i , y)}8: end if9: end for10: until S doesn’t change anymore.

output prediction function f (x) = argmaxy∈Y〈w, ϕ(x , y)〉.

Obs: each update of w needs 1 argmax-prediction per example.(but we solve globally for w, not locally)

Page 152: Chrisoph Lampert, Talk part 2

"One-Slack" Structured Output SVM:(equivalent to ordinary S-SVM by ξ = 1

n∑

i ξi)

minw∈Rd ,ξ∈R+

12‖w‖

2 + Cξ

subject ton∑

i=1

[∆(yi , yi) + 〈w, ϕ(x i , yi)〉 − 〈w, ϕ(x i , yi)〉

]≤ nξ,

subject to, for all (y1, . . . , yn) ∈ Y × · · · × Y ,n∑

i=1∆(yi , yi) + 〈w,

n∑i=1

[ϕ(x i , yi)− ϕ(x i , yi)]〉 ≤ nξ,

|Y|n linear constraints, convex, differentiable objective.

We have blown up the constraint set even further:100 binary 16× 16 images: 10177 constraints (instead of 1079).

Page 153: Chrisoph Lampert, Talk part 2

Working Set One-Slack S-SVM Traininginput training pairs {(x1, y1), . . . , (xn, yn)} ⊂ X × Y ,input feature map ϕ(x , y), loss function ∆(y, y ′), regularizer C

1: S ← ∅2: repeat3: (w, ξ)← solution to QP only with constraints from S4: for i=1,. . . ,n do5: yi ← argmaxy∈Y ∆(yi , y) + 〈w, ϕ(x i , y)〉6: end for7: S ← S ∪ {

((x1, . . . , xn), (y1, . . . , yn)

)}

8: until S doesn’t change anymore.

output prediction function f (x) = argmaxy∈Y〈w, ϕ(x , y)〉.

Often faster convergence:We add one strong constraint per iteration instead of n weak ones.

[Joachims, Finley, Yu: "Cutting-Plane Training of Structural SVMs", Machine Learning, 2009]

Page 154: Chrisoph Lampert, Talk part 2

Numeric Solution: Solving an S-SVM like a non-linear SVM.

Dual S-SVM problem

maxα∈Rn|Y|

+

∑i=1,...,n

y∈Y

αiy∆(yi , y)− 12∑

y,y′∈Yi,i′=1,...,n

αiyαi′y′〈δϕ(x i , yi , y), δϕ(x i′ , yi′ , y ′)〉

subject to, for i = 1, . . . , n,

∑y∈Y

αiy ≤Cn .

Like ordinary SVMs, we can dualize the S-SVMs optimization:min becomes max,original (primal) variables w, ξ disappear,we get one dual variable αiy for each primal constraint.

n linear contraints, convex, differentiable objective, n|Y| variables.

Page 155: Chrisoph Lampert, Talk part 2

Data appears only inside inner products: kernelizeDefine joint kernel function k : (X × Y)× (X × Y)→ R

k( (x , y) , (x ′, y ′) ) = 〈ϕ(x , y), ϕ(x ′, y ′)〉.

k measure similarity between two (input,output)-pairs.

We can express the optimization in terms of k:

〈δϕ(x i , yi , y) , δϕ(x i′ , yi′ , y ′)〉=⟨ϕ(x i , yi)− ϕ(x i , y) , ϕ(x i′ , yi′)− ϕ(x i′ , y ′)

⟩= 〈ϕ(x i , yi), ϕ(x i′ , yi′) 〉 − 〈ϕ(x i , yi), ϕ(x i′ , y ′) 〉− 〈ϕ(x i , y), ϕ(x i′ , yi′)〉+ 〈ϕ(x i , y), ϕ(x i′ , y ′)〉

= k( (x i , yi), (x i′ , yi′) )− k( (x i , yi), ϕ(x i′ , y ′) )− k( (x i , y), (x i′ , yi′) ) + k( (x i , y), ϕ(x i′ , y ′) )

=: Kii′yy′

Page 156: Chrisoph Lampert, Talk part 2

Kernelized S-SVM problem:

maxα∈Rn|Y|

+

∑i=1,...,n

y∈Y

αiy∆(yi , y)− 12∑

y,y′∈Yi,i′=1,...,n

αiyαi′y′Kii′yy′

subject to, for i = 1, . . . , n,∑y∈Y

αiy ≤Cn .

too many variables: train with working set of αiy.

Kernelized prediction function:f (x) = argmax

y∈Y

∑iy′αiy′k( (x i , yi), (x , y) )

Be careful in kernel design:

Evaluating f can easily become very slow / completely infeasible.

Page 157: Chrisoph Lampert, Talk part 2

What do "joint kernel functions" look like?

k( (x , y) , (x ′, y ′) ) = 〈ϕ(x , y), ϕ(x ′, y ′)〉.

Graphical Model: ϕ decomposes into components per factor:ϕ(x , y) =

(ϕf (xf , yf )

)f∈F

Kernel k decomposes into sum over factors:

k( (x , y) , (x ′, y ′) ) =⟨(

ϕf (xf , yf ))

f∈F,(ϕf (x ′f , y ′f )

)f∈F

⟩=∑f∈F〈ϕf (xf , yf ), ϕf (x ′f , y ′f ) 〉

=∑f∈F

kf ( (xf , yf ), (x ′f , y ′f ) )

We can define kernels for each factor (e.g. nonlinear).

Page 158: Chrisoph Lampert, Talk part 2

Example: figure-ground segmentation with grid structure

(x , y)=( , )

Typical kernels: arbirary in x , linear (or at least simple) w.r.t. y:Unary factors:

kp((x , yp), (x ′, y ′p) = k(xN(p), x ′N(p))Jyp = y ′pK

with k image kernel on local neighborhood N (p), e.g. χ2 orhistogram intersection

Pairwise factors:

kpq((yp, yq), (y ′p, y ′p) = Jyp = y ′pK Jyq = y ′qK

More powerful than all-linear, and MAP prediction still possible.

Page 159: Chrisoph Lampert, Talk part 2

Example: object localization

(x , y)=( , )

left top

right bottom

image

Only one factor that includes all x and y:

k( (x , y) , (x ′, y ′) ) = kimage(x|y, x ′|y′)

with kimage image kernel and x|y is image region within box y.

MAP prediction is just object localization with kimage-SVM.

f (x) = argmaxy∈Y

∑iy′αiy′kimage(x i |y′ , x|y)

Page 160: Chrisoph Lampert, Talk part 2

Summary I – S-SVM Learning

Task: parameter learninggiven training set {(x1, y1), . . . , (xn, yn)} ⊂ X × Y , learnparameters w for prediction function

F(x , y) = 〈w, ϕ(x , y)〉.

predict f : X → Y by f (x) = argmaxy∈Y F(x , y) (= MAP)

Solution from maximum-margin principle:enforce correct outputs to be better than others by a margin :

〈w, ϕ(xn, yn)〉 ≥ ∆(yn, y) + 〈w, ϕ(xn, y)〉 for all y ∈ Y .

convex optimization problem (similar to multiclass SVM)many equivalent formulations → different training algorithmstraining calls argmax repeatedly, no probabilisticinference.

Page 161: Chrisoph Lampert, Talk part 2

Extra I: Beyond Fully Supervised Learning

So far, training was fully supervised, all variables were observed.In real life, some variables are unobserved even during training.

missing labels in training data latent variables, e.g. part location

latent variables, e.g. part occlusion latent variables, e.g. viewpoint

Page 162: Chrisoph Lampert, Talk part 2

Three types of variables:x ∈ X always observed,y ∈ Y observed only in training,z ∈ Z never observed (latent).

Decision function: f (x) = argmaxy∈Y maxz∈Z 〈w, ϕ(x , y, z)〉

Maximization over Latent Variables

Solve: minw,ξ

12‖w‖

2 + CN

N∑n=1

ξn

subject to, for n = 1, . . . ,N , for all y ∈ Y

∆(yn, y) + maxz∈Z〈w, ϕ(xn, y, z)〉 −max

z∈Z〈w, ϕ(xn, yn, z)〉

Problem: not a convex problem → can have local minima[C. Yu, T. Joachims, "Learning Structural SVMs with Latent Variables", ICML, 2009]similar idea: [Felzenszwalb, McAllester, Ramaman. A Discriminatively Trained, Multiscale, Deformable Part Model, CVPR’08]

Page 163: Chrisoph Lampert, Talk part 2

Three types of variables:x ∈ X always observed,y ∈ Y observed only in training,z ∈ Z never observed (latent).

Decision function: f (x) = argmaxy∈Y maxz∈Z 〈w, ϕ(x , y, z)〉

Maximization over Latent Variables

Solve: minw,ξ

12‖w‖

2 + CN

N∑n=1

ξn

subject to, for n = 1, . . . ,N , for all y ∈ Y

∆(yn, y) + maxz∈Z〈w, ϕ(xn, y, z)〉 −max

z∈Z〈w, ϕ(xn, yn, z)〉

Problem: not a convex problem → can have local minima[C. Yu, T. Joachims, "Learning Structural SVMs with Latent Variables", ICML, 2009]similar idea: [Felzenszwalb, McAllester, Ramaman. A Discriminatively Trained, Multiscale, Deformable Part Model, CVPR’08]

Page 164: Chrisoph Lampert, Talk part 2

Summary – Learning for Structured PredictionTask: Learn parameter w for a function f : X → Y

f (x) = argmaxy∈Y

〈w, ϕ(x , y)〉

Regularized risk minimization framework:loss function:

I logistic loss → conditional random fields (CRF)I Hinge loss → structured support vector machine (S-SVM)

regularization:I L2 norm of w ↔ Gaussian prior

Neither guarantees better prediction accuracy than the other.see e.g. [Nowozin et al., ECCV 2010]

Difference is in numeric optimization:Use CRFs if you can do probabilistic inference.Use S-SVMs if you can do loss-augmented MAP prediction.

Page 165: Chrisoph Lampert, Talk part 2

Open Research in Structured LearningFaster training

I CRFs need many runs of probablistic inference,I SSVMs need many runs of argmax-predictions.

Reduce amount of training data necessaryI semi-supervised learning? transfer learning?

Understand competing training methodsI when to use probabilistic training, when maximum margin?

Understand S-SVM with approximate inferenceI very often, exactly solving argmaxy F(x, y) is infeasible.I can we guarantee convergence even with approximate inference?

Latent variable S-SVM / CRFs

Page 166: Chrisoph Lampert, Talk part 2

For what problems can we learn solutions?we need training data,we need to be able to perform argmax-prediction.

difficulty

trivial / boring good manual solutions

e.g. imagethresholding

e.g. stereoreconstruction

interesting & realistic horrendouslydifficult

e.g. trueartificial

intelligence

e.g.object recognition

image segmentationpose estimation

action recognition. . .

For many interesting problems, we cannot do exact MAP prediction.

Page 167: Chrisoph Lampert, Talk part 2

For what problems can we learn solutions?we need training data,we need to be able to perform argmax-prediction.

difficulty

trivial / boring good manual solutions

e.g. imagethresholding

e.g. stereoreconstruction

interesting & realistic horrendouslydifficult

e.g. trueartificial

intelligence

e.g.object recognition

image segmentationpose estimation

action recognition. . .

MAP is NP-hard!Exact MAP is doable.

For many interesting problems, we cannot do exact MAP prediction.

Page 168: Chrisoph Lampert, Talk part 2

The Need for Approximate InferenceImage segmentation on pixel grid:

I two labels and positive pairwise weights: exact MAP doableI more than two labels: MAP generally NP-hard

7→

Stereo reconstruction (grid) #labels = #depth levels:I convex penalization of disparities: exact MAP doableI non-convex (truncated) penalization: NP-hard

Human Pose estimationI limbs independent given torso: exact MAP doableI limbs interact with each other: MAP NP-hard

Page 169: Chrisoph Lampert, Talk part 2

Approximate inference y ≈ argmaxy∈Y F(x , y):has been used for decades, e.g. for

I classical combinatorial problems: traveling salesman, knapsackI energy minimization: Markov/Conditional Random Fields

often yield good (=almost perfect) solutions ycan have guarantees F(x , y∗) ≤ F(x , y) ≤ (1 + ε)F(x , y∗)typically much faster than exact inference

Can’t we use that for S-SVM training?

Iterating it is problematic!

In each S-SVM training iterationI we solve y∗ = argmaxy[∆(yi , y) + F(x i , y)],I we use y∗ to update w,I we stop, when there are no more violated constraints / updates.

With y ≈ argmaxy [∆(yi , y) + F(x i , y)]I errors can accumulate,I we might stop too early.

Training S-SVMs with approximate MAP is active research question.

Page 170: Chrisoph Lampert, Talk part 2

Approximate inference y ≈ argmaxy∈Y F(x , y):has been used for decades, e.g. for

I classical combinatorial problems: traveling salesman, knapsackI energy minimization: Markov/Conditional Random Fields

often yield good (=almost perfect) solutions ycan have guarantees F(x , y∗) ≤ F(x , y) ≤ (1 + ε)F(x , y∗)typically much faster than exact inference

Can’t we use that for S-SVM training?

Iterating it is problematic!In each S-SVM training iteration

I we solve y∗ = argmaxy[∆(yi , y) + F(x i , y)],I we use y∗ to update w,I we stop, when there are no more violated constraints / updates.

With y ≈ argmaxy [∆(yi , y) + F(x i , y)]I errors can accumulate,I we might stop too early.

Training S-SVMs with approximate MAP is active research question.

Page 171: Chrisoph Lampert, Talk part 2

Approximate inference y ≈ argmaxy∈Y F(x , y):has been used for decades, e.g. for

I classical combinatorial problems: traveling salesman, knapsackI energy minimization: Markov/Conditional Random Fields

often yield good (=almost perfect) solutions ycan have guarantees F(x , y∗) ≤ F(x , y) ≤ (1 + ε)F(x , y∗)typically much faster than exact inference

Can’t we use that for S-SVM training? Iterating it is problematic!In each S-SVM training iteration

I we solve y∗ = argmaxy[∆(yi , y) + F(x i , y)],I we use y∗ to update w,I we stop, when there are no more violated constraints / updates.

With y ≈ argmaxy [∆(yi , y) + F(x i , y)]I errors can accumulate,I we might stop too early.

Training S-SVMs with approximate MAP is active research question.

Page 172: Chrisoph Lampert, Talk part 2

Approximate inference y ≈ argmaxy∈Y F(x , y):has been used for decades, e.g. for

I classical combinatorial problems: traveling salesman, knapsackI energy minimization: Markov/Conditional Random Fields

often yield good (=almost perfect) solutions ycan have guarantees F(x , y∗) ≤ F(x , y) ≤ (1 + ε)F(x , y∗)typically much faster than exact inference

Can’t we use that for S-SVM training? Iterating it is problematic!In each S-SVM training iteration

I we solve y∗ = argmaxy[∆(yi , y) + F(x i , y)],I we use y∗ to update w,I we stop, when there are no more violated constraints / updates.

With y ≈ argmaxy [∆(yi , y) + F(x i , y)]I errors can accumulate,I we might stop too early.

Training S-SVMs with approximate MAP is active research question.

Page 173: Chrisoph Lampert, Talk part 2

Ad: PhD/PostDoc Positions at I.S.T. Austria, Vienna

More info: www.ist.ac.at

I.S.T. Graduate School1(2) + 3 yr PhD program

I Computer Vision/Machine Learning(me, Vladimir Kolmogorov)

I Computer Graphics (C. Wojtan)I Comp. Topology (H. Edelsbrunner)I Game Theory (K. Chatterjee)I Software Verification (T. Henzinger)I Cryptography (K. Pietrzak)I Comp. Neuroscience (G. Tkacik)

full scholarships

PostDoc Positions in my Groupcomputer vision (object recognition)machine learning (structuredoutput learning)

Internships: ask me!