Machine learning of structured outputs

61
Machine Learning of Structured Outputs Christoph Lampert IST Austria (Institute of Science and Technology Austria) Klosterneuburg Feb 2, 2011

Transcript of Machine learning of structured outputs

Page 1: Machine learning of structured outputs

Machine Learning ofStructured Outputs

Christoph LampertIST Austria

(Institute of Science and Technology Austria)Klosterneuburg

Feb 2, 2011

Page 2: Machine learning of structured outputs

Machine Learning of Structured Outputs

Overview...Introduction to Structured LearningStructured Support Vector MachinesApplications in Computer Vision

Slides available athttp://www.ist.ac.at/~chl

Page 3: Machine learning of structured outputs

What is Machine Learning?

Definition [T. Mitchell]:Machine Learning is the study of computer algorithmsthat improve their performance in a certain taskthrough experience.

Example: BackgammonI Task: play backgammonI Experience: self-playI Performance measure: games won against humans

Example: Object RecognitionI Task: determine which objects are visible in imagesI Experience: annotated training dataI Performance measure: object recognized correctly

Page 4: Machine learning of structured outputs

What is structured data?

Definition [ad hoc]:Data is structured if it consists of several parts, andnot only the parts themselves contain information, butalso the way in which the parts belong together.

Text Molecules / Chemical Structures

Documents/HyperText Images

Page 5: Machine learning of structured outputs

The right tool for the problem.

Example: Machine Learning for/of Structured Data

image body model model fit

Task: human pose estimationExperience: images with manually annotated body posePerformance measure: number of correctly localized body parts

Page 6: Machine learning of structured outputs

Other tasks:

Natural Language Processing:I Automatic Translation (output: sentences)I Sentence Parsing (output: parse trees)

Bioinformatics:I RNA Structure Prediction (output: bipartite graphs)I Enzyme Function Prediction (output: path in a tree)

Speech Processing:I Automatic Transcription (output: sentences)I Text-to-Speech (output: audio signal)

Robotics:I Planning (output: sequence of actions)

This talk: only Computer Vision examples

Page 7: Machine learning of structured outputs

"Normal" Machine Learning:f : X → R.

inputs X can be any kind of objectsI images, text, audio, sequence of amino acids, . . .

output y is a real numberI classification, regression, . . .

many way to construct f :I f (x) = a · ϕ(x) + b,I f (x) = decision tree,I f (x) = neural network

Page 8: Machine learning of structured outputs

Structured Output Learning:f : X → Y .

inputs X can be any kind of objectsoutputs y ∈ Y are complex (structured) objects

I images, parse trees, folds of a protein, . . .

how to construct f ?

Page 9: Machine learning of structured outputs

Predicting Structured Outputs: Image Denosing

f : 7→input: images output: denoised images

input set X = {grayscale images} =̂ [0, 255]M ·N

output set Y = {grayscale images} =̂ [0, 255]M ·N

energy minimization f (x) := argminy∈Y E(x , y)

E(x , y) = λ∑

i(xi − yi)2 + µ∑

i,j |yi − yj |

Page 10: Machine learning of structured outputs

Predicting Structured Outputs: Human Pose Estimation

7→input: image body model output: model fit

input set X = {images}

output set Y = {positions/angles of K body parts} =̂ R4K .

energy minimization f (x) := argminy∈Y E(x , y)

E(x , y) = ∑i w>i ϕfit(xi , yi) +∑

i,j w>ij ϕpose(yi , yj)

Page 11: Machine learning of structured outputs

Predicting Structured Outputs: Shape Matching

input: image pairs

output: mapping y : xi ↔ y(xi)

scoring functionF(x , y) = ∑

i w>i ϕsim(xi , y(xi)) +∑i,j w>ij ϕdist(xi , xj , y(xi), y(xj))

predict f : X → Y by f (x) := argmaxy∈Y F(x , y)

[J. McAuley et al.: "Robust Near-Isometric Matching via Structured Learning of Graphical Models", NIPS, 2008]

Page 12: Machine learning of structured outputs

Predicting Structured Outputs: Tracking (by Detection)

input:image

output:object position

input set X = {images}

output set Y = R2 (box center) or R4 (box coordinates)

predict f : X → Y by f (x) := argmaxy∈Y F(x , y)

scoring function F(x , y) = w>ϕ(x , y) e.g. SVM score

images: [C. L., Jan Peters, "Active Structured Learning for High-Speed Object Detection", DAGM 2009]

Page 13: Machine learning of structured outputs

Predicting Structured Outputs: Summary

Image Denoising

y = argminy E(x , y) E(x , y) = w1∑

i(xi − yi)2 + w2∑

i,j |yi − yj |

Pose Estimationy = argminy E(x , y) E(x , y) =

∑i w>i ϕ(xi , yi) +

∑i,j w>ij ϕ(yi , yj)

Point Matching

y = argmaxy F(x , y) F(x , y) =∑

i w>i ϕ(xi , yi) +∑

i,j w>ij ϕ(yi , yj)

Tracking

y = argmaxy F(x , y) F(x , y) = w>ϕ(x , y)

Page 14: Machine learning of structured outputs

Unified FormulationPredict structured output by maximization

y = argmaxy∈Y

F(x , y)

of a compatiblity function

F(x , y) = 〈w, ϕ(x , y)〉

that is linear in a parameter vector w.

Page 15: Machine learning of structured outputs

Structured Prediction: how to evaluate argmaxy F(x , y)?

chain treeloop-free graphs: Shortest-Path / Belief Propagation (BP)

grid arbitrary graphloopy graphs: GraphCut, approximate inference (e.g. loopy BP)

Structured Learning: how to learn F(x , y) from examples?

Page 16: Machine learning of structured outputs

Machine Learning for Structured Outputs

Learning Problem:Task: predict structured objects f : X → YExperience: example pairs {(x1, y1), . . . , (xN , yN )} ⊂ X × Y :typical inputs with "correct" outputs for them.

{ , , ,. . . }Performance measure: ∆ : Y × Y → R

Our choice:parametric family: F(x , y; w) = 〈w, ϕ(x , y)〉

prediction method: f (x) = argmaxy∈Y F(x , y; w)Task: determine "good" w

Page 17: Machine learning of structured outputs

Reminder: regularized risk minimization

Find w for decision function F = 〈w, ϕ(x , y)〉 by

minw∈Rd λ‖w‖2 +N∑

n=1`(yn,F(xn, ·; w))

Regularization + empirical loss (on training data)

Logistic Loss: Conditional Random FieldsI `(yn , F(xn , ·; w)) = log

( ∑y∈Y

exp[F(xn , y; w)− F(xn , yn ; w)])

Hinge-loss: Maximum Margin TrainingI `(yn , F(xn , ·; w)) = max

y∈Y

[∆(yn , y)+F(xn , y; w)−F(xn , yn ; w)

]Exponential Loss: Boosting

I `(yn , F(xn , ·; w)) =∑

y∈Y\{yn}exp[F(xn , y; w)− F(xn , yn ; w)]

)

Page 18: Machine learning of structured outputs

Maximum Margin Trainingof Structured Models

(Structured SVMs)

Page 19: Machine learning of structured outputs

Structured Support Vector Machine

Structured Support Vector Machine:

minw∈Rd12‖w‖

2

+ CN

N∑n=1

maxy∈Y

∆(yn, y) + 〈w, ϕ(xn, y)〉 − 〈w, ϕ(xn, yn)〉

Unconstrained optimization, convex, non-differentiable objective.

[I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun. "Large Margin Methods for Structured and Interdependent

Output Variables", JMLR, 2005.]

Page 20: Machine learning of structured outputs

S-SVM Objective Function for w ∈ R2:

3 2 1 0 1 2 3 4 52

1

0

1

2

3S-SVM objective C=0.01

3 2 1 0 1 2 3 4 52

1

0

1

2

3S-SVM objective C=0.10

3 2 1 0 1 2 3 4 52

1

0

1

2

3S-SVM objective C=1.00

3 2 1 0 1 2 3 4 52

1

0

1

2

3S-SVM objective C→∞

Page 21: Machine learning of structured outputs

Structured Support Vector Machine:

minw∈Rd12‖w‖

2

+ CN

N∑n=1

maxy∈Y

∆(yn, y) + 〈w, ϕ(xn, y)〉 − 〈w, ϕ(xn, yn)〉

Unconstrained optimization, convex, non-differentiable objective.

Page 22: Machine learning of structured outputs

Structured SVM (equivalent formulation):

minw∈Rd ,ξ∈Rn+

12‖w‖

2 + CN

N∑n=1

ξn

subject to, for n = 1, . . . ,N ,

maxy∈Y

[∆(yn, y) + 〈w, ϕ(xn, y)〉 − 〈w, ϕ(xn, yn)〉

]≤ ξn

n non-linear contraints, convex, differentiable objective.

Page 23: Machine learning of structured outputs

Structured SVM (also equivalent formulation):

minw∈Rd ,ξ∈Rn+

12‖w‖

2 + CN

N∑n=1

ξn

subject to, for n = 1, . . . ,N ,

∆(yn, y) + 〈w, ϕ(xn, y)〉 − 〈w, ϕ(xn, yn)〉 ≤ ξn, for all y ∈ Y

|Y|n linear constraints, convex, differentiable objective.

Page 24: Machine learning of structured outputs

Example: A "True" Multiclass SVM

Y = {1, 2, . . . ,K}, ∆(y, y ′) =

1 for y 6= y ′

0 otherwise..

ϕ(x , y) =(Jy = 1KΦ(x), Jy = 2KΦ(x), . . . , Jy = KKΦ(x)

)= Φ(x)e>y with ey=y-th unit vector

Solve:

minw,ξ12‖w‖

2 + CN

N∑n=1

ξn

subject to, for n = 1, . . . ,N ,

〈w, ϕ(xn, yn)〉 − 〈w, ϕ(xn, y)〉 ≥ 1− ξn for all y ∈ Y .

Classification: MAP f (x) = argmaxy∈Y

〈w, ϕ(x , y)〉

Crammer-Singer Multiclass SVM

Page 25: Machine learning of structured outputs

Hierarchical Multiclass Classification

Loss function can reflect hierarchy:cat dog car bus

∆(y, y ′) := 12(distance in tree)

∆(cat, cat) = 0, ∆(cat, dog) = 1, ∆(cat, bus) = 2, etc.

Solve:

minw,ξ12‖w‖

2 + CN

N∑n=1

ξn

subject to, for n = 1, . . . ,N ,

〈w, ϕ(xn, yn)〉 − 〈w, ϕ(xn, y)〉 ≥ ∆(yn, y)− ξn for all y ∈ Y .

Page 26: Machine learning of structured outputs

Kernelized S-SVM problem:Define

joint kernel function k : (X × Y)× (X × Y)→ R,kernel matrix Knn′yy′ = k( (xn, y), (xn′ , y ′) ).

maxα∈Rn|Y|

+

∑n=1,...,N

y∈Y

αny∆(yn, y)− 12

∑y,y′∈Y

n,n′=1,...,N

αnyαn′y′Knn′yy′

subject to, for n = 1, . . . ,N ,

∑y∈Y

αny ≤CN .

Kernelized prediction function:

f (x) = argmaxy∈Y

∑ny′αny′k( (xn, yn), (x , y) )

Too many variables: train with working set of αny.

Page 27: Machine learning of structured outputs

Applicationsin Computer Vision

Page 28: Machine learning of structured outputs

Example 1: Category-Level Object Localization

What objects are present? person, car

Page 29: Machine learning of structured outputs

Example 1: Category-Level Object Localization

Where are the objects?

Page 30: Machine learning of structured outputs

Object Localization ⇒ Scene Interpretation

A man inside of a car A man outside of a car⇒ He’s driving. ⇒ He’s passing by.

Page 31: Machine learning of structured outputs

Object Localization as Structured Learning:Given: training examples (xn, yn)n=1,...,N

Wanted: prediction function f : X → Y whereI X = {all images}I Y = {all boxes}

fcar

=

Page 32: Machine learning of structured outputs

Structured SVM framework

Define:feature function ϕ : X × Y → Rd ,loss function ∆ : Y × Y → R,routine to solve argmaxy ∆(yn, y) +〈w, ϕ(xn, yn)〉.

Solve: minw,ξ12‖w‖

2 + C ∑Nn=1 ξ

N subject to

∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,

Result:w∗ that determines scoring function F(x , y)=〈w∗, ϕ(x , y)〉,localization function: f (x) = argmaxy F(x , y).

• M. Blaschko, C.L.: Learning to Localize Objects with Structured Output Regression, ECCV 2008.

Page 33: Machine learning of structured outputs

Feature function: how to represents a (image,box)-pair (x , y)?

Obs: whether y is the right box for x , depends only on x|y.

ϕ(x , y) := h(x|y)

where h(r) is a (bag of visual word) histogram representation of theregion r .

ϕ( )

= h( ) ≈ h( ) = ϕ( )

ϕ( )

= h( ) 6≈ h( ) = ϕ( )

ϕ( )

= h( ) ≈ h( ) = ϕ( )

. . .

Page 34: Machine learning of structured outputs

Structured SVM framework

Define:feature function ϕ : X × Y → Rd ,loss function ∆ : Y × Y → R,routine to solve argmaxy ∆(yn, y) +〈w, ϕ(xn, yn)〉.

Solve: minw,ξ12‖w‖

2 + C ∑Nn=1 ξ

N subject to

∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,

Result:w∗ that determines scoring function F(x , y)=〈w∗, ϕ(x , y)〉,localization function: f (x) = argmaxy F(x , y).

• M. Blaschko, C.L.: Learning to Localize Objects with Structured Output Regression, ECCV 2008.

Page 35: Machine learning of structured outputs

Loss function: how to compare two boxes y and y ′?

∆(y, y ′) := 1− area overlap between y and y ′

= 1− area(y ∩ y ′)area(y ∪ y ′)

Page 36: Machine learning of structured outputs

Structured SVM framework

Define:feature function ϕ : X × Y → Rd ,loss function ∆ : Y × Y → R,routine to solve argmaxy ∆(yn, y) +〈w, ϕ(xn, yn)〉.

Solve: minw,ξ12‖w‖

2 + C ∑Nn=1 ξ

N subject to

∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,

Result:w∗ that determines scoring function F(x , y)=〈w∗, ϕ(x , y)〉,localization function: f (x) = argmaxy F(x , y).

• M. Blaschko, C.L.: Learning to Localize Objects with Structured Output Regression, ECCV 2008.

Page 37: Machine learning of structured outputs

How to solve f (x) = argmaxy ∆(yn, y) + 〈w, ϕ(xn, y)〉 ?

Option 1) Sliding Window

1− 0.3 = 0.71− 0.8 = 0.21− 0.1 = 0.91− 0.2 = 0.8. . .0.3 + 1.4 = 1.70 + 1.5 = 1.5. . .1− 1.2 = −0.21− 0.3 = 0.7

Option 2) Branch-and-Bound Search (another talk)

• C.L., M. Blaschko, T. Hofmann: Beyond Sliding Windows: Object Localization by Efficient Subwindow Search, CVPR 2008.

Page 38: Machine learning of structured outputs

Structured Support Vector Machine

S-SVM Optimization: minw,ξ12‖w‖

2 + CN∑

n=1ξn

subject to for n = 1, . . . ,N :

∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,

Solve via constraint generation:Iterate:

I Solve minimization with working set of contraints: new wI Identify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn , y)〉I Add violated constraints to working set and iterate

Polynomial time convergence to any precision ε

Page 39: Machine learning of structured outputs

Example: Training set (x1, z1), . . . , (x4, y4)

Page 40: Machine learning of structured outputs

Initialize: no constraints

Solve minimization with working set of contraints ⇒ w=0Identify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn, y)〉

I 〈w, ϕ(xn , y)〉 = 0 → pick any window with ∆(y, yn) = 1

Add violated constraints to working set and iterate

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1,

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1.

Page 41: Machine learning of structured outputs

Working set of constraints:

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1,

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1.

Solve minimization with working set of contraintsIdentify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn, y)〉

Add violated constraints to working set and iterate

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 0.9,

〈w, 〉 − 〈w, 〉 ≥ 0.8, 〈w, 〉 − 〈w, 〉 ≥ 0.01.

Page 42: Machine learning of structured outputs

Working set of constraints:

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 1

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 0.9,

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 0.8,

〈w, 〉 − 〈w, 〉 ≥ 1, 〈w, 〉 − 〈w, 〉 ≥ 0.01.

Solve minimization with working set of contraintsIdentify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn, y)〉

Add violated constraints to working set and iterate,. . .

Page 43: Machine learning of structured outputs

S-SVM Optimization: minw,ξ12‖w‖

2 + CN∑

n=1ξn

subject to for n = 1, . . . ,N :

∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,

Solve via constraint generation:Iterate:

I Solve minimization with working set of contraintsI Identify argmaxy∈Y ∆(y, yn) + 〈w, ϕ(xn , y)〉I Add violated constraints to working set and iterate

Similar to classical bootstrap training, but:I force margin between correct and incorrect location scores,I handle overlapping detections by fractional scores.

Page 44: Machine learning of structured outputs

Results: PASCAL VOC 2006

Example detections for VOC 2006 bicycle, bus and cat.

Precision–recall curves for VOC 2006 bicycle, bus and cat.

Structured training improves detection accuracy.

Page 45: Machine learning of structured outputs

More Recent Results (PASCAL VOC 2009)

aeroplane

Page 46: Machine learning of structured outputs

More Recent Results (PASCAL VOC 2009)

horse

Page 47: Machine learning of structured outputs

More Recent Results (PASCAL VOC 2009)

sheep

Page 48: Machine learning of structured outputs

More Recent Results (PASCAL VOC 2009)

sofa

Page 49: Machine learning of structured outputs

Why does it work?

Learned weights from binary (center) and structured training (right).

Both training methods: positive weights at object region.Structured training: negative weights for features just outsidethe bounding box position.Posterior distribution over box coordinates becomes morepeaked.

Page 50: Machine learning of structured outputs

Example II: Category-Level Object Segmentation

Where exactly are the objects?

Page 51: Machine learning of structured outputs

Segmentation as Structured Learning:Given: training examples (xn, yn)n=1,...,N

{ , , ,

, }Wanted: prediction function f : X → Y with

I X = {all images}I Y = {all binary segmentations}

Page 52: Machine learning of structured outputs

Structured SVM framework

Define:Feature functions: ϕ(x , y)→ Rd

I unary terms ϕi(x, yi) for each pixel iI pairwise terms ϕij(x, yi , yj) for neighbors (i, j)

Loss function ∆ : Y × Y → R.I ideally decomposes like ϕ

Solve: minw,ξ12‖w‖

2 + C ∑Nn=1 ξ

n subject to

∀y ∈ Y : ∆(y, yn) + 〈w, ϕ(xn, y)〉−〈w, ϕ(xn, yn)〉 ≤ ξn,

Result:w∗ that determines scoring function F(x , y)=〈w∗, ϕ(x , y)〉,segmentation function: f (x) = argmaxy F(x , y).

Page 53: Machine learning of structured outputs

Example choices:

Feature functions: unary terms c = {i}:

ϕi(x , yi) =

(0, h(xi)) for y = 0,(hi(x),0) for y = 1.

h(xi) is the color histogram of the pixel i.

Feature functions: pairwise terms c = {i, j}:ϕij(x , yi , yj) = Jyi 6= yjK.

Loss function: Hamming loss∆(y, y ′) = ∑

iJyi 6= y ′iK

Page 54: Machine learning of structured outputs

How to solve argminy

[∆(yn, y) + argmaxy F(xn, y)

]?

∆(yn, y) + F(xn, y)=∑

iJyn

i 6= yiK +∑

iw>i h(xn

i ) + wij∑ij

Jyi 6= yjK

=∑

i[w>i h(xn

i ) + Jyni 6= yiK] + wij

∑ij

Jyi 6= yjK

if wij ≥ 0 (which makes sense), then E := −F is submodular.use GraphCut algorithm to find global optimum efficiently.

I also possible: (loopy) belief propagation, variational inference,greedy search, simulated annealing, . . .

• [M. Szummer, P. Kohli: "Learning CRFs using graph cuts", ECCV 2008]

Page 55: Machine learning of structured outputs

Extension: Image segmentation with connectedness constraints

Knowing that the object is connected improves segmentation quality.

← →ordinary original connected

segmentation segmentation

Page 56: Machine learning of structured outputs

Segmentation as Structured Learning:Given: training examples (xn, yn)n=1,...,N

Wanted: prediction function f : X → Y whereI X = {all images (as superpixels)}I Y = {all connected binary segmentations}

• S. Nowozin, C.L.: Global Connectivity Potentials for Random Field Models, CVPR 2009.

Page 57: Machine learning of structured outputs

Feature functions: unary terms c = {i}:

ϕi(x , yi) =

(0, h(xi)) for y = 0,(hi(x),0) for y = 1.

h(xi) is the bag of visual words histogram of the superpixel i.

Feature functions: pairwise terms c = {i, j}:ϕij(yi , yj) = Jyi 6= yjK.

Loss function: Hamming loss∆(y, y ′) = ∑

iJyi 6= y ′iK

Page 58: Machine learning of structured outputs

How to solve f (x) = argmax{y is connected}

∆(yn, y) + F(xn, y) ?

Linear programming relaxation with connectivity constraintsrewrite energy such that it is linear in new variables µl

i and µll′ij ,

F(x , y) =∑

i

[w>1 hi(x)µ1

i + w>2 hi(x)µ−1i +

∑l 6=l′

w3µll′ij

]

subject to

µli ∈ {0, 1}, µll′

ij ∈ {0, 1},∑lµl

i = 1,∑

lµll′

ij = µl′i ,

∑l′µll′

ij = µl′j

relax to µli ∈ [0, 1] and µll′

ij ∈ [0, 1]solve linear program with additional linear constraints:

µ1i + µ1

j −∑k∈S

µ1k ≤ 1 for any set S of nodes separating i and j.

Page 59: Machine learning of structured outputs

Example Results:

original segmentation with connectivity

. . . still room for improvement . . .

Page 60: Machine learning of structured outputs

Summary

Machine Learning of Structured OutputsTask: predict f : X → Y for (almost) arbitrary YKey idea:

I learn scoring function F : X × Y → RI predict using f (x) := argmaxy F(x, y)

Structured Support Vector MachinesParametrize F(x , y) = 〈w, ϕ(x , y)〉Learn w from training data by maximum-margin criterionNeeds only:

I feature function ϕ(x, y)I loss function ∆(y, y′)I routine to solve argmaxy ∆(yn , y) + F(xn , y)

Page 61: Machine learning of structured outputs

ApplicationsMany different applications in unified framework

I Natural Language Prediction: parsingI CompBio: secondary structured predictionI Computer Vision: pose estimation, object

localization/segmentationI . . .

Open ProblemsTheory:

I what output structures are useful?I (how) can we use approximate argmaxy?

Practice:I more application? new domains?I training speed!

Thank you!