CVPR2010: higher order models in computer vision: Part 1, 2

Tractable Higher Order Models in Computer Vision

Carsten Rother Sebastian Nowozin

Microsoft Research Cambridge

Schedule

830-900 Introduction

900-1000 Models: small cliques and special potentials

1000-1030 Tea break

1030-1200 Inference: Relaxation techniques: LP, Lagrangian, Dual Decomposition

1200-1230 Models: global potentials and global parameters + discussion

A gentle intro to MRFs

Goal

Given z and unknown (latent) variables x :

P(x|z) = P(z|x) P(x) / P(z) ~ P(z|x) P(x)

z = (R,G,B)n x = {0,1}n

Posterior Probability

Likelihood (data-

dependent)

Maximium a Posteriori (MAP): x* = argmax P(x|z)

Prior (data-

independent)

x

Likelihood P(x|z) ~ P(z|x) P(x)

Red

Gre

en

Red

Gre

en

Prior P(x|z) ~ P(z|x) P(x)

P(x) = 1/f ∏ θij (xi,xj) f = ∑ ∏ θij (xi,xj) “partition function”

θij (xi,xj) = exp{-|xi-xj|} “ising prior”

xi xj

x

i,j Є N

i,j Є N

(exp{-1}=0.36; exp{0}=1)

Prior

Solutions with highest probability (mode)

P(x) = 0.012 P(x) = 0.012 P(x) = 0.011

Pure Prior model:

Faire Samples

Smoothness prior needs the likelihood

P(x) = 1/f ∏ exp{-|xi-xj|} i,j Є N

Weight prior and likelihood

w =0

E(x,z,w) = ∑ θi (xi,zi) + w∑ θij (xi,xj)

w =10

w =200 w =40

Posterior distribution

P(x|z) = 1/f(z,w) exp{-E(x,z,w)}

E(x,z,w) = ∑ θi (xi,zi) + w∑ θij (xi,xj) i i,j

P(zi|xi=1) xi + P(zi|xi=0) (1-xi) θi (xi,zi) =

θij (xi,xj) = |xi-xj|

P(x|z) ~ P(z|x) P(x)

“Gibbs” distribution:

Likelihood

prior

Energy

Unary terms Pairwise terms

Energy minization

-log P(x|z) = -log (1/f(z,w)) + E(x,z,w)

MAP same as minimum Energy

MAP; Global min E

x* = argmin E(x,z,w)

ML

f(z,w) = ∑ exp{-E(x,z,w)} X

X

P(x|z) = 1/f(z,w) exp{-E(x,z,w)}

Model : Variables: discrete or continuous?

If discrete: how many labels? Space: discrete or continuous? Dependences between variables? How many variables? …

Random Field Models for Computer Vision

Inference: Combinatorial optimization: e.g. Graph Cut

Message Passing: e.g. BP, TRW

Iterated Conditional Modes (ICM)

LP-relaxation: e.g. Cutting-plane

Problem decomposition + subgradient

…

Learning: Exhaustive search (grid search)

Pseudo-Likelihood approximation

Training in Pieces

Max-margin

…

Applications: 2D/3D Image segmentation Object Recognition 3D reconstruction Stereo / optical flow Image denoising Texture Synthesis Pose estimation Panoramic Stitching …

Introducing Factor Graphs

Write probability distributions as Graphical model: - Direct graphical model - Undirected graphical model “traditionally used for MRFs”

- Factor graphs “best way to visualize the underlying energy”

References: - Pattern Recognition and Machine Learning *Bishop ‘08, book, chapter 8+

- several lectures at the Machine Learning Summer School 2009 (see video lectures)

Factor Graphs

P(x) ~ θ(x1,x2,x3) θ(x2,x4) θ(x3,x4) θ(x3,x5)

x2 x1

x4 x3

x5

Factor graph

unobserved

P(x) ~ exp{-E(x)} E(x) = θ(x1,x2,x3) + θ(x2,x4) + θ(x3,x4) + θ(x3,x5)

variables are in same factor.

“4 factors”

Gibbs distribution

Definition “Order”

Definition “Order”: The arity (number of variables) of the largest factor

P(X) ~ θ(x1,x2,x3) θ(x2,x4) θ(x3,x4) θ(x3,x5)

x2 x1

x4 x3

x5

Factor graph with order 3

arity 3 arity 2

Extras: • I will use “factor” and “clique” in the same way • Not fully correct since clique may or may not

decompose • Definition of “order” same for clique and factor • Markov Property: Random Field with low-order

factors/cliques.

x2 x1

x4 x3

x5

Undirected model

Triple clique

Random Fields in Vision

4-connected; pairwise MRF

Higher-order MRF

E(x) = ∑ θij (xi,xj) i,j Є N4

higher(8)-connected; pairwise MRF


Order 2 Order 2 Order n

E(x) = ∑ θij (xi,xj)

+θ(x1,…,xn) i,j Є N4

MRF with global variables


Order 2

4-connected: Segmentation

E(x) = ∑ θi (xi,zi) + ∑ θij (xi,xj) i

Observed variable

Unobserved variable

xi

i,j Є N4

zi

Factor graph

xj

4-connected: Segmentation (CRF)

E(x) = ∑ θi (xi,zi) + ∑ θij (xi,xj,zi,zj) i

Observed variable Unobserved (latent) variable

i,j Є N4

Conditional Random Field (CRF) no pure prior

MRF

xj

zj

zi

xj

zj

zi

xi

xj

Factor graph

4-connected: Stereo matching

Ground truth depth Image – left(a) Image – right(b)

• Images rectified • Ignore occlusion for now

E(d): {0,…,D-1}n → R

Energy:

Labels: d (depth/shift)

di

[Boykov et al. ‘01+



Higher-order MRF








E(x,) = ∑ θij (xi,xj) i,j Є N8

Order 2

Highly-connect: Discretization artefacts

higher-connectivity can model true Euclidean length

4-connected Euclidean

8-connected Euclidean

*Boykov et al. ‘03; ‘05+

3D reconstruction

[Slide credits: Daniel Cremers]

Stereo with occlusion

Each pixel is connected to D pixels in the other image

E(d): {1,…,D}2n → R

Ground truth Stereo with occlusion *Kolmogrov et al. ‘02+

Stereo without occlusion *Boykov et al. ‘01+

Texture De-noising

Test image Test image (60% Noise) Training images

Result MRF 9-connected

(7 attractive; 2 repulsive)

Result MRF 4-connected

Result MRF 4-connected (neighbours)



Higher-order MRF









Order 2

Reason 4: Use Non-local parameters: Interactive Segmentation (GrabCut)

*Boykov and Jolly ’01+

GrabCut *Rother et al. ’04+

MRF with Global parameters: Interactive Segmentation (GrabCut)

An object is a compact set of colors:

*Rother et al Siggraph ’04+

E(x,w) = ∑ θi (xi,w) + ∑ θij (xi,xj) i i,j Є N4

E(x,w): {0,1}n x {GMMs}→ R

Red

Red

w

Model jointly segmentation and color model:

Latent/Hidden CRFs

• ObjCut Kumar et. al. ‘05, Deformable Part Model Felzenszwalb et al., CVPR ’08; PoseCut Bray et al. ’06, LayoutCRF Winn et al. ’06

• Hidden variable are never observed (either training or test), e.g. parts

• Maximizing over hidden variables

• ML: Deep Belief Networks, Restricted Booltzman machine [Hinton et al.] (often sampling is done)

“parts”

“instance label”

“instance”

[LayoutCRF Winn et al. ’06+



Higher-order MRF









Order 2

First Example

This Tutorial: 1. What higher-order models have been used in vision? 2. How is MAP inference done for those models? 3. Relationship between higher-order MRFs and MRFs with global variables?

User input Standard MRF: with connectivity:

E(x) = P(x) + h(x) with h(x)= { ∞ if not 4-connected 0 otherwise

Ising prior: * Smoothing boundary * Removes noise

Ising prior: * Smoothing boundary

Inference (very brief summary)

• Message Passing Techniques (BP, TRW, TRW-S) – Defined on the factor graph ([Potetz ’07, Lan ‘06+)

– Can be applied to any model (in theory) (higher order, multi-label)

• LP-relaxation: (… more in part III)

– Relax original problem ({0,1} to [0,1]) and solve with existing techniques (e.g. sub-gradient)

– Can be applied any model (dep. on solver used)

– Connections to TRW (message passing)

Inference (very brief summary)

• Dual/Problem Decomposition (… more in part III)

– Decompose (NP-)hard problem into tractable once. Solve with sub-gradient

– Can be applied any model (dep. on solver used)

• Combinatorial Optimization: (… more in part II)

– Binary, Pairwise MRF: Graph cut, BHS (QPBO) – Multiple label, pairwise: move-making; transformation – Binary, higher-order factors: transformation – Multi-label, higher-order factors:

move-making + transformation

Inference higher-oder models (very brief summary)

• Arbitrary potentials are only tractable for order <7 (memory, computation time)

• For ≥7 potentials need some structure to be exploited in order to make them tractable

Forthcoming book!

• MIT Press (probably end of 2010)

• Most topics of this course and much, much more

• Contributors: usual suspects: Editors + Boykov, Kolmogorov,

Weiss, Freeman, Komodiakis, ....

Advances in Markov Random Fields for Computer Vision (Blake, Kohli, Rother)

Schedule

830-900 Introduction

900-1000 Models: small cliques and special potentials

1000-1030 Tea break

1030-1200 Inference: Relaxation techniques: LP, Lagrangian, Dual Decomposition

1200-1230 Models: global potentials and discussion

Small cliques (<7) (and transformation approach)

Optimization: Binary, Pairwise

E(x) = ∑ θi (xi) + ∑ θij (xi,xj) i,j Є N

Submodular: • θij (0,0) + θij (1,1) ≤ θij (0,1) + θij (1,0) • Condition holds naturally for many vision problems

(e.g. segmentation: |xi-xj|) • Graph cut computes efficiently the global optimum

(~0.5sec for 1MPixel *Boykov, Kolmogorov ‘01+ )

Non-Submolduar: • BHS algorithm (also called QPBO algorithm) ([Borors, Hammer, and Sun ’91,

Kolmogrov et al. ‘07+)

Graph cut on a special graph: Output ,0,1,’?’-; • Partial optimality (various solutions for ‘?’ nodes) • Solves underlying LP-relaxation • Quality depends on application (see *Rother et al CVPR ‘07+)

• Extensions exists QPBOP, QPBOI (see [Rother et al CVPR ’07, Woodford et al. ‘08+)

xj ϵ {0,1} i

Optimization: Binary, Pairwise

f(x1,x2) = θ11x1x2 + θ10x1(1-x2) + θ01(1-x1)x2 + θ00(1-x1)(1-x2)

Quadratic Pseudo-Boolean optimization (QPBO): B2 → R

f(x1,x2) = ax1x2 + bx1 + cx2 + d

Reminder : Encoding for graph-cut

x1 x2

s

t

Θ00

Θ11

Θ01 - Θ00

Θ10 – Θ11

a cut gives a labelling (energy)

all weights are positive if submodular (re-parameterization to normal form)

Optimization: binary, higher-order

f(x1,x2,x3) = θ111x1x2x3 + θ110x1x2(1-x3) + θ101x1(1-x2)x3 + …

f(x1,x2,x3) = ax1x2x3 + bx1x2 + cx2x3… + 1

Quadratic polynomial can be done

Idea: transform 2+ order terms into 2nd order terms

Methods: 1. transformation by “substitution” 2. transformation by “min. selection”

Transformation by “substitution” *Rosenberg ’75, Boros and Hammer ’02, Ali et al. ECCV ‘08+

f(x1,x2,x3) = x1x2x3 + x1x2 + x2

D(x1,x2,z) = x1x2 – 2x1z – 2x2z + 3z

It is: D(x1,x2,z) = 0 if x1x2 = z D(x1,x2,z) > 0 if x1x2 ≠ z

Auxiliary function:

f(x1,x2,x3) = min g(x1,x2,x3,z) = zx3 + z + x2 + K D(x1,x2,z)

Since K very large then x1x2 = z

z

Apply it recursively ….

Problem: • Doesn’t work in practice *Ishikawa CVPR ‘09+ • function D is non-submodular and “K enforces this strongly”

“Substitution”:

z ϵ {0,1}

Transformation by “min. selection” [Freedman and Drineas ’05, Kolmogorov and Zabhi ’04, Ishikawa ’09+

f(x1,x2,x3) = ax1x2x3

-x1x2x3 = min –z(x1+x2+x3-2)

Useful :

z z ϵ {0,1}

Check: - all x1,x2,x3 = 1 then z=1 - Otherwise z=0

f(x1,x2,x3) = min –az (x1+x2+x3-2)

Transform: Case a<0:

z

f(x1,x2,x3) = min a{z(x1+x2+x3-1)+(x1x2+x2x3+x3x1)-(x1+x2+x3+1)} Case a>0: (similar trick)

z

Transformation by “min. selection”

with nd = floor(d-1/2) many new variables w

From *Ishikawa PAMI ’09+

The general case:

Full Procedure *Ishikawa ‘09+

• Worst case exponential: potential order 5 gives up to 15 new variables. • Probably tractable for up to order 6 • May get very hard to solve (non-submodular) • Code available online on Ishikawa’s webpage

General 5-order potential: f(x1,x2,x3) = ax1x2x3x4x5 + bx1x2x3x4 + c x1x2x3x5 + d x1x2x4x5 + … … transform all 2+ degree terms are only degree 2 terms

Application 1: De-noising with Field-of-Experts

[Roth and Black ’05, Ishikawa ‘09+

FoE prior

xc set of nxn patches (here 2x2) Jk set of filters:

From *Ishikawa PAMI ’09, Roth et al ‘05+

E(X) = ∑ (zi-xi)2 / 2σ2 + ∑ ∑ αk (1+ 0.5(Jk xc)

2)

Unary liklihood

i c k

non-convex optimization problem

z

x

How to handle continuous labels in discrete MRF?

Solve with fusion move *Lempitsky et al ICCV ’07, ’08, ‘10, Woodford et al. ‘08+

Fusion move optimization: 1. X = arbitrary labelling

2. E’(X’) = binary MRF for fusion with proposal

3. go to 2) if energy went down

X’=0 X’=1

= ●

X’ (use BHS algorithm)

X’=1

“Alpha expansion”

=

Final X’

initial X’

Properties of fusion move: 1. Performance depends on performance of BHS algorithm (labelled nodes) 2. Guarantee: E goes not up 3. In practice often labelled nodes. Because:

θij (0,0) + θij (1,1) ≤ θij (0,1) + θij (1,0)

“often low cost”

Application 1: De-noising with Field-of-Experts

[Lempitsky et al ICCV ’07, ’08, ‘10, Woodford et al. ‘08]

X’=0 X’=1

Results

“Factor Graph: BP based”

Result: PSNR/E:

original noisy Pairwise-model

Pairwise-model FoE TV-norm (continuous model) From *Ishikawa PAMI ’09+

Results

From *Ishikawa PAMI ’09+

Comparison with “substitution”: Blur & random 0.0002% labelled

Application 2: Curvature in stereo *Woodford et al CVPR ‘08+

f(x1,x2,x3) = x1 -2x2 + x3 where xi ϵ {0,…,D} depth

Example: slanted plane: f(1,2,3)=0

From *Woodford et al. CVPR ’08+

image Pair-wise Triple terms

Application 3: Higher-Order Likelihood for optical flow *Glocker et al. ECCV ‘10+

• Pair-wise MRF: Likelihood in unaries as NCC cost: approximation error!

• Higher-order likelihood: done with triple cliques (ideally higher)

Image 1 Image 2

• Also use 3/4-order term to not penalize any affine motion

One image Bi-layer triangulation Optical flow

Any size, Special potentials (and transformation approach)

Label-Cost Potential [Hoiem et al. ’07, Delong et al. ’10, Bleyer et al. ‘10+

E(x) = P(x) +

From *Delong et al. ’10+

Image Grabcut-style result With cost for each new label *Delong et al. ’10+ (Same function as [Zhu and Yuille ‘96+)

“pairwise MRF”

∑ cl [ p: xp= l ] l Є L

E

“Label cost”

Label cost = 4c

Label cost = 10c

E: {1,…,L}n → R

Basic idea: penalize the complexity of the model • Minimum description length (MDL) • Bayesian information criterion (BIC) • Akaike information criteriorn (AIC)

How is it done …

From *Delong et al. ’10+

In an alpha expansion step:

E(x’) = P(x’) + ∑ (cl – cl Π x’p) l Є L

p Є Pl

Case a<0: a Πxi = min aw (∑xi - |Pl|+1) w i

0 1 1 0 1

1 0 1 1 1

a b b c b

Cost for b: cb example 1: a b a a a

a a a c a

x’

x’

example 2:

Formally:

Cost for b: 0

a a a a a ●

x’= 0 x’= 1

Submodular! p Є Pl

Application: Model fitting *Delong et al. ‘10+

No MRF

Example: surface-based stereo *Bleyer et al. ‘10+

Left Image

No Label Cost prior

With Label Cost prior

3D scene explained by a small set of 3D surfaces

surfaces depth depth surfaces

Example: 3DLayout CRF: Recognition and Segmentation

[Hoiem et. al ‘07+

Result s with instance cost

Robust(Soft) Pn Potts model *Kohli et. al. CVPR ‘07, ‘08, PAMI ’08, IJCV ‘09+

Image Segmentation

E(X) = ∑ ci xi + ∑ dij |xi-xj| i i,j

E: {0,1}n → R

0 →fg, 1→bg

n = number of pixels

[Boykov and Jolly ‘ 01] [Blake et al. ‘04] [Rother et al.`04]

Image Unary Cost Segmentation

Pn Potts Potentials

Patch Dictionary (Tree)

Cmax 0

{ 0 if xi = 0, i ϵ p Cmax otherwise

h(Xp) =

p

[slide credits: Kohli]

Pn Potts Potentials

E(X) = ∑ ci xi + ∑ dij |xi-xj| + ∑ hp (Xp) i i,j p

p

{ 0 if xi = 0, i ϵ p Cmax otherwise

h(Xp) =

E: {0,1}n → R

0 →fg, 1→bg



Image Segmentation

E(X) = ∑ ci xi + ∑ dij |xi-xj| + ∑ hp (Xp) i i,j

Image Pairwise Segmentation Final Segmentation

p

E: {0,1}n → R

0 →fg, 1→bg



Application: Recognition and Segmentation

from [Kohli et al. ‘08]

Image

Unaries only TextonBoost

*Shotton et al. ‘06+

Pairwise CRF only *Shotton et al. ‘06+

Pn Potts

One super-pixelization

another super-pixelization

Robust(soft) Pn Potts model

{ 0 if xi = 0, i ϵ p f(∑xp) otherwise

h(xp) = p

p

from [Kohli et al. ‘08]

Robust Pn Potts Pn Potts

Application: Recognition and Segmentation

From [Kohli et al. ‘08]

Image

Unaries only TextonBoost

*Shotton et al. ‘06+

Pairwise CRF only *Shotton et al. ‘06+

Pn Potts

robust Pn Potts

robust Pn Potts

(different f)

One super-pixelization

another super-pixelization

Same idea for surface-based stereo *Bleyer ‘10+

One input image

Ground truth depth

Stereo with hard-segmentation

Stereo with robust Pn Potts

This approach gets best result on Middlebury Teddy image-pair:

How is it done…

H (X) = F ( ∑ xi )

Most general binary function:

H (X)

∑ xi

concave

0

The transformation is to a submodular pair-wise MRF, hence optimization globally optimal


Higher order to Quadratic

• Start with Pn Potts model:

{ 0 if all xi = 0 C1 otherwise f(x) = x ϵ {0,1}n

min f(x) min C1a + C1 (1-a) ∑xi x = x,a ϵ {0,1}

Higher Order Function

Quadratic Submodular Function

∑xi = 0 a=0 f(x) = 0

∑xi > 0 a=1 f(x) = C1



min f(x) min C1a + C1 (1-a) ∑xi x

= x,a ϵ {0,1}

Higher Order Function Quadratic Submodular Function

∑xi

1 2 3

C1

C1∑xi



min f(x) min C1a + C1 (1-a) ∑xi x

= x,a ϵ {0,1}

Higher Order Submodular Function


∑xi

1 2 3

C1

C1∑xi

a=1 a=0 Lower envelop of concave functions is

concave



min f(x) min f1 (x)a + f2(x) (1-a) x

= x,a ϵ {0,1}



∑xi

1 2 3

Lower envelop of concave functions is

concave

f2(x)

f1(x)



min f(x) min f1 (x)a + f2(x) (1-a) x

= x,a ϵ {0,1}



∑xi

1 2 3

a=1 a=0 Lower envelop of concave functions is

concave

f2(x)

f1(x)

Arbitrary concave functions: sum potentials up (each breakpoint adds a new binary variable) *Vicente et al. ‘09+ [slide credits: Kohli]

+

=

Beyond Pn Potts … soft Pattern-based Potentials *Rother et al. ’08, Komodikis et al. ‘08+

Training Image

Test Image

with noise

Result pairwise-MRF 9-connected

(7 attractive; 2 repulsive)

Motivation: binary image de-noising

Higher Order Structure not Preserved

Higher-order MRF

Minimize:

Where:

Higher Order Function (|c| = 10x10 = 100)

Assigns cost to 2100 possible labellings!

Exploit function structure to transform it to a Pairwise function

E(X) = P(X) + ∑ hc (Xc) c

hc: {0,1}|c| → R

Sparse higher-order functions

How this can be done…

hc(x) = { 0 if xc = P0

P0

k otherwise

Check it: 1. Pattern off => a=1,b=0 (cost k) 2. Patter on => a=0, b=1 (cost 0) Problem: 1. Term: “kab” is non-submodular 2. Only BP, TRW worked for inference

General Potential: Add all terms up

hc(x) = min ka + k(1-b) – ka(1-b) + k ∑ (1-a)xi + k ∑ b(1-xi) a,b i ϵ S0(P0) i ϵ S1(P0)

One clique, one pattern:

k

k

k k

k

-k

Soft multi-label Pattern-based

hc(x) = min {min ka + ∑ wia [xi ≠ Pa(i)] , kmax}

a ϵ {1,…,L}

P patterns (multi-label)

P soft deviation functions

i

w1 w2 w3

P1 P2 P3

Function per clique:

How it is done…

We use BP for optimization, since submodular and other solvers inferior

With a pattern-switching variable z:

hc(xc) = min f(z) + ∑ g(z,xi)

z ϵ {1,…,L+1}

f(z) =

g(z,xi) =

i ϵ c

{

{

ka if z = a

kmax if z = L+1

wia if z = a and xi ≠ Pa(i)

0 if z = L+1

hc(x) = min {min ka + ∑ wia [xi ≠ Pa(i)] , kmax}

a ϵ {1,…,L} i

z

Function per clique:

Results: Multi-label

Training; 256 labels Test; 256 labels Test + noise; 256 labels

Pairwise (15 label) (5.6sec; BP 10iter.)

10 10x10 Hard Pattern (15 label) (48sec; BP 10iter.)

10 10x10 Soft Pattern (15 label) (48sec; BP 10iter.)

Standard Patch-based MRFs [Learning Low-Level Vision, Freeman IJCV ‘04+

Mu

lti-

lab

el

xi xj xk

xl

Not all labels possible (comparison still to be done)

E(x) = U(x) + P(x) E: {1,…,L}n → R

measures patch overlap

IMPORTANT

Tea break!

CVPR2010: higher order models in computer vision: Part 1, 2

Education

Transcript of CVPR2010: higher order models in computer vision: Part 1, 2