1 A 3-clique MRF model for Magnitude of Gradient and the ...

1

A 3-clique MRF model for Magnitude of Gradientand the Application of Move-Based Algorithms

Behrooz Nasihatkon Richard Hartley

Abstract

We propose a 3-clique MRF model to isotropically approximate magnitude of gradient as a regularizer. Compared to theexisting edge-based models, this model is able to achieve more accurate approximations of magnitude of gradient. To optimizethe suggested model, we consider the class of move-based algorithms like alpha-expansion and alpha-beta swap, in which energyfunction is minimized through solving a series of binary labeling problems. By considering move-based algorithms in their generalform, we give necessary and sufficient conditions for a move policy to give submodular binary moves. Especially we considertwo major types of labels, namely ordered labels and unordered labels and fully characterize the class of move-based algorithmsfor which the binary move is submodular. It follows that the alpha-expansion algorithm fails to give a submodular move whenthe 3-clique model is used with ordered labels. To address this, we introduce the Mirrored Swap, which is an efficient algorithmfor the optimization of the 3-clique isotropic gradient model for ordered labels. The new model has been compared to the currentedge-based models, both theoretically and practically. The effectiveness of each model and the algorithms is studied by runningan image completion task.

Index Terms

3-cliques, multilabel MRF, regularization, total variation, isotropic gradient model, move-based optimization, alpha-expansion,alpha-beta swap, submodular functions.

I. INTRODUCTION

In this paper we propose a 3-clique MRF to isotropically model magnitude of gradient as a means of regularization, andgive conditions as to when move-based optimization is applicable. This paper compares the 3-clique model with the currentedge-based models both theoretically and practically. We also demonstrate an example on how the achieved conditions canlead to the design of new move-based algorithms when classic move-based algorithms are not applicable.

The choice of a proper regularizer with desirable properties is essential in many computer vision applications. One of themost effective regularizers in real-valued images with a continuous image domain is Total Variation (TV) [13], which can bedefined as the integral of the magnitude of gradient over the image domain. The definition can be generalized to cover a largeclass of non-differentiable and even discontinuous image functions. Total Variation has interesting properties as a regularizer.Especially, it is known as a discontinuity preserving regularizer, as it equally penalizes sharp and gradual monotonic transitionsbetween two states in a function [13], [5]. In images, for example, this leads to the preservation of the edges.

A large variety of techniques has been proposed for the optimization of cost functions with a TV term. Recently, theapplication of Graphical Models to this problem has attracted particular attention, thanks to the efficient graph-cut algorithms.The basic issue in such approaches is how an MRF lattice can give a good approximation of magnitude of gradient, oralternatively, the perimeter of function level sets. Boykov and Kolmogorov in [2] propose a way of choosing the neighbourhoodsystem and edge weights in an MRF lattice so that the cut cost approximates the Euclidean length of the segmentation boundary.This approach has been generalized to approximate TV in images with continuous values [7], [4], [8], [6]. The optimizationapproach used in these papers is more or less similar to what is proposed in [9], in which the target image is obtained byfinding its level sets. The level sets are efficiently found by solving a finite number of binary MRF optimization problems ina parametric max-flow approach.

Another approach to the approximation of TV is to consider a multi-label MRF model all along, dealing with discretizedimage labels. Ishikawa and Geiger in [10] show that for a certain class of multi-label energy functions with pairwise smoothingterms the global optimum can be obtained via graph cuts. Chambolle in [4] uses this approach for the optimization of TV ina multi-label setting. The Ishikawa’s approach, however, is limited to pairwise potential functions. It also has to deal with alarge max-flow graph when number of labels is large. Especially, the memory requirement can make the implementation ofthis algorithm impractical on ordinary machines when the number of labels is large.

One problem with the edge-based approaches is that, they are never able to give an exact approximation of magnitude ofgradient, even when the gradient is uniform around a node. This is because they are unable to give an isotropic cost for

Behrooz Nasihatkon is with the School of Engineering at the Australian National University and National ICT Australia (NICTA).Richard Hartley is with the School of Engineering at the Australian National University and National ICT Australia (NICTA).Jochen Trumpf is with the School of Engineering at the Australian National University.NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the

Australian Research Council through the ICT Centre of Excellence program.

2

gradients in different directions. More exact approximations can be achieved by taking larger neighbourhoods, however, by sodoing we can increase the graph size and also lose the locality in the gradient approximation.

Dealing with discretized images, a very natural approximation of magnitude of gradient at a node is the squared root of thesum of squared variations in horizontal and vertical directions. This model which uses interactions between triples of nodesis the basic choice for most of the TV optimization algorithms [5]. It achieves an isotropic cost function, which is exactat least at the nodes in a neighbourhood of which the gradient is uniform. However, when it comes to the application ofgraph-cuts, people avoid using this model, as it gives an MRF with 3-clique potentials for which the application of many ofthe optimization algorithms used for the edge based models is not possible or has not been studied.

In this paper we exploit this natural 3-clique model in the context of MRFs. To optimize it we focus on the application of theso-called move-based algorithms [3] in which the multi-label energy function is iteratively minimized by solving a sequence ofbinary optimization problems. Although they do not guarantee to find the global minimum, they perform quite well in practice,both in terms of effectiveness and efficiency. Move-based algorithms usually take advantage of graph cuts to solve the binarymove subproblem at every stage. In order to do that, the binary move problem must be graph-representable. Kolmogorov andZabih in [11] prove that an energy function consisting of a sum of two and three variable potentials is graph representable ifand only if it is submodular. They also prove that submodularity is a necessary condition for an arbitrary energy function onbinary variables to be graph representable.

When using the move-based algorithms we have to notice that not every choice of the node interaction model gives asubmodular binary move for a certain move algorithm. Therefore, after choosing a regularization model, we need to verify theexistence of move-based algorithms giving a graph representable binary move. The main object of this paper is to completelyspecify the conditions under which a move algorithm can give submodular binary move problems for the proposed 3-cliquemodel. This is done by first considering an abstract general move algorithm, and then giving complete conditions on when amove policy gives a submodular move.

Our main focus here is the case of ordered labels, that is when labels are finite and consecutive labels are equally distanced,and the distance between each pair of labels is simply their Euclidean distance. This is important because with this model wecan approximate the TV. However, we also give some general results for when the distance function is an arbitrary metric orsemi-metric. Especially, we study the case of unordered labels, where the distance only depends on whether or not the labelsare equal. This latter model can be applied to problems like segmentation.

In the next section we introduce the 3-clique model and theoretically compare it to the current edge-based models in termsof accuracy and cost. Then, in section III we turn to the move-based approach by briefly introducing the alpha-expansionand alpha-beta swap algorithms and giving an abstract procedure representing a move-based algorithm in its general form. Insections IV and V we study the submodularity of the binary move steps of a general move-based algorithm for each of theedge-based and 3-clique models. In each case we give necessary and sufficient conditions for a general binary move policy toyield a submodular move. The results show that the alpha-expansion algorithm does not give a submodular move when used forthe minimization of the isotropic 3-clique model with ordered labels. Therefore, in section VI we present the Mirrored Swapalgorithm, which possesses many advantages of the expansion algorithm while giving a submodular move for the proposedmodel. The performance of the proposed models and algorithms has been assessed by applying them to the noise removalproblem.

Part of this work has previously been published as a conference paper [12].

II. ISOTROPIC MODELING OF MAGNITUDE OF GRADIENT

In many applications in computer vision, the MRF energy function has the following form

E(x) = λ∑i

fi(xi) +R(x), (1)

where xi is the label of node i taking values in a label set L, and x = [x1, x2, . . . , xn] ∈ Ln is the vector of labels. Theterms

∑i fi(xi) show the data driven cost applied to each node, and R(x) is the regularization term. The scalar λ determine

the emphasis on the data-driven terms versus the regularization term. Our concern here is the regularization term R(x). Asmentioned before, we like R(x) to be a good approximation of the TV. The existing methods approximate TV using edge-basedpotentials [7]:

R(x) =∑

(i,j)∈C2wij |xi − xj |, (2)

where the set C2 represents the collection of all edges or 2-cliques in the neighbourhood graph, |.| shows the absolute value,the set of labels L is either continuous (R or an interval in R) or a discretized ordered set as L = {0, 1, . . . ,M−1} andwij-s are positive weights. The weights wij must be properly chosen based on relative positions of nodes i and j to introducean approximately isotropic cost [2], [7]. Good approximations of the magnitude of gradient can be obtained by choosing alarge neighbourhood around each node in an image lattice. However, here, to have a fair comparison with the 3-clique model,

3

bbbbbbb

bbbbbbb

bbbbbbb

bbbbbbb

(a)b

bb

bbb

b

b

(b)b

bb i

k

jbbb

bbb

b

bb i

i

i

k

k

k j

j j

(c)

Fig. 1. (a) An 8-connected lattice, (b) possible edges and (c) possible shapes for the 3-clique (i, j, k) in an 8-connected lattice. Notice that the order ofelements of (i, j, k) does matter: the second and third elements j and k are always respectively the horizontal and vertical neighbours of the first element i.

we consider neighbourhood with the same locality, that is, when the edges are within square cells of the lattice. With thisrestriction, first we consider a four-connected network:

R1(x) = γ∑

(i,j)∈H∪V

|xi − xj |, (3)

whit H and V showing the set of horizontal and vertical edges respectively and γ being a normalization constant. The secondmodel includes diagonal edges as well:

R2(x) = γ∑

(i,j)∈H∪V

|xi − xj |+ ρ∑

(i,j)∈D

|xi − xj |, (4)

with D being the set of diagonal edges. This gives an 8-connected network as shown in Fig. 1(a,b).A better approximation of gradient may be achieved by considering potential functions on 3-cliques as follows:

R3(x) = γ∑

(i,j,k)∈C3

√(xi − xj)2 + (xi − xk)2 (5)

where C3 denotes the set of all 3-cliques in an 8-connected lattice, such that for each (i, j, k) ∈ C3 the nodes j and k arerespectively horizontal and vertical neighbours of i (see Fig. 1(a,c)).

For each of the above models, we consider two different cases separately: the ordered labels and the unordered labels. Forboth cases, we consider a cost function Cγ[,ρ](θ), representing the local cost of a unit gradient (for the ordered labels) or thecost of an edge per unit length (for unordered labels) as a function of θ that is the angle of the gradient or the edge. In anisotropic setup, an ideal cost function is a constant function with respect to θ. Therefore, we assess the fitness of each modelby measure how uniform the associated cost function is. Here, assuming that horizontal and vertical edges of the lattice havelength 1, we also try to normalize the cost function in the sense of expecting the cost of a unit magnitude gradient (orderedlabels) or the cost of an edge per unit length (unordered labels) to be one. That is ideally we should have Cγ[,ρ](θ) = 1. Tomeasure how fit a model is we consider the integrated squared deviation of Cγ[,ρ](θ) from 1:∫ 2π

0

(Cγ[,ρ](θ)− 1)2dθ. (6)

Also to choose a proper value for parameters γ (and ρ) in our models, we minimize (6) with respect to γ (and ρ). Notice thatfor models (3) and (5), this minimization just leads to a proper normalization, while in the diagonal edge model (4), it alsogives a proper balance between ρ and γ. Next, we show how to evaluate the cost function for each model, for the two casesof ordered and unordered labels.

A. Ordered Labels

In this case the label set L is of the form {0, 1, . . . ,M−1} and the distance between two labels x and y is simply defined asd(x, y) = |x− y|. Usually, this case is used as an approximation to the continuous labels, especially when M is large enough.Therefore, to verify our isotropic model for this case, we simply consider the case of continuous labels where L = R.

Consider a single square cell of the lattice, as shown in Fig. 2, and assume by convention that each side of it has unit length.We can approximate the gradient in the middle of this square using (3), (4) and (5). Now, assume a uniform gradient field on

4

b b

b b

b x

y

sin θ + cos θ

0cos θ

sin θθ gradient

Fig. 2. Suppose each side of the above square cell has a unit length. We have applied a gradient of magnitude 1 and with an angle of θ with the horizontalaxis. We have assumed a label value of 0 for the top left node and calculated the label of the other nodes as a function of θ.

a neighbourhood around this square, such every gradient has unit magnitude and its direction makes an angle of θ with thehorizontal axis. Assuming a label value of 0 for the top left node, the label values for the other nodes will be as shown inFig. 2. According to these values, for models (3) and (4) the cost of horizontal edges is | cos θ| and the cost of a vertical edgeis | sin θ|. In the square cell, there are two horizontal and two vertical edges, but, excluding nodes at the lattice boundary, eachhorizontal or vertical edge is shared with the next square cell. Therefore, for the 4-connected model (3) the cost contributingto the approximation of the gradient in the middle of the cell is

Cγ(θ) = γ(| sin θ|+ | cos θ|). (7)

As for the model (4), the costs of the two diagonal edges in the square cell are | cos θ + sin θ| and | cos θ − sin θ|. Therefore,according to (4) the gradient is approximated as

Cγ,ρ(θ) = γ(| sin θ|+ | cos θ|)+ ρ(| cos θ + sin θ|+ | cos θ − sin θ|). (8)

Considering the 3-clique model (5), there are four 3-cliques per every square cell. The cost will then be

Cγ(θ) = 4γ√| cos θ|2 + | sin θ|2 = 4γ. (9)

The above shows that (5) is an ideal gradient model for the continuous labels as the cost does not depend on the angle of thegradient.

For each of the cost functions (7), (8) and (9), the optimal parameters γ and ρ can be obtained by minimizing (6). Theoptimal parameter values and the optimal integrated squared deviation of each model has been shown in Table I. Evidently,the 3-clique model is the most accurate with a zero integrated squared deviation. The 8-connected diagonal edge method isthe next good method and the 4-connected model with horizontal and vertical edges is the worst one. Note that the diagonal2-clique model has an integrated deviation of about 1/20 of the 4-connected method.

The cost function for different values of gradient angle is plotted in Fig. 3. Notice that according to Table I, for the diagonaledge model (4) the optimal weights are such that γ/ρ =

√2. Therefore, the cost of the gradient is the same for angles 0 and

π/4, as it is clear from Fig. 3. This coincides with the proper choice of edge weights of an 8-connected network obtainedfrom the approach of [2].

method 4-connected diag. edge 3-clique

optimalweights

γ = 4π+2

γ = 8−4√2

π+2√2

ρ = 4√2−4

π+2√

2

γ = 14

integratedsquareddeviation

≈ 0.059433 ≈ 0.003420 0

TABLE ICOST FUNCTIONS

B. Unordered Labels

In this case, for any pair of label values we can just say whether or not they are equal. The distance between the two labelsx and y is simply defined as d(x, y) = 1(x 6=y) which is equal to one if x 6= y and equal to zero if x = y. As a special case,we can consider the basic binary labels where L = {0, 1}. This section has been arranged as the previous section, only thecost functions are calculated in a different way.

For unordered labels the verification cannot be done by approximating using the continuous labels. Instead, here we considerthe edge cost per length. We look at a small section of an edge with an angle of θ = arctan 1

m for some positive integer m.If the grid is fine enough, the above edge appears as a set of stairs where every single stair goes m units to the right and then

5

Fig. 3. A comparison of the edge costs between the 4-connected model (red dashed line), the diagonal edge 8-connected model (blue solid line) and the3-Clique model (black dash-dotted line) for 0 ≤ θ ≤ π/2. The 3-clique model here is rotation-invariant, as its cost function is uniform. The diagonal edgemodel is far more uniform than the 4-connected model.

one unit down as shown in Fig. 4. This only covers an angle range of 0 < θ ≤ π/4, but other ranges of angles can be modeledsimilarly. We compare the cost of the edge per length unit for the 2-clique models (3) and (4) and the 3-clique model (5).

b

bb

bb

b

b

b

b bb b

bb b

i0 i1 i2 im im+1

b

j0 j1 j2 jm jm+1

Fig. 4. An edge in the grid made by a transition from a white labeled region to a gray labeled region. The edge has an angle θ = arctan 1m

, that is forevery m unit going right it goes one unit down. Only the active edges (edges with a nonzero cost) of the MRF graph have been depicted in the picture.

In Fig. 4 an edge is made by the transition from the white labeled area to the gray labeled area. We know that for each stepof the edge we have m vertical active edges, namely (i1, j1), . . . , (im, jm) and one horizontal active edge (j0, j1). By activeedge we mean an edge with a nonzero cost. Notice that the horizontal edge (im, im+1) is the same as (j0, j1) for the nextstair and hence does not count. Therefore, the cost for each stair using the 4-connected model (3) would be m + 1. As thelength of each stair is

√m2 + 1, the edge cost per length unit is

Cγ(θ) = γm+ 1√m2 + 1

= γcot θ + 1√cot2 θ + 1

= γ(cos θ + sin θ) (10)

where 0 ≤ θ ≤ π/4. Therefore, in this case cost is like the continuous labels case in the previous section. To see the cost ofthe 8-connected 2-clique model (4), notice that there exist 2m active diagonal edges in Fig. 4. Similar to (10) the cost can becalculated as:

Cγ,ρ(θ) =γ(m+ 1) + 2ρm√

m2 + 1=γ(cot θ + 1) + 2ρ cot θ√

cot2 θ + 1= γ(cos θ + sin θ) + 2ρ cos θ (11)

for 0 ≤ θ ≤ π/4. One can check that for 0 ≤ θ ≤ π/4 this cost function is exactly the same as the one for the continuouscase (11).

Now, consider the 3-clique model (5). In this case, for each step, there are 2 active 3-cliques (j1, j0, i1) and (im, im+1, jm)with a cost of

√1 + 1 =

√2 and 4m active 3-cliques with a cost of

√1 + 0 = 1. As the length of each stair is

√m2 + 1, the

cost of the edge per length unit would be

Cγ(θ) = γ4m+ 2

√2√

m2 + 1= γ

4 cot θ + 2√

2√cot2 θ + 1

= γ(4 cos θ + 2√

2 sin θ)

where 0 ≤ θ ≤ π/4.Similar to the continuous label case, the optimal values of γ and ρ for each model can be calculated by minimizing the

integrated squared deviation defined in (6). The optimal weights are shown in Table II along with the associated integrated

6

squared deviation. As expected, for the first two models (3) and (4) the results are the same as the continuous label case inthe previous subsection. However, unlike the case of continuous labels, the 3-clique model here is not uniform. According toTable II, the diagonal edge model (4) is the best among the three. The next good model is the 3-clique model whose integratedsquared deviation is around 6 times bigger than the diagonal edge model and around 3 times smaller than the 4-connectedmodel. Again, the 4-connected model is the worst one. The graphs of the cost functions with the optimal weights is plotted inFig. 5.

method 4-connected diag. edge 3-clique

optimalweights

γ = 4π+2

γ = 8−4√2

π+2√2

ρ = 4√2−4

π+2√2

γ = 4√2−2

3π+4√2+2

integratedsquareddeviation

≈ 0.059433 ≈ 0.003420 ≈ 0.020279

TABLE IICOST FUNCTIONS

Fig. 5. A comparison of the edge costs between the 4-connected model (red dashed line), the diagonal edge 8-connected model (blue solid line) and the3-Clique model (black dash-dotted line) for 0 ≤ θ ≤ π/2. The 3-clique model here is not rotation-invariant and is even less close to the uniform functionthan the diagonal edge model. The diagonal edge model and the 4-connected model are respectively the best and the worst models in terms of having a moreuniform cost function.

C. Graph Size and Cost

The first three columns of Table III respectively show the MRF graph size of the binary move step associated with each ofthe three models (7), (8) and (5). For the 3-clique model (5), the graph has been obtained from what proposed in [11], whereeach 3-clique is modeled by adding one auxiliary node and three edges from each node of the 3-clique to the auxiliary node.The table shows that the 3-clique model is much more costly than the other two methods in terms of optimization. This isbecause each square cell of the lattice has four 3-cliques as shown in Fig. 1(c). One possible way to deal with the size problemis to use only one type of 3-clique to model magnitude of gradient among the four possible types shown in Fig. 1(c). Thegraph size for this light 3-clique model is shown in the final column of Table III. Notice that besides the graph size there areother factors affecting the optimization cost of these models. For example, we will show that the alpha-expansion algorithmcannot be applied to the 3-clique models. The algorithm we propose for optimizing them is more costly than alpha-expansion.

model 4-connected diag. edge 3-clique light 3-cliquenodes mn mn ≈ 5mn ≈ 2mnedges ≈ 2mn ≈ 4mn ≈ 12mn ≈ 3mn

TABLE IIITHE MRF GRAPH SIZE FOR EACH OF THE MODELS (EXCLUDING THE SINK AND SOURCE EDGES IN THE MAX-FLOW GRAPH)

7

D. Conclusion

As a conclusion, we can say that the suggested 3-clique model (5) well suits the case of ordered labels, especially whenthe number of labels M is large. For the case of unordered labels or binary labels, the 3-clique model is not as effective asthe diagonal edge model, but still works better than the 4-connected model. This is also expected to be true for ordered labels,at the sites where sharp edges exist. The diagonal edge model is supposedly the best among the three for unordered labels.For the ordered labels it also works quite well and substantially better than the 4-connected model. One disadvantage of the3-clique model is a large max-flow graph size. One possible workaround for this might be to use only one type of 3-cliqueout of the four types shown in Fig. 1(c) to approximate magnitude of gradient.

III. THE GENERAL MOVE ALGORITHM

Consider a general energy function E(x) of the labels x = [x1, x2, . . . , xn], with xi ∈ L. In a multi-label scenario the sizeof the set L is normally bigger than two. To minimize this energy function, we focus on move algorithms in which at everyiteration a binary problem is solved, lowering the energy throughout iterations. Here, after describing two of the known movealgorithms, we present a typical move algorithm in its general form.

a) Alpha-expansion: Perhaps the most popular example of this type is the alpha-expansion algorithm. In this algorithm,a binary variable ui is assigned to each node i. A parameter α iterates through the different values in the label set L. At eachiteration the variable xi is updated according to

x′i = luiα (xi) = xiui + αui, (12)

where x′i = luiα (xi) shows the updated label value of node i. The above means that the variable xi remains unchanged if ui = 0

and is changed to α if ui = 1. The vector of labels x ∈ Ln at the next iteration is therefore a function of u = [u1, u2, . . . , un].This updated vector of labels is shown here by

x′ = luα(x) = [lu1α (x1), lu2

α (x2), . . . , lunα (xn)].

A general energy function E(x) evaluated at the new updated labels gives E(luα(x)) which is a function of u. Therefore, ateach iteration we choose the u that minimizes E(luα(x)) and then update the label values as luα(x) ∈ L. Fig. 6 shows anoutline of the algorithm.

procedure ALPHA-EXPANSION(x,L)repeat

for each α ∈ L dou∗ ← argminuE(luα(x))x ← luα(x)

end foruntil convergencereturn x

end procedure

Fig. 6. The alpha-expansion algorithm. The inputs to the algorithm are the initial label values x and the set of labels L.

b) Alpha-beta swap: Another algorithm of this family is called alpha-beta swap. In this algorithm there are two parametersα and β iterating through all possible distinct pairs of label values. At every iteration only the nodes with labels α or β areupdated. The update equation is

x′i = lui

α,β(xi) =

{αui + βui if xi ∈ {α, β}xi otherwise (13)

In other words, the nodes whose label value is not α or β stay unchanged and other nodes can switch to α or β.Our results in this research does not depend on a specific update criterion. Instead, we consider a general move algorithm.

Here, there is a parameter vector θ that varies over iterations. At each iteration, given the binary variable ui the label of eachnode i is updated to either of the two possible states l0θ(xi) or l1θ(xi):

x′i = l0θ(xi) ui + l1θ(xi)ui = lui

θ (xi) (14)

where xi shows the current label of the node i. Therefore, different choices of l0θ and l1θ give different update criteria. Here,the pair of functions (l0θ, l

1θ) is called the update policy for the parameter θ. Notice that using this update policy, the updated

value for a node i only depends on θ, the current label value xi and the binary variable ui. It does not depend on the locationof node i or the label values of its neighbours. A sketch of the general move algorithm can be seen in Fig. 7.

It is clear that an update policy must possess certain properties to act reasonably. For example, as a trivial case, if for alllabel values x ∈ L we have l0θ(x) = l1θ(x), then the policy is not sensible as it cannot make any updates. Other properties can

8

procedure GENERAL-MOVE-ALGORITHM(x,P)repeat

for each θ ∈ P dou∗ ← argminuE(luθ(x))x ← luθ(x)


end procedure

Fig. 7. The general move algorithm for multi-label energy minimization. The inputs are the initial vector of labels x and the set of parameter values P .

be mentioned for a sensible policy. Here, we just mention one property, namely the state preservation property, which meansthat each label must be able to preserve its current state after each iteration, in other words, for each label value x ∈ L andeach θ we must either have l0θ(x) = x or l1θ(x) = x. It can be easily seen that the alpha-expansion and the alpha-beta swappossess this property. Notice that in the state preservation property, what value of u ∈ {0, 1} gives luθ(x) = x depends on xand θ. The important fact about the state preservation property is that for an update policy with this property, for every θ,luθ(x) is surjective as a function of u ∈ {0, 1} and x ∈ L. Throughout the paper the state preservation property (and hence thesurjectivity) is presumed for all update policies.

As mentioned before, in this paper we are concerned with the binary move problem at each iteration of the general movealgorithm sketched in Fig. 7. We are looking for the update policies for which the energy function E(luθ(x)) is submodular asa function of the binary labels u, for which case there exist polynomial time algorithms to solve the optimization problem.

IV. SUBMODULARITY IN THE 2-CLIQUE POTENTIAL MODELS

In this section we completely specify the conditions for a general move algorithm to give a submodular move for thefollowing model:

E(x) = λ∑i

fi(xi) +∑

(i,j)∈C2wi d(xi, xj) (15)

where C2 is the set of all edges in the MRF graph, d is a semi-metric and fi(xi)-s are unary potentials. The energy functionsusing the four-connected prior (3) and the diagonal edge prior (4) are special case of this type.

The corresponding binary move energy function with a general move policy l0, l1 is as follows

E′x(u) = E(lu(x)) =∑

(i,j)∈C2wij d(lui(xi), l

uj (xj)) + L(u) (16)

where wij-s are positive weights and L(u) represents the linear terms in u1, u2, . . . , un and u1, u2, . . . , un, coming from theunary terms λ

∑i fi(xi). The submodularity of the energy function E′x(u) in (16) is equivalent to the submoduarity of its

restrictions to any two nodes. The restriction of E′x(u) to non-neighbour nodes gives only linear terms, and hence is submodular.The restriction to any pair of neighbours i and j is equal to wij d(lui(xi), l

uj (xj)) plus some linear terms. Therefore, thesubmodularity of E′x(u) is equivalent to the submodularity of d(lui(xi), l

uj (xj)) as a function of ui and uj for all i and j.Notice that, as we assume that x can be in any state, we need to check the submodularity of E′x(u) for all possible states x,and hence, one must check the submoduarity of d′x,y(u, v) = d(lu(x), lv(y)), as a function of the binary variables u and v,for any values of x, y ∈ L. Here we investigate this for each of the cases of ordered labels and unordered labels.

A. ordered labels

As mentioned in section II, in the case of ordered labels the set of labels is L = {1, . . . ,M−1} and the distance function issimply d(x, y) = |x−y|. In this case the energy function can be globally minimized using Ishikawa’s approach [10]. However,for large number of labels the approach of Ishikawa might not be applicable as it requires a large max-flow graph. Therefore,we study the applicability of move-base algorithms as an alternative.

Proposition 1. In the case of ordered labels, that is L = {1, . . . ,M−1} and d(x, y) = |x − y|, the function d′x,y(u, v) =d(lu(x), lv(y)) is submodular for any x, y ∈ L for a policy (l0, l1), if and only if for any pair of label values x, y ∈ L wehave

min(l0(x), l1(y)) ≤ max(l1(x), l0(y)). (17)

The condition (17) is not actually very restricting. One can easily check that for the alpha-expansion algorithm, alpha-betaswap and even the mirrored swap algorithm (to be introduced in section VI), the move policy satisfies this condition. For thesake of compactness, from now on, for every label x ∈ L we may use x0 and x1 to represent l0(x) and l1(x) respectively.

9

Proof: Notice that the submoduarity condition is to have

|x0 − y1|+ |x1 − y0| ≥ |x0 − y0|+ |x1 − y1| (18)

for all labels x, y ∈ L.To prove the forward direction, assume a policy (l0, l1) for which the condition (17) does not hold. Then at least for one

pair of labels x, y ∈ L, we have

max(x1, y0) < min(x0, y1) (19)

By adding max(x0, y1)−min(x1, y0)−max(x1, y0) to both sides we get

max(x0, y1)−min(x1, y0)

< min(x0, y1) + max(x0, y1)−max(x1, y0)−min(x1, y0)

= x0 + y1 − x1 − y0

From (19) we know that x0 > y0 and y1 > x1, therefore, the above can be written as

max(x0, y1)−min(x1, y0) < |x0 − y0|+ |y1 − x1|

From (19) we know that max(x1, y0)−min(x0, y1) is negative and therefore we can add it to the left hand side of the aboveto get

max(x0, y1)−min(x0, y1) + max(x1, y0)−min(x1, y0)

= |x0 − y1|+ |x1 − y0| < |x0 − y0|+ |y1 − x1|.

This means d′x,y(u, v) = d(lu(x), lv(y)) is not submodular and this conclude the forward direction of the proof.To prove the backward direction, assume that for all pairs of labels condition (17) holds. This means that for any pair of

labels x, y ∈ L we have

min(x0, y1) ≤ max(x1, y0) (20)

min(x1, y0) ≤ max(x0, y1) (21)

Takez = max

(min(x0, y1),min(x1, y0)

).

Using (20) it can be easily checked that min(x1, y0) ≤ z ≤ max(x1, y0), and therefore we can write

|x1 − y0| = |x1 − z|+ |z − y0|. (22)

Similarly, using (21) we get min(x0, y1) ≤ z ≤ max(x0, y1) and hence

|x0 − y1| = |x0 − z|+ |z − y1|. (23)

Using (22) and (23) together we can write

|x0 − y1|+ |x1 − y0|= |x0 − z|+ |z − y1|+ |x1 − z|+ |z − y0|≥ |x0 − y0|+ |x1 − y1|,

where the last relation holds by the triangle inequality. The above is exactly the submodularity relation (24) and this completesthe proof.

B. unordered labels

Proposition 2. In the case of unordered labels, that is where the distance function defined as d(x, y) = 1(x 6=y), the 2-cliquepotential d′x,y(u, v) = d(lu(x), lv(y)) d(x, y) is submodular if and only if for any pair of active labels x, y ∈ A, we havel0(x) 6= l1(y).

Proof: Notice that the submoduarity condition in this case is

1(x0 6=y1) + 1(x1 6=y0) ≥ 1(x0 6=y0) + 1(x1 6=y1) (24)

Assume there exist x, y ∈ A such that x0 = y1. Then we must have 1(x0 6=y1) = 0 and also 1(x1 6=y1) = 1(x1 6=x0) = 1(as x ∈ A) and 1(x0 6=y0) = 1(y1 6=y0) = 1 (as y ∈ A). Therefore, we have:

1(x0 6=y1) + 1(x1 6=y0) < 1(x0 6=y0) + 1(x1 6=y1),

10

meaning that d′x,y(u, v) is not submodular for this choice of x and y.On the other hand, assume that for all labels x, y ∈ A we have x0 6= y1. Then for any pair of labels x, y, if any of them, say

x, is not an active label (x0 = x1) then the submoduarity condition (24) can be easily checked to hold. Otherwise if x, y ∈ A,we have x0 6= y1 and x1 6= y0. Therefore, the left hand side of (24) is equal to 2, which is always greater than or equal to theright hand side. This means that d′x,y(u, v) is submodular for any choice of x, y ∈ L.

V. SUBMODULARITY IN THE 3-CLIQUE MODEL

As stated earlier, we deal with energy functions of the form

E(x) = λ∑i

fi(xi) + γ∑

(i,j,k)∈C3T (xi, xj , xk), (25)

where fi(xi)-s are unary terms, γ is a positive scalar, C3 denotes the set of all 3-cliques in an 8-connected MRF lattice, andT (x, y, z) =

√d(x, y)2 + d(x, z)2, as defined in (5), for which d is a semi-metric. The variables xi, xj and xk take values in

the set of labels L whose size is usually bigger than two. Here, a 3-clique is shown by a 3-tuple (i, j, k) whose second elementj always shows a horizontal (left or right) neighbour of the first element i and the third element k is a vertical neighbour ofi, as shown in Fig. 1(b). For a move policy (l0, l1), the corresponding binary move problem will be

E′x(u) = E(lu(x)) = γ∑

(i,j,k)∈C3T ′xi,xj ,xk

(ui, uj , uk) + L′x(u), (26)

where L′x(u) is a linear function of u1, u2, . . . , un and u1, u2, . . . , un coming from the unary part λ∑i fi(xi) of the multi-label

energy function, and the term T ′xi,xj ,xk(ui, uj , uk) is defined as

T ′xi,xj ,xk(ui, uj , uk) = T (lui(xi), l

uj (xj), luk(xk)). (27)

where lui(xi) = ui l0(xi) + ui l

1(xi), as defined in (14). Unlike the 2-clique model studied in the previous section, theequivalence of the submodularity of 3-clique potentials T ′xi,xj ,xk

and the submodularity of the energy function E′x is nottrivial. This is because two or more 3-cliques can intersect at an edge, and thus, more than one 3-clique might be involvedwhen the energy function is restricted to the corresponding variables of an edge. We proceed by the study of submodularityof the potential function T ′xi,xj ,xk

on a single 3-clique. Afterwards, we show that the submodularity of T ′xi,xj ,xkis necessary

and sufficient for the submodularity of E′x in (26) for the whole lattice.

A. Submodularity on a Single 3-Clique

Our main result in this section is that the submodularity of T ′xi,xj ,xkdefined in (27) as a function of u is reduced to the

submodularity of its restriction to the diagonally neighbouring variables. We remind the reader that for every label x we mightuse the short forms x0 and x1 to represent l0(x) and l1(x) respectively. The main theorem is

Theorem 1. With the potential function T ′xi,xj ,xkdefined in (27) and assuming that the update policy has the state preservation

property defined in sec. III, the following are equivalent(i) The potential function T ′xi,xj ,xk

defined in (27) is submodular for all values of xi, xj and xk.(ii) any restriction of T ′xi,xj ,xk

(ui, uj , uk) to the diagonally neighbouring variables uj and uk is submodular for all valuesof xi, xj and xk.

(iii) For any three labels x, y, z ∈ L we have(d(x, y1)− d(x, y0)

) (d(x, z1)− d(x, z0)

)≥ 0, (28)

where, as mentioned before, xu is compact form for lu(x).

Proof: To prove the theorem, first we show that (i) implies (ii), then we show that (ii) and (iii) are equivalent, then weshow that (ii) and (iii) together imply (i).

The first part of the proof, that is (i) ⇒ (ii), easily follows from our definition of the submodularity for functions of morethan two variables.

To show that (i) ⇔ (ii), assume the potential function T ′xi,xj ,xkdefined in (27) is submodular for any values of xi, xj and

xk in L. Then any restriction of T ′xi,xj ,xk(ui, uj , uk) to the diagonally neighbouring variables uj and uk is submodular, that

is to say the relation

T (lui(xi), l0(xj), l

1(xk))+T (lui(xi), l1(xj), l

0(xk)) ≥T (lui(xi), l

0(xj), l0(xk))+T (lui(xi), l

1(xj), l1(xk)) (29)

11

must hold for any values of xi, xj and xk and either values of ui. Now, notice that xi, xj and xk can get any values in Land because of the surjectivity of lu(x), coming from the state preservation property of the update policy, lui(xi) can also getany value in L. Hence, (29) is equivalent to

T (x, y0, z1) + T (x, y1, z0) ≥ T (x, y0, z0) + T (x, y1, z1),

holding for any x, y, z ∈ L, with y0 and y1 being compact ways to show l0(y) and l1(y) for any y ∈ L. The above is equivalentto √

d(x, y0)2 + d(x, z1)2 +√d(x, y1)2 + d(x, z0)2 ≥√

d(x, y0)2 + d(x, z0)2 +√d(x, y1)2 + d(x, z1)2.

By squaring both sides of the above, canceling equal terms from both sides and squaring again we get(d(x, y1)2 − d(x, y0)2

) (d(x, z1)2 − d(x, z0)2

)≥ 0,

By factoring out the terms(d(x, y1) + d(x, y0)

)and

(d(x, z1) + d(x, z0)

)from the left hand side of the above we get (28)1.

From the above discussion it follows that (ii) ⇒ (iii). But, as the steps taken from (ii) to (iii) are reversible, we can say (ii)⇔ (iii).

The last step is to show that (ii) and (iii) together imply (i). Assume that (ii) and (iii) hold. We have to show that anyrestriction of T ′xi,xj ,xk

(ui, uj , uk) to horizontally, vertically and diagonally neighbouring variables is submodular. From (ii)we know that the restriction of T ′xi,xj ,xk

(ui, uj , uk) to the diagonally neighbouring variables uj and uk is submodular. Asthe proofs for horizontal and vertical neighbours are similar, we just need to prove the submodularity of the restriction ofT ′xi,xj ,xk

(ui, uj , uk) to the horizontally neighbouring variables ui and uj . This means that we have to prove that

T (l0(xi), l1(xj), l

uk(xk)) + T (l1(xi), l0(xj), l

uk(xk)) ≥T (l0(xi), l

0(xj), luk(xk))+T (l0(xi), l

0(xj), luk(xk)) (30)

holds for any values of xi, xj , xk and uk. As xi, xj and xk can get any values in L and as luk(xk) can also have any valuein L (as a result of the state preservation property of the update policy), (30) is equivalent to

T (y0, z1, t)+T (y1, z0, t) ≥ T (y0, z0, t) + T (y0, z0, t),

holding for any y, z, t ∈ L. Therefore, all we have to prove is√d(y0, z1)2 + d(y0, t)2 +

√d(y1, z0)2 + d(y1, t)2 ≥√

d(y0, z0)2 + d(y0, t)2 +√d(y1, z1)2 + d(y1, t)2. (31)

Now, it is obvious that if z0 = z1 the above holds as an equality relation. If z0 6= z1, as (iii) holds, by setting y and z in (28)equal to y and z in (31) and setting x in (28) equal to z0 we get(

d(z0, y1)− d(z0, y0)) (d(z0, z1)− d(z0, z0)

)≤ 0.

As d is a semi-metric and z0 6= z1 we can say that d(z0, z0) = 0 and d(z0, z1) > 0. The above relation then gives

d(z0, y1) ≥ d(z0, y0). (32)

Similarly, by setting x = z1 in (28) we get

d(z1, y0) ≥ d(z1, y1). (33)

The relation (31) follows from (32) and (33).

B. Submodularity of the Energy Function

Now, we turn to finding the condition for the submodularity of the energy function E′x(u) defined in (26). It turns outthat with the multi-label potential functions defined as T (xi, xj , xk) =

√d(xi, xj)2 + d(xi, xk)2, the submodularity of E′x(u)

reduces to the submodularity of a single 3-clique potential T ′xi,xj ,xk(ui, uj , uk), as stated in the next proposition

Proposition 3. Assuming the state preservation property defined in sec. III for an update policy, the energy function E′x(u)defined in (26) is submodular for any value of x ∈ Ln if and only if the potential function T ′xi,xj ,xk

defined in (27) issubmodular for any values of xi, xj and xk ∈ L.

Proof: The backward direction of the proof is immediate because if the 3-clique potential T ′xi,xj ,xk(ui, uj , uk) is sub-

modular for any values of xi, xj and xk, so is E′x(u) as a sum of 3-clique potential functions plus some linear terms.

1Notice that as d(x, y) ≥ 0 (since d is a semi-metric), even when(d(x, y1) + d(x, y0)

)is equal to zero our argument is true, as in this case(

d(x, y1)− d(x, y0))

would be also equal to zero.

12

bb

b

b bi1 j

kbi2

Fig. 8. Two cliques sharing the diagonal neighbours j and k.

To prove the forward direction, assume that E′x(u) is submodular for any value of the vector of labels x. Then its restrictionto any diagonally neighbouring variables must be submodular. Suppose E′x(u) defined in (26) is restricted to uj and uk wherej and k are diagonal neighbours. As there are two cliques sharing the nodes j and k (Fig. 8), call them (i1, j, k) and (i2, k, j),the restriction will be

E′j,k(uj , uk) = T (x′i1(ui1), x′j(uj), x′k(uk))

+ T (x′i2(ui2), x′k(uk), x′j(uj))

+ L′′(uj) + L′′(uk) (34)

where L′′(uj) and L′′(uk) are linear terms and do not play a role in submodularity. From the submodularity of E′x(u), weknow that the restriction E′j,k(uj , uk) is submodular as a function of uj and uk for any values given to xi1 , xi2 , ui1 and ui2in (34). Therefore, it is submodular for the cases where xi1 = xi2 and ui1 = ui2 . By replacing xi2 by xi1 and ui2 by ui1 in(34) and considering the fact that T , defined as T (x, y, z) =

√d(x, y)2 + d(x, z)2, is symmetric in its last two arguments, we

can conclude that

2T (x′i1(ui1), x′j(uj), x′k(uk)) + L′′(uj) + L′′(uk)

is submodular. The above being submodular is equivalent to the submodularity of T (x′i1(ui1), x′j(uj), x′k(uk)). As xi1 , xj , xk

and ui1 are considered arbitrary, this means that the restriction of the function T (xi(ui), xj(uj), xk(uk)) to the variables ujand uk is submodular for all possible values of xi1 , xj , xk and either values of ui. According to Theorem 1 this is equivalentto the submodularity of T ′xi,xj ,xk

(ui, uj , uk) defined in (27) for every xi, xj , xk ∈ L.Considering the above proposition along with Theorem 1, we can say that a necessary and sufficient condition for the energy

function E′x(u) to be submodular is the condition (iii) mentioned in Theorem 1, that is the relation (28) must hold for everyvalues of x, y and z in L. This condition means that, for any values of x, y and z, the real numbers d(x, y1)− d(x, y0) andd(x, z1)− d(x, z0) must have the same sign (or otherwise either of them must be zero).

Before we go on with checking the submodularity for different types of labels, we mention a simple yet useful lemma.Define the set of active labels as

A = {x ∈ L | l0(x) 6= l1(x)}. (35)

Because of the state preservation property, we know that if x /∈ A then we have l0(x) = l1(x) = x. Therefore, the activelabels are those that have the potential to change. It is obvious that for a sensible update A must be nonempty.

As an example, in the alpha-beta swap algorithm, at each iteration we have A = {α, β}. Variables with the labels otherthan α and β cannot change. For the alpha-expansion algorithm this set is A = L\{α}, as all the variables can change exceptthose with label α. Intuitively, it seems that generally the update policies with a smaller set of active labels are more likely tobe submodular, as only a few labels have the potential to change. On the other hand, update policies with a larger set of activelabels are supposed to converge after fewer iterations and give better results as more labels are involved in the optimizationof the binary move sub-problem, for which the global optimum is found provided that it is submodular.

To see how the submodularity relates to the active sets, observe that the submodularity condition (28) holds if either y orz is not an active set, that is y0 = y1 or z0 = z1. This means that to study the submodularity we only need to check for thecases where y, z ∈ A. We state this as a lemma whose proof easily follows from the discussion above:

Lemma 1. The energy function E′x defined in (26) is submodular for any x ∈ Ln if and only if for any y, z ∈ A and anyx ∈ L relation (28) holds.

The above lemma tells us that we only need to check (28) for the case of y, z ∈ A rather than for all y, z ∈ L, however,we still have to check for all values of x ∈ L. Now, we try to investigate the submodularity for two kinds of labels, namelythe ordered labels and unordered labels, using the submodularity condition (28).

1) Ordered Labels: In this case labels are in the form of L = {0, 1, . . . ,M−1} and the distance function is simply theEuclidean distance between labels: d(x, y) = |x− y|. The submodularity condition (28) will then become(

|x− y1| − |x− y0|) (|x− z1| − |x− z0|

)≥ 0. (36)

According to Lemma 1, we just have to check the above for the cases where y, z ∈ A. The above says that for any labelx ∈ L, if there exists a label y ∈ A such that x is closer to y0 = l0(y) than y1 = l1(y), then for any other label z ∈ A, x must

13

µ

y1

z0 z1

y0

v0 v1

w1w0

µ+1

Fig. 9. An example of a mirrored update policy. The set of active labels is A = {y, z, v, w}. Notice that for every t ∈ A we have t0 < t1 and(t0 + t1)/2 ∈ {µ, µ+ 1

2, µ+1}. For every t ∈ {y, z, v, w} there is a row of blocks in the image in which the red (dark gray) blocks show the labels x such

that |x− t0| < |x− t1| and the blue (light gray) blocks are those labels x such that |x− t0| > |x− t1|. Essentially, the submodularity condition (36) saysthat the red region of one row must never intersect with the blue region of any other row.

be closer to z0 than z1 or have an equal distance from z0 an z1. As we will show, for the ordered labels this is equivalent tothe case where the labels l1(x) for all x ∈ A are roughly mirrored images of the labels l0(x) for some centre of reflection.This concept is defined below.

Definition 1. An update policy (l0, l1) is called mirrored if(i) Either for all x ∈ A we have l0(x) < l1(x) or for all x ∈ A we have l0(x) > l1(x), and

(ii) there exists µ ∈ L such that for all x ∈ A we have

(l0(x)+l1(x))/2 ∈ {µ, µ+1

2, µ+1}.

It is called mirrored as for all the labels x ∈ A, l1(x) is the mirrored image of l0(x), however, the centre of symmetry canbe µ, µ+ 1

2 or µ+1 for different labels. This concept is illustrated in Fig. 9.The next proposition says that being a mirrored policy is the necessary and sufficient condition for submodularity.

Proposition 4. With the set of labels L defined as {0, 1, . . . ,M−1} and the distance function on L×L defined as d(x, y) =|x− y| (ordered labels), and having state preservation property for the update policy, the energy function E′x defined in (26)is submodular for all values of x if and only if (l0, l1) is a mirrored update policy.

Proof: According to Lemma 1, we only have to check the submodularity condition (28) for all x ∈ L and all y, z ∈ A.To prove the backward direction, assume that conditions (i) and (ii) of a mirrored policy hold. From condition (i), we just

consider the first case where for all x ∈ A we have l0(x) < l1(x) as the proofs by assuming the other case is similar. Therefore,for any y, z ∈ A we have y1 > y0 and z1 > z0. Now assume that in (36) the left term in parentheses is negative, that is

|x− y1| − |x− y0| < 0, (37)

we will show that in this case the other term (|x− z1|− |x− z0|) is nonpositive and hence the relation (36) holds. First, noticethat, as y1 > y0, we have x− y1 < x− y0. This along with (37) gives −(x− y0) < x− y1. It follows that

x > (y0 + y1)/2 = µ+ δy,

for some δy ∈ {0, 12 , 1}, where the equality comes from the property (ii) of a mirrored policy and the fact that y ∈ A. Asx and µ are integers and δy ≥ 0, we can say x ≥ µ + 1, which gives x ≥ (z0 + z1)/2 as for a mirrored policy we have(z0 + z1)/2 ∈ {µ, µ+1/2, µ+1} for the active label z ∈ A. It then follows that

−(x− z0) ≤ x− z1. (38)

From z0 < z1 we have x− z1 < x− z0 which along with (38) gives −(x− z0) ≤ x− z1 < x− z0, that is

|x− z1| ≤ |x− z0|.

Thus, the left term in the parentheses in (36) is nonpositive. A similar discussion can also prove that if(|x− y1| − |x− y0|

)is positive in (36), then

(|x− z1| − |x− z0|

)is nonnegative and hence (36) always holds.

To prove the forward direction, assume that (36) obtains. By setting x = 0 (the smallest label value) in (36) we get

(y1 − y0)(z1 − z0) ≥ 0.

As y and z can have any values, the above gives the condition (i) of a mirrored update policy.From here on we assume that in condition (i) proved above, the first possibility is true, that is for all x ∈ A we have

x0 < x1. The proof by assuming the second possibility, that is for all x ∈ A one has x0 > x1, is similar.

14

To get condition (ii) of a mirrored policy, choose µ as

µ = minx∈A

⌊l0(x) + l1(x)

2

⌋, (39)

and, as A is nonempty, we can set a variable z equal to an optimal x for (39), that is µ = b(z0+z1)/2c. Now we take anarbitrary active label y ∈ A and show that the condition (ii) of a mirrored policy holds for y. From the definition of µ in (39)we know that

yo + y1

2≥ µ. (40)

Now, as z1 > z0, it is obvious that µ+1 = b(z0+z1)/2c+1 is closer to z1 than z0 and hence |z1−(µ+1)|−|z0−(µ+1)| < 0.Therefore, by setting x = µ+1 in (36) (with y and z as defined above), the submodularity condition (36) gives

|y1 − (µ+1)| − |y0 − (µ+1)| ≤ 0.

This means that y0 is not closer to µ+1 than y1. Therefore

y0 + y1

2≤ µ+ 1 (41)

The constraints (40) and (41) leaves three possibilities for (y0 + y1)/2, that is (y0 + y1)/2 ∈ {µ, µ+ 12 , µ+1}. This is the

condition (ii) for being a mirrored policy.From the above proposition, one can check that the alpha-expansion does not give a submodular function for ordered labels

when the size of the label set L is bigger than two. This can be seen for example by setting α = 0 and observing that the updatepolicy is not mirrored in this case. However, the alpha-beta swap algorithm does give a submodular binary energy function,as the update policy is mirrored for any choice of α and β, with µ = b(α + β)/2c. As mentioned before, the disadvantageof the alpha-beta swap is that only two labels have the potential to change at each iteration. However, according to the abovetheorem, submodular algorithms can be designed in which more that two labels are changed. For example, at the same iterationwhere α and β are swapped, α − 1 and β + 1 can also swapped (given 0 < α < β < M−1), and this does not affect thesubmodularity.

2) Unordered Labels: In this case, the only thing we know about the labels is whether or not they are equal. The distancefunction is defined as d(x, y) = 1(x 6=y) ∈ {0, 1}, that gives 0 if x = y and 1 otherwise. Because of the binary nature of thedistance function, compared to the case of ordered labels, it appears that here the submodularity holds for a vaster range ofupdate policies. We will shortly show that this statement is true in some sense.

Here, the submodularity condition (28) becomes(1(x 6=y1)− 1(x 6=y0)

) (1(x 6=z1)− 1(x 6=z0)

)≥ 0. (42)

The next proposition gives a necessary and sufficient submodularity condition for the unordered labels.

Proposition 5. With the distance function defined as d(x, y) = 1(x 6=y) (unordered labels), and by assuming the statepreservation property for the update policy, the energy function E′x defined in (26) is submodular for all x ∈ Ln if andonly if for any pair of active labels y, z ∈ A, we have l0(y) 6= l1(z).

Proof: According to Lemma 1, we only have to check (28) for the case where y, z ∈ A, that is y0 6= y1 and z0 6= z1.We start with the proof of the backward direction. Assume that for any y, z ∈ A we have l0(y) 6= l1(z), it gives y0 6= z1

and y1 6= z0 for x, y ∈ A. Therefore, in (42), both 1(x6=y1) and 1(x 6=z0) cannot be equal to 0 at the same time nor can be1(x 6=y0) and 1(x 6=z1). It follows that (42) holds for all y, z ∈ A and any value of x ∈ L.

To prove the other direction, assume that there exist y, z ∈ A such that y0 = z1. Now, in (42) take x = y0 = z1, as y, z ∈ A,it follows that x 6= y1 and x 6= z0, and therefore (42) does not hold. This proves the proposition.

From the above theorem it can be easily checked that the alpha-expansion as well as the alpha-beta swap give a submodularbinary optimization in an unordered scheme, as it never happens for any y, z ∈ A that l0(y) = l1(z).

Now consider the case where, like the case of ordered labels, the set L is in the form of L = {0, 1, . . . ,M−1}, but unlikethe ordered label case, we use the distance function d(x, y) = 1(x 6=y). In this case, if the policy is mirrored, as defined inDefinition 1, then we have x0 ≤ µ < y1 or x0 > µ ≥ y1 for all x, y ∈ A and hence it never happens that x0 = y1. This meansthat the binary problem would be also submodular in an unordered scheme, according to proposition 5. The reverse, however,is not true in general as there are update policies like the one for alpha-expansion that are not in general submodular in theordered case, while being always submodular in an unordered label system.

15

s/2

0 1 2 3 4 5 6 7 8

(a)

x 0 1 2 3 4 5 6 7 8l0s(x) 0 1 2 2 1 0 6 7 8l1s(x) 5 4 3 3 4 5 6 7 8

(b)

Fig. 10. (a) The illustration of one iteration of mirrored swap, with label set L = {0, 1, . . . , 8} and the parameter s = 5 and (b) the corresponding updatetable. The centre of reflection is s/2 = 2.5 and hence the swapping pairs are (0, 5), (1, 4) and (2, 3). The labels 6, 7 and 8 are inactive. According tosection V-B1, a forth swap of (0, 6) is also possible without violating the submodularity, however, we do not include this kind of swapping to have a neateralgorithm. Moreover, when the number of labels is large including one extra swapping possibility at each iteration does not make a significant difference.

VI. THE MIRRORED SWAP ALGORITHM

In the last two sections, we gave necessary and sufficient conditions for a general move algorithm to give a submodularbinary move for each of the proposed MRF models. Amongst all those case, there was one, namely the 3-clique model (5)for ordered labels, for which the classic alpha-expansion algorithm dose not work. A major advantage of alpha-expansionover the alpha-beta swap is that one round of the algorithm takes only M=|L| iterations and at each iteration M−1 labelscan potentially change2. For alpha-beta swap algorithm, however, there is M(M−1)/2 iterations for each round and only twolabels can change at each iteration. Therefore, in alpha-expansion a larger proportion of the optimization task is shoulderedby the graph-cut part of the algorithm for which the global optimum can be achieved. This makes the algorithm to convergefaster and also perform more effectively.

In section V-B1 we learned that the alpha-beta swap works fine with the 3-clique model for ordered labels. The question iswhether a better algorithm is achievable for which at each iteration more labels are involved and number of iteration per eachround is linear in the number of labels M rather than quadratic. According to section V-B1, the answer is yes. Recall fromsection V-B1 that in a general move algorithm the necessary and sufficient condition for a binary move to be submodular isthat the possible updates for all active labels are mirrored images of each other with respect to a common centre of reflection.Thus, what we can do is to iterate through difference possible centres of reflection as the move parameter, and for each centreof reflection we choose all possible pairs of labels with an equal distance to the centre from both side of the centre as potentiallabel pairs whose values can be switched. Considering the label set L = {0, 1, . . . ,M−1}, the possible values for the centreof reflection is 1

2 , 1,32 , . . . ,

(M−2)+(M−1)2 . Here, for convenience, the centre of reflection is represented by s/2 and the integer

s ∈ {1, 2, . . . , 2M − 3} is considered as the move parameter. This has been illustrated in Fig. 10 for M = 9 and a centreof reflection of 2.5 (s = 5). The general update formula for a centre of reflection s/2 and number of labels M can then bewritten as:

l0s(x) =

{min(x, s−x) 0 ≤ s−x < M,x otherwise, (43)

l1s(x) =

{max(x, s−x) 0 ≤ s−x < M,x otherwise. (44)

A schematic of the mirrored swap algorithm is outlined in Fig. 11. Notice that in the algorithm the energy function E : Rn →R, is arbitrary. However, only for some energy functions, including the 3-clique model (5), the binary move update u∗ ←argminuE(lus (x)) can be done in polynomial time.

Notice that here, unlike the alpha-expansion and alpha-beta swap, the number of active labels in each iteration of thealgorithm depends on the parameter s. For s = 1 only two labels are active, namely 0 and 1. For s = M−1 we have themaximum number of active labels, which is either M or M−1 depending on whether M is even or odd. However, in averagethe number of active labels at each iteration is of the order of M . Number of iterations for each round of the algorithm is2M − 3, which is linear in M as in alpha-expansion. One can notice that, according to section V-B1, the update policy of thecurrent version of the mirrored swap algorithm can be slightly modified to increase the number of active labels by one at mostof the iterations, as shown in Fig. 10. However, this does not have a significant impact when the number of labels M is large.

Another concern here is the order by which the parameter s iterates through the parameter set {1, 2, . . . , 2n−3}. The generalrule of thumb is that during the course of optimization, big changes must generally happen prior to the small ones. According

2By one round of a general move algorithm we mean a set of iteration in which we go through all possible parameters, that is the external loop in thealgorithm in the general move algorithm shown in Fig. 7.

16

procedure MIRRORED-SWAP(x,M )repeat

for each s ∈ {1, 2, . . . , 2n−3} dou∗ ← argminuE(lus (x))x ← lus (x)


end procedure

Fig. 11. The alpha-expansion algorithm. The inputs to the algorithm are the initial label values x and the number of labels M implying L = {0, 1, . . . ,M−1}.The energy function E is any function Rn → R, and luα(x) = [lu1

s (x1), lu2s (x2), . . . , l

uns (xn)], where lus (x) is defined in (43) and (44).

to this, using the basic order 1, 2, 3, . . . does not seem a good choice. An improvement is supposedly achieved by randomlyshuffling the parameter list before running the algorithm. Another way is to take a tree-like style, that is to first consider thelabel in the middle of the ordered list as the centre of reflection (s = M−1), and then consider the centre of each of the tworemaining halves, and the each of the four remaining lists and so on.

VII. EXPERIMENTAL RESULTS

In this section we compare the 4-connected model (3), the diagonal edge model (4) and the 3-clique model (5) byimplementing a simple image completion task, in which the whole image is reconstructed given the intensities of a proportionof pixels. The advantage of choosing an image completion task is that there is no need to select among different data-drivencost functions or find a proper balance between data-driven terms and the regularization term. Therefore, for each model theisolated effect of the regularizer can be illustrated. This allows a clear comparison of different regularization models.

In this particular task, we take several 640 × 480 grayscale images and sample each image uniformly with a frequency of1/4, that is every 4 pixels, in both vertical and horizontal direction. Therefore, only the intensities of 1/16 of the pixels areknown. The MRF model used for image reconstruction is in the form of

E(x) =∑i

fi(xi) +R(x),

where the labels xi ∈ L = {0, 1, . . . , 255} represent the estimated intensity of image pixels at different sites i. To estimatethe labels, we optimize this model for the four cases where R(x) is the four-connected model (3), diagonal edge model (4),the basic 3-clique model (5) and the light 3-clique model. As mentioned before, in the light 3-clique model instead of all fourtypes of 3-cliques listed in Fig. 1(c), only one of them is used. Denoting by Ii the intensity of the main image at site i, andrepresenting by Mi whether the intensity of image at site i is known (Mi = 1) or unknown (Mi = 0), the data-driven termsare defined as

fi(xi) =

0 Mi = 0,0 Mi = 1, xi = Ii,∞ Mi = 1, xi 6= Ii.

In other words, the label values are fixed at the sites where the pixel intensity is known, and at the other sites the unary nodecosts for all the labels are the same. One can see that in this case no balancing between the data driven terms

∑i fi(xi) and

regularization term R(x) is needed.For 256 possible labels values the Ishikawa’s method [10] could not be applied for the edge-based models due to memory

limitations. Therefore, the 4-connected and diagonal edge models are optimized using alpha-expansion. The Mirrored Swapalgorithm is used to optimize the basic and light 3-clique models as the expansion algorithm fails to give a submodular move.In all cases, only one round of the move-based algorithm is run, since the reduction of energy in other rounds is comparablynegligible. To solve the graph cut problems for the binary moves, Boykov and Kolmogorov-s maxflow software [1] has beenused.

Table IV shows the reconstruction mean squared error corresponding to each of the models for all 25 images. The highererrors belong to more detailed images like natural scenes, as they possess high-frequency components which are lost bysampling. Except for the three cases (Images 2,7 and 24), the 3-clique models give a less MSE than the edge based models.In Images 2,7 and 24 the difference is very small. The basic and light 3-clique approaches give more or less the same MSE inall cases. While the diagonal edge model in most cases gives a higher MSE than the four-connected model, its correspondingreconstructed images look nicer in terms of having smoother edges. Fig. 12 illustrates the result of the application of the fourmethods on one of the sample images (Image 11). The reconstructed images for the whole set of 25 cases are provided in thesupplementary material. In general it is obvious that the 3-clique models perform better than the edge-based models, especiallywith the reconstruction of the edges in the image. The light 3-clique model performs nearly the same as the basic 3-clique

17

Case 4-Conn. Diag. Edge 3-clique Lg. 3-cliqueImage 1 644.6 677.7 629.2 628.9Image 2 721.6 781.2 724.5 720.6Image 3 699.1 731.4 666.8 661.8Image 4 150.7 153.6 134.6 137.4Image 5 502.0 548.4 460.8 456.3Image 6 290.4 287.6 252.3 253.9Image 7 815.1 918.6 817.1 809.1Image 8 274.9 284.3 246.3 246.2Image 9 143.4 138.9 113.8 113.6Image 10 101.5 111.9 93.4 93.6Image 11 36.5 38.4 24.7 24.6Image 12 41.1 45.9 38.6 38.4Image 13 290.1 300.6 250.0 249.6Image 14 111.5 92.4 61.2 68.3Image 15 314.0 320.6 271.8 276.5Image 16 638.9 659.5 615.8 615.1Image 17 257.4 222.5 171.0 175.0Image 18 91.0 93.8 77.0 77.4Image 19 44.0 46.2 36.4 37.0Image 20 153.5 147.9 115.3 122.3Image 21 168.1 167.8 135.1 138.6Image 22 192.6 219.8 155.3 156.6Image 23 137.4 146.5 128.2 126.1Image 24 123.2 144.8 123.7 124.0Image 25 331.6 369.0 304.3 306.5Average: 291.0 306.0 265.9 266.3

TABLE IVRECONSTRUCTION MEAN SQUARED ERRORS (MSE) FOR EACH OF THE MODELS, 4-CONNECTED, DIAGONAL EDGE, 3-CLIQUE AND LIGHT 3-CLIQUE FOR

ALL IMAGES.

Case 4-Conn. Diag. Edge 3-clique Lg. 3-cliqueImg 11 1.1 min 1.9 min 18.6 min 5.4 min

TABLE VTHE AVERAGE RECONSTRUCTION TIME FOR EACH OF THE FOUR MODELS, 4-CONNECTED, DIAGONAL EDGE, 3-CLIQUE AND LIGHT 3-CLIQUE.

model. However, it is slightly biased about reconstructing variations in different directions. For more details see the discussionat the bottom of Fig. 12.

Table V shows the time spent by the corresponding move algorithm to optimize each of the models. The rather large timeneeded for the optimization of the models comes from the large number of labels, that is 256. There are two reasons whythe 3-clique models need more optimization time. The first one is that they use the Mirrored Swap algorithm for which thegraph-cut problems solved at each round is nearly twice as much as alpha-expansion. Here, the alpha-expansion used for the4-connected and diagonal edge models solves 256 graph cut problems at each round while for the Mirrored Swap algorithmthis number is 509. Another factor is the corresponding graph size of the graph cut algorithm size for each case, as shown inTable III. Looking at Table III one realizes that the large time needed for the basic 3-clique problem is mostly due to its largegraph size.

As the light 3-clique model perform nearly equally as the basic 3-clique model and needs considerably less time to getoptimized, it can be a good candidate as a regularizer for the case of ordered labels.

VIII. CONCLUSION

In this paper we suggested an isotropic gradient MRF model and studied the submodularity of the binary move problem inthe move-based optimization algorithms. We considered two types of labels, namely ordered labels and unordered labels. Forthe more important case of ordered labels, our findings were as follows:• The suggested model well suits the ordered case, especially if the number of labels is large.• The necessary and sufficient condition for the binary move update policy to give a submodular binary energy function is

that the two update possibilities for each label are (roughly) mirrored reflections of each other.• As a result, the application of alpha-expansion for the suggested multi-label isotropic model is ruled out for the ordered

labels.• The alpha-beta swap can be applied in the ordered labels scenario as it gives a submodular binary move problem.• The proposed Mirrored Swap algorithm is superior to alpha-beta swap in terms of having more number of labels involved

in the binary move at each iteration.For the unordered case we can say:

18

Main 4-Connected

Diagonal Edge 3-Clique Light 3-Clique

4-Connected Diagonal Edge 3-Clique Light 3-Clique

4-Connected Diagonal Edge 3-Clique Light 3-CliqueFig. 12. The results of reconstruction for the 4-connected, diagonal edge, 3-clique and light 3-clique models. The third and forth rows show a close-upview of the boxed areas in the reconstructed images. The most obvious improvement made by the 3-clique models is observable at the edges. Especially, inthe third row you can see the effect of staircasing for the 4-connected model and to a smaller degree for the diagonal edge model. Obviously, the 3-cliquemodels give smoother reconstructions of the edges. The light 3-clique model performs nearly the same as the basic 3-clique model. However, it seems like itdoes not perform quite equally for variations in different directions. Especially, by looking at the third row of the images, one can say that the light 3-cliquemodel appears to give more smooth edges than the basic 3-clique model in some directions and perform worse in other directions.

• The suggested model is not quite isotropic in terms of giving the same cost for differently oriented edges, yet it worksbetter than the 4-connected model.

• For the unordered case, the submodularity holds for vaster types of binary moves, including alpha-expansion and alpha-betaswap.

One major problem with the 3-clique model is a large maxflow graph size, resulting in a long optimization time. Ourexperiments shows that the light version of the 3-clique model, in which instead of 4 types of 3-cliques only one type is used,performs nearly the same as the basic 3-clique model, while is optimized more than 4 times faster than the basic 3-clique model.While this can be a possible workaround, still, there is an obvious need for techniques to further speed up the optimization ofthe proposed model.

19

REFERENCES

[1] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision. Pattern Analysis andMachine Intelligence, IEEE Transactions on, 26(9):1124–1137, 2004.

[2] Yuri Boykov and Vladimir Kolmogorov. Computing geodesics and minimal surfaces via graph cuts. In Proceedings of the Ninth IEEE InternationalConference on Computer Vision - Volume 2, ICCV ’03, pages 26–, Washington, DC, USA, 2003. IEEE Computer Society.

[3] Yuri Boykov, Olga Veksler, and Ramin Zabih. Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell., 23(11):1222–1239, November 2001.

[4] Antonin Chambolle. Total variation minimization and a class of binary MRF models. In Proceedings of the 5th international conference on EnergyMinimization Methods in Computer Vision and Pattern Recognition, EMMCVPR’05, pages 136–152, Berlin, Heidelberg, 2005. Springer-Verlag.

[5] Antonin Chambolle, Vicent Caselles, Matteo Novaga, Daniel Cremers, and Thomas Pock. An introduction to Total Variation for Image Analysis.[6] Antonin Chambolle and Jerome Darbon. On total variation minimization and surface evolution using parametric maximum flows. Int. J. Comput. Vision,

84(3):288–307, September 2009.[7] Jerome Darbon and Marc Sigelle. Image restoration with discrete constrained total variation part i: Fast and exact optimization. J. Math. Imaging Vis.,

26(3):261–276, December 2006.[8] Donald Goldfarb and Wotao Yin. Parametric maximum flow algorithms for fast total variation minimization. SIAM J. Sci. Comput., 31(5):3712–3743,

October 2009.[9] Dorit S. Hochbaum. An efficient algorithm for image segmentation, Markov random fields and related problems. J. ACM, 48(4):686–701, July 2001.

[10] H. Ishikawa and D. Geiger. Segmentation by grouping junctions. In Proceedings of the IEEE Computer Society Conference on Computer Vision andPattern Recognition, CVPR ’98, pages 125–, Washington, DC, USA, 1998. IEEE Computer Society.

[11] V. Kolmogorov and R. Zabin. What energy functions can be minimized via graph cuts? Pattern Analysis and Machine Intelligence, IEEE Transactionson, 26(2):147–159, feb. 2004.

[12] B. Nasihatkon and R. Hartley. Move-based algorithms for the optimization of an isotropic gradient MRF model. In Digital Image Computing Techniquesand Applications (DICTA), 2012 International Conference on, pages 1–8, 2012.

[13] Leonid I. Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena,60(14):259 – 268, 1992.

1 A 3-clique MRF model for Magnitude of Gradient and the ...

Documents

Transcript of 1 A 3-clique MRF model for Magnitude of Gradient and the ...