Post on 25-Jun-2020
An Introduction to Deterministic Annealing
Fabrice Rossi
SAMM – Université Paris 1
2012
Optimization
I in machine learning:I optimization is one of the central toolsI methodology:
I choose a model with some adjustable parametersI choose a goodness of fit measure of the model to some dataI tune the parameters in order to maximize the goodness of fit
I examples: artificial neural networks, support vectormachines, etc.
I in other fields:I operational research: project planning, routing, scheduling,
etc.I design: antenna, wings, engines, etc.I etc.
Easy or Hard?
I is optimization computationally difficult?
I the convex case is relatively easy:I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?
I a local minimum is globalI the local tangent hyperplane is a global lower bound
I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases
Easy or Hard?
I is optimization computationally difficult?I the convex case is relatively easy:
I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?
I a local minimum is globalI the local tangent hyperplane is a global lower bound
I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases
Easy or Hard?
I is optimization computationally difficult?I the convex case is relatively easy:
I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?
I a local minimum is globalI the local tangent hyperplane is a global lower bound
I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases
Easy or Hard?
I is optimization computationally difficult?I the convex case is relatively easy:
I minx∈C J(x) with C convex and J convexI polynomial algorithms (in general)I why?
I a local minimum is globalI the local tangent hyperplane is a global lower bound
I the non convex case is hard:I multiple minimaI no local to global inferenceI NP hard in some cases
In machine learning
I convex case:I linear models with(out) regularization (ridge, lasso)I kernel machines (SVM and friends)I nonlinear projections (e.g., semi-define embedding)
I non convex:I artificial neural networks (such as multilayer perceptrons)I vector quantization (a.k.a. prototype based clustering)I general clusteringI optimization with respect to meta parameters:
I kernel parametersI discrete parameters, e.g., feature selection
I in this lecture, non convex problems:I combinatorial optimizationI mixed optimization
In machine learning
I convex case:I linear models with(out) regularization (ridge, lasso)I kernel machines (SVM and friends)I nonlinear projections (e.g., semi-define embedding)
I non convex:I artificial neural networks (such as multilayer perceptrons)I vector quantization (a.k.a. prototype based clustering)I general clusteringI optimization with respect to meta parameters:
I kernel parametersI discrete parameters, e.g., feature selection
I in this lecture, non convex problems:I combinatorial optimizationI mixed optimization
Particular cases
I Pure combinatorial optimization problems:I a solution spaceM: finite but (very) largeI an error measure E fromM to RI goal: solve
M∗ = arg minM∈M
E(M)
I example: graph clustering
I Mixed problems:I a solution spaceM× Rp
I an error measure E fromM× Rp to RI goal: solve
(M∗,W ∗) = arg minM∈M,W∈Rp
E(M,W )
I example: clustering in Rn
Particular cases
I Pure combinatorial optimization problems:I a solution spaceM: finite but (very) largeI an error measure E fromM to RI goal: solve
M∗ = arg minM∈M
E(M)
I example: graph clusteringI Mixed problems:
I a solution spaceM× Rp
I an error measure E fromM× Rp to RI goal: solve
(M∗,W ∗) = arg minM∈M,W∈Rp
E(M,W )
I example: clustering in Rn
Graph clustering
Goal: find an optimal clustering of a graph with N nodes in Kclusters
I M: set of all partitions of {1, . . . ,N} in K classes (theasymptotic behavior of |M| is roughly K N−1 for a fixed Kwith N →∞ )
I assignment matrix: a partition inM is described by N × Kmatrix M such that Mik ∈ {0,1} and
∑Kk=1 Mik = 1
I many error measures are available:I Graph cut measures (node normalized, edge normalized,
etc.)I ModularityI etc.
Graph clustering
I original graph
I four clusters
1
2
3 4
5
67
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Graph clustering
I original graphI four clusters
Graph visualization
womanbisexual manheterosexual man
I 2386 persons:unreadable
I maximal modularityclustering in 39clusters
I hierarchical display
Graph visualization
womanbisexual manheterosexual man
I 2386 persons:unreadable
I maximal modularityclustering in 39clusters
I hierarchical display
Graph visualization
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
3536
37
38
39
I 2386 persons:unreadable
I maximal modularityclustering in 39clusters
I hierarchical display
Graph visualization
womanbisexual manheterosexual man
I 2386 persons:unreadable
I maximal modularityclustering in 39clusters
I hierarchical display
Vector quantization
I N observations (xi)1≤i≤N in Rn
I M: set of all partitionsI quantization error:
E(M,W ) =N∑
i=1
K∑k=1
Mik‖xi − wk‖2
I the “continuous” parameters are the prototypes wk ∈ Rn
I the assignment matrix notation is equivalent to thestandard formulation
E(M,W ) =N∑
i=1
‖xi − wk(i)‖2,
where k(i) is the index of the cluster to which xi is assigned
Example
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Example
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Combinatorial optimization
I a research field by itselfI many problems are NP hard...
and one therefore relies onheuristics or specialized approaches:
I relaxation methodsI branch and bound methodsI stochastic approaches:
I simulated annealingI genetic algorithms
I in this presentation: deterministic annealing
Combinatorial optimization
I a research field by itselfI many problems are NP hard... and one therefore relies on
heuristics or specialized approaches:I relaxation methodsI branch and bound methodsI stochastic approaches:
I simulated annealingI genetic algorithms
I in this presentation: deterministic annealing
Combinatorial optimization
I a research field by itselfI many problems are NP hard... and one therefore relies on
heuristics or specialized approaches:I relaxation methodsI branch and bound methodsI stochastic approaches:
I simulated annealingI genetic algorithms
I in this presentation: deterministic annealing
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
Optimization strategies
I mixed problem transformation
(M∗,W ∗) = arg minM∈M,W∈Rp
E(M,W ),
I remove the continuous part
minM∈M,W∈Rp
E(M,W ) = minM∈M
(M 7→ min
W∈RpE(M,W )
)I or the combinatorial part
minM∈M,W∈Rp
E(M,W ) = minW∈Rp
(W 7→ min
M∈ME(M,W )
)
I this is not alternate optimization
Optimization strategies
I mixed problem transformation
(M∗,W ∗) = arg minM∈M,W∈Rp
E(M,W ),
I remove the continuous part
minM∈M,W∈Rp
E(M,W ) = minM∈M
(M 7→ min
W∈RpE(M,W )
)I or the combinatorial part
minM∈M,W∈Rp
E(M,W ) = minW∈Rp
(W 7→ min
M∈ME(M,W )
)I this is not alternate optimization
Alternate optimization
I elementary heuristics:1. start with a random configuration M0 ∈M2. compute W i = arg minW∈Rp E(M i−1,W )3. compute M i = arg minM∈M E(M,W i)4. go back to 2 until convergence
I e.g., the k-means algorithm for vector quantization:1. start with a random partition M0 ∈M2. compute the optimal prototypes with
W ik = 1∑N
j=1 M i−1jk
∑Nj=1 M i−1
jk xj
3. compute the optimal partition with M ijk = 1 if and only if
k = arg min1≤l≤K ‖xj −W il ‖2
4. go back to 2 until convergenceI alternate optimization converges to a local minimum
Alternate optimization
I elementary heuristics:1. start with a random configuration M0 ∈M2. compute W i = arg minW∈Rp E(M i−1,W )3. compute M i = arg minM∈M E(M,W i)4. go back to 2 until convergence
I e.g., the k-means algorithm for vector quantization:1. start with a random partition M0 ∈M2. compute the optimal prototypes with
W ik = 1∑N
j=1 M i−1jk
∑Nj=1 M i−1
jk xj
3. compute the optimal partition with M ijk = 1 if and only if
k = arg min1≤l≤K ‖xj −W il ‖2
4. go back to 2 until convergence
I alternate optimization converges to a local minimum
Alternate optimization
I elementary heuristics:1. start with a random configuration M0 ∈M2. compute W i = arg minW∈Rp E(M i−1,W )3. compute M i = arg minM∈M E(M,W i)4. go back to 2 until convergence
I e.g., the k-means algorithm for vector quantization:1. start with a random partition M0 ∈M2. compute the optimal prototypes with
W ik = 1∑N
j=1 M i−1jk
∑Nj=1 M i−1
jk xj
3. compute the optimal partition with M ijk = 1 if and only if
k = arg min1≤l≤K ‖xj −W il ‖2
4. go back to 2 until convergenceI alternate optimization converges to a local minimum
Combinatorial first
I let consider the combinatorial first approach:
minM∈M,W∈Rp
E(M,W ) = minW∈Rp
(W 7→ min
M∈ME(M,W )
)I not attractive a priori, as W 7→ minM∈M E(M,W ):
I has no reason to be convexI has no reason to be C1
I vector quantization example:
F (W ) =N∑
i=1
min1≤k≤K
‖xi − wk‖2
is neither convex nor C1
Combinatorial first
I let consider the combinatorial first approach:
minM∈M,W∈Rp
E(M,W ) = minW∈Rp
(W 7→ min
M∈ME(M,W )
)I not attractive a priori, as W 7→ minM∈M E(M,W ):
I has no reason to be convexI has no reason to be C1
I vector quantization example:
F (W ) =N∑
i=1
min1≤k≤K
‖xi − wk‖2
is neither convex nor C1
Example
Clustering in 2 clusters elements from R
x
back to the example
Example
ln F (W )
multiple local minima andsingularities: minimizingF (W ) is difficult
−4 −2 0 2 4
−4
−2
02
4
w1
w2
Soft minimum
I a soft minimum approximation of minM∈M E(M,W )
Fβ(W ) = −1β
ln∑
M∈Mexp(−βE(M,W ))
solves the regularity issue: if E(M, .) is C1, then Fβ(W ) isalso C1
I if M∗(W ) is such that for all M 6= M∗(W ),E(M,W ) > E(M∗(W ),W ), then
limβ→∞
Fβ(W ) = E(M∗(W ),W ) = minM∈M
E(M,W )
Soft minimum
I a soft minimum approximation of minM∈M E(M,W )
Fβ(W ) = −1β
ln∑
M∈Mexp(−βE(M,W ))
solves the regularity issue: if E(M, .) is C1, then Fβ(W ) isalso C1
I if M∗(W ) is such that for all M 6= M∗(W ),E(M,W ) > E(M∗(W ),W ), then
limβ→∞
Fβ(W ) = E(M∗(W ),W ) = minM∈M
E(M,W )
Example
M
E(M,W )
Example
β = 10−2
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
Example
β = 10−1.5
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
Example
β = 10−1
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
Example
β = 10−0.5
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
Example
β = 100
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
Example
β = 100.5
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
Example
β = 101
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
Example
β = 101.5
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
Example
β = 102
M
exp(−βE(M,W ))exp(−βE(M∗,W ))
Example
Fβ
β
1.00e−02 3.16e−02 1.00e−01 3.16e−01 1.00e+00 3.16e+00 1.00e+01 3.16e+01 1.00e+02
Discussion
I positive aspects:I if E(M,W ) is C1 with respect to W for all M, then Fβ is C1
I limβ→∞ Fβ(W ) = minM∈M E(M,W )I ∀β > 0,
−1β
ln |M| ≤ Fβ(W )− minM∈M
E(M,W ) ≤ 0
I when β is close to 0, Fβ is generally easy to minimize
I negative aspects:I how to compute Fβ(W ) efficiently?I how to solve minW∈Rp Fβ(W )?I why would that work in practice?
I philosophy: where does Fβ(W ) come from?
Discussion
I positive aspects:I if E(M,W ) is C1 with respect to W for all M, then Fβ is C1
I limβ→∞ Fβ(W ) = minM∈M E(M,W )I ∀β > 0,
−1β
ln |M| ≤ Fβ(W )− minM∈M
E(M,W ) ≤ 0
I when β is close to 0, Fβ is generally easy to minimizeI negative aspects:
I how to compute Fβ(W ) efficiently?I how to solve minW∈Rp Fβ(W )?I why would that work in practice?
I philosophy: where does Fβ(W ) come from?
Discussion
I positive aspects:I if E(M,W ) is C1 with respect to W for all M, then Fβ is C1
I limβ→∞ Fβ(W ) = minM∈M E(M,W )I ∀β > 0,
−1β
ln |M| ≤ Fβ(W )− minM∈M
E(M,W ) ≤ 0
I when β is close to 0, Fβ is generally easy to minimizeI negative aspects:
I how to compute Fβ(W ) efficiently?I how to solve minW∈Rp Fβ(W )?I why would that work in practice?
I philosophy: where does Fβ(W ) come from?
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
Computing Fβ(W )
I in general computing Fβ(W ) is intractable: exhaustivecalculation of E(M,W ) on the whole setM
I a particular case: clustering with an additive cost, i.e.
E(M,W ) =N∑
i=1
K∑k=1
Mikeik (W ),
where M is an assignment matrix (and e.g.,eik (W ) = ‖xi − wk‖2)
I then
Fβ(W ) = −1β
N∑i=1
lnK∑
k=1
exp(−βeik (W ))
computational cost O(NK ) (compared to ' K N−1)
Computing Fβ(W )
I in general computing Fβ(W ) is intractable: exhaustivecalculation of E(M,W ) on the whole setM
I a particular case: clustering with an additive cost, i.e.
E(M,W ) =N∑
i=1
K∑k=1
Mikeik (W ),
where M is an assignment matrix (and e.g.,eik (W ) = ‖xi − wk‖2)
I then
Fβ(W ) = −1β
N∑i=1
lnK∑
k=1
exp(−βeik (W ))
computational cost O(NK ) (compared to ' K N−1)
Proof sketch
I let Zβ(W ) be the partition function given by
Zβ(W ) =∑
M∈M
exp(−βE(M,W ))
I assignments inM are independent and the sum can berewritten
Zβ(W ) =∑
M1.∈CK
. . .∑
MN.∈CK
exp(−βE(M,W )),
with CK = {(1,0, . . . ,0), . . . , (0, . . . ,0,1)}I then
Zβ(W ) =N∏
i=1
∑Mi.∈CK
exp(−β∑
k
Mik eik (W ))
Independence
I additive cost corresponds to some form of conditionalindependence
I given the prototypes, observations are assignedindependently to their optimal clusters
I in other words: the global optimal assignment is theconcatenation of the individual optimal assignment
I this is what makes alternate optimization tractable in theK-means despite its combinatorial aspect:
minM∈M
N∑i=1
K∑k=1
Mik‖xi − wk‖2
Minimizing Fβ(W )
I assume E to be C1 with respect to W , then
∇Fβ(W ) =
∑M∈M exp(−βE(M,W ))∇W E(M,W )∑
M∈M exp(−βE(M,W ))
I at a minimum ∇Fβ(W ) = 0⇒ solve the equation or usegradient descent
I same calculation problem as for Fβ(W ): in general, anexhaustive scan ofM is needed
I if E is additive
∇Fβ(W ) =N∑
i=1
∑Kk=1 exp(−βeik (W ))∇W eik (W )∑K
k=1 exp(−βeik (W ))
Minimizing Fβ(W )
I assume E to be C1 with respect to W , then
∇Fβ(W ) =
∑M∈M exp(−βE(M,W ))∇W E(M,W )∑
M∈M exp(−βE(M,W ))
I at a minimum ∇Fβ(W ) = 0⇒ solve the equation or usegradient descent
I same calculation problem as for Fβ(W ): in general, anexhaustive scan ofM is needed
I if E is additive
∇Fβ(W ) =N∑
i=1
∑Kk=1 exp(−βeik (W ))∇W eik (W )∑K
k=1 exp(−βeik (W ))
Fixed point scheme
I a simple strategy to solve ∇Fβ(W ) = 0I starting from a random value of W :
1. compute
(expectation phase)
µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))
2. keeping the µik constant, solve for W
(maximization phase)
N∑i=1
K∑k=1
µik∇W eik (W ) = 0
3. loop to 1 until convergence
I generally converges to a minimum of Fβ(W ) (for a fixed β)
Fixed point scheme
I a simple strategy to solve ∇Fβ(W ) = 0I starting from a random value of W :
1. compute (expectation phase)
µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))
2. keeping the µik constant, solve for W (maximization phase)
N∑i=1
K∑k=1
µik∇W eik (W ) = 0
3. loop to 1 until convergence
I generally converges to a minimum of Fβ(W ) (for a fixed β)
Fixed point scheme
I a simple strategy to solve ∇Fβ(W ) = 0I starting from a random value of W :
1. compute (expectation phase)
µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))
2. keeping the µik constant, solve for W (maximization phase)
N∑i=1
K∑k=1
µik∇W eik (W ) = 0
3. loop to 1 until convergenceI generally converges to a minimum of Fβ(W ) (for a fixed β)
Vector quantization
I if eik (W ) = ‖xi − wk‖2 (vector quantization),
∇wk Fβ(W ) = 2N∑
i=1
µik (W )(wk − xi),
with
µik (W ) =exp(−β‖xi − wk‖2)∑Kl=1 exp(−β‖xi − wl‖2)
I setting ∇Fβ(W ) = 0 leads to
wk =1∑N
i=1 µik (W )
N∑i=1
µik (W )xi
Links with the K-means
I if µil = δk(i)=l , we obtained the prototype update rule of theK-means
I in the K-means, we have
k(i) = arg min1≤l≤K
‖xi − wk‖2
I in deterministic annealing, we have
µik (W ) =exp(−β‖xi − wk‖2)∑Kl=1 exp(−β‖xi − wl‖2)
I this is a soft minimum version of the crisp rule of thek-means
Links with EM
I isotropic Gaussian mixture with a unique and fixedvariance ε
pk (x |wk ) =1
(2πε)p/2 e−12ε‖xi−wk‖2
I given the mixing coefficients δk , responsibilities are
P(xi ∈ Ck |xi ,W ) =δke−
12ε‖xi−wk‖2∑K
l=1 δle−12ε‖xi−wl‖2
I identical update rules for WI β can be seen has an inverse variance: quantify the
uncertainty about the clustering results
So far...
I if E(M,W ) =∑N
i=1∑K
k=1 Mikeik (W ), one tries to reachminM∈M,W∈Rp E(M,W ) using
Fβ(W ) = −1β
N∑i=1
lnK∑
k=1
exp(−βeik (W ))
I Fβ(W ) is smoothI a fixed point EM-like scheme can be used to minimize
Fβ(W ) for a fixed β
I but:I the k-means algorithm is also a EM like algorithm: it does
not reach a global optimumI how handle to handle β →∞?
So far...
I if E(M,W ) =∑N
i=1∑K
k=1 Mikeik (W ), one tries to reachminM∈M,W∈Rp E(M,W ) using
Fβ(W ) = −1β
N∑i=1
lnK∑
k=1
exp(−βeik (W ))
I Fβ(W ) is smoothI a fixed point EM-like scheme can be used to minimize
Fβ(W ) for a fixed βI but:
I the k-means algorithm is also a EM like algorithm: it doesnot reach a global optimum
I how handle to handle β →∞?
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
Limit case β → 0
I limit behavior of the EM scheme:I µik = 1
K for all i and k (and W !)I W is therefore a solution of
N∑i=1
K∑k=1
∇W eik (W ) = 0
I no iteration needed
I vector quantization:I we have
wk =1N
N∑i=1
xi
I each prototype is the center of mass of the data: only onecluster!
I in general the case β → 0 is easy (unique minimum undermild hypotheses)
Limit case β → 0
I limit behavior of the EM scheme:I µik = 1
K for all i and k (and W !)I W is therefore a solution of
N∑i=1
K∑k=1
∇W eik (W ) = 0
I no iteration neededI vector quantization:
I we have
wk =1N
N∑i=1
xi
I each prototype is the center of mass of the data: only onecluster!
I in general the case β → 0 is easy (unique minimum undermild hypotheses)
Example
Clustering in 2 clusters elements from R
x
back to the example
Example
original problem
−4 −2 0 2 4
−4
−2
02
4
w1
w2
Example
simple soft minimum withβ = 10−1
−4 −2 0 2 4
−4
−2
02
4
w1
w2
Example
more complex softminimum with β = 0.5
−4 −2 0 2 4
−4
−2
02
4
w1
w2
Increasing β
I when β increases, Fβ(W ) converges to minM∈M E(M,W )I optimizing Fβ(W ) becomes more and more complex:
I local minimaI rougher and rougher: Fβ(W ) remains C1 but with large
values for the gradient at some points
I path following strategy (homotopy):I for an increasing series (βl)l with β0 = 0I initialize W ∗
0 = arg minW F0(W ) for β = 0 by solving∑Ni=1∑K
k=1∇W eik (W ) = 0I compute W ∗
l via the EM scheme for βl starting from W ∗l−1
I a similar strategy is used in interior point algorithms
Increasing β
I when β increases, Fβ(W ) converges to minM∈M E(M,W )I optimizing Fβ(W ) becomes more and more complex:
I local minimaI rougher and rougher: Fβ(W ) remains C1 but with large
values for the gradient at some pointsI path following strategy (homotopy):
I for an increasing series (βl)l with β0 = 0I initialize W ∗
0 = arg minW F0(W ) for β = 0 by solving∑Ni=1∑K
k=1∇W eik (W ) = 0I compute W ∗
l via the EM scheme for βl starting from W ∗l−1
I a similar strategy is used in interior point algorithms
Increasing β
I when β increases, Fβ(W ) converges to minM∈M E(M,W )I optimizing Fβ(W ) becomes more and more complex:
I local minimaI rougher and rougher: Fβ(W ) remains C1 but with large
values for the gradient at some pointsI path following strategy (homotopy):
I for an increasing series (βl)l with β0 = 0I initialize W ∗
0 = arg minW F0(W ) for β = 0 by solving∑Ni=1∑K
k=1∇W eik (W ) = 0I compute W ∗
l via the EM scheme for βl starting from W ∗l−1
I a similar strategy is used in interior point algorithms
Example
Clustering in 2 classes of 6 elements from R
x
1.00 2.00 3.00 7.00 8.25
Behavior of Fβ(W )
minM∈M E(M,W )
w1
w2
2 4 6 8
24
68
Behavior of Fβ(W )
Fβ(W ), β = 10−2
w1
w2
2 4 6 8
24
68
Behavior of Fβ(W )
Fβ(W ), β = 10−1.75
w1
w2
2 4 6 8
24
68
Behavior of Fβ(W )
Fβ(W ), β = 10−1.5
w1
w2
2 4 6 8
24
68
Behavior of Fβ(W )
Fβ(W ), β = 10−1.25
w1
w2
2 4 6 8
24
68
Behavior of Fβ(W )
Fβ(W ), β = 10−1
w1
w2
2 4 6 8
24
68
Behavior of Fβ(W )
Fβ(W ), β = 10−0.75
w1
w2
2 4 6 8
24
68
Behavior of Fβ(W )
Fβ(W ), β = 10−0.5
w1
w2
2 4 6 8
24
68
Behavior of Fβ(W )
Fβ(W ), β = 10−0.25
w1
w2
2 4 6 8
24
68
Behavior of Fβ(W )
Fβ(W ), β = 1
w1
w2
2 4 6 8
24
68
Behavior of Fβ(W )
minM∈M E(M,W )
w1
w2
2 4 6 8
24
68
One step closer to the final algorithm
I given an increasing series (βl)l with β0 = 0:1. compute W ∗
0 such that∑N
i=1∑K
k=1∇W eik (W ∗0 ) = 0
2. for l = 1 to L:2.1 initialize W ∗l to W ∗l−1
2.2 compute µik =exp(−βeik (W
∗l ))∑K
l=1 exp(−βeil (W∗l ))
2.3 update W ∗l such that∑N
i=1
∑Kk=1 µik∇W eik (W ∗l ) = 0
2.4 good back to 2.2 until convergence
3. use W ∗L as an estimation of
arg minW∈Rp (minM∈M E(M,W ))
I this is (up to some technical details) the deterministicannealing algorithm for minimizingE(M,W ) =
∑Ni=1∑K
k=1 Mikeik (W )
What’s next?
I general topics:I why the name annealing?I why should that work? (and other philosophical issues)
I specific topics:I technical “details” (e.g., annealing schedule)I generalization:
I non additive criteriaI general combinatorial problems
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
Annealing
I from Wikipedia:Annealing is a heat treatment wherein a material isaltered, causing changes in its properties such asstrength and hardness. It is a process that producesconditions by heating to above the re-crystallizationtemperature and maintaining a suitable temperature,and then cooling.
I in our context, we’ll see thatI T = 1
kBβacts as a temperature and models some thermal
agitationI E(W ,M) is the energy of a configuration the system while
Fβ(W ) corresponds to the free energy of the system at agiven temperature
I increasing β reduces the thermal agitation
Annealing
I from Wikipedia:Annealing is a heat treatment wherein a material isaltered, causing changes in its properties such asstrength and hardness. It is a process that producesconditions by heating to above the re-crystallizationtemperature and maintaining a suitable temperature,and then cooling.
I in our context, we’ll see thatI T = 1
kBβacts as a temperature and models some thermal
agitationI E(W ,M) is the energy of a configuration the system while
Fβ(W ) corresponds to the free energy of the system at agiven temperature
I increasing β reduces the thermal agitation
Simulated Annealing
I a classical combinatorial optimization algorithm forcomputing minM∈M E(M)
I given an increasing series (βl)l1. choose a random initial configuration M02. for l = 1 to L
2.1 take a small random step from Ml−1 to build Mcl (e.g.,
change the cluster of an object)2.2 if E(Mc
l ) < E(Ml−1) then set Ml = Mcl
2.3 else set Ml = Mcl with probability
exp(−βl(E(Mcl )− E(Ml−1)))
2.4 else set Ml = Ml−1 in the other case
3. use ML as an estimation of arg minM∈M E(M)
I naive vision:I always accept improvementI “thermal agitation” allows to escape local minima
Statistical physics
I consider a system with state spaceMI denote E(M) the energy of the system when in state MI at thermal equilibrium with the environment, the probability
for the system to be in state M is given by the Boltzmann(Gibbs) distribution
PT (M) =1
ZTexp
(−E(M)
kBT
)with T the temperature, kB Boltzmann’s constant, and
ZT =∑
M∈Mexp
(−E(M)
kBT
)
Back to annealing
I physical analogy:I maintain the system at thermal equilibrium:
I set a temperatureI wait for the system to settle at this temperature
I slowly decrease the temperature:I works well in real systems (e.g., crystallization)I allows the system to explore the state space
I computer implementation:I direct computation of PT (M) (or related quantities)I sampling from PT (M)
Simulated AnnealingRevisited
I simulated annealing samples from PT (M)!I more precisely: the asymptotic distribution of Ml for a fixedβ is given by P1/kbβ
I how?I Metropolis-Hastings Markov Chain Monte CarloI principle:
I P is the target distributionI Q(.|.) is the proposal distribution (sampling friendly)I start with x t , get x ′ from Q(x |x t)I set x t+1 to x ′ with probability
min(
1,P(x ′)Q(x t |x ′)P(x t)Q(x ′|x t)
)I keep x t+1 = x t when this fails
Simulated AnnealingRevisited
I simulated annealing samples from PT (M)!I more precisely: the asymptotic distribution of Ml for a fixedβ is given by P1/kbβ
I how?I Metropolis-Hastings Markov Chain Monte CarloI principle:
I P is the target distributionI Q(.|.) is the proposal distribution (sampling friendly)I start with x t , get x ′ from Q(x |x t)I set x t+1 to x ′ with probability
min(
1,P(x ′)Q(x t |x ′)P(x t)Q(x ′|x t)
)I keep x t+1 = x t when this fails
Simulated AnnealingRevisited
I in SA, Q is the random local perturbationI major feature of Metropolis-Hastings MCMC:
PT (M ′)PT (M)
= exp(−E(M ′)− E(M)
kBT
)ZT is not needed
I underlying assumption, symmetric proposals
Q(x t |x ′) = Q(x ′|x t)
I rationale:I sampling directly from PT (M) for small T is difficultI track likely area during cooling
Mixed problems
I Simulated Annealing applies to combinatorial problemsI in mixed problems, this would means removing the
continuous part
minM∈M,W∈Rp
E(M,W ) = minM∈M
(M 7→ min
W∈RpE(M,W )
)I in the vector quantization example
E(M) =N∑
i=1
K∑k=1
Mik
∥∥∥∥∥∥xi −1∑N
j=1 Mjk
N∑j=1
Mjkxj
∥∥∥∥∥∥2
I no obvious direct relation to the proposed algorithm
Statistical physics
I thermodynamic (free) energy: the useful energy availablein a system (a.k.a., the one that can be extracted toproduce work)
I Helmholtz (free) energy is FT = U − TS, where U is theinternal energy of the system and S its entropy
I one can shown that
FT = −kBT ln ZT
I the natural evolution of a system is to reduce its freeenergy
I deterministic annealing mimicks the cooling process of asystem by tracking the evolution of the minimal free energy
Gibbs distribution
I still no use of PT ...
I important property:I f a function defined onM
EPT (f (M)) =1
ZT
∑M∈M
f (M)exp(−E(M)
kBT
)I then
limT→0
EPT (f (M)) = f (M∗),
where M∗ = arg minM∈M E(M)
I useful to track global information
Gibbs distribution
I still no use of PT ...I important property:
I f a function defined onM
EPT (f (M)) =1
ZT
∑M∈M
f (M)exp(−E(M)
kBT
)I then
limT→0
EPT (f (M)) = f (M∗),
where M∗ = arg minM∈M E(M)
I useful to track global information
Membership functions
I in the EM like phase, we have
µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))
I this does not come out of thin air:
µik = EPβ,W (Mik ) =∑
M∈MMikPβ,W (M)
I µik is therefore the probability for xi to belong to cluster kunder the Gibbs distribution
I at the limit β →∞, the µik peak to Dirac like membershipfunctions: they give the “optimal” partition
Membership functions
I in the EM like phase, we have
µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))
I this does not come out of thin air:
µik = EPβ,W (Mik ) =∑
M∈MMikPβ,W (M)
I µik is therefore the probability for xi to belong to cluster kunder the Gibbs distribution
I at the limit β →∞, the µik peak to Dirac like membershipfunctions: they give the “optimal” partition
Membership functions
I in the EM like phase, we have
µik =exp(−βeik (W ))∑Kl=1 exp(−βeil(W ))
I this does not come out of thin air:
µik = EPβ,W (Mik ) =∑
M∈MMikPβ,W (M)
I µik is therefore the probability for xi to belong to cluster kunder the Gibbs distribution
I at the limit β →∞, the µik peak to Dirac like membershipfunctions: they give the “optimal” partition
Example
Clustering in 2 classes of 6 elements from R
x
1.00 2.00 3.00 7.00 8.25
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
before phase transition
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
after phase transition
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
Membership functions
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
Summary
I two principles:I physical systems tend to reach a state of minimal free
energy at a given temperatureI when the temperature is slowly decreased, a system tends
to reach a well organized stateI annealing algorithms tend to reproduce this slow cooling
behaviorI simulated annealing:
I uses Metropolis Hasting MCMC to sample the Gibbsdistribution
I easy to implement but needs a very slow coolingI deterministic annealing:
I direct minimization of the free energyI computes expectations of global quantities with respect to
the Gibbs distributionI aggressive cooling is possible, but ZT computation must be
tractable
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
Maximum entropy
I another interpretation/justification of DAI reformulation of the problem: find a probability distribution
onM which is “regular” and gives a low average error
I in other words: find PW such thatI∑
M∈M PW (M) = 1I∑
M∈M E(W ,M)PW (M) is smallI the entropy of PW , −
∑M∈M ln PW (M)PW (M) is high
I entropy plays a regularization roleI this leads to the minimization of∑
M∈ME(W ,M)PW (M) +
1β
∑M∈M
ln PW (M)PW (M)
where β sets the trade-off between fitting and regularity
Maximum entropy
I another interpretation/justification of DAI reformulation of the problem: find a probability distribution
onM which is “regular” and gives a low average errorI in other words: find PW such that
I∑
M∈M PW (M) = 1I∑
M∈M E(W ,M)PW (M) is smallI the entropy of PW , −
∑M∈M ln PW (M)PW (M) is high
I entropy plays a regularization role
I this leads to the minimization of∑M∈M
E(W ,M)PW (M) +1β
∑M∈M
ln PW (M)PW (M)
where β sets the trade-off between fitting and regularity
Maximum entropy
I another interpretation/justification of DAI reformulation of the problem: find a probability distribution
onM which is “regular” and gives a low average errorI in other words: find PW such that
I∑
M∈M PW (M) = 1I∑
M∈M E(W ,M)PW (M) is smallI the entropy of PW , −
∑M∈M ln PW (M)PW (M) is high
I entropy plays a regularization roleI this leads to the minimization of∑
M∈ME(W ,M)PW (M) +
1β
∑M∈M
ln PW (M)PW (M)
where β sets the trade-off between fitting and regularity
Maximum entropy
I some calculations show that the minimum over PW is givenby
Pβ,W (M) =1
Zβ,Wexp(−βE(M,W )),
withZβ,W =
∑M∈M
exp(−βE(M,W )).
I plugged back into the optimization criterion, we end upwith the soft minimum
Fβ(W ) = −1β
ln∑
M∈Mexp(−βE(M,W ))
So far...
I three derivations of the deterministic annealing:I soft minimumI thermodynamic inspired annealingI maximum entropy principle
I all of them boil down to two principles:I replace the crisp optimization problem in theM space by a
soft oneI track the evolution of the solution when the crispness of the
approximation increasesI now the technical details...
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
Fixed point
I remember the EM phase:
µik (W ) =exp(−β‖xi − wk‖2)∑Kl=1 exp(−β‖xi − wl‖2)
wk =1∑N
i=1 µik (W )
N∑i=1
µik (W )xi
I this is a fixed point method, i.e., W = Uβ(W ) for a datadependent Uβ
I the fixed point is generally stable, except during a phasetransition
Phase transition
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
before phase transition
Phase transition
w1
w2
2 4 6 8
24
68
1 2 3 7 7.5 8.25
x
Pro
babi
lity
0.0
0.2
0.4
0.6
0.8
1.0
after phase transition
Stability
w1
w2
2 4 6 8
24
68
w1
w2
2 4 6 82
46
8
Stability
I unstable fixed point:I even during a phase transition, the exact previous fixed
point can remain a fixed pointI but we want the transition to take place: this is a new
cluster birth!I fortunately, the previous fixed point is unstable during a
phase transition
I modified EM phase:1. initialize W ∗
l to W ∗l−1 + ε
2. compute µik =exp(−βeik (W∗l ))∑Kl=1 exp(−βeil (W∗l ))
3. update W ∗l such that
∑Ni=1∑K
k=1 µik∇W eik (W ∗l ) = 0
4. good back to 2.2 until convergenceI the noise is important to prevent missing phase transition
Stability
I unstable fixed point:I even during a phase transition, the exact previous fixed
point can remain a fixed pointI but we want the transition to take place: this is a new
cluster birth!I fortunately, the previous fixed point is unstable during a
phase transitionI modified EM phase:
1. initialize W ∗l to W ∗
l−1 + ε
2. compute µik =exp(−βeik (W∗l ))∑Kl=1 exp(−βeil (W∗l ))
3. update W ∗l such that
∑Ni=1∑K
k=1 µik∇W eik (W ∗l ) = 0
4. good back to 2.2 until convergenceI the noise is important to prevent missing phase transition
Annealing strategy
I neglecting symmetries, there are K − 1 phase transitionsfor K clusters
I tracking W ∗ between transitions is easy
I trade-off:I slow annealing: long running time but no transition is
missedI fast annealing: quicker but with higher risk to miss a
transitionI but critical temperatures can be computed:
I via a stability analysis of fixed pointsI corresponds to a linear approximation of the fixed point
equation Uβ around a fixed point
Annealing strategy
I neglecting symmetries, there are K − 1 phase transitionsfor K clusters
I tracking W ∗ between transitions is easyI trade-off:
I slow annealing: long running time but no transition ismissed
I fast annealing: quicker but with higher risk to miss atransition
I but critical temperatures can be computed:I via a stability analysis of fixed pointsI corresponds to a linear approximation of the fixed point
equation Uβ around a fixed point
Annealing strategy
I neglecting symmetries, there are K − 1 phase transitionsfor K clusters
I tracking W ∗ between transitions is easyI trade-off:
I slow annealing: long running time but no transition ismissed
I fast annealing: quicker but with higher risk to miss atransition
I but critical temperatures can be computed:I via a stability analysis of fixed pointsI corresponds to a linear approximation of the fixed point
equation Uβ around a fixed point
Critical temperatures
I for vector quantizationI first temperature:
I the fixed point stability related to the variance of the dataI the critical β is 1/2λmax where λmax is the largest
eigenvalue of the covariance matrix of the data
I in general:I the stability of the whole system is related to the stability of
each clusterI the µik play the role of membership functions for each
clusterI the critical β for cluster k is 1/2λk
max where λkmax is the
largest eigenvalue of∑i
µik (xi − wk )(xi − wk)T
Critical temperatures
I for vector quantizationI first temperature:
I the fixed point stability related to the variance of the dataI the critical β is 1/2λmax where λmax is the largest
eigenvalue of the covariance matrix of the dataI in general:
I the stability of the whole system is related to the stability ofeach cluster
I the µik play the role of membership functions for eachcluster
I the critical β for cluster k is 1/2λkmax where λk
max is thelargest eigenvalue of∑
i
µik (xi − wk )(xi − wk)T
A possible final algorithm
1. initialize W 1 = 1N∑N
i=1 xi
2. for l = 2 to K :2.1 compute the critical βl2.2 initialize W l to W l−1
2.3 for t values of β around βl
2.3.1 add some small noise to W l
2.3.2 compute µik =exp(−β‖xi−W l
k‖2)∑K
t=1 exp(−β‖xi−W lt ‖
2)
2.3.3 compute W lk = 1∑N
i=1 µik
∑Ni=1 µik xi
2.3.4 good back to 2.3.2 until convergence
3. use W K as an estimation ofarg minW∈Rp (minM∈M E(M,W )) and µik as thecorresponding M
Computational cost
I N observations in Rp and K clustersI prototype storage: KpI µ matrix storage: NKI one EM iteration costs: O(NKp)I we need at most K − 1 full EM runs: O(NK 2p)I compared to the K-means:
I more storageI no simple distance calculation tricks: all distances must be
computedI roughly corresponds to K − 1 k-means, not taking into
account the number of distinct β considered during eachphase transition
I neglecting the eigenvalue analysis (in O(p3))...
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
Collapsed prototypes
I when β → 0, only one clusterI waste of computational resources: K identical calculationsI this is a general problem:
I each phase transition adds new clustersI prior to that prototypes are collapsedI but we need them to track cluster birth...
I a related problem: uniform cluster weightsI in a Gaussian mixture
P(xi ∈ Ck |xi ,W ) =δke−
12ε‖xi−wk‖2∑K
l=1 δle−12ε‖xi−wl‖2
prototypes are weighted by the mixing coefficients δ
Collapsed prototypes
I when β → 0, only one clusterI waste of computational resources: K identical calculationsI this is a general problem:
I each phase transition adds new clustersI prior to that prototypes are collapsedI but we need them to track cluster birth...
I a related problem: uniform cluster weightsI in a Gaussian mixture
P(xi ∈ Ck |xi ,W ) =δke−
12ε‖xi−wk‖2∑K
l=1 δle−12ε‖xi−wl‖2
prototypes are weighted by the mixing coefficients δ
Mass constrained DAI we introduce prototype weights δk and plug them into the
soft minimum formulation
Fβ(W , δ) = −1β
N∑i=1
lnK∑
k=1
δk exp(−βeik (W ))
I Fβ is minimized under the constraint∑K
k=1 δk = 1I this leads to
µik (W ) =δk exp(−β‖xi − wk‖2)∑Kl=1 δl exp(−β‖xi − wl‖2)
δk =N∑
i=1
µik (W )/N
wk = NN∑
i=1
µik (W )xi/δk
MCDA
I increases the similarity with EM for Gaussian mixtures:I exactly the same algorithm for a fixed βI isotropic Gaussian mixture with variance 1
2β
I however:I the variance is generally a parameter in the Gaussian
mixturesI different goals: minimal distortion versus maximum
likelihoodI no support for other distributions in DA
Cluster birth monitoring
I avoid wasting computational resources:I consider a cluster Ck with a prototype wkI prior a phase transition in Ck :
I duplicate wk into w ′kI apply some noise to both prototypesI split δk into δk
2 and δ′k2
I apply the EM like algorithm and monitor ‖wk − w ′k‖I accept the new cluster if ‖wk − w ′k‖ becomes “large”
I two strategies:I eigenvalue based analysis (costly, but quite accurate)I opportunistic: always maintain duplicate prototypes and
promote diverging ones
Final algorithm
1. initialize W 1 = 1N∑N
i=1 xi
2. for l = 2 to K :2.1 compute the critical βl2.2 initialize W l to W l−1
2.3 duplicate the prototype of the critical cluster and split theassociated weight
2.4 for t values of β around βl
2.4.1 add some small noise to W l
2.4.2 compute µik =δk exp(−β‖xi−W l
k‖2)∑K
t=1 δt exp(−β‖xi−W lt ‖
2)
2.4.3 compute δk =∑N
i=1 µik (W )/N2.4.4 compute W l
k = Nδk
∑Ni=1 µik xi
2.4.5 good back to 2.4.2 until convergence
3. use W K as an estimation ofarg minW∈Rp (minM∈M E(M,W )) and µik as thecorresponding M
Example
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
A simple dataset
Example
β
E
0.2
0.5
1.0
2.0
5.0
0.1 0.15 0.22 0.32 0.47 0.69 1.02 1.51 2.22 3.37
Error evolution and cluster births
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Clusters
−2 0 2 4 6
−2
−1
01
23
4
X1
X2
Summary
I deterministic annealing addresses combinatorialoptimization by smoothing the cost function and trackingthe evolution of the solution while the smoothingprogressively vanishes
I motivated by:I heuristic (soft minimum)I statistical physics (Gibbs distribution)I information theory (maximal entropy)
I in practice:I rather simple multiple EM like algorithmI a bit tricky to implement (phase transition, noise injection,
etc.)I excellent results (frequently better than those obtained by
K-means)
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
Combinatorial only
I the situation is quite different for a combinatorial problem
arg minM∈M
E(M)
I no soft minimum heuristicI no direct use of the Gibbs distribution PT ; one need to rely
on EPT (f (M)) for interesting fI calculation of PT is not tractable:
I tractability is related to independenceI if
E(M) =∑M1
. . .∑MD
(E(M1) + . . .+ E(MD))
then independent optimization can be done on eachvariable...
Graph clustering
I a non oriented graph with N nodes and weight matrix A (Aijis the weight of the connection between nodes i and j)
I maximal modularity clustering:I node degree ki =
∑Nj=1 Aij , total weight m = 1
2
∑i,j Aij
I null model Pij =ki kj2m , Bij =
12m (Aij − Pij)
I assignment matrix MI Modularity of the clustering
Mod(M) =∑i,j
∑k
Mik Mjk Bij
I difficulty: coupling MikMjk (non linearity)I in the sequel E(M) = −Mod(M)
Strategy for DA
I choose some interesting statistics:I EPT (f (M))I for instance EPT (Mik ) for clustering
I compute an approximation of EPT (f (M)) at hightemperature T
I track the evolution of the approximation while lowering TI use limT→0 EPT (f (M)) = f (arg minM∈M E(M)) to claim
that the approximation converges to the interesting limitI rationale: approximating EPT (f (M)) for a low T is probably
difficult, we use the path following strategy to ease the task
Example
−1.0 −0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
x
y
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
00.
005
0.01
00.
015
0.02
0β = 10−1
x
prob
abili
ty
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
00.
005
0.01
00.
015
0.02
0β = 10−0.75
x
prob
abili
ty
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
00.
005
0.01
00.
015
0.02
00.
025
β = 10−0.5
x
prob
abili
ty
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
00.
010
0.02
00.
030
β = 10−0.25
x
prob
abili
ty
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
0.01
0.02
0.03
0.04
β = 100
x
prob
abili
ty
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
0.02
0.04
0.06
0.08
β = 100.25
x
prob
abili
ty
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
0.05
0.10
0.15
β = 100.5
x
prob
abili
ty
Example
−1.0 −0.5 0.0 0.5 1.0
0.00
0.05
0.10
0.15
0.20
0.25
0.30
β = 100.75
x
prob
abili
ty
Example
−1.0 −0.5 0.0 0.5 1.0
0.0
0.1
0.2
0.3
0.4
0.5
β = 101
x
prob
abili
ty
Example
−1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
β = 101.25
x
prob
abili
ty
Example
−1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8β = 101.5
x
prob
abili
ty
Example
−1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
β = 101.75
x
prob
abili
ty
Example
−1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
β = 102
x
prob
abili
ty
Example
0.1 0.5 1.0 5.0 10.0 50.0 100.0
0.0
0.1
0.2
0.3
0.4
0.5
β
x̂
Expectation approximation
I computing
EPZ (f (Z )) =
∫f (Z )dPZ
has many applicationsI for instance, in machine learning:
I performance evaluation (generalization error)I EM for probabilistic modelsI Bayesian approaches
I two major numerical approaches:I samplingI variational approximation
Sampling
I (strong) law of large numbers
limN→∞
1N
N∑i=1
f (zi) =
∫f (Z )dPZ
if the zi are independent samples of ZI many extensions, especially to dependent samples (slower
convergence)I MCMC methods: build a Markov chain with stationary
distribution PZ and take the average of f on a trajectory
I applied to EPT (f (M)): this is simulated annealing!
Sampling
I (strong) law of large numbers
limN→∞
1N
N∑i=1
f (zi) =
∫f (Z )dPZ
if the zi are independent samples of ZI many extensions, especially to dependent samples (slower
convergence)I MCMC methods: build a Markov chain with stationary
distribution PZ and take the average of f on a trajectoryI applied to EPT (f (M)): this is simulated annealing!
Variational approximation
I main idea:I the intractability of the calculation of EPZ (f (Z )) is induced
by the complexity of PZI let’s replace PZ by a simpler distribution QZ ...I ...and EPZ (f (Z )) by EQZ (f (Z ))
I calculus of variation:I optimization over functional spacesI in this context: choose an optimal QZ in a space of
probability measures Q with respect to a goodness of fitcriterion between PZ and QZ
I probabilistic context:I simple distributions: factorized distributionsI quality criterion: Kullback-Leibler divergence
Variational approximation
I assume Z = (Z1, . . . ,ZD) ∈ RD and f (Z ) =∑D
i=1 fi(Zi)
I for a general PZ , f structure cannot be exploited inEPZ (f (Z ))
I variational approximationI choose Q a set of tractable distributions on RI solve
Q∗ = arg minQ=(Q1×...×QD)∈QD
KL (Q||PZ )
withKL (Q||PZ ) = −
∫ln
dPZ
dQdQ
I approximate EPZ (f (Z )) by
EQ∗ (f (Z )) =D∑
i=1
EQ∗i (fi(Zi))
Variational approximation
I assume Z = (Z1, . . . ,ZD) ∈ RD and f (Z ) =∑D
i=1 fi(Zi)
I for a general PZ , f structure cannot be exploited inEPZ (f (Z ))
I variational approximationI choose Q a set of tractable distributions on RI solve
Q∗ = arg minQ=(Q1×...×QD)∈QD
KL (Q||PZ )
withKL (Q||PZ ) = −
∫ln
dPZ
dQdQ
I approximate EPZ (f (Z )) by
EQ∗ (f (Z )) =D∑
i=1
EQ∗i (fi(Zi))
Application to DA
I general form: Pβ(M) = 1Zβ
exp(−βE(M))
I intractable because E(M) is not linear in M = (M1, . . . ,MD)
I variational approximation:I replace E by F (W ,M) defined by
F (W ,M) =D∑
i=1
WiMi
I use for QW ,β
QW ,β =1
ZW ,βexp(−βF (W ,M))
I optimize W via KL (QW ,β ||Pβ)
Application to DA
I general form: Pβ(M) = 1Zβ
exp(−βE(M))
I intractable because E(M) is not linear in M = (M1, . . . ,MD)I variational approximation:
I replace E by F (W ,M) defined by
F (W ,M) =D∑
i=1
WiMi
I use for QW ,β
QW ,β =1
ZW ,βexp(−βF (W ,M))
I optimize W via KL (QW ,β ||Pβ)
Fitting the approximation
I a complex expectation calculation is replaced by a newoptimization problem
I we have
KL(QW ,β||Pβ
)= ln Zβ−ln ZW ,β+βEQW ,β (E(M)− F (W ,M))
I tractable by factorization on F , except for ZβI solved by ∇W KL
(QW ,β||Pβ
)= 0
I tractable because Zβ do not depend on WI generally solved via a fixed point approach (EM like again)
I similar to variational approximation in EM or Bayesianmethods
Example
I in the case of (graph) clustering, we use
F (W ,M) =N∑
i=1
K∑k=1
WikMik
I a crucial (general) property is that, when i 6= j
EQW ,β
(MikMjl
)= EQW ,β (Mik )EQW ,β
(Mjl)
I a (long and boring) series of equations leads to the generalmean field equations
∂EQW ,β (E(M))
∂Wjl=
K∑k=1
∂EQW ,β
(Mjk)
∂WjlWjk , ∀j , l .
Example
I classical EM like scheme to solve the mean fieldequations:
1. we haveEQW,β (Mik ) =
exp(−βWik )∑Kl=1 exp(−βWil)
2. keep EQW,β (Mik ) fixed and solve the simplified mean fieldequations for W
3. update EQW,β (Mik ) based on the new W and loop on 2 untilconvergence
I in the graph clustering case
Wjk = 2∑i,j
EQW ,β (Mik )Bij
Summary
I deterministic annealing for combinatorial optimizationI given an objective function E(M)
I choose a linear parametric approximation of E(M),F (W ,M) with the associated distributionQW ,β = 1
ZW,βexp(−βF (W ,M))
I write the mean field equations, i.e.
∇W(− ln ZW ,β + βEQW,β (E(M)− F (W ,M))
)= 0
I use a EM like algorithm to solve the equations:I given W , compute EQW,β (Mik )I given EQW,β (Mik ), solve the equations
I back to our classical questions:I why would that work?I how to do that in practice?
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
Physical interpretation
I back to the (generalized) Helmholtz (free) energy
Fβ = U − Sβ
I the free energy can be generalized to any distribution Q onthe states, by
Fβ,Q = EQ (E(M))− H(Q)
β,
where H(Q) = −∑
M∈MQ(M) ln Q(M) is the entropy of QI the Boltzmann-Gibbs distribution minimizes the free
energy, i.e.FT ≤ FT ,Q
Physical interpretation
I in fact we have
FT ,Q = FT +1β
KL (Q||P) ,
where P is the Gibbs distributionI minimizing KL (Q||P) ≥ 0 over Q corresponds to finding
the best upper bound of the free energy on a class ofdistribution
I in other words: given the system states must be distributedaccording to a distribution in Q, find the most stabledistribution
Mean fieldI E(M) is difficult to handle because of coupling
(dependencies) while F (W ,M) is linear and correspondsto a de-coupling in which dependencies are replaced bymean effects
I example:I modularity
E(M) =∑
i
∑k
Mik
∑j
Mjk Bij
I approximation
F (W ,M) =∑
i
∑k
Mik Wik
I the complex influence of Mij on E is replaced by a singleparameter Wik : the mean effect associated to a change inMij
Outline
Introduction
Mixed problemsSoft minimumComputing the soft minimumEvolution of β
Deterministic AnnealingAnnealingMaximum entropyPhase transitionsMass constrained deterministic annealing
Combinatorial problemsExpectation approximationsMean field annealingIn practice
In practice
I homotopy again:I at high temperature (β → 0):
I EQW,β (Mik ) =1K
I the fixed point equation is generally easy to solve; forinstance in graph clustering
Wjk =2K
∑i,j
Bij
I slowly decrease the temperatureI use the mean field at the previous higher temperature as a
starting point for the fixed point iterationsI phase transitions:
I stable versus unstable fixed points (eigenanalysis)I noise injection and mass constrained version
A graph clustering algorithm
1. initialize W 1jk = 2
K∑
i,j Bij
2. for l = 2 to K :2.1 compute the critical βl2.2 initialize W l to W l−1
2.3 for t values of β around βl
2.3.1 add some small noise to W l
2.3.2 compute µik = exp(−βWik )∑Kt=1 exp(−βWit )
2.3.3 compute Wjk = 2∑
i,j µik Bij
2.3.4 good back to 2.3.2 until convergence
3. threshold µik into an optimal partition of the original graph
I mass constrained approach variantI stability based transition detection (slower annealing)
Karate
I clustering of asimple graph
I Zachary’s Karateclub
I Four clusters
1
2
3 4
5
67
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Modularity
Evolution of the modularity during annealing
50 100 200 500
0.0
0.1
0.2
0.3
0.4
ββ
Mod
ular
ity
Phase transitions
I first phase: 2clusters
I second phase: 4clusters
1 2
3 4
Phase transitions
I first phase: 2clusters
I second phase: 4clusters
1 2
3 4
Karate
I clustering of asimple graph
I Zachary’s Karateclub
I Four clusters
DA Roadmap
How to apply deterministic annealing to a combinatorialoptimization problem E(M)?
1. define a linear mean field approximation F (W ,M)
2. specialize the mean field equations, i.e. compute
∂EQW,β (E(M))
∂Wjl=
∂
∂Wjl
(1
ZW ,β
∑M
E(M)exp(−βF (W ,M))
)
3. identify a fixed point scheme to solve the mean field equations,using EQW,β (Mi) as a constant if needed
4. wrap the corresponding EM scheme in an annealing loop: this isthe basic algorithm
5. analyze the fixed point scheme to find stability conditions: thisleads to an advanced annealing schedule
Summary
I deterministic annealing addresses combinatorialoptimization by solving a related but simpler problem andby tracking the evolution of the solution while the simplerproblem converges to the original one
I motivated by:I heuristic (soft minimum)I statistical physics (Gibbs distribution)I information theory (maximal entropy)
I works mainly because of the solution following strategy(homotopy): do not solve a difficult problem from scratch,but rather starting from a good guess of the solution
Summary
I difficulties:I boring and technical calculations (when facing a new
problem)I a bit tricky to implement (phase transition, noise injection,
etc.)I tends to be slow in practice: computing exp is very costly
even on modern hardware (roughly 100 flop)I advantages:
I versatileI appears as a simple EM like algorithm embedded in an
annealing loopI very interesting intermediate resultsI excellent final results in practice
References
T. Hofmann and J. M. Buhmann.Pairwise data clustering by deterministic annealing.IEEE Transactions on Pattern Analysis and Machine Intelligence,19(1):1–14, January 1997.
K. Rose.Deterministic annealing for clustering, compression,classification,regression, and related optimization problems.Proceedings of the IEEE, 86(11):2210–2239, November 1998.
F. Rossi and N. Villa.Optimizing an organized modularity measure for topographicgraph clustering: a deterministic annealing approach.Neurocomputing, 73(7–9):1142–1163, March 2010.