Optimal Transport
http://bicmr.pku.edu.cn/~wenzw/bigdata2019.html
Acknowledgement: this slides is based on Prof. Gabriel Peyré’s lecture notes
1/59
2/59
Applications: comparing measures
������������������
! images, vision, graphics and machine learning, . . .
• Optimal transport
Optimal transport meanL2mean
3/59
Applications: toward high-dimensional OT
��������������������������
����� ����������� ������� ������� ���� ������ �������
4/59
Outline
1 Problem Formulation
2 Applications
3 Entropic Regularization
4 Sinkhorn’s Algorithm
5 Sinkhorn-Newton method
6 Shielding Neighborhood Method
5/59
Kantorovitch’s Formulation
Discrete Optimal TransportInput two discrete probability measures
α =
m∑i=1
aiδxi , β =
n∑j=1
bjδyj . (1)
X = {xi}i, Y = {xj}j: are given points clouds, xi, yi are vectors.ai, bj : positive weights,
∑mi=1 ai =
∑nj=1 bj = 1.
Cij: costs, Cij = c(xi, yj) ≥ 0.
Couplings
U(a, b)def= {Π ∈ Rm×n
+ ; Π1n = a,Π>1m = b} (2)
is called the set of couplings with respect to α and β.
6/59
Kantorovitch’s Formulation
Discrete Optimal TransportIn the Optimal transport, we want to compute the following quantity[Kantorovich 1942]
Optimal transport distance
LC(a, b)def= min
∑i,j
Ci,jΠi,j; Π ∈ U(a, b)
. (3)
7/59
Push Forward
Radon measures (α, β) on (X ,Y).Transfer of measure by T : X → Y: push forward.The measure T#α on Y is defined by
T#α(Y) = α(T−1(Y)), for all measurable Y ∈ Y. (4)
Equivalently, ∫Y
g(y)dT#α(y)def=
∫X
g(T(x))dα(x). (5)
Discrete measures: T#α =∑
i αiδT(xi)
Smooth densities: dα = ρ(x)dx, dβ = ξ(x)dx.
T#α = β ⇐⇒ ρ(T(x))|det(∂T(x))| = ξ(x). (6)
8/59
Monge problem
Monge problem seeks for a map that associates to each point xi
a single point yj, and which must push the mass of α toward themass of β, namely:
∀j, bj =∑
i:T(xi)=yj
ai
Discrete case:
minT
∑i
c(xi,T(xi)), s.t. T#α = β
Arbitrary measures:
minT
∫X
c(x,T(x))dα(x), s.t. T#α = β
9/59
Couplings between General Measures
Projectors:PX : (x, y) ∈ X × Y → x ∈ X ,PY : (x, y) ∈ X × Y → y ∈ Y.
(7)
Couplings between General Measures
U(α, β)def= {π ∈M+(X × Y); PX#π = α,PY#π = β}. (8)
is called the set of couplings with respect to α and β.
10/59
Cases of Couplings
�������������������������
⇡
Discrete
⇡
Continuous
⇡
Semi-discrete
βββ
↵↵ ↵
βββ↵ ↵ ↵
11/59
More Examples
���������������������
β
↵
β
↵
⇡β
↵
β
↵
⇡
β
↵
β
↵
⇡
↵
β
↵
⇡β
12/59
Kantorovitch Problem for General Measures
Optimal transport distance between General Measures
L(α, β, c)def= min
π∈U(α,β)
∫X×Y
c(x, y)dπ(x, y). (9)
Probability interpretation:
min(X,Y){E(X,Y)(c(X,Y)),X ∼ α,Y ∼ β}. (10)
13/59
Wasserstein Distance
Metric Space X = Y.Distance d(x, y) (nonegative, symmetric, identity, triangle inequality).Cost c(x, y) = d(x, y)p, p ≥ 1.
Wasserstein Distance
Wp(α, β)def= L(α, β, dp)1/p. (11)
TheoremWp is a distance, and
Wp(αn, α)→ 0 ⇐⇒ αnweak→ α. (12)
Example
Wp(δx, δy) = d(x, y). (13)
14/59
Dual form
Dual problem (discrete case)
maxf∈Rm,g∈Rn
f>a + g>b,
subject to fi + gj ≤ Cij, ∀(i, j)(14)
Relation between any primal and dual solutions:
Pij > 0⇒ fi + gj = Cij.
15/59
Wasserstein barycenter
Define C def= MXY , where (MXY)ij = d(xi, yi)
p. The Wassersteindistance as
L(a, b,C)def= min
∑i,j
Ci,jΠi,j; Π ∈ U(a, b)
. (15)
Given a set of point clouds and their corresponding probabilityvector {(Y(i), b(i))}, i = 1, . . . ,N.Find a support X = {xi} with a probability vector a such that(X, a) is the optimal solution of the following problem
minX,a
1N
N∑k=1
L(a, b(k),MXYk)
16/59
Outline
1 Problem Formulation
2 Applications
3 Entropic Regularization
4 Sinkhorn’s Algorithm
5 Sinkhorn-Newton method
6 Shielding Neighborhood Method
17/59
Applications: Grayscale image equalization
����������������������������
t = 0 t = 0.25 t = 0.5 t = .75 t = 1
18/59
Applications: image color adaptation
Example: https://github.com/rflamary/POT/blob/master/notebooks/plot_otda_color_images.ipynb
Given color image stored in the RGB format: I1, I2# Converts an image to matrix (one pixel per line)X1 = im2mat(I1), X2 = im2mat(I2)# Take samplesXs = X1[idx1, :], Xt = X2[idx2, :]# Scatter plot of colorspl.scatter(Xs[:, 0], Xs[:, 2], c=Xs)# Sinkhorn Transportot_sinkhorn = ot.da.SinkhornTransport(reg_e=1e-1)ot_sinkhorn.fit(Xs=Xs, Xt=Xt)# prediction between imagestransp_Xs_sinkhorn = ot_sinkhorn.transform(Xs=X1)transp_Xt_sinkhorn = ot_sinkhorn.inverse_transform(Xt=X2)
19/59
Applications: image color adaptation
20/59
Applications: image color palette equalization
Optimal
transport
��������������������������������
21/59
Applications: shape interpolation
�������������������
22/59
Applications: MRI Data Processing
�������������������������������������
L2 barycenter
W2
2barycenter
Ground cost c = dM : geodesic on cortical surface M .
23/59
Applications
�������������������������
�������������������������������������������
���������������������������
24/59
Applications
��������������������������������������
input
output
input
output
µ
⌫
Shape registration:loss regularity
min' di↵eo
D('(µ), ⌫) +R(')
σ
(µ− ⌫) ? kσ
Hilbertian loss (MMD/RKHS):
D(µ, ⌫) = ||kσ ? (µ− ⌫)||2L2
it 0
it 250
! Sinkhorn’s iterates “propagate” a small bandwidth kernel.
! Automatic di↵erentation: game changer for advanced loss and models.
! Do not use OT for registration . . . but as a loss.
Joint work with J. Feydy, B. Charier, F-X. Vialard.
Sinkhorn divergence:
D(µ, ⌫) = W"(µ, ⌫)
25/59
Applications: word mover’s distance
normalized bag-of-words (nBOW), word travel cost (word2vecdistance), document distance Tijc(i, j), transportation problem
�������������������������
����������� dist(D1, D2) = W2(µ,⌫)
µ
⌫
������������
26/59
Applications: word mover’s distance
minΠ≥0
∑ij
Πijcij
s.t.n∑
j=1
Πij = di
n∑i=1
Πij = d′j
xi: word2vec embeddingcij = ‖xi − xj‖2
if word i appears wi times in the document, we denote di = wi∑wj
27/59
Applications: topic models
Top-left topic: competitions. Top-right: time. Bottom-left: socceractions. Bottom-right: drugs.
����������
������������
28/59
Applications
���������������������������������������
!""""""""""""""#""""""""""""""$Source
!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""#""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""$Targets
0 0
MDS in 3-D
Use T to define registration between:Colors
distributionShapeShapeShape
Geodesic distances GW distances MDS
Vizualization
Shapes(Xs)s
1
5
10
15
20
25
30
35
40
45
MDS in 2-D
29/59
Outline
1 Problem Formulation
2 Applications
3 Entropic Regularization
4 Sinkhorn’s Algorithm
5 Sinkhorn-Newton method
6 Shielding Neighborhood Method
30/59
Discrete OT Review
Given an integer n > 1, we write Σn for the discrete probability simplex
Σndef=
{a ∈ R+
n ;
n∑i=1
ai = 1.
}(16)
Given a ∈ Σm, b ∈ Σn, the Optimal Transport problem is to compute
L(a, b,C)def= min{
∑i,j
Ci,jPi,j; s.t. P ∈ U(a, b)}. (17)
Where U(a, b) is the set of couplings between a and b.
31/59
Entropy
The discrete entropy of a positive matrix P (∑
ij Pij = 1) is defined as
H(P)def= −
∑i,j
Pi,j(log(Pi,j)− 1). (18)
For a positive vector u ∈ Σn, the entropy is defined analogously:
H(u)def= −
∑i
ui(log(ui)− 1). (19)
For two positive vector u, v ∈ Σn, the Kullback-Leibler divergence (or,KL divergence) is defined to be
KL(u‖v) = −n∑
i=1
ui log(vi
ui). (20)
The KL divergence is always non-negative: KL(u‖v) ≥ 0 (Jensen’sinequality: E[f (g(X))] ≥ f (E[g(X)])).
32/59
Entropic regularization
Given a ∈ Σm, b ∈ Σn and cost matrix C ∈ Rm×n+ . The entropic
regularization of the transportation problem reads
Lε(a, b,C) = minP∈U(a,b)
〈P,C〉 − εH(P). (21)
The case ε = 0 corresponds to the classic (linear) optimaltransport problem.For ε > 0, problem (21) has an ε-strongly convex objective andtherefore admits a unique optimal solution P?ε.
This is not (necessarily) true for ε = 0. But we have the followingproposition.
33/59
Entropic regularization
PropositionWhen ε→ 0, the unique solution Pε of (21) converges to the optimalsolution with maximal entropy within the set of all optimal solutions ofthe unregularized transportation problem, namely,
Pεε→0→ argmaxP{H(P); P ∈ U(a, b), 〈P,C〉 = L0(a, b,C)} (22)
The above proposition motivates us to solve the problems in (21)sequentially and then take ε→ 0.
34/59
Entropic regularization
ProofWe consider a sequence (ε`)` such that ε` → 0 and ε` > 0. Wedenote P` = P?ε` . Since U(a, b) is bounded, we can extract asequence (that we do not relabel for the sake of simplicity) such thatP` → P?. Since U(a, b) is closed, P? ∈ U(a, b). We consider any Psuch that 〈C,P〉 = L0(a, b,C). By optimality of P and P` for theirrespective optimization problems (for ε = 0 and ε = ε`), one has
0 ≤ 〈C,P`〉 − 〈C,P〉 ≤ ε`(H(P`)− H(P)). (23)
Since H is continuous, taking the limit `→ +∞ in this expressionshows that 〈C,P?〉 = 〈C,P〉. Furthermore, dividing by ε` and takingthe limit shows that H(P) 6 H(P?). Now the result follows from thestrictly convexity of −H.
35/59
Entropic regularization
By the concavity of entropy, for α > 0, we introduce the convex set
Uα(a, b)def= {P ∈ U(a, b)|KL(P‖ab>) ≤ α}= {P ∈ U(a, b)|H(P) ≥ H(a) + H(b)− 1− α}.
(24)
Definition: Sinkhorn Distance
dC,α(a, b)def= min
P∈Uα(a,b)〈C,P〉. (25)
PropositionFor α ≥ 0, dC,α(a, b) is symmetric and satisfies all triangle inequalities.Moreover, 1a6=bdC,α(a, b) satisfies all three distance axioms.
36/59
Entropic regularization
PropositionFor α large enough, the Sinkhorn distance dC,α is the transportdistance dC.
Proof.Note that for any P ∈ U(a, b), we have
H(P) ≥ 12
(H(a) + H(b)), (26)
so for α ≥ 12(H(a) + H(b))− 1, we have
Uα(a, b) = U(a, b).
37/59
Outline
1 Problem Formulation
2 Applications
3 Entropic Regularization
4 Sinkhorn’s Algorithm
5 Sinkhorn-Newton method
6 Shielding Neighborhood Method
38/59
Sinkhorn’s algorithm
For solving (21), consider its Lagrangian dual function
LεC(P, α, β) = 〈C,P〉 − εH(P) + α>(P1n − a) + β>(P>1m − b). (27)
Now let ∂LεC/∂pij = 0, i.e.,
pij = e−cij+αi+βj
ε , (28)
so we can write
Pε = diag(e−αε )e−
Cε diag(e−
βε ). (29)
Note thatPε1n = a, P>ε 1m = b, (30)
we can then use Sinkhorn’s algorithm to find Pε!
39/59
Sinkhorn’s algorithm
Let u = e−αε , v = e−
βε and K = e−C/ε. We again state the KKT system
of (21):Pε = diag(u)Kdiag(v),
a = diag(u)Kv,
b = diag(v)K>u.
(31)
Then the Sinkhorn’s algorithm amounts to alternating updates in theform of
u(k+1) = diag(Kv(k))−1a,
v(k+1) = diag(K>u(k+1))−1b.(32)
40/59
Sinkhorn’s algorithm
Sinkhorn’s algorithm
1. Compute K = e−Cε .
2. Compute K = diag(a−1)K.3. Initial scale factor u ∈ Rm.4. Iteratively update u:
u = 1./(K(b./(K>u))),
until reaches certain stopping criterion.5. Compute
v = b./(K>u),
and eventuallyPε = diag(u)Kdiag(v).
41/59
Outline
1 Problem Formulation
2 Applications
3 Entropic Regularization
4 Sinkhorn’s Algorithm
5 Sinkhorn-Newton method
6 Shielding Neighborhood Method
42/59
Sinkhorn-Newton method
The dual problem of (21) is
minα,β
〈a, α〉+ 〈b, β〉+ ε〈e−αε ,Ke−
βε 〉,
s.t. diag(e−αε )Ke−
βε = a,
diag(e−βε )K>e−
αε = b.
(33)
with α, β being the dual variables.[1] proposes using Newton method to solve this system.
43/59
Sinkhorn-Newton method
Let
F(α, β) =
(diag(e−
αε )Ke−
βε − a
diag(e−βε )K>e−
αε − b
). (34)
We want to find α, β such that F(α, β) = 0 so that
Pε = diag(e−αε )e−
Cε diag(e−
βε ). (35)
The Newton iteration is given by(α(k+1)
β(k+1)
)=
(α(k)
β(k)
)− J−1
F (α(k), β(k))F(α(k), β(k)), (36)
where
JF =1ε
(diag(P1n) P
P> diag(P>1m)
). (37)
44/59
Sinkhorn-Newton method: Convergence
PropositionFor α ∈ Rm and β ∈ Rn, the Jacobian matrix JF(α, β) is symmetricpositive semidefinite, and its kernel is given by
ker(JF(α, β)) = span{(
1m
−1n
)}. (38)
ProofJF is clearly symmetric. For arbitrary γ ∈ Rm and φ ∈ Rn, one has
(γ> φ>
)JF
(γφ
)=
1ε
∑ij
Pij(γi + φj)2 ≥ 0,
which holds with equality if and only if γi + φj = 0 for all i, j, leading usto (38).
45/59
Sinkhorn-Newton method: Convergence
LemmaLet F : D→ Rn be a continuously differentiable mapping with D ⊂ Rn openand convex. Suppose that F(x) is invertible for each x ∈ D. Assume that thefollowing affine covariant Lipschitz condition holds
‖F′(x)−1(F′(y)− F′(x))(y− x)‖ ≤ ω‖y− x‖2 (39)
for x, y ∈ D. Let F(x) = 0 have a solution x∗. For the initial guess x(0) assumethat B(x∗, ‖x(0) − x∗‖) ⊂ D and that
ω‖x(0) − x∗‖ < 2.
Then the ordinary Newton iterates remain in the open ball B(x∗, ‖x(0) − x∗‖)and converge to x∗ at an estimated quadratic rate
‖x(k+1) − x∗‖ ≤ ω
2‖x(k) − x∗‖2. (40)
Moreover, the solution x∗ is unique in the open ball B(x∗, 2/ω).
46/59
Sinkhorn-Newton method: Convergence
ProofDenote e(k) = x(k) − x∗. Let us prove the lemma by induction:
‖e(k+1)‖ = ‖x(k) − (F′(x(k)))−1(F(x(k) − F(x∗))− x∗‖= ‖e(k) − (F′(x(k)))−1(F(x(k) − F(x∗))‖= ‖(F′(x(k)))−1((F(x∗)− F(x(k))) + F′(x(k))e(k))‖
= ‖(F′(x(k)))−1∫ −1
s=0(F′(x(k) + se(k))− F′(x(k)))e(k) ds‖
≤ ω‖∫ −1
s=0s ds‖e(k)‖2 =
ω
2‖e(k)‖2 < ‖e(k)‖.
(41)
Alsoω‖e(k+1)‖ ≤ ω‖e(k)‖ < 2. (42)
For the uniqueness part, let x(0) = x∗∗ 6= x∗ be a different solution,thenx(1) = x∗∗, then consider (40) when k = 0.
47/59
Sinkhorn-Newton method: Convergence
Proposition
For any k ∈ N with P(k)ε,ij > 0, the affine covariante Lipschitz condition
holds in the `∞-norm for
ω ≤ (e1ε − 1)
(1 + 2e
1ε
max{‖P(k)ε 1n‖∞, ‖(P(k)
ε )>1m‖∞}minij P(k)
ε,ij
)(43)
when ‖y− x‖∞ ≤ 1.
The proof for this proposition is tedious and therefore we refer theinterested readers to the paper [1].
48/59
Relationship with Sinkhorn’s algorithm
Let u = e−αε , v = e−
βε and K = e−C/ε. We again state the KKT system
of (21):Pε = diag(u)Kdiag(v),
a = diag(u)Kv,
b = diag(v)K>u.
(44)
Then the Sinkhorn’s algorithm amounts to alternating updates in theform of
u(k+1) = diag(Kv(k))−1a,
v(k+1) = diag(K>u(k+1))−1b.(45)
49/59
Relationship with Sinkhorn’s algorithm
Define
G(u, v) =
(diag(u)Kv− a
diag(v)K>u− b
). (46)
Process analogously to the Sinkhorn-Newton method we justdiscussed, note that
JG(u, v) =
(diag(Kv) diag(u)K
diag(v)K> diag(K>u)
). (47)
If we neglect the off-diagonal blocks above, i.e.,
JG(u, v) =
(diag(Kv) 0
0 diag(K>u)
), (48)
and perform the Newton iteration(u(k+1)
v(k+1)
)=
(u(k)
v(k)
)− J−1
G (u(k), v(k))G(u(k), v(k)), (49)
50/59
Relationship with Sinkhorn’s algorithm
We getu(k+1) = diag(Kv(k))−1a,
v(k+1) = diag(K>u(k))−1b.(50)
So the Sinkhorn’s algorithm simply approximates one Newton step byneglecting the off-diagonal blocks and replacing u(k) by u(k+1).
51/59
Outline
1 Problem Formulation
2 Applications
3 Entropic Regularization
4 Sinkhorn’s Algorithm
5 Sinkhorn-Newton method
6 Shielding Neighborhood Method
52/59
Shielding Neighborhood Method
Main features:Proposed by Bernhard Schmitzer [2] in 2016Exploits a hierarchical multiscale schemeSolves a sequence of sparse subproblems instead of one largedense problemAny existing discrete solver can be used as internal black-boxSignificant improvements both in runtime and memoryrequirements have been observed
53/59
Shielding Neighborhood Method
DefinitionFor some N ⊂ X × Y, we call N a neighborhood.
We may consider the problem restricted to N now and then in thismethod. It is important that N is small (sparse) when compared toX × Y.
Definition(Short-Cut) For a neighborhood N ⊂ X × Y and a coupling π with sptπ ⊂ N , let ((x2, y2), . . . , (xn−1, yn−1)) be an ordered tuple of pairs inspt π . We say ((x2, y2), . . . , (xn−1, yn−1)) is a short-cut for(x1, yn) ∈ X × Y if (xi, yi+1) ∈ N for i = 1, . . . , n− 1 and
c(x1, yn) ≥ c(x1, y2) +
n−1∑i=2
(c(xi, yi+1)− c(xi, yi)). (51)
54/59
Shielding Neighborhood Method
Find a clever way to choose a small neighborhood N such that thereis a short-cut for every (x, y) 6∈ N .
Definition(Shielding Condition) Let x ∈ X , y ∈ Y and (xs, ys) ∈ X × Y. We say(xs, ys) shields x from y when
c(x, y) + c(xs, ys) > c(x, ys) + c(xs, y).
Definition(Shielding Neighborhood) For a given coupling π, we say that aneighborhood N ⊂ X × Y , N ⊃ spt π is a shielding neighborhood iffor every pair (x, y) 6∈ N , there exists some (xs, ys) ∈ spt π with(xs, ys) ∈ N such that (xs, ys) shields x from y.
55/59
Multiscale Scheme
Multiscale Scheme(Hierarchical Partition and Multiscale Measure Approximation) For adiscrete set X a hierarchical partition is an ordered tuple (X0, . . . ,XK)of partitions of X where X0 = {{x} : x ∈ X} is the trivial partition of Xinto singletons and each subsequent level is generated by mergingcells from the previous level.
For simplicity, we assume that the coarsest level is the trivial partitioninto one set: XK = {X}. We call K > 0 the depth of X.
This implies a directed tree graph with vertex set ∪Kk′=0Xk′ , and for
k ∈ {1, . . . ,K}, we say x′ ∈ Xj, j < k, is a descendant of x ∈ Xk whenx′ ⊂ x. We call x′ a child of x for j = k − 1, and a leaf for j = 0.
56/59
Main Algorithm
Assume that we already have two subalgorithms:solveLocal: 2X×Y → Π(µ, ν) such that for N ∈ 2X×Y thecoupling solveLocal(N ) is locally primal optimal w.r.t. N .When N is sparse, any DOT solver can quickly provide ananswer.shield: Π(µ, ν)→ 2X×Y such that for π ∈ Π(µ, ν) theneighborhood shield(π) is shielding for π .
57/59
Main Algorithm
Algorithm 1: Main algorithm
1 k← K.2 π ← solveDense(k) ; // this can be done immediately3 while k > 0 do4 k← k − 1.5 N1 ← {}.6 for (x, y) ∈ spt π do7 N1 ← N1∪(children(x) × children(y)).
8 π ← solveSparse(k,N1) ; // solveSparse ispresented below
9 return π.
58/59
solveSparse
solveSparse uses current scale level k and feasible neighborhoodN1 initialized from the coarser level to compute the global optimizer πand corresponding neighborhood N at current hierarchical level.
Algorithm 2: solveSparse
1 Input: current scale level k, initial feasible neighborhood N1.2 i← 1.3 while i = 1 or C(πi) 6= C(πi−1) do4 πi+1 ← solveLocal(Ni) ; // Use any DOT solver5 Ni+1 ← shield(πi+1, k); // shield is discussed
later6 i← i + 1
7 return πi ; // returning Ni is not necessary
59/59
References
Gabriel Peyre and Marco Cuturi, Computational OptimalTransport, ArXiv:1803.00567, 2018.https://github.com/rflamary/POT
C. BRAUER, C. CLASON, D. LORENZ, AND B. WIRTH, Asinkhorn-newton method for entropic optimal transport, arXivpreprint arXiv:1710.06635, (2017).
B. SCHMITZER, A sparse multiscale algorithm for dense optimaltransport, Journal of Mathematical Imaging and Vision, 56(2016), pp. 238–259.
Top Related