Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground...

13
Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty 1 Marco Cuturi 21 Abstract Regularizing the optimal transport (OT) problem has proven crucial for OT theory to impact the field of machine learning. For instance, it is known that regularizing OT problems with en- tropy leads to faster computations and better dif- ferentiation using the Sinkhorn algorithm, as well as better sample complexity bounds than classic OT. In this work we depart from this practical perspective and propose a new interpretation of regularization as a robust mechanism, and show using Fenchel duality that any convex regular- ization of OT can be interpreted as ground cost adversarial. This incidentally gives access to a robust dissimilarity measure on the ground space, which can in turn be used in other applications. We propose algorithms to compute this robust cost, and illustrate the interest of this approach empirically. 1. Introduction Optimal transport (OT) has become a generic tool in ma- chine learning, with applications in various domains such as supervised machine learning (Frogner et al., 2015; Abadeh et al., 2015; Courty et al., 2016), graphics (Solomon et al., 2015; Bonneel et al., 2016), imaging (Rabin & Papadakis, 2015; Cuturi & Peyré, 2016), generative models (Arjovsky et al., 2017; Salimans et al., 2018), biology (Hashimoto et al., 2016; Schiebinger et al., 2019) or NLP (Grave et al., 2019; Alaux et al., 2019). The key to using OT in these applications lies in the different forms of regularization of the original OT problem, as introduced in references (Vil- lani, 2009; Santambrogio, 2015). Adding a small convex regularization to the classical linear cost not only helps on the algorithmic side, by convexifying the objective and al- lowing for faster solvers, but also introduces a regularity trade-off that prevents from overfitting on data measures. 1 CREST / ENSAE Paris, Institut Polytechnique de Paris 2 Google Brain. Correspondence to: François-Pierre Paty <fran- [email protected]>. Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- thor(s). Regularizing OT Although entropy-regularized OT is the most studied regularization of OT, due to its algorithmic advantages (Cuturi, 2013), several other convex regulariza- tions of the transport plan have been proposed in the com- munity: quadratically-regularized OT (Essid & Solomon, 2017), OT with capacity constraints (Korman & McCann, 2015), Group-Lasso regularized OT (Courty et al., 2016), OT with Laplacian regularization (Flamary et al., 2014), Tsallis Regularized OT (Muzellec et al., 2017), among oth- ers. On the other hand, regularizing the dual Kantorovich problem was shown in (Liero et al., 2018) to be equivalent to unbalanced OT, that is optimal transport with relaxed marginal constraints. Understanding why regularization helps The question of understanding why regularizing OT proves critical has triggered several approaches. A compelling reason is statis- tical: Although classical OT suffers from the curse of dimen- sionality, as its empirical version converges at a rate of order (1/n) 1/d (Dudley, 1969; Fournier & Guillin, 2015; Weed et al., 2019), regularized OT and more precisely Sinkhorn di- vergences have a sample complexity of O(1/ n) (Genevay et al., 2019; Mena & Niles-Weed, 2019). Entropic OT was also shown to perform maximum likelihood estimation in the Gaussian deconvolution model (Rigollet & Weed, 2018). Taking another approach, (Dessein et al., 2018; Blondel et al., 2018) have considered general classes of convex regu- larizations and characterized them from a more geometrical perspective. Robustness Recently, several papers (Genevay et al., 2018; Flamary et al., 2018; Deshpande et al., 2019; Kolouri et al., 2019; Niles-Weed & Rigollet, 2019; Paty & Cu- turi, 2019) have proposed to maximize OT with respect to the ground cost, which can in turn be interpreted in light of ground metric learning (Cuturi & Avis, 2014). This approach can also be viewed as an instance of ro- bust optimization (Ben-Tal & Nemirovski, 1998; Ben-Tal et al., 2009; Bertsimas et al., 2011): instead of consider- ing a data-dependent, hence unstable minimization problem min x f ˆ θ (x) where ˆ θ represents the data, the robust opti- mization literature adversarially chooses the parameters θ in a neighborhood of the data: max θΘ min x f (x). Con- tinuing along these lines, we make a connection between regularizing and maximizing OT. arXiv:2002.03967v3 [stat.ML] 2 Aug 2020

Transcript of Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground...

Page 1: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

François-Pierre Paty 1 Marco Cuturi 2 1

AbstractRegularizing the optimal transport (OT) problemhas proven crucial for OT theory to impact thefield of machine learning. For instance, it isknown that regularizing OT problems with en-tropy leads to faster computations and better dif-ferentiation using the Sinkhorn algorithm, as wellas better sample complexity bounds than classicOT. In this work we depart from this practicalperspective and propose a new interpretation ofregularization as a robust mechanism, and showusing Fenchel duality that any convex regular-ization of OT can be interpreted as ground costadversarial. This incidentally gives access to arobust dissimilarity measure on the ground space,which can in turn be used in other applications.We propose algorithms to compute this robustcost, and illustrate the interest of this approachempirically.

1. IntroductionOptimal transport (OT) has become a generic tool in ma-chine learning, with applications in various domains such assupervised machine learning (Frogner et al., 2015; Abadehet al., 2015; Courty et al., 2016), graphics (Solomon et al.,2015; Bonneel et al., 2016), imaging (Rabin & Papadakis,2015; Cuturi & Peyré, 2016), generative models (Arjovskyet al., 2017; Salimans et al., 2018), biology (Hashimotoet al., 2016; Schiebinger et al., 2019) or NLP (Grave et al.,2019; Alaux et al., 2019). The key to using OT in theseapplications lies in the different forms of regularization ofthe original OT problem, as introduced in references (Vil-lani, 2009; Santambrogio, 2015). Adding a small convexregularization to the classical linear cost not only helps onthe algorithmic side, by convexifying the objective and al-lowing for faster solvers, but also introduces a regularitytrade-off that prevents from overfitting on data measures.

1CREST / ENSAE Paris, Institut Polytechnique de Paris2Google Brain. Correspondence to: François-Pierre Paty <[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Online, PMLR 119, 2020. Copyright 2020 by the au-thor(s).

Regularizing OT Although entropy-regularized OT is themost studied regularization of OT, due to its algorithmicadvantages (Cuturi, 2013), several other convex regulariza-tions of the transport plan have been proposed in the com-munity: quadratically-regularized OT (Essid & Solomon,2017), OT with capacity constraints (Korman & McCann,2015), Group-Lasso regularized OT (Courty et al., 2016),OT with Laplacian regularization (Flamary et al., 2014),Tsallis Regularized OT (Muzellec et al., 2017), among oth-ers. On the other hand, regularizing the dual Kantorovichproblem was shown in (Liero et al., 2018) to be equivalentto unbalanced OT, that is optimal transport with relaxedmarginal constraints.

Understanding why regularization helps The questionof understanding why regularizing OT proves critical hastriggered several approaches. A compelling reason is statis-tical: Although classical OT suffers from the curse of dimen-sionality, as its empirical version converges at a rate of order(1/n)1/d (Dudley, 1969; Fournier & Guillin, 2015; Weedet al., 2019), regularized OT and more precisely Sinkhorn di-vergences have a sample complexity of O(1/

√n) (Genevay

et al., 2019; Mena & Niles-Weed, 2019). Entropic OT wasalso shown to perform maximum likelihood estimation inthe Gaussian deconvolution model (Rigollet & Weed, 2018).Taking another approach, (Dessein et al., 2018; Blondelet al., 2018) have considered general classes of convex regu-larizations and characterized them from a more geometricalperspective.

Robustness Recently, several papers (Genevay et al.,2018; Flamary et al., 2018; Deshpande et al., 2019; Kolouriet al., 2019; Niles-Weed & Rigollet, 2019; Paty & Cu-turi, 2019) have proposed to maximize OT with respectto the ground cost, which can in turn be interpreted inlight of ground metric learning (Cuturi & Avis, 2014).This approach can also be viewed as an instance of ro-bust optimization (Ben-Tal & Nemirovski, 1998; Ben-Talet al., 2009; Bertsimas et al., 2011): instead of consider-ing a data-dependent, hence unstable minimization problemminx fθ(x) where θ represents the data, the robust opti-mization literature adversarially chooses the parameters θin a neighborhood of the data: maxθ∈Θ minx f(x). Con-tinuing along these lines, we make a connection betweenregularizing and maximizing OT.

arX

iv:2

002.

0396

7v3

[st

at.M

L]

2 A

ug 2

020

Page 2: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

Contributions Our main goal is to provide a novel inter-pretation of regularized optimal transport in terms of groundcost robustness: regularizing OT amounts to maximizingunregularized OT with respect to the ground cost. Ourcontributions are:

1. We show that any convex regularization of the transportplan corresponds to ground-cost robustness (§ 3);

2. We reinterpret classical regularizations of OT in theground-cost adversarial setting (§ 4);

3. We prove, under some technical assumption, a dualitytheorem for regularized OT, which we use to show thatunder the same assumption, there exists an optimaladversarial ground-cost that is separable (§ 5);

4. We extend ground-cost robustness to the case of morethan two measures (§ 6);

5. We propose algorithms to solve the above-mentionedproblems (§7) and illustrate them on data (§ 8).

2. Background on Optimal Transport andNotations

Let X be a compact Hausdorff space, and define P(X ) theset of Borel probability measures over X . We write C(X )for the set of continuous functions from X to R, endowedwith the supremum norm. For φ, ψ ∈ C(X ), we write φ⊕ψ ∈ C(X 2) for the function φ⊕ψ : (x, y) 7→ φ(x) +ψ(y).

For n ∈ N, we write JnK = 1, ..., n. All vectors will bedenoted with bold symbols. For a Boolean assertion A, wewrite ι(A) for its indicator function ι(A) = 0 if A is trueand ι(A) = +∞ otherwise.

Kantorovich Formulation of OT For µ, ν ∈P(X ), wewrite Π(µ, ν) for the set of couplings

Π(µ, ν) = π ∈P(X 2) s.t.∀A,B ⊂ X Borel,π(A×X ) = µ(A), π(X ×B) = ν(B).

For a real-valued continuous function c ∈ C(X 2), the opti-mal transport cost between µ and ν is defined as

Tc(µ, ν) := infπ∈Π(µ,ν)

∫X 2

c(x, y) dπ(x, y). (1)

Since c is continuous and X is compact, the infimum in (1)is attained, see Theorem 1.4 in (Santambrogio, 2015). Prob-lem (1) admits the following dual formulation, see Proposi-tion 1.11 and Theorem 1.39 in (Santambrogio, 2015):

Tc(µ, ν) = maxφ,ψ∈C(X )φ⊕ψ≤c

∫φdµ+

∫ψ dν. (2)

Space of Measures Since X is compact, the dual spaceof C(X 2) is the set M (X 2) of Borel finite signed measuresover X 2. For F : M (X 2)→ R, we recall that F is Fréchet-differentiable at π if there exists∇F (π) ∈ C(X 2) such thatfor any h ∈M (X 2), as t→ 0

F (π + th) = F (π) + t

∫∇F (π) dh+ o(t).

Similarly, G : C(X 2)→ R is Fréchet-differentiable at c ifthere exists∇G(c) ∈M (X 2) such that for any h ∈ C(X 2),as t→ 0

G(c+ th) = G(c) + t

∫h d∇G(c) + o(t).

Legendre–Fenchel Transformation For any functionalF : M (X 2) → R ∪ +∞, we can define its convexconjugate F ∗ : C(X 2) → R ∪ +∞ and biconjugateF ∗∗ : M (X 2)→ R ∪ +∞ as

F ∗(c) := supπ∈M (X 2)

∫c dπ − F (π),

F ∗∗(π) := supc∈C(X 2)

∫c dπ − F ∗(c).

F ∗ is always lower semi-continuous (lsc) and convex as thesupremum of continuous linear functions.

Specific notations For F : M (X 2) → R ∪ +∞, wewrite dom(F ) =

π ∈M (X 2) |F (π) < +∞

for its do-

main and will say that F is proper if dom(F ) 6= ∅.

We denote by F the set of proper lsc convex functions F :M (X 2) → R ∪ +∞, and for µ, ν ∈ P(X ), we definethe set F (µ, ν) of lsc convex functions that are proper onΠ(µ, ν):

F (µ, ν) = F ∈ F | ∃π ∈ Π(µ, ν), F (π) < +∞ .

3. Ground Cost Adversarial OptimalTransport

3.1. Definition

Instead of considering the classical linear formulation ofoptimal transport (1), we consider in this paper the followingmore general nonlinear convex formulation:

Definition 1. Let F ∈ F . For µ, ν ∈P(X ), we define:

WF (µ, ν) = infπ∈Π(µ,ν)

F (π). (3)

When F (π) =∫c dπ, problem (3) corresponds to the classi-

cal optimal transport problem defined in (1) and WF = Tc.

Page 3: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

Lemma 1. The infimum in (3) is attained. Moreover, ifF ∈ F (µ, ν), WF (µ, ν) < +∞.

Proof. We can apply Weierstrass’s theorem since Π(µ, ν)is compact and F is lsc by definition. For F ∈ F (µ, ν),there exists π0 ∈ Π(µ, ν) such that F (π0) < +∞, soWF (µ, ν) ≤ F (π0) < +∞.

The main result of this paper is the following interpretationof problem (3) as a ground-cost adversarial OT problem:

Theorem 1. For µ, ν ∈ P(X ) and F ∈ F (µ, ν), mini-mizing F over Π(µ, ν) is equivalent to the following convexproblem:

WF (µ, ν) = supc∈C(X 2)

Tc(µ, ν)− F ∗(c). (4)

Proof. Since F is proper, lsc and convex, Fenchel-Moreautheorem ensures that it is equal to its convex biconjugateF ∗∗, so:

minπ∈Π(µ,ν)

F (π) = minπ∈Π(µ,ν)

F ∗∗(π)

= minπ∈Π(µ,ν)

supc∈C(X 2)

∫c dπ − F ∗(c).

Define the objective l(π, c) :=∫c dπ − F ∗(c). Since F ∗

is lsc as the convex conjugate of F , for any π ∈ Π(µ, ν),l(π, ·) is usc. It is also concave as the sum of concavefunctions. Likewise, for any c ∈ C(X 2), l(·, c) is continuousand convex (in fact linear). Since Π(µ, ν) and C(X 2) areconvex, and Π(µ, ν) is compact, we can use Sion’s minimaxtheorem to swap the min and the sup:

minπ∈Π(µ,ν)

F (π) = supc∈C(X 2)

minπ∈Π(µ,ν)

∫c dπ − F ∗(c).

Finally, c 7→ Tc(µ, ν) − F ∗(c) is concave since F ∗ isconvex and c 7→ Tc(µ, ν) is concave as the minimum oflinear functionals.

Remark 1. Note that the inequality

WF (µ, ν) ≥ supc∈C(X 2)

Tc(µ, ν)− F ∗(c)

is in fact verified for any F : M (X 2)→ R ∪ +∞ sinceF ≥ F ∗∗ is always verified.

The supremum in equation (4) is not necessarily attained.Under some regularity assumption on F , we show that thesupremum is attained and relate the optimal couplings andthe optimal ground costs:

Proposition 1. Let µ, ν ∈ P(X ) and F ∈ F (µ, ν). Sup-pose that F is Fréchet-differentiable on Π(µ, ν). Then thesupremum in (4) is attained at c? = ∇F (π?) where π? isany minimizer of (3). Conversely, suppose F ∗ is Fréchet-differentiable everywhere. If c? is the unique maximizerin (4), then π? = ∇F ∗(c?) is a minimizer of (3).

See a proof in appendix. In section 5, we will further char-acterize c? for a certain class of functions F ∈ F .

One interesting particular case of Theorem 1 is when theconvex cost π 7→ F (π) is a convex regularization of theclassical linear optimal transport:

Corollary 1. Let c0 ∈ C(X 2), µ, ν ∈ P(X ). Let ε > 0and R ∈ F (µ, ν). Then:

minπ∈Π(µ,ν)

∫c0 dπ + εR(π)

= supc∈C(X 2)

Tc(µ, ν)− εR∗(c− c0ε

). (5)

Proof. We apply theorem 1 with F (π) =∫c0 dπ+ εR(π),

for which we only need to compute the convex conjugate:

F ∗(c) = supπ∈M (X 2)

∫c− c0 dπ − εR(π)

= ε supπ∈M (X 2)

∫c− c0ε

dπ −R(π)

= εR∗(c− c0ε

).

Corollary 1 shows that the ground cost c0 in regularized op-timal transport acts as a prior on the adversarial ground cost.Indeed, in equation (5) the penalization term εR∗

(c−c0ε

)forces any optimal adversarial ground cost to be “close” toc0, the closeness being measured in terms of the convexconjugate of the regularization: R∗.

Remark 2. We can also consider the minimization of aproper usc concave function F over Π(µ, ν). Since −F ∈F , by reusing the argument of the proof of Theorem 1 (seea proof in appendix):

infπ∈Π(µ,ν)

F (π) = infc∈C(X 2)

Tc(µ, ν) + (−F )∗(−c).

Minimizing a concave function of the transport plan π ∈Π(µ, ν), or equivalently maximizing a convex function ofπ, amounts to finding a ground cost c ∈ C(X 2) that min-imizes the transport cost between µ and ν plus a convexpenalization on c. Note that this is not a convex problemsince the objective is the sum of a concave and a convexfunctions. When µ and ν are discrete measures, Π(µ, ν) isa finite-dimensional compact polytope so one of its extremepoints has to be a minimizer of F .

Page 4: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

In the ground cost maximization problem, the maximizationis carried out on any continuous function c on X 2, and inparticular we do not impose that c takes only nonnegativevalues. In other words, an optimal adversarial ground costmay take negative values, which prevents us from directlyinterpreting optimal adversarial ground costs as suitabledissimilarity measures over X . In the following subsection,we impose that c ≥ 0 in the adversarial problem when thespace X is discrete and prove an analogue of Corollary 1.

3.2. Discrete Separable Case

In this subsection, we will focus on the discrete case wherethe space X = JnK for some n ∈ N. A probability measureµ ∈ P(X ) is then a histogram of size n that we will rep-resent by a vector µ ∈ Rn+ such that

∑ni=1 µi = 1. Cost

functions c ∈ C(X 2) and transport plans π ∈ Π(µ,ν) arenow matrices c,π ∈ Rn×n.

We focus on regularization functions R that are separable,i.e. of the form

R(π) =

n∑i=1

n∑j=1

Rij(πij)

for some differentiable convex proper lsc Rij : R→ R.

In applications, it is natural to constrain the adversarialground cost c ∈ Rn×n to take nonnegative entries. Addingthis constraint on the adversarial cost corresponds to lin-earizing “at short range” the regularization R for “smalltransport values”.

Figure 1. The entropy regularization R(x) = x log(x) and itslinearized version R(x) for small transport values.

Proposition 2. Let ε > 0. For µ,ν ∈P(X ), it holds:

supc∈Rn×n

+

Tc(µ,ν)− ε∑ij

R∗ij

(cij − c0ij

ε

)= min

π∈Π(µ,ν)〈c0,π〉+ ε

∑ij

Rij(πij) (6)

where Rij : R → R is the continuous convex functiondefined as

Rij(x) :=

Rij(x) if x ≥ R∗ij

′ (−c0ij

ε

)−c0ij

ε x−R∗ij(−c0ij

ε

)otherwise.

Moreover, if Rij is of class C1, then Rij is also C1.

4. Examples4.1. Ground Cost Adversarial Interpretation of

Classical OT Regularizations

As presented in the introduction, several convex regulariza-tions R have been proposed in the literature. We give theground cost adversarial counterpart for some of them: twoexamples in the continuous setting, and four p-norm basedregularizations in the discrete case.

Example 1 (Entropic Regularization). Let µ, ν ∈ P(X ).For π ∈ Π(µ, ν), we define its relative entropy as KL(π‖µ⊗ν) =

∫log dπ

dµ⊗ν dπ. Then for c0 ∈ C(X 2) and ε > 0, itholds:

minπ∈Π(µ,ν)

∫c0 dπ + εKL(π‖µ⊗ ν)

= supc∈C(X 2)

Tc(µ, ν)− ε∫

exp

(c− c0ε

)dµ⊗ ν + ε.

Proof. For π ∈M (X 2), let

R(π) =

∫log dπ

dµ⊗ν dπ −∫dπ + 1 if π µ⊗ ν

+∞ otherwise.

R is convex, and using proposition 7 in (Feydy et al., 2019),

R∗(c) =

∫ec − 1 dµ⊗ ν.

Applying corollary 1 concludes the proof.

Another case of interest is the so-called Subspace RobustWasserstein distance recently proposed by (Paty & Cuturi,2019). Here, the set of adversarial metrics is parameterizedby a finite-dimensional parameter Ω, which allows to re-cover an adversarial metric defined on the whole space evenwhen the measures are finitely supported.

Example 2 (Subspace Robust Wasserstein). Let d ∈ N, k ∈JdK and µ, ν ∈P(Rd) with a finite second-order moment.For π ∈ Π(µ, ν), define Vπ =

∫(x − y)(x − y)>dπ(x, y)

and λ1(Vπ) ≥ . . . ≥ λd(Vπ) its ordered eigenvalues.

Then F : π 7→∑kl=1 λl(Vπ) is convex, and

Sk(µ, ν) := minπ∈Π(µ,ν)

k∑l=1

λl(Vπ) = max0ΩITr(Ω)=k

Td2Ω

(µ, ν)

where d2Ω(x, y) = (x− y)>Ω(x− y) is the squared Maha-

lanobis distance.

Proof. See Theorem 1 in (Paty & Cuturi, 2019). Note thatin this case, X = Rd is not compact. This is not a problemsince F ∗ ≡ +∞ outside a compact set, i.e. the set on

Page 5: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

metrics on which the maximization takes place is compact.Indeed, one can show that:

F ∗(c) = ι(∃0 Ω I with Tr(Ω) = k s.t. c = d2Ω).

Let us now consider p-norm based examples, which will sub-sume quadratically-regularized (p = 2) OT studied in (Essid& Solomon, 2017; Lorenz et al., 2019), capacity-constrained(p = +∞) OT proposed by (Korman & McCann, 2015) andTsallis regularized (p < 0) OT introduced by (Muzellecet al., 2017).

For a matrix w ∈ Rn×n+ with∑ij wij = n2 and π ∈

Rn×n, we denote by ‖π‖pw,p =∑ij wij |πij |p the w-

weighted (powered) p-norm of π. We also write 1/w forthe matrix defined by (1/w)ij = 1/wij . In the following,except otherwise mentioned, we take p, q ∈ [1,+∞] suchthat 1/p+ 1/q = 1, c0 ∈ Rn×n, ε > 0.Example 3 (‖ · ‖pw,p Regularization).

minπ∈Π(µ,ν)

〈c0,π〉+ ε1

p‖π‖pw,p

= supc∈Rn×n

Tc(µ,ν)− ε1

q

∥∥∥∥c− c0

ε

∥∥∥∥q1/wq−1,q

.

In particular when p = 2 and w = 1, this corresponds toquadratically-regularized OT studied in (Essid & Solomon,2017; Lorenz et al., 2019).

We give the details of the (straightforward) computations inthe appendix.Example 4 (‖ · ‖w,p Penalization).

minπ∈Π(µ,ν)

〈c0,π〉+ ε‖π‖w,p = supc∈Rn×n

‖c−c0‖1/w,q≤ε

Tc(µ,ν).

Proof. We apply Corollary 1 with R : Rn×n → Rn×ndefined as R(π) = ‖π‖w,p, for which we need to computeits convex conjugate. We know that the dual of ‖ · ‖p isι(‖ · ‖q ≤ 1), and using classical results about convexconjugates, ‖ · ‖∗w,p = ι(‖ · ‖1/w,q ≤ 1).

Example 5 (‖ · ‖w,p Regularization).

minπ∈Π(µ,ν)‖π‖w,p≤ε

〈c0,π〉 = supc∈Rn×n

Tc(µ,ν)− ε‖c− c0‖1/w,q.

In particular when p = +∞ and w = 1, this coincides withcapacity-constrained OT proposed by (Korman & McCann,2015).

Proof. We apply Corollary 1 with R : Rn×n → Rn×ndefined as R(π) = ι(‖π‖w,p ≤ 1), for which we need tocompute its convex conjugate. We know that the dual ofι(‖ · ‖p ≤ 1) is ‖ · ‖q, and using classical results aboutconvex conjugates, ι(‖ · ‖w,p ≤ 1)∗ = ‖ · ‖1/w,q .

Example 6 (Tsallis Regularization). For q ∈ (0, 1), theTsallis regularized OT problem (Muzellec et al., 2017)

minπ∈Π(µ,ν)

〈c0,π〉 − ε1

1− q∑ij

(πqij − πij

)is equivalent to

supc∈Rn×n

c≤c0

Tc(µ,ν)− ε1

1−q (−p)−p∥∥∥∥ 1

c0 − c

∥∥∥∥−p−p

1− q

where p < 0 is such that 1/p+ 1/q = 1.

We give the details of the computations in appendix.

4.2. A Link With the Matching Literature inEconomics

Maximizing the OT problem with respect to the ground costhas been proposed in the matching literature in economicsas a way to recover a ground cost when only a matching isobserved, see e.g. (Dupuy & Galichon, 2014; Galichon &Salanié, 2015; Dupuy et al., 2016). In this subsection, wereinterpret their methods by showing that they are equivalentto some regularized OT problems. In other words, instead ofinterpreting a regularization problem as a robust OT problemas in subsection 4.1, we go the other way around and showthat this practical OT maximization problem corresponds toa regularized OT problem.

Practitioners observe two probability measures µ, ν ∈P(X ) (e.g. features from a group of men and a groupof women) and a matching π0 ∈ Π(µ, ν) (e.g. dating ormarriage data). Under the assumption that the matching isoptimal for some criteria, we can determine these by find-ing a ground cost c? ∈ C(X 2) such that the matching π0

is an optimal transport plan for the cost c. Then c?(x, y)can be interpreted as the unwillingness for two people withcharacteristics x and y to be matched.

As shown in Theorem 3 in (Galichon & Salanié, 2015),

supc∈C(X 2)

Tc(µ, ν)−∫c dπ0 = ι (π0 ∈ Π(µ, ν)) (7)

and if π0 ∈ Π(µ, ν), the supremum is attained at any c? ∈C(X 2) such that π0 is an optimal transport plan for the costc?. Indeed, the first order condition for the maximizationproblem and the envelope theorem give the result.

In practice, economists are more interested in discoveringwhich features explain the most the observed match-ing π0. To this end, they choose a parametric modelfor the cost c, for example a Mahalanobis model c ∈d2

Ω : (x, y) 7→ (x− y)>Ω(x− y) |Ω 0, ‖Ω‖ ≤ 1

.More generally, we can rewrite problem (7) as

supc∈C(X 2)

Tc(µ, ν)−∫c dπ0 −R∗(c) (8)

Page 6: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

where R ∈ F is a lsc convex functional, e.g. R∗(c) =ι(∃Ω 0, ‖Ω‖ ≤ 1, c = d2

Ω

)for the Mahalanobis model.

Using Theorem 1, we can then reinterpret problem (8):

supc∈C(X 2)

Tc(µ, ν)−∫c dπ0 −R∗(c)

= minπ∈Π(µ,ν)

R(π − π0)

where we have used the fact that R∗∗ = R. Solving equa-tion (8) amounts to finding a matching π ∈ Π(µ, ν) that isclose to the observed matching π0, as measured by R.

5. Characterization of the Adversarial Costand Duality

Theorem 1 shows that regularizing OT is equivalent to max-imizing unregularized OT with respect to the ground cost.This gives access to a robustly computed ground-cost c?. Inthis section, we first prove a duality theorem for problem (3)that we use to further characterize c?. We will first need atechnical assumption on F :Definition 2. Let F ∈ F . We will say that F is separably∗-increasing if for any φ, ψ ∈ C(X ) and any c ∈ C(X 2):

φ⊕ ψ ≤ c⇒ F ∗(φ⊕ ψ) ≤ F ∗(c). (9)

In particular if F ∗ is increasing, F is separably ∗-increasing.

This definition, albeit not always verified e.g. in the discreteseparable case of Proposition 2 and in the SRW case ofExample 2, is indeed verified in various cases of interest,e.g. for the entropic or ‖ · ‖pw,p regularizations:Example 7. For µ, ν ∈P(X ), c0 ∈ C(X 2) and ε > 0, theentropy-regularized OT function

F : π 7→∫c0 dπ + εKL(π‖µ⊗ ν)

is separably ∗-increasing.

Proof. As in the proof of example 1,

F ∗(c) = ε

∫exp

(c− c0ε

)− 1 dµ⊗ ν

which verifies condition (9) as an increasing functional.

Example 8. In the discrete setting X = JnK, let µ,ν ∈P(X ), c0 ∈ Rn×n, w ∈ Rn×n+ summing to n2. Take p > 1and ε > 0. With ϕp(x) = xp if x ≥ 0 and ϕp(x) = +∞ ifx < 0, the ‖ · ‖pw,p-regularized OT function

F : π 7→ 〈c0,π〉+ ε∑ij

wijϕp(πij)

is separably ∗-increasing.

Proof. Note that minimizing F over Π(µ,ν) ⊂ Rn×n+

is equivalent to minimizing F : π 7→ 〈c0,π〉 +ε∑ij wij |πij |p. One can show that, with q > 1 such

that 1/p+ 1/q = 1 and (x)+ := max0, x:

F ∗(c) = ε1

q

∥∥∥∥ (c− c0)+

ε

∥∥∥∥q1/wq−1,q

which clearly verifies condition (9).

When F is separably ∗-increasing, we can easily prove aduality theorem for problem (3):Theorem 2 (WF duality). Let µ, ν ∈ P(X ) and F ∈F (µ, ν) a separably ∗-increasing function. Then:

WF (µ, ν) = maxφ,ψ∈C(X )

∫φdµ+

∫ψdν − F ∗(φ⊕ ψ).

(10)

Proof. Using Theorem 1 and Kantorovich duality (2):

WF (µ, ν) = supc∈C(X 2)

Tc(µ, ν)− F ∗(c)

= supc∈C(X 2)

maxφ,ψ∈C(X )φ⊕ψ≤c

∫φdµ+

∫ψ dν − F ∗(c)

= supc∈C(X 2)

maxφ,ψ∈C(X )

∫φdµ+

∫ψ dν − F ∗(c)

− ι(φ⊕ ψ ≤ c)

= maxφ,ψ∈C(X )

∫φdµ+

∫ψ dν

+ supc∈C(X 2)

−F ∗(c)− ι(φ⊕ ψ ≤ c)

= maxφ,ψ∈C(X )

∫φdµ+

∫ψ dν − inf

c∈C(X 2)φ⊕ψ≤c

F ∗(c).

Since F is separably ∗-increasing, for any φ, ψ ∈ C(X ),

infc∈C(X 2)φ⊕ψ≤c

F ∗(c) = F ∗(φ⊕ ψ),

which shows the desired duality result.

Theorem 2 subsumes the already known duality results forentropy-regularized OT and quadratically-regularized OT.It also enables us to characterize of the optimal adversarialground cost when the convex objective F ∈ F is separably∗-increasing:Corollary 2. If φ?, ψ? are optimal solutions in (10), thecost φ? ⊕ ψ? ∈ C(X 2) is an optimal adversarial cost in (4).

Proof. For φ, ψ ∈ C(X ), note that

Tφ⊕ψ(µ, ν) =

∫φdµ+

∫ψ dν.

Page 7: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

Then using WF duality:

WF (µ, ν) = maxφ,ψ∈C(X )

∫φdµ+

∫ψ dν − F ∗(φ⊕ ψ)

= maxφ,ψ∈C(X )

Tφ⊕ψ(µ, ν)− F ∗(φ⊕ ψ)

≤ supc∈C(X 2)

Tc(µ, ν)− F ∗(c)

= WF (µ, ν)

where we have used Theorem 1 in the last line. This showsthat the inequality is in fact an equality, so if φ?, ψ? are opti-mal dual potentials in (10), φ?⊕ψ? is an optimal adversarialcost in (4).

Corollary 2 is quite striking. Indeed, in the regularizedformulation of Corollary 1, any optimal ground cost c? inequation (5) should be close (in R∗ sense) to the prior costc0 because of the penalization term εR∗

(c−c0ε

). But under

the assumption that F is separably ∗-increasing, we havejust shown that regardless of c0, there exists an optimaladversarial ground cost that is separable.

6. Adversarial Ground-Cost for SeveralMeasures

For two measures µ, ν ∈ P(X ) and a separably ∗-increasing function F ∈ F (µ, ν), corollary 2 shows thatthere exists an optimal adversarial ground cost c? that isseparable. This separability, which is verified e.g. in theentropic or quadratic case, means that the OT problem forc? is degenerate in the sense that any transport plan is op-timal for the cost c?. From a metric learning point of view,c? is not a suitable dissimilarity measure on X . But whylimit ourselves to two measures? If we observe N ∈ Nmeasures µ1, . . . , µN ∈P(X ), we could look for a groundcost c ∈ C(X 2) that is adversarial to all the pairs:

supc∈C(X 2)

∑i6=j

Tc(µi, µj)− F ∗(c)

for some convex regularization F ∗ : C(X 2)→ R ∪ +∞.We will specifically focus on the case where we observe a se-quence of measures µ1:T := µ1, . . . , µT ∈P(X ), T ≥ 2.When we observe such time-dependent data, we can look fora sequence of adversarial costs c1:T−1 := c1, . . . , cT−1 ∈C(X 2) which is globally adversarial:Definition 3. For D : C(X 2)× C(X 2)→ R ∪ +∞ andFt ∈ F (µt, µt+1), t ∈ JT − 1K, we define:

WD,F (µ1:T ) := supc1:T−1

T−1∑t=1

Tct(µt, µt+1) (11)

−D(ct, ct+1)− F ∗t (ct)

with the convention D(cT−1, cT ) = 0.

In problem (11), D acts as a time-regularization by forcingthe adversarial sequence of ground-costs to vary “continu-ously” with time.

Taking inspiration from the Subspace Robust Wasserstein(SRW) distance, we propose as a particular case of defini-tion 3 a generalization of SRW to the case of a sequence ofmeasures µ1, . . . , µT , T ≥ 2:

Definition 4. Let d ∈ N and k ∈ JdK. Define Rk =Ω ∈ Rd×d | 0 Ω I,Tr(Ω) = k

. We define the se-

quential SRW between µ1, . . . , µT ∈P(Rd) as:

T Sk,η(µ1:T ) := supΩ1,...,ΩT−1∈Rk

T−1∑t=1

Td2Ωt

(µt, µt+1) (12)

− ηB2(Ωt,Ωt+1)

where B2(A,B) = Tr(A + B − 2(A12BA

12 )

12 ) is the

squared Bures metric (Bures, 1969; Bhatia et al., 2018)on the SDP cone.

Note that problem (12) is convex. If T = 2, the sequentialSRW is equal to the usual SRW distance: T Sk,η(µ1, µ2) =Sk(µ1, µ2).

7. AlgorithmsFrom now on, we only consider the discrete case X = JnK.

7.1. Projected (Sub)gradient Ascent SolvesNonnegative Adversarial Cost OT

In the setting of subsection 3.2, we propose to run a pro-jected subgradient ascent on the ground cost c ∈ Rn×n+ tosolve problem (6). Note that in this case, F (π) := 〈c0,π〉+ε∑ij R

∗ij

(cij−c0ij

ε

)is not separably ∗-increasing, so we

can hope that the optimal adversarial ground cost will notbe separable.

At each iteration of the ascent, we need to compute a sub-gradient of g : c 7→ Tc(µ,ν) − εR∗

(c−c0

ε

)given by

Danskin’s theorem:

∂g(c) =

conv

π? −∇R∗

(c− c0

ε

) ∣∣∣∣∣π? ∈ arg minπ∈Π(µ,ν)

〈c,π〉

.

Although projected subgradient ascent does converge, hav-ing access to gradients instead of subgradients, hence reg-ularity, helps the convergence. We therefore propose toreplace Tc(µ,ν) by its entropy-regularized version

S ηc (µ, ν) = min

π∈Π(µ,ν)〈c,π〉+ η

∑ij

πij(logπij − 1)

Page 8: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

Algorithm 1 Projected (sub)Gradient Ascent for Nonnega-tive Adversarial Cost

Input: Histograms µ,ν ∈ Rn, learning rate lrInitialize c ∈ Rn×n+

for i = 0 to MAXITER doπ? ← OT(µ,ν, cost = c)c← ProjRn×n

+

[c + lrπ? − lr∇R∗

(c−c0

ε

)]end for

in the definition of the obective g. Then g is differentiable,because there exists a unique solution π? in the entropiccase (hence ∂g(c) is a singleton). This will also speedup the computations of the gradient at each iteration usingSinkhorn’s algorithm. We can interpret this addition of asmall entropy term in the adversarial cost formulation as afurther regularization of the primal:Corollary 3. Using the same notations as in Theorem 1,for η ≥ 0:

supc∈Rn×n

S ηc (µ,ν)− F ∗(c)

= minπ∈Π(µ,ν)

F (π) + η∑ij

πij(logπij − 1).

Proof. Let R(π) :=∑ij πij(logπij − 1). Then:

supc∈Rn×n

S ηc (µ,ν)− F ∗(c)

= supc∈Rn×n

minπ∈Π(µ,ν)

〈π, c〉+ ηR(π)− F ∗(c)

= minπ∈Π(µ,ν)

ηR(π) + supc∈Rn×n

〈π, c〉 − F ∗(c)

= minπ∈Π(µ,ν)

ηR(π) + F (π)

where we have used Sion’s minimax theorem as in the proofof Theorem 1 to swap the min and the sup, and used as wellthe fact that F = F ∗∗ given by Fenchel-Moreau theorem.

7.2. Sinkhorn-like Algorithm for ∗-increasing F ∈ F

If the function F ∈ F is separably ∗-increasing, we candirectly write the optimality conditions for the concave dualproblem (10):

µ = ∇F ∗(φ? ⊕ψ?)1 (13)

ν = ∇F ∗(φ? ⊕ψ?)>1 (14)

where 1 is the vector of all ones. We can then alternatebetween fixingψ and solving for φ in (13) and fixing φ andsolving for ψ in (14). In the case of entropy-regularized OT,this is equivalent to Sinkhorn’s algorithm. In quadratically-regularized OT, this is equivalent to the alternate minimiza-tion proposed by (Blondel et al., 2018). We give the detailedderivation of these facts in the appendix.

7.3. Coordinate Ascent for Sequential SRW

Problem (12) is a globally convex problem ofΩ1, . . . ,ΩT−1. We propose to run a randomized co-ordinate ascent on the concave objective, i.e. to selectτ ∈ JT −1K randomly at each iteration and doing a gradientstep for Ωτ . We need to compute a subgradient of the objec-tive h : Ωτ 7→

∑T−1t=1 Td2

Ωt(µt, µt+1) − ηB2(Ωt,Ωt+1),

given by:

∇h(Ωτ ) = V (πτ ?)− η∂1B2(Ωτ ,Ωτ+1) (15)

− η∂2B2(Ωτ−1,Ωτ )

where π 7→ V (π) is defined in Example 2, πτ ? ∈ Rn×n isany optimal transport plan between µτ , µτ+1 for cost d2

Ωτ,

and ∂1B2, ∂2B

2 are the gradients of the squared Bures met-ric with respect to the first and second arguments, computede.g. in (Muzellec & Cuturi, 2018).

Algorithm 2 Randomized (Block) Coordinate Ascent forsequential SRW

Input: Measures µ1, . . . , µT ∈ P(Rd), dimension k,learning rate lrInitialize Ω1, . . . ,ΩT−1 ∈ Rd×dfor i = 0 to MAXITER do

Draw τ ∈ JT − 1Kπτ ? ← OT(µτ , µτ+1, cost = d2

Ωτ)

Ωτ ← ProjRk [Ωτ + lr∇h(Ωτ )] using (15)end for

8. Experiments8.1. Linearized Entropy-Regularized OT

We consider the entropy-regularized OT problem in thediscrete setting:

S εc0

(µ,ν) = minπ∈Π(µ,ν)

〈c0,π〉+ εR(π)

where c0 ∈ Rn×n and R : π 7→∑ij πij(logπij − 1).

Since R is separable, we can constrain the associated ad-versarial cost to be nonnegative by linearizing the entropicregularization. By proposition 2, this amounts to solve

supc∈Rn×n

+

Tc(µ,ν)− ε∑ij

exp

(cij − c0ij

ε

)(16)

= minπ∈Π(µ,ν)

〈c0,π〉+ ε∑ij

Rij(πij)

where Rij : R→ R is defined as

Rij(x) :=

x(log x− 1) if x ≥ exp

(−c0ij

ε

)−c0ij

ε x− exp(−c0ij

ε

)otherwise.

Page 9: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

We first consider N = 100 couples of measures (µi,νi)in dimension d = 1000, each measure being a uniformmeasure on n = 100 samples from a Gaussian distribu-tion with covariance matrix drawn from a Wishart distri-bution with k = d degrees of freedom. For each couple,we run Algorithm 1 to solve problem (16). This gives anadversarial cost cε?. We plot in Figure 2 the mean valueof∣∣∣Wε −T‖·‖2(µi,νi)

∣∣∣ depending on ε, for Wε equal toTcε?(µi,νi), S ε(µi,νi) and the value of (16). For smallvalues of ε, all three values converge to the real Wasser-stein distance. For large ε, Sinkhorn stabilizes to theMMD (Genevay et al., 2016) while the robust cost goesto 0 (for the adversarial cost goes to 0).

Figure 2. Mean value (over 100 runs) of the difference betweenthe classical (2-Wasserstein) OT cost W and Sinkhorn S ε (orangedashed), OT cost with adversarial nonnegative cost Tcε (blueline) and the value of problem (16) Tcε − εR (green dot-dashed)depending on ε. The shaded areas represent the min-max, 10%-90% and 25%-75% percentiles, and appear negligeable except fornumerical errors.

In Figure 3, we visualize the effect of the regularization ε onthe ground cost cε? itself, for measures µ,ν plotted in Fig-ure 3a. We use multidimensional scaling on the adversarialcost matrix cε? (with distances between points from the samemeasures unchanged) to recover points in R2. For large val-ues of ε, the adversarial cost goes to 0, which correspondsin the primal to a fully diffusive transport plan π = µν>.

(a) Original Points (b) ε = 0.01 (c) ε = 2

Figure 3. Effect of the regularization strength on the metric: as εgrows, the associated adversarial cost shrinks the distances.

8.2. Learning a Metric on the Color Space

We consider 20 measures (µi)i=1,...,10, (νj)j=1,...,10 onthe red-green-blue color space identified with X = [0, 1]3.Each measure is a point cloud corresponding to the colorsused in a painting, divided into two types: ten portraits by

Modigliani (µi, i ∈ M ) and ten by Schiele (νj , j ∈ S).As in SRW and sequential SRW formulations, we learn ametric cΩ ∈ C(X 2) parameterized by a matrix 0 Ω Isuch that Tr Ω = 1 that best separates the Modiglianis andthe Schieles:

Ω? ∈ arg maxΩ∈R1

∑i∈M

∑j∈S

Td2Ω

(µi,νj).

We compute Ω? using projected SGD. We then use this “one-dimensional” metric d2

Ω?as a ground metric for OT-based

color transfer (Rabin et al., 2014): an optimal transport planπ between two color palettes µi,νj gives a way to transfercolors from one painting to the other. Visually, transfer-ring the colors using the classical quadratic cost ‖ · ‖2 orthe adversarially-learnt one-dimensional metric d2

Ω?makes

no major difference, showing that when regularized, OTcan extract sufficient information from lower dimensionalrepresentations.

(a)Modigliani

(b)Schiele

(c) One-dimensional

(d) Three-dimensional

Figure 4. Color transfer, best zoomed in. (a) and (b): Originalpaintings. (c): Schiele’s painting with Modigliani’s colors, usingthe learn adversarial one-dimensional metric d2

Ω? . (d): Schiele’spainting with Modigliani’s colors, using the Euclidean metric ‖·‖2.

9. ConclusionIn this paper, we have shown that any convex regularizationof optimal transport can be recast as a ground cost adver-sarial problem. Under some technical assumption on theregularization, we proved a duality theorem for regularizedOT, which we use to characterize the optimal ground costas a separate function of its two arguments. In order toovercome this degeneration, we proposed to constrain therobust ground-cost to take non-negative values. We alsoproposed a framework to learn an adversarial sequence ofground costs which is adversarial to a time-varying sequenceof measures. Future work includes learning a continuousadversarial cost cθ parameterized by a neural network, un-der some regularity constraints (e.g. cθ is Lipschitz). Onthe application side, learning low-dimensional representa-tions of time-evolving data could be applied in biology as arefinement of the methodology of (Schiebinger et al., 2019).

Acknowledgements We acknowledge the support of a"Chaire d’excellence de l’IDEX Paris Saclay". We wouldlike to thank Boris Muzellec and Théo Lacombe for fruitfuldiscussions and relevant remarks.

Page 10: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

ReferencesAbadeh, S. S., Esfahani, P. M. M., and Kuhn, D. Distribu-

tionally robust logistic regression. In Advances in NeuralInformation Processing Systems, pp. 1576–1584, 2015.

Alaux, J., Grave, E., Cuturi, M., and Joulin, A. Unsuper-vised hyper-alignment for multilingual word embeddings.In International Conference on Learning Representations,2019.

Arjovsky, M., Chintala, S., and Bottou, L. Wassersteingenerative adversarial networks. Proceedings of the 34thInternational Conference on Machine Learning, 70:214–223, 2017.

Ben-Tal, A. and Nemirovski, A. Robust convex optimiza-tion. Mathematics of operations research, 23(4):769–805,1998.

Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robustoptimization, volume 28. Princeton University Press,2009.

Bertsimas, D., Brown, D. B., and Caramanis, C. Theory andapplications of robust optimization. SIAM review, 53(3):464–501, 2011.

Bhatia, R., Jain, T., and Lim, Y. On the bures-wassersteindistance between positive definite matrices. ExpositionesMathematicae, to appear, 2018.

Blondel, M., Seguy, V., and Rolet, A. Smooth and sparse op-timal transport. In Storkey, A. and Perez-Cruz, F. (eds.),Proceedings of the Twenty-First International Confer-ence on Artificial Intelligence and Statistics, volume 84of Proceedings of Machine Learning Research, pp. 880–889, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr2018. PMLR. URL http://proceedings.mlr.press/v84/blondel18a.html.

Bonneel, N., Peyré, G., and Cuturi, M. Wasserstein barycen-tric coordinates: histogram regression using optimal trans-port. ACM Transactions on Graphics, 35(4):71:1–71:10,2016.

Bures, D. An extension of Kakutani’s theorem on infiniteproduct measures to the tensor product of semifinite w∗-algebras. Transactions of the American MathematicalSociety, 135:199–212, 1969.

Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A.Optimal transport for domain adaptation. IEEE transac-tions on pattern analysis and machine intelligence, 39(9):1853–1865, 2016.

Cuturi, M. Sinkhorn distances: lightspeed computation ofoptimal transport. In Advances in Neural InformationProcessing Systems 26, pp. 2292–2300, 2013.

Cuturi, M. and Avis, D. Ground metric learning. Journal ofMachine Learning Research, 15:533–564, 2014.

Cuturi, M. and Peyré, G. A smoothed dual approach for vari-ational Wasserstein problems. SIAM Journal on ImagingSciences, 9(1):320–343, 2016.

Deshpande, I., Hu, Y.-T., Sun, R., Pyrros, A., Siddiqui, N.,Koyejo, S., Zhao, Z., Forsyth, D., and Schwing, A. G.Max-sliced wasserstein distance and its use for gans. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 10648–10656, 2019.

Dessein, A., Papadakis, N., and Rouas, J.-L. Regularized op-timal transport and the rot mover’s distance. The Journalof Machine Learning Research, 19(1):590–642, 2018.

Dudley, R. M. The speed of mean Glivenko-Cantelli conver-gence. Annals of Mathematical Statistics, 40(1):40–50,1969.

Dupuy, A. and Galichon, A. Personality traits and the mar-riage market. Journal of Political Economy, 122(6):1271–1319, 2014.

Dupuy, A., Galichon, A., and Sun, Y. Estimatingmatching affinity matrix under low-rank constraints.Arxiv:1612.09585, 2016.

Essid, M. and Solomon, J. Quadratically-regularizedoptimal transport on graphs. arXiv preprintarXiv:1704.08200, 2017.

Feydy, J., Séjourné, T., Vialard, F.-X., Amari, S.-i., Trouve,A., and Peyré, G. Interpolating between optimal transportand mmd using sinkhorn divergences. In Chaudhuri, K.and Sugiyama, M. (eds.), Proceedings of Machine Learn-ing Research, volume 89 of Proceedings of MachineLearning Research, pp. 2681–2690. PMLR, 16–18 Apr2019. URL http://proceedings.mlr.press/v89/feydy19a.html.

Flamary, R., Courty, N., Rakotomamonjy, A., and Tuia,D. Optimal transport with laplacian regularization. InNIPS 2014, Workshop on Optimal Transport and MachineLearning, 2014.

Flamary, R., Cuturi, M., Courty, N., and Rakotomamonjy,A. Wasserstein discriminant analysis. Machine Learning,107(12):1923–1945, 2018.

Fournier, N. and Guillin, A. On the rate of convergence inWasserstein distance of the empirical measure. Probabil-ity Theory and Related Fields, 162(3-4):707–738, 2015.

Frogner, C., Zhang, C., Mobahi, H., Araya, M., and Poggio,T. A. Learning with a Wasserstein loss. In Advances inNeural Information Processing Systems, pp. 2053–2061,2015.

Page 11: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

Galichon, A. and Salanié, B. Cupid’s invisible hand: Socialsurplus and identification in matching models. Availableat SSRN 1804623, 2015.

Genevay, A., Cuturi, M., Peyré, G., and Bach, F. Stochasticoptimization for large-scale optimal transport. In Ad-vances in Neural Information Processing Systems, pp.3440–3448, 2016.

Genevay, A., Peyré, G., and Cuturi, M. Learning genera-tive models with sinkhorn divergences. In InternationalConference on Artificial Intelligence and Statistics, pp.1608–1617, 2018.

Genevay, A., Chizat, L., Bach, F., Cuturi, M., and Peyré,G. Sample complexity of sinkhorn divergences. InChaudhuri, K. and Sugiyama, M. (eds.), Proceedings ofMachine Learning Research, volume 89 of Proceedingsof Machine Learning Research, pp. 1574–1583. PMLR,16–18 Apr 2019. URL http://proceedings.mlr.press/v89/genevay19a.html.

Grave, E., Joulin, A., and Berthet, Q. Unsupervised align-ment of embeddings with wasserstein procrustes. 2019.

Hashimoto, T., Gifford, D., and Jaakkola, T. Learningpopulation-level diffusions with generative RNNs. InInternational Conference on Machine Learning, pp. 2417–2426, 2016.

Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R., andRohde, G. Generalized sliced wasserstein distances. InAdvances in Neural Information Processing Systems, pp.261–272, 2019.

Korman, J. and McCann, R. Optimal transportation withcapacity constraints. Transactions of the American Math-ematical Society, 367(3):1501–1521, 2015.

Liero, M., Mielke, A., and Savaré, G. Optimal entropy-transport problems and a new hellinger–kantorovich dis-tance between positive measures. Inventiones mathemati-cae, 211(3):969–1117, 2018.

Lorenz, D. A., Manns, P., and Meyer, C. Quadrat-ically regularized optimal transport. arXiv preprintarXiv:1903.01112, 2019.

Mena, G. and Niles-Weed, J. Statistical bounds for entropicoptimal transport: sample complexity and the central limittheorem. In Advances in Neural Information ProcessingSystems, pp. 4543–4553, 2019.

Muzellec, B. and Cuturi, M. Generalizing point embeddingsusing the wasserstein space of elliptical distributions. InBengio, S., Wallach, H., Larochelle, H., Grauman, K.,Cesa-Bianchi, N., and Garnett, R. (eds.), Advances inNeural Information Processing Systems 31, pp. 10258–10269. Curran Associates, Inc., 2018.

Muzellec, B., Nock, R., Patrini, G., and Nielsen, F. Tsallisregularized optimal transport and ecological inference. InThirty-First AAAI Conference on Artificial Intelligence,2017.

Niles-Weed, J. and Rigollet, P. Estimation of wassersteindistances in the spiked transport model. arXiv preprintarXiv:1909.07513, 2019.

Paty, F.-P. and Cuturi, M. Subspace robust Wasser-stein distances. In Chaudhuri, K. and Salakhut-dinov, R. (eds.), Proceedings of the 36th Interna-tional Conference on Machine Learning, volume 97of Proceedings of Machine Learning Research, pp.5072–5081, Long Beach, California, USA, 09–15 Jun2019. PMLR. URL http://proceedings.mlr.press/v97/paty19a.html.

Rabin, J. and Papadakis, N. Convex color image segmen-tation with optimal transport distances. In InternationalConference on Scale Space and Variational Methods inComputer Vision, pp. 256–269. Springer, 2015.

Rabin, J., Ferradans, S., and Papadakis, N. Adaptive colortransfer with relaxed optimal transport. In 2014 IEEEInternational Conference on Image Processing (ICIP),pp. 4852–4856. IEEE, 2014.

Rigollet, P. and Weed, J. Entropic optimal transport ismaximum-likelihood deconvolution. Comptes RendusMathematique, 356(11-12):1228–1235, 2018.

Salimans, T., Zhang, H., Radford, A., and Metaxas,D. Improving GANs using optimal transport. In In-ternational Conference on Learning Representations,2018. URL https://openreview.net/forum?id=rkQkBnJAb.

Santambrogio, F. Optimal transport for applied mathemati-cians. Birkhauser, 2015.

Schiebinger, G., Shu, J., Tabaka, M., Cleary, B., Subrama-nian, V., Solomon, A., Gould, J., Liu, S., Lin, S., Berube,P., et al. Optimal-transport analysis of single-cell geneexpression identifies developmental trajectories in repro-gramming. Cell, 176(4):928–943, 2019.

Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher,A., Nguyen, A., Du, T., and Guibas, L. ConvolutionalWasserstein distances: efficient optimal transportation ongeometric domains. ACM Transactions on Graphics, 34(4):66:1–66:11, 2015.

Villani, C. Optimal Transport: Old and New, volume 338.Springer Verlag, 2009.

Weed, J., Bach, F., et al. Sharp asymptotic and finite-samplerates of convergence of empirical measures in wassersteindistance. Bernoulli, 25(4A):2620–2648, 2019.

Page 12: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

A. ProofsA.1. Proof for Proposition 1

Proof. Let π? be a minimizer of (3). Then using the op-timality condition for supc∈C(X 2)

∫c dπ − F ∗(c), any c

such that π? ∈ ∂F ∗(c) is a best response to π?. Butby Fenchel-Young inequality, such c are exactly those in∂F (π?) = ∇F (π?). Since ∇F (π?) is the unique bestresponse to π?, it is necessarily optimal in (4). Conversely,if there is a unique maximizer c?, then as a result of theabove, c? = ∇F (π?) for some minimizer π? of the primal.Then ∇F ∗(c?) is optimal in the primal.

A.2. Proof for Remark 2

Proof. As in the proof of Theorem 1:

infπ∈Π(µ,ν)

F (π) = infπ∈Π(µ,ν)

−(−F )∗∗(π)

= infπ∈Π(µ,ν)

− supc∈C(X 2)

∫c dπ − (−F )∗(c)

= infπ∈Π(µ,ν)

infc∈C(X 2)

∫−c dπ + (−F )∗(c)

= infπ∈Π(µ,ν)

infc∈C(X 2)

∫c dπ + (−F )∗(−c)

= infc∈C(X 2)

Tc(µ, ν) + (−F )∗(−c).

A.3. Proof for Proposition 2

Proof. As in the proof of Theorem 1, we use Sion’s mini-max theorem to get

supc∈Rn×n

+

minπ∈Π(µ,ν)

〈c,π〉 − ε∑ij

R∗ij

(cij − c0ij

ε

)

= minπ∈Π(µ,ν)

supc∈Rn×n

+

〈c,π〉 − ε∑ij

R∗ij

(cij − c0ij

ε

).

Since the optimization in c ∈ Rn×n+ is separable, we onlyneed to consider this optimization coordinate by coordinate,i.e. we only need to compute supcij∈R+

πijcij − f∗ij(cij)for all i, j ∈ JnK, where f∗ij(cij) = εR∗ij

(cij−c0ij

ε

).

Fix π ∈ Π(µ,ν) and i, j ∈ JnK, and define gij : R 3cij 7→ πijcij − f∗ij(cij).

Suppose that zij = f ′ij(πij) ≥ 0. Then

fij(πij) = f∗∗ij (πij) = gij(zij) = supcij∈R

gij(cij),

and since zij ≥ 0, supcij∈R+gij(cij) = fij(πij). This

means that Rij(πij) = Rij(πij).

Suppose now that zij = f ′ij(πij) < 0. This means that

supcij∈R+

gij(cij) < supcij∈R

gij(cij).

Since gij is concave, this shows that supcij∈R+gij(cij) =

gij(0) = −f∗ij(0), i.e. Rij(πij) =−c0ij

ε πij −R∗ij

(−c0ij

ε

).

Since Rij is convex, R′ij is increasing with pseudo-inverseR∗ij′. Furthermore, the optimality condition in the convex

conjugate problem gives, for any α ∈ R:

R∗ij(α) = α×R∗ij′(α)−Rij R∗ij

′(α).

So if Rij is of class C1, taking α =−c0ij

ε , as x increasesto R∗ij

′ (−c0ij

ε

):

Rij(x) −→ Rij R∗ij′(−c0ij

ε

)= Rij R∗ij

′(−c0ij

ε

),

meaning that Rij is of class C1.

A.4. Proof for Example 3

Proof. We denote by sgn(x) the set +1 if x > 0, −1if x < 0 and [−1, 1] if x = 0. We apply Corollary 1 withR : Rn×n → R defined as R(π) = 1

p‖π‖pw,p, for which

we need to compute its convex conjugate:

R∗(c) = supπ∈Rn×n

〈π, c〉 − 1

p

∑ij

wij |πij |p.

Subdifferentiating with respect to πij :

cij ∈1

pwij

∂πij|πij |p

= wij sgn(πij)|πij |p−1

This implies that sgn(πij) = sgn(cij), so:

πij = sgn(cij)

∣∣∣∣ cijwij

∣∣∣∣q−1

.

Finally,

R∗(c) =∑ij

cij sgn(cij)

∣∣∣∣ cijwij

∣∣∣∣q−1

− 1

pwij

∣∣∣∣ cijwij

∣∣∣∣q=

1

q

∑ij

1

wq−1ij

|cij |q

=1

q‖c‖q

1/wq−1,q.

Page 13: Regularized Optimal Transport is Ground Cost Adversarial · Regularized Optimal Transport is Ground Cost Adversarial François-Pierre Paty1 Marco Cuturi2 1 Abstract Regularizing Wasserstein

Regularized Optimal Transport is Ground Cost Adversarial

A.5. Proof for Example 6

Proof. Since π ∈ Π(µ,ν),∑ij πij = 1 so we can drop

it for now and only consider the term R(π) = 1q−1‖π‖

qq

which is separable in the coordinates of π:

R(π) =∑ij

f(πij)

where we have defined the convex function

f(x) =

1q−1x

q if x ≥ 0

+∞ otherwise.

We compute its convex conjugate:

f∗(y) = supx≥0

xy − 1

q − 1xq

=

(yp

)pif y ≤ 0

+∞ if y > 0

where p = qq−1 ≤ 0 is such that 1/p + 1/q = 1. Then

R∗(c) = +∞ if c has a positive entry, and over Rn×n− :

R∗(c) =∑ij

f∗(cij) =∑ij

(cijp

)p=∑ij

(−cij−p

)p= (−p)−p

∑ij

(1

−cij

)−p.

Adding the term ε1−q we left aside to the result of Corol-

lary 1, we find that Tsallis regularized OT is equal to:

supc∈Rn×n

Tc(µ,ν)− εR∗(c− c0

ε

)+

ε

1− q

= supc∈Rn×n

c≤c0

Tc(µ,ν)− ε(−p)−p∑ij

c0ij − cij

]−p+

ε

1− q

= supc∈Rn×n

c≤c0

Tc(µ,ν)− ε1

1−q (−p)−p∑ij

[1

c0ij − cij

]−p+

ε

1− q

= supc∈Rn×n

c≤c0

Tc(µ,ν)− ε1

1−q (−p)−p∥∥∥∥ 1

c0 − c

∥∥∥∥−p−p

1− q.

A.6. Proof for Subsection 7.2

Entropic OT In the case of entropic OT,

F (π) = 〈π, c0〉+ ε∑ij

πij [logπij − 1] ,

so

F ∗(c) = ε∑ij

exp

(cij − c0ij

ε

)and

∇F ∗(c) =

[exp

(cij − c0ij

ε

)]ij

.

Then the system of equations (13) (14) is:

∀i, µi =∑j

exp

(φ?i +ψ?j − c0ij

ε

)= exp(φ?i/ε) [K exp(ψ?/ε)]i

∀j, νj =∑i

exp

(φ?i +ψ?j − c0ij

ε

)= exp(ψ?j/ε)

[K> exp(φ?/ε)

]j

where K = exp(−c0/ε) ∈ Rn×n and exp is taken elemen-twise. Then solving alternatively for φ and ψ is exactlySinkhorn algorithm.

Quadratic OT In the case of quadratic OT, using the no-tations and results from example 8:

F (π) = 〈π, c0〉+ εϕ2(πij),

andF ∗(c) =

1

∑ij

[(cij − c0ij

)+

]2.

Then:∇F ∗(c) =

1

ε(c− c0)+ .

The system of equations (13) (14) is:

∀i, εµi =∑j

(φ?i +ψ?j − c0ij

)+

∀j, ενj =∑i

(φ?i +ψ?j − c0ij

)+

which is what (Blondel et al., 2018) solve in their appendixB.