Distributionally Robust Stochastic Optimization with ... · Distributionally Robust Stochastic...

Distributionally Robust Stochastic Optimization withWasserstein Distance

Rui Gao, Anton J. KleywegtSchool of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0205

[email protected], [email protected]

Distributionally robust stochastic optimization (DRSO) is an approach to optimization under uncertaintyin which, instead of assuming that there is a known true underlying probability distribution, one hedgesagainst a chosen set of distributions. In this paper we first point out that the set of distributions should bechosen to be appropriate for the application at hand, and that some of the choices that have been popularuntil recently are, for many applications, not good choices. We consider sets of distributions that are withina chosen Wasserstein distance from a nominal distribution, for example an empirical distribution resultingfrom available data. The paper points out that such a choice of sets has two advantages: (1) The resultingdistributions hedged against are more reasonable than those resulting from other popular choices of sets.(2) The problem of determining the worst-case expectation over the resulting set of distributions has desirabletractability properties. We derive a dual reformulation of the corresponding DRSO problem and constructapproximate worst-case distributions (or an exact worst-case distribution if it exists) explicitly via the first-order optimality conditions of the dual problem. Our contributions are five-fold. (i) We identify necessary andsufficient conditions for the existence of a worst-case distribution, which are naturally related to the growthrate of the objective function. (ii) We show that the worst-case distributions resulting from an appropriateWasserstein distance have a concise structure and a clear interpretation. (iii) Using this structure, we showthat data-driven DRSO problems can be approximated to any accuracy by robust optimization problems,and thereby many DRSO problems become tractable by using tools from robust optimization. (iv) To thebest of our knowledge, our proof of strong duality is the first constructive proof for DRSO problems, andwe show that the constructive proof technique is also useful in other contexts. (v) Our strong duality resultholds in a very general setting, and we show that it can be applied to infinite dimensional process controlproblems and worst-case value-at-risk analysis.

Key words : distributionally robust optimization; data-driven decision-making; ambiguity set; worst-casedistribution

MSC2000 subject classification : Primary: 90C15; secondary: 90C46OR/MS subject classification : Primary: programming: stochastic

1. Introduction In decision making problems under uncertainty, a decision maker wants tochoose a decision x from a feasible region X. The objective function Ψ :X ×Ξ→R depends on aquantity ξ ∈Ξ whose value is not known to the decision maker at the time that the decision has tobe made. In some settings it is reasonable to assume that ξ is a random element with distributionµ supported on Ξ, for example, if multiple realizations of ξ will be encountered. In such settings,the decision making problems can be formulated as stochastic optimization problems as follows:

infx∈X

Eµ[Ψ(x, ξ)].

We refer to Shapiro et al. [40] for a thorough study of stochastic optimization. One major criticism ofthe formulation above for practical applications is the requirement that the underlying distributionµ be known to the decision maker. Even if multiple realizations of ξ are observed, µ still may not beknown exactly, while use of a distribution different from µ may sometimes result in bad decisions.Another major criticism is that in many applications there are not multiple realizations of ξ thatwill be encountered, for example in problems involving events that may either happen once ornot happen at all, and thus the notion of a “true” underlying distribution does not apply. Thesecriticisms motivate the notion of distributionally robust stochastic optimization (DRSO), that does

1

mailto:[email protected], [email protected]

2

not rely on the notion of a known true underlying distribution. One chooses a set M of probabilitydistributions to hedge against, and then finds a decision that provides the best hedge against theset M of distributions by solving the following minmax problem:

infx∈X

supµ∈M

Eµ[Ψ(x, ξ)]. (DRSO)

Such a minmax approach has its roots in Von Neumann’s game theory and has been used in manyfields such as inventory management (Scarf et al. [38], Gallego and Moon [21]), statistical decisionanalysis (Berger [8]), as well as stochastic optimization (Zackova [49], Dupacova [17], Shapiro andKleywegt [41]). Recently it regained attention in the operations research literature, and sometimesis called data-driven stochastic optimization or ambiguous stochastic optimization.

A central question is: how to choose a good set of distributions M to hedge against? A good choiceof M should take into account the properties of the practical application as well as the tractabilityof problem (DRSO). Two typical ways of constructing M are moment-based and distance-based.The moment-based approach considers distributions whose moments (such as mean and covariance)satisfy certain conditions (Scarf et al. [38], Delage and Ye [16], Popescu [35], Zymler et al. [51]).It has been shown that in many cases the resulting DRSO problem can be formulated as a conicquadratic or semi-definite program. However, the moment-based approach is based on the curiousassumption that certain conditions on the moments are known exactly but that nothing else aboutthe relevant distribution is known. More often in applications, either one has data from repeatedobservations of the quantity ξ, or one has no data, and in both cases the moment conditions donot describe exactly what is known about ξ. In addition, the resulting worst-case distributionssometimes yield overly conservative decisions (Wang et al. [46], Goh and Sim [23]). For example,Wang et al. [46] shows that for the newsvendor problem, by hedging against all the distributionswith fixed mean and variance, Scarf’s moment approach yields a two-point worst-case distribution,and the resulting decision does not perform well under other more likely scenarios.

The distance-based approach considers distributions that are close, in the sense of a chosenstatistical distance, to a nominal distribution ν, such as an empirical distribution or a Gaussiandistribution (El Ghaoui et al. [18], Calafiore and El Ghaoui [13]). Popular choices of the statisticaldistance are φ-divergences (Bayraksan and Love [5], Ben-Tal et al. [6]), which include Kullback-Leibler divergence (Jiang and Guan [25]), Burg entropy (Wang et al. [46]), and Total Variationdistance (Sun and Xu [42]) as special cases, Prokhorov metric (Erdogan and Iyengar [19]), andWasserstein distance (Wozabal [47, 48], Esfahani and Kuhn [20], Zhao and Guan [50]).

1.1. Motivation: Potential issues with φ-divergence Despite its widespread use, φ-divergence has a number of shortcomings. Here we highlight some of these shortcomings. In a typicalsetup using φ-divergence, Ξ is partitioned into B+ 1 bins represented by points ξ0, ξ1, . . . , ξB ∈ Ξ.The nominal distribution ν associates Ni observations with bin i. That is, the nominal distributionis given by ν := (N0/N,N1/N, . . . ,NB/N), whereN :=

∑B

i=0Ni. Let ∆B := {(p0, p1, . . . , pB)∈RB+1+ :∑B

j=0 pj = 1} denote the set of probability distributions on the same set of bins. Let φ : [0,∞) 7→R bea chosen convex function such that φ(1) = 0, with the conventions that 0φ(a/0) := a limt→∞ φ(t)/tfor all a > 0, and 0φ(0/0) := 0. Then the φ-divergence between µ= (p0, . . . , pB), ν = (q0, . . . , qB) ∈∆B is defined by

Iφ(µ,ν) :=B∑j=0

qjφ

(pjqj

).

Let θ > 0 denote a chosen radius. Then Mφ := {µ∈∆B : Iφ(µ,ν)≤ θ} denotes the set of probabilitydistributions given by the chosen φ-divergence and radius θ. The DRSO problem corresponding tothe φ-divergence ball Mφ is then given by

infx∈X

supµ∈∆B

{B∑j=0

pjΨ(x, ξj) : Iφ(µ,ν)≤ θ

}.

3

It has been shown in Ben-Tal et al. [6] that the φ-divergence ball Mφ can be viewed as a statisticalconfidence region (Pardo [32]), and for several choices of φ, the inner maximization of the problemabove is tractable.

One well-known shortcoming of φ-divergence balls is that, either they are not rich enough tocontain distributions that are often relevant, or they they hedge against many distributions thatare too extreme. For example, for some choices of φ-divergence such as Kullback-Leibler divergence,if the nominal qi = 0, then pi = 0, that is, the φ-divergence ball includes only distributions thatare absolutely continuous with respect to the nominal distribution ν, and thus does not includedistributions with support on points where the nominal distribution ν is not supported. As a result,if Ξ = RK and ν is discrete, then there are no continuous distributions in the φ-divergence ballMφ. Some other choices of φ-divergence exhibit in some sense the opposite behavior. For example,the Burg entropy ball includes distributions with some amount of probability allowed to be shiftedfrom ν to any other bin, with the amount of probability allowed to be shifted depending only onθ and not on how extreme the bin is. See Section 5.1 for more details regarding this potentialshortcoming.

Next we illustrate another shortcoming of φ-divergence that will motivate the use of Wassersteindistance.

Example 1. Suppose that there is an underlying true image (1b), and a decision maker possesses,instead of the true image, an approximate image (1a) obtained with a less than perfect device thatloses some of the contrast. The images are summarized by their gray-scale histograms. (In fact, (1a)was obtained from (1b) by a low-contrast intensity transformation (Gonzalez and Woods [24]), bywhich the black pixels become somewhat whiter and the white pixels become somewhat blacker.This type of transformation changes only the gray-scale value of a pixel and not the location ofa gray-scale value, and therefore it can also be regarded as a transformation from one gray-scalehistogram to another gray-scale histogram.) As a result, roughly speaking, the observed histogramν is obtained by shifting the true histogram µtrue inwards. Also consider the pathological image(1c) that is too dark to see many details, with histogram µpathol. Suppose that the decision makerconstructs a Kullback-Leibler (KL) divergence ball MφKL := {µ ∈∆B : IφKL(µ,ν)≤ θ}. Note thatIφKL(µtrue, ν) = 5.05> IφKL(µpathol, ν) = 2.33. Therefore, if θ is chosen small enough (less than 2.33)for M to exclude the pathological image (1c), then M will also exclude the true image (1b). If θis chosen large enough (greater than 5.05) for M to include the true image (1b), then M also hasto include the pathological image (1c), and then the resulting decision may be overly conservativedue to hedging against irrelevant distributions. If an intermediate value is chosen for θ (between2.33 and 5.05), then M includes the pathological image (1c) and excludes the true image (1b). Incontrast, note that the Wasserstein distance W1 satisfies W1(µtrue, ν) = 30.7<W1(µpathol, ν) = 84.0,and thus Wasserstein distance does not exhibit the problem encountered with KL divergence (seealso Example 3).

The reason for such behavior is that φ-divergence does not incorporate a notion of how close twopoints ξ, ξ′ ∈Ξ are to each other, for example, how likely it is that observation is ξ′ given that thetrue value is ξ. In Example 1, Ξ = {0,1, . . . ,255} represents 8-bit gray-scale values. In this case,we know that the likelihood that a pixel with gray-scale value ξ ∈ Ξ is observed with gray-scalevalue ξ′ ∈ Ξ is decreasing in the absolute difference between ξ and ξ′. However, in the definitionof φ-divergence, only the relative ratio pj/qj for the same gray-scale value j is taken into account,while the distances between different gray-scale values is not taken into account. This phenomenonhas been observed in studies of image retrieval (Rubner et al. [37], Ling and Okada [28]).

The drawbacks of φ-divergence motivates us to consider sets M that incorporate a notion ofhow close two points ξ, ξ′ ∈ Ξ are to each other. One such choice of M is based on Wassersteindistance. Specifically, consider any underlying metric d on Ξ which measures the closeness of any

4

8-bit Gray-scale0

0.5

1

1.5

2

Fre

quence

×104

0 50 100 150 200 250

(a) Observed image with histogramν

8-bit Gray-scale0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

Fre

quence

0 50 100 150 200 250

(b) True image with histogram µtrue

8-bit Gray-scale0

0.5

1

1.5

2

2.5

3

Fre

quence

×104

0 50 100 150 200 250

(c) Pathological image with his-togram µpathol

Figure 1. Three images and their gray-scale histograms. For KL divergence, it holds that IφKL(µtrue, ν) = 5.05>IφKL(µpathol, ν) = 2.33, while in contrast, Wasserstein distance satisfies W1(µtrue, ν) = 30.70<W1(µpathol, ν) = 84.03.

two points in Ξ. Let p≥ 1, and let P(Ξ) denote the set of Borel probability measures on Ξ. Thenthe Wasserstein distance of order p between two distributions µ,ν ∈P(Ξ) is defined as

Wp(µ,ν) = minγ∈P(Ξ2)

{E1/p

(ξ,ζ)∼γ [dp(ξ, ζ)] : γ has marginal distributions µ,ν

}.

More detailed explanation and discussion on Wasserstein distance will be presented in Section 2.Given a radius θ > 0, the Wasserstein ball of probability distributions M is defined by

M := {µ∈P(Ξ) : Wp(µ,ν)≤ θ}.

1.2. Related work Wasserstein distance and the related field of optimal transport, which isa generalization of the transportation problem, have been studied in depth. In 1942, together withthe linear programming problem (Kantorovich [27]), Kantorovich [26] tackled Monge’s problemoriginally brought up in the study of optimal transport. In the stochastic optimization literature,Wasserstein distance has been used for single stage stochastic optimization (Wozabal [47, 48]), andfor multistage stochastic optimization (Pflug and Pichler [34]). The challenge for solving (DRSO)is that, the inner maximization involves a supremum over possibly an infinite dimensional spaceof distributions. To tackle this problem, existing works focus on the setup when ν is the empir-ical distribution on a finite-dimensional space. Wozabal [47] transformed the inner maximizationproblem of (DRSO) into a finite-dimensional non-convex program, by using the fact that if ν issupported on at most N points, then there are extreme distributions of M that are supported onat most N + 3 points. Recently, using duality theory of conic linear programming (Shapiro [39]),Esfahani and Kuhn [20] and Zhao and Guan [50] showed that under certain conditions, the innermaximization problem of (DRSO) is actually equivalent to a finite-dimensional convex problem.

In this paper, we consider any arbitrary nominal distribution ν on a Polish space, and studythe tractability of (DRSO) via strong duality. By the time we completed the first version of thispaper, we learned that Blanchet and Murthy [9] also considered a similar problem to ours and alsoobtained a strong duality result. Our focus and our approach to this problem differ from theirs inthe following ways. First, we prove the strong duality result for the inner maximization of (DRSO)

5

using a novel, yet simple, constructive approach, in contrast with the non-constructive approachesin their work and also in Esfahani and Kuhn [20] and Zhao and Guan [50]. This enables us toestablish the structural characterization of the worst-case distributions of the data-driven DRSO(Corollary 2(ii)), which improves the result of Wozabal [47] and the more recent result of Owhadiand Scovel [31] on extremal distributions of Wasserstein balls (Remark 6). It also enables us tobuild a close connection between DRSO and robust optimization (Corollary 2(iii)). Second, wefocus on Wasserstein distance of order p (p≥ 1), while they consider more general transport metricsin which the distance between two points ξ, ξ′ ∈Ξ is measured by a lower semicontinuous functionrather than a metric dp(ξ, ξ′) as in our case. Nevertheless, our proof remains valid for such moregeneral transport metrics (Remark 3). In the meantime, focusing on Wasserstein distance enablesus to relate the condition for the existence of a worst-case distribution to the important notionof the “growth rate” of the objective function, and enables us to provide practical guidance forchoosing the ambiguity set and controlling the degree of conservativeness based on the objectivefunction (Remark 2).

1.3. Main contributions

• General Setting. We prove a strong duality result for DRSO problems with Wasserstein dis-tance in a very general setting. We show that

supµ∈P(Ξ)

{Eµ[Ψ(x, ξ)] : Wp(µ,ν)≤ θ

}= min

λ≥0

{λθp−

∫Ξ

infξ∈Ξ

[λdp(ξ, ζ)−Ψ(x, ξ)]ν(dζ)

}holds for any Polish space (Ξ, d) and measurable function Ψ (Theorem 1).1. Both Esfahani and Kuhn [20] and Zhao and Guan [50] assume that Ξ is a convex subset

of RK with some associated norm. The greater generality of our results enables one toconsider interesting problems such as the process control problems in Sections 4.1 and4.2, where Ξ is the set of finite counting measures on [0,1], which is infinite-dimensionaland non-convex.

2. Both Esfahani and Kuhn [20] and Zhao and Guan [50] assume that the nominal distribu-tion ν is an empirical distribution, while we allow ν to be any Borel probability measure.The greater generality enables one to study problems such as the worst-case Value-at-Riskanalysis in Section 4.3.

3. Both Esfahani and Kuhn [20] and Zhao and Guan [50] only consider Wasserstein distanceof order p = 1. By considering a bigger family of Wasserstein distances, we establishthe importance for DRSO problems of the notion of the “growth rate” of the objectivefunction, which measures how fast the objective function grows compared to a polynomialof order p. It turns out that the growth rate of the objective function determines thefiniteness of the worst-case objective value (Proposition 2), and it plays an important rolein the existence conditions for the worst-case distribution (Corollary 1). This is of practicalimportance, since it provides guidance for choosing the proper Wasserstein distance andfor controlling the degree of conservativeness based on the structure of the objectivefunction.

• Constructive Proof of Duality. We prove the strong duality result using a novel, elementary,constructive approach. The results of Esfahani and Kuhn [20] and Zhao and Guan [50] andother strong duality results in the literature are based on the established Hahn-Banach the-orem for certain infinite dimensional vector spaces. In contrast, our proof idea is new andis relatively elementary and straightforward: we use the weak duality result as well as thefirst-order optimality condition of the dual problem to construct a sequence of primal feasiblesolutions whose objective values converge to the dual optimal value. Our proof uses relativelyelementary tools, without resorting to other “big hammers”.

6

• Existence Conditions for and Insightful Structure of Worst-case Distributions. As a by productof our constructive proof, we identify necessary and sufficient conditions for the existence ofworst-case distributions, and a structural characterization of worst-case distributions (Corol-lary 1). Specifically, for data-driven DRSO problems where ν = 1

N

∑N

i=1 δξi (where δξ denotesthe unit mass on ξ), whenever a worst-case distribution exists, there is a worst-case distribu-tion µ∗ supported on at most N + 1 points with the following concise structure:

µ∗ =1

N

N∑i=1i6=i0

δξi∗ +p0

Nδξi0∗

+1− p0

Nδξi0∗,

for some i0 ∈ {1, . . . ,N}, p0 ∈ [0,1] and

ξi∗ ∈ arg minξ∈Ξ

{λ∗dp(ξ, ξi)−Ψ(x, ξ)

}, ∀ i 6= i0, ξi0

∗, ξi0∗ ∈ arg min

ξ∈Ξ

{λ∗dp(ξ, ξi0)−Ψ(x, ξ)

},

where λ∗ is the dual minimizer (Corollary 2). Thus µ∗ can be viewed as a perturbation ofν, where the mass on ξi is perturbed to ξi∗ for all i 6= i0, a fraction p0 of the mass on ξi0 is

perturbed to ξi0∗

, and the remaining fraction 1− p0 of the mass on ξi0 is perturbed to ξi0∗ . In

particular, uncertainty quantification problems have a worst-case distribution with this simplestructure, and can be solved by a greedy procedure (Example 7). Our result regarding theexistence of a worst-case distribution with such a structure improves the result of Wozabal[47] and the more recent result of Owhadi and Scovel [31] regarding the extremal distributionsof Wasserstein balls.

• Connection with Robust Optimization. Using the structure of a worst-case distribution, weprove that data-driven DRSO problems can be approximated by robust optimization problemsto any accuracy (Corollary 2(iii)). We use this result to show that two-stage linear DRSOproblems with linear decision rules have a tractable semi-definite programming approxima-tion (Section 5.2). Moreover, the robust optimization approximation becomes exact when theobjective function Ψ is concave in ξ. In addition, if Ψ is convex in x, then the correspondingDRSO problem can be formulated as a convex-concave saddle point problem.

The rest of this paper is organized as follows. In Section 2, we review some results on theWasserstein distance. Next we prove strong duality for general nominal distributions in Section 3.1,and in Section 3.2 we derive additional results for finite-supported nominal distributions. Then,in Sections 4 and 5, we apply our results on strong duality and the structural description of theworst-case distributions to a variety of DRSO problems. We conclude this paper in Section 6.Auxiliary results, as well as proofs of some Lemmas, Corollaries and Propositions, are provided inthe Appendix.

2. Notation and Preliminaries In this section, we introduce notation and briefly outlinesome known results regarding Wasserstein distance. For a more detailed discussion we refer toVillani [44, 45].

Let Ξ be a Polish (separable complete metric) space with metric d. Let B(Ξ) denote the Borelσ-algebra on Ξ, and let Bν(Ξ) denote the completion of B(Ξ) with respect to a measure ν inB(Ξ) such that the measure space (Ξ,Bν(Ξ), ν) is complete (see, e.g., Definition 1.11 in Ambrosioet al. [2]). Let B(Ξ) denote the set of Borel measures on Ξ, and let P(Ξ) denote the set of Borelprobability measures on Ξ. To facilitate later discussion, we introduce the push-forward operatoron measures.

Definition 1 (Push-forward Measure). Given measurable spaces (Ξ,B(Ξ)) and(Ξ′,B(Ξ′)), a measurable function T : Ξ 7→Ξ′, and a measure ν ∈B(Ξ), let T#ν ∈B(Ξ′) denote thepush-forward measure of ν through T , defined by

T#ν(A) := ν(T−1(A)) = ν{ζ ∈Ξ : T (ζ)∈A}, ∀ measurable sets A⊂Ξ′.

7

That is, T#ν is obtained by transporting (“pushing forward”) ν from Ξ to Ξ′ using the functionT . For i∈ {1,2}, let πi : Ξ×Ξ 7→Ξ denote the canonical projections given by πi(ξ1, ξ2) = ξi. Thenfor a measure γ ∈P(Ξ×Ξ), πi#γ ∈P(Ξ) is the i-th marginal of γ given by π1

#γ(A) = γ(A×Ξ) andπ2

#γ(A) = γ(Ξ×A).

Definition 2 (Wasserstein distance). The Wasserstein distance Wp(µ,ν) between µ,ν ∈Pp(Ξ) is defined by

W pp (µ,ν) := min

γ∈P(Ξ×Ξ)

{∫Ξ×Ξ

dp(ξ, ζ)γ(dξ, dζ) : π1#γ = µ,π2

#γ = ν

}. (1)

That is, the Wasserstein distance between µ,ν is the minimum cost (in terms of dp) of redis-tributing mass from ν to µ, which is why it is also called the “earth mover’s distance”. Wassersteindistance is a natural way of comparing two distributions when one is obtained from the other byperturbations. The minimum on the right side of (1) is attained, because d is non-negative, con-tinuous and thus lower semicontinuous (Theorem 1.3 of [44]). The following example is a familiarspecial case of problem (1).

Example 2 (Transportation problem). Consider µ=∑M

i=1 piδξi and ν =∑N

j=1 qjδξj , where

M,N ≥ 1, pi, qj ≥ 0, ξi, ξj ∈ Ξ for all i, j, and∑M

i=1 pi =∑N

j=1 qj = 1. Then problem (1) becomesthe classical transportation problem in linear programming:

minγij≥0

{M∑i=1

N∑j=1

dp(ξi, ξj)γij :N∑j=1

γij = pi, ∀ i,M∑i=1

γij = qj,∀ j

}.

Example 3 (Revisiting Example 1). Next we evaluate the Wasserstein distance between thehistograms in Example 1. To evaluate W1(µtrue, ν), note that the least cost way of transportingmass from ν to µtrue is to move the mass outwards. In contrast, to evaluate W1(µpathol, ν), onehas to transport mass relatively long distances from right to left (changing the gray-scale valuesof pixels by large amounts), resulting in a larger cost than W1(µtrue, ν). Therefore W1(µpathol, ν)>W1(µtrue, ν).

Wasserstein distance has a dual representation due to Kantorovich’s duality (Theorem 5.10 in[45]):

W pp (µ,ν) = sup

u∈L1(µ),v∈L1(ν)

{∫Ξ

u(ξ)µ(dξ) +

∫Ξ

v(ζ)ν(dζ) : u(ξ) + v(ζ)≤ dp(ξ, ζ), ∀ ξ, ζ ∈Ξ

}, (2)

where L1(ν) represents the L1 space of ν-measurable (i.e., (Bν(Ξ),B(R))-measurable) functions.In addition, the set of functions under the supremum above can be replaced by u, v ∈Cb(Ξ), whereCb(Ξ) denotes the set of continuous and bounded real-valued functions on Ξ. Particularly, whenp= 1, by the Kantorovich-Rubinstein Theorem, (2) can be simplified to (see, e.g., Equation (5.11)in [45])

W1(µ,ν) = supu∈L1(µ)

{∫Ξ

u(ξ)d(µ− ν)(ξ) : u is 1-Lipschitz

}.

So for an L-Lipschitz function Ψ : Ξ 7→R, it holds that∣∣Eµ[Ψ(ξ)]−Eν [Ψ(ξ)]

∣∣≤LW1(µ,ν)≤Lθ forall µ∈M.

We remark that Definition 2 and the results above can be extended to finite Borel measures.Moreover, we have the following result.

Lemma 1. For any finite Borel measures µ,ν ∈ B(Ξ) with µ(Ξ) 6= ν(Ξ), it holds that Wp(µ,ν) =∞.

8

Another important feature of Wasserstein distance is that Wp metrizes weak convergence inPp(Ξ) (cf. Theorem 6.9 in Villani [45]). That is, for any sequence {µk}∞k=1 of measures in Pp(Ξ)and µ ∈ Pp(Ξ), it holds that limk→∞Wp(µk, µ) = 0 if and only if µk converges weakly to µ and∫

Ξdp(ξ, ζ0)µk(dξ)→

∫Ξdp(ξ, ζ0)µ(dξ) as k →∞. Therefore, convergence in the Wasserstein dis-

tance of order p implies convergence up to the p-th moment. Villani [45, chapter 6] discusses theadvantages of Wasserstein distance relative to other distances, such as the Prokhorov metric, thatmetrize weak convergence.

3. Tractable Reformulation via Duality. In this section we develop a tractable reformu-lation by deriving its strong dual. We suppress the variable x of Ψ in this section, and results areinterpreted pointwise for each x. Given ν ∈ P(Ξ) and Ψ ∈ L1(ν), for any θ > 0 and p ∈ [1,∞), theinner maximization problem of (DRSO) is written as

vP := supµ∈M

∫Ξ

Ψ(ξ)µ(dξ) = supµ∈P(Ξ)

{∫Ξ

Ψ(ξ)µ(dξ) : Wp(µ,ν)≤ θ}. (Primal)

Our main goal is to derive its strong dual

vD := infλ≥0

{λθp−

∫Ξ

infξ∈Ξ

[λdp(ξ, ζ)−Ψ(ξ)

]ν(dζ)

}. (Dual)

The dual problem is a one-dimensional convex minimization problem with respect to λ,the Lagrangian multiplier of the Wasserstein constraint in the primal problem. The terminfξ∈Ξ[λdp(ξ, ζ)−Ψ(ξ)] is called Moreau-Yosida regularization of −Ψ with parameter 1/λ in theliterature (cf. Parikh and Boyd [33]). Its measurability with respect to ν will be established inLemma 3(i) in Section 3.1.

3.1. General Nominal Distribution In this subsection, we prove the strong duality resultfor a general nominal distribution ν on a Polish space Ξ. Such generality broadens the applicabilityof the result for (DRSO). For example, the result is useful when the nominal distribution is somedistribution such as a Gaussian distribution on RK (Section 4.3), or even some stochastic process(Sections 4.1 and 4.2). We begin with the weak duality, which is an application of Lagrangian weakduality.

Proposition 1 (Weak duality). Consider any ν ∈ P(Ξ) and Ψ ∈ L1(ν). Then for any p ∈[1,∞) and θ > 0, it holds that vP ≤ vD.

To prove the strong duality, we consider two separate case: vD =∞ and vD <∞. As can be seenfrom (Dual), if the term −

∫Ξ

infξ∈Ξ[λdp(ξ, ζ)−Ψ(ξ)] is infinite for all λ≥ 0, then vD =∞. Thus,to facilitate our analysis, we introduce the following definitions.

Definition 3 (Regularization Operator Φ). Let Φ :R×Ξ→R∪{−∞} be given by

Φ(λ, ζ) := infξ∈Ξ

{λdp(ξ, ζ)−Ψ(ξ)

}.

Definition 4 (Growth rate). Define the growth rate κ of Ψ as

κ := inf

{λ≥ 0 :

∫Ξ

infξ∈Ξ

[λdp(ξ, ζ)−Ψ(ξ)

]ν(dζ)>−∞

}.

Particularly, if∫

ΞΦ(λ, ζ)ν(dζ) =−∞ for all λ≥ 0, then κ=∞.

By definition, for all λ≥ 0 and all ζ ∈Ξ, Φ(λ, ζ)≤−Ψ(ζ). Also, for all λ> κ, Φ(λ, ·)∈L1(ν) andthus Φ(λ, ζ)∈R for ν-almost all ζ. In the sequel, for a function f :R+→R∪{−∞}, we denote bydom(f) its effective domain

dom(f) := {λ≥ 0 : f(λ)>−∞},and denote by int(dom(f)) the interior of dom(f).

9

Proposition 2 (Strong duality with infinite optimal value). Consider any ν ∈ P(Ξ) andΨ∈L1(ν). Let p∈ [1,∞). Suppose θ > 0 and κ=∞. Then vP = vD =∞.

Remark 1 (Equivalent expression for κ). By Lemma 6 in the Appendix, for any nominaldistribution

ν ∈Pp(Ξ) :=

{µ∈P(Ξ) :

∫Ξ

dp(ζ, ζ0)ν(dζ)<∞ for some ζ0 ∈Ξ

},

the growth rate κ <∞ is equivalent to the following condition: there exists ζ0 ∈ Ξ, L,M > 0 suchthat Ψ(ξ)−Ψ(ζ0) ≤ L ·dp(ξ, ζ) +M for all ξ ∈Ξ. In addition, when Ξ is unbounded, we have that

κ = lim supξ∈Ξ: dp(ξ,ζ)→∞

max{0,Ψ(ξ)−Ψ(ζ)}dp(ξ, ζ)

for any ζ ∈Ξ.

Remark 2 (Choosing Wasserstein order p). Define

p := inf{p≥ 1 : limsup

d(ζ′,ζ0)→∞

Ψ(ζ ′)−Ψ(ζ0)

dp(ζ ′, ζ0)<∞

}.

Proposition 2 suggests that a meaningful formulation of (DRSO) should be such that the Wasser-stein order p is greater than or equal to p. In both Esfahani and Kuhn [20] and Zhao and Guan [50]only p= 1 is considered. By considering higher orders p in our analysis, we have more flexibilityto choose the ambiguity set and control the degree of conservativeness based on the informationof function Ψ.

The next theorem establishes the strong duality when the growth rate κ is finite.

Theorem 1 (Strong duality with finite optimal value). Consider any ν ∈ P(Ξ) and Ψ ∈L1(ν). Let p∈ [1,∞) and θ > 0. Suppose κ<∞. Then vP = vD <∞.

To prove this theorem, we first study some properties of the regularization operator Φ. For anyλ∈ dom

(∫Ξ

Φ(λ, ζ)ν(dζ)), define D,D :R×Ξ→R∪{+∞} by

D(λ, ζ) := lim supδ↓0

{supξ∈Ξ

{dp(ξ, ζ) : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ

}},

D(λ, ζ) := lim infδ↓0

{infξ∈Ξ

{dp(ξ, ζ) : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ

}}.

(3)

Lemma 2 (Properties of the regularization operator Φ). Let (Ξ, d) be a Polish space. Con-sider any ν ∈P(Ξ) and Ψ∈L1(ν). Let p∈ [1,∞). Suppose κ<∞. Then the following holds ν-a.s.

(i) [Monotonicity] Φ(·, ζ) is nondecreasing and concave. In addition, for λ2 > λ1 ∈ dom(Φ(·, ζ)),it holds that D(λ2, ζ)≤D(λ1, ζ)≤D(λ1, ζ).

(ii) [Bound] For any λ> λ0 ∈ dom(Φ(·, ζ)),

(λ−λ0)D(λ, ζ) ≤ Φ(λ0, ζ)−Ψ(ζ).

(iii) [Derivative] For any λ ∈ int(domΦ(·, ζ)), the left partial derivative ∂Φ(λ, ζ)/∂λ− exists andsatisfies

D(λ, ζ) ≤ ∂Φ(λ, ζ)

∂λ−≤ lim

λ1↑λD(λ1, ζ).

For any λ∈ domΦ(·, ζ), the right partial derivative ∂Φ(λ, ζ)/∂λ+ exists and satisfies

limλ2↓λ

D(λ2, ζ) ≤ ∂Φ(λ, ζ)

∂λ+≤ D(λ, ζ).

10

Lemma 3 (Measurability). Consider any ν ∈P(Ξ) and Ψ∈L1(ν). Let p∈ [1,∞).(i) Φ(λ, ·), D(λ, ·), and D(λ, ·) are ν-measurable.

(ii) Suppose κ<∞. Let λ∈ dom(∫

ΞΦ(λ, ζ)ν(dζ)

)be such that D(λ, ζ)<∞. For any δ, ε≥ 0 such

that the sets

E(ζ) :={ξ ∈Ξ : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ, dp(ξ, ζ) ≤ D(λ, ζ)− ε

},

E(ζ) :={ξ ∈Ξ : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ, dp(ξ, ζ) ≥ D(λ, ζ) + ε

}are non-empty for all ζ ∈Ξ, there exists ν-measurable mappings T ,T : Ξ→Ξ such that T (ζ)∈E(ζ) and T (ζ)∈E(ζ) for ν-almost all ζ.

(iii) Suppose κ<∞. For any ε > 0, and any ν-measurable function M such that the set

E(ζ) := {ξ ∈Ξ : Ψ(ξ)−Ψ(ζ)≥ (κ− ε)dp(ξ, ζ), dp(ξ, ζ)≥M(ζ)}

is non-empty for all ζ ∈ Ξ, there exists a ν-measurable mapping T : Ξ→ Ξ such that T (ζ) ∈E(ζ) for ν-almost all ζ.

Proof of Theorem 1. As a consequence of weak duality (Proposition 1), it suffices to show thatvP ≥ vD. Let h :R+→R∪{∞} denote the dual objective function

h(λ) := λθp−∫

Ξ

Φ(λ, ζ)ν(dζ).

By Lemma 2(i), h(λ) is the sum of a linear function λθp and an (extended real-valued) convexfunction −

∫Ξ

Φ(λ, ζ)ν(dζ) on [0,∞). In addition, since Φ(λ, ζ) ≤ −Ψ(ζ), it follows that h(λ) ≥λθp+

∫Ξ

Ψ(ζ)ν(dζ)→∞ as λ→∞. Thus h is a convex function on [0,∞) that goes to∞ as λ→∞.By Definition 4 of κ, either (i) h has a minimizer λ∗ >κ, or (ii) infλ≥0 h(λ) = limλ↓κ h(λ). Next weconsider these two cases separately. In each case, we construct a sequence of primal feasible solutionswhich converges to the dual optimal value by exploiting the first-order optimality condition of thedual.• Case 1: h has a minimizer λ∗ >κ.

The first-order optimality conditions ∂∂λ−h(λ∗)≤ 0 and ∂

∂λ+h(λ∗)≥ 0 at imply that

∂

∂λ+

(∫Ξ

Φ(λ∗, ζ)ν(dζ)

)≤ θp ≤ ∂

∂λ−

(∫Ξ


). (4)

We verify that we can exchange the partial derivative and integration in (4). To show the firstinequality, consider any decreasing sequence λn ↓ λ∗. Let

fn(ζ) :=Φ(λn, ζ)−Φ(λ∗, ζ)

λn−λ∗

Since Φ(·, ζ) is nondecreasing for all ζ, fn(ζ) ≥ 0 for all ζ. In addition, since Φ(·, ζ) is concave,fn ≤ fn+1 for all n. Note that limn→∞ fn(ζ) = ∂

∂λ+Φ(λ∗, ζ). Thus it follows from the monotone

convergence theorem that

∂

∂λ+

(∫Ξ


)=

∫Ξ

∂

∂λ+Φ(λ∗, ζ)ν(dζ). (5)

To show the second inequality in (4), consider any increasing sequence λn ↑ λ∗ with λ1 > κ. Thenby monotonicity and concavity of Φ(·, ζ), we have that D(λ1, ζ)≥ fn(ζ)≥ fn+1(ζ)≥ 0 for all n. Letλ0 ∈ (κ,λ1). It follows from Lemma 2(ii) that

f1(ζ) ≤ D(λ1, ζ) ≤ 1

λ1−λ0

(Ψ(ζ) + Φ(λ0, ζ)) ∈ L1(ν).

11

Thus, applying the reverse Fatou’s lemma on the sequence {fn}n gives

∂

∂λ−

(∫Ξ


)=

∫Ξ

∂

∂λ−Φ(λ∗, ζ)ν(dζ). (6)

Using (4), (5), (6), and Lemma 2(iii), we have that

θp ≥ ∂

∂λ+

(∫Ξ


)=

∫Ξ

∂

∂λ+Φ(λ∗, ζ)ν(dζ) ≥

∫Ξ

limλ↓λ∗

D(λ, ζ)ν(dζ)

θp ≤ ∂

∂λ−

(∫Ξ


)=

∫Ξ

∂

∂λ−Φ(λ∗, ζ)ν(dζ) ≤

∫Ξ

limλ↑λ∗

D(λ, ζ)ν(dζ).

(7)

Observe that for any λ> κ and δ, ε > 0, the sets E,E in Lemma 3(ii) are well-defined. Hence thereexists ν-measurable mappings T λ, T λ : Ξ→Ξ such that for ν-almost all ζ,

T λ(ζ)∈{ξ ∈Ξ : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ, dp(ξ, ζ) > D(λ, ζ)− ε

},

T λ(ζ)∈{ξ ∈Ξ : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ, dp(ξ, ζ)<D(λ, ζ) + ε

}.

In particular, for any λ1, λ2 with κ< λ1 <λ∗ <λ2, it follows from (7) and Lemma 2(i) that

θp ≥∫

Ξ

limλ↓λ∗

D(λ, ζ)ν(dζ)≥∫

Ξ

D(λ2, ζ)ν(dζ)≥∫

Ξ

D(λ2, ζ)ν(dζ)≥∫

Ξ

dp(T λ2(ζ), ζ)ν(dζ)− ε,

θp ≤∫

Ξ

limλ↑λ∗

D(λ, ζ)ν(dζ)≤∫

Ξ

D(λ1, ζ)ν(dζ)≤∫

Ξ

D(λ1, ζ)ν(dζ)≤∫

Ξ

dp(T λ1(ζ), ζ)ν(dζ) + ε.

(8)

Based on (8), we now construct a feasible primal solution. Note that there is a qεδ(λ1, λ2)∈ [0,1]such that

qεδ(λ1, λ2)

[∫Ξ

dp(T λ1(ζ), ζ)ν(dζ) + ε

]+ (1− qεδ(λ1, λ2))

[∫Ξ

dp(T λ2(ζ), ζ)ν(dζ)− ε

]= θp. (9)

Let qε := θp

θp+max{0,[1−2qεδ(λ1,λ2)]}ε . Define a distribution µεδ(λ1, λ2) by

µεδ(λ1, λ2) := qεqεδ(λ1, λ2) · (T λ1)#ν+ qε(1− qεδ(λ1, λ2)) · (T λ2

)#ν+ (1− qε)ν.

Then µεδ(λ1, λ2) is primal feasible:

W pp (µεδ(λ1, λ2), ν)≤ qεqεδ(λ1, λ2)

∫Ξ

dp(T λ1(ζ), ζ)ν(dζ) + qε(1− qεδ(λ1, λ2))

∫Ξ

dp(T λ2(ζ), ζ)ν(dζ)

= qε(θp + (1− 2qεδ(λ1, λ2))ε)≤ θp.

Furthermore, recall that for all ζ ∈Ξ,

λ1dp(T λ1

(ζ), ζ)−Φ(λ1, ζ)− δ ≤ Ψ(T λ1(ζ)) ≤ λ1d

p(T λ1(ζ), ζ)−Φ(λ1, ζ),

λ2dp(T λ2

(ζ), ζ)−Φ(λ2, ζ)− δ ≤ Ψ(T λ2(ζ)) ≤ λ2d

p(T λ2(ζ), ζ)−Φ(λ2, ζ).

This, together with (9), implies that

vP ≥∫

Ξ

Ψ(ζ)µεδ(λ1, λ2)(dζ)

=qεqεδ(λ1, λ2)

∫Ξ

Ψ(T λ1(ζ))ν(dζ) + qε(1− qεδ(λ1, λ2))

∫Ξ

Ψ(T λ2(ζ))ν(dζ) + (1− qε)

∫Ξ

Ψ(ζ)ν(dζ)

12

≥qεqεδ(λ1, λ2)

∫Ξ

[λ1d

p(T λ1(ζ), ζ)−Φ(λ1, ζ)− δ

]ν(dζ)

+ qε(1− qεδ(λ1, λ2))

∫Ξ

[λ2d

p(T λ2(ζ), ζ)−Φ(λ2, ζ)− δ

]ν(dζ) + (1− qε)

∫Ξ

Ψ(ζ)ν(dζ)

≥qελ1[θp + (1− 2qεδ(λ1, λ2))ε]− qεqεδ(λ1, λ2)

∫Ξ

Φ(λ1, ζ)ν(dζ)

− qε(1− qεδ(λ1, λ2))

∫Ξ

Φ(λ2, ζ)ν(dζ)− qεδ+ (1− qε)∫

Ξ

Ψ(ζ)ν(dζ).

Let λ1 ↑ λ∗ and λ2 ↓ λ∗, and apply the monotone convergence theorem on∫

ΞΦ(λ1, ζ)ν(dζ) and∫

ΞΦ(λ2, ζ)ν(dζ),

limλ1↑λ∗,λ2↓λ∗

{qεδ(λ1, λ2)

∫Ξ

Φ(λ1, ζ)ν(dζ) + (1− qεδ(λ1, λ2))

∫Ξ

Φ(λ2, ζ)ν(dζ)

}=

∫Ξ

Φ(λ∗, ζ)ν(dζ),

hence

vP ≥ qελ∗[θp + limsupλ1↑λ∗,λ2↓λ∗

(1− 2qεδ(λ1, λ2))ε]− qε∫

Ξ

Φ(λ∗, ζ)ν(dζ)− qεδ+ (1− qε)∫

Ξ

Ψ(ζ)ν(dζ).

Taking the limit as ε→ 0, and observing that qε = θp

θp+max{0,[1−2qεδ(λ1,λ2)]}ε ∈ ( θp

θp+ε,1), it follows that

vP ≥ λ∗θp−∫

Ξ

Φ(λ∗, ζ)ν(dζ)− δ.

Since δ > 0 can be arbitrarily small, it follows that vP ≥ vD.

• Case 2: infλ≥0 h(λ) = limλ↓κ h(λ), and no λ> κ is dual optimal.Then h is strictly increasing and convex on (κ,∞). For any λ> λ0 >κ, h(λ)>h(λ0) is equivalentlywritten into ∫

Ξ

[Φ(λ, ζ)−Φ(λ0, ζ)]ν(dζ) < (λ−λ0)θp. (10)

Consider any δ ∈(0, (λ−λ0)θp−

∫Ξ

[Φ(λ, ζ)−Φ(λ0, ζ)]ν(dζ)). It follows from Lemma 3 that there

exists a ν-measurable map T λ : Ξ→Ξ such that λdp(T λ(ζ), ζ)−Ψ(T λ(ζ))≤Φ(λ, ζ) + δ. Also, notethat Φ(λ0, ζ)≤ λ0d

p(T λ(ζ), ζ)−Ψ(T λ(ζ)). Thus,

Φ(λ, ζ)−Φ(λ0, ζ) ≥ (λ−λ0)dp(T λ(ζ), ζ)− δ.

Combining this with (10) yields ∫Ξ

dp(T λ(ζ), ζ)ν(dζ)< θp.

Hence, the distribution (T λ)#ν is primal feasible.Next, we separately consider the cases κ= 0 and κ> 0. If κ= 0, then

vP ≥∫

Ξ

Ψ(ξ)(T λ)#ν(dξ) ≥∫

Ξ

[λdp(T λ(ζ), ζ)−Φ(λ, ζ)− δ]ν(dζ) ≥ −∫

Ξ

Φ(λ, ζ)ν(dζ)− δ.

Since δ can be chosen arbitrarily small, it follows that vP ≥−∫

ΞΦ(λ, ζ)ν(dζ) for all λ> κ. Therefore

vP ≥ limλ↓0

{−∫

Ξ

Φ(λ, ζ)ν(dζ)

}= lim

λ↓0

{λθp−

∫Ξ

Φ(λ, ζ)ν(dζ)

}= inf

λ≥0h(λ) = vD.

13

Otherwise, if κ> 0, for any ε∈ (0, κ), define

D(κ− ε, ζ) := supξ∈Ξ

{dp(ξ, ζ) : Ψ(ξ)−Ψ(ζ)≥ (κ− ε)dp(ξ, ζ)}.

Then we claim that∫

ΞD(κ− ε, ζ)ν(dζ) =∞. Indeed, note that Ψ(ξ)≤ λ0d

p(ξ, ζ)−Φ(λ0, ζ) for allξ ∈Ξ and any λ0 >κ, and thus∫

Ξ

Φ(κ− ε, ζ)ν(dζ) =

∫Ξ

infξ∈Ξ

{(κ− ε)dp(ξ, ζ)−Ψ(ξ) : Ψ(ξ)−Ψ(ζ)≥ (κ− ε)dp(ξ, ζ)

}ν(dζ)

≥∫

Ξ

infξ∈Ξ

{−Ψ(ξ) : Ψ(ξ)−Ψ(ζ)≥ (κ− ε)dp(ξ, ζ)

}ν(dζ)

≥∫

Ξ

infξ∈Ξ

{−λ0d

p(ξ, ζ) + Φ(λ0, ζ) : Ψ(ξ)−Ψ(ζ)≥ (κ− ε)dp(ξ, ζ)}ν(dζ)

≥ −λ0

∫Ξ

D(κ− ε, ζ)ν(dζ) +

∫Ξ

Φ(λ0, ζ)ν(dζ).

Then∫

ΞD(κ− ε, ζ)ν(dζ) cannot be finite, since otherwise

∫Ξ

Φ(κ− ε, ζ)ν(dζ)>−∞, which contra-dicts to the Definition 4 of κ. The above claim implies that for any R> 0, there exists M ∈L1(ν)with

∫ΞMdν ≥R such that for ν-almost all ζ,

{ξ ∈Ξ : Ψ(ξ)−Ψ(ζ)≥ (κ− ε)dp(ξ, ζ), dp(ξ, ζ)≥M(ζ)} 6= ∅.

Then by Lemma 3(iii), for any ε > 0, R> θp, there exists a ν-measurable mapping TR

ε : Ξ→Ξ suchthat

TR

ε (ζ)−Ψ(ζ) ≥ (κ− ε) · dp(TRε (ζ), ζ), ∀ζ ∈Ξ,

and that ∫Ξ

dp(TR

ε (ζ), ζ)ν(dζ) > R.

Without loss of generality we can assume that∫

Ξdp(T

R

ε (ζ), ζ)ν(dζ)<∞, since otherwise consider

Er := {ζ ∈ Ξ : dp(TR

ε (ζ), ζ) ≤ r}, then limr→∞Er = Ξ, and thus there exists a r0 > 0 such thatR<

∫Er0

dp(T n(ζ), ζ)ν(dζ)<∞. Now we can construct a primal feasible solution

µRδ (λ, ε) := qRε (T λ)#ν+ (1− qRε )(TR

ε )#ν,

where qRε ∈ (0,1) is such that

qRε

∫Ξ

dp(T λ(ζ), ζ)ν(dζ) + (1− qRε )

∫Ξ

dp(TR

ε (ζ), ζ)ν(dζ) = θp.

Then by construction µRδ (λ, ε) is primal feasible. Moreover,

vP ≥∫

Ξ

Ψ(ξ)µRδ (λ, ε)(dξ)

= qRε

∫Ξ

Ψ(T λ(ζ))ν(dζ) + (1− qRε )

∫Ξ

Ψ(TR

ε (ζ))ν(dζ)

≥ qRε

∫Ξ

[λdp(T λ(ζ), ζ)−Φ(λ, ζ)− δ]ν(dζ) + (1− qRε )

∫Ξ

[(κ− ε)dp(TRε (ζ), ζ) + Ψ(ζ)

]ν(dζ)

≥ (κ− ε)(qRε

∫Ξ

dp(T λ(ζ), ζ)ν(dζ) + (1− qRε )

∫Ξ

dp(TR

ε (ζ), ζ)ν(dζ))

− qRε∫

Ξ

Φ(λ, ζ)ν(dζ)− qRε δ+ (1− qRε )

∫Ξ

Ψ(ζ)ν(dζ)

= (κ− ε)θp− qRε∫

Ξ

Φ(λ, ζ)ν(dζ)− qRε δ+ (1− qRε )

∫Ξ

Ψ(ζ)ν(dζ).

14

For any fixed λ > κ, δ, ε can be chosen arbitrarily small, and qRε can be arbitrarily close to 1by choosing sufficiently large R. Thus, vP ≥ κθp −

∫Ξ

Φ(λ, ζ)ν(dζ)− δ. Hence vP ≥ infλ>κ{λθp −∫Ξ

Φ(λ, ζ)ν(dζ)}= vD. �

Remark 3. We remark all the above results and proofs in Section 3.1 (expect for Remark 1)remains to hold if we replace the transportation cost dp(·, ·) with any measurable, non-negativecost function c(·, ·) that satisfies c(ξ, ζ) = 0 if and only if ξ = ζ.

Next, we investigate existence conditions for worst-case distributions and their structure. In therest of this subsection, we assume that Ψ is upper-semicontinuous, and every bounded subset in(Ξ, d) is totally bounded, which is satisfied by, for example, any finite-dimensional normed space.Under this assumption, when λ> κ, Lemma 2(ii) and the upper semi-continuity of Ψ imply that theset arg minξ∈Ξ{λdp(ξ, ζ)−Ψ(ξ)} is nonempty, and that min/maxξ∈Ξ{dp(ξ, ζ) : λdp(ξ, ζ)−Ψ(ξ) =Φ(λ, ζ)} can be obtained. When λ= κ and ν({ζ ∈ Ξ : arg minξ∈Ξ{λdp(ξ, ζ)−Ψ(ξ)}= ∅}) = 0, theupper semi-continuity of Ψ imply that the set minξ∈Ξ{dp(ξ, ζ) : λdp(ξ, ζ)−Ψ(ξ) = Φ(λ, ζ)} can beobtained, but maxξ∈Ξ{dp(ξ, ζ) : λdp(ξ, ζ)−Ψ(ξ) = Φ(λ, ζ)} can be infinite. Thus, when (i) λ > κ,and (ii) λ= κ, and ν({ζ ∈Ξ : arg minξ∈Ξ{λdp(ξ, ζ)−Ψ(ξ)}=∅}) = 0, the following quantity

D0(λ, ζ) := maxξ∈Ξ{dp(ξ, ζ) : λdp(ξ, ζ)−Ψ(ξ) = Φ(λ, ζ)},

D0(λ, ζ) := minξ∈Ξ{dp(ξ, ζ) : λdp(ξ, ζ)−Ψ(ξ) = Φ(λ, ζ)},

are well-defined (where D0(λ, ζ) can be infinite). Then D0(λ, ζ) and D0(λ, ζ) represent respectivelythe closest and furthest distances between ζ and any point in arg minξ∈Ξ{λdp(ξ, ζ)−Ψ(ξ)}. Wenote that D0(λ, ζ) (resp. D0(λ, ζ)) may not be equal to D(λ, ζ) (resp. D(λ, ζ)) as defined in (3).

Corollary 1 (Worst-case distribution). Consider any ν ∈ P(Ξ) and Ψ ∈ L1(ν). Let p ∈[1,∞) and θ > 0. Suppose κ<∞. Assume Ψ is upper-semicontinuous, and that bounded subsets of(Ξ, d) are totally bounded. Then the following holds:(i) [Existence condition] A worst-case distribution exists if and only if any of the following con-

ditions hold:1. There exists a dual minimizer λ∗ >κ.2. λ∗ = κ> 0 is the unique minimizer, ν

({ζ ∈Ξ : arg minξ∈Ξ{κdp(ξ, ζ)−Ψ(ξ)}=∅}

)= 0, and∫

Ξ

D0(κ, ζ)ν(dζ) ≤ θp ≤∫

Ξ

D0(κ, ζ)ν(dζ).

3. λ∗ = 0 is the unique minimizer, arg maxξ∈Ξ{Ψ(ξ)} is nonempty, and∫Ξ

D0(0, ζ)ν(dζ)≤ θp.

(ii) If ν(ζ ∈ Ξ : −Ψ(ζ) > infξ∈Ξ {κdp(ξ, ζ)−Ψ(ξ)}

)= 0, then λ∗ = κ for any θ > 0. Otherwise,

there is θ0 > 0 such that λ∗ >κ for any θ < θ0.(iii) [Structure] Whenever a worst-case distribution exists, there exists a worst-case distribution µ∗

which can be represented as a convex combination of two distributions T∗#ν and T ∗#ν, each of

which is a perturbation of ν, as follows:

µ∗ = p∗T∗#ν+ (1− p∗)T ∗#ν,

where p∗ ∈ [0,1], and T∗, T ∗ : Ξ→Ξ satisfy

ν(ζ ∈Ξ : T

∗(ζ), T ∗(ζ) /∈ arg min

ξ∈Ξ

{λ∗dp(ξ, ζ)−Ψ(ξ)})

= 0 (11)

15

(iv) If Ξ is convex, Ψ is concave, and dp(·, ζ) is convex for all ζ ∈ Ξ, then there exists T ∗ : Ξ→ Ξsuch that T ∗#ν is optimal.

Remark 4. Compared with Corollary 4.7 in Esfahani and Kuhn [20], Corollary 1(i) provides acomplete description of the necessary and sufficient conditions for the existence of a worst-casedistribution. Note that Example 1 in Esfahani and Kuhn [20] corresponds to λ∗ = κ= 1 and p= 1.

Example 4. We present several examples that correspond to different cases in Corollary 1(i). Inall these examples, Ξ = [0,∞), d(ξ, ζ) = |ξ− ζ| for all ξ, ζ ∈Ξ, p= 1, θ > 0, and ν = δ0.

(a) Ψa(ξ) = max{0, ξ− a} (b) Ψ(ξ) = max{0,1− ξ2} (c) Ψ±(ξ) = 1 + ξ± 1ξ+1

Figure 2. Examples for existence and non-existence of the worst-case distribution

1. Ψa(ξ) = max{0, ξ− a} for some a∈R. It follows that λ∗ = κ= 1.– If a ≤ 0, then arg minξ∈Ξ{dp(ξ,0)−Ψa(ξ)} = [0,∞), hence D0(κ, ζ) = 0 and D0(κ, ζ) =∞

satisfying condition (2.). One of the worst-case distributions is µ∗ = δθ with vP = vD = θ−a.– If a > 0, then arg minξ∈Ξ{dp(ξ,0)−Ψa(ξ)} = {0}, hence D0(κ, ζ) = D0(κ, ζ) = 0 < θ, thus

condition (2.) is violated. There is no worst-case distribution, but the objective value ofµε = (1− ε)δ0 + εδθ/ε converges to vP = vD = θ as ε→ 0.

2. Ψ(ξ) = max{0,1 − ξ2}. It follows that λ∗ = κ = 0, and arg maxξ∈Ξ Ψ(ξ) = {0}. Thus condi-tion (3.) is satisfied, and the worst-case distribution is µ∗ = δ0 = ν.

3. Ψ±(ξ) = 1 + ξ± 1ξ+1

. It follows that κ= 1. Note that Ψ′±(ξ) = 1∓ 1(ξ+1)2

.

– Note that Ψ′+(ξ) < κ = 1 on Ξ. Also, Ψ+ satisfies the condition in (ii), thus for all θ > 0it holds that λ∗+ = κ= 1 and arg minξ∈Ξ{λ∗+dp(ξ,0)−Ψ+(ξ)}= {0}. There is no worst-casedistribution, but the objective value of µε = (1− ε)δ0 + εδθ/ε converges to vP = vD = 2 + θ asε→ 0.

– Note that Ψ′−(ξ) > κ = 1 on Ξ. Also, arg minλ≥0

{λθ− infξ∈Ξ

{λξ−

(1 + ξ− 1

ξ+1

)}}=

arg minλ≥1

{λ(θ+ 1)− 2

√λ− 1

}={

1 + 1(θ+1)2

}. Thus λ∗− > 1 = κ.

3.2. Finite-Supported Nominal Distribution In this section, we restrict attention to thecase in which the nominal distribution ν = 1

N

∑N

i=1 δξi for some ξi ∈ Ξ, i= 1, . . . ,N . This occurs,for example, in a data-driven setting in which the decision maker collects N observations thatconstitute an empirical distribution.

Corollary 2 (Data-Driven DRSO). Consider any ν = 1N

∑N

i=1 δξi. Let p ∈ [1,∞) and θ > 0.Then the following hold:(i) [Strong duality] The primal problem (Primal) has a strong dual problem

vP = vD = infλ≥0

{λθp +

1

N

N∑i=1

supξ∈Ξ

[Ψ(ξ)−λdp(ξ, ξi)

]}. (12)

16

(ii) [Structure of the worst-case distribution] Whenever a worst-case distribution exists, there existsone which is supported on at most N + 1 points and has the form

µ∗ =1

N

∑i6=i0

δξi∗ +p0

Nδξi0∗

+1− p0

Nδξi0∗

(13)

where i0 ∈ {1, . . . ,N}, p0 ∈ [0,1], ξi0∗, ξi0∗ ∈ arg minξ∈Ξ{λ∗dp(ξ, ξi0) − Ψ(ξ)}, and ξi∗ ∈

arg minξ∈Ξ{λ∗dp(ξ, ξi)−Ψ(ξ)} for all i 6= i0.

(iii) [Robust-program approximation] Suppose that there exists ζ0 ∈Ξ, L,M ≥ 0 such that |Ψ(ξ)−Ψ(ζ0)|< Ldp(ξ, ζ0) +M for all ξ ∈ Ξ. Let K be any positive integer and consider the robustoptimization problem

vK := sup(ξik)i,k∈MK

1

NK

N∑i=1

K∑k=1

Ψ(ξik),

with uncertainty set

MK :=

{(ξik)i,k :

1

NK

N∑i=1

K∑k=1

dp(ξik, ξi)≤ θp, ξik ∈Ξ ∀ i, k}.

If λ∗ >κ, then there exists L′,M ′ > 0, such that

vK ≤ supµ∈M

Eµ[Ψ(ξ)] ≤ vK +M ′+L′D

NK,

where D is a constant independent of K. In addition, if Ξ is convex and Ψ is concave, thenv1 = vP = vD.

Statement (ii) shows that the worst-case distribution µ∗ is a perturbation of ν = 1N

∑N

i=1 δξi ,

where N − 1 out of the N points, {ξi}i 6=i0 , are perturbed with full mass 1/N to a maximizer ξi∗respectively, while at most one point ξi0 is split and perturbed to two maximizers ξi0

∗and ξ

i0∗ . (If

the set of maximizers is a singleton, then there is no need to split). Using this structure, we obtainstatement (iii), which suggests that the primal problem can be approximated by a robust programwith uncertainty set MK , which is a subset of M that contains all distributions supported on NKpoints with equal probability 1

NK. Particularly, when Ψ is concave, such approximation is exact;

and when Ψ is Lipschitz and p= 1, then v1 is an O(1/N)-approximation of vP = vD.

Remark 5. The results in Corollary 2 hold for arbitrary metric space Ξ. In fact, the Polish spaceassumption on Ξ is only used for the measurability results in Lemma 3, which becomes trivial infinite-supported case.

Remark 6. Under compactness assumption on Ξ, Wozabal [47] pointed out that to solve(Primal), it suffices to consider the set of extreme points of the Wasserstein ball M contains distri-butions that are supported on at most N+3 points. Later in Owhadi and Scovel [31], this result wasimproved to N +2 for Polish space or Borel subsets of Polish space. Statement (ii) further strengththese results — for arbitrary metric space (see Remark 5 above), it suffices to consider distribu-tions that are supported on at most N + 1 points, and such bound is tight as shown by Example 7below. Moreover, the weight of the extreme distribution do not change much as compared to thenominal distribution. As can be immediately seen from the proof, the result of statement (ii) canbe generalized as following. Suppose ν = 1

N

∑N

i=1 νiδξi , then whenever the worst-case distributionexists, there exists one of the form∑

i 6=i0

νiδξi∗ + p0νi0δξi0∗+ (1− p0)νi0δξi0∗

.

17

Remark 7 (Total Variation metric). By choosing the discrete metric d(ξ, ζ) = 1{ξ 6=ζ} on Ξ,the Wasserstein distance is equal to Total Variation distance (Gibbs and Su [22]), which can beused for the situation where the distance of perturbation does not matter and provides a ratherconservative decision. In this case, suppose θ is chosen such that Nθ is an integer, then there isno fractional point in (13) and the problem is reduced to the robust program with uncertainty setM1, whether Ξ (Ψ) is convex (concave) or not.

Proof of Corollary 2.(i) It follows directly from the proof of Theorem 1 and Proposition 2.(ii) By Corollary 1(iii), whenever the worst-case distribution exists, there is one supported on at

most 2N points and has the form

µ∗ =1

N

N∑i=1

piδξi∗+ (1− pi)δ

ξi∗, (14)

where pi ∈ [0,1], and ξi∗, ξi

∗ ∈ arg minξ∈Ξ{λ∗dp(ξ, ξi) − Ψ(ξ)}. (In fact, Corollary 1(iii) proves astronger statement that there exists a worst-case distribution such that all pi are equal, but herewe allow them to vary in order to obtain a worst-case distribution with a different form.) Given

ξi∗, ξi

∗ for all i, the problem

max0≤pi≤1

{ 1

N

N∑i=1

piΨ(ξi∗) + (1− pi)Ψ(ξ

i

∗) :1

N

N∑i=1

pidp(ξi∗, ξi) + (1− pi)dp(ξi∗, ξi)≤ θp

}is a linear program with N variables, one equality constraint and 2N inequality constraints pi ≤1, pi ≥ 1, i = 1, . . . ,N . Thus according to linear programming theory, there exists an optimalsolution such that among the 2N inequality constraints, at least N − 1 of them hold as equality,or equivalently, at most one pi is fractional. Therefore there exists a worst-case distribution whichis supported on at most N + 1 points, and has the form (13).

(iii) Note that by assumption on Ψ and Remark 1 we have κ≤L<∞. Also note that using thesimilar idea the above proof of (ii), the distributions µεδ(λ1, λ2), (T λ)#ν and µRδ (λ, ε) defined in theproof of Theorem 1 can be written in the form of

1

N

N∑i=1

pi1δξi1 + pi2δξi2 + pi3δξi3 ,

where pi1 + pi2 + pi3 = 1. Given {ξij : 1≤ i≤N,1≤ j ≤ 3}, the problem

max0≤pij≤1

{ 1

N

N∑i=1

3∑j=1

pijΨ(ξij) :1

N

N∑i=1

3∑j=1

pijdp(ξij, ξi)≤ θp}

is a linear program with 3N variables, one equality constraint and 3N inequality constraints pij ≤ 1,pij ≥ 1, i = 1, . . . ,N , j = 1,2,3. Thus according to linear programming theory, there exists anoptimal solution such that among the 3N inequality constraints, at least 3N − 1 of them hold asequality, or equivalently, at most one pij is fractional. Hence for any ε-optimal solution µ, thereexists a solution of the form

µε =1

N

∑i 6=i0

δξiε +pεNδξi0ε

+1− pεN

δξi0ε,

which yields an objective value no worse than µ. Define

ξik =

ξiε, ∀ 1≤ k≤K, ∀ i 6= i0,ξi0ε, ∀ 1≤ k≤ bKpεc, i= i0,

ξi0ε , ∀ bKpεc<k≤N, i= i0.

18

Then {ξik}i,k belongs to MK . By Lemma 2(ii), for any λ> λ0 ∈ dom(Φ(·, ξi0)),

dp(ξi0ε, ξi0)≤ 1

λ−λ0

(Φ(λ0, ξ

i0)−Ψ(ξi0))

=:D.

Since∣∣pε−bKpεc/K∣∣< 1/K, it follows that∣∣vK −Eµε [Ψ(ξ)]

∣∣≤ 1

N

∣∣pε−bKpεc/K∣∣ · (Ψ(ξi0ε

)−Ψ(ξi0ε ))

≤ 1

NK

(Ψ(ξi0

ε)−Ψ(ξi0)

)≤M +Ldp(ξi0

ε, ξi0)

NK

≤ M +LD

NK.

Let ε→ 0 we obtain the results. �

Example 5 (Saddle-point Problem). When Ψ(x, ξ) is convex in x and concave ξ, p= 1, andd= || · ||2, Corollary 2(iii) shows that the DRSO (DRSO) is equivalent to a convex-concave saddlepoint problem

minx∈X

max(ξ1,...,ξN )∈Y

1

N

N∑i=1

Ψ(x, ξi),

with `1/`2-norm uncertainty set

Y =

{(ξ1, . . . , ξN)∈ΞN :

N∑i=1

||ξi− ξi||2 ≤Nθ

}.

Therefore it can be solved by the Mirror-Prox algorithm (Nemirovski [29], Nesterov and Nemirovski[30]).

Example 6 (Piecewise concave objective). Esfahani and Kuhn [20] proves that when p=1, Ξ is a convex subset of RK equipped with some norm || · || and Ψ(ξ) = max1≤j≤J Ψj(ξ), where Ψj

are concave, the DRSO is equivalent to a convex program. We here show that it can be obtainedas a corollary from the structure of the worst-case distribution. Indeed, using concavity of Ψj andCorollary 2(i), it suffices to consider distributions of the form

1

N

N∑i=1

J∑j=1

pijδξij ,J∑j=1

pij = 1,

where for each i,card{j : pij > 0} ≤ 2,

where card represents cardinality. Relaxing the cardinality constraints yields the following problem:

suppij≥0,ξij∈Ξ

{1

N

N∑i=1

J∑j=1

pijΨ(ξij) :1

N

N∑i=1

J∑j=1

pijd(ξij, ξi)≤ θ,J∑j=1

pij = 1,∀i

}.

Replacing ξij by ξi + (ξij − ξi)/pij, by positive homogeneity of norms and convexity-preservingproperty of perspective functions (cf. Section 2.3.3 in Boyd and Vandenberghe [12]), we obtain anequivalent convex program reformulation of the primal problem:

suppij≥0,

∑j pij=1

ξij∈RK

{1

N

N∑i=1

J∑j=1

pijΨj(ξi +

ξij − ξi

pij

):

1

N

N∑i=1

J∑j=1

d(ξij, ξi)≤ θ, ξi +ξij − ξi

pij∈Ξ,∀i, j

}.

19

So we recover Theorem 4.5 in Esfahani and Kuhn [20], which was obtained therein by a separateprocedure of dualizing twice the reformulation (12).

Example 7 (Uncertainty Quantification). When Ξ =RK and Ψ =−1C , where C is an openset, the worst-case distribution µ∗ of the problem

supµ∈M

Eµ[−1C(ξ)] = minµ∈M

µ(C)

has a clear interpretation. The worst-case distribution perturbs ν such that the set C contains aslittle probability mass as possible, which can be achieved in a greedy fashion as follows. Suppose{ξi}Ni=1 are sorted such that ξ1, . . . , ξi ∈C, ξI+1, . . . , ξN /∈C and satisfy dp(ξ1,Ξ\C)≤ · · · ≤ dp(ξi,Ξ\C). Then to save the total budget of perturbation ξI+1, . . . , ξN stay at the same place, and theξi with small index has the priority to be transported to ∂C. It may happen that some point ξi0

(i0 ≤ I) cannot be transported to ∂C with full mass 1N

, since otherwise the Wasserstein distanceconstraint is violated. In this case, only partial mass (with probability p0/N) is transported andthe remaining stays (see Figure 3). Therefore the worst-case distribution has the form

µ∗ =1

N

i0−1∑i=1

δξi∗ +p0

Nδξi0 +

1− p0

Nδξi0∗

+1

N

N∑i=i0+1

δξi .

In fact, the dual optimizer λ∗ is such that

ξi∗ = arg minξ∈Ξ

{λ∗dp(ξ, ξi) +1C(ξ)}= arg minξ∈∂C

dp(ξ, ξi), ∀i≤ I,

and

ξi0∗ = arg minξ∈Ξ

{λ∗dp(ξ, ξi0) +1C(ξ)}=

{{ξi0}∪ arg minξ∈∂C d

p(ξ, ξi0), p0 6= 0,

arg minξ∈∂C dp(ξ, ξi0), p0 = 0.

Figure 3. When Ψ =−1C , the worst-case distribution perturbs the nominal distribution in a greedy fashion. Thesolid and diamond dots are the support of nominal distribution ν. ξ1, ξ2, ξ3 are three closest interior points to ∂C andthus are transported to ξ1

∗, ξ2∗, ξ

3∗ respectively. ξ4 is the fourth closest interior point to ∂C, but cannot be transported

to ∂C as full mass due to Wasserstein distance constraint, so it is split into ξ4

∗ and ξ4

∗.

Using the similar idea as above, we can prove that the worst-case probability is continuous withrespect to the boundary.

Proposition 3 (Continuity with respect to the boundary). Let Ξ =RK, ν ∈P(Ξ), θ > 0,and M= {µ∈P(Ξ) : Wp(µ,ν)≤ θ}. Then for any Borel set C ⊂Ξ,

infµ∈M

µ(C) = minµ∈M

µ(int(C)).

20

Now let us consider a special case when Ξ = {ξ0, . . . , ξB} for some positive integer B. In thiscase, let Ni be the samples that are equal to ξi, and let qi =Ni/N , i= 0, . . . , B, then the nominal

distribution ν =∑B

i=1 qiδξi . Let ν := (q0, . . . , qB)> ∈∆B. The DRSO becomes

minx∈X

maxµ∈∆B

{B∑i=0

piΨ(x, ξi) :Wp(µ,ν)≤ θ

}. (15)

Corollary 3. Problem (15) has a strong dual

minx∈X,λ≥0

{λθp +

B∑i=0

qiyi : yi ≥Ψ(x, ξj)−λdp(ξi, ξj), ∀i, j = 1, . . . , B

}. (16)

For any x, the worst-case distribution can be computed by

maxµ∈∆B ,γ∈R

B×B+

{B∑i=0

piΨ(x, ξi) :∑i,j

dp(ξi, ξj)γij ≤ θp,∑j

γij = pi,∀i,∑i

γij = qj,∀j

}. (17)

Proof. Reformulation (16) follows from Theorem 1, and (17) can be obtained using the equivalentdefinition of Wasserstein distance in Example 2. �

4. Applications. In this section, we apply our results to on/off system control, intensityestimation and worst-case Value-at-Risk analysis. In the first two problem, the nominal distributionis a point process, and the corresponding underlying space Ξ is the space of counting measures(sample paths), which is non-convex and infinite dimensional. In the third problem, the nonimaldistribution is arbitrary probability distribution on a finite dimensional space, such as Gaussiandistribution. Hence, the results in Esfahani and Kuhn [20] and Zhao and Guan [50] cannot beapplied.

4.1. On/Off System Control. In this problem, the decision maker faces a point processand controls a two-state (on/off) system. The point process is assumed to be exogenous, that is,the arrival times are not affected by the on/off state of the system. When the system is switchedon, a cost of c per unit time is incurred, and each arrival while the system is on contributes 1 unitrevenue. When the system is off, no cost is incurred and no revenue is earned. The decision makerwants to choose a control to maximize the total profit during a finite time horizon. This problemis a prototype for problems in sensor network and revenue management.

In many practical settings, the decision maker does not have a probability distribution for thepoint process. Instead, the decision maker has observations of historical sample paths of the pointprocess, which constitute an empirical point process. Note that if one would use the Sample AverageApproximation (SAA) method with the empirical point process, it would yield a degenerate control,in which the system is switched on only at the arrival time points of the empirical point process.Consequently, if future arrival times can differ from the empirical arrival times by even a little bit,the system would be switched off and no revenue would be earned. Due to such degeneracy andinstability of the SAA method, we resort to the distributionally robust approach.

To model the problem, we scale the finite time horizon to [0,1], and let

Ξ =

{ M∑m=1

δξm : M ∈Z+, ξm ∈ [0,1], m= 1, . . . ,M

}be the space of finite counting measures on [0,1]. Then the point processes on [0,1] are then definedby the set P(Ξ) of Borel probability measures on Ξ. To define the Wasserstein distance between to

21

point processes µ,ν ∈P(Ξ), we need to define the metric d on the space Ξ of counting measures. Weassume that the metric d on Ξ satisfies the following conditions (note that in this subsection, whenwe write the W1 distance between two Borel measures, we use the extended definition mentionedin Section 2):(i) The metric space (Ξ, d) is a Polish space.(ii) For any η =

∑M

m=1 δζm and η =∑M

m=1 δξm , where m is a nonnegative integer and{ζm}Mm=1,{ξm}Mm=1 ⊂ [0,1], it holds that

d(η, η) =W1(η, η) =M∑m=1

|ξ(m)− ζ(m)|,

where ξ(m) (resp. ζ(m)) are the order statistics of ξm (resp. ζm).

(iii) For any Borel set C ⊂ [0,1], θ ≥ 0, and η =∑M

m=1 δζm , where M is a positive integer and{ζm}Mm=1 ⊂ [0,1], it holds that

infη∈Ξ

{η(C) : d(η, η) = θ

}≥ inf

η∈B([0,1])

{η(C) : W1(η, η)≤ θ

}.

We note that condition (ii) is only imposed on η, η ∈Ξ such that η([0,1]) = η([0,1]). Possible choicesfor d are

d( M∑m=1

δξm ,L∑l=1

δζl

)=

min{M,L}∑m=1

∣∣ξ(m)− ζ(l)

∣∣+ |M −L| ,d( M∑m=1

δξm ,L∑l=1

δζl

)=

{max{M,L}, M 6=L,∑M

m=1

∣∣ξ(m)− ζ(m)

∣∣ , M =L,

or

d( M∑m=1

δξm ,L∑l=1

δζl

)=

{+∞, M 6=L,∑M

m=1

∣∣ξ(m)− ζ(m)

∣∣ , M =L.(18)

These metrics are similar to the ones in Barbour and Brown [4] and Chen and Xia [14]. Given themetric d, we choose the distance between two point processes µ,ν ∈P(Ξ) to be W1(µ,ν) as definedin (1).

Suppose we have N sample paths ηi =∑Mi

m=1 δξim , i= 1, . . . ,N , where Mi is a nonnegative integer

and ξim ∈ [0,1] for all i,m. Then the nominal distribution ν = 1N

∑N

i=1 δηi , and the ambiguity setM = {µ ∈ P(Ξ) : W1(µ,ν)≤ θ}. Let X denote the set of all functions x : [0,1]→ {0,1} such thatx−1(1) is a Borel set, where x−1(1) := {t ∈ [0,1] : x(t) = 1}. The decision maker is looking for acontrol x∈X that maximizes the total profit, by solving the problem

v∗ := supx∈X

{v(x) := −c

∫ 1

0

x(t)dt+ infµ∈M

Eη∼µ[η(x−1(1))

]}. (19)

We now investigate the structure of the optimal control. Let int(x−1(1)) be the interior of theset x−1(1) on the space [0,1] with canonical topology (and thus 0,1∈ int([0,1])).

Proposition 4. For any ν ∈P(Ξ) and control x, it holds that

infµ∈M

Eη∼µ[η(x−1(1))] = infρ∈P(B([0,1])×Ξ)

{E(η,η)∼ρ[η(int(x−1(1)))] : E(η,η)∼ρ

[W1(η, η)

]≤ θ, π2

#ρ= ν},

(20)Suppose ν = 1

N

∑N

i=1 δηi with ηi =∑Mi

m=1 δξim. There exists a non-negative integer M such that

v∗ = supxj ,xj∈[0,1],

xj<xj<xj′<xj′ ,∀1≤j<j′≤M

{v( M∑j=1

1[xj ,xj ]

):=−c

M∑j=1

(xj −xj) + infµ∈M

Eη∼µ[η{∪Mj=1[xj, xj]}

]}. (21)

22

Note that

infµ∈M

Eµ[η(x−1(1))] = infγ∈P(Ξ2)

{E(η,η)∼γ [η(x−1(1))] : Eγ [d(η, η)]≤ θ, π2

#γ = ν}.

Hence (20) shows that without changing the optimal value, we can replace d byW1 in the constraint,and enlarge the set of joint distributions from P(Ξ2) to P(B([0,1])×Ξ). Moreover, (21) shows thatit suffices to consider the set of polices of which the duration of on-state is a finite disjoint unionof intervals with positive length. We next show that given a control

∑M

j=1 1[xj ,xj ], the computation

of worst-case point process reduces to a linear program. For every 1≤ i≤N and 1≤m≤Mi, ifξim ∈∪Mj=1[xj, xj], we set jim ∈ {1, . . . ,M} to be such that ξim ∈ [xjim , xjim ], otherwise jim = 0. We alsoset x0 to be any real number.

Proposition 5. The objective v(∑M

j=1 1[xj ,xj ]

)defined in (21) can be written as

M∑j=1

−c(xj −xj) +1

N

N∑i=1

1[xj ,xj ](ξim)+ min

pim,pim≥0

pim

+pim≤1

{− 1

N

N∑i=1

∑1≤m≤Mi:j

im>0

(pim

+ pim) :

1

N

N∑i=1

∑1≤m≤Mi:j

im>0

(pim|xjim − ξ

im|+ pim|xjim − ξ

im|)≤ θ}.

Moreover, the above linear program can be solved by a greedy algorithm (see Algorithm 1), andthere exists a worst-case point process that has the form

µ∗(x) =1

N

N∑i=1i 6=i0

δηi∗ +p0

Nδηi0∗

+(1− p0)

Nδηi0∗,

where i0 ∈ {1, . . . ,N}, ηi∗,∈ Ξ, ηi∗([0,1]) = ηi([0,1]) for all i 6= i0, ηi0∗, ηi0∗ ∈ Ξ and ηi0

∗([0,1]) =

ηi0∗ ([0,1]) = ηi0([0,1]).

Algorithm 1 Greedy Algorithm

1: θ← 0. k← 1. pim, pi

m← 0, dim←min(|xji

M− ξim|, |xjiM − ξ

im|), ∀i,m.

2: Sort {dim}1≤i≤N,1≤m≤Miin increasing order, denoted by d

i(1)m(1)≤ di(2)

m(2)≤ . . .≤ d

i(∑Ni=1

mi)

m(∑Ni=1

Mi).

3: while θ <Nθ do4: if dit = |xjim − ξ

im| then pi(k)∗

m(k)←min

(1, (Nθ− θ)/di(k)

m(k)

).

5: else pi(k)∗m(k)←min

(1, (Nθ− θ)/di(k)

m(k)

).

6: end if7: k← k+ 1.8: end while

Example 8. We illustrate our results as follows. Suppose the number of arrivals has Poisson dis-tribution Poisson(λ), and given the number of arrivals, the arrival times are i.i.d.with density f(t),t ∈ [0,1]. Then problem (19) is maxx

∫x−1(1)

[−c+ λf(t)]dt, with optimal control x∗(t) = 1{λf(t)>c}.Note that f ≡ 1 corresponds to the Poisson point process with rate λ. In this example, we insteadconsider f(t) = k[a+sin(wt+s)], with a> 1 and k= 1/[a+(cos(s)−cos(w+s))/w]. Particularly, let

23

t0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Arrival time density and true optimal control

f(t)

True optimal control

t0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Sample paths

Figure 4. Optimal control for the true process and the DRSO

w= 5π, s= 52π, a= 1.1 and c= λ= 10. Thus x∗−1(1) = [0,0.1]∪ [0.3,0.5]∪ [0.7,0.9]. In the numer-

ical experiment, suppose we have N = 10 sample paths, each of which contains Mi ∼ Poisson(λ),i = 1, . . . ,N , i.i.d. arrival time points. The optimal controls for the true process and the DRSOare shown in Figure 4. We observe that even with a relatively small number of samples, the twocontrols differ from each other not too much, and thus the DRSO indeed provides a good solutionto the original process control problem.

4.2. Intensity Estimation for Non-homogeneous Poisson Process. Consider estimat-ing the intensity function a(t) of a non-homogeneous Poisson process A(t) using maximum likeli-hood method. Given N i.i.d. sample paths ηi =

∑Mim=1 δξim , i= 1, . . . ,N , the log-likelihood function

(see, e.g. Daley and Vere-Jones [15]) is written as∫ T

0

−a(t)dt+1

N

N∑i=1

Mi∑m=1

− ln(a(ξim)).

A common practice is to partition the time horizon [0, T ] into several intervals, and assume a(t) ispiecewise constant on each interval. Then the maximum likelihood estimator equals to the averagearrival rate on each interval. Such a approach suffers from the drawback that the estimator issensitive to the partition of intervals. If the partition is so fine that many intervals may have zeroobservations, then the estimator also vanishes on these intervals. On the other hand, if the partitionis very coarse, then the estimator remains constant during a long interval, which may not reflectthe reality. It appears that there is no systematic way to adaptively choose the partition for thissample average method. Meanwhile, distributionally robust formulation with φ-divergence has thesame problem, since the yielding estimator vanishes on intervals with zero observation.

Consider the distributionally robust formulation with Wasserstein distance

mina(t)

{∫ T

0

a(t)dt+ maxµ∈M

Eη∼µ[∫ T

0

− ln(a(t))η(dt)]}, (22)

where M is the same as the one in the previous subsection, namely, the Wasserstein ball centeredat the empirical process. To facilitate further analysis, we choose (18) as the definition of distancebetween two counting measures. Our strong duality results imply that the dual reformulation of(22) is given by

mina(t)λ≥0

{∫ T

0

a(t)dt+λθ+1

N

N∑i=1

∑1≤m≤Mi

supξ∈[0,T ]

{− ln(a(ξ))−λ|ξ− ξim|

}}.

The following proposition suggests that the optimal estimator is constant if the radius of Wasser-stein ball goes to infinity.

Proposition 6. For sufficiently large θ, the optimal value a∗(t) is constant.

24

0 2 4 6 8 10

Time

0

0.5

1

1.5

2

2.5

Inte

nsity

Pieces = 20

TrueWassersteinSAA

0 2 4 6 8 10

Time

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Inte

nsity

Pieces = 50

TrueWassersteinSAA

0 2 4 6 8 10

Time

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Inte

nsity

Pieces = 100

TrueWassersteinSAA

0 2 4 6 8 10

Time

0

0.5

1

1.5

2

2.5

3

Inte

nsity

Pieces = 20

TrueWassersteinSAA

0 2 4 6 8 10

Time

0

0.5

1

1.5

2

2.5

3

3.5

4

Inte

nsity

Pieces = 50

TrueWassersteinSAA

0 2 4 6 8 10

Time

0

1

2

3

4

5

6

7

Inte

nsity

Pieces = 100

TrueWassersteinSAA

Figure 5. Estimation of intensity function using DRSO and SAA

To numerically solve the problem, let us assume a(t) is piecewise constant. In our numericalexperiments, we assume the underlying true intensity function is given by a(t) = 0.5 + 0.5t ora(t) = 1 + sin(πt), t∈ [0,10]. We fix the sample size (number of sample paths) N = 20 and vary thenumber of pieces in {20,50,100}. The radius θ is chosen via cross-validation method, for which halfof the sample paths are used for training and the remaining are used for calibration. The out-of-sample performance is measured in terms of L2 distance between the estimated intensity and trueintensity. The estimation results and out-of-sample performances are shown in Figure 5 and Table1. We observe that DRSO with Wasserstein distance has superior out-of-sample performance in allcases. The shape of the estimator from DRSO is insensitive to the fineness of the partition for thepiecewise constant function. In contrast, the maximum likelihood estimator behaves terribly if wedo not make the partition correctly, for example, when the number of pieces is too large.

Table 1. Out-of-sample performance of DRSO and SAA

0.2 + 0.2t 1 + sin(πt)

Pieces 20 50 100 20 50 100

Wasserstein 0.394 0.481 0.544 2.008 2.122 2.276SAA 1.510 6.536 11.906 6.160 6.591 11.766

4.3. Worst-case Value-at-Risk. Value-at-risk is a popular risk measure in financial indus-try. Given a random variable Z and α ∈ (0,1), the value-at-risk VaRα[Z] of Z with respect tomeasure ν is defined by

VaRα[Z] := inf {t : Pν{Z ≤ t} ≥ 1−α} .

In the spirit of El Ghaoui et al. [18], we consider the following worst-case VaR problem. Supposewe are given a portfolio consisting of n assets with allocation weight w satisfying

∑N

i=1wi = 1 andw ≥ 0. Let ξi be the (random) return rate of asset i, i = 1, . . . , n, and r = E[ξ] the vector of the

25

expected return rates. Assume the metric d is induced by the infinity norm || · ||∞ on RK . Theworst-case VaR with respect to the set of probability distributions M is defined as

VaRwcα (w) := min

q

{q : inf

µ∈MPµ{−w>ξ ≤ q} ≥ 1−α

}.

Proposition 7. Let q ≥ VaRα[−w>ξ], θ > 0, α ∈ (0,1), w ∈ {w′ ∈ RN :∑N

i=1w′i = 1, w′ ≥ 0}.

Define

β0 := min

(1,

(α− ν{ξ :−w>ξ >VaRα[−w>ξ]})(q−VaRα[−w>ξ])p∣∣∣θp−Eν[(q+w>ξ)p1{−w>ξ>VaRα[−w>ξ]}

]∣∣∣).

Then infµ∈M Pµ{−w>ξ ≤ q} ≥ 1−α is equivalent to

Eν[((q+w>ξ)+)p + ·1{−w>ξ>VaRα[−w>ξ]}

]+β0Eν

[((q+w>ξ)+)p ·1{−w>ξ=VaRα[−w>ξ]}

]≥ θp.

In particular, when ν is a continuous distribution, the condition above can be reduced to

Eν[((q+w>ξ)+)p1{−w>ξ≥VaRα[−w>ξ]}

]≥ θp.

Figure 6. Worst-case VaR. When −w>ξ is continuously distributed and p= 1, VaRwcα equals to the q such that the

area of the shade region is equal to θ.

Example 9 (Worst-case VaR with Gaussian nominal distribution). Suppose ν ∼N(µ,Σ) and consider p = 1. It follows that −w>ξ ∼ N(−w>µ,w>Σw) and VaRα[−w>ξ] =−w>µ +

√w>ΣwΦ−1(1 − α). By Proposition 7, VaRwc

α (−w>ξ) is the minimal q such that (seeFigure 6)

f(q) :=1√

2πw>Σw

∫ q

VaRα[−w>ξ](q− y)e

− (y+w>µ)2

2w>Σw dy≥ θ. (23)

Since f(q) is monotone, (23) can be solved efficiently via any one-dimensional search algorithm.

We remark that the above result indicates that finding the worst-case VaR is tractable. It shouldbe noted that, however, finding the best allocation weight, i.e., optimizing over w is still hard, sincethe VaR constraint is essentially a chance-constraint.

5. Discussions. In this section, we discuss some advantages of Wasserstein ambiguity set. InSection 5.1, we compare the Wasserstein ambiguity set to φ-divergence ambiguity set for newsven-dor problem. In Section 5.2, we illustrate how the close connection between DRSO and robustprogramming (Corollary 3(iii)) can expand the tractability of DRSOs.

26

Table 2. Examples of φ-divergence

Divergence Kullback-Leibler Burg entropy χ2-distance Modified χ2 Hellinger Total Variationφ φkl φb φχ2 φmχ2 φh φtv

φ(t), t≥ 0 t log t − log t 1t(t− 1)2 (t− 1)2 (

√t− 1)2 |t− 1|

Iφ(µ,ν)∑pj log

(pj

qj

) ∑qj log

(qj

pj

) ∑ (pj−qj)2

pj

∑ (pj−qj)2

qj

∑(√pj −

√qj)

2∑|pj − qj |

5.1. Newsvendor problem: a comparison to φ-divergence. In this subsection, we discusssome advantages of Wasserstein ambiguity set by performing a numerical study on distributionallyrobust newsvendor problems, with an emphasis on the worst-case distribution.

In the newsvendor model, the decision maker has to decide the inventory level before the randomdemand is realized, facing both overage and underage costs. The problem can be formulated as

minx≥0

Eµ[h(x− ξ)+ + b(ξ−x)+],

where x is the decision variable for initial inventory level, ξ is the random demand, and h, b representrespectively the overage and underage costs per unit. We assume that b,h > 0, and ξ is supportedon {0,1, . . . , B} for some positive integer B. Sometimes the demand data is expensive to obtain.For instance, a company is introducing a new product of which the demand data is collected bysetting up pilot stores. Then the decision maker may want to consider the DRSO counterpart

minx≥0

supµ∈∆B

{Eµ[h(x− ξ)+ + b(ξ−x)+] :Wp(µ,ν)≤ θ

}.

Using Corollary 3, we obtain a convex programming reformulation

minx,λ≥0

{λθp +

B∑i=0

qiyi : yi ≥max[h(x− j), b(j−x)

]−λ|i− j|p, ∀0≤ i, j ≤ B

}.

On the other hand, one may would also consider φ-divergence ambiguity set (Table 2 shows somecommon φ-divergences). As mentioned in Section 1, the worst-case distribution in φ-divergenceambiguity set may be problematic. Indeed, when limt→∞ φ(t)/t =∞, such as φkl, φmχ2 , the φ-divergence ambiguity set fails to include sufficiently many relevant distributions. In fact, since0φ(pj/0) = pj limt→∞ φ(t)/t=∞ for all pj > 0, the φ-divergence ambiguity set does not include anydistribution which is not absolutely continuous with respect to the nominal distribution ν.

When limt→∞ φ(t)/t <∞, such as φb, φχ2 , φh, φtv, the situation is even worse. Define I0 :={1≤ j ≤N : qj > 0} and jM := arg max1≤j≤N{Ψ(ξj) : qj = 0}. Assume Ψ(ξj) are different from eachother, then according to Ben-Tal et al. [6] and Bayraksan and Love [5], the worst-case distributionsatisfies

p∗j/qj ∈ ∂φ∗(Ψ(ξj)−β∗

λ∗

), ∀i∈ I0, (24a)

p∗j = 0, ∀j /∈ I0 ∪{jM}, (24b)

p∗jM =

{1−

∑i∈I0 p

∗j , if β∗ = Ψ(ξjM )−λ∗ limt→∞ φ(t)/t,

0, if β∗ >Ψ(ξjM )−λ∗ limt→∞ φ(t)/t,(24c)

for some λ∗ ≥ 0 and β∗ ≥Ψ(ξjM )−λ∗ limt→∞ φ(t)/t. (24b) suggests that the support of the worst-case distribution and that of the nominal distribution can differ by at most one point ξjM . Ifp∗jM > 0, (24c) suggests that the probability mass is moved away from scenarios in I0 to the worstscenario ξjM . Note that in many applications where the support of ξ is unknown, the choice of theunderlying space Ξ (and thus ξjM ) may be arbitrary. Hence the worst-case behavior is sensitive tothe specification of Ξ and the shape of function Ψ.

27

0 10 20 30 40 50 60 70 80 90 100

Random demand

0

0.02

0.04

0.06

0.08

0.1

Rela

tive fre

quency

Wasserstein

Burg entropy

Empirical

(a) Binomial(200,0.5), N = 500

0 10 20 30 40 50 60 70 80 90 100

Random demand

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Re

lative

fre

qu

en

cy

Wasserstein

Burg entropy

Empirical

(b) Binomial(200,0.5), N = 50

0 10 20 30 40 50 60 70 80 90 100

Random demand

0

0.02

0.04

0.06

0.08

0.1

0.12

Rela

tive fre

quency

Wasserstein

Burg entropy

Empirical

(c) truncated Geometric(0.1), N = 500

0 10 20 30 40 50 60 70 80 90 100

Random demand

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Rela

tive fre

quency

Wasserstein

Burg entropy

Empirical

(d) truncated Geometric(0.1), N = 50

Figure 7. Histograms of worst-case distributions yielding from Wasserstein distance and Burg entropy

We perform a numerical test of which setup is similar to Wang et al. [46] and Ben-Tal et al.[6]. We set b = h = 1, B = 100, and N ∈ {50,500} representing small and large datasets. Thedata is then generated from Binomial(100,0.5) and Geometric(0.1) truncated on [0,100]. For afair comparison, we estimate the radius of the ambiguity set such that it covers the underlyingdistribution with probability greater than 95%.

When the underlying distribution is Binomial, intuitively, the symmetry of Binomial distribu-tion and b = h = 1 implies that the optimal initial inventory level is close to B/2 = 50, and thecorresponding worst-case distribution should be similar to a mixture distribution with two modes,representing high and low demand respectively. This intuition is consistent with the solid curves inFigure (7a)(7b), representing the worst-case distribution in Wasserstein ambiguity set. In addition,their tail distributions are smooth and reasonable for both small and large datasets. In contrast, ifBurg entropy is used to define the ambiguity set (dashed curves in Figure (7a)(7b)), the worst-casedistribution has disconnected support, and is not symmetric. There is a large spike on the boundary100, corresponding to the “popping” behavior mentioned in Bayraksan and Love [5]. Especiallywhen the dataset is small, the spike is huge, which makes the solution too conservative.

When the underlying distribution is Geometric, intuitively, the worst-case distribution shouldhave one spike for low demand and a heavy tail for high demand. Again, this is consistent with theworst-case distribution in Wasserstein ambiguity set (solid curves in Figure (7c)(7d)). While usingBurg entropy (dashed curves in Figure (7c)(7d)), the tail has unrealistic spikes on the boundary.For distribution with unbounded support, the tail distribution is very sensitive to our choice oftruncation threshold B. Hence, the conclusion for this numerical test is that Wasserstein ambiguityset is likely to yield a more reasonable, robust and realistic worst-case distribution.

28

5.2. Two-stage DRSO: connection with robust optimization. In Corollary 2(iii) weestablished the close connection between the DRSO problem and robust programming. More specif-ically, we show that every DRSO problem can be approximated by robust programs with ratherhigh accuracy, which significantly enlarges the applicability of the DRSO problem. To illustratethis point, in this section we show the tractability of the two-stage linear DRSOs.

Consider the two-stage distributionally robust stochastic optimization

minx∈X

c>x+ supµ∈M

Eµ[Ψ(x, ξ)], (25)

where Ψ(x, ξ) is the optimal value of the second-stage problem

miny∈Rm

{q(ξ)>y : T (ξ)x+W (ξ)y≤ h(ξ)

},

and

q(ξ) = q0 +s∑l=1

ξlql, T (ξ) = T 0 +

s∑l=1

ξlTl, W (ξ) =W 0 +

s∑l=1

ξlWl, h(ξ) = h0 +

s∑l=1

ξlhl.

We assume p= 2 and Ξ =RK with Euclidean distance d. In general, the two-stage problem (25) isNP-hard. However, we are going to show that with tools from robust programming, we are able toobtain a tractable approximation of (25). Let M1 := {(ξ1, . . . , ξN) ∈ ΞN : 1

N

∑N

i=1 ||ξi − ξi||22 ≤ θ2}.Using Theorem 2(ii) with K = 1, we obtain an adjustable-robust-programming approximation

minx∈X

{c>x+ sup

(ξi)i∈M1

1

N

N∑i=1

Ψ(x, ξi)}

= minx∈X,t∈Ry:Ξ→R

{t :t≥ c>x+ 1

N

∑i q(ξ

i)>y(ξi), ∀(ξi)i ∈M1,

T (ξ)x+W (ξ)y(ξ)≤ h(ξ), ∀ξ ∈⋃N

i=1{ξ′ ∈Ξ : ||ξ′− ξi||2 ≤ θ√N}

},

(26)

where the second set of inequalities follows from the fact that T (ξ)x+W (ξ)y(ξ)≤ h(ξ) should holdfor any realization ξ with positive probability for some distribution in M1. Although problem (26)is still intractable in general, there has been a substantial literature on different approximationsto problem (26). One popular approach is to consider the so-called affinely adjustable robustcounterpart (AARC) as follows. We assume that y is an affine function of ξ:

y(ξ) = y0 +s∑l=1

ξlyl, ∀ξ ∈

N⋃i=1

Bi,

for some y0, yl ∈Rm, where Bi := {ξ′ ∈Ξ : ||ξ′− ξi||2 ≤ θ√N}. Then the AARC of (26) is

minx∈X,t∈R

yl∈Rm,l=0,...,s

{t : c>x+

1

N

N∑i=1

(q0 +

s∑l=1

ξilql)>(

y0 +s∑l=1

ξilyl)− t≤ 0, ∀(ξi)i ∈M1,

(T 0 +

s∑l=1

ξlTl)x+

(W 0 +

s∑l=1

ξlWl)(y0 +

s∑l=1

ξlyl)−(h0 +

s∑l=1

hlξl

)≤ 0, ∀ξ ∈

N⋃i=1

Bi

}.

(27)Set ζil := ξil − ξil for i = 1, . . . ,N and l = 1, . . . , s. In view of M1, ζ belongs to the ellipsoidaluncertainty set

Uζ = {(ζil)i,l :∑i,l

ζ2il ≤Nθ2}.

29

Set z = [x; t;{yl}sl=0], and define

α0(z) :=−[c>x+

1

N

N∑i=1

(q0 +s∑l=1

ξilql)>(y0 +

s∑l=1

ξilyl)− t

],

βil0 (z) :=−

[(q0 +

∑s

l′=1 ξil′q

l′)>yl′+ ql

′>(y0 +

∑s

l′=1 ξil′y

l′)]

2N,

Γ(l,l′)0 (z) :=−q

lyl′+ ql

′yl

2N,

∀1≤ i≤N, 1≤ l, l′ ≤ s.

Then the first set of constraints in (27) is equivalent to

α0(z) + 2∑i,l

βil0 (z)ζil +∑i

∑l,l′

Γ(l,l′)0 ζilζil′ ≥ 0, ∀(ζil)i,l ∈ Uζ . (28)

It follows from Theorem 4.2 in Ben-Tal et al. [7] that (28) takes place if and only if there existsλ0 ≥ 0 such that

(α0(z)−λ0)v2 + 2v∑i,l

βil0 (z)wil +∑i

∑l,l′

Γ(l,l′)0 wilwi′l′ +

λ0

Nθ2

∑i,l

w2il ≥ 0, ∀v ∈R,∀wil ∈R,∀i, l.

Or in matrix form,

∃λ0 ≥ 0 :

(Γ0⊗ IN + λ0

Nθ2 · IsN vec(β0)vec>(β0) α0(z)−λ0

)� 0, (29)

where IN (resp. IsN) is N (resp. sN) dimensional identity matrix, ⊗ is the Kronecker product ofmatrices and vec is the vectorization of a matrix.

Now we reformulate the second set of constraints in (27). For all 1 ≤ i ≤ N , 1 ≤ j ≤ m and1≤ l, l′ ≤ s, we set

αij(z) :=−[(T 0

j +s∑l=1

ξilTlj )x+ (W 0

j +s∑l=1

ξilWlj )(y

0 +s∑l=1

ξilyl)− (h0

j +s∑l=1

ξilhlj)],

βlij(z) :=−

[T ljx+ (W 0

j +∑s

l=1 ξilW

lj )y

l +W lj (y

0 +∑s

l=1 ξilyl)−hl

]2

,

Γ(l,l′)j (z) :=−

W ljyl′ +W l′

j yl

2.

Let ηi := ξ− ξi for 1≤ i≤N . Then the second set of constraints in (27) is equivalent to

αij(z) + 2βij(z)>ηi + ηi

>Γj(z)η

i ≥ 0, ∀ηi ∈ {η′ ∈RK : ||η′||2 ≤ θ√N},∀1≤ i≤N,1≤ j ≤m.

Again by Theorem 4.2 in Ben-Tal et al. [7] we have further equivalence

∃λij ≥ 0 :

(Γj(z) +

λijNθ2 · Is βij(z)

βij(z)> αij(z)−λij

)� 0, ∀1≤ i≤N,1≤ j ≤m. (30)

Combining (29) and (30) we obtain the following result.

Proposition 8. An exact reformulation of the AARC of (26) is given by

minx∈X,t∈R,yl∈Rm,l=1,...s

λ0,λij≥0,i=1,...,N,j=1,...,s

{t : (29), (30) holds } .

Note that (26) is a fairly good approximation of the original two-stage DRSO problem (25)by Theorem 1. Hence, as long as the AARC of (26) is reasonably good, its semidefinite-programreformulation (8) provides a good tractable approximation of the two-stage linear DRSO (25).

30

6. Conclusions In this paper, we developed a constructive proof method to derive the dualreformulation of distributionally robust stochastic optimization with Wasserstein distance under ageneral setting. Such approach allows us to obtain a precise structural description of the worst-casedistribution and connects the distributionally robust stochastic optimization to classical robustprogramming. Based on our results, we obtain many theoretical and computational implications.For the future work, extensions to multi-stage distributionally robust stochastic optimization willbe explored.

Appendix A: Auxiliary results

Lemma 4. Consider any p≥ 1 and any ε > 0. Then there exists Cp(ε)≥ 1 such that

(x+ y)p ≤ (1 + ε)xp +Cp(ε)yp

for all x, y≥ 0.

Lemma 5. Let ζ0 ∈Ξ. Then for any λ> λ1 >κ, there exists a constant C > 0 such that

λ−λ1

2D(λ, ζ) ≤ Φ(λ, ζ)−Φ(λ1, ζ

0) +λ1Cdp(ζ, ζ0).

Lemma 6. (i) The quantity

limsupξ∈Ξ: dp(ξ,ζ)→∞

Ψ(ξ)−Ψ(ζ)

dp(ξ, ζ)

is independent of ζ.(ii) Suppose

ν ∈Pp(Ξ) :=

{µ∈P(Ξ) :

∫Ξ

dp(ζ, ζ0)ν(dζ)<∞ for some ζ0 ∈Ξ

}.

Then the growth rate κ defined in Definition 4 is finite if and only if there exists ζ0 ∈ Ξ,L,M > 0 such that

Ψ(ξ)−Ψ(ζ0) ≤ Ldp(ξ, ζ0) +M, ∀ξ ∈Ξ. (31)

(iii) When κ<∞, it holds that

κ = limsupξ∈Ξ: dp(ξ,ζ)→∞

Ψ(ξ)−Ψ(ζ)

dp(ξ, ζ)

for any ζ ∈Ξ.

Lemma 7. Let C be a Borel set in Ξ with nonempty boundary ∂C. Then for any ε > 0, thereexists a Borel map Tε : ∂C→Ξ \ cl(C) such that d(ξ,Tε(ξ))< ε for all ξ ∈ ∂C.

Appendix B: Proofs

B.1. Proofs of LemmasProof of Lemma 1. Let (u0, v0) be any feasible solution for the maximization problem in (2).For any t ∈ R and any ξ, ζ ∈ Ξ, let ut(ξ) := u0(ξ) + t and vt(ζ) := v0(ζ)− t. Then it follows thatut(ξ) + vt(ζ)≤ dp(ξ, ζ) for all ξ, ζ ∈Ξ, and∫

Ξ

ut(ξ)µ(dξ) +

∫Ξ

vt(ζ)ν(dζ) =

∫Ξ

u0(ξ)µ(dξ) +

∫Ξ

v0(ζ)ν(dζ) + t[µ(Ξ)− ν(Ξ)].

Since µ(Ξ) 6= ν(Ξ),

supt∈R

{∫Ξ

ut(ξ)µ(dξ) +

∫Ξ

vt(ζ)ν(dζ)

}=∞,

and thus W pp (µ,ν) =∞. �

31

Proof of Lemma 2.(i) For any ζ ∈ Ξ, Φ(·, ζ) is the infimum of nondecreasing affine functions of λ, Φ(λ, ζ)<∞ for

all λ≥ 0, and Φ(λ, ζ)>−∞ for all λ> κ, and thus Φ(·, ζ) is non-decreasing and concave on [0,∞).For the second part, consider any ζ ∈ Ξ and any λ2 >λ1 such that Φ(λi, ζ)>−∞ for i= 1,2. Forany δ > 0, choose any ξδi ∈Ξ such that λid

p(ξδi , ζ)−Ψ(ξδi )<Φ(λi, ζ) + δ for i= 1,2. It follows that

λ2dp(ξδ2, ζ)−Ψ(ξδ2)<Φ(λ2, ζ) + δ

≤ λ2dp(ξδ1, ζ)−Ψ(ξδ1) + δ

= (λ2−λ1)dp(ξδ1, ζ) +λ1dp(ξδ1, ζ)−Ψ(ξδ1) + δ

< (λ2−λ1)dp(ξδ1, ζ) + Φ(λ1, ζ) + 2δ

≤ (λ2−λ1)dp(ξδ1, ζ) +λ1dp(ξδ2, ζ)−Ψ(ξδ2) + 2δ.

Hence

(λ2−λ1)dp(ξδ2, ζ)< (λ2−λ1)dp(ξδ1, ζ) + 2δ.

Diving by λ2− λ1 on both sides and let δ ↓ 0, and using definition (3), we obtain that D(λ2, ζ)≤D(λ1, ζ)≤D(λ1, ζ).

(ii) By Definition 3, for any λ0 ∈ dom(Φ(·, ζ)) and for ν-almost all ζ, it holds that

Ψ(ξ)≤ λ0dp(ξ, ζ) + Φ(λ0, ζ), ∀ξ ∈Ξ.

On the other hand, for every ξ ∈ Ξ that satisfies λdp(ξ, ζ)−Ψ(ξ)≤ Φ(λ, ζ) + δ for some δ ≥ 0, itholds that

λdp(ξ, ζ)−Ψ(ξ)− δ≤−Ψ(ζ).

Combining the two inequalities above yields that

λdp(ξ, ζ) + Ψ(ζ)− δ≤ λ0dp(ξ, ζ) + Φ(λ0, ζ),

or equivalently,

(λ−λ0)dp(ξ, ζ)− δ≤−Ψ(ζ) + Φ(λ0, ζ).

Taking the supremum over the set {ξ ∈ Ξ : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ} and then the limsupwith δ ↓ 0, and using definition (3) of D(λ, ζ), we obtain that

(λ−λ0)D(λ, ζ) ≤ −Ψ(ζ) + Φ(λ0, ζ).

(iii) Consider any ζ and any λ2 >λ1 with Φ(λ1, ζ)>−∞. For any δ > 0, choose any ξδi ∈Ξ suchthat λid

p(ξδi , ζ)−Ψ(ξδi )≤Φ(λi, ζ) + δ for i= 1,2. Then

Φ(λ1, ζ)−Φ(λ2, ζ) ≤ λ1dp(ξδ2, ζ)−Ψ(ξδ2)−

[λ2d

p(ξδ2, ζ)−Ψ(ξδ2)]

+ δ= (λ1−λ2)dp(ξδ2, ζ) + δ.

Similarly, Φ(λ2, ζ)−Φ(λ1, ζ)≤ (λ2−λ1)dp(ξδ1, ζ) + δ. It follows that

dp(ξδ2, ζ)− δ

λ2−λ1

≤ Φ(λ2, ζ)−Φ(λ1, ζ)

λ2−λ1

≤ dp(ξδ1, ζ) +δ

λ2−λ1

.

Then it follows from (3) that

D(λ2, ζ) ≤ Φ(λ2, ζ)−Φ(λ1, ζ)

λ2−λ1

≤ D(λ1, ζ).

32

Since Φ(·, ζ) is finite-valued and concave on (κ,∞), the left and right derivatives ∂Φ(λ, ζ)/∂λ±exist. Setting λ1 = λ and letting λ2 ↓ λ in the inequality above, it follows that

limλ2↓λ

D(λ2, ζ) ≤ ∂Φ(λ, ζ)

∂λ+≤ D(λ, ζ).

Similarly, setting λ2 = λ and letting λ1 ↑ λ, it follows that

D(λ, ζ) ≤ ∂Φ(λ, ζ)

∂λ−≤ lim

λ1↑λD(λ1, ζ). �

Proof of Lemma 3.(i) By Definition 1.11 in Ambrosio et al. [2], ν has an extension, still denoted by ν, such that

the measure space (Ξ,Bν , ν) is complete. Note that for any b∈R, it holds that

{ζ ∈Ξ : Φ(λ, ζ)< b}= {ζ ∈Ξ : ∃ξ ∈Ξ such that λdp(ξ, ζ)−Ψ(ξ)< b}= π2

({(ξ, ζ)∈Ξ×Ξ : λdp(ξ, ζ)−Ψ(ξ)< b}

).

Note that the set {(ξ, ζ) ∈ Ξ × Ξ : λdp(ξ, ζ) − Ψ(ξ) < b} on the right side is measurable. Since(Ξ, d) is Polish, it follows from the measurable projection theorem (cf. Theorem 8.3.2 in Aubin andFrankowska [3]), that Φ(λ, ·) is (Bν ,B(R))-measurable.

Define functions C,C by

C(λ, ζ, δ) := supξ∈Ξ

{dp(ξ, ζ) : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ}

C(λ, ζ, δ) := infξ∈Ξ{dp(ξ, ζ) : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ}.

For any b∈R it holds that

{ζ ∈Ξ : C(λ, ζ, δ)> b}= {ζ ∈Ξ : ∃ ξ ∈Ξ such that λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ, dp(ξ, ζ) > b}= π2

({(ξ, ζ)∈Ξ×Ξ : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ, dp(ξ, ζ) > b}

),

and thus it follows from the measurable projection theorem that C(λ, ·, δ) is (Bν ,B(R))-measurable. Similarly,

{ζ ∈Ξ : C(λ, ζ, δ)< b}= {ζ ∈Ξ : ∃ ξ ∈Ξ such that λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ, dp(ξ, ζ)< b}= π2

({(ξ, ζ)∈Ξ×Ξ : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ, dp(ξ, ζ)< b}

),

and thus C(λ, ·, δ) is (Bν ,B(R))-measurable. Note that D(λ, ·) = limsupδ↓0C(λ, ·, δ) and D(λ, ·) =lim infδ↓0C(λ, ·, δ) are also (Bν ,B(R))-measurable, because measurability is preserved underlimsup and lim inf.

(ii) Consider any δ, ε > 0. Define multi-valued mappings S,S :R+×Ξ⇒Ξ by

S(λ, ζ, δ, ε) := {ξ ∈Ξ : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ, dp(ξ, ζ) ≤ D(λ, ζ)− ε},S(λ, ζ, δ, ε) := {ξ ∈Ξ : λdp(ξ, ζ)−Ψ(ξ) ≤ Φ(λ, ζ) + δ, dp(ξ, ζ) ≤ D(λ, ζ) + ε}.

For each ζ ∈Ξ, it follows from the measurability of Ψ and dp(·, ζ) that S(λ, ζ, δ, ε) and S(λ, ζ, δ, ε)are in B(Ξ). Since (Ξ, d) is Polish and ν is a complete finite measure, it follows from Aumann’s mea-surable selection theorem (see, e.g. Theorem 18.26 in Aliprantis and Border [1]) that ν-measurableselections T ,T : Ξ→Ξ exist such that T (ζ)∈ S(λ, ζ, δ, ε) and T (ζ)∈ S(λ, ζ, δ, ε).

(iii) Define a multi-valued mapping S :R+×Ξ⇒Ξ by

S(ε, ζ) := {ξ ∈Ξ : Ψ(ξ)−Ψ(ζ)≥ (κ− ε)dp(ξ, ζ), dp(ξ, ζ)≥M(ζ)}.

For each ζ ∈ Ξ, it follows from the measurability of Ψ, M and dp(·, ζ) that S(ε) ∈B(Ξ). Thenusing the same argument as in (ii), there exists a ν-measurable selection T : Ξ→ Ξ such thatT (ζ)∈ S(ε, ζ). �

33

Proof of Lemma 4. Note that if x = 0, then the inequality holds for any Cp(ε) ≥ 1. Next weconsider the case with x> 0, and we let t := y/x. Let

t0(ε) := sup{t > 0 : 1 + ε ≥ (1 + t)p}.

Note that t0(ε)> 0. Next let

Cp(ε) := max

{1, sup

t≥t0(ε)

(1 + t)p−1

tp−1

}.

Note that Cp(ε)<∞ because limt→∞(1 + t)p−1/tp−1 = 1. Next, consider

f(t) := 1 + ε+Cp(ε)tp− (1 + t)p

Note that f(t)≥ 0 for all t∈ [0, t0(ε)]. Also, f ′(t) =Cp(ε)ptp−1−p(1+ t)p−1 ≥ 0 for all t∈ [t0(ε),∞).

Therefore f(t)≥ 0 for all t≥ 0, which establishes the inequality for x> 0. �

Proof of Lemma 5. It follows from Lemma 4 with ε := λ−λ12λ1

that

λ1dp(ξ, ζ0) ≤ λ+λ1

2dp(ξ, ζ) +λ1Cp(ε)d

p(ζ, ζ0)

for all ξ, ζ, ζ0 ∈Ξ. Thus

λdp(ξ, ζ)−Ψ(ξ) =λ−λ1

2dp(ξ, ζ)−Ψ(ξ) +

λ+λ1

2dp(ξ, ζ)

≥ λ−λ1

2dp(ξ, ζ)−Ψ(ξ) +λ1d

p(ξ, ζ0)−λ1Cp(ε)dp(ζ, ζ0)

≥ λ−λ1

2dp(ξ, ζ) + Φ(λ1, ζ

0)−λ1Cp(ε)dp(ζ, ζ0).

Hence, for every ξ ∈Ξ that satisfies λdp(ξ, ζ)−Ψ(ξ)<Φ(λ, ζ) + δ for some δ≥ 0, it holds that

λ−λ1

2dp(ξ, ζ)<Φ(λ, ζ)−Φ(λ1, ζ

0) +λ1Cp(ε)dp(ζ, ζ0) + δ.

Taking the supremum over ξ ∈Ξ on both sides and then the limsup with δ ↓ 0, we obtain that

λ−λ1

2D(λ, ζ) ≤ Φ(λ, ζ)−Φ(λ1, ζ

0) +λ1Cp(ε)dp(ζ, ζ0). �

Proof of Lemma 6.(i) We prove this by contradiction. Suppose that for some ζ0, ζ1 ∈Ξ, it holds that

κ0 < κ1 := lim supd(ξ,ζ1)→∞

Ψ(ξ)−Ψ(ζ1)

dp(ξ, ζ1)

(κ1 =∞ is allowed, and the case κ0 >κ1 can be shown similarly). Choose any ε∈ (0, κ1−κ0). Thenthere exists an R such that for all ξ with d(ξ, ζ0)>R it holds that

Ψ(ξ)−Ψ(ζ1) = Ψ(ξ)−Ψ(ζ0) + Ψ(ζ0)−Ψ(ζ1)

< (κ0 + ε)dp(ξ, ζ0) + [Ψ(ζ0)−Ψ(ζ1)]

≤ (κ0 + ε)[d(ξ, ζ1) + d(ζ1, ζ0)

]p+ [Ψ(ζ0)−Ψ(ζ1)].

34

It follows that

limsupd(ξ,ζ1)→∞

Ψ(ξ)−Ψ(ζ1)

dp(ξ, ζ1)≤ limsup

d(ξ,ζ1)→∞

(κ0 + ε) [d(ξ, ζ1) + d(ζ1, ζ0)]p

+ [Ψ(ζ0)−Ψ(ζ1)]

dp(ξ, ζ1)

= κ0 + ε < κ1,

which is a contradiction.(ii) (Sufficiency). If (31) holds, then

κ0 := limsupξ∈Ξ: d(ξ,ζ0)→∞

Ψ(ξ)−Ψ(ζ0)

dp(ξ, ζ0)≤ L < ∞.

Then by (i),

κ0 = limsupξ∈Ξ: d(ξ,ζ)→∞

Ψ(ξ)−Ψ(ζ)

dp(ξ, ζ), ∀ζ ∈Ξ. (32)

We are going to show that κ= κ0, that is,∫

ΞΦ(λ, ζ)ν(dζ)>−∞ for any λ> κ0.

To this end, we first show that Φ(λ, ζ)>−∞ for any λ> κ0 and ζ ∈Ξ. In fact, by (32), for anyζ ∈Ξ, there is a C > 0 such that for all ξ ∈Ξ with dp(ξ, ζ)>C, it holds that

Ψ(ξ)−Ψ(ζ)

dp(ξ, ζ)<

λ+κ0

2,

that is, (λ+κ0

2)dp(ξ, ζ)−Ψ(ξ)>−Ψ(ζ). Thus, for all ξ ∈Ξ with dp(ξ, ζ)>C, it holds that

λdp(ξ, ζ)−Ψ(ξ) =λ+κ0

2dp(ξ, ζ)−Ψ(ξ) +

λ−κ0

2dp(ξ, ζ)

>−Ψ(ζ) +λ−κ0

2·C >−∞,

and hence

infξ∈Ξ

{λdp(ξ, ζ)−Ψ(ξ) : dp(ξ, ζ)>C

}≥−Ψ(ζ) +

λ−κ0

2·C >−∞. (33)

Using (31) we have that for any ζ, ξ ∈Ξ,

Ψ(ξ)−Ψ(ζ) = Ψ(ξ)−Ψ(ζ0) + Ψ(ζ0)−Ψ(ζ)

≤Ldp(ξ, ζ0) + Ψ(ζ0)−Ψ(ζ)

≤ 2p−1L[dp(ξ, ζ) + dp(ζ, ζ0)] + Ψ(ζ0)−Ψ(ζ)

=:L′ · dp(ξ, ζ) +M(ζ),

where the second inequality follows from the elementary inequality (a+ b)p ≤ 2p−1(ap + bp) for anya, b≥ 0 and p≥ 1, and

L′ := 2p−1L, M(ζ) := 2p−1Ldp(ζ, ζ0) + Ψ(ζ0)−Ψ(ζ). (34)

It then follows that

infξ∈Ξ

{λdp(ξ, ζ)−Ψ(ξ) : dp(ξ, ζ)≤C

}≥ inf

ξ∈Ξ

{−Ψ(ξ) : dp(ξ, ζ)≤C

}≥−Ψ(ζ)−L(ζ)C −M(ζ)>−∞.

Therefore, Φ(λ, ζ)>−∞ for all ζ ∈Ξ and λ> κ0.

35

Then we show that∫

ΞΦ(λ, ζ)ν(dζ)>−∞ for any λ> κ0. Let λ1 ∈ (κ0, λ). Choose any ζ0 ∈Ξ. It

follows from Lemma 5 that there is a constant C such that

Φ(λ, ζ) ≥ λ−λ1

2D(λ, ζ) + Φ(λ1, ζ

0)−Cdp(ζ, ζ0) ≥ Φ(λ1, ζ0)−Cdp(ζ, ζ0).

Integrating over ζ with respect to ν yields that∫Ξ

Φ(λ, ζ)ν(dζ) ≥ Φ(λ1, ζ0)−C

∫Ξ

dp(ζ, ζ0)ν(dζ) > −∞.

Therefore we have shown that κ= κ0.(Necessity). Observe that in the proof of sufficiency, we have shown that the condition (31) is

equivalent to the following condition: there exists L′ ≥ 0 and M(ζ)∈L1(ν) such that

Ψ(ξ)−Ψ(ζ) ≤ L′dp(ξ, ζ) +M(ζ), ∀ξ, ζ ∈Ξ.

Suppose κ <∞, and we are going to prove the necessity by contradiction. Assume the aboveequivalent condition does not hold, then for any λ≥ 0,

infξ∈Ξ

[λdp(ξ, ζ)−Ψ(ξ)] /∈ L1(ν),

which, implies that κ=∞, a contradiction.(iii) follows directly from the proof of (ii). �

Proof of Lemma 7. Since Ξ is separable, ∂C has a countable dense subset {ξi}∞i=1. For each ξi,there exists ξ′i ∈ Ξ \ cl(Ξ) such that εi := 2d(ξi, ξ′i)< ε. Thus ∂C =

⋃∞i=1Bεi(ξ

i), where Bεi(ξi) is

the open ball centered at ξi with radius εi. Define

i∗(ξ) := mini≥0{i : ξ ∈Bεi(ξ

i)}, ξ ∈ ∂C,

andTε(ξ) := ξi∗(ξ), ξ ∈ ∂C.

Then Tε satisfies the requirements in the lemma. �

B.2. Proofs of CorollariesProof of Corollary 1.

(i) (Necessity). First we consider that there exists a dual minimizer λ∗ > κ. Note that if Ψ isupper-semicontinuous and bounded subsets of (Ξ, d) are totally bounded, then in the proof ofLemma 3(ii), C(λ, ζ, δ) and C(λ, ζ, δ) are well-defined for δ= 0, and the supremum and the infimumare attained. It then follows that S(λ, ζ, δ, ε) and S(λ, ζ, δ, ε) in the proof of Lemma 3(ii) are alsowell-defined for ε= δ= 0. Furthermore, we note that if Ψ is upper-semicontinuous bounded subsetsof (Ξ, d) are totally bounded, then D0(·, ζ) is left continuous and D0(·, ζ) is right continuous on(κ,∞). Indeed, fixing λ > κ, let {λn}n be a sequence monotonically increasing to λ with λn ∈((λ+κ)/2, λ) for all n. By Lemma 2(i), it holds that

D0(λn+1, ζ) ≤ D0(λn, ζ) ≤ D0((λ+κ)/2, ζ), ∀n. (35)

Suppose on the contrary that there exists δ > 0 such that limn→∞D0(λn, ζ)≥D0(λ, ζ) + δ. Let

ξn ∈ arg minξ∈Ξ

{d(ξ, ζ) : ξ ∈ arg min

ξ∈Ξ

{λnd

p(ξ, ζ)−Ψ(ξ)}},

ξ0 ∈ arg minξ∈Ξ

{d(ξ, ζ) : ξ ∈ arg min

ξ∈Ξ

{λdp(ξ, ζ)−Ψ(ξ)

}}.

36

By definition we have that

λndp(ξn, ζ)−Ψ(ξn)≤ λndp(ξ0, ζ)−Ψ(ξ0).

From (35) and totally boundedness, up to a subsequence we can assume {ξn}n converges to someξ∗ ∈Ξ. Let n→∞ and by lower-semicontinuity of −Ψ,

λdp(ξ∗, ζ)−Ψ(ξ∗)≤ lim infk→∞

λndp(ξn, ζ)−Ψ(ξn)≤ λdp(ξ0, ζ)−Ψ(ξ0),

thus we obtain that ξ∗ is a minimizer of infξ∈Ξ λdp(ξ, ζ)−Ψ(ξ), but d(ξ∗, ζ)≥D0(λ, ζ) + δ, which

contradicts to the definition of D0(λ, ζ). Therefore we have shown that D0(λ, ζ) is left continuouswith respect to λ. Using a similar argument we can show the right continuity of D(·, ζ).

It follows from the above results and Lemma 2(i)(iii) that for λ> κ, it holds that

∂Φ(λ, ζ)

∂λ+= D(λ, ζ),

∂Φ(λ, ζ)

∂λ−= D(λ, ζ).

Then mimicking the proof of Theorem 1 Case 1, we define mappings

T λ∗(ζ)∈{ξ ∈Ξ : λ∗dp(ξ, ζ)−Ψ(ξ) = Φ(λ∗, ζ), dp(ξ, ζ) = D(λ∗, ζ)

},

T λ∗(ζ)∈{ξ ∈Ξ : λ∗dp(ξ, ζ)−Ψ(ξ) = Φ(λ∗, ζ), dp(ξ, ζ) = D(λ∗, ζ)

},

and define q ∈ [0,1] such that

q

∫Ξ

dp(T (ζ), ζ)ν(dζ) + (1− q)∫

Ξ

dp(T (ζ), ζ)ν(dζ) = θp.

Letµ∗ := q ·T#ν+ (1− q) ·T#ν. (36)

Then µ∗ is feasible and∫Ξ

Ψ(ξ)µ∗(dξ) = q

∫Ξ

Ψ(T (ζ))ν(dζ) + (1− q)∫

Ξ

Ψ(T (ζ))ν(dζ)

= q

∫Ξ

[λ∗dp(T (ζ), ζ)−Φ(λ∗, ζ)

]ν(dζ) + (1− q)

∫Ξ

[λ∗dp(T (ζ)−Φ(λ∗, ζ)

]ν(dζ)

= λ∗θp−∫

Ξ

Φ(λ∗, ζ)ν(dζ) = vD.

Therefore µ∗ is optimal.Next, suppose that λ∗ = κ> 0 is the unique minimizer, ν

({ζ ∈Ξ : arg minξ∈Ξ{κdp(ξ, ζ)−Ψ(ξ)}=

∅})

= 0, and condition (2.) holds. Then the sets E,E in Lemma 3(ii) are well-defined for λ= κ andδ = ε= 0, hence there exists ν-measurable maps T ,T : Ξ→ Ξ, such that κdp(T (ζ), ζ)−Ψ(T (ζ)) =Φ(κ, ζ) = κdp(T (ζ), ζ)−Ψ(T (ζ)) holds ν-a.s., and that∫

Ξ

dp(T (ζ), ζ)ν(dζ) ≤ θp ≤∫

Ξ

dp(T (ζ), ζ)ν(dζ).

Then the distribution defined by (36) is optimal using the same argument.Third, suppose that λ∗ = 0 is the unique minimizer, arg maxξ∈Ξ{Ψ(ξ)} is nonempty, and condi-

tion (3.) holds. Then the sets E,E in Lemma 3(ii) are well-defined for λ= 0 and δ = ε= 0, hencethere exists ν-measurable map T : Ξ→Ξ, such that T (ζ)∈ arg maxξ∈Ξ{Ψ(ξ)} holds ν-a.s., and that∫

Ξ

dp(T (ζ), ζ)ν(dζ)≤ θp.

37

Define µ := T#ν. It follows that∫Ξ

Ψ(ξ)µ(dξ) =

∫Ξ

Ψ(T (ζ))ν(dζ) = maxξ∈Ξ

Ψ(ξ) = vD.

Therefore we have shown the “if” part.(Sufficiency). Let µ be a primal feasible solution, and let (ξ, ζ) be a random vector with

marginal distributions µ and ν. Let γζ be a conditional distribution of ξ given ζ such that∫Ξ2 d

p(ξ, ζ)γζ(dξ)ν(dζ)≤ θp. Then the weak duality implies for any λ≥ 0,∫Ξ

Ψ(ξ)µ(dξ) =

∫Ξ2

[Ψ(ξ)−λdp(ξ, ζ)]γζ(dξ)ν(dζ) +

∫Ξ2

λdp(ξ, ζ)γζ(dξ)ν(dζ)

≤−∫

Ξ

Φ(λ, ζ)ν(dζ) +λθp.

Hence, to make the inequality holds as equality (and thus µ is a worst-case distribution and λ isdual optimal), it must hold that(a) arg minξ∈Ξ{λdp(ξ, ζ)−Ψ(ξ)} is non-empty for ν-almost every ζ.(b) For ν-almost every ζ, the conditional distribution γζ of ξ should be supported on the set

arg minξ∈Ξ{λdp(ξ, ζ)−Ψ(ξ)}.(c) λ · (θp−

∫Ξ2 d

p(ξ, ζ)γζ(dξ)ν(dζ)) = 0.Now suppose all conditions in Corollary 1(i) fail to hold. This could happen when(1) λ∗ = κ> 0 and (2.) fails to hold.(2) λ∗ = 0 and (3.) fails to hold.

Let us first consider (1). Suppose µ is an optimal solution. If∫

ΞD0(κ, ζ)ν(dζ)< θp, then together

with (a)(b) we have that∫Ξ

∫Ξ

dp(ξ, ζ)γζ(dξ)ν(dζ)≤∫

Ξ

D0(κ, ζ)ν(dζ)< θp,

which contracts to (c). If θp <∫

ΞD0(κ, ζ)ν(dζ), by Lemma 2(ii)(iii) and Lemma 3, we have that

∂

∂λ+

∫Ξ

Φ(κ, ζ)ν(dζ) =

∫Ξ

D0(κ, ζ)ν(dζ)> θp.

Therefore, there exists λ > κ, such that λθp −∫

ΞΦ(λ, ζ)ν(dζ)< κθp −

∫Ξ

Φ(κ, ζ)ν(dζ), and thus κcannot be a dual minimizer, a contradiction.

Next we consider (2), i.e., λ∗ = 0 and∫

ΞD0(0, ζ)ν(dζ) > θp. Suppose µ is optimal,

namely,∫

ΞΨ(ξ)µ(dξ) = maxξ∈Ξ Ψ(ξ). Then (a)(b) and

∫ΞD0(0, ζ)ν(dζ) > θp imply that∫

Ξ2 dp(ξ, ζ)γζ(dξ)ν(dζ)> θp, thus we arrive at a contradiction.

(ii) If −Ψ(ζ)≤ infξ∈Ξ {κdp(ξ, ζ)−Ψ(ξ)} ν-almost everywhere, i.e., Ψ(ξ)−Ψ(ζ)≤ κdp(ξ, ζ), thenfor any λ> κ, Φ(λ, ζ) = Ψ(ζ). Hence the dual optimal solution λ∗ = κ.

Otherwise there exists a set E ⊂Ξ with ν(E)> 0 such that Ψ(ζ)>Φ(κ, ζ) for all ζ ∈E, and thus∫Ξ

Ψ(ζ)ν(dζ)>∫

ΞΦ(κ, ζ)ν(dζ). Then by continuity (follows from concavity) of

∫Ξ

Φ(·, ζ)ν(dζ), thereexists λ0 >κ such that

∫Ξ

Ψ(ζ)ν(dζ)>∫

ΞΦ(λ0, ζ)ν(dζ). For such λ0, using the upper-semicontinuity

and totally boundedness assumptions and Lemma 3, there exists a ν-measurable map Tλ0: Ξ→Ξ,

such that λ0dp(Tλ0

(ζ), ζ)−Ψ(Tλ0(ζ)) = Φ(λ0, ζ), and

ε :=

∫Ξ

dp(Tλ0(ζ), ζ)ν(dζ)> 0,

since otherwise∫

ΞΨ(ζ)ν(dζ) =

∫Ξ

Φ(λ0, ζ)ν(dζ). Choose θ < ε1/p, then we claim that λ = κ iscannot be optimal. Indeed, according to Case 2 in the proof of Theorem 1, λ∗ = κ implies

38∫Ξdp(T 0

λ(ζ), ζ)ν(dζ)< θp < ε for all λ > κ, and in particular, for λ= λ0. Thus we arrive at a con-tradiction.

(iii) This directly follows from the fact that in the proof for necessity in (i), in each case weconstruct an optimal solution with the the structure described as in (iii).

(iv) Note that the concavity of Ψ implies that κ<∞. Let µ= qT#ν+(1−q) ·T#ν be an optimalsolution as described in (i). We define T : Ξ→Ξ by

T (ζ) := q ·T (ζ) + (1− q) ·T (ζ).

Then it follows from Ξ being convex that T (ζ)∈Ξ for all ζ ∈Ξ. It follows from dp(·, ζ) being convexfor all ζ ∈Ξ that

W pp (T#ν, ν)≤

∫Ξ

dp(T (ζ), ζ)ν(dζ)

=

∫Ξ

dp(q ·T (ζ) + (1− q) ·T (ζ), ζ)ν(dζ)

≤ q∫

Ξ

dp(T (ζ), ζ)ν(dζ) + (1− q)∫

Ξ

dp(T (ζ), ζ)ν(dζ)

=W pp (µ,ν) ≤ θp.

Also, it follows from Ψ being concave that∫Ξ

Ψ(T (ζ))ν(dζ) =

∫Ξ

Ψ(q ·T (ζ) + (1− q) ·T (ζ))ν(dζ)

≥q∫

Ξ

Ψ(T (ζ))ν(dζ) + (1− q)∫

Ξ

Ψ(T (ζ))ν(dζ)

=Eµ∗ [Ψ] = vD.

Hence T#ν is an optimal solution. �

B.3. Proofs of PropositionsProof of Proposition 1. Let Q := {µ ∈ P(Ξ) : Wp(µ,ν) < ∞}. For any µ ∈ Q, let γµ ∈ P(Ξ2)denote a minimizer in the definition (1) of Wp(µ,ν). Then by tower property of conditional prob-ability, ∫

Ξ

Ψ(ξ)µ(dξ) =

∫Ξ2

Ψ(ξ)γµ(dξ, dζ) =

∫Ξ2

Ψ(ξ)γµζ (dξ)ν(dζ),

and

W pp (µ,ν) =

∫Ξ2

dp(ξ, ζ)γµ(dξ, dζ) =

∫Ξ2

dp(ξ, ζ)γµζ (dξ)ν(dζ),

where γµζ denotes the conditional distribution of ξ given ζ when the joint distribution of (ξ, ζ) isγ. Then the primal problem can be written as

vP = supµ∈Q

{∫Ξ2

Ψ(ξ)γµζ (dξ)ν(dζ) :

∫Ξ2

dp(ξ, ζ)γµζ (dξ)ν(dζ)≤ θp}.

Next we show that

vP ≤ supµ∈Q

infλ≥0

{∫Ξ2

Ψ(ξ)γµζ (dξ)ν(dζ) +λ

(θp−

∫Ξ2

dp(ξ, ζ)γµζ (dξ)ν(dζ)

)}. (37)

39

If∫

ΞΨ(ξ)µ(dξ)<∞ for all µ∈Q, then for any µ∈M= {µ∈P(Ξ) :Wp(µ,ν)≤ θ} it holds that

infλ≥0

{∫Ξ2

Ψ(ξ)γµζ (dξ)ν(dζ) +λ(θp−

∫Ξ2

dp(ξ, ζ)γµζ (dξ)ν(dζ))}

=

∫Ξ2

Ψ(ξ)γµζ (dξ)ν(dζ),

and for any µ∈Q\M it holds that

infλ≥0

{∫Ξ2

Ψ(ξ)γµζ (dξ)ν(dζ) +λ(θp−

∫Ξ2

dp(ξ, ζ)γµζ (dξ)ν(dζ))}

=−∞.

Thus the objective functions in (Primal) and the right side of (37) are the same for all µ∈Q, andtherefore (37) holds as an equality.

Otherwise, if∫

ΞΨ(ξ)µ(dξ) =∞ for some µ∈Q, then for any λ≥ 0, we have that∫

Ξ2

Ψ(ξ)γµζ (dξ)ν(dζ) +λ

(θp−

∫Ξ2

dp(ξ, ζ)γµζ (dξ)ν(dζ)

)=∞,

because∫

Ξ2 dp(ξ, ζ)γµζ (dξ)ν(dζ) =W p

p (µ,ν)<∞, and thus (37) holds. Therefore we conclude that

vP ≤ infλ≥0

{λθp + sup

µ∈Q

{∫Ξ2

[Ψ(ξ)−λdp(ξ, ζ)

]γµζ (dξ)ν(dζ)

}}≤ inf

λ≥0

{λθp +

∫Ξ

supξ∈Ξ

[Ψ(ξ)−λdp(ξ, ζ)

]ν(dζ)

}= vD. �

Proof of Proposition 2. If κ=∞, then for any n> 0, we have that

φn(ζ) := supξ∈Ξ

{Ψ(ξ)−Ψ(ζ)−n · dp(ξ, ζ)} /∈ L1(ν).

Hence, for any n> 0, there exists E ⊂Ξ with ν(E)> 0, such that φn(ζ)>n for all ζ ∈E. Observethat φn(ζ) =−Φ(n, ζ)−Ψ(ζ). Thus by Lemma 3, there exists a ν-measurable mapping T n :E→Ξsuch that

T n(ζ) ∈ {ξ ∈Ξ : Ψ(ξ)−Ψ(ζ)≥ n · dp(ξ, ζ) +n/2}.

For r= 1,2, . . ., consider the set

Er := {ζ ∈E : dp(T n(ζ), ζ)≤ r} .

Then limr→∞Er =E and limr→∞ ν(Er) = ν(E). Hence there exists a r0 such that ν(Er0)> 0, andthat

∫Er0

dp(T n(ζ), ζ)ν(dζ)<∞. Set Tn

to be the restriction of T n on Er0 .

Now define a distributionµn := p ·T n#ν+ (1− p) · ν,

where

p := min(

1,θp∫

Em0dp(T

n(ζ), ζ)ν(dζ)

).

Then µn is a primal feasible solution, and that∫Ξ

Ψdµn−∫

Ξ

Ψdν ≥ p ·(n

∫Em0

dp(Tn(ζ), ζ)ν(dζ) +n/2

)≥ min(nθp, n/2).

Let n→∞ we conclude that vP =∞= vD. �

40

Proof of Proposition 3. Since −1{ξ ∈ int(C)} is upper semicontinuous and binary-valued, byCorollary 1 the worst-case distribution of minµ∈M µ(int(C)) exists. Thus it suffices to show thatfor any ε > 0, there exists µ∈M such that µ(C)≤minµ∈M µ(int(C)) + ε. Observe that there existsan optimal transportation plan γ0 such that

supp γ0 ⊂(supp ν× supp ν

)∪((supp ν ∩ int(C))× ∂C

).

Set µ0 := π2#γ0, then µ0 is an optimal solution for minµ∈M µ(int(C)).

If µ0(∂C) = 0, there is nothing to show, so we assume that µ0(∂C)> 0. We first consider the caseν(int(C)) = 0 (and thus µ0 can be chosen to be ν and the worst-case value is 0). By Lemma 7, wecan define a Borel map Tε which maps each ξ ∈ ∂C to some ξ′ ∈ Ξ \ cl(C) with d(ξ, ξ′)< ε ∈ (0, θ)and is an identity mapping elsewhere. We further define a distribution µε by

µε(A) := µ0(A \ ∂C) +µ0{ξ ∈ ∂C : Tε(ξ)∈A}, for all Borel set A⊂Ξ.

Then Wp(µε, µ0) =Wp(µε, ν)≤ ε < θ and µε(C) = µ0(int(C)).Now let us consider ν(int(C))> 0. For any ε∈ (0, θ), we define a distribution µ′ε by

µ′ε(A) :=µ0(A∩ int(C)) +ε

θ

(γ0{(A∩ int(C))× ∂C}+ ν(A∩ ∂C)

)+(

1− ε

θ

)µ0{ξ ∈ ∂C : Tε(ξ)∈A}+µ0(A \ cl(C)), for all Borel set A⊂Ξ.

Thenµ′ε(C) = µ0(int(C)) +

ε

θ[µ0(∂C)− ν(∂C) + ν(∂C)] + 0 + 0

≤ µ0(int(C)) +ε

θ.

Note that W pp (µ0, ν) =

∫int(C)×∂C d

p(ξ, ζ)γ0(dξ, dζ), it follows that

Wp(µ′ε, ν)≤

(1− ε

θ

)∫int(C)×∂C

dp(ξ, ζ)γ0(dξ, dζ) +(

1− ε

θ

)ε+ 0

≤ θ.

Hence the proof is completed. �

Proof of Proposition 4. Observe that

infµ∈M

Eη∼µ[η(x−1(1))] = infµ∈P(Ξ)

{Eη∼µ[η(x−1(1))] : min

γ∈Γ(µ,ν)Eγ [d(η, η)]≤ θ

}= inf

γ∈P(Ξ2)

{E(η,η)∼γ [η(x−1(1))] : Eγ [d(η, η)]≤ θ, π2

#γ = ν}.

(38)

For any γ ∈ P(Ξ2), denote by γη the conditional distribution of θ := d(η, η) given η, and by γη,θthe conditional distribution of η given η and θ. Using tower property of conditional probability, wehave that for any γ ∈P(Ξ2) with π2

#γ = ν,

E(η,η)∼γ [η(x−1(1))] =Eη∼ν[Eθ∼γη

[Eη∼γη,θ [η(x−1(1))]

]],

and

Eγ [d(η, η)] =Eη∼ν[Eθ∼γη [θ]

].

41

Observe that the right-hand side of the second equation above does not depend on γη,θ. Thereby(38) can be reformulated as

infµ∈M

Eη∼µ[η(x−1(1))] = inf{γη}η ,{γη,θ}η,θ

{Eη∼ν

[Eθ∼γη

[Eη∼γη,θ [η(x−1(1))]

]]:Eη∼ν

[Eθ∼γη [θ]

]≤ θ}

= inf{γη}η

{Eη∼ν

[Eθ∼γη

[inf

{γη,θ}η,θEη∼γη,θ [η(x−1(1))]

]]:Eη∼ν

[Eθ∼γη [θ]

]≤ θ

},

(39)where the second equality follows from interchangeability principle (cf. Theorem 14.60 in Rockafel-lar and Wets [36]). We claim that

inf{γη}η

{Eη∼ν

[Eθ∼γη

[inf


]]:Eη∼ν

[Eθ∼γη [θ]

]≤ θ

}= infρ∈P(B([0,1])×Ξ)

{E(η,η)∼ρ[η(int(x−1(1)))] : E(η,η)∼ρ

[W1(η, η)

]≤ θ, π2

#ρ= ν}.

(40)

Indeed, let ρ be any feasible solution of the right-hand side of (40). We denote by ρη the conditionaldistribution of ¯θ :=W1(η, η) given η and by ρη, ¯θ the conditional distribution of η given η and ¯θ.

When η = 0 (i.e. no arrival) or ¯θ = 0, set γη = δ0 and γη, ¯θ = η, that is, we choose γη and γη, ¯θ be

such that η= η. When η 6= 0 and ¯θ > 0, applying Corollary 1 (Example 7) and Proposition 3 to theproblem minη∈B([0,1]){η(x−1(1)) :W1(η, η)≤ ¯θ}, we have that for any ε > 0, there exists an ε-optimalsolution η of the form

η=

η([0,1])∑i=1i 6=i0

δξi + pη, ¯θδξ+i0+ (1− pη, ¯θ)δξ−i0

,

where 1≤ i0 ≤ η([0,1]), pη, ¯θ ∈ [0,1], and ξi ∈ [0,1] for all i 6= i0 and ξ±i0 ∈ [0,1]. Define

η±η, ¯θ

:=

η([0,1])∑i=1i 6=i0

δξi + δξ±i0.

It follows that η±η, ¯θ

([0,1]) = η([0,1]), and

pη, ¯θη+

η, ¯θ(x−1(1)) + (1− pη, ¯θ)η

−η, ¯θ

(x−1(1))≤ ε+ minη∈B([0,1])

{η(int(x−1(1))) :W1(η, η)≤ ¯θ

}, (41)

andpη, ¯θW1(η+

η, ¯θ, η) + (1− pη, ¯θ)W1(η−

η, ¯θ, η)≤ ¯θ. (42)

Define γη and γη,θ by

γη(C) :=

∫ ∞0

[pη, ¯θ1{W1(η+

η, ¯θ, η)∈C}

+ (1− pη, ¯θ)1{W1(η−η, ¯θ, η)∈C}

]ρη(d

¯θ), ∀ Borel set C ⊂ [0,∞),

and

γη,θ(A) :=

∫ ∞0

∫Ξ

[pη, ¯θ1

{η+

η, ¯θ∈A, W1(η+

η, ¯θ, η) = θ

}+ (1− pη, ¯θ)1

{η−η, ¯θ∈A, W1(η−

η, ¯θ, η) = θ

}]ρη, ¯θ(dη)ρη(d

¯θ), ∀ Borel set A⊂Ξ.

42

Then ({γη}η,{γη,θ}η,θ) is a feasible solution to the left-hand side of (40). Indeed, by condition (ii),

we have d(η±η, ¯θ, η) =W1(η±

η, ¯θ, η), hence (42) implies that pη, ¯θd(η+

η, ¯θ, η) + (1−pη, ¯θ)d(η−

η, ¯θ, η)≤ ¯θ. Then

taking expectation on both sides,

Eη∼ν[Eθ∼γη [θ]

]=

∫Ξ

∫ ∞0

[pη, ¯θd(η+

η, ¯θ, η) + (1− pη, ¯θ)d(η−

η, ¯θ, η)]ρη(d

¯θ)ν(dη) =Eη∼ν[Eθ∼γη [θ]

]≤ θ,

hence {γη}η is feasible. Similarly, taking expectation on both sides of (41), we have that

Eη∼ν[Eθ∼γη

[Eη∼γη,θ [η(x−1(1))]

]]≤ ε+Eρ[η(x−1(1))]. Let ε→ 0, we obtain that

inf{γη}η

{Eη∼ν

[Eθ∼γη

[inf


]]:Eη∼ν

[Eθ∼γη [θ]

]≤ θ

}≤ infρ∈P(B([0,1])×Ξ)

{E(η,η)∼ρ[η(int(x−1(1)))] : E(η,η)∼ρ

[W1(η, η)

]≤ θ, π2

#ρ= ν}.

To show the opposite direction of the above inequality, observe that infµη,θ Eη∼µη,θ [η(x−1(1))] =

infη∈Ξ{η(x−1(1)) : d(η, η) = θ}. Hence

inf{γη}η

{Eη∼ν

[Eθ∼γη

[inf


]]:Eη∼ν

[Eθ∼γη [θ]

]≤ θ

}

= inf{γη}η

{Eη∼ν

[Eθ∼γη

[infη∈Ξ

{η(x−1(1)) : d(η, η) = θ

}]]:Eη∼ν

[Eθ∼γη [θ]

]≤ θ}.

(43)

Let ({γη}η,{ηη,θ}η,θ}) be a feasible solution of the right-hand side of (43). Then the joint distribution

ρ∈P(B([0,1])×Ξ) defined by

ρ(B) :=

∫π2(B)

∫ ∞0

1{ηη,θ ∈ π1(B)}γη(dθ)ν(dη), ∀ Borel set B ⊂B([0,1])×Ξ

is a feasible solution of the right-hand side of (40). By condition (iii), we have that

infη∈Ξ{η(x−1(1)) : d(η, η) = θ} ≥ inf

η∈B([0,1])

{η(int(x−1(1))) : W1(η, η)≤ θ

},

and thus Eη∼ν[Eθ∼γη

[infη

{η(x−1(1)) : d(η, η) = θ

}]]≥E(η,η)∼ρ[η(int(x−1(1)))]. Therefore we prove

the opposite direction and (40) holds. Together with (39), we obtain (20).

It then follows that it suffices to only consider policy x such that x−1(1) is an open set. Then

by Corollary 1 (Example 7), the problem minη∈B([0,1])

{η(x−1(1)) :W1(η, η)≤ θ

}admits a worst-

case distribution ηη,θ and let λη,θ be the associated dual optimizer. Let Ξ := {ξim : i= 1, . . . ,N, t=

1, . . . ,Mi}. We claim that it suffices to further restrict attention to those policies x such thateach connected component of x−1(1) contains at least one point in Ξ. Indeed, suppose thereexists a connected component C0 of x−1(1) such that C0 ∩ Ξ = ∅. Then for every ζ ∈ supp η,

arg minξ∈[0,1]

[1x−1(1)(ξ) + |ξ − ζ|

]/∈ C0, and thus ηη,θ(x

−1(1)) = ηη,θ(x−1(1) \ C0). Hence, x′ :=

1{x−1(1)\C0} achieves a higher objective value v(x′) than v(x) and so x cannot be optimal. We finally

conclude that there exists {xj, xj}Mj=1, where M ≤ card(Ξ), such that (21) holds. �

43

Proof of Proposition 5. Using Corollary 2 and Proposition 4, we have that

v( M∑j=1

1[xj ,xj ]

)= min

0≤pi≤1,

ηi,ηi∈Ξ

{1

N

N∑i=1

M∑j=1

[− c(xj −xj) + piηi{[xj, xj]}+ (1− pi)ηi{[xj, xj]}

]:

1

N

N∑i=1

[piW1(ηi, ηi) + (1− pi)W1(ηi, ηi)

]≤ θ

}.

(44)By the equivalent definition of one-dimensional Wasserstein distance [43], for ηi =

∑Mii=1 δξim , we have

that W1(ηi, ηi) = minσ∑Mi

m=1 |ξim − ξiσ(m)|, where the minimum is taken over all Mi-permutations.Hence

v( M∑j=1

1[xj ,xj ]

)−

M∑j=1

−c(xj −xj)

= minξim,ξ

im∈[0,1],

0≤pi≤1

{1

N

N∑i=1

[pi

Mi∑m=1

M∑j=1

1[xj ,xj ](ξim) + (1− pi)

Mi∑m=1

M∑j=1

1[xj ,xj ](ξim)

]:

1

N

N∑i=1

[pi

Mi∑m=1

|ξim− ξim|+ (1− pi)Mi∑m=1

|ξim− ξim|]≤ θ

}.

(45)Using Example 6, we have that

v( M∑j=1

1[xj ,xj ]

)−

M∑j=1

−c(xj −xj)

= min

{1

N

N∑i=1

(pi

Mi∑m=1

M∑j=1

[1[xj ,xj ](ξim)− (pi

mj+ pimj)] + (1− pi)

Mi∑m=1

M∑j=1

[1[xj ,xj ](ξim)− (p′i

mj+ p′

i

m)])

:

1

N

N∑i=1

(pi

Mi∑m=1

M∑j=1

[pimj|xj − ξim|+ pimj |xj − ξim|

]+ (1− pi)

Mi∑m=1

M∑j=1

[p′imj|xj − ξim|+ p′

i

m|xj − ξim|])≤ θ,

M∑j=1

(pimj

+ pimj)≤ 1,

M∑j=1

(p′imj

+ p′i

mj

)≤ 1,∀i, t

},

where the minimum is taken over all pi, pimj, pi

mj, p′

i

mj, p′imj∈ [0,1]. Replacing pipimj + (1− pi)p′imj

by pimj, and pipimj

+ (1 − pi)p′imj

by pimj

, and noticing that at optimality, pimj, pi

mj> 0 only if

ξim ∈ [xj− , xj], and at most one of {pimj, pimj}i,t,j can be fractional, we obtain the result. �

Proof of Proposition 6. The dual optimizer of the inner maximization problem of (22) is zerowhen θ is sufficiently large, whence the worst-case value of the inner maximization problem equalssupξ∈[0,T ]− ln(a(ξ)). Then the overall objective function equals∫ T

0

a(t)dt+ supξ∈[0,T ]

− ln(a(ξ)).

Set b=∫ T

0a(t)dt. Then due to the second term supξ∈[0,T ]− ln(a(ξ)), the solution a(t) = b/T yields

an objective value no larger than a(t). Hence we complete the proof. �

Proof of Proposition 7. Define Cw := {ξ :−w>ξ < q} for all w. Similar to Example 7, there existsa worst-case distribution µ∗ which attains the infimum infµ∈M Pµ{−w>ξ < q} and there exists maps

44

T ∗, T∗

such that for each ζ ∈ supp ν, it holds that T ∗(ζ), T∗(ζ) ∈ {ζ} ∪ arg minξ∈Ξ\Cw ||ξ − ζ||

p∞.

With this in mind, let γ∗ be the optimal transport plan between ν and µ∗, and let

t∗ := ν-ess supζ∈Ξ

{min

ξ∈Ξ\Cw||ξ− ζ||p∞ : ζ 6= T ∗(ζ)

}.

So t∗ is the longest distance of transportation among all the points that are transported. (Wenote that infinity is allowed in the definition of t∗, however, as will be shown, this violates theprobability bound.) Then µ∗ transports all the points in supp ν ∩ {ξ : q − t∗ < −w>ξ < b}, andpossibly a fraction of mass β∗ ∈ [0,1] in supp ν ∩ {ξ :−w>ξ = q− t∗}. Also note that by Holder’sinequality, the distance between two hyperplanes {ξ : −w>ξ = s} and {ξ : −w>ξ = s′} equals to|s− s′|/||w||1 = |s− s′|. Using this characterization, let us define a probability measure νw on R by

νw{(−∞, s)} := ν{ξ :−w>ξ < s}, ∀s∈R,then using the changing of measure, the total distance of transportation can be computed by

∫(Ξ\Cw)×Cw

dp(ξ, ζ)γ∗(dξ, dζ) =

q−∫(q−t∗)+

(q− s)pνw(ds) +β∗νw({q− t∗})t∗p ≤ θp. (46)

On the other hand, using property of marginal expectation and the characterization of γ∗,

µ∗(Cw) =

∫Cw×Ξ

γ∗(dξ, dζ)

= ν(Cw)−∫

(Ξ\Cw)×Cwγ∗(dξ, dζ)

= 1− νw([q,∞))−β∗νw({q− t∗}) + νw{(q− t∗, q)}= 1− νw(q− t∗,∞)−β∗νw({q− t∗}).

Thereby the condition infµ∈M µ(Cw)≥ 1−α is equivalent to

β∗νw({q− t∗}) + νw(q− t∗,∞)≤ α. (47)

Now consider the quantity

J :=

∫ q−

(VaRα[−w>ξ])+(q− s)pνw(ds) +β0νw

({VaRα[−w>ξ]}

)(q−VaRα[−w>ξ])p− θp.

If J < 0, due to the monotonicity in t∗ of the right-hand side of (46), either q− t∗ <VaRα[−w>ξ] orq− t∗ = VaRα[−w>ξ] and β∗ > β0. But in both cases (47) is violated. On the other hand if J ≥ 0,again by monotonicity, either q− t∗ >VaRα[−w>ξ], or q− t∗ >VaRα[−w>ξ] and β∗ ≤ β∗0 and thus(47) is satisfied. Therefore we complete the proof. �

Appendix C: Selecting Radius θ We mainly use a classical result on Wasserstein distancefrom Bolley et al. [10]. Let νN be the empirical distribution of ξ obtained from the underly-ing distribution ν0. In Theorem 1.1 (see also Remark 1.4) of Bolley et al. [10], it is shown thatP{W1(νN , ν0) > θ} ≤ C(θ)e−

λ2Nθ

2for some constant λ dependent on ν0, and C dependent on θ.

Since their result holds for general distributions, we here simplify it for our purpose and explicitlycompute the constants λ and C. For a more detailed analysis, we refer the reader to Section 2.1 inBolley et al. [10].

45

Noticing that by assumption supp ν0 ⊂ [0, B], the truncation step in Bolley et al. [10] is no longerneeded, thus the probability bound (2.12) (see also (2.15)) of Bolley et al. [10] is reduced to

P{W1(νN , ν0)> θ} ≤max

(8eB

δ,1

)N ( δ2 )

e−λ8N(θ−δ)2

for some constant λ > 0, δ ∈ (0, θ), where e is the natural logarithm, and N ( δ2) is the minimal

number of balls need to cover the support of ξ by balls of radius δ/2 and in our case, N ( δ2) = B/δ.

Now let us compute λ. By Theorem 1.1 of Bolley et al. [10], λ is the constant appeared in theTalagrand inequality

W1(µ,ν0)≤√

2

λIφkl(µ,ν0),

where the Kullback-Leibler divergence of µ with respect to ν is defined by Iφkl(µ,ν0) = +∞ if µis not absolutely continuous with respect to ν0, otherwise Iφkl(µ,ν0) =

∫f log fdν0, where f is the

Radon-Nikodym derivative dµ/dν0. Corollary 4 in Bolley and Villani [11] shows that λ can bechosen as

λ=

[inf

ζ0∈Ξ,α>0

1

α

(1 + log

∫eαd

2(ξ,ζ0)ν(dξ)

)]−1

,

which can be estimated from data. Finally, we obtain a concentration inequality

P{W1(νN , ν0)> θ} ≤ max

(8eB

δ,1

) Bδ

e−λ8N(θ−δ)2 . (48)

In the numerical experiment, we choose δ to make the right-hand side of (48) as small as possible,and θ is chosen such that the right-hand side of (48) is equal to 0.05.

Acknowledgments. The authors would like to thank Wilfrid Gangbo, David Goldberg, AlexShapiro, and Weijun Xie for several stimulating discussions, and Yuan Li for providing the image(Figure (1b)).

References[1] Aliprantis CD, Border K (2006) Infinite Dimensional Analysis: A Hitchhiker’s Guide (Springer Science

& Business Media).

[2] Ambrosio L, Fusco N, Pallara D (2000) Functions of bounded variation and free discontinuity problems,volume 254 (Oxford: Clarendon Press).

[3] Aubin JP, Frankowska H (2009) Set-valued analysis (Springer Science & Business Media).

[4] Barbour AD, Brown TC (1992) Stein’s method and point process approximation. Stochastic Processesand their Applications 43(1):9–31.

[5] Bayraksan G, Love DK (2015) Data-driven stochastic programming using phi-divergences. Tutorials inOperations Research .

[6] Ben-Tal A, Den Hertog D, De Waegenaere A, Melenberg B, Rennen G (2013) Robust solutions ofoptimization problems affected by uncertain probabilities. Management Science 59(2):341–357.

[7] Ben-Tal A, Goryashko A, Guslitzer E, Nemirovski A (2004) Adjustable robust solutions of uncertainlinear programs. Mathematical Programming 99(2):351–376.

[8] Berger JO (1984) The robust bayesian viewpoint. Studies in Bayesian Econometrics and Statistics : InHonor of Leonard J. Savage., Edited by Joseph B. Kadane 4(2):63–124.

[9] Blanchet J, Murthy KR (2016) Quantifying distributional model risk via optimal transport. arXivpreprint arXiv:1604.01446 .

[10] Bolley F, Guillin A, Villani C (2007) Quantitative concentration inequalities for empirical measures onnon-compact spaces. Probability Theory and Related Fields 137(3-4):541–593.

46

[11] Bolley F, Villani C (2005) Weighted csiszar-kullback-pinsker inequalities and applications to transporta-tion inequalities. Annales de la Faculte des sciences de Toulouse: Mathematiques, volume 14, 331–352.

[12] Boyd S, Vandenberghe L (2004) Convex optimization (Cambridge university press).

[13] Calafiore GC, El Ghaoui L (2006) On distributionally robust chance-constrained linear programs. Jour-nal of Optimization Theory and Applications 130(1):1–22.

[14] Chen LH, Xia A (2004) Stein’s method, palm theory and poisson process approximation. Annals ofProbability 2545–2569.

[15] Daley DJ, Vere-Jones D (2003) An Introduction to the Theory of Point Processes Volume I: ElementaryTheory and Methods (Springer), second edition, ISBN 0-387-95541-0.

[16] Delage E, Ye Y (2010) Distributionally robust optimization under moment uncertainty with applicationto data-driven problems. Operations Research 58(3):595–612.

[17] Dupacova J (1987) The minimax approach to stochastic programming and an illustrative application.Stochastics: An International Journal of Probability and Stochastic Processes 20(1):73–88.

[18] El Ghaoui L, Oks M, Oustry F (2003) Worst-case value-at-risk and robust portfolio optimization: Aconic programming approach. Operations Research 51(4):543–556.

[19] Erdogan E, Iyengar G (2006) Ambiguous chance-constrained problems and robust optimization. Math-ematical Programming 107(1-2):37–61.

[20] Esfahani PM, Kuhn D (2015) Data-driven distributionally robust optimization using the Wassersteinmetric: performance guarantees and tractable reformulations. arXiv preprint arXiv:1505.05116 .

[21] Gallego G, Moon I (1993) The distribution free newsboy problem: review and extensions. Journal ofthe Operational Research Society 825–834.

[22] Gibbs AL, Su FE (2002) On choosing and bounding probability metrics. International statistical review70(3):419–435.

[23] Goh J, Sim M (2010) Distributionally robust optimization and its tractable approximations. OperationsResearch 58(4-part-1):902–917.

[24] Gonzalez RC, Woods RE (2006) Digital Image Processing (3rd Edition) (Prentice-Hall, Inc.).

[25] Jiang R, Guan Y (2015) Data-driven chance constrained stochastic program. Mathematical Programming1–37.

[26] Kantorovich LV (1942) On the translocation of masses. Dokl. Akad. Nauk SSSR, volume 37, 199–201.

[27] Kantorovich LV (1960) Mathematical methods of organizing and planning production. ManagementScience 6(4):366–422.

[28] Ling H, Okada K (2007) An efficient earth mover’s distance algorithm for robust histogram comparison.Pattern Analysis and Machine Intelligence, IEEE Transactions on 29(5):840–853.

[29] Nemirovski A (2004) Prox-method with rate of convergence O(1/t) for variational inequalities with lips-chitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journalon Optimization 15(1):229–251.

[30] Nesterov Y, Nemirovski A (2013) On first-order algorithms for l 1/nuclear norm minimization. ActaNumerica 22:509–575.

[31] Owhadi H, Scovel C (2015) Extreme points of a ball about a measure with finite support. arXiv preprintarXiv:1504.06745 .

[32] Pardo L (2005) Statistical inference based on divergence measures (CRC Press).

[33] Parikh N, Boyd SP (2014) Proximal algorithms. Foundations and Trends in Optimization 1(3):127–239.

[34] Pflug GC, Pichler A (2014) Multistage stochastic optimization (Springer).

[35] Popescu I (2007) Robust mean-covariance solutions for stochastic optimization. Operations Research55(1):98–112.

[36] Rockafellar RT, Wets RJB (2009) Variational analysis, volume 317 (Springer Science & Business Media).

[37] Rubner Y, Tomasi C, Guibas LJ (2000) The earth mover’s distance as a metric for image retrieval.International Journal of Computer Vision 40(2):99–121.

[38] Scarf H, Arrow K, Karlin S (1958) A min-max solution of an inventory problem. Studies in the Mathe-matical Theory of Inventory and Production 10:201–209.

47

[39] Shapiro A (2001) On duality theory of conic linear problems. Semi-infinite programming, 135–165(Springer).

[40] Shapiro A, Dentcheva D, et al. (2014) Lectures on stochastic programming: modeling and theory, vol-ume 16 (SIAM).

[41] Shapiro A, Kleywegt A (2002) Minimax analysis of stochastic problems. Optimization Methods andSoftware 17(3):523–542.

[42] Sun H, Xu H (2015) Convergence analysis for distributionally robust optimization and equilibriumproblems. Mathematics of Operations Research .

[43] Vallender S (1974) Calculation of the wasserstein distance between probability distributions on the line.Theory of Probability & Its Applications 18(4):784–786.

[44] Villani C (2003) Topics in optimal transportation. Number 58 (American Mathematical Soc.).

[45] Villani C (2008) Optimal transport: old and new, volume 338 (Springer Science & Business Media).

[46] Wang Z, Glynn PW, Ye Y (2015) Likelihood robust optimization for data-driven problems. Computa-tional Management Science 1–21.

[47] Wozabal D (2012) A framework for optimization under ambiguity. Annals of Operations Research193(1):21–47.

[48] Wozabal D (2014) Robustifying convex risk measures for linear portfolios: A nonparametric approach.Operations Research 62(6):1302–1315.

[49] Zackova J (1966) On minimax solutions of stochastic linear programming problems. Casopis propestovanı matematiky 91(4):423–430.

[50] Zhao C, Guan Y (2015) Data-driven risk-averse stochastic optimization with Wasserstein metric. Avail-able on Optimization Online .

[51] Zymler S, Kuhn D, Rustem B (2013) Distributionally robust joint chance constraints with second-ordermoment information. Mathematical Programming 137(1-2):167–198.

Distributionally Robust Stochastic Optimization with ... · Distributionally Robust Stochastic...

Documents

Transcript of Distributionally Robust Stochastic Optimization with ... · Distributionally Robust Stochastic...