Mathematical Optimization Models and Applications · Mathematical Optimization The ﬁeld of...

CME307/MS&E311: Optimization Lecture Note #01

Mathematical Optimization Models and Applications

Yinyu Ye

Department of Management Science and Engineering

Stanford University

Stanford, CA 94305, U.S.A.

http://www.stanford.edu/˜yyye

Chapters 1, 2.1-2, 6.1-2, 7.2, 11.1, 11.4

1


What you lean in CME307/MS&E311?

• Present a core element, mathematical optimization theories and algorithms, for the ICME/MS&E

disciplines.

• Provide mathematical proofs and in-depth theoretical analyses of optimization/game

models/algorithms discussed in MS&E211

• Introduce additional conic and nonlinear/nonconvex optimization/game models/problems comparing to

MS&E310.

• Describe new/recent effective optimization/game models/methods/algorithms in Data Science,

Machine Learning and AI.

• Emphasis is on nonlinear, nonconvex and stochastic/sample-based optimization theories and

practices together with convex analyses.

2


Mathematical Optimization

The field of optimization is concerned with the study of maximization and minimization of mathematical

functions. Very often the arguments of (i.e., variables or unknowns in) these functions are subject to side

conditions or constraints. By virtue of its great utility in such diverse areas as applied science, engineering,

economics, finance, medicine, and statistics, optimization holds an important place in the practical world

and the scientific world. Indeed, as far back as the Eighteenth Century, the famous Swiss mathematician

and physicist Leonhard Euler (1707-1783) proclaimeda that . . . nothing at all takes place in the Universe in

which some rule of maximum or minimum does not appear.

aSee Leonhardo Eulero, Methodus Inviendi Lineas Curvas Maximi Minimive Proprietate Gaudentes,

Lausanne & Geneva, 1744, p. 245.

3


Mathematical Optimization/Programming (MP)

The class of mathematical optimization/programming problems considered in this course can all be

expressed in the form

(P) minimize f(x)

subject to x ∈ X

where X usually specified by constraints:

ci(x) = 0 i ∈ Eci(x) ≤ 0 i ∈ I.

4


Model Classifications

Optimization problems are generally divided into Unconstrained, Linear and Nonlinear Programming based

upon the objective and constraints of the problem

• Unconstrained Optimization: Ω is the entire space Rn

• Linear Optimization: If both the objective and the constraint functions are linear/affine

• Conic Linear Optimization: If both the objective and the constraint functions are linear/affine, but

variables in a convex cone.

• Nonlinear Optimization: If the constraints contain general nonlinear functions

• Stochastic Optimization: Optimize the expected objective function with random parameters

• Fixed-Point Computation: Optimization of multiple agents with zero-sum objectives

• There are integer program, mixed-integer program etc.

We present a few optimization examples in this lecture that we would cover through out this course.

5


Unconstrained Optimization: Logistic Regression I

Given a training data point ai ∈ Rn, according to the logistic model, the probability that it’s in a class C is

represented by

eaTi x+x0

1 + eaTix+x0

.

Thus, for some training data points, we like to determine intercept x0 and slope vector x ∈ Rn such that

eaTi x+x0

1 + eaTix+x0

=

1, if ai ∈ C

0, otherwise.

Then the probability to give a “right classification answer” for all training data points is( ∏ai∈C

eaTi x+x0

1 + eaTix+x0

) ∏ai ∈C

1

1 + eaTix+x0

6


Logistic Regression II

Therefore, we like to maximize the probability when deciding intercept x0 and slope vector x ∈ Rn( ∏ai∈C

eaTi x+x0

1 + eaTix+x0

) ∏ai ∈C

1

1 + eaTix+x0

=

( ∏ai∈C

1

1 + e−aTix−x0

) ∏ai ∈C

1

1 + eaTix+x0

,

which is equivalently to maximize

−

(∑ai∈C

ln(1 + e−aTi x−x0)

)−

∑ai ∈C

ln(1 + eaTi x+x0)

.

Or

minx0,x

(∑ai∈C

ln(1 + e−aTi x−x0)

)+

∑ai ∈C

ln(1 + eaTi x+x0)

.

This is an unconstrained optimization problem, where the objective is a convex function of decision

variables: intercept x0 and slope vector x ∈ Rn.

7


Data Classification: Supporting Vector Machine

Similar to logistic regression, suppose we have two-class discrimination data. We assign the first class

with 1 and the second with −1 for a binary varible. A powerful discrimination method is the Supporting

Vector Machine (SVM).

Let the first class data points i be given by ai ∈ Rd, i = 1, ..., n1 and the second class data points j be

given by bj ∈ Rd, j = 1, ..., n2. We like to find a hyperplane to separate the two classes:

minimize β + µ∥x∥2

subject to aTi x+ x0 + β ≥ 1, ∀i,bTj x+ x0 − β ≤ −1, ∀j,

β ≥ 0,

where µ is a fixed positive regularization parameter.

This is a constrained quadratic program. If µ = 0, then it is a linear program!

8


*

*

*

*

*

*

*

*

**

*

*

*

*

*

**

**

*

*

Figure 1: Linear Classifier: Support Vector Machine

9


Constrained Optimization: Conic Linear Programming in Abstract Form

minimize cTx

subject to Ax = b,

x ∈ K.

Linear Programming (LP): when K is the nonnegative orthant cone

Second-Order Cone Programming (SOCP): when K is the second-order cone

Semidefinite Cone Programming (SDP): when K is the semidefinite matrix cone

min 2x1 + x2 + x3

s.t. x1 + x2 + x3 = 1,

(x1;x2;x3) ≥ 0;

min 2x1 + x2 + x3

s.t. x1 + x2 + x3 = 1,√x22 + x2

3 ≤ x1.

min 2x1 + x2 + x3

s.t. x1 + x2 + x3 = 1,(x1 x2

x2 x3

)≽ 0,

10


Reinforcement Learning: Markov Decision/Game Process

• RL/MDPs provide a mathematical framework for modeling sequential decision-making in situations

where outcomes are partly random and partly under the control of a decision maker.

• Markov game processes (MGPs) provide a mathematical slidework for modeling sequential

decision-making of two-person turn-based zero-sum game.

• MDGPs are useful for studying a wide range of optimization/game problems solved via dynamic

programming, where it was known at least as early as the 1950s (cf. Shapley 1953, Bellman 1957).

• Modern applications include dynamic planning under uncertainty, reinforcement learning, social

networking, and almost all other stochastic dynamic/sequential decision/game problems in

Mathematical, Physical, Management and Social Sciences.

11


MDP Stationary Policy and Cost-to-Go Value

• An MDP problem is defined by a given number of states, indexed by i, where each state has a number

of actions, Ai, to take. Each action, say j ∈ Ai, is associtaed with an (immeidiate) cost cj of taking,

and a probability distribution pj to transfer to all possible states at the next time period.

• A stationary policy for the decision maker is a function π = π1, π2, · · · , πm that specifies an

action in each state, πi ∈ Ai, that the decision maker will take at any time period; which also lead to a

cost-to-go value for each state.

• The MDP is to find a stationary policy to minimize/maximize the expected discounted sum over the

infinite horizon with a discount factor 0 ≤ γ < 1:

∞∑t=0

γtE[cπit (it, it+1)].

• If the states are partitioned into two sets, one is to minimize and the other is to maximize the

discounted sum, then the process becomes a two-person turn-based zero-sum stochastic game.

12


An MDGP Toy Example: A Maze Runner

Actions are in red, blue and black; and all actions have zero cost except the state 4 to the exit/termination

state 5. Which actions to take from every state to minimize the total cost (called optimal policy)?

13


Toy Example: Game Setting

States 0, 1, 2 minimize, while States 3, 4 maximize.

14


The Cost-to-Go Values of the States

Cost-to-go values on each state when actions in red are taken: the current policy is not optimal since there

are better actions to choose to minimize the cost.

15


The Optimal Cost-to-Go Value Vector

Let y ∈ Rm represent the cost-to-go values of the m states, ith entry for ith state, of a given policy.

The MDP problem entails choosing an optimal policy where the corresponding cost-to-go value vector y∗

satisfying:

y∗i = mincj + γpTj y

∗, ∀j ∈ Ai, ∀i,

with optimal policy

π∗i = argmincj + γpT

j y∗, ∀j ∈ Ai, ∀i.

In the Game setting, the conditions becomes:

y∗i = mincj + γpTj y

∗, ∀j ∈ Ai, ∀i ∈ I−,

and

y∗i = maxcj + γpTj y

∗, ∀j ∈ Ai, ∀i ∈ I+.

They both are fix-point or saddle-point problems. This problem can be cast as a linear program.

16


The Equivalent LP Formulation for MDP

This model can be reformulated as an LP:

maximizey∑m

i=1 yi

subject to y1 − γpTj y ≤ cj , j ∈ A1

...

yi − γpTj y ≤ cj , j ∈ Ai

...

ym − γpTj y ≤ cj , j ∈ Am.

Theorem 1 When y is maximized, there must be at least one inequality constraint in Ai that becomes

equal for every state i, that is, maximal y is a fixed point solution.

17


The Maze Runner Example

The Fixed-Point formulation:

y0 = min0 + γy1, 0 + γ(0.5y2 + 0.25y3 + 0.125y4 + 0.125y5)y1 = min0 + γy2, 0 + γ(0.5y3 + 0.25y4 + 0.25y5)y2 = min0 + γy3, 0 + γ(0.5y4 + 0.5y5)y3 = min0 + γy4, 0 + γy5y4 = 1 + γy5

y5 = 0 (or y5 = 0 + γy5)

The LP formulation:

maximizey y0 + y1 + y2 + y3 + y4 + y5

subject to change each equality above into inequality

18


More LP Examles: Sparsest Data Fitting

We want to find a sparsest solution to fit exact data measurements, that is, to minimize the number of

non-zero entries in x such that Ax = b:

minimize ∥x∥0 = |j : xj = 0|subject to Ax = b.

Sometimes this objective can be accomplished by LASSO:

minimize ∥x∥1 =∑n

j=1 |xj |subject to Ax = b.

It can be equivalently represented by (?)

minimize∑n

j=1 yj

subject to Ax = b, −y ≤ x ≤ y;or

minimize∑n

j=1(x′j + x′′

j )

subject to A(x′ − x′′) = b, x′ ≥ 0, x” ≥ 0.

Both are linear programs!

19


Sparsest Data Fitting continued

A better approximation of the objective can be accomplished by

minimize ∥x∥p :=(∑n

j=1 |xj |p)1/p

subject to Ax = b;or minimize ∥Ax− b∥2 + µ

(∑nj=1 |xj |p

)1/pfor some 0 < p < 1, where µ > 0 is a regularization parameter.

Or simply

minimize ∥x∥pp :=(∑n

j=1 |xj |p)

subject to Ax = b;or minimize ∥Ax− b∥2 + β

(∑nj=1 |xj |p

)The former is a linearly constrained (nonconvex) optimization problem!

20


SOCP Example: Facility Location

cj is the location of client j = 1, 2, ...,m, and y is the location of a facility to be built.

minimize∑

j ∥y − cj∥p.

Or equivalently (?)

minimize∑

j δj

subject to y + xj = cj , ∥xj∥p ≤ δj , ∀j.

This is a p-order conic linear program for p ≥ 1.

In particular, when p = 2, it is an SOCP problem.

For simplicity, consider m = 3.

21


C1

C2

C3

y

p=2

p=1

Figure 2: Facility Location at Point y.

22


Graph Realization and Sensor Network Localization

Given a graph G = (V,E) and sets of non–negative weights, say dij : (i, j) ∈ E, the goal is to

compute a realization of G in the Euclidean space Rd for a given low dimension d, where the distance

information is preserved.

More precisely: given anchors ak ∈ Rd, dij ∈ Nx, and dkj ∈ Na, find xi ∈ Rd such that

∥xi − xj∥2 = d2ij , ∀ (i, j) ∈ Nx, i < j,

∥ak − xj∥2 = d2kj , ∀ (k, j) ∈ Na,

This is a set of Quadratic Equations.

Does the system have a localization or realization of all xj ’s? Is the localization unique? Is there a

certification for the solution to make it reliable or trustworthy? Is the system partially localizable with

certification?

It can be relaxed to SOCP (change “=” to “≤”) or SDP.

23


−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

Figure 3: 50-node 2-D Sensor Localization.

24


Matrix Representation of SNL and SDP Relaxation

Let X = [x1 x2 ... xn] be the d× n matrix that needs to be determined and ej be the vector of all zero

except 1 at the jth position. Then

xi − xj = X(ei − ej) and ak − xj = [I X](ak;−ej)

so that

∥xi − xj∥2 = (ei − ej)TXTX(ei − ej)

∥ak − xj∥2 = (ak;−ej)T [I X]T [I X](ak;−ej) =

(ak;−ej)T

I X

XT XTX

(ak;−ej).

25


Or, equivalently,

(ei − ej)TY (ei − ej) = d2ij , ∀ i, j ∈ Nx, i < j,

(ak;−ej)T

I X

XT Y

(ak;−ej) = d2kj , ∀ k, j ∈ Na,

Y = XTX.

Relax Y = XTX to Y ≽ XTX , which is equivalent to matrix inequality: I X

XT Y

≽ 0.

This matrix has rank at least d; if it’s d, then Y = XTX , and the converse is also true.

The problem is now an SDP problem.

26


More CLP Examples: Portfolio Management

For expected return vector r and co-variance matrix V of an investment portfolio, one management model

is:

minimize xTV x

subject to rTx ≥ µ,

eTx = 1, x ≥ 0,

or simply

minimize xTV x

subject to rTx ≥ µ,

eTx = 1,

where e is the vector of all ones.

This is a convex quadratic program.

27


More CLP Examples: Robust Portfolio Management

In applications, r and V may be estimated under various scenarios, say ri and Vi for i = 1, ...,m. Then,

we like

minimize maxi xTVix

subject to mini rTi x ≥ µ,

eTx = 1, x ≥ 0.

⇒

minimize α

subject to rTi x ≥ µ, ∀i√xTVix ≤ α, ∀i

eTx = 1, x ≥ 0.

If factorize Vi = RTi Ri and let yi = Rix, we can rewrite the problem as

minimize α

subject to rTi x ≥ µ, yi −Rix = 0, ∀i∥yi∥ ≤ α, ∀ieTx = 1, x ≥ 0.

This is an SOCP-LP problem with additional benefits.

28


Stochastic Optimization and Learning

In real world, we most often do

mimimizex∈X EFξ[h(x, ξ)] (1)

where ξ represents random variables with the joint distribution Fξ .

• Pros: In many cases, the expected value is a good measure of performance

• Cons: One has to know the exact distribution of ξ to perform the stochastic optimization so that we

most frequently use sample distribution. Then, deviant from the assumed distribution may result in

sub-optimal solutions. Even know the distribution, the solution/decision is generically risky.

29


Learning with Noises/Distortions

Goodfellow et al. [2014]

30


Distributionally Robust Optimization and Learning

Why error? Believing that the sample distribution is the true distribution...

In practice, although the exact distribution of the random variables may not be known, people usually know

certain observed samples or training data and other statistical information. Thus, we can consider an

enlarged distribution set D that confidently containing the sample distribution, and do

minimizex∈X maxFξ∈D EFξ[h(x, ξ)] (2)

In DRO, we consider a set of distributions D and choose one to minimize the expected value for the worst

distribution in D. When choosing D, we need to consider the following:

• Tractability

• Practical (Statistical) Meanings

• Performance (the potential loss comparing to the benchmark cases)

This is a nonlinear Min-Max optimization/zero-sum-game problem

31

Mathematical Optimization Models and Applications · Mathematical Optimization The ﬁeld of...

Documents

Transcript of Mathematical Optimization Models and Applications · Mathematical Optimization The ﬁeld of...