Prediction without probability: a PDE approach to a model ... · Asymptotically self-similar blowup...

Prediction without probability: a PDE approachto a model problem from machine learning

Robert V. KohnCourant Institute, NYU

Joint work with Kangping Zhu (PhD 2014)and Nadejda Drenska (in progress)

Mathematics for Nonlinear Phenomena:Analysis and Computation

celebrating Yoshikazu Giga’s contributions and impact

Sapporo, August 2015

Robert V. Kohn Prediction without probability

Looking back

We met in Tokyo inJuly 1982, at aUS-Japan seminar.

Giga came to Courant soon thereafter. We decided to study blowupof ut = ∆u + up. Over the next few years we had a lot of fun.

Asymptotically self-similar blowup of semilinear heat equations,CPAM (1985)

Characterizing blowup using similarity variables, IUMJ (1987)

Nondegeneracy of blowup for semilinear heat equations, CPAM (1989)


Over the years

Our paths have crossed many times, and in many ways.

Navier-Stokes

1983, Giga: Time & spatialanalyticity of solutions of theNavier-Stokes equations

1983, Caffarelli-Kohn-Nirenberg:Partial regularity of suitable wksolns of the Navier-Stokes eqns

The Aviles-Giga functional

1987, Aviles-Giga: A math’l pbmrelated to the physical theory ofliquid crystal configurations

2000, Jin-Kohn: Singularperturbation and the energy offolds


Over the years

Crystalline surface energies

1998, M-H Giga & Y Giga:Evolving graphs by singularweighted curvature (the first ofmany joint papers!)

1994, Girao-Kohn: Convergenceof a crystalline algorithm for. . . the motion of a graph byweighted curvature

Level-set representations of interface motion

1991, Chen-Giga-Goto:Uniqueness and existence ofviscosity solutions of generalizedmean curvature flow equations

2005, Kohn-Serfaty: Adeterministic-control-basedapproach to motion by curvature


Over the years

Hamilton-Jacobi approach to spiral growth

2013, Giga-Hamamuki:Hamilton-Jacobi equations withdiscontinuous source terms

1999, Kohn-Schulze: Ageometric model for coarseningduring spiral-mode growth of thinfilms

Finite-time flattening of stepped crystals

2011, Giga-Kohn:Scale-invariant extinction timeestimates for some singulardiffusion equations


Over the years

Many thanks for

- your huge impact on our field- your leadership (both scientific and practical)- helping our community grow and prosper- a lot of fun in our joint projects- your friendship over the years.


Today’s mathematical topic

Prediction without probability: a PDE approach to a modelproblem from machine learning

1 The problem (“prediction with expert advice”)2 Two very simple experts3 Two more realistic experts4 Perspective


Prediction with expert advice

Basic idea: given

a data stream

a notion of prediction

some experts

the overall goal is to beat the (retrospectively) best-performing expert– or at least, not do too much worse.

Jargon: minimize regret.

Widely-used paradigm in machine learning. Many variants, assoc todifferent types of data, classes of experts, notions of prediction.

Note analogy to a common business news feature . . .


The stock prediction problemA classic model problem (T Cover 1965, and many people since):

A stock goes up or down (datastream is binary, no probability)

Investor buys (or sells) f shares of stock at each time step,|f | ≤ 1. Effectively, he is making a prediction.

Two experts (to be specified soon). Regret wrt a given expert =(expert’s gain) - (investor’s gain).

Typical goal: minimize the worst-case value of regret wrtbest-performing expert at a given future time T .

More general goal: Minimize worst-case value of

φ(regret wrt expert 1, regret wrt expert 2)

at time T . (The “typical goal” is φ(x1, x2) = max{x1, x2}.)Robert V. Kohn Prediction without probability

Very simple experts vs more realistic experts

Recall: stock goes up or down (datastream is binary, no probability)

Investor buys (or sells) f shares of stock at each time step,|f | ≤ 1. Effectively, he is making a prediction.

Two experts, each using a public algorithm to make his choice.

FIRST PASS: Two very simple experts – one always expects the stockto go up (he chooses f = 1), the other always expects the stock to godown (he chooses f = −1).

SECOND PASS: Two more realistic experts – each looks at the last dmoves, and makes a choice depending on this recent history.

First pass: Kangping Zhu. Second pass: Nadejda Drenska.


Getting started: two very simple experts

Essentially an optimal control problem:

state space: (x1, x2) = (regret wrt + expert, regret wrt - expert).

control: investor’s stock purchase |f | ≤ 1.

value function: v(x , t) = optimal (worst-case) time-T result,starting from relative regrets x = (x1, x2) at time t .

Dynamic programming principle:

v(x1, x2, t) = min|f |≤1

maxb=±1

v(new position, t + 1)

= min|f |≤1

maxb=±1

v(x1 + b(1− f ), x2 − b(1 + f ), t + 1)

for t < T , with final-time condition v(x ,T ) = φ(x).


The dynamic programming principle

Recall: (x1, x2) = (regret wrt + expert, regret wrt - expert), whereregret = (expert’s gain) - (investor’s gain).

If investor buys f shares and market goes up, investor gains f , the+ expert gains 1, the − expert gains −1. So state moves from (x1, x2)to (x1 + (1− f ), x2 + (−1− f )).

Similarly, if investor buys f shares and market goes down, statemoves from (x1, x2) to (x1 − (1− f ), x2 − (−1− f )).

Hence the dynamic programming principle:

v(x1, x2, t) = min|f |≤1

maxb=±1

v(new position, t + 1)

= min|f |≤1

maxb=±1

v(x1 + b(1− f ), x2 − b(1 + f ), t + 1)


Scaling

In machine learning, one is interested in how regret accumulatesover many time steps.

To access this question, it is natural to rescale the problem andlook for a continuum limit.

Our problem has no probability. But our rescaling is like thepassage from random walk to diffusion.

Our problem shares many features with the two-person-gameinterpretation of motion by curvature (work with Sylvia Serfaty,CPAM 2006).

So: consider a scaled version of problem: stock moves are ±ε andtime steps are ε2. The value function is still the optimal worst-casetime-T result. The principle of dynamic programming becomes

wε(x1, x2, t) = min|f |≤1

maxb=±1

wε(x1 + εb(1− f ), x2 − εb(1 + f ), t + ε2).

We expect that w(x , t) = limε→0 wε should solve a PDE.Robert V. Kohn Prediction without probability

The PDE

The PDE is, roughly speaking, the Hamilton-Jacobi-Bellman eqnassoc to our optimal control problem. Sketch of (formal) derivation:

(1) Use Taylor expansion to estimatew(x1 + εb(1− f ), x2 − εb(1 + f ), t + ε2).

(2) Investor chooses f to make the O(ε) terms vanish, sinceotherwise they kill him; this gives f = (∂1w − ∂2w)/(∂1w + ∂2w).

(3) The O(ε2) terms are insensitive to b = ±1; they give thenonlinear PDE

wt + 2〈D2w∇⊥w

∂1w + ∂2w,∇⊥w

∂1w + ∂2w〉 = 0 with ∇⊥w = (−∂2w , ∂1w).

This final-value problem is to be solved with w = φ at t = T .


More detailed derivation of pdeDynamic programming principle:

wε(x1, x2, t) = max|f |≤1

minb=±1

wε(x1 + εb(1− f ), x2 − εb(1 + f ), t + ε2)

Taylor expansion:

w(x1+εb(1−f ), x2−εb(1+f ), t+ε2) ≈ w(x1, x2, t)+εb(1−f )w1−εb(1+f )w2

+ 12 w11ε

2b2(1− f )2 − w12ε2b2(1− f )(1 + f ) + 1

2 w22ε2b2(1 + f )2 + wtε

2

After substitution and reorganization:

0 ≈ max|f |≤1

minb=±1

{εb[(1− f )w1 − (1 + f )w2]

+ ε2b2[ 12 w11(1− f )2 − w12(1− f )(1 + f ) + 1

2 w22(1 + f )2 + wt ]}

Order-ε term vanishes when

f =∂1w − ∂2w∂1w + ∂2w

.

Note: we expect ∂1w > 0 and ∂2w > 0. Also: condition |f | ≤ 1 is automatic.


Unexpected properties of the PDE

Our PDE is geometric. In fact,

∂tw + 2〈D2w∇⊥w

∂1w + ∂2w,∇⊥w

∂1w + ∂2w〉 = 0

can be rewritten as

∂tw|∇w |

= 2κ|∇w |2

(∂1w + ∂2w)2 ,

where

κ = −div(∇w|∇w |

)is the curvature of a level set of w . Thus the normal velocity of eachlevel set is

vnor =2κ

(n1 + n2)2

where κ is its curvature and n is its unit normal.


Unexpected properties of the PDE – cont’d

Our PDE is the linear heat eqn indisguise. In fact, in the rotated(and scaled) coordinate system

ξ = x1 − x2, η = x1 + x2,

each level set of w is an evolving graph over the ξ axis. Moreover, thefunction η(ξ, t) associated with this graph, defined by

w(ξ, η(ξ, t), t) = const

solves the linear heat eqn

ηt + 2ηξξ = 0 for t < T .

The proof is elementary: one checks that for the evolving graph, thenormal velocity is what our PDE says it should be.

Corollary: existence, regularity, and (more or less) explicit solutionsfor a broad class of final-time data.

Thanks to Y. Giga for this observation.Robert V. Kohn Prediction without probability

Convergence as ε→ 0

Main result: If φ(x) = w(x ,T ) is smooth, then

w(x , t)− Cε ≤ wε(x , t) ≤ w(x , t) + Cε

where C is independent of ε. (It grows linearly with T − t .)

Method: A verification argument. One inequality is obtained byconsidering the particular strategy

f = (∂1w − ∂2w)/(∂1w + ∂2w).

The other involves showing (as seen in the formal argument) that noother strategy can do better.

For the most standard regret-minimization problem,φ(x1, x2) = max{x1, x2} is not smooth. In this case our result is a bitweaker; the errors are of order ε| log ε|.


Sketch of one inequalityGoal: show that

wε(x , t) ≤ w(x , t) + Cε.

Strategy: Estimate wε(z0, t0) by finding asequence (z1, t1), (z2, t2), . . . (zN , tN) suchthat

tj+1 = tj + ε2 for each j , and tN = T .

wε(zj+1, tj+1) ≥ wε(zj , tj ) for each j .

w(zj+1, tj+1) = w(zj , tj ) + O(ε3).

Since N = (T − t0)/ε2, it follows easily that

w(z0, t0) = w(zN , tN) + O(ε).

Since wε = w at the final time T , we get

wε(z0, t0) ≤ wε(zN , tN) = w(zN , tN) ≤ w(z0, t0) + Cε.


Sketch of one inequality, cont’d

The sequence: Recall the dynamicprogramming principle

wε(z0, t0) = min|f |≤1

maxb=±1

wε(

z0 + εb(

f−1f+1

), t0 + ε2

)A specific choice of f gives an inequality;the choice from formal argument gives(

f−1f+1

)= 2

∇⊥w∂1w + ∂2w

evaluated at (z0, t0). Call this v0. Then

wε(z0, t0) ≤ maxb=±1

wε(z0 + εbv0, t0 + ε2) .

Let b0 achieve the max, and set z1 = z0 + εb0v0, t1 = t0 + ε2; we have

wε(z0, t0) ≤ wε(z1, t1).

Iterate to find (zj , tj ), j = 2,3, . . ..Robert V. Kohn Prediction without probability

Sketch of one inequality – cont’d

Proof that w(zj+1, tj+1) = w(zj , tj ) + O(ε3):use the PDE. (Note: since w is smooth,Taylor expansion is honest.)

Using the specific choice of (z1, t1) we get

w(z1, t1) = w(z0, t0) + terms of order ε vanish

+ ε2(∂tw + 2〈D2w ∇⊥w

∂1w+∂2w , ∇⊥w∂1w+∂2w 〉

)+ O(ε3)

in which the RHS is evaluated at (z0, t0).

Using the PDE for w this becomes the desired estimate

w(z1, t1) = w(z0, t0) + terms of order ε2 vanish + O(ε3).

The argument applies for any j . Error terms come from O(ε3) terms inTaylor expansion; so the implicit constant is uniform if D3w and wtt

are uniformly bounded for all x ∈ R and all t < T .Robert V. Kohn Prediction without probability

1 The problem (“prediction with expert advice”)2 Two very simple experts3 Two more realistic experts4 Perspective


More realistic experts

So far our experts were very simple (independent of history). Nowlet’s consider two history-dependent experts.

Keep d days of history. Typical state is thusm = (0001011)2 ∈ {0,1, · · · ,2d − 1}. It is updated each day.

Each expert’s prediction is a (known) function of history. The q expertbuys f = q(m) shares; the r expert buys f = r(m) shares.

Otherwise no change: the goal is to optimize the (worst-case) time-Tvalue of regret wrt best-performing expert, or more generally

φ(regret wrt q expert, regret wrt r expert)


Dynamic programming becomes a mess

Problem: Dynamic programming doesn’t work so well any more.Apparently

state = (regret wrt q expert, regret wrt r expert, history)

so we’re looking for 2d distinct functions of space and time,wm(x1, x2, t). Dynamic programming principle can be formulated(coupling all 2d functions). We seem headed for a system of PDEs.

However:

(a) Regret accumulates slowly while states change rapidly; so valuefunction should be approx indep of state.

(b) Investor should choose f to achieve market indifference (atleading order in Taylor expansion).

(c) Accumulation of regret occurs at order ε2 (in Taylor expansion).

Using these ideas, we will again get a scalar PDE in the limit ε→ 0.


Identifying the PDEFormal derivation: ignore dependence of value function w(x1, x2, t)on ε and m; now

x1 = regret wrt q expert, x2 = regret wrt r expert.

If investor chooses f and market goes up/down (b = ±1),

x1 changes by bε(q(m)− f ), x2 changes by bε(r(m)− f ).

So market indifference at order ε requires

w1(q(m)− f ) + w2(r(m)− f ) = −w1(q(m)− f )− w2(r(m)− f ).

Solve for f : if current state is m, then investor should choose

f = (w1q(m) + w2r(m))/(w1 + w2).

Accumulation of regret is at order ε2. With f set by marketindifference, change in w is ε2 times

wt +12 (q(m)− r(m))2〈D2w ∇⊥w

∂1w+∂2w , ∇⊥w∂1w+∂2w 〉.

Worst-case scenario is the one that makes regret accumulate fastest.Robert V. Kohn Prediction without probability

Identifying the PDE, cont’dRecall: accumulation of regret per step at state m is

12 (q(m)− r(m))2〈D2w ∇⊥w

∂1w+∂2w , ∇⊥w∂1w+∂2w 〉.

Essentially a problem from graph theory: seek

limN→∞

maxpaths of length N

1N

N∑j=1

(q(mj )− r(mj ))2.

In fact:

It suffices to consider cycles.

There are good algorithms for finding optimal cycles.Robert V. Kohn Prediction without probability

Identifying the PDE, cont’d

Thus finally: the PDE is

wt +12

C∗〈D2w∇⊥w

∂1w + ∂2w,∇⊥w

∂1w + ∂2w〉 = 0,

whereC∗ = max

cycles

1cycle length

∑(q(mj )− r(mj ))2.

Summarizing: for two history-dependent experts,

Investor’s choice of f depends on the state as well as on ∇w(x);it achieves leading-order market indifference.

The value function w solves (almost) the same eqn as before(still reducible to the linear heat eqn!). All that changes is the“diffusion coefficient.”

Rigorous analysis still uses a verification argument (though thereare some new subtleties).


What about many history-dependent experts?

Can something similar be done for many history-dependent experts?

If there are K experts then w = w(x1, · · · , xK , t).

Market indifference at order ε still gives a formula for f .

Accumulation of regret sees D2w and Dw (not just a scalarquantity built from them); so the graph problem dependsnontrivially on D2w and Dw .

The formal PDE is much more nonlinear than for two experts.(Analysis: in progress.)


Stepping back

Mathematical messages

Stock prediction problem has a continuous-time limit. Reductionto linear heat eqn provides a rather explicit solution.

It provides another example where a deterministic two-persongame leads to 2nd order nonlinear PDE. (For earlier examples,see Kohn-Serfaty CPAM 2006 and CPAM 2010.)

Our analysis was elementary, since PDE is linked to linear heateqn. In other settings, when PDE solution is not smooth,convergence has been proved (without a rate) using viscositymethods.


Stepping back

Comparison to the machine learning literature

ML is mostly discrete. It was known that for the unscaled game,worst-case regret after N steps is of order

√N (compare: our

parabolic scaling). Our analysis gives the prefactor.

ML guys are smart. For the classic problem of minimizingworst-case regret, Andoni & Panigrahy found the samestrategies that come from our analysis (arXiv:1305.1359) – butdidn’t have the tools to prove they’re optimal.

The link to a linear heat eqn gives surprisingly explicit solutionsin the continuum limit.


Stepping back

Is this just a curiosity?

Key point: since behavior over many time steps is of interest,continuous time viewpoint should be useful.

But: the stock prediction problem is very simple: a binary timeseries and a linear “loss function.” What about other examples?

One might ask: when is worst-case regret minimization a goodidea? Not obvious. . .


Happy Birthday, Yoshi!


Prediction without probability: a PDE approach to a model ... · Asymptotically self-similar blowup...

Documents

Transcript of Prediction without probability: a PDE approach to a model ... · Asymptotically self-similar blowup...