An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning...

36
Maria-Florina (Nina) Balcan An Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University

Transcript of An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning...

Page 1: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Maria-Florina (Nina) Balcan

An Online Learning Approach to Data-driven Algorithm Design

Carnegie Mellon University

Page 2: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Analysis and Design of Algorithms

Classic algo design: solve a worst case instance.

• Easy domains, have optimal poly time algos.

E.g., sorting, shortest paths

• Most domains are hard.

Data driven algo design: use learning & data for algo design.

• Suited when repeatedly solve instances of the same algo problem.

E.g., clustering, partitioning, subset selection, auction design, …

Page 3: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Prior work: largely empirical.

• Artificial Intelligence: E.g., [Xu-Hutter-Hoos-LeytonBrown, JAIR 2008]

• Computational Biology: E.g., [DeBlasio-Kececioglu, 2018]

• Game Theory: E.g., [Likhodedov and Sandholm, 2004]

• Different methods work better in different settings.

• Large family of methods – what’s best in our application?

Data Driven Algorithm Design

Data driven algo design: use learning & data for algo design.

Page 4: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Prior work: largely empirical.

Our Work: Data driven algos with formal guarantees.

• Different methods work better in different settings.

• Large family of methods – what’s best in our application?

Data Driven Algorithm Design

Data driven algo design: use learning & data for algo design.

• Several cases studies of widely used algo families.

• General principles: push boundaries of algorithm design and machine learning.

Page 5: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Algorithm Design as Distributional Learning

Goal: given large family of algos, sample of typical instances from domain, find an algo that performs well on new instances from same domain.

Large family 𝐅 of algorithms

Sample of i.i.d. typical inputs

Facility location:

Clustering: Input 1: Input 2: Input m:

Input 1: Input 2: Input m:

Input 1: Input 2: Input m:

MST

Greedy

Dynamic Programming

+

+ Farthest Location

[Gupta-Roughgarden, ITCS’16 &SICOMP’17]

Page 6: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Statistical Learning Approach to AAD

Challenge: “nearby” algos can have drastically different behavior.

Sample Complexity: How large should our sample of typical instances be in order to guarantee good performance on new instances?

m = O dim F /ϵ instances suffice to ensure generalizability

α ∈ ℝ𝑠

IQP objective value

Price Price

Revenue

2nd

highest bid

Highest bid

Reserve r

Revenue

2nd

highest bid

Page 7: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Prior Work: [Gupta-Roughgarden, ITCS’16 &SICOMP’17] proposed model; analyzed greedy algos for subset selection pbs (knapsack & independent set).

New algorithm classes for a wide range of problems.Our results:

Algorithm Design as Distributional Learning

Single linkage Complete linkage𝛼 −Weighted comb … Ward’s alg

DATA

DP for k-means

DP for k-median

DP for k-center

CLUSTERING

Clustering: Parametrized Linkage

[Balcan-Nagarajan-Vitercik-White, COLT 2017]

Parametrized Lloyds

Random seeding

Farthest first traversal

𝑘𝑚𝑒𝑎𝑛𝑠 + + …𝐷𝛼sampling

DATA

𝐿2-Local search 𝛽-Local search

CLUSTERING

[Balcan-Dick-White, NeurIPS 2018]

Alignment pbs (e.g., string alignment): parametrized dynamic prog.

[Balcan-DeBlasio-Dick-Kingsford-Sandholm-Vitercik, 2019]

dim(F) = O log n dim(F) = O k log n

Page 8: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

New algorithm classes for a wide range of problems.Our results:

Algorithm Design as Distributional Learning

SDP Relaxation

Integer Quadratic Programming (IQP)

GW rounding

1-linear roundig

s-linear rounding

Feasible solution to IQP

… … …

E.g., Max-Cut,

• Partitioning pbs via IQPs: SDP + Rounding

Max-2SAT, Correlation Clustering

[Balcan-Nagarajan-Vitercik-White, COLT 2017]

• MIPs: Branch and Bound Techniques [Balcan-Dick-Sandholm-Vitercik, ICML’18]

Max 𝒄 ∙ 𝒙s.t. 𝐴𝒙 = 𝒃

𝑥𝑖 ∈ {0,1}, ∀𝑖 ∈ 𝐼

MIP instance

Choose a leaf of the search tree

Best-bound Depth-first

Fathom if possible and terminate if

possible

Choose a variable to branch onMost

fractional𝛼-linearProduct

• Automated mechanism design for revenue maximization

Parametrized VCG auctions, posted prices, lotteries.

[Balcan-Sandholm-Vitercik, EC 2018]

dim(F) = O log n

Page 9: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Algorithm Design as Distributional Learning

[Balcan-DeBlasio-Kingsford-Dick-Sandholm-Vitercik, 2019]

General sample complex. tools via structure of dual classOur results:

Thm: Suppose for each costI(α) there are ≤ N boundary fns f1, f2, … ∈ F

s. t within each region defined by them, ∃ g ∈ G s.t. costI α = g(α).

Pdim costα I = O dF∗ + dG∗ log dF∗ + dG∗ + dF∗ logN

dF∗ =VCdim of dual of F, dG∗ =Pdim of dual of G.

𝑠

IQP objective

value

2nd

highest bid

Highest bid

Reserve r

Revenue

2nd

highest bid

Price Price

Revenue

Page 10: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Online Algorithm Selection

• So far, batch setting: collection of typical instances given upfront.

• [Balcan-Dick-Vitercik, FOCS 2018], [Balcan-Dick-Pedgen, 2019] online alg. selection.

• Challenge:

• Identify general properties (piecewise Lipschitz fns withdispersed discontinuities) sufficient for strong bounds.

Cannot use known techniques.

scoring fns non-convex, with lots of discontinuities.

• Show these properties hold for many alg. selection pbs.

𝑠

IQP objective

value

Price Price

Revenue

Page 11: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Structure of the Talk

• Overview. Data driven algo design as batch learning.

• Data driven algo design via online learning.

• An example: clustering via parametrized linkage.

• Online learning of non-convex (piecewise Lipschitz) fns.

Page 12: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Example: Clustering ProblemsClustering: Given a set objects organize then into natural groups.

• E.g., cluster news articles, or web pages, or search results by topic.

• Or, cluster customers according to purchase history.

Often need do solve such problems repeatedly.

• E.g., clustering news articles (Google news).

• Or, cluster images by who is in them.

Page 13: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Example: Clustering Problems

Clustering: Given a set objects organize then into natural groups.

Input: Set of objects S, d

Output: centers {c1, c2, … , ck}

To minimize σpminid2(p, ci)

𝐤-median: min σpmind(p, ci) .

Objective based clustering

𝒌-means

k-center/facility location: minimize the maximum radius.

• Finding OPT is NP-hard, so no universal efficient algo that works on all domains.

Page 14: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Clustering: Linkage + Dynamic Programming

Family of poly time 2-stage algorithms:

1. Use a greedy linkage-based algorithm to organize data into a hierarchy (tree) of clusters.

2. Dynamic programming over this tree to identify pruning of tree corresponding to the best clustering.

A B C D E F

A B D E

A B C DEF

A B C D E F

A B C D E F

A B D E

A B C DEF

A B C D E F

Page 15: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Clustering: Linkage + Dynamic Programming

1. Use a linkage-based algorithm to get a hierarchy.

2. Dynamic programming to the best pruning.

Single linkage

Completelinkage

𝛼 −Weighted comb

… Ward’s algo

DATA

DP for k-means

DP for k-median

DP for k-center

CLUSTERING

Both steps can be done efficiently.

Page 16: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Linkage Procedures for Hierarchical Clustering

Bottom-Up (agglomerative)

soccer

sports fashion

Guccitennis Lacoste

All topics

• Start with every point in its own cluster.

• Repeatedly merge the “closest” two clusters.

Different defs of “closest” give different algorithms.

Page 17: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Linkage Procedures for Hierarchical Clustering

Have a distance measure on pairs of objects.

d(x,y) – distance between x and y

E.g., # keywords in common, edit distance, etc soccer

sports fashion

Guccitennis Lacoste

All topics

• Single linkage: dist A, B = minx∈A,x′∈B

dist(x, x′)

dist A, B = avgx∈A,x′∈B

dist(x, x′)• Average linkage:

• Complete linkage: dist A, B = maxx∈A,x′∈B

dist(x, x′)

• Parametrized family, 𝛼-weighted linkage:

dist A, B = α minx∈A,x′∈B

dist(x, x′) + (1 − α) maxx∈A,x′∈B

dist(x, x′)

Page 18: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Clustering: Linkage + Dynamic Programming

1. Use a linkage-based algorithm to get a hierarchy.

2. Dynamic programming to the best prunning.

Single linkage

Completelinkage

𝛼 −Weighted comb

… Ward’s algo

DATA

DP for k-means

DP for k-median

DP for k-center

CLUSTERING

• Used in practice.

• Strong properties.

PR: small changes to input distances shouldn’t move optimal solution by much.

[Balcan-Liang, SICOMP 2016] [Awasthi-Blum-Sheffet, IPL 2011]

[Angelidakis-Makarychev-Makarychev, STOC 2017]

0.7

E.g., [Filippova-Gadani-Kingsford, BMC Informatics]

E.g., best known algos for perturbation resilient instances for k-median, k-means, k-center.

Page 19: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Clustering: Linkage + Dynamic ProgrammingClustering: Linkage + Dynamic Programming

α ∈ ℝ

Key fact: If we fix a clustering instance of n pts and vary α, at most O n8 switching points where behavior on that instance changes.

So, the cost function is piecewise-constant with at most O n8 pieces.

α ∈ ℝ

Page 20: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Clustering: Linkage + Dynamic ProgrammingClustering: Linkage + Dynamic Programming

𝓝𝟏 𝓝𝟐 𝓝𝟑 𝓝𝟒

𝑝 𝑞𝑝′ 𝑞′ 𝑟𝑟’ 𝑠’

𝑠• For a given α, which will merge

first, 𝒩1 and 𝒩2, or 𝒩3 and 𝒩4?

• Depends on which of (1 − α)d p, q + αd p′, q′ or (1 − α)d r, s + αd r′, s′ is smaller.

Key idea:

• An interval boundary an equality for 8 points, so O n8 interval boundaries.

Key fact: If we fix a clustering instance of n pts and vary α, at most O n8 switching points where behavior on that instance changes.

α ∈ ℝ

Page 21: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Online Algorithm Selection

• [Balcan-Dick-Vitercik, FOCS 2018], [Balcan-Dick-Pedgen, 2019] online alg. selection.

• Challenge:

• Identify general properties (piecewise Lipschitz fns withdispersed discontinuities) sufficient for strong bounds.

Cannot use known techniques.

scoring fns non-convex, with lots of discontinuities.

• Show these properties hold for many alg. selection pbs.

𝑠

IQP objective

value

Price Price

Revenue

Page 22: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Online Algorithm Selection via Online Optimization

Online optimization of general piecewise Lipschitz functions

Goal: minimize regret: max𝛒∈𝒞

σt=1T ut(𝛒) − 𝔼 σt=1

T ut 𝛒𝐭

Our cumulative performancePerformance of best

parameter in hindsight

1. Online learning algo chooses a parameter 𝛒𝐭

On each round t ∈ 1,… , T :

2. Adversary selects a piecewise Lipschitz function ut: 𝒞 → [0, H]

• corresponds to some pb instance and its induced scoring fnc

3. Get feedback:

Payoff: score of the parameter we selected ut(𝛒𝐭).

Full information: observe the function ut ∙

Bandit feedback: observe only payoff ut(𝛒𝐭).

Page 23: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Online Regret Guarantees

Existing techniques (for finite, linear, or convex case): select 𝛒𝐭probabilistically based on performance so far.

• Regret guarantee: max𝛒∈𝒞

σt=1T ut(𝛒) − 𝔼 σt=1

T ut 𝛒𝐭 = ෩𝐎 𝐓 ×⋯

• Probability exponential in performance [Cesa-Bianchi and Lugosi 2006]

No-regret: per-round regret approaches 0 at rate ෩𝐎 𝟏/ 𝐓 .

Challenge: if discontinuities, cannot get no-regret.

To achieve low regret, need structural condition.

• Adversary can force online algo to “play 20 questions” while hiding an arbitrary real number.

• Round 1: adversary splits parameter space in half and randomly chooses one half to perform well, other half to perform poorly.

• Round 2: repeat on parameters that performed well in round 1. Etc.

• Any algorithm does poorly half the time in expectation but ∃ perfect 𝝆.

Page 24: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Not dispersePiecewise Lipschitz

function

Lipschitz within each

piece

Disperse

Few boundaries within any

interval

Many boundaries within interval

Dispersion, Sufficient Condition for No-Regret

u1(∙), … , uT(∙) is (𝐰, 𝐤)-dispersed if any ball of radius 𝐰 contains boundaries for at most 𝐤 of the ui.

Page 25: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Full info: exponentially weighted forecaster [Cesa-Bianchi-Lugosi 2006]

Our Results:

pt 𝛒 ∝ exp λ

s=1

t−1

us 𝛒

Disperse fns, regret ෩O Td fnc of problem) .

On each round t ∈ 1,… , T :

• Sample a 𝛒t from distr. pt:

Dispersion, Sufficient Condition for No-Regret

Disperse

density of 𝛒 exponential in its performance so far

Page 26: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Full info: exponentially weighted forecaster [Cesa-Bianchi-Lugosi 2006]

Assume u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.

The expected regret is O H Td logR

𝐰+ 𝐤 + TL𝐰 .

For most problems:

• Set 𝐰 ≈ 1/ T, 𝐤 = T × (fnc of problem)

pt 𝛒 =exp λUt 𝛒

𝐖𝐭

On each round t ∈ 1,… , T :

• Sample a 𝛒t from distr. pt:

Dispersion, Sufficient Condition for No-Regret

Wt = ∫ exp λUt 𝛒 d𝛒

Ut 𝛒 = σs=0t−1 us(𝛒) total payoff of 𝛒 up to time t

Our Results:

Page 27: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Dispersion, Sufficient Condition for No-Regret

Key idea: Provide upper and lower bounds on WT+1

W1, where Wt norm.

factor Wt = ∫ exp λUt 𝛒 d𝛒

Wt+1

Wt≤ exp

eHλ − 1 PtH

Pt = expected alg payoff at time t.

WT+1

W1≤ exp

eHλ − 1 PALGH

PALG = expected alg total payoff.

pt 𝛒 =exp λUt 𝛒

𝐖𝐭

On each round t ∈ 1,… , T :

• Sample a 𝛒t from distr. pt:Ut 𝛒 = σs=0

t−1 us(𝛒) total payoff of 𝛒 up to time t

Thm:

• Upper bound: classic, relate algo perfor. with magnitude of update:

The expected regret is O H Td logR

𝐰+ 𝐤 + TL𝐰 .

u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.

Page 28: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Dispersion, Sufficient Condition for No-Regret

Key idea: analyze Wt = ∫ exp λUt 𝛒 d𝛒 (norm. factor)

pt 𝛒 =exp λUt 𝛒

𝐖𝐭

On each round t ∈ 1,… , T :

• Sample a 𝛒t from distr. pt:Ut 𝛒 = σs=0

t−1 us(𝛒) total payoff of 𝛒 up to time t

• Lower bound (uses dispersion + Lipschitz):

Let 𝛒∗ be the optimal parameter vector in hindsight.adsf

For all 𝛒 ∈ B 𝛒∗, w ,adsf

a

t=1

T

ut 𝛒 ≥

t=1

T

ut 𝛒∗ − Hk − TLw.

≤ k fncs us have disc. in B 𝛒∗, w & range is 0, H

remaining fnsL-Lipschitz

UT+1 𝛒 ≥ UT+1 𝛒∗ − Hk − TLw.For all 𝛒 ∈ B 𝛒∗, w ,adsf

Thm: u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.

The expected regret is O H Td logR

𝐰+ 𝐤 + TL𝐰 .

Page 29: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Dispersion, Sufficient Condition for No-Regret

Key idea: analyze Wt = ∫ exp λUt 𝛒 d𝛒 (norm. factor)

pt 𝛒 =exp λUt 𝛒

𝐖𝐭

On each round t ∈ 1,… , T :

• Sample a 𝛒t from distr. pt:Ut 𝛒 = σs=0

t−1 us(𝛒) total payoff of 𝛒 up to time t

• Lower bound

UT+1 𝛒 ≥ UT+1 𝛒∗ − Hk − TLw.For all 𝛒 ∈ B 𝛒∗, w ,adsf

WT+1 ≥ Vol B 𝛒∗, w exp λ UT+1(𝛒∗ − Hk − TLw)

Get WT+1

W1≥

w

R

d

exp λ UT+1(𝛒∗ − Hk − TLw)

Thm: u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.

The expected regret is O H Td logR

𝐰+ 𝐤 + TL𝐰 .

Page 30: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Dispersion, Sufficient Condition for No-Regret

Key idea: analyze Wt = ∫ exp λUt 𝛒 d𝛒 (norm. factor)

pt 𝛒 =exp λUt 𝛒

𝐖𝐭

On each round t ∈ 1,… , T :

• Sample a 𝛒t from distr. pt:Ut 𝛒 = σs=0

t−1 us(𝛒) total payoff of 𝛒 up to time t

expeHλ − 1 PALG

H≥

w

R

d

exp λ UT+1(𝛒∗ − Hk − TLw)

Take logs, use std approx of 𝑒𝑧 and fact that PALG ≤ HT:

UT+1 ρ∗ − PALG ≤ H2Tλ +d logR/w

λ+ Hk + TLw

Set λ =1

H

d

Tlog

R

wO H Td log

R

𝐰+ 𝐤 + TL𝐰Regret

Thm: u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.

The expected regret is O H Td logR

𝐰+ 𝐤 + TL𝐰 .

Page 31: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Example: Clustering with ρ-weighted linkage

𝜌-weighted linkage: dist A, B = ρ minx∈A,x′∈B

dist(x, x′) + (1 − ρ) maxx∈A,x′∈B

dist(x, x′)

Theorem: If T instances with distances selected in [0, B] from κ-bounded densities, then for any w, with prob ≥ 1 − 𝛿, we get

(w, k)-dispersion for k = O wn8κ2B2T + O T log n/𝛿 .

For any given interval 𝐼, expected #instances with discontinuities in

𝐼 is at most this

From a uniform convergence

argument

Implies expected regret of O H T log(TnκB)

Page 32: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Example: Clustering with ρ-weighted linkage

𝜌-weighted linkage: dist A, B = ρ minx∈A,x′∈B

dist(x, x′) + (1 − ρ) maxx∈A,x′∈B

dist(x, x′)

Theorem: If T instances with distances in [0, B] from κ-bounded densities, then for any w, with prob ≥ 1 − 𝛿 get (w, k)-dispersion for

k = O wn8κ2B2T + O T log n/𝛿 .

• Each instance O(n8) discontinuities of form𝑑1 − 𝑑3

(𝑑1−𝑑3) + (𝑑4 − 𝑑2)

𝑪𝟏 𝑪𝟐 𝑪𝟑 𝑪𝟒𝒅𝟏

𝒅𝟐

𝒅𝟑

𝒅𝟒

Given 𝜌, merge first, C1 and C2, or C3 and C4?

• Disc from O κB 2 -bounded density.

Critical value when 1 − 𝜌 d1 + 𝜌d2 = 1 − 𝜌 d3 + 𝜌d4

• So, for any given interval I of width w, expected # instances with discontinuities in I is O wn8κ2B2T . Uniform convergence on the top.

Key ideas:

Page 33: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Proving Dispersion

• Use functional form of discontinuities to get distribution of discontinuity locations, and expected number in any given interval.

• Randomness either from input instances (smoothed model) or from randomness in the algorithms themselves

• Use uniform convergence result on the top.

Lemma: Let u1, … , uT: ℝ → ℝ be independent piecewise-Lipschitz, each with ≤ N discontinuities. Let interval I, let D(I) be the # of ui with at least one discontinuity in I. Then w.h.p ≥ 1 − 𝛿 :

supI (D(I)− E D I ) ≤ O T log N/δ

Proof Sketch:

• Show VC-dim of ℱ = {fI} is O(logN), then use uniform convergence.

• For interval I, let fI i = 1 iff ui has discontinuity in I.

Then D I = σ𝑖 fI i

Page 34: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Proof: VC-dimension of 𝐹 is 𝑂(logN)

Lemma: For interval I, set s of ≤ N points, fI s = 1 iff s has a point inside I. VC-dimension of class ℱ = {fI} is O(logN).

How many ways can ℱ label d instances (d sets of ≤ N pts) s1, … , sd?

• But, given Nd points, at most Nd 2 + 1 different subsets of them that can be covered using intervals.

To shatter d instances, must have 2d ≤ Nd 2 + 1. VC-dim d = O(logN).

Proof idea:

• d instances have at most Nd points in their union.

• For fI, fI′ to label the instances differently, minimal req. is that intervals I and I′ can’t cover the exact same points.

Page 35: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Example: Knapsack

A problem instance is collection of items i, each with a value vi and asize si, along with a knapsack capacity C.

Goal: select most valuable subset of items that fits. Find subset V tomaximize σi∈V vi subject to σi∈V si ≤ C.

Greedy family of algorithms:

• Greedily pack items in order of value(highest value first) subject to feasibility.

• Greedily pack items in order of vi/siρ

subject to feasibility.

Thm: Adversary chooses knapsack instances with n items, capacity C, sizes si in [1, C], values vi in (0,1], and item values drawn independently from b-bounded distributions. Utilities u1, … , uT are piecewise constant

and w.h.p ≥ 1 − δ, (w, k)-dispersed for k = O wTn2b2 log C + T log n/δ .

Page 36: An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning Approach to Data-driven Algorithm Design Carnegie Mellon University Analysis and

Online Learning Summary

• Exploit dispersion to get no-regret.

• Learning theory: techniques of independent interest beyond algorithm selection.

• Can bound Rademacher complexity as a function of dispersion.

• For partitioning pbs via IQPs, SDP + Rounding, exploit smoothness in the algorithms themselves.

• For clustering & subset selection via greedy, assume input instances smoothed to show dispersion.

• Also differential privacy bounds.

• Also bandit, and semi-bandit feedback.