An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning...
Transcript of An Online Learning Approach to Data-driven …ninamf/talks/online-data-driven.pdfAn Online Learning...
Maria-Florina (Nina) Balcan
An Online Learning Approach to Data-driven Algorithm Design
Carnegie Mellon University
Analysis and Design of Algorithms
Classic algo design: solve a worst case instance.
• Easy domains, have optimal poly time algos.
E.g., sorting, shortest paths
• Most domains are hard.
Data driven algo design: use learning & data for algo design.
• Suited when repeatedly solve instances of the same algo problem.
E.g., clustering, partitioning, subset selection, auction design, …
Prior work: largely empirical.
• Artificial Intelligence: E.g., [Xu-Hutter-Hoos-LeytonBrown, JAIR 2008]
• Computational Biology: E.g., [DeBlasio-Kececioglu, 2018]
• Game Theory: E.g., [Likhodedov and Sandholm, 2004]
• Different methods work better in different settings.
• Large family of methods – what’s best in our application?
Data Driven Algorithm Design
Data driven algo design: use learning & data for algo design.
Prior work: largely empirical.
Our Work: Data driven algos with formal guarantees.
• Different methods work better in different settings.
• Large family of methods – what’s best in our application?
Data Driven Algorithm Design
Data driven algo design: use learning & data for algo design.
• Several cases studies of widely used algo families.
• General principles: push boundaries of algorithm design and machine learning.
Algorithm Design as Distributional Learning
Goal: given large family of algos, sample of typical instances from domain, find an algo that performs well on new instances from same domain.
Large family 𝐅 of algorithms
Sample of i.i.d. typical inputs
Facility location:
Clustering: Input 1: Input 2: Input m:
Input 1: Input 2: Input m:
Input 1: Input 2: Input m:
…
…
…
MST
Greedy
Dynamic Programming
…
+
+ Farthest Location
[Gupta-Roughgarden, ITCS’16 &SICOMP’17]
Statistical Learning Approach to AAD
Challenge: “nearby” algos can have drastically different behavior.
Sample Complexity: How large should our sample of typical instances be in order to guarantee good performance on new instances?
m = O dim F /ϵ instances suffice to ensure generalizability
α ∈ ℝ𝑠
IQP objective value
Price Price
Revenue
2nd
highest bid
Highest bid
Reserve r
Revenue
2nd
highest bid
Prior Work: [Gupta-Roughgarden, ITCS’16 &SICOMP’17] proposed model; analyzed greedy algos for subset selection pbs (knapsack & independent set).
New algorithm classes for a wide range of problems.Our results:
Algorithm Design as Distributional Learning
Single linkage Complete linkage𝛼 −Weighted comb … Ward’s alg
DATA
DP for k-means
DP for k-median
DP for k-center
CLUSTERING
Clustering: Parametrized Linkage
[Balcan-Nagarajan-Vitercik-White, COLT 2017]
Parametrized Lloyds
Random seeding
Farthest first traversal
𝑘𝑚𝑒𝑎𝑛𝑠 + + …𝐷𝛼sampling
DATA
𝐿2-Local search 𝛽-Local search
CLUSTERING
[Balcan-Dick-White, NeurIPS 2018]
Alignment pbs (e.g., string alignment): parametrized dynamic prog.
[Balcan-DeBlasio-Dick-Kingsford-Sandholm-Vitercik, 2019]
dim(F) = O log n dim(F) = O k log n
New algorithm classes for a wide range of problems.Our results:
Algorithm Design as Distributional Learning
SDP Relaxation
Integer Quadratic Programming (IQP)
GW rounding
1-linear roundig
s-linear rounding
Feasible solution to IQP
… … …
E.g., Max-Cut,
• Partitioning pbs via IQPs: SDP + Rounding
Max-2SAT, Correlation Clustering
[Balcan-Nagarajan-Vitercik-White, COLT 2017]
• MIPs: Branch and Bound Techniques [Balcan-Dick-Sandholm-Vitercik, ICML’18]
Max 𝒄 ∙ 𝒙s.t. 𝐴𝒙 = 𝒃
𝑥𝑖 ∈ {0,1}, ∀𝑖 ∈ 𝐼
MIP instance
Choose a leaf of the search tree
Best-bound Depth-first
Fathom if possible and terminate if
possible
Choose a variable to branch onMost
fractional𝛼-linearProduct
• Automated mechanism design for revenue maximization
Parametrized VCG auctions, posted prices, lotteries.
[Balcan-Sandholm-Vitercik, EC 2018]
dim(F) = O log n
Algorithm Design as Distributional Learning
[Balcan-DeBlasio-Kingsford-Dick-Sandholm-Vitercik, 2019]
General sample complex. tools via structure of dual classOur results:
Thm: Suppose for each costI(α) there are ≤ N boundary fns f1, f2, … ∈ F
s. t within each region defined by them, ∃ g ∈ G s.t. costI α = g(α).
Pdim costα I = O dF∗ + dG∗ log dF∗ + dG∗ + dF∗ logN
dF∗ =VCdim of dual of F, dG∗ =Pdim of dual of G.
𝑠
IQP objective
value
2nd
highest bid
Highest bid
Reserve r
Revenue
2nd
highest bid
Price Price
Revenue
Online Algorithm Selection
• So far, batch setting: collection of typical instances given upfront.
• [Balcan-Dick-Vitercik, FOCS 2018], [Balcan-Dick-Pedgen, 2019] online alg. selection.
• Challenge:
• Identify general properties (piecewise Lipschitz fns withdispersed discontinuities) sufficient for strong bounds.
Cannot use known techniques.
scoring fns non-convex, with lots of discontinuities.
• Show these properties hold for many alg. selection pbs.
𝑠
IQP objective
value
Price Price
Revenue
Structure of the Talk
• Overview. Data driven algo design as batch learning.
• Data driven algo design via online learning.
• An example: clustering via parametrized linkage.
• Online learning of non-convex (piecewise Lipschitz) fns.
Example: Clustering ProblemsClustering: Given a set objects organize then into natural groups.
• E.g., cluster news articles, or web pages, or search results by topic.
• Or, cluster customers according to purchase history.
Often need do solve such problems repeatedly.
• E.g., clustering news articles (Google news).
• Or, cluster images by who is in them.
Example: Clustering Problems
Clustering: Given a set objects organize then into natural groups.
Input: Set of objects S, d
Output: centers {c1, c2, … , ck}
To minimize σpminid2(p, ci)
𝐤-median: min σpmind(p, ci) .
Objective based clustering
𝒌-means
k-center/facility location: minimize the maximum radius.
• Finding OPT is NP-hard, so no universal efficient algo that works on all domains.
Clustering: Linkage + Dynamic Programming
Family of poly time 2-stage algorithms:
1. Use a greedy linkage-based algorithm to organize data into a hierarchy (tree) of clusters.
2. Dynamic programming over this tree to identify pruning of tree corresponding to the best clustering.
A B C D E F
A B D E
A B C DEF
A B C D E F
A B C D E F
A B D E
A B C DEF
A B C D E F
Clustering: Linkage + Dynamic Programming
1. Use a linkage-based algorithm to get a hierarchy.
2. Dynamic programming to the best pruning.
Single linkage
Completelinkage
𝛼 −Weighted comb
… Ward’s algo
DATA
DP for k-means
DP for k-median
DP for k-center
CLUSTERING
Both steps can be done efficiently.
Linkage Procedures for Hierarchical Clustering
Bottom-Up (agglomerative)
soccer
sports fashion
Guccitennis Lacoste
All topics
• Start with every point in its own cluster.
• Repeatedly merge the “closest” two clusters.
Different defs of “closest” give different algorithms.
Linkage Procedures for Hierarchical Clustering
Have a distance measure on pairs of objects.
d(x,y) – distance between x and y
E.g., # keywords in common, edit distance, etc soccer
sports fashion
Guccitennis Lacoste
All topics
• Single linkage: dist A, B = minx∈A,x′∈B
dist(x, x′)
dist A, B = avgx∈A,x′∈B
dist(x, x′)• Average linkage:
• Complete linkage: dist A, B = maxx∈A,x′∈B
dist(x, x′)
• Parametrized family, 𝛼-weighted linkage:
dist A, B = α minx∈A,x′∈B
dist(x, x′) + (1 − α) maxx∈A,x′∈B
dist(x, x′)
Clustering: Linkage + Dynamic Programming
1. Use a linkage-based algorithm to get a hierarchy.
2. Dynamic programming to the best prunning.
Single linkage
Completelinkage
𝛼 −Weighted comb
… Ward’s algo
DATA
DP for k-means
DP for k-median
DP for k-center
CLUSTERING
• Used in practice.
• Strong properties.
PR: small changes to input distances shouldn’t move optimal solution by much.
[Balcan-Liang, SICOMP 2016] [Awasthi-Blum-Sheffet, IPL 2011]
[Angelidakis-Makarychev-Makarychev, STOC 2017]
0.7
E.g., [Filippova-Gadani-Kingsford, BMC Informatics]
E.g., best known algos for perturbation resilient instances for k-median, k-means, k-center.
Clustering: Linkage + Dynamic ProgrammingClustering: Linkage + Dynamic Programming
α ∈ ℝ
Key fact: If we fix a clustering instance of n pts and vary α, at most O n8 switching points where behavior on that instance changes.
So, the cost function is piecewise-constant with at most O n8 pieces.
α ∈ ℝ
Clustering: Linkage + Dynamic ProgrammingClustering: Linkage + Dynamic Programming
𝓝𝟏 𝓝𝟐 𝓝𝟑 𝓝𝟒
𝑝 𝑞𝑝′ 𝑞′ 𝑟𝑟’ 𝑠’
𝑠• For a given α, which will merge
first, 𝒩1 and 𝒩2, or 𝒩3 and 𝒩4?
• Depends on which of (1 − α)d p, q + αd p′, q′ or (1 − α)d r, s + αd r′, s′ is smaller.
Key idea:
• An interval boundary an equality for 8 points, so O n8 interval boundaries.
Key fact: If we fix a clustering instance of n pts and vary α, at most O n8 switching points where behavior on that instance changes.
α ∈ ℝ
Online Algorithm Selection
• [Balcan-Dick-Vitercik, FOCS 2018], [Balcan-Dick-Pedgen, 2019] online alg. selection.
• Challenge:
• Identify general properties (piecewise Lipschitz fns withdispersed discontinuities) sufficient for strong bounds.
Cannot use known techniques.
scoring fns non-convex, with lots of discontinuities.
• Show these properties hold for many alg. selection pbs.
𝑠
IQP objective
value
Price Price
Revenue
Online Algorithm Selection via Online Optimization
Online optimization of general piecewise Lipschitz functions
Goal: minimize regret: max𝛒∈𝒞
σt=1T ut(𝛒) − 𝔼 σt=1
T ut 𝛒𝐭
Our cumulative performancePerformance of best
parameter in hindsight
1. Online learning algo chooses a parameter 𝛒𝐭
On each round t ∈ 1,… , T :
2. Adversary selects a piecewise Lipschitz function ut: 𝒞 → [0, H]
• corresponds to some pb instance and its induced scoring fnc
3. Get feedback:
Payoff: score of the parameter we selected ut(𝛒𝐭).
Full information: observe the function ut ∙
Bandit feedback: observe only payoff ut(𝛒𝐭).
Online Regret Guarantees
Existing techniques (for finite, linear, or convex case): select 𝛒𝐭probabilistically based on performance so far.
• Regret guarantee: max𝛒∈𝒞
σt=1T ut(𝛒) − 𝔼 σt=1
T ut 𝛒𝐭 = ෩𝐎 𝐓 ×⋯
• Probability exponential in performance [Cesa-Bianchi and Lugosi 2006]
No-regret: per-round regret approaches 0 at rate ෩𝐎 𝟏/ 𝐓 .
Challenge: if discontinuities, cannot get no-regret.
To achieve low regret, need structural condition.
• Adversary can force online algo to “play 20 questions” while hiding an arbitrary real number.
• Round 1: adversary splits parameter space in half and randomly chooses one half to perform well, other half to perform poorly.
• Round 2: repeat on parameters that performed well in round 1. Etc.
• Any algorithm does poorly half the time in expectation but ∃ perfect 𝝆.
Not dispersePiecewise Lipschitz
function
Lipschitz within each
piece
Disperse
Few boundaries within any
interval
Many boundaries within interval
Dispersion, Sufficient Condition for No-Regret
u1(∙), … , uT(∙) is (𝐰, 𝐤)-dispersed if any ball of radius 𝐰 contains boundaries for at most 𝐤 of the ui.
Full info: exponentially weighted forecaster [Cesa-Bianchi-Lugosi 2006]
Our Results:
pt 𝛒 ∝ exp λ
s=1
t−1
us 𝛒
Disperse fns, regret ෩O Td fnc of problem) .
On each round t ∈ 1,… , T :
• Sample a 𝛒t from distr. pt:
Dispersion, Sufficient Condition for No-Regret
Disperse
density of 𝛒 exponential in its performance so far
Full info: exponentially weighted forecaster [Cesa-Bianchi-Lugosi 2006]
Assume u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.
The expected regret is O H Td logR
𝐰+ 𝐤 + TL𝐰 .
For most problems:
• Set 𝐰 ≈ 1/ T, 𝐤 = T × (fnc of problem)
pt 𝛒 =exp λUt 𝛒
𝐖𝐭
On each round t ∈ 1,… , T :
• Sample a 𝛒t from distr. pt:
Dispersion, Sufficient Condition for No-Regret
Wt = ∫ exp λUt 𝛒 d𝛒
Ut 𝛒 = σs=0t−1 us(𝛒) total payoff of 𝛒 up to time t
Our Results:
Dispersion, Sufficient Condition for No-Regret
Key idea: Provide upper and lower bounds on WT+1
W1, where Wt norm.
factor Wt = ∫ exp λUt 𝛒 d𝛒
Wt+1
Wt≤ exp
eHλ − 1 PtH
Pt = expected alg payoff at time t.
WT+1
W1≤ exp
eHλ − 1 PALGH
PALG = expected alg total payoff.
pt 𝛒 =exp λUt 𝛒
𝐖𝐭
On each round t ∈ 1,… , T :
• Sample a 𝛒t from distr. pt:Ut 𝛒 = σs=0
t−1 us(𝛒) total payoff of 𝛒 up to time t
Thm:
• Upper bound: classic, relate algo perfor. with magnitude of update:
The expected regret is O H Td logR
𝐰+ 𝐤 + TL𝐰 .
u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.
Dispersion, Sufficient Condition for No-Regret
Key idea: analyze Wt = ∫ exp λUt 𝛒 d𝛒 (norm. factor)
pt 𝛒 =exp λUt 𝛒
𝐖𝐭
On each round t ∈ 1,… , T :
• Sample a 𝛒t from distr. pt:Ut 𝛒 = σs=0
t−1 us(𝛒) total payoff of 𝛒 up to time t
• Lower bound (uses dispersion + Lipschitz):
Let 𝛒∗ be the optimal parameter vector in hindsight.adsf
For all 𝛒 ∈ B 𝛒∗, w ,adsf
a
t=1
T
ut 𝛒 ≥
t=1
T
ut 𝛒∗ − Hk − TLw.
≤ k fncs us have disc. in B 𝛒∗, w & range is 0, H
remaining fnsL-Lipschitz
UT+1 𝛒 ≥ UT+1 𝛒∗ − Hk − TLw.For all 𝛒 ∈ B 𝛒∗, w ,adsf
Thm: u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.
The expected regret is O H Td logR
𝐰+ 𝐤 + TL𝐰 .
Dispersion, Sufficient Condition for No-Regret
Key idea: analyze Wt = ∫ exp λUt 𝛒 d𝛒 (norm. factor)
pt 𝛒 =exp λUt 𝛒
𝐖𝐭
On each round t ∈ 1,… , T :
• Sample a 𝛒t from distr. pt:Ut 𝛒 = σs=0
t−1 us(𝛒) total payoff of 𝛒 up to time t
• Lower bound
UT+1 𝛒 ≥ UT+1 𝛒∗ − Hk − TLw.For all 𝛒 ∈ B 𝛒∗, w ,adsf
WT+1 ≥ Vol B 𝛒∗, w exp λ UT+1(𝛒∗ − Hk − TLw)
Get WT+1
W1≥
w
R
d
exp λ UT+1(𝛒∗ − Hk − TLw)
Thm: u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.
The expected regret is O H Td logR
𝐰+ 𝐤 + TL𝐰 .
Dispersion, Sufficient Condition for No-Regret
Key idea: analyze Wt = ∫ exp λUt 𝛒 d𝛒 (norm. factor)
pt 𝛒 =exp λUt 𝛒
𝐖𝐭
On each round t ∈ 1,… , T :
• Sample a 𝛒t from distr. pt:Ut 𝛒 = σs=0
t−1 us(𝛒) total payoff of 𝛒 up to time t
expeHλ − 1 PALG
H≥
w
R
d
exp λ UT+1(𝛒∗ − Hk − TLw)
Take logs, use std approx of 𝑒𝑧 and fact that PALG ≤ HT:
UT+1 ρ∗ − PALG ≤ H2Tλ +d logR/w
λ+ Hk + TLw
Set λ =1
H
d
Tlog
R
wO H Td log
R
𝐰+ 𝐤 + TL𝐰Regret
Thm: u1(∙), … , uT(∙) are piecewise L-Lipschitz and (𝐰, 𝐤)-dispersed.
The expected regret is O H Td logR
𝐰+ 𝐤 + TL𝐰 .
Example: Clustering with ρ-weighted linkage
𝜌-weighted linkage: dist A, B = ρ minx∈A,x′∈B
dist(x, x′) + (1 − ρ) maxx∈A,x′∈B
dist(x, x′)
Theorem: If T instances with distances selected in [0, B] from κ-bounded densities, then for any w, with prob ≥ 1 − 𝛿, we get
(w, k)-dispersion for k = O wn8κ2B2T + O T log n/𝛿 .
For any given interval 𝐼, expected #instances with discontinuities in
𝐼 is at most this
From a uniform convergence
argument
Implies expected regret of O H T log(TnκB)
Example: Clustering with ρ-weighted linkage
𝜌-weighted linkage: dist A, B = ρ minx∈A,x′∈B
dist(x, x′) + (1 − ρ) maxx∈A,x′∈B
dist(x, x′)
Theorem: If T instances with distances in [0, B] from κ-bounded densities, then for any w, with prob ≥ 1 − 𝛿 get (w, k)-dispersion for
k = O wn8κ2B2T + O T log n/𝛿 .
• Each instance O(n8) discontinuities of form𝑑1 − 𝑑3
(𝑑1−𝑑3) + (𝑑4 − 𝑑2)
𝑪𝟏 𝑪𝟐 𝑪𝟑 𝑪𝟒𝒅𝟏
𝒅𝟐
𝒅𝟑
𝒅𝟒
Given 𝜌, merge first, C1 and C2, or C3 and C4?
• Disc from O κB 2 -bounded density.
Critical value when 1 − 𝜌 d1 + 𝜌d2 = 1 − 𝜌 d3 + 𝜌d4
• So, for any given interval I of width w, expected # instances with discontinuities in I is O wn8κ2B2T . Uniform convergence on the top.
Key ideas:
Proving Dispersion
• Use functional form of discontinuities to get distribution of discontinuity locations, and expected number in any given interval.
• Randomness either from input instances (smoothed model) or from randomness in the algorithms themselves
• Use uniform convergence result on the top.
Lemma: Let u1, … , uT: ℝ → ℝ be independent piecewise-Lipschitz, each with ≤ N discontinuities. Let interval I, let D(I) be the # of ui with at least one discontinuity in I. Then w.h.p ≥ 1 − 𝛿 :
supI (D(I)− E D I ) ≤ O T log N/δ
Proof Sketch:
• Show VC-dim of ℱ = {fI} is O(logN), then use uniform convergence.
• For interval I, let fI i = 1 iff ui has discontinuity in I.
Then D I = σ𝑖 fI i
Proof: VC-dimension of 𝐹 is 𝑂(logN)
Lemma: For interval I, set s of ≤ N points, fI s = 1 iff s has a point inside I. VC-dimension of class ℱ = {fI} is O(logN).
How many ways can ℱ label d instances (d sets of ≤ N pts) s1, … , sd?
• But, given Nd points, at most Nd 2 + 1 different subsets of them that can be covered using intervals.
To shatter d instances, must have 2d ≤ Nd 2 + 1. VC-dim d = O(logN).
Proof idea:
• d instances have at most Nd points in their union.
• For fI, fI′ to label the instances differently, minimal req. is that intervals I and I′ can’t cover the exact same points.
Example: Knapsack
A problem instance is collection of items i, each with a value vi and asize si, along with a knapsack capacity C.
Goal: select most valuable subset of items that fits. Find subset V tomaximize σi∈V vi subject to σi∈V si ≤ C.
Greedy family of algorithms:
• Greedily pack items in order of value(highest value first) subject to feasibility.
• Greedily pack items in order of vi/siρ
subject to feasibility.
Thm: Adversary chooses knapsack instances with n items, capacity C, sizes si in [1, C], values vi in (0,1], and item values drawn independently from b-bounded distributions. Utilities u1, … , uT are piecewise constant
and w.h.p ≥ 1 − δ, (w, k)-dispersed for k = O wTn2b2 log C + T log n/δ .
Online Learning Summary
• Exploit dispersion to get no-regret.
• Learning theory: techniques of independent interest beyond algorithm selection.
• Can bound Rademacher complexity as a function of dispersion.
• For partitioning pbs via IQPs, SDP + Rounding, exploit smoothness in the algorithms themselves.
• For clustering & subset selection via greedy, assume input instances smoothed to show dispersion.
• Also differential privacy bounds.
• Also bandit, and semi-bandit feedback.