Mathematical Theories of Interaction with Oracles Liu Yang Carnegie Mellon University 1© Liu Yang...
-
Upload
michelle-gayton -
Category
Documents
-
view
234 -
download
2
Transcript of Mathematical Theories of Interaction with Oracles Liu Yang Carnegie Mellon University 1© Liu Yang...
Mathematical Theories of Interaction with
Oracles
Liu YangCarnegie Mellon University
1© Liu Yang 2013
Thesis Committee
Avrim Blum (co-chair)Jaime Carbonell (co-chair)Manuel BlumSanjoy Dasgupta (UC, San Diego)Yishay Mansour (Tel Aviv University)Joel Spencer (Courant Institute, NYU)
Outline
• Active Property Testing
- Do we need to imitate human to advance AI?- I see air planes can fly without flapping their wings.
© Liu Yang 2013 3
Property TestingProperty Testing
• Given access to massive dataset: want to quickly determine if a given fn f has some given property P or is far from having it
• Goal: test from very small num of queries.
• One motivation: preprocessing step
before learning
© Liu Yang 2013 4
Property TestingProperty Testing
© Liu Yang 2013 5
• Instance space X = Rn (Distri D over X)
• Tested function f : X->{-1,1}
• A property P of Boolean fn is a subset of all Boolean fns h : X -> {-1,1} (e.g ltf)
• distD(f, P):=ming P Px~D[f(x) ≠g(x)]
• Standard Type of query: membership query (ask for f(x) at arbitrary point x)
Property Testing: An Property Testing: An ExampleExample
• E.g. Union of d Intervals 0----++++----+++++++++-----++---+++--------1
- UINT4 ? Accept! UINT3 ? Depend on ε
- Model selection: testing can tell us how big d need be to be close to target
(double and guess, d = 2, 4, 8, 16, ….)
If fP should accept w/ prob 2/3
If dist(f,P)>ε should reject w/ prob 2/3
6© Liu Yang 2013
Property Testing and Property Testing and Learning : MotivationLearning : Motivation
• What is Property Testing for ? - Quickly tell if the right fn class to use - Estimate complexity of fn without actually learning
• Want to do it with fewer queries than learning
7© Liu Yang 2013
Standard Model uses Standard Model uses Membership QueryMembership Query
• Results of Testing basic Boolean fns using MQ: • Constant QC for UINTd, dictator, ltf, …
However …
8© Liu Yang 2013
Membership Query is Membership Query is Unrealistic for ML Problems: Unrealistic for ML Problems:
An An Object Recognition Object Recognition exampleexample
Recognizing cat/dog ? MQ gives …
Is this a dog or a cat?
9© Liu Yang 2013
An example: movie reviewsAn example: movie reviews Is this a positive or negative
review ? Typical representation in ML (bag-of-words):
• {fell, holding, interest, movie, my, of, short, this}
The original review (human labelers see):
• “This movie fell short of holding my interest.”
© Liu Yang 2013 10
- Object a human expert labels has more structure than internal representation used by alg. - MQs construct ex.s in internal representation.- Can be very difficult to order constructed example’s words so a human can label the example (esp for long reviews)
Passive : Waste Too Many Passive : Waste Too Many Queries Queries
• ML people move on
• Can we SAVE #queries ?
• Passive Model (sample from D) query samples exist in ; but quite wasteful (many examples uninformative)
NATURE
11© Liu Yang 2013
Active Testing Active Testing
12© Liu Yang 2013
Alg can ask for labels but only pts in the poolGoal: small #queries
Pool of unlabeled data (poly-size)
Property TesterProperty Tester
• Definition. Definition. An s-sample, q-query ε-tester for P over the distribution D is a randomized algorithm A that draws s samples from D, sequentially queries for the value of f on q of those samples, and then
1. Accepts w.p. at least 2/3 when f P 2. Rejects w.p. at least 2/3 when
distD(f,P)>ε
cheap
expensive
13© Liu Yang 2013
• Definition. Definition. An s-sample, q-query ε-tester for P over the distribution D is a randomized algorithm A that draws s samples from D, sequentially queries for the value of f on q of those samples, and then
1. Accepts w.p. at least 2/3 when f P 2. Rejects w.p. at least 2/3 when
distD(f,P)>ε
cheap
expensive
14© Liu Yang 2013
Active tester: s = poly(n)Passive tester: s = qMQ tester: s = ∞ (D= Unif)
Active Property TestingActive Property Testing• Testing as preprocessing step of learning • Need an example? where Active testing - get same QC saving as MQ - better in QC than Passive - need fewer queries than Learning
• Union of d Intervals, active testing help! 0----++++----+++++++++-----++---+++--------1
- Testing tells how big d need to be close to target- #Label: Active Testing need O(1), Passive Testing need Θ(√d), Active Learning need Θ(d)
15© Liu Yang 2013
Our Results Our Results
MQ-like on testing UINTd Passive-like on testing Dictator
Active Testing Passive Testing Active Learning
Union of d Intervals O(1) Θ(d1/2) Θ(d)
Dictator Θ(log n) Θ(log n) Θ(log n)
Linear Threshold Fn O(n1/2) ~Θ(n1/2) Θ(n)
Cluster Assumption O(1) Ω(N1/2) Θ(N)
MQ-like Passive-like
NEW !!NEW !!
16© Liu Yang 2013
Testing Unions of IntervalsTesting Unions of Intervals00----++++----+++++++++-----++---+++-------- 1• Theorem. Testing UINTd in the active
testing model can be done using O(1) queries.
Recall: Learning requires Ω(d) examples.
17© Liu Yang 2013
Testing Unions of Intervals Testing Unions of Intervals (cont.)(cont.)
• Suppose uniform distribution• Definition: Fixδ>0. The localδ-noise
sensitivity of fn f: [0, 1]->{0, 1} at x [0; 1] is . The noise sensitivity of f is
• Proposition: Fixδ>0. Let f: [0, 1] -> {0,1} be a union of d intervals. NSδ(f) ≤ dδ.
easyeasy
hard
hard• Lemma: Fix δ= ε2/(32d). Let f : [0, 1] -> {0, 1} be a fn with noise sensitivity bounded by NSδ(f) ≤ dδ(1 + ε/4 ). Then f is ε-close to a union of d intervals.
18© Liu Yang 2013
Easy LemmaEasy Lemma• Lemma. If f is a union of ≤ d intervals,
NSδ(f) ≤ dδ.
Proof sketch:- The probability that x lands within distance
δ of any of the boundaries is at most 2d*2δ.
- The probability that y crosses a boundary given that x is within distance δ of it is 1/4.
- P(f(x)≠f(y)| |x-y|<δ) ≤ (2d*2δ)*(1/4) = dδ.
© Liu Yang 2013 19
Hard LemmaHard Lemma• Lemma. Fix δ = ε2/(32d). If f is ε-
far from a union of d intervals, then NSδ(f) > (1+ε/4)dδ.
Proof strategy:If NSδ(f) is small, do “self-correction”.
g(x) = E[f(y) | yÃ[x-δ,x+δ]], f’(x) = round g(x) to 0 if ≤¿ or to 1 if ≥
1-¿
© Liu Yang 2013 20
Hard LemmaHard Lemma• Lemma. Fix δ = ε2/(32d). If f is ε-
far from a union of d intervals, then NSδ(f) > (1+ε/4)dδ.
Proof strategy:- Argue dist(f,f’) ≤ε/2.- Show f’ is union of ≤ d(1 + ε/2)
intervals.- Implies dist(f’,P) ≤ ε/2.
© Liu Yang 2013 21
zz
----++++----+++++++++-----++---+++-------------++---+++--------
at δ Nr
δδ δδ
!!!
δ
!!!
δ
!!!
22© Liu Yang 2013
Uniform
Testing Unions of IntervalsTesting Unions of Intervals
© Liu Yang 2013 23
• Theorem. Testing UINTd in the active testing model can be done using O(1) queries.
• If non-uniform distribution, use data to stretch/squash the axis, makes the distribution near-uniform
• Total num unlabeled samples: O(d1/2).
Testing Linear Threshold Testing Linear Threshold FnsFns
24© Liu Yang 2013
• Linear Threshold Functions (LTF):
f(x) = sign(<w,x>), for w,x 2 Rn
Testing Linear Threshold Testing Linear Threshold FnsFns
• Theorem. We can efficiently test LTFs under the Gaussian distribution with Õ(n1/2) labeled examples in both active and passive testing models.
• We have lower bounds of ~Ω (n1/3) for active testing and ~Ω (n1/2) for passive testing.
• Learning LTFs need Ω(n) under Gaussian. So testing is better than learning in this case.
25© Liu Yang 2013
Testing Linear Threshold Testing Linear Threshold FnsFns
• [MORS’10] => suffices to estimate E[f(x) f(y) <x,y>] up to ± poly(ε).
• Intuition: LTF is characterized by a nice linear relation between angle (<x,y>) and probability of having same label (f(x)f(y)=1).
26© Liu Yang 2013
Testing Linear Threshold Testing Linear Threshold FnsFns
• [MORS’10] => suffices to estimate E[f(x) f(y) <x,y>] up to ± poly(ε).• Could take m random pairs and use
empirical average. - But most pairs x,y would have <x,y> ≈ n 1/2
(CLT) So would need m = Ω(n) to get within ± poly(ε).• Solution: take O(n1/2) random points
and average f(x)f(y)<x,y> over all O(n) pairs x,y.
- Concentration inequalities for U-statistics [Arcones,95] imply this works.
27© Liu Yang 2013
General Testing DimensionGeneral Testing Dimension
• Testing dim characterize (up to constant factors) the intrinsic #label requests needed to test the given property w.r.t. the given distribution
• All our lower bounds are proved via testing dim
28© Liu Yang 2013
Minimax ArgumentMinimax Argument
• minAlgmaxf P(Alg mistaken) = maxπ0
minAlg P(Alg mistaken)
• wolg, π0=α π + (1-α) π’, π Π0,π’ Πε
• Let πS, π’S be induced distributions on labels of S.
• For a given π0,
minAlgP(Alg makes mistake|S)≤ 1-dS(π, π’)
29© Liu Yang 2013
Passive Testing DimPassive Testing Dim
• Define dpassive largest q in N, s.t.
• Theorem: Sample Complexity of passive testing is Θ(dpassive).
30© Liu Yang 2013
Compare with VC-dimension:Want exists set S s.t. all labelings occur at least once.
Active Testing DimActive Testing Dim• Fair(π,π’,U): distri. of labeled (y; l): w.p.½ choose y~πU, l= 1; w.p.½ choose y~π’U, l= 0.
• err*(H; P): err of optimal fn in H w.r.t data drawn from distri. P over labeled egs.• Given u=poly(n) unlabeled egs, dactive(u):
largest q in N s.t. • Theorem: Active testing w/ failure prob 1/8 using u unlabeled egs needs Ω(dactive(u)) label queries; can be done w/ O(u) unlabeled egs and O(dactive(u)) label queries 31© Liu Yang 2013
Application: Dictator fnsApplication: Dictator fns
• Theorem: For dictator functions under the uniform distribution, dactive(u)=Θ(log n) (for any large-enough u=poly(n)).
• Corollary: Any class that contains dictator functions requires log(n) queries to test in the active model, including poly-size decision trees, functions of low Fourier degree, juntas, DNFs, etc.
32© Liu Yang 2013
Application: Dictator fnsApplication: Dictator fns
• Theorem: For dictator functions under the uniform distribution, dactive(u)=Θ(log n) (for any large-enough u=poly(n)).
• π = unif over dictator fns• π’ = unif over all Boolean fns
33© Liu Yang 2013
Application: LTFs Application: LTFs
• Theorem. For LTFs under the standard n-dim Gaussian distrib, dpassive = Ω((n/logn)1/2) and dactive(u) = Ω((n/logn)1/3) (for any u=poly(n)).
- π: distrib over LTFs obtained by choosing w~N(0, Inxn) and outputting f(x) = sgn(wx). - π’: uniform distrib over all functions.
34© Liu Yang 2013
- Obtain dpassive :bound tvd(distrib of Xw/√n, N(0, Iqxq)).- Obtain dactive: similar to dictator LB but rely on strong concentration bounds on spectrum of random matrices
Open ProblemOpen Problem
• Matching lb/ub for active testing LTF: √n?
• Tolerant Testing ε/2 vs. ε (UINTd, LTF)• Testing LTF under general distrib.
35© Liu Yang 2013
Outline
• Learnability of DNF with Representation Specific Queries
- Liu: We do statistical learning for …
- Marvin: but we haven't not done well at the fundamentals, e.g. knowledge
representation. © Liu Yang 2013 36
Learning DNF formulas
• Poly-sized DNF: # terms = nO(1)
e.g. f=(x1∧x2)∨(x1∧x4)
- Natural form of knowledge representation- PAC-learning DNF appears to be very hard.
37© Liu Yang 2013
Your ticket : n: number of var.s Concept space C: collection of fn h: {0, 1}^n -> {0,1} Unknown target fn f*: the true labeling fn Err(h) = Px~D[h(x) ~= f*(x)] (Distri. D over X)
Best known alg in standard model is exponential over arbitrary distri; Over Unif, no known poly time alg
New Models: Interaction with Oracles
38© Liu Yang 2013
- Boolean queries: K(x, y) = 1 if share some term- Numerical queries: K(x, y) = #terms share
Hi, Tim, do x and y have some term in common ?
Yes!
Imagine …
Query: Similarity about TYPE
39© Liu Yang 2013 Fraud Detection
Type of Query: pair of POSITIVE ex.s from a random dataset, teacher says YES if they share some term; or report how many terms they share.Question: can we efficiently learn DNF with this type of query?
Identity theft
Stolen cards
Stolen cards BIN attack
x
y
What if have similarity info about TYPE ?Fraud detection: fraudulent of same type ? YES! x and y
share a termSkimming
Term 1 of x Term 2 of x Term 3 of x
Warm Up: Disjoint DNF w/Boolean Queries
• Use similarity queries to partition positive ex.s into t buckets, one per term.
• Separately learn a conjunction for each bucket (intersect the pos ex.s in it)
• OR the results
40© Liu Yang 2013
Pos Result 1: Weakly Disjoint DNF w/Boolean Queries
- Distinguishing ex for T1: ex. sat. T1 & no other term
- Weakly disjoint: for each term, poly(n, 1/ε) fraction rand. ex.s sat. it & no other term.
- Neighbor-method: get all its neighbors in the graph and learn a conjunc.
- Neighbor-method w.p. 1-δ, produce an ε-accu. DNF if weakly disjoint.
41© Liu Yang 2013
T1Graph:- Nodes: pos examples- Edge exists if K(.) = 1
Hardness ResultsBoolean Queries
Thm. Learning DNF from random data under arb. distri. w/ Boolean queries is as hard as learning DNF from random data under arb. distri. w/ only labels (no queries).
42© Liu Yang 2013
m
K (giant 1, giant 2) = 1
- Group-learn: tell data from D+ or D-- Reduction from group-learn DNF in std. model to our model - How to use our alg A to group-learn ? - Simulate the oracle by always saying yes whenever there is a query made to two pos ex.s; Given the output of A, we give a group-learn alg for the original problem
n var.sn var.sn var.sn var.sn var.sn var.sn var.sn var.sn var.sn var.s
Hardness ResultsApprox Numerical Queries
Thm. Learning DNF from random data under arbitrary distri. w/ approx-numerical-queries is as hard as learning DNF from random data under arb. distri. w/ only labels i.e. if C is #terms xi and xj sat in common, oracle returns a value in [(1 – τ)C, (1 + τ)C].
© Liu Yang 2013 43
Pos Result 3: learn DNF w/ Numerical Queries
- Sample m = O((t/ε) log(t/(εδ))) landmark points
- Landmark Fi(x) is sum-of-monotone-terms fn (rm terms not sat by pos xi). Fi(·) = K(xi, ·), K is numerical query
- Use subroutine to learn hypo. hi(x) ε/(2m)-accu w.r.t. Fi.
• Subroutine: learn a sum of monotone t terms over unif., using time & samples poly(t, n, 1/ε).
- Combine all hypo.s hi to h: h(x) = 0 if hi(x) = 0 for all i, else h(x) = 1.
© Liu Yang 2013 44
Thm. Under unif distri., w/ numerical queries, can learn any poly(n)-term DNF.
f(x) = T1(x)+T2(x)+ … +Tt(x)
Learn Sum of Monotone Terms
Estimate Fourier coeffi. of S Inclusion check: mag. ≥ ε/(16t)?
otw
x1 | x2 |x3 |x4 |x5 |x6 |x7 |x8 |x9
S= {x1}
Outputx1∧x3∧x4∧x9
S= {x1, x2} S= {x1, x3}S= {x1, x3 x4 }
YES
S= {x1, x3 x4 ,x5} S= {x1, x3 x4 ,x6} S= {x1, x3 x4 ,x7} S= {x1, x3 x4 ,x8} S= {x1, x3 x4 ,x9}
- Greedy:
- Inclusion Check:
- Fourier coeffi. of S:
© Liu Yang 2013 45
Learn Sum of Monotone Terms : Greedy Alg
• Examine each parity fn of size 1 & est its Fourier coeffi. (up to θ/4 accu.). Set θ =ε/(8t)
• Place all coeffi. of mag. ≥ θ/2 into a list L1.• For j = 2, 3, ... repeat: - For each parity fn ΦS in list Lj-1 and each xi not in S,
est Fourier coeffi. of - If est. is ≥ θ/2, add it to list Lj (if not already in) - maintain list Lj: size-j parity fns w/ coeffi. mag. ≥ θ.• Construct fn g: weight sum of parities for identified
coeff. • Output fn h(x) = [g(x)]
© Liu Yang 2013 46
Inclusion check
Other Positive Results Binary Numeric
O(log(n)) terms DNF (any distrib.)
✔
2-term DNF (any distrib.)✔ ✔
DNF: each var in at most O(log(n)) terms (Unif)
✔ ✔
log(n)-Junta (Unif)✔ ✔
log(n)-Junta (any distrib)✔
DNF having ≤ 2O(√log(n)) terms (Unif.)
✔ ✔
Open problems:- learn arbitrary DNF (unif, Boolean queries)?- learn arbitrary DNF (any distri. numerical queries)?
Outline
• Active Learning with a Drifting Active Learning with a Drifting DistributionDistribution
- If not every poem has a proof, can we at least try to make every theorem proved beautiful like a poem?
© Liu Yang 2013 48
Active Learning with Active Learning with a Drifting Distrib: Modela Drifting Distrib: Model
• Scenario: - Unobservable seq. of distrib.s with each - Unobservable time-indep. regular cond. distrib. represent by fn
- : an infinite seq. of indep. r. v., s.t., and cond. distrib. Of Yt given Xt satisfies
• Active learning protocol At each time t, alg is presented with Xt, and is required to
predict a label , then it may optionally request to see true label value Yt
• Interested in cumulative #mistakes up to time T and total #labels requested up to time T
© Liu Yang 2013 49
x1
x2
xtx3
Data Space
D2
D1D3
Dt
x4
D4
© Liu Yang 2013 50
Distrib.Space
Definition and NotationsDefinition and Notations
• Instance space X = Rn • Distribution space of distributions on X• Concept space C of classifiers h: X -> {-1,1} - Assume C has VC dimension vc < ∞
• Dt: Data distrib. on X at t
• Unknown target fn h*: true labeling fn
• Errt (h) = Px~Dt [h(x) ≠ h*(x)]
• In realizable case, h* in C and errt(h*) = 0.
• For ,
© Liu Yang 2013 51
Def: disagreement coefficient, tvd
• The disagreement coefficient of h* under a distri. P on X, is define as, (r > 0)
• Total variation distance of probability measures P and Q on a sigma-algebra of subsets of the sample space is defined via
© Liu Yang 2013 52
Assumptions
• Independence of the Xt variables
• Vc-dim < ∞• Assumption 1 (totally bounded) : is totally bounded (i.e. satisfies ) - For each ε > 0, denote a minimal subset of s.t.
s.t. (i.e. a minimal ε-cover of )
• Assumption 2 (poly-covers)
where c,m ≥ 0 are constants.
Realizable-case Active Learning CAL
© Liu Yang 2013 54
Sublinear Result: Realizable Sublinear Result: Realizable CaseCase
© Liu Yang 2013 55
Theorem. If is totally bounded, then CAL,achieves an expected mistake boundAnd if , then CAL makes an E[#queries]
[Proof Sketch]:Partition D into buckets of diam < eps. Pick a time T_eps past all indices from finite buckets and all the infinite bucket has at least
Number of MistakesNumber of Mistakes
• Alternative scenario: - Let Pi be in bucket i
- Swap the L(ε) samples for bucket i with L(ε) samples from Pi
- L(ε) large enough so E[diam(V)]alternative < sqrt{eps}.
- Note: E[diam(V)] ≤ E[diam(V)]alternative + sumL(ε) t values||P_i – D_t|| < √ε + L(ε)*ε.
So E[diam] -> 0 as T -> ∞ - E[#mistake] - Since
Number of QueriesNumber of Queries
•E[#queries]
•P(make query) = E[P(DIS(Vt-1))]
•Let then and E[#queries] • =>
Explicit Bound: Realizable CaseExplicit Bound: Realizable Case
© Liu Yang 2013 58
Theorem. If poly-covers assumption is satisfied ( )then CAL achieves an expected mistake bound and E[#queries] such that
where [Proof Sketch]Fix any ε >0, and enumerate For t in N, let K(t) be the index k of the closest to Dt. Alternative data sequence: Let be indep., with This way all samples corresp. to distrib.s in a given bucket all came from same distri. Let V’t be the corresponding version spaces.
E[#mistakes]
Classic PAC bound =>
(#previous distrib.s in Dt's bucket)So
(each bucket has at most T samples)So E[#mistakes] Take to get the stated theorem.
To bound E[#queries], again it is
just showed this is So
Again, taking gives the stated result. © Liu Yang 2013 59
Learning with NoiseLearning with NoiseNoise conditionsNoise conditions
•Strictly benign noise condition:
•Special case: Tsybakov's noise conditions •η satisfies strictly benign noise condition and for some c > 0 and α≥0,
•Unif Tsybakov assumption: Tsybakov Assumption is satisfied for all with the same c and α values.
© Liu Yang 2013 60
and
Agnostic CAL [DHM]Agnostic CAL [DHM]
© Liu Yang 2013 61
Based on subroutine:
Tsybakov Tsybakov Noise: Sublinear Noise: Sublinear Results & Explicit BoundResults & Explicit Bound
Theorem. If is totally bounded and η satisfies strictly benign noise condition, then ACAL achieves an excess expected mistake bound and if additionally , then ACAL makes an expected number of queries
© Liu Yang 2013 62
Theorem. If poly-covers Assumption and Unif Tsybakov assumption are satisfied, then ACAL achieves an expected excess number of mistakesACAL achieves expected #mistakes and expected #queries such that, for
Outline
• Transfer LearningTransfer Learning
- Do not ask what Bayesians can do for - Do not ask what Bayesians can do for Machine Learning, ask what Machine Machine Learning, ask what Machine Learning can do for BayesiansLearning can do for Bayesians
Transfer Learning• Principle: solving a new learning problem is easier
given that we’ve solved several already ! • How does it help? - New task directly ``related’’ to previous task [e.g., Ben-David & Schuller 03; Evgeniou, Micchelli, & Pontil 2005] - Previous tasks give us useful sub-concepts [e.g., Thrun 96]
- Can gather statistical info on the variety of concepts [e.g., Baxter 97; Ando & Zhang 04]
• Example: Speech Recognition - After training a few times, figured out the dialects. - Next time, just identify the dialect. - Much easier than training a recognizer from scratch
prior
h1*
x11,y1
1 … x1k,y1
k
Task 1
hT*
xT1,yT
1 … xTk,yT
k
…
Task T
Model of Transfer Learning Motivation: Learners often Not Too Altruistic
h2*
x21,y2
1 … x2k,y2
k
Task 2
Layer 1: draw task i.i.d. from unknown prior
Layer 2: per task, draw data i.i.d. from target
Better Estimate of Prior !!
- Marvin: so you assume learning French is - Marvin: so you assume learning French is similar to learning English?similar to learning English?
- Liu: It indeed seems many English words - Liu: It indeed seems many English words have a French counterpart …have a French counterpart …
Identifiability of priors from joint distribs
• Let prior π be any distribution on C - example: (w, b) ~ multivariate normal
• Target h*π ~ π
• Data X = (X1, X2, …) i.i.d. D indep h*π
• Z(π) = ((X1, h*π (X1), (X2, h*π (X2), …).
• Let [m] = {1, …, m}.
• Denote XI = {Xi}i € I (I : subset of natural numbers)
• ZI (π) = {(Xi, h*π (Xi))}i € I Theorem: Z[VC] (π1) =d Z[VC] (π2) iff π1 = π2.
Identifiability of priors by VC-dim joint distri.
• Threshold:
- for two points x1, x2, if x1 < x2, then
Pr(+,+)=Pr(+.), Pr(-,-)=Pr(.-), Pr(+,-)=0, So Pr(-,+)=Pr(.+)-Pr(++) = Pr(.+)-Pr(+.) - for any k > 1 points, can directly to reduce number of labels in the joint prob from k to 1 P(-----------(-+)+++++++++++++++++)
= P( (-+) ) = P( (.+) ) - P( (++) ) = P( (.+) ) - P( (+.) ) + P( (+-) ) (unrealized labeling !!) = P( (.+) ) - P( (+.) )
---------------------0 1
++++++++++++++++
• Theorem: Z[VC] (π1) =d Z[VC] (π2) iff π1 = π2.
Proof Sketch• Let ρm(h,g) = 1/m Σi=1
m II(h(Xm) ≠ g(Xm)) Then vc < ∞ implies w.p.1 forall h, g € C with h ≠ g limm -> ∞ ρm(h,g) = ρ(h,g) > 0• ρ is a metric on C by assumption, so w.p.1 each h in C labels ∞-seq (X1, X2 …) distinctly
(h(X1), h(X2), …)• => w.p.1 conditional distribution of the label seq Z(π)|
X identifies π => distrib of Z(π) identifies π i.e. Z∞ (π1) =d Z∞ (π2) implies π1 = π2
Identifiability of Priors from Joint Distributions
lower–dim cond distrib
y’ closer to ỹ
Identifiability of Priors from Joint Distributions
Identifiability of Priors from Joint Distributions
Transfer Learning Setting• Collection Π of distribs on C. (known)• Target distrib π* € Π. (unknown)
• Indep target fns h1*, …, hT* ~ π* (unknown)
• Indep i.i.d. D data sets X(t) = (X1(t), X2
(t), …), t €[T].
• Define Z(t) = ((X1(t), ht*(X1
(t))), (X2(t), ht*(X2
(t))), …).
• Learning alg. “gets” Z(1), then produces ĥ1, then “gets” Z(2), then produces ĥ2, etc. in sequence.
• Interested in: values of ρ(ĥt, h*(t)), and the
number of h*t (Xj(t)) value alg. needs to access.
Estimating the prior• Principle: learning would be easier if know π*• Fact: π* is identifiable by distrib of Z[VC]
(t)
• Strategy: Take samples Z[VC](i) from past tasks 1,
…, t-1, use them to estimate distrib of Z[VC](i),
convert that into an estimate π’t-1
of π*,
• Use π’t-1
in a prior-dependent learning alg for
new task ht*• Assume Π is totally bounded in total variation• Can estimate π* at a bounded rate:
|| π* - π’t||< δt converges to 0 (holds whp)
Transfer Learning• Given a prior-dependent learning A(ε, π), with E[# labels accessed] =Λ(ε, π) and producing ĥ with E[ρ(ĥ, h*)]≤εFor t = 1,…, T If δt-1 > ε/4,
run prior-indep learning on Z[VC/ε](t) to get ĥt
Else let π’’t = argminπ € B(π’t-1, δt-1) Λ(ε/2, π) and
run A(ε/2, π’’t) on Z(t) to get ĥt
Theorem: Forall t, E[ρ(ĥt, ht*)] ≤ ε, and
limsupT -> ∞E[#labels accessed]/T ≤Λ(ε/2, π*) + vc.
- Yonatan: I’ll send you an email to - Yonatan: I’ll send you an email to summarize what we just discussed.summarize what we just discussed.
- Liu: Thank you but I now invented a model - Liu: Thank you but I now invented a model to transfer knowledge with provable to transfer knowledge with provable guarantees; guarantees;
so I use that all the time.so I use that all the time.
- Yonatan: But that’s asymptotic guarantee. - Yonatan: But that’s asymptotic guarantee. My life span is finite. So I’m still gonna to My life span is finite. So I’m still gonna to send you an email. send you an email.
Outline
• Online Allocation and Pricing with Online Allocation and Pricing with Economies of ScaleEconomies of Scale
© Liu Yang 2013 77
- Jamie Dimon: Economies of scale are a good thing. - Jamie Dimon: Economies of scale are a good thing. If we didn't have them, we'd still be living in tents If we didn't have them, we'd still be living in tents and eating buffalo.and eating buffalo.
SettingSetting
• Christmas season
- Nov: customer survey
- Dec: purchasing and selling
• Buyers arrive online one at a time w/ val.s on items sampled iid from some unknown distri.
Thrifty Santa Claus Thrifty Santa Claus
• Each shopper wants only one item though it might prefer some items than others
• Minimize total cost to seller
• Buyers: binary valuation• Goal of seller: sat. everyone
Hardness: Set-CoverHardness: Set-Cover
• If costs much more rapidly, then even if all customers' val.s known up front, would be (roughly) a set-cover problem and could not hope to achieve cost o(log n) times optimal.
• Natural case: for each good, cost (to the seller) for ordering T copies is sublinear in T. Production
costMarginal cost
#copies#copies
α = 1
α = 0
α in (0, 1)
α = 1
α = 0 α in (0, 1)
Thrifty Santa Claus : ResultsThrifty Santa Claus : Results
• Mar-cost non-increa, exists optimal strategy? - order items by some perm.; give new buyer earliest item it desires in the perm.
• What if n (#buyers) >> k (#items) AND mar-cost not too rapidly? (rate 1/Tα for 0≤α<1)
- can efficiently perform allocation w/ cost ≤ a const. factor greater than OPT
Algorithm
• Alg: use initial buyers to learn about distri. determine how best to allocate to new buyers.
• If cost fn c(x) = Σi=1 x 1/iα, for α in [0,1)
- run greedy weighted set cover => total cost ≤ 1/(1-α) {± OPT}.
• Essentially smooth variant of set-cover• If ave-cost within some factor of mar-cost,
have a greedy alg w/ const. approx ratio
Sample Complexity Analysis
• How complicated the allocation rule needs to be to achieve good perf.?
Theorem
Outline
• Factor Models for Correlated Factor Models for Correlated Auctions Auctions
© Liu Yang 2013 84
The ProblemThe Problem
• Auctioneer sells good to a group of n buyers.
• Seller wants to maximize his revenue. • Each buyer maximize his utility of getting
good: val. - price• Seller doesn’t know exact val.s of players • He knows distri D from which vec. of val.s
(v1, …, vn) is drawn.
Our ContributionOur Contribution
• When D is a product distri., - Myerson gives dominant strategy
truthful auction• General correlated distr.s, not known - how to create truthful auctions - how to use player j’s bid to capture
info about player i. • What if correlation between buyer val.s
driven by common factors?
ExampleExample• Two firms produce same type of good• Each firm’s “value”: production cost• need to hire workers (W) & rent capital (Z)
• li: #workers firm i needs to produce one unit
• Ki: amount of capital firm needs
• εi:fixed costs unique to firm i.
• firm’s costs: Ci = liW + kiZ + εi
• firms’ costs correlated : hire workers & rent capital from the same pool.
The Factor ModelThe Factor Model
• Factor model as V = F + U where
- V: vec. of observations - λ: matrix of coefficients - F : vec. of factors - U: vec. of idiosyncratic components ind. of each
other & ind. of the factors
DiscussionsDiscussions
• Possible that: - Designer & bidders might not know common
factors - Bidders might only know their val. - seller only knows joint distri. of bidders’ val.s,
• Seller RECOVER factor model by making inferences over observed bids.
• Aggregate info.: common factors inferred from collective knowledge of all players.
The AuctionThe Auction
The Auction (cont.)The Auction (cont.)
• Thm: When correlation follows this factor model, this auction is dominant strategy truthful, ex-post individually rational, and asymptotically optimal.
Dominant Strategy Dominant Strategy TruthfulnessTruthfulness
• Toss a coin & choose between: - 2nd price auction: truthful - mechanism M estimates factors from a
random set of bidders S: bidders in S receive utility 0 regardless of
allocation & price output by M • Players S incentivized truthful for small
incentive they get from participating in 2nd price auction.
Dominant Strategy Dominant Strategy Truthfulness (Cont.)Truthfulness (Cont.)
• Remaining bidders set R = {1, …, n} - S receive incentives from both 2nd price auction and mechanism M.
• M offers them allocation and price vec.s x(bR), p(bR) by running Myerson (bR,VR |^f) on players' bids, and on cond. distri.s estimated for these players.
• No player in R can influence the estimated conditional distri. VR|^f, and Myerson's optimal auction is truthful.
Thanks !
94© Liu Yang 2013
Hofstadter's Law: It always takes longer than you expect, even when you take into account Hofstadter's Law.