Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen...
-
Upload
abigail-hernandez -
Category
Documents
-
view
213 -
download
0
Transcript of Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen...
Efficient classification for metric data
Lee-Ad Gottlieb Weizmann Institute
Aryeh Kontorovich Ben Gurion U.
Robert Krauthgamer Weizmann Institute
Efficient classification for metric data 2
Classification problem Probabilistic concept learning
S is a set of n examples (x,y) drawn from X x {-1,1} according to some unknown probability distribution P.
The learner produces hypothesis h: X → {-1,1} A good hypothesis (classifier) minimizes the generalization error
P{(x,y): h(x) ≠ y}
A popular solution uses kernels Data represented as vectors, kernels take the dot-product of vectors
Efficient classification for metric data 3
Finite metric space (X,d) is a metric space if
X = set of points d = distance function
Nonnegative Symmetric Triangle inequality
Classification for metric data? Problem:
No vector representation → No notion of dot-product → Can’t use kernels
What can be done in this setting?
Haifa Jerusalem
Tel-Aviv
151km95km 62km
Efficient classification for metric data 4
Preliminary definition The Lipschitz constant L of a function f: X → R is the smallest
value that satisfies for all points xi,xj in X L ≥ |f(xi)-f(xj)| / d(xi,xj)
Consider a hypothesis consistent with all of S Its Lipschitz constant is determined by the closest pair of differently
labeled points L ≥ 2 / d(xi,xj) for all xi in S−, xj in S+
Efficient classification for metric data 5
Classification for metric data A powerful framework for this problem was introduced by
von Luxburg & Bousquet (vLB, JMLR ‘04)
The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions
Given the classifier h, the problem of evaluating of h for new points in X reduces to the problem of finding a Lipschitz function consistent with h Lipschitz extension problem, a classic problem in Analysis
For example f(x) = mini [yi + 2d(x, xi)/d(S+,S−)] over all (xi,xj) in S Function evaluation reduces to exact Nearest Neighbor Search (assuming
zero training error) Strong theoretical motivation for the NNS classification heuristic
Efficient classification for metric data 6
Two new directions The framework of vLB leaves open two further questions:
Efficient evaluation of the classifier h on X In arbitrary metric space, exact NNS requires Θ(n) time Can we do better?
Bias – variance tradeoff Which sample points in S should h ignore?
q
~1
~1
-1 +1
Efficient classification for metric data 7
Doubling Dimension Definition: Ball B(x,r) = all points within distance r from x.
The doubling constant (of a metric M) is the minimum value ¸ such that every ball can be covered by ¸ balls of half the radius First used by [Ass-83], algorithmically by [Cla-97]. The doubling dimension is dim(M)=log ¸(M) [GKL-03] A metric is doubling if its doubling dimension is constant
Packing property of doubling spaces A set with diameter D and min. inter-point
distance a, contains at most
(D/a)O(log¸) points
Here ≤7.
Efficient classification for metric data 8
Application I We provide generalization bounds for Lipschitz functions on
spaces with low doubling dimension vLB provided similar bounds using covering numbers and Rademacher
averages
Fat-shattering analysis: Lipschitz function shatters a set →
inter-point distance is at least 2/L Packing property →
set has (DL)O(log¸) points So the fat-shattering dimension is low
Efficient classification for metric data 9
Application I Theorem:
For any f that classifies a sample of size n correctly, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log(578n) + log(4/)) .
Likewise, if f is correct on all but k examples, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/))]1/2.
In both cases, d ≤ 8⌈ LD] log¸+1.
Efficient classification for metric data 10
Application II Evaluation of h for new points in X
Lipschitz extension function f(x) = mini [yi + 2d(x, xi)/d(S+,S−)]
Requires exact nearest neighbor search, which can be expensive!
New tool: (1+)-approximate nearest neighbor search ¸O(1) log n + ¸O(-log) time [KL-04, HM-05, BKL-06, CG-06]
If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of g(x) = (1+) f(x) + h(x) = (1+) f(x) - Note that g(x) ≥ f(x) ≥ h(x)
g(x) and h(x) have Lipschitz constant (1+)L, so they and the approximate function generalizes well
Efficient classification for metric data 11
Bias variance tradeoff Which sample points in S should h ignore?
If f is correct on all but k examples, we have with probability at least 1− P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/))]1/2. Where d ≤ 8⌈ LD]¸+1.
-1 +1
Efficient classification for metric data 12
Bias variance tradeoff Algorithm
Fix a target Lipschitz constant L O(n2) possibilities
Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error
Goal: Remove as few points as possible
Efficient classification for metric data 13
Bias variance tradeoff Algorithm
Fix a target Lipschitz constant L Out of O(n2) possibilities
Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error
Goal: Remove as few points as possible
Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time
Efficient classification for metric data 14
Bias variance tradeoff Algorithm
Fix a target Lipschitz constant L Out of O(n2) possibilities
Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error
Goal: Remove as few points as possible
Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time
Minimum vertex cover on a bipartite graph Equivalent to maximum matching (Konig’s theorem) Admits an exact solution in O(n2.376) randomized time
Efficient classification for metric data 15
Bias variance tradeoff Algorithm:
For each of O(n2) values of L Run matching algorithm to find minimum error Evaluate generalization bound for this value of L
O(n4.376) randomized time
Better algorithm Binary search over O(n2) values of L For each value
Run matching algorithm
Find minimum error in O(n2.376 log n) randomized time
Evaluate generalization bound for this value of L Run greedy 2-approximation
Approximate minimum error in O(n2 log n) time
Evaluate approximate generalization bound for this value of L
Efficient classification for metric data 16
Conclusion Results:
Generalization bounds for Lipschitz classifiers in doubling spaces Efficient evaluation of the Lipschitz extension hypothesis using
approximate NNS Efficient calculation of the bias variance tradeoff
Continuing research Similar results for continuous labels