Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen...

16
Efficient classification for metric data Lee-Ad Gottlieb Weizmann Institute Aryeh Kontorovich Ben Gurion U. Robert Krauthgamer Weizmann Institute

Transcript of Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen...

Page 1: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data

Lee-Ad Gottlieb Weizmann Institute

Aryeh Kontorovich Ben Gurion U.

Robert Krauthgamer Weizmann Institute

Page 2: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 2

Classification problem Probabilistic concept learning

S is a set of n examples (x,y) drawn from X x {-1,1} according to some unknown probability distribution P.

The learner produces hypothesis h: X → {-1,1} A good hypothesis (classifier) minimizes the generalization error

P{(x,y): h(x) ≠ y}

A popular solution uses kernels Data represented as vectors, kernels take the dot-product of vectors

Page 3: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 3

Finite metric space (X,d) is a metric space if

X = set of points d = distance function

Nonnegative Symmetric Triangle inequality

Classification for metric data? Problem:

No vector representation → No notion of dot-product → Can’t use kernels

What can be done in this setting?

Haifa Jerusalem

Tel-Aviv

151km95km 62km

Page 4: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 4

Preliminary definition The Lipschitz constant L of a function f: X → R is the smallest

value that satisfies for all points xi,xj in X L ≥ |f(xi)-f(xj)| / d(xi,xj)

Consider a hypothesis consistent with all of S Its Lipschitz constant is determined by the closest pair of differently

labeled points L ≥ 2 / d(xi,xj) for all xi in S−, xj in S+

Page 5: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 5

Classification for metric data A powerful framework for this problem was introduced by

von Luxburg & Bousquet (vLB, JMLR ‘04)

The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions

Given the classifier h, the problem of evaluating of h for new points in X reduces to the problem of finding a Lipschitz function consistent with h Lipschitz extension problem, a classic problem in Analysis

For example f(x) = mini [yi + 2d(x, xi)/d(S+,S−)] over all (xi,xj) in S Function evaluation reduces to exact Nearest Neighbor Search (assuming

zero training error) Strong theoretical motivation for the NNS classification heuristic

Page 6: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 6

Two new directions The framework of vLB leaves open two further questions:

Efficient evaluation of the classifier h on X In arbitrary metric space, exact NNS requires Θ(n) time Can we do better?

Bias – variance tradeoff Which sample points in S should h ignore?

q

~1

~1

-1 +1

Page 7: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 7

Doubling Dimension Definition: Ball B(x,r) = all points within distance r from x.

The doubling constant (of a metric M) is the minimum value ¸ such that every ball can be covered by ¸ balls of half the radius First used by [Ass-83], algorithmically by [Cla-97]. The doubling dimension is dim(M)=log ¸(M) [GKL-03] A metric is doubling if its doubling dimension is constant

Packing property of doubling spaces A set with diameter D and min. inter-point

distance a, contains at most

(D/a)O(log¸) points

Here ≤7.

Page 8: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 8

Application I We provide generalization bounds for Lipschitz functions on

spaces with low doubling dimension vLB provided similar bounds using covering numbers and Rademacher

averages

Fat-shattering analysis: Lipschitz function shatters a set →

inter-point distance is at least 2/L Packing property →

set has (DL)O(log¸) points So the fat-shattering dimension is low

Page 9: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 9

Application I Theorem:

For any f that classifies a sample of size n correctly, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log(578n) + log(4/)) .

Likewise, if f is correct on all but k examples, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/))]1/2.

In both cases, d ≤ 8⌈ LD] log¸+1.

Page 10: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 10

Application II Evaluation of h for new points in X

Lipschitz extension function f(x) = mini [yi + 2d(x, xi)/d(S+,S−)]

Requires exact nearest neighbor search, which can be expensive!

New tool: (1+)-approximate nearest neighbor search ¸O(1) log n + ¸O(-log) time [KL-04, HM-05, BKL-06, CG-06]

If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of g(x) = (1+) f(x) + h(x) = (1+) f(x) - Note that g(x) ≥ f(x) ≥ h(x)

g(x) and h(x) have Lipschitz constant (1+)L, so they and the approximate function generalizes well

Page 11: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 11

Bias variance tradeoff Which sample points in S should h ignore?

If f is correct on all but k examples, we have with probability at least 1− P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/))]1/2. Where d ≤ 8⌈ LD]¸+1.

-1 +1

Page 12: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 12

Bias variance tradeoff Algorithm

Fix a target Lipschitz constant L O(n2) possibilities

Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error

Goal: Remove as few points as possible

Page 13: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 13

Bias variance tradeoff Algorithm

Fix a target Lipschitz constant L Out of O(n2) possibilities

Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error

Goal: Remove as few points as possible

Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time

Page 14: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 14

Bias variance tradeoff Algorithm

Fix a target Lipschitz constant L Out of O(n2) possibilities

Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error

Goal: Remove as few points as possible

Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time

Minimum vertex cover on a bipartite graph Equivalent to maximum matching (Konig’s theorem) Admits an exact solution in O(n2.376) randomized time

Page 15: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 15

Bias variance tradeoff Algorithm:

For each of O(n2) values of L Run matching algorithm to find minimum error Evaluate generalization bound for this value of L

O(n4.376) randomized time

Better algorithm Binary search over O(n2) values of L For each value

Run matching algorithm

Find minimum error in O(n2.376 log n) randomized time

Evaluate generalization bound for this value of L Run greedy 2-approximation

Approximate minimum error in O(n2 log n) time

Evaluate approximate generalization bound for this value of L

Page 16: Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen Gurion U. Robert KrauthgamerWeizmann Institute TexPoint.

Efficient classification for metric data 16

Conclusion Results:

Generalization bounds for Lipschitz classifiers in doubling spaces Efficient evaluation of the Lipschitz extension hypothesis using

approximate NNS Efficient calculation of the bias variance tradeoff

Continuing research Similar results for continuous labels