Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen...

Efficient classification for metric data

Lee-Ad Gottlieb Weizmann Institute

Aryeh Kontorovich Ben Gurion U.

Robert Krauthgamer Weizmann Institute

Efficient classification for metric data 2

Classification problem Probabilistic concept learning

S is a set of n examples (x,y) drawn from X x {-1,1} according to some unknown probability distribution P.

The learner produces hypothesis h: X → {-1,1} A good hypothesis (classifier) minimizes the generalization error

P{(x,y): h(x) ≠ y}

A popular solution uses kernels Data represented as vectors, kernels take the dot-product of vectors


Finite metric space (X,d) is a metric space if

X = set of points d = distance function

Nonnegative Symmetric Triangle inequality

Classification for metric data? Problem:

No vector representation → No notion of dot-product → Can’t use kernels

What can be done in this setting?

Haifa Jerusalem

Tel-Aviv

151km95km 62km


Preliminary definition The Lipschitz constant L of a function f: X → R is the smallest

value that satisfies for all points xi,xj in X L ≥ |f(xi)-f(xj)| / d(xi,xj)

Consider a hypothesis consistent with all of S Its Lipschitz constant is determined by the closest pair of differently

labeled points L ≥ 2 / d(xi,xj) for all xi in S−, xj in S+


Classification for metric data A powerful framework for this problem was introduced by

von Luxburg & Bousquet (vLB, JMLR ‘04)

The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions

Given the classifier h, the problem of evaluating of h for new points in X reduces to the problem of finding a Lipschitz function consistent with h Lipschitz extension problem, a classic problem in Analysis

For example f(x) = mini [yi + 2d(x, xi)/d(S+,S−)] over all (xi,xj) in S Function evaluation reduces to exact Nearest Neighbor Search (assuming

zero training error) Strong theoretical motivation for the NNS classification heuristic


Two new directions The framework of vLB leaves open two further questions:

Efficient evaluation of the classifier h on X In arbitrary metric space, exact NNS requires Θ(n) time Can we do better?

Bias – variance tradeoff Which sample points in S should h ignore?

q

~1

~1

-1 +1


Doubling Dimension Definition: Ball B(x,r) = all points within distance r from x.

The doubling constant (of a metric M) is the minimum value ¸ such that every ball can be covered by ¸ balls of half the radius First used by [Ass-83], algorithmically by [Cla-97]. The doubling dimension is dim(M)=log ¸(M) [GKL-03] A metric is doubling if its doubling dimension is constant

Packing property of doubling spaces A set with diameter D and min. inter-point

distance a, contains at most

(D/a)O(log¸) points

Here ≤7.


Application I We provide generalization bounds for Lipschitz functions on

spaces with low doubling dimension vLB provided similar bounds using covering numbers and Rademacher

averages

Fat-shattering analysis: Lipschitz function shatters a set →

inter-point distance is at least 2/L Packing property →

set has (DL)O(log¸) points So the fat-shattering dimension is low


Application I Theorem:

For any f that classifies a sample of size n correctly, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log(578n) + log(4/)) .

Likewise, if f is correct on all but k examples, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/))]1/2.

In both cases, d ≤ 8⌈ LD] log¸+1.


Application II Evaluation of h for new points in X

Lipschitz extension function f(x) = mini [yi + 2d(x, xi)/d(S+,S−)]

Requires exact nearest neighbor search, which can be expensive!

New tool: (1+)-approximate nearest neighbor search ¸O(1) log n + ¸O(-log) time [KL-04, HM-05, BKL-06, CG-06]

If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of g(x) = (1+) f(x) + h(x) = (1+) f(x) - Note that g(x) ≥ f(x) ≥ h(x)

g(x) and h(x) have Lipschitz constant (1+)L, so they and the approximate function generalizes well


Bias variance tradeoff Which sample points in S should h ignore?

If f is correct on all but k examples, we have with probability at least 1− P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/))]1/2. Where d ≤ 8⌈ LD]¸+1.

-1 +1


Bias variance tradeoff Algorithm

Fix a target Lipschitz constant L O(n2) possibilities

Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error

Goal: Remove as few points as possible



Fix a target Lipschitz constant L Out of O(n2) possibilities



Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time



Fix a target Lipschitz constant L Out of O(n2) possibilities



Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time

Minimum vertex cover on a bipartite graph Equivalent to maximum matching (Konig’s theorem) Admits an exact solution in O(n2.376) randomized time


Bias variance tradeoff Algorithm:

For each of O(n2) values of L Run matching algorithm to find minimum error Evaluate generalization bound for this value of L

O(n4.376) randomized time

Better algorithm Binary search over O(n2) values of L For each value

Run matching algorithm

Find minimum error in O(n2.376 log n) randomized time

Evaluate generalization bound for this value of L Run greedy 2-approximation

Approximate minimum error in O(n2 log n) time

Evaluate approximate generalization bound for this value of L


Conclusion Results:

Generalization bounds for Lipschitz classifiers in doubling spaces Efficient evaluation of the Lipschitz extension hypothesis using

approximate NNS Efficient calculation of the bias variance tradeoff

Continuing research Similar results for continuous labels

Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen...

Documents

Transcript of Efficient classification for metric data Lee-Ad GottliebWeizmann Institute Aryeh KontorovichBen...