Similarity-based Classifiers: Problems and Solutions.

Similarity-based Classifiers:

Problems and Solutions

2

Classifying based on similarities:

Van GoghOr

Monet ?

Van Gogh

Monet

3

the Similarity-based Classification Problem

Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;n

(painter)(paintings)

4



UnderlyingSimilarity Function: Ã : Ð£ Ð ! R

Training Similarities: S =£Ã(xi;xj )

¤n£n ; y =

£y1 : : : yn

¤T

5



Training Similarities: S =£Ã(xi;xj )

¤n£n ; y =

£y1 : : : yn

¤T

Test Similarities: s =£Ã(x;x1) ::: Ã(x;xn)

¤T; Ã(x;x)

Problem: Estimate theclass label yfor test samplex given S, y, s, and Ã(x;x).

UnderlyingSimilarity Function: Ã : Ð£ Ð ! R

?

6

Examples of Similarity Functions

Computational Biology– Smith-Waterman algorithm (Smith & Waterman,

1981)– FASTA algorithm (Lipman & Pearson, 1985)– BLAST algorithm (Altschul et al., 1990)

Computer Vision– Tangent distance (Duda et al., 2001)– Earth mover’s distance (Rubner et al., 2000)– Shape matching distance (Belongie et al., 2002)– Pyramid match kernel (Grauman & Darrell,

2007)

Information Retrieval– Levenshtein distance (Levenshtein, 1966)– Cosine similarity between tf-idf vectors

(Manning & Schütze, 1999)

7

Approaches to Similarity-based Classification

MDSSimilariti

es as kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA

Classify x given S, y, s, and Ã(x;x).

8


MDSSimilariti

es as kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA


9

Can we treat similarities as kernels?

Kernels are inner products in someHilbert space.

10


conjugatesymmetric, reallinear: hax;zi = a< x;z >positivede nite: hx;xi > 0unless x = 0

Example Inner Product hx;zi = xTz.

Properties of an Inner Product hx;zi :


x

zhx;zi

An inner product implies a norm: kxk=phx;xi

11



Inner products aresimilarities.

Areour notions of similarities always inner products?No!

12

Example: Amazon similarity

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90

Ð spaceof all books,Á(A;B) =%buy book A after viewing book B on Amazon

96 books

96

books

SInner product-like?

13


10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90


96 books

96

books

S

Á(HTF, Bishop) = 3

Á(Bishop, HTF) = 8

assymmetric!

0 10 20 30 40 50 60 70 80 90-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Eigenvalue Rank

Eig

enva

lue


10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90


96 books

96

books

S

negative

Rank

Eig

en

valu

es

Not PSD!

15

Well, let’s just make S be a kernel matrix

First, symmetrize:S Ã 1

2(S +ST ) ) S = U¤UT ;

¤ = diag(¸1; : : : ;¸n)

Clip:Sclip =U diag(max(¸1;0); :: : ;max(¸n;0))UT

0 0

S

Sclip

PSD Cone

Sclip is thePSD matrix closest to Sin terms of theFrobenius norm.

16



2(S +ST ) ) S = U¤UT ;

¤ = diag(¸1; : : : ;¸n)

Flip:S°ip =U diag(j¸1j; : : : ; j¸nj) UT

0 0

(similar e®ect: Snew =STS)

17



2(S +ST ) ) S = U¤UT ;

¤ = diag(¸1; : : : ;¸n)

0 0

Shift:Sshift =U (¤ + jmin(¸min(S);0)j I ) UT

18



2(S +ST ) ) S = U¤UT ;

¤ = diag(¸1; : : : ;¸n)

0 0

Sshift =U (¤ + jmin(¸min(S);0)j I ) UT

Flip, Clip or Shift?Best bet is Clip.

19



2(S +ST ) ) S = U¤UT ;

¤ = diag(¸1; : : : ;¸n)

Learn the best kernel matrix for the SVM:(Luss NIPS 2007, Chen et al. ICML 2009)

minK º 0

minf 2H K

1n

nX

i=1

L(f (xi);yi) +´kf k2K +°kK ¡ SkF

20

Approaches to Similarity-based Classification.

MDSSimilariti

es as Kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA


21

Let the similarities to the training samples be features

– SVM (Graepel et al., 1998; Liao & Noble, 2003)

– Linear programming (LP) machine (Graepel et al., 1999)

– Linear discriminant analysis (LDA) (Pekalska et al., 2001)

– Quadratic discriminant analysis (QDA) (Pekalska & Duin, 2002)

– Potential support vector machine (P-SVM) (Hochreiter & Obermayer, 2006; Knebel et al., 2008)

Let£Ã(x;x1) :: : Ã(x;xn)

¤T2 Rn be the featurevector for x.

minimize®

12ky ¡ S®k22+²k®k1+°k®k1

Asymptotically does thiswork?Our results suggest you need to choosea slow-growing subset of n.

22

AMAZON47 classes

AURAL SONAR2 classes

CALTECH101 classes

FACE REC139 classes

MIREX10 classes

VOTING VDM2 classes

# samples

n = 204 n =100

n = 8677

n = 945 n = 3090

n = 435

SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear)

76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF)

75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

23

AMAZON47 classes

AURAL SONAR2 classes

CALTECH101 classes

FACE REC139 classes

MIREX10 classes

VOTING VDM2 classes

# samples

n = 204 n =100

n = 8677

n = 945 n = 3090

n = 435

SVM-kNN(clip)(Zhang et al. 2006)

17.56 13.75 36.82 4.23 61.25 5.23

SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear)

76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF)

75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

24


MDSSimilariti

es as Kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA


25

Weighted Nearest-Neighbors

Take a weighted vote of the k-nearest-neighbors:

Algorithmic parallel of the exemplar model of human learning.

y = argmaxg2G

kX

i=1

wi I f yi=gg

?

26

Weighted Nearest-Neighbors

Take a weighted vote of the k-nearest-neighbors:

Algorithmic parallel of the exemplar model of human learning.

y = argmaxg2G

kX

i=1

wi I f yi=gg

P (Y = gjX = x) =kX

i=1

wi I f yi=gg

For wi ¸ 0 andP

i wi =1, get class posterior estimate:

Good for asymmetric costsGood for interpretationGood for system integration.

27

Design Goals for the Weights

?

28


Design Goal 1 (Affinity): wi should be an increasing function of ψ(x, xi).

?

29


?

30

Design Goals for the Weights (Chen et al. JMLR 2009)

Design Goal 2 (Diversity): wi should be a decreasing function of ψ(xi, xj).

?

31

Linear Interpolation Weights

Linear interpolation weights will meet these goals:X

i

wixi = x; such that wi ¸ 0;X

i

wi =1

x1

x2 x3

x4x

non-uniquesolution

32

Linear Interpolation Weights

Linear interpolation weights will meet these goals:X

i


i

wi =1

x1

x2 x3

x4x

non-uniquesolution

x1

x2 x3

x4 x

no solution

33

minimizew

°°°°°

kX

i=1

wixi ¡ x

°°°°°

2

2

+¸kX

i=1

wi logwi

subject tokX

i=1

wi =1; wi ¸ 0; i = 1;::: ;k:

LIME weights

Linear interpolation weights will meet these goals:

Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):

X

i


i

wi =1

34

minimizew

°°°°°

kX

i=1

wixi ¡ x

°°°°°

2

2

+¸kX

i=1

wi logwi

subject tokX

i=1

wi =1; wi ¸ 0; i = 1;::: ;k:

LIME weights



X

i


i

wi =1

35

LIME weights



minimizew

°°°°°

kX

i=1

wixi ¡ x

°°°°°

2

2

+¸kX

i=1

wi logwi

subject tokX

i=1

wi =1; wi ¸ 0; i = 1;::: ;k:

X

i


i

wi =1

maximumentropy ! push weights to beequal

36

LIME weights



minimizew

°°°°°

kX

i=1

wixi ¡ x

°°°°°

2

2

+¸kX

i=1

wi logwi

subject tokX

i=1

wi =1; wi ¸ 0; i = 1;::: ;k:

X

i


i

wi =1

maximumentropy = exponential solutionconsistent (Friedlander Gupta IEEE IT 2005)noiseaveraging

37

Kernelize Linear Interpolation (Chen et al. JMLR 2009)

minimizew

°°°°°

kX

i=1

wixi ¡ x

°°°°°

2

2

+¸kX

i=1

wi logwi

subject tokX

i=1

wi =1; wi ¸ 0; i = 1;::: ;k:

minimizew

12wTX TXw¡ xTXw+

¸2wTw

subject to w ¸ 0; 1Tw= 1;

LIME weights:

Let X = [x1; : : : xk], re-writewith matricesand change to ridge regularizer:

38

Kernelize Linear Interpolation

minimizew

°°°°°

kX

i=1

wixi ¡ x

°°°°°

2

2

+¸kX

i=1

wi logwi

subject tokX

i=1

wi =1; wi ¸ 0; i = 1;::: ;k:

minimizew

12wTX TXw¡ xTXw+

¸2wTw


LIME weights:


regularizes the variance of the weights

39

Kernelize Linear Interpolation

minimizew

°°°°°

kX

i=1

wixi ¡ x

°°°°°

2

2

+¸kX

i=1

wi logwi

subject tokX

i=1

wi =1; wi ¸ 0; i = 1;::: ;k:

minimizew

12wTX TXw¡ xTXw+

¸2wTw


LIME weights:


only need inner products – can replace with kernel or similarities!

40

minimizew

12wTSw¡ sTw+

¸2wTw

subject to w ¸ 0; 1Tw=1:

KRI Weights Satisfy Design Goals

Kernel ridge interpolation (KRI) weights:

41



minimizew

12wTSw¡ sTw+

¸2wTw


affinity: s =£Ã(x;x1) ::: Ã(x;xn)

¤T;

so wi high if Ã(x;xi) high

42



minimizew

12wTSw¡ sTw+

¸2wTw


diversity:

12wTSw=

12

X

i;j

Ã(xi;xj )wiwj

43



minimizew

12wTSw¡ sTw+

¸2wTw


MakeS PSD,problem is a QP

QP w/ box constraintsCan solvewith SMO

44



Remove the constraints on the weights:

Can show equivalent to local ridge regression:KRR weights.

argminw

12wTSw¡ sTw+

¸2wTw


argminw

12wTSw¡ sTw+

¸2wTw

´ (S +¸I )¡ 1s

45

Weighted k-NN: Example 1

S =

2

664

5 0 0 00 5 0 00 0 5 00 0 0 5

3

775 ; s =

2

664

4321

3

775

wKRI =arg minw¸ 0;1T w=1

12wTSw¡ sTw+

¸2wTw

KRI weights

10-2

100

1020

0.1

0.25

0.4

0.5

0.6

¸

1

w4

1

w3

1

w2

1

w1

1

¸

wKRR = (S +¸I )¡ 1s

KRR weights

10-2

100

102-0.1

0

0.1

0.25

0.4

0.5

0.6

¸

1

w2

1

w1

1

w3

1

w4

1

¸

46


S =

2

664

5 1 1 11 5 4 21 4 5 21 2 2 5

3

775 ; s =

2

664

3333

3

775


12wTSw¡ sTw+

¸2wTw

KRI weights

10-2

100

1020.15

0.2

0.25

0.3

0.35

0.4

¸

1

w2, w3

1

w4

1

w1

1

¸

wKRR = (S +¸I )¡ 1s

KRR weights

10-2

100

1020.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

¸

1

w2, w3

1

w4

1

w1

1

¸

47


S =

2

664

5 1 1 11 5 4 21 4 5 21 2 2 5

3

775 ; s =

2

664

2433

3

775


12wTSw¡ sTw+

¸2wTw

KRI weights

10-2

100

1020

0.1

0.25

0.4

0.5

0.6

0.7

¸

1

w1

1

w3

1

w4

1

w2

1

¸

wKRR = (S +¸I )¡ 1s

KRR weights

10-2

100

102

-0.4

-0.2

0

0.250.4

0.6

0.8

1

¸

1

w4

1

w1

1

w3

1

w2

1

¸

48

Amazon-47

Aural Sonar

Caltech-101

Face Rec Mirex Voting

# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

LOCAL

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL

SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

49

Amazon-47

Aural Sonar

Caltech-101


# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

LOCAL

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL




P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

50

Amazon-47

Aural Sonar

Caltech-101


# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

LOCAL

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL




P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

51

Amazon-47

Aural Sonar

Caltech-101


# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

LOCAL

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL




P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

52

Approaches to Similarity-based Classification.

MDSSimilarit

ies as Kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA


53

Generative Classifiers

Model theprobability of what you seegiven each class:

Linear discriminant analysisQuadratic discriminant analysisGaussian mixturemodels...

Pro: Produces class probabilities

54

Generative Classifiers

Model theprobability of what you seegiven each class:

Linear discriminant analysisQuadratic discriminant analysisGaussian mixturemodels...

class

descriptivestatistics of s

Our Goal: Model P (T(s)jg)

Weuse: T(s) = [Ã(x;¹ 1);Ã(x;¹ 2);:: : ;Ã(x;¹ G)]¹ h is a centroid for each class

55

Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009)

AssumeG similaritiesclass-conditionally independent

Reducemodel bias by applying locally (local SDA)

Reduceest. varianceby regularizing over localities

Model P (T(s)jg)

EstimateP (Ã(x;¹ hjg) asmax-ent distr.given empirical mean. Result is exponential.

56

Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009)

AssumeG similaritiesclass-conditionally independent

Reducemodel bias by applying locally (local SDA)

Reduceest. varianceby regularizing over localities

Model P (T(s)jg)

EstimateP (Ã(x;¹ hjg) asmax-ent distr.given empirical mean. Result is exponential.

Reg. Local SDAPerformance: Competitive

57

Some Conclusions

Performance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful

- less approximating

- hard to model entire space, underlying manifold?

- always feasible

58

Some Conclusions





Local models useful



- always feasible

59

Some Conclusions





Local models useful



- always feasible

60

Some Conclusions





Local models useful



- always feasible

61

Some Conclusions





Local models useful



- always feasible

62

Lots of Open Questions

Making S PSD.

Fast k-NN search for similarities

Similarity-based regression

Relationship with learning on graphs

Try it out on real data

Fusion with Euclidean features (see our FUSION 2009 papers)

Open theoretical questions (Chen et al. JMLR 2009, Balcan et al. ML 2008)

Code/Data/Papers: idl.ee.washington.edu/similaritylearning

Similarity-based Classification by Chen et al., JMLR 2009

64

Training and Test ConsistencyFor a test sample x, given , shall we

classify x ass =

£Ã(x;x1) : : : Ã(x;xn)

¤T

y = sgn((c?)T s+b?) ?

No! If a training sample was used as a test sample, could change its class!

65

Data Sets

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90

20 40 60 80 100

20

40

60

80

10020 40 60 80 100 120 140

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140

-10

0

10

20

30

40

50

60

70

Eigenvalue Rank

Eig

enva

lue

0 10 20 30 40 50 60 70 80 90-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Eigenvalue Rank

Eig

enva

lue

0 10 20 30 40 50 60 70 80 90-5

0

5

10

15

20

25

30

35

Eigenvalue Rank

Eig

enva

lue

Amazon Aural Sonar Protein

Eigenvalue Rank

Eigenvalue Rank

Eigenvalue Rank

Eig

en

valu

e

Eig

en

valu

e

Eig

en

valu

e

66

Data Sets

50 100 150 200

50

100

150

200100 200 300 400

100

200

300

400

50 100 150 200

50

100

150

200

0 50 100 150 200 250 300 350 400-50

0

50

100

150

200

250

Eigenvalue Rank

Eig

enva

lue

Voting Yeast-5-7 Yeast-5-12

0 20 40 60 80 100 120 140 160 180-20

0

20

40

60

80

100

120

Eigenvalue Rank

Eig

enva

lue

0 20 40 60 80 100 120 140 160 180-20

0

20

40

60

80

100

120

Eigenvalue Rank

Eig

enva

lue

Eig

en

valu

e

Eig

en

valu

e

Eig

en

valu

e

Eigenvalue Rank

Eigenvalue Rank

Eigenvalue Rank

67

SVM Review

Empirical risk minimization (ERM) with regularization:

minimizef 2H K

1n

nX

i=1

L(f (xi);yi) +´kf k2K

Hinge loss:

L(f (x);y) =max(1¡ yf (x);0)

SVM Primal:

minimizec;b;»

1n1T»+´cTK c

subject to diag(y)(K c+b1) ¸ 1 ¡ »; »¸ 0:

0 1 2 1 2 ( )yf x

1

L

hinge loss

0-1 loss

68

Learning the Kernel Matrix

Find for classification the best K regularized toward S:

minK º 0

minf 2H K

1n

nX

i=1


SVM that learns the full kernel matrix:

minimizec;b;»;K

1n1T»+´cTK c+°kK ¡ SkF

subject to diag(y)(K c+b1) ¸ 1 ¡ »;

»¸ 0; K º 0:

69

Related Work

Robust SVM (Luss & d’Aspremont, 2007):

SVM Dual:

maximize®

1T®¡12®T diag(y)K diag(y)®

subject to yT®=0; 0 · ®· C1:

maximize®

minK º 0

µ1T®¡

12®Tdiag(y)K diag(y)®+½kK ¡ Sk2F

¶

subject to yT®=0; 0 · ®· C1:

“This can be interpreted as a worst-case robust classification problem with bounded uncertainty on the kernel matrix K.”

70

Related Work

LetA = f®2n j yT®=0; 0 · ®· C1g

Rewrite the robust SVM as

max®2A

minK º 0

1T®¡12®Tdiag(y)K diag(y)®+½kK ¡ Sk2F

Theorem (Sion, 1958)Let M and N be convex spaces one of which is compact, and f(μ,ν) a function on M N, which is quasiconcave in M, quasiconvex in N, upper semi-continuous in μ for each ν N, and lower semi-continuous in ν for each μ M, then

sup¹ 2M infº2N f (¹ ;º) = infº2N sup¹ 2M f (¹ ;º):

71

Related Work

LetA = f®2n j yT®=0; 0 · ®· C1g

Rewrite the robust SVM as

max®2A

minK º 0


By Sion’s minimax theorem, the robust SVM is equivalent to:

minK º 0

max®2A


Compare

minK º 0

minf 2H K

1n

nX

i=1


L(x;¸?) or f (x)

L(x?;¸) or g(¸) x ¸

zero duality gap

72


It is not trivial to directly solve:

minimizec;b;»;K



»¸ 0; K º 0:

Lemma (Generalized Schur Complement)Let , and . Then

if and only if , z is in the range of K, and .

·K zzT u

¸º 0

u ¡ zTK yz ¸ 0

K 2 Rn£n z 2 Rn u 2 R

K º 0

Let , and notice that since .z = K c cTK c= zTK yz K K yK = K

73


It is not trivial to directly solve:

minimizec;b;»;K



»¸ 0; K º 0:

However, it can be expressed as a convex conic program:

minimizez;b;»;K ;u;v

1n1T»+´u+°v

subject to diag(y)(z +b1) ¸ 1 ¡ »; »¸ 0;·K zzT u

¸º 0; kK ¡ SkF · v:

– We can recover the optimal by .c? c? = (K ?)yz?

74

Learning the Spectrum ModificationConcerns about learning the full kernel matrix:

– Though the problem is convex, the number of variables is O(n2).

– The flexibility of the model may lead to overfitting.

Similarity-based Classifiers: Problems and Solutions.

Documents

Transcript of Similarity-based Classifiers: Problems and Solutions.