Similarity-based Classifiers: Problems and Solutions.
-
Upload
ginger-bond -
Category
Documents
-
view
239 -
download
1
Transcript of Similarity-based Classifiers: Problems and Solutions.
Similarity-based Classifiers:
Problems and Solutions
2
Classifying based on similarities:
Van GoghOr
Monet ?
Van Gogh
Monet
3
the Similarity-based Classification Problem
Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;n
(painter)(paintings)
4
the Similarity-based Classification Problem
Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;n
UnderlyingSimilarity Function: à : У Р! R
Training Similarities: S =£Ã(xi;xj )
¤n£n ; y =
£y1 : : : yn
¤T
5
the Similarity-based Classification Problem
Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;n
Training Similarities: S =£Ã(xi;xj )
¤n£n ; y =
£y1 : : : yn
¤T
Test Similarities: s =£Ã(x;x1) ::: Ã(x;xn)
¤T; Ã(x;x)
Problem: Estimate theclass label yfor test samplex given S, y, s, and Ã(x;x).
UnderlyingSimilarity Function: à : У Р! R
?
6
Examples of Similarity Functions
Computational Biology– Smith-Waterman algorithm (Smith & Waterman,
1981)– FASTA algorithm (Lipman & Pearson, 1985)– BLAST algorithm (Altschul et al., 1990)
Computer Vision– Tangent distance (Duda et al., 2001)– Earth mover’s distance (Rubner et al., 2000)– Shape matching distance (Belongie et al., 2002)– Pyramid match kernel (Grauman & Darrell,
2007)
Information Retrieval– Levenshtein distance (Levenshtein, 1966)– Cosine similarity between tf-idf vectors
(Manning & Schütze, 1999)
7
Approaches to Similarity-based Classification
MDSSimilariti
es as kernels
SVM
Similarities as
features
theory
k-NN
weights
Generative
Models
SDA
Classify x given S, y, s, and Ã(x;x).
8
Approaches to Similarity-based Classification
MDSSimilariti
es as kernels
SVM
Similarities as
features
theory
k-NN
weights
Generative
Models
SDA
Classify x given S, y, s, and Ã(x;x).
9
Can we treat similarities as kernels?
Kernels are inner products in someHilbert space.
10
Can we treat similarities as kernels?
conjugatesymmetric, reallinear: hax;zi = a< x;z >positivede nite: hx;xi > 0unless x = 0
Example Inner Product hx;zi = xTz.
Properties of an Inner Product hx;zi :
Kernels are inner products in someHilbert space.
x
zhx;zi
An inner product implies a norm: kxk=phx;xi
11
Can we treat similarities as kernels?
Kernels are inner products in someHilbert space.
Inner products aresimilarities.
Areour notions of similarities always inner products?No!
12
Example: Amazon similarity
10 20 30 40 50 60 70 80 90
10
20
30
40
50
60
70
80
90
Ð spaceof all books,Á(A;B) =%buy book A after viewing book B on Amazon
96 books
96
books
SInner product-like?
13
Example: Amazon similarity
10 20 30 40 50 60 70 80 90
10
20
30
40
50
60
70
80
90
Ð spaceof all books,Á(A;B) =%buy book A after viewing book B on Amazon
96 books
96
books
S
Á(HTF, Bishop) = 3
Á(Bishop, HTF) = 8
assymmetric!
0 10 20 30 40 50 60 70 80 90-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Eigenvalue Rank
Eig
enva
lue
Example: Amazon similarity
10 20 30 40 50 60 70 80 90
10
20
30
40
50
60
70
80
90
Ð spaceof all books,Á(A;B) =%buy book A after viewing book B on Amazon
96 books
96
books
S
negative
Rank
Eig
en
valu
es
Not PSD!
15
Well, let’s just make S be a kernel matrix
First, symmetrize:S Ã 1
2(S +ST ) ) S = U¤UT ;
¤ = diag(¸1; : : : ;¸n)
Clip:Sclip =U diag(max(¸1;0); :: : ;max(¸n;0))UT
0 0
S
Sclip
PSD Cone
Sclip is thePSD matrix closest to Sin terms of theFrobenius norm.
16
Well, let’s just make S be a kernel matrix
First, symmetrize:S Ã 1
2(S +ST ) ) S = U¤UT ;
¤ = diag(¸1; : : : ;¸n)
Flip:S°ip =U diag(j¸1j; : : : ; j¸nj) UT
0 0
(similar e®ect: Snew =STS)
17
Well, let’s just make S be a kernel matrix
First, symmetrize:S Ã 1
2(S +ST ) ) S = U¤UT ;
¤ = diag(¸1; : : : ;¸n)
0 0
Shift:Sshift =U (¤ + jmin(¸min(S);0)j I ) UT
18
Well, let’s just make S be a kernel matrix
First, symmetrize:S Ã 1
2(S +ST ) ) S = U¤UT ;
¤ = diag(¸1; : : : ;¸n)
0 0
Sshift =U (¤ + jmin(¸min(S);0)j I ) UT
Flip, Clip or Shift?Best bet is Clip.
19
Well, let’s just make S be a kernel matrix
First, symmetrize:S Ã 1
2(S +ST ) ) S = U¤UT ;
¤ = diag(¸1; : : : ;¸n)
Learn the best kernel matrix for the SVM:(Luss NIPS 2007, Chen et al. ICML 2009)
minK º 0
minf 2H K
1n
nX
i=1
L(f (xi);yi) +´kf k2K +°kK ¡ SkF
20
Approaches to Similarity-based Classification.
MDSSimilariti
es as Kernels
SVM
Similarities as
features
theory
k-NN
weights
Generative
Models
SDA
Classify x given S, y, s, and Ã(x;x).
21
Let the similarities to the training samples be features
– SVM (Graepel et al., 1998; Liao & Noble, 2003)
– Linear programming (LP) machine (Graepel et al., 1999)
– Linear discriminant analysis (LDA) (Pekalska et al., 2001)
– Quadratic discriminant analysis (QDA) (Pekalska & Duin, 2002)
– Potential support vector machine (P-SVM) (Hochreiter & Obermayer, 2006; Knebel et al., 2008)
Let£Ã(x;x1) :: : Ã(x;xn)
¤T2 Rn be the featurevector for x.
minimize®
12ky ¡ S®k22+²k®k1+°k®k1
Asymptotically does thiswork?Our results suggest you need to choosea slow-growing subset of n.
22
AMAZON47 classes
AURAL SONAR2 classes
CALTECH101 classes
FACE REC139 classes
MIREX10 classes
VOTING VDM2 classes
# samples
n = 204 n =100
n = 8677
n = 945 n = 3090
n = 435
SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89
SVM sim-as-feature (linear)
76.10 14.25 38.18 4.29 55.54 5.40
SVM sim-as-feature (RBF)
75.98 14.25 38.16 3.92 55.72 5.52
P-SVM 70.12 14.25 34.23 4.05 63.81 5.34
23
AMAZON47 classes
AURAL SONAR2 classes
CALTECH101 classes
FACE REC139 classes
MIREX10 classes
VOTING VDM2 classes
# samples
n = 204 n =100
n = 8677
n = 945 n = 3090
n = 435
SVM-kNN(clip)(Zhang et al. 2006)
17.56 13.75 36.82 4.23 61.25 5.23
SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89
SVM sim-as-feature (linear)
76.10 14.25 38.18 4.29 55.54 5.40
SVM sim-as-feature (RBF)
75.98 14.25 38.16 3.92 55.72 5.52
P-SVM 70.12 14.25 34.23 4.05 63.81 5.34
24
Approaches to Similarity-based Classification
MDSSimilariti
es as Kernels
SVM
Similarities as
features
theory
k-NN
weights
Generative
Models
SDA
Classify x given S, y, s, and Ã(x;x).
25
Weighted Nearest-Neighbors
Take a weighted vote of the k-nearest-neighbors:
Algorithmic parallel of the exemplar model of human learning.
y = argmaxg2G
kX
i=1
wi I f yi=gg
?
26
Weighted Nearest-Neighbors
Take a weighted vote of the k-nearest-neighbors:
Algorithmic parallel of the exemplar model of human learning.
y = argmaxg2G
kX
i=1
wi I f yi=gg
P (Y = gjX = x) =kX
i=1
wi I f yi=gg
For wi ¸ 0 andP
i wi =1, get class posterior estimate:
Good for asymmetric costsGood for interpretationGood for system integration.
27
Design Goals for the Weights
?
28
Design Goals for the Weights
Design Goal 1 (Affinity): wi should be an increasing function of ψ(x, xi).
?
29
Design Goals for the Weights
?
30
Design Goals for the Weights (Chen et al. JMLR 2009)
Design Goal 2 (Diversity): wi should be a decreasing function of ψ(xi, xj).
?
31
Linear Interpolation Weights
Linear interpolation weights will meet these goals:X
i
wixi = x; such that wi ¸ 0;X
i
wi =1
x1
x2 x3
x4x
non-uniquesolution
32
Linear Interpolation Weights
Linear interpolation weights will meet these goals:X
i
wixi = x; such that wi ¸ 0;X
i
wi =1
x1
x2 x3
x4x
non-uniquesolution
x1
x2 x3
x4 x
no solution
33
minimizew
°°°°°
kX
i=1
wixi ¡ x
°°°°°
2
2
+¸kX
i=1
wi logwi
subject tokX
i=1
wi =1; wi ¸ 0; i = 1;::: ;k:
LIME weights
Linear interpolation weights will meet these goals:
Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):
X
i
wixi = x; such that wi ¸ 0;X
i
wi =1
34
minimizew
°°°°°
kX
i=1
wixi ¡ x
°°°°°
2
2
+¸kX
i=1
wi logwi
subject tokX
i=1
wi =1; wi ¸ 0; i = 1;::: ;k:
LIME weights
Linear interpolation weights will meet these goals:
Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):
X
i
wixi = x; such that wi ¸ 0;X
i
wi =1
35
LIME weights
Linear interpolation weights will meet these goals:
Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):
minimizew
°°°°°
kX
i=1
wixi ¡ x
°°°°°
2
2
+¸kX
i=1
wi logwi
subject tokX
i=1
wi =1; wi ¸ 0; i = 1;::: ;k:
X
i
wixi = x; such that wi ¸ 0;X
i
wi =1
maximumentropy ! push weights to beequal
36
LIME weights
Linear interpolation weights will meet these goals:
Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):
minimizew
°°°°°
kX
i=1
wixi ¡ x
°°°°°
2
2
+¸kX
i=1
wi logwi
subject tokX
i=1
wi =1; wi ¸ 0; i = 1;::: ;k:
X
i
wixi = x; such that wi ¸ 0;X
i
wi =1
maximumentropy = exponential solutionconsistent (Friedlander Gupta IEEE IT 2005)noiseaveraging
37
Kernelize Linear Interpolation (Chen et al. JMLR 2009)
minimizew
°°°°°
kX
i=1
wixi ¡ x
°°°°°
2
2
+¸kX
i=1
wi logwi
subject tokX
i=1
wi =1; wi ¸ 0; i = 1;::: ;k:
minimizew
12wTX TXw¡ xTXw+
¸2wTw
subject to w ¸ 0; 1Tw= 1;
LIME weights:
Let X = [x1; : : : xk], re-writewith matricesand change to ridge regularizer:
38
Kernelize Linear Interpolation
minimizew
°°°°°
kX
i=1
wixi ¡ x
°°°°°
2
2
+¸kX
i=1
wi logwi
subject tokX
i=1
wi =1; wi ¸ 0; i = 1;::: ;k:
minimizew
12wTX TXw¡ xTXw+
¸2wTw
subject to w ¸ 0; 1Tw= 1;
LIME weights:
Let X = [x1; : : : xk], re-writewith matricesand change to ridge regularizer:
regularizes the variance of the weights
39
Kernelize Linear Interpolation
minimizew
°°°°°
kX
i=1
wixi ¡ x
°°°°°
2
2
+¸kX
i=1
wi logwi
subject tokX
i=1
wi =1; wi ¸ 0; i = 1;::: ;k:
minimizew
12wTX TXw¡ xTXw+
¸2wTw
subject to w ¸ 0; 1Tw= 1;
LIME weights:
Let X = [x1; : : : xk], re-writewith matricesand change to ridge regularizer:
only need inner products – can replace with kernel or similarities!
40
minimizew
12wTSw¡ sTw+
¸2wTw
subject to w ¸ 0; 1Tw=1:
KRI Weights Satisfy Design Goals
Kernel ridge interpolation (KRI) weights:
41
KRI Weights Satisfy Design Goals
Kernel ridge interpolation (KRI) weights:
minimizew
12wTSw¡ sTw+
¸2wTw
subject to w ¸ 0; 1Tw=1:
affinity: s =£Ã(x;x1) ::: Ã(x;xn)
¤T;
so wi high if Ã(x;xi) high
42
KRI Weights Satisfy Design Goals
Kernel ridge interpolation (KRI) weights:
minimizew
12wTSw¡ sTw+
¸2wTw
subject to w ¸ 0; 1Tw=1:
diversity:
12wTSw=
12
X
i;j
Ã(xi;xj )wiwj
43
KRI Weights Satisfy Design Goals
Kernel ridge interpolation (KRI) weights:
minimizew
12wTSw¡ sTw+
¸2wTw
subject to w ¸ 0; 1Tw=1:
MakeS PSD,problem is a QP
QP w/ box constraintsCan solvewith SMO
44
KRI Weights Satisfy Design Goals
Kernel ridge interpolation (KRI) weights:
Remove the constraints on the weights:
Can show equivalent to local ridge regression:KRR weights.
argminw
12wTSw¡ sTw+
¸2wTw
subject to w ¸ 0; 1Tw=1:
argminw
12wTSw¡ sTw+
¸2wTw
´ (S +¸I )¡ 1s
45
Weighted k-NN: Example 1
S =
2
664
5 0 0 00 5 0 00 0 5 00 0 0 5
3
775 ; s =
2
664
4321
3
775
wKRI =arg minw¸ 0;1T w=1
12wTSw¡ sTw+
¸2wTw
KRI weights
10-2
100
1020
0.1
0.25
0.4
0.5
0.6
¸
1
w4
1
w3
1
w2
1
w1
1
¸
wKRR = (S +¸I )¡ 1s
KRR weights
10-2
100
102-0.1
0
0.1
0.25
0.4
0.5
0.6
¸
1
w2
1
w1
1
w3
1
w4
1
¸
46
Weighted k-NN: Example 2
S =
2
664
5 1 1 11 5 4 21 4 5 21 2 2 5
3
775 ; s =
2
664
3333
3
775
wKRI =arg minw¸ 0;1T w=1
12wTSw¡ sTw+
¸2wTw
KRI weights
10-2
100
1020.15
0.2
0.25
0.3
0.35
0.4
¸
1
w2, w3
1
w4
1
w1
1
¸
wKRR = (S +¸I )¡ 1s
KRR weights
10-2
100
1020.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
¸
1
w2, w3
1
w4
1
w1
1
¸
47
Weighted k-NN: Example 3
S =
2
664
5 1 1 11 5 4 21 4 5 21 2 2 5
3
775 ; s =
2
664
2433
3
775
wKRI =arg minw¸ 0;1T w=1
12wTSw¡ sTw+
¸2wTw
KRI weights
10-2
100
1020
0.1
0.25
0.4
0.5
0.6
0.7
¸
1
w1
1
w3
1
w4
1
w2
1
¸
wKRR = (S +¸I )¡ 1s
KRR weights
10-2
100
102
-0.4
-0.2
0
0.250.4
0.6
0.8
1
¸
1
w4
1
w1
1
w3
1
w2
1
¸
48
Amazon-47
Aural Sonar
Caltech-101
Face Rec Mirex Voting
# samples 204 100 8677 945 3090 435
# classes 47 2 101 139 10 2
LOCAL
k-NN 16.95 17.00 41.55 4.23 61.21 5.80
affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86
KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29
KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52
SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23
GLOBAL
SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89
SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40
SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52
P-SVM 70.12 14.25 34.23 4.05 63.81 5.34
49
Amazon-47
Aural Sonar
Caltech-101
Face Rec Mirex Voting
# samples 204 100 8677 945 3090 435
# classes 47 2 101 139 10 2
LOCAL
k-NN 16.95 17.00 41.55 4.23 61.21 5.80
affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86
KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29
KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52
SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23
GLOBAL
SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89
SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40
SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52
P-SVM 70.12 14.25 34.23 4.05 63.81 5.34
50
Amazon-47
Aural Sonar
Caltech-101
Face Rec Mirex Voting
# samples 204 100 8677 945 3090 435
# classes 47 2 101 139 10 2
LOCAL
k-NN 16.95 17.00 41.55 4.23 61.21 5.80
affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86
KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29
KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52
SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23
GLOBAL
SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89
SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40
SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52
P-SVM 70.12 14.25 34.23 4.05 63.81 5.34
51
Amazon-47
Aural Sonar
Caltech-101
Face Rec Mirex Voting
# samples 204 100 8677 945 3090 435
# classes 47 2 101 139 10 2
LOCAL
k-NN 16.95 17.00 41.55 4.23 61.21 5.80
affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86
KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29
KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52
SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23
GLOBAL
SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89
SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40
SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52
P-SVM 70.12 14.25 34.23 4.05 63.81 5.34
52
Approaches to Similarity-based Classification.
MDSSimilarit
ies as Kernels
SVM
Similarities as
features
theory
k-NN
weights
Generative
Models
SDA
Classify x given S, y, s, and Ã(x;x).
53
Generative Classifiers
Model theprobability of what you seegiven each class:
Linear discriminant analysisQuadratic discriminant analysisGaussian mixturemodels...
Pro: Produces class probabilities
54
Generative Classifiers
Model theprobability of what you seegiven each class:
Linear discriminant analysisQuadratic discriminant analysisGaussian mixturemodels...
class
descriptivestatistics of s
Our Goal: Model P (T(s)jg)
Weuse: T(s) = [Ã(x;¹ 1);Ã(x;¹ 2);:: : ;Ã(x;¹ G)]¹ h is a centroid for each class
55
Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009)
AssumeG similaritiesclass-conditionally independent
Reducemodel bias by applying locally (local SDA)
Reduceest. varianceby regularizing over localities
Model P (T(s)jg)
EstimateP (Ã(x;¹ hjg) asmax-ent distr.given empirical mean. Result is exponential.
56
Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009)
AssumeG similaritiesclass-conditionally independent
Reducemodel bias by applying locally (local SDA)
Reduceest. varianceby regularizing over localities
Model P (T(s)jg)
EstimateP (Ã(x;¹ hjg) asmax-ent distr.given empirical mean. Result is exponential.
Reg. Local SDAPerformance: Competitive
57
Some Conclusions
Performance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful
- less approximating
- hard to model entire space, underlying manifold?
- always feasible
58
Some Conclusions
Performance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful
- less approximating
- hard to model entire space, underlying manifold?
- always feasible
59
Some Conclusions
Performance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful
- less approximating
- hard to model entire space, underlying manifold?
- always feasible
60
Some Conclusions
Performance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful
- less approximating
- hard to model entire space, underlying manifold?
- always feasible
61
Some Conclusions
Performance depends heavily on oddities of each dataset
Weighted k-NN with affinity-diversity weights work well.
Preliminary: Reg. Local SDA works well.
Probabilities useful .
Local models useful
- less approximating
- hard to model entire space, underlying manifold?
- always feasible
62
Lots of Open Questions
Making S PSD.
Fast k-NN search for similarities
Similarity-based regression
Relationship with learning on graphs
Try it out on real data
Fusion with Euclidean features (see our FUSION 2009 papers)
Open theoretical questions (Chen et al. JMLR 2009, Balcan et al. ML 2008)
Code/Data/Papers: idl.ee.washington.edu/similaritylearning
Similarity-based Classification by Chen et al., JMLR 2009
64
Training and Test ConsistencyFor a test sample x, given , shall we
classify x ass =
£Ã(x;x1) : : : Ã(x;xn)
¤T
y = sgn((c?)T s+b?) ?
No! If a training sample was used as a test sample, could change its class!
65
Data Sets
10 20 30 40 50 60 70 80 90
10
20
30
40
50
60
70
80
90
20 40 60 80 100
20
40
60
80
10020 40 60 80 100 120 140
20
40
60
80
100
120
140
0 20 40 60 80 100 120 140
-10
0
10
20
30
40
50
60
70
Eigenvalue Rank
Eig
enva
lue
0 10 20 30 40 50 60 70 80 90-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Eigenvalue Rank
Eig
enva
lue
0 10 20 30 40 50 60 70 80 90-5
0
5
10
15
20
25
30
35
Eigenvalue Rank
Eig
enva
lue
Amazon Aural Sonar Protein
Eigenvalue Rank
Eigenvalue Rank
Eigenvalue Rank
Eig
en
valu
e
Eig
en
valu
e
Eig
en
valu
e
66
Data Sets
50 100 150 200
50
100
150
200100 200 300 400
100
200
300
400
50 100 150 200
50
100
150
200
0 50 100 150 200 250 300 350 400-50
0
50
100
150
200
250
Eigenvalue Rank
Eig
enva
lue
Voting Yeast-5-7 Yeast-5-12
0 20 40 60 80 100 120 140 160 180-20
0
20
40
60
80
100
120
Eigenvalue Rank
Eig
enva
lue
0 20 40 60 80 100 120 140 160 180-20
0
20
40
60
80
100
120
Eigenvalue Rank
Eig
enva
lue
Eig
en
valu
e
Eig
en
valu
e
Eig
en
valu
e
Eigenvalue Rank
Eigenvalue Rank
Eigenvalue Rank
67
SVM Review
Empirical risk minimization (ERM) with regularization:
minimizef 2H K
1n
nX
i=1
L(f (xi);yi) +´kf k2K
Hinge loss:
L(f (x);y) =max(1¡ yf (x);0)
SVM Primal:
minimizec;b;»
1n1T»+´cTK c
subject to diag(y)(K c+b1) ¸ 1 ¡ »; »¸ 0:
0 1 2 1 2 ( )yf x
1
L
hinge loss
0-1 loss
68
Learning the Kernel Matrix
Find for classification the best K regularized toward S:
minK º 0
minf 2H K
1n
nX
i=1
L(f (xi);yi) +´kf k2K +°kK ¡ SkF
SVM that learns the full kernel matrix:
minimizec;b;»;K
1n1T»+´cTK c+°kK ¡ SkF
subject to diag(y)(K c+b1) ¸ 1 ¡ »;
»¸ 0; K º 0:
69
Related Work
Robust SVM (Luss & d’Aspremont, 2007):
SVM Dual:
maximize®
1T®¡12®T diag(y)K diag(y)®
subject to yT®=0; 0 · ®· C1:
maximize®
minK º 0
µ1T®¡
12®Tdiag(y)K diag(y)®+½kK ¡ Sk2F
¶
subject to yT®=0; 0 · ®· C1:
“This can be interpreted as a worst-case robust classification problem with bounded uncertainty on the kernel matrix K.”
70
Related Work
LetA = f®2n j yT®=0; 0 · ®· C1g
Rewrite the robust SVM as
max®2A
minK º 0
1T®¡12®Tdiag(y)K diag(y)®+½kK ¡ Sk2F
Theorem (Sion, 1958)Let M and N be convex spaces one of which is compact, and f(μ,ν) a function on M N, which is quasiconcave in M, quasiconvex in N, upper semi-continuous in μ for each ν N, and lower semi-continuous in ν for each μ M, then
sup¹ 2M infº2N f (¹ ;º) = infº2N sup¹ 2M f (¹ ;º):
71
Related Work
LetA = f®2n j yT®=0; 0 · ®· C1g
Rewrite the robust SVM as
max®2A
minK º 0
1T®¡12®Tdiag(y)K diag(y)®+½kK ¡ Sk2F
By Sion’s minimax theorem, the robust SVM is equivalent to:
minK º 0
max®2A
1T®¡12®Tdiag(y)K diag(y)®+½kK ¡ Sk2F
Compare
minK º 0
minf 2H K
1n
nX
i=1
L(f (xi);yi) +´kf k2K +°kK ¡ SkF
L(x;¸?) or f (x)
L(x?;¸) or g(¸) x ¸
zero duality gap
72
Learning the Kernel Matrix
It is not trivial to directly solve:
minimizec;b;»;K
1n1T»+´cTK c+°kK ¡ SkF
subject to diag(y)(K c+b1) ¸ 1 ¡ »;
»¸ 0; K º 0:
Lemma (Generalized Schur Complement)Let , and . Then
if and only if , z is in the range of K, and .
·K zzT u
¸º 0
u ¡ zTK yz ¸ 0
K 2 Rn£n z 2 Rn u 2 R
K º 0
Let , and notice that since .z = K c cTK c= zTK yz K K yK = K
73
Learning the Kernel Matrix
It is not trivial to directly solve:
minimizec;b;»;K
1n1T»+´cTK c+°kK ¡ SkF
subject to diag(y)(K c+b1) ¸ 1 ¡ »;
»¸ 0; K º 0:
However, it can be expressed as a convex conic program:
minimizez;b;»;K ;u;v
1n1T»+´u+°v
subject to diag(y)(z +b1) ¸ 1 ¡ »; »¸ 0;·K zzT u
¸º 0; kK ¡ SkF · v:
– We can recover the optimal by .c? c? = (K ?)yz?
74
Learning the Spectrum ModificationConcerns about learning the full kernel matrix:
– Though the problem is convex, the number of variables is O(n2).
– The flexibility of the model may lead to overfitting.