Learning Dissimilarities by Ranking From SDP to QP · PDF fileLearning Dissimilarities by...
Transcript of Learning Dissimilarities by Ranking From SDP to QP · PDF fileLearning Dissimilarities by...
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
Learning Dissimilarities by RankingFrom SDP to QP
Hua Ouyang, Alexander Gray
FASTLabCollege of Computing
Georgia Institute of Technology
07/07/2008 ICML
1/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
Outlines
1 Introduction: Motivation and Problem2 Solution by Semidefinite Programming3 Solution by Quadratic Programming4 Experiments5 Discussions and Conclusions
2/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
MotivationsProblem Formulation
Introduction and Motivations
Ranking: learning of orderings: “A ranks lower than B” or“C ranks higher than D”SVM Ranking, BoostRankOur focus: learning rankings of dissimilarities between twosamples (d-ranking)“A is more similar to B than C is to D” or “The distancebetween E and F is larger than that between G and H”Not well studied; nonmetric MDS (NMDS), GeneralizedNMDS (GNMDS)Kind of dissimilarity (metric) learning; optimal Mahanalobisdistance, kernel learning, Multidimensional scalingPreserve metrics (distances) v.s. preserve dissimilarities
3/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
MotivationsProblem Formulation
Applications of d-ranking
Any application? Exact distances not exist / not accurateMovie recommendation (e.g. Netflix), protein folding, socialscience, information retrieval...
A
DC
X
B
4/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
MotivationsProblem Formulation
Problem Formulation of d-ranking
Different formulations:F1- Input: ranks dij ≤ dkl; Output: coefficients in RL, andassume Euclidean metric in RL
F2- Input: coefficients in RD, ranks dij ≤ dkl; Output:explicit dissimilarity: d : RD × RD → RF3- Input: coefficients in RD, ranks dij ≤ dkl; Output:f : RD → RL, coefficients in RL
We will investigate F1 and F2 in this project. F3 will be futurework.
5/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
Basic IdeasSolution by SDP
F1- Basic Ideas
F1- Input: ranks dij ≤ dkl; Output: coefficients in RL, andassume Euclidean metricGoal: find a proper Gram matrix K: Kmn = 〈xm,xn〉‖xi − xj‖22 = Kii − 2Kij +Kjj
Recover low dimensional embedding byeigen-decomposing KK 0, semidefinite programming
6/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
Basic IdeasSolution by SDP
F1- Solution by SDP
Generalized nonmetric multidimensional scaling (GNMDS) [1]
min∑ijkl
ξijkl + λtr(K)
s.t. (Kkk − 2Kkl +Kll)− (Kii − 2Kij +Kjj) + ξijkl ≥ 1∑ab
Kab = 0
ξijkl ≥ 0, K 0,
Semidefinite programming, can be solved by SeDuMi, CSDP,SDPT3 etc.
7/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
Basic IdeasSolution by SDP
F1- Solution by SDP (cont.)
Modified GNMDS
min∑ijkl
ξijkl + λtr(K)
s.t. (Kkk − 2Kkl +Kll)− (Kii − 2Kij +Kjj)− ξijkl ≥ 1∑ab
Kab = 0
ξijkl ≥ 0, K 0,
Ensure that differences between distances ≥ 1. Pullembedding samples apart
8/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
Basic IdeasSolution by SDP
Example: City Location Recovery
-600 -400 -200 0 200 400 600-500
-400
-300
-200
-100
0
100
200
300
400
500
1
1
2
2
3 3
4
4 5 56
6
77
8
8
9 9
10
10
true locationsGNMDS
-600 -400 -200 0 200 400 600-500
-400
-300
-200
-100
0
100
200
300
400
500
1
1
2
2
3
3 44
5
5
6
6
7
78
8
9
9
1010
true locationsmodified GNMDS
0 1 2 3 4 5 6 7 8 9 10 110
10
20
30
40
50
Eig
enva
lues
0 1 2 3 4 5 6 7 8 9 10 110
0.5
1
1.5
Eig
enva
lues
10 cities in Europe, 850 rank pairs2D embedding space, error rate: 20.3% vs 14.14%
9/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
Basic IdeasSolution by QP
F2- Basic Ideas
F2- Input: coefficients in RD, ranks dij ≤ dkl; Output:d : RD × RD → RGoal: estimate a proper dissimilarity measure d(·, ·), basedon limited number of ranking informationMinimize empirical error, while control the complexity of d,SRM, Regularized dissimilarity learningReminiscent of large margin classifiers, e.g. SVM
10/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
Basic IdeasSolution by QP
F2- Solution by QP
d-ranking-VM, primal problem:
min1N
∑ijkl
ξijkl + λ‖d‖2H
s.t. ∀i, j, k, l ⊆ Sd(xk,xl)− d(xi,xj)− ξijkl ≥ 1,ξijkl ≥ 0
where S is the ranking set, diss(k, l) > diss(i, j).Solving this problem needs the following representertheorem for Hyper-RKHS:
11/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
Basic IdeasSolution by QP
Representer Theorem for Hyper-RKHS
Theorem
Denote by Ω : [0,∞)→ R a strictly monotonic increasingfunction, by X a set, and by L : (X2 × R2)M → R ∪ ∞ anarbitrary loss function. Then each minimizer d ∈ H of theregularized risk
L
((x1, y1, d(x1)
), . . . ,
(xM , yM , d(xM )
))+ Ω
(‖d‖H
)admits a representation of the form d(x) =
∑Mi=1 cik
(xi,x
),
where ci ∈ R and H is a hyper-RKHS induced by thehyperkernel k.
12/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
Basic IdeasSolution by QP
F2- Solution by QP
Utilizing KKT conditions, we arrive at the dual problem:
max∑ijkl
αp −AT (Q− P )TK(Q− P )A
4λ
s.t. αp ≥ 0∀i, j, k, l ⊆ S
where K ∈ RN×N is the hyper-kernel matrix; M : number ofdissimilarities, N = |S|, A ∈ RN is a vector with the pth elementbeing αp; P,Q ∈ RM×N contain the rank information.
Quadratic programming problem. Can be solve by generalpurpose opt tools or specialized sequential methods, e.g.SMO.
13/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
Basic IdeasSolution by QP
F2- Hyperkernels [2,3]
Proposition
Let ka(·, ·) and kb(·, ·) be positive definite kernels, then∀x1,x
′1,x2,x
′2 ∈ X, and ∀α, β > 0,
(ka(x1,x2)
)α(kb(x
′1,x
′2))β and
αka(x1,x2) + βkb(x′1,x
′2) can give a hyperkernel k.
k((x1,x
′1), (x2,x
′2))
= exp(− ‖x1−x2‖2+‖x′
1−x′2‖2
2σ2
)k((x1,x
′1), (x2,x
′2))
= exp(− ‖x1−x2‖2
2σ2
)+ exp
(− ‖x
′1−x
′2‖2
2σ2
)
14/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
ExperimentsDiscussions and ConclusionsOpen Problems
Experiments
Obtaining ranks of pairs from dataM = C2
n, N = C2M = n4
8 −n3
4 −n2
8 + n4
Table: Some examples of N v.s. n.
n 2 3 4 10 20 50 100 1000N 0 3 15 990 17955 749700 12248775 1.2475e+11
If A > B,B > C, then A > C can be ignoredN = n2
2 −n2 − 1
15/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
ExperimentsDiscussions and ConclusionsOpen Problems
Experiments: 109 US Cities
2D locations of 109 US cities. Adjacent diss. considered:5885 rankings of pairwised distances
-2500 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500
-1500
-1000
-500
0
500
1000
1500
-2500 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500-400
-300
-200
-100
0
100
200
-2000 -1500 -1000 -500 0 500 1000 1500 2000 2500
-2000
-1500
-1000
-500
0
500
1000
1500
2000
-2500 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500-1500
-1000
-500
0
500
1000
1500
16/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
ExperimentsDiscussions and ConclusionsOpen Problems
Experiments: USPS Digits and UMist Human Faces
results of d-ranking-VM
17/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
ExperimentsDiscussions and ConclusionsOpen Problems
Discussions and Conclusions
Learning ordering can be solved by SDP and QPPros for GNMDS: Can recover low dim. embedding. Onlyrank information needed. No need values of originalsamples.Cons for GNMDS: Solving SDP is hard, esp. for largescale problems. Using existing SDP solver can only solveN < 50 problems. Cannot predict unseen samples.Pros for learning orderings by QP: Solving QP is mucheasier than SDP. Using SMO, can solve N > 103 problems.Can make prediction for unseen samples.Cons for learning orderings by QP: Cannot recoverlow-dimensional embedding explicitly. Learningdissimilarity measure needs values of original samples.How to choose a good hyperkernel is open problem.
18/ 20
IntroductionF1: Solution by Semidefinite Programming
F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions
ExperimentsDiscussions and ConclusionsOpen Problems
Open Challenges
The regularization properties of hyperkernels. Betterhyperkernels.The formulation F3 - Input: coefficients in RD, orderingsdij ≤ dkl; Output: f : RD → RL, coefficients in RL
Real-world nonmetric experiments on manifolds.Ranking to classification ? Learn optimal dissimilarities forkNN classifier
19/ 20
Reference
Reference
[1] S. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D.Kriegman, and S. Belongie, ”Generalized non-metricmultidimensional scaling”, AISTATS 2007[2] Ong, C. S., Smola, A. J., Williamson, R. C., ”Learningthe Kernel with Hyperkernels”, JMLR 2005[3] Kondor, R., Jebara, T., ”Gaussian and WishartHyperkernels”, NIPS 2006
Thank you. Q&A.
20/ 20