Learning Dissimilarities by Ranking From SDP to QP · PDF fileLearning Dissimilarities by...

IntroductionF1: Solution by Semidefinite Programming

F2: Solution by Quadratic ProgrammingExperiments, Discussions, Conclusions

Learning Dissimilarities by RankingFrom SDP to QP

Hua Ouyang, Alexander Gray

FASTLabCollege of Computing

Georgia Institute of Technology

07/07/2008 ICML

1/ 20



Outlines

1 Introduction: Motivation and Problem2 Solution by Semidefinite Programming3 Solution by Quadratic Programming4 Experiments5 Discussions and Conclusions

2/ 20



MotivationsProblem Formulation

Introduction and Motivations

Ranking: learning of orderings: “A ranks lower than B” or“C ranks higher than D”SVM Ranking, BoostRankOur focus: learning rankings of dissimilarities between twosamples (d-ranking)“A is more similar to B than C is to D” or “The distancebetween E and F is larger than that between G and H”Not well studied; nonmetric MDS (NMDS), GeneralizedNMDS (GNMDS)Kind of dissimilarity (metric) learning; optimal Mahanalobisdistance, kernel learning, Multidimensional scalingPreserve metrics (distances) v.s. preserve dissimilarities

3/ 20




Applications of d-ranking

Any application? Exact distances not exist / not accurateMovie recommendation (e.g. Netflix), protein folding, socialscience, information retrieval...

A

DC

X

B

4/ 20




Problem Formulation of d-ranking

Different formulations:F1- Input: ranks dij ≤ dkl; Output: coefficients in RL, andassume Euclidean metric in RL

F2- Input: coefficients in RD, ranks dij ≤ dkl; Output:explicit dissimilarity: d : RD × RD → RF3- Input: coefficients in RD, ranks dij ≤ dkl; Output:f : RD → RL, coefficients in RL

We will investigate F1 and F2 in this project. F3 will be futurework.

5/ 20



Basic IdeasSolution by SDP

F1- Basic Ideas

F1- Input: ranks dij ≤ dkl; Output: coefficients in RL, andassume Euclidean metricGoal: find a proper Gram matrix K: Kmn = 〈xm,xn〉‖xi − xj‖22 = Kii − 2Kij +Kjj

Recover low dimensional embedding byeigen-decomposing KK 0, semidefinite programming

6/ 20




F1- Solution by SDP

Generalized nonmetric multidimensional scaling (GNMDS) [1]

min∑ijkl

ξijkl + λtr(K)

s.t. (Kkk − 2Kkl +Kll)− (Kii − 2Kij +Kjj) + ξijkl ≥ 1∑ab

Kab = 0

ξijkl ≥ 0, K 0,

Semidefinite programming, can be solved by SeDuMi, CSDP,SDPT3 etc.

7/ 20




F1- Solution by SDP (cont.)

Modified GNMDS

min∑ijkl

ξijkl + λtr(K)

s.t. (Kkk − 2Kkl +Kll)− (Kii − 2Kij +Kjj)− ξijkl ≥ 1∑ab

Kab = 0

ξijkl ≥ 0, K 0,

Ensure that differences between distances ≥ 1. Pullembedding samples apart

8/ 20




Example: City Location Recovery

-600 -400 -200 0 200 400 600-500

-400

-300

-200

-100

0

100

200

300

400

500

1

1

2

2

3 3

4

4 5 56

6

77

8

8

9 9

10

10

true locationsGNMDS

-600 -400 -200 0 200 400 600-500

-400

-300

-200

-100

0

100

200

300

400

500

1

1

2

2

3

3 44

5

5

6

6

7

78

8

9

9

1010

true locationsmodified GNMDS

0 1 2 3 4 5 6 7 8 9 10 110

10

20

30

40

50

Eig

enva

lues

0 1 2 3 4 5 6 7 8 9 10 110

0.5

1

1.5

Eig

enva

lues

10 cities in Europe, 850 rank pairs2D embedding space, error rate: 20.3% vs 14.14%

9/ 20



Basic IdeasSolution by QP

F2- Basic Ideas

F2- Input: coefficients in RD, ranks dij ≤ dkl; Output:d : RD × RD → RGoal: estimate a proper dissimilarity measure d(·, ·), basedon limited number of ranking informationMinimize empirical error, while control the complexity of d,SRM, Regularized dissimilarity learningReminiscent of large margin classifiers, e.g. SVM

10/ 20




F2- Solution by QP

d-ranking-VM, primal problem:

min1N

∑ijkl

ξijkl + λ‖d‖2H

s.t. ∀i, j, k, l ⊆ Sd(xk,xl)− d(xi,xj)− ξijkl ≥ 1,ξijkl ≥ 0

where S is the ranking set, diss(k, l) > diss(i, j).Solving this problem needs the following representertheorem for Hyper-RKHS:

11/ 20




Representer Theorem for Hyper-RKHS

Theorem

Denote by Ω : [0,∞)→ R a strictly monotonic increasingfunction, by X a set, and by L : (X2 × R2)M → R ∪ ∞ anarbitrary loss function. Then each minimizer d ∈ H of theregularized risk

L

((x1, y1, d(x1)

), . . . ,

(xM , yM , d(xM )

))+ Ω

(‖d‖H

)admits a representation of the form d(x) =

∑Mi=1 cik

(xi,x

),

where ci ∈ R and H is a hyper-RKHS induced by thehyperkernel k.

12/ 20




F2- Solution by QP

Utilizing KKT conditions, we arrive at the dual problem:

max∑ijkl

αp −AT (Q− P )TK(Q− P )A

4λ

s.t. αp ≥ 0∀i, j, k, l ⊆ S

where K ∈ RN×N is the hyper-kernel matrix; M : number ofdissimilarities, N = |S|, A ∈ RN is a vector with the pth elementbeing αp; P,Q ∈ RM×N contain the rank information.

Quadratic programming problem. Can be solve by generalpurpose opt tools or specialized sequential methods, e.g.SMO.

13/ 20




F2- Hyperkernels [2,3]

Proposition

Let ka(·, ·) and kb(·, ·) be positive definite kernels, then∀x1,x

′1,x2,x

′2 ∈ X, and ∀α, β > 0,

(ka(x1,x2)

)α(kb(x

′1,x

′2))β and

αka(x1,x2) + βkb(x′1,x

′2) can give a hyperkernel k.

k((x1,x

′1), (x2,x

′2))

= exp(− ‖x1−x2‖2+‖x′

1−x′2‖2

2σ2

)k((x1,x

′1), (x2,x

′2))

= exp(− ‖x1−x2‖2

2σ2

)+ exp

(− ‖x

′1−x

′2‖2

2σ2

)

14/ 20



ExperimentsDiscussions and ConclusionsOpen Problems

Experiments

Obtaining ranks of pairs from dataM = C2

n, N = C2M = n4

8 −n3

4 −n2

8 + n4

Table: Some examples of N v.s. n.

n 2 3 4 10 20 50 100 1000N 0 3 15 990 17955 749700 12248775 1.2475e+11

If A > B,B > C, then A > C can be ignoredN = n2

2 −n2 − 1

15/ 20




Experiments: 109 US Cities

2D locations of 109 US cities. Adjacent diss. considered:5885 rankings of pairwised distances

-2500 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500

-1500

-1000

-500

0

500

1000

1500

-2500 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500-400

-300

-200

-100

0

100

200

-2000 -1500 -1000 -500 0 500 1000 1500 2000 2500

-2000

-1500

-1000

-500

0

500

1000

1500

2000

-2500 -2000 -1500 -1000 -500 0 500 1000 1500 2000 2500-1500

-1000

-500

0

500

1000

1500

16/ 20




Experiments: USPS Digits and UMist Human Faces

results of d-ranking-VM

17/ 20




Discussions and Conclusions

Learning ordering can be solved by SDP and QPPros for GNMDS: Can recover low dim. embedding. Onlyrank information needed. No need values of originalsamples.Cons for GNMDS: Solving SDP is hard, esp. for largescale problems. Using existing SDP solver can only solveN < 50 problems. Cannot predict unseen samples.Pros for learning orderings by QP: Solving QP is mucheasier than SDP. Using SMO, can solve N > 103 problems.Can make prediction for unseen samples.Cons for learning orderings by QP: Cannot recoverlow-dimensional embedding explicitly. Learningdissimilarity measure needs values of original samples.How to choose a good hyperkernel is open problem.

18/ 20




Open Challenges

The regularization properties of hyperkernels. Betterhyperkernels.The formulation F3 - Input: coefficients in RD, orderingsdij ≤ dkl; Output: f : RD → RL, coefficients in RL

Real-world nonmetric experiments on manifolds.Ranking to classification ? Learn optimal dissimilarities forkNN classifier

19/ 20

Reference

Reference

[1] S. Agarwal, J. Wills, L. Cayton, G. Lanckriet, D.Kriegman, and S. Belongie, ”Generalized non-metricmultidimensional scaling”, AISTATS 2007[2] Ong, C. S., Smola, A. J., Williamson, R. C., ”Learningthe Kernel with Hyperkernels”, JMLR 2005[3] Kondor, R., Jebara, T., ”Gaussian and WishartHyperkernels”, NIPS 2006

Thank you. Q&A.

20/ 20

Learning Dissimilarities by Ranking From SDP to QP · PDF fileLearning Dissimilarities by...

Documents

Transcript of Learning Dissimilarities by Ranking From SDP to QP · PDF fileLearning Dissimilarities by...