1 Heterogeneous Cross Domain Ranking in Latent Space Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen...

27
1 Heterogeneous Cross Domain Ranking in Latent Space Bo Wang 1 , Jie Tang 2 , Wei Fan 3 , Songcan Chen 1 , Zi Yang 2 , Yanzhu Liu 4 1 Nanjing University of Aeronautics and Astronautics 2 Tsinghua University 3 IBM T.J. Watson Research Center, USA 4 Peking University
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of 1 Heterogeneous Cross Domain Ranking in Latent Space Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen...

1

Heterogeneous Cross Domain Ranking in Latent Space

Bo Wang1, Jie Tang2, Wei Fan3, Songcan Chen1, Zi Yang2, Yanzhu Liu4

1Nanjing University of Aeronautics and Astronautics2Tsinghua University

3IBM T.J. Watson Research Center, USA4Peking University

2

Introduction

• The web is becoming more and more heterogeneous

• Ranking is the fundamental problem over web– unsupervised v.s. supervised

– homogeneous v.s. heterogeneous

3

writewrite

cite

cite

cite

write

write

write

cite

Write

publish

publish

publish

publish

publish

publish

write

write

coauthor coauthor

Dr. Tang

Limin

Prof. Wang

Prof. Li

SVM...Association...

Tree CRF...

Semantic...EOS... Annotation...

IJCAI

ISWC

WWW

Pc member

Motivation

Heterogeneous cross domain ranking

ISWC

WWW

Dr. Tang

Prof. Wang Limin

Tree CRF...

SVM... EOS...IJCAI

Query: “data mining”

KDD

SDM

ICDM

Conferences

PAKDD

Principles of Data Mining

Data Mining: Concepts and Techniques

?

Papers

?

?

P. Yu

?

Authors

?

Main Challenges1) How to capture the correlation between

heterogeneous objects?2) How to preserve the preference orders

between objects across heterogeneous domains?

Main Challenges1) How to capture the correlation between

heterogeneous objects?2) How to preserve the preference orders

between objects across heterogeneous domains?

4

Outline

• Related Work

• Heterogeneous cross domain ranking

• Experiments

• Conclusion

5

Related Work

• Learning to rank– Supervised: [Burges, 05] [Herbrich, 00] [Xu and Li, 07]

[Yue, 07]

– Semi-supervised: [Duh, 08] [Amini, 08] [Hoi and Jin, 08]

– Ranking adaptation: [Chen, 08]

• Transfer learning– Instance-based: [Dai, 07] [Gao, 08]

– Feature-based: [Jebara, 04] [Argyriou, 06] [Raina, 07] [Lee, 07] [Blitzer, 06] [Blitzer, 07]

– Model-based: [Bonilla, 08]

6

Outline

• Related Work

• Heterogeneous cross domain ranking– Basic idea

– Proposed algorithm: HCDRank

• Experiments

• Conclusion

7

Query: “data mining”Conference

KDD

SDM

ADMA

KDD

PKDD

PAKDD

A

B

C

X

Y

Z

Expert

Jiawei Han

Jie Tang

Bo Wang

Alice

Jerry

Bob

Tom

KDD

PKDD

PAKDD

KDD

SDM

ADMA

A

B

C

X

Y

Z

Jie Tang

Tom

Jerry

Bob

Bo Wang

Alice

Jiawei Han

Latent Space

Source Domain Target Domain

mis-ranked pairs mis-ranked pairs

8

The Proposed Algorithm — HCDRankHow to

optimize?How to define?

Non-convex

Dual problem

9

alternately optimize matrix M and D

O(2T*sN logN)Construct transformation matrix

O(d3)

learning in latent space

O(sN logN)

O((2T+1)*sN log(N) + d3O((2T+1)*sN log(N) + d3

10

Outline

• Related Work

• Heterogeneous cross domain ranking

• Experiments– Ranking on Homogeneous data

– Ranking on Heterogeneous data

– Ranking on Heterogeneous tasks

• Conclusion

11

Experiments

• Data sets– Homogeneous data set: LETOR_TR

• 50/75/106 queries with 44/44/25 features for TREC2003_TR, TREC2004_TR and OHSUMED_TR

– Heterogeneous academic data set: ArnetMiner.org• 14,134 authors, 10,716 papers, and 1,434 conferences

– Heterogeneous task data set: • 9 queries, 900 experts, 450 best supervisor candidates

• Evaluation measures– MAP – NDCG

12

Ranking on Homogeneous data

• LETOR_TR– We made a slight revision of LETOR 2.0 to fit into the cross-

domain ranking scenario – three sub datasets: TREC2003_TR, TREC2004_TR, and

OHSUMED_TR

• Baselines

13

Cosine Similarity=0.01

OHSUMED_TR

TREC2004_TRTREC2003_TR

Cosine Similarity=0.23

Cosine Similarity=0.18

14

Training Time

15

Ranking on Heterogeneous data

• ArnetMiner data set (www.arnetminer.org)

14,134 authors, 10,716 papers, and 1,434 conferences• Training and test data set:

– 44 most frequent queried keywords from log file• Author collection: Libra, Rexa and ArnetMiner• Conference collection: Libra, ArnetMiner

• Ground truth:– Conference: online resources– Expert: two faculty members and five graduate students from

CS provided human judgments for expert ranking

16

Feature Definition

Features Description

L1-L10 Low-level language model features

H1-H3 High-level language model features

S1 How many years the conference has been held

S2 The sum of citation number of the conference during recent 5 years

S3 The sum of citation number of the conference during recent 10 years

S4 How many years have passed since his/her first paper

S5 The sum of citation number of all the publications of one expert

S6 How many papers have been cited more than 5 times

S7 How many papers have been cited more than 10 times

17

Expert Finding Results

18

Feature Correlation Analysis

19

Ranking on Heterogeneous tasks

• Expert finding task v.s. best supervisor finding task• Training and test data set:

– expert finding task: ranking lists from ArnetMiner or annotated lists

– best supervisor finding task: 9 most frequent queries from log file of ArnetMiner

• For each query, we collected 50 best supervisor candidates, and sent emails to 100 researchers for annotation

• Ground truth:– Collection of feedbacks about the candidates (yes/ no/ not sure)

20

Feature Definition

Features DescriptionL1-L10 Low-level language model featuresH1-H3 High-level language model featuresB1 The year he/she published his/her first paperB2 The number of papers of an expertB3 The number of papers in recent 2 yearsB4 The number of papers in recent 5 yearsB5 The number of citations of all his/her papersB6 The number of papers cited more than 5 timesB7 The number of papers cited more than 10 timesB8 PageRank scoreSumCo1-SumCo8 The sum of coauthors’ B1-B8 scoresAvgCo1-AvgCo8 The average of coauthors’ B1-B8 scoresSumStu1-SumStu8 The sum of his/her advisees’ B1-B8 scoresAvgStu1-AvgStu8 The average of his/her advisees’ B1-B8 scores

21

Best supervisor finding results

22

Experimental Results

23

Outline

• Related Work

• Heterogeneous cross domain ranking

• Experiments

• Conclusion

24

Conclusion

• Formally define the problem of heterogeneous cross domain ranking and propose a general framework

• We provide a preferred solution under the regularized framework by simultaneously minimizing two ranking loss functions in two domains

• The experimental results on three different genres of data sets verified the effectiveness of the proposed algorithm

25

Data Set

26

Ranking on Heterogeneous data• A subset of ArnetMiner (www.arnetminer.org)

14134 authors, 10716 papers, and 1434 conferences• 44 most frequent queried keywords from log file• Author collection:

– For each query, we gathered top 30 experts from Libra, Rexa and ArnetMiner

• Conference collection: – For each query, we gathered top 30 conferences from Libra and

ArntetMiner• Ground truth:

– Three online resources• http://www.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html• http://www3.ntu.edu.sg/home/ASSourav/crank.htm• http://www.cs-conference-ranking.org/conferencerankings/alltopics.html

– Two faculty members and five graduate students from CS provided human judgments

28

Ranking on Heterogeneous tasks

• For expert finding task, we can use results from ArnetMiner or annotated lists as training data

• For best supervisor task, 9 most frequent queries from log file of ArnetMiner are used– For each query, we sent emails to 100 researchers

• Top 50 researchers by ArnetMiner• Top 50 researchers who start publishing papers only in recent years

(91.6% of them are currently graduates or postdoctoral researchers)

– Collection of feedbacks• 50 best supervisor candidates (yes/ no/ not sure)• Also add other candidates

– Ground truth