1 Heterogeneous Cross Domain Ranking in Latent Space Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen...
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of 1 Heterogeneous Cross Domain Ranking in Latent Space Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen...
1
Heterogeneous Cross Domain Ranking in Latent Space
Bo Wang1, Jie Tang2, Wei Fan3, Songcan Chen1, Zi Yang2, Yanzhu Liu4
1Nanjing University of Aeronautics and Astronautics2Tsinghua University
3IBM T.J. Watson Research Center, USA4Peking University
2
Introduction
• The web is becoming more and more heterogeneous
• Ranking is the fundamental problem over web– unsupervised v.s. supervised
– homogeneous v.s. heterogeneous
3
writewrite
cite
cite
cite
write
write
write
cite
Write
publish
publish
publish
publish
publish
publish
write
write
coauthor coauthor
Dr. Tang
Limin
Prof. Wang
Prof. Li
SVM...Association...
Tree CRF...
Semantic...EOS... Annotation...
IJCAI
ISWC
WWW
Pc member
Motivation
Heterogeneous cross domain ranking
ISWC
WWW
Dr. Tang
Prof. Wang Limin
Tree CRF...
SVM... EOS...IJCAI
Query: “data mining”
KDD
SDM
ICDM
Conferences
PAKDD
Principles of Data Mining
Data Mining: Concepts and Techniques
?
Papers
?
?
P. Yu
?
Authors
?
Main Challenges1) How to capture the correlation between
heterogeneous objects?2) How to preserve the preference orders
between objects across heterogeneous domains?
Main Challenges1) How to capture the correlation between
heterogeneous objects?2) How to preserve the preference orders
between objects across heterogeneous domains?
5
Related Work
• Learning to rank– Supervised: [Burges, 05] [Herbrich, 00] [Xu and Li, 07]
[Yue, 07]
– Semi-supervised: [Duh, 08] [Amini, 08] [Hoi and Jin, 08]
– Ranking adaptation: [Chen, 08]
• Transfer learning– Instance-based: [Dai, 07] [Gao, 08]
– Feature-based: [Jebara, 04] [Argyriou, 06] [Raina, 07] [Lee, 07] [Blitzer, 06] [Blitzer, 07]
– Model-based: [Bonilla, 08]
6
Outline
• Related Work
• Heterogeneous cross domain ranking– Basic idea
– Proposed algorithm: HCDRank
• Experiments
• Conclusion
7
Query: “data mining”Conference
KDD
SDM
ADMA
KDD
PKDD
PAKDD
A
B
C
X
Y
Z
Expert
Jiawei Han
Jie Tang
Bo Wang
Alice
Jerry
Bob
Tom
KDD
PKDD
PAKDD
KDD
SDM
ADMA
A
B
C
X
Y
Z
Jie Tang
Tom
Jerry
Bob
Bo Wang
Alice
Jiawei Han
Latent Space
Source Domain Target Domain
mis-ranked pairs mis-ranked pairs
9
alternately optimize matrix M and D
O(2T*sN logN)Construct transformation matrix
O(d3)
learning in latent space
O(sN logN)
O((2T+1)*sN log(N) + d3O((2T+1)*sN log(N) + d3
10
Outline
• Related Work
• Heterogeneous cross domain ranking
• Experiments– Ranking on Homogeneous data
– Ranking on Heterogeneous data
– Ranking on Heterogeneous tasks
• Conclusion
11
Experiments
• Data sets– Homogeneous data set: LETOR_TR
• 50/75/106 queries with 44/44/25 features for TREC2003_TR, TREC2004_TR and OHSUMED_TR
– Heterogeneous academic data set: ArnetMiner.org• 14,134 authors, 10,716 papers, and 1,434 conferences
– Heterogeneous task data set: • 9 queries, 900 experts, 450 best supervisor candidates
• Evaluation measures– MAP – NDCG
12
Ranking on Homogeneous data
• LETOR_TR– We made a slight revision of LETOR 2.0 to fit into the cross-
domain ranking scenario – three sub datasets: TREC2003_TR, TREC2004_TR, and
OHSUMED_TR
• Baselines
13
Cosine Similarity=0.01
OHSUMED_TR
TREC2004_TRTREC2003_TR
Cosine Similarity=0.23
Cosine Similarity=0.18
15
Ranking on Heterogeneous data
• ArnetMiner data set (www.arnetminer.org)
14,134 authors, 10,716 papers, and 1,434 conferences• Training and test data set:
– 44 most frequent queried keywords from log file• Author collection: Libra, Rexa and ArnetMiner• Conference collection: Libra, ArnetMiner
• Ground truth:– Conference: online resources– Expert: two faculty members and five graduate students from
CS provided human judgments for expert ranking
16
Feature Definition
Features Description
L1-L10 Low-level language model features
H1-H3 High-level language model features
S1 How many years the conference has been held
S2 The sum of citation number of the conference during recent 5 years
S3 The sum of citation number of the conference during recent 10 years
S4 How many years have passed since his/her first paper
S5 The sum of citation number of all the publications of one expert
S6 How many papers have been cited more than 5 times
S7 How many papers have been cited more than 10 times
19
Ranking on Heterogeneous tasks
• Expert finding task v.s. best supervisor finding task• Training and test data set:
– expert finding task: ranking lists from ArnetMiner or annotated lists
– best supervisor finding task: 9 most frequent queries from log file of ArnetMiner
• For each query, we collected 50 best supervisor candidates, and sent emails to 100 researchers for annotation
• Ground truth:– Collection of feedbacks about the candidates (yes/ no/ not sure)
20
Feature Definition
Features DescriptionL1-L10 Low-level language model featuresH1-H3 High-level language model featuresB1 The year he/she published his/her first paperB2 The number of papers of an expertB3 The number of papers in recent 2 yearsB4 The number of papers in recent 5 yearsB5 The number of citations of all his/her papersB6 The number of papers cited more than 5 timesB7 The number of papers cited more than 10 timesB8 PageRank scoreSumCo1-SumCo8 The sum of coauthors’ B1-B8 scoresAvgCo1-AvgCo8 The average of coauthors’ B1-B8 scoresSumStu1-SumStu8 The sum of his/her advisees’ B1-B8 scoresAvgStu1-AvgStu8 The average of his/her advisees’ B1-B8 scores
24
Conclusion
• Formally define the problem of heterogeneous cross domain ranking and propose a general framework
• We provide a preferred solution under the regularized framework by simultaneously minimizing two ranking loss functions in two domains
• The experimental results on three different genres of data sets verified the effectiveness of the proposed algorithm
26
Ranking on Heterogeneous data• A subset of ArnetMiner (www.arnetminer.org)
14134 authors, 10716 papers, and 1434 conferences• 44 most frequent queried keywords from log file• Author collection:
– For each query, we gathered top 30 experts from Libra, Rexa and ArnetMiner
• Conference collection: – For each query, we gathered top 30 conferences from Libra and
ArntetMiner• Ground truth:
– Three online resources• http://www.cs.ualberta.ca/~zaiane/htmldocs/ConfRanking.html• http://www3.ntu.edu.sg/home/ASSourav/crank.htm• http://www.cs-conference-ranking.org/conferencerankings/alltopics.html
– Two faculty members and five graduate students from CS provided human judgments
28
Ranking on Heterogeneous tasks
• For expert finding task, we can use results from ArnetMiner or annotated lists as training data
• For best supervisor task, 9 most frequent queries from log file of ArnetMiner are used– For each query, we sent emails to 100 researchers
• Top 50 researchers by ArnetMiner• Top 50 researchers who start publishing papers only in recent years
(91.6% of them are currently graduates or postdoctoral researchers)
– Collection of feedbacks• 50 best supervisor candidates (yes/ no/ not sure)• Also add other candidates
– Ground truth