1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked...

29
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Dec. 16, 2008 ICDM2008

Transcript of 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked...

Page 1: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

1

Formal Models for Expert Finding on DBLP Bibliography Data

Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu

Department of Computer Science and EngineeringThe Chinese University of Hong Kong

Dec. 16, 2008

ICDM2008

Page 2: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

2

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Introduction Traditional information retrieval Expert finding task

Data mining Data mining

Page 3: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

3

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Outline

Introduction Related work Methodology

Modeling Expertise Statistical language model Topic-based model Hybrid model

Experiments Conclusions

Page 4: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

4

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Introduction

Expert finding received increased interest W3C collection in 2005 and 2006 (introduced and

used by TREC) CSIRO collection in 2007

Nearly all of the work has been evaluated on the W3C collection

We address the expert finding task in a real world academic field An important practical problem Some special problems and difficulties

II. Introduction

Page 5: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

5

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Problems

How to represent the expertise of a researcher? The publications of a researcher

How to identify experts for a given query? Relevance between a query and publications Publications act as the “bridge” between query and experts

What dataset can be used? DBLP bibliography (limited information) Use Google Scholar as a data supplement

How to measure the relevance between a query and docs? Language model, vector space model, etc.

Should we treat each publication equally?

II. Introduction

Page 6: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

6

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Our Work

Our setting: DBLP bibliography and Google scholar More than 955,000 articles with over 574,000 authors About 20GB metadata crawled from Google Scholar

Differ from the W3C setting Cover a wider range of topics Contain much more expert candidates

Applications Find experts for consultation on a new research field Assign papers to reviewers automatically Recommend panels of reviews for grant applications

II. Introduction

Page 7: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

7

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Related Work

Document model & Candidate model (Balog et al., SIGIR’06 & SIGIR’07)

Hierarchical language models (Petkova and Croft, ICTAI’06)

Voting model (Macdonald and Ounis, CIKM’06) Author-Persona-Topic model (Mimno and

McCallum, KDD’07) …… They do not consider the importance of

documents. Hardly to be used in large-scale expert finding.

Page 8: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

8

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Expertise Modeling

Expert finding p(ca|q): what is the probability of a candidate ca being

an expert given the query topic q? Rank candidates ca according to this probability.

Approach: Using Bayes’ theorem,

where p(ca, q) is joint probability of a candidate and a query, p(q) is the probability of a query.

III. Methodology

Page 9: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

9

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Expertise Modeling

Problem: How to estimate p(ca, q)? Model 1: Statistical language model

Document-based approach Find out the experts from the associated publications

Model 2: Topic-based model Association between the query with

several similar topics

Model 3: Hybrid model Combination of Model1 and Model2

q

D

ca

T

q ca

D

III. Methodology

Page 10: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

10

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

III. Model 1: Statistical language model

Basic Language Model

The probability pl(ca,q):

Language Model Conditionally independent

q

D

ca

Fig1. Baseline model

Find out documents relevant to the query Model the knowledge of an expert from the associated

documents

Page 11: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

11

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Weighted Language Model

q

D

ca

p(d)

q

d1 a10.1

0.2

d20.1 a2

0.2

cited by 200

cited by 10

Fig2. A query example Fig3. Weighted model

III. Model 1: Statistical language model

Page 12: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

12

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Topic-based Model

Observation: researchers usually describe their expertise as a combination of several topics

Each candidate is represented as a weighted sum of multiple topics Z

Similarity betweenquery and topics

z -> as a queryestimate p

III. Model 2: Topic-based model

Z

q ca

D

Fig4. Topic-based model

Page 13: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

13

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Topic-based Model

Information retrieval

1. Introduction to Modern Information retrieval2. Information retrieval3. Modern Information retrieval5. A language modeling approach to information

retrieval7. Information filtering and information retrieval……99. Cross-language information retrieval100. On modeling information retrieval with

probabilistic inference

Topic z

Google Scholar θz

represent

Z

q ca

D

III. Model 2: Topic-based model

Page 14: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

14

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Topic-based Model

Challenge: What similar topics would be selected?

T1: Calculate p(q|θz), select the top K ranked topics Assume topics are independent

Ideal similar topics: Include topics from many different subtopics Not include topics with high redundancy Define a conditional probability function to quantify the

novelty and penalize the redundancy of a topic

T2: T3:

III. Model 2: Topic-based model

Page 15: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

15

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Topic Selection Algorithm

T2:

T3:

III. Model 2: Topic-based model

Page 16: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

16

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Hybrid Model

Aggregate the advantage of the pl and pt

Defined as:

III. Model 3: Hybrid model

Page 17: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

17

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Experiments

DBLP Collection Limitation

No abstract and index terms Hard to represent the document

Representation for documents Use Google Scholar for data supplementation

Title as query, crawled top 10 returned records Up to 20 GB metadata (HTML pages) The citation number of the publication

Title (DBLP)

repd

Google Scholar

sup

IV. Experiments

Page 18: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

18

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Topic Collection

2,498 well-defined topics from eventseer Crawl the top 100 returned records from Google Scholar

IV. Experiments

Page 19: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

19

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Benchmark Dataset

A benchmark dataset with 7 topics and expert lists

IV. Experiments

Page 20: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

20

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Evaluation Metrics

Precision at rank n (P@n):

Mean Average Precision (MAP):

Bpref: The score function of the number of non-relevant candidates

IV. Experiments

Page 21: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

21

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Preliminary Experiments

Performed on two corpora using basic language model (B1) “Title” corpus: only using the title “GS” corpus: the representation of Google Scholar

Evaluation results on two corpora (%)

More effective to represent d using Google Scholar

IV. Experiments

Page 22: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

22

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Model 1: Statistical Language Models

Evaluation results of language modes

Weighted language model B3 and B2 outperform B1 Important to consider the prior probability

IV. Experiments

Page 23: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

23

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Model 2: Topic-based Models

Vary the number of topics (K) from 5 to 100 Results by using different values for K.

The number of topics will be cutoff automatically for T2 & T3

IV. Experiments

Page 24: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

24

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Model 2: Topic-based Models

Comparison of the three topic-based models

IV. Experiments

Page 25: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

25

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Model 3: Hybrid Models

Evaluation results of hybrid model

Hybrid model outperforms the pure language model and topic-based model in most of the metrics

IV. Experiments

Page 26: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

26

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Conclusions and Future Work

Conclusions Address expert finding task in a real world academic field Propose a weighted language model Investigate a topic-base model to interpret the expert finding

task Integrate the language model with the topic-based model Demonstrate that hybrid model achieves the best

performance in evaluation results Future work

Take into account other types of information Refine the results by utilizing social network analysis

Page 27: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

27

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Q&A

Thanks!

Page 28: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

28

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Comparison to Other Systems

Evaluation results of our language models and the method TS

Page 29: 1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

29

Hongbo Deng, Irwin King and Michael R. LyuDepartment of Computer Science and Engineering

The Chinese University of Hong Kong ICDM 2008

Example results