PageRank extensions and a new way to measure scientific impacts

55
P AGERANK EXTENSIONS AND A NEW WAY TO MEASURE SCIENTIFIC IMPACTS Son NGUYEN KIM 1

Transcript of PageRank extensions and a new way to measure scientific impacts

PAGERANK EXTENSIONS AND A NEW

WAY TO MEASURE SCIENTIFIC

IMPACTS

Son NGUYEN KIM

1

OUTLINE

I. Context

II. What is PageRank?

III. PageRank extensions

IV. Pira Algorithm

V. Scientific impact measures

VI. Database

VII. Future works

2

I. CONTEXT

1. SNOW

2. PIRA

3

1. SNOW

Create a pertinent ranking method for a dynamic context (forum, Twitter, Facebook, …)

Example: forum, YouTube

4

2. PIRA (PUBLICATION INDUCED RESEARCH ANALYSES)

Validation project for Snow

Create new PageRank-like method to rank authors and scientific publications

Author paper graph

Author PaperAuthorship

Citation

5

II. WHAT IS PAGERANK?

1. How does Google work?

2. Definition of PageRank

3. Damping factor

6

1. HOW DOES GOOGLE WORK?

Crawling DB

Calcu

late web

site’ score

DB

PageRank

Website PageRank-score

Google 1000000

Yahoo 900000

Le Monde 1000

7

1. HOW DOES GOOGLE WORK?

David Beckham DBDavid + beckham

Websites relative

to keywords

Website Request-score

davidbeckham.com 1000

lequipe.fr 999

Ex: Request-score = number of keyword appearance * PR-score8

2. DEFINITION OF PAGERANK

Random surfer model

Score of a website (vertex) is the probability it is visited in an infinite journey

9

3. DAMPING FACTOR

Resemblance to Markov chain:

State space = set of vertices

State transition = edge

Irreducibility => Stationary distribution

Irreducibility = connectedness

Graph is strongly connected => Score determined and unique

And if the graph is not strongly connected?

10

A

C

B

D

A

3. DAMPING FACTOR

The probability a web surfer gets bored and decides to jump to a random website

The state transition is non-zero between all pair of vertices

11

A

C

B

D

A

C

B

D

df / 4

1-df

3. DAMPING FACTOR EFFECT

Change score order of two vertices !

Score(V3) > Score(V5)

Score(V3) < Score(V5)

12

III. PAGERANK EXTENSIONS

1. Extended graph

2. P-weight

3. C-weight

4. Path Diversity

13

1. EXTENDED GRAPH

Different vertices’ types, different edges’ types

More general: vertices and edges can carry properties

Type is short-cut for property « type »

14

Company website----------------------------

AdresseProfit

.....

Personal blog----------------------------

CreatorTheme

.....

Commercial link

Partenariat link

Friend link--------------yearplace

Publicity link-------------------Price

2. P-WEIGHT

Assign different probability weight

(p-weight) to different edges

This value determines the

appreciation of an object towards

another

Ex: in Facebook, “comment” edge

vs. “like” edge

This value is relative

A

B

C

P-weight = 1

P-weight = 2

15

3. C-WEIGHT

V1 V2

e• Increment differently the counter on the journeycounter(V2) = counter(V2) + c-weight(e)

• Represents the appreciation a vertex gives to another

• Example: author

paper

Wrote: c-weight = 0

16

C-WEIGHT VS P-WEIGHT

Score(V3) > Score(V5)

Score(V3) = Score(V5)

• Two ways to favor a vertex• Are they equivalent ?

17

4. PATH DIVERSITY

Remember the journey

Add a additional checking process before incrementing counter(w): compare w to previous vertices

Ex: if w appears already in this list, it is probable that the visitor has fallen in a cycle (ex: w --> w1 --> w2 --> w) and so w should not receive any credit

w w1

w218

IV. PIRA ALGORITHM

1. Author paper graph

2. Journey extensions

20

1. AUTHOR PAPER GRAPH

21

Paper set Author set

cite

wrote

IsWrittenBy

Score of an author/ a paper is the probability it is visited in an infinite journey on the author paper graph

2. JOURNEY EXTENSIONS

a) Choosing type before choosing nodes

b) C-weight

c) P-weight

22

2. JOURNEY EXTENSION

A. CHOOSING TYPE BEFORE CHOOSING NODE

23

P

AA

A

P

P

1/6

1/6

1/6

1/4

1/4

2. JOURNEY EXTENSION

A. CHOOSING TYPE BEFORE CHOOSING NODE

Score(A2) = score(A1) !

24

If « wrote » edge and « cite » edge are treated equally

With the extension « choosing type firstly »

Score(A2) = score(A1) / n

2B. JOURNEY EXTENSION:

C-WEIGHT FOR “WROTE” EDGE

C-weight(wrote) = 0

25

An author does not give directly appreciationtowards his papers:

A P

wrote

2B. JOURNEY EXTENSION:

C-WEIGHT FOR “WROTE” EDGE

If no c-weight(wrote)

26

Score(P2) > Score(P1)

If c-weight(wrote) = 0 Score(P2) = Score(P1)

2C. JOURNEY EXTENSION

P-WEIGHT FOR “WROTE” EDGE

If no p-weight, then Score(P3) < Score(P4) !

27

C-weight(wrote) = 0 => Score(P1) = Score(P2)

Score(P3) = Score(P4)

2C. JOURNEY EXTENSION:

P-WEIGHT FOR “WROTE” EDGE

pe

a

Solution:P-weight(e) = 1 / nbAuthor(p)

P-weight(w1) = P-weight(w2) + P-weight(w3) + P-weight(w4)

Score(P1) = Score(P2)

Score(P3) = Score(P4)

28

V. SCIENTIFIC IMPACT MEASURES

1. Author measures

a) Classic measures

b) Scenarios

c) Summary

2. Paper measures

a. Classic measures

b. Summary

29

1. AUTHOR MEASURES

A. CLASSIC MEASURES

Publication

Citation

H-Index, G-Index

PR-A: PageRank on Author graph

30

1A. CLASSIC MEASURES

PUBLICATION

31

A

P

P

P

wrote

Publication-score(A) = 3

1A. CLASSIC MEASURES

CITATION

32

A

P

P

P

wrote

Citation-score(A) = 5

P

P

P

cite

1A. CLASSIC MEASURES

H-INDEX

33

A

P

P

P

wrote

HIndex-score(A) = 2

P

P

P

cite

1A. CLASSIC MEASURES

PR-A: PAGERANK ON AUTHOR GRAPH

34

• Author graph

• P-weight

p qAB

1

P-weight = 1/6A’

A’’ B’

cite

wrote

1. AUTHOR MEASURES

B. SCENARIOS

Quality of paper

Number of co-authors

Quality of citing papers

Self citation

Score range

35

1B. SCENARIO: QUALITY OF PAPER

Quality of paper

score(A1) > score(A2)

Publication No

Citation Yes

H-Index Yes

PR-A Yes

Pira Yes

36Score(A1) > Score(A2)

1C. SCENARIO: NUMBER OF CO-AUTHORS

score(A1) < score(A4)

Publication No

Citation No

H-Index No

PR-A Yes

Pira Yes

Number of co-authors

37Score(A1) < Score(A2)

1D. SCENARIO: QUALITY OF CITING PAPER

score(A1) > score(A2)

Publication No

Citation No

H-Index No

PR-A No …

Pira Yes

Quality of citing papers

38

Score(A1) > Score(A2)

1E. SCENARIO: SELF-CITATION

score(A2) > score(A1)

Publication No

Citation No

H-Index No

PR-A Yes

Pira Yes

Self-citation

39Score(A1) < Score(A2)

1G. SCENARIO: SCORE RANGE

Millions of authors

Score range (in general)

Publication: < 1000

Citation < 20000

H-Index < 100, G-Index < 200

PR-A, Pira: infinity

Sufficient score range

Publication No

Citation No

H-Index No

PR-A Yes

Pira Yes40

1. AUTHOR MEASURES

C. SUMMARY

Criteria/Measures Publication Citation H-Index PR-A Pira

Paper quality No Yes Yes Yes Yes

Number of co-authors No No No Yes Yes

Citing papers' quality No No No No Yes

Self-citation's effect No No No Yes Yes

Domain specific No No No Yes Yes

Score range No No No Yes Yes

41

2. PAPER MEASURES

A. CLASSIC MEASURES

Citation

PR-P: PageRank for papers

Papers

Cite

42

2B. SUMMARY FOR PAPER MEASURES

Criteria/Measures Citation PR-P Pira

Citing papers' quality No Yes Yes

Self-citation's effect No Yes Yes

Domain specific No Yes Yes

Score range No Yes Yes

Citing authors' quality No No Yes

43

VI. DATABASE

Aggregate DBLP and CiteSeerX

246039 authors (73241 in DBLP) and 281207 papers(67772 in DBLP)

The theoretic scenarios have been found

44

VII. FUTURE WORKS

Pira

Optimization : >10 times faster than matrix-multiplication algorithm

Apply path diversity

Take into account the content

Create a complete search engine for scientific world

Snow

A lot of work to do …

45

Thank you !

46

1D. SCENARIO: QUALITY OF CITING PAPER

score(A1) > score(A2)

Publication No

Citation No

H-Index No

PR-A No …

Pira Yes

Quality of citing papers

PR-A vs Pira47

2. JOURNEY EXTENSION

A. CHOOSING TYPE BEFORE CHOOSING NODE

Score(A2) = score(A1) !Score(P2) = Score(A1) + score(A2)

Score (A2) = Score(P2) / 2

48

2. JOURNEY EXTENSION

A. CHOOSING TYPE BEFORE CHOOSING NODE

Choose type first !

Score(P2) = Score (A2) + Score(A1) / n

Score(A2) = Score(P2) / 2Score(A2) = Score(A1) / n

49

1F. SCENARIO: DOMAIN SPECIFIC

Domain specific: the average number of citations varies from domain to domain.

An average paper is cited about 6 times in life sciences and < 1 times in mathematics

Domain specific

Publication No

Citation No

H-Index No

PR-A Yes

Pira Yes

50

PR-P VS PIRA

Quality of citing authors

51

VII. SNOW ON FORUM

1. Forum graph

2. Algorithm

3. Demonstration

52

1. FORUM GRAPH

53

2. ALGORITHM

Post User

wasWrittenBy, p-weight = 5

Wrote, c-weight = 0

Positive, c-weight = 2

Negative, c-weight = -2

• Path diversity = 6

• C-weight & P-weight

54

3. DEMONSTRATION

55

3. DAMPING FACTOR

The probability df a web surfer gets bored and decides to jump to a random website

In practice, Google set df to 0.15

The score of a website is the probability it is visited in a infinite journey on the web-graph by following 2 rules:

When a visitor arrives at a website A that has at least one outgoing links (i.e. having links to at least one website)

with probability df, the visitor picks a random website (in the set of all websites) to jump to and restart the journey

with probability 1 - df the visitor picks randomly an outgoing link of A to follow

If this website does not have any outgoing links, then the visitor picks a random website to restart the journey from.

Adaption to Markov chain

56