The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

34
The Path Ranking Algorit hms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3

Transcript of The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Page 1: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

The Path Ranking Algorithms for Relational Retrieval Problems

Presented by Ni Lao

2009.9.3

Page 2: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Outline

• Problem definition

• Related works and our contribution

• Random walk algorithms with path parameterization

• Experiment

Page 3: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

IR Trend

• Data in current information retrieval tasks becomes increasingly diverse in entity types and relation types – relational databases (Balmin et al., 2004), – citation networks (e.g., CiteSeer, DBLP); – movie database (e.g., IMDB), – music database (Konstas et al., 2009); – homeland security, (Lin & Chalupsky, 2008); – structured retrieval of annotated text (Bilotti et al. 200

7); – Personal Information Management (PIM, Minkov & C

ohen, 2007)

Page 4: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Formal Definition

• These relational structured data can be represented by an Entity-Relation (ER) graph – a set of entities types T={T}

– a set of entities E={e}. Each entity is typed with e.T T The instantiation of type T is I(T)={e| e.T =T}.

– a set of typed and ordered relations R={R}. Each is a pair of entity types R.T1, R.T2.

• R(e1,e2)=1/0 to denote e1,e2 having relation R or not• R(e,•)={e’| R(e,e')=1} to denote the set of entities that have relat

ion R with e.

– an Entity-Relation graph G=(T,E,R).

Page 5: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Formal Definition

• Generally we define the Relational Retrieval (RR) problem as – given a set of query entities Eq={e'}

– predicte the relevance of each entity e of the target entity type Tq

– call q=(Eq,Tq) a query

Page 6: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Related Works

• Keyword search in relational databases – Answer is defined as trees connecting all query entities and with

target entity as root• BANKS (Bhalotia et al., 2002; Bhavana et al., 2008), • DBXplorer (Agrawal et al., 2002), • Discover (Hristidis & Papakonstantinou, 2002), • BLINKS (He et al., 2007)

• Ad-hoc retrieval style task definition– Entities are ranked by the closeness to the query words– Closeness defined by random walk on graph

• Pagerank (Brin & Page, 1998), • Topic-sensitive Pagerank (Haveliwala, 2002) • Personalized Pagerank (Jeh &. Widom, 2003)• ObjectRank (Balmin et al., 2004), • Personal information management (Minkov & Cohen, 2007) • Gene detection (Arnold & Cohen, 2009)

Page 7: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Related Works• Improving random walk models in supervised fashion

– quadratic programming (Tsoi et al., 2003), – simulated annealing (Nie et al., 2005), – back-propagation (Diligenti et al., 2005; Minkov & Cohen, 2007), – limit memory Newton method (Agarwal et al., 2006)

• Limitations• Expressive power of model

– Actual paths (as opposed to individual relations used during random walk) can be very indicative (Minkov & Cohen, 2007)

• Lack of training data– 18 testing queries by Richardson& Domingos (2001); – 4 testing queries by Balmin et al. (2004); – 10 training and testing queries by Chakrabarti and Agarwal (2006); – <30 training on various tasks by Minkov et al. (2006) – learn from page order generated from artificially manipulated models by

Tsoi et al. (2003) and Agarwal et al. (2006).

Page 8: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Path Matters

• Example

Page 9: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

This Work

• Path Ranking Algorithm (PRA) – Modify random walk model to path parameterization– Modify PageRank to path parameterization– Modify teleport learning to path parameterization– Demonstrate the importance of L1 and L2 regularizati

on– Provide the first large scale evaluation

• Several realistic tasks, each having thousands of training and testing queries

Page 10: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

PRA: Single Entity Queries• Given a ER graph G=(T,E,R) and a query q=(Eq,Tq)

– A type path P=(T1, …,Tn) is a sequence of entity types, with constraint that (Ti,Ti+1) R.

– Let P(q, l) be the set of type paths that start with T , end with Tq, and have length ≤l.

– For each type path P=(T1, …,Tn) in P(q, l), we define a series of distributions hi(e), e.T=Ti

1

1' ( )

( ', )( ) ( ')

| ( ', ) |i

ii i

e I T i

R e eh e h e

R e

Page 11: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

PRA: Single Entity Queries• All the type paths can be summarized as a prefix tree, with each node c

orresponds to a distribution hi(e) over the entities

• A PRA model (G, l, θ) ranks I(Tq) by the scoring function

• in matrix form s=Aθ, where s is a (sparse) column vector of scores of each entity, θ is a column vector of weights for the type paths, each column of A is the distribution hP(e) of a path P

Paper

Paper

Author

Paper

Paper

Paper

Author

Paper

WrittenBy

Write

Cite

Cite

CiteBy

CiteBy

WrittenBy

( , )

( ; , ) ( )P PP q l

score e l h e

P

Page 12: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Parameter Learning

• Given a set of training data D={(q(m), A(m), y(m))} where m=1…M, y(m)(e)=1/0, we define a regularized objective function

1 1 2 21..

( ) ( ) | | | | / 2mm M

O o

Page 13: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Parameter Learning

• Given a set of training data D={(q(m), A(m), y(m))} where m=1…M, y(m)(e)=1/0, we define a regularized objective function

1 1 2 21..

( ) ( ) | | | | / 2mm M

O o

• o(m)(θ) can be in various forms. Like log-loss (logistic regressio

n), negative hinge loss (SVM), negative exponential loss (boosting), and etc

• Here we use log-loss which is easy to optimize and does not penalize too harshly to outlier samples as exponential loss

• Use orthant-wise L-BFGS (Andrew & Gao, 2007) to tune θ

Page 14: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Parameter Learning

• Its gradient

1 ( ) 1 ( )( ) | | ln | | ln(1 )m m

m mm m i m i

i P i N

o P p N p

( )

( ) ( ) ( )( )

exp( )( 1| ; )

1 exp( )

T mm m m ii i T m

i

Ap p y q

A

• Let P(m) be the index set or relevant entities, and N(m) the index set of irrelevant entities (how to choose them will be discussed later)

• We use the average log-likelihood of positive and negative entities as the objective om(θ)

1 ( ) ( ) 1 ( ) ( )( )| | (1 ) | |

m m

m m m mmm i i m i i

i P i N

oP p A N p A

Page 15: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Parameter Learning

• Its gradient

• For a retrieval system we may prefer to optimize pair-wise margins – Predict for each pair of entity if one should be ranked higher than the

other (ei ej)

( ) ( )( ) ( ), ( ) ( )

exp( )( | ; )

1 exp( )

T m T mj km m

j k j k T m T mj k

A Ap p e e q

A A

1 1 ( ),( ) | | | | ln

m m

mm m m j k

j P k N

o P N p

1 1 ( ) ( ) ( ),

( )| | | | (1 )( )

m m

m m mmm m j k j k

j P k N

oP N p A A

Page 16: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

EntityRank: Using Query Independent Paths

• PageRank assign an importance score (independent of the query term) to each web page, and this importance score is later combined with relevance score (query term dependent)

• We include to each query a special entity e0 of special type T0 which has relation to each entity type in the system, and e0 has linked to each entity in the entity relation graph.

• T0 therefore introduces a set of query independent type paths, which can be calculate offline

Paper

Paper

AuthorT0

AuthorPaper

Paper

Wrote

WrittenBy

CiteBy

Cite

Page 17: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Modeling Hidden Factors

• Entity specific information not captured by the model

– E.g. Some document with lower rank to a query may be interesting to the users because of features not captured in the data

– E.g. Different users may have completely different information needs and goals under the same query

– The identity of entity matters

Page 18: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Modeling Hidden Factors

• Hidden Factors– we introduce a set of hidden factors to each entity, one for each

path starting from the entity and leading to the target entity type• papers—(cite)→papers—(written by)→authors• papers—(written by)→authors• authors• authors—(write)→papers—(written by)→authors

– Distribution matrix A is augmented to [A Ahf] where each column of Ahf is a distribution of certain hidden path.

– Similarly θ is augmented to [θ; θhf]

– Suppose there are 10,000 authors and 10,000 papers in the graph. Then the model would include 40,000 parameters

Page 19: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Modeling Hidden Factors

• However hidden factors are– Large: potentially |E|^2 spaces– Redundant: many hidden paths are pointing to the sa

me target entities

• Simplified model: Instantiated Relations– For a task with query type Qq, ard target type Tq, – Define a set of special relations from special entity e1 t

o each target entity e in Tq, and a set of special relations from each query entity e’ in Qq to each target entity e in Tq .

– Each such relation has its own weights

Page 20: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Modeling Hidden Factors

• Efficiency about optimizing too many parameters?

• Only adds to the model important relations – Measured by |O(θ)/θR|– Add at most top b (batch size) relations at eac

h LBFGS iteration

Page 21: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Efficiency Considerations• Path Blocking

– Forbid random walk to take follow relation after its reversed relation (e.g. write after write-1 )

• Maintaining Distribution Sparsity – For time/memory efficiency, sampling has been used to get approximate

d but sparse estimation of distribution – Here we use a truncation strategy:

• At each random walks step hi(e)= max(0, hi(e)-γE[hi(e)])• E[hi] is the average of hi(e) on entities of non-zero values

• Few positive entities vs. thousands (or millions) of negative entities?– First sort all the negative entities with the initial model (all feature weight

s are uniformly set to 1.0), then– Square sampling take negative entities at the k(k+1)/2-th position, – Cubic sampling with k(k+1)(k+2)/6-th position– Exponential sampling with 2k-1-th position, k=0,1,2,3,... – Top-k sampling (taking top-k negative entities)

Page 22: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Experiment: Data • Data sources

– PubMed on-line archive of over 18 million biological abstracts– PubMed Central (PMC) full-text copies of over 1 million of these papers– Saccharomyces Genome Database (SGD) a database information conc

erning the yeast

• The nodes of our network are:– 48,641 papers contained in SGD. – 69,161 authors in SGD paper.– 5,816 genes of yeast, mentioned in SGD.– 58 years, from 1950 through 2008– 1,126 journals– 39,827 unique title terms, after applying a stop word list of size 429.

Page 23: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Experiment: Data

• The edges of our network are– 376,010 Citation relations among papers..– 1,604 RelatesTo relations from genes to other genes– 178,233 Authorship relations from authors to the papers

• Further distinguished as: any author, first author, and last author.– 160,621 Mention relations from papers to the genes they discuss.

• further distinguished into 49 categories in the SGD database like “E

volution”, “Function/Process”, “Mutants/Phenotypes”, et al.– HasTitleTerm, InJorunal, InYear relations for each paper.– Before relations from each year to its next year

<Author> <Paper> <Gene>

Mentions(of aspect)

Relates toFirst/last/any

author

Cites

<Year>

BeforeIn

<Title Term>

Contains

<Institute>

In

<Journal>

In

Page 24: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Experiment: Task • The Paper Completion Tasks (PCT)

– Treat a paper as a big form with fields, the task is to predict one field based on some other fields given. We have 16k query-judgment pairs for each task

– Y-J: • Eq=year, Tq={journal}, • suggest hot journal of a year.

– YGW-J: • Eq=year U genes U words, Tq={journal}, • suggest journal to publish a research work.

– YGW-P: • Eq=year U genes U words, Tq={citation}, • help literature review

– YA-G/YA-W: • Eq=year U authors, Tq={gene/title}, • suggest topics a researcher might currently be interested in.

• Time variant graph – each edge is tagged with a time stamp (year in this case)– only consider edges that are earlier than the query, when doing random walk

Page 25: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Experiment: Main Result

• Compare retrieval qualities by MAP

Y-J YGW-J YGW-P YA-G YA-W

RRAb 0.252 0.404 0.127 0.144 0.201

RRA 0.251 0.462 0.151 0.144 0.199

PRAb 0.252 0.404 0.126 0.143 0.201

PRA 0.247 0.447 0.163 0.149 0.200

PRAeb 0.206 0.271 0.154 0.135 0.184

PRAe 0.339 0.464 0.162 0.151 0.202

PRA-ir30 0.386 0.487 0.167 0.141 0.16

PRA-hf5 0.375 0.479 0.155 0.151 0.179

PRA-P 0.246 0.456 0.166 0.148 0.194

Page 26: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Experiment: Main Result

• Compare scalabilities by training time (s). Y-J YGW-J YGW-P YA-G YA-W

RRAb 0 0 0 0 0

RRA 32 121 121 39 68

PRAb 0 0 0 0 0

PRA 38 133 2,024 190 69

PRAeb 0 0 408 0 0

PRAe 30 146 4,286 274 120

PRA-ir30 75 253 10,512 393 790

PRA-hf5 126 2,660 749* 96* 328*

PRA-P 30 117 9520 234 120

Max path length of hidden factors is set to 2, and set to 1 in the tasks with *

Page 27: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Experiment: threshold γ

• Effect of threshold γ on RRAb: not very sensitive

0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 0 . 5 1 1 . 5l

MA

PY - J Y G W - J Y G W - P

Page 28: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Experiment: L2

• Usually there is a bump

0 . 1 2

0 . 1 3

0 . 1 4

0 . 1 5

0 . 1 6

0 . 1 7

0 . 0 0 0 1 0 . 0 0 1 0 . 0 1 0 . 1 1

λ 2

MA

P

P R A 2 - SP R A e 2 - SP R A 3 - SP R A e 3 - SP R A 4 - SP R A e 4 - S

•YGW-J

0 . 3

0 . 3 5

0 . 4

0 . 4 5

0 . 5

0 . 0 0 1 0 . 0 1 0 . 1 1 1 0 1 0 0

λ 2

MA

P

P R A 2P R A e 2P R A 3P R A e 3P R A 4P R A e 4

•YGW-P

Page 29: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Experiment: L1

• Usually there is a bump

0 . 1 2

0 . 1 3

0 . 1 4

0 . 1 5

0 . 1 6

0 . 1 7

0 . 0 0 0 1 0 . 0 0 1 0 . 0 1 0 . 1 1

λ 2

MA

P

P R A 2 - SP R A e 2 - SP R A 3 - SP R A e 3 - SP R A 4 - SP R A e 4 - S

•YGW-J, λ1=0

0 . 3

0 . 3 5

0 . 4

0 . 4 5

0 . 5

0 . 0 0 1 0 . 0 1 0 . 1 1 1 0 1 0 0

λ 2

MA

P

P R A 2P R A e 2P R A 3P R A e 3P R A 4P R A e 4

•YGW-P, λ1=0

Page 30: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Experiment: L1

• Can eliminate paths without reducing MAP

•YGW-P, λ2=0.003

0 . 1 2

0 . 1 4

0 . 1 6

0 . 1 8

0 . 0 0 1 0 . 0 1 0 . 1 1 1 0

λ 1

MA

P

P R A 3 - SP R A e 3 - SP R A 4 - SP R A e 4 - SP R A 5 - SP R A e 5 - S

1

1 0

1 0 0

1 0 0 0

0 . 0 0 1 0 . 0 1 0 . 1 1 1 0

λ 1

No.

Fea

ture

s

Page 31: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Peek into the Model Weights• Y-J, PRA3-i10

weight Path

26.5 y(_Be)y(_Ye)p(Jo)j

0.89 >Proc_Natl_Acad_Sci_U_S_A

0.87 >Mol_Cell_Biol

0.63 >EMBO_J

0.59 >Genetics

-2.26 >Cell

-2.65 >Nature

-2.81 >Mol_Microbiol

-3.14 >Glycobiology

-3.15 >Nat_Genet

-3.2 >Eur_J_Biochem

-3.21 >FEBS_Lett

-3.21 >Gene

-3.22 >Mol_Gen_Genet

-3.23 >J_Biol_Chem

-3.25 >Yeast

Page 32: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Peek into the Model Weights• YGW-J, PRAe3-i5

weight Path

24.06 w(_Ti)p(_Ci)p(Jo)j

19.98 w(_Ti)p(Jo)j

13.90 T(pa)p(_Ci)p(Jo)j

9.68 g(_Ge)p(_Ci)p(Jo)j

3.11 w(_Ti)p(Ci)p(Jo)j

… …

-2.26 >Nature

-2.63 >Biochim_Biophys_Acta

-2.68 >J_Biol_Chem

-2.73 >Science

-2.79 >Biochem_Biophys_Res_Commun

-3.01 >Cell

-3.05 >FEBS_Lett

-3.81 >Eur_J_Biochem

-3.84 >Gene

-4.17 >Mol_Gen_Genet

-4.23 >Yeast

Page 33: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Peek into the Model Weights• YGW-P, PRAe5 ID weight Path

1 132.1 w(_Ti)p(Ci)p

2 34.18 w(_Ti)p

3 13.49 w(_Ti)p(_Ci)p

4 9.795 g(_Ge)p(Ci)p

5 2.9 g(_Ge)p

6 -0.539 g(_Ge)p(FA)a(_LA)p(Au)a(_LA)p

7 -0.869 g(_Ge)p(FA)a(_LA)p

8 -1.282 g(_Ge)p(FA)a(_LA)p(FA)a(_LA)p

9 -3.9 g(_Ge)p(LA)a(_FA)p

10 -6.589 T(ye)y(_Be)y(_Be)y(_Ye)p

11 -6.591 T(ye)y(_Be)y(_Ye)p

12 -6.594 T(ye)y(_Ye)p

Page 34: The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

• Thanks