The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

The Path Ranking Algorithms for Relational Retrieval Problems

Presented by Ni Lao

2009.9.3

Outline

• Problem definition

• Related works and our contribution

• Random walk algorithms with path parameterization

• Experiment

IR Trend

• Data in current information retrieval tasks becomes increasingly diverse in entity types and relation types – relational databases (Balmin et al., 2004), – citation networks (e.g., CiteSeer, DBLP); – movie database (e.g., IMDB), – music database (Konstas et al., 2009); – homeland security, (Lin & Chalupsky, 2008); – structured retrieval of annotated text (Bilotti et al. 200

7); – Personal Information Management (PIM, Minkov & C

ohen, 2007)

Formal Definition

• These relational structured data can be represented by an Entity-Relation (ER) graph – a set of entities types T={T}

– a set of entities E={e}. Each entity is typed with e.T T The instantiation of type T is I(T)={e| e.T =T}.

– a set of typed and ordered relations R={R}. Each is a pair of entity types R.T1, R.T2.

• R(e1,e2)=1/0 to denote e1,e2 having relation R or not• R(e,•)={e’| R(e,e')=1} to denote the set of entities that have relat

ion R with e.

– an Entity-Relation graph G=(T,E,R).

Formal Definition

• Generally we define the Relational Retrieval (RR) problem as – given a set of query entities Eq={e'}

– predicte the relevance of each entity e of the target entity type Tq

– call q=(Eq,Tq) a query

Related Works

• Keyword search in relational databases – Answer is defined as trees connecting all query entities and with

target entity as root• BANKS (Bhalotia et al., 2002; Bhavana et al., 2008), • DBXplorer (Agrawal et al., 2002), • Discover (Hristidis & Papakonstantinou, 2002), • BLINKS (He et al., 2007)

• Ad-hoc retrieval style task definition– Entities are ranked by the closeness to the query words– Closeness defined by random walk on graph

• Pagerank (Brin & Page, 1998), • Topic-sensitive Pagerank (Haveliwala, 2002) • Personalized Pagerank (Jeh &. Widom, 2003)• ObjectRank (Balmin et al., 2004), • Personal information management (Minkov & Cohen, 2007) • Gene detection (Arnold & Cohen, 2009)

Related Works• Improving random walk models in supervised fashion

– quadratic programming (Tsoi et al., 2003), – simulated annealing (Nie et al., 2005), – back-propagation (Diligenti et al., 2005; Minkov & Cohen, 2007), – limit memory Newton method (Agarwal et al., 2006)

• Limitations• Expressive power of model

– Actual paths (as opposed to individual relations used during random walk) can be very indicative (Minkov & Cohen, 2007)

• Lack of training data– 18 testing queries by Richardson& Domingos (2001); – 4 testing queries by Balmin et al. (2004); – 10 training and testing queries by Chakrabarti and Agarwal (2006); – <30 training on various tasks by Minkov et al. (2006) – learn from page order generated from artificially manipulated models by

Tsoi et al. (2003) and Agarwal et al. (2006).

Path Matters

• Example

This Work

• Path Ranking Algorithm (PRA) – Modify random walk model to path parameterization– Modify PageRank to path parameterization– Modify teleport learning to path parameterization– Demonstrate the importance of L1 and L2 regularizati

on– Provide the first large scale evaluation

• Several realistic tasks, each having thousands of training and testing queries

PRA: Single Entity Queries• Given a ER graph G=(T,E,R) and a query q=(Eq,Tq)

– A type path P=(T1, …,Tn) is a sequence of entity types, with constraint that (Ti,Ti+1) R.

– Let P(q, l) be the set of type paths that start with T , end with Tq, and have length ≤l.

– For each type path P=(T1, …,Tn) in P(q, l), we define a series of distributions hi(e), e.T=Ti

1

1' ( )

( ', )( ) ( ')

| ( ', ) |i

ii i

e I T i

R e eh e h e

R e

PRA: Single Entity Queries• All the type paths can be summarized as a prefix tree, with each node c

orresponds to a distribution hi(e) over the entities

• A PRA model (G, l, θ) ranks I(Tq) by the scoring function

• in matrix form s=Aθ, where s is a (sparse) column vector of scores of each entity, θ is a column vector of weights for the type paths, each column of A is the distribution hP(e) of a path P

Paper

Paper

Author

Paper

Paper

Paper

Author

Paper

WrittenBy

Write

Cite

Cite

CiteBy

CiteBy

WrittenBy

( , )

( ; , ) ( )P PP q l

score e l h e

P

Parameter Learning

• Given a set of training data D={(q(m), A(m), y(m))} where m=1…M, y(m)(e)=1/0, we define a regularized objective function

1 1 2 21..

( ) ( ) | | | | / 2mm M

O o

Parameter Learning

• Given a set of training data D={(q(m), A(m), y(m))} where m=1…M, y(m)(e)=1/0, we define a regularized objective function

1 1 2 21..

( ) ( ) | | | | / 2mm M

O o

• o(m)(θ) can be in various forms. Like log-loss (logistic regressio

n), negative hinge loss (SVM), negative exponential loss (boosting), and etc

• Here we use log-loss which is easy to optimize and does not penalize too harshly to outlier samples as exponential loss

• Use orthant-wise L-BFGS (Andrew & Gao, 2007) to tune θ

Parameter Learning

• Its gradient

1 ( ) 1 ( )( ) | | ln | | ln(1 )m m

m mm m i m i

i P i N

o P p N p

( )

( ) ( ) ( )( )

exp( )( 1| ; )

1 exp( )

T mm m m ii i T m

i

Ap p y q

A

• Let P(m) be the index set or relevant entities, and N(m) the index set of irrelevant entities (how to choose them will be discussed later)

• We use the average log-likelihood of positive and negative entities as the objective om(θ)

1 ( ) ( ) 1 ( ) ( )( )| | (1 ) | |

m m

m m m mmm i i m i i

i P i N

oP p A N p A

Parameter Learning

• Its gradient

• For a retrieval system we may prefer to optimize pair-wise margins – Predict for each pair of entity if one should be ranked higher than the

other (ei ej)

( ) ( )( ) ( ), ( ) ( )

exp( )( | ; )

1 exp( )

T m T mj km m

j k j k T m T mj k

A Ap p e e q

A A

1 1 ( ),( ) | | | | ln

m m

mm m m j k

j P k N

o P N p

1 1 ( ) ( ) ( ),

( )| | | | (1 )( )

m m

m m mmm m j k j k

j P k N

oP N p A A

EntityRank: Using Query Independent Paths

• PageRank assign an importance score (independent of the query term) to each web page, and this importance score is later combined with relevance score (query term dependent)

• We include to each query a special entity e0 of special type T0 which has relation to each entity type in the system, and e0 has linked to each entity in the entity relation graph.

• T0 therefore introduces a set of query independent type paths, which can be calculate offline

Paper

Paper

AuthorT0

AuthorPaper

Paper

Wrote

WrittenBy

CiteBy

Cite

Modeling Hidden Factors

• Entity specific information not captured by the model

– E.g. Some document with lower rank to a query may be interesting to the users because of features not captured in the data

– E.g. Different users may have completely different information needs and goals under the same query

– The identity of entity matters


• Hidden Factors– we introduce a set of hidden factors to each entity, one for each

path starting from the entity and leading to the target entity type• papers—(cite)→papers—(written by)→authors• papers—(written by)→authors• authors• authors—(write)→papers—(written by)→authors

– Distribution matrix A is augmented to [A Ahf] where each column of Ahf is a distribution of certain hidden path.

– Similarly θ is augmented to [θ; θhf]

– Suppose there are 10,000 authors and 10,000 papers in the graph. Then the model would include 40,000 parameters


• However hidden factors are– Large: potentially |E|^2 spaces– Redundant: many hidden paths are pointing to the sa

me target entities

• Simplified model: Instantiated Relations– For a task with query type Qq, ard target type Tq, – Define a set of special relations from special entity e1 t

o each target entity e in Tq, and a set of special relations from each query entity e’ in Qq to each target entity e in Tq .

– Each such relation has its own weights


• Efficiency about optimizing too many parameters?

• Only adds to the model important relations – Measured by |O(θ)/θR|– Add at most top b (batch size) relations at eac

h LBFGS iteration

Efficiency Considerations• Path Blocking

– Forbid random walk to take follow relation after its reversed relation (e.g. write after write-1 )

• Maintaining Distribution Sparsity – For time/memory efficiency, sampling has been used to get approximate

d but sparse estimation of distribution – Here we use a truncation strategy:

• At each random walks step hi(e)= max(0, hi(e)-γE[hi(e)])• E[hi] is the average of hi(e) on entities of non-zero values

• Few positive entities vs. thousands (or millions) of negative entities?– First sort all the negative entities with the initial model (all feature weight

s are uniformly set to 1.0), then– Square sampling take negative entities at the k(k+1)/2-th position, – Cubic sampling with k(k+1)(k+2)/6-th position– Exponential sampling with 2k-1-th position, k=0,1,2,3,... – Top-k sampling (taking top-k negative entities)

Experiment: Data • Data sources

– PubMed on-line archive of over 18 million biological abstracts– PubMed Central (PMC) full-text copies of over 1 million of these papers– Saccharomyces Genome Database (SGD) a database information conc

erning the yeast

• The nodes of our network are:– 48,641 papers contained in SGD. – 69,161 authors in SGD paper.– 5,816 genes of yeast, mentioned in SGD.– 58 years, from 1950 through 2008– 1,126 journals– 39,827 unique title terms, after applying a stop word list of size 429.

Experiment: Data

• The edges of our network are– 376,010 Citation relations among papers..– 1,604 RelatesTo relations from genes to other genes– 178,233 Authorship relations from authors to the papers

• Further distinguished as: any author, first author, and last author.– 160,621 Mention relations from papers to the genes they discuss.

• further distinguished into 49 categories in the SGD database like “E

volution”, “Function/Process”, “Mutants/Phenotypes”, et al.– HasTitleTerm, InJorunal, InYear relations for each paper.– Before relations from each year to its next year

<Author> <Paper> <Gene>

Mentions(of aspect)

Relates toFirst/last/any

author

Cites

<Year>

BeforeIn

<Title Term>

Contains

<Institute>

In

<Journal>

In

Experiment: Task • The Paper Completion Tasks (PCT)

– Treat a paper as a big form with fields, the task is to predict one field based on some other fields given. We have 16k query-judgment pairs for each task

– Y-J: • Eq=year, Tq={journal}, • suggest hot journal of a year.

– YGW-J: • Eq=year U genes U words, Tq={journal}, • suggest journal to publish a research work.

– YGW-P: • Eq=year U genes U words, Tq={citation}, • help literature review

– YA-G/YA-W: • Eq=year U authors, Tq={gene/title}, • suggest topics a researcher might currently be interested in.

• Time variant graph – each edge is tagged with a time stamp (year in this case)– only consider edges that are earlier than the query, when doing random walk

Experiment: Main Result

• Compare retrieval qualities by MAP

Y-J YGW-J YGW-P YA-G YA-W

RRAb 0.252 0.404 0.127 0.144 0.201

RRA 0.251 0.462 0.151 0.144 0.199

PRAb 0.252 0.404 0.126 0.143 0.201

PRA 0.247 0.447 0.163 0.149 0.200

PRAeb 0.206 0.271 0.154 0.135 0.184

PRAe 0.339 0.464 0.162 0.151 0.202

PRA-ir30 0.386 0.487 0.167 0.141 0.16

PRA-hf5 0.375 0.479 0.155 0.151 0.179

PRA-P 0.246 0.456 0.166 0.148 0.194

Experiment: Main Result

• Compare scalabilities by training time (s). Y-J YGW-J YGW-P YA-G YA-W

RRAb 0 0 0 0 0

RRA 32 121 121 39 68

PRAb 0 0 0 0 0

PRA 38 133 2,024 190 69

PRAeb 0 0 408 0 0

PRAe 30 146 4,286 274 120

PRA-ir30 75 253 10,512 393 790

PRA-hf5 126 2,660 749* 96* 328*

PRA-P 30 117 9520 234 120

Max path length of hidden factors is set to 2, and set to 1 in the tasks with *

Experiment: threshold γ

• Effect of threshold γ on RRAb: not very sensitive

0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

0 0 . 5 1 1 . 5l

MA

PY - J Y G W - J Y G W - P

Experiment: L2

• Usually there is a bump

0 . 1 2

0 . 1 3

0 . 1 4

0 . 1 5

0 . 1 6

0 . 1 7

0 . 0 0 0 1 0 . 0 0 1 0 . 0 1 0 . 1 1

λ 2

MA

P

P R A 2 - SP R A e 2 - SP R A 3 - SP R A e 3 - SP R A 4 - SP R A e 4 - S

•YGW-J

0 . 3

0 . 3 5

0 . 4

0 . 4 5

0 . 5

0 . 0 0 1 0 . 0 1 0 . 1 1 1 0 1 0 0

λ 2

MA

P

P R A 2P R A e 2P R A 3P R A e 3P R A 4P R A e 4

•YGW-P

Experiment: L1

• Usually there is a bump

0 . 1 2

0 . 1 3

0 . 1 4

0 . 1 5

0 . 1 6

0 . 1 7

0 . 0 0 0 1 0 . 0 0 1 0 . 0 1 0 . 1 1

λ 2

MA

P


•YGW-J, λ1=0

0 . 3

0 . 3 5

0 . 4

0 . 4 5

0 . 5

0 . 0 0 1 0 . 0 1 0 . 1 1 1 0 1 0 0

λ 2

MA

P

P R A 2P R A e 2P R A 3P R A e 3P R A 4P R A e 4

•YGW-P, λ1=0

Experiment: L1

• Can eliminate paths without reducing MAP

•YGW-P, λ2=0.003

0 . 1 2

0 . 1 4

0 . 1 6

0 . 1 8

0 . 0 0 1 0 . 0 1 0 . 1 1 1 0

λ 1

MA

P


1

1 0

1 0 0

1 0 0 0

0 . 0 0 1 0 . 0 1 0 . 1 1 1 0

λ 1

No.

Fea

ture

s

Peek into the Model Weights• Y-J, PRA3-i10

weight Path

26.5 y(_Be)y(_Ye)p(Jo)j

0.89 >Proc_Natl_Acad_Sci_U_S_A

0.87 >Mol_Cell_Biol

0.63 >EMBO_J

0.59 >Genetics

…

-2.26 >Cell

-2.65 >Nature

-2.81 >Mol_Microbiol

-3.14 >Glycobiology

-3.15 >Nat_Genet

-3.2 >Eur_J_Biochem

-3.21 >FEBS_Lett

-3.21 >Gene

-3.22 >Mol_Gen_Genet

-3.23 >J_Biol_Chem

-3.25 >Yeast

Peek into the Model Weights• YGW-J, PRAe3-i5

weight Path

24.06 w(_Ti)p(_Ci)p(Jo)j

19.98 w(_Ti)p(Jo)j

13.90 T(pa)p(_Ci)p(Jo)j

9.68 g(_Ge)p(_Ci)p(Jo)j

3.11 w(_Ti)p(Ci)p(Jo)j

… …

-2.26 >Nature

-2.63 >Biochim_Biophys_Acta

-2.68 >J_Biol_Chem

-2.73 >Science

-2.79 >Biochem_Biophys_Res_Commun

-3.01 >Cell

-3.05 >FEBS_Lett

-3.81 >Eur_J_Biochem

-3.84 >Gene

-4.17 >Mol_Gen_Genet

-4.23 >Yeast

Peek into the Model Weights• YGW-P, PRAe5 ID weight Path

1 132.1 w(_Ti)p(Ci)p

2 34.18 w(_Ti)p

3 13.49 w(_Ti)p(_Ci)p

4 9.795 g(_Ge)p(Ci)p

5 2.9 g(_Ge)p

…

6 -0.539 g(_Ge)p(FA)a(_LA)p(Au)a(_LA)p

7 -0.869 g(_Ge)p(FA)a(_LA)p

8 -1.282 g(_Ge)p(FA)a(_LA)p(FA)a(_LA)p

9 -3.9 g(_Ge)p(LA)a(_FA)p

10 -6.589 T(ye)y(_Be)y(_Be)y(_Ye)p

11 -6.591 T(ye)y(_Be)y(_Ye)p

12 -6.594 T(ye)y(_Ye)p

• Thanks

The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.

Documents

Transcript of The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3.