Relational Retrieval Using a Combination of Path-Constrained Random Walks
description
Transcript of Relational Retrieval Using a Combination of Path-Constrained Random Walks
![Page 1: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/1.jpg)
Relational Retrieval Using a Combination ofPath-Constrained Random Walks
Ni Lao
Joint work with William Cohen
2010.6.22
![Page 2: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/2.jpg)
22
Outline
• Problem definition and related work
• Retrieval Models with PCRW (ECML PKDD 2010)– Path Ranking Algorithm (PRA) – Ext.1: query-independent experts– Ext.2: popular entity experts
• Comparing efficient random walk strategies (KDD 2010)– Sampling– Truncation
![Page 3: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/3.jpg)
33
Scientific Literature• Can be represented as a labeled directed graph
– Typed nodes: documents, terms, metadata– Labeled edges: “authorOf”, “datePublished”
• Can support a family of typed proximity queries– Input: a set of query nodes + expected answer type– Output: a list of nodes of the desired answer type, ordered by proximity to
the query nodes
• Many tasks– ad hoc retrieval
• term nodes documents– gene recommendation (Andrew & Cohen’09)
• User, year gene– Reference (citation) recommendation
• topic paper– Expert finding
• topic user– Collaborator recommendation (Liben-Nowell and Kleinberg)
• Scientist scientist through co-authorship relation
![Page 4: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/4.jpg)
44
Biology Literature Data• Data of this study
– Yeast: 0.2M nodes, 5.5M links– Fly: 0.8M nodes, 3.5M links
• Human labeled task– Literature recommendation: author,yearpaper
• Automatically labeled tasks– Gene recommendation: author, yeargene– Venue recommendation: genes, title wordsjournal– Reference recommendation: title words,yearpaper– Expert-finding: title words, genesauthor
• E.g. the fly graph:
Publication126,813
Author233,229
Write679,903 Gene
516,416Protein414,824
689,812
Cite 1,267,531
Bioentity5,823,376
1,785,626
Physical/Geneticinteractions1,352,820
Downstream/Uptream
Year58
Journal1,801
Transcribe293,285
before
Title Terms102,223
2,060,275
![Page 5: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/5.jpg)
55
Related Works• Keyword search in relational databases –each answer is a tree
connecting all query entities and a target entity– BANKS (Bhalotia et al., 2002; Bhavana et al., 2008), – DBXplorer (Agrawal et al., 2002), – Discover (Hristidis & Papakonstantinou, 2002), – BLINKS (He et al., 2007)
• Similarity measure based on Random Walk with Restart (RWR)– Topic-sensitive Pagerank (Haveliwala, 2002) – Personalized Pagerank (Jeh &. Widom, 2003)– ObjectRank (Balmin et al., 2004), – Personal information management (Minkov & Cohen, 2007)
• Improving RWR model by tuning edge weights– quadratic programming (Tsoi et al., 2003), – simulated annealing (Nie et al., 2005), – back-propagation (Diligenti et al., 2005; Minkov & Cohen, 2007), – limit memory Newton method (Agarwal et al., 2006)
![Page 6: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/6.jpg)
66
The Limitation of RWR models
• One-parameter-per-edge label RWR proximity measures are limited because the context in which an edge label appears is ignored
Path Comments
a(Read)p(Gene)g(_Gen)pDon't read about genes which I
have already read
a(Read)p(Auth)a(_Aut)p Read about my favorite authors
Path Comments
a(_Aut)p(Gene)g(_Gen)p
Read about the genes that I am working on
a(_Aut)p(Aff)af(_Aff)p Don't read paper from my own lab
![Page 7: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/7.jpg)
77
This Work
• A new proximity measures on labeled graphs– Path Constrained Random Walk (PCRW)– a weighted combination of simple “path experts”, each of which
corresponds to a particular labeled path through the graph
• Citation recommendation task as an example – In the TREC-CHEM Prior Art Search Task [11], people found
that instead of directly searching for patents with the query words, it is much more e ective to first find patents with similar fftopic, then aggregate these patents’ citations
– Our model systematically generate many relation paths and learn proper weighting
Weight Path
![Page 8: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/8.jpg)
88
Outline
• Problem definition and related work
• Retrieval Models with PCRW (ECML PKDD 2010)– Path Ranking Algorithm (PRA) – Ext.1: query-independent experts– Ext.2: popular entity experts
• Comparing efficient random walk strategies (KDD 2010)– Sampling– Truncation
![Page 9: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/9.jpg)
99
Definitions
• An Entity-Relation graph G=(T,E,R), is– a set of entities types T={T} – a set of entities E={e}, Each entity is typed with e.T T – a set of typed and ordered relations R={R}
• dom(R):=R.T1, range(R):= R.T2
• Relational Retrieval (RR) Problem – Given a query q=(Eq,Tq)
• where Eq={e'} is a set of seed entities, and Tq is the target entity type – Produce the relevance of each entity e in Tq
• A Relation path P=(R1, …,Rn) – a sequence of relations, with constraint that Ri.T2=Ri+1.T1
– E.g.
![Page 10: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/10.jpg)
10
Path Constrained Random Walk• Recursively define a distributions hi(e), for the path
P=R1R2…RL as
– Where P=R1R2…RL-1. Each entity passes its probability mass evenly to all of this children in a particular relation
• And for the length zero path, it is an even distribution on the query entities
![Page 11: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/11.jpg)
1111
Relation Trees
• Given – a graph G and a query q=(Eq,Tq), Eq={e0},
• Define P(q, L) as the set of relation paths – that start with T, end with Tq, and have length ≤L
• A relation tree of P(q, L) is– The prefix tree of all the paths with each node corresponds to a
distribution hP(e) over the entities
Paper
Paper
Author
Paper
Paper
Paper
Author
Paper
WrittenBy
Write
Cite
Cite
CiteBy
CiteBy
WrittenBy
![Page 12: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/12.jpg)
1212
Retrieval Based on PCRW
• A model (G, L, θ) ranks IE(Tq) by
• in matrix form s=Aθ– s is a (sparse) column vector of scores – θ is a column vector of weights for the paths P(q,L) – each column of A is the distribution hP(e) of a path P
( , )
( ; , ) ( )P PP q L
score e L h e
P
![Page 13: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/13.jpg)
1313
Parameter Estimation• Given a set of training data
– D={(q(m), A(m), y(m))} m=1…M, y(m)(e)=1/0
• We can define a regularized objective function
• Use average log-likelihood as the objective om(θ)
– P(m) the index set or relevant entities, – N(m) the index set of irrelevant entities
(how to choose them will be discussed later)
1 1 2 21..
( ) ( ) | | | | / 2mm M
O o
1 ( ) 1 ( )( ) | | ln | | ln(1 )m m
m mm m i m i
i P i N
o P p N p
( )
( ) ( ) ( )( )
exp( )( 1| ; )
1 exp( )
T mm m m ii i T m
i
Ap p y q
A
![Page 14: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/14.jpg)
1414
Parameter Estimation
• Selecting the negative entity set Nm
– Few positive entities vs. thousands (or millions) of negative entities?– First sort all the negative entities with an initial model (uniform weight
1.0)– Then take negative entities at the k(k+1)/2-th position,
• The gradient
• Use orthant-wise L-BFGS (Andrew & Gao, 2007) to estimate θ– Efficient– Can deal with L1 regularization
1 ( ) ( ) 1 ( ) ( )( )| | (1 ) | |
m m
m m m mmm i i m i i
i P i N
oP p A N p A
![Page 15: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/15.jpg)
15
L2 Regularization
• Improves retrieval quality– On the personal paper recommendation task
1.0
1.1
1.2
1.3
1.4
1.5
1.6
0.0000001 0.00001 0.001 0.1λ2 (λ1=0)
Neg
ativ
e L
og-li
kelih
ood
l=2l=3l=4
0.20
0.25
0.30
0.35
0.40
0.45
1E-07 0.00001 0.001 0.1λ2 (λ1=0)
MA
P
l=2l=3l=4
![Page 16: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/16.jpg)
16
L1 Regularization
• Does not improve retrieval quality
1.10
1.20
1.30
1.40
1E-05 0.0001 0.001 0.01 0.1λ1 (λ2=0.00001)
Ne
ga
tive
Lo
g-l
ike
liho
od
l=2l=3l=4
0.0
0.1
0.2
0.3
0.4
0.5
1E-05 0.0001 0.001 0.01 0.1λ1 (λ2=0.00001)
MA
Pl=2l=3l=4
![Page 17: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/17.jpg)
17
L1 Regularization
• But can help select features
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1E-05 0.0001 0.001 0.01 0.1λ1 (λ2=0.00001)
MR
R
l=2l=3l=4
1
10
100
1000
1E-05 0.0001 0.001 0.01 0.1λ1 (λ2=0.00001)
No
. Act
ive
Fe
atu
res
l=2l=3l=4
![Page 18: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/18.jpg)
1818
Ext.1: Query Independent Paths
• PageRank – assign an importance score (query independent) to each web page– later combined with relevance score (query dependent)
• Generalize to multiple entity and relation type setting– We include to each query a special entity e0 of special type T0 – T0 has relation to all other entity types– e0 has links to each entity– Therefore, we have a set of query independent relation paths
(distributions of which can be calculate offline)
• Example
Paper
Paper
AuthorT0
AuthorPaper
Paper
Wrote
WrittenBy
CiteBy
Citewell cited papers
productive authors
all papers
all authors
![Page 19: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/19.jpg)
19
Ext.2: Entity Biases
• There are entity specific characteristics which cannot be captured by a general model
– E.g. Some document with lower rank to a query may be interesting to the users because of features not captured in the data (log mining)
– E.g. Different users may have completely different information needs and goals under the same query (personalized)
– The identity of entity matters
![Page 20: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/20.jpg)
20
Ext.2: Popular Entity Biases
• For a task with query type T0, and target type Tq, – Introduce a bias θe for each entity e in IE(Tq)– Introduce a bias θe’,e for each entity pair (e’,e) where e in IE(Tq) and e’
in IE(T0)
• Then
– Or in matrix form
• Efficiency consideration– Only add to the model top J parameters (measured by |O(θ)/θe|)
at each LBFGS iteration
![Page 21: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/21.jpg)
2121
Experiment Setup• Data sources for bio-informatics
– PubMed on-line archive of over 18 million biological abstracts– PubMed Central (PMC) full-text copies of over 1 million of these papers– Saccharomyces Genome Database (SGD) a database for yeast– Flymine a database for fruit flies
• Tasks– Gene recommendation: author, yeargene– Venue recommendation: genes, title wordsjournal– Reference recommendation: title words,yearpaper– Expert-finding: title words, genesauthor
• Data split– 2000 training, 2000 tuning, 2000 test
• Time variant graph – each edge is tagged with a time stamp (year)– only consider edges that are earlier than the query, during random walk
![Page 22: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/22.jpg)
22
Example Features
• A PRA+qip+pop model trained for the reference recommendation task on the yeast data
6) resembles a commonly used ad-hoc retrieval system
1) papers co-cited with the on-topic papers
7,8) papers cited during the past two years
9) well cited papers
12,13) general papers published during the past two years
10,11) (important) early papers about specific query terms (genes)
14) old papers
![Page 23: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/23.jpg)
2323
Experiment Result
• Compare the MAP of PCRW to– RWR model– query independent paths (qip) – popular entity biases (pop)
Except these† , all improvements are statistically significant at p<0.05 using paired t-test
![Page 24: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/24.jpg)
2424
Outline
• Problem definition and related work
• Retrieval Models with PCRW (ECML PKDD 2010)– Path Ranking Algorithm (PRA) – Ext.1: query-independent experts– Ext.2: popular entity experts
• Comparing efficient random walk strategies (KDD 2010)– Sampling– Truncation
![Page 25: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/25.jpg)
25
Four Strategies for Efficiency• Fingerprint Strategy (Fogaras et al. 2004)
– Simulate a large number of random walkders
• Fixed Truncation– Truncate by fixed value
• Beam Truncation– Keep top W probable entities
• Weighted Particle Filtering– A combination of exact inference and sampling
![Page 26: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/26.jpg)
26
Weighted Particle Filtering
• Start from exact inference, then switch to sampling when the branching is heavy
![Page 27: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/27.jpg)
27
Results on the Yeast DataExpert Finding Gene Recommendation Reference
Recommendation
T0 = 0.17s, L= 3 T0 = 1.6s, L = 4 T0 = 2.7s, L= 3
RWR
Particle Filtering
Fixed Truncation
Beam Truncation
Exact
Exact(Edge-Parameter)
Exact(No Learning)
![Page 28: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/28.jpg)
28
Results on the Fly DataExpert Finding Gene Recommendation Reference
Recommendation
T0 = 0.15s, L= 3 T0 = 1.8s, L= 4 T0 = 0.9s, L= 3
RWR
Particle Filtering
Fixed Truncation
Beam Truncation
Exact
Exact(Edge-Parameter)
Exact(No Learning)
![Page 29: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/29.jpg)
2929
Observations• Sampling strategies are better than truncation strategies
• Particle filtering produces better MAP than fingerprinting– By reducing the variances of estimations– 10~100 fold speedup compared to exact RW
• Retrieval quality is improved in many cases– By producing better weight of the model– See (Lao & Cohen, KDD 2010) for details
![Page 30: Relational Retrieval Using a Combination of Path-Constrained Random Walks](https://reader036.fdocuments.in/reader036/viewer/2022081504/568146a4550346895db3c085/html5/thumbnails/30.jpg)
3030
The End
• Thanks