Proximity on Large Graphs
Transcript of Proximity on Large Graphs
4/8/2008
1
SCS CMU
Proximity on Large Graphs
Speaker:
Hanghang Tong
2008-4-10 15-826 Guest Lecture
SCS CMU
Graphs are everywhere!
2
SCS CMU
Graph Mining: the big picture Graph/Global Level
Subgraph/Community Level
3Node Level We are here!
4/8/2008
2
SCS CMU
Proximity on Graph: What?
A BH1 1
I J1
1 1
4
D1 1
EF
G1 11
a.k.a Relevance, Closeness, ‘Similarity’…
SCS CMU
Proximity is the main tool behind…
• Link prediction [Liben-Nowell+], [Tong+]• Ranking [Haveliwala], [Chakrabarti+]• Email Management [Minkov+]• Image caption [Pan+]
5
g p [ ]• Neighborhooh Formulation [Sun+]• Conn. subgraph [Faloutsos+], [Tong+], [Koren+]• Pattern match [Tong+]• Collaborative Filtering [Fouss+]• Many more…
Will return to this later
SCS CMU
Roadmap
• Motivation• Part I: Definitions• Part II: Fast Solutions
• Basic: RWR
• Variants
• Asymmetry of Prox.
• Group Prox
• Prox w/ Attributes
6
• Part III: Applications• Conclusion
Prox w/ Attributes
• Prox w/ Time
4/8/2008
3
SCS CMU
Why not shortest path?
A BD1 1
A BD1 1
E F11 1
Some ``bad’’ proximities
7
‘pizza delivery guy’ problem
A BD1 1
EF
G1 11 A BD1 1
‘multi-facet’ relationship
SCS CMU
A BD1 1
Why not max. netflow?Some ``bad’’ proximities
8
A BD1 11 E
No punishment on long paths
SCS CMU
What is a ``good’’ Proximity?
A BH1 1
I J1
1 1
9
D1 1
EF
G1 11
• Multiple Connections
• Quality of connection
•Direct & In-directed Conns
•Length, Degree, Weight…
…
4/8/2008
4
SCS CMU
1
2
910
811
12
Random walk with restart
10
4
3
56
7
11
SCS CMU
Random walk with restartNode 4
Node 1Node 2Node 3Node 4Node 5Node 6
0.130.100.130.220.130.05
13
2
910
811
120.13
0.10
0.13
0.08
0.04
0.02
0.04
0.03
11
Node 7Node 8Node 9Node 10Node 11Node 12
0.050.080.040.030.040.02
4
5 6
7
0.13
0.05
0.05
Ranking vector More red, more relevantNearby nodes, higher scores
4r
SCS CMU
Why RWR is a good score?
2c 3cc ...
W 2W 3W
12
c c
all paths from ito j with length 1
all paths from ito j with length 2
all paths from ito j with length 3
4/8/2008
5
SCS CMU
Roadmap
• Motivation• Part I: Definitions• Part II: Fast Solutions
• Basic: RWR
• Variants
• Asymmetry of Prox.
• Group Prox
• Prox w/ Attributes
13
• Part III: Applications• Conclusion
Prox w/ Attributes
• Prox w/ Time
SCS CMU
Variant: escape probability• Define Random Walk (RW) on the graph• Esc_Prob(A B)
– Prob (starting at A, reaches B before returning to A)
14Esc_Prob = Pr (smile before cry)
A Bthe remaining graph
SCS CMU
Other Variants
• Other measure by RWs– Community Time/Hitting Time [Fouss+]– SimRank [Jeh+]
• Equivalence of Random Walks
15
Equivalence of Random Walks– Electric Networks:
• EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+] – String Systems
• Katz [Katz], [Huang+], [Scholkopf+]• Matrix-Forest-based Alg [Chobotarev+]
4/8/2008
6
SCS CMU
Other Variants
• Other measure by RWs– Community Time/Hitting Time [Fouss+]– SimRank [Jeh+]
• Equivalence of Random WalksAll are related to, or similar to
16
Equivalence of Random Walks– Electric Networks:
• EC [Doyle+]; SAEC[Faloutsos+]; CFEC[Koren+] – String Systems
• Katz [Katz], [Huang+], [Scholkopf+]• Matrix-Forest-based Alg [Chobotarev+]
,random walk with restart!
SCS CMU
Roadmap
• Motivation• Part I: Definitions• Part II: Fast Solutions
• Basic: RWR
• Variants
• Asymmetry of Prox.
• Group Prox
• Prox w/ Attributes
17
• Part III: Applications• Conclusion
Prox w/ Attributes
• Prox w/ Time
SCS CMU
Asymmetry of Proximity [Tong+ KDD07 a]
A B
1 1
1 A B
1 1
0.5
18
111
11
10.51
What is Prox from A to B?What is Prox from B to A?
What is Prox between A and B?
4/8/2008
7
SCS CMU
Asymmetry also exists in un-directed graphs
• Hanghang’s most important conf. is KDD• The most important author in KDD is ...
19
So is love…Hanghang
KDD
SCS CMU
Roadmap
• Motivation• Part I: Definitions• Part II: Fast Solutions
• Basic: RWR
• Variants
• Asymmetry of Prox.
• Group Prox
• Prox w/ Attributes
20
• Part III: Applications• Conclusion
Prox w/ Attributes
• Prox w/ Time
SCS CMU
Group Proximity [Tong+ 2007]
• Q: How close are Accountants to SECs?
21
• A: Prob (starting at any RED, reaches anyGREEN before touching any RED again)
4/8/2008
8
SCS CMU
Proximity on Attribute Graphs
22
What is the proximity from node 7 to 10?If we know that…
SCS CMU
Sol: Augmented graphs
23
SCS CMU
Attributes on nodes/edges (ER graph) [Chakrabarti+ WWW07]
skip
Works
24
Wrote Sent Received
In-Replied-toCited
4/8/2008
9
SCS CMU
Proximity w/ Time
• Sol #1: treat time an categorical attr. [Minkov+]
S l #2 t li t i [T ]
25
• Sol #2: aggregate slice matrices [Tong+]
Time
•Global aggregation
•Slide window
•Exponential emphasis
SCS CMU
Summary of Part I
• Goal: Summarize multiple … relationships
• Solutions
26
– Basic: Random Walk with Restart– Property: Asymmetry– Variants: Esc_Prob and many others.– Generalization: Group Prox.; w/ Attr.; w/ Time
SCS CMU
• Motivation• Part I: Definitions• Part II: Fast Solutions
Roadmap
• B_Lin: RWR
• FastAllDAP: Esc_Prob
• BB_Lin: Skewed BGs
27
• Part III: Applications• Conclusion
• FastUpdate: Time-Evolving
4/8/2008
10
SCS CMU
Preliminary: Sherman–Morrison Lemma
Tv
uAA =If:
28
1 11 1 1
1( )1
TT
T
A u v AA A u v Av A u
− −− − −
−
⋅ ⋅= + ⋅ = −+ ⋅ ⋅
Then:
SCS CMU
SM Lemma: Applications
• RLS – and almost any algorithm in time series!
• Leave-one-out cross validation for LS• Kalman filtering• Incremental matrix decomposition• … and all the fast sols we will introduce!
29
SCS CMU
Computing RWR
9 1012
0.13 0 1/3 1/3 1/3 0 0 0 0 0 0 0 00.10 1/3 0 1/3 0 0 0 0 1/4 0 0 0 0.13
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟
01/3 1/3 0 1/3 0 0 0 0 0 0 0 0
⎛ ⎞⎜⎜⎜⎜
0.13 00.10 00.13 0
⎛ ⎞ ⎛ ⎞⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟
Ranking vector Starting vectorAdjacent matrix
(1 )i i ir cWr c e= + −Restart p
30
1
43
2
5 6
7
811
120.220.130.05
0.90.050.080.040.030.040.02
⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟ = ×⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
1/3 0 1/3 0 1/4 0 0 0 0 0 0 00 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 0 0 0 0 0 1/4 1/2 0 0 0 0 0 0 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 00 0 0 0 0 0 0 0 1/2 0 1/3 1/2 0 0 0 0 0 0 0 1/4 0 1/3 0 1/2 0 0 0 0 0 0 0 0 0 1/3 1/3 0
⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝ ⎠
0.220.13 00.05 0
0.10.05 00.08 00.04 00.03 00.04 0
2 0
1
0.0
⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟+ ×⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎟⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠
n x n n x 1n x 1
1
4/8/2008
11
SCS CMU
Beyond RWR : Maxwell Equation for Web![Chakrabarti]
31
P-PageRank[Haveliwala]
PageRank[Haveliwala]
RWR[Pan, Sun]
SM Learning[Zhou, Zhu]
RL in CBIR[He]
Fast RWR (B_Lin) Finds the Root Solution !
SCS CMU
• RWR is the building block for computing…– Escape Probability (augmented w/sink)
[Tong+]– Effective Conductanc Resistance Dist
Beyond RWR
32
.. Effective Conductanc Resistance Dist.Commute Time
– MRF (special structure) [Cohen]• Similar Idea of B_Lin to compute other
measurements
SCS CMU
• Q: Given query i, how to solve it?
0 1/3 1/3 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 0 0 0 1/4 0 0 0 01/3 1/3 0 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 1/4 0 0 0 0 0 0 0
0001
⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
33
1/3 0 1/3 0 1/4
0.9= ×
0 0 0 0 0 0 00 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 0
0 0 0 0 1/4 1/2 0 0 0 0 0 0 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 00 0 0 0 0 0 0 0 1/2 0 1/3 1/2
0 0 0 0 0
00
0.10000
0 0 1/4 0 1/3 0 1/2 0 0 0 0 0 0 0 0 0 0 1/3 1/3
1
0 0
⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
+ ×⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠
??
Adjacent matrix Starting vector
4/8/2008
12
SCS CMU
14
32
6
9 10
8 11120.13
0.10
0.13
0 05
0.08
0.04
0.02
0.04
0.03
OntheFly: 0 1/3 1/3 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 0 0 0 1/4 0 0 0 01/3 1/3 0 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 1/4
0.9= ×
0 0 0 0 0 0 00 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 0
0 0 0 0 1/4 1/2 0 0 0 0 0 0 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 0
000
00
0.1000
1
⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
+ ×⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
000100000
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟
0.130.100.130.220.130.050.050.080.04
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟
34
5 6
70.13
0.05
0.05
0 0 0 0 0 0 0 0 1/2 0 1/3 1/2 0 0 0 0 0
0 0 0 1/4 0 1/3 0 1/2 0
0 0 0 0 0 0 0 0 0 1/3 1/3 0 0
⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠
000
⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
0.030.040.02
⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
No pre-computation/ light storage
Slow on-line response O(mE)
ir ir
SCS CMU
0.20 0.13 0.14 0.13 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.340.28 0.20 0.13 0.96 0.64 0.53 0.53 0.85 0.60 0.48 0.53 0.450.14 0.13 0.20 1.29 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.330.13 0.10 0.13 2.06 0.95 0.78 0.78 0.61 0.43 0.34 0.38 0.320.09 0.09 0.09 1.27 2.41 1.97 1.97 1.05 0.73 0.58 0.66 0.560.03 0.04 0.04 0.52 0.98 2.06 1.37 0.43 0.30 0.24 0.27 0.22
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟
PreCompute1 2 3 4 5 6 7 8 9 10 11 12r r r r r r r r r r r r
14
32
6
9 10
8 11120.13
0.10
0.13
0 05
0.08
0.04
0.02
0.04
0.03
R:
35
0.03 0.04 0.04 0.52 0.98 1.37 2.06 0.43 0.30 0.24 0.27 0.220.08 0.11 0.04 0.82 1.05 0.86 0.86 2.13 1.49 1.19 1.33 1.130.03 0.04 0.03 0.28 0.36 0.30 0.30 0.74 1.78 1.00 0.76 0.790.04 0.04 0.04 0.34 0.44 0.36 0.36 0.89 1.50 2.45 1.54 1.800.04 0.05 0.04 0.38 0.49 0.40 0.40 1.00 1.14 1.54 2.28 1.720.02 0.03 0.02 0.21 0.28 0.22 0.22 0.56 0.79 1.20 1.14 2.05
⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
5 6
70.13
0.05
0.05
[Haveliwala]
SCS CMU
2.20 1.28 1.43 1.29 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.341.28 2.02 1.28 0.96 0.64 0.53 0.53 0.85 0.60 0.48 0.53 0.451.43 1.28 2.20 1.29 0.68 0.56 0.56 0.63 0.44 0.35 0.39 0.331.29 0.96 1.29 2.06 0.95 0.78 0.78 0.61 0.43 0.34 0.38 0.320.91 0.86 0.91 1.27 2.41 1.97 1.97 1.05 0.73 0.58 0.66 0.560.37 0.35 0.37 0.52 0.98 2.06 1.37 0.43 0.30 0.24 0.27 0.22
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟
PreCompute:
14
32
6
9 10
8 11120.13
0.10
0.13
0.08
0.04
0.02
0.04
0.03
36
0.37 0.35 0.37 0.52 0.98 1.37 2.06 0.43 0.30 0.24 0.27 0.220.84 1.14 0.84 0.82 1.05 0.86 0.86 2.13 1.49 1.19 1.33 1.130.29 0.40 0.29 0.28 0.36 0.30 0.30 0.74 1.78 1.00 0.76 0.790.35 0.48 0.35 0.34 0.44 0.36 0.36 0.89 1.50 2.45 1.54 1.800.39 0.53 0.39 0.38 0.49 0.40 0.40 1.00 1.14 1.54 2.28 1.720.22 0.30 0.22 0.21 0.28 0.22 0.22 0.56 0.79 1.20 1.14 2.05
⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
5 6
70.13
0.05
0.05
Fast on-line response
Heavy pre-computation/storage costO(n ) O(n )3 2
4/8/2008
13
SCS CMU
Q: How to Balance?
37
On-line Off-line
SCS CMU B_Lin: Basic Idea[Tong+]
1 32
9 10
8 11120.13
0.10
0.13
0.08
0.04
0.02
0 04
0.03
13
29 10
811
12
Find Community 14
32
5 6
7
9 10
8 1112
9 10
811
12
14
32
5 6
7
9 10
8 1112
13
2
38
45 6
70.13
0.05
0.05
0.044
5 6
7
Fix the remaining
Combine1
43
2
5 6
7
9 10
8 1112
5 6
7
14
32
5 6
7
9 10
8 1112
4
SCS CMU
Pre-computational stage• Q: • A: A few small, instead of ONE BIG, matrices inversions
Efficiently compute and store Q-1
39
4/8/2008
14
SCS CMU
• Q: Efficiently recover one column of Q• A: A few, instead of MANY, matrix-vector multiplication
On-Line Query Stage
0⎛ ⎞⎜ ⎟
-1
ie ir
40
+
00
000
1
00000
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
SCS CMU
Pre-compute Stage
• p1: B_Lin Decomposition– P1.1 partition– P1.2 low-rank approximation
• p2: Q matrices
41
p– P2.1 computing (for each partition)– P2.2 computing (for concept space)
11Q−
Λ
SCS CMU
P1.1: partition
1
43
2
5 6
9 10
811
12
1
43
29 10
811
12
skip
42
5 6
7
4
5 6
7
Within-partition links cross-partition links
4/8/2008
15
SCS CMU
P1.1: block-diagonal
1
43
2
5 6
9 10
811
121
43
2
5 6
9 10
811
12
skip
43
5
7
5
7
SCS CMU
P1.2: LRA for
31
4
2
5 6
7
9 10
811
121
43
2
5 6
7
9 10
811
12
skip
44
|S| << |W2|~
SCS CMU
=
skip
45
+
4/8/2008
16
SCS CMU
p2.1 Computing skip
46
c
SCS CMU
Comparing and
• Computing Time– 100,000 nodes; 100 partitions– Computing 100,00x is Faster!
St C t
11Q−
11Q−
skip
47
• Storage Cost – 100x saving!
Q1,1
Q1,2
Q1,k
=11Q−
SCS CMU
W +~ ~~
skip
48
11Q− + ?
4/8/2008
17
SCS CMU
p2.2 Computing:
1S − UV=_
-1Q
1,1
Q1,2
Q
skip
49
1
43
2
5 6
7
9 10
8 11
12
Q1,k
SCS CMU
We have:
Communities Bridges
skip
50
SM Lemma says:1 1 1
1 11
1U VcQQ Q Q− − −− + Λ≈
SCS CMU
On-Line Stage• Q
+
000
000
1
⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟ ?
ie ir
skip
51
+
Query
000000
⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠
Result
?
• A (SM lemma)Pre-Computation
4/8/2008
18
SCS CMU
On-Line Query Stage
q1:
skip
52
q1:q2:q3:q4:q5:q6:
SCS CMU skip
53
SCS CMU
Query Time vs. Pre-Compute Time
Log Query Time
•Quality: 90%+
54
Log Pre-compute Time
•Quality: 90%+ •On-line:
•Up to 150x speedup•Pre-computation:
•Two orders saving
4/8/2008
19
SCS CMU
• Motivation• Part I: Definitions• Part II: Fast Solutions
Roadmap
• B_Lin: RWR
• FastAllDAP: Esc_Prob
• BB_Lin: Skewed BGs
55
• Part III: Applications• Conclusion
• FastUpdate: Time-Evolving
SCS CMU
FastAllDAP [Tong+]
• Q: How to compute – Esc_Prob = Pr (smile before cry)?
56Footnote: augmented w/ universal sink as practical modification
A Bthe remaining graph
SCS CMU
Solving DAP (Straight-forward way)
2 1,
ˆProx( )=c ( )ti j i ji j p I cP p c p−→ − +i i i i
1-c: fly-out probability (to black-hole)
57
One matrix inversion, one proximity!
1 x (n-2) 1 x (n-2)(n-2) x (n-2)
4/8/2008
20
SCS CMU1,1 1,2 1,3 1,4 1,5 1,6
2,1 2,2 2,3 2,4 2,5 2,6
3,1 3,2 3,3 3,4 3,5 3,6
4,1 4,2 4,3 4,4 4,5 4,6
5,1 5,2 5,3 5,4 5,5 5,6
6,1 6,2 6,3 6,4 6,5 6,6
p p p p p pp p p p p pp p p p p pp p p p p pp p p p p pp p p p p p
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
P=1 5
3
2
6
4
0.5 0.5
0.5
0.50.5
0.5
0.5
1
0.5 1
P: Transition matrix (row norm )
58
Esc_Prob(1->5)= I - +
-1P: Transition matrix (row norm.)
2cc
SCS CMU
• Case 1, Medium Size Graph– Matrix inversion is feasible, but…– What if we want many proximities?– Q: How to get all (n ) proximities efficiently?
Challenges
2
59
Q g ( ) p y– A: FastAllDAP!
• Case 2: Large Size Graph – Matrix inversion is infeasible– Q: How to get one proximity efficiently?– A: FastOneDAP! skip
SCS CMU
FastAllDAP
• Q1: How to efficiently compute all possible proximities on a medium size graph?
60
g p– a.k.a. how to efficiently solve multiple
linear systems simultaneously?• Goal: reduce # of matrix inversions!
4/8/2008
21
SCS CMU
1,1 1,2 1,3 1,4 1,5 1,6
2,1 2,2 2,3 2,4 2,5 2,6
3,1 3,2 3,3 3,4 3,5 3,6
4,1 4,2 4,3 4,4 4,5 4,6
5,1 5,2 5,3 5,4 5,5 5,6
6,1 6,2 6,3 6,4 6,5 6,6
p p p p p pp p p p p pp p p p p pp p p p p pp p p p p pp p p p p p
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
FastAllDAP: Observation
1 5
3
2
6
4
0.5 0.5
0.5
0.50.5
0.5
0.5
1
0.5 1
P=
61
1,1 1,2 1,3 1,4 1,5 1,6
2,1 2,2 2,3 2,4 2,5 2,6
3,1 3,2 3,3 3,4 3,5 3,6
4,1 4,2 4,3 4,4 4,5 4,6
5,1 5,2 5,3 5,4 5,5 5,6
6,1 6,2 6,3 6,4 6,5 6,6
p p p p p pp p p p p pp p p p p pp p p p p pp p p p p pp p p p p p
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
Need two different matrix inversions!
P=
SCS CMU
1,1 1,2 1,3 1,4 1,5 1,6
2,1 2,2 2,3 2,4 2,5 2,6
3,1 3,2 3,3 3,4 3,5 3,6
4,1 4,2 4,3 4,4 4,5 4,6
5,1 5,2 5,3 5,4 5,5 5,6
6,1 6,2 6,3 6,4 6,5 6,6
p p p p p pp p p p p pp p p p p pp p p p p pp p p p p pp p p p p p
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
FastAllDAP: Rescue
P=
Overlap between
Prox(1 5)
Prox(1 6)
62
1,1 1,2 1,3 1,4 1,5 1,6
2,1 2,2 2,3 2,4 2,5 2,6
3,1 3,2 3,3 3,4 3,5 3,6
4,1 4,2 4,3 4,4 4,5 4,6
5,1 5,2 5,3 5,4 5,5 5,6
6,1 6,2 6,3 6,4 6,5 6,6
p p p p p pp p p p p pp p p p p pp p p p p pp p p p p pp p p p p p
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
Redundancy among different linear systems!
P=
Overlap between two gray parts!
( )
SCS CMU
FastAllDAP: Theorem
• Theorem: • Example:
63
• Proof: by SM Lemma
4/8/2008
22
SCS CMU
FastAllDAP: Algorithm• Alg.
– Compute Q– For i,j =1,…, n, compute
64
• Computational Save O(1) instead of O(n )!
• Example– w/ 1000 nodes, – 1m matrix inversion vs. 1 matrix!
2
SCS CMU
FastAllDAP
Time (sec)Straight-Solver 1,000x
faster!
65
Size of Graph
FastAllDAP
SCS CMU
• Motivation• Part I: Definitions• Part II: Fast Solutions
Roadmap
• B_Lin: RWR
• FastAllDAP: Esc_Prob
• BB_Lin: Skewed BGs
66
• Part III: Applications• Conclusion
• FastUpdate: Time-Evolving
4/8/2008
23
SCS CMU
RWR on Bipartite Graph
n
authorsAuthor-Conf. Matrix
Observation: n >> m!
Examples:
67
n
mConferences
Examples: 1. DBLP: 400k aus, 3.5k confs2. NetFlix: 2.7M usrs, 18k mvs
SCS CMU
• Q: Given query i, how to solve it?
0 1/3 1/3 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 0 0 0 1/4 0 0 0 01/3 1/3 0 1/3 0 0 0 0 0 0 0 01/3 0 1/3 0 1/4 0 0 0 0 0 0 0
0001
⎛ ⎞ ⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
RWR on Skewed bipartite graphs
… .. .
. .
68
1/3 0 1/3 0 1/4
0.9= ×
0 0 0 0 0 0 00 0 0 1/3 0 1/2 1/2 1/4 0 0 0 00 0 0 0 1/4 0 1/2 0 0 0 0 0
0 0 0 0 1/4 1/2 0 0 0 0 0 0 0 1/3 0 0 1/4 0 0 0 1/2 0 1/3 00 0 0 0 0 0 0 1/4 0 1/3 0 00 0 0 0 0 0 0 0 1/2 0 1/3 1/2
0 0 0 0 0
00
0.10000
0 0 1/4 0 1/3 0 1/2 0 0 0 0 0 0 0 0 0 0 1/3 1/3
1
0 0
⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟
+ ×⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠
??…. .. . . …. ...
0
0
n m
Ar
… .
. .. . …. .. . . …. ... Ac
SCS CMU
• Step 1:
• Step 2:
BB_Lin: Pre-Computation [Tong+ 06]
M = Ac ArX
1( 0 9 )I M −Λ = ×
2-step RWR for Conferences
All Conf-Conf Prox. Scores
• Step 2:
• Cost:
• Examples – NetFlix: 1.5hr for pre-computation; – DBLP: 1 few minutes 69
( 0.9 )I MΛ = − ×3( )O m m E+ ⋅
4/8/2008
24
SCS CMU
BB_Lin: Pre-Computation [Tong+ 06]
• Step 1:
• Step 2:
M = Ac ArX
1( 0 9 )I M −Λ = ×
2-step RWR for Conferences
All Conf-Conf Prox. Scores
70
• Step 2: ( 0.9 )I MΛ = − ×
SCS CMU
BB_Lin: Pre-Computation [Tong+ 06]
• Step 1:
• Step 2:
M = Ac ArX
1( 0 9 )I M −Λ = ×
2-step RWR for Conferences
All Conf-Conf Prox. Scores
71
• Step 2:
• Cost:
• Examples – NetFlix: 1.5hr for pre-computation; – DBLP: 1 few minutes
( 0.9 )I MΛ = − ×3( )O m m E+ ⋅
Ac/ArE edges
m x m
SCS CMU
BB_Lin: On-Line Stage
Case 1: - Conf - Conf
authors
Conferences
Read out !
72Ac/Ar
E edges
4/8/2008
25
SCS CMU
BB_Lin: On-Line Stage
Case 2: - Au - Conf
authors
Conferences
73Ac/Ar
E edges
1 matrix-vec!
SCS CMU
BB_Lin: On-Line Stage
Case 3: - Au - Au
authors
Conferences
74Ac/Ar
E edges
2 m atrix-vec!
SCS CMU
BB_Lin: Examples
• NetFlix dataset (2.7m user x 18k movies)– 1.5hr for pre-computation; – <1 sec for on-line
• DBLP dataset (400k authors x 3.5k confs)– A few minutes for pre-computation– <0.01 sec for on-line
75
4/8/2008
26
SCS CMU
• Motivation• Part I: Definitions• Part II: Fast Solutions
Roadmap
• B_Lin: RWR
• FastAllDAP: Esc_Prob
• BB_Lin: Skewed BGs
76
• Part III: Applications• Conclusion
• FastUpdate: Time-Evolving
SCS CMU
Challenges
• BB_Lin is good for skewed bipartite graphs– for NetFlix (2.7M nodes and 100M edges)– w/ 1.5 hr pre-computation for m x m core matrix– fraction of seconds for on-line query
77
• But…what if the graph is evolving over time– New edges/nodes arrive; edge weights increase…– 1.5hr itself becomes a part of on-line cost!
SCS CMU
1( 0.9 )I M −Λ = − × 3( )O m m E+ ⋅t=0
Q: How to update the core matrix?
78
t=11( 0.9 )I M −Λ = − ×
~ ~ 3( )O m m E+ ⋅
?
4/8/2008
27
SCS CMU
Update the core matrix
• Step 1:
M = Ac ArX~ M= X+
• Step 2:
79
1( 0.9 )I M −Λ = − ×~ ~ ?Rank 2 update
= + X
2( )O m 3( )O m m E+ ⋅
SCS CMU
Update : General Case [Tong+ 2008]
M = Ac ArX~n authors
m Conferences
• E’ edges changed• Involves n’ authors, m’ confs.
• Observation
80min( ', ') 'n m E
SCS CMU
• Observation: – the rank of update is small!
• Algorithm:E’ d h d
Update : General Case
min( ', ') 'n m E n authors
m Conferences
81
– E’ edges changed– Involves n’ authors, m’ confs.
– our Alg.
– (details in the paper)
2(min( ', ') ')O n m m E+3( )O m m E+ ⋅
81
4/8/2008
28
SCS CMU
FastOneUpdateTime (Seconds)
82
176x speedup
40x speedup
Datasets
SCS CMU
Fast-Batch-UpdateTime (Seconds)Time (Seconds)
83
Min (n’, m’)E’
15x speed-up on average!
SCS CMU
Summary of Part II
• Goal: Efficiently Solve Linear System(s)
• Sols.
84
– B_Lin: Approximate one large linear system– FastAllDAP: multiple inner-related linear
systems– BB_Lin: the intrinsic complexity is small– FastUpdate: (smooth) dynamic linear system
4/8/2008
29
SCS CMU1,1 1,2 1,3 1,4 1,5 1,6
2,1 2,2 2,3 2,4 2,5 2,6
3,1 3,2 3,3 3,4 3,5 3,6
4,1 4,2 4,3 4,4 4,5 4,6
5,1 5,2 5,3 5,4 5,5 5,6
6,1 6,2 6,3 6,4 6,5 6,6
p p p p p pp p p p p pp p p p p pp p p p p pp p p p p pp p p p p p
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
1,1 1,2 1,3 1,4 1,5 1,6
2,1 2,2 2,3 2,4 2,5 2,6
3 1 3 2 3 3 3 4 3 5 3 6
p p p p p pp p p p p pp p p p p p
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥B Li
1,1 1,2 1,3 1,4 1,5 1,6
2,1 2,2 2,3 2,4 2,5 2,6
3,1 3,2 3,3 3,4 3,5 3,6
4,1 4,2 4,3 4,4 4,5 4,6
5,1 5,2 5,3 5,4 5,5 5,6
6,1 6,2 6,3 6,4 6,5 6,6
p p p p p pp p p p p pp p p p p pp p p p p pp p p p p pp p p p p p
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
1,1 1,2 1,3 1,4 1,5 1,6
2,1 2,2 2,3 2,4 2,5 2,6
3,1 3,2 3,3 3,4 3,5 3,6
4,1 4,2 4,3 4,4 4,5 4,6
5,1 5,2 5,3 5,4 5,5 5,6
6,1 6,2 6,3 6,4 6,5 6,6
p p p p p pp p p p p pp p p p p pp p p p p pp p p p p pp p p p p p
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
FastAllDAP…
85
3,1 3,2 3,3 3,4 3,5 3,6
4,1 4,2 4,3 4,4 4,5 4,6
5,1 5,2 5,3 5,4 5,5 5,6
6,1 6,2 6,3 6,4 6,5 6,6
p p p p p pp p p p p pp p p p p pp p p p p p
⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
B_Lin
1,1 1,2 1,3 1,4 1,5 1,6
2,1 2,2 2,3 2,4 2,5 2,6
3,1 3,2 3,3 3,4 3,5 3,6
4,1 4,2 4,3 4,4 4,5 4,6
5,1 5,2 5,3 5,4 5,5 5,6
6,1 6,2 6,3 6,4 6,5 6,6
p p p p p pp p p p p pp p p p p pp p p p p pp p p p p pp p p p p p
⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦
BB_Lin…FastUpdate
SCS CMU
• Motivation• Part I: Definitions• Part II: Fast Solutions
Roadmap
• Link Prediction
• NF
• gCap
86
• Part III: Applications• Conclusion
• CePS
• G-Ray
• pTrack/cTrack
SCS CMU
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.090
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18 Link Prediction: existence
with link
density
Prox (i j)+Prox (j i)
P i ff ti t di ti i h d d bl !
870 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0
0.05
0.1
0.15
0.2
0.25
no link
density
Prox (i j)+Prox (j i)
Prox. is effective to distinguish red and blue!
4/8/2008
30
SCS CMU
Link Prediction: direction
• Q: Given the existence of the link, what is the direction of the link?
• A: Compare prox(i j) and prox(j i)>70%
88
>70%
Prox (i j) - Prox (j i)
density
SCS CMU
Neighborhood Formulation
KDDPhilip S. Yu
IJCAI
Ning Zhong
… …
Q: what is most relatedconference to ICDM
89
ICDM
SDM
NIPS
AAAI M. Jordan
Ning Zhong
R. Ramakrishnan
…
…Conference Author
A: RWR![Sun ICDM2005]
SCS CMU
NF: example
KDD
SDM
PKDD
PAKDD
ICML0.009
0.0080.007
0.005
90
ICDM
KDD
ECML
CIKM
DMKD
SIGMOD
ICML
ICDE
0.011
0.005
0.0050.004
0.004
0.004
4/8/2008
31
SCS CMU
gCaP: Automatic Image Caption• Q
…
Sea Sun Sky Wave{ } { }Cat Forest Grass Tiger
91
{?, ?, ?,}
?A: RWR!
[Pan KDD2004]
SCS CMU
Image
Region
92
Test Image
Sea Sun Sky Wave Cat Forest Tiger Grass
Image
Keyword
SCS CMU
Image
Region
93
Test Image
Sea Sun Sky Wave Cat Forest Tiger Grass
Image
Keyword
4/8/2008
32
SCS CMU
Center-Piece Subgraph(CePS)
B
?
Q
CePS guy
94
A C
?
Original GraphBlack: query nodes
CePSA: RWR! [Tong KDD 2006]
Red: Max (Prox(Red, A) x Prox(Red, B) x Prox(Red, C))
SCS CMU
CePS: Example
R. Agrawal Jiawei Han
H.V. Jagadish
Laks V.S. Lakshmanan
Heikki
15 10 13
1 110
95
V. Vapnik M. Jordan
Mannila
Christos Faloutsos
Padhraic Smyth
Corinna Cortes
1 1
6
1 1
4 Daryl Pregibon
2
11 3
16
SCS CMU
K_SoftAnd: Relaxation of AND
96
Asking AND query? No Answer!
Disconnected Communities
Noise
4/8/2008
33
SCS CMU
0.0103
2_SoftAnd
x 1e-4 5
11 124
0.07100.10100.1010
0.4505
1 7
5
10
11
9 8
12
13
4
3
62
0.0103
0.0046
0.0019
0.0046
0.0024
0.0046
0.0046
0.0103
0.0019
0.00190.00460.0046
0.0103
97
1 7
5
10
11
9 8
12
13
4
3
62
0.0103
0.1617
0.1387
0.1617
0.0849
0.1617
0.1617
0.0103
0.1387
0.13870.16170.1617
And
1_SoftAnd (OR)
1 7
10
9 8
133
62
0.4505
0.1010
0.0710
0.1010
0.2267
0.1010
0.1010
0.4505
0.0710
SCS CMU
R. Agrawal Jiawei Han
H.V. Jagadish
Laks V.S. Lakshmanan
Umeshwar D l
1510 13
3 3
CePS: 2 Soft_ANDDB
98
V. Vapnik M. Jordan
Dayal
Bernhard Scholkopf
Peter L. Bartlett
Alex J. Smola
5 2 2
327
4
Stat.
SCS CMU OutputInputQuery Graph
99Attributed Data Graph
Matching Subgraph
Graph X-Ray
4/8/2008
34
SCS CMU
G-Ray: How to?
matching nodematching node
matching node
100
matching node
Goodness = Prox (12, 4) x Prox (4, 12) xProx (7, 4) x Prox (4, 7) x
Prox (11, 7) x Prox (7, 11) xProx (12, 11) x Prox (11, 12)
SCS CMU
Effectiveness: star-query
101
Query Result
SCS CMU
Effectiveness: line-query
102
Query
Result
4/8/2008
35
SCS CMU
Query
Effectiveness: loop-query
103
Result
SCS CMU
pTrack
• [Given]– (1) a large, skewed time-evolving bipartite
graphs, – (2) the query nodes of interest
Author A’ Rank in KDD
Year
104
• [Track]– (1) top-k most related nodes for each query
node at each time step t; – (2) the proximity score (or rank of proximity)
between any two query nodes at each time step t
SCS CMU
Philip S. Yu’s Top-5 conferences up to each year
ICDEICDCS
SIGMETRICSPDISVLDB
CIKMICDCSICDE
SIGMETRICSICMCS
KDDSIGMOD
ICDMCIKM
ICDCS
ICDMKDDICDESDMVLDB
105
1992 1997 2002 2007
DatabasesPerformanceDistributed Sys.
DatabasesData Mining
4/8/2008
36
SCS CMU
KDD’s Rank wrt. VLDB over years
Rank
106
Year
Data Mining and Databases are more and more relavant!
SCS CMU
cTrack
• [Given]– (1) a large, skewed time-evolving graphs, – (2) the query nodes of interest
[T k]
107
• [Track]– (1) top-k most central nodes at each time step t; – (2) the centrality score (or rank of centrality) for
each query node at each time step t
SCS CMU
Ranking of Centrality up to each year (in NIPS)
T. SejnowskiRank of Influential-ness
108
M. Jordan
G.HintonC. Koch
Year
4/8/2008
37
SCS CMU 10 most influential authors up to each year
T. Sejnowski
109
Author-paper bipartite graph from NIPS 1987-1999. 3k. 1740 papers, 2037 authors, spreading over 13 years
M. Jordan
SCS CMU
FastAllDAP
BB_Lin
FastUpdate
Computations
Link Prediction
NF
gCap
CePS
G-Ray
pTrack
cTrack
Applications
EfficientlySolve Linear
System(s)
Use Proximity as
Building block
110
A BH1 1
D1 1
EF
G1 11
I J11 1
Definitions
B_LinFastAllDAPcTrack
ProximityOn
Graphs
Weighted Multiple
Relationship
System(s)
SCS CMU
Take-home Messages
• Proximity Definitions – RWR– and a lot of variants
• Computations– SM Lemma
111
1 11 1 1
1( )1
TT
T
A u v AA A u v Av A u
− −− − −
−
⋅ ⋅= + ⋅ = −+ ⋅ ⋅
4/8/2008
38
SCS CMU
References• L. Page, S. Brin, R. Motwani, & T. Winograd. (1998), The
PageRank Citation Ranking: Bringing Order to the Web, Technical report, Stanford Library.
• T.H. Haveliwala. (2002) Topic-Sensitive PageRank. In WWW, 517-526, 2002
• J Y Pan H J Yang C Faloutsos & P Duygulu (2004) Automatic• J.Y. Pan, H.J. Yang, C. Faloutsos & P. Duygulu. (2004) Automatic multimedia cross-modal correlation discovery. In KDD, 653-658, 2004.
• C. Faloutsos, K. S. McCurley & A. Tomkins. (2002) Fast discovery of connection subgraphs. In KDD, 118-127, 2004.
• J. Sun, H. Qu, D. Chakrabarti & C. Faloutsos. (2005) Neighborhood Formation and Anomaly Detection in Bipartite Graphs. In ICDM, 418-425, 2005.
• W. Cohen. (2007) Graph Walks and Graphical Models. Draft.112
SCS CMU
References• P. Doyle & J. Snell. (1984) Random walks and electric
networks, volume 22. Mathematical Association America, New York.
• Y. Koren, S. C. North, and C. Volinsky. (2006) Measuring and extracting proximity in networks. In KDD, 245–255, 2006.
• A Agarwal S Chakrabarti & S Aggarwal (2006) Learning to• A. Agarwal, S. Chakrabarti & S. Aggarwal. (2006) Learning to rank networked entities. In KDD, 14-23, 2006.
• S. Chakrabarti. (2007) Dynamic personalized pagerank in entity-relation graphs. In WWW, 571-580, 2007.
• F. Fouss, A. Pirotte, J.-M. Renders, & M. Saerens. (2007) Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation. IEEE Trans. Knowl. Data Eng. 19(3), 355-369 2007.
113
SCS CMU
References• H. Tong & C. Faloutsos. (2006) Center-piece subgraphs:
problem definition and fast solutions. In KDD, 404-413, 2006.• H. Tong, C. Faloutsos, & J.Y. Pan. (2006) Fast Random Walk
with Restart and Its Applications. In ICDM, 613-622, 2006.• H. Tong, Y. Koren, & C. Faloutsos. (2007) Fast direction-
aware proximity for graph mining In KDD 747 756 2007aware proximity for graph mining. In KDD, 747-756, 2007.• H. Tong, B. Gallagher, C. Faloutsos, & T. Eliassi-Rad. (2007)
Fast best-effort pattern matching in large attributed graphs. In KDD, 737-746, 2007.
• H. Tong, S. Papadimitriou, P.S. Yu & C. Faloutsos. (2008) Proximity Tracking on Time-Evolving Bipartite Graphs. to appear in SDM 2008.
114