Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin...

25
Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China

Transcript of Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin...

Page 1: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Information Network Analysis and Discovery

Cuiping LiGuoming He

Information School, Renmin University of China

Page 2: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Related Work1. Whole graph Level -Macro properties (Laws, generators) -Summary/Visualization -Index

2. Sub-graph Level

-Frequent Pattern Mining

-Clustering (Community/group detection) -Connected Sub-graph, Central Piece -Pattern Match

3. Node or Link Level -Ranking

-Proximity/Similarity-Node Classification-Outlier Detection (Abnormal nodes/links)

Page 3: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Node Proximity/Similarity: Why?

• Link prediction [Liben-Nowell+], [Tong+]• Ranking [Haveliwala], [Chakrabarti+]• Email Management [Minkov+]• Image caption [Pan+]• Neighborhooh Formulation [Sun+]• Conn. subgraph [Faloutsos+], [Tong+], [Koren+]• Pattern match [Tong+]• Collaborative Filtering [Fouss+]• Many more…

Page 4: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Node Similarity: Related Work(1)• Computer Network’99: Finding related pages in the World

Wide Web, Jeffrey Dean , Monika R. Henzinger (adapting from HITS)

• KDD’02: SimRank: A Measure of Structural-Context Similarity, Glen Jeh, Jennifer Widom (Adapting from PageRank)– Exploiting Hierarchical Domain Structure to Compute Simil

arity. P. Ganesan, H. Garcia-Molina, and J. Widom. Transactions on Information Systems, 21(1): 64-93, January 2003.

• Vertex similarity in networks: Phys. Rev. E 73, 026120 (2006)

• Optimization on simrank– WWW’05: Scaling link-base similarity search,

D.Fogaras, B. Racz (Approximate)– VLDB’08: Accuracy Estimate and Optimization

Techniques for SimRank Computation Dmitry Lizorkin, Pavel Velikhov , Maxim Grinev, Denis Turdakov .

Page 5: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Node Similarity: Related Work(2)

• Domain-Integrated of simrank:– VLDB’08: Simrank++: Query Rewriting through Link

Analysis of the Click Graph, Loannis Antonellis (Stanford University), Hector Garcia-Molina (Stanford University), Chi-Chao Chang (Yahoo!). (keywords, ads)

• Clustering using simrank:– SIGIR’03: ReCom: Reinforcement Clustering of multi-

type interrelated data objects, J. Wang, H.J. Zeng, Z. Chen, H.J. LU,L. Tao

– VLDB’06: LinkCLus:Efficient Clustering via Heterogeneous Semantic Links, Xiaoxin Yin, Jiawei Han, Philip Yu

Page 6: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Existing Research: Limitation 1

• Not Dynamic– Static Algorithm

• Iterative

– Challenges of Dynamic Network• Re-computation even one node or edge changes

– Our Solution• Non-iterative• Incremental Computation

Cuiping Li, Jiawei Han, Guoming He, Xin Jin, Yizhou Sun, Yintao Yu, Tianyi Wu, "Fast Computation of SimRank for Static and Dynamic Information Networks", Int. Conf. on Extending Data Base Technology (EDBT'10), Lausanne, Switzerland, March 2010

Page 7: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Existing Research: Limitation 2

• Not Efficient– Our Solution: employ the modern hardware

resource• GPU (Graphic Process Unit)• Multi-Processor

Page 8: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Compute Node Similarity for Dynamic Network

• SimRank formula

• Or

• Intuition– Two objects are similar if they are referenced by

similar objects.

Page 9: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

How to Compute SimRank Incremetally

• Fist glance at SimRank formula– It is Iterative. Has no chance to be computed incrementally

• Key Observation– SimRank iteration formula has the same form as the well-known

Sylvester Equations, based on this, we can compute SimRank without iteration.

Page 10: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Vec-Operator and Kronecker Products

• Vec-Operator

– Vec flattens an n x n matrix A into an n2 x 1 vector

– It stacks the columns of the matrix on top of each other, from left to right

• Kronecker Product

– Product of two matrices A and B

– Each element of A is multiplied with the full matrix B:

Page 11: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Sylvester Equations

• Sylvester Equations: X=SXT + X0

– Given three n x n matrixes S, T, and X0

– We want to determine X– Solvable in O(n3)

Page 12: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Sylvester Equations• Rewrite the Sylvester Equations as

vec(X)=vec(SXT) + vec(X0)

• Exploit the well-known fact

vec(SXT) = (TTS)vec(X)

• We can get

vec(X)= (TTS)vec(X) + vec(X0)

• We can get (I - TTS)vec(X) = vex(X0)

• Now we have to solve

vec(X)=(I - TTS)-1 vec(X0)

Page 13: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.
Page 14: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

SimRank

• SimRank has the same form as the Sylvester equationsX=cATXA +(1-c)e,

(A is the normalized adjacent matrix, e is an identity matirx)

• Similarly, for SimRank, we have to solve

vec(X)=(I -cAT AT)-1 vec((1-c)e)

vec(X)= (1-c) (I -cAT AT)-1 vec(e)– AT AT can be solved in O(n3)– More importantly, when A is sparse/skew, we can

improve the efficiently further.

Page 15: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

15

Advantages of non-iterative method

vec(X)= (1-c) (I -cAT AT)-1 vec(e)

• It can be solved approximately

• It can be computed incrementally

• It can be computed pair-wisely

Page 16: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

vec(X)= (1-c) (I -cWW)-1 vec(e)

• 利用奇异值分解 SVD 和 Sherman-Morrison方程求 L的逆

Approximation

W =

IvecVVUUcIcSvec

VVUUcIL

UUVVc

VUW

1

1

1

Page 17: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

• W 的 low rank SVD分解

• k的大小– k越大,计算时间越长,精确度越高

• Error Bound

Approximation

kknkkkknnn VUW

Page 18: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

• 预计算

• 计算某对结点 (i,j) 的 SimRank

Approximation

rlji

jil

VcVIcjiS

UUV

,

:,:,

1,

IvecVVV

UUVVc

r 1

Page 19: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Incremental Computation

只需要对 U,,V进行维护即可

Page 20: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Applications

• Similarity Tracking: return the N most similar nodes of i at each time step t.

• Centrality Tracking: return the N most central nodes at each time step t.

Page 21: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Experimental Result on DBLP

Top-10 Most Similar Terms for ‘Prof. Jennifer Widom’ up to Each Time Step

Page 22: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Experimental Result

Top-10 Most Similar Authors for ‘Prof. Jennifer Widom’ up to Each Time Step

Page 23: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

• 预计算时间

Page 24: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

• 计算不同个数结点对的时间

Page 25: Information Network Analysis and Discovery Cuiping Li Guoming He Information School, Renmin University of China.

Wikepedia Data

• We set the threshold T to be 1.0e-6. For k=15– the pre-compute time of the Wikipedia dataset

is approx. 5.68 hours– the query time for every 1000 node pairs is

3.718 seconds