FAST COUNTING OF TRIANGLES IN LARGE NETWORKS: ALGORITHMS AND LAWS
RPI Theory Seminar, 24 November 2008
Charalampos (Babis) Tsourakakis School of Computer ScienceCarnegie Mellon University
http://www.cs.cmu.edu/~ctsourak
Counting Triangles
RPI, November 2008
2
Given an undirected, simple graph G(V,E) a triangle is a set of 3 vertices such that any two of them by an edge of the graph.
Related Problems a) Decide if a graph is triangle-free. b) Count the total number of triangles δ(G). c) Count the number of triangles δ(v) that each
vertex v participates at.
d) List the triangles that each vertex v
participates at.
Our focus
|}),(,),(:),{(|)( EwvEuvEwuv
Why is triangle counting important*?
RPI, November 2008
3
Social Network Analysis:“Friends of friends are friends” [WF94]
Web Spam Detection [BPCG08] Hidden Thematic Structure of the
Web [EM02] Motif Detection e.g. biological
networks [YPSB05]
*few indicative reasons, from the graph mining perspective
Why is triangle counting important?
RPI, November 2008
4
Furthermore, two often used metrics are: Clustering Coefficient
where: Transitivity Ratio
where:
)(
)(3
G
GTR
Triple at node v
Triangle
'' )(
)(
|'|
1)(
|'|
1)(
VvVv v
v
Vvcc
VGCC
v
2
)()( and }2)(:{'
vdvvdvV
VvVv
vGvG )()( and )(3
1)(
Outline
RPI, November 2008
5
• Related Work• Proposed Method • Experiments• Triangle-related Laws• Triangles in Kronecker Graphs• Future Work & Open Problems
Counting methods
Dense graphs
Fast Low space
Time complexity
O(n2.37) O(n3)
Space complexity
O(n2) O(m)
Fast Low space
Time complexity
O(m0.7n1.2+n2+o(1)) e.g. O( n )
Space complexity
Θ(n2) (eventually) Θ(m)
Sparse graphs
RPI, November 2008
2maxd
6
Outline
RPI, November 2008
7
• Related Work• Proposed Method • Experiments• Triangle-related Laws• Triangles in Kronecker Graphs• Future Work & Open Problems
Outline of the Proposed Method
8
EigenTriangle theorem EigenTriangleLocal theorem EigenTriangle algorithm EigenTriangleLocal algorithm Efficiency & Complexity
Power law degree distributions Gershgorin discs Real world network spectra
RPI, November 2008
Theorem [EigenTriangle]9
Theorem The number of triangles δ(G) in an
undirected, simple graph G(V,E) is given by:
where are the eigenvalues of the adjacency matrix of graph G.
RPI, November 2008
6)(
||
1
3
V
ii
G
||21 ... V
Proof10
Call A the adjacency matrix of the graph. Consider the i-th diagonal element of A3, αii. This element is equal to the number of triangles vertex i participates at. So the trace is 6δ(G) because each triangle is counted 6 times (3 participating vertices and is also counted as i-j-k, and i-k-j). Furthermore, if Ax=λx, then λ3 is an eigenvalue of A3 (*) and vice versa if λ is an eigenvalue of A3 , then is an eigenvalue of A.
* A3 x=AAAx=AAλx=λΑΑx=λΑλx=λ2Αx=λ3x
3
RPI, November 2008
Theorem [EigenTriangleLocal]
11
Theorem The number of triangles δ(i) vertex i
partipates at is equal to:
where is the j-th entry of the i-th eigenvector
Proof [Sketch]Follows from the previous theorem and the fact that A is symmetric, therefore diagonalizable and also
RPI, November 2008
2)(
2||
1
3ij
V
jju
i
iju iu
TUUA 33
EigenTriangle Algorithm12
RPI, November 2008
EigenTriangleLocal Algorithm
13
RPI, November 2008
Why are these two
algorithms
efficient?
Skewed Degree Distributions
14
Skewed degree distribution ubiquitous in nature! Have been termed as “the signature of human activity”[FKP02] but appear as well to all other kind of networks, e.g. biological. See [N05][M04] for generative models of power law distributions.
Typically referred to as power-laws (even if sometimes we abuse the strict definition of a power law, i.e ).
RPI, November 2008
bxay )log()log(
Examples of power laws15
Newman [N05] demonstratedhow often power laws appearusing may different types ofnetworks, ranging from wordfrequencies to population ofcities.
RPI, November 2008
Many cities havea small population
Few cities havea huge population
Gershgorin’s Discs
RPI, November 2008
16
Theorem Let B an arbitrary matrix. Then the eigenvalues λ of B are located in the union of the n discs
For a proof see Demmel [D97], p.82.
kj
kjkk bb ||||
Gershgorin Discs
RPI, November 2008
17
Bounds on the airports network (Observe how loose)
Typical real world spectra18
RPI, November 2008
AirportsPolitical blogs
Top Eigenvalues19
Zooming in the top eigenvalues and plotting the rank vs. the eigenvalue in log-log scale reveals that the top eigenvalues follow a power law [FFF99]
Some years later, Mihail & Papadimitriou [MP02] and Chung, Lu and Vu [CLV03] proved this fact.
RPI, November 2008
Our idea20
Simple & clear: Use a low-rank approximation of A3 to estimate the diagonal elements and the trace.
Suggests also a way of thinking:Take advantage of special properties (e.g. power laws) to reduce the complexity of certain computational tasks in real-world networks.
RPI, November 2008
Summing up: Why does it work?
21
Almost symmetry of the spectrum around 0 for the bulk of the eigenvalues except the top ones is the first main reason.
Cubes amplify strongly this phenomenon!
RPI, November 2008
Complexity Analysis22
Main computational bottleneck that determines the complexity is the Lanczos method.
Lanczos runs in linear time with respect to the non-zero entries of the matrix, i.e. the edges, assuming that we compute a few constant number of eigenvalues.
Convergence of Lanczos is fast due to the eigenvalue power law (see Kaniel-Paige theory [GL89])
RPI, November 2008
Outline
RPI, November 2008
23
• Related Work• Proposed Method • Experiments• Triangle-related Laws• Triangles in Kronecker Graphs• Future Work & Open Problems
Datasets24
RPI, November 2008
Competitor: Node Iterator 25
Node Iterator algorithm considers each node at the time, looks at its neighbors and checks how many among them are connected among them.
Complexity: O(n ) We report the results as the speedup
that EigenTriangle algorithm gives compared to the running time of the Node Iterator .
2maxd
RPI, November 2008
Results: #Eigenvalues vs. Speedup
26
RPI, November 2008
Results: #Edges vs. Speedup
27
RPI, November 2008
Main points28
Some interesting facts for the two scatterplots:
Mean required approximations rank for at least 95% is 6.2
Speedups are between 33.7x and 1159x. The mean speedup is 250. Notice the increasing speedup as the
size of the network grows.
RPI, November 2008
Zooming in29
RPI, November 2008
Zoomingin this point
Evaluating the Local Counting Method
30
Pearson’s correlation coefficient ρ Relative Reconstruction Error
||
1 )(
|)(')(|
||
1 V
i
i
VRRE
RPI, November 2008
Political Blogs:RRE 7*10-4
ρ 99.97%
#Eigenvalues vs. ρ for three networks
31
RPI, November 2008
Observe how a low rankresults in
almost optimal results.This holds for
surprisingly manyreal world networks
Outline
RPI, November 2008
32
• Related Work• Proposed Method • Experiments• Triangle-related Laws• Triangles in Kronecker Graphs• Future Work & Open Problems
Triangle Participation Law
RPI, November 2008
33
Plots the number of triangles δ (x-axis) vs. the count of vertices with δ participating triangles.
a) EPINIONS, who trusts-whosb) ASN, social networkc) HEP_TH, collaboration network
(a) (b)
(c)
Degree Triangle Law
RPI, November 2008
34
Plots the degree di (x-axis) vs. the mean number of triangles that nodes with degree di participate at.
Epinions ASN
Outline
RPI, November 2008
35
• Related Work• Proposed Method • Experiments• New Triangle-related Laws• Triangles in Kronecker Graphs• Future Work & Open Problems
Kronecker Graphs
RPI, November 2008
36
This model was introduced in [LCKF05]. It is based on the simple operation of the Kronecker product to generate graphs that mimic real world networks.
Deterministic Kronecker Graphs: Kronecker Product of the adjacency matrix at the current step k with the initiator adjacency matrix (typically small).
Stochastic Kronecker Graphs: Kronecker Product of the matrix at the current step k with the initiator matrix. Initiator matrix contains probabilities.For more details see [LF07].
Triangles in Kronecker Graphs
RPI, November 2008
37
Some notation first:A: nxn initiatior adjacency matrix of the undirected, simple graph GA
B = A[k] k-th Kronecker product
λ=(λ1,...,λn) the eigenvalues of A
Δ(GA), Δ(GΒ) #triangles of GA , GΒ Theorem [KroneckerTRC]
06 1 , k)Δ(G ) Δ(G kA
kB
Proof 38
We use induction on the number of recursion steps k. For k=0 the theorem trivially holds.
Assume now that KroneckerTRC holds now for some
.Call C=A[r], D=A[r+1] and the eigenvalues of C,
[μi]i=1..s.By the assumption
The eigenvalues of D are given by the Kronecker product . By the EigenTriangle theorem, the number of triangles in D is given by:
RPI, November 2008
1r
16 rA
rc )Δ(G ) Δ(G
Proof 39
RPI, November 2008
211
3
1
3
1 1
33
1 1
33
)(6)()(66
)(6
6
)(6
66)(
rA
rCA
s
ii
A
s
iAi
s
i
n
jji
s
i
n
jji
D
GGGG
GG
Therefore KroneckerTRC holds for all .Q.E.D
0k
Outline
RPI, November 2008
40
• Related Work• Proposed Method • Experiments• New Triangle-related Laws• Triangles in Kronecker Graphs• Future Work & Open Problems
Theoretical Challenge I:Spectra of real world networks
41
Can we prove things about the distribution of the eigenvalues, adopting a random graph model such as the expected degree model G(w) [CLV03]?
An analog to Wigner’s semicircle law for random Erdos-Renyi graphs (see Furedi-Komlos [FK81])
RPI, November 2008
Spectrum of
over 100000 Iterations
[S07]
2
1,40
G
Theoretical Challenge I:Spectra of real world networks
42
RPI, November 2008
Empirically, the rest of
the spectrum:Triangular-like
distribution[FDBV01]
Can we proveSomething about
this empirical observation ?
Theoretical Challenge II: Eigenvectors of real world networks
RPI, November 2008
43
Things even “worse” than the case of spectra. Very few knowledge about the eigenvectors. Related work:See [P08] for random graphs.
Theoretical Challenge III: Degree Triangle Law
44
Prove using the expected degree random graph model G(w) the pattern we saw (see [S04])
Conjecture: The relationship we observed probably appears
for some cases of the slope of the degree distribution. Further experiments, recently
showed that for some graphs this pattern does not
hold.
RPI, November 2008
Experimental Challenge I:Compare with Streaming Methods45
Streaming or Semi-Streaming methods, perform one or O(1) passes over the graph. [YKS02][BFLSS06][BPCG08] Common Underlying Idea: Sophisticated sampling methods
Implement and compare.
RPI, November 2008
Practical Challenge I:Triangles in Large Scale Graph Mining46
Many Giga-byte and Peta-byte sized graphs. How to handle these graphs? HADOOP EigenTriangle algorithms are based just on
simple matrix vector multiplications. Easy to parallelize in all sorts of
architectures (distributed memory , shared memory).
See [DHV93] for the details. RPI, November 2008
PEGASUS: Peta-Graph Miningfrom the Triangle perspective
47
RPI, November 2008
On-going work with U Kang and Christos Faloutsos in collaboration with Yahoo! Research.
Among others: Implement EigenTriangle algorithms in HADOOP and compare to other methods.
Find outliers in graphs with many billions of edges wrt triangles.
Soon…Stay tuned!
Curious about:
RPI, November 2008
48
Acknowledgements
RPI, November 2008
49
Christos Faloutsos
Yiannis KoutisFor the helpful discussions
Acknowledgements
RPI, November 2008
50
Maria Tsiarli For the PEGASUS logo
51
RPI, November 2008
References
RPI, November 2008
52
[WF94] Wasserman, Faust: “Social Network Analysis: Methods and Applications (Structural Analysis in the Social Sciences)”
[EM02] Eckmann, Moses: “Curvature of co-links uncovers hidden thematic layers in the World Wide Web”
[BPCG08] Becchetti, Boldi, Castillo, Gionis Efficient Semi-Streaming Algorithms for Local Triangle Counting in Massive Graphs
[FKP02] Fabrikant, Koutsoupias, Papadimitriou: “Heuristically Optimized Trade-offs: A New Paradigm for Power Laws in the Internet”
[N05] Newman: “Power laws, Pareto distributions and Zipf's law” [M04] Mitzenmacher: “A brief history of generative models for
power law and lognormal distributions” [FK81] Furedi-Komlos: “Eigenvalues of random symmetric
matrices”
References
RPI, November 2008
53
[S04] Danilo Sergi: “Random graph model with power-law distributed triangle subgraphs”
[D97] Demmel: “Applied Numerical Algebra” [LCKF05] Leskovec, Chakrabarti, Kleinberg, Faloutsos:
“Realistic, Mathematically Tractable Graph Generation and Evolution using Kronecker Multiplication”
[LK07] Leskovec, Faloutsos: “Scalable Modeling of Real Graphs using Kronecker Multiplication”
[FFF09] Faloutsos, Faloutsos, Faloutsos: “On power-law relationships of the Internet topology”
[MP02] Mihail, Papadimitriou: “On the Eigenvalue Power Law” [CLV03] Chung, Lu, Vu: “Spectra of Random Graphs with
given expected degrees”
References
RPI, November 2008
54
[YKS02] Yossef, Kumar, Sivakumar: “Scalable Modeling of Real Graphs using Kronecker Multiplication”
[GL89] Golub, Van Loan: “Matrix Computations” [BFLSS06] Buriol, Frahling, Leonardi, Spaccamela, Sohler: “Counting
triangles in data streams” [DHV93] Demmel, Heath, Vorst: “Parallel Numerical Linear Algebra” [YPSB05] Ye, Peyser, Spencer, Bader: “Commensurate distances and
similar motifs in genetic congruence and protein interaction networks in yeast”
[P08] Mitra Pradipta: “Entrywise Bounds for Eigenvectors of Random Graphs”
[FDBV01] Farkas, Derenyi, Barabasi, Vicsek: “Spectra of "real-world" graphs: Beyond the semi-circle law”
[S07] Spielman’s “Spectral Graph Theory and its Applications” class (YALE): http://www.cs.yale.edu/homes/spielman/eigs/
References
RPI, November 2008
55
[F08] Faloutsos’ “Multimedia Databases and Data Mining” class (CMU):http://www.cs.cmu.edu/~christos/courses/826.S08
For more references, take a look also in the paper: http://www.cs.cmu.edu/~ctsourak/tsourICDM08.pdf
Top Related