Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔...
Transcript of Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔...
Estimating Clustering Coefficients and Size of Social Networks via
Random Walk Stephen J. Hardiman*
Capital Fund Management
France
Liran Katzir
Advanced Technology Labs Microsoft Research, Israel
*Research was conducted while the author was unaffiliated
Motivation: Social Networks
Facebook Twitter Qzone Google+
Sina Weibo
Habbo Renren
LinkedIn Vkontakte
Bebo
Tagged Orkut
Netlog
Friendster hi5
Flixster
MyLife Classmates.com
Sonico.com
Plaxo
Motivation: External access
v1 v2
v3 v5
v6
v7
v4 v8
v9
Social Analytics
The online social network
Disk Space
Communication
Privacy
Task: Estimate parameters
Business development/ advertisement/ market size.
Predicting Social Products’ Potential.
Global Clustering Coefficient
Network Average
CC
Number of Registered
Users
Global CC = 3 x number of triangles
number of connected triplet
Global Clustering Coefficient
v1 v2
v3 v5
v6
v7
v4 v8
v9
Triangle Connected Triplet
Global Clustering Coefficient
Exact: [Alon et al, 1997]
Estimation – input is read at least once:
• Random Access: [Avron, 2010]
• Streaming Model: [Buriol et al, 2006]
Estimation – sampling:
• Random Access: [Schank et al, 2005]
• External Access: This work.
Ci = #connections between vi′s neighbors
di (di−1)/2
Local Clustering Coefficient
v1 v2
v3 v5
v6
v7
v4 v8
v9
di – degree of node i
d1 = 1 d9 = 2 d2 = 3
C2 =1/3
Network Average CC = average local CC
Network Average CC
Exact: Naïve.
Estimation – input is read at least once:
• Streaming Model: [Becchetti et al, 2010]
Estimation – sampling:
• Random Access: [Schank et al, 2005]
• External Access: [Ribeiro et al 2010], [Gjoka et al, 2010], This work – Improved accuracy.
Number of Registered Users
Exact: trivial
Estimation – sampling:
• External Access: [Hardiman et al 2009], [Katzir et al, 2011], This work – Improved accuracy.
Random Walk
v1 v2
v3 v5
v6
v7
v4 v8
v9
Sampled Nodes: v1 v2 v3 v4
1
22
3
22
2
22
2
22
Stationary
Distribution = 𝑑𝑖
𝑑𝑖
3
22
2
22
3
22
4
22
2
22
v5
Random Walk - Summary
v1 v2
v3 v5
v6
v7
v4 v8
v9
Visible Nodes Invisible Nodes Sampled Nodes
Visible Edges
Invisible Edges
Global CC Algorithm
1. Ψ𝑔 – Sampled nodes average degree - 1.
𝜙𝑘 = 1 if there is an edge 𝑣𝑘−1 − 𝑣𝑘+1,
0 Otherwise.
2. Φ𝑔 – Sampled nodes average 𝜙𝑘𝑑𝑘 .
The estimated global clustering coefficient:
𝑐𝑔 =Φ𝑔
Ψ𝑔
𝜙𝑘 = 1 iff 𝑣𝑘−1, 𝑣𝑘 , 𝑣𝑘+1 is a triangle
Global CC Example
v1 v2
v3 v5
v4
𝜙2 = 0
𝜙3 = 1
Φ𝑔 =1
30 + 2 + 0 =
2
3 Ψ𝑔 =
1
50 + 2 + 1 + 3 + 1 =
7
5
𝑐𝑔 = 2
3
5
7 ≈ 0.47
𝑐𝑔 =9
23≈ 0.39
𝜙4 = 0 v6
v7
Expectation of 𝝓𝒌
𝐸 𝜙𝑘𝑑𝑘 = 𝑑𝑖
𝐷𝐸 𝜙𝑘𝑑𝑘|𝑥𝑘 = 𝑣𝑖
𝑛
𝑖=1
= 𝑑𝑖
𝐷
𝑛
𝑖=1
2𝑙𝑖𝑑𝑖𝑑𝑖
𝑑𝑖
= 2𝑙𝑖𝐷
𝑛
𝑖=1
Total expectation
𝑑𝑖𝑑𝑖 combinations. 2𝑙𝑖 yield 𝜙𝑘=1
𝑙𝑖 – The number of triangles contain vi.
𝑑𝑖 – The degree of node vi.
𝑛 – The number of nodes.
𝐷 = 𝑑𝑖
𝑛
𝑖=1
Global CC Proof
𝐷 = 𝑑𝑖
𝑛
𝑖=1
𝑙𝑖 – The number of triangles contain vi.
𝑑𝑖 – The degree of node vi.
𝑛 – The number of nodes.
𝐸 Φ𝑔 = 𝐸 𝜙𝑘𝑑𝑘 =2
𝐷 𝑙𝑖
𝑛
𝑖=1
𝐸 Ψ𝑔 =1
𝐷 𝑑𝑖 𝑑𝑖 − 1
𝑛
𝑖=1
𝑐𝑔 =Φ𝑔
concentration bounds𝐸 Φ𝑔
Ψ𝑔
concentration bounds𝐸 Ψ𝑔
≅2 𝑙𝑖
𝑛𝑖=1
𝑑𝑖 𝑑𝑖 − 1𝑛𝑖=1
= 𝑐𝑔
Guarantees
For any 𝜖 ≤1
8 and 𝛿 ≤ 1, we have
Prob 1 − 휀 𝑐𝑔 ≤ 𝑐𝑔 ≤ 1 + 휀 𝑐𝑔 ≥ 1 − 𝛿
when the number of samples, r, satisfies
𝑟 ≥ 𝑟𝑔 = 𝑂 mixing time(휀)
Network Average CC Algorithm
1. Ψ𝑙 – Sampled nodes average 1/degree .
𝜙𝑘 = 1 if there is an edge 𝑣𝑘−1 − 𝑣𝑘+1,
0 Otherwise.
2. Φ𝑙 – Sampled nodes average 𝜙𝑘1
𝑑𝑘−1.
The estimated network average CC:
𝑐𝑙 =Φ𝑙
Ψ𝑙
Evaluations
Network n (size) D/n cl cg
DBLP 977,987 8.457 0.7231 0.1868
Orkut 3,072,448 76.28 0.1704 0.0413
Flickr 2,173,370 20.92 0.3616 0.1076
Live Journal 4,843,953 17.69 0.3508 0.1179
DBLP facts: Paper with most co-authors: has 119 listed authors. Most prolific author: Vincent Poor with 798 entries.
Global CC
Relative improvement ranges between 300% and 500% depending on the network.
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2
Re
lati
ve e
stim
atio
n v
alu
e
Percentage of mined nodes
DBLP Network
Gjoka et al*
Ribeiro et al*
This work
Network Average CC
Relative improvement ranges between 50% and 400% depending on the network.
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2
Re
lati
ve e
stim
atio
n v
alu
e
Percentage of mined nodes
Orkut Network
Ribeiro et al
Gjoka et al
Random walk
Conclusions
1. New external access estimator from Global Clustering Coefficient.
2. Improved estimator for Network Average Clustering Coefficient.
3. Improved estimator for number of registered users.
Estimating Sizes of Social Networks via Biased Sampling
Liran Katzir
Yahoo! Labs, Haifa, Israel
Edo Liberty
Yahoo! Labs, Haifa, Israel
Oren Somekh
Yahoo! Labs, Haifa, Israel
The expected number of collisions in a list of r
i.i.d. samples from a set of n elements is 𝑟 𝑟−1
2𝑛.
The Birthday “Paradox”
A collision is a pair of identical samples.
Example: Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x2, x3), (x2, x5), and (x3, x5)
Cardinality estimation uniform
Needs 𝑟 = 𝑂 𝑛 samples to converge. Used by [Ye et al, 2010] to estimate the size.
When C collisions are observed
n ≅𝑟 𝑟 − 1
2𝐶
Stationary distribution sampling
v1 v2
v3 v5
v6
v7
v4 v8
v9
Sampled Nodes: v5
1
22
3
22
2
22
2
22
Stationary
Distribution = 𝑑𝑖
𝑑𝑖
3
22
2
22
3
22
4
22
2
22
v2 v5 v4 v2
Cardinality estimation stationary
Needs 𝑟 = 𝑂 𝑛4 log 𝑛 samples to converge when 𝑑𝑖~𝑧𝑖𝑝𝑓( 𝑛, 2).
When C collisions are observed
n ≅ 𝑑𝑥
1𝑑𝑥
2𝐶
Example:
v1 v2
v3 v5
v6
v7
v4 v8
v9
v5 v2 v5 v4 v2
𝑑𝑥 = 2 + 3 + 2 + 4 + 3 1
𝑑𝑥=
1
2+
1
3+
1
2+
1
4+
1
3
𝑛 =14
23
12
2∙2 ≈ 6.7
Global CC Proof
𝐷 = 𝑑𝑖
𝑛
𝑖=1
𝑑𝑖 – The degree of node vi.
𝑛 – The number of nodes.
𝐸 𝑑𝑥 = 𝑑𝑖
𝐷𝑑𝑖
𝑛
𝑖=1
𝐸1
𝑑𝑥=
𝑑𝑖
𝐷
1
𝑑𝑖
𝑛
𝑖=1
=𝑛
𝐷
𝑛 = 𝑑𝑥
1𝑑𝑥
concentration bounds𝐸 𝑑𝑥 𝐸
1𝑑𝑥
2𝐶concentration bounds
2𝐸 𝐶≅
𝑑𝑖𝐷
𝑑𝑖𝑛𝐷
𝑑𝑖𝐷
𝑑𝑖𝐷
= 𝑛
𝐸 𝐶 = 𝑑𝑖
𝐷
𝑑𝑖
𝐷
𝑛
𝑖=1
Improvements
1. Using all samples (Hardiman et al 2009).
2. Using Conditional Monte Carlo (This work).
All Samples
Restrict computation to indexes m steps apart, 𝐼 = 𝑘, 𝑙 | 𝑘 − 𝑙 ≥ 𝑚
A collision is only be considered within 𝐼. Φ = 𝑥𝑘 = 𝑥𝑙 | 𝑘, 𝑙 ∈ 𝐼
Ratio of degrees is similarly defined
Ψ = 𝑑𝑥𝑘
𝑑𝑥𝑙𝑘,𝑙 ∈𝐼
Conditional Monte Carlo
A collision between 𝑥𝑘 and 𝑥𝑙, is replaced by the conditional collision is steps k+1 and l+1 respectively.
𝐸 1𝑥𝑘+1=𝑥𝑙+1|𝑥𝑘 , 𝑥𝑙 =
Common Neighbors
𝑑𝑥𝑘𝑑𝑥𝑙
Conditional Monte Carlo
• The pair 𝑣4, 𝑣7 is not a collision, but it
contributes 1
12 to the collision counter.
v1 v2
v3 v5
v6
v7
v4 v8
v9
Size Estimation
0
0.5
1
1.5
2
2.5
0.5 1 1.5 2 2.5
Re
lati
ve e
stim
atio
n v
alu
e
Percentage of mined nodes
DBLP Network Priot art
This work
Thanks