Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔...

Estimating Clustering Coefficients and Size of Social Networks via

Random Walk Stephen J. Hardiman*

Capital Fund Management

France

Liran Katzir

Advanced Technology Labs Microsoft Research, Israel

*Research was conducted while the author was unaffiliated

Motivation: Social Networks

Facebook Twitter Qzone Google+

Sina Weibo

Habbo Renren

LinkedIn Vkontakte

Bebo

Tagged Orkut

Netlog

Friendster hi5

Flixster

MyLife Classmates.com

Sonico.com

Plaxo

Motivation: External access

v1 v2

v3 v5

v6

v7

v4 v8

v9

Social Analytics

The online social network

Disk Space

Communication

Privacy

Task: Estimate parameters

Business development/ advertisement/ market size.

Predicting Social Products’ Potential.

Global Clustering Coefficient

Network Average

CC

Number of Registered

Users

Global CC = 3 x number of triangles

number of connected triplet


v1 v2

v3 v5

v6

v7

v4 v8

v9

Triangle Connected Triplet


Exact: [Alon et al, 1997]

Estimation – input is read at least once:

• Random Access: [Avron, 2010]

• Streaming Model: [Buriol et al, 2006]

Estimation – sampling:

• Random Access: [Schank et al, 2005]

• External Access: This work.

Ci = #connections between vi′s neighbors

di (di−1)/2

Local Clustering Coefficient

v1 v2

v3 v5

v6

v7

v4 v8

v9

di – degree of node i

d1 = 1 d9 = 2 d2 = 3

C2 =1/3

Network Average CC = average local CC

Network Average CC

Exact: Naïve.

Estimation – input is read at least once:

• Streaming Model: [Becchetti et al, 2010]


• Random Access: [Schank et al, 2005]

• External Access: [Ribeiro et al 2010], [Gjoka et al, 2010], This work – Improved accuracy.

Number of Registered Users

Exact: trivial


• External Access: [Hardiman et al 2009], [Katzir et al, 2011], This work – Improved accuracy.

Random Walk

v1 v2

v3 v5

v6

v7

v4 v8

v9

Sampled Nodes: v1 v2 v3 v4

1

22

3

22

2

22

2

22

Stationary

Distribution = 𝑑𝑖

𝑑𝑖

3

22

2

22

3

22

4

22

2

22

v5

Random Walk - Summary

v1 v2

v3 v5

v6

v7

v4 v8

v9

Visible Nodes Invisible Nodes Sampled Nodes

Visible Edges

Invisible Edges

Global CC Algorithm

1. Ψ𝑔 – Sampled nodes average degree - 1.

𝜙𝑘 = 1 if there is an edge 𝑣𝑘−1 − 𝑣𝑘+1,

0 Otherwise.

2. Φ𝑔 – Sampled nodes average 𝜙𝑘𝑑𝑘 .

The estimated global clustering coefficient:

𝑐𝑔 =Φ𝑔

Ψ𝑔

𝜙𝑘 = 1 iff 𝑣𝑘−1, 𝑣𝑘 , 𝑣𝑘+1 is a triangle

Global CC Example

v1 v2

v3 v5

v4

𝜙2 = 0

𝜙3 = 1

Φ𝑔 =1

30 + 2 + 0 =

2

3 Ψ𝑔 =

1

50 + 2 + 1 + 3 + 1 =

7

5

𝑐𝑔 = 2

3

5

7 ≈ 0.47

𝑐𝑔 =9

23≈ 0.39

𝜙4 = 0 v6

v7

Expectation of 𝝓𝒌

𝐸 𝜙𝑘𝑑𝑘 = 𝑑𝑖

𝐷𝐸 𝜙𝑘𝑑𝑘|𝑥𝑘 = 𝑣𝑖

𝑛

𝑖=1

= 𝑑𝑖

𝐷

𝑛

𝑖=1

2𝑙𝑖𝑑𝑖𝑑𝑖

𝑑𝑖

= 2𝑙𝑖𝐷

𝑛

𝑖=1

Total expectation

𝑑𝑖𝑑𝑖 combinations. 2𝑙𝑖 yield 𝜙𝑘=1

𝑙𝑖 – The number of triangles contain vi.

𝑑𝑖 – The degree of node vi.

𝑛 – The number of nodes.

𝐷 = 𝑑𝑖

𝑛

𝑖=1

Global CC Proof

𝐷 = 𝑑𝑖

𝑛

𝑖=1

𝑙𝑖 – The number of triangles contain vi.



𝐸 Φ𝑔 = 𝐸 𝜙𝑘𝑑𝑘 =2

𝐷 𝑙𝑖

𝑛

𝑖=1

𝐸 Ψ𝑔 =1

𝐷 𝑑𝑖 𝑑𝑖 − 1

𝑛

𝑖=1

𝑐𝑔 =Φ𝑔

concentration bounds𝐸 Φ𝑔

Ψ𝑔

concentration bounds𝐸 Ψ𝑔

≅2 𝑙𝑖

𝑛𝑖=1

𝑑𝑖 𝑑𝑖 − 1𝑛𝑖=1

= 𝑐𝑔

Guarantees

For any 𝜖 ≤1

8 and 𝛿 ≤ 1, we have

Prob 1 − 휀 𝑐𝑔 ≤ 𝑐𝑔 ≤ 1 + 휀 𝑐𝑔 ≥ 1 − 𝛿

when the number of samples, r, satisfies

𝑟 ≥ 𝑟𝑔 = 𝑂 mixing time(휀)

Network Average CC Algorithm

1. Ψ𝑙 – Sampled nodes average 1/degree .

𝜙𝑘 = 1 if there is an edge 𝑣𝑘−1 − 𝑣𝑘+1,

0 Otherwise.

2. Φ𝑙 – Sampled nodes average 𝜙𝑘1

𝑑𝑘−1.

The estimated network average CC:

𝑐𝑙 =Φ𝑙

Ψ𝑙

Evaluations

Network n (size) D/n cl cg

DBLP 977,987 8.457 0.7231 0.1868

Orkut 3,072,448 76.28 0.1704 0.0413

Flickr 2,173,370 20.92 0.3616 0.1076

Live Journal 4,843,953 17.69 0.3508 0.1179

DBLP facts: Paper with most co-authors: has 119 listed authors. Most prolific author: Vincent Poor with 798 entries.

Global CC

Relative improvement ranges between 300% and 500% depending on the network.

0

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2

Re

lati

ve e

stim

atio

n v

alu

e

Percentage of mined nodes

DBLP Network

Gjoka et al*

Ribeiro et al*

This work

Network Average CC

Relative improvement ranges between 50% and 400% depending on the network.

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2

Re

lati

ve e

stim

atio

n v

alu

e


Orkut Network

Ribeiro et al

Gjoka et al

Random walk

Conclusions

1. New external access estimator from Global Clustering Coefficient.

2. Improved estimator for Network Average Clustering Coefficient.

3. Improved estimator for number of registered users.

Estimating Sizes of Social Networks via Biased Sampling

Liran Katzir

Yahoo! Labs, Haifa, Israel

Edo Liberty


Oren Somekh


The expected number of collisions in a list of r

i.i.d. samples from a set of n elements is 𝑟 𝑟−1

2𝑛.

The Birthday “Paradox”

A collision is a pair of identical samples.

Example: Samples: X = (d, b, b, a, b, e). Total 3 collisions, (x2, x3), (x2, x5), and (x3, x5)

Cardinality estimation uniform

Needs 𝑟 = 𝑂 𝑛 samples to converge. Used by [Ye et al, 2010] to estimate the size.

When C collisions are observed

n ≅𝑟 𝑟 − 1

2𝐶

Stationary distribution sampling

v1 v2

v3 v5

v6

v7

v4 v8

v9

Sampled Nodes: v5

1

22

3

22

2

22

2

22

Stationary

Distribution = 𝑑𝑖

𝑑𝑖

3

22

2

22

3

22

4

22

2

22

v2 v5 v4 v2

Cardinality estimation stationary

Needs 𝑟 = 𝑂 𝑛4 log 𝑛 samples to converge when 𝑑𝑖~𝑧𝑖𝑝𝑓( 𝑛, 2).

When C collisions are observed

n ≅ 𝑑𝑥

1𝑑𝑥

2𝐶

Example:

v1 v2

v3 v5

v6

v7

v4 v8

v9

v5 v2 v5 v4 v2

𝑑𝑥 = 2 + 3 + 2 + 4 + 3 1

𝑑𝑥=

1

2+

1

3+

1

2+

1

4+

1

3

𝑛 =14

23

12

2∙2 ≈ 6.7

Global CC Proof

𝐷 = 𝑑𝑖

𝑛

𝑖=1



𝐸 𝑑𝑥 = 𝑑𝑖

𝐷𝑑𝑖

𝑛

𝑖=1

𝐸1

𝑑𝑥=

𝑑𝑖

𝐷

1

𝑑𝑖

𝑛

𝑖=1

=𝑛

𝐷

𝑛 = 𝑑𝑥

1𝑑𝑥

concentration bounds𝐸 𝑑𝑥 𝐸

1𝑑𝑥

2𝐶concentration bounds

2𝐸 𝐶≅

𝑑𝑖𝐷

𝑑𝑖𝑛𝐷

𝑑𝑖𝐷

𝑑𝑖𝐷

= 𝑛

𝐸 𝐶 = 𝑑𝑖

𝐷

𝑑𝑖

𝐷

𝑛

𝑖=1

Improvements

1. Using all samples (Hardiman et al 2009).

2. Using Conditional Monte Carlo (This work).

All Samples

Restrict computation to indexes m steps apart, 𝐼 = 𝑘, 𝑙 | 𝑘 − 𝑙 ≥ 𝑚

A collision is only be considered within 𝐼. Φ = 𝑥𝑘 = 𝑥𝑙 | 𝑘, 𝑙 ∈ 𝐼

Ratio of degrees is similarly defined

Ψ = 𝑑𝑥𝑘

𝑑𝑥𝑙𝑘,𝑙 ∈𝐼

Conditional Monte Carlo

A collision between 𝑥𝑘 and 𝑥𝑙, is replaced by the conditional collision is steps k+1 and l+1 respectively.

𝐸 1𝑥𝑘+1=𝑥𝑙+1|𝑥𝑘 , 𝑥𝑙 =

Common Neighbors

𝑑𝑥𝑘𝑑𝑥𝑙

Conditional Monte Carlo

• The pair 𝑣4, 𝑣7 is not a collision, but it

contributes 1

12 to the collision counter.

v1 v2

v3 v5

v6

v7

v4 v8

v9

Size Estimation

0

0.5

1

1.5

2

2.5

0.5 1 1.5 2 2.5

Re

lati

ve e

stim

atio

n v

alu

e


DBLP Network Priot art

This work

Thanks

Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔...

Documents

Transcript of Estimating Clustering Coefficients and Size of …lirank/pubs/2013-...Global CC Algorithm 1. Ψ𝑔...