569: Sampling Graphs (2013)jlu.myweb.cs.uwindsor.ca/569/569graphSampling2013.pdf · Lots of...

JianguoLu

Introduction:Sampling

SizeestimationBiascorrection

UniformSampling

AverageDegree

WeiboSampling

569: Sampling Graphs (2013)

Jianguo Lu

University of Windsor

November 18, 2013

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Outline

1 Introduction: Sampling

2 Size estimationBias correctionUniform Sampling

3 Average Degree

4 Weibo Sampling

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Why sampling

Applications�� DeepWeb

�� OSN�� SouceCode

�� SemanticWeb|

|Properties

�� Size�� Distribution

�� Ranking�� Community

�� Diameter�� CluterteringCoefficient

We need to estimate the properties for two reasons:

Data in its entirety is not available (e.g., Facebook), or without central control (e.g., WWW),or evolving.Data is big. Quadratic algorithms are not feasible.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Deep web graph model

d1

q7

d2

q1 q3 q6

d3 d4 d5

q2q4

d6d7 d8d9

q5

A: bipartite graph

d1

q7

d2

q1

q3

q6

d3

d4

d5

q2

q4

d6

d7

d8

d9

q5

B: same graph as (A) in spring model layout

Figure: Hidden data source as a bipartite graph

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

What to sample for

data size;

distributions;

clustering coefficient;

communities;

influential bloggers (degree, pageRank, Katz centralities, etc.)

Outliers (spammers, zombies, inflated followers)

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

How to sample

22

1

6

3

3

53 1

4

1

1

2

1

3

53

4

Original graph Random Node

6

3

3

53

2

6

3

3

5

Random Edge Random Walk

Sample by random node;Sample by random edge;Sample by random walk;Combinations and modifications. e.g., random walk with uniform random restart.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

What is different from traditional sampling

Most of the networks are scale-free. Degrees have very large variance. Uniformrandom sampling does not work.

Precise sampling: quantities are digitalized, making the sampling processprecise. e.g., know the exact degree, and can choose uniformly at random fromthe neighbouring nodes. Only possible for ONLINE social networks not real socialnetworks.

Access interface: provide interface APIs, many options. Can design new samplingschemes using APIs.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Size estimation

Applications

Size of web, search engines

Size of Online Social Network (Twitter, Weibo)

Number of bugs in programs

...

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Capture-Recapture Method

The estimator often usedWhen all the elements have equal probability of being sampled,

N =n1n2

d(1)

where n1 and n2 are the number of samples in two capture occasions, d is theduplicates.

Lawrence and C. Giles. Searching the world wide web. Science,280(5360):98-100, 1998.

A. Broder and et al. Estimating corpus size via queries. In CIKM, pages 594-603.ACM, 2006.

L. Katzir, E. Liberty, and O. Somekh. Estimating sizes of social networks viabiased sampling. In WWW, pages 597-606. ACM, 2011.

Petersson et al., Capture–recapture in software inspections after 10 yearsresearch—-theory, evaluation and application, Journal of Systems and Software,2004.

Lots of research on obtaining uniform random sampling, using methods such asMetropolis-Hasting Random Walk

Bar-Yossef et al. Random sampling from a search engine’s index, JACM 2008.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Multiple Capture-Recapture

Equal sampling probability

N =n2

2C(2)

where n is total number of sampled elements, C is the number of collisions

Unequal sampling probability

N =n2

2CΓ (3)

where Γ is the normalized variance of the degrees of the graph

How large is Γ?

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Γ in various datasets

Graph N(×103) γ or√

Γ− 1 Φ(×10−5)WikiTalk [?] 2,388 26.32 2,700

BerkStan [?] 654 14.51 5.3EmailEu [?] 224 13.66 13Stanford [?] 255 11.51 5.8

Skitter [?] 1,694 10.46 56Youtube [?] 1,134 9.64 440

NotreDame [?] 325 6.40 9.4Gowalla [?] 196 5.54 1,200Epinion [?] 75 4.02 610Google [?] 855 4.00 62

Slashdot [?] 82 3.35 1,900Facebook [?] 2,937 3.14 590

Flickr [?] 105 2.64 68IMDB [?] 374 2.05 130DBLP [?] 511 1.61 560

Amazon [?] 410 1.27 98Gnutella [?] 62 1.21 9,100

CitePatents [?] 3,764 1.20 1,100

Table: Statistics of the 18 real-world graphs, sorted in descending order of the coefficient of degreevariation γ. Φ is the conductance.

For Twitter data, Γ ≈ 1300.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Outline



3 Average Degree

4 Weibo Sampling

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Bias Correction

TheoremThe relative bias of N̂ can be approximated by the reciprocal of E(C), i.e.,

RB ≈1

E(C)(4)

Jianguo Lu, Dingding Li, Bias Correction in Small Sample from Big Data, TKDE, 2013.

5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 100000.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

sample size n

mean e

st N

1/E(C)RB

Figure: RB and 1/E(C) against sample sizes in simulation study. It shows that N̂ is biased upwards,and the relative bias can be approximated by the reciprocal of E(C).

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Our bias corrected estimators

N =n2

2(C + 1)(5)

N =n2

2(C + 1)Γ (6)

5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 100000.98

1

1.02

1.04

1.06

1.08

1.1

1.12x 10

6

sample size n

mean e

st N

hat Nhat N*

Figure: N̂ and N̂S over 104 runs for various sample size in simulation study. Red dotted line is thetrue value.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Twitter data

N=41million.

0 500 1000 1500 2000 2500 3000 3500 40000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.161/E(C)RB

0 500 1000 1500 2000 2500 3000 3500 40004.1

4.2

4.3

4.4

4.5

4.6

4.7

4.8x 10

7

sample size n

mean e

st N

hat Nhat N*

Expected bias Corrected estimation

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Outline



3 Average Degree

4 Weibo Sampling

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Uniform Random Sampling not Recommended

Lemma (Variance of N̂N )

The estimated variance of RN estimator N̂N is

v̂ar(N̂N ) ≈N2

E(C)≈

2N3

n2(7)

Lemma (Variance of N̂E )

The estimated variance of RE estimator N̂E is

v̂ar(N̂E ) ≈2N3

n2Γ

(1 +

nΓCV 2(Γ)

2N

), (8)

where CV (Γ) is the coefficient of variation of Γ.

Theorem (RN vs. RE)

To achieve the same variance of N̂E , N̂N needs to use at most√

Γ times moresamples.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Estimated vs. observed

0.05

0.1

0.15

0.2

RSE

(1)WikiTalk

0.05

0.1

0.15

RSE

(7)NotreDame

10 15 200.05

0.1

0.15

RSE

Sample Size

(13)Flickr

0.05

0.1

0.15

0.2(2)BerkStan

0.05

0.1

0.15

(8)Gowalla

10 15 200.05

0.1

0.15

Sample Size

(14)IMDB

0.05

0.1

0.15

0.2(3)EmailEu

0.05

0.1

0.15

(9)Epinions

10 15 200.05

0.1

0.15

Sample Size

(15)DBLP

0.05

0.1

0.15

0.2(4)Stanford

0.05

0.1

0.15

(10)Google

10 15 200.05

0.1

0.15

Sample Size

(16)Amazon

0.05

0.1

0.15

0.2(5)Skitter

0.05

0.1

0.15

(11)Slashdot

10 15 200.05

0.1

0.15

Sample Size

(17)Gnutella

0.05

0.1

0.15

0.2(6)Youtube

0.05

0.1

0.15

(12)Facebook

10 15 200.05

0.1

0.15

Sample Size

(18)Citation

Figure: Estimated and observed RSE of RE sampling with the growth of sample size over 18datasets. The sample size ranges between 10×

√2N/Γ and 20×

√2N/Γ, i.e., the expected

collisions are between 100 and 400.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

RN and RE Sampling on Facebook Data

10 15 200

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Sample Size (×√

2N)

RS

E

Observed RN

Estimated RN

Observed RE

Estimated RE

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Comparison of RN, RE, and RW Sampling

Figure: Comparison of three sampling methods. The sample size n =√

2NC where√

C = 10. It showsthat for RN sampling (red solid bars), the relative standard error is equal to 1/

√C = 0.1 across all the

datasets. RE sampling is consistently smaller than RN sampling.RW sampling can approximate REsampling for some datasets. For NotreDame etc. that have low conductance, RW is grossly wrong.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Why RW Varies

100~+∞

10~992~91

Figure: Subgraphs obtained by RW sampling from Flickr, EmailEu, Stanford and Youtube. Eachsubgraph contains 60,000 nodes. Node colour represents its degree in the original graph.Green=1; Blue=2 ∼ 9; Orange= 10∼99; Red=100∼ ∞.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Sampling for average degree

Average degree is an important metrics in any network

In and out average degrees in Weibo are different.

Naive method–arithmetic sample mean

Problem– Variance is too large because of the power law

Solution– Use PPS (RE) sampling and harmonic mean estimator

On Twitter, PPS sampling can be hundreds of times better

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Average Degree Estimation

〈̂d〉RN =1n

n∑i=1

dxi (9)

〈̂d〉RE = 〈̂d〉RW = n

[ n∑i=1

1dxi

]−1

(10)

TheoremSuppose the degrees follow Zipf’s law with exponent one, i.e., di = A

α+i . The varianceof the random node estimator is

var(〈̂d〉RN ) ≈〈d〉2

n

(N[

(α+ 1) ln2 N + α

1 + α

]−1− 1

). (11)

var(〈̂d〉RE ) ≈〈d〉2

n

(12

lnN + α

1 + α− 1). (12)

Main conclusionRE is better than RN in many cases; RW depends on graph conductance.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Degree distribution

100

101

102

103

104

105

106

107

100

101

102

103

104

105

106

107

Fre

qu

en

cy

(1)Twitter

100

101

102

103

104

105

106

107

100

101

102

103

104

105

106

(2)WikiTalk

100

101

102

103

104

105

100

101

102

103

104

105

(3)BerkStan

100

101

102

103

104

105

106

100

101

102

103

104

(4)EmailEu

100

101

102

103

104

105

100

101

102

103

104

105

(5)Stanford

100

101

102

103

104

105

106

100

101

102

103

104

105

(6)Skitter

100

101

102

103

104

105

106

100 101 102 103 104 105

Fre

qu

en

cy

(7)Youtube

100

101

102

103

104

105

106

100

101

102

103

104

105

(8)NotreDame

100

101

102

103

104

105

100

101

102

103

104

105

(9)Gowalla

100

101

102

103

104

105

100

101

102

103

104

(10)Epinion

100

101

102

103

104

105

106

100

101

102

103

104

(11)Google

100

101

102

103

104

105

100

101

102

103

104

(12)Slashdot

100

101

102

103

104

105

106

107

100

101

102

103

104

Fre

qu

en

cy

degree

(13)Facebook-1

100

101

102

103

104

105

100

101

102

103

104

degree

(14)Flickr

100

101

102

103

104

100 101 102 103 104

degree

(15)Facebook-2

100

101

102

103

104

105

100

101

102

103

104

degree

(16)Amazon

100

101

102

103

104

105

106

100

101

102

103

degree

(17)Citation

100

101

102

103

104

105

106

100

101

102

degree

(18)RoadNet

Plots are sorted in decreasing order of coefficient of variation γ.

Most of them follow power-law, yet they are very different

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Comparison of Three Sampling methods

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Comparison of RN and RE

0 5 10 15 20 25 30 35 400

5

10

15

20

25

γ

Ra

tio

o

f R

N a

nd

R

E

0 2 4 6 80

1

2

3

4

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Comparison of RN, RE, and RW Samplings

0

2

4

6

8R

RM

SE

(1)Twitter

0

0.5

1

RR

MS

E

(7)Youtube

0 500 10000

0.5

1

Sample Size

RR

MS

E

(13)Facebook−1

0

1

2

3

4(2)WikiTalk

0

0.5

1

1.5

(8)NotreDame

0 500 10000

0.5

1

1.5

Sample Size

(14)Flickr

0

0.5

1

1.5

2

2.5(3)BerkStan

0

0.2

0.4

0.6

0.8

(9)Gowalla

0 500 10000

0.2

0.4

0.6

Sample Size

(15)Facebook−2

0

0.5

1

1.5

2(4)EmailEu

0

0.2

0.4

0.6

(10)Epinion

0 500 10000

0.1

0.2

0.3

Sample Size

(16)Amazon

0

0.5

1

1.5(5)Stanford

0

0.2

0.4

0.6

(11)Google

0 500 10000

0.2

0.4

0.6

Sample Size

(17)Citation

0

0.5

1

1.5(6)Skitter

0

0.2

0.4

0.6

(12)Slashdot

0 500 10000

0.05

0.1

0.15

Sample Size

(18)RoadNet

Figure: RRMSEs of RN, RE, and RW samplings as a function of sample size for 18 graphs.The dotted, dashed, and solid lines are for RN(. . . ), RE(−−), and RW(–) samplingsrespectively. It shows that in most cases the sample size does not change the relativepositions of the sampling methods. The exceptions are the web graphs 3 and 5 where RWsampling does not improve with the increase of sample size because of the random walktraps.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Sample distribution

100

105

100

102

104

(2)WikiTalk

100

105

100

101

102

103

(3)BerkStan

100

102

104

100

102

104

(4)EmailEu

100

105

100

101

102

103

(5)Stanford

100

105

100

101

102

103

(6)Skitter

100

105

100

101

102

103

Fre

quency

(7)Youtube

100

105

100

101

102

103

(8)NotreDame

100

105

100

101

102

103

(9)Gowalla

100

102

104

100

101

102

103

(10)Epinion

100

102

104

100

101

102

103

(11)Google

100

102

104

100

101

102

103

(12)Slashdot

100

102

104

100

101

102

103

Degree

Fre

quency

(13)Facebook−1

100

102

104

100

101

102

103

Degree

(14)Flickr

100

102

104

100

101

102

Degree

(15)Facebook−2

100

102

104

100

102

104

Degree

(16)Amazon

100

102

104

100

101

102

103

Degree

(17)Citation

100

100

102

104

Degree

(18)RoadNet

Figure: The degree distributions of the samples obtained from RE (Random Edge)samplings. n=8,000. The log-log plots in the first two rows exhibit a “V" shape, where thesampled small nodes resemble the distribution of the original graph, while the sampledlarge nodes have a tail pointing upwards. These plots in the first two rows indicate that bothsmall and large nodes are well represented in the sample. The plots in the last row indicatethat the sample distribution is similar to the original distribution, therefore the RRMSE of REsampling is similar to that of RN sampling.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Graph conductance and RW sampling

1.5 2 2.5 3 3.5 4 4.50

2

4

6

8

10

12

14

16

18

−log10(φ)

Ra

tio

of

RW

an

d R

E

Pearson Corr:0.68652Facebbok−2

Youtube

Amazon

Flickr

NotreDame

Stanford

BerkStan

Figure: Standard error ratio between RW and RE vs. graph conductance Φ for 18 datasets.Sample size is 400.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Network Structure

Flickr NotreDame Stanford Web

100~+∞

10~992~91

Amazon Facebook-2 Youtube

Figure: Random walks on six networks. Flickr, NotreDame and Stanford have looselyconnected components while Amazon, Facebook and Youtube are well enmeshed. Eachrandom walk contains 6× 104 nodes except NotreDame which has 15× 104 nodes. Nodecolour indicates the degree of the node. Green=1; Blue=2∼9; Yellow=10∼99; Red=100+.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Conductance

10-4

10-3

10-2

10-1

100

100

101

102

103

104

105

Conducta

nce

Number of nodes in the cluster

Flickr

10-5

10-4

10-3

10-2

10-1

100

100

101

102

103

104

105

106

Conducta

nce


NotreDame

10-5

10-4

10-3

10-2

10-1

100

100

101

102

103

104

105

106

Conducta

nce


Stanford

10-4

10-3

10-2

10-1

100

100

101

102

103

104

105

106

Conducta

nce


Amazon

10-2

10-1

100

100

101

102

103

104

105

Conducta

nce


Facebook-2

10-3

10-2

10-1

100

100

101

102

103

104

105

106

Conducta

nce


Youtube

Figure: Conductance Φ(S) over |S|, the size of the the components, for six networks. Plotsare drawn using SNAP API described in [?].

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Weibo

Very important

We can access only partial informationWhat is the global picture?

SizeDistributionMost influentialOverall topology (e.g. clustering coefficient)Message diffusion, Critical nodesCommunities...

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Star Sampling

Select nodes uniformly at random(e.g., nodes 1 and 6);

Take all the neighbours as sample(nodes 2,3,4,5,5,7,9);

It approximate PPS (probabilityproportional to size) sampling;

More efficient than random walk bytaking all the neighbours instead arandom one;

We sampled around one millionWeibo stars;

1

2 3

4 5 6 7

89

10

11

12

1

2 3

4 5 6 7

89

10

11

12

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Degrees and Messages

Figure: Estimated out-degree, in-degree, and message distributions of Weibo.

Average in-degree and out-degree as 32.10 (CI 31.91, 32.29) and 54.39 (CI 49.02,59.76), respectively.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Estimation of followers

fi di 〈̂d〉i Difference Ratio1 85016 23,335,290 16,859,105 6,476,185 0.382 75243 15,945,306 14,921,069 1,024,237 0.063 71417 15,247,604 14,162,354 1,085,250 0.074 37914 13,394,620 7,518,539 5,876,081 0.785 61962 13,278,161 12,287,380 990,781 0.086 63308 13,153,177 12,554,298 598,879 0.047 59969 12,990,041 11,892,158 1,097,883 0.098 57100 12,604,270 11,323,220 1,281,050 0.119 59406 12,097,122 11,780,512 316,610 0.02

10 54264 12,003,137 10,760,827 1,242,310 0.11

Table: Estimation for the top 10 Weibo accounts. fi : capture frequency of account i ; di : claimedin-degree or number of followers; 〈̂d〉i : estimated number of followers; Ratio = (di − 〈̂d〉i )/〈̂d〉i .

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Estimated Followers

0 200 400 6000

0.5

1

1.5

2

2.5x 10

7

Accounts(top 500)

Follo

wer

s

D

Estimated

Claimed

100

102

104

104

105

106

107

Accounts(all)

Follo

wer

sE

Estimated

Claimed

0 5000 10000−2

0

2

4

6

8x 10

5

Diff

eren

ce

Accounts(all)

F

0 5000 10000−50

0

50

100

150A

Rat

io

Accounts0 5000 10000

−0.5

0

0.5

1

1.5

Sm

ooth

ed ra

tioAccounts

B

100

102

104

100

101

102

103

Rat

io

Accounts (large ratio)

C

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Estimated vs. Claimed Followers

0 0.5 1 1.5 2 2.5

x 107

0

0.5

1

1.5

2

2.5x 10

7

Claimed followers

Estim

ate

d follow

ers

Account

when X=Y

Figure: Estimated followers vs. claimed followers in log-log scale. The Pearson correlationcoefficient is 0.9797.

JianguoLu



UniformSampling

AverageDegree

WeiboSampling

Whether is it correct ...

Relative standard deviation of the estimator is

RSD(d̂i ) = 1/√

fi . (13)

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Top Users

De

gre

es

EmailEu

True Value

Est. Average

Error Bound

0.5

1

1.5

2

2.5

3

x 104

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Top Users

De

gre

es

youtube

True Value

Est. Average

Error Bound

Figure: Sampling accuracy on existing data

569: Sampling Graphs (2013)jlu.myweb.cs.uwindsor.ca/569/569graphSampling2013.pdf · Lots of...

Documents

Transcript of 569: Sampling Graphs (2013)jlu.myweb.cs.uwindsor.ca/569/569graphSampling2013.pdf · Lots of...