Measuring the Size of the Web

35
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State

description

Measuring the Size of the Web. Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State. Studying the Web. To study the characteristics of the Web Statistics Topology Behavior … Why Scientific curiosity Practical values Eg, search engine coverage. Nature 1999. Web as Platform. - PowerPoint PPT Presentation

Transcript of Measuring the Size of the Web

Page 1: Measuring the Size of the Web

Measuring the Size of the Web

Dongwon Lee, Ph.D.

IST 501, Fall 2014

Penn State

Page 2: Measuring the Size of the Web

Studying the Web

To study the characteristics of the Web Statistics Topology Behavior …

Why Scientific curiosity Practical values

Eg, search engine coverage2

Nature 1999

Page 3: Measuring the Size of the Web

Web as Platform

Web becomes a new computation platform Pauses new challenges

Scale Efficiency Heterogeneity Impact to People’s lives

3

Page 4: Measuring the Size of the Web

Eg, How Big is the Web?

Q1: How many web sites?

Q2: How many web pages?

Q3: How many surface/deep web pages?

Research Method Mostly used Experimental method to validate

novel solutions

4

Page 5: Measuring the Size of the Web

Q1: How Many Web Sites?

DNS Registrars List of domain names

Issues Not every domain is web site A domain contains more than one web site Registrars are under no obligations for their

correctness So many of them …

5

Page 6: Measuring the Size of the Web

6

How Many Web Sites?

Brute-force: Polling every IP IPv4: 256.256.256.256

2^32 = 4 billion IPv6: 2^128

10 sec/IP, 1000 simultaneous connection: 2^32*10/(1000*24*60*60) = 460 days

Not going to work !!

Page 7: Measuring the Size of the Web

7

How Many Web Sites? 2nd attempt: Sampling

T: All 4 Billion IPs

S: Sampled IPs

V: Valid reply

||||

||T

S

V

Page 8: Measuring the Size of the Web

8

How Many Web Sites?

||||

||T

S

V

1.Select |S| random IPs2.Send HTTP requests to port 80 at the

selected IPs3.Count valid replies: “HTTP 200 OK” = |V|4. |T| = 2^32

Q: What are the issues here?

Page 9: Measuring the Size of the Web

9

Issues

Virtual hosting Ports other than 80 Temporarily unavailable sites …

Page 10: Measuring the Size of the Web

10

OCLC Survey (2002)

OCLC (Online

Computer Library)

Results

http://wcp.oclc.org/ Still room for growth (at least for Web sites) ??

Page 11: Measuring the Size of the Web

NetCraft Web Server Survey (2010) Goal is to measure web server market share Also record # of sites their crawlers visited August 2010: 213,458,815 distinct sites

11http://news.netcraft.com/archives/category/web-server-survey/

Page 12: Measuring the Size of the Web

NetCraft Web Server Survey (2013) Goal is to measure web server market share Also record # of sites their crawlers visited August 2013: 716,822,317 distinct sites

12http://news.netcraft.com/archives/category/web-server-survey/

Page 13: Measuring the Size of the Web

NetCraft Web Server Survey (2014) Goal is to measure web server market share Also record # of sites their crawlers visited August 2013: 992,177,228 distinct sites

13http://news.netcraft.com/archives/category/web-server-survey/

Page 14: Measuring the Size of the Web

14

Q2: How Many Web Pages? Sampling based?

Issue here?

T: All URLs

S: Sampled URLs

V: Valid reply ||||

||T

S

V

Page 15: Measuring the Size of the Web

15

How Many Web Pages?

Method #1: For each site with valid reply, download all pages Measure average # of pages per site Avg # of pages X total # of sites

Result [Lawrence & Giles, 1999] 289 pages per site, 2.8M sites 289 * 2.8M =~ 800M web pages

Page 16: Measuring the Size of the Web

16

Further Issues

A small #of sites with TONS of pages Sampling could miss these sites

Majority of sites with small # of pages Lots of samples necessary

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

1,000,000

0 200 400 600 800 1000

No of Sites

No

of

Pa

ge

s

99.99% of the sites

Page 17: Measuring the Size of the Web

17

How Many Web Pages?

Method #2: Random sampling

Assume:

T: All pages

B: Base setS: Random samples

Page 18: Measuring the Size of the Web

18

Random Page?

Idea: Random walk Start from a Portal home page (eg, Yahoo) Estimate the size of the portal: B Follow random links, say 10,000 times Select the pages At the end, a set of random web pages S are

gathered

Page 19: Measuring the Size of the Web

19

Straightforward Random Walk

google.com

amazon.com

pike.psu.edu

Follow a random out-link at each step 1

2

3

4

56

7

8

9

Issues?

Page 20: Measuring the Size of the Web

20

Straightforward Random Walk

google.com

amazon.com

pike.psu.edu

Follow a random out-link at each step 1

2

3

4

56

7

8

9

1. Gets stuck in sinks and in dense Web communities2. Biased towards popular pages3. Converges slowly, if at all

Issues?

Page 21: Measuring the Size of the Web

21

Going to Converge? Random walks on regular, undirected graph

uniformly distributed sample

Theorem [Markov chain folklore]: After steps, a random walk reaches the stationary distribution

: depends on the graph structure N: number of nodes

Idea: Transform the Web graph to a regular, undirected graph Perform a random walk

Problem Web is neither regular nor undirected

NO log1

Page 22: Measuring the Size of the Web

22

Intuition

Random walk on undirected Web graph (not regular) High chance to be at a “popular” node at a

particular time Increase the chance to be at a “unpopular”

node by staying there longer through self loop.

Unpopular nodesPopular node

Page 23: Measuring the Size of the Web

23

WebWalker: Undirected Regular Random Walk on the Web

Fact:

A random walk on a connected undirected regular graph converges to a uniform stationary distribution after certain # of steps.

w(v) = degmax - deg(v)

google.com

pike.psu.edu

1

2

31

amazon.com

4

0

23

03

2

2

4

4

3

3

3

1

2

5Follow a random out-link or a random in-link at each step

Use weighted self loops to even out pages’ degrees

Page 24: Measuring the Size of the Web

24

Ideal Random Walk

Generate the regular, undirected graph: Make edges undirected Decide d the maximum # of edges per page:

say, 300,000 If edge(n) < 300,000, then add self-loop

Perform random walks on the graph 10-5 for the 1996 Web, N 109

Page 25: Measuring the Size of the Web

25

WebWalker Results (2000)

Size of the Web pages Altavista: |B| = 250M |BS|/|S| = 35% Estimated |T| = ~ 720M

Avg page size: 12K Avg # of out-links: 10

Ziv Bar-Yossef, Alexander Berg, Steve Chien, Jittat Fakcharoenphol, and Dror Weitz, Approximating Aggregate Queries about Web Pages

via Random Walks. VLDB, 2000

Page 26: Measuring the Size of the Web

How large is SE’s Index?

Prepare a representative corpus (eg, DMOZ) Draw a word W with known frequency

percentage F Eg, “The” is present in 60% of all documents

within the corpus Submit W to a search engine E If E reports there are X number of documents

containing W, one can extrapolate the total size of E’s index as=~ X / F

Repeat multiple times for computing average26

Page 27: Measuring the Size of the Web

http://www.worldwidewebsize.com/ (2010)

27

28 Billions

Page 28: Measuring the Size of the Web

http://www.worldwidewebsize.com/ (2011)

28

46 Billions

Page 29: Measuring the Size of the Web

http://www.worldwidewebsize.com/ (2013)

29

46 Billions

Page 30: Measuring the Size of the Web

http://www.worldwidewebsize.com/ (2013)

30

10 Billions

Page 31: Measuring the Size of the Web

Google Reveals Itself (2008) 1998: 26 Million URLs 2000: 1 Billion URLs 2008: 1 trillion URLs

Not all of them are indexed Duplicates Auto-generated (eg, Calendar) Spams

Experts suspect (2010) Google index at least 40 Billions

31

Page 32: Measuring the Size of the Web

32

Deep Web (aka Hidden Web)

HTML FORM InterfaceQuery Answers

Page 33: Measuring the Size of the Web

33

Q3: Size of Deep Web?

Deep Web: Information reachable only through query interface (eg, HTML FORM)

Often backed by DBMS

Estimation:

How to estimate? By sampling

(Avg size of record) X (Avg # of records per site) X

(Total # of Deep Web sites)

Page 34: Measuring the Size of the Web

34

Size of Deep Web? Total # of Deep Web sites:

|BS|/|S|

Avg size of a record: Issue random queries Estimate reply size

Avg # of records per site: Permute all possible queries for the FORM Issue all queries and count valid return

Page 35: Measuring the Size of the Web

35

Size of Deep Web (2005)

BrightPlanet report estimates: Avg size of a record: 14KB Avg # of records per site: 5MB Total # of Deep Web sites: 200,000 Size of the Deep Web: 10^16 (10 petabytes) 1,000 times larger than the “Surface Web”

How to access it? Wrapper/Mediator (aka. Web scrapping)

http://brightplanet.com/the-deep-web/deep-web-faqs/ : obsolete now