Crawling the Infinite Web (WAW 2004 Rome)

48
Outline Introduction Models Experiments Summary Crawling the Infinite Web: Five Levels are Enough Ricardo Baeza-Yates and Carlos Castillo Center for Web Research www.cwr.cl WAW 2004 R. Baeza-Yates and C. Castillo Center for Web Research Crawling the Infinite Web

Transcript of Crawling the Infinite Web (WAW 2004 Rome)

Page 1: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Crawling the Infinite Web:Five Levels are Enough

Ricardo Baeza-Yates and Carlos Castillo

Center for Web Researchwww.cwr.cl

WAW 2004

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 2: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

1 Introduction

2 Models

3 Experiments

4 Summary

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 3: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Introduction

Dynamic page: “a page which is created on request”

Dynamic pages with links to other dynamic pages

Malicious: loops and/or near-duplicates

Legitimate: recommendation systems, calendars, iterativealgorithms, etc.

The number of pages on the Web can be considered infinite

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 4: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Introduction

Dynamic page: “a page which is created on request”

Dynamic pages with links to other dynamic pages

Malicious: loops and/or near-duplicates

Legitimate: recommendation systems, calendars, iterativealgorithms, etc.

The number of pages on the Web can be considered infinite

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 5: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Introduction

Dynamic page: “a page which is created on request”

Dynamic pages with links to other dynamic pages

Malicious: loops and/or near-duplicates

Legitimate: recommendation systems, calendars, iterativealgorithms, etc.

The number of pages on the Web can be considered infinite

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 6: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Introduction

Dynamic page: “a page which is created on request”

Dynamic pages with links to other dynamic pages

Malicious: loops and/or near-duplicates

Legitimate: recommendation systems, calendars, iterativealgorithms, etc.

The number of pages on the Web can be considered infinite

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 7: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Introduction

Dynamic page: “a page which is created on request”

Dynamic pages with links to other dynamic pages

Malicious: loops and/or near-duplicates

Legitimate: recommendation systems, calendars, iterativealgorithms, etc.

The number of pages on the Web can be considered infinite

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 8: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Conflicting interests

Web site administrator: would like to have all of the Website indexed

Search engine administrator: would like to use efficientlythe network and storage capacity available

Search engine user: would like to find what he is looking for

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 9: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Conflicting interests

Web site administrator: would like to have all of the Website indexed

Search engine administrator: would like to use efficientlythe network and storage capacity available

Search engine user: would like to find what he is looking for

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 10: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Conflicting interests

Web site administrator: would like to have all of the Website indexed

Search engine administrator: would like to use efficientlythe network and storage capacity available

Search engine user: would like to find what he is looking for

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 11: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Our approach

Users do not go so deep inside Web sites

If something is important it has to be easily reachable

We will download only a few levels of each Web site

How many levels?

How much do you lost?

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 12: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Our approach

Users do not go so deep inside Web sites

If something is important it has to be easily reachable

We will download only a few levels of each Web site

How many levels?

How much do you lost?

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 13: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Our approach

Users do not go so deep inside Web sites

If something is important it has to be easily reachable

We will download only a few levels of each Web site

How many levels?

How much do you lost?

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 14: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Our approach

Users do not go so deep inside Web sites

If something is important it has to be easily reachable

We will download only a few levels of each Web site

How many levels?

How much do you lost?

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 15: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Our approach

Users do not go so deep inside Web sites

If something is important it has to be easily reachable

We will download only a few levels of each Web site

How many levels?

How much do you lost?

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 16: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

ModelsNavigating a tree ≈ Moving through levels

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 17: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

ActionsPossible actions at a given level

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 18: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Type of models we study

There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑

action∈A Pr(action|`) = 1

The probability Pr(next|`) is constant

Stationary distribution → how much time users spent at eachlevel

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 19: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Type of models we study

There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑

action∈A Pr(action|`) = 1

The probability Pr(next|`) is constant

Stationary distribution → how much time users spent at eachlevel

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 20: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Type of models we study

There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑

action∈A Pr(action|`) = 1

The probability Pr(next|`) is constant

Stationary distribution → how much time users spent at eachlevel

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 21: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Type of models we study

There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑

action∈A Pr(action|`) = 1

The probability Pr(next|`) is constant

Stationary distribution → how much time users spent at eachlevel

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 22: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Type of models we study

There is a set of atomic actionsA = {next, start/jump, back, stay , prev , fwd}Pr(action|`) is the probability of taking an action∑

action∈A Pr(action|`) = 1

The probability Pr(next|`) is constant

Stationary distribution → how much time users spent at eachlevel

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 23: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Model AForwards and backwards one level at a time

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 24: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Model AForwards and backwards one level at a time

Birth and death process

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 25: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Model BBack to first level

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 26: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Model BBack to first level

Birth and death process with extinction

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 27: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Model CBack to any previous level

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 28: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Model CBack to any previous level

Birth and death process with extinction and disaster?

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 29: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Cumulative probability of levels 0 . . . kBased on solutions given in the paper

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 30: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Experiments

Anonimized access logs for 13 Websites

Educational - Commercial - Reference - Organization - Blogs

Analysis of access logs to extract ≈ 250,000 user sessions

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 31: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Experiments

Anonimized access logs for 13 Websites

Educational - Commercial - Reference - Organization - Blogs

Analysis of access logs to extract ≈ 250,000 user sessions

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 32: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Experiments

Anonimized access logs for 13 Websites

Educational - Commercial - Reference - Organization - Blogs

Analysis of access logs to extract ≈ 250,000 user sessions

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 33: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Distribution of visits per level

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 34: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Model fitting

Code Type Country Model q Error

E1 Educational Chile B 0.51 0.88%E2 Educational Spain B 0.51 2.29%E3 Educational US B 0.64 0.72%

C1 Commercial Chile B 0.55 0.39%C2 Commercial Chile B 0.62 5.17%

R1 Reference Chile B 0.54 2.96%R2 Reference Chile B 0.59 2.75%

O1 Organization Italy C 0.35 2.27%O2 Organization US B 0.62 2.31%

OB1 Organization + Blog Chile B 0.65 2.07%OB2 Organization + Blog Chile B 0.72 0.35%

B1 Blog Chile C 0.79 0.88%B2 Blog Chile C 0.63 1.01%

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 35: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Observed distribution of transitions

Level Obs. Next Start Jump Back Stay Prev

0 247985 0.457 – 0.527 – 0.008 –1 120482 0.459 – 0.332 0.185 0.017 –2 70911 0.462 0.111 0.235 0.171 0.014 –3 42311 0.497 0.065 0.186 0.159 0.017 0.0694 27129 0.514 0.057 0.157 0.171 0.009 0.0885 17544 0.549 0.048 0.138 0.143 0.009 0.1086 10296 0.555 0.037 0.133 0.155 0.009 0.1067 6326 0.596 0.033 0.135 0.113 0.006 0.1138 4200 0.637 0.024 0.104 0.127 0.006 0.0969 2782 0.663 0.015 0.108 0.113 0.006 0.08910 2089 0.662 0.037 0.084 0.120 0.005 0.086

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 36: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Observed distribution of transitionsLevel Obs. Next Start Jump Back Stay Prev

0 247985 0.457 – 0.527 – 0.008 –1 120482 0.459 – 0.332 0.185 0.017 –2 70911 0.462 0.111 0.235 0.171 0.014 –3 42311 0.497 0.065 0.186 0.159 0.017 0.0694 27129 0.514 0.057 0.157 0.171 0.009 0.0885 17544 0.549 0.048 0.138 0.143 0.009 0.1086 10296 0.555 0.037 0.133 0.155 0.009 0.1067 6326 0.596 0.033 0.135 0.113 0.006 0.1138 4200 0.637 0.024 0.104 0.127 0.006 0.0969 2782 0.663 0.015 0.108 0.113 0.006 0.08910 2089 0.662 0.037 0.084 0.120 0.005 0.086

Pr(next) is not constant, if you have spent some time in the Web site,

then you can spend some more

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 37: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Pagerank and depthCumulative Pagerank by levels in the Chilean Web

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 38: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Pagerank and depthCorrelation of Pagerank and depth is low at deeper levels

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 39: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Summary

90% of the visits are 4-5 clicks away from the home page,except in blogs

Simple models try to explain this behavior

In the paper: explicit methodology, closed solutions to themodels, references

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 40: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Summary

90% of the visits are 4-5 clicks away from the home page,except in blogs

Simple models try to explain this behavior

In the paper: explicit methodology, closed solutions to themodels, references

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 41: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Summary

90% of the visits are 4-5 clicks away from the home page,except in blogs

Simple models try to explain this behavior

In the paper: explicit methodology, closed solutions to themodels, references

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 42: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Open problems

A model which better fits empirical data

Analyzing blogs

Analyzing the textual content of pages to decide when to stop

Relationship of this with the spam detection problem

Try adaptive strategies: which are the factors that affect thedesired crawling depth in a Web site?

There are other ways of defining which pages to downloadfrom an infinite set

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 43: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Open problems

A model which better fits empirical data

Analyzing blogs

Analyzing the textual content of pages to decide when to stop

Relationship of this with the spam detection problem

Try adaptive strategies: which are the factors that affect thedesired crawling depth in a Web site?

There are other ways of defining which pages to downloadfrom an infinite set

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 44: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Open problems

A model which better fits empirical data

Analyzing blogs

Analyzing the textual content of pages to decide when to stop

Relationship of this with the spam detection problem

Try adaptive strategies: which are the factors that affect thedesired crawling depth in a Web site?

There are other ways of defining which pages to downloadfrom an infinite set

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 45: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Open problems

A model which better fits empirical data

Analyzing blogs

Analyzing the textual content of pages to decide when to stop

Relationship of this with the spam detection problem

Try adaptive strategies: which are the factors that affect thedesired crawling depth in a Web site?

There are other ways of defining which pages to downloadfrom an infinite set

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 46: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Open problems

A model which better fits empirical data

Analyzing blogs

Analyzing the textual content of pages to decide when to stop

Relationship of this with the spam detection problem

Try adaptive strategies: which are the factors that affect thedesired crawling depth in a Web site?

There are other ways of defining which pages to downloadfrom an infinite set

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 47: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Open problems

A model which better fits empirical data

Analyzing blogs

Analyzing the textual content of pages to decide when to stop

Relationship of this with the spam detection problem

Try adaptive strategies: which are the factors that affect thedesired crawling depth in a Web site?

There are other ways of defining which pages to downloadfrom an infinite set

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web

Page 48: Crawling the Infinite Web (WAW 2004 Rome)

Outline Introduction Models Experiments Summary

Questions and comments . . .

R. Baeza-Yates and C. Castillo Center for Web Research

Crawling the Infinite Web