Web search engines

Post on 13-Jan-2016

16 views 0 download

description

Web search engines. Paolo Ferragina Dipartimento di Informatica Università di Pisa. The Web: Size: more than tens of billions of pages Language and encodings: hundreds… Distributed authorship: SPAM, format-less,… Dynamic: in one year 35% survive, 20% untouched The User: - PowerPoint PPT Presentation

Transcript of Web search engines

Web search engines

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Two main difficulties

The Web: Size: more than tens of billions of pages

Language and encodings: hundreds…

Distributed authorship: SPAM, format-less,…

Dynamic: in one year 35% survive, 20% untouched

The User: Query composition: short (2.5 terms avg) and imprecise

Query results: 85% users look at just one result-page

Several needs: Informational, Navigational, Transactional

Extracting “significant data” is difficult !!

Matching “user needs” is difficult !!

Evolution of Search Engines First generation -- use only on-page, web-text data

Word frequency and language

Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)

Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google

Fourth generation Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]

Google, Yahoo,

MSN, ASK,………

The web-graph: properties

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 19.1 and 19.2

The Web’s Characteristics

Size 1 trillion of pages is available (Google 7/08)

50 billion static pages 5-40K per page => terabytes & terabytes Size grows every day!!

Change 8% new pages, 25% new links change weekly Life time of about 10 days

The Bow Tie

SCCSCC

WCCWCC

Some definitions

Weakly connected components (WCC) Set of nodes such that from any node can go to any node via

an undirected path. Strongly connected components (SCC)

Set of nodes such that from any node can go to any node via a directed path.

Find the CORE Iterate the following process:

Pick a random vertex v Compute all nodes reached from v: O(v) Compute all nodes that reach v: I(v) Compute SCC(v):= I(v) ∩ O(v) Check whether it is the largest SCC

If the CORE is about ¼ of the vertices, after 20 iterations, Pb to not find the core < 1%.

Compute SCCs

Classical Algorithm:1) DFS(G)2) Transpose G in GT

3) DFS(GT) following vertices in decreasing order of the time their visit ended at step 1.

4) Every tree is a SCC.

DFS hard to compute on disk: no locality

DFS

DFS(u:vertex)color[u]=GRAY

d[u] time time +1foreach v in succ[u] do

if (color[v]=WHITE) then p[v] u

DFS(v)endForcolor[u] BLACKf[u] time time + 1

Classical Approach

main(){ foreach vertex v do

color[v]=WHITE

endFor

foreach vertex v do

if (color[v]==WHITE)

DFS(v);

endFor

}

Semi-External DFS (1)

Key observation: If bit-array fits in internal memory than a DFS takes |V| + |E|/B disk accesses.

• Bit array of nodes (visited or not)

• Array of successors (stack of the DFS-recursion)

What about million/billion nodes?

Key observation: A forest is a DFS forest if and only if there are no FORWARD CROSS edges among the non-tree edges

NO

Algorithm? Construct incrementally a tentative DFS forest which minimizes the # of those edges, in passes...

Another Semi-External DFS (3)

• Bit array of nodes (visited or not)

• Array of successors (stack of the DFS-recursion)

Key assumption: We assume that 2n edges, and the auxiliary data structures, fit in memory.

Observing Web Graph

We do not know which percentage of it we know

The only way to discover the graph structure of the web as hypertext is via large scale crawls

Warning: the picture might be distorted by Size limitation of the crawl Crawling rules Perturbations of the "natural" process of birth and

death of nodes and links

Why is it interesting?

Largest artifact ever conceived by the human

Exploit its structure of the Web for Crawl strategies Search Spam detection Discovering communities on the web Classification/organization

Predict the evolution of the Web Sociological understanding

Many other large graphs… Physical network graph

V = Routers E = communication links

The “cosine” graph (undirected, weighted) V = static web pages E = semantic distance between pages

Query-Log graph (bipartite, weighted) V = queries and URL E = (q,u) u is a result for q, and has been clicked by

some user who issued q

Social graph (undirected, unweighted) V = users E = (x,y) if x knows y (facebook, address book, email,..)

The size of the web

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 19.5

What is the size of the web ?

Issues The web is really infinite

Dynamic content, e.g., calendar

Static web contains syntactic duplication, mostly due to mirroring (~30%)

Some servers are seldom connected

Who cares? Media, and consequently the user Engine design

What can we attempt to measure?

The relative sizes of search engines Document extension: e.g. engines index pages not

yet crawled, by indexing anchor-text. Document restriction: All engines restrict what is

indexed (first n words, only relevant words, etc.)

The coverage of a search engine relative to another particular crawling process.

A B = (1/2) * Size A

A B = (1/6) * Size B

(1/2)*Size A = (1/6)*Size B

Size A / Size B = (1/6)/(1/2) = 1/3

Sample URLs randomly from A

Check if contained in B and vice versa

A B

Each test involves: (i) Sampling (ii) Checking

Relative Size from OverlapGiven two engines A and B

Sec. 19.5

Sampling URLs

Ideal strategy: Generate a random URL and check for containment in each index.

Problem: Random URLs are hard to find!

Approach 1: Generate a random URL contained in a given engine

Suffices for the estimation of relative size

Approach 2: Random walks or IP addresses In theory: might give us a true estimate of the size of

the web (as opposed to just relative sizes of indexes)

Random URLs from random queries

Generate random query: how? Lexicon: 400,000+ words from a web crawl

Conjunctive Queries: w1 and w2

e.g., vocalists AND rsi Get 100 result URLs from engine A Choose a random URL as the candidate to check

for presence in engine B (next slide) This distribution induces a probability weight

W(p) for each page. Conjecture: W(SEA) / W(SEB) ~ |SEA| / |SEB|

Query-based checking

Strong Query to check whether an engine B has a document D: Download D. Get list of words. Use 8 low frequency words as AND query to B Check if D is present in result set.

Problems: Near duplicates Redirects Engine time-outs Is 8-word query good enough?

Advantages & disadvantages

Statistically sound under the induced weight. Biases induced by random query

Query bias: Favors content-rich pages in the language(s) of the lexicon

Ranking bias: Solution: Use conjunctive queries & fetch all Checking bias: Duplicates Query restriction bias: engine might not deal properly

with 8 words conjunctive query Malicious bias: Sabotage by engine Operational Problems: Time-outs, failures, engine

inconsistencies, index modification.

Random IP addresses

Generate random IP addresses

Find a web server at the given address If there’s one

Collect all pages from server From this, choose a page at random

Advantages & disadvantages Advantages

Clean statistics Independent of crawling strategies

Disadvantages Many hosts might share one IP, or not accept

requests No guarantee all pages are linked to root page,

and thus can be collected. Power law for # pages/hosts generates bias

towards sites with few pages.

Random walks

View the Web as a directed graph Build a random walk on this graph

Includes various “jump” rules back to visited sites Does not get stuck in spider traps! Can follow all links!

Converges to a stationary distribution Must assume graph is finite and independent of the

walk. Conditions are not satisfied (many traps…) Time to convergence not really known

Sample from stationary distribution of walk

Advantages & disadvantages

Advantages “Statistically clean” method at least in

theory!

Disadvantages List of seeds is a problem. Practical approximation might not be valid. Non-uniform distribution

Subject to link spamming

Conclusions

No sampling solution is perfect. Lots of new ideas ... ....but the problem is getting harder Quantitative studies are fascinating and a

good research problem

The web-graph: storage

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 20.4

Definition

Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)

Three key properties: Skewed distribution: Pb that a node has x links is 1/x, ≈

2.1

The In-degree distribution

Altavista crawl, 1999 WebBase Crawl 2001

Indegree follows power law distributionk

ku 1

])(degree-inPr[

2.1

This is true also for: out-degree, size components,...

Definition

Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties: Skewed distribution: Pb that a node has x links is 1/x, ≈

2.1

Locality: usually most of the hyperlinks point to other URLs on

the same host (about 80%).

Similarity: pages close in lexicographic order tend to share

many outgoing lists

A Picture of the Web Graph

i

j

21 millions of pages, 150millions of links

URL-sorting

Stanford

Berkeley

Front-compression of URLs + delta encoding of IDs

Front-coding

The library WebGraph

Successor list S(x) = {s1-x, s

2-s

1-1, ..., s

k-s

k-1-1}

For negative entries:

Adjacency list with compressed gaps

(locality)

Uncompressed adjacency list

Copy-lists

Uncompressed adjacency list

Each bit of y informs whether the corresponding successor of y is also a successor of the reference x;

The reference index is chosen in [0,W] that gives the best compression.

Adjacency list with copy lists

(similarity)

Reference chainspossibly limited

Copy-blocks = RLE(Copy-list)

Adjacency list with copy lists.

The first copy block is 0 if the copy list starts with 0;

The last block is omitted (we know the length…);

The length is decremented by one for all blocks

Adjacency list with copy blocks

(RLE on bit sequences)

3

Extra-nodes: Compressing Intervals

Adjacency list with copy blocks.

Consecutivity

in extra-nodes

0 = (15-15)*2 (positive)

2 = (23-19)-2 (jump >= 2)

600 = (316-16)*212 = (22-16)*2 (positive)

3018 = 3041-22-1 (difference)

Intervals: use their left extreme and length

Int. length: decremented by Lmin

= 2

Residuals: differences between residuals, or the source

This is a Java and C++ lib(≈3 bits/edge)