Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

58
Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa

Transcript of Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Page 1: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Web search engines

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Page 2: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Two main difficulties

The Web: Size: more than tens of billions of pages

Language and encodings: hundreds…

Distributed authorship: SPAM, format-less,…

Dynamic: in one year 35% survive, 20% untouched

The User: Query composition: short (2.5 terms avg) and imprecise

Query results: 85% users look at just one result-page

Several needs: Informational, Navigational, Transactional

Extracting “significant data” is difficult !!

Matching “user needs” is difficult !!

Page 3: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Evolution of Search Engines First generation -- use only on-page, web-text data

Word frequency and language

Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)

Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google

Fourth generation Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]

Google, Yahoo,

MSN, ASK,………

Page 4: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Page 5: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Page 6: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Page 7: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

2009-122009

Page 8: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Page 9: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Page 10: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

This is a search engine!!!

Page 11: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

II° generationIII° generation

IV° generation

Page 12: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Quality of a search engine

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 8

Page 13: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Is it good ?

How fast does it index Number of documents/hour (Average document size)

How fast does it search Latency as a function of index size

Expressiveness of the query language

Page 14: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Measures for a search engine

All of the preceding criteria are measurable

The key measure: user happiness…useless answers won’t make a user happy

Page 15: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Happiness: elusive to measure

Commonest approach is given by the relevance of search results How do we measure it ?

Requires 3 elements:1. A benchmark document collection2. A benchmark suite of queries3. A binary assessment of either Relevant or

Irrelevant for each query-doc pair

Page 16: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Evaluating an IR system

Standard benchmarks TREC: National Institute of Standards and

Testing (NIST) has run large IR testbed for

many years

Other doc collections: marked by human

experts, for each query and for each doc,

Relevant or Irrelevant

On the Web everything is more complicated since we cannot mark the entire corpus !!

Page 17: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

General scenario

Relevant

Retrieved

collection

Page 18: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Precision: % docs retrieved that are relevant [issue “junk” found]

Precision vs. Recall

Relevant

Retrieved

collection

Recall: % docs relevant that are retrieved [issue “info” found]

Page 19: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

How to compute them

Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved

Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)

Relevant Not Relevant

Retrieved tp (true positive) fp (false positive)

Not Retrieved

fn (false negative) tn (true negative)

Page 20: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Some considerations

Can get high recall (but low precision) by retrieving all docs for all queries!

Recall is a non-decreasing function of the number of docs retrieved

Precision usually decreases

Page 21: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Precision-Recall curve

We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries

precision

recall

x

x

x

x

Page 22: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

A common picture

precision

recall

x

x

x

x

Page 23: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

F measure

Combined measure (weighted harmonic mean):

People usually use balanced F1 measure

i.e., with = ½ thus 1/F = ½ (1/P + 1/R)

Use this if you need to optimize a single measure

that balances precision and recall.

RP

F1

)1(1

1

Page 24: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

The web-graph: properties

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 19.1 and 19.2

Page 25: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

The Web’s Characteristics

Size 1 trillion of pages is available (Google 7/08)

50 billion static pages 5-40K per page => terabytes & terabytes Size grows every day!!

Change 8% new pages, 25% new links change weekly Life time of about 10 days

Page 26: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

The Bow Tie

Page 27: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

SCCSCC

WCCWCC

Some definitions

Weakly connected components (WCC) Set of nodes such that from any node can go to any node via

an undirected path. Strongly connected components (SCC)

Set of nodes such that from any node can go to any node via a directed path.

Page 28: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Find the CORE Iterate the following process:

Pick a random vertex v Compute all nodes reached from v: O(v) Compute all nodes that reach v: I(v) Compute SCC(v):= I(v) ∩ O(v) Check whether it is the largest SCC

If the CORE is about ¼ of the vertices, after 20 iterations, Pb to not find the core < 1% (given that the graph is available).

Page 29: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Compute SCCs

Classical Algorithm:1) DFS(G)2) Transpose G in GT

3) DFS(GT) following vertices in decreasing order of the time their visit ended at step 1.

4) Every tree is a SCC.

DFS is hard to compute on disk: no locality

Page 30: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

DFS

DFS(u:vertex)color[u]=GRAY

d[u] time time +1foreach v in succ[u] do

if (color[v]=WHITE) then p[v] u

DFS(v)endForcolor[u] BLACKf[u] time time + 1

Classical Approach

main(){ foreach vertex v do

color[v]=WHITE

endFor

foreach vertex v do

if (color[v]==WHITE)

DFS(v);

endFor

}

Page 31: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Semi-External DFS

Key observation: If bit-array fits in internal memory than a DFS takes |V| + |E|/B disk accesses.

• Bit array of nodes (visited or not)

• Array of successors

• Stack of the DFS-recursion

Page 32: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

What about million/billion nodes?

Key observation: A forest is a DFS forest if and only if there are no FORWARD CROSS edges among the non-tree edges

NO

Algorithm ? Construct incrementally a tentative DFS forest which minimizes the # of those edges (overall), in passes...

Page 33: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

A Semi-External DFS

• Bit array of nodes (visited or not)

• Array of successors (stack of the DFS-recursion)

Key assumption: We assume that 2n edges, and the auxiliary data structures, fit in memory. Rearrange nodes in adj-lists, the ones with large subtrees go to the front

Page 34: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Observing Web Graph

We do not know which percentage of it we know

The only way to discover the graph structure of the web is via large scale crawls

Warning: the picture might be distorted by Size limitation of the crawl Crawling rules Perturbations of the "natural" process of birth and

death of nodes and links

Page 35: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Why is it interesting?

Largest artifact ever conceived by the human

Exploit its structure of the Web for Crawl strategies Search Spam detection Discovering communities on the web Classification/organization

Predict the evolution of the Web Sociological understanding

Page 36: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Many other large graphs… Physical network graph

V = Routers E = communication links

The “cosine” graph (undirected, weighted) V = static web pages E = semantic distance between pages

Query-Log graph (bipartite, weighted) V = queries and URL E = (q,u) u is a result for q, and has been clicked by

some user who issued q

Social graph (undirected, unweighted) V = users E = (x,y) if x knows y (facebook, address book, email,..)

Page 37: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

The size of the web

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 19.5

Page 38: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

What is the size of the web ?

Issues The web is really infinite

Dynamic content, e.g., calendar

Static web contains syntactic duplication, mostly due to mirroring (~30%)

Some servers are seldom connected

Who cares? Media, and consequently the user Engine design

Page 39: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

What can we attempt to measure?

The relative sizes of search engines Document extension: e.g. engines index pages not

yet crawled, by indexing anchor-text. Document restriction: All engines restrict what is

indexed (first n words, only relevant words, etc.)

The coverage of a search engine relative to another particular crawling process.

Page 40: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

A B = (1/2) * Size A

A B = (1/6) * Size B

(1/2)*Size A = (1/6)*Size B

Size A / Size B = (1/6)/(1/2) = 1/3

Sample URLs randomly from A

Check if contained in B and vice versa

A B

Each test involves: (i) Sampling URL (ii) Checking URL

Relative Size from OverlapGiven two engines A and B

Sec. 19.5

Page 41: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Sampling URLs

Ideal strategy: Generate a random URL and check for containment in each index.

Problem: Random URLs are hard to find!

Approach 1: Generate a random URL surely contained in a given search engine

Approach 2: Random walks or random IP addresses

Page 42: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

#1: Random URL in SE via random queries

Generate random query: Lexicon: 400,000+ words from a web crawl

Conjunctive Queries: w1 and w2

e.g., vocalists AND rsi

Get 100 result URLs from engine A Choose a random URL as the candidate to check

for presence in search engine B (next slide) This induces a probability weight W(p) for each

page.

Conjecture: W(SEA) / W(SEB) ~ |SEA| / |SEB|

Page 43: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

URL checking

Download D at address URL. Get list of words. Use 8 low frequency words as AND query to B Check if D is present in result set.

Problems: Near duplicates Engine time-outs Is 8-word query good enough?

Page 44: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Advantages & disadvantages

Statistically sound under the induced weight. Biases induced by random query

Query bias: Favors content-rich pages in the language(s) of the lexicon

Ranking bias [Solution: Use conjunctive queries & fetch all]

Query restriction bias: engine might not deal properly with 8 words conjunctive query

Malicious bias: Sabotage by engine Operational Problems: Time-outs, failures, engine

inconsistencies, index modification.

Page 45: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

#2: Random IP addresses

Find a web server at the given IP address If there’s one

Collect all pages from server

From this, choose a page at random

Page 46: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Advantages & disadvantages Advantages

Clean statistics Independent of crawling strategies

Disadvantages

Many hosts might share one IP, or not accept requests

No guarantee all pages are linked to root page, and

thus can be collected.

Power law for # pages/hosts generates bias towards

sites with few pages.

Page 47: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Conclusions

No sampling solution is perfect.

Lots of new ideas .......but the problem is getting harder

Quantitative studies are fascinating and a good research problem

Page 48: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

The web-graph: storage

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Reading 20.4

Page 49: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Definition

Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)

Three key properties: Skewed distribution: Pb that a node has x links is 1/x, ≈

2.1

Page 50: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

The In-degree distribution

Altavista crawl, 1999 WebBase Crawl 2001

Indegree follows power law distributionk

ku 1

])(degree-inPr[

2.1

This is true also for: out-degree, size components,...

Page 51: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Definition

Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties: Skewed distribution: Pb that a node has x links is 1/x, ≈

2.1

Locality: usually most of the hyperlinks point to other URLs on

the same host (about 80%).

Similarity: pages close in lexicographic order tend to share

many outgoing lists

Page 52: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

A Picture of the Web Graph

i

j

21 millions of pages, 150millions of links

Page 53: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

URL-sorting

Stanford

Berkeley

Page 54: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Front-compression of URLs + delta encoding of IDs

Front-coding

Page 55: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

The library WebGraph

Successor list S(x) = {s1-x, s

2-s

1-1, ..., s

k-s

k-1-1}

For negative entries:

Uncompressed adjacency list

Adjacency list with compressed gaps

(locality)

Page 56: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Copy-lists: close nodes sharemany successors

Uncompressed adjacency list

Each bit of the copy-list informs whether the corresponding successor of y is also a successor of the reference x;

The reference index is chosen in [0,W] that gives the best compression.

Adjacency list with copy lists

(similarity)

Reference chainspossibly limited

Page 57: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Copy-blocks = RLE(Copy-list)

Adjacency list with copy lists.

The first copy block is 0 if the copy list starts with 0;

Each RLE-length is decremented by one for all blocks

The last block is omitted (we know the length from Outd);

Adjacency list with copy blocks

(RLE on bit sequences)1,3

Page 58: Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

3

Extra-nodes: Compressing Intervals

Adjacency list with copy blocks.

Consecutivity

in extra-nodes

0 = (15-15)*2 (positive)

2 = (23-19)-2 (jump >= 2)

600 = (316-16)*212 = (22-16)*2 (positive)

3018 = 3041-22-1 (difference)

Intervals: encoded by left extreme and length

Int. length: decremented by Lmin

= 2

Residuals: differences between residuals, or the source (the first)

This is a Java and C++ lib(≈3 bits/edge)