Bakka Log Lu

24
The Anatomy of a Large- Scale Hypertextual Web Search Engine: Google Bakkaloglu, Mehmet

description

bakka log lu

Transcript of Bakka Log Lu

The Anatomy of a Large-Scale Hypertextual Web Search Engine:

Google

Bakkaloglu, Mehmet

Google (a common spelling for googol)

• googol: 10100, or 1 followed by 100 zeros

• googolplex: 1010100, or 10googol, or 1 followed by a googol zeros

The name 'googol' was invented in the 1930’s, by a child (the mathematician Edward Kasner’s nine-year-old nephew) who was asked to think up a name for a very big number, namely, 1 with a hundred zeros after it.

General Architecture of a Search Engine

• Spider (Crawler) – gathering information

• Indexer – analyzing information• Searcher – displaying information (ranking)

What makes Ranking difficult:

• Web is not well controlled (it is not like a closed Information Retrieval System) Anyone can publish anything they want

A word can be repeated many times (if frequency is one of the ranking criteria then this is bad)

Metadata may be abused

• “Cloaking”: a website returns altered web pages to a search engine accessing the site, usually to distort search engine rankings

Sub-Optimal Ranking Methods:

• manually maintain a list!

• simply return the document that is closest to the query

What information does Google use in ranking Web Pages?

• Link Structure (PageRank)

IR(Information Retrieval) Measures:

• Anchor Text

• Font(relative to the rest of the page), Capitalization, Position in Page

• Plain Hits vs. Fancy Hits (URL, title, anchor text, meta tag)

• Location Information of different hits Proximity

Link Structure (PageRank)

• Idea behind PageRanking is Citation

• Count the number of links pointing to a page,

• But place different importance levels on each link (e.g. link from yahoo vs. link from a personal web page)

How does PageRanking actually work?

Markov Chains:

Limiting probability of a page ~ Probability that a surfer will visit a page

A B

C

PageRank Example:

A B

C 1

1/2

1/2

1

Equations:

P(A)=P(C)

P(B)=(1/2)*P(A)

P(C)=(1/2)*P(A)+P(B)

P(A)+P(B)+P(C)=1

Limiting probabilities:

P(A)=0.4, P(B)=0.2, P(C)=0.4

Problem with this approach:

A B

C 1

1/2

1/2

Equations:

P(A)=P(C)

P(B)=(1/2)*P(A)+P(B)

P(C)=(1/2)*P(A)

P(A)+P(B)+P(C)=1

1

Limiting probabilities:

P(A)=0, P(B)=1, P(C)=0

This is no good!!!

Solution: Use a Damping Factor

A B

C

1

(½)

(½)

P(C)= [(1-d)/3]*[P(A)+P(B)+P(C)] + d*[(1/2)*P(A)]

(1-d)/3(1-d)/3

(1-d)/3

*d

(1-d)/3

(1-d)/3(1-d)/3

*d

*d

1

(1-d)/3

(1-d)/3

(1-d)/3

*d

Solution: Use a Damping Factor (continued)

P(C)= [(1-d)/3]*[P(A)+P(B)+P(C)] + d*[(1/2)*P(A)]

Rational:

User follows the links then gets bored and randomly goes to another page

Question:

How should we apply the damping factor?

Equally to all pages or more heavily to a subset of pages?

P(C)= [(1-d)/3] + d*[(1/2)*P(A)]

In General;

P(X)=[(1-d)/n] + d*[P(T1)/C(T1)+…. + P(Tn)/C(Tn)]

On a typical workstation each iteration takes ~ 6 min.“The PageRank Citation Ranking: Bringing Order to the Web”

Copy of paper available at:

http://citeseer.nj.nec.com/368196.html

Anchor Text

Idea:

• Links provide information about the pages they are pointing to

• Also allows the inclusion of documents:

which have links pointing to them but which can not be crawled

e.g.: images, programs, databases

(cannot be indexed by text-based search engines)

General Architecture of a Search Engine

• Spider (Crawler) – gathering information

• Indexer – analyzing information• Searcher – displaying information

Main Concerns of Google:Fast and Space Efficient

Architecture:

Distributed Crawling

Barrels: Forward vs. Inverted Index

Forward: Partially sorted (each barrel holds a word range)

Inverted: Sorted

• Two steps (for performance reasons??)• Is using word ranges the best solution, or should it be

balanced based on popularity? (when doing searching)

Inverted Barrels

Sort by docID orSort by ranking

Hybrid solution: use 2 sets of barrels

• One for title and anchor hits (they have more importance than plain hits)

• and one for all hits

Hit Lists

• Capitalization• Font Size• Position

No Color Information!!!

Question:How much effect do each of these properties have on the

ordering of web pages?(i.e. what’s the trade-off in using these?)

How often should Google’s database be updated?

Well there are some limitations: (back in 1998)

• Crawling 26 million pages takes ~ 9 days

• Indexing 24 million pages takes ~ 5 days

• Sorting them takes ~ 1 day

• Plus PageRanking

In reality Google was updated ~ 1- 4 weeks

Incremental Updating??

Smart Algorithmsto decide which pages should be crawled

(or Cooperation from Web Servers)

http://searchenginewatch.com/reports/sizes.html

Improvements in Ranking

1)User Feedback

• User preferences (relevance)

exp: DirectHit (a system that measures what users click on from search results in order to refine relevancy rankings)

• Personalize PageRank by increasing the weights of users’ bookmarks

2)Use correlation information among different words?

(exp: networks computer networks)

Improvements in Ranking(Continued)

3)XML issuesHTML:<td width="20%" valign="top"><small><font face="Arial">Hamburg</font></small></td>Code:<td width="20%" valign="top"><% = & " " & rs.fields("city") %></td>

XML:<City>Hamburg</City>

Is Google’s Ranking optimal?

Hmm… There are some bad examples:

At one time, Search for “What is more evil than Satan,” use to result in Microsoft's home page

Any ideas about why this happened?Sergey Brin: Lots of sites point to Microsoft as evil