Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble...

50
1 Search Engines and Engineering Searches Khurshid Ahmad Professor of Computer Science, Department of Computer Science, Trinity College, Dublin, IRELAND. A Lecture for First Year Engineering Students (1E8) April 2009

Transcript of Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble...

Page 1: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

1

Search Engines and Engineering Searches

Khurshid AhmadProfessor of Computer Science,

Department of Computer Science,

Trinity College, Dublin, IRELAND.

A Lecture for First Year Engineering Students (1E8)

April 2009

Page 2: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

2

Search Engines and Engineering Searches:Preamble

�How does one collect texts from the ‘web’:

� You find a search engine (Google);

� Type in keywords or web site names;

� Get thousands of web documents, texts, images, e-mails, travel agents..

� Select texts YOURSELVES;

� Store texts

� Analyse texts

Page 3: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

3

Search Engines and Engineering Searches:Preamble

� Key web pages�https://www.cs.tcd.ie/Khurshid.Ahmad/Teaching.html

� http://infolab.stanford.edu/~backrub/google.html

� http://en.wikipedia.org/wiki/Page_rank

� http://www.iprcom.com/papers/pagerank/

� http://searchenginewatch.com/showPage.html?page=2167961

Page 4: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

4

Search Engines and Engineering Searches:Preamble

� How does one collect texts from the ‘web’:

� You find a search engine (Google);

� Type in keywords or web site names;

� Get thousands of web documents, texts, images, e-mails, travel agents..

� Select texts YOURSELVES;

� Store texts

� Analyse texts

� But this is possible on a few hundred thousands; what happens when there are millions of new documents being created every week;

�The deluge of information is such that we need helpers/servants –robots to do some of the spade work.

�THIS IS THE BURDEN OF MY TALK

Page 5: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

5

Search Engines and Engineering Searches:Preamble

Whenever you search the web using a search engine, you're asking the engine to scan its index of sites and match your keywords and phrases with those in the texts of documents within the engine's database.

It is important to remember that when you are using a search engine, you are NOT searching the entire web as it exists at this moment. You are actually searching a portion of the web, captured in a fixed index created at an earlier date.

Page 6: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

6

Search Engines and Engineering Searches:Preamble

Spiders and (Ro)Bots: Every search engine either employs programs called robots or spiders to ‘move’/’crawl’ from web site to web site by using the links the pages on one web site may have to other web sites.

Upon arrival on a web site, a spider or a robot usually notes all the ‘content’ words on the individual pages on the web site. Also, a web site owner may volunteer his or her web pages, by giving the Universal Resource Locator (URL) to be crawled upon!

Page 7: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

7

Search Engines and Engineering Searches

•The web is a vast collection of completely uncontrolled heterogeneous documents. Documents on the web have extreme variation internal to the documents, and also in the external meta information that might be available

– documents differ internally in their language (both human and programming), • vocabulary (email addresses, links, zip codes, phone numbers, product

numbers),

• type or format (text, HTML, PDF, images, sounds), and

• may even be machine generated (log files or output from a database).

•External meta information is information that can be inferred about a document, but is not contained within it. Examples of external meta information include things like reputation of the source, updatefrequency, quality, popularity or usage, and citations.

Sergey Brin & Lawrence Page“. (1998). "The anatomy of a large-scale hypertextualWeb search engine“. Computer Networks and ISDN Systems. Volume 30, pp 107-117. (available at citeseer.ist.psu.edu/brin98anatomy.html.)

Page 8: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

8

Search Engines and Engineering SearchesGoogle Search for “trinity college” leads to 1.57 Million documents

Page 9: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

9

Search Engines and Engineering Searches:Some Definitions

� HYPERLINK

� A hyperlink, or simply a link, is a reference in a hypertext document to another document or other resource.

� A hyperlink is similar to a citation in literature.

� A hyperlink combined with a data network and suitable access protocol, it can be used to fetch the resource referenced. This can then be saved, viewed, or displayed as part of the referencing document.

� Hyperlinks are part of the foundation of the World Wide Web.

� A link has two ends, called anchors, and a direction. The link starts at the source anchor and points to the destination anchor. However, the term link is often used for the source anchor, while the destination anchor is called the link target.

http://en.wikipedia.org/wiki/Hyperlink

Page 10: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

10

Search Engines and Engineering Searches

Sergey Brin & Lawrence Page“. (1998). "The anatomy of a large-scale hypertextualWeb search engine“. Computer Networks and ISDN Systems. Volume 30, pp 107-117. (available at citeseer.ist.psu.edu/brin98anatomy.html.)

Page 11: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

11

Search Engines and Engineering Searches

Sergey Brin & Lawrence Page“. (1998). "The anatomy of a large-scale hypertextualWeb search engine“. Computer Networks and ISDN Systems. Volume 30, pp 107-117. (available at citeseer.ist.psu.edu/brin98anatomy.html.)

Page 12: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

12

Search Engines and Engineering Searches

On the Trinity College Webpage (www.tcd.ie) there are about 334 unique words used 902 times:

1.3%12to

1.5%14information

1.6%15research

1.8%17communications

1.9%18news

2.3%21in

2.4%22and

3.6%34college

3.9%36trinity

6.2%58tcd

Relative FrequencyFrequencyWords

My query “trinity college” matched the most frequent keywords on the page: trinity & college

Page 13: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

13

Search Engines and Engineering Searches

On the Trinity College Webpage (www.tcd.ie) there are about 334 unique words used 902 times:

0.8%7courses

0.8%7library

1.0%9for

1.0%9on

1.1%10side

1.3%12dublin

1.3%12students

1.3%12strong

1.3%12of

1.3%12the

Relative

FrequencyFrequencyWords

And there are students and courses as well!!

Page 14: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

14

Search Engines and Engineering Searches

When I typed “Trinity College” on Google, the system returned 1.57 Million pages and Trinity College, Dublin was ranked very high

Page 15: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

15

Search Engines and Engineering Searches

When I typed “tcd” on Google, the system returned 4.4 Million pages and Trinity College, Dublin was ranked first (and the abbreviation ‘tcd’occurred 58 times on the pointed web page comprising 6.2% of allwords on the page!

Notice the advertisements of the ‘sponsored links!

Page 16: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

16

Search Engines and Engineering Searches

On the Trinity College Webpage (www.tcd.ie) there are about 334 unique words used 902 times:

1.3%12to

1.5%14information

1.6%15research

1.8%17communications

1.9%18news

2.3%21in

2.4%22and

3.6%34college

3.9%36trinity

6.2%58tcd

Relative

FrequencyFrequencyWordsMy query “trinity college” matched the most frequent keywords on the page: trinity & college:

But, there were other reasons for our web page to be ranked amongst 10 most relevant pages for this query

Page 17: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

17

Search Engines and Engineering Searches

When I typed “Trinity College” on Google, the system returned 1.57 Million pages and Trinity College, Dublin was ranked very high

Given the very large number of web pages that have the same keywords that I have chosen, how does a program know which of the web pages to display amongst the first 10 pages and which of webpages to be after 1,569,990th

web page?

Page 18: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

18

Search Engines and Engineering Searches

When I typed “Trinity College” on Google, the system returned 1.57 Million pages and Trinity College, Dublin was ranked very highRanking Pages

Google computes the rank, or loosely the relevance of the page to an arbitrary end user, is computed by (a) matching the keywords in the query by the user to keywords in (an index) of all documents ‘crawled’by Google; and (b) the ‘prestige’ or ‘authority’ of the pages that contain.

Given the very large number of web pages that have the same keywords that I have chosen, which of he web pages is more important.

Page 19: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

19

Search Engines and Engineering Searches:Preamble

�Crawlers, a key component of a serachengine enterprise, have additional indices: the keywords in anchor tags are given extra weight; the references to other web sites is taken into account (who cites whom); what font size was used (for emphasis or censoring); position of the keywords in the document etc.

The question the crawlers want an answer to is

‘HOW IMPORTANT IS THIS TEXT’

Page 20: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

20

Search Engines and Engineering Searches:Preamble

�Various search engine enterprise have programs that search for new websites and re-visit websites already visited.

� The visits are for the purpose of examining documents on the web sites;

� Retrieving documents and analysing the documents, for example,

� by computing the occurrence of keywords and by identifying new keywords, and,

� by capturing images and audio streams on the websites

� by identifying references to other documents and noting the web-

address of the others if available

� Creating an index of the documents that comprises the results of the analysis

Page 21: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

21

Search Engines and Engineering Searches:Preamble

�Various search engine enterprise have programs that search for new websites and re-visit websites already visited. These programs are called crawlers. And, the enterprises has programs for creating indices or addresses of documents based on keywords, references to other sites. The enterprise is supported by databases (of the addresses and terms)

Sergey Brin & Lawrence Page“. (1998). "The anatomy of a large-scale hypertextualWeb search engine“. Computer Networks and ISDN Systems. Volume 30, pp 107-117. (available at citeseer.ist.psu.edu/brin98anatomy.html.)

The diagram shows how Google prototype crawled, indexed and searched

Page 22: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

22

Search Engines and Engineering Searches:Motivation

�The anchor texts (bits on the page pointing to others)

Page 23: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

23

Search Engines and Engineering Searches:Motivation

�And, the anchor texts (bits on the page pointing to others)

Page 24: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

24

Search Engines and Engineering Searches:Motivation

Web pages point to each other: the more authoritative page has more links both to and from

Page 25: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

25

Search Engines and Engineering Searches:Page Ranks

•.

Lawrence Page and Sergey Brin and Rajeev Motwani and Terry Winograd (1998). The PageRankCitation Ranking: Bringing Order to the Web. (citeseer.ist.psu.edu/page98pagerank.html)

Page 26: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

26

Search Engines and Engineering Searches:Some Definitions

Anchor text is the visible text in a hyperlink.

Anchor text gets a lot of weight in search engine algorithms because the linked text is usually relevant to the landing page.

The objective of search engines is to provide highly relevant search results; this is where anchor text helps as the tendency is, more often than not, to hyperlink words relevant to the landing page.

Usually this is exploited by webmasters to procure high results in SERPS (search engine results pages). The concept of Google Bombing was/is possible through anchor text manipulation. Much has been written on anchor text which is available on the web today.

Although the search engines are well aware of anchor text manipulation, not much change can be expected in the SE algorithms in the near future because the brighter side of the picture cannot be overlooked: anchor text delivers relevance.

http://en.wikipedia.org/wiki/ Anchor_text

Page 27: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

27

Search Engines and Engineering Searches

Anchoring web pages and overcoming anarchy

Amy Langville and Carl Meyer. (2005). The Use of the Linear Algebra by Web Search Engines http://citeseer.ist.psu.edu/718721.html. (Bulletin of the International Linear Algebra Society, No. 33, Dec. 2005, pp. 2-6. )

Google exploited the elements of a webpage that are used to point to other webpages. The anchor text was used in a very novel way. Google’s initial documents cite two reasons for exploiting the anchors, in addition to the keyword frequency:

1. Sometimes, anchor have been found to provide a more accurate descriptions of web pages than the pages themselves.

2. Anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases.

Page 28: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

28

Search Engines and Engineering Searches:Motivation

Sergey Brin & Lawrence Page“. (1998). "The anatomy of a large-scale hypertextualWeb search engine“. Computer Networks and ISDN Systems. Volume 30, pp 107-117. (available at citeseer.ist.psu.edu/brin98anatomy.html.)

Both these students have not submitted their PhD theses yet as they have been busy elsewhere.

Page 29: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

29

Search Engines and Engineering Searches: Some definitions

•Hit Lists

•A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. Hit lists account for most of the space used in both the forward and the inverted indices. Because of this, it is important to represent them as efficiently as possible.

Sergey Brin & Lawrence Page“. (1998). "The anatomy of a large-scale hypertextualWeb search engine“. Computer Networks and ISDN Systems. Volume 30, pp 107-117. (available at citeseer.ist.psu.edu/brin98anatomy.html.)

Page 30: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

30

Search Engines and Engineering Searches:

�Plain Hits and Fancy Hits:

� Fancy Hits: Universal Resource Locator (URL);

� Web Page Title; Anchor Text; Meta Tag;

� Plain Hits: Word lists in text documents

http://en.wikipedia.org/wiki/Hyperlink

Page 31: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

31

Search Engines and Engineering Searches:

�PageRank and Plain Hits and Fancy Hits:

�In addition to plain and fancy hits, Google’s hitlistcontains information about the frequency of usage of the web pages cited –the authority of the web page/site-and there is much inference about the pragmatic intent of the author of the web page:

� Font size; Position of the keywords;

� Title of the page; Keywords in anchor

http://en.wikipedia.org/wiki/Hyperlink

Page 32: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

32

Search Engines and Engineering Searches:

According to Brin and Page: The Google search engine has two important features that help it produce high precision results.

First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank[…].

Second, Google utilizes link to improve search results.

Page 33: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

33

Search Engines and Engineering Searches:

Let us define the PageRank of a web page A as PR(A) and assume that n other pages, T1, …. Tn, point to, or have a link with page A.

Also assume that page T1 has links C(T1) to all pages including page A.

Page 34: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

34

Search Engines and Engineering Searches:

The PageRank of a web page A as PR(A) is given in terms of the n other pages, T1, …. Tn, pages which point to A. Brin and Page outlined their iterative formula

++++−=

)(

)Pr(.........

)(

)Pr(

)(

)Pr(*)1()Pr(

2

2

1

1

n

n

TC

T

TC

T

TC

TddA

The damping factor, d, is to ensure that (a) other pages (T1,… Tn) have no undue influence o page A; (b) the negation of d (=1-d) ensures that even if the web page A has no other links, it still has a rank.

Page 35: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

35

Search Engines and Engineering Searches:Searching the Web

•Iteration: The repetition of an operation upon its product, as in finding the cube of a cube; esp. the repeated application of a formula devised to provide a closer approximation to the solution of a given equation when an approximate solution is substituted in the formula, so that a series of successively closer approximations may be obtained; a single application of such a formula; also, the formula itself.

Page 36: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

36

Search Engines and Engineering Searches:Searching the Web

•Recursion formula: An equation relating the value of a function for a given value of its argument (or arguments) to its values for other values of the argument(s).

Page 37: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

37

Search Engines and Engineering Searches:

Assuming pages A and T1, …Tn, are linked only to each other. Then the PageRank of a web page T1 as PR(T1) is given in terms of the n other pages, T2, …. Tn, pages and page A. Brin and Page outlined their iterative formula

++++−=

)(

)Pr(.........

)(

)Pr(

)(

)Pr(*)1()Pr(

2

21

n

n

TC

T

TC

T

AC

AddT

Page 38: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

38

Search Engines and Engineering Searches:

++++−=

)(

)Pr(.........

)(

)Pr(

)(

)Pr(*)1()Pr(

2

21

n

n

TC

T

TC

T

AC

AddT

The computation of the page rank of interlinked pages is an iterative compuation: In order to compute the PageRank of page A we have to compute the PageRank of pages T1,……Tn:

++++−=

)(

)Pr(.........

)(

)Pr(

)(

)Pr(*)1()Pr(

2

2

1

1

n

n

TC

T

TC

T

TC

TddA

And, in order to compute the PageRank of page T1we have to compute the PageRank of pages A and T2,……Tn,

Page 39: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

39

Search Engines and Engineering Searches:

1)Pr()Pr(.........).........Pr()Pr()Pr( 121 =

+++ −

n

TTTTAnn

Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.

Page 40: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

40

Search Engines and Engineering Searches:

Consider a simple example of two web pages –A and B that are linked only to each other and as such have only one link ���� C(A)=C(B)=1, hence

( )( ))Pr(*)1()Pr(

)Pr(*)1()Pr(

AddB

BddA

+−=

+−=

Page 41: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

41

Search Engines and Engineering Searches:

Consider a simple example of two web pages –A and B that are linked only to each other and as such have only one link ���� C(A)=C(B)=1

+−=

+−=

)(

)Pr(*)1()Pr(

)(

)Pr(*)1()Pr(

AC

AddB

BC

BddA

Page 42: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

42

Search Engines and Engineering Searches:

Consider a simple example of two web pages –A and B that are linked only to each other and as such have only one link ���� C(A)=C(B)=1, hence

( )( ))Pr(*)1()Pr(

)Pr(*)1()Pr(

AddB

BddA

+−=

+−=

Page 43: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

43

Search Engines and Engineering Searches:

Let us guess the value of Pr(A) and Pr(B) to be ONE, and assume that the damping factor d=0.85

( )( )

12/))Pr()(Pr(

185.015.01*85.0)85.01()Pr(

185.015.01*85.0)85.01()Pr(

1*)1()Pr(

1*)1()Pr(

=+

=+=+−=

=+=+−=

+−=

+−=

bA

and

B

A

or

ddB

ddA

Page 44: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

44

Search Engines and Engineering Searches:

But what if the value of Pr(A) and Pr(B) was guessed to beZERO

( )( )

12/))Pr()(Pr(

2775.015.0*85.0)85.01()Pr(

15.00*85.0)85.01()Pr(

0*)1()Pr(

0*)1()Pr(

≠+

=+−=

=+−=

+−=

+−=

bA

BUT

B

A

or

ddB

ddA

Page 45: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

45

Search Engines and Engineering Searches:

0.950.9410

0.930.919

0.900.888

0.860.837

0.800.776

0.730.685

0.620.564

0.480.393

0.280.152

001

Pr(B)Pr(A)Iteration

But what if the value of Pr(A) and Pr(B) was guessed to be ZERO

Pr(B)Pr(A)Iteration

1.001.0019

1.001.0018

0.990.9917

0.990.9916

0.990.9915

0.990.9814

0.980.9813

0.970.9712

0.960.9511

Page 46: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

46

Search Engines and Engineering Searches:Searching the Web

�Ian Rogers, a researcher

in Page Rank algorithms

has a very informative

presentation available on

http://www.iprcom.com/pa

pers/pagerank/

Page 47: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

47

Search Engines and Engineering Searches:Searching the Web

•PageRank, the second link analysis algorithm from 1998, is the heart of Google.

•Brin and Page use a recursive scheme similar to Kleinberg’s. Their original idea was that

a page is important if it is pointed to by other important pages. That is, they decided that

the importance of your page (its PageRank score) is determined by summing the

PageRanks of all pages that point to yours.

•In building a mathematical definition of PageRank, Brin and Page also reasoned that when an important page points to several places, its weight (PageRank) should be distributed proportionately.

•In other words, if YAHOO! points to your Web page, that’s good, but you shouldn’t receive the full weight of YAHOO! because they point to many other places. If YAHOO! points to 999 pages in addition to yours, then you should only get credit for 1/1000 of YAHOO!’s PageRank.

Amy Langville and Carl Meyer. (2005). The Use of the Linear Algebra by Web Search Engines http://citeseer.ist.psu.edu/718721.html. (Bulletin of the International Linear Algebra Society, No. 33,Dec. 2005, pp. 2-6. )

Page 48: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

48

Search Engines and Engineering Searches

•PageRank is a numeric value that represents how important a page is on the web. Google figures that when one page links to another page, it is effectively casting a vote for the other page.

• The more votes that are cast for a page, the more important the page must be. Also, the importance of the page that is casting the vote determines how important the vote itself is. Google calculates apage's importance from the votes cast for it. How important each vote is is taken into account when a page's PageRank is calculated.

•PageRank is Google's way of deciding a page's importance. It matters because it is one of the factors that determines a page's ranking in the search results. It isn't the only factor that Google uses to rank pages, but it is an important one.

From Phil Craven’s web page at http://www.webworkshop.net/pagerank.html

Page 49: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

49

Search Engines and Engineering Searches

• The Page Rank Algorithm:

• Prestige of a page is proportional to sum of prestige of citing pages

• Standard bibliometric measure of influence

• Simulate a random walk on the Web to precomputeprestige of all pages

• Sort keyword-matched responses by decreasing prestige

http://www.cse.iitb.ac.in/~soumen/doc/emnlp2000/emnlp2000b.ppt

Page 50: Search Engines and Engineering Searches · Search Engines and Engineering Searches: Preamble Spiders and (Ro)Bots : Every search engine either employs programs called robots or spiders

50

Search Engines and Engineering Searches: The end of innocence!

•A Google bomb or Google wash is an attempt to influence the ranking of a given site in results returned by the Google search engine. Due to the way that Google's PageRank algorithm works, a website will be ranked higher if the sites that link to that page all use consistent anchor text. Googlebomb is used both as a verb and a noun.

http://en.wikipedia.org/wiki/Google_Bomb