Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20...

48
Behind the Scenes at a Search Engine William Denton <[email protected]> Web Librarian, York University Libraries 20 March 2008 http://www.library.yorku.ca/binaries/frontiers/20080320-denton- search-engine.ppt

Transcript of Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20...

Page 1: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Behind the Scenes at a Search Engine

William Denton <[email protected]>

Web Librarian, York University Libraries

20 March 2008http://www.library.yorku.ca/binaries/frontiers/20080320-denton-search-engine.ppt

Page 2: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 2

To be covered The three basic parts of a web search Search engine optimization Advertising, and how to avoid it Library databases and the deep web

Ask questions any time.

Page 3: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 3

Google, Yahoo, Ask, Live, A9

www.google.com: australopithecussearch.yahoo.com: australopithecuswww.ask.com: australopithecussearch.live.com: australopithecusa9.com: australopithecus

Page 4: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 4

The computing power, bandwidth, and electricity use is mind-blowing

David F. Carr, How Google Works

Urs Hölze talk on Google’s Linux cluster (2002)

Ginger Strand, Keyword: Evil (Harper’s, March 2008)

Page 5: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 5

What happens when you search?

You enter in some words

You get good links

And sometimes other good stuff

Page 6: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 6

Three things

1. How does it know about everything?

Page 7: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 7

Three things

2. How does it decide what’s relevant?

Page 8: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 8

Three things

3. How does it serve you the results?

Page 9: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 9

1. How does it know about everything?

Crawlers are continually moving around the web, looking for whatever they can find.

Different search engines crawl different numbers of pages, but they all do in the billions.

Page 10: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 10

A visit from a Googlebot

66.249.73.229 - - [09/Mar/2008:04:15:47 -0400] "GET /ccm/jsp/homepage.jsp HTTP/1.1" 200 13670 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; http://

www.google.com/bot.html) - -"

Page 11: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 11

robots.txt

Polite web crawlers first check the robots.txt file on a web site and obey its rules.

Wikipedia’s robots.txt York’s robots.txt www.robotstxt.org

Page 12: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 12

Crawl, download, scan, repeat

Crawlers crawl and crawl and download all the pages (and Word files and PDFs, and spreadsheets) they come across. They scan the page for links and add them to their list of pages to crawl, ad infinitum.

Page 13: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 13

2. How does it decide what’s relevant? What’s on the page Inbound and outbound links (PageRank) Frequency of updates The kind of site it is Clickthroughs and usage analysis People tweaking the rules Secret other stuff

Page 14: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 14

Term frequency

How many times a word appears in a document

Page 15: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 15

Inverse document frequency

Number of documents /

number of documents containing the term

(Actually the logarithm of this.)

Page 16: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 16

TF-IDF

TF-IDF of a keyword in a page = TF * IDF

Page 17: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 17

Example100 web pages. Keyword: naillie#1 has 8 mentions. TF = 8.#2, 17, 19, 76 have 4 mentions. TF = 4.20 pages have 1 mention. TF = 1.

IDF = log2 (100 / 25) = 2

TF-IDF of naillie in #1 = 8 * 2 = 16 High!TF-IDF of naillie in #2, 17, 19, 76 = 4 * 2 = 8 Not so highTF-IDF of naillie in 20 others = 1 * 2 = 2 SmallTF-IDF of naillie in all the rest = 0 * 2 = 0 Irrelevant

Page 18: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 18

HTML helps a lot

<title>The title of the page</title>

<h1>The most important heading</h1>

<h2>Lesser headings</h2>

<a href=“http://www.yorku.ca/”>Hyperlinks</a>

Text at the top of the page

Page 19: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 19

Semantic markupWrong: say how it

should look

<b><font size=“26”>

<i>Upon the Distinction Between the Ashes of the Various Tobaccos</i>

</font></b>

Right: say what it is (then apply a look)

<h1>Upon the Distinction Between the Ashes of the Various Tobaccos</h1>

Page 20: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 20

PageRank! Algorithm designed by Larry Page and

Sergey Brin when they were at Stanford One of the things Google uses in deciding

how important a web page is US Patent 6,285,999 The Anatomy of a Large-Scale Hypertextual

Web Search Engine (Brin, Page, 1998)

Page 21: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 21

Page 22: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 22

http://en.wikipedia.org/wiki/PageRank

Wikipedia has a good explanation of PageRank

Page 23: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 23

Other weightings Frequency of updates The kind of site it is: blog? wiki? institutional

repository? Who runs the site? Search engine companies tweak the rules They don’t give away their secrets, but lots

of people try to reverse engineer the algorithms

Page 24: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 24

Google’s explanation“Google combines PageRank with sophisticated

text-matching techniques to find pages that are both important and relevant to your search. Google goes far beyond the number of times a term appears on a page and examines dozens of aspects of the page's content (and the content of the pages linking to it) to determine if it's a good match for your query.”

http://www.google.com/technology/

Page 25: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 25

Live’s explanation“Live Search website ranking is completely

automated. The Live Search ranking algorithm analyzes factors such as web page content, the number and quality of websites that link to your pages, and the relevance of your website’s content to keywords. The algorithm is complex and is never human-mediated. You can't pay to boost your website’s relevance ranking; however, we do offer advertising options for website owners.”

http://help.live.com/help.aspx?mkt=en-us&project=wl_webmasters

Page 26: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 26

What doesn’t it look at? Crawlers don’t see inside fancy things like

Flash plug-ins, so if your home page is a Flash intro, the search engines may go no further

But they can infer what’s in an image Most search engines ignore <meta> tags

and other metadata

Page 27: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 27

3. How does it serve you the results?

You enter terms,

it looks them up in a reverse index,

then it formats the results on the way out.

Page 28: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 28

Reverse index

naillie 1, 2, 17, 19, 76, etc.

partridge 1, 2, 35, 76, 8, 65

Not showing weights and rankings etc.

Page 29: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 29

Formatting on the way out

Look at the cached copy;

get the title, keywords in

context, etc.

Google: toronto shoes

Page 30: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 30

Search engine optimization

SEO is bringing in people by designing your web site so that search engines will list you high on results pages when people search for certain keywords—without buying ads.

Page 31: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 31

White hat SEO

Improving your organic search results by making your site understandable by search engine crawlers, and by getting people to link to you.

Page 32: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 32

White hat SEO

Have good content

Semantic markup again: <title>Good Titles Matter</title> <h1>So do headings</h1>

Use text

Host on a reliable, trustworthy site

Page 33: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 33

Black hat SEO

Pulling tricks to make search engines think your site is more popular than it is or to mislead them about the content.

Page 34: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 34

SEO browser extensions

Google Toolbar Search Firefox Add-ons for “pagerank” and

“seo”, but mind the privacy issues on what’s being reported where about the pages you view

Page 35: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 35

Advertising

Search engines sell ads. You can pay to get your web page listed at the top of the results page for desired keywords.

Search engines are advertising companies.

Page 36: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 36

Google: AdWords and AdSense

AdWords is Google’s program for selling ads on its site in results pages. Try their Keyword Tool.

AdSense puts little boxes of context-relevant ads on web pages of people who want to make a little money. (Or a lot.)

Page 37: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 37

Yahoo

Video explaining how “sponsored search” works

See how much it would cost to buy some keywords

Page 38: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 38

Avoiding advertising

The Firefox extensions Adblock and Customize Google will make your web browsing ad-free and much more pleasant.

Page 39: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 39

Page 40: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 40

Page 41: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 41

Invisible web or deep web

Search engines miss a lot of web content: It’s behind a login It’s dynamically generated It’s embedded in Flash or Java applets

Page 42: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 42

2/3 of the web goes unseen

He, Patel, Zhang, Chang, Accessing the Deep Web: A Survey (Communications of the ACM, 50: 5, May 2007)

Page 43: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 43

Library databases

Library databases (JSTOR, PsycInfo, Scholars Portal) are part of the deep web. They have enormous amounts of information that’s hidden from the public … except when it’s not, as through Google Scholar.

Page 44: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 44

Library databases

Differences in rankings, algorithmsFull use of metadataUsually sorted by datePageRank is based on citation analysis, but

these databases don’t use citation analysis to rank relevant papers

Scholars Portal: search Natural Sciences for australopithecus

Page 45: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 45

Final note on privacy

Everything you do at a search engine is logged. They track you by IP address and with cookies and logins.

Assume all that information is stored forever.

See http://blog.searchenginewatch.com/blog/060206-150030

Page 46: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 46

Further reading: online

searchenginewatch.com John Battelle’s Searchblog Online: Exploring Technology and Resource

s for Information Professionals

Wikipedia (usually quite strong on technical articles)

The library’s Computer Science Research Guide

Page 47: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 47

Further reading: books

Battelle, John. 2006. The Search: The Inside Story of How Google and Its Rivals Changed Everything (Portfolio)

Berry, Michael W., and Murray Browne. 2005. Understanding Search Engines: Mathematical Modelling and Text Retrieval (SIAM)

Levene, Mark. 2006. An Introduction to Search Engines and Web Navigation (Addison-Wesley)

Page 48: Behind the Scenes at a Search Engine William Denton Web Librarian, York University Libraries 20 March 2008 .

Denton: Search Engines / 20 March 2008 / York 48