1 Search Engines Needles and Haystacks. E-Commerce Prof. Sheizaf Rafaeli2 News… zIn winter 2004,...

43
1 Search Engines Needles and Haystacks
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    1

Transcript of 1 Search Engines Needles and Haystacks. E-Commerce Prof. Sheizaf Rafaeli2 News… zIn winter 2004,...

1

Search Engines

Needles and Haystacks

E-Commerce

Prof. Sheizaf Rafaeli 2

News…

In winter 2004, Google jumped up to 4,300,000,000 pages. Still a drop in the bucket.

Yahoo, AJ and others still runningFroogle. Google News. Google

Compute. The Deskbar, Cooking with google, ratemyprofessors.com….

Teoma, Kartoo, Vivisimo, Booble, amazon, imdb, VisualThesaurus, Wikipedia, Touchgraph, Grokker

E-Commerce

Prof. Sheizaf Rafaeli 3

E-Commerce

Prof. Sheizaf Rafaeli 4

Some concepts…

Manual vs. automatic vs. metasearchingDeep “hidden” web“Webliographies”, BloggingGooglewhacking, google bombing, To “be

googled”Web archives, wayback and alexaLaunchpads and toolbars: Microsoft,

Google, Clicksearch, Babylon, AlexaAI in searching (Google, AJ)

E-Commerce

Prof. Sheizaf Rafaeli 5

How much information is on the web?

35 GB? 300 GB? 3 TB? more?Mid 1999 estimate: 800 million pagesMid 2000 estimate: 3 billion (מיליארד) pagesMid 2003 estimate: 15 billion pages + “Deep Web”Google now indexes (only?) well over 4 billion

Early 2001 “Deep Web” estimate: 500 billionHow do you even estimate?How can you find what you are looking for?Doesn’t this remind you of going to the

library???

E-Commerce

Prof. Sheizaf Rafaeli 6

Engines Idling Roughly

Search engines were supposed to be the Grand Central stations of the Internet: a starting point for every venture into an overwhelming world of information. It appears, however, that people are comfortable clicking around on their own. Only 7 percent of Web pages are accessed through a search engine, a portion that has remained almost static since 1999.

E-Commerce

Prof. Sheizaf Rafaeli 7

Engines Idling Roughly

While search engines may not drive all that much traffic, they do take up a lot of time. Six out of 10 people online report using search engines more than one hour a week, according to a survey by pollster Roper Starch; more than a third search the Net diligently over two hours every week.

E-Commerce

Prof. Sheizaf Rafaeli 8

Engines Idling Roughly

Not surprisingly, many of these surfers are annoyed. Overall, 71 percent of people online say they get frustrated while searching the Net. And it doesn't take them long to lose their cool: About half are frustrated within 15 minutes. But despite the Web's enormous size about 80 percent of people say they usually find what they need when searching.

E-Commerce

Prof. Sheizaf Rafaeli 9

Engines Idling Roughly

But even the most comprehensive search engine, Google, captures only 42 percent of indexable Web pages. And that number drops dramatically for the competition. Second-ranked Fast, a Norwegian search technology, and Inktomi index 19 percent and 17 percent of the Web, respectively.

Still, when it comes to searching, less is more. Specialized search sites may be the key to helping people find what they're looking for. "Internet users need relevance when conducting searches," predicting the emergence of "vertical" search engines for specific user groups.

E-Commerce

Prof. Sheizaf Rafaeli 10

How do you find things at the library?

Several models: Walk around until you find something Walk around until you forget what you

want Walk around until you find a place to nap Use the library catalog Use the services of someone who knows

the collection (Reference Librarian)

E-Commerce

Prof. Sheizaf Rafaeli 11

Search Engines

E-Commerce

Prof. Sheizaf Rafaeli 12

Not all are American or even English, here, eg., are several Hebrew engines

: וואלהhttp://www.walla.co.il : אחלהhttp://www.achla.co.il : תפוזhttp://www.tapuz.co.il : נענעhttp://www.nana.co.il : סבבהhttp://www.sababa.co.il הארץ וIOL נדב הראל וiguide

E-Commerce

Prof. Sheizaf Rafaeli 13

Problems with search engines

Coverage

E-Commerce

Prof. Sheizaf Rafaeli 14

Problems with search engines

Invalid

Links

E-Commerce

Prof. Sheizaf Rafaeli 15

Problems with search engines

E-Commerce

Prof. Sheizaf Rafaeli 16

Search Engines Refer Only A Small Percentage Of Traffic To Web Sites Worldwide

                                                                                       

Are Search Engines truly so important?

E-Commerce

Prof. Sheizaf Rafaeli 17

What do Search Engines search?

They do NOT search the Web! That is, they do not search the web

the very moment you ask for something. Rather they search their databases or indexes

Search engines store the contents of millions of websites in an index or DB, and your query is matched up against that

E-Commerce

Prof. Sheizaf Rafaeli 18

What do Search Engines search?

They don’t even catalog the entire contents of the WWW! Nowhere near, in fact... you only get

what they have! For the most part, they don’t have the

contents of the websites they show you, only links to these sites

E-Commerce

Prof. Sheizaf Rafaeli 19

How do they find it?

They use Spiders, webbots and bots Crawlers, worms, and harvesters Wanderers, indexers, and sitesuckers

What are they? Self-directed browsers which go from link

to link, retrieving all or part of the contents of any given site for inclusion in the search engine's database.

E-Commerce

Prof. Sheizaf Rafaeli 20

How do I find what I want?

“Excuse me, do you have anything on fish..?”

“Do you have anything about the Olympics?”

E-Commerce

Prof. Sheizaf Rafaeli 21

How do I find what I want?

It pays to know how to askIt pays to understand how collections

work

E-Commerce

Prof. Sheizaf Rafaeli 22

Know the lingo

Boolean OperatorsFalse DropsDirectoriesFull-Text IndexingStemmingWebliographies

HitsRecallPrecisionKeywordsMeta-Search

EnginesPresentation order

E-Commerce

Prof. Sheizaf Rafaeli 23

Know the lingo

Boolean Operators Mathematical expressions used to express

statements of formal logic. Some of the most common Boolean operators are AND, OR, NOT and ()

Examples:icons AND NOT relig* free AND pictures AND NOT (nude OR naked)

Many sites claim to use it, only a few work well... trial and error

E-Commerce

Prof. Sheizaf Rafaeli 24

Know the lingo

False Drops Documents or websites retrieved that

are not relevant to the user’s needs Examples:

Let’s do a quick search for XXX

E-Commerce

Prof. Sheizaf Rafaeli 25

Know the lingo

Directories A hierarchical search that proceeds

through increasingly more specific headings or sub-topics

Let’s visit

E-Commerce

Prof. Sheizaf Rafaeli 26

Know the lingo

Full-Text Indexing An indexing method in which every word in

the web page is put into the database, with the exception of prepositions, conjuctions, and the like.

Controlled-language indexing How directories are implemented

Both of these are done for you by the Search Engine

E-Commerce

Prof. Sheizaf Rafaeli 27

Know the lingo

Stemming A type of search that uses the common

root of a word to include all possible occurrences of that word

Example:"child*" would yield results that include

childhood, childless, children, etc.

E-Commerce

Prof. Sheizaf Rafaeli 28

Know the lingo

Hits Documents, or references to documents, that

are returned in response to a query Note: a hit is not necessarily relevant

Recall The degree to which all the matching

documents in a collection are returned, i.e., if a search engine retrieves 80 of 100 available documents, its recall is 80%.

How do you determine recall on the web?

E-Commerce

Prof. Sheizaf Rafaeli 29

Know the lingo

Precision A standard way of measuring the

accuracy of an information retrieval system

The number of relevant documents obtained divided by the total number of documents retrievedin other words: (useful stuff / what you got)remember that a hit is not necessarily relevant

E-Commerce

Prof. Sheizaf Rafaeli 30

Know the lingo

Keywords A search that looks for specific words

provided by cataloged sites Typically, a search engine agent looks

for keywords contained in the <META> tag

A website developer can manipulate the <META> tag to increase the visibility of his/her site, at the expense of accuracy

E-Commerce

Prof. Sheizaf Rafaeli 31

Some Search Tips

Use the plus (+) and minus (-) signs in front of words to force their inclusion and/or exclusion in searches.

Use double quotation marks (" ") around phrases to ensure they are searched exactly as is

Put your most important keywords first in the string.

Type keywords and phrases in lower case to find both lower and upper case versions.

Use truncation and wildcards (e.g., *) to look for variations in spelling and word form.

Know whether or not the search engine you are using maintains a stop word list

E-Commerce

Prof. Sheizaf Rafaeli 32

The “Deep Web”

Regular web searches only drag nets across the surface

E-Commerce

Prof. Sheizaf Rafaeli 33

The “Deep Web”

E-Commerce

Prof. Sheizaf Rafaeli 34

The “Deep Web”

500 times larger than surface web95% of it is public and freeContent in deep web 1000+ times

better quality7,500 TerraBytes (TB) of information45,000 search engines in “surface

web”

E-Commerce

Prof. Sheizaf Rafaeli 35

Presentation order (1)

Presentation order may be more important than just being mentioned. Is order affected by commercial fees? "A page is important if a bunch of

important pages point to it," explained Brin. (Google.com) "It's the sum of the pages that point to it."

E-Commerce

Prof. Sheizaf Rafaeli 36

Presentation order (2)

Location, Location, Location...and Frequency keywords appearing in the title, top are

more relevant than others, etc.

Link popularity Relevancy (person, institution)Meta tags Penalty items

E-Commerce

Prof. Sheizaf Rafaeli 37

Meta-Search Engines

Use multiple search engines in parallel to provide an answer to a single query

Front-ends to other search engines and their collections and typically do not contain their own databases

Examples Surfwax, Vivisimo, Ask Jeeves,

Metacrawler, The Mining Company

E-Commerce

Prof. Sheizaf Rafaeli 38

The Best Search Engine is…

Whichever one you can actually find things with Sometimes their indexing is a little more

“natural” to you Some people prefer search engines that use

directories (Yahoo! and others) and some prefer simple indexing (Altavista and others)

Some people prefer the “human touch” (“webliographies”, “about” The Mining Company).

E-Commerce

Prof. Sheizaf Rafaeli 39

Getting Listed and Noticed (promoting your page)

Have worthwhile content/service

Manual list with engines

Submission Services(like www.submitit.com)

Advertize in print, other media

Use graphics, scripts appropriately

Use good keywordsUse <META> tag

tricksGet complimentary

links, awardsJoin “rings”Be aware of XML,

Ratings and PICS

E-Commerce

Prof. Sheizaf Rafaeli 40

Disintermediation?

Re-intermediation!Infomediaries!

(portals, agents, consultants, experts)

Hagel and Singer: Net Worth: The emerging role of the infomediary in the race for customer information

E-Commerce

Prof. Sheizaf Rafaeli 41

Resources

Webhound www.mcli.dist.maricopa.edu/webhound/

websearch.about.com Search Engine Watch

www.searchiq.com www.searchenginewatch.com

The Spider’s Apprentice, at http://www.monash.com/spidap.html

E-Commerce

Prof. Sheizaf Rafaeli 42

Resources

www.MetaSpy.com

E-Commerce

Prof. Sheizaf Rafaeli 43

Resources

S. Lawrence, C. L. Giles, Accessibility of Information on the Web, Nature, 400, pp. 107-109, 1999.

S. Lawrence, C.L. Giles, Searching the World Wide Web, Science, 280, p 98. 1998.

BrightPlanet’s “Deep Web White Paper”, 2000, at http://128.121.227.57/download/deepwebwhitepaper.pdf