Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE...

Search - on the Web and Locally

Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006

First -- projectsThe class web page suggests these types of projects:

1. A detailed literature review on one of the topics of this course. This involves discovering, reading, summarizing and comparing published material about either search technology or personal information management. Conference papers are an appropriate source of materials. Materials found on the web are fine, as long as you do a suitable evaluation of the credibility of the resource.

2. A comparative review of a number of tools for one type of information management. For example, you might compare several photo management tools, describing each and listing the features that set each apart from the others and then summarizing their strengths and weaknesses. Your report would conclude with your evaluation of the state of the art of this type of information management based on your review of these materials.

3. A significant contribution to an open source project related to our topics. Do you have a way to improve Lucene? Can you find a tool for managing e-mail that you can improve? You must prepare your project for evaluation by the class and for submission to the open source project organization.

4. A totally new tool that you have created. Have you had an idea for a useful tool and never got around to doing anything about it? Maybe this will be the beginning of an important product.

First - the search

• Describe your experience in finding the required reading– What steps did you take?– Were there any problems?– Was anything about the search difficult?– Was anything different from what you

expected?

Initial discussion• What surprised you in these articles?

– What did you recognize from previous courses but did not expect to see in discussion of Web Search?

– What works differently from the image you had?

• What would you like to have learned that was not included?• What are the biggest areas of challenge to the Web Search

enterprise?• Are there things that cannot be solved?

– Are there issues of scale that are just impossible? – Are there limitations that just cannot be overcome?

• Are there problems to solve that require more work but are within the range of manageable improvements?

The Web Search

• Three Distinct Phases:– Crawling– Indexing– Searching

• Each has specific challenges to address

Crawlers

• Basic process– Open an HTML page that has at least one anchor

tag • (<a href=“…..”> link description </a>

– Send HTTP request to the site and receive the page.

– Parse the page, looking for other anchor tags– Place anchors on a queue for further processing– Submit the actual page for indexing and storing

Indexers

• Scanning– “For each indexable term … the indexer writes a

posting consisting of a document number and a term number to a temporary file.”

• Parse this sentence: What is an indexable term? Posting? Document number? Term number? What does a posting look like?

• Invert the file– Sort by term, secondarily by document number– Record start location and list length for each term

Searching (Query Processing)

• Look up query term in term dictionary• Get the postings list• Find documents that match all search terms

– Find documents for each term and merge lists where common documents occur

• Rank documents and report – As many as required or until end of the list

• Still possible to find a result on one search and not find that same item on a subsequent search of the same terms

Expanding from the basics

• Each of the phases of web searching is simple in concept, but complicated by the sheer magnitude of the task.

• The same ideas applied on a smaller scale -- in a company intra-net, for example, can be done efficiently.

• The Web presents special challenges.

Crawling

• A single machine running a simple crawling algorithm would not do well in finding all Web pages.

• Large data centers– Redundancy and fault tolerance– Parallel operation– (SIGCSE talk by Marissa Mayer of Google)

http://video.google.com/videoplay?docid=-1243280683715323550&hl=en

http://video.google.com/videoplay?docid=-1243280683715323550&hl=en

Crawling reality• Speed - amazing numbers:

– @ .5 sec per http request, max 86,400 per day = 634 years for 20 billion pages

• Politeness - – Overwhelming web servers

• Excluded content– Robots.txt

• Duplicate content– Identifying duplicates can be tricky - why?

• Continuous crawling– Keeping current– Note comment about “current time” - how would you fix that?– Priority queue for crawling schedule - why?

• Spam

Indexing large collections

• The Web is the ultimate “large collection”• “Estimating 500 terms in each of 20 billion

pages” --> 10 trillion entries!• Divide and conquer, as the crawler did

– Each indexer builds a partial file in memory– Stops when memory is full– Write to disk, clear memory, and start over

• Merge the partial files to make the full index

Data structures for indexing• Trees, tries, hash tables

– Various ways to organize the terms for easy lookup

• Numbers of terms– Not just all words in all languages– Acronyms, proper names, etc.– Must deal with common phrases also

• Separate index entries (postings) for common word combinations

• Compression– Saves space, increases processing

• Anchor text -- fie on those who use “click here”!!

• Link popularity score– Give a score to a page based on popularity, also on

query-independent factors. – Think about the implications of this.

Query Processing

• Most queries are short, do not provide much context• Result quality -- use some of the techniques from

information retrieval – Once a preliminary list of responses is obtained, treat that as the

collection and use IR techniques to improve the quality of the response.

• Some limitations. No way to judge how complete the initial list is.

– Techniques are part of the trade secrets of the companies

• Speeding:– Skipping– Early termination– Document numbering– Caching

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE...

Documents

Transcript of Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE...