Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE...

16
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006

Transcript of Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE...

Page 1: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Search - on the Web and Locally

Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006

Page 2: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

First -- projectsThe class web page suggests these types of projects:

1. A detailed literature review on one of the topics of this course. This involves discovering, reading, summarizing and comparing published material about either search technology or personal information management. Conference papers are an appropriate source of materials. Materials found on the web are fine, as long as you do a suitable evaluation of the credibility of the resource.

2. A comparative review of a number of tools for one type of information management. For example, you might compare several photo management tools, describing each and listing the features that set each apart from the others and then summarizing their strengths and weaknesses. Your report would conclude with your evaluation of the state of the art of this type of information management based on your review of these materials.

3. A significant contribution to an open source project related to our topics. Do you have a way to improve Lucene? Can you find a tool for managing e-mail that you can improve? You must prepare your project for evaluation by the class and for submission to the open source project organization.

4. A totally new tool that you have created. Have you had an idea for a useful tool and never got around to doing anything about it? Maybe this will be the beginning of an important product.

Page 3: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

First - the search

• Describe your experience in finding the required reading– What steps did you take?– Were there any problems?– Was anything about the search difficult?– Was anything different from what you

expected?

Page 4: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Initial discussion• What surprised you in these articles?

– What did you recognize from previous courses but did not expect to see in discussion of Web Search?

– What works differently from the image you had?

• What would you like to have learned that was not included?• What are the biggest areas of challenge to the Web Search

enterprise?• Are there things that cannot be solved?

– Are there issues of scale that are just impossible? – Are there limitations that just cannot be overcome?

• Are there problems to solve that require more work but are within the range of manageable improvements?

Page 5: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

The Web Search

• Three Distinct Phases:– Crawling– Indexing– Searching

• Each has specific challenges to address

Page 6: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Page 7: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Crawlers

• Basic process– Open an HTML page that has at least one anchor

tag • (<a href=“…..”> link description </a>

– Send HTTP request to the site and receive the page.

– Parse the page, looking for other anchor tags– Place anchors on a queue for further processing– Submit the actual page for indexing and storing

Page 8: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Indexers

• Scanning– “For each indexable term … the indexer writes a

posting consisting of a document number and a term number to a temporary file.”

• Parse this sentence: What is an indexable term? Posting? Document number? Term number? What does a posting look like?

• Invert the file– Sort by term, secondarily by document number– Record start location and list length for each term

Page 9: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Searching (Query Processing)

• Look up query term in term dictionary• Get the postings list• Find documents that match all search terms

– Find documents for each term and merge lists where common documents occur

• Rank documents and report – As many as required or until end of the list

• Still possible to find a result on one search and not find that same item on a subsequent search of the same terms

Page 10: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Expanding from the basics

• Each of the phases of web searching is simple in concept, but complicated by the sheer magnitude of the task.

• The same ideas applied on a smaller scale -- in a company intra-net, for example, can be done efficiently.

• The Web presents special challenges.

Page 11: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Crawling

• A single machine running a simple crawling algorithm would not do well in finding all Web pages.

• Large data centers– Redundancy and fault tolerance– Parallel operation– (SIGCSE talk by Marissa Mayer of Google)

Page 12: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Crawling reality• Speed - amazing numbers:

– @ .5 sec per http request, max 86,400 per day = 634 years for 20 billion pages

• Politeness - – Overwhelming web servers

• Excluded content– Robots.txt

• Duplicate content– Identifying duplicates can be tricky - why?

• Continuous crawling– Keeping current– Note comment about “current time” - how would you fix that?– Priority queue for crawling schedule - why?

• Spam

Page 13: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Indexing large collections

• The Web is the ultimate “large collection”• “Estimating 500 terms in each of 20 billion

pages” --> 10 trillion entries!• Divide and conquer, as the crawler did

– Each indexer builds a partial file in memory– Stops when memory is full– Write to disk, clear memory, and start over

• Merge the partial files to make the full index

Page 14: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Data structures for indexing• Trees, tries, hash tables

– Various ways to organize the terms for easy lookup

• Numbers of terms– Not just all words in all languages– Acronyms, proper names, etc.– Must deal with common phrases also

• Separate index entries (postings) for common word combinations

• Compression– Saves space, increases processing

• Anchor text -- fie on those who use “click here”!!

• Link popularity score– Give a score to a page based on popularity, also on

query-independent factors. – Think about the implications of this.

Page 15: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Query Processing

• Most queries are short, do not provide much context• Result quality -- use some of the techniques from

information retrieval – Once a preliminary list of responses is obtained, treat that as the

collection and use IR techniques to improve the quality of the response.

• Some limitations. No way to judge how complete the initial list is.

– Techniques are part of the trade secrets of the companies

• Speeding:– Skipping– Early termination– Document numbering– Caching

Page 16: Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.