IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems

20
IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems

description

IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems. Boolean or Statistical?. Most web search engines default to statistical, use Boolean for advanced Most proprietary online systems default to Boolean, use statistical for alternative - PowerPoint PPT Presentation

Transcript of IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems

Page 1: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

IS530 Lesson 12

Boolean vs. Statistical Retrieval Systems

Page 2: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Boolean or Statistical?

Most web search engines default to statistical, use Boolean for advanced

Most proprietary online systems default to Boolean, use statistical for alternative

Statistical search engine vs. relevance ranking of Boolean results

Page 3: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Web Search Engines

Databases generated by robotic programs

(non-human)

spiders, wanderers, web walkers, agents

Full-text indexing of website contents

Supports advanced, complex search

strategies

Page 4: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

3 Parts of a Web Search Engine

1. Spider or web-crawler reads webpage, follows links

2. Index catalogs webpages read by spider

3. Search engine software matches queries

lists most relevant site first

Page 5: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

3 Parts of an Online System

1) Database building software (dataware)

(follows rules with known fields)2)Index/dictionary file(list of all words and sometimes

phrases in the indexed fields)3) Search engine software(matches queries; Boolean or

statistical; LIFO or relevant

Page 6: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Boolean Operators

AND limits search decreases hits increases precision

OR expands search increases precision decreases hits

NOT limits search seldom used too strong

Proximity Operators Adj, (N)ear, (W)ith

limit a search increase precision

Page 7: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Command Interface Boolean Searching (Westlaw)

Find information about the assumption of risk involving people who fall after slipping in wintery conditions.

assum! /5 risk / p (ic* or snow****) /p (slip! or fell or fall***)

Page 8: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Natural Language and Relevance Ranking (WIN) I need information on

assumption of risk involving a person who has fallen on ice or snow.

Page 9: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Non-Boolean Retrieval Systems

Statistical (associative, probabilistic, or relevance systems)

Linguistic (semantic)

Page 10: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Statistical Retrieval Systems

Incorporate relevance ranking

May incorporate relevance feedback

May have natural language interface

Almost all web search engines use

Page 11: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Algorithm

Latin algorismus, after al-KhwArizmi

Arabian mathematician (AD 825)

Step-by-step procedure for solving

mathematical problems Merriam-Webster http://www.m-w.com/

Statistical search engines use weighting

algorithms to compute relevance

Page 12: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Statistical Search Engines

Weighting algorithms are proprietary

Search engines differ in how they assign

weights and compute relevance ranking

Search results differ

studies found only about 40% overlap

Page 13: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Statistical Web Retrieval Factors

Popularity, # other sites that link to a site authoritative sites given heavier weight

Google

Meta-tags may boost ranking Inktomi/Overture

Direct hit may boost ranking HotBot

Page 14: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Linguistic Retrieval System

Natural Language & Relevance

Ranking

WIN - (Westlaw Is Natural) has some elements

I need information on assumption of risk

involving a person who has fallen on ice or

snow.

Page 15: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

WIN Steps

1. Enter query in plain English

2. System removes stop phrases

3. Matches legal phrases from thesaurus,

adjusts weighting

4. Removes stop words

Page 16: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

WIN Steps (cont.)

5. Stemming

6. Searches database indexes in OR

relationship

7. Statistical comparison applied

8. Results placed in ranked order

Page 17: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Factors in Determining Relevance

Proximity of query words to each other

Position of query words keywords in title rank higher keyword in headline or near top

Relative length of document

(“normalization”)

Stemming

Page 18: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Factors in Determining Relevance (cont.)

Ignore very frequent terms

Inverse term frequency

Relevance feedback

Stop words

Query expansion/thesaurus

Page 19: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Features Users Can Control

Designating “bound phrases”

Flagging terms that must be present*

Specifying truncat?

Indicating (synonym groups)

Synonym dictionaries

Page 20: IS530  Lesson 12 Boolean vs. Statistical  Retrieval Systems

Web Sites that list search engines and features:

www.pandia.comwww.searchenginewatch.comhttp://notess.com