9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.):...
-
date post
20-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of 9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.):...
9/6/2001 Information Organization and Retrieval
Introduction to Information Retrieval (cont.): Boolean Model
University of California, Berkeley
School of Information Management and Systems
SIMS 202: Information Organization and Retrieval
Lecture authors: Marti Hearst & Ray Larson
9/6/2001 Information Organization and Retrieval
The Standard Retrieval Interaction Model
9/6/2001 Information Organization and Retrieval
IR is an Iterative Process
Repositories
Workspace
Goals
9/6/2001 Information Organization and Retrieval
A sketch of a searcher… “moving through many actions towards a general goal of satisfactory
completion of research related to an information need.” (after Bates 89)
Q0
Q1
Q2
Q3
Q4
Q5
9/6/2001 Information Organization and Retrieval
Restricted Form of the IR Problem
• The system has available only pre-existing, “canned” text passages.
• Its response is limited to selecting from these passages and presenting them to the user.
• It must select, say, 10 or 20 passages out of millions or billions!
9/6/2001 Information Organization and Retrieval
Information Retrieval
• Revised Task Statement:
Build a system that retrieves documents that users are likely to find relevant to their queries.
• This set of assumptions underlies the field of Information Retrieval.
9/6/2001 Information Organization and Retrieval
Some IR History
– Roots in the scientific “Information Explosion” following WWII
– Interest in computer-based IR from mid 1950’s• H.P. Luhn at IBM (1958)
• Probabilistic models at Rand (Maron & Kuhns) (1960)
• Boolean system development at Lockheed (‘60s)
• Vector Space Model (Salton at Cornell 1965)
• Statistical Weighting methods and theoretical advances (‘70s)
• Refinements and Advances in application (‘80s)• User Interfaces, Large-scale testing and application (‘90s)
9/6/2001 Information Organization and Retrieval
Structure of an IR SystemSearchLine Interest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
9/6/2001 Information Organization and Retrieval
Structure of an IR SystemSearchLine Interest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
9/6/2001 Information Organization and Retrieval
Structure of an IR SystemSearchLine Interest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
9/6/2001 Information Organization and Retrieval
Structure of an IR SystemSearchLine Interest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
9/6/2001 Information Organization and Retrieval
Relevance (introduction)• In what ways can a document be relevant to a query?
– Answer precise question precisely.– Who is buried in grant’s tomb? Grant.
– Partially answer question.– Where is Danville? Near Walnut Creek.
– Suggest a source for more information.– What is lymphodema? Look in this Medical Dictionary.
– Give background information.– Remind the user of other knowledge.– Others ...
• Ideally, IR systems should retrieve ALL and ONLY the RELEVANT documents for a user…
9/6/2001 Information Organization and Retrieval
Query Languages
• A way to express the question (information need)
• Types: – Boolean– Natural Language– Stylized Natural Language– Form-Based (GUI)
9/6/2001 Information Organization and Retrieval
Simple query language: Boolean
– Terms + Connectors (or operators)– terms
• words• normalized (stemmed) words• phrases• thesaurus terms
– connectors• AND• OR• NOT
9/6/2001 Information Organization and Retrieval
Boolean Queries• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
9/6/2001 Information Organization and Retrieval
Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)– Each of the following combinations works:
• Cat x x x x• Dog x x x x x• Collar x x x x• Leash x x x x
9/6/2001 Information Organization and Retrieval
Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)– None of the following combinations work:
• Cat x x
• Dog x x
• Collar x x
• Leash x x
9/6/2001 Information Organization and Retrieval
Boolean Logic
A B
BABA
BABA
BAC
BAC
AC
AC
:Law sDeMorgan'
9/6/2001 Information Organization and Retrieval
Boolean Queries– Usually expressed as INFIX operators in IR
• ((a AND b) OR (c AND b))
– NOT is UNARY PREFIX operator• ((a AND b) OR (c AND (NOT b)))
– AND and OR can be n-ary operators• (a AND b AND c AND d)
– Some rules - (De Morgan revisited)• NOT(a) AND NOT(b) = NOT(a OR b)• NOT(a) OR NOT(b)= NOT(a AND b)• NOT(NOT(a)) = a
9/6/2001 Information Organization and Retrieval
Boolean Logic
t33
t11 t22
D11D22
D33
D44D55
D66
D88D77
D99
D1010
D1111
m1
m2
m3m5
m4
m7m8
m6
m2 = t1 t2 t3
m1 = t1 t2 t3
m4 = t1 t2 t3
m3 = t1 t2 t3
m6 = t1 t2 t3
m5 = t1 t2 t3
m8 = t1 t2 t3
m7 = t1 t2 t3
9/6/2001 Information Organization and Retrieval
Boolean Searching“Measurement of thewidth of cracks in prestressedconcrete beams”
Formal Query:cracks AND beamsAND Width_measurementAND Prestressed_concrete
Cracks
Beams Widthmeasurement
Prestressedconcrete
Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)
9/6/2001 Information Organization and Retrieval
Psuedo-Boolean Queries
• A new notation, from web search– +cat dog +collar leash
• Does not mean the same thing!
• Need a way to group combinations.
• Phrases:– “stray cat” AND “frayed collar”– +“stray cat” + “frayed collar”
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
9/6/2001 Information Organization and Retrieval
Result Sets• Run a query, get a result set• Two choices
– Reformulate query, run on entire collection
– Reformulate query, run on result set
• Example: Dialog query• (Redford AND Newman)• -> S1 1450 documents• (S1 AND Sundance)• ->S2 898 documents
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
Reformulated Query
Re-Rank
9/6/2001 Information Organization and Retrieval
Ordering of Retrieved Documents• Pure Boolean has no ordering• In practice:
– order chronologically– order by total number of “hits” on query terms
• What if one term has more hits than others?• Is it better to one of each term or many of one term?
• Fancier methods have been investigated – p-norm is most famous
• usually impractical to implement• usually hard for user to understand
9/6/2001 Information Organization and Retrieval
Boolean• Advantages
– simple queries are easy to understand– relatively easy to implement
• Disadvantages– difficult to specify what is wanted– too much returned, or too little– ordering not well determined
• Dominant language in commercial systems until the WWW
9/6/2001 Information Organization and Retrieval
Faceted Boolean Query
• Strategy: break query into facets (polysemous with earlier meaning of facets)
– conjunction of disjunctionsa1 OR a2 OR a3
b1 OR b2
c1 OR c2 OR c3 OR c4
– each facet expresses a topic“rain forest” OR jungle OR amazon
medicine OR remedy OR cure
Smith OR Zhou
AND
AND
9/6/2001 Information Organization and Retrieval
Faceted Boolean Query
• Query still fails if one facet missing
• Alternative: Coordination level ranking– Order results in terms of how many facets (disjuncts)
are satisfied
– Also called Quorum ranking, Overlap ranking, and Best Match
• Problem: Facets still undifferentiated
• Alternative: assign weights to facets
9/6/2001 Information Organization and Retrieval
Proximity Searches• Proximity: terms occur within K positions of one
another– pen w/5 paper
• A “Near” function can be more vague– near(pen, paper)
• Sometimes order can be specified• Also, Phrases and Collocations
– “United Nations” “Bill Clinton”
• Phrase Variants– “retrieval of information” “information retrieval”
9/6/2001 Information Organization and Retrieval
Filters
• Filters: Reduce set of candidate docs• Often specified simultaneous with query• Usually restrictions on metadata
– restrict by:• date range• internet domain (.edu .com .berkeley.edu)• author• size• limit number of documents returned
9/6/2001 Information Organization and Retrieval
How are the texts handled?
• What happens if you take the words exactly as they appear in the original text?
• What about punctuation, capitalization, etc.?• What about spelling errors? • What about plural vs. singular forms of words• What about cases and declension in non-
english languages?• What about non-roman alphabets?
9/6/2001 Information Organization and Retrieval
Content Analysis• Automated Transformation of raw text into a form
that represent some aspect(s) of its meaning• Including, but not limited to:
– Automated Thesaurus Generation
– Phrase Detection
– Categorization
– Clustering
– Summarization
9/6/2001 Information Organization and Retrieval
Techniques for Content Analysis• Statistical
– Single Document
– Full Collection
• Linguistic– Syntactic
– Semantic
– Pragmatic
• Knowledge-Based (Artificial Intelligence)• Hybrid (Combinations)
9/6/2001 Information Organization and Retrieval
Text Processing
• Standard Steps:– Recognize document structure
• titles, sections, paragraphs, etc.
– Break into tokens• usually space and punctuation delineated
• special issues with Asian languages
– Stemming/morphological analysis
– Store in inverted index (to be discussed later)
Informationneed
Index
Pre-process
Parse
Collections
Rank
Query
text input
How isthe queryconstructed?
How isthe text processed?
Information Organization and Retrieval
Document Processing Steps
9/6/2001 Information Organization and Retrieval
Stemming and Morphological Analysis
• Goal: “normalize” similar words• Morphology (“form” of words)
– Inflectional Morphology• E.g,. inflect verb endings and noun number• Never change grammatical class
– dog, dogs– tengo, tienes, tiene, tenemos, tienen
– Derivational Morphology • Derive one word from another, • Often change grammatical class
– build, building; health, healthy
9/6/2001 Information Organization and Retrieval
Automated Methods• Powerful multilingual tools exist for
morphological analysis– PCKimmo, Xerox Lexical technology– Require a grammar and dictionary– Use “two-level” automata
• Stemmers:– Very dumb rules work well (for English)– Porter Stemmer: Iteratively remove suffixes– Improvement: pass results through a lexicon
9/6/2001 Information Organization and Retrieval
Errors Generated by Porter Stemmer (Krovetz 93)
Too Aggressive Too Timidorganization/ organ european/ europe
policy/ police cylinder/ cylindrical
execute/ executive create/ creation
arm/ army search/ searcher
9/6/2001 Information Organization and Retrieval
Next
• Statistical Properties of Text
• Preparing information for search: Lexical analysis
• Introduction to the Vector Space model of IR.