2003.10.28 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS...
-
date post
20-Jan-2016 -
Category
Documents
-
view
215 -
download
0
Transcript of 2003.10.28 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS...
2003.10.28 - SLIDE 1IS 202 – FALL 2003
Prof. Ray Larson & Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 10:30 am - 12:00 pm
Fall 2003http://www.sims.berkeley.edu/academics/courses/is202/f03/
SIMS 202:
Information Organization
and Retrieval
Lecture 17: Boolean IR and Text Processing
2003.10.28 - SLIDE 2IS 202 – FALL 2003
Announcements
• Wishter volunteers meeting tonight 7:00
• Testers needed!!– UI Tests on Image Gallery/ Annotation
software • Thursday between 2-4• and Friday 10-4.
– The tests will be approximately 1 ½ hours (but most likely will run a bit shorter.)
– Signup sheet will be available at the end of class
2003.10.28 - SLIDE 3IS 202 – FALL 2003
Lecture Overview
• Review– Introduction to Information Retrieval– The Information Seeking Process– History of IR Research
• IR System Structure (revisited)• Central Concepts in IR• Boolean Logic and Boolean IR Systems• Text Processing• Discussion
Credit for some of the slides in this lecture goes to Marti Hearst
2003.10.28 - SLIDE 4IS 202 – FALL 2003
Lecture Overview
• Review– Introduction to Information Retrieval– The Information Seeking Process– History of IR Research
• IR System Structure (revisited)• Central Concepts in IR• Boolean Logic and Boolean IR Systems• Text Processing• Discussion
Credit for some of the slides in this lecture goes to Marti Hearst
2003.10.28 - SLIDE 5IS 202 – FALL 2003
IR is an Iterative Process
Repositories
Workspace
Goals
2003.10.28 - SLIDE 6IS 202 – FALL 2003
Berry-Picking Model
Q0
Q1
Q2
Q3
Q4
Q5
A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89)
2003.10.28 - SLIDE 7IS 202 – FALL 2003
Restricted Form of the IR Problem
• The system has available only pre-existing, “canned” text passages
• Its response is limited to selecting from these passages and presenting them to the user
• It must select, say, 10 or 20 passages out of millions or billions!
2003.10.28 - SLIDE 8IS 202 – FALL 2003
Information Retrieval
• Revised Task Statement:
Build a system that retrieves documents that users are likely to find relevant to their queries
• This set of assumptions underlies the field of Information Retrieval
2003.10.28 - SLIDE 9IS 202 – FALL 2003
Lecture Overview
• Review– Introduction to Information Retrieval– The Information Seeking Process– History of IR Research
• IR System Structure (revisited)• Central Concepts in IR• Boolean Logic and Boolean IR Systems• Text Processing• Discussion
Credit for some of the slides in this lecture goes to Marti Hearst
2003.10.28 - SLIDE 10IS 202 – FALL 2003
Structure of an IR SystemSearchLine Interest profiles
& QueriesDocuments
& data
Rules of the game =Rules for subject indexing +
Thesaurus (which consists of
Lead-InVocabulary
andIndexing
Language
StorageLine
Potentially Relevant
Documents
Comparison/Matching
Store1: Profiles/Search requests
Store2: Documentrepresentations
Indexing (Descriptive and
Subject)
Formulating query in terms of
descriptors
Storage of profiles
Storage of Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
2003.10.28 - SLIDE 11IS 202 – FALL 2003
Lecture Overview
• Review– Introduction to Information Retrieval– The Information Seeking Process– History of IR Research
• IR System Structure (revisited)• Central Concepts in IR• Boolean Logic and Boolean IR Systems• Text Processing• Discussion
Credit for some of the slides in this lecture goes to Marti Hearst
2003.10.28 - SLIDE 12IS 202 – FALL 2003
Central Concepts in IR
• Documents
• Queries
• Collections
• Evaluation
• Relevance
2003.10.28 - SLIDE 13IS 202 – FALL 2003
Documents
• What do we mean by a document?– Full document?– Document surrogates?– Pages?
• Buckland (JASIS, Sept. 1997) “What is a Document”
• Are IR systems better called Document Retrieval systems?
• A document is a representation of some aggregation of information, treated as a unit
2003.10.28 - SLIDE 14IS 202 – FALL 2003
Collection
• A collection is some physical or logical aggregation of documents– A database– A Library– An index?– Others?
2003.10.28 - SLIDE 15IS 202 – FALL 2003
Queries
• A query is some expression of a user’s information needs
• Can take many forms– Natural language description of need– Formal query in a query language
• Queries may not be accurate expressions of the information need– Differences between conversation with a
person and formal query expression
2003.10.28 - SLIDE 16IS 202 – FALL 2003
Evaluation: Why Evaluate?
• Determine if the system is desirable
• Make comparative assessments
• Others?
2003.10.28 - SLIDE 17IS 202 – FALL 2003
What To Evaluate?
• How much of the information need was satisfied
• How much was learned about a topic
• Incidental learning– How much was learned about the collection– How much was learned about other topics
• How inviting the system is…
2003.10.28 - SLIDE 18IS 202 – FALL 2003
What To Evaluate?
What can be measured that reflects users’ ability to use system? (Cleverdon 66)
– Coverage of information– Form of presentation– Effort required/ease of use– Time and space efficiency– Recall
• Proportion of relevant material actually retrieved
– Precision• Proportion of retrieved material actually relevant
Eff
ectiv
enes
s
2003.10.28 - SLIDE 19IS 202 – FALL 2003
Relevance (revisited)
• “Intuitively, we understand quite well what relevance means. It is a primitive ‘y’ know’ concept, as is information for which we hardly need a definition. … if and when any productive contact [in communication] is desired, consciously or not, we involve and use this intuitive notion or relevance.”
» Saracevic, 1975 p. 324
2003.10.28 - SLIDE 20IS 202 – FALL 2003
Relevance
• How relevant is the document– For this user, for this information need
• Subjective, but• Measurable to some extent
– How often do people agree a document is relevant to a query?
• How well does it answer the question?– Complete answer? Partial? – Background information?– Hints for further exploration?
2003.10.28 - SLIDE 21IS 202 – FALL 2003
Relevance Research and Thought
• Review to 1975 by Saracevic
• Reconsideration of user-centered relevance by Schamber, Eisenberg and Nilan, 1990
• Special Issue of JASIS on relevance (April 1994, 45(3))
2003.10.28 - SLIDE 22IS 202 – FALL 2003
Saracevic
• Relevance is considered as a measure of effectiveness of the contact between a source and a destination in a communications process– Systems view– Destinations view– Subject Literature view– Subject Knowledge view– Pertinence– Pragmatic view
2003.10.28 - SLIDE 23IS 202 – FALL 2003
Define Your Own Relevance
• As we saw last time most definitions of relevance follow a “formula”:– Relevance is the (A) gage of relevance of an
(B) aspect of relevance existing between an (C) object judged and a (D) frame of reference as judged by an (E) assessor
From Saracevic, 1975 and Schamber 1990
2003.10.28 - SLIDE 24IS 202 – FALL 2003
Schamber, Eisenberg and Nilan
• “Relevance is the measure of retrieval performance in all information systems, including full-text, multimedia, question-answering, database management and knowledge-based systems.”
• Systems-oriented relevance: Topicality
2003.10.28 - SLIDE 25IS 202 – FALL 2003
Schamber, et al. Conclusions
• “Relevance is a multidimensional concept whose meaning is largely dependent on users’ perceptions of information and their own information need situations
• Relevance is a dynamic concept that depends on users’ judgments of the quality of the relationship between information and information need at a certain point in time.
• Relevance is a complex but systematic and measurable concept if approached conceptually and operationally from the user’s perspective.”
2003.10.28 - SLIDE 26IS 202 – FALL 2003
Janes’ View
Topicality
Pertinence
Relevance
Utility
Satisfaction
2003.10.28 - SLIDE 27IS 202 – FALL 2003
Lecture Overview
• Review– Introduction to Information Retrieval– The Information Seeking Process– History of IR Research
• IR System Structure (revisited)• Central Concepts in IR• Boolean Logic and Boolean IR Systems• Text Processing• Discussion
Credit for some of the slides in this lecture goes to Marti Hearst
2003.10.28 - SLIDE 28IS 202 – FALL 2003
Query Languages
• A way to express the question (information need)
• Types: – Boolean– Natural Language– Stylized Natural Language– Form-Based (GUI)
2003.10.28 - SLIDE 29IS 202 – FALL 2003
Simple Query Language: Boolean
– Terms + Connectors (or operators)– Terms
• Words• Normalized (stemmed) words• Phrases• Thesaurus terms
– Connectors• AND• OR• NOT
2003.10.28 - SLIDE 30IS 202 – FALL 2003
Boolean Queries
• Cat
• Cat OR Dog
• Cat AND Dog
• (Cat AND Dog)
• (Cat AND Dog) OR Collar
• (Cat AND Dog) OR (Collar AND Leash)
• (Cat OR Dog) AND (Collar OR Leash)
2003.10.28 - SLIDE 31IS 202 – FALL 2003
Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)– Each of the following combinations works:
Doc # 1 2 3 4 5 6 7CAT X X X X XDOG X X X X XCOLLAR X X X X XLEASH X X X X
2003.10.28 - SLIDE 32IS 202 – FALL 2003
Boolean Queries
• (Cat OR Dog) AND (Collar OR Leash)– None of the following combinations works:
Doc # 1 2 3 4 5 6 7CAT X XDOG X XCOLLAR X XLEASH X X
2003.10.28 - SLIDE 33IS 202 – FALL 2003
Boolean Logic
A B
BABA
BABA
BAC
BAC
AC
AC
:Law sDeMorgan'
2003.10.28 - SLIDE 34IS 202 – FALL 2003
Boolean Queries
• Usually expressed as INFIX operators in IR– ((a AND b) OR (c AND b))
• NOT is UNARY PREFIX operator– ((a AND b) OR (c AND (NOT b)))
• AND and OR can be n-ary operators– (a AND b AND c AND d)
• Some rules - (De Morgan revisited)– NOT(a) AND NOT(b) = NOT(a OR b)– NOT(a) OR NOT(b)= NOT(a AND b)– NOT(NOT(a)) = a
2003.10.28 - SLIDE 35IS 202 – FALL 2003
Boolean Logic
t33
t11 t22
D11D22
D33
D44D55
D66
D88D77
D99
D1010
D1111
m1
m2
m3m5
m4
m7m8
m6
m2 = t1 t2 t3
m1 = t1 t2 t3
m4 = t1 t2 t3
m3 = t1 t2 t3
m6 = t1 t2 t3
m5 = t1 t2 t3
m8 = t1 t2 t3
m7 = t1 t2 t3
2003.10.28 - SLIDE 36IS 202 – FALL 2003
Boolean Searching
“Measurement of thewidth of cracks in prestressedconcrete beams”
Formal Query:Cracks AND BeamsAND Width_measurementAND Prestressed_concrete
Cracks
Beams Widthmeasurement
Prestressedconcrete
Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)
2003.10.28 - SLIDE 37IS 202 – FALL 2003
Pseudo-Boolean Queries
• A new notation, from web search– +cat dog +collar leash
• Does not mean the same thing!
• Need a way to group combinations
• Phrases:– “stray cat” AND “frayed collar”– +“stray cat” + “frayed collar”
2003.10.28 - SLIDE 38IS 202 – FALL 2003
Another View of IR
InformationNeed
Index
Pre-Process
Parse
Collections
Rank
Query
Text Input
2003.10.28 - SLIDE 39IS 202 – FALL 2003
Result Sets
• Run a query, get a result set
• Two choices– Reformulate query, run on entire collection– Reformulate query, run on result set
• Example: Dialog query• (Redford AND Newman)• -> S1 1450 documents• (S1 AND Sundance)• ->S2 898 documents
2003.10.28 - SLIDE 40IS 202 – FALL 2003
Feedback Queries
Query
Collections
Text Input
Reformulated Query
Re-Rank
InformationNeed
Pre-Process
IndexParse
Rank
2003.10.28 - SLIDE 41IS 202 – FALL 2003
Ordering of Retrieved Documents
• Pure Boolean has no ordering• In practice:
– Order chronologically– Order by total number of “hits” on query terms
• What if one term has more hits than others?• Is it better to one of each term or many of one
term?
• Fancier methods have been investigated – p-norm is most famous
• Usually impractical to implement• Usually hard for user to understand
2003.10.28 - SLIDE 42IS 202 – FALL 2003
Boolean
• Advantages– Simple queries are easy to understand– Relatively easy to implement
• Disadvantages– Difficult to specify what is wanted– Too much returned, or too little– Ordering not well determined
• Dominant language in commercial systems until the WWW
2003.10.28 - SLIDE 43IS 202 – FALL 2003
Faceted Boolean Query
• Strategy: Break query into facets (polysemous with earlier meaning of facets)– Conjunction of disjunctions
• a1 OR a2 OR a3 • b1 OR b2• c1 OR c2 OR c3 OR c4
– Each facet expresses a topic• “rain forest” OR jungle OR amazon• medicine OR remedy OR cure• Smith OR Zhou
AND
AND
2003.10.28 - SLIDE 44IS 202 – FALL 2003
Faceted Boolean Query
• Query still fails if one facet missing
• Alternative: Coordination level ranking– Order results in terms of how many facets
(disjuncts) are satisfied– Also called Quorum ranking, Overlap ranking,
and Best Match
• Problem: Facets still undifferentiated
• Alternative: Assign weights to facets
2003.10.28 - SLIDE 45IS 202 – FALL 2003
Proximity Searches
• Proximity: Terms occur within K positions of one another– pen w/5 paper
• A “Near” function can be more vague– near(pen, paper)
• Sometimes order can be specified• Also, Phrases and Collocations
– “United Nations” “Bill Clinton”
• Phrase Variants– “retrieval of information” “information retrieval”
2003.10.28 - SLIDE 46IS 202 – FALL 2003
Filters
• Filters: Reduce set of candidate docs
• Often specified simultaneous with query
• Usually restrictions on metadata– Restrict by:
• Date range• Internet domain (.edu .com .berkeley.edu)• Author• Size• Limit number of documents returned
2003.10.28 - SLIDE 47IS 202 – FALL 2003
Boolean Systems
• Most of the commercial database search systems that pre-date the WWW are based on Boolean search– Dialog, Lexis-Nexis, etc.
• Most Online Library Catalogs are Boolean systems– E.g., MELVYL
• Database systems use Boolean logic for searching
• Many of the search engines sold for intranet search of web sites are Boolean
2003.10.28 - SLIDE 48IS 202 – FALL 2003
Why Boolean?
• Easy to implement
• Efficient searching across very large databases
• Easy to explain results– “Has to have all of the words…” (AND)– “Has to have at least one of the words…”
(OR)
2003.10.28 - SLIDE 49IS 202 – FALL 2003
Lecture Overview
• Review– Introduction to Information Retrieval– The Information Seeking Process– History of IR Research
• IR System Structure (revisited)• Central Concepts in IR• Boolean Logic and Boolean IR Systems• Text Processing• Discussion
Credit for some of the slides in this lecture goes to Marti Hearst
2003.10.28 - SLIDE 50IS 202 – FALL 2003
Content Analysis
• Automated Transformation of raw text into a form that represents some aspect(s) of its meaning
• Including, but not limited to:– Automated Thesaurus Generation– Phrase Detection– Categorization– Clustering– Summarization
2003.10.28 - SLIDE 51IS 202 – FALL 2003
Techniques for Content Analysis
• Statistical– Single Document– Full Collection
• Linguistic– Syntactic– Semantic– Pragmatic
• Knowledge-Based (Artificial Intelligence)
• Hybrid (Combinations)
2003.10.28 - SLIDE 52IS 202 – FALL 2003
Text Processing
• Standard Steps:– Recognize document structure
• Titles, sections, paragraphs, etc.
– Break into tokens• Usually space and punctuation delineated• Special issues with Asian languages
– Stemming/morphological analysis– Store in inverted index (to be discussed later)
2003.10.28 - SLIDE 53IS 202 – FALL 2003
Content Analysis Areas
How isthe text processed?Index
Pre-Process
Parse
Collections
Rank
Query
Text Input
How isthe queryconstructed?
InformationNeed
2003.10.28 - SLIDE 54
Document Processing Steps
From “Modern IR” Textbook
2003.10.28 - SLIDE 55IS 202 – FALL 2003
Stemming and Morphological Analysis
• Goal: “normalize” similar words• Morphology (“form” of words)
– Inflectional Morphology• E.g,. inflect verb endings and noun number• Never change grammatical class
– dog, dogs– tengo, tienes, tiene, tenemos, tienen
– Derivational Morphology • Derive one word from another, • Often change grammatical class
– build, building; health, healthy
2003.10.28 - SLIDE 56IS 202 – FALL 2003
Automated Methods
• Powerful multilingual tools exist for morphological analysis– PCKimmo, Xerox Lexical technology– Require a grammar and dictionary– Use “two-level” automata
• Stemmers:– Very dumb rules work well (for English)– Porter Stemmer: Iteratively remove suffixes– Improvement: Pass results through a lexicon
2003.10.28 - SLIDE 57IS 202 – FALL 2003
Errors Generated by Porter Stemmer
Too Aggressive Too Timid organization/ organ european/ europe
policy/ police cylinder/ cylindrical
execute/ executive create/ creation
arm/ army search/ searcher
From Krovetz ‘93
2003.10.28 - SLIDE 58IS 202 – FALL 2003
Lecture Overview
• Review– Introduction to Information Retrieval– The Information Seeking Process– History of IR Research
• IR System Structure (revisited)• Central Concepts in IR• Boolean Logic• Boolean IR Systems• Discussion
Credit for some of the slides in this lecture goes to Marti Hearst
2003.10.28 - SLIDE 59IS 202 – FALL 2003
Questions from Patrick Riley
• In Plato's Meno Dialogue, Plato asks: "How does one investigate what one does not know?" Plato's question is similar to typical questions we encounter in this and other readings of INFOSYS 202: how do we overcome the synonymy and polysemy problems faced by lexical searching? Can the LSA (Latent Semantic Analysis) and SVD (singular value decomposition) statistical techniques demonstrated by Demais et al solve the lexicon deficiencies in information retrieval?
2003.10.28 - SLIDE 60IS 202 – FALL 2003
Paradox
• The “Fundamental paradox of Information Retrieval” as stated by Roland Hjerrpe– The need to describe that which you do not
know in order to find it
2003.10.28 - SLIDE 61IS 202 – FALL 2003
Questions from Patrick Riley
• This paper is from 1988...do you know of any applications or advancements of this LSA approach from the information retrieval community? (Example: AI (LSA passed the TEFL).
• And what are some of the limitations of using this corpus-based text comparison mechanism? (Example: no use of word order, incompleteness?) How does the LSA approach differ from other statistical approaches you've encountered? (Example: Google's "Similar Pages" feature.)
2003.10.28 - SLIDE 62IS 202 – FALL 2003
Questions from Joe Hall
• I would really like to see a show of hands (in class, I can't see you now!) of how many people have heard of either of the terms "Singular-value Decomposition" or "Eigenvector Decomposition" before you sat down to read this article. (I ask because we use this a lot in numerical approximation of radiative transfer in astrophysics... SVD is definately a litmus test as to whether or not a problem is difficult.)
2003.10.28 - SLIDE 63IS 202 – FALL 2003
Questions from Joe Hall
• I'm going to get picky here. In the Conclusion, Dumais et al. claim, "The latent structure [LSI] approach is useful for helping people find textual information in large collections." However, their results (and those of other researchers!) mostly contradict this claim. So which is it... does the SVD approach "offer no improvement over term matching methods" only for "relatively homogenous" groups of documents like "information science documents." Does LSI work best on widely different documents? Take a look at this paper's abstract which contradicts the Dumais findings: http://tinyurl.com/smfo
2003.10.28 - SLIDE 64IS 202 – FALL 2003
Questions from Joe Hall
• If you raised your hand for the first question, you may know that SVD is very computationally intensive... Dumais claims that "it need only be done once for each dataset." That's no fun... most datasets change over time... not only that, but most datasets grow with time... which means that SVD techniques can only be used on small, static, homogenous data sets (if you buy the link I showed above)... what fun is that? Where is SVD-enabled SLI useful? Is it merely a fascination of IR researchers and a way to write fancy grant proposals to make the next mazaratti payment?
2003.10.28 - SLIDE 65IS 202 – FALL 2003
Questions from Tu Tran
• In what context was this paper written? What was the state of the IR field?
• Imagine you are an information specialist and had to explain LSI and SVD to your non-mathematically oriented/non-technical manager. How would you do it?
• The paper did not include any user studies. Can you imagine tasks where users would not find this system useful?
2003.10.28 - SLIDE 66IS 202 – FALL 2003
Next Time
• Statistical Properties of Texts and Vector Representation
• Readings/Discussion:– Cooper, “Getting Beyond Boole” Dan– Bates, “How to use Controlled Vocabularies
More Effectively in Online Searching” Ann– Hearst, “Improving Full-Text Precision on
Short Queries Using Simple Constraints” Simon
– Modern IR – Chapter 7 Sean