2003.10.28 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS...

2003.10.28 - SLIDE 1IS 202 – FALL 2003

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 pm

Fall 2003http://www.sims.berkeley.edu/academics/courses/is202/f03/

SIMS 202:

Information Organization

and Retrieval

Lecture 17: Boolean IR and Text Processing

2003.10.28 - SLIDE 2IS 202 – FALL 2003

Announcements

• Wishter volunteers meeting tonight 7:00

• Testers needed!!– UI Tests on Image Gallery/ Annotation

software • Thursday between 2-4• and Friday 10-4.

– The tests will be approximately 1 ½ hours (but most likely will run a bit shorter.)

– Signup sheet will be available at the end of class

2003.10.28 - SLIDE 3IS 202 – FALL 2003

Lecture Overview

• Review– Introduction to Information Retrieval– The Information Seeking Process– History of IR Research

• IR System Structure (revisited)• Central Concepts in IR• Boolean Logic and Boolean IR Systems• Text Processing• Discussion

Credit for some of the slides in this lecture goes to Marti Hearst

2003.10.28 - SLIDE 4IS 202 – FALL 2003

Lecture Overview




2003.10.28 - SLIDE 5IS 202 – FALL 2003

IR is an Iterative Process

Repositories

Workspace

Goals

2003.10.28 - SLIDE 6IS 202 – FALL 2003

Berry-Picking Model

Q0

Q1

Q2

Q3

Q4

Q5

A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89)

2003.10.28 - SLIDE 7IS 202 – FALL 2003

Restricted Form of the IR Problem

• The system has available only pre-existing, “canned” text passages

• Its response is limited to selecting from these passages and presenting them to the user

• It must select, say, 10 or 20 passages out of millions or billions!

2003.10.28 - SLIDE 8IS 202 – FALL 2003

Information Retrieval

• Revised Task Statement:

Build a system that retrieves documents that users are likely to find relevant to their queries

• This set of assumptions underlies the field of Information Retrieval

2003.10.28 - SLIDE 9IS 202 – FALL 2003

Lecture Overview




2003.10.28 - SLIDE 10IS 202 – FALL 2003

Structure of an IR SystemSearchLine Interest profiles

& QueriesDocuments

& data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

2003.10.28 - SLIDE 11IS 202 – FALL 2003

Lecture Overview




2003.10.28 - SLIDE 12IS 202 – FALL 2003

Central Concepts in IR

• Documents

• Queries

• Collections

• Evaluation

• Relevance

2003.10.28 - SLIDE 13IS 202 – FALL 2003

Documents

• What do we mean by a document?– Full document?– Document surrogates?– Pages?

• Buckland (JASIS, Sept. 1997) “What is a Document”

• Are IR systems better called Document Retrieval systems?

• A document is a representation of some aggregation of information, treated as a unit

2003.10.28 - SLIDE 14IS 202 – FALL 2003

Collection

• A collection is some physical or logical aggregation of documents– A database– A Library– An index?– Others?

2003.10.28 - SLIDE 15IS 202 – FALL 2003

Queries

• A query is some expression of a user’s information needs

• Can take many forms– Natural language description of need– Formal query in a query language

• Queries may not be accurate expressions of the information need– Differences between conversation with a

person and formal query expression

2003.10.28 - SLIDE 16IS 202 – FALL 2003

Evaluation: Why Evaluate?

• Determine if the system is desirable

• Make comparative assessments

• Others?

2003.10.28 - SLIDE 17IS 202 – FALL 2003

What To Evaluate?

• How much of the information need was satisfied

• How much was learned about a topic

• Incidental learning– How much was learned about the collection– How much was learned about other topics

• How inviting the system is…

2003.10.28 - SLIDE 18IS 202 – FALL 2003

What To Evaluate?

What can be measured that reflects users’ ability to use system? (Cleverdon 66)

– Coverage of information– Form of presentation– Effort required/ease of use– Time and space efficiency– Recall

• Proportion of relevant material actually retrieved

– Precision• Proportion of retrieved material actually relevant

Eff

ectiv

enes

s

2003.10.28 - SLIDE 19IS 202 – FALL 2003

Relevance (revisited)

• “Intuitively, we understand quite well what relevance means. It is a primitive ‘y’ know’ concept, as is information for which we hardly need a definition. … if and when any productive contact [in communication] is desired, consciously or not, we involve and use this intuitive notion or relevance.”

» Saracevic, 1975 p. 324

2003.10.28 - SLIDE 20IS 202 – FALL 2003

Relevance

• How relevant is the document– For this user, for this information need

• Subjective, but• Measurable to some extent

– How often do people agree a document is relevant to a query?

• How well does it answer the question?– Complete answer? Partial? – Background information?– Hints for further exploration?

2003.10.28 - SLIDE 21IS 202 – FALL 2003

Relevance Research and Thought

• Review to 1975 by Saracevic

• Reconsideration of user-centered relevance by Schamber, Eisenberg and Nilan, 1990

• Special Issue of JASIS on relevance (April 1994, 45(3))

2003.10.28 - SLIDE 22IS 202 – FALL 2003

Saracevic

• Relevance is considered as a measure of effectiveness of the contact between a source and a destination in a communications process– Systems view– Destinations view– Subject Literature view– Subject Knowledge view– Pertinence– Pragmatic view

2003.10.28 - SLIDE 23IS 202 – FALL 2003

Define Your Own Relevance

• As we saw last time most definitions of relevance follow a “formula”:– Relevance is the (A) gage of relevance of an

(B) aspect of relevance existing between an (C) object judged and a (D) frame of reference as judged by an (E) assessor

From Saracevic, 1975 and Schamber 1990

2003.10.28 - SLIDE 24IS 202 – FALL 2003

Schamber, Eisenberg and Nilan

• “Relevance is the measure of retrieval performance in all information systems, including full-text, multimedia, question-answering, database management and knowledge-based systems.”

• Systems-oriented relevance: Topicality

2003.10.28 - SLIDE 25IS 202 – FALL 2003

Schamber, et al. Conclusions

• “Relevance is a multidimensional concept whose meaning is largely dependent on users’ perceptions of information and their own information need situations

• Relevance is a dynamic concept that depends on users’ judgments of the quality of the relationship between information and information need at a certain point in time.

• Relevance is a complex but systematic and measurable concept if approached conceptually and operationally from the user’s perspective.”

2003.10.28 - SLIDE 26IS 202 – FALL 2003

Janes’ View

Topicality

Pertinence

Relevance

Utility

Satisfaction

2003.10.28 - SLIDE 27IS 202 – FALL 2003

Lecture Overview




2003.10.28 - SLIDE 28IS 202 – FALL 2003

Query Languages

• A way to express the question (information need)

• Types: – Boolean– Natural Language– Stylized Natural Language– Form-Based (GUI)

2003.10.28 - SLIDE 29IS 202 – FALL 2003

Simple Query Language: Boolean

– Terms + Connectors (or operators)– Terms

• Words• Normalized (stemmed) words• Phrases• Thesaurus terms

– Connectors• AND• OR• NOT

2003.10.28 - SLIDE 30IS 202 – FALL 2003

Boolean Queries

• Cat

• Cat OR Dog

• Cat AND Dog

• (Cat AND Dog)

• (Cat AND Dog) OR Collar

• (Cat AND Dog) OR (Collar AND Leash)

• (Cat OR Dog) AND (Collar OR Leash)

2003.10.28 - SLIDE 31IS 202 – FALL 2003

Boolean Queries

• (Cat OR Dog) AND (Collar OR Leash)– Each of the following combinations works:

Doc # 1 2 3 4 5 6 7CAT X X X X XDOG X X X X XCOLLAR X X X X XLEASH X X X X

2003.10.28 - SLIDE 32IS 202 – FALL 2003

Boolean Queries

• (Cat OR Dog) AND (Collar OR Leash)– None of the following combinations works:

Doc # 1 2 3 4 5 6 7CAT X XDOG X XCOLLAR X XLEASH X X

2003.10.28 - SLIDE 33IS 202 – FALL 2003

Boolean Logic

A B

BABA

BABA

BAC

BAC

AC

AC

:Law sDeMorgan'

2003.10.28 - SLIDE 34IS 202 – FALL 2003

Boolean Queries

• Usually expressed as INFIX operators in IR– ((a AND b) OR (c AND b))

• NOT is UNARY PREFIX operator– ((a AND b) OR (c AND (NOT b)))

• AND and OR can be n-ary operators– (a AND b AND c AND d)

• Some rules - (De Morgan revisited)– NOT(a) AND NOT(b) = NOT(a OR b)– NOT(a) OR NOT(b)= NOT(a AND b)– NOT(NOT(a)) = a

2003.10.28 - SLIDE 35IS 202 – FALL 2003

Boolean Logic

t33

t11 t22

D11D22

D33

D44D55

D66

D88D77

D99

D1010

D1111

m1

m2

m3m5

m4

m7m8

m6

m2 = t1 t2 t3

m1 = t1 t2 t3

m4 = t1 t2 t3

m3 = t1 t2 t3

m6 = t1 t2 t3

m5 = t1 t2 t3

m8 = t1 t2 t3

m7 = t1 t2 t3

2003.10.28 - SLIDE 36IS 202 – FALL 2003

Boolean Searching

“Measurement of thewidth of cracks in prestressedconcrete beams”

Formal Query:Cracks AND BeamsAND Width_measurementAND Prestressed_concrete

Cracks

Beams Widthmeasurement

Prestressedconcrete

Relaxed Query:(C AND B AND P) OR(C AND B AND W) OR(C AND W AND P) OR(B AND W AND P)

2003.10.28 - SLIDE 37IS 202 – FALL 2003

Pseudo-Boolean Queries

• A new notation, from web search– +cat dog +collar leash

• Does not mean the same thing!

• Need a way to group combinations

• Phrases:– “stray cat” AND “frayed collar”– +“stray cat” + “frayed collar”

2003.10.28 - SLIDE 38IS 202 – FALL 2003

Another View of IR

InformationNeed

Index

Pre-Process

Parse

Collections

Rank

Query

Text Input

2003.10.28 - SLIDE 39IS 202 – FALL 2003

Result Sets

• Run a query, get a result set

• Two choices– Reformulate query, run on entire collection– Reformulate query, run on result set

• Example: Dialog query• (Redford AND Newman)• -> S1 1450 documents• (S1 AND Sundance)• ->S2 898 documents

2003.10.28 - SLIDE 40IS 202 – FALL 2003

Feedback Queries

Query

Collections

Text Input

Reformulated Query

Re-Rank

InformationNeed

Pre-Process

IndexParse

Rank

2003.10.28 - SLIDE 41IS 202 – FALL 2003

Ordering of Retrieved Documents

• Pure Boolean has no ordering• In practice:

– Order chronologically– Order by total number of “hits” on query terms

• What if one term has more hits than others?• Is it better to one of each term or many of one

term?

• Fancier methods have been investigated – p-norm is most famous

• Usually impractical to implement• Usually hard for user to understand

2003.10.28 - SLIDE 42IS 202 – FALL 2003

Boolean

• Advantages– Simple queries are easy to understand– Relatively easy to implement

• Disadvantages– Difficult to specify what is wanted– Too much returned, or too little– Ordering not well determined

• Dominant language in commercial systems until the WWW

2003.10.28 - SLIDE 43IS 202 – FALL 2003

Faceted Boolean Query

• Strategy: Break query into facets (polysemous with earlier meaning of facets)– Conjunction of disjunctions

• a1 OR a2 OR a3 • b1 OR b2• c1 OR c2 OR c3 OR c4

– Each facet expresses a topic• “rain forest” OR jungle OR amazon• medicine OR remedy OR cure• Smith OR Zhou

AND

AND

2003.10.28 - SLIDE 44IS 202 – FALL 2003

Faceted Boolean Query

• Query still fails if one facet missing

• Alternative: Coordination level ranking– Order results in terms of how many facets

(disjuncts) are satisfied– Also called Quorum ranking, Overlap ranking,

and Best Match

• Problem: Facets still undifferentiated

• Alternative: Assign weights to facets

2003.10.28 - SLIDE 45IS 202 – FALL 2003

Proximity Searches

• Proximity: Terms occur within K positions of one another– pen w/5 paper

• A “Near” function can be more vague– near(pen, paper)

• Sometimes order can be specified• Also, Phrases and Collocations

– “United Nations” “Bill Clinton”

• Phrase Variants– “retrieval of information” “information retrieval”

2003.10.28 - SLIDE 46IS 202 – FALL 2003

Filters

• Filters: Reduce set of candidate docs

• Often specified simultaneous with query

• Usually restrictions on metadata– Restrict by:

• Date range• Internet domain (.edu .com .berkeley.edu)• Author• Size• Limit number of documents returned

2003.10.28 - SLIDE 47IS 202 – FALL 2003

Boolean Systems

• Most of the commercial database search systems that pre-date the WWW are based on Boolean search– Dialog, Lexis-Nexis, etc.

• Most Online Library Catalogs are Boolean systems– E.g., MELVYL

• Database systems use Boolean logic for searching

• Many of the search engines sold for intranet search of web sites are Boolean

2003.10.28 - SLIDE 48IS 202 – FALL 2003

Why Boolean?

• Easy to implement

• Efficient searching across very large databases

• Easy to explain results– “Has to have all of the words…” (AND)– “Has to have at least one of the words…”

(OR)

2003.10.28 - SLIDE 49IS 202 – FALL 2003

Lecture Overview




2003.10.28 - SLIDE 50IS 202 – FALL 2003

Content Analysis

• Automated Transformation of raw text into a form that represents some aspect(s) of its meaning

• Including, but not limited to:– Automated Thesaurus Generation– Phrase Detection– Categorization– Clustering– Summarization

2003.10.28 - SLIDE 51IS 202 – FALL 2003

Techniques for Content Analysis

• Statistical– Single Document– Full Collection

• Linguistic– Syntactic– Semantic– Pragmatic

• Knowledge-Based (Artificial Intelligence)

• Hybrid (Combinations)

2003.10.28 - SLIDE 52IS 202 – FALL 2003

Text Processing

• Standard Steps:– Recognize document structure

• Titles, sections, paragraphs, etc.

– Break into tokens• Usually space and punctuation delineated• Special issues with Asian languages

– Stemming/morphological analysis– Store in inverted index (to be discussed later)

2003.10.28 - SLIDE 53IS 202 – FALL 2003

Content Analysis Areas

How isthe text processed?Index

Pre-Process

Parse

Collections

Rank

Query

Text Input

How isthe queryconstructed?

InformationNeed

2003.10.28 - SLIDE 54

Document Processing Steps

From “Modern IR” Textbook

2003.10.28 - SLIDE 55IS 202 – FALL 2003

Stemming and Morphological Analysis

• Goal: “normalize” similar words• Morphology (“form” of words)

– Inflectional Morphology• E.g,. inflect verb endings and noun number• Never change grammatical class

– dog, dogs– tengo, tienes, tiene, tenemos, tienen

– Derivational Morphology • Derive one word from another, • Often change grammatical class

– build, building; health, healthy

2003.10.28 - SLIDE 56IS 202 – FALL 2003

Automated Methods

• Powerful multilingual tools exist for morphological analysis– PCKimmo, Xerox Lexical technology– Require a grammar and dictionary– Use “two-level” automata

• Stemmers:– Very dumb rules work well (for English)– Porter Stemmer: Iteratively remove suffixes– Improvement: Pass results through a lexicon

2003.10.28 - SLIDE 57IS 202 – FALL 2003

Errors Generated by Porter Stemmer

Too Aggressive Too Timid organization/ organ european/ europe

policy/ police cylinder/ cylindrical

execute/ executive create/ creation

arm/ army search/ searcher

From Krovetz ‘93

2003.10.28 - SLIDE 58IS 202 – FALL 2003

Lecture Overview


• IR System Structure (revisited)• Central Concepts in IR• Boolean Logic• Boolean IR Systems• Discussion


2003.10.28 - SLIDE 59IS 202 – FALL 2003

Questions from Patrick Riley

• In Plato's Meno Dialogue, Plato asks: "How does one investigate what one does not know?" Plato's question is similar to typical questions we encounter in this and other readings of INFOSYS 202: how do we overcome the synonymy and polysemy problems faced by lexical searching? Can the LSA (Latent Semantic Analysis) and SVD (singular value decomposition) statistical techniques demonstrated by Demais et al solve the lexicon deficiencies in information retrieval?

2003.10.28 - SLIDE 60IS 202 – FALL 2003

Paradox

• The “Fundamental paradox of Information Retrieval” as stated by Roland Hjerrpe– The need to describe that which you do not

know in order to find it

2003.10.28 - SLIDE 61IS 202 – FALL 2003

Questions from Patrick Riley

• This paper is from 1988...do you know of any applications or advancements of this LSA approach from the information retrieval community? (Example: AI (LSA passed the TEFL).

• And what are some of the limitations of using this corpus-based text comparison mechanism? (Example: no use of word order, incompleteness?) How does the LSA approach differ from other statistical approaches you've encountered? (Example: Google's "Similar Pages" feature.)

2003.10.28 - SLIDE 62IS 202 – FALL 2003

Questions from Joe Hall

• I would really like to see a show of hands (in class, I can't see you now!) of how many people have heard of either of the terms "Singular-value Decomposition" or "Eigenvector Decomposition" before you sat down to read this article. (I ask because we use this a lot in numerical approximation of radiative transfer in astrophysics... SVD is definately a litmus test as to whether or not a problem is difficult.)

2003.10.28 - SLIDE 63IS 202 – FALL 2003


• I'm going to get picky here. In the Conclusion, Dumais et al. claim, "The latent structure [LSI] approach is useful for helping people find textual information in large collections." However, their results (and those of other researchers!) mostly contradict this claim. So which is it... does the SVD approach "offer no improvement over term matching methods" only for "relatively homogenous" groups of documents like "information science documents." Does LSI work best on widely different documents? Take a look at this paper's abstract which contradicts the Dumais findings: http://tinyurl.com/smfo

2003.10.28 - SLIDE 64IS 202 – FALL 2003


• If you raised your hand for the first question, you may know that SVD is very computationally intensive... Dumais claims that "it need only be done once for each dataset." That's no fun... most datasets change over time... not only that, but most datasets grow with time... which means that SVD techniques can only be used on small, static, homogenous data sets (if you buy the link I showed above)... what fun is that? Where is SVD-enabled SLI useful? Is it merely a fascination of IR researchers and a way to write fancy grant proposals to make the next mazaratti payment?

2003.10.28 - SLIDE 65IS 202 – FALL 2003

Questions from Tu Tran

• In what context was this paper written? What was the state of the IR field?

• Imagine you are an information specialist and had to explain LSI and SVD to your non-mathematically oriented/non-technical manager. How would you do it?

• The paper did not include any user studies. Can you imagine tasks where users would not find this system useful?

2003.10.28 - SLIDE 66IS 202 – FALL 2003

Next Time

• Statistical Properties of Texts and Vector Representation

• Readings/Discussion:– Cooper, “Getting Beyond Boole” Dan– Bates, “How to use Controlled Vocabularies

More Effectively in Online Searching” Ann– Hearst, “Improving Full-Text Precision on

Short Queries Using Simple Constraints” Simon

– Modern IR – Chapter 7 Sean

2003.10.28 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS...

Documents

Transcript of 2003.10.28 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS...