CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

68
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002

Transcript of CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Page 1: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

CS276AText Information Retrieval, Mining, and

Exploitation

Lecture 95 Nov 2002

Page 2: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Recap: Relevance Feedback

Rocchio Algorithm:

Typical weights: alpha = 8, beta = 64, gamma = 64 Tradeoff alpha vs beta/gamma: If we have a lot of

judged documents, we want a higher beta/gamma. But we usually don’t …

Page 3: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Pseudo Feedback

documents

retrievedocuments

top kdocuments

apply relevance feedback

label top kdocs relevant

initialquery

Page 4: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Pseudo-Feedback: Performance

Page 5: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Today’s topics

User Interfaces Browsing Visualization

Page 6: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

The User in Information Access

Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Page 7: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

The User in Information Access

yes

no

Focus of

most IR! Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

User

Find startingpoint

Page 8: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Information Access in Context

Stop

High-LevelGoal

Synthesize

Done?

Analyze

yes

no

User

Information Access

Page 9: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

The User in Information Access

Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Page 10: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Starting points

Source selection Highwire press Lexis-nexis Google!

Overviews Directories/hierarchies Visual maps Clustering

Page 11: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Highwire Press

Source Selection

Page 12: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Hierarchical browsing

Level 2

Level 1

Level 0

Page 13: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.
Page 14: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Visual Browsing: Themescape

Page 15: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Browsing

x

x

xxxx

x

x

x

xx

x

x x

Starting point

Credit: William Arms, Cornell

Answer

Page 16: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Scatter/Gather

Scatter/gather allows the user to find a set of documents of interest through browsing.

Take the collection and scatter it into n clusters.

Pick the clusters of interest and merge them.

Iterate

Page 17: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Scatter/Gather

Page 18: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Scatter/gather

Page 19: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

How to Label Clusters

Show titles of typical documents Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which

may not fully represent cluster Show words/phrases prominent in cluster

More likely to fully represent cluster Use distinguishing words/phrases But harder to scan

Page 20: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Visual Browsing: Hyperbolic Tree

Page 21: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Visual Browsing: Hyperbolic Tree

Page 22: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

UWMS Data Mining Workshop

Study of Kohonen Feature Maps

H. Chen, A. Houston, R. Sewell, and B. Schatz, JASIS 49(7)

Comparison: Kohonen Map and Yahoo Task:

“Window shop” for interesting home page Repeat with other interface

Results: Starting with map could repeat in Yahoo (8/11) Starting with Yahoo unable to repeat in map (2/14)

Credit: Marti Hearst

Page 23: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

UWMS Data Mining Workshop

Study (cont.)

Participants liked: Correspondence of region size to #

documents Overview (but also wanted zoom) Ease of jumping from one topic to another Multiple routes to topics Use of category and subcategory labels

Credit: Marti Hearst

Page 24: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

UWMS Data Mining Workshop

Study (cont.) Participants wanted:

hierarchical organization other ordering of concepts (alphabetical) integration of browsing and search corresponce of color to meaning more meaningful labels labels at same level of abstraction fit more labels in the given space combined keyword and category search multiple category assignment (sports+entertain)

Credit: Marti Hearst

Page 25: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Browsing

Effectiveness depends on Starting point Ease of orientation (are similar docs “close”

etc, intuitive organization) How adaptive system is

Compare to physical browsing (library, grocery store)

Page 26: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Searching vs. Browsing

Information need dependent Open-ended (find an interesting quote on the

virtues of friendship) -> browsing Specific (directions to Pacific Bell Park) ->

searching User dependent

Some users prefer searching, others browsing (confirmed in many studies: some hate to type)

You don’t need to know vocabulary for browsing. System dependent (some web sites don’t

support search) Searching and browsing are often interleaved.

Page 27: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Searchers vs. Browsers

1/3 of users do not search at all 1/3 rarely search (or urls only) Only 1/3 understand the concept of search (ISP data from 2000)

Page 28: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Exercise

Observe your own information seeking behavior WWW University library Grocery store

Are you a searcher or a browser? How do you reformulate your query?

Read bad hits, then minus terms Read good hits, then plus terms Try a completely different query …

Page 29: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

The User in Information Access

Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Page 30: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Query Specification

Recall: Relevance feedback Query expansion Spelling correction Query-log mining based

Interaction styles for query specification Queries on the Web Parametric search Term browsing

Page 31: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Query Specification: Interaction Styles

Shneiderman 97 Command Language Form Fillin Menu Selection Direct Manipulation Natural Language

Example: How do each apply to Boolean Queries

Credit: Marti Hearst

Page 32: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Command-Based Query Specification

command attribute value connector … find pa shneiderman and tw user#

What are the attribute names? What are the command names? What are allowable values?

Credit: Marti Hearst

Page 33: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Form-Based Query Specification (Altavista)

Credit: Marti Hearst

Page 34: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Form-Based Query Specification (Melvyl)

Credit: Marti Hearst

Page 35: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Form-based Query Specification (Infoseek)

Credit: Marti Hearst

Page 36: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Di r

ect

Manip

ula

tion S

pec.

VQ

UER

Y (

J ones

98

)

Credit: Marti Hearst

Page 37: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Menu-based Query Specification(Young & Shneiderman 93)

Credit: Marti Hearst

Page 38: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Query Specification/Reformulation

A good user interface makes it easy for the user to reformulate the query

Challenge: one user interface is not ideal for all types of information needs

Page 39: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Types of Information Needs

Need answer to question (who won the game?)

Re-find a particular document Find a good recipe for tonight’s dinner Authoritative summary of information (HIV

review) Exploration of new area (browse sites

about Baja)

Page 40: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Queries on the WebMost Frequent on 2002/10/26

Page 41: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Queries on the Web (2000)

Page 42: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Intranet Queries (Aug 2000) 3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map

773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid

Source: Ray Larson

Page 43: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Intranet Queries

Summary of sample data from 3 weeks of UCB queries 13.2% Telebears/BearFacts/InfoBears/BearLink (12297) 6.7% Schedule of classes or final exams (6222) 5.4% Summer Session (5041) 3.2% Extension (2932) 3.1% Academic Calendar (2846) 2.4% Directories (2202) 1.7% Career Center (1588) 1.7% Housing (1583) 1.5% Map (1393)

Average query length over last 4 months: 1.8 words This suggests what is difficult to find from the home

page

Source: Ray Larson

Page 44: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Query Specification: Feast or Famine

Famine

Feast

Specifyinga well targetedquery is hard.

Bigger problem for Boolean.

Page 45: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Parametric search

Each document has, in addition to text, some “meta-data” e.g., Language = French Format = pdf Subject = Physics etc. Date = Feb 2000

A parametric search interface allows the user to combine a full-text query with selections on these parameters e.g., language, date range, etc.

Page 46: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Notice that the output is a (large) table. Various parameters in the table (column headings) may be clicked on to effect a sort.

Parametric search example

Page 47: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Parametric search example

We can add text search.

Page 48: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Interfaces for term browsing

Page 49: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.
Page 50: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

The User in Information Access

Stop

Information need Explore results

Formulate/Reformulate

Done?

Query

Send to system

Receive results

yes

no

User

Find startingpoint

Page 51: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Explore Results

Determine: Do these results answer my question? Summarization More generally: provide context

Hypertext navigation: Can I find the answer by following a link?

Browsing and clustering (again) Browse to explore results

Page 52: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Explore Results: Context

We can’t present complete documents in the result set – too much information.

Present information about each doc Must be concise (so we can show many docs) Must be informative

Typical information about each document Summary Context of query words Meta data: date, author, language, file

name/url Context of document in collection Information about structure of document

Page 53: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Context in Collection: Cha-Cha

Page 54: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Category Labels

Advantages: Interpretable Capture summary information Describe multiple facets of content Domain dependent, and so descriptive

Disadvantages Do not scale well (for organizing documents) Domain dependent, so costly to acquire May mis-match users’ interests

Credit: Marti Hearst

Page 55: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Evaluate ResultsContext in Hierarchy: Cat-a-Cone

Page 56: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Explore Results: Summarization

Query-dependent summarization KWIC (keyword in context) lines (a la google)

Query-independent summarization Summary written by author (if available) Exploit genre (news stories) Sentence extraction Natural language generation

Page 57: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Evaluate ResultsStructure of document: SeeSoft

Page 58: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Personalization

QueryAugmentation

InterestsInterests

DemographicsDemographics

Click StreamClick Stream

Search HistorySearch History

Application UsageApplication Usage

Result Processing

Outride SchemaUser x Content x

History x Demographics

IntranetSearch

Web Search

Search Engine Schema

Keyword x Doc ID x Link Rank

Outride PersonalizedSearch System

User Query

Result Set

Outride Side Bar Interface

Page 59: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.
Page 60: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

How Long to Get an Answer?

89.6

83.5

81

75.4

38.9

0 10 20 30 40 50 60 70 80 90 100

AOL

Excite

Yahoo!

Google

Outride

Average Task Completion Time in Seconds

SOURCE: ZDLabs/eTesting, Inc. October 2000

Page 61: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Search Engine User Actions Difference (%) Outride 11.2 Google 21.2 89.6 Yahoo! 22.4 100.5 AOL 23.1 107.0 Excite 23.3 108.5 Average 22.5 101.4 Table 1. User actions study results. Experienced Users Novice Users Overall

Engine Expert Time

Rank Novice Time

Rank Average Rank % Difference

Outride 32.8 (1) 45.1 (1) 38.9 (1) 0% AOL 92.3 (5) 87.0 (4) 89.6 (5) 130.2% Excite 75.7 (3) 91.3 (5) 83.5 (4) 114.5% Google 72.5 (2) 78.4 (3) 75.4 (2) 93.7% Yahoo! 85.1 (4) 76.9 (2) 81.0 (3) 107.9% Table 2. Overall timing results (in seconds, with placement in parenthesis). SOURCE: ZDLabs/eTesting, Inc. October 2000

Page 62: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

75.70

91.30

32.83

45.07

0

10

20

30

40

50

60

70

80

90

100

Novice Experts

OthersOutride

Tim

e (S

econ

ds)

User Skill Level

SOURCE: ZDLabs/eTesting, Inc. October 2000

Novices versus Experts(Average Time to Complete Task)

Page 63: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Performance of Interactive Retrieval

Page 64: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Boolean Queries: Interface Issues

Boolean logic is difficult for the average user. Much research was done on interfaces

facilitating the creation of boolean queries by non-experts.

Much of this research was made obsolete by the web.

Current view is that non-expert users are best served with non-boolean or simple +/- boolean (pioneered by altavista).

But boolean queries are the standard for certain groups of expert users (eg, lawyers).

Page 65: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

User Interfaces: Other Issues

Technical HCI issues How to use screen real estate One monolithic window or many? Undo operator Give access to history Alternative interfaces for novel/expert users

Disabilities

Page 66: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Take-Away

Don’t ignore the user in information retrieval.

Finding matching documents for a query is only part of information access and “knowledge work”.

In addition to core information retrieval, information access interfaces need to support Finding starting points Formulation/reformulation of queries Exploring/evaluating results

Page 67: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Exercise

Current information retrieval user interfaces are designed for typical computer screens.

How would you design a user interface for a wall-size screen?

Page 68: CS276A Text Information Retrieval, Mining, and Exploitation Lecture 9 5 Nov 2002.

Resources

MIR Ch. 10.0 – 10.7Donna Harman, Overview of the fourth text retrieval

conference (TREC 4), National Institute of Standards and Technology.

Cutting, Karger, Pedersen, Tukey. Scatter/Gather. ACM SIGIR.

Hearst, Cat-a-cone, an interactive interface for specifying searches and viewing retrieving results in a large category hierarchy, ACM SIGIR.