Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management...

22
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ

Transcript of Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management...

Haystack: Per-User Information Environment

1999 Conference on Information and Knowledge Management

Eytan Adar et al

Presented by Xiao Hu

CS491CXZ

Outline

What is HaystackHaystack and IRData ModelSystem ArchitectureInformation Gathering Problems

What is Haystack?

A software for organizing and retrieving personal information

Totally personalized One user, one Haystack

Personal digital bookshelfA prototype

Haystack and IR

IR large corpus precision-recall metric “expert”relevance judge

IF (collaborative filtering) preference for similar

users require explicit user input

Haystack personal collection user’s satisfaction particular user focus on searching specific to one usercan observe user’s

implicit information needs

All Users / Groups of Users A Single User

Haystack Functionality

Automated data gathering Information maximization gathering as much information as possible

Customized information collectionAdaptation to individual query needs

A IR system that adapts to its user

?

General Data Model

Accommodate all information arbitrary pieces of data metadata links between them

Facilitate data growth new data user’s annotation user’s information behavior

A semantic network

full text searching

bibliographic info. Searching

associate searching

adapt to the user

General Data Model (illustration)

Haystack Document

%!PS Postscript …

tie.Body

Glue todays …

http://goose.lcs.mit.edu …

File Type Guess Rule

Postscript …

needle.Binary

tie.Text

needle.Text

needle.Text

needle.Text

needle.Text

tie.Location.URL

tie.file Type.Postscript

tie.Creator

Needle tie Bale

General Data Model (summary)

Inheritance hierarchy Straw needle: primitive information bale: collection of related straws tie: relationship b/w straws

Metadata representationRecursive metadata annotation Interface Haystack to external “services”

Index agents controlling external devices

System Architecture

Database, searching engine an adapter as interface to various engines

Core Haystack system (root server) data model implementation operation-system-like services

Client level services user interface proxy services data augmenting

services

annotation, querying, browsing

observing interaction with external information resources

modifying data, adding links,…

System Architecture (illustration)

“Real” world

Proxy services: WWW, Mail,

Filesystem, etc.

User Event Driven Clients: Type Guesset, Fetch Services, etc.

Internet Services: Java, WWW, Emacs, Command Line, etc. Query

“Helper” Services

Internet Services: Java, WWW, Emacs, Command Line, etc.

Query Query

Haystack Data

Other Info. Sys. IR System Lore DBM

Client Layer

Service Layer

Database Layer

Haystack Root Server

Indexing in Haystack

Straws generate textual informationIR system stores such informationInfo. from each straw will be regarded as

one unit of indexing allows to associate pieces of information

Incrementally indexing whenever a series of changes happen

Outline

What is HaystackHaystack and IRData ModelSystem ArchitectureData Gathering Problems

Information gathering

User’s explicit annotationUser’s behaviors observed by the system

interaction with outside world (www, emails) interaction with Haystack

building query paths – adapting to the user’s style

Analyzing corpus already in Haystack indexing metadata extraction adding links between documents

User’s explicit annotation

Probably the best information sourceMight not be realisticNicer interface to encourage usersHCI studies

Observers

Proxy services WWW, email proxies

Recording webpages the user sees Tracing the path of browsing Recording visiting time ……

Query observer Using query interactions to mold the data model

to the user

Plug in new data

Adding links b/w nodes

Facilitating retrieval

Query Observer

Integrates queries into the data modelQuery straw

a bale, containing query text, rank of docs, …. attached nodes of matched documents annotations from user’s choices relevance

feedback

tuned to a particular user

Query path a chain of query straws in a single searching good for future retrievals:

presenting similar query terms adapting relevance of documents

by reindexing documents with text of the query path

Information gathering

User’s explicit annotationUser’s behaviors observed by the system

interaction with outside world (www, emails) interaction with Haystack

building query paths – adapting to the user’s style

Analyzing corpus already in Haystack indexing metadata extraction adding links between documents

Data augmenting clients

Data driven clients

Data augmenting clients

“Real” world

Proxy services: WWW, Mail,

Filesystem, etc.

User Event Driven Clients: Type Guesset, Fetch Services, etc.

Internet Services: Java, WWW, Emacs, Command Line, etc. Query

“Helper” Services

Internet Services: Java, WWW, Emacs, Command Line, etc.

Query Query

Haystack Data

Other Info. Sys. IR System Lore DBM

Client Layer

Service Layer

Database Layer

Haystack Root Server

Data augmenting clients

digesting existing information, generating new information

Independent but cooperating Fetch clients Type inference clients Extractor clients Field finder clients

Triggered by events: data changes in Haystack

Data augmenting clients: example

Figure 3: Data driven clients in action

Fetch client

Type inference client

Extractor client

(a) (b)

Document Bits

(0010100111…)

URL URL

(http://web.mit.edu…)

Document

(Haystack: A…) URL Document Bits

Type

URL

Document Bits

Type (Postscript)

Document Document

(d) (c)

Document

SummaryA prototype of a personalized information

organization and retrieval systemRelationship with IRGeneral Data Model

graph, straws, …System Architecture

three layers: DB, core, clientsData Gathering

three approaches

Problems

Information maximization assumption the more, the better? for one user, but has to be prepared for all users what are useful clues?

Efficiency issues dynamic indexing a slow system (512M memory, 2G disk…)

Today’s haystack project semantic web, RDF, ontology, user interface …