Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold...

28
Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003

Transcript of Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold...

Page 1: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

Increasing the Information Density in Digital Library Results

IndoUS Workshop

Gio WiederholdStanford University

22 June 2003

Page 2: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 2

Map

Courtesy of Univ. of Pittsburgh DL Project

Page 3: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 3

Attention is the issue.

  "What information consumes is rather obvious; it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it."

[Herb Simon]

Complementary objective:

Don't waste the attention afforded to information

Page 4: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 4

My focus: Science & business use (I ignore now the artistic aspects,

also important)

Required by customer:

1. knowledge to process information, and

2. tools to facilitate that process– Locate– Select– Articulate, not Integrate – Summarize– Project - exploit data mining

Page 5: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 5

Technologies to filter Information

Survey of Technologies from common to rare

1. Ranking

2. Eliminate redundancy

3. Assure novelty

4. Abstraction

5. Data mining

6. Reduction for visual presentation

7. Modeling

8. Prediction

9. Finding Abnormal Events

Page 6: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 6

1. Ranking

Assumption: The consumer only considers a few

documents on the top of the list.a. Ranking by authority.

• Select sites that are valued in a context,• a journal versus a workshop report, • a recent document.

b. Ranking by reference authority • recursive value by references to it (Google)

• extracts global communal knowledge

c. Rank by customer's context

Page 7: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 7

2. Eliminate redundancy

a. If similar documents are retrieved • present the latest one• present the highest ranked one, per a suitable

criterion, I.e, user's context.

b. Only report differences among documents• look for additional material • decide what are significant differences• abstract differences (see later)• show differences in layout only as maps• compute metric if difference• deal with many documents

Page 8: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 8

3. Value is in the Novelty

a. Information relative to a document collection– Exploits prior technologies

b. Information relative to a customer. – What is the knowledge held by an individual – Can it be captured ?

Domain recognition to determine context

Avoid (unsolvable?) problem of `common knowledge'

Page 9: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 9

4. Abstraction

a. Only present essentials of textual documents 1. Domain-independent abstraction selecting sentences

that appear to represent the contents; 2. Domain-specific text can be effectively abstracted

a. pathology reports -- being doneb. automatic annotation of gene-sequences from papers.

b. Abstracting contents of document collections1. Classify2. Differentiate (2.b)3. Integrate4. Semantic matching if the sources are autonomous

Page 10: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 10

5. Data mining

Out of scope for digital library research, but a. Linking data-mining results with information

from textual sources

strengthens users' explanatory capabilities.

b. Data-mining develops models that can be further exploited

Page 11: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 11

6. Reduction for Visualization

Motivated by modern customers’ settingsa. Reduce numeric data for visual presentation

1. Common2. Can be automated, but rarely done well

b. Reduce textual information into visuals Requires

1. Abstraction 2. Placing the result into some model:

I.e., temporal or spatial aspects: • Progress notes for a patient – disease model • Description of an exploratory journey – attach to a map• Progress of a scientific project – versus proposal

Page 12: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 12

7. Modeling

Models of a domain allow analysis & manipulation

a. to discern novelty

b. representation of normal behavior• corporate finances from 10-K • ecological processes, and global change• metabolic models, needed to formulate an

understanding of food, drug, and environmental effects on organisms.

Page 13: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 13

8. Prediction

Current information technologies, { databases, data-mining, digital libraries } provide only background information for decision-makingToday: decision maker

1. copies results into a spreadsheet2. add formulas to make extrapolations into the future

a. Continue models scenarios into the possible futures1. Investments - monetary, personnel, research, . . .2. Probabilities of outcomes etc.

b. Allow comparison of alternatives

Information systems should not terminate their support with the past, but also to extrapolate the results with the models used for analysis

Page 14: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 14

9. Finding Abnormal Events

• A hard challenge is discovering abnormal situations. – I.e., looking for terrorists.

Note: observables are the effect of many good and a few bad scenarios

• Traditional data-mining finds frequent relationships – abduct the processes that generate those data

serves marketing folk,

• Intelligence tasks seek unusual or abnormal behavior 1. Use model based on recent incidents,

• flight-schools enrollments of terrorists

2. Create and use a reasonable, but hypothetical model• shipping containers can carry nuclear devices into the US

3. Create a model of normal findings

Page 15: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 15

9+. Create & exploit a normal state model

Prerequisite for finding abnormal eventsabnormalities can only be identified if normality can be quantified

a. Populate an initial model with normal findings– Coverage: all likely causes of some observable(s)

b. Identify variation not due to known causes• Temporal tracking is better than static schemes

c. Increase coverage as needed - feedback to b.

d. Maintain models to recognize unexplainables Such models will be large since observed data are the aggregate of

activities from many domains,

travel patterns: business, holidays, and family visits, emergencies.

Page 16: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 16

Benefits

A `business model' for justifying ongoing DL research is needed [Y.T.Chien]

• A business model includes benefits and costs1. Benefits:

– Broad access to knowledge– Education of the next generation– Preservation of cultural heritage– Mutual, inter-cultural understanding, reduction of conflicts– Improved decision-making

2. Costs– Time and money spent on information systems

• Technology Contents– Time spent on obtaining the information– Time spent on analyzing the information– Due to errors

Focus of prior slides

Focus of prior slides

Page 17: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 17

Cost of Errors -- balance

• Type 1 errors

Omitted relevant information

• Lost opportunities• Unperceived risk• Suboptimal choices• Cost: f (variance)

– High if is high

– Low if is low • purchasing

• Type 2 errors

Excess irrelevant information

• Overload• Inability to analyze all• Risk of being misled• Cost: delay, human

– High if excess is high• human time is valuable

– Low if precision is high

Page 18: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 18

Exploiting Information

Data and their relationships

Effects

know-ledge

Action 1

Action 3

Action 2

Decision

?

Has not been an explicit focus of DL research.

It is the point thatgenerates benefits

Page 19: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 19

The Major Feedback Loop

know-ledge

computer

scientists, will

provide tools

user exploiting communities distilled knowledge, categorized

user contributing communitieshuman knowledge, validated by data

Page 20: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 20

Conclusion

Much work is left to be done with digital libraries

Exploiting the results will motivate more investment• In technology• In content breadth and depth

Customer's expectations will changeo Global access is hereo Heterogeneity will remain, cause errorso Ubiquitous access is near

Page 21: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 21

Optional discussion points

• Interfaces

• Personalization

• Heterogeneity

• Computer scientists and their customers

• Data versus relationships

• Disruptive factors

Page 22: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 22

New user interface settings

The new generation

• is more comfortable with screen displays

• can navigate to analyses, backup

• considers paper to be heavy and awkward

• is poor in handwriting and spelling

• is facile in brief keyboard messages

• expects simple voice command technology

Page 23: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 23

Personalization, 2 models

1. Everything about an individual– learn all about the individual

• slow, delayed, lags

2. An individual as a member of groups• learn about the likely memberships

{ 8th grader, ...; carpenter, ..; opera goer, ...; ...}

• learn and assign knowledge to group• inherit knowledge collected in those groups

• leads so that individual also benefits

Context

Page 24: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 24

Interoperation/Interoperability

• Heterogeneity is a fact, and attempts at enforcing consistency are misguided

• natural consistency will be an outcome of collaboration,

Page 25: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 25

Data and their relationships

• Data are verifiable first-order objects– observable– automatic acquisition is common

• Relationships are also first-order objects– defined by metadata in context

{ schemas, references, dependencies, is-as, causality, ... )

– Hard to discover– Instances verifiable in contexts– Needed for exploitation

Page 26: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 26

Customers and Computer Scientists

• Mutual arrogance fed by misunderstandings • Differing scientific paradigms

– Mathematical: formal, definite– Social, biological: case-based, indefinite

Page 27: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 27

Disruptive factors [June 03 NSF meet]

• Technologies– ubiquitous access– community empowerment

• data & semantics contribution

– Machine translation of modest quality

• Sociological– Imposed privacy constraints– TIA reactions national/international

– Commercial pressures - • skimming the cream

Page 28: Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

June 2003 IndoUS Gio 28

Roadblocks [Y.T. Chien]

• lack of a business model

• matching technology to user needs• define a research pipeline [NAS HPCC report]