Download - Increasing the Information Density in Digital Library Results IndoUS Workshop Gio Wiederhold Stanford University 22 June 2003.

Increasing the Information Density in Digital Library Results

IndoUS Workshop

Gio WiederholdStanford University

22 June 2003

June 2003 IndoUS Gio 2

Map

Courtesy of Univ. of Pittsburgh DL Project


Attention is the issue.

"What information consumes is rather obvious; it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it."

[Herb Simon]

Complementary objective:

Don't waste the attention afforded to information


My focus: Science & business use (I ignore now the artistic aspects,

also important)

Required by customer:

1. knowledge to process information, and

2. tools to facilitate that process– Locate– Select– Articulate, not Integrate – Summarize– Project - exploit data mining


Technologies to filter Information

Survey of Technologies from common to rare

1. Ranking

2. Eliminate redundancy

3. Assure novelty

4. Abstraction

5. Data mining

6. Reduction for visual presentation

7. Modeling

8. Prediction

9. Finding Abnormal Events


1. Ranking

Assumption: The consumer only considers a few

documents on the top of the list.a. Ranking by authority.

• Select sites that are valued in a context,• a journal versus a workshop report, • a recent document.

b. Ranking by reference authority • recursive value by references to it (Google)

• extracts global communal knowledge

c. Rank by customer's context


2. Eliminate redundancy

a. If similar documents are retrieved • present the latest one• present the highest ranked one, per a suitable

criterion, I.e, user's context.

b. Only report differences among documents• look for additional material • decide what are significant differences• abstract differences (see later)• show differences in layout only as maps• compute metric if difference• deal with many documents


3. Value is in the Novelty

a. Information relative to a document collection– Exploits prior technologies

b. Information relative to a customer. – What is the knowledge held by an individual – Can it be captured ?

Domain recognition to determine context

Avoid (unsolvable?) problem of `common knowledge'


4. Abstraction

a. Only present essentials of textual documents 1. Domain-independent abstraction selecting sentences

that appear to represent the contents; 2. Domain-specific text can be effectively abstracted

a. pathology reports -- being doneb. automatic annotation of gene-sequences from papers.

b. Abstracting contents of document collections1. Classify2. Differentiate (2.b)3. Integrate4. Semantic matching if the sources are autonomous


5. Data mining

Out of scope for digital library research, but a. Linking data-mining results with information

from textual sources

strengthens users' explanatory capabilities.

b. Data-mining develops models that can be further exploited


6. Reduction for Visualization

Motivated by modern customers’ settingsa. Reduce numeric data for visual presentation

1. Common2. Can be automated, but rarely done well

b. Reduce textual information into visuals Requires

1. Abstraction 2. Placing the result into some model:

I.e., temporal or spatial aspects: • Progress notes for a patient – disease model • Description of an exploratory journey – attach to a map• Progress of a scientific project – versus proposal


7. Modeling

Models of a domain allow analysis & manipulation

a. to discern novelty

b. representation of normal behavior• corporate finances from 10-K • ecological processes, and global change• metabolic models, needed to formulate an

understanding of food, drug, and environmental effects on organisms.


8. Prediction

Current information technologies, { databases, data-mining, digital libraries } provide only background information for decision-makingToday: decision maker

1. copies results into a spreadsheet2. add formulas to make extrapolations into the future

a. Continue models scenarios into the possible futures1. Investments - monetary, personnel, research, . . .2. Probabilities of outcomes etc.

b. Allow comparison of alternatives

Information systems should not terminate their support with the past, but also to extrapolate the results with the models used for analysis


9. Finding Abnormal Events

• A hard challenge is discovering abnormal situations. – I.e., looking for terrorists.

Note: observables are the effect of many good and a few bad scenarios

• Traditional data-mining finds frequent relationships – abduct the processes that generate those data

serves marketing folk,

• Intelligence tasks seek unusual or abnormal behavior 1. Use model based on recent incidents,

• flight-schools enrollments of terrorists

2. Create and use a reasonable, but hypothetical model• shipping containers can carry nuclear devices into the US

3. Create a model of normal findings


9+. Create & exploit a normal state model

Prerequisite for finding abnormal eventsabnormalities can only be identified if normality can be quantified

a. Populate an initial model with normal findings– Coverage: all likely causes of some observable(s)

b. Identify variation not due to known causes• Temporal tracking is better than static schemes

c. Increase coverage as needed - feedback to b.

d. Maintain models to recognize unexplainables Such models will be large since observed data are the aggregate of

activities from many domains,

travel patterns: business, holidays, and family visits, emergencies.


Benefits

A `business model' for justifying ongoing DL research is needed [Y.T.Chien]

• A business model includes benefits and costs1. Benefits:

– Broad access to knowledge– Education of the next generation– Preservation of cultural heritage– Mutual, inter-cultural understanding, reduction of conflicts– Improved decision-making

2. Costs– Time and money spent on information systems

• Technology Contents– Time spent on obtaining the information– Time spent on analyzing the information– Due to errors

Focus of prior slides

Focus of prior slides


Cost of Errors -- balance

• Type 1 errors

Omitted relevant information

• Lost opportunities• Unperceived risk• Suboptimal choices• Cost: f (variance)

– High if is high

– Low if is low • purchasing

• Type 2 errors

Excess irrelevant information

• Overload• Inability to analyze all• Risk of being misled• Cost: delay, human

– High if excess is high• human time is valuable

– Low if precision is high


Exploiting Information

Data and their relationships

Effects

know-ledge

Action 1

Action 3

Action 2

Decision

?

Has not been an explicit focus of DL research.

It is the point thatgenerates benefits


The Major Feedback Loop

know-ledge

computer

scientists, will

provide tools

user exploiting communities distilled knowledge, categorized

user contributing communitieshuman knowledge, validated by data


Conclusion

Much work is left to be done with digital libraries

Exploiting the results will motivate more investment• In technology• In content breadth and depth

Customer's expectations will changeo Global access is hereo Heterogeneity will remain, cause errorso Ubiquitous access is near


Optional discussion points

• Interfaces

• Personalization

• Heterogeneity

• Computer scientists and their customers

• Data versus relationships

• Disruptive factors


New user interface settings

The new generation

• is more comfortable with screen displays

• can navigate to analyses, backup

• considers paper to be heavy and awkward

• is poor in handwriting and spelling

• is facile in brief keyboard messages

• expects simple voice command technology


Personalization, 2 models

1. Everything about an individual– learn all about the individual

• slow, delayed, lags

2. An individual as a member of groups• learn about the likely memberships

{ 8th grader, ...; carpenter, ..; opera goer, ...; ...}

• learn and assign knowledge to group• inherit knowledge collected in those groups

• leads so that individual also benefits

Context


Interoperation/Interoperability

• Heterogeneity is a fact, and attempts at enforcing consistency are misguided

• natural consistency will be an outcome of collaboration,


Data and their relationships

• Data are verifiable first-order objects– observable– automatic acquisition is common

• Relationships are also first-order objects– defined by metadata in context

{ schemas, references, dependencies, is-as, causality, ... )

– Hard to discover– Instances verifiable in contexts– Needed for exploitation


Customers and Computer Scientists

• Mutual arrogance fed by misunderstandings • Differing scientific paradigms

– Mathematical: formal, definite– Social, biological: case-based, indefinite


Disruptive factors [June 03 NSF meet]

• Technologies– ubiquitous access– community empowerment

• data & semantics contribution

– Machine translation of modest quality

• Sociological– Imposed privacy constraints– TIA reactions national/international

– Commercial pressures - • skimming the cream


Roadblocks [Y.T. Chien]

• lack of a business model

• matching technology to user needs• define a research pipeline [NAS HPCC report]