Increasing the Information Density in Digital Library Results
IndoUS Workshop
Gio WiederholdStanford University
22 June 2003
June 2003 IndoUS Gio 2
Map
Courtesy of Univ. of Pittsburgh DL Project
June 2003 IndoUS Gio 3
Attention is the issue.
"What information consumes is rather obvious; it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it."
[Herb Simon]
Complementary objective:
Don't waste the attention afforded to information
June 2003 IndoUS Gio 4
My focus: Science & business use (I ignore now the artistic aspects,
also important)
Required by customer:
1. knowledge to process information, and
2. tools to facilitate that process– Locate– Select– Articulate, not Integrate – Summarize– Project - exploit data mining
June 2003 IndoUS Gio 5
Technologies to filter Information
Survey of Technologies from common to rare
1. Ranking
2. Eliminate redundancy
3. Assure novelty
4. Abstraction
5. Data mining
6. Reduction for visual presentation
7. Modeling
8. Prediction
9. Finding Abnormal Events
June 2003 IndoUS Gio 6
1. Ranking
Assumption: The consumer only considers a few
documents on the top of the list.a. Ranking by authority.
• Select sites that are valued in a context,• a journal versus a workshop report, • a recent document.
b. Ranking by reference authority • recursive value by references to it (Google)
• extracts global communal knowledge
c. Rank by customer's context
June 2003 IndoUS Gio 7
2. Eliminate redundancy
a. If similar documents are retrieved • present the latest one• present the highest ranked one, per a suitable
criterion, I.e, user's context.
b. Only report differences among documents• look for additional material • decide what are significant differences• abstract differences (see later)• show differences in layout only as maps• compute metric if difference• deal with many documents
June 2003 IndoUS Gio 8
3. Value is in the Novelty
a. Information relative to a document collection– Exploits prior technologies
b. Information relative to a customer. – What is the knowledge held by an individual – Can it be captured ?
Domain recognition to determine context
Avoid (unsolvable?) problem of `common knowledge'
June 2003 IndoUS Gio 9
4. Abstraction
a. Only present essentials of textual documents 1. Domain-independent abstraction selecting sentences
that appear to represent the contents; 2. Domain-specific text can be effectively abstracted
a. pathology reports -- being doneb. automatic annotation of gene-sequences from papers.
b. Abstracting contents of document collections1. Classify2. Differentiate (2.b)3. Integrate4. Semantic matching if the sources are autonomous
June 2003 IndoUS Gio 10
5. Data mining
Out of scope for digital library research, but a. Linking data-mining results with information
from textual sources
strengthens users' explanatory capabilities.
b. Data-mining develops models that can be further exploited
June 2003 IndoUS Gio 11
6. Reduction for Visualization
Motivated by modern customers’ settingsa. Reduce numeric data for visual presentation
1. Common2. Can be automated, but rarely done well
b. Reduce textual information into visuals Requires
1. Abstraction 2. Placing the result into some model:
I.e., temporal or spatial aspects: • Progress notes for a patient – disease model • Description of an exploratory journey – attach to a map• Progress of a scientific project – versus proposal
June 2003 IndoUS Gio 12
7. Modeling
Models of a domain allow analysis & manipulation
a. to discern novelty
b. representation of normal behavior• corporate finances from 10-K • ecological processes, and global change• metabolic models, needed to formulate an
understanding of food, drug, and environmental effects on organisms.
June 2003 IndoUS Gio 13
8. Prediction
Current information technologies, { databases, data-mining, digital libraries } provide only background information for decision-makingToday: decision maker
1. copies results into a spreadsheet2. add formulas to make extrapolations into the future
a. Continue models scenarios into the possible futures1. Investments - monetary, personnel, research, . . .2. Probabilities of outcomes etc.
b. Allow comparison of alternatives
Information systems should not terminate their support with the past, but also to extrapolate the results with the models used for analysis
June 2003 IndoUS Gio 14
9. Finding Abnormal Events
• A hard challenge is discovering abnormal situations. – I.e., looking for terrorists.
Note: observables are the effect of many good and a few bad scenarios
• Traditional data-mining finds frequent relationships – abduct the processes that generate those data
serves marketing folk,
• Intelligence tasks seek unusual or abnormal behavior 1. Use model based on recent incidents,
• flight-schools enrollments of terrorists
2. Create and use a reasonable, but hypothetical model• shipping containers can carry nuclear devices into the US
3. Create a model of normal findings
June 2003 IndoUS Gio 15
9+. Create & exploit a normal state model
Prerequisite for finding abnormal eventsabnormalities can only be identified if normality can be quantified
a. Populate an initial model with normal findings– Coverage: all likely causes of some observable(s)
b. Identify variation not due to known causes• Temporal tracking is better than static schemes
c. Increase coverage as needed - feedback to b.
d. Maintain models to recognize unexplainables Such models will be large since observed data are the aggregate of
activities from many domains,
travel patterns: business, holidays, and family visits, emergencies.
June 2003 IndoUS Gio 16
Benefits
A `business model' for justifying ongoing DL research is needed [Y.T.Chien]
• A business model includes benefits and costs1. Benefits:
– Broad access to knowledge– Education of the next generation– Preservation of cultural heritage– Mutual, inter-cultural understanding, reduction of conflicts– Improved decision-making
2. Costs– Time and money spent on information systems
• Technology Contents– Time spent on obtaining the information– Time spent on analyzing the information– Due to errors
Focus of prior slides
Focus of prior slides
June 2003 IndoUS Gio 17
Cost of Errors -- balance
• Type 1 errors
Omitted relevant information
• Lost opportunities• Unperceived risk• Suboptimal choices• Cost: f (variance)
– High if is high
– Low if is low • purchasing
• Type 2 errors
Excess irrelevant information
• Overload• Inability to analyze all• Risk of being misled• Cost: delay, human
– High if excess is high• human time is valuable
– Low if precision is high
June 2003 IndoUS Gio 18
Exploiting Information
Data and their relationships
Effects
know-ledge
Action 1
Action 3
Action 2
Decision
?
Has not been an explicit focus of DL research.
It is the point thatgenerates benefits
June 2003 IndoUS Gio 19
The Major Feedback Loop
know-ledge
computer
scientists, will
provide tools
user exploiting communities distilled knowledge, categorized
user contributing communitieshuman knowledge, validated by data
June 2003 IndoUS Gio 20
Conclusion
Much work is left to be done with digital libraries
Exploiting the results will motivate more investment• In technology• In content breadth and depth
Customer's expectations will changeo Global access is hereo Heterogeneity will remain, cause errorso Ubiquitous access is near
June 2003 IndoUS Gio 21
Optional discussion points
• Interfaces
• Personalization
• Heterogeneity
• Computer scientists and their customers
• Data versus relationships
• Disruptive factors
June 2003 IndoUS Gio 22
New user interface settings
The new generation
• is more comfortable with screen displays
• can navigate to analyses, backup
• considers paper to be heavy and awkward
• is poor in handwriting and spelling
• is facile in brief keyboard messages
• expects simple voice command technology
June 2003 IndoUS Gio 23
Personalization, 2 models
1. Everything about an individual– learn all about the individual
• slow, delayed, lags
2. An individual as a member of groups• learn about the likely memberships
{ 8th grader, ...; carpenter, ..; opera goer, ...; ...}
• learn and assign knowledge to group• inherit knowledge collected in those groups
• leads so that individual also benefits
Context
June 2003 IndoUS Gio 24
Interoperation/Interoperability
• Heterogeneity is a fact, and attempts at enforcing consistency are misguided
• natural consistency will be an outcome of collaboration,
June 2003 IndoUS Gio 25
Data and their relationships
• Data are verifiable first-order objects– observable– automatic acquisition is common
• Relationships are also first-order objects– defined by metadata in context
{ schemas, references, dependencies, is-as, causality, ... )
– Hard to discover– Instances verifiable in contexts– Needed for exploitation
June 2003 IndoUS Gio 26
Customers and Computer Scientists
• Mutual arrogance fed by misunderstandings • Differing scientific paradigms
– Mathematical: formal, definite– Social, biological: case-based, indefinite
June 2003 IndoUS Gio 27
Disruptive factors [June 03 NSF meet]
• Technologies– ubiquitous access– community empowerment
• data & semantics contribution
– Machine translation of modest quality
• Sociological– Imposed privacy constraints– TIA reactions national/international
– Commercial pressures - • skimming the cream
June 2003 IndoUS Gio 28
Roadblocks [Y.T. Chien]
• lack of a business model
• matching technology to user needs• define a research pipeline [NAS HPCC report]
Top Related