Integrating Heterogeneous in situ Information using SPARCE
description
Transcript of Integrating Heterogeneous in situ Information using SPARCE
Integrating Heterogeneous in situ Information using SPARCE
Sudarshan MurthyCSE 606 INI: Fall 2003
This work is supported by US NSF Grant IIS 0086002.
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
2
Apr 22, 2023 3
Apr 22, 2023 4
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
5
• People often superimpose new interpretations onto existing information (from heterogeneous sources)
• They excerpt information and create annotations• They integrate existing information and new
interpretations– Prepare many arrangements of the same
information– Organize using appropriate models and schemas
(possibly different from any of the sources)
Observations
Apr 22, 2023 6
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
7
Facilitate integration of heterogeneous in situ information, of varying granularity, with minimal mediation
using superimposed information to enhance base information
given one superimposed information model and schema (possibly different from any base information model and schema).
Goal
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
8
Benefits
• Likely discover information not completely contained in base sources– Information cannot always be obtained by a
query distributed over base sources• Exploit human expertise
– Annotations and relationships created by humans can be valuable
• Minimize volume of base data mediated– We only retrieve selected information
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
9
Outline
• Goal• Background
– Superimposed information management, SPARCE• Information integration example• Proposal• Future work• Conclusion
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
10
What is Superimposed Information?
Data placed over existing information sources to help organize, access, connect, and reuse information elements in those sources. [Maier 1999, Delcambre 2001]
Superimposed
Layer
Base Layer
Information Source1
Information Source2
Information Sourcen
…
marks
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
11
Marks
• A Mark is a reference to a base-layer element [Delcambre 2001]
– Several mark implementations exist– Addressing scheme usually depends on the base
type– PDF mark uses page no., and starting and
ending word indexes; MS Word mark uses starting and ending character indexes
• Marks provide uniform interface across base-layer types and access protocols
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
12
Excerpts and Contexts
• Excerpt is the content of a marked region– Type of an excerpt varies: text, graphics, …
• Context is information related to a mark• Context element is one piece of context
– Section heading, containing paragraph text, and font name are examples
• Many kinds of context elements exist– Content, Presentation, Location, Topology, …
• Context definition varies across and within base types
Apr 22, 2023 13
Example Context
Name Value
Excerpt Garlic permits traditional and multi-media data to be stored in a variety of existing data repositories, including databases, files, text managers, …
Font name Times New Roman
Italics True
Section Heading
Abstract:
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
14
Superimposed Applications
• These are applications that manipulate superimposed information
• They associate marks and context elements with superimposed information elements
• They are free to choose display and data models based on their needs
• A user can activate a mark to navigate to base layer or examine context without expressly navigating to base layer
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
15
SPARCE
• SPARCE: Superimposed Pluggable Architecture for Contexts and Excerpts– Middleware for superimposed information
management
• Address base information regardless of its type, location, and access protocol
• Retrieve excerpts and contexts– Use the same programmatic interface to work with
any base type • View excerpts and contexts side by side with
superimposed information
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
16
Overview
SA 1
XML
SPARCE
Marks
<mark ID=“…”> <type>…</type> <address>…</address> …</mark>
Word
Acrobat
SA 2
Relations
Superimposed Layer Base Layer
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
17
Information Integration Example
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
18
Setup
• SPARCE extended for information integration– XML serialization introduced– Pluggable context transformers infrastructure added– A query interface developed
• RIDPad extended for information integration– Annotations, XML serialization (and DOM) added
• Information models supported– Object model (COM)– XML (DOM and serialized)
• Example uses XML
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
19
• Five items (two groups)• An item contains a label
and a comment• Five base documents
(all PDF—heterogeneous?)
• Granularity of marks varies
Input
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
20
Generating XML Data (1)
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
21
Generating XML Data (2)
Integrating Heterogeneous in situ Information using SPARCE
XML Data Generated
XML for the two groups
Mark
Context
Integrating Heterogeneous in situ Information using SPARCE
23
Querying*
For each item, get text content from the context (of its mark)
* Currently supports XSLT and XPath; XQuery coming soon
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
24
This system isn’t very smart.
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
25
Preserve the Layers
<Item name=‘CLIO’><Mark id=‘…’>
<Context …>
</Context></Mark>
</Item>
<Group name=‘…’> <Item name=‘…’>
</Item></Group>-------------------------------------
<Mark id=‘…’>…</Mark><Mark id=‘…’>…</Mark>-------------------------------------
<Context …>…</Context><Context …>…</Context>
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
26
Why Preserve the Layers
• The information sources are different– SI: Superimposed application– Marks: SPARCE– Contexts: Base applications (via context agents)
• A hierarchy is inefficient and unnecessary– Mark and context information is replicated– Context can be large (broad)– Joins can provide the same result
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
27
Start with the Query
• Figure out what information is in scope– Only some superimposed information elements might
qualify– Only some marks might qualify– Only some context elements might be needed
• Minimize the amount of information retrieved– Push “selects” down and distribute “selects”
• Helped by preserving the layers• Enables parallel and distributed query execution
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
28
Exploit Relationships
• Relationships in superimposed layer can have many benefits– Improve recall (for user)– Alternative execution plans
(for query processor)
• XML has no native support for relationships– Can be implemented using
XPointer, XLink, etc.
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
29
Future Work
• Test proposal• Bi-level query system
– Develop example queries– Model the system, build it, test it
• Support other information models– RDF should be easy, relational might not be– Support for new models can be added without
affecting existing implementations – Sun’s “No Recompile” guarantee for
superimposed applications
Some restrictions may apply
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
30
Conclusion
• Enhancing base information with superimposed information makes possible new queries over base information
• Heterogeneous in situ base information can be integrated and queried using SPARCE
• The naïve XML implementation makes a good straw man
• If this stuff holds water, a bi-level query system maybe in my future
Apr 22, 2023 Integrating Heterogeneous in situ Information using SPARCE
31
Questions?
ask me about a demo