Managing a Space of Heterogeneous Data
description
Transcript of Managing a Space of Heterogeneous Data
![Page 1: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/1.jpg)
Managing a Space of Heterogeneous Data
Xin (Luna) DongUniversity of Washington
March, 2007
![Page 2: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/2.jpg)
Once upon a time…
![Page 3: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/3.jpg)
Nowadays…
D1
D2
D3
D4
D5
![Page 4: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/4.jpg)
Mappings Between Heterogeneous Data Sources
Name Length Status Price Rate
The Departed
…
151 mins
…
In stock
…
$34.99
…
Excellent
…
MovieDVD
ID Title Year Genre Runtime Director
15827 The Departed 2006 Crime 151 min 32468
Movie
DirectorID Name
32468 Martin Scorsese
Director
MovieID Review
15827 Martin Scorsese Hits the Streets Again!
Review
![Page 5: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/5.jpg)
Traditional Data Integration Systems Require Semantic Mappings Between Data Sources Up Front
D1
D2
D3
D4
D5
Mediated Schema
QQQ
Q1
Q2Q4
Q
Q2Q2
Q5
Q3
![Page 6: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/6.jpg)
In Many Applications it is Hard to Obtain Precise Semantic Mappings
D1
D2
D3
D4
D5?
![Page 7: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/7.jpg)
Scenario 1. Different Websites About Movies
![Page 8: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/8.jpg)
IntranetInternet
Scenario 2. Personal Information Space
![Page 9: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/9.jpg)
In Many Applications it is Hard to Obtain Precise Semantic Mappings
D1
D2
D3
D4
D5
Mediated Schema
Q
![Page 10: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/10.jpg)
Managing Dataspaces
Dataspaces [Halevy et al., PODS’06]Collections of heterogeneous data
sourcesNot necessarily include semantic
mappingsScenarios: personal information,
enterprises, government agencies, smart homes, digital libraries, and the Web
My goal: Provide quality search, querying and browsing as the system evolves
![Page 11: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/11.jpg)
Heterogeneity at Different Levels
Name:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
![Page 12: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/12.jpg)
Heterogeneity at Instance LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Form of heterogeneity The same real-world object
can be referred to using different attribute values
Current work Record linkage: most works
assume matching tuples from a single database table that has a fair number of attributes (Surveyed in [Winkler, 2006])
Contributions Reference reconciliation:
reconcile instances of multiple classes and with only limited attributes [Sigmod’05]
![Page 13: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/13.jpg)
Heterogeneity at Schema LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Form of heterogeneity The same domain can be
described using different schemas
Data can be (semi-)structured or unstructured
Current work Schema matching (Surveyed
in [Rahm&Bernstein, 2001]) Query reformulation
(Surveyed in [Halevy 2000]) Contributions
Probabilistic schema mapping [VLDB’07]
Visualizing heterogeneous data [InfoVis’07]
![Page 14: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/14.jpg)
Heterogeneity at Query LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Form of heterogeneity Different terms and
different levels of structural details
Keyword search: ‘Semex Dong’
Structured query:Paper (title, ‘Semex’), (authoredBy, ‘Dong’)
![Page 15: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/15.jpg)
Heterogeneity at Query LevelName:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Form of heterogeneity Different terms and different
levels of structural detailsKeyword search: ‘Semex Dong’
Structured query:Paper (title, ‘Semex’), (authoredBy, ‘Dong’)
Current work Keyword search on databases
(Discover, DBExplorer, etc.) Contributions
Seamless querying of structured and unstructured data
Indexing heterogeneous data [Sigmod’07]
Answering structured queries on unstructured data [WebDB’06]
![Page 16: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/16.jpg)
Outline Problem definition and goals Semex Personal Information
Management System [CIDR’05, one of three Best Demos at Sigmod’05]
Technical contributions:Reference reconciliation [Sigmod 2005]Indexing heterogeneous data [Sigmod 2007]Answering structured queries on
unstructured data [WebDB 2006]Probabilistic schema mapping [VLDB 2007]Visualizing heterogeneous data [InfoVis
2007] Future research directions
![Page 17: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/17.jpg)
OriginitatedFrom
PublishedIn
ConfHomePageExperimentOf
ArticleAbout
BudgetOf
CourseGradeIn
AddressOf
Cites
CoAuthor
FrequentEmailer
HomePage
Sender
EarlyVersion
Recipient
AttachedTo
PresentationFor
ComeFrom
Semex Generates a Logical View of Meaningful Objects and Associations
![Page 18: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/18.jpg)
Semex Provides Association Browsing of One’s Personal Information
Names
Emails
Alon. Y. Levy
![Page 19: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/19.jpg)
Semex Provides Association Browsing of One’s Personal Information A Platform for Personal Information
Management and Integration
Title
Year
![Page 20: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/20.jpg)
Semex Provides Association Browsing of One’s Personal Information
CIDR
![Page 21: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/21.jpg)
Semex Provides Association Browsing of One’s Personal Information
Trio: A System for Integrated Mangement of Data, Accuracy, and Lineage
![Page 22: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/22.jpg)
Question 1: Which emails has my advisor sent me about my thesis?
[email protected]@[email protected]@transformic.com
![Page 23: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/23.jpg)
Question 2: Who have been working on schema matching?
6 Messages67 Articles
31 Persons working on Schema Matching (e.g., Alon Halevy, Phil
Bernstein, Renee Miller, Anhai Doan)
Search ‘Schema Matching’
![Page 24: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/24.jpg)
Question 3: Which of my friends published in Sigmod 2007?
My friends who published papers in
Sigmod 2007
![Page 25: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/25.jpg)
Data Integration Module
SchemaManagement
Module
DomainModel
ReferenceReconciliater
Association DB
Extractors
Indexer Index
ObjectsAssociations
Word PPT PDF Latex Email Webpage Excel DB
Integrator
Searcher Browser Analyzer
DomainManager
Data Analysis Module
DomainModel
ReferenceReconciliater
Association DB
Extractors
Indexer Index
ObjectsAssociations
Word PPT PDF Latex Email Webpage Excel DB
Integrator
Searcher Browser Analyzer
Semex Architecture
DomainManager
![Page 26: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/26.jpg)
Outline Problem definition and our principle Semex Personal Information
Management System [CIDR’05, one of three Best Demos at Sigmod’05]
Technical contributions:Reference reconciliation [Sigmod 2005]Indexing heterogeneous data [Sigmod 2007]Answering structured queries on
unstructured data [WebDB 2006]Probabilistic schema mapping [VLDB 2007]Visualizing heterogeneous data [InfoVis
2007] Future research directions
![Page 27: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/27.jpg)
Heterogeneity at Different Levels
Name:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Instance level• Reference Reconciliation
[Sigmod’05]
Query level• Answering structured
queries on unstructured data [WebDB’06]
• Indexing heterogeneous data [Sigmod’07]
Schema level• Probabilistic schema
mapping[VLDB’07]
• Visualization of heterogeneous data [InfoVis’07]
![Page 28: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/28.jpg)
Reference Reconciliation is Crucial in Dataspaces
Xin (Luna) Dong
xin dong
•¶ ðà xinluna dong
luna
dongxin
x. dong
Lab-#dong xin
dong xin luna
Names
Emails
![Page 29: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/29.jpg)
Previous Approaches
A very active area of research in databases, data mining and AI
Most current approaches assume matching tuples from a single database tableTraditional approaches are based on pair-
wise comparisons (Surveyed in [Winkler, 2006])
New approaches explore relationship between reconciliation decisions using probability models[Russell et al, 2002] [Domingos et al, 2004]
Harder for a complex information space
![Page 30: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/30.jpg)
Challenges for a Complex Information Space Article: a1=(“Distributed Query Processing”,“169-180”,
{p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”,
{p4,p5,p6}, c2)
Venue: c1=(“ACM Conference on Management of Data”, “1978”,
“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)
Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)
![Page 31: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/31.jpg)
Challenges for a Complex Information Space Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3},
c1)a2=(“Distributed query processing”,“169-180”, {p4,p5,p6},
c2)
Venue: c1=(“ACM Conference on Management of Data”, “1978”,
“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)
Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “[email protected]”)p8=(null, “[email protected]”)p9=(“mike”, “[email protected]”)
1. MultipleClasses 3. Multi-value
Attributes
2. Limited Information
?
?
![Page 32: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/32.jpg)
Intuition: Exploit Association Network
We extract from dataspaces networks of instances and associations between the instances
Key: exploit the network, specifically, the clues hidden in the associations
![Page 33: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/33.jpg)
Strategy I. Exploiting Richer Evidence Cross-attribute similarity –
Name&email p5=(“Stonebraker, M.”, null) p8=(null, “[email protected]”)
Context Information I – Contact list p5=(“Stonebraker, M.”, null, {p4, p6}) p8=(null, “[email protected]”, {p7}) p6=p7
Context Information II – Authored articles p2=(“Michael Stonebraker”, null) p5=(“Stonebraker, M.”, null) p2 and p5 authored the same article
![Page 34: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/34.jpg)
Considering Only Attribute-wise Similarities Cannot Merge Persons Well
1750
1950
2150
2350
2550
2750
2950
3150
3350
1 2 3 4
Evidence
#(P
ers
on
Par
titi
on
s)
Person references: 24076 Real-world persons (gold-standard):1750
3159
1409
1750
![Page 35: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/35.jpg)
Considering Richer Evidence Improves the Result
3159
2169 21692096
1750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
erso
n P
arti
tio
ns)
1409
346
Person references: 24076 Real-world persons:1750
1750
![Page 36: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/36.jpg)
Strategy II. Propagate Information Between Reconciliation Decisions Article: a1=(“Distributed Query Processing”,“169-180”,
{p1,p2,p3}, c1)a2=(“Distributed query processing”,“169-180”,
{p4,p5,p6}, c2)
Venue: c1=(“ACM Conference on Management of Data”, “1978”,
“Austin, Texas”) c2=(“ACM SIGMOD”, “1978”, null)
Person: p1=(“Robert S. Epstein”, null)p2=(“Michael Stonebraker”, null)p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null)p5=(“Stonebraker, M.”, null)p6=(“Wong, E.”, null)
![Page 37: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/37.jpg)
3159
2169 21692096
3159
2146 2135
2022
1750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-w ise Name&Email Article Contact
Evidence
#(Pe
rson
Par
titio
ns)
Traditional Propagation
Propagating Information Between Reconciliation Decisions Further Improves the Result
Person references: 24076 Real-world persons:1750
1409
272346
1750
![Page 38: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/38.jpg)
Strategy III. Reference Enrichment p2=(“Michael Stonebraker”, null,
{p1,p3})p8=(null, “[email protected]”, {p7})p9=(“mike”, “[email protected]”, null)
p8-9 =(“mike”, “[email protected]”, {p7})
V
XXV
![Page 39: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/39.jpg)
References Enrichment Improves the Result More than Information Propagation
3159
2169 21692096
3169
2036 2036
19101750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
erso
n P
arti
tio
ns)
Traditional Enrichment Propagation
Person references: 24076 Real-world persons:1750
1409
160346
1750
![Page 40: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/40.jpg)
3159
2169 21692096
3169
2002 1990
18731750
1950
2150
2350
2550
2750
2950
3150
3350
Attr-wise Name&Email Article Contact
Evidence
#(P
erso
n P
artit
ions
)
Traditional Enrichment Propagation Full
Applying Both Information Propagation and Reference Enrichment Gets the Best Result
Person references: 24076 Real-world persons:1750
1409
125346
1750
![Page 41: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/41.jpg)
Experiment Settings Data sets: Four personal data sets Use the same parameters and thresholds for
all data sets Measure
Precision: #(correctly reconciled reference pairs) #(reconciled reference pairs)
Recall: #(correctly reconciled reference pairs)#(reference pairs that refer to the same real-world object)
F-measure: 2·Precision·Recall Precision+Recall
![Page 42: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/42.jpg)
Precision and Recall Increase Largely Compared with Attr-wise Matching
Dataset
Attr-wise Matching Association Network
Precision
Recall FPrecisi
onRecall F
ABCD
Avg
0.9950.81
0.9870.694
0.872
0.5090.8030.7820.837
0.733
0.6730.8060.8730.759
0.778
0.9820.9580.8140.942
0.924
0.9470.8910.9250.737
0.875
0.9640.9230.8670.827
0.895
![Page 43: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/43.jpg)
Heterogeneity at Different Levels
Name:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Instance level• Reference Reconciliation
[Sigmod’05]
Query level• Answering structured
queries on unstructured data [WebDB’06]
• Indexing heterogeneous data [Sigmod’07]
Schema level• Probabilistic schema
mapping[VLDB’07]
• Visualization of heterogeneous data [InfoVis’07]
![Page 44: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/44.jpg)
Seamless Querying of Structured and Unstructured Data
Structured Data & Semi-structured Data
Unstructured Data
RDF
webpages
XML
Doc
RDB
Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
Keyword Search “dataspaces”
![Page 45: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/45.jpg)
I. Answering Structured Queries on Unstructured Data
Structured Data & Semi-structured Data
Unstructured Data
RDF
webpages
XML
Doc
RDB
Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
Keyword Search “dataspaces”
DB
DB
IR
?
Our approach: query translation Transform a structured query into keyword search Keyword search on unstructured data
![Page 46: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/46.jpg)
Challenges
ExampleSELECT title
FROM paper
WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
select title from paper where title LIKE +dataspaces and year +2005
Top-10Precision
0
![Page 47: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/47.jpg)
Challenges
ExampleSELECT title
FROM paper
WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
title paper title +dataspaces year +2005
Top-10Precision
0
![Page 48: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/48.jpg)
Challenges
ExampleSELECT title
FROM paper
WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
+dataspaces +2005Top-10
Precision
0.2
![Page 49: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/49.jpg)
Challenges
ExampleSELECT title
FROM paper
WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
+dataspaces +2005 paper titleTop-10
Precision
0.2
![Page 50: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/50.jpg)
Challenges
ExampleSELECT title
FROM paper
WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
+dataspaces +2005 paperTop-10
Precision
0.6
![Page 51: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/51.jpg)
II. Answering Queries that Combine Keywords and Structural Information
Structured Data & Semi-structured Data
Unstructured Data
RDF
webpages
XML
Doc
RDB
Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
Keyword Search “dataspaces”
![Page 52: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/52.jpg)
II. Answering Queries that Combine Keywords and Structural Information
Structured Data & Semi-structured Data
Unstructured Data
RDF
webpages
XML
Doc
RDB
Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
Keyword-based Structure-aware QueriesArticle (title “dataspaces”) (year “2005”)
Keyword Search “dataspaces”
![Page 53: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/53.jpg)
Neighborhood Keyword Queries: Return Implicitly Relevant Instances in Answers to Keyword Queries
6 Messages67 Articles
Search ‘Schema Matching’
31 Persons working on Schema Matching (e.g., Jeff Naughton, Anhai Doan, Phil Bernstein, Renee Miller)
![Page 54: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/54.jpg)
Predicate Queries: Queries that Combine Keywords and Simple Structural Requirements
Message (Sender “Halevy”) (Recipient “Luna”) (Subject “thesis”)
![Page 55: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/55.jpg)
II. Answering Queries that Combine Keywords and Structural Information
Structured Data & Semi-structured Data
Unstructured Data
RDF
webpages
XML
Doc
RDB
Structured QueriesSELECT titleFROM paperWHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’
Keyword-based Structure-aware QueriesArticle (title “dataspaces”) (year “2005”)
Keyword Search “dataspaces”
![Page 56: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/56.jpg)
Indexing Heterogeneous Data
Challenges Index data from heterogeneous data sources Capture both text values and structural
information Traditional Indexes
Build a separate index for each attribute to support structured queries
Build an inverted list to support keyword search XML indexes assume tree models and build
multiple indexes ([Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc.)
![Page 57: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/57.jpg)
Index Heterogeneous Data Using an Inverted List
Desktop
Alon Halevy
Luna Dong
Semex: …authoredPaper
author
authoredPaper
author
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Alon
Dong
Halevy
Luna
Semex
Xin
Inverted List
![Page 58: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/58.jpg)
Desktop
Index Heterogeneous Data Using an Inverted List
Alon Halevy
Semex: …authoredPaper
author
authoredPaper
author
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Alon 1
Dong 1 1
Halevy 1
Luna 1
Semex 1
Xin 1
Inverted List
Luna Dong
Query: Dong
![Page 59: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/59.jpg)
Desktop
Incorporate Attribute Labels in the Inverted List
Alon Halevy
Semex: …authoredPaper
author
authoredPaper
author
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Alon 1
Dong 1 1
Halevy 1
Luna 1
Semex 1
Xin 1
Inverted List
Luna Dong
Query: firstName “Dong”
![Page 60: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/60.jpg)
Desktop
Incorporate Attribute Labels in the Inverted List
Query: firstName “Dong”
Alon Halevy
Semex: …authoredPaper
author
authoredPaper
author
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Alon/name/ 1
Dong/name/ 1
Dong/firstName/ 1
Halevy/name/ 1
Luna/name/ 1
Semex/title/ 1
Xin/lastName/ 1
Inverted List
Luna Dong
Query: firstName “Dong” “Dong/firstName/”
![Page 61: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/61.jpg)
Desktop
Incorporate Attribute Hierarchy in the Inverted List
Query: name “Dong”
Alon Halevy
Semex: …authoredPaper
author
authoredPaper
author
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Alon/name/ 1
Dong/name/ 1
Dong/firstName/ 1
Halevy/name/ 1
Luna/name/ 1
Semex/title/ 1
Xin/lastName/ 1
Inverted List
Luna Dong
![Page 62: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/62.jpg)
Desktop
Incorporate Attribute Hierarchy in the Inverted List
Query: name “Dong”
Alon Halevy
Semex: …authoredPaper
author
authoredPaper
author
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Alon/name/ 1
Dong/name/ 1
Dong/name/firstName/ 1
Halevy/name/ 1
Luna/name/ 1
Semex/title/ 1
Xin/name/lastName/ 1
Inverted List
Luna Dong
Query: name “Dong” “Dong/name/*”
name
firstName lastName
![Page 63: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/63.jpg)
Desktop
Incorporate Association Labels in the Inverted List
Query: author “Dong”
Alon Halevy
authoredPaper
author
authoredPaper
StuID lastName firstName …
1000001 Xin Dong …
… … … …
Departmental Database
Inverted List
Luna Dong
Semex: …
Alon/name/ 1
Dong/name/ 1
Dong/name/firstName/ 1
Halevy/name/ 1
Luna/name/ 1
Semex/title/ 1
Xin/name/lastName/ 1
author
![Page 64: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/64.jpg)
Desktop
Incorporate Association Labels in the Inverted List
Alon Halevy
authoredPaper
author
authoredPaper
author
StuID LastName FirstName …
1000001 Xin Dong …
… … … …
Departmental Database
Inverted List
Luna Dong
Semex: …
Alon/author/ 1
Alon/name/ 1
Dong/author/ 1
Dong/name/ 1
Dong/name/firstName/ 1
Halevy/name/ 1
Luna/name/ 1
Semex/authoredPaper/ 1 1
Semex/title/ 1
Xin/name/LastName/ 1
Query: author “Dong”Query: author “Dong” “Dong/author/*”
![Page 65: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/65.jpg)
Desktop
Answering Neighborhood Keyword Queries
Alon Halevy
authoredPaper
author
authoredPaper
author
StuID LastName FirstName …
1000001 Xin Dong …
… … … …
Departmental Database
Inverted List
Luna Dong
Semex: …
Alon/author/ 1
Alon/name/ 1
Dong/author/ 1
Dong/name/ 1
Dong/name/firstName/ 1
Halevy/name/ 1
Luna/name/ 1
Semex/authoredPaper/ 1 1
Semex/title/ 1
Xin/name/LastName/ 1
Query: SemexQuery: Semex “Semex/*”
![Page 66: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/66.jpg)
Experimental Setting Data sets
A 50MB personal data set Two 10GB XML data sets: Wikipedia, XMark
Benchmark
Queries: with one predicate or keyword Predicate Query with leaf attributes Predicate Query with branch attributes Predicate Query with associations Neighborhood Keyword Query
Measure: in millisecond Index-lookup time Query-answering time
![Page 67: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/67.jpg)
Our Indexing Method Significantly Improves Query Answering
Query Type
Plain Inverted List
(10.6MB)
Extended Inverted List
(28.1MB)
Index Lookup
(ms)
Query Answer
(ms)
Index Lookup
(ms)
Query Answer
(ms)
Pred Query with leaf
attributes2 22 4 6
Pred Query with branch attributes
3 43 4 6
Pred Query with
associations3 88 6 17
Neighborhood Keyword
Query18 4174 48 97
![Page 68: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/68.jpg)
Our Indexing Method Scales Well
WikipediaXMark
w/o assoXMark
with asso
Index 4.15hr(1.13GB)
6.64hr(3.04GB)
12.72hr(4.08GB)
Pred Query with leaf
attributes156 94 116
Pred Query with branch attributes
- 67 93
Pred Query with
associations- - 217
Neighborhood Keyword
Query1646 1838 13468
![Page 69: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/69.jpg)
Heterogeneity at Different Levels
Name:First: Luna Last: DongE-Mail Addresses: [email protected]
@inproceedings{dong05,author=“Xin Dong”,title=“Semex: A Platform for Personal Information Management and Integration”,booktitle=“VLDB 2005 PhD Workshop”,…}
Instance level• Reference Reconciliation
[Sigmod’05]
Query level• Answering structured
queries on unstructured data [WebDB’06]
• Indexing heterogeneous data [Sigmod’07]
Schema level• Probabilistic schema
mapping(VLDB’07)
• Visualization of heterogeneous data (InfoVis’07)
![Page 70: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/70.jpg)
Probabilistic Schema Mapping S=(pname, email-addr, home-addr, office-
addr)
T=(name, mailing-addr)
Possible MappingProbabil
ity{(pname,name),(home-addr, mailing-addr)}
0.5
{(pname,name),(office-addr, mailing-addr)}
0.4
{(pname,name),(email-addr, mailing-addr)}
0.1
![Page 71: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/71.jpg)
By-Table v.s. By-Tuple Semantics
![Page 72: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/72.jpg)
By-Table v.s. By-Tuple Semantics
pname
email-addr
home-addroffice-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale San Jose
Ds=
name
mailing-addr
AliceMountain
View
Bob Sunnyvale
DT=nam
emailing-
addr
Alice Sunnyvale
Bob San Jose
name
mailing-addr
Alice alice@
Bob bob@ 0.5 0.4 0.1
![Page 73: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/73.jpg)
By-Table v.s. By-Tuple Semantics
pname
email-addr
mailing-addr
office-addr
Alice alice@Mountain
ViewSunnyvale
Bob bob@ Sunnyvale San Jose
Ds=
name
mailing-addr
AliceMountain
View
Bob San Jose
DT=nam
emailing-
addr
AliceSunnyval
e
Bob San Jose
name
mailing-addr
AliceSunnyval
e
Bob bob@ 0.2 0.16 0.04
…
![Page 74: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/74.jpg)
Theoretical Results Query answering in by-table semantics
In PTIME in the size of the data Query answering in by-tuple
semanticsIn general #P-complete in the size of the
dataIn PTIME for two types of queries
The query contains a single table that is a target in a probabilistic mapping
If a join attribute is in a table that is a target in a probabilistic mapping, the query returns the attribute
![Page 75: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/75.jpg)
More Theoretical Results Query answering in both semantics is
in PTIME in the size of the probabilistic mapping
Compress representations of probabilistic mappings We propose two compact representations of
probabilistic mappings, such that query answering is still in PTIME in the size of the mapping
When we encode probabilistic mappings using a Bayes Net, query answering can be exponential in the size of the mapping
![Page 76: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/76.jpg)
Conclusions Goal: Provide quality search, querying
and browsing for dataspaces Thesis Contributions
An algorithm for reference reconciliation An indexing method for supporting queries
that combine keywords and structure An algorithm for answering structured
queries on unstructured data The concept and theoretical foundation for
Probabilistic Schema Mapping An approach for visualizing heterogeneous
data A PIM system incorporating the above
![Page 77: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/77.jpg)
Future Work I. Evolve Semantic Relationships Between Data Sources on an As-needed Basis
D1
D2
D3
D4
D5
Mediated Schema
Q
![Page 78: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/78.jpg)
D1
D2
D3
D4
D5
Future Work II. Manage Dataspaces at the Web-Scale
![Page 79: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/79.jpg)
Future Work II. Manage Dataspaces at the Web-Scale
Challenges: Large scale and complex domains
Future directions:1. Probabilistic data integration2. Information redundancy3. Universal search
Keyword Search
![Page 80: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/80.jpg)
Research Methodology
MachineLearning
InformationRetrieval
Database
Theory
1. Semex Personal Information Management System[Sigmod’05 Best Demo]
2. Woogle Web ServiceSearch Engine [VLDB’04]
1. Probabilistic Schema Mapping [VLDB’07]
2. XML Query Containment [VLDB’04]
3. Optimization of Query Difference (Submitted)System
![Page 81: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/81.jpg)
co-worker
AcknowledgementProject: Semex
advisor co-worker
ArticleAbout
ArticleAbout
ArticleAboutCIDR
publishedIn
publishedIn
publishedIn
StanfordVisual Grp
collaborator
collaborator
Person: Luna
participant
Person: AlonprojectLeader
Person: Jayant
participant
Person: Michelle
Person: Yuhan
participantparticipant
co-worker
![Page 82: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/82.jpg)
![Page 83: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/83.jpg)
Our Algorithm Equals or OutperformsAttr-wise Matching in All Classes
Class
Attr-wise Matching
Association Network
Precision
RecallPrecisi
onRecall
Person
Article
Venue
0.8720.9970.935
0.7330.9770.790
0.9240.9990.987
0.8750.9760.937
![Page 84: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/84.jpg)
Results on Cora Dataset is Competitive with Other Reported Results
Results reported in other record linkage papers: Precision/Recall = 0.990/0.925 [Cohen et al., 2002] Precision/Recall = 0.842/0.909 [Parag and Domingo, 2004] F-measure = 0.867 [Bilenko and Mooney, 2003]
Class
Attr-wise Matching
Dependency Graph
Prec/Recall
F-msre
Prec/Recall F-msre
Article
PersonVenue
0.985/0.913
0.994/0.985
0.982/0.362
0.948 0.9890.529
0.985/0.924
1/0.9870.837/0.71
4
0.954 0.9930.771
![Page 85: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/85.jpg)
Experiment Settings
Measure: Diversity and DispersionDiversity: For every result partition,
how many real-world objects are included; ideally should be 1 (related to precision)
Dispersion: For every real-world object, how many result partitions include them; ideally should be 1 (related to recall)
![Page 86: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/86.jpg)
Diversity and Dispersion Are Very Close to 1
Dataset#per/#ref
Attr-wise Matching Dependency Graph
Diversity/Dispersion Diversity/Dispersion
A (1750/2407
6)B
(1989/36359)C
(1570/15160)D
(1518/17199)
Avg
1.18/1.0031.067/1.01
1.053/1.0031.041/1.004
1.085/1.005
1.047/1.0031.039/1.0081.03/1.017
1.023/1.005
1.035/1.008
![Page 87: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/87.jpg)
Our Indexing Method Scales Well
WikipediaXMark
w/o assoXMark
with asso
Index 4.15hr(1.13GB)
6.64hr(3.04GB)
12.72hr(4.08GB)
Pred Query with leaf
attributes156 94 116
Pred Query with branch attributes
- 67 93
Pred Query with
associations- - 217
Neighborhood Keyword
Query1646 1838 13468
![Page 88: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/88.jpg)
I. Visualizing Heterogeneous Data Current data visualization
Consider only data residing in a single database
Allow users to specify a visualization for each type of data (e.g., Haystack [Karger et al., 2005])
Visualization of dataspaces need to consider data from heterogeneous sources
![Page 89: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/89.jpg)
Example Visualization —A Map Marked with Papers
![Page 90: Managing a Space of Heterogeneous Data](https://reader034.fdocuments.in/reader034/viewer/2022051517/568149ee550346895db71fd4/html5/thumbnails/90.jpg)
Example Visualization —A Calendar with Presentation Slides