User Data Warehouse Warehouse DBMS A DBMS B DBMS C Database Data warehouse example.
CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational...
-
date post
21-Dec-2015 -
Category
Documents
-
view
220 -
download
0
Transcript of CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational...
CSE 636Data Integration
Overview
2
Data Warehouse Architecture
DataSource
DataSource
Relational Database(Warehouse)
DataSource
Users
Applications
OLAP / Decision SupportData Cubes / Data Mining
ETL Tools(Extract-Transform-Load)
Data Cleaning
3
Virtual Integration Architecture
• Leave the data in the sources• When a query comes in:
– Determine the relevant sources to the query– Break down the query into sub-queries for the sources– Get the answers from the sources, filter them if needed
and combine them appropriately
• Data is fresh• Otherwise known as
On Demand Integration
4
Virtual Integration Architecture
End Users
Applications
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Design-Time
SchemaMappingsSchema
MappingsSchema
Mappings
Sources can be:• Relational DBs• Excel Files• Web Sites• Web Services
5
• Differences in:– Names in schema– Attribute grouping
– Coverage of databases– Granularity and format of attributes
Inventory Database B
AuthorsISBNFirstNameLastName
BooksTitleISBNPriceDiscountPriceEdition
Inventory Database A
BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords
Schema Mappings
BookCategoriesISBNCategory
CDCategoriesASINCategory
ArtistsASINArtistNameGroupName
CDsAlbumASINPriceDiscountPriceStudio
6
Issues for Schema Mappings
Design-Time
• What formalisms to express them?
• How to create them?• Can we discover them
somehow?• How do we use them?
End Users
Applications
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
SchemaMappingsSchema
MappingsSchema
Mappings
7
Mediator
Virtual Integration Architecture
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Run-Time
Reformulation
Optimization
Execution
Query Result
Wrapper Wrapper
8
Mediator
Issues for Query Processing
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Reformulation
Reformulation
Query
• User queries refer to the global schema
• Data is stored in the sources in a local schema
• Rewriting algorithms
9
Issues for Query Processing
Reformulation
Global Schema
BooksTitleISBNPriceDiscountPriceEdition
Local Schema A
BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords
SELECT ISBN, PriceFROM BooksWHERE Title = ‘on the road’
SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ‘on the road’AND ItemType = ‘Books’
10
Mediator
Issues for Query Processing
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Query Translation
Reformulation
Optimization
Execution
Query
Wrapper
• Different query languages
11
Local Source A
Issues for Query Processing
Query Translation
Global Schema
BooksTitleISBNPriceDiscountPriceEdition
SELECT ISBN, PriceFROM BooksWHERE Title = ‘on the road’
http://www.amazon.com/homepage.html?ItemType=Books&Title=on+the+road
12
Mediator
Issues for Query Processing
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Data Translation
Reformulation
Optimization
Execution
Query
Wrapper
• Different data models
13
Issues for Query Processing
Data Translation
<table> <tr> <td> <a href=/details?isbn=123> <b>On the Road</b> </a> -- by Jack Kerouac; Paperback <br> <a href=/details?isbn=123> Buy new </a> :<b class=price>$10.86</b> </td> </tr></table>
Local Result A
Global Schema
BooksTitleISBNPriceDiscountPriceEdition
Title ISBN Price … …
On the Road 123 10.86 … …
14
Mediator
Issues for Query Processing
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Query Execution
Reformulation
Optimization
Execution
Query
Wrapper Wrapper
• Access as many data sources as needed
• Duplicate/redundant and irrelevant data
• Limited query capabilities
15
Issues for Query Processing
Limited Query Capabilities
Global Schema
BooksTitleISBNPriceDiscountPriceEdition
Local Schema A
BooksAndMusicTitleAuthorItemIDItemTypeSuggestedPrice
SELECT ISBN, Price, DiscountPriceFROM BooksWHERE Title = ‘on the road’
SELECT GreatPriceFROM DiscountBooksWHERE ISBN = ?
Local Schema B
DiscountBooksTitleEditionISBNGreatPrice
SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ?
SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ‘on the road’
A
B
SELECT GreatPriceFROM DiscountBooksWHERE ISBN = 123
C
ItemID SuggestedPrice
123 10.86
ItemID SuggestedPrice
123 10.86D
E
GreatPrice
8.86
ISBN Price DiscountPrice
123 10.86 8.86
16
Mediator
Issues for Query Processing
DataSource
DataSource
GlobalSchema
LocalSchema
LocalSchema
DataSource
LocalSchema
Query Answering
Reformulation
Optimization
Execution
Query Result
Wrapper Wrapper
• Combine the results and further process them if needed
• Mainly union and merge• Inconsistencies
17
Issues for Query Processing
Query Answering (Union)
ItemID SuggestedPrice
123 10.86
ISBN GreatPrice
456 8.86
ISBN Price
123 10.86
456 8.86
18
Issues for Query Processing
Query Answering (Merge)
ItemID Title
123 On the Road
ISBN Edition Price
123 2nd 8.86
ISBN Title Edition Price
123 On the Road 2nd 8.86
PrimaryKey
ISBN Title Edition Price
123 On the Road 2nd 8.86
PrimaryKey
PrimaryKey
19
Issues for Query Processing
Query Answering (Inconsistencies)
ItemID Title Edition
123 On the Road 1st
ISBN Edition Price
123 2nd 8.86
ISBN Title Edition Price
123 On the Road 8.86
PrimaryKey
ISBN Title Edition Price
123 On the Road ??? 8.86
PrimaryKey
PrimaryKey
21
Peer-Based Integration
Peer 2
Peer 1
Peer 5
Peer 3
Peer 4Query
Query
22
Peer-Based Integration
• No need for a central mediated schema• Peers serve as mediators for other peers• A peer can be both a server and a client• Semantic relationships are specified locally
(between small sets of peers)• Queries are posed using the peer’s schema• Answers come from anywhere in the system• This is not P2P file sharing.
– Data has rich semantics
23
References
• Information integration– Maurizio Lenzerini
– Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003
– Invited Tutorial
• Data Integration: a Status Report– Alon Halevy
– German Database Conference (BTW), 2003– Invited Talk