Data Integration - Brown University

41
Data Integration Sam Birch & Alex Leblang

Transcript of Data Integration - Brown University

Page 1: Data Integration - Brown University

Data IntegrationSam Birch & Alex Leblang

Page 2: Data Integration - Brown University

Two faces of data integration

● Businesses○ Have relatively more structured databases which they

need to organize

● Research on integrating less structured data○ Databases coming from different organizations

without common architecture

Page 4: Data Integration - Brown University

One safe place:the data warehouseBusinesses want to control and access their operational data in a single place:

1. Backups and versioning2. Single query interface3. Quality & consistency4. Separation of analytic and

operational workloads

Page 5: Data Integration - Brown University

Data warehouse designBill Inmon, a data warehouse means:

1. All data about a single real-world thing are linked;

2. Data are never over-written or deleted;3. Comprises all (or nearly all) an organization's

data in a consistent way;4. Comprises all versions of the data in the

operational system;5. Not virtualized

Page 6: Data Integration - Brown University

Update methods

1. ETL2. Virtual / federated databases3. Change data capture

Page 7: Data Integration - Brown University

ETL● Extract, Transform, Load● Take data from one data source, transform it,

and load it into another location, usually a data warehouse

● Generally periodic (hourly, daily…)

Page 8: Data Integration - Brown University

Virtual (federated) DBs● A method for integrating data virtually● No actual (physical) data integration● Virtual database systems give you the feel of

data integration without the need for maintaining one single data warehouse

Page 9: Data Integration - Brown University

How they work● Federated databases map autonomous

database systems into a single database● This is done over a computer network and has

the advantage of possible geographic distribution

● Federated databases can be loosely or tightly coupled

Page 10: Data Integration - Brown University

Loosely / Tightly Coupled● Loosely coupled databases require each

component to construct their own schema○ ...forces the user to have knowledge of the schema

when using the database

● Tightly coupled databases use independent processes to create a schema used across the federated database○ …removes much of the work from the user or DBA to

the software itself

Page 11: Data Integration - Brown University

Change data capture● Keep track of diffs, like version control for

your database● Helpful for large data with smaller changes● Different Implementations:

○ Timestamps on rows○ Version number on rows○ Triggers on tables

Page 12: Data Integration - Brown University

Reforming unstructured data

Page 13: Data Integration - Brown University

Large and unstructured

The 4 Vs (according to Dong)○ large Volume of sources○ changing at a high Velocity○ as well as a huge Variety of sources○ with lots of question regarding data Veracity

Dong et al.

Page 14: Data Integration - Brown University

Goals

● Schema alignment● Record linkage● Data fusion

Page 15: Data Integration - Brown University

Schema Alignment

Dong et al

Page 16: Data Integration - Brown University

Schema Alignment● Mediated Schema

○ Identify domain specific modeling

● Attribute Matching○ Identify similarities between schema attributes

● Schema Mapping○ Specify how to specifically map records in different

schemas

Dong et al

Page 17: Data Integration - Brown University

Record Linkage

Dong et al

Page 18: Data Integration - Brown University

Record Linkage

Dong et al

Page 19: Data Integration - Brown University

Data Fusion

Dong et al

Page 20: Data Integration - Brown University

Data Fusion

Dong et al

Page 21: Data Integration - Brown University

Data Fusion

● Reconciliation of conflicting non-identifying content○ Voting○ Source Quality○ Copy Detection

Dong et al

Page 22: Data Integration - Brown University

Dealing With Different Data Sources

● Semantic Heterogeneity● Access Pattern Limitations● Integrity Constraints● Data-level Heterogeneity

http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source

Page 23: Data Integration - Brown University

Semantic Heterogeneity● Data integration can suffer from many issues ● Differences in:

○ organization of tables○ naming of schemas○ data-level representation

http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source

Page 24: Data Integration - Brown University

Data-level Heterogeneity● “115 Waterman St. 02912” / “Brown

University CIT”● “Tim Kraska” / “Timothy Kraska” /

[email protected]”● IRA: Individual Retirement Account or Irish

Republican Army?● Arbitrarily hard: different descriptions /

photos of the same place

Page 25: Data Integration - Brown University

Entity resolution (ER)

“[The] problem of identifying and linking/grouping different manifestations of the same real world object.”

Getoor, 2012.

Ironically, AKA: deduplication, entity clustering, merge/purge, fuzzy match, record linkage, approximate match...

Page 26: Data Integration - Brown University

Motivating examples

● Mining unstructured data (e.g. webpages)● Governance (census, intelligence)● Generally, when data comes from different

organizations

Getoor, 2012.

Page 27: Data Integration - Brown University

ER Challenges

● Fundamental ambiguity● Diversity in representations (format,

truncation, ambiguity)● Errors● Missing data● Records from different times● Relationships in addition to equality

Getoor, 2012.

Page 28: Data Integration - Brown University

Normalization

● Transform data into a format which is more likely to match other similar data○ Splitting / combining rows

● Canonicalization (e.g. phone numbers, URLs, case of text, expanding truncations)○ Maximally informative, but standard format

● Logic is specific to data

Getoor, 2012.

Page 29: Data Integration - Brown University

Pairwise matching

Raw data Normalized data

Matching features

Getoor, 2012.

Page 30: Data Integration - Brown University

Matching features

● Edit distance (e.g. Levenstein) for typos● Set/vector similarity (Jaccard index, TF/IDF,

dot-product) for ● Alignment (e.g. Monge-Elkan)● Phonetic (e.g. Soundex)

Getoor, 2012.

Page 31: Data Integration - Brown University

Record linkage

● Matching record to record rather than datum to datum○ May also require schema alignment

● Average of component similarities● Specific rules about each column● Probabilistic models with ML

○ Training data not trivial: most pairs are obviously not matches

Getoor, 2012.

Page 32: Data Integration - Brown University

Collective matching and constraints● Some data matching operations aren’t

independent of the other data in the record○ e.g. two research papers in the same venue are more

likely to be by the same authors

● Expressed in constraints over the matching relationships of columns in a record○ Transitivity (if A = B, and B = C then A = C) ○ Exclusivity (if A = B then B != C)

Getoor, 2012.

Page 33: Data Integration - Brown University

Getoor, 2012.

Page 34: Data Integration - Brown University

AscDB● Developed at Google● The authors looked at 14.1 billion HTML

tables and from that found 154 million that they considered to contain high quality relational data

● Work was done in 2008

Cafarella et al

Page 35: Data Integration - Brown University

AscDB● The authors created the attribute correlation

statistics database● AscDB is “a set of statistics about the schemas

in the corpus”

Cafarella et al

Page 36: Data Integration - Brown University

AscDB● AscBD makes possible:

○ schema auto-complete○ attribute synonym finding○ join-graph traversal

Cafarella et al

Page 37: Data Integration - Brown University

ASCDb

Cafarella et al

Page 38: Data Integration - Brown University

ASCDb

Cafarella et al

Page 39: Data Integration - Brown University

2013 Follow Up

Extracting Tabular Data on the Web

VLDB 2013 paper discusses the idea of row classes that have a more flexible method towards determining the table schema

Adelfio et al

Page 40: Data Integration - Brown University

Conclusion

● For businesses there are tradeoffs between specialized systems and integration

● Lots of research is being done involving combining very large amounts of disparate data

Page 41: Data Integration - Brown University

References● Luna Dong and Divesh Srivastava, Big Data Integration,

Tutorial in Proceedings of the IEEE International Conference on Database Engineering (ICDE), 2013

● Lise Getoor, Ashwin Machanavajjhala, Entity Resolution: Theory, Practice & Open Challenges, PVLDB 5(12): 2018-2019 (2012)

● Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang: WebTables: exploring the power of tables on the web. PVLDB 1(1): 538-549 (2008)

● Marco D. Adelfio, Hanan Samet, Schema Extraction for Tabular Data on the Web, In International Conference on Very Large Data Bases (VLDB), 2013

● http://research.cs.wisc.edu/dibook/