Data Integration - Brown University

Data IntegrationSam Birch & Alex Leblang

Two faces of data integration

● Businesses○ Have relatively more structured databases which they

need to organize

● Research on integrating less structured data○ Databases coming from different organizations

without common architecture

Businesses have a lot of data

…in many different databases

http://mrfrosti.com/wp-content/uploads/2011/06/PostgreSQL-9.gif

http://www.bvp.com/sites/all/themes/BVP2011/templates/slir/w281/sites/default/files/vertica_283_224.jpg

http://evonet.com.au/wp-content/uploads/2008/08/Microsoft.SqlServer.Types_.10.50.1600.1.png

http://www.ecircle.com/blog/wp-content/uploads/2013/03/teradata-logo1.jpeg

https://googledrive.com/host/0B0r5lGCXKZbAaWgxSXRzeDNDTFU/img/greenplum.jpg

https://drupal.org/files/project-images/amazon_s3_logo.jpg

http://www.softwareone.com/PublisherLogos/Oracle%20Large%20copy.jpg

http://comsysto.files.wordpress.com/2012/11/cloudera.jpg

http://media2.hpcwire.com/datanami/hortonworks.jpg

One safe place:the data warehouseBusinesses want to control and access their operational data in a single place:

1. Backups and versioning2. Single query interface3. Quality & consistency4. Separation of analytic and

operational workloads

http://www.dixoncd.com/portals/0/Services/data-integration.jpg

Data warehouse designBill Inmon, a data warehouse means:

1. All data about a single real-world thing are linked;

2. Data are never over-written or deleted;3. Comprises all (or nearly all) an organization's

data in a consistent way;4. Comprises all versions of the data in the

operational system;5. Not virtualized

Update methods

1. ETL2. Virtual / federated databases3. Change data capture

ETL● Extract, Transform, Load● Take data from one data source, transform it,

and load it into another location, usually a data warehouse

● Generally periodic (hourly, daily…)

http://www.butein.com/Butein/DataIntegration/Butein-%20Data%20Integration.png

Virtual (federated) DBs● A method for integrating data virtually● No actual (physical) data integration● Virtual database systems give you the feel of

data integration without the need for maintaining one single data warehouse

How they work● Federated databases map autonomous

database systems into a single database● This is done over a computer network and has

the advantage of possible geographic distribution

● Federated databases can be loosely or tightly coupled

Loosely / Tightly Coupled● Loosely coupled databases require each

component to construct their own schema○ ...forces the user to have knowledge of the schema

when using the database

● Tightly coupled databases use independent processes to create a schema used across the federated database○ …removes much of the work from the user or DBA to

the software itself

Change data capture● Keep track of diffs, like version control for

your database● Helpful for large data with smaller changes● Different Implementations:

○ Timestamps on rows○ Version number on rows○ Triggers on tables

Reforming unstructured data

Large and unstructured

The 4 Vs (according to Dong)○ large Volume of sources○ changing at a high Velocity○ as well as a huge Variety of sources○ with lots of question regarding data Veracity

Dong et al.

Goals

● Schema alignment● Record linkage● Data fusion

Schema Alignment

Dong et al

Schema Alignment● Mediated Schema

○ Identify domain specific modeling

● Attribute Matching○ Identify similarities between schema attributes

● Schema Mapping○ Specify how to specifically map records in different

schemas

Dong et al

Record Linkage

Dong et al

Data Fusion

Dong et al

Data Fusion

● Reconciliation of conflicting non-identifying content○ Voting○ Source Quality○ Copy Detection

Dong et al

Dealing With Different Data Sources

● Semantic Heterogeneity● Access Pattern Limitations● Integrity Constraints● Data-level Heterogeneity

http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source

http://research.cs.wisc.edu/dibook/


Semantic Heterogeneity● Data integration can suffer from many issues ● Differences in:

○ organization of tables○ naming of schemas○ data-level representation

http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source



Data-level Heterogeneity● “115 Waterman St. 02912” / “Brown

University CIT”● “Tim Kraska” / “Timothy Kraska” /

“[email protected]”● IRA: Individual Retirement Account or Irish

Republican Army?● Arbitrarily hard: different descriptions /

photos of the same place

Entity resolution (ER)

“[The] problem of identifying and linking/grouping different manifestations of the same real world object.”

Getoor, 2012.

Ironically, AKA: deduplication, entity clustering, merge/purge, fuzzy match, record linkage, approximate match...

Motivating examples

● Mining unstructured data (e.g. webpages)● Governance (census, intelligence)● Generally, when data comes from different

organizations

Getoor, 2012.

ER Challenges

● Fundamental ambiguity● Diversity in representations (format,

truncation, ambiguity)● Errors● Missing data● Records from different times● Relationships in addition to equality

Getoor, 2012.

Normalization

● Transform data into a format which is more likely to match other similar data○ Splitting / combining rows

● Canonicalization (e.g. phone numbers, URLs, case of text, expanding truncations)○ Maximally informative, but standard format

● Logic is specific to data

Getoor, 2012.

Pairwise matching

Raw data Normalized data

Matching features

Getoor, 2012.

Matching features

● Edit distance (e.g. Levenstein) for typos● Set/vector similarity (Jaccard index, TF/IDF,

dot-product) for ● Alignment (e.g. Monge-Elkan)● Phonetic (e.g. Soundex)

Getoor, 2012.

Record linkage

● Matching record to record rather than datum to datum○ May also require schema alignment

● Average of component similarities● Specific rules about each column● Probabilistic models with ML

○ Training data not trivial: most pairs are obviously not matches

Getoor, 2012.

Collective matching and constraints● Some data matching operations aren’t

independent of the other data in the record○ e.g. two research papers in the same venue are more

likely to be by the same authors

● Expressed in constraints over the matching relationships of columns in a record○ Transitivity (if A = B, and B = C then A = C) ○ Exclusivity (if A = B then B != C)

Getoor, 2012.

Getoor, 2012.

AscDB● Developed at Google● The authors looked at 14.1 billion HTML

tables and from that found 154 million that they considered to contain high quality relational data

● Work was done in 2008

Cafarella et al

AscDB● The authors created the attribute correlation

statistics database● AscDB is “a set of statistics about the schemas

in the corpus”

Cafarella et al

AscDB● AscBD makes possible:

○ schema auto-complete○ attribute synonym finding○ join-graph traversal

Cafarella et al

ASCDb

Cafarella et al

2013 Follow Up

Extracting Tabular Data on the Web

VLDB 2013 paper discusses the idea of row classes that have a more flexible method towards determining the table schema

Adelfio et al

Conclusion

● For businesses there are tradeoffs between specialized systems and integration

● Lots of research is being done involving combining very large amounts of disparate data

References● Luna Dong and Divesh Srivastava, Big Data Integration,

Tutorial in Proceedings of the IEEE International Conference on Database Engineering (ICDE), 2013

● Lise Getoor, Ashwin Machanavajjhala, Entity Resolution: Theory, Practice & Open Challenges, PVLDB 5(12): 2018-2019 (2012)

● Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang: WebTables: exploring the power of tables on the web. PVLDB 1(1): 538-549 (2008)

● Marco D. Adelfio, Hanan Samet, Schema Extraction for Tabular Data on the Web, In International Conference on Very Large Data Bases (VLDB), 2013

● http://research.cs.wisc.edu/dibook/

Data Integration - Brown University

Documents

Transcript of Data Integration - Brown University