Data Integration - Brown University
Transcript of Data Integration - Brown University
Data IntegrationSam Birch & Alex Leblang
Two faces of data integration
● Businesses○ Have relatively more structured databases which they
need to organize
● Research on integrating less structured data○ Databases coming from different organizations
without common architecture
Businesses have a lot of data
…in many different databases
One safe place:the data warehouseBusinesses want to control and access their operational data in a single place:
1. Backups and versioning2. Single query interface3. Quality & consistency4. Separation of analytic and
operational workloads
Data warehouse designBill Inmon, a data warehouse means:
1. All data about a single real-world thing are linked;
2. Data are never over-written or deleted;3. Comprises all (or nearly all) an organization's
data in a consistent way;4. Comprises all versions of the data in the
operational system;5. Not virtualized
Update methods
1. ETL2. Virtual / federated databases3. Change data capture
ETL● Extract, Transform, Load● Take data from one data source, transform it,
and load it into another location, usually a data warehouse
● Generally periodic (hourly, daily…)
Virtual (federated) DBs● A method for integrating data virtually● No actual (physical) data integration● Virtual database systems give you the feel of
data integration without the need for maintaining one single data warehouse
How they work● Federated databases map autonomous
database systems into a single database● This is done over a computer network and has
the advantage of possible geographic distribution
● Federated databases can be loosely or tightly coupled
Loosely / Tightly Coupled● Loosely coupled databases require each
component to construct their own schema○ ...forces the user to have knowledge of the schema
when using the database
● Tightly coupled databases use independent processes to create a schema used across the federated database○ …removes much of the work from the user or DBA to
the software itself
Change data capture● Keep track of diffs, like version control for
your database● Helpful for large data with smaller changes● Different Implementations:
○ Timestamps on rows○ Version number on rows○ Triggers on tables
Reforming unstructured data
Large and unstructured
The 4 Vs (according to Dong)○ large Volume of sources○ changing at a high Velocity○ as well as a huge Variety of sources○ with lots of question regarding data Veracity
Dong et al.
Goals
● Schema alignment● Record linkage● Data fusion
Schema Alignment
Dong et al
Schema Alignment● Mediated Schema
○ Identify domain specific modeling
● Attribute Matching○ Identify similarities between schema attributes
● Schema Mapping○ Specify how to specifically map records in different
schemas
Dong et al
Record Linkage
Dong et al
Record Linkage
Dong et al
Data Fusion
Dong et al
Data Fusion
Dong et al
Data Fusion
● Reconciliation of conflicting non-identifying content○ Voting○ Source Quality○ Copy Detection
Dong et al
Dealing With Different Data Sources
● Semantic Heterogeneity● Access Pattern Limitations● Integrity Constraints● Data-level Heterogeneity
http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source
Semantic Heterogeneity● Data integration can suffer from many issues ● Differences in:
○ organization of tables○ naming of schemas○ data-level representation
http://research.cs.wisc.edu/dibook/ Chapter 3: Data Source
Data-level Heterogeneity● “115 Waterman St. 02912” / “Brown
University CIT”● “Tim Kraska” / “Timothy Kraska” /
“[email protected]”● IRA: Individual Retirement Account or Irish
Republican Army?● Arbitrarily hard: different descriptions /
photos of the same place
Entity resolution (ER)
“[The] problem of identifying and linking/grouping different manifestations of the same real world object.”
Getoor, 2012.
Ironically, AKA: deduplication, entity clustering, merge/purge, fuzzy match, record linkage, approximate match...
Motivating examples
● Mining unstructured data (e.g. webpages)● Governance (census, intelligence)● Generally, when data comes from different
organizations
Getoor, 2012.
ER Challenges
● Fundamental ambiguity● Diversity in representations (format,
truncation, ambiguity)● Errors● Missing data● Records from different times● Relationships in addition to equality
Getoor, 2012.
Normalization
● Transform data into a format which is more likely to match other similar data○ Splitting / combining rows
● Canonicalization (e.g. phone numbers, URLs, case of text, expanding truncations)○ Maximally informative, but standard format
● Logic is specific to data
Getoor, 2012.
Pairwise matching
Raw data Normalized data
Matching features
Getoor, 2012.
Matching features
● Edit distance (e.g. Levenstein) for typos● Set/vector similarity (Jaccard index, TF/IDF,
dot-product) for ● Alignment (e.g. Monge-Elkan)● Phonetic (e.g. Soundex)
Getoor, 2012.
Record linkage
● Matching record to record rather than datum to datum○ May also require schema alignment
● Average of component similarities● Specific rules about each column● Probabilistic models with ML
○ Training data not trivial: most pairs are obviously not matches
Getoor, 2012.
Collective matching and constraints● Some data matching operations aren’t
independent of the other data in the record○ e.g. two research papers in the same venue are more
likely to be by the same authors
● Expressed in constraints over the matching relationships of columns in a record○ Transitivity (if A = B, and B = C then A = C) ○ Exclusivity (if A = B then B != C)
Getoor, 2012.
Getoor, 2012.
AscDB● Developed at Google● The authors looked at 14.1 billion HTML
tables and from that found 154 million that they considered to contain high quality relational data
● Work was done in 2008
Cafarella et al
AscDB● The authors created the attribute correlation
statistics database● AscDB is “a set of statistics about the schemas
in the corpus”
Cafarella et al
AscDB● AscBD makes possible:
○ schema auto-complete○ attribute synonym finding○ join-graph traversal
Cafarella et al
ASCDb
Cafarella et al
ASCDb
Cafarella et al
2013 Follow Up
Extracting Tabular Data on the Web
VLDB 2013 paper discusses the idea of row classes that have a more flexible method towards determining the table schema
Adelfio et al
Conclusion
● For businesses there are tradeoffs between specialized systems and integration
● Lots of research is being done involving combining very large amounts of disparate data
References● Luna Dong and Divesh Srivastava, Big Data Integration,
Tutorial in Proceedings of the IEEE International Conference on Database Engineering (ICDE), 2013
● Lise Getoor, Ashwin Machanavajjhala, Entity Resolution: Theory, Practice & Open Challenges, PVLDB 5(12): 2018-2019 (2012)
● Michael J. Cafarella, Alon Y. Halevy, Daisy Zhe Wang, Eugene Wu, Yang Zhang: WebTables: exploring the power of tables on the web. PVLDB 1(1): 538-549 (2008)
● Marco D. Adelfio, Hanan Samet, Schema Extraction for Tabular Data on the Web, In International Conference on Very Large Data Bases (VLDB), 2013
● http://research.cs.wisc.edu/dibook/