Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille...

54
Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    1

Transcript of Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille...

Page 1: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Data Integration: The Teenage Years

Alon Halevy (Google)

Anand Rajaraman (Kosmix)

Joann Ordille (Avaya)

VLDB 2006

Page 2: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Agenda

• A few perspectives on the last 10 years– Technical, commercial

• Perspectives from our personal paths

• Wild speculations about the future

• This is not a survey on data integration(See the paper in the proceedings for another

non-survey)

Page 3: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Acknowledgements

Other members of the Information Manifold Project:– Jaewoo Kang (NCSU, Korea Univ.)– Divesh Srivastava (AT&T Labs)– Shuky Sagiv (Hebrew U.)– Tom Kirk

Page 4: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Acknowledgements

To the SIGMOD 1996 Program committee

For rejecting the earlier version of the paper.

Page 5: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Timeline

95 96 97 98 99 00 01 02 03 04 05 06

Page 6: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Data Integration

Legacy DatabasesServices and Applications

Enterprise Databases Sequenceable

EntityGenePhenotype

Structured Vocabulary

Experiment

ProteinNucleotide Sequence

Microarray Experiment

Page 7: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

The Information Manifold

• Goal: integrate data from multiple sources on the web:

Find the Woody Allen movies playing in my area, and their reviews

• Need to describe the data sources:– Contents, constraints, access patterns

Page 8: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

wrapper wrapper wrapper wrapper wrapper

Mediated Schema

Semantic mappingsoptimization &

execution

query reformulation

Design time Run time

Page 9: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Semantic Mappings[a.k.a. Source Descriptions]

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

CD: ASIN, Title, Genre,…Artist: ASIN, name, …

Mediated Schema

logic

Page 10: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Global-as-View (GAV)

SourceSource Source Source SourceR1 R2 R3 R4 R5

CD(A,T,G) :- R1(A,T,G)CD(A,T,G) :- R2(A,T), R3(T,G)

CD: ASIN, Title, Genre,…Artist: ASIN, name, …

Mediated Schema

Mapping:

Page 11: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Local-as-View (LAV)

SourceSource Source Source SourceR1 R2 R3 R4 R5

R1(A,T,G) :- CD(A,T,G,Y), Artist(A,N), Y< 1970R2(A,T) :- CD(A,T,”French”,Y)

CD: ASIN, Title, Genre, YearArtist: ASIN, Name, …

Mediated Schema

Mapping:

Page 12: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Query Answering in LAV =Answering queries using views

Given a set of views V1,…,Vn,

And a query Q,

Can we answer Q using only the answers to V1,…,Vn?

Page 13: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

AQUV (I)

• [Larson et al., 85 & 87], [Tsatalos et al., 94], [Chaudhuri et al., 95],

• Focus on AQUV for:– Query optimization– Supporting physical data independence

• Every commercial DBMS supports AQUV.

Page 14: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

AQUV (II)

• AQUV for data integration:– Find maximally contained rewriting– Not necessarily equivalent rewriting

• Algorithms: – Bucket algorithm [LRO, 96]– Inverse rules [Duschka, 97]– Minicon [Pottinger and Halevy, 2000]

• Views and security: [Miklau and Suciu, 04]

Survey: Halevy, VLDB Journal, 2001

Page 15: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Some Subsequent Results• Semantics of data integration:

– Abiteboul & Duschka, 1998: certain answers– Open vs. closed world assumption

• CWA is bad complexity news!

Survey: Lenzerini, PODS 2002

Page 16: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Certain Answers

Origin Destination

SF Seattle

NY Seoul

Origin Destination

SF Seoul

NY Seattle

Mediated schema: Route (Origin, Destination)

Source 1: Origins SF NY

Source 2: Destinations Seattle Seoul

Query: Route (SF, Seattle)?Possible databases:

Page 17: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Some Subsequent Results• Limitations due to binding patterns

– Input title, get book info [Rajaraman et al., 95]

• Additional query processing capabilities– Form applies multiple predicates

• Disjunction, negation in sources.

• Ordering sources, probabilistic mappings– [Florescu et al., 97, Doan et al., Dong et al.]

• GLAV [Millstein et al., 99]

Survey: Lenzerini, PODS 2002

Page 18: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

A word on Description Logics

• Selecting relevant sources = reasoning. • Description logics to the rescue:

– [Catarci and Lenzerini, 93]

• Information Manifold– Combined the Classic DL with Datalog

(CARIN)– See AAAI-96 (not sigmod)

• Brought DL and DB closer together. – A very active area of research today.

Page 19: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

95 96 97 98 99 00 01 02 03 04 05 06

Page 20: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

XML and Semi-structured Data

• Tsimmis: semi-structured data for integration.

• XML: whetted the integration appetites– We have the syntax– Now just solve the silly semantics problems– Don’t bother: we’ll all standardize on DTDs.

• XML will have a significant role on the data integration industry and research.

Page 21: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

95 96 97 98 99 00 01 02 03 04 05 06

Page 22: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Back in the Lab…

• Two observations:– Who’s going to write all these LAV/GAV

formulas? – This was the bottleneck.

• Once we have mappings, how can we execute queries? – Traditional plan-then-execute doesn’t work.

Page 23: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Semantic Mappings

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Books TitleISBNPriceDiscountPriceEdition

CDs AlbumASINPriceDiscountPriceStudio

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

AuthorsISBNFirstNameLastName

Inventory Database A

Inventory Database B

“Standards are great, but there are too many of them.”

Page 24: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Techniques for Schema Mapping[Survey by Rahm and Bernstein, VLDBJ 2001]

• Compare schema elements based on:– Names (or n-grams)– Data types and instances– Text descriptions, integrity constraints

• Combine multiple techniques:– [Momis, Cupid, LSD, Coma]

• Create mappings from matches– [Clio @ IBM + Miller]

Page 25: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

A Machine Learning Approach[Doan et al., 2001, ACM Distinguished Dissertation 2003]

• Many mapping tasks are repetitive

• Learn from previous experience:– Build a classifier for every element of the

mediated schema. – Many kinds of cues meta-strategy learning

Mediated schema

Given matches Predict new ones

Page 26: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

listed-price $250,000 $110,000 ...

address price agent-phone description

Matching Real-Estate Sources

location Miami, FL Boston, MA ...

phone(305) 729 0831(617) 253 1429 ...

commentsFantastic houseGreat location ...

realestate.com

location listed-price phone comments

Schema of realestate.com

If “fantastic” & “great”

occur frequently in data values =>

description

Learned hypotheses

price $550,000 $320,000 ...

contact-phone(278) 345 7215(617) 335 2315 ...

extra-infoBeautiful yardGreat beach ...

homes.com

If “phone” occurs in the name =>

agent-phone

Mediated schema

Page 27: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Reference ReconciliationTo Join or not to Join?

• Many ways to refer to the same object in the world:– “IBM”, “International Business Machines”– Alon Levy, Alon Halevy

• Automated methods are necessity– Can’t go through all the data manually

• Very active area in ML, KDD, DB, UAI, …

Page 28: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Query ProcessingTo Plan or to Execute?

• In addition to distributed query processing issues:– Few statistics, if any.– Network behavior issues: latency, burstiness,…– Garlic @IBM

• “Adaptive query processing”:– Stonebraker saw it coming in Ingres. – Revivals by Graefe (1993) and DeWitt (1998). – Query scrambling [Urhan & Franklin]– Eddies [Avnur & Hellerstein]– Convergent query processing [Ives et al.]

Page 29: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

95 96 97 98 99 00 01 02 03 04 05 06

Page 30: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Commercialization

• Late 90’s – anything goes.

• Want money from VC’s?– Say “XML” 3 times loud and clear.

• Academia at the forefront:– Nimble (UW), Cohera (Berkeley), Enosys

(UCSD),…

• Big companies took notice– Some faster than others

Page 31: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Commercialization Retrospective[See Panel-of-Experts, SIGMOD 05]

• Uphill battle vs. the warehousing folks– Virtual integration was more “pay-as-you-go”

• Another battle with the EAI folks– Should really be a symbiosis there.

• Go vertical or horizontal?– Obvious: go vertical if you can find the right

one.

• The technology worked – But it’s all in the timing…

Page 32: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

XML Query

User Applications

Lens™ File InfoBrowser™Software

Developers Kit

NIMBLE™ APIs

Front-End

XML

Lens Builder™Lens Builder™

Management Tools

Management Tools

Integration Builder

Integration Builder

Security T

ools

Data Administrator

Data Administrator

After $30M…

Concordance Developer

Concordance Developer

Integration

Layer

Nimble Integration Engine™

Compiler Executor

MetadataServerCache

Relational Data Warehouse/ Mart

Legacy Flat File Web Pages

Common XML View

Page 33: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

95 96 97 98 99 00 01 02 03 04 05 06

NASDAQNASDAQ

Page 34: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

So… Back in the Lab

• Model management

• Peer data management systems

• Data exchange

Page 35: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Model Management[Bernstein et al.]

• Generic infrastructure for managing schemas and mappings:– Manipulate models and mappings as bulk

objects– Operators to create & compose mappings,

merge & diff models – Short operator scripts can solve schema

integration, schema evolution, reverse engineering, etc.

• First challenge: semantics of operators.

Page 36: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Peer Data Management Systems

Berkeley

Stanford

DBLP

UW (Washington)

UW (Wisconsin)

CiteSeerUW (Waterloo)

Q

Q1

Q2Q6

Q5

Q4

Q3

LAV, GLAV

Page 37: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

PDMS-Related Projects

• Piazza (Washington)• Hyperion (Toronto)• PeerDB (Singapore)• Local relational models (Trento, Toronto)• Active XML (INRIA)• Edutella (Hannover, Germany)• Semantic Gossiping (EPFL Lausanne)• Raccoon (UC Irvine)• Orchestra (U. Penn)

Page 38: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

PDMS Challenges

Berkeley

Stanford

DBLP

UW (Washington)

UW (Wisconsin)

CiteSeerUW (Waterloo)

• Semantics:• careful about cycles

• Optimization:• Compose mappings• Prune paths

• Manage networks:• Consistency• Quality• Caching

Page 39: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Data Exchange

• Key question: given an instance of S and a mapping, create an instance for T.

• [Fagin, Kolaitis, Popa & Tan]

S TM

Page 40: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

95 96 97 98 99 00 01 02 03 04 05 06

Page 41: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

95 96 97 98 99 00 01 02 03 04 05 06

?

Page 42: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

2006 Status Report[The People Angle]

• Joann @ Avaya– Integrating communications into business

processes

• Anand @ Kosmix – Creating a new kind of search company

• Alon @ Google– Working for Joann’s old boss– Deep web evangelist – Pondering data management for the masses

Page 43: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

2006 Status Report[Enterprise Angle]

• Enterprise Information Integration is established:– IBM, BEA, Oracle, MetaMatrix, Composite,

Actuate, …

• Impact on design tools:– IBM Rational Data Architect – ADO .NET v. 3

Page 44: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Forrester Says…

"Enterprises are facing the growing challenges of using disparate sources of data managed by different applications, including problems with data integration, security, performance, availability and quality.... New technology is emerging that Forrester has coined "information fabric," a term defined as a virtualized data layer that integrates heterogeneous data and content repositories in real time.... The potential benefits of this technology are so great that enterprises should develop a strategy to leverage information fabric technology as it becomes more widely available."

Page 45: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

2006 Status Report[Web Angle]

• Vertical search engines: one domain

• At scale: need even better source descriptions– deep web can be surfaced

• Terminology: Data integration = mashups!

Page 46: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Wikipedia:

A mashup is a website or Web 2.0 application that uses content from more than one source to create a completely

new service. This is akin to transclusion.

Page 47: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.
Page 48: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.
Page 49: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.
Page 50: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Looking Ahead

• Data management: from the enterprise to the masses

• Challenges: – Databases of everything– Need support for collaboration– Help people structure their data

– Pay-as-you go data management

Page 51: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Pay-as-you-go Data Management

Benefit

Investment (time, cost)

Dataspaces

Data integration solutions

Artist: Mike Franklin

Dataspaces: Franklin, Halevy, Maier [see PODS 2006]

Page 52: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Big Carrots

Page 53: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Reusing Human Attention• Principle:

User action = statement of semantic relationshipLeverage actions to infer other semantic relationships

• Examples– Providing a semantic mapping

• Infer other mappings

– Writing a query • Infer content of sources, relationships between sources

– Creating a “digital workspace”• Infer “relatedness” of documents/sources• Infer co-reference between objects in the dataspace

– Annotating, cutting & pasting, browsing among docs

Page 54: Data Integration: The Teenage Years Alon Halevy (Google) Anand Rajaraman (Kosmix) Joann Ordille (Avaya) VLDB 2006.

Conclusion

• We’ve done extremely well as a community!

• Next challenge: data management and integration tools for the masses