Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and...

41
Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009

Transcript of Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and...

Page 1: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Dataspaces: Progress and Prospects

Michael J. Franklin UC Berkeley & Truviso

BNCOD July 7, 2009

Page 2: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Dataspaces: Progress and Prospects

Michael J. Franklin UC Berkeley & Truviso

BNCOD July 7, 2009

Dataspace: The Final Frontier?

Page 3: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Outline

•  Dataspaces – some history •  Dataspaces – what are they, really? •  Some emerging examples •  Example technologies •  What’s missing? •  What’s next?

Page 4: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

The SIGMOD Credo

Codd made relations, all else is the work of man.

Leopold Kronecker (paraphrased by Raghu Ramakrishnan?)

Page 5: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

The Politics of Dataspaces •  Roots: CIDR 2005 Conference

–  “Gloom and Doom” panel –  David Dewitt’s call for a unifying goal – Juxtaposed with lots of great work across the

web, new devices, scalable computing, …

Page 6: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

An Aside: The cycle of DB Angst

Did we “miss the boat” on something cool?

Are we polishing a “round ball”?

Page 7: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Dataspaces: Timeline •  CIDR 2005 (January) •  A small group started looking for

commonality and a “grand challenge” •  We put a name on it. •  Ran an early draft by an impromptu group

of advisors at SIGMOD 2005 (June 05). •  Wrote it up for SIGMOD Record (Dec 05)

[Franklin, Halevy, Maier] •  Kept working on pretty much what we

were already doing!

Page 8: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

What’s in a name?

Page 9: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Dataspaces – what are they?

Page 10: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Dataspaces Inclusive

Deal with all the data of interest – in whatever form

Co-existence not Integration No integrated schema, no single warehouse,

no ownership required Pay-as-you-go

– Keyword search is bare minimum. – More function and increased consistency

as you add work.

Page 11: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Compare to Data Integration

A quintessential schema-first approach.

wrapper wrapper wrapper wrapper wrapper

Mediated Schema

Semantic mappings

Courtesy of Alon Halevy

Page 12: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Structured Data Management

Page 13: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

A “Modern” View of Data Management

Page 14: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

The Structure Spectrum

Structured (schema-first)

Relational Database

Formatted Messages

Semi-Structured (schema-later)

XML Tagged

Text/Media

Unstructured (schema-never)

Plain Text Media

Page 15: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Whither Structured Data? •  Conventional

Wisdom: only 20% of data

is structured.

•  Decreasing due to: – Consumer

applications – Enterprise search – Media applications

Page 16: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

But Structure Matters! Functionality

Time (and cost)

Structured (schema-first)

Unstructured (schema-less)

Dataspaces (pay-as-you-go)

Structure enables computers to help users manipulate and maintain the data.

Page 17: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

An Alternative View

Strong Weak

Strong

Weak

Desktop Search

Web Search Virtual

Organization

Federated DBMS

DBMS

Semantic Integration

Administrative Control

Page 18: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Some Interesting Points on the Structure Spectrum

Page 19: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Page 20: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Page 21: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Page 22: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Page 23: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Page 24: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Web-scale Structured Data�

23  

For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Name Title Organization Bill Gates CEO Microsoft Bill Veghte VP Microsoft Richard Stallman Founder Free Soft..

HTML  Tables  extracted  from  the  Web�

Rela6ons  generated  by    informa6on  extrac6on    from  web  pages �

Database  Views  in  the  Deep  Web  accessed  through  HTML  Forms  on  the  Web�

Page 25: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

The Future of Analytics •  Analytics traditionally a

key DB use case – Need to understand

data to manipulate it •  “Barbarians at the Gate”

– Procedural cloud-based approaches gaining interest

– Scalability for massive data sets – But, we’ve seen this movie before!

Page 26: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

The View From the Clouds •  “Pig Latin” [Olston et al. SIGMOD 08]

– Why have a schema? 1) Transactional (referential?) Consistency 2) Fast point look ups through indexes 3) Curation for future (other) users

–  Flexible, optional, nested data model –  Data remains in files (no admin)

•  “Column Family” models of BigTable, Hbase, Cassandra, CouchDB, …

•  “Schema on Read”? == Errors on Read?

Page 27: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Other Examples Personal Information Management(iMemex),

Question answering, Scientific Collaboration

Page 28: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Outline

•  Dataspaces – some history •  Dataspaces – what are they, really? •  Some emerging examples •  Example technologies •  What’s missing? •  What’s next?

Page 29: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

DataSpace Technology

•  Probabilistic Databases •  Schema Matching •  Judicious use of User Input •  Approx. Query Answering •  Uncertainty Management •  Data Model Learning •  Provenance and Annotation •  Structured + Unstructured Search

Page 30: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Roomba: Soliciting User Feedback* •  A “web 2.0” spin on Reference

Reconciliation. –  Inspired by “ESP Game” for image labeling by

Von Ahn & Dabbish; “MOBS” architecture by Doan et al.

•  Use automated techniques to generate candidate matches.

•  Ask users to confirm. •  Problem: which matches are most important?

* “Soliciting User Feedback in a Dataspace System”, Shawn Jeffery, Michael Franklin, Alon Halevy; SIGMOD 2008.

Page 31: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Roomba Overview

•  Based on Value of Perfect Information (VPI) (see Russell and Norvig)

•  Choose matches that provide largest increase in dataspace utility.

•  Must consider: Query Workload, # Records per Term, and Confidence of Matches.

Page 32: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Roomba: Sample Result Perfect Knowledge

VPI-Based Ordering

Page 33: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Data Integration at Web-scale�•  A typical data integration solution is impractical for

web-scale data –  Too many domains of interest (Web Data is about

everything) –  Huge number of sources for each domain –  Designing Mediated Schema is infeasible –  Data sources are dirty, incomplete and lack of meta-data

•  Solution: A Data Integration Solution that is –  Automated –  Best Effort –  Pay-as-you-go

“Functional Dependency Generation and Applications in Pay-as-you-go Data Integration Systems” WebDB 2009 Wang, Dong, Das Sarma, Franklin, Halevy

Page 34: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Probabilistic Functional Dependencies (pFDs)�

•  Idea - use probabilistic Functional Dependencies to guide automated approaches –  Normalize mediated schemas –  Identify low quality data sources

•  Definition of a probabilistic FD (pFD) X p A, p is the likelihood of FD holds in general

•  “Learn” pFDs by counting data and schema instances –  Note: this will get you a bad grade in your database course.

•  Related work –  TANE, CORDS –  Conditional Functional Dependences

Page 35: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Results for pFDs Generation Algorithms on “Web Tables”�

Fidelity of generated FDs with confidence 0.8 with “golden standard” FDs

Page 36: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Normalizing a Mediated Schema�•  Generating the minimal pFD-set

–  Prune low-probability pFDs –  Prune pFDs that can be generated by transitivity

"tle�author  authors  author(s) � journal  "tle  

journal �

issn�

subject  subjects�

•  Avoid  over-­‐spli8ng    

0.95�

0.9�

0.95�

0.95�0.92�

0.97�

conference  mee"ng  

colloquium�

zip �

address �

city�

0.95�

0.9�1.0�

Page 37: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Results for Schema Normalization�

Page 38: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

PayGo Quality Metrics�

•  Measuring quality of data sources •  Measuring and Improving quality of a integration

(e.g. mediated schema, schema mapping, etc.)

•  FD-based Quality measuring framework is an example: –  Identify Dirty Data sources –  Improving Mediated Schema

Page 39: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

What’s Missing? •  Metrics!!!!

–  Key idea: you pay more to get better data. Must define “better”!

–  Application-, user-, context-dependent –  Relation to Data Quality work

•  Benchmarks –  Key to progress

•  Support for collaboration/data-sharing/visualization –  Particularly with uncertainty in base data and inferences

•  More data/media types •  Focus on “serious” analytics workloads •  …Your ideas here…

Page 40: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Metcalf’s (not Moore’s) Law will drive future DBMS inovation

Data Center

EDGE

Data Warehouse

Inventory

PoS ERP

• More connectivity means more data to integrate.

• Dataspace-style techniques will play an ever-larger role.

Page 41: Dataspaces: Progress and Prospects - Peoplefranklin/Talks/BNCOD09.pdf · Dataspaces: Progress and Prospects Michael J. Franklin UC Berkeley & Truviso BNCOD July 7, 2009 ... DBMS DBMS

M. Franklin BNCOD 2009 7 July 2009

Conclusions •  More connectivity means more data. •  Many would simply throw away the benefits

of structure due to “schema-first” problems. •  Dataspaces provide a framework for

intelligent use of structural information. •  Could also meet the goal of a “grand

challenge” for the DB Community.

As an inherently unsolvable problem…

Dataspace may, in fact, be the final frontier.