Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

40
Web Data and the Resurrection of Database Theory Dan Suciu University of Washington

description

Short History of Database Theory The legendary beginnings, : Relational databases are the brainchild of a theoretician (Codd) Heavily debated at the time (against CODASYL) It took several years for the concept to be validated in practice Theory driving the industry

Transcript of Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Page 1: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Web Data and the Resurrection of Database Theory

Dan SuciuUniversity of Washington

Page 2: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

“In theory there is no difference between theory and practice. In practice there is.”

Jan L.A. van de Snepscheut

September 12, 1953 - February 23, 1994

Page 3: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Short History of Database Theory

The legendary beginnings, 1970-1971:• Relational databases are the brainchild of a

theoretician (Codd)• Heavily debated at the time (against CODASYL)• It took several years for the concept to be

validated in practice

Theory driving the industry

Page 4: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Short History of Database Theory

The golden years (end of 70s, early 80s)• Relational theory

– Functional dependencies– Query containment

• Transactions• Access methods

Theory listening to the industry

Page 5: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Short History of Database Theory

Refined decadence (end of 80s, early 90s)• Descriptive complexity• Logic databases• Complex objects• Constraint databases

Divorce ?

Page 6: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

“Database Metatheory:Asking the Big Queries”

Christos Papadimitriou, in PODS, 1995• Theory is inevitable: CS is a science of the artificial, and its artifact is being changed

by the very act of studying it

Immaturescience

Normalscience Crisis Revolution

• Kuhn’s paradigm principle, for natural sciences

Page 7: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Is DB Theory in a Crisis Today ?

• Industry’s focus:– one particular data model: relational/SQL– one particular application (client-server)

• Theory’s focus is on Logic:– New data models, query languages (query

containment, complex objects, recursion)– New applications (incomplete information,

query rewriting using views)

Page 8: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

One Example of Unused Theory

Containment of conjunctive queries is NP complete [Chandra and Merlin’77]

Dozens of extensions:• With union and difference [Sagiv and Yannakakis’81]• With order predicates [Klug’88, van den Meyden’92]• With complex objects [Levy and Suciu’97]• With regular expressions [Florescu, Levy and Suciu’98]

Page 9: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Query Containment

The query:

Minimization not used by RDBMs today

Q1 = SELECT DISTINCT x.name, x.phone FROM Person x, Person y, Person z WHERE x.department = y.department AND x.manager = z.manager

Q2 = SELECT DISTINCT x.name, x.phone FROM Person x

Is minimized to:

The following can be checked: Q1 Q2 and Q1 Q2

…hence Q1=Q2

Page 10: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Why Today Things Are Changing

Just one reason: The Web

More precisely:• A new data model

– Semistructured data– XML syntax

• New applications – Transformation– Integration

Page 11: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Web Data Management

• Who creates the new rules– W3C working groups– Sometimes the industryThe new artifacts are not concepts, but standards

• The double role of theory– Long term: conceptualize/rationalize

• E.g. keys for XML [Buneman, Davidson, Fan, Hara, Tan’01]

– Short term: answer technical questions

Page 12: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Some Questions for Database Theory

• XML publishing• Typechecking XML transformations• XML storage• Data distribution

Page 13: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Warehouse

application

relational data

Transform

IntegrateXML Data WEB (HTTP)

application

application

legacy data

object-relational

WarehouseXMLPublishing

XMLStorage

XMLTypechecking

XMLDistribution

Page 14: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

XML Publishing

Today:• Legacy data

– fragmented into many flat relations– 3rd normal form– proprietary

• XML data– nested– un-normalized– public (450 schemas at www.biztalk.org)

Page 15: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

XML Publishing: an Example

Eu-Stores US-Stores

Products

Eu-Sales US-Sales

name country name url

date

date tax

name priceUSD

euSid usSid

pid

Legacy data in E/R:

Page 16: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

XML Publishing: an Example• XML view

<allsales> <country> <name> France </name> <store> <name> Nicolas </name> <product> <name> Blanc de Blanc </name> <sold> 10/10/2000 </sold> <sold> 12/10/2000 </sold> … </product> <product>…</product>… </store>…. </country> …</allsales>

• In summary: group by country store product

Page 17: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

allsales

country

name store

name product

name sold

date tax

url

PCDATA

PCDATA

PCDATA

PCDATA PCDATA

PCDATA

*

*

*

*

?

?

Output “schema”:

Page 18: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

{ FROM EuStores $S, EuSales $L, Products $P WHERE $S.euSid = $L.euSid AND $L.pid = $P.pid CONSTRUCT <allsales()> <country($S.country)> <name> $S.country </name> <store($S.euSid)> <name> $S.name </name> <product($P.pid)> <name> $P.name </name> <price> $P.priceUSD </price> </product> </store> </country> <allsales>} /* union….. */

XML Publishing

…. /* union */{ FROM USStores $S, EuSales $L, Products $P WHERE $S.usSid = $L.euSid AND $L.pid = $P.pid CONSTRUCT <allsales()> <country(“USA”)> <name> USA </name> <store($S.euSid)> <name> $S.name </name> <url> $S.url </url> <product($P.pid)> <name> $P.name </name> <price> $P.priceUSD </price> <tax> $L.tax </tax> </product> </store> </country> <allsales>}

In SilkRoute [Fernandez, Suciu, Tan ’00]

Page 19: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Non-recursive datalog(SELECT DISTINCT … )allsales()

country(c)

name(c) store(c,x)

name(n) product(c,x,y)

name(n) sold(c,x,y,d)

date(c,x,y,d) Tax(c,x,y,d,t)

url(c,x,u)

c

n

n

d t

u

Internal Representation

country(c) :-EuStores(x,_,c), EuSales(x,y,_), Products(y,_,_)country(“USA”) :-

store(c,x) :- EuStores(x,_,c), EuSales(x,y,_), Products(y,_,_)store(c,x) :- USStores(x,_,_), USSales(x,y,_), Products(y,_,_), c=“USA”

url(c,x,u):-USStores(x,_,u), USSales(x,y,_),Products(y,_,_)

allsales():-

Large query (x100 lines), large XML answer (x100 MB)

*

*

*

*

?

View Tree:

Page 20: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Users Ask Specific XML Queries

• find names, urls of all stores who sold on 1/1/2000 (in XML-QL / XQuery melange):

WHERE <allsales/country/store> <product/sold/date> 1/1/2000 </> <name> $X </> <url> $Y </> </>RETURN $X , $Y

Small query, small answer

Page 21: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

name(c)

name(n)

Tax(c,x,y,d,t)date(c,x,y,d)

allsales()

country(c)

store(c,x)

name(n) product(c,x,y)

sold(c,x,y,d)

url(c,x,u)

c

n

n

d t

u

Query Compositionallsales

country

store

product

sold

date

url

1/1/2000

name

$X $Y

View Tree XML-QL Query Pattern$n1

$n2

$n3

$n4

$n5

$Z

“Evaluate” the XML pattern(s) on the view tree, combine all datalog rules

Page 22: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Query CompositionResult (in theory…):

( SELECT S.name, S.url

FROM USStores S, USSales L, Products P

WHERE S.usSid=L.usSid AND L.pid=P.pid AND L.date=‘1/1/2000’)

UNION

( SELECT S2.name, S2.url

FROM EUStores S1, EUSales L1, Products P1

USStores S2, USSales L2, Products P2,

WHERE S1.usSid=L1.usSid AND L1.pid=P1.pid AND L1.date=‘1/1/2000’

AND S2.usSid=L2.usSid AND L2.pid=P1.pid

AND S1.country=“USA” AND S1.euSid = S2.usSid)

Page 23: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Complexity of XML Publishing

• But in practice: 5-7 times more joins !– Need query minimization

• Could this be avoided ?– We thought hard and couldn’t find a better way– Asked students to re-implement: same problem– It is NP-hard !

Page 24: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

XML Publishing Is NP-Hard

customer

order complaintPCDATA

??

PCDATA

order():- Q1 complaint():- Q2

XML query:

The composed SQL query is :Minimizing it is NP hard ! (can be shown…)

View Tree:

WHERE <customer> <order> $x </> <complaint> $y </> </>RETURN ( )

Q1 JOIN Q2

Page 25: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Recent Advancements in Query Containment

Definition FOk = First Order Logic with k variables

Fact If Q2 FOk and k “is small”, then Q1 Q2 can be checked efficiently

[Kolaitis, Vardi’98], [Vardi’00], [Chekuri, Ramajaran’97]

Page 26: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

XML Publishing: Finale

Prediction techniques based on FOk and/or query width will be deployed in XML publishing in the future

(perhaps under different names)

Page 27: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

XML Typechecking

Purpose: ensure that the generated XML conforms to the desired DTD (or XML Schema)

Two kinds:• Dynamic typechecking

– Easy: lots of XML validating parsers available• Static typechecking

– Hard: need complex analysis of the XML generation program

Page 28: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

XML Typechecking

XML generation programs:• Publishing: RDBMS XML (e.g. SilkRoute)• Transformation: XML XML (e.g. XSL, Xquery)• Integration: XML + XML XML

This talk: XML XML

Page 29: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

The XML Typechecking Problem

Given an XML XML transformation f:

Type Checking ProblemGiven DTDs 1, 2, check D 1, f(D) 2

sometimes 1 = any: check D, f(D) 2

Page 30: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Today’s Systems Try to DoType Inference

Type Inference ProblemGiven DTD 1, find the DTD f(1) = {f(D) | D 1}

Today’s systems:• “Compute” f(1)

• Check f(1) 2 (which is possible)

sometimes 1 = any: compute f(any)check f(any) 2

Page 31: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Theory’s Role:Send a Warning

This approach fails in general !

But it may work OK in most “practical” cases...

Page 32: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Why XML Type Inference Fails

Xquery f =

• “Inferred” (wrong) DTD f(any):

RETURN <a> (FROM Employee $x RETURN <b/>), (FROM Employee $x RETURN <c/>), (FROM Employee $x RETURN <d/>) </a>

<!ELEMENT a (b*,c*,d*)>

<!ELEMENT a ({bn,cn,dn | n 0})>• “Real” output “DTD”

<!ELEMENT a ((b,b)*,(c,c)*,(d,d)* | (b,b)*,b,(c,c)*,c,(d,d)*,d)>

• Fails to typecheck f(any) 2 when 2=

Page 33: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

The Typechecking Problem in Theory and Practice

• In practice, we care about typechecking• Question for theory: is this possible ?• Positive result [Milo, Suciu, Vianu, 2000]:

– Decidable for k-pebble tree tansducers– Hence: decidable for:

• Join-free XQuery• Simple XSLT programs

• Negative result [Alon, Milo, Neven, Suciu, Vianu 2001]:– Undecidable for transformations with value joins

Page 34: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

The Typechecking: Finale

Prediction: systems will continue to use type inference, but will never be as robust as type checking in programming languages

Need to understand well their applicability

Page 35: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

XML Storage

Problem:• Given: a (large) XML data instance• Goal: store/process it in a RDBMS• Problem: find the relational schema !

• Current approaches:– Generic schema [Florescu, Kossman 99]– Derive schema from DTD [Shanmungasudaram et al 99]– Derive schema from XML data[Deutsch, Fernandez, Suciu 99]

Page 36: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

The Theory of XML Storage

• The simplest case: flat, unique subelements

M =

• How do we cover all 1’s most economically ?– R1(E2, E3, E4), R2(E1, E5, E9, E12), …

Oid E1 E2 E3 E4 … E5000

&1 1 0 0 1 … 0

&2 0 1 1 0 … 0

&3 0 1 0 1 … 0

&4 0 1 1 1 … 0

&5 1 0 1 0 … 0

&6 1 1 0 0 … 0

… … …

&o10000000 0 1 0 0 0

Page 37: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

The Theory of XML Storage

• XML storage and matrix rank

M =

• Can store XML data in k relations rank(M)=k• Conversely: if rank(M)=k what about storage ?

Oid E1 E2 E3 E4 … E5000

&1 1 0 0 1 … 0

&2 0 1 1 1 … 0

&3 0 1 1 1 … 0

&4 0 1 1 1 … 0

&5 1 1 0 0 … 0

&6 1 1 0 0 … 0

&7 0 0 0 1 ... …

… … …

&10000000 1 0 0 1 … 0

Page 38: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

XML Storage: Finale

Prediction: we will see several clever XML storage techniques discovered in the near future

Page 39: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

The Data Distribution

• Many data consumers, many places to cache• Data can be replicated, transformed

– How to transform it ? The view selection problem– Where to place it ? The data distribution problem.

NP-complete

Prediction: no predictions here (too early…)

Page 40: Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Conclusions:Resurrection of Database Theory

• Is theory irrelevant ?– [Papadimitriou, 95]: wrong question to ask

• Respect for practice: only a recent development in human culture• Applicability pressure in CS: annoying trend of last 10 years or so

• Database theory: are we in a revolution ?– The past: researchers created artifacts for the industry– Today: society (Web, W3C) is creating artifacts for

researchers to study, improve

Prediction: there will be no difference betweentheory and practice… at least, in theory !