Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of...

Post on 22-Dec-2015

214 views 0 download

Transcript of Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of...

Efficiently Publishing Relational Data as XML Documents

Jayavel Shanmugasundaram

University of Wisconsin-Madison/University of Wisconsin-Madison/IBM Almaden Research CenterIBM Almaden Research Center

Joint work with: Rimon BarrMichael CareyBruce LindsayHamid PiraheshBerthold ReinwaldEugene Shekita

Outline

• Why?

• How?

• Which?

• Hence

XML Example<department name=“Purchasing”>

<emplist>

<employee> John </employee>

<employee> Mary </employee>

</emplist>

<projlist>

<project> Internet </project>

<project> Recycling </project>

</projlist>

</department>

What is the big deal about XML?

• Elegantly models complex, hierarchical/ graph-structured data

• Domain-specific tags (unlike HTML)

• Simple!

Fast emerging as dominant standard for data exchange on the WWW

Why Relational Data?

• Most business data stored in relational databases

• Unlikely to change in the near future– Scalability, Reliability, Performance, Tools

Need efficient means to publish relational data as XML documents

Usage Scenario

Existing Database System

(RDBMS)

Application/User Query to produce XML Documents

XML Result (processed or

displayed in browser)

The Internet

Example Relational Schema

Department

DeptId DeptName

10 PurchasingProject

ProjId DeptId ProjName

888 10 Internet

795 10 Recycling

EmployeeEmpId DeptId EmpName

101 10 John

91 10 Mary

Salary

50K

70K

XML Representation<department name=“Purchasing”> <emplist> <employee> John </employee> <employee> Mary </employee> </emplist> <projlist> <project> Internet </project> <project> Recycling </project> </projlist></department>

Main Issues

• Relational data is flat, XML is a tagged graph

• How do we specify translation from flat model to a graph model?– A query language to map from relations to XML

• How do we transform flat representations to tagged nested representations?– Efficient implementation strategies

Outline

• Why?

• How?– Language?– Mechanism?

• Which?

• Hence

Transformation Languages• Two obvious choices:

– XML Query Language– SQL

Example Relational Schema

Department

DeptId DeptName

10 PurchasingProject

ProjId DeptId ProjName

888 10 Internet

795 10 Recycling

EmployeeEmpId DeptId EmpName

101 10 John

91 10 Mary

Salary

50K

70K

XMLQL: Default XML View

<defaultview>

<department>

<row> <deptid>10</> <deptname>Purchasing</> </row>

</department>

<employee>

<row> <empid>101</> <deptid>10</> <empname>John</> <salary>50K</> </row>

<row> <empid>91</> <deptid>10</> <empname>Mary</> <salary>70K</> </row>

</employee>

<project>

<row> <projid>888</> <deptid>10</> <projname>Internet</> </row>

<row> <projid>795</> <deptid>10</> <projname>Recycling</> </row>

</project>

</defaultview>

XMLQL: Query Over Default ViewWHERE <defaultview.department.row>

<deptid> $did </> <deptname> $dname </>

</> IN DefaultView

CONSTRUCT <department name=$dname>

<emplist>

</emplist>

<projlist>

</projlist> </>

{ WHERE <defaultview.employee.row>

<deptid> $did </> <empname> $ename </> </> IN DefaultView CONSTRUCT <employee> $ename </> }

{ WHERE <defaultview.project.row>

<deptid> $did </> <projname> $pname </> </> IN DefaultView CONSTRUCT <project> $pname </> }

XMLQL: Query Result<department name=“Purchasing”> <emplist> <employee> John </employee> <employee> Mary </employee> </emplist> <projlist> <project> Internet </project> <project> Recycling </project> </projlist></department>

XMLQL: Pros and Cons

• Pros:– Natural for XML users– Infrastructure to build hierarchies of XML views– One query language for XML and relational data

• Cons:– Ignores existing API (JDBC), tools, support– Need to mature new query language (aggregates etc.)

SQL: Key Ideas

• Sub-queries to specify nesting

• Scalar functions to specify tags/attributes– XML Constructors

• Aggregate functions to group child elements

SQL: Query to publish XML

Select DEPT(d.name,

<subquery to produce emplist>,

<subquery to produce projlist>

)From Department d

SQL: XML Constructor

Define XML Constructor DEPT(dname: varchar(20), emplist: xml, projlist: xml) As ( <department name=$dname> <emplist> $emplist </emplist> <projlist> $projlist </projlist></department>

)

SQL: Query to publish XML

Select DEPT(d.name,

<subquery to produce emplist>,

<subquery to produce projlist>

)From Department d

SQL: Query to publish XML

Select DEPT(d.name, (Select XMLAGG(EMP(e.name)) From Employee e Where e.deptno = d.deptno), <subquery to produce projlist>

)From Department d

SQL: XML Constructor

Define XML Constructor EMP(ename: varchar(20)) As (

<employee> <name> $ename </name></employee>

)

SQL: Query to publish XML

Select DEPT(d.name, (Select XMLAGG(EMP(e.name)) From Employee e Where e.deptno = d.deptno), <subquery to produce projlist>

)From Department d

SQL: Query to publish XML

Select DEPT(d.name, (Select XMLAGG(EMP(e.name)) From Employee e Where e.deptno = d.deptno), (Select XMLAGG(PROJ(p.name)) From Project p Where p.deptno = d.deptno) )From Department d

Query Result

<department name=“Purchasing”>

<emplist>

<employee> John </employee>

<employee> Mary </employee>

</emplist>

<projlist>

<project> Internet </project>

<project> Recycling </project>

</projlist>

</department>

(<XML Result>)

SQL: Pros and Cons

• Pros:– Reuses SQL infrastructure/API– Natural for SQL users– Efficient execution inside relational engine

• Cons:– Limited support for XML View Composition

Outline

• Why?

• How?– Language?– Mechanism?

• Which?

• Hence

Relations to XML: Issues

• Two main differences:– Nesting (structuring)– Tagging

• Space of alternatives:Late TaggingEarly Tagging

Late Structuring

Early StructuringInside Engine Inside Engine

Inside Engine

Outside Engine Outside Engine

Outside Engine

Stored Procedure Approach

• Issue queries for sub-structures and tag them

• Could be a Stored Procedure

DBMS EngineDepartment

Employee

Project

• Problem: Too many SQL queries!

(10, Purchasing)

(John)

(Mary)

(Internet)

(Recycling)

Early Tagging, Early Structuring, Outside Engine

Correlated CLOB Approach

• Problem: Correlated execution of sub-queries

Select DEPT(d.name, (Select XMLAGG(EMP(e.name)) From Employee e Where e.deptno = d.deptno), (Select XMLAGG(PROJ(p.name)) From Project p Where p.deptno = d.deptno) )From Department d

Early Tagging, Early Structuring, Inside Engine

De-Correlated CLOB Approach

• Problem: CLOBs during processing

With EmpStruct (deptname, empinfo) AS (

Select d.deptname,

XMLAGG(EMP(employee, e.empname))

From department d left join employee e

on d.deptid = e.deptid

Group By d.deptname)

With ProjStruct (deptname, projinfo) AS (

Select d.deptname,

XMLAGG(PROJ(employee, p.projname))

From department d left join project p

on d.deptid = e.deptid

Group By d.deptname)

Select DEPT(name, d1.empinfo, d2.projinfo))

From EmpStruct d1 full join ProjStruct d2

on d1.deptname = d2.deptname

Early Tagging, Early Structuring, Inside Engine

Late Tagging, Late Structuring• XML document content produced without

structure (in arbitrary order)

• Tagger enforces order as final step

Relational QueryProcessing

Unstructured content

TaggingResult XML Document

Redundant Relation Approach• How do we represent nested content as relations?

(10, Purchasing)

(10, Internet)

(10, Recycling)

(10, John)

(10, Mary) (Purchasing, John, Internet)

(Purchasing, John, Recycling)

(Purchasing, Mary, Internet)

(Purchasing, Mary, Recycling)

• Problem: Large relation due to data redundancy!

Late Tagging, Late Structuring

Outer Union Approach• How do we represent nested content as relations?

• Problem: Wide tuples (having many columns)

Department

Employee ProjectDepartment

Employee Project

Union

(Purchasing, Internet)

(Purchasing, Recycling)

(Purchasing, John)

(Purchasing, Mary)

(10, Purchasing)

(Purchasing, null, Internet , 0)

(Purchasing, null, Recycling, 0)

(Purchasing, John, null , 1)

(Purchasing, Mary, null , 1)

Late Tagging, Late Structuring

Hash-based Tagger

• Results not structured early– In arbitrary order

• Tagger has to enforce order during tagging– Hash-based approach

• Inside/Outside engine tagger

Late Tagging, Late Structuring

• Problem: Requires memory for entire document

Late Tagging, Early Structuring• Structured XML document content produced

• Tagger just adds tags (constant space)

Relational QueryProcessing

Structured content

TaggingResult XML Document

Sorted Outer Union Approach

A

B C

D E F G

A B n n E n n

A n C n n F n

A n C n n n G

Late Tagging, Early Structuring

A B n D n n n

Sort By: Aid, Bid, Cid

• Problem: Only partial ordering required

Constant Space Tagger

• Detects changes in XML document hierarchy

• Adds appropriate opening/closing tags

• Inside/outside engine

Late Tagging, Late Structuring

Classification of AlternativesLate TaggingEarly Tagging

LateStructuring

EarlyStructuring

Inside Engine

Inside Engine

De-Correlated CLOB

Out

side

Eng

ine

Stored Procedure

Inside Engine

Out

side

Eng

ine

Sorted Outer Union(Tagging inside)

Sorted Outer Union(Tagging outside)

Unsorted Outer Union(Tagging inside)

Unsorted Outer Union(Tagging outside)

Out

side

Eng

ine

Correlated CLOB

Outline

• Why?

• How?– Language?– Mechanism?

• Which?

• Hence

Performance Evaluation

TABLE000 TABLE001 TABLE011TABLE010

TABLE00 TABLE01

TABLE0

Query Depth

Query Fan Out

Database Size

Inside vs. Outside Engine

0

10

20

30

40

50

60

2 3 4

Query Fan Out

Tim

e (in

sec

onds

)

Stored Proc

CLOB-Corr

CLOB-DeCorr

Redundant R

Unsorted OU (Out)

Unsorted OU (In)

Sorted OU (Out)

Sorted OU (In)

Where Does Time Go?

05

101520253035

Tim

e (in

sec

onds

)

XML File

Tagging

Bind Out

Execution

Effect of Query Fan Out

0

5

10

15

2 3 4

Query Fan Out

Time (

in sec

onds

)

CLOB-Corr

CLOB-DeCorr

Unsorted OU

Sorted OU

Effect of Query Depth

0

20

40

60

2 3 4

Query Depth

Time (

in se

cond

s)

CLOB-Corr

CLOB-DeCorr

Unsorted OU

Sorted OU

Memory Considerations

• Sorted outer union more robust

• Relational sort highly scalable!

Outline

• Why?

• How?– Language?– Mechanism?

• Which?

• Hence

Conclusion

• Publishing XML from relational sources important in Internet

• Language alternatives:– SQL based

– XML query language based

• Implementation Alternatives– Inside engine >> Outside engine

– Unsorted Outer Union : sufficient main memory

– Sorted Outer Union : otherwise