Database Systems and XML

52
Database Systems and XML David Wu CS 632 April 23, 2001

description

Database Systems and XML. David Wu CS 632 April 23, 2001. Researched Papers. J. Shanmugasundaram, et al. " Efficiently Publishing Relational Data as XML Documents ", VLDB Conference, September 2000. - PowerPoint PPT Presentation

Transcript of Database Systems and XML

Page 1: Database Systems and XML

Database Systems and XML

David Wu

CS 632

April 23, 2001

Page 2: Database Systems and XML

Researched Papers

• J. Shanmugasundaram, et al. "Efficiently Publishing Relational Data as XML Documents", VLDB Conference, September 2000.

• J. Shanmugasundaram, et al. "Relational Databases for Querying XML Documents: Limitations and Opportunities," VLDB Conference, September 1999.

Page 3: Database Systems and XML

Efficiently Publishing Relational Data as XML Documents

Page 4: Database Systems and XML

Motivation

• Relational database systems and XML are heavily used on the Web.

• Would like some way to publish relational data as XML.

Page 5: Database Systems and XML

What is Needed

• Language to specify the conversion from relational data to XML.

• Implementation to efficiently carry out the conversion.

Page 6: Database Systems and XML

SQL Based Language

Page 7: Database Systems and XML

Implementation Alternatives

Main differences between relations and XML:

• XML docs have tags

• XML has nested structure

Page 8: Database Systems and XML

Early Tagging, Early Structuring

• Stored Procedure Approach (outside engine)– Performs a nested-loop join by issuing queries for each

nested structure in the desired XML.

– High overhead due to the number of queries.

– Fixed join order.

Page 9: Database Systems and XML

Early Tagging, Early Structuring

• Correlated CLOB Approach (inside engine)– Have one large query with sub-queries is run within the

engine.

– Must add XML constructor support to the engine.

– XML fragments from the constructors are stored as CLOBs (Character Long Objects). Costly to handle.

• De-Correlated CLOB Approach (inside)– Perform query de-correlation to give optimizer more

flexibility.

Page 10: Database Systems and XML

Late Tagging, Late Structuring

Two phases:

1) Content creation

2) Tagging and structuring

Page 11: Database Systems and XML

Late Tagging, Late Structuring

Content Creation: Redundant Relation Approach– Join all source tables– Both content and process redundancy

Page 12: Database Systems and XML

Late Tagging, Late Structuring

Content creation: Outer Union Approach– Separate the children of the same parent (e.g.

one tuple should represent either account or purchaseOrder).

– At the end outer union the results.– Still some data redundancy (e.g. parent info)

Page 13: Database Systems and XML

Late Tagging, Late Structuring

Outer Union Plan:

Page 14: Database Systems and XML

Late Tagging, Late Structuring

Structuring/Tagging: Hashed-based Tagger

• Group by hashing

• Extract tuples and tag them.

Page 15: Database Systems and XML

Late Tagging, Early Structuring

• Late Tagging, Late Structuring requires much memory for the hash table.

• Fix by creating “structured content” and then tag.

Page 16: Database Systems and XML

Late Tagging, Early Structuring

Structured content: Sorted Outer Union Approach– Desired format

1. Parent information comes before or with its child

2. All info of a node and its descendants occur together

3. Relative order of the tuples matches user-specified order

– Achieve by performing a sort on ids on the result of the outer union.

Page 17: Database Systems and XML

Late Tagging, Early Structuring

• Tagging Sorted Data:ConstantSpaceTagger– Can append tags as soon as data is seen.– Only need to remember the parent ids of the

last tuple seen to know when to append closing tags.

Page 18: Database Systems and XML

Experiement

• Inside Engine

• Outside Engine

Page 19: Database Systems and XML

Breakdown of Construction

Page 20: Database Systems and XML

Summary of Results

• Constructing inside the relational engine is more efficient.

• When processing can be done in main mem, the Unsorted Outer Union approach wins.

• When main mem is not enough, the Sorted Outer Union approach is best.

Page 21: Database Systems and XML

Relational Databases for Querying XML Documents

Page 22: Database Systems and XML

Why Bother?

• XML is becoming the standard for data representation in WWW.

• A query engine designed to tap information from XML documents is valuable.

• Relational database system is a mature technology and could be used to support XML querying.

Page 23: Database Systems and XML

Basic Idea

Step 1: Generate a relational schema from the DTD

Step 2: Parse the XML document and load the data into tuples of the relational table.

Step 3: Translate the semi-structured XML queries into SQL corresponding to the relational data.

Step 4: Convert the result back to XML.

Page 24: Database Systems and XML

Translating XML to Relational Schema

Main Issues:

1. DTDs complexity

2. Arbitrary nesting of XML DTDs vs. two-level nature of relational schemas.

3. Set-valued attributes and recursion

Page 25: Database Systems and XML

1) Flattening transformation

2) Simplification transformation of unary operations

3) Grouping transformation

Page 26: Database Systems and XML

Techniques to translate XML DTD to relations.

• Basic Inlining Technique

• Shared Inlining Technique

• Hybrid Inlining technique

Page 27: Database Systems and XML

Basic Inlining Technique

• Inlining as many descendants of an element into a relation. (author:firstname,lastname,address)

• Every element will have a relation corresponding to it. (firstname, lastname, and address will all have elements)

Page 28: Database Systems and XML

Basic Inlining Technique (cont.)

Complications:

1) Set-valued attributes (eg. Article)• Solve by using foreign keys and other tables.

2) Recursion• Solve with relational keys and relational

recursive processing to retrieve the relationship.

Page 29: Database Systems and XML

Tools used in creating relations

DTD Graph– Nodes are elements,

attributes,operators– Each element

appears once– Attributes and

operators appear as many times as they do in the DTD

– Cycles in the graph indicates recursion

Page 30: Database Systems and XML

Tools used in creating relations

Element Graphs– Generated from the DTD

graph– Created by doing a DFS

from an element node

Page 31: Database Systems and XML

Creating a Relation

Given an element graph, the root it made intoa relation with all descendents inlined into it,except:

1) Children directly below a “*” are made into separate relations;

2) Each node with a backpointer edge are made into separate relations.

These additional relations are named by their pathfrom the root and have parentID fields that serve asforeign keys (e.g. Article.author has the attributearticle.author.parentID)

Page 32: Database Systems and XML

Problems with Basic

• Large number of relations it creates

• Not efficient for certain queries – Good: “list all authors of books”– Bad: “list all authors having first name Jack”

Page 33: Database Systems and XML

Shared Inlining Technique

Idea: Identify commonly used element nodes and share them by creating separate relations for them.

Page 34: Database Systems and XML

Shared Inlining Technique

Rules for creating relations:– Nodes with in-degree>1 have relations made– Nodes with in-degree=1 are inlined– Nodes with in-degree=0 have relations made– Nodes following “*” have relations made– Nodes with in-degree=1 AND mutually

recurive, one of them is made into a relation

Page 35: Database Systems and XML

Shared Inlining Technique

Rules for designing the schema:– Relation X inlines all nodes Y that it an reach

such that the path from X to Y does not contain a node that is to be made a separate relation.

– Inlined elements are flagged as being a root with the isRoot field.

Page 36: Database Systems and XML

Problems with Shared

• Too many joins required!

Page 37: Database Systems and XML

Hybrid Inlining Technique

• Same as Shared except Hybrid also inlines elements that…– have in-degree>1 AND – are not recursive AND – are not reached through a “*” node.

Page 38: Database Systems and XML

Evaluation Metric

For path expressions of length N, data was gathered on:

• The avg number of SQL queries generated

• The avg number of joins in each SQL query

• The total average number of joins in order to process the path expression

Page 39: Database Systems and XML

Results for N=3

• For Basic, 1/3 of the DTDs tests didn’t run to completion due to lack of virtual memory. Basic is thus ignored.

Page 40: Database Systems and XML

Results for N=3

Page 41: Database Systems and XML

Results for N=3

• Group 1: Hybrid reduce join/query, increases a smaller amount of queries => Hybrid requires fewer joins than shared.

• Group 2: Hybrid reduces join/query, increases a comparable amount of queries=> Hybrid and Shared are the same.

Page 42: Database Systems and XML

Results for N=3

• Group 3: Hybrid reduces some joins/query, but increased the queries by a lot => Hybrid generates more joins than Shared.

• Hybrid and Shared performed similarly in both joins/query and # of queries => Hybrid and Shared are about the same.

Page 43: Database Systems and XML

Semi-Structured Queries to SQL

Semi-structured query languages– Allow path expressions with various operators

and wildcards.

XML-QL Query Lorel

Page 44: Database Systems and XML

Simple Path to SQL

1. The relations corresponding to the start of the root path is added to the FROM clause.

2. If needed, the path expressions are translated to joins.

Page 45: Database Systems and XML

Simple Recursive Path to SQL

1. Find initialization of the recursion (e.g. *.monograph.editor with condition monograph.title= “Subclass Cirripedia”)

2. Find the actual recursive path expression (e.g. monograph.editor)

3. Union the two

Page 46: Database Systems and XML

Arbitrary Path to Simple Recursive Path

• Use a general technique to translate path expressions to many simple (recursive) path expressions.

Page 47: Database Systems and XML

Relational Results to XML: Simple Structuring

Requires only attaching appropriate tags to each tuple.

Page 48: Database Systems and XML

Relational Results to XML: Tag Variables

Have the relational query contain the tag value in the result tuple. Then just covert it to a tag during XML generation.

Page 49: Database Systems and XML

Grouping

a) Could sort the result tuples by the group-by field and and scan through it in order when generating the XML.

b) Could do a grouping operation.

Page 50: Database Systems and XML

Other Cases

• Complex Element Construction– e.g. asking for all article elements and assume

that may be multiple elements (e.g. author & title)

– Difficult to do in traditional relational model.

• Heterogeneous Results– e.g. asking for either title or author of article.– Could be done in two queries and then merged.

Page 51: Database Systems and XML

Other Cases

• Nested Queries– Could be rewritten in terms of SQL queries

using outer joins.

Page 52: Database Systems and XML

Conclusion

Suggested modifications to relational systems:

• Untyped/variable-typed references.

• Information retrieval style indices

• Flexible comparison operators

• Multiple-query optimization/execution

• More powerful recursion support.