On the Path to Efficient XML Queries

28
IBM Almaden Research Center © 2006 IBM Corporation On the Path to Efficient XML Queries Andrey Balmin, Kevin Beyer , Fatma Özcan IBM Almaden Research Center Matthias Nicola IBM Silicon Valley Lab

description

On the Path to Efficient XML Queries. Andrey Balmin, Kevin Beyer , Fatma Özcan IBM Almaden Research Center Matthias Nicola IBM Silicon Valley Lab. New languages = new abilities + new pitfalls. XQuery A new query language designed specifically for XML data SQL / XML - PowerPoint PPT Presentation

Transcript of On the Path to Efficient XML Queries

IBM Almaden Research Center

© 2006 IBM Corporation

On the Path to Efficient XML Queries

Andrey Balmin, Kevin Beyer, Fatma ÖzcanIBM Almaden Research Center

Matthias NicolaIBM Silicon Valley Lab

IBM Almaden Research Center

© 2006 IBM Corporation

New languages = new abilities + new pitfalls

XQuery

– A new query language designed specifically for XML data

SQL / XML

– Added XML as a data type, including XQuery sequences

– Added XQuery as a sublanguage

IBM Almaden Research Center

© 2006 IBM Corporation

Purpose

Teach users of the new XML query languages

Teach the teachers

Share our users’ experiences

Influence languages

IBM Almaden Research Center

© 2006 IBM Corporation

Focus

Large databases with many moderate XML documents

Schema flexibility is required

– Many schemas in one collection

– No schema validation used

– Schemas with xs:any

– Documents like Atom Syndication and RSS that allow any extension

Therefore

– Document filtering is primary concern

– Limited type inference

– Any data is possible

IBM Almaden Research Center

© 2006 IBM Corporation

Index eligibility

Index Eligibility:

We say that an index I is eligible to answer predicate P of query Q, if for any collection of XML documents D, the following holds:

Q(D) = Q( I( P,D )).

Where I( P,D ) is the set of XML documents produced, by probing index I with predicate P.

This is not as obvious as it is in relational databases.

IBM Almaden Research Center

© 2006 IBM Corporation

XML indexes in DB2

Index a linear XPath pattern over a column as a particular datatype

CREATE INDEX index-name ON table(xml-column) USING 'pattern' AS type

pattern ::= namespace-decls? (( / | // ) axis? ( name-test | kind-test ))+

axis ::= @ | child:: | attribute:: | self:: | descendant:: | descendant-or-self::

name-test ::= qname | * | ncname:* | *:ncname

kind-test ::= node() | text() | comment() | processing-instruction()

type ::= varchar | double | date | timestamp

IBM Almaden Research Center

© 2006 IBM Corporation

Query pattern index pattern

CREATE INDEX li_price ON orders(orddoc)USING XMLPATTERN '//lineitem/@price' AS double

Can use the index: more restrictive

for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[ lineitem/@price > 100 ]return $i

Cannot use the index: less restrictive

for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[ lineitem/@* > 100 ]return $i

IBM Almaden Research Center

© 2006 IBM Corporation

Match index and query predicate data type

CREATE INDEX li_price ON orders(orddoc)USING XMLPATTERN '//lineitem/@price' AS double

Can use the index: numeric predicate and index

for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[ lineitem/@price > 100 ]return $i

Cannot use the index: string predicate

for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[lineitem/@price > "100" ]return $i

IBM Almaden Research Center

© 2006 IBM Corporation

Data Types for Joins

CREATE INDEX o_custid ON orders(orddoc) USING XMLPATTERN '//custid' AS double

CREATE INDEX c_custid ON customer(cdoc) USING XMLPATTERN '/customer/id' AS double

Cannot use the indexes: unknown comparison type

for $i in db2-fn:xmlcolumn("ORDERS.ORDDOC")/orderfor $j in db2-fn:xmlcolumn("CUSTOMER.CDOC")/customerwhere $i/custid = $j/idreturn $i

Can use the indexes: at least one cast required

for $i in db2-fn:xmlcolumn("ORDERS.ORDDOC")/orderfor $j in db2-fn:xmlcolumn("CUSTOMER.CDOC")/customerwhere $i/custid/xs:double(.) = $j/id/xs:double(.)return $i

IBM Almaden Research Center

© 2006 IBM Corporation

SQL/XML Query Functions

XMLQuery

Scalar function that returns an (possibly empty) XQuery sequence for every row

XMLExists

Predicate that returns true iff the XQuery sequence produced is not empty

XMLTable

Produces a table with one row for each item in the row-producing XQuery sequence, and with one column per column-producing XQuery expression. The columns may be XQuery sequences or cast to simple SQL types.

IBM Almaden Research Center

© 2006 IBM Corporation

XMLQuery does not filter rows (usually)

Cannot use the index:

SELECT XMLQuery(‘$order//lineitem[ @price > 100 ]

‘ passing orddoc as "order")FROM orders

Can use the index:

VALUES (XMLQuery(’ db2-fn:xmlcolumn("ORDERS.ORDDOC") //lineitem[ @price > 100 ] '))

Can use the index:

db2-fn:xmlcolumn('ORDERS.ORDDOC') //lineitem[ @price > 100 ]

(LI1, LI2)

()

(LI3)

()

(LI1, LI2, LI3)

LI1

LI2

LI3

Result

IBM Almaden Research Center

© 2006 IBM Corporation

XMLExists filter rows (usually)

Can use the index

SELECT ordid, orddocFROM ordersWHERE XMLExists(‘

$order//lineitem[ @price > 100 ]‘ passing orddoc as "order")

Cannot use the index: false exists

SELECT ordid, orddocFROM ordersWHERE XMLExists(‘

$order//lineitem/@price > 100‘ passing orddoc as "order")

Need XMLTest which uses XQuery’s Effective Boolean Value

IBM Almaden Research Center

© 2006 IBM Corporation

XMLQuery + XMLExists vs. XMLTable Can use the index

SELECT ordid, XMLQuery(‘$order//lineitem[@price > 100]

’ passing orddoc as "order")FROM ordersWHERE XMLExists(‘

$order//lineitem[@price > 100] ’ passing orddoc as "order")

XMLTable: More efficient and less redundant

SELECT o.ordid, t.lineitemFROM orders o,XMLTable(‘

$order//lineitem[@price > 100] ’ passing o.orddoc as "order“ COLUMNS "lineitem" XML BY REF PATH '.')as t(lineitem)

IBM Almaden Research Center

© 2006 IBM Corporation

Predicates in XMLTable column expressions Can use the index

SELECT o.ordid, t.lineitemFROM orders o,XMLTable(‘

$order//lineitem[@price > 100] ’ passing o.orddoc as "order“ COLUMNS "lineitem" XML BY REF PATH '.')as t(lineitem)

Cannot use the index

SELECT o.ordid, t.lineitem, t.priceFROM orders o, XMLTable(‘

$order//lineitem ’ passing o.orddoc as "order" COLUMNS "lineitem" XML BY REF PATH '.', "price" DECIMAL(6,3) PATH '@price[. > 100]‘ )as t(lineitem, price)

1 LI1

1 LI2

3 LI3

1 LI4 null

1 LI1 175

1 LI2 150

2 LI5 null

2 LI6 null

3 LI3 201

3 LI7 null

IBM Almaden Research Center

© 2006 IBM Corporation

Joining XML Values in SQL/XML

Can use index on product/id, but not p.id

SELECT p.name, o.orddocFROM products p, orders oWHERE XMLExists(‘

$order//lineitem/product[ id eq $pid ] ‘ passing o. orddoc as "order", p.id as "pid")

Can use index on p.id, but not product/id

SELECT p.name, o.orddocFROM products p, orders oWHERE p.id = XMLCast( XMLQuery(‘

$order//lineitem/product/id ‘ passing o. orddoc as "order") as VARCHAR(13))

Need to unify XQuery and SQL data types

IBM Almaden Research Center

© 2006 IBM Corporation

Joining XML Values in SQL/XML Probably cannot use XML indexes: SQL types differ from XML

SELECT c.name, o.orddocFROM orders o, customer cWHERE XMLCast( XMLQuery(‘

$order/order/custid ’ passing o.orddoc as "order") as DOUBLE) =XMLCast( XMLQuery(‘

$cust/customer/id ’ passing c.cdoc as "cust") as DOUBLE)

Can use XML indexes

SELECT c.name, o.orddocFROM orders o, customer cWHERE XMLExists(‘

$order/order[ custid/xs:double(.) = $cust/customer/id/xs:double(.) ]

‘ passing o.orddoc as "order", c.cdoc as "cust")

IBM Almaden Research Center

© 2006 IBM Corporation

XQuery Let Clauses

Can use the index

for $doc in db2-fn:xmlcolumn('ORDERS.ORDDOC')for $item in $doc//lineitem[ @price > 100 ]return <result>{ $item }</result>

for $ord in db2-fn:xmlcolumn('ORDERS.ORDDOC')/orderreturn $ord/lineitem[ @price > 100 ]

Cannot use the index

for $doc in db2-fn:xmlcolumn('ORDERS.ORDDOC')let $item:= $doc//lineitem[ @price > 100 ]return <result>{ $item }</result>

for $ord in db2-fn:xmlcolumn('ORDERS.ORDDOC')/orderreturn <result>{ $ord/lineitem[ @price > 100 ]}</result>

R/L1

R/L2

R/L3

R/L1,L2

R

R/L3

R

IBM Almaden Research Center

© 2006 IBM Corporation

XQuery Let Clauses

Can use the index

for $ord in db2-fn:xmlcolumn('ORDERS.ORDDOC')/orderwhere $ord/lineitem/@price > 100return <result>{ $ord/lineitem }</result>

Same as above

for $ord in db2-fn:xmlcolumn('ORDERS.ORDDOC')/orderlet $price := $ord/lineitem/@pricewhere $price > 100return <result>{ $ord/lineitem }</result>

R/L4,L1,L2

R/L3,L7

IBM Almaden Research Center

© 2006 IBM Corporation

Context is everything

$i is bound to the document node

for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') return $i/order/lineitem

$j is bound to <my_order>

for $j in ( for $o in db2-fn:xmlcolumn('ORDERS.ORDDOC')/order return <my_order>{ $o/* }</my_order> )return $j/my_order/lineitem

IBM Almaden Research Center

© 2006 IBM Corporation

Remember the dot

Produces a type error: no document node at root

let $order := <new_order>{ db2-fn:xmlcolumn('ORDERS.ORDDOC') /order[custid > 1001] } </new_order>return $order[ //customer/name ]

Absolute path expressions is a shorthand for

fn:root(.) treat as document-node().

Absolute path expressions are bad style

IBM Almaden Research Center

© 2006 IBM Corporation

Construction and View Composition

Want to rewrite this…

let $view := for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC')/ order/lineitem return <ordered>{ $i/@quantity, $i/product/@price,

<pid>{ $i/product/id/data(.) }</pid> }</ordered>

for $j in $viewwhere $j/pid = '17‘return $j/@price

IBM Almaden Research Center

© 2006 IBM Corporation

Construction and View Composition

… into this

for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') /order/lineitemwhere $i/product/id/data(.) = '17‘return $i/product/@price

but…

IBM Almaden Research Center

© 2006 IBM Corporation

Construction and View Composition

Data type changed to untypedAtomic

– id is string: comparison is now an error

– id is long: comparison is now as double instead of long

List types are concatenated

Error for duplicate @price attributes lost

New node identity lost

Parent axis is broken

Any sequence should live in tree without change.Separate identity from construction?

IBM Almaden Research Center

© 2006 IBM Corporation

Remember the namespaces

Index definition and query must match namespaces

CREATE INDEX li_price ON orders(orddoc)USING XMLPATTERN '//lineitem/@price' AS double

Cannot use the index. Which is right?

declare default element namespace "http://ournamespaces.com/order";

for $i in db2-fn:xmlcolumn('ORDERS.ORDDOC') //order[ lineitem/@price > 100 ]return $i

IBM Almaden Research Center

© 2006 IBM Corporation

Elements and text nodes differ

CREATE INDEX PRICE_TEXT ON orders.orddocUSING XMLPATTERN '//price' AS varchar

Can not use index

for $ord in db2-fn:xmlcolumn(“ORDERS.ORDDOC”) /order[ lineitem/price/text() = “99.50” ]return $ord

Element might have more data than just text

<price>99.50<unit>USD</unit></price>

IBM Almaden Research Center

© 2006 IBM Corporation

Attributes are shy

No attributes

//*//node()

Only attributes

//@*//attribute::node()

Empty result due to “principle node kind”

//@*/self:*

IBM Almaden Research Center

© 2006 IBM Corporation

Between predicates are not obvious

Might not be between: multiple prices

lineitem[ price > 100 and price < 200 ]

Between or error

lineitem[ price gt 100 and price lt 200 ]

Always between

lineitem/price/data()[ . > 100 and . < 200 ]

Between if not list type

lineitem/price[ . > 100 and . < 200 ]

Between if not list type

lineitem[ @price > 100 and @price < 200 ]

IBM Almaden Research Center

© 2006 IBM Corporation

Conclusions

Easy to make mistakes without schema constraints

Many subtle differences in expressions

Improve construction composition

Unify SQL and XQuery type systems

Add XMLTest to SQL/XML