Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance...
Transcript of Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance...
XML Benchmarks 1
Web Databases
XML System Benchmarks
XML Benchmarks 2
Benchmarks
• Many XML DB tools– Design and adopt benchmarks to allow
comparative performance analysis• Four key criteria (Jim Gray 1993)
– Relevance– Portability– Scalability– Simplicity
XML Benchmarks 3
Existing XML Benchmarks
• Application benchmarks– X007 (National University of Singapore,
University of Auckland, Arizona State University)
– XMach-1 (University of Leipzig)– XMark (CWI, Inria, Microsoft, BEA, XQRL,
Fraunhofer-IPSI)– XBench (University of Waterloo + IBM)
XML Benchmarks 4
Benchmark Dataset
• Must be complex enough to capture all characteristics of XML data representation
• Capture the document (ordering) and navigation (references) features
• Scalability– Depth of a tree can be controlled by varying the number
of repetitions of recursive elements– The width of the tree can be adjusted by varying the
cardinality of some elements
XML Benchmarks 5
Benchmark Queries
• Types of queries– Data-centric
• Join, aggregation, sorting (R9, R10, R11)
– Document-centric• Element/document ordering (R17, R21)
– Navigational• Traversal (R13, R20)
XML Benchmarks 6
Benchmark Queries• R1
– Query all data types and collections of possibly multiple XML documents
• R2– Allow data-oriented, document-oriented, and mixed queries
• R3– Accept streaming data
• R4– Support operations on various data models
• R5– Allow conditions/constraints on text elements
• R6– Support hierarchical and sequence queries
XML Benchmarks 7
Benchmark Queries• R7
– Manipulate NULL values• R8
– Support quantifiers (some, all, not) in queries• R9
– Allow queries that combine different parts of document(s)• R10
– Support for aggregation• R11
– Able to generate sorted results• R12
– Support composition of operations
XML Benchmarks 8
Benchmark Queries• R13
– Allow navigation (reference traversals)• R14
– Able to use environment information as part of queries• R15
– Able to support XML updates if data model allows• R16
– Support type coercion• R17
– Preserve the structure of the documents• R18
– Transform and create XML documents
XML Benchmarks 9
Benchmark Queries• R19
– Support ID creation• R20
– Structural recursion• R21
– Element ordering
XML Benchmarks 10
X007
• Comes from the 007 Benchmark• X007
– Bressan, Dobbie 2001– Bressan, Lee 2001– http://www.comp.nus.edu.sg/~ebh/XOO7.html
• Allow datasets of varying sizes
XML Benchmarks 11
X007 - DTD<!ELEMENT Module (Manual, ComplexAssembly)><!ATTLIST Module MyID NMTOKEN #REQUIRED
type CDATA #REQUIREDbuildDate NMTOKEN #REQUIRED>
<!ELEMENT Manual (#PCDATA)><!ATTLIST Manual MyID NMTOKEN #REQUIRED
title CDATA #REQUIREDtextLen NMTOKEN #REQUIRED>
<!ELEMENT ComplexAssembly (ComplexAssembly+ | BaseAssembly+)><!ATTLIST ComplexAssembly MyID NMTOKEN #REQUIRED
type CDATA #REQUIREDbuildDate NMTOKEN #REQUIRED>
<!ELEMENT BaseAssembly (CompositePart+)><!ATTLIST BaseAssembly MyID NMTOKEN #REQUIRED
type CDATA #REQUIREDbuildDate NMTOKEN #REQUIRED>
XML Benchmarks 12
X007 - DTD<!ELEMENT CompositePart (Document, Connection+)><!ATTLIST CompositePart MyID NMTOKEN #REQUIRED
type CDATA #REQUIREDbuildDate NMTOKEN #REQUIRED>
<!ELEMENT Document (#PCDATA | para)+><!ATTLIST Document MyID NMTOKEN #REQUIRED
title CDATA #REQUIRED><!ELEMENT para (#PCDATA)><!ELEMENT Connection (AtomicPart, AtomicPart)><!ATTLIST Connection type CDATA #REQUIRED
length NMTOKEN #REQUIRED><!ELEMENT AtomicPart EMPTY><!ATTLIST AtomicPart MyID NMTOKEN #REQUIRED
type CDATA #REQUIREDbuildDate NMTOKEN #REQUIREDx NMTOKEN #REQUIREDy NMTOKEN #REQUIREDdocId NMTOKEN #REQUIRED>
XML Benchmarks 13
X007 ERD
Module
Manual
Document
ComplexAssembly
CompositeParts
BaseAssembly
Assembly
DesignObj
AtomicPart
XML Benchmarks 14
X007 Queries• Query 1 (R1, R2)
– Randomly generate 5 numbers in the range of AtomicPart's MyID, then return the AtomicPart according to the 5 numbers.
– FOR $a IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Connection/AtomicPart[@MyID = 221 or @MyID = 1000 or @MyID = 535 or @MyID = 13 or @MyID =
2000]RETURN $a
• Query 2 (R1, R2)– Randomly generate 5 titles for Documents, then return the first paragraph of the
Document by lookup on these titles. – FOR $d IN document("small31.xml")
/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Document[ @title = "Composite Part 00000009" or @title = "Composite Part 00000050" or @title = "Composite Part 00000034" or @title = "Composite Part 00000022" or @title = "Composite Part 00000080"]
RETURN $d/para[1]
XML Benchmarks 15
X007 Queries• Query 3 (R4)
– Select 5% of AtomicParts via buildDate (in a certain period).– FOR $a IN document("small31.xml")
/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Connection/AtomicPart[@buildDate .>=. 1900 and @buildDate .<. 1950]
Return $a• Query 4 (R13)
– Find the CompositePart if it is later than BaseAssembly it is using (comparing the buildDate attribute).
– FOR $b IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseA
ssembly,$c IN $b/CompositePart[@buildDate .>. $b/@buildDate]
RETURN $c• Query 5 (R9)
– Within the same BaseAssembly, return the AtomicParts once finding a Document that has MyID equals to its docId.
– FOR $b IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseA
ssembly,$d IN $b/CompositePart/Document
LET $a := $b/CompositePart/Connection/AtomicPartWHERE $d/@MyID = $a/@docIdRETURN $a
XML Benchmarks 16
X007 Queries• Query 6 (R9)
– Select all BaseAssemblies with earlier buildDate from one XML database where it has the same "type" attributes as the BaseAssemblies in another database.
– FOR $b1 IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/Base
Assembly,$b2 IN document("/export/home/liyg/genxml/small32.xml")
/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssemblyWhere $b1/@type = $b2/@typeand $b1/@buildDate .<. $b2/@buildDateRETURN $b1
• Query 7 (R5)– Randomly generate two phrases among all phrases in Documents. Select those documents
containing the 2 phrases.– FOR $d IN document("small31.xml")
/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Document[contains(., "00000010") and contains(., "document")]
Return $d• Query 8
– (to be changed)
XML Benchmarks 17
X007 Queries• Query 9 (R18)
– Select all AtomicParts with corresponding CompositeParts as their sub-elements.– FOR $a IN document("small31.xml")
/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Connection/AtomicPart
Return <AtomicPart $a/@*>shallow($a/../..)
</AtomicPart>• Query 10 (R17, R18)
– Select all ComplexAssembly with type "type008" without the knowledge of the path.– FOR $ca IN document("small31.xml")//ComplexAssembly[./@type = "type008"]
RETURN $ca• Query 11 (R9, R21)
– Among the first 5 Connections of each CompositePart, select those with length greater than "len".
– FOR $c IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart
RETURN$c/Connection[position() .<=. 5][@length .>. 60000]
XML Benchmarks 18
X007 Queries• Query 12 (R9, R21)
– For each CompositePart, select the first 5 Connections with length greater than "len".– FOR $c IN document("small31.xml")
/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart
RETURN$c/Connection[@length .>. 60000][position() .<=. 5]
• Query 13 (R9, R10)– For each BaseAssembly count the number of documents.– FOR $b IN document("small31.xml")
/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssemblyLET $d := $b/CompositePart/DocumentRETURN count($d)
XML Benchmarks 19
X007 Queries• Query 14 (R11, R14)
– Sort CompositePart in descending order where buildDate is within a year from current year.
– FUNCTION year(){
"2002"}
FOR $c IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart
Where $c/@buildDate .>=. (year()-1)RETURN
<result>$c
</result>sortby (buildDate DESCENDING)
• Query 15 (R8)– Find BaseAssembly of not type "type008".– FOR $b IN document("small31.xml")
/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly
[not(@type='type008')]RETURN $b
XML Benchmarks 20
X007 Queries• Query 16 (R18)
– Return all BaseAssembly of type "type008" without any child nodes.– FOR $b IN document("small31.xml")
/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly[./@type="type008"]
Return shallow($b)
• Query 17 (R9, R10)– Return all CompositePart having Connection elements with length greater than
Avg(length) within the same CompositePart without child elements.– FOR $c IN document("small31.xml")
/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly /BaseAssembly/CompositePart,
$con IN $c/Connection[./@length .>. avg($c/Connection/@length)] Return shallow($con)
XML Benchmarks 21
X007 Queries• Query 18 (R17, R18)
– For CompositePart of type "type008", give 'Result' containing ID of CompositePart and Document.
– FOR $c IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart
Return<Result $c/@MyID>$c/Document
</Result>
• Query 19– Select all of CompositePart, Document and AtomicPart.– <Result>
Let $m := document("small31.xml") FILTER (self::CompositePart OR self::Document OR self::AtomicPart) return $m
</Result>
XML Benchmarks 22
X007 Queries• Query 20
– Select the last connection of each CompositePart.– For $c in document("small31.xml")
/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart
return $c/Connection[position() = last()]
• Query 21– Select the third connection's AtomicParts of each CompositePart.– for $c in document("small31.xml")
/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart,
$cn in $c/Connection[position() = 3]return $cn/AtomicPart
XML Benchmarks 23
X007 Queries• Query 22
– Select the AtomicPart whose MyID is smaller than its sibling's and it occurs before that sibling.
– for $c in document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Connection,
$a1 in $c/AtomicPartreturn $c/AtomicPart[(. BEFORE $a1) AND (./@MyID .<. $a1/@MyID)]
• Query 23– Select all Document after the Document with MyID = 25.– FOR $doc in document("small31.xml")
LET $d := $doc/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly
/BaseAssembly/CompositePart/Document[@MyID = 25]return
<After_DOC>$doc/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Document AFTER $d
</After_DOC>
XML Benchmarks 24
X007 DB ParametersParameters Small Medium Large
NumAtomicPerComposite 20 200 200
NumConnPerAtomic 3,6,9 3, 6, 9 3, 6, 9
DocumentSize (bytes) 500 1000 1000
ManualSize (bytes) 2000 4000 4000
NumCompositePerModule 50 50 500
NumAssmPerAssm 3 3 3
NumAssmLevels 5 5 7
NumCompositePerAssm 3 3 3
NumModules 1 1 1
XML Benchmarks 25
XBench
• Capture different application database characteristics
• Capture different application workload characteristics
• Capture full XQuery functionality• Xbench
– db.uwaterloo.ca/~ddbms/projects/xbench
XML Benchmarks 26
XBench• “Relevant, portable, scalable and simple”• Text and non-text documents
– Text (e.g., digital libraries)• Order of elements important• Mixed content
– Non-text (e.g., transactional data)• Only child elements and only data• Structured (schema-based) and non-structured (schema-less)
• Single and multiple documents• Ability to deal with XML Schema definitions/DTDs as
well as the lack of them
XML Benchmarks 27
XBench• Scalability
– Small: 10MB, Normal: 100MB, Large: 1GB, Huge: 10GB
• XML Documents– Balanced and skewed tree structures– Exploit XML features (links, notations, entities, name spaces)
• Workload– Queries, updates, bulk loading
• XQuery compatibility• Implementation independence
XML Benchmarks 28
System Under Test • Single machine• All applications on
the same machine– A DBMS– A Client
• Send / Receive• Measure & Log
• No Web interaction overhead in this version• Similar to XMark, different from XMach-1
XML Benchmarks 29
Database Design
• Characterization– Text-centric (TC) vs data-centric (DC) - Application– Single document (SD) vs multiple documents (MD) -
Document
E-commerce transactional data
E-commerce catalogs, IMDB (Internet Movie DB)
DC
Reuters news corpusSpringer DL, DBPL
GCIDE Dictionary,OED
TC
MDSD
XML Benchmarks 30
Document Characteristics
Applications Size Elems AttrsAvgA/E
MinA/E
MaxA/E
AvgDept
MinDept
MaxDept
Avg FanOut
Min FanOut
Max FanOut
Text Text% AttV AttV%
Avg 2,340.0 33.2 20.9 0.8 0.0 3.9 4.2 2.5 5.1 1.9 1.0 4.6 189.0 7.9% 192.0 10.3%Min 294.0 3.0 6.0 0.3 0.0 2.0 2.0 1.0 2.0 1.0 1.0 1.0 0.0 0.0% 59.0 4.3%Max 6,954.0 87.0 60.0 2.0 0.0 5.0 8.1 3.0 10.0 3.4 1.0 10.0 543.0 16.9% 515.0 29.8%
Avg 581.8 15.4 3.9 0.2 0.0 1.0 1.0 1.0 1.0 14.4 14.4 14.4 271.1 47.7% 40.3 6.3%Min 233.0 5.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 4.0 1.0 4.0 98.0 16.0% 12.0 0.6%Max 5,937.0 138.0 125.0 0.9 0.0 1.0 1.2 1.0 2.0 137.0 137.0 137.0 2,273.0 67.4% 1,462.0 33.1%
WCS 1,598,165.0 11,294.0 65,107.0 5.8 1.0 27.0 1.0 1.0 1.0 11,293.0 11,293.0 11,293.0 11,294.0 0.7% 458,222.0 28.7%
GCIDE 57,917,440.0 2,267,510.0 23.0 0.0 0.0 2.0 2.4 1.0 7.0 4.0 1.0 239,185.0 33,594,747.0 58.0% 572.0 0.0%
IMDB1 4,006,587.0 155,887.0 21,960.0 0.1 0.0 1.0 2.0 2.0 2.0 8.4 3.0 18,590.0 1,143,552.0 28.5% 36,437.0 0.9%IMDB2 11,036,866.0 283,065.0 24,881.0 0.1 0.0 1.0 3.0 2.0 4.0 5.4 1.0 11,024.0 4,446,604.0 40.3% 72,054.0 0.7%
OLAPCube 63,618.0 662.0 3,805.0 5.7 0.0 8.0 2.1 2.0 4.0 55.1 1.0 613.0 2,488.0 3.9% 14,557.0 22.9%
Avg 2,967.1 38.6 40.7 1.1 0.0 4.0 2.2 1.0 4.0 3.6 1.0 14.1 1,557.0 47.5% 480.8 18.1%Min 1,146.0 20.0 23.0 0.2 0.0 4.0 1.9 1.0 4.0 2.2 1.0 7.0 223.0 17.8% 263.0 2.5%Max 11,214.0 229.0 161.0 1.5 0.0 4.0 2.9 1.0 4.0 20.9 1.0 200.0 10,005.0 89.2% 1,966.0 30.2%
Avg 213,448.5 4,856.5 0.0 0.0 0.0 0.0 3.9 1.0 5.0 5.6 1.0 155.3 126,893.6 59.5% 0.0 0.0%Min 141,345.0 3,153.0 0.0 0.0 0.0 0.0 3.9 1.0 4.0 4.6 1.0 71.0 84,458.0 55.8% 0.0 0.0%Max 288,735.0 6,636.0 0.0 0.0 0.0 0.0 4.0 1.0 5.0 7.0 2.0 434.0 170,648.0 64.2% 0.0 0.0%
Xmark 116,524,435.0 1,666,315.0 381,878.0 0.2 0.0 2.0 4.6 2.0 11.0 3.7 1.0 25,500.0 81,286,567.0 69.8% 4,284,980.0 3.7%
Shakspeare(37)
Statistics of Paramters in Some Application Domains
cXML(46)
DBLP(4362)
Reuters (1952)
XML Benchmarks 31
Database Characterization
• Element types• Tree structure of element types• Distribution of children to elements• Distribution of element values to types• Attribute names• Distribution of attribute values to names• Distribution of attributes to elements
XML Benchmarks 32
Data Gathering Methodology
• Analysis
• Abstraction– Statistical analysis to develop probability distributions
for each document• Generalization
– Statistically combining the two document characteristics to come up with one document
• Database generation– Use ToxGene from University of Toronto
TPC-W (All tables; transactional data)
TPC-W (ITEM+AUTHOR+ADDRESS+COUNTRY tables; catalog data)
DC
Reuters news corpusSpringer DL
GCIDE Dictionary,OEDTC
MDSD
XML Benchmarks 33
Analysis - DC (TPC-W -> XML)
• Element oriented mapping vs. attribute oriented mapping
• Existing mapping methods– Flat translation (FT) – Nesting based translation (NeT) – Constraint based translation (CoT)
• Improved mapping methods are used
XML Benchmarks 34
Analysis - TC• Stats of occurrence of <chapter>• Stats of occurrence of <section> for
each <chapter>• Stats of occurrence of <p> for each
<chapter>• Stats of occurrence of <p> for each
<section>• Stats of lengths of content of <p>
XML Benchmarks 35
Generalization - TC
• Merge two or more semantically same element types– Same document– Different documents
• Assumptions– All data sources are equally important– Frequencies change proportionally w.r.t. data
size
XML Benchmarks 36
Generation - ToxGene
• Template based tool generating synthetic XML documents
• The Toxgene Specification Language (TSL) is based on XML Schema
• Features– Distribution– Re-use
XML Benchmarks 37
TSL
<tox-distribution name = "c1"type = "exponential" minInclusive = "5"maxInclusive = "100" mean = "35"/>
...<simpleType name = "my_float">
<restriction base = "float"><tox-number tox-distribution = "c1"/>
</restriction></simpleType>
XML Benchmarks 38
Example Database Schema – TC/MD
XML Benchmarks 39
TPC-W Schema
XML Benchmarks 40
DC/SD Schema
XML Benchmarks 41
Synthetic Data Characteristics
57600005760005760057601num_addressesDC/MD (address.xml)
----1fixedDC/MD (country.xml)
40004004041num_authorsDC/MD (author.xml)
25920002592002592025922592-2592000
num_ordersDC/MD (orderXX.xml)
100001000100101num_itemsDC/MD (item.xml)
28800002880002880028801num_customersDC/MD (customer.xml)
31110003111003111031111num_itemsDC/SD (catalog.xml)
5555555555555555-55555
article_numTC/MD (articleXX.xml)
1000K100K10K1K1entry_numTC/SD (dictionary.xml)
HugeLargeNormalSmall# FilesSize Par.Sources
XML Benchmarks 42
Workload• Core queries
– Exact match (shallow/deep): 8 queries– Function application:– Ordered access (relative/absolute): – Queries with quantifiers (existential and universal): – Sorting queries (by string types/by others):
• Text-centric queries– Document construction (structure preserving/transforming):– Irregular data (missing elements/null values):– Individual document retrieval:– Text search (single/multiple word):
• Data-centric queries– References and joins:– Data type casting:
XML Benchmarks 43
End of Lecture