Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance...

43
XML Benchmarks 1 Web Databases XML System Benchmarks

Transcript of Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance...

Page 1: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 1

Web Databases

XML System Benchmarks

Page 2: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 2

Benchmarks

• Many XML DB tools– Design and adopt benchmarks to allow

comparative performance analysis• Four key criteria (Jim Gray 1993)

– Relevance– Portability– Scalability– Simplicity

Page 3: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 3

Existing XML Benchmarks

• Application benchmarks– X007 (National University of Singapore,

University of Auckland, Arizona State University)

– XMach-1 (University of Leipzig)– XMark (CWI, Inria, Microsoft, BEA, XQRL,

Fraunhofer-IPSI)– XBench (University of Waterloo + IBM)

Page 4: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 4

Benchmark Dataset

• Must be complex enough to capture all characteristics of XML data representation

• Capture the document (ordering) and navigation (references) features

• Scalability– Depth of a tree can be controlled by varying the number

of repetitions of recursive elements– The width of the tree can be adjusted by varying the

cardinality of some elements

Page 5: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 5

Benchmark Queries

• Types of queries– Data-centric

• Join, aggregation, sorting (R9, R10, R11)

– Document-centric• Element/document ordering (R17, R21)

– Navigational• Traversal (R13, R20)

Page 6: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 6

Benchmark Queries• R1

– Query all data types and collections of possibly multiple XML documents

• R2– Allow data-oriented, document-oriented, and mixed queries

• R3– Accept streaming data

• R4– Support operations on various data models

• R5– Allow conditions/constraints on text elements

• R6– Support hierarchical and sequence queries

Page 7: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 7

Benchmark Queries• R7

– Manipulate NULL values• R8

– Support quantifiers (some, all, not) in queries• R9

– Allow queries that combine different parts of document(s)• R10

– Support for aggregation• R11

– Able to generate sorted results• R12

– Support composition of operations

Page 8: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 8

Benchmark Queries• R13

– Allow navigation (reference traversals)• R14

– Able to use environment information as part of queries• R15

– Able to support XML updates if data model allows• R16

– Support type coercion• R17

– Preserve the structure of the documents• R18

– Transform and create XML documents

Page 9: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 9

Benchmark Queries• R19

– Support ID creation• R20

– Structural recursion• R21

– Element ordering

Page 10: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 10

X007

• Comes from the 007 Benchmark• X007

– Bressan, Dobbie 2001– Bressan, Lee 2001– http://www.comp.nus.edu.sg/~ebh/XOO7.html

• Allow datasets of varying sizes

Page 11: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 11

X007 - DTD<!ELEMENT Module (Manual, ComplexAssembly)><!ATTLIST Module MyID NMTOKEN #REQUIRED

type CDATA #REQUIREDbuildDate NMTOKEN #REQUIRED>

<!ELEMENT Manual (#PCDATA)><!ATTLIST Manual MyID NMTOKEN #REQUIRED

title CDATA #REQUIREDtextLen NMTOKEN #REQUIRED>

<!ELEMENT ComplexAssembly (ComplexAssembly+ | BaseAssembly+)><!ATTLIST ComplexAssembly MyID NMTOKEN #REQUIRED

type CDATA #REQUIREDbuildDate NMTOKEN #REQUIRED>

<!ELEMENT BaseAssembly (CompositePart+)><!ATTLIST BaseAssembly MyID NMTOKEN #REQUIRED

type CDATA #REQUIREDbuildDate NMTOKEN #REQUIRED>

Page 12: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 12

X007 - DTD<!ELEMENT CompositePart (Document, Connection+)><!ATTLIST CompositePart MyID NMTOKEN #REQUIRED

type CDATA #REQUIREDbuildDate NMTOKEN #REQUIRED>

<!ELEMENT Document (#PCDATA | para)+><!ATTLIST Document MyID NMTOKEN #REQUIRED

title CDATA #REQUIRED><!ELEMENT para (#PCDATA)><!ELEMENT Connection (AtomicPart, AtomicPart)><!ATTLIST Connection type CDATA #REQUIRED

length NMTOKEN #REQUIRED><!ELEMENT AtomicPart EMPTY><!ATTLIST AtomicPart MyID NMTOKEN #REQUIRED

type CDATA #REQUIREDbuildDate NMTOKEN #REQUIREDx NMTOKEN #REQUIREDy NMTOKEN #REQUIREDdocId NMTOKEN #REQUIRED>

Page 13: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 13

X007 ERD

Module

Manual

Document

ComplexAssembly

CompositeParts

BaseAssembly

Assembly

DesignObj

AtomicPart

Page 14: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 14

X007 Queries• Query 1 (R1, R2)

– Randomly generate 5 numbers in the range of AtomicPart's MyID, then return the AtomicPart according to the 5 numbers.

– FOR $a IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Connection/AtomicPart[@MyID = 221 or @MyID = 1000 or @MyID = 535 or @MyID = 13 or @MyID =

2000]RETURN $a

• Query 2 (R1, R2)– Randomly generate 5 titles for Documents, then return the first paragraph of the

Document by lookup on these titles. – FOR $d IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Document[ @title = "Composite Part 00000009" or @title = "Composite Part 00000050" or @title = "Composite Part 00000034" or @title = "Composite Part 00000022" or @title = "Composite Part 00000080"]

RETURN $d/para[1]

Page 15: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 15

X007 Queries• Query 3 (R4)

– Select 5% of AtomicParts via buildDate (in a certain period).– FOR $a IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Connection/AtomicPart[@buildDate .>=. 1900 and @buildDate .<. 1950]

Return $a• Query 4 (R13)

– Find the CompositePart if it is later than BaseAssembly it is using (comparing the buildDate attribute).

– FOR $b IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseA

ssembly,$c IN $b/CompositePart[@buildDate .>. $b/@buildDate]

RETURN $c• Query 5 (R9)

– Within the same BaseAssembly, return the AtomicParts once finding a Document that has MyID equals to its docId.

– FOR $b IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseA

ssembly,$d IN $b/CompositePart/Document

LET $a := $b/CompositePart/Connection/AtomicPartWHERE $d/@MyID = $a/@docIdRETURN $a

Page 16: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 16

X007 Queries• Query 6 (R9)

– Select all BaseAssemblies with earlier buildDate from one XML database where it has the same "type" attributes as the BaseAssemblies in another database.

– FOR $b1 IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/Base

Assembly,$b2 IN document("/export/home/liyg/genxml/small32.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssemblyWhere $b1/@type = $b2/@typeand $b1/@buildDate .<. $b2/@buildDateRETURN $b1

• Query 7 (R5)– Randomly generate two phrases among all phrases in Documents. Select those documents

containing the 2 phrases.– FOR $d IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Document[contains(., "00000010") and contains(., "document")]

Return $d• Query 8

– (to be changed)

Page 17: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 17

X007 Queries• Query 9 (R18)

– Select all AtomicParts with corresponding CompositeParts as their sub-elements.– FOR $a IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Connection/AtomicPart

Return <AtomicPart $a/@*>shallow($a/../..)

</AtomicPart>• Query 10 (R17, R18)

– Select all ComplexAssembly with type "type008" without the knowledge of the path.– FOR $ca IN document("small31.xml")//ComplexAssembly[./@type = "type008"]

RETURN $ca• Query 11 (R9, R21)

– Among the first 5 Connections of each CompositePart, select those with length greater than "len".

– FOR $c IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart

RETURN$c/Connection[position() .<=. 5][@length .>. 60000]

Page 18: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 18

X007 Queries• Query 12 (R9, R21)

– For each CompositePart, select the first 5 Connections with length greater than "len".– FOR $c IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart

RETURN$c/Connection[@length .>. 60000][position() .<=. 5]

• Query 13 (R9, R10)– For each BaseAssembly count the number of documents.– FOR $b IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssemblyLET $d := $b/CompositePart/DocumentRETURN count($d)

Page 19: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 19

X007 Queries• Query 14 (R11, R14)

– Sort CompositePart in descending order where buildDate is within a year from current year.

– FUNCTION year(){

"2002"}

FOR $c IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart

Where $c/@buildDate .>=. (year()-1)RETURN

<result>$c

</result>sortby (buildDate DESCENDING)

• Query 15 (R8)– Find BaseAssembly of not type "type008".– FOR $b IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly

[not(@type='type008')]RETURN $b

Page 20: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 20

X007 Queries• Query 16 (R18)

– Return all BaseAssembly of type "type008" without any child nodes.– FOR $b IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly[./@type="type008"]

Return shallow($b)

• Query 17 (R9, R10)– Return all CompositePart having Connection elements with length greater than

Avg(length) within the same CompositePart without child elements.– FOR $c IN document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly /BaseAssembly/CompositePart,

$con IN $c/Connection[./@length .>. avg($c/Connection/@length)] Return shallow($con)

Page 21: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 21

X007 Queries• Query 18 (R17, R18)

– For CompositePart of type "type008", give 'Result' containing ID of CompositePart and Document.

– FOR $c IN document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart

Return<Result $c/@MyID>$c/Document

</Result>

• Query 19– Select all of CompositePart, Document and AtomicPart.– <Result>

Let $m := document("small31.xml") FILTER (self::CompositePart OR self::Document OR self::AtomicPart) return $m

</Result>

Page 22: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 22

X007 Queries• Query 20

– Select the last connection of each CompositePart.– For $c in document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart

return $c/Connection[position() = last()]

• Query 21– Select the third connection's AtomicParts of each CompositePart.– for $c in document("small31.xml")

/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart,

$cn in $c/Connection[position() = 3]return $cn/AtomicPart

Page 23: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 23

X007 Queries• Query 22

– Select the AtomicPart whose MyID is smaller than its sibling's and it occurs before that sibling.

– for $c in document("small31.xml")/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Connection,

$a1 in $c/AtomicPartreturn $c/AtomicPart[(. BEFORE $a1) AND (./@MyID .<. $a1/@MyID)]

• Query 23– Select all Document after the Document with MyID = 25.– FOR $doc in document("small31.xml")

LET $d := $doc/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly

/BaseAssembly/CompositePart/Document[@MyID = 25]return

<After_DOC>$doc/ComplexAssembly/ComplexAssembly/ComplexAssembly/ComplexAssembly/BaseAssembly/CompositePart/Document AFTER $d

</After_DOC>

Page 24: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 24

X007 DB ParametersParameters Small Medium Large

NumAtomicPerComposite 20 200 200

NumConnPerAtomic 3,6,9 3, 6, 9 3, 6, 9

DocumentSize (bytes) 500 1000 1000

ManualSize (bytes) 2000 4000 4000

NumCompositePerModule 50 50 500

NumAssmPerAssm 3 3 3

NumAssmLevels 5 5 7

NumCompositePerAssm 3 3 3

NumModules 1 1 1

Page 25: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 25

XBench

• Capture different application database characteristics

• Capture different application workload characteristics

• Capture full XQuery functionality• Xbench

– db.uwaterloo.ca/~ddbms/projects/xbench

Page 26: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 26

XBench• “Relevant, portable, scalable and simple”• Text and non-text documents

– Text (e.g., digital libraries)• Order of elements important• Mixed content

– Non-text (e.g., transactional data)• Only child elements and only data• Structured (schema-based) and non-structured (schema-less)

• Single and multiple documents• Ability to deal with XML Schema definitions/DTDs as

well as the lack of them

Page 27: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 27

XBench• Scalability

– Small: 10MB, Normal: 100MB, Large: 1GB, Huge: 10GB

• XML Documents– Balanced and skewed tree structures– Exploit XML features (links, notations, entities, name spaces)

• Workload– Queries, updates, bulk loading

• XQuery compatibility• Implementation independence

Page 28: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 28

System Under Test • Single machine• All applications on

the same machine– A DBMS– A Client

• Send / Receive• Measure & Log

• No Web interaction overhead in this version• Similar to XMark, different from XMach-1

Page 29: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 29

Database Design

• Characterization– Text-centric (TC) vs data-centric (DC) - Application– Single document (SD) vs multiple documents (MD) -

Document

E-commerce transactional data

E-commerce catalogs, IMDB (Internet Movie DB)

DC

Reuters news corpusSpringer DL, DBPL

GCIDE Dictionary,OED

TC

MDSD

Page 30: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 30

Document Characteristics

Applications Size Elems AttrsAvgA/E

MinA/E

MaxA/E

AvgDept

MinDept

MaxDept

Avg FanOut

Min FanOut

Max FanOut

Text Text% AttV AttV%

Avg 2,340.0 33.2 20.9 0.8 0.0 3.9 4.2 2.5 5.1 1.9 1.0 4.6 189.0 7.9% 192.0 10.3%Min 294.0 3.0 6.0 0.3 0.0 2.0 2.0 1.0 2.0 1.0 1.0 1.0 0.0 0.0% 59.0 4.3%Max 6,954.0 87.0 60.0 2.0 0.0 5.0 8.1 3.0 10.0 3.4 1.0 10.0 543.0 16.9% 515.0 29.8%

Avg 581.8 15.4 3.9 0.2 0.0 1.0 1.0 1.0 1.0 14.4 14.4 14.4 271.1 47.7% 40.3 6.3%Min 233.0 5.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0 4.0 1.0 4.0 98.0 16.0% 12.0 0.6%Max 5,937.0 138.0 125.0 0.9 0.0 1.0 1.2 1.0 2.0 137.0 137.0 137.0 2,273.0 67.4% 1,462.0 33.1%

WCS 1,598,165.0 11,294.0 65,107.0 5.8 1.0 27.0 1.0 1.0 1.0 11,293.0 11,293.0 11,293.0 11,294.0 0.7% 458,222.0 28.7%

GCIDE 57,917,440.0 2,267,510.0 23.0 0.0 0.0 2.0 2.4 1.0 7.0 4.0 1.0 239,185.0 33,594,747.0 58.0% 572.0 0.0%

IMDB1 4,006,587.0 155,887.0 21,960.0 0.1 0.0 1.0 2.0 2.0 2.0 8.4 3.0 18,590.0 1,143,552.0 28.5% 36,437.0 0.9%IMDB2 11,036,866.0 283,065.0 24,881.0 0.1 0.0 1.0 3.0 2.0 4.0 5.4 1.0 11,024.0 4,446,604.0 40.3% 72,054.0 0.7%

OLAPCube 63,618.0 662.0 3,805.0 5.7 0.0 8.0 2.1 2.0 4.0 55.1 1.0 613.0 2,488.0 3.9% 14,557.0 22.9%

Avg 2,967.1 38.6 40.7 1.1 0.0 4.0 2.2 1.0 4.0 3.6 1.0 14.1 1,557.0 47.5% 480.8 18.1%Min 1,146.0 20.0 23.0 0.2 0.0 4.0 1.9 1.0 4.0 2.2 1.0 7.0 223.0 17.8% 263.0 2.5%Max 11,214.0 229.0 161.0 1.5 0.0 4.0 2.9 1.0 4.0 20.9 1.0 200.0 10,005.0 89.2% 1,966.0 30.2%

Avg 213,448.5 4,856.5 0.0 0.0 0.0 0.0 3.9 1.0 5.0 5.6 1.0 155.3 126,893.6 59.5% 0.0 0.0%Min 141,345.0 3,153.0 0.0 0.0 0.0 0.0 3.9 1.0 4.0 4.6 1.0 71.0 84,458.0 55.8% 0.0 0.0%Max 288,735.0 6,636.0 0.0 0.0 0.0 0.0 4.0 1.0 5.0 7.0 2.0 434.0 170,648.0 64.2% 0.0 0.0%

Xmark 116,524,435.0 1,666,315.0 381,878.0 0.2 0.0 2.0 4.6 2.0 11.0 3.7 1.0 25,500.0 81,286,567.0 69.8% 4,284,980.0 3.7%

Shakspeare(37)

Statistics of Paramters in Some Application Domains

cXML(46)

DBLP(4362)

Reuters (1952)

Page 31: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 31

Database Characterization

• Element types• Tree structure of element types• Distribution of children to elements• Distribution of element values to types• Attribute names• Distribution of attribute values to names• Distribution of attributes to elements

Page 32: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 32

Data Gathering Methodology

• Analysis

• Abstraction– Statistical analysis to develop probability distributions

for each document• Generalization

– Statistically combining the two document characteristics to come up with one document

• Database generation– Use ToxGene from University of Toronto

TPC-W (All tables; transactional data)

TPC-W (ITEM+AUTHOR+ADDRESS+COUNTRY tables; catalog data)

DC

Reuters news corpusSpringer DL

GCIDE Dictionary,OEDTC

MDSD

Page 33: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 33

Analysis - DC (TPC-W -> XML)

• Element oriented mapping vs. attribute oriented mapping

• Existing mapping methods– Flat translation (FT) – Nesting based translation (NeT) – Constraint based translation (CoT)

• Improved mapping methods are used

Page 34: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 34

Analysis - TC• Stats of occurrence of <chapter>• Stats of occurrence of <section> for

each <chapter>• Stats of occurrence of <p> for each

<chapter>• Stats of occurrence of <p> for each

<section>• Stats of lengths of content of <p>

Page 35: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 35

Generalization - TC

• Merge two or more semantically same element types– Same document– Different documents

• Assumptions– All data sources are equally important– Frequencies change proportionally w.r.t. data

size

Page 36: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 36

Generation - ToxGene

• Template based tool generating synthetic XML documents

• The Toxgene Specification Language (TSL) is based on XML Schema

• Features– Distribution– Re-use

Page 37: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 37

TSL

<tox-distribution name = "c1"type = "exponential" minInclusive = "5"maxInclusive = "100" mean = "35"/>

...<simpleType name = "my_float">

<restriction base = "float"><tox-number tox-distribution = "c1"/>

</restriction></simpleType>

Page 38: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 38

Example Database Schema – TC/MD

Page 39: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 39

TPC-W Schema

Page 40: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 40

DC/SD Schema

Page 41: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 41

Synthetic Data Characteristics

57600005760005760057601num_addressesDC/MD (address.xml)

----1fixedDC/MD (country.xml)

40004004041num_authorsDC/MD (author.xml)

25920002592002592025922592-2592000

num_ordersDC/MD (orderXX.xml)

100001000100101num_itemsDC/MD (item.xml)

28800002880002880028801num_customersDC/MD (customer.xml)

31110003111003111031111num_itemsDC/SD (catalog.xml)

5555555555555555-55555

article_numTC/MD (articleXX.xml)

1000K100K10K1K1entry_numTC/SD (dictionary.xml)

HugeLargeNormalSmall# FilesSize Par.Sources

Page 42: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 42

Workload• Core queries

– Exact match (shallow/deep): 8 queries– Function application:– Ordered access (relative/absolute): – Queries with quantifiers (existential and universal): – Sorting queries (by string types/by others):

• Text-centric queries– Document construction (structure preserving/transforming):– Irregular data (missing elements/null values):– Individual document retrieval:– Text search (single/multiple word):

• Data-centric queries– References and joins:– Data type casting:

Page 43: Web Databases - comp.polyu.edu.hkcstyng/webdb.07/lectures/lesson10.pdf · comparative performance analysis • Four key criteria (Jim Gray 1993) – Relevance – Portability –

XML Benchmarks 43

End of Lecture