Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D....
Transcript of Ph.D. Defense - xerial.org2 Purifying XML StructPurifying XML Structures: Ph.D. Defenseures: Ph.D....
http://www.xerial.org/ 1
Taro L. SaitoTaro L. SaitoUniversity of Tokyo
Purifying XML StructuresPurifying XML StructuresPh.D. DefensePh.D. Defense
2
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
OutlineOutline
•• IntroductionIntroduction– XML and Structural Fluctuation– Amoeba Join
•• Purifying XML StructuresPurifying XML Structures– Functional Dependencies for XML– Amoeba Join Decomposition– Ubiquitous Keys
•• ImplementationImplementation– Amoeba Join Processing Algorithms– XML Indexing– Experimental Results
•• Conclusions Conclusions – Applications– Summary of Contributions & Future Work
3
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
IntroductionIntroduction
•• XML XML (Extensible Markup Language)(Extensible Markup Language)
– A markup language representing a tree structure
– Since 1996, XML has been broadly used as a data representation format
•• Major drawbacksMajor drawbacks– Hierarchical representation of data
is too complex• for both of human and computer
programs• reminiscences of 1970s’ discussion
– Relational v. s. Hierarchical DB
– There exist many alternative tree structures
• to represent a same data model
<bookstore><bookstore><order><order>
<customer><customer>JohnJohn</customer></customer><book><book>
<title><title>Data on the WebData on the Web</title></title></book></book>
</order></order></bookstore></bookstore>
orderorder
customercustomer bookbook
titletitle
4
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Structural FluctuationStructural Fluctuation
•• Differently Structured XML DocumentsDifferently Structured XML Documents– representing a same data model e.g. Amazon.com
• for order, customer, book nodes
– The hierarchical order of order and customer is reversed.– The order node is behind the pending node.
orderorder
customercustomer bookbook“cancelled”
customercustomer
bookbook
pendingpending
orderorder notenote
5
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Querying Structural FluctuationQuerying Structural Fluctuation
•• Standards of XML Processing: XPath, SAX, DOM, etc. Standards of XML Processing: XPath, SAX, DOM, etc. •• Many parse states:Many parse states:
– If we find an order, then parse customer and book– or if we first find an customer, then parse pending/order and book ...–– Such query processing is tedious and errorSuch query processing is tedious and error--prone!prone!
•• Why we need different programs to parse the same meaning XML Why we need different programs to parse the same meaning XML data?data?
orderorder
customercustomer bookbook“cancelled”
customercustomer
bookbook
pendingpending
orderorder notenote
6
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Structural FluctuationsStructural Fluctuations
•• In general, the number of structural fluctuations of In general, the number of structural fluctuations of nn nodes is nnodes is n(n(n--1)1)
– Enumeration of labeled trees of n nodes
7
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Current SolutionCurrent Solution
•• Disallow structural fluctuations by using a schemaDisallow structural fluctuations by using a schema– XML Schema, DTD, RelaxNG, etc.
•• However, fixing a tree structure involves irrelevant work in However, fixing a tree structure involves irrelevant work in defining a data model.defining a data model.– Why we have to choose only one tree structures?
8
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Heuristic ApproachHeuristic Approach
•• SLCA (Smallest Lowest Common Ancestor)SLCA (Smallest Lowest Common Ancestor)– [Li, VLDB2004], [Xu, SIGMOD2005]
– An lca node that does not contain other lca nodes.– However, it easily leads to unintended results
orderorder
customercustomer bookbook
slca of (customer, book)
datadata
customercustomer
9
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Amoeba JoinAmoeba Join
•• Amoeba Join:Amoeba Join: AJ(order, customer, book)AJ(order, customer, book)– [WebDB 2006]– retrieves node tuple such that
• one of (order, customer, book) nodes is a common ancestor of the others.
– Handles every structural fluctuation
orderorder
customercustomer bookbook
amoebaamoeba
“cancelled”
customercustomer
bookbook
pendingpending
orderorder notenote
amoeba rootamoeba root
10
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Semantics of XML StructuresSemantics of XML Structures
•• Semantics implied in XML dataSemantics implied in XML data– Each order node should have a single book node
• Invalid structure might be retrieved without considering such semantics of data.
– Instances of such invalid structures could be numerous
•• To represent semantics of XML data, we introduce To represent semantics of XML data, we introduce functional dependenciesfunctional dependencies for XMLfor XML
11
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Functional Dependency (FD)Functional Dependency (FD)
•• Functional DependencyFunctional Dependency– X → Y : if two tuples p, q agree with X, then also agree with Y
order book title1 b1 Database Systems2 b1 Database Systems3 b2 Data on the Web
order book1 b12 b13 b2
book titleb1 Database Systemsb2 Data on the Web
•• FDs: order FDs: order →→ book, book book, book →→ titletitle
•• FD is generally used to avoid redundancies of dataFD is generally used to avoid redundancies of data– Normal Form
12
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Data Modeling & FDData Modeling & FD
•• FD has an essential role in data modelingFD has an essential role in data modeling– describe one-to-one, one-to-many, many-to-many relationships
• ex. ER (Entity-Relationship) diagram, UML (Unified Modeling Language)
•• ExampleExample– order → book, order → customer
• An order has a book. An order has a customer. • A book has many orders. A customer has many orders (one-to-many)
– book → title, title → book• A book has a title; a title belongs to a book (one-to-one)
– customer, book → order• An order connects many customers and books (many-to-many)
orderordercustomercustomer bookbook1
m n1
titletitle
1
1
13
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Functional Dependencies for XMLFunctional Dependencies for XML
•• Previous Work of FDs for XMLPrevious Work of FDs for XML– [Buneman et al., WWW2001], [Arenas and Libkin, TODS2004]
– based on fixed paths• Because there was no counter part of relation (tables) in XML
– e.g. /order /order →→ /order/book/order/book• Structural fluctuations are not allowed:
– In reality, however, the constraint on the path, a book must be a child of an order, is too strong.
– Their definition has no loss-less decomposition
14
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Relation in XMLRelation in XML
•• Relation in XML allows a zigzag shapeRelation in XML allows a zigzag shape•• For an FD: For an FD: book, customer book, customer →→ orderorder
– (book, customer, order) must be an amoeba
15
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
A set of A set of FDsFDs defines XML structuresdefines XML structures
•• Traditional Approach:Traditional Approach:– XML data (Structured Data) -> Data Model
•• Our approach: Our approach: Data Model (FD) Data Model (FD) --> XML Structures> XML Structures– Allows various XML structures to describe a data model– Enhancing expressive power of XML databases
16
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Amoeba Join Satisfying Amoeba Join Satisfying FDsFDs
•• AJAJF F (order, book, customer) (order, book, customer) – retrieves a relation in XML satisfying a set F of FDs
•• Makes easier managing multiple hierarchies of XML tree structureMakes easier managing multiple hierarchies of XML tree structuress– An amoeba join AJF (order, book, customer) can track D2
17
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Amoeba Join DecompositionAmoeba Join Decomposition
18
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
FD Based XML Query ProcessingFD Based XML Query Processing
•• No explicit path structures are requiredNo explicit path structures are required•• Examples:Examples:
– FDs• book, customer → order ・order → book ・order → customer
– A query for book and order node: AJF (book, order)• book and order nodes compose amoebas
– A query for book and customer nodes:• AJF (book, customer)
– book and customer nodes might be connected through order nodes
• Thus, AJF (book, customer, order) is evaluated
•• Relation inRelation in XML is dynamically determined according to query targetsXML is dynamically determined according to query targets
orderordercustomercustomer bookbook1
m n
1
titletitle
19
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Functional Dependencies and KeysFunctional Dependencies and Keys
•• Key is a special case of a functional Key is a special case of a functional dependencydependency– e.g. order (id) → book, customer
• order (id) is a key
•• Using a relation in XML, we can define keys for XMLUsing a relation in XML, we can define keys for XML– [order@id] → book, customer
• Given an order id, we can uniquely determine book and title nodes• XML structures: <<order, book>>, <<order, customer>>
•• More general description of keysMore general description of keys– In [Buneman, et al. WWW2001], it is not allowed to reverse the
position of order and book nodes
JohnJohnb1b111
LucyLucyb2b222
customercustomerbookbookorder order
20
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Ubiquitous KeysUbiquitous Keys
21
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Querying without using StructuresQuerying without using Structures
•• AJ(book, [pending, order, title])AJ(book, [pending, order, title])– book nodes are merged using ubiquitous keys
http://www.xerial.org/ 22
Amoeba Join ProcessingAmoeba Join Processing
23
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
orgorg
managermanager locationlocation
managermanager
locationlocation“Kyoto”
“Tokyo”
namename
“David” “Michael”
departmentdepartment orgorg
companycompany
Sweep Amoeba Join AlgorithmSweep Amoeba Join Algorithm
•• Fetch all input nodesFetch all input nodes– AJ(org, manager, location)– Sort input nodes in their document orders.
•• Sweep sorted input nodesSweep sorted input nodes– Assume the smallest node in the input nodes as an amoeba root.– Search their descendant regions for components of amoebas
amoebaamoebaamoebaamoebaorgorg
managermanager locationlocation
managermanager
orgorg
locationlocation
24
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Disk I/O OptimizationDisk I/O Optimization
•• AJ(org, manager, location = AJ(org, manager, location = ““TokyoTokyo””))– Choose pivot nodes from a small input domain– Traverse upward to find amoeba root candidates
– Search space for amoeba is localized under the amoeba root candidates.
orgorg
managermanager locationlocation
managermanager
locationlocation“Kyoto”
““TokyoTokyo””
namename
“David” “Michael”
departmentdepartment orgorg
companycompany
locationlocation PivotPivot
orgorg
managermanager
amoeba root candidateamoeba root candidate
http://www.xerial.org/ 25
XML IndexingXML Indexing
26
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
History of XML IndexingHistory of XML Indexing
•• A hundreds of XML indexing papers A hundreds of XML indexing papers ……. . – tailored to specific queries
• XPath query, structural-join (A//D), twig-queries, text search, etc.
– from many research areas• Database Community
– DataGuides (1997), 1-index (1999), XR-tree (2002), PathFinder(2006)
– Node labeling (static or updatable)» Dewey order, ORDPATH(2004), BLAS(2004), Pbi (2005)
• Information Retrieval (IR)– inverted indexes for text data. SLCA (2005)
• Compressed Index– XBW (Ferrangina, WWW2006)
27
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Multidimensional Aspects of XMLMultidimensional Aspects of XML
•• TreeTree--Structure IndexStructure Index– Ancestor, Descendant (subtree),
Sibling•• PathPath--Structure IndexStructure Index
– Suffix-path (//headline/item)
•• An XML Index that can process An XML Index that can process both of the structuresboth of the structuressimultaneouslysimultaneously is strongly requiredis strongly required
28
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Our ApproachOur Approach
•• [DASFAA2007][DASFAA2007]
•• Integrating treeIntegrating tree--structure and path indexes structure and path indexes – As a multidimensional index
• (start, end, level, path)
– It can be implemented on top of the B+-tree
•• Why B+Why B+--trees?trees?– Index structures and transaction management, recovery,
logging, caching etc. are interdependent.
– We already have many transaction management techniques on B+-trees
• Transaction management on R-tree is not seriously supported.
29
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
InvertedInverted--Path IndexPath Index
•• Align inverted paths in the lexicographical orderAlign inverted paths in the lexicographical order– facilitates suffix path queries
•• Examples (suffixExamples (suffix--path query range):path query range):– //item [6, 11)– //headline/item [6, 8)
30
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
ZZ--OrderOrder
•• Align multidimensional points (nodes) in zAlign multidimensional points (nodes) in z--orderorder– Interleave function gives z-order in the multidimensional space
•• Each step in zEach step in z--orders splits slices into twoorders splits slices into two
31
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Range QueryRange Query
•• Traverse B+Traverse B+--tree in the order of ztree in the order of z--orderorder
http://www.xerial.org/ 32
Experimental ResultsExperimental Results
33
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
ImplementationImplementation
•• Xerial Xerial – http://www.xerial.org/– XML Database Management System
• XML data is multi-dimensionally indexed• supporting amoeba joins & XPath queries
– Implemented in C++• about 150,000 lines of codes
– Query compiler & scheduler, query processing algorithms– Database indexing, XML processor, etc.
•• Machine environment for experimentsMachine environment for experiments– Windows XP notebook– Pentium M 2GHz, 1GB Main Memory– 5,400 rpm HDD (100GB)
34
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Database SizeDatabase Size
•• Data set: Data set: XMarkXMark Benchmark XML Document Benchmark XML Document •• Xerial is spaceXerial is space--efficientefficient
35
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
SuffixSuffix--Path Query PerformancePath Query Performance
•• Xerial & pathXerial & path--start index is fasteststart index is fastest
36
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Subtree Retrieval PerformanceSubtree Retrieval Performance
•• All of the indexes shows similar performanceAll of the indexes shows similar performance– XML nodes are sorted in the order of start values
37
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Ancestor RetrievalAncestor Retrieval
•• The number of the previous nodes of a context node The number of the previous nodes of a context node affects the ancestoraffects the ancestor--query performance.query performance.
38
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Sibling Query PerformanceSibling Query Performance
•• Without indexes for levelWithout indexes for level--values, retrievals of sibling values, retrievals of sibling nodes are inefficientnodes are inefficient
39
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Amoeba Join PerformanceAmoeba Join Performance
•• AlgorithmAlgorithm– QK: Quicker, SW: Sweep, BF: Brute Force
•• IndexIndex– I: Index Scan, S: Sequential Scan
•• Quicker algorithm is fastest when we can Quicker algorithm is fastest when we can localize search regionslocalize search regions
40
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Improvement by AJ DecompositionImprovement by AJ Decomposition
•• Without decomposing amoeba joins, the number of Without decomposing amoeba joins, the number of XML structures to be retrieved explodes.XML structures to be retrieved explodes.
http://www.xerial.org/ 41
PerspectivesPerspectives
42
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
ApplicationsApplications
•• Our methods can be applied various XML databasesOur methods can be applied various XML databases
•• Examples of promising applicationsExamples of promising applications•• File SystemsFile Systems
– Represent files with XML format• reorganization and enhancing information of files with tags
•• BioinformaticsBioinformatics– Reorganization of data is frequent
• Statistical analysis (classification, transformation, cleansing, etc.)
– Integration of various data sources is required
43
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
SCMDSCMD
•• SCMD SCMD ((Saccharomyces CerevisiaeSaccharomyces Cerevisiae Morphological Database)Morphological Database)– [NAR04], [NAR05], [PNAS05]
44
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Deep Copies of XML DataDeep Copies of XML Data
cellcell
sizesize
roundnessroundness
clusterclusterpropertyproperty
functionfunction
genegene
<cell><size>…</size><roundness>…</roundness><cluster>
<function>…</function><property>…</property>
</cluster></cell>
<cell><size>…</size><roundness>…</roundness><cluster>
<function>…</function><property>…</property>
</cluster></cell>
<cluster><function>…</function><property>…</property><cell>
<size> … </size><roundness>..</roundness>
</cell></cluster>
<cluster><function>…</function><property>…</property><cell>
<size> … </size><roundness>..</roundness>
</cell></cluster>•• Many duplicates (deep copies) Many duplicates (deep copies)
of dataof data
45
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Shallow Copies of XML DataShallow Copies of XML Data
•• GraphGraph--structured data model can structured data model can be decomposed into several treesbe decomposed into several trees
•• To connect nodes in trees, we need To connect nodes in trees, we need shallow copies of nodes. shallow copies of nodes.
cellcell
sizesize
roundnessroundness
clusterclusterpropertyproperty
functionfunction
genegene<cell id=“1”>
<size>…</size><roundness>…</roundness>
</cell>
<cell id=“1”><size>…</size><roundness>…</roundness>
</cell>
<cluster><function>…</function><property>…</property><cell id=“1”/>
</cluster>
<cluster><function>…</function><property>…</property><cell id=“1”/>
</cluster>
•• With FDWith FD--based query processingbased query processing– It becomes easier to manage shallow-copy representation of XML data
46
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Future WorkFuture Work
•• Query OptimizationQuery Optimization– Efficient amoeba join decomposition scheduling– Integration of index-lookups and cost-based optimization– Indexes for amoeba structures
•• More complex semanticsMore complex semantics– Ownerships of nodes– Scope of attributes
•• Updates of XML DataUpdates of XML Data– Detecting violation of FDs– Automatically constructs XML structures
• From unstructured data
47
Purifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. DefensePurifying XML Structures: Ph.D. Defense
http://www.xerial.org/
Our ContributionsOur Contributions
•• Amoeba JoinAmoeba Join– Tracks various XML structures
•• Functional DependencyFunctional Dependency– defines XML structures of interest– Conceptual change: Data model (FD) defines XML structuresData model (FD) defines XML structures
•• Amoeba Join DecompositionAmoeba Join Decomposition– makes faster the FD-based query processing
•• XML IndexingXML Indexing– A space-efficient XML indexing technique
http://www.xerial.org/ 48
Thank you!Thank you!
This is the end of the presentationThis is the end of the presentation