Sedna XML Database System: Internal Representation

19
Sedna Sedna XML Database System: XML Database System: Internal Representation Internal Representation Leonid Novak Leonid Novak Ph.D., Software developer Ph.D., Software developer [email protected] [email protected] Institute for System Programming Institute for System Programming Russian Academy of Sciences Russian Academy of Sciences

description

Describes internal data representation, XPath execution, value indexes, microoperations and update statements

Transcript of Sedna XML Database System: Internal Representation

Page 1: Sedna XML Database System: Internal Representation

Sedna Sedna XML Database System:XML Database System:Internal RepresentationInternal Representation

Leonid NovakLeonid NovakPh.D., Software developerPh.D., Software developer

[email protected]@ispras.ru

Institute for System ProgrammingInstitute for System ProgrammingRussian Academy of SciencesRussian Academy of Sciences

Page 2: Sedna XML Database System: Internal Representation

AgendaAgenda

►Data structures Data structures ►Descriptive schema of XML documentsDescriptive schema of XML documents►XPATH execution modesXPATH execution modes►Labeling schemeLabeling scheme►Strings and serializationStrings and serialization►IndexesIndexes►MicrooperationsMicrooperations►Update statementsUpdate statements

Page 3: Sedna XML Database System: Internal Representation

Sedna Database objectsSedna Database objects►DatabaseDatabase►Collection and Stand-alone documentCollection and Stand-alone document►Document in CollectionDocument in Collection►Schema, Index, Trigger, ModuleSchema, Index, Trigger, Module►NodeNode►Atomic Value (utf-8)Atomic Value (utf-8)

► ContextContext► SequenceSequence► Tuple…Tuple…

Statement-level

Page 4: Sedna XML Database System: Internal Representation

Internal Data Representation:Internal Data Representation:Descriptive SchemaDescriptive Schema

Page 5: Sedna XML Database System: Internal Representation

Internal Data Representation:Internal Data Representation:Descriptive Schema Driven StorageDescriptive Schema Driven Storage

Page 6: Sedna XML Database System: Internal Representation

Internal Data Representation: Internal Data Representation: Storing data in blocksStoring data in blocks

► Blocks are chained into bidirectional listsBlocks are chained into bidirectional lists► Node descriptors are ordered across blocks Node descriptors are ordered across blocks

according to document orderaccording to document order► Bi-directional references from the descriptive Bi-directional references from the descriptive

schema node to/from the blockschema node to/from the block

Page 7: Sedna XML Database System: Internal Representation

Internal Data Representation: Internal Data Representation: Node Descriptor StructureNode Descriptor Structure

► Fixed-size descriptor inside blockFixed-size descriptor inside block► All pointers are direct except parentAll pointers are direct except parent► Long and short pointers are usedLong and short pointers are used► Label – numbering scheme numberLabel – numbering scheme number► Indirection record - OIDIndirection record - OID

Page 8: Sedna XML Database System: Internal Representation

Labeling SchemeLabeling Scheme

►Prefix-based (Dewey encoding) labeling Prefix-based (Dewey encoding) labeling (easy updates);(easy updates);

►Label: [aLabel: [a11…a…ann], where a], where aii∈∈[0..255][0..255]► Document order: A [aDocument order: A [a11..a..ann]<B[b]<B[b11..b..bmm] iff:] iff:

∃∃i i ∀∀j<i aj<i ajj=b=bjj and a and aii<b<bii

► Ancestor: A [aAncestor: A [a11..a..ann] is ancestor Of B[b] is ancestor Of B[b11..b..bmm] iff:] iff:n<m and n<m and ∀∀j<=n aj<=n ajj=b=bjj and and bbn+1n+1≠255≠255

► 255 255 is used as is used as delimeterdelimeter in generic prefix in generic prefix encoding. In contrast to generic approach: we encoding. In contrast to generic approach: we don’t use it per depth level per labeldon’t use it per depth level per label

Page 9: Sedna XML Database System: Internal Representation

XPath Evaluation ScenariosXPath Evaluation Scenarios

►Simple absolute XPath: /library/book/title Simple absolute XPath: /library/book/title (descriptive schema evaluation only)(descriptive schema evaluation only)

►Absolute XPath with descendant axes: Absolute XPath with descendant axes: /library//title (descriptive schema with /library//title (descriptive schema with merge by labeling schema)merge by labeling schema)

►XPath with predicates: XPath with predicates: /library/book[title=“XQuery”]/author/library/book[title=“XQuery”]/author

► following,sibling,parent,…: following,sibling,parent,…: /library//author[text()=“Tolstoy”]/../library//author[text()=“Tolstoy”]/..

Page 10: Sedna XML Database System: Internal Representation

Various featuresVarious features

►Persistent and Temporary (constructed) Persistent and Temporary (constructed) nodes have identical presentation.nodes have identical presentation.

►Namespace nodes: explicit and implicit Namespace nodes: explicit and implicit declaration.declaration.

►Strings: short and long strings. Random Strings: short and long strings. Random access for long strings.access for long strings.

►System documents.System documents.►Serialization parameters: indent, character Serialization parameters: indent, character

mapsmaps

Page 11: Sedna XML Database System: Internal Representation

Internal Data Representation:Internal Data Representation:ConclusionConclusion

► Fast execution of XPath expressionsFast execution of XPath expressions Descriptive schema as structural indexDescriptive schema as structural index Clustering – avoid reading needless dataClustering – avoid reading needless data

► Support for updatesSupport for updates Node descriptors have a fixed size within a blockNode descriptors have a fixed size within a block Node descriptors are partly orderedNode descriptors are partly ordered The parent pointer of node descriptor is indirectThe parent pointer of node descriptor is indirect Indirection record is OIDIndirection record is OID

► Numbering scheme based algorithms are usedNumbering scheme based algorithms are used► Disadvantages:Disadvantages:

Data serialization is not very fastData serialization is not very fast Space expenditure in case of very unstable structuresSpace expenditure in case of very unstable structures

Page 12: Sedna XML Database System: Internal Representation

Indexes.Indexes.

►Create IndexCreate Index titletitle ONON path1path1 BYBY path2path2 asas typetype path1 path1 – nodes to be indexed– nodes to be indexed path2path2 – these node’ values are used as keys – these node’ values are used as keys typetype – an atomic type the keys are casted to – an atomic type the keys are casted to

► index-scanindex-scan(title,value,mode)(title,value,mode) value value – key value (– key value (type type promotion)promotion) modemode – one of (EQ,LT,GT,GE,LE) – one of (EQ,LT,GT,GE,LE)

►Drop indexDrop index title title

Page 13: Sedna XML Database System: Internal Representation

XML VS. SQL indexesXML VS. SQL indexes►Dynamic type castingDynamic type casting►Ununiqueness of (key,value) Ununiqueness of (key,value)

pairpair►Support of dynamic structure Support of dynamic structure

changeschanges►Support for XQuery updatesSupport for XQuery updates

Page 14: Sedna XML Database System: Internal Representation

Index Implementation details & Index Implementation details & tradeoffstradeoffs

►B+-treeB+-tree►ClusterizationClusterization►Error countersError counters►Pre-sorting during Pre-sorting during

createcreate►Markers on SchemaMarkers on Schema

► Index update is part of Index update is part of micro-operationmicro-operation

► Long keys are not Long keys are not supportedsupported (>PAGE_SIZE/2)(>PAGE_SIZE/2)

► Physical optimization is Physical optimization is not supported (yet)not supported (yet)

Page 15: Sedna XML Database System: Internal Representation

Full-text indices and IRFull-text indices and IR

►Integration with external engine: Integration with external engine: dtSearchdtSearch►CREATE FULL_TEXT INDEX CREATE FULL_TEXT INDEX titletitle ON ON pathpath

TYPE TYPE typetype (“XML”,”stringvalue”,”delimited”, (“XML”,”stringvalue”,”delimited”, ”customized”)”customized”)

►ftscanftscan based on IR-oriented language based on IR-oriented language and,or,near,contains,wildcards…and,or,near,contains,wildcards… Stemming and morphologyStemming and morphology Higlightning in resultsHiglightning in results

►ACID support and lazy evaluationACID support and lazy evaluation

Page 16: Sedna XML Database System: Internal Representation

MicrooperationsMicrooperations► An atomic unbreakable piece of work with DBAn atomic unbreakable piece of work with DB► Minimal logical unit for logical undo-redoMinimal logical unit for logical undo-redo

► Insert_node (left_sibling,right_sibling,parent…) Inserts new node to descriptive schema (if needed) Inserts new node to blocks (or appends existing text

node) Index updates, logs, locks… Checks well-formedness (attribute duplicates) Optimized for Bulk-loading

► Delete (node) Deletes leaf node (i.e. node w/o children and attributes) Merges text nodes (if needed) Index updates, logs, locks…

Page 17: Sedna XML Database System: Internal Representation

Sedna updatesSedna updates

►UPDATE UPDATE iinsertnsert SourceSourceExpr1Expr1 ( (into|into|preceding|followingpreceding|following) ) TargetTargetExpr2 Expr2

►UPDATE deleteUPDATE delete Expr Expr ►UPDATE delete_undeep UPDATE delete_undeep ExprExpr ►UPDATE renameUPDATE rename Expr  Expr onon QName QName ►UPDATE replace UPDATE replace $var$var [as [as typetype] in ] in Expr1Expr1

with with Expr2Expr2($var)($var)

Page 18: Sedna XML Database System: Internal Representation

XQUery vs. Sedna updatesXQUery vs. Sedna updates► Same expressive powerSame expressive power► No detachments in Sedna (XqueryP issue)No detachments in Sedna (XqueryP issue)► All updates are top-level in SednaAll updates are top-level in Sedna► Avoid intermediate copying of nodes of Avoid intermediate copying of nodes of

SourceExpression SourceExpression ► Straitforward Mappings: insert, delete, rename, Straitforward Mappings: insert, delete, rename,

replace(->)replace(->)► Artificial mapping: replace value(->), delete Artificial mapping: replace value(->), delete

undeep(<-),replace(<-)undeep(<-),replace(<-)► Transform: straightforward (with copying) artif. Transform: straightforward (with copying) artif.

(on versions)(on versions)► To extent existing expressions in Sedna To extent existing expressions in Sedna

(FLWR,Comma…): pending update list must be (FLWR,Comma…): pending update list must be implemented implemented

Page 19: Sedna XML Database System: Internal Representation

Future modificationsFuture modifications►To speed up performance:To speed up performance:

Physical optimization with indexes and statisticsPhysical optimization with indexes and statistics Indirection records inside data blocksIndirection records inside data blocks Index support for fast serialization (region Index support for fast serialization (region

indexes e.t.c)indexes e.t.c)►To decrease XML data size:To decrease XML data size:

Unfixed size for node descriptorsUnfixed size for node descriptors Prefix numbering scheme optimizationPrefix numbering scheme optimization

►Additional functionality:Additional functionality: XQuery update facility XQuery update facility XQueryP supportXQueryP support