Post on 20-Dec-2015
From Semistructured Data to XML:Migrating The Lore Data Model and
Query Language
Roy Goldman, Jason McHugh, Jennifer WidomStanford University
http://www-db.stanford.edu/lore/
Introduction
• Lore– Originally a DBMS designed specifically for
semistructured data– Semistructured data models and XML share
many similarities– Migrating Lore to work with XML
• Modifications to data model
• Changes to query language
• Changes to DataGuides
OEM (Object Exchange Model)
• Lore’s original data model
• All entities are atomic or complex objects
• Each object has a unique object identifier (oid)
• Atomic objects contain a value from one of the atomic types (integer, real, string, etc…)
• Complex objects are sets of <label, subobject> pairs
• Can be thought of as a labeled directed graph– objects are nodes
– complex objects have labeled outgoing edges
– atomic objects contain their value
Differences between XML and OEM• XML has attributes
• XML is ordered, OEM is not
• XML does not directly support graph structure– Uses special attribute types to encode graph structure
– Example:<Person Id = ‘P1’ Name = ‘Jeff Ullman’ Colleague = ‘P2’/>
<Person Id = ‘P2’ Name = ‘Jennifer Widom’ Colleague = ‘P1’/>
<Publication Title = ‘A First Course In Database Systems’ Author = ‘P1 P2’/>
Attribute Id is of type ID, Colleague is of type IDREF, and Author is of type IDREFS
Colleague
Colleague
AuthorAuthor
Jennifer Widom Jeff Ullman
Literal vs. Semantic Data Model
• Should an XML data model be a literal tree corresponding to XML’s text representation? (where IDREF(S) are nothing but string attributes)
• Or should it be a graph that includes all the intended links? (preserving the semantic graph structure)
• It should be... BOTH!– Both literal and semantic modes should be supported
– The user or application can select between the two
Lore’s XML Data Model• An XML element is a pair <eid, value>
• eid is a unique element identifier
• value is either an atomic text string or a complex value containing the following four components:– A string-valued tag corresponding to the XML tag for that element
– An ordered list of attribute-name/atomic-value pairs (attribute-name is a string, atomic-value has an atomic type)
– An ordered list of crosslink subelements of the form <label, eid> where label is a string. Crosslink subelements are introduced via an attribute of type IDREF(S)
– An ordered list of normal subelements of the form <label, eid> where label is a string. Normal subelements are introduced via lexical nesting within an XML document
XML Document/Graph Example
• eids appear within nodes (&1, &2, etc…)• Attributes appear within brackets next to the nodes
• Two types of edges:• Normal subelement edges labeled with destination subelement’s tag (solid line)• Crosslink edges labeled with the attribute name that introduced the link (dashed line)
• Semantic vs. Literal:• In semantic mode, omit attributes of type IDREF(S)• In literal mode, omit crosslink edges
Migrating Lorel (Lore’s query language)
• Distinguishing between attributes and subelements– Lorel uses path expressions
• A sequence of labels such as DBGroup.Member.Project.Title
• Can also contain wildcards and regular expressions
– Path expression qualifiers differentiate between attributes and subelements
• Placing a ‘>‘ before a label matches subelements only
• Placing a ‘@’ before a label matches attributes only
• Absence of qualifier means match both
– Examples:• DBGroup.Member.>Name will match name elements that are
subelements of DBGroup.Member elements
• DBGroup.Member.@Name will match name attributes of DBGroup.Member elements
• DBGroup.Member.Name will match both
Migrating Lorel (continued...)
• Comparisons– How do we compare two different things? (for example,
comparing constants with attribute values)• All XML components are treated as atomic values...
• Functions that transform elements into strings:– Flatten(e) : Ignoring all tags, recursively serialize all text values in the
subtree rooted at element e
– Concatenate(e) : Concatenates all immediate text children of element e (subelements are ignored)
– Tag(e) : Returns the XML tag of element e
– Eid(e) : Returns the eid of element e as a string
– XML(e) : Tranforms the graph, starting with element e, into an XML document
• Default Semantics (when no functions are specified):– atomic (Text) element : the text itself
– elements with no attributes and only one or more Text elements as children : concatenation of the children’s text values
– all others : the element’s eid represented as a string
Migrating Lorel (continued...)
• Range qualifiers– The expression [range] can be optionally applied to any path
expression component or variable• Example: select y from DBGroup.Member x, x.Office[1-2] y
– returns the first two Office subelements of every group member
• Example: select y[1-2] from DBGroup.Member x, x.Office y– returns the first two Office subelements over ALL members
• Order-by clause– Query results are ordered lists of eids that identify the elements
selected by the query (attributes are coerced into elements)
– order-by-document-order orders results based on original XML document
• Newly constructed elements are placed at the end of the document order with no specified order among them
Migrating Lorel (continued...)
• Transformations and structured results– Using queries to restructure XML data
• The with clause (added to the standard select-from-where construct)– Query result will replicate all data selected by the select clause, along
with all data reachable via a set of path expressions in the with clause
• Skolem functions– Allows more expressive data restructuring
– Accepts a list of variables as arguments and produces one unique element for every binding of elements and/or attributes to the arguments
• Updates– Lorel supports an expressive update language
– Changes for XML model:• ability to create both attributes and elements
• order-relevant updates
Migrating Lorel (continued...)
• DataGuides– Can be used when a DTD is not supplied
– A notion of order must be introduced• Problem - could result in very large DataGuides
– When DTD’s exist, DataGuides are built from those DTD’s
– Combining DTD’s and DataGuides• DTD’s available for specific portions of an XML database
• DataGuides can be used over portions not specified by DTD’s
Conclusion
• As of June 1999, the migration of Lore to an XML model is nearly complete