COMP60411 Semi-structured Data and the Web Datatypes Relax...

1

COMP60411Semi-structured Data and the WebDatatypesRelax NG, XML Schema, and Tree Grammars

Conny Hedeler and Uli SattlerUniversity of Manchester

Datatypes and Representations

2

SE3/M3: Evaluating Robustness

• Robustness in the face of change– A measure of evolvability

• If something changes, does our system break?• If it breaks, do we know that it broke?• If it broke, can we fix it?• If we “fixed” it, can we tell/how hard is it?

• Robustness is an organization-wide phenomenon– Fragility in one area can be compensated for by another

• E.g., by someone who never sleeps and knows the system– Different sorts of fragility

• With different probabilities and costs3

is the ability of a computer system to cope with errors during execution or the ability of an algorithm to continue to operate despite abnormalities in input, calculations, etc [wikipedia]

SE3/M3: XQuery, schemas, and types

4

PSVI

(tree adorned with default values & types)

Schema-aware query processor

Schema-aware parser

Quer

XML doc.

Schem

Query processor

QueryAnswer

(0.5 * 2) cast as xs:integer)(validate {doc("el1.xml")})//element(*,AxiomType)

Some SE3 Questions• Which query is most robust to changes in the schema?

1. /*/(equivalent|subsumes|...)2. /*/*[ssd:axiom(.)]3. /*/element(*,el:Axiom)4. They are equi-robust (and fragile)5. They are equi-robust (and robust)

• Which query is most widely usable?1. /*/(equivalent|subsumes|...)2. /*/*[ssd:axiom(.)]3. /*/element(*,el:Axiom)4. They are equi-usable (and not widely usable)5. They are equi-usable (and widely usable)

5

Basics of Types

• What, in the most general sense, is a datatype?1. A set of (data) values2. A description of the arguments of a function3. Anything derived from xs:anyType4. An annotation of a variable/node/element

• Anything naming or describing a set– ...has an associated type!

• Types are just sets (of “values”)• The “extensional” view

• A Type System is a language for – describing types (the “intensional” view)– associating types with other linguistic entities

• E.g., literals, variables, expressions, programs

6

But we may or may not be able to express this type

A Typical Type System & XSD • some primitive or built-in or “basic” types

– Integer, strings, etc.– xs:anyType, xs:string, xs:duration, ...

• some constructor to build composite types– Arrays, records, dictionaries, etc.– xs:list, xs:union,

• other constructors– To, for example, create other derived types– xs:restriction, xs:extension

• a syntax for associating types with variables, items,...– And functions, etc.– <xsd:element name="purchaseOrder" type="PurchaseOrderType"/>– <person xsi:type="LongPersonType" phone="5433">

• A set of conditions for success or failure (Type Errors)7

A Brief Tour of Type Systems• Strong vs. Weak

– Type errors are caught/reported vs silently succeeding/causing havoc

• Static vs. Dynamic– Check type at compile time vs. at run time

• Explicit/Manifest vs. Implicit/Latent– Type of everything (vars, functions,element) has (not) to be declared – Implicit: requires type inference – Explicit: requires type checking

• Nominal vs. Structural– Nominal: type compatibility relies on features of the declaration

• I declare a two types, “miles” and “feet” whose values are integers• 1 as miles != 1 as feet

– structural: type compatibility relies entirely on value structure • 1 as miles == 1 as feet (1 is the same integer!)

8

Some questions• Java’s type system is primarily

• strong, manifest, and nominal

• XQuery’s type system is primarily• strong, latent, and nominal

9

Some Expression Examples

10

Some Expression Examples• if ($aBool) then 1+1 else "2"

its type depends on value of $aBool:

10



• if (true()) then 1+1 else "2" is an instance of xs:integer:

10




• if (false()) then 1+1 else "2" is an instance of xs:string

10





• (if ($aBool) then 1+1 else "2") instance of (xs:integer | xs:string)

10





• (if ($aBool) then 1+1 else "2") instance of (xs:integer | xs:string)

10

Not legal XQuery

(if (true()) then 3+1 else "2") instance of xs:integer

returns “true”

(if (false()) then 3+1 else "2") instance of xs:string

returns “true”

Mistyped• Obvious conflict

– "2" + 2

• Making this conflict less obvious: – (if (false()) then 1+1 else "2") + 2

• Same error as above– (if (true()) then 1+1 else "2") + 2

• This is accepted!

• Making this conflict even less obvious: – declare function ssd:test($x as xs:boolean) as xs:integer{

if ($x) then 1+1 else "2" + 2 };

– declare function ssd:test($x as xs:boolean) as xs:integer{ if ($x) then 1+1 else "2"};

My checker doesn’t flag this error

It does flag this one!

11

Arithmetic operator is not defined for arguments of types (xs:integer, xs:string)

Simple Promotion• Explicit

– (1.0 + ("1" cast as xs:integer)) instance of xs:decimal– True!

• Implicit– ((1.0 treat as xs:decimal) + 125E2) instance of xs:double– Also true– Note that treat as and cast as are not the same:

• ("1.0" treat as xs:decimal)– doesnʼt work

• ("1.0" cast as xs:decimal)– This results in 1

12

Required item type of value in 'treat as' expression is xs:decimal; supplied value has item type xs:string

Complex Casting

http://msdn.microsoft.com/en-us/library/ms191231.aspx 13

http://msdn.microsoft.com/en-us/library/ms191231.aspx

http://msdn.microsoft.com/en-us/library/ms191231.aspx

Getting to PSVI• Consider a very simple XQuery

– No results!

• Must validate!

– Returns: <atomic xmlns="..." name="Person"/>– validate generates a PSVI

14

import schema default element namespace "…” at "el-typed.xsd";<instance-of> <constant name="sally"/> <atomic name="Person"/></instance-of>/element(*, ClassExpression)

import schema default element namespace "…” at "el-typed.xsd";validate{<instance-of> <constant name="sally"/> <atomic name="Person"/></instance-of>}/element(*, ClassExpression)

http://owl.cs.manchester.ac.uk/2010/comp/ssd-60372/day2/el


import schema namespace el="http://www.cs.manchester.ac.uk/pgt/COMP60411/el" at "el-typed.xsd";import schema namespace owl="http://www.w3.org/2002/07/owl#" at "owl2-xml.xsd";

declare namespace ex="http://ex.org";declare function ex:convertAxiom($ax as element(*, el:Axiom)) as element(*, owl:Axiom){ typeswitch ($ax) case schema-element(el:equivalent) return validate{<owl:EquivalentClasses>{ for $expr in $ax/* return ex:convertExpression($expr)}</owl:EquivalentClasses>} default return validate {<owl:EquivalentClasses><owl:Class IRI="http://BOGUS"/> <owl:Class IRI="http://BOGUS"/><owl:EquivalentClasses>}};declare function ex:convertExpression($expr as element(*, el:ClassExpression)) as element(*, owl:ClassExpression){ if ($expr instance of element(el:atomic)) then validate{<owl:Class IRI="{$expr/@name}"/>} else validate {<owl:Class IRI="http://BOGUS"/>} };declare function ex:convert($ont as element(*, el:Ontology)) as element(owl:Ontology, owl:Ontology){ validate{ <owl:Ontology> {for $e in $ont/element(*,el:Axiom) return ex:convertAxiom($e)} </owl:Ontology> }};ex:convert(validate{doc("el1.xml")/*}) 15



http://www.w3.org/2002/07/owl#


http://ex.org

http://ex.org

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

...that was a Complex Typed “Cast”• where all input and output all typed

• ...how do we ensure that our system works correctly with types?

<?xml version="1.0" encoding="UTF-8"?><owl:Ontology xmlns:owl="http://www.w3.org/2002/07/owl#"> <owl:EquivalentClasses> <owl:Class IRI="http://BOGUS"/> <owl:Class IRI="http://BOGUS"/> </owl:EquivalentClasses> <owl:EquivalentClasses> <owl:Class IRI="http://BOGUS"/> <owl:Class IRI="http://BOGUS"/> </owl:EquivalentClasses> <owl:EquivalentClasses> <owl:Class IRI="Person"/> <owl:Class IRI="http://BOGUS"/> </owl:EquivalentClasses> <owl:EquivalentClasses> <owl:Class IRI="http://BOGUS"/> <owl:Class IRI="http://BOGUS"/> </owl:EquivalentClasses></owl:Ontology>

The only “proper” value

16



http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

http://BOGUS

Static type check - type soundness

• A (statically verified) type safe program– has some guaranteed behavior

• and thus can be transformed or optimized in aggressive ways– may be more brittle

• fails hard on invalid input• accepts maybe less input than possible

Type-inference rules are written in such a way that any value that can be returned by an expression is guaranteed to conform to the static type inferred for the expression. This property of a type system is called type soundness. A consequence of this property is that a query that raises no type errors during static analysis will also raise no type errors during execution on valid input data. The importance of type soundness depends somewhat on which errors are classified as "type errors," as we will see below.

http://www.informit.com/articles/article.aspx?p=100667&seqNum=6 17

http://www.informit.com/articles/article.aspx?p=100667&seqNum=6

http://www.informit.com/articles/article.aspx?p=100667&seqNum=6

Data Representations• Data and data structures have representations

– (More or less) Physical embodiments– (Ultimately) Bits in a machine

• The “same” data can have distinct representations– 1 vs. “one”

• The “same” data structure can have distinct representations– At different levels of abstraction

• One key distinction– Internal (“in-memory”)– External (“on disk”)

• Generally:– External representations are for exchange between

(heterogeneous) systems

“Location” doesn’t really matter

18

Conversion• We can go from external to internal (e2i)

– Parsing, reading, loading, de-serializing, unmarshalling

• We can go from internal to external (i2e) – Serializing, writing, printing, saving, marshalling– Different systems may have different internals

• At least in detail– Different applications may behave differently

• There and back again: Roundtripping– Internal to external to internal (i2e2i)– External to internal to external (e2i2e)– Ideally preserves key properties

• Which?• When is ok not to preserve?

19

What is an XML “Document”?

Errors here mean noXML! SAX ErrorHandler

Yay! XPath! XSLT! Etc.

20

Element

Element Element Attribute

Element


LevelData unit examples

Information or Property

required

cognitive

application

tree adorned with...

namespace schema nothing a

schematree well-formedness

token

complex <foo:Name t=”8”>Bob

simple <foo:Name t=”8”>Bob

character < foo:Name t=”8”>Bob

which encoding(e.g., UTF-8)

bit 10011010


21

Element


Element




required

cognitive

application




token





bit 10011010

validateeraseserialise

parse


22

Element


Element




required

cognitive

application




token





bit 10011010

“Same” inputs canhave different “meanings”!(external validation)


23

Element


Element




required

cognitive

application




token





bit 10011010

...we can have many...

For “the same” meaning

24

The Essence of XML (with WXS)• Thesis:

– “XML is touted as an external format for representing data.”• Two properties

– Self-describing• Destroyed by external validation

– Round-tripping• Destroyed by defaults and union types

http://bit.ly/essenceOfXML2



25

The Essence of XML (with WXS)• Roundtripping issues

– Internal to external and back• Take an element, foo, with content {“one”, “2”, 3}• It’s (simple) type is a list of union of integer and string• Serialize

– <foo>one 2 3</foo>• Parse and validate

– Content is {“one”, 2, “3”}– External to internal and back

• “001” to 1 to “1”




26

The Essence of XML (with WXS)• Conclusion:

– “So the essence of XML is this: the problem it solves is not hard, and it does not solve the problem well.”

• Itʼs not obvious– That the issues are serious (enough)– That the problem solved is all that easy– That there arenʼt other, worse issues




Tree Grammars

27

Observations/Q3:

• Documents/trees are finite structures • A Schema/grammar can describe no/finitely/infinitely many

documents/trees• For a given set of documents/trees, we can design various

schemas/grammars

28

<?xml version="1.0" encoding="UTF-8"?><!ELEMENT cartoon (prolog, panels)> <!ATTLIST cartoon

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE cartoon SYSTEM "cartoon.dtd"><cartoon

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE cartoon SYSTEM "cartoon.dtd"><cartoon copyright="Bill

.

.

.

.ε

0 1

0 0 0

BAA

A BB

ε0 1

0 0 0 1

BAA

A BB B

N = {Book, PA, Editor, A, Paper, F, L}Σ = {B, Name, F, L, A, P}S = {Book, Paper}

P = { Book → B Editor|PA, Paper → P PA+, Editor → Name F,L, PA → Name L,A, F → F ε, L → L ε, A → A ε }

Remember: Tree Grammars

๏ A set of trees is called a tree language (like sets of strings are languages)

• A tree language can be empty, finite, or infinite

๏ A tree language TS is if there exists a

tree grammar G such that L(G) = TS.

‣ for one TS, there can be different tree grammars accepting exactly TS…

localsingle-type

regular

localsingle-type

(any)

29

Properties of Local and Single-Type Tree Languages

- the following observation is an immediate consequence of the definitions of local and single-type tree languages

★ Every local tree language is single-type, and every single-type tree language is regular.

- the next observation is a bit more tricky: ★ There are regular tree languages that are not single-type, and

there are single-type tree languages that are not local.

Loc ⊊ ST ⊊ Reg LocSTReg

30

Single-Typedness and PSVIs...

31

• Imagine, the following XML Schema was legal,• and you’d ask (a schema-aware XQuery processor) to return all

elements of type NewPersonType

• ...as in //*element(*, NewPersonType)

• from the little document below…

• the answer would depend on the PSVI constructed for little.xml – what is the type of /A/person?– NewPersonType or OldPersonType?

• To avoid such confusion/nondeterminism, UPAc ensures single-typedness ensures unique PSVI

<A> <person> .... </person> </A>

little.xml

<xs:element name="A"><xs:complexType> <xs:sequence> <xs:element name="person" type="NewPersonType" minOccurs="0" maxOccurs="1"/> <xs:element name="person" type="OldPersonType" minOccurs="0" maxOccurs="1"/> </xs:sequence> </xs:complexType></xs:element>

Why & when single-type matters

★ A single-type grammar can have no more than one run on a tree. • a run corresponds to a PSVI

– as it labels input tree/document nodes with non-terminals/types

• ..hence validation against a schema that corresponds to a single-type grammar results in a unique PSVI– (PSVI = DOM tree adorned with default values & types) – hence schema-aware queries know/agree on what to return!

★ A regular grammar can have more than one run on a tree. • ..hence validation against a schema that does not correspond to a

single-type grammar may result in one of many PSVIs– hence schema-aware queries may differ in their answer!

✴ Use single-type schema language for schema-aware querying!32

Tree Grammars: 1 more thing

• BTW, w.l.o.g., we can assume that no two production rules have the same non-terminal on the left hand side and the same terminal. I.e., no N → P PA and N → P (Editor,Editor*).

We can also rewrite those, e.g., to

N → P (PA | (Editor,Editor*))

• ...so, how did we get here? From DTDs and XML schemas!

33

Tree Grammars ⇆ DTDs• since DTDs don’t have “types”, just element names, they correspond

to grammars of a peculiar, simple kind:

★ Tree grammars for DTDs are always local...even if the DTD has a non-deterministic content model <!ELEMENT N1 (M|(M,M))> is not deterministic and thus illegal (but can be replaced with <!ELEMENT N1 (M,(M|ε))>)

<!ELEMENT T (N1,N2*)><!ELEMENT N1 (M|(M,M))><!ELEMENT N2 (#PCDATA)><!ELEMENT M (#PCDATA)>

F = (N, Σ, S, P) withN = {T, N1, N2, M, pcdata}Σ = {T, N1, N2, M, pcdata}S = {T}P = { T → T (N1,N2*), N1 → N1 (M|(M,M)), N2 → N2 pcdata, M → M pcdata, pcdata → pcdata ε}

ε

0

0,0

T

N1

1,0

Mpcdata

1 N2

0,0,0 pcdata

34

Remember?!• in DTDs and in WXS, content models are further restricted

(for compatibility with SGML)– [DTD] determistic (or 1-unambiguous),

e.g., (M|(M,M)) is not deterministic, (M,(M|ε)) is.e.g., ((b, c) | (b, d)) is not deterministic, b,(c|d) is.From http://www.w3.org/TR/REC-xml/:

35

As noted in 3.2.1 Element Content, it is required that content models in element type declarations be deterministic. This requirement is for compatibility with SGML (which calls deterministic content models "unambiguous"); XML processors built using SGML systems may flag non-deterministic content models as errors.

More formally: a finite state automaton may be constructed from the content model using the standard algorithms, e.g. algorithm 3.5 in section 3.9 of Aho, Sethi, and Ullman [Aho/Ullman]. In many such algorithms, a follow set is constructed for each position in the regular expression (i.e., each leaf node in the syntax tree for the regular expression); if any position has a follow set in which more than one following position is labeled with the same element type name, then the content model is in error and may be reported as an error.

http://www.w3.org/TR/REC-xml/

http://www.w3.org/TR/REC-xml/

Tree Grammars and DTDs• so, DTDs are local (and thus single-type) because they don’t

have any types at all– and not because their content model is deterministic!– they are single-type even with non-deterministic content model

• hence we could extend DTDs with types and still be single-type...provided we impose suitable restrictions

36

Tree Grammars ⇆ WXS

• tree grammars also capture the basic, structural part of WXS:✓ types (complex and anonymous)‣ model groups (we ignore them)‣ derivation by extension and restriction (we ignore them)‣ substitution groups (we ignore them)‣ integrity constraints like keys (must be ignored, don’t fit into tree

grammars) • we only deal with simple XML schemas, but general approach works

for more

37

Tree Grammars ⇆ WXS

• one stupid problem with this: in XSD, we can have – named types, e.g., <xs:complexType name="BBlist">– unnamed types, e.g., <xs:element name="mylist">...

• ...hence we invent a lot of type names for unnamed types,

– eg MYLIST for mylist

• we use a two-stage approach:• to transform an XML schema S into a tree grammar G,

1. we translate S into a generalized tree grammar 2. then flatten the generalized tree grammar into a tree grammar G

• this will be done such that T validates against S iff T is accepted by G.

38

Translating WXS into Tree Grammars• take a simple XML Schema S and translate it into grammar G(S): ➡ for each top-level element in S of the form

– <xs:element name="mylist" type="Blist"></xs:element>• add the following production rule to G(S):

– MYLIST → mylist BLIST^TYPE– add MYLIST, BLIST^TYPE to non-terminals, add mylist to terminals

➡ for each top-level element in S of the form – <xs:element name="mylist">

<xs:complexType> <xs:sequence> <xs:element name="ename" type="Comp" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element>

• add the following production rules to G(S):– MYLIST → mylist ENAME,ENAME*– ENAME → ename COMP^TYPE

what is the default for minOccurs?

39

Translating WXS into Tree Grammars

➡ for each top-level element in S of the form – <xs:complexType name="Blist">

<xs:sequence> <xs:element name="friend" type='Person' minOccurs = ʻ1ʼ maxOccurs ='2'/> </xs:sequence> </xs:complexType>

• add the following production rules to G(S):– BLIST^TYPE → (FRIEND | (FRIEND,FRIEND)) – FRIEND → friend PERSON^TYPE– add BLIST^TYPE, FRIEND, PERSON^TYPE to non-terminals,

add friend to terminals

38

%% generalized rule: to be expanded!

40

Translating WXS into Tree Grammars➡ for each top-level element in S of the form

- <xs:complexType name="BBlist"> <xs:choice> <xs:sequence> <xs:element name="A" type="xs:string"/> <xs:element name="B" type="xs:string"/> </xs:sequence> <xs:sequence> <xs:element name="A" type="xs:string"/> <xs:element name="C" type="xs:string"/> </xs:sequence> </xs:choice> </xs:complexType>

• add the following production rules to G(S):– BBLIST^TYPE → (A,B) | (A,C)– A → A STRING^TYPE– B → B STRING^TYPE– C → C STRING^TYPE– add BBLIST^TYPE, A, B, C, STRING^TYPE to non-terminals,

add A, B, C to terminals

%% generalized rule -- to be expanded!

%% UPA - violation:%% Oxygen complains!

41

Translating WXS into Tree Grammars• Consider the following case:

• To handle cases like the one above we can’t always add rules – AT^TYPE → N*, BT^TYPE → N* – N → N ??LIST^TYPE

• Instead, we translate these as – AT^TYPE → NÂSÂLIST^TYPE* BT^TYPE → NÂS^BLIST^TYPE*– NÂSÂLIST^TYPE → N ALIST^TYPE– NÂS^BLIST^TYPE → N BLIST^TYPE

<xs:complexType name="AT"> <xs:sequence> <xs:element name="N" type="Alist" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>

<xs:complexType name="BT"> <xs:sequence> <xs:element name="N" type="Blist" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>

42

Translating WXS into Tree GrammarsOur translation yields almost a tree grammar:• it produces illegal rules of the form X → e, i.e., without non-terminal

– e.g., BLIST^TYPE → (FRIEND | (FRIEND,FRIEND))

• our grammar model doesn’t handle those (check definition of a run)๏ hence we expand these illegal rules:

• e.g., MYLIST → mylist BLIST^TYPE would be transformed into – MYLIST → mylist (FRIEND | (FRIEND,FRIEND))

• ...and if we had <xs:element name="yourlist" type="Blist"/> then we also had – YOURLIST → yourlist BLIST^TYPE and thus– YOURLIST → yourlist (FRIEND | (FRIEND,FRIEND))

pick illegal rule X → e:– remove X → e from rule set – replace all occurrences of X in rule set with e

until no illegal rules are left in rule set

43

Translating WXS into Tree Grammars• Expanding illegal rules even works with cyclic type definitions - try

• This gives you these rules, including 2 illegal rules

• that can be expanded as follows:

<xs:complexType name="NType"> <xs:choice> <xs:element name="test2" type="AType"/> <xs:element name="EndElement" type="xs:string"/> </xs:choice> </xs:complexType>

<xs:complexType name="AType"> <xs:choice> <xs:element name="test1" type="NType"/> <xs:element name="EndElement" type="xs:string"/> </xs:choice> </xs:complexType>

NType^TYPE → (TEST2 | ENDELEMENT)TEST2 → test2 AType^TYPEENDELEMENT → EndElement STRING^TYPE...

AType^TYPE → (TEST1 | ENDELEMENT)TEST1 → test1 NType^TYPEENDELEMENT → EndElement STRING^TYPE...

TEST2 → test2 (TEST1 | ENDELEMENT)ENDELEMENT → EndElement STRING^TYPE...

TEST1 → test1 (TEST2 | ENDELEMENT)ENDELEMENT → EndElement STRING^TYPE...

44

Translating WXS into Tree Grammars

• So, to transform an XML schema S into a tree grammar G, 1. we translate S into a generalized tree grammar G’2. then expand G’ into a tree grammar G

★ Then any tree T validates against S iff T is accepted by G.

• So, what are the tree grammars we get as results?– they are tree grammars– are they single-type?– are they local?

★ Tree grammars corresponding to WXS are not local.• E.g., consider

– NÂSÂLIST^TYPE → N ALIST^TYPE– NÂS^BLIST^TYPE → N BLIST^TYPE– .. NÂSÂLIST^TYPE and NÂS^BLIST^TYPE are competing!

LocSTReg

45

Translating WXS into Tree Grammars ★ Tree grammars corresponding to WXS are single-type.

– This is ensured by the Unique Particle Attribution constraint in WXS. • Tree grammars corresponding to DTDs are local, ….hence

★ DTDs are less expressive than XML schemata.

• That is, there are tree languages that we can describe in WXS, but not in DTDs, e.g.,

LocSTReg

N = {Book, PA, Editor, A, Paper, F, L}Σ = {B,N,A,P,C}S = {Book, Paper}P = { Book → B Editor|PA, Paper → P PA, Editor → N F,L, PA → N L,A, F → F ε, L → L ε, A → A ε }

L

ε

0

0,0

B

N

0,1F

ε

0

0,0

P

N

0,1AL

Remember:

47

A content model must be formed such that during validation of an element information item sequence, the particle component contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#cos-nonambig

Rephrasing: a content model M must be formed such that, during validation of an element E’s childnode sequence E1...Ek, we can, starting from i = 1 and increasing, associate each Ei with a single particle contained (possibly implicitly) in M without examining the content or attributes of Ei, and without any information about any Ej with j >i.

http://www.w3.org/TR/xmlschema-1/#key-vn


http://www.w3.org/TR/xmlschema-1/#key-impl-cont




http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/


Remember: • In XML Schema, content model is constrained as well

– to make validation easier & for compatibility with SGML– e.g., through Unique Particle Attribute Constraint:

47

A content model must be formed such that during validation of an element information item sequence, the particle component contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#cos-nonambig

Rephrasing: a content model M must be formed such that, during validation of an element E’s childnode sequence E1...Ek, we can, starting from i = 1 and increasing, associate each Ei with a single particle contained (possibly implicitly) in M without examining the content or attributes of Ei, and without any information about any Ej with j >i.









Translating WXS into Tree Grammars ★ Tree grammars corresponding to WXS are single-type.

– This is ensured by the Unique Particle Attribution constraint in WXS.

• We know: validation against a schema that corresponds to a single-type grammar results in a unique PSVI– (PSVI = DOM tree adorned with default values & types) – hence schema-aware queries know/agree on what to return! – hence WXS can be used for schema-aware querying!

49

Using more than 1 schema:

PSVI(tree adorned with default values & types)

Your application

Schema-aware parser for rich schema language, e.g. RelaxNG

Queryor other input

XML doc.

rich Schema 1 Schema-

aware Query processor

QueryAnswer

single-typeSchema 2

Schema-aware parser for s-t schema language, e.g. XSD

doesn’t validateErrorHandler

validates

….

….

Content models & types in DTD & WXS

• (we already know that) in WXS, we have a type hierarchy– an element of a type X derived by restriction or extension

from Y can be used in place of an element of type Y • but you have to say so explicitly:

– we call this ‘named’ typing: • sub-types are declared (restriction

or extension), and not inferred (by comparing structure)

– in DTDs, we don’t have types!

• In order to prevent difficulties in WXS as caused by types, Element Declarations Consistent constraint is imposed:

<xs:complexType> <xs:sequence> <xs:element name="person" type= "NewPersonType" minOccurs="0" maxOccurs="1"/> <xs:element name="person" type= "OldPersonType" minOccurs="0" maxOccurs="1"/> </xs:sequence> </xs:complexType> 50

<person phone="2"> <Name>Peter</Name> <DoB>1966-05-04</DoB></person><person xsi:type="LongPersonType" phone="5432"> <Name>Paul</Name> <DoB>1967-05-04</DoB> <address>Manchester</address></person>

Summary

• So far, we have seen how to translate schema languages in tree grammars: we saw that– each DTD can be faithfully translated into a local tree grammar,

and therefor in a single-type one• hence each DTD corresponds to a single-type grammar• hence there is exactly 1 PSVI for each document that validates against

a DTD – each XML schema can be faithfully translated into a single-type

tree grammar, • hence there is exactly 1 PSVI for each document that validates against

an XML schema• ...we also saw that parts of the UPA constraint helps to generate PSVI:

do we need other parts?

51

LocSTReg

Relax NG, a very powerful schema language

52

53

Relax NG: yet another schema language

• Relax NG was designed to be a simpler schema language• (described in a readable on-line book by Eric Van der Vlist)• and allows us to describe XML documents in terms of their

tree abstractions:– no default attributes– no entity declarations– no key/uniqueness constraints– minimal datatypes: only “token” and “string” like DTDs

(but a mechanism to use XSD datatypes)

• since it is so simple/flexible– it’s (claimed to be) easy to use– it doesn’t have complex constraints on description of element content like

determinism/1-unambiguity– it’s claimed to be reliable– but you need other tools to do other things (like datatypes and attributes)

54

Relax NG: another side of Determinism

• remember that DTDs and WXS required their content models to be – [DTD] deterministic (and thus look-ahead-free)– [WXS] deterministic (EDC, every matching child node sequence

matches in exactly one way only)– [WXS] UPA constraint expresses both and other constraints even more

• determinism & single-typeness have a reason:– some tools annotate a (valid) document while parsing:

• type information -- to be exploited, e.g., for concise queries (remember assignment?)

• default attribute values – if your schema is not single-type, then

• tools validating the same document against the same schema may construct different PSVIs

• this can happen with different tools or different runs of the same tool

55

Relax NG: another side of ValidationReasons why one would want to validate an XML document:• ensure that structure is ok• ensure that values in elements/attributes are of the correct type• generate PSVI to work with• check constraints on co-occurrence of elements/how they are related • check other integrity constraints, eg. a person age vs. their mother’s

age• check constraints on elements/their value against external data

– postcode correctness– VAT/tax/other numeric constraints– spell checking

...only few of these checks can be carried out by validating against schemas...

Relax NG was designed to 1. validate structure and 2. link to datatype validators to type check values of elements/attributes

56

Relax NG: basic principles • both DTDs and XSD allow the user to describe documents

– by descriptions of its elements and attributes, e.g., an element “person” must have two element child nodes, name and address, and ....

• Relax NG is based on patterns (similar to XPath expressions): – a pattern is a description of a set of valid node sets– we can view our example

as different combinationsof different parts, and design patterns for each

– enhanced flexibility

<?xml version="1.0" encoding="UTF-8"?><people> <person age="41"> <name> <first>Harry</first> <last>Potter</last> </name> <address>4 Main Road </address> <project type="epsrc" id="1"> DeCompO </project> <project type="eu" id="3"> TONES </project> </person> <person>.... </people>

57

Relax NG: good to knowRelax NG comes in 2 syntaxes• the compact syntax

– succinct– human readable

• the XML syntax– verbose– machine readable

Trang converts betweenthe two, pfew!(and also into/from other schema languages)

Trang can be used from Oxygen

grammar { start = element name { element first { text }, element last { text } }}

<grammar xmlns="http:...” xmlns:a="http:.." datatypeLibrary="http:...> <start> <element name="name"> <element name="first"><text/></element> <element name="first"><text/></element> </element> </start></grammar>

58

Relax NG - structure validation:• 3 kinds of patterns, for the 3 “central” nodes:

– text <text/>– attribute <attribute name=”age"/>

<attribute name=”type"/>– element <element name="name">

<element name="first"> <text/></element> <element name="last"> <text/></element> </element>

• these can be combined– ordered groups– unordered groups– choices

• we can constrain cardinalities of patterns • text nodes

– can be marked as “data” and linked• we can specify libraries of patterns


element name { element first { text }, element last { text }}

59

Relax NG - structure validation: ordered groups• we can name patterns• in strange “chains”• we can use ?, *, and +:


grammar { start = people-element

people-element = element people { person-element+ }

person-element = element person { attribute age { text }, name-element, address-element+, project-element*}

name-element = element name { element first { text }, element middle { text }?, element last { text } }

address-element = element address { text }

project-element = element project { attribute type { text }, attribute id {text}, text }}

use “?” if optional

Relax NG - structure validation: ordered groups in XML syntax (Trang knows…)

<?xml version="1.0" encoding="UTF-8"?><grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="people"><ref name="people-content"/> </element></start> <define name="people-content"> <oneOrMore> <element name="person"><ref name="person-content"/> </element></oneOrMore></define>

<define name="person-content"> <attribute name="age"/> <element name="name"><ref name="name-content"/> </element> <oneOrMore> <element name="address"><text/></element> </oneOrMore> <zeroOrMore> <element name="project"><ref name="project-content"/> </element></zeroOrMore></define>

<define name="name-content"> <element name="first"><text/></element> <optional><element name="middle"><text/></element> </optional> <element name="last"><text/></element> </define> <define name="project-content"> <attribute name="type"/><attribute name="id"/><text/> </define></grammar>







60

http://relaxng.org/ns/structure/1.0

http://relaxng.org/ns/structure/1.0

61

Relax NG - structure validation: different styles

grammar { start = element people {people-content}

people-content = element person { person-content }+

person-content = attribute age { text }, element name {name-content}, element address { text }+, element project {project-content}*

name-content = element first { text }, element middle { text }?, element last { text }

project-content = attribute type { text }, attribute id {text}, text }







• so far, we modelled ‘element centric’...we can model ‘content centric’:

62

Relax NG - structure validation: ordered groups

• we can combine patterns in fancy ways:

grammar {start = element people {people-content}people-content = element person { person-content }+

person-content = HR-stuff, contact-stuff

HR-stuff = attribute age { text }, project-content

contact-stuff = attribute phone { text }, element name {name-content}, element address { text } name-content = element first { text }, element middle { text }?, element last { text } project-content = element project { attribute type { text }, attribute id {text}, text }+}


63

Relax NG: structure validation summary • Relax NG’s specification of structure differs from DTDs and XSD:

– grammar oriented– 2 syntaxes with automatic translation– flexible: we can gather different aspects of elements into different patterns– unconstrained: no constraints regarding

unambiguity/1-ambiguity/deterministic content model/Unique Particle Constraints/Element Declarations Consistent

– like for XSD, we have an “ALL” construct for unordered groups, “interleave” &:

element person { attribute age { text}, attribute phone { text}, name-element , address-element+ , project-element*}

here, the patterns must appear in the specified order, (except for attributes, which are allowed to appear in any order in the start tag):

here, the patterns can appear any order:

element person { attribute age { text } & attribute phone { text} & name-element & address-element+ & project-element*}

Translating Relax NG into tree grammarsby example 1

• ...let’s see one more64

grammar {start = AddressBookAddressBook = element addressBook { Card* }Card = element card { Inline }Inline = Name, Email+Name = element name { text }Email = element email { text } }

Translate into G=(N, Σ, S, P) with N = {AddressBook, Card, Inline, Name, Email, Pcdata}Σ = {addressBook, card, name, email, pcdata}S = {AddressBook}P = {AddressBook → addressBook Card*, Card → card Inline, Inline → Name, Email+, Name → name Pcdata, Email → email Pcdata, Pcdata → pcdata ϵ }

“element y” ➟ y ∈ Σ...possibly also “uppercased copy” ➟ Y ∈ Nall other user defined symbols X ➟ X ∈ N...translate Relax NG rules easy(depending on Relax NG style)


65

grammar { start = p-el

p-el = element people { per-el+ }

per-el = element person { attribute age { text }, na-el, ad-el+, pro-el*}

na-el = element name { element first { text }, element middle { text }?, element last { text } }

ad-el = element address { text }

pro-el = element project { attribute type { text }, attribute id {text}, text }}

Translate into G = (N, Σ, S, P) with N = {P-EL, PER-EL, NA-EL, AD-EL, PRO-EL, FIRST, MIDDLE, LAST, Pcdata}Σ = {people, person, name, first, middle, last, address, project}S = {P-EL}P = {P-EL → people PER-EL, PER-EL*, PER-EL → person NA-EL,AD-EL, AD-EL*,PRO-EL* NA-EL → name FIRST, (MIDDLE|ε), LAST, FIRST → first Pcdata, MIDDLE → middle Pcdata, LAST → last Pcdata, AD-EL → address Pcdata, PRO-EL → project Pcdata, Pcdata → pcdata ϵ }

Ignore!

Ignore! This Relax NG style makes translation of rules easy


66

grammar { start = element people {people-content}

people-content = element person { person-content }+

person-content = attribute age { text }, element name {name-content}, element address { text }+, element project {project-content}*

name-content = element first { text }, element middle { text }?, element last { text }

project-content = attribute type { text }, attribute id {text}, text }

Translate into G=(N, Σ, S, P) with N = {PEOPLE, P-C, PER-C, NA, NA-C, PERSON, PRO-C,ADR, PROJ, PRO-C, FIRST, MIDDLE,LAST, Pcdata}Σ = {people, person, name, first, middle, last, address, project}S = {PEOPLE}P = {PEOPLE → people P-C, P-C → PERSON, PERSON*, PERSON → person PER-C, PER-C → NA, ADR, ADR*,PROJ, NA → name NA-C, ADR → address Pcdata, PROJ → project PRO-C, PRO-C → pcdata ϵ, NA-C → FIRST,(MIDDLE|ϵ),LAST FIRST → first Pcdata, MIDDLE → middle Pcdata, LAST → last Pcdata, Pcdata → pcdata ϵ }

expand!

expand!

This Relax NG style makes translation of rules less easy… and leads to generalized rules!


Two things we have already seen when translating WXS:• “generalized” rules -- which can & need to be expanded, as for WXS:

• we might have to “contextualise” names and types of elements: ... 67

...people-content = element person { person-content }+.....person-content = attribute age { text }, element name {name-content}, element address { text }+, element project {project-content}*

... PERSON → person PER-C, PER-C → NA, ADR, ADR*,PROJ, NA → name NA-C, ADR → address Pcdata, ...

expand!

for each illegal rule X → e:– remove X → e from rule set – replace all occurrences of X in rule set with e


68

...people-content = element person { person-content }+, element friend {friend-content }+ .....person-content = attribute age { text }, element name {name-content}, ...friend-content = attribute age { text }, element name {friend-name-content},...

... P-C → PERSON, PERSON*,FRIEND,FRIEND* PERSON → person PER-C, FRIEND → friend FRIE-C, PER-C → NA^NA-C, ... FRIE-C → NA^FRIE-NA-C, ... NA^NA-C → name NA-C, NA^FRIE-NA-C → name FRIE-NA-C, ...

2. we might have to “contextualise” names and types of elements, to handle schemas where the same element name is used in different contexts with different types:

Translating Relax NG into tree grammars• each Relax NG schema can be faithfully translated into a tree grammar:

– local? no: example on previous slide leads to competing non-terminals (NA^PER-C and NA^FRIE-C)

– single-type? no: see example belowNA^NA-C and NA^FO-NA-C compete and occur in the same RHS

– so is Relax NG as powerful as tree grammars?

69

... NA^PER-C → name NA-C, NA^FRIE-C → name NA-C,...

...person-content = attribute age { text }, element name {name-content} | element name {foreign-name-content}, ...

... PER-C → NA^NA-C | NA^FO-NA-C NA^NA-C → name NA-C, NA^FO-NA-C → name FO-NA-C,...

Relax NG schema is indeed as powerful as tree grammars★ Every tree grammar can be faithfully translated into a Relax NG schema.

• Proof (not too hard): given a tree grammar G = (N, Σ, S, P), 1. translate each production rule N → t regexp in P into

(fortunately, the tree grammar regular expression syntax is very close to and more strict than Relax NG regular expression syntax)

2. Put the resulting statements intoa grammar, where N1 , ... , Nk areall start symbols, i.e., S = {N1 , ... , Nk}

3. Call the resulting schema GS

★ Then T ∈ L(G) if and only if T validates against GS.

70

N = element t { regexp }

grammar {start = N1 | ... | Nk ..... }

Tree Grammars and Schema Languages• Harvest Time!

• but then, isn’t validation of an XML document against a Relax NG schema really complicated and complex (i.e., space and/or time consuming)?

• perhaps it’s even undecidable or intractable?

71

LocSTReg DTDWXSRelax NG with our knowledge

How costly is validity testing?…

Does it matter against which kind of schema?

…Is Single-Type cheaper than

general?

72

73

How costly is schema validation?

PSVI(tree adorned with default values & types)

Your application

Schema-aware parser for rich schema language, e.g. RelaxNG

Queryor other input

XML doc.

rich Schema 1 Schema-

aware Query processor

QueryAnswer

single-typeSchema 2

Schema-aware parser for s-t schema language, e.g. XSD

doesn’t validateErrorHandler

validates

….

….

Schema Languages and Tree Grammars• We have learned about a third, flexible, liberal schema

language, Relax NG– how to translate Relax NG schemas into tree grammars➡ more liberal than single-type/XSD

• Now, we will look at: – the problem of – algorithms for

74

validating a document against a schema!

algorithmTree TGrammar G

“yes”, if T ∈ L(G)

“no”, otherwise

See the paper by Murata, Lee, Mani, Kawaguchi

• To design our “schema validator”,1. we start with the easy case: assume that G is local

(this gives us automatically a validator for structural aspect of DTDs)2. then expand algorithm to single-type

(this gives us automatically a validator for structural aspect of WXS)3. then expand to general tree grammars (...Relax NG)

– we also assume that we have a subroutine

– ...if time permits, we will see later how to build that one (it’s based on a translation of regular expressions into finite state machines (aka automata), otherwise

• remember your undergraduate studies (?)• read it up, e.g., in the textbook by Hopcroft, Ullman

75

ValAlgoTree TGrammar G


“no”, otherwise

MatchAlgoString wregular expression e

“yes”, if w ∈ L(e), (w matches e)

“no”, otherwise

Input: DOM Tree for T, local tree grammar G = (N, Σ, S, P),NT is a stack of strings of non-terminalsR is a stack of production rulesTraverse T in a depth-first, left-2-to-right mannerWhen an element E is visited on way down,

if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop

When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop

report “accepted” and stop

76

ValAlgoXML doc/Tree Tlocal Grammar G


“no”, otherwise

See the paper by Murata, Lee, Mani, Kawaguchi

locality

store rule for E’s content in Rstart remembering E’s child nodes

retrieve rule for E’s content in Rretrieve E’s child nodes

add E’s terminal node to its predecessor siblings

to store NTs of child nodes

Stacks and tree traversal, observations• our algorithm visits a tree in a depth-first, left-2-to-right manner• whenever we visit a node

on our way – down, we

push relevant informationfor this node on stacks

– up, we pop relevant informationfor this node from stacks

• hence, whenever we are at a node n during this traversal, allrelevant information regarding all ancestors of n are (in reverseorder), on our stacks

77

Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,



• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with

P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

78

a

c c

b

c

b




R NT

Stack of rules

Stack of NT strings



“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

7915

a

c c

b

c

b

R NTS → a B,B* ϵ






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

80

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

ϵϵ






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

81

a

c c

b

c

b

R NT

C → c ϵ|C B → b (C,C)|C S → a B,B*

ϵϵϵ






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

82

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

ϵϵ

C → c ϵ|C ϵyes, ϵ ∈ L(ϵ|C)






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

83

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

Cϵ

C → c ϵ|C






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

84

a

c c

b

c

b

R NT

Cϵ

C → c ϵ|C B → b (C,C)|C S → a B,B*

ϵ

ϵ






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

85

a

c c

b

c

b

R NT

Cϵ

B → b (C,C)|C S → a B,B* ϵ







“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

86

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

CC ϵ

C → c ϵ|C






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

87

a

c c

b

c

b

R NTS → a B,B* ϵ

B → b (C,C)|C CCyes, CC ∈ L((C,C)|C)






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

88

a

c c

b

c

b

R NTS → a B,B* B






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

89

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

ϵB






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

90

a

c c

b

c

b

R NT

C → c ϵ|C B → b (C,C)|C S → a B,B*

ϵϵB






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

91

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

ϵB







“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

92

a

c c

b

c

b

R NT

B → b (C,C)|C S → a B,B*

CB

C → c ϵ|C






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

93

a

c c

b

c

b

R NTS → a B,B* B

B → b (C,C)|C C

yes, C ∈ L((C,C)|C)






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

94

a

c c

b

c

b




R NTS → a B,B* BB

B → b (C,C)|C



“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

95

a

c c

b

c

b

R NT

BBS → a B,B*

yes, BB ∈ L(B,B*)






“no”, otherwise


P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}

96

a

c c

b

c

b

R NT

“accepted” (“yes”), T ∈ L(G)




report “accepted” and stop ☜ Check slide 74



“no”, otherwise

Validating trees against tree grammars• want to implement this algorithm?

– walk the DOM tree in a depth-first, left-2-right way, or

– use a SAX parser and do it in a streaming fashion• no need to keep whole tree in memory• validate-while-u-parse!

• ...and we can use this algorithm for general DTDs!• ...next week, we’ll see how this works for

– single-type tree grammars (and WXS)• rather straightforward because we still only have at most one run of our tree

grammar on the input tree

– general tree grammars (and Relax NG)…

– ...all validate-while-u-parse!

97

COMP60411 Semi-structured Data and the Web Datatypes Relax...

Documents

Transcript of COMP60411 Semi-structured Data and the Web Datatypes Relax...