COMP60411 Semi-structured Data and the Web Datatypes Relax...
-
Upload
nguyendang -
Category
Documents
-
view
225 -
download
0
Transcript of COMP60411 Semi-structured Data and the Web Datatypes Relax...
1
COMP60411Semi-structured Data and the WebDatatypesRelax NG, XML Schema, and Tree Grammars
Conny Hedeler and Uli SattlerUniversity of Manchester
Datatypes and Representations
2
SE3/M3: Evaluating Robustness
• Robustness in the face of change– A measure of evolvability
• If something changes, does our system break?• If it breaks, do we know that it broke?• If it broke, can we fix it?• If we “fixed” it, can we tell/how hard is it?
• Robustness is an organization-wide phenomenon– Fragility in one area can be compensated for by another
• E.g., by someone who never sleeps and knows the system– Different sorts of fragility
• With different probabilities and costs3
is the ability of a computer system to cope with errors during execution or the ability of an algorithm to continue to operate despite abnormalities in input, calculations, etc [wikipedia]
SE3/M3: XQuery, schemas, and types
4
PSVI
(tree adorned with default values & types)
Schema-aware query processor
Schema-aware parser
Quer
XML doc.
Schem
Query processor
QueryAnswer
(0.5 * 2) cast as xs:integer)(validate {doc("el1.xml")})//element(*,AxiomType)
Some SE3 Questions• Which query is most robust to changes in the schema?
1. /*/(equivalent|subsumes|...)2. /*/*[ssd:axiom(.)]3. /*/element(*,el:Axiom)4. They are equi-robust (and fragile)5. They are equi-robust (and robust)
• Which query is most widely usable?1. /*/(equivalent|subsumes|...)2. /*/*[ssd:axiom(.)]3. /*/element(*,el:Axiom)4. They are equi-usable (and not widely usable)5. They are equi-usable (and widely usable)
5
Basics of Types
• What, in the most general sense, is a datatype?1. A set of (data) values2. A description of the arguments of a function3. Anything derived from xs:anyType4. An annotation of a variable/node/element
• Anything naming or describing a set– ...has an associated type!
• Types are just sets (of “values”)• The “extensional” view
• A Type System is a language for – describing types (the “intensional” view)– associating types with other linguistic entities
• E.g., literals, variables, expressions, programs
6
But we may or may not be able to express this type
A Typical Type System & XSD • some primitive or built-in or “basic” types
– Integer, strings, etc.– xs:anyType, xs:string, xs:duration, ...
• some constructor to build composite types– Arrays, records, dictionaries, etc.– xs:list, xs:union,
• other constructors– To, for example, create other derived types– xs:restriction, xs:extension
• a syntax for associating types with variables, items,...– And functions, etc.– <xsd:element name="purchaseOrder" type="PurchaseOrderType"/>– <person xsi:type="LongPersonType" phone="5433">
• A set of conditions for success or failure (Type Errors)7
A Brief Tour of Type Systems• Strong vs. Weak
– Type errors are caught/reported vs silently succeeding/causing havoc
• Static vs. Dynamic– Check type at compile time vs. at run time
• Explicit/Manifest vs. Implicit/Latent– Type of everything (vars, functions,element) has (not) to be declared – Implicit: requires type inference – Explicit: requires type checking
• Nominal vs. Structural– Nominal: type compatibility relies on features of the declaration
• I declare a two types, “miles” and “feet” whose values are integers• 1 as miles != 1 as feet
– structural: type compatibility relies entirely on value structure • 1 as miles == 1 as feet (1 is the same integer!)
8
Some questions• Java’s type system is primarily
• strong, manifest, and nominal
• XQuery’s type system is primarily• strong, latent, and nominal
9
Some Expression Examples
10
Some Expression Examples• if ($aBool) then 1+1 else "2"
its type depends on value of $aBool:
10
Some Expression Examples• if ($aBool) then 1+1 else "2"
its type depends on value of $aBool:
10
Some Expression Examples• if ($aBool) then 1+1 else "2"
its type depends on value of $aBool:
• if (true()) then 1+1 else "2" is an instance of xs:integer:
10
Some Expression Examples• if ($aBool) then 1+1 else "2"
its type depends on value of $aBool:
• if (true()) then 1+1 else "2" is an instance of xs:integer:
10
Some Expression Examples• if ($aBool) then 1+1 else "2"
its type depends on value of $aBool:
• if (true()) then 1+1 else "2" is an instance of xs:integer:
• if (false()) then 1+1 else "2" is an instance of xs:string
10
Some Expression Examples• if ($aBool) then 1+1 else "2"
its type depends on value of $aBool:
• if (true()) then 1+1 else "2" is an instance of xs:integer:
• if (false()) then 1+1 else "2" is an instance of xs:string
10
Some Expression Examples• if ($aBool) then 1+1 else "2"
its type depends on value of $aBool:
• if (true()) then 1+1 else "2" is an instance of xs:integer:
• if (false()) then 1+1 else "2" is an instance of xs:string
10
Some Expression Examples• if ($aBool) then 1+1 else "2"
its type depends on value of $aBool:
• if (true()) then 1+1 else "2" is an instance of xs:integer:
• if (false()) then 1+1 else "2" is an instance of xs:string
• (if ($aBool) then 1+1 else "2") instance of (xs:integer | xs:string)
10
Some Expression Examples• if ($aBool) then 1+1 else "2"
its type depends on value of $aBool:
• if (true()) then 1+1 else "2" is an instance of xs:integer:
• if (false()) then 1+1 else "2" is an instance of xs:string
• (if ($aBool) then 1+1 else "2") instance of (xs:integer | xs:string)
10
Not legal XQuery
(if (true()) then 3+1 else "2") instance of xs:integer
returns “true”
(if (false()) then 3+1 else "2") instance of xs:string
returns “true”
Mistyped• Obvious conflict
– "2" + 2
• Making this conflict less obvious: – (if (false()) then 1+1 else "2") + 2
• Same error as above– (if (true()) then 1+1 else "2") + 2
• This is accepted!
• Making this conflict even less obvious: – declare function ssd:test($x as xs:boolean) as xs:integer{
if ($x) then 1+1 else "2" + 2 };
– declare function ssd:test($x as xs:boolean) as xs:integer{ if ($x) then 1+1 else "2"};
My checker doesn’t flag this error
It does flag this one!
11
Arithmetic operator is not defined for arguments of types (xs:integer, xs:string)
Simple Promotion• Explicit
– (1.0 + ("1" cast as xs:integer)) instance of xs:decimal– True!
• Implicit– ((1.0 treat as xs:decimal) + 125E2) instance of xs:double– Also true– Note that treat as and cast as are not the same:
• ("1.0" treat as xs:decimal)– doesnʼt work
• ("1.0" cast as xs:decimal)– This results in 1
12
Required item type of value in 'treat as' expression is xs:decimal; supplied value has item type xs:string
Complex Casting
http://msdn.microsoft.com/en-us/library/ms191231.aspx 13
Getting to PSVI• Consider a very simple XQuery
– No results!
• Must validate!
– Returns: <atomic xmlns="..." name="Person"/>– validate generates a PSVI
14
import schema default element namespace "…” at "el-typed.xsd";<instance-of> <constant name="sally"/> <atomic name="Person"/></instance-of>/element(*, ClassExpression)
import schema default element namespace "…” at "el-typed.xsd";validate{<instance-of> <constant name="sally"/> <atomic name="Person"/></instance-of>}/element(*, ClassExpression)
import schema namespace el="http://www.cs.manchester.ac.uk/pgt/COMP60411/el" at "el-typed.xsd";import schema namespace owl="http://www.w3.org/2002/07/owl#" at "owl2-xml.xsd";
declare namespace ex="http://ex.org";declare function ex:convertAxiom($ax as element(*, el:Axiom)) as element(*, owl:Axiom){ typeswitch ($ax) case schema-element(el:equivalent) return validate{<owl:EquivalentClasses>{ for $expr in $ax/* return ex:convertExpression($expr)}</owl:EquivalentClasses>} default return validate {<owl:EquivalentClasses><owl:Class IRI="http://BOGUS"/> <owl:Class IRI="http://BOGUS"/><owl:EquivalentClasses>}};declare function ex:convertExpression($expr as element(*, el:ClassExpression)) as element(*, owl:ClassExpression){ if ($expr instance of element(el:atomic)) then validate{<owl:Class IRI="{$expr/@name}"/>} else validate {<owl:Class IRI="http://BOGUS"/>} };declare function ex:convert($ont as element(*, el:Ontology)) as element(owl:Ontology, owl:Ontology){ validate{ <owl:Ontology> {for $e in $ont/element(*,el:Axiom) return ex:convertAxiom($e)} </owl:Ontology> }};ex:convert(validate{doc("el1.xml")/*}) 15
...that was a Complex Typed “Cast”• where all input and output all typed
• ...how do we ensure that our system works correctly with types?
<?xml version="1.0" encoding="UTF-8"?><owl:Ontology xmlns:owl="http://www.w3.org/2002/07/owl#"> <owl:EquivalentClasses> <owl:Class IRI="http://BOGUS"/> <owl:Class IRI="http://BOGUS"/> </owl:EquivalentClasses> <owl:EquivalentClasses> <owl:Class IRI="http://BOGUS"/> <owl:Class IRI="http://BOGUS"/> </owl:EquivalentClasses> <owl:EquivalentClasses> <owl:Class IRI="Person"/> <owl:Class IRI="http://BOGUS"/> </owl:EquivalentClasses> <owl:EquivalentClasses> <owl:Class IRI="http://BOGUS"/> <owl:Class IRI="http://BOGUS"/> </owl:EquivalentClasses></owl:Ontology>
The only “proper” value
16
Static type check - type soundness
• A (statically verified) type safe program– has some guaranteed behavior
• and thus can be transformed or optimized in aggressive ways– may be more brittle
• fails hard on invalid input• accepts maybe less input than possible
Type-inference rules are written in such a way that any value that can be returned by an expression is guaranteed to conform to the static type inferred for the expression. This property of a type system is called type soundness. A consequence of this property is that a query that raises no type errors during static analysis will also raise no type errors during execution on valid input data. The importance of type soundness depends somewhat on which errors are classified as "type errors," as we will see below.
http://www.informit.com/articles/article.aspx?p=100667&seqNum=6 17
Data Representations• Data and data structures have representations
– (More or less) Physical embodiments– (Ultimately) Bits in a machine
• The “same” data can have distinct representations– 1 vs. “one”
• The “same” data structure can have distinct representations– At different levels of abstraction
• One key distinction– Internal (“in-memory”)– External (“on disk”)
• Generally:– External representations are for exchange between
(heterogeneous) systems
“Location” doesn’t really matter
18
Conversion• We can go from external to internal (e2i)
– Parsing, reading, loading, de-serializing, unmarshalling
• We can go from internal to external (i2e) – Serializing, writing, printing, saving, marshalling– Different systems may have different internals
• At least in detail– Different applications may behave differently
• There and back again: Roundtripping– Internal to external to internal (i2e2i)– External to internal to external (e2i2e)– Ideally preserves key properties
• Which?• When is ok not to preserve?
19
What is an XML “Document”?
Errors here mean noXML! SAX ErrorHandler
Yay! XPath! XSLT! Etc.
20
Element
Element Element Attribute
Element
Element Element Attribute
LevelData unit examples
Information or Property
required
cognitive
application
tree adorned with...
namespace schema nothing a
schematree well-formedness
token
complex <foo:Name t=”8”>Bob
simple <foo:Name t=”8”>Bob
character < foo:Name t=”8”>Bob
which encoding(e.g., UTF-8)
bit 10011010
What is an XML “Document”?
21
Element
Element Element Attribute
Element
Element Element Attribute
LevelData unit examples
Information or Property
required
cognitive
application
tree adorned with...
namespace schema nothing a
schematree well-formedness
token
complex <foo:Name t=”8”>Bob
simple <foo:Name t=”8”>Bob
character < foo:Name t=”8”>Bob
which encoding(e.g., UTF-8)
bit 10011010
validateeraseserialise
parse
What is an XML “Document”?
22
Element
Element Element Attribute
Element
Element Element Attribute
LevelData unit examples
Information or Property
required
cognitive
application
tree adorned with...
namespace schema nothing a
schematree well-formedness
token
complex <foo:Name t=”8”>Bob
simple <foo:Name t=”8”>Bob
character < foo:Name t=”8”>Bob
which encoding(e.g., UTF-8)
bit 10011010
“Same” inputs canhave different “meanings”!(external validation)
What is an XML “Document”?
23
Element
Element Element Attribute
Element
Element Element Attribute
LevelData unit examples
Information or Property
required
cognitive
application
tree adorned with...
namespace schema nothing a
schematree well-formedness
token
complex <foo:Name t=”8”>Bob
simple <foo:Name t=”8”>Bob
character < foo:Name t=”8”>Bob
which encoding(e.g., UTF-8)
bit 10011010
...we can have many...
For “the same” meaning
24
The Essence of XML (with WXS)• Thesis:
– “XML is touted as an external format for representing data.”• Two properties
– Self-describing• Destroyed by external validation
– Round-tripping• Destroyed by defaults and union types
http://bit.ly/essenceOfXML2
25
The Essence of XML (with WXS)• Roundtripping issues
– Internal to external and back• Take an element, foo, with content {“one”, “2”, 3}• It’s (simple) type is a list of union of integer and string• Serialize
– <foo>one 2 3</foo>• Parse and validate
– Content is {“one”, 2, “3”}– External to internal and back
• “001” to 1 to “1”
http://bit.ly/essenceOfXML2
26
The Essence of XML (with WXS)• Conclusion:
– “So the essence of XML is this: the problem it solves is not hard, and it does not solve the problem well.”
• Itʼs not obvious– That the issues are serious (enough)– That the problem solved is all that easy– That there arenʼt other, worse issues
http://bit.ly/essenceOfXML2
Tree Grammars
27
Observations/Q3:
• Documents/trees are finite structures • A Schema/grammar can describe no/finitely/infinitely many
documents/trees• For a given set of documents/trees, we can design various
schemas/grammars
28
<?xml version="1.0" encoding="UTF-8"?><!ELEMENT cartoon (prolog, panels)> <!ATTLIST cartoon
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE cartoon SYSTEM "cartoon.dtd"><cartoon
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE cartoon SYSTEM "cartoon.dtd"><cartoon copyright="Bill
.
.
.
.ε
0 1
0 0 0
BAA
A BB
ε0 1
0 0 0 1
BAA
A BB B
N = {Book, PA, Editor, A, Paper, F, L}Σ = {B, Name, F, L, A, P}S = {Book, Paper}
P = { Book → B Editor|PA, Paper → P PA+, Editor → Name F,L, PA → Name L,A, F → F ε, L → L ε, A → A ε }
Remember: Tree Grammars
๏ A set of trees is called a tree language (like sets of strings are languages)
• A tree language can be empty, finite, or infinite
๏ A tree language TS is if there exists a
tree grammar G such that L(G) = TS.
‣ for one TS, there can be different tree grammars accepting exactly TS…
localsingle-type
regular
localsingle-type
(any)
29
Properties of Local and Single-Type Tree Languages
- the following observation is an immediate consequence of the definitions of local and single-type tree languages
★ Every local tree language is single-type, and every single-type tree language is regular.
- the next observation is a bit more tricky: ★ There are regular tree languages that are not single-type, and
there are single-type tree languages that are not local.
Loc ⊊ ST ⊊ Reg LocSTReg
30
Single-Typedness and PSVIs...
31
• Imagine, the following XML Schema was legal,• and you’d ask (a schema-aware XQuery processor) to return all
elements of type NewPersonType
• ...as in //*element(*, NewPersonType)
• from the little document below…
• the answer would depend on the PSVI constructed for little.xml – what is the type of /A/person?– NewPersonType or OldPersonType?
• To avoid such confusion/nondeterminism, UPAc ensures single-typedness ensures unique PSVI
<A> <person> .... </person> </A>
little.xml
<xs:element name="A"><xs:complexType> <xs:sequence> <xs:element name="person" type="NewPersonType" minOccurs="0" maxOccurs="1"/> <xs:element name="person" type="OldPersonType" minOccurs="0" maxOccurs="1"/> </xs:sequence> </xs:complexType></xs:element>
Why & when single-type matters
★ A single-type grammar can have no more than one run on a tree. • a run corresponds to a PSVI
– as it labels input tree/document nodes with non-terminals/types
• ..hence validation against a schema that corresponds to a single-type grammar results in a unique PSVI– (PSVI = DOM tree adorned with default values & types) – hence schema-aware queries know/agree on what to return!
★ A regular grammar can have more than one run on a tree. • ..hence validation against a schema that does not correspond to a
single-type grammar may result in one of many PSVIs– hence schema-aware queries may differ in their answer!
✴ Use single-type schema language for schema-aware querying!32
Tree Grammars: 1 more thing
• BTW, w.l.o.g., we can assume that no two production rules have the same non-terminal on the left hand side and the same terminal. I.e., no N → P PA and N → P (Editor,Editor*).
We can also rewrite those, e.g., to
N → P (PA | (Editor,Editor*))
• ...so, how did we get here? From DTDs and XML schemas!
33
Tree Grammars ⇆ DTDs• since DTDs don’t have “types”, just element names, they correspond
to grammars of a peculiar, simple kind:
★ Tree grammars for DTDs are always local...even if the DTD has a non-deterministic content model <!ELEMENT N1 (M|(M,M))> is not deterministic and thus illegal (but can be replaced with <!ELEMENT N1 (M,(M|ε))>)
<!ELEMENT T (N1,N2*)><!ELEMENT N1 (M|(M,M))><!ELEMENT N2 (#PCDATA)><!ELEMENT M (#PCDATA)>
F = (N, Σ, S, P) withN = {T, N1, N2, M, pcdata}Σ = {T, N1, N2, M, pcdata}S = {T}P = { T → T (N1,N2*), N1 → N1 (M|(M,M)), N2 → N2 pcdata, M → M pcdata, pcdata → pcdata ε}
ε
0
0,0
T
N1
1,0
Mpcdata
1 N2
0,0,0 pcdata
34
Remember?!• in DTDs and in WXS, content models are further restricted
(for compatibility with SGML)– [DTD] determistic (or 1-unambiguous),
e.g., (M|(M,M)) is not deterministic, (M,(M|ε)) is.e.g., ((b, c) | (b, d)) is not deterministic, b,(c|d) is.From http://www.w3.org/TR/REC-xml/:
35
As noted in 3.2.1 Element Content, it is required that content models in element type declarations be deterministic. This requirement is for compatibility with SGML (which calls deterministic content models "unambiguous"); XML processors built using SGML systems may flag non-deterministic content models as errors.
More formally: a finite state automaton may be constructed from the content model using the standard algorithms, e.g. algorithm 3.5 in section 3.9 of Aho, Sethi, and Ullman [Aho/Ullman]. In many such algorithms, a follow set is constructed for each position in the regular expression (i.e., each leaf node in the syntax tree for the regular expression); if any position has a follow set in which more than one following position is labeled with the same element type name, then the content model is in error and may be reported as an error.
Tree Grammars and DTDs• so, DTDs are local (and thus single-type) because they don’t
have any types at all– and not because their content model is deterministic!– they are single-type even with non-deterministic content model
• hence we could extend DTDs with types and still be single-type...provided we impose suitable restrictions
36
Tree Grammars ⇆ WXS
• tree grammars also capture the basic, structural part of WXS:✓ types (complex and anonymous)‣ model groups (we ignore them)‣ derivation by extension and restriction (we ignore them)‣ substitution groups (we ignore them)‣ integrity constraints like keys (must be ignored, don’t fit into tree
grammars) • we only deal with simple XML schemas, but general approach works
for more
37
Tree Grammars ⇆ WXS
• one stupid problem with this: in XSD, we can have – named types, e.g., <xs:complexType name="BBlist">– unnamed types, e.g., <xs:element name="mylist">...
• ...hence we invent a lot of type names for unnamed types,
– eg MYLIST for mylist
• we use a two-stage approach:• to transform an XML schema S into a tree grammar G,
1. we translate S into a generalized tree grammar 2. then flatten the generalized tree grammar into a tree grammar G
• this will be done such that T validates against S iff T is accepted by G.
38
Translating WXS into Tree Grammars• take a simple XML Schema S and translate it into grammar G(S): ➡ for each top-level element in S of the form
– <xs:element name="mylist" type="Blist"></xs:element>• add the following production rule to G(S):
– MYLIST → mylist BLIST^TYPE– add MYLIST, BLIST^TYPE to non-terminals, add mylist to terminals
➡ for each top-level element in S of the form – <xs:element name="mylist">
<xs:complexType> <xs:sequence> <xs:element name="ename" type="Comp" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element>
• add the following production rules to G(S):– MYLIST → mylist ENAME,ENAME*– ENAME → ename COMP^TYPE
what is the default for minOccurs?
39
Translating WXS into Tree Grammars
➡ for each top-level element in S of the form – <xs:complexType name="Blist">
<xs:sequence> <xs:element name="friend" type='Person' minOccurs = ʻ1ʼ maxOccurs ='2'/> </xs:sequence> </xs:complexType>
• add the following production rules to G(S):– BLIST^TYPE → (FRIEND | (FRIEND,FRIEND)) – FRIEND → friend PERSON^TYPE– add BLIST^TYPE, FRIEND, PERSON^TYPE to non-terminals,
add friend to terminals
38
%% generalized rule: to be expanded!
40
Translating WXS into Tree Grammars➡ for each top-level element in S of the form
- <xs:complexType name="BBlist"> <xs:choice> <xs:sequence> <xs:element name="A" type="xs:string"/> <xs:element name="B" type="xs:string"/> </xs:sequence> <xs:sequence> <xs:element name="A" type="xs:string"/> <xs:element name="C" type="xs:string"/> </xs:sequence> </xs:choice> </xs:complexType>
• add the following production rules to G(S):– BBLIST^TYPE → (A,B) | (A,C)– A → A STRING^TYPE– B → B STRING^TYPE– C → C STRING^TYPE– add BBLIST^TYPE, A, B, C, STRING^TYPE to non-terminals,
add A, B, C to terminals
%% generalized rule -- to be expanded!
%% UPA - violation:%% Oxygen complains!
41
Translating WXS into Tree Grammars• Consider the following case:
• To handle cases like the one above we can’t always add rules – AT^TYPE → N*, BT^TYPE → N* – N → N ??LIST^TYPE
• Instead, we translate these as – AT^TYPE → N^AS^ALIST^TYPE* BT^TYPE → N^AS^BLIST^TYPE*– N^AS^ALIST^TYPE → N ALIST^TYPE– N^AS^BLIST^TYPE → N BLIST^TYPE
<xs:complexType name="AT"> <xs:sequence> <xs:element name="N" type="Alist" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>
<xs:complexType name="BT"> <xs:sequence> <xs:element name="N" type="Blist" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType>
42
Translating WXS into Tree GrammarsOur translation yields almost a tree grammar:• it produces illegal rules of the form X → e, i.e., without non-terminal
– e.g., BLIST^TYPE → (FRIEND | (FRIEND,FRIEND))
• our grammar model doesn’t handle those (check definition of a run)๏ hence we expand these illegal rules:
• e.g., MYLIST → mylist BLIST^TYPE would be transformed into – MYLIST → mylist (FRIEND | (FRIEND,FRIEND))
• ...and if we had <xs:element name="yourlist" type="Blist"/> then we also had – YOURLIST → yourlist BLIST^TYPE and thus– YOURLIST → yourlist (FRIEND | (FRIEND,FRIEND))
pick illegal rule X → e:– remove X → e from rule set – replace all occurrences of X in rule set with e
until no illegal rules are left in rule set
43
Translating WXS into Tree Grammars• Expanding illegal rules even works with cyclic type definitions - try
• This gives you these rules, including 2 illegal rules
• that can be expanded as follows:
<xs:complexType name="NType"> <xs:choice> <xs:element name="test2" type="AType"/> <xs:element name="EndElement" type="xs:string"/> </xs:choice> </xs:complexType>
<xs:complexType name="AType"> <xs:choice> <xs:element name="test1" type="NType"/> <xs:element name="EndElement" type="xs:string"/> </xs:choice> </xs:complexType>
NType^TYPE → (TEST2 | ENDELEMENT)TEST2 → test2 AType^TYPEENDELEMENT → EndElement STRING^TYPE...
AType^TYPE → (TEST1 | ENDELEMENT)TEST1 → test1 NType^TYPEENDELEMENT → EndElement STRING^TYPE...
TEST2 → test2 (TEST1 | ENDELEMENT)ENDELEMENT → EndElement STRING^TYPE...
TEST1 → test1 (TEST2 | ENDELEMENT)ENDELEMENT → EndElement STRING^TYPE...
44
Translating WXS into Tree Grammars
• So, to transform an XML schema S into a tree grammar G, 1. we translate S into a generalized tree grammar G’2. then expand G’ into a tree grammar G
★ Then any tree T validates against S iff T is accepted by G.
• So, what are the tree grammars we get as results?– they are tree grammars– are they single-type?– are they local?
★ Tree grammars corresponding to WXS are not local.• E.g., consider
– N^AS^ALIST^TYPE → N ALIST^TYPE– N^AS^BLIST^TYPE → N BLIST^TYPE– .. N^AS^ALIST^TYPE and N^AS^BLIST^TYPE are competing!
LocSTReg
45
Translating WXS into Tree Grammars ★ Tree grammars corresponding to WXS are single-type.
– This is ensured by the Unique Particle Attribution constraint in WXS. • Tree grammars corresponding to DTDs are local, ….hence
★ DTDs are less expressive than XML schemata.
• That is, there are tree languages that we can describe in WXS, but not in DTDs, e.g.,
LocSTReg
N = {Book, PA, Editor, A, Paper, F, L}Σ = {B,N,A,P,C}S = {Book, Paper}P = { Book → B Editor|PA, Paper → P PA, Editor → N F,L, PA → N L,A, F → F ε, L → L ε, A → A ε }
L
ε
0
0,0
B
N
0,1F
ε
0
0,0
P
N
0,1AL
Remember:
47
A content model must be formed such that during validation of an element information item sequence, the particle component contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#cos-nonambig
Rephrasing: a content model M must be formed such that, during validation of an element E’s childnode sequence E1...Ek, we can, starting from i = 1 and increasing, associate each Ei with a single particle contained (possibly implicitly) in M without examining the content or attributes of Ei, and without any information about any Ej with j >i.
Remember: • In XML Schema, content model is constrained as well
– to make validation easier & for compatibility with SGML– e.g., through Unique Particle Attribute Constraint:
47
A content model must be formed such that during validation of an element information item sequence, the particle component contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#cos-nonambig
Rephrasing: a content model M must be formed such that, during validation of an element E’s childnode sequence E1...Ek, we can, starting from i = 1 and increasing, associate each Ei with a single particle contained (possibly implicitly) in M without examining the content or attributes of Ei, and without any information about any Ej with j >i.
Remember: • In XML Schema, content model is constrained as well
– to make validation easier & for compatibility with SGML– e.g., through Unique Particle Attribute Constraint:
47
A content model must be formed such that during validation of an element information item sequence, the particle component contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#cos-nonambig
Rephrasing: a content model M must be formed such that, during validation of an element E’s childnode sequence E1...Ek, we can, starting from i = 1 and increasing, associate each Ei with a single particle contained (possibly implicitly) in M without examining the content or attributes of Ei, and without any information about any Ej with j >i.
Translating WXS into Tree Grammars ★ Tree grammars corresponding to WXS are single-type.
– This is ensured by the Unique Particle Attribution constraint in WXS.
• We know: validation against a schema that corresponds to a single-type grammar results in a unique PSVI– (PSVI = DOM tree adorned with default values & types) – hence schema-aware queries know/agree on what to return! – hence WXS can be used for schema-aware querying!
49
Using more than 1 schema:
PSVI(tree adorned with default values & types)
Your application
Schema-aware parser for rich schema language, e.g. RelaxNG
Queryor other input
XML doc.
rich Schema 1 Schema-
aware Query processor
QueryAnswer
single-typeSchema 2
Schema-aware parser for s-t schema language, e.g. XSD
doesn’t validateErrorHandler
validates
….
….
Content models & types in DTD & WXS
• (we already know that) in WXS, we have a type hierarchy– an element of a type X derived by restriction or extension
from Y can be used in place of an element of type Y • but you have to say so explicitly:
– we call this ‘named’ typing: • sub-types are declared (restriction
or extension), and not inferred (by comparing structure)
– in DTDs, we don’t have types!
• In order to prevent difficulties in WXS as caused by types, Element Declarations Consistent constraint is imposed:
<xs:complexType> <xs:sequence> <xs:element name="person" type= "NewPersonType" minOccurs="0" maxOccurs="1"/> <xs:element name="person" type= "OldPersonType" minOccurs="0" maxOccurs="1"/> </xs:sequence> </xs:complexType> 50
<person phone="2"> <Name>Peter</Name> <DoB>1966-05-04</DoB></person><person xsi:type="LongPersonType" phone="5432"> <Name>Paul</Name> <DoB>1967-05-04</DoB> <address>Manchester</address></person>
Summary
• So far, we have seen how to translate schema languages in tree grammars: we saw that– each DTD can be faithfully translated into a local tree grammar,
and therefor in a single-type one• hence each DTD corresponds to a single-type grammar• hence there is exactly 1 PSVI for each document that validates against
a DTD – each XML schema can be faithfully translated into a single-type
tree grammar, • hence there is exactly 1 PSVI for each document that validates against
an XML schema• ...we also saw that parts of the UPA constraint helps to generate PSVI:
do we need other parts?
51
LocSTReg
Relax NG, a very powerful schema language
52
53
Relax NG: yet another schema language
• Relax NG was designed to be a simpler schema language• (described in a readable on-line book by Eric Van der Vlist)• and allows us to describe XML documents in terms of their
tree abstractions:– no default attributes– no entity declarations– no key/uniqueness constraints– minimal datatypes: only “token” and “string” like DTDs
(but a mechanism to use XSD datatypes)
• since it is so simple/flexible– it’s (claimed to be) easy to use– it doesn’t have complex constraints on description of element content like
determinism/1-unambiguity– it’s claimed to be reliable– but you need other tools to do other things (like datatypes and attributes)
54
Relax NG: another side of Determinism
• remember that DTDs and WXS required their content models to be – [DTD] deterministic (and thus look-ahead-free)– [WXS] deterministic (EDC, every matching child node sequence
matches in exactly one way only)– [WXS] UPA constraint expresses both and other constraints even more
• determinism & single-typeness have a reason:– some tools annotate a (valid) document while parsing:
• type information -- to be exploited, e.g., for concise queries (remember assignment?)
• default attribute values – if your schema is not single-type, then
• tools validating the same document against the same schema may construct different PSVIs
• this can happen with different tools or different runs of the same tool
55
Relax NG: another side of ValidationReasons why one would want to validate an XML document:• ensure that structure is ok• ensure that values in elements/attributes are of the correct type• generate PSVI to work with• check constraints on co-occurrence of elements/how they are related • check other integrity constraints, eg. a person age vs. their mother’s
age• check constraints on elements/their value against external data
– postcode correctness– VAT/tax/other numeric constraints– spell checking
...only few of these checks can be carried out by validating against schemas...
Relax NG was designed to 1. validate structure and 2. link to datatype validators to type check values of elements/attributes
56
Relax NG: basic principles • both DTDs and XSD allow the user to describe documents
– by descriptions of its elements and attributes, e.g., an element “person” must have two element child nodes, name and address, and ....
• Relax NG is based on patterns (similar to XPath expressions): – a pattern is a description of a set of valid node sets– we can view our example
as different combinationsof different parts, and design patterns for each
– enhanced flexibility
<?xml version="1.0" encoding="UTF-8"?><people> <person age="41"> <name> <first>Harry</first> <last>Potter</last> </name> <address>4 Main Road </address> <project type="epsrc" id="1"> DeCompO </project> <project type="eu" id="3"> TONES </project> </person> <person>.... </people>
57
Relax NG: good to knowRelax NG comes in 2 syntaxes• the compact syntax
– succinct– human readable
• the XML syntax– verbose– machine readable
Trang converts betweenthe two, pfew!(and also into/from other schema languages)
Trang can be used from Oxygen
grammar { start = element name { element first { text }, element last { text } }}
<grammar xmlns="http:...” xmlns:a="http:.." datatypeLibrary="http:...> <start> <element name="name"> <element name="first"><text/></element> <element name="first"><text/></element> </element> </start></grammar>
58
Relax NG - structure validation:• 3 kinds of patterns, for the 3 “central” nodes:
– text <text/>– attribute <attribute name=”age"/>
<attribute name=”type"/>– element <element name="name">
<element name="first"> <text/></element> <element name="last"> <text/></element> </element>
• these can be combined– ordered groups– unordered groups– choices
• we can constrain cardinalities of patterns • text nodes
– can be marked as “data” and linked• we can specify libraries of patterns
<?xml version="1.0" encoding="UTF-8"?><people> <person age="41"> <name> <first>Harry</first> <last>Potter</last> </name> <address>4 Main Road </address> <project type="epsrc" id="1"> DeCompO </project> <project type="eu" id="3"> TONES </project> </person> <person>.... </people>
element name { element first { text }, element last { text }}
59
Relax NG - structure validation: ordered groups• we can name patterns• in strange “chains”• we can use ?, *, and +:
<?xml version="1.0" encoding="UTF-8"?><people> <person age="41"> <name> <first>Harry</first> <last>Potter</last> </name> <address>4 Main Road </address> <project type="epsrc" id="1"> DeCompO </project> <project type="eu" id="3"> TONES </project> </person> <person>.... </people>
grammar { start = people-element
people-element = element people { person-element+ }
person-element = element person { attribute age { text }, name-element, address-element+, project-element*}
name-element = element name { element first { text }, element middle { text }?, element last { text } }
address-element = element address { text }
project-element = element project { attribute type { text }, attribute id {text}, text }}
use “?” if optional
Relax NG - structure validation: ordered groups in XML syntax (Trang knows…)
<?xml version="1.0" encoding="UTF-8"?><grammar xmlns="http://relaxng.org/ns/structure/1.0"> <start> <element name="people"><ref name="people-content"/> </element></start> <define name="people-content"> <oneOrMore> <element name="person"><ref name="person-content"/> </element></oneOrMore></define>
<define name="person-content"> <attribute name="age"/> <element name="name"><ref name="name-content"/> </element> <oneOrMore> <element name="address"><text/></element> </oneOrMore> <zeroOrMore> <element name="project"><ref name="project-content"/> </element></zeroOrMore></define>
<define name="name-content"> <element name="first"><text/></element> <optional><element name="middle"><text/></element> </optional> <element name="last"><text/></element> </define> <define name="project-content"> <attribute name="type"/><attribute name="id"/><text/> </define></grammar>
grammar { start = people-element
people-element = element people { person-element+ }
person-element = element person { attribute age { text }, name-element, address-element+, project-element*}
name-element = element name { element first { text }, element middle { text }?, element last { text } }
address-element = element address { text }
project-element = element project { attribute type { text }, attribute id {text}, text }}
60
61
Relax NG - structure validation: different styles
grammar { start = element people {people-content}
people-content = element person { person-content }+
person-content = attribute age { text }, element name {name-content}, element address { text }+, element project {project-content}*
name-content = element first { text }, element middle { text }?, element last { text }
project-content = attribute type { text }, attribute id {text}, text }
grammar { start = people-element
people-element = element people { person-element+ }
person-element = element person { attribute age { text }, name-element, address-element+, project-element*}
name-element = element name { element first { text }, element middle { text }?, element last { text } }
address-element = element address { text }
project-element = element project { attribute type { text }, attribute id {text}, text }}
• so far, we modelled ‘element centric’...we can model ‘content centric’:
62
Relax NG - structure validation: ordered groups
• we can combine patterns in fancy ways:
grammar {start = element people {people-content}people-content = element person { person-content }+
person-content = HR-stuff, contact-stuff
HR-stuff = attribute age { text }, project-content
contact-stuff = attribute phone { text }, element name {name-content}, element address { text } name-content = element first { text }, element middle { text }?, element last { text } project-content = element project { attribute type { text }, attribute id {text}, text }+}
<?xml version="1.0" encoding="UTF-8"?><people> <person age="41"> <name> <first>Harry</first> <last>Potter</last> </name> <address>4 Main Road </address> <project type="epsrc" id="1"> DeCompO </project> <project type="eu" id="3"> TONES </project> </person> <person>.... </people>
63
Relax NG: structure validation summary • Relax NG’s specification of structure differs from DTDs and XSD:
– grammar oriented– 2 syntaxes with automatic translation– flexible: we can gather different aspects of elements into different patterns– unconstrained: no constraints regarding
unambiguity/1-ambiguity/deterministic content model/Unique Particle Constraints/Element Declarations Consistent
– like for XSD, we have an “ALL” construct for unordered groups, “interleave” &:
element person { attribute age { text}, attribute phone { text}, name-element , address-element+ , project-element*}
here, the patterns must appear in the specified order, (except for attributes, which are allowed to appear in any order in the start tag):
here, the patterns can appear any order:
element person { attribute age { text } & attribute phone { text} & name-element & address-element+ & project-element*}
Translating Relax NG into tree grammarsby example 1
• ...let’s see one more64
grammar {start = AddressBookAddressBook = element addressBook { Card* }Card = element card { Inline }Inline = Name, Email+Name = element name { text }Email = element email { text } }
Translate into G=(N, Σ, S, P) with N = {AddressBook, Card, Inline, Name, Email, Pcdata}Σ = {addressBook, card, name, email, pcdata}S = {AddressBook}P = {AddressBook → addressBook Card*, Card → card Inline, Inline → Name, Email+, Name → name Pcdata, Email → email Pcdata, Pcdata → pcdata ϵ }
“element y” ➟ y ∈ Σ...possibly also “uppercased copy” ➟ Y ∈ Nall other user defined symbols X ➟ X ∈ N...translate Relax NG rules easy(depending on Relax NG style)
Translating Relax NG into tree grammarsby example 2
65
grammar { start = p-el
p-el = element people { per-el+ }
per-el = element person { attribute age { text }, na-el, ad-el+, pro-el*}
na-el = element name { element first { text }, element middle { text }?, element last { text } }
ad-el = element address { text }
pro-el = element project { attribute type { text }, attribute id {text}, text }}
Translate into G = (N, Σ, S, P) with N = {P-EL, PER-EL, NA-EL, AD-EL, PRO-EL, FIRST, MIDDLE, LAST, Pcdata}Σ = {people, person, name, first, middle, last, address, project}S = {P-EL}P = {P-EL → people PER-EL, PER-EL*, PER-EL → person NA-EL,AD-EL, AD-EL*,PRO-EL* NA-EL → name FIRST, (MIDDLE|ε), LAST, FIRST → first Pcdata, MIDDLE → middle Pcdata, LAST → last Pcdata, AD-EL → address Pcdata, PRO-EL → project Pcdata, Pcdata → pcdata ϵ }
Ignore!
Ignore! This Relax NG style makes translation of rules easy
Translating Relax NG into tree grammarsby example 3
66
grammar { start = element people {people-content}
people-content = element person { person-content }+
person-content = attribute age { text }, element name {name-content}, element address { text }+, element project {project-content}*
name-content = element first { text }, element middle { text }?, element last { text }
project-content = attribute type { text }, attribute id {text}, text }
Translate into G=(N, Σ, S, P) with N = {PEOPLE, P-C, PER-C, NA, NA-C, PERSON, PRO-C,ADR, PROJ, PRO-C, FIRST, MIDDLE,LAST, Pcdata}Σ = {people, person, name, first, middle, last, address, project}S = {PEOPLE}P = {PEOPLE → people P-C, P-C → PERSON, PERSON*, PERSON → person PER-C, PER-C → NA, ADR, ADR*,PROJ, NA → name NA-C, ADR → address Pcdata, PROJ → project PRO-C, PRO-C → pcdata ϵ, NA-C → FIRST,(MIDDLE|ϵ),LAST FIRST → first Pcdata, MIDDLE → middle Pcdata, LAST → last Pcdata, Pcdata → pcdata ϵ }
expand!
expand!
This Relax NG style makes translation of rules less easy… and leads to generalized rules!
Translating Relax NG into tree grammarsby example 3
Two things we have already seen when translating WXS:• “generalized” rules -- which can & need to be expanded, as for WXS:
• we might have to “contextualise” names and types of elements: ... 67
...people-content = element person { person-content }+.....person-content = attribute age { text }, element name {name-content}, element address { text }+, element project {project-content}*
... PERSON → person PER-C, PER-C → NA, ADR, ADR*,PROJ, NA → name NA-C, ADR → address Pcdata, ...
expand!
for each illegal rule X → e:– remove X → e from rule set – replace all occurrences of X in rule set with e
Translating Relax NG into tree grammarsby example 4
68
...people-content = element person { person-content }+, element friend {friend-content }+ .....person-content = attribute age { text }, element name {name-content}, ...friend-content = attribute age { text }, element name {friend-name-content},...
... P-C → PERSON, PERSON*,FRIEND,FRIEND* PERSON → person PER-C, FRIEND → friend FRIE-C, PER-C → NA^NA-C, ... FRIE-C → NA^FRIE-NA-C, ... NA^NA-C → name NA-C, NA^FRIE-NA-C → name FRIE-NA-C, ...
2. we might have to “contextualise” names and types of elements, to handle schemas where the same element name is used in different contexts with different types:
Translating Relax NG into tree grammars• each Relax NG schema can be faithfully translated into a tree grammar:
– local? no: example on previous slide leads to competing non-terminals (NA^PER-C and NA^FRIE-C)
– single-type? no: see example belowNA^NA-C and NA^FO-NA-C compete and occur in the same RHS
– so is Relax NG as powerful as tree grammars?
69
... NA^PER-C → name NA-C, NA^FRIE-C → name NA-C,...
...person-content = attribute age { text }, element name {name-content} | element name {foreign-name-content}, ...
... PER-C → NA^NA-C | NA^FO-NA-C NA^NA-C → name NA-C, NA^FO-NA-C → name FO-NA-C,...
Relax NG schema is indeed as powerful as tree grammars★ Every tree grammar can be faithfully translated into a Relax NG schema.
• Proof (not too hard): given a tree grammar G = (N, Σ, S, P), 1. translate each production rule N → t regexp in P into
(fortunately, the tree grammar regular expression syntax is very close to and more strict than Relax NG regular expression syntax)
2. Put the resulting statements intoa grammar, where N1 , ... , Nk areall start symbols, i.e., S = {N1 , ... , Nk}
3. Call the resulting schema GS
★ Then T ∈ L(G) if and only if T validates against GS.
70
N = element t { regexp }
grammar {start = N1 | ... | Nk ..... }
Tree Grammars and Schema Languages• Harvest Time!
• but then, isn’t validation of an XML document against a Relax NG schema really complicated and complex (i.e., space and/or time consuming)?
• perhaps it’s even undecidable or intractable?
71
LocSTReg DTDWXSRelax NG with our knowledge
How costly is validity testing?…
Does it matter against which kind of schema?
…Is Single-Type cheaper than
general?
72
73
How costly is schema validation?
PSVI(tree adorned with default values & types)
Your application
Schema-aware parser for rich schema language, e.g. RelaxNG
Queryor other input
XML doc.
rich Schema 1 Schema-
aware Query processor
QueryAnswer
single-typeSchema 2
Schema-aware parser for s-t schema language, e.g. XSD
doesn’t validateErrorHandler
validates
….
….
Schema Languages and Tree Grammars• We have learned about a third, flexible, liberal schema
language, Relax NG– how to translate Relax NG schemas into tree grammars➡ more liberal than single-type/XSD
• Now, we will look at: – the problem of – algorithms for
74
validating a document against a schema!
algorithmTree TGrammar G
“yes”, if T ∈ L(G)
“no”, otherwise
See the paper by Murata, Lee, Mani, Kawaguchi
• To design our “schema validator”,1. we start with the easy case: assume that G is local
(this gives us automatically a validator for structural aspect of DTDs)2. then expand algorithm to single-type
(this gives us automatically a validator for structural aspect of WXS)3. then expand to general tree grammars (...Relax NG)
– we also assume that we have a subroutine
– ...if time permits, we will see later how to build that one (it’s based on a translation of regular expressions into finite state machines (aka automata), otherwise
• remember your undergraduate studies (?)• read it up, e.g., in the textbook by Hopcroft, Ullman
75
ValAlgoTree TGrammar G
“yes”, if T ∈ L(G)
“no”, otherwise
MatchAlgoString wregular expression e
“yes”, if w ∈ L(e), (w matches e)
“no”, otherwise
Input: DOM Tree for T, local tree grammar G = (N, Σ, S, P),NT is a stack of strings of non-terminalsR is a stack of production rulesTraverse T in a depth-first, left-2-to-right mannerWhen an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
report “accepted” and stop
76
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
See the paper by Murata, Lee, Mani, Kawaguchi
locality
store rule for E’s content in Rstart remembering E’s child nodes
retrieve rule for E’s content in Rretrieve E’s child nodes
add E’s terminal node to its predecessor siblings
to store NTs of child nodes
Stacks and tree traversal, observations• our algorithm visits a tree in a depth-first, left-2-to-right manner• whenever we visit a node
on our way – down, we
push relevant informationfor this node on stacks
– up, we pop relevant informationfor this node from stacks
• hence, whenever we are at a node n during this traversal, allrelevant information regarding all ancestors of n are (in reverseorder), on our stacks
77
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
78
a
c c
b
c
b
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
R NT
Stack of rules
Stack of NT strings
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
7915
a
c c
b
c
b
R NTS → a B,B* ϵ
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
80
a
c c
b
c
b
R NT
B → b (C,C)|C S → a B,B*
ϵϵ
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
81
a
c c
b
c
b
R NT
C → c ϵ|C B → b (C,C)|C S → a B,B*
ϵϵϵ
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
82
a
c c
b
c
b
R NT
B → b (C,C)|C S → a B,B*
ϵϵ
C → c ϵ|C ϵyes, ϵ ∈ L(ϵ|C)
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
83
a
c c
b
c
b
R NT
B → b (C,C)|C S → a B,B*
Cϵ
C → c ϵ|C
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
84
a
c c
b
c
b
R NT
Cϵ
C → c ϵ|C B → b (C,C)|C S → a B,B*
ϵ
ϵ
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
85
a
c c
b
c
b
R NT
Cϵ
B → b (C,C)|C S → a B,B* ϵ
C → c ϵ|C ϵyes, ϵ ∈ L(ϵ|C)
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
86
a
c c
b
c
b
R NT
B → b (C,C)|C S → a B,B*
CC ϵ
C → c ϵ|C
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
87
a
c c
b
c
b
R NTS → a B,B* ϵ
B → b (C,C)|C CCyes, CC ∈ L((C,C)|C)
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
88
a
c c
b
c
b
R NTS → a B,B* B
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
89
a
c c
b
c
b
R NT
B → b (C,C)|C S → a B,B*
ϵB
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
90
a
c c
b
c
b
R NT
C → c ϵ|C B → b (C,C)|C S → a B,B*
ϵϵB
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
91
a
c c
b
c
b
R NT
B → b (C,C)|C S → a B,B*
ϵB
C → c ϵ|C ϵyes, ϵ ∈ L(ϵ|C)
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
92
a
c c
b
c
b
R NT
B → b (C,C)|C S → a B,B*
CB
C → c ϵ|C
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
93
a
c c
b
c
b
R NTS → a B,B* B
B → b (C,C)|C C
yes, C ∈ L((C,C)|C)
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
94
a
c c
b
c
b
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
R NTS → a B,B* BB
B → b (C,C)|C
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
95
a
c c
b
c
b
R NT
BBS → a B,B*
yes, BB ∈ L(B,B*)
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
• Let’s see how algorithm works:– G = ({S,B,C},{a,b,c},{S},P) with
P = { S → a B,B*, B → b (C,C)|C, C → c ϵ|C}
96
a
c c
b
c
b
R NT
“accepted” (“yes”), T ∈ L(G)
Traverse T in a depth-first, left-2-to-right manner When an element E is visited on way down,
if there is a production rule N → a e in P with a = E’s tag namethen push N → a e onto R and push ϵ onto NT else report “not accepted” and stop
When an element E is visited on way up,pop a rule N → a e out of Rpop a string of non-terminals w out of NTif w matches e then pop a string w’ of non-terminals out of NT and push w’N onto NTelse report “not accepted” and stop
report “accepted” and stop ☜ Check slide 74
ValAlgoXML doc/Tree Tlocal Grammar G
“yes”, if T ∈ L(G)
“no”, otherwise
Validating trees against tree grammars• want to implement this algorithm?
– walk the DOM tree in a depth-first, left-2-right way, or
– use a SAX parser and do it in a streaming fashion• no need to keep whole tree in memory• validate-while-u-parse!
• ...and we can use this algorithm for general DTDs!• ...next week, we’ll see how this works for
– single-type tree grammars (and WXS)• rather straightforward because we still only have at most one run of our tree
grammar on the input tree
– general tree grammars (and Relax NG)…
– ...all validate-while-u-parse!
97