Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson...
-
date post
22-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of Spring 2005Daria Barger – DB Seminar 1 Efficient Incremental Validation of XML Documents Denilson...
Spring 2005 Daria Barger – DB Seminar 1
Efficient Incremental Validation of XML Documents
Denilson Barbosa Alberto O.Mendelson Leonid Libkin Laurent Mignet Marcelo Arenas
Presented by Daria Barger
Spring 2005 Daria Barger – DB Seminar 2
Outline
Introduction Types of constraints Update operations Incremental validation Experiments Conclusions Future work
Spring 2005 Daria Barger – DB Seminar 3
Introduction The problems of storing and querying XML
documents have attracted a great deal of interest.
Other aspects of XML data management, however, have not yet been satisfactorily explored.
Among them is the problem of checking that documents are valid with respect to their specifications, and that they remain valid after updates.
Spring 2005 Daria Barger – DB Seminar 4
DTD
One popular form of XML document specification is the Document Type Definition (DTD).
A DTD D is a grammar that defines a set of documents L(D).
Each document in L(D) is said to be valid with respect to D .
Spring 2005 Daria Barger – DB Seminar 5
The Validation Problem
The validation problem is:
Given a DTD D and an XML document X, is it the case that X L(D) ?
The incremental validation problem is:
Let U be some update operation.
Given X L(D), is it the case that
U(X) L(D)?
Spring 2005 Daria Barger – DB Seminar 6
Validation of structural constraints
Elements are declared in DTD by rules of the form:<!ELEMENT e c>
<?xml version="1.0"?> <!ELEMENT db (person*)>
<!ELEMENT person(name, dep, email, tel*)> <!ELEMENT name (#PCDATA)> <!ELEMENT dep(#PCDATA)> <!ELEMENT email(#PCDATA)>
<!ELEMENT tel(#PCDATA)>
Content Model:Element- valid iff the string formed by concatenating its children elements belongs to L(E), the language denoted by E.
Content Model:#PCDATA – validation can be done trivially
Spring 2005 Daria Barger – DB Seminar 7
Validation of attributes
Attributes validation is trivial, except for
ID and IDREF attribute types.
Valid XML document should hold: Values of all ID attributes are unique Value of each IDREF attribute must be equal to
the value of some ID attribute
Spring 2005 Daria Barger – DB Seminar 8
1-unambiguous regular expressions
The specification of XML DTDs restricts the regular expression used for defining element content to be 1- unambiguous (deterministic).
Marking:
)|(`
)|(
21
cbbaE
cbbaE
Position – subscripted symbol in E`.For given position x, Χ (x) denotes a corresponding (unmarked) symbol in Σ.For example: pos(E’) = {a,b1,b2,c}Χ (b1) =b
Spring 2005 Daria Barger – DB Seminar 9
1-unambiguous regular expressions
A regular expression E is 1- unambiguous if and only if for all words u,v,w over the subscripted alphabet pos(E) and all x,y in pos(E), the conditionsuxv, uyw L(E`) and x≠y imply Χ(x) ≠ Χ(y)
Which regular expression is deterministic?– (ab)|(ac)– a(b|c)– a(a+b)*ac
Spring 2005 Daria Barger – DB Seminar 10
The Glushkov automaton for Regular Expressions
otherwise ,
if },{ U F 4.
let ,For 3.
(let ,For 2.
1.
,
i
last(E)
L(E)ε qlast (E)
a}),X(x) follow(E,x{y|y δ(x,a)Σ,a pos(E)x
a}X(x) first(E), {x|x ,a)q δΣa
}{q pos(E) U Q
,F),q(Q,ΣG
i
i
ie
set of positions that appear as the first symbol of some word in L(E’)
set of positions that appear as the last symbol of some word in L(E’)
set of positions that appear immediately after position x in some word in L(E’)
Spring 2005 Daria Barger – DB Seminar 11
Update operations
Append(p,y) - insert element y as the last child of element p.
A
A A
A A
AAA
AA
p
A A
A A
A
y
Append
Spring 2005 Daria Barger – DB Seminar 12
Update operations (2)
InsertBefore(x,y) – insert element y as immediate left sibling of element x.(This operation is not defined if x is the root of the document).
A
A A
A A
AAA
AA
A A
A A
Ay
x
Insert Before
Spring 2005 Daria Barger – DB Seminar 13
Update operations(3)
Delete(x) – delete element x from the document. Note that if x is the root of the document the operation is trivially valid.
A
A A
A A
AAA
AA
x
A
AA
AA
Delete(x)
Spring 2005 Daria Barger – DB Seminar 14
Observation
The incremental validation concerns only the content of the element where the update takes place. For example, after an Append(p,y) operation only the content of p needs to be revalidated.
Spring 2005 Daria Barger – DB Seminar 15
Together with the i-th child of p we store the value of for the automaton that validates the content model of p.
This requires auxiliary storage of size O(n log d), where n is a size of XML document, d is size of DTD
The approach
)...(ˆ 1 iww
p
wk
)...(ˆ 1 kwww2
),(ˆ 21 ww
w1
)(ˆ 1w
w3
),,(ˆ 321 www…
Spring 2005 Daria Barger – DB Seminar 16
Append at the end
Append(p,y) operation
time)log(log
succeedsoperation then the)),...(ˆ( If 1
dnO
Fyww k
p
ywk
)...(ˆ 1 kwww2
),(ˆ 21 ww
w1
)(ˆ 1w
w3
),,(ˆ 321 www…
Spring 2005 Daria Barger – DB Seminar 17
Arbitrary insertions and deletionsDelete(x) operation
)(||
time)loglog|(|
)),...(ˆ( from starting ...w wrevalidate should We 111k1
nOw
dnwO
www ii
Problem: Complexity
p
wk
)...(ˆ 1 kwww2
),(ˆ 21 ww
w1
)(ˆ 1w
wi
)...(ˆ 1 iww……
Spring 2005 Daria Barger – DB Seminar 18
1,2 Conflict Free Regular Expression
Let’s consider E=a(b1*|cb2*)
W=acb…b. All b’s match state b2
Delete c from w, receive w’=ab…b
Now all b’s match state b1
We should re - validate the entire string
)...(ˆ)...(ˆ 111111 iiiii wwwwwww
Possible solution:
This condition does not hold always, e.g.
Spring 2005 Daria Barger – DB Seminar 19
Definition of 1,2 Conflict-free
Let E be regular expression over alphabet ΣFollow(E,x) – set of position in E that can follow x in some path through E.Define
),(|)({),(2 xEfollowzEposyxEfollow
such that )},( zEfollowy
E is 1,2 conflict - free regular expression if:
zyzΧ(y)
xEfollowz
xEfollowyEposzyx
)(
then ),( and
),( if },{)(,,every For 2)
ticdeterminis is E 1)
2
Spring 2005 Daria Barger – DB Seminar 20
Restricted forms of DTD
1,2 Conflict Free DTD There is no “flipping” between automata states
after the update. The per update complexity for 1,2 Conflict Free
DTD is O(log n + log d) time and O(n log d) auxiliary space.
Conflict-free DTD: No repeated symbols. The per update complexity: O(log n + log d) and
constant auxiliary space.
Spring 2005 Daria Barger – DB Seminar 21
Incremental validation of ID and IDREF for adding element
Append(p,y) and InsertBefore(x,y) operations require checking that no two ID attributes are the same and every IDREF attribute in y refers to some existing document values.
The complexity:O(|y|log n) time and linear auxiliary space.
|y| = size of added subtree.
Spring 2005 Daria Barger – DB Seminar 22
Incremental validation of ID and IDREF for deleting element
After Delete(x) operation we have to check that there is no subtree rooted at x that contains a node that has an ID attribute referenced by some other node that is not a descendant of x.
a
b
c
Checking reference counter in delete requires O(log n) time.Updating reference counter in insert/removing IDREF attribute: O(h log n) time.
Spring 2005 Daria Barger – DB Seminar 23
Valid Insertion
2G256M32M4M512K64K
100
10000
1e+06
1e+08
Document size
Tim
e [m
icro
sec
] Incr CF –
Incr 1.2 CF –
Incr Arb –
Full Arb –
Full CF -
Spring 2005 Daria Barger – DB Seminar 24
Valid DeletionT
ime
[mic
ro s
ec]
100
10000
1e+06
1e+08
2G256M32M4M512K64K
Document size
Incr CF –
Incr 1.2 CF –
Incr Arb –
Full Arb –
Full CF -
Spring 2005 Daria Barger – DB Seminar 25
Invalid Deletion
102G256M32M4M512K64K
Document size
100
1000
Tim
e [m
icro
sec
]Incr CF –
Incr 1.2 CF –
Incr Arb –
Full Arb –
Full CF -
Spring 2005 Daria Barger – DB Seminar 26
Conclusions
1. Handled insertion and deletion of subtrees (not leaf nodes only).
2. Validated ID and IDREF attributes.
3. Characterize a class of DTDs appearing to capture most real life DTDs that admits a log time and constant space incremental validation algorithm.
4. Conducted experiments showing that the method is practical for large data documents and behaves much better than full revalidation.
Spring 2005 Daria Barger – DB Seminar 27
Future Work
Handling complex updates, involving several insertions and deletions as a single transactions.