Representing and Querying XML with Incomplete Information Serge Abiteboul INRIA Luc Segoufin INRIA...
-
Upload
miguel-lynch -
Category
Documents
-
view
218 -
download
1
Transcript of Representing and Querying XML with Incomplete Information Serge Abiteboul INRIA Luc Segoufin INRIA...
Representing and QueryingXMLwith Incomplete Information
Serge AbiteboulINRIA
Luc SegoufinINRIA
Victor VianuUCSD
pods 2001 Abiteboul-Segoufin-Vianu 2
Organization
• Motivations• Simplifying assumptions• Model of incompleteness• Answering queries• Results• Discussion• Conclusion
Motivations
pods 2001 Abiteboul-Segoufin-Vianu 4
The Web is a world of incompleteness
• Information you get from the web is seldom complete:• Queries return you some - not all - data • Limited storage capability• Documents change on the Web:
expiration• Sites are unavailable…
• Context: A warehouse of XML documents from the Web, Xyleme
pods 2001 Abiteboul-Segoufin-Vianu 5
This work
• This work: simple, practically appealing approach to managing incomplete information
• Sequence of queries to the web • (q1,A1)+(q2,A2)+… • Answers are cached
• Process a new query without access to the web• Give an incomplete answer• Explain incompleteness to user • Seek additional information, i.e., find minimal set
of queries to fully answer
pods 2001 Abiteboul-Segoufin-Vianu 6
Related works
• Semantic caching• Answering queries using views
• keep (Qi,Ai)
• try to rewrite query Q into Q’(A1,...,An)
• reject if you cannot
• Incomplete database • (Qi,Ai) is some incomplete knowledge of DB
• Related to querying incomplete information – e.g. Lipski-Imielinski
pods 2001 Abiteboul-Segoufin-Vianu 7
Challenge: balance expressiveness and tractability
• Choice of data model• Choice of the query language• Choice of a representation of
incompleteness
• Results• Simple, practical solution• Extra features lead to serious problems
Simplifying Assumptions
pods 2001 Abiteboul-Segoufin-Vianu 9
Data is XML: trees
dealer
UsedCars NewCars
ad ad
model year model
<dealer> <UsedCars> <ad> <model>Honda</model> <year>96</year> </ad> </UsedCars> <NewCars> <ad> <model>Acura</model> </ad> </NewCars></dealer>
Honda 96 Acura
pods 2001 Abiteboul-Segoufin-Vianu 10
Simplified XML
=can =444 =electronique=can =444 =electronique=nik =234 =electronic=nik =234 =electronic
=camera=camera=camera=camera
=c.jpg=c.jpg
value functionvalue function
unordered treesunordered trees
name price cat picturename price cat picture
catalogcatalog
productproduct
subcategorysubcategory
productproduct
name price categoryname price category
subcategorysubcategory
labelling functionlabelling function
pods 2001 Abiteboul-Segoufin-Vianu 11
Simple XML types
catalogcatalog
productproduct
name price cat picturename price cat picture
subcategorysubcategory
**
**
1 : 1 child (default)1 : 1 child (default)* : 0 or more* : 0 or more+ : 1 or more+ : 1 or more? : 0 or 1? : 0 or 1
pods 2001 Abiteboul-Segoufin-Vianu 12
Prefix Selection Queries (ps-queries)
catalogcatalog
productproduct
name price cat=elecname price cat=elec
subcategorysubcategory
<200<200
Query1Query1catalogcatalog
productproduct
name name
Query2Query2
picturepicture
pods 2001 Abiteboul-Segoufin-Vianu 13
Simplifications
Data
• No order• No distinction
attribute/element• No recursion• No links
Query
• No complex path expressions
• No join• No repeated child
productproduct
name cat=elec cat=toyname cat=elec cat=toy
NONO
pods 2001 Abiteboul-Segoufin-Vianu 14
Crucial assumption: XID
prodprod
canon 120 eleccanon 120 elec
cameracamera
&245&245 prodprod&245&245
c.jpgc.jpg
++c.jpgc.jpg
prodprod
canon 120 eleccanon 120 elec
&245&245
cameracamera
==
• URLsURLs• ID/IDrefsID/IDrefs
Representation of incomplete information:
Incomplete trees
pods 2001 Abiteboul-Segoufin-Vianu 16
Document Type Definition (DTD) are used to represent incompleteness
• Set of rules: e r• e element name• r regular expression• Set of trees satisfying a
DTD d: tree(d)• Shortcoming of DTDs
• An element has a single definition independently of the context
• Type of ad depends on the context
dealerdealer
newxarnewxarusedcarusedcar
adadadad
modelmodel yearyear modelmodel
pods 2001 Abiteboul-Segoufin-Vianu 17
Solution: specialization (decoupled tags)
• adused and adnew h(adused)=h(adnew )=ad
dealerdealer
newxarnewxarusedcarusedcar
adadnewadadused
modelmodel yearyear modelmodel
dealerdealer
newxarnewxarusedcarusedcar
adadadad
modelmodel yearyear modelmodel
hh
pods 2001 Abiteboul-Segoufin-Vianu 18
DTDs + Specialization
The sets of trees that can be specified: the regular unranked tree languages [Bruggeman—Klein+Murata+Wood]
• Same closure properties: intersection, union, complement
• Same complexity
pods 2001 Abiteboul-Segoufin-Vianu 19
Example
Q1: name, subcat, price of electronic products with price Q1: name, subcat, price of electronic products with price less than $200less than $200
Q2: name, pictures of cameras at least pictured onceQ2: name, pictures of cameras at least pictured once
--------------------------------------------------------
Q3: name, price, pictures of cameras costing less than Q3: name, price, pictures of cameras costing less than $100 and at least pictured once$100 and at least pictured once
can be can be completelycompletely answered using A1, A2 answered using A1, A2
Q4: list all camerasQ4: list all cameras
can be can be partiallypartially answered using A1, A2 answered using A1, A2
pods 2001 Abiteboul-Segoufin-Vianu 20
catalogcatalog
cdplayercdplayer
productproduct
canon 120 eleccanon 120 elec
cameracamera
productproduct
nikon 199 elecnikon 199 elec
cameracamera
productproduct
sony 175 elecsony 175 elec
product1product1 product2product2
****
Q1: name, subcat, price of electronic products with price less than 200Q1: name, subcat, price of electronic products with price less than 200
missingmissing
pods 2001 Abiteboul-Segoufin-Vianu 21
Missing data after Q1
product1product1
name price cat picturename price cat picture
subcategorysubcategory
**
product2product2
name price cat picturename price cat picture
subcategorysubcategory
**
!=elec!=elec =elec=elec>200>200
pods 2001 Abiteboul-Segoufin-Vianu 22
catalogcatalog
productproduct
canon 120 eleccanon 120 elec
cameracamera
productproduct
nikon 199 elecnikon 199 elec
cameracamera
productproduct
sony 175 elecsony 175 elec
cdplayercdplayer
product2aproduct2a
Q2: name, pictures of cameras at least pictured onceQ2: name, pictures of cameras at least pictured once
product1product1
missingmissingproduct2cproduct2c
product2product2**
** product2bproduct2b**
c.jpgc.jpg akai a.jpg elecakai a.jpg elec
cameracamera
3333
pods 2001 Abiteboul-Segoufin-Vianu 23
Incomplete information
• Known information• Prefix of the real data tree
• Missing information• Extended tree type• Conditions on data values• Specializations, disjunctions
pods 2001 Abiteboul-Segoufin-Vianu 24
product1product1
name price name price catcat picture picture
subcategorysubcategory
**
!=elec!=elec
product2product2aa
name name priceprice catcat picture picture
subcategorysubcategory
=elec=elec>200>200
name price name price catcat
product3product3
elecelecproduct2product2bb
namename priceprice catcat picturepicture
**
=elec=elec>200>200
product2product2cc
namename priceprice catcat
subcategorysubcategory
=elec=elec>200>200
subcategorysubcategory!=camera!=camera
subcategorysubcategory!=camera!=camera
no pictureno picture
no pictureno picture
product +product +
Known data
Missing data
Answering Queries
pods 2001 Abiteboul-Segoufin-Vianu 26
Complete answer to Q3
• Q3: name, price, Q3: name, price, pictures of cameras pictures of cameras costing less than $150 costing less than $150 and having at least one and having at least one picturepicture
• Can be fully answered Can be fully answered using available using available informationinformation
• Need to check whether Need to check whether answer is completeanswer is complete
catalogcatalog
prodprod
canon 120 canon 120 c.jpgc.jpg
pods 2001 Abiteboul-Segoufin-Vianu 27
Incomplete answer to Q4• Provide known cameras• Explain incompleteness
canoncanon nikonnikon sony sony akaiakai
more productsmore products
name name
price>200price>200andandno pictureno picture
pods 2001 Abiteboul-Segoufin-Vianu 28
Completing answer to Q4
• It suffices to ask:
productproduct
name price cat name price cat
sub=camerasub=camera
=elec=elec>200>200 picturepicture
00
pods 2001 Abiteboul-Segoufin-Vianu 29
Revisit the types• DTD • Conditions• Specialization: same
element name may have several types
• Not sufficient
• Need to extend again the types: disjunctions
productproduct2b2b
**
=elec=elec>200>200
subcategorysubcategory!=camera!=camera
namename priceprice catcat picturepicture
pods 2001 Abiteboul-Segoufin-Vianu 30
Disjunction
??
??
vehiclevehicle
datadata engineengine
descriptiondescription
sailsail
vehiclevehicle
datadata
descriptiondescription
vehiclevehicle
datadataengineengine
sailsail
Query1’Query1’ Query2’Query2’
vehiclevehicle
data=“….”data=“….”
description=“….”description=“….”
Empty!Empty!&322&322
pods 2001 Abiteboul-Segoufin-Vianu 31
Disjunction continued
• Type of &322vehicle1 + vehicle2
vehicle2vehicle2
datadata
descriptiondescription
sailsail
vehicle1vehicle1
datadata engineengine
descriptiondescription
The type of &322 can not be described independently of that of data below
Results
pods 2001 Abiteboul-Segoufin-Vianu 33
Representation System:Lipski’s+Imielinski’s
reprep rep(T)rep(T)Set of possible Set of possible worldsworlds
q(rep(T))q(rep(T))==
rep(q(T))rep(q(T))
Set of possible Set of possible answersanswers
TT
Representation Representation of informationof information
q(T)q(T) reprep
Representation Representation of resultof result
pods 2001 Abiteboul-Segoufin-Vianu 34
Representation System for PS-queries
• Incomplete tree T to representq1
-1(A1) … qk-1(Ak)
• PS-query q
• q(T) can be computed in ptime(representation of the answer can be
computed in ptime)
pods 2001 Abiteboul-Segoufin-Vianu 35
Querying Incomplete Trees
• Given T and a query q, one can • Give in ptime the sure answers up to
our current knowledge• Check in ptime whether query q can be
fully anwered• Generate in ptime queries to complete
answer
pods 2001 Abiteboul-Segoufin-Vianu 36
Comparison with IL
Relational model
• Relational calculus/algebra
• Conditional table
• Closed or open world
• Representation system
XML tree model
• Weaker language (no join)
• Weaker system (no variable)
+ Closed and open World
• Representation system
pods 2001 Abiteboul-Segoufin-Vianu 37
Drawback: exponential blowup
• Incomplete information may become exponential w.r.t the sequence of query/answer q1/A1;q2/A2…
11 11 qqii::
Answers are emptyAnswers are empty
databasedatabase
a=ia=i b=ib=i
databasedatabase
aa bb
Type:Type:
pods 2001 Abiteboul-Segoufin-Vianu 38
Dealing with exponential blowup
• Make the representation more complex using disjunctions of types• Size of representation stays polynomial• Manipulations much more complex
• Restrict tree types and PS-queries • Already very/too? simple
• Accept to loose some information • Ask extra queries to simplify
representation
Discussion
pods 2001 Abiteboul-Segoufin-Vianu 40
Discussion: extend language
• Some results in paper• Extensions often lead to intractability
• E.G. : K-pebble transducers [Milo,Suciu,Vianu] that somehow subsume XML-QL and XSL• No (known) representation system• Testing rep(T) is empty is non-
elementary
pods 2001 Abiteboul-Segoufin-Vianu 41
Discussion : node Ids
Without node Ids• much less information to integrate
results• more complex• tedious case analysis
pods 2001 Abiteboul-Segoufin-Vianu 42
Discussion: ordering
• Ordering in XML, DTD, queries • Problem is totally different and very complex
• Example: • Q1/A1: list of males; Q2/A2: list of females; Q3: list all
• Depending on the type of input• (Male)*(Female)* A3= A1 || A2• (Male Female)* A3= shuffle(A1,A2)• (Male + Female)* we cannot answer A3
• Regular expression processing
pods 2001 Abiteboul-Segoufin-Vianu 43
Conclusion
• Framework for acquiring, maintaining, querying incomplete XML data
• Limitations: • simple queries• no order and Id assumption • small extensions lead to problems
• Possible to represent the incompleteness• Possible to answer with incompleteness• Possible to obtain queries to provide full
answer