Post on 19-Jan-2016
Part OneXML and Databases
Soumen Chakrabarti
CSE, IIT Bombay
Form and content• The Web today
– HTML generated by hand, wysisyg editors, ‘webified’ databases
– HTML specifies rendering for human reading– Screen scraping required to consolidate data
• The Web in the future– Common interchange format (XML)– Concentrate on content, not form– Represent data class broader than relations
Role of databases• Contribute
– Data storage and indexing– Query processing and optimization– Views, transformations, integration
• Adopt– Search modalities– Content-based approximate search– Linguistic analysis
Features of semi-structured data• No explicit schema, or volatile schema
• Schema size comparable to data size
• Structure changes without notice
• Heterogeneous, deeply nested, irregular
• Has nature of documents rather than tables
Semi-structured data model example
&o1
&o12 &o24 &o29
&o43&96
&243 &206
&25
“Serge”“Abiteboul”
1997
“Victor”“Vianu”
122 133
paperbook
paper
references
referencesreferences
authortitle
yearhttp
author
authorauthor
title publisherauthor
authortitle
page
firstnamelastname
firstname lastname firstlast
Bib
Object Exchange Model (OEM)
complex object
atomic object
Syntax
{ paper: { author: “Abiteboul”,
author: { firstname: “Victor”,
lastname: “Vianu”},
title: “Regular path queries …”,
page: { first: 122, last: 133 }
}
}
Some observations• Missing or additional attributes
• Multiple attributes
• Different types in different objects
• Heterogeneous collections
Object ID’s and references
<person id=“o555”><name>Jane </name></person>
<person id=“o456”><name>Mary</name><children idref=“o123 o555”/></person>
<person id=“o123” mother=“o456”><name>John</name></person>
o555
o456
o123
children childrenmother
Names and acronyms• OEM (Object Exchange Model): a semi-
structured data model from Stanford, 1995
• Lore: a system for storing data adhering to the OEM
• Lorel: a query language for Lore
• XML (eXtensible Markup Language): a simplification of SGML and a generalization of HTML
• XML-QL: Query language for XML
Lorel query examples
select Bib.paper.titlefrom Bib.paperwhere Bib.paper.year >1995
select Bib.paper.titlefrom Bib.paperwhere Bib.paper.year >1995
select X.titlefrom Bib.paper X, Bib.(paper|book) Ywhere Y.author.lastname? = “Ullman” and Y.reference+ X
select X.titlefrom Bib.paper X, Bib.(paper|book) Ywhere Y.author.lastname? = “Ullman” and Y.reference+ X
Alternative
Transitive closureNavigating partially
known structures
XML-QL query examples
where <book language=“french”> <publisher><name>Morgan Kaufmann</name> </publisher> <author> $a </author></book> in “www.a.b.c/bib.xml”construct $a
where <book language=“french”> <publisher><name>Morgan Kaufmann</name> </publisher> <author> $a </author></book> in “www.a.b.c/bib.xml”construct $a
where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result><author>$a</><lang>$l</></>
where <book language = $l> <author> $a </> </> in “www.a.b.c/bib.xml”construct <result><author>$a</><lang>$l</></>
XML storage in ternary relation
&o1
&o3
&o2
&o4 &o5
paper
title author authoryear
&o6
“The Calculus” “…” “…” “1986”
S o u r c e L a b e l D e s t
& o 1 p a p e r & o 2& o 2 t i t l e & o 3& o 2 a u t h o r & o 4& o 2 a u t h o r & o 5& o 2 y e a r & o 6
N o d e V a l u e
& o 3 T h e C a l c u l u s& o 4 …& o 5 …& o 6 1 9 8 6
Ref
Val
• Too many joins
• Label name storage redundant
Storage optimization through mining
paperpaper paper
paper
authorauthor author author author
titletitle title title
year
fn fn fn fn lnlnlnln
a u t h o r t i t l eX X
f n 1 l n 1 f n 2 l n 2 t i t l e y e a r
X X X X X -X X - - X XX X - - X -
Paper1
Paper2
• Inline common cases
• Tolerate a few nulls
Schema extraction• Schema: a template for type/semantics
specification
• Conformance– Does that data conform to a given schema ?
• Classification– If so, which objects belong to what
classes/types?
• Applications– Storage and query optimization
Graph simulationGiven two edge-labeled graphs G1 and G2, a
simulation is a relation R between nodes such that if (x1, x2) is in R, and (x1, a, y1) is in G1, then there exists (x2, a, y2) in G2 (same label) such that (y1,y2) is in R
x1 x2
a
R
G1 G2
y1
a
Ry2
Upper and lower bound schema• Lower bound schema
– Conformance: find simulation R from S to D– Classification: check if (c,x) in R– Used in storage optimization
• Upper bound schema (data guides)– Conformance: find simulation R from D to S– Classification: check if (x,c) in R– Used in path index generation and query
optimization
Sample data
&r
&p8&p1 &p2 &p3 &p4 &p5 &p6 &p7
&c
company
employeeemployee
employeeemployee employee employee
employeeemployee
worksfor
worksfor
worksforworksforworksfor
worksforworksfor
worksfor
manages
manages
manages
manages
managedby
managedbymanagedby
manages
managedby
managedby
Lower bound schema
Root&r
Bosses&p1,&p4,&p6
Regulars&p2,&p3,&p5,&p7,&p8
Company&c
company employee
manages
managedby
worksfor
worksfor
employee
Storage using lower bound schema
Root
Company Employee
string
company
person
works-for
c.e.o.
address
name
managed-by
name
o i d n a m e m a n a g e d - b y w o r k s - f o r… … … …… … … …
Employee
Store rest inoverflow graph
Lower-bound schema
Upper bound schema (DataGuides)
Root&r
Employees&p1,&p1,&p3,P4
&p5,&p6,&p7,&p8
Bosses&p1,&p4,&p6
Regulars&p2,&p3,&p5,&p7,&p8
Company&c
company
employee
managesmanagedby
manages
managedby
worksfor
worksfor
worksfor
Query optimization issues
Select x from A.B x where exists y in x.C: y=5
D D B
C C C
A
5 5 5
B B B
C C C
A
4 4 5
B B B
C C C
A
4 4 5
B
B
D
D
What makes the problem difficult• Selectivity estimation
• Index selection
• Access cost models
• Clustering choices
Part Two Information Retrieval and
Databases
Soumen Chakrabarti
CSE, IIT Bombay
Information retrieval (IR)• Search
– ‘Inverted’ index– Boolean match– Relevance ranking
• Classification– Learn topics from examples
• Clustering– Discover topics from a document collection
• Never done inside a relational database
cat
dog
D5: 3, 37, 50D7: 9, 20
D7: 7, 90, 400D20: 22, 533
Current style of loose integration• RDBMS provides hooks
• Declare some columns as textual with keyword index
• Inserts, updates, and deletes trigger external program, e.g., Verity search engine
• Search engine maintains separate indices
• Simple query rewriting to combine relational and text-match where-clauses
Reasons• Space
– BLOB vs. pure relational representation– Average English word is only 5 bytes
• Time– Most text engines are resigned to flexible (i.e.,
no) model for data consistency– Much faster read-only access than relational
database lookups
New features desired• Operations that are more complex than
keyword search can benefit from tighter coupling with RDBMS
• Approximate search is essential (Anand Rajaraman, Amazon.com, SIGMOD 99)– Misspelling book title, author name common– Variant of OEM edge label (author/writer/poet)
• Similarity extends to structure as well (‘Travolta’ NEAR ‘Cage’ = ‘Face/Off’)
Case study: generalized ‘like’• SQL has limited string matching constructs
– like ‘%x’, ‘x%’, ‘%x%’– x must be exact match
• Need more lenient match– Applications: LDAP, IR
• String edit distance is not suitable– “Given query, order strings in database in
increasing order of edit distance and pick top 5”
Sliding-window matching
nascent pascal
nas asc sce cen ent pas sca cal
• Given a query, scan to get a set of 3-grams
• Similarity of string in database to query = number of shared 3-grams
rascal
ras
Issues• Minimally disruptive architecture
• Low storage overheads
• Fast query processing
• Good selectivity estimates
• Combining with other predicates for ranking
• Efficiently handling updates