Post on 04-Jan-2016
description
1
Lecture 08: XML and Semistructured Data
2
Outline
• XML (Section 17)– XML syntax, semistructured data– Document Type Definitions (DTDs)
• XPath
3
Additional Readings on XML
• XML– http://www.w3.org/XML/1999/XML-in-10-points– www.zvon.org/xxl/XMLTutorial/General/book_en.html– http://db.bell-labs.com/galax/– http://www.w3.org/TR/REC-xml-names (1/99)
• Xpath– http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html
• Xquery– http://www.w3.org/TR/xmlquery-use-cases/
– http://www.xmlportfolio.com/xquery.html
• Main source: www.w3.org (but hard to read)
4
XML
• eXtensible Markup Language
• XML 1.0 – a recommendation from W3C, 1998
• Roots: SGML (used in publishing).
• After the roots: a format for sharing data
5
XML Data
• Relational data does not have a syntax– I can’t “give” you my relational database– Need to import it from other syntax,
like CSV (comma-separated-values)
• XML = rich syntax for data– But XML is not relational: semistructured
• Usage:– Map any data to XML– Store it in files, exchange on the Web, etc.– Even query it directly, using XPath, XQuery
6
XML Data Sharing and Exchange
application
relational data
Transform
Integrate
Warehouse
XML Data WEB (HTTP)
application
application
legacy data
object-relational
Specific data management tasks
7
From HTML to XML
HTML describes the layout
8
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
9
XML<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML describes the structure
10
XML Terminology• tags: book, title, author, …• start tag: <book>, end tag: </book>• elements:
<book>…</book>,<author>…</author>• elements are nested• empty element: <red></red> abbrv. <red/>well formed XML document
• if it has matching tags• tags are properly nested• single root element• and more constraints, e.g. on names
11
More XML: Attributes
<book price = “55” currency = “USD”>
<title> Foundations of Databases </title>
<author> Abiteboul </author>
…
<year> 1995 </year>
</book>
<book price = “55” currency = “USD”>
<title> Foundations of Databases </title>
<author> Abiteboul </author>
…
<year> 1995 </year>
</book>attributes are alternative ways to represent data
12
More XML: IDs and References
<person id=“o555”> <name> Jane </name> </person>
<person id=“o456”> <name> Mary </name>
<children idref=“o123 o555”/>
</person>
<person id=“o123” mother=“o456”><name>John</name>
</person>
<person id=“o555”> <name> Jane </name> </person>
<person id=“o456”> <name> Mary </name>
<children idref=“o123 o555”/>
</person>
<person id=“o123” mother=“o456”><name>John</name>
</person>
Scope of IDs and references is the document
13
More XML: CDATA Section
• Syntax: <![CDATA[ .....any text here...]]>
• Example:
<example> <![CDATA[ some text here </notAtag> <>]]></example>
<example> <![CDATA[ some text here </notAtag> <>]]></example>
14
More XML: Entity References
• Syntax: &entityname;
• Used like macros
• Example: <element> this is less than < </element>
< <
> >
& &
' ‘
" “
& Unicode char
complete list: http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html
some predefined entities
15
More XML: Processing Instructions
• Syntax: <?target argument?>
• Example:
• Processed by external applications, e.g. php(bad style)
<product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price></product>
<product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price></product>
16
More XML: Comments
• Syntax <!-- .... Comment text... -->
• Yes, they are part of the data model !!!
17
XML Data: a Tree !
<data>
<person id=“o555” >
<name> Mary </name>
<address>
<street> Maple </street>
<no> 345 </no>
<city> Seattle </city>
</address>
</person>
<person>
<name> John </name>
<address> Thailand </address>
<phone> 23456 </phone>
</person>
</data>
<data>
<person id=“o555” >
<name> Mary </name>
<address>
<street> Maple </street>
<no> 345 </no>
<city> Seattle </city>
</address>
</person>
<person>
<name> John </name>
<address> Thailand </address>
<phone> 23456 </phone>
</person>
</data>
data
Mary
person
person
name address
name address
street no city
Maple 345 Seattle
JohnThai
phone
23456
id
o555
Elementnode
Textnode
Attributenode
Order matters !!!
18
From Relational Data to XML Data
<persons><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>
6363</phone></row></persons>
<persons><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>
6363</phone></row></persons>
n a m e p h o n e
J o h n 3 6 3 4
S u e 6 3 4 3
D i c k 6 3 6 3
row row row
name name namephone phone phone
“John” 3634 “Sue” “Dick”6343 6363
persons XML: persons
19
XML Data
• XML is self-describing
• Schema elements become part of the data– Relational schema: persons(name,phone)– In XML <persons>, <name>, <phone> are part
of the data, and are repeated many times
• Consequence: XML is much more flexible
• XML = semistructured data
20
Semi-structured Data Explained
• Missing attributes:
• Could represent ina table with nulls
<person> <name> John</name> <phone>1234</phone> </person>
<person> <name>Joe</name></person>
<person> <name> John</name> <phone>1234</phone> </person>
<person> <name>Joe</name></person> no phone !
name phone
John 1234
Joe -
21
Semi-structured Data Explained
• Repeated attributes
• Impossible in tables:
<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>
<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>
two phones !
name phone
Mary 2345 3456 ???
22
Semistructured Data Explained
• Attributes with different types in different objects
• Nested collections (no 1NF)• Heterogeneous collections:
– <db> contains both <book>s and <publisher>s
<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>
<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>
structured name !
23
Document Type DefinitionsDTD
• part of the original XML specification
• an XML document may have a DTD
• XML document:well-formed = if tags are correctly closed
valid = if it has a DTD and conforms to it
• validation is useful in data exchange
24
Very Simple DTD
<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>
<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>
25
Very Simple DTD
<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>
<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>
Example of valid XML document:
26
DTD: The Content Model
• Content model:– Complex = a regular expression over other elements
– Text-only = #PCDATA
– Empty = EMPTY
– Any = ANY
– Mixed content = (#PCDATA | A | B | C)*
<!ELEMENT tag (CONTENT)><!ELEMENT tag (CONTENT)>
contentmodel
27
DTD: Regular Expressions
<!ELEMENT name (firstName, lastName))
<!ELEMENT name (firstName, lastName))
<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>
<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>
<!ELEMENT name (firstName?, lastName))<!ELEMENT name (firstName?, lastName))
DTD XML
<!ELEMENT person (name, phone*))<!ELEMENT person (name, phone*))
sequence
optional
<!ELEMENT person (name, (phone|email)))<!ELEMENT person (name, (phone|email)))
star (repeated occurrence)
alternation
<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>
<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>
<name> <lastName> . . . . . </lastName></name>
<name> <lastName> . . . . . </lastName></name>
<person> <name> . . . . . </name> <email> . . . . . </email> </person>
<person> <name> . . . . . </name> <email> . . . . . </email> </person>
28
DTD: Attributes
• Document Type Definition
<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED "18"
birthdate CDATA #IMPLIED nationality CDATA #FIXED "CH"
gender (male|female) "female">
• Document
<person age="24" nationality="CH" gender="male">
<ssn> … </ssn> …<phone> … </phone> </person>
mandatory
optional
default
enumeration
29
DTD: Entities
• DTD:
<!ENTITY address SYSTEM "address.xml"><!ENTITY name "<name>Tim Berners Lee</name>">
• Document:
<celebrity>&name;&address;</celebrity>
internal entity
external entity
30
Inclusion of DTD in Documents
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE test PUBLIC "-//Test AG//DTD test V1.0//EN" SYSTEM "http://www.test.org/test.dtd"><test> "test" is a document element </test>
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE test PUBLIC "-//Test AG//DTD test V1.0//EN" SYSTEM "http://www.test.org/test.dtd"><test> "test" is a document element </test>
<!DOCTYPE test SYSTEM "http://www.test.org/test.dtd" [ <!ENTITY hello "hello world">]><test>&hello;</test>
<!DOCTYPE test SYSTEM "http://www.test.org/test.dtd" [ <!ENTITY hello "hello world">]><test>&hello;</test>
<!DOCTYPE test [ <!ELEMENT test EMPTY> ]><test/>
<!DOCTYPE test [ <!ELEMENT test EMPTY> ]><test/>
External DTD Declaration
Internal DTD Declaration
Mixed usage
31
XML Namespaces
• Different DTDs can use the same names!– how to avoid conflicts when combining names
from different DTDs?
• XML namespace is a collection of names (markup vocabulary)– identified by a prefix (URL reference)
32
XML Namespaces
• name ::= [prefix:]localname
<book xmlns='urn:loc.gov:book' xmlns:isbn='www.isbn-org.org/def'> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number></book>
<book xmlns='urn:loc.gov:book' xmlns:isbn='www.isbn-org.org/def'> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number></book>
default name space
names belong todefault name space
33
<tag xmlns:mystyle = “http://…”>
…
<mystyle:title> … </mystyle:title>
<mystyle:number> …
</tag>
<tag xmlns:mystyle = “http://…”>
…
<mystyle:title> … </mystyle:title>
<mystyle:number> …
</tag>
XML Namespaces
• syntactic: <number> , <isbn:number>
• semantic: URL used as unique identifier– URL may not exist, has no function
Belong to this namespace
34
Querying XML Data• XPath = simple navigation through the tree
• XQuery = the SQL of XML
• XSLT = recursive traversal– will not discuss
• XQuery and XSLT build on XPath
35
Sample Data for Queries<bib>
<book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>
</bib>
<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>
</bib>
36
Data Model for XPath
bib
book book
publisher author . . . .
Addison-Wesley Serge Abiteboul
The root
The root element
37
XPath: Simple Expressions
Result: <year> 1995 </year>
<year> 1998 </year>
Result: empty (there were no papers)
/bib/book/year/bib/book/year
/bib/paper/year/bib/paper/year
38
XPath: Restricted Kleene Closure
Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author>
Result: <first-name> Rick </first-name>
//author//author
/bib//first-name/bib//first-name
39
XPath: Text Nodes
Result: Serge Abiteboul
Jeffrey D. Ullman
Rick Hull doesn’t appear because he has firstname, lastname
Functions in XPath:– text() = matches the text value– node() = matches any node (= * or @* or text())– name() = returns the name of the current tag
/bib/book/author/text()/bib/book/author/text()
40
XPath: Wildcard
Result: <first-name> Rick </first-name>
<last-name> Hull </last-name>
* Matches any element
//author/*//author/*
41
XPath: Attribute Nodes
Result: “55”
@price means that price is has to be an attribute
/bib/book/@price/bib/book/@price
42
XPath: Predicates
Result: <author> <first-name> Rick </first-name>
<last-name> Hull </last-name>
</author>
/bib/book/author[firstname]/bib/book/author[firstname]
43
XPath: More Predicates
Result: <lastname> … </lastname>
<lastname> … </lastname>
/bib/book/author[firstname][address[.//zip][city]]/lastname/bib/book/author[firstname][address[.//zip][city]]/lastname
44
XPath: More Predicates
/bib/book[@price < “60”]/bib/book[@price < “60”]
/bib/book[author/@age < “25”]/bib/book[author/@age < “25”]
/bib/book[author/text()]/bib/book[author/text()]
45
XPath: Summarybib matches a bib element
* matches any element
/ matches the root element
/bib matches a bib element under root
bib/paper matches a paper in bib
bib//paper matches a paper in bib, at any depth
//paper matches a paper at any depth
paper|book matches a paper or a book
@price matches a price attribute
bib/book/@price matches price attribute in book, in bib
bib/book[@price<“55”]/author/lastname matches…