Lecture 08: XML and Semistructured Data

Post on 04-Jan-2016

28 views 1 download

description

Lecture 08: XML and Semistructured Data. Outline. XML (Section 17) XML syntax, semistructured data Document Type Definitions (DTDs) XPath. Additional Readings on XML. XML http://www.w3.org/XML/1999/XML-in-10-points www.zvon.org/xxl/XMLTutorial/General/book_en.html - PowerPoint PPT Presentation

Transcript of Lecture 08: XML and Semistructured Data

1

Lecture 08: XML and Semistructured Data

2

Outline

• XML (Section 17)– XML syntax, semistructured data– Document Type Definitions (DTDs)

• XPath

3

Additional Readings on XML

• XML– http://www.w3.org/XML/1999/XML-in-10-points– www.zvon.org/xxl/XMLTutorial/General/book_en.html– http://db.bell-labs.com/galax/– http://www.w3.org/TR/REC-xml-names (1/99)

• Xpath– http://java.sun.com/webservices/docs/ea2/tutorial/doc/JAXPXSLT2.html

• Xquery– http://www.w3.org/TR/xmlquery-use-cases/

– http://www.xmlportfolio.com/xquery.html

• Main source: www.w3.org (but hard to read)

4

XML

• eXtensible Markup Language

• XML 1.0 – a recommendation from W3C, 1998

• Roots: SGML (used in publishing).

• After the roots: a format for sharing data

5

XML Data

• Relational data does not have a syntax– I can’t “give” you my relational database– Need to import it from other syntax,

like CSV (comma-separated-values)

• XML = rich syntax for data– But XML is not relational: semistructured

• Usage:– Map any data to XML– Store it in files, exchange on the Web, etc.– Even query it directly, using XPath, XQuery

6

XML Data Sharing and Exchange

application

relational data

Transform

Integrate

Warehouse

XML Data WEB (HTTP)

application

application

legacy data

object-relational

Specific data management tasks

7

From HTML to XML

HTML describes the layout

8

HTML

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteoul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteoul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

9

XML<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

</bibliography>

<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

</bibliography>

XML describes the structure

10

XML Terminology• tags: book, title, author, …• start tag: <book>, end tag: </book>• elements:

<book>…</book>,<author>…</author>• elements are nested• empty element: <red></red> abbrv. <red/>well formed XML document

• if it has matching tags• tags are properly nested• single root element• and more constraints, e.g. on names

11

More XML: Attributes

<book price = “55” currency = “USD”>

<title> Foundations of Databases </title>

<author> Abiteboul </author>

<year> 1995 </year>

</book>

<book price = “55” currency = “USD”>

<title> Foundations of Databases </title>

<author> Abiteboul </author>

<year> 1995 </year>

</book>attributes are alternative ways to represent data

12

More XML: IDs and References

<person id=“o555”> <name> Jane </name> </person>

<person id=“o456”> <name> Mary </name>

<children idref=“o123 o555”/>

</person>

<person id=“o123” mother=“o456”><name>John</name>

</person>

<person id=“o555”> <name> Jane </name> </person>

<person id=“o456”> <name> Mary </name>

<children idref=“o123 o555”/>

</person>

<person id=“o123” mother=“o456”><name>John</name>

</person>

Scope of IDs and references is the document

13

More XML: CDATA Section

• Syntax: <![CDATA[ .....any text here...]]>

• Example:

<example> <![CDATA[ some text here </notAtag> <>]]></example>

<example> <![CDATA[ some text here </notAtag> <>]]></example>

14

More XML: Entity References

• Syntax: &entityname;

• Used like macros

• Example: <element> this is less than &lt; </element>

&lt; <

&gt; >

&amp; &

&apos; ‘

&quot; “

&#38; Unicode char

complete list: http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html

some predefined entities

15

More XML: Processing Instructions

• Syntax: <?target argument?>

• Example:

• Processed by external applications, e.g. php(bad style)

<product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price></product>

<product> <name> Alarm Clock </name> <?ringBell 20?> <price> 19.99 </price></product>

16

More XML: Comments

• Syntax <!-- .... Comment text... -->

• Yes, they are part of the data model !!!

17

XML Data: a Tree !

<data>

<person id=“o555” >

<name> Mary </name>

<address>

<street> Maple </street>

<no> 345 </no>

<city> Seattle </city>

</address>

</person>

<person>

<name> John </name>

<address> Thailand </address>

<phone> 23456 </phone>

</person>

</data>

<data>

<person id=“o555” >

<name> Mary </name>

<address>

<street> Maple </street>

<no> 345 </no>

<city> Seattle </city>

</address>

</person>

<person>

<name> John </name>

<address> Thailand </address>

<phone> 23456 </phone>

</person>

</data>

data

Mary

person

person

name address

name address

street no city

Maple 345 Seattle

JohnThai

phone

23456

id

o555

Elementnode

Textnode

Attributenode

Order matters !!!

18

From Relational Data to XML Data

<persons><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>

6363</phone></row></persons>

<persons><row> <name>John</name> <phone> 3634</phone></row> <row> <name>Sue</name> <phone> 6343</phone> <row> <name>Dick</name> <phone>

6363</phone></row></persons>

n a m e p h o n e

J o h n 3 6 3 4

S u e 6 3 4 3

D i c k 6 3 6 3

row row row

name name namephone phone phone

“John” 3634 “Sue” “Dick”6343 6363

persons XML: persons

19

XML Data

• XML is self-describing

• Schema elements become part of the data– Relational schema: persons(name,phone)– In XML <persons>, <name>, <phone> are part

of the data, and are repeated many times

• Consequence: XML is much more flexible

• XML = semistructured data

20

Semi-structured Data Explained

• Missing attributes:

• Could represent ina table with nulls

<person> <name> John</name> <phone>1234</phone> </person>

<person> <name>Joe</name></person>

<person> <name> John</name> <phone>1234</phone> </person>

<person> <name>Joe</name></person> no phone !

name phone

John 1234

Joe -

21

Semi-structured Data Explained

• Repeated attributes

• Impossible in tables:

<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>

<person> <name> Mary</name> <phone>2345</phone> <phone>3456</phone></person>

two phones !

name phone

Mary 2345 3456 ???

22

Semistructured Data Explained

• Attributes with different types in different objects

• Nested collections (no 1NF)• Heterogeneous collections:

– <db> contains both <book>s and <publisher>s

<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>

<person> <name> <first> John </first> <last> Smith </last> </name> <phone>1234</phone></person>

structured name !

23

Document Type DefinitionsDTD

• part of the original XML specification

• an XML document may have a DTD

• XML document:well-formed = if tags are correctly closed

valid = if it has a DTD and conforms to it

• validation is useful in data exchange

24

Very Simple DTD

<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>

<!DOCTYPE company [ <!ELEMENT company ((person|product)*)> <!ELEMENT person (ssn, name, office, phone?)> <!ELEMENT ssn (#PCDATA)> <!ELEMENT name (#PCDATA)> <!ELEMENT office (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT product (pid, name, description?)> <!ELEMENT pid (#PCDATA)> <!ELEMENT description (#PCDATA)>]>

25

Very Simple DTD

<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>

<company> <person> <ssn> 123456789 </ssn> <name> John </name> <office> B432 </office> <phone> 1234 </phone> </person> <person> <ssn> 987654321 </ssn> <name> Jim </name> <office> B123 </office> </person> <product> ... </product> ...</company>

Example of valid XML document:

26

DTD: The Content Model

• Content model:– Complex = a regular expression over other elements

– Text-only = #PCDATA

– Empty = EMPTY

– Any = ANY

– Mixed content = (#PCDATA | A | B | C)*

<!ELEMENT tag (CONTENT)><!ELEMENT tag (CONTENT)>

contentmodel

27

DTD: Regular Expressions

<!ELEMENT name (firstName, lastName))

<!ELEMENT name (firstName, lastName))

<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>

<name> <firstName> . . . . . </firstName> <lastName> . . . . . </lastName></name>

<!ELEMENT name (firstName?, lastName))<!ELEMENT name (firstName?, lastName))

DTD XML

<!ELEMENT person (name, phone*))<!ELEMENT person (name, phone*))

sequence

optional

<!ELEMENT person (name, (phone|email)))<!ELEMENT person (name, (phone|email)))

star (repeated occurrence)

alternation

<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>

<person> <name> . . . . . </name> <phone> . . . . . </phone> <phone> . . . . . </phone> <phone> . . . . . </phone> . . . . . .</person>

<name> <lastName> . . . . . </lastName></name>

<name> <lastName> . . . . . </lastName></name>

<person> <name> . . . . . </name> <email> . . . . . </email> </person>

<person> <name> . . . . . </name> <email> . . . . . </email> </person>

28

DTD: Attributes

• Document Type Definition

<!ELEMENT person (ssn, name, office, phone?)><!ATTLIST person age CDATA #REQUIRED "18"

birthdate CDATA #IMPLIED nationality CDATA #FIXED "CH"

gender (male|female) "female">

• Document

<person age="24" nationality="CH" gender="male">

<ssn> … </ssn> …<phone> … </phone> </person>

mandatory

optional

default

enumeration

29

DTD: Entities

• DTD:

<!ENTITY address SYSTEM "address.xml"><!ENTITY name "<name>Tim Berners Lee</name>">

• Document:

<celebrity>&name;&address;</celebrity>

internal entity

external entity

30

Inclusion of DTD in Documents

<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE test PUBLIC "-//Test AG//DTD test V1.0//EN" SYSTEM "http://www.test.org/test.dtd"><test> "test" is a document element </test>

<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE test PUBLIC "-//Test AG//DTD test V1.0//EN" SYSTEM "http://www.test.org/test.dtd"><test> "test" is a document element </test>

<!DOCTYPE test SYSTEM "http://www.test.org/test.dtd" [ <!ENTITY hello "hello world">]><test>&hello;</test>

<!DOCTYPE test SYSTEM "http://www.test.org/test.dtd" [ <!ENTITY hello "hello world">]><test>&hello;</test>

<!DOCTYPE test [ <!ELEMENT test EMPTY> ]><test/>

<!DOCTYPE test [ <!ELEMENT test EMPTY> ]><test/>

External DTD Declaration

Internal DTD Declaration

Mixed usage

31

XML Namespaces

• Different DTDs can use the same names!– how to avoid conflicts when combining names

from different DTDs?

• XML namespace is a collection of names (markup vocabulary)– identified by a prefix (URL reference)

32

XML Namespaces

• name ::= [prefix:]localname

<book xmlns='urn:loc.gov:book' xmlns:isbn='www.isbn-org.org/def'> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number></book>

<book xmlns='urn:loc.gov:book' xmlns:isbn='www.isbn-org.org/def'> <title> … </title> <number> 15 </number> <isbn:number> …. </isbn:number></book>

default name space

names belong todefault name space

33

<tag xmlns:mystyle = “http://…”>

<mystyle:title> … </mystyle:title>

<mystyle:number> …

</tag>

<tag xmlns:mystyle = “http://…”>

<mystyle:title> … </mystyle:title>

<mystyle:number> …

</tag>

XML Namespaces

• syntactic: <number> , <isbn:number>

• semantic: URL used as unique identifier– URL may not exist, has no function

Belong to this namespace

34

Querying XML Data• XPath = simple navigation through the tree

• XQuery = the SQL of XML

• XSLT = recursive traversal– will not discuss

• XQuery and XSLT build on XPath

35

Sample Data for Queries<bib>

<book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

<bib><book> <publisher> Addison-Wesley </publisher> <author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <title> Foundations of Databases </title> <year> 1995 </year></book><book price=“55”> <publisher> Freeman </publisher> <author> Jeffrey D. Ullman </author> <title> Principles of Database and Knowledge Base Systems </title> <year> 1998 </year></book>

</bib>

36

Data Model for XPath

bib

book book

publisher author . . . .

Addison-Wesley Serge Abiteboul

The root

The root element

37

XPath: Simple Expressions

Result: <year> 1995 </year>

<year> 1998 </year>

Result: empty (there were no papers)

/bib/book/year/bib/book/year

/bib/paper/year/bib/paper/year

38

XPath: Restricted Kleene Closure

Result:<author> Serge Abiteboul </author> <author> <first-name> Rick </first-name> <last-name> Hull </last-name> </author> <author> Victor Vianu </author> <author> Jeffrey D. Ullman </author>

Result: <first-name> Rick </first-name>

//author//author

/bib//first-name/bib//first-name

39

XPath: Text Nodes

Result: Serge Abiteboul

Jeffrey D. Ullman

Rick Hull doesn’t appear because he has firstname, lastname

Functions in XPath:– text() = matches the text value– node() = matches any node (= * or @* or text())– name() = returns the name of the current tag

/bib/book/author/text()/bib/book/author/text()

40

XPath: Wildcard

Result: <first-name> Rick </first-name>

<last-name> Hull </last-name>

* Matches any element

//author/*//author/*

41

XPath: Attribute Nodes

Result: “55”

@price means that price is has to be an attribute

/bib/book/@price/bib/book/@price

42

XPath: Predicates

Result: <author> <first-name> Rick </first-name>

<last-name> Hull </last-name>

</author>

/bib/book/author[firstname]/bib/book/author[firstname]

43

XPath: More Predicates

Result: <lastname> … </lastname>

<lastname> … </lastname>

/bib/book/author[firstname][address[.//zip][city]]/lastname/bib/book/author[firstname][address[.//zip][city]]/lastname

44

XPath: More Predicates

/bib/book[@price < “60”]/bib/book[@price < “60”]

/bib/book[author/@age < “25”]/bib/book[author/@age < “25”]

/bib/book[author/text()]/bib/book[author/text()]

45

XPath: Summarybib matches a bib element

* matches any element

/ matches the root element

/bib matches a bib element under root

bib/paper matches a paper in bib

bib//paper matches a paper in bib, at any depth

//paper matches a paper at any depth

paper|book matches a paper or a book

@price matches a price attribute

bib/book/@price matches price attribute in book, in bib

bib/book[@price<“55”]/author/lastname matches…