Models and languages for semistructured data Bridging documents and databases.

48
Models and languages for semistructured data Bridging documents and databases
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    1

Transcript of Models and languages for semistructured data Bridging documents and databases.

Models and languages forsemistructured data

Bridging documents and databases

Lectures

1. Introduction to data models2. Query languages for relational

databases3. Models and query languages for object

databases4. Models and query languages for

semistructured data, XML5. Embedded query languages 6. Guest lecture on Object Role Modelling

Why do we like types?

Types facilitate understanding

Types enable compact representations

Types enable query optimisation

Types facilitate consistency enforcement

Background assumptions fortyped data

Data stable over timeOrganisational body to control data

Exercise: Give an example of a context where these assumptions do not hold

Semistructured data

Semistructured data is schemaless and self describing

The data and the description of the data are integrated

An example

{name: {first: “John”, last: “Smith”}, tel: 112233, email: “[email protected]”}

“John” “Smith”

112233 “[email protected]

name tel email

first last

Another example

person person

name age name age

child

&o1 &o2

“Eva” 40 “Abel” 20

{person:&o1{name: “Eva”, age: 40, child: &o2},person:&o2{name: “Abel”, age: 20}}

An object identifier, such as &o1, before a structure, binds the object identifier to the identity of that structure. The object identifier can then be used to refer to the structure.

Terminology

The following is an ssd-expression:

&o1{name: “Eva”, age: 40, child: &o2}

Label ValueObjectidentifier

A database

biblio

paper

book

author

author

title

date

Crick

Wallace

DNAspiral

1956

author

titledate

n1

n2

Darwin Origin 1848

db

author

titledate

n3

Marx Kapital 1860book

…….

Path expressions

A path expression is a sequence of labels:l1.l2…ln

A path expression results in a set of nodes

Path properties are specified by regular expressions on two levels: on the alphabet of labels and on the alphabet of characters that comprise labels

A path expression

biblio

paper

book

author

author

title

date

Crick

Wallace

DNAspiral

1956

author

titledate

n1

n2

Darwin Origin 1848

db

author

titledate

n3

Marx Kapital 1860book

…….

biblio.book.author

A path expression

biblio

paper

book

author

author

title

date

Crick

Wallace

DNAspiral

1956

author

titledate

n1

n2

Darwin Origin 1848

db

author

titledate

n3

Marx Kapital 1860book

…….

biblio.(book l paper).author

Examples of path expressions

biblio.book.author - authors of booksbiblio.paper.author - authors of papersbiblio.(book l paper).author - authors of

books or papersbiblio._.author - authors of anythingbiblio._*.author - nodes at the ends of

paths starting with biblio, ending with author, and having an arbitrary sequence of labels between

Example of a label pattern

((b l B)ook l (a l A)uthor) (s)? - book, Book, author, Author, books, Books, authors, Authors

An exercise

biblio._*.author.(“[s l S]ection”)

Which ones of the following paths match the path expression above?

1. Biblio.author.Section2. Biblio.cat.rat.hat.author.section3. Biblio.author4. Biblio.cat.author.section.Section

A simple query

Select author: Xfrom biblio.book.author X

Result:{author: “Darwin”, author: “Marx”}

A query with a condition

select row: Xfrom biblio._ Xwhere “Crick” in X.author

Result:{row: {author: “Crick”,

author: “Wallace”,date: 1956,title: “The spiral DNA”}, …}

Two exercises

select row: {title: Y, date: Z}from biblio.paper X, X.title Y, X.date Z

select row: {author: Y, date: Z}from biblio.book X, X.author Y, X.date

Z

A database

biblio

paper

book

author

author

title

date

Crick

Wallace

DNAspiral

1956

author

titledate

n1

n2

Darwin Origin 1848

db

author

titledate

n3

Marx Kapital 1860book

…….

select row: {title: Y, date: Z}from biblio.paper X, X.title Y, X.date Z

A database

biblio

paper

book

author

author

title

date

Crick

Wallace

DNAspiral

1956

author

titledate

n1

n2

Darwin Origin 1848

db

author

titledate

n3

Marx Kapital 1860book

…….

Nested queries

select row: (select author: Y from X.author Y)

from biblio.book X

Three exercises

Which authors have written a book or a paper in 1992?

Which authors have written a book together with Jones?

Which authors have written both a book and a paper?

Expressing relations

a b c

1 2 33 2 24 3 1

b d e

1 1 33 4 22 3 1

r1 r2

{ r1: { row: {a: 1, b:2, c:2}, row: {a: 1, b:2, c:2}, row: {a: 1, b:2, c:2} }, r2: { row: {b: 1, d:2, e:2}, row: {b: 1, d:2, e:2}, row: {b: 1, d:2, e:2} } }

Expressing relational joins

select a: A, d: Dfrom r1.row X

r2.row YX.a A, X.b B, Y.b B’, Y.d D

where B = B’

Label variables

select L: Xfrom biblio._*.L Xwhere matches(“.*Shakespeare.*”, X)

Label variable

biblio book

author

titledate

n2

Shakespeare Macbeth 1622

db

author

titledate

n3

Smith Best of Shakespeare 1992book

…….

Label variables

select L: Xfrom biblio._*.L Xwhere matches(“.*Shakespeare.*”, X)

{author: “Shakespeare”, title: “Best of Shakespeare”}

Turning labels into data

select publ: {type: L, author: A}

from biblio.L X, X.author A

biblio

paper

book

author

author

title

date

Crick

Wallace

DNAspiral

1956

author

titledate

n1

n2

Darwin Origin 1848

db

{publ: {type: “paper”, author: “Crick”},publ: {type: “paper”, author: “Wallace”},publ: {type: “book”, author: “Darwin”}

An exercise

List all publications in 1992, their types, and titles.

Basic XML syntax

XML is a textual representation of dataAn element is a text bounded by tags

<name> John </name>

start-tagend-tagcontent

element

<name> </name> can be abbreviated as <name/>

Basic XML syntax

Elements may contain subelements

<person><name> John </name><tel> 112233 </tel><email> [email protected] </email>

</person>

XML attributes

An attribute is defined by a name-value pair within a tag

<price currency = “dollar”> 500 </price>

<length unit = “cm”> 25 </length>

XML attributes and elements

<product><name> widget </name><price> 10 </price>

</product>

<product price = “10”><name> widget </name>

</product>

<product name = “widget” price = “10”/>

XML and ssd-expressions

<person><name> John </name><tel> 112233 </tel><email> [email protected] </email>

</person>

{person: {name: “John”, tel: 112233, email: “[email protected]”}}

XML references

<person id = “p1”><name> John </name><tel> 112233 </tel>

</person>

<person id = “p2”><name> Peter </name><tel> 998877 </tel><boss idref = “p1”/>

</person>

element identifier

reference attribute

Document Type Definitions

<!DOCTYPE db [<!ELEMENT db (person*)><!ELEMENT person (name, age, email)><!ELEMENT name (#PCDATA)><!ELEMENT age (#PCDATA)><!ELEMENT email (#PCDATA)>

]>

An exercise on DTDs as schemas

<db> <r1> <a> a1 </a> <b> b1 </b> </r1><r1> <a> a2 </a> <b> b2 </b> </r1> <r2> <c> a1 </c> <d> b1 </d> </r1> <r2> <c> c2 </c> <d> d2 </d> </r1> <r3> <a> a1 </a> <c> b1 </c> </r1>

</db>

Write down a DTD for the data above!

Attributes in DTDs

<product>

<name language = “Swedish” department = “music”>

trumpet </name>

<price currency = “dollar”> 500 </price>

<length unit = “cm”> 25 </length>

</product>

<!ATTLIST name language CDATA #REQUIRED department CDATA #IMPLIED>

<!ATTLIST price currency CDATA #REQUIRED><!ATTLIST length unit CDATA #REQUIRED>

Reference attributes in DTDs

<!DOCTYPE people [

<!ELEMENT people (person*)>

<!ELEMENT person (name)>

<!ELEMENT name (PCDATA)>

<!ATTLIST person id ID #REQUIRED

boss IDREF #REQUIRED

friends IDREFS#IMPLIED>

]>

An exercise

<people><person> id = “sven” boss = “olle”>

<name> Sven Svensson </name></person> <person> id = “olle” friends = “nils eva”>

<name> Olle Olsson </name></person> <person> id = “pelle” boss = “nils eva”>

<name> Per Persson </name></person>

<people>

Does this XML element conform to the previous DTD?

Limitations of DTDs as schemas

DTDs impose order

No base types

The types of IDREFs cannot be

constrained

XSL - extensible stylesheet language<bib> <book> <title> t1 </title>

<author> a1 </author> <author> a2 </author>

</book><paper>

<title> t2 </title> <author> a3 </author> <author> a4 </author>

</paper> <book> <title> t3 </title>

<author> a5 </author> <author> a6 </author>

</book></bib>

Template rules and XSL patterns

<xsl: template><xsl: apply-templates/>

</xsl: template>

<xsl: template match = “bib/*/title”><result>

<xsl: value-of/></result>

</xsl: template>

}Template rule

XSL pattern

<result> t1 </result><result> t2 </result><result> t3 </result>

Two exercises

select row: {title: Y, date: Z}from biblio.paper X, X.title Y, X.date Z{row: {title: “The spiral DNA”,

date: 1956}, {title: “Origin”,date: 1848}, {title: “Kapital”,date: 1860}}

select row: {author: Y, date: Z}from biblio.book X, X.author Y, X.date Z

Which authors have written a book or a paper in 1992?

select author: Xfrom biblio.(book | paper) Y, Y.author Xwhere Y.date = 1992

Which authors have written a book together with Jones?

select author: Xfrom biblio.book Y, Y.author Xwhere “Jones” in Y.author

Which authors have written both a book and a paper?

select author: Afrom biblio.book B, biblio.paper P, B.author Awhere B.author = P.author

select author: A1from biblio.book B, biblio.paper P, B.author A1, P.author A2where A1 = A2

List all publications in 1992, their types, and titles.

select publ: {type: L, title: T}from biblio.L X, X.title Twhere X.date = 1992

<!DOCTYPE db [<!ELEMENT db (r1*, r2*, r3*)><!ELEMENT r1 (a, b)><!ELEMENT r2 (c, d)><!ELEMENT r3 (a, c)><!ELEMENT a (#PCDATA)><!ELEMENT b (#PCDATA)><!ELEMENT c (#PCDATA)><!ELEMENT d (#PCDATA)>

]>

<db> <r1> <a> a1 </a> <b> b1 </b> </r1><r1> <a> a2 </a> <b> b2 </b> </r1> <r2> <c> a1 </c> <d> b1 </d> </r1> <r2> <c> c2 </c> <d> d2 </d> </r1> <r3> <a> a1 </a> <c> b1 </c> </r1>

</db>