1 Chapter 29 Semistructured Data and XML Transparencies.

104
1 Chapter 29 Semistructured Data and XML Transparencies

Transcript of 1 Chapter 29 Semistructured Data and XML Transparencies.

Page 1: 1 Chapter 29 Semistructured Data and XML Transparencies.

1

Chapter 29

Semistructured Data and XML

Transparencies

Page 2: 1 Chapter 29 Semistructured Data and XML Transparencies.

2

Chapter - Objectives

What semistructured data is. Concepts of the Object Exchange Model (OEM), a

model for semistructured data. Main language elements of XML. Difference between well-formed and valid XML

documents. How Document Type Definitions (DTDs) can be

used to define the valid syntax of an XML document.

Page 3: 1 Chapter 29 Semistructured Data and XML Transparencies.

3

Chapter - Objectives

About other related XML technologies. Limitations of DTDs and how the W3C XML

Schema overcomes these limitations. How RDF and RDF Schema provide a foundation

for processing meta-data. Proposals for a W3C Query Language.

Page 4: 1 Chapter 29 Semistructured Data and XML Transparencies.

5

Introduction

In 1998 XML 1.0 was formally ratified by W3C. Yet, set to impact every aspect of programming

including graphical interfaces, embedded systems, distributed systems, and database management.

Already becoming de facto standard for data communication within software industry, and is quickly replacing EDI systems as primary medium for data interchange among businesses.

Some analysts believe it will become language in which most documents are created and stored, both on and off Internet.

Page 5: 1 Chapter 29 Semistructured Data and XML Transparencies.

6

Introduction

Due to nature of information on Web and inherent flexibility of XML, expected that much of the data encoded in XML will be semistructured; ie., data may be irregular or incomplete, and its structure may change rapidly or unpredictably.

Unfortunately, relational, object-oriented, and object-relational DBMSs do not handle data of this nature particularly well.

Page 6: 1 Chapter 29 Semistructured Data and XML Transparencies.

7

Semistructured Data

Data that may be irregular or incomplete and have a structure that may change rapidly or unpredictably.

Semistructured data is data that has some structure, but structure may not be rigid, regular, or complete.

Generally, the data does not conform to a fixed schema (sometimes terms schema-less or self-describing is used to describe such data). .

Page 7: 1 Chapter 29 Semistructured Data and XML Transparencies.

8

Semistructured Data

The information normally associated with a schema is contained within the data itself.

In some forms of semistructured data there is no separate schema, in others it exists but only places loose constraints on the data.

Unfortunately, relational, object-oriented, and object-relational DBMSs do not handle data of this nature particularly well.

Page 8: 1 Chapter 29 Semistructured Data and XML Transparencies.

9

Semistructured Data

Has gained importance recently for various reasons:– may be desirable to treat Web sources like a database,

but cannot constrain these sources with a schema;

– may be desirable to have a flexible format for data exchange between disparate databases;

– emergence of XML as standard for data representation and exchange on the Web, and similarity between XML documents and semistructured data.

Page 9: 1 Chapter 29 Semistructured Data and XML Transparencies.

10

Example 29.1

Page 10: 1 Chapter 29 Semistructured Data and XML Transparencies.

11

Example 29.1

Note, data is not regular:– for John White, hold first and last names, but for

Ann Beech store single name and also store a salary;

– for property at 2 Manor Rd, store a monthly rent whereas for property at 18 Dale Rd, store an annual rent;

– for property at 2 Manor Rd, store property type (flat) as a string, whereas for property at 18 Dale Rd, store type (house) as an integer value.

Page 11: 1 Chapter 29 Semistructured Data and XML Transparencies.

12

Example 29.1

Page 12: 1 Chapter 29 Semistructured Data and XML Transparencies.

13

Object Exchange Model (OEM) Data in OEM is schema-less and self-describing, and can be

thought of as labeled directed graph where nodes are objects, consisting of:– unique object identifier (for example, &7), – descriptive textual label (street), – type (string), – a value (“22 Deer Rd”).

Objects are decomposed into atomic and complex:– atomic object contains a value for a base type (eg., integer or

string) and can be recognized in diagram as one that has no outgoing edges.

– All other objects are complex objects whose type are a set of object identifiers.

Page 13: 1 Chapter 29 Semistructured Data and XML Transparencies.

14

Object Exchange Model (OEM)

A label indicates what the object represents and is used to identify the object and to convey the meaning of the object, and so should be as informative as possible.

Labels can change dynamically. A name is a special label that serves as an alias for

a single object and acts as an entry point into the database (for example, DreamHome is a name that denotes object &1).

Page 14: 1 Chapter 29 Semistructured Data and XML Transparencies.

15

Object Exchange Model (OEM)

An OEM object can be considered as a quadruple (label, oid, type, value).

For example:

{Staff, &4, set, {&9, &10}}

{name, &9, string, “Ann Beech”}

{salary, &10, decimal, 12000}

Page 15: 1 Chapter 29 Semistructured Data and XML Transparencies.

16

Semistructured Data - Case StudyObject Exchange Model

Page 16: 1 Chapter 29 Semistructured Data and XML Transparencies.

17

OIDOID unique identifier or NULL LabelLabel character string descriptor

TypeType atomic data type or set ValueValue atomic value or set of object references

• Common model for heterogeneous information exchange, self-describing

• Each object:

OIDOID LabelLabel TypeType ValueValue

• “Help pages” for labels• Query language OEM-QL

OEM Features

Page 17: 1 Chapter 29 Semistructured Data and XML Transparencies.

18

<collection, {b1, a1, ...}>b1: <book, {t, a}> t: <title, “Database and ...”> a: <author, {n, p}>

n: <name, “Jeff Ullman”>p: <picture, “/gifs/ullman.gif”>

a1: <article, {v, w, x}>v: <author, “Gio Wiederhold”>w: <title, “Mediators in the …”>x: <journal, “IEEE Computer”>

Label

Set Value

Atomic ValueMemoryAddresses

...

Representing Semistructured Data Using OEM

Page 18: 1 Chapter 29 Semistructured Data and XML Transparencies.

19

• Logic-based language for OEM– Match object patterns, generate variable bindings,

construct new OEM objects from existing ones

• Get articles published in “IEEE Computer”

P :-

P:<articles {<journal “IEEE Computer”>}>• Get titles of books by “Jeff Ullman”

<answer_title T> :-

<book {<author “Jeff Ullman”> <title T>}>

An OEM Query Language: OEM-QL

Page 19: 1 Chapter 29 Semistructured Data and XML Transparencies.

20

XML

Vendors introduced some browser-specific HTML tags, making it difficult to develop sophisticated, widely viewable Web documents.

W3C has produced new standard called XML, which could preserve general application independence that makes HTML portable and powerful.

Page 20: 1 Chapter 29 Semistructured Data and XML Transparencies.

21

XML

XML is a restricted version of SGML, designed especially for Web documents.

SGML allows document to be logically separated into two: one that defines the structure of the document (DTD), other containing the text itself.

By giving documents a separately defined structure, and by giving authors ability to define custom structures, SGML provides extremely powerful document management system.

However, SGML has not been widely adopted due to its inherent complexity.

Page 21: 1 Chapter 29 Semistructured Data and XML Transparencies.

22

XML

XML attempts to provide a similar function to SGML, but is less complex and, at same time, network-aware.

XML retains key SGML advantages of extensibility, structure, and validation.

Since XML is a restricted form of SGML, any fully compliant SGML system will be able to read XML documents (although the opposite is not true).

XML is not intended as a replacement for SGML or HTML.

Page 22: 1 Chapter 29 Semistructured Data and XML Transparencies.

23

XML (eXtensible Markup Language)

origins: HTML + SGML (ISO Standard, 1986, ~600pp) W3C standard (~26 pp): XML syntax + DTDs XML = HTML presentational tags

+ user-defined DTD (tags+nesting)

=> a metalanguage for defining other languages via DTDs

=> XML is more like SGML than HTML XML = SGML {complexity, document perspective} +

{simplicity, data exchange perspective}

Page 23: 1 Chapter 29 Semistructured Data and XML Transparencies.

24

Advantages of XML

Simplicity Open standard and platform/vendor-

independent Extensibility Reuse Separation of content and presentation Improved load balancing

Page 24: 1 Chapter 29 Semistructured Data and XML Transparencies.

25

Advantages of XML

Support for integration of data from multiple

sources

Ability to describe data from a wide variety of

applications

More advanced search engines

New opportunities.

Page 25: 1 Chapter 29 Semistructured Data and XML Transparencies.

26

Why are Database folks so excited about XML?

XML is just a syntax for (self-describing) data

This is still exciting because

– No standard syntax for relational data

– With XML, we can» Translate any legacy data to

XML» Can exchange data in XML

format Ship over the web,

input to any application

Page 26: 1 Chapter 29 Semistructured Data and XML Transparencies.

27

XML machine accessible meaning

This is what a web-page in natural language looks like for a machine

Page 27: 1 Chapter 29 Semistructured Data and XML Transparencies.

28

XML machine accessible meaning

CV

name

education

work

private

< >

< >

< >

< >

< >

XML allows “meaningful tags” to be added toparts of the text

Page 28: 1 Chapter 29 Semistructured Data and XML Transparencies.

29

XML machine accessible meaning

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

<>

<>

<>

But to your machine, the tags look like this….

Page 29: 1 Chapter 29 Semistructured Data and XML Transparencies.

30

XML machine accessible meaning

Schemas help….

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

<>

<>

<>

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

<>

<>

<>

< > …by relating common termsbetween documents

Page 30: 1 Chapter 29 Semistructured Data and XML Transparencies.

31

But other people use other schemas

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

>

<>

<>

Someone else has one like this….

Page 31: 1 Chapter 29 Semistructured Data and XML Transparencies.

32

But other people use other schemas

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

<>

<>

<>

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

<>

<>

<>

< >

…which don’t fit in

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

< >

< >

Moral: There is still

need for ontology

mapping..

Page 32: 1 Chapter 29 Semistructured Data and XML Transparencies.

33

An HTML document

Page 33: 1 Chapter 29 Semistructured Data and XML Transparencies.

34

HTML code

<title>ICS185/ICS180 - Spring, 2003</title><body bgcolor="#d0d0ff"><H2>Index</H2><UL> <LI> <a HREF = "#announcements">Announcements </a> <LI> <a HREF = "#geninfo">Course Information </a></UL>

<H2>Course Information</H2> <a href="geninfo.html">General Information</a>. The following

are a few important entries: <UL> <li> <A HREF = "geninfo.html#goals">Course Goals</A><BR> <li> <A HREF = "geninfo.html#crsenum">About the course

numbers</A><BR></UL></body>

Page 34: 1 Chapter 29 Semistructured Data and XML Transparencies.

35

What is the problem?

To do more fancy things with documents:– need to make their logical

structure explicit. Otherwise, software applications

– do not know what is what – do not have any handle over

documents.

Page 35: 1 Chapter 29 Semistructured Data and XML Transparencies.

36

An XML document <?xml version="1.0" ?><bib> <vendor id="id3_4"> <name>QuickBooks</name> <email>[email protected]</email> <phone>1-800-333-9999</phone> <book> <title>Inorganic Chemistry</title> <publisher>Brooks/Cole Publishing</publisher> <year>1991</year> <author>  <firstname>James</firstname>   <lastname>Bowser</lastname> </author> <price>43.72</price> </book> </vendor> </bib>

Page 36: 1 Chapter 29 Semistructured Data and XML Transparencies.

37

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteoul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

</bibliography>

HTML describes presentation

XML describes content

Page 37: 1 Chapter 29 Semistructured Data and XML Transparencies.

38

What is XML?

eXtensible Markup Language Data are identified using tags (identifiers

enclosed in angle brackets: <...>) Collectively, the tags are known as

“markup” XML tags tell you what the data means,

rather than how to display it

Page 38: 1 Chapter 29 Semistructured Data and XML Transparencies.

39

XML versus relational

– Relational: structured– XML: semi-structured– Plain text file: unstructured

Page 39: 1 Chapter 29 Semistructured Data and XML Transparencies.

40

How does XML work?

XML allows developers to write their own Document Type Definitions (DTD)

DTD is a markup language’s rule book that describes the sets of tags and attributes that is used to describe specific content

If you want to use a certain tag, then it must first be defined in DTD

Page 40: 1 Chapter 29 Semistructured Data and XML Transparencies.

41

Key Components in XML

Three generic components, and one customizable component

XML Content

DTD Rules

XML Parser Application

Page 41: 1 Chapter 29 Semistructured Data and XML Transparencies.

42

Meta Markup Language

Not a language – but a way of specifying other languages

Meta-markup language – gives the rules by which other markup languages can be written

Portable - platform independent

Page 42: 1 Chapter 29 Semistructured Data and XML Transparencies.

43

Markup Languages

Presentation based:– Markup languages that describe

information for presentation for human consumption

Content based:– Describe information that is of

interest to another computer application

Page 43: 1 Chapter 29 Semistructured Data and XML Transparencies.

44

HTML and XML

HTML tag says "display this data in bold font" – <b>...</b>

XML tag acts like a field name in your program

It puts a label on a piece of data that identifies it– <message>...</message>

Page 44: 1 Chapter 29 Semistructured Data and XML Transparencies.

45

HTML vs. XML

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteoul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author> <author> Vianu </author> <publisher> Addison

Wesley </publisher> <year> 1995 </year> </book> …

</bibliography>

“Self-describing”

-Schema info part of the data

-Good for data exchange

(albeit baroque for sto

rage)

Page 45: 1 Chapter 29 Semistructured Data and XML Transparencies.

46

Simple Example

XML data for a messaging application:<message>

<to>[email protected]</to> <from>[email protected]</from> <text> Why is it good? Let me count the ways... </text>

</message>

Page 46: 1 Chapter 29 Semistructured Data and XML Transparencies.

47

Element

Data between the tag and its matching end tag defines an element of the data

Comment:– <!-- This is a comment -->

Page 47: 1 Chapter 29 Semistructured Data and XML Transparencies.

48

Example

<!-- Using attributes   --> <message to="[email protected]"

from="[email protected]">  <text>Whty is it good? Let me count the

ways...</text>   </message>

Page 48: 1 Chapter 29 Semistructured Data and XML Transparencies.

49

Attributes

Tags can also contain attributes Attributes contain additional

information included as part of the tag, within the tag's angle brackets

Attribute name is followed by an equality sign and the attribute value

Page 49: 1 Chapter 29 Semistructured Data and XML Transparencies.

50

Other Basics

White space is essentially irrelevant Commas between attributes are not

ignored - if present, they generate an error Case sensitive: “message” and “MESSAGE”

are different

Page 50: 1 Chapter 29 Semistructured Data and XML Transparencies.

51

Well Formed XML Every tag has a closing tag XML represents hierarchical data

structures having one tag to contain others– Tags have to be completely nested

Correct:– <message>..<to>..</to>..</

message> Incorrect

– <message>..<to>..</message>..</to>

Page 51: 1 Chapter 29 Semistructured Data and XML Transparencies.

52

Empty Tag

Empty tag is used when it makes sense to have a tag that stands by itself and doesn't enclose any content - a "flag" – You can create an empty tag by ending

it with />– <flag/>

Page 52: 1 Chapter 29 Semistructured Data and XML Transparencies.

53

Example

<message to="[email protected]" from="[email protected]" subject=“XML is good"> <flag/> <text> Whty is it good? Let me count the ways... </text>

</message>

Page 53: 1 Chapter 29 Semistructured Data and XML Transparencies.

54

Tree representation

<BOOKS><book id=“123”

loc=“library”> <author>Hull</author> <title>California</title> <year> 1995 </year></book><article id=“555” ref=“123”> <author>Su</author> <title> Purdue</title></article></BOOKS>

Hull Purdue

BOOKS

123 555

California

Su

titleauthor

title

author

articlebook

year

1995

ref

loc=“library”

Page 54: 1 Chapter 29 Semistructured Data and XML Transparencies.

55

Prolog in XML Files

XML file always starts with a prolog The minimal prolog contains a declaration that

identifies the document as an XML document:<?xml version="1.0"?>

The declaration may also contain additional information– version - version of the XML used in the data– encoding - Identifies the character set used – standalone - whether the document

references an external entity or data type specification

Page 55: 1 Chapter 29 Semistructured Data and XML Transparencies.

56

Detailed Example of XML File

simple version of the kind of XML data you could use for a slide presentation

You can use your text editor to create the data – Step 1: create a file named

slideSample01.xml– Step 2: write the declaration, which

identifies the file as an XML document<?xml version='1.0' encoding='us-ascii'?>

Page 56: 1 Chapter 29 Semistructured Data and XML Transparencies.

57

Defining the Root Element– Step 3: Adding a comment…<!-- A SAMPLE set of slides --> – Step 4: Defining the Root Element…<slideshow> </slideshow>

After the declaration, every XML file defines exactly one element, known as the root element

Any other elements in the file are contained within that element

Page 57: 1 Chapter 29 Semistructured Data and XML Transparencies.

58

Attributes

A slide presentation has a title... <slideshow

title="Sample Slide Show"> </slideshow>

Page 58: 1 Chapter 29 Semistructured Data and XML Transparencies.

59

Adding Nested Elements

– Step 5: Adding Nested Elements<slideshow... <!-- TITLE SLIDE -->

<slide title="Title of Talk"/> <!-- TITLE SLIDE -->

<slide type="all"> <title>Introduction to XML </title>

</slide> </slideshow>

Page 59: 1 Chapter 29 Semistructured Data and XML Transparencies.

60

Attribute vs. Element type of the slide is defined as an attribute

– Slides could be earmarked for a mostly technical or mostly executive audience with type="tech" or type="exec", or identified as suitable for both with type="all“

title element is defined as an element The title is something the audience will see

– So it is an element The type is something that never gets presented

– So it is an attribute

Page 60: 1 Chapter 29 Semistructured Data and XML Transparencies.

61

Adding Text

– Step 6: Adding Text<slideshow>…

<!-- OVERVIEW --> <slide type="all"> <title>Overview</title>

<item>Why is XML great?</item> <item>Who uses it?</item>

</slide>

</slideshow>

Page 61: 1 Chapter 29 Semistructured Data and XML Transparencies.

62

Adding an Empty Element– Step 7: Adding an Empty Element

<slideshow> …

<!-- OVERVIEW --><slide> …

<!-- define an empty list item --><item/>…

</slide> </slideshow>

Page 62: 1 Chapter 29 Semistructured Data and XML Transparencies.

63

Complete Example <?xml version="1.0" encoding="us-ascii" ?> <!-- A SAMPLE set of slides   --> <slideshow title="Sample Slide Show">

<!-- TITLE SLIDE   --> <slide type="all">  <title>Introduction to CML</title>

  </slide> <!-- OVERVIEW   --> <slide type="all">  <title>Overview</title>  <item>Why is XML great?</item>   <item />  </slide>

 </slideshow>

Page 63: 1 Chapter 29 Semistructured Data and XML Transparencies.

64

XML Parsing – IE Example

Page 64: 1 Chapter 29 Semistructured Data and XML Transparencies.

65

Processing Instructions

An XML file can also contain processing instructions that give commands or information to an application that is processing the XML data:

<?target instructions?>– target is the name of the application that is

expected to do the processing– instructions is a string of characters that

embodies the information or commands for the application to process

Page 65: 1 Chapter 29 Semistructured Data and XML Transparencies.

66

XML

Page 66: 1 Chapter 29 Semistructured Data and XML Transparencies.

67

XML -Elements

Elements, or tags, are most common form of markup. First element must be a root element, which can

contain other (sub)elements. XML document must have one root element

(<STAFFLIST>. Element begins with start-tag (<STAFF>) and ends with end-tag (</STAFF>).

XML elements are case sensitive An element can be empty, in which case it can be

abbreviated to <EMPTYELEMENT/>. Elements must be properly nested.

Page 67: 1 Chapter 29 Semistructured Data and XML Transparencies.

68

XML - Attributes

Attributes are name-value pairs that contain descriptive information about an element.

Attribute is placed inside start-tag after corresponding element name with the attribute value enclosed in quotes. <STAFF branchNo = “B005”>

Could also have represented branch as subelement of STAFF.

A given attribute may only occur once within a tag, while subelements with same tag may be repeated.

Page 68: 1 Chapter 29 Semistructured Data and XML Transparencies.

69

Data Type Definition (DTD)

DTD specifies the types of tags that can be included in the XML document– it defines which tags are valid, and in what

arrangements– where text is expected, letting the parser

determine whether the whitespace it sees is significant or ignorable

An optional part of the document prolog

Page 69: 1 Chapter 29 Semistructured Data and XML Transparencies.

70

XML document and DTD

item1

Slideshow

DBitem

slideslide

itemtitle

item2 AItitleitem

item3

Slideshow

Slide

titleitem

+

*

XML Document

XML DTD

<?xml version='1.0' encoding='us-ascii'?><!-- DTD for a simple "slide show".--><!ELEMENT slideshow (slide+)><!ELEMENT slide (title, item*)><!ELEMENT title (#PCDATA)><!ELEMENT item (#PCDATA | item)* >

Page 70: 1 Chapter 29 Semistructured Data and XML Transparencies.

71

Detailed DTD Example

– Step 1: Create a file named slideshow.dtd

– Step 2: Enter an XML declaration <?xml version='1.0' encoding='us-ascii'?><!-- DTD for a simple "slide show". -->

– Step 3: Specify contains of a slideshow element

… <!–- slideshow contains 1+ slide elements --><!ELEMENT slideshow (slide+)>

Page 71: 1 Chapter 29 Semistructured Data and XML Transparencies.

72

Qualifiers

<?xml version='1.0' encoding='us-ascii'?> <!-- DTD for a simple example. --><!ELEMENT slideshow (slide+)>

slideshow element contains slide elements and nothing else

Qualifier Meaning? Optional (zero or one)* Zero or more+ One or more

Page 72: 1 Chapter 29 Semistructured Data and XML Transparencies.

73

Grouping multiple items

((image, title)+) Every image element must be paired with

a title element Plus sign applies to the image/title pair to

indicate that one or more pairs of the specified items can occur

Page 73: 1 Chapter 29 Semistructured Data and XML Transparencies.

74

Defining Text and Nested Elements

– Step 4: Defining Text and Nested Elements<!ELEMENT slide (title, item*)> <!ELEMENT title (#PCDATA)>

<!ELEMENT item (#PCDATA | item)* > Text = Parsed Character DATA (PCDATA) "#" that precedes PCDATA indicates that what

follows is a special word, rather than an element name

Page 74: 1 Chapter 29 Semistructured Data and XML Transparencies.

75

Complete Example

<?xml version='1.0' encoding='us-ascii'?><!-- DTD for a simple "slide show".--><!ELEMENT slideshow (slide+)><!ELEMENT slide (title, item*)><!ELEMENT title (#PCDATA)><!ELEMENT item (#PCDATA | item)* >

Page 75: 1 Chapter 29 Semistructured Data and XML Transparencies.

76

Attribute Types

(#PCDATA | item)* Vertical bar (|) indicates an “or” condition In this case, either PCDATA or an item can occur

Attribute Type Specifies...

CDATA "Unparsed character data" = a text string.)

ID A name that no other ID attribute shares.

IDREF A reference to an ID defined elsewhere in the document.

IDREFS A space-separated list containing one or more ID references.

ENTITY The name of an entity defined in the DTD.

ENTITIES A space-separated list of entities.

NMTOKEN A valid XML name composed of letters, numbers, hyphens, underscores, and colons.

NMTOKENS A space-separated list of names.

NOTATION The name of a DTD-specified notation, which describes a non-XML data format, such as those used for image files

Page 76: 1 Chapter 29 Semistructured Data and XML Transparencies.

77

What you cannot do?

Double-definition for an item element doesn't work<!ELEMENT item (#PCDATA) > <!ELEMENT item (#PCDATA, item+) >

– Produces a "duplicate definition" warning – The second definition is ignored

Page 77: 1 Chapter 29 Semistructured Data and XML Transparencies.

78

XML Names and NMTOKEN Name Characters are letters, digits, hyphens,

underscores, colons or full stops. An NMTOKEN is any collection of Name Characters NMTOKENS is any list of NMTOKEN’s separated by white

space (space, tab, newline etc.) Case is significant: PERSON and person are distinct

names Attribute and Element names must be (a subset of)

NMTOKEN with restriction– Names cannot begin with a digit– Names cannot begin with xml (or any variant gotten

by case changes) – system will use this prefix

Page 78: 1 Chapter 29 Semistructured Data and XML Transparencies.

79

Element Declarations: EMPTY

Keyword ELEMENT Introduces a new element<!ELEMENT NAME CONTENT_MODEL>

Element name must begin with a letter, and may additionally contain digits and some punctuations, i.e. ‘.’, ‘-’, ‘_’, and ‘:’ as we described earlier under NMTOKEN

If an element can hold no child elements, and also no text, then it is known as empty element and denoted by EMPTY for CONTENT_MODEL– This seems trivial but it isn’t because the present or

absence of this element in an XML file can be used as a flag

– As an example we can find several in HTML such as HR and IMG which never have children and include no text. Here we would write<!ELEMENT HR EMPTY> and then <HR/> or <HR></HR> generates a horizontal line

EMPTY ELEMENTS can have attributes such as the SRC attribute in <IMG/> to specify source of image.

Page 79: 1 Chapter 29 Semistructured Data and XML Transparencies.

80

Element Declarations: ANY An element declared to have a content of ANY may contain all of

the other elements declared in the DTD This is not quite the same as no DTD for the file

<!DOCTYPE fred [<!ELEMENT fred ANY >]><fred>

<people>Me and You</people><people>Them</people>

</fred> Gets an error due to presence of <people> tag Adding <!ELEMENT people ANY > inside DTD declaration

produces a valid document.

Page 80: 1 Chapter 29 Semistructured Data and XML Transparencies.

81

Entities The DTD of an XML document can contain entity declarations.

These are like macro substitutions in other languages. ENTITY’s are defined in DTD and consist of several flavors:

– General Entities are referenced as &EntName;– Parameter Entities are referenced as %Entname;

We have already seen the character entities – &amp; for &– &apos; for ‘– &gt; for >– &lt; for <– &quot; for “

These are built in but you could add other such entities with– <!ENTITY aitself “A” > and &aitself; would be

replaced by A

Page 81: 1 Chapter 29 Semistructured Data and XML Transparencies.

82

General Entities As another example, we can use in DTD

<!ENTITY TODAY “May 12 2003” > and<comment>&TODAY; was very quiet in Irvine</comment>is parsed as <comment>May 12 2003 was very quiet in Irvine</comment>

General Entity references can be nested inside a DTD, e.g., one can write <!ENTITY YEAR “2003” > <!ENTITY TODAY “May 12 &YEAR;” >

However one must use Parameter Entities and not General Entities for macro substitution in other DTD declarations like <!ATTLIST and <!ELEMENT

Parameter entities are defined as in<!ENTITY % CUSTARDTAGS “(NAME,DATE,ORDERS)” >

Page 82: 1 Chapter 29 Semistructured Data and XML Transparencies.

83

Parameter Entities

<!ENTITY %peopletags “(firstname,lastname,dateofbirth)” ><!ELEMENT student %peopletags; > <!ELEMENT teacher %peopletags; > <!ELEMENT administrator %peopletags; >

Defines a bunch of people ELEMENTS to have the same child elements

Parameter entities are even more commonly used for attributes because almost always several ELEMENTS share the same attributes (with often a basic set being augmented in different ways for different ELEMENTS)– This basic set can be set in a parameter Entity

Page 83: 1 Chapter 29 Semistructured Data and XML Transparencies.

84

Defining Implied Attributes

Attributes must be declared in the DTD to be able to be used

“Implied” means that this attribute optional and there is no default value

<!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA #IMPLIED> The attribute year can be defined or undefined in the

element population. Valid Examples:– <population year=“2000”>80</population>– <population>80</population>

Page 84: 1 Chapter 29 Semistructured Data and XML Transparencies.

85

Defining Required Attributes

<!ELEMENT population (#PCDATA)> <!ATTLIST population year #REQUIRED>– The population must contain a year attribute:

<population year=“1996”>80</population> <!ELEMENT population (#PCDATA)> <!ATTLIST

population year (2000|2001) #REQUIRED>– The population must contain a year attribute of 2000

or 2001<population year=“2000”>80</population>

– No quotes on the enumeration values

Page 85: 1 Chapter 29 Semistructured Data and XML Transparencies.

86

Defining Default Attributes

<!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA “2000”>

All these are valid– <population

year=“2001”>80</population>– <population

year=“2000”>80</population>– <population>80</population>

Page 86: 1 Chapter 29 Semistructured Data and XML Transparencies.

87

Defining Fixed Attributes

<!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA #FIXED “2000”>– Invalid <population

year=“2001”>80</population>– Valid <population

year=“2000”>80</population>– Valid <population>80</population>

Page 87: 1 Chapter 29 Semistructured Data and XML Transparencies.

88

Defining Unique Attributes

<!ELEMENT animal (name)> <!ATTLIST animal code ID #REQUIRED> The code attribute has to be unique in the XML

document– <animal code=“T50”><name>Lion</name>

</animal> <animal code=“T51”><name>Rabbit</name> </animal>

Page 88: 1 Chapter 29 Semistructured Data and XML Transparencies.

89

Referring Unique Attributes

<!ELEMENT website (url)> <!ATTLIST website animal_refer IDREF #REQUIRED>

animal_refer attribute refers to previous ID attribute defined– <website animal_refer=“T50”>

<url>http://www.lions.com</url> </website>

Page 89: 1 Chapter 29 Semistructured Data and XML Transparencies.

90

Referring Multiple Unique Attributes

<!ELEMENT website (url)> <!ATTLIST website contents IDREFS #REQUIRED>

contents attribute contain series of IDs– <website contents=“T50 T51”>

<url>http://www.animals.com</url> </website>

Page 90: 1 Chapter 29 Semistructured Data and XML Transparencies.

91

XML Example - the DTD

<!ELEMENT addressBook (person)+><!ELEMENT person (name, email*, link?) ><!ATTLIST person id ID #REQUIRED ><!ATTLIST person gender (male|female) #IMPLIED><!ELEMENT name (#PCDATA|(family,given))> <!ELEMENT family (#PCDATA)><!ELEMENT given (#PCDATA)><!ELEMENT email (#PCDATA)><!ELEMENT link EMPTY >

<!ATTLIST link manager IDREF #IMPLIED subordinates IDREF #IMPLIED>

Page 91: 1 Chapter 29 Semistructured Data and XML Transparencies.

92

DOCTYPE declarations

Internal: local definition of DTD External: to an external file Can combine both

Page 92: 1 Chapter 29 Semistructured Data and XML Transparencies.

93

Internal DTD

<?xml version="1.0" standalone="yes" ?><!--open the DOCTYPE declaration - the open square bracket indicates an internal DTD--><!DOCTYPE foo [<!--define the internal DTD--> <!ELEMENT foo (#PCDATA)><!--close the DOCTYPE declaration-->]><foo>Hello World.</foo>

Page 93: 1 Chapter 29 Semistructured Data and XML Transparencies.

94

Internal DTD: rules

The document type declaration must be placed between the XML declaration and the first element (root element) in the document .

The keyword DOCTYPE must be followed by the name of the root element in the XML document .

The keyword DOCTYPE must be in upper case .

Page 94: 1 Chapter 29 Semistructured Data and XML Transparencies.

95

External DTD

Useful for creating a common DTD that can be shared between multiple documents.

Any changes that are made to the external DTD automatically updates all the documents that reference it.

Two types: private, and public. Rules:

– If any elements, attributes, or entities are used in the XML document that are referenced or defined in an external DTD, standalone="no" must be included in the XML declaration .

Page 95: 1 Chapter 29 Semistructured Data and XML Transparencies.

96

"Private" External DTDs

Identified by the keyword SYSTEM Intended for use by a single author or group of authors. Example:

<!DOCTYPE root_element SYSTEM "DTD_location"> where: DTD_location is relative or absolute URL (such as “http:/” and “file:/”).

Page 96: 1 Chapter 29 Semistructured Data and XML Transparencies.

97

"Private" External DTDs (cont)

XML document:<?xml version="1.0" standalone="no" ?> <!DOCTYPE document SYSTEM "subjects.dtd"><document> … </document>

subjects.dtd:<!ELEMENT document …>…

Page 97: 1 Chapter 29 Semistructured Data and XML Transparencies.

98

“Public" External DTDs

Identified by the keyword PUBLIC Intended for broad use. <!DOCTYPE root_element PUBLIC "DTD_name" "DTD_location">

where:– DTD_location: relative or absolute URL – DTD_name: follows the syntax:

"prefix//owner_of_the_DTD// description_of_the_DTD//ISO 639_language_identifier“– "DTD_location" is used to find the public DTD if it cannot be

located by the "DTD_name".

Page 98: 1 Chapter 29 Semistructured Data and XML Transparencies.

99

“Public" External DTDs (cont)

<?xml version="1.0" standalone="no" ?> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"

"http://www.w3.org/TR/REC-html40/loose.dtd"> <HTML>

<HEAD> <TITLE>A typical HTML file</TITLE>

</HEAD> <BODY>

…</BODY>

</HTML>

Page 99: 1 Chapter 29 Semistructured Data and XML Transparencies.

100

“Public" External DTDs (cont)

Valid DTD_name Prefix: ISO :The DTD is an ISO standard. All ISO standards are

approved. + : The DTD is an approved non-ISO standard. - : The DTD is an unapproved non-ISO standard.

Page 100: 1 Chapter 29 Semistructured Data and XML Transparencies.

101

Combining Internal and External DTDs

A document can use both internal and external DTD subsets. The internal DTD subset is specified between the square

brackets of the DOCTYPE declaration. The declaration for the external DTD subset is placed before

the square brackets immediately after the SYSTEM keyword. Declaring an ELEMENT with the same name in both the

internal and external DTD subsets is invalid

Page 101: 1 Chapter 29 Semistructured Data and XML Transparencies.

102

Example

<?xml version="1.0" standalone="no" ?> <!DOCTYPE document SYSTEM "subjects.dtd" [<!ATTLIST assessment assessment_type (exam | assignment | prac)> <!ELEMENT results (#PCDATA)> ]>

subjects.dtd<!ELEMENT document (title*,subjectID,subjectname,prerequisite?,

classes,assessment,syllabus,textbooks*)> <!ELEMENT prerequisite (subjectID,subjectname)>…

Page 102: 1 Chapter 29 Semistructured Data and XML Transparencies.

103

DTD Validation

A XML content can be well-formed but invalid under DTD rules

e.g. DTD rule: <!ELEMENT name (#PCDATA)>

Acceptable: <name> Giancarlo Succi </name>

Unacceptable:<name>

<first_name> Giancarlo </first_name><last_name> Succi </last_name>

</name>

Page 103: 1 Chapter 29 Semistructured Data and XML Transparencies.

104

Beyond DTDs…

DTD limitations– Simple document structures– Lack of “real” datatypes

Advanced schema languages– XML Schema– Relax NG– …

Page 104: 1 Chapter 29 Semistructured Data and XML Transparencies.

105

References

http://www.java.sun.com/xml/docs/tutorial/TOC.htmlhttp://www.xml.com/pub/a/1999/09/expat/index.htmlhttp://xmlfiles.com/dtd/dtd_attributes.asphttp://xmlwriter.net/xml_guide/doctype_declaration.shtml