1 Chapter 29 Semistructured Data and XML Transparencies.

1

Chapter 29

Semistructured Data and XML

Transparencies

2

Chapter - Objectives

What semistructured data is. Concepts of the Object Exchange Model (OEM), a

model for semistructured data. Main language elements of XML. Difference between well-formed and valid XML

documents. How Document Type Definitions (DTDs) can be

used to define the valid syntax of an XML document.

3

Chapter - Objectives

About other related XML technologies. Limitations of DTDs and how the W3C XML

Schema overcomes these limitations. How RDF and RDF Schema provide a foundation

for processing meta-data. Proposals for a W3C Query Language.

5

Introduction

In 1998 XML 1.0 was formally ratified by W3C. Yet, set to impact every aspect of programming

including graphical interfaces, embedded systems, distributed systems, and database management.

Already becoming de facto standard for data communication within software industry, and is quickly replacing EDI systems as primary medium for data interchange among businesses.

Some analysts believe it will become language in which most documents are created and stored, both on and off Internet.

6

Introduction

Due to nature of information on Web and inherent flexibility of XML, expected that much of the data encoded in XML will be semistructured; ie., data may be irregular or incomplete, and its structure may change rapidly or unpredictably.

Unfortunately, relational, object-oriented, and object-relational DBMSs do not handle data of this nature particularly well.

7

Semistructured Data

Data that may be irregular or incomplete and have a structure that may change rapidly or unpredictably.

Semistructured data is data that has some structure, but structure may not be rigid, regular, or complete.

Generally, the data does not conform to a fixed schema (sometimes terms schema-less or self-describing is used to describe such data). .

8

Semistructured Data

The information normally associated with a schema is contained within the data itself.

In some forms of semistructured data there is no separate schema, in others it exists but only places loose constraints on the data.

Unfortunately, relational, object-oriented, and object-relational DBMSs do not handle data of this nature particularly well.

9

Semistructured Data

Has gained importance recently for various reasons:– may be desirable to treat Web sources like a database,

but cannot constrain these sources with a schema;

– may be desirable to have a flexible format for data exchange between disparate databases;

– emergence of XML as standard for data representation and exchange on the Web, and similarity between XML documents and semistructured data.

10

Example 29.1

11

Example 29.1

Note, data is not regular:– for John White, hold first and last names, but for

Ann Beech store single name and also store a salary;

– for property at 2 Manor Rd, store a monthly rent whereas for property at 18 Dale Rd, store an annual rent;

– for property at 2 Manor Rd, store property type (flat) as a string, whereas for property at 18 Dale Rd, store type (house) as an integer value.

12

Example 29.1

13

Object Exchange Model (OEM) Data in OEM is schema-less and self-describing, and can be

thought of as labeled directed graph where nodes are objects, consisting of:– unique object identifier (for example, &7), – descriptive textual label (street), – type (string), – a value (“22 Deer Rd”).

Objects are decomposed into atomic and complex:– atomic object contains a value for a base type (eg., integer or

string) and can be recognized in diagram as one that has no outgoing edges.

– All other objects are complex objects whose type are a set of object identifiers.

14

Object Exchange Model (OEM)

A label indicates what the object represents and is used to identify the object and to convey the meaning of the object, and so should be as informative as possible.

Labels can change dynamically. A name is a special label that serves as an alias for

a single object and acts as an entry point into the database (for example, DreamHome is a name that denotes object &1).

15

Object Exchange Model (OEM)

An OEM object can be considered as a quadruple (label, oid, type, value).

For example:

{Staff, &4, set, {&9, &10}}

{name, &9, string, “Ann Beech”}

{salary, &10, decimal, 12000}

16

Semistructured Data - Case StudyObject Exchange Model

17

OIDOID unique identifier or NULL LabelLabel character string descriptor

TypeType atomic data type or set ValueValue atomic value or set of object references

• Common model for heterogeneous information exchange, self-describing

• Each object:

OIDOID LabelLabel TypeType ValueValue

• “Help pages” for labels• Query language OEM-QL

OEM Features

18

<collection, {b1, a1, ...}>b1: <book, {t, a}> t: <title, “Database and ...”> a: <author, {n, p}>

n: <name, “Jeff Ullman”>p: <picture, “/gifs/ullman.gif”>

a1: <article, {v, w, x}>v: <author, “Gio Wiederhold”>w: <title, “Mediators in the …”>x: <journal, “IEEE Computer”>

Label

Set Value

Atomic ValueMemoryAddresses

...

Representing Semistructured Data Using OEM

19

• Logic-based language for OEM– Match object patterns, generate variable bindings,

construct new OEM objects from existing ones

• Get articles published in “IEEE Computer”

P :-

P:<articles {<journal “IEEE Computer”>}>• Get titles of books by “Jeff Ullman”

<answer_title T> :-

<book {<author “Jeff Ullman”> <title T>}>

An OEM Query Language: OEM-QL

20

XML

Vendors introduced some browser-specific HTML tags, making it difficult to develop sophisticated, widely viewable Web documents.

W3C has produced new standard called XML, which could preserve general application independence that makes HTML portable and powerful.

21

XML

XML is a restricted version of SGML, designed especially for Web documents.

SGML allows document to be logically separated into two: one that defines the structure of the document (DTD), other containing the text itself.

By giving documents a separately defined structure, and by giving authors ability to define custom structures, SGML provides extremely powerful document management system.

However, SGML has not been widely adopted due to its inherent complexity.

22

XML

XML attempts to provide a similar function to SGML, but is less complex and, at same time, network-aware.

XML retains key SGML advantages of extensibility, structure, and validation.

Since XML is a restricted form of SGML, any fully compliant SGML system will be able to read XML documents (although the opposite is not true).

XML is not intended as a replacement for SGML or HTML.

23

XML (eXtensible Markup Language)

origins: HTML + SGML (ISO Standard, 1986, ~600pp) W3C standard (~26 pp): XML syntax + DTDs XML = HTML presentational tags

+ user-defined DTD (tags+nesting)

=> a metalanguage for defining other languages via DTDs

=> XML is more like SGML than HTML XML = SGML {complexity, document perspective} +

{simplicity, data exchange perspective}

24

Advantages of XML

Simplicity Open standard and platform/vendor-

independent Extensibility Reuse Separation of content and presentation Improved load balancing

25

Advantages of XML

Support for integration of data from multiple

sources

Ability to describe data from a wide variety of

applications

More advanced search engines

New opportunities.

26

Why are Database folks so excited about XML?

XML is just a syntax for (self-describing) data

This is still exciting because

– No standard syntax for relational data

– With XML, we can» Translate any legacy data to

XML» Can exchange data in XML

format Ship over the web,

input to any application

27

XML machine accessible meaning

This is what a web-page in natural language looks like for a machine

28


CV

name

education

work

private

< >

< >

< >

< >

< >

XML allows “meaningful tags” to be added toparts of the text

29


CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

<>

<>

<>

But to your machine, the tags look like this….

30


Schemas help….

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

<>

<>

<>

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

<>

<>

<>

< > …by relating common termsbetween documents

31

But other people use other schemas

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

>

<>

<>

Someone else has one like this….

32

But other people use other schemas

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

<>

<>

<>

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

<>

<>

<>

< >

…which don’t fit in

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

< >

< >

Moral: There is still

need for ontology

mapping..

33

An HTML document

34

HTML code

<title>ICS185/ICS180 - Spring, 2003</title><body bgcolor="#d0d0ff"><H2>Index</H2><UL> <LI> <a HREF = "#announcements">Announcements </a> <LI> <a HREF = "#geninfo">Course Information </a></UL>

<H2>Course Information</H2> <a href="geninfo.html">General Information</a>. The following

are a few important entries: <UL> <li> <A HREF = "geninfo.html#goals">Course Goals</A><BR> <li> <A HREF = "geninfo.html#crsenum">About the course

numbers</A><BR></UL></body>

35

What is the problem?

To do more fancy things with documents:– need to make their logical

structure explicit. Otherwise, software applications

– do not know what is what – do not have any handle over

documents.

36

An XML document <?xml version="1.0" ?><bib> <vendor id="id3_4"> <name>QuickBooks</name> <email>[email protected]</email> <phone>1-800-333-9999</phone> <book> <title>Inorganic Chemistry</title> <publisher>Brooks/Cole Publishing</publisher> <year>1991</year> <author> <firstname>James</firstname> <lastname>Bowser</lastname> </author> <price>43.72</price> </book> </vendor> </bib>

37

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteoul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author>

<author> Vianu </author>

<publisher> Addison Wesley </publisher>

<year> 1995 </year>

</book>

…

</bibliography>

HTML describes presentation

XML describes content

38

What is XML?

eXtensible Markup Language Data are identified using tags (identifiers

enclosed in angle brackets: <...>) Collectively, the tags are known as

“markup” XML tags tell you what the data means,

rather than how to display it

39

XML versus relational

– Relational: structured– XML: semi-structured– Plain text file: unstructured

40

How does XML work?

XML allows developers to write their own Document Type Definitions (DTD)

DTD is a markup language’s rule book that describes the sets of tags and attributes that is used to describe specific content

If you want to use a certain tag, then it must first be defined in DTD

41

Key Components in XML

Three generic components, and one customizable component

XML Content

DTD Rules

XML Parser Application

42

Meta Markup Language

Not a language – but a way of specifying other languages

Meta-markup language – gives the rules by which other markup languages can be written

Portable - platform independent

43

Markup Languages

Presentation based:– Markup languages that describe

information for presentation for human consumption

Content based:– Describe information that is of

interest to another computer application

44

HTML and XML

HTML tag says "display this data in bold font" – <b>...</b>

XML tag acts like a field name in your program

It puts a label on a piece of data that identifies it– <message>...</message>

45

HTML vs. XML

<h1> Bibliography </h1>

<p> <i> Foundations of Databases </i>

Abiteboul, Hull, Vianu

<br> Addison Wesley, 1995

<p> <i> Data on the Web </i>

Abiteoul, Buneman, Suciu

<br> Morgan Kaufmann, 1999

<bibliography>

<book> <title> Foundations… </title>

<author> Abiteboul </author>

<author> Hull </author> <author> Vianu </author> <publisher> Addison

Wesley </publisher> <year> 1995 </year> </book> …

</bibliography>

“Self-describing”

-Schema info part of the data

-Good for data exchange

(albeit baroque for sto

rage)

46

Simple Example

XML data for a messaging application:<message>

<to>[email protected]</to> <from>[email protected]</from> <text> Why is it good? Let me count the ways... </text>

</message>

47

Element

Data between the tag and its matching end tag defines an element of the data

Comment:–

48

Example

 <message to="[email protected]"

from="[email protected]"> <text>Whty is it good? Let me count the

ways...</text> </message>

49

Attributes

Tags can also contain attributes Attributes contain additional

information included as part of the tag, within the tag's angle brackets

Attribute name is followed by an equality sign and the attribute value

50

Other Basics

White space is essentially irrelevant Commas between attributes are not

ignored - if present, they generate an error Case sensitive: “message” and “MESSAGE”

are different

51

Well Formed XML Every tag has a closing tag XML represents hierarchical data

structures having one tag to contain others– Tags have to be completely nested

Correct:– <message>..<to>..</to>..</

message> Incorrect

– <message>..<to>..</message>..</to>

52

Empty Tag

Empty tag is used when it makes sense to have a tag that stands by itself and doesn't enclose any content - a "flag" – You can create an empty tag by ending

it with />– <flag/>

53

Example

<message to="[email protected]" from="[email protected]" subject=“XML is good"> <flag/> <text> Whty is it good? Let me count the ways... </text>

</message>

54

Tree representation

<BOOKS><book id=“123”

loc=“library”> <author>Hull</author> <title>California</title> <year> 1995 </year></book><article id=“555” ref=“123”> <author>Su</author> <title> Purdue</title></article></BOOKS>

Hull Purdue

BOOKS

123 555

California

Su

titleauthor

title

author

articlebook

year

1995

ref

loc=“library”

55

Prolog in XML Files

XML file always starts with a prolog The minimal prolog contains a declaration that

identifies the document as an XML document:<?xml version="1.0"?>

The declaration may also contain additional information– version - version of the XML used in the data– encoding - Identifies the character set used – standalone - whether the document

references an external entity or data type specification

56

Detailed Example of XML File

simple version of the kind of XML data you could use for a slide presentation

You can use your text editor to create the data – Step 1: create a file named

slideSample01.xml– Step 2: write the declaration, which

identifies the file as an XML document<?xml version='1.0' encoding='us-ascii'?>

57

Defining the Root Element– Step 3: Adding a comment… – Step 4: Defining the Root Element…<slideshow> </slideshow>

After the declaration, every XML file defines exactly one element, known as the root element

Any other elements in the file are contained within that element

58

Attributes

A slide presentation has a title... <slideshow

title="Sample Slide Show"> </slideshow>

59

Adding Nested Elements

– Step 5: Adding Nested Elements<slideshow... 

<slide title="Title of Talk"/> 

<slide type="all"> <title>Introduction to XML </title>

</slide> </slideshow>

60

Attribute vs. Element type of the slide is defined as an attribute

– Slides could be earmarked for a mostly technical or mostly executive audience with type="tech" or type="exec", or identified as suitable for both with type="all“

title element is defined as an element The title is something the audience will see

– So it is an element The type is something that never gets presented

– So it is an attribute

61

Adding Text

– Step 6: Adding Text<slideshow>…

 <slide type="all"> <title>Overview</title>

<item>Why is XML great?</item> <item>Who uses it?</item>

</slide>

</slideshow>

62

Adding an Empty Element– Step 7: Adding an Empty Element

<slideshow> …

<slide> …

<item/>…

</slide> </slideshow>

63

Complete Example <?xml version="1.0" encoding="us-ascii" ?>  <slideshow title="Sample Slide Show">

 <slide type="all"> <title>Introduction to CML</title>

</slide>  <slide type="all"> <title>Overview</title> <item>Why is XML great?</item> <item /> </slide>

</slideshow>

64

XML Parsing – IE Example

65

Processing Instructions

An XML file can also contain processing instructions that give commands or information to an application that is processing the XML data:

<?target instructions?>– target is the name of the application that is

expected to do the processing– instructions is a string of characters that

embodies the information or commands for the application to process

66

XML

67

XML -Elements

Elements, or tags, are most common form of markup. First element must be a root element, which can

contain other (sub)elements. XML document must have one root element

(<STAFFLIST>. Element begins with start-tag (<STAFF>) and ends with end-tag (</STAFF>).

XML elements are case sensitive An element can be empty, in which case it can be

abbreviated to <EMPTYELEMENT/>. Elements must be properly nested.

68

XML - Attributes

Attributes are name-value pairs that contain descriptive information about an element.

Attribute is placed inside start-tag after corresponding element name with the attribute value enclosed in quotes. <STAFF branchNo = “B005”>

Could also have represented branch as subelement of STAFF.

A given attribute may only occur once within a tag, while subelements with same tag may be repeated.

69

Data Type Definition (DTD)

DTD specifies the types of tags that can be included in the XML document– it defines which tags are valid, and in what

arrangements– where text is expected, letting the parser

determine whether the whitespace it sees is significant or ignorable

An optional part of the document prolog

70

XML document and DTD

item1

Slideshow

DBitem

slideslide

itemtitle

item2 AItitleitem

item3

Slideshow

Slide

titleitem

+

*

XML Document

XML DTD

<?xml version='1.0' encoding='us-ascii'?><!ELEMENT slideshow (slide+)><!ELEMENT slide (title, item*)><!ELEMENT title (#PCDATA)><!ELEMENT item (#PCDATA | item)* >

71

Detailed DTD Example

– Step 1: Create a file named slideshow.dtd

– Step 2: Enter an XML declaration <?xml version='1.0' encoding='us-ascii'?>

– Step 3: Specify contains of a slideshow element

… <!–- slideshow contains 1+ slide elements --><!ELEMENT slideshow (slide+)>

72

Qualifiers

<?xml version='1.0' encoding='us-ascii'?> <!ELEMENT slideshow (slide+)>

slideshow element contains slide elements and nothing else

Qualifier Meaning? Optional (zero or one)* Zero or more+ One or more

73

Grouping multiple items

((image, title)+) Every image element must be paired with

a title element Plus sign applies to the image/title pair to

indicate that one or more pairs of the specified items can occur

74

Defining Text and Nested Elements

– Step 4: Defining Text and Nested Elements<!ELEMENT slide (title, item*)> <!ELEMENT title (#PCDATA)>

<!ELEMENT item (#PCDATA | item)* > Text = Parsed Character DATA (PCDATA) "#" that precedes PCDATA indicates that what

follows is a special word, rather than an element name

75

Complete Example

<?xml version='1.0' encoding='us-ascii'?><!ELEMENT slideshow (slide+)><!ELEMENT slide (title, item*)><!ELEMENT title (#PCDATA)><!ELEMENT item (#PCDATA | item)* >

76

Attribute Types

(#PCDATA | item)* Vertical bar (|) indicates an “or” condition In this case, either PCDATA or an item can occur

Attribute Type Specifies...

CDATA "Unparsed character data" = a text string.)

ID A name that no other ID attribute shares.

IDREF A reference to an ID defined elsewhere in the document.

IDREFS A space-separated list containing one or more ID references.

ENTITY The name of an entity defined in the DTD.

ENTITIES A space-separated list of entities.

NMTOKEN A valid XML name composed of letters, numbers, hyphens, underscores, and colons.

NMTOKENS A space-separated list of names.

NOTATION The name of a DTD-specified notation, which describes a non-XML data format, such as those used for image files

77

What you cannot do?

Double-definition for an item element doesn't work<!ELEMENT item (#PCDATA) > <!ELEMENT item (#PCDATA, item+) >

– Produces a "duplicate definition" warning – The second definition is ignored

78

XML Names and NMTOKEN Name Characters are letters, digits, hyphens,

underscores, colons or full stops. An NMTOKEN is any collection of Name Characters NMTOKENS is any list of NMTOKEN’s separated by white

space (space, tab, newline etc.) Case is significant: PERSON and person are distinct

names Attribute and Element names must be (a subset of)

NMTOKEN with restriction– Names cannot begin with a digit– Names cannot begin with xml (or any variant gotten

by case changes) – system will use this prefix

79

Element Declarations: EMPTY

Keyword ELEMENT Introduces a new element<!ELEMENT NAME CONTENT_MODEL>

Element name must begin with a letter, and may additionally contain digits and some punctuations, i.e. ‘.’, ‘-’, ‘_’, and ‘:’ as we described earlier under NMTOKEN

If an element can hold no child elements, and also no text, then it is known as empty element and denoted by EMPTY for CONTENT_MODEL– This seems trivial but it isn’t because the present or

absence of this element in an XML file can be used as a flag

– As an example we can find several in HTML such as HR and IMG which never have children and include no text. Here we would write<!ELEMENT HR EMPTY> and then <HR/> or <HR></HR> generates a horizontal line

EMPTY ELEMENTS can have attributes such as the SRC attribute in <IMG/> to specify source of image.

80

Element Declarations: ANY An element declared to have a content of ANY may contain all of

the other elements declared in the DTD This is not quite the same as no DTD for the file

<!DOCTYPE fred [<!ELEMENT fred ANY >]><fred>

<people>Me and You</people><people>Them</people>

</fred> Gets an error due to presence of <people> tag Adding <!ELEMENT people ANY > inside DTD declaration

produces a valid document.

81

Entities The DTD of an XML document can contain entity declarations.

These are like macro substitutions in other languages. ENTITY’s are defined in DTD and consist of several flavors:

– General Entities are referenced as &EntName;– Parameter Entities are referenced as %Entname;

We have already seen the character entities – & for &– ' for ‘– > for >– < for <– " for “

These are built in but you could add other such entities with– <!ENTITY aitself “A” > and &aitself; would be

replaced by A

82

General Entities As another example, we can use in DTD

<!ENTITY TODAY “May 12 2003” > and<comment>&TODAY; was very quiet in Irvine</comment>is parsed as <comment>May 12 2003 was very quiet in Irvine</comment>

General Entity references can be nested inside a DTD, e.g., one can write <!ENTITY YEAR “2003” > <!ENTITY TODAY “May 12 &YEAR;” >

However one must use Parameter Entities and not General Entities for macro substitution in other DTD declarations like <!ATTLIST and <!ELEMENT

Parameter entities are defined as in<!ENTITY % CUSTARDTAGS “(NAME,DATE,ORDERS)” >

83

Parameter Entities

<!ENTITY %peopletags “(firstname,lastname,dateofbirth)” ><!ELEMENT student %peopletags; > <!ELEMENT teacher %peopletags; > <!ELEMENT administrator %peopletags; >

Defines a bunch of people ELEMENTS to have the same child elements

Parameter entities are even more commonly used for attributes because almost always several ELEMENTS share the same attributes (with often a basic set being augmented in different ways for different ELEMENTS)– This basic set can be set in a parameter Entity

84

Defining Implied Attributes

Attributes must be declared in the DTD to be able to be used

“Implied” means that this attribute optional and there is no default value

<!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA #IMPLIED> The attribute year can be defined or undefined in the

element population. Valid Examples:– <population year=“2000”>80</population>– <population>80</population>

85

Defining Required Attributes

<!ELEMENT population (#PCDATA)> <!ATTLIST population year #REQUIRED>– The population must contain a year attribute:

<population year=“1996”>80</population> <!ELEMENT population (#PCDATA)> <!ATTLIST

population year (2000|2001) #REQUIRED>– The population must contain a year attribute of 2000

or 2001<population year=“2000”>80</population>

– No quotes on the enumeration values

86

Defining Default Attributes

<!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA “2000”>

All these are valid– <population

year=“2001”>80</population>– <population

year=“2000”>80</population>– <population>80</population>

87

Defining Fixed Attributes

<!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA #FIXED “2000”>– Invalid <population

year=“2001”>80</population>– Valid <population

year=“2000”>80</population>– Valid <population>80</population>

88

Defining Unique Attributes

<!ELEMENT animal (name)> <!ATTLIST animal code ID #REQUIRED> The code attribute has to be unique in the XML

document– <animal code=“T50”><name>Lion</name>

</animal> <animal code=“T51”><name>Rabbit</name> </animal>

89

Referring Unique Attributes

<!ELEMENT website (url)> <!ATTLIST website animal_refer IDREF #REQUIRED>

animal_refer attribute refers to previous ID attribute defined– <website animal_refer=“T50”>

<url>http://www.lions.com</url> </website>

90

Referring Multiple Unique Attributes

<!ELEMENT website (url)> <!ATTLIST website contents IDREFS #REQUIRED>

contents attribute contain series of IDs– <website contents=“T50 T51”>

<url>http://www.animals.com</url> </website>

91

XML Example - the DTD

<!ELEMENT addressBook (person)+><!ELEMENT person (name, email*, link?) ><!ATTLIST person id ID #REQUIRED ><!ATTLIST person gender (male|female) #IMPLIED><!ELEMENT name (#PCDATA|(family,given))> <!ELEMENT family (#PCDATA)><!ELEMENT given (#PCDATA)><!ELEMENT email (#PCDATA)><!ELEMENT link EMPTY >

<!ATTLIST link manager IDREF #IMPLIED subordinates IDREF #IMPLIED>

92

DOCTYPE declarations

Internal: local definition of DTD External: to an external file Can combine both

93

Internal DTD

<?xml version="1.0" standalone="yes" ?><!DOCTYPE foo [ <!ELEMENT foo (#PCDATA)>]><foo>Hello World.</foo>

94

Internal DTD: rules

The document type declaration must be placed between the XML declaration and the first element (root element) in the document .

The keyword DOCTYPE must be followed by the name of the root element in the XML document .

The keyword DOCTYPE must be in upper case .

95

External DTD

Useful for creating a common DTD that can be shared between multiple documents.

Any changes that are made to the external DTD automatically updates all the documents that reference it.

Two types: private, and public. Rules:

– If any elements, attributes, or entities are used in the XML document that are referenced or defined in an external DTD, standalone="no" must be included in the XML declaration .

http://xmlwriter.net/xml_guide/xml_declaration.shtml

96

"Private" External DTDs

Identified by the keyword SYSTEM Intended for use by a single author or group of authors. Example:

<!DOCTYPE root_element SYSTEM "DTD_location"> where: DTD_location is relative or absolute URL (such as “http:/” and “file:/”).

97

"Private" External DTDs (cont)

XML document:<?xml version="1.0" standalone="no" ?> <!DOCTYPE document SYSTEM "subjects.dtd"><document> … </document>

subjects.dtd:<!ELEMENT document …>…

98

“Public" External DTDs

Identified by the keyword PUBLIC Intended for broad use. <!DOCTYPE root_element PUBLIC "DTD_name" "DTD_location">

where:– DTD_location: relative or absolute URL – DTD_name: follows the syntax:

"prefix//owner_of_the_DTD// description_of_the_DTD//ISO 639_language_identifier“– "DTD_location" is used to find the public DTD if it cannot be

located by the "DTD_name".

http://xmlwriter.net/xml_guide/glossary.shtml#URL

99

“Public" External DTDs (cont)

<?xml version="1.0" standalone="no" ?> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"

"http://www.w3.org/TR/REC-html40/loose.dtd"> <HTML>

<HEAD> <TITLE>A typical HTML file</TITLE>

</HEAD> <BODY>

…</BODY>

</HTML>

100

“Public" External DTDs (cont)

Valid DTD_name Prefix: ISO :The DTD is an ISO standard. All ISO standards are

approved. + : The DTD is an approved non-ISO standard. - : The DTD is an unapproved non-ISO standard.

101

Combining Internal and External DTDs

A document can use both internal and external DTD subsets. The internal DTD subset is specified between the square

brackets of the DOCTYPE declaration. The declaration for the external DTD subset is placed before

the square brackets immediately after the SYSTEM keyword. Declaring an ELEMENT with the same name in both the

internal and external DTD subsets is invalid

102

Example

<?xml version="1.0" standalone="no" ?> <!DOCTYPE document SYSTEM "subjects.dtd" [<!ATTLIST assessment assessment_type (exam | assignment | prac)> <!ELEMENT results (#PCDATA)> ]>

subjects.dtd<!ELEMENT document (title*,subjectID,subjectname,prerequisite?,

classes,assessment,syllabus,textbooks*)> <!ELEMENT prerequisite (subjectID,subjectname)>…

103

DTD Validation

A XML content can be well-formed but invalid under DTD rules

e.g. DTD rule: <!ELEMENT name (#PCDATA)>

Acceptable: <name> Giancarlo Succi </name>

Unacceptable:<name>

<first_name> Giancarlo </first_name><last_name> Succi </last_name>

</name>

104

Beyond DTDs…

DTD limitations– Simple document structures– Lack of “real” datatypes

Advanced schema languages– XML Schema– Relax NG– …

105

References

http://www.java.sun.com/xml/docs/tutorial/TOC.htmlhttp://www.xml.com/pub/a/1999/09/expat/index.htmlhttp://xmlfiles.com/dtd/dtd_attributes.asphttp://xmlwriter.net/xml_guide/doctype_declaration.shtml

http://www.java.sun.com/xml/docs/tutorial/TOC.html

http://www.xml.com/pub/a/1999/09/expat/index.html

http://xmlfiles.com/dtd/dtd_attributes.asp

http://xmlwriter.net/xml_guide/doctype_declaration.shtml

1 Chapter 29 Semistructured Data and XML Transparencies.

Documents

Transcript of 1 Chapter 29 Semistructured Data and XML Transparencies.