1 Chapter 30 Semistructured Data and XML Transparencies © Pearson Education Limited 1995, 2005.
1 Chapter 29 Semistructured Data and XML Transparencies.
-
Upload
jasper-bradford -
Category
Documents
-
view
221 -
download
1
Transcript of 1 Chapter 29 Semistructured Data and XML Transparencies.
1
Chapter 29
Semistructured Data and XML
Transparencies
2
Chapter - Objectives
What semistructured data is. Concepts of the Object Exchange Model (OEM), a
model for semistructured data. Main language elements of XML. Difference between well-formed and valid XML
documents. How Document Type Definitions (DTDs) can be
used to define the valid syntax of an XML document.
3
Chapter - Objectives
About other related XML technologies. Limitations of DTDs and how the W3C XML
Schema overcomes these limitations. How RDF and RDF Schema provide a foundation
for processing meta-data. Proposals for a W3C Query Language.
5
Introduction
In 1998 XML 1.0 was formally ratified by W3C. Yet, set to impact every aspect of programming
including graphical interfaces, embedded systems, distributed systems, and database management.
Already becoming de facto standard for data communication within software industry, and is quickly replacing EDI systems as primary medium for data interchange among businesses.
Some analysts believe it will become language in which most documents are created and stored, both on and off Internet.
6
Introduction
Due to nature of information on Web and inherent flexibility of XML, expected that much of the data encoded in XML will be semistructured; ie., data may be irregular or incomplete, and its structure may change rapidly or unpredictably.
Unfortunately, relational, object-oriented, and object-relational DBMSs do not handle data of this nature particularly well.
7
Semistructured Data
Data that may be irregular or incomplete and have a structure that may change rapidly or unpredictably.
Semistructured data is data that has some structure, but structure may not be rigid, regular, or complete.
Generally, the data does not conform to a fixed schema (sometimes terms schema-less or self-describing is used to describe such data). .
8
Semistructured Data
The information normally associated with a schema is contained within the data itself.
In some forms of semistructured data there is no separate schema, in others it exists but only places loose constraints on the data.
Unfortunately, relational, object-oriented, and object-relational DBMSs do not handle data of this nature particularly well.
9
Semistructured Data
Has gained importance recently for various reasons:– may be desirable to treat Web sources like a database,
but cannot constrain these sources with a schema;
– may be desirable to have a flexible format for data exchange between disparate databases;
– emergence of XML as standard for data representation and exchange on the Web, and similarity between XML documents and semistructured data.
10
Example 29.1
11
Example 29.1
Note, data is not regular:– for John White, hold first and last names, but for
Ann Beech store single name and also store a salary;
– for property at 2 Manor Rd, store a monthly rent whereas for property at 18 Dale Rd, store an annual rent;
– for property at 2 Manor Rd, store property type (flat) as a string, whereas for property at 18 Dale Rd, store type (house) as an integer value.
12
Example 29.1
13
Object Exchange Model (OEM) Data in OEM is schema-less and self-describing, and can be
thought of as labeled directed graph where nodes are objects, consisting of:– unique object identifier (for example, &7), – descriptive textual label (street), – type (string), – a value (“22 Deer Rd”).
Objects are decomposed into atomic and complex:– atomic object contains a value for a base type (eg., integer or
string) and can be recognized in diagram as one that has no outgoing edges.
– All other objects are complex objects whose type are a set of object identifiers.
14
Object Exchange Model (OEM)
A label indicates what the object represents and is used to identify the object and to convey the meaning of the object, and so should be as informative as possible.
Labels can change dynamically. A name is a special label that serves as an alias for
a single object and acts as an entry point into the database (for example, DreamHome is a name that denotes object &1).
15
Object Exchange Model (OEM)
An OEM object can be considered as a quadruple (label, oid, type, value).
For example:
{Staff, &4, set, {&9, &10}}
{name, &9, string, “Ann Beech”}
{salary, &10, decimal, 12000}
16
Semistructured Data - Case StudyObject Exchange Model
17
OIDOID unique identifier or NULL LabelLabel character string descriptor
TypeType atomic data type or set ValueValue atomic value or set of object references
• Common model for heterogeneous information exchange, self-describing
• Each object:
OIDOID LabelLabel TypeType ValueValue
• “Help pages” for labels• Query language OEM-QL
OEM Features
18
<collection, {b1, a1, ...}>b1: <book, {t, a}> t: <title, “Database and ...”> a: <author, {n, p}>
n: <name, “Jeff Ullman”>p: <picture, “/gifs/ullman.gif”>
a1: <article, {v, w, x}>v: <author, “Gio Wiederhold”>w: <title, “Mediators in the …”>x: <journal, “IEEE Computer”>
Label
Set Value
Atomic ValueMemoryAddresses
...
Representing Semistructured Data Using OEM
19
• Logic-based language for OEM– Match object patterns, generate variable bindings,
construct new OEM objects from existing ones
• Get articles published in “IEEE Computer”
P :-
P:<articles {<journal “IEEE Computer”>}>• Get titles of books by “Jeff Ullman”
<answer_title T> :-
<book {<author “Jeff Ullman”> <title T>}>
An OEM Query Language: OEM-QL
20
XML
Vendors introduced some browser-specific HTML tags, making it difficult to develop sophisticated, widely viewable Web documents.
W3C has produced new standard called XML, which could preserve general application independence that makes HTML portable and powerful.
21
XML
XML is a restricted version of SGML, designed especially for Web documents.
SGML allows document to be logically separated into two: one that defines the structure of the document (DTD), other containing the text itself.
By giving documents a separately defined structure, and by giving authors ability to define custom structures, SGML provides extremely powerful document management system.
However, SGML has not been widely adopted due to its inherent complexity.
22
XML
XML attempts to provide a similar function to SGML, but is less complex and, at same time, network-aware.
XML retains key SGML advantages of extensibility, structure, and validation.
Since XML is a restricted form of SGML, any fully compliant SGML system will be able to read XML documents (although the opposite is not true).
XML is not intended as a replacement for SGML or HTML.
23
XML (eXtensible Markup Language)
origins: HTML + SGML (ISO Standard, 1986, ~600pp) W3C standard (~26 pp): XML syntax + DTDs XML = HTML presentational tags
+ user-defined DTD (tags+nesting)
=> a metalanguage for defining other languages via DTDs
=> XML is more like SGML than HTML XML = SGML {complexity, document perspective} +
{simplicity, data exchange perspective}
24
Advantages of XML
Simplicity Open standard and platform/vendor-
independent Extensibility Reuse Separation of content and presentation Improved load balancing
25
Advantages of XML
Support for integration of data from multiple
sources
Ability to describe data from a wide variety of
applications
More advanced search engines
New opportunities.
26
Why are Database folks so excited about XML?
XML is just a syntax for (self-describing) data
This is still exciting because
– No standard syntax for relational data
– With XML, we can» Translate any legacy data to
XML» Can exchange data in XML
format Ship over the web,
input to any application
27
XML machine accessible meaning
This is what a web-page in natural language looks like for a machine
28
XML machine accessible meaning
CV
name
education
work
private
< >
< >
< >
< >
< >
XML allows “meaningful tags” to be added toparts of the text
29
XML machine accessible meaning
CV
name
education
work
private
< >
< >
< >
< >
< >
< >
< >
<>
<>
<>
But to your machine, the tags look like this….
30
XML machine accessible meaning
Schemas help….
CV
name
education
work
private
< >
< >
< >
< >
< >
< >
< >
<>
<>
<>
CV
name
education
work
private
< >
< >
< >
< >
< >
< >
< >
<>
<>
<>
< > …by relating common termsbetween documents
31
But other people use other schemas
CV
name
education
work
private
< >
< >
< >
< >
< >
< >
>
<>
<>
Someone else has one like this….
32
But other people use other schemas
CV
name
education
work
private
< >
< >
< >
< >
< >
< >
< >
<>
<>
<>
CV
name
education
work
private
< >
< >
< >
< >
< >
< >
< >
<>
<>
<>
< >
…which don’t fit in
CV
name
education
work
private
< >
< >
< >
< >
< >
< >
< >
< >
< >
Moral: There is still
need for ontology
mapping..
33
An HTML document
34
HTML code
<title>ICS185/ICS180 - Spring, 2003</title><body bgcolor="#d0d0ff"><H2>Index</H2><UL> <LI> <a HREF = "#announcements">Announcements </a> <LI> <a HREF = "#geninfo">Course Information </a></UL>
<H2>Course Information</H2> <a href="geninfo.html">General Information</a>. The following
are a few important entries: <UL> <li> <A HREF = "geninfo.html#goals">Course Goals</A><BR> <li> <A HREF = "geninfo.html#crsenum">About the course
numbers</A><BR></UL></body>
35
What is the problem?
To do more fancy things with documents:– need to make their logical
structure explicit. Otherwise, software applications
– do not know what is what – do not have any handle over
documents.
36
An XML document <?xml version="1.0" ?><bib> <vendor id="id3_4"> <name>QuickBooks</name> <email>[email protected]</email> <phone>1-800-333-9999</phone> <book> <title>Inorganic Chemistry</title> <publisher>Brooks/Cole Publishing</publisher> <year>1991</year> <author> <firstname>James</firstname> <lastname>Bowser</lastname> </author> <price>43.72</price> </book> </vendor> </bib>
37
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
HTML describes presentation
XML describes content
38
What is XML?
eXtensible Markup Language Data are identified using tags (identifiers
enclosed in angle brackets: <...>) Collectively, the tags are known as
“markup” XML tags tell you what the data means,
rather than how to display it
39
XML versus relational
– Relational: structured– XML: semi-structured– Plain text file: unstructured
40
How does XML work?
XML allows developers to write their own Document Type Definitions (DTD)
DTD is a markup language’s rule book that describes the sets of tags and attributes that is used to describe specific content
If you want to use a certain tag, then it must first be defined in DTD
41
Key Components in XML
Three generic components, and one customizable component
XML Content
DTD Rules
XML Parser Application
42
Meta Markup Language
Not a language – but a way of specifying other languages
Meta-markup language – gives the rules by which other markup languages can be written
Portable - platform independent
43
Markup Languages
Presentation based:– Markup languages that describe
information for presentation for human consumption
Content based:– Describe information that is of
interest to another computer application
44
HTML and XML
HTML tag says "display this data in bold font" – <b>...</b>
XML tag acts like a field name in your program
It puts a label on a piece of data that identifies it– <message>...</message>
45
HTML vs. XML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author> <author> Vianu </author> <publisher> Addison
Wesley </publisher> <year> 1995 </year> </book> …
</bibliography>
“Self-describing”
-Schema info part of the data
-Good for data exchange
(albeit baroque for sto
rage)
46
Simple Example
XML data for a messaging application:<message>
<to>[email protected]</to> <from>[email protected]</from> <text> Why is it good? Let me count the ways... </text>
</message>
47
Element
Data between the tag and its matching end tag defines an element of the data
Comment:– <!-- This is a comment -->
48
Example
<!-- Using attributes --> <message to="[email protected]"
from="[email protected]"> <text>Whty is it good? Let me count the
ways...</text> </message>
49
Attributes
Tags can also contain attributes Attributes contain additional
information included as part of the tag, within the tag's angle brackets
Attribute name is followed by an equality sign and the attribute value
50
Other Basics
White space is essentially irrelevant Commas between attributes are not
ignored - if present, they generate an error Case sensitive: “message” and “MESSAGE”
are different
51
Well Formed XML Every tag has a closing tag XML represents hierarchical data
structures having one tag to contain others– Tags have to be completely nested
Correct:– <message>..<to>..</to>..</
message> Incorrect
– <message>..<to>..</message>..</to>
52
Empty Tag
Empty tag is used when it makes sense to have a tag that stands by itself and doesn't enclose any content - a "flag" – You can create an empty tag by ending
it with />– <flag/>
53
Example
<message to="[email protected]" from="[email protected]" subject=“XML is good"> <flag/> <text> Whty is it good? Let me count the ways... </text>
</message>
54
Tree representation
<BOOKS><book id=“123”
loc=“library”> <author>Hull</author> <title>California</title> <year> 1995 </year></book><article id=“555” ref=“123”> <author>Su</author> <title> Purdue</title></article></BOOKS>
Hull Purdue
BOOKS
123 555
California
Su
titleauthor
title
author
articlebook
year
1995
ref
loc=“library”
55
Prolog in XML Files
XML file always starts with a prolog The minimal prolog contains a declaration that
identifies the document as an XML document:<?xml version="1.0"?>
The declaration may also contain additional information– version - version of the XML used in the data– encoding - Identifies the character set used – standalone - whether the document
references an external entity or data type specification
56
Detailed Example of XML File
simple version of the kind of XML data you could use for a slide presentation
You can use your text editor to create the data – Step 1: create a file named
slideSample01.xml– Step 2: write the declaration, which
identifies the file as an XML document<?xml version='1.0' encoding='us-ascii'?>
57
Defining the Root Element– Step 3: Adding a comment…<!-- A SAMPLE set of slides --> – Step 4: Defining the Root Element…<slideshow> </slideshow>
After the declaration, every XML file defines exactly one element, known as the root element
Any other elements in the file are contained within that element
58
Attributes
A slide presentation has a title... <slideshow
title="Sample Slide Show"> </slideshow>
59
Adding Nested Elements
– Step 5: Adding Nested Elements<slideshow... <!-- TITLE SLIDE -->
<slide title="Title of Talk"/> <!-- TITLE SLIDE -->
<slide type="all"> <title>Introduction to XML </title>
</slide> </slideshow>
60
Attribute vs. Element type of the slide is defined as an attribute
– Slides could be earmarked for a mostly technical or mostly executive audience with type="tech" or type="exec", or identified as suitable for both with type="all“
title element is defined as an element The title is something the audience will see
– So it is an element The type is something that never gets presented
– So it is an attribute
61
Adding Text
– Step 6: Adding Text<slideshow>…
<!-- OVERVIEW --> <slide type="all"> <title>Overview</title>
<item>Why is XML great?</item> <item>Who uses it?</item>
</slide>
</slideshow>
62
Adding an Empty Element– Step 7: Adding an Empty Element
<slideshow> …
<!-- OVERVIEW --><slide> …
<!-- define an empty list item --><item/>…
</slide> </slideshow>
63
Complete Example <?xml version="1.0" encoding="us-ascii" ?> <!-- A SAMPLE set of slides --> <slideshow title="Sample Slide Show">
<!-- TITLE SLIDE --> <slide type="all"> <title>Introduction to CML</title>
</slide> <!-- OVERVIEW --> <slide type="all"> <title>Overview</title> <item>Why is XML great?</item> <item /> </slide>
</slideshow>
64
XML Parsing – IE Example
65
Processing Instructions
An XML file can also contain processing instructions that give commands or information to an application that is processing the XML data:
<?target instructions?>– target is the name of the application that is
expected to do the processing– instructions is a string of characters that
embodies the information or commands for the application to process
66
XML
67
XML -Elements
Elements, or tags, are most common form of markup. First element must be a root element, which can
contain other (sub)elements. XML document must have one root element
(<STAFFLIST>. Element begins with start-tag (<STAFF>) and ends with end-tag (</STAFF>).
XML elements are case sensitive An element can be empty, in which case it can be
abbreviated to <EMPTYELEMENT/>. Elements must be properly nested.
68
XML - Attributes
Attributes are name-value pairs that contain descriptive information about an element.
Attribute is placed inside start-tag after corresponding element name with the attribute value enclosed in quotes. <STAFF branchNo = “B005”>
Could also have represented branch as subelement of STAFF.
A given attribute may only occur once within a tag, while subelements with same tag may be repeated.
69
Data Type Definition (DTD)
DTD specifies the types of tags that can be included in the XML document– it defines which tags are valid, and in what
arrangements– where text is expected, letting the parser
determine whether the whitespace it sees is significant or ignorable
An optional part of the document prolog
70
XML document and DTD
item1
Slideshow
DBitem
slideslide
itemtitle
item2 AItitleitem
item3
Slideshow
Slide
titleitem
+
*
XML Document
XML DTD
<?xml version='1.0' encoding='us-ascii'?><!-- DTD for a simple "slide show".--><!ELEMENT slideshow (slide+)><!ELEMENT slide (title, item*)><!ELEMENT title (#PCDATA)><!ELEMENT item (#PCDATA | item)* >
71
Detailed DTD Example
– Step 1: Create a file named slideshow.dtd
– Step 2: Enter an XML declaration <?xml version='1.0' encoding='us-ascii'?><!-- DTD for a simple "slide show". -->
– Step 3: Specify contains of a slideshow element
… <!–- slideshow contains 1+ slide elements --><!ELEMENT slideshow (slide+)>
72
Qualifiers
<?xml version='1.0' encoding='us-ascii'?> <!-- DTD for a simple example. --><!ELEMENT slideshow (slide+)>
slideshow element contains slide elements and nothing else
Qualifier Meaning? Optional (zero or one)* Zero or more+ One or more
73
Grouping multiple items
((image, title)+) Every image element must be paired with
a title element Plus sign applies to the image/title pair to
indicate that one or more pairs of the specified items can occur
74
Defining Text and Nested Elements
– Step 4: Defining Text and Nested Elements<!ELEMENT slide (title, item*)> <!ELEMENT title (#PCDATA)>
<!ELEMENT item (#PCDATA | item)* > Text = Parsed Character DATA (PCDATA) "#" that precedes PCDATA indicates that what
follows is a special word, rather than an element name
75
Complete Example
<?xml version='1.0' encoding='us-ascii'?><!-- DTD for a simple "slide show".--><!ELEMENT slideshow (slide+)><!ELEMENT slide (title, item*)><!ELEMENT title (#PCDATA)><!ELEMENT item (#PCDATA | item)* >
76
Attribute Types
(#PCDATA | item)* Vertical bar (|) indicates an “or” condition In this case, either PCDATA or an item can occur
Attribute Type Specifies...
CDATA "Unparsed character data" = a text string.)
ID A name that no other ID attribute shares.
IDREF A reference to an ID defined elsewhere in the document.
IDREFS A space-separated list containing one or more ID references.
ENTITY The name of an entity defined in the DTD.
ENTITIES A space-separated list of entities.
NMTOKEN A valid XML name composed of letters, numbers, hyphens, underscores, and colons.
NMTOKENS A space-separated list of names.
NOTATION The name of a DTD-specified notation, which describes a non-XML data format, such as those used for image files
77
What you cannot do?
Double-definition for an item element doesn't work<!ELEMENT item (#PCDATA) > <!ELEMENT item (#PCDATA, item+) >
– Produces a "duplicate definition" warning – The second definition is ignored
78
XML Names and NMTOKEN Name Characters are letters, digits, hyphens,
underscores, colons or full stops. An NMTOKEN is any collection of Name Characters NMTOKENS is any list of NMTOKEN’s separated by white
space (space, tab, newline etc.) Case is significant: PERSON and person are distinct
names Attribute and Element names must be (a subset of)
NMTOKEN with restriction– Names cannot begin with a digit– Names cannot begin with xml (or any variant gotten
by case changes) – system will use this prefix
79
Element Declarations: EMPTY
Keyword ELEMENT Introduces a new element<!ELEMENT NAME CONTENT_MODEL>
Element name must begin with a letter, and may additionally contain digits and some punctuations, i.e. ‘.’, ‘-’, ‘_’, and ‘:’ as we described earlier under NMTOKEN
If an element can hold no child elements, and also no text, then it is known as empty element and denoted by EMPTY for CONTENT_MODEL– This seems trivial but it isn’t because the present or
absence of this element in an XML file can be used as a flag
– As an example we can find several in HTML such as HR and IMG which never have children and include no text. Here we would write<!ELEMENT HR EMPTY> and then <HR/> or <HR></HR> generates a horizontal line
EMPTY ELEMENTS can have attributes such as the SRC attribute in <IMG/> to specify source of image.
80
Element Declarations: ANY An element declared to have a content of ANY may contain all of
the other elements declared in the DTD This is not quite the same as no DTD for the file
<!DOCTYPE fred [<!ELEMENT fred ANY >]><fred>
<people>Me and You</people><people>Them</people>
</fred> Gets an error due to presence of <people> tag Adding <!ELEMENT people ANY > inside DTD declaration
produces a valid document.
81
Entities The DTD of an XML document can contain entity declarations.
These are like macro substitutions in other languages. ENTITY’s are defined in DTD and consist of several flavors:
– General Entities are referenced as &EntName;– Parameter Entities are referenced as %Entname;
We have already seen the character entities – & for &– ' for ‘– > for >– < for <– " for “
These are built in but you could add other such entities with– <!ENTITY aitself “A” > and &aitself; would be
replaced by A
82
General Entities As another example, we can use in DTD
<!ENTITY TODAY “May 12 2003” > and<comment>&TODAY; was very quiet in Irvine</comment>is parsed as <comment>May 12 2003 was very quiet in Irvine</comment>
General Entity references can be nested inside a DTD, e.g., one can write <!ENTITY YEAR “2003” > <!ENTITY TODAY “May 12 &YEAR;” >
However one must use Parameter Entities and not General Entities for macro substitution in other DTD declarations like <!ATTLIST and <!ELEMENT
Parameter entities are defined as in<!ENTITY % CUSTARDTAGS “(NAME,DATE,ORDERS)” >
83
Parameter Entities
<!ENTITY %peopletags “(firstname,lastname,dateofbirth)” ><!ELEMENT student %peopletags; > <!ELEMENT teacher %peopletags; > <!ELEMENT administrator %peopletags; >
Defines a bunch of people ELEMENTS to have the same child elements
Parameter entities are even more commonly used for attributes because almost always several ELEMENTS share the same attributes (with often a basic set being augmented in different ways for different ELEMENTS)– This basic set can be set in a parameter Entity
84
Defining Implied Attributes
Attributes must be declared in the DTD to be able to be used
“Implied” means that this attribute optional and there is no default value
<!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA #IMPLIED> The attribute year can be defined or undefined in the
element population. Valid Examples:– <population year=“2000”>80</population>– <population>80</population>
85
Defining Required Attributes
<!ELEMENT population (#PCDATA)> <!ATTLIST population year #REQUIRED>– The population must contain a year attribute:
<population year=“1996”>80</population> <!ELEMENT population (#PCDATA)> <!ATTLIST
population year (2000|2001) #REQUIRED>– The population must contain a year attribute of 2000
or 2001<population year=“2000”>80</population>
– No quotes on the enumeration values
86
Defining Default Attributes
<!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA “2000”>
All these are valid– <population
year=“2001”>80</population>– <population
year=“2000”>80</population>– <population>80</population>
87
Defining Fixed Attributes
<!ELEMENT population (#PCDATA)> <!ATTLIST population year CDATA #FIXED “2000”>– Invalid <population
year=“2001”>80</population>– Valid <population
year=“2000”>80</population>– Valid <population>80</population>
88
Defining Unique Attributes
<!ELEMENT animal (name)> <!ATTLIST animal code ID #REQUIRED> The code attribute has to be unique in the XML
document– <animal code=“T50”><name>Lion</name>
</animal> <animal code=“T51”><name>Rabbit</name> </animal>
89
Referring Unique Attributes
<!ELEMENT website (url)> <!ATTLIST website animal_refer IDREF #REQUIRED>
animal_refer attribute refers to previous ID attribute defined– <website animal_refer=“T50”>
<url>http://www.lions.com</url> </website>
90
Referring Multiple Unique Attributes
<!ELEMENT website (url)> <!ATTLIST website contents IDREFS #REQUIRED>
contents attribute contain series of IDs– <website contents=“T50 T51”>
<url>http://www.animals.com</url> </website>
91
XML Example - the DTD
<!ELEMENT addressBook (person)+><!ELEMENT person (name, email*, link?) ><!ATTLIST person id ID #REQUIRED ><!ATTLIST person gender (male|female) #IMPLIED><!ELEMENT name (#PCDATA|(family,given))> <!ELEMENT family (#PCDATA)><!ELEMENT given (#PCDATA)><!ELEMENT email (#PCDATA)><!ELEMENT link EMPTY >
<!ATTLIST link manager IDREF #IMPLIED subordinates IDREF #IMPLIED>
92
DOCTYPE declarations
Internal: local definition of DTD External: to an external file Can combine both
93
Internal DTD
<?xml version="1.0" standalone="yes" ?><!--open the DOCTYPE declaration - the open square bracket indicates an internal DTD--><!DOCTYPE foo [<!--define the internal DTD--> <!ELEMENT foo (#PCDATA)><!--close the DOCTYPE declaration-->]><foo>Hello World.</foo>
94
Internal DTD: rules
The document type declaration must be placed between the XML declaration and the first element (root element) in the document .
The keyword DOCTYPE must be followed by the name of the root element in the XML document .
The keyword DOCTYPE must be in upper case .
95
External DTD
Useful for creating a common DTD that can be shared between multiple documents.
Any changes that are made to the external DTD automatically updates all the documents that reference it.
Two types: private, and public. Rules:
– If any elements, attributes, or entities are used in the XML document that are referenced or defined in an external DTD, standalone="no" must be included in the XML declaration .
96
"Private" External DTDs
Identified by the keyword SYSTEM Intended for use by a single author or group of authors. Example:
<!DOCTYPE root_element SYSTEM "DTD_location"> where: DTD_location is relative or absolute URL (such as “http:/” and “file:/”).
97
"Private" External DTDs (cont)
XML document:<?xml version="1.0" standalone="no" ?> <!DOCTYPE document SYSTEM "subjects.dtd"><document> … </document>
subjects.dtd:<!ELEMENT document …>…
98
“Public" External DTDs
Identified by the keyword PUBLIC Intended for broad use. <!DOCTYPE root_element PUBLIC "DTD_name" "DTD_location">
where:– DTD_location: relative or absolute URL – DTD_name: follows the syntax:
"prefix//owner_of_the_DTD// description_of_the_DTD//ISO 639_language_identifier“– "DTD_location" is used to find the public DTD if it cannot be
located by the "DTD_name".
99
“Public" External DTDs (cont)
<?xml version="1.0" standalone="no" ?> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd"> <HTML>
<HEAD> <TITLE>A typical HTML file</TITLE>
</HEAD> <BODY>
…</BODY>
</HTML>
100
“Public" External DTDs (cont)
Valid DTD_name Prefix: ISO :The DTD is an ISO standard. All ISO standards are
approved. + : The DTD is an approved non-ISO standard. - : The DTD is an unapproved non-ISO standard.
101
Combining Internal and External DTDs
A document can use both internal and external DTD subsets. The internal DTD subset is specified between the square
brackets of the DOCTYPE declaration. The declaration for the external DTD subset is placed before
the square brackets immediately after the SYSTEM keyword. Declaring an ELEMENT with the same name in both the
internal and external DTD subsets is invalid
102
Example
<?xml version="1.0" standalone="no" ?> <!DOCTYPE document SYSTEM "subjects.dtd" [<!ATTLIST assessment assessment_type (exam | assignment | prac)> <!ELEMENT results (#PCDATA)> ]>
subjects.dtd<!ELEMENT document (title*,subjectID,subjectname,prerequisite?,
classes,assessment,syllabus,textbooks*)> <!ELEMENT prerequisite (subjectID,subjectname)>…
103
DTD Validation
A XML content can be well-formed but invalid under DTD rules
e.g. DTD rule: <!ELEMENT name (#PCDATA)>
Acceptable: <name> Giancarlo Succi </name>
Unacceptable:<name>
<first_name> Giancarlo </first_name><last_name> Succi </last_name>
</name>
104
Beyond DTDs…
DTD limitations– Simple document structures– Lack of “real” datatypes
Advanced schema languages– XML Schema– Relax NG– …
105
References
http://www.java.sun.com/xml/docs/tutorial/TOC.htmlhttp://www.xml.com/pub/a/1999/09/expat/index.htmlhttp://xmlfiles.com/dtd/dtd_attributes.asphttp://xmlwriter.net/xml_guide/doctype_declaration.shtml