9 . 11 .200 7

45
Standards for digital encoding Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz 9 9 . . 11 11 .200 .200 7 7

description

Standards for digital encoding Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz. 9 . 11 .200 7. Overview. a few words about me a few words about you a short introduction to standards some words on XML Practicum: - PowerPoint PPT Presentation

Transcript of 9 . 11 .200 7

Page 1: 9 . 11 .200 7

Standards for digital encoding

Tomaž Erjavec

Institut für InformationsverarbeitungGeisteswissenschaftliche

Fakultät

Karl-Franzens-Universität Graz 99..1111.200.20077

Page 2: 9 . 11 .200 7

Overview

1.1. a few words about mea few words about me

2.2. a few words about youa few words about you

3.3. a short introduction to standardsa short introduction to standards

4.4. some words on some words on XMLXML

Practicum:Practicum:

writing a small document in XMLwriting a small document in XML

(recipes)(recipes)

Page 3: 9 . 11 .200 7

Lecturer Tomaž ErjavecTomaž Erjavec

Department of Knowledge Technologies Department of Knowledge Technologies Jožef Stefan InstituteJožef Stefan InstituteLjubljanaLjubljana

http://nl.ijs.si/et/http://nl.ijs.si/et/ [email protected]@ijs.si corpora and other language resources, corpora and other language resources,

standards, annotation, text-critical standards, annotation, text-critical editionseditions

Web page for this course: Web page for this course: http://nl.ijs.si/et/teach/graz07/standards/http://nl.ijs.si/et/teach/graz07/standards/

Page 4: 9 . 11 .200 7

Students

background: field of study, background: field of study, exposure to exposure to

XMLXML namespacesnamespaces TEITEI XSLTXSLT

emails?emails? expectations?expectations?

Page 5: 9 . 11 .200 7

Standards

dictionarydictionary: : an obligatory uniform an obligatory uniform regulation for measurment, quantity or regulation for measurment, quantity or quality quality // // that which specifies how that which specifies how something can or must besomething can or must be

consensually accepted regulations, which consensually accepted regulations, which are public and contain explicit definitionsare public and contain explicit definitions

the main purpuse is to harmonise the main purpuse is to harmonise industrial practice in various fields in industrial practice in various fields in order to enable interchangeorder to enable interchange

Page 6: 9 . 11 .200 7

Some history XVIII XVIII centurycentury: : in in FrancFrance each region (village) has e each region (village) has

its own units of measurement; also, different its own units of measurement; also, different objects (say a field or forest) are measured objects (say a field or forest) are measured differentlydifferently

how to definine a uniform system of how to definine a uniform system of measurements: search for a single unit from which measurements: search for a single unit from which it would be possible to derive all other measuresit would be possible to derive all other measures

meter: meter: one ten-millionth of the length of the one ten-millionth of the length of the meridian through Paris, from the North Pole to the meridian through Paris, from the North Pole to the equatorequator

the importance of standardisation grows with the the importance of standardisation grows with the industrial revolutionindustrial revolution: : mechanical and electrical mechanical and electrical engineering, construction workengineering, construction work……

today, standards encompas even such “soft” fields today, standards encompas even such “soft” fields as the organisation of bussines as the organisation of bussines (ISO 9000)(ISO 9000)

big bussiness: companies that check compliance big bussiness: companies that check compliance with standardswith standards

Page 7: 9 . 11 .200 7

Standardisation bodies

publish publish standardstandards according to strictly defined s according to strictly defined proceduresprocedures::

nanational tional standardstandardss: DIN, ANSI, : DIN, ANSI, SISTSIST international international standardstandardss: IEC, ISO: IEC, ISO ISOISO International Organization for Standardization, International Organization for Standardization,

Geneva (1947)Geneva (1947) ISOISO Technical Technical CCommitteeommittees are composed of s are composed of

members from participating countriesmembers from participating countries, , who then who then develop and approve standards from their fielddevelop and approve standards from their field

ISO TCs can be further composed sub-committees ISO TCs can be further composed sub-committees (SC) and these can containing Working Groups (SC) and these can containing Working Groups (WG)(WG)

Page 8: 9 . 11 .200 7

ISO TC 37

Technical Technical CCommittee on ommittee on TerminologyTerminology important for all other standards, as important for all other standards, as

each standard must contain a section each standard must contain a section on terminologyon terminology

basic definitions,…, basic definitions,…, ISO 639ISO 639, MARTIF, MARTIF in 2001 name of TC 37 changed toin 2001 name of TC 37 changed to: :

… … and other language and content and other language and content resourcesresources

ISO TC 34 ISO TC 34 SC4SC4: Language Resources : Language Resources ManagementManagement

Page 9: 9 . 11 .200 7

W3C The The World Wide Web ConsortiumWorld Wide Web Consortium first recommendation was HTML (1992)first recommendation was HTML (1992) best known versions of best known versions of HTMLHTML: 3.2, 4.1: 3.2, 4.1 XML 1.0XML 1.0 released February 1998 released February 1998 Many XML related standards:Many XML related standards:

DOM Level 1 V1.0 (October 1998) DOM Level 1 V1.0 (October 1998) XML Namespaces V1.0 (January 1999) XML Namespaces V1.0 (January 1999) XPath V1.0 (November 1999) XPath V1.0 (November 1999) XSLT V1.0 (November 1999) XSLT V1.0 (November 1999) XHTML V1.0 (January 2000) XHTML V1.0 (January 2000) XML Schema V1.0 (May 2001) XML Schema V1.0 (May 2001) XLink V1.0 (June 2001) XLink V1.0 (June 2001) XPointer V1.0 (September 2001) XPointer V1.0 (September 2001) XSL V1.0 (October 2001) XSL V1.0 (October 2001) XML Information Set V1.0 (October 2001) XML Information Set V1.0 (October 2001) XPath 2.0 WD (April 2002) XPath 2.0 WD (April 2002)

Page 10: 9 . 11 .200 7

Why standards for encoding of digital data?

The encoding of digital data is typically bound to a The encoding of digital data is typically bound to a particular piece of software e.g. a text editorparticular piece of software e.g. a text editor..

ProblProblemsems:: longevitylongevity:: rapid advances in technology make rapid advances in technology make

programs obsolete very soon, and the data bound programs obsolete very soon, and the data bound to these programs becomes unreadableto these programs becomes unreadable

interchangeinterchange: : difficult to use data on other difficult to use data on other platformsplatforms

exploitationexploitation:: difficult to re-use the data for other difficult to re-use the data for other purposespurposes

intelligibility:intelligibility: the data are understandable only to the data are understandable only to the program (no public and stable specifications of the program (no public and stable specifications of the formatthe format))

validationvalidation:: we don’t know whether certain data is we don’t know whether certain data is written according to the format specification or written according to the format specification or notnot

Page 11: 9 . 11 .200 7

Language data text editorstext editors: : very loose encoding, too very loose encoding, too

oriented to the visual appearance of oriented to the visual appearance of texttext

databasesdatabases: : too rigid encoding, does not too rigid encoding, does not allow for mixture of content (text) and allow for mixture of content (text) and structure structure ((markupmarkup))

ISO ISO 88798879 SGML (Standard Generalised Markup Language), 1986

defined a language for the representation of texts that will be processed by computer programs

Page 12: 9 . 11 .200 7

SGML

it defined an encoding which is: very general, as it is a “metalanguage” (a very general, as it is a “metalanguage” (a

language for describing other languages) and language for describing other languages) and lets you design your own customised markup lets you design your own customised markup languages for different types of documentslanguages for different types of documents

interchangable between computer platforms resistant to changes in technology enables the use of documents for various

purposes enables automatic validation whether a certain

document is compliant with the standard

Page 13: 9 . 11 .200 7

Problems with SGML

the the standard standard is very complexis very complex software for using it was either very software for using it was either very

expensive or very “expensive or very “aacademiccademic”” the conversion of existing the conversion of existing

documents into documents into SGML SGML was was expensiveexpensive

so, the use of SGML was limited to so, the use of SGML was limited to large companies or academialarge companies or academia

Page 14: 9 . 11 .200 7

The Web HTML HTML was an applicatoin of was an applicatoin of SGMLSGML but but SGML SGML compliant compliant HTML HTML is used by very is used by very

few web pagesfew web pages.... HTML HTML is also not expressive enough for the is also not expressive enough for the

encoding of arbitrary web dataencoding of arbitrary web data the need for a new standard for encoding the need for a new standard for encoding

web data that would have all the web data that would have all the advantages of SGML without its advantages of SGML without its weaknesessweaknesess

eXtended Markup Language, XML (1998)eXtended Markup Language, XML (1998)

Page 15: 9 . 11 .200 7

XML now

XML XML became very popular, and is became very popular, and is becoming the universal medium becoming the universal medium for interchange of (language) datafor interchange of (language) data

many related standardsmany related standards many freely available tools for many freely available tools for

processing XMLprocessing XML many programs support import many programs support import

and export of data in XMLand export of data in XML

Page 16: 9 . 11 .200 7

What is XML? XML is a definition of device-independent, XML is a definition of device-independent,

system-independent methods of storing system-independent methods of storing and processing texts in electronic form and processing texts in electronic form

XML is a project of W3C; hence, it is an XML is a project of W3C; hence, it is an open and non-proprietary specification open and non-proprietary specification

XML is a subset of SGMLXML is a subset of SGML XML is a “metalanguage” -- a language XML is a “metalanguage” -- a language

for describing other languages -- which for describing other languages -- which lets you design your own customised lets you design your own customised markup languages for different types of markup languages for different types of documentsdocuments

Page 17: 9 . 11 .200 7

What is a Markup Language? markupmarkup (equivalently, (equivalently, encodingencoding) )

making explicit an interpretation of textmaking explicit an interpretation of text markup languagemarkup language

a set of markup conventions used together a set of markup conventions used together for encoding texts. for encoding texts.

A markup language must specify: A markup language must specify: how markup is to be distinguished from text, how markup is to be distinguished from text, what the markup means, what the markup means, what markup is allowed, what markup is allowed, what markup is required what markup is required

Page 18: 9 . 11 .200 7

Structure of XML documents<poem><poem> <title>The SICK ROSE</title><title>The SICK ROSE</title> <stanza> <stanza> <line>O Rose thou art sick.</line> <line>O Rose thou art sick.</line> <line>The invisible worm,</line><line>The invisible worm,</line> <line>That flies in the night</line><line>That flies in the night</line> <line>In the howling storm:</line><line>In the howling storm:</line> </stanza> </stanza> <stanza> <stanza> <line>Has found out thy <line>Has found out thy

bed</line> bed</line> <line>Of crimson joy:</line> <line>Of crimson joy:</line> <line>And his dark secret <line>And his dark secret

love</line> love</line> <line>Does thy life destroy.</line> <line>Does thy life destroy.</line> </stanza></stanza></poem> </poem>

dodoccument = ument = text text + + mark-upmark-up

element = element = start tagstart tag + + contentcontent + + end tagend tag

generic identifier generic identifier = name of the = name of the tagtag

element element contains contains text or elements text or elements or both (or or both (or nothing)nothing)

Page 19: 9 . 11 .200 7

XML data model

<poem><title>The SICK ROSE</title> <stanza><line>O Rose thou art sick.</line> <poem><title>The SICK ROSE</title> <stanza><line>O Rose thou art sick.</line> <line>The invisible worm,</line> <line>That flies in the night</line> <line>In the <line>The invisible worm,</line> <line>That flies in the night</line> <line>In the howling storm:</line></stanza> <stanza><line>Has found out thy bed</line> <line>Of howling storm:</line></stanza> <stanza><line>Has found out thy bed</line> <line>Of crimson joy:</line> <line>And his dark secret love</line> <line>Does thy life crimson joy:</line> <line>And his dark secret love</line> <line>Does thy life destroy.</line></stanza></poem> destroy.</line></stanza></poem>

~~

The SICK ROSEThe SICK ROSE

poempoem

titletitle stanzastanza

linelinelineline lineline

O Rose thou O Rose thou

art sickart sick

The invisible The invisible

worm,worm,

That flies That flies

in the nightin the night

linelinelineline lineline

Has found Has found

out thy bedout thy bedOf crimson joyOf crimson joy

And his dark And his dark

secret lovesecret love

stanzastanza

lineline

In the In the

howling stormhowling storm::

lineline

Does thy Does thy

life destroy.life destroy.

serializatioserializationn

data data modelmodel

Page 20: 9 . 11 .200 7

Empty elements

elements with contentelements with content: : <<tagtag> … </tag>> … </tag>

empty elements have no content:empty elements have no content:<tag/><tag/>

used for indicating “points” in the used for indicating “points” in the document, for example page breaksdocument, for example page breaks

formallyformally<<tagtag/>/> = = <<tagtag></></tagtag>>

Page 21: 9 . 11 .200 7

Attributesused to describe properties of elementsused to describe properties of elementsExample: <table id="P1" status='revised'> ... </table>Example: <table id="P1" status='revised'> ... </table>

given as given as attribute-value pairsattribute-value pairs inside the start-tag inside the start-tag value must be inside matching quotation marks, single value must be inside matching quotation marks, single

or double; or double; order in which attribute-value pairs are supplied inside a order in which attribute-value pairs are supplied inside a

tag has no significance; tag has no significance; an XML processor can use the values of the attributes in an XML processor can use the values of the attributes in

any way it chooses; the id attribute is a slightly special any way it chooses; the id attribute is a slightly special case in that, by convention, it is always used to supply a case in that, by convention, it is always used to supply a unique value to identify a particular element unique value to identify a particular element occurrence, which may be used for cross reference occurrence, which may be used for cross reference purposes. purposes.

Page 22: 9 . 11 .200 7

Comments Comments can appear anywhere in text (but not in Comments can appear anywhere in text (but not in

markup) markup) Comments start with <!-- and end with -->Comments start with <!-- and end with --> Comments cannot be nested and cannot contain -- Comments cannot be nested and cannot contain -- e.g.e.g.

<poem> <poem> <title>The SLICK <!-- is this an typo? --> <title>The SLICK <!-- is this an typo? --> ROSE</title>ROSE</title> <stanza> <stanza> <line>O Rose thou art sick.</line> <line>O Rose thou art sick.</line> <!-- some lines missing --> <!-- some lines missing --> </stanza> </stanza> <!-- here comes the second stanza --> <!-- here comes the second stanza --> </poem> </poem>

Note that in XML 'meta-markup' starts with <! or <? Note that in XML 'meta-markup' starts with <! or <?

Page 23: 9 . 11 .200 7

Example: annotated corpus

Page 24: 9 . 11 .200 7

Example: dictionary

Page 25: 9 . 11 .200 7

Entities XML XML documents can also contain entity references, documents can also contain entity references,

which are, when processing the document, substituted which are, when processing the document, substituted by their interpretation (the entity)by their interpretation (the entity)

an entity reference starts with the character ampersand an entity reference starts with the character ampersand and ends with the semicolon: and ends with the semicolon: &&…; …;

a few entities are a few entities are predefinirane predefinirane in XMLin XML: : &&ltlt;; = < = < &&gtgt;; = > = > &&ampamp;; = & = & &&aposapos; ; = = '' &&quotquot; ; = = ""

< and & are “magic” characters and must always be < and & are “magic” characters and must always be escaped when using them in the text:escaped when using them in the text:

1 < 21 < 2 must be written as must be written as 1 &lt; 21 &lt; 2 Procter & GambleProcter & Gamble must be written as must be written as Procter &amp; GambleProcter &amp; Gamble entities are also used for other purposes entities are also used for other purposes

Page 26: 9 . 11 .200 7

XML declaration

Every XML document must begin with an XML declaration which does two things:

specifies that this is an XML document, and which version of the XML standard it follows

specifies which character encoding the document uses: <?xml version="1.0" ?> <?xml version="1.0" encoding="iso-8859-

1" ?> The default, and recommended, encoding is

UTF-8

Page 27: 9 . 11 .200 7

Minimal requirements the the dodoccument ument starts with the starts with the XMLXML declaration declaration tags and entities are correctly writtentags and entities are correctly written

Wrong: <a x=y>1 &lt 2</a> Wrong: <a x=y>1 &lt 2</a> the document must be a tree:the document must be a tree:

every start tag has a matching end-tagevery start tag has a matching end-tag ((<<namename>> ≠≠ <<NameName>> ≠≠ <<NAMENAME>> ))

elements are correctly nestedelements are correctly nestedWrongWrong: : <a>…<b>…</a>…</b><a>…<b>…</a>…</b>

the document has a single top-level elementthe document has a single top-level element a a well-formedwell-formed XMLXML document document

Page 28: 9 . 11 .200 7

Splot the mistake<greeting>Hello world!</greeting><greeting>Hello world!</Greeting>

<greeting><grunt>Ho</grunt> world!</greeting><grunt>Ho <greeting>world!</greeting></grunt><greeting><grunt>Ho world!</greeting></grunt>

<grunt type=loud>Ho</grunt><grunt type="loud"></grunt>

<grunt type= "loud"><grunt type ="loud"/>

Page 29: 9 . 11 .200 7

Another bad XML document<HTML> <HTML> <HEAD><TITLE>Links</TITLE></HEAD> <HEAD><TITLE>Links</TITLE></HEAD> <BODY> <BODY> <H1 align=center>Interesting<BR>WWW links</H1> <H1 align=center>Interesting<BR>WWW links</H1> <UL> <UL> <li><A HREF="http://www.w3.org/XML">W3C XML</A> <li><A HREF="http://www.w3.org/XML">W3C XML</A> <li><A HREF="http://xml.coverpages.org/">Cover's <li><A HREF="http://xml.coverpages.org/">Cover's

pages</A>pages</A></ul> </ul> <FORM action="http://www.google.com/search" <FORM action="http://www.google.com/search"

method=get> method=get> <A href="http://www.google.com/">Google</a> <A href="http://www.google.com/">Google</a> <input type=text name=q size=28 maxlength=256> <input type=text name=q size=28 maxlength=256> <input type=hidden name=meta value="lr=&hl=en"> <input type=hidden name=meta value="lr=&hl=en"> </FORM> </FORM> </BODY> </BODY> </HTML> </HTML>

Page 30: 9 . 11 .200 7

Defining the rules

A valid XML document conforms to rules which are stated in an external schema (“element grammar”) of some sort.

A schema specifies: names for all elements used names and datatypes and (occasionally)

default values for their attributes rules about how elements can nest and a few other things, depending on the

schema language

n.b. A schema does not specify anything about what elements "mean"

Page 31: 9 . 11 .200 7

In XML a schema is optional! XML allows you to make up your own tags, and

doesn’t require a schema... The XML concept is dangerously powerful:

XML elements are light in semantics one man’s <p> is another’s <para> (or is

it?) the appearance of interchangeability may

be worse than its absence But XML is too good to ignore

mainstream software development proliferation of tools the language of the web

Page 32: 9 . 11 .200 7

What can a schema (or DTD) do for you? ensure that your documents use only

predefined elements, attributes, and entities

enforce structural rules such as ‘every chapter must begin with a heading’ or ‘recipes must include an ingredient list’

make sure that the same thing is always called by the same name

schema languages vary in the amount of validation they support

Page 33: 9 . 11 .200 7

Schema languages

Schemas can be written in:XML DTD Language

(inherited from SGML)The W3C schema language

(main successor of DTDs)The ISO Relax NG schema language

(mostly used by latest version of TEI)

Page 34: 9 . 11 .200 7

A simple DTDXML documentXML document::<city><city>

<name>Graz<<name>Graz<//namename>><inhabitants>285,470<<inhabitants>285,470<//inhabitants>inhabitants><<countrycountry>>AustriaAustria</</countrycountry>>

<<//citycity>>

DTD:DTD:<<!ELEMENT !ELEMENT citycity ( (namename, , inhabitantsinhabitants, , countrycountry))>><<!ELEMENT !ELEMENT namename ((##PCDATA)PCDATA)>><<!ELEMENT !ELEMENT inhabitantsinhabitants ((##PCDATA)>PCDATA)><<!ELEMENT !ELEMENT countrycountry ((##PCDATA)>PCDATA)>

Page 35: 9 . 11 .200 7

A more complex DTD

<!ELEMENT anthology (poem+)> <!ELEMENT anthology (poem+)> <!ELEMENT poem (title?, <!ELEMENT poem (title?,

stanza+)> stanza+)> <!ELEMENT title (#PCDATA) > <!ELEMENT title (#PCDATA) > <!ELEMENT stanza (line+) > <!ELEMENT stanza (line+) > <!ELEMENT line (#PCDATA) ><!ELEMENT line (#PCDATA) >

<anothology><anothology>

<poem><poem>

<title>The SICK ROSE</title><title>The SICK ROSE</title>

<stanza> <stanza>

<line>O Rose thou art sick.</line> <line>O Rose thou art sick.</line>

<line>The invisible worm,</line><line>The invisible worm,</line>

<line>That flies in the <line>That flies in the night</line>night</line>

<line>In the howling <line>In the howling storm:</line>storm:</line>

</stanza> </stanza>

<stanza> <stanza>

<line>Has found out thy <line>Has found out thy bed</line> bed</line>

<line>Of crimson joy:</line> <line>Of crimson joy:</line>

<line>And his dark secret <line>And his dark secret love</line> love</line>

<line>Does thy life <line>Does thy life destroy.</line> destroy.</line>

</stanza></stanza>

</poem> </poem>

</anothology></anothology>

An element definition gives:An element definition gives: the name of the element the name of the element its content modelits content model

Page 36: 9 . 11 .200 7

Content Model Operators ( open bracket for grouping ( open bracket for grouping ) close bracket ) close bracket , follows , follows | or | or ? maybe ? maybe * repeated 0 or more times * repeated 0 or more times + repeated once or more times + repeated once or more times

<!ELEMENT poem <!ELEMENT poem (title?, (title?,

(line+ (line+ | | (refrain?, (stanza, refrain?)+)(refrain?, (stanza, refrain?)+)))

) ) >>

Page 37: 9 . 11 .200 7

Mixed contentIf an element contains #PCDATA and element content, If an element contains #PCDATA and element content,

#PCDATA must always appear as the first option in an #PCDATA must always appear as the first option in an alternation; the group containing it must use the star alternation; the group containing it must use the star operator; it may appear once only, and in the outermost operator; it may appear once only, and in the outermost model group. model group.

<!ELEMENT ltem1 (#PCDATA | para)*> <!ELEMENT ltem1 (#PCDATA | para)*> <!-- OK --> <!-- OK --> <!ELEMENT item2 (#PCDATA | para | note)*> <!-- OK --> <!ELEMENT item2 (#PCDATA | para | note)*> <!-- OK --> <!ELEMENT item3 (#PCDATA , para)*> <!-- WRONG! --> <!ELEMENT item3 (#PCDATA , para)*> <!-- WRONG! --> <!ELEMENT item4 (para | #PCDATA)*> <!-- WRONG! --> <!ELEMENT item4 (para | #PCDATA)*> <!-- WRONG! --> <!ELEMENT item5 (#PCDATA | para)+> <!-- WRONG! --> <!ELEMENT item5 (#PCDATA | para)+> <!-- WRONG! --> <!ELEMENT item6 (para | (#PCDATA | note)*)> <!-- <!ELEMENT item6 (para | (#PCDATA | note)*)> <!--

WRONG! --> WRONG! -->

Page 38: 9 . 11 .200 7

Content model ambiguity

XML parsing is deterministic so content XML parsing is deterministic so content model must not be ambiguous.model must not be ambiguous.

<!ELEMENT x (a, (b | c) )> <!ELEMENT x (a, (b | c) )> <!-- OK --> <!-- OK -->

<!ELEMENT x ((a, b)|(a, c))> <!-- WRONG! <!ELEMENT x ((a, b)|(a, c))> <!-- WRONG! --> -->

Page 39: 9 . 11 .200 7

Empty Content

Empty elements do not have content. To Empty elements do not have content. To distinguish them from those with content distinguish them from those with content in well-formed XML documents, they have in well-formed XML documents, they have a special form: the tag ends with a slash.a special form: the tag ends with a slash.

In the DTD: In the DTD: <!ELEMENT pageBreak EMPTY><!ELEMENT pageBreak EMPTY>

In the document: In the document: ... <p> The page ends here. ... <p> The page ends here. <pageBreak/> Here starts a new one. <pageBreak/> Here starts a new one. </p> ... </p> ...

Page 40: 9 . 11 .200 7

Attributes In theIn the DTD: DTD:attribute nameattribute name; t; typeype defaultdefault<!ATTLIST t<!ATTLIST tablablee

tytyppee CDATA CDATA #IMPLIED#IMPLIED allowedallowedid id ID ID #REQUIRED#REQUIRED necessarynecessarystatus status ((draftdraft||

revisedrevised || finalfinal ) ) ""draftdraft"" default default

valuevalue>>

In the XML documentIn the XML document::

<tabl<tablee id= id=""tab.12tab.12"" t tyyppee= = "summary" "summary" status= status= "revised""revised">>

Page 41: 9 . 11 .200 7

Entities in the DTD: in the DTD:

<!ENTITY xml-url "http://www.w3.org/XML/"> <!<!ENTITY xml-url "http://www.w3.org/XML/"> <!ENTITY xml-ref "<A href='&xml-url;'>&xml-ENTITY xml-ref "<A href='&xml-url;'>&xml-url;</A>"> url;</A>">

in the document: in the document: <hint>Read about XML at &xml-ref;.</hint> <hint>Read about XML at &xml-ref;.</hint>

after processing: after processing: <hint>Read about XML at <A <hint>Read about XML at <A href='http://www.w3.org/XML/'>http://www.w3.ohref='http://www.w3.org/XML/'>http://www.w3.org/XML/</A>.</hint>rg/XML/</A>.</hint>

Page 42: 9 . 11 .200 7

Character references Character references are used for cases where Character references are used for cases where

certain characters cannot represented (entered, certain characters cannot represented (entered, stored, transmitted, displayed) directly. stored, transmitted, displayed) directly.

Character reference starts with Character reference starts with &# followed by the decimal number of the &# followed by the decimal number of the

character, or by character, or by &#x followed by the hexadecimal number of &#x followed by the hexadecimal number of

the character, and ending with ;, e.g.: the character, and ending with ;, e.g.: Saarbr&#252;cken Saarbr&#252;cken

When processing, such references are When processing, such references are substituted by their codepoint substituted by their codepoint

Codepoints can be found on the Codepoints can be found on the Unicode Web pagesUnicode Web pages

Page 43: 9 . 11 .200 7

External Entities

External entity references are substituted External entity references are substituted by the contents of files: by the contents of files: <!ENTITY Chap1 SYSTEM <!ENTITY Chap1 SYSTEM "P4X/p4chap2.xml"> "P4X/p4chap2.xml"> <!ENTITY Chap2 SYSTEM <!ENTITY Chap2 SYSTEM "http://www.tei-c.org/P4X/p4chap2.xml"> "http://www.tei-c.org/P4X/p4chap2.xml">

External entities are referenced in the External entities are referenced in the document just as internal ones are:document just as internal ones are: <body> &Chap1; &Chap2; </body> <body> &Chap1; &Chap2; </body>

Page 44: 9 . 11 .200 7

The Document Type Declaration Specifies:Specifies: the root element of the document, the root element of the document, the external entity containing the DTD the external entity containing the DTD and/or the (part of the) DTD contained in the internal subset and/or the (part of the) DTD contained in the internal subset e.g.e.g. <!DOCTYPE anthology SYSTEM "anthology.dtd"><!DOCTYPE anthology SYSTEM "anthology.dtd"> <!DOCTYPE anthology SYSTEM "antology.dtd" [ <!DOCTYPE anthology SYSTEM "antology.dtd" [

<!ENTITY jbw "Jabberwocky"> <!ENTITY jbw "Jabberwocky"> ]> ]>

<!DOCTYPE anthology [ <!DOCTYPE anthology [ <!ELEMENT anthology (poem+)> <!ELEMENT anthology (poem+)> <!ELEMENT poem (title?, stanza+)> <!ELEMENT poem (title?, stanza+)> <!ELEMENT title (#PCDATA) > <!ELEMENT title (#PCDATA) > <!ELEMENT stanza (line+) > <!ELEMENT stanza (line+) > <!ELEMENT line (#PCDATA) > <!ELEMENT line (#PCDATA) >

]>]>

Page 45: 9 . 11 .200 7

A Complete Valid XML Document <?xml version="1.0" encoding="us-ascii"?> <?xml version="1.0" encoding="us-ascii"?> <!DOCTYPE anthology [ <!DOCTYPE anthology [

<!ELEMENT anthology (poem+)> <!ELEMENT anthology (poem+)> <!ELEMENT poem (title?, stanza+)> <!ELEMENT poem (title?, stanza+)> <!ELEMENT title (#PCDATA) > <!ELEMENT title (#PCDATA) > <!ELEMENT stanza (line+) > <!ELEMENT stanza (line+) > <!ELEMENT line (#PCDATA) > <!ELEMENT line (#PCDATA) >

]> ]> <anthology> <anthology> <poem><poem> <title>The SICK ROSE</title> <title>The SICK ROSE</title> <stanza> <stanza> <line>O Rose thou art sick.</line> <line>O Rose thou art sick.</line> <line>The invisible worm,</line> <line>The invisible worm,</line> <line>That flies in the night</line> <line>That flies in the night</line> <line>In the howling storm:</line> <line>In the howling storm:</line> </stanza> </stanza> <stanza> <stanza> <line>Has found out thy bed</line> <line>Has found out thy bed</line> <line>Of crimson joy:</line> <line>Of crimson joy:</line> <line>And his dark secret love</line> <line>And his dark secret love</line> <line>Does thy life destroy.</line> <line>Does thy life destroy.</line> </stanza> </stanza> </poem> </poem> </anthology> </anthology>