A GENTLE INTRODUCTION TO SGMLcsis.pace.edu/~marchese/CS835/Lect2/sgmlintro.pdf · tem known as the...

50
1. Introduction 2. What’s special about SGML 3. Textual structure 4. SGML structures 5. The DTD 6. More on element declarations 7. Attributes 8. SGML entities 9. Marked sections 10. Putting it all together 11. Using SGML 12. Notes A GENTLE INTRODUCTION TO SGML C. M. Sperberg-McQueen and Lau Burnard for circulation among the members INDIAN T E X users group

Transcript of A GENTLE INTRODUCTION TO SGMLcsis.pace.edu/~marchese/CS835/Lect2/sgmlintro.pdf · tem known as the...

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

A GENTLEINTRODUCTION

TO SGMLC. M. Sperberg-McQueen and Lau Burnard

for circulation among the members

I N D I A N

TEXusersgroup

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

2

1 INTRODUCTION

The encoding scheme defined by these Guidelines is formulated as an application of a sys-tem known as the Standard Generalized Markup Language (SGML)[Note 1]. SGML is aninternational standard for the definition of device-independent, system-independent meth-ods of representing texts in electronic form. This chapter presents a brief tutorial guide toits main features, for those readers who have not encountered it before. For a more technicalaccount of TEI practice in using the SGML standard, see chapter 28: Conformance ; for amore technical description of the subset of SGML used by the TEI encoding scheme, seechapter 39: Formal Grammar for the TEI Interchange Format Subset of SGML.

SGML is an international standard for the description of marked-up electronic text.More exactly, SGML is ametalanguage, that is, a means of formally describing a language,in this case, amarkup language. Before going any further we should define these terms.

Historically, the wordmarkuphas been used to describe annotation or other markswithin a text intended to instruct a compositor or typist how a particular passage shouldbe printed or laid out. Examples include wavy underlining to indicate boldface, specialsymbols for passages to be omitted or printed in a particular font and so forth. As theformatting and printing of texts was automated, the term was extended to cover all sorts ofspecialmarkup codesinserted into electronic texts to govern formatting, printing, or otherprocessing.

Generalizing from that sense, we define markup, or (synonymously)encoding, as anymeans of making explicit an interpretation of a text. At a banal level, all printed textsare encoded in this sense: punctuation marks, use of capitalization, disposition of lettersaround the page, even the spaces between words, might be regarded as a kind of markup, thefunction of which is to help the human reader determine where one word ends and anotherbegins, or how to identify gross structural features such as headings or simple syntacticunits such as dependent clauses or sentences. Encoding a text for computer processing isin principle, like transcribing a manuscript fromscriptio continua, a process of making

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

3

explicit what is conjectural or implicit, a process of directing the user as to how the contentof the text should be interpreted.

By markup languagewe mean a set of markup conventions used together for encodingtexts. A markup language must specify what markup is allowed, what markup is required,how markup is to be distinguished from text, and what the markup means. SGML providesthe means for doing the first three; documentation such as these Guidelines is required forthe last.

The present chapter attempts to give an informal introduction—much less formal thanthe standard itself—to those parts of SGML of which a proper understanding is necessaryto make best use of these Guidelines.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

4

2 WHAT ’ S SPECIAL ABOUT SGML?

There are three characteristics of SGML which distinguish it from other markup languages:its emphasis on descriptive rather than procedural markup; itsdocument typeconcept; andits independence of any one system for representing the script in which a text is written.These three aspects are discussed briefly below, and then in more depth in sectionsSGMLStructuresandDefining SGML Document Structures: The DTD.

2.1 Descriptive MarkupA descriptive markup system uses markup codes which simply provide names to categorizeparts of a document. Markup codes such as<para> or \end{list} simply identify aportion of a document and assert of it that “the following item is a paragraph,” or “this is theend of the most recently begun list,” etc. By contrast, a procedural markup system defineswhat processing is to be carried out at particular points in a document: “call procedurePARA with parameters 1, b and x here” or “move the left margin 2 quads left, move the rightmargin 2 quads right, skip down one line, and go to the new left margin,” etc. In SGML,the instructions needed to process a document for some particular purpose (for example,to format it) are sharply distinguished from the descriptive markup which occurs withinthe document. Usually, they are collected outside the document in separate procedures orprograms.

With descriptive instead of procedural markup the same document can readily be pro-cessed by many different pieces of software, each of which can apply different processinginstructions to those parts of it which are considered relevant. For example, a content anal-ysis program might disregard entirely the footnotes embedded in an annotated text, whilea formatting program might extract and collect them all together for printing at the end ofeach chapter. Different sorts of processing instructions can be associated with the sameparts of the file. For example, one program might extract names of persons and places froma document to create an index or database, while another, operating on the same text, mightprint names of persons and places in a distinctive typeface.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

5

2.2 Types of Document

Secondly, SGML introduces the notion of adocument type, and hence adocument typedefinition(DTD). Documents are regarded as having types, just as other objects processedby computers do. The type of a document is formally defined by its constituent parts andtheir structure. The definition of a report, for example, might be that it consisted of a titleand possibly an author, followed by an abstract and a sequence of one or more paragraphs.Anything lacking a title, according to this formal definition, would not formally be a report,and neither would a sequence of paragraphs followed by an abstract, whatever other report-like characteristics these might have for the human reader.

If documents are of known types, a special purpose program (called aparser) can beused to process a document claiming to be of a particular type and check that all the el-ements required for that document type are indeed present and correctly ordered. Moresignificantly, different documents of the same type can be processed in a uniform way. Pro-grams can be written which take advantage of the knowledge encapsulated in the documentstructure information, and which can thus behave in a more intelligent fashion.

2.3 Data Independence

A basic design goal of SGML was to ensure that documents encoded according to its pro-visions should be transportable from one hardware and software environment to anotherwithout loss of information. The two features discussed so far both address this requirementat an abstract level; the third feature addresses it at the level of the strings of bytes (charac-ters) of which documents are composed. SGML provides a general purpose mechanism forstring substitution, that is, a simple machine-independent way of stating that a particularstring of characters in the document should be replaced by some other string when the doc-ument is processed. One obvious application for this mechanism is to ensure consistencyof nomenclature; another, more significant one, is to counter the notorious inability of dif-ferent computer systems to understand each other’s character sets, or of any one system to

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

6

provide all the graphic characters needed for a particular application, by providing descrip-tive mappings for non-portable characters. The strings defined by this string-substitutionmechanism are calledentitiesand they are discussed below inSection 8 SGML Entities.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

7

3 TEXTUAL STRUCTURE

A text is not an undifferentiated sequence of words, much less of bytes. For differentpurposes, it may be divided into many different units, of different types or sizes. A prosetext such as this one might be divided into sections, chapters, paragraphs, and sentences.A verse text might be divided into cantos, stanzas, and lines. Once printed, sequences ofprose and verse might be divided into volumes, gatherings, and pages.

Structural units of this kind are most often used to identify specific locations or ref-erence points within a text (“the third sentence of the second paragraph in chapter ten”;“canto 10, line 1234”; “page 412,” etc.) but they may also be used to subdivide a text intomeaningful fragments for analytic purposes (“is the average sentence length of section 2different from that of section 5?” “how many paragraphs separate each occurrence of theword ‘nature’?” “how many pages?”). Other structural units are more clearly analytic, inthat they characterize a section of a text. A dramatic text might regard each speech by adifferent character as a unit of one kind, and stage directions or pieces of action as unitsof another kind. Such an analysis is less useful for locating parts of the text (“the 93rdspeech by Horatio in Act 2”) than for facilitating comparisons between the words used byone character and those of another, or those used by the same character at different pointsof the play.

In a prose text one might similarly wish to regard as units of different types passages indirect or indirect speech, passages employing different stylistic registers (narrative, polemic,commentary, argument, etc.), passages of different authorship and so forth. And for certaintypes of analysis (most notably textual criticism) the physical appearance of one particularprinted or manuscript source may be of importance: paradoxically, one may wish to usedescriptive markup to describe presentational features such as typeface, line breaks, use ofwhite space and so forth.

These textual structures overlap with each other in complex and unpredictable ways.Particularly when dealing with texts as instantiated by paper technology, the reader needs

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

8

to be aware of both the physical organization of the book and the logical structure of thework it contains. Many great works (Sterne’sTristram Shandyfor example) cannot be fullyappreciated without an awareness of the interplay between narrative units (such as chaptersor paragraphs) and page divisions. For many types of research, it is the interplay betweendifferent levels of analysis which is crucial: the extent to which syntactic structure andnarrative structure mesh, or fail to mesh, for example, or the extent to which phonologicalstructures reflect morphology.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

9

4 SGML STRUCTURES

This section describes the simple and consistent mechanism for the markup or identifica-tion of structural textual units which is provided by SGML. It also describes the methodsSGML provides for the expression of rules defining how combinations of such units canmeaningfully occur in any text.

4.1 Elements

The technical term used in the SGML standard for a textual unit, viewed as a structuralcomponent, iselement. Different types of elements are given different names, but SGMLprovides no way of expressing the meaning of a particular type of element, other than itsrelationship to other element types. That is, all one can say about an element called (forinstance)<blort> is that instances of it may (or may not) occur within elements of type<farble>, and that it may (or may not) be decomposed into elements of type<blortette>.It should be stressed that the SGML standard is entirely unconcerned with the semanticsof textual elements: these are application dependent.[Note 2]. It is up to the creators ofSGML conformant tag sets (such as these Guidelines) to choose intelligible names for theelements they identify and to document their proper use in text markup. That is one purposeof this document. From the need to choose element names indicative of function comes thetechnical term for the name of an element type, which isgeneric identifier, or GI.

Within a marked up text (adocument instance), each element must be explicitly markedor tagged in some way. The standard provides for a variety of different ways of doing this,the most commonly used being to insert a tag at the beginning of the element (astart-tag)and another at its end (anend-tag). The start- and end-tag pair are used to bracket offthe element occurrences within the running text, in rather the same way as different typesof parentheses or quotation marks are used in conventional punctuation. For example, aquotation element in a text might be tagged as follows:

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

10

... Rosalind’s remarks <quote>This is the silliest stuffthat ere I heard of!</quote> clearly indicate ...

As this example shows, a start-tag takes the form<name>, where the opening anglebracket indicates the start of the start-tag, “name” is the generic identifier of the elementwhich is being delimited, and the closing angle bracket indicates the end of a tag. An end-tag takes an identical form, except that the opening angle bracket is followed by a solidus(slash) character, so that the corresponding end-tag would be</name> [Note 3].

4.2 Content Models: An Example

An element may beempty, that is, it may have no content at all, or it may contain simpletext. More usually, however, elements of one type will beembedded(contained entirely)within elements of a different type.

To illustrate this, we will consider a very simple structural model. Let us assume thatwe wish to identify within an anthology only poems, their titles, and the stanzas and linesof which they are composed. In SGML terms, our document type is theanthology, and itconsists of a series ofpoems. Each poem has embedded within it one element, a title, andseveral occurrences of another, a stanza, each stanza having embedded within it a numberof line elements. Fully marked up, a text conforming to this model might appear as follows[Note 4]:

<anthology><poem><title>The SICK ROSE</title>

<stanza><line>O Rose thou art sick.</line><line>The invisible worm,</line><line>That flies in the night</line>

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

11

<line>In the howling storm:</line></stanza><stanza>

<line>Has found out thy bed</line><line>Of crimson joy:</line><line>And his dark secret love</line><line>Does thy life destroy.</line>

</stanza></poem>

<!-- more poems go here -->

</anthology>

It should be stressed that this example doesnot use the same names as are proposedfor corresponding elements elsewhere in these Guidelines: the above isnot a valid TEIdocument. It will however serve as an introduction to the basic notions of SGML. Whitespace and line breaks have been added to the example for the sake of visual clarity only;they have no particular significance in the SGML encoding itself. Also, the line

<!-- more poems go here -->

is an SGMLcommentand is not treated as part of the text.This example makes no assumptions about the rules governing, for example, whether

or not a title can appear in places other than preceding the first stanza, or whether lines can

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

12

appear which are not included in a stanza: that is why its markup appears so verbose. Insuch cases, the beginning and end of every element must be explicitly marked, because thereare no identifiable rules about which elements can appear where. In practice, however, rulescan usually be formulated to reduce the need for so much tagging. For example, consideringour greatly over-simplified model of a poem, we could state the following rules:

• An anthology contains a number of poems and nothing else.

• A poem always has a single title element which precedes the first stanza and containsno other elements.

• Apart from the title, a poem consists only of stanzas.

• Stanzas consist only of lines and every line is contained by a stanza.

• Nothing can follow a stanza except another stanza or the end of a poem.

• Nothing can follow a line except another line or the start of a new stanza.

From these rules, it may be inferred that we do not need to mark the ends of stanzas orlines explicitly. From rule 2 it follows that we do not need to mark the end of the title—itis implied by the start of the first stanza. Similarly, from rules 3 and 1 it follows that weneed not mark the end of the poem: since poems cannot occur within poems but must occurwithin anthologies, the end of a poem is implied by the start of the next poem, or by theend of the anthology. Applying these simplifications, we could mark up the same poem asfollows:

<anthology><poem><title>The SICK ROSE<stanza>

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

13

<line>O Rose thou art sick.<line>The invisible worm,<line>That flies in the night<line>In the howling storm:

<stanza><line>Has found out thy bed<line>Of crimson joy:<line>And his dark secret love<line>Does thy life destroy.

<poem><!-- more poems go here -->

</anthology>

The ability to use rules stating which elements can be nested within others to simplifymarkup is a very important characteristic of SGML. Before considering these rules further,you may wish to consider how text marked up in the form above could be processed bya computer for very many different purposes. A simple indexing program could extractonly the relevant text elements in order to make a list of titles, or of words used in thepoem text; a simple formatting program could insert blank lines between stanzas, perhapsindenting the first line of each, or inserting a stanza number. Different parts of each poemcould be typeset in different ways. A more ambitious analytic program could relate the useof punctuation marks to stanzaic and metrical divisions.[Note 5]. Scholars wishing to seethe implications of changing the stanza or line divisions chosen by the editor of this poemcan do so simply by altering the position of the tags. And of course, the text as presentedabove can be transported from one computer to another and processed by any program (or

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

14

person) capable of making sense of the tags embedded within it with no need for the sort oftransformations and translations needed to move word processor files around.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

15

5 DEFINING SGML D OCUMENT STRUCTURES: T HE DTD

Rules such as those described above are the first stage in the creation of a formal specifi-cation for the structure of an SGML document, ordocument type definition, usually abbre-viated toDTD. In creating a DTD, the document designer may be as lax or as restrictiveas the occasion warrants. A balance must be struck between the convenience of followingsimple rules and the complexity of handling real texts. This is particularly the case whenthe rules being defined relate to texts which already exist: the designer may have only thehaziest of notions as to an ancient text’s original purpose or meaning and hence find it verydifficult to specify consistent rules about its structure. On the other hand, where a new textis being prepared to an exact specification, for example for entry into a textual databaseof some kind, the more precisely stated the rules, the better they can be enforced. Evenin the case where an existing text is being marked up, it may be beneficial to define a re-strictive set of rules relating to one particular view or hypothesis about the text—if only asa means of testing the usefulness of that view or hypothesis. It is important to rememberthat every document type definition is an interpretation of a text. There is no single DTDwhich encompasses any kind of absolute truth about a text, although it may be convenientto privilege some DTDs above others for particular types of analysis.

At present, SGML is most widely used in environments where uniformity of documentstructure is a major desideratum. In the production of technical documentation, for example,it is of major importance that sections and subsections should be properly nested, that crossreferences should be properly resolved and so forth. In such situations, documents are seenas raw material to match against pre-defined sets of rules. As discussed above, however,the use of simple rules can also greatly simplify the task of tagging accurately elementsof less rigidly constrained texts. By making these rules explicit, the scholar reduces his orher own burdens in marking up and verifying the electronic text, while also being forcedto make explicit an interpretation of the structure and significant particularities of the textbeing encoded.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

16

5.1 An Example DTD

A DTD is expressed in SGML as a set of declarative statements, using a simple syntaxdefined in the standard. For our simple model of a poem, the following declarations wouldbe appropriate:

<!ELEMENT anthology - - (poem+)><!ELEMENT poem - - (title?, stanza+)><!ELEMENT title - O (#PCDATA) ><!ELEMENT stanza - O (line+) ><!ELEMENT line O O (#PCDATA) >

These five lines are examples of formal SGML element declarations. A declaration, likean element, is delimited by angle brackets; the first character following the opening bracketmust be an exclamation mark, followed immediately by one of a small set of SGML-definedkeywords, specifying the kind of object being declared. The five declarations above are allof the same type: each begins with anELEMENTkeyword, indicating that it declares anelement, in the technical sense defined above. Each consists of three parts: a name or groupof names, two characters specifyingminimization rules, and acontent model.Each of theseparts is discussed further below. Components of the declaration are separated by whitespace, that is one or more blanks, tabs or newlines.

The first part of each declaration above gives the generic identifier of the element whichis being declared, for example ‘poem’, ‘title’, etc. It is possible to declare several elementsin one statement, as discussed below.

5.2 Minimization Rules

The second part of the declaration specifies what are calledminimization rulesfor the ele-ment concerned. These rules determine whether or not start- and end-tags must be present

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

17

in every occurrence of the element concerned. They take the form of a pair of characters,separated by white space, the first of which relates to the start-tag, and the second to theend-tag. In either case, either a hyphen or a letter O (for “omissible” or “optional”) mustbe given; the hyphen indicating that the tag must be present, and the letter O that it may beomitted. Thus, in this example, every element except<line> must have a start-tag. Onlythe<poem> and<anthology> elements must have end-tags as well.

5.3 Content Model

The third part of each declaration, enclosed in parentheses, is called thecontent modelof the element, because it specifies what element occurrences may legitimately contain.Contents are specified either in terms of other elements or using special reserved words.There are several such reserved words, of which by far the most commonly encounteredis PCDATA, as in this example. This is an abbreviation for parsed character data, and itmeans that the element being defined may contain any valid character data. If an SGMLdeclaration is thought of as a structure like a family tree, with a single ancestor at the top(in our case, this would be<anthology>), then almost always, following the branches ofthe tree downwards (for example, from<anthology> to<poem> to<stanza> to<line>and<title>) will lead eventually to PCDATA. In our example,<title> and<line> are sodefined. Since their content models say PCDATA only and name no embedded elements,they may not contain any embedded elements.

5.4 Occurrence Indicators

The declaration for<stanza> in the example above states that a stanza consists of one ormore lines. It uses anoccurrence indicator(the plus sign) to indicate how many times theelement named in its content model may occur. There are three occurrence indicators inthe SGML syntax, conventionally represented by the plus sign, the question mark, and theasterisk or star[Note 6]. The plus sign means that there may be one or more occurrences ofthe element concerned; the question mark means that there may be at most one and possibly

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

18

no occurrence; the star means that the element concerned may either be absent or appearone or more times. Thus, if the content model for<stanza> were(LINE*) , stanzas withno lines would be possible as well as those with more than one line. If it were(LINE?) ,again empty stanzas would be countenanced, but no stanza could have more than a singleline. The declaration for<poem> in the example above thus states that a<poem> cannothave more than one title, but may have none, and that it must have at least one<stanza>and may have several.

5.5 Group Connectors

The content model(TITLE?, STANZA+) contains more than one component, and thusneeds additionally to specify the order in which these elements (<title> and<stanza>)may appear. This ordering is determined by thegroup connector(the comma) used betweenits components. There are three possible group connectors, conventionally represented bycomma, vertical bar, and ampersand[Note 7]. The comma means that the components itconnects must both appear in the order specified by the content model. The ampersandindicates that the components it connects must both appear but may appear in any order.The vertical bar indicates that only one of the components it connects may appear. If thecomma in this example were replaced by an ampersand, a title could appear either beforethe stanzas of a<poem> or at the end (but not between stanzas). If it were replaced by avertical bar, then a<poem> would consist of either a title or just stanzas—but not both!

5.6 Model Groups

In our example so far, the components of each content model have been either single ele-ments or PCDATA. It is quite permissible however to define content models in which thecomponents are lists of elements, combined by group connectors. Such lists, known asmodel groups, may also be modified by occurrence indicators and themselves combined bygroup connectors. To demonstrate these facilities, let us now expand our example to in-clude non-stanzaic types of verse. For the sake of demonstration, we will categorize poems

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

19

as one ofstanzaic, couplets, or blank (or stichic). A blank-verse poem consists simply oflines (we ignore the possibility of verse paragraphs for the moment)[Note 8] so no addi-tional elements need be defined for it. A couplet is defined as a<line1> followed by a<line2>.

<!ELEMENT couplet O O (line1, line2) >

The elements<line1> and<line2> (which are distinguished to enable studies ofrhyme scheme, for example) have exactly the same content model as the existing<line>element. They can therefore share the same declaration. In this situation, it is convenientto supply aname groupas the first component of a single element declaration, rather thangive a series of declarations differing only in the names used. A name group is a list of GIsconnected by any group connector and enclosed in parentheses, as follows:

<!ELEMENT (line | line1 | line2) O O (#PCDATA) >

The declaration for the<poem> element can now be changed to include all three pos-sibilities:

<!ELEMENT poem - O (title?, (stanza+ | couplet+ | line+) ) >

That is, a poem consists of an optional title, followed by one or several stanzas, or oneor several couplets, or one or several lines. Note the difference between this definition andthe following:

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

20

<!ELEMENT poem - O (title?, (stanza | couplet | line)+ ) >

The second version, by applying the occurrence indicator to the group rather than toeach element within it, would allow for a single poem to contain a mixture of stanzas,couplets or blank verse.

Quite complex models can easily be built up in this way, to match the structural com-plexity of many types of text. As a further example, consider the case of stanzaic verse inwhich a refrain or chorus appears. A refrain may be composed of repetitions of the line ele-ment, or it may simply be text, not divided into verse lines. A refrain can appear at the startof a poem only, or as an optional addition following each stanza. This could be expressedby a content model such as the following:

<!ELEMENT refrain - - (#PCDATA | line+)><!ELEMENT poem - O (title?,

( (line+)| (refrain?, (stanza, refrain?)+ ) ))>

That is, a poem consists of an optional title, followed by either a sequence of lines, or anun-named group, which starts with an optional refrain, followed by one of more occurrencesof another group, each member of which is composed of a stanza followed by an optionalrefrain. A sequence such as ‘refrain - stanza - stanza - refrain’ follows this pattern, as doesthe sequence ‘stanza - refrain - stanza - refrain’. The sequence ‘refrain - refrain - stanza -stanza’ does not, however, and neither does the sequence “stanza - refrain - refrain - stanza.”Among other conditions made explicit by this content model are the requirements that at

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

21

least one stanza must appear in a poem, if it is not composed simply of lines, and that ifthere is both a title and a stanza they must appear in that order.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

22

6 COMPLICATING THE I SSUE: M ORE ON ELEMENT DECLARATIONS

In the simple cases described so far, it has been assumed that one can identify the immediateconstituents of every element defined in a textual structure. A poem consists of stanzas,and an anthology consists of poems. Stanzas do not float around unattached to poems orcombined into some other unrelated element; a poem cannot contain an anthology. All theelements of a given document type may be arranged into a hierarchic structure, arrangedlike a family tree with a single ancestor at the top and many children (mostly the elementscontaining #PCDATA) at the bottom. This gross simplification turns out to be surprisinglyeffective for a large number of purposes. It is not however adequate for the full complexityof real textual structures. In particular, it does not cater for the case of more or less freelyfloating elements that can appear at almost any hierarchic level in the structure, and it doesnot cater for the case where different elements overlap or several different trees may beidentified in the same document. To deal with the first case, SGML provides theexceptionmechanism; to deal with the second, SGML permits the definition of ‘concurrent’ documentstructures.

6.1 Exceptions to the Content ModelIn most documents, there will be some elements that can occur at any level of its structure.Annotations, for example, might be attached to the whole of a poem, to a stanza, to a lineof a stanza or to a single word within it. In a textual critical edition, the same might be trueof variant readings. In this simple case, the complexity of adding an annotation elementas an optional component of every content model is not particularly onerous; in a morerealistically complex model perhaps containing some ten or twenty levels such an approachcan become much more difficult.

To cope with this, SGML allows for any content model to be further modified by meansof an exceptionlist. There are two types of exception:inclusions, that is, additional ele-ments that can be included at any point in the model group or any of its constituent elements;andexclusions, that is, elements that cannot be included within the current model.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

23

To extend our declarations further to allow for annotations and variant readings, whichwe will assume can appear anywhere within the text of a poem, we first need to add decla-rations for these two elements:

<!ELEMENT (note | variant) - - (#PCDATA)>

The note and variant elements must have both start- and end-tags, since they can appearanywhere. Rather than add them to the content model for each type of poem, we can addthem in the form of an inclusion list to the poem element, which now reads:

<!ELEMENT poem - O (title?, (stanza+ | couplet+ | line+) )+(note | variant) >

The plus sign at the start of the(NOTE | VARIANT) name list indicates that this is aninclusion exception. With this addition, notes or variants can appear at any point in thecontent of a poem element—even those (such as<title>) for which we have defined acontent model of #PCDATA. They can thus also appear within notes or variants!

If we wanted for some reason to prevent notes or variants appearing within titles, wecould add an exclusion exception to the declaration for<title> above:

<!ELEMENT title - O (#PCDATA) -(note | variant) >

The minus sign at the start of the (NOTE — VARIANT) name list indicates that this is anexclusion exception. With this addition, notes and variants will be prohibited from appear-ing within titles, notwithstanding their potential inclusion implied by the previous additionto the content model for<poem>.

In the same way, we could prevent notes and variants from nesting within notes andvariants by modifying the definition above to read

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

24

<!ELEMENT (note | variant) - - (#PCDATA) -(note | variant) >

The meticulous reader will note that this precludes both variants within notes and noteswithin variants. Inclusion and exclusion exceptions should be used with care as their rami-fications may not be immediately apparent.

6.2 Concurrent Structures

All the structures we have so far discussed have been simply hierarchic: that is, at everylevel of the tree, each node is entirely contained by a parent node. The figure below repre-sents the structure of a document conforming to the simple DTD we have so far defined asa tree (drawn on its side through exigencies of space). We have already seen how Blake’spoem can be divided into a title and two stanzas, each of four lines. In this diagram, we adda second poem, consisting of one stanza and a title, to make up an instance of an anthology:

|-------------------title|| |----line1| |----line2

|------POEM1---|----stanza1---|----line3| | |----line4| || | |----line5| |----stanza2---|----line6| |----line7| |----line8

anthology-||

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

25

| |-------------------title| || | |----line1| | |----line2|------POEM2---|----stanza1---|----line3

|----line4|----line5

Clearly, there are many such trees that might be drawn to describe the structure of thisor other anthologies. Some of them might be representable as further subdivisions of thistree: for example, we might subdivide the lines into individual words, since no word crossesa line boundary. But equally clearly there are many other trees that might be drawn whichdo not fit within this tree. We might, for example, be interested in syntactic structures —which rarely respect the formal boundaries of verse. Or, to take a simpler example, wemight want to represent the pagination of different editions of the same text.

One way of doing this would be to group the lines and titles of our current model intopages. A declaration for such an element is simple enough:

<!ELEMENT page - - ((title?, line+)+) >

That is, a page consists of one or more unnamed groups, each of which contains an optionaltitle, followed by a sequence of lines. (Note, incidentally, that this model prohibits a titleappearing on its own at the foot of a page). However, simply inserting the element<page>into the hierarchy already defined is not as easy as it might seem. Some poems are longerthan a single page, and other pages contain more than one poem. We cannot therefore insert

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

26

the element<page> between<anthology> and<poem> in the hierarchy, nor can it gobetween<poem> and<stanza>, nor yet in both places at once! What is needed is theability to create a separate hierarchy, with the same elements at the bottom (the stanzas,lines and titles), but combined into a different superstructure. This is the ability which theCONCUR feature of SGML gives.

A separate document type definition must be created for each hierarchic tree into whichthe text is to be structured. The definition we have so far built up for the anthology looks,in full, like this:

<!DOCTYPE anthology [<!ELEMENT anthology - - (poem+) ><!ELEMENT poem - - (title?, stanza+) ><!ELEMENT stanza - O (line+) ><!ELEMENT (title | line) - O (#PCDATA) >]>

As this example shows, the name of a document type must always be the same as the nameof the largest element in it, that is the element at the top of the hierarchy. The syntax usedis discussed further below. Let us now add to this declaration a second definition for aconcurrent document type, which we will call a paged anthology, or<p.anth> for short:

<!DOCTYPE p.anth [<!ELEMENT p.anth - - (page+) ><!ELEMENT page - - ((title?, line+)+) ><!ELEMENT (title|line) - O (#PCDATA) >]>

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

27

We have now defined two different ways of looking at the same basic text—the PC-DATA components grouped by both these document type definitions into lines or titles. Inone view, the lines are grouped into stanzas and poems; in the other they are grouped intopages only. Notice that it is exactly the same text which is visible in both views: the twohierarchies simply allow us to arrange it in two different ways.

To mark up the two views, it will be necessary to indicate which hierarchy each elementbelongs to. This is done by including the name of the document type (the view) withinparentheses immediately before the identifier concerned, inside both start- and end-tags.Thus, pages (which are only visible in the<p.anth> document type) must be tagged witha<(p.anth)page> tag at their start and a</(p.anth)page> at their end. In the same way,as poems and stanzas appear only in the<anthology> document type, they must nowbe tagged using<(anthology)poem> and<(anthology)stanza> tags respectively. Forthe line and title elements, however, which appear in both hierarchies, no document typespecification need be given: any tag containing only a name is assumed to mark an elementpresent in every active document type.

As a simple example, let us assume that Blake’s poem appears in some paged anthology,with the page break occurring half way through the first stanza. The poem might then bemarked up as follows:

<(anthology)anthology><(p.anth)p.anth><(p.anth)page>

<!-- other titles and lines on this page here -->

<(anthology)poem><title>The SICK ROSE<(anthology)stanza>

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

28

<line>O Rose thou art sick.<line>The invisible worm,

</(p.anth)page><(p.anth)page>

<line>That flies in the night<line>In the howling storm:

<(anthology)stanza><line>Has found out thy bed<line>Of crimson joy:<line>And his dark secret love<line>Does thy life destroy.

</(anthology)poem>

<!-- rest of material on this page here --></(p.anth)page>

</(p.anth)p.anth)</(anthology)anthology>

It is now possible to select only the elements concerned with a particular view from thetext, even though both are represented in the tagging. A processor concerned only with thepagination will see only those elements whose tags include the P.ANTH specification, orwhich have no specification at all. A processor concerned only with the ANTHOLOGYview of things will not see the page breaks. And a processor concerned to inter-relate thetwo views can do so unambiguously.

A note of caution is appropriate: CONCUR is an optional feature of SGML, and notall available SGML software systems support it, while those which do, do not always do

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

29

so according to the letter of the standard. For that reason, if for no other, wherever theseGuidelines have identified a potential application of CONCUR, they also invariably suggestalternative methods as well. For fuller discussion of these issues, see chapter 31: MultipleHierarchies.

Note also that we cannot introduce a new element, a page number for example, into the<p.anth> document type, since there is no existing data in the<anthology> documenttype which could be fitted into it. One way of adding that extra information is discussed inthe nextsection.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

30

7 ATTRIBUTES

In the SGML context, the word ‘attribute’, like some other words, has a specific technicalsense. It is used to describe information which is in some sense descriptive of a specificelement occurrence but not regarded as part of its content. For example, you might wishto add astatus attribute to occurrences of some elements in a document to indicatetheir degree of reliability, or to add anidentifier attribute so that you could refer toparticular element occurrences from elsewhere within a document. Attributes are useful inprecisely such circumstances.

Although different elements may have attributes with the same name, (for example,in the TEI scheme, every element is defined as having anid attribute), they are alwaysregarded as different, and may have different values assigned to them. If an element hasbeen defined as having attributes, the attribute values are supplied in the document instanceasattribute-value pairsinside the start-tag for the element occurrence. An end-tag may notcontain an attribute-value specification, since it would be redundant.

For example

<poem id=P1 status="draft"> ... </poem>

The<poem> element has been defined as having two attributes:id andstatus. Forthe instance of a<poem> in this example, represented here by an ellipsis, theid attributehas the valueP1 and thestatus attribute has the valuedraft. An SGML processorcan use the values of the attributes in any way it chooses; for example, a formatter mightprint a poem element which has the status attribute set todraft in a different way fromone with the same attribute set torevised ; another processor might use the same attributeto determine whether or not poem elements are to be processed at all. Theid attribute isa slightly special case in that, by convention, it is always used to supply a unique value to

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

31

identify a particular element occurrence, which can be used for cross reference purposes,as discussed further below.

Like elements, attributes are declared in the SGML document type declaration, usingrather similar syntax. As well as specifying its name and the element to which it is to beattached, it is possible to specify (within limits) what kind of value is acceptable for anattribute and a default value.

The following declarations could be used to define the two attributes we have specifiedabove for the<poem> element:

<!ATTLIST poemid ID #IMPLIEDstatus (draft|revised|published) draft >

The declaration begins with the symbolATTLIST , which introduces anattribute listspecification. The first part of this specifies the element (or elements) concerned. In ourexample, attributes have been declared only for the<poem> element. If several elementsshare the same attributes, they may all be defined in a single declaration; just as with elementdeclarations, several names may be given in a parenthesized list. Following this name (orlist of names), is a series of rows, one for each attribute being declared, each containingthree parts. These specify the name of the attribute, the type of value it takes, and a defaultvalue respectively.

Attribute names (id andstatus in this example) are subject to the same restrictionsas other names in SGML; they need not be unique across the whole DTD, however, butonly within the list of attributes for a given element.

The second part of an attribute specification can take one of two forms, both illustratedabove. The first case uses one of a number of special keywords to declare what kind of value

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

32

an attribute may take. In the example above, the special keywordID is used to indicate thatthe attributeID will be used to supply a unique identifying value for each poem instance(see further the discussion below). Among other possible SGML keywords are

• CDATA The attribute value may contain any valid character data; tags may be in-cluded in the value, but they will not be recognized by the SGML parser, and will notbe processed as tags normally are

• IDREF The attribute value must contain a pointer to some other element (see furtherthe discussion of ID below)

• NMTOKEN The attribute value is aname token, that is, (more or less) any string ofalphanumeric characters

• NUMBER The attribute value is composed only of numerals

In the example above, a list of the possible values for thestatus attribute has beensupplied. This means that a parser can check that no<poem> is defined for which thestatus attribute does not have one ofdraft , revised , or published as its value.Alternatively, if the declared value had been either CDATA or NAME, a parser would haveaccepted almost any string of characters (status=awful or status=12345678 ifit had been aNMTOKEN; status="anything goes" or status = "well, AL-MOST anything" if it were CDATA). Sometimes, of course, the set of possible valuescannot be pre-defined. Where it can, as in this case, it is generally better to do so.

The last piece of each information in each attribute definition specifies how a parsershould interpret the absence of the attribute concerned. This can be done by supplyingone of the special keywords listed below, or (as in this case) by supplying a specific valuewhich is then regarded as the value for every element which does not supply a value for theattribute concerned. Using the example above, if a poem is simply tagged<poem>, theparser will treat it exactly as if it were tagged<poem status=draft>. Alternatively, one ofthe following keywords may be used to specify a default value for an attribute:

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

33

• #REQUIRED A value must be specified.

• #IMPLIED A value need not be supplied (as in the case ofID above).

• #CURRENT If no value is supplied in this element occurrence, the last specifiedvalue should be used.

For example, if the attribute definition above were rewritten as

<!ATTLIST poemid ID #IMPLIEDstatus (draft | revised | published) #CURRENT >

then poems which appear in the anthology simply tagged<poem> would be treated as ifthey had the same status as the preceding poem. If the keyword were #REQUIRED ratherthan #CURRENT, the parser would report such poems as erroneously tagged, as it wouldif any value other thandraft , published , or revised were supplied. The use of#CURRENT implies that whatever value is specified for this attribute on the first poem willapply to all subsequent poems, until altered by a new value. Only the status of the firstpoem need therefore be supplied, if all are the same.

It is sometimes necessary to refer to an occurrence of one textual element from withinanother, an obvious example being phrases such as “see note 6” or “as discussed in chapter5.” When a text is being produced the actual numbers associated with the notes or chaptersmay not be certain. If we are using descriptive markup, such things as page or chapter num-bers, being entirely matters of presentation, will not in any case be present in the marked uptext: they will be assigned by whatever processor is operating on the text (and may indeeddiffer in different applications). SGML therefore provides a special mechanism by whichany element occurrence may be given a special identifier, a kind of label, which may beused to refer to it from anywhere else within the same text. The cross-reference itself is

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

34

regarded as an element occurrence of a specific kind, which must also be declared in theDTD. In each case, the identifying label (which may be arbitrary) is supplied as the valueof a special attribute.

Suppose, for example, we wish to include a reference within the notes on one poem thatrefers to another poem. We will first need to provide some way of attaching a label to eachpoem: this is done by defining an attribute for the<poem> element, as suggested above.

<!ATTLIST poem id ID #IMPLIED >

Here we define an attributeid , the value of which must be of typeID . It is not requiredthat any attribute of typeID have the nameid as well; it is however a useful conventionalmost universally observed. Note that not every poem need carry anid attribute and theparser may safely ignore the lack of one in those which do not. Only poems to which weintend to refer need use this attribute; for each such poem we should now include in itsstart-tag some unique identifier, for example:

<POEM ID=Rose>Text of poem with identifier ’ROSE’

</POEM>

<POEM ID=P40>Text of poem with identifier ’P40’

</POEM>

<POEM>This poem has no identifier

</POEM>

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

35

Next we need to define a new element for the cross reference itself. This will not haveany content—it is only a pointer—but it has an attribute, the value of which will be theidentifier of the element pointed at. This is achieved by the following declarations:

<!ELEMENT poemref - O EMPTY ><!ATTLIST poemref target IDREF #REQUIRED >

The<poemref> element needs no end-tag because it has no content. It has a singleattribute calledtarget . The value of this attribute must be of type IDREF (the keywordused for cross reference pointers of this type) and it must be supplied.

With these declarations in force, we can now encode a reference to the poem with idRose as follows:

Blake’s poem on the sick rose <POEMREF TARGET=Rose> ...

When an SGML parser encounters this empty element it will simply check that anelement exists with the identifierRose. Different SGML processors could take any numberof additional actions: a formatter might construct an exact page and line reference for thelocation of the poem in the current document and insert it, or just quote the poem’s title orfirst lines. A hypertext style processor might use this element as a signal to activate a linkto the poem being referred to. The purpose of the SGML markup is simply to indicate thata cross reference exists: it does not determine what the processor is to do with it.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

36

8 SGML ENTITIES

The aspects of SGML discussed so far are all concerned with the markup of structuralelements within a document. SGML also provides a simple and flexible method of encodingand naming arbitrary parts of the actual content of a document in a portable way. In SGMLthe wordentity has a special sense: it means a named part of a marked up document,irrespective of any structural considerations. An entity might be a string of characters or awhole file of text. To include it in a document, we use a construction known as anentityreference. For example, the following declaration

<!ENTITY tei "Text Encoding Initiative">

defines an entity whose name istei and whose value is the string “Text Encoding Ini-tiative” [Note 9]. This is an instance of anentity declaration, which declares aninternalentity. The following declaration, by contrast, declares asystem entity:

<!ENTITY ChapTwo SYSTEM "sgmlmkup.txt">

This defines a system entity whose name isChapTwo and whose value is the textassociated with the system identifier — in this case, the system identifier is the name of anoperating system file and the replacement text of the entity is the contents of the file.

Once an entity has been declared, it may be referenced anywhere within a document.This is done by supplying its name prefixed with the ampersand character and followed bythe semicolon. The semicolon may be omitted if the entity reference is followed by a spaceor record end.

When an SGML parser encounters such anentity reference,it immediately substitutesthe value declared for the entity name. Thus, the passage “The work of the &tei; has only

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

37

just begun” will be interpreted by an SGML processor exactly as if it read “The work of theText Encoding Initiative has only just begun”. In the case of a system entity, it is, of course,the contents of the operating system file which are substituted, so that the passage “Thefollowing text has been suppressed: &ChapTwo;” will be expanded to include the whole ofwhatever the system finds in the filesgmlmkup.txt [Note 10].

This obviously saves typing, and simplifies the task of maintaining consistency in aset of documents. If the printing of a complex document is to be done at many sites, thedocument body itself might use an entity reference, such as&site; , wherever the nameof the site is required. Different entity declarations could then be added at different sitesto supply the appropriate string to be substituted for this name, with no need to change thetext of the document itself.

This string substitutionmechanism has many other applications. It can be used to cir-cumvent the notorious inadequacies of many computer systems for representing the fullrange of graphic characters needed for the display of modern English (let alone the require-ments of other modern scripts or of ancient languages). So-called ‘special characters’ notdirectly accessible from the keyboard (or if accessible not correctly translated when trans-mitted) may be represented by an entity reference.

Suppose, for example, that we wish to encode the use of ligatures in early printed texts.The ligatured form of ‘ct’ might be distinguished from the non-ligatured form by encodingit as &ctlig; rather thanct. Other special typographic features such as leafstops orrules could equally well be represented by mnemonic entity references in the text. Whenprocessing such texts, an entity declaration would be added giving the desired representa-tion for such textual elements. If, for example, ligatured letters are of no interest, we wouldsimply add a declaration such as

<!ENTITY ctlig "ct" >

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

38

and the distinction present in the source document would be removed. If, on the otherhand, a formatting program capable of representing ligatured characters is to be used, wemight replace the entity declaration to give whatever sequence of characters such a programrequires as the expansion.

A list of entity declarations is known as anentity set. Standard entity sets are providedfor use with most SGML processors, in which the names used will normally be taken fromthe lists of such names published as an annex to the SGML standard and elsewhere, asmentioned above.

The replacement values given in an entity declaration are, of course, highly systemdependent. If the characters to be used in them cannot be typed in directly, SGML providesa mechanism to specify characters by their numeric values, known ascharacter references.A character reference is distinguished from other characters in the replacement string bythe fact that it begins with a special symbol, conventionally the sequence ‘&#’, and endswith the normal semicolon. For example, if the formatter to be used represents the ligaturedform of ct by the characters c and t prefixed by the character with decimal value 102, theentity declaration would read:

<!ENTITY ctlig "ct" >

Note that character references will generally not make sense if transferred to another hard-ware or software environment: for this reason, their use is only recommended in situationslike this.

Useful though the entity reference mechanism is for dealing with occasional depar-tures from the expected character set, no one would consider using it to encode extendedpassages, such as quotations in Greek or Russian in an English text. In such situations,different mechanisms are appropriate. These are discussed elsewhere in these Guidelines(see chapter 4: Characters and Character sets).

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

39

A special form of entities,parameter entities, may be used within SGML markup dec-larations; these differ from the entities discussed above (which technically are known asgeneral entities) in two ways:

• Parameter entities are usedonly within SGML markup declarations; with some spe-cial exceptions which will not be discussed here, they will normally not be foundwithin the document itself.

• Parameter entities are delimited by percent sign and semicolon, rather than by am-persand and semicolon.

Declarations for parameter entities take the same form as those for general entities,but insert a percent sign between the keywordENTITY and the name of the entity itself.White space (blanks, tabs, or line breaks) must occur on both sides of the percent sign.An internal parameter entity namedTEI.prose , with an expansion ofINCLUDE, and anexternal parameter entity namedTEI.extensions.dtd , which refers to the system filemystuff.dtd , could be declared thus[Note 11]:

<!ENTITY % TEI.prose ’INCLUDE’><!ENTITY % TEI.extensions.dtd SYSTEM ’mystuff.dtd’>

The TEI document type definition makes extensive use of parameter entities to controlthe selection of different tag sets and to make it easier to modify the TEI DTD. Numerousexamples of their use may thus be found in chapter 3 : Structure of the TEI Document TypeDefinition.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

(a) Introduction

(b) What’s special about SGML

(c) Textual structure

(d) SGML structures

(e) The DTD

(f) More on element declarations

(g) Attributes

(h) SGML entities

(i) Marked sections

(j) Putting it all together

(k) Using SGML

(l) Notes

J I

J I

40

9 MARKED SECTIONS

It is occasionally convenient to mark some portion of a text for special treatment by theSGML parser. Certain portions of legal boilerplate, for example, might need to be includedor omitted systematically, depending on the state or country in which the document wasintended to be valid. (Thus the statement “Liability is limited to $50,000.” might need to beincluded in Delaware, but excluded in Maryland.) Technical manuals for related productsmight share a great deal of information but differ in some details; it might be convenientto maintain all the information for the entire set of related products in a single document,selecting at display or print time only those portions relevant to one specific product. (Thus,a discussion of how to change the oil in a car might use the same text for most steps, butoffer different advice on removing the carburetor, depending on the specific engine modelin question.)

SGML provides themarked sectionconstruct to handle such practical requirementsof document production. In general, as the examples above are intended to suggest, it ismore obviously useful in the production of new texts than in the encoding of pre-existingtexts. Most users of the TEI encoding scheme will never need to use marked sections, andmay wish to skip the remainder of this discussion. The TEI DTD makes extensive useof marked sections, however, and this section should be read and understood carefully byanyone wishing to follow in detail the discussions in chapter chapter 3 : Structure of theTEI Document Type Definition.

The ‘special processing’ offered for marked sections in SGML can be of several types,each associated with one of the following keywords:

1. INCLUDE The marked section should be included in the document and processednormally.

2. IGNORE The marked section should be ignored entirely; if the SGML applicationprogram produces output from the document, the marked section will be excluded

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

41

from the document.

3. CDATA The marked section may contain strings of characters which look likeSGML tags or entity references, but which should not be recognized as such by theSGML parser. (These Guidelines use such CDATA marked sections to enclose theexamples of SGML tagging.)

4. RCDATA The marked section may contain strings of characters which look likeSGML tags, but which should not be recognized as such by the SGML parser; entityreferences, on the other hand, may be present and should be recognized and expandedas normal.

5. TEMP The passage included in the marked section is a temporary part of the doc-ument; the marked section is used primarily to indicate its location, so that it can beremoved or revised conveniently later.

When a marked section occurs in the text, it is preceded by amarked-section startstring, which contains one or more keywords from the list above; its end is marked by amarked-section closestring. The second and last lines of the following example are thestart and close of a marked section to be ignored:

In such cases, the bank will reimburse the customer forall losses.<![ IGNORE [Liability is limited to $50,000.]]>

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

42

Of the marked section keywords, the most important for understanding the TEI DTD areINCLUDEandIGNORE; these can be used to include and exclude portions of a document— or a DTD — selectively, so as to adjust it to relevant circumstances (e.g. to allow a userto select portions of the DTD relevant to the document in question).

The literal keywordsINCLUDEandIGNORE, however, are not much use in adjusting aDTD or a document to a user’s requirements, however. (To change the text above to includethe excluded sentence, for example, a user would have to edit the text manually and changeIGNOREto INCLUDE. It might be thought just as easy to add and delete the sentencemanually.) But the keywords need not be given as literal values; they can be represented bya parameter entity reference. In a document with many sentences which should be includedonly in Maryland, for example, each such sentence can be included in a marked sectionwhose keyword is represented by a reference to a parameter entity namedMaryland . Theearlier example would then be:

In such cases, the bank will reimburse the customer forall losses.<![ %Maryland; [Liability is limited to $50,000.]]>

When the entityMaryland is defined asIGNORE, the marked sections so marked willall be excluded. If the definition is changed to the following, the marked sections will beincluded in the document:

<!ENTITY % Maryland ’INCLUDE’>

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

43

When parameter entities are used in this way to control marked sections in a DTD, theexternal DTD file normally contains a default declaration. If the user wishes to override thedefault (as by including the Maryland sections), adding an appropriate declaration to theDTD subset suffices to override the default[Note 12].

The examples of parameter entity declarations at the end of the preceding section cannow be better understood. The declarations

<!ENTITY % TEI.prose ’INCLUDE’><!ENTITY % TEI.extensions.dtd SYSTEM ’mystuff.dtd’>

have the effect ofincluding in the DTD all the sections marked as relevant to prose,since in the external DTD files such sections are all included in marked sections con-trolled by the parameter entityTEI.prose . They also override the default declarationof TEI.extensions.dtd (which declares this entity as an empty string), so as to in-clude the filemystuff.dtd in the DTD.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

44

10 PUTTING I T ALL TOGETHER

An SGML conformant document has a number of parts, not all of which have been dis-cussed in this chapter, and many of which the user of these Guidelines may safely ignore.For completeness, the following summary of how the parts are inter-related may howeverbe found useful.

An SGML document consists of an SGMLprologand adocument instance. The prologcontains anSGML declaration(described below) and adocument type definition, whichcontains element and entity declarations such as those described above. Different softwaresystems may provide different ways of associating the document instance with the prolog;in some cases, for example, the prolog may be ‘hard-wired’ into the software used, so thatit is completely invisible to the user.

10.1 The SGML DeclarationThe SGML declaration specifies basic facts about the dialect of SGML being used suchas the character set, the codes used for SGML delimiters, the length of identifiers, etc. Itscontent for TEI-conformant document types is discussed further in chapter 28 and chapter39. Normally the SGML declaration will be held in the form of compiled tables by theSGML processor and will thus be invisible to the user.

10.2 The DTDThe document type definition specifies the document type definition against which the doc-ument instance is to be validated. Like the SGML declaration it may be held in the formof compiled tables within the SGML processor, or associated with it in some way which isinvisible to the user, or requires only that the name of the document type be specified beforethe document is validated.

At its simplest the document type definition consists simply of a base document typedefinition (possibly also one or more concurrent document type definitions) which is pre-fixed to the document instance. For example:

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

45

<!DOCTYPE my.dtd [<!-- all declarations for MY.DTD go here -->...

]><my.dtd>

This is an instance of a MY.DTD type document</my.dtd>

More usually, the document type definition will be held in a separate file and invokedby reference, as follows:

<!DOCTYPE tei.2 system "tei2.dtd" []><tei.2>

This is an instance of an unmodified TEI type document</tei.2>

Here, the text of the TEI.2 document type definition is not given explicitly, but theSGML processor is told that it may be read from the file with the system identifier given inquotation marks. The square brackets may still be supplied, as in this example, even thoughthey enclose nothing.

The part enclosed by square brackets is known as thedocument type declaration subsetor ‘DTD subset’. Its purpose is to specify any modification to be made to the DTD beinginvoked, thus:

<!DOCTYPE tei.2 SYSTEM "tei2.dtd" [

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

46

<!ENTITY tla "Three Letter Acronym"><!ELEMENT my.tag - - (#PCDATA)><!-- any other special-purpose declarations or

re-definitions go in here -->]><tei.2>

This is an instance of a modified TEI.2 type document,which may contain <my.tag>my special tags</my.tag> andreferences to my usual entities such as &tla;.

</tei.2>

In this case, the document type definition in force includes first the contents of theDTD subset, and then the contents of the file specified after the keywordSYSTEM. Theorder is important, because in SGML only the first declaration of an entity counts. In theabove example, therefore, the declaration of the entitytla in the DTD subset would takeprecedence over any declaration of the same entity in the filetei2.dtd . It is perfectlylegal SGML for entities to be declared twice; this is the usual method for allowing usermodification of SGML DTDs. (Elements, by contrast, may not be declared more than once;if a declaration for<my.tag> were contained in filetei.dtd , the SGML parser wouldsignal an error.) Combining and extending the TEI document type definitions is discussedfurther in chapter chapter 3 : Structure of the TEI Document Type Definition.

10.3 The Document Instance

The document instance is the content of the document itself. It contains only text, markupand general entity references, and thus may not contain any new declarations. A convenientway of building up large documents in a modular fashion might be to use the DTD subsetto declare entities for the individual pieces or modules, thus:

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

47

<!DOCTYPE tei.2 [<!ENTITY chap1 system "chap1.txt"><!ENTITY chap2 system "chap2.txt"><!ENTITY chap3 "-- not yet written --">

]><tei.2><teiHeader> ... </teiHeader><text>

<front> ... </front><body>

&chap1;&chap2;&chap3;...

</body></text></tei.2>

In this example, the DTD contained in filetei2.dtd has been extended by entitydeclarations for each chapter of the work. The first two are system entities referring to thefile in which the text of particular chapters is to be found; the third a dummy, indicating thatthe text does not yet exist (alternatively, an entity with a null value could be used). In thedocument instance, the entity references &chap1; etc. will be resolved by the parser to givethe required contents. The chapter files themselves will not, of course, contain any element,attribute list, or entity declarations—just tagged text.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

48

11 USING SGML

A variety of software is available to assist in the tasks of creating, validating and processingSGML documents. Only a few basic types can be described here. At the heart of most suchsoftware is an SGMLparser: that is, a piece of software which can take a document typedefinition and generate from it a software system capable of validating any document in-voking that DTD. Output from a parser, at its simplest, is just “yes” (the document instanceis valid) or “no” (it is not). Most parsers will however also produce a new version of thedocument instance incanonical form(typically with all end-tags supplied and entity refer-ences resolved) or formatted according to user specifications. This form can then be usedby other pieces of software (loosely or tightly coupled with the parser) to provide additionalfunctions, such as structured editing, formatting and database management.

A structured editoris a kind of intelligent word-processor. It can use information ex-tracted from a processed DTD to prompt the user with information about which elementsare required at different points in a document as the document is being created. It can alsogreatly simplify the task of preparing a document, for example by inserting tags automati-cally.

A formatteroperates on a tagged document instance to produce a printed form of it.Many typographic distinctions, such as the use of particular typefaces or sizes, are inti-mately related to structural distinctions, and formatters can thus usefully take advantage ofdescriptive markup. It is also possible to define the tagging structure expected by a format-ting program in SGML terms, as a concurrent document structure.

Text-oriented database management systems typically use inverted file indexes to pointinto documents, or subdivisions of them. A search can be made for an occurrence of someword or word pattern within a document or within a subdivision of one. Meaningful sub-divisions of input documents will of course be closely related to the subdivisions specifiedusing descriptive markup. It is thus simple for textual database systems to take advantageof SGML-tagged documents. Much research work is also currently going into ways of

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

49

extending the capabilities of existing (non-text) database systems to take advantage of thestructuring information made explicit by SGML markup.

Hypertextsystems improve on other methods of handling text by supporting associativelinks within and across documents. Again, the basic building block needed for such systemsis also a basic building block of SGML markup: the ability to identify and to link togetherindividual document elements comes free as a part of the SGML way of doing things. Bytagging links explicitly, rather than using proprietary software, developers of hypertextscan be sure that the resources they create will continue to be useful. To load an SGMLdocument into a hypertext system requires only a processor which can correctly interpretSGML tags such as those discussed in chapter 14: Linking, Segmentation, and Alignment.

NOTES

[Note 1] International Organization for Standardization, ISO 8879:Informationprocessing—Text and office systems—Standard Generalized Markup Language(SGML)([Geneva]: ISO, 1986)..

[Note 2] Work is currently going on in the standards community to create (using SGMLsyntax) a definition of a standard document style semantics and specification languageor DSSSL.

[Note 3] The actual characters used for the delimiting characters (the angle brackets, excla-mation mark and solidus) may be redefined, but it is conventional to use the charactersused in this description.

[Note 4] The example is taken from William Blake’sSongs of innocence and experience(1794). The markup is designed for illustrative purposes and is not TEI-conformant.

[Note 5] Note that this simple example has not addressed the problem of marking elementssuch as sentences explicitly; the implications of this are discussed below inSection 6.

I N D I A N

TEXusersgroup

A Gentle Introduction to SGML

C. M. Sperberg-McQueen and LauBurnard

1. Introduction

2. What’s special about SGML

3. Textual structure

4. SGML structures

5. The DTD

6. More on element declarations

7. Attributes

8. SGML entities

9. Marked sections

10. Putting it all together

11. Using SGML

12. Notes

J I

J I

50

[Note 6] Like the delimiters, these are assigned formal names by the standard and may beredefined with an appropriate SGML declaration.

[Note 7] What are here called group connectors are referred to by the SGML standardsimply as connectors; the longer term is preferred here to stress the fact that theseconnectors are used only in SGML model groups and name groups. Like the delimitersand the occurrence indicators, group connectors are assigned formal names by thestandard and may be redefined with an appropriate SGML declaration.

[Note 8] It will not have escaped the astute reader that the fact that verse paragraphs neednot start on a line boundary seriously complicates the issue; see furtherSection 6.

[Note 9] By convention case is significant in entity names, unlike element names.

[Note 10] Strictly speaking, SGML does not require system entities to be files; they can inprinciple be any data source available to the SGML processor: files, results of databasequeries, results of calls to system functions — anything at all. It is simpler, however,when first learning SGML, to think of system entities as referring to files, and thisdiscussion therefore ignores the other possibilities. All existing SGML processors dosupport the use of system entities to refer to files; fewer support the other possible usesof system entities.

[Note 11] Such entity declarations might be used in extending the TEI base tag set for proseusing the declarations found in mystuff.dtd.

[Note 12] This is so because the declarations in the DTD subset are read before those in theexternal DTD file, and the first declaration of a given entity is the one which counts.