Encoding Information for Interchange: St Malo, 1998 1 Encoding Information for Interchange An...
-
Upload
meryl-robbins -
Category
Documents
-
view
240 -
download
1
Transcript of Encoding Information for Interchange: St Malo, 1998 1 Encoding Information for Interchange An...
Encoding Information for Interchange: St Malo, 19981
Encoding Information for Interchange
An introduction to the TEI
Lou BurnardHumanities Computing Unit
Oxford University
Encoding Information for Interchange: St Malo, 19982
The problem
• SGML/XML markup is powerful, flexible, and can be customised to meet most (all?) needs
• But to use it, you need a formal specification (aka document type definition orDTD)
• Where do you get one from?• How do you choose?
Encoding Information for Interchange: St Malo, 19983
Some answers
• Roll your own– from scratch– within an existing framework
• Take what’s on offer• Use the TEI architecture
Encoding Information for Interchange: St Malo, 19985
Where did the TEI come from? • From the humanities research community
• librarians and cybernauts• linguists, historians, lexicographers...
• Sponsors• ACH Association for Computers and the Humanities• ACL Association for Computational Linguistics• ALLC Association for Literary and Linguistic
Computing• Funders
• U.S. National Endowment for the Humanities• Mellon Foundation• Commission of European Communities DG XIII• Social Science and Humanities Research Council of
Canada
Encoding Information for Interchange: St Malo, 19986
… and where is it going?
• Continued work in new application areas– manuscript description– physical description– non-SGML data– XML conformance
• Continued take-up• Need for new infrastructure• Corrected reprint of P3 due summer
1998
Encoding Information for Interchange: St Malo, 19987
a user-driven codification of existing best practice
a user-driven codification of existing best practice
Goals of the TEI
• better interchange and integration of data
• support for all texts, in all languages, from all periods
• guidance for the perplexed: what to encode
• assistance for the specialist: how to encode any information of interest
Encoding Information for Interchange: St Malo, 19988
... but no software... but no software
TEI Deliverables
• coherent set of recommendations for text encoding
• comprising several distinct SGML tagsets
• based on existing practice• documented in a reference manual• tutorials for general and
specialised audiences
Encoding Information for Interchange: St Malo, 19989
The TEI modus operandi...
• identify significant particularities independent of notation or realisation
• avoid controversy, over-delicacy, inadequacy
• seek generalizable solutions, acceptable to a consensus
Encoding Information for Interchange: St Malo, 199810
... and some consequences
• focus on content, not presentation• descriptive, not prescriptive• Occam's razor• modular, extensible dtd• highly general in application,
needs customization for particular areas
Encoding Information for Interchange: St Malo, 199811
Who uses TEI?
• see http://www-tei.uic/orgs/tei/app/• digital librarians and archivists
•LC, HTI, UVA, CETH, OTA...
• Language Engineering projects•EAGLES, BNC, MULTEX, Parole, Silfide
• academic researchers•Women Writers Project, Project Orlando,
Model Editions Partnership, Canterbury Tales Project, Bodleian Library, and many more...
Encoding Information for Interchange: St Malo, 199812
Designing your DTD
• How can a single mark-up scheme handle a large variety of requirements ?– all texts are alike– every text is different
• Learn from the database designers– one construct, many views– each view a selection from the whole
Encoding Information for Interchange: St Malo, 199813
or is there a better way?or is there a better way?
How many dtds might you need?
• one (the Corporate or WKWBFY approach)
• none (the Anarchic or NWEUMP approach)
• as many as it takes (the Mixed Economy or WNSA approach)
Encoding Information for Interchange: St Malo, 199814
a single main DTD with many faces (a British DTD)
a single main DTD with many faces (a British DTD)
The TEI solution: modularization
• a (very) large number of element and attribute definitions
• organised as tagsets (core, base, additional, or auxiliary)
• grouped into classes
Encoding Information for Interchange: St Malo, 199815
Combining Tag Sets
• And how does one combine tagsets? The how-many-dtds problem is back.– all tag sets, all the time (the table d'hôte
model)– a few pre-selected combinations (the
combination plate model)– in completely unconstrained abandon
(the smorgasbord model)– one from column A, two from column B
(the Chinese menu model)
Encoding Information for Interchange: St Malo, 199816
The Chicago Pizza Model
<!ENTITY % base “(deepDish|thinCrust|stuffed)” ><!ENTITY % topping “(pepperoni|mushrooms|sausage| pepper | anchovies | ...)” ><!ELEMENT pizza - -
(%base;, tomatoSauce & cheese, %(topping)*) >
<!ENTITY % base “(deepDish|thinCrust|stuffed)” ><!ENTITY % topping “(pepperoni|mushrooms|sausage| pepper | anchovies | ...)” ><!ELEMENT pizza - -
(%base;, tomatoSauce & cheese, %(topping)*) >
Encoding Information for Interchange: St Malo, 199817
<!DOCTYPE TEI.2 system 'tei2.dtd' [<!ENTITY % tei.prose 'INCLUDE' ><!ENTITY % tei.analysis 'INCLUDE' >]><tei.2>.....</tei.2>
<!DOCTYPE TEI.2 system 'tei2.dtd' [<!ENTITY % tei.prose 'INCLUDE' ><!ENTITY % tei.analysis 'INCLUDE' >]><tei.2>.....</tei.2>
To build a view of the TEI dtd, take...
• the core tagsets• the base of your choice• the toppings of your choice
Encoding Information for Interchange: St Malo, 199818
… trim to fit ...
• user extension files• rename elements• undefine elements to be redefined* or
removed
<!ENTITY % tei.extensions.ent SYSTEM ‘myMods.ent’ >
<!ENTITY % tei.extensions.ent SYSTEM ‘myMods.ent’ >
<!ENTITY % n.p ‘para’ ><!ENTITY % seg ‘IGNORE’>
<!ENTITY % n.p ‘para’ ><!ENTITY % seg ‘IGNORE’>
* see later
Encoding Information for Interchange: St Malo, 199819
… and cook thoroughly
• ‘compile’ the dtd to remove all parameterization
• easier to use for some software• better project management• see
http://firth.natcorp.ox.ac.uk/~tei/pizza.html
•don’t forget the documentation!
Encoding Information for Interchange: St Malo, 199820
TEI base tagsets
• one only must be selected• defines basic structural components• currently defined:
– prose, verse, drama– transcribed speech– dictionaries– terminological databases
• mixtures of bases require special treatment
Encoding Information for Interchange: St Malo, 199821
TEI additional tagsets
• sets of elements for specialised application areas
• can be mixed and matched ad lib• currently provided:
– linking and alignment; analysis; feature structures; certainty; physical transcription; textual criticism, names and dates; graphs and trees; figures and tables; language corpora....
Encoding Information for Interchange: St Malo, 199822
How does this work ?
• Main dtd consists of marked sections, each (potentially) containing one tagset
• By default, all tagsets are IGNOREd<![ %TEI.tagset [<!-- declarations for tagset here -->]]>
<![ %TEI.tagset [<!-- declarations for tagset here -->]]>
<!ENTITY % TEI.tagset “INCLUDE”><!ENTITY % TEI.tagset “INCLUDE”>
Encoding Information for Interchange: St Malo, 199823
How does this work? (contd)
• Tagsets contain element and attlist declarations, each also enclosed by a marked section
• By default all elements are INCLUDEd<![ %element [<!ELEMENT %n.element - - (#PCDATA)><!ATTLIST %n.element %a.global >]]>
<![ %element [<!ELEMENT %n.element - - (#PCDATA)><!ATTLIST %n.element %a.global >]]>
<!ENTITY % element “IGNORE”><!ENTITY % element “IGNORE”>
Encoding Information for Interchange: St Malo, 199824
How does this work? (contd)
• Element names (GIs) are always referred to indirectly, so that they may be renamed
<!ELEMENT %n.elem1 - (%n.elem2;+)><!ELEMENT %n.elem1 - (%n.elem2;+)>
<!ENTITY % n.elem1 “elem1”><!ENTITY % n.elem2 “foo”>
<!ENTITY % n.elem1 “elem1”><!ENTITY % n.elem2 “foo”>
Encoding Information for Interchange: St Malo, 199825
Element Classes
• Model classes– elements which share syntactic
properties (i.e. occur in same position)
• Attribute classes– elements which share attributes
• Class membership can be inherited• Another way of doing architectural
forms
Encoding Information for Interchange: St Malo, 199826
Some TEI model classes
• divn: structural elements like divisions<div>, <div1>, <div2>, <lg>, <lg1>...
• divtop: elements which can appear at the start of a divn element<head>, <epigraph>, <byLine>...
• chunk: paragraph-like elements<sp>, <p>, <lg>, <l>…
• phrase: elements which appear within chunks<hi>, <foreign>, <date>, <q> ...
Encoding Information for Interchange: St Malo, 199827
Some TEI semantic classes
• data: phrases likely to be normalised or processed non textually<date>, <time>, <name>...
• biblpart: specialised components of bibliographic descriptions<author>, <title>, <editor>...
• demographic: descriptive features of participants in a language interaction<birth>, <socEcstat>, <occupation>...
Encoding Information for Interchange: St Malo, 199828
Some TEI attribute classes
•global: attributes which are available to every elementn, lang, id, TEIform
•linking: attributes for elements which have linking semanticstargType, targOrder, evaluate
Encoding Information for Interchange: St Malo, 199829
The class system in action
• Simplifying documentation and understanding of the DTD
• Parameterizing content models– different for different bases
• Simplifies customization– class membership is unaffected– adding new elements to an existing
class
Encoding Information for Interchange: St Malo, 199830
Parameterized content models
• “Components”, for example:– a dictionary is composed of entries– a play is composed of speeches– a novel is composed of paragraphs
• in each case, the basic “text soup” (and the structural divisions) remain the same, but they are organized differently
Encoding Information for Interchange: St Malo, 199831
How does this work? (contd)
• the component class has different members in different bases
<![ %TEI.prose [<!ENTITY % m.component “p|list|note”>]]><![ %TEI.dictionaries [<!ENTITY % m.component “entry”>]]><!ENTITY %component.seq “(%m.component)+”><!ELEMENT div -- (head?, (%component.seq), div*) >
<![ %TEI.prose [<!ENTITY % m.component “p|list|note”>]]><![ %TEI.dictionaries [<!ENTITY % m.component “entry”>]]><!ENTITY %component.seq “(%m.component)+”><!ELEMENT div -- (head?, (%component.seq), div*) >
Encoding Information for Interchange: St Malo, 199832
Customization...
• Removing an element involves– undeclaring it– (NB: ISO 8879 permits references to
undefined elements -- though not all vendors know this)
• Adding a new element involves– determining its class– defining it– adding it to that class
Encoding Information for Interchange: St Malo, 199833
Customization (contd)
• Modification of an element implies removal followed by addition
• Class membership should be unaffected<!-- in TEI.extensions.ent --><!ENTITY % p “IGNORE”>
<!-- in TEI.extensions.ent --><!ENTITY % p “IGNORE”>
<!-- in TEI.extensions.dtd --><!ELEMENT %n.p - - (#PCDATA)>
<!-- in TEI.extensions.dtd --><!ELEMENT %n.p - - (#PCDATA)>
Encoding Information for Interchange: St Malo, 199834
<!ENTITY % x.class ““><!ENTITY % m.class “%x.class name1 | name2 | name3 ...” >
<!ENTITY % x.class ““><!ENTITY % m.class “%x.class name1 | name2 | name3 ...” >
<!ELEMENT % n.element - - (%m.class;+)><!ELEMENT % n.element - - (%m.class;+)>
How does this work? (contd)
• Each model class is defined as a parameter entity
• Reference to class members is always indirect
• Membership extensible (by a kludge)
Encoding Information for Interchange: St Malo, 199835
An example: the Lampeter corpus
• Requirements– light presentational tagging– structural markup for access– demographic information about text
production– small number of tags to ease data capture and
validation
• Implementation– tagsets: prose base, and tags from four
additional sets– some extensions, many exclusions
Encoding Information for Interchange: St Malo, 199836
The Lampeter corpus DTD subset
<!DOCTYPE TEICORPUS.2 SYSTEM "tei2.dtd" [<!ENTITY % TEI.prose "INCLUDE"><!ENTITY % TEI.corpus "INCLUDE"><!ENTITY % TEI.figures "INCLUDE"><!ENTITY % TEI.transcr "INCLUDE"><!ENTITY % TEI.extensions.ent SYSTEM "lampext.ent"><!ENTITY % TEI.extensions.dtd SYSTEM "lampext.dtd"><!-- more declarations here -->]>
<!DOCTYPE TEICORPUS.2 SYSTEM "tei2.dtd" [<!ENTITY % TEI.prose "INCLUDE"><!ENTITY % TEI.corpus "INCLUDE"><!ENTITY % TEI.figures "INCLUDE"><!ENTITY % TEI.transcr "INCLUDE"><!ENTITY % TEI.extensions.ent SYSTEM "lampext.ent"><!ENTITY % TEI.extensions.dtd SYSTEM "lampext.dtd"><!-- more declarations here -->]>
Encoding Information for Interchange: St Malo, 199837
The Lampeter corpus extensions.ent
<!ENTITY % analytic 'IGNORE' ><!ENTITY % biblStruct 'IGNORE' ><!-- hic desunt multa --><!ENTITY % supplied 'IGNORE' >
<!ENTITY % x.phrase "it|ro|sc|su|bo|go|"><!ENTITY % x.biblPart "printer|pubFormat|bookSeller|"><!ENTITY % x.demographic "socecstatusPat|biogNote|"><!ENTITY % x.globincl "gap|">
<!ENTITY % analytic 'IGNORE' ><!ENTITY % biblStruct 'IGNORE' ><!-- hic desunt multa --><!ENTITY % supplied 'IGNORE' >
<!ENTITY % x.phrase "it|ro|sc|su|bo|go|"><!ENTITY % x.biblPart "printer|pubFormat|bookSeller|"><!ENTITY % x.demographic "socecstatusPat|biogNote|"><!ENTITY % x.globincl "gap|">
Encoding Information for Interchange: St Malo, 199838
The Lampeter corpus extensions.dtd
<!ELEMENT (it|ro|sc|su|bo|go) - - (%phrase.seq)><!ELEMENT (persName|printer|pubFormat |bookSeller|biogNote|socecstatusPat) - - (%phrase.seq) >
<!ELEMENT (it|ro|sc|su|bo|go) - - (%phrase.seq)><!ELEMENT (persName|printer|pubFormat |bookSeller|biogNote|socecstatusPat) - - (%phrase.seq) >
NB: This is a provisional version only! (no attlists, no
documentation…)
Encoding Information for Interchange: St Malo, 199839
Summary
• Designing a successful DTD involves careful, conscious, controlled , theft
• Modularize the task• A class system helps identify
– what is true of all documents– what is true of some documents
• Modifiability can be compatible with standardization