Mdst3703 2013-09-17-text-models
-
Upload
rafael-alvarado -
Category
Technology
-
view
578 -
download
0
description
Transcript of Mdst3703 2013-09-17-text-models
![Page 1: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/1.jpg)
Text Models and Markup
Prof. AlvaradoMDST 3703
17 September 2013
![Page 2: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/2.jpg)
Business
• Plan B: If Home Directory is not working for you, please use the Hive – Go to http://its.virginia.edu/hive/connected.html – Install VMWare Client– Use Notepad++ – Home Directory link your Desktop (also as J drive)
• Tutorials– If you feel lost about HTML let me know
![Page 3: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/3.jpg)
Review 1: Textual Signals
• Each of the authors last week viewed the text as a kind of signal
• A signal is a pattern that contains messages• Messages can be grasped through parsing the
signal• What were the messages? How were they
parsed?
![Page 4: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/4.jpg)
A text can be viewed as a long signal consisting of characters selected from a common set of characters
![Page 5: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/5.jpg)
A model of communication.Messages get converted into signals and back into messages
by means of a shared code.
ENCODING DECODING
SHARED CODE
Person 1 Person 2
![Page 6: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/6.jpg)
Author Parsed elements Decoded message
Levi-Strauss Relations and bundles
Structural oppositions
Colby Thesaurus words Thematic patterns
Ramsay Scenes Genres
![Page 7: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/7.jpg)
Text is like this. This is a map of DC generated by thousands of individual Flickr and Twitter events.
The picture is a kind of signal—collective and unconscious, yet meaningful.
The patterns discerned from the signals are not intentional, but they are the products of intentional activity.
http://anthonyflo.tumblr.com/post/7590868323/photographer-and-self-described-geek-of-maps
[Text is like this]
![Page 8: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/8.jpg)
Review 2: Semantic HTML
• Also called POSH—”Plain Old Semantic HTML”• The use of HTML to describe a text, not to
format it (CSS is used to format)• DIV, SPAN, CLASS, and ID are general purpose
tools to provide more flexible markup• What kinds of things can POSH be used to
describe?
![Page 9: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/9.jpg)
SegueSemantic markup may be used to support the analysis of each of our authors—including
AristotleAristotle: Elements of drama, Elements of plot
<div class=“plot-element” id=“reversal-of-fortune”> ... </div>
Levi-Strauss: Relations and Bundles in myths<span class=“relation”> ... </span>
Colby: Theme words in folktales<span class=“antagonism”>fight</span>
Ramsay: Scenes in plays<div class=“scene”> ... </div>
![Page 10: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/10.jpg)
Let’s step back and look more closely at “text”
Let’s look at some material examples
![Page 11: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/11.jpg)
page o’ text
Real world text comes packaged in documents
![Page 12: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/12.jpg)
How is text conveyed in a document?
A document is a material artifact—a medium with which to convey a signal
![Page 13: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/13.jpg)
![Page 14: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/14.jpg)
What is text?
![Page 15: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/15.jpg)
Visual Signifiers
• Small caps• Indentation• Alignment• Italics• Space
All used to signify elements of text
![Page 16: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/16.jpg)
Other examples
![Page 17: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/17.jpg)
[Charrette]
![Page 18: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/18.jpg)
[The Wasteland]
![Page 19: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/19.jpg)
[Critical Edition]
![Page 20: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/20.jpg)
[OED]
![Page 21: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/21.jpg)
Documents have thee Levels: Structure, Content, Style
StructureThe organization of content into units (elements) and logical relationships (e.g. reading order)
ContentTEXT, images, video clips, etc.
StyleScreen and print layoutFonts, colors, etc.
![Page 22: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/22.jpg)
Descriptive markup languages allow us to define structure of documents for
computational purposes
Theoretically, they do not specify layout or content
![Page 23: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/23.jpg)
[PDF, Procedural Markup]
In contrast to procedural markup like PDF
![Page 24: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/24.jpg)
So, how are documents structured?
![Page 25: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/25.jpg)
Hierarchically …
(theoretically)
![Page 26: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/26.jpg)
Document Elements and StructuresPlay– Act +
• Scene +– Line +
Book– Chapter +
• Verse +
Letter
– Heading• Return Address• Date• Recipient Info
– Name– Title– Address
– Content• Salutation• Paragraph +• Closing
![Page 27: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/27.jpg)
These are all “trees”
![Page 28: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/28.jpg)
XML is a markup language
It is a more powerful system for semantic markup than POSH
![Page 29: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/29.jpg)
What is XML?
• Stands for eXtensible Markup Language– Actually invented after the web– A simplification of SGML, the language used to create HTML– It specifies a set of rules for creating specialized markup
languages such as HTML and TEI• It is simplified version of the SGML
– Standard Generalized Markup Language• SGML was invented in the early 1970s to wrest the
control of documents from computer people who were taking over industries like law and accounting
![Page 30: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/30.jpg)
![Page 31: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/31.jpg)
XML looks like this
Notice how the element names reference units, not layout or style
![Page 32: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/32.jpg)
Also markup for “in-line” elements
![Page 33: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/33.jpg)
XML Premises
1. All documents are comprised of elements.2. Elements contain content.3. Elements have no layout.4. Elements are hierarchically ordered.5. Elements are to be indicated by “markup” –
tags that define the beginning and end of an element
![Page 34: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/34.jpg)
XML Markup Rules
• Tags signify structural elements• Three kinds of tag– Start and End, e.g <p> and </p>– Singleton, e.g <br />
• Start and singleton tags can have attributes– Simple key/value pairs– <div class="stanza" style="color:red;">
• Basic rules– All attributes must be quoted– All tags must nest (no overlaps!)
![Page 35: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/35.jpg)
Documents in XML that meet these rules are “well formed”
![Page 36: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/36.jpg)
XML also provides Document Types• A Document Type Definition (DTD) defines a set of tags
and rules for using them– Specifies elements, attributes, and possible combinations– E.g. in HTML, the ol and ul elements must contain li
elements• A DTD is just one kind of schema system used by XML • Schema express data models of/for texts– TEI is a powerful way of describing primary source
materials for scholars• Documents that use a schema properly are called
“valid”
![Page 37: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/37.jpg)
Originally, DTDs defined “genres” like business letter or mortgage form
They were later used to define more abstract models of textual content
![Page 38: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/38.jpg)
XML is used everywhere
• HTML– E.g. Embed codes
• TEI (Text Encoding Initiative)• RSS• Civilization IV• Playlists (e.g. XSPF or “spiff”)• Google Maps (KML)
![Page 39: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/39.jpg)
The Text Encoding Initiative created TEI to mark up scholarly documents
Mainly primary sources such as books and manuscripts
![Page 40: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/40.jpg)
TEI
• Written in XML (was SGML)• The dominant language used to encode
scholarly text• Scholars can select from a large set of
elements or their own elements to match what they are interested in
![Page 41: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/41.jpg)
Examples
• The TEI Header– http://tbe.kantl.be/TBE/examples/TBED02v00.
htm• TEI Prose– http://tbe.kantl.be/TBE/examples/TBED03v00.
htm • Find others at the TEI By Example Project– http://tbe.kantl.be/TBE/
![Page 42: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/42.jpg)
XML and TEI both contain an implicit theory of text
What is it?
![Page 43: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/43.jpg)
OCHO
• XML (and therefore HTML and TEI) imply a certain theory of text– A text is an OHCO
• OHCO– Ordered Hierarchy of Content Objects
• An OHCO is a kind of tree– Elements follow each other in sequences– Elements can contain other elements
![Page 44: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/44.jpg)
What are the advantages of this view?
![Page 45: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/45.jpg)
OHCO allows for easy processing
• Every element has a precise address in the text– E.g. HTML/body/p[1]
• Texts can be described in the language of kinship– Ancestors, parents, siblings, children, etc.
• Texts can be restructured and manipulated by known patterns and algorithms– Traversing– Pruning– Cross-referencing
![Page 46: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/46.jpg)
What are the disadvantages of OCHO?
![Page 47: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/47.jpg)
Logical vs. Physical Structure
THIS IS WHAT WE ENCOUNTERED AT THE END OF LAST WEEK’S STUDIO
![Page 48: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/48.jpg)
Two common structures that overlap
Pages and Paragraphs
![Page 49: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/49.jpg)
<page n=“2”>. . .<p id=“foo”>His good looks and his rank had one fair claim on his attachment, since to them he must have owed a wife</p> </page><page n=“3”><p id=“bar” prev_id=“foo”> a very superior character to anything deserved by his own.</p>. . .</page>
Solution 1: Split Elements
![Page 50: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/50.jpg)
<p>His good looks and his rank had one fair claim on his attachment, since to them he must have owed a wife <pb n=“3” /> a very superior character to anything deserved by his own.</p>
Solution 2: Use “Milestones”
One structure gets backgrounded
![Page 51: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/51.jpg)
Wittgenstein’s Manuscripts
What about this?
![Page 52: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/52.jpg)
The problem of overlap suggests that OHCO is not a simple as it looks
How does Renear “solve” the problem?
![Page 53: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/53.jpg)
Each OHCO markup schema represents an analytical perspective,
an interpretive model
![Page 54: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/54.jpg)
[Charrette]
![Page 55: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/55.jpg)
So, XML, TEI, POSH – these allow us to impose a model on a text
How does Unsworth characterize these models?
![Page 56: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/56.jpg)
A markup schema is a “knowledge represention”
![Page 57: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/57.jpg)
A KR is a model that comprises
1. A set of categories (aka Ontology)Names and relationships between names
2. A set of inference rules (aka Logic)A method of traversing names and relations
3. A medium for computationA medium for mechanically producing inferences
4. A language for expressing these thingsSuch as a programming or markup language
![Page 58: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/58.jpg)
What tools beside XML does Unsworth reference as useful for KR?
![Page 59: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/59.jpg)
Tables
![Page 60: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/60.jpg)
What are some differences between trees and tables?
![Page 61: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/61.jpg)
Tables are more rigidTrees allow for indefinite depth
But tables are easier to manipulate
In any case, tables and trees are two major kinds of data structure that you will encounter …
![Page 62: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/62.jpg)
How to reconcile these tools?
![Page 63: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/63.jpg)
A Proposed Model
• Texts are not documents– Documents are media, Texts are messages
• Texts and documents are part of a system comprised of “levels”– They are effectively archaeology sites with stratigraphic
layers– Erasures are like cities building on top of each other
• Each level of the system is described by an appropriate set of tools– Document structures XML– Textual structures, embedded ontologies Tables
![Page 64: Mdst3703 2013-09-17-text-models](https://reader033.fdocuments.in/reader033/viewer/2022061118/5469d15daf7959cb768b603a/html5/thumbnails/64.jpg)
Basic Levels
• Document– Physical objects (paper)– Logical objects (defined by space, style, punctuation, etc.)– Style and layout (also defined by space, color, etc.)– Can have superimposed versions
• Text– Sequences of characters– Grammatical features– Figures and poetic features– Etc.