Documents: form vs. content ?
description
Transcript of Documents: form vs. content ?
![Page 1: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/1.jpg)
1
Digital preservation.Principles and potential role of XML
Giovanni Michetti
Urbino, 9th october 2002
![Page 2: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/2.jpg)
2
Documents:form vs. content ?
Traditional environment:
Form
Content
![Page 3: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/3.jpg)
3
Documents:form vs. content ?
Digital environment:
Form
Content
![Page 4: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/4.jpg)
4
Documents:structure
Structure is unavoidably inside documents
Complexity grows structure grows Structure is (part of the) message
We deal with structure not in digital environment only
![Page 5: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/5.jpg)
5
Documents:structure and digital environment
Moving information onto new media
Need of functionalities to manage the explosive growth of information
Need to make structure explicit
![Page 6: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/6.jpg)
6
Markup
The proper description of an information resource requires: identifying its logical components making its structure explicit
Markup
![Page 7: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/7.jpg)
7
Markup
Markup:every means of making interpretation of a document explicit
![Page 8: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/8.jpg)
8
From a record ...University of Urbino
Faculty of Arts
Rome, 1st August 2002Dr. Giovanni Michetti
Protocol n. 1234/ABSubject: Teaching appointment
We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree n. 80/1998, the authorization by the administration you belong to.
The DeanProf. Giorgio Cerboni Baiardi
Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino
Tel: 0722.320125 Fax: 0722.322553 Email: [email protected]
![Page 9: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/9.jpg)
9
… to a marked record ...<XML><letter><sender>University of Urbino
Faculty of Arts </sender>
<date>Rome, 1st August 2002</date><addressee>Dr. Giovanni Michetti</addressee>
<protocolnumb>Protocol n. 1234/AB</protocolnumb><subject>Subject: Teaching appointment</subject>
<text>We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree 80/88, the authorization by the administration you belong to</text>
<author>The DeanProf. Giorgio Cerboni Baiardi</author>
<heading>Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino
Tel: 0722.320125 Fax: 0722.322553 Email: [email protected]</heading></letter></XML>
![Page 10: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/10.jpg)
10
… to a DTD ...
<! ELEMENT letter (sender, date, addressee, protocolnumb, subject, text, author,
heading)><!ELEMENT sender (#PCDATA)><!ELEMENT date (#PCDATA)><!ELEMENT addressee (#PCDATA)><!ELEMENT protocolnumb (#PCDATA)><!ELEMENT subject (#PCDATA)><!ELEMENT text (#PCDATA)><!ELEMENT author (#PCDATA)><!ELEMENT heading (#PCDATA)>
![Page 11: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/11.jpg)
11
… to a more precise DTD
<! ELEMENT letter (sender, date, addressee, precedent?, protocolnumb, classif?, subject,
text, attachments?, author, heading)><!ELEMENT sender, date, addressee, protocolnumb, subject, text, author,
heading (#PCDATA)><!ELEMENT precedent (#PCDATA)><!ELEMENT classif (#PCDATA)><!ELEMENT attachments (#PCDATA)>
![Page 12: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/12.jpg)
12
Let’s refine the markup ...<XML><letter><sender><body>University of Urbino</body>
<bureau>Faculty of Arts</bureau></sender>
<date><place>Rome,</place><time>1st August 2002</time></date><addressee>Dr. Giovanni Michetti</addressee>
<protocolnumb>Protocol n. 1234/AB</protocolnumb><subject>Subject: Teaching appointment</subject>
<text>We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree 80/88, the authorization by the administration you belong to</text>
<author><role>The Dean</role><name>Prof. Giorgio Cerboni Baiardi</name></author>
<heading>Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino
Tel: 0722.320125 Fax: 0722.322553 Email: [email protected]</heading></letter></XML>
![Page 13: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/13.jpg)
13
... keeping on refining ...<XML><letter><sender><body>University of Urbino</body>
<bureau>Faculty of Arts</bureau></sender>
<date><place>Rome,</place><time>1st August 2002</time></date><addressee>Dr. Giovanni Michetti</addressee>
[Protocolnumb + Subject + Text]
<author><role>The Dean</role><name><title>Prof.</title><propername>Giorgio</propername><surname>Cerboni
Baiardi</surname></name></author>
<heading><bureau>Faculty of Arts</bureau><address>Piano S. Lucia 6 - 61029 Urbino</address>
<tel>Tel: 0722.320125</tel><fax>Fax: 0722.322553</fax><email>Email:
[email protected]</email></heading></letter></XML>
![Page 14: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/14.jpg)
14
… and let’s refine the DTD<! ELEMENT letter
(sender, date, addressee, precedent?, protocolnumb, classifi?, subject, text,
attachment?, author, heading)>
<!ELEMENT sender (body, bureau)>
<!ELEMENT body (#PCDATA)>
<!ELEMENT bureau (#PCDATA)>
<!ELEMENT date (place, time)>
<!ELEMENT place (#PCDATA)>
<!ELEMENT time (#PCDATA)>
<!ELEMENT addressee (#PCDATA)>
<!ELEMENT precedent, protocolnumb, classif, subject, text, attachment (#PCDATA)>
<!ELEMENT author (role, name)>
<!ELEMENT role (#PCDATA)>
<!ELEMENT name (title, propername, surname)>
<!ELEMENT title, propername, surname (#PCDATA)>
<!ELEMENT heading (bureau,address, tel, fax, email)>
<!ELEMENT address, tel, fax, email (#PCDATA)>
![Page 15: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/15.jpg)
15
The final DTD<! ELEMENT letter
(sender, date, addressee+, precedent?, protocolnumb, classifi?, subject, text,
attachment?, author, heading?)>
<!ELEMENT sender (body?, bureau)><!ELEMENT body (#PCDATA)><!ELEMENT bureau (#PCDATA)><!ELEMENT date (place, time)><!ELEMENT place (#PCDATA)><!ELEMENT time (#PCDATA)><!ELEMENT addressee (#PCDATA)><!ELEMENT precedent, protocolnumb, classif, subject, text, attachment (#PCDATA)>
<!ELEMENT author (role?, name)><!ELEMENT role (#PCDATA)>
<!ELEMENT name (title?, propername?, surname)><!ELEMENT title, propername, surname (#PCDATA)>
<!ELEMENT heading (bureau?, address?, tel?, fax?, email?)><!ELEMENT address, tel, fax, email (#PCDATA)>
![Page 16: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/16.jpg)
16
XML declaration Every XML document should start with
an XML declaration, like<?XML version="1.0">
Such declaration must be right at the start of the document: there should be nothing before it (comments, instructions, white spaces, ...)
![Page 17: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/17.jpg)
17
XML declaration
A parser uses the first 5 characters <?XML to understand which kind of character set the document uses
The version attribute must have value 1.0
![Page 18: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/18.jpg)
18
XML declaration
It is possible to specify the language encoding using the optional encoding attribute.
Example:
<?XML version="1.0" encoding="ISO-8859-1"?>
![Page 19: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/19.jpg)
19
Elements Elements are the most important
components of XML documents: they are the logical components through which you can identify the structure of documents. Example:
<author>Giovanni Michetti</author>delimiter
tag-namecontent
start-tagend-tag
element
![Page 20: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/20.jpg)
20
Elements
Each start-tag must have a corresponding end-tag (starting with a forward slash)
Empty elements (like <img>, <br>, <hr> in HTML) are represented by a tag starting with a delimiter and ending with a forward slash before the closing bracket. Example: <image/>
![Page 21: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/21.jpg)
21
Attributes Attributes are expressed as name-value
pairs associated with elements and appearing only in start-tags
Names are separated from related values by an equal sign (=). Values are wrapped in single or double quotes
Attributes must be associated to elements
No matter of the order of the attributes inside a start-tag
![Page 22: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/22.jpg)
22
XML tree
An XML document is a kind of a hierarchical tree. It starts from a root (root or document element) and it develops from it into child elements, that can be sibling
![Page 23: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/23.jpg)
23
XML tree
Each element has one and only one father (except from root)
Each element is completely wrapped inside another element
![Page 24: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/24.jpg)
24
Entities Example:
<author>Giovanni Michetti</author>
The string Giovanni Michetti (the element content) is also called character data. Character data can appear anywhere inside elements, or as values of attributes
![Page 25: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/25.jpg)
25
Entities There are special characters that are
not allowed in text blocks: what if we want to use the less than symbol < in a mathematical formula (a < b ) ?
Stratagem 1 Stratagem 2
![Page 26: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/26.jpg)
26
Entities
1. CDATA sections: They start with the CDATA start marker
<!CDATA[
and end with the CDATA end marker
]]>
![Page 27: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/27.jpg)
27
Entities
2. Entity references:Example:
< <
The parser recognizes the entity < and substitute it with the proper value <
![Page 28: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/28.jpg)
28
Entities
A parser is a piece of software able to read and interpret an XML document. A parser read the XML document as plain text
Some parsers (validating parsers) are able to check the conformance of an XML document with a DTD
![Page 29: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/29.jpg)
29
Entities Standard (i.e. predefined) entities:
< <> >& &' '" "
Any XML parser recognizes these entities and substitutes them with the proper values
![Page 30: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/30.jpg)
30
Well-formed documents Any XML document must be well
formed: it has to comply with some constraints, some of which are:
Each start-tag has a corresponding end-tag Elements can’t overlap There must be one and only one root
element Attribute values must be quoted An element can’t contain different attributes
with the same name
![Page 31: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/31.jpg)
31
Document Type Definition (DTD)
Once able to create a set of attributes and tags, we need to share it with other users in order to adopt the same syntax
We need a Document Type Definition (DTD)
![Page 32: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/32.jpg)
32
Document Type Definition (DTD)
A DTD defines what markup can be used in a document that is supposed to conform to a specific structure, whose components are identified by tags
![Page 33: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/33.jpg)
33
Document Type Definition (DTD)
For example, a DTD defines what elements a document can contain, their occurrences, their order, and so on
A DTD can set out which attributes an element can take and whether they must be valued. It is also possible to define a set of predefined values for the attributes, and so on
![Page 34: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/34.jpg)
34
Internal and external DTD
A DTD can be an external file or it can be included as part of the XML document. If it is an external file, the XML document must contain an explicit reference inside the Document Type Declaration:
<!DOCTYPE MyXMLDocs SYSTEM “file.dtd”>
![Page 35: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/35.jpg)
35
Internal and external DTD
A DTD can also be written inside the document type declaration. In this case we have an internal DTD, like:<!DOCTYPE MyXMLDoc [
<!ELEMENT MyXMLDoc (#PCDATA)>
]> In this case, all the constraints on the
structure of the document are provided as declarations inside the square brackets
![Page 36: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/36.jpg)
36
Element declarations A DTD is a set of declarations, the most
important of which is the element declaration. Any DTD must have at least one element declaration (referred to the root element)
The syntax for a declaration is:
<!ELEMENT elementname (contentmodel)>
![Page 37: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/37.jpg)
37
Element declarations Example:
<!ELEMENT anthology (poem+)>
<!ELEMENT poem (title?, (stanza+|line+) )
<!ELEMENT title (#PCDATA)>
<!ELEMENT stanza (verso+)>
<!ELEMENT line (#PCDATA)>
![Page 38: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/38.jpg)
38
Cardinality suffixes Cardinality suffixes are symbols used to
specify how many times an element can occur at a certain point of the structure. Symbols used are:
? 0-1+ 1-n* 0-n
(none) 1
![Page 39: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/39.jpg)
39
Connectors Connectors are symbols used to specify
order and relationships between components of a model
Symbols used are:
, (comma)
| (vertical line)
![Page 40: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/40.jpg)
40
Attribute declarations An attribute declaration allows to define
attributes associated to a given element
The syntax for a declaration is:
<!ATTLIST element_name attribute_definition*>
where an attribute definition is like:
attribute_name attribute_type default_declaration
![Page 41: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/41.jpg)
41
Valid documents Well-formed documents: XML
documents conforming to the rules laid down in the XML 1.0 specifications
Valid documents: well-formed documents conforming to the rules laid down in a DTD
![Page 42: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/42.jpg)
42
Stylesheets
So far the structure. But how can we render documents in the proper way?
Stylesheets
![Page 43: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/43.jpg)
43
Stylesheets Since content is separated from style, we do
need no more to re-write the whole document each time we want to change the layout: we simply need to change the “instructions” that modify rendering. In other words, we can modify representation without modifying content
XSL (eXtensible Stylesheet Language) is a style language based upon DSSL (Document Style Semantics and Specification Language)
![Page 44: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/44.jpg)
44
So far the document …
… but a document is (generally) part of a file, which is in turn part of a series or a more complex archival collection
Archival bond
![Page 45: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/45.jpg)
45
The object of analysis:from documents ...
![Page 46: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/46.jpg)
46
… to files ...
![Page 47: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/47.jpg)
47
.....
![Page 48: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/48.jpg)
48
… to series
![Page 49: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/49.jpg)
49
Archives:a complex system of relationships
File
Series
Archiv
e
Document
![Page 50: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/50.jpg)
50
Preserving, of course; but what?
Preserving
Original data
Context allowing data to be interpreted
Hardware
??
?
![Page 51: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/51.jpg)
51
Preserving context
Preserving the context
Need to manage a network of metadata
![Page 52: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/52.jpg)
52
XML technologies XML Schema Document Object Model (DOM) Simple API for XML (SAX) XSLT/Xpath XML Query Xlink Xpointer Xbase Xform XML Fragment interchange Xinclude
![Page 53: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/53.jpg)
53
XML features It’s a formal, non-proprietary standard
it is acceptable to a wide range of users It’s a meta-language
it allows to define DTDs and validate documents It allows to manage highly structured documents It’s human-readable and self-descriptive
good chances to last It uses Unicode text
no problems related to internationalization
![Page 54: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/54.jpg)
54
XML features
It’s a family of technologies It’s modular It’s license-free and platform-independent It can be transported across Web using
existing transport protocol re-use of communication and
security structures already in place
![Page 55: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/55.jpg)
55
XML features
It allows to easily manage metadata It provides very good mechanism for
representing the layout It’s easy, powerful, but not too expensive
![Page 56: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/56.jpg)
56
XML double-edged features
1. It’s a meta-language: it allows to define DTDs danger of specialization (each user community with its own language)
Without a common language, XML is not so competitive with respect to other mechanism of data interchange
XSL does allow to translate between different encodings, but it could be quite complex
RosettaNet and OASIS: trying to adopt common languages
![Page 57: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/57.jpg)
57
XML double-edged features
2. It’s self-descriptive: you can create documents without using a DTD ...
![Page 58: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/58.jpg)
58
XML double-edged features
3. It supports sophisticated searching by means of the tags embedded in the text, but a bad markup (not complete or not correct) highly reduces search effectiveness
![Page 59: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/59.jpg)
59
XML limitations
It’s a syntax: it contains no semantics you need to use other XML modules such as XML Schema and RDF
It’s based upon text: the size of the markup can be much larger than the data itself
![Page 60: Documents: form vs. content ?](https://reader036.fdocuments.in/reader036/viewer/2022062323/568152cb550346895dc0e644/html5/thumbnails/60.jpg)
60
Preservation
Some considerations ...