XML for Information Management – Day 3 Airi Salminen XML for Information Management University of...

37
XML for Information Management – Day 3 Airi Salminen XML for Information Management University of Erlangen-Nuremberg Computational Linguistics Instructor: Professor Airi Salminen http://users.jyu.fi/~airi/ 26.4.-30.4.2010

Transcript of XML for Information Management – Day 3 Airi Salminen XML for Information Management University of...

XML for Information Management – Day 3Airi Salminen

XML for Information Management

University of Erlangen-NurembergComputational Linguistics

Instructor: Professor Airi Salminenhttp://users.jyu.fi/~airi/

26.4.-30.4.2010

XML for Information Management – Day 3Airi Salminen

2

1. Structured documents2. Formal grammars in XML3. Natural languages in XML

documents4. Adding meaning by markup5. Text indexing6. Logical structure of XML

documents

Outline

XML for Information Management – Day 3Airi Salminen

3

1. Structured documents

Structured document

‣ structure, content, and external presentation can be separated from each other and processed separately

‣ structural components have names

‣ structural components can be recognized by software modules

‣ possible to define the structure

XML for Information Management – Day 3Airi Salminen

4

Structured document

Structure

Content

Layout

1. Structured documents

an open language standard,

e.g. SGML, XML

different languages for defining the layout, e.g., CSS and XSL for XML

different languages for defining the structure,

e.g., DTD, XML Schema, RELAX NG for XML

XML for Information Management – Day 3Airi Salminen

5

Structured document

Structure

Content

Layout

1. Structured documents

Example

DTD.txt

rhymes-with-ext-dtd.txt

rhymes-with-ext-dtd.xml

rhymes-style.txt

rhymes-style.css

rhymes-with-style-and-ext-dtd.xml

rhymes-with-style-and-ext-dtd.txt

XML for Information Management – Day 3Airi Salminen

6

Management of structured documents

‣ document management

‣ management of the data contained in documents

1. Structured documents

XML for Information Management – Day 3Airi Salminen

7

Characteristics in the management of structured documents

‣ Design. Adopting the approach of structured document management in an environment often requires careful planning before the creation of documents. Includes schema design and layout design.

‣ Content production. Content can be produced by different types of software, e.g. by a syntax-directed editor. Checking the validity against the schema.

‣ Evolution. Schema versioning, layout versioning.

‣ Operations. Most typical operation is some kind of transformation.

‣ Software. Many kinds of software systems used.

1. Structured documents

XML for Information Management – Day 3Airi Salminen

8

2. Formal grammars in XML

‣ terminal symbols (alphabet)‣ nonterminal symbols ‣ production rules‣ start symbol

The language defined by a grammar consists of all those strings over the alphabet that can be generated by starting with the start symbol and then applying the production rules until no nonterminal symbols are present.

A formal grammar is a way to describe the syntax of language.

XML for Information Management – Day 3Airi Salminen

9

In XML there are two kinds of formal grammars with their own notations:

‣ the grammar defining the XML syntax in the XML specification

‣ DTD

2. Formal grammars in XML

XML for Information Management – Day 3Airi Salminen

10

The XML specification uses the EBNF (Extended Backus-Naur Form) notation with metasymbols ?, *, +, |, and ( )

The syntax of XML 1.0 is described by production rules numbered from [1] to [89]. A subset of the rules included in the first edition have been left out in later editions, some other have been added, for example, [28a], [28b].

The notation of XML syntax is decribed in Section 6 of the specification: 6. Notation.

2. Formal grammars in XML

XML for Information Management – Day 3Airi Salminen

11

A? A is optionalA | B A and B are alternativesA + A occurs once or moreA* A may be missing or occurs once or moreA - B A but not B A B B after A( ) grouping

document ::= prolog element Misc*prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?Misc ::= Comment | PI | SComment ::= '<!--' ((Char - '-') | ('-'(Char - '-')))* '-->'

2. Formal grammars in XML

Example rules in XML 1.0:

XML for Information Management – Day 3Airi Salminen

12

Production rules in a DTD:

<!ELEMENT rhymecollection (title?, rhyme+)><!ELEMENT title (#PCDATA)><!ELEMENT rhyme (line+)><!ELEMENT line (#PCDATA)>

DTD does not describe in the element type declarations the concrete syntax of elements, only their hierarchic structure. The details of the concrete syntax (begin-tag, end-tag, etc.) are described in the XML specification.

2. Formal grammars in XML

XML for Information Management – Day 3Airi Salminen

13

XML spesification defines the concrete syntax of XML documents.

The distinction between the concrete and abstract syntax of XML is not quite clear. W3C has developed four slightly different models to describe the abstract syntax:

2. Formal grammars in XML

• XML Information Set• DOM model• XPath 1.0 model• XQuery 1.0 and XPath 2.0 data model

Analysis of differences in the models: Salminen, A., & Tompa, F.W. (2001). Requirements for XML document database systems. Proc. of the ACM Symposium on Document Engineering (DocEng '01) (pp. 85-94). New York: ACM Press.

XML for Information Management – Day 3Airi Salminen

14

3. Natural languages in XML documents

Natural language may occur in XML marked up text in the:

•content of elements

•markup

• element, attribute, and entity names

• attribute values

• comments

XML for Information Management – Day 3Airi Salminen

15

3. Natural language in XML documents

•human individuals in• reading the markedup text• information access• communicating with other individuals about the schema or marked up content

•some software applications, for example, text analysis software

Natural language in the markup is NOT utilized by the XML processor, BUT it can be utilized by

XML for Information Management – Day 3Airi Salminen

16

4. Adding meaning by markup

It is important that the element and attribute names are meaningful to human readers.

The names are not useful in information access

<AAA XXX= "5" ><rki YYY= "Hamlet" >Where wilt thou lead me? speak; I'll go no further.</rki><rki YYY="ghost">Mark me.</rki></AAA>

XML for Information Management – Day 3Airi Salminen

17

4. Adding meaning by markup

Natural language in XML documents provides semantic information to human readers and for human communication.

Meaningful markup is useful for human users in information retrieval and in specifying transformations.

Markup may provide rich semantic and linguistic information.

XML for Information Management – Day 3Airi Salminen

18

4. Adding meaning by markup

Example from Smith, J., Deshaye, J., & Stoicheff, P., Callimachus - Avoiding the pitfalls of XML for collaborative text analysis. Literary and Linguistic Computing 21 (2), 2006, 199-218.

She smelled like trees.<Chapter section = '1' > <Paragraph id='143' FragmentCode='1.12'> <Narration narrator='Benjy'> <Subject person='Caddy'>She</Subject> <Senses mode='smell'>smelled</Senses> like <Imagery referent='tree'>trees</Imagery> </Narration> </Paragraph></Chapter>

Example of combining structural, semantic and linguistic markup:

XML for Information Management – Day 3Airi Salminen

19

4. Adding meaning by markup

<Chapter section = '1' > <Narration narrator='Benjy'> <Imagery place='tree' mode='simile' sense='smell'> <Fragment code='1.12'> <Paragraph id='143'> <Subject person='Caddy'>She</Subject> smelled like trees. </Paragraph> </Fragment> </Imagery> </Narration></Chapter>

She smelled like trees.

Example from Smith, J., Deshaye, J., & Stoicheff, P., Callimachus - Avoiding the pitfalls of XML for collaborative text analysis. Literary and Linguistic Computing 21 (2), 2006, 199-218.

Another markup for the same text:

XML for Information Management – Day 3Airi Salminen

20

4. Adding meaning by markup

Some other examples:

http://nrrc.mitre.org/NRRC/Docs_Data/MPQA_04/approval_time.htm

http://www.cs.cmu.edu/~awb/festival_demos/sable.html

http://www.etang.umontreal.ca/bwp1800/essays/flanders_encoding4.html

XML for Information Management – Day 3Airi Salminen

21

4. Adding meaning by markup

In Semantic Web semantic information about the meaning of markup vocabulary of documents is available as additional metadata in a formal, standardized form.

The concepts and meanings are defined in formal ontologies.

Software applications can understand the meanings.

XML for Information Management – Day 3Airi Salminen

22

5. Text indexing

documents

index

search enginequery

answer

In information retrieval environments collections of natural language documents are usually indexed, retrieval is based on the index terms included in the index.

XML for Information Management – Day 3Airi Salminen

23

6. Logical structure of XML documents

• declarations

• elements

• comments

• processing instructions

Components of the logical structure

XML for Information Management – Day 3Airi Salminen

24

6. Logical structure of XML documents

document ::= prolog element Misc*

declarationscommentsprocessing instructions

elementscommentsprocessing instructions

commentsprocessing instructions

XML for Information Management – Day 3Airi Salminen

25

‣ XML declaration [23]

‣ document type declaration [28]

‣ markup declaration [29]

• element type declaration [45]

• attribute list declaration [52]

• entity declaration [70]

• notation declaration [82]

‣ encoding declaration [80]

‣ standalone document declaration [32]

‣ text declaration [77]

Declarations:

6. Logical structure of XML documents

to constrain the logical structure

to constrain the physical structure

XML for Information Management – Day 3Airi Salminen

26

Typical element type declarations:

6. Logical structure of XML documents

mixed content defined

element content defined

<!ELEMENT product (mfg, model, description, clock?)><!ELEMENT model (#PCDATA)><!ELEMENT description (#PCDATA | feature)*><!ELEMENT clock EMPTY>

empty element defined

XML for Information Management – Day 3Airi Salminen

27

6. Logical structure of XML documents

empty element defined:

<clock></clock><clock/>

<!ELEMENT clock EMPTY>

two forms of the element allowed in a well-formed document:

XML for Information Management – Day 3Airi Salminen

28

6. Logical structure of XML documents

element content: definition by content models with metasymbols

* iteration (none or more)+ iteration (once or more)| alternatives? optional, successive( ) grouping

#PCDATA is not accepted in the content model!

<!ELEMENT table (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>

Example from XHTML 1.0 Strict DTD:

XML for Information Management – Day 3Airi Salminen

29

6. Logical structure of XML documents

mixed content: definition has basically two forms

(#PCDATA)(#PCDATA | e1 | … | en)*

<!ELEMENT text (#PCDATA)><!ELEMENT section (#PCDATA | subsection)*><!ELEMENT section (#PCDATA | subsection | paragraph)*>

#PCDATA is always included in the content specification and comes first in the list of alternatives

examples:

XML for Information Management – Day 3Airi Salminen

30

• to define the set of attributes pertaining to a given elemen type

• to establish type constraints for these attributes

• to provide default values for attributes

Attribute list declarations

6. Logical structure of XML documents

XML for Information Management – Day 3Airi Salminen

31

attribute name

<!ATTLIST poem author CDATA #REQUIRED >

attribute type: string

constraint: the attribute must be specified for all elements of type poem

element type

6. Logical structure of XML documents

XML for Information Management – Day 3Airi Salminen

32

Defining constraints

#REQUIRED: attribute must always be provided in all elements of the given type

#IMPLIED: attribute can be provided in a element; no default value is provided

AttValue: default value is given between single or double quotes

#FIXED AttValue: instances of the attribute must match the given default value

[60] DefaultDecl ::= '#REQUIRED' | '#IMPLIED'| (('#FIXED' S) ? AttValue)

6. Logical structure of XML documents

XML for Information Management – Day 3Airi Salminen

33

Attribute types

[54] AttType ::= StringType | TokenizedType | EnumeratedType

• ENTITY, ENTITIES: entity names

• NMTOKEN, NMTOKENS: text tokens consisting of characters accepted in names

• ID: names that uniquely identify elements

• IDREF, IDREFS: references to ID type identifiers

tokenized types:

enumerated types:• NOTATION, NOTATIONS: identify notations• enumeration

6. Logical structure of XML documents

XML for Information Management – Day 3Airi Salminen

34

<?xml version= "1.0"?><!DOCTYPE text [<!ELEMENT text (line+)><!ELEMENT line (#PCDATA)><!ATTLIST line

id ID #REQUIREDseeline IDREFS #IMPLIED> ]>

<text><line id= "r1">This is the first line</line><line id= "r2" seeline= "r1" >This is the second line, but look at the first too</line></text>

6. Logical structure of XML documents

XML for Information Management – Day 3Airi Salminen

35

6. Logical structure of XML documents

<Chapter section = '1' ><Narration narrator='Benjy'><Imagery place='tree' mode=simile sense='smell'><Fragment code='1.12'><Paragraph id='143'><Subject person='Caddy'>She</Subject>smelled like trees.</Paragraph></Fragment></Imagery></Narration></Chapter>

XML-aware web browsers support the visualization of the tree structure: example

XML for Information Management – Day 3Airi Salminen

36

6. Logical structure of XML documents

Different abstract models to decribe the tree in slightly different ways.

<poem author = "Murasaki Shikibu" born = "974"><!-- The poem is translated from Japanese by Kenneth Rexroth --><line>This life of ours would not cause you sorrow</line><line>if you thought of it as like</line><line>the mountain cherry blossoms</line><line>which bloom and fade in a day. </line></poem>

XML for Information Management – Day 3Airi Salminen

37

poem

line

line

lineAuthorMurasaki Shikibu

line

born 974

This life of ours would not cause you sorrow

if you thought of it as like

which bloom and fade in a day.

the mountain cherry blossoms

Root node

Element node

Attribute node

The poem is translated from Japanese by Kenneth Rexroth

Text node

Comment node

poem

6. Logical structure of XML documents

Node types of XPath 1.0