Markup Languages and Complex Documents (MLCD)

32
Markup Languages and Complex Documents (MLCD) Universität zu Köln, 10.12.2004 Claus Huitfeldt (University of Bergen) and C.M. Sperberg-McQueen (World Wide Web Consortium) http://teksttek.aksis.uib.no/projects/mlcd

description

Markup Languages and Complex Documents (MLCD). Universität zu Köln, 10.12.2004 Claus Huitfeldt (University of Bergen) and C.M. Sperberg-McQueen (World Wide Web Consortium) http://teksttek.aksis.uib.no/projects/mlcd. What MLCD is about. Creating a markup notation data structure - PowerPoint PPT Presentation

Transcript of Markup Languages and Complex Documents (MLCD)

Page 1: Markup  Languages  and Complex Documents (MLCD)

Markup Languages and Complex Documents (MLCD)

Universität zu Köln, 10.12.2004

Claus Huitfeldt(University of Bergen)

and

C.M. Sperberg-McQueen(World Wide Web Consortium)

http://teksttek.aksis.uib.no/projects/mlcd

Page 2: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 2

What MLCD is about

Creating a markup– notation– data structure– grammar– semantics

for ”complex” documents, i.e. documents with overlapping, fragmented or disordered elements, multiple co-existing alternative structures etc.

Page 3: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 3

The structure of this talk

• About document markup• The success of SGML/XML• The problems of SGML/XML• MLCD’s aims and organization etc.• Data structure• Notation• Prototype software• Grammars• Related work

Page 4: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 4

What Markup is

(or, what we mean by "markup") Markup is information added to the character

stream of a text document, normally meta-information about the contents, intended interpretation, or processing of part of the character stream.

Markup is - embedded - separable

Page 5: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 5

Why Markup is Important

• Form of representation affects:– what computers can do with texts– the way we think about texts– designers of computer text systems– authors and readers

• Formal representation may serve as a model for theories of text, or as a tool in theorising about texts.

• Therefore: Shortcomings and problems of current markup systems are also important

Page 6: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 6

The basic elements of markup (The “Tripod”):

• Notation (linearisation)

• Data structure (graph representation)

• Constraint language (grammar)

Page 7: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 7

Why SGML/XML (or is it HTML/PDF?) is such a success

Tight integration of: – A simple notation

• (the angle brackets)

– A straightforward data structure with a natural interpretation

• (document tree and attribution of properties to elements)

– A powerful constraint language• (the DTD, a context-free grammar)

Page 8: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 8

Why SGML/XML is still not perfect

Problems representing:– overlapping elements– discontiguous elements– disordered elements– structural variation (alternate ordering)

• micro-level variation• macro-level alternate ordering

– fragmentation, transposition, disorder

– coextensive elements?– context-sensitive constraints– attribute co-occurence constraints

i.e.: complex structures.

Page 9: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 9

Are complex structures important?Yes, -- an example: Overlap• Overlap exists in real texts:

– pages and paragraphs– physical and formal structure– details of inscriptions and other structures– verse lines and speeches in verse drama– direct discourse and verse lines or sentences

• Overlap also exists in electronic texts:– Overlap is explicitly allowed, and used for the encoding of

overlapping features, in certain non-SGML systems such as:• MECS, FFF (Folio Flat File), TACT/COCOA

– Such systems may also create "spurious" or "dumb" overlap. And some systems contain overlap although they are not supposed to: HTML !

Page 10: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 10

Doesn’t SGML/XML have solutions?

Yes, but No: SGML/XML ”solutions”, such as– milestones– fragmentation– virtual elements– stand-off markup– CONCUR

are artificial and cumbersome, and not supported, neither by SGML/XML as such, nor by existing software.

(From now on: SGML/XML -> XML.)

Page 11: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 11

Why doesn’t XML support complex structures?

• Non-hierarchical links are possible, but• XML element structure is supported by a

context-free grammar• which requires hierarchical nesting of

elements

• What we have called ”complex structures” are not supported by context-free grammars

Page 12: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 12

Departure point for MLCD: XML and MECS

• MECS:– Simple notation for overlapping structures– Generic, like XML– Used in markup of 20,000 pages– Host of software– But:

• No data structure– (Just left-to-right scan)

• No grammar– (Though well-formedness and GI vocabulary constraints defined)

Page 13: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 13

The idea behind MLCD

• Combine the best of both worlds (XML and MECS), i.e.

• create a markup language which – can handle complex structures (overlap and

beyond), – is based on a markup tripod tightly integrated

like that of XML (notation, data structure and grammar).

Page 14: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 14

Today

XML SGMLMECS

Page 15: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 15

Tomorrow

XML SGMLMECS

MLCD

Page 16: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 16

MLCD – project organization• Project period: 2001-07• Project partners

– Host: Aksis• Programmer, researcher, administration

– UoB• Philosophy, Linguistics, Humanities informatics…

– GSLIS (semantics)• Renear, Dubin

– Sperberg-McQueen (W3C)– Others…?

• Achievements– Notation (TexMECS)– Data structure (GODDAG)– Experimental software (wff checker, loader/linearizer, visualizer, MOTS15, BECHAMEL)

• Plans– Grammar– Prototype software– Further partners…?

Page 17: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 17

MLCD notation: TexMECS(”Trivially extended MECS”)

• Design goals:– Isomorphic to XML and MECS for relevant

documents– Every TexMECS document corresponds to a

GODDAG structure, and vice versa– Correct GODDAG construction without

application-specific knowledge– Simplicity of parsing, minimal number of

magic characters

Page 18: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 18

TexMECS elements

Only two reserved characters: < and |

empty: <e att="val">

with ID: <e@foo att="val">

contiguous: <e|...|e>

interrupted: <e|...|-e> ... <+e|...|e>

unordered: <|e||...||e|>

virtual: <^e^foo att="val">

self-overlap: <e~1|...<e~2|...|e~1>...|e~2>

Page 19: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 19

Other TexMECS mechanisms

Internal entities: <&eacute>Structured internal entities:

<&dot.fullstop> vs. <&dot.decimal>External entities:

<<url>>, e.g. <<vw117-a>> or <<http://www.w3.org/XML>>

Comments: <* ... *>. Note that comments can nest.

CDATA sections: <#CDATA< ... >>.

Page 20: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 20

TexMECS examples<s|<a| John <b| loves |a> Mary |b>|s>

<sp who="HUGHIE"|<p|How did that translation go?|p> <lg type="haiku"|

<l|da de dum de dum,|l> <l@frog|gets a new frog,|l> <l|...|l>|lg> |sp>

<sp who="LOUIS"|<p|Er ...|p> <lg|

<l@new|it's a new pond.|l>|lg> |sp> <sp who="DEWEY"|

<p|Ah ...|p> <lg|<l@pond|When the old pond|l>|lg> <p|Right. That's it.|p> |sp>

<lg|<^l^pond><^l^frog><^l^new>|lg>

Page 21: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 21

MLCD data structure: GODDAG(”generalized ordered-descendant directed acyclic graph ”)

Not: But:

Overlap is simply multiple parentage.

Page 22: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 22

GODDAG – general description

A Goddag is a directed acyclic graph (DAG): • Every node is either a leaf or a non-terminal.• Each leaf is labeled with a string.• Each non-terminal is labeled with an identifier.• Directed arcs identify parent/child relation; paths

identify ancestor/descendant relation.• Node n is a leaf node iff n is not a parent.• Node n is a non-terminal node iff n is not a leaf

node.

Page 23: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 23

Restricted GODDAGs• Leaf nodes are ordered.• Each non-terminal dominates a contiguous subsequence

of leaves.• No two nodes dominate the same subsequence of the

frontier.

Unrestricted GODDAGs

• For each node n, arcs (n → x) are ordered.• Leaves need not have any ordering; no contiguity rule for

non-terminals.• Two non-terminals may dominate same set of leaves.

Page 24: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 24

Features of GODDAGs

Just like a tree:

• simple inheritance

• overriding

• additive meaning

• positional meaning

Page 25: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 25

Page 26: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 26

GODDAG and spurious overlap

<a|…<b|…|a>…|b>

<a|<b|…|a>…|b><a|…<b||a>…|b><a|…<b|…|a>|b>

However, GODDAGs can be ”cleaned” by removing spurious overlap.

Page 27: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 27

Software

– Well-formedness checker– Loader/linearizer for XML, MECS, TexMECS– Visualizer– Retrieval / concordance program

Demo: http://teksttek.aksis.uib.no/projects/mlcd

– BECHAMEL

Page 28: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 28

MLCD’s open slot: A Grammar

• Without a formal grammar, no constraint language.• Without a constraint language, no proper markup

system.

• Validation Requirements– allow for overlap– allow validation of virtual elements– partial validation?– modular specification, operations on grammar fragments

• union• intersection• difference

Page 29: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 29

The Chomsky hierarchy

• regular grammars (regular expressions)• context-free grammars (BNF, ...)• context-sensitive (monotonic) grammars• unrestricted phrase-structure grammars

What we want is either a little more powerful than context-free grammars.

Or else a little weaker.

Page 30: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 30

Formalisms to consider

• attribute/affix grammars– attribute grammars (Knuth 1968)– affix grammars (descended from Van Wijngaarden two-level

grammars used in Algol 68 Report)– extend context-free grammars in limited (tractable) ways by

passing parameters on non-terminals

• tree automata• parallel parsing (intersection of multiple grammars), cf.

CONCUR• exotica (GPSG slash formalism? graph grammars?

constraint grammars?)• standard context-free grammars plus ad hoc rules?

Page 31: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 31

Related work• Text Encoding Initiative SIG on overlapping

markup• The ARCHway Project (Kentucky)• OSIS (Steve DeRose)• JITTs (Patrick Durusau and Brook O’Donnell)• LMNL project (Wendell Piez and Jeni Tennison)• Bielefeld Text Technology group?• Others ???

Page 32: Markup  Languages  and Complex Documents (MLCD)

10.12.2004 http://teksttek.aksis.uib.no/projects/mlcd 32

Thank you

http://teksttek.aksis.uib.no/projects/mlcd