Making Multi-Structured Documents

27
[email protected] - http://liris.cnrs.fr/~peportie [email protected] - http://liris.cnrs.fr/~scalabre Laboratoire d'InfoRmatique en Image et Systèmes d'information UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale d Université Claude Bernard Lyon 1, bâtiment Nautibus 43, boulevard du 11 novembre 1918 — F-69622 Villeurbanne cedex http://liris.cnrs.fr UMR 5205 Lyon - 25/11/2008 Lyon - 25/11/2008 Multi-structured documents Modelisation and creation

description

slides shown to Elisa Bertino (25 november 2008) about the construction of multi-structured documents

Transcript of Making Multi-Structured Documents

Page 1: Making Multi-Structured Documents

[email protected] - http://liris.cnrs.fr/~peportie [email protected] - http://liris.cnrs.fr/~scalabre

Laboratoire d'InfoRmatique en Image et Systèmes d'informationLIRIS UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon

Université Claude Bernard Lyon 1, bâtiment Nautibus43, boulevard du 11 novembre 1918 — F-69622 Villeurbanne cedex

http://liris.cnrs.fr

UMR 5205

Lyon - 25/11/2008

Lyon - 25/11/2008

Multi-structured documents

Modelisation and creation

Page 2: Making Multi-Structured Documents

Lyon - 25/11/2008 2

MSD Problematic

Several specific uses several structure types

e.g. physical, logical, semantic, poetic, linguistic

Recurrent problematic of Digital HumanitiesTEI recommendations and overlapping hierarchies

Example queries:Find all damaged words that contain damaged characters

only.Indicate for each word containing restored characters the

location of the corresponding line.

Page 3: Making Multi-Structured Documents

Lyon - 25/11/2008 3

Medieval Manuscript (1)

Transcription of old manuscripts

Page 4: Making Multi-Structured Documents

Lyon - 25/11/2008 4

Medieval Manuscript (2)

Physical structure

Page 5: Making Multi-Structured Documents

Lyon - 25/11/2008 5

Medieval Manuscript (3)

Lexical structure

Page 6: Making Multi-Structured Documents

Lyon - 25/11/2008 6

Medieval Manuscript (4)

Damaged characters structure

Page 7: Making Multi-Structured Documents

Lyon - 25/11/2008 7

Medieval Manuscript (5)

Image regions structure

Page 8: Making Multi-Structured Documents

Lyon - 25/11/2008 8

Medieval Manuscript (6)

Relations between structures

Physical structure Lexical structure Damaged characters structure

Text regions structuretra

nscr

iptio

n

lines

loca

lizat

ion

brokenWordslocalization

damagedcharacterslocalization

A multi-structured document is a document having multiple structureslinked together through a shared content or other inter-structural relations.

Page 9: Making Multi-Structured Documents

Lyon - 25/11/2008 9

Modern Manuscript (1)

Modern manuscript of J.T. Desanti

Page 10: Making Multi-Structured Documents

Lyon - 25/11/2008 10

Modern Manuscript (2)

Physical structure: lines

Page 11: Making Multi-Structured Documents

Lyon - 25/11/2008 11

Modern Manuscript (3)

Idiomatic structure

Page 12: Making Multi-Structured Documents

Lyon - 25/11/2008 12

Modern Manuscript (4)

Alterations structure

Page 13: Making Multi-Structured Documents

Lyon - 25/11/2008 13

Existing works (1)

(too) specific “models”

0

0,5

1

1,5

2

2,5

3model expressivity

model genericity

implementation

usability of XML tools

query mechanisms

Structures and data changes TEI Guidelines Redundant encoding

TEI Guidelines Empty elements

TEI Guidelines Virtual elements

TEI Guidelines Stand-off markup

CONCUR

MuLaX

MECS / TexMECS

LMNL

MonetDB

Page 14: Making Multi-Structured Documents

Lyon - 25/11/2008 14

Existing works (2)

Generic models

0

0,5

1

1,5

2

2,5

3model expressivity

model genericity

implementation

usability of XML tools

query mechanisms

Structures and data changes

Delay Nodes

Annotations Graphs

RDF (RDFTEF)

MCT

MSXD

GODDAG

MSDM / MultiX

Page 15: Making Multi-Structured Documents

Lyon - 25/11/2008 15

Multi-Structured Document Model

MSDM

Page 16: Making Multi-Structured Documents

Lyon - 25/11/2008 16

MSDM (2)

Relations between structures

P h y s i c a l s t r u c t u r e P h y s i c a l s t r u c t u r e L e x i c a l s t r u c t u r eL e x i c a l s t r u c t u r eD a m a g e d c h a r a c t e r s

s t r u c t u r e

D a m a g e d c h a r a c t e r s

s t r u c t u r e

T e x t r e g i o n s t r u c t u r eT e x t r e g i o n s t r u c t u r e

B a s e s t r u c t u r eB a s e s t r u c t u r e

L o c a l i z a t i o n o f b r o k e n

w o r d s

L o c a l i z a t i o n o f l i n e s

T r a n s c r i p t i o n

L o c a l i z a t i o n o f d a m a g e d

c h a r a c t e r s

P h y s i c a l s t r u c t u r e P h y s i c a l s t r u c t u r e L e x i c a l s t r u c t u r eL e x i c a l s t r u c t u r eD a m a g e d c h a r a c t e r s

s t r u c t u r e

D a m a g e d c h a r a c t e r s

s t r u c t u r e

T e x t r e g i o n s t r u c t u r eT e x t r e g i o n s t r u c t u r e

B a s e s t r u c t u r eB a s e s t r u c t u r e

L o c a l i z a t i o n o f b r o k e n

w o r d s

L o c a l i z a t i o n o f l i n e s

T r a n s c r i p t i o n

L o c a l i z a t i o n o f d a m a g e d

c h a r a c t e r s

MultiX ;Xinclude ;Etc.

Stand-Off Markup

Page 17: Making Multi-Structured Documents

Lyon - 25/11/2008 17

MultiX (1)

Base Structure

Page 18: Making Multi-Structured Documents

Lyon - 25/11/2008 18

MultiX (2)

Composition for a line of the physical structure

<msd:comp id=“C1” idrefs=“F1 F2 F3=F4 F5 F6 F7” />

<line n=“1”><msd:clink target=“BS” label=“text content” to=“C1”/></line>

Page 19: Making Multi-Structured Documents

Lyon - 25/11/2008 19

MultiX (3)

Querying MultiX documents: Xquery functionsrebuild ($elem-seq as element()*) as element()*share-content ($e as element()) as xs:Booleanshare-content-with ($e as element(), $str_name as

xs:string) as element()*share-fragments ($e1 as element(), $e2 as element()) as

xs:Booleanget-shared-fragments ($e1 as element(), $e2 as element())

as element(msd:frag)*includes-fragments-of ($e1 as element(), $e2 as element())

as xs:BooleanEtc.

Page 20: Making Multi-Structured Documents

Lyon - 25/11/2008 20

MultiX (4)

Find all damaged words that contain damaged characters only.

Page 21: Making Multi-Structured Documents

Lyon - 25/11/2008 21

MultiX (5)

Creation and evolution of MultiX documentsA parser (MXP) creates an internal representation from

separated structuresUseful with a priori known structures

Page 22: Making Multi-Structured Documents

Lyon - 25/11/2008 22

Creation and Evolution of MSD

Little or no a priori knowledge about the structures

Common situation for scholars in the humanities

E.g. transcription of a poem found in a manuscript using the vocabulary defined by the TEI schema

The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.

He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!”

Page 23: Making Multi-Structured Documents

Lyon - 25/11/2008 23

Before restructuring

The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.

He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!”stanzas

sentences

verses

base structure

compositionnodes

fragments

Page 24: Making Multi-Structured Documents

Lyon - 25/11/2008 24

Restructuring is necessary

The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.

He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!”stanzas

sentences

verses

base structure

compositionnodes

fragments

Page 25: Making Multi-Structured Documents

Lyon - 25/11/2008 25

Automatic restructuring

The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.

He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!” stanzas

sentences

verses

Page 26: Making Multi-Structured Documents

Lyon - 25/11/2008 26

User intervention in restructuring

The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.

He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!” stanzas

verses

sentences

Page 27: Making Multi-Structured Documents

Lyon - 25/11/2008 27

Perspectives

Shared responsibilitiesWho is responsible for each document structure ?Life cycle of newly created document structures ?

Use of formal knowledgeFormal knowledge, the tree structure of well formed XML

documents, made possible an automatic restructuringIt seems necessary to find simple formal conditions for

restructuring times …