Making Multi-Structured Documents

Post on 22-May-2015

454 views 0 download

Tags:

description

slides shown to Elisa Bertino (25 november 2008) about the construction of multi-structured documents

Transcript of Making Multi-Structured Documents

pierre-edouard.portier@liris.cnrs.fr - http://liris.cnrs.fr/~peportie sylvie.calabretto@liris.cnrs.fr - http://liris.cnrs.fr/~scalabre

Laboratoire d'InfoRmatique en Image et Systèmes d'informationLIRIS UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon

Université Claude Bernard Lyon 1, bâtiment Nautibus43, boulevard du 11 novembre 1918 — F-69622 Villeurbanne cedex

http://liris.cnrs.fr

UMR 5205

Lyon - 25/11/2008

Lyon - 25/11/2008

Multi-structured documents

Modelisation and creation

Lyon - 25/11/2008 2

MSD Problematic

Several specific uses several structure types

e.g. physical, logical, semantic, poetic, linguistic

Recurrent problematic of Digital HumanitiesTEI recommendations and overlapping hierarchies

Example queries:Find all damaged words that contain damaged characters

only.Indicate for each word containing restored characters the

location of the corresponding line.

Lyon - 25/11/2008 3

Medieval Manuscript (1)

Transcription of old manuscripts

Lyon - 25/11/2008 4

Medieval Manuscript (2)

Physical structure

Lyon - 25/11/2008 5

Medieval Manuscript (3)

Lexical structure

Lyon - 25/11/2008 6

Medieval Manuscript (4)

Damaged characters structure

Lyon - 25/11/2008 7

Medieval Manuscript (5)

Image regions structure

Lyon - 25/11/2008 8

Medieval Manuscript (6)

Relations between structures

Physical structure Lexical structure Damaged characters structure

Text regions structuretra

nscr

iptio

n

lines

loca

lizat

ion

brokenWordslocalization

damagedcharacterslocalization

A multi-structured document is a document having multiple structureslinked together through a shared content or other inter-structural relations.

Lyon - 25/11/2008 9

Modern Manuscript (1)

Modern manuscript of J.T. Desanti

Lyon - 25/11/2008 10

Modern Manuscript (2)

Physical structure: lines

Lyon - 25/11/2008 11

Modern Manuscript (3)

Idiomatic structure

Lyon - 25/11/2008 12

Modern Manuscript (4)

Alterations structure

Lyon - 25/11/2008 13

Existing works (1)

(too) specific “models”

0

0,5

1

1,5

2

2,5

3model expressivity

model genericity

implementation

usability of XML tools

query mechanisms

Structures and data changes TEI Guidelines Redundant encoding

TEI Guidelines Empty elements

TEI Guidelines Virtual elements

TEI Guidelines Stand-off markup

CONCUR

MuLaX

MECS / TexMECS

LMNL

MonetDB

Lyon - 25/11/2008 14

Existing works (2)

Generic models

0

0,5

1

1,5

2

2,5

3model expressivity

model genericity

implementation

usability of XML tools

query mechanisms

Structures and data changes

Delay Nodes

Annotations Graphs

RDF (RDFTEF)

MCT

MSXD

GODDAG

MSDM / MultiX

Lyon - 25/11/2008 15

Multi-Structured Document Model

MSDM

Lyon - 25/11/2008 16

MSDM (2)

Relations between structures

P h y s i c a l s t r u c t u r e P h y s i c a l s t r u c t u r e L e x i c a l s t r u c t u r eL e x i c a l s t r u c t u r eD a m a g e d c h a r a c t e r s

s t r u c t u r e

D a m a g e d c h a r a c t e r s

s t r u c t u r e

T e x t r e g i o n s t r u c t u r eT e x t r e g i o n s t r u c t u r e

B a s e s t r u c t u r eB a s e s t r u c t u r e

L o c a l i z a t i o n o f b r o k e n

w o r d s

L o c a l i z a t i o n o f l i n e s

T r a n s c r i p t i o n

L o c a l i z a t i o n o f d a m a g e d

c h a r a c t e r s

P h y s i c a l s t r u c t u r e P h y s i c a l s t r u c t u r e L e x i c a l s t r u c t u r eL e x i c a l s t r u c t u r eD a m a g e d c h a r a c t e r s

s t r u c t u r e

D a m a g e d c h a r a c t e r s

s t r u c t u r e

T e x t r e g i o n s t r u c t u r eT e x t r e g i o n s t r u c t u r e

B a s e s t r u c t u r eB a s e s t r u c t u r e

L o c a l i z a t i o n o f b r o k e n

w o r d s

L o c a l i z a t i o n o f l i n e s

T r a n s c r i p t i o n

L o c a l i z a t i o n o f d a m a g e d

c h a r a c t e r s

MultiX ;Xinclude ;Etc.

Stand-Off Markup

Lyon - 25/11/2008 17

MultiX (1)

Base Structure

Lyon - 25/11/2008 18

MultiX (2)

Composition for a line of the physical structure

<msd:comp id=“C1” idrefs=“F1 F2 F3=F4 F5 F6 F7” />

<line n=“1”><msd:clink target=“BS” label=“text content” to=“C1”/></line>

Lyon - 25/11/2008 19

MultiX (3)

Querying MultiX documents: Xquery functionsrebuild ($elem-seq as element()*) as element()*share-content ($e as element()) as xs:Booleanshare-content-with ($e as element(), $str_name as

xs:string) as element()*share-fragments ($e1 as element(), $e2 as element()) as

xs:Booleanget-shared-fragments ($e1 as element(), $e2 as element())

as element(msd:frag)*includes-fragments-of ($e1 as element(), $e2 as element())

as xs:BooleanEtc.

Lyon - 25/11/2008 20

MultiX (4)

Find all damaged words that contain damaged characters only.

Lyon - 25/11/2008 21

MultiX (5)

Creation and evolution of MultiX documentsA parser (MXP) creates an internal representation from

separated structuresUseful with a priori known structures

Lyon - 25/11/2008 22

Creation and Evolution of MSD

Little or no a priori knowledge about the structures

Common situation for scholars in the humanities

E.g. transcription of a poem found in a manuscript using the vocabulary defined by the TEI schema

The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.

He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!”

Lyon - 25/11/2008 23

Before restructuring

The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.

He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!”stanzas

sentences

verses

base structure

compositionnodes

fragments

Lyon - 25/11/2008 24

Restructuring is necessary

The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.

He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!”stanzas

sentences

verses

base structure

compositionnodes

fragments

Lyon - 25/11/2008 25

Automatic restructuring

The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.

He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!” stanzas

sentences

verses

Lyon - 25/11/2008 26

User intervention in restructuring

The loss of his clothes hardly mattered, becauseHe had seven coats on when he came,With three pairs of boots—but the worst of it was,He had wholly forgotten his name.

He would answer to “Hi!” or to any loud cry,Such as “Fry me!” or “Fritter my wig!”To “What-you-may-call-um!” or “What-was-his-name!”But especially “Thing-um-a-jig!” stanzas

verses

sentences

Lyon - 25/11/2008 27

Perspectives

Shared responsibilitiesWho is responsible for each document structure ?Life cycle of newly created document structures ?

Use of formal knowledgeFormal knowledge, the tree structure of well formed XML

documents, made possible an automatic restructuringIt seems necessary to find simple formal conditions for

restructuring times …