Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony...

27
Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik Gmbh

Transcript of Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony...

Page 1: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

Enabling xComForTable Mapping to the Linguistic Annotation Framework

Marion Freese

Sony International (Europe) Gmbh;

IMS, Universität Stuttgart;

hmb Datentechnik Gmbh

Page 2: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

2/24 LREC 2004 05/29/2004Marion Freese

Overview

xComForT – Outline Relevance for richly annotated corpora xComForT Features

– Adaptation to new text formats– Integration of annotation tools

Proposal for integration into LAF Summary

Page 3: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

3/24 LREC 2004 05/29/2004Marion Freese

xComForT – What is it?

extensible Common Format for Text based on

– XML– Text Encoding Initiative (TEI)– Corpus Encoding Standard (CES / XCES)

provides extensibility and reusability

Page 4: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

4/24 LREC 2004 05/29/2004Marion Freese

xComForT – What’s it for?

NOT– Standard for linguistic annotation

BUT– Standards proposal for structural annotation of

primary data– Common anchor for linguistic annotations (LA)– Set of guidelines for LA architecture

(company-internal standard)

Page 5: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

5/24 LREC 2004 05/29/2004Marion Freese

Example: Newspaper (plain text)

bylinecopyright

meta information

headlinequotation

bylinedateline paragraph

Page 6: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

6/24 LREC 2004 05/29/2004Marion Freese

xComForT – Primary Document Example

<xcomfortDoc type="text" extension="SZ" version="v0.6" TEIform="TEI.2"> <cesHeader ...> <!-- ... -> </cesHeader> <text xml:lang="de"> <!-- ... -> zu erhalten.</p> <byline type="signer"> <docAuthor type="short">mgd</docAuthor> </byline> </div>

<div type="article" id="d19990104_a12"> <opener id="d19990104_a12o"> <divMeta> <publDate>Montag, 4. Januar 1999</publDate> <cat target="ns8"><hi>BAYERN</hi></cat> <!-- ... -> </divMeta> <head id="d19990104_a12hl1">Kafkaeskes Augsburg</head> <head id="d19990104_a12hl2" type="sub">Der nächste Akzent <!-- ... -></head> <byline type="main">Von <docAuthor type="full">Peter Richter</docAuthor> </byline> <dateline><location>Augsburg</location> – </dateline> <p id="d19990104_a12p1">Auch wenn nicht <!-- ... -></p><!-- ... -></xcomfortDoc>

Page 7: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

7/24 LREC 2004 05/29/2004Marion Freese

xComForT – Data Architecture

substring / 1:1ran

ge-to / 1

:1

1:1 (#id)

range-to / 1:1

1:1 (#id)

1:1 (#id)

xComForT storage format

base document

level 1 level 2

token level

token stream

substring

e.g. morpheme, syllable streams

e.g. sentence, chunk, mw streams

level 3

1st linguistic level

e.g. PoS, lemma, pronunciation

streams

level 4

2nd linguistic level

e.g. parse tree stream

e.g. intonation stream

segInfo

Page 8: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

8/24 LREC 2004 05/29/2004Marion Freese

Relevance for richly annotated Corpora

Standoff-Markup– supports huge amount of annotation data

» alternative / concurrent / ambiguous annotations» partial / underspecified results» flexible merging» various annotation types (multimodal, multimedia,

metadata, …) media independence– reduces annotation dependencies

Support for integration of external tools for annotation and exploitation

common standards-based starting point for rich annotation

Page 9: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

9/24 LREC 2004 05/29/2004Marion Freese

Comparison with CES

Structural markup and linguistic annotation are strictly separated in xComForT

provides common base format for arbitrary linguistic annotation

allows for using consistent annotation schema Primary document DTD is easily extensible while

retaining TEI conformance

xComForT provides more flexibility than CES wrt. resource formats (e.g. integration of different modalities possible)

Page 10: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

10/24 LREC 2004 05/29/2004Marion Freese

Creation of an extended DTD for storage

xComForT.ent

xComForT.dtd

core markup definition

class.modclass.new

class.comments

elem.modelem.new

elem.comments

xcomfort_new.ent

xcomfort_new.dtd

extension definition

xComForT_store.dtd

TEI conformant storage format

template

TEI conformant extension

storage format

xComForT_store_new.dtd

Page 11: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

11/24 LREC 2004 05/29/2004Marion Freese

Extension Definition Support

core markup definition contains extension entity for each element and entity, e.g.

» <!ENTITY % x.byline ‘’>

» <!ELEMENT byline (#PCDATA | author %x.byline;)>

<!ENTITY % x.byline ‘| interviewer’>

<!ELEMENT byline (#PCDATA | author | interviewer)>

Page 12: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

12/24 LREC 2004 05/29/2004Marion Freese

Integration of Annotation Tools

Toolbox support for converting annotation tool output to xComForT

annotationstream

elementnames

xComForTdocument

type of annotation annotate.perl

text nodes for annotation tool input:

<tn ancestors=“div p“ parentID=“div1.p1“>With</tn>

<tn ancestors=“div p“ parentID=“div1.p1“>the</tn>

...

e.g. sentence

<elem>p</elem>

<s xlink:href=“..“/>

Page 13: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

13/24 LREC 2004 05/29/2004Marion Freese

Linguistic Annotation Tools – implemented examples

input and output formats of– Tokenizer (from IMS, University of Stuttgart)

» tokens» sentences

– IMS TreeTagger» lemma» part-of-speech

Page 14: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

14/24 LREC 2004 05/29/2004Marion Freese

Relation to current LAF standardization issues (1)

General requirements for the standard for a Linguistic Annotation Framework (LAF) (cf. Ide & Romary 2003)

xComForT conforms to these requirements, i.e. to– Media independence– Human readability– Processability

Page 15: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

15/24 LREC 2004 05/29/2004Marion Freese

Relation to current LAF standardization issues (2)

Remaining requirements are xComForT’s main features, i.e. – Consistency– Uniformity– Incrementality– Expressiveness

Two proposals for integration into the LAF Mapping between proprietary resource formats and

the LAF annotation data model Resource reusability

Page 16: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

16/24 LREC 2004 05/29/2004Marion Freese

Proposal to the LAF (1-1)

LAF architecture (Ide & Romary)

Dump format

Page 17: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

17/24 LREC 2004 05/29/2004Marion Freese

Proposal to the LAF (1-2)

Dump Format conforming to xComForT guidelines Advantages

– Direct mapping from/to user-defined formats– Support for annotation tool integration– Easy conversion into proprietary formats

Disadvantages– xComForT is possibly not the most

adequate/efficient processing format– Different requirements of processing format vs.

exchange format

Page 18: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

18/24 LREC 2004 05/29/2004Marion Freese

Proposal to the LAF (2-1)

LAF architecture (Ide & Romary)

Intermediate Format between resource and LAF dump format

Page 19: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

19/24 LREC 2004 05/29/2004Marion Freese

Proposal to the LAF (2-2)

Intermediate Format (Common Document Format) Disadvantages

– One more mapping step Advantages

– Standards-based adaptation to proprietary formats– Mapping to dump format tightly defined and

targeted– Common mapping tool, e.g. provided by the LAF

Page 20: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

20/24 LREC 2004 05/29/2004Marion Freese

Example: Potential LAF dump format

“Jones followed him into the front room, closing the door behind him” (Ide&Romary2001)

<struct id="s0" type="S"> <struct id="s1" type="NP" xlink:href="xptr(substring(p/s[1]/text(),1,5))" rel="SBJ"/> <struct id="s2" type="VP" xlink:href="xptr(substring(p/s[1]/text(),7,8))"/> <struct id="s3" type="NP" xlink:href="xptr(substring(p/s[1]/text(),16,3))"/> <struct id="s4" type="PP" xlink:href="xptr(substring(p/s[1]/text(),20,4))" rel="DIR"> <struct id="s5" type="NP" xlink:href="xptr(substring(p/s[1]/text(),25,14))"/> </struct> <struct id="s6" type="S" rel="ADV"> <!-- ... --></struct>

Page 21: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

21/24 LREC 2004 05/29/2004Marion Freese

Example: Possible xComForT Representation (1)

segments

xComForT storage format

level 1

PTBraw.xml

level 2

token level

substring

token.xml

level 3

1st linguistic level

level 4

2nd linguistic level

range-t

o

range-to

sentence.xml

chunk.xml

segInfo

chunk_relation.xml

1:1 (#id)

Page 22: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

22/24 LREC 2004 05/29/2004Marion Freese

Example: Possible xComForT Representation (2)

chunk.xml

chunk_relation.xml

<segments level="ling1" type="chunk" xml:base="token.xml"> <chunk id="div1.p1.chunk1" type="NP" xlink:href="#div1.p1.tok1"/> <chunk id="div1.p1.chunk2" type="VP" xlink:href="#div1.p1.tok2"/> <chunk id="div1.p1.chunk3" type="NP" xlink:href="#div1.p1.tok3"/> <chunk id="div1.p1.chunk4" type="PP" xlink:href="#xpointer(id('div1.p1.tok4')/ range-to(id('div1.p1.tok7'))"/> <chunk id="div1.p1.chunk5" type="NP" xlink:href="#xpointer(id('div1.p1.tok5')/ range-to(id('div1.p1.tok7'))"/></segments>

<segInfo level="ling2" type="rel" xml:base="chunk.xml"> <rel id="div1.p1.chunk1.rel" xlink:href="#div1.p1.chunk1>SBJ</rel> <rel id="div1.p1.chunk4.rel" xlink:href="#div1.p1.chunk4>DIR</rel></segInfo>

Page 23: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

23/24 LREC 2004 05/29/2004Marion Freese

Summary

standards-based

common tools available and usable stand-off annotation

easy plugging-in of linguistic annotation schema easily extensible markup of primary document

easy adaptation to arbitrary resource

Standard base format, e.g. to simplify support for mapping into the Linguistic Annotation Framework

Page 24: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

24/24 LREC 2004 05/29/2004Marion Freese

xComForTable Mapping to the LAF

Thanks for your attention!

… Any questions?

Page 25: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

25/24 LREC 2004 05/29/2004Marion Freese

Structural Markup improves Analysis

e.g. sentence boundary detection

Then things would get even worse. (see also pages 4 and 11)

SHADOWS

By Leena Dhingra

I couldn’t possibly do that.

tokenizer input:<p>-elements (without <rs>-elements)

correct sentence markup

<p>[..]Then things would get even worse.<rs type=“see also“> (see also pages 4 and 11)</rs></p></div>

<div><head>SHADOWS</head><byline>By Leena Dhingra</byline><p>I couldn’t possibly do that.</p>

Page 26: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

26/24 LREC 2004 05/29/2004Marion Freese

Example – Discontinuous Material

CES

xComForT<div id="d19990607_a1" type="article"> <opener><!-- ... --></opener> <discontinuous id="d19990607_a1. discontinuous" type="rubbish"> Die GewinnzahlenLotto (5. Juni): 5, 19, 21, 31, 43, 48 Zusatzzahl: 32, Superzahl: 9 Toto: lag noch nicht vor </ discontinuous> <closer><!-- ... --></closer> </div>

Die Gewinnzahlen

Lotto (5. Juni): 5, 19, 21, 31, 43, 48 Zusatzzahl: 32, Superzahl: 9 Toto: lag noch nicht vor

<!ELEMENT discontinuous (#PCDATA)><!ATTLIST discontinuous id ID #REQUIRED type (rubbish | editorial | ..) #IMPLIED>

Page 27: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.

27/24 LREC 2004 05/29/2004Marion Freese

Example – Meta Information

CES

xComForT<opener> <divMeta> <publDate>Montag, 7. Juni 1999</publDate> <cat target="ns1">NACHRICHTEN</cat> <distribution>M / F</distribution> <publBy>Süddeutsche Zeitung</publBy> <volNr>Nr. 127</volNr> / <pageNr>Seite 7</pageNr> </divMeta></opener>

Montag, 7. Juni 1999 NACHRICHTEN M / F Süddeutsche Zeitung Nr. 127 / Seite 7

<opener><date>Montag, 7. Juni 1999</date> NACHRICHTEN M / F Süddeutsche Zeitung Nr. 127 / Seite 7</opener>

reference to taxonomy