EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access...

25
EAGLES/ISLE Workshop LREC 2000 • Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar College Data Architectures and Software Support for Large Corpora Towards an American National Corpus
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access...

Page 1: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

The XML FrameworkIts Implications for Corpus Access and Use

Nancy Ide

Department of Computer Science

Vassar College

Data Architectures and Software Support for Large CorporaTowards an American National Corpus

Page 2: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

XML

• emerging standard for data representation and exchange on the World Wide Web

• powerful tool for data representation and access

• obvious standard for interchange of language resources– supports text, speech, video, audio– ...and linkage among them!

Page 3: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

XML provides more than SGML

• better linkage mechanisms

• XSLT for document access and transformation

• XML schemas

• provision for accessing all or part of multiple DTDs

Page 4: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

XML Links• "stand-off" annotation is the accepted norm for

annotated resources

• maintain all or most annotations in separate documents– each references appropriate locations in the

original data – yields a finely linked hypertext format where the

links specify a semantic role rather than navigational options

Page 5: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Requirements of the stand-off architecture

• address XML elements

• address characters and chains of characters within those elements

• address elements and characters both within the same document and in other XML documents

Page 6: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

XML Path Language (XPath)

• concise notation for element localization in the document tree– /div/p[2]/s[3] - third sentence of second

paragraph in each <div>– /descendant::p - all <p> elements

• predicates for accessing characters within elements– substring(/p/s[2]/text(),10,12)

Page 7: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

XPointer

• extends XPath syntax to allow : – addressing points and ranges as well as

nodes– locating information by string matching– use of addressing expressions in URI-

references as fragment identifiers

Page 8: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

XLink

• uni- or multi-directional links

• can specify how link is to be activated– by hand or automatically by the browser

• can specify what to do with the target fragment – replace it or insert it into the source document

Page 9: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Links to External Documents

• None in SGML

• HyTime/TEI invented "doc" attribute

• CES used "doc" with inheritance to avoid repetition of the attribute– not supported by SGML processors

• XML: XLink and xml:base attribute

Page 10: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

XSLT• a powerful tree-traversal language

• translate any XML document into another document in any form– html– XML– plain text– etc.

• most to offer for handling annotated resources

Page 11: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

XSLT Capabilities

selection of elements or portions of element content using the XPath syntax

rearrangement, transformation of extracted information (text content, element names, etc.) in the target document

• addition of information to the target document

Page 12: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

A Simple Example<?xml version="1.0">

<chunk type="BODY" lang="en"

xml:base=

"http://www.cs.vassar.edu/~ME/Oen.xcesDoc#">

<par xlink:href="xptr(substring(//p[1]">

<s xlink:href="xptr(substring(//p/s[1]">

<tok type="WORD"

xlink:href=

"xptr(substring(//p/s[1]/text(),1,2">

<orth>It</orth>

<disamb>

<base>it</base>

<msd>Pp3ns</msd>

<ctag>PPER3</ctag></lex>

<lex>

<base>it</base>

<msd>Pp3ns</msd>

<ctag>PPER3</ctag></lex></tok>...

xcesAnadocumentxcesAnadocument

Page 13: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

<xsl:stylesheet version="1.0" xmnls:xsl= "http://www.w3.org/1999/XSL/Transform">

<xsl:template match= “/”> <html> <body> <xsl:apply-templates/> </body> </html></xsl:template>

<xsl:template match="//par"/> <xsl:for-each select=”//tok”/> <xsl:value-of select=”orth”/> <xsl:text>|</xsl:text> <xsl:value-of select=”disamb/base”/> <xsl:text>|</xsl:text> <xsl:value-of select=”disamb/ctag”/> </xsl:for-each> </xsl:template>

</xsl:stylesheet>

XSLT creates HTML

XSLTdocumentXSLTdocument

Page 14: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Result

It|it|PPER3 was|be|PAST3 a|a|DINTbright|bright|ADJEcold|cold|ADJE day|day|NN…

Page 15: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Possibilities

• create new documents containing selected annotations

• transduce XML encoded documents to tool-internal formats

• generate a new document with all phonemes that appear in a certain context (or all the unique contexts of a certain phoneme), etc.

Page 16: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

XML Schemas

• constrain and document the meaning, usage and relationships of the constituent parts of XML documents– datatypes– elements and their content– attributes and their values

• provide default values for attributes and elements

Page 17: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Impact for language resources

• provide means to define an abstract data model for a class of documents– e.g., data model for annotations and annotated

objects– one of the most important tasks for corpus and

tool creators

• provide for much tighter validation of document form and content

Page 18: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Capabilities

• different attribute declarations and/or content models can apply to elements with the same name in different contexts– allows for more tightly constrained content

models than possible with DTDs– e.g., <name> in header and <name> in text

likely have different content constraints

Page 19: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

• define equivalence classes for groups of elements and/or attributes– may be used in the same ways as defined

for a particular named element

• in CES used parameter entities to make a class of phrase-level objects (for example)– a "kludge"

Page 20: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

• constrain attribute or element values (or combinations) to be unique, e.g.,– only one entry in a computational lexicon can

be defined with a given word form – only one paragraph can have an attribute

indicating that it is the 23rd– only one disambiguated form is given for each

token – only one correspondence for a given item in an

alignment document

Useful for error detection and preventionUseful for error detection and prevention

Page 21: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

• establish dependencies based on element or attribute values, for example:– prevent nouns from being assigned a tense– specify that tokens with type attribute value

PUNCT include only <orth> elements containing specific characters

– specify annotation labels elsewhere, constrain element content to these values only

• e.g., constrain the values of the <msd> element in an XCES annotation document to the EAGLES morpho-syntactic specifications

Another means for error control and validationAnother means for error control and validation

Page 22: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Why is XML a good thing?• search, extraction, and transformation

capabilities answer most current and foreseen needs for corpus-based language engineering

• means to fully implement the stand-off data architecture

• processing tools for XML recommendations are freely distributed– no need for costly and time-consuming tool

development

Page 23: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Conclusion• XML will allow for

– representation of multi-lingual, multi-modal resources

– implementation of the stand-off scheme– compatibility with the WWW, enabling

• exploitation by LE researchers via the web

• harmonization and combination of LRE resources with other WWW data

– distributed model for data delivery

Page 24: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

P.S....

• A set of XML recommendations for encoding language resources exists:– XCES (XML version of the Corpus Encoding

Standard--CES)– http://www.cs.vassar.edu/XCES

Page 25: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.

Data Architectures and Software Support for Large Corpora

Data Architectures and Software Support for Large Corpora

EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece

Acknowledgements

• Laurent Romary (LORIA/CNRS)

• Patrice Bonhomme (LORIA/CNRS)