A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John...

14
A Flexible and Extensible A Flexible and Extensible Architecture for Linguistic Architecture for Linguistic Annotation Annotation Steven Bird Steven Bird * , David Day , David Day , John Garofolo , John Garofolo , John , John Henderson Henderson , Christophe Laprun , Christophe Laprun and Mark Liberman* and Mark Liberman* * Linguistic Data Consortium, University of Pennsylvania Linguistic Data Consortium, University of Pennsylvania MITRE Corporation MITRE Corporation National Institute of Standards and Technology National Institute of Standards and Technology

Transcript of A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John...

Page 1: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

A Flexible and Extensible Architecture A Flexible and Extensible Architecture for Linguisticfor Linguistic AnnotationAnnotation

Steven BirdSteven Bird**, David Day, David Day††, John Garofolo, John Garofolo‡‡, John Henderson, John Henderson††, ,

Christophe LaprunChristophe Laprun‡‡ and Mark Liberman* and Mark Liberman*

**Linguistic Data Consortium, University of PennsylvaniaLinguistic Data Consortium, University of Pennsylvania††MITRE CorporationMITRE Corporation

‡‡National Institute of Standards and TechnologyNational Institute of Standards and Technology

Page 2: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

Tradition: Create formats and Tradition: Create formats and tools for each research domaintools for each research domain

• Existing bazaar of formats and tools discourages exchange and reuse

SGMLRDB

Page 3: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

BackgroundBackground

• Participant “Troika” motivated by applications needs– NIST work in evaluation infrastructure– LDC work in corpus building and annotation graph research– MITRE work in multi-modal visualization/annotation, extraction

technology, Alembic Workbench

• Began collaboration in early summer ‘99– Initially, exploring feasibility of fitting together existing resources

under Bird & Liberman annotation graph formalism

• Early goals– develop ability to construct flexible and extensible tools and

data formats for existing research domains and applications– focus task to create formats to support ACE infrastructure

• Project has evolved substantially as we continue to explore new domains and uses

Page 4: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

Base Ontology for Linguistic Base Ontology for Linguistic Annotation of SignalsAnnotation of Signals

• Establishing an annotation requires specifying– The source signal that is being annotated– The particular region of the signal about which one wants

to say something – The content of the annotation being asserted about that

region of the signal

Signal

Annotation

Region

Page 5: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

The Annotation Graph ModelThe Annotation Graph Model

• The Annotation Graph model, a proper subset of the more general case, addresses annotation for one-dimensional signals (text, audio)– intervals specified with start and end nodes

• nodes have (optional) offsets

– annotations specified as labeled arcs between nodes• labels are fielded records (attributes + values)

– collection of annotations => annotation graph

• Formal definition– labeled directed acyclic graph, with a partial time function on

nodes (see Bird & Liberman 2000)

Page 6: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

ATLAS Generalized ModelATLAS Generalized Model• The generalized model has been designed to

accommodate non-linear signals such as images:– annotation elements describing regions within signals with

signal pointer(s) and content-bearing attributes

Signal

Content

<Annotation> <Source> <Region> … </Region> </Source>

<Content> … </Content></Annotation>

Annotation

Region

– annotation sets containing clusters of annotation elements

• annotations may be treated as signals themselves

• standoff annotations provide alignment of annotations & signals

Page 7: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

ExtensibilityExtensibility

• Impossible to anticipate all the varieties of “linguistic signals” and the ways one might wish to annotate them

• ATLAS includes a mechanism for declaring new signal classes and defining new ways of carving out regions of those signals via– the definition of an anchor type for the new signal class

– the creation of an anchor “plug-in” component

• ATLAS will support general purpose signal classes for popular linguistic resource types– Signals: text, audio, images, video

– Symbol tables: word lists, part-of-speech tagsets, …

– Attribute value matrices: dictionaries, thesauri, knowledge representation propositions, …

– Tree databases: Treebanks, …

– Signal alignments: bilingual corpora, …

Page 8: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

ATLAS LayersATLAS Layers

• Approach: Separate/abstract physical and logical levels from application-specific levels for maximum flexibility. – Physical level provides a persistent representation of logical

level data for long-term storage, exchange, and pipelining• XML-based ATLAS Interchange Format (AIF)

• Relational database implementation

– Logical level provides a structural framework for the manipulation of annotation data

• annotation elements and sets

• atomic operators (creation, manipulation, destruction)

– Application level specifies semantic interpretation of annotation data and provides user interfaces

• application-specific (developer-provided)

Page 9: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

EvaluationSoftwareConversion

Tools

QuerySystems

Layered SolutionLayered Solution

Visualization and Exploration

ExtractionSystems

AnnotationTools

AutomaticAligners

RDBAIF

Files

ATLAS CORE

ATLAS Physical Level

Applications

ATLAS Logical Level

ATLAS API

Page 10: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

ATLAS ArchitectureATLAS Architecture

ATLAS Internal Representation

Annotation

AC1

AC2

ACn Visualization

VC1VC2

VCn

Format Exchange

EC1

EC2

ECnSearch/Access

SC1

SC2

SCn

Persistent Storage•RDBMS•flat files (AIF)

XML Processing•DTD validation•XML parser•XSLT

Data Access•file sharing•network protocols•multi-user/collaboration•privacy

Page 11: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

ATLAS Interchange FormatATLAS Interchange FormatAn ExampleAn Example

<AnnotationSet id="http://ace.program/ocr/9801.10/9801.10.omni.xml”> <Signal mime-class=“AUDIO” mime-type=“wav” encoding=“wav” ID=“Audio1”><Signal mime-class=“TEXT” mime-type=“PLAIN” encoding=“UTF8” ID=“Text1”>

<Annotation id=“a1” type=“transcription”> <Source> <Region Signal=“Audio1” type=“interval”> <Value type=“integer” role=“start” unit=“msec”>453</Value> <Value type=“integer” role=“end” unit=“msec”>497</Value></Region> </Source> <Content> <Region Signal=“Text1” type=“interval”> <Value type=“integer” role=“start” unit=“char”>25</Value> <Value type=“integer” role=“end” unit=“char”>29</Value></Region> </Content></Annotation>

<Annotation id=“a2” type=“transcription”> … </Annotation>… </AnnotationSet>

Annot

element

Source Signal

Standoff Content

Signal types

Annot

set

Page 12: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

Potential ATLAS ApplicationsPotential ATLAS Applications• Corpora:

– data exchange/reuse, consistent meta data formats

– multi-layered/multi-linked annotation

– multi-lingual dictionaries, aligned multi-lingual data

– aligned multi-modal data (audio/video/image/text)

– lexicons with varying levels of structure

• Tools– modular/reusable annotation components

– development infrastructure

– conversion tools

• Applications– internal/external data representation

– faster prototyping and development

– evaluation

– data pipelining and plug-and-play data exchange

– document segmentation/zoning

Page 13: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

ATLAS Projects UnderwayATLAS Projects Underway

• Evaluation Formats:– ACE Entity Detection and Tracking (EDT) Evaluation– DARPA/NIST ASR/Segmentation scoring

• Corpora:– NSF linguistic exploration project on low-density languages– NSF Talkbank– UMD Image Recognition Evaluation Corpus

• Tools:– LDC annotation tools– MITRE Alembic Workbench– Emu speech database access tools– DGA speech Transcriber– next generation SCLITE

Page 14: A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.

Development StatusDevelopment Status• ATLAS Prototype Suite implemented:

– ATLAS Interchange Format (AIF) XML DTD – Annotation graph API definition– Core API implementations (C++, Java) for annotation graphs

• Extending the architecture for new signal types

• Defining query language

• Currently soliciting research community input– ACE, TIDES, DARPA ASR, ISLE, CES, industry ...

• Complete ATLAS 1.0 (Beta) (Sep. 2000)– Internal representation, AIF, basic query language, sample

applications (transcription/annotation tools, conversion tools)

• Open Source ATLAS (Winter, 2000-2001)

• ATLAS Website: – http://www.nist.gov/speech/atlas