A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John...
-
Upload
junior-goodman -
Category
Documents
-
view
213 -
download
0
Transcript of A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John...
A Flexible and Extensible Architecture A Flexible and Extensible Architecture for Linguisticfor Linguistic AnnotationAnnotation
Steven BirdSteven Bird**, David Day, David Day††, John Garofolo, John Garofolo‡‡, John Henderson, John Henderson††, ,
Christophe LaprunChristophe Laprun‡‡ and Mark Liberman* and Mark Liberman*
**Linguistic Data Consortium, University of PennsylvaniaLinguistic Data Consortium, University of Pennsylvania††MITRE CorporationMITRE Corporation
‡‡National Institute of Standards and TechnologyNational Institute of Standards and Technology
Tradition: Create formats and Tradition: Create formats and tools for each research domaintools for each research domain
• Existing bazaar of formats and tools discourages exchange and reuse
SGMLRDB
BackgroundBackground
• Participant “Troika” motivated by applications needs– NIST work in evaluation infrastructure– LDC work in corpus building and annotation graph research– MITRE work in multi-modal visualization/annotation, extraction
technology, Alembic Workbench
• Began collaboration in early summer ‘99– Initially, exploring feasibility of fitting together existing resources
under Bird & Liberman annotation graph formalism
• Early goals– develop ability to construct flexible and extensible tools and
data formats for existing research domains and applications– focus task to create formats to support ACE infrastructure
• Project has evolved substantially as we continue to explore new domains and uses
Base Ontology for Linguistic Base Ontology for Linguistic Annotation of SignalsAnnotation of Signals
• Establishing an annotation requires specifying– The source signal that is being annotated– The particular region of the signal about which one wants
to say something – The content of the annotation being asserted about that
region of the signal
Signal
Annotation
Region
The Annotation Graph ModelThe Annotation Graph Model
• The Annotation Graph model, a proper subset of the more general case, addresses annotation for one-dimensional signals (text, audio)– intervals specified with start and end nodes
• nodes have (optional) offsets
– annotations specified as labeled arcs between nodes• labels are fielded records (attributes + values)
– collection of annotations => annotation graph
• Formal definition– labeled directed acyclic graph, with a partial time function on
nodes (see Bird & Liberman 2000)
ATLAS Generalized ModelATLAS Generalized Model• The generalized model has been designed to
accommodate non-linear signals such as images:– annotation elements describing regions within signals with
signal pointer(s) and content-bearing attributes
Signal
Content
<Annotation> <Source> <Region> … </Region> </Source>
<Content> … </Content></Annotation>
Annotation
Region
– annotation sets containing clusters of annotation elements
• annotations may be treated as signals themselves
• standoff annotations provide alignment of annotations & signals
ExtensibilityExtensibility
• Impossible to anticipate all the varieties of “linguistic signals” and the ways one might wish to annotate them
• ATLAS includes a mechanism for declaring new signal classes and defining new ways of carving out regions of those signals via– the definition of an anchor type for the new signal class
– the creation of an anchor “plug-in” component
• ATLAS will support general purpose signal classes for popular linguistic resource types– Signals: text, audio, images, video
– Symbol tables: word lists, part-of-speech tagsets, …
– Attribute value matrices: dictionaries, thesauri, knowledge representation propositions, …
– Tree databases: Treebanks, …
– Signal alignments: bilingual corpora, …
ATLAS LayersATLAS Layers
• Approach: Separate/abstract physical and logical levels from application-specific levels for maximum flexibility. – Physical level provides a persistent representation of logical
level data for long-term storage, exchange, and pipelining• XML-based ATLAS Interchange Format (AIF)
• Relational database implementation
– Logical level provides a structural framework for the manipulation of annotation data
• annotation elements and sets
• atomic operators (creation, manipulation, destruction)
– Application level specifies semantic interpretation of annotation data and provides user interfaces
• application-specific (developer-provided)
EvaluationSoftwareConversion
Tools
QuerySystems
Layered SolutionLayered Solution
Visualization and Exploration
ExtractionSystems
AnnotationTools
AutomaticAligners
RDBAIF
Files
ATLAS CORE
ATLAS Physical Level
Applications
ATLAS Logical Level
ATLAS API
ATLAS ArchitectureATLAS Architecture
ATLAS Internal Representation
Annotation
AC1
AC2
ACn Visualization
VC1VC2
VCn
Format Exchange
EC1
EC2
ECnSearch/Access
SC1
SC2
SCn
Persistent Storage•RDBMS•flat files (AIF)
XML Processing•DTD validation•XML parser•XSLT
Data Access•file sharing•network protocols•multi-user/collaboration•privacy
ATLAS Interchange FormatATLAS Interchange FormatAn ExampleAn Example
<AnnotationSet id="http://ace.program/ocr/9801.10/9801.10.omni.xml”> <Signal mime-class=“AUDIO” mime-type=“wav” encoding=“wav” ID=“Audio1”><Signal mime-class=“TEXT” mime-type=“PLAIN” encoding=“UTF8” ID=“Text1”>
<Annotation id=“a1” type=“transcription”> <Source> <Region Signal=“Audio1” type=“interval”> <Value type=“integer” role=“start” unit=“msec”>453</Value> <Value type=“integer” role=“end” unit=“msec”>497</Value></Region> </Source> <Content> <Region Signal=“Text1” type=“interval”> <Value type=“integer” role=“start” unit=“char”>25</Value> <Value type=“integer” role=“end” unit=“char”>29</Value></Region> </Content></Annotation>
<Annotation id=“a2” type=“transcription”> … </Annotation>… </AnnotationSet>
Annot
element
Source Signal
Standoff Content
Signal types
Annot
set
Potential ATLAS ApplicationsPotential ATLAS Applications• Corpora:
– data exchange/reuse, consistent meta data formats
– multi-layered/multi-linked annotation
– multi-lingual dictionaries, aligned multi-lingual data
– aligned multi-modal data (audio/video/image/text)
– lexicons with varying levels of structure
• Tools– modular/reusable annotation components
– development infrastructure
– conversion tools
• Applications– internal/external data representation
– faster prototyping and development
– evaluation
– data pipelining and plug-and-play data exchange
– document segmentation/zoning
ATLAS Projects UnderwayATLAS Projects Underway
• Evaluation Formats:– ACE Entity Detection and Tracking (EDT) Evaluation– DARPA/NIST ASR/Segmentation scoring
• Corpora:– NSF linguistic exploration project on low-density languages– NSF Talkbank– UMD Image Recognition Evaluation Corpus
• Tools:– LDC annotation tools– MITRE Alembic Workbench– Emu speech database access tools– DGA speech Transcriber– next generation SCLITE
Development StatusDevelopment Status• ATLAS Prototype Suite implemented:
– ATLAS Interchange Format (AIF) XML DTD – Annotation graph API definition– Core API implementations (C++, Java) for annotation graphs
• Extending the architecture for new signal types
• Defining query language
• Currently soliciting research community input– ACE, TIDES, DARPA ASR, ISLE, CES, industry ...
• Complete ATLAS 1.0 (Beta) (Sep. 2000)– Internal representation, AIF, basic query language, sample
applications (transcription/annotation tools, conversion tools)
• Open Source ATLAS (Winter, 2000-2001)
• ATLAS Website: – http://www.nist.gov/speech/atlas