osis linguistic annotation
description
Transcript of osis linguistic annotation
osis linguistic annotation
definitions and requirements
kirk e. lowerywestminster hebrew institute
sbl computer-assisted research group
the context
why osis linguistic annotation?
the goal of osis
to exchange electronic biblesany language, medium, presentation style
to add “meta-information” to those textskeywords: “link”, “hierarchy”, “pyramid”
to easily transform these textsthe target transformation is unknown
to cut costs: production, presentation, distribution of bibles plus “meta-data”time, money, people
why exchange bible texts?
coordination within organizations cooperation between organizations and
between individuals publish in multiple formats and media
from one “canonical” source long-term archival the changing definition of “publish”
documents have a life cycle!
who wants to exchange texts?
bible publishers commercial publishing houses denominations & bible societies
bible translators translation teams & editors consultants & supervisors
bible scholars original languages, text criticism text analysis and commentary
text “meta-data”
what informationneeds to be captured?
translators:managing the translation process document versions & responsibility comments & corrections by editors handling presentation issues
script direction “rubies”
linking source, relay & target translations linking supplementary information
notes, glossaries, maps
translators & scholars:focus on the text manuscript collation & description text criticism: establishment of the original linguistic analysis
text segmentation segment id: from phoneme to text structures linguistic mapping of source & target
alignment: parallel & synoptic texts
linguistic annotation
how can we capturethe information?
required
a way to segment the text a mechanism for associating labels with an
arbitrary text-span a means to declare labels used in analysis
a common linguistic vocabulary language-specific grammar terms
a protocol for user redefinition
segmenting text<seg id="gn1:1,1.1">B.:</seg><seg id="gn1:1,1.2">R")$IYT</seg><seg id="gn1:1,2.1">B.FRF)</seg><seg id="gn1:1,3.1">):ELOHIYM </seg><seg id="gn1:1,4.1">)"T</seg><seg id="gn1:1,5.1">HA</seg><seg id="gn1:1,5.2">$.FMAYIM</seg><seg id="gn1:1,6.1">W:</seg><seg id="gn1:1,6.2">)"T</seg><seg id="gn1:1,7.1">HF</seg><seg id="gn1:1,7.2">)FREC</seg>
start tag
unique identification
hebrew text
end tag
adding annotation (1)<seg id="gn1:1,1.1">B.:
<lemma>B.</lemma><particle type="preposition" />
</seg><seg id="gn1:1,1.2">R")$IYT
<lemma>R")$IYT</lemma><noun type="common"
features="fsa" /></seg><seg id="gn1:1,2.1">B.FRF)
<lemma homonym="1">B.R)</lemma><verb stem="q" conjugation="p"
pgn="3ms" /></seg><seg id="gn1:1,3.1">):ELOHIYM
<lemma>):ELOHIYM</lemma><noun type="common" features="mpa" />
</seg><seg id="gn1:1,4.1">)"T
<lemma homonym="1">)"T</lemma><particle type="object_marker" />
</seg>
content tag
“milestone” tag
adding annotation (2)<seg id="gn1:1,5.1">HA
<lemma>H</lemma><particle type="article" />
</seg><seg id="gn1:1,5.2">$.FMAYIM
<lemma>$FMAYIM</lemma><noun type="common" features="mpa" />
</seg><seg id="gn1:1,6.1">W:
<lemma>W</lemma><particle type="conjunction" />
</seg><seg id="gn1:1,6.2">)"T
<lemma homonym="1">)"T</lemma><particle type="object_marker" />
</seg><seg id="gn1:1,7.1">HF
<lemma>H</lemma><particle type="article" />
</seg><seg id="gn1:1,7.2">)FREC
<lemma>)EREC</lemma><noun type="common" features="fsa" />
</seg>
content tag
“milestone” tag
the hard part: linguistic labels
must be standard must be applicable to any conceivable
language labels are the “linguistic inventory”
must be compatible with current and future linguistic theories labels must be linguistic theory-neutral
must be redefinable by the user
standard solutions: labels
expert advisory group on language engineering standards (eagles) <http://www.ilc.pi.cnr.it/EAGLES/home.html> an initiative of the european commission (1993) standard grammar labels of morphology and
syntax for european languages create osis standard labels for hebrew,
aramaic and greek
standard solutions: mechanism
the text encoding initiative (tei) guidelines chapter 14: linking, segmentation, & alignment chapter 16: feature structures chapter 26: feature system declaration
“stand-off” markup (xlink) or “up-close-and-personal” (inline)? separate meta-data about the text from the text
itself? “either-or” or “both-and”?
formal requirements
what we must do, exactly
labels
claims made about the data itself vs claims about the claims that can be made! the linguistic model vs the analysis allowed by the
model example: does Hebrew have “adverbs”?
a library of labels as comprehensive as possible definitions to clarify what “thing” is being
labeled labels are names for grammatical objects
labels as objects
grammatical “objects” have “attributes” or “features”
features can vary over a range of “values” objects & features have defaults that could be
changed objects & features could be easily extended objects & features can be arranged linearly or
hierarchically
mechanism
user language declaration all labels and their relationships done by “exclusion”, not inclusion sensitive to linguistic theory
levels of language: resolution of ambiguity lexical, semantic, phonemic, morphologic,
phrase-, clause-, discourse-, theological levels “context-free” and “context-bound” analysis part-of-speech resolution
tei feature structures
the feature element the most basic markup requires a label and any number of values <f t="feature name" value="feature value">
the feature structure element <fs name="feature structure name"> may contain any number of nested <f> and <fs> models some grammatical object
tei feature example
<f name="conjugation"> <vAlt mutExcl="Y"> <sym id="pf" value="perfect" /> <sym id="impf" value="imperfect" /> <sym id="qppt" value="qal_passive_participle" /> <sym id="wc" value="wayyiqtol" /> <sym id="impv" value="imperative" /> <sym id="inf" value="infinitive" /> <sym id="pt" value="participle" /> </vAlt></f>
tei feature structure example
<fs type="common noun features"> <f name="gender" org="set" fVal="gm gf gn" /> <f name="number" org="set" fVal="ns np nd" /> <f name="state" org="set" fVal="sa sc" /></fs>
tei feature library example
<fvLib id="g" type="gender feature values"> <vAlt mutExcl="N"> <sym id="gm" value="masculine"/> <sym id="gf" value="feminine" /> <sym id="gn" value="neuter" /> </vAlt></fvLib>
a different approach
<div type="x-tag" osisID="A_APFC" divTitle="A APFC"> <p>Part of speech: adjective</p> <p>Case: accusative</p> <p>Number: plural</p> <p>Gender: feminine</p> <p>Degree: comparative</p></div>
Dictionary of Packard-Style Greek Morphology Codes
what can we do with feature structure marked up text?
self-organizing topic maps compare linguistic hypotheses with actual
usage XSLT transforms automated tagging of new features comparative linguistic study source↔target language grammar mapping
conclusions
where do we go from here?
in the short-term
complete a first pass of language modeling mark up real biblical text with annotation distribute to translators and scholars for
feedback does this meet your needs? is it practical enough that you will use it? is it flexible enough for your language(s) and
linguistic theories
in the long-term
determine if tei feature structures are sufficient
decide whether to require “inline” or “standoff” markup, or to allow either
determine the best way of integrating linguistic markup with the osis core tag set
explore ideas for authoring software or, at least, linguistic annotation utility programs