osis linguistic annotation

osis linguistic annotation

definitions and requirements

kirk e. lowerywestminster hebrew institute

sbl computer-assisted research group

the context

why osis linguistic annotation?

the goal of osis

to exchange electronic biblesany language, medium, presentation style

to add “meta-information” to those textskeywords: “link”, “hierarchy”, “pyramid”

to easily transform these textsthe target transformation is unknown

to cut costs: production, presentation, distribution of bibles plus “meta-data”time, money, people

why exchange bible texts?

coordination within organizations cooperation between organizations and

between individuals publish in multiple formats and media

from one “canonical” source long-term archival the changing definition of “publish”

documents have a life cycle!

who wants to exchange texts?

bible publishers commercial publishing houses denominations & bible societies

bible translators translation teams & editors consultants & supervisors

bible scholars original languages, text criticism text analysis and commentary

text “meta-data”

what informationneeds to be captured?

translators:managing the translation process document versions & responsibility comments & corrections by editors handling presentation issues

script direction “rubies”

linking source, relay & target translations linking supplementary information

notes, glossaries, maps

translators & scholars:focus on the text manuscript collation & description text criticism: establishment of the original linguistic analysis

text segmentation segment id: from phoneme to text structures linguistic mapping of source & target

alignment: parallel & synoptic texts

linguistic annotation

how can we capturethe information?

required

a way to segment the text a mechanism for associating labels with an

arbitrary text-span a means to declare labels used in analysis

a common linguistic vocabulary language-specific grammar terms

a protocol for user redefinition

segmenting text<seg id="gn1:1,1.1">B.:</seg><seg id="gn1:1,1.2">R")$IYT</seg><seg id="gn1:1,2.1">B.FRF)</seg><seg id="gn1:1,3.1">):ELOHIYM </seg><seg id="gn1:1,4.1">)"T</seg><seg id="gn1:1,5.1">HA</seg><seg id="gn1:1,5.2">$.FMAYIM</seg><seg id="gn1:1,6.1">W:</seg><seg id="gn1:1,6.2">)"T</seg><seg id="gn1:1,7.1">HF</seg><seg id="gn1:1,7.2">)FREC</seg>

start tag

unique identification

hebrew text

end tag

adding annotation (1)<seg id="gn1:1,1.1">B.:

<lemma>B.</lemma><particle type="preposition" />

</seg><seg id="gn1:1,1.2">R")$IYT

<lemma>R")$IYT</lemma><noun type="common"

features="fsa" /></seg><seg id="gn1:1,2.1">B.FRF)

<lemma homonym="1">B.R)</lemma><verb stem="q" conjugation="p"

pgn="3ms" /></seg><seg id="gn1:1,3.1">):ELOHIYM

<lemma>):ELOHIYM</lemma><noun type="common" features="mpa" />

</seg><seg id="gn1:1,4.1">)"T

<lemma homonym="1">)"T</lemma><particle type="object_marker" />

</seg>

content tag

“milestone” tag

adding annotation (2)<seg id="gn1:1,5.1">HA

<lemma>H</lemma><particle type="article" />

</seg><seg id="gn1:1,5.2">$.FMAYIM

<lemma>$FMAYIM</lemma><noun type="common" features="mpa" />

</seg><seg id="gn1:1,6.1">W:

<lemma>W</lemma><particle type="conjunction" />

</seg><seg id="gn1:1,6.2">)"T

<lemma homonym="1">)"T</lemma><particle type="object_marker" />

</seg><seg id="gn1:1,7.1">HF

<lemma>H</lemma><particle type="article" />

</seg><seg id="gn1:1,7.2">)FREC

<lemma>)EREC</lemma><noun type="common" features="fsa" />

</seg>

content tag

“milestone” tag

the hard part: linguistic labels

must be standard must be applicable to any conceivable

language labels are the “linguistic inventory”

must be compatible with current and future linguistic theories labels must be linguistic theory-neutral

must be redefinable by the user

standard solutions: labels

expert advisory group on language engineering standards (eagles) <http://www.ilc.pi.cnr.it/EAGLES/home.html> an initiative of the european commission (1993) standard grammar labels of morphology and

syntax for european languages create osis standard labels for hebrew,

aramaic and greek

standard solutions: mechanism

the text encoding initiative (tei) guidelines chapter 14: linking, segmentation, & alignment chapter 16: feature structures chapter 26: feature system declaration

“stand-off” markup (xlink) or “up-close-and-personal” (inline)? separate meta-data about the text from the text

itself? “either-or” or “both-and”?

formal requirements

what we must do, exactly

labels

claims made about the data itself vs claims about the claims that can be made! the linguistic model vs the analysis allowed by the

model example: does Hebrew have “adverbs”?

a library of labels as comprehensive as possible definitions to clarify what “thing” is being

labeled labels are names for grammatical objects

labels as objects

grammatical “objects” have “attributes” or “features”

features can vary over a range of “values” objects & features have defaults that could be

changed objects & features could be easily extended objects & features can be arranged linearly or

hierarchically

mechanism

user language declaration all labels and their relationships done by “exclusion”, not inclusion sensitive to linguistic theory

levels of language: resolution of ambiguity lexical, semantic, phonemic, morphologic,

phrase-, clause-, discourse-, theological levels “context-free” and “context-bound” analysis part-of-speech resolution

tei feature structures

the feature element the most basic markup requires a label and any number of values <f t="feature name" value="feature value">

the feature structure element <fs name="feature structure name"> may contain any number of nested <f> and <fs> models some grammatical object

tei feature example

<f name="conjugation"> <vAlt mutExcl="Y"> <sym id="pf" value="perfect" /> <sym id="impf" value="imperfect" /> <sym id="qppt" value="qal_passive_participle" /> <sym id="wc" value="wayyiqtol" /> <sym id="impv" value="imperative" /> <sym id="inf" value="infinitive" /> <sym id="pt" value="participle" /> </vAlt></f>

tei feature structure example

<fs type="common noun features"> <f name="gender" org="set" fVal="gm gf gn" /> <f name="number" org="set" fVal="ns np nd" /> <f name="state" org="set" fVal="sa sc" /></fs>

tei feature library example

<fvLib id="g" type="gender feature values"> <vAlt mutExcl="N"> <sym id="gm" value="masculine"/> <sym id="gf" value="feminine" /> <sym id="gn" value="neuter" /> </vAlt></fvLib>

a different approach

<div type="x-tag" osisID="A_APFC" divTitle="A APFC"> Part of speech: adjective Case: accusative Number: plural Gender: feminine Degree: comparative</div>

Dictionary of Packard-Style Greek Morphology Codes

what can we do with feature structure marked up text?

self-organizing topic maps compare linguistic hypotheses with actual

usage XSLT transforms automated tagging of new features comparative linguistic study source↔target language grammar mapping

conclusions

where do we go from here?

in the short-term

complete a first pass of language modeling mark up real biblical text with annotation distribute to translators and scholars for

feedback does this meet your needs? is it practical enough that you will use it? is it flexible enough for your language(s) and

linguistic theories

in the long-term

determine if tei feature structures are sufficient

decide whether to require “inline” or “standoff” markup, or to allow either

determine the best way of integrating linguistic markup with the osis core tag set

explore ideas for authoring software or, at least, linguistic annotation utility programs

osis linguistic annotation

Documents

Transcript of osis linguistic annotation