LAF Fabric
-
Upload
dirk-roorda -
Category
Education
-
view
410 -
download
1
description
Transcript of LAF Fabric
LAF Fabric
Dirk Roorda DANS/TLA 2013-12-12 VU/ETCBC Amsterdam
What is LAF?
Linguistic Annotation Framework
ISO standard 24612:2012 Nancy Ide, Laurent Romary
A data model for stand-off markup plus a serialization advice: GrAF
LAF examples
OANC Open American National Corpus
WIVU-ETCBC Text Database Hebrew Bible
OANC 17K 8 feb 2013 ch5-logical.xml 217K 8 feb 2013 ch5-mpqa.xml 265K 8 feb 2013 ch5-nc.xml 16K 8 feb 2013 ch5-ne.xml 1,3M 8 feb 2013 ch5-penn.xml 1,4M 8 feb 2013 ch5-ptb.xml 990K 8 feb 2013 ch5-ptbtok.xml 48K 8 feb 2013 ch5-s.xml 274K 8 feb 2013 ch5-seg.xml 177K 8 feb 2013 ch5-vc.xml 2,6K 3 jun 09:05 ch5.hdr 31K 8 feb 2013 ch5.txt 19K 8 feb 2013 resource-header.xml
textSemantics enters with purpose.
For this to be true, it is not necessary that the carriers of purpose, say, the same bacterium heading upstream in the glucose gradient, be conscious.
I hope my definition of an autonomous agent is useful, an autocatalytic system carrying out a work cycle, now rather broadened by the realization that autonomous agents also do often detect and measure and record displacements of external systems from equilibrium that can be used to extract work, then do extract work, propagating work and constraint construction, from their environment.
2101-2110
annotations<region xml:id="mpqa-r64" anchors="2101 2110"/> !... <node xml:id="mpqa-n64"> <link targets="mpqa-r64"/> </node> <a xml:id="mpqa-N81257" label="target" ref="mpqa-n64" as="mpqa"> <fs> <f name="id" value="semantics"/> </fs> </a>
!
*
**
*
headers
Resource header
Primary data header
Annotation header
Metadata, namespaces, annotation labels, statistics
SHEBANQ<node xml:id="n_88917">
<link targets="r_1 r_2 r_3 r_4 r_5 r_6 r_7 r_8 r_9 r_10 r_11"/></node>
<a xml:id="a_88917" label="sentence_atom" ref="n_88917" as="lingo"/><a xml:id="a_f71355" label="ft" ref="n_88917" as="lingo"><fs>
<f name="sentence_atom_number" value="0"/></fs></a>
<edge xml:id="e_1" from="n_88917" to="n_84383"/><a xml:id="a_e1" label="parents" ref="e_1" as="link"/>
<region xml:id="r_1" anchors="0 5"/><node xml:id="n_2"><link targets="r_1"/></node>
<a xml:id="a_2" label="word" ref="n_2" as="monads"/>
<region xml:id="r_2" anchors="6 23"/><node xml:id="n_3"><link targets="r_2"/></node>
<a xml:id="a_3" label="word" ref="n_3" as="monads"/>
<region xml:id="w_1" anchors="24 24"/>
labeled edges
nodesn_object id
annotations(features)
annotations(empty)
primary textUNICODE-utf8
regionsr_monad number
parentsparentsparents
parentsparents
parentsparents
parents
lexeme_utf8= תישארold_lexeme_utf8= תישאר
vocalized_lexeme_utf8= תישארsurface_consonants_utf8= תישאר
graphical_lexeme_utf8= ישאר
׃ץראה תאו םימשה תא םיה.א ארב תישארב
r_10-5
r_26-23
w_124
r_325-38
w_239
w_358
w_467
w_592
w_6105
r_440-57
r_559-66
r_668-71
r_772-91
r_893-96
r_997-104
r_10106-109
r_11110-121
p_7122-123
n_2n_3n_4n_5n_6n_7n_8
parents
n_9n_10n_11n_12
word word word word word word word word word word word
n_84383
sentence
number_within_chapter=1
n_59559
phrase
determination=determinedis_apposition=false
number_within_clause=4phrase_function=Objc
phrase_type=PP
parents
n_34680clause_atom
parents
n_77637
subphrase
parents
mothern_77638
subphrase
parents
n_40770
phrase_atom
parents
n_28737
clause
parentn_88917
sentence_atom
r_7 .. r_5r_11 .. r_9
r_11 .. r_5
r_11 .. r_5
r_11 .. r_1
r_11 .. r_1
r_11 .. r_1r_11 .. r_1
clause_atom_number=1clause_atom_relation=0
clause_atom_relation_daughter_tense=unknownclause_atom_relation_kind=No_relation
clause_atom_relation_mother_tense=unknownclause_atom_relation_preposition_class=none
clause_atom_type=xQtlindentation=0
<a xml:id="a_f22" label="ft" ref="n_3" as="utf8"><fs><f name="lexeme_utf8" value=" תישאר "/>
<f name="old_lexeme_utf8" value=" תישאר "/><f name="vocalized_lexeme_utf8" value=" תישאר "/>
<f name="surface_consonants_utf8" value=" תישאר "/><f name="graphical_lexeme_utf8" value=" ישאר "/>
</fs></a>
link to regions
Linguistic Annotation Framework
More about SHEBANQ
data 2.27 GB XML 99.2%
nodes 1,453,175edges 1,524,637regions 800,087features 42,545,492xml ids 12,831,550words 426,499
Performance Problems
POIO: load time +60 min RAM +20 GB
ExistDb: load time +30 min (initial) count features +60 min
nodes/edges/features directly modeled as objects in Python
!
!
xquery not a handy tool for relevant queries
need extensive index building
identifier chasing
What is it?
a compiler LAF-XML ==> Python arrays 2,270 MB ==> 485 MB binary data 60 min load time ==> 1 s
a task execution environment runs custom Python scripts offering them a LAF API
Where is it?
clone it from Github https://github.com/dirkroorda/laf-fabric
run it locally
share your custom tasks
share your own annotations
Example: Esther
linguistic variation among the bible books
count the common nouns of Esther
compare their freqs in Esther with those in other books of the Bible
Wanted: a tab separated file with frequencies for 216 common
nouns for all books
Genesis Exodus LamentationsEsther Daniel Ezraאב 27.8 3.31 2.94 2.27 6.95 9.1אבדנ 0 0 0 1.51 0 0אבל 0.534 0 1.47 1.51 1.16 0אגרת 0 0 0 1.51 0 0אורה 0 0 0 0.755 0 0אח 23.8 2.34 0 0.755 0 6.82אחד 6.28 13.6 0 3.78 7.34 6.82אחר 11.4 3.86 0 1.51 1.93 2.27אחשדרפנימ 0 0 0 2.27 0 0.569אי 0.134 0 0 0.755 0.386 0אינ 4.94 3.03 16.2 7.55 3.48 2.27
Given(1):LAF
annotation files
with part of speech features
<a xml:id="amf21" label="ft" ref="n3"><fs> <f name="noun_type" value="common"/> <f name="part_of_speech" value="noun"/> <f name="phrase_dependent_part_of_speech" value="noun"/> <f name="pronoun_type" value="none"/> </fs></a> <a xml:id="amf33" label="ft" ref="n4"><fs> <f name="noun_type" value="none"/> <f name="part_of_speech" value="verb"/> <f name="phrase_dependent_part_of_speech" value="verb"/> <f name="pronoun_type" value="none"/> </fs></a> <a xml:id="amf45" label="ft" ref="n5"><fs> <f name="noun_type" value="common"/> <f name="part_of_speech" value="noun"/> <f name="phrase_dependent_part_of_speech" value="noun"/> <f name="pronoun_type" value="none"/> </fs></a>
Given (2)Information about the books (where they start and end)
<region xml:id="s_1254379" anchors="4609273 4664382"/> <node xml:id="n1254379"><link targets="s_1254379"/></node> <a xml:id="as1254379" label="db" ref="n1254379"><fs> <f name="otype" value="book"/> <f name="oid" value="1254379"/> <f name="monads" value="368500-373120"/> <f name="minmonad" value="368500"/> <f name="maxmonad" value="373120"/> </fs></a> <a xml:id="asf34" label="sft" ref="n1254379"><fs> <f name="book" value="Esther"/> </fs></a>
to the workbench!
a small python script
target_book = "Esther" for node in NN(): this_type = F.shebanq_db_otype.v(node) if this_type == "word": p_o_s = F.shebanq_ft_part_of_speech.v(node) if p_o_s == "noun": noun_type = F.shebanq_ft_noun_type.v(node) if noun_type == "common": words[book_name] += 1 lexeme = F.shebanq_ft_lexeme_utf8.v(node) lexemes[book_name][lexeme] += 1 elif this_type == "book": book_name = F.shebanq_sft_book.v(node) books.append(book_name) ontarget = F.shebanq_sft_book.v(node) == target_book
source
code
Declare features"features": { "shebanq": { "node": [ "db.otype", "ft.part_of_speech,noun_type,lexeme_utf8", "sft.book", ], "edge": [ ], },
The workbench will load selected features unload other features
Receive task object
def task(graftask): (msg, NN, F, X) = graftask.get_mappings() !
And use supplied methods for rapid data access.
run your task
gather results
Genesis Exodus Leviticus Numbers DeuteronomyJoshua Judges I_SamuelII_SamuelI_Kings II_Kings Isaiah Jeremiah Ezekiel Hosea Joel Amos Obadiah Jonah Micah Nahum
אב 27.8 3.31 4.76 11.6 13.4 9.61 17.3 12.9 7.62 18.7 15.6 3.08 7.94 3.21 1.14 2.3 2.7 0 0 3.59 0
אבדנ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
אבל 0.534 0 0 0 0.189 0 0 0.243 0.817 0 0 0.294 0.378 0.119 0 0 4.05 0 0 1.8 0
אגרת 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
אורה 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
אח 23.8 2.34 4.38 2.58 9.08 3.57 8.81 2.92 9.25 2.16 1.13 1.03 2.39 1.07 3.42 2.3 2.7 17.4 0 3.59 0
אחד 6.28 13.6 9.33 24.3 4.92 16.5 8.2 10.5 9.53 12.4 7.47 2.94 1.76 12.7 1.14 0 6.76 8.7 4.55 0 0
אחר 11.4 3.86 4.76 3.94 5.86 11 12.8 12.9 16.1 11 7.02 1.47 6.8 2.98 7.99 11.5 4.05 0 0 0 0
אחשדרפנימ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
אי 0.134 0 0 0 0 0 0 0 0 0 0 2.79 0.63 1.07 0 0 0 0 0 0 0
אינ 4.94 3.03 4 2.58 5.68 1.37 8.2 8.02 4.08 4.92 4.53 13.5 11.2 2.86 17.1 6.9 6.76 8.7 0 10.8 26.3
איש 21.2 13.2 17.9 17.8 17 19.8 60.5 51.3 38.4 16.7 28.8 9.4 20.3 10.6 11.4 9.2 4.05 26.1 40.9 14.4 3.76
אלפ 0.267 1.52 0 14.1 1.89 3.29 10.6 6.81 5.17 4.13 2.49 1.03 0.378 4.76 0 0 1.35 0 0 3.59 0
אמ 3.47 0.965 2.86 0.272 2.46 0.824 6.08 0.973 0.817 3.15 4.98 0.734 1.13 1.19 4.57 0 0 0 0 1.8 0
אמה 1.74 9.37 0.571 0.952 2.08 0.275 0.608 2.67 1.63 9.44 0.679 0.147 0.504 10.5 0 0 0 0 0 0 3.76
אמנה 0 0 0 0 0 0 0 0 0 0 0.226 0 0 0 0 0 0 0 0 0 0
אמת 0.802 0.276 0 0 0.568 0.824 0.912 0.243 0.817 0.983 0.453 1.76 1.39 0.238 1.14 0 0 0 0 1.8 0
אפר 0.134 0 0 0.272 0 0 0 0 0.272 0.393 0 0.44 0.126 0.238 0 0 0 0 4.55 0 0
ארבע 4.01 6.48 1.33 7.75 3.41 4.67 4.25 1.95 1.91 6.29 1.81 0.44 0.882 6.91 0 0 13.5 0 4.55 0 0
ארגמנ 0 3.58 0 0.136 0 0 0.304 0 0 0 0 0 0.126 0.238 0 0 0 0 0 0 0
ארצ 41.5 18.7 15.6 16.7 37.3 29.4 18.2 12.6 10.9 11 16.1 27.9 34.1 23.6 22.8 27.6 31.1 8.7 9.09 26.9 11.3
אשה 20.3 5.79 14.7 7.75 7.95 3.02 21 13.4 13.3 7.47 4.3 1.76 4.54 2.62 5.71 0 2.7 0 0 1.8 3.76
בגד 1.87 3.17 10.5 2.72 0.189 0 1.52 0.973 2.18 0.787 4.53 2.06 0.63 1.67 0 2.3 1.35 0 0 0 0
בד 1.87 5.79 2.48 2.99 1.7 0.824 2.73 1.22 2.45 3.34 1.13 1.61 0.252 1.19 1.14 0 0 0 0 0 0
בהט 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
בוצ 0 0 0 0 0 0 0 0 0 0 0 0 0 0.119 0 0 0 0 0 0 0
בזה 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
בזיונ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
בינ 10.2 4.82 3.43 2.58 3.22 5.49 5.77 5.84 3.27 5.11 2.26 1.03 1.13 5.48 1.14 2.3 0 8.7 4.55 1.8 0
בירה 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
בית 14.6 8.13 10.1 7.89 8.51 6.86 21.6 16 32.1 38.3 34.2 11 18.5 21.6 17.1 13.8 36.5 43.5 0 30.5 3.76
ביתנ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
בכי 0.134 0 0 0 0.189 0 0.304 0 0.272 0 0.226 1.17 1.01 0 0 2.3 0 0 0 0 0
בנ 48.8 32.1 30.5 83.1 24 66.2 62.6 33.8 56.9 37.2 50.3 12.3 28.2 22.7 28.5 34.5 14.9 17.4 13.6 10.8 0
בעל 0.534 1.93 0.19 0.136 0.568 0.275 9.12 0.973 0.817 2.56 6.57 0.587 1.76 0 6.85 2.3 0 0 0 0 3.76
בקר 4.81 6.2 4 8.43 2.65 1.37 3.34 7.29 4.35 3.15 2.04 2.35 0.756 1.9 4.57 2.3 4.05 0 4.55 1.8 0
בקשה 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
בשמ 0 0.827 0 0 0 0 0 0 0 0.787 0.226 0.294 0 0.119 0 0 0 0 0 0 0
בת 14.7 3.17 4.38 3.54 4.16 4.39 8.2 3.89 5.44 2.56 3.85 3.96 5.17 5.24 4.57 4.6 1.35 0 0 14.4 0
בתולה 0.134 0.276 0.381 0 0.757 0 0.608 0 0.544 0.197 0.226 0.734 1.01 0.238 0 2.3 2.7 0 0 0 0
גבורה 0 0.138 0 0 0.189 0 0.608 0 0 0.787 1.59 1.03 0.756 0.238 0 0 0 0 0 3.59 0
גדולה 0 0 0 0 0 0 0 0 0.544 0 0 0 0 0 0 0 0 0 0 0 0
גולה 0 0 0 0 0 0 0 0 0 0 0.453 0 1.26 1.31 0 0 1.35 0 0 0 3.76
גורל 0 0 0.952 0.952 0 7.14 0.912 0 0 0 0 0.44 0.126 0.119 0 2.3 0 8.7 13.6 1.8 3.76
גליל 0 0 0 0 0 0 0 0 0 0.393 0 0.147 0 0 0 0 0 0 0 0 0
Next stepsUsage by ETCBC workflow for adding annotations
Wider Digital Humanities pattern seeking
Incorporate in NLPLAB VU
Combine with POIO NEO4J backend ?
Discuss at workshops
Wido van Peursen, Janet Dyk, whoever needs new kinds of data in and out the database !Rens Bod !!Wouter van Atteveldt !Peter Bouda !TLA Nijmegen (done) CLIN (accepted) DH2014?
Links
Docs: laf-fabric.readthedocs.org
Github: github.com/laf-fabric
ETCBC: vu.nl/etcbc
LAF: iso.org/laf