LAF Fabric

27
LAF Fabric Dirk Roorda DANS / TLA 2013-12-12 VU/ETCBC Amsterdam

description

Workbench for processing big LAF resources (Linguistic Annotation Framework)

Transcript of LAF Fabric

Page 2: LAF Fabric

What is LAF?

Linguistic Annotation Framework

ISO standard 24612:2012 Nancy Ide, Laurent Romary

A data model for stand-off markup plus a serialization advice: GrAF

Page 4: LAF Fabric

OANC 17K 8 feb 2013 ch5-logical.xml 217K 8 feb 2013 ch5-mpqa.xml 265K 8 feb 2013 ch5-nc.xml 16K 8 feb 2013 ch5-ne.xml 1,3M 8 feb 2013 ch5-penn.xml 1,4M 8 feb 2013 ch5-ptb.xml 990K 8 feb 2013 ch5-ptbtok.xml 48K 8 feb 2013 ch5-s.xml 274K 8 feb 2013 ch5-seg.xml 177K 8 feb 2013 ch5-vc.xml 2,6K 3 jun 09:05 ch5.hdr 31K 8 feb 2013 ch5.txt 19K 8 feb 2013 resource-header.xml

Page 5: LAF Fabric

textSemantics enters with purpose.

For this to be true, it is not necessary that the carriers of purpose, say, the same bacterium heading upstream in the glucose gradient, be conscious.

I hope my definition of an autonomous agent is useful, an autocatalytic system carrying out a work cycle, now rather broadened by the realization that autonomous agents also do often detect and measure and record displacements of external systems from equilibrium that can be used to extract work, then do extract work, propagating work and constraint construction, from their environment.

2101-2110

Page 6: LAF Fabric

annotations<region xml:id="mpqa-r64" anchors="2101 2110"/> !... <node xml:id="mpqa-n64"> <link targets="mpqa-r64"/> </node> <a xml:id="mpqa-N81257" label="target" ref="mpqa-n64" as="mpqa"> <fs> <f name="id" value="semantics"/> </fs> </a>

!

*

**

*

Page 7: LAF Fabric

headers

Resource header

Primary data header

Annotation header

Metadata, namespaces, annotation labels, statistics

Page 8: LAF Fabric

SHEBANQ<node xml:id="n_88917">

<link targets="r_1 r_2 r_3 r_4 r_5 r_6 r_7 r_8 r_9 r_10 r_11"/></node>

<a xml:id="a_88917" label="sentence_atom" ref="n_88917" as="lingo"/><a xml:id="a_f71355" label="ft" ref="n_88917" as="lingo"><fs>

<f name="sentence_atom_number" value="0"/></fs></a>

<edge xml:id="e_1" from="n_88917" to="n_84383"/><a xml:id="a_e1" label="parents" ref="e_1" as="link"/>

<region xml:id="r_1" anchors="0 5"/><node xml:id="n_2"><link targets="r_1"/></node>

<a xml:id="a_2" label="word" ref="n_2" as="monads"/>

<region xml:id="r_2" anchors="6 23"/><node xml:id="n_3"><link targets="r_2"/></node>

<a xml:id="a_3" label="word" ref="n_3" as="monads"/>

<region xml:id="w_1" anchors="24 24"/>

labeled edges

nodesn_object id

annotations(features)

annotations(empty)

primary textUNICODE-utf8

regionsr_monad number

parentsparentsparents

parentsparents

parentsparents

parents

lexeme_utf8= תישארold_lexeme_utf8= תישאר

vocalized_lexeme_utf8= תישארsurface_consonants_utf8= תישאר

graphical_lexeme_utf8= ישאר

׃ץראה תאו םימשה תא םיה.א ארב תישארב

r_10-5

r_26-23

w_124

r_325-38

w_239

w_358

w_467

w_592

w_6105

r_440-57

r_559-66

r_668-71

r_772-91

r_893-96

r_997-104

r_10106-109

r_11110-121

p_7122-123

n_2n_3n_4n_5n_6n_7n_8

parents

n_9n_10n_11n_12

word word word word word word word word word word word

n_84383

sentence

number_within_chapter=1

n_59559

phrase

determination=determinedis_apposition=false

number_within_clause=4phrase_function=Objc

phrase_type=PP

parents

n_34680clause_atom

parents

n_77637

subphrase

parents

mothern_77638

subphrase

parents

n_40770

phrase_atom

parents

n_28737

clause

parentn_88917

sentence_atom

r_7 .. r_5r_11 .. r_9

r_11 .. r_5

r_11 .. r_5

r_11 .. r_1

r_11 .. r_1

r_11 .. r_1r_11 .. r_1

clause_atom_number=1clause_atom_relation=0

clause_atom_relation_daughter_tense=unknownclause_atom_relation_kind=No_relation

clause_atom_relation_mother_tense=unknownclause_atom_relation_preposition_class=none

clause_atom_type=xQtlindentation=0

<a xml:id="a_f22" label="ft" ref="n_3" as="utf8"><fs><f name="lexeme_utf8" value=" תישאר "/>

<f name="old_lexeme_utf8" value=" תישאר "/><f name="vocalized_lexeme_utf8" value=" תישאר "/>

<f name="surface_consonants_utf8" value=" תישאר "/><f name="graphical_lexeme_utf8" value=" ישאר "/>

</fs></a>

link to regions

Linguistic Annotation Framework

Page 9: LAF Fabric

More about SHEBANQ

data 2.27 GB XML 99.2%

nodes 1,453,175edges 1,524,637regions 800,087features 42,545,492xml ids 12,831,550words 426,499

Page 10: LAF Fabric

LAF Processors

http://www.poio.eu

Page 11: LAF Fabric

LAF Processors

http://www.exist-db.org

Page 12: LAF Fabric

Performance Problems

POIO: load time +60 min RAM +20 GB

ExistDb: load time +30 min (initial) count features +60 min

nodes/edges/features directly modeled as objects in Python

!

!

xquery not a handy tool for relevant queries

need extensive index building

identifier chasing

Page 13: LAF Fabric

LAF-Fabric

http://laf-fabric.readthedocs.org

Page 14: LAF Fabric

What is it?

a compiler LAF-XML ==> Python arrays 2,270 MB ==> 485 MB binary data 60 min load time ==> 1 s

a task execution environment runs custom Python scripts offering them a LAF API

Page 15: LAF Fabric

Where is it?

clone it from Github https://github.com/dirkroorda/laf-fabric

run it locally

share your custom tasks

share your own annotations

Page 16: LAF Fabric

Example: Esther

linguistic variation among the bible books

count the common nouns of Esther

compare their freqs in Esther with those in other books of the Bible

Page 17: LAF Fabric

Wanted: a tab separated file with frequencies for 216 common

nouns for all books

Genesis Exodus LamentationsEsther Daniel Ezraאב 27.8 3.31 2.94 2.27 6.95 9.1אבדנ 0 0 0 1.51 0 0אבל 0.534 0 1.47 1.51 1.16 0אגרת 0 0 0 1.51 0 0אורה 0 0 0 0.755 0 0אח 23.8 2.34 0 0.755 0 6.82אחד 6.28 13.6 0 3.78 7.34 6.82אחר 11.4 3.86 0 1.51 1.93 2.27אחשדרפנימ 0 0 0 2.27 0 0.569אי 0.134 0 0 0.755 0.386 0אינ 4.94 3.03 16.2 7.55 3.48 2.27

Page 18: LAF Fabric

Given(1):LAF

annotation files

with part of speech features

<a xml:id="amf21" label="ft" ref="n3"><fs> <f name="noun_type" value="common"/> <f name="part_of_speech" value="noun"/> <f name="phrase_dependent_part_of_speech" value="noun"/> <f name="pronoun_type" value="none"/> </fs></a> <a xml:id="amf33" label="ft" ref="n4"><fs> <f name="noun_type" value="none"/> <f name="part_of_speech" value="verb"/> <f name="phrase_dependent_part_of_speech" value="verb"/> <f name="pronoun_type" value="none"/> </fs></a> <a xml:id="amf45" label="ft" ref="n5"><fs> <f name="noun_type" value="common"/> <f name="part_of_speech" value="noun"/> <f name="phrase_dependent_part_of_speech" value="noun"/> <f name="pronoun_type" value="none"/> </fs></a>

Page 19: LAF Fabric

Given (2)Information about the books (where they start and end)

<region xml:id="s_1254379" anchors="4609273 4664382"/> <node xml:id="n1254379"><link targets="s_1254379"/></node> <a xml:id="as1254379" label="db" ref="n1254379"><fs> <f name="otype" value="book"/> <f name="oid" value="1254379"/> <f name="monads" value="368500-373120"/> <f name="minmonad" value="368500"/> <f name="maxmonad" value="373120"/> </fs></a> <a xml:id="asf34" label="sft" ref="n1254379"><fs> <f name="book" value="Esther"/> </fs></a>

Page 20: LAF Fabric

to the workbench!

a small python script

target_book = "Esther" for node in NN(): this_type = F.shebanq_db_otype.v(node) if this_type == "word": p_o_s = F.shebanq_ft_part_of_speech.v(node) if p_o_s == "noun": noun_type = F.shebanq_ft_noun_type.v(node) if noun_type == "common": words[book_name] += 1 lexeme = F.shebanq_ft_lexeme_utf8.v(node) lexemes[book_name][lexeme] += 1 elif this_type == "book": book_name = F.shebanq_sft_book.v(node) books.append(book_name) ontarget = F.shebanq_sft_book.v(node) == target_book

source

code

Page 21: LAF Fabric

Declare features"features": { "shebanq": { "node": [ "db.otype", "ft.part_of_speech,noun_type,lexeme_utf8", "sft.book", ], "edge": [ ], },

The workbench will load selected features unload other features

Page 22: LAF Fabric

Receive task object

def task(graftask): (msg, NN, F, X) = graftask.get_mappings() !

And use supplied methods for rapid data access.

Page 23: LAF Fabric

run your task

Page 24: LAF Fabric

gather results

Genesis Exodus Leviticus Numbers DeuteronomyJoshua Judges I_SamuelII_SamuelI_Kings II_Kings Isaiah Jeremiah Ezekiel Hosea Joel Amos Obadiah Jonah Micah Nahum

אב 27.8 3.31 4.76 11.6 13.4 9.61 17.3 12.9 7.62 18.7 15.6 3.08 7.94 3.21 1.14 2.3 2.7 0 0 3.59 0

אבדנ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

אבל 0.534 0 0 0 0.189 0 0 0.243 0.817 0 0 0.294 0.378 0.119 0 0 4.05 0 0 1.8 0

אגרת 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

אורה 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

אח 23.8 2.34 4.38 2.58 9.08 3.57 8.81 2.92 9.25 2.16 1.13 1.03 2.39 1.07 3.42 2.3 2.7 17.4 0 3.59 0

אחד 6.28 13.6 9.33 24.3 4.92 16.5 8.2 10.5 9.53 12.4 7.47 2.94 1.76 12.7 1.14 0 6.76 8.7 4.55 0 0

אחר 11.4 3.86 4.76 3.94 5.86 11 12.8 12.9 16.1 11 7.02 1.47 6.8 2.98 7.99 11.5 4.05 0 0 0 0

אחשדרפנימ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

אי 0.134 0 0 0 0 0 0 0 0 0 0 2.79 0.63 1.07 0 0 0 0 0 0 0

אינ 4.94 3.03 4 2.58 5.68 1.37 8.2 8.02 4.08 4.92 4.53 13.5 11.2 2.86 17.1 6.9 6.76 8.7 0 10.8 26.3

איש 21.2 13.2 17.9 17.8 17 19.8 60.5 51.3 38.4 16.7 28.8 9.4 20.3 10.6 11.4 9.2 4.05 26.1 40.9 14.4 3.76

אלפ 0.267 1.52 0 14.1 1.89 3.29 10.6 6.81 5.17 4.13 2.49 1.03 0.378 4.76 0 0 1.35 0 0 3.59 0

אמ 3.47 0.965 2.86 0.272 2.46 0.824 6.08 0.973 0.817 3.15 4.98 0.734 1.13 1.19 4.57 0 0 0 0 1.8 0

אמה 1.74 9.37 0.571 0.952 2.08 0.275 0.608 2.67 1.63 9.44 0.679 0.147 0.504 10.5 0 0 0 0 0 0 3.76

אמנה 0 0 0 0 0 0 0 0 0 0 0.226 0 0 0 0 0 0 0 0 0 0

אמת 0.802 0.276 0 0 0.568 0.824 0.912 0.243 0.817 0.983 0.453 1.76 1.39 0.238 1.14 0 0 0 0 1.8 0

אפר 0.134 0 0 0.272 0 0 0 0 0.272 0.393 0 0.44 0.126 0.238 0 0 0 0 4.55 0 0

ארבע 4.01 6.48 1.33 7.75 3.41 4.67 4.25 1.95 1.91 6.29 1.81 0.44 0.882 6.91 0 0 13.5 0 4.55 0 0

ארגמנ 0 3.58 0 0.136 0 0 0.304 0 0 0 0 0 0.126 0.238 0 0 0 0 0 0 0

ארצ 41.5 18.7 15.6 16.7 37.3 29.4 18.2 12.6 10.9 11 16.1 27.9 34.1 23.6 22.8 27.6 31.1 8.7 9.09 26.9 11.3

אשה 20.3 5.79 14.7 7.75 7.95 3.02 21 13.4 13.3 7.47 4.3 1.76 4.54 2.62 5.71 0 2.7 0 0 1.8 3.76

בגד 1.87 3.17 10.5 2.72 0.189 0 1.52 0.973 2.18 0.787 4.53 2.06 0.63 1.67 0 2.3 1.35 0 0 0 0

בד 1.87 5.79 2.48 2.99 1.7 0.824 2.73 1.22 2.45 3.34 1.13 1.61 0.252 1.19 1.14 0 0 0 0 0 0

בהט 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

בוצ 0 0 0 0 0 0 0 0 0 0 0 0 0 0.119 0 0 0 0 0 0 0

בזה 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

בזיונ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

בינ 10.2 4.82 3.43 2.58 3.22 5.49 5.77 5.84 3.27 5.11 2.26 1.03 1.13 5.48 1.14 2.3 0 8.7 4.55 1.8 0

בירה 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

בית 14.6 8.13 10.1 7.89 8.51 6.86 21.6 16 32.1 38.3 34.2 11 18.5 21.6 17.1 13.8 36.5 43.5 0 30.5 3.76

ביתנ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

בכי 0.134 0 0 0 0.189 0 0.304 0 0.272 0 0.226 1.17 1.01 0 0 2.3 0 0 0 0 0

בנ 48.8 32.1 30.5 83.1 24 66.2 62.6 33.8 56.9 37.2 50.3 12.3 28.2 22.7 28.5 34.5 14.9 17.4 13.6 10.8 0

בעל 0.534 1.93 0.19 0.136 0.568 0.275 9.12 0.973 0.817 2.56 6.57 0.587 1.76 0 6.85 2.3 0 0 0 0 3.76

בקר 4.81 6.2 4 8.43 2.65 1.37 3.34 7.29 4.35 3.15 2.04 2.35 0.756 1.9 4.57 2.3 4.05 0 4.55 1.8 0

בקשה 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

בשמ 0 0.827 0 0 0 0 0 0 0 0.787 0.226 0.294 0 0.119 0 0 0 0 0 0 0

בת 14.7 3.17 4.38 3.54 4.16 4.39 8.2 3.89 5.44 2.56 3.85 3.96 5.17 5.24 4.57 4.6 1.35 0 0 14.4 0

בתולה 0.134 0.276 0.381 0 0.757 0 0.608 0 0.544 0.197 0.226 0.734 1.01 0.238 0 2.3 2.7 0 0 0 0

גבורה 0 0.138 0 0 0.189 0 0.608 0 0 0.787 1.59 1.03 0.756 0.238 0 0 0 0 0 3.59 0

גדולה 0 0 0 0 0 0 0 0 0.544 0 0 0 0 0 0 0 0 0 0 0 0

גולה 0 0 0 0 0 0 0 0 0 0 0.453 0 1.26 1.31 0 0 1.35 0 0 0 3.76

גורל 0 0 0.952 0.952 0 7.14 0.912 0 0 0 0 0.44 0.126 0.119 0 2.3 0 8.7 13.6 1.8 3.76

גליל 0 0 0 0 0 0 0 0 0 0.393 0 0.147 0 0 0 0 0 0 0 0 0

Page 25: LAF Fabric

Next stepsUsage by ETCBC workflow for adding annotations

Wider Digital Humanities pattern seeking

Incorporate in NLPLAB VU

Combine with POIO NEO4J backend ?

Discuss at workshops

Wido van Peursen, Janet Dyk, whoever needs new kinds of data in and out the database !Rens Bod !!Wouter van Atteveldt !Peter Bouda !TLA Nijmegen (done) CLIN (accepted) DH2014?

Page 27: LAF Fabric

thank you

slideshare.n

et/dirk

roord

a