Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

35
Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko Comenius University, Faculty of Education Slovak Academy of Sciences, Ľ. Štúr Institute of Linguistics [email protected]

description

Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko Comenius University, Faculty of Education Slovak Academy of Sciences, Ľ. Štúr Institute of Linguistics [email protected]. Slovník súčasného slovenského jazyka - PowerPoint PPT Presentation

Transcript of Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

Page 1: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Optimizing Word Sketchesfor a Large-Scale

Lexicographic Project

• Vladimír Benko

• Comenius University, Faculty of Education• Slovak Academy of Sciences, Ľ. Štúr Institute of

Linguistics• [email protected]

Page 2: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Slovník súčasného slovenského jazyka• (Dictionary of the Contemporary Slovak Language)

• A long-term project• First presented: EURALEX 1992, Helsinki• Real compilation started 1996• First volume (A–G) published 2006

(appeared January 2007)• Second volume (H–L) to appear December 2010• Third volume (M–P1): 75% compiled

Page 3: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Infrastructure

• 1996:• one PC per room, MS-DOS• Novell Server• some PCs at home, mostly without Internet connection

• today:• dual/triple screen PC for every lexicographer• 4 servers (2 for dictionary projects, 2 for corpora)• PC at home, Internet connection

Page 4: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Slovník súčasného slovenského jazyka

• Lexical data: pure text + lightweight markup language• (similar to Wikipedia Markup)

• "headword" (bold)• 'example' (italics)• |label| (smaller print)• [*reference] (smaller print)• {structure} (sense numbers, idiom indicators)

• !identification line• ?comment line

Page 5: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Corpora

• 5 M corpus since 1998• 20 M corpus in 2000• Slovak National Corpus since 2003

• at present:• 550 M (60 % newspapers and journals)• web corpus (87 M, growing)

• WSE since 2007• now: version 4 of Slovak Sketch Grammar

Page 6: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rules

• *DUAL

Page 7: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rules

• *DUAL

• 2:"ADJ" [tag="AD[JV]"]{0,3} 1:"NOM"• 1:"NOM" ("ADJ" "KON")? 2:"ADJ"• 1:"V.*" 2:"ADV"• 2:"ADV" 1:"V.*"

Page 8: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rules

• *DUAL• =modifier/modifié• 2:"ADJ" [tag="AD[JV]"]{0,3} 1:"NOM"• 1:"NOM" ("ADJ" "KON")? 2:"ADJ"• 1:"V.*" 2:"ADV"• 2:"ADV" 1:"V.*"

Page 9: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rule Names (CNC, “A” Style)

• is_subj_of• has_subj• is_obj4_of• has_obj4• a_modifier• modifies• prec_prep• coord• gen1• gen2

Page 10: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rule Names (CNC, “A” Style)

• KW is_subj_of CL• KW has_subj CL• KW is_obj4_of CL• KW has_obj4 CL• CL is a_modifier of KW• KW modifies CL• CL is prec_prep of KW• KW & CL are coord'ed• CL is gen1 case• KW is gen2 case

Page 11: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

KW CL KW CL CL21 =coord * * * *2 =post_inf * Vb3 =prec_verb Sb Vb4 =post_verb Sb Vb5 =a_modifier/modifies Sb Aj Aj Sb

6 =prec_prep Sb R7 =post_prep Vb R8 =post_%s Sb Sb Pp

8 =post_%s Vb Sb Pp

9 =prec_%s Sb Sb Pp

9 =prec_%s Vb Sb Pp

10 =byt_adj/subj_byt Sb Aj Aj Sb

11 =gen_1/gen_2 Sb Sb Sb Sb

12 =is_subj_of/has_subj Sb Vb Vb Sb

13 =is_obj2_of/has_obj2 Sb Vb Vb Sb

14 =is_obj3_of/has_obj3 Sb Vb Vb Sb

15 =is_obj4_of/has_obj4 Sb Vb Vb Sb

16 =is_obj7_of/has_obj7 Sb Vb Vb Sb

17 =passive/subj_of_passive Sb Vb Vb Sb

18 =categ1/categ2 Sb Sb Sb Sb

19 =ajine1/ajine2 Sb Sb Sb Sb

Page 12: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rule Names (“A” Style)

• Rule names motivated syntactically(named by syntactic function)

• Keyword/Collocate position (usually) not indicated• Keyword/Collocate PoS implied• Some relationships difficult to name• Transparent for basic relationships• Difficult to extend• Precision preferred over Recall

Page 13: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rule Names (“V” Style)

• *DUAL• =a_modifier/modifies• 2:[tag="A.*"] []{0,2} 1:[tag="N.*"]

Page 14: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rule Names (“V” Style)

• *DUAL• =a_modifier/modifies• 2:[tag="A.*"] []{0,2} 1:[tag="N.*"]

• *DUAL• =Aj X/X Nn• 2:[tag="A.*"] []{0,2} 1:[tag="N.*"]

Page 15: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rule Names (“V” Style)

• *DUAL• =Aj X/X Nn• 2:[tag="A.*"] []{0,2} 1:[tag="N.*"]

• =Aj X• 2:[tag="A.*"] []{0,2} 1:[]

• =X Nn• 1:[] []{0,2} 2:[tag="N.*"]

Page 16: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rule Names (“V” Style)

• Keyword X (UC)• Collocate Vb, Aj, Av, … (UC+LC)• Collocate Y (UC)

• Keyword/Collocate• modifier/restriction sgX(LC+UC)• (usually in UNARY rules)• Secondary Collocate %s(LC)• (TRINARY rules)

Page 17: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rule Names (SNC, “V” Style)

• (BINARY) SYMMETRIC• Vb X/X Vb X , X• Av X/X Av X Cj X• Nm X / X Nm• Aj X / X Aj

• Y X / X Y• Pp X / X Pp

• TRINARY• pp X Y, …

Page 18: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rule Names (SNC, “V” Style)

• UNARY• sgX pX• plX cX• sX• nomX• genX 1pX• datX 2pX• accX 3pX• vocX• locX SbX, ...• insX

Page 19: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

KW CL CL21 =Y cX Aj/Av *2 =cX Y Aj/Av *3 =Y sX Aj/Av *4 =sX Y Aj/Av *5 =Y Cj X/X Cj Y * *6 =Vb X/X Vb * Vb7 =Av X/X Av * Av

8 =Aj X * Aj9 =X Aj * Aj10 =Sb X * Sb

11 =X Sb * Sb

12 =Y X * *′

13 =X Y * *′

14 =Pp X * Pp

15 =X Pp * Pp

16 =Y %s X * * Pp

17 =%s Y X * * Pp

18 =X %s Y * * Pp

19 =%s X Y * * Pp

Page 20: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Word Sketch Rule Names (“V” Style)

• Rule names motivated collocationally(named by PoS of Keyword/Collocate)

• Keyword/Collocate position indicated explicitly• Keyword/Collocate PoS indicated (usually)

explicitly• All relationships can be named uniformly• Name of syntactic function not present• Easily extensible• Recall preferred over precission

Page 21: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Special Treatment: Reflexive Verbs

• Reflexivity of verbs in Slovak: • Reflexive formant sa or si in the vicinity of a verb,

which can be regarded as• a) Lexical morpheme (“inherent” reflexivity)• b) Reflexive pronoun (“proper” reflexivity or

reciprocity)• c) Grammatical formant (reflexive form of a non-

reflexive verb)

Page 22: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Special Treatment: Reflexive Verbs

• In dictionaries:• (a) case implies creation of a new headword (in a

common entry with the non-reflexive form of the respective verb, or have an entry of its own

• (b) case may generate a new headword, or be indicated in other way (e.g. within the example zone); it depends on the type and size of the dictionary

• (c) case is a syntactic phenomenon, the dictionaries usually do not treat it in a systematic way

Page 23: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Special Treatment: Reflexive Verbs

• Reflexives in SSSJ: always in separate entries

• holiť sa -lí sa -lia sa hoľ sa! -lil sa -liac sa -liaci sa -lený -lenie sa nedok.• {1} (ø; čím) rezaním odstraňovať zo svojej tváre• (al. častí tela) chlpy: musí sa denne h.; h. sa namokro;• h. sa v podpazuší; Husto zarastal a zavčasu sa holil.• [Š. Žáry]; A ráno sa holí mojou žiletkou. [M. Zelinka];• Sám sa holiť nemohol, lebo sa mu od ťažkej roboty triasli• ruky. [B. Šikula]; h. sa strojčekom, žiletkou, britvou;• holí sa každé ráno; Dlho som sa neholil, narástla mi• brada a fúzy. [P. Jaroš]• {2} dávať si odstraňovať chlpy z tváre (obyč. britvou):• otec sa už roky holí u toho istého holiča;• opak. holievať sa -va sa -vajú sa -val sa; dok. -> oholiť sa

Page 24: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Reflexive verbs (Slovak Orthographic Dictionary)

10 0 ,0 014241u vedom iť - í - ia d o k .xxx70 ,5579po sťa žo vať sa - u je - u j ú d o k .xx60 ,0 23l íhava ť - a - a jú n ed o k .xx51,2 918 3zapam ätávať si - a - a jú n ed o k .x40 ,4 56 4pá čiť - i - ia n ed o k .xx32 3 ,9 83 4 15zahn iezď o vať sa - u je - u jú n ed o k.x27 ,1610 19abd iko vať - u je - u jú n ed o k . i d o k .x16 6 ,559 4 78

sisaø%

10 0 ,0 014241u vedom iť - í - ia d o k .xxx70 ,5579po sťa žo vať sa - u je - u j ú d o k .xx60 ,0 23l íhava ť - a - a jú n ed o k .xx51,2 918 3zapam ätávať si - a - a jú n ed o k .x40 ,4 56 4pá čiť - i - ia n ed o k .xx32 3 ,9 83 4 15zahn iezď o vať sa - u je - u jú n ed o k.x27 ,1610 19abd iko vať - u je - u jú n ed o k . i d o k .x16 6 ,559 4 78

sisaø%

Page 25: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Reflexive verbs (Slovak Orthographic Dictionary)

1

2

3

4

5

6

7

Page 26: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Special Treatment: Reflexive Verbs

• To be able to separate Word Sketches for reflexive and non-reflexive form of a verb, we need

• (1) Secondary segmentationsplitting sentences into smaller chunks

• (2) Secondary markupindicating reflexivity for verbs

• (3) Use secondary markup in Word Sketch rules

Page 27: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Secondary Segmentation and Markup

• <s>Francúzski vojenskí dôstojníci• a humanitní pracovníci na juhozápade• cez víkend varovali pred novým exodom• vystrašených Rwanďanov, predovšetkým Hutuov,• ktorí sa boja odchodu francúzskych vojakov• dozerajúcich na poriadok v oblasti, • avizovaného na koniec tohto mesiaca.</s>

Page 28: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Secondary Segmentation and Markup

• <s0>Francúzski vojenskí dôstojníci</s0>• <s0>a humanitní pracovníci na juhozápade• cez víkend varovali pred novým exodom• vystrašených Rwanďanov,</s0> <s0>predovšetkým

Hutuov,</s0>• <s0>ktorí sa boja odchodu francúzskych vojakov• dozerajúcich na poriadok v oblasti,</s0>• <s0>avizovaného na koniec tohto mesiaca.</s0>

Page 29: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Secondary Segmentation and Markup

• <s0>Francúzski vojenskí dôstojníci• a humanitní pracovníci na juhozápade• cez víkend varovali pred novým exodom• vystrašených Rwanďanov,</s0> <s0>predovšetkým

Hutuov,• ktorí sa boja odchodu francúzskych vojakov• dozerajúcich na poriadok v oblasti,</s0>• <s0>avizovaného na koniec tohto mesiaca.</s0>

Page 30: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Secondary Segmentation and Markup

• <s0>Francúzski vojenskí dôstojníci• a humanitní pracovníci na juhozápade• cez víkend varovalir0 pred novým exodom

• vystrašených Rwanďanov,</s0> <s0>predovšetkým Hutuov,

• ktorí sa bojar1 odchodu francúzskych vojakov

• dozerajúcich na poriadok v oblasti,</s0>• <s0>avizovaného na koniec tohto mesiaca.</s0>

Page 31: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko
Page 32: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko
Page 33: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Optimizing: Some Minor Issues

• Choosing optimal browser• Mozilla Firefox for dual screen display• Google Chrome for dual window display

• Default Word Sketch parameters• minimal frequency 4• minimal salience –2.0• no collocation clustering• minimal unary score –20.0

Page 34: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Optimizing: Some Minor Issues

• Default Screen Layout• fixed order of tables• 4 columns only (easier to print)• 32 lines per table (to fit the screen)• font selection: Georgia (set in browser)

Page 35: Optimizing Word Sketches for a Large-Scale Lexicographic Project Vladimír Benko

• Infrastructure

• 2 servers (eugen & samo)*• Debian, Ubuntu• Apache, Lighttpd• hot backup• three “gates”:

– stable– beta (Sandbox)– alpha (Rockbox)

• common authentication

• ________• * Eugen Jóna (1909–1985), Eugen Pauliny (1912–1983),

Samuel Czambel (1856–1909) ... Slovak linguists