On the Ambiguity of Serbian Texts and Methods to disambiguate it
description
Transcript of On the Ambiguity of Serbian Texts and Methods to disambiguate it
![Page 1: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/1.jpg)
1
On the Ambiguity of Serbian Texts and Methods to
disambiguate it
Cvetana Krstev, Duško Vitas,
University of Belgrade
8th Intex/Nooj Workshop
![Page 2: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/2.jpg)
2
What is the ambiguity?
• the assignment of different lemmas• the assignment of different grammatical categories
![Page 3: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/3.jpg)
3
The ambiguity in Serbian
In Serbian many word forms are homographs although not homophones—stress marks are not recorded:gőre adv. upgőrē adv. worsegòrē P3s goreti,V+Ek to burngòre A3sgòrē P3s gorjeti,V+Ijk to burngòre A3sgòre fs2 gora forest
short long
up ő ô
down ò ó
gore
![Page 4: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/4.jpg)
4
The ambiguity in Serbian (2)rodoslovna,rodoslovni.A2+PosQ:akms2g:akms4v:aefs1g:aefs5g:akns2g:aenp1g:aenp4g:aenp5g
rodoslovne,rodoslovni.A2+PosQ:aemp4g:aefs2g:aefp1g:aefp4g:aefp5g
rodoslovni,rodoslovni.A2+PosQ:adms1g:aems4q:aems5g:aemp1g:aemp5g
rodoslovnih,rodoslovni.A2+PosQ:aemp2g:aefp2g:aenp2g
rodoslovnim,rodoslovni.A2+PosQ:aems6g:aemp3g:aemp6g:aemp7g:aefp3g:aefp6g:aefp7g:aens6g:aenp3g:aenp6g:aenp7g
rodoslovnima,rodoslovni.A2+PosQ:aemp3g:aemp6g:aemp7g:aefp3g:aefp6g:aefp7g:aenp3g:aenp6g:aenp7g
rodoslovno,rodoslovni.A2+PosQ:aens1g:aens4g:aens5g
rodoslovnog,rodoslovni.A2+PosQ:adms2g:adms4v:adns2g
rodoslovnoga,rodoslovni.A2+PosQ:adms2g:adms4v:adns2g
rodoslovnoj,rodoslovni.A2+PosQ:aefs3g:aefs7g
rodoslovnom,rodoslovni.A2+PosQ:adms3g:adms7g:aefs6g:adns3g:adns7g
…
← 9 sets of grammatical categories
e : form is the same for definite, indefinite
g : form is the same for animate and inanimate
![Page 5: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/5.jpg)
5
Disambiguation process
• Reconstructing word forms
• Using filter dictionaries
• Using restricted dictionaries
• Using dictionaries of compounds
• Using disambiguation grammars
![Page 6: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/6.jpg)
6
Reconstructing word forms – date adverbial phrases
![Page 7: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/7.jpg)
7
Reconstructing word forms – date adverbial phrases (2)
i izdavanxem YUBA kartica 20. februara 2002. godine.celog sistema. Zato je josx pocyetkom 1996. godine jedani www.plivamed.net. U petom mjesecu 2001.godine smo oformlxcxe biti odrzxan u novembru ove godine u Neumu, a za prvog
Simple forms
Assoc. lemmas
ratio Lemmas + categ.
ratio
54196 86915 1.60 174079 3.21
54126 86768 1.60 173727 3.21
![Page 8: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/8.jpg)
8
Reconstructing word forms – forms written with digits, etc.
![Page 9: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/9.jpg)
9
Reconstructing word forms – forms written with digits(2)
sxkovi iznosili oko 500 hilxada maraka. Znacyajna usxteda poput SAP-ovog ili IBM-ovog, dobijate i organizaciju firmecyelicyne industrije 1890-ih nije postojao. Ali, poznata jesveta drma tezxinom od 81,7 milijardi dolara u 160 zemalxa,
odnosno ukupno bezmalo pola milijarde (464 miliona)! Predxe
Simple forms
Assoc. lemmas
ratio Lemmas + categ.
ratio
54126 86768 1.60 173727 3.21
54064 86507 1.60 173693 3.21
![Page 10: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/10.jpg)
10
Using filter dictionaries
mi,ja.PRO01+Prs:sx3i
mi,mi.PRO03+Prs:px1r
mi,miti.V35+Imperf+Tr+Iref+Ref:Ays:Azs
li,li.PAR
li,liti.V98+Imperf+Tr+It+Iref:Ays:Azs
![Page 11: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/11.jpg)
11
Using filter dictionaries (2)
Very cautious filter dictionary with only 41 entries:
Simple forms
Assoc. lemmas
ratio Lemmas + categ.
ratio
54064 86507 1.60 173693 3.21
53858 81607 1.52 166908 3.10
![Page 12: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/12.jpg)
12
Using restricted dictionaries
• Dictionaries contain lemmas for both standard pronunciations – Ekavian and Ijekavian. Text, however, are usually written in only one.
• Dictionaries contain lemmas for both Serbian and Croatian language (or variant of Serbo-Croatian)
![Page 13: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/13.jpg)
13
Using restricted dictionaries (2)
crvene,crven.A17+Col:aemp4g:aefs2g:aefp1g:aefp4g:aefp5g
crvene,crveneti.V547+Imperf+It+Iref+Ref+Ek:Pzp:Ays:Azs
crvene,crveniti.V54+Imperf+Tr+Iref:Pzp
crvene,crvenxeti.V747+Imperf+It+Iref+Ref+Ijk:Pzp
Simple forms
Assoc. lemmas
ratio Lemmas + categ.
ratio
53858 81607 1.52 166908 3.10
53809 80890 1.50 165546 3.08
![Page 14: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/14.jpg)
14
Using dictionary of compounds
bez obzira na,bez obzira na.PREP+C+Ncn+p4bez,bez.PREP+p2na,na.INTna,na.PREP+p4+p7obzira,obzir.N1:ms2q:mp2qobzira,obzirati.V519+Imperf+It+Ref:Ays:Azs
Simple forms
Assoc. lemmas
ratio Lemmas + categ.
ratio
53809 80890 1.50 165546 3.08
48698 72597 1.49 147714 3.03
![Page 15: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/15.jpg)
15
Using disambiguation grammars – positional constraint
It is interjection, if it is followed by an exclamation mark.
![Page 16: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/16.jpg)
16
Using disambiguation grammars – positional constraint (2)
After sentence or phrase boundary, “mi” and “ti” are personal pronouns in nominative case (after other possibilities were excluded)
![Page 17: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/17.jpg)
17
Using disambiguation grammars – sequential constraint
“da” is a conjunction (and not a form of a verb dati – to give – if is followed by an auxiliary verb in clitic form)
![Page 18: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/18.jpg)
18
Using disambiguation grammars – sequential and positional constraints
sxargarepe evropska unija ne samo da je prihvatila nasxu ida,.CONJda,.ADVda,.INTda,.PARda,dati.V103+Perf+Tr+Iref+Ref:Pzs:Ays:Azs
Forms Assoc. lemmas
ratio Lemmas + categ.
ratio
48698 72595 1.49 147714 3.03
48698 71809 1.47 146491 3.01
![Page 19: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/19.jpg)
19
Using disambiguation grammars – agreement
An adjective, possessive pronoun or numeral has to agree in gender, number, and case with a noun that follows
![Page 20: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/20.jpg)
20
Using disambiguation grammars – agreement (2)
povecxati nxegov proboj u regionu. Rumunska proporcijau,.PREP+p2u,.PREP+p4u,.PREP+p7regionu,region.N1:ms3qregionu,region.N1:ms7q
Forms Assoc. lemmas
ratio Lemmas + categ.
ratio
48698 71809 1.47 146491 3.01
48698 66284 1.36 129167 2.65
![Page 21: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/21.jpg)
21
Using disambiguation grammars – agreement of personal names
Special rules of the agreement of first name and surname
![Page 22: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/22.jpg)
22
Using disambiguation grammars – agreement (2)
raspalio je Mladxan Dinkicx sxakom o okrugli sto "Platne kartice -
Mladxan,Mladxan.N1002+Hum+NProp+First+SR:ms1vMladxan,mladxan.A7:akms1g:akms4qDinkicx,Dinkicx.N28+NProp+Hum+Last+SR:ms1v
Forms Assoc. lemmas
ratio Lemmas + categ.
ratio
48698 66284 1.36 129167 3.65
48698 66255 1.36 129101 2.65
![Page 23: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/23.jpg)
23
The order of grammar application
←Apply first
Apply second →
![Page 24: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/24.jpg)
24
Careful construction of grammars
Syntactic ambiguity:Zalagacxu se da ti trosxkovi budu minimalni.
I will do my best to minimize these expences.I will do my best to minimize your expences.
Although some cases are much more frequent...Kličke je bio voljan da da automobil.
Klicke was willing to give the car.
Mislio sam da ti tvoja gospođa ne da da je viđaš. I thought that your misses is not giving to you to see her.
![Page 25: On the Ambiguity of Serbian Texts and Methods to disambiguate it](https://reader035.fdocuments.in/reader035/viewer/2022081603/5681581b550346895dc581d9/html5/thumbnails/25.jpg)
25
Thank you!