School of ComputingFACULTY OF ENGNEERINGSchool of ComputingFACULTY OF ENGNEERING
Linguistically Informed and Corpus Informed
Morphological Analysis of Arabic
Majdi Sawalha & Eric AtwellSchool of Computing ,
University of Leeds, Leeds, LS2 9JT, UK
. . .sawalha@comp leeds ac uk ,
. . .eric@comp leeds ac uk
2
Introduction
Arabic Morphological Analyzers• Arabic Corpora & Lexicons
• Analytical Study of Tri-literal Roots of Arabic
• Specifications of the Morphological Analyzer
Morphological Features of Arabic Words and Tag Set
Evaluation and Results
• Gold Standard for Evaluation
• Morphochallenge 2009 Qur’an Gold Standard
Outline
3
Introduction
Methodologies for developing a robust Arabic morphological analyzer
• Syllable-based Morphology (SBM)
• Root-Pattern Methodology
• Lexeme-based Morphology
• Stem-based Arabic lexicon with grammar and lexis specifications
• Using tagged corpora and computer algorithms to build morphological database of the tagged words
Roots, stems, patterns and affixes are pre-stored. Grammar and linguistic information
are encoded with the analyzers
4
Arabic Morphological Analyzers
Buckwalter Morphological Analyzer
• Uses pre-stored dictionaries of words, stems and affixes constructed manually.
Khoja’s Stemmer • Removes the longest prefix and suffix of the word,
• Matches the processed word with lists of noun and verb patterns to extract the correct root of the word.
Al-Shalabi et al
• Depends on mathematical calculations of weights assigned to the letters of the word,
• The algorithm selects the letters with lower weights as root letters.
5
Comparative Evaluation of Arabic Morphological Analyzers
Studying freely available morphological analyzers and stemmers.
Developing a gold standard for evaluation.
Results:• More work is needed for the development of morphological analysis
of Arabic.
• We can not rely on such analyzers for further analysis such as part-of-speech tagging and parsing.
6
Arabic Corpora
The Qur’an
78,000 tokens, 19,000 vowelized word types, 15,000 non-vowelized word types.
The Corpus of Contemporary Arabic (CCA)
Modern standard Arabic text corpus consists of 1 million word.
The Penn Arabic Treebank
734 files, 166,000 words of written Modern Standard Arabic.
The text of 15 traditional Arabic lexicons as corpora.
About 11 million words and 2 million word types of both modern and classical Arabic text.
7
Arabic Lexicons
Methodologies of ordering lexical entries in the Arabic lexicons
• Al-Khalil methodology ( Listed the lexical entries based on the pronunciation of the letters, starting from the farthest in the mouth to the nearest)
• Abi Obaid methodology ( Listed the lexical entries based on similarity in meaning.)
• Al-Jawhari methodology ( Listed the lexical entries based on last letter of the word.)
• Al Barmaki methodology ( Listed the lexical entries alphabetically.)
Arabic Lexicons
A sample of Arabic lexicon
الشيءA كAتAبA. ك.ت?ب< وك.ت.ب<: معروف، والجمع الك1تاب. :كتب: خAطKه؛ قال أAبو النجم: كAتKبAه، وك1تابةH وك1تاباH وكAت?باH يAك?ت.به
، تAل1ف? خ? ط ر1ج?اليA بخAطX م. ، تAخ. ر1ف? Aن?د1 زياد كالخ بAل?ت. من ع1 أAق?Aل1ف?ت.كAتhبان1 قال: ورأAيت في بعض النسخ1 في الطKريق1 المA أ
رون التاء، ت1ك1تhبان1 اءA، يAك?س1 Aر ، بكسر التاء، وهي لغة بAه?Aم.ونAفيقولون: ت1ع?ل ...
الشيءA كAتAبA. ك.ت?ب< وك.ت.ب<: معروف، والجمع الك1تاب. :كتبH يAك?ت.به : خAطKه؛ قال أAبو النجم: كAتKبAه، وك1تابةH وك1تاباH وكAت?با
، تAل1ف? خ? ط ر1ج?اليA بخAطX م. ، تAخ. ر1ف? Aن?د1 زياد كالخ بAل?ت. من ع1 أAق?Aل1ف?ت.كAتhبان1 قال: ورأAيت في بعض النسخ1 في الطKريق1 المA أ
رون التاء، ت1ك1تhبان1 اءA، يAك?س1 Aر ، بكسر التاء، وهي لغة بAه?Aم.ونAفيقولون: ت1ع?ل ...
k t b: [Alkitab] the book; is well known. The plural forms are [kutubun] and [kutbun]. [kataba Alshay’] He wrote something, [yaktubuhu] the action of writing something. [katban], [kitaban] and [kitabatan] means the art of writing. And [kattabahu] writing it means draw it up. Abu Al-Najim said: I returned back from Ziyad place [after meeting him] as senile, my legs draw up different drawings (means walking in different way). they wrote [tukattibani] on the road the letters of Lam Alif (describing how he was walking crazily and in different way). He said: I saw in different version, the word “they wrote” [tikittibani] using the short vowel kasrah on the first letter [taa], as it is used by Bahraa’ [Arab tribe] dialect. They say: [ti’lamuwn] (you know).
k t b: [Alkitab] the book; is well known. The plural forms are [kutubun] and [kutbun]. [kataba Alshay’] He wrote something, [yaktubuhu] the action of writing something. [katban], [kitaban] and [kitabatan] means the art of writing. And [kattabahu] writing it means draw it up. Abu Al-Najim said: I returned back from Ziyad place [after meeting him] as senile, my legs draw up different drawings (means walking in different way). they wrote [tukattibani] on the road the letters of Lam Alif (describing how he was walking crazily and in different way). He said: I saw in different version, the word “they wrote” [tikittibani] using the short vowel kasrah on the first letter [taa], as it is used by Bahraa’ [Arab tribe] dialect. They say: [ti’lamuwn] (you know).
A sample of Arabic-English Dictionary by Edward Lane
9
Analytical Study of Tri-literal Roots of Arabic
Tri-literal roots were classified into 3 main groups and 22 detailed groups.
Experiment 1: Qur’an words derived from tri-literal roots were analyzed, (45,534 words) and (1,610 tri-literal roots)
Defective, 32.12%
Compound, 6.82%
Intact, 61.06%
Intact
Defective
Compound
Tri-literal roots of Qur’an
Qur’an tokens
Intact, 1097,
68.14%
Compound, 45, 2.80%
Defective, 468, 29.07%
Intact
Defective
Compound
10
Analytical Study of Tri-literal Roots of Arabic
Experiment 2:
Word-types of broad-lexical resource constructed by analyzing 15 Arabic lexicons, which contains 376,167 word types
Compound, 2.33%
Defective, 29.42%
Intact, 68.25%
Intact
Defective
Compound
Word types of broad-lexical resource Roots of broad-lexical resource
Intact, 5368, 63.14%
Compound, 309, 3.63%
Defective, 2825, 33.23%
Intact
Defective
Compound
11
Specifications of the Morphological Analyzers - Inputs
Input: single words or text (fully vowelized, partially vowelized, or non-vowelized)
Tokenization: Arabic word, number, currency or punctuation mark.
Processing Arabic words:
AوKىص AوAص ىص?waS~a Ywa SoSaY
AوKىص AوAص ىص?waS~aY waSoSaY
ن.واآ Aم ن.واءا Aم| manuwA A‘manuwA
ن.واآ Aم ن.واءا Aم|manuwA ‘AmanuwA
- A - w u n a m - A - ‘ ‘AmanuwA
- ا - و |. ن |A م - ا - ء ن.وا Aءام
- Y a S o S a w waSoSaY
- ى |A ص |? ص |A و وAص?صAى12 11 10 9 8 7 6 5 4 3 2 1 Position
• Resolving doubled letter marked with Shaddah
Only one short vowel might appear on any letter of the Arabic word.
• Resolving the Extention (maddah)
12
Stop Words (Unambiguous Words)
Stop word has only one morphological analysis wherever they appear in the text.
About 40% of any text tokens belongs to stop words.
The system contains a list of 1,368 stop words.
Personal Pronouns : أنا “ ”nA>I ,هي“ ”hyshe
: Relative pronounsالذي “ ”Al*ywho ,(sm) التي“ ”Altywho(sf)
: Demonstrative pronounsهذا“ ”h*Athis ,(sm) هذه“ ”h*hthis(sf)
:Prepositionsفي“ ”fyin ,على “ ElY ” on ,إلى “ ”lY<to
Personal Pronouns : أنا “ <nA” I, هي “hy” she
Relative pronouns : الذي “Al*y” who (sm), التي “Alty” who (sf)
Demonstrative pronouns : هذا “h*A” this (sm), هذه “h*h” this (sf)
Prepositions: في “fy” in, على “ElY” on , إلى “>lY” to
13
Cliticts, Prefixes and Suffixes
Proclitics, prefixes, suffixes and enclitics were collected from traditional Arabic grammar books.
Clitics and affixes lists were checked using four Arabic corpora:
• The Qur’an
• The Corpus of Contemporary Arabic (CCA)
• The Penn Arabic Treebank
• The text of the 15 traditional Arabic lexicons as a corpus
14
Cliticts, Prefixes and Suffixes
215 Proclitics & Prefixes
127 Suffixes & Enclitics
AlwwAlsmA’wAlr---d-----d--------الp--t---------------و
سماـوالء
والtsffst*krwnfst
r---s-nus----------تp--i---------------سp--t---------------فذكـفست
رونفست
TagP3TagP2TagP1ExamplePrefix
wnyAlHwArywnYwn
r---l-mp-n---?----?ونr---j--------------ي
يوالحوارن
يونhmAwtm>wrvtmwhAtmwhA
r---&-ndt??----h---هماr---l-mp-n---?----?وr---&-mps??----h---تم
تموهـأورثا
تموهما
TagP3TagP2TagP1ExampleSuffix
15
Cliticts, Prefixes and Suffixes
Words are divided into three parts of different size.
The first part is searched in the proclitics & prefixes list
The third part is searched in the suffixes & enclitics list
Not acceptedlwnلونmمyEيع
Candidate analysiswnونEmlعملyيNot acceptednنyEmlwيعملو
Candidate analysisyEmlwnيعملو
ن
yaEomaluwnaAل.ون Aع?مAي
Prefixes & Suffixes analysesThird PartSecond PartFirst Part
Analyzed Word
16
Root or Stem
The system uses a list of about 12,000 roots extracted by analyzing 15 traditional Arabic language lexicons
The second part of the word is searched by the root list.
Accepted AnalysisCandidate analysiswnونEmlعمل
yي
Not accepted analysisCandidate analysisEmlwnعملون
yي
Not accepted analysisCandidate analysiswnونyEml
يعمل
Not accepted analysisCandidate analysisyEmlwnيعملون
yaEomaluwnaAل.ون Aع?مAي
Affixes and Root analysesAffixes analysesThird Part
Second partFirst part
Analyzed Word
17
Word Pattern
Different words are derived from their roots using certain patterns.
Derived words inherent morphological features of the derivation patterns.
The system has a list of patterns which are extracted from traditional Arabic language grammar books.
• 2730 verb patterns
• 985 noun patterns
• Morphological features POS tags are assigned to each pattern in the list.
• Patterns are fully vowelized
v-p---mss---an?-st?-faEalota
ل? Aع AفAت
v-p---npf---an?-st?-faEalonaA
ل? Aع AفنAا
v-p---nsf---an?-st?-faEalotu
ل? Aع Aفت.
POS TagVerb Patterns
nw----??-??----?qt-?fAEuwlA’فاع.والء
nw----??-??----?qt-?AifoEiylAlع1ي ا1ف?الل
nw----??-??----?qt-?>ufoEulAwaYع.ال أ.ف?وAى
POS TagNoun Patterns
18
First algorithm: depends on the word and its root as inputs.
• The root letters of the word are replaced by the letters
(fa’, Aiin, Lam, [Lam]) (]ف ، ع ، ل ، ]ل).
Replacement of root letters is not an easy task !!!!
Second algorithm: depends on a pre-stored list of patterns.
• Searches the pattern list for patterns of similar size as the analyzed word, after removing its affixes.
• E.g: The word كتب ktb matches the following patterns:
• Replaces the letters of the word corresponding to the letters
(Fa’, Ain, Lam , [Lam]) (]ف ، ع ، ل ، ]ل) of the pattern.
Pattern Matching Algorithms
fiEilfiEolfuEilfuEulfuEalfuEolfaElfaEulFaEalfaEol
ع1لف1ع?لف1ع1ل ع.لف. عAلف. ع?لف. ع1لف. Aع.لف AلفAع Aع?لف Aف
19
v-c---mpt--ipn?-tt?yufoEaluwnaAل.ونAع ي.ف?
v-c---mpt--ipn?-at?yufoEiluwnaAع1ل.ون ي.ف?
v-c---mpt--ian?-st?yafoEaluwnaAل.ونAع يAف?
v-c---mpt--ian?-st?yafoEiluwnaAع1ل.ون يAف?
v-c---mpt--ian?-st?yafoEuluwnaAع.ل.ون يAف?
yaEomaluwnaAل.ون Aع?مAي
TagMatched Patterns
Analyzed Word
Word Pattern: The second algorithm (Example)
20
Vowelization
Helps in determining some morphological features of the words.
Analyzed Word
Vowelization
Pattern
kitibkitobkutibkutubkutabkutobkatibkatubkatabkatob
كAت?بكAتAبكAت.بكAت1بك.ت?بك.تAبك.ت.بك.ت1بك1ت?بك1ت1ب
fiEilfiEolfuEilfuEulfuEalfuEolfaElfaEulFaEalfaEol
ع1لف1ع?لف1ع1ل ع.لف. عAلف. ع?لف. ع1لف. Aع.لف AلفAع Aع?لف Aف
ktbكتب
21
Part-of-Speech Tag Set is designed following the traditional grammar classifications.
Tag Set has 22 morphological features of Arabic words.
The Tag consists of 22 characters.
E.g.• v at the first position indicates verb, n at the second position indicates
proper name. At the seventh position m indicates masculine, and f indicates feminine
• “ - “ is used If the value of a certain feature is not applicable for the tagged word.
• “?” is used if the value of a certain feature belongs to word, but at the moment is not available or the automatic tagger could not guess it.
Morphological Features of Arabic Words and Tag Set
http://www.comp.leeds.ac.uk/sawalha/tagset
22
Morphological Features of Arabic Words and Tag Set
الحالة اإلعرابية لالسم أو
الفعل
Case and Mood11
Morphology10الصKرفPerson9 الشخص
Number8 العددGender7 الجنس
عالمات الترقيم
Punctuations6
أقسام فرعيKة )أخرى(
Residuals5
أقسام فرعيKة
)الحرف(POS of Particle4
أقسام فرعيKة
)الفعل(POS of Verb3
أقسام Kة فرعي
االسم()POS of Noun2
أAقسام الكالم الرئيسيKة
Main POS1Morphological Features CategoriesP
Noun finals22
Verb Internal Structure
21
Root letters20
Augmented & Unaugmented
19
Variability & Conjugation
18
Humanness17Transitivity16
Emphasize15
Voice14
Definiteness13
Case and Mood marks
12 أقسام األسم تبعاH للفظ
آخره
ب.نية الفعل
ف ر. عAدAد أح?ذ?ر Aالج
د Kالمجر والمزيد
التKصريف
العاقل وغير العاقل
الالزم والمتعدي
الم.ؤكKد وغير. الم.ؤكKد
ب?ني Aالمع?ل.وم و Aل1لم
ب?ني Aالمول ه. ل1لمAج?
ة Aع?ر1ف Aالمة Aك1رKوالن
عالمة اإلعراب أو
البناء
Morphological Features CategoriesP
http://www.comp.leeds.ac.uk/sawalha/tagset
23
Morphological Features of Arabic Words and Tag Set
Sample of tagged document using the morphological feature Tag Set
نHا س? ال1دAي?ه1 ح. Aب1و Aان A1نس ي?نAا اإل? KصAوAو.We have recommended that a person must take good care of their parents
نHا س? ال1دAي?ه1 ح. Aب1و Aان A1نس ي?نAا اإل? KصAوAوWe have recommended that a person must take good care of their parents.
Word Tag
و� wa And p--t-----a---------
و�ص�ي� waS~ayoRecommended
v-p---npf--iano-at&
�ا ن naA We p--&---p-n---------
ا نس� اإل�ن�Alo<insaAna
human nq----np-ad-----bt-
ب bi to p--r-----g---------
د�ي� و�الwaAlida
yoparents nw----nd-gd-----at-
ه hi his p--&-----g---------
�ا ن ح�س�HusonA
Fwell no----nu-ai-----st-
24
Gold standards are used to evaluate and measure the actual accuracy of automatic systems.
To construct a gold standard for evaluation, we need to determine:
• The Problem Domain• Evaluating morphological analyzers and part-of-speech taggers.
• The Corpora
• Corpora of different text domains, formats and genres of both vowelized and non-vowelized Arabic text.
• Two versions of the Qur’an text, vowelized Qur’an text, and non-vowelized Qur’an text.
• The Corpus of Contemporary Arabic (Al-Sulaiti & Atwell, 2006).
Evaluation and Results:Gold Standard for Evaluation
25
Gold Standard Format
• Includes morphological and part-of-speech information for each word of the gold standard in a line separated by tabs.
• Contains the root and the pattern information of the words.
• The gold standard will be stored using flat text files, using Unicode utf8 encoding or using XML.
Gold Standard Size• It must be relatively large.
• can cover most cases that morphological analyzers have to handle.
• It is measured by the number of words it contains.
Gold Standard for Evaluation
26
Morphochallenge 2009 Gold Standard
http://www.cis.hut.fi/morphochallenge2009/
MorphoChallenge aims to develop an unsupervised morphological analyzer to be used for different languages including Arabic.
A Gold standard of the Qur’an has been constructed to be used to evaluate morphological analyzers in Morphochallenge 2009 competition.
• Its size is 78,004 words.
• It contains the full morphological analysis for each word, according to the morphological analysis of the Qur’an in the tagged database of the Qur’an developed at the University of Haifa (Dror et al, 2004).
27
Morphochallenge 2009 Qur’an Gold Standard
م1 ب1س? سم None , Noun+Triptotic+Sg+Masc+Gen+سم , Prep+ب
الل�ه1 None None ه Aلال+Noun+ProperName+Gen+Def ,
مـن1 ح? Kالر رحم ن AعالAف حمAان Aر+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,
يم1 ح1 Kالر رحم ع1يل Aف يم ح1 Aر+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,
بسم سم None , Noun+Triptotic+Sg+Masc+Gen+سم , Prep+ب
الله None None , Noun+ProperName+Gen+Def+لاله
الرحمـن رحم فعالن , Noun+Triptotic+Adjective+Sg+Masc+Gen+Def+رحمان
الرحيم رحم فعيل , Noun+Triptotic+Adjective+Sg+Masc+Gen+Def+رحيم
bisomi sm None b+Prep , sm+Noun+Triptotic+Sg+Masc+Gen ,
All~hi None None llaah+Noun+ProperName+Gen+Def ,
Alr~aHom_ani rHm faElaAn raHmaan+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def
Alr~aHiymi rHm faEiyl raHiim+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,
bsm sm None b+Prep , sm+Noun+Triptotic+Sg+Masc+Gen ,
Allh None None llAh+Noun+ProperName+Gen+Def ,
AlrHm_n rHm fElAn rHmAn+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def
AlrHym rHm fEyl rHym+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,
28
Thank you!Thank you!
Questions ?Questions ?
Top Related