Download - School of Computing FACULTY OF ENGNEERING 17/07/09CL 20091 School of Computing FACULTY OF ENGNEERING Linguistically Informed and Corpus Informed Morphological.

School of ComputingFACULTY OF ENGNEERINGSchool of ComputingFACULTY OF ENGNEERING

Linguistically Informed and Corpus Informed

Morphological Analysis of Arabic

Majdi Sawalha & Eric AtwellSchool of Computing ,

University of Leeds, Leeds, LS2 9JT, UK

. . .sawalha@comp leeds ac uk ,

. . .eric@comp leeds ac uk

2

Introduction

Arabic Morphological Analyzers• Arabic Corpora & Lexicons

• Analytical Study of Tri-literal Roots of Arabic

• Specifications of the Morphological Analyzer

Morphological Features of Arabic Words and Tag Set

Evaluation and Results

• Gold Standard for Evaluation

• Morphochallenge 2009 Qur’an Gold Standard

Outline

3

Introduction

Methodologies for developing a robust Arabic morphological analyzer

• Syllable-based Morphology (SBM)

• Root-Pattern Methodology

• Lexeme-based Morphology

• Stem-based Arabic lexicon with grammar and lexis specifications

• Using tagged corpora and computer algorithms to build morphological database of the tagged words

Roots, stems, patterns and affixes are pre-stored. Grammar and linguistic information

are encoded with the analyzers

4

Arabic Morphological Analyzers

Buckwalter Morphological Analyzer

• Uses pre-stored dictionaries of words, stems and affixes constructed manually.

Khoja’s Stemmer • Removes the longest prefix and suffix of the word,

• Matches the processed word with lists of noun and verb patterns to extract the correct root of the word.

Al-Shalabi et al

• Depends on mathematical calculations of weights assigned to the letters of the word,

• The algorithm selects the letters with lower weights as root letters.

5

Comparative Evaluation of Arabic Morphological Analyzers

Studying freely available morphological analyzers and stemmers.

Developing a gold standard for evaluation.

Results:• More work is needed for the development of morphological analysis

of Arabic.

• We can not rely on such analyzers for further analysis such as part-of-speech tagging and parsing.

6

Arabic Corpora

The Qur’an

78,000 tokens, 19,000 vowelized word types, 15,000 non-vowelized word types.

The Corpus of Contemporary Arabic (CCA)

Modern standard Arabic text corpus consists of 1 million word.

The Penn Arabic Treebank

734 files, 166,000 words of written Modern Standard Arabic.

The text of 15 traditional Arabic lexicons as corpora.

About 11 million words and 2 million word types of both modern and classical Arabic text.

7

Arabic Lexicons

Methodologies of ordering lexical entries in the Arabic lexicons

• Al-Khalil methodology ( Listed the lexical entries based on the pronunciation of the letters, starting from the farthest in the mouth to the nearest)

• Abi Obaid methodology ( Listed the lexical entries based on similarity in meaning.)

• Al-Jawhari methodology ( Listed the lexical entries based on last letter of the word.)

• Al Barmaki methodology ( Listed the lexical entries alphabetically.)

Arabic Lexicons

A sample of Arabic lexicon

الشيءA كAتAبA. ك.ت?ب< وك.ت.ب<: معروف، والجمع الك1تاب. :كتب: خAطKه؛ قال أAبو النجم: كAتKبAه، وك1تابةH وك1تاباH وكAت?باH يAك?ت.به

، تAل1ف? خ? ط ر1ج?اليA بخAطX م. ، تAخ. ر1ف? Aن?د1 زياد كالخ بAل?ت. من ع1 أAق?Aل1ف?ت.كAتhبان1 قال: ورأAيت في بعض النسخ1 في الطKريق1 المA أ

رون التاء، ت1ك1تhبان1 اءA، يAك?س1 Aر ، بكسر التاء، وهي لغة بAه?Aم.ونAفيقولون: ت1ع?ل ...

الشيءA كAتAبA. ك.ت?ب< وك.ت.ب<: معروف، والجمع الك1تاب. :كتبH يAك?ت.به : خAطKه؛ قال أAبو النجم: كAتKبAه، وك1تابةH وك1تاباH وكAت?با

، تAل1ف? خ? ط ر1ج?اليA بخAطX م. ، تAخ. ر1ف? Aن?د1 زياد كالخ بAل?ت. من ع1 أAق?Aل1ف?ت.كAتhبان1 قال: ورأAيت في بعض النسخ1 في الطKريق1 المA أ

رون التاء، ت1ك1تhبان1 اءA، يAك?س1 Aر ، بكسر التاء، وهي لغة بAه?Aم.ونAفيقولون: ت1ع?ل ...

k t b: [Alkitab] the book; is well known. The plural forms are [kutubun] and [kutbun]. [kataba Alshay’] He wrote something, [yaktubuhu] the action of writing something. [katban], [kitaban] and [kitabatan] means the art of writing. And [kattabahu] writing it means draw it up. Abu Al-Najim said: I returned back from Ziyad place [after meeting him] as senile, my legs draw up different drawings (means walking in different way). they wrote [tukattibani] on the road the letters of Lam Alif (describing how he was walking crazily and in different way). He said: I saw in different version, the word “they wrote” [tikittibani] using the short vowel kasrah on the first letter [taa], as it is used by Bahraa’ [Arab tribe] dialect. They say: [ti’lamuwn] (you know).

k t b: [Alkitab] the book; is well known. The plural forms are [kutubun] and [kutbun]. [kataba Alshay’] He wrote something, [yaktubuhu] the action of writing something. [katban], [kitaban] and [kitabatan] means the art of writing. And [kattabahu] writing it means draw it up. Abu Al-Najim said: I returned back from Ziyad place [after meeting him] as senile, my legs draw up different drawings (means walking in different way). they wrote [tukattibani] on the road the letters of Lam Alif (describing how he was walking crazily and in different way). He said: I saw in different version, the word “they wrote” [tikittibani] using the short vowel kasrah on the first letter [taa], as it is used by Bahraa’ [Arab tribe] dialect. They say: [ti’lamuwn] (you know).

A sample of Arabic-English Dictionary by Edward Lane

9

Analytical Study of Tri-literal Roots of Arabic

Tri-literal roots were classified into 3 main groups and 22 detailed groups.

Experiment 1: Qur’an words derived from tri-literal roots were analyzed, (45,534 words) and (1,610 tri-literal roots)

Defective, 32.12%

Compound, 6.82%

Intact, 61.06%

Intact

Defective

Compound

Tri-literal roots of Qur’an

Qur’an tokens

Intact, 1097,

68.14%

Compound, 45, 2.80%

Defective, 468, 29.07%

Intact

Defective

Compound

10

Analytical Study of Tri-literal Roots of Arabic

Experiment 2:

Word-types of broad-lexical resource constructed by analyzing 15 Arabic lexicons, which contains 376,167 word types

Compound, 2.33%

Defective, 29.42%

Intact, 68.25%

Intact

Defective

Compound

Word types of broad-lexical resource Roots of broad-lexical resource

Intact, 5368, 63.14%

Compound, 309, 3.63%

Defective, 2825, 33.23%

Intact

Defective

Compound

11

Specifications of the Morphological Analyzers - Inputs

Input: single words or text (fully vowelized, partially vowelized, or non-vowelized)

Tokenization: Arabic word, number, currency or punctuation mark.

Processing Arabic words:

AوKىص AوAص ىص?waS~a Ywa SoSaY

AوKىص AوAص ىص?waS~aY waSoSaY

ن.واآ Aم ن.واءا Aم| manuwA A‘manuwA

ن.واآ Aم ن.واءا Aم|manuwA ‘AmanuwA

- A - w u n a m - A - ‘ ‘AmanuwA

- ا - و |. ن |A م - ا - ء ن.وا Aءام

- Y a S o S a w waSoSaY

- ى |A ص |? ص |A و وAص?صAى12 11 10 9 8 7 6 5 4 3 2 1 Position

• Resolving doubled letter marked with Shaddah

Only one short vowel might appear on any letter of the Arabic word.

• Resolving the Extention (maddah)

12

Stop Words (Unambiguous Words)

Stop word has only one morphological analysis wherever they appear in the text.

About 40% of any text tokens belongs to stop words.

The system contains a list of 1,368 stop words.

Personal Pronouns : أنا “ ”nA>I ,هي“ ”hyshe

: Relative pronounsالذي “ ”Al*ywho ,(sm) التي“ ”Altywho(sf)

: Demonstrative pronounsهذا“ ”h*Athis ,(sm) هذه“ ”h*hthis(sf)

:Prepositionsفي“ ”fyin ,على “ ElY ” on ,إلى “ ”lY<to

Personal Pronouns : أنا “ <nA” I, هي “hy” she

Relative pronouns : الذي “Al*y” who (sm), التي “Alty” who (sf)

Demonstrative pronouns : هذا “h*A” this (sm), هذه “h*h” this (sf)

Prepositions: في “fy” in, على “ElY” on , إلى “>lY” to

13

Cliticts, Prefixes and Suffixes

Proclitics, prefixes, suffixes and enclitics were collected from traditional Arabic grammar books.

Clitics and affixes lists were checked using four Arabic corpora:

• The Qur’an

• The Corpus of Contemporary Arabic (CCA)

• The Penn Arabic Treebank

• The text of the 15 traditional Arabic lexicons as a corpus

14


215 Proclitics & Prefixes

127 Suffixes & Enclitics

AlwwAlsmA’wAlr---d-----d--------الp--t---------------و

سماـوالء

والtsffst*krwnfst

r---s-nus----------تp--i---------------سp--t---------------فذكـفست

رونفست

TagP3TagP2TagP1ExamplePrefix

wnyAlHwArywnYwn

r---l-mp-n---?----?ونr---j--------------ي

يوالحوارن

يونhmAwtm>wrvtmwhAtmwhA

r---&-ndt??----h---هماr---l-mp-n---?----?وr---&-mps??----h---تم

تموهـأورثا

تموهما

TagP3TagP2TagP1ExampleSuffix

15


Words are divided into three parts of different size.

The first part is searched in the proclitics & prefixes list

The third part is searched in the suffixes & enclitics list

Not acceptedlwnلونmمyEيع

Candidate analysiswnونEmlعملyيNot acceptednنyEmlwيعملو

Candidate analysisyEmlwnيعملو

ن

yaEomaluwnaAل.ون Aع?مAي

Prefixes & Suffixes analysesThird PartSecond PartFirst Part

Analyzed Word

16

Root or Stem

The system uses a list of about 12,000 roots extracted by analyzing 15 traditional Arabic language lexicons

The second part of the word is searched by the root list.

Accepted AnalysisCandidate analysiswnونEmlعمل

yي

Not accepted analysisCandidate analysisEmlwnعملون

yي

Not accepted analysisCandidate analysiswnونyEml

يعمل

Not accepted analysisCandidate analysisyEmlwnيعملون


Affixes and Root analysesAffixes analysesThird Part

Second partFirst part

Analyzed Word

17

Word Pattern

Different words are derived from their roots using certain patterns.

Derived words inherent morphological features of the derivation patterns.

The system has a list of patterns which are extracted from traditional Arabic language grammar books.

• 2730 verb patterns

• 985 noun patterns

• Morphological features POS tags are assigned to each pattern in the list.

• Patterns are fully vowelized

v-p---mss---an?-st?-faEalota

ل? Aع AفAت

v-p---npf---an?-st?-faEalonaA

ل? Aع AفنAا

v-p---nsf---an?-st?-faEalotu

ل? Aع Aفت.

POS TagVerb Patterns

nw----??-??----?qt-?fAEuwlA’فاع.والء

nw----??-??----?qt-?AifoEiylAlع1ي ا1ف?الل

nw----??-??----?qt-?>ufoEulAwaYع.ال أ.ف?وAى

POS TagNoun Patterns

18

First algorithm: depends on the word and its root as inputs.

• The root letters of the word are replaced by the letters

(fa’, Aiin, Lam, [Lam]) (]ف ، ع ، ل ، ]ل).

Replacement of root letters is not an easy task !!!!

Second algorithm: depends on a pre-stored list of patterns.

• Searches the pattern list for patterns of similar size as the analyzed word, after removing its affixes.

• E.g: The word كتب ktb matches the following patterns:

• Replaces the letters of the word corresponding to the letters

(Fa’, Ain, Lam , [Lam]) (]ف ، ع ، ل ، ]ل) of the pattern.

Pattern Matching Algorithms

fiEilfiEolfuEilfuEulfuEalfuEolfaElfaEulFaEalfaEol

ع1لف1ع?لف1ع1ل ع.لف. عAلف. ع?لف. ع1لف. Aع.لف AلفAع Aع?لف Aف

19

v-c---mpt--ipn?-tt?yufoEaluwnaAل.ونAع ي.ف?

v-c---mpt--ipn?-at?yufoEiluwnaAع1ل.ون ي.ف?

v-c---mpt--ian?-st?yafoEaluwnaAل.ونAع يAف?

v-c---mpt--ian?-st?yafoEiluwnaAع1ل.ون يAف?

v-c---mpt--ian?-st?yafoEuluwnaAع.ل.ون يAف?


TagMatched Patterns

Analyzed Word

Word Pattern: The second algorithm (Example)

20

Vowelization

Helps in determining some morphological features of the words.

Analyzed Word

Vowelization

Pattern

kitibkitobkutibkutubkutabkutobkatibkatubkatabkatob

كAت?بكAتAبكAت.بكAت1بك.ت?بك.تAبك.ت.بك.ت1بك1ت?بك1ت1ب

fiEilfiEolfuEilfuEulfuEalfuEolfaElfaEulFaEalfaEol

ع1لف1ع?لف1ع1ل ع.لف. عAلف. ع?لف. ع1لف. Aع.لف AلفAع Aع?لف Aف

ktbكتب

21

Part-of-Speech Tag Set is designed following the traditional grammar classifications.

Tag Set has 22 morphological features of Arabic words.

The Tag consists of 22 characters.

E.g.• v at the first position indicates verb, n at the second position indicates

proper name. At the seventh position m indicates masculine, and f indicates feminine

• “ - “ is used If the value of a certain feature is not applicable for the tagged word.

• “?” is used if the value of a certain feature belongs to word, but at the moment is not available or the automatic tagger could not guess it.


http://www.comp.leeds.ac.uk/sawalha/tagset

22


الحالة اإلعرابية لالسم أو

الفعل

Case and Mood11

Morphology10الصKرفPerson9 الشخص

Number8 العددGender7 الجنس

عالمات الترقيم

Punctuations6

أقسام فرعيKة )أخرى(

Residuals5

أقسام فرعيKة

)الحرف(POS of Particle4

أقسام فرعيKة

)الفعل(POS of Verb3

أقسام Kة فرعي

االسم()POS of Noun2

أAقسام الكالم الرئيسيKة

Main POS1Morphological Features CategoriesP

Noun finals22

Verb Internal Structure

21

Root letters20

Augmented & Unaugmented

19

Variability & Conjugation

18

Humanness17Transitivity16

Emphasize15

Voice14

Definiteness13

Case and Mood marks

12 أقسام األسم تبعاH للفظ

آخره

ب.نية الفعل

ف ر. عAدAد أح?ذ?ر Aالج

د Kالمجر والمزيد

التKصريف

العاقل وغير العاقل

الالزم والمتعدي

الم.ؤكKد وغير. الم.ؤكKد

ب?ني Aالمع?ل.وم و Aل1لم

ب?ني Aالمول ه. ل1لمAج?

ة Aع?ر1ف Aالمة Aك1رKوالن

عالمة اإلعراب أو

البناء

Morphological Features CategoriesP

http://www.comp.leeds.ac.uk/sawalha/tagset

23


Sample of tagged document using the morphological feature Tag Set

نHا س? ال1دAي?ه1 ح. Aب1و Aان A1نس ي?نAا اإل? KصAوAو.We have recommended that a person must take good care of their parents

نHا س? ال1دAي?ه1 ح. Aب1و Aان A1نس ي?نAا اإل? KصAوAوWe have recommended that a person must take good care of their parents.

Word Tag

و� wa And p--t-----a---------

و�ص�ي� waS~ayoRecommended

v-p---npf--iano-at&

�ا ن naA We p--&---p-n---------

ا نس� اإل�ن�Alo<insaAna

human nq----np-ad-----bt-

ب bi to p--r-----g---------

د�ي� و�الwaAlida

yoparents nw----nd-gd-----at-

ه hi his p--&-----g---------

�ا ن ح�س�HusonA

Fwell no----nu-ai-----st-

24

Gold standards are used to evaluate and measure the actual accuracy of automatic systems.

To construct a gold standard for evaluation, we need to determine:

• The Problem Domain• Evaluating morphological analyzers and part-of-speech taggers.

• The Corpora

• Corpora of different text domains, formats and genres of both vowelized and non-vowelized Arabic text.

• Two versions of the Qur’an text, vowelized Qur’an text, and non-vowelized Qur’an text.

• The Corpus of Contemporary Arabic (Al-Sulaiti & Atwell, 2006).

Evaluation and Results:Gold Standard for Evaluation

25

Gold Standard Format

• Includes morphological and part-of-speech information for each word of the gold standard in a line separated by tabs.

• Contains the root and the pattern information of the words.

• The gold standard will be stored using flat text files, using Unicode utf8 encoding or using XML.

Gold Standard Size• It must be relatively large.

• can cover most cases that morphological analyzers have to handle.

• It is measured by the number of words it contains.

Gold Standard for Evaluation

26

Morphochallenge 2009 Gold Standard

http://www.cis.hut.fi/morphochallenge2009/

MorphoChallenge aims to develop an unsupervised morphological analyzer to be used for different languages including Arabic.

A Gold standard of the Qur’an has been constructed to be used to evaluate morphological analyzers in Morphochallenge 2009 competition.

• Its size is 78,004 words.

• It contains the full morphological analysis for each word, according to the morphological analysis of the Qur’an in the tagged database of the Qur’an developed at the University of Haifa (Dror et al, 2004).

27

Morphochallenge 2009 Qur’an Gold Standard

م1 ب1س? سم None , Noun+Triptotic+Sg+Masc+Gen+سم , Prep+ب

الل�ه1 None None ه Aلال+Noun+ProperName+Gen+Def ,

مـن1 ح? Kالر رحم ن AعالAف حمAان Aر+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,

يم1 ح1 Kالر رحم ع1يل Aف يم ح1 Aر+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,

بسم سم None , Noun+Triptotic+Sg+Masc+Gen+سم , Prep+ب

الله None None , Noun+ProperName+Gen+Def+لاله

الرحمـن رحم فعالن , Noun+Triptotic+Adjective+Sg+Masc+Gen+Def+رحمان

الرحيم رحم فعيل , Noun+Triptotic+Adjective+Sg+Masc+Gen+Def+رحيم

bisomi sm None b+Prep , sm+Noun+Triptotic+Sg+Masc+Gen ,

All~hi None None llaah+Noun+ProperName+Gen+Def ,

Alr~aHom_ani rHm faElaAn raHmaan+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def

Alr~aHiymi rHm faEiyl raHiim+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,

bsm sm None b+Prep , sm+Noun+Triptotic+Sg+Masc+Gen ,

Allh None None llAh+Noun+ProperName+Gen+Def ,

AlrHm_n rHm fElAn rHmAn+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def

AlrHym rHm fEyl rHym+Noun+Triptotic+Adjective+Sg+Masc+Gen+Def ,

28

Thank you!Thank you!

Questions ?Questions ?