Introduction to Arabic NLP
Transcript of Introduction to Arabic NLP
-
8/21/2019 Introduction to Arabic NLP
1/87
1
Introduction to ArabicNatural Language Processing
Nizar HabashColumbia University
Center for Computational Learning [email protected]
MEADR 2009Cairo, Egypt
April 21, 2009
CADIM
-
8/21/2019 Introduction to Arabic NLP
2/87
2
• Focus of this
tutorial
– Phenomena – Concepts – Approaches
– Resources• What is ‘Arabic’? – Arabic Script – Arabic Language
• Modern StandardArabic (MSA)
• Arabic Dialects
-
8/21/2019 Introduction to Arabic NLP
3/87
3
Road Map• Introduction
• Orthography
• Morphology
• Syntax
-
8/21/2019 Introduction to Arabic NLP
4/87
4
Road Map• Introduction• Orthography
– Arabic Script
– MSA Phonology and Spelling
– Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/… – Encoding Issues
• Morphology
• Syntax
-
8/21/2019 Introduction to Arabic NLP
5/87
5
Arabic Script
-
8/21/2019 Introduction to Arabic NLP
6/87
6
Arabic ScriptArabic script is an alphabet with allographic variants,
optional zero-width diacritics and common ligatures.
Arabic script is used to write many languages: Arabic,Persian, Kurdish, Urdu, Pashto, etc.
رعلا ُطَخلا
-
8/21/2019 Introduction to Arabic NLP
7/87
7
Arabic ScriptAlphabet
• letter forms
• letter marks
• Arabic only
• Other languages
• Persian, Kurdish,Urdu, Pashto, etc.
• OCR output ambiguity; common spelling errors
-
8/21/2019 Introduction to Arabic NLP
8/87
8
Arabic Script
/b/
Alphabet (MSA)
• letters (form+mark)
• Distinctive
• Non-distinctive
/t/
ث
/θ/
س
/s/
ش
/ʃ/
/ʔ/glottal stop aka hamza
ؤا ئ
-
8/21/2019 Introduction to Arabic NLP
9/87
9
Arabic ScriptLetter Shapes
• No distinction between print and handwriting• No capitalization• Right-to-left• Ambiguous
shapes• Connectiveletters
• Disconnective
letters( ة ى رزد ذ و ؤ ا )cause word-internal visualspacing
ز
ن
final
medial
initial
Stand
alone
غ
-
8/21/2019 Introduction to Arabic NLP
10/87
10
Arabic ScriptLetter shaping
/k itāb /آ =بآ تك k tābbook
/k atab /
آ =آ تك
k tbto write
-
8/21/2019 Introduction to Arabic NLP
11/87
11
Arabic Script
/bin/
/bun/
/ban/
NunationDiacritics
• Zero-width characters
• Used for short vowels
آ /katab/ to write
• Nunation is used for
nominal indefinitemarker in MSA
آ ب /kitābun/ a book
/bi/
/bu/
/ba/
Vowel
-
8/21/2019 Introduction to Arabic NLP
12/87
12
Arabic Script
/bb/
DoubleConsonant
/bban/
/bbin//bbu/
Diacritics
• No-vowel marker (sukun)
/maktab/ office
• Double consonant marker (shadda)
آ /kattab/ to dictate • Combinable
/b/
No Vowel
-
8/21/2019 Introduction to Arabic NLP
13/87
13
Arabic Script
ب=َ ب ع َر َب
Putting it together
Simple combination
Ligatures ب=ْ ب
غ َر ْب West /ʁarb/
Arab /ʕarab/
م
م
ل
/Peace /salāmس
م
-
8/21/2019 Introduction to Arabic NLP
14/87
14
Arabic ScriptTatweel
• ‘elongation’
• aka kashida
• used for text highlightand justification
االنس
حقوق
سـنالا
وق قـح
ـ ـ سـنالا
وق ـ ـ قـح
ـ ـ ـ ـ سـنالا
وق ـ ـ ـ ـ قـح
human rights /ħuqūq alʔinsān/
-
8/21/2019 Introduction to Arabic NLP
15/87
15
Arabic Script• Different styles
• High fluidity• Optional ligatures
• Verticalarrangements
/alʤabr / /muħammad / /ʕarabi/
ع بيمحمدالج
محمالج ع
عريبحمماجلرب
algebraMuhammadArabic
-
8/21/2019 Introduction to Arabic NLP
16/87
16
٠
0
٩Eastern Indo-ArabicIran, Pakistan, etc.
٩
Indo-ArabicMiddle East
987654321Western Arabic
Tunisia, Morocco, etc.
Arabic Script “Arabic” Numerals• Decimal system• Numbers written left-to-right in right-to-left text
الفرنس 132بعد1962استقلت اجلزائر يف سنة . عاما من االحتالل
Algeria achieved its independence in 1962 after 132 years of French occupation.
• Three systems of enumeration symbols that vary by region
-
8/21/2019 Introduction to Arabic NLP
17/87
17
Road Map• Introduction
• Orthography – Arabic Script
– MSA Phonology and Spelling
– Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/… – Encoding Issues
• Morphology
• Syntax
-
8/21/2019 Introduction to Arabic NLP
18/87
18
MSA Phonology and Spelling• Phonological profile of Standard Arabic
– 28 Consonants – 3 short vowels, 3 long vowels, 2 diphthongs
• Arabic spelling is mostly phonemic … – Letter-sound correspondence
ā ʔt bʤ θx ħδ dz r ssʖ ʃtʖ dʖʕk ʁq f lm
ولقخحج ى ؤء
h nw j ū ī δ
-
8/21/2019 Introduction to Arabic NLP
19/87
19
MSA Phonology and Spelling• Arabic spelling is mostly phonemic …
Except for • Medial short vowels can only appear as
diacritics
• Diacritics are optional in most written text – Except in holy scripture – Present diacritics mark syntactic/semantic
distinctions
• آ /katab/ to write آ /kutib/ to be written• /ħubb/ love /ħabb/ seed• Dual use of ,ا ,و ي as consonant and long vowel
ا – (/‘/,/ā/) و (/w/,/ū/) ي (/j/,/ī/)
-
8/21/2019 Introduction to Arabic NLP
20/87
20
MSA Phonology and Spelling• Arabic spelling is mostly phonemic …
Except for (continued)
• Morphophonemic characters
– Feminine markerة
(ta marbuta)• آ /kabīr/ (big ♂) آ /kabīr a/ (big ♀)
– Derivation marker
• /ʕasa/ (to disobey ) (a stick )• Hamza variants (6 characters for one phoneme!)
– ) ء أإؤئ) /baha’/ + 3MascSing (his glory)
-
8/21/2019 Introduction to Arabic NLP
21/87
21
MSA Phonology and Spelling• Arabic spelling can be ambiguous
– optional diacritics and dual use of letter • But how ambiguous? Really?• Classic example
ths s wht n rbc txt lks lk wth n vwls
this is what an Arabic text looks like with no vowels• Not exactly true
– Long vowels are always written
– Initial vowels are represented by an ‘alef’
– Some final short vowels are represented
ths is wht an Arbc txt lks lik wth no vwls
Will revisit ambiguity in more detail again under morphology discussion
-
8/21/2019 Introduction to Arabic NLP
22/87
22
Proper Name Spelling• The Qadafi-Schwarzenegger problem
– Foreign Proper name spelling is often ad hoc – Multiplicity of spellings causes increased sparsity
Schwarzenegger
زرا
زرازرا
ار
Gadafi Gaddafi Gaddfi GadhafiGhaddafi Kadaffy Qaddafi Qadhafi …
ا
-
8/21/2019 Introduction to Arabic NLP
23/87
23
Road Map• Introduction
• Orthography – Arabic Script – MSA Phonology and Spelling
– Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/… – Encoding Issues
• Morphology
• Syntax
-
8/21/2019 Introduction to Arabic NLP
24/87
24
Arabic ScriptOther languages
Arabic
• No more than 3 dots• Dots either above or below• Marks are 1/2/3 dots, hamza (ء)
or madda (~) only• Rare borrowing for foreign words
•,/ /pپ ,/ /vڤ ڤ گ ,/ /gچ / /tʃچ
• regionally variable
Not Arabic• Extra marks: haft (v), ring (o), taa ,(ط)
four dots (::), vertical dots (:)• Some Numerals (,,)
Once you learn the alphabet, it is easier ☺
ړ ژڈ
ڐ
ۅ څ ځڈ
…گ ێ ڤ
أ إ ؤئ ة ت ث ج ح خ
ر ذ ز
ش ص
ض ط ظ ع غ ف
ق ك ل م ن وى ي ء
-
8/21/2019 Introduction to Arabic NLP
25/87
25
Arabic
Not Arabic
-
8/21/2019 Introduction to Arabic NLP
26/87
26
Arabic
Not Arabic
... ا...
ا ن رو او
و
... ا... وا رق اح
او او ر ا واا ا
ا ا ا وا ط و ا ام
... :ورد د
-
8/21/2019 Introduction to Arabic NLP
27/87
27
Arabic
Not Arabic
-
8/21/2019 Introduction to Arabic NLP
28/87
-
8/21/2019 Introduction to Arabic NLP
29/87
29
Encoding Issues• Encoding Arabic
– Data entry, storage, and display – Ease of use for Arabic-illiterate users – Multi-script support – Multilingual support (extended Arabic characters)
• Types of Encoding – Machine character sets
• Graphemic (shape insensitive, logical order)• Allographic (shape/direction sensitive) [obsolete]
– Human accessible• Transliteration• Phonetic spelling (IPA)• Romanization
-
8/21/2019 Introduction to Arabic NLP
30/87
30
Encoding Issues• Many Conflicting Character Sets for Arabic
-
8/21/2019 Introduction to Arabic NLP
31/87
31
Encodings• CP-1256
– Commonly used – 1-byte characters
– Widely supportedinput/display
– Minimal support forextended Arabiccharacters
– bi-script support(Roman/Arabic)
– Tri-lingual support:Arabic, French,English (ala ANSI)
-
8/21/2019 Introduction to Arabic NLP
32/87
32
Encodings• Unicode
– Becoming thestandard more andmore
– 2-byte characters
– Widely supportedinput/display
– Supports extendedArabic characters
– Multi-scriptrepresentation
-
8/21/2019 Introduction to Arabic NLP
33/87
33
Encodings• Unicode
– Supports presentationforms (shapes andligatures)
-
8/21/2019 Introduction to Arabic NLP
34/87
34
Encoding IssuesArabic Display
• Memory (logical order)
ÔÇÑßÊ ÝáÓØíä (Palestine) Ýí ÇæáãÈíÇÏ (Olympics) 2000 æ 2004.
ر ك ت ف ل س ين (Palestine) يف و ل م ب ي د (Olympics) 2000 و 2004.
or this way for those with direction-bias
.4002 æ 0002 )scipmylO( ÏÇíÈãáæÇ íÝ )enitselaP( äíØÓáÝ ÊßÑÇÔ
.4002 و 0002 )scipmylO( في ولمبي د )enitselaP( ركت فلسين
-
8/21/2019 Introduction to Arabic NLP
35/87
35
Encoding IssuesArabic Display
• Memory (logical order)ÔÇÑßÊ ÝáÓØíä (Palestine) Ýí ÇæáãÈíÇÏ (Olympics) 2000 æ 2004. ر ك ت ف ل س ين (Palestine) يف و ل م ب ي د (Olympics) 2000 و 2004.
• Display (visual order) – Bidirectional (BiDi) support• Numbers and Roman script. 2004و2000 (Olympics) في ولمبي د (Palestine) سين ل ف ت ك ر
– Letter and ligature shaping. 2004و2000 (Olympics) ود (Palestine) آر
-
8/21/2019 Introduction to Arabic NLP
36/87
36
Display Problems
تدشينمنطقةØ-رة Ùيدبيللتجارةالالكترونية
ÊÏÔêæ åæ×âÉ ÍÑÉáê ÏÈê ääÊÌÇÑÉÇäÇäãÊÑèæêÉ
ÊÏÔíä ãäØÞÉ ÍÑÉÝí ÏÈí ááÊÌÇÑÉÇáÇáßÊÑæäíÉ
د ور
؛
ُ
ع
ع
ع ع ع - عع
ع
ع
ع
،
ع
ع
ع
ع
ع
ع
ï» ؟ ¯ ´ ٹ †ظ
…†·‚ © -± © ط
ٹ
¯ ¨ٹ
ظ
„„ ¬§± © „§„§ط ƒ
±ˆ† ٹ
©
ʏʏʏʏ栥既栥既栥既栥既
ɠɠɠɠ ɠɠɠɠψ
ǑɠɠɠɠǤǤ㊑㊑㊑㊑親親親親ɠɠɠɠ
د ور
×â و هê ش
êل êدب êو èر
ʏʏʏʏ ɠɠɠɠ ɠɠɠɠψǑɠɠɠɠǡǡǡǡǡǡǡǡѦѦѦѦ
آ ف ر دب
د ور
WesternUnicodeISO-8859CP-1256Display Encoding
C P - 1 2 5 6
I S O - 8 8 5 9
U n i
c o d e
A c t u a l E n c
o d i n g
• Wrong encoding • Partial support problems
-
8/21/2019 Introduction to Arabic NLP
37/87
37http://www.cyrillic.com/kbd/btc.html
س م م
Encoding IssuesArabic Input• Standard graphemic keyboard
• Logical order input
-
8/21/2019 Introduction to Arabic NLP
38/87
38
EncodingsBuckwalter Encoding• Romanization
– One-to-one mappingto Arabic script spelling
– Left-to-right – Easy to learn/use – Human & machine compatible
• Commonly used in NLP – Penn Arabic Tree Bank
• Some characters can bemodified to allow use with XMLand regular expressions
• Roman input/display• Monolingual encoding (can’t do
English and Arabic)• Minimal support for extended
Arabic characters
-
8/21/2019 Introduction to Arabic NLP
39/87
39
Road Map• Introduction
• Orthography• Morphology
• Syntax
-
8/21/2019 Introduction to Arabic NLP
40/87
40
Morphology• Type
– Concatenative: prefix, suffix, circumfix – Templatic: root+pattern
• Function
– Derivational• Creating new words• Mostly templatic
– Inflectional• Modifying features of words
– Tense, number, person, mood, aspect
• Mostly concatenative
-
8/21/2019 Introduction to Arabic NLP
41/87
41
Road Map• Introduction
• Orthography• Morphology
– Derivational Morphology – Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology
• Syntax
-
8/21/2019 Introduction to Arabic NLP
42/87
42
Derivational Morphology• Templatic Morphology
b
م
kt
آ
ا
maktū bwritten
k āti bwriter Lexeme.Meaning =
(Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random
ك
maū āi
• Root
• Pattern
• Lexeme
-
8/21/2019 Introduction to Arabic NLP
43/87
43
Derivational MorphologyRoot Meaning
• ك KTB = notion of “writing”
آ /katab/
write
آ /kātib/writer
ب /maktūb/
letter
آب /kitāb/
book /maktaba/
library
/maktab/
office
ب /maktūb/
written
-
8/21/2019 Introduction to Arabic NLP
44/87
44
Derivational MorphologyRoot Meaning
laHm
• LHM-1
• Notion of “meat” – /laħm/
• Meat
– م /laħħām/• Butcher
-
8/21/2019 Introduction to Arabic NLP
45/87
45
Derivational MorphologyRoot Meaning
• LHM-2
• Notion of “battle” – /malħama/
• Fierce battle
• Massacre• Epic
D i i l M h l
-
8/21/2019 Introduction to Arabic NLP
46/87
46
• LHM-3
• Notion of “soldering” – /laħam/
• Weld, solder, stick, cling
– ا /iltaħam/• Be welded/soldered/fused
– /multaħim/• Welded, soldered, fused
Derivational MorphologyRoot Meaning
D i ti l M h l
-
8/21/2019 Introduction to Arabic NLP
47/87
47
Derivational MorphologyPattern Meaning
ask/make_writektb AistaktabRequirement Aista12a3X
Turn red/blushHmr AiHmarrTransformation Ai12a33IX
registerktb AiktatabAcquiescence, exaggeration Ai1ta2a3VIII
subscribe/enrollktb AinkatabPassive of Pattern I Ain1a2a3VII
correspondktb takaAtabReflexive of Pattern IIIta1aA2a3VI
learnElm taEal~amReflexive of Pattern IIta1a22a3V
seat jls AjlasCausation Aa12a3IVcorrespond with
ktb kaAtabInteraction with others1aA2a3III
dictatektb kattabIntensification, causation1a22a3II
writektb katabBasic sense of root1a2a3I
GlossExamplePattern MeaningPattern
• Verb Pattern Meaning is hard to define
-
8/21/2019 Introduction to Arabic NLP
48/87
48
Road Map• Introduction
• Orthography• Morphology
– Derivational Morphology – Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology
• Syntax
-
8/21/2019 Introduction to Arabic NLP
49/87
49
Inflectional Morphology• Derivational Morphology
– Lexeme ≈ Root + Pattern
• Inflectional Morphology – Word = Lexeme + Features
• Part-of-speech
– Traditional : Noun, Verb, Particle – Computational : N, PN, V, Adj, Adv, P, Pron, Num,Conj, Det, Aux, Pun, IJ, and others
– Many tag sets exist ranging from 3 to over 22K tags
• Noun-specific Features• Verb-specific Features• Other Features
-
8/21/2019 Introduction to Arabic NLP
50/87
50
Inflectional Morphology• Noun-specific Features – Number: singular, dual, plural, collective – Gender: masculine, feminine – Definiteness: definite, indefinite – Case: nominative, accusative, genitive – Possessive clitic
• Verb-specific Features – Aspect: perfective, imperfective, imperative – Voice: active, passive – Tense: past, present, future – Mood: indicative, subjunctive, jussive
– Subject (Person, Number, Gender) – Object clitic• Other Features
– Single-letter conjunctions – Single-letter prepositions
Inflectional Morphology
-
8/21/2019 Introduction to Arabic NLP
51/87
51
Inflectional Morphology
Nouns
ت
/walilmaktabāt/
ات++ ا++ wa+li+al+maktaba+āt
and+for +the+library+plural
And for the libraries
conj prepnoun poss plural article
آ
/wakabiyūtinā/
+ ت + ك + wa+ka+ biyūt+nā
and+like+houses+our
And like our houses
• Morphotactics (e.g. ا+ل )• Arabic Broken Plurals (templatic)
Inflectional Morphology
-
8/21/2019 Introduction to Arabic NLP
52/87
52
Inflectional Morphology
Verbs
ه
/faqulnāhā/
ه++ل+فfa+qul+na+hā
so+said+we+it
So we said it
conjverbobject subj tense
/wasanaqūluhā/
ه+++س+ wa+sa+na+qūl+u+hāand+will+we+say+it
And we will say it
• Morphotactics• Subject conjugation (suffix or circumfix)
I fl ti l M h l
-
8/21/2019 Introduction to Arabic NLP
53/87
53
Inflectional Morphologykatab ‘to write’
• Perfect verb subject conjugation (suffixes only )
آ katabā آ katabtū
آ katabtum
آ katabnā
PluralDualSingular
آ kataba3
آ katabtumā آ katabta2
آ katabtu1
• Imperfect verb subject conjugation ( prefix+suffix )
Feminine form and other verb moods not shown
yaktubān ن yaktubūn
ن taktubūn naktubu
PluralDualSingular
yaktubu3
taktubān taktubu2آا aktubu1
-
8/21/2019 Introduction to Arabic NLP
54/87
54
Road Map• Introduction
• Orthography• Morphology
– Derivational Morphology
– Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology
• Syntax
-
8/21/2019 Introduction to Arabic NLP
55/87
55
Morphological Ambiguity
• Derivational ambiguity – : basis/principle/rule, military base,Qa'ida/Qaeda/Qaida
• Inflectional ambiguity – /taktub/: you write, she writes – Segmentation ambiguity
• و: he found; +و : and+grandfather • : + ل : for a language; ا+ ل : for the language
-
8/21/2019 Introduction to Arabic NLP
56/87
56
Morphological Ambiguity• Spelling ambiguity
– Optional diacritics• آ: /kātib/ writer , /kātab/ to correspond
– Suboptimal spelling• Hamza dropping: ,أ ا
• Undotted ta-marbuta: ة • Undotted final ya:
ي
• Multiple sources of ambiguity
بني
– /bayyana/ Verb he demonstrated – /bayyanna/ Verb they [feminine] demonstrated – /bayyin/ Adj clear/evident/explicit – /bayna/ Prep between/among – /biyin/ Proper Noun in Yen
– /biyn/ Proper Noun Ben
Morphological Disambiguation
-
8/21/2019 Introduction to Arabic NLP
57/87
57
Morphological Disambiguation
in English
• Select a morphological tag that fullydescribes the morphology of a word
• Complete English morphological tag set
(Penn Treebank): 48 tagsVerb:
• Same as “POS Tagging” in English
goesgogonegoingwentgo
VBZVBPVBNVBGVBDVB
Morphological Disambiguation
-
8/21/2019 Introduction to Arabic NLP
58/87
58
• Morphological tag has 14 subtags
corresponding to different linguistic categories – Example:VerbGender(2), Number(3), Person(3), Aspect(3),Mood(3), Voice(2), Pronominal clitic(12),
Conjunction clitic(3)• 22,400 possible tags
– Different possible subsets
• 2,200 appear in Penn Arabic Tree Bank Part 1(140K words)• Example solution: MADA (Habash&Rambow 2005)
Morphological Disambiguation
in Arabic
MADA (Habash&Rambow 2005)(Habash&Rambow 2007)
(Roth et al 2008)
-
8/21/2019 Introduction to Arabic NLP
59/87
59
W-3 W-2 W-1 W0 W1 W2 W3 W4W-4
MORPHOLOGICALANALYZER
MORPHOLOGICALCLASSIFIERS
• Rule-based
• Human-created
• Multiple independentclassifiers
• Corpus-trained
2nd
3rd
5th
4th
1st
RANKER
• Heuristic orcorpus-trained
(Roth et al. 2008)
MADA(Habash&Rambow 2005)(Habash&Rambow 2007)
(Roth et al 2008)
-
8/21/2019 Introduction to Arabic NLP
60/87
60
MADAMorphological Analysis and Disambiguation for Arabic
;;; SENTENCE AsbAnyA tnfy tjmyd AlmsAEdp AlmmnwHp llmgrb;;WORD AsbAnyA;;MADA: AsbAnyA art-NO aspect-NA case-NOCASE clitic-NO conj-NO def-DEF mood-NA n
um-SG part-NO per-3 pos-PN voice-NA*0.78571
-
8/21/2019 Introduction to Arabic NLP
61/87
61
Road Map• Introduction
• Orthography• Morphology
– Derivational Morphology
– Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology
• Syntax
-
8/21/2019 Introduction to Arabic NLP
62/87
62
Arabic Computational Morphology• Representation units• Natural token ت ـ ـ ـ ـ ـ ـ ـ ـ و
– White space separated strings (as is) – Can include extra characters (e.g. tatweel/kashida)
•Word وت
• Segmented word – Can include any degree of morphological analysis – Pure segmentation:
ت
ل
و
– Arabic Treebank tokens (with recovery of somedeleted/modified letters):
ت
ل
و
-
8/21/2019 Introduction to Arabic NLP
63/87
63
Arabic Computational Morphology• Representation units (continued)• Prefix + Stem + Suffix
–و++
– Can create more ambiguity• Lexeme + Features
– [+Plural +Def + +ل ]
• Root + Pattern + Features –
آ
+ة
a3a21aم
+ [+Plural +Def +ل
+و
] – Very abstract
• Root + Pattern + Vocalism + Features – آ + 321م + a.a.a + [+Plural +Def +ل [و+ – Very very abstract
(Habash&Sadat 2006)
-
8/21/2019 Introduction to Arabic NLP
64/87
64
TOKAN• A generalized tokenizer • Assumes disambiguated morphological analysis
– a la MADA
• Declarative specification of tokenization scheme wsyktbhA=[katab_1 POS:V +IV w+ s+ +S:3MS +O:3FS]
• Uses generator (Habash 2004)
w+ f+ b+ k+ l+ s+ Al+ LEXEME + BIESPOS +S:ENw+ s+ ktb/VBZS:3MS +hA
w+ f+ b+ k+ l+ REST +P: +O:TBw+ syktb +hA
w+ f+ b+ k+ l+ s+ Al+ REST +P: +O:D3w+ s+ yktb +hA
w+ f+ b+ k+ l+ s+ RESTD2w+ s+ yktbhA
w+ f+ RESTD1w+ syktbhA
SpecificationSchemeExample
-
8/21/2019 Introduction to Arabic NLP
65/87
65
Arabic Computational Morphology• Approaches – Finite state machines (Beesely,2001) (Kiraz,2001) (Habash&Rambow 2006) – Concatenative analysis/generation (Smrz, 2007) (Buckwlater,2002)
(Cavalli-Sforza et al, 2000)
– Lexeme+Feature analysis/generation (Habash, 2004) (Habash&Rambow2006)
– Shallow stemming (Darwish,2002) (Aljlayl and Frieder 2002) – Machine learning (Diab et al,2004) (Lee et al,2003) (Rogati et al, 2003)
(Habash & Rambow 2005a)
– Survey article: (Al-Sughaiyer&Al-Kharashi, 2004)• Issues
– Appropriateness of system representation for an application• Machine Translation vs. Information Retrieval
• Arabic spelling vs. phonetic spelling – System coverage – System extendibility – Availability to researchers – Use for analysis and generation
-
8/21/2019 Introduction to Arabic NLP
66/87
66
Road Map• Introduction
• Orthography• Morphology
• Syntax – Morphology and Syntax
– Sentence Structure – Phrase Structure
– Computational Resources
-
8/21/2019 Introduction to Arabic NLP
67/87
67
Morphology and Syntax• Rich morphology crosses into syntax
– Pro-drop / Subject conjugation
– Verb sub-categorization and object clitics• Verbtransitive+subject+object• Verbintransitive+subject but not Verbintransitive+subject+object• Verbpassive+subject but not Verbpassive+subject+object
• Morphological interactions with syntax – Agreement• Full: e.g. Noun-Adjective on number, gender, and definiteness (for
persons)• Partial: e.g. Verb-Subject on gender (in VSO order)
– Definiteness• Noun compound formation, copular sentences, etc.• Nouns+DefiniteArticle, Proper Nouns, Pronouns, etc.
-
8/21/2019 Introduction to Arabic NLP
68/87
68
Morphology and Syntax• Morphological interactions with syntax (continued)
– Case
• MSA is case marking: nominative, accusative, genitive• Almost-free word order • Case is often marked with optionally written short vowels
– This effectively limits the word-order freedom in published text
• Agglutination – Attached prepositions create words that cross phraseboundaries
ل
+ت
li+Almaktabātfor the-libraries [PP li [NP Almaktabāt]]
• Some morphological analysis (minimally segmentation)is necessary even for statistical approaches to parsing
-
8/21/2019 Introduction to Arabic NLP
69/87
69
Road Map• Introduction
• Orthography• Morphology
• Syntax – Morphology and Syntax
– Sentence Structure – Phrase Structure
– Computational Resources
-
8/21/2019 Introduction to Arabic NLP
70/87
70
Sentence StructureTwo types of Arabic Sentences
• Verbal sentences – [Verb Subject Object] (VSO) – آا دوا
Wrote the-boys the-poems
The boys wrote the poems• Copular sentences (aka nominal sentences)
– [Topic Complement]
– د
و
اا
the-boys poetsThe boys are poets
-
8/21/2019 Introduction to Arabic NLP
71/87
71
Sentence Structure• Verbal sentences
– Verb agreement with gender only• Default singular number • ا آ\وا wrote3MascSing the-boy/the-boys
• آا \ا wrote3FemSing the-girl/the-girls
– Pronominal subjects are conjugated• آ wrote-youMascSing• آ wrote-youMascPlur • آ wrote-theyMascPlur
– Passive verbs• Same structure: Verbpassive SubjectunderlyingObject• Agreement with surface subject
-
8/21/2019 Introduction to Arabic NLP
72/87
72
Sentence Structure• Verbal sentences
– Common structural ambiguity• Third masculine/feminine singular
is structurally
ambiguous
– Verb3MascSingular NounMascVerb subject=he object=Noun
Verb subject=Noun
• Passive and active forms are often similar in
standard orthography – آ /kataba/ he wrote – آ /kutiba/ it was written
-
8/21/2019 Introduction to Arabic NLP
73/87
73
Sentence Structure• Copular sentences
– [Topic Complement]
Definite Topic, Indefinite Complement• ا
the-boy poetThe boy is a poet
– [Auxiliary Topic Complement]Auxiliaries (kāna and her sisters)• Tense, Negation, Transformation, Persistence• ا آن was the-boy poet The boy was a poet • ا is-not the-boy poet The boy is not a poet
– Inverted order is expected in certain cases• Indefinite topic آ ي /ʕindi kitābun/ at-me a-book I have a book
-
8/21/2019 Introduction to Arabic NLP
74/87
74
Sentence Structure• Copular sentences – Types of complements
• Noun/Adjective/Adverb – اآذ the-boy smart The boy is smart
• Prepositional Phrase – ا ا the-boy in the-library The boy is in the library
• Copular-Sentence – اآ آ [the-boy [book-his big]] The boy, his book is big
• Verb-Sentence – اآ اود
[the-boys [wrote3rdMascPlur poems]] The boys wrote the poems
– Full agreement in this order (SVO)
– اوآ ار
[the-poems [wrote3rdMascSing-them the boys]] The poems, the boys wrote
-
8/21/2019 Introduction to Arabic NLP
75/87
75
Road Map• Introduction
• Orthography• Morphology
• Syntax – Morphology and Syntax
– Sentence Structure – Phrase Structure
– Computational Resources
-
8/21/2019 Introduction to Arabic NLP
76/87
76
Phrase Structure• Noun Phrase
– Determiner Noun Adjective PostModifier
• ان this the-writer the-ambitious the-arriving from Japanها ا اح ادمThis ambitious writer from Japan
– Noun-Adjective agreement• number, gender, definiteness
– ا ا the-writer FemSing the-ambitiousFemSing – ا ا the-writer FemPlur the-ambitiousFemPlur
• Exception: Plural non-persons – definiteness agreement; feminine singular default – ا ا the-officeMascSing the-newMascSing – ا ا the-libraryFemSing the-newFemSing – ا ا the-officesMascBPlur the-newFemSing – ا ا the-librariesFemPlur the-newFemSing
-
8/21/2019 Introduction to Arabic NLP
77/87
77
Phrase Structure• Noun Phrase
– Idafa construction (ا)
• Noun1 of Noun2 encoded structurally• Noun1-indefinite Noun2-definite اردن •
king Jordanthe king of Jordan / Jordan’s king
– Noun1 becomes definite• Agrees with definite adjectives
– Idafa chains• N 1
indef
N 2 indef
… N n-1indef
N ndef • آا ةرادا ر ر ا
son uncle neighbor chief committee management the-companyThe cousin of the CEO’s neighbor
-
8/21/2019 Introduction to Arabic NLP
78/87
78
Phrase Structure• Morphological definiteness interacts with syntactic structure
Indefinitedefinite
Noun Phrase
آAn artist(ic) writer
Copular Sentence
اThe writer is an artist
Noun Compound
آاThe writer of the artist
Noun Phrase
ا اThe artist(ic) writer
Word 1 آ writer
W o r
d 2
a r t i s t
d e f i n i t e
i n d e f i n i t e
Agreement in Arabic
-
8/21/2019 Introduction to Arabic NLP
79/87
79
g• Verb-Subject agreement
– Verb agrees with subject in full (gender,number)• Exception: partial agreement (number=singular) in VSO order • Exception: partial agreement (number=singular; gender=feminine) for non-person plural
subjects regardless of order • Noun-Adjective
– Adjective agrees with noun in full (gender, number, definiteness and case)• Exception: partial agreement (number=singular; gender=feminine) for non-person plural
nouns
• Noun-Number
– Number is the syntactic-case head – for numbers [3..10]: Noun is plural+genitive (idafa); number gender is inverted gender
of noun! – for numbers [11..99]: Noun is singular+accusative (tamyiyz/specification); number
gender is even more complicated☺ – for numbers [100,1K,1M]: Noun is singular+genitive (idafa)
Adjectives of plural non-person nouns are Fem+Sg
Numbers agrees bygender inversion
Verbs in VSO order are alwaysSg and agree in gender only
Fem+Sg+GenFem+PL+GenMasc+Sg+NomFem+Sg
jdydp ‘new’ jAmEAt ‘universities’ vlAv ‘three’ bnyt ‘was built’
-
8/21/2019 Introduction to Arabic NLP
80/87
80
Road Map• Introduction
• Orthography• Morphology
• Syntax – Morphology and Syntax
– Sentence Structure – Phrase Structure
– Computational Resources
-
8/21/2019 Introduction to Arabic NLP
81/87
81
Computational Resources• Monolingual corpora for building language models
– Arabic Gigaword
• Agence France Presse• AlHayat News Agency• AnNahar News Agency• Xinhua News Agency
– Arabic Newswire – United Nations Corpus (parallel with other UN languages) – Ummah Corpus (parallel with English)
• Distributors – Linguistic Data Consortium (LDC) – Evaluations and Language resources Distribution Agency
(ELDA)
Computational Resources
-
8/21/2019 Introduction to Arabic NLP
82/87
82
p
• Penn Arabic Treebank (PATB) – Started in 2001
– Goal is 1 Million words – Currently 650K words
• Agence France Presse , AlHayat newspaper, AnNahar newspaper
• POS tags – Buckwalter analyzer – Arabic-tailored POS list
• PATB constituencyrepresentation – Some modifications of Penn English Treebank
• (e.g. Verb-phrase internal subjects)
Computational Resources
-
8/21/2019 Introduction to Arabic NLP
83/87
83
p
• Prague Dependency Treebank
• Partial overlap with PATBand Arabic Gigaword – Agence France Presse,
AlHayat and Xinhua• Morphological analysis
– Extends on PATB
• Dependency representation
Graphic courtesy of Otakar Smrž: http://ckl.mff.cuni.cz/padt/PADT_1.0/docs/slides/2003-eacl-trees.ppt
CATiB:Columbia Arabic Tree Bank(Habash, 2009)
-
8/21/2019 Introduction to Arabic NLP
84/87
84
• CATiB Representation – Lite Dependency Syntax
– Tokenization• CONJ PART BASE PRON – Part of Speech Tag set
• Six tags: VRB, VRB-pass, NOM,PROP, PRT, PNX
– Syntactic Annotation• Eight dependency relations: SBJ,
OBJ, TPC, PRD, IDF, TMZ, MOD, FLAT
• Practical Considerations
– Less information to annotate – Dependencies easier to annotate than phrase structure – Terminology and representation close to traditional
Arabic grammar
Computational Resources
-
8/21/2019 Introduction to Arabic NLP
85/87
85
• Applications using Arabic treebanks – Statistical parsing
• Bikel’s parser (Bikel 2003)
– Same engine used with English, Chinese and Arabic• Nivre’s MALT parser (Nivre et al. 2006) – Base-phrase Chunking
• (Diab et al, 2004; Diab et al. 2007) – POS tagging and morphological disambiguation
• (Diab et al, 2004; Diab et al. 2007; Habash and Rambow, 2005a; Smith etal., 2005; Roth et al. 2008)
• Other non-treebank-based POS tagging efforts: (Khoja, 2001)
• Formalism conversion – Constituency to dependency (Žabokrtský and Smrž 2003; Habash et al.
2007; Tounsi et al., 2009)
– Tree-adjoining grammar extraction (Habash and Rambow 2004)• Automatic diacritization
– Zitouni et al. (2006); Habash&Rambow (2007); Shaalan et al (2008)among others
– Diacritization for MT (Diab et al. 2007)
Other Tutorial Slides
-
8/21/2019 Introduction to Arabic NLP
86/87
86
• Columbia’s Arabic Dialect
Modeling Group (CADIM) – http://www1.ccls.columbia.edu/~cadim/
• Presentations
MEADR 2009Cairo, Egypt
A il 21 2009
-
8/21/2019 Introduction to Arabic NLP
87/87
87
Introduction to ArabicNatural Language Processing
Nizar HabashColumbia University
Center for Computational Learning Systems
April 21, 2009
CADIM