Introduction to Arabic NLP

download Introduction to Arabic NLP

of 28

Transcript of Introduction to Arabic NLP

  • 8/21/2019 Introduction to Arabic NLP

    1/87

    1

    Introduction to ArabicNatural Language Processing

    Nizar HabashColumbia University

    Center for Computational Learning [email protected]

    MEADR 2009Cairo, Egypt

    April 21, 2009

    CADIM

  • 8/21/2019 Introduction to Arabic NLP

    2/87

    2

    • Focus of this

    tutorial

     – Phenomena – Concepts – Approaches

     – Resources• What is ‘Arabic’? – Arabic Script – Arabic Language

    • Modern StandardArabic (MSA)

    • Arabic Dialects

  • 8/21/2019 Introduction to Arabic NLP

    3/87

    3

    Road Map• Introduction

    • Orthography

    • Morphology

    • Syntax

  • 8/21/2019 Introduction to Arabic NLP

    4/87

    4

    Road Map• Introduction• Orthography

     – Arabic Script

     – MSA Phonology and Spelling

     – Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/… – Encoding Issues

    • Morphology

    • Syntax

  • 8/21/2019 Introduction to Arabic NLP

    5/87

    5

    Arabic Script

  • 8/21/2019 Introduction to Arabic NLP

    6/87

    6

    Arabic ScriptArabic script is an alphabet with allographic variants,

    optional zero-width diacritics and common ligatures.

    Arabic script is used to write many languages: Arabic,Persian, Kurdish, Urdu, Pashto, etc.

     رعلا ُطَخلا

  • 8/21/2019 Introduction to Arabic NLP

    7/87

    7

    Arabic ScriptAlphabet

    • letter forms

    • letter marks

    • Arabic only

    • Other languages

    • Persian, Kurdish,Urdu, Pashto, etc.

    • OCR output ambiguity; common spelling errors

  • 8/21/2019 Introduction to Arabic NLP

    8/87

    8

    Arabic Script

     

    /b/

    Alphabet (MSA)

    • letters (form+mark)

    • Distinctive

    • Non-distinctive

     

    /t/

    ث

    /θ/

    س

    /s/

    ش

    /ʃ/

    /ʔ/glottal stop aka hamza

    ؤا ئ

  • 8/21/2019 Introduction to Arabic NLP

    9/87

    9

    Arabic ScriptLetter Shapes

    • No distinction between print and handwriting• No capitalization• Right-to-left• Ambiguous

    shapes• Connectiveletters

    • Disconnective

    letters( ة ى رزد ذ و ؤ ا )cause word-internal visualspacing

     

     

     

    ز

     

     ن

    final

    medial

    initial

    Stand

    alone

          

         غ

  • 8/21/2019 Introduction to Arabic NLP

    10/87

    10

    Arabic ScriptLetter shaping

     /k itāb /آ  =بآ تك  k tābbook 

     /k atab /

    آ  =آ تك 

    k tbto write

  • 8/21/2019 Introduction to Arabic NLP

    11/87

    11

    Arabic Script

     

    /bin/

     

    /bun/

     

    /ban/

    NunationDiacritics

    • Zero-width characters

    • Used for short vowels

     آ  /katab/ to write 

    • Nunation is used for

    nominal indefinitemarker in MSA

    آ ب  /kitābun/ a book 

     

    /bi/

     

    /bu/

     

    /ba/

    Vowel

  • 8/21/2019 Introduction to Arabic NLP

    12/87

    12

    Arabic Script

     

    /bb/

    DoubleConsonant

    /bban/

     

    /bbin//bbu/

     

     

    Diacritics

    • No-vowel marker (sukun)

         /maktab/ office 

    • Double consonant marker (shadda)

     آ  /kattab/ to dictate • Combinable

     

    /b/

    No Vowel

  • 8/21/2019 Introduction to Arabic NLP

    13/87

    13

    Arabic Script

     ب=َ  ب    ع َر َب 

    Putting it together 

    Simple combination

    Ligatures  ب=ْ  ب  

    غ َر ْب West /ʁarb/

    Arab /ʕarab/

    م

     م

     ل

     /Peace /salāmس

    م

  • 8/21/2019 Introduction to Arabic NLP

    14/87

    14

    Arabic ScriptTatweel

    • ‘elongation’ 

    • aka kashida

    • used for text highlightand justification

    االنس

     

    حقوق

      سـنالا

     

    وق قـح

      ـ ـ سـنالا

     

    وق ـ ـ قـح

      ـ ـ ـ ـ سـنالا

     

    وق ـ ـ ـ ـ قـح

    human rights /ħuqūq alʔinsān/

  • 8/21/2019 Introduction to Arabic NLP

    15/87

    15

    Arabic Script• Different styles

    • High fluidity• Optional ligatures

    • Verticalarrangements

     /alʤabr  / /muħammad / /ʕarabi/ 

    ع بيمحمدالج

          محمالج ع

    عريبحمماجلرب

    algebraMuhammadArabic

  • 8/21/2019 Introduction to Arabic NLP

    16/87

    16

    ٠

    0

    ٩Eastern Indo-ArabicIran, Pakistan, etc.

    ٩

    Indo-ArabicMiddle East 

    987654321Western Arabic

    Tunisia, Morocco, etc.

    Arabic Script “Arabic” Numerals• Decimal system• Numbers written left-to-right in right-to-left text

    الفرنس  132بعد1962استقلت اجلزائر يف سنة . عاما من االحتالل 

    Algeria achieved its independence in 1962 after 132 years of French occupation.

    • Three systems of enumeration symbols that vary by region

  • 8/21/2019 Introduction to Arabic NLP

    17/87

    17

    Road Map• Introduction

    • Orthography – Arabic Script

     – MSA Phonology and Spelling

     – Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/… – Encoding Issues

    • Morphology

    • Syntax

  • 8/21/2019 Introduction to Arabic NLP

    18/87

    18

    MSA Phonology and Spelling• Phonological profile of Standard Arabic

     – 28 Consonants – 3 short vowels, 3 long vowels, 2 diphthongs

    • Arabic spelling is mostly phonemic … – Letter-sound correspondence

    ā ʔt bʤ θx ħδ dz r ssʖ  ʃtʖ dʖʕk    ʁq f lm

    ولقخحج  ى ؤء

    h nw j ū ī  δ

  • 8/21/2019 Introduction to Arabic NLP

    19/87

    19

    MSA Phonology and Spelling• Arabic spelling is mostly phonemic …

    Except for • Medial short vowels can only appear as

    diacritics

    • Diacritics are optional in most written text – Except in holy scripture – Present diacritics mark syntactic/semantic

    distinctions

    • آ /katab/ to write آ /kutib/ to be written•   /ħubb/ love   /ħabb/ seed• Dual use of ,ا ,و ي as consonant and long vowel

    ا –  (/‘/,/ā/) و (/w/,/ū/) ي (/j/,/ī/)

  • 8/21/2019 Introduction to Arabic NLP

    20/87

    20

    MSA Phonology and Spelling• Arabic spelling is mostly phonemic …

    Except for (continued)

    • Morphophonemic characters

     – Feminine markerة

    (ta marbuta)• آ /kabīr/ (big ♂) آ /kabīr a/ (big ♀)

     – Derivation marker 

    • /ʕasa/ (to disobey ) (a stick )• Hamza variants (6 characters for one phoneme!)

     – )   ء أإؤئ) /baha’/ + 3MascSing (his glory)

  • 8/21/2019 Introduction to Arabic NLP

    21/87

    21

    MSA Phonology and Spelling• Arabic spelling can be ambiguous

     – optional diacritics and dual use of letter • But how ambiguous? Really?• Classic example

    ths s wht n rbc txt lks lk wth n vwls

    this is what an Arabic text looks like with no vowels• Not exactly true

     – Long vowels are always written

     – Initial vowels are represented by an ‘alef’

     – Some final short vowels are represented

    ths is wht an Arbc txt lks lik wth no vwls

    Will revisit ambiguity in more detail again under morphology discussion

  • 8/21/2019 Introduction to Arabic NLP

    22/87

    22

    Proper Name Spelling• The Qadafi-Schwarzenegger problem

     – Foreign Proper name spelling is often ad hoc – Multiplicity of spellings causes increased sparsity

    Schwarzenegger 

     زرا

    زرازرا

     ار

    Gadafi Gaddafi Gaddfi GadhafiGhaddafi Kadaffy Qaddafi Qadhafi …

    ا

  • 8/21/2019 Introduction to Arabic NLP

    23/87

    23

    Road Map• Introduction

    • Orthography – Arabic Script – MSA Phonology and Spelling

     – Recognizing Arabic vs. Persian/Urdu/Pashto/Kurdish/Sindhi/… – Encoding Issues

    • Morphology

    • Syntax

  • 8/21/2019 Introduction to Arabic NLP

    24/87

    24

    Arabic ScriptOther languages

    Arabic

    • No more than 3 dots• Dots either above or below• Marks are 1/2/3 dots, hamza (ء)

    or madda (~) only• Rare borrowing for foreign words

    •,/ /pپ  ,/ /vڤ ڤ گ ,/ /gچ /  /tʃچ

    • regionally variable

    Not Arabic• Extra marks: haft (v), ring (o), taa ,(ط)

    four dots (::), vertical dots (:)• Some Numerals (,,)

    Once you learn the alphabet, it is easier ☺

    ړ ژڈ

     ڐ

    ۅ څ ځڈ

    …گ ێ ڤ

      أ إ ؤئ ة ت ث ج ح خ

    ر ذ ز

     ش ص

    ض ط ظ ع غ ف

    ق ك ل م ن وى ي ء

  • 8/21/2019 Introduction to Arabic NLP

    25/87

    25

    Arabic

    Not Arabic

  • 8/21/2019 Introduction to Arabic NLP

    26/87

    26

    Arabic

    Not Arabic

    ... ا...

     ا ن رو   او

      و

    ... ا...  وا  رق اح

       او او   ر ا واا ا

     ا   ا ا وا ط و ا ام

    ...   :ورد د

  • 8/21/2019 Introduction to Arabic NLP

    27/87

    27

    Arabic

    Not Arabic

  • 8/21/2019 Introduction to Arabic NLP

    28/87

  • 8/21/2019 Introduction to Arabic NLP

    29/87

    29

    Encoding Issues• Encoding Arabic

     – Data entry, storage, and display – Ease of use for Arabic-illiterate users – Multi-script support – Multilingual support (extended Arabic characters)

    • Types of Encoding – Machine character sets

    • Graphemic (shape insensitive, logical order)• Allographic (shape/direction sensitive) [obsolete]

     – Human accessible• Transliteration• Phonetic spelling (IPA)• Romanization

  • 8/21/2019 Introduction to Arabic NLP

    30/87

    30

    Encoding Issues• Many Conflicting Character Sets for Arabic

  • 8/21/2019 Introduction to Arabic NLP

    31/87

    31

    Encodings• CP-1256

     – Commonly used – 1-byte characters

     – Widely supportedinput/display

     – Minimal support forextended Arabiccharacters

     – bi-script support(Roman/Arabic)

     – Tri-lingual support:Arabic, French,English (ala ANSI)

  • 8/21/2019 Introduction to Arabic NLP

    32/87

    32

    Encodings• Unicode

     – Becoming thestandard more andmore

     – 2-byte characters

     – Widely supportedinput/display

     – Supports extendedArabic characters

     – Multi-scriptrepresentation

  • 8/21/2019 Introduction to Arabic NLP

    33/87

    33

    Encodings• Unicode

     – Supports presentationforms (shapes andligatures)

  • 8/21/2019 Introduction to Arabic NLP

    34/87

    34

    Encoding IssuesArabic Display

    • Memory (logical order)

    ÔÇÑßÊ ÝáÓØíä (Palestine) Ýí ÇæáãÈíÇÏ (Olympics) 2000 æ 2004.

      ر ك ت ف ل س ين (Palestine) يف و ل م ب ي د (Olympics) 2000 و 2004.

    or this way for those with direction-bias 

    .4002 æ 0002 )scipmylO( ÏÇíÈãáæÇ íÝ )enitselaP( äíØÓáÝ ÊßÑÇÔ

    .4002 و 0002 )scipmylO( في ولمبي د )enitselaP( ركت فلسين

  • 8/21/2019 Introduction to Arabic NLP

    35/87

    35

    Encoding IssuesArabic Display

    • Memory (logical order)ÔÇÑßÊ ÝáÓØíä (Palestine) Ýí ÇæáãÈíÇÏ (Olympics) 2000 æ 2004.  ر ك ت ف ل س ين (Palestine) يف و ل م ب ي د (Olympics) 2000 و 2004.

    • Display (visual order) – Bidirectional (BiDi) support• Numbers and Roman script. 2004و2000 (Olympics) في ولمبي د (Palestine) سين ل ف ت ك ر

     – Letter and ligature shaping. 2004و2000 (Olympics) ود (Palestine) آر

  • 8/21/2019 Introduction to Arabic NLP

    36/87

    36

    Display Problems

    تدشينمنطقةØ-رة Ùيدبيللتجارةالالكترونية

    ÊÏÔêæ åæ×âÉ ÍÑÉáê ÏÈê ääÊÌÇÑÉÇäÇäãÊÑèæêÉ

    ÊÏÔíä ãäØÞÉ ÍÑÉÝí ÏÈí ááÊÌÇÑÉÇáÇáßÊÑæäíÉ

    د   ور

    ؛

    ُ

     

     

    ع

    ع

     ع ع ع -   عع

      ع

    ع

    ع

     

     ،

     

     

     

    ع

     

    ع

    ع

     

     

    ع

    ع

    ع

     

     ï» ؟ ¯ ´ ٹ †ظ

    …†·‚ © -± © ط

     ٹ

      ¯ ¨ٹ

     ظ

    „„ ¬§± © „§„§ط ƒ

     ±ˆ† ٹ

      © 

    ʏʏʏʏ栥既栥既栥既栥既

    ɠɠɠɠ  ɠɠɠɠψ

    ǑɠɠɠɠǤǤ㊑㊑㊑㊑親親親親ɠɠɠɠ

    د   ور

    ×â و هê ش

    êل êدب  êو  èر

    ʏʏʏʏ ɠɠɠɠ   ɠɠɠɠψǑɠɠɠɠǡǡǡǡǡǡǡǡѦѦѦѦ

      آ     ف  ر  دب

    د   ور

    WesternUnicodeISO-8859CP-1256Display Encoding

       C   P  -   1   2   5   6

       I   S   O  -   8   8   5   9

       U  n   i

      c  o   d  e

       A  c   t  u  a   l   E  n  c

      o   d   i  n  g

    • Wrong encoding • Partial support problems

  • 8/21/2019 Introduction to Arabic NLP

    37/87

    37http://www.cyrillic.com/kbd/btc.html

    س   م م  

    Encoding IssuesArabic Input• Standard graphemic keyboard

    • Logical order input

  • 8/21/2019 Introduction to Arabic NLP

    38/87

    38

    EncodingsBuckwalter Encoding• Romanization

     – One-to-one mappingto Arabic script spelling

     – Left-to-right – Easy to learn/use – Human & machine compatible

    • Commonly used in NLP – Penn Arabic Tree Bank

    • Some characters can bemodified to allow use with XMLand regular expressions

    • Roman input/display• Monolingual encoding (can’t do

    English and Arabic)• Minimal support for extended

    Arabic characters

  • 8/21/2019 Introduction to Arabic NLP

    39/87

    39

    Road Map• Introduction

    • Orthography• Morphology

    • Syntax

  • 8/21/2019 Introduction to Arabic NLP

    40/87

    40

    Morphology• Type

     – Concatenative: prefix, suffix, circumfix – Templatic: root+pattern

    • Function

     – Derivational• Creating new words• Mostly templatic 

     – Inflectional• Modifying features of words

     – Tense, number, person, mood, aspect

    • Mostly concatenative

  • 8/21/2019 Introduction to Arabic NLP

    41/87

    41

    Road Map• Introduction

    • Orthography• Morphology

     – Derivational Morphology – Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology

    • Syntax

  • 8/21/2019 Introduction to Arabic NLP

    42/87

    42

    Derivational Morphology• Templatic Morphology

      

    b

      م

    kt

    آ

    ا

    maktū bwritten

    k āti bwriter Lexeme.Meaning =

    (Root.Meaning+Pattern.Meaning)*Idiosyncrasy.Random

      ك

    maū āi

    • Root

    • Pattern

    • Lexeme

  • 8/21/2019 Introduction to Arabic NLP

    43/87

    43

    Derivational MorphologyRoot Meaning 

    •   ك KTB = notion of “writing” 

     آ /katab/

    write

     آ /kātib/writer

     ب  /maktūb/

    letter

    آب  /kitāb/

    book   /maktaba/

    library

      /maktab/

    office

     ب  /maktūb/

    written

  • 8/21/2019 Introduction to Arabic NLP

    44/87

    44

    Derivational MorphologyRoot Meaning 

     laHm

    • LHM-1

    • Notion of “meat” – /laħm/

    • Meat

     – م /laħħām/• Butcher 

  • 8/21/2019 Introduction to Arabic NLP

    45/87

    45

    Derivational MorphologyRoot Meaning 

    • LHM-2

    • Notion of “battle” –    /malħama/

    • Fierce battle

    • Massacre• Epic

    D i i l M h l

  • 8/21/2019 Introduction to Arabic NLP

    46/87

    46

    • LHM-3

    • Notion of “soldering” – /laħam/

    • Weld, solder, stick, cling

     – ا /iltaħam/• Be welded/soldered/fused

     – /multaħim/• Welded, soldered, fused

    Derivational MorphologyRoot Meaning 

    D i ti l M h l

  • 8/21/2019 Introduction to Arabic NLP

    47/87

    47

    Derivational MorphologyPattern Meaning 

    ask/make_writektb   AistaktabRequirement Aista12a3X

    Turn red/blushHmr   AiHmarrTransformation Ai12a33IX

    registerktb AiktatabAcquiescence, exaggeration Ai1ta2a3VIII

    subscribe/enrollktb   AinkatabPassive of Pattern I Ain1a2a3VII

    correspondktb   takaAtabReflexive of Pattern IIIta1aA2a3VI

    learnElm taEal~amReflexive of Pattern IIta1a22a3V

    seat jls AjlasCausation Aa12a3IVcorrespond with

    ktb kaAtabInteraction with others1aA2a3III

    dictatektb kattabIntensification, causation1a22a3II

    writektb katabBasic sense of root1a2a3I

    GlossExamplePattern MeaningPattern

    • Verb Pattern Meaning is hard to define

  • 8/21/2019 Introduction to Arabic NLP

    48/87

    48

    Road Map• Introduction

    • Orthography• Morphology

     – Derivational Morphology – Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology

    • Syntax

  • 8/21/2019 Introduction to Arabic NLP

    49/87

    49

    Inflectional Morphology• Derivational Morphology

     – Lexeme ≈ Root + Pattern

    • Inflectional Morphology – Word = Lexeme + Features

    • Part-of-speech

     – Traditional : Noun, Verb, Particle – Computational : N, PN, V, Adj, Adv, P, Pron, Num,Conj, Det, Aux, Pun, IJ, and others

     – Many tag sets exist ranging from 3 to over 22K tags

    • Noun-specific Features• Verb-specific Features• Other Features

  • 8/21/2019 Introduction to Arabic NLP

    50/87

    50

    Inflectional Morphology• Noun-specific Features – Number: singular, dual, plural, collective – Gender: masculine, feminine – Definiteness: definite, indefinite – Case: nominative, accusative, genitive – Possessive clitic

    • Verb-specific Features – Aspect: perfective, imperfective, imperative – Voice: active, passive – Tense: past, present, future – Mood: indicative, subjunctive, jussive

     – Subject (Person, Number, Gender) – Object clitic• Other Features

     – Single-letter conjunctions – Single-letter prepositions

    Inflectional Morphology

  • 8/21/2019 Introduction to Arabic NLP

    51/87

    51

    Inflectional Morphology

    Nouns

     ت

     

    /walilmaktabāt/

    ات++ ا++ wa+li+al+maktaba+āt

    and+for +the+library+plural

     And for the libraries

    conj prepnoun poss  plural article

     آ

    /wakabiyūtinā/

    + ت + ك + wa+ka+ biyūt+nā

    and+like+houses+our 

     And like our houses

    • Morphotactics (e.g. ا+ل  )• Arabic Broken Plurals (templatic)

    Inflectional Morphology

  • 8/21/2019 Introduction to Arabic NLP

    52/87

    52

    Inflectional Morphology

    Verbs

     ه

    /faqulnāhā/

    ه++ل+فfa+qul+na+hā

    so+said+we+it

    So we said it 

    conjverbobject subj tense

      

     /wasanaqūluhā/

    ه+++س+ wa+sa+na+qūl+u+hāand+will+we+say+it

     And we will say it 

    • Morphotactics• Subject conjugation (suffix or circumfix)

    I fl ti l M h l

  • 8/21/2019 Introduction to Arabic NLP

    53/87

    53

    Inflectional Morphologykatab ‘to write’ 

    • Perfect verb subject conjugation (suffixes only )

     آ katabā  آ katabtū

     آ  katabtum

     آ katabnā

    PluralDualSingular 

    آ  kataba3

     آ  katabtumā آ katabta2

     آ katabtu1

    • Imperfect verb subject conjugation ( prefix+suffix )

    Feminine form and other verb moods not shown

       yaktubān  ن yaktubūn

      ن taktubūn  naktubu

    PluralDualSingular 

      yaktubu3

       taktubān  taktubu2آا aktubu1

  • 8/21/2019 Introduction to Arabic NLP

    54/87

    54

    Road Map• Introduction

    • Orthography• Morphology

     – Derivational Morphology

     – Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology

    • Syntax

  • 8/21/2019 Introduction to Arabic NLP

    55/87

    55

    Morphological Ambiguity

    • Derivational ambiguity – : basis/principle/rule, military base,Qa'ida/Qaeda/Qaida

    • Inflectional ambiguity – /taktub/: you write, she writes – Segmentation ambiguity

    • و: he found; +و : and+grandfather •  : + ل : for a language; ا+ ل : for the language

  • 8/21/2019 Introduction to Arabic NLP

    56/87

    56

    Morphological Ambiguity• Spelling ambiguity

     – Optional diacritics• آ: /kātib/ writer , /kātab/ to correspond

     – Suboptimal spelling• Hamza dropping: ,أ ا

    • Undotted ta-marbuta: ة   • Undotted final ya:

    ي

     • Multiple sources of ambiguity

    بني

     – /bayyana/ Verb he demonstrated  – /bayyanna/ Verb they [feminine] demonstrated  – /bayyin/ Adj clear/evident/explicit  – /bayna/ Prep between/among  – /biyin/ Proper Noun in Yen

     – /biyn/ Proper Noun Ben

    Morphological Disambiguation

  • 8/21/2019 Introduction to Arabic NLP

    57/87

    57

    Morphological Disambiguation

    in English

    • Select a morphological tag that fullydescribes the morphology of a word

    • Complete English morphological tag set

    (Penn Treebank): 48 tagsVerb:

    • Same as “POS Tagging” in English

    goesgogonegoingwentgo

    VBZVBPVBNVBGVBDVB

    Morphological Disambiguation

  • 8/21/2019 Introduction to Arabic NLP

    58/87

    58

    • Morphological tag has 14 subtags

    corresponding to different linguistic categories – Example:VerbGender(2), Number(3), Person(3), Aspect(3),Mood(3), Voice(2), Pronominal clitic(12),

    Conjunction clitic(3)• 22,400 possible tags

     – Different possible subsets

    • 2,200 appear in Penn Arabic Tree Bank Part 1(140K words)• Example solution: MADA (Habash&Rambow 2005)

    Morphological Disambiguation

    in Arabic 

    MADA (Habash&Rambow 2005)(Habash&Rambow 2007)

    (Roth et al 2008)

  • 8/21/2019 Introduction to Arabic NLP

    59/87

    59

    W-3 W-2 W-1 W0 W1 W2 W3 W4W-4

    MORPHOLOGICALANALYZER 

    MORPHOLOGICALCLASSIFIERS

    • Rule-based

    • Human-created

    • Multiple independentclassifiers

    • Corpus-trained

    2nd

    3rd

    5th

    4th

    1st

    RANKER 

    • Heuristic orcorpus-trained

    (Roth et al. 2008)

    MADA(Habash&Rambow 2005)(Habash&Rambow 2007)

    (Roth et al 2008)

  • 8/21/2019 Introduction to Arabic NLP

    60/87

    60

    MADAMorphological Analysis and Disambiguation for Arabic

    ;;; SENTENCE AsbAnyA tnfy tjmyd AlmsAEdp AlmmnwHp llmgrb;;WORD AsbAnyA;;MADA: AsbAnyA art-NO aspect-NA case-NOCASE clitic-NO conj-NO def-DEF mood-NA n

    um-SG part-NO per-3 pos-PN voice-NA*0.78571

  • 8/21/2019 Introduction to Arabic NLP

    61/87

    61

    Road Map• Introduction

    • Orthography• Morphology

     – Derivational Morphology

     – Inflectional Morphology – Morphological Ambiguity – Arabic Computational Morphology

    • Syntax

  • 8/21/2019 Introduction to Arabic NLP

    62/87

    62

    Arabic Computational Morphology• Representation units• Natural token ت ـ ـ ـ ـ ـ ـ ـ ـ و

     – White space separated strings (as is) – Can include extra characters (e.g. tatweel/kashida)

    •Word وت

    • Segmented word – Can include any degree of morphological analysis – Pure segmentation:

    ت

    ل

     و

     – Arabic Treebank tokens (with recovery of somedeleted/modified letters):

    ت

    ل 

     و

     

  • 8/21/2019 Introduction to Arabic NLP

    63/87

    63

    Arabic Computational Morphology• Representation units (continued)• Prefix + Stem + Suffix

     –و++ 

     – Can create more ambiguity• Lexeme + Features

     – [+Plural +Def + +ل ]

    • Root + Pattern + Features –

    آ

    a3a21aم

    + [+Plural +Def +ل

    ] – Very abstract

    • Root + Pattern + Vocalism + Features – آ + 321م + a.a.a + [+Plural +Def +ل [و+ – Very very abstract

    (Habash&Sadat 2006)

  • 8/21/2019 Introduction to Arabic NLP

    64/87

    64

    TOKAN• A generalized tokenizer • Assumes disambiguated morphological analysis

     – a la MADA

    • Declarative specification of tokenization scheme wsyktbhA=[katab_1 POS:V +IV w+ s+ +S:3MS +O:3FS]

    • Uses generator (Habash 2004)

    w+ f+ b+ k+ l+ s+ Al+ LEXEME + BIESPOS +S:ENw+ s+ ktb/VBZS:3MS +hA

    w+ f+ b+ k+ l+ REST +P: +O:TBw+ syktb +hA

    w+ f+ b+ k+ l+ s+ Al+ REST +P: +O:D3w+ s+ yktb +hA

    w+ f+ b+ k+ l+ s+ RESTD2w+ s+ yktbhA

    w+ f+ RESTD1w+ syktbhA

    SpecificationSchemeExample

  • 8/21/2019 Introduction to Arabic NLP

    65/87

    65

    Arabic Computational Morphology• Approaches – Finite state machines (Beesely,2001) (Kiraz,2001) (Habash&Rambow 2006) – Concatenative analysis/generation (Smrz, 2007) (Buckwlater,2002)

    (Cavalli-Sforza et al, 2000)

     – Lexeme+Feature analysis/generation (Habash, 2004) (Habash&Rambow2006)

     – Shallow stemming (Darwish,2002) (Aljlayl and Frieder 2002) – Machine learning (Diab et al,2004) (Lee et al,2003) (Rogati et al, 2003)

    (Habash & Rambow 2005a)

     – Survey article: (Al-Sughaiyer&Al-Kharashi, 2004)• Issues

     – Appropriateness of system representation for an application• Machine Translation vs. Information Retrieval

    • Arabic spelling vs. phonetic spelling – System coverage – System extendibility – Availability to researchers – Use for analysis and generation

  • 8/21/2019 Introduction to Arabic NLP

    66/87

    66

    Road Map• Introduction

    • Orthography• Morphology

    • Syntax – Morphology and Syntax

     – Sentence Structure – Phrase Structure

     – Computational Resources

  • 8/21/2019 Introduction to Arabic NLP

    67/87

    67

    Morphology and Syntax• Rich morphology crosses into syntax

     – Pro-drop / Subject conjugation

     – Verb sub-categorization and object clitics• Verbtransitive+subject+object• Verbintransitive+subject but not Verbintransitive+subject+object• Verbpassive+subject but not Verbpassive+subject+object

    • Morphological interactions with syntax – Agreement• Full: e.g. Noun-Adjective on number, gender, and definiteness (for

    persons)• Partial: e.g. Verb-Subject on gender (in VSO order)

     – Definiteness• Noun compound formation, copular sentences, etc.• Nouns+DefiniteArticle, Proper Nouns, Pronouns, etc.

  • 8/21/2019 Introduction to Arabic NLP

    68/87

    68

    Morphology and Syntax• Morphological interactions with syntax (continued)

     – Case

    • MSA is case marking: nominative, accusative, genitive• Almost-free word order • Case is often marked with optionally written short vowels

     – This effectively limits the word-order freedom in published text

    • Agglutination – Attached prepositions create words that cross phraseboundaries

    ل

     

    li+Almaktabātfor the-libraries [PP li [NP Almaktabāt]]

    • Some morphological analysis (minimally segmentation)is necessary even for statistical approaches to parsing

  • 8/21/2019 Introduction to Arabic NLP

    69/87

    69

    Road Map• Introduction

    • Orthography• Morphology

    • Syntax – Morphology and Syntax

     – Sentence Structure – Phrase Structure

     – Computational Resources

  • 8/21/2019 Introduction to Arabic NLP

    70/87

    70

    Sentence StructureTwo types of Arabic Sentences

    • Verbal sentences – [Verb Subject Object] (VSO) – آا دوا 

    Wrote the-boys the-poems

    The boys wrote the poems• Copular sentences (aka nominal sentences)

     – [Topic Complement]

     – د

    و

    اا

    the-boys poetsThe boys are poets

  • 8/21/2019 Introduction to Arabic NLP

    71/87

    71

    Sentence Structure• Verbal sentences

     – Verb agreement with gender only• Default singular number • ا آ\وا wrote3MascSing the-boy/the-boys

    • آا \ا wrote3FemSing the-girl/the-girls

     – Pronominal subjects are conjugated• آ wrote-youMascSing• آ wrote-youMascPlur • آ wrote-theyMascPlur 

     – Passive verbs• Same structure: Verbpassive SubjectunderlyingObject• Agreement with surface subject

  • 8/21/2019 Introduction to Arabic NLP

    72/87

    72

    Sentence Structure• Verbal sentences

     – Common structural ambiguity• Third masculine/feminine singular 

     

    is structurally

    ambiguous

     – Verb3MascSingular NounMascVerb subject=he object=Noun

    Verb subject=Noun

    • Passive and active forms are often similar in

    standard orthography – آ /kataba/ he wrote – آ /kutiba/ it was written

  • 8/21/2019 Introduction to Arabic NLP

    73/87

    73

    Sentence Structure• Copular sentences

     – [Topic Complement]

    Definite Topic, Indefinite Complement• ا

    the-boy poetThe boy is a poet 

     – [Auxiliary Topic Complement]Auxiliaries (kāna and her sisters)• Tense, Negation, Transformation, Persistence•   ا  آن was the-boy poet The boy was a poet •   ا is-not the-boy poet The boy is not a poet 

     – Inverted order is expected in certain cases• Indefinite topic آ ي /ʕindi kitābun/ at-me a-book I have a book 

  • 8/21/2019 Introduction to Arabic NLP

    74/87

    74

    Sentence Structure• Copular sentences – Types of complements

    • Noun/Adjective/Adverb – اآذ the-boy smart The boy is smart 

    • Prepositional Phrase – ا ا  the-boy in the-library The boy is in the library 

    • Copular-Sentence – اآ آ [the-boy [book-his big]] The boy, his book is big 

    • Verb-Sentence –  اآ اود

    [the-boys [wrote3rdMascPlur  poems]] The boys wrote the poems

     – Full agreement in this order (SVO)

     –  اوآ ار

    [the-poems [wrote3rdMascSing-them the boys]] The poems, the boys wrote

  • 8/21/2019 Introduction to Arabic NLP

    75/87

    75

    Road Map• Introduction

    • Orthography• Morphology

    • Syntax – Morphology and Syntax

     – Sentence Structure – Phrase Structure

     – Computational Resources

  • 8/21/2019 Introduction to Arabic NLP

    76/87

    76

    Phrase Structure• Noun Phrase

     – Determiner Noun Adjective PostModifier 

    •  ان this the-writer the-ambitious the-arriving from Japanها ا اح ادمThis ambitious writer from Japan

     – Noun-Adjective agreement• number, gender, definiteness

     –  ا ا the-writer FemSing the-ambitiousFemSing –  ا  ا the-writer FemPlur  the-ambitiousFemPlur 

    • Exception: Plural non-persons – definiteness agreement; feminine singular default – ا ا the-officeMascSing the-newMascSing – ا ا the-libraryFemSing the-newFemSing – ا ا the-officesMascBPlur  the-newFemSing – ا  ا the-librariesFemPlur the-newFemSing

  • 8/21/2019 Introduction to Arabic NLP

    77/87

    77

    Phrase Structure• Noun Phrase

     – Idafa construction (ا)

    • Noun1 of Noun2 encoded structurally• Noun1-indefinite Noun2-definite اردن •

    king Jordanthe king of Jordan / Jordan’s king 

     – Noun1 becomes definite• Agrees with definite adjectives

     – Idafa chains• N 1

    indef

    N 2 indef

    … N n-1indef

    N ndef • آا ةرادا  ر ر ا

    son uncle neighbor chief committee management the-companyThe cousin of the CEO’s neighbor 

  • 8/21/2019 Introduction to Arabic NLP

    78/87

    78

    Phrase Structure• Morphological definiteness interacts with syntactic structure

    Indefinitedefinite

    Noun Phrase

     آAn artist(ic) writer 

    Copular Sentence

     اThe writer is an artist

    Noun Compound 

     آاThe writer of the artist

    Noun Phrase

     ا  اThe artist(ic) writer 

    Word 1 آ writer 

       W  o  r

       d   2 

                   a      r        t        i      s        t

       d  e   f   i  n   i   t  e

       i  n   d  e   f   i  n   i   t  e

    Agreement in Arabic

  • 8/21/2019 Introduction to Arabic NLP

    79/87

    79

    g• Verb-Subject agreement

     – Verb agrees with subject in full (gender,number)• Exception: partial agreement (number=singular) in VSO order • Exception: partial agreement (number=singular; gender=feminine) for non-person plural

    subjects regardless of order • Noun-Adjective

     – Adjective agrees with noun in full (gender, number, definiteness and case)• Exception: partial agreement (number=singular; gender=feminine) for non-person plural

    nouns

    • Noun-Number 

     – Number is the syntactic-case head – for numbers [3..10]: Noun is plural+genitive (idafa); number gender is inverted gender

    of noun! – for numbers [11..99]: Noun is singular+accusative (tamyiyz/specification); number

    gender is even more complicated☺ – for numbers [100,1K,1M]: Noun is singular+genitive (idafa)

    Adjectives of plural non-person nouns are Fem+Sg

    Numbers agrees bygender inversion

    Verbs in VSO order are alwaysSg and agree in gender only

    Fem+Sg+GenFem+PL+GenMasc+Sg+NomFem+Sg

     jdydp ‘new’  jAmEAt ‘universities’ vlAv ‘three’ bnyt ‘was built’ 

  • 8/21/2019 Introduction to Arabic NLP

    80/87

    80

    Road Map• Introduction

    • Orthography• Morphology

    • Syntax – Morphology and Syntax

     – Sentence Structure – Phrase Structure

     – Computational Resources

  • 8/21/2019 Introduction to Arabic NLP

    81/87

    81

    Computational Resources• Monolingual corpora for building language models

     – Arabic Gigaword

    • Agence France Presse• AlHayat News Agency• AnNahar News Agency• Xinhua News Agency

     – Arabic Newswire – United Nations Corpus (parallel with other UN languages) – Ummah Corpus (parallel with English)

    • Distributors – Linguistic Data Consortium (LDC) – Evaluations and Language resources Distribution Agency

    (ELDA)

    Computational Resources

  • 8/21/2019 Introduction to Arabic NLP

    82/87

    82

    p

    • Penn Arabic Treebank (PATB) – Started in 2001

     – Goal is 1 Million words – Currently 650K words

    • Agence France Presse , AlHayat newspaper, AnNahar newspaper

    • POS tags – Buckwalter analyzer – Arabic-tailored POS list

    • PATB constituencyrepresentation – Some modifications of Penn English Treebank

    • (e.g. Verb-phrase internal subjects)

    Computational Resources

  • 8/21/2019 Introduction to Arabic NLP

    83/87

    83

    p

    • Prague Dependency Treebank

    • Partial overlap with PATBand Arabic Gigaword – Agence France Presse,

    AlHayat and Xinhua• Morphological analysis

     – Extends on PATB

    • Dependency representation

    Graphic courtesy of Otakar Smrž: http://ckl.mff.cuni.cz/padt/PADT_1.0/docs/slides/2003-eacl-trees.ppt

    CATiB:Columbia Arabic Tree Bank(Habash, 2009)

  • 8/21/2019 Introduction to Arabic NLP

    84/87

    84

    • CATiB Representation – Lite Dependency Syntax 

     – Tokenization• CONJ PART BASE PRON – Part of Speech Tag set

    • Six tags: VRB, VRB-pass, NOM,PROP, PRT, PNX

     – Syntactic Annotation• Eight dependency relations: SBJ,

    OBJ, TPC, PRD, IDF, TMZ, MOD, FLAT

    • Practical Considerations

     – Less information to annotate – Dependencies easier to annotate than phrase structure – Terminology and representation close to traditional

    Arabic grammar 

    Computational Resources

  • 8/21/2019 Introduction to Arabic NLP

    85/87

    85

    • Applications using Arabic treebanks – Statistical parsing

    • Bikel’s parser (Bikel 2003)

     – Same engine used with English, Chinese and Arabic• Nivre’s MALT parser (Nivre et al. 2006) – Base-phrase Chunking

    • (Diab et al, 2004; Diab et al. 2007) – POS tagging and morphological disambiguation

    • (Diab et al, 2004; Diab et al. 2007; Habash and Rambow, 2005a; Smith etal., 2005; Roth et al. 2008)

    • Other non-treebank-based POS tagging efforts: (Khoja, 2001)

    • Formalism conversion – Constituency to dependency (Žabokrtský and Smrž 2003; Habash et al.

    2007; Tounsi et al., 2009)

     – Tree-adjoining grammar extraction (Habash and Rambow 2004)• Automatic diacritization

     – Zitouni et al. (2006); Habash&Rambow (2007); Shaalan et al (2008)among others

     – Diacritization for MT (Diab et al. 2007)

    Other Tutorial Slides

  • 8/21/2019 Introduction to Arabic NLP

    86/87

    86

    • Columbia’s Arabic Dialect

    Modeling Group (CADIM) – http://www1.ccls.columbia.edu/~cadim/

    • Presentations

    MEADR 2009Cairo, Egypt

    A il 21 2009

  • 8/21/2019 Introduction to Arabic NLP

    87/87

    87

    Introduction to ArabicNatural Language Processing

    Nizar HabashColumbia University

    Center for Computational Learning Systems

    [email protected]

    April 21, 2009

    CADIM