Intuitive Coding of the Arabic Lexicon

Intuitive Coding of the Arabic Lexicon

Ali Farghaly & Jean Senellart

SYSTRAN Software Corporation

San Diego, CA & Soisy, France

Purpose

• To report on SYSTRAN’s experience in building an Arabic monolingual dictionary as a component of SYSTRAN’s Arabic-English Machine Translation System

• To describe the methodology and implementation adopted for dictionary building and morphological analysis

Overview

• SYSTRAN’s Arabic-English MT System

• SYSTRAN’s Intuitive Coding Technology

• Intuitive Coding of the Arabic Lexicon– Stem-based– Statistical Arabic stem Generation– Internal morphology– External morphology

SYSTRAN’s Arabic-English MT System

• An end to end MT System • Development started July 2002• Using SYSTRAN’s NG technology

– Declarative modules – State of the art Arabic linguistic knowledge– Transfer approach – Hybrid approach combining Statistical

techniques and linguistic knowledge

SYSTRAN’s Intuitive Coding Technology

• Customizing MT systems to improve translation quality

• Building user specific dictionaries

- by the developers

- by the user

- collaboration • SYSTRAN’s decision: • Let the user do the customization

Intuitive Coding

• (Senellart et al, 2003)

• Dictionary representation should be simple

• Automatic processing of user information

• Interactive processing

• Multi level coding algorithm

• Complete integration

• Easy to use Graphic Interface

Stem Based Arabic lexicon

• Following the spirit of Senellart (2003), we opted for intuitive coding of the Arabic lexicon:

• What are the building blocks of the Arabic dictionary?

• A – roots

• B - stems

Why Stems?

• Stems are more intuitive than roots• Eliminates the need for morphological

patterns “ الصرفي ”الميزان• Eliminates overgeneralization of Arabic

stems • Subcategorization frames, syntactic and

semantic information are stem-specific and not root-specific

Sample Entry

• 1016 �ص�ر� َت �ْن ِإ verb plain"[perfect= �ص�ر� َت =imperfect],[ِإْن �ص�ر� َت ,[يْن[passper=ص�ر� َت �ص�ر=imperative],[ِإْن َت ,[ِإْن[passimp=ر�ص� َت "[يْن[+AINT+GPP+HUSUBJ]

Statistical Arabic Stem Generator

• To reduce amount of typing

• To speed up entry creation

• 60% increase of productivity of lexicographers

• Uses morphological rules that are most productive

Generator Output

• [perfect= قال],[imperfect=قال� ,[ي[imperative=ِإقال],[passperf=قال],[passimperf=يقال]

• [perfect=

• �َب� �َت َت�َب=imperfect],[َك �ْك َت�َب=imperative],[ي �َك ,[ُأ[passperf= �َب� �َت َت�َب=passimperf],[َك �ْك [ي

Arabic Morphology

• SYSTRAN has two different modules:

• 1. Internal Morphology

• 2. External Morphology

• Two separate modules in a feeding order

Internal Morphology Module

• Generates all different inflected forms of a given stem and adds morphological information to be used in syntactic processing

The Input to Internal Morphology Module

• Input: Two files:

• 1. stem files

• 2. Morphological Rules file

• Output

• Inflected Dictionary file

Sample of output

• كتب verb plain كتبن +past+fem+3P+plural

Syntagmatic and Paradigmatic (Halliday 1972) Morphology

Internal

هم شاهد وْني يشاهد ف

External ها سيشاهد

ه ل يشاهدونهن ْنشاهد

External Morphology Module

• Decomposes a token into different part-of-speech units

• Follows morphosyntactic rules of the language

• It is the syntax of morphemes

• It has morphophonemic component

Sample of External Morphology Rules

• WAFA:= |CONJ.و�> <CONJ.ف�• KABILI:= < |PREP.َك� |PREP.ِب� <PREP.ل�• LI:= < <PREP.ل�• {WAFA}?_{AL}_<NOUN:-PROPERNOUN|

ADJ |DET:QUANTIFIER|NUMERIC:CARDINAL> {WAFA}?_{NOUNADJ}_<PRON:PERSPOSS>{WAFA}?_{KABILI}_{NOUNADJ}_<PRON:PERSPOSS>

Order of Application

• The External morphology has to apply before the internal morphology and the lookup in the mono inflected dictionary

• Thus we can say that the output of the external morphology module feeds the internal morphology

Conclusion

• SYSTRAN’s monolingual dictionary has about 30,000 entries

• Coverage of newspapers’ discourse is over 90%

• The approach outlined in this paper has greatly accelerated development

• Analysis, homograph resolution and transfer rules are being added and implemented.

Intuitive Coding of the Arabic Lexicon

Documents

Transcript of Intuitive Coding of the Arabic Lexicon