Intuitive Coding of the Arabic Lexicon
description
Transcript of Intuitive Coding of the Arabic Lexicon
Intuitive Coding of the Arabic Lexicon
Ali Farghaly & Jean Senellart
SYSTRAN Software Corporation
San Diego, CA & Soisy, France
Purpose
• To report on SYSTRAN’s experience in building an Arabic monolingual dictionary as a component of SYSTRAN’s Arabic-English Machine Translation System
• To describe the methodology and implementation adopted for dictionary building and morphological analysis
Overview
• SYSTRAN’s Arabic-English MT System
• SYSTRAN’s Intuitive Coding Technology
• Intuitive Coding of the Arabic Lexicon– Stem-based– Statistical Arabic stem Generation– Internal morphology– External morphology
SYSTRAN’s Arabic-English MT System
• An end to end MT System • Development started July 2002• Using SYSTRAN’s NG technology
– Declarative modules – State of the art Arabic linguistic knowledge– Transfer approach – Hybrid approach combining Statistical
techniques and linguistic knowledge
SYSTRAN’s Intuitive Coding Technology
• Customizing MT systems to improve translation quality
• Building user specific dictionaries
- by the developers
- by the user
- collaboration • SYSTRAN’s decision: • Let the user do the customization
Intuitive Coding
• (Senellart et al, 2003)
• Dictionary representation should be simple
• Automatic processing of user information
• Interactive processing
• Multi level coding algorithm
• Complete integration
• Easy to use Graphic Interface
Stem Based Arabic lexicon
• Following the spirit of Senellart (2003), we opted for intuitive coding of the Arabic lexicon:
• What are the building blocks of the Arabic dictionary?
• A – roots
• B - stems
Why Stems?
• Stems are more intuitive than roots• Eliminates the need for morphological
patterns “ الصرفي ”الميزان• Eliminates overgeneralization of Arabic
stems • Subcategorization frames, syntactic and
semantic information are stem-specific and not root-specific
Sample Entry
• 1016 �ص�ر� َت �ْن ِإ verb plain"[perfect= �ص�ر� َت =imperfect],[ِإْن �ص�ر� َت ,[يْن[passper=ص�ر� َت �ص�ر=imperative],[ِإْن َت ,[ِإْن[passimp=ر�ص� َت "[يْن[+AINT+GPP+HUSUBJ]
Statistical Arabic Stem Generator
• To reduce amount of typing
• To speed up entry creation
• 60% increase of productivity of lexicographers
• Uses morphological rules that are most productive
Generator Output
• [perfect= قال],[imperfect=قال� ,[ي[imperative=ِإقال],[passperf=قال],[passimperf=يقال]
• [perfect=
• �َب� �َت َت�َب=imperfect],[َك �ْك َت�َب=imperative],[ي �َك ,[ُأ[passperf= �َب� �َت َت�َب=passimperf],[َك �ْك [ي
Arabic Morphology
• SYSTRAN has two different modules:
• 1. Internal Morphology
• 2. External Morphology
• Two separate modules in a feeding order
Internal Morphology Module
• Generates all different inflected forms of a given stem and adds morphological information to be used in syntactic processing
The Input to Internal Morphology Module
• Input: Two files:
• 1. stem files
• 2. Morphological Rules file
• Output
• Inflected Dictionary file
Sample of output
• كتب verb plain كتبن +past+fem+3P+plural
Syntagmatic and Paradigmatic (Halliday 1972) Morphology
Internal
هم شاهد وْني يشاهد ف
External ها سيشاهد
ه ل يشاهدونهن ْنشاهد
External Morphology Module
• Decomposes a token into different part-of-speech units
• Follows morphosyntactic rules of the language
• It is the syntax of morphemes
• It has morphophonemic component
Sample of External Morphology Rules
• WAFA:= |CONJ.و�> <CONJ.ف�• KABILI:= < |PREP.َك� |PREP.ِب� <PREP.ل�• LI:= < <PREP.ل�• {WAFA}?_{AL}_<NOUN:-PROPERNOUN|
ADJ |DET:QUANTIFIER|NUMERIC:CARDINAL> {WAFA}?_{NOUNADJ}_<PRON:PERSPOSS>{WAFA}?_{KABILI}_{NOUNADJ}_<PRON:PERSPOSS>
Order of Application
• The External morphology has to apply before the internal morphology and the lookup in the mono inflected dictionary
• Thus we can say that the output of the external morphology module feeds the internal morphology
Conclusion
• SYSTRAN’s monolingual dictionary has about 30,000 entries
• Coverage of newspapers’ discourse is over 90%
• The approach outlined in this paper has greatly accelerated development
• Analysis, homograph resolution and transfer rules are being added and implemented.