John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar...

Post on 20-Dec-2015

218 views 0 download

Transcript of John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers ACL 4 NCLT Seminar...

John Tinsley

Morphological Analysis of Spanish Using Finite-State

Transducers

ACL 4

NCLT Seminar Presentation, 7th June 2006

Introduction What is this project about?

Provide morphological information on Spanish strings

Generate strings from morphologcal descriptions

What were my aims? Robust, fast, application – easily integrated into

other systems 80% token coverage on unrestricted text 100% coverage of Spanish morphology

Design Methodology Formalisation

Discovery of Spanish morphological rules

Implementation Coding of morphological model with

Xerox Finite-State Tools Evaluation

Check for accuracy & well-formedness Assess language coverage

Formalisation

Spanish Morphology - Verbs Inflected for person, tense/mood, number Regular verbs

3 regular conjugations identified by infinitive endings

‘-ar’, ‘-er’, and ‘-ir’

Irregular verbs 66 distinct irregularities Varying degrees of irregularity

Spanish Morphology - Nouns Inflected for number, gender 7 types of noun

Feminine, masculine, neutral, derivative, profession, number invariant, proper

Irregularities All arise via pluralisation Accentuation, character alterations

Spanish Morphology - Adjectives

Inflected for number, gender 4 types of adjective

Neutral, derivative, profession, irregular

Adverbs derived from adjectives by addition of suffix ‘mente’

Implementation

Xerox-Finite State Tools - lexc Lexicon compiler Compiles ‘continuation classes’ into

lexical transducers

Xerox Finite-State Tools - xfst

Xerox finite-state tool Compiles regular expressions into

networks Regular expression replace rules

[ String -> Replacement || left-context _ right-context ]

Xerox Finite-State Tool - example

conocer - ‘to know’ 1st person, pres. ind. ‘conozco’ Lexical transducer mappings

conoc:conoc er+Verb:ε +PresInd:^PresInd +1P+Sg:o

Xerox Finite-State Tool - example cont…

Composed replace rule

[ c -> {zc} || _ ^PresInd ]

Triggered by the ^PresInd tag Makes required changes, remove trigger

Lexical conocer+Verb+PresInd+1P+Sg

Surface conoc^PresIndo

Verb Lexicon

Coded in lexc Model has 3 regular paths 66 varieties of irregularity

e.g. poder ‘to be able to’

LEXICON Irreg430:^UE^VSoue^PRET1^FRErV ;[o -> {ue} || _Consonant^<4 [%^UE ?* [[%^PresInd | %^PresSubj] ?* [%^1PSg | %^2PSg | %^3PSg | %^3PPl] ]

Noun LexiconLEXICON NounFem ! Feminine Nouns!STEM !CONT. CLASS ! GLOSSacción fIsNounEs ; ! action

LEXICON fIsNounEs ! feminine pluralised with 'es'+Noun:0 fNounPluralES ;

LEXICON fNounPluralES+Sg+Fem:0 # ;+Pl+Fem:^NZ^NOes # ;

[z -> c || _ %^NZ]

[ó -> o || _ ?^<5 %^NO ]

Adjective Lexicon

Same process as noun lexicon Uses the same replace rules One exception for adverbs

LEXICON nIsAdjS+Adj:0 nAdjPluralS ;+Adj|+Adv:^AAOmente # ;

[o -> a || _ %^NAO %^AAO {mente}]

Other Transducers Overgeneration Filter

llover ‘to rain’

Capitalisation

Trigger Remover

Execution script

~[ $[{llov} ?* [[%+1P | %+2P] [%+Sg | %+Pl] | [%+3P %+Pl] ] ]

[ a (->) A || .#. _ ]

[ %^IE -> 0 ]

Evaluation

Testing

Accuracy Maintaining integrity of existing rules

Projection Subtraction

Well-formedness Ensuring tag order

Assessing Coverage Aim – 80% on unrestricted text Statistical predictions (Crystal 1997)

Corpus compilation and processing Europarl, 3 corpora

(http://people.csail.mit.edu/koehn/publications/europarl/ )

Phase 1 – augmentation Phase 2 – 81% coverage Final assessment – 84.15% coverage

Further Details

Class # of forms

Nouns 547

Verbs 304

Adjectives 183

Other 378

• Generates approx. 44,000 unique morphological descriptions

• Evaluation corpus – 1.26 analyses per input token on average

Possible improvements Increase coverage

lexicon augmentation

Disambiguation using POS tagger

More derivational morphology

Deal with different dialects of Spanish

References (Beesley & Karttunen 2003) Beesley, K. and Karttunen, L.,

Finite State Morphology, CSLI Publications, United States, 2003. 

(Claret 2005) Los Verbos Castellanos Conjugados, Sexta Edición, Editorial Claret, Barcelona, 2005

(Crystal 1997) Crystal, D., The Cambridge Encyclopedia of Language. (2nd. ed.) Cambridge University Press, 1997

Europarl - Europarl Parallel Corpus http://people.csail.mit.edu/koehn/publications/europarl/ - Last Accessed 19/05/2006

(Kendris 1990) Kendris, C. Spanish Grammar. Barron’s, 1990.

(Mateo & Rojo Sastre 1997) Mateo, F. and Rojo Sastre, A.J. Collection Bescherelle - Les verbes espagnols. Hatier, 1997.

Real Academia Española – http://www.rae.es/ - Last Accessed 25/05/2006

Conclusions

Demonstration

LEXICON ArVerbs!STEM !CONT. CLASS !GLOSSabord ArV ; !to approach

LEXICON ArVar+Verb:0 ArConj ;

LEXICON ArConj!TAGS !CONT.CLASS+PresInd:^PresInd ArPresInd ;+PretInd:^PretInd ArPretInd ;

LEXICON ArPresInd ! Present Indicative+1P+Sg:o^1PSg #;+2P+Sg:as^2PSg #;+3P+Sg:a^3PSg #;