vnReduplicator

download vnReduplicator

of 7

Transcript of vnReduplicator

  • 8/3/2019 vnReduplicator

    1/7

    Finite-State Description of Vietnamese Reduplication

    Le Hong Phuong

    LORIA, France

    [email protected]

    Nguyen Thi Minh Huyen

    Hanoi Univ. of Science, Vietnam

    [email protected]

    Azim Roussanaly

    LORIA, France

    [email protected]

    Abstract

    We present for the first time a compu-tational model for the reduplication ofthe Vietnamese language. Reduplicationis a popular phenomenon of Vietnamese

    in which reduplicative words are createdby the combination of multiple syllableswhose phonics are similar. We first givea systematical study of Vietnamese redu-plicative words, bringing into focus clearprinciples for the formation of a large classof bi-syllabic reduplicative words. We thenmake use of optimal finite-state devices,in particular minimal sequential string-tostring transducers to build a computationalmodel for very efficient recognition and

    production of those words. Finally, sev-eral nice applications of this computa-tional model are discussed.

    1 Introduction

    Finite-state technology has been applied success-fully for describing the morphological processesof many natural languages since the pioneer-ing works of (Kaplan and Kay, 1994; Kosken-niemi, 1983). It is shown that while finite-stateapproaches to most natural languages have gen-

    erally been very successful, they are less suitablefor non-concatenative phenomena found in somelanguages, for example the non-concatenativeword formation processes in Semitic languages(Cohen-Sygal and Wintner, 2006). A popular non-concatenative process is reduplication the pro-cess in which a morpheme or part of it is dupli-cated.

    Reduplication is a common linguistic phe-nomenon in many Asian languages, for exam-ple Japanese, Mandarin Chinese, Cantonese, Thai,

    Malay, Indonesian, Chamorro, Hebrew, Bangla,and especially Vietnamese.

    We are concerned with the reduplication ofVietnamese. It is noted that Vietnamese is a mono-syllabic language and its word forms never change,contrary to occidental languages that make use ofmorphological variations. Consequently, redupli-cation is one popular and important word forma-

    tion method which is extensively used to enrich thelexicon. This follows that the Vietnamese lexiconconsists of a large number of reduplicative words.

    This paper presents for the first time a compu-tational model for recognition and production of alarge class of Vietnamese reduplicative words. Weshow that Vietnamese reduplication can be sim-ulated efficiently by finite-state devices. We firstintroduce the Vietnamese lexicon and the struc-ture of Vietnamese syllables. We next give a com-plete study about the reduplication phenomenon

    of Vietnamese language, bringing into focus for-mation principles of reduplicative words. We thenpropose optimal finite-state sequential transducersrecognizing and producing a substantial class ofthese words. Finally, we present several nice ap-plications of this computational model before con-cluding and discussing the future work.

    2 Vietnamese Lexicon

    In this section, we first present some general char-acteristics of the Vietnamese language. We then

    give some statistics of the Vietnamese lexicon andintroduce the structure of Vietnamese syllables.

    The following basic characteristics of Viet-namese are adopted from (on, 2003; on etal. , 2003; Hu et al. , 1998; Nguyn et al. , 2006).

    2.1 Language Type

    Vietnamese is classified in the Viet-Muong groupof the Mon-Khmer branch, that belongs to theAustro-Asiatic language family. Vietnamese isalso known to have a similarity with languages

    in the Tai family. The Vietnamese vocabulary fea-tures a large amount of Sino-Vietnamese words.

    inria

    00421099,

    version

    1

    30

    Sep

    2009

    Author manuscript, published in "The 7th Workshop on Asian Language Resources - In conjunction with ACL-IJCNLP 2009(2009)"

    http://hal.archives-ouvertes.fr/http://hal.inria.fr/inria-00421099/fr/
  • 8/3/2019 vnReduplicator

    2/7

    Moreover, by being in contact with the French lan-guage, Vietnamese was enriched not only in vo-cabulary but also in syntax by the calque of Frenchgrammar.

    Vietnamese is an isolating language, which ischaracterized by the following properties:

    it is a monosyllabic language;

    its word forms never change, contrary to oc-cidental languages that make use of morpho-logical variations (plural form, conjugation,etc.);

    hence, all grammatical relations are mani-fested by word order and function words.

    2.2 Vocabulary

    Vietnamese has a special unit called ting thatcorresponds at the same time to a syllable with re-spect to phonology, a morpheme with respect tomorpho-syntax, and a word with respect to sen-tence constituent creation. For convenience, wecall these ting syllables. The Vietnamese vo-cabulary contains

    simple words, which are monosyllabic;

    reduplicative words composed by phoneticreduplication;

    compound words composed by semantic co-ordination and by semantic subodination;

    complex words phonetically transcribed fromforeign languages.

    The Vietnamese lexicon edited recently by theVietnam Lexicography Center (Vietlex1) contains40, 181 words and idioms, which are widely usedin contemporary spoken language, newspapersand literature. These words are made up of 7, 729

    syllables. Table 1 shows some interesting statisticsof the word length measured in syllables. 6, 303syllables (about 81.55% of syllables) are words bythemselves. Two-syllable words are the most fre-quent, consisting of nearly 71% of the vocabulary.

    2.3 Syllables

    In this paragraph, we introduce phonetic attributesof Vietnamese syllables. In addition of the mono-syllabic characteristic, Vietnamese is a tonal lan-guage in that each syllable has a certain pitch char-

    acteristic. The meaning of a syllable varies with its1http://www.vietlex.com/

    Length # %

    1 6, 303 15.692 28, 416 70.723 2, 259 5.624 2, 784 6.93

    5 419 1.04

    Total 40, 181 100

    Table 1: Length of words measured in syllables

    No. Tones Notation

    1. low falling 2. creaky rising 3. creaky falling 4. mid level a5. dipping 6. high rising

    Table 2: Vietnamese tones

    tone. This phonetic mechanism can also be foundin other languages such that Chinese or Thai.

    There are six tones in Vietnamese as specifiedin Table 2. The letter a denotes any non-accent syl-lable. These six tones can be roughly classifiedinto two groups corresponding to low and highpitches in pronunciation. The first half of the ta-ble contains three low tones and the second half

    contains three high tones. In addition, the differ-ence in the tone of two syllables are distinguishedby flat property of tones. The 1st and 4th tones inTable 2 are flat (bng), the other tones are non-flat(trc).

    The structure of a Vietnamese syllable is givenin Table 3. Each syllable can be divided into threeparts: onset, rhyme and tone. The onset is usuallya consonant, however it may be empty. The rhymecontains a vowel (nucleus) with or without glide /w/, and an optional consonant (coda). It is no-

    ticed that the initial consonant of a syllable doesnot carry information of the tone, the Vietnamesetone has an effect only on the rhyme part of thesyllable (Tran et al., 2006). This result reinforcesthe fact that a tone is always marked by the nucleuscomposant of the rhyme which is a vowel. Readerswho are interested in detail the phonetic composi-tion of Vietnamese syllables may refer to (Tran etal., 2006; Vu et al., 2005).

    3 Reduplication in Vietnamese

    Reduplication is one of the methods for creatingmulti-syllable words in Vietnamese. A reduplica-

    inria

    00421099,

    version

    1

    30

    Sep

    2009

  • 8/3/2019 vnReduplicator

    3/7

    ToneOnset Rhyme

    Glide Nucleus Coda

    Table 3: Phonetic structure of Vietnamese sylla-bles

    tive word is characterized by a phenomenon calledphonetic interchange, in which one or several pho-netic elements of a syllable are repeated followinga certain number of specific rules.

    From the point of view of sense, the redupli-cation in Vietnamese usually indicates a diminu-tive of adjectives, which can also be found in He-brew, or a pluralization in Malay, in Thai and inIndonesian, or an intensivity as the use of par-

    tial reduplication in Japanese, Thai, Cantonese andChamorro (an Austronesian language spoken onGuam and the Northern Mariana Islands). In thisaspect, Vietnamese reduplication serves similarfunctions as those of reduplication in several Asianlanguages, as reported in an investigation of Asianlanguage reduplication within the NEDO project(Tokunaga et al. , 2008a; Tokunaga et al. , 2008b).

    The Vietnamese reduplication creates an ex-pressional sense connecting closely to the pho-netic material of Vietnamese, a language of rich

    melody. Consequently, there are many Vietnamesereduplicative words which are difficult to inter-pret to foreigners, though in general, native Viet-namese speakers always use and understand themcorrectly (Dip, 1999).

    Vietnamese reduplicative words can be classi-fied into three classes basing on the number ofsyllables they contain: two-syllable (or bi-syllabic)reduplicative words, three-syllable (or tri-syllabic)reduplicative words and four-syllable reduplicativewords. The bi-syllabic class is the most important

    class because of two reasons: (1) bi-syllabic redu-plicative words make up more than 98% amount ofreduplicative words, that is, almost reduplicativewords has two syllables; and (2) bi-syllabic redu-plicative words embody principle characteristicsof the reduplication phenomenon in both phone as-pect and sense formation aspect. For these reasons,in this paper, we address only bi-syllabic redu-plicative words and call them reduplicative wordsfor short, if there is no confusion.

    As presented in the previous section, a syllable

    has a strict structure containing three parts: the on-set, the rhyme and the tone. Basing on the phonetic

    interchange of a syllable, we distinguish two typesof reduplication:

    full reduplication, where the whole syllable isrepeated;

    partial reduplication, where either the onsetis repeated or the rhyme and the tone are re-peated.

    In this work, we constraint ourselves by focus-ing only on the construction of an efficient com-putational model applied for reduplicative wordswhich have clear and well-defined formation prin-ciples. These words can be classified into threetypes investigated in detail in the following sub-sections. In given examples, the base syllables (orroot syllable, or root for short) are the ones which

    are underlined. The reduplication that has unde-fined or incomplete formation rules will be tackledin future works.

    3.1 Full Reduplication

    In this type of reduplication, the root is identicallyrepeated; there is only a slight difference on stressin pronunciation. For example, hao hao (a littlesimilar), lm lm (intentional), ng ng (acci-dentally dertermined), l l(silently). In the Viet-namese lexicon there are 274 reduplicative words

    of this type.In principle, there appears to be many redu-

    plicative words of this type whose roots may bewhatever syllables bearing whatever tone, for in-stance , h h, sng sng, chm chm. How-ever, in consequence of the difference of stress be-tween the root and the reduplicant, the tone of thereduplicant is changed in order to be in harmonywith the root, for the sake of more readability andaudibility (easier to read, easier to hear). Thisconsequence leads to the formation of reduplica-

    tive words of the second type which we call redu-plication with tone according.

    3.2 Reduplication with Tone According

    As presented above, the difference between tone ofthe root and the reduplicant is a consequence of thedifference between their stress which is expressedby their tones. This creates reduplicative words ofthe second type; for example, o (reddish), hh(in the bloom of youth), sng sng (statly, highand majestic), chm chm (rather slow). The tone

    properties (low or high pitch, flat or non-flat) arenow put into use.

    inria

    00421099,

    version

    1

    30

    Sep

    2009

  • 8/3/2019 vnReduplicator

    4/7

  • 8/3/2019 vnReduplicator

    5/7

  • 8/3/2019 vnReduplicator

    6/7

    states in which 2 states are final ones. It has 262transitions, the maximum number of outtransitionsfrom a state is 19.

    Once all the three transducers have been con-structed, we can unify them by making use ofthe standard union operation on transducers to

    obtain a sequential FST which is able to recog-nize all the three class of reduplication presentedabove (Mohri, 1996; Mohri, 1997).

    4.4 A Software Package

    We have developed a Java software package namedvnReduplicator which implements the above-mentioned computational model of Vietnamesereduplication. The core component of this pack-age is a minimal FST which can recognize a sub-

    stantial amount of reduplicative bi-syllabic wordsfound in the Vietnamese language.The first application of this core model which

    we have developed is a reduplication scanner forVietnamese. We use the minimal FST of the coremodel to build a tool for fast detection of redu-plication. The tool scans a given input text andproduces a list of all the recognized reduplicativewords. The detection process is very fast since theunderlying transducer operates in optimal time inthe sense that the time to recognize a syllable cor-

    responds to the time required to follow a singlepath in the deterministic finite-state machine, andthe length of the path is the length of the syllablemeasured in characters.

    As an example, given the following input text

    Anh i bin bit. C vn ch anh hn 20 nmng ng.2,

    the scanner marks two reduplicative words asshown in the italic face.

    We are currently investigating another useful

    application of the core model for a partial spellchecking of Vietnamese text. It is observed thatpeople may make typhographical errors in writinglike ng ng instead of the correct word ngng. In such cases, the computational model canbe exploited to detect the potential errors and sug-gest corrections.

    The reduplication model could also help im-prove the accuracy of Vietnamese lexical rec-ognizers in particular and the accuracy of Viet-namese word segmentation systems in general.

    2He has left behind no traces whatsoever. She has beenwaiting for him for 20 years.

    The reduplication scanner will be integrated to vn-Tokenizer3 - an open source and highly accuratetokenizer for Vietnamese texts (Le et al., 2008).

    The software and related resources will be dis-tributed under the GNU General Public Lisence4

    and it will be soon available online5.

    5 Conclusion and Future Work

    We have presented for the first time a compu-tational model for the reduplication of the Viet-namese language. We show that a large class ofreduplicative words can be modeled effectively bysequential finite-state string-to-string transducers.

    The analysis of the various patterns of redu-plication of the Vietnamese language has twofoldcontributions. On the one hand, it gives useful

    information on identification of spelling variantsin Vietnamese texts. On the other hand, it givesan explicit formalization of precedence relation-ships in the phonology, and as a result helps or-dering and modeling phonological processes be-fore transfer of the presentation to the articulatoryinterface.

    It is argued that the relation between morphol-ogy and phonology is an intimate one, both syn-chronically and diachronically. As mentioned ear-lier, Vietnamese reduplication is always accompa-

    nied by a modification of phone and tone for asymmetric and harmonic posture. We thus believethat the compact finite-state description of a largeclass of reduplication would help connect mor-phosyntactic attributes to individual phonologicalcomponents of a set of Vietnamese word formsand contribute to the improvement of Vietnameseautomatic speech recognition systems.

    As mentioned earlier, the current work does nothandle partial reduplication in which either the on-set is repeated or the rhyme and the tone of sylla-

    bles are repeated, for example bng bnh (bob),chm chm (open slightly ones lips), lm cm(doting), lng tng (perplexed, embarrassed). Par-tial reduplication is a topic which has been wellstudied for a long time by Vietnamese linguistscommunity. It has been shown that partial redu-plicative words also have certain principle forma-tion rules (Dip, 1999; UBKHXH, 1983). Hence,partial reduplicative words could also be gener-ated and recognized by an appropriate finite-state

    3

    http://www.loria.fr/

    lehong/tools/vnTokenizer.php4http://www.gnu.org/copyleft/gpl.html5http://www.loria.fr/lehong/projects.php

    inria

    00421099,

    version

    1

    30

    Sep

    2009

  • 8/3/2019 vnReduplicator

    7/7

    model which encodes precisely their formationrules. This is an interesting topic of our futurework in constructing a rather complete computa-tional model for Vietnamese bi-syllabic reduplica-tion.

    Furthermore, in addition to the bi-syllabic redu-

    plication forms, there exists also three or four syl-lable reduplication forms, for example cn cn con(very little), to to teo (very small), or vi vivng vng (hurry), ng ng nh (deliber-ate). These reduplication forms involve the copy-ing operation of morphological structures whichis a non-regular operation. Non-regular operationsare problematic in that they cannot be cast in termsof composition the regular operation of ma-jor importance in finite-state devices, while finite-state devices cannot handle unbounded copying.

    However, the question of the possibility for an el-egant account to reduce these specific kinds ofreduplication to purely regular mechanisms wouldbe of interest for further research to extend and im-prove the core reduplication components for Viet-namese. Unknown reduplicative word guessing isanother interesting and useful topic since the lexi-con can never cover all reduplicative words.

    Acknowledgement

    We gratefully acknowledge helpful comments andvaluable suggestions from three anonymous re-viewers for improving the paper.

    References

    Yael Cohen-Sygal and Shuly Wintner. 2006. Finite-State Registered Automata for Non-Concatenative

    Morphology. Computational Linguistics, Vol. 32,No. 1, Pages 4982.

    Jan Daciuk, Stoyan Mihov, Bruce W. Watson andRichard E. Watson. 2000 Incremental Construction

    of Minimal Acyclic Finite-State Automata. Compu-tational Linguistics, Vol. 26, No. 1, 2000.

    Le H. Phuong, Nguyen T. M. Huyen, Roussanaly A.,Ho T. Vinh. 2008 A hybrid approach to word seg-mentation of Vietnamese texts. Proceedings of the2nd International Conference on Language and Au-tomata Theory and Applications, Tarragona, Spain.Springer LNCS 5196, 2008.

    Dip Quang Ban and Hong Vn Thung. 1999 Ngphp Ting Vit (Vietnamese Grammar). NXB Giodc, H Ni, Vit Nam.

    on Thin Thut. 2003 Ng m ting Vit (Viet-namese Phonetics). NXB i hc Quc gia H Ni,H Ni, Vit Nam.

    on Thin Thut (Editor-in-chief) and Nguyn KhnhH and Phm Nh Qunh. 2003 A Concise Viet-namese Grammar (For Non-native Speakers)). ThGii Publishers, H Ni, Vit Nam.

    Hu t and Trn Tr Di and o Thanh Lan. 1998C s ting Vit (Basis of Vietnamese). NXB Gio

    dc, H Ni, Vit Nam.Ronald Kaplan and Martin Kay. 1994. Regular Models

    of Phonological Rule Systems. Computational Lin-guistics, Vol. 20, No. 3, Pages 331378.

    Koskenniemi Kimmo. 1983 Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. The Department ofGeneral Linguistics, University of Helsinki.

    Mehryar Mohri. 1996 On Some Applications of Finite-State Automata Theory to Natural Language Pro-cessing. Natural Language Engineering, Vol. 2, No.

    1, Pages 6180.Mehryar Mohri. 1997 Finite-State Transducers in Lan-

    guage and Speech Processing. Computational Lin-guistics, Vol. 23.

    Nguyn Th Minh Huyn, Laurent Romary, MathiasRossignol and V Xun Lng. 2006. A Lexicon

    for Vietnamese Language Processing. Language Re-sources and Evaluation, Vol. 40, No. 34.

    Tokunaga T., Kaplan D., Huang C-R., Hsieh S-K, Cal-zolari N., Monachini M., Soria C., Shirai K., Sorn-lertlamvanich V., Charoenporn T., Xia Y., 2008.

    Adapting international standard for Asian languagetechnologies. Proceedings of The 6th InternationalConference on Language Resources and Evaluation(LREC 2008)

    Tokunaga T. et al. 2008. Developing InternationalStandards of Language Resources for Semantic Web

    Applications Research Report of the InternationalJoint Research Program(NEDO Grant) for FY 2007,http://www.tech.nedo.go.jp/PDF/100013569.pdf

    Tran D. D. and Castelli E. and Serignat J. F. and TrinhV. L. and Le X. H. 2006. LinearF0 Contour Model

    for Vietnamese Tones and Vietnamese Syllable Syn-

    thesis with TD-PSOLA. Proceedings of TAL2006,La Rochelle, France.

    Thang Tat Vu, Dung Tien Nguyen, Mai Chi Luong andJohn-Paul Hosom. 2006. Vietnamese Large Vocab-ulary Continuous Speech Recognition. Proceedingsof Eurospeech 2005, Lisboa.

    y ban Khoa hc X hi Vit Nam. 1983. Ng phpting Vit (Vietnamese Grammar). Nh xut bnKhoa hc X hi H Ni, Vit Nam.

    inria

    00421099,

    version

    1

    30

    Sep

    2009