[IEEE 2013 International Conference on Asian Language Processing (IALP) - Urumqi, China...

4
Research and Implementation of the Uyghur-Chinese Personal Name Transliteration Based on syllabification Alim Murat 1 Xinjiang Normal University Urumqi, China [email protected] Azragul Yusup 2 , Yusup Abaydulla 3 Xinjiang Normal University The Key Laboratory for Network Information Security and Public Opinion Analysis Urumqi, China [email protected] Abstract- In recent years, there have been many Uyghur- Chinese cross language applications, but automatic translation between these two languages is still lack of in- depth study. The most traditional Uyghur-Chinese personal name transliteration is based on rules, and different from phoneme-based transliteration. This paper achieves Uyghur- Chinese personal name transliteration on the basis of Uyghur syllabification and under Grapheme-based DOM Transliteration Framework. Keywords- Uyghur, syllabification, Uyghur-Chinese personal name, Transliteration I. INTRODUCTION Many loanwords have sprung up in the local culture with the rapid development of science and technology. Most of them are not only a proper names or named entities such as personal names, place names and organization name, but also the parts of sentences and the key of expression. For these reasons, named entities have become an essential problem that is mostly focused on by scholars, at home or abroad in the field of natural language processing. Especially, named entities have a higher frequency and contain vital information in the field of journalism. For which, it always plays an important role for the quality of the whole translation whether is good or not. However, personal name has the highest proportion of corpus so that it is much more obvious that the effect of translation of personal name is more important. For a long time, due to the specification of the Uyghur- Chinese Personal Name Transliterations have had no regular standard to follow, it is getting much disorder, and the same Uyghur’s name appears more than one Chinese character transliteration results, as a result, many inconveniencies are immerged in the minority people's daily lives such as boarding, postal remittances, enrollment, residence and file management. Hence, carrying out this research has been greatly significant in many NLP applications such as machine translation, cross- language information retrieval, and automatic bilingual dictionary completion. Personal name transliteration is a process of name translation between the source language and the target language. A previous study showed different ways of name transliterations method being raised by researchers. Studies relate to Uyghur personal name transliteration, Hasan omar [1] proposed names translation algorithm based on rule base multilayer filtering that could implement English-Uyghur personal name transliteration through the establishment of three rule base; Imam hasan [2] proposed a new rule on the bases of the rules of pronunciation which can implement Uyghur-Chinese transliteration of names; Samat [3] used a method based on the converting rule which can bring about the automatic proper name translation from Chinese to Uyghur. Those name transliterations mentioned above, they have mostly adopted certain rule-based method to achieve a transliteration of name. According to the transliteration method, it is divided into phoneme based and grapheme based transliteration. The method this paper used is grapheme based transliteration, as grapheme based method subdivide name of source language into a number of Syllables and directly convert it to name of target language by mapping each transliteration unit. So this method can be able to reduce the intermediate process, thereby improve the accuracy of transliteration. II. UYGHUR SYLLABIFICATION RULE AND IMPLEMENTATION 2.1 Uyghur syllables First, we give an overview about Uyghur language. Uyghur belongs to the Eastern branch of the Turkic group of the Altaic language family. At present, it uses adopted Arabic script as main writing system, called Arabic Script of Uyghur. In grammatical aspects, Uyghur is an agglutinative language, and it has 32 letters, in which of 8 vowels and 24 consonants. A syllable is made up of three components: a nucleus, which consists of a single vowel or syllabic consonant, optionally surrounded by one or more consonants, and it can be divided into open and close syllables in accordance with its pronunciation. Open syllables that end up with vowels, and close syllables that end up with consonants. The construction of syllables in Uyghur is very regular, but due to the fact that there are some loanwords from Arabic, Persian, Chinese and other European languages, that are can not be syllabified by syllabification rule. 2.2 Uyghur syllabification rule Syllables in Uyghur language is regular, there are quite a few exceptions to the rule that there is at least one vowel (centre of syllable) and certain consonants in each 2013 International Conference on Asian Language Processing 978-0-7695-5063-3/13 $26.00 © 2013 IEEE DOI 10.1109/IALP.2013.22 71

Transcript of [IEEE 2013 International Conference on Asian Language Processing (IALP) - Urumqi, China...

Page 1: [IEEE 2013 International Conference on Asian Language Processing (IALP) - Urumqi, China (2013.08.17-2013.08.19)] 2013 International Conference on Asian Language Processing - Research

Research and Implementation of the Uyghur-Chinese Personal Name Transliteration Based on syllabification

Alim Murat1

Xinjiang Normal University Urumqi, China

[email protected]

Azragul Yusup2, Yusup Abaydulla3 Xinjiang Normal University

The Key Laboratory for Network Information Security and Public Opinion Analysis

Urumqi, China [email protected]

Abstract- In recent years, there have been many Uyghur-Chinese cross language applications, but automatic translation between these two languages is still lack of in-depth study. The most traditional Uyghur-Chinese personal name transliteration is based on rules, and different from phoneme-based transliteration. This paper achieves Uyghur-Chinese personal name transliteration on the basis of Uyghur syllabification and under Grapheme-based DOM Transliteration Framework.

Keywords- Uyghur, syllabification, Uyghur-Chinese personal name, Transliteration

I. INTRODUCTION Many loanwords have sprung up in the local culture

with the rapid development of science and technology. Most of them are not only a proper names or named entities such as personal names, place names and organization name, but also the parts of sentences and the key of expression. For these reasons, named entities have become an essential problem that is mostly focused on by scholars, at home or abroad in the field of natural language processing.

Especially, named entities have a higher frequency and contain vital information in the field of journalism. For which, it always plays an important role for the quality of the whole translation whether is good or not. However, personal name has the highest proportion of corpus so that it is much more obvious that the effect of translation of personal name is more important.

For a long time, due to the specification of the Uyghur-Chinese Personal Name Transliterations have had no regular standard to follow, it is getting much disorder, and the same Uyghur’s name appears more than one Chinese character transliteration results, as a result, many inconveniencies are immerged in the minority people's daily lives such as boarding, postal remittances, enrollment, residence and file management. Hence, carrying out this research has been greatly significant in many NLP applications such as machine translation, cross-language information retrieval, and automatic bilingual dictionary completion.

Personal name transliteration is a process of name translation between the source language and the target language. A previous study showed different ways of name transliterations method being raised by researchers. Studies relate to Uyghur personal name transliteration, Hasan omar [1] proposed names translation algorithm based on rule base

multilayer filtering that could implement English-Uyghur personal name transliteration through the establishment of three rule base; Imam hasan [2] proposed a new rule on the bases of the rules of pronunciation which can implement Uyghur-Chinese transliteration of names; Samat [3] used a method based on the converting rule which can bring about the automatic proper name translation from Chinese to Uyghur. Those name transliterations mentioned above, they have mostly adopted certain rule-based method to achieve a transliteration of name.

According to the transliteration method, it is divided into phoneme based and grapheme based transliteration. The method this paper used is grapheme based transliteration, as grapheme based method subdivide name of source language into a number of Syllables and directly convert it to name of target language by mapping each transliteration unit. So this method can be able to reduce the intermediate process, thereby improve the accuracy of transliteration.

II. UYGHUR SYLLABIFICATION RULE AND IMPLEMENTATION

2.1 Uyghur syllables First, we give an overview about Uyghur language.

Uyghur belongs to the Eastern branch of the Turkic group of the Altaic language family. At present, it uses adopted Arabic script as main writing system, called Arabic Script of Uyghur.

In grammatical aspects, Uyghur is an agglutinative language, and it has 32 letters, in which of 8 vowels and 24 consonants. A syllable is made up of three components: a nucleus, which consists of a single vowel or syllabic consonant, optionally surrounded by one or more consonants, and it can be divided into open and close syllables in accordance with its pronunciation. Open syllables that end up with vowels, and close syllables that end up with consonants. The construction of syllables in Uyghur is very regular, but due to the fact that there are some loanwords from Arabic, Persian, Chinese and other European languages, that are can not be syllabified by syllabification rule.

2.2 Uyghur syllabification rule Syllables in Uyghur language is regular, there are

quite a few exceptions to the rule that there is at least one vowel (centre of syllable) and certain consonants in each

2013 International Conference on Asian Language Processing

978-0-7695-5063-3/13 $26.00 © 2013 IEEEDOI 10.1109/IALP.2013.22

71

Page 2: [IEEE 2013 International Conference on Asian Language Processing (IALP) - Urumqi, China (2013.08.17-2013.08.19)] 2013 International Conference on Asian Language Processing - Research

syllable. In Uyghur, The number of vowels in the word determines the number of syllables in the word; hence we can know the syllables count according to vowels.

Uyghur has 12 different syllable [4] types as following: A, BAB, AB, BA, BABB, ABB, BBABB, BBAB, BBA, BAAB, BABBB and BAA (B stands for consonant, a stands for Vowel). First six syllable types are most frequent syllables in native Uyghur words, others are often occurs in loanwords, and never in accord with the rules of Uyghur syllables.

From this various form of Uyghur syllables, we can derive the basic idea [5] of Uyghur syllabification as follows:

(1) In polysyllabic words, if there is a consonant letter between two vowels, then this consonant letter belongs to the following syllable; Example: “ ”/ Grape

(2) In polysyllabic words, if there are two consonant letters between two vowels, then a former consonant belongs to the former syllable, back consonant belongs to the latter. Example: “ ”/ Apple

(3) In polysyllabic words, if there are three consonants between two vowels, then two former consonants belongs to the former syllable, then back consonant belongs to the latter. Example: “ ”/ Conditional

(4) In polysyllabic words, if no consonant letters between two vowels, the first vowel belong to the former syllable, then the second vowel belong to the latter. Example: “ ””/ Watch

III. GRAPHEME-BASED DOM TRANSLITERATION FRAMEWORK

3.1 Study on transliteration approaches Transliteration methods can be categorized into

grapheme-based, and phoneme-based. Grapheme-based methods perform a direct orthographical mapping (DOM) between source and target words, while phoneme-based approaches use an intermediate phonetic representation. Both grapheme and phoneme-based methods usually begin by breaking the source word into segments, and then use a source segment to tar-get segment mapping to generate the target word. The rules of this mapping are obtained by aligning already available transliterated word pairs.

Figure 1. Process of Grapheme-based DOM Transliteration

Method

We propose under the grapheme-based framework to take into account all intermediate phonetic transformation steps, and it is more effective than phoneme-based

methods besides. Process of Grapheme-based DOM Transliteration Method [6] shown in Figure 1.

3.2 Formulation This paper uses the Uyghur to Chinese personal

name transliteration as example to introduce. Assuming that Uyghur personal name defined as U=u1,u2,…ui,…um, where ui is a syllable of Uyghur personal name (i=l,…,m; m is a number of syllables in Uyghur personal name), and Chinese personal name defined as C=c1,c2,…cj…cn , where cj is a Chinese character (j=1,…,n; n is a number of characters in Chinese personal name). Uyghur personal name U and its corresponding Chinese personal name C transliteration are cut into a series of substring, respectively as following=u1,u2,…ui,…uk and C=c1,c2,…ci,…ck. Substring as a transliteration unit, that is a syllable for Uyghur personal name , and a Chinese character for Chinese personal name, as a result, it all form a transliteration pair,

Transliteration unit: it is referred to as each substring that is being cut. Including, a Chinese character Ci represents Chinese name transliteration unit. A syllable Ui represents Uyghur personal name transliteration unit.

Transliteration pair: it is formed by one Uyghur personal name transliteration unit Ui and one Chinese personal name transliteration unit Ci, and represented by this expression <e,c>i. thus, transliteration pair <e,c>i can reflect the matching relation in both directions that Uyghur to chinese and chinese to Uyghur.

� (Alignment): it shows the alignment between Uyghur personal name U and Chinese personal name C, as following:

<u,c>1=<ul,cl> <u,c>2 =<u2,c2>

…… <u,c>k=<uk,ck>

Example: Uyghur personal name: “ ” and its corresponding Chinese characters:” ”, in which there are three syllables in Uyghur word and three characters in Chinese word. for reason, both Uyghur and Chinese personal name have three transliteration units equally. Hence, there are three corresponding transliteration pairs as following: < , >, < , > and < , >.

From the given example above, the corresponding transliteration unit and transliteration pairs are determined by the segmentation approach. Because the Uyghur personal name transliteration unit is defined by the Uyghur syllabification rule, so long as Uyghur personal name transliteration unit to be determined, a number of Chinese personal name transliteration unit is to be certain.

3.3 Transliteration algorithm Uyghur to Chinese personal transliteration is that a

given Uyghur personal name � can produce the largest probability to the corresponding Chinese personal name �', and it is presented in formula (2-1) as followed:

72

Page 3: [IEEE 2013 International Conference on Asian Language Processing (IALP) - Urumqi, China (2013.08.17-2013.08.19)] 2013 International Conference on Asian Language Processing - Research

�'=arg� max P (�,�) (1) Due to many alignments � between � and �, all the

alignments need to be considered thoroughly. So formula (2-1) is transformed as followed:

�'= arg� max � P (�,�,�) (2) Replace the summation � by using the maximum

value, in order to reduce the complexity of computation. Then formula (2-2) is transformed as followed:

�'� arg� max(arg� max P (�,�,�)) (3) Merge all the argmax process between � and �:

�'=arg�,� max P (�,�,�) (4) Then formula (2-4) is the final U2C process.

IV. UYGHUR TO CHINESE PERSONAL NAME TRANSLITERATION

This paper is within the framework of Grapheme-based DOM Transliteration, using Uyghur syllabification to implement Uyghur to Chinese transliteration, and The Uyghur personal name as an input and the corresponding Chinese characters as an output.

Figure 2. The flowchart of Uyghur to Chinese personal name

transliteration on basis of syllabification

First of all, Uyghur personal name is syllabified, because a syllable in Uyghur personal name is the same with one Uyghur personal name transliteration unit. Then find out Chinese character corresponded to each Uyghur syllable by using mapping table. If a Chinese character corresponded to certain Uyghur syllable is not in the mapping table, then current syllable will be divided into sub-syllable for finding out the corresponding Chinese character in the mapping table easily. Because mapping table can’t have all syllables in Uyghur personal name

which is corresponded to Chinese character. In this way, some syllables need to divide into sub-syllable.

In order to express the matching of Uyghur and

Chinese letter more objectively, producing a table that can be able to show two types of relations, one is that Uyghur letter to a Chinese character, the other is that Uyghur syllable to a Chinese character. Over all, it aim at providing Uyghur personal name transliteration unit the corresponding Chinese character that is the best match for it

TABLE I. UYGHUR LETTER TO CHINESE CHARACTER MAPPING

As shown Table 1, a letter in Uyghur (a consonant or

a vowel) represents Uyghur personal name transliteration.

TABLE II. UYGHUR SYLLABLE TO CHINESE CHARACTER MAPPING

As shown Table 2, a syllable in Uyghur represents

Uyghur personal name transliteration.

73

Page 4: [IEEE 2013 International Conference on Asian Language Processing (IALP) - Urumqi, China (2013.08.17-2013.08.19)] 2013 International Conference on Asian Language Processing - Research

V. EXPERIMENTS AND ANALYSIS This paper used 1000 Uyghur personal name. As a

result most of the Uyghur personal name and its corresponding Chinese personal name have a one-to-one relationship; a part of them has a one-to-many relationship. Thus, test result is considered by both one-to-one and one-to-many, and it is given in Table 3 as followed:

TABLE III. TEST RESULT OF UYGHUR TO CHINESE PERSONAL NAME TRANSLITERATION

Relation one-to-one one-to-many

Proportion

percentage

620

62%

380

38%

Test result respectively shows, the one-to-one relations have a significantly higher proportion comparing with the one-to-many relation (shown in Table 4) in Uyghur to Chinese personal name transliteration. So as to one-to-many relations, the most suitable candidate transliteration is chosen by using a rules database for one-to-many relations.

TABLE IV. ONE-TO-MANY RELATION IN UYGHUR TO CHINESE PERSONAL NAME TRANSLITERATION

Uyghur personal name transliteration unit

Chinese personal name transliteration unit

VI. CONCLUSIONS During the design and implementation of Uyghur to

Chinese personal name transliteration, we have presented a new approach that is on the basis of Uyghur syllabification for Uyghur to Chinese personal name transliteration. It is very flexible and there is no intermediate process. In addition, this approach can also be used in the study of Kazak and Kirgiz personal name transliteration. Meanwhile, it provides Uyghur to Chinese personal name transliteration with a scientific and automatic processing method.

ACKNOWLEDGMENT This paper is funded by these projects of XJNU

Graduate student innovation fund (20121206) open project of key laboratory at XJNU (WLYQ2012201)National Natural Science Foundation of China (1063036,61132009 61262066), Ministry of Education Social Science Fund (10YJA740121),the State Language Commission (YB115-38, YB125-45) and Xinjiang Normal University Key laboratory.

REFERENCES

[1] Aishan Omar,Turgun Ibrayim. Researching and Implementation of the English to Uyghur

[2] Personal Name Machine Translation Algorithm [J]. journal of Xinjiang university, 2007,24 1 97-100.

[3] Imam Hasan,Abulikim Turdi,Askar Tamdulla. Rule based Uyghur to Chinese personal name Machine Translation Algorithm [J]. Computer applications and software, ,2010,27 8 86-87.

[4] Samat Mamatimin, Yasin Imin. Research on Chinese and Uyghur proper name translation on basis of transforming rules[C]. The Seventh International Conference on Chinese information processing, 2007.

[5] Saimaiti Maimaitimin. (2004) Study on Phonemic Combination and Syllabic Structure of Modern Uyghur. The Journal of Xinjiang University, Vol.4, 2004.

[6] Hamit Tomur, Modern Uyghur grammar [M]. The national press, 1987:130-150.

[7] Abida Omar, Research and implement of Uyghur syllabification, researching of the national language information technology- The eleventh session of national minority language information Symposium, 2007.

[8] M. Zhang, H. Z. Li and J. Su. Direct Orthographical Mapping for Machine Transliteration[C]. In Proceedings of 20th International Conference on Computational Linguistics, 2004, 716-722.

74