Study on Asian Script Entry and Display Formats · The Khmer script was chosen in this study of...

58
All Copyrights Reserved, CICC Study on Input and Output Method on Asian Scripts -Case Study of he Khmer script- (Extract) 1

Transcript of Study on Asian Script Entry and Display Formats · The Khmer script was chosen in this study of...

All Copyrights Reserved, CICC

Study on Input and Output Method on Asian Scripts

-Case Study of the Khmer script-

(Extract)

1

All Copyrights Reserved, CICC

This is an extract of the report of Study of Input and Output Method of Asian Script Project.

CICC planned this project and ordered to Ecellead Company in July 2002 and this report was

delivered to CICC on November 2002.

There are three objectives for this project.

1. To confirm the technology to compose complex syllable of Asian glyphs out from multiple

character codes.

2. To demonstrate the applicability of multiple input methods for the same character code

3. To find out if more coded elements are needed to handle Asian scripts properly.

Backgrounds:

1. Traditionally, character code are assigned for all necessary repertoire of the target script (such as

USASCII), however, this practice is not applied for syllabic characters of Asian scripts. Some of

the repertoires are expressed by a sequence of the character codes (no specific single code

assigned to the repertoire). Most of Asian countries faced a difficult time to understand how to

handle this new approach of encoding method.

2. As same as representation of a glyph for output such as display and printing, conventional input

method was also one to one correspondence type. One each key on a keyboard is assigned for one

each of corresponding character code. This practice required different keyboard for different

character code. Further more, there is a confusion of Asian countries for input method for the

sequence of codes that is described above. It is necessary to demonstrate that the keyboard

operation can be designed totally independent from character code (like Japanese input method).

3.Because of unique feature of Asian scripts, it is necessary to check if some of additional

functionality as a character code is needed or not.

Results:

Open Type Font like approach is applicable to output Asian script (under environment of glyph

description by a sequence of character codes.)

Japanese input like free input methods is also applicable as an input method for Asian scripts also.

Some kind of common rule to be agreed for WORDWRAPPING of Asian script that do not have a

SPACE between words.

2

All Copyrights Reserved, CICC

1. Outline of Study

1.1 Selection of Scripts to be Surveyed, Etc.

1.2 Survey Method

1.2.1 System

1.2.2 Schedule

2. What Is the Khmer script

2.1 Syllabic Characters

2.2 Characteristics of The Khmer script

3 Implementation of the Khmer Script Font

3.1 Review of Font Implementation Methods

3.2 Detailed Definition of Font Implementation

4. The Khmer script Input Methods

4.1 Current Local Entry Methods

4.2 Review of Entry Method Implementation Policies

5.Verification Program Development -Khmer Font Development

5.1 Outline of OpenType

5.1.1 Description of Main Open Type Tables

5.1.2 Substitution Tables

5.2 Khmer OpenType Font Specifications

5.2.1 Classification of Khmer Characters

5.2.2 Khmer OpenType Table Implementation Specifications

5.3 Font Development Procedure

5.3.1 Development Environment

5.3.2 Development Procedure

6. Verification Program Development

7. Pilot Tests

8. Information Obtained in This Survey and Discussion

8.1 Dealing with Complicated Character Structures

8.2 Handling of Diverse Character Codes for Same Script (Diversity of Display Methods)

8.3 Handling of Diverse Character Codes for Same Script (Diversity of Entry Methods)

8.4 Technical Conclusions

9. Future Tasks (Reference: For Realizing Results of This Survey)

3

All Copyrights Reserved, CICC

1. Outline of Study

1.1 Selection of Scripts to be Surveyed, Etc.

The aim of this study is to conduct survey and research on the input and output methods of

South Asian, South-East Asian, and Arabic scripts.

The Khmer script was chosen in this study of possible input and output methods because it has

the most complicated structure in the range of scripts considered for this study, and it combines

the characteristics of most of all other scripts. To deal with the diversity of this script, the

structure of ISO/IEC 10646 (also called UCS and Unicode, hereafter referred to as UCS)

developed basically for diverse variety of scripts was undertaken and basic tests and studies

were conducted using the UCS Khmer script.

Script Structure

Technically studied, South Asian, South-East Asian, and Arabic scripts are characterized by

syllable synthesis, presentation form, ligature, and accent/tone marks. The Khmer script posses

all the necessary and very complicated elements.

Indian scripts which include the Khmer script are called syllabic characters, and consist of

consonant characters and vowel characters (vowel symbols). When these two are combined,

they represent syllable units comprised of consonants and vowels. Hangul script is also a

clustered syllabic character, but the Indian script requires more complicated syllabification. The

Khmer script also has a unique feature called the foot (subscript character,). This is a type of

consonant sign. When two or three consonants coincide, the second and third consonants

become the subscript character, and change. If a support technique can be developed for the

Khmer script, that technique should be applicable for all Indian scripts.

Multi Character Compatible

UCS carries out encryption based on an architecture aimed at universally dealing with diverse

scripts. On the contrary, by establishing an appropriate output and input architecture for UCS

characters, the chances for dealing with diverse scripts will increase. In the midst of various

characters encoding of the Khmer script which was chosen for its script structure from the

above reasons, the Khmer script of UCS was chosen for experiments and research.

Verifications

Based on the need for verification of the validity of the results of the above survey to be carried

out by researchers who have knowledge of the script concerned, verification was carried out by

Japanese researchers involved in these scripts as well as by inviting native users of these scripts.

4

All Copyrights Reserved, CICC

In particular, as encoding in UCS aims at use for diverse scripts as mentioned earlier, native users

of the scripts would find a script odd on its own. This applies to Cambodia which also used the

Khmer script. There is strong opposition again UCS in the country (doubts about suitability). As

verification by these people should be the most accurate, the applicability of the development

results was tested with governmental organizations of Cambodia.

1.2 Survey Method

This survey was carried out by the following system and schedule.

1.2.1 System

Survey meetings were carried out by setting the working-level meeting members in Japan as

well as conducting field investigations in Cambodia. Collaborators were also invited from

Cambodia for pilot tests.

Working level meeting members

Makoto Minegishi, Tokyo University of Foreign Studies, Research Institute for the

Languages and Cultures of Asia and Africa

Tatsuo Kobayashi, President, Scholex Co., Ltd.

Takayuki K. Sato, Chief Researcher, CICC

Excellead Technology Co., Ltd.

Collaborators from Cambodia

National Information and Communication Authority CAMBODIA (hereafter NiDA)

SORSSAK PAN (Undersecretary of State)

HE, UNG Vuthy

CHEA Sok Huor

TOP Rithy

1.2.2 Schedule

This survey was conducted under the following schedule.

Survey and research period: June and August, 2002

Field investigation (July 15 to 18, 2002)

Survey of natural input methods to Cambodians, and technical surveys of

implementation methods

Development of verification programs: July to September 2002

Development of font and input (copy) programs

Pilot tests, survey report preparation period: August to October 2002

1st pilot test: August 23, 26, 27, 2002

5

All Copyrights Reserved, CICC

Participation of CHEA Sok Huor and TOP Rithy

2nd pilot test: September 25 and 26, 2002

Participation of HE, UNG Vuthy and CHEA Sok Huor

Survey report preparation

6

All Copyrights Reserved, CICC

2. What Is the Khmer script

2.1 Syllabic Characters

Indian scripts such as the Khmer script are called syllabic characters, and take syllables as the unit.

The characters are composed of consonant characters and vowel characters (vowel symbols to be

precise). By combining these two, syllable units composed of consonants and vowels can be

represented.

The syllabification of the various languages of South-East Asia is relatively complicated

compared to the Japanese language which is based on the open syllable of “consonant and vowel”.

The following syllables may be possible if Indian loanwords are included. C however represents a

consonant and V the vowel.

1. CV

2. CCV

3. CVC

4. CCVCC

5. ....

With South-East Asian scripts, such syllables are represented by adding vowel symbols to the top,

bottom, left, and right of the consonant/vowel. In other words, the Khmer script has more or less

the same syllabification as Korean Hangul script.

However, South-East Asian Indian syllabic characters have the following characteristics.

Horizontal writing from left to right. Not written vertically.

No space is left between words. Spaces are left between phrases and characters.

2.2 Characteristics of The Khmer script

*Note by CICC: This section describes a nature of Khmer script that includes all features of most

of other Asian scripts

The Khmer script is the oldest script of all existing South-East Asian scripts. It serves as the

original form of Burmese script, Thai script, and Laotian script. The Khmer script (Cambodian

script) is used for expressing the Khmer language (Cambodian language) as well as for expressing

Sanskrit and Pali languages in Cambodia and Thailand. It is said that the Khmer script was used in

the kingdom of King Ramkhamhaeng of the Sukhothai Dynasty who established the Thai script.

The Gold Scripture exhibited at the Wat Po Temple next to the Thai Palace is written in Pali using

the Khmer script. To Thai monks, the Khmer script is scripture.

7

All Copyrights Reserved, CICC

The shape of the Khmer script is decorative. Characteristically, it is decorated by waves above

characters.

The Khmer script is a type of syllable character made of consonant characters and vowel

characters. Letters can broadly be divided into two groups, mul characters and criana characters.

Mul characters can further be divided into mul characters and khama characters. Mul characters

means “round character, and is used in inscriptions, Buddhist text, titles and headings of printed

matter, public notices, signs, etc. and for purposes requiring decorative effects.

The criana character can further be divided into criana characters and jhara characters. criana

characters means “italic” and is used widely in general printed matter and daily reading and

writing. The jhara character means “upright character" and is the upright form of criana characters

in prints. Therefore, basically, the jhara character is the same as criana characters, and the

difference between the two lies in the difference in typeface. The differences between these four

typefaces lie in the difference between character designs and does not extend to the structure of

script.

The consonant characters of Cambodian language come in two types. One is the consonant/vowel

and the other is the consonant character called subscript character.

consonant/vowel can be written and read separately from other characters, while the subscript

character must always be written below the consonant/vowel. It cannot exist alone.

There are 33 types of consonant/vowel, and order of which follows the concept of sound

classification of India. Therefore plosive and nasal are arranged by place of articulation from near

to the throat to near to the lips (k, c, t, p (b)). They are also arranged in order from other

semivowel, liquids, and fricative.

Vowel symbols are always combined with consonant/vowel, and are not written alone. Vowel

symbols are written at a fixed position on the top, bottom, left, or right side of the

consonant/vowel or the combination of consonant/vowel and subscript character.

Other than the vowel sign, there is the independent vowel character that is not combined with

consonant/vowel. The independent vowel character expresses one syllable with one character.

This has been used for only a few terms, and with the simplification of the orthography in the

recent years, it is gradually being replaced by the combination of consonant/vowel and vowel

symbols.

In addition to this, there are the supplementary signs, numerals, punctuation marks added to

characters. Supplementary signs are signs added to syllables combining consonant characters and

vowel symbols. Numerals on the other hand are used for characteristic numbers and are common

to Thai and Cambodian. How numerals are represented is the same as Arabic numerals. In modern

times, Arabic numerals are also used. In addition, repetitive signs, numerals, and question marks

8

All Copyrights Reserved, CICC

are used in texts. Punctuation marks are used with small paragraphs consisting of several

sentences as the unit. This is a feature also seen in Thai.

The feature of the representation form of the Khmer script lies in the hierarchical structure where

the consonant characters are placed at the top and bottom in representation of multiple consonants.

The normal character serving as the base is called consonant alphabet while the character

changing to one arranged below are called subscript characters.

9

All Copyrights Reserved, CICC

3 Implementation of the Khmer Script Font

3.1 Review of Font Implementation Methods

By modifying the Khek Sangker font, the feasibility of implementing fonts which can be

implemented was verified, and the presence of problems with the existing standard in The Khmer

script was studied.

In implementing Font, it was basically agreed that the concept of Unicode 3.2 reflecting the

opinions pf Cambodia National Body would be taken in for the definition of character, based on

the concepts used in UCS. (Unicode 3.2 eliminates the subscript character, takes multiple coded

character codes as one code and give independent names and display formats to each).

In this project, approval was obtained from the Cambodia National Body that the 33 vowels, 35

consonants, 37 subscript characters, 18 other auxiliary signs, and ten numeral given as The Khmer

script in Unicode 3.2 can be taken as the Khmer script (symbols used in lunar calendar were

excluded).

However, gylphs not contained in Khek Sangker font were not newly created, but those currently

available were implemented where possible.

In the development of Khmer fonts, a major topic discussed was how to treat character code order

and rendering for expressing character code order and rendering for internal processing.

The agreements reached in the field studies in Cambodia can briefly be summarized as follows.

The character code order in internal processing shall basically be closest to the basic

pronunciation order.

*Note by CICC: ISO/IEC TR 14651 recommends a code value order free sorting. Using a

sort key. Thus this is not important issue.

By this, optimize the sorting and search processes

Adjustments with the output shall be carried out with the output engine or font.

After these agreements, information and opinions were exchanged with the person in charge of

implementing the Khmer script at Microsoft (hereafter referred to as MS) regarding OpenType

and Uniscribe, and common awareness was reached on the following points.

The mounting measures of Uniscribe by MS match the agreements and basic directions

of CICC and NiDA.

MS has the aim to reflect the intentions of native speakers faithfully and promptly in

Uniscribe.

Uniscribe shall perform the least required process for preventing loss of the uniqueness

of such language and entry to leave the freedom of Open Type Font where possible.

The Khmer main text font and heading font shall be dealt with the process for Open

Type Font.

10

All Copyrights Reserved, CICC

• MS shall provide technical information and support where possible for this project.

Based on the above agreement, this project carried out the development of font based on Uniscribe,

and adjustments with the MS method shall be made for the encoding and character layout for

internal processing.

Based on the above circumstances, it was decided that in this project, Open Type Font shall be

developed based on the technical information provided by Dr. Paul Nelson of MS.

*Note by CICC: Remember that one of the objectives is to confirm an applicability of Open-Type

like approach for Asian scripts.

For details on the coding undertaken by Uniscribe, visit the following URL.

Http://www.Microsoft.com/typography/otfntdev/khmerot/shaping.htm

3.2 Detailed Definition of Font Implementation

The Khek Sangker font was classified into the following based on the requests of the Cambodian

National Body, under the guidance of Professor Makoto Minegishi, Tokyo University of Foreign

Studies, Research Institute for the Languages and Cultures of Asia and Africa.

Define (1)Consonant, (2)Subscript Characters, (3)Various Signs, (4)Dependent Vowel symbols,

(5)Independent Vowels, (6)Various Signs, (7)Digits based on glyph variations and M width.

(hexadecimal representation).

Romanization is a form of entry method described later.

(1)Consonants

Consonant characters are divided into ten types; CA, CB, CC, CD, CE, CF, CG, CH, CI, and

CJ.

There are altogether 35 consonants. This time, as the two characters CC are not available as

font, 33 characters are implemented.

CA= (42, 44, 4E, 54, 68, 6E, 78) %%%% 7 characters.

In Romanization, /b, D, N, d, h, n, k’/.

The shape does not change.

CB={43, 64, 66, 70, 7A}%%% 5 characters.

In Romanization, /j, T, t’, p’, T’/.

The shape does not change.

However, as the top right is curved up slightly higher than CA, with certain font designs, the

11

All Copyrights Reserved, CICC

glyph positions of superior additional signs that follow SA, SB, SC, and SD, and superior

vowel symbols (DC) may be higher. In this implementation, due to restrictions of current

Khek Bros. font design, it is taken CB=CA because the glyph positions of CA and following

superior additional signs and superior vowel symbols are taken as not changing.

CC={67} %%% 1 character.

The /G/. form does not change in romanization.

However, because the top right is curved slightly higher than CA, the glyph positions of

superior additional signs that follow SA, SB, SC, and SD, and superior vowel symbols (DC)

are higher.

CC={76} %%% 1 character.

The /v/. form does not change in romanization.

However, because the top right is curved slightly higher than CA, the glyph positions of

superior additional signs that follow SA, SB, SC, and SD, and superior vowel symbols

(DC) are higher. In addition, only for Mul characters, these are joined with a superior

vowel sign. Due to restrictions of current Khek Bros. font design, as mul glyphs are

available for CD, CD is taken to be CC this time.

CE={72} %%% 1 character.

The /r/. shape does not change in romanization. (As mentioned later, note the subscript

character and glyph)

Due to the narrow character width as a consonant character, these protrude out from the left

when superior vowel characters and subscript characters are attached, and are unsightly. To

begin with, with /r/ glyph designs, it may be better to increase the character with (as done

with Thai characters) or prepare narrow glyphs for superior vowel symbols and subscript

characters for this character.

This time, CE is taken as CA.

CF={46, 47, 51, 58, 5A, 63, 6C, 6D, 71, 73, 79} %%% 11 characters.

The shape changes.

In Romanization, /d’, q, j’, g’, D’, c, l, m, c’, s, y/.

Except when part of the subscript character (JD) follows, these are joined to the right side of

right-side vowel symbols (DB) and left and right enclosed vowel symbols (DG). With

ISO/IEC 10646 (UCS or Unicode) [179D]/X/ and ISO/IEC 10646 (UCS or Unicode)

[179E]/S/, these characters are used only for Sanskrit, but with the Khek Bros. font, no glyph

12

All Copyrights Reserved, CICC

is available. If characters exist, there will be altogether 14 CFs.

CG={4B, 50, 6B, 74} %%% 4 characters

In Romanization, /g, b’, k. t/.

The shape does not change.

Except when part of the subscript character (JD) follows, these are joined to the right side of

right-side vowel symbols (DB) and left and right enclosed vowel symbols (DG). In addition,

only for the mul character, these are joined with superior vowel symbols. Due to restrictions

of current Khek Bros. font design, as mul glyphs are not available for CG, CG is taken as CF.

CH={4C} %%% 1 character.

In Romanization, /L/.

The shape does not change.

Except when part of the subscript character (JD) follows, this character is joined to the right

side of right-side vowel symbols (DB) and left and right enclosed vowel symbols (DG). In

addition, only for the mul character, it is joined to the right side of right-side vowel symbols

(DB) and left and right enclosed vowel symbols (DG). As the character is lower than the base

line (this character itself is a combination of a consonant and subscript character, and acts

that way), the glyph position of the inferior additional sign and inferior vowel sign is at the

bottom.

CI={6A} Non-glyph {56, D7} %%% 1 character.

In Romanization, /J/.

The shape changes.

This character is lower than the base line. It is a combination of the consonant character and

subscript character /<J/.

Therefore, when the subscript character follows, the consonant character itself changes to a

non-glyph {4A}. In particular, when the subscript character that follows is the same /J/, the

non-glyph {EF} of the subscript character /<J/ is attached to the bottom of {4A}. The glyph

positions of the non-glyph of inferior additional signs which follow SC, SD and inferior

vowel symbols (DD) become the following non-glyph.

CJ={62} Non-glyph {56, D7} %%% 1 character.

The shape changes.

In Romanization, /p/.

When joined with right-side vowel symbols and the right side of left-right enclosed vowel

13

All Copyrights Reserved, CICC

symbols, it becomes difficult to distinguish from the /h/ character. It therefore has a unique

joined non-glyph. {56} is joined with the right side vowel sign (DB) and right side of the left

and right enclosed vowel symbols (DG), {D7} is joined with the right side of the left and

right enclosed vowel symbols (DG). (Note: though glyphs with {4D}/M/ are required for the

top right of {56},but none is found)

(2)Joeung subscript character

Subscript characters are classified into six types; JA, JB, JC, JD, JE, and JF. There are

altogether 35 subscript characters. However, because two characters /X, S/ are not provided

as fonts, 33 characters were implemented.

Of these 33 characters, there are 31 types of typical glyphs of subscript characters. {C2}=/1/

and {C2}=/L/ and {91 or B6} =/T/ and {91 or B6}=/T/ glyphs are the same.

JA={82, 91, 91, A9, B5, B8, B9, BD, C2, C2, C3, C4, C5, CD, CF, D8, E6, EB, EC, ED,

F0, F5, FB, 7E} %%%% 25 characters.

In Romanization, /j, T, t, G, m, b’, p’, T’, I, L, v, t’, K’, c, C’, j, d, D, d’, q, g, b, N, k, n/.

The shape does not change.

With the Khek Bros font, {F5} and {FF} appear like the subscript characters of /b/, but the

differences are not clear. The ISO/IEC 10646 (UCS or Unicode) [17D2-179D] /X. are

characters used only for Sanskrit representation, but the glyph is not provided for Khek Bros.

font. If characters are provided, there will be altogether 25 Jas.

JB={FA} Non-glyph {F1} %%% 1 character

In Romanization, /h/.

The shape does not change, but it has a non-glyph positioned lower {F1}.

Example: Two examples /L, huG//dL, hiikrNO of following consonant character {4C} have

been confirmed.

JC={Glyph} %%%

Test-wise, the above JA {B5}/m/ does not change shape, but a lowered non-glyph similar to

{FA}/h/ is desired. The output environment follows /L/ like /h/ or follows other subscript

characters JA, JB, JD generally. In other words, this is limited to adding an exceptional

subscript character when other subscript characters are already present. For this aggregate, it

has been verified that there is one example of the continuous /lk, s, mii/ borrowed from

Sanskrit. Over the recent years, due to the increasing need to express names of not only the

French, English and Americans, but Russians, etc. as well, there should be a need to provide

14

All Copyrights Reserved, CICC

glyphs of the position of the second subscript character for all subscript characters. (In this

case, for JDs where the subscript character goes to the right side of the consonant character, it

is necessary to not only lower the position, but to provide a tall glyph as well).

JD={A7, B4, BA, CE, F3, F4} %%% 5 characters

In Romanization, /s, y, p, j’, D’, g’/.

The shape does not change.

These characters are joined to the right side of right-side vowel symbols (DB) and left and

right enclosed vowel symbols (DG). With ISO/IEC 10646 (UCS or Unicode) [1792-179E]/S/,

these characters are used only for Sanskrit, but with the Khek Bros. font, no glyph is

available. If characters exist, there will be altogether 14 CCs.

JE={A8} Non-glyph {E5} %%% 1 character.

In Romanization, /r/.

This is the only subscript character positioned on the left side of the consonant character.

(Hereafter, the reverse of the entry order is called Hindi Reverse.)

If following other subscript characters, it becomes a long non-glyph {E5}.

JF={C6} Non-glyph {EF} %%% 1 character.

In Romanization, /j/.

If same as the preceding consonant character /J/, /J/ becomes {4A}, and at the same time, the

subscript character /,J/ becomes the non-glyph {EF}.

(3)Various Signs Added Symbols

Added symbols are classified into four types, SA, SB, SC, SD.

SA={27, 26, 2A, E1, B1} 5(6) characters (Bantok, Samyoksanya, Ahsda, Robat, Kakabaat (,

Viriam)). Superior added symbol. Neither the position nor shape changes. However, for

Viriam (u-17D1) no glyph exists in Khek Bros.

SB={5F} Non-Glyph {5F->AD} 1 character (Toandakhiat)

Superior added symbol.

When the superior vowel sign DC follows, it changes to a tall non-glyph. With the Khek

Bros font, a combination non-gylph AD where 5F is follows by /i/. As this is originally used

in Sanskrit, it may be possible for a character in DC to be combined with /ii/.

SC={22} Non-Glyph {22->75, 22->AC} 1 character (Muusikatoan)

15

All Copyrights Reserved, CICC

Superior added symbol

When the upper vowel sign DC and DB {53} /aM follows, it changes to a non-glyph called

kbias-kraom.With the Khek Bros, the non-glyph has the same shape as the vowel sign.

Like /u/, when added to consonant character CI, and when subscript characters other than JE

follow, it becomes a low non-glyph AC.

(As the input, the Cambodian side requests that when Muusikatoan is key input, and the

vowel sign which follows is a superior vowel sign DC, it should automatically change to 75

or non-glyph of AC.)

SD={DF} {DF->75, DF-?AC} 1 Character (Triisap)

Superior added symbol

Like SC, when the upper vowel sign DC follows, it changes to a non-glyph called

kbias-kraom. Unlike SC, whether it changes or not is optional depending on the writer. There

are two ways of writing. (This can be confirmed in pages 1516 to 1522 on the addition of

superior vowel symbols of “h” in a Buddlish School dictionary.) The non-glyph has the same

shape as the vowel sign /u/ in Khek Bros. Like /u/, when added to consonant character CI and

subscript characters other than JE follow, it becomes a low non-glyph AC.

(When Triisap is entered, and the vowel sign which follows is a superior vowel sign DC,

there is a need to select the writing method to change to the 75 or AC non-glyph.)

(4)Dependent Vowel symbols

Vowel symbols are classified into nine types DA, DB, DC, DD, DE, DF, DG, DH, and DI.

(or 13 types when DJ, DK, DL, DM are added)

DA={4b, EE}%%% 2 Characters

Right side vowel sign. The shape does not change.

In Romanization, /H, Q/.

DB={61, 53} %%% 2 Characters.

Right side vowel sign.

In Romanization, /aa, aM/. When the preceding consonant characters are CF, CG, CH, CI, or

when the preceding subscript character is JD, these characters are joined to them. This time,

no joining glyph is provided. To join, a non-glyph is required.

DC={69, 49, 77, 57, 4D} Non-Glyph {69->5E, 49->E9, 77->B7, 57->E3, 4D->F7} %%% 5

Characters

16

All Copyrights Reserved, CICC

Superior vowel symbols.

In Romanization, /I, ii, w, ww, M/.

If the preceding consonant characters are CC, CD, or when no superior added symbol is

added to the consonant character, these characters become non-glyphs at high positions.

DD={75, 55, 59} Non-Glyph {75->AC, 55->E8, 59->E7} %%% 3 Characters

Inferior vowel symbols.

In Romanization, /u, uu, uo/.

If the preceding consonant characters are CH, CI, or when subscript characters other than JE

follow the consonant character, these characters become non-glyphs at low positions.

DE={4D+75} Non-Glyps {4D->F7, 75->AC}. %%% 1 Character.

Top, bottom enclosed vowel sign.

In Romanization, /uM/.

With current keyboards and fonts, this character is expressed by combining {4D}/M and

{75}/u/. Consequently, height is adjusted to that of {4D} and {75}. Consequently, when the

preceding consonant character is CC and CD, and a superior added symbol is added to the

consonant character, the top part {4D} becomes a non-glyph with a high position of {F7}.

When the preceding consonant character is CH and CI, and subscript characters other than JE

follow the consonant character, the bottom part {75} becomes a non-glyph {AC} with a low

position.

DF={45, 65, AB} %%% 3 Characters.

Left side vowel sign (Hindi Reverse).

In Romanization, /ee, e, ai/.

DG={65+61, 65+41} %%% 2 Characters.

Left right enclosed vowel sign.

In Romanization, /o, au/.

With current keyboards and fonts, these two characters are expressed by combining the left

side {65}/e/ and right parts {4F} and {6F}. The left part is placed on the left of the consonant

character by Hindi Reverse. When the preceding consonant character is CH, CI, and the

subscript characters other than JE follow the consonant character, the glyphs on the right side

become non-glyphs with long bottoms {AF} and {BF}.

DI={65+69} Non-Glyps {69->5E}. %%% 1 Character.

17

All Copyrights Reserved, CICC

Left top enclosed vowel sign.

In Romanization, /ae/.

With current keyboards and fonts, these two characters are expressed by combining the left

side {65}/e/ and right parts {4F} and {6F}. The left part is placed on the left of the consonant

character by Hindi Reverse. For the top part {69}, when the preceding consonant character is

CC, CD and added signs are added to the consonant character, this character becomes a

non-glyph with a high position {5E}.

(5)Independent Vowels

Independent vowels are used when vowels existing in Sanskrit or Parli are not accompanied

by the consonants at the head of the syllable. With the Khmer script, the head consonant of

the syllable {47/q/ is taken as the “zero consonant” in this case, and the orthography is

changed to one which does not use the independent vowel. Generations above middle-age

however tend to use this independent vowels.

{B2} In Romanization, /i/.

{F8} In Romanization, /ii/.

{F2} In Romanization, /RI/.

{C9} In Romanization, /RII/.

{AE} In Romanization, /LI/.

{BE} In Romanization, /LII/.

{D3} In Romanization, /u/.

{FC} In Romanization, /uu/.

{D2} In Romanization, /u, v/. These characters are exceptional, combining the non-glyphs

of vowel /u/ and syllable-marginal consonant /v/ which has no vowel corresponding to

Sanskrit (cannot be used alone currently.)

{E4} In Romanization, /e/.

{DE} In Romanization, /ai/.

{7B} In Romanization, /o/.

{5B}In Romanization, /o y/. Exceptional, combining the non-glyphs of vowel /o/ and

syllable-marginal consonant /y/ which has no vowel corresponding to Sanskrit (cannot be

used alone currently.)

(6)Various Signs

{2E} Punctuation u-17D4 (Khan)

{40} Repeated symbol u-17D7 (Lek Too)

18

All Copyrights Reserved, CICC

{B3} u-17D5 (Bariyoosan)

{D0} Quotation sign u-17D6 (Camnuc Pii Kuuh)

{DA} End sign u-17DA (koomuut)

{DB} Start sign u-17D9 (Phnaek Muan)

{DD} Currency sign u-17DB (Riel)

(7) Digits

{30} 0

{31} 1

{32} 2

{33} 3

{34} 4

{35} 5

{36} 6

{37} 7

{38} 8

{39} 9

19

All Copyrights Reserved, CICC

4. The Khmer script Input Methods

Along with the survey on the Khmer script fonts as described above, the Khmer script input

methods were surveyed and reviewed with the cooperation of Professor Makoto Minegishi, Tokyo

University of Foreign Studies, Research Institute for the Languages and Cultures of Asia and

Africa.

*Note by CICC: One of the objectives of this project is to demonstrate that the input method is

independent from character code. Many input method can be developed for the

same character code. In this project, two methods are developed for

demonstration purpose, but more input methods are able to be developed based

on different idea for the same code.

4.1 Current Local Entry Methods

8 bits single byte codes are assigned to current Khmer fonts to enable use of Khmer characters on

English systems. For this reason, the direct input method where the same keyboard as the English

system corresponds to codes (glyphs) one to one is used.

The Khek Bros fonts provided for this project correspond to the following keyboard layout.

he code assignment is as shown in Tables 5.1 to 5.3 by this layout and corresponding ASCII

ffers the following keyboard labels.

Figure 4.1 Khek Bros Keyboard

T

character string.

Khek Bros also o

20

All Copyrights Reserved, CICC

Figure 4.2 Khek Bros Keyboard Labels

ocal PC ships currently sell the following Khmer keyboard.

Figure 4.4 Key Assignment of Khmer Keyboard

he Cambodian National Body does not consider the Khek Bros and the several existing keyboard

L

Figure 4.3 Khmer Keyboard

T

layouts and proposes the following layout.

21

All Copyrights Reserved, CICC

22

Figure 4.5 Keyboard Layout Proposed by Cambodian National Body

The Cambodian National Body’s request lies especially in the Normal and Shift keys of the above

keyboard layout. Their request is laying out mainly subscript characters in the Shift Space and

Space area, with Space following Normal. In the bottom right area, Space is to follow Shift, and

the top right area is to be as shown.

All Copyrights Reserved, CICC

23

4.2 Review of Entry Method Implementation Policies

Currently, 8 bits single byte codes are assigned to enable use of the Khmer script to utilize

conventional English based systems. With ISO/IEC 10646 (UCS or Unicode), one glyph of the

Khmer script may consist of several character codes. This clearly means that to enable input of the

Khmer script, an input method for converting key input codes to Khmer characters assigned by

ISO/IEC 10646 (UCS or Unicode) according to some type of conversion rules is required.

Since there is a risk that the Cambodian side may lose sight of the fact that there exists multiple

technically general implementations by restricting input methods to one, the following two

methods were proposed, and the approval was obtained from the Cambodian National Body on

the policies.

• Transcription mode method from Latin alphabets (Roman letters)

by Professor Minegishi

• Direct code entry method based on existing keyboard layout

“Transcription mode method from Latin alphabets (Roman letters) by Professor Minegishi” can be

called the Khmer script transliteration mode and is an input method which takes up each Khmer

character whose pronunciation resembles Latin alphabets. For details, refer to the romanization in

“5.1.4 Detailed Definition of Font Implementation”.

“Direct code input method based on existing keyboard layout” is an input method based on the

key assignment proposed by the Cambodian National Body shown in Figure 5.8. This time

however, as the Cambodian National Body focused on such issues as the Space key conversion.

Regarding the potentials of the actual implementation, the following switching method was

therefore proposed in the sense to eliminate stereotypes, and approved.

Bottom left

Top right

Bottom right

Bottom top:

(NOTE: Sama as Figure 4.4 Keyboard Layout s Khmer Keyboard.)

All Copyrights Reserved, CICC

24

5.Verification Program Development -Khmer Font Development-

In this project, ISO/IEC 10646 (UCS or Unicode) compatible Khmer fonts were developed. The

OpenType font format which is the standard font format of general OSs such as Windows, MacOS,

and Solaris was adopted.

5.1 Outline of OpenType

The outline font is a general font format used with Windows and Mac operating systems,

operating systems employed extensively for personal computers. The outline font can mainly be

divided into the PostScript and TrueType. In 1996, the OpenType font format which combines the

two was released and is gradually becoming the standard format. In the future, font formats are

expected to centralize to this OpenType. The following outlines the OpenType font format, and

processing of Indian scripts using the OpenType font.

TrueType, the predecessor of the OpenType, is a font format established jointly by Apple

Computer and Microsoft. It is supported as standard by main operating systems such as Windows,

MacOS, Solaris, etc.

It can be used for both computer screen displays and printing.

With the TrueType font, tables for storing various information such as character outline, character

code mapping, and character design are defined. Fonts are constructed of the aggregate of such

table data. Character outline is expressed by two-dimensional B spline curves.

Microsoft has also established a font format called

the TrueType open as the extended format of

TrueType. With this format, new tables for

replacement of character glyphs, and adjustment of

character glyph positions are defined, which can

store high quality character layout information

inside the font. This character layout information

was originally available at the operating system

side (character layout processing system), but in the

relation between the character layout processing

system and font, it signifies that the role of the font

has become greater.

Head

Glyph Substitution

Character Glyph Image Data

Name

Font Design Information

(Other tables)

.

. .

Figure 5.1 TrueType Font

Structure

All Copyrights Reserved, CICC

25

OpenType is a format jointly developed by Microsoft and Adobe Systems. It can be called the

fusion of the TrueType and CID-Keyed font.

The font has the same structure as the table structure of TrueType. For the outline descriptive

method, the TrueType and Postscript CFF (Compact Font Format) font types can be used. There

are also more table types, and increased information which font venders must package.

The OpenType is supported as standard in Windows 2000/XP, Mac OS X, and Solaris.

By using the Adobe System Adobe Type Manager (ATM), it can be used in conventional

Windows OS and Mac OS.

5.1.1 Description of Main Open Type Tables

GSUB

The character glyph may be replaced by a different shape even if the same character code is given.

This table stores information for replacing the character glyph image with a different shape.

Replacement occurs in the following cases;

(a) Change of glyph depending on which position in the word the character is located such

as Arabic characters, etc.

(b) Change of glyph due to adjoining characters such as Indian characters

(c) Change of glyph due to language, one glyph is made up of more than two characters

(d) Layout of glyph changes in vertical writing

These are tables added in TrueType Open.

GPOS

Stores position control information when the layout position of character glyph differs according

to the combination with other character glyph.

Mort

Table with the same aim as GSUB. However the data structure and replacement algorithm differ.

The GSUB table is defined by Microsoft and is used as standard by Windows OS, while the mort

table us defined by Apple Computer, and is supported as standard by Mac OS.

cmap

Within the font, numbers called glyph IDs are given to each glyph. This table consists of data

describing the relation between character codes and glyph IDs (called mapping table). As this

relation can be described several times, different mapping such as shift JIS and ISO/IEC 10646

(UCS or Unicode) data can be provided.

All Copyrights Reserved, CICC

26

Glph

Table for storing the shapes of TrueType font character outlines.

CFF

Table for storing the shapes of PostScript font character outlines.

Name

Table for storing font names, family names, and copyright indications, etc.

OS/2

Table for describing information on font design characteristics and character sets stored in fonts.

Used by Windows operating system, OS/2.

5.1.2 Substitution Tables

With Indian scripts like Devanagari, the shape of character glyphs and layout position change

intricately according to the combination of other characters. With OpenType, character

substitution and character layout position information can be stored to enable appropriate

processing of Indian scripts on computers. Some main examples are the GSUB and GPOS tables.

GSUB (Glyph Substitution Table)

The GSUB table stores character glyph substitution information. It is classified by function (this

classification is called feature) for the substitution information, and defined with four-character

tags. This feature is related to specific writing systems and languages. Substitution by one or

multisubscript character glyph index numbers defined in cmap and substitution from index

numbers after glyph substitution can be defined. The GSUB table supports the following six types

of substitution methods.

• Substitution of one glyp with another glyph(single substitution). This is used when

the glyph changes according to the character position as in Arabic, or when glyphs for

vertical writing as in Japanese are required.

Figure 5.2 Example of Single Substitution

All Copyrights Reserved, CICC

• One character is substituted to more than two glyphs (multi substitution).

• Character are the same, but substituted to several different glyphs (alternate

substitution). For example, subscript character glyphs with different designs are

provided for selection by the user.

• Substitution of several charactes with totally different glyph(s) (ligature substitution).

An example is character substitution by consonant clustering in Indian scripts.

• Substitution of

substitution). G

(b)classification

(c)defining aggr

• Substitution usi

Extension of co

GPOS (Glyph Positioning T

The GPOS table stored info

layouts. Like GSUB, this ta

and languages.

The layout position of certa

other character glyphs. For

tone mark to alphabets are

same character, the diacritica

with the next glyph, and is a

Figure 5.3 Example of Multi Substitution

Figure 5.4 Example of Alternate Substitution

Figure 5.5 Example of Ligature Substitution

27

one or more glyph patterns by one or more glyphs (contextual

lyph patterns consist of (a) taking glyph strings as patterns,

of glyphs into classes, and taking the class strings as patterns,

egates of glyphs, and taking the aggregate strings as patterns.

ng multiple contextual substitution (chaining contextual substitution).

ntextual substitution.

able)

rmation for minutely controlling layout positions of glyphs in text

ble is classified by function, and related to specific character systems

in character glyphs may change according to the combination with

example, In Vietnamese, character systems adding diacritical marks,

used. When the diacritical mark and tone marks are attached to the

l mark shifts to an upper position. In the Arabic Urdu, the glyph joins

rranged downwards in the direction from right to left. Such character

All Copyrights Reserved, CICC

28

systems require not only the writing direction in character layout processing, but also

two-dimensional glyph position control.

The drawing of this verse was excerpted from

http://www.Microsoft.com/typography/otspec/default.htm.

Figure 5.6 Example of Mark Position in Vietnamese

Figure 5.7 Example of Urdu

All Copyrights Reserved, CICC

29

5.2 Khmer OpenType Font Specifications

This section describes the implementation specifications of OpenType font First the character

categories for implementation are described, followed by the implementation specifications and

development methods of each table of Open Type.

The development of the Unicode Khmer OpenType was based on the 1byte Khmer font Khek

Sangker of Khek Brothers. The format of this font is the TrueType. It is converted to the

OpenType font by adding the required table.

5.2.1 Classification of Khmer Characters

Khmer characters consist of consonant characters, subscript characters (consonant signs),

independent vowel characters, vowel symbols, added symbols, and Khmer numerals. Of this, the

consonant characters serve as the base of characters, and arranged around it are the subscript

character, vowel sign, and added symbol on the top, bottom, left, and right. In some cases,

multiple subscript characters may come under one consonant character. The glyph shape or glyph

position change according to the combination with consonant characters or combination with

other subscript characters, vowel symbols, added symbols, etc. With certain consonant characters,

the glyph shape may change. Independent vowel symbols and Khmer numerals compose

characters independently.

As characters can be classified according to the changes in glyph shape and glyph position, Khmer

font was implemented referring to this classification. The classification of characters conforms to

the classification method of Professor Makoto Minegishi, Tokyo University of Foreign Studies,

Research Institute for the Languages and Cultures of Asia and Africa.

All Copyrights Reserved, CICC

30

Table 5.1 Classification of Characters

Type of Character

Classification of Prof. Minegishi

ISO/IEC

10646

(UCS or

Unicode)

Glyph Remarks

1781

178C

178E

1791

1793

1796

CA

17A0

The glyph shape does not change.

1787

178A

178B

1790

CB

1795

Depending on the font design, the glyph position of the following superior added symbols SA, SB, SD and superior vowel symbol DC may be at the top.

Consonant character

CC 1784

The glyph position of the following superior added symbols SA, SB, SD and superior vowel symbol DC is at the top.

All Copyrights Reserved, CICC

31

Table 5.2 Classification of Characters (Continuation)

Type of Character

Classification of Prof. Minegishi

ISO/IEC

10646

(UCS or

Unicode)

Glyph Remarks

CD 179C

The glyph position of the following superior added symbols SA, SB, SD and superior vowel symbol DC is at the top. In addition, only for the mul character, it is joined with a superior vowel symbol.

CE 179A

The glyph shape does not change, but a narrow glyph should ideally be provided for the superior vowel symbol and subscript character combined with.

1783

1785

1786

1788

178D

1792

1798

1799

179B

179F

Consonant character

CF

17A2

Except when part of the subscript character (JD) follows, this is joined to the right side vowel symbol (DB) and right side of the left right enclosed vowel symbol (DG).

All Copyrights Reserved, CICC

32

Table 5.3 Classification of Characters (Continuation)

Type of Character

Classification of Prof. Minegishi

ISO/IEC

10646

(UCS or

Unicode)

Glyph Remarks

1780

1782

178F

Consonant character

CG

1797

Except when part of the subscript character (JD) follows, this is joined to the right side vowel symbol (DB) and right side of the left right enclosed vowel symbol (DG). In addition, only for the mul character, it is joined with a superior vowel symbol.

CH 17A1

Except when part of the subscript character (JD) follows, this is joined to the right side vowel symbol (DB) and right side of the left right enclosed vowel symbol (DG). The glyph position of the following inferior added symbol and inferior vowel symbol is at the bottom.

CI 1789

When the subscript character follows, the glyph shape changes. Particularly, if the following subscript character is the same consonant, the glyph shape of this subscript character changes. The glyph position of a different glyph of the following inferior added symbols SC and SD and inferior vowel symbol (DD) is at the bottom.

CJ 1794

Joined to the right side vowel symbol and right side of the left right enclosed vowel symbol to become a different glyph shape.

All Copyrights Reserved, CICC

33

Table 5.4 Classification of Characters (Continuation)

Type of Character

Classification of Prof. Minegishi

ISO/IEC

10646

(UCS or

Unicode)

Glyph Remarks

17D2

1780 17D2

1781 17D2

1782 17D2

1784 17D2

1785 17D2

1786 17D2

1787 17D2

178A 17D2

178B

17D2

178C

17D2

178E 17D2

178F 17D2

1790

Subscript character

JA

17D2

1791

The glyph shape does not change.

All Copyrights Reserved, CICC

34

17D2

1792

Table 5.5 Classification of Characters (Continuation)

Type of Character

Classification of Prof. Minegishi

ISO/IEC

10646

(UCS or

Unicode)

Glyph Remarks

17D2

1793

17D2

1795

17D2

1796

17D2

1797

17D2

179B 17D2

179C

JA

17D2

17A2

The glyph shape does not change.

JB 17D2

17A0

Depending on the combination with the consonant character, the glyph position may be at the bottom.

JC 17D2

1798

When following the consonant character CH, or when following the subscript characters JA, JB, JC, the glyph position is at the bottom.

17D2

1783 17D2

1788

Subscript character

JD

17D2

178D

Joined to the right side of the following vowel symbol DB and left right enclosed vowel symbol DG .

All Copyrights Reserved, CICC

35

17D2

1794

17D2

1799

17D2

179F

Table 5.6 Classification of Characters (Continuation)

Type of Character

Classification of Prof. Minegishi

ISO/IEC

10646

(UCS or

Unicode)

Glyph Remarks

JE 17D2

179A

When following a different subscript character, becomes a different glyph with a long shape downwards.

Subscript character

JF 17D2

1789

When the precedent consonant character is the same as itself, the glyph shape changes.

17C7

DA

17C8

The glyph shape does not change.

17B6

DB

17B6

17C6

When the precedent consonant character is CF, CG, CH, CI, or the precedent subscript character is JD, joins with each.

17B7

17B9

17BA

17C6

Vowel

symbol

DC

17B8

When the precedent consonant character is CC or CD, and the consonant character is attached with a superior vowel symbol added symbol, the glyph position becomes higher.

All Copyrights Reserved, CICC

36

17BC

17BD

DD

17BB

When the precedent consonant character is CH or CI, and the consonant character is followed by a subscript character other than JE, the glyph position becomes lower.

Table 5.7 Classification of Characters (Continuation)

Type of Character

Classification of Prof. Minegishi

ISO/IEC

10646

(UCS or

Unicode)

Glyph Remarks

DE 17BB

17C6

When the precedent consonant character is CC or CD, and the consonant character is attached with a superior vowel symbol added symbol, the top glyph position becomes higher.

17C1

17C2

DF

17C3

When the precedent consonant character is CH or CI, and the consonant character is followed by a subscript character other than JE, the bottom glyph position becomes lower.

17C4

DG

17C5

The right side joins when the precedent consonant character is CF, CG, CH, CI or the precedent subscript character is JD.

17BF

Vowel symbol

DH

17C0

When the precedent consonant character is CH or CI, and the consonant character is followed by a subscript character other than JE, it changes to a glyph shape with the right side long downwards.

All Copyrights Reserved, CICC

37

DI 17BE

When the precedent consonant character is CC or CD, and the consonant character is attached with an added symbol, the position of the upper part becomes higher.

Table 5.8 Classification of Characters (Continuation)

Type of Character

Classification of Prof. Minegishi

ISO/IEC

10646

(UCS or

Unicode)

Glyph Remarks

17CB

17CC

17CE

17CF

SA

17D0

The position and glyph shape do not change.

SB 17CD

When the superiod vowel symbol DC follows, the glyph position becomes higher.

Added symbol

SC 17C9

When the superiod vowel symbols DC and DB (17B6+17C6) follow, the glyph shape changes (same glyph as 17BB). When the glyph shape changes, or when the consonant character CI is attached, or when a subscript character other than JE follows, the glyph position becomes lower.

All Copyrights Reserved, CICC

38

SD 17CA

When the superior vowel symbol DC follows, the glyph shape changes (same glyp as 17BB). When the glyph shape changes, or when the consonant character CI is attached, or when a subscript character other than JE follows, the glyph position becomes lower.

*The glyph shape of the subscript characters, vowel symbols, and added symbols in the tables are

shown combined with consonant character U+1780.

All Copyrights Reserved, CICC

39

5.2.2 Khmer OpenType Table Implementation Specifications

The following describe is the specifications of the OpenType table related to Khmer font.

The implementation of the GSUB and GPOS tables conform to the Khmer OpenType font

specifications defined by Microsoft.

(1)cmap table

A mapping table based on the use of fonts in the Windows environment has been added.

Table 5.9 Mapping table

Data Value Description of Value

Platform ID 3 Table referred to by Microsoft OS

Encoding ID 1 ISO/IEC 10646 (UCS or Unicode) mapping

Format 4 Microsoft standard mapping table format

Glyphs were also mapped for the following character codes. With fonts before OpenType

(TrueType Open to be precise), there was a need to assign different character codes if the

glyph shape or position were different even for the same character. With OpenType, as it is

possible to output different glyphs or change the glyph position with the same character code,

basically one character code is assigned to one character. Consequently, the Khmer

OpenType font consists of many glyphs for which character codes cannot be assigned.

*Khek Sangker consists of fonts missing from character sets defined by ISO/IEC 10646

(UCS or Unicode).

Table 5.10 Glyphs Mapping Character Codes

Character Code Corresponding Glyph

U+0020 Space

U+0030~0039 0-9

U+1780~U+17E9 Khmer Character

*The left superior vowel symbol and left right attached vowel symbol are divided into two as

glyphs (with consonant characters in between), but exist as one as a character code. With the

OpenType mapping, the upper or right side glyph and character codes are related by mapping

tables.

All Copyrights Reserved, CICC

40

(2)Name Table

Though information of 1byte TrueType font name table serving as the base can be used,

information is rewritten for changing the font name.

The information stored in the name table is constructed of sets of Platform ID,

Platform-specific Encoding ID, Language ID, Name ID, and String.

Table 5.11 Setting of name Tables

(Platform ID, Platform-specific Encoding ID Settings 1)

Data Value Description

Platform ID 1 Mac

Platform-specific encoding ID 0 Roman

Language ID 0 English

(Platform ID, Platform-specific Encoding ID Settings 2)

Data Value Description

Platform ID 1 Mac

Platform-specific encoding ID 0 Roman

Language ID 0 English

(Name ID, String Settings for Above Two Sets of Platform ID, Platform-specific Encoding ID)

Name ID String

0(Copyright notice) Copyright 1994-99 by Khek

Brothers. All rights reserved.

Designed by Kantol Khek.

1(Font Family name) Khek Sangker Test

(Font Subfamily name) Regular

3(Unique font identifier) Khek Sangker Test

4(Full font name) Khek Sangker Test

5(Version string) 0.01

6(Postscript name) KhekSangkerTest

(3)OS/2 Table

The following information is rewritten using the OS/2 information of the 1byte TrueType

font serving as the base.

With the OpenType specifications, the OS/2 table version is 2. But in this project, version 1

was mounted.

All Copyrights Reserved, CICC

41

Table 5.12 OS/2 Table Rewritten Areas and Values

Data Value Description

ulUnicodeRange1 0x00000001 Accommodated character set

ulUnicodeRange2 0 Accommodated character set

ulUnicodeRange3 0x00010000 Accommodated character set

ulUnicodeRange4 0 Accommodated character set

UsLastCharIndex 0x17E9 Final character code

ulCodePageRange1 0x00000001 Specify code page

ulCodePageRange2 0 Specify code page

(4)GSUB Table

Microsoft defines the following features for substitution of glyphs implemented for the

Khmer OpenType font.

Table 5.13 Substitution Features Related to the Khmer script Defined by Microsoft

The GSUB and GPOS tables relate the scripts (character system) and language using the

features implemented in this table.

With the Khmer OpenType font, they are related with feature.

Feature Description

pref Information on substitution to subscript character attached to front of consonant character

blwf Information on substitution to subscript character attached to bottom of consonant character

abvf Information on substitution to symbol attached to top of consonant character

pstf Information on substitution to subscript character attached to back of consonant character

pres Information on substitution to different glyph of subscript character attached in front.

Applied to substitution of glyph when arranged with attached subscript character.

blws Information on substitution to different glyph of subscript character attached to the bottom.

abvs Information on substitution to different glyph of superior symbol.

Applied to the substitution of one glyph by joining two superior symbols

psts Information on substitution to different glyph of subscript character attached to the back.

Applied to the substitution of glyph when one consonant character is arranged with an

inferior subscript character.

clig Applied to the substitution of glyph when consonant characters and vowel symbols are

joined.

All Copyrights Reserved, CICC

42

Script kmhr(Khmer) Language dflt(Default)

Script kmhr(Khmer) Language dflt(Default)

The GSUB table sets the following features (substitution information)

As the base Khek Sangker does not accommodate all combined type glyphs, for those

without the combined type glyph, substitution information is not implemented for Khmer

OpenType font.

Under the GSUB column of the table, the implementation format of the GSUB table is

indicated numerically and represent each.

Script kmhr(Khmer) Language dflt(Default)

Script kmhr(Khmer) Language dflt(Default)

Substitution to Subscript characters

With ISO/IEC 10646 (UCS or Unicode), it is defined so that the glyph of subscript characters

are rewritten by U+17D2 (character code of consonant characters). Depending on where the

subscript character is located, in front, bottom, or back of the consonant character, three

features are defined.

Table 5.14 Substitution to Subscript character

Feature Corresponding Character Code GSUB Description

pref U+179A 2 Attached in front of the consonant character

pstf U+1783

U+1788

U+178D

U+1794

U+1799

U+179F

2 Attached behind the consonant character

blwf Attached below the consonant

character

2 Consonant character code other than the

above

All Copyrights Reserved, CICC

43

Different Glyphs of Consonant Characters

Table 5.15 Different Glyphs of Consonant Characters

Feature Character

Code

Character

Classification

Before

Substitution

After

Substitution GSUB Description

blws U+1789 CI

1 The glyph shape

changes when

subscript character

is attached

clig U+1794 CJ

2 Becomes a

different glyph

shape when joined

with right side

vowel symbol and

right side of the left

right enclosed

vowel symbol.

All Copyrights Reserved, CICC

44

Different Glyphs of Subscript characters

Table 5.15 Different Glyphs of Subscript characters

Feature Character

Code

Character

Classification

Before

Substitution

After

Substitution GSUB Description

pres U+17D2

U+179A

JE

2 When following

different subscript

characters,

becomes a different

glyph with a long

bottom.

U+17BF DH

psts

U+17C0 DH

2 When the

precedent

consonant

character is CH or

CI, and subscript

characters other

than JE follow the

consonant

character, becomes

a glyph shape with

a long right side

downwards.

pres U+17D2

U+1789

JF

+<JF>

1 When the

precedent

consonant

character is the

same as itself, the

glyph shape

changes.

All Copyrights Reserved, CICC

45

Different Glyphs of Vowel Symbols

Table 5.15 Different Glyphs of Vowel Symbols

Feature Character

Code

Character

Classification

Before

Substitution

After

Substitution GSUB Description

U+17BF DH

psts

U+17C0 DH

2 When the precedent

consonant character

is CH or CI, and

subscript characters

other than JE follow

the consonant

character, becomes a

glyph shape with a

long right side

downwards.

Different Glyphs of Added Symbols

Table 5.15 Different Glyphs of Added Symbols

Feature Character

Code

Character

Classification

Before

Substitution

After

Substitution GSUB Description

clig U+17C9 SC

1 When the superior

vowel symbol DC and

DB (17B6+17C6)

follow, the glyph

shape changes (same

glyph as 17BB).

clig U+17CA SD

1 When the superior

vowel symbol DC

follows, the glyph

shape changes (same

glyph as 17BB).

All Copyrights Reserved, CICC

46

(5)GPOS Table

Microsoft defines the following features for the position of glyphs implemented for the

Khmer OpenType font.

Table 5.18 Position Features Related to The Khmer script Defined by Microsoft

The GPOS table sets the following features (glyph position adjustment information).

Under the GPOS column of the table, the implementation format of the GPOS table is

indicated numerically and represent each.

1 MarkToBase Attachment Positioning

2 MarkToMark Attachment Positioning

Adjustment of Glyph Position of Subscript characters

Table 5.19 Adjustment of Glyph Position of Subscript characters

Feature Character

Code

Character

Classification Character GPOS Description

blwm - JA

JB

JC

JF

CH 1 The glyph position of the subscript

character is lower.

blwm U+17D2

U+1798

JC JD 2 The glyph position of JC is lower.

mkmk U+17D2

U+1798

JC JA

JB

2 The glyph position of JC is lower.

Feature Description

dist Information on positional adjustment between consonant character, front symbol, and back symbol

blwm Information on positional adjustment between consonant character and bottom symbol

abvm Information on positional adjustment between consonant character and top symbol

mkmk Information on positional adjustment between symbols

All Copyrights Reserved, CICC

47

Adjustment of Glyph Position of Vowel Symbols

Table 5.20 Adjustment of Glyph Position of Vowel Symbols

Feature Character

Code

Character

Classification Character GPOS Description

blwm - DD

DE

CH 1 The glyph position of the inferior vowel

symbol (or bottom of vowel symbol) is

lower.

blwm - DD

DE

CI 1 The glyph position of the inferior vowel

symbol (or bottom of vowel symbol) is

lower.

blwm - DD

DE

JD 2 The glyph position of the inferior vowel

symbol (or bottom of vowel symbol) is

lower.

abvm - DC CC

CD

1 The glyph position of the superior vowel

symbol is higher.

mkmk - DD

DE

JA

JB

JC

JF

2 The glyph position of the inferior vowel

symbol (or bottom of vowel symbol) is

lower.

mkmk - DC SD 2 The glyph position of the superior vowel

symbol is higher.

All Copyrights Reserved, CICC

48

Adjustment of Glyph Position of Added Symbols

Table 5.21 Adjustment of Glyph Position of Added Symbols

(6)GDEF Table

Sets “mark” attributes to superior, inferior subscript characters, vowel symbols, and added

symbols.

Others are “simple” attributes.

(7)DSIG Table

The DSIG table is entered with electronic signatures. This is not unique to Khmer fonts, but

is nonetheless implemented. To acquire the official authentication key for electronic

signature, there is a need to apply to certification organizations like VeriSign. In this project,

a test authentication key was used.

Feature Character

Code

Character

Classification Character GPOS Description

abvm - SA

SB

SC

SD

CC

CD

1 The glyph position of the added

symbol is higher.

mkmk U+17CD U+17CD DC 2 The glyph position of the added

symbol is higher.

All Copyrights Reserved, CICC

49

5.3 Font Development Procedure

5.3.1 Development Environment

The following shows the Khmer OpenType font development hardware and software

environment of this project.

Table 5.22 Development Environment

Hardware Intel architecture machine

Software OS: Windows XP Professional

Uniscribe (dll 1.453.3665.0)

Development Tools • Cygwin (Unix compatible environment)

(Acquired from: http://cygwin.com/)

• gcc 2.95.3-5

(C compiler)

(Acquired from: Cgywin installation)

• VOLT 1.1b

(OpenType Layout Table compilation tool)

(Acquired from:

http://www.microsoft.com/typography/developers/volt/default.htm)

• dsig (Signature generation tool)

(Acquired from:

http://www.microsoft.com/typography/developers/dsig/default.htm)

All Copyrights Reserved, CICC

5.3.2 Development Procedure

The following shows the development procedure of the Khmer OpenType font.

(1)Deletion

The Platfo

cmap table

The unique

delcmap is

The using

(2)Modificat

The origin

mkname is

The using

The name

Deletion of cmap table Modification of name table Modification of OS/2 table

Addition of DSIG table Addition of GSUB, GPOS,

GDEF tables Addition of cmap tables

delcmap

mkname

<Platfor

Figure 6.8 Development Procedure of Khmer OpenType Font

50

of cmap Table

rm ID=3 Encoding ID=1 mapping table is deleted from the original TrueType font

.

tool delcmap developed was used for the deletion.

command based, and the development language is C.

method is:

ion of name Table

al TrueType font name table is rewritten using the unique tool mkname.

command based, and the development language is C.

method is:

table information file stores items set in the name table in text format.

<Font name> <Mapping table index number>

<Font name> <Name table information file>

m ID> <Platform-specific encoding ID> <Language ID> <name ID> <String>

All Copyrights Reserved, CICC

The following shows a specific example.

*Items are divided by tab.

(3)Modification of OS/2 Table

The original TrueType font OS/2 table is rewritten using the unique tool mkos2.

Rewriting is done only for modified parts.

mkos2 is command based, and the development language is C.

The using method is:

T

T

*

1 0 0 1 Khek Sangker Test

1 0 0 2 Regular

1 0 0 3 Khek Sangker Test

mkos2 <Font name> <OS/2 table information file>

he os/2 table information file stores items and values set in the OS/2 table in text format.

<Item 1><Value 1>

<Item 2><Value 2>

….

51

he following shows a specific example.

Items are divided by tab.

ulUnicodeRange1 0x00000001 % bit0

ulUnicodeRange2 0

ulUnicodeRange3 0x00010000 % bit80

ulUnicodeRange4 0

usLastCharIndex 0x17E9

ulCodePageRange1 0x00000001 % bit0

ulCodePageRange2 0

All Copyrights Reserved, CICC

52

(4)Addition of cmap Table

The original TrueType font cmap table is added with ISO/IEC 10646 (UCS or Unicode)

compatible mapping table of Platform ID=3 Encoding ID=1.

The unique tool mkcmap4 developed was used for the addition.

mkcmap4 is the base of the command generating the format 4 mapping table, and the

development language is C.

The using method is:

The cmap table information file stores items and values set in the mapping table in text

format.

(5)Addition of GSUB, GPOS, GDEF Tables

The GSUB, GPOS, GDEF tables were implemented using the VOLT (Visual OpenType

Layout Tool) provided by Microsoft. This is a GUI based tool. The tables can be

implemented while visually checking the changes in the glyph shape and position.

*VOLT independently rewrites the cmap table. But some original cmap table information

may be lost. The lost information must be set again using VOLT.

(6)Addition of DSIG Table

The DSIG table was implemented using dsig (signature generation tool provided by

Microsoft. This is a command base.

mkcmap4 <Font name> <cmap table information file>

<segment count>

<endCount 1>

<endCoount 2>

<startCount 1>

<startCount 2>

….

<character code1> <glyph index 1>

<character code2> <glyph index 2>

All Copyrights Reserved, CICC

53

6. Verification Program Development -Development of Input (Transcription) Programs-

An input (transcription) program was developed as a method for inputting the Khmer script.

The following points were taken into consideration in the development of the program.

• Development in general OS environment

• Natural input method based on intentions of native speakers

• Flexible architecture not limited by technical constraints of specific input methods

The following input method was developed based on the above conditions.

• TCL program running on the Windows Im interface

• Input shall be a free keyboard sequence

• Simple logics can be incorporated

• Outputs corresponding to multiple entries are encoded character strings of UCS

• Has multiple modes which can be switched freely (actually four modes).

The above IM was implemented with two modes based the instructions from Professor Minegishi

and exisiting keyboard layouts obtained in field studies.

• Transcription mode method from Latin alphabets by Professor Minegishi’s method.

• Direct encode input method based on existing keyboard layout.

All Copyrights Reserved, CICC

54

7. Pilot Tests Over the following two occasions, guests from NiDA (Under Secretary of State: Mr. SORASAK

PAN) were invited for user verification by native speakers at the Tokyo University of Foreign

Studies, Research Institute for the Languages and Cultures of Asia and Africa.

• 1st time; August 23, 26, 27, 2002

Participated by CHEA Sok Huor and TOP Rithy

• 2nd time: September 25 and 26, 2002

Participated by HE. UNG Vuthy and CHEA Sok Huor

The results confirmed that the basic functions are satisfactory.

At this point, there are no marked elements that are insufficient as encoded characters.

However, it was found that auxiliary function characters are required for end-of-line processing.

(when at the end-of-the-line, function which clearly prompts starting a new line and function

which clearly stops starting a new line).

In addition, there exists ambiguous combinations of whether to perform a ligature process and it

was clarified that functional elements to inhibit ligature processing in the display of one syllable

are essential.

Examples of such auxiliary function characters are ZWJ (zero width joiner;U+200C) and ZWNJ

(zero width non-joiner;U+200D), ZWSP (zero width space; U+200B) used by Microsoft for trial

implementations.

*Note by CICC: The third objective of this project is to identify needs this kind of additional

functutions.

To prevent confusion in future implementations, it should be necessary to disclose the handling of

these function characters in the Khmer script based on the agreement between related

organizations such as Cambodian National Body, ISO/IEC JTC1, Unicode Technical Committee,

etc.

All Copyrights Reserved, CICC

55

8. Information Obtained in This Survey and Discussion As a result of this survey, the following were implemented for UCS the Khmer script;

(1)All glyph images required for display were implemented, character code strings and relation of

glyphs for display were described by methods such as tables, and it was clarified that the method

for displaying the glyphs of required characters can be put to practical application.

(2)The input results are not used as direct data for various character input methods, and data

strings are made according to conversion programs and reference tables. It was verified that data

strings can be made from the same character codes regardless of the input method, without using

the input results as direct data.

(3)Even the Cambodian agency who had doubts regarding the UCS method in the beginning about

the output and input by this method confirmed the practicability of this method.

Reference information:

The Laos Governmental Science and Technology Environment Agency IT center has also

confirmed the potentials of extending the Laotian characters.

8.1 Dealing with Complicated Character Structures

With conventional character output methods, the structure of multiple characters was defined by

the fixed relation between the attributes of letter shapes for output and character codes. Due to the

need to deal with each character structure by script and character code, this serves as the

bottleneck of corresponding language. With the method verified here, required letter shapes for

output were prepared for all results for this complicated structure, and the relation with character

codes is displayed by separately describing character codes and output a glyph images. The

experiment results show that all rendering requirements can be fulfilled such as fine tuning of

mutual position of each element, minute changes and synthesis of glyphsbylcharacter codes

sythesis, addition of accent/tone marks, etc., adjustment of letter shapes and change in output

golyph according to mutual positional relations within character code strings. (presentation form).

With this method, basically all the required items are prepared, and by defining the relation, the

technology should be applicable to all scripts in the regions studied, and reviews on letter shape

changes of independent, beginning of words, in words, and at the end of words of Arabic

characters.

Furthermore, this method should be applicable to European and American accented characters and

synthesized characters (presentation forms) and should be usable gerenerally.

All Copyrights Reserved, CICC

56

8.2 Handling of Diverse Character Codes for Same Script (Diversity of Display Methods)

This survey was conducted centering around the implementation of UCS character codes.

However, character codes actually used are not limited to UCS. On the contrary, character

codes other than UCS are used extensively, and the diversity is complicating the handling of

character codes.

Though character codes are described as being diverse, eventually only one is displayed. The

difference lies only in the diversity of the methods involved. With the method used to confirm

the effectiveness of the results of this survey, all the items required for output eventually were

prepared, and the relation with separate character codes is described logically. For this reason,

by changing only the description of the relation, it may be possible to output diverse character

codes as intended by eliminating character processing unique to character codes which may

lose the interchangeability of output letter shape development by character code which takes up

the most time and efforts and data processing processes.

8.3 Handling of Diverse Character Codes for Same Script (Diversity of Entry Methods)

For the input of character codes, due to the need for character input methods matching the logical

character of character codes, there is a need to design inputs by character codes for the same

language and same script.

With the methods verified for their usefulness in this survey, diverse character codes of the same

script can be inputted using the same input method.

8.4 Technical Conclusions

The output and input methods verified in this survey were confirmed to be applicable to

South-East Asian languages for which implementation has been slow, and recommendable as

methods useful for diversifying character codes of targeted languages.

All Copyrights Reserved, CICC

57

9. Future Tasks (Reference: For Realizing Results of This Survey) The implementation of UCS characters with international consistency by each Asian country is a

common theme.

Based on the knowledge obtained in this survey, information such as the recommended processes,

precautions, project expenses, etc. for implementing the scripts of each Asian country in the future

are provided as reference in the aim to contribute to the establishment of plans for dealing with

diverse languages.

This survey and research and prototype development project successfully achieved the foal of the

survey and research on the technical development on expandability of Arabic and South and

South-East Asian scripts.

The potentials of technical development for expanding security were successfully clarified by

technically clarifying multi-lingual implementation and verifying the effectiveness of current

international standards (ISO/IEC 10646 UCS).

However, from the viewpoint of the infrastructure for further promoting information provision

and exchange by digitalization and multi-language, in order to realize a practical environment for

outputting and inputting the Khmer scripts using UCS, more diverse support is evidently

necessary. Likewise, the same trends are seen for the other Arabic and South and South-East

Asian scripts, and similar support is indispensable.

With regard to the Khmer script which was taken up in this project, based on the fact that the

development process of UCS was not participated by native speakers and governmental

representatives of the country concerned, the Cambodian National Body (NiDA), which had just

been set up by the good interests and support of volunteers from non-governmental organizations

from several countries including Japan, requested revision of the UCS standards to ISO/IEC

JTC1/SC2. During this process, various proposals on encoding methods, system implementations,

font developments were carried out by groups involved in the development of UCS The Khmer

script, good will non-governmental organization volunteer groups, and IT venders in Cambodia,

however no consideration was given to consistency, sustainability, and succession whatsoever

between these groups.

As a result, problems reviewed for individual proposals and implementations, criticisms of groups

against those proposals and implementations were not accumulated as knowhow. If a proposal or

implementation failed to receive support from the market or National Body, other proposals and

implementations would start all over again from scratch, thus lacking the process of learning from

failure whatsoever.

All Copyrights Reserved, CICC

58

Based on such a background, future development support and technical transfer should require

sustainable and progressive processes taking into consideration collaboration with the Cambodian

National Body, international consistency, marketability, etc.

From this perspective, it should be necessary to undertake projects for establish more practical

input/output environments in a sustainable and progressive manner based on the results of this

survey and prototype development project.

---end----