CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts...

54
CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX OPERATING SYSTEM A Project Report Submitted in partial fulfillment of the requirements for the award of the degree of Master of Technology in Computer Science and Engineering by Ratheesh K (CS99739) Under the guidance of Dr Hema A Murthy DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY, MADRAS JANUARY, 2001

Transcript of CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts...

Page 1: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

CONSOLE BASED INDIAN LANGUAGE SUPPORTFOR THE LINUX OPERATING SYSTEM

A Project Report

Submitted in partial fulfillment of the requirements

for the award of the degree of

Master of Technology

in

Computer Science and Engineering

by

Ratheesh K(CS99739)

Under the guidance of

Dr Hema A Murthy

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY, MADRAS

JANUARY, 2001

Page 2: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

i

CERTIFICATE

This is to certify that the report entitled CONSOLE BASED INDIAN

LANGUAGE SUPPORT FOR THE LINUX OPERATING SYSTEM

submitted by Ratheesh K (CS99739), to the Indian Institute of

Technology, Madras, for the award of the degree of Master of Technology,

is a bonafide record of the project work done by him under my

supervision. The contents of this report, in full or parts, have not been

submitted to any other Institute or University for the award of any degree

or diploma.

Dr Hema A Murthy

Dept. of Computer Science & Engg.

Place : Madras – 600036 Indian Institute of Technology

Date : Madras – 600036

Page 3: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

ii

ACKNOWLEDGEMENTS

I express my sincere thanks to my guide Dr. Hema A Murthy for

the invaluable guidance and encouragement she provided throughout the

project work. Madam was always approachable in any kind of doubts

and she has always been a source of inspiration for me. The weekly

project review meetings greatly helped in arriving at a systematic

approach of work and timely completion of the project.

I am deeply grateful to Prof. T.A. Gonsalves for giving timely

suggestions and for reviewing different phases of the project, which has

greatly helped me to improve the design and also to address various

issues in the implementation.

I thank Dr. S. Raman, our faculty advisor, for his co-operation

throughout my MTech course.

I would like to thank all of the DONLabbers for encouraging me

during the project work. I specially convey my thanks to the IndLinux

team members, Patricia, Shenoi and Sreepriya for their cooperation

and support.

I also wish to thank to my teacher, Sri. P.C. Reghuraj, for

encouraging me during the course work.

Finally, I would like to thank to my beloved parents and sister for

their affection and moral support.

Page 4: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

iii

ABSTRACT

Affordable local language software will play a crucial role in the

process of taking the benefits of "information revolution" to the

marginalized sections of society. The objective of this project is to develop

a console-based local language interface for the Linux Operating System.

Linux OS is chosen for development as it is a robust and stable operating

system and is freely available under the GNU general public license.

Indianisation of Linux can be done in two different environments;

namely, the console based environment and the X-Windows based

environment. The focus of the effort in this thesis is to Indianise Linux

for the console-based environment. Linux doesn’t support variable width

fonts for console, which is required for displaying Indian language fonts.

Further, consonant-vowel clusters of Indian languages will result in non-

trivial modified versions. In the proposed design, a generic solution for all

Indian languages is worked out. The console and TTY drivers of Linux OS

are modified to interpret user defined parse rules and display multiple

glyphs per character with all the associated functionalities. Utilities are

provided for the user to define and load the parse rules and multiglyph

mappings to the kernel. Font files are also developed. The kernel

modifications support Indian language fonts, and since the modifications

are in the kernel, it can be inherited by various applications running on

console. To make the OS more user friendly, some common applications

like editor, mailer, browser, command interpreter and compiler have

been customised to make full use of the kernel support. The current

revision of the console based solution uses the ISCII standard approved

by the Bureau of Indian Standards.

Page 5: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

iv

TABLE OF CONTENTS1. OVERVIEW ............................................................................................ 1

1.1 Scope of the Project ........................................................ 1

1.2 ISCII – Gateway to Transliteration ................................... 3

1.3 Issue: Width of Characters .............................................. 5

1.4 Issue: Vowel & Consonant Clusters ................................. 6

1.5 Issue: Internationalisation of applications........................ 7

1.6 Solution Methodology...................................................... 7

1.7 Major Contributions of the Thesis ................................... 9

1.8 Organization of the Thesis............................................... 9

2. BACKGROUND: DISPLAY MECHANISM IN LINUX CONSOLE ....11

2.1 Keyboard Input Mechanism at Console Level ................. 11

2.2 The PSF file format ....................................................... 12

2.3 Kernel Data Structures for Console Display ................... 13

2.3.1 Font Information ..................................................................142.3.2 Unicode Mapping..................................................................142.3.3 Console Information .............................................................152.3.4 Unimap Directory .................................................................17

2.4 Loading a Font and Mapping......................................... 19

2.5 Console Display Pipeline ............................................... 20

2.5.1 Displaying a Regular Character ............................................202.5.2 Processing control characters ...............................................22

Page 6: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

v

3. DESIGN AND IMPLEMENTATION OF KERNEL PATCH ................23

3.1 Multiglyph Support....................................................... 23

3.1.1 Display of Characters ...........................................................233.1.2 Deletion & Backspacing........................................................253.1.3 Inserting a Character............................................................283.1.4 Processing Cursor Positioning ..............................................28

3.2 Parserule Support......................................................... 28

3.2.1 The Forward and Reverse DFAs ............................................283.2.2 Forward Parserule Matching.................................................323.2.3 Reverse Parserule Matching..................................................33

3.3 Utilities for Loading Multimaps and Parserules .............. 35

3.3.1 Loadmultimap......................................................................353.3.2 Loadparserules.....................................................................36

4.LOCALIZATION OF APPLICATION PROGRAMS .............................37

4.1 Localization Using Gettext............................................. 37

4.2 Applications Modified.................................................... 39

5. CONCLUSION .......................................................................................45

5.1 Project Results.............................................................. 45

5.2 Publication ................................................................... 45

5.3 Website ........................................................................ 46

5.4 Observations ................................................................ 46

5.5 Future Enhancements .................................................. 46

BIBLIOGRAPHY ……………………..…………………………………44

Page 7: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

1

Chapter 1

OVERVIEW

1.1 Scope of the Project

Almost all the widely available software today is written and

documented in English, and uses English as the medium to interact with

users. This has the advantage of a common language of communication

between developers, maintainers and users from different countries. But,

in a country like India, an overwhelming majority of the population does

not know English. Given this fact, availability of affordable native

language software will play a crucial role in the process of taking the

benefits of the "information revolution" to the marginalized sections of

society [1] and to achieve appropriate social use of information

technology.

The objective of this project is to develop a native language

interface for the Linux operating system at the console level. Developing

a native language interface at an operating system level is a better

proposition compared to developing it at an application level as the

former enables all the applications running on top of the operating

system to inherit the interface. The choice of Linux as the operating

system has been motivated by the fact that Linux is a robust and stable

operating system and is freely available [2].

Page 8: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

2

There are two modes in which the computer can be used in the

Linux operating system, namely, the console mode and the X-Windows

mode. The requirements on the RAM vary with the mode. In the console

mode the RAM requirement is 4 MB while even a minimal X-Windows

based system requires 6-8 MB. The effort to Indianise Linux thus

consists of two separate tasks of Indianising two different environments

and then customisation of applications to run in both of the

environments. (See figure 1.1)

Figure 1.1 Overview of Linux Indianisation

In either case, the primary goal is to enable applications to inherit

the interface with no or minimal modification. Further, an application

developed in the console-based environment must work without

requiring any modification in the X-environment. In addition, once

support has been developed for a particular Indian language, the effort to

enable any other Indian Language support should require only changes

to the configuration.

The focus of this project is console based local language support.

Indianisation of the X-Windows environment is dealt with in the MTech

thesis by VS Shenoi [3].

Indian LanguageSupport for Linux

Console Based

X-Windows Based

Kernel Modifications

Applications inIndian Languages

Focus of this Thesis

Page 9: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

3

1.2 ISCII – Gateway to Transliteration

People who have visited local language web sites might have noticed

the presence of large number of fonts for the same language. Across

different fonts, not only the font faces but also the mapping between font

indices and the actual alphabet changes. The keyboard layout also

changes because of this. So, finally we will end up with a number of fonts

and keyboard mappings for the same language.

Effort to standardise character codes for non-English languages was

started in early eighties. Unicode [4] was proposed by a group of industry

leaders and in this standard, each character is represented by 16 bits.

This accounts to 216 = 65536 possible character codes, which will be

sufficient to represent almost all major languages of the world. Slots in

the range [0,65536) are allotted to different languages.

One very obvious drawback of the Unicode standard is that it

requires 100% more storage space than ordinary 8 bit characters. When

transmitted over network, it requires double the bandwidth. But when

applied to Indian languages, Unicode is having some more serious

limitations.

Unicode doesn’t exploit the similarity between the alphabets of

various Indian languages. All Indian language alphabets have a common

direct mapping to phonetic syllables. So, assigning codes to the phonetic

syllables rather than individual alphabets of different languages has

many advantages. First of all, a single encoding standard can be used for

representing all Indian languages. Next, a text in one language when

encoded using these phonetic codes can be read in any Indian language

by just changing the font. This property is called Transliteration. To

Page 10: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

4

explain this further, I will take up an example: Consider a part of a

hypothetical phonetic code given in table 1.1

Character

Code

Character Name Font in

Devanagari

Font in

Tamil

Font in

Malayalam

200 Pa {É ð ]

201 Ka E è I

202 Halant Modifier ÂÂ ¢ v

203 Sha ¹É û j

204 I – Vowel

Modifier

Ê ¤ n

Table 1.1 Part of a sample phonetic character code

Now Consider the word Pakshi (Means Bird in Hindi). It can be

deconstructed into phonetic syllables Pa + Ka + Halant + Sha + I

Modifier. Thus, the character codes will be 200, 201, 202, 203, 204 in

our hypothetical coding method. Now, Once we encode this in this form,

then it can be read in Hindi, Malayalam or Tamil as given in the above

table, by just changing the fonts. (The ordering of the glyphs may be

different in different languages, though, for e.g., the I - modifier will come

after the consonant in Tamil and Malayalam, where as it will prefix the

consonant in case of Hindi.) This is exactly what is meant by

transliteration.

This is possible only because Indian language alphabets are deeply

rooted in phonetic syllables, (Linguistically, this is because they are all

based on the ancient Brahmi script [5]) and this property is not exploited

in Unicode.

Page 11: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

5

The Indian Script Code for Information Interchange (ISCII) standard

was proposed by Department of Electronics, Government of India in 1983

and was approved by Bureau of Indian standards [6] provides an

optimized and elegant encoding scheme for all Indian languages. It is an

eight-bit character-encoding scheme in which the lower 128 are assigned

to ASCII characters itself and the upper 128 characters are used for

Indian language codes. Currently 10 Indian scripts are supported.

A keyboard standard for Indian scripts was also brought out by

DOE in 1986, called the Inscript keyboard layout, which is based on the

ISCII characters.

For localizing Linux operating system, It was proposed to use ISCII for

the following reasons:

Ø Less storage space & bandwidth is taken.

Ø Same keyboard layout can be used for all languages.

Ø Transliteration is feasible.

Ø There are lots of applications that assume 8 bit characters internally

which can work smoothly with ISCII as well.

Ø Any application that is written for Linux console will work in X-

Windows and vice-versa, if it is based on ISCII.

1.3 Issue: Width of Characters

Linux uses a font format called PC Screen Font (PSF) for console.

Neither the format, nor the kernel modules implementing the display

mechanism, viz., console and video drivers, support variable width fonts.

Moreover, the width of a font glyph is fixed at 8 pixels. It is not a problem

for English characters where even the glyph with the largest width, “m”

can be represented legibly in 8 pixels. Also, the mean deviation of width

in English characters is small, therefore, even if all characters are

Page 12: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

6

represented using the same width, there is no discomfort in terms of

aesthetics to the end user.

But, this is not the case with most of the Indian language

characters. A character Ë in Tamil or B in Malayalam cannot be

legibly fit in 8-pixel width. One option will be to enable the kernel to

support wider fixed width fonts (say, all 16 pixels wide). But then, there

are characters like S in Malayalam or ® in Hindi which are too narrow

and it will look odd if characters with this much variation in width are

displayed together in a screen with the same width allocation for all of

them.

An alternative solution should provide support for variable width

fonts in the console mode.

1.4 Issue: Vowel & Consonant Clusters

Another issue that needs to be addressed is the display of

consonant vowel clusters. Consonant - vowel clusters in Indian

languages will result in new non-trivial consonant vowel cluster glyphs.

As an example, the phonetic syllables Ka , Halant and Sha will

produce the sound Ksha, which can be represented in various languages

as given in Table 1.2

Script Ka Halant Sha Resultant

glyph

Devanagari E  ¹É IÉ

Tamil è ¢¢ û þ

Malayalam I v j £

Table 1.2 Formation of consonant-vowel clusters

Page 13: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

7

Besides, the glyph ordering also will be different in different

languages. The modifier for the vowel “I” appears before the consonant in

case of Hindi and after the consonant in the case of Tamil. For e.g.,

consider Ka + I modifier. In Hindi, it is E + Ê = ÊE ; But in Tamil it is

è + ¤ = è¤. In some cases, the consonants may get sandwiched between

components of the vowel modifier. For e.g., consider the modifier for O in

Tamil. Suppose that it is applied to the character è (Ka). The resultant

character Ko will look like ªè£.

Editing operations should also be taken care of while working with

vowel modifiers. For e.g., If we press backspace at a character ªè£, it

should give è, not ªè. A cursor-positioning request to go to the second

column should place the cursor after ªè£, not after ªè.

1.5 Issue: Internationalization of applications

If we want to really use the Local Language Interface (LLI) for

console (or X-Windows for that matter), applications running on it, like

Mailtool, Editor, Web browser and the command interpreter also need to

be modified to give a user interface in local language. This may require:

a) Modification of the application.

b) Generation of the application specific substitution string tables in the

language of interest.

1.6 Solution Methodology

The objective of this project is to develop a local language interface

for the console, considering all the issues involved.

The requirements of a practical solution are:

Page 14: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

8

Ø Support should be developed at the kernel level, so that all

applications running on it can inherit the LLI interface.

Ø It should be very generic. I.e., it should be language independent. The

kernel should only enable the support for language specific tables,

which must be loadable at run time from appropriate configuration

files.

Ø The solution should address all the issues specified in the previous

sections.

Ø Applications written for the console environment should work in the

X-Windows environment as well.

ISCII standard in consonance with the inscript keyboard is a good

choice for encoding of characters in Indian languages (refer section 1.2) If

applications are based on ISCII standard, then they can be used in X-

Windows as well.

The console display mechanism is taken care by console, TTY and

Video device drivers in the kernel. To provide true variable width font

support will require the modification of all these drivers. The solution

adopted in this thesis is to display multiple glyphs for a character code in

order to display wider fonts. In this case, the glyphs are still of a fixed

width and a one-to-many mapping mechanism is introduced in the

display pipeline. In this design, only the console and TTY drivers need to

be changed.

To support vowel – consonant clusters will require support for

context sensitive parsing at the console I/O level. The parsing rules

would be different for different languages, and hence should be user

defined in order to make the kernel modifications language independent.

Thus, kernel should be able to interpret and process the parse rules.

Page 15: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

9

1.7 Major Contributions of the Thesis

The major contributions of the thesis are summarised as follows:

Ø Kernel modifications for enabling, loading and display of multiple

glyphs per character code.

Ø Kernel modifications to enable loading and interpretation of user

defined parse rules at console I/O level.

Ø Development of utilities for loading the multiglyph map tables and

parserules.

Ø Creation of PSF files, parserules and multimap files for one Indian

language for testing.

Ø Enabling a mailtool, editor and compiler to support local language

user interfaces and messages.

Ø Publication of a paper entitled “Indian Language Support to Linux

Operating System” accepted at International Symposium: Information

Technology, People’s Development and Culture, to be held in

February 2001, at JK Institute, Allahabad.

Ø Hosting of website to publicize technical information and

downloadable packages of the project.

1.8 Organization of the Thesis

Chapter 2 of this thesis describes the current console display

pipeline in Linux. We also will go through the important data structures

used.

Chapter 3 describes the design and implementation of the kernel

patch for supporting multiglyphs and parse rules at console display

level.

Page 16: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

10

In Chapter 4, we will address the issues regarding development of

applications in local languages.

Finally, Chapter 5 consolidates the results of the project work, and

also lists down few observations made regarding the work and also the

future enhancement possibilities.

Page 17: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

11

Chapter 2

BACKGROUND: DISPLAY MECHANISM INLINUX CONSOLE

This chapter gives a brief overview of the console display process in

Linux operating system. Further, data structures and algorithms used

are described. We also address some of the special cases and issues

there in.

2.1 Keyboard Input Mechanism at Console Level

The keyboard-input mechanism at the console level in Linux [7] is

as follows: When a key is pressed, the keyboard controller sends

scancodes to the kernel keyboard driver. The keyboard driver sends

whatever it receives to the application program when it is in scancode

mode (for e.g., when X-Windows runs). Otherwise, it parses the stream of

scancodes into keycodes, corresponding to key press or key release

events. These keycodes are sent to the application program when it is in

keycode mode. Otherwise, these keycodes are looked up in a keymap and

the character or string found there is transmitted to the application, or

the action described there is performed.

At the console level, Linux allows loading of a font into the

EGA/VGA character generator, with the options of specifying the screen-

font map and/or application character set mapping. Linux also allows

Page 18: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

12

loading of keyboard translation tables. Consolechars is the utility for

doing the former task and loadkeys for the later.

Thus at the console level, Linux allows the flexibility of customizing

the keyboard input as well as the display.

2.2 The PSF file format

Linux Console uses a font format called PC Screen font (PSF) for

display [8]. A PSF file contains one character font, whose width is 8

pixels, i.e.; each scanline in a character occupies 1 byte.

It may contain characters of any height between 0 and 255, though

character heights lower than 8 or greater than 32 may not be useful.

Character width is fixed at 8. Fonts can contain either 256 or 512

characters. The file can optionally contain a Unicode mapping-table. The

“file mode” byte controls font size (256/512) and indicates whether file

contains a Unicode mapping table.

The PSF file format is described here in pseudo EBNF notation.

Upper-case words represent terminal symbols, i.e. C types, lower-case

words represent non-terminal symbols, i.e., symbols defined in terms of

other symbols. [sym] denotes an optional symbol, {sym}* is a symbol that

can be repeated 0 or more times. {sym}*N is a symbol that must be

repeated N times. Comments are introduced with a “#” sign. The data

(unsigned shorts) are stored in LITTLE_ENDIAN byte order.

psf_file = psf_header raw_fontdata [unicode_data]

psf_header = magic_number filemode fontheight

magic number = CHAR=0x36 CHAR=0x04

fontheight = CHAR # measured in scan lines

Page 19: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

13

filemode = CHAR # 0: 256 characters, no unicode_data

# 1: 512 characters, no unicode_data

# 2: 256 characters, with unicode_data

# 3: 512 characters, with unicode_data

raw_fontdata = {char_data}*<fontsize>

char_data = {BYTE}*<fontheight>

unicode_data = {unicode_array psf_separator}*<fontsize>

unicode_array = {unicode} # any necessary number of times

unicode = U_SHORT # UCS2 code

psf_separator = unicode=0xFFFF

The utility consolechars is used for loading the font into the kernel. The

console device driver provides ioctl functions with codes PIO_FONTX,

PIO_UNIMAP etc. to load fonts and Unicode maps into the kernel. The

kernel maintains data structures to store the information regarding the

font glyphs and mapping tables. The consolechars utility processes the

PSF file to create the tables and calls the ioctl functions to load them

finally into the kernel space.

2.3 Kernel Data Structures for Console Display

The Linux kernel uses a set of data structures for storing the font

information and Unicode mappings in an optimized way. It uses a special

data structure for storing all relevant information to each console. The

screen data is stored in a buffer with each character – attribute

combination occupying two bytes. There are some additional data

structures which can be used along with the ioctl function calls.

The kernel data structures used for console display processing are

defined in the files kd.h, console_struct.h, consolemap.h and

vt_kern.h. The files can be located in include/linux subdirectory of

the main source tree [9]. Some of the important data structures are

discussed in this section.

Page 20: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

14

2.3.1 Font Information

The data structure consolefontdesc stores the font information

that is loaded from a PSF file.

struct consolefontdesc {

unsigned short charcount;

unsigned short charheight;

char *chardata;

};

charcount gives the number of characters present in the font (256 or

512) and charheight gives the scan lines per character (which is the

height of the character) and can be in the range 1-32. chardata points to

the font data.

The utility consolechars uses this structure for passing the font

information from the PSF file to the kernel.

2.3.2 Unicode Mapping

The data structure unipair stores a single Unicode to glyph index

mapping. Note that this is a one-to-one mapping; i.e., a single Unicode

entry is mapped to a single glyph index in the font file.

struct unipair {

unsigned short unicode;

unsigned short fontpos;

};

unicode gives the Unicode value and fontpos gives the glyph index in

the font.

The data structure unimapdesc stores the complete table.

Page 21: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

15

struct unimapdesc {

unsigned short entry_ct;

struct unipair *entries;

};

entry_ct gives the number of elements in the table and entries gives

the actual table.

This data structure is used for passing a screen font mapping to

the kernel. The Unicode – glyph index pairs present in the structure are

used for constructing the unimap directory as explained in section 2.3.4

2.3.3 Console Information

The data structure vc_data stores the information regarding a virtual

console. Only few of the members are shown here, which are of concern

in this project.

struct vc_data {

unsigned short vc_num;

unsigned int vc_cols;

unsigned int vc_rows;

unsigned int vc_size_row;

unsigned short *vc_screenbuf;

unsigned int vc_screenbuf_size;

unsigned short vc_hi_font_mask;

.......

.......

unsigned int vc_x, vc_y;

unsigned long vc_uni_pagedir;

unsigned long *vc_uni_pagedir_loc;

unsigned int vc_state;

};

Page 22: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

16

vc_num gives the virtual console number.

vc_cols and vc_rows gives the number of columns and rows in the

screen.

vc_size_row gives the number of bytes per row. Since each character on

screen requires two bytes, this will be vc_cols * 2.

vc_screenbuf points to the display buffer of the console.

vc_screenbuf_size gives the display buffer size.

vc_hi_font_mask gives the attributes set for upper 256 glyphs of the

font. This will be zero if current font contains only 256 glyphs (Explained

in section 2.5.1).

vc_x and vc_y give the row and column numbers of the current cursor

position.

vc_uni_pagedir_loc points to the uni_pagedir structure allotted for

the console. (Explained in section 2.3.4)

vc_state gives the state of escape sequence parser. (Explained in

section 2.5.2)

At present, the data structure vc contains just a pointer to an element of

type vc_data. This is designed like this considering future enhancement

possibilities.

struct vc {

struct vc_data *d;

};

The variable definition

struct vc vc_cons [MAX_NR_CONSOLES];

defines an array of vc structures for different consoles.

Page 23: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

17

2.3.4 Unimap Directory

The Unicode to glyph index mapping is stored in a special data

structure called uni_pagedir defined as follows:

struct uni_pagedir {

unsigned short **uni_pgdir[32];

....

};

A simple method to store the mapping would be to store it in an

array. Since there are 216=65536 Unicode characters, this would require

an array of 65536 elements, and each element should be in the range 0-

512, which means 2 bytes would be required to store the elements. This

comes to 128KB for one console. If there are 8 virtual consoles, it comes

to 1MB of memory for the mapping table itself. This will be a large

overhead considering the fact that the matrix will be sparse, with values

initialized only for less than 1% of the elements.

The uni_pagedir data structure is designed in such a way that a

mapping table occupies less space. The element uni_pgdir is an array of

32 double pointers to unsigned shorts. This forms a tree structure of 3

levels as shown in Figure 2.1

Figure 2.1 Structure of uni_pagedir

uni_pagedir 0 1 31……….

0 1 31… 0 1 31…

……………………..

0 1 63… 0 1 63 0 1 630 1 63

……………..

…………..

Page 24: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

18

Now, suppose a mapping <U, G> is to be inserted where U is the

Unicode value and G is the glyph index (both 16 bits). Let U = b15 b14 ....

b0 where bi's are the bits (0 or 1). Now, the insertion algorithm is as given

as follows:

PROCEDURE insert (U,G)

BEGIN

Let pgdir be the uni_pagedir structure corresponding

to the current console;

unsigned short ** u1 = pgdir->uni_pgdir[b15b14b13b12b11];

IF (u1 == NULL) THEN

Allocate memory for u1;

unsigned short * u2 = u1[b4b3b2b1b0];

IF (u2 == NULL) THEN

Allocate memory for u2;

u2[b10b9b8b7b6b5] = G;

END

This way the memory requirements are reduced. To illustrate this,

consider the mapping given in Table 2.1

Unicode Value Glyph Index

0x0100 220

0x0101 221

0x2100 300

0x2200 310

0x203F 320

Table 2.1 A List of unicode-to-glyph maps

Page 25: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

19

The unimap directory constructed for these maps will be as given in

figure 2.2

Figure 2.2 Unimap directory constructed from glyph mappings of Table 2.1

Here, the total memory required can be calculated as follows:

Level 1: 32 Pointers (Only two are non-NULL)

Level 2: 32 × 2 = 64 pointers (Only 4 values are non-NULL)

Level 3: 64 × 4 = 256 unsigned shorts

Total: (32+64) × 4 + 256 × 2 = 896 bytes (Assuming 4 bytes per pointers)

The memory requirements here are very less compared to that in

the array based implementation, where we will have to allocate memory

for the entire range, even if most of the entries are not initialized.

2.4 Loading a Font and Mapping

The consolechars utility is used for loading a PSF font into kernel.

It reads the file and forms few data structures, which are passed to the

kernel, and calls the ioctl functions as referred in section 2.2. The

kernel in turn stores the font information in the consfontdesc data

structure and the mapping table in the uni_pagedir data structure as

explained in algorithm in section 2.3.4.

uni_pagedir 0 31……….

0 1 31… 0 1 31…

……………………..

0 1 630 63

……………..

…………..0 638

220

0 638

221

……….4

8 16

300 310 320

Page 26: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

20

2.5 Console Display Pipeline

Whenever a character is to be displayed, the function

do_con_write() in the console driver is called. The input to this

function is the character buffer to be displayed and the number of

characters. The characters can include escape sequences or regular

characters. The display pipeline is given in figure 2.3.

If in UTF-8 mode, a sequence of characters together will represent

a Unicode character, and the function first determines the Unicode value

of a sequence. Otherwise the Unicode value is assumed to be same as the

character code.

The function first checks whether the character is a control

character or regular character and also checks the current state of the

escape sequence parser. Depending upon that, it processes the

characters appropriately.

2.5.1 Displaying a Regular Character

If the escape sequence parser is in ESNormal state, and if the

current character is a normal character, then it is displayed in the

following way:

For displaying a regular character, first, a function

conv_uni_to_pc() is called. This function gets the glyph index for the

Unicode value. For this, the uni_pagedir structure is traversed.

The screen buffer (see section 2.3.3 about console structure) has

16 bits per character. Out of this 8 or 9 bits are required for representing

the glyph index depending upon whether the number of glyphs are 256

or 512. The remaining bits represent the attributes. But this always need

not be the upper 7 bits in the case of 9 bit glyph indices, but will be

packed according to the attribute bit mask given in vc_hi_font_mask.

Page 27: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

21

UTF-8 Mode?

Combine sequence ofcharacters to form asingle Unicode value

Take the charactercode itself as theUnicode value

Parser State = ESNormalAND

Normal Character?

Process usingdo_con_trol()

Get the glyph index bycalling conv_uni_to_pc()

Pack the glyph index withthe attributes by referring

hi_font_mask

Yes

Yes

No

No

Write the packed data intothe screen buffer, advance

the cursor coordinates

Characters from Buffer

Figure 2.3 Display Pipeline in Console

Page 28: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

22

The output of conv_uni_to_pc() is combined with the current

attribute character and repacked accordingly and the final value is

written into the screen buffer. Later, the video driver takes care of actual

rendering of the character onto the screen.

While displaying a regular character, few of the special cases are

considered. One is with insertion. If in insert mode, then characters that

are displayed after the current cursor position are moved by one position

to make space for the current character. Line wrapping is taken care of

by inserting a carriage return and a line feed character when the current

column number reaches the screen boundary.

2.5.2 Processing control characters

Control characters are processed by the function do_con_trol() which

uses a trie like structure implemented via code using switch statements.

For e.g., Suppose that cursor is to be positioned at (5,6) in screen

coordinates. The sequence of characters 27(ESC), 91([), 5, 59(;), 6, 72(H)

can be used for this purpose and can be passed to this function via some

write() calls. When an arrow key is pressed, the sequence 27(ESC),

91([), 68(D), 8(Backspace) can be used for positioning the cursor. There

can be different ways for achieving the same results and different

applications maybe using different methods for the same purpose.

The do_con_write() function processes this sequence in the

following way: If it sees that the character is a control character, then it

passes it to the function do_con_trol() in the console driver, which

matches it with a set of switch statements, depending upon the current

escape sequence parser state, and also updates the state. Some states

will have actions associated with them, which will be nothing but calls to

functions like gotoxy(), insert_char(), delete_char(),

scr_up(), scr_down() etc.

Page 29: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

23

Chapter 3

DESIGN AND IMPLEMENTATION OFKERNEL PATCH

This chapter describes the design and implementation of the

kernel patch for supporting the multiglyph display and parse rules that

have been developed to support Indian languages. Various issues

concerning the display process as described in chapter 1 are addressed.

3.1 Multiglyph Support

A multiglyph mapping will be of the general form

C = G1 G2 G3 ... Gn

where C is a character code and Gi ‘s are glyph indices.

3.1.1 Display of Characters

The multiglyph support requires a one-to-many mapping between

the character codes and font glyph indices. For this, a structure similar

to uni_pagedir is introduced, called multimap_pagedir.

struct multimap_pagedir {

unsigned short ***multimap_pgdir[32];

.....

};

Page 30: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

24

Here, one more level is added to the tree structure (Refer section

2.3.4) in the element multimap_pgdir. The traversing and insertion

method is similar to that of uni_pgdir, except that an array of unsigned

short will be inserted instead of a single unsigned short value as in

uni_pagedir.

A function acm_to_multiple() is provided to return the glyph

indices for a given character code. The call to the function

conv_uni_to_pc() in the display pipeline is replaced by a call to

acm_to_multiple() , and the glyph indices that are output from the

function are written into the screen buffer after necessary attribute

packing.

ioctl function PIO_MULTIMAP is provided for loading the

multimap tables into the kernel space. The call to this function requires

a structure of the type mutimapdesc to be passed, defined as given

below:

struct multimapdesc{

unsigned short num_entries;

struct multimappair* entries;

};

num_entries gives the number of multimap rules

entries is an array of multimappair structures defined as follows:

struct multimappair {

unsigned short unicode;

unsigned short *fontpos;

};

Page 31: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

25

unicode gives the Unicode value.

fontpos is an array of glyph indices to which unicode is mapped.

The ioctl function code PIO_MULTIMAPCLR is provided for clearing the

multimap rules.

Callbacks are added in vt.c for the ioctl functions, which will

load the mapping tables into the multimap_pagedir structure [10],[11].

3.1.2 Deletion & Backspacing

One issue that is to be taken care of while editing, is deletion and

backspacing. When a multiglyph character is erased using the backspace

key or delete key, all the glyphs corresponding to the glyph should be

erased.

There are two arrays which assist in the process of appropriate

deletion of glyphs. These are declared as follows:

static unsigned char del_bitmap[64];

static unsigned char bs_bitmap[64];

Each of these arrays are 64 bytes i.e., 64 * 8 = 512 bits long and each bit

gives information about one out of the 512 glyphs in the font. Functions

are provided to get / set bits corresponding to a glyph index. The

initialization of this array is done when the multimap is loaded, using the

algorithm ConstructBSandDELBitmaps :

ALGORITHM ConstructBSandDELBitmaps

INPUT Multimap rules of the form C = G1 G2 G3 ... Gn

where C is the character code and GI’s are the

glyph indices.

OUTPUT del_bitmap and bs_bitmap arrays initialized

HELPER FUNCTIONS

Page 32: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

26

set_bsbit(i) : Sets bit corresponding to i in

bs_bitmap

set_delbit(i) : Sets bit corresponding to i in

del_bitmap

BEGIN

FOR EACH multimap rule C = G1 G2 G3 ... Gn DO

BEGIN

FOR i:= 2 to n DO set_bsbit(i);

FOR i:= 1 to n-1 DO set_delbit(i);

END

END

Now suppose that the cursor is at the end of a multiglyph

character and the user erases it using backspace character. Different

applications implement it in different ways. One solution is to write the

sequence Backspace, Space, Backspace. This is done inside the

do_con_write() function using the algorithm ProcessBackspace:

ALGORITHM ProcessBackspace

HELPER FUNCTIONS

get_bsbit(i) : Gets the bit in bs_bitmap

corresponding to i

readfrom_screenbuffer() : Returns the character at

cursor position , also decrements the cursor

position

BEGIN

Spacecount = 1;

WHILE get_bsbit(readfrom_screenbuffer()) == 1 DO

Spacecount ++;

END

Page 33: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

27

Using this algorithm, the cursor is backspaced for the appropriate

number of columns, which is stored in Spacecount. Next time when

space is displayed, it will be duplicated Spacecount times, thus erasing

all the glyphs corresponding to the character. When the last backspace is

processed, It is again applied Spacecount times, thus completing the

erasing operation. The code of do_con_write() is modified to do this.

Deletion of a character, (i.e., erasing a character to the right of the

cursor position) is also taken care in a similar way, but using the

del_bitmap array, and the function delete_char() is modified

accordingly.

The design imposes some limitations in the glyph design. To

illustrate this, consider the following multimap rules:

A = α β γ

B = β γ

Here, the first rule indicates that we have to backspace further after

getting glyph β, but the second rule says that we shouldn't, as it is at the

beginning of a map sequence. Thus, there is a conflict.

But, this is not a serious limitation, considering the following fact:

It is better to provide a margin of one or two pixels to the left and right of

a complete character, for aesthetic purpose. For a multiglyph character C

= G1 G2 G3 ... Gn, glyph G1 will have a left margin and Gn will have a right

margin, and all other glyphs won’t have margins as they have to join with

the glyphs to their right and left to form the complete character. In the

above-specified rules, the glyph β is an exception to this requirement. In

the first rule, it will not have margins, but in the second rule, β will have

a left margin. This requires that the aesthetic issues be addressed at the

time of font design, so that such multiglyph maps never occur.

Page 34: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

28

3.1.3 Inserting a Character

When we insert a character at some position, kernel gets an escape

sequence which will result in a call to the function csi_at(). Here,

currently characters are moved to right by one position, to make space

for the character to be inserted. But, for the multiglyph character, this

should be more than one position; the exact number will be known only

when it is to be displayed. So, a flag process_insertion is introduced

and is set in this function, and in do_con_write() , code is inserted to

check this flag, and do the insertion of additional glyphs.

3.1.4 Processing Cursor Positioning

To address the issue in section 1.4, the gotoxy() function is

modified. Suppose that a request comes to position the cursor at

coordinate <x, y>. The row numbers are not affected by the multiglyph

patch, but the column positioning needs to be modified. The

modifications in gotoxy() refer to the del and bs bitmaps and the

parserules (described in section 3.2 ) to determine the actual column

number that is required.

3.2 Parserule Support

A parserule will be of the general form

[G1-1G1-2G1-3 ... G1-m] C = G2-1G2-2G2-3 ... G2-n

Where Gi-j ‘s are glyph indices and C is a character code.

3.2.1 The Forward and Reverse DFAs

The parse rules are used to construct two deterministic finite

automata (DFA) in the kernel space. They are called forward_dfa and

reverse_dfa and are of type dfa_node.

Page 35: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

29

struct dfa_node {

struct dfa_transition* transition_list;

unsigned short* data;

unsigned int datalength;

};

data is the Left Hand Side (LHS) of the rule, if the node corresponds to

an end of Right Hand Side (RHS) match (explained later in next section).

datalength gives the length of the LHS.

dfa_transition is a linked list of transitions of type dfa_transition

defined below:

struct dfa_transition {

unsigned short trigger;

struct dfa_node* destination;

struct dfa_transition* next_trans;

};

trigger is the label of the transition.

destination is destination state of the transition.

next_trans is the next element in the transition list.

The DFAs are constructed in the algorithm ConstructDFA:

ALGORITHM ConstructDFA;

INPUT : The set of parserules

OUTPUT: The forward and reverse DFAs

BEGIN

FOR EACH parserule of the form

[G1-1G1-2G1-3 ... G1-m] C = G2-1G2-2G2-3 ... G2-n DO

BEGIN

UpdateAutomata(ForwardAutomata,

Page 36: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

30

G1-1G1-2G1-3 ... G1-m C, G2-1G2-2G2-3 ... G2-n);

UpdateAutomata(ReverseAutomata,

G2-1G2-2G2-3 ... G2-n, G1-1G1-2G1-3 ... G1-m C);

END

END

PROCEDURE UpdateAutomata(Automata , LHS , RHS )

INPUT: Automata : A DFA of type dfa_node

LHS, RHS : Array of unsigned shorts, the LHS and

RHS of rules.

BEGIN

CurrentNode = Start Node of Automata;

FOR i:= length(LHS) DOWNTO 1 DO

BEGIN

NextNode = Transition from CurrentNode on symbol

LHS[i];

IF NextNode == NULL THEN

BEGIN

Create NextNode;

Add a transition from Current Node to

NextNode on LHS[i];

(This will require adding an entry to

CurrentNode.transition_list)

END

CurrentNode = NextNode;

END

CurrentNode.data = RHS;

CurrentNode.datalength = length(RHS);

END

Page 37: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

31

The ioctl function PIO_PARSERULEADD is provided for loading the

parserules. The input to this function is a structure of type

parseruledesc, which is defined as follows:

struct parseruledesc{

unsigned short num_rules;

struct parserule* parserules;

};

num_rules gives the number of parserules and parserules gives the

actual parserules of type parserule defined below:

struct parserule

{

unsigned short* rule_lhs;

unsigned short* rule_rhs;

unsigned int rhs_length;

unsigned int lhs_length;

};

rule_lhs and rule_rhs are the Left Hand Side (LHS) and Right Hand

Side (RHS) of a parserule respectively and lhs_length and rhs_length

are their lengths.

The ioctl function PIO_PARSERULECLR can be used to clear the

parserules. Functions are added as callbacks to the ioctl calls, which

will construct the DFA from the input structure.

As an illustration, the forward and reverse DFAs for a set of four

parse rules is given in Figure 3.1

Page 38: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

32

α β A → a b c

β A → d c

γ A → e c

α β B → a d c

3.2.2 Forward Parserule Matching

Forward parse rule matching is used while normal characters are

being displayed. Suppose that there is a parse rule α β A = a b c. Also

assume that the glyphs α β are already displayed. Now, when the

character code corresponding to A is pressed, then the forward DFA

matching is initialized. If the rules are already loaded before, then the

traversal will go in the order A → β → α, matching all successive

Figure 3.1 A set of parserules and (a) A forward DFA (b) reverse DFA producedfrom these parserules using algorithm ConstructDFA

Start Node Start Node

A B

β γ

α

β

α

b

c

e

d

a a

dc ec

abc adc

β A γ A

αβ A αβ B

(a) (b)

-- --

--

--

--

Page 39: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

33

characters, finally when it encounters a non-matching character, the

data element in the last DFA node (abc) is returned.

Meanwhile, a count of the number of characters backspaced is

kept. Finally, these characters, which are actually the LHS of the

matched parserules, are erased and the glyphs in the data element that

are returned, which are actually RHS of the rule, are written at that

position.

Always the longest matching is taken. The necessary code for this

is added in do_con_write() function.

3.2.3 Reverse Parserule Matching

Reverse parserule matching is used when a backspace character is

pressed. Suppose that the cursor is positioned to the right of the glyphs

a b c, and the backspace key is pressed. A reverse matching is

initialized, and it will match in the order c → b → a. When finally a

non-matching glyph is found, it will return the data element of the

longest matched rule (if any), which is α β A. Now, The glyphs a b c are

erased, and then the glyphs α β and character A are displayed. When

glyphs are displayed, the forward matching is suspended, else it will

again result in previous values, i.e., a b c. But, when A is to be

displayed, Forward matching is enabled, but limited to a single

character, that is to A itself. This is because, A may be having a

multimap defined in the form of a parse rule, like A = x y which we

would like to match and process. But we don’t want to match more than

one character because then it will match the rule completely, and will

return a b c again.

Page 40: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

34

KeyboardCheck if Control

Character

NormalCharacterApply

ForwardMatching ofParserules

Check ifBackspace

Do the normalprocessing of

control characters

NotBackspace

Apply ReverseMatching ofParserules

Replace the matchedglyphs with the RHS

of the matchedparserule

Match found

Check multiglyphmapping tables for

multiglyph matching

Match notfound

Backspacefound

Control Character

Display theglyphs for the

character

Multiglyphmapping present

Display the glyphin the Screen Font

Map

Delete the matchedcharacters, display theRHS of the matchedrule and then apply

backspacingCalculate the exactcursor position by

referring the parserulesand multiglyph

mappings

Match notfound

No MultiglyphMapping found

Cursor positioningFunction

Refer bs_bitmap anddel_bitmap to matchmultiglyphs, delete

the number ofglyphs as required.

Character codes

KeymapFile

Keycodes

Figure 3.2 Console I/O Flow diagram in the new design

Match Found

Page 41: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

35

The code for implementation of this is added to do_con_write().

Also, appropriate modifications are made in the behavior of insertion flag,

line-wrapping information and in cursor positioning functions. A

complete flow diagram of the console I/O process in the new design is

given in Figure 3.2

3.3 Utilities for Loading Multimaps and Parserules

Two utilities have been developed to enable the user to load the

multimap and parserules into the kernel space. This section describes

the design and usage of these applications.

3.3.1 Loadmultimap

loadmultimap utility can be used to load a multimap file into the

kernel space. The usage of this utility is

loadmultimap <multimap filename>

A multimap file should be a text file where each line is of the form

C G1 G2 G3 ... Gn

where C is the character code in the range [0,256) and GI ‘s are glyph

indices in the range [0,512). Both of them can be in decimals or in

hexadecimals (prefixed with 0x). This line represents the multimap

C = G1 G2 G3 ... Gn

Help options and man pages are provided for this utility to enable

ease of use.

The utility reads the file, parses it, constructs the structure

multimapdesc, and passes it to the kernel through the ioctl call

PIO_MULTIMAP.

The call “ loadmultimap -d “ will clear the multimaps in the

kernel. This is done by calling the ioctl function PIO_MULTIMAPCLR.

Page 42: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

36

3.3.2 Loadparserules

loadparserules utility can be used to load a parserule file into the

kernel space, The usage of this utility is

loadparserules <parserule filename>

A parserule file should be a text file where each line can be of the form

[G1-1G1-2G1-3 ... G1-m] C = G2-1G2-2G2-3 ... G2-n

Where GI-j's are the glyph indices in the range [0,256) and C is a

character code in the range [0,512). Both can be given in decimals or

hexadecimals.

The utility reads the file, parses it, constructs the parseruledesc

structure and passes it to the kernel through the ioctl call

PIO_PARSERULEADD.

The call “loadparserules –d “ will clear the parserules in the

kernel. This is done by calling the ioctl function PIO_PARSERULECLR.

Again, man pages and help options are provided for the utility.

Page 43: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

37

Chapter 4

LOCALIZATION OF APPLICATIONPROGRAMS

The localization support provided in the kernel can be inherited by any

application running on the console. Still, in order to make full use of the

support, Applications need to be modified so that their user interfaces,

messages etc. also come in local language. The Pine mailer, Pico editor

and gcc compiler have been developed in Malayalam language and this

chapter describes details of the implementation.

4.1 Localization Using Gettext

The main task in localizing an application is the translation of

strings given as input to printf() or related functions. An easy method

of translation is to put all strings used in the application as macro

definitions in some header file, and include the header file wherever they

are used.

For e.g. consider the following part of a file app.c

app.c:

…..

printf("This is a test message\n")

printf("This is another test\n");

Page 44: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

38

……..

This can be rewritten as

app.c:

#include "mystrings.h"

.....

printf(STR0001);

printf(STR0002);

mystrings.h:

#ifndef _MYSTRINGS_H_

#define _MYSTRINGS_H_

#define STR0001 "This is a test string\n"

#define STR0002 "This is another test\n"

#endif

For supporting a new language we have to change file mystrings.h to

replace the English strings with equivalent strings in the language.

Even though this approach is simple, there are various problems

associated with it: For each language, we will need to recompile the

application by including the language specific header file (may be in

different names) and separate executables will be produced for each

language.

Another option is to do the translation at run time. The file

mystrings.h can be a kind of configuration file that can be sent with

the executable. The printf() functions can be something like this:

printf(translate("This is a test message\n"));

Page 45: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

39

Where the translate() function can check an environment variable to

determine the current language, and accordingly use a particular

configuration file to get the translated string.

The GNU gettext [12] is a very useful utility for doing the

translation of strings in applications and it follows the latter approach.

For using this utility, the strings to be printed should be translated via

the gettext function, which comes with the gettext library. The string

table file has a specific format called portable object (po) format. The

portable object file should be created for any language of interest, and

then it should be compiled using the msgfmt application to produce a

binary file in machine object (mo) format, which needs to be supplied with

the application package.

Inside the application, the bindtexttodomain() and

textdomain() functions can be used to bind a specific mo filename and

a specific path. For e.g., gcc application can bind to gcc.mo in the path

/usr/share/locale. At run time, the environment variable LANG is

checked and the machine object file is taken from the path

$LANG/LC_MESSAGES with respect to the bound path.

This way, an application need not be recompiled for every new

language - the machine object file only needs to be supplied for

supporting any new language.

4.2 Applications Modified

The applications that are modified as part of this project are the

Pine mailer, Pico editor and gcc compiler. The applications are ported

into the Malayalam language.

The menu and other user interfaces and also many of the error

messages of Pine (figure 4.1) and Pico (figure 4.2) have been translated.

Page 46: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

40

Figure 4.1 Screen Shot of Pine application in Malayalam

Page 47: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

41

Figure 4.2 Screen Shot of Malayalam Pico application running in Tamil mode (Notethat the menus are transliterated).

Page 48: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

42

Figure 4.3 Screen shot of a compilation session with Malayalam GCC.

Page 49: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

43

Figure 4.4 Screen shot of Malayalam Shell

Page 50: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

44

Since Pine and Pico don’t support gettext currently, the source

code is modified to display the Malayalam strings.

Gcc compiler supports gettext, so machine object files are prepared

for the Malayalam language. Most of the commonly encountered error

messages and warnings have been translated. The programmer can also

give comments and string constants in local language. Figure 4.3 shows

screen shot of a compilation session with gcc. It shows a C program

being displayed using the ‘cat’ command and then the compilation of

the same using gcc.

The emacs editor application has been modified and a printer

utility called iscii2ps has been developed as part of the effort to Indianise

X-Windows environment for Linux [3]. Since the modifications in emacs

application are based on ISCII, it will work in the console environment as

well. The user interface and most of the messages of emacs have been

translated into the Malayalam language. The print utility can also be

made use of in the console environment.

A prototype version of shell in the Malayalam language was

developed called amjvv (MASH). It uses a user configurable command

map to translate Malayalam commands to `bash` shell equivalent

commands. Both English and Malayalam commands can be executed.

Figure 4.4 shows a screen shot of a shell session.

Page 51: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

45

Chapter 5

CONCLUSION

5.1 Project Results

The kernel patch has been developed and tested extensively. The

patch has been made for Linux kernel versions 2.2.5, 2.2.14, 2.2.16,

2.2.17 and 2.4.0-test7. Test programs have been written to assist in

testing the functioning of parserules and multimap rules, and all tests

have been passed.

As the kernels get upgraded, it is a requirement to integrate the

patch with the main source tree of Linux so that separate patches need

not be written for future releases. For this, a discussion has been

initiated with the Linux Kernel Development group.

5.2 Publication

The paper entitled “Indian Language Support for the Linux

Operating System”, authored by J Patricia, K Ratheesh, V S Shenoi, G

Sreepriya, Dr. Timothy A Gonsalves and Dr. Hema A Murthy has been

accepted for International Symposium on Information Technology,

People’s Development and Culture, to be held at JK Institute, Allahabad

in February 2001.

Page 52: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

46

5.3 Website

A web site has been setup to give technical overview, usage

instructions and updates about the project. The URL of the site is

http://www.tenet.res.in/Indlinux. The kernel patch, utilities, man

pages, documentation, sample files and installation scripts have been

included in packages and have been posted at the website.

5.4 Observations

Even though the kernel patch has been developed with Indian

languages in mind, there is no inherent limitation in the approach that

forbids anybody from using it for some other language. Anybody who

wants to enable Linux console for a language which is having wider fonts

or complex character clusters can use this patch effectively. To support

any language, the tasks involved are the following:

Ø Develop a PC Screen Font file for the language (upto 512 glyphs).

Ø Develop multimap file for the language if there are multiglyph fonts.

Ø Develop parserule file for the language.

Ø Develop Keymap file if required.

5.5 Future Enhancements

Ø Utilities can be provided to convert fonts in different format to PSF

format and for automatic generation of multimap files. Parse rule files

can also be generated with assistance from the user. This will further

simplify the task of localization.

Ø The number of glyphs supported may have to be increased for some

languages (e.g. Japanese) and the format of display buffer and

attribute packing will have to be modified in that case.

Ø Cursor positioning operations can be optimized by using separate

screen buffers for glyph codes and character codes.

Page 53: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

47

BIBLIOGRAPHY

[1] Kenneth Kenistion, Politics, Culture and Software, Economic and

Political Weekly, Vol. XXXIII, No. 3, pp. 105-110, January 17, 1998.

[2] Linux Online – Distributions and FTP sites: available at website

http://www.linux.org/dist/index.html.

[3] V S Shenoi, X-Windows based Indian language support for the Linux

operating system, MTech project thesis, IIT Madras, January 2001.

[4] Unicode Home Page: http://www.unicode.org.

[5] Brahmi Script - History and Description : Information available at

website: http://tied.narod.ru/project/script/brahm.html.

[6] Indian Standard, Indian Script Code for Information Interchange - ISCII

from Bureau of Indian Standards, [email protected]

[7] Andries Brouwer <[email protected]>, Linux Keyboard and Console HOWTO

Documentation, Version 2.8, 25 February 1998 available at

http://www.linux.com/howto/Keyboard-and-Console-HOWTO.html.

[8] Consolechars documentation on screen font file formats, documentation

included with ConsoleTools package, available at web site:

http://www.multimania.com/ydirson/en/lct/.

[9] Linux kernel source code, all versions are available at location

http://www.kernel.org/pub/linux/kernel/.

Page 54: CONSOLE BASED INDIAN LANGUAGE SUPPORT FOR THE LINUX ... · A keyboard standard for Indian scripts was also brought out by DOE in 1986, called the Inscript keyboard layout, which is

48

[10] Michael Beck, Harald Bohme, Mirko Dziadzka, Ulrich Kunitz,

Robert Magnus, Dirk Verworner. Linux Kernel Internals Second

Edition, 1998, Addison-Wesley. Chapters 1-4, 7, Appendices A,B.

[11] Alessandro Rubini, Linux Device Drivers, 1998, O’reilly &

Associates Inc. Chapters 1-5.

[12] GNU gettext package documentation, available at website:

www.gnu.org/software/gettext/gettext.html.