Unicode Primer for the Uninitiated: A Guide to Unicode and Internationalization (i18n) ~ Character...

Internationalization Articles May 8th, 2008Unicode Primer for the Uninitiated

Among our friends and clients at Lingoport, we regularly see ranges of confusion, to complete lack of awareness of

what Unicode is. So for the less- or under-informed, perhaps this article will help. The advent of Unicode is a key

underpinning for global software applications and websites so that they can support worldwide language scripts. So

it’s a very important standard to be aware of, whether you’re in localization, an engineer or a business manager.

Firstly, Unicode is a character set standard used for displaying

and processing language data in computer applications. The

Unicode character set is the entire world’s set of characters,

including letters, numbers, currencies, symbols and the like,

supporting a number of character encodings to make that all

happen. Before your eyes glaze over, let me explain what

character encoding means. You have to remember that for a

computer, all information is represented in zeros and ones

(i.e. binary values). So if you think of the letter A in the ASCII

standard of zeros and ones it would look like this: 1000001.

That is, a 1 then five zeros and a 1 to make a total of 7 bits.

This binary representation for A is called A’s code point, and

this mapping of zeros and ones to characters is called the

character encoding. In the early days of computing, unless

you did something very special, ASCII (7 bits per character)

was how your data got managed. The problem is that ASCII

doesn’t leave you enough zeros and ones to represent

extended characters, like accents and characters specific to

non-English alphabets, such as you find in European languages. You certainly can’t support the complex characters

that make up Chinese, Korean and Japanese languages. These languages require 8-bit (single-byte) or 16-bit (double-

byte) character encodings. One important note on all of these single- and double-byte encodings is that they are a

superset of 7-bit ASCII encoding, which means that English code points will always be the same regardless the

encoding.

The Bad Old Days

In the early computing days, specific character single- and double-byte encodings were developed to support

various languages. That was very bad, as it meant that software developers needed to build a version of their

application for every language they wanted to support that used a different encoding. You’d have the Japanese

version, the Western European language version, the English-only version and so on. You’d end up with a hoard of

individual software code bases, each needing their own testing, updating and ongoing maintenance and support,

which is very expensive, and pretty near impossible for businesses to realistically support without serious digressions

among the various language versions over time. You don’t see this problem very often for newly developed

applications, but there are plenty of holdovers. We see it typically when a new client has turned over their source

code to a particular country partner or marketing agent which was responsible for adapting the code to multiple

languages. The worst case I saw was in 2004 when a particular client, who I will leave unmentioned, had a legacy

product with 18 separate language versions and had no real idea any longer the level of functionality that varied

4/30/2010 A guide to Unicode and Internationaliz…

lingoport.com/unicode-primer-for-the-… 1/4

from language to language. That’s no way to grow a corporate empire!

ISO Latin

A single-byte character set that we often see in applications is ISO Latin 1, which is represented in various encoding

standards such as ISO-8859-1 for UNIX, Windows-1252 for Windows and MacRoman on guess what platform. This

character set supports characters used in Western European languages such as French, Spanish, German, and U.K.

English. Since each character requires only a single byte, this character set provides support for multiple languages,

while avoiding the work required to support either Unicode or a double-byte encoding. Trouble is that still leaves

out much of the world. For example, to support Eastern European languages you need to use a different character

set, often referred to as Latin 2, which provides the characters that are uniquely needed for these languages.

There are also separate character sets for Baltic languages, Turkish, Arabic, Hebrew, and on and on. When having to

internationalize software for the first time, sometimes companies will start with just supporting ISO Latin 1 if it meets

their immediate marketing requirements and deal with the more extensive work of supporting other languages later.

The reason is that it’s likely these software applications will need major reworking of the encoding support in their

database and functions, methods and classes within their source code to go beyond ISO Latin support, which means

more time and more money – often cascading into later releases and foregone revenues. However, if the software

company has truly global ambitions, they will need to take that plunge and provide Unicode support. I’ll argue that if

companies are supporting global customers, and even not doing a bit of translation/localization for the interface,

they still need to support Unicode so they can provide processing of their customer’s global data.

Unicode

We come back to Unicode, which as we mentioned above, is a character set created to enable support of any

written language worldwide. Now you might find a language or two lacking Unicode support for its script but that is

becoming extremely isolated. For instance, currently Javanese, Loma, and Tai Viet are among scripts not yet

supported. Arcane until you need them I suppose. I remember a few years ago when we were developing a multi-

lingual site which needed support for Khmer and Armenian, and we were thankful that Unicode had just added their

support a few months prior. If you have a marketing requirement for your software to support Japanese or Chinese,

think Unicode. That’s because you will need to move to a double-byte encoding at the very least, and as soon as

you go through the trouble to do that, you might as well support Unicode and get the added benefit of support for

all languages.

UTF-8

Once you’ve chosen to support Unicode, you must decide on the specific character encoding you want to use,

which will be dependent on the application requirements and technologies. UTF-8 is one of the commonly used

character encodings defined within the Unicode Standard, which uses a single byte for each character unless it

needs more, in which case it can expand up to 4 bytes. People sometimes refer to this as a variable-width encoding

since the width of the character in bytes varies depending upon the character. The advantage of this character

encoding is that all English (ASCII) characters will remain as single-bytes, saving data space. This is especially desirable

for web content, since the underlying HTML markup will remain in single-byte ASCII. In general, UNIX platforms are

optimized for UTF-8 character encoding. Concerning databases, where large amounts of application data are integral

to the application, a developer may choose a UTF-8 encoding to save space if most of the data in the database does

not need translation and so can remain in English (which requires only a single byte in UTF-8 encoding). Note that

some databases will not support UTF-8, specifically M icrosoft’s SQL Server.

UTF-16

UTF-16 is another widely adopted encoding within the Unicode standard. It assigns two bytes for each character

whether you need it or not. So the letter A is 00000000 01000001 or 9 zeros, a one, followed by 5 zeros and a one. If

more than 2 bytes are needed for a character, four bytes can be combined, however you must adapt your software



to be capable of handling this four-byte combination. Java and .Net internally process strings (text and messages) as

UTF-16.

For many applications, you can actually support multiple Unicode encodings so that for example your data is stored

in your database as UTF-8 but is handled within your code as UTF-16, or vice versa. There are various reasons to do

this, such as software limitations (different software components supporting different Unicode encodings), storage

or performance advantages, etc.. But whether that’s a good idea is one of those “it depends” kinds of questions.

Implementing can be tricky and clients pay us good money to solve this.

Microsoft’s SQL Server is a bit of a special case, in that it supports UCS-2, which is like UTF-16 but without the 4-

byte characters (only the 16-bit characters are supported).

GB 18030

There’s also a special-case character set when it comes to engineering for software intended for sale in China (PRC),

which is required by the Chinese Government. This character set is GB 18030GB 18030, and it is actually a superset

of Unicode, supporting both simplified and traditional Chinese. Similarly to UTF-16, GB 18030 character encoding

allows 4 bytes per character to support characters beyond Unicode’s “basic” (16-bit) range, and in practice

supporting UTF-16 (or UTF-8) is considered an acceptable approach to supporting GB 18030 (the UCS-2 encoding just

mentioned is not, however).

Now all of this considered, a converse question might be, what happens when you try to make your application

support complex scripts that need Unicode, and the support isn’t there? Depending upon your system, you get

anything from garbled and meaningless gibberish where data or messages become corrupted characters or weird

square boxes, or the application crashes forcing a restart. Not good.

If your application supports Unicode, you are ready to take on the world.

ResourcesInternationalization Articles

Internationalization Newsletter

Internationalization Whitepapers

Videos

Webinars

SubscribeSubscribe to our newsletter and white papers for free internationalization news, articles, and Webinar

announcements sent via email.

Click Here to Subscribe

Contact UsPhone: +1.303.444.8020

Email: [email protected]



Unicode Primer for the Uninitiated: A Guide to Unicode and Internationalization (i18n) ~ Character...

Technology

Transcript of Unicode Primer for the Uninitiated: A Guide to Unicode and Internationalization (i18n) ~ Character...