Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

86
Internationalization: An Introduction Tutorial from Character Encodings & Unicode

Transcript of Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Page 1: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Internationalization: An IntroductionTutorial from

Character Encodings & Unicode

Page 2: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

License

This presentation and its associated materials licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 2.5 License. You may use these materials without obtaining permission from the author. Any materials used or redistributed must contain this notice.[Derivative works may be permitted with permission of the author.]This work is copyright © 2008-2011 by Addison P. Phillips

Page 3: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Presenter and Presentation

• Addison Phillips– Globalization Architect, Lab126

• This Presentation– Part I of the Internationalization and Unicode Conference

tutorial :

“Internationalization: An Introduction”

Character Encodings and Unicode

Page 4: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Who is this guy?• Globalization Architect, Lab126

We make the technology behind the Kindle

• Chair, W3C Internationalization WG

Page 5: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Internationalization is:

• the design and development of a product that is enabled for target audiences that vary in culture, region, or language. [W3C]

• a fundamental architectural approach to software development

Page 6: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Mystic Numbering (M4C N7G)Opinions differ on capitalization (C12N);choose from: i18N I18n I18n I18NVery geeky; not very internationalized (I19G?)

I N T E R N A T I O N A L I Z A T I O N

I 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 N

I18N

Localization = L10NGlobalization = G11NCanonicalization = C14NAccessibility = A12Y

Page 7: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

CHARACTER ENCODINGSThe basics of text processing in software.

Page 8: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

The Biggest Source of Woe

“Character encodings consume more than 80% of my work day. They are the source of more mis-information and confusion than any other single thing. And developers aren’t getting any better educated.”

~Glen PerkinsGlobalization Architect

Page 9: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

A lot of jargon

Real Jargon

Multibyte

Variable width

Wide character

Character encoding

Coded character set

Bidi or bidirectional

Glyph, character, code unit

Unicode

Potentially Bogus Jargon

kanji

double-byte language

extended ASCII

ANSI, OEM

encoding agnostic

Page 10: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

How the computer sees the world

“bits”: 010000010101101101101000

“byte” or “octet”: 01000001 (0x41)code unit: a unit of physical storage and information interchange

• represent numbers• come in various sizes (e.g. 7, 8, 16, 32, 64 bits)

how do we map text to the numbers used by computers?

Page 11: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

From text to bits

Glyphs– A “glyph” is screen unit of text: it’s a picture

of what users think of as a character.– A “grapheme” is a single visual unit of text.

Characters– A “character” is a single logical unit of text.– A “character set” is a set of characters.– A “code point” is a number assigned to a

character in a character set.– A “coded character set” is a character set

where each character has a code point.Bytes

– A “character encoding form” maps a sequence of code points (“characters”) to a sequence of code units (such as bytes).

– A “code unit” is a single logical unit of storage.

… 0xC3 0x80 …

U+00C0

À

Page 12: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Coded Character Set• Collection (repertoire) of characters, that is: a set.• Organized so that each character has a unique numeric

(typically integer) value (code point).

• Examples: – Unicode– ASCII (ANSI X3.4)– ISO 646– JIS X 208– Latin-1 (ISO 8859-1)

Character sets are often associated with a particular language or writing system.

Page 13: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Character Encoding Form

• Maps a sequence of code points (characters) to a sequence of code units (e.g. bytes).– Some encoding forms use another code unit

instead of the byte. For example, some encoding forms use a 16-bit, 32-bit, or 64-bit code unit.

U+00C0 0xC3 0x80

Often shortened as “character encoding”, “encoding form”, or, confusingly, “charset”

Page 14: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

*(the most important slide in this presentation)

All text has a character encoding

When things go wrong, start by asking what the encoding is, what encoding you expected it to be, and whether the bytes match the encoding.

In memory, on disk, on the network, etc.

Page 15: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Common Encoding Problems

Tofuhollow boxes

Mojibakegarbage characters

Question Marks (conversion not supported)

Page 16: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

It can happen to anyone…

Page 17: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Tofu

• Can appear as either hollow boxes (empty glyph) or as question marks (Firefox, for example)

• Not usually a bug: it’s a display problem

• Can mask or masquerade as character corruption.

Page 18: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Mojibake

When Good Characters Go Bad

Page 19: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Sources of Mojibake

• View text using the wrong encoding

• Apply a transfer encoding and forget to remove it

• Convert to an encoding twice

• Convert to or from the wrong encoding

• Overzealous escaping• Conversion to entities

(“entitization”)• Multiple conversions

Page 20: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Character Encoding Forms

Their theory, structure, and use

Page 21: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

EBCDIC

Page 22: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

ASCII• 7 bits = 27 = 128 characters• Enough for “U.S. English”

Page 23: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Latin-1 (ISO 8859-1)

ASCII for characters 0x00 through 0x7F

Accented letters and other symbols 0x80 through 0xFF

Page 24: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

One character—many character sets and many character encodings!

È 0xC8 0xD4

char cp1252 cp850

Page 25: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Windows Code Pages

Windows’s encodings (called “code pages”) are generally based on standard encodings—plus some additional characters.Example:

CP 1252 is based on ISO 8859-1, but includes 27 “extra” characters in the C1 control range (0x80-0x9F)

Page 26: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Code Page

Originally an IBM character encoding term.

IBM numbered their character sets with “CCSIDs” (coded character set ids) and numbered the corresponding character encoding forms as “code pages”.

Microsoft borrowed code pages to create PC-DOS.

Microsoft defines two kinds of code pages: “ANSI” code pages are the

ones used by Windows GUI programs.

“OEM” code pages are the ones used by command shell/command line programs.

Neither “ANSI” nor “OEM” refer to a particular encoding standard or standards body in this context.

Avoid the use of ANSI and OEM when referring to encodings.

Page 27: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Beyond Single Byte Encodings

• So far we’ve been looking at single-byte encodings: one byte per character 1 byte = 1 character

(= 1 glyph?) 256 character maximum Good enough for most

alphabetic languages

Some languages need more characters.

What about the “double-byte” languages?

Don’t those take two bytes per character?

丏丣並À

Page 28: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Methods of reaching beyond single-byte• Escape sequences to select

another character set– Example: ISO 2022 uses

escape sequences to select various encodings

• Use a larger code unit (“wide” character encoding)– Example: IBM DBCS code pages

or Unicode UTF-16– 216 = 64K characters– 232 = 4.2 billion characters

• Use a variable-width encoding

Variable width encodings use different numbers of code units to represent different types of characters within the same encoding form.

Page 29: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Multibyte EncodingsOne or more bytes per character

– 1 byte != 1 character – May use 1, 2, 3, or 4 bytes per

character-> maximum number of bytes per character varies by encoding form.

– May use shift or escape sequences

– May encode more than one character set

• Single-byte encodings are a special case of multibyte!

Multibyte Encoding: Any “variable-width” encoding that uses the byte as its code unit.

Page 30: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

JIS X 213: A Coded Character Set whose common encoding forms are multibyte

JIS X 213 11,233 characters (2) 94x94 character planes

Page 31: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

あ1-4-1(code point)

61-3-22(code point)

Page 32: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Simple Multibyte Encoding Forms• Specific byte ranges

encoding characters that take more than one byte.– A “lead byte”– One or more “trailing bytes”

• Code point != code unit

あ1-4-1(code point)

0x82 0xA0

lead 

byte

trail 

byte

A1-3-33(code point)

0x41

single 

byte

Page 33: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Shift_JIS: A Multibyte Encoding

• In order to reach more characters, Shift_JIS characters start with a limited range of “lead bytes”

• These can be followed by a larger range of byte values (“trail byte”)

Page 34: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Shift_JIS

Page 35: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Shift-JIS• Lead bytes can be

trail byte values• Trail bytes include

ASCII values• Trail bytes include

special values such as 0x5C (“\”) int pos = strchr(mybuf, ‘@’);

Page 36: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

More Complex Multibyte Systems

• Stateful Encodings– ex. IBM “MBCS” code pages [SI/SO shift

between 1-byte and 2-byte characters]– ISO 2022 [escape sequence changes

character set being encoded]

Page 37: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Ad hoc Encodings

Page 38: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Transfer Encodings• A transfer encoding syntax is a reversible transform of encoded data

which may (or may not) include textual data represented in one or more character encoding schemes.

• Email headers• URIs• IDN (domain names)

Abcソース=?UTF-8?B?QWJj44K944O844K5?=Abcソース

Page 39: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Encoding Conversion

• Document formats often require a single character encoding be used for all parts of the document.

Process Output(HTML, XML, etc.)

TemplatesISO 8859-1

ContentUTF-8

Data

Shift_JIS

When data is merged, the same encoding form must be used or some of the data will be “mojibake”.

Common Encoding

Conversion Tools and Libraries

• iconv (Unix)

• ICU (C, C++, Java)

• perl Encode

• Java (native2ascii, IO/NIO)

• (etc.)

Page 40: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Encoding Conversion as Filter

Encoding conversion acts as a “filter”– Replacement characters (“question marks”) replace

characters from the source character set that are not present in the target character set.

ISO 8859-1

ÀàС£ ??????

»èç?????????

UTF-8ÀàС£ ??????

»èç?????????

ISO 8859-1ÀàС£

UTF-8детски

»èçينس文字

Shift_JIS

文字化け ? (0x3F) is the replacement character for ISO 8859-1

Page 41: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Too Many Fish in the Sea

• Need for more converters and conversion maps

• Difficulty of passing, storing, and processing data in multiple encodings

• Too many character sets……leads to what we call “code page hell”

Page 42: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Unicode / ISO-10646

Page 43: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

The Idea Behind Unicode

• Fights mojibake because:– characters are from the

common repertoire; – characters are encoded

according to one of the encoding forms;

– characters are interpreted with Unicode semantics;

– unknown characters are not corrupted

• Basic Principles– Universal repertoire – Logical order – Efficiency – Unification – Characters, not glyphs – Dynamic composition – Semantics – Stability – Plain Text – Convertibility

Page 44: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Unicode (ISO 10646)Unicode is a character set that supports all of the world’s languages and writing systems.

Code space of up to 0x10FFFF characters (about 1.1 million)

Unicode and ISO 10646 are maintained in sync. Unicode is maintained by an industry consortium. ISO 10646 is maintained by the ISO.

Page 45: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

What are “planes”?

Divide Unicode in equal sized regions of code points.

17 planes (0 through 0x10), each with 65,535 characters.

Plane 0 is called the Basic Multilingual Plane (BMP). > 99% of text in the wild lives

in the BMP Planes 1 through 0x10 are

called supplementary planes.

Page 46: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Unicode as the Universal Character Set

• An organized collection of characters.

• Each character has a code point

aka Unicode Scalar Value (USV)

• U+0041 <= hex notation

Page 47: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Unicode Character Database

• code point• name• character class• combining level• bidi class• case mappings• canonical decomposition• mirroring• default grapheme clustering

ӑ (U+04D1)CYRILLIC SMALL LETTER A WITH BREVE

letter non-combining left-to-right decomposes to U+0430 U+0306 Ӑ U+04D0 is uppercase (and titlecase)

Page 48: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Compatibility Characters

Many characters were included in Unicode for round-trip conversion compatibility with legacy encodings:

①②③45Ⅵ¾Lj¼Nj½dž︴︷︻︽﹁﹄ヲィゥォェュ゙

هضممشملسو هيلع هللا ىلص سسfi fl ffi ffl st մե

Compatibility Characters

includes presentation forms

legacy encoding: a term for non-Unicode character encodings.

Page 49: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Byte Order Mark (BOM)

U+FEFF• Used to indicate the “byte-order” of UTF-16 code units

– 0xFE FF; 0xFF FE

• Also used as a Unicode signature by some software (Windows’s Notepad editor, for example) for UTF-8– 0xEF BB BF

Appears as a character or renders as junk in some formats or on some systems. For example, older browsers render it as three bytes of mojibake.

Page 50: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

The Replacement Character

U+FFFD Indicates a bad

byte sequence or a character that could not be converted.

Equivalent to “question marks” in legacy encoding conversions

�there was a character here,

but it is gone now

Page 51: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Combining Marks

Composition can create “new” charactersBase + non-spacing (“combining”) characters

A + ˚ = ÅU+0041 + U+030A = U+00C5

a + ˆ + . = ậU+0061 + U+0302 + U+0323 = U+1EAD

a + . + ˆ = ậU+0061 + U+0323 + U+0302 = U+1EAD

Page 52: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Complex Scripts

ญั�ตต�ที่��เสนอได้�ผ่�านที่��ประชุ�มด้�วยมต�เอกฉั�นที่

ญั� = ญั +  ั� glyph = consonant + vowel

ญั�ตต�ที่��เสนอได้�ผ่�านที่��ประชุ�มด้�วยมต�เอกฉั�นที่ (word boundaries)

Page 53: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Hindi

What is Unicode?यू� नि�को�ड क्यू है�?

यू� नि� को� ड

यू � � � को � ड� + � = नि�

Page 54: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Tamil Example

‘ko’

U+0B95 U+0BCA

Combining mark drawn to the “left” of the base character

கொ��

� ொகொ��

Page 55: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

UNICODE'S ENCODING FORMS

Page 56: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Unicode Encoding Forms• UTF-32

– Uses 32-bit code units. – All characters are the same width.

• UTF-16– Uses 16-bit code units.– BMP characters use one 16-bit code unit.– Supplementary characters use two special 16-bit code units: a “surrogate

pair”.• UTF-8

– Uses 8-bit code units (bytes!)– It’s a multi-byte encoding! – Characters use between 1 and 4 bytes.– ASCII is ASCII in UTF-8

Page 57: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Unicode Encodings Compared

ቐ (U+1251)UTF-32: 0x00001251 UTF-16: 0x1251UTF-8: 0xE1 0x89 0x91

𐌸(U+10338)0x000103380xD800 0xDF380xF0 0x90 0x8C 0xB8

A (U+0041)UTF-32: 0x0000041 UTF-16: 0x0041UTF-8: 0x41

À (U+00C0)UTF-32: 0x000000C0 UTF-16: 0x00C0UTF-8: 0xC2 0x80

Page 58: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

UTF-32

• Uses 32-bit code units (instead of the more-familiar 8-bit code unit, aka the “byte”)

• Each character takes exactly one code unit.

U+1251 ቑ 0x00001251

U+10338 𐌸 0x00010338

Page 59: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Advantages and Disadvantages of UTF-32

• Easy to process– each logical character

takes one code unit– can use pointer arithmetic

• Not commonly used– Not efficient for storage

• 11 bits are never used• BMP characters are the

most common—16 bits wasted for each of these

– Affected by processor architecture (Big-Endian vs. Little-Endian)

Page 60: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

UTF-16

• Uses 16-bit code units (instead of the more-familiar 8-bit code unit, aka the “byte”)– BMP characters use one unit– Supplementary characters use a “surrogate pair”, special code

points that don’t do anything else.

0x1251 U+1251 ቑ

0xD800 0xDF38 U+10338 𐌸

High Surrogate Low Surrogate

0xD800-DBFF 0xDC00-DFFF Unique Ranges!

Page 61: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Advantages and Disadvantages of UTF-16

• Most common languages and scripts are encoded in the BMP.– Less wasteful than UTF-32– Simpler to process

(excepting surrogates)– Commonly supported in

major operating environments, programming languages, and libraries

• May not be suitable for all applications– Affected by processor

architecture (Big-Endian vs. Little-Endian)

– Requires more storage, on average, for Western European scripts, ASCII, HTML/XML markup.

Page 62: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

UTF-8

• 7-bit ASCII is itself• All other characters take 2, 3, or 4 bytes each

– lead bytes have a special pattern– trailing bytes range from 0x80->0xBF

0xxxxxxx

110xxxxx 10xxxxxx

1110xxxx 10xxxxxx 10xxxxxx

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Lead Byte Trail Bytes

< 0x80

< 0x800

< 0x10000

Supplementary

Corresponding Code Point

Page 63: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Advantages and Disadvantages of UTF-8

• ASCII-compatible• Default or recommended

encoding for many Internet standards

• Bit pattern highly detectable (over longer runs)

• Non-endian• Streaming• C char* friendly• Easy to navigate

• Multibyte encoding requires additional processing awareness

• Non-shortest form checking needed

• Less efficient than UTF-16 for large runs of Asian text

Page 64: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

HTMLSet Web server to declare UTF-8 in HTTP Content-Type headerDeclare UTF-8 in META tag headerActually use UTF-8 as the encoding!!

<html lang="en" dir="ltr"><head><meta charset="utf-8"><title>Вибір і застосування кодування</title>

Page 65: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

WORKING WITH UNICODEIt’s more than just a character set and some encodings…

Page 66: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Unicode Properties, Annexes, and Standards

Unicode provides additional information: Character name Character class “ctype” information, such as if it’s a digit, number, alphabetic, etc. Directionality (LTR, RTL, etc.) and the Bidi Algorithm Case mappings (UPPER, lower, and Titlecase) Default Collation and the Unicode Collation Algorithm (UCA) Identifier names Regular Expression syntaxes Normalization Compatibility information

Many of these items are in the form of Unicode Technical Reportshttp://www.unicode.org/reports

Page 67: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Normalization

AbcABCabcabCaBc

abc

Unicode Normalization has to deal with more issues:

• single or multiple combining marks

• compatibility characters

• presentation forms

ǺU+01FA

U+00C5 U+0301

U+00C1 U+030A

U+212B U+0301

U+0041 U+0301 U+030A

U+0041 U+030A U+0301

Page 68: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Four Normalization Forms

Ǻ• Form D

canonical decomposition • Form C

canonical decomposition followed by composition

• Form KDkompatibility decomposition

• Form KCkompatibility decomposition followed by composition

ways to represent:

U+01FA

U+00C5 U+0301

U+00C1 U+030A

U+212B U+0301

U+0041 U+0301 U+030A

U+0041 U+030A U+0301

Page 69: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Normalization in Action

Original Form C Form D Form KC Form KD

U+01FA U+01FA U+0041 U+0301 U+030A

U+01FA U+0041 U+0301 U+030A

U+00C5 U+0301 U+01FA U+0041 U+0301 U+030A

U+01FA U+0041 U+0301 U+030A

U+00C1 U+030A U+01FA U+0041 U+0301 U+030A

U+01FA U+0041 U+0301 U+030A

U+212B U+0301 U+212B U+0301

U+212B U+0301

U+01FA U+0041 U+0301 U+030A

U+0041 U+0301 U+030A

U+01FA U+0041 U+0301 U+030A

U+01FA U+0041 U+0301 U+030A

U+0041 U+030A U+0301

U+01FA U+0041 U+0301 U+030A

U+01FA U+0041 U+0301 U+030A

Ǻ

Page 70: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Normalization: Not a Panacea

Not all compatibility characters have a compatibility decomposition.

Not all characters that look alike or have similar semantics have a compatibility decomposition.For example, there are many ‘dots’ used as a period.

Not all character variations are handled by normalization.For example, upper, title, and lowercase variations.

Normalization can remove meaning

Page 71: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

A Bit of Bidi

Page 72: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Bi-directional Scripts

• Some languages are written predominantly from left-to-right (LTR).

• Some languages are written predominantly from right-to-left (RTL).

• (A few can be written top-to-bottom or using other schemes)

Unicode defines character “directionality” and a “Bidi” algorithm for rendering text.

Uses logical, not visual, order.

Uses levels of “embedding”.

Requires markup changes (as in HTML) or special controls for certain cases.

Page 73: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Embedding and “Logical Order”

Characters are encoded in logical order.Visual order is determined by the layout.

– Override and bidi control characters– “Indeterminate” characters

Page 74: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Bidirectional Embedding

Paste in Arabic

Page 75: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Unicode Controls and Markup

Page 76: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Natural Language Processing

Page 77: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Unicode Collation Algorithm

• Defines default collation algorithm and sequences (UTS#10)– Must be tailored by language and “locale”

(culture) and other variations.

LanguageSwedish: z < öGerman: ö < z

UsageGerman Dictionary: öf < ofGerman Telephone: of < öf

CustomizationsUpper-first A < aLower-First a < A

Page 78: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Text Segmentation (UAX#29)Find grapheme, word, and line-break boundaries in text.

• Tailored by language• Provides good basic default handling

Page 79: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

CLDR and Language Specific Processing…

… is in the next section

Page 80: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

SUMMARY

Page 81: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

“That’s great: I’ll just use Unicode”

Remember “all text has an encoding”?user input via formsemaildata feedsexisting, legacy

datadatabase instancesuploads

Use UTF-8 for HTML and Web forms

Use UTF-8 in your APIsCheck that data really is UTF-8Control encoding via code;

avoid hard-coding the encodingWatch out for legacy encodings

Convert to Unicode as soon as practical.

Convert from Unicode as late as possible.

Wrap Unicode-unfriendly technologies

Page 82: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Your SystemMap Your SystemAPIs use Unicode encoding hide internal storage

encodingData Stores, Local I/O use Unicode encoding consider an encoding

conversion planFront Ends use Unicode encodingBack Ends, External Data Uses Unicode? If not, what encoding? Store the encoding!

API

UnicodeLegacy

Encoding

Detect / Convert

Capture Encoding

Detect / Convert

Unicode Cloud

Unicode Interface

Convert to Legacy

Input

Page 83: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Counting Things

Be aware of whether you need to count glyphs, characters, or bytes:

– Is the limit “screen positions”, “characters”, or “bytes of storage”?

– Should you be using a different limit? Which one are you actually counting?

यू�नि�को�ड (4 glyphs)

यू � � � को � ड (7 characters)

E0-A4-AF E0-A5-82 E0-A4-A8 E0-A4-BF E0-A4-95 E0-A5-8B E0-A4-A1 (21 bytes)

varchar(110)

Page 84: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Character Encodings

• Code unit• Code point• Character• Glyph

• Multibyte encoding– Tofu– Mojibake– Question Marks

• “All text has an encoding”

Page 85: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Unicode

• 17 planes of goodness– 1.1 million potential

code points– 150,000 assigned code

points

• 3 encodings– UTF-32– UTF-16– UTF-8

• Normalize• Bidi• Collation• Case folding• … and so much more

Page 86: Internationalization: An Introduction Tutorial from Character Encodings & Unicode.

Q&A

Would you write the code for I18N on the whiteboard before you go?

#define UNICODE#import I18N.h