Unicode Fundamentals

Unicode Fundamentals

1

Presenters

Md. Sami Hassan (Ek-29)Md. Nazmul Islam (SH-64)Nafi Md. Kamrul Haque Chowdhury

Sadique (SH-04)

2

Character Encoding

What is a Character?

A Character is the smallest unit of writing which is capable of conveying information.

4

Examples of character

Letters Digits Hyphen Mathematical symbols Punctuation Control Characters

- typically not visible

5

What is Character Encoding?

The set of available characters is called a character repertoire.

The location of a given character within a repertoire is known as its code position, or code point.

The method of representing a code point to a digital form within a given repertoire is called the character encoding.

6


Example :

In the repertoire ISO 8859-1,

The “code point” of "Latin capital letter A" is 0x41 (“0x” indicating 41 is hexadecimal).

The “encoding” of “A” is 10000001.0001. 7


So, in general, we can say,

Character encoding is the way that letters, digits and other symbols are expressed as binary values that a computer can understand.

8

Why is Character Encoding required?

Computers at their most basic level just deal with binary numbers, 0 and 1. They store letters, numerals and other characters by assigning a binary number for each one.

9

Pre-Unicode Character RepertoiresEBCDICASCII (American Standard Code for Information

Interchange)US-ASCII (U.S. version, standardised as ISO 646) ISO 8859 ISO 8859-1 (most common version for Western

languages) ISO 8859-15 (replacement of ISO 8859-1).

10

A Brief History

In the pre-Unicode environment, we had single 8-bit characters repertoires, which limited us to a maximum limit of 256 characters. No single encoding could contain enough characters to cover all the languages.

So, hundreds of different encoding systems were developed for assigning numbers to characters.

11

A Brief History

As a result, these coding systems conflicted with each other. That is, two encodings could use the same number for two different characters or

different numbers for the same character. Any given computer needed to support many

different encodings. Yet whenever data was passed between different

encodings or platforms, that data always runs the risk of corruption.

So, a solution of this problem was necessary.12

A Brief History

And so, the solution was Unicode — a character repertoire that contains most of the characters used in the languages of the world.

13

Unicode

What is Unicode?Unicode is a computing industry standard for the

consistent encoding, representation and handling of text expressed in most of the world's writing systems.

Unicode provides a unique number for every character,no matter what the platform,no matter what the program,no matter what the language.

Where is Unicode Used?The Unicode standards has been adopted

by many software and hardware vendors.Most OSs support Unicode.Unicode is required for international

document and data interchange.Several modern standards use Unicode,

such as, Programming languages, such as , #, , .Java C Perl Python , , , Markup Languages such as XML HTML XHTML

, , .JavaScript LDAP Cobra etc

How many languages are covered by Unicode?

The Unicode Standard encodes characters on a per script basis.

That is, Unicode encodes scripts for languages, rather than languages themselves.

The reason is, many scripts (especially the Latin script) are used to write a large number of languages.

Therefore, the easiest answer is that Unicode covers all of the languages that can be written in the Unicode supported scripts.

Unicode Supported Scripts

How can Unicode cover this much characters?

As we have already seen, in the pre-Unicode environment, we had single 8-bit characters repertoires, which limited us to a maximum limit of 256 characters, i.e., code points.

Where, Unicode code points are logically divided into 17 planes, each with 65,536 (= 216) code points.

In the Unicode standard, planes are groups of code points that point to specific characters.

Planes are identified by the numbers 0 to 16, which corresponds with the possible values 0x00–0x10 of the first two positions in six position format (hhhhhh).

The only one used in most circumstances is the first plane, known as the basic multilingual plane, or BMP.

Unicode Planes

How to name a Code Point?

Normally, a Unicode code point is referred to by writing "U+" followed by its hexadecimal number.

For code points in the Basic Multilingual Plane (BMP), four digits are used

- e.g. U+0058 for the character Latin Capital Letter “X”. For code points outside the BMP, five or six digits are

used, as required

- e.g. U+E0001 for the character 'LANGUAGE TAG'

- e.g. U+10FFFD for the character ‘<Plane 16 Private Use, Last>’

Unicode and ISO

A version of Unicode that has been standardised by ISO (International Organization for Standardization) is called ISO 10646.

There are minor differences between Unicode and ISO 10646,but not that much to ponder about.

Mapping of Unicode characters

Unicode/ISO 10646 is only a repertoire.

Therefore, we need an encoding to go with it.

And that encodings are UTF and UCS.

UTF and UCS

What are UTF and UCS? Unicode defines two mapping methods:

- Unicode Transformation Format (UTF) encodings

- Universal Character Set (UCS) encodings Unicode transformation format (UTF) is an algorithmic

mapping for every Unicode code point to a unique byte.

The ISO 10646 standard uses the term “UCS transformation format” for UTF; the two terms are merely synonyms for the same concept.

UTF vs Unicode

Say an application reads the following from the disk:

1101000 1100101 1101100 1101100 1101111

The application knows this data represents a Unicode string encoded with UTF-8 and must show this as text to the user.

UTF vs Unicode

First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this:

0x68 65 6C 6C 6F Since the app knows this is a Unicode string, it can assume

each number represents a character’s code point. So, it uses Unicode character repertoire to translate each

number to a corresponding character. The resulting string is:

hello

UTF vs Unicode

Therefore,

UTF is an encoding used to translate binary data into numbers and vice-versa.

Unicode is a character repertoire used to translate numbers into characters and vice-versa.

UTF FormatsUTF encodings include:

UTF-1 – a retired predecessor of UTF-8, maximizes compatibility with ISO 2022, no longer part of The Unicode Standard.

UTF-7 – a 7-bit encoding sometimes used in e-mail, often considered obsolete, no longer part of The Unicode Standard.

UTF-EBCDIC – an 8-bit variable-width encoding similar to UTF-8, but designed for compatibility with EBCDIC, not part of The Unicode Standard.

UTF Formats

UTF-8 – an 8-bit variable-width encoding which maximizes compatibility with ASCII, used by 77.3% of all the websites.

UTF-16 – a 16-bit, variable-width encoding, used by less than 0.1% of all the websites.

UTF-32 – a 32-bit, fixed-width encoding, very rarely used.

UTF Formats

So, at present, Unicode standard defines three encoding forms that allow the same character data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16, or 32-bits per code unit).

These three encoding forms are called UTF-8, UTF-16 and UTF-32 respectively.

UTF-8 UTF-8 is a variable-width encoding that can represent every character in the Unicode

character set. It encodes each of the 1,112,064 code points in the Unicode character set using one to four 8-bit bytes (termed "octets" in the Unicode Standard).

It has the following properties:

1. Characters U+0000 to U+007F (ASCII) are encoded as a single byte 0x00 to 0x7F, this means UTF-8 is fully compatible with ASCII.

2. All characters greater than U+007F are encoded as a sequence of several bytes, all of which are above 0x7F (namely no ASCII byte), this makes it unambiguous to determine whether a byte belows to a multi-byte character or an ASCII character.

3. The first byte of a multi-byte sequence (that represent a non-ASCII) is always in the range of 0xC0 to 0xFD. All further bytes in the sequence are in the range 0x80 to 0xBF. This makes it unambiguous to determine the boundary of the multi-byte characters (in fact, the first byte also contains a redundant information about how many bytes follow for the character).

4. The bytes FEh and FFh are never used in the UTF-8 encoding.

UTF-8 As mentioned before,UTF-8 is the popular encoding form

for Unicode. The reason lies in the fact that all ASCII characters are

encoded as a single byte in UTF-8 which is not only fully backward compatible, but also space efficient for US and many European users.

In general, UTF-8 costs no extra space for US ASCII, only a few percent more for ISO 8859-1 (AKA Latin-1, covers most West European languages), 50% more for Chinese/Japanese/Korean, 100% more for Greek and Cyrillic.

UTF-16 Like UTF-8, UTF-16 is also a variable-width encoding that can

represent every character in the Unicode character set. The difference is, it encodes 1,112,064 code points in the Unicode

character set using one to two 16-bit bytes (two or four octets). Properties : The Unicode code space is divided into seventeen planes of 216

code points each. The first plane (code points U+0000 to U+FFFF) contains the most

frequently used characters and is called the Basic Multilingual Plane or BMP. UTF-16 encode code points in this range as single 16-bit code units that are numerically equal to the corresponding code points.

UTF-16

Code points from the other planes (called Supplementary Planes) are encoded in UTF-16 by pairs of 16-bit code units called a surrogate pair, by the following scheme:

o 0x010000 is subtracted from the code point, leaving a 20 bit number in the range 0.. 0x0FFFFF.

o The top ten bits (a number in the range 0.. 0x03FF) are added to 0xD800 to give the first code unit or lead surrogate, which will be in the range 0xD800.. 0xDBFF.

o The low ten bits (also in the range 0.. 0x03FF) are added to 0xDC00 to give the second code unit or trail surrogate, which will be in the range 0xDC00.. 0xDFFF.

UTF-16

The Unicode standard permanently reserves the code point values U+D800 to U+DFFF for UTF-16 encoding of the lead and trail surrogates, and they will never be assigned a character, so there should be no reason to encode them.

UTF-16

Use in major operating systems and environments : UTF-16 is used for text in the OS API in Microsoft

Windows 2000/XP/2003/Vista/CE. UTF-16 is used by the Qualcomm BREW operating

systems; the .NET environments; Mac OS X's Cocoa and Core Foundation frameworks; and the Qt cross-platform graphical widget toolkit.

Symbian OS used in Nokia S60 handsets and Sony Ericsson UIQ handsets uses UTF-16.

J2SE 5.0 version of Java uses UTF-16.

UTF-32

UTF-32 is a character encoding to encode Unicode characters that uses exactly 32 bits (four octets) per Unicode code point.

It uses fixed-length encodings, where all other Unicode transformation formats use variable-length encodings.

Advantage :

The main advantage of UTF-32, versus variable length encodings, is that the Unicode code points are directly indexable.

UTF-32• Disadvantage :

The main disadvantage of UTF-32 is that it is space inefficient, using four bytes per character. Non-BMP characters are so rare in most texts, they may as well be considered non-existent for sizing issues, making UTF-32 twice the size of UTF-16 and up to four times the size of UTF-8.

Though a fixed number of bytes per code point appear convenient, it is not as useful as it appears.

Because, it does not make it faster to find a particular offset in the string, as an "offset" can be measured in the fixed-size code units of any encoding. It does not make calculating the displayed width of a string easier except in limited cases, since even with a “fixed width” font there may be more than one code point per character position or more than one character position per code point.

For this,UTF-32 is very rarely used.

Now, let’s see a problem

Computers speak different languages, like people!

Some write data "left-to-right" and others "right-to-left".

A machine can read its own data just fine - problems happen when one computer stores data and a different type tries to read it.

Then, what may be the solution?

Agree to a common format (i.e., all network traffic follows a single format), or

Always include a header that describes the format of the data. If the header appears backwards, it means data was stored in the other format and needs to be converted.

Which one is taken?

Practically, there's no rule that all computers must use the same language, just like there's no rule all humans need to. Each type of computer is internally consistent (it can read back its own data), but there are no guarantees about how another type of computer will interpret the data it created.

So, the solution is to use a header.

In Unicode, this header is known as Byte Order Mark, or BOM.

Byte Order Mark (BOM)

What is a BOM?

The byte order mark (BOM) is a Unicode character used to signal the “endianness” of a text file or stream.

It consists of the character code U+FEFF at the beginning of a data stream.

What does ‘endian’ mean?

Data types longer than a byte can be stored in computer memory with the most significant byte (MSB) first or last.

The former is called big-endian, the latter little-endian.

Therefore, it can be said that the terms endian and endianness, refers to how bytes of a data word are ordered within memory.

Explanation

Each byte of memory is associated with an index, called its address, which indicates its position. Bytes of a single data word (such as a 32 bit integer datatype) are generally stored in consecutive memory addresses (a 32 bit integer needs 4 such locations). Big-endian systems are systems in which the most significant byte of the word is stored in the smallest address given and the least significant byte is stored in the largest. In contrast, little endian systems are those in which the least significant byte is stored in the smallest address.

Example

Say the data word was "0A 0B 0C 0D" (a set of 4 bytes) and memory addresses starting at a with offsets 0, 1, 2 and 3 are given. Then, in big endian systems, byte 0A is placed in offset 0, 0B in 1, 0C in 2 and 0D in 3. In little-endian systems, the order is the reverse of it (see the diagram).

Where is a BOM useful?

A BOM is useful at the beginning of files that are typed as text, but for which it is not known whether they are in big or little endian format.

The BOM character may also indicate which of the several Unicode representations the text is encoded in.

When a BOM is used? BOM use is optional.

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use.

In UTF-16, a BOM may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

Although a BOM could be used with UTF-32, this encoding is rarely used for transmission. Otherwise the same rules as for UTF-16 are applicable.

Representations of byte order marks by encoding

THAT’S IT FOR TODAY!!!

THANK YOU ALL

Unicode Fundamentals

Education

Transcript of Unicode Fundamentals