Unicode and character sets

Unicode and Character Sets

The Absolute Minimum Every Software Developer

Absolutely, Positively Must Know About Unicode

and Character Sets (No Excuses!)

- Joel Spolsky

The founder of Stackoverflow

The author of 《More Joel on Software》

A

0100 0001

In person’s eye

In computer’s eye

ASCII 32~127 8bits

ISO-8859-1, ISO-8859-2, ISO-8859-3……….. 16

In ISO-8859-1, 0xC0 is À

In ISO-8859-7, 0xC0 is ΐ

The same octet has different meanings in different charsets!!

UnicodeNot a Charset

To assign a code point to every words in the world

A -> U+0041

http://www.unicode.org/charts/

How to use Unicode in computer?

UCS-2 (UTF-16)

PROS:

1. map code points (U+0000~U+FFFF) to octet directly

CONS:

1. Be incompatible with ASCII

2. Waste memory when code point <= U+007F

3. Cannot support code point > U+FFFF

A -> U+0041 -> 0x00 0x41

UCS-4 (UTF-32)

PROS:

1. map code points (U+00000000~U+FFFFFFFF) to octet directly

CONS:

1. Be incompatible with ASCII

2. Waste huge memory

A -> U+0041 -> 0x00 0x00 0x00 0x41

UTF-80000 ~ 007F 0xxxxxxx

0080 ~ 07FF 110xxxxx 10xxxxxx

0800 ~ FFFF 1110xxxx 10xxxxxx 10xxxxxx

A => U+0041 => 1000001 => 01000001 => 0x41

神 => U+795E => 1111001 01011110 =>

11100111 10100101 10011110 => 0xE7 0xA5 0x9E

UTF-8

PROS:

1. Be compatible with ASCII

2. Can map all the code points to octets

CONS:

1. Algorithm is a little complicate

It does not make sense to have a string without know what

encoding it uses.

- Joel Spolsky

Software communicate with each other by octet stream

A B

Sends E7 A5 9E E9 A9 AC 3F

A should tell B he sends the octets with charset UTF-8.

Then B can understand the received message is “神马?”

Charsets in Perl

Two ways to get a string in Perl

1. Literal string

2. From I/O

Literal string – depends on the encoding of your source code

# encoding UTF-8

my $a1 = “神马?”;my $a2 = “\xE7\xA5\x9E\xE9\xA9\xAC\x3F”;

my $a3 = <FH>;

Anyway, in the perl’s eye, it’s a string with 7 octets.

ISO-8859-1 or UTF-8?

Default, Perl treats it just as a sequence of octets

# encoding UTF-8

my $a1 = “神马?”;print length($a1) #output is 7

How to make perl treat it as a sequence of characters?

# encoding UTF-8

my $a1 = “神马?”;Encode::decode_utf8($a1);

Encode::decode(“utf8”, $a1);

Encode::_utf8_on($a1);

print length($a1) #output is 3

What has happened inside?

1. Decode the sequence of octets to Code points as UTF-8(or other charsets)

2. Encode the Code points to internal format (utf8)

3. Turn the string’s UTF8 flag ON

4. According to the UTF8 flag, Perl treats it as a sequence of chars

UTF-8 ? utf8? UTF8?

UTF-8

The standard charset made by Ken Thompson

utf8

Perl internal charset

Superset of UTF-8

UTF8

The name of flag that indicate whether

perl should treat it as a sequence of chars

More Examples

#encoding UTF-8

use Devel::Peek;

print Dump(“神”), Dump(“\xE7\xA5\x9E”);

print Dump(“\x{795E}”), Dump(Encode::decode_utf8(“\xE7\xA5\x9E”));

print Dump(“神”. “\x{795E}”);

FLAGS = <PADMY,POK,Ppok>

PV = 0x16189d8 “\347\245\236”\0

FLAGS = <PADMY,POK,Ppok,UTF8>

PV = 0x2e7478 “\347\245\236”\0 [UTF8 “\x{795e}”]

FLAGS = <PADMY,POK,Ppok,UTF8>

PV = 0x2e74d8 “\347\245\236\303\247\302\245\302\236”\0 \

[UTF8 “\x{795e}\x{e7}\x{a5}\x{9e}”]

\236\303 = 11000011 10100111

\x{e7} = 11100111

神E7A59E(UTF-8

encoded)

UTF8 flag = off

神U+795E(unicode)

神E7A59E(utf8 encoded)

UTF8 flag = on

decode

神C9F1(gbk encoded)

UTF8 flag = off

encode

Convert “神” from UTF-8 to GBK

Charsets in MySQL

Server -> database -> table

CREATE TABLE XXX

……

……

……

DEFAULT CHARSET = UTF-8

SET NAMES X

SET CHARACTER_SET_CLIENT = X

SET CHARACTER_SET_CONNECTION = X

SET CHARACTER_SET_RESULTS = X

Shell (UTF-8)

Perl (euc-jp)

MySQL(UTF-8)

Client_charset = UTF-8

Client_charset = euc-jp

Connection_charset = shiftJIS

UTF-8 -> shiftJIS

euc-jp -> shiftJIS

shiftJIS -> UTF-8

shiftJIS -> UTF-8

Results_charset = euc-jp

Results_charset = UTF-8UTF-8 <- UTF-8

euc-jp <- UTF-8

Thank U!

Unicode and character sets

Technology

Transcript of Unicode and character sets