Unicode and character sets
-
Upload
renchenyu -
Category
Technology
-
view
646 -
download
5
Transcript of Unicode and character sets
Unicode and Character Sets
The Absolute Minimum Every Software Developer
Absolutely, Positively Must Know About Unicode
and Character Sets (No Excuses!)
- Joel Spolsky
The founder of Stackoverflow
The author of 《More Joel on Software》
A
0100 0001
In person’s eye
In computer’s eye
ASCII 32~127 8bits
ISO-8859-1, ISO-8859-2, ISO-8859-3……….. 16
In ISO-8859-1, 0xC0 is À
In ISO-8859-7, 0xC0 is ΐ
The same octet has different meanings in different charsets!!
UnicodeNot a Charset
To assign a code point to every words in the world
A -> U+0041
http://www.unicode.org/charts/
How to use Unicode in computer?
UCS-2 (UTF-16)
PROS:
1. map code points (U+0000~U+FFFF) to octet directly
CONS:
1. Be incompatible with ASCII
2. Waste memory when code point <= U+007F
3. Cannot support code point > U+FFFF
A -> U+0041 -> 0x00 0x41
UCS-4 (UTF-32)
PROS:
1. map code points (U+00000000~U+FFFFFFFF) to octet directly
CONS:
1. Be incompatible with ASCII
2. Waste huge memory
A -> U+0041 -> 0x00 0x00 0x00 0x41
UTF-80000 ~ 007F 0xxxxxxx
0080 ~ 07FF 110xxxxx 10xxxxxx
0800 ~ FFFF 1110xxxx 10xxxxxx 10xxxxxx
A => U+0041 => 1000001 => 01000001 => 0x41
神 => U+795E => 1111001 01011110 =>
11100111 10100101 10011110 => 0xE7 0xA5 0x9E
UTF-8
PROS:
1. Be compatible with ASCII
2. Can map all the code points to octets
CONS:
1. Algorithm is a little complicate
It does not make sense to have a string without know what
encoding it uses.
- Joel Spolsky
Software communicate with each other by octet stream
A B
Sends E7 A5 9E E9 A9 AC 3F
A should tell B he sends the octets with charset UTF-8.
Then B can understand the received message is “神马?”
Charsets in Perl
Two ways to get a string in Perl
1. Literal string
2. From I/O
Literal string – depends on the encoding of your source code
# encoding UTF-8
my $a1 = “神马?”;my $a2 = “\xE7\xA5\x9E\xE9\xA9\xAC\x3F”;
my $a3 = <FH>;
Anyway, in the perl’s eye, it’s a string with 7 octets.
ISO-8859-1 or UTF-8?
Default, Perl treats it just as a sequence of octets
# encoding UTF-8
my $a1 = “神马?”;print length($a1) #output is 7
How to make perl treat it as a sequence of characters?
# encoding UTF-8
my $a1 = “神马?”;Encode::decode_utf8($a1);
Encode::decode(“utf8”, $a1);
Encode::_utf8_on($a1);
print length($a1) #output is 3
What has happened inside?
1. Decode the sequence of octets to Code points as UTF-8(or other charsets)
2. Encode the Code points to internal format (utf8)
3. Turn the string’s UTF8 flag ON
4. According to the UTF8 flag, Perl treats it as a sequence of chars
UTF-8 ? utf8? UTF8?
UTF-8
The standard charset made by Ken Thompson
utf8
Perl internal charset
Superset of UTF-8
UTF8
The name of flag that indicate whether
perl should treat it as a sequence of chars
More Examples
#encoding UTF-8
use Devel::Peek;
print Dump(“神”), Dump(“\xE7\xA5\x9E”);
print Dump(“\x{795E}”), Dump(Encode::decode_utf8(“\xE7\xA5\x9E”));
print Dump(“神”. “\x{795E}”);
FLAGS = <PADMY,POK,Ppok>
PV = 0x16189d8 “\347\245\236”\0
FLAGS = <PADMY,POK,Ppok,UTF8>
PV = 0x2e7478 “\347\245\236”\0 [UTF8 “\x{795e}”]
FLAGS = <PADMY,POK,Ppok,UTF8>
PV = 0x2e74d8 “\347\245\236\303\247\302\245\302\236”\0 \
[UTF8 “\x{795e}\x{e7}\x{a5}\x{9e}”]
\236\303 = 11000011 10100111
\x{e7} = 11100111
神E7A59E(UTF-8
encoded)
UTF8 flag = off
神U+795E(unicode)
神E7A59E(utf8 encoded)
UTF8 flag = on
decode
神C9F1(gbk encoded)
UTF8 flag = off
encode
Convert “神” from UTF-8 to GBK
Charsets in MySQL
Server -> database -> table
CREATE TABLE XXX
……
……
……
DEFAULT CHARSET = UTF-8
SET NAMES X
SET CHARACTER_SET_CLIENT = X
SET CHARACTER_SET_CONNECTION = X
SET CHARACTER_SET_RESULTS = X
Shell (UTF-8)
Perl (euc-jp)
MySQL(UTF-8)
Client_charset = UTF-8
Client_charset = euc-jp
Connection_charset = shiftJIS
UTF-8 -> shiftJIS
euc-jp -> shiftJIS
shiftJIS -> UTF-8
shiftJIS -> UTF-8
Results_charset = euc-jp
Results_charset = UTF-8UTF-8 <- UTF-8
euc-jp <- UTF-8
Q & A
Thank U!