Unicode (and Python)

Post on 11-May-2015

2.067 views 5 download

Tags:

description

An introduction to Unicode and its processing in Python.

Transcript of Unicode (and Python)

Unicode (and Python)

Juan Manuel Gimeno Illa

jmgimeno@diei.udl.cat

November 2008

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 1 / 21

Outline

1 Before Unicode

2 UnicodeUnicode ConceptsEncodings

3 Python’s Unicode SupportUnicode String TypeSource Code Encoding

4 Bibliography

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 2 / 21

Before Unicode

Before Unicode

In the beginning, computing was mainly centered in North Americaand done in English. Characters were stored one-per-byte by usingeither

I ASCII (7 bits)I EBCDIC (8 bits)

In other parts of the world, different ways of storing their characterswere invented

I Japan: various flavours of JIS encodingsI Russian: KOI8I India: ISCI standard

Also, there were some proprietary encodings defined by operatingsystem vendors

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21

Before Unicode

Before Unicode

In the beginning, computing was mainly centered in North Americaand done in English. Characters were stored one-per-byte by usingeither

I ASCII (7 bits)I EBCDIC (8 bits)

In other parts of the world, different ways of storing their characterswere invented

I Japan: various flavours of JIS encodingsI Russian: KOI8I India: ISCI standard

Also, there were some proprietary encodings defined by operatingsystem vendors

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21

Before Unicode

Before Unicode

In the beginning, computing was mainly centered in North Americaand done in English. Characters were stored one-per-byte by usingeither

I ASCII (7 bits)I EBCDIC (8 bits)

In other parts of the world, different ways of storing their characterswere invented

I Japan: various flavours of JIS encodingsI Russian: KOI8I India: ISCI standard

Also, there were some proprietary encodings defined by operatingsystem vendors

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21

Before Unicode

Before Unicode

In the beginning, computing was mainly centered in North Americaand done in English. Characters were stored one-per-byte by usingeither

I ASCII (7 bits)I EBCDIC (8 bits)

In other parts of the world, different ways of storing their characterswere invented

I Japan: various flavours of JIS encodingsI Russian: KOI8I India: ISCI standard

Also, there were some proprietary encodings defined by operatingsystem vendors

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21

Before Unicode

Before Unicode

In the beginning, computing was mainly centered in North Americaand done in English. Characters were stored one-per-byte by usingeither

I ASCII (7 bits)I EBCDIC (8 bits)

In other parts of the world, different ways of storing their characterswere invented

I Japan: various flavours of JIS encodingsI Russian: KOI8I India: ISCI standard

Also, there were some proprietary encodings defined by operatingsystem vendors

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21

Before Unicode

Before Unicode

In the beginning, computing was mainly centered in North Americaand done in English. Characters were stored one-per-byte by usingeither

I ASCII (7 bits)I EBCDIC (8 bits)

In other parts of the world, different ways of storing their characterswere invented

I Japan: various flavours of JIS encodingsI Russian: KOI8I India: ISCI standard

Also, there were some proprietary encodings defined by operatingsystem vendors

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21

Before Unicode

Before Unicode

In the beginning, computing was mainly centered in North Americaand done in English. Characters were stored one-per-byte by usingeither

I ASCII (7 bits)I EBCDIC (8 bits)

In other parts of the world, different ways of storing their characterswere invented

I Japan: various flavours of JIS encodingsI Russian: KOI8I India: ISCI standard

Also, there were some proprietary encodings defined by operatingsystem vendors

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21

Before Unicode

Before Unicode

In the beginning, computing was mainly centered in North Americaand done in English. Characters were stored one-per-byte by usingeither

I ASCII (7 bits)I EBCDIC (8 bits)

In other parts of the world, different ways of storing their characterswere invented

I Japan: various flavours of JIS encodingsI Russian: KOI8I India: ISCI standard

Also, there were some proprietary encodings defined by operatingsystem vendors

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 3 / 21

Before Unicode

ISO-8859-*

For the huge number of people in America, Europe, and the MiddleEast who use relatively small alphabets, there was ISO-8859

I left ASCII as ASCII (range 0 to 127)I used the range 128 through 255 for different purposes

1-4 Different accented characters (e.g. latin-1)

5 Cyrillic

6 Arabic

7 Greek

8 Hebrew

9 Turkish

10 Nordic languages

But you could only be using one at a time, so one couldn’t easily mixGreek and Cyrillic in the same file.

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21

Before Unicode

ISO-8859-*

For the huge number of people in America, Europe, and the MiddleEast who use relatively small alphabets, there was ISO-8859

I left ASCII as ASCII (range 0 to 127)I used the range 128 through 255 for different purposes

1-4 Different accented characters (e.g. latin-1)

5 Cyrillic

6 Arabic

7 Greek

8 Hebrew

9 Turkish

10 Nordic languages

But you could only be using one at a time, so one couldn’t easily mixGreek and Cyrillic in the same file.

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21

Before Unicode

ISO-8859-*

For the huge number of people in America, Europe, and the MiddleEast who use relatively small alphabets, there was ISO-8859

I left ASCII as ASCII (range 0 to 127)I used the range 128 through 255 for different purposes

1-4 Different accented characters (e.g. latin-1)

5 Cyrillic

6 Arabic

7 Greek

8 Hebrew

9 Turkish

10 Nordic languages

But you could only be using one at a time, so one couldn’t easily mixGreek and Cyrillic in the same file.

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21

Before Unicode

ISO-8859-*

For the huge number of people in America, Europe, and the MiddleEast who use relatively small alphabets, there was ISO-8859

I left ASCII as ASCII (range 0 to 127)I used the range 128 through 255 for different purposes

1-4 Different accented characters (e.g. latin-1)

5 Cyrillic

6 Arabic

7 Greek

8 Hebrew

9 Turkish

10 Nordic languages

But you could only be using one at a time, so one couldn’t easily mixGreek and Cyrillic in the same file.

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21

Before Unicode

ISO-8859-*

For the huge number of people in America, Europe, and the MiddleEast who use relatively small alphabets, there was ISO-8859

I left ASCII as ASCII (range 0 to 127)I used the range 128 through 255 for different purposes

1-4 Different accented characters (e.g. latin-1)

5 Cyrillic

6 Arabic

7 Greek

8 Hebrew

9 Turkish

10 Nordic languages

But you could only be using one at a time, so one couldn’t easily mixGreek and Cyrillic in the same file.

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21

Before Unicode

ISO-8859-*

For the huge number of people in America, Europe, and the MiddleEast who use relatively small alphabets, there was ISO-8859

I left ASCII as ASCII (range 0 to 127)I used the range 128 through 255 for different purposes

1-4 Different accented characters (e.g. latin-1)

5 Cyrillic

6 Arabic

7 Greek

8 Hebrew

9 Turkish

10 Nordic languages

But you could only be using one at a time, so one couldn’t easily mixGreek and Cyrillic in the same file.

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 4 / 21

Before Unicode

Huston, Huston, . . .

Clearly this was an very unsatisfactory situation

ISO-2022 provided a partial solution allowing to shift encodings inthe middle of a string

I it was difficult to useI so it wasn’t widespread

What was needed was an universal way to refer to all the differentcharacters in all the alphabets

I ISO/IEC 10646I Unicode

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21

Before Unicode

Huston, Huston, . . .

Clearly this was an very unsatisfactory situation

ISO-2022 provided a partial solution allowing to shift encodings inthe middle of a string

I it was difficult to useI so it wasn’t widespread

What was needed was an universal way to refer to all the differentcharacters in all the alphabets

I ISO/IEC 10646I Unicode

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21

Before Unicode

Huston, Huston, . . .

Clearly this was an very unsatisfactory situation

ISO-2022 provided a partial solution allowing to shift encodings inthe middle of a string

I it was difficult to useI so it wasn’t widespread

What was needed was an universal way to refer to all the differentcharacters in all the alphabets

I ISO/IEC 10646I Unicode

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21

Before Unicode

Huston, Huston, . . .

Clearly this was an very unsatisfactory situation

ISO-2022 provided a partial solution allowing to shift encodings inthe middle of a string

I it was difficult to useI so it wasn’t widespread

What was needed was an universal way to refer to all the differentcharacters in all the alphabets

I ISO/IEC 10646I Unicode

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21

Before Unicode

Huston, Huston, . . .

Clearly this was an very unsatisfactory situation

ISO-2022 provided a partial solution allowing to shift encodings inthe middle of a string

I it was difficult to useI so it wasn’t widespread

What was needed was an universal way to refer to all the differentcharacters in all the alphabets

I ISO/IEC 10646I Unicode

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21

Before Unicode

Huston, Huston, . . .

Clearly this was an very unsatisfactory situation

ISO-2022 provided a partial solution allowing to shift encodings inthe middle of a string

I it was difficult to useI so it wasn’t widespread

What was needed was an universal way to refer to all the differentcharacters in all the alphabets

I ISO/IEC 10646I Unicode

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21

Before Unicode

Huston, Huston, . . .

Clearly this was an very unsatisfactory situation

ISO-2022 provided a partial solution allowing to shift encodings inthe middle of a string

I it was difficult to useI so it wasn’t widespread

What was needed was an universal way to refer to all the differentcharacters in all the alphabets

I ISO/IEC 10646I Unicode

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 5 / 21

Unicode Unicode Concepts

Unicode’s Solution

One encoding for all scripts of the world

ASCII compatibility (even Latin-1)

Includes character meta dataI Case mapping informationI Character category information

Accounts for scripts using different orientations

Enables sorting and normalization support

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21

Unicode Unicode Concepts

Unicode’s Solution

One encoding for all scripts of the world

ASCII compatibility (even Latin-1)

Includes character meta dataI Case mapping informationI Character category information

Accounts for scripts using different orientations

Enables sorting and normalization support

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21

Unicode Unicode Concepts

Unicode’s Solution

One encoding for all scripts of the world

ASCII compatibility (even Latin-1)

Includes character meta dataI Case mapping informationI Character category information

Accounts for scripts using different orientations

Enables sorting and normalization support

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21

Unicode Unicode Concepts

Unicode’s Solution

One encoding for all scripts of the world

ASCII compatibility (even Latin-1)

Includes character meta dataI Case mapping informationI Character category information

Accounts for scripts using different orientations

Enables sorting and normalization support

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21

Unicode Unicode Concepts

Unicode’s Solution

One encoding for all scripts of the world

ASCII compatibility (even Latin-1)

Includes character meta dataI Case mapping informationI Character category information

Accounts for scripts using different orientations

Enables sorting and normalization support

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21

Unicode Unicode Concepts

Unicode’s Solution

One encoding for all scripts of the world

ASCII compatibility (even Latin-1)

Includes character meta dataI Case mapping informationI Character category information

Accounts for scripts using different orientations

Enables sorting and normalization support

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21

Unicode Unicode Concepts

Unicode’s Solution

One encoding for all scripts of the world

ASCII compatibility (even Latin-1)

Includes character meta dataI Case mapping informationI Character category information

Accounts for scripts using different orientations

Enables sorting and normalization support

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 6 / 21

Unicode Unicode Concepts

Unicode’s Terminology

Grapheme This is what users regard as a character- Andre

Code points This is an Unicode encoding of the string- AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)- Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)

Code Units This is what the implementation stores (e.g. UTF-8- Andre0xCC 0x81

This can be explored in Linux using the program gucharmap

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21

Unicode Unicode Concepts

Unicode’s Terminology

Grapheme This is what users regard as a character- Andre

Code points This is an Unicode encoding of the string- AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)- Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)

Code Units This is what the implementation stores (e.g. UTF-8- Andre0xCC 0x81

This can be explored in Linux using the program gucharmap

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21

Unicode Unicode Concepts

Unicode’s Terminology

Grapheme This is what users regard as a character- Andre

Code points This is an Unicode encoding of the string- AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)- Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)

Code Units This is what the implementation stores (e.g. UTF-8- Andre0xCC 0x81

This can be explored in Linux using the program gucharmap

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21

Unicode Unicode Concepts

Unicode’s Terminology

Grapheme This is what users regard as a character- Andre

Code points This is an Unicode encoding of the string- AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)- Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)

Code Units This is what the implementation stores (e.g. UTF-8- Andre0xCC 0x81

This can be explored in Linux using the program gucharmap

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21

Unicode Unicode Concepts

Unicode’s Terminology

Grapheme This is what users regard as a character- Andre

Code points This is an Unicode encoding of the string- AndrU+00E9 (LATIN SMALL LETTER E WITH ACUTE)- Andre’=AndreU+0301 (COMBINING ACUTE ACCENT)

Code Units This is what the implementation stores (e.g. UTF-8- Andre0xCC 0x81

This can be explored in Linux using the program gucharmap

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 7 / 21

Unicode Unicode Concepts

Unicode Organization

Unicode currently defines just under 100000 code points but it hasspace for upto 1114112

They are organized into 17 planes of 216 = 65536 characters,numbered 0 to 16

Plane 0 is called Basic Multilingual Plane (BMP) and contains prettywell everything usefulThe characters in BMP are laid out more or less West to East

I ASCII characters from 0 to 127I Latin-1 characters from 128 to 255I Then moving East in Europe (Greek, Cyrillic)I Next Middle East (Arabic, Hebrew)I Then the Indus (scripts of India)I Next Southeast Asia (Thai, Laotian and so on)I and ending with China, Japan and Korea

Planes 1 to 16 are sometimes called astral planes that include exotic,rare and historically important characters (old italic, byzantinemusical symbols, etc.)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21

Unicode Unicode Concepts

Unicode Organization

Unicode currently defines just under 100000 code points but it hasspace for upto 1114112

They are organized into 17 planes of 216 = 65536 characters,numbered 0 to 16

Plane 0 is called Basic Multilingual Plane (BMP) and contains prettywell everything usefulThe characters in BMP are laid out more or less West to East

I ASCII characters from 0 to 127I Latin-1 characters from 128 to 255I Then moving East in Europe (Greek, Cyrillic)I Next Middle East (Arabic, Hebrew)I Then the Indus (scripts of India)I Next Southeast Asia (Thai, Laotian and so on)I and ending with China, Japan and Korea

Planes 1 to 16 are sometimes called astral planes that include exotic,rare and historically important characters (old italic, byzantinemusical symbols, etc.)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21

Unicode Unicode Concepts

Unicode Organization

Unicode currently defines just under 100000 code points but it hasspace for upto 1114112

They are organized into 17 planes of 216 = 65536 characters,numbered 0 to 16

Plane 0 is called Basic Multilingual Plane (BMP) and contains prettywell everything usefulThe characters in BMP are laid out more or less West to East

I ASCII characters from 0 to 127I Latin-1 characters from 128 to 255I Then moving East in Europe (Greek, Cyrillic)I Next Middle East (Arabic, Hebrew)I Then the Indus (scripts of India)I Next Southeast Asia (Thai, Laotian and so on)I and ending with China, Japan and Korea

Planes 1 to 16 are sometimes called astral planes that include exotic,rare and historically important characters (old italic, byzantinemusical symbols, etc.)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21

Unicode Unicode Concepts

Unicode Organization

Unicode currently defines just under 100000 code points but it hasspace for upto 1114112

They are organized into 17 planes of 216 = 65536 characters,numbered 0 to 16

Plane 0 is called Basic Multilingual Plane (BMP) and contains prettywell everything usefulThe characters in BMP are laid out more or less West to East

I ASCII characters from 0 to 127I Latin-1 characters from 128 to 255I Then moving East in Europe (Greek, Cyrillic)I Next Middle East (Arabic, Hebrew)I Then the Indus (scripts of India)I Next Southeast Asia (Thai, Laotian and so on)I and ending with China, Japan and Korea

Planes 1 to 16 are sometimes called astral planes that include exotic,rare and historically important characters (old italic, byzantinemusical symbols, etc.)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21

Unicode Unicode Concepts

Unicode Organization

Unicode currently defines just under 100000 code points but it hasspace for upto 1114112

They are organized into 17 planes of 216 = 65536 characters,numbered 0 to 16

Plane 0 is called Basic Multilingual Plane (BMP) and contains prettywell everything usefulThe characters in BMP are laid out more or less West to East

I ASCII characters from 0 to 127I Latin-1 characters from 128 to 255I Then moving East in Europe (Greek, Cyrillic)I Next Middle East (Arabic, Hebrew)I Then the Indus (scripts of India)I Next Southeast Asia (Thai, Laotian and so on)I and ending with China, Japan and Korea

Planes 1 to 16 are sometimes called astral planes that include exotic,rare and historically important characters (old italic, byzantinemusical symbols, etc.)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21

Unicode Unicode Concepts

Unicode Organization

Unicode currently defines just under 100000 code points but it hasspace for upto 1114112

They are organized into 17 planes of 216 = 65536 characters,numbered 0 to 16

Plane 0 is called Basic Multilingual Plane (BMP) and contains prettywell everything usefulThe characters in BMP are laid out more or less West to East

I ASCII characters from 0 to 127I Latin-1 characters from 128 to 255I Then moving East in Europe (Greek, Cyrillic)I Next Middle East (Arabic, Hebrew)I Then the Indus (scripts of India)I Next Southeast Asia (Thai, Laotian and so on)I and ending with China, Japan and Korea

Planes 1 to 16 are sometimes called astral planes that include exotic,rare and historically important characters (old italic, byzantinemusical symbols, etc.)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21

Unicode Unicode Concepts

Unicode Organization

Unicode currently defines just under 100000 code points but it hasspace for upto 1114112

They are organized into 17 planes of 216 = 65536 characters,numbered 0 to 16

Plane 0 is called Basic Multilingual Plane (BMP) and contains prettywell everything usefulThe characters in BMP are laid out more or less West to East

I ASCII characters from 0 to 127I Latin-1 characters from 128 to 255I Then moving East in Europe (Greek, Cyrillic)I Next Middle East (Arabic, Hebrew)I Then the Indus (scripts of India)I Next Southeast Asia (Thai, Laotian and so on)I and ending with China, Japan and Korea

Planes 1 to 16 are sometimes called astral planes that include exotic,rare and historically important characters (old italic, byzantinemusical symbols, etc.)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21

Unicode Unicode Concepts

Unicode Organization

Unicode currently defines just under 100000 code points but it hasspace for upto 1114112

They are organized into 17 planes of 216 = 65536 characters,numbered 0 to 16

Plane 0 is called Basic Multilingual Plane (BMP) and contains prettywell everything usefulThe characters in BMP are laid out more or less West to East

I ASCII characters from 0 to 127I Latin-1 characters from 128 to 255I Then moving East in Europe (Greek, Cyrillic)I Next Middle East (Arabic, Hebrew)I Then the Indus (scripts of India)I Next Southeast Asia (Thai, Laotian and so on)I and ending with China, Japan and Korea

Planes 1 to 16 are sometimes called astral planes that include exotic,rare and historically important characters (old italic, byzantinemusical symbols, etc.)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21

Unicode Unicode Concepts

Unicode Organization

Unicode currently defines just under 100000 code points but it hasspace for upto 1114112

They are organized into 17 planes of 216 = 65536 characters,numbered 0 to 16

Plane 0 is called Basic Multilingual Plane (BMP) and contains prettywell everything usefulThe characters in BMP are laid out more or less West to East

I ASCII characters from 0 to 127I Latin-1 characters from 128 to 255I Then moving East in Europe (Greek, Cyrillic)I Next Middle East (Arabic, Hebrew)I Then the Indus (scripts of India)I Next Southeast Asia (Thai, Laotian and so on)I and ending with China, Japan and Korea

Planes 1 to 16 are sometimes called astral planes that include exotic,rare and historically important characters (old italic, byzantinemusical symbols, etc.)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21

Unicode Unicode Concepts

Unicode Organization

Unicode currently defines just under 100000 code points but it hasspace for upto 1114112

They are organized into 17 planes of 216 = 65536 characters,numbered 0 to 16

Plane 0 is called Basic Multilingual Plane (BMP) and contains prettywell everything usefulThe characters in BMP are laid out more or less West to East

I ASCII characters from 0 to 127I Latin-1 characters from 128 to 255I Then moving East in Europe (Greek, Cyrillic)I Next Middle East (Arabic, Hebrew)I Then the Indus (scripts of India)I Next Southeast Asia (Thai, Laotian and so on)I and ending with China, Japan and Korea

Planes 1 to 16 are sometimes called astral planes that include exotic,rare and historically important characters (old italic, byzantinemusical symbols, etc.)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21

Unicode Unicode Concepts

Unicode Organization

Unicode currently defines just under 100000 code points but it hasspace for upto 1114112

They are organized into 17 planes of 216 = 65536 characters,numbered 0 to 16

Plane 0 is called Basic Multilingual Plane (BMP) and contains prettywell everything usefulThe characters in BMP are laid out more or less West to East

I ASCII characters from 0 to 127I Latin-1 characters from 128 to 255I Then moving East in Europe (Greek, Cyrillic)I Next Middle East (Arabic, Hebrew)I Then the Indus (scripts of India)I Next Southeast Asia (Thai, Laotian and so on)I and ending with China, Japan and Korea

Planes 1 to 16 are sometimes called astral planes that include exotic,rare and historically important characters (old italic, byzantinemusical symbols, etc.)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21

Unicode Unicode Concepts

Unicode Organization

Unicode currently defines just under 100000 code points but it hasspace for upto 1114112

They are organized into 17 planes of 216 = 65536 characters,numbered 0 to 16

Plane 0 is called Basic Multilingual Plane (BMP) and contains prettywell everything usefulThe characters in BMP are laid out more or less West to East

I ASCII characters from 0 to 127I Latin-1 characters from 128 to 255I Then moving East in Europe (Greek, Cyrillic)I Next Middle East (Arabic, Hebrew)I Then the Indus (scripts of India)I Next Southeast Asia (Thai, Laotian and so on)I and ending with China, Japan and Korea

Planes 1 to 16 are sometimes called astral planes that include exotic,rare and historically important characters (old italic, byzantinemusical symbols, etc.)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 8 / 21

Unicode Unicode Concepts

Code Points

Each code point (“character”) gets a number and a name

The number is usually given in hexadecimal and prefixed by U+

(Note that it is not a 16 bit number due to the astral planes !!!)

Unicode includes tables with useful character properties (metadata)such as

I this is a numberI this is uppercaseI this is punctuation

The standard also providesI a helpful picture of a reasonably typical renditionI rules for line-breakingI hyphenationI sorting

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21

Unicode Unicode Concepts

Code Points

Each code point (“character”) gets a number and a name

The number is usually given in hexadecimal and prefixed by U+

(Note that it is not a 16 bit number due to the astral planes !!!)

Unicode includes tables with useful character properties (metadata)such as

I this is a numberI this is uppercaseI this is punctuation

The standard also providesI a helpful picture of a reasonably typical renditionI rules for line-breakingI hyphenationI sorting

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21

Unicode Unicode Concepts

Code Points

Each code point (“character”) gets a number and a name

The number is usually given in hexadecimal and prefixed by U+

(Note that it is not a 16 bit number due to the astral planes !!!)

Unicode includes tables with useful character properties (metadata)such as

I this is a numberI this is uppercaseI this is punctuation

The standard also providesI a helpful picture of a reasonably typical renditionI rules for line-breakingI hyphenationI sorting

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21

Unicode Unicode Concepts

Code Points

Each code point (“character”) gets a number and a name

The number is usually given in hexadecimal and prefixed by U+

(Note that it is not a 16 bit number due to the astral planes !!!)

Unicode includes tables with useful character properties (metadata)such as

I this is a numberI this is uppercaseI this is punctuation

The standard also providesI a helpful picture of a reasonably typical renditionI rules for line-breakingI hyphenationI sorting

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21

Unicode Unicode Concepts

Code Points

Each code point (“character”) gets a number and a name

The number is usually given in hexadecimal and prefixed by U+

(Note that it is not a 16 bit number due to the astral planes !!!)

Unicode includes tables with useful character properties (metadata)such as

I this is a numberI this is uppercaseI this is punctuation

The standard also providesI a helpful picture of a reasonably typical renditionI rules for line-breakingI hyphenationI sorting

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21

Unicode Unicode Concepts

Code Points

Each code point (“character”) gets a number and a name

The number is usually given in hexadecimal and prefixed by U+

(Note that it is not a 16 bit number due to the astral planes !!!)

Unicode includes tables with useful character properties (metadata)such as

I this is a numberI this is uppercaseI this is punctuation

The standard also providesI a helpful picture of a reasonably typical renditionI rules for line-breakingI hyphenationI sorting

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21

Unicode Unicode Concepts

Code Points

Each code point (“character”) gets a number and a name

The number is usually given in hexadecimal and prefixed by U+

(Note that it is not a 16 bit number due to the astral planes !!!)

Unicode includes tables with useful character properties (metadata)such as

I this is a numberI this is uppercaseI this is punctuation

The standard also providesI a helpful picture of a reasonably typical renditionI rules for line-breakingI hyphenationI sorting

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21

Unicode Unicode Concepts

Code Points

Each code point (“character”) gets a number and a name

The number is usually given in hexadecimal and prefixed by U+

(Note that it is not a 16 bit number due to the astral planes !!!)

Unicode includes tables with useful character properties (metadata)such as

I this is a numberI this is uppercaseI this is punctuation

The standard also providesI a helpful picture of a reasonably typical renditionI rules for line-breakingI hyphenationI sorting

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21

Unicode Unicode Concepts

Code Points

Each code point (“character”) gets a number and a name

The number is usually given in hexadecimal and prefixed by U+

(Note that it is not a 16 bit number due to the astral planes !!!)

Unicode includes tables with useful character properties (metadata)such as

I this is a numberI this is uppercaseI this is punctuation

The standard also providesI a helpful picture of a reasonably typical renditionI rules for line-breakingI hyphenationI sorting

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21

Unicode Unicode Concepts

Code Points

Each code point (“character”) gets a number and a name

The number is usually given in hexadecimal and prefixed by U+

(Note that it is not a 16 bit number due to the astral planes !!!)

Unicode includes tables with useful character properties (metadata)such as

I this is a numberI this is uppercaseI this is punctuation

The standard also providesI a helpful picture of a reasonably typical renditionI rules for line-breakingI hyphenationI sorting

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21

Unicode Unicode Concepts

Code Points

Each code point (“character”) gets a number and a name

The number is usually given in hexadecimal and prefixed by U+

(Note that it is not a 16 bit number due to the astral planes !!!)

Unicode includes tables with useful character properties (metadata)such as

I this is a numberI this is uppercaseI this is punctuation

The standard also providesI a helpful picture of a reasonably typical renditionI rules for line-breakingI hyphenationI sorting

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21

Unicode Unicode Concepts

Code Points

Each code point (“character”) gets a number and a name

The number is usually given in hexadecimal and prefixed by U+

(Note that it is not a 16 bit number due to the astral planes !!!)

Unicode includes tables with useful character properties (metadata)such as

I this is a numberI this is uppercaseI this is punctuation

The standard also providesI a helpful picture of a reasonably typical renditionI rules for line-breakingI hyphenationI sorting

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 9 / 21

Unicode Encodings

Encodings

Along with the code points, Unicode also defines methods for storingthem in byte sequences in a computer

There are three approaches named UTF-8, UTF-16 and UTF-32

UTF stands for Unicode Transformation Format or UCSTransformation Format where UCS stands for Unicode CharacterSet

The characters we will use in the explanations are:

Number Name PlaneU+0026 (38) AMPERSAND BMPU+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMPU+4E2D (20013) HAN IDEOGRAPH 4E2E BMPU+10346 (66374) GOTHIC LETTER FAIHU Astral

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21

Unicode Encodings

Encodings

Along with the code points, Unicode also defines methods for storingthem in byte sequences in a computer

There are three approaches named UTF-8, UTF-16 and UTF-32

UTF stands for Unicode Transformation Format or UCSTransformation Format where UCS stands for Unicode CharacterSet

The characters we will use in the explanations are:

Number Name PlaneU+0026 (38) AMPERSAND BMPU+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMPU+4E2D (20013) HAN IDEOGRAPH 4E2E BMPU+10346 (66374) GOTHIC LETTER FAIHU Astral

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21

Unicode Encodings

Encodings

Along with the code points, Unicode also defines methods for storingthem in byte sequences in a computer

There are three approaches named UTF-8, UTF-16 and UTF-32

UTF stands for Unicode Transformation Format or UCSTransformation Format where UCS stands for Unicode CharacterSet

The characters we will use in the explanations are:

Number Name PlaneU+0026 (38) AMPERSAND BMPU+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMPU+4E2D (20013) HAN IDEOGRAPH 4E2E BMPU+10346 (66374) GOTHIC LETTER FAIHU Astral

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21

Unicode Encodings

Encodings

Along with the code points, Unicode also defines methods for storingthem in byte sequences in a computer

There are three approaches named UTF-8, UTF-16 and UTF-32

UTF stands for Unicode Transformation Format or UCSTransformation Format where UCS stands for Unicode CharacterSet

The characters we will use in the explanations are:

Number Name PlaneU+0026 (38) AMPERSAND BMPU+0416 (1046) CYRILLIC CAPITAL LETTER ZHE BMPU+4E2D (20013) HAN IDEOGRAPH 4E2E BMPU+10346 (66374) GOTHIC LETTER FAIHU Astral

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 10 / 21

Unicode Encodings

UTF-32

The simplest way to storing characters: you use 32 bits (4 bytes) tostore each character

So we store 38, 1046, 20013 and 66374 as 32 bit integers

For Latin-1 characters it wastes too much space

Problems with C strings because most bytes are zero (use wchar t)

There are lots of ways of storing 4 byte integers among 4 bytes(remember big-endian and little-endian?)

So if you send one of these 4-byte integers to another machineproblems occur if they use different orderings

Solutions:

Explicitness UTF-32BE and UTF-32LE encodingsByte Order Mark (BOM) Character U+FEFF (ZERO WIDTH

NO-BREAK SPACE) and the guarantee that U+FFFEwill never be a character

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21

Unicode Encodings

UTF-32

The simplest way to storing characters: you use 32 bits (4 bytes) tostore each character

So we store 38, 1046, 20013 and 66374 as 32 bit integers

For Latin-1 characters it wastes too much space

Problems with C strings because most bytes are zero (use wchar t)

There are lots of ways of storing 4 byte integers among 4 bytes(remember big-endian and little-endian?)

So if you send one of these 4-byte integers to another machineproblems occur if they use different orderings

Solutions:

Explicitness UTF-32BE and UTF-32LE encodingsByte Order Mark (BOM) Character U+FEFF (ZERO WIDTH

NO-BREAK SPACE) and the guarantee that U+FFFEwill never be a character

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21

Unicode Encodings

UTF-32

The simplest way to storing characters: you use 32 bits (4 bytes) tostore each character

So we store 38, 1046, 20013 and 66374 as 32 bit integers

For Latin-1 characters it wastes too much space

Problems with C strings because most bytes are zero (use wchar t)

There are lots of ways of storing 4 byte integers among 4 bytes(remember big-endian and little-endian?)

So if you send one of these 4-byte integers to another machineproblems occur if they use different orderings

Solutions:

Explicitness UTF-32BE and UTF-32LE encodingsByte Order Mark (BOM) Character U+FEFF (ZERO WIDTH

NO-BREAK SPACE) and the guarantee that U+FFFEwill never be a character

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21

Unicode Encodings

UTF-32

The simplest way to storing characters: you use 32 bits (4 bytes) tostore each character

So we store 38, 1046, 20013 and 66374 as 32 bit integers

For Latin-1 characters it wastes too much space

Problems with C strings because most bytes are zero (use wchar t)

There are lots of ways of storing 4 byte integers among 4 bytes(remember big-endian and little-endian?)

So if you send one of these 4-byte integers to another machineproblems occur if they use different orderings

Solutions:

Explicitness UTF-32BE and UTF-32LE encodingsByte Order Mark (BOM) Character U+FEFF (ZERO WIDTH

NO-BREAK SPACE) and the guarantee that U+FFFEwill never be a character

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21

Unicode Encodings

UTF-32

The simplest way to storing characters: you use 32 bits (4 bytes) tostore each character

So we store 38, 1046, 20013 and 66374 as 32 bit integers

For Latin-1 characters it wastes too much space

Problems with C strings because most bytes are zero (use wchar t)

There are lots of ways of storing 4 byte integers among 4 bytes(remember big-endian and little-endian?)

So if you send one of these 4-byte integers to another machineproblems occur if they use different orderings

Solutions:

Explicitness UTF-32BE and UTF-32LE encodingsByte Order Mark (BOM) Character U+FEFF (ZERO WIDTH

NO-BREAK SPACE) and the guarantee that U+FFFEwill never be a character

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21

Unicode Encodings

UTF-32

The simplest way to storing characters: you use 32 bits (4 bytes) tostore each character

So we store 38, 1046, 20013 and 66374 as 32 bit integers

For Latin-1 characters it wastes too much space

Problems with C strings because most bytes are zero (use wchar t)

There are lots of ways of storing 4 byte integers among 4 bytes(remember big-endian and little-endian?)

So if you send one of these 4-byte integers to another machineproblems occur if they use different orderings

Solutions:

Explicitness UTF-32BE and UTF-32LE encodingsByte Order Mark (BOM) Character U+FEFF (ZERO WIDTH

NO-BREAK SPACE) and the guarantee that U+FFFEwill never be a character

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21

Unicode Encodings

UTF-32

The simplest way to storing characters: you use 32 bits (4 bytes) tostore each character

So we store 38, 1046, 20013 and 66374 as 32 bit integers

For Latin-1 characters it wastes too much space

Problems with C strings because most bytes are zero (use wchar t)

There are lots of ways of storing 4 byte integers among 4 bytes(remember big-endian and little-endian?)

So if you send one of these 4-byte integers to another machineproblems occur if they use different orderings

Solutions:

Explicitness UTF-32BE and UTF-32LE encodingsByte Order Mark (BOM) Character U+FEFF (ZERO WIDTH

NO-BREAK SPACE) and the guarantee that U+FFFEwill never be a character

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21

Unicode Encodings

UTF-32

The simplest way to storing characters: you use 32 bits (4 bytes) tostore each character

So we store 38, 1046, 20013 and 66374 as 32 bit integers

For Latin-1 characters it wastes too much space

Problems with C strings because most bytes are zero (use wchar t)

There are lots of ways of storing 4 byte integers among 4 bytes(remember big-endian and little-endian?)

So if you send one of these 4-byte integers to another machineproblems occur if they use different orderings

Solutions:

Explicitness UTF-32BE and UTF-32LE encodingsByte Order Mark (BOM) Character U+FEFF (ZERO WIDTH

NO-BREAK SPACE) and the guarantee that U+FFFEwill never be a character

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21

Unicode Encodings

UTF-32

The simplest way to storing characters: you use 32 bits (4 bytes) tostore each character

So we store 38, 1046, 20013 and 66374 as 32 bit integers

For Latin-1 characters it wastes too much space

Problems with C strings because most bytes are zero (use wchar t)

There are lots of ways of storing 4 byte integers among 4 bytes(remember big-endian and little-endian?)

So if you send one of these 4-byte integers to another machineproblems occur if they use different orderings

Solutions:

Explicitness UTF-32BE and UTF-32LE encodingsByte Order Mark (BOM) Character U+FEFF (ZERO WIDTH

NO-BREAK SPACE) and the guarantee that U+FFFEwill never be a character

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 11 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-16UTF-16 stores Unicode characters in 16 bit chunks

I all the BMP characters appear as themselvesI some trickery is needed for the astral plane ones

There are two blocks of code points in the BMP called surrogate blocksHigh surrogates from U+D800 to U+DBFFLow surrogates from U+DC00 to U+DFFFAstral plane characters are splitted into two characters

I first, 0x10000 = 216 is subtracted from the code pointI next, its 20 bits are splitted using the low surrogate for the low ten bits

and the high for the high ones

This gives 20 bits or 220 characters that fits the 16 = 24 astral planeswith 216 characters eachSo U+10346 is represented as the 16-bits integers 0xD800 0xDF46It also has ordering problems so the UTF-16BE, UTF-16LE or use ofthe BOMNightmare in C: embedded zeros and not same size as wchar tThe most efficient way to store asian characters

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 12 / 21

Unicode Encodings

UTF-8

UTF-8 was invented by Ken Thompson on September 2, 1992, on aplacemat in a New Jersey diner with Rob Pike.

It works like this:I characters whose value is less that 128 (ASCII) are encoded as

themselves in one byteI the rest will have its bits ripped apart and deal out into several (from

two to four) bytes as follows:F The first byte has a bunch of high-order one bits telling how many

bytes are used to encode the character, followed by a zero bitF The rest of the bytes each begin with a single one byte followed by a

zero bitF The bits of the character are dealt out in the space left over after these

signalling bits

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21

Unicode Encodings

UTF-8

UTF-8 was invented by Ken Thompson on September 2, 1992, on aplacemat in a New Jersey diner with Rob Pike.

It works like this:I characters whose value is less that 128 (ASCII) are encoded as

themselves in one byteI the rest will have its bits ripped apart and deal out into several (from

two to four) bytes as follows:F The first byte has a bunch of high-order one bits telling how many

bytes are used to encode the character, followed by a zero bitF The rest of the bytes each begin with a single one byte followed by a

zero bitF The bits of the character are dealt out in the space left over after these

signalling bits

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21

Unicode Encodings

UTF-8

UTF-8 was invented by Ken Thompson on September 2, 1992, on aplacemat in a New Jersey diner with Rob Pike.

It works like this:I characters whose value is less that 128 (ASCII) are encoded as

themselves in one byteI the rest will have its bits ripped apart and deal out into several (from

two to four) bytes as follows:F The first byte has a bunch of high-order one bits telling how many

bytes are used to encode the character, followed by a zero bitF The rest of the bytes each begin with a single one byte followed by a

zero bitF The bits of the character are dealt out in the space left over after these

signalling bits

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21

Unicode Encodings

UTF-8

UTF-8 was invented by Ken Thompson on September 2, 1992, on aplacemat in a New Jersey diner with Rob Pike.

It works like this:I characters whose value is less that 128 (ASCII) are encoded as

themselves in one byteI the rest will have its bits ripped apart and deal out into several (from

two to four) bytes as follows:F The first byte has a bunch of high-order one bits telling how many

bytes are used to encode the character, followed by a zero bitF The rest of the bytes each begin with a single one byte followed by a

zero bitF The bits of the character are dealt out in the space left over after these

signalling bits

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21

Unicode Encodings

UTF-8

UTF-8 was invented by Ken Thompson on September 2, 1992, on aplacemat in a New Jersey diner with Rob Pike.

It works like this:I characters whose value is less that 128 (ASCII) are encoded as

themselves in one byteI the rest will have its bits ripped apart and deal out into several (from

two to four) bytes as follows:F The first byte has a bunch of high-order one bits telling how many

bytes are used to encode the character, followed by a zero bitF The rest of the bytes each begin with a single one byte followed by a

zero bitF The bits of the character are dealt out in the space left over after these

signalling bits

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21

Unicode Encodings

UTF-8

UTF-8 was invented by Ken Thompson on September 2, 1992, on aplacemat in a New Jersey diner with Rob Pike.

It works like this:I characters whose value is less that 128 (ASCII) are encoded as

themselves in one byteI the rest will have its bits ripped apart and deal out into several (from

two to four) bytes as follows:F The first byte has a bunch of high-order one bits telling how many

bytes are used to encode the character, followed by a zero bitF The rest of the bytes each begin with a single one byte followed by a

zero bitF The bits of the character are dealt out in the space left over after these

signalling bits

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21

Unicode Encodings

UTF-8

UTF-8 was invented by Ken Thompson on September 2, 1992, on aplacemat in a New Jersey diner with Rob Pike.

It works like this:I characters whose value is less that 128 (ASCII) are encoded as

themselves in one byteI the rest will have its bits ripped apart and deal out into several (from

two to four) bytes as follows:F The first byte has a bunch of high-order one bits telling how many

bytes are used to encode the character, followed by a zero bitF The rest of the bytes each begin with a single one byte followed by a

zero bitF The bits of the character are dealt out in the space left over after these

signalling bits

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 13 / 21

Unicode Encodings

UTF-8

The following table summarizes the rules:

Hex range Binary UTF-8000000–00007F 0zzzzzzz 0zzzzzzz

000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz

000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz

010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz

Our examples result in:

Character Binary UTF-8U+0026 00100110 00100110

U+0416 00000100 00010110 11010000 10010110

U+4E2D 01001110 00101101 11100100 10111000 10101101

U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110

Using hexadecimal:

Character HexadecimalU+0026 0x26U+0416 0xD0 0x96U+4E2D 0xE4 0xB8 0xADU+10346 0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21

Unicode Encodings

UTF-8

The following table summarizes the rules:

Hex range Binary UTF-8000000–00007F 0zzzzzzz 0zzzzzzz

000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz

000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz

010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz

Our examples result in:

Character Binary UTF-8U+0026 00100110 00100110

U+0416 00000100 00010110 11010000 10010110

U+4E2D 01001110 00101101 11100100 10111000 10101101

U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110

Using hexadecimal:

Character HexadecimalU+0026 0x26U+0416 0xD0 0x96U+4E2D 0xE4 0xB8 0xADU+10346 0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21

Unicode Encodings

UTF-8

The following table summarizes the rules:

Hex range Binary UTF-8000000–00007F 0zzzzzzz 0zzzzzzz

000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz

000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz

010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz

Our examples result in:

Character Binary UTF-8U+0026 00100110 00100110

U+0416 00000100 00010110 11010000 10010110

U+4E2D 01001110 00101101 11100100 10111000 10101101

U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110

Using hexadecimal:

Character HexadecimalU+0026 0x26U+0416 0xD0 0x96U+4E2D 0xE4 0xB8 0xADU+10346 0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21

Unicode Encodings

UTF-8

The following table summarizes the rules:

Hex range Binary UTF-8000000–00007F 0zzzzzzz 0zzzzzzz

000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz

000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz

010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz

Our examples result in:

Character Binary UTF-8U+0026 00100110 00100110

U+0416 00000100 00010110 11010000 10010110

U+4E2D 01001110 00101101 11100100 10111000 10101101

U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110

Using hexadecimal:

Character HexadecimalU+0026 0x26U+0416 0xD0 0x96U+4E2D 0xE4 0xB8 0xADU+10346 0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21

Unicode Encodings

UTF-8

The following table summarizes the rules:

Hex range Binary UTF-8000000–00007F 0zzzzzzz 0zzzzzzz

000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz

000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz

010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz

Our examples result in:

Character Binary UTF-8U+0026 00100110 00100110

U+0416 00000100 00010110 11010000 10010110

U+4E2D 01001110 00101101 11100100 10111000 10101101

U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110

Using hexadecimal:

Character HexadecimalU+0026 0x26U+0416 0xD0 0x96U+4E2D 0xE4 0xB8 0xADU+10346 0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21

Unicode Encodings

UTF-8

The following table summarizes the rules:

Hex range Binary UTF-8000000–00007F 0zzzzzzz 0zzzzzzz

000080–0007FF 00000yyy yyzzzzzz 110yyyyy 10zzzzzz

000800–00FFFF xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz

010000–10FFFF 000wwwxx xxxxyyyy yyzzzzzz 11110www 10xxxxxx 10yyyyyy 10zzzzzz

Our examples result in:

Character Binary UTF-8U+0026 00100110 00100110

U+0416 00000100 00010110 11010000 10010110

U+4E2D 01001110 00101101 11100100 10111000 10101101

U+10346 00000001 00000011 01000110 11110000 10010000 10001101 10000110

Using hexadecimal:

Character HexadecimalU+0026 0x26U+0416 0xD0 0x96U+4E2D 0xE4 0xB8 0xADU+10346 0xF0 0x90 0x8D 0x86

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 14 / 21

Unicode Encodings

UTF-8

UTF-8 is a kind of racist favouring us with round-eyesI anglophones get one byte per characterI most people west of the Indus river get away with two bytesI India and points east need three bytes per character

Processing UTF-8 characters sequentially is about as efficient as inany other encoding

But you can’t easily index into a buffer (this is the same as UTF-16)I count charactersI array of positions

UTF-8 has no embedded zero bytes so some C routines work

No byte-ordering problems

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21

Unicode Encodings

UTF-8

UTF-8 is a kind of racist favouring us with round-eyesI anglophones get one byte per characterI most people west of the Indus river get away with two bytesI India and points east need three bytes per character

Processing UTF-8 characters sequentially is about as efficient as inany other encoding

But you can’t easily index into a buffer (this is the same as UTF-16)I count charactersI array of positions

UTF-8 has no embedded zero bytes so some C routines work

No byte-ordering problems

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21

Unicode Encodings

UTF-8

UTF-8 is a kind of racist favouring us with round-eyesI anglophones get one byte per characterI most people west of the Indus river get away with two bytesI India and points east need three bytes per character

Processing UTF-8 characters sequentially is about as efficient as inany other encoding

But you can’t easily index into a buffer (this is the same as UTF-16)I count charactersI array of positions

UTF-8 has no embedded zero bytes so some C routines work

No byte-ordering problems

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21

Unicode Encodings

UTF-8

UTF-8 is a kind of racist favouring us with round-eyesI anglophones get one byte per characterI most people west of the Indus river get away with two bytesI India and points east need three bytes per character

Processing UTF-8 characters sequentially is about as efficient as inany other encoding

But you can’t easily index into a buffer (this is the same as UTF-16)I count charactersI array of positions

UTF-8 has no embedded zero bytes so some C routines work

No byte-ordering problems

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21

Unicode Encodings

UTF-8

UTF-8 is a kind of racist favouring us with round-eyesI anglophones get one byte per characterI most people west of the Indus river get away with two bytesI India and points east need three bytes per character

Processing UTF-8 characters sequentially is about as efficient as inany other encoding

But you can’t easily index into a buffer (this is the same as UTF-16)I count charactersI array of positions

UTF-8 has no embedded zero bytes so some C routines work

No byte-ordering problems

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21

Unicode Encodings

UTF-8

UTF-8 is a kind of racist favouring us with round-eyesI anglophones get one byte per characterI most people west of the Indus river get away with two bytesI India and points east need three bytes per character

Processing UTF-8 characters sequentially is about as efficient as inany other encoding

But you can’t easily index into a buffer (this is the same as UTF-16)I count charactersI array of positions

UTF-8 has no embedded zero bytes so some C routines work

No byte-ordering problems

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21

Unicode Encodings

UTF-8

UTF-8 is a kind of racist favouring us with round-eyesI anglophones get one byte per characterI most people west of the Indus river get away with two bytesI India and points east need three bytes per character

Processing UTF-8 characters sequentially is about as efficient as inany other encoding

But you can’t easily index into a buffer (this is the same as UTF-16)I count charactersI array of positions

UTF-8 has no embedded zero bytes so some C routines work

No byte-ordering problems

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21

Unicode Encodings

UTF-8

UTF-8 is a kind of racist favouring us with round-eyesI anglophones get one byte per characterI most people west of the Indus river get away with two bytesI India and points east need three bytes per character

Processing UTF-8 characters sequentially is about as efficient as inany other encoding

But you can’t easily index into a buffer (this is the same as UTF-16)I count charactersI array of positions

UTF-8 has no embedded zero bytes so some C routines work

No byte-ordering problems

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21

Unicode Encodings

UTF-8

UTF-8 is a kind of racist favouring us with round-eyesI anglophones get one byte per characterI most people west of the Indus river get away with two bytesI India and points east need three bytes per character

Processing UTF-8 characters sequentially is about as efficient as inany other encoding

But you can’t easily index into a buffer (this is the same as UTF-16)I count charactersI array of positions

UTF-8 has no embedded zero bytes so some C routines work

No byte-ordering problems

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21

Unicode Encodings

UTF-8

UTF-8 is a kind of racist favouring us with round-eyesI anglophones get one byte per characterI most people west of the Indus river get away with two bytesI India and points east need three bytes per character

Processing UTF-8 characters sequentially is about as efficient as inany other encoding

But you can’t easily index into a buffer (this is the same as UTF-16)I count charactersI array of positions

UTF-8 has no embedded zero bytes so some C routines work

No byte-ordering problems

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 15 / 21

Python’s Unicode Support Unicode String Type

Python’s Unicode type

Python has a built-in Unicode type

Unicode string literals has the same syntax as the normal ones, with au or U prefixing the quotes (e.g. u"This is Unicode")

Unicode literals can include the escape sequence \uXXXX to denotecharacter point U+XXXX and \UXXXXXXXX for U+XXXXXXXX (e.g.u"\u0026\u0416\u4e2d\U00010346")

Unicode characters can be named using the escape sequence\N{name} (e.g. u"\N{Ampersand}")

unichr(i) returns a Unicode String with character i (the inverse isord)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21

Python’s Unicode Support Unicode String Type

Python’s Unicode type

Python has a built-in Unicode type

Unicode string literals has the same syntax as the normal ones, with au or U prefixing the quotes (e.g. u"This is Unicode")

Unicode literals can include the escape sequence \uXXXX to denotecharacter point U+XXXX and \UXXXXXXXX for U+XXXXXXXX (e.g.u"\u0026\u0416\u4e2d\U00010346")

Unicode characters can be named using the escape sequence\N{name} (e.g. u"\N{Ampersand}")

unichr(i) returns a Unicode String with character i (the inverse isord)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21

Python’s Unicode Support Unicode String Type

Python’s Unicode type

Python has a built-in Unicode type

Unicode string literals has the same syntax as the normal ones, with au or U prefixing the quotes (e.g. u"This is Unicode")

Unicode literals can include the escape sequence \uXXXX to denotecharacter point U+XXXX and \UXXXXXXXX for U+XXXXXXXX (e.g.u"\u0026\u0416\u4e2d\U00010346")

Unicode characters can be named using the escape sequence\N{name} (e.g. u"\N{Ampersand}")

unichr(i) returns a Unicode String with character i (the inverse isord)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21

Python’s Unicode Support Unicode String Type

Python’s Unicode type

Python has a built-in Unicode type

Unicode string literals has the same syntax as the normal ones, with au or U prefixing the quotes (e.g. u"This is Unicode")

Unicode literals can include the escape sequence \uXXXX to denotecharacter point U+XXXX and \UXXXXXXXX for U+XXXXXXXX (e.g.u"\u0026\u0416\u4e2d\U00010346")

Unicode characters can be named using the escape sequence\N{name} (e.g. u"\N{Ampersand}")

unichr(i) returns a Unicode String with character i (the inverse isord)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21

Python’s Unicode Support Unicode String Type

Python’s Unicode type

Python has a built-in Unicode type

Unicode string literals has the same syntax as the normal ones, with au or U prefixing the quotes (e.g. u"This is Unicode")

Unicode literals can include the escape sequence \uXXXX to denotecharacter point U+XXXX and \UXXXXXXXX for U+XXXXXXXX (e.g.u"\u0026\u0416\u4e2d\U00010346")

Unicode characters can be named using the escape sequence\N{name} (e.g. u"\N{Ampersand}")

unichr(i) returns a Unicode String with character i (the inverse isord)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 16 / 21

Python’s Unicode Support Unicode String Type

Encoding and Decoding

You can convert between plain string objects (bytes) to Unicodestring objects by means of a codec

s.encode(codec=None, errors=’strict’)Returns a plain string encoded from the (plain or unicode) string susing the given encoding (for example ’ascii’, ’latin-1’,’utf-8’) and error handling (’strict’, ’replace’ or ’ignore’)

s.decode(codec=None, errors=’strict’)Returns an Unicode string decoded from the plain string s using thegiven encoding and error handling. This is the same as:unicode(s, codec=None, errors=’strict’)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 17 / 21

Python’s Unicode Support Unicode String Type

Encoding and Decoding

You can convert between plain string objects (bytes) to Unicodestring objects by means of a codec

s.encode(codec=None, errors=’strict’)Returns a plain string encoded from the (plain or unicode) string susing the given encoding (for example ’ascii’, ’latin-1’,’utf-8’) and error handling (’strict’, ’replace’ or ’ignore’)

s.decode(codec=None, errors=’strict’)Returns an Unicode string decoded from the plain string s using thegiven encoding and error handling. This is the same as:unicode(s, codec=None, errors=’strict’)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 17 / 21

Python’s Unicode Support Unicode String Type

Encoding and Decoding

You can convert between plain string objects (bytes) to Unicodestring objects by means of a codec

s.encode(codec=None, errors=’strict’)Returns a plain string encoded from the (plain or unicode) string susing the given encoding (for example ’ascii’, ’latin-1’,’utf-8’) and error handling (’strict’, ’replace’ or ’ignore’)

s.decode(codec=None, errors=’strict’)Returns an Unicode string decoded from the plain string s using thegiven encoding and error handling. This is the same as:unicode(s, codec=None, errors=’strict’)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 17 / 21

Python’s Unicode Support Unicode String Type

Encoding and Decoding

You can convert between plain string objects (bytes) to Unicodestring objects by means of a codec

s.encode(codec=None, errors=’strict’)Returns a plain string encoded from the (plain or unicode) string susing the given encoding (for example ’ascii’, ’latin-1’,’utf-8’) and error handling (’strict’, ’replace’ or ’ignore’)

s.decode(codec=None, errors=’strict’)Returns an Unicode string decoded from the plain string s using thegiven encoding and error handling. This is the same as:unicode(s, codec=None, errors=’strict’)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 17 / 21

Python’s Unicode Support Unicode String Type

Encoding and Decoding

You can convert between plain string objects (bytes) to Unicodestring objects by means of a codec

s.encode(codec=None, errors=’strict’)Returns a plain string encoded from the (plain or unicode) string susing the given encoding (for example ’ascii’, ’latin-1’,’utf-8’) and error handling (’strict’, ’replace’ or ’ignore’)

s.decode(codec=None, errors=’strict’)Returns an Unicode string decoded from the plain string s using thegiven encoding and error handling. This is the same as:unicode(s, codec=None, errors=’strict’)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 17 / 21

Python’s Unicode Support Unicode String Type

Encoding and Decoding

You can convert between plain string objects (bytes) to Unicodestring objects by means of a codec

s.encode(codec=None, errors=’strict’)Returns a plain string encoded from the (plain or unicode) string susing the given encoding (for example ’ascii’, ’latin-1’,’utf-8’) and error handling (’strict’, ’replace’ or ’ignore’)

s.decode(codec=None, errors=’strict’)Returns an Unicode string decoded from the plain string s using thegiven encoding and error handling. This is the same as:unicode(s, codec=None, errors=’strict’)

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 17 / 21

Python’s Unicode Support Unicode String Type

Modules related to Unicode

The codecs moduleI This module defines base classes for standard Python codecs (encoders

and decoders) and provides access to the internal Python codecregistry which manages the codec and error handling look-up process

I open(filename, mode[, encoding[, errors[, buffering]]])Open an encoded file using the given mode and return a wrappedversion providing transparent encoding/decoding

I EncodedFile(file, input[, output[, errors]])Return a wrapped version of file which provides transparent encodingtranslation

The unicodedata moduleI Supplies easy access to the Unicode Character Database

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21

Python’s Unicode Support Unicode String Type

Modules related to Unicode

The codecs moduleI This module defines base classes for standard Python codecs (encoders

and decoders) and provides access to the internal Python codecregistry which manages the codec and error handling look-up process

I open(filename, mode[, encoding[, errors[, buffering]]])Open an encoded file using the given mode and return a wrappedversion providing transparent encoding/decoding

I EncodedFile(file, input[, output[, errors]])Return a wrapped version of file which provides transparent encodingtranslation

The unicodedata moduleI Supplies easy access to the Unicode Character Database

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21

Python’s Unicode Support Unicode String Type

Modules related to Unicode

The codecs moduleI This module defines base classes for standard Python codecs (encoders

and decoders) and provides access to the internal Python codecregistry which manages the codec and error handling look-up process

I open(filename, mode[, encoding[, errors[, buffering]]])Open an encoded file using the given mode and return a wrappedversion providing transparent encoding/decoding

I EncodedFile(file, input[, output[, errors]])Return a wrapped version of file which provides transparent encodingtranslation

The unicodedata moduleI Supplies easy access to the Unicode Character Database

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21

Python’s Unicode Support Unicode String Type

Modules related to Unicode

The codecs moduleI This module defines base classes for standard Python codecs (encoders

and decoders) and provides access to the internal Python codecregistry which manages the codec and error handling look-up process

I open(filename, mode[, encoding[, errors[, buffering]]])Open an encoded file using the given mode and return a wrappedversion providing transparent encoding/decoding

I EncodedFile(file, input[, output[, errors]])Return a wrapped version of file which provides transparent encodingtranslation

The unicodedata moduleI Supplies easy access to the Unicode Character Database

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21

Python’s Unicode Support Unicode String Type

Modules related to Unicode

The codecs moduleI This module defines base classes for standard Python codecs (encoders

and decoders) and provides access to the internal Python codecregistry which manages the codec and error handling look-up process

I open(filename, mode[, encoding[, errors[, buffering]]])Open an encoded file using the given mode and return a wrappedversion providing transparent encoding/decoding

I EncodedFile(file, input[, output[, errors]])Return a wrapped version of file which provides transparent encodingtranslation

The unicodedata moduleI Supplies easy access to the Unicode Character Database

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21

Python’s Unicode Support Unicode String Type

Modules related to Unicode

The codecs moduleI This module defines base classes for standard Python codecs (encoders

and decoders) and provides access to the internal Python codecregistry which manages the codec and error handling look-up process

I open(filename, mode[, encoding[, errors[, buffering]]])Open an encoded file using the given mode and return a wrappedversion providing transparent encoding/decoding

I EncodedFile(file, input[, output[, errors]])Return a wrapped version of file which provides transparent encodingtranslation

The unicodedata moduleI Supplies easy access to the Unicode Character Database

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21

Python’s Unicode Support Unicode String Type

Modules related to Unicode

The codecs moduleI This module defines base classes for standard Python codecs (encoders

and decoders) and provides access to the internal Python codecregistry which manages the codec and error handling look-up process

I open(filename, mode[, encoding[, errors[, buffering]]])Open an encoded file using the given mode and return a wrappedversion providing transparent encoding/decoding

I EncodedFile(file, input[, output[, errors]])Return a wrapped version of file which provides transparent encodingtranslation

The unicodedata moduleI Supplies easy access to the Unicode Character Database

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21

Python’s Unicode Support Unicode String Type

Modules related to Unicode

The codecs moduleI This module defines base classes for standard Python codecs (encoders

and decoders) and provides access to the internal Python codecregistry which manages the codec and error handling look-up process

I open(filename, mode[, encoding[, errors[, buffering]]])Open an encoded file using the given mode and return a wrappedversion providing transparent encoding/decoding

I EncodedFile(file, input[, output[, errors]])Return a wrapped version of file which provides transparent encodingtranslation

The unicodedata moduleI Supplies easy access to the Unicode Character Database

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 18 / 21

Python’s Unicode Support Source Code Encoding

Source Code Encodings

By default, Python source must only contain characters from theascii set

But you are allowed to tell Python that you use a superset of ascii

These characters can only appearI in commentsI string literals

To accomplish this, in the first or second line (if there is a shebangline) of your source file, put a comment like this:

# -*- coding: latin-1 -*-

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21

Python’s Unicode Support Source Code Encoding

Source Code Encodings

By default, Python source must only contain characters from theascii set

But you are allowed to tell Python that you use a superset of ascii

These characters can only appearI in commentsI string literals

To accomplish this, in the first or second line (if there is a shebangline) of your source file, put a comment like this:

# -*- coding: latin-1 -*-

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21

Python’s Unicode Support Source Code Encoding

Source Code Encodings

By default, Python source must only contain characters from theascii set

But you are allowed to tell Python that you use a superset of ascii

These characters can only appearI in commentsI string literals

To accomplish this, in the first or second line (if there is a shebangline) of your source file, put a comment like this:

# -*- coding: latin-1 -*-

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21

Python’s Unicode Support Source Code Encoding

Source Code Encodings

By default, Python source must only contain characters from theascii set

But you are allowed to tell Python that you use a superset of ascii

These characters can only appearI in commentsI string literals

To accomplish this, in the first or second line (if there is a shebangline) of your source file, put a comment like this:

# -*- coding: latin-1 -*-

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21

Python’s Unicode Support Source Code Encoding

Source Code Encodings

By default, Python source must only contain characters from theascii set

But you are allowed to tell Python that you use a superset of ascii

These characters can only appearI in commentsI string literals

To accomplish this, in the first or second line (if there is a shebangline) of your source file, put a comment like this:

# -*- coding: latin-1 -*-

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21

Python’s Unicode Support Source Code Encoding

Source Code Encodings

By default, Python source must only contain characters from theascii set

But you are allowed to tell Python that you use a superset of ascii

These characters can only appearI in commentsI string literals

To accomplish this, in the first or second line (if there is a shebangline) of your source file, put a comment like this:

# -*- coding: latin-1 -*-

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21

Python’s Unicode Support Source Code Encoding

Source Code Encodings

By default, Python source must only contain characters from theascii set

But you are allowed to tell Python that you use a superset of ascii

These characters can only appearI in commentsI string literals

To accomplish this, in the first or second line (if there is a shebangline) of your source file, put a comment like this:

# -*- coding: latin-1 -*-

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21

Python’s Unicode Support Source Code Encoding

Source Code Encodings

By default, Python source must only contain characters from theascii set

But you are allowed to tell Python that you use a superset of ascii

These characters can only appearI in commentsI string literals

To accomplish this, in the first or second line (if there is a shebangline) of your source file, put a comment like this:

# -*- coding: latin-1 -*-

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 19 / 21

Bibliography

Kumar McMillan, Unicode In Python, Completely Demystified(recorded in this video), PyCon 2008, Chicago

Tim Bray, On the Goodness of Unicode, Characters vs. Bytes

Roman Czyborra, Unicode’s Characters

Michael Foord, A Crash Course in Character Encoding

A.M. Kuchling, The Unicode-HOWTO

Markus Kuhn, UTF-8 and Unicode FAQ for Unix/Linux

Marc-Andre Lemburg, PEP-100: Python Unicode Integration,PEP-263: Defining Python Source Code Encodings, DevelopingUnicode-aware Applications in Python

Joel Spolski, The Absolute Minimum Every Software DeveloperAbsolutely, Positively Must Know About Unicode and Character Sets(No Excuses!)

Unicode Consortium, The Unicode Home Page

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 20 / 21

License

License

Aquesta obra esta subjecta a una llicencia Reconeixement-Compartir ambla mateixa llicencia 2.5 Espanya de Creative Commons.Per veure’n una copia, visiteu

http://creativecommons.org/licenses/by-sa/2.5/es/

o envieu una carta a

Creative Commons559 Nathan Abbott WayStanfordCalifornia 94305USA

J.M.Gimeno (jmgimeno@diei.udl.cat) Unicode November 2008 21 / 21