Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

135
ODE TO A SHIPPING LABEL by Carlos Bueno Once there was a little o, with an accent on top like só. It started out as UTF8, (universal since '98), but the program only knew latin1, and changed little ó to "ó" for fun. A second program saw the "ó" and said "I know HTML entity!" So "ó" was smartened to "ó" and passed on through happily. Another program saw the tangle (more precisely, ampersands to mangle) and thus the humble "ó" became "ó"

description

Every developer will inevitably feel the pain of character encoding issues. We will cover the fundamentals every Python developer should know on character encoding and Unicode. We will teach you how to identify the types of problems that occur when dealing with character encoding and outline a set of best practices and useful libraries which can be used to avoid and fix character encoding issues.

Transcript of Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Page 1: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

ODE TO A SHIPPING LABEL!by Carlos Bueno!!Once there was a little o,!with an accent on top like só.!!It started out as UTF8,!(universal since '98),!but the program only knew latin1,!and changed little ó to "ó" for fun.!!A second program saw the "ó"!and said "I know HTML entity!"!So "ó" was smartened to "ó"!and passed on through happily.!!Another program saw the tangle!(more precisely, ampersands to mangle)!and thus the humble "ó"!became "ó"

Page 2: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Character Encoding & Unicode How to (╯°□°)╯︵ ┻━┻ with dignity

Esther Nam & Travis Fischer!PyCon US 2014, Montréal

Page 3: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Page 4: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Page 5: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Page 6: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Uni-wat?!

Page 7: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

┻━┻ ︵ヽ ノ( ┻━┻

Page 8: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

How to (╯°□°)╯︵ ┻━┻ with dignity

Page 9: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

– Luke Sneeringer | Program Committee Chair

“You'll be pleased to know that your talk title crashed our meeting robot, which is a great argument for the relevance of this talk. :-) ...”

Page 10: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Python 3 is out of scope

Page 11: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

The Fundamentals of Unicode

Page 12: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Humans use text. Computers speak bytes.

Page 13: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

a -> 01100001

Page 14: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

ASCII ISO-8859-15!(latin-9)

CP-1252!(Windows 1252) UTF-8

a 01100001 01100001 01100001 01100001

€ NA 10100100 1000000011100010 10000010 10101100

¤ NA NA 10100100 11000010 10100100

Page 15: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

ASCII ISO-8859-15!(latin-9)

CP-1252!(Windows 1252) UTF-8

a 01100001 01100001 01100001 01100001

€ NA 10100100 1000000011100010 10000010 10101100

¤ NA NA 10100100 11000010 10100100

Page 16: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

ASCII ISO-8859-15!(latin-9)

CP-1252!(Windows 1252) UTF-8

a 01100001 01100001 01100001 01100001

€ NA 10100100 1000000011100010 10000010 10101100

¤ NA NA 10100100 11000010 10100100

Page 17: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

ASCII ISO-8859-15!(latin-9)

CP-1252!(Windows 1252) UTF-8

a 01100001 01100001 01100001 01100001

€ NA 10100100 1000000011100010 10000010 10101100

¤ NA NA 10100100 11000010 10100100

Page 18: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

π — ‽ ☠ ☁ ☂ ☃ ☄ ★ ☆ ☇ ☈ ☉ ☊

☋ ☌ ☍ ☎ ☏ ☐ ☑ ☒ ☓ ☖ ☗ ☘ ☙

☚ ☛ ☜ ☝ ☞ ☟ ☠ ☡ ☢ ☣ ☤ ☥ ☦ ☧

☨ ☩ ☪ ☫ ☬ ☭ ☮ ☯ ☸ ☹ ☺ ☻ ☼ ☽ ☾ ☿ ♀ ♁ ♂ ♃ ♄ ♅ ♆ ♇ ♔ ♕ ♖ ♗ ♘ ♙ ♚ ♛ ♜ ♝ ♞ ♟ ♠ ♡ ♢ ♣ ♤ ♥

♦ ♧ ♨ ♩ ♪ ♫ ♬ ♭ ♮♯ ♰ ♲ ♳ ♻ ♼ ♽ ♾ ⚀ ⚁ ⚂ ⚃ ⚄ ⚅ ⚆ ⚇ ⚈

Page 19: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Page 20: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Page 21: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

a -> U+0061Character Unicode Code Point

Page 22: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

!

Unicode

a -> U+0061Character Unicode Code Point

Page 23: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

!

Unicode

a -> U+0061 Character LATIN SMALL LETTER A

Page 24: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Computers speak bytes.

Page 25: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

!

Unicode

a !

U+0061 -> 01100001Unicode Code Point Binary Encoding

Page 26: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

!

Unicode

U+0061 -> 01100001Unicode Code Point Binary Encodinga

Page 27: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

UTF-8Unicode Transformation Format

Page 28: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Unicode != UTF-8 Code Points Binary Encoding U+0061 01100001

Page 29: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Layers of Abstraction

Page 30: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

• Display (Glyphs | Fonts) Let them eat cake!

Page 31: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

• Display (Glyphs | Fonts) Let them eat cake!

!

• Text (Unicode | Code Points) U+0061

Page 32: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

• Display (Glyphs | Fonts) Let them eat cake!

!

• Text (Unicode | Code Points) U+0061

!

• Storage (Binary | UTF-8) 01100001

Page 33: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Unicode & Python[Python 2.7]

Page 34: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

str type>>>euro_bytestring = '€' !>>>type(euro_bytestring) <type 'str'>

[Python 2.7]

Page 35: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

unicode type# € code point >>>euro_unicode = u'\u20ac' !>>>type(euro_unicode) <type 'unicode'>

[Python 2.7]

Page 36: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Unicode Code points u'\u20ac'

!Bytes UTF-8 '\xe2\x82\xac' !

[Python 2.7]

Page 37: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Unicode Code points u'\u20ac'

'\xe2\x82\xac'.decode('utf8')

!Bytes UTF-8 '\xe2\x82\xac' !

[Python 2.7]

Page 38: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Unicode Code points u'\u20ac'

'\xe2\x82\xac'.decode('utf8')

!Bytes UTF-8 '\xe2\x82\xac' !

[Python 2.7]

Page 39: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Unicode Code points u'\u20ac'

'\xe2\x82\xac'.decode('utf8')

u'\u20ac'.encode('utf8')

!Bytes UTF-8 '\xe2\x82\xac' !

[Python 2.7]

Page 40: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Unicode Code points u'\u20ac'

'\xe2\x82\xac'.becode('utf8')

u'\u20ac'.uncode('utf8')

!Bytes UTF-8 '\xe2\x82\xac' !

[Python 2.7]

Page 41: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

You CANNOT infer an encoding from a bytestring

Page 42: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

#! /usr/bin/python # -*- coding: utf8 -*- !# Opened file should be latin-1 encoded! # If it’s not, call tech support ASAP with open("input_file.csv") as input_file:

Date: Wed, 11 Apr 2014 11:15:55 -0600To: [email protected] From: [email protected] Subject: Character encoding MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8"

<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC “-//W3C//DTD …> <html xmlns="http://www.w3.org/1999/xhtml" …>

Page 43: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Best Practices

Page 44: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Example Application

Page 45: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Author Review

G. van Rossum If you decide to design your own car there are thousands sort of car…

R. Ebert Every great car should feel new every time you drive it.

L. Torvalds Volvo isn’t evil, they just make really crappy cars.

Page 46: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Author Review

G. van Rossum If you decide to design your own car there are thousands sort of car…

R. Ebert Every great car should feel new every time you drive it.

L. Torvalds Volvo isn’t evil, they just make really crappy cars.

Application Processes Text

Page 47: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Author Review

G. van Rossum If you decide to design your own car there are thousands sort of car…

R. Ebert Every great car should feel new every time you drive it.

L. Torvalds Volvo isn’t evil, they just make really crappy cars.

Application Processes Text

PSQL

Page 48: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Author Review

G. van Rossum If you decide to design your own car there are thousands sort of car…

R. Ebert Every great car should feel new every time you drive it.

L. Torvalds Volvo isn’t evil, they just make really crappy cars.

Application Processes Text

PSQL

Page 49: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Encoding: Windows 1252 (CP-1252)

Page 50: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Montreal -> Montréal

Page 51: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

psql=# set server_encoding to "utf-8";

Page 52: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his.Sample Review Text

Page 53: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his.Sample Review Text

Page 54: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his.Sample Review Text

Page 55: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 56: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 57: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 58: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 59: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 60: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 61: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 62: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv")!for row_text in reviews_file:! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

Page 63: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

My friend said: �I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.� He told me he had paid 9400� for his.Output from UTF-8 encoded PSQL database

Page 64: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 65: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 66: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 67: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

[Python 2.7]

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

Page 68: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 69: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 70: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 71: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 72: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his.Original CP-1252 Data

Page 73: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.” He told me he had paid 9400€ for his.Mixed CP-1252 & UTF-8

Page 74: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

My friend said: �I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.� He told me he had paid 9400� for his.Interpreted as UTF-8 by database

Page 75: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Know your encodingsBest Practice #1

Page 76: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 77: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 78: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 79: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 80: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv")!for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)

Page 81: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Traceback (most recent call last): File "...", line ..., in <module> unicode_row = row_text.decode() UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 31: ordinal not in range(128)

Page 82: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Traceback (most recent call last): File "...", line ..., in <module> unicode_row = row_text.decode() UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 31: ordinal not in range(128)

Page 83: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 84: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)

Page 85: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 86: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

Page 87: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8"))

[Python 2.7]

Page 88: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv")!for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable author.encode("utf8" date.encode("utf8"), converted_review.encode("utf8")

Page 89: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.” He told me he had paid 9400€ for his.

Page 90: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Use the Unicode SandwichBest Practice #2

Page 91: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Decode as early as possible.!Unicode everywhere in the middle.!Encode as late as possible.

Page 92: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8"))

[Python 2.7]

Page 93: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8"))

[Python 2.7]

Page 94: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u”Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8"))

[Python 2.7]

Page 95: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Test Your (Text Related) CodeBest Practice #3

Page 96: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Test encoding ranges & boundaries

test_strings = ['Hello Montreal!', ,'ǝɥןןɐǝɹʇuoɯ oן¡' 'ђєɭɭ๏ ๓๏ภՇгєคɭ!'] !func_under_test(test_strings)

Page 97: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

test_bytes = 'I am a bytestring mwahaha' !test_unicode = u'ι αм υηι¢σ∂є!' !!i_expect_unicode(test_bytes) !i_expect_bytes(test_unicode)

Test interfaces against both Python text types

Page 98: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

def ascii_handling_function(ascii_str): ... ascii_str.decode('ascii') ...

Test handling of incorrect encoding

Page 99: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

utf8_str = u'UՇF-8 ՇєsՇ'.encode('utf8') !with assertRaises(UnicodeDecodeError): line = ascii_handling_function(utf8_str)

Test handling of incorrect encoding

Page 100: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Best Practices1. Know your encodings

2. Use the Unicode sandwich

3. Test your (text related) code

Page 101: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Issues We Can’t Control

Page 102: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Incorrect encoding

Page 103: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Author Review

G. van Rossum If you decide to design your own car there are thousands sort of car…

R. Ebert Every great car should feel new every time you drive it.

L. Torvalds Volvo isn’t evil, they just make really crappy cars.

Application Processes Text

PSQL

Page 104: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Declared as “CP-1252”!

!

!

!

!

Is actually “UTF-8”

Page 105: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv")!for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable author.encode("utf8" date.encode("utf8"), converted_review.encode("utf8")

Page 106: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

UnicodeDecodeError

Page 107: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

How to Deal• Ask

Page 108: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

How to Deal• Ask

• Guess (with chardet library)

Page 109: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

How to Deal• Ask

• Guess (with chardet library)

• You wrote tests, right?

Page 110: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Mixed encodings or corrupted bytes

Page 111: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

John Smith’s Autoplex !

Broken text&hellip; it&#x2019;s fantastic! !

Hello ^[[30m; World

Page 112: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

John Smith’s Autoplex!

Broken text&hellip; it&#x2019;s fantastic!!

Hello ^[[30m; WorldMOJIBAKE

Page 113: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

u"John Smith’s Autoplex"

Page 114: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

u"John Smith’s Autoplex" !

>>>u'John Smith’sAutoplex'.encode('cp1252')

Page 115: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

u"John Smith’s Autoplex" !

>>>u'John Smith’sAutoplex'.encode('cp1252') !'John Smith\xe2\x80\x99s Autoplex' (bytestring)

Page 116: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Page 117: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Page 118: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

'John Smith\xe2\x80\x99s Autoplex' (bytestring)

Page 119: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

'John Smith\xe2\x80\x99s Autoplex' (bytestring) !>>>'John Smith\xe2\x80\x99s Autoplex' \ .decode('utf8')

!!

u'John Smith’s Autoplex'

Page 120: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

UTF8

U+2019 !’

Page 121: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

UTF8

\xe2\x80\x99

U+2019 !’

Page 122: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

UTF8

\xe2\x80\x99

U+2019 !’

U+00e2 !

â

U+20ac !€

U+2122 !

CP1252

Page 123: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

str_dealer = u"John Smith’s Autoplex" !!def manually_convert_encoding(str_dealer): """ Manually replace incorrect, UTF8-encoded bytes with CP1252 bytes for the same character """ ! str_dealer.replace('\xe2\x80\x98', '\x91') # ‘ str_dealer.replace('\xe2\x80\x99', '\x92') # ’ str_dealer.replace('\xe2\x80\x9c', '\x93') # “ str_dealer.replace('\xe2\x80\x9d', '\x94') # ” str_dealer.replace('\xe2\x80\x94', '\x97') # — str_dealer.replace('\xe2\x84\xa2', '\x99') # ™ str_dealer.replace('\xe2\x82\xac', '\x80') # €

Page 124: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

dealer_name = u"John Smith’s Autoplex" !>>> from ftfy import fix_text !>>> fix_text(dealer_name) !u"John Smith's Autoplex"

python-ftfy fixes mojibake

Page 125: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Target encoding can’t handle source data

Page 126: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Source Data

(UTF-8)

Target Application

Data (CP-1252)

?

Page 127: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

>>>u'☃ Brrrr!'.encode('cp1252', 'strict') !Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/esther/ENV/lib/python2.7/encodings/cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character u'\u2603' in position 0: character maps to <undefined>

[Python 2.7]

Page 128: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

>>>u'☃ Brrrr!'.encode('cp1252', 'ignore') !' Brrrr!'

[Python 2.7]

Page 129: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

>>>u'☃ Brrrr!'.encode('cp1252', 'replace') !'? Brrrr!'

[Python 2.7]

Page 130: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

!!

U+0004

END OF TRANSMISSION

Page 131: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Cars.com / NewCars.com Tech Team !

SoCal Piggies !

Ned Batchelder (for his Pragmatic Unicode talk)

Thank you ツ

Page 132: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

Pragmatic Unicode http://nedbatchelder.com/text/unipain.html !The Absolute Minimum You Must Know http://www.joelonsoftware.com/articles/Unicode.html !Chapter on Strings in “Dive into Python” by Mark Pilgrim http://getpython3.com/diveintopython3/strings.html !General questions, relating to UTF or Encoding Form http://www.unicode.org/faq/utf_bom.html !Unicode HOWTO (Python 2.7) http://docs.python.org/2/howto/unicode.html

The fundamentals

Page 133: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

“Just what the dickens is ‘Unicode’?” https://pythonhosted.org/kitchen/unicode-frustrations.html

Differences between these commonly confused encodings http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html !“Latin-1” in MySQL is more like “CP-1252” https://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html !Why it's important to write tests with character boundary values http://labs.spotify.com/2013/06/18/creative-usernames/

Further reading

Page 134: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

chardet https://pypi.python.org/pypi/chardet !

python-ftfy https://github.com/LuminosoInsight/python-ftfy

Tools

Page 135: Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity

@estherbester @travisfischer

Slides at http://bit.ly/flip_tables

IRC