Character Encoding & Unicode - How to (╯°□°）╯︵ ┻━┻ with dignity

ODE TO A SHIPPING LABEL!by Carlos Bueno!!Once there was a little o,!with an accent on top like só.!!It started out as UTF8,!(universal since '98),!but the program only knew latin1,!and changed little ó to "Ã³" for fun.!!A second program saw the "Ã³"!and said "I know HTML entity!"!So "Ã³" was smartened to "&ATILDE;&SUP3;"!and passed on through happily.!!Another program saw the tangle!(more precisely, ampersands to mangle)!and thus the humble "&ATILDE;&SUP3;"!became "&AMP;AMP;ATILDE;&AMP;AMP;SUP3;"

Character Encoding & Unicode How to (╯°□°）╯︵ ┻━┻ with dignity

Esther Nam & Travis Fischer!PyCon US 2014, Montréal

Uni-wat?!

┻━┻ ︵ヽﾉ（ ┻━┻

How to (╯°□°）╯︵ ┻━┻ with dignity

– Luke Sneeringer | Program Committee Chair

“You'll be pleased to know that your talk title crashed our meeting robot, which is a great argument for the relevance of this talk. :-) ...”

Python 3 is out of scope

The Fundamentals of Unicode

Humans use text. Computers speak bytes.

a -> 01100001

ASCII ISO-8859-15!(latin-9)

CP-1252!(Windows 1252) UTF-8

a 01100001 01100001 01100001 01100001

€ NA 10100100 1000000011100010 10000010 10101100

¤ NA NA 10100100 11000010 10100100

a 01100001 01100001 01100001 01100001

€ NA 10100100 1000000011100010 10000010 10101100

¤ NA NA 10100100 11000010 10100100

a 01100001 01100001 01100001 01100001

€ NA 10100100 1000000011100010 10000010 10101100

¤ NA NA 10100100 11000010 10100100

a 01100001 01100001 01100001 01100001

€ NA 10100100 1000000011100010 10000010 10101100

¤ NA NA 10100100 11000010 10100100

π — ‽ ☠ ☁ ☂ ☃ ☄ ★ ☆ ☇ ☈ ☉ ☊

☋ ☌ ☍ ☎ ☏ ☐ ☑ ☒ ☓ ☖ ☗ ☘ ☙

☚ ☛ ☜ ☝ ☞ ☟ ☠ ☡ ☢ ☣ ☤ ☥ ☦ ☧

☨ ☩ ☪ ☫ ☬ ☭ ☮ ☯ ☸ ☹ ☺ ☻ ☼ ☽ ☾ ☿ ♀ ♁ ♂ ♃ ♄ ♅ ♆ ♇ ♔ ♕ ♖ ♗ ♘ ♙ ♚ ♛ ♜ ♝ ♞ ♟ ♠ ♡ ♢ ♣ ♤ ♥

♦ ♧ ♨ ♩ ♪ ♫ ♬ ♭ ♮♯ ♰ ♲ ♳ ♻ ♼ ♽ ♾ ⚀ ⚁ ⚂ ⚃ ⚄ ⚅ ⚆ ⚇ ⚈

a -> U+0061Character Unicode Code Point

Unicode

a -> U+0061Character Unicode Code Point

Unicode

a -> U+0061 Character LATIN SMALL LETTER A

Computers speak bytes.

Unicode

U+0061 -> 01100001Unicode Code Point Binary Encoding

Unicode

U+0061 -> 01100001Unicode Code Point Binary Encodinga

UTF-8Unicode Transformation Format

Unicode != UTF-8 Code Points Binary Encoding U+0061 01100001

Layers of Abstraction

• Display (Glyphs | Fonts) Let them eat cake!

• Text (Unicode | Code Points) U+0061

• Display (Glyphs | Fonts) Let them eat cake!

• Text (Unicode | Code Points) U+0061

• Storage (Binary | UTF-8) 01100001

Unicode & Python[Python 2.7]

str type>>>euro_bytestring = '€' !>>>type(euro_bytestring) <type 'str'>

[Python 2.7]

unicode type# € code point >>>euro_unicode = u'\u20ac' !>>>type(euro_unicode) <type 'unicode'>

[Python 2.7]

Unicode Code points u'\u20ac'

!Bytes UTF-8 '\xe2\x82\xac' !

[Python 2.7]

'\xe2\x82\xac'.decode('utf8')

[Python 2.7]

u'\u20ac'.encode('utf8')

[Python 2.7]

'\xe2\x82\xac'.becode('utf8')

u'\u20ac'.uncode('utf8')

[Python 2.7]

You CANNOT infer an encoding from a bytestring

#! /usr/bin/python # -*- coding: utf8 -*- !# Opened file should be latin-1 encoded! # If it’s not, call tech support ASAP with open("input_file.csv") as input_file:

Date: Wed, 11 Apr 2014 11:15:55 -0600To: foo@bar.com From: bar@foo.com Subject: Character encoding MIME-Version: 1.0 Content-Type: text/html; charset="UTF-8"

<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC “-//W3C//DTD …> <html xmlns="http://www.w3.org/1999/xhtml" …>

Best Practices

Example Application

Author Review

G. van Rossum If you decide to design your own car there are thousands sort of car…

R. Ebert Every great car should feel new every time you drive it.

L. Torvalds Volvo isn’t evil, they just make really crappy cars.

Author Review

Application Processes Text

Author Review

Encoding: Windows 1252 (CP-1252)

Montreal -> Montréal

psql=# set server_encoding to "utf-8";

My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his.Sample Review Text

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: ! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv")!for row_text in reviews_file:! author, date, review_text = row_text.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

My friend said: �I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.� He told me he had paid 9400� for his.Output from UTF-8 encoded PSQL database

[Python 2.7]

My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his.Original CP-1252 Data

My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.” He told me he had paid 9400€ for his.Mixed CP-1252 & UTF-8

My friend said: �I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.� He told me he had paid 9400� for his.Interpreted as UTF-8 by database

Know your encodingsBest Practice #1

[Python 2.7]

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(",") converted_review = review_text.replace("Montreal", "Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv")!for row_text in reviews_file: unicode_row = row_text.decode() author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)

Traceback (most recent call last): File "...", line ..., in <module> unicode_row = row_text.decode() UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 31: ordinal not in range(128)

[Python 2.7]

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author, date, converted_review)

[Python 2.7]

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8"))

[Python 2.7]

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv")!for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable author.encode("utf8" date.encode("utf8"), converted_review.encode("utf8")

My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.” He told me he had paid 9400€ for his.

Use the Unicode SandwichBest Practice #2

Decode as early as possible.!Unicode everywhere in the middle.!Encode as late as possible.

[Python 2.7]

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv") !for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u”Montreal", u"Montréal") DB.insert(ReviewTable, author.encode("utf8"), date.encode("utf8"), converted_review.encode("utf8"))

[Python 2.7]

Test Your (Text Related) CodeBest Practice #3

Test encoding ranges & boundaries

test_strings = ['Hello Montreal!', ,'ǝɥןןɐǝɹʇuoɯ oן¡' 'ђєɭɭ๏ ๓๏ภՇгєคɭ!'] !func_under_test(test_strings)

test_bytes = 'I am a bytestring mwahaha' !test_unicode = u'ι αм υηι¢σ∂є!' !!i_expect_unicode(test_bytes) !i_expect_bytes(test_unicode)

Test interfaces against both Python text types

def ascii_handling_function(ascii_str): ... ascii_str.decode('ascii') ...

Test handling of incorrect encoding

utf8_str = u'UՇF-8 ՇєsՇ'.encode('utf8') !with assertRaises(UnicodeDecodeError): line = ascii_handling_function(utf8_str)

Test handling of incorrect encoding

Best Practices1. Know your encodings

2. Use the Unicode sandwich

3. Test your (text related) code

Issues We Can’t Control

Incorrect encoding

Author Review

Declared as “CP-1252”!

Is actually “UTF-8”

# -*- coding: utf-8 -*- !reviews_file = open("reviews_file.csv")!for row_text in reviews_file: unicode_row = row_text.decode("cp1252") author, date, review_text = unicode_row.split(u",") converted_review = review_text.replace(u"Montreal", u"Montréal") DB.insert(ReviewTable author.encode("utf8" date.encode("utf8"), converted_review.encode("utf8")

UnicodeDecodeError

How to Deal• Ask

• Guess (with chardet library)

How to Deal• Ask

• Guess (with chardet library)

• You wrote tests, right?

Mixed encodings or corrupted bytes

John Smithâ€™s Autoplex !

Broken text… it’s fantastic! !

Hello ^[[30m; World

John Smithâ€™s Autoplex!

Broken text… it’s fantastic!!

Hello ^[[30m; WorldMOJIBAKE

u"John Smithâ€™s Autoplex"

u"John Smithâ€™s Autoplex" !

>>>u'John Smithâ€™sAutoplex'.encode('cp1252')

u"John Smithâ€™s Autoplex" !

>>>u'John Smithâ€™sAutoplex'.encode('cp1252') !'John Smith\xe2\x80\x99s Autoplex' (bytestring)

'John Smith\xe2\x80\x99s Autoplex' (bytestring)

'John Smith\xe2\x80\x99s Autoplex' (bytestring) !>>>'John Smith\xe2\x80\x99s Autoplex' \ .decode('utf8')

u'John Smith’s Autoplex'

U+2019 !’

\xe2\x80\x99

U+2019 !’

\xe2\x80\x99

U+2019 !’

U+00e2 !

U+20ac !€

U+2122 !

CP1252

str_dealer = u"John Smithâ€™s Autoplex" !!def manually_convert_encoding(str_dealer): """ Manually replace incorrect, UTF8-encoded bytes with CP1252 bytes for the same character """ ! str_dealer.replace('\xe2\x80\x98', '\x91') # ‘ str_dealer.replace('\xe2\x80\x99', '\x92') # ’ str_dealer.replace('\xe2\x80\x9c', '\x93') # “ str_dealer.replace('\xe2\x80\x9d', '\x94') # ” str_dealer.replace('\xe2\x80\x94', '\x97') # — str_dealer.replace('\xe2\x84\xa2', '\x99') # ™ str_dealer.replace('\xe2\x82\xac', '\x80') # €

dealer_name = u"John Smithâ€™s Autoplex" !>>> from ftfy import fix_text !>>> fix_text(dealer_name) !u"John Smith's Autoplex"

python-ftfy fixes mojibake

Target encoding can’t handle source data

Source Data

(UTF-8)

Target Application

Data (CP-1252)

>>>u'☃ Brrrr!'.encode('cp1252', 'strict') !Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/esther/ENV/lib/python2.7/encodings/cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character u'\u2603' in position 0: character maps to <undefined>

[Python 2.7]

>>>u'☃ Brrrr!'.encode('cp1252', 'ignore') !' Brrrr!'

[Python 2.7]

>>>u'☃ Brrrr!'.encode('cp1252', 'replace') !'? Brrrr!'

[Python 2.7]

U+0004

END OF TRANSMISSION

Cars.com / NewCars.com Tech Team !

SoCal Piggies !

Ned Batchelder (for his Pragmatic Unicode talk)

Thank you ツ

Pragmatic Unicode http://nedbatchelder.com/text/unipain.html !The Absolute Minimum You Must Know http://www.joelonsoftware.com/articles/Unicode.html !Chapter on Strings in “Dive into Python” by Mark Pilgrim http://getpython3.com/diveintopython3/strings.html !General questions, relating to UTF or Encoding Form http://www.unicode.org/faq/utf_bom.html !Unicode HOWTO (Python 2.7) http://docs.python.org/2/howto/unicode.html

The fundamentals

“Just what the dickens is ‘Unicode’?” https://pythonhosted.org/kitchen/unicode-frustrations.html

Differences between these commonly confused encodings http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html !“Latin-1” in MySQL is more like “CP-1252” https://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html !Why it's important to write tests with character boundary values http://labs.spotify.com/2013/06/18/creative-usernames/

Character Encoding & Unicode - How to (╯°□°）╯︵ ┻━┻ with dignity

Software

Transcript of Character Encoding & Unicode - How to (╯°□°）╯︵ ┻━┻ with dignity

§1. Characters being proposed - Unicode Consortium · script in Unicode” dated 2009-Oct-24. This present document requests the encoding of Vedic ... Characters not being proposed

REPORT ON THE FINAL RECOMMENDATIONS OF … encoding scheme for adoption in the Unicode. The G.O. further says that Tamil Nadu Government have become an Associate Member of the Unicode

Unicode Plain Text Encoding of Mathematics · 2006. 4. 4. · Unicode Nearly Plain Text Encoding of Mathematics Unicode Technical Note 5 In practice, this approach leads to plain

STDLIB - Erlangerlang.org/doc/apps/stdlib/stdlib.pdf · Encoding is unicode, this is an Erlang standard mixed Unicode list (one integer in a list per character, characters in binaries

Encoding of Bengali khanda ta - Unicode Consortium of Bengali Khanda Ta in Unicode Page 3 Peter Constable, Microsoft Corporation, 2004-02-17 Other possibilities exist (e.g. representation

Proposal for Encoding Book Pahlavi in the Unicode Standard · Proposal for Encoding Book Pahlavi in the Unicode Standard ... going through a round trip, ... List of Figures. 2.1 Sortingorderofthebasiccharacters

The Unicode Standard, Version 10 · For the Unicode Standard, by contrast, the repertoire is inherently open. Because Unicode is a universal encoding, any abstract character that

Unicode Nearly Plain-Text Encoding of · PDF fileNd General Categories 1(see The Unicode Standard 5.0, Table 4-2. ... Unicode Nearly Plain Text Encoding of Mathematics Unicode Technical

Unicode - Oracle · PDF fileChapter 1 – Using Unicode 8 OVERVIEW What is Unicode? Basically, computers understand only numbers. Through various encoding systems (code pages), computers

The Unicode Standard, Version 6 · The Unicode Standard ... cepts of character, code point, and encoding forms, ... See Section 8.2, Arabic, and Section 9.1, Devanagari, for

The Unicode Standard, Version 9Thai layout in the Unicode Standard is based on the Thai Industrial Standard 620-2529, and its updated version 620-2533. Encoding Principles. In common

Unicode Plain Text Encoding of Mathematics · 2016-11-17 · Unicode Nearly Plain Text Encoding of Mathematics 4 Unicode Technical Note 28 used with few or no modifications for such

UNICODE Character Sets and Coding Standards Han Unification and ISO10646 Encoding Evolution and Unicode Programming Unicode.

Introduction - interoperability.blob.core.windows.net… · Web view: A byte-oriented standard for encoding Unicode characters, defined in the Unicode standard. Unless specified

Sargent2006 - Unicode Nearly Plain-Text Encoding of Mathematics

Deep Confusables - Improving Unicode Encoding Attacks with ...

L2/19-044 TO: Unicode Technical Committee FROM: …L2/19-044 TO: Unicode Technical Committee FROM: Debbie Anderson, Script Encoding Initiative, UC Berkeley SUBJECT: Bété script working

ISO/IEC JTC 1/SC 2/WG 2 - Unicode · PUA. The current proposal is to identify among the remaining characters those eligible for encoding and add them to the Unicode repertory. Contents

Unicode Support in Python - downloads.egenix.com · Python & Unicode Introduction to Unicode: The Unicode Consortium Solution • One encoding for all scripts of the world • ASCII

hariprasanthmadhavan.files.wordpress.com · Web view: setting the default language, using Unicode encoding, using the ‘lang’ attribute, being aware of standard font sizes and