Using unicode with php

50
Translation, localization, and 100% less mojibake guaranteed or your users won’t come back! USING UNICODE WITH PHP

Transcript of Using unicode with php

  • 1.Translation, localization, and 100% less mojibake guaranteed or your users wont come back! USING UNICODE WITH PHP

2. The whole world uses the internet 3. Why is internationalization important? Content language of websites Percentage of Internet users by language 4. Worse than no internationalization? Mojibake 5. Unicode is the solution! Well kind of 1. Different encodings 2. OSs have different default implementations 3. All software encodings have to match or convert Unicode Idea == simple Unicode Implementation == hard 6. Back to Basics WHAT IS UNICODE? 7. Unicode yoonikd/ Noun COMPUTING 1. an international encoding standard for use with different languages and scripts, by which each letter, digit, or symbol is assigned a unique numeric value that applies across different platforms and programs. 8. In the Beginning, there was ASCII 9. Code Pages In which things get really weird 10. ASCII Unicode One character to bits in memory Code point A -> 0100 0001 A -> U+0041 Direct Abstract Representing characters differently But how do we represent this in memory? 11. Encoding Madness UTF Unicode Transformation Format Maps a Code Point to a Byte Sequence 12. What is a character? (A + COMBINING RING or A-RING) How long is the string? 1. In bytes? 2. In code units? 3. In code points? 4. In graphemes? 13. Crash course in Computer Memory Big endian systems - most significant bytes of a number in the upper left corner. Decreasing significance. Little endian systems most significant bytes of a number in the lower right. Increasing significance. 14. Big Endian? Little Endian? Youre hurting my brain Hello -> U+0048 U+0065 U+006C U+006C U+006F 00 48 00 65 00 6C 00 6C 00 6F Little Endian 48 00 65 00 6C 00 6C 00 6F 00 - Big Endian But.. Its the same way to encode unicode Now I have a headache! 15. UTF-8 to the rescue! Hello in ANSI -> 48 65 6C 6C 6 Hello in UTF8 -> 48 65 6C 6C 6 16. Moral of the story Unicode is a standard, not an implementation Text is never plain Every string has an encoding From a file From a db From an HTTP POST or GET (or PUT or file upload) No encoding? Start praying to the Mojibake gods If you do web use UTF-8 17. Mojibake on rye with swiss. WHY DO YOU NEED UNICODE? 18. Helgi ormar orbjrnsson 19. Laurence 20. More than just UTF8 BEYOND STRINGS 21. I18n and L10N Internationalization adaptation of products for potential use virtually everywhere Localization - addition of special features for use in a specific locale 22. Date and Time Formats 30 juin 2009 fr_FR 30.06.2009 de_DE Jun 30, 2009 en_US And dont forget the time zones! 23. Currency and Numbers 123 456 fr_FR 345 987,246 fr_FR 123.456 de_DE 345.987,246 de_DE 123,456 en_US 345,987.246 en_US French (France), Euro: 9 876 543,21 German (Germany), Euro: 9.876.543,21 English (United States), US Dollar: $9,876,543.21 24. Collation (Sorting) The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d Accented letters can be treated as minor variants of the unaccented letter. For example, "" can be treated equivalent to "e. Accented letters can be treated as distinct letters. For example, "" in Danish is treated as a separate letter that sorts just after "Z. 25. String Translation Translation is never one to one, especially when inserting items like numbers Some languages have different grammars and formats for the strangest things Usually translated strings are separated into messages and stored, then mapped depending on the locale Large amounts of text need even more different tables in a database, files in directories, or more 26. Layout and Design Reading order Right to left Left to right Top to bottom Word order Cultural taboos (human images, for example) 27. 3.5 extensions for triple the pain! HOW TO UNICODE WITH PHP 28. Upgrade to at least 5.3 No, really, Im entirely serious If youre not on 5.3 youre not ready for unicode At all You have far bigger issues to deal with like no security updates (oh, and the extensions and apis you need either dont exist or wont work right) 29. Install the bare minimum intl extension (bundled since PHP 5.3) mb_string (if you need zend_multibyte support or on the fly conversion, but most anything else it can do intl does better) iconv extension (optional but excellent for dealing with files) pcre MUST have utf8 support (CHECK!) 30. PHP strings 101 31. C strings and encoding char - 1 byte (usually 8 bit) char * - a pointer to an array of chars stored in memory Can handle Code Page encodings, although generally need special APIs for dealing with multibyte code pages Usually null terminated well unless its a binary string Unix cleverly supports utf8 with apis Windows does not 32. Introducing a new type wchar_t C90 standard (horribly ambiguous) Windows set it at 16 and defined A and W versions of everything Unix set it at 32 C99 and C++11 do char16_t and char32_t to fix the craziness Non-portable and api support sketchy Libraries to fix this exist Few are cross-platform Except for ICU which just rocks 33. Why do we care? PHP talks ONLY to ansi apis on windows PHP functions assume ascii or binary encodings (except for a few special ones) Although most functions are now marked binary safe and dont flip out on null bytes within a string, some still assume a null terminated string string handling functions treat strings as a sequence of single-byte characters. 34. Non-stupid PHP functionality utf8_encode (only ISO-8859-1 to UTF8) utf8_decode (only UTF8 to ISO-8859-1) html_ entity_ decode htmlentities htmlspecialchars_ decode htmlspecialchars 35. C locales or how to make servers cry Setlocale is Per process I will repeat that setlocale sets PER PROCESS Locales are slightly different on different OSs Windows does not support utf8 properly 36. What setlocale will break gettext extension strtoupper strtolower number_format money_format ucfirst ucwords strftime 37. INTL to the rescue! Wrapper around the excellent ICU library Standardized locales, set default locale per script Number formatting Currency formatting Message formatting (replaces gettext) Calendars, dates, timezones and time Transliterator Spoofchecker Resource Bundles Convertors IDN support Graphemes Collation Iterators 38. Some intl caveats New stuff is only in newer PHP versions All strings in and out must be UTF-8 except for Uconvertor Intl doesnt yet support zend_multibyte Intl doesnt support HTTP input/output conversion Intl doesnt support function overloading 39. mb_string enables zend_multibyte support supports transparent http in and out encoding provides some wrappers for functionality such as strtoupper (including overloading the php function version) 40. Iconv Primarily for charset conversion output buffer handler mime encoding functionality conversion some string helpers len substr strpos strrpos stream filter stream_filter_append($fp, 'convert.iconv.ISO-2022-JP/EUC-JP'); 41. What do you mean mysql is giving me garbage? BEYOND THE CODE 42. Browser Considerations Set Content-type AND charset use HTTP headers AND meta tags (not just meta) use accept-charset on forms to make sure your data is coming in right Javascript: string literals, regular expression literals and any code unit can also be expressed via a Unicode escape sequence uHHHH Specify content-type AND charset headers for javascript!! 43. Databases Table/Schema encoding and connection Mysql you need to set the charset right on the table AND Set the charset right on the connection (NOT set names, it does not do enough) AND Dont use mysql mysqli or pdo postgresql - pg_set_client_encoding oracle passed in the connect sqlite(3) make sure it was compiled with unicode and intl extension is available sqlsrv/pdo_sqlsrv CharacterSet in options 44. Other gotchas Plain text is not plain text, files will have encodings Files will be loaded as binary if you add the b flag to fopen (heres a hint, always use the b flag) You can convert files on the fly with the iconv filter You cannot use unicode file names with PHP and windows at all (no, not even utf8) unless you find a 3rd party php extension Beware of sending anything but ascii to exec, proc_open and other command line calls 45. The best and worst in PHP apps CASE STUDIES 46. Applications Wordpress gettext (sigh) Drupal gettext files but NOT gettext api 47. Frameworks ZF and ZF2 http://framework.zend.com/manual/1.12/en/performance.localization.html multiple adapters gettext allows using fast .po files, but doesnt use setlocale/gettext extension Symfony 1 and 2 http://symfony.com/doc/current/book/translation.html multiple formats to hold translations doesnt use gettext 48. Resources http://www.joelonsoftware.com/articles/Unicode.html http://unicode.org http://www.slideshare.net/andreizm/the-good-the-bad-and- the-ugly-what-happened-to-unicode-and-php-6 http://php.net http://www.2ality.com/2013/09/javascript-unicode.html http://htmlpurifier.org/docs/enduser-utf8.html 49. My Little Project Get everything needed into intl from mb_string and iconv so you need only 1 solution stream filter from iconv output handler from iconv zend_multibyte support from mb_string http in and output conversion from mb_string Some simplified apis to make overloading doable 50. Contact [email protected] @auroraeosrose http://emsmith.net http://github.com/auroraeosrose Freenode #phpwomen #phpmentoring #php-gtk