Character sets and iconv

download Character sets and iconv

If you can't read please download the document

Transcript of Character sets and iconv

Character sets and iconv

This presentation is about character sets and the iconv library (with usage examples in PHP)

By Daniel Rhodes of Warp Asylumhttp://www.warpasylum.co.uk

What is a character set?

Mapping of character x in human language y is value z

Western European languages often use 8-bit ISO 8859-1

English possible in 7-bit ASCII!

Some languages have complex / numerous characters and need 2, 3 or even 4 bytes to represent one character!

So, many different character sets exist

More about character sets

Even same language may have many different character sets

Character sets tend not to be compatible

So, conversion is necessary and useful

But Unicode is coming through as a modernising, unifying character set

Unicode is one HUGE character set that can be used to represent any character from any language!

Character sets? Who cares!

Anglophones very lucky as everything seems to just work (even if in the background different character sets are interacting)

English is not the only language!

An app expecting character set x but getting y (or an incorrect character set conversion) will result in mojibake

Mojibake? What's that?

A great Japanese word meaning garbled (bake) characters (moji)

Often encountered in Japanese computing with its two traditional character sets, Unicode and a separate character set for emails!

Shouldn't really happen at all in modern computing

But it still does, mostly due to lack of implementation knowledge

Mojibake in English

A slight case of mojibake here, the pound symbols () have garbled

Mojibake in German

More severe now, umlauted vowels (, and ) have garbled

Mojibake in Japanese

Ouch!

What is the iconv library?

API to convert between character sets

Works on strings

Some support for transliteration (changing / substituting characters in source character set that don't exist in target character set)

Your implementation may vary, but a HUGE number of character sets are supported

Some iconv use cases

Convert legacy character set Unicode

Convert backend frontend character sets

Convert file's character set for import / export

Transliterate to remove unwanted characters

Transliterate to make safe for URL / filename

Let's look at some iconv usage examples in PHP..

What is PHP's iconv extension?

Interface to iconv library

See http://uk.php.net/manual/en/book.iconv.php

iconv library should be on your OS

If not, need to install it before using the PHP extension

See http://www.gnu.org/software/libiconv

iconv extension presence

phpinfo() will look something like:

A few directives

iconv.input_encoding currently unused

iconv.output_encoding for ob_iconv_handler() [iconv handler for PHP's output buffering]

iconv.internal_encoding for ob_iconv_handler(), iconv_mime_*() and iconv's string utility functions (which are present from PHP 5)

First play

Basic usage

iconv() is the conversion function

Pass it the input string's character set,

the desired output character set

and the input string

BUT within reason...

Within reason

Character mapping

You might not get every character from set x present in set y

So what to do if character absent? Bomb out and return an empty string?

NO! iconv gives us a few options

Let's look at transliteration first...

First transliteration

Transliteration

Append //TRANSLIT to output character set as passed to iconv()

Approximates characters not present in output character set with closest equivalent

Closest equivalent might simply be '?' for wildly different character sets

More realistic transliteration

Ignore option

We can also append //IGNORE to the output character set as passed to inconv()

This will simply skip over any characters that are absent from the output character set

Ignore example

Transliterate and ignore

You may (or may not!) be able to combine the //TRANSLIT and //IGNORE behaviours

This will transliterate transliteratable characters and ignore the rest

Action it by appending //TRANSLIT//IGNORE to the output character set as passed to iconv()

Output buffer handler

We also get a handler for PHP's output buffering

Allows us to, for example, output everything to the browser as ISO-8859-1 though our PHP scripts etc are using UTF-8

An automatic way to convert character sets for output without necessarily touching anything internally

Let's take a look...

ob_iconv_handler

Utility functions

As of PHP 5, we also get some non-conversion utility functions

iconv_strlen()

iconv_strpos(), iconv_strrpos()

iconv_substr()

Character equivalents of core strlen(), strpos(), strrpos() and substr() [which are really byte functions]

Quite trivial so we'll look only at one, iconv_strlen()...

iconv_strlen()

Food for thought

Unicode is the character set of the future

PHP iconv extension uses sytem locale [setlocale()] for transliteration

PHP iconv extension issues a notice even when //IGNORE is used

iconv library has no mechanism for custom character maps

Summary

iconv library can be accessed on the command line

But extension for PHP (and many other languages!)

Many character sets supported

All or nothing conversion or softer transliteration

Links

Should be able to get a PHP source code pack from wherever you got this presentation

http://spin.atomicobject.com/2011/07/13/some-useful-iconv-functionality

http://developer.loftdigital.com/blog/php-utf-8-cheatsheet

http://blog.grayproductions.net/articles/encoding_conversion_with_iconv

http://czyborra.com/charsets/iso8859.html

Klicken Sie, um das Format des Titeltextes zu bearbeiten

Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte GliederungsebeneFnfte GliederungsebeneSechste GliederungsebeneSiebente GliederungsebeneAchte GliederungsebeneNeunte Gliederungsebene

Click to edit the title text format

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level

Made with OpenOffice.org

Pulse para editar el formato del texto de ttulo

Pulse para editar los formatos del texto del esquemaSegundo nivel del esquemaTercer nivel del esquemaCuarto nivel del esquemaQuinto nivel del esquemaSexto nivel del esquemaSptimo nivel del esquemaOctavo nivel del esquemaNoveno nivel del esquema

Klicken Sie, um das Format des Titeltextes zu bearbeiten

Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte GliederungsebeneFnfte GliederungsebeneSechste GliederungsebeneSiebente GliederungsebeneAchte GliederungsebeneNeunte Gliederungsebene

Klicken Sie, um das Format des Titeltextes zu bearbeiten

Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte GliederungsebeneFnfte GliederungsebeneSechste GliederungsebeneSiebente GliederungsebeneAchte GliederungsebeneNeunte Gliederungsebene

Klicken Sie, um das Format des Titeltextes zu bearbeiten

Klicken Sie, um das Format des Titeltextes zu bearbeiten

Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte GliederungsebeneFnfte GliederungsebeneSechste GliederungsebeneSiebente GliederungsebeneAchte GliederungsebeneNeunte Gliederungsebene

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level

Klicken Sie, um das Format des Titeltextes zu bearbeiten

23456789

Klicken Sie, um das Format des Titeltextes zu bearbeiten

Klicken Sie, um die Formate des Gliederungstextes zu bearbeitenZweite GliederungsebeneDritte GliederungsebeneVierte GliederungsebeneFnfte GliederungsebeneSechste GliederungsebeneSiebente GliederungsebeneAchte GliederungsebeneNeunte Gliederungsebene

Pulse para editar el formato del texto de ttulo

Pulse para editar los formatos del texto del esquemaSegundo nivel del esquemaTercer nivel del esquemaCuarto nivel del esquemaQuinto nivel del esquemaSexto nivel del esquemaSptimo nivel del esquemaOctavo nivel del esquemaNoveno nivel del esquema

Pulse para editar el formato del texto de ttulo

Pulse para editar los formatos del texto del esquemaSegundo nivel del esquemaTercer nivel del esquemaCuarto nivel del esquemaQuinto nivel del esquemaSexto nivel del esquemaSptimo nivel del esquemaOctavo nivel del esquemaNoveno nivel del esquema

Pulse para editar el formato del texto de ttulo

Pulse para editar los formatos del texto del esquemaSegundo nivel del esquemaTercer nivel del esquemaCuarto nivel del esquemaQuinto nivel del esquemaSexto nivel del esquemaSptimo nivel del esquemaOctavo nivel del esquemaNoveno nivel del esquema

Pulse para editar el formato del texto de ttulo

Pulse para editar los formatos del texto del esquemaSegundo nivel del esquemaTercer nivel del esquemaCuarto nivel del esquemaQuinto nivel del esquemaSexto nivel del esquemaSptimo nivel del esquemaOctavo nivel del esquemaNoveno nivel del esquema