Download - Unicode - Hacking The International Character System

Transcript

UNICODEHacking The International Character System

Introduction

• Standard for representing text for most of the world’s writing systems

• The most recent version is Unicode 6.0

• Widely adopted by most programming platforms, operating systems and The Web

• The most widely used unicode encodings are UTF-8 and UTF-16

Introduction to UTF-8

• UTF-8 (UCS Transformation Format - 8bit)

• Backwards compatible with ASCII

• Simple ASCII chars are represented by a single byte

• Other characters can include up to 4 bytes but 31 bits in total spanning across 6 physical bytes

UTF-8 Encoding Table

Bits Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6

7 0XXXXXX

11 110XXXXX 10XXXXXX

16 1110XXXX 10XXXXXX 10XXXXXX

21 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX

26 111110XX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX

31 1111110X 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX

UTF-8 Encoding Rules• Every ASCII character is also valid UTF-8 character

(up to 7 bits or 128 characters)

• For every other UTF-8 byte sequence the first byte indicates the length of the sequence in bytes

• The rest of the bytes from the byte sequence have 10 as the two most significant bits

• This helps to easily find where a byte sequence starts and ends

• There are more rules but this is a good start...

Interesting UTF-8 Characters

• UTF-8 also provides a lot of function characters such as

• Byte Order Mark (BOM) - 0xEF, 0xBB, 0xBF are placed at the start of the document to indicate UTF-8

• Left to Right Mark (LRM) - 0xE2, 0x80, 0x8E are placed to indicate text orientation

• In HTML - ‎ ‎ or ‎

• Right to Left Mark (RLM) - 0xE2, 0x80, 0x8F are placed to indicate text orientation

• In HTML - ‏ ‏ or ‏

• Left to Right Embedding (LRE) - 0xE2, 0x80, 0xAA

• In HTML - ‪

• Right to Left Embedding (RLE) - 0xE2, 0x80, 0xAB

• In HTML - ‫

• There are more...

Clarifications• How exactly the hex sequence 0xE2, 0x80, 0x8E maps to

‎ in HTML?

• 0xE2, 0x80, 0x8E is UTF-8

• ‎ is 0x20, 0x0E in UTF-16

• also known as 0x0000200E in UTF-32

• There is no magic! You simply need to know which encoding system you are working with and find out what characters it supports.

• http://www.decodeunicode.org - is a good reference

Multiple Representations

• The same character can be represented multiple ways

• For example

• . (DOT) is represented as 0x2E

• It is also the equivalent of 0xC0, 0xAE

• It is also the equivalent of 0xE0, 0x80, 0xAE

• It is also the equivalent of 0xF0, 0x80, 0x80, 0xAE

• It is also the equivalent of 0xF8, 0x80, 0x80, 0x80, 0xAE

• It is also the equivalent of 0xFC, 0x80, 0x80, 0x80, 0x80, 0xAE

Translating the . (DOT)

HEX Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 62E 00101110

C0 AE 11000000 10101110

E0 80 AE 11100000 10000000 10101110

F0 80 80 AE 11110000 10000000 10000000 10101110

F8 80 80 80 AE 11111000 10000000 10000000 10000000 10101110

FC 80 80 80 80 AE 11111100 10000000 10000000 10000000 10000000 10101110

Half and Full Width Forms

• Graphic characters are traditionally classed as halfwidth and fullwidth characters

• In a fixed width font a halfwidth character takes the half of the width of a fullwidth character

• In Unicode you can find characters which are presented in their halfwidth and fullwidth forms

• http://www.unicode.org/charts/PDF/UFF00.pdf - for more information

Fullwidth Latin Characters

• Halfwidth and Fullwidth notations make sense when used for characters such as those found in the Japanese and Chinese character sets

• The specifications also talk about latin characters presented in their fullwidth forms

• As a result the following mappings are possible

• A - 0x41 (halfwidth) = A - 0xEF, 0xBC, 0xA1 (fullwidth)

• B - 0x42 (halfwidth) = B - 0xEF, 0xBC, 0xA2 (fullwidth)

• etc.

Security Considerations

• Visual Security Issues

• Internationalized names

• Left to Right and Right to Left representations

• Charset Translation Issues

• Occurs when strings are normalized before and after translation between character sets

• Characters in multiple representation

• The same character can be represented in multiple ways

Case Study: Windows Filename Mangling

• Consider the following files

• [RTLO]cod.stnemucodtnatropmi.exe

• [RTLO]cod.yrammusevituc[LTRO]n1c[LTRO].exe

• [RTLO]gpj.!nuf_stohsnee[LTRO]n1c[LTRO].scr

• Visually these files look different

• exe.importantdocuments.doc

• n1c.executivesummary.doc

• n1c.screenshots_fun!.jpg

Case Study: The PAYPAL Scam

• What is the difference between paypal.com and paypai.com or between intel.com and lntel.com?

• How about citybank.com?

• 0000000: d181 6974 7962 616e 6b2e 636f 6d ..itybank.com

• 0xd1, 0x81 is the Cyrillic letter c which looks like the latin letter c although they are very different

Case Study: Directory Traversal

• Let’s say an application shows images by requesting /getimage.jsp?name=image.jpg

• The attacker tries to retrieve an arbitrary file by requesting /getimage.jsp?name=../../../../boot.ini

• Unfortunately the attack fails because the application checks for the presence of ../ character sequence

• ../ is 0x2E, 0x2E, 0x5C in hex

• ../ is also 0x2E, 0xC0, 0xAE, 0x5C in overlong UTF-8

• Since 0x2E, 0xC0, 0xAE, 0x5C is not equal to 0x2E, 0x2E, 0x5C the security check is bypassed and the file content retrieved

References• http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters

• http://decodeunicode.org

• http://unicode.org/reports/tr36/

• http://www.fileformat.info

• http://blog.commtouch.com/cafe/email-security-news/using-unicode-to-trick-users-to-install-malware/

• https://dc414.org/wp-content/uploads/2011/01/righttoleften-override.pdf

• http://norman.com/security_center/security_center_archive/2011/rtlo_unicode_hole/

• http://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms

• http://www.unicode.org/charts/PDF/UFF00.pdf