Unicode - Hacking The International Character System
Embed Size (px)
In this presentation we explore some of the problems of unicode and how they can be used for nefarious purposes in order to exploit a range of critical vulnerabilities including SQL Injection, XSS and many other.
Transcript of Unicode - Hacking The International Character System
- UNICODE Hacking The International Character System
- Introduction Standard for representing text for most of the worlds writing systems The most recent version is Unicode 6.0 Widely adopted by most programming platforms, operating systems and The Web The most widely used unicode encodings are UTF-8 and UTF-16
- Introduction to UTF-8 UTF-8 (UCS Transformation Format - 8bit) Backwards compatible with ASCII Simple ASCII chars are represented by a single byte Other characters can include up to 4 bytes but 31 bits in total spanning across 6 physical bytes
- UTF-8 Encoding Table Bits Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 7 0XXXXXX 11 110XXXXX 10XXXXXX 16 1110XXXX 10XXXXXX 10XXXXXX 21 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX 26 111110XX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX 31 1111110X 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX 10XXXXXX
- UTF-8 Encoding Rules Every ASCII character is also valid UTF-8 character (up to 7 bits or 128 characters) For every other UTF-8 byte sequence the rst byte indicates the length of the sequence in bytes The rest of the bytes from the byte sequence have 10 as the two most signicant bits This helps to easily nd where a byte sequence starts and ends There are more rules but this is a good start...
- Interesting UTF-8 Characters UTF-8 also provides a lot of function characters such as Byte Order Mark (BOM) - 0xEF, 0xBB, 0xBF are placed at the start of the document to indicate UTF-8 Left to Right Mark (LRM) - 0xE2, 0x80, 0x8E are placed to indicate text orientation In HTML - or Right to Left Mark (RLM) - 0xE2, 0x80, 0x8F are placed to indicate text orientation In HTML - or Left to Right Embedding (LRE) - 0xE2, 0x80, 0xAA In HTML - Right to Left Embedding (RLE) - 0xE2, 0x80, 0xAB In HTML - There are more...
- Clarications How exactly the hex sequence 0xE2, 0x80, 0x8E maps to in HTML? 0xE2, 0x80, 0x8E is UTF-8 is 0x20, 0x0E in UTF-16 also known as 0x0000200E in UTF-32 There is no magic!You simply need to know which encoding system you are working with and nd out what characters it supports. http://www.decodeunicode.org - is a good reference
- Multiple Representations The same character can be represented multiple ways For example . (DOT) is represented as 0x2E It is also the equivalent of 0xC0, 0xAE It is also the equivalent of 0xE0, 0x80, 0xAE It is also the equivalent of 0xF0, 0x80, 0x80, 0xAE It is also the equivalent of 0xF8, 0x80, 0x80, 0x80, 0xAE It is also the equivalent of 0xFC, 0x80, 0x80, 0x80, 0x80, 0xAE
- Translating the . (DOT) HEX Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6 2E 00101110 C0 AE 11000000 10101110 E0 80 AE 11100000 10000000 10101110 F0 80 80 AE 11110000 10000000 10000000 10101110 F8 80 80 80 AE 11111000 10000000 10000000 10000000 10101110 FC 80 80 80 80 AE 11111100 10000000 10000000 10000000 10000000 10101110
- Half and Full Width Forms Graphic characters are traditionally classed as halfwidth and fullwidth characters In a xed width font a halfwidth character takes the half of the width of a fullwidth character In Unicode you can nd characters which are presented in their halfwidth and fullwidth forms http://www.unicode.org/charts/PDF/UFF00.pdf - for more information
- Fullwidth Latin Characters Halfwidth and Fullwidth notations make sense when used for characters such as those found in the Japanese and Chinese character sets The specications also talk about latin characters presented in their fullwidth forms As a result the following mappings are possible A - 0x41 (halfwidth) = A - 0xEF, 0xBC, 0xA1 (fullwidth) B - 0x42 (halfwidth) = B - 0xEF, 0xBC, 0xA2 (fullwidth) etc.
- Security Considerations Visual Security Issues Internationalized names Left to Right and Right to Left representations Charset Translation Issues Occurs when strings are normalized before and after translation between character sets Characters in multiple representation The same character can be represented in multiple ways
- Case Study:Windows Filename Mangling Consider the following les [RTLO]cod.stnemucodtnatropmi.exe [RTLO]cod.yrammusevituc[LTRO]n1c[LTRO].exe [RTLO]gpj.!nuf_stohsnee[LTRO]n1c[LTRO].scr Visually these les look different exe.importantdocuments.doc n1c.executivesummary.doc n1c.screenshots_fun!.jpg
- Case Study:The PAYPAL Scam What is the difference between paypal.com and paypai.com or between intel.com and lntel.com? How about citybank.com? 0000000: d181 6974 7962 616e 6b2e 636f 6d ..itybank.com 0xd1, 0x81 is the Cyrillic letter c which looks like the latin letter c although they are very different
- Case Study: Directory Traversal Lets say an application shows images by requesting /getimage.jsp? name=image.jpg The attacker tries to retrieve an arbitrary le by requesting / getimage.jsp?name=../../../../boot.ini Unfortunately the attack fails because the application checks for the presence of ../ character sequence ../ is 0x2E, 0x2E, 0x5C in hex ../ is also 0x2E, 0xC0, 0xAE, 0x5C in overlong UTF-8 Since 0x2E, 0xC0, 0xAE, 0x5C is not equal to 0x2E, 0x2E, 0x5C the security check is bypassed and the le content retrieved
- References http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters http://decodeunicode.org http://unicode.org/reports/tr36/ http://www.leformat.info http://blog.commtouch.com/cafe/email-security-news/using-unicode-to-trick-users-to- install-malware/ https://dc414.org/wp-content/uploads/2011/01/righttoleften-override.pdf http://norman.com/security_center/security_center_archive/2011/rtlo_unicode_hole/ http://en.wikipedia.org/wiki/Halfwidth_and_Fullwidth_Forms http://www.unicode.org/charts/PDF/UFF00.pdf