expect("").length.toBe(1)

55
expect(" ").length .toBe(1) No. Itā€™s not ā€œjust stringsā€

Transcript of expect("").length.toBe(1)

Page 1: expect("").length.toBe(1)

expect("šŸ’©").length .toBe(1)

No. Itā€™s not ā€œjust stringsā€

Page 2: expect("").length.toBe(1)

Whereā€™s the problem?

ā€¢ "".length == 0

ā€¢ "a".length == 1

ā€¢ "Ƥ".length == 1

Page 3: expect("").length.toBe(1)

See? No problem here

Page 4: expect("").length.toBe(1)
Page 5: expect("").length.toBe(1)

What gives?

Page 6: expect("").length.toBe(1)

Question is: What are we counting?

Page 7: expect("").length.toBe(1)

But first: Even more fundamental

Page 8: expect("").length.toBe(1)

What is a string?

ā€¢ Computers are really good at dealing with numbers

ā€¢ Humans are really good at dealing with text

ā€¢ Marketing demands that computers do text

ā€¢ We need to translate between text and numbers

Page 9: expect("").length.toBe(1)

When all you have is a hammer, every problem looks like a nail

Page 10: expect("").length.toBe(1)

In the beginning

ā€¢ Array of characters

ā€¢ C says char*

ā€¢ char is defined as the ā€œsmallest addressable unit that can contain basic character setā€. Integer type. Might be signed or unsigned

ā€¢ char ends up being a byte

ā€¢ Assign some meaning to each byte value

Page 11: expect("").length.toBe(1)

Interacting with the world

ā€¢ Just dump the contents of the memory into a file

ā€¢ Read back the same contents and put it in memory

ā€¢ Problem solved.

ā€¢ Until you need to do this across machines

Page 12: expect("").length.toBe(1)

Interoperability

ā€¢ char and the mapping from number to letter is inherently implementation dependent

ā€¢ So is by definition the file you dump your char* into

ā€¢ Canā€™t move files between machines

Page 13: expect("").length.toBe(1)

ASCII

ā€¢ ā€œAmerican Standard Code for Information Interchangeā€

ā€¢ Published 1963

ā€¢ Uses 7 bits per character (circumventing the signedness-issue)

ā€¢ Perfectly fine for what everybody is using (English)

Page 14: expect("").length.toBe(1)

But I need Ć¼mlĆ¤Ć¼teā€¢ Machines were used where people speak strange

languages (i.e. not English)

ā€¢ ASCII is 7bit.

ā€¢ Most machine have 8 bit bytes.

ā€¢ Using that additional bit bit gives us another 127 characters!

ā€¢ Depending on your country, these upper 127 characters had different meanings

ā€¢ No problem as texts usually donā€™t leave their country

Page 15: expect("").length.toBe(1)

remember ā€œchcp 850ā€?

Page 16: expect("").length.toBe(1)

Then the Internet happened

Page 17: expect("").length.toBe(1)

ThĆ¼s wƤs nƶt pюssїŅŒlŅæ!

I apologize to all Russians for butchering their script.

Page 18: expect("").length.toBe(1)

Unicode 1.0

ā€¢ 16 bits per character

ā€¢ Published in 1991, revised in 1992

ā€¢ Jumped on by everybody who wanted ā€œto do it rightā€

ā€¢ APIs were made Unicode compliant by extending the size of a character to 16 bits.

Page 19: expect("").length.toBe(1)

Itā€™s the wild west 90ies

Page 20: expect("").length.toBe(1)

Still just dumping memory

ā€¢ wchar is 16 bits

ā€¢ Endianness? See if we care!

ā€¢ To save to a file: Dump memory contents.

ā€¢ To load from a file: Read file into memory

ā€¢ Note they didnā€™t dare extending char to 16 bits

ā€¢ Letā€™s call this ā€œUnicodeā€

Page 21: expect("").length.toBe(1)

16 bits everywhere

ā€¢ Windows API (XxxxXxxW uses wchar which is 16 bit wide)

ā€¢ Java uses 16 bits

ā€¢ Objective C uses 16 bits

ā€¢ And of course, JavaScript uses 16 bits

ā€¢ C and by extension Unix stayed away from this.

Page 22: expect("").length.toBe(1)

Thatā€™s perfect. By using 16 bit characters, we

can store all of Unicode!

Page 23: expect("").length.toBe(1)

65K characters are enough for everybody

Page 24: expect("").length.toBe(1)

640K are enough for everybody

Page 25: expect("").length.toBe(1)

It didnā€™t work out so well

ā€¢ By just dumping memory, thereā€™s no way to know how to read it back

ā€¢ You have no clue whether you have just read Unicode or old-style data.

ā€¢ Remember: itā€™s just numbers.

ā€¢ Heuristics suck (try typing ā€œBush hid the factsā€ in Windows Notepad, saving, reloading)

Page 26: expect("").length.toBe(1)

BOM

Page 27: expect("").length.toBe(1)

We learned

ā€¢ Unicode Transformation Format has happened

ā€¢ specifically UTF-8 happened

ā€¢ Unicode 2.0 happened

ā€¢ Programming environments learned

Page 28: expect("").length.toBe(1)

Unicode 2.0+

ā€¢ Theoretically unlimited code space

ā€¢ Doesnā€™t talk about bits any more

ā€¢ The terminology is code point.

ā€¢ Currently 1.1M code points

ā€¢ The old characters (0000 - FFFF) are on the BMP

Page 29: expect("").length.toBe(1)

Unicode Transformation Format

ā€¢ Specifies how to store Unicode on disk

ā€¢ Specifies exact byte encoding for every Unicode code point

ā€¢ Available for 8-, 16- and 32 bit encodings per code point

ā€¢ Not every byte sequence is a valid UTF byte sequence (finally!)

Page 30: expect("").length.toBe(1)

UTF-8

ā€¢ Uses an 8bit encoding to store code points

ā€¢ Is the same as ASCII for whateverā€™s in ASCII

ā€¢ Uses multiple bytes to encode code points outside of ASCII

ā€¢ The old algorithms donā€™t work any more

Page 31: expect("").length.toBe(1)

UTF-16

ā€¢ Combines the worst of both worlds

ā€¢ Uses 16bit to encode a code point

ā€¢ Uses surrogate pairs to encode a code point outside of the BMP

ā€¢ Wastes memory for ASCII, has byte-ordering-issues and still breaks the old algorithms.

ā€¢ Is the only way for these 16bit bandwagon jumpers to support Unicode 2.0 and later

Page 32: expect("").length.toBe(1)

UTF-32

ā€¢ 4 bytes per character

ā€¢ Byte ordering issues

ā€¢ Still breaking the old algorithms due to combining marks

Page 33: expect("").length.toBe(1)

So. What is a String?

Page 34: expect("").length.toBe(1)

Itā€™s not a collection of bytes

ā€¢ A string is a sequence of graphemes

ā€¢ Or a sequence of Unicode Code Points

ā€¢ A byte array is a sequence of bytes

ā€¢ Both are incompatible with each other

ā€¢ You can encode a string into a byte array

ā€¢ You can decode a byte array into a string

Page 35: expect("").length.toBe(1)

Back to counting

Page 36: expect("").length.toBe(1)

Intermission: Combining Marks

ā€¢ Ƥ is not the same as Ƥ

ā€¢ Ƥ can be ā€œLATIN SMALL LETTER A WITH DIAERESISā€

ā€¢ it can also be ā€œLATIN SMALL LETTER Aā€ followed by ā€œCOMBINING DIAERESISā€

ā€¢ both look exactly the same

Page 37: expect("").length.toBe(1)

Counting Lengths

ā€¢ You could count the length in graphemes

ā€¢ Or the length in unicode code points

ā€¢ Or the length of your binary blob once your string has been encoded.

ā€¢ Or something in between because you were trigger-happy in the 90ies

Page 38: expect("").length.toBe(1)

Which brings us back to JS, Java and friends

ā€¢ Live back in 1996

ā€¢ Strings specified as being stored in UCS-2 (Fixed 16 bits per character)

ā€¢ Leak its implementation in the API

ā€¢ Donā€™t know about Unicode 2.0

Page 39: expect("").length.toBe(1)

Cheating abound

ā€¢ Applications still want to support Unicode 2.0

ā€¢ We need to display these piles of poo!

ā€¢ String APIs use UTF-16, encoding characters outside of the BMP as surrogate pairs

ā€¢ String APIs often donā€™t know about UTF-16

Page 40: expect("").length.toBe(1)

Letā€™s talk EcmaScript

Page 41: expect("").length.toBe(1)

String methods are leaky

ā€¢ String.length returns mish-mash of byte length and code point length for strings outside the BMP

ā€¢ substr() can break strings

ā€¢ charAt() can return non-existing code-points

ā€¢ and letā€™s not talk about to*Case

Page 42: expect("").length.toBe(1)

Samples

Page 43: expect("").length.toBe(1)

Et tu RegEx?

ā€¢ Character classes donā€™t work right

ā€¢ Counting characters doesnā€™t work right

ā€¢ Can break strings

Page 44: expect("").length.toBe(1)

No Normalization

Page 45: expect("").length.toBe(1)

Real-World example

Ā«("šŸ’©")Ā» is 5 graphemes long. I counted 6 underscores :-)

Also: Mixed Content Warning? What gives?

Page 46: expect("").length.toBe(1)

Some did it ok

ā€¢ Python 3.3 (PEP 393)

ā€¢ Ruby 1.9 (avoids political issues by giving a lot of freedom)

ā€¢ Perl (awesome libraries since forever)

ā€¢ Swift after a very rough start

ā€¢ ICU, ICU4C (http://icu-project.org/)

Page 47: expect("").length.toBe(1)

ES2015

Page 48: expect("").length.toBe(1)

still some šŸž around

Page 49: expect("").length.toBe(1)

This was just the tip of the iceberg!

ā€¢ Localization issues (Collation, Case change)

ā€¢ Security issues (Encoding, Homographs)

ā€¢ Broken Software (including ā€œUS UTF-8ā€)

Page 50: expect("").length.toBe(1)

Highly recommended Literature

Page 51: expect("").length.toBe(1)

Pop Quiz

Page 52: expect("").length.toBe(1)

"#".length

[ā€¦"#"].length

Page 53: expect("").length.toBe(1)

Thank you!

ā€¢ @pilif on twitter

ā€¢ https://github.com/pilif/

Page 54: expect("").length.toBe(1)

In case you answered 11 and 8, I salute you

Page 55: expect("").length.toBe(1)

ā€¢ U+1F468 (MAN) šŸ‘Ø

ā€¢ U+200D (ZERO WIDTH JOINER)

ā€¢ U+2764 (HEAVY BLACK HEART) ā¤

ā€¢ U+FE0F (VARIATION SELECTOR-16)

ā€¢ U+200D (ZERO WIDTH JOINER)

ā€¢ U+1F48B (KISS MARK) šŸ’‹

ā€¢ U+200D (ZERO WIDTH JOINER)

ā€¢ U+1F468 (MAN) šŸ‘Ø