expect("").length.toBe(1)
-
Upload
philip-hofstetter -
Category
Engineering
-
view
2.222 -
download
0
Transcript of expect("").length.toBe(1)
expect("š©").length .toBe(1)
No. Itās not ājust stringsā
Whereās the problem?
ā¢ "".length == 0
ā¢ "a".length == 1
ā¢ "Ƥ".length == 1
See? No problem here
What gives?
Question is: What are we counting?
But first: Even more fundamental
What is a string?
ā¢ Computers are really good at dealing with numbers
ā¢ Humans are really good at dealing with text
ā¢ Marketing demands that computers do text
ā¢ We need to translate between text and numbers
When all you have is a hammer, every problem looks like a nail
In the beginning
ā¢ Array of characters
ā¢ C says char*
ā¢ char is defined as the āsmallest addressable unit that can contain basic character setā. Integer type. Might be signed or unsigned
ā¢ char ends up being a byte
ā¢ Assign some meaning to each byte value
Interacting with the world
ā¢ Just dump the contents of the memory into a file
ā¢ Read back the same contents and put it in memory
ā¢ Problem solved.
ā¢ Until you need to do this across machines
Interoperability
ā¢ char and the mapping from number to letter is inherently implementation dependent
ā¢ So is by definition the file you dump your char* into
ā¢ Canāt move files between machines
ASCII
ā¢ āAmerican Standard Code for Information Interchangeā
ā¢ Published 1963
ā¢ Uses 7 bits per character (circumventing the signedness-issue)
ā¢ Perfectly fine for what everybody is using (English)
But I need Ć¼mlĆ¤Ć¼teā¢ Machines were used where people speak strange
languages (i.e. not English)
ā¢ ASCII is 7bit.
ā¢ Most machine have 8 bit bytes.
ā¢ Using that additional bit bit gives us another 127 characters!
ā¢ Depending on your country, these upper 127 characters had different meanings
ā¢ No problem as texts usually donāt leave their country
remember āchcp 850ā?
Then the Internet happened
ThĆ¼s wƤs nƶt pŃssŃŅlŅæ!
I apologize to all Russians for butchering their script.
Unicode 1.0
ā¢ 16 bits per character
ā¢ Published in 1991, revised in 1992
ā¢ Jumped on by everybody who wanted āto do it rightā
ā¢ APIs were made Unicode compliant by extending the size of a character to 16 bits.
Itās the wild west 90ies
Still just dumping memory
ā¢ wchar is 16 bits
ā¢ Endianness? See if we care!
ā¢ To save to a file: Dump memory contents.
ā¢ To load from a file: Read file into memory
ā¢ Note they didnāt dare extending char to 16 bits
ā¢ Letās call this āUnicodeā
16 bits everywhere
ā¢ Windows API (XxxxXxxW uses wchar which is 16 bit wide)
ā¢ Java uses 16 bits
ā¢ Objective C uses 16 bits
ā¢ And of course, JavaScript uses 16 bits
ā¢ C and by extension Unix stayed away from this.
Thatās perfect. By using 16 bit characters, we
can store all of Unicode!
65K characters are enough for everybody
640K are enough for everybody
It didnāt work out so well
ā¢ By just dumping memory, thereās no way to know how to read it back
ā¢ You have no clue whether you have just read Unicode or old-style data.
ā¢ Remember: itās just numbers.
ā¢ Heuristics suck (try typing āBush hid the factsā in Windows Notepad, saving, reloading)
BOM
We learned
ā¢ Unicode Transformation Format has happened
ā¢ specifically UTF-8 happened
ā¢ Unicode 2.0 happened
ā¢ Programming environments learned
Unicode 2.0+
ā¢ Theoretically unlimited code space
ā¢ Doesnāt talk about bits any more
ā¢ The terminology is code point.
ā¢ Currently 1.1M code points
ā¢ The old characters (0000 - FFFF) are on the BMP
Unicode Transformation Format
ā¢ Specifies how to store Unicode on disk
ā¢ Specifies exact byte encoding for every Unicode code point
ā¢ Available for 8-, 16- and 32 bit encodings per code point
ā¢ Not every byte sequence is a valid UTF byte sequence (finally!)
UTF-8
ā¢ Uses an 8bit encoding to store code points
ā¢ Is the same as ASCII for whateverās in ASCII
ā¢ Uses multiple bytes to encode code points outside of ASCII
ā¢ The old algorithms donāt work any more
UTF-16
ā¢ Combines the worst of both worlds
ā¢ Uses 16bit to encode a code point
ā¢ Uses surrogate pairs to encode a code point outside of the BMP
ā¢ Wastes memory for ASCII, has byte-ordering-issues and still breaks the old algorithms.
ā¢ Is the only way for these 16bit bandwagon jumpers to support Unicode 2.0 and later
UTF-32
ā¢ 4 bytes per character
ā¢ Byte ordering issues
ā¢ Still breaking the old algorithms due to combining marks
So. What is a String?
Itās not a collection of bytes
ā¢ A string is a sequence of graphemes
ā¢ Or a sequence of Unicode Code Points
ā¢ A byte array is a sequence of bytes
ā¢ Both are incompatible with each other
ā¢ You can encode a string into a byte array
ā¢ You can decode a byte array into a string
Back to counting
Intermission: Combining Marks
ā¢ Ƥ is not the same as Ƥ
ā¢ Ƥ can be āLATIN SMALL LETTER A WITH DIAERESISā
ā¢ it can also be āLATIN SMALL LETTER Aā followed by āCOMBINING DIAERESISā
ā¢ both look exactly the same
Counting Lengths
ā¢ You could count the length in graphemes
ā¢ Or the length in unicode code points
ā¢ Or the length of your binary blob once your string has been encoded.
ā¢ Or something in between because you were trigger-happy in the 90ies
Which brings us back to JS, Java and friends
ā¢ Live back in 1996
ā¢ Strings specified as being stored in UCS-2 (Fixed 16 bits per character)
ā¢ Leak its implementation in the API
ā¢ Donāt know about Unicode 2.0
Cheating abound
ā¢ Applications still want to support Unicode 2.0
ā¢ We need to display these piles of poo!
ā¢ String APIs use UTF-16, encoding characters outside of the BMP as surrogate pairs
ā¢ String APIs often donāt know about UTF-16
Letās talk EcmaScript
String methods are leaky
ā¢ String.length returns mish-mash of byte length and code point length for strings outside the BMP
ā¢ substr() can break strings
ā¢ charAt() can return non-existing code-points
ā¢ and letās not talk about to*Case
Samples
Et tu RegEx?
ā¢ Character classes donāt work right
ā¢ Counting characters doesnāt work right
ā¢ Can break strings
No Normalization
Real-World example
Ā«("š©")Ā» is 5 graphemes long. I counted 6 underscores :-)
Also: Mixed Content Warning? What gives?
Some did it ok
ā¢ Python 3.3 (PEP 393)
ā¢ Ruby 1.9 (avoids political issues by giving a lot of freedom)
ā¢ Perl (awesome libraries since forever)
ā¢ Swift after a very rough start
ā¢ ICU, ICU4C (http://icu-project.org/)
ES2015
still some š around
This was just the tip of the iceberg!
ā¢ Localization issues (Collation, Case change)
ā¢ Security issues (Encoding, Homographs)
ā¢ Broken Software (including āUS UTF-8ā)
Highly recommended Literature
Pop Quiz
"#".length
[ā¦"#"].length
In case you answered 11 and 8, I salute you
ā¢ U+1F468 (MAN) šØ
ā¢ U+200D (ZERO WIDTH JOINER)
ā¢ U+2764 (HEAVY BLACK HEART) ā¤
ā¢ U+FE0F (VARIATION SELECTOR-16)
ā¢ U+200D (ZERO WIDTH JOINER)
ā¢ U+1F48B (KISS MARK) š
ā¢ U+200D (ZERO WIDTH JOINER)
ā¢ U+1F468 (MAN) šØ