Strings and Unicode in JavaScript
Published on 8 May 2020
• 5 min
Cette page est également disponible en français.
Time for the fifth article of our daily series: “19 nuggets of vanilla JS.” We’ll explore the rich and complex relationship between JavaScript strings and Unicode; because the time of ISO-Latin-15 / Windows-1252 is long gone, folks…
The series of 19
Check out surrounding posts from the series:
- Properly formatting a number
Array#splice
- Strings and Unicode (this post)
- Short-circuiting nested loops
- Inverting two values with destructuring
- …and beyond! (fear not, all 19 are scheduled already)…
A complex relationship…
We’re told that the String
type has always “been Unicode.” After all we can indeed put any Unicode-defined glyph in there, it works. But in practice, the encoding they went with, along with the misleading terminology of the API, have some issues.
Incidentally, almost everything we’ll discuss also applies to Java’s String
, because both were designed at the same time in a quasi-identical way. (But JavaScript doesn’t look anything like Java, don’t go pretending I ever said such a thing!)
UTF-16
JavaScript strings are encoded using UTF-16 (or UCS-2, depending on the implementation; it’s a distinction without much of a difference). Every string position therefore refers to 16 bits of data, or 2 bytes. This is indeed enough to encode most Unicode codepoints in the U+0000 to U+FFFF range, but not beyond (despite there being a truckload beyond, in practice adding up to around 144,000 glyphs). This 16-bit block is called a code unit.
For instance, Emojis, along with many lesser-known or ancient alphabets (such as ugaritic or phenician) and graphical sets (Mahjongg tiles, dominos, cards…) lie beyond the 16-bit range, so they require using two combined values, each of which is invalid when standalone: the combo is called a surrogate pair. Sure, this pertains mostly to extended graphical glyphs and extinct languages (like, extinct), but still.
Characters, codepoints, code units and surrogate pairs
For most people, a “character” is a full-blown entity; a “cell” in the Unicode table, so to speak, that is actually called a codepoint.
This analogy only goes so far, as many codepoints do not represent a full character, but more discrete, technical elements of it that can be invisible (such as the hyphenation point or the zero-width joiner) or diacritical signs (e.g. the acute accent).
Still, for practical purposes, a chinese ideogram, a georgian digit, a babylonian pictogram or an emoji are all things we perceive as “a character” when looking at a text.
Yet in practice, because of the UTF-16 / UCS-2 encoding, many codepoints require a surrogate pair, hence two code units, which through the API of JavaScript’s String
means “two char
s.” In fact, a length of 2.
You read that right! charAt
returns a code unit, not a codepoint. Same for charCodeAt
, based on code units, or length
, that gives the number of code units. Basically, almost the whole API of String
is based on code units. Check this out:
'🃑🃑🃑🃞🃞'.length // => 10 (5 cards / surrogate pairs)
'😅'.charCodeAt(1) // => 56837 / U+DE05, which is a “trail surrogate”
'😅'.split('') // => ['\ud83d', '\ude05']
Do notice the escape sequence \uXXXX
we’ve always been able to use in String
literals (or in CSS): it allows 4 hexadecimal digits, covering only 16 bits, so a code unit, not a codepoint! How do you drop a non-literal Emoji then? You need to grab its codepoint, convert it to a surrogate pair and type it all: '\ud83d\ude05' === '😅'
.
Quite the clusterfuck, wouldn’t you say?
Well, there are things that work fine, like toLocaleUpperCase()
, localeCompare(…)
and their friends, that do know about encoding (and are barely impacted), so there’s that…
ES2015: literal codepoints
ES2015 (formerly known as “ES6”) brought a lot to the table when it comes to Unicode.
Let’s start with escape sequences: having to compute the surrogate pair when exceeding the first 16 bits was super annoying, although getting the codepoint was easy enough.
We thus get a new escape sequence for Unicode, using curly braces: \u{…}
. It accepts the whole codepoint, making it instantly more dev-friendly.
'\u{1f605}' === '😅' // => true
I myself favor dropping the literal character unless it’s invisible (such as '\u202f'
, the narrow no-break space used a group delimiter or unit separator in the French number formatting, or a true invisible character: the soft hyphen \u00ad
, that represents a hyphenation point that, when activated, results in a hyphen mark). For such use cases, an explicit escape sequence is easily identified when browsing the code.
ES2015: codepoint-based APIs
We also get three new String
methods pertaining to Unicode.
codePointAt(index)
It’s just like charCodeAt(index)
, except that when the given position (expressed as code units) lands on the beginning of a surrogate pair (referred to as a lead surrogate or high surrogate), instead of just returning that, it’ll grab the rest of the codepoint from the following code unit (the trail surrogate or low surrogate).
'😅'.charCodeAt(0) // => 55357 / U+D83D “lead surrogate”
'😅'.charCodeAt(1) // => 56837 / U+DE05, “trail surrogate”
'😅'.codePointAt(0) // => 128517 / U+1F605 (full codepoint)
String.fromCodePoint(…)
It’s akin to String.fromCharCode(…)
, but accepts codepoints instead of code units.
String.fromCharCode(0xd83d, 0xde05) // => '😅'
String.fromCodePoint(0x1f605, 0x25b6, 0xfe0f, 0x1f60d) // => '😅▶️😍'
normalize([form])
When exploring Unicode, you usually stumble upon the concept of normalization, along with notions of canonical, compatible and (de)composed. I won’t get into a full-blown course on this topic, that is not too hard but quite out of the scope of this post.
A key tenet is that many codepoints actually represent glyphs that happen to be composable from other separate glyphs. For instance, the “acute uppercase E” (that is, “É”), can in Unicode be represented both by slapping an acute accent diacritic (\u0301
) right after an E glyph (\u0045
), or by using the actual uppercase acute E glyph (\u00c9
):
const obfuscated = '\u0045\u0301lodie' // => 'Élodie'
const plain = 'Élodie'
obfuscated === plain // => false
obfuscated.length // => 7
plain.length // => 6
You can utilitize the various types of normalization (going canonical or compatible, composed or decomposed) to work with homogeneous representations of texts. This can prove useful when said texts come from a variety of sources that were not necessarily using the same Unicode approach, but you want to correctly compare them.
obfuscated.normalize('NFC') === plain // => true
obfuscated === plain.normalize('NFD') // => true
The default target form is NFC
, which is the composed canonical form, the one that usually ends up consuming the least amount of code units and, when possible, folding down to the canonical codepoint.
Dig further into normalize(…)
.
ES2015: codepoint-based iterability
ES2015 surfaces the hitherto-hidden notion of iterability, especially through the iterable protocol we can implement in our own objects, that many native or host objects (e.g. String
, Array
, Map
, Set
or NodeList
) implement out of the box.
For strings, iterability is based on codepoints 🎉. Yay!
What this means is that any code consuming iterables will, when operating on strings, use full codepoints. This includes the for
…of
loop, the spread operator (...
), positional destructuring and all the iterable-consuming APIs (e.g. Array.from
, new Set
, Promise.all
).
const text = 'Want some 🀄️ with your 🍵?'
for (const codePoint of text) {
console.log(codePoint)
// => won’t mistakenly break down 🀄️ or 🍵
}
Array.from(text)
// => ['W', 'a', 'n', 't', ' ', 's', 'o', 'm', 'e', ' ', '🀄️'…]
This means you can very much count the full codepoints in a text by converting it to an array (using its iterability) and grabbing the array’s size:
text.length // => 🔴 25
Array.from(text).length // => ✅ 23
[...text].length // => ✅ 23
Do note this won’t necessarily handle our perception of more advanced combinations towards a “single” character. For instance, the emoji for “Family: man, woman, girl, boy” (👨👩👦👦) actually comprises “man” (👨), “woman” (👩), “boy” (👦) and “girl” (👧) joined by zero-width-joiners, or ZWJs (pronounced “zwidje”), for a combined length of 7 codepoints, adding up to 11 code units (each emoji takes 2)! Normalization won’t help us on this one.
If you’re into emoji combos using ZWJs (that let us alter gender, skin tone, hair color…), this is a cool page.
Bonus: Unicode in regexes
We’ll get back to Unicode aspects of regular expressions in a later “nugget” article, but know that there is now a u
flag on regexes, that lets us use codepoint-based escape sequences in there (you know, \u{…}
) along with Unicode properties; that later article will uncover the full power of these.
'😅'.match(/\u{1f605}/) // => null
'😅'.match(/\u{1f605}/u) // => MatchResult ([ '😅', index: 0… ])
Careful though: the u
flag does not alter the semantics of predefined classes, such as \w
(alphanumerical) or \d
(digits), that remain ASCII-based.