Efficiently extracting a substring
Published on 5 May 2020 • 3 min

Cette page est également disponible en français.

Here is the second article of our daily series: “19 nuggets of vanilla JS.” This time we’ll talk about extracting a part of a string, and see there are no less than 3 ways to go about it… but only one should stick with you 😉

The series of 19

Check out surrounding posts from the series:

  1. Efficiently deduplicating an array
  2. Efficiently extracting a substring (this post)
  3. Properly formatting a number
  4. Array#splice
  5. …and beyond! (fear not, all 19 are scheduled already)…

The Ugly: substr(…)

Did you now? Strings have a substr method. You didn’t know? Good for you! It can’t be trusted and is not even handy.

  • It’s not quite official. It is in annex B of the spec, which despite being “normative” since ES2015 instead of “informative” earlier, is about the parts of the language and its standard library that were never quite clean and have been actively discouraged, sometimes for a long time (as for substr, it was frowned upon ever since ES3, that’s 1999, folks).
  • It has an unusual signature: substr(index, length). Yes, length. Not two indices, but one index and one length.
  • It has incompatible implementations. In particular, although it explicitly allows negative indices to start from the end (which is good!), this facet doesn’t work in JScript, the JS engine in Internet Explorer pre-9.0.
'DEBBIE is a missionary'.substr(4, 12).replace('i', 'e')
// => … You know you’ll run this ;-)

It also sports a lousy name, truncated haphazardly, which reminds me of the dark early days of PHP (nl2br, yes, I’m looking at you—and many others).

So throw this method to the trash.

The Bad: substring(…)

Many fine folks use substring. Many folks indeed. Way too many folks. It’s kinda like this !@# parseInt: everybody thinks that yeah, okay, I got this. Then right when you do your most critical deployment ever, bam! The hidden bug. The caveat. The pitfall.

The name is clear though, I’ll give it that. And arguments are indices, which is cool.

BUT—!

  • Indices can’t be negative (no end-of-string confort there)
  • There’s a Nasty Joke™ if the second argument is less than the first.
'I have you darling'.substring(10, 0).replace('v', 't')
// => 😤💩 dammit!

Guessed it? Yup, if the second index is less than the first, they get inverted! What could go wrong?! Sure, it has to be exactly what we intended, just like new Date(2020, 0, -6) lands on Christmas 2019, that makes perfect sense!

Thank you, next!

The Good: slice(…)

Here’s our good friend at last! You probably know slice from arrays, well it’s also available on strings, and the API is exactly the same, which is nifty: there are more than enough APIs to remember, so when we can reuse one… Many good things to say, then:

  • 100% API-compatible with the slice from Array
  • Two indices, both allowing negative values (and as usual, the second one is exclusive)
  • No weird-ass inversion if the second one is less than the first one

Gotta love it! 😍

There are two more niceties, that it does share with the two prior candidates so they’re not exactly benefits, but I’ll list them anyway:

  • Omit the second index: go to the end of the string
  • Omit even the first index: grab the whole string
'<love>'.slice(1, -1)
// => 'love
'Living on the Edge'.slice(-4)
// => 'Edge'

Finally!

“Yeah but that doesn’t do kawaii!”

As you no doubt have gathered, slice is my friend. Still, like all traditional String APIs, it often stumbles on Unicode. We’ll circle back to this soon (spoiler alert) but JS strings are, much like Java’s (argh!) encoded as UCS-2 / UTF-16LE, and what the API incorrectly refers to as characters (charAt, charCodeAt, etc.) are actually 16-bit (2-byte) code units. This is plenty for latin characters, digits and the usual Western punctuation, but the moment we reach a certain range of Unicode codepoints, say Chinese ideograms, Japanese kanjis or straight-up emojis, things start falling apart and we need a surrogate pair:

'😍'.length // => 2 🤔
'😍 👨‍👩‍👦‍👦'.length // => 14 😱

Yup, '😍' actually holds two code units. Normalized as ASCII source, we’d need to write '\ud83d\ude0d'. Lovely, right? One emoji, but a string of “length” 2. One codepoint, two code units making up a surrogate pair. Hence:

'For real 😍'.slice(9, 10) // => invalid character

So how can we extract a segment “in a codepoint sense?” If we really need to, we can seize the fact that since ES2015, strings are iterable by codepoints, not by code units. Turn them into an array of codepoints, slice that array and rebuild the string from it:

Array.from('For real 😍').slice(9, 10).join('') // => '😍'

Pfew! That still won’t handle codepoint combinations based on ZWJs (Zero-Width Joiners), so we’re not always in the clear…

Array.from('😍 👨‍👩‍👦‍👦').slice(2, 3).join('')
// => '👨' -- The Mrs hightailed it with the kids

…but still, with a bit of luck we can reunite the whole family:

Array.from('😍 👨‍👩‍👦‍👦').slice(2).join('') // => '👨‍👩‍👦‍👦'

I love a happy ending.