Properly sorting texts
Published on 13 May 2020 • 5 min

Cette page est également disponible en français.

We’re already at the tenth post of our daily series “19 nuggets of vanilla JS.” Today we look at a recurring theme: sorting arrays in a smart (and clean) manner, despite sometimes-complex formatting. And yet, we seldom need to query the server or use a library: the JS standard library provides some outstanding capabilities!

The series of 19

Check out surrounding posts from the series:

  1. Easily stripping “blank values” from an array
  2. Long live numeric separators!
  3. Properly sorting texts (this post)
  4. Extracting emojis from a text
  5. Properly defining optional named parameters
  6. …and beyond! (fear not, all 19 are scheduled already)…

Array#sort(…)

It’s a classic: the sort(…) method on Array. I’m sure you used it already, it’s easy:

const languages = ['Rust', 'C#', 'Haskell', 'Prolog', 'JS', 'Elixir']
languages.sort()
// => ['C#', 'Elixir', 'Haskell', 'JS', 'Prolog', 'Rust']

We could do that in our sleep…

Watch out for these mutants!

The first pitfall is that sort(…) is mutative: it modifies the array in-place, instead of generating a freshly derived one. True, this is not the only mutative method on Array: you’ll also find copyWithin, fill, pop, push, reverse, shift, splice and unshift. But it remains a minority (9 methods out of 32), and when you’re used to immutable behavior (original array untouched), this can bite:

const languages = ['Rust', 'C#', 'Haskell', 'Prolog', 'JS', 'Elixir']
const sortedLanguages = languages.sort()
languages[0] // => 'C#' -- d’oh.
languages === sortedLanguages // => true

It’s done that way for performance reasons: most of the time, you do want to sort the original array. If that trips you up, you can always clone it first:

const sortedLanguages = Array.from(languages).sort()

Sorting numbers

Look at this mess:

const fibonacciTerms = [1, 1, 2, 8, 3, 21, 5, 13, 34, 144, 55, 89]
fibonacciTerms.sort()
// => [1, 1, 13, 144, 2, 21, 3, 34, 5, 55, 8, 89]

Luke Skywalker yelling “Noooooooon!”

He he, that old trap. As JavaScript uses dynamic typing, classical arrays (the Array type, as opposed to numeric typed arrays) can end up holding values of multiple types, for instance String, Number, etc.

As a result, to sort it all, we need a way to represent values in an interoperable way. And the only type all other types can converge to is String (which is why all objects have a toString() method).

This is what sort() defaults to: it converts all values to String before comparing them, using a good old < operator, of all things.

Not cool.

So what’s a developer to do? We need to provide our own comparator, my friend! sort(…) accepts an optional argument that is a comparison function:

  • It will receive two arguments: two values from the array to compare
  • It returns a negative number if the first argument comes earlier, a positive one if the first argument comes later, and a neutral one (zero) if they are deemed equivalent.

If you have a Java background, it’s much like doing an anonymous implementation of the java.util.Comparator interface. But less annoyingverbose.

For ascending sorts, we can simply return the difference between the two numbers: if a is smaller, the result will be negative, so a will be sorted earlier. For a descending sort, we use b - a instead, which negates the sign!

const fibonacciTerms = [1, 1, 2, 8, 3, 21, 5, 13, 34, 144, 55, 89]
fibonacciTerms.sort((a, b) => a - b)
// => [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144]

Lovely.

Advanced text sorting

It’s not just numbers that can trip you up. By default, comparison between strings is lexicographical: it uses the character table’s ordering (so the Unicode codepoints). Except this is not a natural order at all, folks:

const names = ['Élodie', 'christophe', 'Maxence', 'Elliott', 'Mårk']
names.sort()
// => ['Elliott', 'Maxence', 'Mårk', 'christophe', 'Élodie'] 😱😵

WTF?! I highly doubt any of your users would be happy with that trashunexpected result. Linguistical ordering usually follows a set of particular rules about diacritics (e.g. acute accents, cedillas), case (upper / lower), punctuation, and more. None of that here.

This corpus of rules, that can vary drastically from one locale to the next, is usually referred to as collation. You might have seen that in SQL, when defining tables and columns or writing a query, so that your order by on textual fields yields something reasonable. It’s a fairly universal data processing concept.

Well guess what? str1 < str2 doesn’t give a hoot about collation.

Respecting the locale

Let’s go with something better.

String#localeCompare(…)

At a minimum, we can use the basic version of localeCompare(…), a String method that’s been around since ES3 (1999). It compares two strings by following the rules of the active locale, and returns -1, 0 or 1. Even when that’s all you’ve got and you can’t use an explicit locale, it’s miles better than <, especially since a majority of locales tend to converge about their sorting rules:

const names = ['Élodie', 'christophe', 'Maxence', 'Elliott', 'Mårk']
names.sort((s1, s2) => s1.localeCompare(s2))
// => ['christophe', 'Elliott', 'Élodie', 'Mårk', 'Maxence'] 😍

That is already awesome, but when our JS engine supports ECMA-402, the standard for the Intl API (that is an integral part of JavaScript’s standard library), this method becomes much more powerful as it becomes a shortcut for features made available by Intl.

In practice, we’ve had that since IE11, Firefox 60, Chrome 74, Edge 15, Safari 10, Opera 61 and Node 0.12! Quite enough…

We mentioned already in nugget #3 that it gave superpowers to Number#toLocaleString(…), by putting it on top of Intl.NumberFormat. This happens here too: String#localeCompare(…) becomes a wrapper around Intl.Collator.

Intl.Collator

This object deals with, you guessed it, collation. Just as with Intl.NumberFormat, we provide a locale and any option we’d need. And what options these are! So many cool tricks. Let’s talk about my two favorite ones.

First, numeric can be set to true to sort numerically the segments that are… numeric (cough) in our texts:

const products = ['Nimbus 2000', 'Nimbus 3', 'Nimbus 400']
const sorter = new Intl.Collator('en-GB', { numeric: true })
products.sort(sorter.compare)
// => ['Nimbus 3', 'Nimbus 400', 'Nimbus 2000']

Isn’t that adorable? And this doesn’t need to happen at the trailing end of your texts, it can be anywhere, and you can have multiple numeric segments.

My other favorite option is ignorePunctuation, that you can set to true to… ignore punctuation. I know. Let’s first look at the behavior without the option:

const names = ['Jean-Pascal', 'Jeanne', 'Jean “Ze J” Louis']
const sorter = new Intl.Collator('fr-FR')
names.sort(sorter.compare)
// => ['Jean “Ze J” Louis', 'Jean-Pascal', 'Jeanne'] 😒

Yeah, not so good. Let’s try with the option then:

const names = ['Jean-Pascal', 'Jeanne', 'Jean “Ze J” Louis']
const sorter = new Intl.Collator('fr-FR', { ignorePunctuation: true })
names.sort(sorter.compare)
// => ['Jeanne', 'Jean-Pascal', 'Jean “Ze J” Louis'] 😍

I loooooooove it!

There are more exotic options too, such as caseFirst, sensitivity or usage, but I seldom encounter them in production. I’ll leave it to you to read the docs 😱 if you’re curious…

Examples from outer space

Not so long ago, one of my clients hit a snag with one of their sorting needs (browser-side). They had texts that described dimensions. Units were homogeneous, but this was still a mess:

const dimensions = [
'40cm × 50cm',
'40cm* × 45cm',
'100cm × 120cm',
'100cm × 50cm',
'40cm × 40cm',
'30cm × 40cm',
]
dimensions.sort()
// => ['100cm × 120cm', '100cm × 50cm', '30cm × 40cm',
// => '40cm × 40cm', '40cm × 50cm', '40cm* × 45cm']

Let’s start with a numeric sort:

const sorter = new Intl.Collator('en-US', { numeric: true })
dimensions.sort(sorter.compare)
// => ['30cm × 40cm', '40cm × 40cm', '40cm × 50cm',
// => '40cm* × 45cm', '100cm × 50cm', '100cm × 120cm']

That’s a lot better already, but this asterisk prevents its dimension from getting properly sorted… Let’s add another option:

const sorter = new Intl.Collator('en-US', {
numeric: true,
ignorePunctuation: true,
})
dimensions.sort(sorter.compare)
// => ['30cm × 40cm', '40cm × 40cm', '40cm* × 45cm',
// '40cm × 50cm', '100cm × 50cm', '100cm × 120cm']

Profit!

Need a reverse sort?

Should you need a reverse sort (a descending-order sort, basically), don’t go around sorting then reversing: you’d pay a performance penalty for the extra full traversal of your array, which would be a shame.

Instead, just negate the sign of your comparator:

const names = ['Élodie', 'christophe', 'Maxence', 'Elliott', 'Mårk']
names.sort((s1, s2) => -s1.localeCompare(s2))
// => ['Maxence', 'Mårk', 'Élodie', 'Elliott', 'christophe']

dimensions.sort((s1, s2) => -sorter.compare(s1, s2))
// => ['100cm × 120cm', '100cm × 50cm', '40cm × 50cm',
// => '40cm* × 45cm', '40cm × 40cm', '30cm × 40cm']

Want to dive deeper?

Our trainings are amazeballs, be they in-room or remote online, multi-client or in-house just for your company!