Efficiently deduplicating an array
Published on 4 May 2020 • 4 min

Cette page est également disponible en français.

This post opens our “19 nuggets of vanilla JS” post series, with one daily article (not too long, nibble-size) on a facet of pure JavaScript language; or a protip, best practice, poorly known ability, mythbusting, demystifying, etc. 19 reasons to come back!

The series of 19

Check out surrounding posts from the series:

  1. Efficiently deduplicating an array (this post)
  2. Efficiently extracting a substring
  3. Properly formatting a number
  4. …and beyond! (fear not, all 19 are scheduled already)…

“Deduplicate?”

Deduplicate: the action of stripping from a list all extraneous occurrences of values, to only retain unique values in the end.

The list doesn’t have to be sorted and values may have multiple types. The implementation can choose whether to preserve order of the original values (be “stable”) or not.

This is the kind of words non-IT folks seldom use in their daily life.

Old-school

Can’t escape traversing the whole list.

If we can’t rely on the list being sorted (and its values being Strings), it’s even worse: at every step, we’ll need to traverse the ongoing result to check for a prior encounter of the current value! A traversal in a traversal: the algorithm is quadratic, of complexity O(n²). Roughly speaking, if your array has a thousand items, you’ll do a million turns… A rather naive implementation (especially considering we appear to be using ES2015+) could look like this:

function naiveUniq(items) {
const result = []
for (const item of items) {
if (!result.includes(item)) {
result.push(item)
}
}
return result
}

Feeling jarred by for…of or Array#includes()? You’re more the ES3 (pre-2009) type, so here’s a version for you that is actually even slower:

function lousyUniq(items) {
var result = [],
len = items.length
for (var index = 0; index < len; ++index) {
var item = items[index]
if (result.lastIndexOf(item) === -1) {
result.push(item)
}
}
return result
}

Now, if we can assume items is “sorted” to begin with (identical values are adjacent), we can save a ton of computation by avoiding the inner traversal of result. Our algorithm goes from quadratic to linear (O(n), so a thousand items yield a thousand turns):

function sortedUniq(items) {
const result = []
let lastSeen
let first = true
for (const item of items) {
if (first || lastSeen !== item) {
result.push(item)
first = false
lastSeen = item
}
}
return result
}

Still, not so neat. If all our values are Strings, we could optimize the existence check a bit, but barely, by using a dictionary object and filling up its keys, to check with a seen.hasOwnProperty(item) or a rougher seen[item], but the gain may not be noticeable (altering the “shape” of seen every time we add a newly-encountered key kills most internal lookup optimizations by the JS engine).

“Yeah, but Lodash!”

Absolutely! Lodash has featured for a long time uniq and friends (uniqWith, uniqBy, sortedUniq, etc.). It wasn’t even first to the key, as Prototype.js did that all the way back in 2005.

Like a boss

Sure, if we have advanced needs (custom comparator, computed keys, etc.) we’ll need Lodash or some other help. But for the common case: deduplicating raw values, the trick, since ES2015, is to use a Set.

Set is one of two new collection types (with Map) that came out in 2015 in the language’s standard library. You’ll find a similar type in many ecosystems, representing a set in the mathematical sense. In particular, sets have two important characteristics:

  • There is no instrinsic order
  • All values are unique

In practice, undefined, null and NaN are treated here like any other value. Set compares values using the SameValueZero pseudo-algorithm laid out in the JavaScript specification. It’s much like strict equality (the === operator, that first checks types are the same, then compares values), with a teeny-tiny difference: it considers -0, 0 and +0 as identical. I am quite certain you don’t mind 😉

It so happens that a Set can be built based on any iterable. There are quite a few kinds of iterables (making our code even more generic and useful!) but Array is undoubtedly the most well-known kind.

So we could go:

const result = new Set(items)

Building a Set by passing it an iterable is sort of the optimized version of manually traversing that iterable (say, with for…of) and calling add(…) every time. The semantics are the same, it’s just faster. As a nice bonus, the order of insertion is preserved, making our algorithm “stable” that way.

The definition of add(…) mandates that it ignores any argument already present in the Set. Internally, it is implemented using data structures optimized for storing and verifying the existence of values, regardless of any ordering. This is why set.has(x) usually runs with complexity O(log(n)), which is potentially much more performant than arr.includes(x), which runs in O(n). What would you prefer, 3–4 turns or 1,000 turns?

OK, but now that we have a Set, not a nice Array chock-full of cool handy methods… How do we land back on an array?

A Set is an iterable too, which means we can spread it using the ... operator, or go with the Array.from(…) method. So to get an Array back, we could write this:

const result = [...new Set(items)]

Spreading inside an array literal ([]) produces, well… an actual Array. We consume the entire iterable, which means an O(n) run, for a total cost around O(2n log(n)), which is very, very much better than O(n²). Here’s a small comparative table, assuming 10ns for the SameValueZero comparison, which is quite reasonable:

Size O(n²) O(2n log(n))
100 100µs 4µs
1,000 10ms 60µs
10,000 1s 800µs
100,000 1m40s 10ms
1,000,000 2h46m40s 120ms

So there.

Voilà!

function lovelyUniq(items) {
return [...new Set(items)]
}

If items were “sorted,” we could try and measure for that specific case whether a manual accumulation with a control Set would yield better results, but I doubt it: manual iterations are usually more costly than the ones implemented internally by the native code of the standard library. The potential benefit would likely be far too thin to justify the extra volume of code and its consequences (documentation, testing, etc.).