#
Efficiently deduplicating an array
Published on 4 May 2020
• 4 min

Cette page est également disponible en français.

This post opens our “19 nuggets of vanilla JS” post series, with one daily article *(not too long, nibble-size)* on a facet of pure JavaScript language; or a protip, best practice, poorly known ability, mythbusting, demystifying, etc. 19 reasons to come back!

## The series of 19

Check out surrounding posts from the series:

**Efficiently deduplicating an array**(this post)- Efficiently extracting a substring
- Properly formatting a number
*…and beyond! (fear not, all 19 are scheduled already)…*

## “Deduplicate?”

**Deduplicate**: the action of stripping from a list all extraneous occurrences of values, to only retain **unique values** in the end.

The list doesn’t have to be sorted and values may have multiple types. The implementation can choose whether to preserve order of the original values (be “stable”) or not.

This is the kind of words non-IT folks seldom use in their daily life.

## Old-school

Can’t escape traversing the whole list.

If we can’t rely on the list being sorted (and its values being `String`

s), it’s even worse: at every step, we’ll need to traverse the ongoing result to check for a prior encounter of the current value! A traversal in a traversal: the algorithm is quadratic, of complexity *O(n²)*. Roughly speaking, if your array has **a thousand** items, you’ll do **a million** turns… A rather naive implementation (especially considering we appear to be using ES2015+) could look like this:

`function naiveUniq(items) {`

const result = []

for (const item of items) {

if (!result.includes(item)) {

result.push(item)

}

}

return result

}

Feeling jarred by `for…of`

or `Array#includes()`

? You’re more the ES3 (pre-2009) type, so here’s a version for you that is actually even slower:

`function lousyUniq(items) {`

var result = [],

len = items.length

for (var index = 0; index < len; ++index) {

var item = items[index]

if (result.lastIndexOf(item) === -1) {

result.push(item)

}

}

return result

}

Now, if we can assume `items`

is “sorted” to begin with (identical values are adjacent), we can save a ton of computation by avoiding the inner traversal of `result`

. Our algorithm goes from quadratic to **linear** (*O(n)*, so a thousand items yield a thousand turns):

`function sortedUniq(items) {`

const result = []

let lastSeen

let first = true

for (const item of items) {

if (first || lastSeen !== item) {

result.push(item)

first = false

lastSeen = item

}

}

return result

}

Still, not so neat. If all our values are `String`

s, we could optimize the existence check a bit, but barely, by using a dictionary object and filling up its keys, to check with a `seen.hasOwnProperty(item)`

or a rougher `seen[item]`

, but the gain may not be noticeable (altering the “shape” of `seen`

every time we add a newly-encountered key kills most internal lookup optimizations by the JS engine).

### “Yeah, but Lodash!”

Absolutely! Lodash has featured for a long time `uniq`

and friends (`uniqWith`

, `uniqBy`

, `sortedUniq`

, etc.). It wasn’t even first to the key, as Prototype.js did that all the way back in 2005.

## Like a boss

Sure, if we have advanced needs (custom comparator, computed keys, etc.) we’ll need Lodash or some other help. But for the common case: **deduplicating raw values**, the trick, since ES2015, is to use a `Set`

.

`Set`

is one of two new collection types (with `Map`

) that came out in 2015 in the **language’s standard library**. You’ll find a similar type in many ecosystems, representing a *set* in the mathematical sense. In particular, sets have two important characteristics:

- There is no instrinsic order
- All values are unique

In practice, `undefined`

, `null`

and `NaN`

are treated here like any other value. `Set`

compares values using the **SameValueZero** pseudo-algorithm laid out in the JavaScript specification. It’s much like strict equality (the `===`

operator, that first checks types are the same, then compares values), with a teeny-tiny difference: it considers `-0`

, `0`

and `+0`

as identical. I am quite certain you don’t mind 😉

It so happens that a `Set`

can be built based on any iterable. There are quite a few kinds of iterables (making our code even more generic and useful!) but `Array`

is undoubtedly the most well-known kind.

So we could go:

`const result = new Set(items)`

Building a `Set`

by passing it an iterable is sort of the optimized version of manually traversing that iterable (say, with `for…of`

) and calling `add(…)`

every time. The semantics are the same, it’s just faster. As a nice bonus, the order of insertion is preserved, making our algorithm “stable” that way.

The definition of `add(…)`

mandates that it ignores any argument already present in the `Set`

. Internally, it is implemented using data structures optimized for storing and verifying the existence of values, regardless of any ordering. This is why `set.has(x)`

usually runs with complexity *O(log(n))*, which is potentially *much* more performant than `arr.includes(x)`

, which runs in *O(n)*. What would you prefer, 3–4 turns or 1,000 turns?

OK, but now that we have a `Set`

, not a nice `Array`

chock-full of cool handy methods… How do we land back on an array?

A `Set`

is an iterable too, which means we can *spread* it using the `...`

operator, or go with the `Array.from(…)`

method. So to get an `Array`

back, we could write this:

`const result = [...new Set(items)]`

*Spreading* inside an array literal (`[`

…`]`

) produces, well… an actual `Array`

. We consume the entire iterable, which means an *O(n)* run, for a total cost around *O(2n log(n))*, which is very, very much better than *O(n²)*. Here’s a small comparative table, assuming 10ns for the *SameValueZero* comparison, which is quite reasonable:

Size | O(n²) |
O(2n log(n)) |
---|---|---|

100 | 100µs | 4µs |

1,000 | 10ms | 60µs |

10,000 | 1s | 800µs |

100,000 | 1m40s | 10ms |

1,000,000 | 2h46m40s | 120ms |

So there.

**Voilà!**

`function lovelyUniq(items) {`

return [...new Set(items)]

}

If `items`

were “sorted,” we could try and measure for that specific case whether a manual accumulation with a control `Set`

would yield better results, but I doubt it: manual iterations are usually more costly than the ones implemented internally by the native code of the standard library. The potential benefit would likely be far too thin to justify the extra volume of code and its consequences (documentation, testing, etc.).