One of the features we wanted to add to the company website was the ability to tag and filter posts so that these posts can then be displayed in their own index page. That meant that we needed to read through all the YAML frontmatter, collate the tags into a collection, then deduplicate the collection so we can create an index page per tag.
We set out to do just that, until we hit a snag: what's the best way to deduplicate a JavaScript array (since the website was using Next.js)?
You would have expected that JavaScript, being a mature language used in almost all modern websites, would have a built-in method for removing duplicate elements from an array.
For example, Ruby's arrays have a built-in method uniq
that does what you'd expect:
array = [
"scrum",
"scrum",
"scrum",
"engineering",
"culture",
"culture",
"remote-work",
]
array.uniq
There's no such thing in JavaScript land, so we set out and looked towards the vast plains of the Internet searching for whatever wisdom we can manage to get.
We eventually ended settling with one of them, but we realized that there might be some value in exploring the variety of solutions that we've collected. We tried to ask ourselves: How do they work and how they fare in terms of performance?
So we took a few of those solutions that we felt best represented a certain concept, and ran benchmarks on them.
const iterations = 1000000;
const array = [
"scrum",
"scrum",
"scrum",
"engineering",
"culture",
"culture",
"remote-work",
];
const methods = {
usingSet: (arr) => [...new Set(arr)],
usingFilter: (arr) => arr.filter((e, i, a) => a.indexOf(e) === i),
usingFilterIncludes: (arr) =>
arr.filter(function (e) {
return !this.includes(e) && this.push(e);
}, []),
usingObjectKeys: (arr) =>
Object.keys(arr.reduce((acc, cur) => (acc[cur] = cur && acc), {})),
usingMagic: (arr, t = {}) => arr.filter((e) => !(t[e] = e in t)),
};
const performanceObj = Object.entries(methods).reduce((acc, [key, value]) => {
const start = performance.now();
for (let i = 0; i < iterations; i++) value(array);
acc[key] = `${performance.now() - start} ms`;
return acc;
}, {});
performanceObj;
We actually have two more examples for deduplicating JavaScript arrays, but they either require external libraries or are currently only considered experiemental technologies that a polyfill is necessary so we won't be digging in too deep with them.
// use underscore.js (advantage is that they take optimizing the algorithm)
_.uniq([array]);
// groupBy and keys, experimental and needs polyfill
Object.keys(array.groupBy((e) => e));
The benchmarks fluctuate depending on the resources your browser is currently using, but the ranking mostly remains the same. For reference, these are the results that we got:
Object {
"usingFilter": "73 ms",
"usingFilterIncludes": "100.89999997615814 ms",
"usingMagic": "256.5 ms",
"usingObjectKeys": "143.60000002384186 ms",
"usingSet": "130.29999995231628 ms",
}
Let's look at them one by one, in order of benchmark performance.
Using filter()
and indexOf()
const usingFilter = (arr) => arr.filter((e, i, a) => a.indexOf(e) === i);
We start off with the fastest of the bunch, and it's also sweet and succinct (albeit needing some thought to how it works).
One uncommon thing here would be the usage of all three arguments in the filter function: the current element, the current index, and the array being iterated on.
The filter function finds the index of e
in the array a
and checks if the current index i
is the same. If so, that means that this is the first time that the current e
was encountered because the current index and the location of e
in the array are the same.
If they are not the same, this check will evaluate to false because indexOf
will return the index of the earliest occurrence of e
and the current index i
would not be the same as that index.
Although the filter function looks like it's doing multiple passes over the array to find the index of the current element (at worst an O(n^2) complexity with all elements already unique), it should still be fast enough for simple use cases.
Using filter()
and includes()
const usingFilterIncludes = (arr) =>
arr.filter(function (e) {
return !this.includes(e) && this.push(e);
}, []);
While similar to the first one, this one uses some this
-chicanery to avoid having to declare a temporary variable to hold the comparison list.
In its callback function mode (not using an arrow function), Array.prototype.filter()
takes two arguments: the callback function (which can be defined or inline) and something called a thisArg
. The thisArg
is the value to be used as this
when the callback is executed.
That means within the scope of the function, this
would refer to []
since that was passed in as the second parameter to filter()
.
The callback function will then use includes()
to check if e
is present in the array; if not we negate this and continue on the boolean &&
and execute an array push of the element into the array.
This results in duplicates elements returning true after the includes()
test, which is negated, which then short-circuits the boolean and does not continue on with the array push.
It's only slightly slower than the indexOf()
version, but depending on who you ask it may be more understandable than that. The only unusual thing about this is the usage of the thisArg
which can be surprising to some developers who are used to arrow functions and don't need to worry about what this
references.
Using Set
const usingSet = (arr) => [...new Set(arr)];
We like this one a lot due to its simplicity and elegance. You simply create a new Set
Object from the array, and then use the spread syntax to create a new array with the elements of the created Set
. The MDN docs even provide this exact same example on deduplicating an array as part of its sample usage documentation.
It's almost like having unique elements is its sole reason for living (it is, mostly).
Why would you need to create an array from a Set
when a Set
should be expected behave much like an array, you might ask? If you look at the documentation of Set
you'd notice that the array functions you'd mostly expect to use like map()
or reduce()
aren't present (you can still do forEach()
though).
You also don't get any of array-like behaviors like being able to retrieve an element via its index using []
or being able to have "holes" in the sequence by directly assigning a value to non-sequential indexes.
This was what we ended up finally using in our implementation.
Using Object Keys
const usingObjectKeys = (arr) =>
Object.keys(arr.reduce((acc, cur) => (acc[cur] = cur && acc), {}));
There are two tricks being (ab)used here:
- JavaScript Object property names (aka keys) don't have duplicates
Array.prototype.reduce()
takes two arguments: the reducer function, and an optional initial value
The reducer function then creates a property in the accumulator acc
and returns the accumulator. As expected in reducer functions, the returned accumulator is made available to the next iteration.
The trick here is that if the property already exists, then no new property is created and the assignment simply happens to a property that is already there.
We then call Object.keys()
to get the property names of the accumulator object in order to get back a deduplicated array.
While using an Object's properties as some sort of deduplication data structure was unavoidable in past versions of JavaScript, we now have constructs like Set
or Map
that are specialized for this kind of operations. In fact, the MDN documentation hints at Map
s as a replacement for using Object
for hash-like or dictionary-like structures.
The downside of using a Map
instead of an Object
is similar to that of using Set
: you don't have familiar operators like map
or reduce
, and probably worse would be the confusion with setting Object properties vs Map keys using []=
—after all, Map
is still a full-fledged JavaScript Object.
Using Magic
const usingMagic = (arr, t = {}) => arr.filter((e) => !(t[e] = e in t));
This is a fun one. Similar to using Object keys, we initialize a temporary variable t
and (ab)use the fact that JavaScript Object keys are unique. We then use Array.prototype.filter()
to loop through the array and kick out duplicate elements.
So how does the actual filter function work? Let's dissect the filter function:
!(t[e] = e in t))
- we start evaluating the expression right to left, starting with the
in
operator - the
in
operator checks whethere
is present in the property list of Objectt
and returns true or false - the result (true or false) gets assigned to
t[e]
but this isn't really important; what we want is to create the keye
in Objectt
if it doesn't exist yet - what's important though is that this expression evaluates to true or false depending on whether property
e
is present in Objectt
- since the expression returns true if
e
is int
, we want to negate this because ife
is already present int
then we're dealing with a duplicate Array.prototype.filter()
will reject elements for which the filter function returns false, leaving us with a deduplicated array
There are a lot of tricks and gotchas here, and we do not recommend using this in production code unless you have specific reasons (like trying to confuse your colleagues so they pay attention to your code's review :P). It also came in dead last in the benchmark, taking almost double the time it took the second worst algorithm to run.
Going Beyond the Primitive Case
With all that said, we think Set
checks all the boxes we were looking for to get our blog tags feature done. It also looks very trendy, using modern JavaScript constructs like the spread operator.
const dedupe = (arr) => [...new Set(arr)];
So there you have it, this is definitely the best way™ to remove duplicate elements from a JavaScript array.
Let's check our assumptions for a bit and try to see where our algorithms break down. Right now all we've been dealing with are arrays filled with strings. What would happen if our array happens to contain objects instead? Heck, let's throw in a couple of duplicated arrays in the mix as well.
Here's an example using Array.prototype.filter()
and Array.prototype.indexOf()
const array = [
{ slug: "scrum", title: "Scrum" },
{ slug: "scrum", title: "Scrum" },
{ slug: "scrum", title: "Scrum" },
{ slug: "remote-work", title: "Remote Work" },
{ slug: "culture", title: "Culture" },
[1, [2, 3], [[4, 5], [6, 7], 8], 9],
[1, [2, 3], [[4, 5], [6, 7], 8], 9],
];
const usingFilter = (arr) => arr.filter((e, i, a) => a.indexOf(e) === i);
usingFilter(array);
It seems that we didn't get what we were expecting. Strangely, none of the duplicated elements were filtered out.
How about using the Set
object?
const array = [
{ slug: "scrum", title: "Scrum" },
{ slug: "scrum", title: "Scrum" },
{ slug: "scrum", title: "Scrum" },
{ slug: "remote-work", title: "Remote Work" },
{ slug: "culture", title: "Culture" },
[1, [2, 3], [[4, 5], [6, 7], 8], 9],
[1, [2, 3], [[4, 5], [6, 7], 8], 9],
];
const usingSet = (arr) => [...new Set(arr)];
usingSet(array);
This didn't work either. We get the same result as the filter()
and indexOf()
version.
What gives?
Maybe the magical ways will fare better:
const array = [
{ slug: "scrum", title: "Scrum" },
{ slug: "scrum", title: "Scrum" },
{ slug: "scrum", title: "Scrum" },
{ slug: "remote-work", title: "Remote Work" },
{ slug: "culture", title: "Culture" },
[1, [2, 3], [[4, 5], [6, 7], 8], 9],
[1, [2, 3], [[4, 5], [6, 7], 8], 9],
];
const usingMagic = (arr, t = {}) => arr.filter((e) => !(t[e] = e in t));
usingMagic(array);
Now we get something different, but still is definitely not what we can consider as the "correct" result. One would have assumed that deduplication is such a simple task that it's such a surprise not to get the expected answer.
Speaking of assumptions, we've been talking about deduplication but neglecting to consider the elephant in the room: what does it mean for something to be a duplicate of another? In other words: what makes something different from another?
We've been taking for granted that the functions we've been using have implicit tests of equality that may not necessarily match what we expect from what "equality" means between array elements.
We'll dive deeper into existential questions of equality and truth in part 2.