Extracting emojis from a text
Published on 14 May 2020
• 3 min
Cette page est également disponible en français.
Here’s the eleventh post of our daily series “19 nuggets of vanilla JS.” And today we’re talking emojis. They’re everywhere, but it’s hard to identify, extract and collect them from a string. They are an ever-expanding list and, in JavaScript String
s, are always encoded as surrogate pairs because of their higher-range codepoints… Fortunately, an ES2018 feature makes it easier for us!
The series of 19
Check out surrounding posts from the series:
- Long live numeric separators!
- Properly sorting texts
- Extracting emojis from a text (this post)
- Properly defining optional named parameters
const
is the newvar
- …and beyond! (fear not, all 19 are scheduled already)…
Emojis, Unicode and surrogate pairs
In our #5 “nugget” post, “Strings and Unicode in JavaScript,” we discussed already how Unicode is handled by the String
type. In particular, we saw that text was encoded as UTF-16, with 2-byte code units, which requires a combination of 2 individually-invalid code units for high-enough codepoints, something called a surrogate pair.
This is the common scenario for emojis, as pretty much all of them have codepoints in the U+1Fxxx range, plus numerous modifiers going all the way to the U+Exxxx range.
Such a diversity implies that it’s rather tedious and error-prone to “manually” identify emojis in a String
. The “traditional” regex to achieve this would be rather intense (and would likely perform a bit poorly)…
The Unicode flag for regexes
ES2015 introduced a u
flag on regexes, that triggers Unicode handling.
Before ES2018, this “only” allowed using the codepoint literal syntax (i.e. \u{xxxxx}
) in addition to the legacy code unit literal syntax (\uXXXX
). But since ES0218, this also lets us describe positive or negative matches with Unicode properties.
Unicode properties
The Unicode standard assigns each codepoint a series of properties. These are cross-cutting categories, so to speak. As an example, consider the U+2778 glyph: ❸ (fondly known as Dingbat Negative Circled Digit Three). Some of its properties are:
- Script:
Common
/Zyyy
(here’s a [list from the ES spec](https://tc39.es/ecma262/#table-unicode-script-values) and [another from the excellent Compart site](https://www.compart.com/en/unicode/scripts)) - General Category:
Other_Number
/No
(here’s the [ES spec’s list](https://tc39.es/ecma262/#table-unicode-general-category-values) and the [Compart list](https://www.compart.com/en/unicode/category)).
Transitively, it’s also part of the more generic `Number` / `N` category.
By the way, most property values have a long form (e.g. Other_Number
) and a shorthand (e.g. No
). As always, do favor the longer (more legible) version to make your code a bit easier to understand and maintain…
Unicode Property Escapes to the rescue!
ES2018 brings a new syntax for regexes that lets us match Unicode properties: Unicode Property Escapes. It reads \p{…}
. As is usual for escape sequences in regexes, the positive variant is lowercase, and the negative variant is uppercase (\P{…}
). Just like \s
says “whitespace” and \S
says “anything but whitespace.”
Properties can be binary (yes/no; list from the ES spec) or more general (anything else, such as General_Category
or Script
). For binary properties, their name alone is enough; for others, you’ll need to provide a value.
Many script or category values can be used directly as “binary properties” in the syntax. For instance, you can indifferently write \p{Emoji}
or \p{Script=Emoji}
. Some useful “pseudo-binaries” include Alphabetic
, Uppercase
, Lowercase
, Number
(especially Decimal_Number
), Diacritic
, Emoji
, White_Space
(that covers many less-usual codepoints, unlike the legacy \w
)…
So here’s our solution for extracting any sequence of emojis from a text!
'Awesome 🎉 I love it! 🤗😍'.match(/\p{Emoji}+/gu)
// => ['🎉', '🤗😍']
This can be super cool for lots of other needs, as you might expect:
'42 (yes, 𝟜𝟚) or ٤٢, or even ೪೨, is the answer…'.match(/\p{Decimal_Number}+/gu)
// => ['42', '𝟜𝟚', '٤٢', '೪೨']
(Yup, be they ASCII, mathematic double-struck, Arabic-Indic or Kannada, these are still decimal digits…)
The MDN docs are, as always, excellent.
Bonus: the singleline/dotall flag
One of many regex syntax improvements that came out with ES2018 is the long-awaited s
flag, for “single line” (also known as “dotall”), that extends the any character class (.
) (single period) to also match line breaks and carriage returns:
'<p>one\ntwo</p>'.match(/<p>.*<\/p>/) // => null
'<p>one\ntwo</p>'.match(/<p>.*<\/p>/s) // => ['<p>one\ntwo</p>, …]
Before that, we had to resort to rather puzzling hacks (instead of .
), such as [^]
(a class meaning “everything except nothing”) or [\s\S]
(that said “all whitespaces and all non-whitespaces”), which made the intent rather unclear…