Using named captures
Published on 17 May 2020
• 4 min
Cette page est également disponible en français.
This is already the fourteenth installment of our daily series “19 nuggets of vanilla JS,” and we’re talking again about regular expressions to shed some light on one of the nicer regex novelties in ES0218: named capturing groups, also known as “named captures”.
The series of 19
Check out surrounding posts from the series:
- Properly defining optional named parameters
const
is the newvar
- Using named captures (this post)
- Object spread vs.
Object.assign
- Converting an object to
Map
and vice-versa - …and beyond! (fear not, all 19 are scheduled already)…
A quick recap on groups
In a regular expression, we use groups to apply quantifiers or alternatives to more than one character.
Let’s say we want to express “the letter ‘b’ at least once”: we would write b+
. But to say “the text ‘ba’ at least once”, we can’t just go with ba+
: that pattern would mean “the letter ‘b’, followed by at least one letter ‘a’”. We thus create a group around that text, on which the quantifier applies: (ba)+
.
In the same way, baba|bébé
means “‘baba’ or ‘bébé’”, but to say “‘hi’, followed by either ‘baba’ or ‘bébé’, followed by ‘!’” we would have to write hi (baba|bébé)!
to restrict the scope of the alternative: without the group, it would mean “‘hi baba’ or ‘bébé!’”.
Capturing groups
By default, groups are capturing: the part of the scanned text that ends up matching them is isolated in a captured group with an index. Group zero is always there: it contains the expression’s entire match. Groups starting at one (1) are the captured groups. As a result, if you look into a match result (an extended Array
object returned by match
or exec
) produced by the expression in the code below, you’ll find, among other things, properties 1
, 2
and 3
holding the three captured groups.
const REGEX_US_PHONE = /\b(\d{3})-(\d{3})-(\d{4})\b/
const result = 'Twitter HQ: 415-222-9670'.match(REGEX_US_PHONE)
result[0] // => '415-222-9760'
result[1] // => '415'
result[2] // => '222'
result[3] // => '9760'
Capturing groups are also handy for backrefs (back references): these let us express that our pattern should contain “the same source text as the one that matched an earlier spot in our expression.” Let’s say we want to match an HTML attribute, the value of which can be surrounded by single or double quotes ('
or "
). The critical thing is, we need the same delimiter on both sides. We can use a backref with the proper captured group index for this:
// Intentionally simplified "name" part here…
const REGEX_HTML_ATTR = /[\w-]+=(['"])(.+?)\1/
REGEX_HTML_ATTR.exec(`name="foo"`) // => ['name="foo"', '"', 'foo']
REGEX_HTML_ATTR.exec(`name='foo'`) // => ["name='foo'", "'", 'foo']
REGEX_HTML_ATTR.exec(`name='foo"`) // => null
REGEX_HTML_ATTR.exec(`name="foo'`) // => null
In the regex above, the delimiter pattern (['"]
) is in the first capturing group: to use a backref on it, we will therefore type \1
.
It follows that when using a regex with the String#replace
API and providing a text-based replacement pattern, we can reference captured groups with the $index
notation. Look at this:
'415-222-9670'.replace(REGEX_US_PHONE, '$3/$2/$1')
// => '9670/222/415
Non-capturing groups
These group indices quickly get out of hand, though. The moment we add a group somewhere, it offsets all the later indices! Say we want to allow a phone number to be prefixed with “tel:”, it offsets everything else:
const REGEX_US_PHONE = /\b(tel:)?(\d{3})-(\d{3})-(\d{4})\b/
const result = 'Twitter HQ: 415-222-9670'.match(REGEX_US_PHONE)
result[0] // => '415-222-9760'
result[1] // => undefined -- ARGH!
result[2] // => '415' -- Drats.
result[3] // => '222' -- Shoot.
result[4] // => '9760' -- Son of a gun.
For all we know, we may not even care whether the “tel:” protocol is there or not, we just wanted it to be part of the pattern matching! We didn’t mean to wreck havoc on our captured group indexing. In such scenarios, we can use non-capturing groups by starting them with (?:
instead of just (
:
const REGEX_US_PHONE = /\b(?:tel:)?(\d{3})-(\d{3})-(\d{4})\b/
const result = 'Twitter HQ: 415-222-9670'.match(REGEX_US_PHONE)
result[0] // => '415-222-9760'
result[1] // => '415' -- Yay!
result[2] // => '222' -- Yowza!
result[3] // => '9760' -- Banzaï!
Group specialization
As a general rule, any capturing group specialization starts with (?
:
(?:
for non-capturing groups,(?=
for lookaheads,(?!
for negative lookaheads,(?<=
for lookbehinds,(?<!
for negative lookbehinds.
Named capturing groups
Many languages feature a better way to capture groups: by naming them. It is more readable and more resilient to changes in the expression: no surprise reference shifting as with the indices.
ES2018 finally provides this! The API changes for this are many:
- You can define a named capturing group with
(?<name>expr)
(so between angular brackets, before the pattern). - You can do a backref with
\k<name>
. - The match result features a
groups
property, that becomes an object whose properties use the capturing groups’ names. - The textual replacement pattern in
String#replace
allows$<name>
for referencing named captured groups.
As a side note, these groups also have indices but We Just Don’t Care™.
Revisiting our previous examples as named captures:
const HTML_ATTR = /(?<name>[\w-]+)=(?<delim>['"])(?<value>.+?)\k<delim>/
HTML_ATTR.exec(`name="foo"`).groups
// => { name: 'name', delim: '"', value: 'foo' }
HTML_ATTR.exec(`name='foo'`).groups
// => { name: 'name', delim: "'", value: 'foo' }
const US_PHONE = /\b(?<area>\d{3})-(?<prefix>\d{3})-(?<line>\d{4})\b/
'415-222-9670'.replace(US_PHONE, '$<line>/$<prefix>/$<area>')
// => '9670/222/415'
I mean, just 😍.
Where can I get that?!
Pretty much everywhere that matters: this is supported natively since Chrome 64, Firefox 78, Edge 79, Safari 11.1, Opera 51 and Node 10.
Otherwise, Babel transpiles (including env
and latest
presets).
Bonus trick: String#matchAll(…)
A long-time gripe with String#match
(and its RegExp#exec
counterpart) is that you could not quite have your cake and eat it too when using capturing groups:
- Either you used the
g
flag (for global, which returns all matches of the entire pattern) and you got an array of full-pattern matches, without their individual groups. - Or you did not use the flag, and got either
null
or a match result, with individual captured groups.
Here it is in all its (infamous) glory:
const US_PHONES = /\b(?<area>\d{3})-(?<prefix>\d{3})-(?<line>\d{4})\b/g
const text = `
HQ: 412-222-9670
Support: 415-865-5405
`
text.match(US_PHONES)
// => ['412-222-9670', '415-865-5405']
// --> Wait a minute — Where did my captured groups go!?
Since ES2020 however, we finally get String#matchAll
, that returns an iterator (even better than a dumb Array
) on match results:
Array.from(text.matchAll(US_PHONES))
// => [
// ['412-222-9670', '412', '222', '9670'],
// ['415-865-5405', '415', '865', '5405'],
// ]
Array.from(text.matchAll(US_PHONES)).map((a) => a.groups)
// => [
// { area: '412', prefix: '222', line: '9670' },
// { area: '415', prefix: '865', line: '5405' },
// ]
That’s 👏 Just 👏 Spiffy!
This has been natively supported since Chrome 73, Firefox 67, Edge 79, Opera 60, Safari 13 and Node 12. Obviously no IE. core-js (used by Babel among other things) started polyfilling it in version 3.4.