JavaScript for impatient programmers (beta)
Please support this book: buy it or donate
(Ad, please don’t block.)

40 Regular expressions (RegExp)



  Availability of features

Unless stated otherwise, each regular expression feature has been available since ES3.

40.1 Creating regular expressions

40.1.1 Literal vs. constructor

The two main ways of creating regular expressions, are:

Both regular expressions have the same two parts:

40.1.2 Cloning and non-destructively modifying regular expressions

There are two variants of the constructor RegExp():

The second variant is useful for cloning regular expressions, optionally while modifying them. Flags are immutable and this is the only way of changing them. For example:

function copyAndAddFlags(regExp, flagsToAdd='') {
  // The constructor doesn’t allow duplicate flags,
  // make sure there aren’t any:
  const newFlags = [...new Set(regExp.flags + flagsToAdd)].join('');
  return new RegExp(regExp, newFlags);
}
assert.equal(/abc/i.flags, 'i');
assert.equal(copyAndAddFlags(/abc/i, 'g').flags, 'gi');

40.2 Syntax

40.2.1 Syntax characters

At the top level of a regular expression, the following syntax characters are special. They are escaped by prefixing a backslash (\).

\ ^ $ . * + ? ( ) [ ] { } |

In regular expression literals, you must escape slashs:

> /\//.test('/')
true

In the argument of new RegExp(), you don’t have to escape slashes:

> new RegExp('/').test('/')
true

40.2.2 Basic atoms

Atoms are the basic building blocks of regular expressions.

40.2.3 Unicode property escapes [ES2018]

40.2.3.1 Unicode character properties

In the Unicode standard, each character has properties – metadata describing it. Properties play an important role in defining the nature of a character. Quoting the Unicode Standard, Sect. 3.3, D3:

The semantics of a character are determined by its identity, normative properties, and behavior.

These are a few examples of properties:

40.2.3.2 Unicode property escapes

Unicode property escapes look like this:

  1. \p{prop=value}: matches all characters whose property prop has the value value.
  2. \P{prop=value}: matches all characters that do not have a property prop whose value is value.
  3. \p{bin_prop}: matches all characters whose binary property bin_prop is True.
  4. \P{bin_prop}: matches all characters whose binary property bin_prop is False.

Comments:

Examples:

Further reading:

40.2.4 Character classes

A character class wraps class ranges in square brackets. The class ranges specify a set of characters:

Rules for class ranges:

40.2.5 Groups

40.2.6 Quantifiers

By default, all of the following quantifiers are greedy (they match as many characters as possible):

To make them reluctant (so that they match as few characters as possible), put question marks (?) after them:

> /".*"/.exec('"abc"def"')[0]  // greedy
'"abc"def"'
> /".*?"/.exec('"abc"def"')[0] // reluctant
'"abc"'

40.2.7 Assertions

40.2.7.1 Lookahead

Positive lookahead: (?=«pattern») matches if pattern matches what comes next.

Example: sequences of lowercase letters that are followed by an X.

> 'abcX def'.match(/[a-z]+(?=X)/g)
[ 'abc' ]

Note that the X itself is not part of the matched substring.

Negative lookahead: (?!«pattern») matches if pattern does not match what comes next.

Example: sequences of lowercase letters that are not followed by an X.

> 'abcX def'.match(/[a-z]+(?!X)/g)
[ 'ab', 'def' ]
40.2.7.2 Lookbehind [ES2018]

Positive lookbehind: (?<=«pattern») matches if pattern matches what came before.

Example: sequences of lowercase letters that are preceded by an X.

> 'Xabc def'.match(/(?<=X)[a-z]+/g)
[ 'abc' ]

Negative lookbehind: (?<!«pattern») matches if pattern does not match what came before.

Example: sequences of lowercase letters that are not preceded by an X.

> 'Xabc def'.match(/(?<!X)[a-z]+/g)
[ 'bc', 'def' ]

Example: replace “.js” with “.html”, but not in “Node.js”.

> 'Node.js: index.js and main.js'.replace(/(?<!Node)\.js/g, '.html')
'Node.js: index.html and main.html'

40.2.8 Disjunction (|)

Caveat: this operator has low precedence. Use groups if necessary:

40.3 Flags

Table 20: These are the regular expression flags supported by JavaScript.
Literal flag Property name ES Description
g global ES3 Match multiple times
i ignoreCase ES3 Match case-insensitively
m multiline ES3 ^ and $ match per line
s dotall ES2018 Dot matches line terminators
u unicode ES6 Unicode mode (recommended)
y sticky ES6 No characters between matches

The following regular expression flags are available in JavaScript (tbl. 20 provides a compact overview):

40.3.1 Flag: Unicode mode via /u

The flag /u switches on a special Unicode mode for regular expressions. That mode enables several features:

The following subsections explain the last item in more detail. They use the following Unicode character to explain when the atomic units are Unicode characters and when they are JavaScript characters:

const codePoint = '🙂';
const codeUnits = '\uD83D\uDE42'; // UTF-16

assert.equal(codePoint, codeUnits); // same string!

I’m only switching between 🙂 and \uD83D\uDE42, to illustrate how JavaScript sees things. Both are equivalent and can be used interchangeably in strings and regular expressions.

40.3.1.1 Consequence: you can put Unicode characters in character classes

With /u, the two code units of 🙂 are treated as a single character:

> /^[🙂]$/u.test('🙂')
true

Without /u, 🙂 is treated as two characters:

> /^[\uD83D\uDE42]$/.test('\uD83D\uDE42')
false
> /^[\uD83D\uDE42]$/.test('\uDE42')
true

Note that ^ and $ demand that the input string have a single character. That’s why the first result is false.

40.3.1.2 Consequence: the dot operator (.) matches Unicode characters, not JavaScript characters

With /u, the dot operator matches Unicode characters:

> '🙂'.match(/./gu).length
1

.match() plus /g returns an Array with all the matches of a regular expression

Without /u, the dot operator matches JavaScript characters:

> '\uD83D\uDE80'.match(/./g).length
2
40.3.1.3 Consequence: quantifiers apply to Unicode characters, not JavaScript characters

With /u, a quantifier applies to the whole preceding Unicode character:

> /^🙂{3}$/u.test('🙂🙂🙂')
true

Without /u, a quantifier only applies to the preceding JavaScript character:

> /^\uD83D\uDE80{3}$/.test('\uD83D\uDE80\uDE80\uDE80')
true

40.4 Properties of regular expression objects

Noteworthy:

40.4.1 Flags as properties

Each regular expression flag exists as a property, with a longer, more descriptive name:

> /a/i.ignoreCase
true
> /a/.ignoreCase
false

This is the complete list of flag properties:

40.4.2 Other properties

Each regular expression also has the following properties:

40.5 Methods for working with regular expressions

40.5.1 In general, regular expressions match anywhere in a string

Note that, in general, regular expressions match anywhere in a string:

> /a/.test('__a__')
true

You can change that by using assertions such as ^ or by using the flag /y:

> /^a/.test('__a__')
false
> /^a/.test('a__')
true

40.5.2 regExp.test(str): is there a match? [ES3]

The regular expression method .test() returns true if regExp matches str:

> /bc/.test('ABCD')
false
> /bc/i.test('ABCD')
true
> /\.mjs$/.test('main.mjs')
true

With .test() you should normally avoid the /g flag. If you use it, you generally don’t get the same result every time you call the method:

> const r = /a/g;
> r.test('aab')
true
> r.test('aab')
true
> r.test('aab')
false

The results are due to /a/ having two matches in the string. After all of those were found, .test() returns false.

40.5.3 str.search(regExp): at what index is the match? [ES3]

The string method .search() returns the first index of str at which there is a match for regExp:

> '_abc_'.search(/abc/)
1
> 'main.mjs'.search(/\.mjs$/)
4

40.5.4 regExp.exec(str): capturing groups [ES3]

40.5.4.1 Getting a match object for the first match

Without the flag /g, .exec() returns the captures of the first match for regExp in str:

assert.deepEqual(
  /(a+)b/.exec('ab aab'),
  {
    0: 'ab',
    1: 'a',
    index: 0,
    input: 'ab aab',
    groups: undefined,
  }
);

The result is a match object with the following properties:

40.5.4.2 Named capture groups [ES2018]

The previous example contained a single positional group. The following example demonstrates named groups:

assert.deepEqual(
  /(?<as>a+)b/.exec('ab aab'),
  {
    0: 'ab',
    1: 'a',
    index: 0,
    input: 'ab aab',
    groups: { as: 'a' },
  }
);

In the result of .exec(), you can see that a named group is also a positional group – its capture exists twice:

40.5.4.3 Looping over multiple matches

If you want to retrieve all matches of a regular expression (not just the first one), you need to switch on the flag /g. Then you can call .exec() multiple times and get one match each time. After the last match, .exec() returns null.

> const regExp = /(a+)b/g;
> regExp.exec('ab aab')
{ 0: 'ab', 1: 'a', index: 0, input: 'ab aab', groups: undefined }
> regExp.exec('ab aab')
{ 0: 'aab', 1: 'aa', index: 3, input: 'ab aab', groups: undefined }
> regExp.exec('ab aab')
null

Therefore, you can loop over all matches as follows:

const regExp = /(a+)b/g;
const str = 'ab aab';

let match;
// Check for null via truthiness
// Alternative: while ((match = regExp.exec(str)) !== null)
while (match = regExp.exec(str)) {
  console.log(match[1]);
}
// Output:
// 'a'
// 'aa'

Sharing regular expressions with /g has a few pitfalls, which are explained later.

  Exercise: Extract quoted text via .exec()

exercises/regexps/extract_quoted_test.mjs

40.5.5 str.match(regExp): return all matching substrings [ES3]

Without /g, .match() works like .exec() – it returns a single match object.

With /g, .match() returns all substrings of str that match regExp:

> 'ab aab'.match(/(a+)b/g)
[ 'ab', 'aab' ]

If there is no match, .match() returns null:

> 'xyz'.match(/(a+)b/g)
null

You can use the Or operator to protect yourself against null:

const numberOfMatches = (str.match(regExp) || []).length;

40.5.6 str.replace(searchValue, replacementValue) [ES3]

.replace() is overloaded – it works differently, depending on the types of its parameters:

The next two subsubsections assume that a regular expression with /g is being used.

40.5.6.1 replacementValue is a string

If the replacement value is a string, the dollar sign has special meaning – it inserts text matched by the regular expression:

Text Result
$$ single $
$& complete match
$` text before match
$' text after match
$n capture of positional group n (n > 0)
$<name> capture of named group name [ES2018]

Example: Inserting the text before, inside, and after the matched substring.

> 'a1 a2'.replace(/a/g, "($`|$&|$')")
'(|a|1 a2)1 (a1 |a|2)2'

Example: Inserting the captures of positional groups.

> const regExp = /^([A-Za-z]+): (.*)$/ug;
> 'first: Jane'.replace(regExp, 'KEY: $1, VALUE: $2')
'KEY: first, VALUE: Jane'

Example: Inserting the captures of named groups.

> const regExp = /^(?<key>[A-Za-z]+): (?<value>.*)$/ug;
> 'first: Jane'.replace(regExp, 'KEY: $<key>, VALUE: $<value>')
'KEY: first, VALUE: Jane'
40.5.6.2 replacementValue is a function

If the replacement value is a function, you can compute each replacement. In the following example, we multiply each non-negative integer, that we find, by two.

assert.equal(
  '3 cats and 4 dogs'.replace(/[0-9]+/g, (all) => 2 * Number(all)),
  '6 cats and 8 dogs'
);

The replacement function gets the following parameters. Note how similar they are to match objects. These parameters are all positional, but I’ve included how one might name them:

  Exercise: Change quotes via .replace() and a named group

exercises/regexps/change_quotes_test.mjs

40.5.7 Other methods for working with regular expressions

String.prototype.split() is described in the chapter on strings. Its first parameter of String.prototype.split() is either a string or a regular expression. If it is the latter, then captures of groups appear in the result:

> 'a:b : c'.split(':')
[ 'a', 'b ', ' c' ]
> 'a:b : c'.split(/ *: */)
[ 'a', 'b', 'c' ]
> 'a:b : c'.split(/( *):( *)/)
[ 'a', '', '', 'b', ' ', ' ', 'c' ]

40.6 Flag /g and its pitfalls

The following two regular expression methods work differently if /g is switched on:

Then they can be called repeatedly and deliver all matches inside a string. Property .lastIndex of the regular expression is used to track the current position inside the string. For example:

const r = /a/g;
assert.equal(r.lastIndex, 0);

assert.equal(r.test('aa'), true); // 1st match?
assert.equal(r.lastIndex, 1); // after 1st match

assert.equal(r.test('aa'), true); // 2nd match?
assert.equal(r.lastIndex, 2); // after 2nd match

assert.equal(r.test('aa'), false); // 3rd match?
assert.equal(r.lastIndex, 0); // start over

The next subsections explain the pitfalls of using /g. They are followed by a subsection that explains how to work around those pitfalls.

40.6.1 Pitfall: You can’t inline a regular expression with flag /g

A regular expression with /g can’t be inlined: For example, in the following while loop, the regular expression is created fresh, every time the condition is checked. Therefore, its .lastIndex is always zero and the loop never terminates.

let count = 0;
// Infinite loop
while (/a/g.test('babaa')) {
  count++;
}

40.6.2 Pitfall: Removing /g can break code

If code expects a regular expression with /g and has a loop over the results of .exec() or .test(), then a regular expression without /g can cause an infinite loop:

function countMatches(regExp) {
  let count = 0;
  // Infinite loop
  while (regExp.exec('babaa')) {
    count++;
  }
  return count;
}
countMatches(/a/); // Missing: flag /g

Why? Because .exec() always returns the first result, a match object, and never null.

40.6.3 Pitfall: Adding /g can break code

With .test(), there is another caveat: If you want to check exactly once if a regular expression matches a string then the regular expression must not have /g. Otherwise, you generally get a different result, every time you call .test():

function isMatching(regExp) {
  return regExp.test('Xa');
}
const myRegExp = /^X/g;
assert.equal(isMatching(myRegExp), true);
assert.equal(isMatching(myRegExp), false);

Normally, you won’t add /g if you intend to use .test() in this manner. But it can happen if, e.g., you use the same regular expression for testing and for replacing.

40.6.4 Pitfall: Code can break if .lastIndex isn’t zero

If you match a regular expression multiple times via .exec() or .test(), the current position inside the input string is stored in the regular expression property .lastIndex. Therefore, code that matches multiple times, may break if .lastIndex is not zero:

function countMatches(regExp) {
  let count = 0;
  while (regExp.exec('babaa')) {
    count++;
  }
  return count;
}

const myRegExp = /a/g;
myRegExp.lastIndex = 4;
assert.equal(countMatches(myRegExp), 1); // should be 3

Note that .lastIndex is always zero in newly created regular expressions, but it may not be if the same regular expression is used multiple times.

40.6.5 Dealing with /g and .lastIndex

As an example of dealing with /g and .lastIndex, we will implement the following function.

countMatches(regExp, str)

It counts how often regExp has a match inside str. How do we prevent a wrong regExp from breaking our code? Let’s look at three approaches.

First, we can throw an exception if /g isn’t set or .lastIndex isn’t zero:

function countMatches(regExp, str) {
  if (!regExp.global) {
    throw new Error('Flag /g of regExp must be set');
  }
  if (regExp.lastIndex !== 0) {
    throw new Error('regExp.lastIndex must be zero');
  }
  
  let count = 0;
  while (regExp.test(str)) {
    count++;
  }
  return count;
}

Second, we can clone the parameter. That has the added benefit that regExp won’t be changed.

function countMatches(regExp, str) {
  const cloneFlags = regExp.flags + (regExp.global ? '' : 'g');
  const clone = new RegExp(regExp, cloneFlags);

  let count = 0;
  while (clone.test(str)) {
    count++;
  }
  return count;
}

Third, we can use .match() to count occurrences – which doesn’t change or depend on .lastIndex.

function countMatches(regExp, str) {
  if (!regExp.global) {
    throw new Error('Flag /g of regExp must be set');
  }
  return (str.match(regExp) || []).length;
}

40.7 Techniques for working with regular expressions

40.7.1 Escaping arbitrary text for regular expressions

The following function escapes an arbitrary text so that it is matched verbatim if you put it inside a regular expression:

function escapeForRegExp(str) {
  return str.replace(/[\\^$.*+?()[\]{}|]/g, '\\$&'); // (A)
}
assert.equal(escapeForRegExp('[yes?]'), String.raw`\[yes\?\]`);
assert.equal(escapeForRegExp('_g_'), String.raw`_g_`);

In line A, we escape all syntax characters. We have to be selective, because the regular expression flag /u forbids many escapes. For example: \a \: \-

The regular expression method .replace() only lets you replace plain text once. With escapeForRegExp(), we can work around that limitation and replace plain text multiple times:

const plainText = ':-)';
const regExp = new RegExp(escapeForRegExp(plainText), 'ug');
assert.equal(
  ':-) :-) :-)'.replace(regExp, '🙂'), '🙂 🙂 🙂');

40.7.2 Matching everything or nothing

Sometimes, you may need a regular expression that matches everything or nothing. For example, as a default value.