RegExp
)|
)/u
regExp.test(str)
: is there a match? [ES3]str.search(regExp)
: at what index is the match? [ES3]regExp.exec(str)
: capturing groups [ES3]str.match(regExp)
: return all matching substrings [ES3]str.replace(searchValue, replacementValue)
[ES3]/g
and its pitfalls
Availability of features
Unless stated otherwise, each regular expression feature has been available since ES3.
The two main ways of creating regular expressions, are:
Literal: compiled statically (at load time).
Constructor: compiled dynamically (at runtime).
Both regular expressions have the same two parts:
abc
– the actual regular expression.u
and i
. Flags configure how the pattern is interpreted: For example, i
enables case-insensitive matching. A list of available flags is given later in this chapter.There are two variants of the constructor RegExp()
:
new RegExp(pattern : string, flags = '')
[ES3]
A new regular expression is created as specified via pattern
. If flags
is missing, the empty string ''
is used.
new RegExp(regExp : RegExp, flags = regExp.flags)
[ES6]
regExp
is cloned. If flags
is provided then it determines the flags of the clone.
The second variant is useful for cloning regular expressions, optionally while modifying them. Flags are immutable and this is the only way of changing them. For example:
function copyAndAddFlags(regExp, flagsToAdd='') {
// The constructor doesn’t allow duplicate flags,
// make sure there aren’t any:
const newFlags = [...new Set(regExp.flags + flagsToAdd)].join('');
return new RegExp(regExp, newFlags);
}
assert.equal(/abc/i.flags, 'i');
assert.equal(copyAndAddFlags(/abc/i, 'g').flags, 'gi');
At the top level of a regular expression, the following syntax characters are special. They are escaped by prefixing a backslash (\
).
\ ^ $ . * + ? ( ) [ ] { } |
In regular expression literals, you must escape slashs:
In the argument of new RegExp()
, you don’t have to escape slashes:
Atoms are the basic building blocks of regular expressions.
^
, $
, etc.). Pattern characters match themselves. Examples: A b %
.
matches any character. You can use the flag /s
(dotall
) to control if the dot matches line terminators or not (more below).\f
: form feed (FF)\n
: line feed (LF)\r
: carriage return (CR)\t
: character tabulation\v
: line tabulation\cA
(Ctrl-A), …, \cZ
(Ctrl-Z)\u00E4
/u
): \u{1F44D}
\d
: digits (same as [0-9]
)
\D
: non-digits\w
: “word” characters (same as [A-Za-z0-9_]
, related to identifiers in programming languages)
\W
: non-word characters\s
: whitespace (space, tab, line terminators, etc.)
\S
: non-whitespace\p{White_Space}
, \P{White_Space}
, etc.
/u
.In the Unicode standard, each character has properties – metadata describing it. Properties play an important role in defining the nature of a character. Quoting the Unicode Standard, Sect. 3.3, D3:
The semantics of a character are determined by its identity, normative properties, and behavior.
These are a few examples of properties:
Name
: a unique name, composed of uppercase letters, digits, hyphens and spaces. For example:
Name = LATIN CAPITAL LETTER A
🙂
: Name = SLIGHTLY SMILING FACE
General_Category
: categorizes characters. For example:
General_Category = Lowercase_Letter
General_Category = Currency_Symbol
White_Space
: used for marking invisible spacing characters, such as spaces, tabs and newlines. For example:
White_Space = True
White_Space = False
Age
: version of the Unicode Standard in which a character was introduced. For example: The Euro sign € was added in version 2.1 of the Unicode standard.
Age = 2.1
Block
: a contiguous range of code points. Blocks don’t overlap and their names are unique. For example:
Block = Basic_Latin
(range U+0000..U+007F)🙂
: Block = Emoticons
(range U+1F600..U+1F64F)Script
: is a collection of characters used by one or more writing systems.
Script = Greek
Script = Cyrillic
Unicode property escapes look like this:
\p{prop=value}
: matches all characters whose property prop
has the value value
.\P{prop=value}
: matches all characters that do not have a property prop
whose value is value
.\p{bin_prop}
: matches all characters whose binary property bin_prop
is True.\P{bin_prop}
: matches all characters whose binary property bin_prop
is False.Comments:
You can only use Unicode property escapes if the flag /u
is set. Without /u
, \p
is the same as p
.
Forms (3) and (4) can be used as abbreviations if the property is General_Category
. For example, the following two escapes are equivalent:
\p{Lowercase_Letter}
\p{General_Category=Lowercase_Letter}
Examples:
Checking for whitespace:
Checking for Greek letters:
Deleting any letters:
Deleting lowercase letters:
Further reading:
A character class wraps class ranges in square brackets. The class ranges specify a set of characters:
[«class ranges»]
matches any character in the set.[^«class ranges»]
matches any character not in the set.Rules for class ranges:
Non-syntax characters stand for themselves: [abc]
Only the following four characters are special and must be escaped via slashes:
^ \ - ]
^
only has to be escaped if it comes first.-
need not be escaped if it comes first or last.Character escapes (\n
, \u{1F44D}
, etc.) have the usual meaning.
\b
stands for backspace. Elsewhere in a regular expression, it matches word boundaries.Character class escapes (\d
, \p{White_Space}
, etc.) have the usual meaning.
Ranges of characters are specified via dashes: [a-z]
(#+)
\1
, \2
, etc.(?<hashes>#+)
\k<hashes>
(?:#+)
By default, all of the following quantifiers are greedy (they match as many characters as possible):
?
: match never or once*
: match zero or more times+
: match one or more times{n}
: match n
times{n,}
: match n
or more times{n,m}
: match at least n
times, at most m
times.To make them reluctant (so that they match as few characters as possible), put question marks (?
) after them:
> /".*"/.exec('"abc"def"')[0] // greedy
'"abc"def"'
> /".*?"/.exec('"abc"def"')[0] // reluctant
'"abc"'
^
matches only at the beginning of the input$
matches only at the end of the input\b
matches only at a word boundary
\B
matches only when not at a word boundaryPositive lookahead: (?=«pattern»)
matches if pattern
matches what comes next.
Example: sequences of lowercase letters that are followed by an X
.
Note that the X
itself is not part of the matched substring.
Negative lookahead: (?!«pattern»)
matches if pattern
does not match what comes next.
Example: sequences of lowercase letters that are not followed by an X
.
Positive lookbehind: (?<=«pattern»)
matches if pattern
matches what came before.
Example: sequences of lowercase letters that are preceded by an X
.
Negative lookbehind: (?<!«pattern»)
matches if pattern
does not match what came before.
Example: sequences of lowercase letters that are not preceded by an X
.
Example: replace “.js” with “.html”, but not in “Node.js”.
> 'Node.js: index.js and main.js'.replace(/(?<!Node)\.js/g, '.html')
'Node.js: index.html and main.html'
|
)Caveat: this operator has low precedence. Use groups if necessary:
^aa|zz$
matches all strings that start with aa
and/or end with zz
. Note that |
has a lower precedence than ^
and $
.^(aa|zz)$
matches the two strings 'aa'
and 'zz'
.^a(a|z)z$
matches the two strings 'aaz'
and 'azz'
.Literal flag | Property name | ES | Description |
---|---|---|---|
g |
global |
ES3 | Match multiple times |
i |
ignoreCase |
ES3 | Match case-insensitively |
m |
multiline |
ES3 | ^ and $ match per line |
s |
dotall |
ES2018 | Dot matches line terminators |
u |
unicode |
ES6 | Unicode mode (recommended) |
y |
sticky |
ES6 | No characters between matches |
The following regular expression flags are available in JavaScript (tbl. 20 provides a compact overview):
/g
(.global
): fundamentally changes how the following methods work.
RegExp.prototype.test()
RegExp.prototype.exec()
String.prototype.match()
How, is explained in §40.6 “Flag /g
and its pitfalls”. In a nutshell: Without /g
, the methods only consider the first match for a regular expression in an input string. With /g
, they consider all matches.
/i
(.ignoreCase
): switches on case-insensitive matching:
/m
(.multiline
): If this flag is on, ^
matches the beginning of each line and $
matches the end of each line. If it is off, ^
matches the beginning of the whole input string and $
matches the end of the whole input string.
/u
(.unicode
): This flag switches on the Unicode mode for a regular expression. That mode is explained in the next subsection.
/y
(.sticky
): This flag mainly makes sense in conjunction with /g
. When both are switched on, any match must directly follow the previous one (that is, it must start at index .lastIndex
of the regular expression object). Therefore, the first match must be at index 0.
> 'a1a2 a3'.match(/a./gy)
[ 'a1', 'a2' ]
> '_a1a2 a3'.match(/a./gy) // first match must be at index 0
null
> 'a1a2 a3'.match(/a./g)
[ 'a1', 'a2', 'a3' ]
> '_a1a2 a3'.match(/a./g)
[ 'a1', 'a2', 'a3' ]
The main use case for /y
is tokenization (during parsing).
/s
(.dotall
): By default, the dot does not match line terminators. With this flag, it does:
Work-around if /s
isn’t supported: Use [^]
instead of a dot.
/u
The flag /u
switches on a special Unicode mode for regular expressions. That mode enables several features:
In patterns, you can use Unicode code point escapes such as \u{1F42A}
to specify characters. Code unit escapes such as \u03B1
only have a range of four hexadecimal digits (which corresponds to the basic multilingual plane).
In patterns, you can use Unicode property escapes such as \p{White_Space}
.
Many escapes are now forbidden. For example: \a \- \:
Pattern characters always match themselves:
Without /u
, there are some pattern characters that still match themselves if you escape them with backslashes:
With /u
:
\p
starts a Unicode property escape.The atomic units for matching are Unicode characters (code points), not JavaScript characters (code units).
The following subsections explain the last item in more detail. They use the following Unicode character to explain when the atomic units are Unicode characters and when they are JavaScript characters:
const codePoint = '🙂';
const codeUnits = '\uD83D\uDE42'; // UTF-16
assert.equal(codePoint, codeUnits); // same string!
I’m only switching between 🙂
and \uD83D\uDE42
, to illustrate how JavaScript sees things. Both are equivalent and can be used interchangeably in strings and regular expressions.
With /u
, the two code units of 🙂
are treated as a single character:
Without /u
, 🙂
is treated as two characters:
Note that ^
and $
demand that the input string have a single character. That’s why the first result is false
.
.
) matches Unicode characters, not JavaScript charactersWith /u
, the dot operator matches Unicode characters:
.match()
plus /g
returns an Array with all the matches of a regular expression
Without /u
, the dot operator matches JavaScript characters:
With /u
, a quantifier applies to the whole preceding Unicode character:
Without /u
, a quantifier only applies to the preceding JavaScript character:
Noteworthy:
.lastIndex
is a real instance property. All other properties are implemented via getters..lastIndex
is the only mutable property. All other properties are read-only. If you want to change them, you need to copy the regular expression (consult §40.1.2 “Cloning and non-destructively modifying regular expressions” for details).Each regular expression flag exists as a property, with a longer, more descriptive name:
This is the complete list of flag properties:
.dotall
(/s
).global
(/g
).ignoreCase
(/i
).multiline
(/m
).sticky
(/y
).unicode
(/u
)Each regular expression also has the following properties:
.source
[ES3]: The regular expression pattern
.flags
[ES6]: The flags of the regular expression
.lastIndex
[ES3]: Used when flag /g
is switched on. Consult §40.6 “Flag /g
and its pitfalls” for details.
Note that, in general, regular expressions match anywhere in a string:
You can change that by using assertions such as ^
or by using the flag /y
:
regExp.test(str)
: is there a match? [ES3]The regular expression method .test()
returns true
if regExp
matches str
:
With .test()
you should normally avoid the /g
flag. If you use it, you generally don’t get the same result every time you call the method:
The results are due to /a/
having two matches in the string. After all of those were found, .test()
returns false
.
str.search(regExp)
: at what index is the match? [ES3]The string method .search()
returns the first index of str
at which there is a match for regExp
:
regExp.exec(str)
: capturing groups [ES3]Without the flag /g
, .exec()
returns the captures of the first match for regExp
in str
:
assert.deepEqual(
/(a+)b/.exec('ab aab'),
{
0: 'ab',
1: 'a',
index: 0,
input: 'ab aab',
groups: undefined,
}
);
The result is a match object with the following properties:
[0]
: the complete substring matched by the regular expression[1]
: capture of positional group 1 (etc.).index
: where did the match occur?.input
: the string that was matched against.groups
: captures of named groupsThe previous example contained a single positional group. The following example demonstrates named groups:
assert.deepEqual(
/(?<as>a+)b/.exec('ab aab'),
{
0: 'ab',
1: 'a',
index: 0,
input: 'ab aab',
groups: { as: 'a' },
}
);
In the result of .exec()
, you can see that a named group is also a positional group – its capture exists twice:
'1'
).groups.as
).If you want to retrieve all matches of a regular expression (not just the first one), you need to switch on the flag /g
. Then you can call .exec()
multiple times and get one match each time. After the last match, .exec()
returns null
.
> const regExp = /(a+)b/g;
> regExp.exec('ab aab')
{ 0: 'ab', 1: 'a', index: 0, input: 'ab aab', groups: undefined }
> regExp.exec('ab aab')
{ 0: 'aab', 1: 'aa', index: 3, input: 'ab aab', groups: undefined }
> regExp.exec('ab aab')
null
Therefore, you can loop over all matches as follows:
const regExp = /(a+)b/g;
const str = 'ab aab';
let match;
// Check for null via truthiness
// Alternative: while ((match = regExp.exec(str)) !== null)
while (match = regExp.exec(str)) {
console.log(match[1]);
}
// Output:
// 'a'
// 'aa'
Sharing regular expressions with /g
has a few pitfalls, which are explained later.
Exercise: Extract quoted text via
.exec()
exercises/regexps/extract_quoted_test.mjs
str.match(regExp)
: return all matching substrings [ES3]Without /g
, .match()
works like .exec()
– it returns a single match object.
With /g
, .match()
returns all substrings of str
that match regExp
:
If there is no match, .match()
returns null
:
You can use the Or operator to protect yourself against null
:
str.replace(searchValue, replacementValue)
[ES3].replace()
is overloaded – it works differently, depending on the types of its parameters:
searchValue
is:
/g
: Replace first match of this regular expression./g
: Replace all matches of this regular expression.'*'
becomes /\*/
).replacementValue
is:
$
has special meaning and lets you insert captures of groups and more (read on for details).The next two subsubsections assume that a regular expression with /g
is being used.
replacementValue
is a stringIf the replacement value is a string, the dollar sign has special meaning – it inserts text matched by the regular expression:
Text | Result |
---|---|
$$ |
single $ |
$& |
complete match |
$` |
text before match |
$' |
text after match |
$n |
capture of positional group n (n > 0) |
$<name> |
capture of named group name [ES2018] |
Example: Inserting the text before, inside, and after the matched substring.
Example: Inserting the captures of positional groups.
> const regExp = /^([A-Za-z]+): (.*)$/ug;
> 'first: Jane'.replace(regExp, 'KEY: $1, VALUE: $2')
'KEY: first, VALUE: Jane'
Example: Inserting the captures of named groups.
> const regExp = /^(?<key>[A-Za-z]+): (?<value>.*)$/ug;
> 'first: Jane'.replace(regExp, 'KEY: $<key>, VALUE: $<value>')
'KEY: first, VALUE: Jane'
replacementValue
is a functionIf the replacement value is a function, you can compute each replacement. In the following example, we multiply each non-negative integer, that we find, by two.
assert.equal(
'3 cats and 4 dogs'.replace(/[0-9]+/g, (all) => 2 * Number(all)),
'6 cats and 8 dogs'
);
The replacement function gets the following parameters. Note how similar they are to match objects. These parameters are all positional, but I’ve included how one might name them:
all
: complete matchg1
: capture of positional group 1index
: where did the match occur?input
: the string in which we are replacinggroups
[ES2018]: captures of named groups (an object) Exercise: Change quotes via
.replace()
and a named group
exercises/regexps/change_quotes_test.mjs
String.prototype.split()
is described in the chapter on strings. Its first parameter of String.prototype.split()
is either a string or a regular expression. If it is the latter, then captures of groups appear in the result:
> 'a:b : c'.split(':')
[ 'a', 'b ', ' c' ]
> 'a:b : c'.split(/ *: */)
[ 'a', 'b', 'c' ]
> 'a:b : c'.split(/( *):( *)/)
[ 'a', '', '', 'b', ' ', ' ', 'c' ]
/g
and its pitfallsThe following two regular expression methods work differently if /g
is switched on:
RegExp.prototype.exec()
RegExp.prototype.test()
Then they can be called repeatedly and deliver all matches inside a string. Property .lastIndex
of the regular expression is used to track the current position inside the string. For example:
const r = /a/g;
assert.equal(r.lastIndex, 0);
assert.equal(r.test('aa'), true); // 1st match?
assert.equal(r.lastIndex, 1); // after 1st match
assert.equal(r.test('aa'), true); // 2nd match?
assert.equal(r.lastIndex, 2); // after 2nd match
assert.equal(r.test('aa'), false); // 3rd match?
assert.equal(r.lastIndex, 0); // start over
The next subsections explain the pitfalls of using /g
. They are followed by a subsection that explains how to work around those pitfalls.
/g
A regular expression with /g
can’t be inlined: For example, in the following while
loop, the regular expression is created fresh, every time the condition is checked. Therefore, its .lastIndex
is always zero and the loop never terminates.
/g
can break codeIf code expects a regular expression with /g
and has a loop over the results of .exec()
or .test()
, then a regular expression without /g
can cause an infinite loop:
function countMatches(regExp) {
let count = 0;
// Infinite loop
while (regExp.exec('babaa')) {
count++;
}
return count;
}
countMatches(/a/); // Missing: flag /g
Why? Because .exec()
always returns the first result, a match object, and never null
.
/g
can break codeWith .test()
, there is another caveat: If you want to check exactly once if a regular expression matches a string then the regular expression must not have /g
. Otherwise, you generally get a different result, every time you call .test()
:
function isMatching(regExp) {
return regExp.test('Xa');
}
const myRegExp = /^X/g;
assert.equal(isMatching(myRegExp), true);
assert.equal(isMatching(myRegExp), false);
Normally, you won’t add /g
if you intend to use .test()
in this manner. But it can happen if, e.g., you use the same regular expression for testing and for replacing.
.lastIndex
isn’t zeroIf you match a regular expression multiple times via .exec()
or .test()
, the current position inside the input string is stored in the regular expression property .lastIndex
. Therefore, code that matches multiple times, may break if .lastIndex
is not zero:
function countMatches(regExp) {
let count = 0;
while (regExp.exec('babaa')) {
count++;
}
return count;
}
const myRegExp = /a/g;
myRegExp.lastIndex = 4;
assert.equal(countMatches(myRegExp), 1); // should be 3
Note that .lastIndex
is always zero in newly created regular expressions, but it may not be if the same regular expression is used multiple times.
/g
and .lastIndex
As an example of dealing with /g
and .lastIndex
, we will implement the following function.
It counts how often regExp
has a match inside str
. How do we prevent a wrong regExp
from breaking our code? Let’s look at three approaches.
First, we can throw an exception if /g
isn’t set or .lastIndex
isn’t zero:
function countMatches(regExp, str) {
if (!regExp.global) {
throw new Error('Flag /g of regExp must be set');
}
if (regExp.lastIndex !== 0) {
throw new Error('regExp.lastIndex must be zero');
}
let count = 0;
while (regExp.test(str)) {
count++;
}
return count;
}
Second, we can clone the parameter. That has the added benefit that regExp
won’t be changed.
function countMatches(regExp, str) {
const cloneFlags = regExp.flags + (regExp.global ? '' : 'g');
const clone = new RegExp(regExp, cloneFlags);
let count = 0;
while (clone.test(str)) {
count++;
}
return count;
}
Third, we can use .match()
to count occurrences – which doesn’t change or depend on .lastIndex
.
function countMatches(regExp, str) {
if (!regExp.global) {
throw new Error('Flag /g of regExp must be set');
}
return (str.match(regExp) || []).length;
}
The following function escapes an arbitrary text so that it is matched verbatim if you put it inside a regular expression:
function escapeForRegExp(str) {
return str.replace(/[\\^$.*+?()[\]{}|]/g, '\\$&'); // (A)
}
assert.equal(escapeForRegExp('[yes?]'), String.raw`\[yes\?\]`);
assert.equal(escapeForRegExp('_g_'), String.raw`_g_`);
In line A, we escape all syntax characters. We have to be selective, because the regular expression flag /u
forbids many escapes. For example: \a \: \-
The regular expression method .replace()
only lets you replace plain text once. With escapeForRegExp()
, we can work around that limitation and replace plain text multiple times:
const plainText = ':-)';
const regExp = new RegExp(escapeForRegExp(plainText), 'ug');
assert.equal(
':-) :-) :-)'.replace(regExp, '🙂'), '🙂 🙂 🙂');
Sometimes, you may need a regular expression that matches everything or nothing. For example, as a default value.
Match everything: /(?:)/
The empty group ()
matches everything. We make it non-capturing (via ?:
), to avoid unnecessary work.
Match nothing: /.^/
^
only matches at the beginning of a string. The dot moves matching beyond the first character and now ^
doesn’t match, anymore.