Strings • JavaScript for impatient programmers (beta)

18 Strings

18.1 Plain string literals
- 18.1.1 Escaping
18.2 Accessing characters and code points
- 18.2.1 Accessing JavaScript characters
- 18.2.2 Accessing Unicode code point characters via for-of and spreading
18.3 String concatenation via +
18.4 Converting to string
- 18.4.1 Stringifying objects
- 18.4.2 Customizing the stringification of objects
- 18.4.3 An alternate way of stringifying values
18.5 Comparing strings
18.6 Atoms of text: Unicode characters, JavaScript characters, grapheme clusters
- 18.6.1 Working with code points
- 18.6.2 Working with code units (char codes)
- 18.6.3 Caveat: grapheme clusters
18.7 Quick reference: Strings
- 18.7.1 Converting to string
- 18.7.2 Numeric values of characters
- 18.7.3 String operators
- 18.7.4 String.prototype: finding and matching
- 18.7.5 String.prototype: extracting
- 18.7.6 String.prototype: combining
- 18.7.7 String.prototype: transforming
- 18.7.8 Sources

Strings are primitive values in JavaScript and immutable. That is, string-related operations always produce new strings and never change existing strings.

18.1 Plain string literals

Plain string literals are delimited by either single quotes or double quotes:

const str1 = 'abc';
const str2 = "abc";
assert.equal(str1, str2);

Single quotes are used more often, because it makes it easier to mention HTML, where double quotes are preferred.

The next chapter covers template literals, which give you:

String interpolation
Multiple lines
Raw string literals (backslash has no special meaning)

18.1.1 Escaping

The backslash lets you create special characters:

Unix line break: '\n'
Windows line break: '\r\n'
Tab: '\t'
Backslash: '\\'

The backslash also lets you use the delimiter of a string literal inside that literal:

assert.equal(
  'She said: "Let\'s go!"',
  "She said: \"Let's go!\"");

18.2 Accessing characters and code points

18.2.1 Accessing JavaScript characters

JavaScript has no extra data type for characters – characters are always represented as strings.

const str = 'abc';

// Reading a character at a given index
assert.equal(str[1], 'b');

// Counting the characters in a string:
assert.equal(str.length, 3);

18.2.2 Accessing Unicode code point characters via `for-of` and spreading

Iterating over strings via for-of or spreading (...) visits Unicode code point characters. Each code point character is encoded by 1–2 JavaScript characters. For more information, see §18.6 “Atoms of text: Unicode characters, JavaScript characters, grapheme clusters”.

This is how you iterate over the code point characters of a string via for-of:

for (const ch of 'x🙂y') {
  console.log(ch);
}
// Output:
// 'x'
// '🙂'
// 'y'

And this is how you convert a string into an Array of code point characters via spreading:

assert.deepEqual([...'x🙂y'], ['x', '🙂', 'y']);

18.3 String concatenation via `+`

If at least one operand is a string, the plus operator (+) converts any non-strings to strings and concatenates the result:

assert.equal(3 + ' times ' + 4, '3 times 4');

The assignment operator += is useful if you want to assemble a string, piece by piece:

let str = ''; // must be `let`!
str += 'Say it';
str += ' one more';
str += ' time';

assert.equal(str, 'Say it one more time');

Concatenating via + is efficient

Using + to assemble strings is quite efficient, because most JavaScript engines internally optimize it.

Exercise: Concatenating strings

exercises/strings/concat_string_array_test.mjs

18.4 Converting to string

These are three ways of converting a value x to a string:

String(x)
''+x
x.toString() (does not work for undefined and null)

Recommendation: use the descriptive and safe String().

Examples:

assert.equal(String(undefined), 'undefined');
assert.equal(String(null), 'null');

assert.equal(String(false), 'false');
assert.equal(String(true), 'true');

assert.equal(String(123.45), '123.45');

Pitfall for booleans: If you convert a boolean to a string via String(), you generally can’t convert it back via Boolean():

> String(false)
'false'
> Boolean('false')
true

The only string for which Boolean() returns false, is the empty string.

18.4.1 Stringifying objects

Plain objects have a default string representation that is not very useful:

> String({a: 1})
'[object Object]'

Arrays have a better string representation, but it still hides much information:

> String(['a', 'b'])
'a,b'
> String(['a', ['b']])
'a,b'

> String([1, 2])
'1,2'
> String(['1', '2'])
'1,2'

> String([true])
'true'
> String(['true'])
'true'
> String(true)
'true'

Stringifying functions, returns their source code:

> String(function f() {return 4})
'function f() {return 4}'

18.4.2 Customizing the stringification of objects

You can override the built-in way of stringifying objects by implementing the method toString():

const obj = {
  toString() {
    return 'hello';
  }
};

assert.equal(String(obj), 'hello');

18.4.3 An alternate way of stringifying values

The JSON data format is a text representation of JavaScript values. Therefore, JSON.stringify() can also be used to convert values to strings:

> JSON.stringify({a: 1})
'{"a":1}'
> JSON.stringify(['a', ['b']])
'["a",["b"]]'

The caveat is that JSON only supports null, booleans, numbers, strings, Arrays and objects (which it always treats as if they were created by object literals).

Tip: The third parameter lets you switch on multi-line output and specify how much to indent. For example:

console.log(JSON.stringify({first: 'Jane', last: 'Doe'}, null, 2));

This statement produces the following output.

{
  "first": "Jane",
  "last": "Doe"
}

18.5 Comparing strings

Strings can be compared via the following operators:

< <= > >=

There is one important caveat to consider: These operators compare based on the numeric values of JavaScript characters. That means that the order that JavaScript uses for strings is different from the one used in dictionaries and phone books:

> 'A' < 'B' // ok
true
> 'a' < 'B' // not ok
false
> 'ä' < 'b' // not ok
false

Properly comparing text is beyond the scope of this book. It is supported via the ECMAScript Internationalization API (Intl).

18.6 Atoms of text: Unicode characters, JavaScript characters, grapheme clusters

Quick recap of §17 “Unicode – a brief introduction”:

Unicode characters are represented by code points; numbers which have a range of 21 bits.
In JavaScript strings, Unicode is implemented via code units based on the encoding format UTF-16. Each code unit is a 16-bit number. One to two of code units are needed to encode a single code point.
- Therefore, each JavaScript character is represented by a code unit. In the JavaScript standard library, code units are also called char codes. Which is what they are: numbers for JavaScript characters.
Grapheme clusters (user-perceived characters) are written symbols, as displayed on screen or paper. One or more Unicode characters are needed to encode a single grapheme cluster.

The following code demonstrates that a single Unicode character comprises one or two JavaScript characters. We count the latter via .length:

// 3 Unicode characters, 3 JavaScript characters:
assert.equal('abc'.length, 3);

// 1 Unicode character, 2 JavaScript characters:
assert.equal('🙂'.length, 2);

The following table summarizes the concepts we have just explored:

Entity	Numeric representation	Size	Encoded via
Grapheme cluster			1+ code points
Unicode character	Code point	21 bits	1–2 code units
JavaScript character	UTF-16 code unit	16 bits	–

18.6.1 Working with code points

Let’s explore JavaScript’s tools for working with code points.

A code point escape lets you specify a code point hexadecimally. It produces one or two JavaScript characters.

> '\u{1F642}'
'🙂'

String.fromCodePoint() converts a single code point to 1–2 JavaScript characters:

> String.fromCodePoint(0x1F642)
'🙂'

.codePointAt() converts 1–2 JavaScript characters to a single code point:

> '🙂'.codePointAt(0).toString(16)
'1f642'

You can iterate over a string, which visits Unicode characters (not JavaScript characters). Iteration is described later in this book. One way of iterating is via a for-of loop:

const str = '🙂a';
assert.equal(str.length, 3);

for (const codePointChar of str) {
  console.log(codePointChar);
}

// Output:
// '🙂'
// 'a'

Spreading (...) into Array literals is also based on iteration and visits Unicode characters:

> [...'🙂a']
[ '🙂', 'a' ]

That makes it a good tool for counting Unicode characters:

> [...'🙂a'].length
2
> '🙂a'.length
3

18.6.2 Working with code units (char codes)

Indices and lengths of strings are based on JavaScript characters (as represented by UTF-16 code units).

To specify a code unit hexadecimally, you can use a code unit escape:

> '\uD83D\uDE42'
'🙂'

And you can use String.fromCharCode(). Char code is the standard library’s name for code unit:

> String.fromCharCode(0xD83D) + String.fromCharCode(0xDE42)
'🙂'

To get the char code of a character, use .charCodeAt():

> '🙂'.charCodeAt(0).toString(16)
'd83d'

18.6.3 Caveat: grapheme clusters

When working with text that may be written in any human language, it’s best to split at the boundaries of grapheme clusters, not at the boundaries of Unicode characters.

TC39 is working on Intl.Segmenter, a proposal for the ECMAScript Internationalization API to support Unicode segmentation (along grapheme cluster boundaries, word boundaries, sentence boundaries, etc.).

Until that proposal becomes a standard, you can use one of several libraries that are available (do a web search for “JavaScript grapheme”).

18.7 Quick reference: Strings

Strings are immutable, none of the string methods ever modify their strings.

18.7.1 Converting to string

Tbl. 13 describes how various values are converted to strings.

Table 13: Converting values to strings.
`x`	`String(x)`
`undefined`	`'undefined'`
`null`	`'null'`
Boolean value	`false` `→` `'false'`, `true` `→` `'true'`
Number value	Example: `123` `→` `'123'`
String value	`x` (input, unchanged)
An object	Configurable via, e.g., `toString()`

18.7.2 Numeric values of characters

Char code: represents a JavaScript character numerically. JavaScript’s name for Unicode code unit.
- Size: 16 bits, unsigned
- Convert number to character: String.fromCharCode() ^[ES1]
- Convert character to number: string method .charCodeAt() ^[ES1]
Code point: represents a Unicode character numerically.
- Size: 21 bits, unsigned (17 planes, 16 bits each)
- Convert number to character: String.fromCodePoint() ^[ES6]
- Convert character to number: string method .codePointAt() ^[ES6]

18.7.3 String operators

// Access characters via []
const str = 'abc';
assert.equal(str[1], 'b');

// Concatenate strings via +
assert.equal('a' + 'b' + 'c', 'abc');
assert.equal('take ' + 3 + ' oranges', 'take 3 oranges');

18.7.4 `String.prototype`: finding and matching

(String.prototype is where the methods of strings are stored.)

.endsWith(searchString: string, endPos=this.length): boolean ^[ES6]

Returns true if the string would end with searchString if its length were endPos. Returns false, otherwise.
```
> 'foo.txt'.endsWith('.txt')
true
> 'abcde'.endsWith('cd', 4)
true
```
.includes(searchString: string, startPos=0): boolean ^[ES6]

Returns true if the string contains the searchString and false, otherwise. The search starts at startPos.
```
> 'abc'.includes('b')
true
> 'abc'.includes('b', 2)
false
```
.indexOf(searchString: string, minIndex=0): number ^[ES1]

Returns the lowest index at which searchString appears within the string, or -1, otherwise. Any returned index will be minIndex or higher.
```
> 'abab'.indexOf('a')
0
> 'abab'.indexOf('a', 1)
2
> 'abab'.indexOf('c')
-1
```
.lastIndexOf(searchString: string, maxIndex=Infinity): number ^[ES1]

Returns the highest index at which searchString appears within the string, or -1, otherwise. Any returned index will be maxIndex or lower.
```
> 'abab'.lastIndexOf('ab', 2)
2
> 'abab'.lastIndexOf('ab', 1)
0
> 'abab'.lastIndexOf('ab')
2
```
[1 of 2] .match(regExp: string | RegExp): RegExpMatchArray | null ^[ES3]

If regExp is a regular expression with flag /g not set, then .match() returns the first match for regExp within the string. Or null if there is no match. If regExp is a string, it is used to create a regular expression (think parameter of new RegExp()) before performing the previously mentioned steps.

The result has the following type:
```
interface RegExpMatchArray extends Array<string> {
  index: number;
  input: string;
  groups: undefined | {
    [key: string]: string
  };
}
```
Numbered capture groups become Array indices (which is why this type extends Array). Named capture groups (ES2018) become properties of .groups. In this mode, .match() works like RegExp.prototype.exec().

Examples:
```
> 'ababb'.match(/a(b+)/)
{ 0: 'ab', 1: 'b', index: 0, input: 'ababb', groups: undefined }
> 'ababb'.match(/a(?<foo>b+)/)
{ 0: 'ab', 1: 'b', index: 0, input: 'ababb', groups: { foo: 'b' } }
> 'abab'.match(/x/)
null
```
[2 of 2] .match(regExp: RegExp): string[] | null ^[ES3]

If flag /g of regExp is set, .match() returns either an Array with all matches or null if there was no match.
```
> 'ababb'.match(/a(b+)/g)
[ 'ab', 'abb' ]
> 'ababb'.match(/a(?<foo>b+)/g)
[ 'ab', 'abb' ]
> 'abab'.match(/x/g)
null
```
.search(regExp: string | RegExp): number ^[ES3]

Returns the index at which regExp occurs within the string. If regExp is a string, it is used to create a regular expression (think parameter of new RegExp()).
```
> 'a2b'.search(/[0-9]/)
1
> 'a2b'.search('[0-9]')
1
```
.startsWith(searchString: string, startPos=0): boolean ^[ES6]

Returns true if searchString occurs in the string at index startPos. Returns false, otherwise.
```
> '.gitignore'.startsWith('.')
true
> 'abcde'.startsWith('bc', 1)
true
```

18.7.5 `String.prototype`: extracting

.slice(start=0, end=this.length): string ^[ES3]

Returns the substring of the string that starts at (including) index start and ends at (excluding) index end. If an index is negative, it is added to .length before they are used (-1 means this.length-1, etc.).
```
> 'abc'.slice(1, 3)
'bc'
> 'abc'.slice(1)
'bc'
> 'abc'.slice(-2)
'bc'
```
.split(separator: string | RegExp, limit?: number): string[] ^[ES3]

Splits the string into an Array of substrings – the strings that occur between the separators. The separator can be a string:
```
> 'a | b | c'.split('|')
[ 'a ', ' b ', ' c' ]
```
It can also be a regular expression:
```
> 'a : b : c'.split(/ *: */)
[ 'a', 'b', 'c' ]
> 'a : b : c'.split(/( *):( *)/)
[ 'a', ' ', ' ', 'b', ' ', ' ', 'c' ]
```
The last invocation demonstrates that captures made by groups in the regular expression become elements of the returned Array.

Warning: .split('') splits a string into JavaScript characters. That doesn’t work well when dealing with astral Unicode characters (which are encoded as two JavaScript characters). For example, emojis are astral:
```
> '🙂X🙂'.split('')
[ '\uD83D', '\uDE42', 'X', '\uD83D', '\uDE42' ]
```
Instead, it is better to use spreading:
```
> [...'🙂X🙂']
[ '🙂', 'X', '🙂' ]
```
.substring(start: number, end=this.length): string ^[ES1]

Use .slice() instead of this method. .substring() wasn’t implemented consistently in older engines and doesn’t support negative indices.

18.7.6 `String.prototype`: combining

.concat(...strings: string[]): string ^[ES3]

Returns the concatenation of the string and strings. 'a'.concat('b') is equivalent to 'a'+'b'. The latter is much more popular.
```
> 'ab'.concat('cd', 'ef', 'gh')
'abcdefgh'
```
.padEnd(len: number, fillString=' '): string ^[ES2017]

Appends (fragments of) fillString to the string until it has the desired length len. If it already has or exceeds len, then it is returned without any changes.
```
> '#'.padEnd(2)
'# '
> 'abc'.padEnd(2)
'abc'
> '#'.padEnd(5, 'abc')
'#abca'
```
.padStart(len: number, fillString=' '): string ^[ES2017]

Prepends (fragments of) fillString to the string until it has the desired length len. If it already has or exceeds len, then it is returned without any changes.
```
> '#'.padStart(2)
' #'
> 'abc'.padStart(2)
'abc'
> '#'.padStart(5, 'abc')
'abca#'
```
.repeat(count=0): string ^[ES6]

Returns the string, concatenated count times.
```
> '*'.repeat()
''
> '*'.repeat(3)
'***'
```

18.7.7 `String.prototype`: transforming

.normalize(form: 'NFC'|'NFD'|'NFKC'|'NFKD' = 'NFC'): string ^[ES6]

Normalizes the string according to the Unicode Normalization Forms.
[1 of 2] .replace(searchValue: string | RegExp, replaceValue: string): string ^[ES3]

Replace matches of searchValue with replaceValue. If searchValue is a string, only the first verbatim occurrence is replaced. If searchValue is a regular expression without flag /g, only the first match is replaced. If searchValue is a regular expression with /g then all matches are replaced.
```
> 'x.x.'.replace('.', '#')
'x#x.'
> 'x.x.'.replace(/./, '#')
'#.x.'
> 'x.x.'.replace(/./g, '#')
'####'
```
Special characters in replaceValue are:
- $$: becomes $
- $n: becomes the capture of numbered group n (alas, $0 stands for the string '$0', it does not refer to the complete match)
- $&: becomes the complete match
- $`: becomes everything before the match
- $': becomes everything after the match
Examples:
```
> 'a 2020-04 b'.replace(/([0-9]{4})-([0-9]{2})/, '|$2|')
'a |04| b'
> 'a 2020-04 b'.replace(/([0-9]{4})-([0-9]{2})/, '|$&|')
'a |2020-04| b'
> 'a 2020-04 b'.replace(/([0-9]{4})-([0-9]{2})/, '|$`|')
'a |a | b'
```
Named capture groups (ES2018) are supported, too:
- $<name> becomes the capture of named group name
Example:
```
assert.equal(
  'a 2020-04 b'.replace(
    /(?<year>[0-9]{4})-(?<month>[0-9]{2})/, '|$<month>|'),
  'a |04| b');
```
[2 of 2] .replace(searchValue: string | RegExp, replacer: (...args: any[]) => string): string ^[ES3]

If the second parameter is a function, occurrences are replaced with the strings it returns. Its parameters args are:
- matched: string. The complete match
- g1: string|undefined. The capture of numbered group 1
- g2: string|undefined. The capture of numbered group 2
- (Etc.)
- offset: number. Where was the match found in the input string?
- input: string. The whole input string
```
const regexp = /([0-9]{4})-([0-9]{2})/;
const replacer = (all, year, month) => '|' + all + '|';
assert.equal(
  'a 2020-04 b'.replace(regexp, replacer),
  'a |2020-04| b');
```
Named capture groups (ES2018) are supported, too. If there are any, an argument is added at the end, with an object whose properties contain the captures:
```
const regexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})/;
const replacer = (...args) => {
  const groups=args.pop();
  return '|' + groups.month + '|';
};
assert.equal(
  'a 2020-04 b'.replace(regexp, replacer),
  'a |04| b');
```
.toUpperCase(): string ^[ES1]

Returns a copy of the string, in which all lowercase alphabetic characters are converted to uppercase. How well that works for various alphabets, depends on the JavaScript engine.
```
> '-a2b-'.toUpperCase()
'-A2B-'
> 'αβγ'.toUpperCase()
'ΑΒΓ'
```
.toLowerCase(): string ^[ES1]

Returns a copy of the string, in which all uppercase alphabetic characters are converted to lowercase. How well that works for various alphabets, depends on the JavaScript engine.
```
> '-A2B-'.toLowerCase()
'-a2b-'
> 'ΑΒΓ'.toLowerCase()
'αβγ'
```
.trim(): string ^[ES5]

Returns a copy of the string, in which all leading and trailing whitespace (spaces, tabs, line terminators, etc.) is gone.
```
> '\r\n#\t  '.trim()
'#'
> '  abc  '.trim()
'abc'
```
.trimEnd(): string ^[ES2019]

Similar to .trim(), but only the end of the string is trimmed:
```
> '  abc  '.trimEnd()
'  abc'
```
.trimStart(): string ^[ES2019]

Similar to .trim(), but only the beginning of the string is trimmed:
```
> '  abc  '.trimStart()
'abc  '
```

18.7.8 Sources

Exercise: Using string methods

exercises/strings/remove_extension_test.mjs

Quiz

See quiz app.