Skip to main content

Unicode Regex

Regular Expressions: Unicode Regex



What is Unicode in JavaScript?

View Answer:
Interview Response: Unicode is a standard for representing a wide range of characters from different languages and symbols in JavaScript and other programming languages.

How does JavaScript handle Unicode in Regex?

View Answer:
Interview Response: JavaScript uses the "u" flag in Regex to enable full Unicode matching, allowing it to match any character in the Unicode database.

Code Example:

let str = "hello js!";
let match = str.match(/\p{L}/gu);
// match will be ["h", "e", "l", "l", "o", "j", "s"]
console.log(match);

What are the implications of the "u" flag in JavaScript Regex?

View Answer:
Interview Response: The "u" flag in JavaScript Regex enables Unicode matching, but can cause unexpected results with quantifiers, ranges in character sets, and the dot operator, due to differences in how Unicode and non-Unicode patterns interpret characters.

Code Example:

let str = "😊";

console.log(str.length); // Outputs: 2 (without "u" flag)
console.log([...str].length); // Outputs: 1 (with "u" flag)

let regexWithoutU = /^.$/;
console.log(regexWithoutU.test(str)); // Outputs: false (without "u" flag)

let regexWithU = /^.$/u;
console.log(regexWithU.test(str)); // Outputs: true (with "u" flag)

In this example, the string contains a single emoji, which is represented as a Unicode surrogate pair. Without the "u" flag, JavaScript treats the surrogate pair as two separate characters, hence the regex ^.$ (which matches a string of exactly one character) fails to match the string. However, with the "u" flag, the surrogate pair is treated as a single character, so the regex ^.$/u matches the string.


What does the \p{} notation in JavaScript Regex do?

View Answer:
Interview Response: The `\p{}` notation in JavaScript regular expressions, used with the "u" flag, matches characters based on their Unicode properties, such as `Script`, `General_Category`, `Script_Extensions`, etc.

Code Example:

let str = "hello 123! 你好 नमस्ते";
let match = str.match(/\p{Script=Latin}/gu);
console.log(match);
// Outputs: ["h", "e", "l", "l", "o"]

What is a surrogate pair in JavaScript Unicode handling?

View Answer:
Interview Response: A surrogate pair is a pair of 16-bit values that JavaScript uses to represent a single Unicode character outside the Basic Multilingual Plane.

Code Example:

let str = "\uD83D\uDE00"; // This is a surrogate pair for 😄
console.log(str); // Outputs: 😄

let regex = /\uD83D\uDE00/u;
console.log(regex.test(str)); // Outputs: true

In this example, the string str uses a surrogate pair to represent the grinning face emoji 😄. The regular expression /\uD83D\uDE00/u uses the same surrogate pair to match this emoji. The u flag enables full Unicode matching, which treats the surrogate pair as a single character.


What is a Unicode property escape in JavaScript?

View Answer:
Interview Response: A Unicode property escape (\p{}) is a type of escape sequence in a regular expression that matches characters based on their general category, script, or other properties in the Unicode standard.

Can you use the Unicode range in JavaScript regex?

View Answer:
Interview Response: Yes, you can use Unicode ranges in JavaScript regex by using the property escape \u{} notation with the "u" flag.

Code Example:

let str = "hello, 你好, नमस्ते!";
let match = str.match(/[\u4e00-\u9fff]+/gu);
console.log(match);
// Outputs: [ '你好' ]

In this example, the regex [\u4e00-\u9fff]+/gu matches any sequence of characters that are in the Unicode range from 4E00 to 9FFF, which includes most common Chinese characters. The g flag makes the regex match globally, and the u flag enables full Unicode matching.


How does the "u" flag change the behavior of \b in JavaScript?

View Answer:
Interview Response: When the "u" flag is used, \b only considers the underscore and alphanumeric characters from the ASCII range as word characters. The "u" flag changes the behavior of \b (word boundary) in JavaScript by allowing it to correctly handle Unicode characters.

Code Example:

let str = "café";
let regexWithoutU = /\bcafé\b/;
console.log(regexWithoutU.test(str)); // Outputs: false (without "u" flag)

let regexWithU = /\bcafé\b/u;
console.log(regexWithU.test(str)); // Outputs: true (with "u" flag)

What is an astral symbol in relation to Regex?

View Answer:
Interview Response: When the "u" flag is used, \b only considers the underscore and alphanumeric characters from the ASCII range as word characters.

Code Example:

Here's a simple JavaScript code example illustrating how the 'u' flag in regex allows astral symbols to be matched as single characters:

let str = "𝌆"; // This is an astral symbol

let regexWithoutU = /.+/; // Regex without 'u' flag
let matchWithoutU = str.match(regexWithoutU);

console.log(matchWithoutU[0].length); // Outputs: 2, because it treats astral symbol as two separate characters

let regexWithU = /.+/u; // Regex with 'u' flag
let matchWithU = str.match(regexWithU);

console.log(matchWithU[0].length); // Outputs: 1, because it treats astral symbol as a single character

In this example, you can see how the 'u' flag enables the regex to treat the astral symbol as a single character instead of two separate characters.


How does JavaScript handle astral symbols in Regex?

View Answer:
Interview Response: Astral symbols are matched as a single unit in JavaScript Regex when the "u" flag is set, rather than being interpreted as two separate code units. Astral symbols are Unicode characters that are outside of the Basic Multilingual Plane (BMP), requiring two 16-bit code units in UTF-16.

Code Example:

let str = "I love 🍕!";
let regexWithoutU = /🍕/;
console.log(regexWithoutU.test(str)); // Outputs: false (without "u" flag)

let regexWithU = /🍕/u;
console.log(regexWithU.test(str)); // Outputs: true (with "u" flag)

Without the "u" flag, the astral symbol (pizza emoji) is treated as two separate characters, so the regex fails to match. With the "u" flag, the astral symbol is correctly treated as a single character, and the regex successfully matches the string.


How can you match any Unicode letter in JavaScript Regex?

View Answer:
Interview Response: You can match any Unicode letter in JavaScript Regex using Unicode property escapes: \p{Letter}, with the "u" flag set.

Code Example:

let str = "hello, 你好, नमस्ते!";
let match = str.match(/\p{L}/gu);
console.log(match);
// Outputs: ['h', 'e', 'l', 'l', 'o', '你', '好', 'न', 'म', 'स', 'त', 'े']

What does JavaScript's \p{Script=} do in Regex?

View Answer:
Interview Response: The \p{Script=} in JavaScript Regex is a Unicode property escape that matches any character that is a part of the specified script, such as Latin, Greek, etc.

Code Example:

let str = "こんにちは (Hello in Japanese Hiragana)";
let match = str.match(/\p{Script=Hiragana}/gu);
console.log(match);
// Outputs: [ 'こ', 'ん', 'に', 'ち', 'は' ]

In this example, the regex /\p{Script=Hiragana}/gu matches any character from the Hiragana script. The g flag makes the regex match globally, and the u flag enables full Unicode matching. It matches all the Hiragana letters in the string.


Can JavaScript regex match emoji using Unicode?

View Answer:
Interview Response: Yes, JavaScript regex can match emoji using the Unicode property escape \p{Emoji}, when the "u" flag is set. Optionaly, you can use the emoji inline regex /🍕/u;

Code Example:

let str = "I love 🍕!";
let regex = /\p{Emoji}/u;
console.log(regex.test(str)); // Outputs: true

////////////////////////////////

let str = "I love 🍕!";
let regex = /🍕/u;
console.log(regex.test(str)); // Outputs: true

How can you match all whitespace characters, including Unicode spaces, in JavaScript regex?

View Answer:
Interview Response: To match all whitespace characters, including Unicode spaces, in JavaScript regex using property escapes, you can use the `\p{White_Space}` Unicode property escape with the `u` (Unicode) flag.

Code Example:

let str = "Hello\t\n\u{2003}World!"; // Normal space, tab, newline, and em space characters
let match = str.match(/\p{White_Space}/gu);
console.log(match);
// Outputs: [' ', '\t', '\n', ' ']

In this example, the regex /\p{White_Space}/gu matches any Unicode whitespace character in the string. The \p{White_Space} is a Unicode property escape that matches any kind of whitespace character as defined by Unicode, including regular spaces, tabs, newlines, and other types of spaces like the em space. The g flag makes the regex match globally, and the u flag enables full Unicode matching. It matches all the different types of spaces in the string.


Can you perform Unicode case-insensitive matching in JavaScript regex?

View Answer:
Interview Response: Yes, by using both the "u" and "i" flags, JavaScript regex can perform Unicode case-insensitive matching.

Code Example:

let str = "Hello hElLo HELLO";
let regex = /hello/giu;
console.log(str.match(regex));
// Outputs: ['Hello', 'hElLo', 'HELLO']

In this example, the regular expression /hello/giu matches the word "hello" in any case. The i flag makes the regex case-insensitive, the g flag makes it match globally, and the u flag enables full Unicode matching. It matches all variations of "hello" in the string, regardless of their case.


What's the impact of using the dot (.) in a JavaScript regex with the "u" flag?

View Answer:
Interview Response: The dot . matches any single character except line terminators (like newline). When used with the "u" (Unicode) flag, it can also match any Unicode astral symbol, which would otherwise be seen as two characters.

Code Example:

let str = "😄"; // An astral symbol
let regexWithoutU = /^.$/;
console.log(regexWithoutU.test(str)); // Outputs: false (without "u" flag)

let regexWithU = /^.$/u;
console.log(regexWithU.test(str)); // Outputs: true (with "u" flag)

In this example, the emoji is a Unicode astral symbol represented by a surrogate pair in JavaScript. Without the "u" flag, JavaScript treats the surrogate pair as two separate characters, so the regex ^.$ fails to match. However, with the "u" flag, JavaScript treats the surrogate pair as a single character, so the regex ^.$/u matches successfully.


What's the significance of Unicode normalization in JavaScript?

View Answer:
Interview Response: Unicode normalization is significant in JavaScript because it helps ensure that text is in a standard, consistent form, even when there are multiple valid sequences of Unicode code points that could produce the same text.

Code Example:

let str1 = "café"; // Composed form (é is one Unicode character)
let str2 = "café"; // Decomposed form (e and ´ are two separate Unicode characters)

console.log(str1 === str2); // Outputs: false (not normalized)

// Normalize to composed form (NFC)
console.log(str1.normalize("NFC") === str2.normalize("NFC")); // Outputs: true

In this example, str1 and str2 look identical but are represented differently at the Unicode level. Without normalization, JavaScript considers them different strings. However, by normalizing to the same form ("NFC" for composed form), they are recognized as the same string. This is particularly important for string comparisons and when working with international text.


How many bytes are Unicode characters?

View Answer:
Interview Response: Unicode characters can be 1 to 4 bytes long, depending on the specific character and encoding method used (UTF-8, UTF-16, or UTF-32).

Technical Response: JavaScript uses Unicode encoding for strings. Most characters encode with 2 bytes, but that allows them to represent at most 65536 characters. That range is not big enough to encode all possible characters, so some rare characters are encoded with 4 bytes, for instance, like 𝒳 (mathematical X) or 😄 (a smile), some hieroglyphs. So, the simple answer is 2 bytes for regular “old” characters and 4 bytes for special “surrogate pairs or new” characters. When the JavaScript language got created a long time ago, Unicode encoding was more straightforward; there were no 4-byte characters. So, some language features still mishandle them. By default, regular expressions also treat 4-byte “long characters” as a pair of 2-byte ones. And, as it happens with strings, that may lead to odd results.

Code Example:

// Both characters return a length of 2,
// it should be 1, but these are special characters
console.log('😄'.length); // 2
console.log('𝒳'.length); // 2

How are Unicode properties expressed in regular expressions?

View Answer:
Interview Response: Unicode properties are expressed in regular expressions using the \p{Property} notation, provided the "u" flag is set.

Technical Response: In simple terms, Unicode properties are denoted or expressed as \p{…}. When we need to use \p{…}, a regular expression must have flag u. For instance, \p{Letter} denotes a letter in any language. We can also use \p{L}, as L is an alias of Letter. There are shorter aliases for almost every property.

Code Example:

let str = 'A ბ ㄱ';

console.log(str.match(/\p{L}/gu)); // output: A,ბ,ㄱ
console.log(str.match(/\p{L}/g)); // output: null
// null (no matches, \p does not work without the flag "u")

Is there a way to find or match a Hexadecimal number using Unicode properties?

View Answer:
Interview Response: Yes, using Unicode properties in a regular expression allows you to match hexadecimal numbers. The Unicode property \p{Hex_Digit} can be used to match any hex digit, which includes 0-9 and A-F in both cases.

Code Example:

let regexp = /x\p{Hex_Digit}\p{Hex_Digit}/u;

console.log('number: xAF'.match(regexp)); // ["xAF"]

What approach should we use to handle script-based languages, like Chinese, in regular expressions?

View Answer:
Interview Response: When handling script-based languages like Cyrillic, Greek, Arabic, or Han (Chinese), we should use the Unicode property for the Scriptwriting system, which we achieve by using the Script=‹value› syntax.

Code Example:

let regexp = /\p{sc=Han}/gu; // returns Chinese hieroglyphs

let str = `Hello Привет 你好 123_456`;

console.log(str.match(regexp)); // 你,好

What Unicode property should we use in regular expressions?

View Answer:
Interview Response: The Unicode property to use in regex depends on the specific requirement; options include categories like \p{Letter}, \p{Number}, \p{Punctuation}, \p{Emoji}, and more. Characters that denote a currency, such as $, €, ¥, have Unicode property \p{Currency_Symbol}, the short alias: \p{Sc}, that we should use.

Code Example:

let regexp = /\p{Sc}\d/gu;

let str = `Prices: $2, €1, ¥9`;

console.log(str.match(regexp)); // $2,€1,¥9