Skip to main content

Unicode Regex

Regular Expressions: Unicode Regex

How many bytes are Unicode characters?

View Answer:
Interview Response: JavaScript uses Unicode encoding for strings. Most characters encode with 2 bytes, but that allows them to represent at most 65536 characters. That range is not big enough to encode all possible characters, so some rare characters are encoded with 4 bytes, for instance, like 𝒳 (mathematical X) or 😄 (a smile), some hieroglyphs. So, the simple answer is 2 bytes for regular “old” characters and 4 bytes for special “surrogate pairs or new” characters. When the JavaScript language got created a long time ago, Unicode encoding was more straightforward; there were no 4-byte characters. So, some language features still mishandle them. By default, regular expressions also treat 4-byte “long characters” as a pair of 2-byte ones. And, as it happens with strings, that may lead to odd results.

Code Example:

// Both characters return a length of 2,
// it should be 1, but these are special characters
alert('😄'.length); // 2
alert('𝒳'.length); // 2

How are Unicode properties expressed in regular expressions?

View Answer:
Interview Response: In simple terms, Unicode properties are denoted or expressed as \p{…}. When we need to use \p{…}, a regular expression must have flag u. For instance, \p{Letter} denotes a letter in any language. We can also use \p{L}, as L is an alias of Letter. There are shorter aliases for almost every property.

Code Example:

let str = 'A ბ ㄱ';

alert(str.match(/\p{L}/gu)); // A,ბ,ㄱ
// null (no matches, \p does not work without the flag "u")

Is there a way to denote Hexadecimal numbers in a regular expression?

View Answer:
Interview Response: Yes, A hex digit gets denoted as \p{Hex_Digit} Unicode property.

Code Example:

let regexp = /x\p{Hex_Digit}\p{Hex_Digit}/u;

alert('number: xAF'.match(regexp)); // xAF

What approach should we use to handle script-based languages, like Chinese, in regular expressions?

View Answer:
Interview Response: When handling script-based languages like Cyrillic, Greek, Arabic, or Han (Chinese), we should use the Unicode property for the Scriptwriting system, which we achieve by using the Script=‹value› syntax.

Code Example:

let regexp = /\p{sc=Han}/gu; // returns Chinese hieroglyphs

let str = `Hello Привет 你好 123_456`;

alert(str.match(regexp)); // 你,好

What Unicode property should we use in regular expressions?

View Answer:
Interview Response: Characters that denote a currency, such as $, €, ¥, have Unicode property \p{Currency_Symbol}, the short alias: \p{Sc}, that we should use.

Code Example:

let regexp = /\p{Sc}\d/gu;

let str = `Prices: $2, €1, ¥9`;

alert(str.match(regexp)); // $2,€1,¥9