⬆️ ⬇️

ECMAScript 6. Regular expressions with Unicode support

ECMAScript 6 introduces two new flags for regular expressions:



  1. y turns on sticky matching mode.
  2. u includes various options related to Unicode.


This article explains the effect of the u flag. This article will be useful to you if you are familiar with Unicode-problems in Javascript .





NB! Since an attempt to publish an article revealed problems with displaying some Unicode characters used in the article, some of the code was replaced with images with links to JSFiddle.



Impact on syntax



Setting the u flag in a regular expression allows the escape sequences of ES6 Unicode code points ( \u{...} ) in the pattern.

')





In the absence of the u flag, things like \u{1234} may technically still occur in patterns, but they will not be interpreted as Unicode code point escape sequences. /\u{1234}/ equivalent to writing /u{1234}/ , which corresponds to 1234 consecutive u characters instead of the character corresponding to the U + 1234 code points escape sequence.



The Javascript engine does this for compatibility reasons. But with the u flag set and things like \a (where a not an escape sequence) will no longer be equivalent to a . Therefore, even if /\a/ processed as /a/ , /\a/u throws an error, since \a not a reserved escape sequence. This allows extending the functionality of the u flag of regular expressions in a future version of ECMAScript. For example, /\p{Script=Greek}/u throws an exception for ES6, but can become a regular expression matching all the characters of the Greek alphabet according to the Unicode database when the corresponding syntax is added to the specification.



Effect on the operator ' . '



In the absence of the flag u matches any BMP character (Basic Multilingual Plane) with the exception of line terminators . When the flag is set ES6 u ,. also corresponds to astral symbols.







Impact on quantifiers



The following quantifiers are available in Javascript regular expressions (as well as their variations): * , + ? , and {2} , {2,} , {2,4} . In the absence of the u flag, if the quantifier follows the astral symbol, it applies only to the low surrogate of this symbol.







With the ES6 flag, quantifiers are applied to the characters as a whole, which is true even for astral characters.







Effect on character classes



In the absence of the u flag, any given character class can only match BMP characters. Things like [bcd] work as we expect:



 const regex = /^[bcd]$/; console.log( regex.test('a'), // false regex.test('b'), // true regex.test('c'), // true regex.test('d'), // true regex.test('e') // false ); 


However, when the astral symbol is used in the character class, the Javascript engine processes it as two separate “characters”: one for each of its surrogate halves.







The ES6 u flag allows you to use solid astral symbols in character classes.







Therefore, solid astral symbols can also be used in ranges of character classes , and everything will work as we expect, while the u flag is set.







The u flag also affects excluding character classes . For example, /[^a]/ equivalent to /[\0-\x60\x62-\uFFFF]/ , which matches any BMP character except a . But with the flag u /[^a]/u corresponds to a much larger set of all Unicode characters except a .







Effect on escape sequences



The u flag affects the value of the \D , \S , and \W escape sequences. In the absence of the flag u , \D , \S , and \W correspond to any BMP characters that do not correspond to \d , \s and \w , respectively.







The u , \D , \S , and \W flags also correspond to astral symbols.







The u flag does not refer to their inverse analogues \d , \s and \w . It was suggested that \d and \w (and \b ) be more Unicode-compatible, but this proposal was rejected.



Impact on flag i



When the i and u flags are set, all characters are implicitly converted to a single register using a simple conversion provided by the Unicode standard, immediately before matching them.



 const es5regex = /[az]/i; const es6regex = /[az]/iu; console.log( es5regex.test('s'), es6regex.test('s'), // true true es5regex.test('S'), es6regex.test('S'), // true true // Note: U+017F   `S`. es5regex.test('\u017F'), es6regex.test('\u017F'), // false true // Note: U+212A   `K`. es5regex.test('\u212A'), es6regex.test('\u212A') // false true ); 


Case-folding is applied to characters in a regular expression pattern, as well as characters in a matching string.



 console.log( /\u212A/iu.test('K'), // true /\u212A/iu.test('k'), // true /\u017F/iu.test('S'), // true /\u017F/iu.test('s') // true ); 


This case-casting logic also applies to the \w and \W escape sequences, which also affects the \b and \B escape sequences. /\w/iu corresponds to [0-9A-Z_a-z] , but also U + 017F , because U + 017F from the matched regular expression string is converted (canonicalizes) to S The same goes for U + 212A and K Thus, /\W/iu equivalent to /[^0-9a-zA-Z_\u{017F}\u{212A}]/u .



 console.log( /\w/iu.test('\u017F'), // true /\w/iu.test('\u212A'), // true /\W/iu.test('\u017F'), // false /\W/iu.test('\u212A'), // false /\W/iu.test('s'), // false /\W/iu.test('S'), // false /\W/iu.test('K'), // false /\W/iu.test('k'), // false /\b/iu.test('\u017F'), // true /\b/iu.test('\u212A'), // true /\b/iu.test('s'), // true /\b/iu.test('S'), // true /\B/iu.test('\u017F'), // false /\B/iu.test('\u212A'), // false /\B/iu.test('s'), // false /\B/iu.test('S'), // false /\B/iu.test('K'), // false /\B/iu.test('k') // false ); 


Impact on HTML documents



Believe it or not, the u flag also affects HTML documents.



The pattern attribute for input and textarea elements allows you to specify a regular expression to validate user input. The browser then provides you with styles and scripts to create behavior based on the validity of the input.







The u flag is always enabled for regular expressions compiled using the HTML attribute pattern . Here is a demo .



Support



Currently, the ES6 u flag for regular expressions is available in stable versions of all major browsers. Browsers are gradually starting to use this functionality for the HTML attribute pattern .

Browser (s)JavaScript engineu flagu flag for pattern attribute
EdgeChakraissue # 1102227 + issue # 517 + issue # 1181issue # 7113940
FirefoxSpidermonkeybug # 1135377 + bug # 1281739bug # 1227906
Chrome / OperaV8V8 issue # 2952 + issue # 5080issue # 535441
WebkitJavascriptorebug # 154842 + bug # 151597 + bug # 158505bug # 151598


Recommendations for developers





Transforming (transpiling) Unicode ES6 regular expressions in ES5



I created regexpu , a transpiler that transforms Unicode ES6 regular expressions into the equivalent ES5 code that works today. This will allow you to play with new, evolving functionality.







Full-scale ES6 / ES7 transpilers, such as Traceur and Babel, depend on regexpu for u transpilation. Let me know if you can break it.

Source: https://habr.com/ru/post/338366/



All Articles