📜 ⬆️ ⬇️

Unicode character properties in V8 regular expressions

JavaScript regular expressions are gradually catching up with PCRE.

The recently mentioned lookbehind feature has moved to the --es_staging flag stage .

V8 developers have also begun to add Unicode properties to regular expressions (see the general description and specification of this character character).
')
In my opinion, there are two differences in promoting lookbehind and character properties: the first opportunity introduces quite a bit of syntax compared to the second, but the second changes the behavior of the whole process less (compare the number of files affected by changes in V8 source files by the two links mentioned). In fact, the properties of Unicode are just convenient abbreviations, synonyms for different groups of codepoints, so you can expect a minimum of dirty tricks from them when integrated into the system.

Of course, both options are not recommended for use in products (except Google Chrome, they are not implemented anywhere in browsers, and Node.js just goes to the corresponding version of V8, in which they are still under the flags).

But for personal needs (word processing utilities, etc.), it seems to me that they are quite applicable. Perhaps the V8 developer code, even experimental, can sometimes be trusted with no greater risk than the various libraries on npmjs or GitHub.

In Google Chrome, even in the currently stable v50, testing can be done under the flag:

chrome.exe --js-flags="--harmony_regexp_property"

In Node.js, this feature appears with v6.0 ( there are already the first RCs):

node --harmony_regexp_property test.js

In Google Chrome v50 and Node.js v6.0, the current version of V8 ('5.0.71.32') contains only the first portion of the implementation - the very first commit from Feb 10, 2016. But this is a huge leap forward, it allows you to work with . general categories of symbols ( description and specification ). Filling categories with symbols can be viewed here .

Sample script for testing features.

At the beginning, an object is created, the keys in which are the names of the categories, and the values ​​are three characters from this category. If a category is a national team (that is, it simply combines several other categories), the value will be a function that combines the rows of the respective categories. The fact that at least somehow amenable to an intelligible display, I entered by the characters themselves; that which is invisible or merged (control characters, diacritics, etc.) is entered using escape sequences.

Then the script iterates over the elements of the object, creates a regular expression from the key (the name of the category) and tests with it the value (string with examples). The result is output to the console. If a category is not implemented, an error message is displayed (in the mentioned versions of Google Chrome v50 and Node.js v6.0, only one is not implemented, the modular category is \ p {LC}, but it is easy to implement manually by combining its members in a regular expression; in later versions of the V8, this omission has already been fixed). If the search is unsuccessful, null is output (in the script this only happens with the category \ p {Cn}, because in principle no character is assigned to it and it is impossible to provide examples for matching).

The beginning of the output of the script in Node.js 6.0.0-rc.2 (V8 5.0.71.32 - the initial stage of the implementation of Unicode character properties):



Beginning of the output of the Google Chrome Canary 52.0.2710.0 script (V8 5.2.26 is the current stage of implementation, note the difference in processing \ p {LC}):



As we can see from the list of implemented , in Google Chrome Canary, you can already test a much larger set of features: scripts, loose matching for binary names, \ p {} in character classes, binary and enumerated properties. Soon these elements will get to Node.

Enjoy testing and successful caution.

Ps. Unicode property escapes in JavaScript regular expressions by Mathias Bynens - a brief description of the future specification with examples and useful links.

Source: https://habr.com/ru/post/281755/


All Articles