RegExp Unicode Property Escapes have moved to Step 4 and will be included in ES2018 .
In V8, they are available without a flag since v6.4 , so they are ready for use in all current Google Chrome channels from stable to Canary.
In Node.js they will be available without a flag already in v10 (coming out in April). In other versions, the --harmony_regexp_property
(Node.js v6 – v9) or --harmony
(Node.js v8 – v9) --harmony
. Now without a flag, you can try them either in nightly assemblies or in the v8-canary branch .
It should be borne in mind that Node.js assemblies compiled without ICU support will not be able to use this class of regular expressions (for more details, see Internationalization Support ). For example, it concerns the popular Android build from Termux community.
For more information about support in other engines and environments, see the well-known table (after the transition, go a little higher).
I will not repeat the descriptions of this long-awaited opportunity, just refer to several articles of famous experts:
I also wanted to tell about a pair of not quite obvious trifles.
When I started to get acquainted with this new feature, I regretted two missing conveniences: the way to programmatically get a list of all valid choices in this most extensive regular expression class now and the way to get a list of suitable properties for a particular Unicode character.
If someone feels the same need, let these notes save him time :)
At the moment, the current ECMAScript specification itself, in particular the tables (carefully, follow the links to a heavy page) in Runtime Semantics: UnicodeMatchProperty (p) and Runtime Semantics: UnicodeMatchPropertyValue (p, v) , is an authoritative and comprehensive source that lists all possible properties. .
If it is inconvenient for someone to load the entire specification, it is possible to limit the specification of the sentence with the same tables. And quite a lightweight version: these tables exist as four separate files in the root of the ECMAScript specification repository . Actually, only they exist in the form of separate files imported into the specification - this alone, probably, can indicate their unprecedented volume. Tables can be viewed with relative convenience using the native sub-service .
I extracted this data and sketched out a tiny library containing a structured list of all possible names and values and exporting this object in the form of a flattened array of all possible members from this class of regular expressions.
All subsections are presented in alphabetical order with the exception of general properties (the order of the document from the Unicode base is more convenient and familiar here). The list does not contain synonyms, and abbreviations are used only for common properties, which significantly saves space in subsequent operations with the library.
With the help of a simple script and the mentioned library, you can get a list in JSON format containing sources for regular expressions. An example of such a script and its output can be found in the same place in the comments - only 372 variants in the current version of the specification.
The described library allows us to use this class of regulars with an unusual purpose: not to search for characters based on properties, but to obtain properties based on the character we have. On the move you can come up with several uses.
I should make a reservation that for the sake of illustrative simplicity, I did not add error handling to scripts, so this should be taken care of separately.
A small utility receives as a command line parameter a single character or its hexadecimal number in the Unicode base (code point) and gives a list of properties that can be used in the future when searching for a given character or a general class of characters.
'use strict'; const reUnicodeProperties = require('./re-unicode-properties.js'); const RADIX = 16; const PAD_MAX = 4; const [, , arg] = process.argv; let character; let codePoint; if ([...arg].length === 1) { character = arg; codePoint = `U+${character.codePointAt(0).toString(RADIX).padStart(PAD_MAX, '0')}`; } else { character = String.fromCodePoint(Number.parseInt(arg, RADIX)); codePoint = `U+${arg.padStart(PAD_MAX, '0')}`; } const characterProperties = reUnicodeProperties .filter(re => re.test(character)) .map(re => re.source) .join('\n') .replace(/\\p\{|\}/g, ''); console.log( `${JSON.stringify(character)} (${codePoint})\n${characterProperties}`, );
Example output:
$ node re-unicode-properties.character-info.js "" (U+0451) gc=Letter gc=Cased_Letter gc=Lowercase_Letter sc=Cyrillic scx=Cyrillic Alphabetic Any Assigned Cased Changes_When_Casemapped Changes_When_Titlecased Changes_When_Uppercased Grapheme_Base ID_Continue ID_Start Lowercase XID_Continue XID_Start
This version of the script runs on my machine for 2-3 minutes and eats about a gigabyte of memory, so be careful. For a single run that gives us a complete base, it is tolerable, if necessary, you can set up a gradual output to a file instead of building the entire base in memory and output in one sitting.
The script can be run without parameters, then it displays the database in a simplified text format, one character each with properties per line. If we add the json
parameter, we will get a readable database in JSON (by the way, using hexadecimal numbers in the string representation does not work in the form of keys: sorting the result ceases to be determined by the order of key creation; therefore, we will add the U+
prefix to the numeric key) will be saved, and it will be more convenient to search for the symbol in the network if you need a full set of properties and a detailed description, and not just a list that is suitable for a regular expression; Well we undertake to save on file size).
'use strict'; const { writeFileSync } = require('fs'); const reUnicodeProperties = require('./re-unicode-properties.js'); const [, , format] = process.argv; const LAST_CODE_POINT = 0x10FFFF; const RADIX = 16; const PAD_MAX = LAST_CODE_POINT.toString(RADIX).length; const data = {}; let codePoint = 0; while (codePoint <= LAST_CODE_POINT) { const character = String.fromCodePoint(codePoint); data[`U+${codePoint.toString(RADIX).padStart(PAD_MAX, '0')}`] = [ character, ...reUnicodeProperties .filter(re => re.test(character)) .map(re => re.source.replace(/\\p\{|\}/g, '')), ]; codePoint++; } if (format === 'json') { writeFileSync( 're-unicode-properties.code-points.json', `\uFEFF${JSON.stringify(data, null, 2)}\n`, ); } else { writeFileSync( 're-unicode-properties.code-points.txt', `\uFEFF${ Object.entries(data) .map(([k, v]) => `${k.replace('U+', '')} ${JSON.stringify(v.shift())} ${v.join(' ')}`) .join('\n') }\n`, ); }
Examples of fragments in both formats:
000020 " " gc=Separator gc=Space_Separator sc=Common scx=Common ASCII Any Assigned Grapheme_Base Pattern_White_Space White_Space 000021 "!" gc=Punctuation gc=Other_Punctuation sc=Common scx=Common ASCII Any Assigned Grapheme_Base Pattern_Syntax Sentence_Terminal Terminal_Punctuation 000022 "\"" gc=Punctuation gc=Other_Punctuation sc=Common scx=Common ASCII Any Assigned Grapheme_Base Pattern_Syntax Quotation_Mark 000023 "#" gc=Punctuation gc=Other_Punctuation sc=Common scx=Common ASCII Any Assigned Emoji Emoji_Component Grapheme_Base Pattern_Syntax 000024 "$" gc=Symbol gc=Currency_Symbol sc=Common scx=Common ASCII Any Assigned Grapheme_Base Pattern_Syntax 000025 "%" gc=Punctuation gc=Other_Punctuation sc=Common scx=Common ASCII Any Assigned Grapheme_Base Pattern_Syntax 000026 "&" gc=Punctuation gc=Other_Punctuation sc=Common scx=Common ASCII Any Assigned Grapheme_Base Pattern_Syntax 000027 "'" gc=Punctuation gc=Other_Punctuation sc=Common scx=Common ASCII Any Assigned Case_Ignorable Grapheme_Base Pattern_Syntax Quotation_Mark 000028 "(" gc=Punctuation gc=Open_Punctuation sc=Common scx=Common ASCII Any Assigned Bidi_Mirrored Grapheme_Base Pattern_Syntax 000029 ")" gc=Punctuation gc=Close_Punctuation sc=Common scx=Common ASCII Any Assigned Bidi_Mirrored Grapheme_Base Pattern_Syntax 00002a "*" gc=Punctuation gc=Other_Punctuation sc=Common scx=Common ASCII Any Assigned Emoji Emoji_Component Grapheme_Base Pattern_Syntax 00002b "+" gc=Symbol gc=Math_Symbol sc=Common scx=Common ASCII Any Assigned Grapheme_Base Math Pattern_Syntax 00002c "," gc=Punctuation gc=Other_Punctuation sc=Common scx=Common ASCII Any Assigned Grapheme_Base Pattern_Syntax Terminal_Punctuation 00002d "-" gc=Punctuation gc=Dash_Punctuation sc=Common scx=Common ASCII Any Assigned Dash Grapheme_Base Pattern_Syntax 00002e "." gc=Punctuation gc=Other_Punctuation sc=Common scx=Common ASCII Any Assigned Case_Ignorable Grapheme_Base Pattern_Syntax Sentence_Terminal Terminal_Punctuation 00002f "/" gc=Punctuation gc=Other_Punctuation sc=Common scx=Common ASCII Any Assigned Grapheme_Base Pattern_Syntax
[ "U+000020": [ " ", "gc=Separator", "gc=Space_Separator", "sc=Common", "scx=Common", "ASCII", "Any", "Assigned", "Grapheme_Base", "Pattern_White_Space", "White_Space" ], "U+000021": [ "!", "gc=Punctuation", "gc=Other_Punctuation", "sc=Common", "scx=Common", "ASCII", "Any", "Assigned", "Grapheme_Base", "Pattern_Syntax", "Sentence_Terminal", "Terminal_Punctuation" ] ]
You can download full databases in archives if you wish: .txt
(5 MB in the archive, ~ 60 MB of text) or .json
(5.5 MB in the archive, ~ 112 MB of text). When viewing do not forget to use good fonts.
This is a variant of the previous script, which does not provide a complete database of characters, but only the set that is found in a given file. The first parameter of the script is the path to the file, the second optional is the format (text is used by default, you can also specify json
). A conclusion similar to the previous one, only smaller in volume. Since the file is read in stream mode, you can process texts of any reasonable size. My gigabyte file was processed for five minutes, throughout the entire work the script occupied about 60 megabytes of memory.
'use strict'; const { createReadStream, writeFileSync } = require('fs'); const { basename } = require('path'); const reUnicodeProperties = require('./re-unicode-properties.js'); const [, , filePath, format] = process.argv; const LAST_CODE_POINT = 0x10FFFF; const RADIX = 16; const PAD_MAX = LAST_CODE_POINT.toString(RADIX).length; const data = {}; (async function main() { const fileStream = createReadStream(filePath); fileStream.setEncoding('utf8'); const characters = new Set(); for await (const chunk of fileStream) { [...chunk].forEach((character) => { characters.add(character); }); } [...characters].sort().forEach((character) => { data[`U+${character.codePointAt(0).toString(RADIX).padStart(PAD_MAX, '0')}`] = [ character, ...reUnicodeProperties .filter(re => re.test(character)) .map(re => re.source.replace(/\\p\{|\}/g, '')), ]; }); if (format === 'json') { writeFileSync( `re-unicode-properties.file-info.${basename(filePath)}.json`, `\uFEFF${JSON.stringify(data, null, 2)}\n`, ); } else { writeFileSync( `re-unicode-properties.file-info.${basename(filePath)}.txt`, `\uFEFF${ Object.entries(data) .map(([k, v]) => `${k.replace('U+', '')} ${JSON.stringify(v.shift())} ${v.join(' ')}`) .join('\n') }\n`, ); } })();
On this, perhaps, everything. Thank you for your time.
Source: https://habr.com/ru/post/350448/
All Articles