One way to find unshielded characters with new JavaScript tools

1. How it all began

Recently, I needed to write another utility that processes a text file in a format similar to a simplified BBCode, namely in the source format for ABBYY Lingvo dictionaries - DSL (Dictionary Specification Language). (Not to be confused with another DSL (Domain-specific language) - an interesting case when a hyponym is a homonym for a hyperonym).

Suffice it to say that the language uses tags in square brackets and that square brackets can be escaped with a backslash if you want to use them as part of plain text.

One of the tasks of the utility was to find these tags with the exception of shielded combinations.
')
Since recently you can use lookbehind assertions (for personal use) in JavaScript regular expressions, I wondered if you could implement a search using this tool, especially since you can use variable-length expressions in this type of lookbehind.

2. Preliminary remarks

To evaluate the further experiment, it is necessary to get acquainted with some new features of JavaScript.

1. Template literals - long-awaited lines with interpolation of variables.

2. String.raw () . The capabilities of this function can be compared with single quotes in Perl and the prefix r'' in Python: they all help you create strings with the literal interpretation of the special character escaping.

3. Lookbehind assertions (including see how to activate them in Google Chrome and Node.js).

3. Implementation

Script code with a trial (naive) implementation of search and verification:

 /******************************************************************************/ 'use strict'; /******************************************************************************/ const r = String.raw; const startOfString = '^'; const notEscapeSymbol = r`[^\x5c]`; const escapedEscapeSymbols = r`(?:${startOfString}|${notEscapeSymbol})(?:\x5c{2})+`; const tag = r`\x5b[^\x5d]+\x5d`; const tagRE = new RegExp( `(?<=${startOfString}|${notEscapeSymbol}|${escapedEscapeSymbols})${tag}`, 'g' ); console.log(r`[tag]text[/tag]`.match(tagRE)); console.log(r`\\[tag]text\\\\[/tag]`.match(tagRE)); console.log(r`\[tag]text\\\[/tag]`.match(tagRE)); /******************************************************************************/

First, we create a synonym for String.raw so that you can use the short form, like the prefix r'' in Python.

Then we create the component parts of the future regular expression.

I proceeded from the assumption that one of three options can precede the correct tag: the beginning of a line, any character except a backslash and a screened backslash (that is, a combination of two backslashes). At the same time, it is necessary to ensure that the escaping slash character itself is not subjected to screening: in other words, only an even number of backslashes may precede the tag, before which, in turn, may be either the beginning of a line or any other non-character.

Thus, we need four key elements of a complex regular expression: the tag itself and its three valid predecessors — the beginning of a line, any character except a backslash, and a screened slash or repeat it any number of times. The third predecessor tag can be represented as a combination of one of the first two predecessors and a pair of backslashes in any quantity.

In order not to ruffle in the eyes, I replaced all the literal characters of backslashes and square brackets with hexadecimal literals ( [ — \x5b, \ — \x5c, ] — \x5d ).

The equivalent of the regular expression compiled from the parts will be the following combination (it can be used instead of the entire first part, assigning it directly to the variable tagRE ):

/(?<=^|[^\x5c]|(?:^|[^\x5c])(?:\x5c{2})+)\x5b[^\x5d]+\x5d/g

At the end of the script, the resulting expression is tested on a minimum set of correct and shielded tags. The first line contains the tag after the beginning of the line and after a character other than a backslash. The second line contains tags after a screened backslash, which (or which) is preceded by either the beginning of a line, or a character other than themselves. The third line contains escaped tags.

The following result is output to the console:

[ '[i]', '[/i]' ]
[ '[i]', '[/i]' ]
null

When evaluating a decision, two reservations should be kept in mind:

1. This is an implementation for home use and not for mass production (until lookbehind assertions comes out of the flag in Node.js and Google Chrome and will not be implemented in other browsers).

2. This expression is not intended to verify the correctness of the contents of the tags themselves, only to distinguish them from the screened combinations.

I would be grateful for instructions not unnoticed risks and for optimization tips.

Source: https://habr.com/ru/post/282275/

All Articles

One way to find unshielded characters with new JavaScript tools

1. How it all began

2. Preliminary remarks

3. Implementation

More articles: