... namely, how to write LL parsers for not very complex structures by constructing a complex parser from simpler ones. Occasionally there is a need to parse something simple, say some XML-like structure or some kind of data URL, and then usually there is either a sheet of tricky difficult-to-read code or dependence on some more complex and tricky parsing library. Here I am going to combine several well-known ideas (some of them came across on Habré) and show how you can simply and concisely write quite complex parsers while doing very few lines of code. For example, I will write a parser for an XML-like structure. And yes, I will not insert a picture here to attract attention. There are no pictures in the article at all, so it will be difficult to read.
main idea
')
It lies in the fact that each piece of input text is parsed by a separate function (let's call it a “pattern”) and by combining these functions you can get more complex functions that can parse more complex texts. So, a pattern is such an object that has an
exec method that performs parsing. This function is indicated as where from where to parse and it returns parse and the place where the parsing ended:
var digit = { exec: function (str, pos) { var chr = str.charAt(pos); if (chr >= "0" && chr <= "9") return { res: +chr, end: pos + 1}; } };
Now the digit is a pattern of the digit digit and can be used as follows:
assert.deepEqual(digit.exec("5", 0), { res: 5, end: 1 }); assert.deepEqual(digit.exec("Q", 0), void 0);
Why such an interface? Because in JS there is a built-in class RegExp with a very similar interface. For convenience, we introduce the Pattern class (look at it as an analogue of RegExp) whose instances will be these patterns:
function Pattern(exec) { this.exec = exec; }
Then we will introduce some simple patterns that will be useful practically in a more or less complex parser.
Simple patterns
The simplest pattern is txt - it parses a fixed, predetermined string of text:
function txt(text) { return new Pattern(function (str, pos) { if (str.substr(pos, text.length) == text) return { res: text, end: pos + text.length }; }); }
It is applied like this:
assert.deepEqual(txt("abc").exec("abc", 0), { res: "abc", end: 3 }); assert.deepEqual(txt("abc").exec("def", 0), void 0);
If there were default constructors in JS (as in C ++), then the txt (“abc”) record could be reduced to “abc” in the context of constructing a more complex pattern.
Then we introduce its analogue for regular expressions and call it rgx:
function rgx(regexp) { return new Pattern(function (str, pos) { var m = regexp.exec(str.slice(pos)); if (m && m.index === 0) return { res: m[0], end: pos + m[0].length }; }); }
Apply it like this:
assert.deepEqual(rgx(/\d+/).exec("123", 0), { res: "123", end: 3 }); assert.deepEqual(rgx(/\d+/).exec("abc", 0), void 0);
Again, if the default constructors were in JS, then the rgx (/ abc /) entry could be reduced to / abc /.
Combinators Patterns
Now you need to enter a few patterns that combine existing patterns. The simplest of such “combinators” is opt - it makes any pattern optional, i.e. if the source pattern p cannot parse the text, then opt (p) on the same text will say that everything has parsed, only the result of parsing is empty:
function opt(pattern) { return new Pattern(function (str, pos) { return pattern.exec(str, pos) || { res: void 0, end: pos }; }); }
Usage example:
assert.deepEqual(opt(txt("abc")).exec("abc"), { res: "abc", end: 3 }); assert.deepEqual(opt(txt("abc")).exec("123"), { res: void 0, end: 0 });
If operator overload and default constructors were possible in JS, then the opt (p) record could be reduced to p || void 0 (this is clearly seen from how opt is implemented).
The next most complex combinator is exc - it only parses what the first pattern can parse and cannot parse the second one:
function exc(pattern, except) { return new Pattern(function (str, pos) { return !except.exec(str, pos) && pattern.exec(str, pos); }); }
If W (p) is a set of texts that parses the pattern p, then W (exc (p, q)) = W (p) \ W (q). This is useful for example when you need to parse all large letters except the letter H:
var p = exc(rgx(/[AZ]/), txt("H")); assert.deepEqual(p.exec("R", 0), { res: "R", end: 1 }); assert.deepEqual(p.exec("H", 0), void 0);
If JS had operator overloading, then exc (p1, p2) could be reduced to p1 - p2 or! P2 && p1 (for this, however, you need to enter the pattern combinator all / and which will work as an && operator) .
Then there is a pattern-combinator any - it takes several patterns and constructs a new one, which parses what the first of these patterns parses. We can say that W (any (p1, p2, p3, ...)) = W (p1) v W (p2) v W (p3) v ...
function any(...patterns) { return new Pattern(function (str, pos) { for (var r, i = 0; i < patterns.length; i++) if (r = patterns[i].exec(str, pos)) return r; }); }
I used the ... patterns (harmony: rest_parameters) construct to avoid a clumsy code like [] .slice.call (arguments, 0). Example of using any:
var p = any(txt("abc"), txt("def")); assert.deepEqual(p.exec("abc", 0), { res: "abc", end: 3 }); assert.deepEqual(p.exec("def", 0), { res: "def", end: 3 }); assert.deepEqual(p.exec("ABC", 0), void 0);
If in JS there was an operator overload, then any (p1, p2) could be reduced to p1 || p2.
The following pattern combinator is seq - it sequentially parses the text of the sequence of patterns given to it and produces an array of results:
function seq(...patterns) { return new Pattern(function (str, pos) { var i, r, end = pos, res = []; for (i = 0; i < patterns.length; i++) { r = patterns[i].exec(str, end); if (!r) return; res.push(r.res); end = r.end; } return { res: res, end: end }; }); }
It is applied like this:
var p = seq(txt("abc"), txt("def")); assert.deepEqual(p.exec("abcdef"), { res: ["abc", "def"], end: 6 }); assert.deepEqual(p.exec("abcde7"), void 0);
If JS had operator overloading, then seq (p1, p2) could be reduced to p1, p2 (the comma operator is overloaded).
Finally, the pattern combinator rep - it repeatedly applies the known pattern to the text and produces an array of results. As a rule, this is used to parse some of the same type of structures separated by, say, commas, so the rep takes two arguments: the main pattern whose results we are interested in and the dividing pattern whose results are discarded:
function rep(pattern, separator) { var separated = !separator ? pattern : seq(separator, pattern).then(r => r[1]); return new Pattern(function (str, pos) { var res = [], end = pos, r = pattern.exec(str, end); while (r && r.end > end) { res.push(r.res); end = r.end; r = separated.exec(str, end); } return { res: res, end: end }; }); }
You can add a couple more parameters min and max which will control how many repetitions are valid. Here I used the arrow function r => r [1] (harmony: arrow_functions) in order not to write function (z) {return z [1]}. Notice how rep is reduced to seq with Pattern # then (the idea is taken from Promise # then):
function Pattern(exec) { ... this.then = function (transform) { return new Pattern(function (str, pos) { var r = exec(str, pos); return r && { res: transform(r.res), end: r.end }; }); }; }
This method allows you to derive another from one pattern by applying an arbitrary transformation to the results of the first. By the way, experts of Haskel, can we say that this Pattern # then makes a pattern of monad?
Well, rep is used this way:
var p = rep(rgx(/\d+/), txt(",")); assert.deepEqual(p.exec("1,23,456", 0), { res: ["1", "23", "456"], end: 8 }); assert.deepEqual(p.exec("123ABC", 0), { res: ["123"], end: 3 }); assert.deepEqual(p.exec("ABC", 0), void 0);
Any distinct analogy with operator overload for rep does not occur to me.
The result is about 70 lines for all these rep / seq / any. This completes the list of combinator patterns and you can proceed to the actual construction of a pattern that recognizes XML-like text.
Parser XML-like texts
We confine ourselves to such XML-like texts:
<?xml version="1.0" encoding="utf-8"?> <book title="Book 1"> <chapter title="Chapter 1"> <paragraph>123</paragraph> <paragraph>456</paragraph> </chapter> <chapter title="Chapter 2"> <paragraph>123</paragraph> <paragraph>456</paragraph> <paragraph>789</paragraph> </chapter> <chapter title="Chapter 3"> ... </chapter> </book>
To begin with, we will write a pattern recognizing a named attribute of the form name = "value" - it is obviously often found in XML:
var name = rgx(/[az]+/i).then(s => s.toLowerCase()); var char = rgx(/[^"&]/i); var quoted = seq(txt('"'), rep(char), txt('"')).then(r => r[1].join('')); var attr = seq(name, txt('='), quoted).then(r => ({ name: r[0], value: r[2] }));
Here, attr parsits a named attribute with a value as a string, quoted - parses the string in quotes, char - parses one letter in the string (why write it as a separate pattern? Then, what then to “teach” this char to parse the so-called xml entities ), well, the name parses the name of the attribute (note that it parses both large and small letters, but returns the parse name where all the letters are small). Using attr looks like this:
assert.deepEqual( attr.exec('title="Chapter 1"', 0), { res: { name: "title", value: "Chapter 1" }, end: 17 });
Next, we construct a pattern that can parse a header like <? Xml ...?>:
var wsp = rgx(/\s+/); var attrs = rep(attr, wsp).then(r => { var m = {}; r.forEach(a => (m[a.name] = a.value)); return m; }); var header = seq(txt('<?xml'), wsp, attrs, txt('?>')).then(r => r[2]);
Here wsp parsit one or more spaces, attrs parses one or more named attributes and returns parsed as a dictionary (rep would return an array of name-value pairs, but the dictionary is more convenient, so the array is converted to a dictionary inside then), and the header parses the header itself and returns only header attributes in the form of the dictionary itself:
assert.deepEqual( header.exec('<?xml version="1.0" encoding="utf-8"?>', 0), { res: { version: "1.0", encoding: "utf-8" }, end: ... });
We now turn to parsing the constructions of the form <node ...> ...:
var text = rep(char).then(r => r.join('')); var subnode = new Pattern((str, pos) => node.exec(str, pos)); var node = seq( txt('<'), name, wsp, attrs, txt('>'), rep(any(text, subnode), opt(wsp)), txt('</'), name, txt('>')) .then(r => ({ name: r[1], attrs: r[3], nodes: r[5] }));
Here text parsits the text inside the node (node) and uses the char pattern which can be taught to recognize xml entities, the internal node has its subnode and its node has attributes and internal nodes. Why such a clever definition of subnode? If in the definition of node we refer directly to node (something like this: node = seq (..., node, ...)) then it turns out that at the time of determining node this variable is still empty. The subnode trick allows you to remove this circular dependency.
It remains to determine the pattern recognizing the entire file with the header:
var xml = seq(header, node).then(r => ({ root: r[1], attrs: r[0] }));
The application is accordingly:
assert.deepEqual( xml.exec(src), { attrs: { version: '1.0', encoding: 'utf-8' }, root: { name: 'book', attrs: { title: 'Book 1' }, nodes: [ { name: 'chapter', attrs: { title: 'Chapter 1' }, nodes: [...] }, ... ] } });
Here I call Pattern # exec with one argument and the meaning of this is that I want to parse the line from the very beginning and make sure that it has been parted to the end, and since it has been parted to the end, it is enough to return only parser without a pointer to that place. where the parser stopped (I already know that this is the end of the line):
function Pattern(name, exec) { ... this.exec = function (str, pos) { var r = exec(str, pos || 0); return pos >= 0 ? r : !r ? null : r.end != str.length ? null : r.res; }; }
Actually the entire parser is 20 lines (do not forget about those 70 that implement rep, seq, any, etc.):
var name = rgx(/[az]+/i).then(s => s.toLowerCase()); var char = rgx(/[^"&]/i); var quoted = seq(txt('"'), rep(char), txt('"')).then(r => r[1].join('')); var attr = seq(name, txt('='), quoted).then(r => ({ name: r[0], value: r[2] })); var wsp = rgx(/\s+/); var attrs = rep(attr, wsp).then(r => { var m = {}; r.forEach(a => (m[a.name] = a.value)); return m; }); var header = seq(txt('<?xml'), wsp, attrs, txt('?>')).then(r => r[2]); var text = rep(char).then(r => r.join('')); var subnode = new Pattern((str, pos) => node.exec(str, pos)); var node = seq( txt('<'), name, wsp, attrs, txt('>'), rep(any(text, subnode), opt(wsp)), txt('</'), name, txt('>')) .then(r => ({ name: r[1], attrs: r[3], nodes: r[5] })); var xml = seq(header, node).then(r => ({ root: r[1], attrs: r[0] }));
With operator overloading in JS (or in C ++), it would look something like this:
var name = rgx(/[az]+/i).then(s => s.toLowerCase()); var char = rgx(/[^"&]/i); var quoted = ('"' + rep(char) + '"').then(r => r[1].join('')); var attr = (name + '=' + quoted).then(r => ({ name: r[0], value: r[2] })); var wsp = rgx(/\s+/); var attrs = rep(attr, wsp).then(r => { var m = {}; r.forEach(a => (m[a.name] = a.value)); return m; }); var header = ('<?xml' + wsp + attrs + '?>').then(r => r[2]); var text = rep(char).then(r => r.join('')); var subnode = new Pattern((str, pos) => node.exec(str, pos)); var node = ('<' + name + wsp + attrs + '>' + rep(text | subnode) + (wsp | null) + '</' + name + '>') .then(r => ({ name: r[1], attrs: r[3], nodes: r[5] })); var xml = (header + node).then(r => ({ root: r[1], attrs: r[0] }));
It should be noted that each var here strictly corresponds to one ABNF rule and therefore if it is necessary to parse something according to the description in the RFC (and ABNF is loved there), then the transfer of those rules is a mechanical matter. Moreover, since the ABNF rules themselves (as well as EBNF and PEG) are strictly formal, we can write a parser of these rules and then, instead of calls to rep, seq, etc., write something like this:
var dataurl = new ABNF('"data:" mime ";" attrs, "," data', { mime: /[az]+\/[az]+/i, attrs: ..., data: /.*/ }).then(r => ({ mime: r[1], attrs: r[3], data: r[5] }));
And apply as usual:
assert.deepEqual( dataurl.exec('data:text/plain;charset="utf-8",how+are+you%3f'), { mime: "text/plain", attrs: { charset: "utf-8" }, data: "how are you?" });
Some more beeches
Why does Pattern # exec return null / undefined if parsing failed? Why not throw an exception? If you use exceptions in this way, the parser will become slower every twenty. Exceptions are good for exceptional cases.
The described method allows you to write LL parsers that are not suitable for all purposes. For example, if you need to parse an XML like
<book attr1="..." attr2="..." ???
So in the place where it is ??? may be either /> or just>. Now, if LL parser tried to parse it all as <book ...> and in place ??? turned out to be />, then the parser will drop everything that it took so long for parsil and start anew, assuming that it is <book ... />. LR parsers do not have this drawback, but it is difficult to write them.
Also LL parsers are poorly suited for parsing various syntactic / mathematical expressions where there are operators with different priorities, etc. LL parser can of course be written, but it will be somewhat confused and will work slowly. The LR parser will be messed up by itself, but it will be fast. Therefore, such expressions are convenient to parse so-called. Pratt's algorithm that
explained Crockford quite well (if you have a purple link, then you probably understand the parsers better than me and you must have been very bored reading all this).
I hope someone this is useful. At one time I wrote parsers of varying degrees of clumsiness and the method described above was a discovery for me.