MySQL grammar on ANTLR 4

Application-level firewall is designed to analyze and filter traffic for any application or class of applications, such as web applications or DBMS . When building it, it becomes necessary to speak the language of this application. For a relational DBMS, this is the SQL dialect. Suppose you need to build a firewall for the DBMS. In this case, you will need to recognize and analyze SQL statements in order to make a decision about their compliance with a given security policy. Depending on the tasks being solved (for example, detection of attacks like SQL injection, access control, correlation of SQL and HTTP requests), one or another depth of SQL analysis will be needed. Either way, you need to perform lexical, syntactic and semantic analysis of SQL statements.

Introduction

Formal grammar of the language

In order to get a complete picture of the structure of the language and analyze it, you can use the formal grammar of this language. With its help, it is possible both to generate sentences of a language, and, using a parser , to recognize sentences in it.

According to the Chomsky hierarchy , there are four main types of languages and, accordingly, four types of grammars. Grammars differ in their generative possibilities. MySQL is a context-sensitive language. However, the list of language constructs that can only be generated by context-sensitive grammar is small. As a rule, language constructions are used in practice, for generating which generation context-free grammar is sufficient. This article describes the details of the development of context-free grammar for MySQL.

Terminology used

The language is determined on the basis of the alphabet - the set of characters. The letters of the alphabet are combined into meaningful sequences called lexemes. Tokens can be of different types (identifiers, strings, keywords, etc.). A token is a tuple consisting of a token and a type name. A phrase is a sequence of tokens arranged in a special order. From sentences can be built sentences. Further, a sentence is understood as a certain complete sequence of lexemes, which in the context of a given language has an independent meaning. The concept of a only makes sense in the applied field.

Applications built on the basis of a language operate with sentences of this language, for example, they perform or interpret them. Phrases, from the applied point of view, are incomplete constructions; they can form part of a sentence. But when building a grammar it is convenient to use phrases. In fact, phrases and sentences do not differ in terms of grammar - they correspond to some of its rules, on the right side of which there are non-terminal symbols.

The use of language involves the construction or recognition of sentences. The recognition task assumes that the sequence of tokens is input, and the output gives an answer to the question whether this sequence is a set of correct sentences in this language .

MySQL language

The MySQL language is a dialect of the SQL language for writing queries to the MySQL DBMS. By SQL language is meant a standard or, formally, a series of ISO / IEC 9075 standards “Information technology - Database languages - SQL” .

The MySQL dialect is a concrete implementation of the standard with some limitations and additions. Most MySQL sentences can be described by context-free grammar, but there are some sentences that need context-sensitive grammar rules to describe. In simple terms, if the lexeme influences the further recognition of phrases, such a phrase is described by a context-dependent grammar rule.

MySQL has some expressions based on this principle. For example:

 DELIMITER SOME_LITERAL

In this case, you need to remember SOME_LITERAL , because in future sentences the literal SOME_LITERAL should be used as a symbol of their completion, and not ; .

In a procedural extension, loop operators and block clauses can be labeled. Their structure is as follows:

 label somephrases label

In this case, tag IDs must be the same. You can display such a sentence only in context-sensitive grammar.

ANTLR

ANTLR parser generator was chosen for the development of the MySQL parser. The main advantages of this generator:

ANTLR involves a two-step algorithm for generating a recognition code. First, the lexical structure of the language is described, that is, it is determined what are tokens. The following describes the syntactic structure of the language, that is, the recognized tokens are grouped into sentences. The lexical and syntactic structures in ANTLR are described using rules. The lexical structure is determined by the type (token handle) and value. To describe the value, a language with elements of regular expressions is used, but with support for recursion. The syntactic structure rule is made up of token descriptors based on the sentence construction rules in ANTLR 4, which allow determining the structure of the location of lexemes in a sentence or phrase within a sentence.

When constructing rules, one should take into account the basic principle of lexical analysis, which is also implemented in ANTLR. Lexer first tries to recognize the longest sequence of characters from the input stream, suitable for any lexical rule. If there are several such rules, then the first in the order of definition will be activated.

Without the use of semantic predicates in ANTLR, only context-free grammar can be constructed. The advantage is that in this case the resulting grammar will be independent of the execution environment. The proposed grammar for the MySQL dialect is built without the use of semantic predicates.

Lexer

Where to begin

When developing a grammar, the first thing to do is to determine the list of types of tokens found in the language. When recognizing tokens, the characters of the alphabet of the language will be input to the recognizer's input, from which it is necessary to make lexemes; however, characters that are not involved in the construction of tokens can be filtered. Such characters are whitespace and comments. After filtering, only meaningful lexemes of the language will be included in the further analysis. You can filter whitespace and comments like this:

 SPACE: [ \t\r\n]+ -> channel(HIDDEN); COMMENT_INPUT: '/*' .*? '*/' -> channel(HIDDEN); LINE_COMMENT: ('-- ' | '#') ~[\r\n]* ('\r'? '\n' | EOF) -> channel(HIDDEN);

You can also immediately take into account potential lexical errors and skip unknown characters:

 ERROR_RECONGNIGION: . -> channel(ERRORCHANNEL);

If any characters are not recognized by any lexical rule, they are recognized by the ERROR_RECONGNIGION rule. It is placed at the end of the grammar, since the last rule has the lowest priority.

Now you can proceed to the selection of tokens. As a rule, the following types of tokens can be distinguished:

keywords;
identifiers;
literals;
Special symbols.

If there is no explicit (or implicit) intersection of these types of lexemes in the language, there are no problems and it is required to simply describe all the lexemes. However, if intersections occur somewhere, they must be resolved. The situation is complicated by the fact that regular grammar is used to recognize individual lexemes. In MySQL, this problem occurs with " fully qualified name " and with keywords that can be identifiers.

Dot identifiers

When recognizing MySQL tokens, such as identifiers that begin with numbers , there are some problems: the "." can occur both in full column names and in real literals:

 select table_name.column_name as full_column_name ...

 select 1.e as real_number ...

Thus, it is necessary to correctly recognize the full column name in the first case and the real literal in the second. Intersection arises from the fact that identifiers in MySQL can begin with numbers.

From the point of view of the MySQL language, the phrase:

 someTableName.1SomeColumn

is a sequence of three tokens:

 (someTableName, ), (. , -), (1SomeColumn, )

To do this, it is natural to use the rules:

 DOT: .; ID_LITERAL: [0-9]*[a-zA-Z_$][a-zA-Z_$0-9]*;

And for numbers like this:

 DECIMAL_LITERAL: [0-9]+;

After tokenization, a sequence of four tokens is obtained:

 (someTableName, ), (. , -), (1, ), (SomeColumn, )

In order to avoid the ambiguity problem, you can introduce an auxiliary construction for identifying identifiers:

 fragment ID_LITERAL: [0-9]*[a-zA-Z_$][a-zA-Z_$0-9]*;

and determine the rules, sorted by priority:

 DOT_ID: '.' ID_LITERAL; ... ID: ID_LITERAL; ... DOT: '.'

Since ANTLR recognizes sequences of maximum length, you can not be afraid of the "." will be recognized as a separate character.

Strings

Using the example of strings, one more rule of lexical analysis, implemented in ANTLR, can be given. A string in MySQL is a sequence of almost any characters enclosed in single or double quotes. Single quotation strings cannot contain a single backslash or quotation mark, because the lexer does not determine where the string ends. If you still need to use such signs, then shielding is used, which consists in replacing one quote with two consecutive quotes. In addition, the escape character within a string cannot occur by itself, it must escape something. Therefore, the individual appearances of this symbol must also be prohibited. The result is the following fragment of the lexical rule:

 fragment SQUOTA_STRING: '\'' ('\\'. | '\'\'' | ~('\'' | '\\'))* '\'';

'\\'. - allows backslash and the character that it escapes;
'\'\'' - allows a sequence of two single quotes;
~('\'' | '\\') - prohibits the appearance of a single single quote or a single screen character.

Keywords

The ANTLR lexer, unlike the parser, uses rules in order of priority. Rules defined earlier have a higher priority than those described later. This gives an obvious instruction on how to sort the rules: first, there are specific rules that define keywords and special characters, and then general rules that define literals, variables, identifiers, etc.

MySQL special comment type

MySQL uses a multi-line comment of a special kind . Such comments allow you to create queries compatible with other DBMSs, isolating the specifics of MySQL. MySQL when analyzing the query will analyze the text from such comments. To recognize special MySQL comments, you can use the rule:

 SPEC_MYSQL_COMMENT: '/*!' .+? '*/' -> channel(MYSQLCOMMENT);

However, it alone is not enough for the correct parsing of requests.

Suppose that a request comes in as follows:

 select name, info /*!, secret_info */ from users;

When using such a rule, we get the following sequence of tokens:

 (SELECT, 'select') (ID, 'name') (COMMA, ',') (ID, 'info') (SPEC_MYSQL_COMMENT, '/*!, secret_info */') (FROM, 'from') (ID, 'users') (SEMI, ';')

At the same time, the standard MySQL lexer recognizes several other tokens:

 (SELECT, 'select') (ID, 'name') (COMMA, ',') (ID, 'info') (COMMA, ',') (ID, 'secret_info') (FROM, 'from') (ID, 'users') (SEMI, ';')

Therefore, for correct recognition of a special type of comment in MySQL, additional processing is required:

The source text is recognized by a special pre-processing lexer.
SPEC_MYSQL_COMMENT are extracted from the SPEC_MYSQL_COMMENT tokens and a new text is generated, which will be processed only by the MySQL server.
Processing of the newly constructed text is performed using the usual parser and lexer.

A preprocessing lexer breaks the input stream into phrases related to:

to a special comment ( SPEC_MYSQL_COMMENT );
basic query ( TEXT );

and can be built as follows:

 lexer grammar mysqlPreprocessorLexer; channels { MYSQLCOMMENT } TEXT: ~'/'+; SPEC_MYSQL_COMMENT: '/*!' .+? '*/'; //-> channel(MYSQLCOMMENT); SLASH: '/' -> type(TEXT);

As a result of the work of the pre-lexer, the request code is split into a sequence of SPEC_MYSQL_COMMENT and TEXT tokens. If the MySQL dialect sentence is being processed, values are combined from the SPEC_MYSQL_COMMENT tokens, which are combined with the values of the TEXT tokens. After that, the usual MySQL lexer is started for the resulting text. In the case of another SQL dialect, the SPEC_MYSQL_COMMENT tokens SPEC_MYSQL_COMMENT simply deleted or placed in an isolated channel.

Register Independence

Almost all the lexemes in MySQL are case-insensitive, which means that the two following queries are identical:

 select * from t; SelECT * fROm t;

Unfortunately, in ANTLR there is no support for case-insensitive tokens, and for tokens one has to use the following entry using fragment tokens , which are used to build real tokens:

 SELECT: SELECT; FROM: FROM; fragment S: [sS]; fragment E: [eE];

This reduces grammar readability. Moreover, for each character, the lecturer has to choose from two options — a character in upper or lower case — which negatively affects performance.

In order for the lexer code to be cleaner and the performance higher, the input stream of characters needs to be normalized, that is, lead to upper or lower case. ANTLR supports a special stream, which does not consider the register during lexical analysis, but retains the register of the original token values. These tokens can be used during tree traversal.

The implementation of such a stream for different runtimes was proposed by KvanTTT . It can be found in the DAGE project, the cross-platform editor of grammars ANTLR 4.

As a result, all tokens are written either in lower or upper case. Since SQL keywords in queries are usually written in upper case, it was decided to use it also for grammar:

 SELECT: "SELECT"; FROM: "FROM";

Parser

To describe the syntactic structure of a language, it is necessary to determine the writing order:

sentences in the text;
phrases in the sentence;
tokens and phrases in more general phrases.

MySQL text structure

For MySQL, there is an excellent grammar description , albeit distributed throughout the reference manual. The structure of the location of sentences in the text is in the description section of the messaging protocol between the server and the MySQL client. You can see that all offers, except perhaps the last, use a separator ; . In addition, there is a nuance concerning a line comment : the last sentence in the text can end with such a comment. As a result, it turns out that any correct sequence of MySQL statements should be presented as:

 root : sqlStatements? MINUSMINUS? EOF ; sqlStatements : (sqlStatement MINUSMINUS? SEMI | emptyStatement)* (sqlStatement (MINUSMINUS? SEMI)? | emptyStatement) ; ... MINUSMINUS: '--';

The power of context-free grammar is not enough to fully support these rules, since the MySQL client can use the DELIMITER command , with which you can set the current delimiter. In this case, you need to remember and use the separator in other rules. Thus, if this directive is used, correctly written SQL statements using the grammar in question will not be recognized.

MySQL sentence types

MySQL sentences are of the following types :

DDL;
DML;
transaction management offers;
cluster management suggestions;
Dynamic query creation suggestions
server administration suggestions;
offers of performance of service actions;
proposals for procedural extensions.

After translating the documentation into the ANTLR 4 grammar, the root sentence will be as follows:

 sqlStatement : ddlStatement | dmlStatement | transactionStatement | replicationStatement | preparedStatement | administrationStatement | utilityStatement

There is also an empty sentence consisting of one semicolon:

 empty_statement : SEMI ; SEMI: ';';

Subsequent paragraphs of official documentation are transformed into ANTLR rules in a similar way.

SELECT

Perhaps the most interesting and extensive sentence in SQL in general and in MySQL in particular is the SELECT statement. When writing grammar, the main attention was paid to the following parts:

the definition of tables;
definition of expressions;
combining using UNION .

Let's start with the definition of the tables. In MySQL, there is quite a serious description of what can be used in the FROM field of a SELECT query (hereinafter, we will call this “table links”). After careful study and testing on existing versions, it becomes clear that the "table links" are a construction of the form:

   1,   2, …,   N

in which the "Tabular Object" is one of four constructs:

separate table;
join tables;
subquery;
table references enclosed in brackets.

If we start from the less general, then we find that the table object is inductively defined as a table or as a table-based construction. The latter may be:

join tables;
subquery;
a sequence of table objects enclosed in brackets.

Next, we find that the FROM field simply defines a sequence of table objects consisting of at least one table object. Of course, the grammar defines additional constructions like "join conditions", references to partitions (PARTITION), etc., but the general structure is as follows:

 tableSources : tableSource (',' tableSource)* ; tableSource : tableSourceItem joinPart* | '(' tableSourceItem joinPart* ')' ; tableSourceItem : tableName (PARTITION '(' uidList ')' )? (AS? alias=uid)? (indexHint (',' indexHint)* )? #atomTableItem | (subquery | '(' parenthesisSubquery=subquery ')') AS? alias=uid #subqueryTableItem | '(' tableSources ')' #tableSourcesItem ;

Expressions

Expressions are used everywhere in MySQL, wherever there is a need to compute a value (value vector). An inductive expression can be defined as follows:

An expression is any type token:
- constant (literal) value;
- variable;
- object identifier.
An expression is a superposition of expressions united by any transformations.

Transformations include operations, operators (including set-theoretic, comparison operators), functions, queries, parentheses.

UNION

Unlike other dialects, there are only two set-theoretic operations on tables in MySQL. The first - JOIN - has already been reviewed. Experimentally, it was found that the description of UNION in official documentation is somewhat incomplete. We have added it as follows:

 selectStatement : querySpecification lockClause? #simpleSelect | queryExpression lockClause? #parenthesisSelect | querySpecificationNointo unionStatement+ ( UNION (ALL | DISTINCT)? (querySpecification | queryExpression) )? orderByClause? limitClause? lockClause? #unionSelect | queryExpressionNointo unionParenthesis+ ( UNION (ALL | DISTINCT)? queryExpression )? orderByClause? limitClause? lockClause? #unionParenthesisSelect ;

When using UNION individual queries can be enclosed in parentheses. The requirement to use them is not mandatory, except when ORDER BY and LIMIT are used in queries. However, if the first query in UNION enclosed in parentheses, all subsequent queries must be enclosed in them.

Mistake:

 (select 1) union select 2; (select 1) union (select 2) union select 3;

Right:

 (((select 1))) union (select 2); (select 1) union ((select 2)) union (select 3);

Grammar usage

Grammar is written to solve problems of syntactic and lexical analysis. On the one hand, it is necessary that the recognition be carried out as quickly as possible, and on the other, that the code of the generated lexer and parser can be safely used in own applications without affecting the capabilities and speed.

An application that uses the parser is likely to use one of two design patterns — Visitor or Observer . Each of them involves the analysis of a specific subset of the nodes of the parse tree . The nodes of the parse tree that are not leaves correspond to any syntactic rules of grammar. When analyzing the nodes of the parse tree, you need to refer to the child nodes corresponding to the fragments of the original rule. And you can apply to both individual nodes and groups of nodes.

"" . "" , . ANTLR . , Visitor, . , MySQL :

 tableSourceItem : tableName (PARTITION '(' uidList ')' )? (AS? alias=uid)? (indexHint (',' indexHint)* )? | (subquery | '(' parenthesisSubquery=subquery ')') AS? alias=uid | '(' tableSources ')' ;

, :

;
;
, .

 tableSourceItem : tableName (PARTITION '(' uidList ')' )? (AS? alias=uid)? (indexHint (',' indexHint)* )? #atomTableItem | (subquery | '(' parenthesisSubquery=subquery ')') AS? alias=uid #subqueryTableItem | '(' tableSources ')' #tableSourcesItem ;

. . , () . . , .

, :

 loadXmlStatement : LOAD XML (LOW_PRIORITY | CONCURRENT)? LOCAL? INFILE STRING_LITERAL (REPLACE | IGNORE)? INTO TABLE tableName (CHARACTER SET charsetName)? (ROWS IDENTIFIED BY '<' STRING_LITERAL '>')? ( IGNORE decimalLiteral (LINES | ROWS) )? ( '(' assignmentField (',' assignmentField)* ')' )? (SET updatedElement (',' updatedElement)*)? ;

, , LOAD XML . , LOAD XML :

- , ;
;
.

, , :

 loadXmlStatement : LOAD XML priority=(LOW_PRIORITY | CONCURRENT)? LOCAL? INFILE file=STRING_LITERAL violation=(REPLACE | IGNORE)? INTO TABLE tableName (CHARACTER SET charsetName)? (ROWS IDENTIFIED BY '<' tag=STRING_LITERAL '>')? ( IGNORE decimalLiteral linesFormat=(LINES | ROWS) )? ( '(' assignmentField (',' assignmentField)* ')' )? (SET updatedElement (',' updatedElement)*)? ;

, .

Conclusion

SQL- , , , , - . , MySQL , , , , MySQL . MySQL , WordPress Bitrix, , - . grammars-v4 MIT.

Source: https://habr.com/ru/post/339336/

All Articles