Creating a state machine for parsing an HTTP request

Deterministic state machine can be used to implement a very fast way to parse the input sequence. It requires only one pass through the input sequence, and minimal steps at each step. Unfortunately, this model has limitations - it is not always possible to build a DFA for the existing Nondeterministic finite state machine (regular expression, grammar). Or even if it is possible to build, an automaton may have too many states.

Nevertheless, I decided to try to create a parser for an HTTP request based on DKA. The main task is not just to verify the correctness of the HTTP request, but to select elements in the input line that correspond to certain values of the HTTP request fields. The machine should be generated from the BNF rules (scattered over) RFC2616. Everything is implemented in C #, the automaton at the output is also in C #. Although it is clear that when the machine is ready, to generate it in any language, in any form is not a problem.

To generate the DCA, the following chain of actions is implemented:
1. Analysis of BNF rules
2. Creation of an NCA based on them
3. Convert NCA to DCA
4. Minimize DCA
5. Creating files with source text and jump table

I used Irony to parse BNF grammar, ABNF grammar is well described in RFC5234. According to the syntax tree, the program builds the NCA according to the rules of Thomson. I used standard algorithms to convert the NCA to the DCA, and to minimize, I will not describe them either.
')
Since the HTTP message is a relatively simple object, the output of the parser is a class (this is enough), with fields corresponding to HTTP headers. In order for the automaton to parse the message during the passage, simple actions modifying class variables are added to the DFA during generation. For example, you need to get the value of the Host header, the class has two variables, the beginning and end of the header. In the states that are triggered when the machine goes to the value of the Host field, actions are added to change these variables. In the DCA itself, it is impossible to determine which state is responsible for what. Therefore, the states are marked when creating an NCA, and then these labels are preserved, with all the transformations of the automaton. To simplify the marking of states and the assignment of actions, a simple language was made, in which the names of variables are specified and what values they will receive. On the basis of this data, a class is then generated and in which the parser places the result of the work.

A similar approach is used in regular expressions (although not all implementations do a conversion to a DFA). But with direct access to the machine, it is possible to make a number of interesting optimizations. For example, the Content-Lenght field contains the length of the message body; for this field, conversion to a number occurs directly in the machine; at the output, we have a field of the int ContentLenght class. Another interesting optimization, for example, for the HTTP request is a set of methods (GET, HOST ...). At the output of the parser, you can immediately get a constant corresponding to the method. And all this in one pass.

It is also very important that the parser accepts an array of bytes as input, this is especially true for modern languages like C #, where the string is not an array of bytes as in C ++, and simply casting a pointer will not work.

In general, the system looks like this. There is a BNF grammar, and a description of the class that we want to get at the output. All this is fed to the input of the generator, and we get a DKA on C #. With a ready-made machine, you can work as follows:

dfa.SetDefaultValue();
dfa.Parse(message0, 0, message0.Length);
dfa.SetArray(message0);

Console.WriteLine("Method: {0}", dfa.Method);
Console.WriteLine("Request-URI: |{0}|", dfa.RequestUri.ToString());
Console.WriteLine("Host: |{0}|", dfa.Host.ToString());
Console.WriteLine("Content-Type: |{0}|", dfa.ContentType.Value.ToString());
Console.WriteLine("Content-Length: {0}", dfa.ContentLength);
Console.WriteLine("Referer: |{0}|", dfa.Referer.ToString());

In practice, a DFA for an HTTP request contains before minimizing 150K states, after 1K-2K. It is clear, firstly, that 1K-2K is quite acceptable value for real use. In the second 150K states, it requires some memory and time to compute (at least in my implementation) - minutes x-tsat.

The speed of the parser is very high, the formal rough estimate for this approach is ~ C * O (n), and it is probably difficult to fundamentally improve it. In reality, the C # parser on my machine managed to parse 1M times in 3.3 seconds a message 220 bytes long.

Project source code is available: code.google.com/p/siprevo

The code at the output of the generator tried to be accurate, but the gene generator itself was slightly (if not more) in the draft version. It’s good (but not during debugging) that such systems almost cannot fail, it either works or does not work, in this case it works.

Update: the “compiler” of the automaton as a separate utility lives here .

Source: https://habr.com/ru/post/141743/

All Articles

Creating a state machine for parsing an HTTP request

More articles: