In November of the last (already) year, Hola announced
a programming contest for mail filters on js , and recently published
its results .
I shared second place with Ilya Makarov, and now I will tell ...
How it was
Deciding to take part in the contest and thinking about the task, I quickly realized that the specified email templates can be converted to regular expressions by simply replacing
'*'
with
'.*'
And
'?'
on
'.'
and escaping all special characters. On this one could calm down, as Nadav Ivgi did, having received a special prize for
the shortest decision .
')
But this solution has a fatal flaw - complexity
O ((number of rules) * (number of messages)) , since each message must be checked for compliance with each rule.
There is a better way. First, the two regulars for the
.from
and
.to
fields can be
joined into one by separating them from each other with an unused character (say
'\x1F'
, since by condition only the symbols from
'\x20'
to
'\x7F'
inclusively are used) ), and do the same with the corresponding message fields:
{from: 'john.doe@foo.org', to: '*@hola.*', action: 'act1'} -> {signature: 'alex@foo.org\x1F*@hola.*', action: 'act1'}
This reduces the number of tests required by half.
Secondly, these regulars can be compiled into one
deterministic finite state machine (DFA) , the complexity of which does not depend on their quantity. Here the standard
RegExp
will not help us.
This is exactly what the
first version of my decision did.
Having successfully tested it on the data from the assignment, I decided to automate the process and implemented
the initial data generator and the comparison of results with the reference implementation . As it turned out, it was not for nothing that a
bug was detected in the DFA minimization procedure
, and I temporarily excluded it from the algorithm (its presence does not affect the correctness, only performance).
My script went to the reference implementation with the maximum possible frequency, and after a couple of days, Hola imposed a limit on the frequency of requests. Heh. I shrugged and
introduced a delay between requests . Testing began to pass more slowly, but it was possible to live.
Then optimization began. The first idea was to
compile DFA into a huge switch
on js , as
lex and other similar tools do.
Having realized such a code generator, I realized that I did not have the opportunity to assess whether I had improved the result, and that for this I need a much more extensive set of test data. I decided that I could not so easily generate datasets with imputed statistical characteristics, and it would be nice to use real one. Pretty quickly, I found
Enron Email Dataset - an email archive of a bankrupt company with a corresponding name.
A couple of Python
scripts extracted messages and generated rules from their addresses, and the benchmark was ready.
I launched it and ...
exploded on a combinatorial mine . The fact is that the
algorithm for constructing DFA has an exponential complexity (
O (2 ^ (number of states of a non-deterministic finite automaton (NFA))) ) and I underestimated how terrible it is. I could not wait for the completion of the construction of the automaton for only twenty rules. Instead of solving this algorithmic problem, I spent time optimizing the already high-performance part of the system. Verily, it
is the root of all evil .
Sensing enlightenment, the first thing I did was to separate
.from
and
.to
back, thus obtaining one automaton operating only on
.from
for all the rules, and one automaton operating on
.to
for each group of rules, the
.from
which lead to the same the final state of the first automaton:

This reduced the size of each NFA at least twice, and therefore significantly reduced the size of the corresponding DFA. Productivity rose by an order of magnitude; now it was possible to wait for the construction of the automaton for hundreds of rules.
But it was still too long. The understanding has come that it is necessary to change the concept, and somewhere in a week an insight has come: you do not need to build the whole deterministic automaton at once, you can build it lazily, one state at a time.
The implementation of this concept surprised me greatly: it was several times faster than the previous version on the same input data, moreover, it could easily chew on input data of a much greater dimension. Apparently most of the states of full DFA are not visited. I even rerinted Enron to extract more data.
I decided that this is what I need, and began to optimize.
Optimization
I tried several different things, I will describe those that gave tangible results.
Memoization
I didn’t measure it, but I have reason to believe that time saves the most. Used in two places:
- The lazy DFA build remembers once passed states , rather than creating each pass from anew.
- The final state of the automaton for each email is remembered , thereby eliminating the need to re-pass through the DFA column. Since in a typical email archive of messages it is often much more than the addresses between which they are sent, the benefit is substantial. In particular, this leads to the fact that as a result of
filter
messages that have the same arrays by action actually refer to the same array object. Action arrays are not filled again for each new message.
Search for a DFA state corresponding to multiple NFA states
Spent the most time for this. The bottom line is this: you need an efficient mapping from a set of
Object
(NFA states) to another
Object
(DFA state). Since there are no normal hashtables with non-string keys in js, I had to dodge.
I did the following: in each state, NFA placed a unique int.
.id
when creating NFA (simple increment). Having many such states, I put them
.id
in
Buffer
, sort them, and
interpret them as a string key with
.toString()
.
Using the NFA structure specifics
NFA, resulting from email templates, has features:
- Each vertex-state has at most two outgoing arc transitions. Initially this was implemented as arrays for an arbitrary number of elements. When I implemented it as a pair of properties , the execution time was reduced by 20%.
- At the time of creation, NFA states can be numbered (fill in
.id
) in such a way that the future sets of these will be already sorted, eliminating the need for sorting to build a string key (described above)
Minimization of memory allocation frequency
- When building a DFA, NFA state arrays are constantly created, which are often discarded after a string key is built on them and the corresponding DFA state is found (if not found, a new one is created and the array becomes its part)
- When constructing a string key, a buffer containing
.id
NFA states is always thrown .id
.
In both cases, you can
create an array and a buffer with the maximum required dimensions when creating a DFA and reuse them without re-allocating memory.
Final assessment of correctness
By this time, I was running out of opportunities for optimization, and there was no complete confidence in the correctness of the decision. To correct this unfortunate omission, I wrote
another python script that slowly, over several days, drove a small piece of Enron's dataset (about 500 messages for 1000 rules) through a reference implementation. And, as it turned out, for good reason. The resulting test cases revealed a bug on the edge case, when in the email template two stars are separated from each other by one symbol. It was hard to find
but easy to fix .
Commentary on the decision assessment methodology
The organizers' benchmark
runs test modules through the vm
Node module , instead of the simple
require
. Which,
as it turned out , has unexpected effects on performance.
The organizers decided to review the results.I was also surprised by the relatively small dimension of the input data that was used to evaluate the solutions. I drove the final version to 500,000 messages and 1,000 rules, and expected even larger sizes in the organizers benchmark. On such a dataset and with the launch through
require
three official leaders show the following results:
- Yuri Kilochek ~ 2300ms (I)
- Denis Kreshikhin ~ 2850ms
- Ilya Makarov ~ 7500ms
For those who wish, I
laid out my decision with the benchmarks . I suggest in the future, in order to avoid such misunderstandings, to average the result over different dimensions of the input data, or to organize competitions in different 'weight categories'.