📜 ⬆️ ⬇️

3rd place in 11 steps in the Hola JavaScript contest

Surely many of you have already flashed the headlines of articles from the competition from Hola , which recently came to its logical conclusion. In the final results I was lucky to be in 3rd place. For this reason, I allowed myself to share a description of my decision , as well as how I came to him.

1. Why am I worse?


The very first decision, naturally, "in a forehead". We form a pair of regular expressions for each rule, and then we start the cycle according to the messages with a nested loop according to the rules. The code to the disgrace is simple (and, as we already know, it easily fits into this frightening number of 666 bytes), so I don’t see the point here.

2. Long live dynamic functions!


The next step I decided to expand the inner loop. To do this, preliminarily, based on the rules, we dynamically form the getActions (from, to) function and call it in a loop for each message. The filter takes about the following form:

function filter(messages, rules) { //   var builder = []; builder.push('var actions = [];'); for ( var i = 0; i < rules.length; i++ ) { var rxFrom = createRegex(rules[i].from); if ( rxFrom ) builder.push(escapeRegex(rxFrom), '.test(from) && '); var rxTo = createRegex(rules[i].to); if ( rxTo ) builder.push(escapeRegex(rxTo), '.test(to) && '); builder.push('actions.push(\'', escapeString(rules[i].action), '\');'); } builder.push('return actions;'); var getActions = new Function('from, to', builder.join('')); //   var result = {}; for ( var key in messages ) { var message = messages[key]; result[key] = getActions(message.from, message.to); } return result; } 

Below is an example of the generated getActions function:
')
 function getActions(from, to) { var actions = []; /^.*@work\.com$/.test(from) && actions.push('tag work'); /^.*@spam\.com$/.test(from) && actions.push('tag spam'); /^jack@example\.com$/.test(from) && /^jill@example\.org$/.test(to) && actions.push('folder jack'); /^jill@example\.com$/.test(to) && actions.push('forward to jill@elsewhere.com'); return actions; } 

The advantage of this approach is that inside the function we are free to pre-program any conditions necessary for a particular rule. Specifically in the above example, if the rule does not specify the from or to property, then there is no need to add to the function body a check of the corresponding message property for this rule. And if none of these properties is specified in the rule, then the action specified in the rule is simply written into the resulting array without any checks.

When using the same nested loop, to achieve this behavior, we would have to insert additional checks into the loop body, calculated at each iteration, which, of course, imposes additional costs.

3. Himself, all by yourself!


Further, I was still honored to get away from regular expressions and wrote the match (value, pattern) function, which determines whether the address matches the specified pattern or not, returning true or false, respectively. Initially, the function I used to operate with characters / strings, but this approach was not very effective, so it became mainly operate with ASCII codes, which are obtained via String.charCodeAt (index) (there was also an attempt to use String.codePointAt (index), but it did not pay off in terms of effectiveness). The function is cumbersome and not very interesting, so I will not bring it here.

To call the match function from getActions, I passed it with an additional parameter after from and to.

4. Sorry, you are definitely not suitable for us.


Looking for a way to further increase performance, I decided that an eigenfunction is, of course, great, but it should be called less often, and again focused on getActions. Actually, at this stage, the task was to, on the basis of some signs, knowingly exclude some of the rules and not call a resource-intensive function for them, and the determination of these signs by itself should not take a lot of time.

One of these signs was just the first character of the template. It can only be used if it was not a "?" or "*". The second sign, length, could be used when there is no “*” symbol in the pattern (at that time I had not yet thought that the minimum length of the address could be checked).

In order not to call a bunch of methods in the from and to properties of the message being processed when checking each rule, at the beginning of the getActions function, we calculate additional variables and use them as necessary:

 function getActions(from, to, match) { var actions = []; var fb = from.charCodeAt(0), fb61 = fb == 0x61, fb74 == 0x74 /*, ... */; var tb = to.charCodeAt(0), tb66 = tb === 0x66 /*, ... */; var fl = from.length, fl14 = fl == 14, fl15 = fl == 15 /*, ... */; var tl = to.length, tl16 = tl == 16 /*, ... */; fb61 && tb66 && fl15 && tl16 && match(from, 'abc@example.com') && match(to, 'fake@example.com') && actions.push('action 1'); fb74 && fl14 && match(from, 'test@gmail.com') && match(to, '*@any.net') && actions.push('action 2'); // ... return actions; } 

This change gave a noticeable increase in productivity. But you can see that in any template that starts with the symbol "*" it is impossible to navigate either the first character or the length. And such templates should definitely meet.

5. And a few more questions.


Therefore, the next step I decided to try to classify templates also by domain. Since anywhere in the domain can also meet "*", then I decided that I would take from it the maximum possible level of the domain up to the 3rd. Accordingly, at the beginning of the getActions function, it is necessary to select the level 1, 2, and 3 domains from the properties of the message from and to and form a large batch of variables indicating that the message address belongs to one or another domain.

As far as I can remember now, the method is not that completely non-working, but with different test data, it manifested itself in different ways. Distinguishing domains from addresses and preliminary calculation of variables at each iteration ate a lot of time. Here I tried a lot of different options. For example, I grouped rarely meeting domains to reduce the number of calculated variables at first (by the way, I also conducted similar groupings with the first characters and lengths), tried to calculate the hash of the domains and compare it. But in the end, I still refused to compare domains.

6. Somewhere nearby


Meanwhile, the thought was all in my head, and not to try to build a single tree from the beginning (and then from the end) of the templates to the first “*” symbol, and shove into its leaves lists of rules that need to be checked? Either I just didn’t really like this method, and I subconsciously built it inefficiently, or it really takes so long to build, but after the first experiments I decided that the creation of this structure is a rather slow thing, and I don’t want to mess with it.

7. Why reinvent the wheel?


I went back to the previous version, but instead of classifying by domains, I simply added a check of the last character of the template.

It seems all is well, but not very. Now it turns out that when checking every rule before the call of functions, the match could be checked up to 6 variables (and initially I wanted to reduce the number of checks), and even calculate all of them for each message. I didn't like it. Besides, experimenting with the number of checks for each rule and their sequence, I came to the conclusion that, with such a long check, the three signs of benefit are not much more than two. Hence the next step.

8. In cramped, not mad


I decided that comparing the lengths of addresses and patterns is the most useless feature, and I excluded features from the excluding rules. Only the first and last character matches are left. From the conditions of the competition, remember that in the lines there can only be characters with codes in the range from 0x20 to 0x7F, which means that all the first and last characters of the patterns can easily fit into one Integer variable and only make one comparison for each rule before calling match. There was only the problem that some templates at the beginning or end could have the characters "?" and "*", which should not participate in the comparison. But this problem is easily solved if before comparing the values ​​to impose on them a mask that resets the extra octets. We get about the following:

 function getActions(from, to, match) { var actions = []; var hash = (from.charCodeAt(from.length-1)<<24) | (from.charCodeAt(0)<<16) | (to.charCodeAt(to.length-1)<<8) | to.charCodeAt(0); (hash & 0xFFFFFFFF) == 0x6D616D66 && match(from, 'abc@example.com') && match(to, 'fake@example.com') && actions.push('action 1'); (hash & 0xFFFFFF00) == 0x6D747400 && match(from, 'test@gmail.com') && match(to, '*@any.net') && actions.push('action 2'); // ... return actions; } 

And of course, the imposition of masks on the hash is also logical to pre-calculate the variables (they can not be more than 16) at the beginning of the function, so as not to perform this operation for each rule.

9. Divide and conquer!


The match function worked quite well, but it was universal for all templates, and that was its drawback. For templates that do not contain the "*" symbol, it was not necessary to go through all these additional procedures, but it is enough to compare two strings character-by-character considering the features of the "?" in the template. The same applies to templates containing only one character "*". In this case, it is logical to check the parts of the template before "*" and after.

As a result, instead of one match function, there are five of them:


In addition, I am passing the template to these functions as an array of character codes, and not as a string (so as not to pull the String.charCodeAt (index) method once again, I don’t really know if it has benefited).

Accordingly, in the process of its formation, the getActions function is substituted for calls to the necessary functions (which are still passed through additional parameters). At the same time, the consecutive "*" characters are removed from the templates, since this does not affect the result, and the comparison function is not. (In the final solution, the implementation of this process is inside the createPatternMatchFunc function.)

10. Only shhh ...


There is one weak point in this decision. In the worst case, if suddenly all templates begin with the same character, and even end with the same character, all of them will have the same hash and the template validation functions inside getActions will be called for absolutely all rules (if the addresses in the message , of course, fit).

In this case, it would be logical to compare not the first, but the second or third characters from the beginning and end of the pattern (provided that they were not "*"). But how to determine which option is most optimal?

I went the following way:

  1. For the first and last three positions of the templates I calculated the frequency of occurrence of certain character codes;
  2. For each position, calculated the standard deviation of these frequencies;
  3. I chose at the beginning and end of the position with the minimum standard deviation (but not those in which there is only one character code).

In principle, it all worked, but the process of forming the getActions function slowed down, which affected the overall speed of the filter. In addition, I was not bothered by the idea that the position with two variants of codes with equal frequencies and the position with eight variants of codes with equal frequencies have the same standard deviations, although it is logically clear that the second option is more appropriate for use. Given the small performance failure, I did not try to balance all this, but simply returned to the analysis of the first and last characters (we were promised real data).

11. Practical magic and not only.



At this stage, the decision was sent for review.

12. Of the unfulfilled


To test the performance of my solution, I prepared test data, which included 600,000 messages and 1000 rules. And among the messages there was not one with a pair of duplicate addresses. Perhaps that is why I somehow had no idea about caching the results.

By adding the caching of the getActions result, with the data used in the contest, on my machine I got a performance increase of 15% for large-data and 2% for xlarge-data (on my data, the performance degraded by 13%). That is, when sending the solution in this form, probably in the final table, my result would be about 248 and 2303 ms instead of 292 and 2351 ms, respectively, and the article would be called “2nd place in 12 steps”. But, alas and oh!

Thanks to the organizers, waiting for new contests!

Source: https://habr.com/ru/post/275343/


All Articles