Breathalyzer from facebook

Everyone knows the social network facebook . Many have heard of some programmer tasks published by the administration of this network in order to search for programmers in their office (although, judging by the comments on the forum, this practice has long been suspended). Some tried to solve these puzzles. Some have even succeeded. But only a few shared their experience in this. And the experience, I must say, is very, very useful. Gathering my thoughts, I decided to slightly correct this omission.

A small disclaimer: This article will not be a beautiful code, following the principles of the PLO, compliance with adopted conventions and other things that are popular nowadays. It is always possible to have time to nicely refactor a working code, and the task solved during the article is to write the actual working code.

So, breathalyzer. He is a breathalyzer . This is a snack-complexity task according to facebook classification, i.e. by their standards, it is not complicated at all. That did not stop me from spending a good couple of weeks on its decision (partly because of the fundamental desire to solve it in Ruby). I did the second task in turn, and it was she who prompted me to the main idea, which prompted me to put a lot of effort into finding a solution. And the idea was as follows - I do not know how to program ...

Yes. I graduated from a completely not bad technical university in an IT specialty. I have been earning Delphi programming for several years, I am writing web applications on RoR at my leisure for my own development and improvement of CSR, but I have not learned to program it. What prompted me to this thought? Let's see together.
')

Conditions of the problem

So you are a facebook programmer, which in itself is not bad. The administration of the social network began to be disturbed by the problem of a large number of messages on the walls written by drunk people (which they themselves sometimes regret), and therefore you are given the task of organizing an online breathalyzer to determine the degree of intoxication of a person according to his written text. The number of messages sent to the walls of the social network is too large to moderate all of them manually. However, a good programmer is not a hindrance - the process can be automated.
The terms are simple. If we omit the details regarding the formatting of source texts and results, as well as the target area of application of the code, then the program will be given some text with errors and a dictionary of correct words. The task is to calculate the number of errors in each word and display the total amount of errors for the text. The number of errors is calculated by the number of corrections that need to be made to a word in order to bring it into full compliance with one of the words in the dictionary. The corrections are: inserting one letter, deleting one letter, replacing one letter with any other. Hmm ... it's not immediately clear how to count this, but okay, the eyes are afraid, and the hands are hooks. Go. Oh yeah, an important addition to the conditions of the assignment states that the program must be effective in time.

Decision. Iteration 1

The first problem to be solved is the actual counting of errors in the word. A quick search on Google shows that everything has already been invented before us, and the quantitative indicator of the difference in words for given conditions is called the “Levenshtein distance”. The description of the algorithm for finding this distance has already been encountered on Habré, so I will not dwell on it in detail; I will give its implementation in Ruby. So, the heart of our decision is as follows:

def calc_levenshtein_distance(s,t) n = s.length m = t.length return m if n==0 return n if m==0 d = Array.new(n+1).map!{Array.new(m+1)} for i in 0..n d[i][0]=i end for j in 0..m d[0][j]=j end for i in 1..n for j in 1..m do if s[i-1]==t[j-1] cost=0 else cost=1 end d[i][j] = [d[i-1][j]+1, d[i][j-1]+1, d[i-1][j-1] + cost].min end end return d[n][m] end

The link to the dictionary is given in the conditions of the problem (just in case I duplicate ), several examples of source texts for testing are posted by the participating enthusiasts on the forum ( 187.in , david_22719.in , 4.in , 12.in ), you can proceed. Things are easy. We read the entire dictionary (since it is conveniently formatted by the word on each line and sorted alphabetically), we read the source text, we start a cycle: for each word from the source text we are looking for the closest word in the dictionary. How are we looking? We run around the dictionary, swinging the Levenshtein distance search function:

 def find_word(word, dic) result_word = '' result_cost = 999 dic.each do |dic_word| ld = calc_levenshtein_distance(word, dic_word) if ld<result_cost result_word = dic_word result_cost = ld end end return {:word => result_word, :cost => result_cost} end

The main body of the program minus the reading of dictionaries:

 total_cost = 0 words.each do |word| total_cost += find_word(word, dic_words)[:cost] end

The full text of the first option can be found here . This is where the first doubts have already visited me, but okay, the first launch with my favorite test file in the future (almost all iterations I checked first on the 187.in file) ... Hmm. After a ten-minute wait, it became clear that this program would hardly be considered effective in time.

Well, it is necessary to look for where the brake was dug. So, the main part of the program is the Levenshtein distance search function, wrapped in two cycles - according to the words of the incoming text and the dictionary. A dictionary contains nearly 180 thousand words. There is something to work hard ...

Decision. Iteration 2

To begin with we will make the simplest optimization. First, if the considered word is present in the dictionary without changes, then without further ado, we will assume the number of errors in it to be zero.

 return {:word => word, :cost => 0} if dic.include?(word)

Secondly, in some examples there are repeated words, therefore for each considered word we will save search results:

 res = already_found.include?(word) ? already_found[word] : find_word(word, dic_words) already_found[word] ||= res

Third, add a debug output. According to the condition of the task, the program should print a single number in the console, but for work it will be nice to see a little more information (the full text of the second version is here ). This iteration allowed us to achieve a relatively sane runtime of the program on the 4.in and 12.in input files (about three minutes on my home computer) and, equally important, to get the correct results. Wait for execution 187.in is still not easy, even despite the output to the console of each word being processed. Well, at least we can be sure that our program is working properly. For now.

Decision. Iteration 3

The conditions of the problem state that you can correct errors in words only in three ways: add a letter, remove a letter, replace a letter with any other. From this follows the fact that words differing in length by 1 have a Levenshtein distance of at least 1 between themselves. What does this give? We can break the dictionary into groups according to the length of words. Analyzing the next word from the incoming text, we can begin to search the dictionary with those words that are identical to the one being considered in length, continue with those that are at 1, and so on. As soon as the difference in length exceeds the current minimum distance found, we will know that there is no need to continue.

First, we construct a dictionary of the required structure:

 dictionary = {} dic_words.each do |w| dictionary[w.length] ||= [] dictionary[w.length] << w end

Then we will modify the dictionary search function with a file so that it goes through the words in the necessary order:

 def find_word(word, dic) if dic[word.length] return {:word => word, :cost => 0} if dic[word.length].include?(word) end result_word = '' result_cost = 999 for offset in 0..15 do return {:word => result_word, :cost => result_cost} if result_cost<=offset indexes = [word.length-offset,word.length+offset].uniq indexes.each do |index| next unless dic[index] dictionary_part = dic[index] dictionary_part.each do |dic_word| ld = calc_levenshtein_distance(word, dic_word) if ld<result_cost result_word = dic_word result_cost = ld end end end end return {:word => result_word, :cost => result_cost} end

Hooray! Program execution on files 12.in and 4.in was reduced to 70 seconds (the full text is searched here ). And even it turned out to wait for the execution results on the file 187.in - about 17 minutes. As my math teacher said, it's better already, but for now two. The david_22719.in file is still inaccessible.

Decision. Iteration 4

Forgive me Ruby-ists and lovers of other interpreted languages, but somewhere at this stage I realized that in tasks that are critical to the time of execution, such s .. cope poorly. Somewhere in this stage, I implemented the task in C and realized. The bruteforce of each word throughout the dictionary for the file 187.in was executed in seconds. In a few minutes the program mastered even the file david_22719.in . Posted by facebook bot decision returned with the note:

After running your solution to a breathalyzer (received on January 18, 2011, 10:40 am), I have determined. Your solution ran for 1585.775 ms on its longest test case.

The solution in Ruby returned time after time with a note

Thank you for your puzzle! After running your solution to a breathalyzer (received on January 18, 2011, 12:20 pm), I’ve been able to find out.

Well, that means it's time to look for other ways to improve. The heart of the program is the Levenshtein distance calculation function, for which there is more than one algorithm, but all of them except the simplest ... how to say it ... are somewhat nontrivial. Having rummaged some hours, I decided to look for other solution. And found. Unfortunately, I can’t find a link right now, it was painful for a long time, but the idea was to make a list of all possible first-order errors for a given word and check if there is at least one of them in the dictionary. The function of compiling a list of typos:

 def get_possible_edits(word) letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.split(//) res = [] for i in 0..(word.length-1) str = String.new(word) str[i] = '' res << str end for c in letters for i in 0..(word.length-1) next if word[i]==c str = String.new(word) str[i]=c res << str end end for c in letters for i in 0..(word.length) str = String.new(word).insert(i,c) res << str end end return res end

And add the lines to the dictionary search function:

  minimum_errors = 1 #     ,        edits = get_possible_edits(word) edits.each do |edit| return {:word => edit, :cost => 1} if dic[edit.length] ? dic[edit.length].include?(edit) : false end minimum_errors = 2 #          ,

Well, the further search for the solution will be truncated taking into account the fact that in this word there is exactly at least one error (since we have already considered all the options with one error):

  for offset in 0..15 do return {:word => result_word, :cost => result_cost} if ((result_cost<=offset) or (result_cost<=minimum_errors)) ... end

Such changes made it possible to improve the result of running 187.in to 14 minutes, 12.in to 64 seconds, and 4.in is executed in a fraction of a second (which is not surprising, there are no words with more than one error in it). Little win Now you can go even further and calculate the error to the second order using the same method. However, it should be understood that the number of possible first-level errors increases geometrically with increasing length of the word in question, and the number of second-level errors increases quadratically with the number of first-level errors. So, if we just add a piece of code:

  edits.each do |edit| edits2 = get_possible_edits(edit) edits2.each do |edit2| return {:word => edit2, :cost => 2} if dic[edit2.length] ? dic[edit2.length].include?(edit2) : false end end minimum_errors = 3

then with extraordinary ease we will increase the program execution time on the input file 12.in ten times. This algorithm is advisable to use in short words. Experimentally, I found that the greatest gains from it can be obtained by considering them words no longer than six characters:

  if word.length<=6 edits.each do |edit| edits2 = get_possible_edits(edit) edits2.each do |edit2| return {:word => edit2, :cost => 2} if dic[edit2.length] ? dic[edit2.length].include?(edit2) : false end end minimum_errors = 3 end

On the 187.in file, this element allows you to play even about 10% of the execution time. At this point, all the trivial optimizations end, which means it's time to plunge into math.

Decision. Iteration 5

Lowenstein's distance is not a new thing in mathematics, and many algorithms have been invented to find it, however, to understand their work, you need good mathematical preparation, endless time and unshakable confidence that somewhere in your family were Olympian gods. I spent more than one evening, sorting out mathematical-algorithmic calculations, eventually settled on the Ukkonen algorithm .

Here, before writing code, you will have to speculate a little with the use of mathematics. I'm not sure that I understand its essence correctly, especially since the literal implementation of the algorithm described there did not work for me. But I can tell you what I understood, and what works.

Imagine the Levenshtein distance calculation table. For simplicity, we consider the case when it is horizontally longer than vertically (we do not lose generality, because we can always just change the word order). We are interested in the value in the lower right cell of the table. According to the algorithm, it is calculated last. But it turns out that it is not at all necessary to count the entire table to get this value! The essence of the Ukkonen method is that several diagonals are calculated. Which diagonals to calculate and how to choose them correctly? This is the main question.

Let's call the main diagonal the one that comes out of the upper left corner. If the number of letters in the compared words coincides, then it also rests on our desired lower right corner. This diagonal is always calculated, regardless of anything. How many diagonals are needed? I am not ready to give a mathematical justification, but empirically I found out that there should be at least three diagonals.

If the difference in the length of words is at least two characters, then all the diagonals to the right of the main one are calculated, up to the one that displays the solution in the lower right corner (remember, we assume that with different lengths of words, the longer one is written horizontally).

Calculation with different lengths of words

Calculation with different lengths of words

If the words differ in length by one letter, then an additional one diagonal is added to the left of the main one, if the words are the same in length, then one diagonal is added to the left and right of the main one:

Calculation with the same length of words

Calculation with the same length of words

In the illustrations, the main diagonal is highlighted in green, calculated in yellow, the solution in red.

Each step of the calculation function of the table calculates the value of the new cell as the minimum of the three surrounding it (plus the transition cost coefficients), so that the missing cells of the table do not affect the course of the calculation, you need to initialize them with some obviously large number, for example 50. The left upper cell, as before, is initialized by zero. In code, it looks like this:

 def min3(a,b,c) m = a < b ? a : b return m < c ? m : c end def calc_levenshtein_distance(s,t) return calc_levenshtein_distance(t,s) if s.length<t.length n = s.length m = t.length return m if n==0 return n if m==0 d = Array.new(m+1).map!{Array.new(n+1).map!{50}} d[0][0]=0 p = (n==m) ? 1 : 0 #   q = (nm<=1)? 1 : 0 #   for j in 1..[(n-m+p),n].min d[0][j]=d[0][j-1]+1 end for i in 1..m do for j in (i - q)..[(i + n - m + p),n].min do if j==0 d[i][j] = d[i-1][j]+1 else cost = 1 cost = 0 if t[i-1] == s[j-1] d[i][j] = min3(d[i-1][j] + 1 , d[i][j-1] + 1 , d[i-1][j-1] + cost) end end end return d[m][n] end

Improvements are evident, 187.in is executed in just 6 minutes, and 12.in - in 25 seconds. But still, it would be good to come up with something else. The full text of the fifth iteration is here .

Decision. Iteration 6

How else can you accelerate a function that is already accelerated as much as possible? Well, for example, you can simply not call it. It is not necessary to consider the Levenshtein distance for those words that obviously do not coincide by an amount not less than that already found. That's just how to calculate this value, and even faster than the Levenshtein distance to get some speed gain? I do not pretend to be ideal, but I managed to come up with something. I decided to store for each word an array of 26 values, each of which showed how many times the corresponding letter of the English alphabet occurs in a given word. The difference between two words using these arrays can be calculated as the sum of the differences of the corresponding values of the arrays:

 def arr_diff(a1,a2) res = 0 for i in 0..25 do res += (a1[i]-a2[i]).abs end return res end

Such a difference, of course, is even more abstract than the Levenshtein distance, but it can be calculated more quickly. How can she help? How can the Levenshtein distance be related to the difference of such hashes? The most important connection is that adding and removing a single letter in a word adds one to both of these values. If in the word to swap two letters, then the Levenshtein distance of the new word from the original one will be 2, although the difference of the corresponding arrays will be zero, since the composition of the letters is preserved. And if we change a letter in a word to any other, then the difference of the arrays will increase by 2, while the Levenshtein distance is only 1. Thus, the limitation on the difference of arrays from below is rather weak: minus the length difference in the word, there can be as many permutations letters that lead to an increase in the Levenshtein distance, but not to an increase in the difference of arrays (the only lower constraint that we can impose is one, in case the minimum number of errors in a word is odd, because an odd number of errors by Leven matte cannot be achieved by permutations of letters), and from above - the difference of the arrays cannot exceed the Levenshtein double distance.

Based on the foregoing, we can say that for two words with the Levenshtein distance diff and the difference in length by offset the difference of the arrays will lie in the interval [offset, 2 * diff - offset] . In fact, to limit the bottom, you can also use an estimate of the minimum number of errors in a word (for example, if we did not find the word by searching all first-order typos, then we can say that there are at least two errors in it). If the difference in the minimum number of errors and the difference in the length of words is odd (that is, in addition to the difference in length in the word, the minimum of N errors and N are odd), then the minimum possible difference in the arrays of words can be increased by one - [offset + (min_errors - offset )% 2, 2 * diff - offset] (in the case of even N, no additional constraints can be imposed, because an even number of Levenshtein errors can be achieved by permutations of letters that do not affect the difference of arrays).

Let us try to use this knowledge to filter the words given to the Levenshtein distance calculation function. We will consider arrays for vocabulary words not immediately when reading a dictionary, but as necessary, in order not to make unnecessary calculations for short input files, during reading we will only initialize the corresponding arrays:

 dictionary = {} dic_words.each do |w| dictionary[w.length] ||= {} dictionary[w.length][w] = Array.new(26,0) end

At each comparison of the current word with the dictionary we know the result of the previous comparison. If we have already found a word with a distance from the desired one in 3, then we are no longer interested in words with a greater distance, so we can impose restrictions on the difference of the corresponding word arrays. The dictionary search function in this case will look like this:

  def find_word(word, dic) if dic[word.length] return {:word => word, :cost => 0} if dic[word.length].include?(word) end minimum_errors = 1 #     ,        edits = get_possible_edits(word) edits.each do |edit| return {:word => edit, :cost => 1} if dic[edit.length] ? dic[edit.length].include?(edit) : false end minimum_errors = 2 #          ,        if word.length<=6 edits.each do |edit| edits2 = get_possible_edits(edit) edits2.each do |edit2| return {:word => edit2, :cost => 2} if dic[edit2.length] ? dic[edit2.length].include?(edit2) : false end end minimum_errors = 3 end result_word = '' result_cost = 999 wordhash = Array.new(26,0) word.each_byte{|b| wordhash[b-65]+=1} for offset in 0..15 do return {:word => result_word, :cost => result_cost} if ((result_cost<=offset) or (result_cost<=minimum_errors)) indexes = [word.length-offset,word.length+offset].uniq indexes.each do |index| next unless dic[index] dictionary_part = dic[index] if dictionary_part[dictionary_part.keys.first].max==0 #       ,    dictionary_part.each do |key, val| key.each_byte{|b| val[b-65]+=1} end end lower_bound = (minimum_errors-offset>0 ? minimum_errors-offset : 0).abs%2 + offset higher_bound = 2*result_cost - offset dictionary_part.each do |dic_word, dic_wordhash| a = arr_diff(dic_wordhash,wordhash) next if a<lower_bound or a>higher_bound ld = calc_levenshtein_distance(word, dic_word) if ld<result_cost result_word = dic_word result_cost = ld higher_bound = 2*result_cost - offset end end end end return {:word => result_word, :cost => result_cost} end

Thus it turns out to achieve the execution of the file 187.in in 66 seconds ( 12.in - 5.5 seconds). This is quite good compared to the time shown (or rather, not shown) by the first version of the program and even allows, after waiting a couple of hours, to see the solution for the file david_22719.in , but as it turned out, not enough to go through the facebook bot.

Decision. Iteration 7

But then I blunted. I didn’t master any accelerated execution algorithmically, so I had to resort to purely technical tools based on the features of Ruby in general and version 1.8.6 in particular. For example, experimentally, I found that the ternary operator (a? B: c) runs faster than the traditional if-else. I learned that ugly calculations in one line can win a fraction of a second from the option of creating temporary variables. Disproved the myth that short variable names speed up the process. I found out that parsing a whole file with a regular expression works faster than line-by-line reading ... Many hours of licking code trying to remove some extra object initialization or unnecessary memory allocation allowed me to knock a couple more seconds off of time, but it still wasn't enough to go through the bot. And then I gave up.

Decision. White flag

I decided to give up not quite in the sense that they usually invest in it. I have abandoned the idea of solving a Ruby problem. At least, in pure Ruby ... Yes, you understood correctly, I decided to cheat a little and turn to power C. Having smoked the corresponding manuals for a couple of hours, I wrote a Ruby extension that counts the Levenshtein distance for two lines:

 #include <ruby.h> #ifndef max #define max( a, b ) ( ((a) > (b)) ? (a) : (b) ) #endif #ifndef min #define min( a, b ) ( ((a) < (b)) ? (a) : (b) ) #endif void Init_levenshtein(); VALUE distance(VALUE self, VALUE rwm, VALUE rwn); void Init_levenshtein() { VALUE cLevenshtein = rb_define_class("Levenshtein",rb_cObject); rb_define_singleton_method(cLevenshtein, "distance", distance, 2); } VALUE distance(VALUE self, VALUE rwm, VALUE rwn) { char *wm, *wn; int res; wm = StringValuePtr(rwm); wn = StringValuePtr(rwn); res = LeveDist(wm,wn); return INT2NUM(res); } int min3(int a, int b, int c) { int Min = a; if (b < Min) Min = b; if (c < Min) Min = c; return Min; } int LeveDist(char *s, char *t) { char* wm; char* wn; int m, n, q, p, cost, i, j; int loop1, loop2; char* d; int res = -1; if (strlen(s) > strlen(t)) { wm = t; wn = s; } else { wm = s; wn = t; } m = strlen(wm); n = strlen(wn); d = calloc((m + 1)*(n + 1),sizeof(char)); memset(d, 50, (m + 1)*(n + 1)); p = (n==m) ? 1 : 0; q = (nm<=1)? 1 : 0; *d = 0; loop1 = min(n, n - m + p); for (j = 1; j <= loop1; j++) *(d + j) = *(d + j - 1) + 1; for (i = 1; i <= m; i++) { loop2 = min(n, i + n - m + p); for (j = (iq); j <= loop2; j++) { if (j == 0) { *(d + i * (n + 1) + j) = *(d + (i - 1)*(n + 1) + j) + 1; } else { cost = 1; if (*(wm + i - 1) == *(wn + j - 1)) cost = 0; *(d + i * (n + 1) + j) = min3(*(d + (i - 1)*(n + 1) + j) + 1, *(d + i * (n + 1) + j - 1) + 1, *(d + (i - 1)*(n + 1) + j - 1) + cost); } } } res = *(d + m * (n + 1) + n); free(d); return res; }

 require 'levenshtein' ... Levenshtein.distance(word,dic_word)

, , , , ( ).

Decision. Victory!

, Windows . Linux , :

Thank you for your submission of a puzzle solution to Facebook! After running your solution to breathalyzer (received on January 25, 2011, 3:04 am), I have determined it to be correct. Your solution ran for 5097.469 ms on its longest test case.

, , , ,

Conclusion

, , . wall of text . , — -. Ruby. , . - . , , , , — , , . , , . — , , .

(Core-i7 2.66 GHz)
,
, , - , Ruby -
, facebook-
facebook-
snack- ( thrift-), ,
C , 187.in ,

Source: https://habr.com/ru/post/128893/

All Articles