ABBYY Cup: Debriefing

As it is known to those interested, more than a month ago ABBYY Cup, a student online sports programming contest, was held. For those who have not heard anything about it, I recommend reading this topic first .

In each round, we offered participants 6 tasks each, you could get 100 points for each, but for the light division Codeforces composed an additional seventh to entertain those participants that make the light division seem very easy.

As in the past year, solutions were evaluated using automatic tests of various levels of complexity. There were several groups of tests at the ABBYY Cup, in the lightweight division - two, in the difficult - three. Different tests are needed to distinguish those who did the task well, from those who did very well. Did you run the code within the specified limits with 20 input values? Get on input 50, we'll see :)
')
In addition, there was an additional rating system. If a person tried to solve problems many times and sent, say, the wrong decision 10 times and sent the correct decision on the 11th, he will be lower in the rating than the person who solved longer, but sent a solution that immediately passed all the tests.

Representative sample

The task

ABBYY's Smart Beaver has been a long-time collaborator with the Cytology and Genetics Research Institute. Recently, the workers of this scientific research institute offered Beaver a new task. The essence of the problem was as follows.

There is a set of n proteins (not necessarily different). Each protein is a string consisting of small letters of the Latin alphabet. The task that scientists have set for Smart Beaver is to select a subset of power k from the initial set of proteins, and the representativeness of the selected subset of proteins should be maximized.

The ABBYY Smart Beaver did a little research and came to the conclusion that the representativeness of a set of proteins can be assessed by one number, which is quite simple to calculate. Suppose we have a set { a ₁ , ..., a _k } of k lines describing proteins. The representativeness of this set is the following value:

where f ( x , y ) is the length of the greatest common prefix of the strings x and y ; for example, f ("abc", "abd") = 2, and f ("ab", "bcd") = 0. Thus, the representativeness of the set of proteins {"abc", "abd", "abe"} is 6 , and the set {"aaa", "ba", "ba"} - 2.

After this discovery, the Smart Beaver from ABBYY asked the participants of the Cup to write a program that chooses from the given set of proteins a subset of the power k , which has the greatest possible value of representativeness. Help him with this task!

Task analysis

Immediately proceed to the complete solution of the problem. Let's sort the lines lexicographically and calculate lcp [ i ] - the length of the greatest common prefix of the lines i and i + 1. It is easy to prove that the greatest common prefix of any two lines i and j is now equal to min ( lcp [ i ], lcp [ i + 1] , ..., lcp [ j - 1]) (a similar statement is used when using a suffix array). This is useful.

Now apply the dynamic programming method. Let's make the following DP: F ( i , j , k ) is the maximum possible answer to the problem, if you need to select exactly k lines from lines i ... j . The transition is as follows: we find the position of p on the segment i ... j - 1 with the minimum lcp [ p ], then F ( i , j , k ) = max _{q = 0 ... k} ( F ( i , p , q ) + F ( p + 1, j , k - q ) + q · ( k - q ) · lcp [ p ]), that is, we assume that q is selected from rows from i ... p , k - q lines from p + 1 ... j , and for of all pairs of these lines, the length of the longest common prefix is lcp [ p ] (this follows from the statement in the first paragraph).

This solution is enough to get 50 points. So far it seems that the complexity of this solution is of order O ( N ⁴ ). It remains to show that in fact it is possible to implement this solution so that its complexity is equal to O ( N ² ). To begin, we note that the various segments i ... j , on which we are interested in the values of the function F , are in fact exactly 2 N - 1, not O ( N ² ). Why this is so can be understood from the transitions of our DP. The segment 1 ... N is divided into two segments 1 ... p and p + 1 ... N , each of these segments is further divided into two segments, and so on; we see that between positions i and i + 1 for any i, exactly once in the entire calculation of the DP, there will be a “cut”, and this cut will generate two new segments in which we are interested in the values of the DP. Hence, we obtain an estimate of 2 N - 1 necessary segments, therefore, our solution already has a complexity of no more than O ( N ³ ).

The last and most difficult thing is to guess that the solution above may have O ( N ² ) complexity. For example, you can implement it like this. Let's get the recursive function F ( i , j ), which returns an array of length j - i + 2 - the actual values of the DP F ( i , j , k ) (it is clear that we are only interested in the values of k from the segment from 0 to j - i + 1 ). Inside this function we do the following: find p , call F ( i , p ) and F ( p + 1, j ), now iterate x from 0 to p - i + 1, iterate y from 0 to j - p and update the value of F ( i , j , x + y ), if it has improved (more details can be seen in the author’s solution, everything is quite transparent there). The maximum time of the function F , called on the segment i ... j of length j - i + 1 = N , can be written as T ( N ) = max _{X = 1 .. N - 1} ( T ( X ) + T ( N - X ) + ( X + 1) · ( N - X + 1)). It remains to understand that T ( N ) = O ( N ² ). This can be done, for example, by writing a simple DP to calculate T. If you think about it, you can understand it this way: when we cycle through x = 0 ... p - i + 1 and y = 0 ... j - p , let's imagine that we are sorting like a couple of lines: x = i ... p and y = p + 1 ... j (here the search over x and y takes one less iteration, but it doesn’t matter); then each pair of lines for the entire DP calculation will be moved exactly once, which gives us the final complexity O ( N ² ).

Here you can see the author's decision.

Many participants handed over decisions using boron. Such solutions may also have O ( N ² ) complexity, however, they work longer and use much more memory. However, time and memory restrictions were quite loyal to different solutions :)

You can look at all the tasks on Codeforces: light division , light division analysis , complex division , complex division analysis .

In general, our colleagues note the high level of participants in the ABBYY Cup. In the difficult division, all 6 problems were solved by 10 people - this is quite a lot. A good preparation of the participants is indicated by the fact that in the unofficial competition (it was possible to participate not only for students, but also for schoolchildren), the schoolboy, the winner of the All-Russian Programming Contest for Schoolchildren, entered the top five.

Judging by the reviews, most of the participants liked to solve our problems, the heuristic recognition problem seemed especially interesting to the guys. Well, next year our colleagues will definitely prepare something interesting for them in the same vein. Like last year, in the beginning of July, ABBYY Cup participants will come to our office, at their request we will hold another small one, this time a full-time competition.

Source: https://habr.com/ru/post/146081/

All Articles

ABBYY Cup: Debriefing

Representative sample

More articles: