JS Programming Contest: Word qualifier (about testing)

First of all, we apologize to all participants in the programming contest for the delay in the results. Today we publish all submitted solutions and official scripts for test generation and testing, and also tell you how things are going with verifying solutions.

The English version of this post is posted on GitHub .

Test 312 solutions

Many thanks to all participants! We received 603 solutions from 312 participants. Since we accept the latest sent solutions for testing, we need to test 312 solutions . This was an unexpected result. I hope this explains a little why it takes so long.

We published all the testing solutions in the submissions directory on GitHub. The name of each subdirectory is the unique identifier of the solution; if you sent us a solution, then you received an automatic confirmation letter containing this identifier. The names or pseudonyms of the authors of the decisions will be published along with the final results.
')
Each directory contains a file solution.js (the actual competition solution) and, if there is one, a data or data.gz data data.gz (if it is compressed with gzip). This layout corresponds to what the official testing scripts expect. The src subdirectory contains the sources that the author sent in an additional archive. Testing scripts do not pay attention to them.

Of course, to make changes to your decision is no longer possible. However, you can send us some additional files that you would like to publish in the src directory, but not included in the archive with source codes, for example, the README file. This will not affect the test result, but if we can better understand the solution, it may increase your chances of getting a special prize for originality. If you want to send additional files, drop us an email .

Test Generator

As promised, we publish the test generator . This is the command line interface to the same module that we used in the web service to generate tests. For the same initial values of the pseudo-random number generator, this generator is guaranteed to produce the same results as the web service. Option --help gives instructions.

When starting, the generator analyzes the dictionary and builds Markov models of various orders (therefore, the generator initialization takes considerable time). It then uses a pseudo-random number generator to generate test cases. With probability ½, a random word is selected from the dictionary, and in other cases, randomly selected from 9 models are used for generation: these are Markov chains of orders from 1 to 8 and a white noise generator. The higher the order of the Markov model, the more similar its results to English words. If you accidentally get a word that is present in the dictionary, it is discarded, and a new one is generated. Since this happens more often with high-order models, non-words generated by these models are less common in tests than those generated by low-order models. If you are interested in the details, we recommend reading the source of the generator .

For serious testing you need a large number of blocks of 100 words. To generate them, we use a text string (“seed string”) to initialize the master sequence of pseudo-random numbers from which the block numbers are taken. From the same line, the same sequence of blocks is always obtained, which can be continued indefinitely.

In a few days, the web version of the test generator will be disabled. Please use a generator instead, which will produce exactly the same results on your computer.

"Battle" test system

For mass testing solutions, we have developed a new, more powerful and flexible test system . Option --help gives instructions.

The new test system is functionally equivalent to the old, that is, for the same tests and tested programs, both systems will give the same results. Here are the distinctive features of the new system:

Multiple solutions can be tested in parallel. To do this, they must be listed on the command line one by one. (As for the first script, you need to specify the names of the directories with the solutions, and not the paths to the files.)
Solutions are launched in separate processes that receive words from the main process and send back answers.
Instead of reading thousands of JSON files with tests, the new system generates tests on the fly using a generator based on a string to initialize the master sequence. It is fully deterministic and reproducible.
The new system counts much more statistics than just one percent of correct answers. During testing, she regularly updates these statistics on the console. It is also possible to record test results in a machine-readable JSON file.
Optionally, the test program can restart and re-initialize the solution whenever a repetition occurs in a test (a word that has already occurred). This allows you to isolate and measure the effect of learning in the course of testing that some solutions implement.
You can choose the “dangerous” mode, in which the solution is loaded as a normal Node.js module, without a sandbox. It can be used (with caution) if you suspect that Node.js VM is distorting the work of the solution. Of course, you trust your own solution, so always using --unsafe when testing it is a good idea to improve performance.

This is what the statistics displayed on the screen during testing means. Top line: Total is the total number of verified words, W is the number of words, NW is the number of non-words, NW[0] ... NW[8] is the number of non-words produced by each of the models: 0 means white noise generator, and 1 –8 - Markov chains of corresponding orders. For each solution: Correct is the percentage of correct answers (this is exactly what determines the place in the final standings), F- is the percentage of false-negative results, F+ is the percentage of false-positive results, F+[0] ... F+[8] is the percentage of false-positive results for each models separately, ms/100w - average processing time of one block in milliseconds. If the solution is restarted when duplicates appear, then (R) displayed to the left of its name. The machine-readable JSON file with the results contains all the same information in an intuitive format.

Official test sequence

We opened Twitter @ladygaga :-) and took out the first tweet that appeared after the deadline. The text of this tweet has become the official string initialization sequence. To generate the official test sequence, you can specify the following option for the generate.js and test2.js :

--text 'Lady Gaga has accompanied Mario Andretti in a two-seater race car at #Indy500 today!'

(Addition: unfortunately, this tweet has already been deleted. Only this retweet remains . Since the testing has already been started, we do not change the initialization string.)

The beginning of the sequence (block number) is: 416943456, 1130928813, 1690493249, 1560679505, 1146154276, 403245362, 1509995579, 2138666856, 1841363379, 1083285606.

The number of blocks and self-study

First, we drove all the decisions on a very small number of blocks in order to find out which solutions have a chance to be in the lead. Then we took 8 promising solutions and ran them on a very large number of blocks. With several different lines of initialization, the first three lines of the rating stabilized within the first 1,500 blocks: a further increase in the number of blocks, although refining the percentages, did not change the relative positions of the leaders. We decided to test all solutions for 10,000 blocks (a million words).

Some solutions use the original approach: they remember all the words that the test system presents to them. Since the words in the dictionary are less than 700,000, and the variety of pseudo-random non-words is much wider, a word that occurs twice is more likely a word than a non-word. For solutions that use this phenomenon, the accuracy of the results increases over time. Theoretically, with an unlimited increase in the number of blocks, such a solution can show an arbitrarily high result.

This is an interesting idea, but for the competition it is a problem, since theoretically, on a sufficiently large set of tests, such a solution could bypass any non-learning solutions. This would not only make this simple technique a “magic wand” conquering everything else, but would also create for the jury the problem of choosing between several learning solutions. If we had this idea before the start of the competition, we could easily solve the problem, indicating that the test system can restart the solution at any time during the testing. However, we explicitly promised that initialization will occur only once.

We decided to dwell on a figure of 10,000 blocks and drive off each solution, both with restarts and without, in order to measure separately the effect of learning. The results of non-learning solutions during this time are quite time to stabilize. Learning really improves the outcome of some decisions (for the time being we see one solution that is in the top ten thanks to the training). Of course, the students' solutions would have shown even higher results on a larger number of blocks, but this would be unfair to those participants whose programs show high recognition accuracy immediately after initialization.

In short: we allow learning solutions to learn, but do not use such huge numbers of blocks on which the learning effect becomes decisive.

Status

At the moment, some solutions are still being tested.

We did not specify any performance requirements for the programs in the problem statement. If we specified the limitations, we would have to accurately describe the hardware and the operating system, run solutions many times for an accurate measurement, solve problems with sandbox distortions. From the problems of measuring performance like to get away. As a result, some of the solutions we received are very slow.

Testing some of the solutions is still not over. The slowest of them, it seems, we have to stop early and use the results of the partial test run. At the moment, none of these solutions in the top ten does not fall. In addition, at least one solution has hung. We will figure out whether the sandbox introduces this problem or whether it is a bug in the solution itself.

In a few days, when the testing of slow solutions ends, we will post the results. Already, you can find out your own result, without waiting for the publication of the entire table. Use for this purpose a new test system and the above official initialization string. Set 10,000 blocks, and you will get exactly the same result as us (unless your solution uses Math.random - in this case, the results will differ slightly). So far, the best result we have seen on 10,000 blocks is 83.67% .

Thank you for your patience. Stay with us!

Source: https://habr.com/ru/post/302922/

All Articles