Creation and support alone of a complex product with a large technology zoo and without financial injections from the outside is troublesome and tedious business. Therefore, having learned about the competition with an interesting task, we are in Megalent I thought about arranging a “sabbatical” for myself and distracting myself for a while from working on a new version.
The task was to write a program in JS, which will determine whether there is a word with a dictionary of English words or not. It seems to be simple, but there are a couple of restrictions that make the task obviously impracticable:
- A word is considered not just any correct word of the English language, but a word that is in the provided dictionary of 600K + words.
- There is no dictionary at the time of the program execution, it cannot be downloaded, and the size of the program, including the data, should not exceed 64K. External libraries can also not be connected, but the data file can be archived.
Due to these conditions, instead of a single answer, the result can only be the determination of the highest probability of the word in the dictionary.
I’ll say right away that I didn’t send the solution due to dissatisfaction with the result (the solution that gave at least 80% I could only put in 120-130K, and without exceeding the size in 64K I squeezed the maximum 70%).
Nevertheless, I consider the experience quite interesting and worthy of the article. Under the cat a lot of SQL, JS, Python, neural networks, as well as the sad truth about the performance of the CPU on the hosting.
As the prize buns were a few kilobaksov and an invitation for an interview. A couple of thousands of cu they will not interfere in the household, but they will not do any special weather, and I am not in an active search for work (although, for sure, there may be offers that I could not refuse). Nevertheless, the task itself “hooked” me, and I decided to set aside a week to participate.
Since in the comments of the contest article and in the article itself I saw that it was appropriate to use neural networks when deciding, I am looking for libraries for nodejs on which the result will be checked, or at least for Python in the hope that it will be possible to train the network in Python, and then make a forecast code for js. I find Brain , ConvNetJS , Synaptic , PyBrain , Scikit . I understand that this is quite difficult, I do not have enough theoretical knowledge and decide for now to try without them.
In the meantime, just in case, I decide to start collecting the wrong words. On several servers, I start collecting incorrect words in the MongoDB database.
import time import requests import random from constants import * from pymongo import * client = MongoClient(MONGO_SERVER_IP, socketKeepAlive = True , socketTimeoutMS =30000) db = client.challenge words = db.words wordcount = 0 while True: page= random.randrange(0,600000000) url="https://hola.org/challenges/word_classifier/testcase/"+str(page) response=None try: try: response = requests.get(url) except: time.sleep(5) response = requests.get(url) if response and response.content: result=response.json() if result and len(result): for word in result: try: wordcount+=1 if not wordcount%1000: print(word count) if not result[word] words.replace_one({"_id":word},{"_id":word) except Exception as err: time.sleep(1) words.replace_one({"_id":word},{"_id":word) except Exception as err: print (err)
Now I would like to take a closer look at the dictionary. What is the best way to quickly analyze large amounts of data? Of course, SQL queries! I put MSSQL locally, upload word.txt to the database (wizard, no frills - we'll get to bcp later) and do the primary "visual" data analysis:
First, look at the number of words:
select count(*) from words
661686
Just in case, check for duplicates.
select count(distinct word) from words
630404
Oops ... For some reason, there are duplicates in the file. Pouring data without duplicates:
select distinct word into #words from words drop table words select * into words from #words drop table #words
Create a primary key:
alter table words alter column word varchar(100) not null alter table words add constraint pk_words primary key clustered (word)
We estimate the number of long words:
select len(word),count(*) from words where len(word) >20 group by len(word) order by len(word)
Length | amount |
---|---|
20 | 709 |
21 | 348 |
22 | 151 |
23 | 67 |
24 | 38 |
25 | 18 |
26 | 3 |
27 | five |
28 | 3 |
29 | 6 |
thirty | 2 |
31 | 2 |
32 | one |
33 | one |
34 | 2 |
45 | 2 |
58 | one |
60 | one |
We see that words longer than 21 characters are easier to stupidly list and cut off those not included in the "white list".
We save somewhere a list of long words and delete them from the table.
select * into longwords from words where len(word)>21 delete words where len(word)>21
I start to dream, how "in the forehead" you can weed out the wrong words. Let me remind you that the priority is to use for this a minimum of reference information.
The first thing that lies on the surface is pairs and triples of symbols that are not in the dictionary the number of vowels / consonants in words.
I create tables.
Vowels:
create table vowels (letter char(1)) insert into vowels select 'a' union select 'e' union select 'i' union select 'o' union select 'u' union select 'y'
Consonants:
create table consonants (letter char(1)) insert into consonants select 'b' union select 'c' union select 'd' union select 'f' union select 'g' union select 'h' union select 'j' union select 'k' union select 'l' union select 'm' union select 'n' union select 'p' union select 'q' union select 'r' union select 's' union select 't' union select 'v' union select 'w' union select 'x' union select 'z'
View for everyone:
create view allchars as select * from vowels union all select * from consonants union all select ''''
I find the missing pairs:
select v1.letter l1, v2.letter l2 into notexists2 from allchars v1 cross join allchars v2 where not exists (select * from words where word like '%'+v1.letter+v2.letter+'%')
Found a total of 14 pairs.
Missing threes:
select v1.letter l1, v2.letter l2,v3.letter l3 into notexists3 from allchars v1 cross join allchars v2 cross join allchars v3 where not exists (select * from words where word like '%'+v1.letter+v2.letter+v3.letter+'%')
Found 8114 triples.
I receive triples and fours of vowels and consonants with similar requests. Fours of vowels are too few, consonants - too many. By this time, a rather impressive amount of test data is being gathered. For the analysis it is necessary to transfer from Monga on hosting to local SQL.
I do a simple download to Python of a million words in plain text:
client = MongoClient(MONGO_SERVER_IP) db = client.challenge words = db.words pages=range(0,16) for page in pages: f = open("result_"+str(page)+".txt","w") insert='' wordsresult=words.find({}).skip(page*1000000).limit(1000000+10) i=0 for pword in wordsresult: i+=1 if not i%50000: print(i) f.write(pword["_id"]+";") f.close()
Download files and upload via bcp. Bcp (Bulk Copy Program) is a console utility from Microsoft for fast loading / unloading of large amounts of data to / from MSSQL. It works really fast, especially due to the fact that without the inclusion of a special logging mode "bulk logged" data loading is not reflected in the transaction log. For example, 60M words were loaded from a text file for just a few minutes. Usage example:
bcp tmpdata in result1_0.txt -S localhost -d test -U sa -PP@ssw0rd -f bcp.fmt
The specified bcp.fmt is a file with a description of the source data format, which indicates the type of fields and delimiters. If you do not specify it, the utility itself will ask the questions and offer to create such a file for further use.
On 66M incorrect words, we check how many of them are eliminated by simple filters:
Length:
delete learndata where len(word)>21
It became 55M
Couples:
delete learndata where exists (select * from notexists2 where word like '%'+l1+l2+'%')
It became 46M
Threesomes:
delete learndata where exists (select * from notexists3 where word like '%'+l1+l2+l3+'%')
It became 44M
Seems not bad. I add the missing pairs and triples at the beginning of a word, at the end of a word and with an offset of 2-3 characters.
I check on the test array. On every million it drops out like this:
Length: 121576
Missing pairs: 37903
Missing threes: 162205
Missing first pairs: 453
Missing first three: 13905
Missing second pairs: 0
Missing second three: 6276
Missing third triples: 40114
Missing Last Pair: 594
Missing the last three: 6844
Missing last but one pair: 0
Missing last but one threesomes: 6844
Missing pre-dp three: 4846
Not bad for a start. Go ahead.
Now you need to arrange these checks in a Javascript program and check "live" on nodejs.
For this I am doing a js-program that will load test data and invoke a competitive script.
var fs =require('fs') var contest = require('./contest.js'); var zlib= require('zlib'); var data = fs.readFileSync('data.txt.gz'); data = zlib.gunzipSync(data); var testdata = JSON.parse(fs.readFileSync('test.txt')); var total=0; var right=0; for (testword in testdata) { result=contest(testword,testdata[testword]); total++; if(testdata[testword]==(result!==null?result:true)right++ } console.log(right/total)
Next you need to figure out how to make the data take up as little space as possible. Let me remind you, the number of non-existent triples is about 8000, and besides them there are still triples first, from the end, etc ... I would not want all 64K to be eaten only by these directories.
I decide to make an end-to-end guide of triples with a bit mask, in which each bit will be responsible for where in the word this combination does not occur. In this case, the first bit will mean that the combination does not occur at all, which will make it unnecessary to specify all subsequent bits and save a little more space.
Also, the predictable alternation of letters and numbers allows you to abandon the separators and again save. Thus, the data will look like this:
'aa1eaa64oaa106 etc ...
For example: "'aa1" means that the combination of "' aa" cannot occur at all, and "eaa64" means that "eaa" cannot be the pre-last three characters. Since besides this data set, others are also assumed, it was decided to separate them among themselves with a semicolon.
The download code will look like this:
for(var i=0;i<data.length;i++) { if(data[i]==59){chunk++;i++} if (chunk==1) { word=String.fromCharCode(data[i])+String.fromCharCode(data[++i])+String.fromCharCode(data[++i]); digit=String.fromCharCode(data[++i]); while (data[++i]>47&&data[i]<58){ digit+=String.fromCharCode(data[i]); } notExistsMatrix3[word]=parseInt(digit); i--; } else if (chunk==2){ word=String.fromCharCode(data[i])+String.fromCharCode(data[++i]); digit=String.fromCharCode(data[++i]); while (data[++i]>47&&data[i]<58){ digit+=String.fromCharCode(data[i]); } notExistsMatrix2[word]=parseInt(digit); i--; } else if (chunk==3){ word=""; while (data[++i]!=59&&data[i]!=10){ word+=String.fromCharCode(data[i]); } longWords.push(word);i--; }
In this case, it is assumed that reference books with missing triples, pairs and long words will be transmitted in the data file.
The code to check will look like this (example for couples):
for (letters in notExistsMatrix2) { if (word.indexOf(letters)>=0) { if (notExistsMatrix2[letters] & 1){result=false;break;} if (notExistsMatrix2[letters] & 2 && letters[0] == word[0] && letters[1] == word[1]){result = false;break;} if (notExistsMatrix2[letters] & 4 && letters[0] == word[1] && letters[1] == word[2]){result = false;break;} if (notExistsMatrix2[letters] & 8 && letters[0] == word[2] && letters[1] == word[3]){result = false;break;} if (notExistsMatrix2[letters] & 16 && letters[0] == word[word.length - 2] && letters[1] == word[word.length - 1]){result =false;break;} if (notExistsMatrix2[letters] & 32 && letters[0] == word[word.length - 3] && letters[1] == word[word.length - 2]){result =false;break;} if (notExistsMatrix2[letters] & 64 && letters[0] == word[word.length - 4] && letters[1] == word[word.length - 3]){result =false;break;} } }
So, in theory, everything seems to be beautiful, it's time to run.
Run 60%. And this is only due to the fact that the distribution of correct and incorrect words in test pages is approximately the same. First disappointment.
Returning to the topic with neural networks. I find out that basically they can only work with binary input and output data. At best, algorithms that use the sigmoid function, which take on the input values ​​between zero and one.
I want to try, but for this you need to prepare a table with input data. The first thought was to feed a set of input data, in which all words will be spelled.
To reduce the size of the set, I decide to display each letter as a set of five bits corresponding to the ordinal number of the letter in the alphabet.
The query for creating such a data set looks like this:
-- , - select word,1 as isValid, case substring(word,1,1) when '''' then 27 when 'a' then 1 when 'b' then 2 when 'c' then 3 when 'd' then 4 when 'e' then 5 when 'f' then 6 when 'g' then 7 when 'h' then 8 when 'i' then 9 when 'j' then 10 when 'k' then 11 when 'l' then 12 when 'm' then 13 when 'n' then 14 when 'o' then 15 when 'p' then 16 when 'q' then 17 when 'r' then 18 when 's' then 19 when 't' then 20 when 'u' then 21 when 'v' then 22 when 'w' then 23 when 'x' then 24 when 'y' then 25 when 'z' then 26 else 0 end l1, ... case substring(word,21,1) when '''' then 27 when 'a' then 1 when 'b' then 2 when 'c' then 3 when 'd' then 4 when 'e' then 5 when 'f' then 6 when 'g' then 7 when 'h' then 8 when 'i' then 9 when 'j' then 10 when 'k' then 11 when 'l' then 12 when 'm' then 13 when 'n' then 14 when 'o' then 15 when 'p' then 16 when 'q' then 17 when 'r' then 18 when 's' then 19 when 't' then 20 when 'u' then 21 when 'v' then 22 when 'w' then 23 when 'x' then 24 when 'y' then 25 when 'z' then 26 else 0 end l21 into formining from words union ALL select word,0 as isValid, case substring(word,1,1) when '''' then 27 when 'a' then 1 when 'b' then 2 when 'c' then 3 when 'd' then 4 when 'e' then 5 when 'f' then 6 when 'g' then 7 when 'h' then 8 when 'i' then 9 when 'j' then 10 when 'k' then 11 when 'l' then 12 when 'm' then 13 when 'n' then 14 when 'o' then 15 when 'p' then 16 when 'q' then 17 when 'r' then 18 when 's' then 19 when 't' then 20 when 'u' then 21 when 'v' then 22 when 'w' then 23 when 'x' then 24 when 'y' then 25 when 'z' then 26 end l1, .... case substring(word,21,1) when '''' then 27 when 'a' then 1 when 'b' then 2 when 'c' then 3 when 'd' then 4 when 'e' then 5 when 'f' then 6 when 'g' then 7 when 'h' then 8 when 'i' then 9 when 'j' then 10 when 'k' then 11 when 'l' then 12 when 'm' then 13 when 'n' then 14 when 'o' then 15 when 'p' then 16 when 'q' then 17 when 'r' then 18 when 's' then 19 when 't' then 20 when 'u' then 21 when 'v' then 22 when 'w' then 23 when 'x' then 24 when 'y' then 25 when 'z' then 26 end l21 from wrongwords
Translation to bits:
select l1_1=IIF(l1&1=1,1,0), l1_2=IIF(l1&2=2,1,0), l1_3=IIF(l1&4=4,1,0), l1_4=IIF(l1&8=8,1,0), l1_5=IIF(l1&16=16,1,0), ... l21_1=l21&1, l21_2=IIF(l21&2=2,1,0), l21_3=IIF(l21&4=4,1,0), l21_4=IIF(l21&8=8,1,0), l21_5=IIF(l21&16=16,1,0), into formining1 from formining where length>2
I am trying to feed this set of neural network data.
I try brainjs, conventjs, synaptic, and also some libraries on Python - the result does not please. On test models with several thousand entries of input data, they are fine, but when I try to feed my real set, at least for words of five characters, everything becomes bad. Training lasts a long time, no accuracy. Maybe it's in the number of hidden layers, but I tried 10, and 100, and even 1000. Most likely, I simply don't know how to prepare them, but for now the only conclusion I came to is that I need a lot of training iterations (at least tens of thousands), and the contest will end on a large amount of data before the network learns.
There is an idea to use hosting. But, as it turned out, the truth about the real CPU performance on hosting is such that a bummer was waiting for me here. I naively raise a dozen virtual machines on Azure in order to run parallel training on them, but I see that training on the Azure D2 V2 server works (by sight) twenty times slower than on my home i7-4770. In both cases, one core was involved and the GPU was not used. I think that something is wrong with Azure, I remember that I have a server on flops.ru, I try there - the result is the same. The sad conclusion: for some reason, the processor on any dedicated server with i7 will be many times (in my particular case an order of magnitude) faster than on a virtual machine. I understand that I will not be able to provide hundreds of thousands of iterations of training on 20M records, I refuse the neural network.
I remember that in addition to neural networks there are also other mining mechanisms. For example, decision trees and relationship finding algorithms. I decide to try them and to do this add to the data set the length of the word, the number of vowels, consonants, and the number of identical letters in the word. To do this, in the above long sql to create a test dataset add the line:
-- length=len(word), -- len(replace(replace(replace(replace(replace(word,'a',''),'e',''),'i',''),'o',''),'u','')) consonants, -- len(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(word,'b',''),'c',''),'d',''),'f',''),'g',''),'h',''),'j',''),'k',''),'l',''),'m',''),'n',''),'p',''),'q',''),'r',''),'s',''),'t',''),'v',''),'w',''),'x',''),'y',''),'z','')) vowels, -- len(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(word,'''',''),'y',''),'b',''),'c',''),'d',''),'e',''),'f',''),'g',''),'h',''),'i',''),'j',''),'k',''),'l',''),'m',''),'n',''),'o',''),'p',''),'q',''),'r',''),'s',''),'t',''),'u',''),'v',''),'w',''),'x',''),'z','')) a, ... len(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(word,'''',''),'a',''),'b',''),'c',''),'d',''),'e',''),'f',''),'g',''),'h',''),'i',''),'j',''),'k',''),'l',''),'m',''),'n',''),'o',''),'p',''),'q',''),'r',''),'s',''),'t',''),'u',''),'v',''),'w',''),'x',''),'y','')) z
I feed the algorithm for building a decision tree from Microsoft.
Here is the MS DataMining decision tree:
At first glance, promising. The figure shows that the algorithm determined the main factor by the length and, for example, among words longer or equal to 19 characters out of the loaded 85,5253 variants, only 2079 were correct. And among these options, in turn, the number of letters "K" in a word became the main factor influencing. On the one hand, the information is interesting, and on the other, it is useless for our task. Finally, I made a couple more experiments with this algorithm, including looking at what would happen if you specify the type of the predicted attribute "continuous" instead of "discrete". In this case, an interesting result is obtained:
As you can see, the node has a formula:l17 < 21 or >= 24 Existing Cases: 3290857 Missing Cases: 0 Is Valid = 0,007-0,005*(K-0,144)-0,005*(W-0,106)+0,002*(O-1,671)+0,0003*(l21-15,271)-0,004*(B-0,346)
That is, in this case, it is actually a statement that for words whose 17th letter is not the 21st, 22nd, and 23rd letter of the alphabet, the accuracy of correctness can be obtained by replacing the letters in the formula with their number in the word, and instead of l21, put the alphabetic number 21st letter of the word.
Here it is! I was glad that I finally got the magic formula! But it was not there. These formulas have not passed the test for compliance with reality.
I propose to dwell on this and about further experiments (and there were many more) to read in the second part .
Source: https://habr.com/ru/post/302020/
All Articles