📜 ⬆️ ⬇️

The most common structure of sentences in Russian according to the version of the library Flibusta

I am a php programmer, but I wanted to expand my horizons, to learn something new. Therefore, I decided to teach other languages ​​and technologies. The choice fell on perl, python and mysql.

A wonderful pymorphy package was taken, Flibusta library (.fb2 only), sedna for storing fb2, mysql percona 5.1 for storing statistics and a small file. A primitive myisam plaque was created where it was recorded how many sentences occurred, and a description of the parts of the speech of this sentence.

According to the description, I made a unique text index, but I forgot to make an index by a numerical field (I thought it would not be useful).

Fb2 with filibustra placed in sedna base, it turned out the base is somewhere in 90 GB.
')
The first step I collected all the unique words found in the library. It turned out about 14 million words. These were both unique words and their forms, and other garbage.

It was experimentally revealed that in the pickle mode, pymorphy spends a lot of time on initialization, and it will take a very long time to pull it out of php through the shell. Since I just started learning python, and to get a part of speech, I had to use pymorphy, I decided to run the latter as a server. This eliminated the problem of long initialization, and this is what happened:

#!/usr/bin/env python <br/>
# -*- coding: utf-8 -*- <br/>
from pymorphy import get_morph<br/>
import re , sys , socket , pprint ,json,chardet, ConfigParser <br/>
<br/>
def do_word ( data ) :<br/>
if ( data == False or data == '' ) : print "error incoming data !!!" <br/>
tmpWord = re . sub ( r "rn" , '' ,data ) <br/>
word = tmpWord. decode ( 'utf-8' ) . upper ( ) <br/>
if ( DEBUG == 1 ) : print word<br/>
info = morph._decline ( word ) <br/>
if ( DEBUG == 1 ) : pprint . pprint ( info ) <br/>
sjSon = json. dumps ( info ) <br/>
return sjSon<br/>
<br/>
config = ConfigParser . ConfigParser ( ) <br/>
config. read ( 'pymorphy_conf.ini' ) <br/>
DEBUG = config. getint ( 'decline' , 'DEBUG' ) <br/>
HOST = config. get ( 'decline' , 'HOST' ) <br/>
PORT = config. getint ( 'decline' , 'PORT' ) <br/>
<br/>
morph = get_morph ( "/pymorphy/dicts/converted/ru/morphs.pickle" , 'pickle' ) <br/>
<br/>
srv = socket . socket ( socket . AF_INET , socket . SOCK_STREAM ) <br/>
srv. bind ( ( HOST,PORT ) ) <br/>
while 1 :<br/>
if ( DEBUG == 1 ) : print " " ,PORT<br/>
srv. listen ( 1 ) <br/>
sock,addr = srv. accept ( ) <br/>
while 1 :<br/>
pal = sock. recv ( 1024 ) <br/>
if not pal:<br/>
break <br/>
lap = do_word ( pal ) <br/>
sock. send ( lap ) <br/>
sock. close ( )



Then, turning to this improvised server, put down parts of speech for previously found words. All this was done in the evening, and did not cause much difficulty.

After that, the simplest thing is to collect statistics. I collected it for 2 months, after which I got tired. What became a bottleneck is difficult to say. The search for a part of speech is the words ~ 0.0046 sec, the rest of the operations are also sane, since simply the words were quickly assembled. Sedna was also tested at the previous stages, and of course it didn’t fly, but the words could be collected from it in a couple of hours, hence its performance would be enough to collect statistics on the proposals.
The result was the following data:

image

Etc.
supplemented
According to (this data is not only Flibusta, there is also a bit of other texts mixed up)

TOP1500 and
less
150 and
less
15 and less
not60300window sill1500nazis150Eich15
I53847wooden1500crooked150Lanter15
Yes50813threw off1499Jure150Butovsan15
what48819by names1499the afterlife150iksaytov15
So45926grew up1499stalls150tragedy15
everything44672high1499Alla150Atcal15
this42825DEEP1499sheep150Yummy15
he42154actively1499shifted150molluscore15
you38073miracles1498lasers150medvyany15
me35976Explain1498transporter150chronosurgeons15
but35900sounded1498brunettes150Iwashura15
him35394broke into1498schoolgirls150chronus accelerator15
as34347a figure1498thieves150oscillations15
to me34078nights1498scouts150pentarch15
and33836WORKS1498physiological150Gorshin15
she is31373ok1498poetic150roida15
OK30932discontent1498kicked up150marshallas15
you30738bucks1497cylinders150evmenarch15
here29969called1497triple150Eumenarchus15
there is29579turned1497debates150chronogen15
It was29297rushed1497a minimum150Kovalyakh15
but29009champagne1497characteristic150Yagya15
there28955bomb1497raid150energy information15
Well28612demanded1496counters150consortline15
nothing28605thinking1496exiled150ephanalysis15
a business27653delighted1496genetic150pisspheres15
same26973inspect1496Anniversary150foreing15
here26898reached1495old150Reflesians15
still26640of sources1495wasteland150Reflesians15
her26556turned1495onion150Slayer15
person26415got up1495by generator150Levikov15
now26408pulled out1495unpretentious150Glenke15
need to25657gray1495wrappers150Irrashi15
of course25568by misfortune1494mouthwatering150camra15
why25216looked at1494turned150Arvaroche15
can25012amazingly1494tablets150Uanduk15
we24847uncles1493interception150chefattashe15
I KNOW24843checks1493cockroaches150Tarkhanov15
they24070invitations1493scanty150blackhead15
you23922heavy1493glade150Ornumhoniorov15
at23600intervene1493unprecedented150Ganfal15
time23598sure1493intended150Hamelinam15
their22956have seen1493profits150Uraine15
you22780continue1492feeding150Ellata15
that22730objected1492commenting150Uraina15
time22508whole1492chic150gervrites15
Now22488Eibogu1492enterprises150Harrens15
not22450I'll start1492languishing150giazir15
you22107snapshot1492Drozdov150Cutletkin15
his22014Understand1491valid150black magic15
said21970meeting1491will obey150Durnevu15
us21950in heat1491foggy150Cutletkin15
eyes21763jump1491departure150Spurius15
Who21740imagine1491Indifferent150ROBUSGROBUS15
myself21656riddle1491habitual150Yoha15
will be21642Vasiliy1491referred to as150Tetlucoacle15
one21576sounds1490valerian women150Yaraat15
then21438connected1490looked like150maglody15
be21415quietly1490cores150Ligul15
people21368slower1490responding150Lamas15
a life21322of yesterday1489painted150Varnana15
true20972thin1489audible150Beliria15
with20645told1489ladders150Nimrobec15
only20603dived1489waterways150Dunk15
of life20517return1489rampant150Legiara15
about20516flag1489civilizations150Charlock15
whether20450up1489scandalous150Midwinter15
more20328call1489Englishman150Bow15
to tell20318fidelity1489protested150Zorik15
to myself20299replaced1488strangle150outland15
than20291serious1488round150Bjart15
then20244change1488saucepan150Renaldo15
it is better20047Vladimir1488then150Samonenko15
you20024the atmosphere1487mocking150Dyushka15
farther19819ran away1487extreme150Morphichev15
Where19600opposite1487roar150Varykin15
arms19545a raincoat1487devilish150Gryzach15
do19525snorted1487meat150Conch15
if a19522the eldest1487slippers150Growlis15
on19444was starting1487have called150Lebedinskaya15
years old19444to drag1487mineral150Valyushok15
houses19361a pair1487legends150I wanted15
him19096to retreat1487slammed150Schaaaas15
the words19026incomprehensible1487so150Gugim15
can18966affably1487adventurers150glongov15
of this18860widow1487romantic150Oberporuchik15
day18772made a mistake1486sheepskin coat150Beckham15
what for18730have given away1486touching150In which15
people18680gave it away1486cured150Kutevanova15
also18583we will wait1486painted150kopach15
was18450fresh1486responsible150Strobach15
would18438fresh1486cultural150Zonova15
means18361a photo1486mixing up150Portfolio15
human18312slave1486will read150Lael15
by him18293lanterns1485intimate150Bepe15
question18276furnishings1485food150clastapug15
backwards18186having time1485unworthy150Kupling15
us18167behavior1485Shudder150Kuplinga15
or18155the subject1485new arrivals150Tables15
such a18116result1485collar150Power15


A total of 90 611 059 sentences were processed.

Forgetting to immediately make an index on a numeric field, I ran into a serious problem. After getting a table of 58 million records and a size of 12 GB, the index was built on it for more than a day and was never built. Bailed out
myisam_sort_buffer_size=1024MB
installed in 1 GB, the index was built in hours.

Ps. Server configuration: AMD Athlon (tm) II X4 635 Processor, 16 gb DDR3, WDC WD7500AACS-00D6B1

Source: https://habr.com/ru/post/138172/


All Articles