⬆️ ⬇️

Parsim Dictionary of the Russian language Zaliznyak Andrei Anatolyevich

It took me somehow to collect a lot of Russian nouns in the singular and nominative cases. Began to look on the Internet. Everything that came to hand was either not in a very convenient format for me or amateur collections. After all, I wanted more official source data, so that it could be translated into its own format, for example, into a MySQL database table.



On September 1, 2009, an order from the Ministry of Education and Science approving a list of dictionaries, grammars and reference books recommended by the Interdepartmental Commission on the Russian Language at the Ministry of Education and Science came into force. Among the 4 approved books is the Grammar Dictionary of the Russian Language by A. A. Zaliznyak .



I settled on this dictionary, firstly, because it contains a morphological description of a word, which would extract, for example, only verbs of a perfect form. Secondly, because I could find an electronic version of the dictionary.

')

There was another variant of grabbing. Wiktionary.org - Category: Russian nouns . Maybe it makes sense to combine these two bases, but for now let's stop at Zaliznyak.



Vocabulary



Zaliznyak's dictionary was found on the site of the “Tower of Babel” project dedicated to comparative historical linguistics. Ozhegov, Zaliznyak and Vasmer dictionaries are available both online and for download .



Download the file dicts.exe from 11/27/2004. Install. In the folder c: \ StarSoft \ dict \ files will be located. We need only those starting with Z_ * (from Z_160 to Z_239). The words in the files are grouped by the first letters. Those. in the file Z_160 there are all words beginning with the letter A, in Z_161 - with the letter B, etc.



Parser



The files are in OEM 866 encoding. For convenience, I translated them to UTF-8 using Notepad ++. Then I wrote a simple PHP parser. I only needed masculine and feminine nouns. You can change the regular expression yourself to your needs.
<?php <br/> <br/> mb_internal_encoding ( 'utf-8' ) ; <br/> <br/> $dir = new DirectoryIterator ( dirname ( __FILE__ ) . '/dict/' ) ; <br/> foreach ( $dir as $file ) <br/> { <br/> if ( $file -> isDot ( ) ) { <br/> continue ; <br/> } <br/> <br/> if ( ! preg_match_all ( '/^(\\p{L}{2,})\\s+\\d+\\s+(?:||||)\\s+/um' , file_get_contents ( $file -> getPathname ( ) ) , $matches ) ) { <br/> continue ; <br/> } <br/> <br/> foreach ( $matches [ 1 ] as $word ) <br/> { <br/> // $word <br/> } <br/> }


As a result, I got a table with 39361 nouns.

Source: https://habr.com/ru/post/97440/



All Articles