Rip network dictionaries using Node.js, part 1: static pages; CLI; DSL -> TXT, PDF, DjVu; related tasks

ABBYY has created a good software shell for working with dictionaries, but its by-product from the development of ABBYY Lingvo, a DSL markup language, has become no less of its contribution to digital lexicography. He has long gone beyond the boundaries of Lingvo, has become an independent standard and format for other dictionary shells, including one of its most famous of its kind - GoldenDict .

But by itself, ABBYY would not have achieved such success without the help of a large army of enthusiastic lexicographers who manically digitized paper dictionaries and converted digital dictionaries year after year, from miniature specials to huge general-purpose ones.

One of the most famous and fruitful groups has long been working on the site forum.ru-board.com. Over time, there has accumulated both a vast collection of dictionaries and the most thorough knowledge base and tools to help their creators and editors. Many scripts and programs have been written, the set of which reflects the history and changes in the popularity of programming languages, more or less adapted for word processing. Here and Perl with Python, and batch file languages for shells, and macros MS Word and Excel, and compiled programs in general purpose languages.
')
However, until recently, one of the languages was almost not represented in this area. I would like to fill this gap and pay tribute to the rapid growth of the power, functionality and popularity of the JavaScript language. It seems that it can be of great assistance to modern programmers-lexicographers, especially at the border of network and local lexicography.

Creating a local copy of a network dictionary usually takes several stages: storing HTML pages using programs like Teleport, clearing them of tags using regular expressions (in text editors, macros or scripts), final marking with DSL. JavaScript in its Node.js version allows you to significantly reduce and facilitate this path, because this language is native to WEB and can operate with network data without falling to a shaky and changeable level of code and regular expressions, but working at the level of DOM elements.

I will try to illustrate the possibilities of the language and some of its libraries by the example of creating a local copy of one of the richest and most popular explanatory English dictionaries born on the net: Urban Dictionary. The fruits of recent efforts can be assessed by these distributions on popular trackers:

rutracker.org/forum/viewtopic.php?t=5106848
nnm-club.me/forum/viewtopic.php?t=951668
kinozal.tv/details.php?id=1389116

If you are not planning to save any kind of online dictionary yet, you can take a look at the third part of the article: there are collected examples of other common tasks when working with electronic dictionaries that can be solved with the help of Node.js.

However, it should be noted that programming is just a hobby for me. This is both a warning about the lack of professionalism of further examples, and encouragement for those who, like me, have only a humanitarian education.

It is understood that the reader knows JavaScript in its pure and applied versions and has independently figured out the basics of Node.js. If this is not the case, you will have to start with the basics or fill in the blanks: JavaScript , DOM and Node.js.

In this article, we will limit ourselves to processing static pages (by this we mean pages whose key content does not change when JavaScript is turned off) by console scripts. In the next part, we will analyze the preservation of dynamic sites (the key content of which is formed by scripts) and touch the programs with the GUI.

Since we will run scripts only on our computers and at the same time use new language features, we recommend installing the latest version of Node.js.

I. Preliminary stage: getting the list of addresses of dictionary entries

There are at least three algorithms for creating a local copy of the network dictionary.

1. In the worst case, the dictionary does not provide any reliable mechanism for sorting all the articles. Then you have to analyze the address pattern and substitute in it words from some more or less complete list of lexemes of a given language (you can borrow a set of vocables from the largest digitized explanatory dictionary), discarding failed requests.

2. Some dictionaries allow you to go along the chain from the first vocable to the last (through the link “next word” or a set of links to the next vocables). This is the easiest way, but not the most obvious: it will be difficult to first estimate the total number of vocables, and then monitor the progress of copying. Therefore, although the same Urban Dictionary provides this opportunity (on the page of each word there is a column of links to the nearest previous and next articles), we will use the third method.

3. If the dictionary has a separate list of links to all dictionary entries, we first copy the entire set of these links to a file. So we get an idea of the upcoming volume of requests and the ability to monitor the percentage of what has been done. For example, in the Urban Dictionary, for links like www.urbandictionary.com/browse.php?character=A , www.urbandictionary.com/browse.php?character=A&page=2 , etc. You can get a list of addresses of all articles with vocables on the specified letter (such as www.urbandictionary.com/define.php?term=a , www.urbandictionary.com/define.php?term=a%5E_%5E , etc.) .

So, the whole process of saving the dictionary will be divided into two stages, a separate script will be responsible for each.

Here is the code of the first one that preserves the list of links to dictionary entries:

UD.get_toc.js

1. In the initial part of the script, we load the necessary library modules (or immediately the necessary methods from them, if they are called frequently). Almost all modules are built-in, installed along with the Node.js installation. From the external, we need only the jsdom module: by itself Node.js cannot analyze HTML pages and turn them into a DOM tree, and this ability will be provided to us by the module mentioned (installation of modules is simple, because npm is installed with Node.js) ; just open the console, go to the folder with the created script and type npm install jsdom , and then wait for the download and installation to finish - the module itself and the slave modules it needs will be installed in the node_modules folder, and our script will search for them).

After loading the modules, the script determines the folder in which the files will be saved (if the user did not specify the folder when the script was started by the first command line key, the folder containing the script will be selected) and creates three future documents in it: the list of addresses of dictionary entries; the list of processed pages from which he will take these addresses; report on the errors that occurred.

At the end of the first part, four service variables are created in which they will be stored:

- an array of the English alphabet (for adding letters one by one when creating URLs of pages with links; the last character is added to the array *, responsible for the list of references to vocabulums that begin with special characters);
- the previous and current URL of requests (in order to determine in case of errors whether we are still requesting the same ill-fated address, or if the error relates to a new address and must be included in the report);
- flag of user script interruption.

2. In the second part, we set handlers for two events: for any end of the script (there we close all files and call a function that draws the user's attention to any important event with sound) and stop the program for the user command (called by pressing Ctrl+C and toggles the interrupt flag, which is checked before each new network request).

3. In the third part, we launch a query cycle in which we will receive and save lists of addresses of dictionary entries. The part is divided into two logical blocks.

but. If the report file for already processed pages is empty, then the script starts from the very beginning, and not after an abnormal termination or user interruption. In this case, we print the first letter of the alphabet in the console window and in its title, extract it from the alphabetical array and call the function to retrieve the page with the URL structure.

b. If the file is not empty, then the script has already worked. It is necessary to extract the last processed address from the file in order to form a request to the next one in turn. Since the report file can be large, we will not load it into memory as a whole, but we will use the module that allows you to read the file line by line (ignoring blank lines just in case). Having reached the end, we will have the desired address in the variable. After analyzing this address, we get the letter of the alphabet that the script was last processing, and also determine the page of the list of addresses that follows the saved one before the program terminates. Proceeding from this data, we will reduce the alphabetical array by the required letter inclusively, print the new processing start to the console and call the function of getting the next page with a template that takes into account the desired letter and the necessary page.

This completes the procedural part of the script. Then follow three functions: one service and two main, calling each other in turn in the query cycle.

4. For the audio alarm in the playAlert() service function, I chose a console cross-platform player from the ffmpeg library (see the launch playAlert() on the developers website), but you can use any other player or the sound generation module you like using system tools. Sound can also choose any other.

5. The getDoc(url) function sends a request to receive a regular page with a list of addresses of dictionary entries. First, it checks whether the user required to interrupt the script (the script runs for several hours, so it may be necessary to take a break). The function then updates the variables of the past and the upcoming query. Finally, she commands the jsdom module to request the page, while simultaneously passing the function to the corresponding method, which will need to be called when the page is received.

Two additional features are commented out in the code.

but. If you plan to run several scripts in parallel to speed up the download, it is better to do this through an anonymizing proxy server. I tested the Fiddler + Tor bundle (in the non-browser version of the Expert Bundle) - although I did not use it throughout the script, since it simultaneously slows down the speed of communication with the server of one process, and I didn’t want to complicate the work by splitting up the task into parallel processes . See an example implementation here .

If you still want to parallelize the work of the script, you will need to either specify different folders for the output files when running, or run different copies of the script from different folders. These folders should already contain report files on processed pages, consisting of at least one line indicating the address immediately preceding the specified portion of addresses.

b. Another precaution against server side banning is to use delays between requests. It is enough to wrap the method call in setTimeout and experiment with the size of pauses. My experience has shown that the Urban Dictionary servers have enough natural pauses between requests, no additional breaks are required.

6. The processDoc(err, window) function is called by the jsdom module, having received a page or stumbled upon an error, hence the two corresponding function arguments.

First, the function checks the err argument: if it is defined, the request was unsuccessful. In this case, the script signals with a sound, then writes a message to the error file (if this is the first error with the given URL, and not the next one in the chain of repeated requests), displays the information in the window and the console header and restarts the request, referring to the getDoc(url) function repeat argument.

If the err argument is empty, the function starts to analyze the resulting document. There may be several results and reactions to them.

but. The page has a portion of links to dictionary entries. Then the function writes the list of addresses of these links in the dictionary content file, writes the address of the current page to the file of processed pages, reports to the console the number of saved links and tries to find the URL of the next page with addresses. If the search is successful, the program sends information about the next request (letter and page number) to the console and sends it to the getDoc(url) function already familiar to us. If the search is unsuccessful, the program checks the alphabetic array: if there are letters in it, it goes to the new one, if it is empty, it quits.

b. If there are no links on the page, but the address is the same as the requested one, then most likely an error did not occur on the server (this is the case, for example, when the server responds with temporary unavailability). In this case, the script repeats the request.

at. If there are no links and the address does not match the requested one, then a redirect has occurred. This is possible due to one particular list of addresses of dictionary entries on the Urban Dictionary: sometimes the estimated number of pages with this list per current letter is more than the actual number of pages, and when you try to query for a non-existing page number at the end of the letter block, the server redirects the user to the main page. In this case, the script moves to the next letter if the alphabetical array is not empty.

d. If the array is empty, the script exits.

As a result, we get a file with the contents of the dictionary. The other two files have a utility value, so they can be deleted by reading, if necessary, with the errors that occurred.

Ii. The main stage: obtaining the text of dictionary entries

The structure of the second script will be similar, the differences will mainly come from a substantial increase in both the working time (it will now be measured not in hours, but in days), and the complexity of page processing:

UD.get_dic.js

1. In the first part, we again load almost the same modules, then we already check two startup keys for the program: first let it be assigned to the folder with the incoming file (in it the script will look for a list of links to dictionary entries saved in the previous step), then the path to folder for new outgoing files. In both cases, if the keys are not set, the folder of the script itself is used. The script checks the presence of a file with links - if it does not appear, the program will exit with the corresponding declaration.

Next, we define several variables:

- a regular expression for formatting large numbers and the number of milliseconds per hour - they will be used regularly in the future;
- containers for permanent or temporary storage of data (a list of links to dictionary entries, a list of titles of the current article and a list of its sections - different user interpretations of the vocabulary);
- already familiar to us variables for the previous and upcoming requests;
- variables for calculating and displaying the speed of the script;
- user interruption flag.

2. The second part introduces the event handlers already described above: the completion of the script and the user's commands to interrupt the work.

3. In the third part, we check by the size of the main output file, whether it is the first launch of the program or the next one after the break. If the work is just beginning, we enter the BOM label and the initial DSL format directives in the file of the future dictionary.

4. The fourth part completes the procedural section of the program. In it, we first read our input file with a list of links to dictionary entries in the container, which will determine further requests (it will look like an alphabetical container from the first script — it will become the starting point of our request cycle). Then, as in the previous script, we check the report file of already processed addresses: if there is something in it, we find the final line in which the last successfully saved dictionary entry is written, reduce the array of addresses for future processing, remember the number of remaining work and run a function that once per hour will calculate the speed of the script and approximately predict the end time of the work (in the hypothetical case of its continuity). Then we print the number of remaining addresses to the console (this is a large number, so we will separate its digits with spaces for better readability) and start our usual cycle of querying and saving pages. If the report file is empty, we skip reading it, going straight to the second part of the listed actions.

If the incoming file of links to dictionary entries turns out to be empty, we inform the user about it and terminate the program until better times.

The following functions follow: a few small service ones and two main components that make up the turns of cyclic queries already known to us.

5. The playAlert() function is no different from the one of the first script.

6. The secure(str, isHeadword) function secure(str, isHeadword) will be used regularly when saving dictionary entries to a DSL file. It has two tasks: to translate the control characters (characters from the initial ASCII block) found in the network text into a conditionally readable form that will not confuse the DSL compiler; and shorten words that are too long from articles that go beyond DSL format requirements (headings will be reduced according to other rules).

7. The setSpeedInfo() function will work in parallel with the main program flow. Once an hour, it will replace the information line, which displays the speed of the script and the remaining time (at the beginning of the line there will be solid question marks that will be replaced with numbers after the first hour). The function is fairly transparent; you only have to make two notes: the restMark variable stores the number of remaining addresses during the previous speed calculation; the start of the audio signal about the speed recalculation is performed asynchronously in this function (that is, the script does not wait for the sound to end) - for this, we first stored in the variable a separate method of asynchronous start of child processes.

8. The getDoc(url) function that sends a network request is no different from the one described in the previous section, including commented out precautions against server banning and ways to speed up work.

9. The processDoc(err, window) function processDoc(err, window) , in comparison with the previous script of the same name, will retain the framework, but it will differ significantly in terms of processing and saving information from the received page - after all, we will have to not only write a set of links, but also analyze and convert the whole block data.

However, the beginning of the function has not changed: we still check the argument err and, if it is defined, enter the information in the error report file and restart the failed request.

If there are no errors yet, we begin to analyze the page. The following turns of events are possible.

but. The page contains the expected dictionary entry, the page address does not give reason to suspect redirection.

In this case, we turn into an array all the parts of a dictionary entry, that is, all user interpretations of the word. Then go to the analysis of each element.

Each user interpretation consists, as a rule, of three main subsections: the heading (it can be equal to the main heading of the article or be its variant with insignificant deviations), interpretation and examples (the last part is optional).

All the headers will be accumulated in a special buffer (to avoid adding duplicates, we will use a new JavaScript data structure Set , which preserves only unique elements). Before this, we will secure(str, isHeadword) each header through the secure(str, isHeadword) function secure(str, isHeadword) , then create two options: a header for the DSL header part and a header to place the DSL card at the very beginning, because these areas have different requirements. In each variant we will screen the required characters. We will reduce the first option before placing it in the buffer according to the requirements of the format, if it is too long.

Since the jsdom module uses the jsdom property, which has several drawbacks, to extract text from DOM elements, we also reinsure ourselves against the loss of textContent translations, additionally inserting their symbolic variants before some br tags.

Then we sequentially process parts of the interpretation and examples before storing them in temporary variables: remove whitespace at the beginning and end of lines, reduce recurring spaces, screen the required special characters, insert the initial indents required for the card body, reinsure itself from the loss of empty dividing lines during future compilation DSL.

Having finished with the main part, we save service information into the variables: the votes for and against each interpretation and the time of creation of each interpretation (removing the part that is redundant for us, appearing in anonymous subsections).

At the end, we merge all the parts into the next element of the buffer accumulating the parts of the dictionary entry.

Then we check if the article is spread over several pages. If so, we request the next page to repeat the analysis and increase our header and interpretation buffers. ( ), , , .

, , . .

. , . .

— — . . , , ( , ).

— , . .

at. — .

, ( : , - , ), .

, DSL. ( , ).

Iii.

, , .

1.

ABBYY Lingvo UTF-16, GoldenDict UTF-8 ( ). — . , , .

Node.js , . ( ):

utf8_2_utf16.js

utf16_2_utf8.js

2.

, , . , .

.

but.

Urban Dictionary — . .

replace.js

.

b. BOM

, , PDF- BOM PDF . , BOM ( UTF-8).

deBOM.js

.

, .

— DSL, .

— #ICON_FILE , LSD- ( , ).

— , . , . , .

— ( [m1] [m2] ..).

— .

replace_in_dsl.js

( ). , . , .

3.

Urban Dictionary, , , .

count_dsl_elements.js

4.

DSL (, , ).

extract_headwords.js

5.

, . , .

check_headword_uniqueness.js

6. DSL

, , PDF DjVu. , — .

dsl_2_txt.js

. :

string-width — , , ( , CJK ). ( hasCJK(str) ).

wordwrap : .

, , , , DSL, , .

, , ( , , , ).

, DSL , ( ) , ( ):

— , DSL «» ;
— ( [mN]), «» ( , ), ( , ) ;
— .

.

7.

, PDF, . .

paginate.js

: ( , ).

string , . , , , , . .

, . .

8. .

, .

but.

, . , , Adobe Acrobat , .

split_by_abc.js

, ( ). (), ( ). , .

( , , ) .

newPart(chr) , , . :

— ( );
— ;
— ;
— (, );
— BOM ( );
— ;
— , , .

: , ; , ; .

DSL, ABBYY (, Lingvo «» - ). , ( ).

b.

PDF , : 65000 . 65000 , , Adobe Acrobat. .

split_by_pages.js

:

— : , , ;
— , ;
— ( ), ;
— / , , .

, , , , Node.js. . , . )

PS , , .

Source: https://habr.com/ru/post/274475/

All Articles

Rip network dictionaries using Node.js, part 1: static pages; CLI; DSL -> TXT, PDF, DjVu; related tasks

I. Preliminary stage: getting the list of addresses of dictionary entries

Ii. The main stage: obtaining the text of dictionary entries

Iii.

1.

2.

but.

b. BOM

.

3.

4.

5.

6. DSL

7.

8. .

but.

b.

More articles: