Rip network dictionaries using Node.js, part 2: dynamic pages; connect nw.js

In the previous section , basic operations and related tasks were described when copying online dictionaries using Node.js. This part describes the use of an important additional tool for converting web sources of particular level of complexity.

I. Why do we need NW.js ?

1. The more complex the structure of the web pages of the dictionary, the more reason to rely on the full range of possibilities provided by the sophisticated browser engine. JSDOM is a fairly advanced library, but even it doesn’t compare to the full set of tools from Chromium .

2. People involved in the creation and conversion of digital dictionaries are, to a large extent, humanities, who have been brought into the IT sphere by the will of fate. Sometimes it’s more comfortable for them to work with the GUI than with the command line interface, especially if they don’t write the utilities themselves, but rather use the ready-made designs of their colleagues. NW.js provides easy ways to create GUI to trivial applications for analyzing, processing and converting web pages.
')
As an example for a brief description of this tool, I chose the site www.wordspy.com . Word Spy is a constantly growing dictionary of English neologisms that have already become part of the language. That is, they were not created and once used by authors for private needs (such words are called "occasionalisms"), but they "appeared" in several printed and network sources of different origin. Compared to the Urban Dictionary , which served as an illustration for the first article, Word Sp y has two significant differences: the contents of the pages are formed by the asynchronous operation of scripts, and the structure of these pages is largely unpredictable and complex (whereas in the Urban Dictionary a small set of tags, and their order and combination were uniform). This was the decisive reason to turn to NW.js.

I do not plan to repeat here parts of the official documentation , which is already quite complete and systematic - if you are not familiar with NW.js , it’s better to start with it (then you can scroll through the wiki pages on GitHub - although many of them are already outdated, there is still there is something interesting not mentioned in the main documentation). I will confine myself only to notes on the application of the project to the selected task.

Ii. Preparatory stage

1. Getting a list of addresses of entries

Basically, the first preparatory script will largely resemble the program from the first article. For the time being, we will not even connect NW.js , since we will only need to pull out the necessary links from the pages, and JSDOM will successfully cope with this.

I denote only significant differences.

but. Because by the time the page loads and the function that reacts to this event of the window and document objects are not ready for the page, we will need to enter an additional cycle of checks (since the page is filled with asynchronous script operation, tracking the load event will not give us anything; hang event handlers for DOM changes, but in this situation, this seems like an unnecessary complication. After analyzing the work of site scripts, we find some significant element of the page, the presence of which means the completion of the construction of the structure we need (in this case, a block with a list of references to dictionary entries). We define the selector of this element in addition to the variables we already know ( selectorsToCheck in the initial code block; in the future case when different check elements are required for different pages, we will make this variable an array). The second addition will be the number of milliseconds, which determines the frequency of checking the key element ( checkFrequency ).

b. Word Spy contains convenient two-level content of the entire dictionary: 1) a list of all tags, divided into several thematic blocks; 2) a link to each of the tags opens a list of all the vocables related to this tag. We will add to our dictionary both the first list of tags, and all the lists of vocables under the tags. To do this, in our initial array of addresses ( tocURLs ), which will be the source of the list of entries, we will add the mentioned starting page with tags. Also, unlike the script from the first article, where this array was called abc , we will immediately turn it into a URL list, and we will not form it on the fly from the alphabet, since the tagged address does not fit into a single URL pattern.

at. Something in our current task will be simplified: Word Spy is a dictionary that is orders of magnitude smaller in comparison with the Urban Dictionary , therefore the address lists and vocabulary entries are one-page. We will not have to check for the presence of multipage continuations in this script or in the script for saving the dictionary itself, which will simplify the construction of the URL and the corresponding sections of the code.

d. In the getDoc function, the getDoc library query changes slightly: the Urban Dictionary was a static dictionary, but here we will have to require loading and executing scripts on the pages, which is displayed in the request options.

Since there is another asynchronous moment in our code, we divide the old processDoc function into two: in the checkDoc function checkDoc we will check both possible errors and the end of the site scripts, and transfer the finished document to the delayed processDoc function. The test cycle performs a certain number of iterations (for example, until 5 seconds has passed). If during this time there is a verification element, we move on to the document processing function. If there is no item after the timeout, we check if there was a redirect: if not, you can suspect a hitch on the server and repeat the request if the server redirects us somewhere, all that remains is to issue a warning to the user and temporarily terminate the program. Experience has shown that, in most cases, it took 100–400 milliseconds to work out site scripts, although sometimes the delay was several seconds and only occasionally exceeded the timeout (in such cases, one repeated request was enough).

e. Processing the finished page and extracting the necessary links is not significantly different from those described in the first article, unless we take care of adding the address with a thematic list of all tags to the list of dictionary entries URL so that this general content is then saved for ease of navigation in future dictionary.

2. Making a list of tags

Since the structure of the pages of the selected dictionary is complex and predictable only up to a certain level, we will try to add a preliminary pass through all the necessary pages to the process of saving in order to collect information on the types and frequency of tags used. To do this, we will create a script, in many respects similar to the dictionary preservation script, except for extracting extremely simple information for the time being (therefore, we will still limit ourselves to JSDOM ).

This script can be called a partial hybrid of the script familiar to us in saving the Urban Dictionary and the script described just above: it will read the finished address list (first or from the place where it was stopped and which it indicated before stopping in a special log) load the page, run its scripts and wait until they build all the necessary contents of the dictionary entry. We denote only a few new parts.

but. When saving the dictionary, we created three files: the dictionary code itself, the process log with the recording of saved addresses, and the error log. In this case, two files are enough for us: we will keep records of tags so that the file with them, if necessary, could play the role of the log of the work done for recovery after the break.

b. The array of selectorsToCheck key selectorsToCheck will now contain two elements: for ordinary dictionary pages and for pages with a list of tags (or vocabulums combined by one tag).

at. In order not to overload the analysis with unnecessary information, according to a rough preliminary assessment, we will define some elements that we will not save to the dictionary and which we can now not disassemble into tags: we will select selectors of these elements in the selectorsToDelete variable to remove the unnecessary before parsing.

d. The analysis of each page will consist in extracting all tags from the element of interest to us, registering their names in the summing tags object (with a constant increase in statistics for each tag), writing the page address to the file and the list of tags on it. At the end of the script, the final tags object is also written to the file. Thus, we get both the general statistics of the tags and their distribution among the pages, which gives us the opportunity to see examples of the use of the tag by opening any of the addresses under which this tag is written. If the work of the script is interrupted, we can restore the statistical tags object using the information already written to the file. These two similar processes - reading pages and reading extracts from the log - we see in two appropriate places of the script: in the initial part (under the line console.log('Reading the tag file...'); ) and in the processDoc function.

The rest of the program code does not contain anything unfamiliar.

Iii. Saving dictionary

Programs on NW.js consist of at least two files: a service file in JSON format that describes the main parameters of the program, and an HTML page that describes the GUI and contains scripts. The latter can be transferred to a separate file (s) and referenced to them at a local or network address.

1. `package.json`

Here is the minimum content of our service file:

 { "name": "NW.WordSpy.get_dic", "main": "WordSpy.get_dic.html", "window": { "title": "Save WordSpy.com" } }

Programs on NW.js at the first start create their own subfolder in the system folder for user data, and its name will be formed by the name of the program from the name field.

The main field contains the path to the main file with GUI elements and the main script of the program.

The optional subsection of the window contains the parameters of the program window being created, and for the time being we will limit ourselves to the title.

More information about the format and components of the service file can be found in the help .

2. `WordSpy.get_dic.html`

The window of our program will be relatively simple and in some ways even resemble a console application.

HTML- .

In the header of the markup, in addition to the necessary minimum, you can add an arbitrary block of CSS . Here it is purely illustrative, and we will not dwell on it.

The first elements of our GUI will be two fields for parameters that we previously set through command line switches: the input file with the addresses of the dictionary pages (we used to set the folder with the input file, and the file name was set in the code so that the key is shorter - now this is not necessary, and we can select the content file directly) and the folder in which the output files will be created - the dictionary itself, the log of the saved pages and the error log. More information about the features of the file fields in NW.js can be read here .

Next comes the button that launches the main action of the program, followed by the field for displaying information. With CSS and some script tweaks we will make it look like a console output window to keep in touch with the usual console versions of our scripts.

The invisible element of audio will serve to attract user attention - we used to use a console player for this. The address of the sound file can be any other, I used one of the system files of the standard sound event scheme.

Finally, the last element will play the key role of the “browser” - we will load our pages into this embedded frame to analyze and extract data. About the features of frames in NW.js and some precautions related to them can be read on the link already familiar to us .

The program view at the beginning and at the end of the process of saving the dictionary can be estimated by the screenshots:

We brought the script part of the program to a separate file for convenience. It is better to refer to it at the end of the page, so that at the time of launching the script can find all the necessary window elements and start interacting with them.

3. `WordSpy.get_dic.js`

In the comments, I will try to dwell only on the differences and innovations compared to the console script from the previous article, because both the main structure and many code sections will be common.

but. The first difference we see at the beginning of the introductory part. Variables appear for the window and document of the program itself (so it will be easier not to confuse them with the variables of the window and document of the loaded pages), and then for each GUI element. Since the file paths will be built dynamically (not once and for all by command line keys, but in response to user actions), we will store them as changeable properties of the io object, and not as a separate set of constants. Another difference is the sets of selectors for different purposes for more convenient manipulations with a complex structure of the document (they are already familiar to us from the previous script for analyzing tags). Finally, as interactivity increases, at the end of the introductory part, we will create several indicator variables for the current state of the program and user commands.

b. When working with the GUI, the temptation to close the window at the wrong time is more than when working with the console. Therefore, we will create a slightly larger system of fuses against incorrect termination of the program. To begin with, let's assign an onExit() function to the window closing handler, about which actions we will say later.

at. As we could see from the help, the standard precautions of HTML 5 remained valid, and we cannot set the final addresses of files using the attributes or properties of our file fields — this can only be done by custom action through the dialog box. But we can reduce the time and effort of the user by saving the path to the folder in which the user is asked to select a file (and if the file is a folder, the preliminary and final address will coincide miraculously). To do this, we will use another service file in JSON format - config.json , in which we will store an object with two properties, according to the number of paths we need. At the beginning of the program, the program will check for the presence of this file: if it exists, it will read the contents into the config object and write the desired paths in the nwworkingdir properties for both fields. If there is no file, the object will be empty and the initial directory will be defined in the usual way for the browser.

d. After checking the saved settings file, we set event handlers for all interactive elements and launch the first one to force the elements into the correct initial state.

e. The checkDirs() function checks the definition of all necessary paths: if at least one of them is not defined, it displays a message in the information block, otherwise writes data to the file of saved settings and removes the lock from the main process launch button.

e. The onStop() function responds to the interrupt command of the main process: it merely translates the indicator of this command to the on position so that the process can then be interrupted at a convenient time.

. The onExit() function reacts to an attempt to close the program window. If a dictionary is saved at this time, it asks a verification question. Upon confirmation, the indicators for process interruption and exit from the program are transferred to the on position for subsequent actions at a convenient time. If the user does not confirm the action, it is ignored. If the save is not made, the program closes without any questions.

g. In the setSpeedInfo() function, a significant change only affected the sound signal. So far, I have left the update frequency and the format of information about the speed of work at the same level (once an hour), but if necessary, they can be corrected (because the Urban Dictionary lasted for many days, and Word Spy - about an hour and a half, so the frequency of conversion and the unit of measurement raise to minutes).

h The updateInfo(str) function is responsible for likening the console information block. We set the buffer size to 10 lines and cut off the extra lines first (the oldest information there), scrolling the block to the last line. Through this function, we display continuously current information in the process of saving. With small dictionaries, this behavior can be disabled (then the entire rip protocol will be preserved), but with a long process, such restrictions save memory and remove redundancy (especially since everything necessary is written to the logs).

and. The logError(evt) function is designed to respond to an error event inside the embedded frame window. It has never worked for me yet.

y. The secureLow(str) function serves the low-level text processing of the loaded pages to bring it to the requirements of DSL, namely, to escape special characters. Whereas secureHigh is used to process text blocks (removing extra spaces, inserting indents before the body of vocabulary DSL entries, special insertion to save blank lines). In the console version of the first article we managed with one function, but here our order of extracting and formatting information will change somewhat, and we will have to separate this processing.

saveDic() - the main function of the program, launched when you click on the button to save the dictionary. It largely corresponds to the initial, procedural part of our console script from the first article, but there are a number of differences. First of all, we turn on the variable-indicator of the preservation process and change the appearance and behavior of the main button: now it will be responsible for interrupting the process. Also disable the file fields that performed their role. Then we perform already familiar file manipulations: we check for the presence of the address list, create dictionary vocabulary and reports, read the address list, read the information about already saved pages when there is one in the save log and, if necessary, shorten the task, finally start the save cycle, requesting the first page in list. New in this code segment will be the task of the event handlers load and error for the window of the embedded frame, necessary for the operation of our cycle.

l getDoc(url) - the starting link of the circular chain of conservation. We call this function at the beginning of the cycle and after processing each page. It starts by checking the interrupt indicator: if it was turned on, the cycle is interrupted and the process stops running. If it is turned off, after familiar operations we change the frame address, forcing it to load a new page.

The checkDoc() function starts automatically in response to a full page load in our built-in browser. She is partially familiar to us from the previous scripts of this article. Only now we start it with the creation of variables, allowing us not to confuse the main objects of the program window and the window of the loaded page. Then follows the familiar cycle of checking the readiness of the page content. Depending on its results, we either proceed to processing the information, or reload the page, or exit with a message about an unknown error.

n The processDoc(iWin, iDoc, iLoc, iter) function processDoc(iWin, iDoc, iLoc, iter) contains the extraction, processing, and saving of the page's dictionary data. It is the most different from the corresponding console part of the code - and because of the differences in the dictionary, and because of the features of the new tool.

We start by cleaning the unnecessary parts of a dictionary entry. Then we define its key element, get a list of all its text parts (the XPath capabilities will allow us to get exactly the final text nodes, without nested HTML elements, so that we can change their contents without risking damage to the structure of the document), and then subject all these elements of the above-mentioned low-level cleaning - so from the very beginning we get the escaping of special characters throughout the text of the article, and further adding of DSL tags can be done without serious consequences over this text.

Then we form the title of the future dictionary entry.

, , , ( , , ). NW.js : innerText . JSDOM ( ), textContent , (- ( HTML) ). innerText : , ( , ). ( , , ): , ( , ).

, — , «# ». , «# Tags by Category», , , — «# acronyms and abbreviations» ..

. , , , . : , .

insertAdjacentHTML() , .

, , , : , innerText ; hr ; ( , , ) .

. . . (, smirk flame ). , CSS innerText .

DSL .

. , , CSS . - ( ), - : span , . ( HTML, DSL ) — , DOM .

— .

: , , ; DSL ; URL ( , DSL , ).

( , — ).

, (.. ).

: HTML , — DSL , . innerText , HTML , DSL, .

, secureHigh . ( , ), , , .

. endSaving() , . , /, , . , .

4.

NW.js , , , - . , , , , . 138 . — ( - ).

, , . , / .

( 16.02.2016) rghost.net drive.google.com . DSL- UTF-8 UTF-16, LSD ABBYY Lingvo. : 5827; : 3419; : 9311.

Thanks for attention.

Source: https://habr.com/ru/post/277513/

All Articles

Rip network dictionaries using Node.js, part 2: dynamic pages; connect nw.js

I. Why do we need NW.js ?

Ii. Preparatory stage

1. Getting a list of addresses of entries

2. Making a list of tags

Iii. Saving dictionary

1. package.json

2. WordSpy.get_dic.html

3. WordSpy.get_dic.js

4.

More articles:

1. `package.json`

2. `WordSpy.get_dic.html`

3. `WordSpy.get_dic.js`