What and why
It took me once to parse information from one site. I took Node.js in my hands and got down to business.
The site consisted of sections, each section consisted of pages. To process a single section, you had to do a lot of requests, according to the number of pages.
At that moment, we had to deal with limitations: the site began to give an error when requests were too frequent (more often than several requests per second). Well, no problem, I thought, and decided it in a known way, making a kind of "asynchronous cycle". That is, at the end of processing one page, I started a timer to process the next one.
Then I remembered that I had to parse the different sections of this site and realized that it was becoming too inconvenient. Therefore, I made the tool Conveyor, which is able to handle certain “data elements” (that is, to apply a handler function to specified objects) with a time delay between processing. This also turned out to be convenient for “heavy” calculations, which can be long performed in a cycle.
')
The Conveyor code is on a
githab ; you can install it via
npm (called dataconveyor). More structured help is also on github. You can use it as you like, anywhere, without restrictions.
Below is a description of the Conveyor tool.
How to use
First, create an instance of the Conveyor object, specifying a data handler for it:
var conveyor = new Conveyor(function(element) { console.log(element); }, { period: 100 });
Here we create an object that will write data to the console with an interval of 100 ms. After initialization, you must specify the data:
conveyor.add(12); conveyor.add("Ahoj, Habr!"); conveyor.add([firstElement, secondElement]);
It should be noted that in the case of an array, the elements firstElement and secondElement will be processed separately, and not the entire array. New data can be added during data processing, i.e. conveyor.add () can also be used inside the handler installed in the constructor.
So, when we added data for processing (they, by the way, begin to be processed immediately after the addition), we can set a function that will be called after running all event handlers and waiting for the interval:
conveyor.whenStop(function() { console.log('Done.'); });
In such a simple way we can start processing data with the frequency we need. This solved the problem of downloading information from many pages. But another problem emerged.
Having made a function like parseAllPages () (which loads information from all pages of a single section), I have not provided that I want to call it for different sections simultaneously and asynchronously. To load information from various categories, I ran this conditional parseAllPages () function in another Conveyor element. But several Conveyors are not synchronized with each other and therefore can perform more requests per second than the permissible limits.
To eliminate the flaw in the parameters of Conveyor, the useQueue flag (boolean parameter) was added (false by default), cocking of which means sequential data processing (the next element will be processed only after the previous one has been processed). This type of processing allows you to synchronize several interconnected Conveyor objects. Example:
var categoriesConveyor = new Conveyor(function (category, cb) { parseAllPages(category, function() { cb(); } }, { period: 100, useQueue: true });
That is, I processed the categories sequentially, and the pages within the category did not sequentially. Well, after the described algorithm.
The function Conveyor.wait (count) is also implemented in case the elements for processing will be added later when the whenStop function is called. That is, the function from whenStop will not be called until the function conveyor.add () is called count times. Or, if you no longer need to add data, you can call the Conveyor.unwait (count) function. The expected item count can also be set during initialization of the Conveyor by specifying the value of the parameter expectedElementsCounter.
And if you need to stop processing (ignoring the raw elements), you should call the function Conveyor.forceStop ().
This thing really helped me. I hope that it will also be useful to someone.
I would be grateful for feedback. It will be especially useful for the codestyle in js.