Some time ago we, together with a small team of programmers, began developing a rather interesting from a technical point of view, an analytical project. Its main purpose was the processing of data obtained from various web pages. It was necessary to process these data, leading to a convenient form and then analyze the collected statistics.
As long as we did not have a large amount of all sorts of data, we did not have any non-standard problems and all solutions were fairly straightforward. But the project grew, and the size of the information collected, although not very quickly at first, but still increased. Grew and code base. And after a while we realized a very sad fact - because of all the crutches and quick fixes, we violated almost all possible design principles. And if at first the organization of the code was not so important, then over time it became clear that without good refactoring we would not go far.
After discussions and reflections, it was decided that for our purposes, the architecture of the “parsile” of the Internet should first of all be service oriented (SOA). Then we repelled from this approach and identified three main parts of the future system, responsible for the following tasks:
- Retrieving the contents of pages, data from various services through the API, data from structured files
- Structuring the information received
- Analyzing statistics and making recommendations
')
As a result of this separation, there should have been three independent services: Fetch Service, Parse Service and Analyze Service.
* Hereinafter I will use some English-language names for greater convenience of perception and brevity
Next came the question of how these services would communicate with each other. Defining the general mechanism, it was decided to use the concept of pipeline processing (pipeline). It is quite clear and simple approach when you need to consistently process any information, passing it from one node to another. A RabbitMQ-based queue mechanism was chosen as the communication bus.

So, with the main architectural model decided. It turned out quite simple, understandable and quite extensible. Further I will describe what each service consists of and what allows them to scale.
Service Components and Technologies
Let's talk a little about the technologies that are used within each individual service. In this article, I will mainly describe how the Fetch Service works. However, other services have a similar architecture. Below I will describe the general points, or rather the main components. There are four of them.
The first is the data processing module (proccessing module), which contains all the basic logic of working with data. It is a set of workers who perform tasks. And the clients who create these tasks. It uses Gearman as a task server and, accordingly, its API. The workers and customers themselves are separate processes that are controlled by Supervisord.
The next component is the results repository. Which is a database in MongoDB. Basically, data is retrieved from web pages, or through various APIs that return JSON. And MongoDB is convenient enough to store this type of information. In addition, the structure of the results may vary, new metrics may appear, and so on. And in this case, we can easily make changes to the structure of documents.
And finally, the third component of the system is the queue. There are two types of queues. The former are engaged in serving to send requests to services from other services or from external clients (not to be confused with Gearman clients). Such queues are referred to as Request Queues. In the case of the Fetch Service, mentioned earlier, the JSON string is sent to this type of queue. It contains the URL of the desired page or parameters for a request to a third-party API.
The second type of queues is Notifications Queues. In a queue of this type of services they place information about requests that have been processed and the result can be obtained from the repository. Thus, asynchronous execution of requests for receiving, processing and analyzing data is realized.
RabbitMQ was chosen as the message broker. This is a good solution, it works perfectly, albeit with some problems. However, for such a system is too heaped up, so it may be better to replace it with something more simple.

Communication
So, communication is provided by queuing and this is an obvious and convenient way to connect various services together. Next, I will describe the communication process in more detail.
There are two types of communication. Inside the system, between services. And between the end customer and the system as a whole.
For example, Parse Service requires new data. He sends a request to the queue for the Fetch Service and continues to go about his business - requests are executed asynchronously.
After the Fetch Service receives the request from the queue, it will perform the necessary actions to extract data from the desired source (web page, file, API) and store it in the storage (MongoDB). And then it will send a notification of the completion of the operation, which in turn will receive Parse Service to process the latest data.

Fetch service
And finally, I will tell you a little more about the service, which is responsible for obtaining raw data from external sources.
This basic part of the system is the first stage in the data processing pipeline. The solution of the following tasks falls on him:
- Getting data from an external source
- Exception and error handling at this stage (for example, processing HTTP responses)
- Providing basic information about the received data (headers, statistics of file changes, etc.)
In itself, the extraction of source data is an important part of most systems where data is structured. And the service-oriented approach in this case is very convenient.
We just say: “Give me this data” and get what we want. In addition, this approach allows you to create different workers for obtaining information from specific sources without forcing customers to think about where the processing material will come from. You can use different APIs with different formats and protocols. The entire logic of obtaining target data is isolated at this level.
In turn, others can be built on top of this service, implementing more specific logic, such as crawling sites, parsing, aggregation, and so on. But every time there is no need to take care of network interactions and handle a variety of situations.
On this, I probably finish. Of course there are many more aspects of the development of such systems. But the most important thing to remember is that you always need to think first about the architecture and use the principle of the only duty. Isolate system components and link them in a simple and understandable way. And you get a result that is easy to scale, easy to control and very easy to use in the future.