Real-time data processing in AWS Cloud. Part 2

In the first part of the article, we described one of the tasks we faced while working on a public service for storing and analyzing the results of biological research. The requirements provided by the customer and several possible implementation options based on existing products were reviewed.

Today we will talk about the decision that was implemented.

Proposed Architecture

')
Front end

User requests come to the front-end, validated for compliance with the format and transferred to the back-end. Each request will eventually return to the front-end with an image or a set of points if the client wants to build such an image on his own.

In the future, it is possible to install LRU cache for storing repetitive results with a short element lifetime - commensurate with the duration of a user session.

Back end

For each such request back end

validates the incoming request and checks its validity from the point of view of security policy,
determines which data needs to be read out from S3 and forms a common task, the processing result of which should be returned to the front-end,
splits tasks into subtasks, taking into account the features, the location of the data on S3, to avoid double readings, etc.,
puts tasks in a queue built on the basis of RabbitMQ,
processes the results obtained from the queue, putting together sets of points,
renders an image if the request it assumes
returns the results to the front end.

Processing subtasks occurs by parallel queuing each subtask in RPC-style (set the task, waited, got the result). For this, a thread pool is used that is global for the back-end application. Each thread in this pool is responsible for interacting with the broker: sending a message, waiting, receiving a result.

In addition, the use of a pool of threads of a known size allows you to control the number of simultaneously processed subtask messages. And the launch of threads for processing known subtasks makes it possible to plan exactly which common tasks are being performed at the moment, predicting the timing of readiness of each common task.

Sustainability requires that you follow three things:

The processing time of one subtask / number of subtasks queued at the time - with an increase in this parameter, the throughput of the queue is required to increase.
Prioritize the processing of subtasks so that each common task is processed in the least possible time.
The number of common tasks in processing is to avoid overflowing the JVM heap on the back end due to the need to keep intermediate results in memory.

Items 2 and 3 are achieved by manipulating the size of the thread pool and the approach to placing subtasks in the queue. When changing the average processing time for subtasks (item 1), it is required to increase or, accordingly, reduce the number of working nodes for processing subtasks.

Worker Workers

Subscribers to the RabbitMQ queue are standalone applications, which, for definiteness, are called workers. Each of them occupies one of the EC2 instances completely, most effectively using the CPU, RAM and network bandwidth.

Subtasks formed on the back end are consumed by some of the workers. The processing of such a subtask does not imply a global context, because the worker works independently of his own kind.

The important point is that Amazon S3 provides random access to any data . This means that instead of downloading a file of 500 MB in size, most of which is not needed for processing this request, we can only read what is really needed. That is, by dividing the general task into pozdacha in the correct way, one can always achieve the absence of double readings of the same data.

In the case of runtime errors (out of memory, failure, etc.), the task simply goes back to the queue, where it is distributed to another node automatically. For system stability, each of the workers is periodically restarted by cron to avoid possible problems with memory leaks and JVM heap overflow.

Scaling

There may be several reasons leading to the need to change the number of application nodes:

The increase in the average processing time of subtasks, which ultimately leads to problems in the delivery of the final result to users in the required time frame.
Lack of proper workload on the node-workers.
Overload of the back-end on the CPU or memory consumption.

To solve problems 1 and 2, we used the API provided by EC2, and created a separate module-scaler that operates instances. Each new instance is created on the basis of a preconfigured operating system image (Amazon Machine Image, AMI) and is launched using spot requests, which saves you about five times the cost of hosting payment.

The disadvantage of this approach is that it takes about 4-5 minutes from the moment of creating a spot-request to an instance until its commissioning. By this time, the peak of the load may have already been passed, and the need to increase the number of nodes may disappear by itself.

To get into such situations less often, we use statistics on the number of requests, the geographical location of users and the time of day. With its help, we increase or reduce the number of working nodes “in advance”. Almost all users work with our service only during the working day. Therefore, bursts are well noticeable at the beginning of the working day in the States (especially US West) and in China. And if problems with the queue overload still occur, then we have time to smooth them in 4-5 minutes.

Problem number 3 has not yet been resolved and is for us the most vulnerable spot. The current connectivity of three things: control of access to data, knowledge of their specifics and location, and post-processing of computed data (Reduce step) - is far-fetched and subject to processing into separate layers.

To be fair, the Reduce process comes down to System.arraycopy (...), and the total amount of data in memory (requests + parts of ready subtasks) on one instance of the back end has never exceeded 1 GB, which easily fits into JVM heap.

Deployment

Any changes in the existing system go through several stages of testing:

Unit testing. This process is integrated into a build that runs on TeamCity after each commit.
Integration testing. Once a day (sometimes less) TeamCity launches several builds that check the interaction of modules. As test data, we use pre-prepared files, the processing result of which is known. As we expand the set of functional features, we add specific cases to the test code.
If the changes concern the user interface, then human intervention is sometimes required at the final stage.

For the described subsystem, the changes, in their basis, relate to the performance and support of new types of input data. Therefore, unit and integration testing is usually sufficient.

After each successful build from the “production” branch, TeamCity publishes artifacts, which are ready-to-use JARs and scripts that control a set of parameters for running the application. When starting a new instance from a pre-prepared AMI (or reloading an existing one), the startup script loads the latest production build from TeamCity and starts the application using the script supplied with the build.

Thus, all you need to do for deploying a new version in production is to wait for the end of the tests and click on the “magic” button that restarts the instances. By monitoring the set of running instances and dividing the flow of tasks into different RabbitMQ queues, you can perform A / B testing for groups of users.

Hostess note

Know how your data is arranged. Provide random access to any part in the shortest possible time. [Keywords]: Amazon S3, random access.
Use spot queries to save money. [Keywords]: Amazon EC2, spot requests.
Be sure to build prototypes based on existing solutions. At a minimum - get experience. As a maximum - get a virtually turnkey solution.

And in the end I will tell...

In this review article, we described our approach to solving a fairly typical problem. The project continues to evolve, becoming every day more and more functional. We will be happy to share our experience with the audience and answer your questions.

Source: https://habr.com/ru/post/262431/

All Articles

Real-time data processing in AWS Cloud. Part 2

Proposed Architecture

Hostess note

And in the end I will tell...

More articles: