📜 ⬆️ ⬇️

Real-time data processing in AWS Cloud. Part 1

Hello!

Today I want to talk about one of the typical tasks in the field of Cloud Computing and Big Data and the approach to its solution, which we found in TeamDev.

image
')
We are faced with BigData issues in the development of a public service for one of the companies involved in the storage and analysis of biological research results. The goal of the customer at the next stage was the real-time visualization of certain slices of such data.

Let's try to formalize the task.



Initial data: tens of thousands of files, each of which consists of several related matrices of the type int and float. The size of one file varies and may be about 2-4 GB. All data is supposed to be uploaded to the cloud storage.

Return values: a set of points on which to build a high resolution image. The processing involves the summation and finding of maximum values ​​in arrays with given boundaries, and therefore is a fairly active CPU time consumer. The size of the results depends on the request from the user - from ~ 50 KB to ~ 20 MB. The size of the source data to be processed to form one answer exceeds the size of the answer by 30-200 times. That is, to send a 100 KB response, you need to read and process about 3-20 MB.

Requirements:



Baseline:

As a cloud provider, the customer chose Amazon Web Services using Amazon S3 as an “endless” source data repository and Amazon EC2 as a host for work nodes.

The front-end, which should give images to browsers and desktop client, is written in Java. Located on Amazon EC2.

The back-end that defines the business logic, including data access control, is written in Java and is located on another EC2 instance.

The descriptive part of the data (where they lie, who they own, what they are) is in MySQL on Amazon RDS.

Where to begin?



With attempts to solve the problem "in the forehead"! What if you create one application that processes in parallel a set of requests from users? It is proposed to take data from S3 and give some structure that represents a set of points or render the finished image.

The set of difficulties encountered is obvious:

  1. The average response volume is 4 MB. For 100 simultaneous or almost simultaneous requests, the total volume of results reaches 4 MB * 100 = 400 MB. The size of the source data exceeds the size of the response 30 or more times. It means that it will be necessary at the same time to read at least 30 * 400 MB of ~ 12 GB from the storage, and as much as 200 * 400 MB of ~ 80 GB of data.
  2. The presence of 100 simultaneous requests, the processing of each of which requires CPU time, implies a comparable number of CPUs.
  3. The theoretical maximum network bandwidth between Amazon S3 and the EC2 instance is 1 Gbit / s, that is, 125 MB / s. That is, in order to read [in the lab] even 12 GB of data will take approximately 12 * 1024/125 ~ 98 seconds.
  4. The only instance can in no way serve users from different parts of the planet with equal speed.


Amazon EC2 allows you to create instances of different sizes, but the largest of them - d2.8xlarge (36 x Intel Xeon E5-2676v3, 244 GB of RAM) - solves only problems â„–1 and â„–2. Problem number 3 - time to load data - is two orders of magnitude higher than the expected rate of return of results. In addition, the scalability of this solution tends to zero.

Ready solution?



Elastic MapReduce

AWS provides cloud-based Elastic MapReduce , which is hosted Apache Hadoop, for solving such problems. With its help, it is possible to overcome all the difficulties encountered by distributing the load among the nodes of the cluster. However, in this version there are new problems:

  1. Hadoop task start speed is seconds. Which is several times slower than the required response time.
  2. The need to pre-warm the cluster and load the selected data from S3 to HDFS. This requires additional gestures at the choice of a cluster operating strategy (clusters) for load balancing.
  3. The result is delivered to S3 or HDFS, which requires additional infrastructure to deliver it to the end user.


In general, Elastic MapReduce is well suited for tasks that do not imply a conditionally instant result. But to solve our problem it is impossible to use it, mainly because of non-compliance with the requirements at the time of processing the request.

Apache storm

As an alternative, preserving the advantages of the MapReduce-approach, but allowing to receive the result of processing in close to real time, the Apache Storm is well suited. This framework was used for the needs of Twitter (processing of analytical events) and is adapted for task flows with millions of queues.

The installation of Storm in AWS Cloud is well thought out: there are ready deployment scripts that automatically launch all the necessary nodes plus the Zookeeper instance to maintain the viability of the system.
However, upon closer examination (a prototype was made), it turned out that this solution also has several disadvantages:

  1. Changing the configuration of a Storm cluster (adding nodes, deploying new versions) is not transparent on the fly. If to be completely accurate, in many cases, changes are guaranteed to be picked up only after the cluster is restarted.
  2. The concept of processing messages in Storm in RPC mode implies at least three stages for the implementation of MapReduce: division of work into parts, processing of part of work, combining results. Each of these steps is generally performed on its own node. In turn, this leads to additional serialization-deserialization of the binary content of messages.
  3. Not the easiest approach to integration testing - raising a whole test cluster requires resources and time.
  4. Obsessive API (from the category of tastes, but nonetheless).


The concept, built with the help of Storm, met the requirements, including speed. But because of the permanently arising tasks of servicing this solution (points 1 and 3) and temporary losses due to the “extra” serialization (point 2), it was decided to abandon it.

Elastic beanstalk

Another option was to write your own application with placement in Amazon Elastic Beanstalk . Such an option could solve all the problems in one fell swoop: a set of EC2 instances for distributing the load on the CPU and the network, automatic scaling, metrics and maintaining the viability of all nodes. But upon closer inspection doubts arose:

  1. Vendor lock-in. After discussion with the customer, it turned out that in addition to developing a public service, his plans also included the delivery of boxed solutions with similar functionality. And if an alternative to Amazon EC2 and Amazon S3 with similar functionality can be found in intranet-oriented products (for example, the Pivotal product line), there is no adequate replacement for Beanstalk.
  2. Insufficient flexibility in scaling settings. Statistics at the request of users spoke of obvious surges, tied to the time of the beginning of the working day. But the system did not allow to bind the pre-heating of the servers to the time of day. [Most recently, this opportunity has appeared ].
  3. Not the most reliable Amazon SQS message delivery service that is part of Beanstalk. Amazon Developer Forums are full of problems when working with the SDK of this service.
  4. Comprehensive deployment procedure.


We refused such a decision, mainly because of the first paragraph. But it is worth noting that Beanstalk is developing rapidly, and in the next projects we will definitely turn our attention to it.

Bicycle



Two opinions are widespread in our environment: “everything is written before us - you need to be able to search” and “if you want to do something well, do it yourself”. Based on the experience gained during the search, it was decided in favor of a samopinny system.

(about which - in the next part of the article ).

Source: https://habr.com/ru/post/262175/


All Articles