Overview of the second day of Data Science Weekend 2018. Data Engineering, ETL, search services and more

A few days ago, we published a review of the first day of Data Science Weekend 2018 , which took place on March 2-3 at Attic Rambler & Co. Having studied the practice of using machine learning algorithms, we now turn to the review of the second day of the conference, during which the speakers talked about the use of various tools, the engineer’s date for the needs of data platforms, ETL, search services and many other things.

GridGain

The second day of the DSWknd2018 was opened by Yuri Babak, the head of development of the machine learning module for the Apache Ignite platform at GridGain, describing how the company coped with the optimization of distributed machine learning on a cluster.

Usually, speaking of the optimization of the machine learning process, they primarily mean the optimization of the trainings themselves. We decided to go on the other side and solve the problem that the data are usually stored separately from the system where they are processed and trained.
')
Consequently, we decided to focus on the problem of long ETL (after all, it is pointless to compete in machine learning performance with Tensorflow), as a result of which we managed to train distributed models on the entire cluster that we have. The result is a new approach in the field of ML, using a non-parallel, namely, distributed learning system.

We succeeded in achieving this goal due to the fact that we developed a new machine learning module in Apache Ignite, which relies on the functionality that already exists in it (streaming, native persistence, and more). I would like to tell you about two features of the resulting module: distributed key-value storage and collocated computing.

Key-value repository. First, we needed distributed replicated caches, which generally have the following structure: we have a cache with some data, it is “spread” across the cluster, and Apache Ignite cares about how to balance the data so that distributed across the cluster. The diagram below shows the situation when we have 3 copies of data in each partition, and even if the cluster falls and only one of the 4 nodes remains, we still have all the data intact.

Collocated computing. This is essentially our implementation of the MapReduce concept, when we are not trying to aggregate data in one place and make calculations, but, on the contrary, “deliver” the calculations to our data, thereby minimizing the load on the cluster. The scheme is as follows: in the Map step, we send our task only to those nodes where there is the necessary data, then it is executed locally on each specific node, so that at the Reduce step we can aggregate the results and send them somewhere else.

More information about Apache Ignite can be found here , here and here .

Raiffeisenbank

After that, Alexey Kuznetsov and Mikhail Setkin, graduates of our Data Engineer programs and Big Data Specialist , shared their experience in building a Real-Time Decision Platform (RTDP) based on the Hortonworks Data Platform and Data Flow .

Any organization has data that characterizes events that can be displayed on a timeline and somehow be used to make real-time decisions. For example, you can imagine a scenario when a bank customer failed to pay by card 2 times, then he called the call-center, went to the Internet bank, left a request, etc. Of course, we would like to receive some signals, based on these events, in order to promptly offer the client some services, special offers or help with his problem.

Each RTDM system is based on streaming , which works in such a way that there are sources from which we can shoot events and put on the data bus, and then pick up from there, and we cannot connect events from different sources and somehow aggregate them.

The next level of abstraction over streaming is Complex Events Processing (CEP), which differs from simple streaming in that there are many sources that we try to process at the same time, that is, we see events together, the join events of these events appear, we can somehow aggregate them on the fly, etc.

The last element of abstraction is the RTDM system itself, the main difference of which from CEP is that it has a set of pre-configured solutions and actions that can be taken online: a call from the call center in the case of an application for consultation in the Internet bank, SMS from special offer in case of replenishment of the account for a large amount and other actions.

How to implement this system? You can go to vendors and in 99% of cases this is standard practice in the field. On the other hand, we have a team of data engineers who can do everything themselves using open-source solutions. The main drawback of most of them is the lack of a user interface.

However, we managed to find a suitable platform for us - the choice fell on HortonWorks Data Flow 3.0, the new version of which has just what we need - Streaming Analytics Manager (SAM), where the graphical interface was implemented, and considering that HortonWorks was in production, we took the path of least resistance.

Let us turn to the architecture of our RTDM solution. Data from the sources come to the data bus, where they are aggregated, and then using HDF they are collected and put into Kafka. Next comes SAM, where, using the user interface, the user launches the campaign for execution, the JAR file is then compiled and sent to Apache Storm.

The central element of the entire system is SAM, thanks to which all this has become possible. Here is what its interface looks like:

SAM gives the user great opportunities: there is a choice of data sources, a set of processors with which he processes the data flow, a group of filters, a branching, the choice of an action for a particular client, whether it is an absolutely personalized offer or a common action for a certain group of people.

Lamoda

Our Data Science Weekend was in full swing and next in line was Igor Mosyagin, another Data Engineer graduate and R & D developer at Lamoda. Igor spoke about how they optimized the search tips on the site and tried to make friends with Apache Solr , Golang and Airflow .

On our site, as well as on many others, there is a search field where people enter something and along the way there are hints with which we try to predict what the user needs. At that time, we already had some ready-made solution, but it did not quite suit us. For example, the system did not respond if the user confused the layout.

Apache Solr is the center of the whole system, we used it in the old solution, but now we decided to implement Airflow, which I learned about using the Data Engineer program. As a result, the request that comes from the user gets into our service, written in Go, which prepares it and sends it to Solr, and then receives the answer and returns it to the user. At the same time, Airflow is launched regularly, at some predetermined time, which climbs into the database and, if required, starts importing data into Solr. The main thing in all of this is the fact that 50 ms is passed from the user's request to the reply, of which the lion's share — 40 ms — is a request to Solr and receiving a response from it.

Generally speaking, Apache Solr is such a big “colossus” with good documentation, it has a lot of sadzhesterov that work according to different logic: they can return an answer only by exact coincidence, or there are options when finding a line far from the beginning of the word and t .d In total, there are 7 or 8 variants of sadzhester, but we used only 3, since the rest in most cases worked out very slowly.

It is also important to update the weights from time to time, as the frequency of requests changes. For example, if this month one of the most popular requests are winter boots, then, of course, this must be taken into account.

Aligned research group

Next came the turn of Nikolai Markov, who is the Senior Data Engineer in the Aligned Research Group, and also lectures on our Big Data Specialist and Data Engineer programs. Nikolay spoke about the advantages and disadvantages of the Hadoop ecosystem, and why analyzing and processing data on the command line can be a good alternative.

If you look at Hadoop from a modern engineering point of view, you can find not only a lot of advantages, but also a number of drawbacks. For example, MapReduce is a general-purpose paradigm. In fact, it all boils down to the fact that you spend a lot of time on transferring your algorithm to MapReduce, so that something counts there, and you may have made a lot of errors in the process. Sometimes you need a lot of MapReduce to do another thing, and it turns out that instead of writing business logic, you spend time on MapReduce.

The advantage of Hadoop is, of course, Python support. He's good at everything, I write on it and recommend it to everyone, but the problem is that when we write in Python under Hadoop, we need a lot of engineering support to make it all work in production: all analytical packages (Pandas, Numpy, and .d.) must stand on the end nodes, all this must be deployed automatically. As a result, it turns out that we either adapt to a specific vendor, which allows its versions to be installed there, or we need a configuration management system, which will be engaged in deployment.

Naturally, one of the main drawbacks of Hadoop and at the same time the main reason for the appearance of Spark is that Hadoop always writes the results to disk, reads from it and writes there too. In fact, even if we scatter this process on many nodes, it will still work at the speed of the disk (average).

To solve the problem of speed, you can, of course, by scaling Hadoop, that is, simply “throwing money”. However, there are more effective alternatives. One of them is the analysis and processing of data on the command line . It is really possible to solve serious analytical problems in it, and it will be several times faster than on Hadoop. The only negative is presented below:

This may seem unreadable to some, but this thing works an order of magnitude faster than a Python script, so it’s great if you have an engineer who can write such things on the command line.

Also, I do not quite understand why in companies, your business logic must necessarily be tied to a relational database, because nothing prevents you from taking some kind of non-relational database in the modern form (the same MongoDB). You should not make excuses for having a bunch of join-s there and it’s impossible to do without SQL. To date, the database is very much, and you can choose yourself which one is closer to you.

If you still can not do without SQL, then you can try Presto - this is an extensible engine for distributed work with many data sources at once. That is, you can write a plugin for your data source and in fact extract everything you want with SQL. In principle, Hive has the same logic, but it is tied to the Hadoop infrastructure, and Presto is an independent development. The advantage is integration with Apache Zeppelin - this is such a beautiful front-end, where you can write SQL queries and immediately get graphs.

Rambler & Co

Alexander Shorin, an instructor at the Data Engineer program and a Python senior development engineer at Rambler & Co, was honored to finish our productive weekend weekend. This time the focus was on the engineering part of the project.

The original pipeline looks like this:

The cameras transfer photos to WebDAV, then the task from Airflow pulls out new photos and sends them to the API, which forms all of this into separate tasks, and then uploads them to RabbitMQ. From the "rabbit" the workers take these tasks, do some transformations with them and send the results back.

How can we scale this process from a technical point of view? How many more cars do we need to handle the whole stream of photos? To answer this question, take a profiler. We decided to take UF 's PyFlame, which is actually a wrapper over Ptrace that clings to the Linux process, looks at what it does, and records what and how many times it has been.

We launched a test dataset consisting of 472 photos, and it was calculated in 293 seconds. Is it a lot or a little? The PyFlame report looks like this in the following beautiful way:

Here we see the “gorge” of loading models, there is a “valley” of segmentation and other interesting things. On this report, it is clear that our code is slowing down, since a huge bar in the center of the image refers to it.

In fact, it turned out that we needed to change only one line in the Jupyter laptop in order to optimize the segmentation: the process duration fell from 293 seconds to 223 seconds. We also switched from PIL, which didn’t scold just lazy for sluggishness, to OpenCV, thanks to which the total running time decreased by another 20 seconds. Finally, the use of Pillow-SIMD, which Alexander Karpinsky described in his speech at the Piter Py # 4 conference, for image processing made it possible to reduce the task execution time to 183 seconds. True to PyFlame, this only slightly affected:

As you can see, PyTorch is still standing out here , so we will kick it. what can we do with him? In PyTorch, when sending data to a video card, they first undergo preprocessing, and then they are thrown into DataLoader .

Having studied the principles of DataLoader, we saw that it raises the workers, processes the data, and then kills the workers. The question arises: why do we constantly raise and kill workers, if we have few people in the cinema and the processing of a photo takes about a second? Why raise and kill two processes every second if it is inefficient?

DataLoader was optimized due to its modification: now it does not kill the workers and does not use them if there are less than 24 people in the hall (the number was taken more or less from the ceiling). At the same time, such optimization did not give a significant increase in processing speed, however, the average utilization of CPU decreased from ~ 600% to ~ 200%, that is, 3 times.

Finally, other improvements include facilitating the implementation of Conv2d , removing excess lambda from the neural network model, and converting Image to np.array for ToTensor .

Finally, some more feedback about our conference from speakers and listeners:

“A very comfortable get-together where you can talk on the sidelines about the industry and catch speakers with questions in general. As a speaker, I note the professionalism of the organizers, it is clear that everything is done so that both speakers and listeners are as comfortable as possible. ”- Igor Mosyagin, speaker, R & D developer, Lamoda.

“I liked it very much. Friendly audience, smart questions. ”- Mikhail Setkin, speaker, Product Manager, Raiffeisenbank.

Full speeches of all speakers can be viewed on our Facebook page . See you soon at our other events!

Source: https://habr.com/ru/post/352010/

All Articles