“I don’t see any reason to use Python to work with Spark, except for laziness”

The other day we decided to talk with Dmitry Bugaychenko ( dmitrybugaychenko ), one of our teachers of the Scala Data Analysis program, and discuss topical issues of using Scala in Data Science and Data Engineering. Dmitry is an analyst at Odnoklassniki.

- Dima, you work in classmates. Tell me, what are you doing there?

In Odnoklassniki, I started working in 2011 on a draft recommendation for music. It was a very interesting and difficult task - most of the music recommendation services at that time were based around well-cataloged publishing content, whereas we had a real UGC (user generated content), which we had to first brush and sort through. In general, the resulting system proved to be quite good and decided to extend the experience to other sections of the site: recommendations of groups, friendship, ribbon ranking, etc. In parallel with this, the team grew, the infrastructure developed, new algorithms and technologies were introduced. Now I have a fairly wide range of responsibilities: coordination of efforts, the date of scientists, the development of DC-infrastructure, research projects, etc.

- How long have you started using Spark? What is the need?

The first attempts to make friends with Spark were back in 2013, but were unsuccessful. We had an urgent need for a powerful interactive tool that allows us to quickly test hypotheses, but Spark of that time could not provide the stability and scalability we need. We made a second attempt a year later, in 2014, and this time everything turned out much better. In the same year, we began to introduce streaming analytics tools based on Kafka and Samza, tried Spark Streaming, but then he could not start. Due to relatively early implementation, by 2017 we were in a position of catching up for a while - a large amount of code on the first Spark prevented us from switching to the second, but in the summer of 2018 we solved this problem and are now working on 2.3.3. In this version, streaming has already earned more stable and we have already done some new promotional tasks on it.

- As I understand it, you use the Scala API, and not Python, as most. Why is that?

I honestly do not see any reason to use Python to work with Spark, except for laziness. Scala API is more flexible and much more efficient, but no more difficult. If you use the standard features of Spark SQL, the Scala code is almost identical to the corresponding Python code, the speed of work will be identical. But if you try to make the simplest user function, the difference becomes obvious - the work of the Scala code remains just as effective, and the Python code turns the multi-core cluster into a pumpkin and starts burning kilowatts / hours for completely unproductive activities. On the scale with which we have to work, we simply cannot afford such wastefulness.

- C Python is understandable. And if you compare with Java, then Scala is something better for data analysis in general? In Java, a lot of things are written in the big data stack.

We use Java very widely, including in machine learning. In the most heavily loaded Scala applications, we try not to pull. But if we are talking about interactive analysis and rapid prototyping, the laconic Scala becomes a plus. The truth must always be borne in mind that when programming in Scala, it is very easy to shoot your legs off to the very ears - many constructions may not behave as one would expect from a common sense point of view, and some simple operations can cause unnecessary copying and attempts to materialize huge dataset in memory.

- With all these advantages, why is Scala not so popular yet? Does she clearly benefit from Python and Java?

Scala is a very powerful tool that requires a fairly high qualification from the person who uses it. In addition, during team development, additional requirements are imposed on the overall level of development culture: the Scala code is written very easily, but not always successfully read even by the author after some time, and under the hood of a simple API can create some kind of game. Therefore, special attention should be paid to maintaining a single style, functional and load testing solutions.

Well, making a comparison of JVM-languages, it is impossible not to mention Kotlin - it is gaining popularity, is considered by many more “ideologically verified”, and even supports Spark in the framework of the sparklin project, while the truth is in a very limited form. We do not use it for Spark yet, but we are closely following the development.

- Let's go back to Spark. As I understand it, anyway, even this functionality of the Scala API did not suit you and you wrote some kind of your own fork to Spark?

It would be wrong to call our project PravdaML a fork: this library does not replace, but complements the functionality of SparkML with new features. We came to those solutions that were implemented there, trying to scale and put the ribbon ranking model on the replicable rails. The fact is that when developing effective distributed machine learning algorithms, a lot of “technical” factors need to be taken into account: how to properly decompose data into nodes, at what point to cache, zadunsemplity, etc. In standard SparkML there is no possibility to control these aspects, and they have to be moved beyond the ML pipeline, which has a negative effect on controllability and reproducibility.

“I remember you had two variants of the name ...”

Yes, the original name ok-ml-pipelines seemed boring to the guys, so we are now in the process of “rebranding” with the new name PravdaML.

- Do many people use it outside of your team?

I do not think that much, but we are working on it. J

- Now let's talk about the roles and professions in the field of working with data. Say, should a data scientist write code in production or is it already some other profession and role?

The answer to this question is my opinion, and there is a harsh reality. I have always believed that for a successful implementation of ML solutions a person must understand where and why all this is being implemented (who is the user, what are his needs, and what are the needs of the business), should understand what mathematical methods can be used to develop a solution, and how these methods can work from a technical point of view. Therefore, in Odnoklassniki we still try to adhere to the model of a single responsibility, when a person comes up with some initiative, implements and implements it. Of course, to solve individual issues, whether it is an effective DBMS or interactive layout, you can always attract people with more experience in these areas, but the integration of all this into a single mechanism remains for the scientist, as the person who best understands what should work on output

But there is a harsh reality in the labor market, which is now very overheated in the field of ML, which leads to the fact that many young professionals do not consider it necessary to study anything besides ML itself. As a result, finding a “full cycle” specialist is becoming increasingly difficult. Although a good alternative has recently emerged: practice has shown that good programmers master ML fairly quickly and quite well. J

- Date engineer need to know Scala? How good by the way? Do I need to go into the jungle of functional programming?

You definitely need to know Scala, if only because two such fundamental tools as Kafka and Spark are written on it, and you should be able to read their sources. As for the “jungle of functional programming”, I would strongly advise them not to abuse too much: the more developers can read and understand the code, the better. Even if for this it is sometimes necessary to turn the “elegant” functional construction into a banal cycle.

- Has the universe of professions in this area ceased to expand, or are we still waiting for the emergence of any new professions in it?

I think that in ML and DS in the foreseeable future there will be a turning point related to automation: the main patterns that guide people when working with signs, choosing a model and its parameters, checking quality will be automated. This will lead to the fact that the demand for specialists who “select the parameters” will decrease significantly, but AutoML-engineers will be able to implement and develop automated solutions.

- You are actively teaching, as I understand it. Why do you think this is important? What is the motivation behind this?

All of us sometime will depart from affairs and quality of our life will strongly depend on the one who will come to replace us. Therefore, investment in next-generation education is one of the most important.

- On our program "Data Analysis on Scala" you will conduct several sessions. Tell me briefly about them. What is their importance?

In these classes, we will just study how engineering and mathematics fit in: how to organize the process correctly, without introducing unnecessary barriers to ETL-> ML-> Prod. The course will be built around the capabilities of Spark ML: basic concepts, supported transformations, implemented algorithms and their limitations. We will also touch upon the area where the existing capabilities of SparkML are not enough, and it becomes necessary to use extensions like PravdaML. Well, there will definitely be practice, and not only at the level of “assemble a solution from ready-made cubes”, but also on how to understand that a new “cube” is needed here, and how to implement it.

- Is there any favorite word game with Scala? Rock climbing wall, rock climber, cave art - use in your daily life?

Is that the epithet "Indoscale", which we use in the address of particularly remarkable pieces of open source, the author of which clearly wanted to demonstrate the remarkable ability to construct unreadable code using functional abstractions.

- Moscow or Peter?

Each city has its own zest. Moscow is a rich and well-groomed city with a fast rhythm. Peter is calmer and filled with the charm of the former European capital. Therefore, I like to come to Moscow for a visit, but I prefer to live in St. Petersburg.

Source: https://habr.com/ru/post/442812/

All Articles

“I don’t see any reason to use Python to work with Spark, except for laziness”

More articles: