📜 ⬆️ ⬇️

Why learn Spark?

Why do developers learn Spark? How to master the technology at home? What can, and what does Spark not, and what awaits it in the future? About this - in an interview with the trainer for Java and Big Data at EPAM Alexey Zinoviev .


image - You're a Java and Big Data trainer - what does this mean? What do you do?

- In EPAM, I prepare and conduct trainings at the request of teams for senior and leading engineers (or, as we say in Aytishchine, signor and lead). Digging all the themes with the letter J on a deep level is beyond the power of one person, so I specialize in the following: Java Concurrency, JVM Internals (those same guts), Spring Boot, Kafka, Spark, Lombok, the model of actors - in general, on that which allows both to increase the productivity of the developer himself, and to speed up the work of his application. Of course, if necessary, I can prepare training on Java EE or patterns, but there are already enough such materials both in and outside EPAM.
')
- You have listed quite a lot of different topics.

“Even in these topics there are always so many new questions and tasks that almost every morning I have to say to myself:“ So, stop, I’m not doing this. ” So, by the cut-off method, it turns out to highlight a number of areas with which I work. One of them is Spark. The Spark frameworks family is growing and expanding, so even here one has to choose one to become a real expert. This year I chose Structured Streaming in order to understand what is going on at the level of its source code and promptly solve the emerging problems.

- Why does a developer learn to work with Spark?

- Three years ago, if you wanted to do Big Data, you had to be able to unscrew Hadoop, tune it, write bloody JoRa's MapReduce, etc. Now knowledge of Apache Spark is just as important. Although now at the interview, any Big Data engineer will still be driven by Hadoop. But maybe not so carefully, and they will not demand experience of combat use.

While under Hadoop, integration bridges with other data formats, platforms, and frameworks were long and painful, in the case of Spark we see a different situation. The community that develops it competes to compete to connect the next NoSQL database by writing a connector for it.

This leads to the fact that many large companies pay attention to Spark and migrate to it: most of their wishes are realized there. Previously, they copied Hadoop in general detail, and at the same time they had their zest, associated with supporting additional operations, some kind of internal optimizer, etc.

- There is a Spark Zoo. What can you do with it?

- First, Spark Zoo helps to quickly build reports, extract facts, aggregates from a large number of both static data and rapidly flowing into your Data Lake.

Secondly, it solves the problem of integrating machine learning and distributed data, which are spread across the cluster and calculated in parallel. This is done quite easily, and at the expense of R- and Python-connectors Spark's capabilities can be used by data scientists who are as far from the problems of building high-performance backends as possible.

Thirdly, he copes with the problem of integrating everything with everything. Everyone writes Spark connectors. Spark can be used as a quick filter to reduce the dimension of the input data. For example, distilling, filtering and aggregating a stream from Kafka, adding it to MySQL, why not?

- Are there problems that Spark can't handle?

- Of course there is, because we are still not at the framework fair, so that I can sell you the perfect hammer with which to paint the walls. If we talk about the same machine learning, work on building an ideal framework is still underway. Many copies of the final API design are broken, some of the algorithms are not parallelized (there are only articles and implementations of a single-threaded version).

There is a certain problem that Spark Core has already changed three turns of the API: RDD, DataFrame, DataSet. Many components are still built on RDD (I mean Streaming, most of the MLlib algorithms, processing large graphs).

- What about new Spark frameworks?

- All of them are not good enough to use in production. The most ready now is Structured Streaming, which came out of the experimental underground. But it is not yet possible in it, for example, to join the two threads. You have to roll back and write on the DStreams / DataFrames mix. But there are almost no problems with the fact that developers break API from version to version. Everything is quite calm here, and the code written on Spark a couple of years ago will still work with minor changes.

- Where is Spark going? What tasks will he be able to solve in the near future?

- Spark moves to a total square-nested perception of reality a la DataFrame everywhere, for all components. This will allow you to safely remove RDD support in Spark 3.0 and fully concentrate on the engine to optimize the SparkAssembler that your top-level set of operations on labels turns into. Spark follows the path of strong integration with DeepLearning, in particular, by the TensorFrames project.

- What to expect in a year, for example?

- I think in 2018 there will be more than now, monitoring tools, deployment and other services that will offer “Spark-cluster in one click with full integration with the whole row and visual designer” for reasonable money or even slightly free - only with server time billing.

- On Youtube there are a lot of videos, how to put Spark in two clicks, but few materials about what to do next. What do you advise?

- I can recommend several resources:


- What level developers should master Spark?

- It is possible, of course, to plant code writing on Spark and a person who has a couple of labs on Pascal or Python. "Hello World" he can run without problems, but why is he?
It seems to me that studying Spark will be useful for developers who have already worked in the “bloody enterprise”, signed backends, stored procedures. The same goes for those who have solid experience in setting up the DBMS and optimizing queries, who have not yet forgotten Computer Science, who like to think how to process data, lowering the constant in evaluating the complexity of an algorithm. If you have been working together for several years now, and “poking around in the sources” is not yours, it is better to go past Spark.

- Is it possible to master Spark at home?

- You can start with a laptop, which is at least 8Gb RAM and a couple of cores. Just put IDEA Community Edition + Scala Plugin + Sbt (you can also Maven), throw a couple of dependencies and go ahead. This will work even under Windows, but, of course, it’s better to roll out everything at once under Ubuntu / CentOS.

After that, you can deploy a small Spark cluster in the cloud for a project with data collection on the Web or to process any open dataset from github.com/caesar0301/awesome-public-datasets . Well, read my GitBook , of course.

- What difficulties do you usually encounter when working with Spark?

“What works on a small dataset (test methods and some JVM settings) often behaves differently on large hips in production.

Another difficulty for a Java developer is to learn Scala. Most of the code base and function signatures require reading Scala code with a dictionary. However, this is a pleasant difficulty.

And last but not least, complexity - even the Pet Project on a “small cluster” and “medium hand dataset” is very expensive. Accounts for Amazon are increasing in comparison with web-crafts to protect the next Java framework.

On September 9, in Petersburg, I will conduct an Apache Spark training for Java developers. I will share my experience and tell you what Spark components should be used immediately, how to set up the environment, build your ETL process, how to work with the newest version of Spark and not only.

Source: https://habr.com/ru/post/336090/


All Articles