Welcome to Spark ... on Java: Interview with Evgeny Borisov

Big Data is a problem. The amount of information is growing every day, and it accumulates like a snowball. It's great that this problem has solutions, only in the world of JVM tens of thousands of projects will process more data.

In 2012, the Apache Spark framework, developed on Scala and designed to improve the performance of certain classes of tasks in working with Big Data, was released. The project has already matured for 4 years and has grown to version 2.0, to which (in fact, starting from version 1.3-1.5) it has a powerful and convenient API for working with Java. To understand for whom all this is necessary, which tasks should be solved with Spark, and which should not, we spoke with EvgenyBorisov Borisov, the author of the training “Welcome to Spark” , which will be held October 12-13 in St. Petersburg.

JUG.RU: Eugene, welcome! Come from the beginning. Tell me briefly what is Spark and what is it eaten with?
')
Eugene: First of all, Apache Spark is a framework: a certain API that allows processing Big Data with an unlimited number of resources, machines that can be independently scaled. To make it clear to Java developers, imagine an old ~~angry~~ JDBC with which you can communicate with the database: read something from it, write something to it - and Spark allows you to write something to the database and read but your code will scale indefinitely.

And then a reasonable question arises: where can I write and where can I write? You can write in Apache Hadoop , you can work with HDFS, you can with Amazon S3. In general, there are so many distributed repositories, and for many there is an API written for Spark, for others it is being written. For example, Apache Cassandra has its own connector for this (in DataStax Enterprise), which makes it possible to write Spark with Cassandra. In the end, it is possible with the local file system: although it does not work out here (there is no place to scale), usually this feature is used for testing.

Each year, information is accumulated more and more, respectively, there is a desire to process it with an unlimited amount of resources.

JUG.RU: So Spark turns on a distributed infrastructure. Does this mean that the project is entirely "enterprise", or in principle can it be used in some personal projects?

Evgeniy: Today it’s rather an enterprise framework, but Big Data is now spreading at a pace that soon without it there will be nowhere at all: more and more information accumulates every year, respectively, there is a desire to process it with an unlimited amount of resources. Today you have little information, and there is a code that only processes it, and when more is accumulated, you will have to rewrite everything. So? And if everything was initially processed on Spark, then you can simply increase the cluster by several machines, and the code should not be changed at all.

JUG.RU: You say that Spark is still an enterprise class. This is an important question, what about stability and backward compatibility?

Eugene: Since Spark is written in Scala, and Scala with backward compatibility is bad enough, Spark also suffers a bit from this. It happens that you are updated, and some functionality suddenly fell off, but it happens to a much lesser extent than in Scala. Still, the API is much more stable here and everything is not so critical: “breakdowns” are usually solved locally, pointwise.

Now the second version of Spark has been released, it looks very cool, but so far I can’t say how much everything has collapsed there, globally no one has switched to it. I will have time to prepare a review topic for the training and show what has changed, updated.

Here it is worth adding that despite the fact that Spark itself is written on Scala, and I am a small supporter of those who do not know it writing on Scala . I am often reproached with “that, they say, you run into Scala”. I do not run on Scala! I just think that if someone does not know Scala, but wants to write on Spark, this is not a reason to learn Scala.

JUG.RU: Okay, we decided that in principle, under Spark, it is better to write on Scala, but if you can't in Scala, then you can on Ja ...

Eugene: No, no, no! I did not say that it is better to write on Scala. I said that people who know Rock well and write under Spark on it are completely normal.

But if a person says: “Well, I don’t know Scala, but I have to write on Spark because I understood what a cool thing it is. Is it just me that I need to teach Rock now? ”I say that there is absolutely no point in this! And it was funny for me to read the comments of people who wrote “we tried to write in Java, because we didn’t know Scala, but we knew Java, but then everything became so bad that in the end we had to switch to Scala and now then we are happy. "

Today, this is not at all the case: it would be true if we talked about Java 7, and Spark was old, even with RDD, - yes, then such three-story structures would really come out that were completely impossible to understand.

Today, Java is the eighth, and data frames have appeared in Spark (from version 1.3 yet), which provide an API that can be used to live without Scala without any difference on what to run: on Scala, on Java.

JUG.RU: If I can write in Java 8, what is the threshold for entry for Spark? Will I have to learn a lot of things, read smart books?

Evgeniy : Very low, especially if you already have practical experience with Java 8. Take the eighth java stream here: the ideology there is very similar, even most of the methods are called the same. You can start yourself and figure it out.

Training is needed more in order to deal with all sorts of subtleties, tricks and nuances. In addition, since the training will be for Java developers, I will show how you can customize everything using Spring, how you can write the infrastructure using it so that the Spark tricks related to performance can be done using annotations so that everything works "from the box".

JUG.RU: Everywhere they write and say that Spark is great for working with Big Data, and this is understandable. But the question that does not sound so often: why Spark is not suitable? What are the limitations and what tasks it is not worth taking?

Eugene: There is no point in taking Spark for tasks that do not scale, in the sense that if a task does not scale by its ideology, then Spark will not help here.

As an example: the window functions, despite the fact that they are in Sparke (in general, you can do everything on Spark), but they work really slowly. Over time, and there should be better, they are moving in this direction.

JUG.RU: By the way, is this a good story, where does Spark go? It is clear that now it is already possible to process some data quickly and well.

Eugene: In the first Spark, there was an RDD (Resilient Distributed Dataset) , which makes it possible to process data using code. Since the data usually has a column-based structure, but it turns out that you can’t access column names, there’s just no such thing in the RDD API. And if you have a file with a huge number of columns and a huge number of rows, and you write logic that processes all this, then you get a very unreadable code.

Data frames allowed data to be processed with the preservation of structure using column names: the code became much more readable, people familiar with SQL became very good in this world. On the other hand, there was a shortage of more fine-tuning of logic.

The result was that it became convenient to do certain things with the help of RDD, and which with the help of data frames. The second Spark is the whole thing combined in a structure called DataSet , and it is possible to work in it like this, and so, without leaving one API. Plus, everything has become much faster to work. Accordingly, if we talk about where Spark is moving, then we can say that they do a lot of different optimizations, and the framework runs faster and faster.

The framework is running faster and faster.

JUG.RU: Clearly, moving towards speed and flexibility. Now is the time to ask about the infrastructure: what tools work with Spark? In the report on the JPoint you tell in detail that you can work with Hadoop, you can without Hadoop and so on.

But in the comments to the previous article, it was felt that Spark without Yarn is moveton, and there is no normal resource management there.

Eugene: I do not agree with this opinion. Let's first take a look at how this starts: there is something coordinating the work of the workers, to which our code is being launched, launched in parallel. So Yarn, of course, coordinates all this much faster, and he also knows how to monitor the status of workers and, if necessary, restart them. But you can work without Yarn, if necessary. There is Spark Standalone, which, of course, is slower and not as powerful, but besides this, there are still alternatives: Apache Mesos, for example, others are still sawing now. I am sure that in five years they will be full, not everything is tied to YARN. Especially for distributed repositories, there is also a bunch of tools, as I said in the beginning.

JUG.RU: The theory is more or less sorted out. What Spark is not needed for is also understandable. Can you give examples of Spark application from your experience? Surely there was something interesting in the Big Data area.

Yevgeny: I don’t know about the “interesting”, after all, I mostly implemented enterprise-projects, there is not much fun there. Another thing is that writing was interesting, fast and convenient.

Of the interesting cases, I can remember the service for telephone companies: imagine that you flew to another country, did not change the SIM, and you need to choose roaming. How is he chosen, on the basis of what? Cheap, expensive, profitable, unprofitable - in order to make such decisions, telephone companies must analyze all their data: every call, who, to whom, where, from where they called, whether there was a good connection - everything is fixed. Specifically, this project cheated such data for all calls around the world: it analyzed all of this and derived various statistics.

The second example is Slice. There are people who open access to their mailboxes in order to analyze any purchases, orders, tickets, etc. from the mail in order to better target advertising and receive some offers. Here, again, there is a processing of a wild number of letters, they all store it in Redshift on Amazon: everything needs to be structured, calculated, somehow still quickly processed, in order to also produce some kind of statistics, based on which customers give directional advertisements or recommendations. Here we screwed Spark to improve performance, without it everything worked very slowly.

JUG.RU: Spark data in real-time process?

Eugene: And in realtime, you can, and batch'ami.

JUG.RU: I see, but what about data validation? Are there any tools that simplify data integrity or correctness checking?

Eugene: Well, this is all solved at the code level: you take a million lines of code and the first thing you do is throw out invalid ones. According to him, by the way, statistics are usually collected: are there many such data, why are they incorrect, this is also done by Spark.

JUG.RU: Finally, I can not but touch upon the question "Java vs Scala in Spark" again. Even more out of curiosity. Whose side are you on?

Eugene: I’ll rather side with Java, although many people don’t like me for it. I can understand - I have been writing in Java for fifteen years, for several years I have been writing on Groovy, which, of course, is the next step compared to Java, but with the release of the eighth version of Java, everything was not so straightforward. Now, starting a new project, I think every time I start it in Java 8 or Groovy.

But Scala is another world in general! And it is more difficult to understand it, as in the tool. There are no macropatterns there. There was a period when I had to write on Scala - I suffered terribly. Naturally, when you suffer, you go to other people for advice. You consult with one person how to build architecture, he says one thing. Ask another - you get a completely different answer! In the Java world, everything is much more even, much more experience, more people, community: I have services, I have dao, I have dependency injection, I have Spring, for this I have Spring Data or, there, Spring MVC, to throw it away all this ton of information that people have gathered and go to Scala, learn everything from scratch? What for? If on Java everything works no worse. I understand that if Scala worked two to three times faster, or the API would be 10 times more convenient, I would agree to learn a year and a half.

I remembered a funny incident. In Lviv, a man approached me after the report and said:
- Listen, you don't like Scala, do you?
“I didn’t say that,” I answered.
- Well, I feel the same.
- Well, I just think that for a project that already has Java programmers, there is no point in wasting time transferring them to Scala.
- And you yourself wrote on the Rock?
- Well, I wrote a little.
- How many?
- Half a year somewhere.
- Ha, half a year ... For half a year you could not understand Scala! You need at least two or three years.

It was at that moment that I realized that I was absolutely right . This is where the entry threshold is high, at 2-3 years. After all, the question is not whether a good language or a bad one. The truth is that if you can write in Java - write in Java, with Spark, in any case, this is easy and simple. Less and less often, developers on my projects have situations when they are looking for some solutions on Google and Scala, but not on Java. Previously, there was a lot of this, but now there is practically no.

Actually, I’ll say once again that Scala’s mission has changed not so long ago: instead of transferring developers to Scala (which hasn’t happened in 12 years), their goal now is to use all Scala products in Java, - this is already felt very strongly: now with each new version of the same Spark it becomes more and more sharpened, including under Java, with each version the difference is ever smaller.

If a year ago on GitHub I compared the number of projects under Spark to Java and Scala, and the distribution was around 3000 and 7000, now these numbers have come together

If a year ago on GitHub I compared the number of projects under Spark in Java and Scala, and the distribution was around 3000 and 7000, now these figures have become closer. And even then the gap was not so great, this is a huge number of programmers who write in Java under Spark, and everything works fine for them.

JUG.RU: Eugene, thank you, and see you at the training!

In general, if the Java theme on Spark seems interesting to you, then we will be glad to see you at our training. There will be many tasks, live coding, and eventually you will leave this training with enough knowledge to start working independently on Spark-e in the familiar world of Java. Details about which you can read on the corresponding page .

And if the training is not interesting for you, then you can meet with Eugene at the Joker conference, where he will give two presentations:
- Myths about Spark, or can Spark be used by a regular Java developer (a very brief version of the training) ;
- Maven vs. Greydla: At the dawn of automation (with Baruch Sadogursky) ;

Source: https://habr.com/ru/post/311146/

All Articles

Welcome to Spark ... on Java: Interview with Evgeny Borisov

Each year, information is accumulated more and more, respectively, there is a desire to process it with an unlimited amount of resources.

The framework is running faster and faster.

If a year ago on GitHub I compared the number of projects under Spark to Java and Scala, and the distribution was around 3000 and 7000, now these numbers have come together

More articles: