Apache Spark as the core of the project. Part 2. Streaming, and what we ran into

Hello colleagues. Yes, less than three years from the first article , but the project abyss released only now. I want to share with you my thoughts and problems regarding Spark streaming in conjunction with Kafka. Perhaps there are people among you with successful experience, so I will be glad to talk in the comments.

So, in our project there is a need to make decisions in real time. We successfully use Spark for batch processing, and therefore decided to use it for realtime. This gives us a single technology platform and a single code base.

Workflow looks like this: All events are queued (Apache Kafka), and then read and processed by Spark streaming consumers. Consumers must solve two problems:
')

Data routing (redirect data streams to different storage)
Making decisions in real time.

The data that comes to Kafka should ultimately get into HDFS in the form of “raw” log files converted to parquet, and in HBase, in the form of attributes of user profiles. At one time, for similar routing, we quite successfully used Apache Flume, but this time we decided to delegate this business to Spark streaming. Spark out of the box can work with HDFS and HBase, in addition, the developers guarantee “exactly once” semantics. And now let's take a little closer look at the semantics of data delivery (Message Delivery Semantics).
There are three types of them:

At most once - The message may be lost, but never delivered more than once.
At least once - The message can never be lost, but can be delivered more than once.
Exactly once - This is what people want. A message can only be delivered once, and cannot be lost.

And here lies the biggest misunderstanding. When Spark developers talk about exactly once semantics, they mean only Spark. That is, if the data got into the Spark process, then they will be delivered once to all user functions involved in processing, including those located on other hosts.

But as you understand full workflow does not consist of only one spark. Three sides are involved in our process, and semantics should be considered for the whole bundle.

As a result, we have two problems - the delivery of data from Kafka to Spark, and the delivery of data from Spark to storage (HDFS, HBase).

From Kafka to Spark

Theoretically, the problem of delivering data from Kafka to Spark has been solved in two ways.

Method one, old (Receiver-based Approach)

The driver has a driver that uses the Kafka consumer API to track the read data (offsets). These offsets in the classics of the genre are stored in the Zookeeper. And everything would be fine, but there is a non-zero probability of delivering the message more than once, at the moments of failure, and this is At least once.

Method Two, New (Direct Approach (No Receivers))

The developers have implemented a new Sparkovsky driver, which itself deals with tracking of offsets. It stores information about the read data in HDFS, in the so-called checkpoints. This approach guarantees semantics exactly once, and that is what we use.

Problem #number

Spark sometimes spoils the checkpoints, so much so that it cannot work with them later, and goes into a state of severe drug intoxication. He stops reading the data, but at the same time he continues to hang in his memory and tell everyone that everything is all right with him. What is the cause of this problem is not at all clear. Accordingly, we kill the process, remove the checkpoints, start and read everything from the beginning, or from the end. And this is also not exactly once)) For historical reasons, we are using version 1.6.0 on Cloudera. Perhaps it is worth updating, and everything will pass.

Problem number two

Kafka - beats rarely but aptly. There are such situations that any broker falls. To understand why the fall occurred is simply impossible, because of absolutely non-informative logs. The fall of any broker is not scary, Kafka is designed for that. But if you missed, and the broker did not restart in time, the entire cluster would be inoperable. This certainly does not happen in one hour, but nonetheless.

From Spark to external storage

Here things are not so good. The developer himself must take care of guarantees for the delivery of data from Spark to external storage, which introduces a weak overhead into development and architecture. If at this level exactly once semantics is needed, then it is necessary not to be bothered. By the way, we have not solved the problem in this part of the system, and we are content with At most once semantics.

Results:

I feel you can use Spark streaming, but only if your data does not have a special financial value and you are not afraid to lose it. The ideal case is when the data is guaranteed to get into the storage using some other, more reliable subsystem, and Spark streaming is used as a rough tool that helps generate some kind of rough conclusions or not accurate statistics, but in real time, followed by refining batch processing mode.

Source: https://habr.com/ru/post/330986/

All Articles