Apache Spark: What's under the hood?

Introduction

Recently, the Apache Spark project has attracted immense attention, a large number of small practical articles have been written about it, it has become part of Hadoop 2.0. Plus, it quickly gained additional frameworks, such as Spark Streaming, SparkML, Spark SQL, GraphX, and besides these “official” frameworks, a sea of projects appeared - various connectors, algorithms, libraries, and so on. Quickly and confidently enough to understand this zoo with a lack of serious documentation, especially considering the fact that Spark contains all sorts of basic pieces of other Berkeley projects (for example BlinkDB) is not easy. Therefore, I decided to write this article in order to make life easier for busy people.

A little background:

Spark is a UC Berkeley lab project that began around 2009. The founders of Spark are well-known database scientists, and according to their Spark's philosophy, they are in some way a response to MapReduce. Now Spark is under the Apache “roof”, but the ideologues and main developers are the same people.

Spoiler: Spark in 2 words

Spark can be described in one phrase like this - these are the insides of a massively parallel database engine. That is, Spark does not promote its storage, but lives beyond the others (HDFS is the distributed file system Hadoop File System, HBase, JDBC, Cassandra, ...). The truth is to immediately mention the project IndexedRDD - key / value repository for Spark, which probably will soon be integrated into the project. Spark does not care about transactions, but otherwise it is just MPP DBMS engine.

RDD - the basic concept of Spark

The key to understanding Spark is RDD: Resilient Distributed Dataset. In essence, this is a reliable distributed table (in fact, RDD contains an arbitrary collection, but it is most convenient to work with tuples, as in a relational table). RDD can be completely virtual and just know how it was born, so that, for example, in the event of a node failure, recover. And maybe materialized - distributed, in memory or on disk (or in memory with crowding out to disk). Also, inside, the RDD is divided into partitions - this is the minimum amount of RDD that each work node will process.
')

Everything interesting that happens in Spark happens through RDD operations. That is, usually applications for Spark look like this - we create RDD (for example, we get data from HDFS), we patch it (map, reduce, join, groupBy, aggregate, reduce, ...), do something with the result - for example, we throw back HDFS.

Well, based on this understanding, Spark should be considered as a parallel environment for complex analytical tasks, where there is a master who coordinates the task, and a bunch of working nodes that participate in the execution.

Let's look at such a simple application in detail (we will write it on Scala - this is the reason to learn this trendy language):

Spark application example (not all included, for example include)

We will separately analyze what happens at each step.

def main(args: Array[String]){ // ,    val conf = new SparkConf().setAppName(appName).setMaster(master) val sc = new SparkContext(conf) //    HDFS,   RDD val myRDD = sc.textFile("hdfs://mydata.txt") //      .    . //      ,    (   // ) -  ""  val afterSplitRDD = myRDD.map( x => ( x.split(" ")( 0 ), x ) ) //    :  -    val groupByRDD = afterSplitRDD.groupByKey( x=>x._1 ) //  -     val resultRDD = groupByRDD.map( x => ( x._1, x._2.length )) //       HDFS resultRDD.saveAsTextFile("hdfs://myoutput.txt") }

And what happens there?

Now let's run through this program and see what happens.

Well, first of all, the program is launched on the cluster master, and before any parallel processing goes, there is an opportunity to do something quietly in one thread. Then, as probably already noticeable, each operation on RDD creates another RDD (except saveAsTextFile). In this case, RDDs are all created lazily, only when we ask either to write to a file, or, for example, unload it into memory on a master — execution begins. That is, the implementation takes place both in terms of a request, by a conveyor, where the conveyor element is a partition.

What happens to the very first RDD we made from the HDFS file? Spark is well integrated with Hadoop, so each working node will download its own subset of data, and will be downloaded by partitions (which in the case of HDFS coincide with the blocks). That is, all the nodes downloaded the first block, and the execution went further as planned.

After reading from the disk, we have a map — it runs trivially at each working node.

Next comes groupBy. This is no longer a simple conveyor operation, but a real distributed grouping. For good, it is better to avoid this operator, because as long as it is not implemented too cleverly, it doesn’t keep track of the data locality and will be comparable to distributed sorting in performance. Well, this is information for consideration.

Let's think about the state of affairs at the time of groupBy. All RDDs were conveyer before that, that is, they did not save anything anywhere. In the event of a failure, they would again pull the missing data from the HDFS and pass through the pipeline. But groupBy violates the conveyor and as a result we get cached RDD. In case of loss, now we will have to redo all RDDs to groupBy completely.

To avoid a situation where due to failures in a complex application for Spark, you have to recalculate the entire pipeline, Spark allows the user to control caching with the persist operator. It can cache in memory (in this case, recalculation occurs when data is lost in memory - it can happen when the cache is full), to disk (not always fast enough), or to memory with flushing to disk in case of cache overflow.

After, we again map and write to HDFS.

Well, now it’s more or less clear what is happening inside Spark on a simple level.

But what about the details?

For example, you want to know exactly how the groupBy operation works. Or the reduceByKey operation, and why it is much more efficient than groupBy. Or how join and leftOuterJoin work. Unfortunately, most of the details are the easiest to find out only from the Spark sources or by asking a question on their mailing list (by the way, I recommend subscribing to it if you do something serious or non-standard on Spark).

Even worse, we understand what's going on in the various connectors to Spark. And how much they can be used at all. For example, we had to abandon the idea of integrating with Cassandra for a while because of their incomprehensible support for the Spark connector. But there is hope that quality documentation will appear in the near future.

And what interesting things do we have on top of Spark?

SparkSQL: SQL engine on top of Spark. As we have already seen, Sparke already has almost everything for this, except for the storage, indexes and its statistics. This seriously complicates optimization, but the SparkSQL team claims that they are sawing their new optimization framework, as well as AMP LAB (the laboratory from which Spark grew up) is not going to refuse from the Shark project - full replacement for Apache HIVE
Spark MLib: This is essentially an Apache Mahaout replacement, only much more serious. In addition to efficient parallel machine learning (not only using RDD tools, but also additional primitives), SparkML works much better with local data using the Breeze package of native linear algebra, which will attract Fortran code to you into the cluster. Well, very well thought out API. A simple example: in parallel, we are learning on a cluster with cross-validation.
BlinkDB: A very interesting project - inaccurate SQL queries on top of large amounts of data. We want to calculate the average for some field, but I want to do it no longer than 5 seconds (losing accuracy) - please. We want the result with an error not more than the specified one - also suitable. By the way, pieces of this BlinkDB can be found inside Spark (this can be viewed as a separate quest).
Well, a lot of things are now being written on top of Spark, I just listed the most interesting projects from my point of view.

Source: https://habr.com/ru/post/251507/

All Articles