Configuring Spark on YARN

Habr, hello! Yesterday at the Apache Spark mitap from the guys from Rambler & Co, there were quite a lot of questions from the participants related to configuring this tool. We decided to share our experience in his footsteps. The topic is not an easy one - therefore, we offer to share our experience in the comments too, maybe we also do something wrong and use it.

A little introduction - how we use Spark. We have a three-month “Big Data Specialist” program, and the entire second module our participants work on this tool. Accordingly, our task, as organizers, is to prepare a cluster for use within such a case.

A feature of our use is that the number of people working at Spark at the same time can be equal to the whole group. For example, at the seminar, when everyone tries something at the same time and repeats after our teacher. And this is a little bit - under 40 people at times. Probably not many companies in the world who are faced with such a use case.

Next, I will tell you how and why we selected certain parameters of the config.
')
Let's start from the beginning. Spark has 3 options to work on the cluster: standalone, using Mesos and using YARN. We decided to choose the third option, because for us it was logical. We already had a hadoop cluster. Our participants are already well acquainted with its architecture. Let's use YARN.

spark.master=yarn

Further more interesting. Each of these 3 deployment options has 2 deployment options: client and cluster. Based on the documentation and various links on the Internet, we can conclude that the client is suitable for interactive work - for example, via jupyter notebook, and the cluster is more suitable for production solutions. In our case, we were interested in interactive work, therefore:

 spark.deploy-mode=client

In general, from now on, Spark will somehow work on YARN, but that was not enough for us. Since we have a program about big data, sometimes the participants did not have enough of what worked in the framework of a uniform cutting of resources. And here we found an interesting thing - dynamic allocation of resources. In short, the point is this: if you have a difficult task and the cluster is free (for example, in the morning), then with this option Spark can give you additional resources. Necessity is considered there according to a cunning formula. We will not go into details - it works well.

 spark.dynamicAllocation.enabled=true

We set this parameter, and when Spark was launched, it cursed and did not start. That's right, because I had to read the documentation carefully. It states that in order for everything to be ok, you need to also include an additional parameter.

 spark.shuffle.service.enabled=true

Why is it needed? When our work does not require such a number of resources anymore, Spark should return them to the common pool. The most time consuming stage in almost any MapReduce task is the Shuffle stage. This parameter allows you to save data that is formed at this stage and release executors accordingly. And executor is a process that cheats everything up on a worker. It has some amount of processor cores and some amount of memory.

Added this parameter. Everything seemed to work. It became noticeable that participants really got more resources when they needed to. But another problem arose - at some point other participants woke up and also wanted to use Spark, but everything was taken and they were unhappy. They can be understood. Began to look into the documentation. There it turned out that there are still some number of parameters with which you can influence the process. For example, if the executor is in standby mode - after what time can it take resources from it?

 spark.dynamicAllocation.executorIdleTimeout=120s

In our case - if your executors do nothing for two minutes, then please return them to the common pool. But this parameter was not always enough. It was obvious that a person has not done anything for a long time, and resources are not being released. It turned out that there is still a special parameter - after what time to select executors, which contain cached data. By default, this parameter was - infinity! We corrected him.

 spark.dynamicAllocation.cachedExecutorIdleTimeout=600s

That is, if within 5 minutes your executors are not doing anything, give them to the common pool. In this mode, the speed of release and the issuance of resources for a large number of users has become decent. The amount of discontent has declined. But we decided to go further and limit the maximum number of executors to one application - in fact to one program participant.

 spark.dynamicAllocation.maxExecutors=19

Now, of course, there were dissatisfied on the other hand - “the cluster is idle, and I have only 19 executors”, but what to do is need some kind of right balance. All make happy fail.

And another small story related to the specifics of our case. Somehow several people were late for a practical lesson, and for some reason Spark did not start for them. We looked at the number of free resources - it seems to be there. Spark should start. Fortunately, by that time the documentation was already somewhere enrolled in the subcortex, and we remembered that when Spark was launched, it was looking for a port on which to launch. If the first port in the range is busy, then it goes to the next in order. If he is free, then captures. And there is a parameter that indicates the maximum number of attempts for this. The default is 16. The number is less than the people in our group in class. Accordingly, after 16 attempts, Spark threw this case and said that I could not start. We corrected this parameter.

 spark.port.maxRetries=50

Then I will talk about some of the settings that are no longer closely related to the specifics of our case.

For a quicker start, Spark has a recommendation that the jars folder in the home directory SPARK_HOME be archived and put on HDFS. Then he will not waste time downloading these jarniki to the workers.

 spark.yarn.archive=hdfs:///tmp/spark-archive.zip

Also for faster work it is recommended to use kryo as a serializer. It is more optimized than the default.

 spark.serializer=org.apache.spark.serializer.KryoSerializer

And there is still the old Spark problem that it often falls down from memory. Often this happens at the moment when the workers have counted everything and send the result to the driver. We have made this parameter more. By default, it is 1GB, we did - 3.

 spark.driver.maxResultSize=3072

And the last, as a dessert. How to upgrade Spark to version 2.1 on HortonWorks distribution kit - HDP 2.5.3.0. This version of HDP contains a pre-installed version 2.0, but once we decided for ourselves that Spark is developing quite actively, and each new version fixes some bugs plus gives additional features, including the python API, so we decided you need to do an update.

Downloaded the version from an official site under Hadoop 2.7. Unzipped, thrown in a folder with HDP. Put the symlink as it should. We start - it does not start. Writes a very incomprehensible mistake.

 java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig

Googling, they found out that Spark decided not to wait until Hadoop was released, and decided to use the new version of jersey. They themselves there with each other swear on this topic in JIRA. The solution was to download jersey version 1.17.1 . Drop it into the jars folder in SPARK_HOME, zip it again and upload to HDFS.

We bypassed this error, but a new and rather streamlined one appeared.

 org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master

At the same time try to run version 2.0 - everything is ok. Try to guess what's the matter. We got into the logs of this application and saw something like this:

 /usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar

In general, for some reason hdp.version did not resolve. Googling, found a solution. In Ambari, go to the YARN settings and add the parameter to the custom yarn-site there:

 hdp.version=2.5.3.0-37

This magic helped, and Spark took off. We tested several of our jupyter laptops. Everything is working. We are ready for the first Spark lesson on Saturday (tomorrow)!

UPD . The lesson revealed another problem. At some point, YARN stopped issuing containers for Spark. In YARN, it was necessary to correct the parameter, which defaulted to 0.2:

 yarn.scheduler.capacity.maximum-am-resource-percent=0.8

That is, only 20% of the resources were involved in the distribution of resources. Changing the parameters, restarted YARN. The problem was solved, and the other participants were also able to start the spark context.

Source: https://habr.com/ru/post/327556/

All Articles

Configuring Spark on YARN

More articles: