Our story began with a seemingly simple task. It was necessary to configure analytical tools for data science specialists and just data analysts. With such a task, we were approached by colleagues from the retail risk and CRM divisions, where the concentration of data science specialists is historically high. Customers had a simple desire - to write code in Python, import advanced libraries (xgboost, pytorch, tensorflow, etc.) and run algorithms on data picked up from a hdfs cluster.

It seems that everything is simple and clear. But there were so many pitfalls that we decided to write a post about it and post a ready-made solution on GitHub.
First, some details about the source infrastructure:
')
- HDFS data warehouse (12 Oracle Big Data Appliance nodes, Cloudera distribution). There are a total of 130 TB of data from various internal systems of the bank, there is also heterogeneous information from external sources.
- Two application servers that were supposed to deploy analytical tools. It is worth mentioning that not only advanced analytics tasks are “spinning” on these servers, so one of the requirements was the use of containerization tools (Docker) to manage server resources, use various environments and set them up.
As the main environment for the work of analysts decided to choose JupyterHub, which de facto has already become one of the standards for working with data and developing machine learning models. Read more about it
here . In the future, we have already imagined JupyterLab.
It would seem that everything is simple: you need to take and configure a bunch of Python + Anaconda + Spark. Install Jupyter Hub on the application server, integrate with LDAP, connect Spark or connect to the data in hdfs in some other way and go ahead - build models!
If you delve into all the source data and requirements, here is a more detailed list:
- Run JupyterHub in Docker (base OS - Oracle Linux 7)
- Cloudera CDH 5.15.1 + Spark 2.3.0 cluster with Kerberos authentication in Active Directory configuration + dedicated MIT Kerberos in a cluster (see Cluster-Dedicated MIT KDC with Active Directory ), Oracle Linux OS 6
- Active Directory integration
- Transparent authentication in Hadoop and Spark
- Python 2 and 3 support
- Spark 1 and 2 (with the possibility of using cluster resources to train models and parallelize data processing using pyspark)
- Ability to limit host resources
- Library set
This post is designed for IT professionals who are faced with the need to solve such problems in their work.
Solution Description
Run in Docker + Cloudera Cluster Integration
There is nothing unusual here. JupyterHub and Cloudera product clients are installed in the container (as — see below), and the configuration files are mounted from the host machine:
start-hub.shVOLUMES="-v/var/run/docker.sock:/var/run/docker.sock:Z -v/var/lib/pbis/.lsassd:/var/lib/pbis/.lsassd:Z -v/var/lib/pbis/.netlogond:/var/lib/pbis/.netlogond:Z -v/var/jupyterhub/home:/home/BANK/:Z -v/u00/:/u00/:Z -v/tmp:/host/tmp:Z -v${CONFIG_DIR}/krb5.conf:/etc/krb5.conf:ro -v${CONFIG_DIR}/hadoop/:/etc/hadoop/conf.cloudera.yarn/:ro -v${CONFIG_DIR}/spark/:/etc/spark/conf.cloudera.spark_on_yarn/:ro -v${CONFIG_DIR}/spark2/:/etc/spark2/conf.cloudera.spark2_on_yarn/:ro -v${CONFIG_DIR}/jupyterhub/:/etc/jupyterhub/:ro" docker run -p0.0.0.0:8000:8000/tcp ${VOLUMES} -e VOLUMES="${VOLUMES}" -e HOST_HOSTNAME=`hostname -f` dsai1.2
Active Directory integration
For integration with Active Directory / Kerberos of iron and not so hosts, the standard in our company is the product
PBIS Open . Technically, this product is a set of services that communicate with Active Directory, with which, in turn, clients work through unix domain sockets. This product integrates with Linux PAM and NSS.
We used the standard Docker method - the unix domain sockets of the host services were mounted in a container (the sockets were found empirically by simple manipulations with the lsof command):
start-hub.sh VOLUMES="-v/var/run/docker.sock:/var/run/docker.sock:Z -v/var/lib/pbis/.lsassd:/var/lib/pbis/.lsassd:Z <b>-v/var/lib/pbis/.netlogond:/var/lib/pbis/.netlogond:Z -v/var/jupyterhub/home:/home/BANK/:Z -v/u00/:/u00/:Z -v/tmp:/host/tmp:Z -v${CONFIG_DIR}/krb5.conf:/etc/krb5.conf:ro </b> -v${CONFIG_DIR}/hadoop/:/etc/hadoop/conf.cloudera.yarn/:ro -v${CONFIG_DIR}/spark/:/etc/spark/conf.cloudera.spark_on_yarn/:ro -v${CONFIG_DIR}/spark2/:/etc/spark2/conf.cloudera.spark2_on_yarn/:ro -v${CONFIG_DIR}/jupyterhub/:/etc/jupyterhub/:ro" docker run -p0.0.0.0:8000:8000/tcp ${VOLUMES} -e VOLUMES="${VOLUMES}" -e HOST_HOSTNAME=`hostname -f` dsai1.2
In turn, PBIS packages are installed inside the container, but without executing the postinstall section. So we put only executable files and libraries, but do not run services inside the container - for us this is unnecessary. PAM and NSS Linux integration commands are run manually.
Dockerfile: # Install PAM itself and standard PAM configuration packages. RUN yum install -y pam util-linux \ # Here we just download PBIS RPM packages then install them omitting scripts. # We don't need scripts since they start PBIS services, which are not used - we connect to the host services instead. && find /var/yum/localrepo/ -type f -name 'pbis-open*.rpm' | xargs rpm -ivh --noscripts \ # Enable PBIS PAM integration. && domainjoin-cli configure --enable pam \ # Make pam_loginuid.so module optional (Docker requirement) and add pam_mkhomedir.so to have home directories created automatically. && mv /etc/pam.d/login /tmp \ && awk '{ if ($1 == "session" && $2 == "required" && $3 == "pam_loginuid.so") { print "session optional pam_loginuid.so"; print "session required pam_mkhomedir.so skel=/etc/skel/ umask=0022";} else { print $0; } }' /tmp/login > /etc/pam.d/login \ && rm /tmp/login \ # Enable PBIS nss integration. && domainjoin-cli configure --enable nsswitch
It turns out that the PBIS container clients communicate with the PBIS host services. JupyterHub uses a PAM authenticator, and with a properly configured PBIS on the host, everything works out of the box.
In order not to let all users from AD in JupyterHub, you can use the setting that restricts users to specific AD groups.
config-example / jupyterhub / jupyterhub_config.py c.DSAIAuthenticator.group_whitelist = ['COMPANY\\domain^users']
Transparent authentication in Hadoop and Spark
When logged in to JupyterHub, PBIS caches the user's Kerberos ticket in a specific file in the / tmp directory. For transparent authentication in this way, it is sufficient to mount the host's / tmp directory into the container and set the KRB5CCNAME variable to the desired value (this is done in our authenticator class).
start-hub.sh VOLUMES="-v/var/run/docker.sock:/var/run/docker.sock:Z -v/var/lib/pbis/.lsassd:/var/lib/pbis/.lsassd:Z -v/var/lib/pbis/.netlogond:/var/lib/pbis/.netlogond:Z -v/var/jupyterhub/home:/home/BANK/:Z -v/u00/:/u00/:Z -v/tmp:/host/tmp:Z -v${CONFIG_DIR}/krb5.conf:/etc/krb5.conf:ro -v${CONFIG_DIR}/hadoop/:/etc/hadoop/conf.cloudera.yarn/:ro -v${CONFIG_DIR}/spark/:/etc/spark/conf.cloudera.spark_on_yarn/:ro -v${CONFIG_DIR}/spark2/:/etc/spark2/conf.cloudera.spark2_on_yarn/:ro -v${CONFIG_DIR}/jupyterhub/:/etc/jupyterhub/:ro" docker run -p0.0.0.0:8000:8000/tcp ${VOLUMES} -e VOLUMES="${VOLUMES}" -e HOST_HOSTNAME=`hostname -f` dsai1.2
assets / jupyterhub / dsai.py env['KRB5CCNAME'] = '/host/tmp/krb5cc_%d' % pwd.getpwnam(self.user.name).pw_uid
Thanks to the code above, a JupyterHub user can execute hdfs commands from a Jupyter terminal and run Spark jobs without additional actions for authentication. Mounting the entire directory / tmp host in the container is not safe - we are aware of this problem, but its solution is still in development.
Python versions 2 and 3
Here, it would seem, everything is simple: you need to install the necessary versions of Python and synonym them with Jupyter, creating the necessary Kernel. This question is already covered in many places. Conda is used to manage Python environments. Why all the simplicity is only apparent, it will be clear from the next section. Kernel example for Python 3.6 (this file is not in git - all kernel files are generated by code):
/opt/cloudera/parcels/Anaconda-5.3.1-dsai1.0/envs/python3.6.6/share/jupyter/kernels/python3.6.6/kernel.json { "argv": [ "/opt/cloudera/parcels/Anaconda-5.3.1-dsai1.0/envs/python3.6.6/bin/python", "-m", "ipykernel_launcher", "-f", "{connection_file}" ], "display_name": "Python 3", "language": "python" }
Spark 1 and 2
To integrate with SPARK clients, you also need to create Kernels. Kernel example for Python 3.6 and SPARK 2.
/opt/cloudera/parcels/Anaconda-5.3.1-dsai1.0/envs/python3.6.6/share/jupyter/kernels/python3.6.6-pyspark2/kernel.json { "argv": [ "/opt/cloudera/parcels/Anaconda-5.3.1-dsai1.0/envs/python3.6.6/bin/python", "-m", "ipykernel_launcher", "-f", "{connection_file}" ], "display_name": "Python 3 + PySpark 2", "language": "python", "env": { "JAVA_HOME": "/usr/java/default/", "SPARK_HOME": "/opt/cloudera/parcels/SPARK2/lib/spark2/", "PYTHONSTARTUP": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/pyspark/shell.py", "PYTHONPATH": "/opt/cloudera/parcels/SPARK2/lib/spark2/python/:/opt/cloudera/parcels/SPARK2/lib/spark2/python/lib/py4j-0.10.7-src.zip", "PYSPARK_PYTHON": "/opt/cloudera/parcels/Anaconda-5.3.1-dsai1.0/envs/python3.6.6/bin/python" } }
Immediately, we note that the requirement to have the support of Spark 1 has developed historically. However, it is possible that someone will face similar restrictions — for example, you cannot install Spark 2 in a cluster. Therefore, we describe here the pitfalls that we encountered on the path of implementation.
First, Spark 1.6.1
does not work with Python 3.6. Interestingly, in CDH 5.12.1 this was fixed, but in 5.15.1 - for some reason not). First, we wanted to solve this problem by simply applying the appropriate patch. However, later this idea had to be abandoned, since this approach requires the installation of a modified Spark in a cluster, which turned out to be unacceptable for us. The solution was found in creating a separate Conda environment with Python 3.5.
The second problem does not allow Spark 1 to work inside the Docker. The Spark driver opens a specific port over which the Worker establishes a connection with the driver — for this, the driver sends it its IP address. In the case of the Docker Worker, it tries to connect with the driver by the IP of the container and when using the network = bridge it doesn’t work out quite naturally.
The obvious solution is to send not the IP of the container, but the IP of the host, which was
implemented in Spark 2 by adding the appropriate configuration settings. This patch was creatively reworked and applied to Spark 1. Spark modified in this way does not need to be installed on the cluster hosts, so problems like incompatibility with Python 3.6 do not arise.
Regardless of the version of Spark, for its performance it is necessary to have in the cluster the same versions of Python as in the container. To install Anaconda directly to bypass the Cloudera Manager, we had to learn how to do two things:
- collect your parcel with Anaconda and all the necessary environments
- install it in the docker (for consistency)
Build parcel Anaconda
This turned out to be quite a simple task. All you need is:
- Prepare the contents of parcel by installing the appropriate versions of Anaconda and environment Python
- Create a metadata file (s) and put it in the meta directory
- Create parcel with a simple tar
- Validate parcel utility from Cloudera
The process is described in more detail on
GitHub , and the validator code is also there. We borrowed the metadata in the official
parcel of Anaconda for Cloudera, creatively reworking them.
Install parcel in Docker
This practice has been useful for two reasons:
- Spark operability - it is impossible to put Anaconda in a cluster without parcel
- Spark 2 is distributed only in the form of parcel - one could, of course, install it in a container just as jar files, but this approach was rejected
As a bonus as a result of solving the problems above, we received:
- ease of setting up Hadoop and Spark clients - when installing the same parcel in Docker and in a cluster, the paths on the cluster and in the container are the same
- ease of maintaining a uniform environment in the container and in the cluster — when you update the cluster, the Docker image is simply rebuilt with the same parcels that were installed in the cluster.
To install parcel to Docker, first install the Cloudera Manager from RPM packages. For the actual installation of parcel, Java code is used. The Java client knows what the Python client doesn’t, so I had to use Java and lose some uniformity), which calls the API.
assets / install-parcels / src / InstallParcels.java ParcelsResourceV5 parcels = clusters.getParcelsResource(clusterName); for (int i = 1; i < args.length; i += 2) { result = installParcel(api, parcels, args[i], args[i + 1], pause); if (!result) { System.exit(1); } }
Host resource limits
DockerSpawner , a component that runs end-user Jupyter in a separate Docker container — and
cgroups — is a resource management mechanism in Linux, used to manage host resources. DockerSpawner uses the Docker API, which allows you to set the parent cgroup for the container. In a regular DockerSpawner there is no such possibility, so we wrote simple code that allows you to set the correspondence between the AD entities and the parent cgroup in the configuration.
assets / jupyterhub / dsai.py def set_extra_host_config(self): extra_host_config = {} if self.user.name in self.user_cgroup_parent: cgroup_parent = self.user_cgroup_parent[self.user.name] else: pw_name = pwd.getpwnam(self.user.name).pw_name group_found = False for g in grp.getgrall(): if pw_name in g.gr_mem and g.gr_name in self.group_cgroup_parent: cgroup_parent = self.group_cgroup_parent[g.gr_name] group_found = True break if not group_found: cgroup_parent = self.cgroup_parent extra_host_config['cgroup_parent'] = cgroup_parent
A small modification was also made that launches Jupyter from the same image from which JupyterHub was launched. Thus, there is no need to use more than one image.
assets / jupyterhub / dsai.py current_container = None host_name = socket.gethostname() for container in self.client.containers(): if container['Id'][0:12] == host_name: current_container = container break self.image = current_container['Image']
What exactly to run in a container, Jupyter or JupyterHub, is determined in the startup script for environment variables:
assets / jupyterhub / dsai.py #!/bin/bash ANACONDA_PATH="/opt/cloudera/parcels/Anaconda/" DEFAULT_ENV=`cat ${ANACONDA_PATH}/envs/default` source activate ${DEFAULT_ENV} if [ -z "${JUPYTERHUB_CLIENT_ID}" ]; then while true; do jupyterhub -f /etc/jupyterhub/jupyterhub_config.py done else HOME=`su ${JUPYTERHUB_USER} -c 'echo ~'` cd ~ su ${JUPYTERHUB_USER} -p -c "jupyterhub-singleuser --KernelSpecManager.ensure_native_kernel=False --ip=0.0.0.0" fi
The ability to start Docker containers Jupyter from the Docker container JupyterHub is achieved by mounting the Docker daemon's socket into the container JupyterHub.
start-hub.sh VOLUMES="-<b>v/var/run/docker.sock:/var/run/docker.sock:Z -v/var/lib/pbis/.lsassd:/var/lib/pbis/.lsassd:Z</b> -v/var/lib/pbis/.netlogond:/var/lib/pbis/.netlogond:Z -v/var/jupyterhub/home:/home/BANK/:Z -v/u00/:/u00/:Z -v/tmp:/host/tmp:Z -v${CONFIG_DIR}/krb5.conf:/etc/krb5.conf:ro -v${CONFIG_DIR}/hadoop/:/etc/hadoop/conf.cloudera.yarn/:ro -v${CONFIG_DIR}/spark/:/etc/spark/conf.cloudera.spark_on_yarn/:ro -v${CONFIG_DIR}/spark2/:/etc/spark2/conf.cloudera.spark2_on_yarn/:ro -v${CONFIG_DIR}/jupyterhub/:/etc/jupyterhub/:ro" docker run -p0.0.0.0:8000:8000/tcp ${VOLUMES} -e VOLUMES="${VOLUMES}" -e HOST_HOSTNAME=`hostname -f` dsai1.2
In the future it is planned to abandon this decision in favor of, for example, ssh.
When using DockerSpawner in conjunction with Spark, another problem arises: the Spark driver opens random ports on which the connection is then established from outside by Workers. We can control the range of port numbers from which random ones are selected by setting these ranges in the Spark configuration. However, these ranges must be different for different users, since we cannot run Jupyter containers with the same published ports. To solve this problem, a code was written that simply generates port ranges by user id from the JupyterHub database and launches the Docker container and Spark with the appropriate configuration:
assets / jupyterhub / dsai.py def set_extra_create_kwargs(self): user_spark_driver_port, user_spark_blockmanager_port, user_spark_ui_port, user_spark_max_retries = self.get_spark_ports() if user_spark_driver_port == 0 or user_spark_blockmanager_port == 0 or user_spark_ui_port == 0 or user_spark_max_retries == 0: return ports = {} for p in range(user_spark_driver_port, user_spark_driver_port + user_spark_max_retries): ports['%d/tcp' % p] = None for p in range(user_spark_blockmanager_port, user_spark_blockmanager_port + user_spark_max_retries): ports['%d/tcp' % p] = None for p in range(user_spark_ui_port, user_spark_ui_port + user_spark_max_retries): ports['%d/tcp' % p] = None self.extra_create_kwargs = { 'ports' : ports }
The disadvantage of this solution is that when the container with JupyterHub is restarted, everything stops working due to the loss of the database. Therefore, when JupyterHub is restarted for, for example, a configuration change, we do not touch the container itself, but restart only the JupyterHub process inside it.
restart-hub.sh #!/bin/bash docker ps | fgrep 'dsai1.2' | fgrep -v 'jupyter-' | awk '{ print $1; }' | while read ID; do docker exec $ID /bin/bash -c "kill \$( cat /root/jupyterhub.pid )"; done
The cgroups themselves are created by standard Linux tools, the correspondence between the AD entities and cgroups in the configuration looks like.
<b>config-example/jupyterhub/jupyterhub_config.py</b> c.DSAISpawner.user_cgroup_parent = { 'bank\\user1' : '/jupyter-cgroup-1', # user 1 'bank\\user2' : '/jupyter-cgroup-1', # user 2 'bank\\user3' : '/jupyter-cgroup-2', # user 3 } c.DSAISpawner.cgroup_parent = '/jupyter-cgroup-3'
Git code
Our solution is publicly available on GitHub:
https://github.com/DS-AI/dsai/ (DSAI - Data Science and Artificial Intelligence). All code is decomposed into directories with sequence numbers - the code from each following directory can use artifacts from the previous one. The result of the code from the last directory will be a Docker image.
Each directory contains files:
- assets.sh - creating the necessary artifacts for the assembly (download from the Internet or copy from the directories of the previous steps)
- build.sh - build
- clean.sh - cleaning up the artifacts you need to build
To completely rebuild the Docker image, you need to consistently run clean.sh, assets.sh, build.sh from directories by their sequence number.
To build, we use a Linux machine with RedHat 7.4, Docker 17.05.0-ce. By car, there are 8 cores, 32GB of RAM and 250GB of disk space. It is strongly not recommended to use for building a host with the worst parameters for RAM and HDD.
Here is a reference to the names used:
- 01-spark-patched - RPM Spark 1.6.1 with two SPARK-4563 and SPARK-19019 patches applied.
- 02-validator - parcel validator
- 03-anaconda-dsai-parcel-1.0 - parcel Anaconda with the necessary Python (2, 3.5 and 3.6)
- 04-cloudera-manager-api - libraries Cloudera Manager API
- 05-dsai1.2-offline - final image
Alas, the assembly may fall for reasons that we did not fix (for example, parcel
crashes tar . In this case, as a rule, you just need to restart the assembly, but this does not always help (for example, the Spark assembly depends on external resources Cloudera, which may cease to be available, etc.).
Another drawback - the assembly of parcel is not reproducible. Since the libraries are constantly updated, the repetition of the assembly can give a result different from the previous one.
Solemn finale
Now users are successfully using tools, their number has exceeded several dozen and continues to grow. In the future, we plan to try JupyterLab and think of connecting the GPU to the cluster, as now the computing resources of two fairly powerful application servers are no longer enough.