Spark integration documentation

Hagen · December 5, 2018, 2:20am

Hello everybody!
As @betatim asked me here, this is my (hopefully) complete documentation on how to run Spark on K8s on z2jh.

As some of you might know Spark introduced native K8s support with version 2.3 and K8s support for PySpark/R-Spark with version 2.4. This documentation is about the integration on pangeo as I use it for ‘native’ Dask support. For vanilla z2jhk8s this probably differs but the differences should be pretty few ones. My hope was to integrate Spark as painless as possible into the pangeo workflow to empower our users to use it rather than to configure it.

Disclaimer: I am pretty new to the whole kubernetes/cloud/container topic. I tried to use best practices as good as possible but maybe there are some newbie mistakes. As starting point I used this documentation.:

As first step I added this configuration to my K8s cluster:

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 namespace: default
 name: spark-role
rules:
- apiGroups: [“”]
 resources: [“pods”]
 verbs: [“get”, “watch”, “list”, “edit”, “create”, “delete”]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 name: spark
 namespace: default
subjects:
- kind: User
 name: system:serviceaccount:pangeo:daskkubernetes
 apiGroup: “”
roleRef:
 kind: Role
 name: spark-role
 apiGroup: “”

This adds the spark-role to the service account daskkubernetes which is used in pangeo to create dask workers and makes sure that the pods are spawned in an autoscaling worker pool with preemptible instances. As far as I understood it the namespace for the spark executor pods has to be default.

The next thing is to create an image for the spark executor pods. Spark 2.4 is shipped with a script that creates these images. To make sure that the versions are 100% compatible I decided to create them from the environment within my spark user image. First you run a container with Spark installed. I used the pyspark notebook from the docker stacks.

docker run -i --rm -e GRANT_SUDO=yes \
-v /var/run/docker.sock:/var/run/docker.sock \ # This is important to expose the hosts docker daemon
jupyter/pyspark-notebook:5b2160dfd919 # Tag with spark 2.4

The exposure of the docker daemon is important to use docker inside the container (It’s used by the spark script)

Enter the container

docker exec -it -u root <CONTAINER-ID> bash

And install docker in the container

sudo apt update
sudo apt install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"
sudo apt update
sudo apt install docker-ce

Then we can create the images with:

cd $SPARK_HOME
./bin/docker-image-tool.sh -r <repo> -t my-tag build 
./bin/docker-image-tool.sh -r <repo> -t my-tag push

My current (working) images are available at

idalab/spark
idalab/spark-r
idalab/spark-py

if you just want to try it out yourself without doing all of this. In this case make sure to use the correct pyspark image from above.

Afterwards you can start up a user pod with the pyspark image and use the following configuration to get Spark running:

from pyspark import *
import os
import socket

conf = SparkConf()

# set spark to k8s mode and provide master IP
conf.setMaster('k8s://https://kubernetes.default.svc')
# configure executor image
conf.set('spark.kubernetes.container.image', 'idalab/spark-py:spark')
# Spark on K8s works ONLY in client mode (driver runs on client)
conf.set('spark.submit.deployMode', 'client')
# set # of executors
conf.set('spark.executor.instances', '2')
# set name
conf.setAppName('pyspark-shell')
# set IP of driver. This is always the user pod
conf.set('spark.driver.host', socket.gethostbyname(socket.gethostname()))
# settings for the spark ui to work as it did not work as intended with nbserverproxy 
conf.set('spark.ui.proxyBase', os.environ['JUPYTERHUB_SERVICE_PREFIX'] + 'proxy/4040')
os.environ['SPARK_PUBLIC_DNS'] = <JHUB_ADDRESS> + os.environ['JUPYTERHUB_SERVICE_PREFIX'] + 'proxy/4040/jobs/'
# set python versions explicitly. otherwise the executors will use python 2.7
os.environ['PYSPARK_PYTHON'] = 'python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = 'python3'


sc = SparkContext(conf=conf)

Some comments on the configuration:

Right now this happens in the user notebook which is pretty messy. easel proposed (see link to original issue at the top) to put this in the hub configuration like this:

    extraEnv:
...
      SPARK_OPTS: >-
        --deploy-mode=client
        --master=k8s://https://kubernetes.default.svc
        --conf spark.driver.host=$(hostname -i)
        --conf spark.kubernetes.container.image=idalab/spark-py:spark
        --conf spark.ui.proxyBase=${JUPYTERHUB_SERVICE_PREFIX}proxy/4040
        --conf spark.executor.instances=2
        --driver-java-options=-Xms1024M # pre set in jupyter image. reset to not overwrite it
        --driver-java-options=-Xmx4096M # pre set in jupyter image.
        --driver-java-options=-Dlog4j.logLevel=info # pre set in jupyter image.

      PYSPARK_PYTHON: python3
      PYSPARK_DRIVER_PYTHON: python3
      SPARK_PUBLIC_DNS: hub.idalab.de${JUPYTERHUB_SERVICE_PREFIX}proxy/4040/jobs/

Sadly this does not work for me as the shell commands are not correctly evaluated but are set as strings like

'--deploy-mode=client --master=k8s://https://kubernetes.default.svc --conf spark.driver.host=$(hostname -i) --conf spark.kubernetes.container.image=idalab/spark-py:spark --conf spark.ui.proxyBase=${JUPYTERHUB_SERVICE_PREFIX}proxy/4040

Setting of the spark.ui.proxyBase and SPARK_PUBLIC_DNS was necessary as spark and the nbserverproxy did not like each other. As far as I know @ryan addressed this problem already but I do not understand the details good enough to say how ‘good’ the fix is.

Hopefully this was everything. Please feel free if something is unclear. Thank everybody for your help.

Cheers,
Hagen

P.S.: Limiting newbies to only use 2 hyperlinks is not so helpful for documenting stuff with references

Hagen · December 5, 2018, 2:24am

Links that fit not in the post above:

Ran · March 6, 2019, 12:24pm

Great documentation.
I see that you don’t specify spark.driver.port, so how come it works?
I chose a default port, but to get it work I have to manually create a headless service for each user that spawns a notebook. Im trying to find away around it…

Topic		Replies	Views
Executing PySpark code in a Jupyter Notebook using Z2JK where the user configures the Spark Session with the spark deployment mode set to "client" with the Spark executors running in their own dedicated Kubernetes cluster Zero to JupyterHub on Kubernetes jupyterlab , jupyterhub	1	837	June 16, 2022
Single user server as driver node in spark cluster k8s Zero to JupyterHub on Kubernetes jupyterlab , jupyterhub , how-to , help-wanted	1	922	April 13, 2022
Jupyter Enterprise Gateway with Spark-on-k8s Enterprise Gateway	3	1403	July 25, 2023
JHub + Spark on K8S / Workers cant connect to Drivers Zero to JupyterHub on Kubernetes	0	608	June 25, 2021
Jupyter Notebook connecting to existing Spark/Yarn Cluster General	7	15735	April 1, 2019

Spark integration documentation

Related topics