Hello everybody!
As @betatim asked me here, this is my (hopefully) complete documentation on how to run Spark on K8s on z2jh.
As some of you might know Spark introduced native K8s support with version 2.3 and K8s support for PySpark/R-Spark with version 2.4. This documentation is about the integration on pangeo as I use it for ‘native’ Dask support. For vanilla z2jhk8s this probably differs but the differences should be pretty few ones. My hope was to integrate Spark as painless as possible into the pangeo workflow to empower our users to use it rather than to configure it.
Disclaimer: I am pretty new to the whole kubernetes/cloud/container topic. I tried to use best practices as good as possible but maybe there are some newbie mistakes. As starting point I used this documentation.:
- As first step I added this configuration to my K8s cluster:
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: default
name: spark-role
rules:
- apiGroups: [“”]
resources: [“pods”]
verbs: [“get”, “watch”, “list”, “edit”, “create”, “delete”]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: spark
namespace: default
subjects:
- kind: User
name: system:serviceaccount:pangeo:daskkubernetes
apiGroup: “”
roleRef:
kind: Role
name: spark-role
apiGroup: “”
This adds the spark-role
to the service account daskkubernetes
which is used in pangeo to create dask workers and makes sure that the pods are spawned in an autoscaling worker pool with preemptible instances. As far as I understood it the namespace for the spark executor pods has to be default
.
- The next thing is to create an image for the spark executor pods. Spark 2.4 is shipped with a script that creates these images. To make sure that the versions are 100% compatible I decided to create them from the environment within my spark user image. First you run a container with Spark installed. I used the pyspark notebook from the docker stacks.
docker run -i --rm -e GRANT_SUDO=yes \
-v /var/run/docker.sock:/var/run/docker.sock \ # This is important to expose the hosts docker daemon
jupyter/pyspark-notebook:5b2160dfd919 # Tag with spark 2.4
The exposure of the docker daemon is important to use docker inside the container (It’s used by the spark script)
- Enter the container
docker exec -it -u root <CONTAINER-ID> bash
- And install docker in the container
sudo apt update
sudo apt install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"
sudo apt update
sudo apt install docker-ce
- Then we can create the images with:
cd $SPARK_HOME
./bin/docker-image-tool.sh -r <repo> -t my-tag build
./bin/docker-image-tool.sh -r <repo> -t my-tag push
My current (working) images are available at
idalab/spark
idalab/spark-r
idalab/spark-py
if you just want to try it out yourself without doing all of this. In this case make sure to use the correct pyspark image from above.
- Afterwards you can start up a user pod with the pyspark image and use the following configuration to get Spark running:
from pyspark import *
import os
import socket
conf = SparkConf()
# set spark to k8s mode and provide master IP
conf.setMaster('k8s://https://kubernetes.default.svc')
# configure executor image
conf.set('spark.kubernetes.container.image', 'idalab/spark-py:spark')
# Spark on K8s works ONLY in client mode (driver runs on client)
conf.set('spark.submit.deployMode', 'client')
# set # of executors
conf.set('spark.executor.instances', '2')
# set name
conf.setAppName('pyspark-shell')
# set IP of driver. This is always the user pod
conf.set('spark.driver.host', socket.gethostbyname(socket.gethostname()))
# settings for the spark ui to work as it did not work as intended with nbserverproxy
conf.set('spark.ui.proxyBase', os.environ['JUPYTERHUB_SERVICE_PREFIX'] + 'proxy/4040')
os.environ['SPARK_PUBLIC_DNS'] = <JHUB_ADDRESS> + os.environ['JUPYTERHUB_SERVICE_PREFIX'] + 'proxy/4040/jobs/'
# set python versions explicitly. otherwise the executors will use python 2.7
os.environ['PYSPARK_PYTHON'] = 'python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = 'python3'
sc = SparkContext(conf=conf)
Some comments on the configuration:
- Right now this happens in the user notebook which is pretty messy. easel proposed (see link to original issue at the top) to put this in the hub configuration like this:
extraEnv:
...
SPARK_OPTS: >-
--deploy-mode=client
--master=k8s://https://kubernetes.default.svc
--conf spark.driver.host=$(hostname -i)
--conf spark.kubernetes.container.image=idalab/spark-py:spark
--conf spark.ui.proxyBase=${JUPYTERHUB_SERVICE_PREFIX}proxy/4040
--conf spark.executor.instances=2
--driver-java-options=-Xms1024M # pre set in jupyter image. reset to not overwrite it
--driver-java-options=-Xmx4096M # pre set in jupyter image.
--driver-java-options=-Dlog4j.logLevel=info # pre set in jupyter image.
PYSPARK_PYTHON: python3
PYSPARK_DRIVER_PYTHON: python3
SPARK_PUBLIC_DNS: hub.idalab.de${JUPYTERHUB_SERVICE_PREFIX}proxy/4040/jobs/
Sadly this does not work for me as the shell commands are not correctly evaluated but are set as strings like
'--deploy-mode=client --master=k8s://https://kubernetes.default.svc --conf spark.driver.host=$(hostname -i) --conf spark.kubernetes.container.image=idalab/spark-py:spark --conf spark.ui.proxyBase=${JUPYTERHUB_SERVICE_PREFIX}proxy/4040
- Setting of the
spark.ui.proxyBase
andSPARK_PUBLIC_DNS
was necessary as spark and the nbserverproxy did not like each other. As far as I know @ryan addressed this problem already but I do not understand the details good enough to say how ‘good’ the fix is.
Hopefully this was everything. Please feel free if something is unclear. Thank everybody for your help.
Cheers,
Hagen
P.S.: Limiting newbies to only use 2 hyperlinks is not so helpful for documenting stuff with references