Spark integration documentation

Hello everybody!
As @betatim asked me here, this is my (hopefully) complete documentation on how to run Spark on K8s on z2jh.

As some of you might know Spark introduced native K8s support with version 2.3 and K8s support for PySpark/R-Spark with version 2.4. This documentation is about the integration on pangeo as I use it for ‘native’ Dask support. For vanilla z2jhk8s this probably differs but the differences should be pretty few ones. My hope was to integrate Spark as painless as possible into the pangeo workflow to empower our users to use it rather than to configure it.

Disclaimer: I am pretty new to the whole kubernetes/cloud/container topic. I tried to use best practices as good as possible but maybe there are some newbie mistakes. As starting point I used this documentation.:

  1. As first step I added this configuration to my K8s cluster:
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 namespace: default
 name: spark-role
rules:
- apiGroups: [“”]
 resources: [“pods”]
 verbs: [“get”, “watch”, “list”, “edit”, “create”, “delete”]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
 name: spark
 namespace: default
subjects:
- kind: User
 name: system:serviceaccount:pangeo:daskkubernetes
 apiGroup: “”
roleRef:
 kind: Role
 name: spark-role
 apiGroup: “”

This adds the spark-role to the service account daskkubernetes which is used in pangeo to create dask workers and makes sure that the pods are spawned in an autoscaling worker pool with preemptible instances. As far as I understood it the namespace for the spark executor pods has to be default.

  1. The next thing is to create an image for the spark executor pods. Spark 2.4 is shipped with a script that creates these images. To make sure that the versions are 100% compatible I decided to create them from the environment within my spark user image. First you run a container with Spark installed. I used the pyspark notebook from the docker stacks.
docker run -i --rm -e GRANT_SUDO=yes \
-v /var/run/docker.sock:/var/run/docker.sock \ # This is important to expose the hosts docker daemon
jupyter/pyspark-notebook:5b2160dfd919 # Tag with spark 2.4

The exposure of the docker daemon is important to use docker inside the container (It’s used by the spark script)

  1. Enter the container
docker exec -it -u root <CONTAINER-ID> bash
  1. And install docker in the container
sudo apt update
sudo apt install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"
sudo apt update
sudo apt install docker-ce
  1. Then we can create the images with:
cd $SPARK_HOME
./bin/docker-image-tool.sh -r <repo> -t my-tag build 
./bin/docker-image-tool.sh -r <repo> -t my-tag push

My current (working) images are available at

idalab/spark
idalab/spark-r
idalab/spark-py

if you just want to try it out yourself without doing all of this. In this case make sure to use the correct pyspark image from above.

  1. Afterwards you can start up a user pod with the pyspark image and use the following configuration to get Spark running:
from pyspark import *
import os
import socket

conf = SparkConf()

# set spark to k8s mode and provide master IP
conf.setMaster('k8s://https://kubernetes.default.svc')
# configure executor image
conf.set('spark.kubernetes.container.image', 'idalab/spark-py:spark')
# Spark on K8s works ONLY in client mode (driver runs on client)
conf.set('spark.submit.deployMode', 'client')
# set # of executors
conf.set('spark.executor.instances', '2')
# set name
conf.setAppName('pyspark-shell')
# set IP of driver. This is always the user pod
conf.set('spark.driver.host', socket.gethostbyname(socket.gethostname()))
# settings for the spark ui to work as it did not work as intended with nbserverproxy 
conf.set('spark.ui.proxyBase', os.environ['JUPYTERHUB_SERVICE_PREFIX'] + 'proxy/4040')
os.environ['SPARK_PUBLIC_DNS'] = <JHUB_ADDRESS> + os.environ['JUPYTERHUB_SERVICE_PREFIX'] + 'proxy/4040/jobs/'
# set python versions explicitly. otherwise the executors will use python 2.7
os.environ['PYSPARK_PYTHON'] = 'python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = 'python3'


sc = SparkContext(conf=conf)

Some comments on the configuration:

  1. Right now this happens in the user notebook which is pretty messy. easel proposed (see link to original issue at the top) to put this in the hub configuration like this:
    extraEnv:
...
      SPARK_OPTS: >-
        --deploy-mode=client
        --master=k8s://https://kubernetes.default.svc
        --conf spark.driver.host=$(hostname -i)
        --conf spark.kubernetes.container.image=idalab/spark-py:spark
        --conf spark.ui.proxyBase=${JUPYTERHUB_SERVICE_PREFIX}proxy/4040
        --conf spark.executor.instances=2
        --driver-java-options=-Xms1024M # pre set in jupyter image. reset to not overwrite it
        --driver-java-options=-Xmx4096M # pre set in jupyter image.
        --driver-java-options=-Dlog4j.logLevel=info # pre set in jupyter image.

      PYSPARK_PYTHON: python3
      PYSPARK_DRIVER_PYTHON: python3
      SPARK_PUBLIC_DNS: hub.idalab.de${JUPYTERHUB_SERVICE_PREFIX}proxy/4040/jobs/

Sadly this does not work for me as the shell commands are not correctly evaluated but are set as strings like

'--deploy-mode=client --master=k8s://https://kubernetes.default.svc --conf spark.driver.host=$(hostname -i) --conf spark.kubernetes.container.image=idalab/spark-py:spark --conf spark.ui.proxyBase=${JUPYTERHUB_SERVICE_PREFIX}proxy/4040

  1. Setting of the spark.ui.proxyBase and SPARK_PUBLIC_DNS was necessary as spark and the nbserverproxy did not like each other. As far as I know @ryan addressed this problem already but I do not understand the details good enough to say how ‘good’ the fix is.

Hopefully this was everything. Please feel free if something is unclear. Thank everybody for your help.

Cheers,
Hagen

P.S.: Limiting newbies to only use 2 hyperlinks is not so helpful for documenting stuff with references :wink:

3 Likes

Links that fit not in the post above:

  1. How to install docker in a container

  2. nbserverprox issue about getting UI to work

Great documentation.
I see that you don’t specify spark.driver.port, so how come it works?
I chose a default port, but to get it work I have to manually create a headless service for each user that spawns a notebook. Im trying to find away around it…