Jupyter Notebook connecting to existing Spark/Yarn Cluster

I want to deploy jupyterHub on a Kubernetes cluster using the following jupyter notebook image.
I have been trying to use the recipe in here to build a docker image which can use our Spark/Yarn cluster. I slightly changed the docker file, so this is my final docker file:

FROM jupyter/all-spark-notebook

USER $NB_USER
# Set env vars for pydoop
ENV HADOOP_HOME=/usr/local/hadoop-2.7.3
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
ENV HADOOP_CONF_HOME=/usr/local/hadoop-2.7.3/etc/hadoop
ENV HADOOP_CONF_DIR=/usr/local/hadoop-2.7.3/etc/hadoop
ENV HADOOP_PREFIX=/usr/local/hadoop-2.7.3

USER root
# Add proper open-jdk-8 not just the jre, needed for pydoop
RUN echo "deb http://archive.ubuntu.com/ubuntu trusty-backports main restricted universe multiverse" >> /etc/apt/sources.list.d/trusty-backports.list && \
apt-get -y update && \
apt-get install --no-install-recommends -t trusty-backports -y openjdk-8-jdk && \
rm /etc/apt/sources.list.d/trusty-backports.list && \
apt-get clean && \
rm -rf /var/lib/apt/lists/ && \
wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz && \
tar -xvf hadoop-2.7.3.tar.gz -C /usr/local && \
chown -R $NB_USER:users /usr/local/hadoop-2.7.3 && \
rm -f hadoop-2.7.3.tar.gz && \
apt-get update && \
apt-get install --no-install-recommends -y python-pip python3-pip  build-essential python-dev python3-dev libsasl2-dev python-setuptools python-wheel python3-setuptools python3-wheel && \
apt-get install -y vim && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*  && \
rm -f /usr/local/hadoop-2.7.3/etc/hadoop/*

# Remove the example hadoop configs and replace
# with those for our cluster.
#rm -f /usr/local/hadoop-2.7.3/etc/hadoop/*
# I will mount it as a volume
# Download this from ambari / cloudera manager and copy here
COPY hadoop-conf/ /usr/local/hadoop-2.7.3/etc/hadoop/

# Spark-Submit doesn't work unless I set the following
RUN echo "spark.driver.extraJavaOptions -Dhdp.version=2.5.3.0-37" >> /usr/local/spark/conf/spark-defaults.conf  && \
echo "spark.yarn.am.extraJavaOptions -Dhdp.version=2.5.3.0-37" >> /usr/local/spark/conf/spark-defaults.conf && \
echo "spark.master=yarn" >>  /usr/local/spark/conf/spark-defaults.conf && \
echo "spark.hadoop.yarn.timeline-service.enabled=false" >> /usr/local/spark/conf/spark-defaults.conf && \
chown -R $NB_USER:users /usr/local/spark/conf/spark-defaults.conf && \
chown $NB_USER:users /usr/local/spark/conf/ && \
mkdir -p /etc/hadoop/conf/ && \
mkdir -p /usr/local/hadoop-2.7.3/etc/hadoop/conf/ && \
chown $NB_USER:users /etc/hadoop/conf/ && \
chown $NB_USER:users /usr/local/hadoop-2.7.3/etc/hadoop/conf/

USER $NB_USER
#USER root

# Install useful jupyter extensions and python libraries like :
# - Dashboards
# - PyDoop
# - PyHive
RUN echo $HADOOP_HOME && \
echo $HADOOP_CONF_DIR && \
pip install pyhive thrift sasl thrift_sasl && \
pip install --pre pydoop

USER root
# Ensure we overwrite the kernel config so that toree connects to cluster
RUN jupyter toree install --sys-prefix --spark_opts="--master yarn --deploy-mode client --driver-memory 512m  --executor-memory 512m  --executor-cores 1 --driver-java-options -Dhdp.version=2.5.3.0-37 --conf spark.hadoop.yarn.timeline-service.enabled=false"
#USER $NB_USER


RUN chown jovyan -R /home/jovyan/.local


USER $NB_USER

Using this recipe, I copy the hadoop conf files in the HADOOP_CONF_DIR path, and also I mount the hadoop conf dir when deploy the jupyterhub on Kubernetes cluster.
First of all, I need to set the env SPARK_USER to the user that logged in to the jupyterhub, but I don’t know how I can set this in my yaml file. Otherwise it tries with user jovyan which doesn’t have the permission. And if the user sets this env after logging in, it will go away after logging out.

I tried to set this env to my user (for test) so I got this error after a long time processing:

An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : 
org.apache.spark.SparkException: Uncaught exception: org.apache.spark.SparkException: Exception 
thrown in awaitResult:  	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
...

Unfortunately, I can’t retrieve the whole error right now, but I will update it later.

I was hoping there was more helping documents for this.

Thanks,

Looks like you are trying to use JupyterHub and enable “remote kernels” on your Kubernetes environment to access Spark. In the past, I have written a blog post that accomplish most of what you want using JupyterHub and Jupyter Enterprise Gateway. You can probably follow those steps and customize the Enterprise Gateway image to have the necessary Spark/Hadoop configuration to be able to perform the spark-submit of the kernels to your own Spark/Yarn cluster.

Going back to your specific issue, it seems that you are getting a timeout while creating the context, and this might be due to the docker image not being able to connect to YARN during the context creation. Please see issue #369 that originated the recipe for more configuration details when using JupyterHub.

Please let us know if this helps, and if you have any other question.

1 Like

Also note that, in this case, your kernel will be running in Spark Client mode where your Context will be initialized locally and use resources from your pod. So make sure you also have enough resources for the kernel or multiple kernels.

@lresende Thank you for your quick response. I haven’t thought of Jupyter Enterprise Gateway yet, so I need to read more about it.
About the timeout error, I set the docker to use host network interface, so I was expecting there wouldn’t be any issue, do you think I need to set more configs?

About your last comment, I’m a little confused. Doesn’t client mode means the task will be sent to yarn resource manager (out of pod) and distributed between worker nodes? I don’t want to use pod’s resources to do the spark tasks.

I was mostly going trough the issue, and noticed some other requirements. Having said that I have not actually validated these.

Jupyter Notebook today only supports running Spark on YARN client mode.

In Client mode, the Spark Driver (which is responsible for task scheduling/management and management of data aggregations/shufflings) runs locally in your pod, while the workers are created on the spark cluster.

If you want all the Spark processing to be executed in the cluster, then you want to use Spark on YARN Cluster mode, in which case the kernels will be remote and you must use Jupyter Enterprise Gateway to enable remote kernel lifecycle management.

This page can give you more details about the Spark Driver.

This stack overflow post also explains a little more about Client versus Cluster mode.

@lresende Sorry, I may have misspoken in the last comment. I actually want the client mode. I’m not very sure how much resource it might need for now, but I will figure that out.
But for timeout error, is it possible it’s been blocked by the server? because of docker container? Also, do you know how I can set random ports for each single user notebook in kubernetes?

You could probably troubleshoot that by trying some network connectivity to the server from inside the spawn jupyter image.

All the infrastructure related to spawning and proxy of a given user to the proper Jupyter server should be handled by JupyterHub and should be transparent for users (and you). Are you seeing specific issues in this area?

@lresende Thank you, I think I found the problem of creating spark context, It probably is because of memory, so I asked our admin to increase the resource memory and add some extra configuration. If that fixes the problem I will post it here.

As I know, kubespawner’s default port is 8888 (based on its github) And I set the single user container to use host network. So, in case multiple users’ pod are deployed on one host, then I will have problem because all of them will try to use port 8888. That’s why I want to set it to choose random port. Do you know something about it?

Thanks,