Pyspark library is missing from jupyter/pyspark-notebook when running with jupyterhub/zero-to-jupyterhub-k8s

pyspark library is not available in the pyspark-notebook when running on jupyterhub.

When running jupyter/pyspark-notebook locally, I can import pyspark as I would expect:

from pyspark import SparkConf, SparkContext

→ no errors

When I run jupyter/pyspark-notebook with my jupyterhub/zero-to-jupyterhub-k8s helm installation, the same code returns:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-ee46be618299> in <module>
----> 1 from pyspark import SparkConf, SparkContext

ModuleNotFoundError: No module named 'pyspark'

Running python -m pip list --format columns. Does return a lot of installed modules however not the pyspark module.

Having read https://stackoverflow.com/a/31541051/5930295 I tried to update the sys.path because I found that pyspark is installed in the container under the following locations. Running:

import sys
sys.path.insert(0, '/usr/local/spark/python')
sys.path.insert(0, '/usr/local/spark/python/lib/py4j-0.10.9-src.zip')

from pyspark import SparkConf, SparkContext

→ No errors

Further information

I thought the issue must be with the image. Therefore I originally opened a ticket with jupyter/docker-stack (pyspark library is missing from jupyter/pyspark-notebook when running with jupyterhub/zero-to-jupyterhub-k8s · Issue #1255 · jupyter/docker-stacks · GitHub).

Your personal set up

jupyterhub/zero-to-jupyterhub-k8s Helm Chart

jupyter/pyspark-notebook:9fe5186aba96 as default image

singleuser:
  defaultUrl: "/lab"
  image:
    name: jupyter/pyspark-notebook
    tag: 9fe5186aba96

Not Working Environment

env | sort outputs:

APACHE_AIRFLOW_PORT=tcp://10.43.46.18:8080
APACHE_AIRFLOW_PORT_8080_TCP=tcp://10.43.46.18:8080
APACHE_AIRFLOW_PORT_8080_TCP_ADDR=10.43.46.18
APACHE_AIRFLOW_PORT_8080_TCP_PORT=8080
APACHE_AIRFLOW_PORT_8080_TCP_PROTO=tcp
APACHE_AIRFLOW_POSTGRESQL_PORT=tcp://10.43.64.222:5432
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP=tcp://10.43.64.222:5432
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP_ADDR=10.43.64.222
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP_PORT=5432
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP_PROTO=tcp
APACHE_AIRFLOW_POSTGRESQL_SERVICE_HOST=10.43.64.222
APACHE_AIRFLOW_POSTGRESQL_SERVICE_PORT=5432
APACHE_AIRFLOW_POSTGRESQL_SERVICE_PORT_TCP_POSTGRESQL=5432
APACHE_AIRFLOW_REDIS_MASTER_PORT=tcp://10.43.14.5:6379
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP=tcp://10.43.14.5:6379
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP_ADDR=10.43.14.5
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP_PORT=6379
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP_PROTO=tcp
APACHE_AIRFLOW_REDIS_MASTER_SERVICE_HOST=10.43.14.5
APACHE_AIRFLOW_REDIS_MASTER_SERVICE_PORT=6379
APACHE_AIRFLOW_REDIS_MASTER_SERVICE_PORT_TCP_REDIS=6379
APACHE_AIRFLOW_SERVICE_HOST=10.43.46.18
APACHE_AIRFLOW_SERVICE_PORT=8080
APACHE_AIRFLOW_SERVICE_PORT_HTTP=8080
APACHE_SPARK_MASTER_SVC_PORT=tcp://10.43.67.11:7077
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP=tcp://10.43.67.11:7077
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP_ADDR=10.43.67.11
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP_PORT=7077
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP_PROTO=tcp
APACHE_SPARK_MASTER_SVC_PORT_80_TCP=tcp://10.43.67.11:80
APACHE_SPARK_MASTER_SVC_PORT_80_TCP_ADDR=10.43.67.11
APACHE_SPARK_MASTER_SVC_PORT_80_TCP_PORT=80
APACHE_SPARK_MASTER_SVC_PORT_80_TCP_PROTO=tcp
APACHE_SPARK_MASTER_SVC_SERVICE_HOST=10.43.67.11
APACHE_SPARK_MASTER_SVC_SERVICE_PORT=7077
APACHE_SPARK_MASTER_SVC_SERVICE_PORT_CLUSTER=7077
APACHE_SPARK_MASTER_SVC_SERVICE_PORT_HTTP=80
APACHE_SPARK_VERSION=3.1.1
CLICOLOR=1
CONDA_DIR=/opt/conda
CONDA_VERSION=4.9.2
DEBIAN_FRONTEND=noninteractive
GIT_PAGER=cat
HADOOP_VERSION=3.2
HOME=/home/jovyan
HOSTNAME=jupyter-hersam
HUB_PORT=tcp://10.43.0.141:8081
HUB_PORT_8081_TCP=tcp://10.43.0.141:8081
HUB_PORT_8081_TCP_ADDR=10.43.0.141
HUB_PORT_8081_TCP_PORT=8081
HUB_PORT_8081_TCP_PROTO=tcp
HUB_SERVICE_HOST=10.43.0.141
HUB_SERVICE_PORT=8081
JPY_API_TOKEN=c5353ea5222b413c85a1e08306ebfbb3
JPY_PARENT_PID=7
JUPYTERHUB_ACTIVITY_URL=http://hub:8081/hub/api/users/hersam/activity
JUPYTERHUB_ADMIN_ACCESS=1
JUPYTERHUB_API_TOKEN=c5353ea5222b413c85a1e08306ebfbb3
JUPYTERHUB_API_URL=http://hub:8081/hub/api
JUPYTERHUB_BASE_URL=/
JUPYTERHUB_CLIENT_ID=jupyterhub-user-hersam
JUPYTERHUB_HOST=
JUPYTERHUB_OAUTH_CALLBACK_URL=/user/hersam/oauth_callback
JUPYTERHUB_SERVER_NAME=
JUPYTERHUB_SERVICE_PREFIX=/user/hersam/
JUPYTERHUB_USER=hersam
JUPYTER_IMAGE=jupyter/pyspark-notebook:latest
JUPYTER_IMAGE_SPEC=jupyter/pyspark-notebook:latest
KUBERNETES_PORT=tcp://10.43.0.1:443
KUBERNETES_PORT_443_TCP=tcp://10.43.0.1:443
KUBERNETES_PORT_443_TCP_ADDR=10.43.0.1
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_SERVICE_HOST=10.43.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_ALL=en_US.UTF-8
MEM_GUARANTEE=1073741824
MINIFORGE_VERSION=4.9.2-7
MPLBACKEND=module://ipykernel.pylab.backend_inline
NB_GID=100
NB_UID=1000
NB_USER=jovyan
PAGER=cat
PATH=/opt/conda/bin:/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.9-src.zip:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin
PROXY_API_PORT=tcp://10.43.170.224:8001
PROXY_API_PORT_8001_TCP=tcp://10.43.170.224:8001
PROXY_API_PORT_8001_TCP_ADDR=10.43.170.224
PROXY_API_PORT_8001_TCP_PORT=8001
PROXY_API_PORT_8001_TCP_PROTO=tcp
PROXY_API_SERVICE_HOST=10.43.170.224
PROXY_API_SERVICE_PORT=8001
PROXY_PUBLIC_PORT=tcp://10.43.159.248:80
PROXY_PUBLIC_PORT_80_TCP=tcp://10.43.159.248:80
PROXY_PUBLIC_PORT_80_TCP_ADDR=10.43.159.248
PROXY_PUBLIC_PORT_80_TCP_PORT=80
PROXY_PUBLIC_PORT_80_TCP_PROTO=tcp
PROXY_PUBLIC_SERVICE_HOST=10.43.159.248
PROXY_PUBLIC_SERVICE_PORT=80
PROXY_PUBLIC_SERVICE_PORT_HTTP=80
PWD=/home/jovyan
SHELL=/bin/bash
SPARK_HOME=/usr/local/spark
SPARK_NODE_PORT_PORT=tcp://10.43.21.88:7077
SPARK_NODE_PORT_PORT_7077_TCP=tcp://10.43.21.88:7077
SPARK_NODE_PORT_PORT_7077_TCP_ADDR=10.43.21.88
SPARK_NODE_PORT_PORT_7077_TCP_PORT=7077
SPARK_NODE_PORT_PORT_7077_TCP_PROTO=tcp
SPARK_NODE_PORT_PORT_80_TCP=tcp://10.43.21.88:80
SPARK_NODE_PORT_PORT_80_TCP_ADDR=10.43.21.88
SPARK_NODE_PORT_PORT_80_TCP_PORT=80
SPARK_NODE_PORT_PORT_80_TCP_PROTO=tcp
SPARK_NODE_PORT_SERVICE_HOST=10.43.21.88
SPARK_NODE_PORT_SERVICE_PORT=7077
SPARK_NODE_PORT_SERVICE_PORT_CLUSTER=7077
SPARK_NODE_PORT_SERVICE_PORT_HTTP=80
SPARK_OPTS=--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info
TERM=xterm-color
XDG_CACHE_HOME=/home/jovyan/.cache/
import os
for k, v in sorted(os.environ.items()):
    print(f'{k}={v}')

outputs:

APACHE_AIRFLOW_PORT=tcp://10.43.46.18:8080
APACHE_AIRFLOW_PORT_8080_TCP=tcp://10.43.46.18:8080
APACHE_AIRFLOW_PORT_8080_TCP_ADDR=10.43.46.18
APACHE_AIRFLOW_PORT_8080_TCP_PORT=8080
APACHE_AIRFLOW_PORT_8080_TCP_PROTO=tcp
APACHE_AIRFLOW_POSTGRESQL_PORT=tcp://10.43.64.222:5432
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP=tcp://10.43.64.222:5432
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP_ADDR=10.43.64.222
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP_PORT=5432
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP_PROTO=tcp
APACHE_AIRFLOW_POSTGRESQL_SERVICE_HOST=10.43.64.222
APACHE_AIRFLOW_POSTGRESQL_SERVICE_PORT=5432
APACHE_AIRFLOW_POSTGRESQL_SERVICE_PORT_TCP_POSTGRESQL=5432
APACHE_AIRFLOW_REDIS_MASTER_PORT=tcp://10.43.14.5:6379
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP=tcp://10.43.14.5:6379
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP_ADDR=10.43.14.5
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP_PORT=6379
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP_PROTO=tcp
APACHE_AIRFLOW_REDIS_MASTER_SERVICE_HOST=10.43.14.5
APACHE_AIRFLOW_REDIS_MASTER_SERVICE_PORT=6379
APACHE_AIRFLOW_REDIS_MASTER_SERVICE_PORT_TCP_REDIS=6379
APACHE_AIRFLOW_SERVICE_HOST=10.43.46.18
APACHE_AIRFLOW_SERVICE_PORT=8080
APACHE_AIRFLOW_SERVICE_PORT_HTTP=8080
APACHE_SPARK_MASTER_SVC_PORT=tcp://10.43.67.11:7077
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP=tcp://10.43.67.11:7077
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP_ADDR=10.43.67.11
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP_PORT=7077
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP_PROTO=tcp
APACHE_SPARK_MASTER_SVC_PORT_80_TCP=tcp://10.43.67.11:80
APACHE_SPARK_MASTER_SVC_PORT_80_TCP_ADDR=10.43.67.11
APACHE_SPARK_MASTER_SVC_PORT_80_TCP_PORT=80
APACHE_SPARK_MASTER_SVC_PORT_80_TCP_PROTO=tcp
APACHE_SPARK_MASTER_SVC_SERVICE_HOST=10.43.67.11
APACHE_SPARK_MASTER_SVC_SERVICE_PORT=7077
APACHE_SPARK_MASTER_SVC_SERVICE_PORT_CLUSTER=7077
APACHE_SPARK_MASTER_SVC_SERVICE_PORT_HTTP=80
APACHE_SPARK_VERSION=3.1.1
CLICOLOR=1
CONDA_DIR=/opt/conda
CONDA_VERSION=4.9.2
DEBIAN_FRONTEND=noninteractive
GIT_PAGER=cat
HADOOP_VERSION=3.2
HOME=/home/jovyan
HOSTNAME=jupyter-hersam
HUB_PORT=tcp://10.43.0.141:8081
HUB_PORT_8081_TCP=tcp://10.43.0.141:8081
HUB_PORT_8081_TCP_ADDR=10.43.0.141
HUB_PORT_8081_TCP_PORT=8081
HUB_PORT_8081_TCP_PROTO=tcp
HUB_SERVICE_HOST=10.43.0.141
HUB_SERVICE_PORT=8081
JPY_API_TOKEN=c5353ea5222b413c85a1e08306ebfbb3
JPY_PARENT_PID=7
JUPYTERHUB_ACTIVITY_URL=http://hub:8081/hub/api/users/hersam/activity
JUPYTERHUB_ADMIN_ACCESS=1
JUPYTERHUB_API_TOKEN=c5353ea5222b413c85a1e08306ebfbb3
JUPYTERHUB_API_URL=http://hub:8081/hub/api
JUPYTERHUB_BASE_URL=/
JUPYTERHUB_CLIENT_ID=jupyterhub-user-hersam
JUPYTERHUB_HOST=
JUPYTERHUB_OAUTH_CALLBACK_URL=/user/hersam/oauth_callback
JUPYTERHUB_SERVER_NAME=
JUPYTERHUB_SERVICE_PREFIX=/user/hersam/
JUPYTERHUB_USER=hersam
JUPYTER_IMAGE=jupyter/pyspark-notebook:latest
JUPYTER_IMAGE_SPEC=jupyter/pyspark-notebook:latest
KUBERNETES_PORT=tcp://10.43.0.1:443
KUBERNETES_PORT_443_TCP=tcp://10.43.0.1:443
KUBERNETES_PORT_443_TCP_ADDR=10.43.0.1
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_SERVICE_HOST=10.43.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_ALL=en_US.UTF-8
MEM_GUARANTEE=1073741824
MINIFORGE_VERSION=4.9.2-7
MPLBACKEND=module://ipykernel.pylab.backend_inline
NB_GID=100
NB_UID=1000
NB_USER=jovyan
PAGER=cat
PATH=/opt/conda/bin:/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.9-src.zip:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin
PROXY_API_PORT=tcp://10.43.170.224:8001
PROXY_API_PORT_8001_TCP=tcp://10.43.170.224:8001
PROXY_API_PORT_8001_TCP_ADDR=10.43.170.224
PROXY_API_PORT_8001_TCP_PORT=8001
PROXY_API_PORT_8001_TCP_PROTO=tcp
PROXY_API_SERVICE_HOST=10.43.170.224
PROXY_API_SERVICE_PORT=8001
PROXY_PUBLIC_PORT=tcp://10.43.159.248:80
PROXY_PUBLIC_PORT_80_TCP=tcp://10.43.159.248:80
PROXY_PUBLIC_PORT_80_TCP_ADDR=10.43.159.248
PROXY_PUBLIC_PORT_80_TCP_PORT=80
PROXY_PUBLIC_PORT_80_TCP_PROTO=tcp
PROXY_PUBLIC_SERVICE_HOST=10.43.159.248
PROXY_PUBLIC_SERVICE_PORT=80
PROXY_PUBLIC_SERVICE_PORT_HTTP=80
PWD=/home/jovyan
SHELL=/bin/bash
SPARK_HOME=/usr/local/spark
SPARK_NODE_PORT_PORT=tcp://10.43.21.88:7077
SPARK_NODE_PORT_PORT_7077_TCP=tcp://10.43.21.88:7077
SPARK_NODE_PORT_PORT_7077_TCP_ADDR=10.43.21.88
SPARK_NODE_PORT_PORT_7077_TCP_PORT=7077
SPARK_NODE_PORT_PORT_7077_TCP_PROTO=tcp
SPARK_NODE_PORT_PORT_80_TCP=tcp://10.43.21.88:80
SPARK_NODE_PORT_PORT_80_TCP_ADDR=10.43.21.88
SPARK_NODE_PORT_PORT_80_TCP_PORT=80
SPARK_NODE_PORT_PORT_80_TCP_PROTO=tcp
SPARK_NODE_PORT_SERVICE_HOST=10.43.21.88
SPARK_NODE_PORT_SERVICE_PORT=7077
SPARK_NODE_PORT_SERVICE_PORT_CLUSTER=7077
SPARK_NODE_PORT_SERVICE_PORT_HTTP=80
SPARK_OPTS=--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info
TERM=xterm-color
XDG_CACHE_HOME=/home/jovyan/.cache/

Working Environment (running container locally with docker)

env | sort outputs:

APACHE_SPARK_VERSION=3.1.1
CONDA_DIR=/opt/conda
CONDA_VERSION=4.9.2
DEBIAN_FRONTEND=noninteractive
HADOOP_VERSION=3.2
HOME=/home/jovyan
HOSTNAME=2a600aae602f
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_ALL=en_US.UTF-8
MINIFORGE_VERSION=4.9.2-7
NB_GID=100
NB_UID=1000
NB_USER=jovyan
PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin
SHELL=/bin/bash
SPARK_HOME=/usr/local/spark
SPARK_OPTS=--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info
TERM=xterm
XDG_CACHE_HOME=/home/jovyan/.cache/
import os
for k, v in sorted(os.environ.items()):
    print(f'{k}={v}')

outputs:

APACHE_SPARK_VERSION=3.1.1
CLICOLOR=1
CONDA_DIR=/opt/conda
CONDA_VERSION=4.9.2
DEBIAN_FRONTEND=noninteractive
GIT_PAGER=cat
HADOOP_VERSION=3.2
HOME=/home/jovyan
HOSTNAME=2a600aae602f
JPY_PARENT_PID=7
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_ALL=en_US.UTF-8
MINIFORGE_VERSION=4.9.2-7
MPLBACKEND=module://ipykernel.pylab.backend_inline
NB_GID=100
NB_UID=1000
NB_USER=jovyan
PAGER=cat
PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin
PWD=/home/jovyan
PYSPARK_PYTHONPATH_SET=1
PYTHONPATH=/usr/local/spark/python/lib/py4j-0.10.9-src.zip:/usr/local/spark/python:
SHELL=/bin/bash
SHLVL=0
SPARK_CONF_DIR=/usr/local/spark/conf
SPARK_HOME=/usr/local/spark
SPARK_OPTS=--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info
TERM=xterm-color
XDG_CACHE_HOME=/home/jovyan/.cache/

Initial findings:

The pyspark image adds a /usr/local/bin/before-notebook.d hook which sets PYTHONPATH:

This is called in start.sh:

But by default Z2JH calls jupyterhub-singleuser directly:

1 Like

Awesome, thanks!

Using this config.yaml it works:

singleuser:
  defaultUrl: "/lab"
  image:
    name: jupyter/pyspark-notebook
    tag: latest
  cmd: ["/usr/local/bin/start.sh", "jupyterhub-singleuser"]
1 Like

Great that you figured it out. It looks like

singleuser:
  cmd:

i.e. use the default from the Docker image also works

1 Like

I am sorry to hook on this old topic, but I am facing the exact same problem on mybinder.org. When building and running my Docker container based on pyspark-notebook locally, everything is fine because within the container PYTHONPATH points to the correct location. However, when running on mybinder.org, PYTHONPATH is empty and pyspark is therefore not found. Manually tweaking sys.path works. In contrast to the solution proposed in this topic, I do not control the configuration though. Any ideas?

Since you’re using a custom Docker image you could try overriding your ENTRYPOINT.
I’ve opened an issue to see if the behaviour can be made more intuitive: