pyspark library is not available in the pyspark-notebook when running on jupyterhub.
When running jupyter/pyspark-notebook locally, I can import pyspark as I would expect:
from pyspark import SparkConf, SparkContext
→ no errors
When I run jupyter/pyspark-notebook with my jupyterhub/zero-to-jupyterhub-k8s helm installation, the same code returns:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-ee46be618299> in <module>
----> 1 from pyspark import SparkConf, SparkContext
ModuleNotFoundError: No module named 'pyspark'
Running python -m pip list --format columns
. Does return a lot of installed modules however not the pyspark module.
Having read https://stackoverflow.com/a/31541051/5930295 I tried to update the sys.path because I found that pyspark is installed in the container under the following locations. Running:
import sys
sys.path.insert(0, '/usr/local/spark/python')
sys.path.insert(0, '/usr/local/spark/python/lib/py4j-0.10.9-src.zip')
from pyspark import SparkConf, SparkContext
→ No errors
Further information
I thought the issue must be with the image. Therefore I originally opened a ticket with jupyter/docker-stack (pyspark library is missing from jupyter/pyspark-notebook when running with jupyterhub/zero-to-jupyterhub-k8s · Issue #1255 · jupyter/docker-stacks · GitHub).
Your personal set up
jupyterhub/zero-to-jupyterhub-k8s Helm Chart
jupyter/pyspark-notebook:9fe5186aba96 as default image
singleuser:
defaultUrl: "/lab"
image:
name: jupyter/pyspark-notebook
tag: 9fe5186aba96
Not Working Environment
env | sort
outputs:
APACHE_AIRFLOW_PORT=tcp://10.43.46.18:8080
APACHE_AIRFLOW_PORT_8080_TCP=tcp://10.43.46.18:8080
APACHE_AIRFLOW_PORT_8080_TCP_ADDR=10.43.46.18
APACHE_AIRFLOW_PORT_8080_TCP_PORT=8080
APACHE_AIRFLOW_PORT_8080_TCP_PROTO=tcp
APACHE_AIRFLOW_POSTGRESQL_PORT=tcp://10.43.64.222:5432
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP=tcp://10.43.64.222:5432
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP_ADDR=10.43.64.222
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP_PORT=5432
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP_PROTO=tcp
APACHE_AIRFLOW_POSTGRESQL_SERVICE_HOST=10.43.64.222
APACHE_AIRFLOW_POSTGRESQL_SERVICE_PORT=5432
APACHE_AIRFLOW_POSTGRESQL_SERVICE_PORT_TCP_POSTGRESQL=5432
APACHE_AIRFLOW_REDIS_MASTER_PORT=tcp://10.43.14.5:6379
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP=tcp://10.43.14.5:6379
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP_ADDR=10.43.14.5
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP_PORT=6379
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP_PROTO=tcp
APACHE_AIRFLOW_REDIS_MASTER_SERVICE_HOST=10.43.14.5
APACHE_AIRFLOW_REDIS_MASTER_SERVICE_PORT=6379
APACHE_AIRFLOW_REDIS_MASTER_SERVICE_PORT_TCP_REDIS=6379
APACHE_AIRFLOW_SERVICE_HOST=10.43.46.18
APACHE_AIRFLOW_SERVICE_PORT=8080
APACHE_AIRFLOW_SERVICE_PORT_HTTP=8080
APACHE_SPARK_MASTER_SVC_PORT=tcp://10.43.67.11:7077
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP=tcp://10.43.67.11:7077
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP_ADDR=10.43.67.11
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP_PORT=7077
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP_PROTO=tcp
APACHE_SPARK_MASTER_SVC_PORT_80_TCP=tcp://10.43.67.11:80
APACHE_SPARK_MASTER_SVC_PORT_80_TCP_ADDR=10.43.67.11
APACHE_SPARK_MASTER_SVC_PORT_80_TCP_PORT=80
APACHE_SPARK_MASTER_SVC_PORT_80_TCP_PROTO=tcp
APACHE_SPARK_MASTER_SVC_SERVICE_HOST=10.43.67.11
APACHE_SPARK_MASTER_SVC_SERVICE_PORT=7077
APACHE_SPARK_MASTER_SVC_SERVICE_PORT_CLUSTER=7077
APACHE_SPARK_MASTER_SVC_SERVICE_PORT_HTTP=80
APACHE_SPARK_VERSION=3.1.1
CLICOLOR=1
CONDA_DIR=/opt/conda
CONDA_VERSION=4.9.2
DEBIAN_FRONTEND=noninteractive
GIT_PAGER=cat
HADOOP_VERSION=3.2
HOME=/home/jovyan
HOSTNAME=jupyter-hersam
HUB_PORT=tcp://10.43.0.141:8081
HUB_PORT_8081_TCP=tcp://10.43.0.141:8081
HUB_PORT_8081_TCP_ADDR=10.43.0.141
HUB_PORT_8081_TCP_PORT=8081
HUB_PORT_8081_TCP_PROTO=tcp
HUB_SERVICE_HOST=10.43.0.141
HUB_SERVICE_PORT=8081
JPY_API_TOKEN=c5353ea5222b413c85a1e08306ebfbb3
JPY_PARENT_PID=7
JUPYTERHUB_ACTIVITY_URL=http://hub:8081/hub/api/users/hersam/activity
JUPYTERHUB_ADMIN_ACCESS=1
JUPYTERHUB_API_TOKEN=c5353ea5222b413c85a1e08306ebfbb3
JUPYTERHUB_API_URL=http://hub:8081/hub/api
JUPYTERHUB_BASE_URL=/
JUPYTERHUB_CLIENT_ID=jupyterhub-user-hersam
JUPYTERHUB_HOST=
JUPYTERHUB_OAUTH_CALLBACK_URL=/user/hersam/oauth_callback
JUPYTERHUB_SERVER_NAME=
JUPYTERHUB_SERVICE_PREFIX=/user/hersam/
JUPYTERHUB_USER=hersam
JUPYTER_IMAGE=jupyter/pyspark-notebook:latest
JUPYTER_IMAGE_SPEC=jupyter/pyspark-notebook:latest
KUBERNETES_PORT=tcp://10.43.0.1:443
KUBERNETES_PORT_443_TCP=tcp://10.43.0.1:443
KUBERNETES_PORT_443_TCP_ADDR=10.43.0.1
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_SERVICE_HOST=10.43.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_ALL=en_US.UTF-8
MEM_GUARANTEE=1073741824
MINIFORGE_VERSION=4.9.2-7
MPLBACKEND=module://ipykernel.pylab.backend_inline
NB_GID=100
NB_UID=1000
NB_USER=jovyan
PAGER=cat
PATH=/opt/conda/bin:/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.9-src.zip:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin
PROXY_API_PORT=tcp://10.43.170.224:8001
PROXY_API_PORT_8001_TCP=tcp://10.43.170.224:8001
PROXY_API_PORT_8001_TCP_ADDR=10.43.170.224
PROXY_API_PORT_8001_TCP_PORT=8001
PROXY_API_PORT_8001_TCP_PROTO=tcp
PROXY_API_SERVICE_HOST=10.43.170.224
PROXY_API_SERVICE_PORT=8001
PROXY_PUBLIC_PORT=tcp://10.43.159.248:80
PROXY_PUBLIC_PORT_80_TCP=tcp://10.43.159.248:80
PROXY_PUBLIC_PORT_80_TCP_ADDR=10.43.159.248
PROXY_PUBLIC_PORT_80_TCP_PORT=80
PROXY_PUBLIC_PORT_80_TCP_PROTO=tcp
PROXY_PUBLIC_SERVICE_HOST=10.43.159.248
PROXY_PUBLIC_SERVICE_PORT=80
PROXY_PUBLIC_SERVICE_PORT_HTTP=80
PWD=/home/jovyan
SHELL=/bin/bash
SPARK_HOME=/usr/local/spark
SPARK_NODE_PORT_PORT=tcp://10.43.21.88:7077
SPARK_NODE_PORT_PORT_7077_TCP=tcp://10.43.21.88:7077
SPARK_NODE_PORT_PORT_7077_TCP_ADDR=10.43.21.88
SPARK_NODE_PORT_PORT_7077_TCP_PORT=7077
SPARK_NODE_PORT_PORT_7077_TCP_PROTO=tcp
SPARK_NODE_PORT_PORT_80_TCP=tcp://10.43.21.88:80
SPARK_NODE_PORT_PORT_80_TCP_ADDR=10.43.21.88
SPARK_NODE_PORT_PORT_80_TCP_PORT=80
SPARK_NODE_PORT_PORT_80_TCP_PROTO=tcp
SPARK_NODE_PORT_SERVICE_HOST=10.43.21.88
SPARK_NODE_PORT_SERVICE_PORT=7077
SPARK_NODE_PORT_SERVICE_PORT_CLUSTER=7077
SPARK_NODE_PORT_SERVICE_PORT_HTTP=80
SPARK_OPTS=--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info
TERM=xterm-color
XDG_CACHE_HOME=/home/jovyan/.cache/
import os
for k, v in sorted(os.environ.items()):
print(f'{k}={v}')
outputs:
APACHE_AIRFLOW_PORT=tcp://10.43.46.18:8080
APACHE_AIRFLOW_PORT_8080_TCP=tcp://10.43.46.18:8080
APACHE_AIRFLOW_PORT_8080_TCP_ADDR=10.43.46.18
APACHE_AIRFLOW_PORT_8080_TCP_PORT=8080
APACHE_AIRFLOW_PORT_8080_TCP_PROTO=tcp
APACHE_AIRFLOW_POSTGRESQL_PORT=tcp://10.43.64.222:5432
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP=tcp://10.43.64.222:5432
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP_ADDR=10.43.64.222
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP_PORT=5432
APACHE_AIRFLOW_POSTGRESQL_PORT_5432_TCP_PROTO=tcp
APACHE_AIRFLOW_POSTGRESQL_SERVICE_HOST=10.43.64.222
APACHE_AIRFLOW_POSTGRESQL_SERVICE_PORT=5432
APACHE_AIRFLOW_POSTGRESQL_SERVICE_PORT_TCP_POSTGRESQL=5432
APACHE_AIRFLOW_REDIS_MASTER_PORT=tcp://10.43.14.5:6379
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP=tcp://10.43.14.5:6379
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP_ADDR=10.43.14.5
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP_PORT=6379
APACHE_AIRFLOW_REDIS_MASTER_PORT_6379_TCP_PROTO=tcp
APACHE_AIRFLOW_REDIS_MASTER_SERVICE_HOST=10.43.14.5
APACHE_AIRFLOW_REDIS_MASTER_SERVICE_PORT=6379
APACHE_AIRFLOW_REDIS_MASTER_SERVICE_PORT_TCP_REDIS=6379
APACHE_AIRFLOW_SERVICE_HOST=10.43.46.18
APACHE_AIRFLOW_SERVICE_PORT=8080
APACHE_AIRFLOW_SERVICE_PORT_HTTP=8080
APACHE_SPARK_MASTER_SVC_PORT=tcp://10.43.67.11:7077
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP=tcp://10.43.67.11:7077
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP_ADDR=10.43.67.11
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP_PORT=7077
APACHE_SPARK_MASTER_SVC_PORT_7077_TCP_PROTO=tcp
APACHE_SPARK_MASTER_SVC_PORT_80_TCP=tcp://10.43.67.11:80
APACHE_SPARK_MASTER_SVC_PORT_80_TCP_ADDR=10.43.67.11
APACHE_SPARK_MASTER_SVC_PORT_80_TCP_PORT=80
APACHE_SPARK_MASTER_SVC_PORT_80_TCP_PROTO=tcp
APACHE_SPARK_MASTER_SVC_SERVICE_HOST=10.43.67.11
APACHE_SPARK_MASTER_SVC_SERVICE_PORT=7077
APACHE_SPARK_MASTER_SVC_SERVICE_PORT_CLUSTER=7077
APACHE_SPARK_MASTER_SVC_SERVICE_PORT_HTTP=80
APACHE_SPARK_VERSION=3.1.1
CLICOLOR=1
CONDA_DIR=/opt/conda
CONDA_VERSION=4.9.2
DEBIAN_FRONTEND=noninteractive
GIT_PAGER=cat
HADOOP_VERSION=3.2
HOME=/home/jovyan
HOSTNAME=jupyter-hersam
HUB_PORT=tcp://10.43.0.141:8081
HUB_PORT_8081_TCP=tcp://10.43.0.141:8081
HUB_PORT_8081_TCP_ADDR=10.43.0.141
HUB_PORT_8081_TCP_PORT=8081
HUB_PORT_8081_TCP_PROTO=tcp
HUB_SERVICE_HOST=10.43.0.141
HUB_SERVICE_PORT=8081
JPY_API_TOKEN=c5353ea5222b413c85a1e08306ebfbb3
JPY_PARENT_PID=7
JUPYTERHUB_ACTIVITY_URL=http://hub:8081/hub/api/users/hersam/activity
JUPYTERHUB_ADMIN_ACCESS=1
JUPYTERHUB_API_TOKEN=c5353ea5222b413c85a1e08306ebfbb3
JUPYTERHUB_API_URL=http://hub:8081/hub/api
JUPYTERHUB_BASE_URL=/
JUPYTERHUB_CLIENT_ID=jupyterhub-user-hersam
JUPYTERHUB_HOST=
JUPYTERHUB_OAUTH_CALLBACK_URL=/user/hersam/oauth_callback
JUPYTERHUB_SERVER_NAME=
JUPYTERHUB_SERVICE_PREFIX=/user/hersam/
JUPYTERHUB_USER=hersam
JUPYTER_IMAGE=jupyter/pyspark-notebook:latest
JUPYTER_IMAGE_SPEC=jupyter/pyspark-notebook:latest
KUBERNETES_PORT=tcp://10.43.0.1:443
KUBERNETES_PORT_443_TCP=tcp://10.43.0.1:443
KUBERNETES_PORT_443_TCP_ADDR=10.43.0.1
KUBERNETES_PORT_443_TCP_PORT=443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_SERVICE_HOST=10.43.0.1
KUBERNETES_SERVICE_PORT=443
KUBERNETES_SERVICE_PORT_HTTPS=443
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_ALL=en_US.UTF-8
MEM_GUARANTEE=1073741824
MINIFORGE_VERSION=4.9.2-7
MPLBACKEND=module://ipykernel.pylab.backend_inline
NB_GID=100
NB_UID=1000
NB_USER=jovyan
PAGER=cat
PATH=/opt/conda/bin:/usr/local/spark/python:/usr/local/spark/python/lib/py4j-0.10.9-src.zip:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin
PROXY_API_PORT=tcp://10.43.170.224:8001
PROXY_API_PORT_8001_TCP=tcp://10.43.170.224:8001
PROXY_API_PORT_8001_TCP_ADDR=10.43.170.224
PROXY_API_PORT_8001_TCP_PORT=8001
PROXY_API_PORT_8001_TCP_PROTO=tcp
PROXY_API_SERVICE_HOST=10.43.170.224
PROXY_API_SERVICE_PORT=8001
PROXY_PUBLIC_PORT=tcp://10.43.159.248:80
PROXY_PUBLIC_PORT_80_TCP=tcp://10.43.159.248:80
PROXY_PUBLIC_PORT_80_TCP_ADDR=10.43.159.248
PROXY_PUBLIC_PORT_80_TCP_PORT=80
PROXY_PUBLIC_PORT_80_TCP_PROTO=tcp
PROXY_PUBLIC_SERVICE_HOST=10.43.159.248
PROXY_PUBLIC_SERVICE_PORT=80
PROXY_PUBLIC_SERVICE_PORT_HTTP=80
PWD=/home/jovyan
SHELL=/bin/bash
SPARK_HOME=/usr/local/spark
SPARK_NODE_PORT_PORT=tcp://10.43.21.88:7077
SPARK_NODE_PORT_PORT_7077_TCP=tcp://10.43.21.88:7077
SPARK_NODE_PORT_PORT_7077_TCP_ADDR=10.43.21.88
SPARK_NODE_PORT_PORT_7077_TCP_PORT=7077
SPARK_NODE_PORT_PORT_7077_TCP_PROTO=tcp
SPARK_NODE_PORT_PORT_80_TCP=tcp://10.43.21.88:80
SPARK_NODE_PORT_PORT_80_TCP_ADDR=10.43.21.88
SPARK_NODE_PORT_PORT_80_TCP_PORT=80
SPARK_NODE_PORT_PORT_80_TCP_PROTO=tcp
SPARK_NODE_PORT_SERVICE_HOST=10.43.21.88
SPARK_NODE_PORT_SERVICE_PORT=7077
SPARK_NODE_PORT_SERVICE_PORT_CLUSTER=7077
SPARK_NODE_PORT_SERVICE_PORT_HTTP=80
SPARK_OPTS=--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info
TERM=xterm-color
XDG_CACHE_HOME=/home/jovyan/.cache/
Working Environment (running container locally with docker)
env | sort
outputs:
APACHE_SPARK_VERSION=3.1.1
CONDA_DIR=/opt/conda
CONDA_VERSION=4.9.2
DEBIAN_FRONTEND=noninteractive
HADOOP_VERSION=3.2
HOME=/home/jovyan
HOSTNAME=2a600aae602f
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_ALL=en_US.UTF-8
MINIFORGE_VERSION=4.9.2-7
NB_GID=100
NB_UID=1000
NB_USER=jovyan
PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin
SHELL=/bin/bash
SPARK_HOME=/usr/local/spark
SPARK_OPTS=--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info
TERM=xterm
XDG_CACHE_HOME=/home/jovyan/.cache/
import os
for k, v in sorted(os.environ.items()):
print(f'{k}={v}')
outputs:
APACHE_SPARK_VERSION=3.1.1
CLICOLOR=1
CONDA_DIR=/opt/conda
CONDA_VERSION=4.9.2
DEBIAN_FRONTEND=noninteractive
GIT_PAGER=cat
HADOOP_VERSION=3.2
HOME=/home/jovyan
HOSTNAME=2a600aae602f
JPY_PARENT_PID=7
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_ALL=en_US.UTF-8
MINIFORGE_VERSION=4.9.2-7
MPLBACKEND=module://ipykernel.pylab.backend_inline
NB_GID=100
NB_UID=1000
NB_USER=jovyan
PAGER=cat
PATH=/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/spark/bin
PWD=/home/jovyan
PYSPARK_PYTHONPATH_SET=1
PYTHONPATH=/usr/local/spark/python/lib/py4j-0.10.9-src.zip:/usr/local/spark/python:
SHELL=/bin/bash
SHLVL=0
SPARK_CONF_DIR=/usr/local/spark/conf
SPARK_HOME=/usr/local/spark
SPARK_OPTS=--driver-java-options=-Xms1024M --driver-java-options=-Xmx4096M --driver-java-options=-Dlog4j.logLevel=info
TERM=xterm-color
XDG_CACHE_HOME=/home/jovyan/.cache/