Minimal image compatible with NVIDIA RAPIDS?

I was looking for an example configuration or pre-built Docker image that includes NVIDIA RAPIDS already installed? They already provide good install docs, however, following the conda-based instructions here to a try and add these dependencies on top of an existing Jupyter Docker image (even the most minimal) creates an image of over 40GB, too big to successfully build on GH Actions even after some tricks that free additional space!

Fortunately, the RAPIDS team also provides a bunch of pre-built docker images, e.g. nvcr.io/nvidia/rapidsai/base:25.02-cuda12.0-py3.12, which are also much smaller. I think it should be a straight forward task to add the necessary jupyterhub on this image, I tried this,

FROM nvcr.io/nvidia/rapidsai/notebooks:25.02-cuda12.0-py3.12
RUN mamba install -c conda-forge --yes jupyterhub notebook
CMD ["jupyter", "lab", "--ip", "0.0.0.0"]

but the resulting docker image,

ghcr.io/rocker-org/rapids@sha256:c508daa3143add94f8a6af878f4dc3e62a36166373dbacb771a51a0fff35bc7a

won’t start up on my Jupyterhub. (The cuda-enabled official jupyter ones work fine, and this one seems to work fine if I just run in docker locally.). What did I miss?

Can you turn on debug logging and share your logs?

Thanks! Here’s the kubectl events from describe pod, looks like the PostStartHook fails. (not sure what that does. Is there anything that needs to be set as a default entrypoint or default command when making an image?)

Events:
  Type     Reason               Age                From                            Message
  ----     ------               ----               ----                            -------
  Normal   Scheduled            44s                testjuypterhelm-user-scheduler  Successfully assigned testjupyter/jupyter-cboettig to thelio
  Normal   Pulled               45s                kubelet                         Container image "quay.io/jupyterhub/k8s-network-tools:4.0.0" already present on machine
  Normal   Created              45s                kubelet                         Created container block-cloud-metadata
  Normal   Started              44s                kubelet                         Started container block-cloud-metadata
  Normal   Pulled               43s                kubelet                         Successfully pulled image "ghcr.io/rocker-org/rapids@sha256:c508daa3143add94f8a6af878f4dc3e62a36166373dbacb771a51a0fff35bc7a" in 518ms (518ms including waiting)
  Normal   Pulled               42s                kubelet                         Successfully pulled image "ghcr.io/rocker-org/rapids@sha256:c508daa3143add94f8a6af878f4dc3e62a36166373dbacb771a51a0fff35bc7a" in 525ms (525ms including waiting)
  Normal   Pulled               28s                kubelet                         Successfully pulled image "ghcr.io/rocker-org/rapids@sha256:c508daa3143add94f8a6af878f4dc3e62a36166373dbacb771a51a0fff35bc7a" in 608ms (608ms including waiting)
  Normal   Created              28s (x3 over 43s)  kubelet                         Created container notebook
  Normal   Started              28s (x3 over 43s)  kubelet                         Started container notebook
  Warning  FailedPostStartHook  28s (x3 over 43s)  kubelet                         PostStartHook failed
  Normal   Killing              28s (x3 over 43s)  kubelet                         FailedPostStartHook
  Warning  BackOff              15s (x3 over 42s)  kubelet                         Back-off restarting failed container notebook in pod jupyter-cboettig_testjupyter(c0ff394d-ac48-4603-87a1-9f3ac231870a)

Can you share your Z2JH config? postStart isn’t configured by default

I did have a postStart config, but I think that’s unrelated. I just removed the postStart task and still the image won’t launch,

  ----     ------     ----               ----                            -------
  Normal   Scheduled  32s                testjuypterhelm-user-scheduler  Successfully assigned testjupyter/jupyter-cboettig to thelio
  Normal   Pulled     32s                kubelet                         Container image "quay.io/jupyterhub/k8s-network-tools:4.0.0" already present on machine
  Normal   Created    32s                kubelet                         Created container block-cloud-metadata
  Normal   Started    32s                kubelet                         Started container block-cloud-metadata
  Normal   Pulled     31s                kubelet                         Successfully pulled image "ghcr.io/rocker-org/rapids:latest" in 541ms (541ms including waiting)
  Normal   Pulled     30s                kubelet                         Successfully pulled image "ghcr.io/rocker-org/rapids:latest" in 531ms (531ms including waiting)
  Normal   Pulling    16s (x3 over 31s)  kubelet                         Pulling image "ghcr.io/rocker-org/rapids:latest"
  Normal   Pulled     16s                kubelet                         Successfully pulled image "ghcr.io/rocker-org/rapids:latest" in 485ms (485ms including waiting)
  Normal   Created    15s (x3 over 31s)  kubelet                         Created container notebook
  Normal   Started    15s (x3 over 31s)  kubelet                         Started container notebook
  Warning  BackOff    3s (x3 over 29s)   kubelet                         Back-off restarting failed container notebook in pod jupyter-cboettig_testjupyter(90e614a9-b60a-4ce9-8966-a3774efe22cc)

(For the record, postStart was just:

#  lifecycleHooks:
#    postStart:
#      exec:
#        command: ["/bin/bash", "-c", "if [ -f '/opt/share/start.sh' ]; then /bin/bash '/opt/share/start.sh'; fi"]

Can you share the logs for the failed pod- looks like

kubectl logs jupyter-cboettig_testjupyter

should give you them.

Thanks @manics , the pod doesn’t start so it doesn’t have logs (waiting for pod creation). I have pasted the event logs from kubectl describe pod in the previous reply.

Just to double-check, is there anything other than installing jupyterhub and notebook conda packages in the default conda environment that should be necessary here?

Starting single user server is bit more involved that running jupyter lab --ip 0.0.0.0 as you are doing in your image. Refer the base image Dockerfile to see how it is done. You can try replacing CMD with ["jupyterhub-singleuser"] to see if it works.

Can anyone point me to the documentation about what is required for a Docker image to be compatible to start on JupyterHub?

For instance, here is a relatively minimal Dockerfile: it merely installs jupyterhub 4.* and notebook on top of the micromaba default image. This works just fine on deploying on Jupyterhub (using the 'bring my own docker image" option).

Note that this minimal example works just fine with or without the CMD, I believe the z2jh config provides the necessary default CMD on startup already.

Not sure why the NVIDIA RAPIDS image cannot start up though or how to debug. Is anyone else able to test that image?

I think the only requirement is that a compatible version of jupyterhub-singleuser is in the container’s default PATH. What happens if you docker run ... IMAGE jupyterhub-singleuser --help?

ah ha! apparently I was just being either too slow or too fast to get the kubectl logs to show up. Was able to run that again and get logs this time, which shows a permission issue on their custom entrypoint file there! Sorry for the noise and thanks all for the help. Testing now but this looks promising.

progress, new error message.

I unset ENDPOINT on the image, but now it produces the following error log

kubectl logs -n testjupyter jupyter-cboettig


Defaulted container "notebook" out of: notebook, block-cloud-metadata (init)
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Python path configuration:
  PYTHONHOME = (not set)
  PYTHONPATH = (not set)
  program name = '/opt/conda/bin/python3.12'
  isolated = 0
  environment = 1
  user site = 1
  safe_path = 0
  import site = 1
  is in build tree = 0
  stdlib dir = '/opt/conda/lib/python3.12'
  sys._base_executable = '/opt/conda/bin/python3.12'
  sys.base_prefix = '/opt/conda'
  sys.base_exec_prefix = '/opt/conda'
  sys.platlibdir = 'lib'
  sys.executable = '/opt/conda/bin/python3.12'
  sys.prefix = '/opt/conda'
  sys.exec_prefix = '/opt/conda'
  sys.path = [
    '/opt/conda/lib/python312.zip',
    '/opt/conda/lib/python3.12',
    '/opt/conda/lib/python3.12/lib-dynload',
  ]
Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
Python runtime state: core initialized
ModuleNotFoundError: No module named 'encodings'

Current thread 0x00007da95b38b740 (most recent call first):
  <no Python frame>

Any ideas?

Sounds like either your Python installation has been corrupted (perhaps by installing incompatible packages?), or something’s gone wrong with your Python related paths which means it’s looking in the wrong place for system modules. I’ve no idea how that happened though.

Do you have the Dockerfile from Nvidia used to build the original base image?

Thanks, yes, I believe it is build from this dockerfile.