Integrate JupyterHub on Kubernetes (Z2JH) with Slurm

Hello together,

We are currently using Z2JH to start Jupyter notebooks in a Kubernetes cluster. In the cluster we also have GPU nodes, which we can address via a defined profile in the “singleuser.profileList”. This all works very well!

But now we also want to integrate our HPC environment, which is still used classically with Slurm.

I am now wondering how we can best do this with Z2JH. In a non-Kubernetes JupyterHub installation, we would use the ‘batchspawner.SlurmSpawner’ in combination with ‘wrapspawner.ProfilesSpawner’.

Can you point me into a direction, how we can integrate Slurm with Z2JH? :slight_smile:

Many thanks in advance and best regards,
Martin

This should be doable but it requires a bit of work. As you have guessed, it can be done using SlurmSpawner and ProfilesSpawner. But there are few things to think about here:

  • SlurmSpawner spawns single user servers as batch scripts submitted via native commands like sbatch, squeue and scancel. This means the pod where JupyterHub is running must have routing to SLURM internal network and munge key authentication. Moreover, SlurmSpawner uses sudo to submit the batch job as the “user” that is spawning the single user server. This means you will have to run JupyterHub pod as privileged (I am not totally sure of that tho…) and more importantly you will have to maintain a list of SLURM users and groups on your k8s environment as well. Finally, you will need to install SLURM client tools in the JupyterHub pod.
  • A way to simplify the above scenario is to use SLURM’s REST API server and submit the jobs using API requests which is more k8s “friendly”. In this case, you will have to override the SlurmSpawner’s methods to use REST API instead of native SLURM client commands. Even in this case you will need to have a network routing between SLURM compute nodes and the pod where JupyterHub is running. This is due to the fact that JupyterHub does health checks on single user servers and thus a bi-directional communication between Hub and single user servers is necessary.

Once you decide which way to go, you can use KubeSpawner and SlurmSpawner as different profiles in ProfilesSpawner to be able to spawn the single user servers in different environments. A reference of ProfilesSpawner that uses SSH and SLURM spawner on our HPC platform.

1 Like

Hello @mahendrapaipuri, many thanks for your reply.

I’ve decided to give the existing SlurmSpawner a try. Using Slurm’s REST API indeed looks technically cleaner to me, but is currently still associated with more effort.

I was able to get the SlurmSpawner up and running inside Kubernetes by opening an SSH-Connection to the login node with the “exec_prefix”. In doing so, I encountered a few hurdles that I was able to solve with the help of existing workarounds. I’ve needed to build a custom Hub-Image with ssh, wrapspawner and batchspawner installed. For anyone interested in doing the same, I’ve put my learnings below. :slightly_smiling_face:

Parts of the HelmChart Config

hub:
  image:
    # Custom jupyterhub image with ssh, wrapspawner and batchspawner installed.
    name: myregistry.example.com/k8s-hub-custom
  extraFiles:
    # SSH Private Key to connect to HPC login node - just as proof of concept.
    #
    # (!) PLEASE NOTE THAT THIS IS A SECURITY RISK IF THE KEY IS NOT PROTECTED 
    #     PROPERLY. THIS IS JUST AN PROOF OF CONCEPT. 
    00-ssh-key:
      mountPath: /id_ed25519
      mode: 0400
      stringData: |
        -----BEGIN OPENSSH PRIVATE KEY-----
        XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
        XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
        XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
        XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
        XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
        -----END OPENSSH PRIVATE KEY-----
  extraConfig:
    00-global: |
      c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'
      c.JupyterHub.log_level = 'DEBUG'
      c.Spawner.http_timeout = 180
      c.Spawner.start_timeout = 300
    01-batchspawner-slurm: |
      import batchspawner
    
      # WORKAROUND: No local slurm users in hub pod, which leads to error
      # with calling pwd.getpwnam(self.user.name).pw_dir in _req_homedir_default function.
      # 
      # Patch SlurmSpawner to not require local users.
      # Thx to https://gist.github.com/zonca/55f7949983e56088186e99db53548ded
      #
      class SlurmSpawnerNoLocalUsers(batchspawner.SlurmSpawner):
        def user_env(self, env):
          """get user environment"""
          env['USER'] = self.user.name
          return env

        def _req_homedir_default(self):
          return "/home/{}/".format(self.user.name)

      # WORKAROUND: Environment variables are not passed to the spawned process with ssh
      #
      # Thx to https://github.com/jupyterhub/batchspawner/issues/123
      #
      c.SlurmSpawner.batch_submit_cmd = " ".join(
          [
              "env", "{% for var in keepvars.split(',') %}{{var}}=\"'${{'{'}}{{var}}{{'}'}}'\" {% endfor %}",
              "sbatch --parsable",
          ]
      )

      # WORKAROUND: Error: squeue: error: Unrecognized option: %B
      #
      # Take care of quotion in squeue command in combination with exec_prefix.
      # Thx to https://github.com/jupyterhub/batchspawner/issues/123#issuecomment-2157902069
      #
      c.SlurmSpawner.batch_query_cmd = "squeue -h -j {job_id} -o \"'%T %B'\""
      
      #
      # Slurm settings
      #
      c.SlurmSpawner.exec_prefix = "ssh -o StrictHostKeyChecking=accept-new -i /id_ed25519 MYSSHUSER@login-node.example.com"      
      c.SlurmSpawner.batch_script = '''#!/bin/bash
      #SBATCH --output=/home/{username}/jupyterhub_slurmspawner_%j.log
      #SBATCH --job-name=spawner-jupyterhub
      #SBATCH --chdir=/home/{username}
      #SBATCH --export=HOME,PATH,JUPYTERHUB_API_TOKEN,JPY_API_TOKEN,JUPYTERHUB_CLIENT_ID,JUPYTERHUB_COOKIE_HOST_PREFIX_ENABLED,JUPYTERHUB_HOST,JUPYTERHUB_OAUTH_CALLBACK_URL,JUPYTERHUB_OAUTH_SCOPES,JUPYTERHUB_OAUTH_ACCESS_SCOPES,JUPYTERHUB_OAUTH_CLIENT_ALLOWED_SCOPES,JUPYTERHUB_USER,JUPYTERHUB_SERVER_NAME,JUPYTERHUB_API_URL,JUPYTERHUB_ACTIVITY_URL,JUPYTERHUB_BASE_URL,JUPYTERHUB_SERVICE_PREFIX,JUPYTERHUB_SERVICE_URL,JUPYTERHUB_PUBLIC_URL,JUPYTERHUB_PUBLIC_HUB_URL,USER,HOME,SHELL
      #SBATCH --get-user-env=L

      echo "***************************************************************"
      hostname
      ml load Python
      python3 -m venv ./venv_jupyterhub
      source ./venv_jupyterhub/bin/activate
      pip3 install batchspawner
      pip3 install jupyterhub
      pip3 install jupyterlab
      pip3 install jupyter-server

      echo "***************************************************************"
      # Most likely the following modifications are not needed, if the Hub can directly communicate
      # with the compute nodes and vice versa. Unfortunately, this is not yet the case in our environment and I had to do SSH port forwarding magic in the background
      export JUPYTERHUB_API_URL=http://login-node.example.com:8085/hub/api
      export JUPYTERHUB_ACTIVITY_URL=http://login-node.example.com:8085/hub/api/users/{username}/activity
      export JUPYTERHUB_SERVICE_URL=http://localhost:19999

      echo "***************************************************************"
      which batchspawner-singleuser
      which jupyterhub-singleuser
      env
      echo $HOME
      echo $PATH

      echo "***************************************************************"
      batchspawner-singleuser jupyterhub-singleuser --debug --ServerApp.port=9999
      '''
    10-profilespawner: |
      # WORKAROUND: Stuck in "Started container notebook"
      #
      # KubeSpawner with ProfilesSpawner leads to hanging "Started container notebook". 
      # Thy to https://github.com/jupyterhub/wrapspawner/issues/58#issuecomment-1882918661, I've took a
      # further look into the variables. In contrast to the issue I had to set cmd to ['jupyterhub-singleuser']
      #
      # NOTE: 
      #
      # A profile in ProfilesSpawner.profiles is a tuple with the following parameters:
      #
      #   1. display_name: The name of the profile which is shown in the dropdown menu.
      #   2. name: The name of the profile which is used in the singleuser.profileList.
      #   3. spawner_class: The class of the spawner which is used in the singleuser.profileList.
      #   4. kwargs: The parameters which are passed to the spawner class in the singleuser.profileList.
      #
      # Using the KubeSpawner please be aware of the 'name'-paramneter (the second one). It must a match of a profile 
      # in the singleuser.profileList. If a profile in the singleuser.profileList does not container a name, than the 
      # display_name is used as name without spaces and all lowerkeys. 
      # E. g. display_name: "GPU Node" will be used as name: "gpunode".
      #
      c.ProfilesSpawner.profiles = [
        ('K8S - Default', 'default', 'kubespawner.KubeSpawner', {'ip':'0.0.0.0', 'port': 0, 'cmd': ['jupyterhub-singleuser']}),
        ('K8S - GPU-Node', 'gpu-node', 'kubespawner.KubeSpawner', {'ip':'0.0.0.0', 'port': 0, 'cmd': ['jupyterhub-singleuser']}),
        ('HPC - Partition XY (2 cores, 4 GB, 8 hours)', 'singleuser', SlurmSpawnerNoLocalUsers, dict(req_partition='parition.xy', req_nprocs='2', req_memory='4gb', req_runtime='8:00:00'))
      ]
singleuser:
  # Profiles for Kubernetes Spawner
  profileList:
    - display_name: "Default"
      description: "Your code will run on a shared machine with CPU only."
      default: True
    - display_name: "GPU-Node"
      description: "Spawns a notebook server with access to a GPU"
      kubespawner_override:
        extra_resource_limits:
          nvidia.com/gpu: "1"  

Modified jupyterhub image with ssh, wrapspawner and batchspawner installed

FROM quay.io/jupyterhub/k8s-hub:4.2.0

USER root

RUN export DEBIAN_FRONTEND=noninteractive \
 && apt update \
 && apt install -y openssh-client

USER jovyan

RUN pip3 install wrapspawner batchspawner

Thanks again,
Martin

3 Likes