Jupyter spawner can not spawn pods with GPU nvidia


I created a node with GPU. Overall, I configured everything correctly: when I simply create a static resource pod and add the limit nvidia.com/gpu: 1, everything works as expected.

But when I try to do the same thing in the profileList of JupyterHub (add GPU request/limit), I get the following error in describe:

   extra_resource_requests:
     nvidia.com/gpu: 1

text

Warning  FailedScheduling  2m14s  default-scheduler  0/8 nodes are available: 1 Insufficient nvidia.com/gpu. 
preemption: 0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption is not helpful for scheduling.

To clarify: the resources are definitely there — as I already mentioned, creating pods directly works fine. This behavior only appears when the pod is spawned through JupyterHub.

P.S. I already tried the obvious thing — switched from jupyter-scheduler to default-scheduler, it didn’t help.

Hey there, KubeSpawner does not have the option extra_resource_requests, but extra_resource_guarantees and extra_resource_limits instead. So trying to set these values may resolve your issue?

Hi! Yeah, sorry, my fault!! I mean I try to do this:

- display_name: "ML with GPU"
  description: "Ubuntu 22.04"
  kubespawner_override:
    scheduler_name: default-scheduler
    image: some_image_with_nvidia_smi
    extra_resource_limits:
      nvidia.com/gpu: 1


Hmm, that’s odd. Did you take a look at Customizing User Resources — Zero to JupyterHub with Kubernetes documentation and verified that the command to check GPU availability succeeds?

Yes! Command to check GPU availability is succeed

kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.‘nvidia.com/gpu’
NAME GPUs
gpu-node 2

Can you show us the full YAML for the pod?

Full yaml of test pod

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  containers:
  - name: cuda
    image: nvidia/cuda:13.0.0-base-ubuntu22.04
    command: ["nvidia-smi", "-l", "5"]
    resources:
      limits:
        nvidia.com/gpu: 1

Sorry, I meant the full YAML of the failing jupyter pod (kubectl get pod jupyter-…. -oyaml)

1 Like
apiVersion: v1
kind: Pod
metadata:
  annotations:
    hub.jupyter.org/jupyterhub-version: 5.4.3
    hub.jupyter.org/kubespawner-version: 7.0.0
    hub.jupyter.org/username: ikol006
  creationTimestamp: "2026-01-22T03:06:14Z"
  labels:
    app: jupyterhub
    app.kubernetes.io/component: singleuser-server
    app.kubernetes.io/instance: jupyterhub
    app.kubernetes.io/managed-by: kubespawner
    app.kubernetes.io/name: jupyterhub
    chart: jupyterhub-4.3.2
    component: singleuser-server
    helm.sh/chart: jupyterhub-4.3.2
    hub.jupyter.org/network-access-hub: "true"
    hub.jupyter.org/servername: ""
    hub.jupyter.org/username: ikol006
    release: jupyterhub
  name: jupyter-ikol006
  namespace: jupyterhub
  resourceVersion: "930369902"
  uid: ccc01307-09bb-4b2a-b870-88c82764d874
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: hub.jupyter.org/node-purpose
            operator: In
            values:
            - user
        weight: 100
  automountServiceAccountToken: false
  containers:
  - args:
    - jupyterhub-singleuser
    env:
    - name: JPY_API_TOKEN
      value: 66sdfe0e9e7af47a91b43205aa6dc8c4
    - name: JUPYTERHUB_ACTIVITY_URL
      value: http://hub:8081/hub/api/users/ikol006/activity
    - name: JUPYTERHUB_ADMIN_ACCESS
      value: "1"
    - name: JUPYTERHUB_API_TOKEN
      value: 66sdfe0e9e7af47a91b43205aa6dc8c4
    - name: JUPYTERHUB_API_URL
      value: http://hub:8081/hub/api
    - name: JUPYTERHUB_BASE_URL
      value: /
    - name: JUPYTERHUB_CLIENT_ID
      value: jupyterhub-user-ikol006
    - name: JUPYTERHUB_COOKIE_HOST_PREFIX_ENABLED
      value: "0"
    - name: JUPYTERHUB_DEBUG
      value: "1"
    - name: JUPYTERHUB_HOST
    - name: JUPYTERHUB_OAUTH_ACCESS_SCOPES
      value: '["access:servers!server=ikol006/", "access:servers!user=ikol006"]'
    - name: JUPYTERHUB_OAUTH_CALLBACK_URL
      value: /user/ikol006/oauth_callback
    - name: JUPYTERHUB_OAUTH_CLIENT_ALLOWED_SCOPES
      value: '[]'
    - name: JUPYTERHUB_OAUTH_SCOPES
      value: '["access:servers!server=ikol006/", "access:servers!user=ikol006"]'
    - name: JUPYTERHUB_PUBLIC_HUB_URL
    - name: JUPYTERHUB_PUBLIC_URL
    - name: JUPYTERHUB_SERVER_NAME
    - name: JUPYTERHUB_SERVICE_PREFIX
      value: /user/ikol006/
    - name: JUPYTERHUB_SERVICE_URL
      value: http://0.0.0.0:8888/user/ikol006/
    - name: JUPYTERHUB_USER
      value: ikol006
    - name: JUPYTER_IMAGE
      value: ml-with-gpu-notebook:x86_64-ubuntu-22.04
    - name: JUPYTER_IMAGE_SPEC
      value: ml-with-gpu-notebook:x86_64-ubuntu-22.04
    - name: MEM_GUARANTEE
      value: "1073741824"
    - name: PIP_INDEX_URL
      value: https://nexus.local/repository/pypi-proxy/simple
    - name: PIP_TIMEOUT
      value: "60"
    - name: PIP_TRUSTED_HOST
      value: nexus.local
    - name: SSL_CERT_FILE
      value: /etc/ssl/certs/ca-certificates.crt
    image: ml-with-gpu-notebook:x86_64-ubuntu-22.04
    imagePullPolicy: Always
    lifecycle:
      postStart:
        exec:
          command:
          - sh
          - -c
          - |
            cp /etc/ssl/certs/ca-certificates.crt /opt/conda/lib/python3.11/site-packages/certifi/cacert.pem
    name: notebook
    ports:
    - containerPort: 8888
      name: notebook-port
      protocol: TCP
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        memory: "1073741824"
        nvidia.com/gpu: "1"
    securityContext:
      allowPrivilegeEscalation: false
      runAsUser: 1000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/ssl/certs/ca-certificates.crt
      name: files
      subPath: ca-certificates.crt
    - mountPath: /home/jovyan
      name: volume-ikol006
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - command:
    - iptables
    - --append
    - OUTPUT
    - --protocol
    - tcp
    - --destination
    - 111.127.222.127
    - --destination-port
    - "80"
    - --jump
    - DROP
    image: quay.io/jupyterhub/k8s-network-tools:4.3.2
    imagePullPolicy: IfNotPresent
    name: block-cloud-metadata
    resources: {}
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
      privileged: true
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  preemptionPolicy: PreemptLowerPriority
  priority: 1000
  priorityClassName: develop
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 100
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: hub.jupyter.org/dedicated
    operator: Equal
    value: user
  - effect: NoSchedule
    key: hub.jupyter.org_dedicated
    operator: Equal
    value: user
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: files
    secret:
      defaultMode: 420
      items:
      - key: ca-certificates.crt
        mode: 420
        path: ca-certificates.crt
      secretName: singleuser
  - name: volume-ikol006
    persistentVolumeClaim:
      claimName: claim-ikol006
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2026-01-22T03:06:15Z"
    message: '0/8 nodes are available: 1 Insufficient nvidia.com/gpu. preemption:
      0/8 nodes are available: 1 No preemption victims found for incoming pod, 7 Preemption
      is not helpful for scheduling..'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

Can you try creating your gpu-test pod with these requests as well as limits?

1 Like

From jupyterhub or just single resource pod ?

The test pod that you said was working in Jupyter spawner can not spawn pods with GPU nvidia - #7 by metaIhead

Try creating that pod immediately after your JupyterHub pod fails to spawn

A test pod is created under any circumstances, even after a Jupyter pod attempts to create or crashes.

Creating a test pod without limits, but only with requests, won’t work.

The Pod “gpu-test” is invalid: spec.containers[0].resources.limits: Required value: Limit must be set for non overcommitable resources

If I specify only limits in a test pod, then the requests section will exist in the manifest obtained via kubectl get pods -o yaml.

I recall some complexity about this, only declare the nvidia limit, not request, and see if it works.

Yes, that’s correct — I specify only limits, and requests are added automatically in the pod manifest.

1 Like

This will be quite tedious, but can you manually (not via JupyterHub) create a pod based on the YAML in

but:

  • Change argsto jupyterlab, which should allow the pod to run as a standalone JupyterLab server
  • Delete the fields managed by Kubernetes (e.,g. creationTimestamp, status, etc)

Presumably that should fail with the same error as in your first post( Insufficient nvidia.com/gpu). Then try removing fields until eventually you approach your working example in

3 Likes

I removed metadata such as status, creationTime, and so on, and also removed the PVC attachment. After that, the pod started on the required node where GPU access is available.

So in the end, the manifest looked like this.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    hub.jupyter.org/jupyterhub-version: 5.4.3
    hub.jupyter.org/kubespawner-version: 7.0.0
    hub.jupyter.org/username: ikol006
  labels:
    app: jupyterhub
    app.kubernetes.io/component: singleuser-server
    app.kubernetes.io/instance: jupyterhub
    app.kubernetes.io/managed-by: kubespawner
    app.kubernetes.io/name: jupyterhub
    chart: jupyterhub-4.3.2
    component: singleuser-server
    helm.sh/chart: jupyterhub-4.3.2
    hub.jupyter.org/network-access-hub: "true"
    hub.jupyter.org/servername: ""
    hub.jupyter.org/username: ikol006
    release: jupyterhub
  name: jupyter-ikol006
  namespace: test
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: hub.jupyter.org/node-purpose
            operator: In
            values:
            - user
        weight: 100
  automountServiceAccountToken: false
  containers:
  - args:
    - jupyterhub-singleuser
    env:
    - name: JPY_API_TOKEN
      value: Laiv5Jie0Fie9Ue3auf5ohnge
    - name: JUPYTERHUB_ACTIVITY_URL
      value: http://hub:8081/hub/api/users/ikol006/activity
    - name: JUPYTERHUB_ADMIN_ACCESS
      value: "1"
    - name: JUPYTERHUB_API_TOKEN
      value: Laiv5Jie0Fie9Ue3auf5ohnge
    - name: JUPYTERHUB_API_URL
      value: http://hub:8081/hub/api
    - name: JUPYTERHUB_BASE_URL
      value: /
    - name: JUPYTERHUB_CLIENT_ID
      value: jupyterhub-user-ikol006
    - name: JUPYTERHUB_COOKIE_HOST_PREFIX_ENABLED
      value: "0"
    - name: JUPYTERHUB_DEBUG
      value: "1"
    - name: JUPYTERHUB_HOST
    - name: JUPYTERHUB_OAUTH_ACCESS_SCOPES
      value: '["access:servers!server=ikol006/", "access:servers!user=ikol006"]'
    - name: JUPYTERHUB_OAUTH_CALLBACK_URL
      value: /user/ikol006/oauth_callback
    - name: JUPYTERHUB_OAUTH_CLIENT_ALLOWED_SCOPES
      value: '[]'
    - name: JUPYTERHUB_OAUTH_SCOPES
      value: '["access:servers!server=ikol006/", "access:servers!user=ikol006"]'
    - name: JUPYTERHUB_PUBLIC_HUB_URL
    - name: JUPYTERHUB_PUBLIC_URL
    - name: JUPYTERHUB_SERVER_NAME
    - name: JUPYTERHUB_SERVICE_PREFIX
      value: /user/ikol006/
    - name: JUPYTERHUB_SERVICE_URL
      value: http://0.0.0.0:8888/user/ikol006/
    - name: JUPYTERHUB_USER
      value: ikol006
    - name: JUPYTER_IMAGE
      value: registry.mycompany.com/internals/jupyterhub/ml-with-gpu-notebook:x86_64-ubuntu-22.04
    - name: JUPYTER_IMAGE_SPEC
      value: registry.mycompany.com/internals/jupyterhub/ml-with-gpu-notebook:x86_64-ubuntu-22.04
    - name: MEM_GUARANTEE
      value: "1073741824"
    - name: PIP_INDEX_URL
      value: https://repo.mycompany.com/repository/pypi-proxy/simple
    - name: PIP_TIMEOUT
      value: "60"
    - name: PIP_TRUSTED_HOST
      value: repo.mycompany.com
    - name: SSL_CERT_FILE
      value: /etc/ssl/certs/ca-certificates.crt
    image: jupyterhub/ml-with-gpu-notebook:x86_64-ubuntu-22.04
    imagePullPolicy: Always
    lifecycle:
      postStart:
        exec:
          command:
          - sh
          - -c
          - |
            cp /etc/ssl/certs/ca-certificates.crt /opt/conda/lib/python3.11/site-packages/certifi/cacert.pem
    name: notebook
    ports:
    - containerPort: 8888
      name: notebook-port
      protocol: TCP
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        memory: "1073741824"
        nvidia.com/gpu: "1"
    securityContext:
      allowPrivilegeEscalation: false
      runAsUser: 1000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    # volumeMounts:
    # - mountPath: /etc/ssl/certs/ca-certificates.crt
    #   name: files
    #   subPath: ca-certificates.crt
    # - mountPath: /home/jovyan
    #   name: volume-ikol006
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - command:
    - iptables
    - --append
    - OUTPUT
    - --protocol
    - tcp
    - --destination
    - 111.222.333.444
    - --destination-port
    - "80"
    - --jump
    - DROP
    image: jupyterhub/k8s-network-tools:4.3.2
    imagePullPolicy: IfNotPresent
    name: block-cloud-metadata
    resources: {}
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
      privileged: true
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  preemptionPolicy: PreemptLowerPriority
  priority: 1000
  priorityClassName: develop
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 100
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: hub.jupyter.org/dedicated
    operator: Equal
    value: user
  - effect: NoSchedule
    key: hub.jupyter.org_dedicated
    operator: Equal
    value: user
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  # volumes:
  # - name: files
  #   secret:
  #     defaultMode: 420
  #     items:
  #     - key: ca-certificates.crt
  #       mode: 420
  #       path: ca-certificates.crt
  #     secretName: singleuser
  # - name: volume-ikol006
  #   persistentVolumeClaim:
  #     claimName: claim-ikol006
1 Like

What storage provider are you using? Are persistent volumes always mountable from all nodes?

2 Likes

Is there a daemonset resource related to storage, that runs on many nodes, but not the GPU node because it doesnt tolerate a taint? Then make it tolerate it and try again!

I’m using the vSphere CSI driver, but I removed the PVC from the manifest since I didn’t create any PVCs for the pod.