Jupyterhub Pods all going to only one node on the cluster

tony · September 1, 2023, 2:16pm

Hello All,
Need help figuring out why my deployment is sending all pods to the same node. I tested the cluster by deploying 28 nginx pods and they were evenly spread across the cluster. I tested bith v2.0.0 and v3.0.1 My config is:

debug:
enabled: true

scheduling:
userScheduler:
enabled: true
podPriority:
enabled: true
userPlaceholder:
enabled: true
replicas: 4
userPods:
nodeAffinity:
matchNodePurpose: require

cull:
enabled: true
timeout: 3600
every: 300

prePuller:
continuous:
enabled: false
hook:
enabled: false

I also tested by changing scheduling things. Any ideas?
Thanks!

manics · September 1, 2023, 4:37pm

The Z2JH user schedule tries to pack as many user pods as possible into the smallest number of nodes so that they can autoscale down:

Try disabling it to use the default K8s scheduler.

tony · September 1, 2023, 6:36pm

Hi Manics,

Thank you for the link. I have tried all combinations that I see of those settings. I even deleted the deployment between changes. Any other ideas or information I can post?

Thank you!
Tony

tony · September 1, 2023, 7:23pm

Also wanted to add that the documentation seems to say that I can use this:

singleuser:
  schedulerStrategy: spread

But when upgrading I get this error:

main.newUpgradeCmd.func2
	helm.sh/helm/v3/cmd/helm/upgrade.go:209
github.com/spf13/cobra.(*Command).execute
	github.com/spf13/cobra@v1.6.1/command.go:916
github.com/spf13/cobra.(*Command).ExecuteC
	github.com/spf13/cobra@v1.6.1/command.go:1044
github.com/spf13/cobra.(*Command).Execute
	github.com/spf13/cobra@v1.6.1/command.go:968
main.main
	helm.sh/helm/v3/cmd/helm/helm.go:83
runtime.main
	runtime/proc.go:250
runtime.goexit
	runtime/asm_amd64.s:1598

manics · September 1, 2023, 9:54pm

Which documentation are you looking at? This isn’t mentioned on Optimizations — Zero to JupyterHub with Kubernetes documentation

Have you tried looking at your K8s logs and events for your singleuser and other (nginx) pods, and comparing them, there may be clues as to why K8s has chosen particular nodes.

Can you show us the output of kubectl get pod <podname> -o yaml for a singleuser and nginx pod for comparison?

tony · September 1, 2023, 10:26pm

Thank you again for looking at this.

Regarding the “schedulerStrategy: spread”, in desperation I was reading this:
https://test-zerotojh.readthedocs.io/en/edit-awseks/optimization.html

The “none” working pod:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: 3308882e8c3234a7032b8f4687113888429e1470fa13313ab5f2688ef3d22cac
    cni.projectcalico.org/podIP: 10.244.4.172/32
    cni.projectcalico.org/podIPs: 10.244.4.172/32
    hub.jupyter.org/username: tony_cricelli
  creationTimestamp: "2023-09-01T21:39:04Z"
  labels:
    app: jupyterhub
    chart: jupyterhub-2.0.0
    component: singleuser-server
    heritage: jupyterhub
    hub.jupyter.org/network-access-hub: "true"
    hub.jupyter.org/servername: ""
    hub.jupyter.org/username: tony-5fcricelli
    release: ugba88
  name: jupyter-tony-5fcricelli
  namespace: ugba88
  resourceVersion: "268926"
  uid: 7cf7ad9e-c2ca-4661-a7de-f3750e7e56ea
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: hub.jupyter.org/node-purpose
            operator: In
            values:
            - user
        weight: 100
  automountServiceAccountToken: false
  containers:
  - env:
    - name: CPU_GUARANTEE
      value: "0.5"
    - name: CPU_LIMIT
      value: "2.0"
    - name: JPY_API_TOKEN
      value: 0fc02e85cf6747da85e243be5d634d04
    - name: JUPYTERHUB_ACTIVITY_URL
      value: http://hub:8081/hub/api/users/tony_cricelli/activity
    - name: JUPYTERHUB_ADMIN_ACCESS
      value: "1"
    - name: JUPYTERHUB_API_TOKEN
      value: 0fc02e85cf6747da85e243be5d634d04
    - name: JUPYTERHUB_API_URL
      value: http://hub:8081/hub/api
    - name: JUPYTERHUB_BASE_URL
      value: /
    - name: JUPYTERHUB_CLIENT_ID
      value: jupyterhub-user-tony_cricelli
    - name: JUPYTERHUB_DEBUG
      value: "1"
    - name: JUPYTERHUB_DEFAULT_URL
      value: /tree/
    - name: JUPYTERHUB_HOST
    - name: JUPYTERHUB_OAUTH_ACCESS_SCOPES
      value: '["access:servers!server=tony_cricelli/", "access:servers!user=tony_cricelli"]'
    - name: JUPYTERHUB_OAUTH_CALLBACK_URL
      value: /user/tony_cricelli/oauth_callback
    - name: JUPYTERHUB_OAUTH_CLIENT_ALLOWED_SCOPES
      value: '[]'
    - name: JUPYTERHUB_OAUTH_SCOPES
      value: '["access:servers!server=tony_cricelli/", "access:servers!user=tony_cricelli"]'
    - name: JUPYTERHUB_SERVER_NAME
    - name: JUPYTERHUB_SERVICE_PREFIX
      value: /user/tony_cricelli/
    - name: JUPYTERHUB_SERVICE_URL
      value: http://0.0.0.0:8888/user/tony_cricelli/
    - name: JUPYTERHUB_SINGLEUSER_APP
      value: notebook.notebookapp.NotebookApp
    - name: JUPYTERHUB_USER
      value: tony_cricelli
    - name: JUPYTER_IMAGE
      value: montereytony/ugba88:jup8-23-fall-v16
    - name: JUPYTER_IMAGE_SPEC
      value: montereytony/ugba88:jup8-23-fall-v16
    - name: MEM_GUARANTEE
      value: "1073741824"
    - name: MEM_LIMIT
      value: "6442450944"
    image: montereytony/ugba88:jup8-23-fall-v16
    imagePullPolicy: IfNotPresent
    lifecycle:
      postStart:
        exec:
          command:
          - sh
          - -c
          - |
            mkdir -p my-work; #        /bin/sh /tmp/fixer.sh
    name: notebook
    ports:
    - containerPort: 8888
      name: notebook-port
      protocol: TCP
    resources:
      limits:
        cpu: "2"
        memory: "6442450944"
      requests:
        cpu: 500m
        memory: "1073741824"
    securityContext:
      allowPrivilegeEscalation: true
      runAsUser: 1000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/jovyan
      name: home
      subPath: homes/tony-5fcricelli
    - mountPath: /home/jovyan/shared
      name: jupyterhub-shared
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - command:
    - iptables
    - -A
    - OUTPUT
    - -d
    - 169.254.169.254
    - -j
    - DROP
    image: jupyterhub/k8s-network-tools:2.0.0
    imagePullPolicy: IfNotPresent
    name: block-cloud-metadata
    resources: {}
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
      privileged: true
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  nodeName: jup5
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  priorityClassName: ugba88-default-priority
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 100
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: hub.jupyter.org/dedicated
    operator: Equal
    value: user
  - effect: NoSchedule
    key: hub.jupyter.org_dedicated
    operator: Equal
    value: user
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: home
    persistentVolumeClaim:
      claimName: ugba88-pvc
  - name: jupyterhub-shared
    persistentVolumeClaim:
      claimName: ugba88-shared-pvc
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-09-01T21:39:05Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-09-01T21:39:06Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-09-01T21:39:06Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-09-01T21:39:04Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://c143d4357d2f2871c594f9a003ea902d14287e8f7d250d1ef356f2c7ea9cafbf
    image: docker.io/montereytony/ugba88:jup8-23-fall-v16
    imageID: docker.io/montereytony/ugba88@sha256:021792c506eb22f4c3560ca1e2e1994814f6dc925a7413ff7400a1884ec424a8
    lastState: {}
    name: notebook
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-09-01T21:39:05Z"
  hostIP: 192.168.2.145
  initContainerStatuses:
  - containerID: containerd://0d07504ddd47b4256c9d84f357d0deaa6e21016364165c2b57c5d65c3502cd39
    image: docker.io/jupyterhub/k8s-network-tools:2.0.0
    imageID: docker.io/jupyterhub/k8s-network-tools@sha256:ab4172a025721495c0c65bd2a6165a6cd625bae39e0e5231c06e149c2ffc5dab
    lastState: {}
    name: block-cloud-metadata
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://0d07504ddd47b4256c9d84f357d0deaa6e21016364165c2b57c5d65c3502cd39
        exitCode: 0
        finishedAt: "2023-09-01T21:39:04Z"
        reason: Completed
        startedAt: "2023-09-01T21:39:04Z"
  phase: Running
  podIP: 10.244.4.172
  podIPs:
  - ip: 10.244.4.172
  qosClass: Burstable
  startTime: "2023-09-01T21:39:04Z"

I did label my worker nodes with hub.jupyter.org/node-purpose=user

Here is the working deployment where the pods are evenly distributed across the nodes:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: e48b4342343612406f065d682ae87b34d9dff64a8090c33df7bb8080a787c2bf
    cni.projectcalico.org/podIP: 10.244.4.209/32
    cni.projectcalico.org/podIPs: 10.244.4.209/32
  creationTimestamp: "2023-09-01T22:10:56Z"
  generateName: nginx-deployment-6595874d85-
  labels:
    app: nginx
    pod-template-hash: 6595874d85
  name: nginx-deployment-6595874d85-44sg6
  namespace: ugba88
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: nginx-deployment-6595874d85
    uid: 980f0bdc-f270-41df-9b73-058994c5402b
  resourceVersion: "274383"
  uid: 58063796-8bbb-4a8b-a40d-b5c58bb231f9
spec:
  containers:
  - image: nginx:1.14.2
    imagePullPolicy: IfNotPresent
    name: nginx
    ports:
    - containerPort: 80
      protocol: TCP
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-8zl52
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: jup5
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: kube-api-access-8zl52
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-09-01T22:10:56Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-09-01T22:11:02Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-09-01T22:11:02Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-09-01T22:10:56Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://a647ead1c00fe93e4bb2656edd2da2c39a6bc803468119c5df5cbca8cfd760b3
    image: docker.io/library/nginx:1.14.2
    imageID: docker.io/library/nginx@sha256:f7988fb6c02e0ce69257d9bd9cf37ae20a60f1df7563c3a2a6abe24160306b8d
    lastState: {}
    name: nginx
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2023-09-01T22:11:02Z"
  hostIP: 192.168.2.145
  phase: Running
  podIP: 10.244.4.209
  podIPs:
  - ip: 10.244.4.209
  qosClass: BestEffort
  startTime: "2023-09-01T22:10:56Z"

Here is a pending pod that should have been scheduled on a different node:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    hub.jupyter.org/username: xxxxxxxxx
  creationTimestamp: "2023-09-01T22:19:53Z"
  labels:
    app: jupyterhub
    chart: jupyterhub-2.0.0
    component: singleuser-server
    heritage: jupyterhub
    hub.jupyter.org/network-access-hub: "true"
    hub.jupyter.org/servername: ""
    hub.jupyter.org/username: xxxxxxxxx
    release: ugba88
  name: jupyter-xxxxxxxx
  namespace: ugba88
  resourceVersion: "275786"
  uid: 62c3019b-18ce-4a62-b907-1c2fbff34551
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: hub.jupyter.org/node-purpose
            operator: In
            values:
            - user
        weight: 100
  automountServiceAccountToken: false
  containers:
  - env:
    - name: CPU_GUARANTEE
      value: "0.5"
    - name: CPU_LIMIT
      value: "2.0"
    - name: JPY_API_TOKEN
      value: fb7020912758439dbc9964babfde146b
    - name: JUPYTERHUB_ACTIVITY_URL
      value: http://hub:8081/hub/api/users/xxxxxxxxx/activity
    - name: JUPYTERHUB_ADMIN_ACCESS
      value: "1"
    - name: JUPYTERHUB_API_TOKEN
      value: fb7020912758439dbc9964babfde146b
    - name: JUPYTERHUB_API_URL
      value: http://hub:8081/hub/api
    - name: JUPYTERHUB_BASE_URL
      value: /
    - name: JUPYTERHUB_CLIENT_ID
      value: jupyterhub-user-sarikapasumarthy
    - name: JUPYTERHUB_DEBUG
      value: "1"
    - name: JUPYTERHUB_DEFAULT_URL
      value: /tree/
    - name: JUPYTERHUB_HOST
    - name: JUPYTERHUB_OAUTH_ACCESS_SCOPES
      value: '["access:servers!server=xxxxxxx/", "access:servers!user=xxxxxxxxx"]'
    - name: JUPYTERHUB_OAUTH_CALLBACK_URL
      value: /user/xxxxxxx/oauth_callback
    - name: JUPYTERHUB_OAUTH_CLIENT_ALLOWED_SCOPES
      value: '[]'
    - name: JUPYTERHUB_OAUTH_SCOPES
      value: '["access:servers!server=xxxxxxx/", "access:servers!user=xxxxxxx"]'
    - name: JUPYTERHUB_SERVER_NAME
    - name: JUPYTERHUB_SERVICE_PREFIX
      value: /user/xxxxxx/
    - name: JUPYTERHUB_SERVICE_URL
      value: http://0.0.0.0:8888/user/xxxxxx/
    - name: JUPYTERHUB_SINGLEUSER_APP
      value: notebook.notebookapp.NotebookApp
    - name: JUPYTERHUB_USER
      value: xxxxx
    - name: JUPYTER_IMAGE
      value: montereytony/ugba88:jup8-23-fall-v16
    - name: JUPYTER_IMAGE_SPEC
      value: montereytony/ugba88:jup8-23-fall-v16
    - name: MEM_GUARANTEE
      value: "1073741824"
    - name: MEM_LIMIT
      value: "6442450944"
    image: montereytony/ugba88:jup8-23-fall-v16
    imagePullPolicy: IfNotPresent
    lifecycle:
      postStart:
        exec:
          command:
          - sh
          - -c
          - |
            mkdir -p my-work; #        /bin/sh /tmp/fixer.sh
    name: notebook
    ports:
    - containerPort: 8888
      name: notebook-port
      protocol: TCP
    resources:
      limits:
        cpu: "2"
        memory: "6442450944"
      requests:
        cpu: 500m
        memory: "1073741824"
    securityContext:
      allowPrivilegeEscalation: true
      runAsUser: 1000
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/jovyan
      name: home
      subPath: homes/xxxxxx
    - mountPath: /home/jovyan/shared
      name: jupyterhub-shared
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - command:
    - iptables
    - -A
    - OUTPUT
    - -d
    - 169.254.169.254
    - -j
    - DROP
    image: jupyterhub/k8s-network-tools:2.0.0
    imagePullPolicy: IfNotPresent
    name: block-cloud-metadata
    resources: {}
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
      privileged: true
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  priorityClassName: ugba88-default-priority
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 100
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: hub.jupyter.org/dedicated
    operator: Equal
    value: user
  - effect: NoSchedule
    key: hub.jupyter.org_dedicated
    operator: Equal
    value: user
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: home
    persistentVolumeClaim:
      claimName: ugba88-pvc
  - name: jupyterhub-shared
    persistentVolumeClaim:
      claimName: ugba88-shared-pvc
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-09-01T22:19:53Z"
    message: '0/8 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated
      taint {node-role.kubernetes.io/master: }, 6 node(s) had volume node affinity
      conflict. preemption: 0/8 nodes are available: 1 No preemption victims found
      for incoming pod, 7 Preemption is not helpful for scheduling.'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

All pods are in the same namespace.

Here is the output of kubectl get pods that shows the jupyterhub pods on jup5 and the nginx spread across other nodes:

jupyter-xxnran                      1/1     Running   0             31m   10.244.4.199   jup5     <none>           <none>
jupyter-xxuntsier                   1/1     Running   0             31m   10.244.4.174   jup5     <none>           <none>
jupyter-xxora05                     1/1     Running   0             31m   10.244.4.200   jup5     <none>           <none>
jupyter-siyabudddev                 1/1     Running   0             31m   10.244.4.204   jup5     <none>           <none> 
jupyter-xxcratesj-2xxsorio          1/1     Running   0             31m   10.244.4.185   jup5     <none>           <none>
jupyter-xxota                       1/1     Running   0             31m   10.244.4.183   jup5     <none>           <none>
jupyter-xxny-5fcxxcelli             1/1     Running   0             32m   10.244.4.172   jup5     <none>           <none>
jupyter-xxnsh2004                   1/1     Running   0             33m   10.244.4.171   jup5     <none>           <none>
jupyter-xxxinia-2exu                1/1     Running   0             31m   10.244.4.208   jup5     <none>           <none>
jupyter-xxxxyons10                  1/1     Running   0             31m   10.244.4.195   jup5     <none>           <none>
jupyter-xxxxisura                   1/1     Running   0             31m   10.244.4.201   jup5     <none>           <none>
jupyter-xxxxanli                    1/1     Running   0             31m   10.244.4.198   jup5     <none>           <none>
nginx-deployment-6595874d85-44sg6   1/1     Running   0             11s   10.244.4.209   jup5     <none>           <none>
nginx-deployment-6595874d85-56r7x   1/1     Running   0             11s   10.244.5.71    jup6     <none>           <none>
nginx-deployment-6595874d85-6bjtj   1/1     Running   0             12s   10.244.2.28    jup3     <none>           <none>
nginx-deployment-6595874d85-6d78p   1/1     Running   0             11s   10.244.1.37    jup2     <none>           <none>
nginx-deployment-6595874d85-77nwc   1/1     Running   0             11s   10.244.1.38    jup2     <none>           <none>
nginx-deployment-6595874d85-7r8l7   1/1     Running   0             11s   10.244.5.68    jup6     <none>           <none>
nginx-deployment-6595874d85-7sqqh   1/1     Running   0             11s   10.244.1.36    jup2     <none>           <none>
nginx-deployment-6595874d85-88v5r   1/1     Running   0             11s   10.244.6.21    jup7     <none>           <none>
nginx-deployment-6595874d85-8m79n   1/1     Running   0             11s   10.244.3.35    jup4     <none>           <none>
nginx-deployment-6595874d85-9zwsw   1/1     Running   0             11s   10.244.7.31    jup9     <none>           <none>
nginx-deployment-6595874d85-g2mpc   1/1     Running   0             11s   10.244.5.70    jup6     <none>           <none>
nginx-deployment-6595874d85-gndmr   1/1     Running   0             11s   10.244.3.34    jup4     <none>           <none>
nginx-deployment-6595874d85-gplrg   1/1     Running   0             11s   10.244.7.32    jup9     <none>           <none>
nginx-deployment-6595874d85-kztvb   1/1     Running   0             11s   10.244.2.30    jup3     <none>           <none>
nginx-deployment-6595874d85-mgx4p   1/1     Running   0             12s   10.244.6.19    jup7     <none>           <none>
nginx-deployment-6595874d85-msqsl   1/1     Running   0             11s   10.244.4.210   jup5     <none>           <none>
nginx-deployment-6595874d85-rbsbp   1/1     Running   0             11s   10.244.2.29    jup3     <none>           <none>
nginx-deployment-6595874d85-rcvkw   1/1     Running   0             12s   10.244.7.30    jup9     <none>           <none>
nginx-deployment-6595874d85-rk9bj   1/1     Running   0             11s   10.244.3.33    jup4     <none>           <none>
nginx-deployment-6595874d85-tp9kw   1/1     Running   0             11s   10.244.6.20    jup7     <none>           <none>
nginx-deployment-6595874d85-vdxpr   1/1     Running   0             11s   10.244.5.69    jup6     <none>           <none>
proxy-6bc5f57fd7-9g6nf              1/1     Running   0             34m   10.244.3.32    jup4     <none>           <none>

manics · September 3, 2023, 3:27pm

How many nodes are labelled with this?

The default value scheduling.userPods.nodeAffinity.matchNodePurpose="prefer"

should mean the pods are spread over all nodes with that label. Try setting it to ignore, or alternatively remove the hub.jupyter.org/node-purpose=user label from your node(s)?

tony:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-09-01T22:19:53Z"
    message: '0/8 nodes are available: 1 Insufficient cpu, 1 node(s) had untolerated
      taint {node-role.kubernetes.io/master: }, 6 node(s) had volume node affinity
      conflict. preemption: 0/8 nodes are available: 1 No preemption victims found
      for incoming pod, 7 Preemption is not helpful for scheduling.'

This might also be a problem, how is your dynamic storage setup? Some storage controllers create volumes that are tied to a single node.

tony · September 3, 2023, 8:14pm

I have 7 nodes with that label. I just tested with ignore and also removing hub.jupyter.org/node-purpose=user with no joy. I think you are onto something with the storage. I will test that next. it is pretty much the only thing I have not looked at. Thanks again!

tony · September 4, 2023, 12:46am

I think you are correct it is the storage, but I am not able to figure it out. First I tried

singleuser:
  storage:
    type: none

I also tried

singleuser:
  storage:
    dynamic:
      storageClass: ugba88-2-sc

Each time I started up 150 users and they always go to the same node.

I have NFS mounted common storage mounted on all the nodes. So I assume since the storage is “local” I can just point to it.

I defined a storage class:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ugba88-2-sc
provisioner: kubernetes.io/no-provisioner
parameters:
  # The path to the local storage on the node.
  local: /mnt/ist/jhub-stor/2023/fall/ugba88/

Then I defined a pv:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: ugba88-2-pv
spec:
  capacity:
    storage: 300Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Delete
  storageClassName: ugba88-2-sc
  local:
    path: /mnt/ist/jhub-stor/2023-fall/ugba88
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - jup2
          - jup3
          - jup4
          - jup5
          - jup6
          - jup7
          - jup8
          - jup9

and I defined a pvc

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ugba88-2-pvc
spec:
  storageClassName: ugba88-2-sc
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

Is there a better way to do this? Many years ago I was able to use HostPath directory in the jupyterhub yaml, that was less complicated

tony · September 4, 2023, 4:54pm

I think the problem maybe related to the Access Modes. I am going to destroy everything and rebuild making sure the modes are all ReadWriteMany on the storage.

tony · September 5, 2023, 1:17pm

This issue is solved. Thank you for the advice and suggestions. I made the following changes:

Redefined the storageclass, PV and PVCs and made sure all storage was set to ReadWriteMany

Removed node label: hub.jupyter.org/node-purpose=user

Changed my config to:

scheduling:
  userScheduler:
    enabled: false
  userPods:
    nodeAffinity:
      # matchNodePurpose valid options:
      # - ignore
      # - prefer (the default)
      # - require
      matchNodePurpose: ignore
  corePods:
    nodeAffinity:
      matchNodePurpose: ignore

I then did a start all in the control panel and 150 hubs were launched and spread evenly across my nodes.

Thank you again!
Tony

Topic		Replies	Views
Singleuser pods stuck in Pending Zero to JupyterHub on Kubernetes	12	6043	July 9, 2023
Jupyterhub scheduler configuration Zero to JupyterHub on Kubernetes	0	763	August 27, 2021
[Warning] 0/1 nodes are available: 1 node(s) didn't match node selector Zero to JupyterHub on Kubernetes	4	24013	December 26, 2019
Installation runs without error messages, but hub pod remains "Pending" Zero to JupyterHub on Kubernetes	7	1823	April 3, 2021
Jupyterhub install kubernetes Zero to JupyterHub on Kubernetes	0	486	May 11, 2023

Jupyterhub Pods all going to only one node on the cluster

Related topics