Hub not starting on bare metal cluster: Readiness probe failed / api_request to the proxy failed with status code 599

Dear community,

I’ve been trying to setup Jupyterhub on my k8s cluster for quite some time now, but I haven’t got it running. I suspect it is some kind of networking issue, but all the other topics here, on gitlab or SO end without a solution or without one that helps in my case, like e.g. here or here. This is why I am asking you for support in this regards.

My setup is as follows:

I’ve two nodes in my setup, one control plane called gpu-0-bio (which actually has no gpu) and another node called gpu-3-bio. (Just in case you wonder, gpu-1-bio and gpu-2-bio exist as well but I disconnected them in order to make the setup easier.) I’m using calico as CNI.

$ kubectl get pods -A
NAMESPACE      NAME                                       READY   STATUS             RESTARTS        AGE
gpu-operator   nvidia-device-plugin-1658416951-trbgz      1/1     Running            0               8m42s
gpu-operator   nvidia-device-plugin-1658416951-zvrrw      0/1     CrashLoopBackOff   6 (2m34s ago)   8m42s
jhub           continuous-image-puller-vwv22              1/1     Running            0               29m
jhub           continuous-image-puller-wmts6              1/1     Running            0               29m
jhub           hook-image-awaiter-25fsd                   0/1     Error              0               29m
jhub           hook-image-awaiter-dq46j                   0/1     Error              0               40m
jhub           hook-image-awaiter-fkkqc                   0/1     Error              0               27m
jhub           hook-image-awaiter-jkmtc                   0/1     Error              0               34m
jhub           hook-image-awaiter-k2pqx                   0/1     Error              0               23m
jhub           hook-image-awaiter-lgmr9                   0/1     Error              0               25m
jhub           hook-image-awaiter-n6g6w                   0/1     Error              0               20m
jhub           hook-image-puller-vs684                    1/1     Running            0               40m
jhub           hook-image-puller-xzxzp                    1/1     Running            0               40m
jhub           hub-7c5cc995fd-hhbx4                       0/1     CrashLoopBackOff   6 (8s ago)      9m36s
jhub           proxy-7f9c944765-sql72                     1/1     Running            0               11m
jhub           user-scheduler-7c57c8b84d-6m7ll            1/1     Running            0               29m
jhub           user-scheduler-7c57c8b84d-tnmdj            1/1     Running            0               29m
kube-system    calico-kube-controllers-555bc4b957-lw9pb   1/1     Running            2 (3h42m ago)   5h13m
kube-system    calico-node-qzdhq                          1/1     Running            1 (3h43m ago)   5h13m
kube-system    calico-node-z6gg6                          1/1     Running            1 (3h43m ago)   5h13m
kube-system    coredns-6d4b75cb6d-fth2t                   1/1     Running            1 (3h43m ago)   5h13m
kube-system    coredns-6d4b75cb6d-nws6s                   1/1     Running            1 (3h43m ago)   5h13m
kube-system    etcd-gpu-0-bio                             1/1     Running            1 (3h43m ago)   5h13m
kube-system    kube-apiserver-gpu-0-bio                   1/1     Running            1 (3h43m ago)   5h13m
kube-system    kube-controller-manager-gpu-0-bio          1/1     Running            1 (3h43m ago)   5h13m
kube-system    kube-proxy-5v7tc                           1/1     Running            1 (3h43m ago)   5h13m
kube-system    kube-proxy-d42zh                           1/1     Running            1 (3h43m ago)   5h13m
kube-system    kube-scheduler-gpu-0-bio                   1/1     Running            1 (3h43m ago)   5h13m
$ kubectl describe pod -n jhub hub-7c5cc995fd-hhbx4 
Name:         hub-7c5cc995fd-hhbx4
Namespace:    jhub
Priority:     0
Node:         gpu-3-bio/10.162.15.45
Start Time:   Thu, 21 Jul 2022 17:21:38 +0200
Labels:       app=jupyterhub
              component=hub
              hub.jupyter.org/network-access-proxy-api=true
              hub.jupyter.org/network-access-proxy-http=true
              hub.jupyter.org/network-access-singleuser=true
              pod-template-hash=7c5cc995fd
              release=jhub1
Annotations:  checksum/config-map: 2655ca5c5669782f1e9645c88a8580d99db2cff7592bd2452f39886fe35201a9
              checksum/secret: aee640eb515de506b406d466e32da1c5382f054c1f81711cb52ebc21920acabc
Status:       Running
IP:           172.17.0.3
IPs:
  IP:           172.17.0.3
Controlled By:  ReplicaSet/hub-7c5cc995fd
Containers:
  hub:
    Container ID:  docker://aa7abdf1a6dd9d6dfced1bef89145d6d8287ac329665275b0492304f8457c690
    Image:         jupyterhub/k8s-hub:1.2.0
    Image ID:      docker-pullable://jupyterhub/k8s-hub@sha256:e4770285aaf7230b930643986221757c2cc2e9420f5e21ac892582c96a57ce1c
    Port:          8081/TCP
    Host Port:     0/TCP
    Args:
      jupyterhub
      --config
      /usr/local/etc/jupyterhub/jupyterhub_config.py
      --debug
      --upgrade-db
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 21 Jul 2022 17:25:28 +0200
      Finished:     Thu, 21 Jul 2022 17:25:56 +0200
    Ready:          False
    Restart Count:  4
    Liveness:       http-get http://:http/hub/health delay=300s timeout=3s period=10s #success=1 #failure=30
    Readiness:      http-get http://:http/hub/health delay=0s timeout=1s period=2s #success=1 #failure=1000
    Environment:
      PYTHONUNBUFFERED:        1
      HELM_RELEASE_NAME:       jhub1
      POD_NAMESPACE:           jhub (v1:metadata.namespace)
      CONFIGPROXY_AUTH_TOKEN:  <set to the key 'hub.config.ConfigurableHTTPProxy.auth_token' in secret 'hub'>  Optional: false
    Mounts:
      /srv/jupyterhub from pvc (rw)
      /usr/local/etc/jupyterhub/config/ from config (rw)
      /usr/local/etc/jupyterhub/jupyterhub_config.py from config (rw,path="jupyterhub_config.py")
      /usr/local/etc/jupyterhub/secret/ from secret (rw)
      /usr/local/etc/jupyterhub/z2jh.py from config (rw,path="z2jh.py")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ggz94 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      hub
    Optional:  false
  secret:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  hub
    Optional:    false
  pvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  hub-db-dir
    ReadOnly:   false
  kube-api-access-ggz94:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 hub.jupyter.org/dedicated=core:NoSchedule
                             hub.jupyter.org_dedicated=core:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  Normal   Scheduled  5m27s                   default-scheduler  Successfully assigned jhub/hub-7c5cc995fd-hhbx4 to gpu-3-bio
  Normal   Pulled     4m53s (x2 over 5m26s)   kubelet            Container image "jupyterhub/k8s-hub:1.2.0" already present on machine
  Normal   Created    4m53s (x2 over 5m26s)   kubelet            Created container hub
  Normal   Started    4m53s (x2 over 5m26s)   kubelet            Started container hub
  Warning  Unhealthy  4m52s (x19 over 5m25s)  kubelet            Readiness probe failed: Get "http://172.17.0.3:8081/hub/health": dial tcp 172.17.0.3:8081: connect: connection refused
  Warning  BackOff    14s (x15 over 4m19s)    kubelet            Back-off restarting failed container
$ kubectl logs -n jhub hub-7c5cc995fd-hhbx4 
[D 2022-07-21 15:25:28.342 JupyterHub application:730] Looking for /usr/local/etc/jupyterhub/jupyterhub_config in /srv/jupyterhub
Loading /usr/local/etc/jupyterhub/secret/values.yaml
No config at /usr/local/etc/jupyterhub/existing-secret/values.yaml
[D 2022-07-21 15:25:28.515 JupyterHub application:752] Loaded config file: /usr/local/etc/jupyterhub/jupyterhub_config.py
[I 2022-07-21 15:25:28.533 JupyterHub app:2479] Running JupyterHub version 1.5.0
[I 2022-07-21 15:25:28.533 JupyterHub app:2509] Using Authenticator: jupyterhub.auth.DummyAuthenticator-1.5.0
[I 2022-07-21 15:25:28.533 JupyterHub app:2509] Using Spawner: kubespawner.spawner.KubeSpawner-1.1.0
[I 2022-07-21 15:25:28.533 JupyterHub app:2509] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-1.5.0
[D 2022-07-21 15:25:28.533 JupyterHub app:1721] Connecting to db: sqlite:///jupyterhub.sqlite
[D 2022-07-21 15:25:28.540 JupyterHub orm:815] database schema version found: 4dc2d5a8c53c
[D 2022-07-21 15:25:28.544 JupyterHub orm:815] database schema version found: 4dc2d5a8c53c
[W 2022-07-21 15:25:28.546 JupyterHub app:1828] No admin users, admin interface will be unavailable.
[W 2022-07-21 15:25:28.546 JupyterHub app:1829] Add any administrative users to `c.Authenticator.admin_users` in config.
[I 2022-07-21 15:25:28.546 JupyterHub app:1858] Not using allowed_users. Any authenticated user will be allowed.
[D 2022-07-21 15:25:28.577 JupyterHub app:2010] Purging expired APITokens
[D 2022-07-21 15:25:28.579 JupyterHub app:2010] Purging expired OAuthAccessTokens
[D 2022-07-21 15:25:28.580 JupyterHub app:2010] Purging expired OAuthCodes
[D 2022-07-21 15:25:28.584 JupyterHub app:2133] Initializing spawners
[D 2022-07-21 15:25:28.585 JupyterHub app:2266] Loaded users:
    
[I 2022-07-21 15:25:28.585 JupyterHub app:2546] Initialized 0 spawners in 0.001 seconds
[I 2022-07-21 15:25:28.586 JupyterHub app:2758] Not starting proxy
[D 2022-07-21 15:25:28.586 JupyterHub proxy:832] Proxy: Fetching GET http://proxy-api:8001/api/routes
[W 2022-07-21 15:25:28.587 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:28.741 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:33.837 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:33.987 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:39.869 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:41.515 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:45.028 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:46.570 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:51.575 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:56.342 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[E 2022-07-21 15:25:56.343 JupyterHub app:2989]
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/app.py", line 2987, in launch_instance_async
        await self.start()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/app.py", line 2762, in start
        await self.proxy.get_all_routes()
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/proxy.py", line 898, in get_all_routes
        resp = await self.api_request('', client=client)
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/proxy.py", line 862, in api_request
        result = await exponential_backoff(
      File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 184, in exponential_backoff
        raise TimeoutError(fail_message)
    TimeoutError: Repeated api_request to proxy path "" failed.
    
[D 2022-07-21 15:25:56.346 JupyterHub application:834] Exiting application: jupyterhub
$ kubectl -n jhub describe pod proxy-7f9c944765-sql72 
Name:         proxy-7f9c944765-sql72
Namespace:    jhub
Priority:     0
Node:         gpu-3-bio/10.162.15.45
Start Time:   Thu, 21 Jul 2022 17:19:23 +0200
Labels:       app=jupyterhub
              component=proxy
              hub.jupyter.org/network-access-hub=true
              hub.jupyter.org/network-access-singleuser=true
              pod-template-hash=7f9c944765
              release=jhub1
Annotations:  checksum/auth-token: a42c
              checksum/proxy-secret: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
Status:       Running
IP:           172.17.0.2
IPs:
  IP:           172.17.0.2
Controlled By:  ReplicaSet/proxy-7f9c944765
Containers:
  chp:
    Container ID:  docker://17cd38201108f684d836b286dfc1dfc15a8ec6ddca4aa363e168dc9bafbb841b
    Image:         jupyterhub/configurable-http-proxy:4.5.0
    Image ID:      docker-pullable://jupyterhub/configurable-http-proxy@sha256:8ced0a2f8073bd14e9d9609089c8144e95473c0d230a14ef49956500ac8d24ac
    Ports:         8000/TCP, 8001/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      configurable-http-proxy
      --ip=
      --api-ip=
      --api-port=8001
      --default-target=http://hub:$(HUB_SERVICE_PORT)
      --error-target=http://hub:$(HUB_SERVICE_PORT)/hub/error
      --port=8000
      --log-level=debug
    State:          Running
      Started:      Thu, 21 Jul 2022 17:19:24 +0200
    Ready:          True
    Restart Count:  0
    Liveness:       http-get http://:http/_chp_healthz delay=60s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:http/_chp_healthz delay=0s timeout=1s period=2s #success=1 #failure=3
    Environment:
      CONFIGPROXY_AUTH_TOKEN:  <set to the key 'hub.config.ConfigurableHTTPProxy.auth_token' in secret 'hub'>  Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rnvql (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  kube-api-access-rnvql:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 hub.jupyter.org/dedicated=core:NoSchedule
                             hub.jupyter.org_dedicated=core:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  9m27s  default-scheduler  Successfully assigned jhub/proxy-7f9c944765-sql72 to gpu-3-bio
  Normal  Pulled     9m27s  kubelet            Container image "jupyterhub/configurable-http-proxy:4.5.0" already present on machine
  Normal  Created    9m27s  kubelet            Created container chp
  Normal  Started    9m27s  kubelet            Started container chp

Journalctl on gpu-3-bio seems to hold some more information:

$ journalctl | tail -n 30
Jul 21 17:39:18 gpu-3-bio kubelet[1593]: I0721 17:39:18.931411    1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:39:18 gpu-3-bio kubelet[1593]: E0721 17:39:18.931598    1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:39:19 gpu-3-bio cri-dockerd[1839]: time="2022-07-21T17:39:19+02:00" level=info msg="Failed to read pod IP from plugin/docker: Couldn't find network status for jhub/hub-5fd96df75b-4c57q through plugin: invalid network status for"
Jul 21 17:39:24 gpu-3-bio kubelet[1593]: I0721 17:39:24.780385    1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:39:24 gpu-3-bio kubelet[1593]: E0721 17:39:24.781358    1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:39:35 gpu-3-bio kubelet[1593]: I0721 17:39:35.897564    1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:39:35 gpu-3-bio kubelet[1593]: E0721 17:39:35.898490    1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:39:49 gpu-3-bio kubelet[1593]: I0721 17:39:49.896654    1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:39:49 gpu-3-bio kubelet[1593]: E0721 17:39:49.896855    1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:40:00 gpu-3-bio kubelet[1593]: I0721 17:40:00.897837    1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:40:00 gpu-3-bio kubelet[1593]: E0721 17:40:00.900315    1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:40:15 gpu-3-bio kubelet[1593]: I0721 17:40:15.897367    1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:40:15 gpu-3-bio kubelet[1593]: E0721 17:40:15.898310    1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:40:26 gpu-3-bio kubelet[1593]: I0721 17:40:26.897279    1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:40:26 gpu-3-bio kubelet[1593]: E0721 17:40:26.898229    1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:40:40 gpu-3-bio kubelet[1593]: I0721 17:40:40.897381    1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:40:40 gpu-3-bio systemd[1]: var-lib-docker-overlay2-709bbf521028fac943c90a240bdf4b5b3fdc44566fed1afa81e38a8200c45304\x2dinit-merged.mount: Deactivated successfully.
Jul 21 17:40:40 gpu-3-bio containerd[852]: time="2022-07-21T17:40:40.944526779+02:00" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Jul 21 17:40:40 gpu-3-bio containerd[852]: time="2022-07-21T17:40:40.944558464+02:00" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Jul 21 17:40:40 gpu-3-bio containerd[852]: time="2022-07-21T17:40:40.944565178+02:00" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Jul 21 17:40:40 gpu-3-bio containerd[852]: time="2022-07-21T17:40:40.944640586+02:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/2d4f15c1ed3975f64106c023b73eaede35f5b4c6f7178f8249013d5bf0e6648e pid=451841 runtime=io.containerd.runc.v2
Jul 21 17:40:41 gpu-3-bio systemd[1]: Started libcontainer container 2d4f15c1ed3975f64106c023b73eaede35f5b4c6f7178f8249013d5bf0e6648e.
Jul 21 17:40:41 gpu-3-bio cri-dockerd[1839]: time="2022-07-21T17:40:41+02:00" level=info msg="Failed to read pod IP from plugin/docker: Couldn't find network status for jhub/hub-5fd96df75b-4c57q through plugin: invalid network status for"
Jul 21 17:40:46 gpu-3-bio NetworkManager[702]: <info>  [1658418046.0507] manager: NetworkManager state is now CONNECTED_SITE
Jul 21 17:40:46 gpu-3-bio dbus-daemon[700]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.15' (uid=0 pid=702 comm="/usr/sbin/NetworkManager --no-daemon " label="unconfined")
Jul 21 17:40:46 gpu-3-bio systemd[1]: Starting Network Manager Script Dispatcher Service...
Jul 21 17:40:46 gpu-3-bio dbus-daemon[700]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Jul 21 17:40:46 gpu-3-bio systemd[1]: Started Network Manager Script Dispatcher Service.
Jul 21 17:40:46 gpu-3-bio NetworkManager[702]: <info>  [1658418046.2966] manager: NetworkManager state is now CONNECTED_GLOBAL
Jul 21 17:40:56 gpu-3-bio systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
 $ kubectl version

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:30:46Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:23:26Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
$ helm version

version.BuildInfo{Version:"v3.9.1", GitCommit:"a7c043acb5ff905c261cfdc923a35776ba5e66e4", GitTreeState:"clean", GoVersion:"go1.17.5"}

I wonder if it might have something to do with RX packet overruns in my networking adapter eno1 (here gpu-0-bio)?

$ ifconfig

docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        inet6 fe80::42:2ff:fe64:c0d2  prefixlen 64  scopeid 0x20<link>
        ether 02:42:02:64:c0:d2  txqueuelen 0  (Ethernet)
        RX packets 97322  bytes 10453861 (10.4 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 101236  bytes 31099925 (31.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.162.15.116  netmask 255.255.255.0  broadcast 10.162.15.255
        inet6 fe80::bee1:9cbb:3569:eba  prefixlen 64  scopeid 0x20<link>
        ether 4c:52:62:a4:9f:04  txqueuelen 1000  (Ethernet)
        RX packets 3986279  bytes 908807048 (908.8 MB)
        RX errors 0  dropped 3317347  overruns 0  frame 0
        TX packets 173004  bytes 49126213 (49.1 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 16  memory 0x91200000-91220000  

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 2235430  bytes 506963410 (506.9 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2235430  bytes 506963410 (506.9 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tunl0: flags=193<UP,RUNNING,NOARP>  mtu 1480
        inet 10.244.121.64  netmask 255.255.255.255
        tunnel   txqueuelen 1000  (IPIP Tunnel)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth4ccd98e: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::7c0e:1aff:fec9:8c9e  prefixlen 64  scopeid 0x20<link>
        ether 7e:0e:1a:c9:8c:9e  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 29  bytes 3353 (3.3 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth52867c0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::4c1a:86ff:fe29:9d9b  prefixlen 64  scopeid 0x20<link>
        ether 4e:1a:86:29:9d:9b  txqueuelen 0  (Ethernet)
        RX packets 8501  bytes 1183286 (1.1 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8932  bytes 3787738 (3.7 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth66490d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::bc5b:c6ff:fec2:b537  prefixlen 64  scopeid 0x20<link>
        ether be:5b:c6:c2:b5:37  txqueuelen 0  (Ethernet)
        RX packets 14516  bytes 1373723 (1.3 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 15771  bytes 1524341 (1.5 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth8d69212: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::b498:7eff:fec1:3c46  prefixlen 64  scopeid 0x20<link>
        ether b6:98:7e:c1:3c:46  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 38  bytes 4120 (4.1 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth9fcd7f4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::ecba:d6ff:fe4a:6fbf  prefixlen 64  scopeid 0x20<link>
        ether ee:ba:d6:4a:6f:bf  txqueuelen 0  (Ethernet)
        RX packets 9856  bytes 821271 (821.2 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 9696  bytes 9300398 (9.3 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vethba6b718: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::84a4:9eff:fe7b:84a6  prefixlen 64  scopeid 0x20<link>
        ether 86:a4:9e:7b:84:a6  txqueuelen 0  (Ethernet)
        RX packets 14522  bytes 1373757 (1.3 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 15688  bytes 1520793 (1.5 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vethbebb629: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::9ccb:f2ff:feac:cced  prefixlen 64  scopeid 0x20<link>
        ether 9e:cb:f2:ac:cc:ed  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 34  bytes 3749 (3.7 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

My init command on the control plane:

sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket=unix:///var/run/cri-dockerd.sock --apiserver-advertise-address=10.162.15.116

config.yaml for jupyterhub:

cat config.yaml 

proxy:
  secretToken: "2fdeb3679d666277bdb1c93102a08f5b894774ba796e60af7957cb5677f40706"

prePuller:
  hook:
    enabled: false

singleuser:
  storage:
    type: none

hub:
  db:
    pvc:
      storageClassName: 'local-storage'

debug:
  enabled: true

Things I’ve tried:

  • Starting with a clean install, i.e. partitioning both nodes and setting everything up from scratch.
  • Using rook-ceph for storage. I had some networking issues here as well, but once I changed to host network as network it worked.
  • Setting hostNetwork to true for the hub.
  • Tried with MetalLB and without and with various settings in the proxy.network config.
    However, all of these attempts ended with the same result.

Would you have any idea what part of the configuration to change? As mentioned, I suspect it has to do with the network config, but I really don’t know what other parts to adapt there. Just let me know if you need any other logs.

I appreciate any help. Thanks a lot in advance.

Best wishes
Henning

Hi! Do you have any other multi-pod applications on your cluster? Are they working?

Hi,

thank you for your answer and sorry for my late reply, was gone during the last two weeks.

I just installed rook-ceph on my two-node cluster, which comes with lots of pods, as well as prometheus, and they’re working out fine (prometheus as well as jupyterhub are using rook-ceph for volume provisioning):

$ kubectl get all -o wide --all-namespaces

NAMESPACE        NAME                                                READY   STATUS             RESTARTS       AGE     IP              NODE        NOMINATED NODE   READINESS GATES
cert-manager     pod/cert-manager-5bb9dd7d5d-vqwbd                   1/1     Running            0              72m     172.17.0.7      gpu-3-bio   <none>           <none>
cert-manager     pod/cert-manager-cainjector-6586bddc69-rrfvc        1/1     Running            0              72m     172.17.0.6      gpu-3-bio   <none>           <none>
cert-manager     pod/cert-manager-webhook-6fc8f4666b-6mtz4           1/1     Running            0              72m     172.17.0.5      gpu-3-bio   <none>           <none>
default          pod/prom1-kube-state-metrics-d9db975bc-vj97q        1/1     Running            0              67m     172.17.0.17     gpu-3-bio   <none>           <none>
default          pod/prom1-prometheus-alertmanager-d4c6cd7f5-l47p9   2/2     Running            0              67m     172.17.0.20     gpu-3-bio   <none>           <none>
default          pod/prom1-prometheus-node-exporter-b8g9b            1/1     Running            0              67m     10.162.15.45    gpu-3-bio   <none>           <none>
default          pod/prom1-prometheus-pushgateway-8498b45dc-6sqgc    1/1     Running            0              67m     172.17.0.18     gpu-3-bio   <none>           <none>
default          pod/prom1-prometheus-server-6bcbf58967-kv5mp        2/2     Running            0              67m     172.17.0.19     gpu-3-bio   <none>           <none>
jhub             pod/continuous-image-puller-k75m9                   1/1     Running            0              8m15s   172.17.0.4      gpu-3-bio   <none>           <none>
jhub             pod/hub-5bfd466f44-8w22n                            0/1     CrashLoopBackOff   5 (103s ago)   8m15s   172.17.0.24     gpu-3-bio   <none>           <none>
jhub             pod/proxy-c5f8b5687-h8v2p                           1/1     Running            0              8m15s   172.17.0.22     gpu-3-bio   <none>           <none>
jhub             pod/user-scheduler-7c57c8b84d-2l2km                 1/1     Running            0              8m15s   172.17.0.23     gpu-3-bio   <none>           <none>
jhub             pod/user-scheduler-7c57c8b84d-6qft9                 1/1     Running            0              8m15s   172.17.0.21     gpu-3-bio   <none>           <none>
kube-system      pod/calico-kube-controllers-555bc4b957-pgsds        1/1     Running            0              73m     172.17.0.2      gpu-3-bio   <none>           <none>
kube-system      pod/calico-node-q4x99                               1/1     Running            0              73m     10.162.15.116   gpu-0-bio   <none>           <none>
kube-system      pod/calico-node-sq8rl                               1/1     Running            0              73m     10.162.15.45    gpu-3-bio   <none>           <none>
kube-system      pod/coredns-6d4b75cb6d-jct4n                        1/1     Running            0              73m     172.17.0.3      gpu-0-bio   <none>           <none>
kube-system      pod/coredns-6d4b75cb6d-nj94s                        1/1     Running            0              73m     172.17.0.2      gpu-0-bio   <none>           <none>
kube-system      pod/etcd-gpu-0-bio                                  1/1     Running            0              73m     10.162.15.116   gpu-0-bio   <none>           <none>
kube-system      pod/kube-apiserver-gpu-0-bio                        1/1     Running            0              73m     10.162.15.116   gpu-0-bio   <none>           <none>
kube-system      pod/kube-controller-manager-gpu-0-bio               1/1     Running            0              73m     10.162.15.116   gpu-0-bio   <none>           <none>
kube-system      pod/kube-proxy-rss24                                1/1     Running            0              73m     10.162.15.116   gpu-0-bio   <none>           <none>
kube-system      pod/kube-proxy-xqstp                                1/1     Running            0              73m     10.162.15.45    gpu-3-bio   <none>           <none>
kube-system      pod/kube-scheduler-gpu-0-bio                        1/1     Running            0              73m     10.162.15.116   gpu-0-bio   <none>           <none>
metallb-system   pod/metallb-controller-5ffbcf4b7f-gjhv6             1/1     Running            0              9m9s    172.17.0.3      gpu-3-bio   <none>           <none>
metallb-system   pod/metallb-speaker-7l2gc                           1/1     Running            0              9m9s    10.162.15.45    gpu-3-bio   <none>           <none>
metallb-system   pod/metallb-speaker-9gbsm                           1/1     Running            0              9m9s    10.162.15.116   gpu-0-bio   <none>           <none>
rook-ceph        pod/csi-cephfsplugin-provisioner-5c6c4c7785-874ns   6/6     Running            0              71m     172.17.0.12     gpu-3-bio   <none>           <none>
rook-ceph        pod/csi-cephfsplugin-provisioner-5c6c4c7785-fwcp2   0/6     Pending            0              71m     <none>          <none>      <none>           <none>
rook-ceph        pod/csi-cephfsplugin-vfnct                          3/3     Running            0              71m     10.162.15.45    gpu-3-bio   <none>           <none>
rook-ceph        pod/csi-rbdplugin-provisioner-7c756d9bd7-qfjz7      0/6     Pending            0              71m     <none>          <none>      <none>           <none>
rook-ceph        pod/csi-rbdplugin-provisioner-7c756d9bd7-vhh5d      6/6     Running            0              71m     172.17.0.11     gpu-3-bio   <none>           <none>
rook-ceph        pod/csi-rbdplugin-wbv6g                             3/3     Running            0              71m     10.162.15.45    gpu-3-bio   <none>           <none>
rook-ceph        pod/rook-ceph-mds-myfs-a-774b5cd5cc-7xln8           1/1     Running            0              68m     172.17.0.16     gpu-3-bio   <none>           <none>
rook-ceph        pod/rook-ceph-mds-myfs-b-554c7586f9-t446f           1/1     Running            0              68m     172.17.0.9      gpu-3-bio   <none>           <none>
rook-ceph        pod/rook-ceph-mgr-a-677b6f9b5-mszjg                 1/1     Running            0              70m     172.17.0.14     gpu-3-bio   <none>           <none>
rook-ceph        pod/rook-ceph-mon-a-8b96db755-5ttqx                 1/1     Running            0              71m     172.17.0.10     gpu-3-bio   <none>           <none>
rook-ceph        pod/rook-ceph-operator-8485948986-rwk79             1/1     Running            0              72m     172.17.0.8      gpu-3-bio   <none>           <none>
rook-ceph        pod/rook-ceph-osd-0-55f48487d-p7226                 1/1     Running            0              70m     172.17.0.15     gpu-3-bio   <none>           <none>
rook-ceph        pod/rook-ceph-osd-prepare-gpu-3-bio-b6cm2           0/1     Completed          0              70m     172.17.0.9      gpu-3-bio   <none>           <none>
rook-ceph        pod/rook-ceph-tools-7cd7457fb5-mgdgm                1/1     Running            0              72m     172.17.0.13     gpu-3-bio   <none>           <none>

NAMESPACE        NAME                                     TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE     SELECTOR
cert-manager     service/cert-manager                     ClusterIP      10.99.216.227    <none>        9402/TCP                 72m     app.kubernetes.io/component=controller,app.kubernetes.io/instance=cert-manager,app.kubernetes.io/name=cert-manager
cert-manager     service/cert-manager-webhook             ClusterIP      10.96.33.164     <none>        443/TCP                  72m     app.kubernetes.io/component=webhook,app.kubernetes.io/instance=cert-manager,app.kubernetes.io/name=webhook
default          service/kubernetes                       ClusterIP      10.96.0.1        <none>        443/TCP                  73m     <none>
default          service/prom1-kube-state-metrics         ClusterIP      10.99.42.108     <none>        8080/TCP                 67m     app.kubernetes.io/instance=prom1,app.kubernetes.io/name=kube-state-metrics
default          service/prom1-prometheus-alertmanager    ClusterIP      10.107.145.55    <none>        80/TCP                   67m     app=prometheus,component=alertmanager,release=prom1
default          service/prom1-prometheus-node-exporter   ClusterIP      10.104.197.188   <none>        9100/TCP                 67m     app=prometheus,component=node-exporter,release=prom1
default          service/prom1-prometheus-pushgateway     ClusterIP      10.107.190.125   <none>        9091/TCP                 67m     app=prometheus,component=pushgateway,release=prom1
default          service/prom1-prometheus-server          ClusterIP      10.109.92.202    <none>        80/TCP                   67m     app=prometheus,component=server,release=prom1
jhub             service/hub                              ClusterIP      10.105.186.158   <none>        8081/TCP                 8m15s   app=jupyterhub,component=hub,release=jhub1
jhub             service/proxy-api                        ClusterIP      10.97.143.150    <none>        8001/TCP                 8m15s   app=jupyterhub,component=proxy,release=jhub1
jhub             service/proxy-public                     LoadBalancer   10.100.219.151   <pending>     80:31084/TCP             8m15s   component=proxy,release=jhub1
kube-system      service/kube-dns                         ClusterIP      10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP   73m     k8s-app=kube-dns
metallb-system   service/metallb-webhook-service          ClusterIP      10.111.75.246    <none>        443/TCP                  9m9s    app.kubernetes.io/component=controller,app.kubernetes.io/instance=metallb,app.kubernetes.io/name=metallb
rook-ceph        service/csi-cephfsplugin-metrics         ClusterIP      10.111.42.102    <none>        8080/TCP,8081/TCP        71m     contains=csi-cephfsplugin-metrics
rook-ceph        service/csi-rbdplugin-metrics            ClusterIP      10.108.62.223    <none>        8080/TCP,8081/TCP        71m     contains=csi-rbdplugin-metrics
rook-ceph        service/rook-ceph-mgr                    ClusterIP      10.111.101.248   <none>        9283/TCP                 70m     app=rook-ceph-mgr,ceph_daemon_id=a,rook_cluster=rook-ceph
rook-ceph        service/rook-ceph-mgr-dashboard          ClusterIP      10.102.118.112   <none>        7000/TCP                 70m     app=rook-ceph-mgr,ceph_daemon_id=a,rook_cluster=rook-ceph
rook-ceph        service/rook-ceph-mon-a                  ClusterIP      10.101.130.242   <none>        6789/TCP,3300/TCP        71m     app=rook-ceph-mon,ceph_daemon_id=a,mon=a,mon_cluster=rook-ceph,rook_cluster=rook-ceph

NAMESPACE        NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE     CONTAINERS                                              IMAGES                                                                                                                  SELECTOR
default          daemonset.apps/prom1-prometheus-node-exporter   1         1         1       1            1           <none>                   67m     prometheus-node-exporter                                quay.io/prometheus/node-exporter:v1.3.1                                                                                 app=prometheus,component=node-exporter,release=prom1
jhub             daemonset.apps/continuous-image-puller          1         1         1       1            1           <none>                   8m15s   pause                                                   k8s.gcr.io/pause:3.5                                                                                                    app=jupyterhub,component=continuous-image-puller,release=jhub1
kube-system      daemonset.apps/calico-node                      2         2         2       2            2           kubernetes.io/os=linux   73m     calico-node                                             docker.io/calico/node:v3.23.3                                                                                           k8s-app=calico-node
kube-system      daemonset.apps/kube-proxy                       2         2         2       2            2           kubernetes.io/os=linux   73m     kube-proxy                                              k8s.gcr.io/kube-proxy:v1.24.3                                                                                           k8s-app=kube-proxy
metallb-system   daemonset.apps/metallb-speaker                  2         2         2       2            2           kubernetes.io/os=linux   9m9s    speaker                                                 quay.io/metallb/speaker:v0.13.4                                                                                         app.kubernetes.io/component=speaker,app.kubernetes.io/instance=metallb,app.kubernetes.io/name=metallb
rook-ceph        daemonset.apps/csi-cephfsplugin                 1         1         1       1            1           <none>                   71m     driver-registrar,csi-cephfsplugin,liveness-prometheus   k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.5.0,quay.io/cephcsi/cephcsi:v3.6.1,quay.io/cephcsi/cephcsi:v3.6.1   app=csi-cephfsplugin
rook-ceph        daemonset.apps/csi-rbdplugin                    1         1         1       1            1           <none>                   71m     driver-registrar,csi-rbdplugin,liveness-prometheus      k8s.gcr.io/sig-storage/csi-node-driver-registrar:v2.5.0,quay.io/cephcsi/cephcsi:v3.6.1,quay.io/cephcsi/cephcsi:v3.6.1   app=csi-rbdplugin

[..., cut because of character limit]

One thing I’m stumbling upon is that jhub’s proxy-public service is staying in pending state and does no longer take on the IP that I assign in the config file:

proxy:
  secretToken: "abcdef"
  service:
    loadBalancerIP: 10.162.15.200

Not sure if this is related though.

Thank you again.

Best wishes
Henning

It definitely sounds like a networking issue with your cluster, unrelated to JupyterHub.

I presume this is why rook is working. It’s hiding an underlying problem though, so it’s probably best to start from basics. E.g. Start with a plain two-node cluster (default networking instead of using Calico), and see if pod-pod communication across nodes works, e.g. run Nginx in one node and try and wget/curl from a pod on another node, and vice-versa. If that works then slowly add features to your cluster.

1 Like

Thanks for the advice. I’ll try as you describe and get back if I have any relevant news or a solution.