Dear community,
I’ve been trying to setup Jupyterhub on my k8s cluster for quite some time now, but I haven’t got it running. I suspect it is some kind of networking issue, but all the other topics here, on gitlab or SO end without a solution or without one that helps in my case, like e.g. here or here. This is why I am asking you for support in this regards.
My setup is as follows:
I’ve two nodes in my setup, one control plane called gpu-0-bio (which actually has no gpu) and another node called gpu-3-bio. (Just in case you wonder, gpu-1-bio and gpu-2-bio exist as well but I disconnected them in order to make the setup easier.) I’m using calico as CNI.
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
gpu-operator nvidia-device-plugin-1658416951-trbgz 1/1 Running 0 8m42s
gpu-operator nvidia-device-plugin-1658416951-zvrrw 0/1 CrashLoopBackOff 6 (2m34s ago) 8m42s
jhub continuous-image-puller-vwv22 1/1 Running 0 29m
jhub continuous-image-puller-wmts6 1/1 Running 0 29m
jhub hook-image-awaiter-25fsd 0/1 Error 0 29m
jhub hook-image-awaiter-dq46j 0/1 Error 0 40m
jhub hook-image-awaiter-fkkqc 0/1 Error 0 27m
jhub hook-image-awaiter-jkmtc 0/1 Error 0 34m
jhub hook-image-awaiter-k2pqx 0/1 Error 0 23m
jhub hook-image-awaiter-lgmr9 0/1 Error 0 25m
jhub hook-image-awaiter-n6g6w 0/1 Error 0 20m
jhub hook-image-puller-vs684 1/1 Running 0 40m
jhub hook-image-puller-xzxzp 1/1 Running 0 40m
jhub hub-7c5cc995fd-hhbx4 0/1 CrashLoopBackOff 6 (8s ago) 9m36s
jhub proxy-7f9c944765-sql72 1/1 Running 0 11m
jhub user-scheduler-7c57c8b84d-6m7ll 1/1 Running 0 29m
jhub user-scheduler-7c57c8b84d-tnmdj 1/1 Running 0 29m
kube-system calico-kube-controllers-555bc4b957-lw9pb 1/1 Running 2 (3h42m ago) 5h13m
kube-system calico-node-qzdhq 1/1 Running 1 (3h43m ago) 5h13m
kube-system calico-node-z6gg6 1/1 Running 1 (3h43m ago) 5h13m
kube-system coredns-6d4b75cb6d-fth2t 1/1 Running 1 (3h43m ago) 5h13m
kube-system coredns-6d4b75cb6d-nws6s 1/1 Running 1 (3h43m ago) 5h13m
kube-system etcd-gpu-0-bio 1/1 Running 1 (3h43m ago) 5h13m
kube-system kube-apiserver-gpu-0-bio 1/1 Running 1 (3h43m ago) 5h13m
kube-system kube-controller-manager-gpu-0-bio 1/1 Running 1 (3h43m ago) 5h13m
kube-system kube-proxy-5v7tc 1/1 Running 1 (3h43m ago) 5h13m
kube-system kube-proxy-d42zh 1/1 Running 1 (3h43m ago) 5h13m
kube-system kube-scheduler-gpu-0-bio 1/1 Running 1 (3h43m ago) 5h13m
$ kubectl describe pod -n jhub hub-7c5cc995fd-hhbx4
Name: hub-7c5cc995fd-hhbx4
Namespace: jhub
Priority: 0
Node: gpu-3-bio/10.162.15.45
Start Time: Thu, 21 Jul 2022 17:21:38 +0200
Labels: app=jupyterhub
component=hub
hub.jupyter.org/network-access-proxy-api=true
hub.jupyter.org/network-access-proxy-http=true
hub.jupyter.org/network-access-singleuser=true
pod-template-hash=7c5cc995fd
release=jhub1
Annotations: checksum/config-map: 2655ca5c5669782f1e9645c88a8580d99db2cff7592bd2452f39886fe35201a9
checksum/secret: aee640eb515de506b406d466e32da1c5382f054c1f81711cb52ebc21920acabc
Status: Running
IP: 172.17.0.3
IPs:
IP: 172.17.0.3
Controlled By: ReplicaSet/hub-7c5cc995fd
Containers:
hub:
Container ID: docker://aa7abdf1a6dd9d6dfced1bef89145d6d8287ac329665275b0492304f8457c690
Image: jupyterhub/k8s-hub:1.2.0
Image ID: docker-pullable://jupyterhub/k8s-hub@sha256:e4770285aaf7230b930643986221757c2cc2e9420f5e21ac892582c96a57ce1c
Port: 8081/TCP
Host Port: 0/TCP
Args:
jupyterhub
--config
/usr/local/etc/jupyterhub/jupyterhub_config.py
--debug
--upgrade-db
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 21 Jul 2022 17:25:28 +0200
Finished: Thu, 21 Jul 2022 17:25:56 +0200
Ready: False
Restart Count: 4
Liveness: http-get http://:http/hub/health delay=300s timeout=3s period=10s #success=1 #failure=30
Readiness: http-get http://:http/hub/health delay=0s timeout=1s period=2s #success=1 #failure=1000
Environment:
PYTHONUNBUFFERED: 1
HELM_RELEASE_NAME: jhub1
POD_NAMESPACE: jhub (v1:metadata.namespace)
CONFIGPROXY_AUTH_TOKEN: <set to the key 'hub.config.ConfigurableHTTPProxy.auth_token' in secret 'hub'> Optional: false
Mounts:
/srv/jupyterhub from pvc (rw)
/usr/local/etc/jupyterhub/config/ from config (rw)
/usr/local/etc/jupyterhub/jupyterhub_config.py from config (rw,path="jupyterhub_config.py")
/usr/local/etc/jupyterhub/secret/ from secret (rw)
/usr/local/etc/jupyterhub/z2jh.py from config (rw,path="z2jh.py")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ggz94 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: hub
Optional: false
secret:
Type: Secret (a volume populated by a Secret)
SecretName: hub
Optional: false
pvc:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: hub-db-dir
ReadOnly: false
kube-api-access-ggz94:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: hub.jupyter.org/dedicated=core:NoSchedule
hub.jupyter.org_dedicated=core:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m27s default-scheduler Successfully assigned jhub/hub-7c5cc995fd-hhbx4 to gpu-3-bio
Normal Pulled 4m53s (x2 over 5m26s) kubelet Container image "jupyterhub/k8s-hub:1.2.0" already present on machine
Normal Created 4m53s (x2 over 5m26s) kubelet Created container hub
Normal Started 4m53s (x2 over 5m26s) kubelet Started container hub
Warning Unhealthy 4m52s (x19 over 5m25s) kubelet Readiness probe failed: Get "http://172.17.0.3:8081/hub/health": dial tcp 172.17.0.3:8081: connect: connection refused
Warning BackOff 14s (x15 over 4m19s) kubelet Back-off restarting failed container
$ kubectl logs -n jhub hub-7c5cc995fd-hhbx4
[D 2022-07-21 15:25:28.342 JupyterHub application:730] Looking for /usr/local/etc/jupyterhub/jupyterhub_config in /srv/jupyterhub
Loading /usr/local/etc/jupyterhub/secret/values.yaml
No config at /usr/local/etc/jupyterhub/existing-secret/values.yaml
[D 2022-07-21 15:25:28.515 JupyterHub application:752] Loaded config file: /usr/local/etc/jupyterhub/jupyterhub_config.py
[I 2022-07-21 15:25:28.533 JupyterHub app:2479] Running JupyterHub version 1.5.0
[I 2022-07-21 15:25:28.533 JupyterHub app:2509] Using Authenticator: jupyterhub.auth.DummyAuthenticator-1.5.0
[I 2022-07-21 15:25:28.533 JupyterHub app:2509] Using Spawner: kubespawner.spawner.KubeSpawner-1.1.0
[I 2022-07-21 15:25:28.533 JupyterHub app:2509] Using Proxy: jupyterhub.proxy.ConfigurableHTTPProxy-1.5.0
[D 2022-07-21 15:25:28.533 JupyterHub app:1721] Connecting to db: sqlite:///jupyterhub.sqlite
[D 2022-07-21 15:25:28.540 JupyterHub orm:815] database schema version found: 4dc2d5a8c53c
[D 2022-07-21 15:25:28.544 JupyterHub orm:815] database schema version found: 4dc2d5a8c53c
[W 2022-07-21 15:25:28.546 JupyterHub app:1828] No admin users, admin interface will be unavailable.
[W 2022-07-21 15:25:28.546 JupyterHub app:1829] Add any administrative users to `c.Authenticator.admin_users` in config.
[I 2022-07-21 15:25:28.546 JupyterHub app:1858] Not using allowed_users. Any authenticated user will be allowed.
[D 2022-07-21 15:25:28.577 JupyterHub app:2010] Purging expired APITokens
[D 2022-07-21 15:25:28.579 JupyterHub app:2010] Purging expired OAuthAccessTokens
[D 2022-07-21 15:25:28.580 JupyterHub app:2010] Purging expired OAuthCodes
[D 2022-07-21 15:25:28.584 JupyterHub app:2133] Initializing spawners
[D 2022-07-21 15:25:28.585 JupyterHub app:2266] Loaded users:
[I 2022-07-21 15:25:28.585 JupyterHub app:2546] Initialized 0 spawners in 0.001 seconds
[I 2022-07-21 15:25:28.586 JupyterHub app:2758] Not starting proxy
[D 2022-07-21 15:25:28.586 JupyterHub proxy:832] Proxy: Fetching GET http://proxy-api:8001/api/routes
[W 2022-07-21 15:25:28.587 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:28.741 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:33.837 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:33.987 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:39.869 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:41.515 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:45.028 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:46.570 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:51.575 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[W 2022-07-21 15:25:56.342 JupyterHub proxy:851] api_request to the proxy failed with status code 599, retrying...
[E 2022-07-21 15:25:56.343 JupyterHub app:2989]
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/app.py", line 2987, in launch_instance_async
await self.start()
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/app.py", line 2762, in start
await self.proxy.get_all_routes()
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/proxy.py", line 898, in get_all_routes
resp = await self.api_request('', client=client)
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/proxy.py", line 862, in api_request
result = await exponential_backoff(
File "/usr/local/lib/python3.8/dist-packages/jupyterhub/utils.py", line 184, in exponential_backoff
raise TimeoutError(fail_message)
TimeoutError: Repeated api_request to proxy path "" failed.
[D 2022-07-21 15:25:56.346 JupyterHub application:834] Exiting application: jupyterhub
$ kubectl -n jhub describe pod proxy-7f9c944765-sql72
Name: proxy-7f9c944765-sql72
Namespace: jhub
Priority: 0
Node: gpu-3-bio/10.162.15.45
Start Time: Thu, 21 Jul 2022 17:19:23 +0200
Labels: app=jupyterhub
component=proxy
hub.jupyter.org/network-access-hub=true
hub.jupyter.org/network-access-singleuser=true
pod-template-hash=7f9c944765
release=jhub1
Annotations: checksum/auth-token: a42c
checksum/proxy-secret: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
Status: Running
IP: 172.17.0.2
IPs:
IP: 172.17.0.2
Controlled By: ReplicaSet/proxy-7f9c944765
Containers:
chp:
Container ID: docker://17cd38201108f684d836b286dfc1dfc15a8ec6ddca4aa363e168dc9bafbb841b
Image: jupyterhub/configurable-http-proxy:4.5.0
Image ID: docker-pullable://jupyterhub/configurable-http-proxy@sha256:8ced0a2f8073bd14e9d9609089c8144e95473c0d230a14ef49956500ac8d24ac
Ports: 8000/TCP, 8001/TCP
Host Ports: 0/TCP, 0/TCP
Command:
configurable-http-proxy
--ip=
--api-ip=
--api-port=8001
--default-target=http://hub:$(HUB_SERVICE_PORT)
--error-target=http://hub:$(HUB_SERVICE_PORT)/hub/error
--port=8000
--log-level=debug
State: Running
Started: Thu, 21 Jul 2022 17:19:24 +0200
Ready: True
Restart Count: 0
Liveness: http-get http://:http/_chp_healthz delay=60s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:http/_chp_healthz delay=0s timeout=1s period=2s #success=1 #failure=3
Environment:
CONFIGPROXY_AUTH_TOKEN: <set to the key 'hub.config.ConfigurableHTTPProxy.auth_token' in secret 'hub'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rnvql (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-rnvql:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: hub.jupyter.org/dedicated=core:NoSchedule
hub.jupyter.org_dedicated=core:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m27s default-scheduler Successfully assigned jhub/proxy-7f9c944765-sql72 to gpu-3-bio
Normal Pulled 9m27s kubelet Container image "jupyterhub/configurable-http-proxy:4.5.0" already present on machine
Normal Created 9m27s kubelet Created container chp
Normal Started 9m27s kubelet Started container chp
Journalctl on gpu-3-bio seems to hold some more information:
$ journalctl | tail -n 30
Jul 21 17:39:18 gpu-3-bio kubelet[1593]: I0721 17:39:18.931411 1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:39:18 gpu-3-bio kubelet[1593]: E0721 17:39:18.931598 1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:39:19 gpu-3-bio cri-dockerd[1839]: time="2022-07-21T17:39:19+02:00" level=info msg="Failed to read pod IP from plugin/docker: Couldn't find network status for jhub/hub-5fd96df75b-4c57q through plugin: invalid network status for"
Jul 21 17:39:24 gpu-3-bio kubelet[1593]: I0721 17:39:24.780385 1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:39:24 gpu-3-bio kubelet[1593]: E0721 17:39:24.781358 1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:39:35 gpu-3-bio kubelet[1593]: I0721 17:39:35.897564 1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:39:35 gpu-3-bio kubelet[1593]: E0721 17:39:35.898490 1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:39:49 gpu-3-bio kubelet[1593]: I0721 17:39:49.896654 1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:39:49 gpu-3-bio kubelet[1593]: E0721 17:39:49.896855 1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:40:00 gpu-3-bio kubelet[1593]: I0721 17:40:00.897837 1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:40:00 gpu-3-bio kubelet[1593]: E0721 17:40:00.900315 1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:40:15 gpu-3-bio kubelet[1593]: I0721 17:40:15.897367 1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:40:15 gpu-3-bio kubelet[1593]: E0721 17:40:15.898310 1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:40:26 gpu-3-bio kubelet[1593]: I0721 17:40:26.897279 1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:40:26 gpu-3-bio kubelet[1593]: E0721 17:40:26.898229 1593 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"hub\" with CrashLoopBackOff: \"back-off 1m20s restarting failed container=hub pod=hub-5fd96df75b-4c57q_jhub(d8fbb716-f84f-41f0-8ecb-8f2334c6657e)\"" pod="jhub/hub-5fd96df75b-4c57q" podUID=d8fbb716-f84f-41f0-8ecb-8f2334c6657e
Jul 21 17:40:40 gpu-3-bio kubelet[1593]: I0721 17:40:40.897381 1593 scope.go:110] "RemoveContainer" containerID="ea882fad67bb1b2fb3e2e5186caf4fca741c68a8aa697d7a4ad173f5a8f51631"
Jul 21 17:40:40 gpu-3-bio systemd[1]: var-lib-docker-overlay2-709bbf521028fac943c90a240bdf4b5b3fdc44566fed1afa81e38a8200c45304\x2dinit-merged.mount: Deactivated successfully.
Jul 21 17:40:40 gpu-3-bio containerd[852]: time="2022-07-21T17:40:40.944526779+02:00" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Jul 21 17:40:40 gpu-3-bio containerd[852]: time="2022-07-21T17:40:40.944558464+02:00" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Jul 21 17:40:40 gpu-3-bio containerd[852]: time="2022-07-21T17:40:40.944565178+02:00" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Jul 21 17:40:40 gpu-3-bio containerd[852]: time="2022-07-21T17:40:40.944640586+02:00" level=info msg="starting signal loop" namespace=moby path=/run/containerd/io.containerd.runtime.v2.task/moby/2d4f15c1ed3975f64106c023b73eaede35f5b4c6f7178f8249013d5bf0e6648e pid=451841 runtime=io.containerd.runc.v2
Jul 21 17:40:41 gpu-3-bio systemd[1]: Started libcontainer container 2d4f15c1ed3975f64106c023b73eaede35f5b4c6f7178f8249013d5bf0e6648e.
Jul 21 17:40:41 gpu-3-bio cri-dockerd[1839]: time="2022-07-21T17:40:41+02:00" level=info msg="Failed to read pod IP from plugin/docker: Couldn't find network status for jhub/hub-5fd96df75b-4c57q through plugin: invalid network status for"
Jul 21 17:40:46 gpu-3-bio NetworkManager[702]: <info> [1658418046.0507] manager: NetworkManager state is now CONNECTED_SITE
Jul 21 17:40:46 gpu-3-bio dbus-daemon[700]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.15' (uid=0 pid=702 comm="/usr/sbin/NetworkManager --no-daemon " label="unconfined")
Jul 21 17:40:46 gpu-3-bio systemd[1]: Starting Network Manager Script Dispatcher Service...
Jul 21 17:40:46 gpu-3-bio dbus-daemon[700]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Jul 21 17:40:46 gpu-3-bio systemd[1]: Started Network Manager Script Dispatcher Service.
Jul 21 17:40:46 gpu-3-bio NetworkManager[702]: <info> [1658418046.2966] manager: NetworkManager state is now CONNECTED_GLOBAL
Jul 21 17:40:56 gpu-3-bio systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
$ kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:30:46Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:23:26Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
$ helm version
version.BuildInfo{Version:"v3.9.1", GitCommit:"a7c043acb5ff905c261cfdc923a35776ba5e66e4", GitTreeState:"clean", GoVersion:"go1.17.5"}
I wonder if it might have something to do with RX packet overruns in my networking adapter eno1 (here gpu-0-bio)?
$ ifconfig
docker0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
inet6 fe80::42:2ff:fe64:c0d2 prefixlen 64 scopeid 0x20<link>
ether 02:42:02:64:c0:d2 txqueuelen 0 (Ethernet)
RX packets 97322 bytes 10453861 (10.4 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 101236 bytes 31099925 (31.0 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.162.15.116 netmask 255.255.255.0 broadcast 10.162.15.255
inet6 fe80::bee1:9cbb:3569:eba prefixlen 64 scopeid 0x20<link>
ether 4c:52:62:a4:9f:04 txqueuelen 1000 (Ethernet)
RX packets 3986279 bytes 908807048 (908.8 MB)
RX errors 0 dropped 3317347 overruns 0 frame 0
TX packets 173004 bytes 49126213 (49.1 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 16 memory 0x91200000-91220000
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 2235430 bytes 506963410 (506.9 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2235430 bytes 506963410 (506.9 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
tunl0: flags=193<UP,RUNNING,NOARP> mtu 1480
inet 10.244.121.64 netmask 255.255.255.255
tunnel txqueuelen 1000 (IPIP Tunnel)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth4ccd98e: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::7c0e:1aff:fec9:8c9e prefixlen 64 scopeid 0x20<link>
ether 7e:0e:1a:c9:8c:9e txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 29 bytes 3353 (3.3 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth52867c0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::4c1a:86ff:fe29:9d9b prefixlen 64 scopeid 0x20<link>
ether 4e:1a:86:29:9d:9b txqueuelen 0 (Ethernet)
RX packets 8501 bytes 1183286 (1.1 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8932 bytes 3787738 (3.7 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth66490d1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::bc5b:c6ff:fec2:b537 prefixlen 64 scopeid 0x20<link>
ether be:5b:c6:c2:b5:37 txqueuelen 0 (Ethernet)
RX packets 14516 bytes 1373723 (1.3 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 15771 bytes 1524341 (1.5 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth8d69212: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::b498:7eff:fec1:3c46 prefixlen 64 scopeid 0x20<link>
ether b6:98:7e:c1:3c:46 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 38 bytes 4120 (4.1 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth9fcd7f4: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::ecba:d6ff:fe4a:6fbf prefixlen 64 scopeid 0x20<link>
ether ee:ba:d6:4a:6f:bf txqueuelen 0 (Ethernet)
RX packets 9856 bytes 821271 (821.2 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 9696 bytes 9300398 (9.3 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
vethba6b718: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::84a4:9eff:fe7b:84a6 prefixlen 64 scopeid 0x20<link>
ether 86:a4:9e:7b:84:a6 txqueuelen 0 (Ethernet)
RX packets 14522 bytes 1373757 (1.3 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 15688 bytes 1520793 (1.5 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
vethbebb629: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::9ccb:f2ff:feac:cced prefixlen 64 scopeid 0x20<link>
ether 9e:cb:f2:ac:cc:ed txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 34 bytes 3749 (3.7 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
My init command on the control plane:
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --cri-socket=unix:///var/run/cri-dockerd.sock --apiserver-advertise-address=10.162.15.116
config.yaml for jupyterhub:
cat config.yaml
proxy:
secretToken: "2fdeb3679d666277bdb1c93102a08f5b894774ba796e60af7957cb5677f40706"
prePuller:
hook:
enabled: false
singleuser:
storage:
type: none
hub:
db:
pvc:
storageClassName: 'local-storage'
debug:
enabled: true
Things I’ve tried:
- Starting with a clean install, i.e. partitioning both nodes and setting everything up from scratch.
- Using rook-ceph for storage. I had some networking issues here as well, but once I changed to host network as network it worked.
- Setting hostNetwork to true for the hub.
- Tried with MetalLB and without and with various settings in the proxy.network config.
However, all of these attempts ended with the same result.
Would you have any idea what part of the configuration to change? As mentioned, I suspect it has to do with the network config, but I really don’t know what other parts to adapt there. Just let me know if you need any other logs.
I appreciate any help. Thanks a lot in advance.
Best wishes
Henning