We’re trying to resize (reduce) a z2jh running on an OpenStack Magnum-created k8s, and now seeing some problems with the hub pod, which may be related to networking.
From the hub logs:
[I 2023-11-07 17:59:20.318 JupyterHub app:1984] Not using allowed_users. Any authenticated user will be allowed.
[D 2023-11-07 17:59:20.344 JupyterHub app:2343] Purging expired APITokens
[D 2023-11-07 17:59:20.346 JupyterHub app:2343] Purging expired OAuthCodes
[D 2023-11-07 17:59:20.348 JupyterHub app:2179] Loading role assignments from config
[D 2023-11-07 17:59:20.364 JupyterHub app:2502] Initializing spawners
[D 2023-11-07 17:59:20.365 JupyterHub app:2633] Loaded users:
[I 2023-11-07 17:59:20.365 JupyterHub app:2928] Initialized 0 spawners in 0.002 seconds
[I 2023-11-07 17:59:20.370 JupyterHub metrics:278] Found 1 active users in the last ActiveUserPeriods.twenty_four_hours
[I 2023-11-07 17:59:20.370 JupyterHub metrics:278] Found 5 active users in the last ActiveUserPeriods.seven_days
[I 2023-11-07 17:59:20.371 JupyterHub metrics:278] Found 28 active users in the last ActiveUserPeriods.thirty_days
[I 2023-11-07 17:59:20.371 JupyterHub app:3142] Not starting proxy
[D 2023-11-07 17:59:20.372 JupyterHub proxy:880] Proxy: Fetching GET http://proxy-api:8001/api/routes
[W 2023-11-07 17:59:40.392 JupyterHub proxy:899] api_request to the proxy failed with status code 599, retrying...
[W 2023-11-07 18:00:00.555 JupyterHub proxy:899] api_request to the proxy failed with status code 599, retrying...
[E 2023-11-07 18:00:00.556 JupyterHub app:3382]
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/jupyterhub/app.py", line 3380, in launch_instance_async
await self.start()
File "/usr/local/lib/python3.11/site-packages/jupyterhub/app.py", line 3146, in start
await self.proxy.get_all_routes()
File "/usr/local/lib/python3.11/site-packages/jupyterhub/proxy.py", line 946, in get_all_routes
resp = await self.api_request('', client=client)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/jupyterhub/proxy.py", line 910, in api_request
result = await exponential_backoff(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/jupyterhub/utils.py", line 237, in exponential_backoff
raise asyncio.TimeoutError(fail_message)
TimeoutError: Repeated api_request to proxy path "" failed.
[D 2023-11-07 18:00:00.558 JupyterHub application:1031] Exiting application: jupyterhub
and from describe
on the hub pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12m default-scheduler Successfully assigned jhub/hub-6bf4b55dd4-v9rgb to mscjupyter-k8s-1-23-6-2023-0-2-4cjjzmnjq6p6-node-7
Normal Pulled 12m kubelet Container image "jupyterhub/k8s-hub:3.1.0" already present on machine
Normal Created 12m kubelet Created container hub
Normal Started 12m kubelet Started container hub
Warning Unhealthy 11m (x22 over 12m) kubelet Readiness probe failed: Get "http://10.100.116.147:8081/hub/health": dial tcp 10.100.116.147:8081: connect: connection refused
Normal SandboxChanged 7m27s (x3 over 7m42s) kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 7m26s kubelet Container image "jupyterhub/k8s-hub:3.1.0" already present on machine
Normal Created 7m26s kubelet Created container hub
Normal Started 7m25s kubelet Started container hub
Warning Unhealthy 6m52s (x19 over 7m25s) kubelet Readiness probe failed: Get "http://10.100.116.148:8081/hub/health": dial tcp 10.100.116.148:8081: connect: connection refused
Warning BackOff 2m42s (x12 over 6m41s) kubelet Back-off restarting failed container
I’ve tried restarting the proxy
pod, and this hasn’t made any difference.
Here’s our singleuser config:
ingleuser:
nodeSelector:
node.kubernetes.io/instance-type: m2.highmem
cpu:
guarantee: 0.5
limit: 1
memory:
guarantee: 2G
limit: 2G
storage:
dynamic:
storageClass: csi-sc-cinderplugin
image:
<image stuff>
There’s also hub authentication config and https proxy config, none of which has changed and is unlikely to be implicated.
Nearly all pods are showing as running
, except some hook-image-awaiter
pods:
2023/11/07 17:22:52 GET https://kubernetes.default.svc:443/apis/apps/v1/namespaces/jhub/daemonsets/hook-image-puller giving up after 6 attempt(s): Get "https://kubernetes.default.svc:443/apis/apps/v1/namespaces/jhub/daemonsets/hook-image-puller": dial tcp: lookup kubernetes.default.svc on 10.254.0.10:53: read udp 10.100.15.241:58500->10.254.0.10:53: i/o timeout
We’re using Calico for the networking. I’ve tried restarting one or two of the calico pods, and not surprisingly this hasn’t helped. Am reluctant to touch the calico-kube-controllers
pod, which is showing as running
.
It’s z2jh 3.1.0