Unhealthy hub pod, possible network problems?

We’re trying to resize (reduce) a z2jh running on an OpenStack Magnum-created k8s, and now seeing some problems with the hub pod, which may be related to networking.

From the hub logs:

[I 2023-11-07 17:59:20.318 JupyterHub app:1984] Not using allowed_users. Any authenticated user will be allowed.
[D 2023-11-07 17:59:20.344 JupyterHub app:2343] Purging expired APITokens
[D 2023-11-07 17:59:20.346 JupyterHub app:2343] Purging expired OAuthCodes
[D 2023-11-07 17:59:20.348 JupyterHub app:2179] Loading role assignments from config
[D 2023-11-07 17:59:20.364 JupyterHub app:2502] Initializing spawners
[D 2023-11-07 17:59:20.365 JupyterHub app:2633] Loaded users:

[I 2023-11-07 17:59:20.365 JupyterHub app:2928] Initialized 0 spawners in 0.002 seconds
[I 2023-11-07 17:59:20.370 JupyterHub metrics:278] Found 1 active users in the last ActiveUserPeriods.twenty_four_hours
[I 2023-11-07 17:59:20.370 JupyterHub metrics:278] Found 5 active users in the last ActiveUserPeriods.seven_days
[I 2023-11-07 17:59:20.371 JupyterHub metrics:278] Found 28 active users in the last ActiveUserPeriods.thirty_days
[I 2023-11-07 17:59:20.371 JupyterHub app:3142] Not starting proxy
[D 2023-11-07 17:59:20.372 JupyterHub proxy:880] Proxy: Fetching GET http://proxy-api:8001/api/routes
[W 2023-11-07 17:59:40.392 JupyterHub proxy:899] api_request to the proxy failed with status code 599, retrying...
[W 2023-11-07 18:00:00.555 JupyterHub proxy:899] api_request to the proxy failed with status code 599, retrying...
[E 2023-11-07 18:00:00.556 JupyterHub app:3382]
    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/app.py", line 3380, in launch_instance_async
        await self.start()
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/app.py", line 3146, in start
        await self.proxy.get_all_routes()
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/proxy.py", line 946, in get_all_routes
        resp = await self.api_request('', client=client)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/proxy.py", line 910, in api_request
        result = await exponential_backoff(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/jupyterhub/utils.py", line 237, in exponential_backoff
        raise asyncio.TimeoutError(fail_message)
    TimeoutError: Repeated api_request to proxy path "" failed.

[D 2023-11-07 18:00:00.558 JupyterHub application:1031] Exiting application: jupyterhub

and from describe on the hub pod:

Events:
  Type     Reason          Age                     From               Message
  ----     ------          ----                    ----               -------
  Normal   Scheduled       12m                     default-scheduler  Successfully assigned jhub/hub-6bf4b55dd4-v9rgb to mscjupyter-k8s-1-23-6-2023-0-2-4cjjzmnjq6p6-node-7
  Normal   Pulled          12m                     kubelet            Container image "jupyterhub/k8s-hub:3.1.0" already present on machine
  Normal   Created         12m                     kubelet            Created container hub
  Normal   Started         12m                     kubelet            Started container hub
  Warning  Unhealthy       11m (x22 over 12m)      kubelet            Readiness probe failed: Get "http://10.100.116.147:8081/hub/health": dial tcp 10.100.116.147:8081: connect: connection refused
  Normal   SandboxChanged  7m27s (x3 over 7m42s)   kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled          7m26s                   kubelet            Container image "jupyterhub/k8s-hub:3.1.0" already present on machine
  Normal   Created         7m26s                   kubelet            Created container hub
  Normal   Started         7m25s                   kubelet            Started container hub
  Warning  Unhealthy       6m52s (x19 over 7m25s)  kubelet            Readiness probe failed: Get "http://10.100.116.148:8081/hub/health": dial tcp 10.100.116.148:8081: connect: connection refused
  Warning  BackOff         2m42s (x12 over 6m41s)  kubelet            Back-off restarting failed container

I’ve tried restarting the proxy pod, and this hasn’t made any difference.

Here’s our singleuser config:

ingleuser:
  nodeSelector:
    node.kubernetes.io/instance-type: m2.highmem
  cpu:
    guarantee: 0.5
    limit: 1
  memory:
    guarantee: 2G
    limit: 2G
  storage:
    dynamic:
      storageClass: csi-sc-cinderplugin
  image:
    <image stuff>

There’s also hub authentication config and https proxy config, none of which has changed and is unlikely to be implicated.

Nearly all pods are showing as running, except some hook-image-awaiter pods:

2023/11/07 17:22:52 GET https://kubernetes.default.svc:443/apis/apps/v1/namespaces/jhub/daemonsets/hook-image-puller giving up after 6 attempt(s): Get "https://kubernetes.default.svc:443/apis/apps/v1/namespaces/jhub/daemonsets/hook-image-puller": dial tcp: lookup kubernetes.default.svc on 10.254.0.10:53: read udp 10.100.15.241:58500->10.254.0.10:53: i/o timeout

We’re using Calico for the networking. I’ve tried restarting one or two of the calico pods, and not surprisingly this hasn’t helped. Am reluctant to touch the calico-kube-controllers pod, which is showing as running.

It’s z2jh 3.1.0

As you’ve guessed it’s almost certainly due to your K8s cluster and not JupyterHub, especially if it was working fine before.

To help with debugging you can try things like disabling all network policies and forcing all pods to run on a single node. If either of those solves the problem that gives you some pointers on where to look in Calico/k8s.