Liveness probe in jupyter-server

Hi, we’re using zero-to-jupyterhub-k8s 0.11.1 with jupyterhub 1.3.0, jupyterlab 3.1.7 and jupyter-server 1.10.2. Our singleuser-notebook pods are backed by object storage for the file system rather than persistent volumes. From time to time there is a hiccup with the object storage connection which breaks the server and we see an error like this in the logs:

Sep 1 09:41:43 jupyter-60fa26b0bdc80340c8a98b6a notebook WARNING WARNING 2021-09-01T14:41:43.234Z [SingleUserLabApp handlers:603] No such file or directory:

The pod status is Running but when trying to exec into the pod we get an error, something like this:

error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec “63f044962ed1edbb5c60378873bf998d4a8dbc2465fd29f6c6f93c844332c34e”: OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: chdir to cwd ("/home/jovyan") set in config.json failed: transport endpoint is not connected: unknown

Deleting the pod and having the user restart the notebook server resolves that issue. Ideally with the jupyterlab interface we direct users to the hub control panel to stop and start their notebook server pod themselves, however in this case users were getting a Directory not found error when trying to load the File menu which prevented them from getting to the Hub Control Panel.

What I’m wondering is if there is a way to write some kind of custom liveness probe and package it into the notebook server app pod such that it will kill the pod if it fails to work with the file system (s3fs). I looked through the jupyter-server docs and config options but some kind of supported hook didn’t really stand out to me there. I saw the extra_services option but that looks more like adding API handler extensions to the server web app which isn’t what we’re thinking of here. Are there other better hook points to add something like this, or are we better off just writing a script that runs on a cron within the notebook server image?

Thanks for any help.

1 Like

I think a jupyter-server health check endpoint would be pretty neat, and potentially something that could be integrated into KubeSpawner and therefore the Z2JH Helm Chart.

Is there any chance this extraPodConfig hook in z2jh could be used? I see that maps to extra_pod_config in kubespawner. I’m not totally sure how that will work since the livenessProbe is part of the containers item in the pod spec and I’m not sure if we can touch that from here.

It looks like extra_pod_config is merged in with the pod spec:

but it’s a shallow merge, so you wouldn’t be able to recurse into containers[].readiness_probe.

OK that’s what I figured, thanks for confirming.

1 Like

Is c.KubeSpawner.extra_container_config something that can be set from z2jh? I’m not seeing any reference to that config option in the z2jh repo, but it seems like that’s what we’d want in our case, we just want a liveness probe hook to run a command within the container to make sure the file system is OK.

It doesn’t look like there’s an explicit config for it in Z2JH. The recommended way to enable it is to use hub.extraConfig which lets you add arbitrary Python configuration.

If you’re feeling adventurous there’s a new hub.config parameter which is intended to map directly to the Traitlets configuration. At the moment it’s only supported for configuring Authentication as there may be conflicts or inter-dependencies with some other parameters, but in future it should mean most Traitlets can be used without needing to modify the Z2JH helm chart.

Thanks for the tip. We ended up using that with c.KubeSpawner.extra_container_config to set a liveness probe like this:

    spawnerConfig: |
      c.KubeSpawner.extra_container_config = {
        'livenessProbe': {
          'exec': {
            'command': [
              'touch',
              '/home/jovyan/.jupyter'
            ]
          },
          'initialDelaySeconds': 10,
          'periodSeconds': 60,
          'timeoutSeconds': 30
        }
      }

We’ve got that deployed in our pre-production environment and so far so good, it shows up in the pod spec as expected:

Liveness: exec [touch /home/jovyan/.jupyter] delay=10s timeout=30s period=60s #success=1 #failure=3

1 Like

Just an update on this. Turns out this type of exec liveness probe wouldn’t work for the issue we were having where essentially the file system connection was broken. The exec probe would not work and Kubernetes treats that as an unknown error rather than a probe failure so the container was still effectively unusable but the probe wasn’t killing the notebook container.

So we developed a simple jupyter-server extension (kudos to the docs and examples in that repo, those went a long way in helping me to write that extension in a day) which implemented the same touch command but through a REST API so we could use an HTTP probe in the container spec.

The problem we hit with that was the route to the extension in the notebook container is GET /user/:userid/<extension path> and while the user ID is available to the container via environment variables, it’s not known / available when the liveness probe spec is applied so we couldn’t use it directly. The probe was always failing trying to GET a route like /user/$JUPYTERHUB_USER/<extension path>.

So to workaround that issue, we developed a simple little app which exposes a GET /health endpoint that proxies the GET /user/:userid/<extension path> API. We package that and deploy it as a sidecar container in the singleuser-server pod using singleuser.extraContainers in the z2jh helm chart.

Then we changed the notebook liveness probe to use GET /health, exposed via the sidecar container and which proxies the jupyter-server extension point that does the touch, and it’s all working swimmingly.

That took a lot more work than initially expected but I at least learned a lot along the way; I wrote my first jupyter-server extension and added our first extra container to the singleuser-server pod via the helm chart, all of that was very easy once we figured out what we needed to do.

2 Likes