Liveness probe in jupyter-server

Hi, we’re using zero-to-jupyterhub-k8s 0.11.1 with jupyterhub 1.3.0, jupyterlab 3.1.7 and jupyter-server 1.10.2. Our singleuser-notebook pods are backed by object storage for the file system rather than persistent volumes. From time to time there is a hiccup with the object storage connection which breaks the server and we see an error like this in the logs:

Sep 1 09:41:43 jupyter-60fa26b0bdc80340c8a98b6a notebook WARNING WARNING 2021-09-01T14:41:43.234Z [SingleUserLabApp handlers:603] No such file or directory:

The pod status is Running but when trying to exec into the pod we get an error, something like this:

error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec “63f044962ed1edbb5c60378873bf998d4a8dbc2465fd29f6c6f93c844332c34e”: OCI runtime exec failed: exec failed: container_linux.go:380: starting container process caused: chdir to cwd ("/home/jovyan") set in config.json failed: transport endpoint is not connected: unknown

Deleting the pod and having the user restart the notebook server resolves that issue. Ideally with the jupyterlab interface we direct users to the hub control panel to stop and start their notebook server pod themselves, however in this case users were getting a Directory not found error when trying to load the File menu which prevented them from getting to the Hub Control Panel.

What I’m wondering is if there is a way to write some kind of custom liveness probe and package it into the notebook server app pod such that it will kill the pod if it fails to work with the file system (s3fs). I looked through the jupyter-server docs and config options but some kind of supported hook didn’t really stand out to me there. I saw the extra_services option but that looks more like adding API handler extensions to the server web app which isn’t what we’re thinking of here. Are there other better hook points to add something like this, or are we better off just writing a script that runs on a cron within the notebook server image?

Thanks for any help.

1 Like

I think a jupyter-server health check endpoint would be pretty neat, and potentially something that could be integrated into KubeSpawner and therefore the Z2JH Helm Chart.

Is there any chance this extraPodConfig hook in z2jh could be used? I see that maps to extra_pod_config in kubespawner. I’m not totally sure how that will work since the livenessProbe is part of the containers item in the pod spec and I’m not sure if we can touch that from here.

It looks like extra_pod_config is merged in with the pod spec:

but it’s a shallow merge, so you wouldn’t be able to recurse into containers[].readiness_probe.

OK that’s what I figured, thanks for confirming.

1 Like

Is c.KubeSpawner.extra_container_config something that can be set from z2jh? I’m not seeing any reference to that config option in the z2jh repo, but it seems like that’s what we’d want in our case, we just want a liveness probe hook to run a command within the container to make sure the file system is OK.

It doesn’t look like there’s an explicit config for it in Z2JH. The recommended way to enable it is to use hub.extraConfig which lets you add arbitrary Python configuration.

If you’re feeling adventurous there’s a new hub.config parameter which is intended to map directly to the Traitlets configuration. At the moment it’s only supported for configuring Authentication as there may be conflicts or inter-dependencies with some other parameters, but in future it should mean most Traitlets can be used without needing to modify the Z2JH helm chart.

Thanks for the tip. We ended up using that with c.KubeSpawner.extra_container_config to set a liveness probe like this:

    spawnerConfig: |
      c.KubeSpawner.extra_container_config = {
        'livenessProbe': {
          'exec': {
            'command': [
              'touch',
              '/home/jovyan/.jupyter'
            ]
          },
          'initialDelaySeconds': 10,
          'periodSeconds': 60,
          'timeoutSeconds': 30
        }
      }

We’ve got that deployed in our pre-production environment and so far so good, it shows up in the pod spec as expected:

Liveness: exec [touch /home/jovyan/.jupyter] delay=10s timeout=30s period=60s #success=1 #failure=3

1 Like