JupyterLab singleuser kernels are disconnecting due to apparent auth and web socket errors

I am running a z2jh setup with relatively vanilla config options. Everything appears to run smoothly until about 20 lines into executing any notebook, at which point the kernel crashes (see screenshot).

I don’t see any obvious candidate for what’s getting in the way here. In other threads I have found and checked the following:

Version issues: I have jupyterhub==4.1.6 and
jupyterlab==4.2.4 and websocketclient==1.8.0 installed on the singleuser pods and jupyterhub==4.1.6 on the main hub pod.

Ingress issues: I have seen that traefik can cause this behavior if using as an ingress, but I am not. Just using letsencrypt for https but that’s it.

Feeling pretty stuck here and would appreciate advice!

Possibly related to Couldn't authenticate WebSocket connection - #8 by Yan_Vulich ?

If there is an issue with websockets, your notebook would not even work in the first place. If it is happening after executing certain cells, I assume you are hitting some sort of resource constraints imposed on your single user server pod by k8s. Could you share your pod logs which should give more details?

1 Like

@mahendrapaipuri excellent point on ws not being a likely culprit here. Perhaps those errors are coming after the crash.

I’ve replicated the crash while tailing logs and here are the relevant lines.

[C 2024-09-18 15:26:00.931 ServerApp] received signal 15, stopping
[I 2024-09-18 15:26:00.932 ServerApp] Shutting down 7 extensions
[I 2024-09-18 15:26:00.933 ServerApp] Shutting down 2 kernels
[I 2024-09-18 15:26:00.936 ServerApp] Kernel shutdown: 96e22a0d-b684-4deb-b8db-f0c8693f91c2
[I 2024-09-18 15:26:00.937 ServerApp] Kernel shutdown: 684af038-581f-4709-abd5-571dee097d9a

All other lines in the logs are simple 200 responses to various APIs, so not relevant.

Unfortunately “signal 15” is an incredibly generic linux thing that can have many root causes, so I’ve got a bit of research to do. Looks like there are some other threads on this in this forum at least.

1 Like

Ok progress! tl;dr it’s a memory allocation issue

I was able to watch head -3 /proc/meminfo on the singleuser pod while I was running the notebook to see the available memory slowly decrease as I worked through the notebook. Finally, I got an exit code 137, then the ssh session crashed.

I then went to kubectl describe pod <Pod_Name> and confirmed it was OOMKilled

Screenshot

This in turn corresponds to the logic expressed in step 6 of the Explicit Memory and CPU Allocated to Core Pods Containers documentation.

A container running out of memory will get its process killed by a Linux Out Of Memory Killer(OOMKiller). When this happens you should see a trace of it by using kubectl describe pod--namespace <k8s-namespace> <k8s-pod-name> and kubectl logs --previous--namespace <k8s-namespace> <k8s-pod-name>.When a container’s process has been killed, the container will restart if the container’s restartPolicy allows it, and otherwise the pod will be evicted.

Looks like I’ll be able to manage this by populating singleuser.memory guarantees and limits (the latter of which is referred to as requests in k8s parlance), which default to NULL and 1GB, respectively.

I’ll likely set the guarantee higher and the limit to null, which according to the k8s docs will result in the following behavior:

If you specify a limit for a resource, but do not specify any request, and no admission-time mechanism has applied a default request for that resource, then Kubernetes copies the limit you specified and uses it as the requested value for the resource.

1 Like