Which is the correct way to cull idle kernels and notebook?

While the two aim in general to solve the same category of problem (wasted resources), they have different metrics available and different levels of action to take.

The notebook-environment configuration should in general produce better, more fine-grained results because it can do things like cull unused kernels and be aware of things like idle/busy or connected status. This lets the notebook server make more intelligent choices like “shutdown a kernel if it’s been idle for 5 minutes BUT not if there’s an open tab currently connected to it and/or it’s in the middle of running a long computation”. Then, finally, it can shutdown the server itself if there have been no API requests and no running kernels for NotebookApp.shutdown_no_activity_timeout. This can be better, because it makes it easier for users to keep their kernels and/or servers running without being inappropriately culled (see various discussions on mybinder.org about how the current culling logic is deleting sessions that people feel like they are still using).

Critically, there is a shortcoming in the internal culling logic, which is that terminal activity is not measured. Open terminals are always considered active and never register as idle. If you leave a terminal running, the internal culler will never shutdown the server itself. This should be considered a missing feature we need to implement.

The external jupyterhub culler has much less granular information to act on: Whether there has been network traffic to the service, as measured by the proxy; and can only shutdown the whole server. This can simply measure “has anybody talked to me in the last X minutes?” Not any information about what operations were taken, is anything running, etc. It’s vulnerable to false-activity registered by left-open tabs. The external culler is also insensitive to the user’s environment - we wouldn’t want users on mybinder.org to set their own cull parameters, which they could if the internal culler were our only mechanism.

I would consider it a best practice in general to use both of these, because they can both be fooled in different ways. How exactly you configure them will depend on your relationship with your users and computational resources. Generally, I would usually say that the internal culler should have shorter timeouts because it can be smarter, especially if cull_connected and cull_busy are False. This is both because it’s less likely to have a false positive for shutdown, and because losing a kernel is less disruptive than losing a whole notebook server (no notebook data loss, only kernel state). Then the outer jupyterhub culler can make a more coarse-grained timeout (say, 1 hour).

On mybinder.org, we use the internal culler to be more aggressive, to prevent left-open websocket connections from preventing shutdown. I’d have to do some digging to find how often each culler is responsible for a given pod’s shutdown.

If the first method isn’t working at all, I’d first add --debug to the single-user launch command and make sure that the configuration is being loaded in the first place. Then you might be able to dig into why it doesn’t think things are idle if they should be.

5 Likes