A "cull idle user" service that deletes PVs

Has someone investigated/coded something/thought about using the activity information of JupyterHub users to remove the PVC and PV of users that haven’t been active for more than 30 days (or some other configurable period)?

Background is the JupyterHub of the OpenHumans foundation. They offer storage to everyone who logs in but as usual many people only login once or twice. Then never again. Over time (in this case >2years) you accumulate a lot of PVs that will never be used again. Automating the clean-up would help save money and remove a manual task.

Another use case could be hubs used for teaching where students come for a few week course or a semester or several semesters. Eventually they will stop using the hub and their PVs will continue to cost money.

An alternative approach is to add a NFS server (or use cloud vendor version of this) to provide shared storage to users. This removes the need to clean up PVs but comes with more complexity or increased cost if you don’t need the minimal storage size your cloud vendor sets.

The current idea is to create a JupyterHub service similar to the “cull idle users” service. The service would check when a user was last active and remove their PV after a long period of inactivity. Deleting data is never nice so maybe the server could even notify users by email before it will delete data.

If you know of existing code or discussions I’d be happy to hear about it.

2 Likes

I just want to add myself to the discussion because it sounds interesting for me as well.

@betatim what exactly do PVC and PV stand for?

PV = persistent volume and PVC = persistent volume claim

These are kubernetes terms, super simplified they mean “a hard drive you can connect to a pod”. https://kubernetes.io/docs/concepts/storage/persistent-volumes/ is probably a good place to start reading.

1 Like

Ah great thanks! I haven’t used kubernetes before and I was happy with the options of docker-compose & classic linux administration.

I don’t have much knowledge on the relation between Jupyter Hub and Spawner…
Just want to suggest my idea - What about adding a trigger to clear spawner resources inside deleting user API? Then cull_idle_users can delete an inactive and aged user. (Different age and inactivity threshold should be applied though)
DELETE, /api/users/

I’ve been thinking it’s a bit strange deleting a user clears user data away from DB, but leave user storage as it was. The deleted users access the service again. Then they could keep working with their previous data. It could be a better UX but sometimes it could raise so bad situations especially in terms of privacy issues…

I think the idea is that culling a user is something that happens a lot, even during normal operations. As a company running a hub you might cull inactive users every 16h or faster and active users after 5days or so. This means you want to disconnect deleting a user and their PVs.

The other thought was that deleting a PV is very final. If you did that by accident it is very unlikely that you could revert it. So better to make it a manual step.

I think having a flag that allows DELETE /api/user/<id> to tell a spawner that it should run clean up tasks for that user would be an interesting idea. Then kubespawner could offer to clean up PVs, though I think we should keep it turned off by default and put a big warning label on the option in kubespawner.

3 Likes

Interesting idea. I don’t think it would be too hard to do. I might even take a try at writing such a service. Just at some interval (eg 1x a day), run a query to see if any users match, if so ask kubernetes to remove their persistent volume.

1 Like

Let me know if you get around to doing this. I’d be happy to inherit a script and/or exchange ideas and/or contribute.

Thinking about it a bit more I was even thinking something that you run manually and potentially asks you to confirm each delete would be a good start. Until you’ve run it a few times and start feeling like “yeah, this script probably won’t delete data of the wrong users if I automate it with a cron job” :smiley:

I would need such a “cull idle user and eventually delete all their data” service for removing docker volumes but I don’t use kubernetes. If there could be an abstraction layer in between, maybe both can profit? Then only the implementation for how the user data is purged differs. And as said in the initial post, jupyterhub-idle-culler could be a great starting point.

1 Like

I started a discussion at https://github.com/jupyterhub/jupyterhub-idle-culler/issues/8 to include people who are not (frequently) visiting this forum.

Coincidentally we were just talking about this so good timing on the thread. :slight_smile:

I tend to agree that culling old notebook storage for culled users should be separate from the existing cull_idle_servers.py script since as @betatim said that can get run pretty aggressively.

To throw a wrinkle into this, we don’t have a PVC per notebook pod, we have a single PVC per environment that is backed by object storage, so blindly deleting that single PVC would be…not good.

The current idea is to create a JupyterHub service similar to the “cull idle users” service. The service would check when a user was last active and remove their PV after a long period of inactivity.

We have our cull-idle service setup to also cull idle users, so this probably wouldn’t work for us, unless I’m missing something. Consider a scenario where the per-notebook culler stops a pod after an hour of inactivity, and the hub-managed cull-idle service deletes the user after let’s say 5 days of inactivity. We still might not want to delete their storage for like 30 days or longer, something like that. My point being, if we’ve culled the user record from the database then it seems we’d have to work backward from the storage and check to see if the user still exists for each and if not then delete the storage.

2 Likes

I guess we all agree that this needs to be handled with great care - nobody should lose data because of a stupid misconfiguration. I believe there could be a general template to make the whole task easier but data persistency can be very use-case related. Therefore, except a common framework at some point each one needs to write a script to solve their own problems.

1 Like

At https://github.com/jupyterhub/jupyterhub-idle-culler/issues/8#issuecomment-652481646 there is one possible solution how the cull_idle_servers script can be altered (as well as the JupyterHub config) in a way that could also be applied to this use case.

This and in addition the points of @mriedem sound like a separate service with slightly altered code but a very different name (to avoid confusion for new administrators) sound like an easy and implementable option to me. The only issue with the above-mentioned solution is that it requires monkey-patching and that is not the cleanest of all options.

Some discussion regarding the impelementation can be found at https://github.com/jupyterhub/kubespawner/issues/415 and https://github.com/jupyterhub/dockerspawner/issues/384

1 Like