Has someone investigated/coded something/thought about using the activity information of JupyterHub users to remove the PVC and PV of users that haven’t been active for more than 30 days (or some other configurable period)?
Background is the JupyterHub of the OpenHumans foundation. They offer storage to everyone who logs in but as usual many people only login once or twice. Then never again. Over time (in this case >2years) you accumulate a lot of PVs that will never be used again. Automating the clean-up would help save money and remove a manual task.
Another use case could be hubs used for teaching where students come for a few week course or a semester or several semesters. Eventually they will stop using the hub and their PVs will continue to cost money.
An alternative approach is to add a NFS server (or use cloud vendor version of this) to provide shared storage to users. This removes the need to clean up PVs but comes with more complexity or increased cost if you don’t need the minimal storage size your cloud vendor sets.
The current idea is to create a JupyterHub service similar to the “cull idle users” service. The service would check when a user was last active and remove their PV after a long period of inactivity. Deleting data is never nice so maybe the server could even notify users by email before it will delete data.
If you know of existing code or discussions I’d be happy to hear about it.
I just want to add myself to the discussion because it sounds interesting for me as well.
@betatim what exactly do PVC and PV stand for?
PV = persistent volume and PVC = persistent volume claim
These are kubernetes terms, super simplified they mean “a hard drive you can connect to a pod”. https://kubernetes.io/docs/concepts/storage/persistent-volumes/ is probably a good place to start reading.
Ah great thanks! I haven’t used kubernetes before and I was happy with the options of docker-compose & classic linux administration.
I don’t have much knowledge on the relation between Jupyter Hub and Spawner…
Just want to suggest my idea - What about adding a trigger to clear spawner resources inside deleting user API? Then
cull_idle_users can delete an inactive and aged user. (Different age and inactivity threshold should be applied though)
I’ve been thinking it’s a bit strange deleting a user clears user data away from DB, but leave user storage as it was. The deleted users access the service again. Then they could keep working with their previous data. It could be a better UX but sometimes it could raise so bad situations especially in terms of privacy issues…
I think the idea is that culling a user is something that happens a lot, even during normal operations. As a company running a hub you might cull inactive users every 16h or faster and active users after 5days or so. This means you want to disconnect deleting a user and their PVs.
The other thought was that deleting a PV is very final. If you did that by accident it is very unlikely that you could revert it. So better to make it a manual step.
I think having a flag that allows
DELETE /api/user/<id> to tell a spawner that it should run clean up tasks for that user would be an interesting idea. Then kubespawner could offer to clean up PVs, though I think we should keep it turned off by default and put a big warning label on the option in kubespawner.
Interesting idea. I don’t think it would be too hard to do. I might even take a try at writing such a service. Just at some interval (eg 1x a day), run a query to see if any users match, if so ask kubernetes to remove their persistent volume.
Let me know if you get around to doing this. I’d be happy to inherit a script and/or exchange ideas and/or contribute.
Thinking about it a bit more I was even thinking something that you run manually and potentially asks you to confirm each delete would be a good start. Until you’ve run it a few times and start feeling like “yeah, this script probably won’t delete data of the wrong users if I automate it with a cron job”
I would need such a “cull idle user and eventually delete all their data” service for removing docker volumes but I don’t use kubernetes. If there could be an abstraction layer in between, maybe both can profit? Then only the implementation for how the user data is purged differs. And as said in the initial post, jupyterhub-idle-culler could be a great starting point.
I started a discussion at https://github.com/jupyterhub/jupyterhub-idle-culler/issues/8 to include people who are not (frequently) visiting this forum.
Coincidentally we were just talking about this so good timing on the thread.
I tend to agree that culling old notebook storage for culled users should be separate from the existing
cull_idle_servers.py script since as @betatim said that can get run pretty aggressively.
To throw a wrinkle into this, we don’t have a PVC per notebook pod, we have a single PVC per environment that is backed by object storage, so blindly deleting that single PVC would be…not good.
The current idea is to create a JupyterHub service similar to the “cull idle users” service. The service would check when a user was last active and remove their PV after a long period of inactivity.
We have our
cull-idle service setup to also cull idle users, so this probably wouldn’t work for us, unless I’m missing something. Consider a scenario where the per-notebook culler stops a pod after an hour of inactivity, and the hub-managed
cull-idle service deletes the user after let’s say 5 days of inactivity. We still might not want to delete their storage for like 30 days or longer, something like that. My point being, if we’ve culled the user record from the database then it seems we’d have to work backward from the storage and check to see if the user still exists for each and if not then delete the storage.
I guess we all agree that this needs to be handled with great care - nobody should lose data because of a stupid misconfiguration. I believe there could be a general template to make the whole task easier but data persistency can be very use-case related. Therefore, except a common framework at some point each one needs to write a script to solve their own problems.
At https://github.com/jupyterhub/jupyterhub-idle-culler/issues/8#issuecomment-652481646 there is one possible solution how the cull_idle_servers script can be altered (as well as the JupyterHub config) in a way that could also be applied to this use case.
This and in addition the points of @mriedem sound like a separate service with slightly altered code but a very different name (to avoid confusion for new administrators) sound like an easy and implementable option to me. The only issue with the above-mentioned solution is that it requires monkey-patching and that is not the cleanest of all options.
Is their any update fix for this issue?
We are also facing same issue discussed here.
We have jupyterhub deployed in k8s cluster & we are using EFS(Elastic File System) as PV(Persistent Volume).
When we delete a jupyterhub user using admin panel. User is deleted but the PVC(Persistent Volume Claims) associated with that user are not getting deleted. If we create a new user with same name, the PVC are getting attached to new user with same name.
The current state is as documented in the GitHub issues that are referened in this thread
First, we need to extend the jupyterhub-idle-culler as described in these issues:
The culler is an isolated service that is unaware of the JupyterHub spawner such as DockerSpawner or KubeSpawner. It does not access the JupyterHub settings. Hence, we need to add that functionality in the spawner because of the volume naming strategy, i.e.:
You are more than welcome to join into the discussions and contribute I guess until now mostly general discussions are started but we need somebody to try out some of the mentioned options.