Best practices for regulating disk usage in JupyterHub

Dear JupyterHub maintainers,
We are running a JupyterHub for our university (~50k people, 2k having
used the service), for casual use of Jupyter (interactive sessions,
with persistent storage of users home). We are reaching the point where
disk consumption is becoming a problem and we would want to regulate
disk usage. Usage patterns vary quite some, so we would like to avoid
one-size-fits-all solutions like stringent disk quotas. Instead, we would
want to:

  • provide useful feedback to our users to educate them and help them
    keep their home directory tidy.
  • improve the configuration to reduce unconscious waste of disk space

Indeed, we have noticed that the vast majority of disk space is used
without the user even being aware of it. For example, 800 of our users
have a .cache/yarn directory that reaches 400Mb. We presume that,
in most of the cases this was populated directly by the system, with little
if any user intervention (e.g. triggering a jupyterlab rebuild from the UI).

We would love to hear how you handle this. E.g. do you have scripts
that do automatic cleanup ? Special configuration so that packages
get installed temporarily in the container rather than the home
directories? Have you setup regular automatic emails to your user
summarizing their disk usage and providing directly actionable tips like:

These files weight more than XXX Mb; if you don't need them, you
may use `rm ...` from a terminal to delete them

You have XXX Mb of conda packages installed in your home. If you
don't need them, you may clear them with ...

Thanks in advance for your feedback!

1 Like