[Request for Implementations] Disabling downloads from a JupyterHub

“How do I disable downloading files from JupyterHub?” is an extremely common question, especially for folks working with sensitive data. This is impossible to do in the absolute sense - if someone can see data on a screen, they can copy paste it.

However, there are ways to make it harder, and I want to list some of those ideas here. It’s important that these don’t affect users at all when they’re doing ‘regular’ data analysis work - security at the cost of usability forces people to find holes in the system so they can get their damn work done. I think these ideas try to be progressively effective without drastically affecting the user experience.

Hopefully someone can then contribute code / config to make it happen :smiley:

  1. Disable the download buttons in notebook & lab. This is very minimal, but extremely helpful. Forces users to use non GUI methods to download things, and that’s already a win.
  2. Wrap the default ContentsManager so it can deny access to non .ipynb files. ContentsManager is the primary way to get things off your filesystem to your browser. I think only .ipynb files are needed for notebook / lab to work, and disabling all other access makes downloading things harder. To prevent people from just renaming data files to .ipynb, you can also validate that it is a notebook file before serving it.
  3. In containers, make sure the user can’t actually modify the notebook package that is used to run Jupyter Notebook server. Python is a dynamic language, and (1) and (2) can be easily subverted if users can just edit the python files containing that logic! So with standard linux permissions, you must lock down the environment where the notebook package is installed. Additionally, you’d also need to lock down config for the notebook, so users can’t just change the config. Blocking access to the paths in jupyter --path would do the trick here.
  4. Block all outgoing internet access from users, except for specifically allowed targets. This prevents people from just sending out your data to the internet and downloading it from there. Consider using a proxy for outgoing connections here, so you can log as you wish.
  5. Throttle network connections to the user in such a way that regular usage (ipynb loading, frontend assets, etc) are fine, but larger data downloads are intolerably slow. This could be applied just between the user server and the proxy, since all access to user servers from users go via the proxy. Maybe something that starts throttling once your TCP connection has reached a certain size?
  6. [Audit] Efficient payload logging at the JupyterHub proxy level, so we can attribute downloads when needed.

If you have users on your JupyterHub, you semi-trust them. For sensitive environments, strong contracts & other non-technical measures are just as important as technical safeguards. Auditing is extremely important in those cases, and not covered in this post at all. However, that only works if you are an organization large enough to go after people who violate your contracts :smiley: In those situations, making it harder to download data technically is very important.

As far as I know, there are no public & well documented implementations of any of these. I would love:

  1. A classic notebook extension for (1)
  2. A JupyterLab extension for (1)
  3. A wrapper ContentsManager for (2)
  4. Detailed guidelines for (3), mostly around building containers where this is true. z2jh will also need to be configured correctly for this to work.
  5. A test suite that ensures that (3) is really true
  6. Guidelines on how to do (4) in the most common environments. This can probably just live in z2jh.
  7. A kubernetes sidecar for doing (5). This will exist in the same namespace as the user pod, and could use anything from tc to ebpf. If there exists a pre-existing kubernetes solution for this, documentation on how to deploy that with z2jh would be most helpful.

If you have solutions to these, would love to see them!

7 Likes

This is a great list of ideas and I hope it can act as a kind of central place to find solution (attempts) to the ideas given here.

Making it easy to verify that the measure works is super important (see (5) in the second list of points). We use tc and ebpf to limit the speed of network connections for users of mybinder.org. One day I tried to really understand how, why and if they really do what we want and think they do. I left thinking “probably not”.

However maybe my tests weren’t testing what I thought they were testing :joy:

Most blog posts and howto guides seem to not go much beyond trivial examples and hardly ever have before/after tests that you can run to demonstrate things work.

TL;DR: sharing ways to test the measures is as useful and valuable a contribution as one that implements one of the measures.

1 Like