We are running a jupyter Hub in a server for several thousand people with k8s and a whole infrastructure behind that allow for authentication to access sensible data.
Is it possible to create a kind of “black list” of libraries that is checked in case any of the users would like to install in their user space libs that we think might cause potential damage or cause security bridges?
How are users installing the packages? Are they using pypi/conda? One way to fix this would be to block all access to PyPI and anaconda.org, and running in an internal mirror to distribute the package. This way user would have access to only “approved” packages.
They could still get the packages if they really wanted (like directly from github) assuming they still have internet connection.
Yerp: the only safe computer is one not connected to the internet.
In k8s, individual pods should only be able to connect to exactly what the custodian of the infrastructure/data chooses, and a fully virtualized/containerized/buzzworded deployment should make this possible, and explicit. A policy of “no data from anywhere except the following domains” is a much better place to be starting from, but of course can’t prevent someone from hand-jamming packages in by upload once given interactive tools.
Inside the container: locking down environments is certainly possible, but a fully 700
file system, owned by some other system user than the user themselves, can’t actually be used for interactive computing, and can’t fully exclude data exfiltration.
On the client: requiring access to the hub be gated behind a VPN, with lots of logging and monitoring, etc. is a place to start, but again, users, by definition, use these tools from a computer where they can presumably run programs, hand-type malicious code, etc.
If arbitrary packages are needed, running a package proxy is the right play. Presumably this deployment is operating at the scale where one could handle running an enterprise tool (I’m not going to shill any of them here, “enterprise package management,” has enough SEO that you’d find the top couple contenders). This would allow moving this concern to the perimeter, and these tools offer caching mirrors, block/allowlists and continuous scanning (given subscriptions). Even failing that, a plain-old-proxy would potentially do the job, but would take a lot more work.