Many JupyterHub installations provide direct access to cloud resources like object storage (S3, GCS), cloud databases (RDS, Google Cloud SQL, etc) or bespoke cloud-provider specific things (BigQuery, Athena, Spanner, etc). Cloud providers have their own notion of identity (AWS IAM, Google’s Service accounts, etc), and access to cloud resources is granted on the basis of cloud provider identity. So, for access control & auditing purposes, it is important to attach a cloud provider identity to the users of your JupyterHub.
Currently, folks do this in ad-hoc ways, with stuff like IRSA or Google Cloud Workload Identity. However, mostly they map all the users on the JupyterHub to the same cloud provider identity. This works, but isn’t super ideal.
What I’d like is a way to provide each individual user their own cloud identity, based on some criteria. This could be per-hub, per-user-group, per-user, etc. This gives us two advantages:
- Fine-grained control over who can access what. This is very useful in multi-tenant situations, and providing ‘scratch space’ for users (a per-user S3 bucket, for example)
- Auditing who accessed which cloud resource when. Things like GCP Data Access audit logs and CloudTrail record access information based on cloud identity. Giving each user a distinct identity makes it possible to provide attribution on who accessed what, when. Very important in highly sensitive deployments.
There are many ways this could be accomplished:
- Create a Kubernetes Service account & an associated cloud identity for each user, and map those two together somehow. This is probably the cleanest, but could leave you with a lot of cloud identities you have to manage. It might also require the hub pod have elevated access credentials in your cloud, since you’ll have to create new cloud identities.
- Provide a service that makes JupyterHub users to cloud identities based on some criteria, and only provides temporary access credentials to the user notebook. This is probably simpler and gives you more control over who gets access to what, when.
- ???
I’m sure there are many possible implementations of this with different tradeoffs. If you are an organization that’s already doing this, I’d love for you to either talk about your approach, or open source your code. The world of folks who aren’t as versed in cloud work will thank you for it.