[Request for Implementation] Mapping notebook user identities to cloud provider identities

Many JupyterHub installations provide direct access to cloud resources like object storage (S3, GCS), cloud databases (RDS, Google Cloud SQL, etc) or bespoke cloud-provider specific things (BigQuery, Athena, Spanner, etc). Cloud providers have their own notion of identity (AWS IAM, Google’s Service accounts, etc), and access to cloud resources is granted on the basis of cloud provider identity. So, for access control & auditing purposes, it is important to attach a cloud provider identity to the users of your JupyterHub.

Currently, folks do this in ad-hoc ways, with stuff like IRSA or Google Cloud Workload Identity. However, mostly they map all the users on the JupyterHub to the same cloud provider identity. This works, but isn’t super ideal.

What I’d like is a way to provide each individual user their own cloud identity, based on some criteria. This could be per-hub, per-user-group, per-user, etc. This gives us two advantages:

  1. Fine-grained control over who can access what. This is very useful in multi-tenant situations, and providing ‘scratch space’ for users (a per-user S3 bucket, for example)
  2. Auditing who accessed which cloud resource when. Things like GCP Data Access audit logs and CloudTrail record access information based on cloud identity. Giving each user a distinct identity makes it possible to provide attribution on who accessed what, when. Very important in highly sensitive deployments.

There are many ways this could be accomplished:

  1. Create a Kubernetes Service account & an associated cloud identity for each user, and map those two together somehow. This is probably the cleanest, but could leave you with a lot of cloud identities you have to manage. It might also require the hub pod have elevated access credentials in your cloud, since you’ll have to create new cloud identities.
  2. Provide a service that makes JupyterHub users to cloud identities based on some criteria, and only provides temporary access credentials to the user notebook. This is probably simpler and gives you more control over who gets access to what, when.
  3. ???

I’m sure there are many possible implementations of this with different tradeoffs. If you are an organization that’s already doing this, I’d love for you to either talk about your approach, or open source your code. The world of folks who aren’t as versed in cloud work will thank you for it.

4 Likes

Agreed this would be really useful @yuvipanda! I explored this a while back on AWS wanting to give each user a private S3 prefix to use (https://github.com/pangeo-data/pangeo-cloud-federation/issues/610). We ended up not pursuing anything fancy and stuck with the approach you mention of “map all the users on the JupyterHub to the same cloud provider identity.”

But I did try mapping an IAM account to each authenticated user, with some success. For that I found these two resources very useful: https://gravitational.com/blog/aws-github-sso/, https://auth0.com/docs/aws-api-setup. Of course this relies on Auth0 for authentication…

The existing JupyterHub notion here is “auth state” for Authenticators. This is a blob of JSONable info, usually including an access token, which can be passed to the Spawner environment. An environment variable like GITHUB_ACCESS_TOKEN can be used to configure e.g. default github push access.

I think there are two big cases in your description that also need to be addressed, as the needs are very different:

  • identities that already exist, and need only to be passed along (auth_state can probably already suffice here)
  • identities that need to be created on the fly (this would be a bigger task, though could be implemented)

I think the big questions are:

  • what resources need to access this identity and how should it propagate? (i.e. are we only talking about ServiceAccounts?)
  • Are we mainly talking about kubernetes, or should this be a general topic?
  • Should Cloud Identity be a concept JupyterHub has, or is direct auth_state->kubespawner negotiation sufficient?
  • What levels / resources should consume this information? The pod service account, server extensions, kernel code, etc.? All of the above?

My hunch is that JupyterHub should not add any awareness of “cloud identity” and instead only make sure that sufficient hooks exist for Authenticators and Spawners to communicate what they need. A sketch of what resources should be created when and based on what sources of data will help to determine if we need anything in jupyterhub, or if perhaps should live as a z2jh / kubespawner feature.