This is an attempt to capture a discussion between myself and @yuvipanda on Gitter. I am still researching/understanding the problem, so there may be errors in understanding on my part. If anybody has any thoughts about this or is aware of prior work in this direction, I would be very interested in hearing about it.
The essential question is this: as an admin of a JupyterHub, can you automate actions in a user server, performed on behalf of the user? My question is specifically with regards to a z2jh-based deployment, but some of the discussion could be adapted to other kinds of JupyterHub deployments. The specific use-case I have in mind is allowing users to schedule automated jobs. I am envisioning a workflow like this:
- The user writes a script/notebook that performs an action (possibly with side effects, possibly not)
- The user schedules the script/notebook to be run at some regular interval. It would be nice to schedule this as an Airflow DAG. The UI for this is yet-to-be determined.
- The scheduler (i.e., Airflow) launches the user server on behalf of the user, runs the script, and shuts down the server.
In discussion with @yuvipanda, we identified two broad classes of ways forward to attack this problem:
- Use the JupyterHub APIs to launch the server and perform the actions.
- Use k8s APIs to launch a user pod with the right PVC and perform the actions.
Pros of JupyterHub APIs
- It could work for non-k8s-based deployments
- If the spawner customizes the user environment beyond what the user image does (e.g., sets environment variables,
auth_state, etc), then that can be captured.
- It more naturally gives access to user identity.
Pros of k8s APIs
- It might allow access to tooling around scheduling kubernetes pods, e.g., the Airflow
Based on this, I think it might be better to use the JupyterHub APIs, but that is subject to revision.
@danielballan has created a JupyterHub/JupyterLab extension jupyterhub-sheare-link that uses the JupyterHub APIs to launch a user to create a shareable link to a file on their server. Another user may then use that link to access the file (at least until it expires). The implementation of this functionality uses the JupyterHub APIs to launch the relevant user servers and copy the shared file between them. Much of the logic around launching the user servers could be adapted/extracted for this use-case.
How do we handle user secrets? These could include API keys, database credentials, etc. Many of the scheduled jobs could rely on these things, so it would be good to support their use in a controlled way.
What do we do if the user server is in use? We should be able to launch a new named server as the user, but it can fail if the PVC for the user is already in use. We need to investigate persistent volume access modes to see what is most appropriate, and how to handle this case. In particular, this notes
Important! A volume can only be mounted using one access mode at a time, even if it supports many. For example, a GCEPersistentDisk can be mounted as ReadWriteOnce by a single node or ReadOnlyMany by many nodes, but not at the same time.
How big of a restriction is this? Is it possibly good enough to schedule jobs for night-time (in my particular deployment we pretty much know the time zone of our users)? Can we send a notification if a job failed due to a user PVC failing to mount to the user?