Performing actions on the behalf of users

This is an attempt to capture a discussion between myself and @yuvipanda on Gitter. I am still researching/understanding the problem, so there may be errors in understanding on my part. If anybody has any thoughts about this or is aware of prior work in this direction, I would be very interested in hearing about it.

Problem statement

The essential question is this: as an admin of a JupyterHub, can you automate actions in a user server, performed on behalf of the user? My question is specifically with regards to a z2jh-based deployment, but some of the discussion could be adapted to other kinds of JupyterHub deployments. The specific use-case I have in mind is allowing users to schedule automated jobs. I am envisioning a workflow like this:

  1. The user writes a script/notebook that performs an action (possibly with side effects, possibly not)
  2. The user schedules the script/notebook to be run at some regular interval. It would be nice to schedule this as an Airflow DAG. The UI for this is yet-to-be determined.
  3. The scheduler (i.e., Airflow) launches the user server on behalf of the user, runs the script, and shuts down the server.

Possible solution

In discussion with @yuvipanda, we identified two broad classes of ways forward to attack this problem:

  1. Use the JupyterHub APIs to launch the server and perform the actions.
  2. Use k8s APIs to launch a user pod with the right PVC and perform the actions.

Pros of JupyterHub APIs

  • It could work for non-k8s-based deployments
  • If the spawner customizes the user environment beyond what the user image does (e.g., sets environment variables, auth_state, etc), then that can be captured.
  • It more naturally gives access to user identity.

Pros of k8s APIs

  • It might allow access to tooling around scheduling kubernetes pods, e.g., the Airflow KubernetesPodOperator.

Based on this, I think it might be better to use the JupyterHub APIs, but that is subject to revision.

Prior art

@danielballan has created a JupyterHub/JupyterLab extension jupyterhub-sheare-link that uses the JupyterHub APIs to launch a user to create a shareable link to a file on their server. Another user may then use that link to access the file (at least until it expires). The implementation of this functionality uses the JupyterHub APIs to launch the relevant user servers and copy the shared file between them. Much of the logic around launching the user servers could be adapted/extracted for this use-case.

Outstanding questions

Secrets

How do we handle user secrets? These could include API keys, database credentials, etc. Many of the scheduled jobs could rely on these things, so it would be good to support their use in a controlled way.

Volumes

What do we do if the user server is in use? We should be able to launch a new named server as the user, but it can fail if the PVC for the user is already in use. We need to investigate persistent volume access modes to see what is most appropriate, and how to handle this case. In particular, this notes

Important! A volume can only be mounted using one access mode at a time, even if it supports many. For example, a GCEPersistentDisk can be mounted as ReadWriteOnce by a single node or ReadOnlyMany by many nodes, but not at the same time.

How big of a restriction is this? Is it possibly good enough to schedule jobs for night-time (in my particular deployment we pretty much know the time zone of our users)? Can we send a notification if a job failed due to a user PVC failing to mount to the user?

1 Like

This seems great and useful!

@jcrist has begun a related project over at https://github.com/jcrist/papermillhub. As the name suggests, the motivation was inspired conceptually on papermill, but I believe it aims to be more general than parameterized notebook execution. If I recall correctly, the notion of job-scheduling was originally in scope for papermillhub, but it may have evolved a tighter scope of just executing headless notebooks via a server extension–no Hub integration necessary. Maybe @jcrist can chime in with clarity on that point.

I like the idea of incorporating Airflow, and I second your view that using Hub APIs has strong advantages. I think it is probably good enough to fail when there is contention for volumes. Trying to work around this in some way seems like it would just raise secondary, tricky issues.

Thanks for pinging me on this. This is something @danielballan, @rabernat, and I discussed at SciPy this year. I started a prototype at https://github.com/jcrist/papermillhub, but had to put things on pause due to work. I’d be quite happy to see this work continue.

Looks like Ian also raised an issue on that repo. To save myself some typing, see this comment (https://github.com/jcrist/papermillhub/issues/2#issuecomment-534176947) for my responses to the above.

1 Like

Just want to give a big :+1: to this discussion. In Pangeo, we get a lot of questions from users about how to deal with long-running notebooks, batch jobs, etc. It would be great to have some sort of solution for this common use case.

In terms of task scheduling, we recently experimented with Prefect and liked it a lot: https://prefect.io

Yes, there seem to be approximately 123 actively-maintained open-source Python workflow managers. In the long term it might be good to avoid binding ourselves too tightly to Airflow, but from what I can tell Airflow is a good initial case study, and Prefect shares conceptual DNA.

1 Like

Thanks for writing this up, @ian-r-rose!

I’ve been thinking particularly about the shared storage piece of this. In particular, I want a solution that satisfies the following criteria:

  1. One dedicated home directory ‘disk’ (Google PD, AWS EBS volume, etc) per user. This is more secure than putting all users on the same NFS Share (via AWS EFS or Google Filestore), has better control over how much resources each user has, etc.
  2. A way to access this single ‘disk’ from many different places a user’s code could be running in - named servers in JupyterHub, batch job containers, dask workers, etc.

Basically, I want to get ReadWriteMany semantics from a ReadWriteOnce disk, so we can have one home directory per user than one per pod/container.

The solution I’ve been thinking about is this:

  1. For each user, provision a PVC with dynamic storage - this will give you a Google PD, AWS EBS store, etc. This PVC will be ReadWriteOnce.
  2. Attach this PVC to a statefulset per user, which uses something like NFS Ganesha to export an NFS share
  3. Create a PV that binds to this NFS share. It’ll be ReadWriteMany.
  4. Various pods / containers can then generate PVCs that bind to this PV, and get access to the same underlying storage.
  5. You can enforce security with NetworkPolicy.

Just like most problems, we’ve now ‘solved’ this with a layer of indirection!

Big problems with this are:

  1. If the NFS server pod dies, your storage is unavailable until it comes back.
  2. You’re now running not just one but N NFS servers (N = number of users). You must sacrifice a baby goat to $DEITY whenever you run an NFS server to increase the chances that it will not turn into a nightmare of processes stuck in D state on all your nodes (or other horrors you can listen to by mentioning the name NFS to people who have maintained them in the long run), so you need N baby goats (N = number of users)
  3. In addition to baby goats, you also sacrifice some performance.

However, if you want persistent POSIX compatible storage (which you want for home directories I think), this is probably the only way to go.

We aren’t the first people to want this, so rook has already done most of the work here. Now we need to hook this up with KubeSpawner and see what breaks.

Quick question about one assumption, which I don’t see made explicit: The named-server requirement assumes that this runner tool must be in total control of starting and stopping the notebook server. This begets the complicated ReadWriteMany storage problem because the probability of two servers running at the same time with access to the same data is high. If, instead, the requirement were only that the notebook server were running, this could be simplified a great deal (not without its own costs):

  1. ensure user server is running (use it if so, rather than start a new one)
  2. run the job on the server
  3. let the idle culler shut it down (we can’t know that the user didn’t show up while we were running)

This should eliminate the need to control the server and guarantee multi-server concurrent disk access. It does so at the expense of fitting the job resource requests within those of a possibly-in-use server, rather than allowing independent resource requests for offline jobs.

Back to the broader question of JupyterHub API vs Kube-directly: as I usually do, I think both make sense. In the long term, though, I think talking directly to kube via airflow/papermill/what have you is the lighter weight, more robust approach. The challenge, then, is the Jupyter server configuration/extension that makes it convenient to load all the necessary properties, env, etc. into the pod/job template. Most if not all should be accessible from the user’s notebook server environment already. While JupyterHub does a lot with credentials, ultimately that’s all environment variables in the pod, so passing them on to another pod via the creation of a job-specific secret shouldn’t be the biggest hurdle.

Concurrent PV access does come back as an issue, though there are ways around that, mainly by saying that jobs don’t actually have access to home - e.g. bundling a job context as a copy (a la docker build when using docker machine). This doesn’t scale for large data (just try running docker build if you have a local node_modules directory…), but large data is hopefully coming from something other than a home volume that can already be multiply mounted, etc.

If going the JupyterHub API path, I would go with the above simplified proposal of ensuring a server is running, rather than owning it entirely, and saving the more tailored approach if you want separate resources/concurrency/etc. like you get from direct use of a job system.

2 Likes

Thanks for the insight @minrk. That’s a good point about reusing the user server. In many cases that would probably be good enough, and it does indeed simplify a lot of questions about the PV.

In my case, there are a lot of users who would probably struggle with bundling a job context (“why can’t I just access the CSV in my home directory?”), so I’m probably trying to initially aim for a solution that just works :tm:, even if there is a cost incurred in how big/expensive the jobs can be.

In the long term, it makes sense that using the kube APIs directly with a job scheduler would be more robust, and the lifecycle of a job would be simpler to control. These solutions are also not mutually exclusive (though I don’t really see much implementation logic that would be shared between them).

@yuvipanda thanks for writing this - it was the only post on the entire internet that confirmed the problems I’ve been tackling! (“Get ReadWriteMany semantics from a ReadWriteOnce disk”)

I wondered if you or anyone else had an update after trying to translate your ideas into action.

In my case, I’m enabling ContainDS Dashboards on z2jh installations. The data scientist writes their notebook or uploads a script to their regular default My Server on JupyterHub, then Dashboards can automatically spin up a new named server that runs that as a Voila (or similar) dashboard server. The problem is ensuring the new server has access to the same files that were created on the user’s original server.

This works fine for ReadWriteMany PVCs, or for static storage type.

Your ideas could help when RWO is the only available option, although I may be able to use something more lightweight. I may be happy just cloning the contents of the server tree at launch time, in which case a separate service might be able to temporarily hold the files until the new server is ready to receive them. Think of something like nbgitpuller, but using a simpler temporary service in place of git (and also implementing the ‘commit’ part automatically).

Any thoughts are much appreciated from anyone also tackling this, and I’ll update if I make any progress myself…