Performing actions on the behalf of users

Quick question about one assumption, which I don’t see made explicit: The named-server requirement assumes that this runner tool must be in total control of starting and stopping the notebook server. This begets the complicated ReadWriteMany storage problem because the probability of two servers running at the same time with access to the same data is high. If, instead, the requirement were only that the notebook server were running, this could be simplified a great deal (not without its own costs):

  1. ensure user server is running (use it if so, rather than start a new one)
  2. run the job on the server
  3. let the idle culler shut it down (we can’t know that the user didn’t show up while we were running)

This should eliminate the need to control the server and guarantee multi-server concurrent disk access. It does so at the expense of fitting the job resource requests within those of a possibly-in-use server, rather than allowing independent resource requests for offline jobs.

Back to the broader question of JupyterHub API vs Kube-directly: as I usually do, I think both make sense. In the long term, though, I think talking directly to kube via airflow/papermill/what have you is the lighter weight, more robust approach. The challenge, then, is the Jupyter server configuration/extension that makes it convenient to load all the necessary properties, env, etc. into the pod/job template. Most if not all should be accessible from the user’s notebook server environment already. While JupyterHub does a lot with credentials, ultimately that’s all environment variables in the pod, so passing them on to another pod via the creation of a job-specific secret shouldn’t be the biggest hurdle.

Concurrent PV access does come back as an issue, though there are ways around that, mainly by saying that jobs don’t actually have access to home - e.g. bundling a job context as a copy (a la docker build when using docker machine). This doesn’t scale for large data (just try running docker build if you have a local node_modules directory…), but large data is hopefully coming from something other than a home volume that can already be multiply mounted, etc.

If going the JupyterHub API path, I would go with the above simplified proposal of ensuring a server is running, rather than owning it entirely, and saving the more tailored approach if you want separate resources/concurrency/etc. like you get from direct use of a job system.

2 Likes