A Persistent BinderHub Deployment

Follow-up to jupyterhub/binderhub/issues/794

We deployed a test instance on notebooks-test.gesis.org, where you can try the described setup. To save build-time use the pre-built repositories:

Beyond making the code available for everyone interested we plan to introduce this on our production environment and appreciate any feedback and suggestions!

Our goal is to bring persistency to BinderHub. We want to unite the best of JupyterHub and BinderHub. From a user’s point of view we think the way forward is to enable a binder form on the home page of every user on the JupyterHub installation. To achieve this, we added 2 new features to BinderHub, authentication and persistent storage.

Authentication

As a first step Authentication has been introduced and is supported by BinderHub since jupyterhub/binderhub/pull/666. You can get more information about enabling authentication in BinderHub documentation. The config we used on our staging server is as follows:

binderhub:
  config:
    BinderHub:
      auth_enabled: true

  jupyterhub:
    cull:
      # don't cull authenticated users
      users: False
    custom:
      binderauth_enabled: true
    hub:
      redirectToServer: false
      services:
        binder:
          oauth_redirect_uri: "https://notebooks-test.gesis.org/oauth_callback"
          oauth_client_id: "binder-oauth-client-test"

    singleuser:
      # to make notebook servers aware of hub
      cmd: jupyterhub-singleuser

    auth:
      type: github
      github:
        callbackUrl: "https://notebooks-test.gesis.org/hub/oauth_callback"
        clientId: "###secret###"
        clientSecret: "###secret###"
      scopes:
        - "read:user"
      admin:
        users: ['bitnik', 'arnim']

Persistent Storage

The overall desiderata for persistence were to enable multiple projects while keeping the behavior and established directory structure of vanilla binder environments. This lead to the following landmarks that guided our development:

  1. provide each user pod with a PV (Persistent Volume), where multiple projects of a single user can reside, each project in a separate folder
  2. mount the user’s PV somewhere other than the home folder (e.g. /projects), so that users can access files across multiple projects
  3. mount a selected project folder (from user’s PV) into the home folder (/home/jovyan)
  4. start a notebook server on /home/jovyan which is the default behavior of BinderHub
  5. in the project folder have the same content as provided by repo2docker, and not introduce any additional logic. This is particularly important because projects may use further features of repo2docker such as the postBuild script. As a consequence, we don’t want to use git clone or nbgitpuller to fetch content in this step.
  6. use repo2docker with the default configuration, so we can share output images with other BinderHub deployments, such as at GESIS Notebooks
  7. support the ability to migrate existing users on a JupyterHub without the loss of information

/home/jovyan is also where repo2docker clones by default repository content to. So we had to find a way to copy repo content into the PV before it is mounted to the user pod. For this, we decided to use initContainers which

  • has the same image as notebook container
  • has the PV containing all of a user’s projects mounted into /projects/
  • deletes project folders if a user deleted any through Your Projects table
  • copies content of the home folder into /projects/<project_folder_name> if the <project_folder_name> folder doesn’t exist
  # example
  initContainers:
  - name: project-manager
    image: <image-name-tag-created-by-repo2docker>
    volumeMounts:
    - mountPath: /projects/
      name: volume-bitnik
    command:
    - /bin/sh
    - -c
    - <first delete projects, then copy content of current repo>

Once initContainers is done, the user’s notebook container is ready to start. We can then mount the same PV into 2 different locations, /home/jovyan with sub-path of the project folder and /projects/ where user can reach all projects:

spec:
  containers:
    volumeMounts:
    - mountPath: /home/jovyan
      name: volume-bitnik
      subPath: <project_folder_name>
    - mountPath: /projects/
      name: volume-bitnik
  volumes:
  - name: volume-bitnik
    persistentVolumeClaim:
      claimName: claim-bitnik

initContainers and PVs of the user pod are configured for each user pod during spawn in the start method of PersistentBinderSpawner. The PersistentBinderSpawner customizes KubeSpawner to:

  • save all the projects a user has in Spawner's state (JSONDict) field under the projects key
  • cache deleted projects under the deleted_projects key untill their actual removal
  • get the image name and tag from user_options, which is produced after the build the process of binder
  • configure initContainers as mentioned above
  • configure PV of the user pod as mentioned above

Notes:

  • Users can launch one project at a time on the test instance and have up to 5 projects in total
  • When a user launches a repo from Your Projects table, the user continues on this project where she/he left, with same image and code-base
  • Code-base is only copied from image when the project folder is missing in PV
  • User can update repository image by using binder form
  • User can use git or nbgitpuller to manually update the repository content

Deployment repository

gesiscss/example-binderhub-deployments is a repository where we hold config files for different kinds of BinderHub deployments. Here we want to point to some important files for our persistent BinderHub deployment:

Last but not least we (@arnim and @bitnik) want to thank the incredible Binder community for supporting this with awesome contributions and invaluable advice.

7 Likes

This is one of the coolest new features to arrive in BinderHub land!

Thanks a lot for working on this and having the patience with slow reviews, nitpicking, and questioning things. I am super happy to see this idea that can be described in 5min but takes months to build and get right “in production”!

Now … how can we offer this on mybinder.org :slight_smile: I’ll be off searching for a rainbow with a pot of gold at the end :rainbow::trophy:.

2 Likes

This sounds really handy…

Being a bear of little brain (trying to think this through) and clarify some key differences. If I understand correctly:

  • in Jupyterhub, if I dockerspawn environments then each user can have a persistent volume associated with each docker environment; files from userX-imageY data volume are mounted into containerY when it runs;

  • in persistent Binderhub, each user has a single data volume (userX) with several project directories; a specific project directory (userX/projZ) is mounted into particular Binder project environment (binder-repoZ/mountpoint) when it runs?

@psychemedia

each user can have a persistent volume associated with each docker environment

There is one persistent volume per user that is always available via /projects independent of the project/docker-env currently run. /projects/$project_name is the subfolder for the currently active $project_name and mirrored to /home/jovyan.

@bitnik @arnim what does custom.binderauth_enabled do? I can’t seem to find an explanation in the docs? Is this something that’s available in the current BinderHub helm chart or have you configured it elsewhere?

Wonderful! Thank for you sharing this work and writing it up in detail.

I like the structure, placing the user in a project directory with files from other projects still accessible in /projects. I happen to have built a duct-tape-and-hot-glue implementation with the same structure on an HPC system (no Kubernetes available, unfortunately). One question that has come up is what should happen when users modify the software environment interactively.

In your test deployment, just as with normal Binder, the user has the ability to install additional software (i.e. to modify the content of /opt/conda/) interactively. That flexibility is important for experimentation. Of course, if a user stops and re-launches the Project, the software environment will be reset because a fresh container is launched. This seems good to me because it prohibits users from diverging from the Binder specification over time and creating a long-lived, irreproducible “junk drawer”. If the user finds themselves consistently needing additional software, it’s time to make a new Binder repo.

I have, however, heard some interest in persisting changes to the software environment by storing the stopped container and restarting it when the Project is restarted, a feature that another JupyterHub-like project provides. My gut feeling is to regard this as an anti-feature. Users may initially find it inconvenient to start from a clean slate each time, but in my view it guides them toward practices that will be beneficial in the long run. Have others given any thought to this question and how to respond to it?

4 Likes

@sgibson91 it is available in the current BinderHub and it is used to tell BinderSpawner that auth is enabled: https://github.com/jupyterhub/binderhub/blob/8c51534a9517d40f82fa2546e99660e88d94f5e7/helm-chart/binderhub/values.yaml#L62-L73
related PR: https://github.com/jupyterhub/binderhub/pull/1023

So what does that mean exactly? Why does BinderSpawner need to know? Can it be used to allow access to private repos or is it just for labelling a pvc with the authenticated username?

BinderSpawner starts notebook server differently if auth is enabled or not (https://github.com/jupyterhub/binderhub/blob/8c51534a9517d40f82fa2546e99660e88d94f5e7/helm-chart/binderhub/values.yaml#L80). Before we had to define a new BinderSpawner when we want to enable authentication, but now we just have to set that setting (https://github.com/jupyterhub/binderhub/pull/1023/files#diff-1e5341c6eb671dbeb82d2c741e17f209). That is the purpose of that PR.

That setting is not related to accessing private repos or PVCs.

1 Like

Thank you! :sparkles:

Hey again,

I’m trying to implement this myself on a test hub, but the redirection after authentication doesn’t seem to be working correctly. The output of kubectl logs HUB_POD shows that my authentication was successful but I’m not redirected from the “Sign in with GitHub” JupyterHub page. I followed the documentation here and here. Do you have any tips please? :slightly_smiling_face:

Have you tried it with dummy auth? This will help to narrow down if the problem is the interaction with the external OAuth workflow or an internal problem:

jupyterhub:
  auth:
    type: dummy
    dummy:
      password: 'password'
    whitelist:
      users:
        - test
1 Like

So I started with just the dummy authenticator, which was fine, and then upgraded to GitHub oauth, also good. The issue started when I tried to implement the persistent storage with it. I’m finding it very difficult to distill what I actually need to do to make that jump from the example repo alone. Is there an accompanying blog post (or a plan to write one) that is a guide to deploying this and explaining where the IPs go? I think it would be really beneficial.

1 Like

We plan put forward more documentation. However, this is still very early and even at GESIS we do not have this in production.

1 Like