Let JupyterHub configure and mount group volumes

On Z2JH there is no support for getting group-based volumes. To get group-based volumes two issues must be solved:

  1. The user’s group/team information should be available.
  2. The group-based volumes must reside on a network file system (e.g. NFS).

The reason for 2 is that users can be assigned to different physical machines in the cloud, therefore the volume’s underlying filesystem must support concurrent reads and writes from distinct machines. Issue 1 can already be solved if you manually write the group(s) for each user in the config file, but this is tedious and requires manually changing the config whenever a user is removed from a group or joins another group. This certainly wouldn’t work in any setting with >100 users from many different groups.

Issue 1 I already have proposed a solution for and an issue has been opened on the Jupyterhub/Oauthenticator github, which has been noticed and commented already.

Issue 2 is still open, and the solution to this issue consists of getting JupyterHub to automatically spawn the NFS server.

When both of these issues are solved, the hub config file should support something similar to this:

# The config should support the following:
hub:
  config:
    group_volumes: true
    group_volumes_server: "NFS" # Could be NFS by default

Over and over again I find posts from people asking about NFS on Z2JH or people asking about group functionality (myself included), which is why I really want this to be a part of JupyterHub itself. Here are just some examples to previous posts regarding NFS, group volumes or both:
Mar '22 - Asking about group volumes (my own post)
Mar '22 - Asking about group shares
Jun '21 - Manually getting group information in KubeSpawner class
Oct '20 - Asking about groups and shares
Sep '20 - Asking about NFS
Nov '19 - Talks about NFS on Z2JH

3 Likes

It’s been quite a while since this was posted and a lot of progress has been made since. The addition of the option hub.config.GitHubOAuthenticator.populate_teams_in_auth_state has appeared since, which has helped a lot in getting the team and organization information of a user.

Here is a thorough example of how to create and mount team based persistent volumes, so that users belonging to the same GitHub team can write simultaneously to a team-specific shared hard drive even if they are on distinct physical nodes in the cloud. This approach is suitable if you are using a cluster with the cluster auto scaler enabled resulting in users spawning on distinct physical machines.

First, a service that provides networked file storage must be created, e.g. an NFS server. Kubernetes has made their own example of how to setup an NFS server here. Simply put you just need 3 files to create the NFS server:

(File1) Here is the single hard drive that the NFS server uses:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc-exports
  namespace: jhub # Use your own namespace
spec:
  accessModes: [ "ReadWriteOnce" ] # NFS handles concurrent writes and reads
  resources:
    requests:
      storage: 200Gi # This is the huge disk, determine the size as needed

(File2) Here is the NFS server itself:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nfs-server
  namespace: jhub
spec:
  replicas: 1
  selector:
    matchLabels:
      role: nfs-server
  template:
    metadata:
      labels:
        role: nfs-server
    spec:
      containers:
      - name: nfs-server
        image: k8s.gcr.io/volume-nfs:0.8
        ports:
          - name: nfs
            containerPort: 2049
          - name: mountd
            containerPort: 20048
          - name: rpcbind
            containerPort: 111
        securityContext:
          privileged: true
        volumeMounts:
          - name: export
            mountPath: /exports
      volumes:
        - name: export
          persistentVolumeClaim:
            claimName: nfs-pvc-exports

(File3) Here is the service that exposes the ports of the NFS server to all other pods in the cluster:

kind: Service
apiVersion: v1
metadata:
  name: nfs-server
  namespace: jhub
spec:
  ports:
    - name: nfs
      port: 2049
    - name: mountd
      port: 20048
    - name: rpcbind
      port: 111
  selector:
    role: nfs-server

Now just kubectl apply these 3 files, and you have an NFS server up and running in your cluster. Now to get the organization and team names of a user logging in, make sure to have the following content in your hub’s config.yaml file:

hub:
  config:
    JupyterHub:
      authenticator_class: github
    GitHubOAuthenticator: 
      # If you don't know what the next 3 lines are, read this link from the z2jh official guide:
      # https://zero-to-jupyterhub.readthedocs.io/en/latest/administrator/authentication.html#github
      client_id: <your-client-id>
      client_secret: <your-client-secret>
      oauth_callback_url: <https://your-jupyterhub-domain/hub/oauth_callback>
      enable_auth_state: true # This enables us to store user information.
      populate_teams_in_auth_state: true # This is what saves the team information of a user.
      allowed_organizations:
        - OrgA:TeamAlpha
        - OrgA:TeamBeta
        - OrgA:TeamGamma
      scope:
        - read:org # Must be enabled or the user's team information is un-retrievable.

The above code of course implies that you have setup a Github OAuth App, but this is really simple to do and only takes 5-10min. If in doubt just follow the official z2jh guide as mentioned in a comment above.

Now when users are logging in, we got their team and organization information, so we can mount volumes that are only specific to those teams. To create a persistent volume specific to that team the following template yaml file will be used org-team-template.yaml:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: ${ORG_TEAM_NAME}-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteMany
  nfs:
    server: nfs-server.jhub.svc.cluster.local
    path: "/${ORG_TEAM_NAME}"
  mountOptions:
    - nfsvers=4.2
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ${ORG_TEAM_NAME}-pvc
  namespace: jhub
spec:
  accessModes:
    - ReadWriteMany # Users can use this volume from distinct physical machines
  storageClassName: ""
  resources:
    requests:
      storage: 1Gi
  volumeName: ${ORG_TEAM_NAME}-pv

Now comes the only manual labor part of this process, which is creating the team volumes.

Human Labor :neutral_face:
In a file copy your allowed org and team names and let’s name the file allowed-teams.txt:

OrgA:TeamAlpha
OrgA:TeamBeta
OrgA:TeamGamma

Then get the name of your nfs-server pod with the command: kubectl get pods -n <namespace>. The nfs-server has a semi-random name like nfs-server-asd38-alks3d.

Now we just need to loop over the team names in the allowed-teams.txt and create a persistentvolume for each team, and create a corresponding folder with the name of the team on the NFS server. This loop is achieved in the following script, which reads the team names line by line from the allowed-teams.txt:

#!/bin/bash
while read LINE; do 
    ORG_TEAM_NAME=$LINE
    envsubst < org-team-template.yaml | kubectl apply -f -
    kubectl exec <nfs-server-pod-name> -n <your-namespace> -- sh -c "mkdir /exports/$ORG_TEAM_NAME && chmod -R 777 /exports/$ORG_TEAM_NAME"
done <allowed-teams.txt

Now a folder exists on the NFS server with the name of the team and a persistent volume exists, now the only part there is left to do, is to add the following code to the config file:

hub:
  extraConfig:
      # A post was made about "deeply specializing the KubeSpawner" and from here an example of overriding the start function was given:
      # Discourse post: https://discourse.jupyter.org/t/advanced-z2jh-deeply-customizing-the-spawner/8432
      # Github link:    github.com/berkeley-dsep-infra/datahub/blob/21e4a45c9f694578ec297c2947a0537f3bdcaa5b/hub/values.yaml#L296
      00-custom_spawner.py: | 
        from kubespawner import KubeSpawner
        from tornado import gen
  
        class CustomSpawner(KubeSpawner):
  
          @gen.coroutine # This is actually old, somehow one should be able to use async functions.
          def start(self):
            auth_state = yield self.user.get_auth_state()
            name_index = 0
            for team in auth_state['teams']:
              nfs_pv_name = team['organization']['login'] + '.' + team['name'] + '-pv'
              nfs_pvc_name = nfs_pv_name + 'c'
              self.volumes += [{'name' : str(name_index), 'persistentVolumeClaim' : {'claimName' : nfs_pvc_name}}]
              self.volume_mounts += [{'mountPath' : '/home/' + nfs_pv_name[:-3], 'name' : str(name_index)}]
              name_index += 1
  
            return (yield super().start())
        
        c.JupyterHub.spawner_class = CustomSpawner

The for loop should read “for each team the user belongs to, mount the corresponding team volume”, and this code will run anytime a user tries to log in. Done.

Finally a question for the Z2JH community and maintainers: could something like this configuration be added to the helm chart, so that it can be easily enabled in the config file?

1 Like