Mounting server data on each user's pod

(copy of post from #794)

We are working on a open paper (in the spirit of distill) but focused on neuroimaging (for now). Any user could have access to a reproducible paper using binder. Anyone could also upload a new article there.

We are trying to let the user upload data from the web https://github.com/SIMEXP/Repo2Data (discussed here jupyter/repo2docker#460) into our server.

@bitnik We have a binder running on our server and were wondering how to “mount” the data in the user’s notebook, and how to launch repo2data every time a user upload a new repository. Could you explain more in details how we could do that (for now we are not wondering about authentification) (using https://github.com/gesiscss/example-binderhub-deployments/blob/69c9efad09df8df795a82cfd21f1c56d27f11f43/persistent_storage/config.yaml) ? Here is our config file :

jupyterhub:
  ingress:
    enabled: true
    hosts:
      - conp7.calculquebec.cloud
    annotations:
      ingress.kubernetes.io/proxy-body-size: 64m
      kubernetes.io/ingress.class: nginx
      kubernetes.io/tls-acme: 'true'

  hub:
    baseUrl: /jupyter/
  proxy:
    service:
      type: NodePort
  singleuser:
    memory:
       guarantee: 4G
    cpu:
       guarantee: 2

# BinderHub config
config:
  BinderHub:
    hub_url: https://conp7.calculquebec.cloud/jupyter
    use_registry: true
    image_prefix: cmdntrf/conp7.calculquebec.cloud-

service:
  type: NodePort

storage:
  capacity: 2G

ingress:
  enabled: true
  hosts:
    - conp7.calculquebec.cloud
  annotations:
    kubernetes.io/ingress.class: nginx
  https:
    enabled: true
    type: kube-lego
  config:
    # Allow POSTs of upto 64MB, for large notebook support.
    proxy-body-size: 64m

Thank you,

1 Like

Basically, we do not want the user to have right access to our server, that is why repo2data wold be launch on the server every time any user upload a new repository.

The process could be seen as it:

  1. A user upload his work (a notebook), which is using some databases (from https://openneuro.org/ for example).
  2. He provides in the repo a configuration file data_requirements.json , which specify where the data lives
  3. After the docker image is built, we launch repo2data from our server which will read user’s data_requirements.json
  4. The database is downloaded on a folder /data on our server (if it is not already existing on our server)
  5. /data is accessible as read-only by every users (every notebooks running on our binder).

@ltetrel sorry that I reply late.

To mount /data folder in your server into each user pod (notebook), you can use additional storage volumes: https://zero-to-jupyterhub.readthedocs.io/en/latest/user-storage.html#additional-storage-volumes

Here is an example with hostPath volume type:

jupyterhub:
  singleuser:
      extraVolumes:
      - name: shared-data
        hostPath:
          path: /path/to/shared/data
      extraVolumeMounts:
      - name: shared-data
        mountPath: /data  # where each user can reach the shared data
        readOnly: true

According to your setup and where you have your data, you can choose appropriate volume type: https://kubernetes.io/docs/concepts/storage/volumes/#types-of-volumes

I think using pre_spawn_hook should work to do that:

jupyterhub:
  hub:
    extraConfig:
      myExtraConfig: |
        async def my_pre_spawn_hook(spawner):
            repo_url = spawner.user_options.get('repo_url')
            ref = spawner.user_options.get('image').split(':')[-1]  # commit hash
            # TODO get data_requirement.json from repo
            # TODO run repo2data

        c.KubeSpawner.pre_spawn_hook = my_pre_spawn_hook

This will work with 2 conditions:

  1. Hub has access to same data volume, which is mounted to user pods, with write access (not readonly). Here is a complementary example to above config of singleuser:
jupyterhub:
  hub:
    extraVolumes:
    - name: shared-data
      hostPath:
        path: /path/to/shared/data
    extraVolumeMounts:
    - name: shared-data
      mountPath: /data  # where hub can reach the shared data
  1. Your hub image has repo2data and its requirements installed. So you have to extend the hub image and then use your own image in your config:
jupyterhub:
  hub:
    image:
      name: <your_hub_image_name>
      tag: <tag>

Hope this helps :slight_smile:

1 Like

Thank you @bitnik

I will try that and let you know,

I will give some updates on this post.
I managed to do what we wanted, you can see our configuration file here.
These are some troubles that we had :

  1. Don’t forget to create/mount a persistent volume, and share it among your nodes (where the pods lives)
  2. The folder should be writable to allow binderhub to push data inside
  3. Create a hub image on dockerhub and put it in the configuration file
  4. Gives read permission to the user,

We still have some problems with the http timeout. Sometimes the data download takes a lot of times and we reach a timeout. I saw we can change it here but I am not sure which one is important https://jupyterhub.readthedocs.io/en/stable/api/spawner.html#spawner
Do you have any ideas @bitnik ?

Thank you for your help again!