Repo2DockerSpawner - alternative version

danlester · March 11, 2020, 10:50am

As part of a project I’ve been working on, I have open sourced a Repo2DockerSpawner for JupyterHub.

This allows the user to select a repo when they spawn a new server (or just use the default ‘blank’ image if they prefer):

It runs repo2docker in a Docker container, and streams logs and the progress bar to the /spawn-pending/ page:

I only came across the @yuvipanda version of this same concept this morning - sorry, it might have been better to combine functionality!

But I still thought this different approach might be useful to some people.

The main differences are:

User options form supplied as default
Progress logs displayed
Runs r2d in a Docker container
Image name caching is carried out by r2d rather than natively in the spawner

Please let me know if you have any questions!

yuvipanda · April 9, 2020, 5:14am

Just wanted to say this is amazing, and I’m glad you built your own! <3 I hope this gets more traction

danlester · April 13, 2020, 12:45pm

Thank you for your enthusiasm - it means a lot!

As before, I think some people will need bits from yours, and some will need mine… but we might as well wait to see if anyone feeds back on either before progressing…

Dan

1kastner · April 13, 2020, 1:41pm

Thank you two for those great contributions! I always love good documentation so I might give it a try soon, @danlester!

Do you think it would be difficult to extend your two implementations to digest a ZIP file instead of a git repository? I want to use a Learning Management System to distribute the material when it is time to work on the exercise. This kind of control we could force on a git repository but you know what people say about hammers and nails - I don’t believe it would be a good solution.

danlester · April 14, 2020, 7:15pm

Thanks for your enthusiasm!

A similar process could certainly work for a ZIP file, e.g. given the URL of a ZIP file. That could be similar to the ‘local folder’ option in Repo2Docker, where the source ‘repo’ is just a collection of files on the local hard disk instead of e.g. a git repo.

However, it gets a bit more complicated to check whether the ZIP file has been updated since the image was last built, at least without downloading it first and/or relying on the server returning HTTP details about the ZIP file reliably. And you wouldn’t want to rebuild the image every time a user creates a server based on the ZIP.

To be clear, it’s not something that my Repo2DockerSpawner can do as it stands, and I’m not sure it fits too neatly into the ‘Binder’ philosophy without some agreed standards. I think it would need an extra ‘wrapper’ to handle the ZIP download and check, compared to the other current source repo options. (Maybe @yuvipanda has more experience here.)

Presumably in your workflow there is a point at which the ZIP file is created - but where from, and how is it ZIPped and uploaded somewhere etc… it might make sense to generate the Docker image at that point, depending on how your users are going to access JupyterHub(s) to use the image. Maybe they just need to be given an extra image in the list of available images for use in the standard DockerSpawner.

If you want to generate ideas at that level, it could be worth writing up your workflow and requirements in a separate post to see if anyone else can suggest a more direct solution. If you do, please link to it from here!

1kastner · April 15, 2020, 6:21am

Thank you very much for your input @danlester, I will follow that lead!

betatim · April 15, 2020, 8:09am

We have had “a ZIP file provider” on the roadmap for a while for repo2docker: https://github.com/jupyter/repo2docker/issues/812

This would be a good contribution to get started learning about how the content provider part of repo2docker works. I think we already have some ZIP file (or archive) handling in the Zenodo/Figshare providers that you can look at for inspiration.

I’d implement the caching based on the value of the ETag header that a server sends. This needs the server to cooperate a bit (aka send a etag header) but I think almost all webservers do that today. My idea would be to use the value of the etag as we use the resolved commit hash of a git repository. This means a ZIP file content provider would make a HEAD request to get the etag value and based on that decide if it needs to build or not.

I think a ZIP file fits very well with the Binder philosophy. While it all started with Git repositories on GitHub we now support lots of other content providers. In hindsight maybe repo2docker is doubly misnamed:

it should be “directory-like-thing” instead of “repo”
it should be “container” not "docker

Though I guess repo2docker is a bit more catchy than directory-like-thing2container. For sure it is less to type.

danlester · April 15, 2020, 11:19am

Thanks for clarifying from the Repo2Docker point of view, @betatim.

Yes, the etag was what I was thinking in “relying on the server returning HTTP details”.

Once ZIP is available in Repo2Docker, it would be a simple case of just updating the UI in Repo2DockerSpawner.

However, even if ZIP was available, I still think it is worth taking a step back and thinking through your whole process. It might not make sense for your users to have to copy and paste the ZIP URL (and/or potentially any other URL) in order to get their server running.

betatim · April 15, 2020, 4:12pm

If they want to use a zip file as the source, how else would you do it? Someone at some point has to construct the URL that points to the source. You can build shortcuts (like the Zenodo content provider) where you type something else (in this case a DOI) but in the end a URL to a zip file is created and downloaded.

danlester · April 15, 2020, 7:03pm

Oh yes, that’s the most obvious way if ZIP through Binder is indeed the solution.

I’m probably just meddling, but encouraged @1kastner to take a step back and think whether it makes sense for ZIP through Binder to be the best way of getting the required image to his students, or whether that was an opportunistic conclusion given the subject of the original post here.

i.e. there must be some workflow to create the workspace/environment required in the first place, before it gets Binderized into a ZIP. Could there be an earlier point in the workflow where there is an opportunity to generate the Docker image which could be used directly by students.

I’m interested to hear more of the background story if useful, but also appreciate I might actually be able to take his question at face value - maybe he does just want ZIP through Binder without me interfering!

betatim · April 15, 2020, 7:39pm

Ah ok. It would be interesting to learn how/why people want to use a ZIP file. Or generally why they want to do whatever it is they want to do.

I had read your comment as somehow not using a URL to a ZIP file if you wanted to start from a zip file. Which seemed pretty wild

1kastner · April 15, 2020, 8:24pm

@danlester then let’s get to the story behind this.

How currently files are shared

We as a member of our organization are supposed to use a certain platform to share files called Stud.IP. Now we might set up a JupyterHub (everything still very hypothetical). Anyhow, since this is the very beginning, using the JupyterHub should not be enforced and it is not (yet) an officially supported tool of the organization. Hence, it is necessary to stick to the old file distribution system. Furthermore, since I want to control which course content is distributed at which time, I need some method to hide my internal progress. I might have already prepared some files and I do keep them in an internal git repository. This does not mean I want to share my results in the moment I have obtained (updated) them. I need full control over this. Distributing ZIP files through that system gives me that specific control.

Is the JupyterHub the only solution?

In my context, it should not be obligatory to use the JupyterHub. I believe that running some code through Anaconda on your personal laptop can be an empowering experience when you start learning programming. Since the course participants are administrators on their own laptops, installing additional libraries etc. is easy. By using PowerShell etc. they see their own machine (e.g. folder structure navigation through “cd” etc.) with new eyes. On the other hand, if people connect to some remote strange linux server, they don’t know what they “perceive” and everything is alien. So the JupyterHub is more an additional option for people with poor hardware, e.g. tablets. It should harmonize well with Jupyter Notebook users.

Keeping this simple

We could work with docker images as well. We have an organization-internal docker registry I could push to the image just in the moment I want to publish it etc. This adds yet another tool to what the course participants need to install and to learn. This adds unnecessary complexity. We just want to explain them what directly helps them to work with the Jupyter Notebooks. This is what it is all about.

I believe that every (non-IT) course participant can open a ZIP file and can work through a README file. I doubt this for docker images.

danlester · April 17, 2020, 8:14am

Thank you so much for detailing all of this. It’s a really interesting perspective.

As you say, a ZIP is likely to be meaningful to your students (who might not know git). Even if some students are using JupyterHub, it could be reassuring for them to see that they are starting with the same ZIP URL as everyone else, just feeding the URL into a Binder process instead of exploring manually on their laptop.

I’m sure you’ve digested our input from the technical detail side of Repo2DockerSpawner etc, but to summarise my thoughts:

To use ZIP URL as a source would require repo2docker to be updated to accept this in the first place.
Repo2DockerSpawner would (probably) need minor UI adjustments to support this.
You would need Stud.IP to reliably return etags so that it can use a cached Docker image when more students come with the same URL. Furthermore, and probably a bigger issue, you would need Stud.IP to allow your JupyterHub to have direct unauthenticated access to the URL. That would probably mean having it open to the wider internet, unless everything can be behind a private network or similar.

If Stud.IP is anything like Dropbox, for example, the URL the user clicks to see the ZIP file on the Stud.IP UI will not be usable directly to Repo2DockerSpawner - it is an HTML page not the ZIP itself. If there is an alternate URL direct to the ZIP, does it need to check for authentication cookies.

If it needs to be authenticated, maybe there is a way to supply credentials (or a different ‘share’ URL) to Repo2DockerSpawner’s UI, but that would need a much bigger change in Repo2DockerSpawner as well as Repo2Docker… As would allowing the ZIP file to be downloading manually from Stud.IP and then manually uploaded to Repo2DockerSpawner in JupyterHub (pretty cumbersome on a tablet anyway).

So I think a lot of this comes down to the precise behavior of Stud.IP and how your network is configured.

All of the above requires someone to make some code changes to repo2docker etc.

In my view, the immediately-available workaround is for you to build the images yourself and add to the list of images in DockerSpawner - if that’s what you’re using in JupyterHub. You could actually give the ZIP name as the ‘friendly name’ that users see for the image anyway, so they know exactly what they are selecting in relation to the ZIP files shared with all students. Other spawners may not have the same functionality, perhaps allowing only one named image to be available at any time (KubeSpawner, I think).

Sorry again if this is telling you everything you already know! Please keep us updated.

Dan

1kastner · April 17, 2020, 10:29am

Thank you very much for your input. Stud.IP is more a Learning Management System than dropbox - I know I will need to programm some intermediate layer. I considered writing a JupyterLab extension for logging in, choosing the exercise and downloading the ZIP file to the server where JupyterLab is executed. I can’t modify Stud.IP’s behavior so whatever needs to be done I need to implement in a separate manner.

Your workaround seems quite applicable. It means more work for setting up each exercise but less programming.

EDIT: Note to myself - there is the image whitelist attribute not covered in the repo readme which seems to be the main documentation.

yuvipanda · April 21, 2020, 7:31am

Great conversation, and I learnt a lot about this use case!

In persistent systems, I always think of repo2docker as providing the environment (libraries, packages, config files, etc) and nbgitpuller as providing the content.

I really wanna extend nbgitpuller to pull from arbitrary sources, and worked on it a little bit. The idea was:

Create a temporary, read-only git repository someplace
Run f2git each time an nbgitpuller link is clicked. This will reach out to your CMS (currently just canvas), fetch the files, add them to this read-only git repository, and commit it
Then nbgitpuller will actually pull from this repo

So we’re using f2git (possibly with something like rclone?) to pull from arbitrary file sources and put them in a hidden git repository. Then we use nbgitpuller to pull from this repository to the student’s home directory. The students have no idea that git is being used - from their perspective, they clicked a link and they see their content! For instructors, they can just use whatever their CMS uses to store files - Canvas, etc. They can keep materials internally someplace, and have a ‘student-visible’ place that f2git can pull from. This way, the only people who need to know git exists are the infrastructure set up people.

A major advantage of this is that nbgitpuller works in the order of seconds, while forcing a full image rebuild takes a while (see @psychemedia’s blog post) . Also nbgitpuller will merge your instructor’s changes with the students’ changes, so students never lose changes. With this, instructors can confidently release content as they wish, and make modifications to released content if they need to - students will not have to see merge conflicts.

This requires a bit of work, but IMO is a better content distribution solution than putting content in docker images. What do y’all think?

psychemedia · April 21, 2020, 9:21am

Re: naming, is archive2docker more generic? Although archive does have other connotations too (like “old, broken stuff put in a cupboard somewhere”…)

1kastner · April 21, 2020, 9:41am

The general description sounds great! nbgitpuller sounds like a good way to have a one-way communication and f2git I would need to modify according to the ILIAS API. How do you ensure that f2git is run whenever the nbgitpuller link is clicked? How do you add that hook?

From a first glance it looks like this method would create a new docker image for each student since nbgitpuller, f2git etc. are run in the Jupyter Notebook docker container. That way, repo2docker would be inefficient. Anyhow, in my particular use case the libraries do not differ that much between the weeks so the advantage of environment isolation can be sacrificed. Hence, the approach of @yuvipanda that does not make use of repo2docker also sounds very interesting. Since we have already moved far away from the initial topic in this thread, I suggest we should maybe continue the discussion elsewhere? I am not experienced with this forum software so any solution I could accept. Thank you to everyone for your interesting and really helpful input!

1kastner · April 21, 2020, 9:48am

my first idea when I read archive2docker was it is rather a project2docker because we take a whole project file structure. I am still not happy with the name but isn’t it about automatically starting a project in isolation as a standalone application? This makes a programming project executable (not in the sense of binary files of course). That perspective would also allow project names such as project-executor even though I still dislike that name. But it is more about opening the discussion of what exactly the repo2docker does.

yuvipanda · April 21, 2020, 10:35am

I opened an issue Add plugin hook to execute commands before git pulling · Issue #119 · jupyterhub/nbgitpuller · GitHub with information on how we can add this hook

psychemedia · April 21, 2020, 2:08pm

@1kastner So how about riffing around “runnables” (I’m reminded of the word by this old issue that recently resurfaced around data packages…)

Topic		Replies	Views
Repo2Docker: make it easy to start from arbitrary docker image discuss	16	3448	April 27, 2019
Would a "The Littlest Binder" be useful? Binder	36	5498	August 30, 2021
Repo2docker roadmap review discuss	18	1434	December 12, 2018
Brainstorming: Repo2Docker Action-> VM on GCP, AWS, Azure? mybinder.org ops jupyterhub , help-wanted	22	2199	September 16, 2021
GitHub Actions + Binder Binder community , how-to	7	2360	November 22, 2019

Repo2DockerSpawner - alternative version

How currently files are shared

Is the JupyterHub the only solution?

Keeping this simple

Related topics