Brainstorming: Repo2Docker Action-> VM on GCP, AWS, Azure?

Today we launched an official repo2docker GitHub Action that makes it even easier for people to leverage the power of repo2docker in their workflows.

One problem I have been thinking about solving is the desire for folks to quickly launch a ready-to-go Jupyter Server with dependencies loaded (via repo2docker) on the compute and cloud of their choice, which is very useful if you want specialized compute like GPU, or a high memory/CPU instance and for whatever reason mybinder.org doesn’t suit your needs. Google Colab is ok, but you have to install all your dependencies so I think there is a gap to be filled by repo2docker so that you can get and share an environment with others that is ready to go with the appropriate compute with no hassle.

What I was thinking it would be nice to provide a high level API that takes as input your (1) cloud credentials (2) instance type, and in return it would provide a URL for you to access your Jupyter server. It would be nice if this automatically happens when you fork a repository and you can be guided through the steps so this can take place, for example in a conference that has a training session that involves Jupyter Notebooks. Or triggered manually. We could make something like this with GitHub Actions.

Some questions I have:

  • Is this something that people would want in the Jupyter community?
  • Is there anything inappropriate about this idea about integrating with a cloud provider like this? Any guidelines or suggestions on how to approach this or alternative ideas, or any tips on keeping this as agnostic as possible? I want to try to prototype this out on one cloud to begin with, and was thinking Google Cloud since I have no affiliation with them, to reinforce the neutrality of my intentions. I’m also happy to not work on this if this is a bad idea. I just wanted to discuss this.
  • There might be security concerns of providing people with a URL that anyone can abuse and might be a vector for malicious actors to exploit users who use this tool. Some ideas I have to mitigate this:
    • Restrict the GitHub Action to only do this to do these things on private repositories.
    • Don’t provide a URL and force people to ssh and port forward to localhost, but provide the command to do this so people can just copy and paste this in their terminal instead of a URL. This will require some additional steps for people to setup but might be ok.

Aside: I was contemplating using ngrok to generate the URL by instantiating that on the VMs, which in my experiments seem to work. We would just have to discuss if it is possible to secure this sufficiently or we need to go down the ssh tunnel route instead.

I haven’t fleshed this idea out completely, but I wanted to get general opinions and guidance on how or if I should even try to work on this. Really looking forward to everyone’s input.

cc: @betatim @choldgraf @willingc @MSeal

4 Likes

Having recently done some user testing in this area these are my findings:

Summary: a persistent box running your image lets you develop/demo/qa in a highly reproducible environment.

  1. The easiest way to play around with a setup like this is by setting up a GCP VM and calling gcloud compute instances update-container myvm --zone us-central1-a to update the image.
  2. The main advantage of having a persistent VM over a binder instance is the persistence (obviously). This means it’s best if it is deployed to only when explicitly requested (e.g. when a tag is pushed) to avoid overwriting changes.
  3. Where binder is more useful for trying someone elses code, a VM is better for development, therefore ideally it’s personalised (easy to setup your git creds etc.) and contains dev dependencies.

This is an interesting initiative and I reckon it will require more core changes to r2d images to make it really useful.

1 Like

I think this is a super interesting avenue to explore. It goes along the lines of “bring your own compute” for BinderHub.

I think @yuvipanda at some point made a small script that takes a repo2docker image and spins up a GCP node for you to run it. Can’t find it back right now though.

2 Likes

What kind of changes are needed? My containers built with r2d seem to always launch perfectly out of the box, so curious what you are thinking here

Sounds like I really need to meet @yuvipanda ! this is a great tip,I’ll try to reach as well to get more info.

Kind of related: I wrote an experimental AnsibleSpawner for spinning up docker/podman/cloud-VMs/storage/anything-else with JupyterHub:


Ansible has loads of modules for working with cloud providers. It doesn’t provide a full abstraction across clouds, so for example the openstack modules behave differently from the AWS modules.
2 Likes

Perhaps related, I notice that Docker are exploring a simplified CLI route to running arbitrary docker containers / docker-compose setups on AWS: https://www.docker.com/blog/from-docker-straight-to-aws/

And there are recipes for running docker containers on eg Digital Ocean: https://www.digitalocean.com/community/tutorials/how-to-use-a-remote-docker-server-to-speed-up-your-workflow

I also note a couple of things from the datasette project, specifically:

  1. the plugin mechanism that allows third party / extension functionality to be added to datasette;
  2. the publish mechanism, which allows the datasette server to be packaged in a container and launched on various third party servers.

The two work together as for example in the https://github.com/simonw/datasette-publish-fly plugin.

I’ve often wondered:

a) would it be possible to crib these approaches for repo2docker;
b) would it be possible to abstract out the datasette publish elements so that they could be reused in various projects. Eg use the same X publish package in datasette or repo2docker.

1 Like

I’d say ideally there is some entrypoint which sets up an individual dev environment.

E.g. logs me into AWS/GCP/GitHub.

This way i don’t need to do that manually every time i update the image.

Do you mean a binderhub entrypoint that logs you in to AWS/GCP with OAuth, spins up a VM, and builds or deploys the repo2docker image on that? Or did you mean a dev environment inside the image built by repo2docker?

the latter.

Ideally the action gives me a jupyter lab link on my vm. When I land inside the vm i’m hooked up to github/aws/gcp

What use is a dev VM without integration with these services? May as well be a short-lived binder instance in that case IMO.

An extremely powerful tool in this space is packer which can make just about anything, using just about anything, with a sane, no-templating JSON language: e.g. you can build a docker image with ansible playbooks, etc. It claims:

Amazon EC2, CloudStack, DigitalOcean, Docker, Google Compute Engine, Microsoft Azure, QEMU, VirtualBox, VMware, and more.

I’ve primarily used it to have a full audit log for generating a seed OVAs from original distribution ISOs, then do sub-builds off that base to generate the actual VMs to be deployed, but it claims…

A “repo2packer” might make a lot of sense.

2 Likes

This sounds pretty cool. Will have to look into this

1 Like

My friend also just told me about a project called Caliban which I have been looking at

I like the idea @hamel! I think one place this would come up with a real usecase would be for integrating this into a scheduler like Airflow to make an operator that can launch your repo to run a notebook with the correct dynamic environment. This would help with automating dependencies pairing with the notebook to be executed without a user needing to know as much infrastructure design.

As for security I’d expect you’d run into basically the same set of requirements that the binder team hit. @willingc might know the right person to discuss that subject.

A “repo2packer” might make a lot of sense.

That’s got some potential for sure for VM sourced development.

1 Like

Thanks @MSeal that sounds like an intriguing idea. I would like to explore the more simple “repo2vm” thing before moving onto hooks into pipelines and such.

As you mentioned, I am pretty paranoid about security, still exploring the best tradeoffs there per your earlier advice

2 Likes

I’ve just written a repo2docker extension, repo2shellscript:

It takes the intermediate Dockerfile and other files created by repo2docker, converts the Dockerfile into a bash script, and returns a folder containing the script and other associated files. This folder can be copied into a Ubuntu 18.04 VM and used to install the environment. I reckon it shouldn’t be too difficult to get this into packer, if I have time I’ll have a go :slightly_smiling_face:.

It relies on this repo2docker PR which defines an abstract interface for the container engine used by repo2docker:

repo2shellscript simply replaces the (docker) build command by the script creation steps.

3 Likes