An unfolding story of my first contribution to repo2docker

consideRatio · April 19, 2019, 7:32pm

Session 1 - 19 April

Dear rubber duck
Have you ever thought that it was helpful to speak to someone about something, even though the other person did not say much? I don’t have anyone around to be that person right now, and I don’t own a rubber duck, so I figured I’ll write to you in this forum!

It is my hope that by documenting this process I may provide some insights on the general process of contributing to open source projects in general.

Defined my goal: to make mybinder.org / repo2docker support pipfiles
I’m starting out on a journey to solve a problem that I really want solved. I want to make mybinder.org able to understand how to use the Python package dependecy files named pipfile and pipfile.lock. I want this as I’ve found myself twice or more in a situation where I was about to suggest the authors of a repo with jupyter notebooks also added a MyBinder.org badge only to spot the pipfiles.

MyBinder.org currently understands how to use environment.yml and requirements.txt python package dependency files, but not the pipfile’s. Getting MyBinder.org to support these pipfile’s is really a question of making repo2docker support them though, so that is where I’ll work - towards repo2docker!

This work will be an attempt to close issue #174! (ping: @yuvipanda @minrk @choldgraf @jezcope @jzf2101 @trallard @Madhu94 @betatim)

Found CONTRIBUTING.md
I’ve already got started and read the README.md file of repo2docker but there was nothing on how to get started with a contribution in the file itself. But, I spotted the CONTRIBUTING.md file! I read it through and picked up on how to setup a local development environment.

Read more documentation
But, as repo2docker is quite new to me, I figured I’ll avoid a past mistake of running into issues I could have avoided by simply reading a bit of the documentation ahead of time.

What I learned
Repo2Docker will inspect a git repository and ask its buildpacks in a specific order if they can handle figure out how to create a Dockerfile for the repository, this is the Detection phase. We need to add code to detect pipfile or pipfile.lock in an existing buildpack or create a new one.

Questions!
At this point, I better write down some of the questions I’ve ended up with before I loose track of them. I’d love to get your help with input about them!

Question 1: Should I add a buildpack or augment one?

Hmmm… I think I should add one, but I’m a bit confused… I saw fewer than expected in the repository code base, one named conda but none seemed associated with requirements.txt. Perhaps its part of the conda buildpack? Hmmm…

OK - Session 2: I’m quite confident I should augment the logic in the PythonBuildPack now.

Question 2: If we add a buildpack, it should be put in the ordered list for the detection phase, but at what position would make sense?

Hmmm… I think this is a question along with Q1 that could be answered by those that has contributed a lot to the project already if I ask them.

OK - Session 2: No longer a relevant question due to not adding a buildpack.

Question 3: What makes sense when finding the various combinations of pipfile.lock and pipfile?

OK - Session 1: Oh I think I got this one myself after simply writing it down! I think if we find either one of these, we will let pipenv install do the job for us! I think pipenv install will use the lock-file if there is one, or use the less tightly pinned packages from the pipfile if there is no pipfile.lock to be found. So, pipenv install will solve the logic for us, we just need to find either one of these files I think.

OK Correction - Session 3: pipenv install will work on the Pipfile while pipenv sync with work on the Pipfile.lock. So, let’s prioritize the locked file and the sync command and follow up with the install command if there are none.

OK Correction - Session 3: I use the pipenv install command no matter what in order to be able to use the --system flag that isn’t available in the pipenv sync command. The install command can accomplish the same thing if passed two additional parameters: --ignore-pipfile and --deploy after having created a Pipfile.lock if there were none.

Question 4: What should we do if we find a combination of environment.yml / requirements.txt /
pipfile?

Hmmm… I think this relates closely to Q2.

OK - Session 2: We should only care about environment.yml, but if there was no such file but requirements.txt and Pipfile or Pipfile.lock then we should ignore requirements.txt I think.

Session 2 - April 20

I’ve setup a developer environment and solved a minor challenge along the way that I documented in a post below as something to fix at some point. For now though, I want to progress towards the goal and not get stuck so I wrote it down and continued.

I’m looking into the source code trying to understand how things work as best as I can. I realize I needed a better understanding of the buildpacks in place. So, I’m starting to write down some overview about them. Perhaps I can answer Q1, if I should add a new buildpack or augment one.

Overview of the detect() function of the buildpacks

The ordering of the buildpacks detect functionality goes as follows:

LegacyBinderDockerBuildPack, will detect a Dockerfile with a FROM andrewosh/binder-base statement.
DockerBuildPack, inherits from BuildPack, will detect a Dockerfile.
JuliaProjectTomlBuildPack, inherits from PythonBuildPack, will detect either Project.toml or JuliaProject.toml.
JuliaRequireBuildPack, inherits from PythonBuildPack, will detect a REQUIRE file and requires a Project.toml to not be found.
NixBuildPack, inherits from BaseImage > BuildPack, will detect a default.nix file.
RBuildPack, inherits from PythonBuildPack
CondaBuildPack, inherits from BaseImage > BuildPack, detects environment.yml
PythonBuildPack, inherits from CondaBuildPack, detects python in runtime.txt, setup.py in root folder, and requirements.txt.

Hmmmm, leaning towards the idea of augmenting the PythonBuildPack, I think pipenv files compete with requirements.txt files and PythonBuildPack is working with them.

I learned about the test setup

The tests folder contained a conftest.py file that had a useful docstring!

Each directory that has a script named ‘verify’ is considered
a test. jupyter-repo2docker is run on that directory,
and then ./verify is run inside the built container. It should
return a non-zero exit code for the test to be considered a
success.

That is excellent! I figure why not start out by creating some tests, that way I’d define the functionality I want to achieve and can communicate that to the maintainers of the repo through concrete code as well!

Test 1 - Stub done: I want pipfile or pipfile.lock to take precedence over a requirements.txt file.
Test 2 - Stub done: I want a environment.yml to take precedence over a pipfile or pipfile.lock for the same reasons they are taking precedence over a requirements.txt file that I imagine. I imagine conda can install more than pip/pipenv can so we should not limit ourselves.
Test 3 - Stub done: I want pipfile or pipfile.lock to take the same kind of precedence as requirements.txt over setup.py. Oh… I learned now that setup.py is installed after requirements.txt anyhow. I also found no test associated with setup.py. Let's make this test anyhow at some point where setup.pyis verified to be installed after thepipenv` installation.

I made a [WIP] PR
I submitted a [WIP] PR to jupyter/repo2docker! See #649.

Session 3 - April 21

I got a basic idea of how things work and I have created some test to succeed along with all the other test that should still not fail while doing that.

I first ran a single test to verify I could do that.

# run a specific test and get lots of output
pipenv run pytest -s tests/venv/pipfile/environment-yml/

It worked out great and I could understand clearly that a Dockerfile was created, built, and tested. This takes quite a while. So, by decided to run all tests so I could cache a lot of work.

# lets run all tests to cache a lot of work for the future
pipenv run pytest

Questions!

Question 5: Should we use pipenv install --dev or pipenv install by default?

Hmmm… I think --dev currently should be added, but I’m not sure.

OK: I decided to use --dev flag.

Question 6: pipenv install will do nothing for us unless we enter the environment as well I think, hmmm… One could also make pipenv install install things without a virtual environment as the Dockerfile kinda is one anyhow and it would reduce potential complexity down the road I think. Okay so the question becomes: should I install a virtual environment and enter it with pipenv shell, or should I make the pipenv install install things directly, which I think we can make it do but I don’t know right now how.

Hmmm… I’ll look and learn from how things have been done for other buildpacks such as the conda and python (also referred in tests as venv) buildpacks.

OK: In this repo2docker code section I notice the answer should probably is to use a specific pip binary to do the install, or at least I realize we should avoid the complexity of doing pipenv shell or similar to enter a environment.

OK Correction: Apparently entering pipenv shell wasn’t easy from a Dockerfile, so by using --system and --python we install it directly.

Question 7: How to make pipenv install not in a virtual environment, but instead use a specific pip binary to install things?

Hmmm… I should read up a lot on the command line options for pipenv.

Hmmm… Multiple options show up on how to use.

Generate a requirements.txt: I could let pipenv generate a requirements.txt file and use the pre-existing system within repo2docker to manage these. I would need to lookout for all interactions with such file though. I’m specifically cautious to not overlook something relating to how python versions are managed. I recall reading some code about extraction of a python version.

Speficy a python executable: What would it mean to use the --python flag?

Specify eh… --system? What would it mean to use the --system flag? I don’t think this is relevant to us. I think this influences the choice of having pckages installed in the user / system level, but I’m not especially confident about these aspects.

I’ll need to decide on option 1 or 2 I think.

Hmmm… If we generate a requirements.txt file we may deviate from expected behavior where .env files are loaded, and perhaps also something relating to the Python version. Perhaps we need to choose option 2 to do this properly because with option 1 that won’t happen.

OK: I’m going with option 2. This may be less straight forward but it should be the most robust solution long term I think.

OK Correction: I’m going with option 3. See Question 6’s final entry.

Question 8: How is the python_version() function used and should I adjust something based on introducing Pipfile’s that somehow potentially involve specifications of python versions. I also know that requirements.txt can include python=3.7 statements etc. How is that different from using runtime.txt for repo2docker etc?

Question 9: I notice that a Pipfile can explicitly install a package with setup.py. So, should we really have a logic that installs setup.py after what was installed with Pipfile?

OK: I decided to enforce the logic that if you have a setup.py file and also a Pipfile, the Pipfile need to have imported the local package like this where the local dummy package is installed.
[[source]]
url = "https://pypi.python.org/simple"
verify_ssl = true
name = "pypi"

[packages]
there = "*"
dummy = {path=".", editable=true}

Question 10: I found $NB_PYTHON_PREFIX and $KERNEL_PYTHON_PREFIX within the code and now understand that the python environment that starts up the notebook server is one, and the actual environment that the Python kernel to be used within it will or at least can be another one. In the scripts I’ve seen pip been invoked in three different ways and I’m now lost. What are the differences between the three pips?

${KERNEL_PYTHON_PREFIX}/bin/pip

${NB_PYTHON_PREFIX}/bin/pip

pip

Hmmm… Is the third option simply the same as one of the others? Where should pipenv be installed

Question 11: Why are we installing this version of pip?

       elif os.path.exists(requirements_file):
           assemble_scripts.append((
               '${NB_USER}',
               'pip install "pip<19" && ' + \
               '{} install --no-cache-dir -r "{}"'.format(pip, requirements_file)
           ))

Question 12: In what of the two-three python environments does it make sense for me to install pipenv?

Question 13: If we use --system and not pipenv shell etc, we won’t get the benefits of loading the .env right? Perhaps we can do an additional plug for this? See: Automatic loading of .env.

Session summary
I worked a lot with defining the tests and struggled a while with the setup.py tests as I got very confused about being able to import a local package even though it wasn’t installed. But, it was because it was locally available but it did not really get installed with dependencies etc. So when I figured out I could check to see if it got a dependency installed as well things turned around.

I spent also a lot of time figuring out how to actually do the pipenv install part and get packages to be detected in the right environment. Now everything seem to work though, I added commits up 3397068 in #649!

I think the key part that remains relates to Python versions.

Session 4 - Evening April 21

The goal is to start learning about pinning Python versions. I added a test to install Python 3.5 to get started. I quickly concluded that the test failed, and I got warnings about not having Python 3.5 etc, but as I remember reading that if we have PyEnv installed things may be managed for us. So, I set out to install that and see what happens.

Installing PyEnv isn’t trivial.
GitHub - pyenv/pyenv: Simple Python version management
We need various apt-get dependencies:
Common build problems · pyenv/pyenv Wiki · GitHub

Questions!

Question 14: Where should we install pyenv?

Hmmm… Various files has been put in /tmp I’ve noticed.

Question 15: What apt-get packages is already installed and which needs adding?

Question 16: Where should I install these apt-get build dependencies for PyEnv?

Question 17: Should I use pyenv or resort to overriding python_version() instead?.

Hmmm… For now, after realizing the effort of getting pyenv installed, I’ll try overriding python_version() in a similar way that the CondaBuildPack does it. They inspect the environment.yml file and choose a python version based on that.

Question 17: When overriding the python_version() function that normally inspects runtime.txt, one may wonder what makes most sense what to do when both a runtime.txt and Pipfile is defined with python_version = "3.5" declared in it for example. Should I prioritize one or the other?

Hmmm… I leaning to want to override runtime.txt with python_version specified in the Pipfile, I’d like to scream some feedback to the user about this though…

Hmmm… For now I’ll go with ignoring runtime.txt entirely if there is a Pipfile or Pipfile.lock, it is simple.

OK: I went with giving priority to Pipfile.lock, then Pipfile, then runtime.txt.

Question 18: What makes thing work with py36 but not py35?

Step 40/47 : RUN ${KERNEL_PYTHON_PREFIX}/bin/pipenv lock --python >${KERNEL_PYTHON_PREFIX}/bin/python
---> Running in c7ae30385795
Creating a virtualenv for this project…
Pipfile: /home/erik/Pipfile
Using /srv/conda/bin/python (3.5.5) to create virtualenv…
⠙ Creating virtual environment...Already using interpreter /srv/conda/bin/python
Using base prefix '/srv/conda'
New python executable in /home/erik/.local/share/virtualenvs/erik-zof0I2Qp/bin/python
ERROR: The executable /home/erik/.local/share/virtualenvs/erik-zof0I2Qp/bin/python is not >functioning
ERROR: It thinks sys.prefix is '/home/erik' (should be '/home/erik/.local/share/virtualenvs/erik->zof0I2Qp')
ERROR: virtualenv is not compatible with this system or executable

✘ Failed creating virtual environment 
[pipenv.exceptions.VirtualenvCreationException]:   File "/srv/conda/lib/python3.5/site->packages/pipenv/vendor/click/decorators.py", line 17, in new_func
[pipenv.exceptions.VirtualenvCreationException]:       return f(get_current_context(), *args, **kwargs)
[pipenv.exceptions.VirtualenvCreationException]:   File "/srv/conda/lib/python3.5/site->packages/pipenv/cli/command.py", line 319, in lock
[pipenv.exceptions.VirtualenvCreationException]:       ensure_project(three=state.three, >python=state.python, pypi_mirror=state.pypi_mirror)
[pipenv.exceptions.VirtualenvCreationException]:   File "/srv/conda/lib/python3.5/site->packages/pipenv/core.py", line 574, in ensure_project
[pipenv.exceptions.VirtualenvCreationException]:       pypi_mirror=pypi_mirror,
[pipenv.exceptions.VirtualenvCreationException]:   File "/srv/conda/lib/python3.5/site->packages/pipenv/core.py", line 506, in ensure_virtualenv
[pipenv.exceptions.VirtualenvCreationException]:       python=python, site_packages=site_packages, >pypi_mirror=pypi_mirror
[pipenv.exceptions.VirtualenvCreationException]:   File "/srv/conda/lib/python3.5/site->packages/pipenv/core.py", line 935, in do_create_virtualenv
[pipenv.exceptions.VirtualenvCreationException]:       extra=[crayons.blue("{0}".format(c.err)),]
[pipenv.exceptions.VirtualenvCreationException]: /home/erik/.local/share/virtualenvs/erik->zof0I2Qp/bin/python: error while loading shared libraries: libpython3.5m.so.1.0: cannot open shared >object file: No such file or directory

Failed to create virtual environment.

Question 19: Oh… I think I have a bug introduced by not specifying where my Pipfile resides when I do pipenv lock and pipenv install because of the binder folder. I better ensure to specify the file explicitly.

OK: I could confirm that was the case, I added a test that failed as expected. Then I added a commit to fix the test and problem solved!

consideRatio · April 19, 2019, 7:33pm

Potential offshoot PRs to repo2docker

When working towards my goal I end up with a lot of insight into my own developer experience (DX) and realizes potential improvements to the repo to make it easier for others in the future. But instead of straying from my goal to fix those one at the time, I try focus on the goal and instead write them down.

Mention CONTRIBUTING.md in README.md

Perhaps we should mention how to get going with a development environment.

Dependency of `semver` not installed with `pipenv install --dev`

After following the installation instructions for pipenv I got the following error when trying to run repo2docker from my virtual environment with pipenv run repo2docker.

erik@xps:~/dev/contrib/repo2docker$ pipenv run repo2docker
Traceback (most recent call last):
  File "/home/erik/.local/share/virtualenvs/repo2docker-MBJmfNIh/bin/repo2docker", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/home/erik/.local/share/virtualenvs/repo2docker-MBJmfNIh/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3241, in <module>
    @_call_aside
  File "/home/erik/.local/share/virtualenvs/repo2docker-MBJmfNIh/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3225, in _call_aside
    f(*args, **kwargs)
  File "/home/erik/.local/share/virtualenvs/repo2docker-MBJmfNIh/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3254, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/home/erik/.local/share/virtualenvs/repo2docker-MBJmfNIh/lib/python3.6/site-packages/pkg_resources/__init__.py", line 583, in _build_master
    ws.require(__requires__)
  File "/home/erik/.local/share/virtualenvs/repo2docker-MBJmfNIh/lib/python3.6/site-packages/pkg_resources/__init__.py", line 900, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/home/erik/.local/share/virtualenvs/repo2docker-MBJmfNIh/lib/python3.6/site-packages/pkg_resources/__init__.py", line 786, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'semver' distribution was not found and is required by jupyter-repo2docker

By writing pipenv install semver this error went away.

Minor base.py Dockerfile optimization

In this docker documentation we find the following section:

Official Debian and Ubuntu images automatically run apt-get clean , so explicit invocation is not required.

So, we could remove apt-get -qq clean && \ from four places in the base.py file.

Clarification of buildpacks ordering

I could really use an example to grasp the ordering here.

github.com

jupyterhub/repo2docker/blob/746e4d92e0c2c7aa84fcddf3d731b67f7b60e5cd/repo2docker/app.py#L84-L99


      
          buildpacks = List(
              [
                  LegacyBinderDockerBuildPack,
                  DockerBuildPack,
                  JuliaProjectTomlBuildPack,
                  JuliaRequireBuildPack,
                  NixBuildPack,
                  RBuildPack,
                  CondaBuildPack,
                  PythonBuildPack,
              ],
              config=True,
              help="""
              Ordered list of BuildPacks to try when building a git repository.
              """
          )

Initially thought that LegacyBinderDockerBuildPack was very specific and should override whatever found later, but then I realize that PythonBuildPack inherits from CondaBuildPack and I got a bit confused. It overrides the detect functionality of CondaBuildPack… Hmmm… Will only one build pack be selected from this list for use? I think so, and the idea of the composability of buildpack comes from inheritance.

Make a visual overview of configuration logic

Inspired by @leportella’s visual overviews I think it would be useful to have some kind of flow chart or visualization to demonstrate what buildpack does what etc. It took a while to figure out and I’m still not 100%. I got to read more docs and code still.

Optimize tests - test building minimalistic packages with no dependencies

Our tests installs various packages for testing, but some are bigger than others. I’ve seen numpy being installed for example. Perhaps we can go with some dummy packages. I looked for such packages but ended up choosing to use requests and there along with numpy in my added tests for now. I want to avoid numpy if possible though as I think it can be quite big and slow to resolve relative to other packages.

Optimize CI - ordering of tests

I understand it as various tests are run in parallell, but the order they are executed could be optimized based on having a limited number of parallell runners, four I think.

We could optimize it so that the last test to start isn’t also one that takes up most time because then we will end up using four runners for a long time but then in the end only use a single runner for a long time. It would be better to have continuous use of four runners and try to make the most relevant tests be run first and put the quickest and least commonly failing tests last.

Clarify the repo2docker repo’s relationship between its `(dev|doc)-requirements.txt` and `Pipfile`

There is no description about how these are to be used together or individually, I end up confused and spent a while to figure things out.

Perhaps there are about three different scenarios for developers.

The actual developer that wants to get all relevant packages installed for development. (Pipfile that includes both the other requirements.txt files contents but also the package itself).
The CI test pipeline, that only needs what it needs (dev-requirements.txt)
The Docs builder, that only needs what it needs (docs/doc-requirements.txt)

Document docker-compose.test.yml

I don’t know what this file should be used for or is used for, and is one of various things that leaves a question that could had been answered by at least some inline comment in the file.

(Became part of PR) Add some content generated during `pytest` to .gitignore

I ran all the tests on my computer and ended up with the following remnant files that I don’t want include by mistake in a commit.

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	tests/dockerfile/legacy/._binder.Dockerfile
	tests/dockerfile/legacy/apt-sources.list
	tests/dockerfile/legacy/python3.frozen.yml
	tests/dockerfile/legacy/root.frozen.yml

consideRatio · April 19, 2019, 7:33pm

Placeholder post 2
I wanted to leave some space for myself in the top of the topic.

jhermann · April 19, 2019, 9:46pm

If indeed the README doesn’t mention the contribution docs anywhere, it should.

betatim · April 23, 2019, 8:58pm

This thread is fantastic! I’m still reading and thinking. Below some first thoughts.

I think most of your points in An unfolding story of my first contribution to repo2docker - #2 by consideRatio should become issues or PRs.

We try each build pack in turn from the start of the list, if a build pack detect()s it is chosen and that is the end of the story. So LegacyBinderDockerBuildPack gets tried first and if it says “yes” we stop. Because it doesn’t use any of the other build packs (its the legacy one after all) it just does its thing. The DockerBuildPack is similar.

The PythonBuildPack is tried after the CondaBuildPack in order to give requirements.txt a lower precedence compared to envrionment.yml files. However we need to install Python itself which we do via conda, this means the PythonBuildPack inherits from the conda one (to use the assemble and build parts of it).

I think numpy should be quick to install as there are wheels (binary packages) available for most versions of Python on Ubuntu. there is a tiny package we use a lot for speed. Smaller is better. Though we don’t want to start using totally unknown packages as we already have enough flaky tests due packages being pulled for security reasons or not being updated, or just being unreliable.

New build pack or modify existing one?

I would make a new build pack and place it above the existing Python build packs in the search order. This way Pipfiles take precedence over environment.yml and requirements.txt. I think this is what “everyone” wants.

Another thing to keep in mind is that we should plan for having two environments. One in which the notebook server and notebook extensions are installed, and one in which we install the kernel and dependencies for the repository. We have been using this for Python 2 kernels for a while. It works well as the notebook server can use a modern version while the user’s kernel uses Python 2. This solves the problem that notebook and co aren’t creating releases for legacy Python any more. We are now facing the problem that there are already packages that have stopped making packages for Python 3.5 (and conda-forge has dropped it as well). This means we will need to introduce this “two environments” approach for all Python versions. Worth planning for now. (If you want to discuss this let’s start a new thread as this is a discussion and a half by itself.)

jhermann · April 23, 2019, 11:06pm

If you go for Pipfile, you might want to consider to also include project.toml support.

Topic		Replies	Views
Repo2docker roadmap review discuss	18	1434	December 12, 2018
Repo2Docker: make it easy to start from arbitrary docker image discuss	16	3448	April 27, 2019
Tip: Debug binder builds faster with repo2docker discuss tip	4	1665	November 13, 2018
[ANN] repo2docker v0.10.0 Binder	0	389	August 9, 2019
"reproducible" binder environments with repo2docker, dockerhub and nbgitpuller discuss	10	2137	August 7, 2019