I wanted to share some stuff I’ve built and lessons learned regarding the use of Jupyter notebooks as part of a larger research project. Namely, how to keep things reproducible (with a single command) while retaining the benefits of interactivity notebooks provide.
that’s a nice project! I like the way you integrate the code, data, and setup into the workflow specification! You have specifically pinned down the ML-setups very flexible and user-specific using Docker and/or Conda/venv. Only few have realized that the “open setup” is (besides code and data) a growing issue for ML experiments and solutions therefore are needed!
I’ve recently published a research article (preprint is on ResearchGate; still waiting for the official DOI) in which I’ve (1) searched for and compared solutions for reproducible machine learning and (2) proposed an own framework.
We compared the solutions based on the suitability for flexible Deep Learning experiments (GPU support, isolation on OS-level, taggable image versions, flexibility in languages and libraries, allowing custom-builds, and IDE integration).
The proposed solution is GPU-Jupyter. It is very well suited for these criteria. It offers a very robust and flexible image for Deep Learning. Moreover, it allows custom builds and the reproducibility of the whole customized setup in one single line.
However, the integration of data and the workflow are not strictly defined, mainly because engineering needs more flexibility. For classical ML experiments, we suggested referring to a codemeta.json as part of the FAIR4RS principles (see the demo-repository on github under iot-salzburg/reproducible-research-with-gpu-jupyter/blob/main/codemeta.json (not allowed to post more than 2 links)). This codemeta.json defines the Docker image with additional installations to some degree, but I think this is currently underrepresented.
I’ve seen that calkit allows specifying the Docker image. Therefore, a combination of our solution could be very interesting and better than sticking to codemeta.json specification.
What do you think about this quasi-standard?
How do you define in calkit whole data-science projects which require a directed graph of preprocessing steps?
Btw, I stick to the Jupyter’s docker stacks in GPU-Jupyter to allow a similar UI. I have already posted a question regarding the GPU-support here: Proposal: GPU-Support.
GPU-Jupyter looks super interesting. Workflows that use both notebooks and GPUs are becoming more prevalent, and the complexity will certainly make reproducibility more difficult.
I definitely think a combination of Calkit and GPU-Jupyter would be interesting to explore. Pre-processing steps are defined as additional stages in the pipeline. These can be notebooks, scripts, or even shell commands, and their outputs can be defined as inputs to other stages to form a DAG.
Allowing also non-sequential preprocessing brings very important flexibility and is an advantage over many existing solutions.
Yes, this example project is good to try GPU-Jupyter for reproducibility (the README will be shortened soon). I would be very glad for a demo for using GPU-Jupyter as base image within calkit. I expect that both the base image + torchsummary installation as well as the custom image https://hub.docker.com/repository/docker/cschranz/reproducible-research-with-gpu-jupyter can be used in calkit, right?
I would also like to try calkit this week to better understand it.
Thanks @petebachant, I’ll check it out later this week.
What makes me wonder is, that (some of) the resulting errors are different, even though only the package notebook was changed. Do you know why this is the case?
Additionally, there seems to be a lot of redundant files within the .calkit directory. This is necessary for version-controlling the setup state under which notebooks and code is executed, if I understand it correctly, right?
I will have to check on how to compile the project using calkit in more detail. I may ask you on this point later.
I assume this is non-reproducibility coming from PyTorch, but I’m not sure. Maybe it’s worth running on different machines.
That’s right. Since Calkit compiles a DVC pipeline, which relies on files to determine stage staleness, Calkit generates a cleaned version of the notebook so the inputs can be isolated. It also generates some executed versions to save as artifacts. Whether or not they get committed to version control is configurable, however. For example, you can set the storage settings in the pipeline stage to null and those files will not be committed.
eventually I had time to dive into calkit. That’s an impressive and comprehensive software project you created! It took quite a time to get into it, but I guess it would be quite fast for subsequent projects. I guess that some hands-on video tutorial could help a lot of users.
Overall, I think calkit is very interesting and pins down each step to reproduce work to IMHO the minimal effort required - both for reproduction and also for making existing work reproducible.
Supporting the CUDA drivers in Docker OS-agnostically would be an important point for me, because the calkit run command would take very long without GPU for deep learning projects.