I feel like all of the components are in place in the Jupyter ecosystem to enable ~ a full publishing pipeline for reproducible computation. These are the major pieces I can think of:
Jupyter Notebooks (.ipynb or text-based notebooks) for storing the content + computation + (optionally) results
JupyterHub / BinderHub (for providing environments where people can interact with the computation, or providing kernels that can power visualizations remotely)
Jupyter Book (for providing more “publication-ready” documents and syntax, such as citations and figures, and for providing a lightweight way to share content with others)
JupyterLab (for authoring and interacting interactively with notebooks)
GitHub Actions (to automate some of the connections between services, and manage a review pipeline on GitHub)
From a user’s perspective
I want to be able to do the following:
Work on my notebook in a Jupyter Lab interface, author my computation etc in that notebook, along with rich syntax for publishing (figures etc)
In one-or-two clicks, publish that notebook so
Others can quickly read what I created, in a beautiful and “publication-style” form
Others can quickly discover what I created, potentially along with creations from others on a hub
Others can quickly interact with what I created on the infrastructure where I created it.
Others can quickly comment on what I created to provide feedback.
Others can find a DOI for my notebook
I think that we have the building blocks needed to do much of this:
author: JupyterLab (with improvements to Lab markdown so that some publication-relevant MyST markdown syntax is supported). We could also leverage the recent myst-js effort for multi-document authoring discover: something like the jupyter book gallery or the Pangeo Gallery interact: either thebe integration via Jupyter Book, or links that connect back to a JupyterHub / BinderHub comment: a commenting system built in to the published page, like hypothesis in jupyter book or relying on a service like CurveNote. DOI: connect to one of the DOI minting services like cross-ref
Development needed
I think there are two kinds of development we’d need:
Focused development in each tool to improve each of these tools in focused ways (e.g., MyST support in JupyterLab).
“Glue” development focused around building better connections between these tools (e.g., a JupyterLab extension to publish a notebook as a Jupyter Book on github, or in a gallery).
People needed
I think we’d need the following roles (covered by one or more people):
A JupyterHub/Binder person with knowledge about the infrastructure + cloud
A JupyterLab person to help with the interface and extension updates
A Jupyter Book person who understands MyST markdown and some of the publishing features in Jupyter Book. They’d either be a Sphinx/Python person (if Jupyter Book is the focus) or a JavaScript / front-end / publishing person (if MyST-JS is the focus)
One or more usecases for publishing that we can use to design tech + proces around
A person with relevant background in publishing use-cases to notice edge-cases and blind spots we haven’t considered (e.g. a librarian, publishing expert, etc)
Just wanted to get this idea written down to see what others think! Perhaps this is the kind of thing a group of us can raise some $$$ around building? Would love to know the reaction of others to this idea!
thanks to @rabernat for helping me think through some of these ideas. I’ll continue to edit this post if my thinking evolves!
… involves developing the accompanying user services and engagement needed to attract, nurture, and grow a robust research community that is actively involved in determining directions for the infrastructure as well as management of the infrastructure
We are currently working on that for our platform: https://www.neurolibre.com/ were submissions are rendered using Jupyter books. Our focus now is to build the Jupyter book and enable the preview on our infrastructure.
The direction I am taking atm is to actually build the jupyter book using a postBuild file, so the build artifacts lives inside the docker image (book artifacts are not re-built if a new environment is spawned, and source does not change). The rendering is happening through the jupyterhub interface.
Check this WIP: GitHub - ltetrel/nha2020-nilearn.
This is not working when loading the index.html:
403 : Forbidden
Blocking request from unknown origin
But works when loading the individual html files.
The image is also quite big and takes too much time to be pushed on DockerHub.
At this stage, the Jupyter book does not exists and is not hosted anywhere. The idea is to build it on our end and host it temporarilly on binderhub (our test server) so the user can have a preview of its submission.
Thanks for getting this conversation going.
I think this is a fantastic idea, and I agree the Jupyter ecosystem now has the necessary pieces for an amazing publishing platform.
@ltetrel already mentioned we have been working on a project in that direction for a couple of years now, called NeuroLibre, that should be open for submissions very soon. In short, Neurolibre is going to be a preprint service for neuroscience, that hosts Jupyter books and associated data. The point @ltetrel brought up on NeuroLibre is very technical and relates to an issue we are having at the moment., but I would like to share some more general thoughts as well.
my experience with myBinder is that start-up time (and start-up success) is quite variable. I think this is an amazing service for the community given that anyone can start a pod. The level of reliability is great given the low bar to entry. For a publishing platform, there would need to be more reliability. This should be possible because the review process will limit the number of submissions. But it may require some technical developments too. This gets me to my next point.
At least in my field, papers rely on fairly large datasets, in the order of 10s of GB. Downloading these data to reproduce computations is also a major limiting factor in order to quickly reproduce an analysis. @ltetrel created a mechanism for neurolibre to create a local cache of data at build time GitHub - SIMEXP/Repo2Data: Automatic data fetcher from the web We also have a local docker registry (integration of this in Jupyter was discussed at some point). This means we can reliably spin a pod with all necessary data in a matter of seconds. But that also means we need to store a lot of data long-term in the cloud, which again gets me to my next point.
We decided to rely on a local cloud. First, we tried to use the public Canadian high-performance infrastructure but we ran into reliability issues. Recently, we moved to be hosted at McGill University who decided to support our IT through the Canadian open neuroscience platform, which is also funding neurolibre. Cancer computing donated a fairly large number of servers to this effort, which is now set up with openNebula, Kubernetes and Jupyter hub. This solution is not perfect (we haven’t got terraform to work and the setup requires some manual intervention), but it works. Note that we’ve been creating some documentation along the way, even if there is some catch-up to do (NeuroLibre — NeuroLibre v0.1 documentation). The rationale for using this type of infrastructure rather than a commercial cloud is data hosting. Even if we don’t have our cloud hosted on Compute Canada, we are connected to them with a high-speed connection (I believe 100 Gb/s line), and they have the capacity to host tape for very very cheap, which cannot be matched through commercial providers (at least when I did my price analysis a few years back). As this infrastructure is built and maintained as part of a national and university investment anyway, I think it’s an excellent solution in terms of sustainability for data and compute hosting.
For submission and review (or in our case technical screening), we have tried working purely with GitHub actions, and have piloted an entire system using that. @emdupre eventually convinced us to build on top of the system used by the journal of open source software (JOSS) instead. The main rationale is to contribute to an existing successful project rather than start something new. I was skeptical about the ease of adopting their system. Elizabeth responded by basically writing the JOSS installation instructions for them. The system is up and running and we are now trying to extend their build system to include the jupyter books, and not just a pdf. Building a Jupyter book with a lot of data is time-consuming, and we need this service to be hosted by our binder hub instance. @ltetrel comment was related to that (hopefully last) development to get everything working.
Those are just thoughts and I realize you may disagree with some of the design decisions we made, in particular relying on an academic cloud. One last point is that the “neurolibre” manifesto includes contributing upstream as a founding principle (in particular Jupyter and JOSS). So the neurolibre team will be happy to contribute to the Jupyter publishing platform as much as possible.
Regarding funding, neurolibre is in the process of renewing CONP and just applied to a Welcome fund. This could have worked for the jupyter publishing platform, but that deadline has passed. I would be happy to help in any way I can to get the Jupyter publishing infrastructure funded.
@pbellec thanks so much for this detailed synopsis of NeuroLibre’s experience so far! (I can’t believe I forgot to ping you in this thread in the first place, so I’m glad you and @ltetrel found it!).
The commercial cloud issue you bring up is really important - I still don’t know the right answer there. I wonder if @rabernat has thoughts from the geo community as it pertains to building publishing pipelines.
I would be interested in brainstorming funding opportunities that could go towards this. I think many scholarly communities would benefit from a pattern that they could follow for their infrastructure!
First we have a bash file that is responsible for getting the proper metadata, if checks pass (config exists and/or book alreay built) the jupyter-book build process is launched.
This process lives just before a user starts his session, possible thanks to latest jupyterhub changes:
The cluster can be then upgraded with: sudo helm upgrade (...) --set-file jupyterhub.singleuser.extraFiles.jb_build.stringData=./jb_build.bash (...)