Feature Idea: JupyterHub/BinderHub + Jupyter Book as a publishing platform

choldgraf · March 16, 2021, 6:01pm

Background

I feel like all of the components are in place in the Jupyter ecosystem to enable ~ a full publishing pipeline for reproducible computation. These are the major pieces I can think of:

Jupyter Notebooks (.ipynb or text-based notebooks) for storing the content + computation + (optionally) results
JupyterHub / BinderHub (for providing environments where people can interact with the computation, or providing kernels that can power visualizations remotely)
Jupyter Book (for providing more “publication-ready” documents and syntax, such as citations and figures, and for providing a lightweight way to share content with others)
JupyterLab (for authoring and interacting interactively with notebooks)
GitHub Actions (to automate some of the connections between services, and manage a review pipeline on GitHub)

From a user’s perspective

I want to be able to do the following:

Work on my notebook in a Jupyter Lab interface, author my computation etc in that notebook, along with rich syntax for publishing (figures etc)
In one-or-two clicks, publish that notebook so
- Others can quickly read what I created, in a beautiful and “publication-style” form
- Others can quickly discover what I created, potentially along with creations from others on a hub
- Others can quickly interact with what I created on the infrastructure where I created it.
- Others can quickly comment on what I created to provide feedback.
- Others can find a DOI for my notebook

I think that we have the building blocks needed to do much of this:

author: JupyterLab (with improvements to Lab markdown so that some publication-relevant MyST markdown syntax is supported). We could also leverage the recent myst-js effort for multi-document authoring
discover: something like the jupyter book gallery or the Pangeo Gallery
interact: either thebe integration via Jupyter Book, or links that connect back to a JupyterHub / BinderHub
comment: a commenting system built in to the published page, like hypothesis in jupyter book or relying on a service like CurveNote.
DOI: connect to one of the DOI minting services like cross-ref

Development needed

I think there are two kinds of development we’d need:

Focused development in each tool to improve each of these tools in focused ways (e.g., MyST support in JupyterLab).
More improvements to the myst-js and jupyter book documentation systems to handle all the major publication cases people want
“Glue” development focused around building better connections between these tools (e.g., a JupyterLab extension to publish a notebook as a Jupyter Book on github, or in a gallery).

People needed

I think we’d need the following roles (covered by one or more people):

A JupyterHub/Binder person with knowledge about the infrastructure + cloud
A JupyterLab person to help with the interface and extension updates
A Jupyter Book person who understands MyST markdown and some of the publishing features in Jupyter Book. They’d either be a Sphinx/Python person (if Jupyter Book is the focus) or a JavaScript / front-end / publishing person (if MyST-JS is the focus)
One or more usecases for publishing that we can use to design tech + proces around
A person with relevant background in publishing use-cases to notice edge-cases and blind spots we haven’t considered (e.g. a librarian, publishing expert, etc)

Just wanted to get this idea written down to see what others think! Perhaps this is the kind of thing a group of us can raise some $$$ around building? Would love to know the reaction of others to this idea!

thanks to @rabernat for helping me think through some of these ideas. I’ll continue to edit this post if my thinking evolves!

HashRocketSyntax · March 17, 2021, 6:19pm

Been meaning to send you this NSF grant about Community Research Infrastructure:
https://www.nsf.gov/funding/pgm_summ.jsp?pims_id=12810&org=CISE&sel_org=CISE&from=fund

… involves developing the accompanying user services and engagement needed to attract, nurture, and grow a robust research community that is actively involved in determining directions for the infrastructure as well as management of the infrastructure

ltetrel · March 17, 2021, 11:39pm

Hi !

We are currently working on that for our platform: https://www.neurolibre.com/ were submissions are rendered using Jupyter books. Our focus now is to build the Jupyter book and enable the preview on our infrastructure.
The direction I am taking atm is to actually build the jupyter book using a postBuild file, so the build artifacts lives inside the docker image (book artifacts are not re-built if a new environment is spawned, and source does not change). The rendering is happening through the jupyterhub interface.
Check this WIP: GitHub - ltetrel/nha2020-nilearn.

This is not working when loading the index.html:

403 : Forbidden
Blocking request from unknown origin

But works when loading the individual html files.
The image is also quite big and takes too much time to be pushed on DockerHub.

That’s why Ideally I would like to build the book separately (and somehow inject the built artifacts into the user running environment), but I cannot do this until this PR moves Optional build volumes and build init_containers, REPO_URL accessible to pod by ltetrel · Pull Request #1081 · jupyterhub/binderhub · GitHub.

Any feedback also appreciated

choldgraf · March 18, 2021, 12:08am

Could you try hosting the book on github-pages, and then just embedding it as an iframe in JupyterLab or something? e.g. with: GitHub - timkpaine/jupyterlab_iframe: JupyterLab iframe widget

ltetrel · March 18, 2021, 3:23pm

At this stage, the Jupyter book does not exists and is not hosted anywhere. The idea is to build it on our end and host it temporarilly on binderhub (our test server) so the user can have a preview of its submission.

pbellec · March 19, 2021, 2:32pm

Hi Chris,

Thanks for getting this conversation going.
I think this is a fantastic idea, and I agree the Jupyter ecosystem now has the necessary pieces for an amazing publishing platform.

@ltetrel already mentioned we have been working on a project in that direction for a couple of years now, called NeuroLibre, that should be open for submissions very soon. In short, Neurolibre is going to be a preprint service for neuroscience, that hosts Jupyter books and associated data. The point @ltetrel brought up on NeuroLibre is very technical and relates to an issue we are having at the moment., but I would like to share some more general thoughts as well.

my experience with myBinder is that start-up time (and start-up success) is quite variable. I think this is an amazing service for the community given that anyone can start a pod. The level of reliability is great given the low bar to entry. For a publishing platform, there would need to be more reliability. This should be possible because the review process will limit the number of submissions. But it may require some technical developments too. This gets me to my next point.
At least in my field, papers rely on fairly large datasets, in the order of 10s of GB. Downloading these data to reproduce computations is also a major limiting factor in order to quickly reproduce an analysis. @ltetrel created a mechanism for neurolibre to create a local cache of data at build time GitHub - SIMEXP/Repo2Data: Automatic data fetcher from the web We also have a local docker registry (integration of this in Jupyter was discussed at some point). This means we can reliably spin a pod with all necessary data in a matter of seconds. But that also means we need to store a lot of data long-term in the cloud, which again gets me to my next point.
We decided to rely on a local cloud. First, we tried to use the public Canadian high-performance infrastructure but we ran into reliability issues. Recently, we moved to be hosted at McGill University who decided to support our IT through the Canadian open neuroscience platform, which is also funding neurolibre. Cancer computing donated a fairly large number of servers to this effort, which is now set up with openNebula, Kubernetes and Jupyter hub. This solution is not perfect (we haven’t got terraform to work and the setup requires some manual intervention), but it works. Note that we’ve been creating some documentation along the way, even if there is some catch-up to do (NeuroLibre — NeuroLibre v0.1 documentation). The rationale for using this type of infrastructure rather than a commercial cloud is data hosting. Even if we don’t have our cloud hosted on Compute Canada, we are connected to them with a high-speed connection (I believe 100 Gb/s line), and they have the capacity to host tape for very very cheap, which cannot be matched through commercial providers (at least when I did my price analysis a few years back). As this infrastructure is built and maintained as part of a national and university investment anyway, I think it’s an excellent solution in terms of sustainability for data and compute hosting.
For submission and review (or in our case technical screening), we have tried working purely with GitHub actions, and have piloted an entire system using that. @emdupre eventually convinced us to build on top of the system used by the journal of open source software (JOSS) instead. The main rationale is to contribute to an existing successful project rather than start something new. I was skeptical about the ease of adopting their system. Elizabeth responded by basically writing the JOSS installation instructions for them. The system is up and running and we are now trying to extend their build system to include the jupyter books, and not just a pdf. Building a Jupyter book with a lot of data is time-consuming, and we need this service to be hosted by our binder hub instance. @ltetrel comment was related to that (hopefully last) development to get everything working.

Those are just thoughts and I realize you may disagree with some of the design decisions we made, in particular relying on an academic cloud. One last point is that the “neurolibre” manifesto includes contributing upstream as a founding principle (in particular Jupyter and JOSS). So the neurolibre team will be happy to contribute to the Jupyter publishing platform as much as possible.

Regarding funding, neurolibre is in the process of renewing CONP and just applied to a Welcome fund. This could have worked for the jupyter publishing platform, but that deadline has passed. I would be happy to help in any way I can to get the Jupyter publishing infrastructure funded.

choldgraf · March 20, 2021, 12:38am

@pbellec thanks so much for this detailed synopsis of NeuroLibre’s experience so far! (I can’t believe I forgot to ping you in this thread in the first place, so I’m glad you and @ltetrel found it!).

The commercial cloud issue you bring up is really important - I still don’t know the right answer there. I wonder if @rabernat has thoughts from the geo community as it pertains to building publishing pipelines.

I would be interested in brainstorming funding opportunities that could go towards this. I think many scholarly communities would benefit from a pattern that they could follow for their infrastructure!

ltetrel · November 10, 2021, 5:44pm

Some updates on that topic @choldgraf,

You can check here the latest developement:

First we have a bash file that is responsible for getting the proper metadata, if checks pass (config exists and/or book alreay built) the jupyter-book build process is launched.

github.com

neurolibre/terraform-binderhub/blob/99c253a06d21fb05be2335a5ad8a2e2101f0f056/terraform-modules/binderhub/assets/jb_build.bash

#!/bin/bash

# repo parameters
IFS='/'; BINDER_PARAMS=(${BINDER_REF_URL}); unset IFS;
PROVIDER_NAME=${BINDER_PARAMS[-5]}
USER_NAME=${BINDER_PARAMS[-4]}
REPO_NAME=${BINDER_PARAMS[-3]}
COMMIT_REF=${BINDER_PARAMS[-1]}
# paths
CONFIG_FILE="content/_config.yml"
BOOK_DST_PATH="/mnt/books/${USER_NAME}/${PROVIDER_NAME}/${REPO_NAME}/${COMMIT_REF}"
BOOK_BUILT_FLAG="${BOOK_DST_PATH}/successfully_built"
BOOK_BUILD_LOG="${BOOK_DST_PATH}/book-build.log"

# checking if book build is necessary
echo "Checking if jupyter book build will be done..."
if [ -f "${CONFIG_FILE}" ]; then
  echo -e "\t ${CONFIG_FILE} exists."
else
  echo -e "\t ${CONFIG_FILE} not found."

This file has been truncated. show original

This process lives just before a user starts his session, possible thanks to latest jupyterhub changes:

github.com

neurolibre/terraform-binderhub/blob/99c253a06d21fb05be2335a5ad8a2e2101f0f056/terraform-modules/binderhub/assets/config.yaml#L109-L116

    
      
          extraFiles:
            jb_build:
              mountPath: /usr/local/share/jb_build.bash
              mode: 0755
          lifecycleHooks:
            postStart:
              exec:
                command: ["bash", "/usr/local/share/jb_build.bash"]

The cluster can be then upgraded with:
sudo helm upgrade (...) --set-file jupyterhub.singleuser.extraFiles.jb_build.stringData=./jb_build.bash (...)

Hope that this will help!

PS: Ideally I would like to use an initContainer, but I cannot because of this binder url request accessible to hub singleuser initContainers · Issue #1429 · jupyterhub/binderhub · GitHub
Even better if this process is done just once after docker build https://github.com/jupyterhub/binderhub/pull/1081

manics · November 12, 2021, 12:40pm

This sounds cool! Have you thought about demo-ing it in one of the Jupyter Community Calls? Jupyter Community Calls - #70 by isabela-pf

ltetrel · November 12, 2021, 2:59pm

Yes I can definitively do that!
Do you now when the next call for November will be ?

isabela-pf · November 13, 2021, 1:54am

Yes! I’m so excited to see this!

@ltetrel the next community call is on November 30. All the details of the call can be found on this repo. You can add yourself to the November agenda to make sure we save time for you. Thanks so much for your interest!

(Thanks for the mention @manics )

Topic		Replies	Views
Creating a future infrastructure for notebooks to be submitted and peer-reviewed Publishing	25	4774	September 17, 2020
Possible to have Jupyter Book JupyterHub launch integration similar to Binder? JupyterHub	4	720	July 4, 2020
Would a "The Littlest Binder" be useful? Binder	36	5487	August 30, 2021
Feature idea: extension to simplify Binder publishing (lab>git>binder) JupyterLab	6	901	March 6, 2021
Is there a free (even ad-supported) public JupyterHub available? General	25	4070	August 28, 2019

Feature Idea: JupyterHub/BinderHub + Jupyter Book as a publishing platform

Background

From a user’s perspective

Development needed

People needed

Related topics