Some folks at UFF, NYU did a study of the reproducibility of notebooks found on GitHub.
Highlighted data points: 1.4M notebooks found in 264k repos. 864k notebooks executed with 24% completing without errors and 4% producing the “same” results (“same” is quite strictly defined with some criticism available for their execution practices).
They made a few choices for measuring reproducibility that I think are peculiar. In particular, when a notebook had out-of-order prompts, indicating re-run cells, they chose to follow the displayed execution order, rather than what you’d get from “restart and run all” which would be my definition. Additionally, if reproducibility is requiring byte-for-byte identical output, then you need to make sure there are no time or memory-sensitive outputs (e.g. printing default object reprs), which may well have no impact on reproducibility. This is by far the simplest measure of reproducibility, and the strictest, but not the most useful.
They also define some best practices for reproducibility, which I think are worth attention and discussion.