Hello Everyone! I am new to the Jupyter landscape and am hoping to get some feedback on the best way to structure a notebook. I am in the early phases of my project and want to get this sorted out now while changing my structure is still manageable.
The simplified version of my project is as follows:
Pull a large dataset out of a pickle file and use it to populate dictionaries for dataset1 and dataset2
Do some manipulation and generate plots for dataset1
Do some manipulation and generate plots for dataset2
Currently I have three separate python files for populating the dictionaries, plotting dataset1 and plotting dataset2. I am planning on doing a lot more work with the data and will generate additional python files for each additional task.
My python files import each other (or more specifically the files dataset1 and dataset2 plotting import the dictionaries file) and I run each file in a notebook cell using the %run command. The problem is that I think this is basically a resource/time waste as the dictionary file is being run each time I run the scripts to generate my plots.
I could get around this by putting all of my python code directly into the notebook cells but the files are 400+ lines. I feel like this really clutters up the notebook, and makes editing the code much more tedious (versus using VScode to write the python files).
Does anyone have any advice on the best way to structure my notebook so that I can efficiently run large python scripts? The easiest way I’ve thought of is to simply write the code in VScode and copy/paste it into the Jupyter cells but that seems really clunky.
You may be able to adapt your scripts to store or pickle your dictionaries. And then check if a file with such data stored is present and read it in before making it again. Not everything can be serialized but there is level of options. I usually give the file I’m going to check for a text name where there’s a part that is unique to that type of data and then a part specific to that run, such as date/time stamp. That’s how I can check if a file containing the part unique to the type of data present and if it isn’t, go ahead make one.
As for keeping things together. Sometimes some scripts get used in multiple projects, or may eventually, and so it doesn’t make sense to just have them in a single project folder. If you can store your scripts in a central repository like GitHub that can make things easier. While you may not want to make your notebook and results available during development, sometimes the scripts themselves don’t really offer much to anyone finding them. That way you can have your data and notebooks local and then build in fetching the scripts if they aren’t already present in your working directory.
And because of the development of your scripts and possibility to use them multiple times, it makes more sense not to build them into your notebook. You can also import functions and scripts without using %run, too. However, you have more ability to scale the more you have scripts with the code vs. a bloated notebook.
The other thing to bear in mind is that there’s tools like snakemake that allow you to make workflows/pipelines that let you run notebooks and make notebooks if you want. As you are new, it’s probably something to just keep in mind for now. One of the features of snakemake is that if some of your data gets changed you can re-run the pipeline and only the pertinent code for the new data to make pertinent results/reports/analysis will get re-run. It depends on making things modular as best you can and probably best done after you tried a few without it as you don’t need the addition of learning new tech as you work out how you like to work and build your analyses.
There’s also ways to run your notebook from the command line that are good for these types of multi-step notebook runs. Jupytext is my go-to for that usually.