Question: How to remember the state of a variable

Hi everyone. I’ve faced a problem and need your help.
Let’s imagine, we’ve created a script which parses data from a website. It takes about 10 mins to parse it. You’ve written these parsing code into a cell. Then you’ve written down the results of the output data into a variable and have started to analyze, correct and clear this data using the previous variable.

Question: When you close the service (jupyter notebook) and returned again, how to remember the state of the variable so that you don’t come back to the first stage - scrapping 10 min data?

Two ideas, which use a database or files on disk:

  • database: in the case of fetching stuff, if it’s using requests, requests-cache is wonderful.
    • import it, run one line, and then all the request.gets end up in a sqlite database in the current working directory.
    • as long as nothing about a request, it will return the cached response, and not touch the internet (even if offline)
      • it caches based on the whole request, so you can partially invalidate it with an extra ? param
  • pile of files: for the general case, organize tasks with doit, and have them leave stuff on disk
    • it’s like GNU make (which is also excellent and worth learning), but in python
    • with small tasks with good targets and file_dep, it can be really good at not doing rework: consider a classic scraping problem of search results
      • do one task that requests the first page, which includes how many pages
      • do one task per page of results
      • do one task per result

In general, once past fetching, the more the whole pipeline caches its steps along the way, the happier it will be when things fail. So:

  • requests.get(url)raw/{some-id}.html
  • raw/{some-id}.html - parsed/{some-id}.json
  • parsed{some-id}.json - report/{some-id}.html

With this approach, it can scale to multiple processes/computers, invalidate just some of the work, etc.

There are much heavier-weight systems like airflow, luigi and dagster, but each of this is basically an ecosystem that requires learning.