Hi everyone. I’ve faced a problem and need your help.
Let’s imagine, we’ve created a script which parses data from a website. It takes about 10 mins to parse it. You’ve written these parsing code into a cell. Then you’ve written down the results of the output data into a variable and have started to analyze, correct and clear this data using the previous variable.
Question: When you close the service (jupyter notebook) and returned again, how to remember the state of the variable so that you don’t come back to the first stage - scrapping 10 min data?
Two ideas, which use a database or files on disk:
- database: in the case of fetching stuff, if it’s using requests,
requests-cache is wonderful.
- import it, run one line, and then all the
request.gets end up in a sqlite database in the current working directory.
- as long as nothing about a
request, it will return the cached response, and not touch the internet (even if offline)
- it caches based on the whole request, so you can partially invalidate it with an extra
- pile of files: for the general case, organize tasks with
doit, and have them leave stuff on disk
- it’s like GNU
make (which is also excellent and worth learning), but in python
- with small tasks with good
file_dep, it can be really good at not doing rework: consider a classic scraping problem of search results
- do one task that requests the first page, which includes how many pages
- do one task per page of results
- do one task per result
In general, once past fetching, the more the whole pipeline caches its steps along the way, the happier it will be when things fail. So:
With this approach, it can scale to multiple processes/computers, invalidate just some of the work, etc.
There are much heavier-weight systems like
dagster, but each of this is basically an ecosystem that requires learning.