Hi everyone. I’ve faced a problem and need your help.
Let’s imagine, we’ve created a script which parses data from a website. It takes about 10 mins to parse it. You’ve written these parsing code into a cell. Then you’ve written down the results of the output data into a variable and have started to analyze, correct and clear this data using the previous variable.
Question: When you close the service (jupyter notebook) and returned again, how to remember the state of the variable so that you don’t come back to the first stage - scrapping 10 min data?
Two ideas, which use a database or files on disk:
- database: in the case of fetching stuff, if it’s using requests,
requests-cache
is wonderful.
- import it, run one line, and then all the
request.get
s end up in a sqlite database in the current working directory.
- as long as nothing about a
request
, it will return the cached response, and not touch the internet (even if offline)
- it caches based on the whole request, so you can partially invalidate it with an extra
?
param
- pile of files: for the general case, organize tasks with
doit
, and have them leave stuff on disk
- it’s like GNU
make
(which is also excellent and worth learning), but in python
- with small tasks with good
targets
and file_dep
, it can be really good at not doing rework: consider a classic scraping problem of search results
- do one task that requests the first page, which includes how many pages
- do one task per page of results
- do one task per result
In general, once past fetching, the more the whole pipeline caches its steps along the way, the happier it will be when things fail. So:
-
requests.get(url)
→ raw/{some-id}.html
-
raw/{some-id}.html
- parsed/{some-id}.json
-
parsed{some-id}.json
- report/{some-id}.html
With this approach, it can scale to multiple processes/computers, invalidate just some of the work, etc.
There are much heavier-weight systems like airflow
, luigi
and dagster
, but each of this is basically an ecosystem that requires learning.
3 Likes