For you sir @yuvipanda I threw together this little repo that’s the basic picture of the pattern we’re using.
We haven’t released the real code we’re using yet, but we are toying on a SciPy 2021 submission about it, and what we find from it, and in fact we use JupyterHub+Dask to crunch the numbers and then use Voila to show the stuff in a dashboard. I am hoping a paper deadline will motivate our IPO people to say OK to us releasing it…?
The basic idea is to split the duties into 2 parts. First is figure out what things you actually want to record when the hook fires and second part is how you want to record it. In the demo the first part is just compare the list of stuff in sys.modules to a list of packages of interest and return the matches. That could be arbitrarily complicated to weed out stdlib or whatever. We use it in particular to watch for many big deal packages but also more speculative ones.
The second part about recording in the demo is just drop it to a JSON file. The Blue Waters paper also used a directory on the file system to collect the data. But we have this path through syslog to elastic and we try to gather all this kind of stuff there, so we’re using that. A filesystem solution would not hold up I think.
The interfaces for both things in the demo look pretty much like the interfaces in the real code we’re using but there might be another way to look at the problem than this “inspect and report” model.
There’s a third bit which I elide from the demo which is how to format the messages. We have a way of standardizing it and collecting other bits based on what we can find out at exit time: How big is the job? Who’s the user? Is this a batch job, is it running in a container, is the user staff (so we can filter out staff jobs), what is the executable called, what Python interpreter was invoked, all that.
Thoughts about sitecustomize.py. In the demo I put it into sys.path directly. There’s not a way to install sitecustomize.py using normal Python packaging tools except to copy it in place, which I would have liked. We ended up going with PYTHONPATH with the paths in it baked into the compute node images (to avoid bad Python import behavior) and if users really object to it they can opt out. Also the thing fires on everything like activating conda environments (you’ll find multiprocessing is inordinately important), but if it breaks bad it’s real noisy about it so that’s good…
If you go a similar route you need to think about what you might miss. If containers shut down super-abruptly then things won’t get a chance to fire the exit hook, but again you might be able to rig up some kind of signal to trigger the nice shutdown.