[Request for Implementation] Instrument libraries actively used by users on a JupyterHub

Problem

We have a docker image for our students on JupyterHub that has a lot of libraries. We don’t fully know which libraries are actively being used - making upgrades and library removals difficult. Wouldn’t it be great if we know which packages are being actually used?

Suggested solution

Python now has audit events fired for different events. We can listen for those events, and record these events someplace. There’s an audit event for imports that we can hook on to. Then we have to send it to a centralized location where we can determine what libraries are being actively used.

Possible issues

Privacy! We don’t need to record user names at all. This doesn’t guarantee privacy, especially if you also have access to out of band information. But that’s a good start.

Performance! A lot of imports happen in python! If we make a network request each import, everything will become extremely slow. Need to figure out how to make sure we can record this without making everything super slow.

1 Like

@ryan showed me this cool trick for doing the same in R:

library <- function (package, ...) {
    package <- as.character(substitute(package))
    if (length(package) != 1L)
         stop("'package' must be of length 1")
    system(paste("python /usr/local/common/lib/shells/RpkgLogWrite.py",
         Sys.info()["user"], Sys.info()["nodename"], package,
         date(), 0, sep = " "))
    eval(substitute(base::library(XYZ, ...), list(XYZ = package)))
}

We have some experience here. What we do is inject a sitecustomize.py to register an exit hook that inspects sys.modules and then sends the list of packages to elastic. We didn’t want to inject monitoring at import time because that would work horribly at scale and if it went sideways hopefully it would happen after the user was done doing whatever they were doing.

With only one exception it has worked basically flawlessly for years and that case was one that involved Python 2/3 runpy shenanigans, had we known about them ahead of time we could have avoided it I think. The blind spots happen when Python shuts down without going through the normal exit hook, but it could be triggered through a signal if you can make sure that always happens. Users could also manipulate sys.modules to get around reporting their imports but that works against them. We also submit canary jobs to see if the monitoring is actually working, but so far it’s been nearly perfect.

The basic idea comes from here though our implementation is our own… It has the benefit of being really easy to do and leverages normal Python stuff.

1 Like

Maybe @arokem would be interested in this topic?

1 Like

@yuvipanda That R function was written by Chris Paciorek who isn’t on this discourse. It overloads library() and then you can do what you want with the package info. It works fine at our scale.

I’m interested in this topic as we have no corresponding solution for python.

1 Like

Woah, that’s amazing! Is there code you can share?

I don’t wanna run elastic, but maybe something smaller that can answer just the question I have - which is ‘when was the last time this library was used’, maybe along a frequency counter.

For you sir @yuvipanda I threw together this little repo that’s the basic picture of the pattern we’re using.

We haven’t released the real code we’re using yet, but we are toying on a SciPy 2021 submission about it, and what we find from it, and in fact we use JupyterHub+Dask to crunch the numbers and then use Voila to show the stuff in a dashboard. I am hoping a paper deadline will motivate our IPO people to say OK to us releasing it…?

The basic idea is to split the duties into 2 parts. First is figure out what things you actually want to record when the hook fires and second part is how you want to record it. In the demo the first part is just compare the list of stuff in sys.modules to a list of packages of interest and return the matches. That could be arbitrarily complicated to weed out stdlib or whatever. We use it in particular to watch for many big deal packages but also more speculative ones.

The second part about recording in the demo is just drop it to a JSON file. The Blue Waters paper also used a directory on the file system to collect the data. But we have this path through syslog to elastic and we try to gather all this kind of stuff there, so we’re using that. A filesystem solution would not hold up I think.

The interfaces for both things in the demo look pretty much like the interfaces in the real code we’re using but there might be another way to look at the problem than this “inspect and report” model.

There’s a third bit which I elide from the demo which is how to format the messages. We have a way of standardizing it and collecting other bits based on what we can find out at exit time: How big is the job? Who’s the user? Is this a batch job, is it running in a container, is the user staff (so we can filter out staff jobs), what is the executable called, what Python interpreter was invoked, all that.

Thoughts about sitecustomize.py. In the demo I put it into sys.path directly. There’s not a way to install sitecustomize.py using normal Python packaging tools except to copy it in place, which I would have liked. We ended up going with PYTHONPATH with the paths in it baked into the compute node images (to avoid bad Python import behavior) and if users really object to it they can opt out. Also the thing fires on everything like activating conda environments (you’ll find multiprocessing is inordinately important), but if it breaks bad it’s real noisy about it so that’s good…

If you go a similar route you need to think about what you might miss. If containers shut down super-abruptly then things won’t get a chance to fire the exit hook, but again you might be able to rig up some kind of signal to trigger the nice shutdown.

2 Likes