In opened a [Request for Implementation] a while ago, about a privacy-preserving way to collect data on which installed libraries are being used by users on a JupyterHub. The primary goal is to helps find and remove unused libraries, but am sure that aggregated usage stats of libraries in a given community have other uses.
python-popularity-contest does just that! It collects pre-aggregated, anonymized data on which installed libraries are being actively used by your users. Privacy is very important here, so collect only exactly the amount of data needed.
We want to collect just enough data to help with the following tasks:
Remove unused library that have never been imported. These can probably be removed without a lot of breakage for individual users
Provide aggregate statistics about the ‘popularity’ of a library to add a data point for understanding how important a particular library is to a group of users. This can help with funding requests, better training recommendations, etc.
To collect the smallest amount of data possible, we aggregate this at source. Only overall global counts are stored, without any individual record of each source. This is much better than storing per-user or per-process records. It filters out standard library modules and any local modules the user has written. It also ignores all modules used by ipython when starting a kernel, so we can be as close to ‘just the libraries used by the user in a notebook’ as possible.
It also emits information about libraries (something you can pip or conda install) rather than modules (which you import). This offers some more privacy, and is also the more useful in answering our questions.
I deployed this in a Berkeley hub yesterday, and it provides data of this form:
Am excited to use this for slimming down my images!
I’ve tried to provide technical and deployment information in the project README. Please check it out and let me know what you think!