How could data_files be improved?

Hello everyone. I’m currently discussing the possibility of a replacement to data_files in setuptools and the rest of the Python packaging ecosystem over on the Python packaging discourse (Should there be a new standard for installing arbitrary data files? - Packaging - Discussions on Python.org). Given that Jupyter is a significant user of data_files, I thought I would highlight the discussion here because you may have useful insight into how Jupyter uses data_files, why it’s used over other solutions, and how you want want to see data_files change if the feature was redesigned. These insights would certainly help us move the discussion forwards.
Thanks!

1 Like

Thanks for stopping by!

Anticipating this day was coming, and out of a desire to allow other snake people that use things other than setuptools to integrate with the Jupyter stack (especially with editable installs), we’ve been fighting some of this stuff out:

Our biggest challenge is that every jupyter user is forced to play the role of full-stack (web, but thankfully less now) engineer, with configurability and extensibility at almost every level of the stack. As a python-native, but ideally language-agnostic system, the filesystem is the touchpoint, and is a contract we hold with many languages, most of which have their own highly opinionated way of putting files on disk (nodejs, julia, rust, go). As such, many power users easily have 10s of first-party jupyter packages and potentially more 3rd-party extensions… while the 1,000 packages we were using for testing are not real-world, today, it does point to what could feasibly happen.

The above pass down the entry_points route points (ha) to some challenges that come from not being able to know everything is in a single, well-known place on disk at install time:

  • upstreams like tornado and jinja2 do not like looking at potentially hundreds of places on disk to find static assets and templates
  • even cutting a ton of corners, even loading hundreds of entry_points is pretty slow on fast developer machines with SSDs, and we have a lot of users that are on NFS shares, etc.

PEP 420 namespaces might work, but nobody has taken the time to explore that route and see what other gremlins exist… and as kind of a “also ran” feature, a lot of tools don’t handle them particularly well (pyreverse) or have explicitly said they won’t (flit).

1 Like

@bollwyvl I’ve been reading I’ve been reading up on your and @jasongrout’s notes, and been messing around with some of the related code.

It seems like the biggest likely issue with namespace packages (aside from the obvious ones, like requiring a certain very specific project layout for all extension devs) is that one broken subpkg in the namespace can end up borking the whole thing. To me that feels like a dealbreaker, since the current behavior of notebook and jupyter_server is to tolerate failure of any given server extension.

Actually, come to think of it, server extension can fail pretty quietly in a lot of cases, so we may want to think about a “strict” extension execution mode anyway… but that’s a whole separate bag of something

I’m glad this has been raised; I was thinking about the issue again only today, and noticed that it made it onto the python mailing list.

From my perspective, the main ‘issue’ we have right now is the deprecation in setuptools. It is slightly frustrating that using data_files locks us in to a particular build backend, but as @bollwyvl elegantly notes, supported alternatives don’t seem viable at the moment, and are not friendly to the non python world. It doesn’t seem like the scheme is being deprecated from the wheel format itself, so I’m not clear on what is actually planned for the feature. Perhaps someone on the setuptools team could clarify things here?