-
-
Notifications
You must be signed in to change notification settings - Fork 32.6k
Description
The current status-quo when it comes to the development integration/synchronization between the importlib backports and the CPython upstream isn't optimal.
Before anything else, I must properly acknowledge @jaraco's monumental and tireless effort on maintaining the importlib backports, and handling the complex synchronization with the CPython upstream, not to mention the continued development of these modules. It has been instrumental to get things to the state they are today, and none of the issues discussed in this thread should reflect negatively on him, but rather our failure to ensure these projects got the resources they need — a far too common tale in open-source.
Here are some issues I think we should improve:
- Synchronization process — even though @jaraco has left comments in some PRs describing his workflow, there's no properly documented process
- Authorship stripping — the current way changes are synced in and from the backports strip commit authorship
- Documentation fragmentation, resulting in a sub-optimal documentation
- Eg.
importlib.metadata.Distribution
, and other classes, do not document their attributes (I regularly have to resort to the source code)
- Eg.
- CLA enforcement — the backports do not enforce the CLA
- Segmented development workflow — issues and changes happen in both places
- Source history — the current way changes are synced in and from the backports strip commit history
cc @python/importlib-team
Activity
FFY00 commentedon Jan 26, 2025
I would like to propose officially defining a development upstream, and enforcing it.
The solution that I think would more cleanly handle fragmentation, history, authorship, and CLA issues, is to select CPython as the upstream. An approach to implement this could be to track the backport version here, and when updated, have CI automation to update the backport repos, just like we do to backport to older Python versions.
While I think that's cleaner, it is a major change to how these modules are currently developed, and the implementation might be too complex, so I think it's more likely for us to go with the backports as the upstream. If so, there are a couple things I think we should do:
AA-Turner commentedon Jan 30, 2025
I think that this makes sense, especially as both of the importlib modules are no longer "provisional". Useful parallels can be drawn with PEP 360, which used to record "externally maintained" packages, and was updated in 2006 to say:
Another parallel is the changes to Pathlib that Barney has recently been making. He published the
pathlib-abc
package as a backport/preview, rather than primary development being in that package.Whilst having a brief look at the history, I found that Jason noted in a comment from a few years ago that:
The other two recently externally-developed modules seem to be
tomli
andzoneinfo
(please let me know if I'm missing any).Three of the four PRs to
tomllib
since it was added as a package were to synchronise withtomli
as an upstream (#128907, #126428, and #124587). These have each been quite minor, and each has been opened as an individual PR, rather than an omnibus "sync with version X" update.zoneinfo
used to have some sync PRs, but the last one seems to be four years ago (#20499), and the backport package has not been updated for two years (last release 2020-06).There have also been some problems with synchronising the documentation, as the d.p.o documentation used to point to the backport (python/importlib_metadata#485), with one user going so far as to manipulate Sphinx internals (stefan6419846/license_tools#63) to solve this problem. Ultimately, documentation was removed from the backport package (python/importlib_metadata#466).
To Jason's quoted comment above about pace of development eventually slowing, I wonder if at some point we should seek to update the backport packages less frequently, and to mirror Python releases. There is prior art for this with
zipfile3{x}
packages on PyPI. This would ease the burden of the actual backporting, as it would be done less often. The backport package could also use a rebase or fast-forward merge, which would preserve authorship details.As such, I would be in favour of this
python/cpython
repo being the one where new features are developed, discussed, and merged forimportlib.resources
andimportlib.metadata
(and alsotomllib
).A
jaraco commentedon Feb 8, 2025
Thanks @FFY00 for raising this issue. It's been a lingering concern of mine as well, and I've had a lot of thoughts on the matter.
My instinct is the same, that ideally the stdlib should be the canonical implementation and upstream. That's the case for several other backports I maintain (backports.tarfile, singledispatch, configparser, ...).
The main reason I haven't taken the packages in this direction is the third-party libraries are more capable and thus drastically easier to develop. I have in fact documented the methodology. Probably I should link that document from the READMEs of the third-party projects for increased visibility.
In general, the third-party packages get a much more modern, complete, and sophisticated treatment. It's the presence of these documented advantages that have kept me reluctant to move the upstream to the stdlib.
I've been thinking about ways to make the integration (and attribution) better. There are some factors that make the integration more difficult.
In an ideal world, the canonical source for something like "importlib metadata" would exist somewhat independently, be linked into the various target projects, and have customizations overlay and extend the canonical source. I can imagine a couple of ways to model these concerns using VCS tools.
A branch per project
Imagine having a separate branch in CPython for each project, with its history rooted independent from the CPython history. This branch would have either the raw source or possibly the full third-party package in that branch, but when merged into CPython, would track the new location and CPython-specific requirements.
This approach doesn't work in the CPython pull request model due to the squashed merges (the tracking is lost).
That's why instead, each of these projects carries their own cpython branch to track those concerns.
Submodules
Another way to model subsets of an implementation is through Git Submodules. Some companies and projects use submodules as a way to compose larger systems from smaller components.
You could imagine the importlib subprojects to each be a submodule attached at
Lib/importlib/{submodule}
, and have branches off of those submodule repos implement the third-party packages and merge/cherry-pick changes.This approach is fraught with problems:
Lib/importlib/*
but alsoLib/test/test_importlib/*
.Lib/improtlib/metadata/tests
(or similar), but that would violate established conventions. This approach could be applied across all of the stdlib, but that would be a highly disruptive migration. In my opinion, it would be a better outcome overall, but I'm not confident it can be executed safely.Last year, I kicked off work on the essential layout, which aims to solve some of these problems and empower projects to be composable in this way, but it's already had to concede some of the purity of the design (pyproject.toml and .github) and still has some problems yet unsolved (it's incompatible with RTD).
Ultimately, I don't feel these options are very attractive, so I'm left limping along with the current methodology.
I quite like the suggestions Filipe has brought. They all sound reasonable - let's revisit them in light of the documented methodology.
One last thing I wanted to mention - although I dislike it, I sometimes batch several changes from the third-party packages into CPython, mainly because it's a bit of work to get everything synchronized and the amount of toil it would take to re-submit each contribution in multiple places would be impractical. If we had automation to apply mechanically changes to both projects together, that would be ideal.