-
Notifications
You must be signed in to change notification settings - Fork 78
Support on-demand content in repair_metadata #848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fc1102d
to
99180e4
Compare
9e27aaf
to
04c2e4f
Compare
9875d50
to
925e392
Compare
2b14e83
to
79d7acf
Compare
79d7acf
to
b2b74fa
Compare
42e9834
to
0b91e31
Compare
0b91e31
to
7c3cb5c
Compare
7c3cb5c
to
95f6e5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work on the logic, the grouping of the packages by name-version and then further grouping by shared url of the RAs looks right (it is definitely a bit complex!)
grouped_by_url[ra.remote.url].append((package, ra)) | ||
|
||
# Prioritize the URL that can serve the most packages | ||
for url, pkg_ra in sorted( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg_ra
is a list of (package, ra) tuples, so I think this variable should be named to reflect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would pkg_ra_pairs
be better?
@@ -189,6 +191,19 @@ def artifact_to_python_content_data(filename, artifact, domain=None): | |||
return data | |||
|
|||
|
|||
def fetch_json_release_metadata(name: str, version: str, remote: Remote) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You didn't change this to use the remote's downloader.
# Keep track of on-demand packages that need to be repaired | ||
# A package is removed from this variable once it is successfully repaired | ||
pkgs_not_repaired = set(on_demand_content.values_list("pk", flat=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make this an empty set and then add the packages that we failed to repair. Should be less expensive.
continue | ||
|
||
for package, ra in pkg_ra: | ||
if package.pk not in pkgs_not_repaired: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use group_list
(probably should change it to a set) to track if we repaired the packaged yet. And then remove packages from it when we processed it.
) | ||
batch = [] | ||
set_of_update_fields.clear() | ||
pkgs_not_repaired.remove(package.pk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkgs_not_repaired.remove(package.pk) | |
group_set.remove(package.pk) | |
pkgs_not_repaired.update(group_set) |
After a group is finished processing, any left over packages we didn't remove can then be added to pkgs_not_repaired
# All packages have the same URL, so pick a random remote | ||
remote = pkg_ra[0][1].remote | ||
try: | ||
json_data = fetch_json_release_metadata(name, version, remote) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# All packages have the same URL, so pick a random remote | |
remote = pkg_ra[0][1].remote | |
try: | |
json_data = fetch_json_release_metadata(name, version, remote) | |
if not group_set: | |
break # No packages left to repair, move onto the next group | |
# All packages have the same URL, so pick a random remote | |
remote = pkg_ra[0][1].remote | |
try: | |
json_data = fetch_json_release_metadata(name, version, remote) |
By switching to tracking the repaired packages by group you can break out of the RAs loop early if there is no remaining packages to process.
changed = False | ||
for field, value in new_data.items(): | ||
if getattr(package, field) != value: | ||
setattr(package, field, value) | ||
set_of_update_fields.add(field) | ||
changed = True | ||
if changed: | ||
batch.append(package) | ||
if len(batch) == 1000: | ||
total_repaired += len(batch) | ||
PythonPackageContent.objects.bulk_update( | ||
batch, set_of_update_fields | ||
) | ||
batch = [] | ||
set_of_update_fields.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is now repeated twice we can probably move it into an inline function.
This PR extends the
repair_metadata
endpoint for repositories to also support on-demand content if the associated remote supports the PyPI json version API.For on-demand content, we expect that:
PythonPackageContent
always has correctname
andversion
, and oneContentArtifact
RemoteArtifact
always has correctsha256
Closes #849