Skip to content

Support on-demand content in repair_metadata #848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jobselko
Copy link
Contributor

@jobselko jobselko commented May 11, 2025

This PR extends the repair_metadata endpoint for repositories to also support on-demand content if the associated remote supports the PyPI json version API.

For on-demand content, we expect that:

  1. PythonPackageContent always has correct name and version, and one ContentArtifact
  2. RemoteArtifact always has correct sha256

Closes #849

@jobselko jobselko self-assigned this May 11, 2025
@jobselko jobselko force-pushed the repair_on_demand_content branch 4 times, most recently from fc1102d to 99180e4 Compare May 16, 2025 18:54
@jobselko jobselko force-pushed the repair_on_demand_content branch 8 times, most recently from 9e27aaf to 04c2e4f Compare May 22, 2025 11:33
@jobselko jobselko force-pushed the repair_on_demand_content branch 2 times, most recently from 9875d50 to 925e392 Compare May 22, 2025 11:38
@jobselko jobselko changed the title DRAFT: Add on_demand content to repair_metadata Support on-demand content in repair_metadata May 22, 2025
@jobselko jobselko marked this pull request as ready for review May 22, 2025 11:59
@jobselko jobselko requested a review from gerrod3 May 22, 2025 12:25
@jobselko jobselko force-pushed the repair_on_demand_content branch 2 times, most recently from 2b14e83 to 79d7acf Compare May 22, 2025 12:51
@jobselko jobselko force-pushed the repair_on_demand_content branch from 79d7acf to b2b74fa Compare May 22, 2025 12:53
@github-actions github-actions bot removed the no-issue label May 22, 2025
@jobselko jobselko force-pushed the repair_on_demand_content branch 2 times, most recently from 42e9834 to 0b91e31 Compare May 22, 2025 13:42
@jobselko jobselko marked this pull request as draft May 22, 2025 18:03
@jobselko jobselko force-pushed the repair_on_demand_content branch from 0b91e31 to 7c3cb5c Compare May 25, 2025 20:40
@jobselko jobselko force-pushed the repair_on_demand_content branch from 7c3cb5c to 95f6e5f Compare May 25, 2025 21:44
@jobselko jobselko marked this pull request as ready for review May 26, 2025 11:43
Copy link
Contributor

@gerrod3 gerrod3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on the logic, the grouping of the packages by name-version and then further grouping by shared url of the RAs looks right (it is definitely a bit complex!)

grouped_by_url[ra.remote.url].append((package, ra))

# Prioritize the URL that can serve the most packages
for url, pkg_ra in sorted(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg_ra is a list of (package, ra) tuples, so I think this variable should be named to reflect it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would pkg_ra_pairs be better?

@@ -189,6 +191,19 @@ def artifact_to_python_content_data(filename, artifact, domain=None):
return data


def fetch_json_release_metadata(name: str, version: str, remote: Remote) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You didn't change this to use the remote's downloader.

Comment on lines +80 to +82
# Keep track of on-demand packages that need to be repaired
# A package is removed from this variable once it is successfully repaired
pkgs_not_repaired = set(on_demand_content.values_list("pk", flat=True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this an empty set and then add the packages that we failed to repair. Should be less expensive.

continue

for package, ra in pkg_ra:
if package.pk not in pkgs_not_repaired:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use group_list (probably should change it to a set) to track if we repaired the packaged yet. And then remove packages from it when we processed it.

)
batch = []
set_of_update_fields.clear()
pkgs_not_repaired.remove(package.pk)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pkgs_not_repaired.remove(package.pk)
group_set.remove(package.pk)
pkgs_not_repaired.update(group_set)

After a group is finished processing, any left over packages we didn't remove can then be added to pkgs_not_repaired

Comment on lines +129 to +132
# All packages have the same URL, so pick a random remote
remote = pkg_ra[0][1].remote
try:
json_data = fetch_json_release_metadata(name, version, remote)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# All packages have the same URL, so pick a random remote
remote = pkg_ra[0][1].remote
try:
json_data = fetch_json_release_metadata(name, version, remote)
if not group_set:
break # No packages left to repair, move onto the next group
# All packages have the same URL, so pick a random remote
remote = pkg_ra[0][1].remote
try:
json_data = fetch_json_release_metadata(name, version, remote)

By switching to tracking the repaired packages by group you can break out of the RAs loop early if there is no remaining packages to process.

Comment on lines +151 to +165
changed = False
for field, value in new_data.items():
if getattr(package, field) != value:
setattr(package, field, value)
set_of_update_fields.add(field)
changed = True
if changed:
batch.append(package)
if len(batch) == 1000:
total_repaired += len(batch)
PythonPackageContent.objects.bulk_update(
batch, set_of_update_fields
)
batch = []
set_of_update_fields.clear()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is now repeated twice we can probably move it into an inline function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support on-demand content in repair_metadata endpoint
2 participants