Fix #172: Add cache listing and proper retry with shutdown awareness#173
Fix #172: Add cache listing and proper retry with shutdown awareness#173cramt wants to merge 3 commits into
Conversation
…with shutdown awareness This commit addresses GitHub Actions Cache rate limiting issues by implementing two key improvements: 1. List existing cache keys on startup - Fetch all existing cache entries via GitHub REST API on daemon startup - Store keys in memory and skip uploads for paths that already exist - Dramatically reduces redundant uploads of unchanged packages - Adds uploads_skipped metric to track effectiveness 2. Implement proper retry with Retry-After header support - Parse Retry-After header from 429 responses - Retry with exponential backoff (1s, 2s, 4s, 8s, 16s, max 60s) - Retry up to 5 times before tripping circuit breaker - Critically: Stop retrying immediately when shutdown is signaled AND rate limited - Prevents CI from hanging when /api/workflow-finish is called during retry The shutdown behavior ensures we: - Continue retrying during normal operation (resilience) - Exit immediately when both shutdown + 429 occur (responsiveness) - Never hang CI waiting for retry-after during workflow completion Fixes DeterminateSystems#172 Related to DeterminateSystems#147, DeterminateSystems#154
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 47 minutes and 57 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughImplemented GitHub Actions Cache rate-limit retry handling with exponential backoff, shutdown signaling, and cache-key discovery/deduplication. API operations now retry rate-limit-like failures, expose shutdown controls, parse Changes
Sequence Diagram(s)sequenceDiagram
participant Main as Main
participant Gha as GhaCache
participant Api as API Layer
participant GitHub as GitHub Actions Cache API
Main->>Gha: new() [async]
activate Gha
Gha->>Api: list_existing_cache_keys()
activate Api
Api->>GitHub: GET /actions/cache (paged)
GitHub-->>Api: 200 OK (pages)
Api-->>Gha: HashSet(existing_keys)
deactivate Api
Gha->>Gha: spawn worker(existing_keys)
deactivate Gha
alt Worker handling upload
Gha->>Gha: upload_path -> compute key
alt key in existing_keys
Gha->>Gha: skip upload, increment uploads_skipped
else
Gha->>Api: execute_with_retry(upload request)
activate Api
Api->>GitHub: POST cache entry
alt 429 / rate-limit
GitHub-->>Api: 429 (Retry-After?)
Api->>Api: parse Retry-After, exponential backoff
Api->>GitHub: retry POST (up to MAX_RETRIES)
end
GitHub-->>Api: success / final error
Api-->>Gha: result or Error::ShuttingDown
deactivate Api
Gha->>Gha: insert key into existing_keys on success
end
end
Main->>Gha: shutdown()
activate Gha
Gha->>Api: signal_shutdown()
Gha->>Gha: drain queue, best-effort uploads (no retries)
deactivate Gha
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Goodness gracious, this is great -- thank you @cramt! I'll get some awareness of this to get it reviewed. Thank you! |
yeah sorry for the big pr, but this kinda needed a lot 😅 |
|
@grahamc sorry to rush you but githubs caching behavior seems to be getting more and more annoying, could you see if this could get reviewed 🙏 |
|
@grahamc any chance this can be reviewed? |
|
@grahamc A review on this would be great! :D |
…limiting # Conflicts: # gha-cache/src/api.rs
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
magic-nix-cache/src/gha.rs (1)
150-170: Drain aftersignal_shutdown()is mostly no-op work — consider just discarding.
shutdown()callsself.api.signal_shutdown()before sendingRequest::Shutdown, so by the time the worker hits this branch the API'sexecute_with_retrywill short-circuit withError::ShuttingDownon the very firstallocate_file_with_random_suffix/upload_filecall. The best-effort drain will therefore fail fast for every queued item, pay the cost ofquery_path_infoand NAR streaming setup, and emit info logs per cancellation.If the intent is really to flush whatever can still succeed before the API is fully closed, consider calling
signal_shutdown()after the drain completes instead of before. Otherwise, dropping the queued items without invokingupload_pathwould be cheaper and equivalent in behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@magic-nix-cache/src/gha.rs` around lines 150 - 170, The current shutdown drains the channel and calls upload_path for each Request::Upload while the API has already had shutdown signaled, causing immediate Error::ShuttingDown failures; either move the API.shutdown call to after the drain or skip attempting uploads during Request::Shutdown. Concretely, in the shutdown flow where shutdown() calls api.signal_shutdown(), either (A) change the order so signal_shutdown() is invoked only after the loop that handles Request::Shutdown completes, or (B) modify the Request::Shutdown branch to drop queued Upload requests instead of calling upload_path (avoid calling execute_with_retry / allocate_file_with_random_suffix / upload_file and skip query_path_info and NAR streaming setup). Ensure the chosen approach updates the Request::Shutdown handling and avoids calling upload_path when the API is shutting down.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@magic-nix-cache/src/gha.rs`:
- Around line 278-280: The comment claims we insert the full allocated key with
nonce suffix, but the code calls
existing_keys.write().await.insert(narinfo_path.clone()) which inserts the base
"{hash}.narinfo"; either (A) change the insert to use the actual key returned by
allocate_file_with_random_suffix (use the variable that holds the allocated
filename/URL with the "-XXXX" suffix) so existing_keys stores the exact
allocated key, or (B) if intended to store the base key, update the comment to
say the base key (no suffix) is inserted and keep relying on the starts_with
pre-check; refer to existing_keys, narinfo_path, and
allocate_file_with_random_suffix when making the change.
- Around line 222-234: The dedup check is skipping uploads across cache versions
because existing_keys from Api::list_existing_cache_keys() aren't filtered by
cache version; update the fix by either filtering keys by self.version in
Api::list_existing_cache_keys() before populating existing_keys or change the
dedup logic in this function to check the cache_version embedded in each
returned key against self.version when testing keys.iter().any(...); also
deduplicate the repeated string computation by computing the narinfo key once
(currently narinfo_key_prefix and narinfo_path use the same format via
path.to_hash()) and reuse that variable in both the existence check and later
upload logic (reference: Api::list_existing_cache_keys, existing_keys,
narinfo_key_prefix, narinfo_path, path.to_hash()).
---
Nitpick comments:
In `@magic-nix-cache/src/gha.rs`:
- Around line 150-170: The current shutdown drains the channel and calls
upload_path for each Request::Upload while the API has already had shutdown
signaled, causing immediate Error::ShuttingDown failures; either move the
API.shutdown call to after the drain or skip attempting uploads during
Request::Shutdown. Concretely, in the shutdown flow where shutdown() calls
api.signal_shutdown(), either (A) change the order so signal_shutdown() is
invoked only after the loop that handles Request::Shutdown completes, or (B)
modify the Request::Shutdown branch to drop queued Upload requests instead of
calling upload_path (avoid calling execute_with_retry /
allocate_file_with_random_suffix / upload_file and skip query_path_info and NAR
streaming setup). Ensure the chosen approach updates the Request::Shutdown
handling and avoids calling upload_path when the API is shutting down.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 72f36aca-fd93-4475-bc18-b29548da8dd6
📒 Files selected for processing (2)
gha-cache/src/api.rsmagic-nix-cache/src/gha.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- gha-cache/src/api.rs
| // Add to existing_keys to prevent re-upload in same run | ||
| // Note: We use the full key with suffix that was allocated | ||
| existing_keys.write().await.insert(narinfo_path.clone()); |
There was a problem hiding this comment.
Comment contradicts the code: the base key (no nonce suffix) is inserted.
narinfo_path is "{hash}.narinfo", but the actually allocated key is "{hash}.narinfo-XXXX" (see allocate_file_with_random_suffix). The current starts_with pre-check at line 226 still matches, so there is no functional bug today, but the comment "We use the full key with suffix that was allocated" is misleading and will trip up anyone tightening the check to exact match later.
✏️ Proposed clarification
- // Add to existing_keys to prevent re-upload in same run
- // Note: We use the full key with suffix that was allocated
- existing_keys.write().await.insert(narinfo_path.clone());
+ // Add to existing_keys to prevent re-upload in same run.
+ // We insert the base key (without the random nonce suffix) because the
+ // pre-upload check at the top of this function uses `starts_with` against
+ // this same base prefix. If that check is ever changed to an exact match,
+ // this insert must switch to the suffixed key returned by allocation.
+ existing_keys.write().await.insert(narinfo_path.clone());🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@magic-nix-cache/src/gha.rs` around lines 278 - 280, The comment claims we
insert the full allocated key with nonce suffix, but the code calls
existing_keys.write().await.insert(narinfo_path.clone()) which inserts the base
"{hash}.narinfo"; either (A) change the insert to use the actual key returned by
allocate_file_with_random_suffix (use the variable that holds the allocated
filename/URL with the "-XXXX" suffix) so existing_keys stores the exact
allocated key, or (B) if intended to store the base key, update the comment to
say the base key (no suffix) is inserted and keep relying on the starts_with
pre-check; refer to existing_keys, narinfo_path, and
allocate_file_with_random_suffix when making the change.
- Filter cache keys by version in list_existing_cache_keys to prevent cross-version dedup conflicts - Fix shutdown drain to drop queued uploads instead of attempting them after API shutdown has been signaled - Fix misleading comments about key suffix and version filtering - Deduplicate narinfo_path computation
Summary
This PR addresses the GitHub Actions Cache rate limiting issues reported in #172 by implementing two key improvements:
1. List Existing Cache Keys on Startup
Problem: The cache was uploading the entire closure on every build, including unchanged NixOS base packages, leading to excessive uploads and rate limiting.
Solution:
HashSet<String>)uploads_skippedmetricBenefits:
2. Implement Proper Retry with Retry-After Support
Problem: The circuit breaker tripped permanently on the first 429, and there was no retry logic. Additionally, retries could hang CI if they occurred during shutdown.
Solution:
Retry-Afterheader from 429 responses/api/workflow-finish)Benefits:
Implementation Details
Files Modified
gha-cache/src/credentials.rs: Addedgithub_tokenandgithub_repositoryfields for REST API accessgha-cache/src/api.rs:list_existing_cache_keys()method with paginationshutting_downflag andsignal_shutdown()methodexecute_with_retry()wrapper with exponential backoffRetry-Afterheadermagic-nix-cache/src/gha.rs:existing_keysHashSet toGhaCachestructapi.signal_shutdown()inshutdown()methodShuttingDownerror gracefullymagic-nix-cache/src/telemetry.rs: Addeduploads_skippedmetricmagic-nix-cache/src/main.rs: Updated to call asyncGhaCache::new()Shutdown Behavior
The retry logic ensures we balance resilience with responsiveness:
Normal Operation:
During Shutdown:
ShuttingDownerror (no retry)ShuttingDownerrorThis prevents the CI from hanging indefinitely when
/api/workflow-finishis called while rate limited.Usage
The GitHub Action needs to pass
GITHUB_TOKENfor cache listing to work:The
GITHUB_REPOSITORYvariable is already set by GitHub Actions automatically.Testing
The code changes are syntactically correct. Full compilation requires system dependencies (
nix-main,protoc) that are available in the Nix build environment.Expected Impact
Before:
After:
Fixes #172
Related to #147, #154
Summary by CodeRabbit