Skip to content

Fix #172: Add cache listing and proper retry with shutdown awareness#173

Open
cramt wants to merge 3 commits into
DeterminateSystems:mainfrom
cramt:fix/issue-172-rate-limiting
Open

Fix #172: Add cache listing and proper retry with shutdown awareness#173
cramt wants to merge 3 commits into
DeterminateSystems:mainfrom
cramt:fix/issue-172-rate-limiting

Conversation

@cramt

@cramt cramt commented Jan 26, 2026

Copy link
Copy Markdown

Summary

This PR addresses the GitHub Actions Cache rate limiting issues reported in #172 by implementing two key improvements:

1. List Existing Cache Keys on Startup

Problem: The cache was uploading the entire closure on every build, including unchanged NixOS base packages, leading to excessive uploads and rate limiting.

Solution:

  • On startup, fetch all existing cache entries via GitHub REST API
  • Store cache keys in memory (HashSet<String>)
  • Before uploading any path, check if it already exists and skip if present
  • Track skipped uploads with new uploads_skipped metric

Benefits:

  • Single API call on startup vs N individual checks during upload
  • Dramatically reduces redundant uploads
  • Prevents re-uploading unchanged dependencies

2. Implement Proper Retry with Retry-After Support

Problem: The circuit breaker tripped permanently on the first 429, and there was no retry logic. Additionally, retries could hang CI if they occurred during shutdown.

Solution:

  • Parse Retry-After header from 429 responses
  • Implement exponential backoff retry (1s, 2s, 4s, 8s, 16s, max 60s)
  • Retry up to 5 times before tripping circuit breaker
  • Critically: Stop retrying immediately when BOTH conditions are met:
    1. Shutdown has been signaled (via /api/workflow-finish)
    2. AND we receive a 429 rate limit error

Benefits:

  • Respects GitHub's rate limits properly
  • Resilient to temporary rate limiting during normal operation
  • Never hangs CI waiting for retry when workflow is finishing
  • Only trips circuit breaker after exhausting all retries

Implementation Details

Files Modified

  • gha-cache/src/credentials.rs: Added github_token and github_repository fields for REST API access
  • gha-cache/src/api.rs:
    • Added list_existing_cache_keys() method with pagination
    • Added shutting_down flag and signal_shutdown() method
    • Implemented execute_with_retry() wrapper with exponential backoff
    • Updated error handling to parse Retry-After header
    • Wrapped critical operations (reserve, commit, finalize, get) with retry logic
  • magic-nix-cache/src/gha.rs:
    • Added existing_keys HashSet to GhaCache struct
    • Fetch existing keys on startup (best effort, doesn't fail if unavailable)
    • Check before upload and skip if key exists
    • Call api.signal_shutdown() in shutdown() method
    • Updated worker to handle ShuttingDown error gracefully
  • magic-nix-cache/src/telemetry.rs: Added uploads_skipped metric
  • magic-nix-cache/src/main.rs: Updated to call async GhaCache::new()

Shutdown Behavior

The retry logic ensures we balance resilience with responsiveness:

Normal Operation:

  • Retry on 429 with exponential backoff
  • Keep trying up to 5 attempts
  • Only give up after MAX_RETRIES

During Shutdown:

  • If operation succeeds: Complete normally
  • If 429 received: Immediately return ShuttingDown error (no retry)
  • If waiting for retry: Abort wait and return ShuttingDown error

This prevents the CI from hanging indefinitely when /api/workflow-finish is called while rate limited.

Usage

The GitHub Action needs to pass GITHUB_TOKEN for cache listing to work:

- uses: DeterminateSystems/magic-nix-cache-action@main
  env:
    GITHUB_TOKEN: ${{ github.token }}

The GITHUB_REPOSITORY variable is already set by GitHub Actions automatically.

Testing

The code changes are syntactically correct. Full compilation requires system dependencies (nix-main, protoc) that are available in the Nix build environment.

Expected Impact

Before:

  • Uploads entire closure every time (including unchanged base packages)
  • Hits 200/minute rate limit easily
  • Circuit breaker trips permanently on first 429
  • CI can hang if 429 happens during shutdown

After:

  • Skips uploads for paths that already exist
  • Retries with proper backoff on 429
  • Only trips circuit breaker after 5 failed retries
  • Shuts down immediately when requested, even during retry wait

Fixes #172
Related to #147, #154

Summary by CodeRabbit

  • New Features
    • Automatic retries with exponential backoff for API rate limits and graceful abort on shutdown.
    • Preloads and tracks existing cache keys to skip redundant uploads.
    • Shutdown now drains queued uploads and treats shutdown-cancelled operations as expected.
    • New telemetry metric reporting skipped uploads.
    • Optional environment-configurable GitHub token/repository support for cache operations.

…with shutdown awareness

This commit addresses GitHub Actions Cache rate limiting issues by implementing two key improvements:

1. List existing cache keys on startup
   - Fetch all existing cache entries via GitHub REST API on daemon startup
   - Store keys in memory and skip uploads for paths that already exist
   - Dramatically reduces redundant uploads of unchanged packages
   - Adds uploads_skipped metric to track effectiveness

2. Implement proper retry with Retry-After header support
   - Parse Retry-After header from 429 responses
   - Retry with exponential backoff (1s, 2s, 4s, 8s, 16s, max 60s)
   - Retry up to 5 times before tripping circuit breaker
   - Critically: Stop retrying immediately when shutdown is signaled AND rate limited
   - Prevents CI from hanging when /api/workflow-finish is called during retry

The shutdown behavior ensures we:
- Continue retrying during normal operation (resilience)
- Exit immediately when both shutdown + 429 occur (responsiveness)
- Never hang CI waiting for retry-after during workflow completion

Fixes DeterminateSystems#172
Related to DeterminateSystems#147, DeterminateSystems#154
@coderabbitai

coderabbitai Bot commented Jan 26, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@cramt has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 47 minutes and 57 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 47 minutes and 57 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 26ba0b97-025e-4174-9121-b59b65f7d667

📥 Commits

Reviewing files that changed from the base of the PR and between 0a6f987 and ec3e371.

📒 Files selected for processing (2)
  • gha-cache/src/api.rs
  • magic-nix-cache/src/gha.rs
📝 Walkthrough

Walkthrough

Implemented GitHub Actions Cache rate-limit retry handling with exponential backoff, shutdown signaling, and cache-key discovery/deduplication. API operations now retry rate-limit-like failures, expose shutdown controls, parse Retry-After, and provide listing of existing cache keys. Consumer initializes asynchronously and skips redundant uploads while tracking skip metrics.

Changes

Cohort / File(s) Summary
API: retry, shutdown, cache listing
gha-cache/src/api.rs
Added execute_with_retry with exponential backoff and retry-after handling; new constants (MAX_RETRIES, BASE_RETRY_DELAY_MS, MAX_RETRY_DELAY_MS); parse Retry-After into Error::ApiError { ... retry_after }; added Error::ShuttingDown; shutdown signalling (signal_shutdown, is_shutting_down) and retry abort on shutdown; added list_existing_cache_keys() with pagination models; wrapped commit_cache, finalize_cache, get_cache_entry, reserve_cache with retry logic.
Credentials: env metadata
gha-cache/src/credentials.rs
Added optional github_token: Option<String> and github_repository: Option<String> fields populated from GITHUB_TOKEN and GITHUB_REPOSITORY env vars (serde aliases and debug-ignore for token).
Cache manager: async init, dedupe, shutdown behavior
magic-nix-cache/src/gha.rs
Made GhaCache::new async and await api.list_existing_cache_keys() to populate existing_keys: Arc<RwLock<HashSet<String>>>; pass existing_keys into worker/upload paths; pre-upload existence check to skip uploads and increment uploads_skipped; insert new keys post-upload; shutdown() signals API shutdown and worker drains queue performing best-effort uploads, treating ShuttingDown as expected.
CLI: async constructor use
magic-nix-cache/src/main.rs
Updated call to GhaCache::new(...) to await the async constructor.
Telemetry: skip metric
magic-nix-cache/src/telemetry.rs
Added uploads_skipped: Metric to TelemetryReport, propagated through initialization and emitted via fact!(recorder, uploads_skipped).

Sequence Diagram(s)

sequenceDiagram
    participant Main as Main
    participant Gha as GhaCache
    participant Api as API Layer
    participant GitHub as GitHub Actions Cache API

    Main->>Gha: new() [async]
    activate Gha
    Gha->>Api: list_existing_cache_keys()
    activate Api
    Api->>GitHub: GET /actions/cache (paged)
    GitHub-->>Api: 200 OK (pages)
    Api-->>Gha: HashSet(existing_keys)
    deactivate Api
    Gha->>Gha: spawn worker(existing_keys)
    deactivate Gha

    alt Worker handling upload
        Gha->>Gha: upload_path -> compute key
        alt key in existing_keys
            Gha->>Gha: skip upload, increment uploads_skipped
        else
            Gha->>Api: execute_with_retry(upload request)
            activate Api
            Api->>GitHub: POST cache entry
            alt 429 / rate-limit
                GitHub-->>Api: 429 (Retry-After?)
                Api->>Api: parse Retry-After, exponential backoff
                Api->>GitHub: retry POST (up to MAX_RETRIES)
            end
            GitHub-->>Api: success / final error
            Api-->>Gha: result or Error::ShuttingDown
            deactivate Api
            Gha->>Gha: insert key into existing_keys on success
        end
    end

    Main->>Gha: shutdown()
    activate Gha
    Gha->>Api: signal_shutdown()
    Gha->>Gha: drain queue, best-effort uploads (no retries)
    deactivate Gha
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰
With whiskers twitching, I retry and wait,
Backoff measured, patient—no more rate hate.
I spy existing keys and skip the repeat,
Signal shutdown gentle, finish what we meet.
Hop, cache, repeat—small wins, carrot-sweet. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: cache listing functionality and retry mechanism with shutdown awareness, matching the core objectives.
Linked Issues check ✅ Passed All coding objectives from #172 are met: Retry-After header parsing, exponential backoff (max 5 attempts), non-permanent circuit breaker, interruptible shutdown behavior, and cache-key deduplication to reduce thrashing.
Out of Scope Changes check ✅ Passed All changes align with #172 objectives: credential fields support API access, retry/shutdown logic addresses rate limiting, cache-key listing reduces thrashing, and telemetry tracks deduplication effectiveness.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@grahamc

grahamc commented Feb 17, 2026

Copy link
Copy Markdown
Member

Goodness gracious, this is great -- thank you @cramt! I'll get some awareness of this to get it reviewed. Thank you!

@cramt

cramt commented Feb 17, 2026

Copy link
Copy Markdown
Author

Goodness gracious, this is great -- thank you @cramt! I'll get some awareness of this to get it reviewed. Thank you!

yeah sorry for the big pr, but this kinda needed a lot 😅

@cramt

cramt commented Mar 8, 2026

Copy link
Copy Markdown
Author

@grahamc sorry to rush you but githubs caching behavior seems to be getting more and more annoying, could you see if this could get reviewed 🙏

@dezren39

dezren39 commented Apr 5, 2026

Copy link
Copy Markdown

@grahamc any chance this can be reviewed?

@benkoppe

Copy link
Copy Markdown

@grahamc A review on this would be great! :D

…limiting

# Conflicts:
#	gha-cache/src/api.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
magic-nix-cache/src/gha.rs (1)

150-170: Drain after signal_shutdown() is mostly no-op work — consider just discarding.

shutdown() calls self.api.signal_shutdown() before sending Request::Shutdown, so by the time the worker hits this branch the API's execute_with_retry will short-circuit with Error::ShuttingDown on the very first allocate_file_with_random_suffix/upload_file call. The best-effort drain will therefore fail fast for every queued item, pay the cost of query_path_info and NAR streaming setup, and emit info logs per cancellation.

If the intent is really to flush whatever can still succeed before the API is fully closed, consider calling signal_shutdown() after the drain completes instead of before. Otherwise, dropping the queued items without invoking upload_path would be cheaper and equivalent in behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magic-nix-cache/src/gha.rs` around lines 150 - 170, The current shutdown
drains the channel and calls upload_path for each Request::Upload while the API
has already had shutdown signaled, causing immediate Error::ShuttingDown
failures; either move the API.shutdown call to after the drain or skip
attempting uploads during Request::Shutdown. Concretely, in the shutdown flow
where shutdown() calls api.signal_shutdown(), either (A) change the order so
signal_shutdown() is invoked only after the loop that handles Request::Shutdown
completes, or (B) modify the Request::Shutdown branch to drop queued Upload
requests instead of calling upload_path (avoid calling execute_with_retry /
allocate_file_with_random_suffix / upload_file and skip query_path_info and NAR
streaming setup). Ensure the chosen approach updates the Request::Shutdown
handling and avoids calling upload_path when the API is shutting down.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magic-nix-cache/src/gha.rs`:
- Around line 278-280: The comment claims we insert the full allocated key with
nonce suffix, but the code calls
existing_keys.write().await.insert(narinfo_path.clone()) which inserts the base
"{hash}.narinfo"; either (A) change the insert to use the actual key returned by
allocate_file_with_random_suffix (use the variable that holds the allocated
filename/URL with the "-XXXX" suffix) so existing_keys stores the exact
allocated key, or (B) if intended to store the base key, update the comment to
say the base key (no suffix) is inserted and keep relying on the starts_with
pre-check; refer to existing_keys, narinfo_path, and
allocate_file_with_random_suffix when making the change.
- Around line 222-234: The dedup check is skipping uploads across cache versions
because existing_keys from Api::list_existing_cache_keys() aren't filtered by
cache version; update the fix by either filtering keys by self.version in
Api::list_existing_cache_keys() before populating existing_keys or change the
dedup logic in this function to check the cache_version embedded in each
returned key against self.version when testing keys.iter().any(...); also
deduplicate the repeated string computation by computing the narinfo key once
(currently narinfo_key_prefix and narinfo_path use the same format via
path.to_hash()) and reuse that variable in both the existence check and later
upload logic (reference: Api::list_existing_cache_keys, existing_keys,
narinfo_key_prefix, narinfo_path, path.to_hash()).

---

Nitpick comments:
In `@magic-nix-cache/src/gha.rs`:
- Around line 150-170: The current shutdown drains the channel and calls
upload_path for each Request::Upload while the API has already had shutdown
signaled, causing immediate Error::ShuttingDown failures; either move the
API.shutdown call to after the drain or skip attempting uploads during
Request::Shutdown. Concretely, in the shutdown flow where shutdown() calls
api.signal_shutdown(), either (A) change the order so signal_shutdown() is
invoked only after the loop that handles Request::Shutdown completes, or (B)
modify the Request::Shutdown branch to drop queued Upload requests instead of
calling upload_path (avoid calling execute_with_retry /
allocate_file_with_random_suffix / upload_file and skip query_path_info and NAR
streaming setup). Ensure the chosen approach updates the Request::Shutdown
handling and avoids calling upload_path when the API is shutting down.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 72f36aca-fd93-4475-bc18-b29548da8dd6

📥 Commits

Reviewing files that changed from the base of the PR and between c82fd30 and 0a6f987.

📒 Files selected for processing (2)
  • gha-cache/src/api.rs
  • magic-nix-cache/src/gha.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • gha-cache/src/api.rs

Comment thread magic-nix-cache/src/gha.rs
Comment thread magic-nix-cache/src/gha.rs Outdated
Comment on lines +278 to +280
// Add to existing_keys to prevent re-upload in same run
// Note: We use the full key with suffix that was allocated
existing_keys.write().await.insert(narinfo_path.clone());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Comment contradicts the code: the base key (no nonce suffix) is inserted.

narinfo_path is "{hash}.narinfo", but the actually allocated key is "{hash}.narinfo-XXXX" (see allocate_file_with_random_suffix). The current starts_with pre-check at line 226 still matches, so there is no functional bug today, but the comment "We use the full key with suffix that was allocated" is misleading and will trip up anyone tightening the check to exact match later.

✏️ Proposed clarification
-    // Add to existing_keys to prevent re-upload in same run
-    // Note: We use the full key with suffix that was allocated
-    existing_keys.write().await.insert(narinfo_path.clone());
+    // Add to existing_keys to prevent re-upload in same run.
+    // We insert the base key (without the random nonce suffix) because the
+    // pre-upload check at the top of this function uses `starts_with` against
+    // this same base prefix. If that check is ever changed to an exact match,
+    // this insert must switch to the suffixed key returned by allocation.
+    existing_keys.write().await.insert(narinfo_path.clone());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magic-nix-cache/src/gha.rs` around lines 278 - 280, The comment claims we
insert the full allocated key with nonce suffix, but the code calls
existing_keys.write().await.insert(narinfo_path.clone()) which inserts the base
"{hash}.narinfo"; either (A) change the insert to use the actual key returned by
allocate_file_with_random_suffix (use the variable that holds the allocated
filename/URL with the "-XXXX" suffix) so existing_keys stores the exact
allocated key, or (B) if intended to store the base key, update the comment to
say the base key (no suffix) is inserted and keep relying on the starts_with
pre-check; refer to existing_keys, narinfo_path, and
allocate_file_with_random_suffix when making the change.

- Filter cache keys by version in list_existing_cache_keys to prevent
  cross-version dedup conflicts
- Fix shutdown drain to drop queued uploads instead of attempting them
  after API shutdown has been signaled
- Fix misleading comments about key suffix and version filtering
- Deduplicate narinfo_path computation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Time sensitive: GitHub Actions Cache rate limits (from GitHub staff)

4 participants