Fix #172: Add cache listing and proper retry with shutdown awareness by cramt · Pull Request #173 · DeterminateSystems/magic-nix-cache

cramt · 2026-01-26T12:32:39Z

Summary

This PR addresses the GitHub Actions Cache rate limiting issues reported in #172 by implementing two key improvements:

1. List Existing Cache Keys on Startup

Problem: The cache was uploading the entire closure on every build, including unchanged NixOS base packages, leading to excessive uploads and rate limiting.

Solution:

On startup, fetch all existing cache entries via GitHub REST API
Store cache keys in memory (HashSet<String>)
Before uploading any path, check if it already exists and skip if present
Track skipped uploads with new uploads_skipped metric

Benefits:

Single API call on startup vs N individual checks during upload
Dramatically reduces redundant uploads
Prevents re-uploading unchanged dependencies

2. Implement Proper Retry with Retry-After Support

Problem: The circuit breaker tripped permanently on the first 429, and there was no retry logic. Additionally, retries could hang CI if they occurred during shutdown.

Solution:

Parse Retry-After header from 429 responses
Implement exponential backoff retry (1s, 2s, 4s, 8s, 16s, max 60s)
Retry up to 5 times before tripping circuit breaker
Critically: Stop retrying immediately when BOTH conditions are met:
1. Shutdown has been signaled (via /api/workflow-finish)
2. AND we receive a 429 rate limit error

Benefits:

Respects GitHub's rate limits properly
Resilient to temporary rate limiting during normal operation
Never hangs CI waiting for retry when workflow is finishing
Only trips circuit breaker after exhausting all retries

Implementation Details

Files Modified

gha-cache/src/credentials.rs: Added github_token and github_repository fields for REST API access
gha-cache/src/api.rs:
- Added list_existing_cache_keys() method with pagination
- Added shutting_down flag and signal_shutdown() method
- Implemented execute_with_retry() wrapper with exponential backoff
- Updated error handling to parse Retry-After header
- Wrapped critical operations (reserve, commit, finalize, get) with retry logic
magic-nix-cache/src/gha.rs:
- Added existing_keys HashSet to GhaCache struct
- Fetch existing keys on startup (best effort, doesn't fail if unavailable)
- Check before upload and skip if key exists
- Call api.signal_shutdown() in shutdown() method
- Updated worker to handle ShuttingDown error gracefully
magic-nix-cache/src/telemetry.rs: Added uploads_skipped metric
magic-nix-cache/src/main.rs: Updated to call async GhaCache::new()

Shutdown Behavior

The retry logic ensures we balance resilience with responsiveness:

Normal Operation:

Retry on 429 with exponential backoff
Keep trying up to 5 attempts
Only give up after MAX_RETRIES

During Shutdown:

If operation succeeds: Complete normally
If 429 received: Immediately return ShuttingDown error (no retry)
If waiting for retry: Abort wait and return ShuttingDown error

This prevents the CI from hanging indefinitely when /api/workflow-finish is called while rate limited.

Usage

The GitHub Action needs to pass GITHUB_TOKEN for cache listing to work:

- uses: DeterminateSystems/magic-nix-cache-action@main
  env:
    GITHUB_TOKEN: ${{ github.token }}

The GITHUB_REPOSITORY variable is already set by GitHub Actions automatically.

Testing

The code changes are syntactically correct. Full compilation requires system dependencies (nix-main, protoc) that are available in the Nix build environment.

Expected Impact

Before:

Uploads entire closure every time (including unchanged base packages)
Hits 200/minute rate limit easily
Circuit breaker trips permanently on first 429
CI can hang if 429 happens during shutdown

After:

Skips uploads for paths that already exist
Retries with proper backoff on 429
Only trips circuit breaker after 5 failed retries
Shuts down immediately when requested, even during retry wait

Fixes #172
Related to #147, #154

Summary by CodeRabbit

New Features
- Automatic retries with exponential backoff for API rate limits and graceful abort on shutdown.
- Preloads and tracks existing cache keys to skip redundant uploads.
- Shutdown now drains queued uploads and treats shutdown-cancelled operations as expected.
- New telemetry metric reporting skipped uploads.
- Optional environment-configurable GitHub token/repository support for cache operations.

…with shutdown awareness This commit addresses GitHub Actions Cache rate limiting issues by implementing two key improvements: 1. List existing cache keys on startup - Fetch all existing cache entries via GitHub REST API on daemon startup - Store keys in memory and skip uploads for paths that already exist - Dramatically reduces redundant uploads of unchanged packages - Adds uploads_skipped metric to track effectiveness 2. Implement proper retry with Retry-After header support - Parse Retry-After header from 429 responses - Retry with exponential backoff (1s, 2s, 4s, 8s, 16s, max 60s) - Retry up to 5 times before tripping circuit breaker - Critically: Stop retrying immediately when shutdown is signaled AND rate limited - Prevents CI from hanging when /api/workflow-finish is called during retry The shutdown behavior ensures we: - Continue retrying during normal operation (resilience) - Exit immediately when both shutdown + 429 occur (responsiveness) - Never hang CI waiting for retry-after during workflow completion Fixes DeterminateSystems#172 Related to DeterminateSystems#147, DeterminateSystems#154

coderabbitai · 2026-01-26T12:32:57Z

Warning

Rate limit exceeded

@cramt has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 47 minutes and 57 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 47 minutes and 57 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 26ba0b97-025e-4174-9121-b59b65f7d667

📥 Commits

Reviewing files that changed from the base of the PR and between 0a6f987 and ec3e371.

📒 Files selected for processing (2)

gha-cache/src/api.rs
magic-nix-cache/src/gha.rs

📝 Walkthrough

Walkthrough

Implemented GitHub Actions Cache rate-limit retry handling with exponential backoff, shutdown signaling, and cache-key discovery/deduplication. API operations now retry rate-limit-like failures, expose shutdown controls, parse Retry-After, and provide listing of existing cache keys. Consumer initializes asynchronously and skips redundant uploads while tracking skip metrics.

Changes

Cohort / File(s)	Summary
API: retry, shutdown, cache listing `gha-cache/src/api.rs`	Added `execute_with_retry` with exponential backoff and retry-after handling; new constants (`MAX_RETRIES`, `BASE_RETRY_DELAY_MS`, `MAX_RETRY_DELAY_MS`); parse `Retry-After` into `Error::ApiError { ... retry_after }`; added `Error::ShuttingDown`; shutdown signalling (`signal_shutdown`, `is_shutting_down`) and retry abort on shutdown; added `list_existing_cache_keys()` with pagination models; wrapped `commit_cache`, `finalize_cache`, `get_cache_entry`, `reserve_cache` with retry logic.
Credentials: env metadata `gha-cache/src/credentials.rs`	Added optional `github_token: Option<String>` and `github_repository: Option<String>` fields populated from `GITHUB_TOKEN` and `GITHUB_REPOSITORY` env vars (serde aliases and debug-ignore for token).
Cache manager: async init, dedupe, shutdown behavior `magic-nix-cache/src/gha.rs`	Made `GhaCache::new` async and await `api.list_existing_cache_keys()` to populate `existing_keys: Arc<RwLock<HashSet<String>>>`; pass `existing_keys` into worker/upload paths; pre-upload existence check to skip uploads and increment `uploads_skipped`; insert new keys post-upload; `shutdown()` signals API shutdown and worker drains queue performing best-effort uploads, treating `ShuttingDown` as expected.
CLI: async constructor use `magic-nix-cache/src/main.rs`	Updated call to `GhaCache::new(...)` to await the async constructor.
Telemetry: skip metric `magic-nix-cache/src/telemetry.rs`	Added `uploads_skipped: Metric` to `TelemetryReport`, propagated through initialization and emitted via `fact!(recorder, uploads_skipped)`.

Sequence Diagram(s)

sequenceDiagram
    participant Main as Main
    participant Gha as GhaCache
    participant Api as API Layer
    participant GitHub as GitHub Actions Cache API

    Main->>Gha: new() [async]
    activate Gha
    Gha->>Api: list_existing_cache_keys()
    activate Api
    Api->>GitHub: GET /actions/cache (paged)
    GitHub-->>Api: 200 OK (pages)
    Api-->>Gha: HashSet(existing_keys)
    deactivate Api
    Gha->>Gha: spawn worker(existing_keys)
    deactivate Gha

    alt Worker handling upload
        Gha->>Gha: upload_path -> compute key
        alt key in existing_keys
            Gha->>Gha: skip upload, increment uploads_skipped
        else
            Gha->>Api: execute_with_retry(upload request)
            activate Api
            Api->>GitHub: POST cache entry
            alt 429 / rate-limit
                GitHub-->>Api: 429 (Retry-After?)
                Api->>Api: parse Retry-After, exponential backoff
                Api->>GitHub: retry POST (up to MAX_RETRIES)
            end
            GitHub-->>Api: success / final error
            Api-->>Gha: result or Error::ShuttingDown
            deactivate Api
            Gha->>Gha: insert key into existing_keys on success
        end
    end

    Main->>Gha: shutdown()
    activate Gha
    Gha->>Api: signal_shutdown()
    Gha->>Gha: drain queue, best-effort uploads (no retries)
    deactivate Gha

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰
With whiskers twitching, I retry and wait,
Backoff measured, patient—no more rate hate.
I spy existing keys and skip the repeat,
Signal shutdown gentle, finish what we meet.
Hop, cache, repeat—small wins, carrot-sweet. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main changes: cache listing functionality and retry mechanism with shutdown awareness, matching the core objectives.
Linked Issues check	✅ Passed	All coding objectives from `#172` are met: Retry-After header parsing, exponential backoff (max 5 attempts), non-permanent circuit breaker, interruptible shutdown behavior, and cache-key deduplication to reduce thrashing.
Out of Scope Changes check	✅ Passed	All changes align with `#172` objectives: credential fields support API access, retry/shutdown logic addresses rate limiting, cache-key listing reduces thrashing, and telemetry tracks deduplication effectiveness.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

grahamc · 2026-02-17T01:15:08Z

Goodness gracious, this is great -- thank you @cramt! I'll get some awareness of this to get it reviewed. Thank you!

cramt · 2026-02-17T09:20:56Z

Goodness gracious, this is great -- thank you @cramt! I'll get some awareness of this to get it reviewed. Thank you!

yeah sorry for the big pr, but this kinda needed a lot 😅

cramt · 2026-03-08T11:45:53Z

@grahamc sorry to rush you but githubs caching behavior seems to be getting more and more annoying, could you see if this could get reviewed 🙏

dezren39 · 2026-04-05T07:19:56Z

@grahamc any chance this can be reviewed?

benkoppe · 2026-04-16T18:16:27Z

@grahamc A review on this would be great! :D

…limiting # Conflicts: # gha-cache/src/api.rs

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

magic-nix-cache/src/gha.rs (1)
150-170: Drain after signal_shutdown() is mostly no-op work — consider just discarding.

shutdown() calls self.api.signal_shutdown() before sending Request::Shutdown, so by the time the worker hits this branch the API's execute_with_retry will short-circuit with Error::ShuttingDown on the very first allocate_file_with_random_suffix/upload_file call. The best-effort drain will therefore fail fast for every queued item, pay the cost of query_path_info and NAR streaming setup, and emit info logs per cancellation.

If the intent is really to flush whatever can still succeed before the API is fully closed, consider calling signal_shutdown() after the drain completes instead of before. Otherwise, dropping the queued items without invoking upload_path would be cheaper and equivalent in behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magic-nix-cache/src/gha.rs` around lines 150 - 170, The current shutdown
drains the channel and calls upload_path for each Request::Upload while the API
has already had shutdown signaled, causing immediate Error::ShuttingDown
failures; either move the API.shutdown call to after the drain or skip
attempting uploads during Request::Shutdown. Concretely, in the shutdown flow
where shutdown() calls api.signal_shutdown(), either (A) change the order so
signal_shutdown() is invoked only after the loop that handles Request::Shutdown
completes, or (B) modify the Request::Shutdown branch to drop queued Upload
requests instead of calling upload_path (avoid calling execute_with_retry /
allocate_file_with_random_suffix / upload_file and skip query_path_info and NAR
streaming setup). Ensure the chosen approach updates the Request::Shutdown
handling and avoids calling upload_path when the API is shutting down.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magic-nix-cache/src/gha.rs`:
- Around line 278-280: The comment claims we insert the full allocated key with
nonce suffix, but the code calls
existing_keys.write().await.insert(narinfo_path.clone()) which inserts the base
"{hash}.narinfo"; either (A) change the insert to use the actual key returned by
allocate_file_with_random_suffix (use the variable that holds the allocated
filename/URL with the "-XXXX" suffix) so existing_keys stores the exact
allocated key, or (B) if intended to store the base key, update the comment to
say the base key (no suffix) is inserted and keep relying on the starts_with
pre-check; refer to existing_keys, narinfo_path, and
allocate_file_with_random_suffix when making the change.
- Around line 222-234: The dedup check is skipping uploads across cache versions
because existing_keys from Api::list_existing_cache_keys() aren't filtered by
cache version; update the fix by either filtering keys by self.version in
Api::list_existing_cache_keys() before populating existing_keys or change the
dedup logic in this function to check the cache_version embedded in each
returned key against self.version when testing keys.iter().any(...); also
deduplicate the repeated string computation by computing the narinfo key once
(currently narinfo_key_prefix and narinfo_path use the same format via
path.to_hash()) and reuse that variable in both the existence check and later
upload logic (reference: Api::list_existing_cache_keys, existing_keys,
narinfo_key_prefix, narinfo_path, path.to_hash()).

---

Nitpick comments:
In `@magic-nix-cache/src/gha.rs`:
- Around line 150-170: The current shutdown drains the channel and calls
upload_path for each Request::Upload while the API has already had shutdown
signaled, causing immediate Error::ShuttingDown failures; either move the
API.shutdown call to after the drain or skip attempting uploads during
Request::Shutdown. Concretely, in the shutdown flow where shutdown() calls
api.signal_shutdown(), either (A) change the order so signal_shutdown() is
invoked only after the loop that handles Request::Shutdown completes, or (B)
modify the Request::Shutdown branch to drop queued Upload requests instead of
calling upload_path (avoid calling execute_with_retry /
allocate_file_with_random_suffix / upload_file and skip query_path_info and NAR
streaming setup). Ensure the chosen approach updates the Request::Shutdown
handling and avoids calling upload_path when the API is shutting down.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 72f36aca-fd93-4475-bc18-b29548da8dd6

📥 Commits

Reviewing files that changed from the base of the PR and between c82fd30 and 0a6f987.

📒 Files selected for processing (2)

gha-cache/src/api.rs
magic-nix-cache/src/gha.rs

🚧 Files skipped from review as they are similar to previous changes (1)

gha-cache/src/api.rs

coderabbitai · 2026-04-17T07:49:27Z

+    // Add to existing_keys to prevent re-upload in same run
+    // Note: We use the full key with suffix that was allocated
+    existing_keys.write().await.insert(narinfo_path.clone());


⚠️ Potential issue | 🟡 Minor

Comment contradicts the code: the base key (no nonce suffix) is inserted.

narinfo_path is "{hash}.narinfo", but the actually allocated key is "{hash}.narinfo-XXXX" (see allocate_file_with_random_suffix). The current starts_with pre-check at line 226 still matches, so there is no functional bug today, but the comment "We use the full key with suffix that was allocated" is misleading and will trip up anyone tightening the check to exact match later.

✏️ Proposed clarification

- // Add to existing_keys to prevent re-upload in same run - // Note: We use the full key with suffix that was allocated - existing_keys.write().await.insert(narinfo_path.clone()); + // Add to existing_keys to prevent re-upload in same run. + // We insert the base key (without the random nonce suffix) because the + // pre-upload check at the top of this function uses `starts_with` against + // this same base prefix. If that check is ever changed to an exact match, + // this insert must switch to the suffixed key returned by allocation. + existing_keys.write().await.insert(narinfo_path.clone());

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@magic-nix-cache/src/gha.rs` around lines 278 - 280, The comment claims we insert the full allocated key with nonce suffix, but the code calls existing_keys.write().await.insert(narinfo_path.clone()) which inserts the base "{hash}.narinfo"; either (A) change the insert to use the actual key returned by allocate_file_with_random_suffix (use the variable that holds the allocated filename/URL with the "-XXXX" suffix) so existing_keys stores the exact allocated key, or (B) if intended to store the base key, update the comment to say the base key (no suffix) is inserted and keep relying on the starts_with pre-check; refer to existing_keys, narinfo_path, and allocate_file_with_random_suffix when making the change.

- Filter cache keys by version in list_existing_cache_keys to prevent cross-version dedup conflicts - Fix shutdown drain to drop queued uploads instead of attempting them after API shutdown has been signaled - Fix misleading comments about key suffix and version filtering - Deduplicate narinfo_path computation

cramt mentioned this pull request Jan 26, 2026

Time sensitive: GitHub Actions Cache rate limits (from GitHub staff) #172

Open

Merge remote-tracking branch 'upstream/main' into fix/issue-172-rate-…

0a6f987

…limiting # Conflicts: # gha-cache/src/api.rs

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

This was referenced Jun 15, 2026

fix(ci): use magic-nix-cache PR #173 to reduce rate limiting johnspade/mybriefcase-bookmarks#37

Closed

fix(ci): use magic-nix-cache with rate-limit fix johnspade/mybriefcase-bookmarks#47

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #172: Add cache listing and proper retry with shutdown awareness#173

Fix #172: Add cache listing and proper retry with shutdown awareness#173
cramt wants to merge 3 commits into
DeterminateSystems:mainfrom
cramt:fix/issue-172-rate-limiting

cramt commented Jan 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 26, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

grahamc commented Feb 17, 2026

Uh oh!

cramt commented Feb 17, 2026

Uh oh!

cramt commented Mar 8, 2026

Uh oh!

dezren39 commented Apr 5, 2026

Uh oh!

benkoppe commented Apr 16, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cramt commented Jan 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. List Existing Cache Keys on Startup

2. Implement Proper Retry with Retry-After Support

Implementation Details

Files Modified

Shutdown Behavior

Usage

Testing

Expected Impact

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

grahamc commented Feb 17, 2026

Uh oh!

cramt commented Feb 17, 2026

Uh oh!

cramt commented Mar 8, 2026

Uh oh!

dezren39 commented Apr 5, 2026

Uh oh!

benkoppe commented Apr 16, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cramt commented Jan 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 26, 2026 •

edited

Loading