-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Labels
Description
CI Run Link: https://github.com/coder/coder/actions/runs/19636845700
Branch: main
Commit: 6882c43b391ebc0762e22b5d52984333c3fd1603 (Atif Ali) — coder/coder@6882c43
Timing: Failures occurred within the same minute as the Slack alert (required failed at 14:07:33Z; Slack at 14:07:38Z).
Summary
- Offlinedocs job failed during Install Protoc: unzip reported an invalid/partial archive.
- Build job failed during Install rcodesign: wget received 503 Service Unavailable from GitHub Releases.
- All test suites (Go, race, JS, E2E) passed. The ‘required’ aggregator failed because some required jobs failed (offlinedocs, build).
Evidence
Offlinedocs (job id 56229725255):
mkdir -p /tmp/proto
pushd /tmp/proto
curl -L -o protoc.zip https://github.com/protocolbuffers/protobuf/releases/download/v23.4/protoc-23.4-linux-x86_64.zip
unzip protoc.zip
...
End-of-central-directory signature not found. Either this file is not a zipfile...
unzip: cannot find zipfile directory in one of protoc.zip or protoc.zip.zip
##[error]Process completed with exit code 9.
Build (job id 56230013463):
wget -O /tmp/rcodesign.tar.gz https://github.com/indygreg/apple-platform-rs/releases/download/apple-codesign%2F0.22.0/apple-codesign-0.22.0-x86_64-unknown-linux-musl.tar.gz
...
HTTP request sent, awaiting response... 503 Service Unavailable
2025-11-24 14:02:43 ERROR 503: Service Unavailable.
##[error]Process completed with exit code 8.
Required aggregator:
- Fails after detecting failed required checks (as expected).
Classification
- Infrastructure: external artifact downloads (GitHub Releases) were unavailable/partial.
- Not a test flake, not a data race, and no panic/OOM indicators.
Duplicates search (coder/internal)
- Queries:
- offlinedocs protoc unzip
- rcodesign 503 Service Unavailable apple-codesign
- external download failure GitHub Releases
- required checks failed offlinedocs build
- No open/closed duplicates found specific to protoc unzip/rcodesign 503.
- Related but different: flake: CI required checks aggregator fails when gen job is cancelled #1081 (required aggregator surfaced a cancelled gen job; different failure mode).
Ownership analysis (component area)
- Affects CI workflows (.github/workflows/ci.yaml). Recent substantive CI ownership signals include:
- Ethan Dickson: required checks/CI gating changes (#20131, #19722) and CI failure notification tuning.
- Cian Johnston: temporary infra workarounds in CI (e.g., get.helm.sh outage) later reverted.
- Multiple routine dependency bumps by dependabot.
- This appears to be general CI reliability (DevEx/Release Eng) rather than a product test owner.
- Assigning is ambiguous without a specific workflow owner for offlinedocs/build install steps; leaving unassigned and tagging DevEx/CI maintainers for triage.
Root cause and mitigation
- Root cause: Transient failures when downloading external artifacts from GitHub Releases (partial/invalid zip; HTTP 503).
- Mitigations to reduce flakes:
- Add robust retries with exponential backoff to curl/wget steps (e.g., curl --retry-all-errors --retry 5 --retry-delay 2 --fail; wget --tries=5 --waitretry=2 --retry-connrefused --no-verbose).
- Validate archives before unzip/tar (check magic bytes; unzip -t; tar -tzf) and retry on failure.
- Consider mirroring critical artifacts (protoc/rcodesign) to a reliable cache or use package managers when possible.
- Prefer GitHub’s API with asset IDs + retries over plain URLs to avoid redirect/edge cache hiccups.
- Optionally gate these steps behind a small wrapper script with built-in retries and checksum validation to standardize behavior across jobs.
Reproduction/Next steps
- Re-run the workflow to confirm transient nature.
- Update ci.yaml offlinedocs Install Protoc step and build Install rcodesign step to include retries and validation.
- Optionally add a central download helper script used by both jobs.
Quality Checklist
- Downloaded and reviewed complete logs for failing jobs
- Verified timing alignment with Slack alert
- Searched coder/internal with multiple query patterns, including closed issues last 30 days
- Classified root cause (Infrastructure) and provided concrete evidence
- Not a duplicate of existing issue
- Assignment based on component area (CI infra); not PR/commit author