Skip to content

flake: Build/offlinedocs external download failures (protoc unzip; rcodesign 503) #1144

@flake-investigator

Description

@flake-investigator

CI Run Link: https://github.com/coder/coder/actions/runs/19636845700

Branch: main
Commit: 6882c43b391ebc0762e22b5d52984333c3fd1603 (Atif Ali) — coder/coder@6882c43
Timing: Failures occurred within the same minute as the Slack alert (required failed at 14:07:33Z; Slack at 14:07:38Z).

Summary

  • Offlinedocs job failed during Install Protoc: unzip reported an invalid/partial archive.
  • Build job failed during Install rcodesign: wget received 503 Service Unavailable from GitHub Releases.
  • All test suites (Go, race, JS, E2E) passed. The ‘required’ aggregator failed because some required jobs failed (offlinedocs, build).

Evidence
Offlinedocs (job id 56229725255):

mkdir -p /tmp/proto
pushd /tmp/proto
curl -L -o protoc.zip https://github.com/protocolbuffers/protobuf/releases/download/v23.4/protoc-23.4-linux-x86_64.zip
unzip protoc.zip
...
End-of-central-directory signature not found. Either this file is not a zipfile...
unzip:  cannot find zipfile directory in one of protoc.zip or protoc.zip.zip
##[error]Process completed with exit code 9.

Build (job id 56230013463):

wget -O /tmp/rcodesign.tar.gz https://github.com/indygreg/apple-platform-rs/releases/download/apple-codesign%2F0.22.0/apple-codesign-0.22.0-x86_64-unknown-linux-musl.tar.gz
...
HTTP request sent, awaiting response... 503 Service Unavailable
2025-11-24 14:02:43 ERROR 503: Service Unavailable.
##[error]Process completed with exit code 8.

Required aggregator:

  • Fails after detecting failed required checks (as expected).

Classification

  • Infrastructure: external artifact downloads (GitHub Releases) were unavailable/partial.
  • Not a test flake, not a data race, and no panic/OOM indicators.

Duplicates search (coder/internal)

  • Queries:
    • offlinedocs protoc unzip
    • rcodesign 503 Service Unavailable apple-codesign
    • external download failure GitHub Releases
    • required checks failed offlinedocs build
  • No open/closed duplicates found specific to protoc unzip/rcodesign 503.
  • Related but different: flake: CI required checks aggregator fails when gen job is cancelled #1081 (required aggregator surfaced a cancelled gen job; different failure mode).

Ownership analysis (component area)

  • Affects CI workflows (.github/workflows/ci.yaml). Recent substantive CI ownership signals include:
    • Ethan Dickson: required checks/CI gating changes (#20131, #19722) and CI failure notification tuning.
    • Cian Johnston: temporary infra workarounds in CI (e.g., get.helm.sh outage) later reverted.
    • Multiple routine dependency bumps by dependabot.
  • This appears to be general CI reliability (DevEx/Release Eng) rather than a product test owner.
  • Assigning is ambiguous without a specific workflow owner for offlinedocs/build install steps; leaving unassigned and tagging DevEx/CI maintainers for triage.

Root cause and mitigation

  • Root cause: Transient failures when downloading external artifacts from GitHub Releases (partial/invalid zip; HTTP 503).
  • Mitigations to reduce flakes:
    • Add robust retries with exponential backoff to curl/wget steps (e.g., curl --retry-all-errors --retry 5 --retry-delay 2 --fail; wget --tries=5 --waitretry=2 --retry-connrefused --no-verbose).
    • Validate archives before unzip/tar (check magic bytes; unzip -t; tar -tzf) and retry on failure.
    • Consider mirroring critical artifacts (protoc/rcodesign) to a reliable cache or use package managers when possible.
    • Prefer GitHub’s API with asset IDs + retries over plain URLs to avoid redirect/edge cache hiccups.
    • Optionally gate these steps behind a small wrapper script with built-in retries and checksum validation to standardize behavior across jobs.

Reproduction/Next steps

  • Re-run the workflow to confirm transient nature.
  • Update ci.yaml offlinedocs Install Protoc step and build Install rcodesign step to include retries and validation.
  • Optionally add a central download helper script used by both jobs.

Quality Checklist

  • Downloaded and reviewed complete logs for failing jobs
  • Verified timing alignment with Slack alert
  • Searched coder/internal with multiple query patterns, including closed issues last 30 days
  • Classified root cause (Infrastructure) and provided concrete evidence
  • Not a duplicate of existing issue
  • Assignment based on component area (CI infra); not PR/commit author

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions