Skip to content

feat: Improve ArgoCD sync action resilience (async sync + error visibility) #152

@gandalf-at-lerian

Description

@gandalf-at-lerian

Request Type

Performance improvement

Affected Workflow (if applicable)

Infrastructure (gitops-update, build, helm-update-chart, dispatch-helm)

Problem / Motivation

The github-actions-argocd-sync action fails intermittently when the ArgoCD server takes longer than the CLI's default timeout (~90s) to respond to a sync request. This was observed on the Reporter repo (run #23253916955) where firmino-reporter-dev failed after 5 sync attempts, even though the sync was actually applied successfully on ArgoCD and the new image was running correctly in dev.

Root cause (confirmed): The argocd app sync command completes the sync successfully (successfully synced (all tasks run)), but the CLI exits with code 1 because there are orphaned resources that require pruning:

{"level":"fatal","msg":"2 resources require pruning","time":"2026-03-18T14:39:07-03:00"}

The orphaned resources are:

  • ClusterRole reporter-manager-midaz-plugins-dev
  • ClusterRoleBinding reporter-manager-midaz-plugins-dev

These were left behind after a rename from namespace-suffixed names to plain reporter-manager. The current entrypoint.sh redirects all output to /dev/null, hiding this error. It then retries 5 times — each retry successfully syncs but also exits 1 due to the same pruning requirement.

Initial hypothesis (timeout) was incorrect. The ~1min per attempt was the actual sync duration, not a timeout. The exit code 1 was from the pruning fatal log, not a gRPC timeout.

Proposed Solution

Changes to github-actions-argocd-sync/entrypoint.sh:

  1. Remove > /dev/null 2>&1 from the sync command — expose the actual error message so failures are diagnosable from the GitHub Actions log. This is the most critical change — without it, the real error is invisible.

  2. Add --prune flag support — new optional input prune (default: false). When enabled, pass --prune to argocd app sync so orphaned resources are cleaned up automatically during sync. This prevents the "resources require pruning" fatal from causing false failures.

  3. Use --async on argocd app sync — fire the sync without waiting for completion. The script already has an argocd app wait step afterward that handles the confirmation. This separates sync dispatch from sync verification.

  4. Increase retry interval from 5s to 30s — give time for a previous sync attempt to complete before retrying.

  5. Add explicit --timeout to the sync and wait commands (e.g., --timeout 180) for predictable behavior regardless of CLI defaults.

Alternatives Considered

  • Only removing /dev/null (helps diagnosis but doesn't prevent the failure)
  • Always pruning (risky in production — better as opt-in flag)
  • Adding --force to sync retries (risky, could cause unintended overwrites)

Example Usage

# Existing usage remains the same (backward compatible)
- uses: LerianStudio/github-actions-argocd-sync@main
  with:
    app-name: firmino-reporter
    argo-cd-token: ${{ secrets.ARGOCD_TOKEN }}
    argo-cd-url: ${{ secrets.ARGOCD_URL }}
    env-prefix: dev
    skip-if-not-exists: true

# New: with safe pruning enabled
- uses: LerianStudio/github-actions-argocd-sync@main
  with:
    app-name: firmino-reporter
    argo-cd-token: ${{ secrets.ARGOCD_TOKEN }}
    argo-cd-url: ${{ secrets.ARGOCD_URL }}
    env-prefix: dev
    skip-if-not-exists: true
    prune: true

Would This Be a Breaking Change?

No — fully backward compatible

Checklist

  • I searched existing issues and this is not a duplicate.
  • This feature aligns with the repository's goal of providing reusable, organization-wide workflows.

Additional Context

  • Related Jira ticket: DSINT-860
  • Reported by Arthur Ribeiro in #devops-team
  • Investigated by Lucas Bedatty — confirmed root cause via local sync without /dev/null redirect
  • Orphaned resources from PR fix/reporter-cluster-role-unique-names (March 12) — namespace suffix added then reverted, old resources left behind

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions