-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Request Type
Performance improvement
Affected Workflow (if applicable)
Infrastructure (gitops-update, build, helm-update-chart, dispatch-helm)
Problem / Motivation
The github-actions-argocd-sync action fails intermittently when the ArgoCD server takes longer than the CLI's default timeout (~90s) to respond to a sync request. This was observed on the Reporter repo (run #23253916955) where firmino-reporter-dev failed after 5 sync attempts, even though the sync was actually applied successfully on ArgoCD and the new image was running correctly in dev.
Root cause (confirmed): The argocd app sync command completes the sync successfully (successfully synced (all tasks run)), but the CLI exits with code 1 because there are orphaned resources that require pruning:
{"level":"fatal","msg":"2 resources require pruning","time":"2026-03-18T14:39:07-03:00"}
The orphaned resources are:
ClusterRole reporter-manager-midaz-plugins-devClusterRoleBinding reporter-manager-midaz-plugins-dev
These were left behind after a rename from namespace-suffixed names to plain reporter-manager. The current entrypoint.sh redirects all output to /dev/null, hiding this error. It then retries 5 times — each retry successfully syncs but also exits 1 due to the same pruning requirement.
Initial hypothesis (timeout) was incorrect. The ~1min per attempt was the actual sync duration, not a timeout. The exit code 1 was from the pruning fatal log, not a gRPC timeout.
Proposed Solution
Changes to github-actions-argocd-sync/entrypoint.sh:
-
Remove
> /dev/null 2>&1from the sync command — expose the actual error message so failures are diagnosable from the GitHub Actions log. This is the most critical change — without it, the real error is invisible. -
Add
--pruneflag support — new optional inputprune(default:false). When enabled, pass--prunetoargocd app syncso orphaned resources are cleaned up automatically during sync. This prevents the "resources require pruning" fatal from causing false failures. -
Use
--asynconargocd app sync— fire the sync without waiting for completion. The script already has anargocd app waitstep afterward that handles the confirmation. This separates sync dispatch from sync verification. -
Increase retry interval from 5s to 30s — give time for a previous sync attempt to complete before retrying.
-
Add explicit
--timeoutto the sync and wait commands (e.g.,--timeout 180) for predictable behavior regardless of CLI defaults.
Alternatives Considered
- Only removing /dev/null (helps diagnosis but doesn't prevent the failure)
- Always pruning (risky in production — better as opt-in flag)
- Adding
--forceto sync retries (risky, could cause unintended overwrites)
Example Usage
# Existing usage remains the same (backward compatible)
- uses: LerianStudio/github-actions-argocd-sync@main
with:
app-name: firmino-reporter
argo-cd-token: ${{ secrets.ARGOCD_TOKEN }}
argo-cd-url: ${{ secrets.ARGOCD_URL }}
env-prefix: dev
skip-if-not-exists: true
# New: with safe pruning enabled
- uses: LerianStudio/github-actions-argocd-sync@main
with:
app-name: firmino-reporter
argo-cd-token: ${{ secrets.ARGOCD_TOKEN }}
argo-cd-url: ${{ secrets.ARGOCD_URL }}
env-prefix: dev
skip-if-not-exists: true
prune: trueWould This Be a Breaking Change?
No — fully backward compatible
Checklist
- I searched existing issues and this is not a duplicate.
- This feature aligns with the repository's goal of providing reusable, organization-wide workflows.
Additional Context
- Related Jira ticket: DSINT-860
- Reported by Arthur Ribeiro in #devops-team
- Investigated by Lucas Bedatty — confirmed root cause via local sync without /dev/null redirect
- Orphaned resources from PR
fix/reporter-cluster-role-unique-names(March 12) — namespace suffix added then reverted, old resources left behind