feat(ci): automated rollback on deployment failure with health-check gating#950
Open
Pranav-chaudhari-2006 wants to merge 1 commit into
Open
Conversation
…gating - Add .github/workflows/deploy.yml: production deploy pipeline that builds and pushes a sha-tagged Docker image to GHCR, waits for kubectl rollout status, polls /healthz/ready for 60 s, and automatically calls kubectl rollout undo on any failure with a rich GitHub Step Summary. - Add deploy/scripts/rollback.sh: standalone emergency rollback script with --namespace, --deployment, --health-url, --timeout, and --revision flags. Verifies health after undo and exits non-zero on failure. - Add .github/workflows/staging-smoke.yml: smoke-test gate that runs on every develop push and blocks production promotion if liveness, readiness, or the POST /api/chat check fails. - Modify deploy/k8s/deployment.example.yaml: add RollingUpdate strategy (maxUnavailable=0, maxSurge=1), minReadySeconds=15, and progressDeadlineSeconds=180 to enable native Kubernetes auto-rollback. - Modify docker-compose.yml: add /healthz/ready healthcheck to backend, restart: on-failure:3 on backend and frontend, and change frontend depends_on to condition: service_healthy. Closes #<issue-number>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR improves deployment reliability by adding health-gated startup in Docker Compose, introducing Kubernetes rollout safety knobs, and adding GitHub Actions workflows/scripts to support automated and manual rollback flows.
Changes:
- Add Docker Compose healthchecks and gate frontend startup on backend readiness.
- Add a manual Kubernetes rollback script plus example deployment rollout strategy settings.
- Add GitHub Actions workflows for staging smoke tests and production deploy with auto-rollback.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| docker-compose.yml | Adds backend healthcheck + restart policy and gates frontend on backend health. |
| deploy/scripts/rollback.sh | Introduces a manual rollback helper that undoes a deployment and polls readiness. |
| deploy/k8s/deployment.example.yaml | Documents and configures safer rolling update parameters for Deployments. |
| .github/workflows/staging-smoke.yml | Adds a staging smoke workflow to probe health endpoints and a basic API call. |
| .github/workflows/deploy.yml | Adds build/push + deploy workflow with an explicit health gate and auto-rollback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Poll /healthz/ready so Docker Compose (and the frontend service) know | ||
| # when the backend is truly ready to serve traffic. | ||
| healthcheck: | ||
| test: ["CMD", "curl", "-f", "http://localhost:8000/healthz/ready"] |
Comment on lines
36
to
+38
| depends_on: | ||
| - backend | ||
| backend: | ||
| condition: service_healthy |
Comment on lines
+125
to
+126
| HTTP_CODE=$(curl -sf -o /dev/null -w "%{http_code}" "$READY_URL" 2>/dev/null || echo "000") | ||
| if [[ "$HTTP_CODE" == "200" ]]; then |
Comment on lines
+29
to
+32
| # If the rollout hasn't finished within 3 minutes, Kubernetes marks it | ||
| # Failed and the deploy controller automatically issues a rollout undo. | ||
| # This is the native K8s auto-rollback trigger. | ||
| progressDeadlineSeconds: 180 |
Comment on lines
+99
to
+101
| | Liveness \`/healthz/live\` | ${{ steps.*.outcome }} | | ||
| | Readiness \`/healthz/ready\` | | | ||
| | API \`POST /api/chat\` | | |
Comment on lines
+32
to
+33
| HTTP=$(curl -sf -o /tmp/live.json -w "%{http_code}" \ | ||
| "${{ env.BASE_URL }}/healthz/live" || echo "000") |
Comment on lines
+68
to
+69
| # Accept 200 or 422 (validation error is OK — means the API is up) | ||
| if [[ "$HTTP" != "200" && "$HTTP" != "422" && "$HTTP" != "401" ]]; then |
| # How long (seconds) to poll /healthz/ready after rollout reports success | ||
| HEALTH_CHECK_TIMEOUT: 60 | ||
| # URL of the service's health endpoint — override via repository variable | ||
| SERVICE_HEALTH_URL: ${{ vars.SERVICE_HEALTH_URL || 'http://localhost:8000' }} |
Comment on lines
+160
to
+161
| HTTP_CODE=$(curl -sf -o /dev/null -w "%{http_code}" "$URL" 2>/dev/null || echo "000") | ||
| if [ "$HTTP_CODE" = "200" ]; then |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Failed deployments can leave the service in a broken state with no automatic recovery. This PR introduces a multi-layer automated rollback system that triggers rollbacks based on health checks and deployment outcomes — covering GitHub Actions CI/CD, Kubernetes, and Docker Compose environments.
Changes Made
🆕 New Files
.github/workflows/deploy.ymlA full production deployment pipeline that runs automatically after the
CIworkflow succeeds onmain.Pipeline flow:
latestand a pinnedsha-<commit>tag, so the previous image is always reachable for rollbackdeploy/scripts/rollback.shA standalone Bash script for manual or emergency rollbacks from any CI system or terminal.
What it does:
kubectl rollout undo/healthz/readyfor up to 90 s to confirm the rollback itself is healthy.github/workflows/staging-smoke.ymlA lightweight smoke-test workflow that runs on every push to
developand blocks production promotion if any check fails.Checks performed:
GET /healthz/liveGET /healthz/readyPOST /api/chat✏️ Modified Files
deploy/k8s/deployment.example.yamlAdded fields that arm Kubernetes's own rollout controller as a native auto-rollback line of defence:
strategy.typeRollingUpdatemaxUnavailable0maxSurge1minReadySeconds15progressDeadlineSeconds180docker-compose.ymlHealth-check and restart improvements for local and staging Docker Compose environments:
Setup & Requirements
GitHub Secrets & Variables
KUBECONFIG_B64base64 -w0 ~/.kube/configSERVICE_HEALTH_URLhttps://api.yourdomain.comSTAGING_URLhttps://staging.yourdomain.comGo to Settings → Secrets and variables → Actions to add these.
Permissions
The
deploy.ymlworkflow usesGITHUB_TOKENfor GHCR pushes — no additional PAT needed. Theproductionenvironment should be configured in Settings → Environments with any required approval rules.How to Run / Test
Trigger the deploy pipeline
Push to
mainafter CI passes —deploy.ymlruns automatically.Or trigger it manually: Actions → Deploy with Automated Rollback → Run workflow.
Trigger the staging smoke tests
Push to
develop—staging-smoke.ymlruns automatically.Or trigger manually via Actions → Staging Smoke Tests → Run workflow (you can override the staging URL in the inputs).
Run the rollback script manually
Test the Docker Compose health checks locally
Staging Rollback Test Checklist
Before enabling in production, validate these scenarios in staging:
kubectl rollout statusfails and rollback fires/healthz/readyalways returns 503 → confirm health-gate catches itstaging-smoke.ymlmanually → confirm all three checks passrollback.sh --revision <n>→ confirm it rolls to the correct revisionkubectl rollout historyReferences
deploy.yml— Production deploy workflow with auto-rollbackstaging-smoke.yml— Staging smoke-test gaterollback.sh— Manual rollback scriptdeployment.example.yaml— Kubernetes manifest with rollout strategydocker-compose.yml— Local/staging health-check configworkflow_runtriggerRollback Decision Tree