Skip to content

feat(ci): automated rollback on deployment failure with health-check gating#950

Open
Pranav-chaudhari-2006 wants to merge 1 commit into
imDarshanGK:mainfrom
Pranav-chaudhari-2006:feat/automated-rollback
Open

feat(ci): automated rollback on deployment failure with health-check gating#950
Pranav-chaudhari-2006 wants to merge 1 commit into
imDarshanGK:mainfrom
Pranav-chaudhari-2006:feat/automated-rollback

Conversation

@Pranav-chaudhari-2006

Copy link
Copy Markdown

Summary

Failed deployments can leave the service in a broken state with no automatic recovery. This PR introduces a multi-layer automated rollback system that triggers rollbacks based on health checks and deployment outcomes — covering GitHub Actions CI/CD, Kubernetes, and Docker Compose environments.


Changes Made

🆕 New Files

.github/workflows/deploy.yml

A full production deployment pipeline that runs automatically after the CI workflow succeeds on main.

Pipeline flow:

Build & tag Docker image (sha-<short> + latest)
  └─► Push to GHCR
        └─► kubectl set image (rolling update)
              └─► kubectl rollout status --timeout=3m
                    ├─ FAILURE → kubectl rollout undo ──► Step Summary ──► ❌ workflow fails
                    └─ SUCCESS → Poll /healthz/ready for 60s
                          ├─ TIMEOUT → kubectl rollout undo ──► Step Summary ──► ❌ workflow fails
                          └─ HTTP 200 → ✅ Deployment summary written
  • Images are tagged with both latest and a pinned sha-<commit> tag, so the previous image is always reachable for rollback
  • A GitHub Step Summary is written on rollback, showing the failed image, the reverted image, and next-step instructions

deploy/scripts/rollback.sh

A standalone Bash script for manual or emergency rollbacks from any CI system or terminal.

# Roll back to the previous revision
bash deploy/scripts/rollback.sh

# Roll back to a specific revision
bash deploy/scripts/rollback.sh --revision 5 --namespace production

# Override the health endpoint
bash deploy/scripts/rollback.sh --health-url https://api.yourdomain.com --timeout 120

What it does:

  1. Captures current image + revision before undoing
  2. Calls kubectl rollout undo
  3. Waits for the undo rollout to stabilise
  4. Polls /healthz/ready for up to 90 s to confirm the rollback itself is healthy
  5. Exits non-zero if the rollback image is also unhealthy (prevents silent failures)
  6. Annotates the deployment with the rollback reason for audit history

.github/workflows/staging-smoke.yml

A lightweight smoke-test workflow that runs on every push to develop and blocks production promotion if any check fails.

Checks performed:

Check Endpoint Pass condition
Liveness GET /healthz/live HTTP 200
Readiness GET /healthz/ready HTTP 200
API smoke POST /api/chat HTTP 200, 401, or 422

✏️ Modified Files

deploy/k8s/deployment.example.yaml

Added fields that arm Kubernetes's own rollout controller as a native auto-rollback line of defence:

Field Value Effect
strategy.type RollingUpdate Explicitly documented
maxUnavailable 0 Never drop below full replica count during rollout
maxSurge 1 Bring up 1 extra pod before killing an old one
minReadySeconds 15 Pod must stay healthy 15 s before it counts as ready
progressDeadlineSeconds 180 K8s marks rollout Failed → auto-triggers rollout undo after 3 min

progressDeadlineSeconds is the native Kubernetes rollback trigger — it fires independently of the GitHub Actions workflow, providing a second automatic rollback layer.


docker-compose.yml

Health-check and restart improvements for local and staging Docker Compose environments:

# backend
healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:8000/healthz/ready"]
  interval: 10s
  timeout: 5s
  retries: 5
  start_period: 15s
restart: on-failure:3   # stops crash-restart loops after 3 failures

# frontend
depends_on:
  backend:
    condition: service_healthy   # won't start until backend is confirmed healthy
restart: on-failure:3

Setup & Requirements

GitHub Secrets & Variables

Name Type Value
KUBECONFIG_B64 Secret base64 -w0 ~/.kube/config
SERVICE_HEALTH_URL Variable e.g. https://api.yourdomain.com
STAGING_URL Variable e.g. https://staging.yourdomain.com

Go to Settings → Secrets and variables → Actions to add these.

Permissions

The deploy.yml workflow uses GITHUB_TOKEN for GHCR pushes — no additional PAT needed. The production environment should be configured in Settings → Environments with any required approval rules.


How to Run / Test

Trigger the deploy pipeline

Push to main after CI passes — deploy.yml runs automatically.
Or trigger it manually: Actions → Deploy with Automated Rollback → Run workflow.

Trigger the staging smoke tests

Push to developstaging-smoke.yml runs automatically.
Or trigger manually via Actions → Staging Smoke Tests → Run workflow (you can override the staging URL in the inputs).

Run the rollback script manually

# Make executable (Linux/macOS)
chmod +x deploy/scripts/rollback.sh

# Run with defaults
bash deploy/scripts/rollback.sh

# Full options
bash deploy/scripts/rollback.sh \
  --namespace production \
  --deployment qyverixai \
  --health-url https://api.yourdomain.com \
  --timeout 120 \
  --revision 5

Test the Docker Compose health checks locally

docker compose up --build
# Backend must pass /healthz/ready before frontend container starts
docker compose ps   # check STATUS column — should show "(healthy)"

Staging Rollback Test Checklist

Before enabling in production, validate these scenarios in staging:

  • Push a non-existent image tag → confirm kubectl rollout status fails and rollback fires
  • Push an image whose /healthz/ready always returns 503 → confirm health-gate catches it
  • Run staging-smoke.yml manually → confirm all three checks pass
  • Run rollback.sh --revision <n> → confirm it rolls to the correct revision
  • Confirm rollback annotations appear in kubectl rollout history

References


Rollback Decision Tree

Deploy triggered on main
│
├─[K8s] progressDeadlineSeconds=180 fires (pods not ready in 3 min)
│         └─► K8s controller auto-issues rollout undo
│
├─[CI] kubectl rollout status times out
│         └─► deploy.yml → kubectl rollout undo + GitHub Step Summary
│
├─[CI] /healthz/ready doesn't return 200 within 60s
│         └─► deploy.yml → kubectl rollout undo + GitHub Step Summary
│
└─[Manual] On-call engineer
              └─► bash deploy/scripts/rollback.sh

…gating

- Add .github/workflows/deploy.yml: production deploy pipeline that builds
  and pushes a sha-tagged Docker image to GHCR, waits for kubectl rollout
  status, polls /healthz/ready for 60 s, and automatically calls
  kubectl rollout undo on any failure with a rich GitHub Step Summary.

- Add deploy/scripts/rollback.sh: standalone emergency rollback script
  with --namespace, --deployment, --health-url, --timeout, and --revision
  flags. Verifies health after undo and exits non-zero on failure.

- Add .github/workflows/staging-smoke.yml: smoke-test gate that runs on
  every develop push and blocks production promotion if liveness,
  readiness, or the POST /api/chat check fails.

- Modify deploy/k8s/deployment.example.yaml: add RollingUpdate strategy
  (maxUnavailable=0, maxSurge=1), minReadySeconds=15, and
  progressDeadlineSeconds=180 to enable native Kubernetes auto-rollback.

- Modify docker-compose.yml: add /healthz/ready healthcheck to backend,
  restart: on-failure:3 on backend and frontend, and change frontend
  depends_on to condition: service_healthy.

Closes #<issue-number>
Copilot AI review requested due to automatic review settings June 8, 2026 10:21

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves deployment reliability by adding health-gated startup in Docker Compose, introducing Kubernetes rollout safety knobs, and adding GitHub Actions workflows/scripts to support automated and manual rollback flows.

Changes:

  • Add Docker Compose healthchecks and gate frontend startup on backend readiness.
  • Add a manual Kubernetes rollback script plus example deployment rollout strategy settings.
  • Add GitHub Actions workflows for staging smoke tests and production deploy with auto-rollback.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
docker-compose.yml Adds backend healthcheck + restart policy and gates frontend on backend health.
deploy/scripts/rollback.sh Introduces a manual rollback helper that undoes a deployment and polls readiness.
deploy/k8s/deployment.example.yaml Documents and configures safer rolling update parameters for Deployments.
.github/workflows/staging-smoke.yml Adds a staging smoke workflow to probe health endpoints and a basic API call.
.github/workflows/deploy.yml Adds build/push + deploy workflow with an explicit health gate and auto-rollback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docker-compose.yml
# Poll /healthz/ready so Docker Compose (and the frontend service) know
# when the backend is truly ready to serve traffic.
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/healthz/ready"]
Comment thread docker-compose.yml
Comment on lines 36 to +38
depends_on:
- backend
backend:
condition: service_healthy
Comment on lines +125 to +126
HTTP_CODE=$(curl -sf -o /dev/null -w "%{http_code}" "$READY_URL" 2>/dev/null || echo "000")
if [[ "$HTTP_CODE" == "200" ]]; then
Comment on lines +29 to +32
# If the rollout hasn't finished within 3 minutes, Kubernetes marks it
# Failed and the deploy controller automatically issues a rollout undo.
# This is the native K8s auto-rollback trigger.
progressDeadlineSeconds: 180
Comment on lines +99 to +101
| Liveness \`/healthz/live\` | ${{ steps.*.outcome }} |
| Readiness \`/healthz/ready\` | |
| API \`POST /api/chat\` | |
Comment on lines +32 to +33
HTTP=$(curl -sf -o /tmp/live.json -w "%{http_code}" \
"${{ env.BASE_URL }}/healthz/live" || echo "000")
Comment on lines +68 to +69
# Accept 200 or 422 (validation error is OK — means the API is up)
if [[ "$HTTP" != "200" && "$HTTP" != "422" && "$HTTP" != "401" ]]; then
# How long (seconds) to poll /healthz/ready after rollout reports success
HEALTH_CHECK_TIMEOUT: 60
# URL of the service's health endpoint — override via repository variable
SERVICE_HEALTH_URL: ${{ vars.SERVICE_HEALTH_URL || 'http://localhost:8000' }}
Comment on lines +160 to +161
HTTP_CODE=$(curl -sf -o /dev/null -w "%{http_code}" "$URL" 2>/dev/null || echo "000")
if [ "$HTTP_CODE" = "200" ]; then
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants