Skip to content

feat: AI-powered remediation - allow users to request fixes directly from the dashboard #1

@mason5052

Description

@mason5052

Overview

Argus-Ops is designed for non-experts who operate Kubernetes clusters without deep K8s knowledge. Today the tool can detect problems and explain why they happen (AI Diagnosis). The next logical step is to let users ask AI to fix the problem directly from the dashboard -- without needing to know kubectl commands or Kubernetes internals.

This issue tracks the design and implementation of on-demand AI remediation.


Problem Statement

Current user journey:

  1. Dashboard shows a finding: "Pod shopify-rpa is in CrashLoopBackOff (restarts: 12)"
  2. AI Diagnosis explains: "Likely OOM kill due to memory limit of 256Mi being too low"
  3. User is stuck -- they do not know how to fix it in Kubernetes

Target user journey:

  1. Same finding + diagnosis as above
  2. User clicks "Ask AI to Fix" button next to the incident
  3. AI generates a safe remediation plan with full explanation
  4. User reviews the plan (shows exact commands or YAML changes)
  5. User clicks "Apply" to execute -- or copies the commands to run manually
  6. Dashboard confirms the action was taken and re-scans

Functional Requirements

FR-1: Remediation Plan Generation (Phase 1 - No execution)

  • Add a "Suggest Fix" button on each AI Diagnosis card
  • POST /api/remediate/{incident_id} endpoint: calls LLM with the incident + diagnosis context
  • LLM returns a structured remediation plan:
    • Summary: one-sentence description of what will be done
    • Steps: ordered list of actions with explanations in plain English
    • Commands: exact kubectl commands or YAML patches to apply
    • Risk level: low / medium / high with justification
    • Rollback: how to undo the change if it makes things worse
  • Display the plan in a modal dialog before any execution
  • Phase 1 is read-only -- user copies commands and runs them manually

FR-2: Safe Auto-Execution (Phase 2 - With approval gate)

  • "Apply Fix" button becomes available after reviewing the plan
  • Execution is gated by:
    • Explicit user confirmation dialog ("I understand this will change my cluster")
    • Risk level check: high risk actions require typing the resource name to confirm
    • Dry-run first: run kubectl apply --dry-run=server and show output before real apply
  • Supported remediation actions (safe subset only):
    • kubectl rollout restart deployment/<name> -n <namespace> -- restart a crashing deployment
    • kubectl scale deployment/<name> --replicas=<n> -n <namespace> -- scale up/down
    • kubectl set resources deployment/<name> --limits=memory=<new> -n <namespace> -- bump memory limit
    • kubectl patch for simple field changes (e.g., image tag update)
    • kubectl delete pod/<name> -n <namespace> -- force-delete a stuck pod
  • Actions that are never auto-executed (always manual-only):
    • Deleting namespaces, PVCs, or StatefulSets
    • Modifying RBAC (ClusterRole, ClusterRoleBinding)
    • Any action outside the namespaces argus-ops is configured to watch
    • Changes to nodes (cordon, drain, delete)

FR-3: Audit Trail

  • Every remediation action (suggested or executed) is logged with:
    • Timestamp, incident ID, action taken, user-initiated vs. automatic
    • The exact command that was run
    • Before/after state (findings count before and after re-scan)
  • Audit log accessible via GET /api/audit and visible in a new "Audit" tab in the dashboard

FR-4: Re-scan After Fix

  • After applying a remediation, automatically trigger a fresh cluster scan
  • Compare new findings against pre-fix findings
  • Show a summary: "Fixed: 2 issues resolved. Remaining: 1 issue still present."

Technical Design

New RBAC permissions required

Current ClusterRole argus-ops-reader is read-only. Phase 2 requires write permissions scoped to specific verbs:

# Additional rules for remediation (Phase 2 only)
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["patch", "update"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["delete"]
- apiGroups: ["apps"]
  resources: ["deployments/scale"]
  verbs: ["patch", "update"]

A separate argus-ops-remediator ClusterRole should be created so users can opt-in to write permissions independently of read permissions.

New WatchService methods

def remediate_now(self, incident_id: str) -> RemediationPlan:
    """Generate a remediation plan for the given incident (no execution)."""
    ...

def execute_remediation(self, plan_id: str, confirmed: bool) -> RemediationResult:
    """Execute a previously generated and user-confirmed plan."""
    ...

New API endpoints

POST /api/remediate/{incident_id}    -> RemediationPlan (generation only, safe)
POST /api/remediate/{plan_id}/apply  -> RemediationResult (execution, requires confirmation)
GET  /api/audit                      -> list of AuditEntry

New data models

@dataclass
class RemediationStep:
    description: str       # plain English explanation for non-experts
    command: str           # exact kubectl command
    is_destructive: bool

@dataclass
class RemediationPlan:
    plan_id: str
    incident_id: str
    summary: str
    steps: list[RemediationStep]
    risk_level: str        # "low" | "medium" | "high"
    risk_reason: str
    rollback_steps: list[RemediationStep]
    generated_at: datetime
    model_used: str

@dataclass
class RemediationResult:
    plan_id: str
    executed_at: datetime
    success: bool
    output: str            # stdout/stderr of kubectl commands
    error: str | None

AI Prompt Design

The remediation prompt must:

  1. Include the full finding + diagnosis context
  2. Explicitly instruct the LLM to only suggest safe, reversible actions
  3. Require structured JSON output (use LiteLLM response_format)
  4. Include the cluster namespace list so the LLM knows the scope
  5. Explain each step in plain English suitable for non-experts

UX Design

Dashboard changes

  • AI Diagnoses tab: each diagnosis card gets a "Suggest Fix" button
  • Clicking opens a Remediation Modal:
    • Header: incident summary + risk badge (green/yellow/red)
    • Body: numbered steps, each with plain-English description + kubectl command in a code block
    • Rollback section (collapsed by default)
    • Footer: "Copy All Commands" button + "Apply Fix" button (Phase 2 only)
  • After applying: inline status shows each step running, then green check or red X per step

Non-expert language requirement

All AI-generated text must use plain language. The LLM prompt must explicitly instruct the model to avoid Kubernetes jargon and always explain why a fix works.

Examples:

  • Bad: "Scale the Deployment resource to increase replica count"
  • Good: "Create more copies of this app so it can handle more load (currently 1 copy, will set to 2)"

Implementation Phases

Phase 1 - Suggest Fix only, no execution (first PR)

  • Add RemediationPlan and RemediationStep models to models.py
  • Add generate_remediation() method to AI provider (ai/provider.py)
  • Add remediate_now(incident_id) to WatchService
  • Add POST /api/remediate/{incident_id} to api.py
  • Add "Suggest Fix" button and modal to dashboard.html
  • Add GET /api/audit endpoint (log generation events only in Phase 1)
  • Unit tests: mock LLM response, verify plan structure and risk level validation

Phase 2 - Execute with approval gate (follow-up PR)

  • Add execute_remediation() to WatchService (uses Kubernetes Python client, no subprocess)
  • Add POST /api/remediate/{plan_id}/apply with mandatory dry-run step
  • Add separate argus-ops-remediator ClusterRole in deploy/k8s/rbac.yaml
  • Add confirmation dialog with risk-based gate (typing resource name for high risk)
  • Add Audit tab to dashboard
  • Add automatic re-scan trigger after successful execution with before/after diff
  • Integration tests: mock Kubernetes client calls, verify audit log entries

Security Considerations

  • Remediation execution uses the same in-cluster ServiceAccount as the rest of argus-ops -- no privilege escalation path
  • high risk actions are blocked by default; require explicit opt-in flag: remediation.allow_high_risk: false
  • All commands are logged before execution (write-ahead log pattern)
  • Namespace scope strictly enforced: argus-ops will not remediate resources outside its configured namespaces list
  • No --force flags permitted in generated commands
  • No destructive cluster-level operations (namespace delete, node delete, PVC delete)

Open Questions

  1. Should Phase 2 execution use the Kubernetes Python client directly (type-safe, no subprocess) or shell out to kubectl (simpler output capture, easier for users to reproduce manually)?
  2. Should the "Apply Fix" button be gated by a config flag (remediation.enabled: false by default) so deployers have explicit control?
  3. For multi-step plans, should execution be all-or-nothing (auto-rollback on failure) or step-by-step with a pause and confirm between each step?

Related Files

  • Current AI Diagnosis: src/argus_ops/web/watch_service.py (diagnose_now())
  • Current RBAC: deploy/k8s/rbac.yaml
  • Dashboard AI tab: src/argus_ops/web/templates/dashboard.html
  • AI provider: src/argus_ops/ai/provider.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions