-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Overview
Argus-Ops is designed for non-experts who operate Kubernetes clusters without deep K8s knowledge. Today the tool can detect problems and explain why they happen (AI Diagnosis). The next logical step is to let users ask AI to fix the problem directly from the dashboard -- without needing to know kubectl commands or Kubernetes internals.
This issue tracks the design and implementation of on-demand AI remediation.
Problem Statement
Current user journey:
- Dashboard shows a finding: "Pod
shopify-rpais in CrashLoopBackOff (restarts: 12)" - AI Diagnosis explains: "Likely OOM kill due to memory limit of 256Mi being too low"
- User is stuck -- they do not know how to fix it in Kubernetes
Target user journey:
- Same finding + diagnosis as above
- User clicks "Ask AI to Fix" button next to the incident
- AI generates a safe remediation plan with full explanation
- User reviews the plan (shows exact commands or YAML changes)
- User clicks "Apply" to execute -- or copies the commands to run manually
- Dashboard confirms the action was taken and re-scans
Functional Requirements
FR-1: Remediation Plan Generation (Phase 1 - No execution)
- Add a "Suggest Fix" button on each AI Diagnosis card
POST /api/remediate/{incident_id}endpoint: calls LLM with the incident + diagnosis context- LLM returns a structured remediation plan:
- Summary: one-sentence description of what will be done
- Steps: ordered list of actions with explanations in plain English
- Commands: exact
kubectlcommands or YAML patches to apply - Risk level:
low/medium/highwith justification - Rollback: how to undo the change if it makes things worse
- Display the plan in a modal dialog before any execution
- Phase 1 is read-only -- user copies commands and runs them manually
FR-2: Safe Auto-Execution (Phase 2 - With approval gate)
- "Apply Fix" button becomes available after reviewing the plan
- Execution is gated by:
- Explicit user confirmation dialog ("I understand this will change my cluster")
- Risk level check:
highrisk actions require typing the resource name to confirm - Dry-run first: run
kubectl apply --dry-run=serverand show output before real apply
- Supported remediation actions (safe subset only):
kubectl rollout restart deployment/<name> -n <namespace>-- restart a crashing deploymentkubectl scale deployment/<name> --replicas=<n> -n <namespace>-- scale up/downkubectl set resources deployment/<name> --limits=memory=<new> -n <namespace>-- bump memory limitkubectl patchfor simple field changes (e.g., image tag update)kubectl delete pod/<name> -n <namespace>-- force-delete a stuck pod
- Actions that are never auto-executed (always manual-only):
- Deleting namespaces, PVCs, or StatefulSets
- Modifying RBAC (ClusterRole, ClusterRoleBinding)
- Any action outside the namespaces argus-ops is configured to watch
- Changes to nodes (cordon, drain, delete)
FR-3: Audit Trail
- Every remediation action (suggested or executed) is logged with:
- Timestamp, incident ID, action taken, user-initiated vs. automatic
- The exact command that was run
- Before/after state (findings count before and after re-scan)
- Audit log accessible via
GET /api/auditand visible in a new "Audit" tab in the dashboard
FR-4: Re-scan After Fix
- After applying a remediation, automatically trigger a fresh cluster scan
- Compare new findings against pre-fix findings
- Show a summary: "Fixed: 2 issues resolved. Remaining: 1 issue still present."
Technical Design
New RBAC permissions required
Current ClusterRole argus-ops-reader is read-only. Phase 2 requires write permissions scoped to specific verbs:
# Additional rules for remediation (Phase 2 only)
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["patch", "update"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["delete"]
- apiGroups: ["apps"]
resources: ["deployments/scale"]
verbs: ["patch", "update"]A separate argus-ops-remediator ClusterRole should be created so users can opt-in to write permissions independently of read permissions.
New WatchService methods
def remediate_now(self, incident_id: str) -> RemediationPlan:
"""Generate a remediation plan for the given incident (no execution)."""
...
def execute_remediation(self, plan_id: str, confirmed: bool) -> RemediationResult:
"""Execute a previously generated and user-confirmed plan."""
...New API endpoints
POST /api/remediate/{incident_id} -> RemediationPlan (generation only, safe)
POST /api/remediate/{plan_id}/apply -> RemediationResult (execution, requires confirmation)
GET /api/audit -> list of AuditEntry
New data models
@dataclass
class RemediationStep:
description: str # plain English explanation for non-experts
command: str # exact kubectl command
is_destructive: bool
@dataclass
class RemediationPlan:
plan_id: str
incident_id: str
summary: str
steps: list[RemediationStep]
risk_level: str # "low" | "medium" | "high"
risk_reason: str
rollback_steps: list[RemediationStep]
generated_at: datetime
model_used: str
@dataclass
class RemediationResult:
plan_id: str
executed_at: datetime
success: bool
output: str # stdout/stderr of kubectl commands
error: str | NoneAI Prompt Design
The remediation prompt must:
- Include the full finding + diagnosis context
- Explicitly instruct the LLM to only suggest safe, reversible actions
- Require structured JSON output (use LiteLLM response_format)
- Include the cluster namespace list so the LLM knows the scope
- Explain each step in plain English suitable for non-experts
UX Design
Dashboard changes
- AI Diagnoses tab: each diagnosis card gets a "Suggest Fix" button
- Clicking opens a Remediation Modal:
- Header: incident summary + risk badge (green/yellow/red)
- Body: numbered steps, each with plain-English description + kubectl command in a code block
- Rollback section (collapsed by default)
- Footer: "Copy All Commands" button + "Apply Fix" button (Phase 2 only)
- After applying: inline status shows each step running, then green check or red X per step
Non-expert language requirement
All AI-generated text must use plain language. The LLM prompt must explicitly instruct the model to avoid Kubernetes jargon and always explain why a fix works.
Examples:
- Bad: "Scale the Deployment resource to increase replica count"
- Good: "Create more copies of this app so it can handle more load (currently 1 copy, will set to 2)"
Implementation Phases
Phase 1 - Suggest Fix only, no execution (first PR)
- Add
RemediationPlanandRemediationStepmodels tomodels.py - Add
generate_remediation()method to AI provider (ai/provider.py) - Add
remediate_now(incident_id)toWatchService - Add
POST /api/remediate/{incident_id}toapi.py - Add "Suggest Fix" button and modal to
dashboard.html - Add
GET /api/auditendpoint (log generation events only in Phase 1) - Unit tests: mock LLM response, verify plan structure and risk level validation
Phase 2 - Execute with approval gate (follow-up PR)
- Add
execute_remediation()toWatchService(uses Kubernetes Python client, no subprocess) - Add
POST /api/remediate/{plan_id}/applywith mandatory dry-run step - Add separate
argus-ops-remediatorClusterRole indeploy/k8s/rbac.yaml - Add confirmation dialog with risk-based gate (typing resource name for
highrisk) - Add Audit tab to dashboard
- Add automatic re-scan trigger after successful execution with before/after diff
- Integration tests: mock Kubernetes client calls, verify audit log entries
Security Considerations
- Remediation execution uses the same in-cluster ServiceAccount as the rest of argus-ops -- no privilege escalation path
highrisk actions are blocked by default; require explicit opt-in flag:remediation.allow_high_risk: false- All commands are logged before execution (write-ahead log pattern)
- Namespace scope strictly enforced: argus-ops will not remediate resources outside its configured
namespaceslist - No
--forceflags permitted in generated commands - No destructive cluster-level operations (namespace delete, node delete, PVC delete)
Open Questions
- Should Phase 2 execution use the Kubernetes Python client directly (type-safe, no subprocess) or shell out to
kubectl(simpler output capture, easier for users to reproduce manually)? - Should the "Apply Fix" button be gated by a config flag (
remediation.enabled: falseby default) so deployers have explicit control? - For multi-step plans, should execution be all-or-nothing (auto-rollback on failure) or step-by-step with a pause and confirm between each step?
Related Files
- Current AI Diagnosis:
src/argus_ops/web/watch_service.py(diagnose_now()) - Current RBAC:
deploy/k8s/rbac.yaml - Dashboard AI tab:
src/argus_ops/web/templates/dashboard.html - AI provider:
src/argus_ops/ai/provider.py