feat: AI-powered remediation - allow users to request fixes directly from the dashboard

## Overview

Argus-Ops is designed for non-experts who operate Kubernetes clusters without deep K8s knowledge. Today the tool can **detect problems** and **explain why they happen** (AI Diagnosis). The next logical step is to let users **ask AI to fix the problem** directly from the dashboard -- without needing to know kubectl commands or Kubernetes internals.

This issue tracks the design and implementation of on-demand AI remediation.

---

## Problem Statement

Current user journey:
1. Dashboard shows a finding: "Pod `shopify-rpa` is in CrashLoopBackOff (restarts: 12)"
2. AI Diagnosis explains: "Likely OOM kill due to memory limit of 256Mi being too low"
3. **User is stuck** -- they do not know how to fix it in Kubernetes

Target user journey:
1. Same finding + diagnosis as above
2. User clicks **"Ask AI to Fix"** button next to the incident
3. AI generates a safe remediation plan with full explanation
4. User reviews the plan (shows exact commands or YAML changes)
5. User clicks **"Apply"** to execute -- or copies the commands to run manually
6. Dashboard confirms the action was taken and re-scans

---

## Functional Requirements

### FR-1: Remediation Plan Generation (Phase 1 - No execution)
- Add a "Suggest Fix" button on each AI Diagnosis card
- `POST /api/remediate/{incident_id}` endpoint: calls LLM with the incident + diagnosis context
- LLM returns a structured remediation plan:
  - **Summary**: one-sentence description of what will be done
  - **Steps**: ordered list of actions with explanations in plain English
  - **Commands**: exact `kubectl` commands or YAML patches to apply
  - **Risk level**: `low` / `medium` / `high` with justification
  - **Rollback**: how to undo the change if it makes things worse
- Display the plan in a modal dialog before any execution
- Phase 1 is **read-only** -- user copies commands and runs them manually

### FR-2: Safe Auto-Execution (Phase 2 - With approval gate)
- "Apply Fix" button becomes available after reviewing the plan
- Execution is gated by:
  - Explicit user confirmation dialog ("I understand this will change my cluster")
  - Risk level check: `high` risk actions require typing the resource name to confirm
  - Dry-run first: run `kubectl apply --dry-run=server` and show output before real apply
- Supported remediation actions (safe subset only):
  - `kubectl rollout restart deployment/<name> -n <namespace>` -- restart a crashing deployment
  - `kubectl scale deployment/<name> --replicas=<n> -n <namespace>` -- scale up/down
  - `kubectl set resources deployment/<name> --limits=memory=<new> -n <namespace>` -- bump memory limit
  - `kubectl patch` for simple field changes (e.g., image tag update)
  - `kubectl delete pod/<name> -n <namespace>` -- force-delete a stuck pod
- Actions that are **never auto-executed** (always manual-only):
  - Deleting namespaces, PVCs, or StatefulSets
  - Modifying RBAC (ClusterRole, ClusterRoleBinding)
  - Any action outside the namespaces argus-ops is configured to watch
  - Changes to nodes (cordon, drain, delete)

### FR-3: Audit Trail
- Every remediation action (suggested or executed) is logged with:
  - Timestamp, incident ID, action taken, user-initiated vs. automatic
  - The exact command that was run
  - Before/after state (findings count before and after re-scan)
- Audit log accessible via `GET /api/audit` and visible in a new "Audit" tab in the dashboard

### FR-4: Re-scan After Fix
- After applying a remediation, automatically trigger a fresh cluster scan
- Compare new findings against pre-fix findings
- Show a summary: "Fixed: 2 issues resolved. Remaining: 1 issue still present."

---

## Technical Design

### New RBAC permissions required
Current ClusterRole `argus-ops-reader` is read-only. Phase 2 requires write permissions scoped to specific verbs:

```yaml
# Additional rules for remediation (Phase 2 only)
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["patch", "update"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["delete"]
- apiGroups: ["apps"]
  resources: ["deployments/scale"]
  verbs: ["patch", "update"]
```

A separate `argus-ops-remediator` ClusterRole should be created so users can opt-in to write permissions independently of read permissions.

### New WatchService methods
```python
def remediate_now(self, incident_id: str) -> RemediationPlan:
    """Generate a remediation plan for the given incident (no execution)."""
    ...

def execute_remediation(self, plan_id: str, confirmed: bool) -> RemediationResult:
    """Execute a previously generated and user-confirmed plan."""
    ...
```

### New API endpoints
```
POST /api/remediate/{incident_id}    -> RemediationPlan (generation only, safe)
POST /api/remediate/{plan_id}/apply  -> RemediationResult (execution, requires confirmation)
GET  /api/audit                      -> list of AuditEntry
```

### New data models
```python
@dataclass
class RemediationStep:
    description: str       # plain English explanation for non-experts
    command: str           # exact kubectl command
    is_destructive: bool

@dataclass
class RemediationPlan:
    plan_id: str
    incident_id: str
    summary: str
    steps: list[RemediationStep]
    risk_level: str        # "low" | "medium" | "high"
    risk_reason: str
    rollback_steps: list[RemediationStep]
    generated_at: datetime
    model_used: str

@dataclass
class RemediationResult:
    plan_id: str
    executed_at: datetime
    success: bool
    output: str            # stdout/stderr of kubectl commands
    error: str | None
```

### AI Prompt Design
The remediation prompt must:
1. Include the full finding + diagnosis context
2. Explicitly instruct the LLM to only suggest safe, reversible actions
3. Require structured JSON output (use LiteLLM response_format)
4. Include the cluster namespace list so the LLM knows the scope
5. Explain each step in plain English suitable for non-experts

---

## UX Design

### Dashboard changes
- AI Diagnoses tab: each diagnosis card gets a "Suggest Fix" button
- Clicking opens a **Remediation Modal**:
  - Header: incident summary + risk badge (green/yellow/red)
  - Body: numbered steps, each with plain-English description + kubectl command in a code block
  - Rollback section (collapsed by default)
  - Footer: "Copy All Commands" button + "Apply Fix" button (Phase 2 only)
- After applying: inline status shows each step running, then green check or red X per step

### Non-expert language requirement
All AI-generated text must use plain language. The LLM prompt must explicitly instruct the model to avoid Kubernetes jargon and always explain *why* a fix works.

Examples:
- Bad: "Scale the Deployment resource to increase replica count"
- Good: "Create more copies of this app so it can handle more load (currently 1 copy, will set to 2)"

---

## Implementation Phases

### Phase 1 - Suggest Fix only, no execution (first PR)
- [ ] Add `RemediationPlan` and `RemediationStep` models to `models.py`
- [ ] Add `generate_remediation()` method to AI provider (`ai/provider.py`)
- [ ] Add `remediate_now(incident_id)` to `WatchService`
- [ ] Add `POST /api/remediate/{incident_id}` to `api.py`
- [ ] Add "Suggest Fix" button and modal to `dashboard.html`
- [ ] Add `GET /api/audit` endpoint (log generation events only in Phase 1)
- [ ] Unit tests: mock LLM response, verify plan structure and risk level validation

### Phase 2 - Execute with approval gate (follow-up PR)
- [ ] Add `execute_remediation()` to `WatchService` (uses Kubernetes Python client, no subprocess)
- [ ] Add `POST /api/remediate/{plan_id}/apply` with mandatory dry-run step
- [ ] Add separate `argus-ops-remediator` ClusterRole in `deploy/k8s/rbac.yaml`
- [ ] Add confirmation dialog with risk-based gate (typing resource name for `high` risk)
- [ ] Add Audit tab to dashboard
- [ ] Add automatic re-scan trigger after successful execution with before/after diff
- [ ] Integration tests: mock Kubernetes client calls, verify audit log entries

---

## Security Considerations

- Remediation execution uses the same in-cluster ServiceAccount as the rest of argus-ops -- no privilege escalation path
- `high` risk actions are blocked by default; require explicit opt-in flag: `remediation.allow_high_risk: false`
- All commands are logged before execution (write-ahead log pattern)
- Namespace scope strictly enforced: argus-ops will not remediate resources outside its configured `namespaces` list
- No `--force` flags permitted in generated commands
- No destructive cluster-level operations (namespace delete, node delete, PVC delete)

---

## Open Questions

1. Should Phase 2 execution use the Kubernetes Python client directly (type-safe, no subprocess) or shell out to `kubectl` (simpler output capture, easier for users to reproduce manually)?
2. Should the "Apply Fix" button be gated by a config flag (`remediation.enabled: false` by default) so deployers have explicit control?
3. For multi-step plans, should execution be all-or-nothing (auto-rollback on failure) or step-by-step with a pause and confirm between each step?

---

## Related Files

- Current AI Diagnosis: `src/argus_ops/web/watch_service.py` (`diagnose_now()`)
- Current RBAC: `deploy/k8s/rbac.yaml`
- Dashboard AI tab: `src/argus_ops/web/templates/dashboard.html`
- AI provider: `src/argus_ops/ai/provider.py`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AI-powered remediation - allow users to request fixes directly from the dashboard #1

Overview

Problem Statement

Functional Requirements

FR-1: Remediation Plan Generation (Phase 1 - No execution)

FR-2: Safe Auto-Execution (Phase 2 - With approval gate)

FR-3: Audit Trail

FR-4: Re-scan After Fix

Technical Design

New RBAC permissions required

New WatchService methods

New API endpoints

New data models

AI Prompt Design

UX Design

Dashboard changes

Non-expert language requirement

Implementation Phases

Phase 1 - Suggest Fix only, no execution (first PR)

Phase 2 - Execute with approval gate (follow-up PR)

Security Considerations

Open Questions

Related Files

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: AI-powered remediation - allow users to request fixes directly from the dashboard #1

Description

Overview

Problem Statement

Functional Requirements

FR-1: Remediation Plan Generation (Phase 1 - No execution)

FR-2: Safe Auto-Execution (Phase 2 - With approval gate)

FR-3: Audit Trail

FR-4: Re-scan After Fix

Technical Design

New RBAC permissions required

New WatchService methods

New API endpoints

New data models

AI Prompt Design

UX Design

Dashboard changes

Non-expert language requirement

Implementation Phases

Phase 1 - Suggest Fix only, no execution (first PR)

Phase 2 - Execute with approval gate (follow-up PR)

Security Considerations

Open Questions

Related Files

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions