feat: redesign Node Health tab into full cluster topology view (nodes + pods + namespaces + services)

## Overview

The Node Health tab in the argus-ops dashboard currently shows a flat card grid of nodes with minimal information. Users who are not Kubernetes experts cannot answer basic operational questions from this view:

- Which nodes are master (control-plane) nodes and which are worker nodes?
- What pods are running on each node?
- What application does each pod represent?
- What namespace does each pod belong to?
- What is the health status of each individual container inside a pod?
- Is there a Service exposing any pod to external traffic, and what is its stable endpoint?

This issue tracks the redesign of the Node Health tab into a full cluster topology view that answers all of these questions without requiring any kubectl knowledge.

---

## Problem Statement

### Current behavior

`GET /api/nodes` returns only the node list collected by `KubernetesCollector._collect_nodes()`. The response includes: name, conditions, labels, allocatable/capacity, os, arch, unschedulable, taints. It does **not** include pods or services.

The dashboard renders one card per node showing: Ready/Not-Ready badge, OS, arch, condition flags. There is no pod information, no role distinction, no namespace grouping, and no service information.

### Impact on non-expert users

A user monitoring their cluster sees a grid of node cards with names like `ip-10-1-1-71` and has no way to know:
1. Whether that node is a master or worker
2. What is actually running on it
3. Whether the application they care about (e.g., `shopify-rpa`) is healthy
4. What URL or IP address routes traffic to their application

---

## Functional Requirements

### FR-1: Node Role Classification

Each node card must display its role clearly:

- Detect master/control-plane nodes by checking for the label `node-role.kubernetes.io/control-plane` or `node-role.kubernetes.io/master` (legacy)
- Detect worker nodes by the absence of control-plane labels, or presence of custom role labels (e.g., `rpa-worker`, `app-worker`, `node-pool`)
- Display role as a prominent badge: `MASTER` (dark) or `WORKER` (green) at the top of each node card
- Show all custom role labels (e.g., `node-pool=dashboard`, `rpa-worker=w2`) as secondary tags below the role badge

### FR-2: Pod List Per Node

Each node card must show all pods scheduled on that node:

- Group pods by namespace with a namespace header (e.g., `monitoring`, `rpa`, `default`)
- For each pod, display:
  - Pod name (shortened: remove random suffix if deterministic name exists, e.g., `argus-ops-7d9f...` -> `argus-ops`)
  - Application label (`app=` or `app.kubernetes.io/name=`) as the human-readable app name
  - Phase badge: `Running` (green), `Pending` (yellow), `Failed` / `CrashLoopBackOff` (red), `Completed` (gray)
  - Container count and ready count (e.g., `2/2 ready`)
  - Restart count if > 0 (shown as warning: `12 restarts`)
- System pods on master nodes (kube-system namespace) should be collapsed by default with a "Show system pods" toggle

### FR-3: Container Detail (Expandable)

Clicking on a pod row should expand to show per-container detail:

- Container name and image (image tag highlighted)
- Individual container state: `Running`, `Waiting (reason)`, `Terminated (exit code)`
- CPU and memory requests/limits
- Restart count

### FR-4: Namespace Summary Panel

Add a separate "Namespace" panel or sidebar alongside the node grid:

- List all active namespaces discovered during the scan
- For each namespace: total pod count, running count, failed count
- Clicking a namespace filters the pod view to show only pods in that namespace across all nodes
- Color-code namespaces: green if all pods healthy, yellow if any pending, red if any failed/crashloop

### FR-5: Service Information

Expose Kubernetes Service resources in the dashboard:

- Collect services via `v1.list_namespaced_service()` in `KubernetesCollector`
- For each service, capture:
  - `name`, `namespace`, `type` (ClusterIP / NodePort / LoadBalancer)
  - `clusterIP`
  - `ports`: list of `{port, targetPort, nodePort (if NodePort), protocol}`
  - `selector`: label selector used to route traffic to pods
  - `externalIPs` and `loadBalancerIngress` (if applicable)
- In the pod view, show a "Service" badge next to pods that are targeted by a service, with the stable endpoint (NodePort URL or LoadBalancer IP) shown on hover or expand

New API endpoint required:

```
GET /api/services  ->  { services: ServiceInfo[], total: int, last_scan: str }
```

### FR-6: Pod-to-Node Assignment in API

The current `/api/nodes` endpoint does not include pod data. Two options:

**Option A (preferred):** Add pod data into `/api/nodes` response as `pods_by_namespace`:
```json
{
  "name": "ip-10-1-1-81",
  "role": "worker",
  "role_labels": ["node-pool=dashboard", "app-worker=w4"],
  "pods_by_namespace": {
    "monitoring": [ { "name": "argus-ops-xxx", "app": "argus-ops", "phase": "Running", ... } ],
    "zrpa-demo":  [ { "name": "simple-web-xxx", "app": "simple-web", "phase": "Running", ... } ]
  },
  ...
}
```

**Option B:** Add a new `GET /api/topology` endpoint that returns the full cluster topology in one response (nodes + pods grouped by node + services).

### FR-7: Visual Layout

Replace the current flat card grid with a structured layout:

- **Left panel (25% width):** Namespace list with health summary badges
- **Main panel (75% width):** Node cards, each expandable, sorted: master nodes first, then workers
- Each node card has two sections:
  - **Header:** node name, role badge, OS, arch, Ready status, CPU/memory allocatable
  - **Body (collapsible):** namespace-grouped pod list

---

## Technical Design

### Collector changes (`src/argus_ops/collectors/k8s.py`)

#### `_collect_nodes()` additions:
```python
# Derive role from labels
labels = node.metadata.labels or {}
is_master = (
    "node-role.kubernetes.io/control-plane" in labels
    or "node-role.kubernetes.io/master" in labels
)
role = "master" if is_master else "worker"

# Collect all role-like labels for display
role_labels = [
    f"{k}={v}" for k, v in labels.items()
    if k not in {"kubernetes.io/os", "kubernetes.io/arch",
                  "kubernetes.io/hostname", "beta.kubernetes.io/os",
                  "beta.kubernetes.io/arch"}
    and ("role" in k or "pool" in k or "worker" in k)
]

node_info["role"] = role
node_info["role_labels"] = role_labels
```

#### New `_collect_services()` method:
```python
def _collect_services(self, v1: Any, namespace: str) -> HealthSnapshot:
    services = v1.list_namespaced_service(namespace)
    service_data = []
    for svc in services.items:
        ports = []
        for p in (svc.spec.ports or []):
            port_info = {
                "port": p.port,
                "target_port": str(p.target_port),
                "protocol": p.protocol,
            }
            if p.node_port:
                port_info["node_port"] = p.node_port
            ports.append(port_info)

        service_data.append({
            "name": svc.metadata.name,
            "namespace": namespace,
            "type": svc.spec.type,
            "cluster_ip": svc.spec.cluster_ip,
            "ports": ports,
            "selector": svc.spec.selector or {},
            "external_ips": svc.spec.external_i_ps or [],
            "load_balancer_ingress": [
                {"ip": i.ip, "hostname": i.hostname}
                for i in (svc.status.load_balancer.ingress or [])
            ] if svc.status.load_balancer else [],
        })
    return HealthSnapshot(
        collector_name=self.name,
        infra_type=self.infra_type,
        target=f"k8s://{namespace}/services",
        data={"services": service_data, "namespace": namespace},
        metrics={"services." + namespace + ".total": float(len(service_data))},
    )
```

#### Pod-to-node mapping in `WatchService`:
`_run_scan()` currently stores `nodes` and `findings` separately. After this change, it should also store `pods` (all pods across all namespaces, each with `node_name`) and `services`. The node grid rendering then joins pods to nodes client-side or server-side.

### API changes (`src/argus_ops/web/api.py`)

- `/api/nodes`: extend response to include `role`, `role_labels`, and `pods_by_namespace` per node
- `/api/services`: new endpoint returning all service objects grouped by namespace
- `/api/topology` (optional): single endpoint returning nodes + pods + services in one response for efficient dashboard load

### Dashboard changes (`src/argus_ops/web/templates/dashboard.html`)

- Replace `.node-grid` flex layout with two-panel layout (namespace sidebar + node cards)
- Add CSS for `.ns-panel`, `.ns-item`, `.ns-badge`, `.pod-list`, `.pod-row`, `.pod-expand`, `.svc-badge`
- Add `loadTopology()` function replacing `loadNodes()`
- Add namespace filter state: clicking a namespace highlights it and filters pod rows
- Add expand/collapse per node card body
- Add expand/collapse per pod row (shows container detail)
- System pod toggle per node card

---

## UX Design

### Node Card (after redesign)

```
+--------------------------------------------------+
| [MASTER]  ip-10-1-1-71              [Ready v]    |
| OS: linux / arch: amd64 | CPU: 4 | Mem: 15Gi     |
+--------------------------------------------------+
| kube-system  (collapsed)  [Show 8 system pods]   |
| monitoring                                        |
|   [Running]  argus-ops         2/2    0 restarts  |
|   [Running]  prometheus-0      1/1    0 restarts  |
+--------------------------------------------------+

+--------------------------------------------------+
| [WORKER]  ip-10-1-1-81              [Ready v]    |
| node-pool=dashboard | app-worker=w4              |
| OS: linux / arch: amd64 | CPU: 4 | Mem: 15Gi     |
+--------------------------------------------------+
| zrpa-demo                                        |
|   [Running]  simple-web    [NodePort :30800]  1/1 |
| monitoring                                       |
|   [Running]  argus-ops     [NodePort :30880]  1/1 |
+--------------------------------------------------+
```

### Service badge tooltip (on hover)

```
Service: argus-ops (NodePort)
  Port 8080 -> NodePort 30880
  Stable URL: http://<node-ip>:30880
  Selector: app=argus-ops
```

### Non-expert language guidelines

All labels and states must use plain language:

| Technical term       | Dashboard display           |
|----------------------|-----------------------------|
| control-plane node   | Master Node (runs the cluster brain) |
| worker node          | Worker Node (runs your apps) |
| pod phase: Running   | Running (healthy)           |
| CrashLoopBackOff     | Crashing (restarting repeatedly) |
| NodePort service     | External Access: port XXXXX |
| ClusterIP service    | Internal only (no external access) |

---

## Implementation Phases

### Phase 1 - Node role + pod list (first PR)
- [ ] Add `role` and `role_labels` fields to `_collect_nodes()` in `k8s.py`
- [ ] Store pod snapshots in `WatchService` and join pods to nodes in `get_state()`
- [ ] Extend `/api/nodes` to include `pods_by_namespace` per node
- [ ] Update Node Health tab: add role badge, pod list grouped by namespace, system pod toggle
- [ ] Add namespace sidebar panel

### Phase 2 - Service information (follow-up PR)
- [ ] Add `_collect_services()` to `KubernetesCollector`
- [ ] Add `services` to `WatchService` state
- [ ] Add `GET /api/services` endpoint
- [ ] Add service badge on pod rows with stable endpoint display
- [ ] Add service list section to namespace sidebar

### Phase 3 - Container detail expand (follow-up PR)
- [ ] Add expandable container detail row to each pod
- [ ] Show image, state, CPU/memory requests/limits, restart count

---

## Related Files

- Node collection: `src/argus_ops/collectors/k8s.py` (`_collect_nodes()`, `_collect_pods()`)
- State storage: `src/argus_ops/web/watch_service.py` (`_run_scan()`, `get_state()`)
- API: `src/argus_ops/web/api.py` (`/api/nodes`)
- Dashboard: `src/argus_ops/web/templates/dashboard.html` (Node Health tab, `loadNodes()`)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: redesign Node Health tab into full cluster topology view (nodes + pods + namespaces + services) #2

Overview

Problem Statement

Current behavior

Impact on non-expert users

Functional Requirements

FR-1: Node Role Classification

FR-2: Pod List Per Node

FR-3: Container Detail (Expandable)

FR-4: Namespace Summary Panel

FR-5: Service Information

FR-6: Pod-to-Node Assignment in API

FR-7: Visual Layout

Technical Design

Collector changes (`src/argus_ops/collectors/k8s.py`)

`_collect_nodes()` additions:

New `_collect_services()` method:

Pod-to-node mapping in `WatchService`:

API changes (`src/argus_ops/web/api.py`)

Dashboard changes (`src/argus_ops/web/templates/dashboard.html`)

UX Design

Node Card (after redesign)

Service badge tooltip (on hover)

Non-expert language guidelines

Implementation Phases

Phase 1 - Node role + pod list (first PR)

Phase 2 - Service information (follow-up PR)

Phase 3 - Container detail expand (follow-up PR)

Related Files

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Technical term	Dashboard display
control-plane node	Master Node (runs the cluster brain)
worker node	Worker Node (runs your apps)
pod phase: Running	Running (healthy)
CrashLoopBackOff	Crashing (restarting repeatedly)
NodePort service	External Access: port XXXXX
ClusterIP service	Internal only (no external access)

feat: redesign Node Health tab into full cluster topology view (nodes + pods + namespaces + services) #2

Description

Overview

Problem Statement

Current behavior

Impact on non-expert users

Functional Requirements

FR-1: Node Role Classification

FR-2: Pod List Per Node

FR-3: Container Detail (Expandable)

FR-4: Namespace Summary Panel

FR-5: Service Information

FR-6: Pod-to-Node Assignment in API

FR-7: Visual Layout

Technical Design

Collector changes (src/argus_ops/collectors/k8s.py)

_collect_nodes() additions:

New _collect_services() method:

Pod-to-node mapping in WatchService:

API changes (src/argus_ops/web/api.py)

Dashboard changes (src/argus_ops/web/templates/dashboard.html)

UX Design

Node Card (after redesign)

Service badge tooltip (on hover)

Non-expert language guidelines

Implementation Phases

Phase 1 - Node role + pod list (first PR)

Phase 2 - Service information (follow-up PR)

Phase 3 - Container detail expand (follow-up PR)

Related Files

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Collector changes (`src/argus_ops/collectors/k8s.py`)

`_collect_nodes()` additions:

New `_collect_services()` method:

Pod-to-node mapping in `WatchService`:

API changes (`src/argus_ops/web/api.py`)

Dashboard changes (`src/argus_ops/web/templates/dashboard.html`)