-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Overview
The Node Health tab in the argus-ops dashboard currently shows a flat card grid of nodes with minimal information. Users who are not Kubernetes experts cannot answer basic operational questions from this view:
- Which nodes are master (control-plane) nodes and which are worker nodes?
- What pods are running on each node?
- What application does each pod represent?
- What namespace does each pod belong to?
- What is the health status of each individual container inside a pod?
- Is there a Service exposing any pod to external traffic, and what is its stable endpoint?
This issue tracks the redesign of the Node Health tab into a full cluster topology view that answers all of these questions without requiring any kubectl knowledge.
Problem Statement
Current behavior
GET /api/nodes returns only the node list collected by KubernetesCollector._collect_nodes(). The response includes: name, conditions, labels, allocatable/capacity, os, arch, unschedulable, taints. It does not include pods or services.
The dashboard renders one card per node showing: Ready/Not-Ready badge, OS, arch, condition flags. There is no pod information, no role distinction, no namespace grouping, and no service information.
Impact on non-expert users
A user monitoring their cluster sees a grid of node cards with names like ip-10-1-1-71 and has no way to know:
- Whether that node is a master or worker
- What is actually running on it
- Whether the application they care about (e.g.,
shopify-rpa) is healthy - What URL or IP address routes traffic to their application
Functional Requirements
FR-1: Node Role Classification
Each node card must display its role clearly:
- Detect master/control-plane nodes by checking for the label
node-role.kubernetes.io/control-planeornode-role.kubernetes.io/master(legacy) - Detect worker nodes by the absence of control-plane labels, or presence of custom role labels (e.g.,
rpa-worker,app-worker,node-pool) - Display role as a prominent badge:
MASTER(dark) orWORKER(green) at the top of each node card - Show all custom role labels (e.g.,
node-pool=dashboard,rpa-worker=w2) as secondary tags below the role badge
FR-2: Pod List Per Node
Each node card must show all pods scheduled on that node:
- Group pods by namespace with a namespace header (e.g.,
monitoring,rpa,default) - For each pod, display:
- Pod name (shortened: remove random suffix if deterministic name exists, e.g.,
argus-ops-7d9f...->argus-ops) - Application label (
app=orapp.kubernetes.io/name=) as the human-readable app name - Phase badge:
Running(green),Pending(yellow),Failed/CrashLoopBackOff(red),Completed(gray) - Container count and ready count (e.g.,
2/2 ready) - Restart count if > 0 (shown as warning:
12 restarts)
- Pod name (shortened: remove random suffix if deterministic name exists, e.g.,
- System pods on master nodes (kube-system namespace) should be collapsed by default with a "Show system pods" toggle
FR-3: Container Detail (Expandable)
Clicking on a pod row should expand to show per-container detail:
- Container name and image (image tag highlighted)
- Individual container state:
Running,Waiting (reason),Terminated (exit code) - CPU and memory requests/limits
- Restart count
FR-4: Namespace Summary Panel
Add a separate "Namespace" panel or sidebar alongside the node grid:
- List all active namespaces discovered during the scan
- For each namespace: total pod count, running count, failed count
- Clicking a namespace filters the pod view to show only pods in that namespace across all nodes
- Color-code namespaces: green if all pods healthy, yellow if any pending, red if any failed/crashloop
FR-5: Service Information
Expose Kubernetes Service resources in the dashboard:
- Collect services via
v1.list_namespaced_service()inKubernetesCollector - For each service, capture:
name,namespace,type(ClusterIP / NodePort / LoadBalancer)clusterIPports: list of{port, targetPort, nodePort (if NodePort), protocol}selector: label selector used to route traffic to podsexternalIPsandloadBalancerIngress(if applicable)
- In the pod view, show a "Service" badge next to pods that are targeted by a service, with the stable endpoint (NodePort URL or LoadBalancer IP) shown on hover or expand
New API endpoint required:
GET /api/services -> { services: ServiceInfo[], total: int, last_scan: str }
FR-6: Pod-to-Node Assignment in API
The current /api/nodes endpoint does not include pod data. Two options:
Option A (preferred): Add pod data into /api/nodes response as pods_by_namespace:
{
"name": "ip-10-1-1-81",
"role": "worker",
"role_labels": ["node-pool=dashboard", "app-worker=w4"],
"pods_by_namespace": {
"monitoring": [ { "name": "argus-ops-xxx", "app": "argus-ops", "phase": "Running", ... } ],
"zrpa-demo": [ { "name": "simple-web-xxx", "app": "simple-web", "phase": "Running", ... } ]
},
...
}Option B: Add a new GET /api/topology endpoint that returns the full cluster topology in one response (nodes + pods grouped by node + services).
FR-7: Visual Layout
Replace the current flat card grid with a structured layout:
- Left panel (25% width): Namespace list with health summary badges
- Main panel (75% width): Node cards, each expandable, sorted: master nodes first, then workers
- Each node card has two sections:
- Header: node name, role badge, OS, arch, Ready status, CPU/memory allocatable
- Body (collapsible): namespace-grouped pod list
Technical Design
Collector changes (src/argus_ops/collectors/k8s.py)
_collect_nodes() additions:
# Derive role from labels
labels = node.metadata.labels or {}
is_master = (
"node-role.kubernetes.io/control-plane" in labels
or "node-role.kubernetes.io/master" in labels
)
role = "master" if is_master else "worker"
# Collect all role-like labels for display
role_labels = [
f"{k}={v}" for k, v in labels.items()
if k not in {"kubernetes.io/os", "kubernetes.io/arch",
"kubernetes.io/hostname", "beta.kubernetes.io/os",
"beta.kubernetes.io/arch"}
and ("role" in k or "pool" in k or "worker" in k)
]
node_info["role"] = role
node_info["role_labels"] = role_labelsNew _collect_services() method:
def _collect_services(self, v1: Any, namespace: str) -> HealthSnapshot:
services = v1.list_namespaced_service(namespace)
service_data = []
for svc in services.items:
ports = []
for p in (svc.spec.ports or []):
port_info = {
"port": p.port,
"target_port": str(p.target_port),
"protocol": p.protocol,
}
if p.node_port:
port_info["node_port"] = p.node_port
ports.append(port_info)
service_data.append({
"name": svc.metadata.name,
"namespace": namespace,
"type": svc.spec.type,
"cluster_ip": svc.spec.cluster_ip,
"ports": ports,
"selector": svc.spec.selector or {},
"external_ips": svc.spec.external_i_ps or [],
"load_balancer_ingress": [
{"ip": i.ip, "hostname": i.hostname}
for i in (svc.status.load_balancer.ingress or [])
] if svc.status.load_balancer else [],
})
return HealthSnapshot(
collector_name=self.name,
infra_type=self.infra_type,
target=f"k8s://{namespace}/services",
data={"services": service_data, "namespace": namespace},
metrics={"services." + namespace + ".total": float(len(service_data))},
)Pod-to-node mapping in WatchService:
_run_scan() currently stores nodes and findings separately. After this change, it should also store pods (all pods across all namespaces, each with node_name) and services. The node grid rendering then joins pods to nodes client-side or server-side.
API changes (src/argus_ops/web/api.py)
/api/nodes: extend response to includerole,role_labels, andpods_by_namespaceper node/api/services: new endpoint returning all service objects grouped by namespace/api/topology(optional): single endpoint returning nodes + pods + services in one response for efficient dashboard load
Dashboard changes (src/argus_ops/web/templates/dashboard.html)
- Replace
.node-gridflex layout with two-panel layout (namespace sidebar + node cards) - Add CSS for
.ns-panel,.ns-item,.ns-badge,.pod-list,.pod-row,.pod-expand,.svc-badge - Add
loadTopology()function replacingloadNodes() - Add namespace filter state: clicking a namespace highlights it and filters pod rows
- Add expand/collapse per node card body
- Add expand/collapse per pod row (shows container detail)
- System pod toggle per node card
UX Design
Node Card (after redesign)
+--------------------------------------------------+
| [MASTER] ip-10-1-1-71 [Ready v] |
| OS: linux / arch: amd64 | CPU: 4 | Mem: 15Gi |
+--------------------------------------------------+
| kube-system (collapsed) [Show 8 system pods] |
| monitoring |
| [Running] argus-ops 2/2 0 restarts |
| [Running] prometheus-0 1/1 0 restarts |
+--------------------------------------------------+
+--------------------------------------------------+
| [WORKER] ip-10-1-1-81 [Ready v] |
| node-pool=dashboard | app-worker=w4 |
| OS: linux / arch: amd64 | CPU: 4 | Mem: 15Gi |
+--------------------------------------------------+
| zrpa-demo |
| [Running] simple-web [NodePort :30800] 1/1 |
| monitoring |
| [Running] argus-ops [NodePort :30880] 1/1 |
+--------------------------------------------------+
Service badge tooltip (on hover)
Service: argus-ops (NodePort)
Port 8080 -> NodePort 30880
Stable URL: http://<node-ip>:30880
Selector: app=argus-ops
Non-expert language guidelines
All labels and states must use plain language:
| Technical term | Dashboard display |
|---|---|
| control-plane node | Master Node (runs the cluster brain) |
| worker node | Worker Node (runs your apps) |
| pod phase: Running | Running (healthy) |
| CrashLoopBackOff | Crashing (restarting repeatedly) |
| NodePort service | External Access: port XXXXX |
| ClusterIP service | Internal only (no external access) |
Implementation Phases
Phase 1 - Node role + pod list (first PR)
- Add
roleandrole_labelsfields to_collect_nodes()ink8s.py - Store pod snapshots in
WatchServiceand join pods to nodes inget_state() - Extend
/api/nodesto includepods_by_namespaceper node - Update Node Health tab: add role badge, pod list grouped by namespace, system pod toggle
- Add namespace sidebar panel
Phase 2 - Service information (follow-up PR)
- Add
_collect_services()toKubernetesCollector - Add
servicestoWatchServicestate - Add
GET /api/servicesendpoint - Add service badge on pod rows with stable endpoint display
- Add service list section to namespace sidebar
Phase 3 - Container detail expand (follow-up PR)
- Add expandable container detail row to each pod
- Show image, state, CPU/memory requests/limits, restart count
Related Files
- Node collection:
src/argus_ops/collectors/k8s.py(_collect_nodes(),_collect_pods()) - State storage:
src/argus_ops/web/watch_service.py(_run_scan(),get_state()) - API:
src/argus_ops/web/api.py(/api/nodes) - Dashboard:
src/argus_ops/web/templates/dashboard.html(Node Health tab,loadNodes())