Skip to content

feat: redesign Node Health tab into full cluster topology view (nodes + pods + namespaces + services) #2

@mason5052

Description

@mason5052

Overview

The Node Health tab in the argus-ops dashboard currently shows a flat card grid of nodes with minimal information. Users who are not Kubernetes experts cannot answer basic operational questions from this view:

  • Which nodes are master (control-plane) nodes and which are worker nodes?
  • What pods are running on each node?
  • What application does each pod represent?
  • What namespace does each pod belong to?
  • What is the health status of each individual container inside a pod?
  • Is there a Service exposing any pod to external traffic, and what is its stable endpoint?

This issue tracks the redesign of the Node Health tab into a full cluster topology view that answers all of these questions without requiring any kubectl knowledge.


Problem Statement

Current behavior

GET /api/nodes returns only the node list collected by KubernetesCollector._collect_nodes(). The response includes: name, conditions, labels, allocatable/capacity, os, arch, unschedulable, taints. It does not include pods or services.

The dashboard renders one card per node showing: Ready/Not-Ready badge, OS, arch, condition flags. There is no pod information, no role distinction, no namespace grouping, and no service information.

Impact on non-expert users

A user monitoring their cluster sees a grid of node cards with names like ip-10-1-1-71 and has no way to know:

  1. Whether that node is a master or worker
  2. What is actually running on it
  3. Whether the application they care about (e.g., shopify-rpa) is healthy
  4. What URL or IP address routes traffic to their application

Functional Requirements

FR-1: Node Role Classification

Each node card must display its role clearly:

  • Detect master/control-plane nodes by checking for the label node-role.kubernetes.io/control-plane or node-role.kubernetes.io/master (legacy)
  • Detect worker nodes by the absence of control-plane labels, or presence of custom role labels (e.g., rpa-worker, app-worker, node-pool)
  • Display role as a prominent badge: MASTER (dark) or WORKER (green) at the top of each node card
  • Show all custom role labels (e.g., node-pool=dashboard, rpa-worker=w2) as secondary tags below the role badge

FR-2: Pod List Per Node

Each node card must show all pods scheduled on that node:

  • Group pods by namespace with a namespace header (e.g., monitoring, rpa, default)
  • For each pod, display:
    • Pod name (shortened: remove random suffix if deterministic name exists, e.g., argus-ops-7d9f... -> argus-ops)
    • Application label (app= or app.kubernetes.io/name=) as the human-readable app name
    • Phase badge: Running (green), Pending (yellow), Failed / CrashLoopBackOff (red), Completed (gray)
    • Container count and ready count (e.g., 2/2 ready)
    • Restart count if > 0 (shown as warning: 12 restarts)
  • System pods on master nodes (kube-system namespace) should be collapsed by default with a "Show system pods" toggle

FR-3: Container Detail (Expandable)

Clicking on a pod row should expand to show per-container detail:

  • Container name and image (image tag highlighted)
  • Individual container state: Running, Waiting (reason), Terminated (exit code)
  • CPU and memory requests/limits
  • Restart count

FR-4: Namespace Summary Panel

Add a separate "Namespace" panel or sidebar alongside the node grid:

  • List all active namespaces discovered during the scan
  • For each namespace: total pod count, running count, failed count
  • Clicking a namespace filters the pod view to show only pods in that namespace across all nodes
  • Color-code namespaces: green if all pods healthy, yellow if any pending, red if any failed/crashloop

FR-5: Service Information

Expose Kubernetes Service resources in the dashboard:

  • Collect services via v1.list_namespaced_service() in KubernetesCollector
  • For each service, capture:
    • name, namespace, type (ClusterIP / NodePort / LoadBalancer)
    • clusterIP
    • ports: list of {port, targetPort, nodePort (if NodePort), protocol}
    • selector: label selector used to route traffic to pods
    • externalIPs and loadBalancerIngress (if applicable)
  • In the pod view, show a "Service" badge next to pods that are targeted by a service, with the stable endpoint (NodePort URL or LoadBalancer IP) shown on hover or expand

New API endpoint required:

GET /api/services  ->  { services: ServiceInfo[], total: int, last_scan: str }

FR-6: Pod-to-Node Assignment in API

The current /api/nodes endpoint does not include pod data. Two options:

Option A (preferred): Add pod data into /api/nodes response as pods_by_namespace:

{
  "name": "ip-10-1-1-81",
  "role": "worker",
  "role_labels": ["node-pool=dashboard", "app-worker=w4"],
  "pods_by_namespace": {
    "monitoring": [ { "name": "argus-ops-xxx", "app": "argus-ops", "phase": "Running", ... } ],
    "zrpa-demo":  [ { "name": "simple-web-xxx", "app": "simple-web", "phase": "Running", ... } ]
  },
  ...
}

Option B: Add a new GET /api/topology endpoint that returns the full cluster topology in one response (nodes + pods grouped by node + services).

FR-7: Visual Layout

Replace the current flat card grid with a structured layout:

  • Left panel (25% width): Namespace list with health summary badges
  • Main panel (75% width): Node cards, each expandable, sorted: master nodes first, then workers
  • Each node card has two sections:
    • Header: node name, role badge, OS, arch, Ready status, CPU/memory allocatable
    • Body (collapsible): namespace-grouped pod list

Technical Design

Collector changes (src/argus_ops/collectors/k8s.py)

_collect_nodes() additions:

# Derive role from labels
labels = node.metadata.labels or {}
is_master = (
    "node-role.kubernetes.io/control-plane" in labels
    or "node-role.kubernetes.io/master" in labels
)
role = "master" if is_master else "worker"

# Collect all role-like labels for display
role_labels = [
    f"{k}={v}" for k, v in labels.items()
    if k not in {"kubernetes.io/os", "kubernetes.io/arch",
                  "kubernetes.io/hostname", "beta.kubernetes.io/os",
                  "beta.kubernetes.io/arch"}
    and ("role" in k or "pool" in k or "worker" in k)
]

node_info["role"] = role
node_info["role_labels"] = role_labels

New _collect_services() method:

def _collect_services(self, v1: Any, namespace: str) -> HealthSnapshot:
    services = v1.list_namespaced_service(namespace)
    service_data = []
    for svc in services.items:
        ports = []
        for p in (svc.spec.ports or []):
            port_info = {
                "port": p.port,
                "target_port": str(p.target_port),
                "protocol": p.protocol,
            }
            if p.node_port:
                port_info["node_port"] = p.node_port
            ports.append(port_info)

        service_data.append({
            "name": svc.metadata.name,
            "namespace": namespace,
            "type": svc.spec.type,
            "cluster_ip": svc.spec.cluster_ip,
            "ports": ports,
            "selector": svc.spec.selector or {},
            "external_ips": svc.spec.external_i_ps or [],
            "load_balancer_ingress": [
                {"ip": i.ip, "hostname": i.hostname}
                for i in (svc.status.load_balancer.ingress or [])
            ] if svc.status.load_balancer else [],
        })
    return HealthSnapshot(
        collector_name=self.name,
        infra_type=self.infra_type,
        target=f"k8s://{namespace}/services",
        data={"services": service_data, "namespace": namespace},
        metrics={"services." + namespace + ".total": float(len(service_data))},
    )

Pod-to-node mapping in WatchService:

_run_scan() currently stores nodes and findings separately. After this change, it should also store pods (all pods across all namespaces, each with node_name) and services. The node grid rendering then joins pods to nodes client-side or server-side.

API changes (src/argus_ops/web/api.py)

  • /api/nodes: extend response to include role, role_labels, and pods_by_namespace per node
  • /api/services: new endpoint returning all service objects grouped by namespace
  • /api/topology (optional): single endpoint returning nodes + pods + services in one response for efficient dashboard load

Dashboard changes (src/argus_ops/web/templates/dashboard.html)

  • Replace .node-grid flex layout with two-panel layout (namespace sidebar + node cards)
  • Add CSS for .ns-panel, .ns-item, .ns-badge, .pod-list, .pod-row, .pod-expand, .svc-badge
  • Add loadTopology() function replacing loadNodes()
  • Add namespace filter state: clicking a namespace highlights it and filters pod rows
  • Add expand/collapse per node card body
  • Add expand/collapse per pod row (shows container detail)
  • System pod toggle per node card

UX Design

Node Card (after redesign)

+--------------------------------------------------+
| [MASTER]  ip-10-1-1-71              [Ready v]    |
| OS: linux / arch: amd64 | CPU: 4 | Mem: 15Gi     |
+--------------------------------------------------+
| kube-system  (collapsed)  [Show 8 system pods]   |
| monitoring                                        |
|   [Running]  argus-ops         2/2    0 restarts  |
|   [Running]  prometheus-0      1/1    0 restarts  |
+--------------------------------------------------+

+--------------------------------------------------+
| [WORKER]  ip-10-1-1-81              [Ready v]    |
| node-pool=dashboard | app-worker=w4              |
| OS: linux / arch: amd64 | CPU: 4 | Mem: 15Gi     |
+--------------------------------------------------+
| zrpa-demo                                        |
|   [Running]  simple-web    [NodePort :30800]  1/1 |
| monitoring                                       |
|   [Running]  argus-ops     [NodePort :30880]  1/1 |
+--------------------------------------------------+

Service badge tooltip (on hover)

Service: argus-ops (NodePort)
  Port 8080 -> NodePort 30880
  Stable URL: http://<node-ip>:30880
  Selector: app=argus-ops

Non-expert language guidelines

All labels and states must use plain language:

Technical term Dashboard display
control-plane node Master Node (runs the cluster brain)
worker node Worker Node (runs your apps)
pod phase: Running Running (healthy)
CrashLoopBackOff Crashing (restarting repeatedly)
NodePort service External Access: port XXXXX
ClusterIP service Internal only (no external access)

Implementation Phases

Phase 1 - Node role + pod list (first PR)

  • Add role and role_labels fields to _collect_nodes() in k8s.py
  • Store pod snapshots in WatchService and join pods to nodes in get_state()
  • Extend /api/nodes to include pods_by_namespace per node
  • Update Node Health tab: add role badge, pod list grouped by namespace, system pod toggle
  • Add namespace sidebar panel

Phase 2 - Service information (follow-up PR)

  • Add _collect_services() to KubernetesCollector
  • Add services to WatchService state
  • Add GET /api/services endpoint
  • Add service badge on pod rows with stable endpoint display
  • Add service list section to namespace sidebar

Phase 3 - Container detail expand (follow-up PR)

  • Add expandable container detail row to each pod
  • Show image, state, CPU/memory requests/limits, restart count

Related Files

  • Node collection: src/argus_ops/collectors/k8s.py (_collect_nodes(), _collect_pods())
  • State storage: src/argus_ops/web/watch_service.py (_run_scan(), get_state())
  • API: src/argus_ops/web/api.py (/api/nodes)
  • Dashboard: src/argus_ops/web/templates/dashboard.html (Node Health tab, loadNodes())

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions