Skip to content

feat(server): /metrics endpoint + HTTP request metrics#35

Open
Zorlin wants to merge 1 commit into
mainfrom
feat/metrics-endpoint
Open

feat(server): /metrics endpoint + HTTP request metrics#35
Zorlin wants to merge 1 commit into
mainfrom
feat/metrics-endpoint

Conversation

@Zorlin

@Zorlin Zorlin commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

What

Adds a Prometheus /metrics endpoint + an HTTP request middleware so the serving
pipeline — the surface that starves under concurrent PXE imaging — can be
scraped and bisected. Today there is zero visibility (no /metrics; only
TraceLayer INFO logs).

Surface

GET /metrics (text/plain; version=0.0.4) exposes, per bounded endpoint
label
:

  • http_requests_in_flight (gauge)
  • http_requests_total (counter: method, status)
  • http_request_duration_seconds (histogram)

Endpoint labels are a coarse classify() of the path — not the raw path
(/boot/{mac} is high-cardinality). boot / image / os_image /
boot_debian_asset / pxelinux_config / ipxe_artifact / api / infra. This
keeps the latency histogram meaningful under a boot storm and lets us see which
serve surface saturates (the iPXE-script per-boot store read vs the big OS image
vs JIT QCOW2 conversion).

How

  • Deps: metrics + metrics-exporter-prometheus (avoided axum-prometheus — it
    dragged tower 0.5 + tower-http 0.6 alongside the workspace's 0.4/0.5).
  • Recorder installed at server start (run); handle cached in a OnceLock so
    /metrics needs no AppState change.
  • Middleware applied as the outermost router layer; /metrics registered as a
    public route (no auth extraction — like the /boot PXE endpoints).

Tests

  • classify truth-table + a cardinality-bound property test (1000 distinct MACs
    → one boot label).
  • End-to-end: drive a request through the instrumented router, scrape /metrics,
    assert 200 + text/plain + the counter/histogram/labels present.
  • Full dragonfly-server suite green (193).

Why

First piece of the repeatable PXE-starvation harness — a Jetpack playbook repro
instrumented on controller / server / client. This is the server plane. Next:
jetpack controller telemetry + the repro/collect playbooks in infra.

🤖 Generated with Claude Code

Concurrent PXE imaging can starve Dragonfly's HTTP serving pipeline, but there
was no visibility into it (no /metrics, only TraceLayer logs). Add a Prometheus
/metrics endpoint plus a request middleware that records, per bounded endpoint
label: in-flight gauge, request counter (method/status), and latency histogram.

Endpoint labels are a coarse classification of the path (classify), not the raw
path — /boot/{mac} is high-cardinality; collapsing to "boot"/"image"/"os_image"/
etc. keeps the histogram meaningful under a boot storm. Uses metrics +
metrics-exporter-prometheus (axum-prometheus was avoided: it dragged tower 0.5 +
tower-http 0.6 alongside the workspace's 0.4/0.5). The recorder handle is cached
in a OnceLock so /metrics needs no AppState change.

Wire: install the recorder at server start (run), register GET /metrics, and
apply the middleware as the outermost router layer.

Tests: classify truth-table + cardinality bound, and an end-to-end test driving
a request through the instrumented router and scraping /metrics for the
counter/histogram/labels. Full dragonfly-server suite green (193).

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant