Skip to content

fix: eliminate dashboard/snapshot data race and event-loop blocking#8

Open
henryiii wants to merge 1 commit into
pfackeldey:mainfrom
henryiii:fix/dashboard-async-correctness
Open

fix: eliminate dashboard/snapshot data race and event-loop blocking#8
henryiii wants to merge 1 commit into
pfackeldey:mainfrom
henryiii:fix/dashboard-async-correctness

Conversation

@henryiii

Copy link
Copy Markdown
Contributor

🤖 AI text below 🤖

The gRPC server (grpc.aio) and the FastAPI dashboard run in the same process and event loop, and the dashboard reads the live ChunkedHist objects that the gRPC handlers mutate. This PR fixes three related async-correctness issues. The common approach: take an atomic copy on the event-loop thread (where synchronous code cannot be preempted by gRPC handlers) and offload only the heavy work to asyncio.to_thread.

Finding #4 — data race between dashboard worker thread and gRPC mutation

Both get_histogram (REST) and _send_hist_data (WS) offloaded histogram_to_plot_json to a worker thread. Inside, _chunk_values calls .tolist() on the live numpy chunk array, while a concurrent gRPC Fill mutates that same array in place (target[...] += source) or reassigns self._chunks. NumPy releases the GIL on large arrays, so the worker could observe torn data or a dict resized mid-lookup.

Fix: split the work in histogram_json.py. capture_plot_json_inputs() does the chunk lookup and an independent copy() synchronously on the loop (new ChunkedHist.chunk_view_copy), returning a snapshot that owns its array; only the .tolist() conversion (to_plot_json) is offloaded. The bridge now calls capture-on-loop then to_thread(to_plot_json, ...). JSON shape and the version field (derived from entry.last_access) are unchanged.

Finding #6 — gRPC Snapshot blocks the event loop

Snapshot serialized large histograms (.tobytes() over many chunks) synchronously on the loop, blocking all other RPCs for the duration. Fix: capture an atomic ChunkedHist.copy() (new method copying _chunks and each axis's known_keys) on the loop, then await asyncio.to_thread(serialize_chunked_hist_payload, ...). Applies to both the full-snapshot and partial-selector paths (the latter already returns a fresh copy via __getitem__); the delete_from_server path copies after pop. TypeError/ValueError raised inside the thread still propagate to the existing FAILED_PRECONDITION handler when awaited.

Finding #5 — push-loop task can be garbage-collected

create_app's _lifespan started the push loop with a bare asyncio.create_task(...) whose result was discarded; per the asyncio docs such a task can be GC'd mid-flight. Fix: keep a strong reference in the lifespan and cancel() + await it on shutdown.

Validation

  • uvx ty check src — no new diagnostics (the 2 pre-existing IntCategory/StrCategory arg-type errors in _chunk_axis_for_spec are unrelated and present on main)
  • uv run pytest tests/test_dashboard -q — 36 passed
  • uv run pytest tests/test_grpc_integration.py -q -k snapshot — 14 passed
  • full suite: 112 passed

Part of #4

🤖 Generated with Claude Code

The gRPC server and the FastAPI dashboard share one process and event loop,
and the dashboard reads the live ChunkedHist objects the gRPC handlers mutate.
This addresses three related async-correctness issues:

- Data race (pfackeldey#4): the dashboard offloaded histogram_to_plot_json to a worker
  thread, where .tolist() ran on the live numpy chunk array while a concurrent
  gRPC Fill mutated it in place (NumPy drops the GIL on large arrays) or
  resized self._chunks. Split the work: capture_plot_json_inputs() copies the
  chunk synchronously on the loop (atomic vs. gRPC handlers), and only the
  .tolist() conversion (to_plot_json) is offloaded to the thread.

- Event-loop blocking (pfackeldey#6): the Snapshot handler serialized large histograms
  (.tobytes() over many chunks) synchronously on the loop, blocking all other
  RPCs. Capture an atomic ChunkedHist.copy() on the loop, then serialize that
  copy via asyncio.to_thread. The partial-selector path already returns a fresh
  copy; the delete path copies after pop. TypeError/ValueError raised inside
  the thread still propagate to the existing FAILED_PRECONDITION handler.

- Task GC (pfackeldey#5): the dashboard push loop was started with a bare create_task()
  whose result was discarded, so it could be garbage-collected mid-flight. Keep
  a strong reference in the lifespan and cancel it on shutdown.

Public REST/WS JSON shape (including the version field) is unchanged.

Assisted-by: ClaudeCode:claude-opus-4.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant