fix: eliminate dashboard/snapshot data race and event-loop blocking#8
Open
henryiii wants to merge 1 commit into
Open
fix: eliminate dashboard/snapshot data race and event-loop blocking#8henryiii wants to merge 1 commit into
henryiii wants to merge 1 commit into
Conversation
The gRPC server and the FastAPI dashboard share one process and event loop, and the dashboard reads the live ChunkedHist objects the gRPC handlers mutate. This addresses three related async-correctness issues: - Data race (pfackeldey#4): the dashboard offloaded histogram_to_plot_json to a worker thread, where .tolist() ran on the live numpy chunk array while a concurrent gRPC Fill mutated it in place (NumPy drops the GIL on large arrays) or resized self._chunks. Split the work: capture_plot_json_inputs() copies the chunk synchronously on the loop (atomic vs. gRPC handlers), and only the .tolist() conversion (to_plot_json) is offloaded to the thread. - Event-loop blocking (pfackeldey#6): the Snapshot handler serialized large histograms (.tobytes() over many chunks) synchronously on the loop, blocking all other RPCs. Capture an atomic ChunkedHist.copy() on the loop, then serialize that copy via asyncio.to_thread. The partial-selector path already returns a fresh copy; the delete path copies after pop. TypeError/ValueError raised inside the thread still propagate to the existing FAILED_PRECONDITION handler. - Task GC (pfackeldey#5): the dashboard push loop was started with a bare create_task() whose result was discarded, so it could be garbage-collected mid-flight. Keep a strong reference in the lifespan and cancel it on shutdown. Public REST/WS JSON shape (including the version field) is unchanged. Assisted-by: ClaudeCode:claude-opus-4.8
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 AI text below 🤖
The gRPC server (
grpc.aio) and the FastAPI dashboard run in the same process and event loop, and the dashboard reads the liveChunkedHistobjects that the gRPC handlers mutate. This PR fixes three related async-correctness issues. The common approach: take an atomic copy on the event-loop thread (where synchronous code cannot be preempted by gRPC handlers) and offload only the heavy work toasyncio.to_thread.Finding #4 — data race between dashboard worker thread and gRPC mutation
Both
get_histogram(REST) and_send_hist_data(WS) offloadedhistogram_to_plot_jsonto a worker thread. Inside,_chunk_valuescalls.tolist()on the live numpy chunk array, while a concurrent gRPCFillmutates that same array in place (target[...] += source) or reassignsself._chunks. NumPy releases the GIL on large arrays, so the worker could observe torn data or a dict resized mid-lookup.Fix: split the work in
histogram_json.py.capture_plot_json_inputs()does the chunk lookup and an independentcopy()synchronously on the loop (newChunkedHist.chunk_view_copy), returning a snapshot that owns its array; only the.tolist()conversion (to_plot_json) is offloaded. The bridge now calls capture-on-loop thento_thread(to_plot_json, ...). JSON shape and theversionfield (derived fromentry.last_access) are unchanged.Finding #6 — gRPC
Snapshotblocks the event loopSnapshotserialized large histograms (.tobytes()over many chunks) synchronously on the loop, blocking all other RPCs for the duration. Fix: capture an atomicChunkedHist.copy()(new method copying_chunksand each axis'sknown_keys) on the loop, thenawait asyncio.to_thread(serialize_chunked_hist_payload, ...). Applies to both the full-snapshot and partial-selector paths (the latter already returns a fresh copy via__getitem__); thedelete_from_serverpath copies after pop.TypeError/ValueErrorraised inside the thread still propagate to the existingFAILED_PRECONDITIONhandler when awaited.Finding #5 — push-loop task can be garbage-collected
create_app's_lifespanstarted the push loop with a bareasyncio.create_task(...)whose result was discarded; per the asyncio docs such a task can be GC'd mid-flight. Fix: keep a strong reference in the lifespan andcancel()+ await it on shutdown.Validation
uvx ty check src— no new diagnostics (the 2 pre-existingIntCategory/StrCategoryarg-type errors in_chunk_axis_for_specare unrelated and present onmain)uv run pytest tests/test_dashboard -q— 36 passeduv run pytest tests/test_grpc_integration.py -q -k snapshot— 14 passedPart of #4
🤖 Generated with Claude Code