-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
fix(shm): Add memory barriers for cross-process shared memory visibility #29819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The shared memory ring buffer protocol in shm_broadcast.py uses plain byte writes to signal between writer and reader processes. On multi-core systems, these writes may stay in CPU store buffers and not be visible to other processes running on different cores, causing indefinite spinning/freeze under sustained concurrent load. This patch adds explicit memory barriers using threading.Lock acquire/release (which provides full barrier semantics per POSIX.1-2008) at four critical points: - In acquire_write(): before reading flags and after setting written flag - In acquire_read(): before reading flags and after setting read flag The memory barrier ensures that: 1. All stores before the barrier are globally visible 2. All loads after the barrier see the latest values Fixes freeze observed during sustained concurrent batch inference (~500+ requests) where both writer and readers would spin indefinitely waiting for flags that were updated but not visible across CPU cores. Signed-off-by: Christina Holland <[email protected]>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly identifies and fixes a critical race condition in the shared memory IPC mechanism by introducing memory barriers. The use of threading.Lock for this purpose is a standard and portable approach. The analysis in the pull request description is excellent. I have a couple of suggestions to improve the robustness and documentation of the new memory_fence function.
| Implementation acquires and immediately releases a lock. On POSIX systems, | ||
| pthread_mutex_lock/unlock have full memory barrier semantics (sequentially | ||
| consistent). This is a lightweight operation (~20ns) that guarantees: | ||
| - All stores before the barrier are visible to other threads/processes | ||
| - All loads after the barrier see the latest values | ||
| Reference: POSIX.1-2008 specifies that mutex operations synchronize memory. | ||
| """ | ||
| # Lock acquire/release provides full memory barrier semantics on POSIX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation and comments are very POSIX-centric. While correct for POSIX, threading.Lock also provides memory barrier semantics on other platforms like Windows, making this implementation portable. The documentation should reflect this to avoid confusion for developers on other platforms.
| Implementation acquires and immediately releases a lock. On POSIX systems, | |
| pthread_mutex_lock/unlock have full memory barrier semantics (sequentially | |
| consistent). This is a lightweight operation (~20ns) that guarantees: | |
| - All stores before the barrier are visible to other threads/processes | |
| - All loads after the barrier see the latest values | |
| Reference: POSIX.1-2008 specifies that mutex operations synchronize memory. | |
| """ | |
| # Lock acquire/release provides full memory barrier semantics on POSIX | |
| Implementation acquires and immediately releases a lock. This provides full | |
| memory barrier semantics on major platforms (e.g., POSIX, Windows), making | |
| it a portable way to enforce memory ordering. This lightweight operation | |
| (~20ns) guarantees: | |
| - All stores before the barrier are visible to other threads/processes | |
| - All loads after the barrier see the latest values | |
| """ | |
| # Lock acquire/release provides full memory barrier semantics. |
| _memory_fence_lock.acquire() | ||
| _memory_fence_lock.release() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For improved robustness, it's better to use a with statement for the lock. This ensures that the lock is always released, even if an unexpected exception (like a KeyboardInterrupt) occurs between acquire() and release(). While unlikely in this short function, it's a best practice that guarantees lock release and prevents potential deadlocks.
| _memory_fence_lock.acquire() | |
| _memory_fence_lock.release() | |
| with _memory_fence_lock: | |
| pass |
|
Thanks @kitaekatt! I'm just a bit surprised that this hasn't been encountered more often if it's so easy to reproduce. Are you pinning cpus / setting some particular numa config? |
Summary
Fixes freeze/hang during sustained concurrent batch inference caused by missing memory barriers in the shared memory ring buffer protocol.
Root Cause
The
shm_broadcast.pyshared memory IPC uses plain byte writes (metadata_buffer[0] = 1) to signal between writer and reader processes. On multi-core systems, these writes stay in CPU store buffers and may not be visible to other processes running on different cores. This causes indefinite spinning where:The Fix
Add explicit memory barriers using
threading.Lockacquire/release pattern (which provides full memory barrier semantics per POSIX.1-2008) at four critical points:Why threading.Lock?
On POSIX systems,
pthread_mutex_lock/unlockprovides sequentially consistent memory barrier semantics. The lock acquire/release pattern (~20ns overhead) is the most portable and well-defined way to get memory barriers in Python without requiring platform-specific code.Test Results
Before fix: Freeze at batch ~41 (~492 concurrent requests)
After fix: Successfully completed 120 batches (1440 requests) without freeze
Test configuration:
Test Plan