fix(shm): Add memory barriers for cross-process shared memory visibility #29819

kitaekatt · 2025-12-01T22:31:57Z

Summary

Fixes freeze/hang during sustained concurrent batch inference caused by missing memory barriers in the shared memory ring buffer protocol.

Root Cause

The shm_broadcast.py shared memory IPC uses plain byte writes (metadata_buffer[0] = 1) to signal between writer and reader processes. On multi-core systems, these writes stay in CPU store buffers and may not be visible to other processes running on different cores. This causes indefinite spinning where:

Writer waits for readers to set read flags (that are already set but not visible)
Readers wait for writer to set written flag (that is already set but not visible)

The Fix

Add explicit memory barriers using threading.Lock acquire/release pattern (which provides full memory barrier semantics per POSIX.1-2008) at four critical points:

acquire_write() - before reading flags: Ensures writer sees latest read flags
acquire_write() - after setting written flag: Ensures write is globally visible
acquire_read() - before reading flags: Ensures reader sees latest written flag
acquire_read() - after setting read flag: Ensures read completion is visible to writer

Why threading.Lock?

On POSIX systems, pthread_mutex_lock/unlock provides sequentially consistent memory barrier semantics. The lock acquire/release pattern (~20ns overhead) is the most portable and well-defined way to get memory barriers in Python without requiring platform-specific code.

Test Results

Before fix: Freeze at batch ~41 (~492 concurrent requests)
After fix: Successfully completed 120 batches (1440 requests) without freeze

Test configuration:

Model: Qwen2.5-32B-Instruct-AWQ
max_num_seqs: 62
Concurrent requests per batch: 12
Test repeated multiple times with consistent success

Test Plan

Stress test with sustained concurrent load (120 batches, 1440 total requests)
Verified fix eliminates the freeze that occurred reliably at ~500 requests
Multiple test runs confirm reliability

The shared memory ring buffer protocol in shm_broadcast.py uses plain byte writes to signal between writer and reader processes. On multi-core systems, these writes may stay in CPU store buffers and not be visible to other processes running on different cores, causing indefinite spinning/freeze under sustained concurrent load. This patch adds explicit memory barriers using threading.Lock acquire/release (which provides full barrier semantics per POSIX.1-2008) at four critical points: - In acquire_write(): before reading flags and after setting written flag - In acquire_read(): before reading flags and after setting read flag The memory barrier ensures that: 1. All stores before the barrier are globally visible 2. All loads after the barrier see the latest values Fixes freeze observed during sustained concurrent batch inference (~500+ requests) where both writer and readers would spin indefinitely waiting for flags that were updated but not visible across CPU cores. Signed-off-by: Christina Holland <[email protected]>

chatgpt-codex-connector · 2025-12-01T22:32:02Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request correctly identifies and fixes a critical race condition in the shared memory IPC mechanism by introducing memory barriers. The use of threading.Lock for this purpose is a standard and portable approach. The analysis in the pull request description is excellent. I have a couple of suggestions to improve the robustness and documentation of the new memory_fence function.

gemini-code-assist · 2025-12-01T22:33:50Z

vllm/distributed/device_communicators/shm_broadcast.py

+    Implementation acquires and immediately releases a lock. On POSIX systems,
+    pthread_mutex_lock/unlock have full memory barrier semantics (sequentially
+    consistent). This is a lightweight operation (~20ns) that guarantees:
+    - All stores before the barrier are visible to other threads/processes
+    - All loads after the barrier see the latest values
+
+    Reference: POSIX.1-2008 specifies that mutex operations synchronize memory.
+    """
+    # Lock acquire/release provides full memory barrier semantics on POSIX


The documentation and comments are very POSIX-centric. While correct for POSIX, threading.Lock also provides memory barrier semantics on other platforms like Windows, making this implementation portable. The documentation should reflect this to avoid confusion for developers on other platforms.

Suggested change

Implementation acquires and immediately releases a lock. On POSIX systems,

pthread_mutex_lock/unlock have full memory barrier semantics (sequentially

consistent). This is a lightweight operation (~20ns) that guarantees:

- All stores before the barrier are visible to other threads/processes

- All loads after the barrier see the latest values

Reference: POSIX.1-2008 specifies that mutex operations synchronize memory.

"""

# Lock acquire/release provides full memory barrier semantics on POSIX

Implementation acquires and immediately releases a lock. This provides full

memory barrier semantics on major platforms (e.g., POSIX, Windows), making

it a portable way to enforce memory ordering. This lightweight operation

(~20ns) guarantees:

- All stores before the barrier are visible to other threads/processes

- All loads after the barrier see the latest values

"""

# Lock acquire/release provides full memory barrier semantics.

gemini-code-assist · 2025-12-01T22:33:51Z

vllm/distributed/device_communicators/shm_broadcast.py

+    _memory_fence_lock.acquire()
+    _memory_fence_lock.release()


For improved robustness, it's better to use a with statement for the lock. This ensures that the lock is always released, even if an unexpected exception (like a KeyboardInterrupt) occurs between acquire() and release(). While unlikely in this short function, it's a best practice that guarantees lock release and prevents potential deadlocks.

Suggested change

_memory_fence_lock.acquire()

_memory_fence_lock.release()

with _memory_fence_lock:

pass

njhill · 2025-12-02T00:11:19Z

Thanks @kitaekatt! I'm just a bit surprised that this hasn't been encountered more often if it's so easy to reproduce. max_num_seqs=62 isn't very high concurrency.

Are you pinning cpus / setting some particular numa config?

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

heheda12345 requested a review from youkaichao December 3, 2025 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(shm): Add memory barriers for cross-process shared memory visibility #29819

fix(shm): Add memory barriers for cross-process shared memory visibility #29819

kitaekatt commented Dec 1, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

njhill commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix(shm): Add memory barriers for cross-process shared memory visibility #29819

Are you sure you want to change the base?

fix(shm): Add memory barriers for cross-process shared memory visibility #29819

Conversation

kitaekatt commented Dec 1, 2025

Summary

Root Cause

The Fix

Why threading.Lock?

Test Results

Test Plan

Uh oh!

chatgpt-codex-connector bot commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

njhill commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants