Skip to content

Conversation

@kitaekatt
Copy link

Summary

Fixes freeze/hang during sustained concurrent batch inference caused by missing memory barriers in the shared memory ring buffer protocol.

Root Cause

The shm_broadcast.py shared memory IPC uses plain byte writes (metadata_buffer[0] = 1) to signal between writer and reader processes. On multi-core systems, these writes stay in CPU store buffers and may not be visible to other processes running on different cores. This causes indefinite spinning where:

  • Writer waits for readers to set read flags (that are already set but not visible)
  • Readers wait for writer to set written flag (that is already set but not visible)

The Fix

Add explicit memory barriers using threading.Lock acquire/release pattern (which provides full memory barrier semantics per POSIX.1-2008) at four critical points:

  1. acquire_write() - before reading flags: Ensures writer sees latest read flags
  2. acquire_write() - after setting written flag: Ensures write is globally visible
  3. acquire_read() - before reading flags: Ensures reader sees latest written flag
  4. acquire_read() - after setting read flag: Ensures read completion is visible to writer

Why threading.Lock?

On POSIX systems, pthread_mutex_lock/unlock provides sequentially consistent memory barrier semantics. The lock acquire/release pattern (~20ns overhead) is the most portable and well-defined way to get memory barriers in Python without requiring platform-specific code.

Test Results

Before fix: Freeze at batch ~41 (~492 concurrent requests)
After fix: Successfully completed 120 batches (1440 requests) without freeze

Test configuration:

  • Model: Qwen2.5-32B-Instruct-AWQ
  • max_num_seqs: 62
  • Concurrent requests per batch: 12
  • Test repeated multiple times with consistent success

Test Plan

  • Stress test with sustained concurrent load (120 batches, 1440 total requests)
  • Verified fix eliminates the freeze that occurred reliably at ~500 requests
  • Multiple test runs confirm reliability

The shared memory ring buffer protocol in shm_broadcast.py uses plain byte
writes to signal between writer and reader processes. On multi-core systems,
these writes may stay in CPU store buffers and not be visible to other
processes running on different cores, causing indefinite spinning/freeze
under sustained concurrent load.

This patch adds explicit memory barriers using threading.Lock acquire/release
(which provides full barrier semantics per POSIX.1-2008) at four critical
points:
- In acquire_write(): before reading flags and after setting written flag
- In acquire_read(): before reading flags and after setting read flag

The memory barrier ensures that:
1. All stores before the barrier are globally visible
2. All loads after the barrier see the latest values

Fixes freeze observed during sustained concurrent batch inference (~500+
requests) where both writer and readers would spin indefinitely waiting
for flags that were updated but not visible across CPU cores.

Signed-off-by: Christina Holland <[email protected]>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies and fixes a critical race condition in the shared memory IPC mechanism by introducing memory barriers. The use of threading.Lock for this purpose is a standard and portable approach. The analysis in the pull request description is excellent. I have a couple of suggestions to improve the robustness and documentation of the new memory_fence function.

Comment on lines +60 to +68
Implementation acquires and immediately releases a lock. On POSIX systems,
pthread_mutex_lock/unlock have full memory barrier semantics (sequentially
consistent). This is a lightweight operation (~20ns) that guarantees:
- All stores before the barrier are visible to other threads/processes
- All loads after the barrier see the latest values
Reference: POSIX.1-2008 specifies that mutex operations synchronize memory.
"""
# Lock acquire/release provides full memory barrier semantics on POSIX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The documentation and comments are very POSIX-centric. While correct for POSIX, threading.Lock also provides memory barrier semantics on other platforms like Windows, making this implementation portable. The documentation should reflect this to avoid confusion for developers on other platforms.

Suggested change
Implementation acquires and immediately releases a lock. On POSIX systems,
pthread_mutex_lock/unlock have full memory barrier semantics (sequentially
consistent). This is a lightweight operation (~20ns) that guarantees:
- All stores before the barrier are visible to other threads/processes
- All loads after the barrier see the latest values
Reference: POSIX.1-2008 specifies that mutex operations synchronize memory.
"""
# Lock acquire/release provides full memory barrier semantics on POSIX
Implementation acquires and immediately releases a lock. This provides full
memory barrier semantics on major platforms (e.g., POSIX, Windows), making
it a portable way to enforce memory ordering. This lightweight operation
(~20ns) guarantees:
- All stores before the barrier are visible to other threads/processes
- All loads after the barrier see the latest values
"""
# Lock acquire/release provides full memory barrier semantics.

Comment on lines +70 to +71
_memory_fence_lock.acquire()
_memory_fence_lock.release()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For improved robustness, it's better to use a with statement for the lock. This ensures that the lock is always released, even if an unexpected exception (like a KeyboardInterrupt) occurs between acquire() and release(). While unlikely in this short function, it's a best practice that guarantees lock release and prevents potential deadlocks.

Suggested change
_memory_fence_lock.acquire()
_memory_fence_lock.release()
with _memory_fence_lock:
pass

@njhill
Copy link
Member

njhill commented Dec 2, 2025

Thanks @kitaekatt! I'm just a bit surprised that this hasn't been encountered more often if it's so easy to reproduce. max_num_seqs=62 isn't very high concurrency.

Are you pinning cpus / setting some particular numa config?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants