Skip to content

Conversation

@ann-qin-lu
Copy link

@ann-qin-lu ann-qin-lu commented Jan 24, 2026

Summary

  • Fix CLOSE_WAIT connection issues when using --use-rollout-routing-replay with large response payloads
  • Replace response.json() with response.aread() + json.loads() to ensure full body consumption
  • Apply fix to both _post() and get() functions in http_utils.py

Problem

When using routing replay, SGLang returns large routed_experts data (10-15MB per response for 8K sequences). The response.json() method may not fully read the response body, leaving bytes in the TCP receive buffer. This causes connections to hang in CLOSE_WAIT state, making rollout generation stuck at ~99% completion.

Test plan

  • Run training with --use-rollout-routing-replay --use-slime-router
  • Monitor TCP connections: netstat -tnp | grep <rollout_manager_pid>
  • Verify no connections remain in CLOSE_WAIT state after rollout completes
  • Confirm rollout generation completes to 100% without hanging

🤖 Generated with Claude Code

This fixes CLOSE_WAIT connection issues with large response payloads (e.g. large router replay logits). The response.json() method may not fully read
the response body, leaving bytes in the TCP receive buffer and causing
connections to hang in CLOSE_WAIT state.

Changes:
- Use response.aread() + json.loads() instead of response.json()
- Apply fix to both _post() and get() functions

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ann-qin-lu
Copy link
Author

Here are additional contexts: HTTP Connection CLOSE_WAIT Issue when enabling Routing Replay

Issue Summary

When using --use-rollout-routing-replay with --use-slime-router, the rollout generation would get stuck at ~99% completion (e.g., 8176/8192 samples). The job would hang indefinitely waiting for HTTP responses that never arrived.

Symptoms

  1. Progress bar stuck at near-completion (e.g., Rollout generation: 100%|█████████▉| 8176/8192)
  2. Multiple TCP connections in CLOSE_WAIT state with 1 byte in receive queue:
    tcp  1  0 10.209.226.13:50468  10.209.226.13:4327  CLOSE_WAIT  215773/ray::Rollout
    tcp  1  0 10.209.226.13:52128  10.209.226.13:4327  CLOSE_WAIT  215773/ray::Rollout
    ...
    
  3. Stack trace showing RolloutManager.generate waiting on asyncio.wait() for pending tasks

Root Cause

The issue was in slime/utils/http_utils.py where the HTTP client used response.json() directly without first fully consuming the response body:

# Before (problematic)
response = await client.post(url, json=payload or {})
response.raise_for_status()
output = response.json()  # Does not guarantee full body consumption

Why This Causes CLOSE_WAIT

  1. Large Response Payloads: With --use-rollout-routing-replay, SGLang returns routed_experts data in the response. For long sequences (8K tokens), this can be 10-15MB per response (seq_len × 48 layers × 8 topk × 4 bytes).

  2. Incomplete Body Consumption: The response.json() method in httpx may not fully read the response body from the underlying TCP connection before returning. This leaves bytes in the TCP receive buffer.

  3. Server Closes Connection: The slime-router (or SGLang worker) closes its side of the connection after sending the response.

  4. Client Connection Stuck: The client's TCP stack sees the server's FIN but still has unread data in the buffer, resulting in CLOSE_WAIT state. The httpx connection pool cannot properly close these connections.

  5. Asyncio Tasks Never Complete: The async tasks waiting for these HTTP responses never resolve because the connection is in a broken state, causing asyncio.wait() to hang indefinitely.

Fix

Explicitly read the full response body using response.aread() before parsing JSON:

# After (fixed)
response = await client.post(url, json=payload or {})
response.raise_for_status()
content = await response.aread()  # Fully consume response body
output = json.loads(content)

Changes Made

File: slime/utils/http_utils.py

  1. _post() function:

    • Changed from response.json() to await response.aread() + json.loads(content)
  2. get() function:

    • Changed from response.json() to await response.aread() + json.loads(content)

Technical Details

TCP Connection States

  • ESTABLISHED: Active connection, data can flow both ways
  • CLOSE_WAIT: Remote side sent FIN, local side hasn't closed yet
  • TIME_WAIT: Local side initiated close, waiting for final ACK

The CLOSE_WAIT with "1" in recv-q indicates the TCP FIN packet (or trailing data) hasn't been consumed by the application layer.

Why response.json() Is Problematic

httpx's response.json() internally calls response.read() but may not properly handle:

  • Very large responses that exceed internal buffers
  • Connection pooling edge cases
  • Async context cleanup

Using response.aread() explicitly ensures the entire response body is read into memory before any further processing, allowing the connection to be properly returned to the pool or closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant