fix: use aread() to fully consume HTTP response body #1488

ann-qin-lu · 2026-01-24T07:54:34Z

Summary

Fix CLOSE_WAIT connection issues when using --use-rollout-routing-replay with large response payloads
Replace response.json() with response.aread() + json.loads() to ensure full body consumption
Apply fix to both _post() and get() functions in http_utils.py

Problem

When using routing replay, SGLang returns large routed_experts data (10-15MB per response for 8K sequences). The response.json() method may not fully read the response body, leaving bytes in the TCP receive buffer. This causes connections to hang in CLOSE_WAIT state, making rollout generation stuck at ~99% completion.

Test plan

Run training with --use-rollout-routing-replay --use-slime-router
Monitor TCP connections: netstat -tnp | grep <rollout_manager_pid>
Verify no connections remain in CLOSE_WAIT state after rollout completes
Confirm rollout generation completes to 100% without hanging

🤖 Generated with Claude Code

This fixes CLOSE_WAIT connection issues with large response payloads (e.g. large router replay logits). The response.json() method may not fully read the response body, leaving bytes in the TCP receive buffer and causing connections to hang in CLOSE_WAIT state. Changes: - Use response.aread() + json.loads() instead of response.json() - Apply fix to both _post() and get() functions 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

ann-qin-lu · 2026-01-24T08:00:27Z

Here are additional contexts: HTTP Connection CLOSE_WAIT Issue when enabling Routing Replay

Issue Summary

When using --use-rollout-routing-replay with --use-slime-router, the rollout generation would get stuck at ~99% completion (e.g., 8176/8192 samples). The job would hang indefinitely waiting for HTTP responses that never arrived.

Symptoms

Progress bar stuck at near-completion (e.g., Rollout generation: 100%|█████████▉| 8176/8192)

Multiple TCP connections in CLOSE_WAIT state with 1 byte in receive queue:

tcp  1  0 10.209.226.13:50468  10.209.226.13:4327  CLOSE_WAIT  215773/ray::Rollout
tcp  1  0 10.209.226.13:52128  10.209.226.13:4327  CLOSE_WAIT  215773/ray::Rollout
...

Stack trace showing RolloutManager.generate waiting on asyncio.wait() for pending tasks

Root Cause

The issue was in slime/utils/http_utils.py where the HTTP client used response.json() directly without first fully consuming the response body:

# Before (problematic)
response = await client.post(url, json=payload or {})
response.raise_for_status()
output = response.json()  # Does not guarantee full body consumption

Why This Causes CLOSE_WAIT

Large Response Payloads: With --use-rollout-routing-replay, SGLang returns routed_experts data in the response. For long sequences (8K tokens), this can be 10-15MB per response (seq_len × 48 layers × 8 topk × 4 bytes).
Incomplete Body Consumption: The response.json() method in httpx may not fully read the response body from the underlying TCP connection before returning. This leaves bytes in the TCP receive buffer.
Server Closes Connection: The slime-router (or SGLang worker) closes its side of the connection after sending the response.
Client Connection Stuck: The client's TCP stack sees the server's FIN but still has unread data in the buffer, resulting in CLOSE_WAIT state. The httpx connection pool cannot properly close these connections.
Asyncio Tasks Never Complete: The async tasks waiting for these HTTP responses never resolve because the connection is in a broken state, causing asyncio.wait() to hang indefinitely.

Fix

Explicitly read the full response body using response.aread() before parsing JSON:

# After (fixed)
response = await client.post(url, json=payload or {})
response.raise_for_status()
content = await response.aread()  # Fully consume response body
output = json.loads(content)

Changes Made

File: slime/utils/http_utils.py

_post() function:
- Changed from response.json() to await response.aread() + json.loads(content)
get() function:
- Changed from response.json() to await response.aread() + json.loads(content)

Technical Details

TCP Connection States

ESTABLISHED: Active connection, data can flow both ways
CLOSE_WAIT: Remote side sent FIN, local side hasn't closed yet
TIME_WAIT: Local side initiated close, waiting for final ACK

The CLOSE_WAIT with "1" in recv-q indicates the TCP FIN packet (or trailing data) hasn't been consumed by the application layer.

Why response.json() Is Problematic

httpx's response.json() internally calls response.read() but may not properly handle:

Very large responses that exceed internal buffers
Connection pooling edge cases
Async context cleanup

Using response.aread() explicitly ensures the entire response body is read into memory before any further processing, allowing the connection to be properly returned to the pool or closed.

Merge branch 'main' into fix/httpx-close-wait

173c272

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use aread() to fully consume HTTP response body #1488

fix: use aread() to fully consume HTTP response body #1488

Uh oh!

ann-qin-lu commented Jan 24, 2026 •

edited

Loading

Uh oh!

ann-qin-lu commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: use aread() to fully consume HTTP response body #1488

Are you sure you want to change the base?

fix: use aread() to fully consume HTTP response body #1488

Uh oh!

Conversation

ann-qin-lu commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Test plan

Uh oh!

ann-qin-lu commented Jan 24, 2026

Issue Summary

Symptoms

Root Cause

Why This Causes CLOSE_WAIT

Fix

Changes Made

Technical Details

TCP Connection States

Why response.json() Is Problematic

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ann-qin-lu commented Jan 24, 2026 •

edited

Loading