fix: add TCP keepalive to default httpx transport to prevent NAT hangs#3368
fix: add TCP keepalive to default httpx transport to prevent NAT hangs#3368Oxygen56 wants to merge 3 commits into
Conversation
httpx's default transport does not enable TCP keepalive, causing non-streaming requests to hang silently behind NAT when idle connections are dropped by the NAT gateway. This adds socket-level TCP keepalive options (SO_KEEPALIVE, TCP_KEEPIDLE, TCP_KEEPINTVL, TCP_KEEPCNT) to the default HTTP client transport. Fixes openai#3269 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ca02bf25f7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| kwargs.setdefault("timeout", DEFAULT_TIMEOUT) | ||
| kwargs.setdefault("limits", DEFAULT_CONNECTION_LIMITS) | ||
| kwargs.setdefault("follow_redirects", True) | ||
| kwargs.setdefault("transport", httpx.HTTPTransport(socket_options=_build_keepalive_socket_options())) |
There was a problem hiding this comment.
Raise the httpx floor before using socket_options
This package still declares httpx>=0.23.0, <1 in pyproject.toml, but socket_options was only added to HTTPTransport/AsyncHTTPTransport in httpx 0.25.0. In environments that satisfy the current dependency with httpx 0.23.x or 0.24.x, constructing the default OpenAI client will raise TypeError: ... unexpected keyword argument 'socket_options' before any request is made. Please either guard this argument for older httpx versions or bump the minimum dependency.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed without bumping the httpx floor. Instead of unconditionally passing socket_options, the SDK now detects support at runtime via inspect.signature(httpx.HTTPTransport.__init__) (cached in _HTTPX_TRANSPORT_SUPPORTS_SOCKET_OPTIONS). On httpx 0.23.x/0.24.x the argument is simply skipped, so constructing the default client no longer raises TypeError; keepalive is enabled on 0.25.0+. This keeps httpx>=0.23.0 intact and avoids a breaking change for downstream installs.
| kwargs.setdefault("timeout", DEFAULT_TIMEOUT) | ||
| kwargs.setdefault("limits", DEFAULT_CONNECTION_LIMITS) | ||
| kwargs.setdefault("follow_redirects", True) | ||
| kwargs.setdefault("transport", httpx.HTTPTransport(socket_options=_build_keepalive_socket_options())) |
There was a problem hiding this comment.
Preserve httpx client options when adding keepalive
Creating and passing a concrete transport here causes httpx.Client to use that transport as-is, so the limits, verify, cert, trust_env, http1, and http2 kwargs set on DefaultHttpxClient(...) are no longer applied to the default non-proxy transport. This regresses both the SDK's DEFAULT_CONNECTION_LIMITS and documented customization such as DefaultHttpxClient(http2=True) unless callers manually build their own transport. Please build the transport with the effective kwargs or use an approach that does not bypass httpx's transport initialization.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed. The transport is now built with the effective client options forwarded into it: _build_keepalive_transport() propagates limits, verify, cert, trust_env, http1, and http2 from the resolved kwargs (so DEFAULT_CONNECTION_LIMITS and customizations like DefaultHttpxClient(http2=True) are preserved), and only adds socket_options when httpx supports it. It also only builds a transport when the caller did not supply one, so a user-provided transport is never overridden. Verified locally: the default client's connection pool reports max_connections=1000 / max_keepalive_connections=100, and http2=True reaches the pool.
| kwargs.setdefault("timeout", DEFAULT_TIMEOUT) | ||
| kwargs.setdefault("limits", DEFAULT_CONNECTION_LIMITS) | ||
| kwargs.setdefault("follow_redirects", True) | ||
| kwargs.setdefault("transport", httpx.AsyncHTTPTransport(socket_options=_build_keepalive_socket_options())) |
There was a problem hiding this comment.
Preserve async httpx client options when adding keepalive
The async default client has the same issue: supplying a prebuilt AsyncHTTPTransport means httpx.AsyncClient does not apply its limits, TLS, env, or HTTP/2 options to the default transport. Users relying on DefaultAsyncHttpxClient(http2=True) or the SDK's default connection limits silently get httpx transport defaults instead. Please pass the effective options into the transport or avoid replacing httpx's normal transport construction.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed the same way for the async client. _DefaultAsyncHttpxClient now routes through the shared _build_keepalive_transport(httpx.AsyncHTTPTransport, kwargs) helper, which forwards limits, TLS (verify/cert), trust_env, and http1/http2 into the AsyncHTTPTransport, and only adds socket_options when supported. A caller-supplied transport is preserved. So DefaultAsyncHttpxClient(http2=True) and the SDK default connection limits are no longer silently dropped.
|
Added unit tests covering the keepalive behavior (10 cases):
Design note on why this PR doesn't bump the httpx version requirement (unlike #3270): the keepalive transport works fine with httpx >=0.23, so there's no reason to force a version bump that could break existing installations. |
…lient options - Detect httpx socket_options support at runtime instead of bumping the httpx floor (httpx<0.25 compatible) - Forward effective client options (limits, verify, cert, trust_env, http1, http2) to the prebuilt transport so DEFAULT_CONNECTION_LIMITS and http2=True are no longer dropped - Preserve user-supplied transport
Fixes #3269 — Non-streaming calls silently hang forever behind NAT.
httpx's default transport does not enable TCP keepalive, causing non-streaming requests to hang silently behind NAT when idle connections are dropped by the NAT gateway.
This adds socket-level TCP keepalive options to the default HTTP client transport:
SO_KEEPALIVE— enable keepalive probesTCP_KEEPIDLE(Linux) /TCP_KEEPALIVE(macOS) — 60s idle before first probeTCP_KEEPINTVL— 60s between subsequent probesTCP_KEEPCNT— 5 unacknowledged probes before declaring deadUses
kwargs.setdefaultso any caller-supplied custom transport is completely unaffected.This is the same approach used by the Anthropic Python SDK.