Skip to content

[serve] Run HAProxy drain/undrain test without SO_REUSEPORT#64216

Draft
eicherseiji wants to merge 2 commits into
ray-project:masterfrom
eicherseiji:serve-reuseport-default
Draft

[serve] Run HAProxy drain/undrain test without SO_REUSEPORT#64216
eicherseiji wants to merge 2 commits into
ray-project:masterfrom
eicherseiji:serve-reuseport-default

Conversation

@eicherseiji

Copy link
Copy Markdown
Contributor

Why

test_drain_and_undrain_haproxy_manager set SERVE_SOCKET_REUSE_PORT_ENABLED=1 so the per-node HAProxies could share *:8000 on the single test host. Each HAProxy also binds fixed stats (:8404) and metrics (:9101) ports, so SO_REUSEPORT was the only thing letting the co-located HAProxy processes coexist. The goal is to run the Serve test suite with SO_REUSEPORT disabled (the default), so the override has to go.

What

Give the per-node HAProxies distinct ports instead of sharing them:

  • worker frontend via the existing TEST_WORKER_NODE_HTTP_PORT (head :8000, worker :8001)
  • worker stats/metrics ports via per-node env_vars passed to cluster.add_node (:8405 / :9102)

The test now runs with SO_REUSEPORT disabled and no SERVE_SOCKET_REUSE_PORT_ENABLED override. No application-code changes.

Limitations

The cluster is reduced to head + one worker. The frontend port has no per-node knob beyond the uniform TEST_WORKER_NODE_HTTP_PORT, so two worker nodes cannot get distinct frontends without SO_REUSEPORT. With one worker the test asserts the drain/undrain outcome (the worker proxy is removed when its node loses all replicas and re-added healthy when replicas return) rather than the transient DRAINING status. The HEALTHY/DRAINING/DRAINED state machine is covered by python/ray/serve/tests/unit/test_proxy_state.py.

Testing

RAY_SERVE_ENABLE_HA_PROXY=1 pytest python/ray/serve/tests/test_haproxy.py::test_drain_and_undrain_haproxy_manager passes with SO_REUSEPORT disabled.

test_drain_and_undrain_haproxy_manager forced
SERVE_SOCKET_REUSE_PORT_ENABLED=1 so the three per-node HAProxies could
share *:8000 on the single test host. Each HAProxy also binds fixed stats
(8404) and metrics (9101) ports, so SO_REUSEPORT was the only thing letting
the co-located processes coexist.

Give the per-node HAProxies distinct ports instead: the worker frontend via
TEST_WORKER_NODE_HTTP_PORT and the worker stats/metrics ports via per-node
env vars passed to add_node. The test now runs with SO_REUSEPORT disabled
(the default) and no SERVE_SOCKET_REUSE_PORT_ENABLED override, with no
application-code changes.

The cluster is head + one worker: the frontend port has no per-node knob
beyond the uniform TEST_WORKER_NODE_HTTP_PORT, so two workers cannot get
distinct frontends. The test asserts the drain/undrain outcome (the worker
proxy is removed when its node loses all replicas and re-added healthy when
they return). The HEALTHY/DRAINING/DRAINED state machine is covered by
tests/unit/test_proxy_state.py.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the HAProxy draining test in test_haproxy.py to simplify its structure, removing the use of async/threading and assigning distinct ports to the head and worker HAProxies to avoid port collision without relying on SO_REUSEPORT. The review feedback recommends keeping the draining period short by setting RAY_SERVE_PROXY_MIN_DRAINING_PERIOD_S to prevent test timeouts in CI, and safely handling serve_details.proxies in case it is None during transient states to avoid potential AttributeError failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/serve/tests/test_haproxy.py
Comment thread python/ray/serve/tests/test_haproxy.py
@eicherseiji eicherseiji self-assigned this Jun 18, 2026
The default RAY_SERVE_PROXY_MIN_DRAINING_PERIOD_S is 30s, leaving a thin
margin under the 40s drain-completion wait. Set it to 1s so the worker proxy
drains quickly. Addresses review feedback.

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant