[serve] Run HAProxy drain/undrain test without SO_REUSEPORT#64216
[serve] Run HAProxy drain/undrain test without SO_REUSEPORT#64216eicherseiji wants to merge 2 commits into
Conversation
test_drain_and_undrain_haproxy_manager forced SERVE_SOCKET_REUSE_PORT_ENABLED=1 so the three per-node HAProxies could share *:8000 on the single test host. Each HAProxy also binds fixed stats (8404) and metrics (9101) ports, so SO_REUSEPORT was the only thing letting the co-located processes coexist. Give the per-node HAProxies distinct ports instead: the worker frontend via TEST_WORKER_NODE_HTTP_PORT and the worker stats/metrics ports via per-node env vars passed to add_node. The test now runs with SO_REUSEPORT disabled (the default) and no SERVE_SOCKET_REUSE_PORT_ENABLED override, with no application-code changes. The cluster is head + one worker: the frontend port has no per-node knob beyond the uniform TEST_WORKER_NODE_HTTP_PORT, so two workers cannot get distinct frontends. The test asserts the drain/undrain outcome (the worker proxy is removed when its node loses all replicas and re-added healthy when they return). The HEALTHY/DRAINING/DRAINED state machine is covered by tests/unit/test_proxy_state.py. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the HAProxy draining test in test_haproxy.py to simplify its structure, removing the use of async/threading and assigning distinct ports to the head and worker HAProxies to avoid port collision without relying on SO_REUSEPORT. The review feedback recommends keeping the draining period short by setting RAY_SERVE_PROXY_MIN_DRAINING_PERIOD_S to prevent test timeouts in CI, and safely handling serve_details.proxies in case it is None during transient states to avoid potential AttributeError failures.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
The default RAY_SERVE_PROXY_MIN_DRAINING_PERIOD_S is 30s, leaving a thin margin under the 40s drain-completion wait. Set it to 1s so the worker proxy drains quickly. Addresses review feedback. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Why
test_drain_and_undrain_haproxy_managersetSERVE_SOCKET_REUSE_PORT_ENABLED=1so the per-node HAProxies could share*:8000on the single test host. Each HAProxy also binds fixed stats (:8404) and metrics (:9101) ports, so SO_REUSEPORT was the only thing letting the co-located HAProxy processes coexist. The goal is to run the Serve test suite with SO_REUSEPORT disabled (the default), so the override has to go.What
Give the per-node HAProxies distinct ports instead of sharing them:
TEST_WORKER_NODE_HTTP_PORT(head:8000, worker:8001)env_varspassed tocluster.add_node(:8405/:9102)The test now runs with SO_REUSEPORT disabled and no
SERVE_SOCKET_REUSE_PORT_ENABLEDoverride. No application-code changes.Limitations
The cluster is reduced to head + one worker. The frontend port has no per-node knob beyond the uniform
TEST_WORKER_NODE_HTTP_PORT, so two worker nodes cannot get distinct frontends without SO_REUSEPORT. With one worker the test asserts the drain/undrain outcome (the worker proxy is removed when its node loses all replicas and re-added healthy when replicas return) rather than the transientDRAININGstatus. TheHEALTHY/DRAINING/DRAINEDstate machine is covered bypython/ray/serve/tests/unit/test_proxy_state.py.Testing
RAY_SERVE_ENABLE_HA_PROXY=1 pytest python/ray/serve/tests/test_haproxy.py::test_drain_and_undrain_haproxy_managerpasses with SO_REUSEPORT disabled.