[serve][ci] Run all serve tests with HAProxy, drop the test whitelist#64210
[serve][ci] Run all serve tests with HAProxy, drop the test whitelist#64210eicherseiji wants to merge 4 commits into
Conversation
The HAProxy CI step previously ran only the curated list of targets in ci/ray_ci/serve_hap_test_names.txt. HAProxy is now the default ingress and stable enough that the whole serve test suite should run against it. Replace the whitelist with //python/ray/serve/tests/... and delete the file. The step still excludes tags that have their own steps or build images (ha_integration, serve_tracing, direct_ingress) plus gpu and post_wheel_build. Bump parallelism from 3 to 6 to absorb the larger target set. This is a draft to surface, via premerge CI, which serve tests do not yet pass with HAProxy enabled. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request updates the HAProxy test configuration in Buildkite to run the full serve test suite dynamically instead of using a curated whitelist file, which has been deleted. It also increases the step's parallelism from 3 to 6. The reviewer suggests expanding the test targets to include '//python/ray/serve/...' and '//python/ray/tests/...' to ensure complete test coverage and maintain consistency with standard serve tests.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # whitelist. Excludes tags handled by their own steps/images. | ||
| # Exit 42 skips whole-step retry (per @kouroshHakha); per-test retry is enough. | ||
| - bazel run //ci/ray_ci:test_in_docker -- $(cat ci/ray_ci/serve_hap_test_names.txt) serve | ||
| - bazel run //ci/ray_ci:test_in_docker -- //python/ray/serve/tests/... serve |
There was a problem hiding this comment.
To ensure complete test coverage and consistency with the standard serve tests step (defined on line 44), we should run the same set of targets: //python/ray/serve/... and //python/ray/tests/.... Using only //python/ray/serve/tests/... might miss serve-related tests located in python/ray/tests/ or other subdirectories under python/ray/serve/.
- bazel run //ci/ray_ci:test_in_docker -- //python/ray/serve/... //python/ray/tests/... serveMatch the target set used by the standard serve test steps (//python/ray/serve/... //python/ray/tests/...) instead of the narrower //python/ray/serve/tests/..., so this step stays uniform with the rest of the file and picks up any future team:serve test added under python/ray/tests. Tighten the step comment to one line. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Premerge of the whitelist removal showed a set of serve tests cannot pass with the HAProxy ingress: gRPC tests (HAProxy does not proxy gRPC yet, see haproxy.py), tests that probe the native Serve HTTP proxy directly, and a few lifecycle/shutdown tests whose timing differs under HAProxy. Add a skip_if_haproxy(reason) helper (pytest.mark.skipif keyed on RAY_SERVE_ENABLE_HA_PROXY) and apply it to those tests. This replaces the external serve_hap_test_names.txt allowlist with inline, self-documenting skips colocated with each test. The skips only trigger in the HAProxy step; the tests still run normally in every other serve step, so no coverage is lost, and the full suite now runs under HAProxy minus the documented skips. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
A second premerge run (#68597) surfaced stragglers the first partial harvest missed: the native-proxy metrics tests in test_metrics.py (proxy/serve metrics that HAProxy emits differently, covered by test_metrics_haproxy.py) and test_deployment_scheduler_downscale's downscale_fallback_node. Mark them with skip_if_haproxy. test_regression.test_replica_memory_growth failed once but passed on retry, so it is flaky rather than incompatible and is left running. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Why
The HAProxy CI step (
serve: HAProxy tests) ran only the curated list of targets inci/ray_ci/serve_hap_test_names.txt. HAProxy is now the default Serve ingress, so the whole serve test suite should run against it rather than a hand-maintained subset.What
$(cat ci/ray_ci/serve_hap_test_names.txt)with//python/ray/serve/tests/...in the HAProxy step.ci/ray_ci/serve_hap_test_names.txt.--except-tags post_wheel_build,gpu,ha_integration,serve_tracing,direct_ingressso tests that have their own dedicated steps or build images are not pulled in here. Thehaproxy-tagged tests still run only in this step.parallelismfrom 3 to 6 to absorb the larger target set.Status
Draft. The point of this PR is to surface, via premerge CI, which serve tests do not yet pass with
RAY_SERVE_ENABLE_HA_PROXY=1. Failures here are the work items before the whitelist can be removed for real.