[core] Introduce core chaos network release tests #58868

Sparks0219 · 2025-11-21T01:07:51Z

Briefly describe what this PR accomplishes and why it's needed.

Creating core chaos network release tests by adding ip table variations to the current chaos release tests. Also added a basic chaos release test for streaming generators and object ref borrowing.

Did a minor refactor by moving each chaos test workload (tasks/actors/streaming gen/borrowing) into it's own python file so it's easier to add additional tests in the future rather than making a huge mono file. Added metrics for total runtime + peak head node memory usage. Furthermore removed the baseline run as it's repeated among all chaos failure types, and it should only be run once in total. Hence for each workload, we now have 4 tests (baseline, EC2 instance killer, raylet killer, ip table network failure).

Note that for the ip table tests you'll need to add these 4 config variables:
- RAY_health_check_period_ms=10000
- RAY_health_check_timeout_ms=100000
- RAY_health_check_failure_threshold=10
- RAY_gcs_rpc_server_connect_timeout_s=60
the top 3 prevent the raylet from failing the gcs health check during the transient network error duration, and the last prevents us from getting killed by the GCS client check where upon connection if we can't initially connect to the GCS for 5 seconds, we die.

Also deleted test_chaos.py that's located in python/ray/tests as the release chaos tests cover similar functionality.

…ing tests Signed-off-by: joshlee <[email protected]>

Signed-off-by: joshlee <[email protected]>

edoakes

Stacking all of these tests into one mega-script is a little ugly... is there some lightweight refactoring we can do to clean it up? Like move out shared utils and put each test in its own file. If you don't think this is worthwhile, that's fine.

also, are there any structured metrics we should be tracking for these tests beyond pass/fail?

edoakes · 2025-11-21T16:44:05Z

release/release_tests.yaml

+        - RAY_health_check_period_ms=10000
+        - RAY_health_check_timeout_ms=100000
+        - RAY_health_check_failure_threshold=10
+        - RAY_gcs_rpc_server_connect_timeout_s=60


why do we need to set custom config options here? this should be called out in the PR description with an explanation

Top 3 are to disable the heartbeat checks which can cause the node to die when transient network errors occur, the last one is a timeout in the gcs client that causes the process to die if it can't initially connect for X amount of time. The default is 5 seconds.

I'll make a note of this in the PR.

Top 3 are to disable the heartbeat checks which can cause the node to die when transient network errors occur

Wouldn't we want to test the default behavior here? (which would include the node dying due to healthcheck failures)

thats a good point, perhaps it might be a good idea to bump the defaults? These were probably created before fault tolerance was a thing (like for example the gcs 5 second init connection timeout). We could be a bit more generous now that transient network errors should be handled.

release/nightly_tests/chaos_test/test_chaos_basic.py

Signed-off-by: joshlee <[email protected]>

Sparks0219 · 2025-11-22T01:33:17Z

Stacking all of these tests into one mega-script is a little ugly... is there some lightweight refactoring we can do to clean it up? Like move out shared utils and put each test in its own file. If you don't think this is worthwhile, that's fine.

also, are there any structured metrics we should be tracking for these tests beyond pass/fail?

The main test script and argument parsing is the same among all workloads so I kept that as is. I moved each workload python function into it's own file so it's cleaner to extend if people in the future want to add additional tests.

One thing I noticed though was that for example each failure type runs the workload twice, with the first being the baseline. This is unnecessary though because the EC2 Instance killer and the raylet killer will both run the baseline when it should only be ran twice. Also it's broken for ip tables because I run it the network failure starting from the beginning of the python file execution. So I just run the workload once, and have a separate release test that just runs the workload with no failure injection.

Also followed up from our conversation and added metrics for head node peak mem usage + total runtime which should give us an indicator if things are going wrong.

Signed-off-by: joshlee <[email protected]>

release/nightly_tests/chaos_test/test_chaos.py

Signed-off-by: joshlee <[email protected]>

release/release_tests.yaml

release/nightly_tests/chaos_test/test_chaos.py

Signed-off-by: joshlee <[email protected]>

release/nightly_tests/chaos_test/test_chaos.py

Signed-off-by: joshlee <[email protected]>

edoakes · 2025-11-24T20:31:08Z

So I just run the workload once, and have a separate release test that just runs the workload with no failure injection.

Nice, was going to suggest this :)

edoakes

🚀

> Briefly describe what this PR accomplishes and why it's needed. Creating core chaos network release tests by adding ip table variations to the current chaos release tests. Also added a basic chaos release test for streaming generators and object ref borrowing. Did a minor refactor by moving each chaos test workload (tasks/actors/streaming gen/borrowing) into it's own python file so it's easier to add additional tests in the future rather than making a huge mono file. Added metrics for total runtime + peak head node memory usage. Furthermore removed the baseline run as it's repeated among all chaos failure types, and it should only be run once in total. Hence for each workload, we now have 4 tests (baseline, EC2 instance killer, raylet killer, ip table network failure). Note that for the ip table tests you'll need to add these 4 config variables: - RAY_health_check_period_ms=10000 - RAY_health_check_timeout_ms=100000 - RAY_health_check_failure_threshold=10 - RAY_gcs_rpc_server_connect_timeout_s=60 the top 3 prevent the raylet from failing the gcs health check during the transient network error duration, and the last prevents us from getting killed by the GCS client check where upon connection if we can't initially connect to the GCS for 5 seconds, we die. Also deleted test_chaos.py that's located in python/ray/tests as the release chaos tests cover similar functionality. --------- Signed-off-by: joshlee <[email protected]> Signed-off-by: YK <[email protected]>

Sparks0219 added 4 commits November 21, 2025 01:06

Introduce core chaos network release tests + streaming gen and borrow…

0161088

…ing tests Signed-off-by: joshlee <[email protected]>

Fix pytest errors

4767e08

Signed-off-by: joshlee <[email protected]>

Fixing tests

40ecc32

Signed-off-by: joshlee <[email protected]>

Fixing python tests

3ddef9f

Signed-off-by: joshlee <[email protected]>

Sparks0219 marked this pull request as ready for review November 21, 2025 09:27

Sparks0219 added core Issues that should be addressed in Ray Core release-test release test labels Nov 21, 2025

Sparks0219 assigned dayshah and unassigned dayshah Nov 21, 2025

Sparks0219 requested review from dayshah and edoakes November 21, 2025 09:27

Sparks0219 added the go add ONLY when ready to merge, run all tests label Nov 21, 2025

Cleaning up python tests

b811b81

Signed-off-by: joshlee <[email protected]>

edoakes reviewed Nov 21, 2025

View reviewed changes

Addressing comments + refactor

6f72642

Signed-off-by: joshlee <[email protected]>

Sparks0219 requested a review from a team as a code owner November 22, 2025 01:23

Update bazel file

1b5f249

Signed-off-by: joshlee <[email protected]>

cursor bot reviewed Nov 22, 2025

View reviewed changes

release/nightly_tests/chaos_test/test_chaos.py Outdated Show resolved Hide resolved

AI comment

f39bc4a

Signed-off-by: joshlee <[email protected]>

cursor bot reviewed Nov 22, 2025

View reviewed changes

release/release_tests.yaml Outdated Show resolved Hide resolved

release/nightly_tests/chaos_test/test_chaos.py Outdated Show resolved Hide resolved

AI comments

f633e85

Signed-off-by: joshlee <[email protected]>

cursor bot reviewed Nov 22, 2025

View reviewed changes

release/nightly_tests/chaos_test/test_chaos.py Outdated Show resolved Hide resolved

Sparks0219 added 2 commits November 22, 2025 02:13

AI comments

4d9b083

Signed-off-by: joshlee <[email protected]>

Shorten iptable injection interval

b60acec

Signed-off-by: joshlee <[email protected]>

Sparks0219 requested a review from edoakes November 24, 2025 19:54

edoakes approved these changes Nov 24, 2025

View reviewed changes

edoakes merged commit 907fdbb into ray-project:master Nov 24, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Introduce core chaos network release tests #58868

[core] Introduce core chaos network release tests #58868

Uh oh!

Sparks0219 commented Nov 21, 2025 •

edited

Loading

Uh oh!

edoakes left a comment

Uh oh!

edoakes Nov 21, 2025

Uh oh!

Sparks0219 Nov 22, 2025

Uh oh!

edoakes Nov 24, 2025

Uh oh!

Sparks0219 Nov 24, 2025

Uh oh!

Uh oh!

Uh oh!

Sparks0219 commented Nov 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes commented Nov 24, 2025

Uh oh!

edoakes left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[core] Introduce core chaos network release tests #58868

[core] Introduce core chaos network release tests #58868

Uh oh!

Conversation

Sparks0219 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

edoakes Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Sparks0219 Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Sparks0219 commented Nov 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes commented Nov 24, 2025

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sparks0219 commented Nov 21, 2025 •

edited

Loading