Skip to content

Conversation

@Sparks0219
Copy link
Contributor

@Sparks0219 Sparks0219 commented Nov 21, 2025

Briefly describe what this PR accomplishes and why it's needed.

Creating core chaos network release tests by adding ip table variations to the current chaos release tests. Also added a basic chaos release test for streaming generators and object ref borrowing.

Did a minor refactor by moving each chaos test workload (tasks/actors/streaming gen/borrowing) into it's own python file so it's easier to add additional tests in the future rather than making a huge mono file. Added metrics for total runtime + peak head node memory usage. Furthermore removed the baseline run as it's repeated among all chaos failure types, and it should only be run once in total. Hence for each workload, we now have 4 tests (baseline, EC2 instance killer, raylet killer, ip table network failure).

Note that for the ip table tests you'll need to add these 4 config variables:
- RAY_health_check_period_ms=10000
- RAY_health_check_timeout_ms=100000
- RAY_health_check_failure_threshold=10
- RAY_gcs_rpc_server_connect_timeout_s=60
the top 3 prevent the raylet from failing the gcs health check during the transient network error duration, and the last prevents us from getting killed by the GCS client check where upon connection if we can't initially connect to the GCS for 5 seconds, we die.

Also deleted test_chaos.py that's located in python/ray/tests as the release chaos tests cover similar functionality.

@Sparks0219 Sparks0219 marked this pull request as ready for review November 21, 2025 09:27
@Sparks0219 Sparks0219 added core Issues that should be addressed in Ray Core release-test release test labels Nov 21, 2025
@Sparks0219 Sparks0219 assigned dayshah and unassigned dayshah Nov 21, 2025
@Sparks0219 Sparks0219 added the go add ONLY when ready to merge, run all tests label Nov 21, 2025
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stacking all of these tests into one mega-script is a little ugly... is there some lightweight refactoring we can do to clean it up? Like move out shared utils and put each test in its own file. If you don't think this is worthwhile, that's fine.

also, are there any structured metrics we should be tracking for these tests beyond pass/fail?

Comment on lines +3457 to +3460
- RAY_health_check_period_ms=10000
- RAY_health_check_timeout_ms=100000
- RAY_health_check_failure_threshold=10
- RAY_gcs_rpc_server_connect_timeout_s=60
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to set custom config options here? this should be called out in the PR description with an explanation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top 3 are to disable the heartbeat checks which can cause the node to die when transient network errors occur, the last one is a timeout in the gcs client that causes the process to die if it can't initially connect for X amount of time. The default is 5 seconds.

I'll make a note of this in the PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top 3 are to disable the heartbeat checks which can cause the node to die when transient network errors occur

Wouldn't we want to test the default behavior here? (which would include the node dying due to healthcheck failures)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats a good point, perhaps it might be a good idea to bump the defaults? These were probably created before fault tolerance was a thing (like for example the gcs 5 second init connection timeout). We could be a bit more generous now that transient network errors should be handled.

@Sparks0219 Sparks0219 requested a review from a team as a code owner November 22, 2025 01:23
@Sparks0219
Copy link
Contributor Author

Stacking all of these tests into one mega-script is a little ugly... is there some lightweight refactoring we can do to clean it up? Like move out shared utils and put each test in its own file. If you don't think this is worthwhile, that's fine.

also, are there any structured metrics we should be tracking for these tests beyond pass/fail?

The main test script and argument parsing is the same among all workloads so I kept that as is. I moved each workload python function into it's own file so it's cleaner to extend if people in the future want to add additional tests.

One thing I noticed though was that for example each failure type runs the workload twice, with the first being the baseline. This is unnecessary though because the EC2 Instance killer and the raylet killer will both run the baseline when it should only be ran twice. Also it's broken for ip tables because I run it the network failure starting from the beginning of the python file execution. So I just run the workload once, and have a separate release test that just runs the workload with no failure injection.

Also followed up from our conversation and added metrics for head node peak mem usage + total runtime which should give us an indicator if things are going wrong.

Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
@Sparks0219 Sparks0219 requested a review from edoakes November 24, 2025 19:54
@edoakes
Copy link
Collaborator

edoakes commented Nov 24, 2025

So I just run the workload once, and have a separate release test that just runs the workload with no failure injection.

Nice, was going to suggest this :)

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@edoakes edoakes merged commit 907fdbb into ray-project:master Nov 24, 2025
6 checks passed
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
> Briefly describe what this PR accomplishes and why it's needed.

Creating core chaos network release tests by adding ip table variations
to the current chaos release tests. Also added a basic chaos release
test for streaming generators and object ref borrowing.

Did a minor refactor by moving each chaos test workload
(tasks/actors/streaming gen/borrowing) into it's own python file so it's
easier to add additional tests in the future rather than making a huge
mono file. Added metrics for total runtime + peak head node memory
usage. Furthermore removed the baseline run as it's repeated among all
chaos failure types, and it should only be run once in total. Hence for
each workload, we now have 4 tests (baseline, EC2 instance killer,
raylet killer, ip table network failure).

Note that for the ip table tests you'll need to add these 4 config
variables:
        - RAY_health_check_period_ms=10000
        - RAY_health_check_timeout_ms=100000
        - RAY_health_check_failure_threshold=10
        - RAY_gcs_rpc_server_connect_timeout_s=60
the top 3 prevent the raylet from failing the gcs health check during
the transient network error duration, and the last prevents us from
getting killed by the GCS client check where upon connection if we can't
initially connect to the GCS for 5 seconds, we die.

Also deleted test_chaos.py that's located in python/ray/tests as the
release chaos tests cover similar functionality.

---------

Signed-off-by: joshlee <[email protected]>
Signed-off-by: YK <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests release-test release test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants