-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Introduce core chaos network release tests #58868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Introduce core chaos network release tests #58868
Conversation
…ing tests Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
edoakes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stacking all of these tests into one mega-script is a little ugly... is there some lightweight refactoring we can do to clean it up? Like move out shared utils and put each test in its own file. If you don't think this is worthwhile, that's fine.
also, are there any structured metrics we should be tracking for these tests beyond pass/fail?
| - RAY_health_check_period_ms=10000 | ||
| - RAY_health_check_timeout_ms=100000 | ||
| - RAY_health_check_failure_threshold=10 | ||
| - RAY_gcs_rpc_server_connect_timeout_s=60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to set custom config options here? this should be called out in the PR description with an explanation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Top 3 are to disable the heartbeat checks which can cause the node to die when transient network errors occur, the last one is a timeout in the gcs client that causes the process to die if it can't initially connect for X amount of time. The default is 5 seconds.
I'll make a note of this in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Top 3 are to disable the heartbeat checks which can cause the node to die when transient network errors occur
Wouldn't we want to test the default behavior here? (which would include the node dying due to healthcheck failures)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats a good point, perhaps it might be a good idea to bump the defaults? These were probably created before fault tolerance was a thing (like for example the gcs 5 second init connection timeout). We could be a bit more generous now that transient network errors should be handled.
Signed-off-by: joshlee <[email protected]>
The main test script and argument parsing is the same among all workloads so I kept that as is. I moved each workload python function into it's own file so it's cleaner to extend if people in the future want to add additional tests. One thing I noticed though was that for example each failure type runs the workload twice, with the first being the baseline. This is unnecessary though because the EC2 Instance killer and the raylet killer will both run the baseline when it should only be ran twice. Also it's broken for ip tables because I run it the network failure starting from the beginning of the python file execution. So I just run the workload once, and have a separate release test that just runs the workload with no failure injection. Also followed up from our conversation and added metrics for head node peak mem usage + total runtime which should give us an indicator if things are going wrong. |
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Signed-off-by: joshlee <[email protected]>
Nice, was going to suggest this :) |
edoakes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
> Briefly describe what this PR accomplishes and why it's needed.
Creating core chaos network release tests by adding ip table variations
to the current chaos release tests. Also added a basic chaos release
test for streaming generators and object ref borrowing.
Did a minor refactor by moving each chaos test workload
(tasks/actors/streaming gen/borrowing) into it's own python file so it's
easier to add additional tests in the future rather than making a huge
mono file. Added metrics for total runtime + peak head node memory
usage. Furthermore removed the baseline run as it's repeated among all
chaos failure types, and it should only be run once in total. Hence for
each workload, we now have 4 tests (baseline, EC2 instance killer,
raylet killer, ip table network failure).
Note that for the ip table tests you'll need to add these 4 config
variables:
- RAY_health_check_period_ms=10000
- RAY_health_check_timeout_ms=100000
- RAY_health_check_failure_threshold=10
- RAY_gcs_rpc_server_connect_timeout_s=60
the top 3 prevent the raylet from failing the gcs health check during
the transient network error duration, and the last prevents us from
getting killed by the GCS client check where upon connection if we can't
initially connect to the GCS for 5 seconds, we die.
Also deleted test_chaos.py that's located in python/ray/tests as the
release chaos tests cover similar functionality.
---------
Signed-off-by: joshlee <[email protected]>
Signed-off-by: YK <[email protected]>
Creating core chaos network release tests by adding ip table variations to the current chaos release tests. Also added a basic chaos release test for streaming generators and object ref borrowing.
Did a minor refactor by moving each chaos test workload (tasks/actors/streaming gen/borrowing) into it's own python file so it's easier to add additional tests in the future rather than making a huge mono file. Added metrics for total runtime + peak head node memory usage. Furthermore removed the baseline run as it's repeated among all chaos failure types, and it should only be run once in total. Hence for each workload, we now have 4 tests (baseline, EC2 instance killer, raylet killer, ip table network failure).
Note that for the ip table tests you'll need to add these 4 config variables:
- RAY_health_check_period_ms=10000
- RAY_health_check_timeout_ms=100000
- RAY_health_check_failure_threshold=10
- RAY_gcs_rpc_server_connect_timeout_s=60
the top 3 prevent the raylet from failing the gcs health check during the transient network error duration, and the last prevents us from getting killed by the GCS client check where upon connection if we can't initially connect to the GCS for 5 seconds, we die.
Also deleted test_chaos.py that's located in python/ray/tests as the release chaos tests cover similar functionality.