-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance link down handler to avoid race conditions in the failover path #438
Conversation
…tor on link down handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you are on the right track here
Tested this against the example from #405, and the issue there appears to be resolved. |
…ulk update EVCs dict; delegate evc.setup_failover_path to consistency routine; create new event to cleanup old failover path
Hi @viniarck based on our discussion on the performance impact of
After adding those two changes, I re-executed the local tests, performance tests and also end-to-end tests. Results are presented below. Local testsSame strategy as before, simulating the extreme scenario where two links between adjacent switches exists and those links are subject to simultaneous failure, making challenging for mef_eline handle those two events that affect both current_path and failover_path. A) Packet loss without this change/fix (same result as before):
B) Packet loss with the new changes:
Here we can still see that new proposed changes really improved the failure handler on this scenario because the packet loss is much lower than previously measured, meaning mef_eline didnt suffer from race conditions when handling the multiple failures and were actually able to redeploy a new path for the EVCs. Performance testNew results from the performance test simulating a link failure with 100 EVCs using the failed Link, to measure the convergence:
To help understand the result above we have to compare with previous results: Without this change:
With this change (previous version):
In summary: there is something wrong with the results above. The results got very worst. End-to-end test
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done @italovalcy, and much appreciated your help with this one, this is was a tough one and I'm glad how succinct and great the solution has become. Glad to see the results too. I also explored with 200 EVCs, from what I've seen it also worked well on my local env.
Other than that, I opened a few threads, none of them are blockers, some of them are just points to be aware regarding future work on telemetry_int
, one regarding consistency check, and the rest if just minor code implementation details, that's up to you whether it'll be worth to slightly change or not. Once e2e and changelog are updated feel free to merge.
…ducing try_setup_failover when EVC gets redeployed
Hi,
New results:
The problem was basically the testing methodology, which was starting the link failure simulation a few seconds after finish creating the circuits (which didnt give enough time for setup failover path). I changed the testing methodology to give more time between EVC creating and link failure simulations. With the changes introduced on commit 9fb511b this won't even be necessary. To help understand the result above we have to compare with previous results: Without this change:
With this change (previous version):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Excellent PR @italovalcy, appreciated the updates, glad to see that the perf results ended up great and it was just a methodology detail in the last run, also appreciated the issues you've mapped for us to be aware and address on 2024.1
.
Before merging, make sure to also update the CHANGELOG.rst, thanks.
Closes #405
Summary
See updated changelog file and/or add any other summarized helpful information for reviewers (in fact, the changelog already mentions a fix similar to this one fixed race condition in
failover_path
when handling simultaneous Link Down events leading to inconsistencies on some EVC)Local Tests
Using the topology presented on the issue, I executed tests with and without this change to evaluate how a user (EVC service) would be impacted. Tests consist basically of all nodes pinging to h1 (h2-h1, h3-h1, h4-h1, and h5-h1), and on each step, we disable both interfaces that directly connect the nodes (first disabling h1-h2 links; then h1-h3 links, then h1-h4 links). The link-down event is simulated around 25% of the data transfer on each step. The metric used here was the packet loss, so 0 means no packet loss (good) while 100 means 100% of packet loss (bad), results are presented using average and 95% confidence interval.
The results above mean basically that 1) without the change, tests from h2 to h1 will have a 75% pkt loss on the first step and then 100% of packet loss in all other steps -- this basically means that EVC got stuck after the first failover routine because path was wrongly provisioned; same happens for tests from h3 to h1, but step 1 does not have an impact because those links do not affect communication from h3 to h1, but starting on step 2 we see pretty much the same behavior; 2) with the change we do see some packet loss, but no EVC get stuck -- we can confirm this because the packet loss is lower than the time duration from when the event happen and the end of the data transfer (in other words, packet loss is way lower than 75%). One could expect 0% of packet loss, but that is not possible in this scenario because, unfortunately, the failover path is chosen from a link that is subject to failures (the hypothesis here is that those two links share the underlying physical media (e.g., optical system)) -- see Issue #439 for more info on this context.
Another evaluation I executed was a performance test on Link failover convergence, to understand if the changes here would impact the performance in ideal scenarios (where we can really benefit from the failover_path) and below are the results (the test is basically measuring how long does it takes for Kytos to failover a Link with 100 EVCs):
Without this change:
With this change:
We can see that the values are very similar, but with the change, the performance was slightly better.
End-to-End Tests
Results from the end-to-end tests using this branch (Job #59047):