Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance link down handler to avoid race conditions in the failover path #438

Merged
merged 7 commits into from
Mar 8, 2024

Conversation

italovalcy
Copy link

@italovalcy italovalcy commented Feb 22, 2024

Closes #405

Summary

See updated changelog file and/or add any other summarized helpful information for reviewers (in fact, the changelog already mentions a fix similar to this one fixed race condition in failover_path when handling simultaneous Link Down events leading to inconsistencies on some EVC)

Local Tests

Using the topology presented on the issue, I executed tests with and without this change to evaluate how a user (EVC service) would be impacted. Tests consist basically of all nodes pinging to h1 (h2-h1, h3-h1, h4-h1, and h5-h1), and on each step, we disable both interfaces that directly connect the nodes (first disabling h1-h2 links; then h1-h3 links, then h1-h4 links). The link-down event is simulated around 25% of the data transfer on each step. The metric used here was the packet loss, so 0 means no packet loss (good) while 100 means 100% of packet loss (bad), results are presented using average and 95% confidence interval.

  1. Without this change/fix :
Test h1-h2
--> step 1 mean_95_CI 75.050 +- 0.113
--> step 2 mean_95_CI 100.000 +- 0.000
--> step 3 mean_95_CI 100.000 +- 0.000
--> step 4 mean_95_CI 100.000 +- 0.000
Test h3-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 75.250 +- 0.189
--> step 3 mean_95_CI 100.000 +- 0.000
--> step 4 mean_95_CI 100.000 +- 0.000
Test h4-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 75.300 +- 0.185
--> step 4 mean_95_CI 100.000 +- 0.000
Test h5-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 0.000 +- 0.000
--> step 4 mean_95_CI 75.400 +- 0.151
  1. With the change/fix :
Test h1-h2
--> step 1 mean_95_CI 2.500 +- 0.000
--> step 2 mean_95_CI 9.000 +- 0.000
--> step 3 mean_95_CI 8.500 +- 0.000
--> step 4 mean_95_CI 13.500 +- 0.000
Test h3-h1
--> step 1 mean_95_CI 0.100 +- 0.278
--> step 2 mean_95_CI 8.900 +- 0.278
--> step 3 mean_95_CI 8.500 +- 0.000
--> step 4 mean_95_CI 13.500 +- 0.000
Test h4-h1
--> step 1 mean_95_CI 2.000 +- 2.105
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 8.500 +- 0.000
--> step 4 mean_95_CI 13.500 +- 0.000
Test h5-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 0.000 +- 0.000
--> step 4 mean_95_CI 13.500 +- 0.000

The results above mean basically that 1) without the change, tests from h2 to h1 will have a 75% pkt loss on the first step and then 100% of packet loss in all other steps -- this basically means that EVC got stuck after the first failover routine because path was wrongly provisioned; same happens for tests from h3 to h1, but step 1 does not have an impact because those links do not affect communication from h3 to h1, but starting on step 2 we see pretty much the same behavior; 2) with the change we do see some packet loss, but no EVC get stuck -- we can confirm this because the packet loss is lower than the time duration from when the event happen and the end of the data transfer (in other words, packet loss is way lower than 75%). One could expect 0% of packet loss, but that is not possible in this scenario because, unfortunately, the failover path is chosen from a link that is subject to failures (the hypothesis here is that those two links share the underlying physical media (e.g., optical system)) -- see Issue #439 for more info on this context.

Another evaluation I executed was a performance test on Link failover convergence, to understand if the changes here would impact the performance in ideal scenarios (where we can really benefit from the failover_path) and below are the results (the test is basically measuring how long does it takes for Kytos to failover a Link with 100 EVCs):

Without this change:

min 0.376 / 25pct 0.597 / 50pct 1.099 / 90pcl 1.499 / max 1.757 / mean_95_CI 0.996 +- 0.036

With this change:

min 0.376 / 25pct 0.552 / 50pct 0.889 / 90pcl 1.469 / max 1.547 / mean_95_CI 0.933 +- 0.036

We can see that the values are very similar, but with the change, the performance was slightly better.

End-to-End Tests

Results from the end-to-end tests using this branch (Job #59047):

============================= test session starts ==============================
platform linux -- Python 3.9.2, pytest-7.2.0, pluggy-1.4.0
rootdir: /builds/amlight/kytos-end-to-end-tester/kytos-end-to-end-tests
plugins: rerunfailures-10.2, timeout-2.1.0, anyio-3.6.2
collected 257 items

tests/test_e2e_01_kytos_startup.py ..                                    [  0%]
tests/test_e2e_05_topology.py ....................                       [  8%]
tests/test_e2e_10_mef_eline.py ..........ss.....x.....x................  [ 24%]
tests/test_e2e_11_mef_eline.py ......                                    [ 26%]
tests/test_e2e_12_mef_eline.py .....Xx.                                  [ 29%]
tests/test_e2e_13_mef_eline.py ....Xs.s.....Xs.s.XXxX.xxxx..X........... [ 45%]
.                                                                        [ 45%]
tests/test_e2e_14_mef_eline.py x                                         [ 46%]
tests/test_e2e_15_mef_eline.py .....                                     [ 48%]
tests/test_e2e_16_mef_eline.py .                                         [ 48%]
tests/test_e2e_20_flow_manager.py .....................                  [ 56%]
tests/test_e2e_21_flow_manager.py ...                                    [ 57%]
tests/test_e2e_22_flow_manager.py ...............                        [ 63%]
tests/test_e2e_23_flow_manager.py ..............                         [ 69%]
tests/test_e2e_30_of_lldp.py ....                                        [ 70%]
tests/test_e2e_31_of_lldp.py ...                                         [ 71%]
tests/test_e2e_32_of_lldp.py ...                                         [ 73%]
tests/test_e2e_40_sdntrace.py ..............                             [ 78%]
tests/test_e2e_41_kytos_auth.py ........                                 [ 81%]
tests/test_e2e_42_sdntrace.py ..                                         [ 82%]
tests/test_e2e_50_maintenance.py ........................                [ 91%]
tests/test_e2e_60_of_multi_table.py .....                                [ 93%]
tests/test_e2e_70_kytos_stats.py ........                                [ 96%]
tests/test_e2e_80_pathfinder.py ss......                                 [100%]

=============================== warnings summary ===============================
= 233 passed, 8 skipped, 9 xfailed, 7 xpassed, 1143 warnings in 12324.60s (3:25:24) =

main.py Show resolved Hide resolved
Copy link

@Ktmi Ktmi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you are on the right track here

main.py Outdated Show resolved Hide resolved
@Ktmi
Copy link

Ktmi commented Feb 22, 2024

Tested this against the example from #405, and the issue there appears to be resolved.

main.py Outdated Show resolved Hide resolved
@italovalcy italovalcy marked this pull request as ready for review February 23, 2024 18:04
@italovalcy italovalcy requested a review from a team as a code owner February 23, 2024 18:04
…ulk update EVCs dict; delegate evc.setup_failover_path to consistency routine; create new event to cleanup old failover path
@italovalcy
Copy link
Author

italovalcy commented Mar 6, 2024

Hi @viniarck based on our discussion on the performance impact of evc.sync() and evc.setup_failover_path() above, I've submitted new commits to:

  1. Implement the bulk write on EVCs updated by the failover path take over
  2. Delegated the setup of a new failover path to the consistency check routine, with an extra filter to only setup the failover path after some time of the EVC being stable (currently after one consistency check round)

After adding those two changes, I re-executed the local tests, performance tests and also end-to-end tests. Results are presented below.

Local tests

Same strategy as before, simulating the extreme scenario where two links between adjacent switches exists and those links are subject to simultaneous failure, making challenging for mef_eline handle those two events that affect both current_path and failover_path.

A) Packet loss without this change/fix (same result as before):

Test h1-h2
--> step 1 mean_95_CI 75.050 +- 0.113
--> step 2 mean_95_CI 100.000 +- 0.000
--> step 3 mean_95_CI 100.000 +- 0.000
--> step 4 mean_95_CI 100.000 +- 0.000
Test h3-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 75.250 +- 0.189
--> step 3 mean_95_CI 100.000 +- 0.000
--> step 4 mean_95_CI 100.000 +- 0.000
Test h4-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 75.300 +- 0.185
--> step 4 mean_95_CI 100.000 +- 0.000
Test h5-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 0.000 +- 0.000
--> step 4 mean_95_CI 75.400 +- 0.151

B) Packet loss with the new changes:

Test h1-h2
--> step 1 mean_95_CI 2.000 +- 0.000
--> step 2 mean_95_CI 4.600 +- 0.278
--> step 3 mean_95_CI 11.500 +- 0.000
--> step 4 mean_95_CI 13.000 +- 0.000
Test h3-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 7.200 +- 0.340
--> step 3 mean_95_CI 10.500 +- 0.000
--> step 4 mean_95_CI 12.300 +- 0.340
Test h4-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 8.700 +- 0.340
--> step 4 mean_95_CI 10.500 +- 0.000
Test h5-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 0.000 +- 0.000
--> step 4 mean_95_CI 7.500 +- 0.439

Here we can still see that new proposed changes really improved the failure handler on this scenario because the packet loss is much lower than previously measured, meaning mef_eline didnt suffer from race conditions when handling the multiple failures and were actually able to redeploy a new path for the EVCs.

Performance test

New results from the performance test simulating a link failure with 100 EVCs using the failed Link, to measure the convergence:

min 23.873 / 25pct 26.955 / 50pct 27.342 / 90pcl 28.712 / max 29.817 / mean_95_CI 27.563 +- 0.062

To help understand the result above we have to compare with previous results:

Without this change:

min 0.376 / 25pct 0.597 / 50pct 1.099 / 90pcl 1.499 / max 1.757 / mean_95_CI 0.996 +- 0.036

With this change (previous version):

min 0.376 / 25pct 0.552 / 50pct 0.889 / 90pcl 1.469 / max 1.547 / mean_95_CI 0.933 +- 0.036

In summary: there is something wrong with the results above. The results got very worst.

End-to-end test

============================= test session starts ==============================
platform linux -- Python 3.9.2, pytest-7.2.0, pluggy-1.4.0
rootdir: /builds/amlight/kytos-end-to-end-tester/kytos-end-to-end-tests
plugins: rerunfailures-10.2, timeout-2.1.0, anyio-3.6.2
collected 261 items

tests/test_e2e_01_kytos_startup.py ..                                    [  0%]
tests/test_e2e_05_topology.py ....................                       [  8%]
tests/test_e2e_10_mef_eline.py ..........ss.....x.....x................  [ 23%]
tests/test_e2e_11_mef_eline.py ......                                    [ 26%]
tests/test_e2e_12_mef_eline.py .....Xx.                                  [ 29%]
tests/test_e2e_13_mef_eline.py ....Xs.s.....Xs.s.XXxX.xxxx..X........... [ 44%]
.                                                                        [ 45%]
tests/test_e2e_14_mef_eline.py x                                         [ 45%]
tests/test_e2e_15_mef_eline.py .....                                     [ 47%]
tests/test_e2e_16_mef_eline.py .                                         [ 47%]
tests/test_e2e_20_flow_manager.py .....................                  [ 55%]
tests/test_e2e_21_flow_manager.py ...                                    [ 57%]
tests/test_e2e_22_flow_manager.py ...............                        [ 62%]
tests/test_e2e_23_flow_manager.py ..............                         [ 68%]
tests/test_e2e_30_of_lldp.py ....                                        [ 69%]
tests/test_e2e_31_of_lldp.py ...                                         [ 70%]
tests/test_e2e_32_of_lldp.py ...                                         [ 72%]
tests/test_e2e_40_sdntrace.py ..............                             [ 77%]
tests/test_e2e_41_kytos_auth.py ........                                 [ 80%]
tests/test_e2e_42_sdntrace.py ..                                         [ 81%]
tests/test_e2e_50_maintenance.py ............................            [ 91%]
tests/test_e2e_60_of_multi_table.py .....                                [ 93%]
tests/test_e2e_70_kytos_stats.py ........                                [ 96%]
tests/test_e2e_80_pathfinder.py ss......                                 [100%]

=============================== warnings summary ===============================
------------------------------- start/stop times -------------------------------
= 237 passed, 8 skipped, 9 xfailed, 7 xpassed, 1143 warnings in 12264.43s (3:24:24) =

@viniarck viniarck self-requested a review March 6, 2024 12:17
Copy link
Member

@viniarck viniarck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done @italovalcy, and much appreciated your help with this one, this is was a tough one and I'm glad how succinct and great the solution has become. Glad to see the results too. I also explored with 200 EVCs, from what I've seen it also worked well on my local env.

Other than that, I opened a few threads, none of them are blockers, some of them are just points to be aware regarding future work on telemetry_int, one regarding consistency check, and the rest if just minor code implementation details, that's up to you whether it'll be worth to slightly change or not. Once e2e and changelog are updated feel free to merge.

main.py Outdated Show resolved Hide resolved
main.py Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
main.py Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
…ducing try_setup_failover when EVC gets redeployed
@italovalcy
Copy link
Author

Hi,

In summary: there is something wrong with the results above. The results got very worst.

New results:

min 0.331 / 25pct 0.508 / 50pct 0.967 / 90pcl 1.492 / max 1.646 / mean_95_CI 0.920 +- 0.026

The problem was basically the testing methodology, which was starting the link failure simulation a few seconds after finish creating the circuits (which didnt give enough time for setup failover path). I changed the testing methodology to give more time between EVC creating and link failure simulations. With the changes introduced on commit 9fb511b this won't even be necessary.

To help understand the result above we have to compare with previous results:

Without this change:

min 0.376 / 25pct 0.597 / 50pct 1.099 / 90pcl 1.499 / max 1.757 / mean_95_CI 0.996 +- 0.036

With this change (previous version):

min 0.376 / 25pct 0.552 / 50pct 0.889 / 90pcl 1.469 / max 1.547 / mean_95_CI 0.933 +- 0.036

@italovalcy italovalcy requested a review from viniarck March 8, 2024 14:21
Copy link
Member

@viniarck viniarck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Excellent PR @italovalcy, appreciated the updates, glad to see that the perf results ended up great and it was just a methodology detail in the last run, also appreciated the issues you've mapped for us to be aware and address on 2024.1.

Before merging, make sure to also update the CHANGELOG.rst, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

failover path still subject to race condition when multiple link down events affect an EVC
3 participants