Enhance link down handler to avoid race conditions in the failover path #438

italovalcy · 2024-02-22T15:17:33Z

Closes #405

Summary

See updated changelog file and/or add any other summarized helpful information for reviewers (in fact, the changelog already mentions a fix similar to this one fixed race condition in failover_path when handling simultaneous Link Down events leading to inconsistencies on some EVC)

Local Tests

Using the topology presented on the issue, I executed tests with and without this change to evaluate how a user (EVC service) would be impacted. Tests consist basically of all nodes pinging to h1 (h2-h1, h3-h1, h4-h1, and h5-h1), and on each step, we disable both interfaces that directly connect the nodes (first disabling h1-h2 links; then h1-h3 links, then h1-h4 links). The link-down event is simulated around 25% of the data transfer on each step. The metric used here was the packet loss, so 0 means no packet loss (good) while 100 means 100% of packet loss (bad), results are presented using average and 95% confidence interval.

Without this change/fix :

Test h1-h2
--> step 1 mean_95_CI 75.050 +- 0.113
--> step 2 mean_95_CI 100.000 +- 0.000
--> step 3 mean_95_CI 100.000 +- 0.000
--> step 4 mean_95_CI 100.000 +- 0.000
Test h3-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 75.250 +- 0.189
--> step 3 mean_95_CI 100.000 +- 0.000
--> step 4 mean_95_CI 100.000 +- 0.000
Test h4-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 75.300 +- 0.185
--> step 4 mean_95_CI 100.000 +- 0.000
Test h5-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 0.000 +- 0.000
--> step 4 mean_95_CI 75.400 +- 0.151

With the change/fix :

Test h1-h2
--> step 1 mean_95_CI 2.500 +- 0.000
--> step 2 mean_95_CI 9.000 +- 0.000
--> step 3 mean_95_CI 8.500 +- 0.000
--> step 4 mean_95_CI 13.500 +- 0.000
Test h3-h1
--> step 1 mean_95_CI 0.100 +- 0.278
--> step 2 mean_95_CI 8.900 +- 0.278
--> step 3 mean_95_CI 8.500 +- 0.000
--> step 4 mean_95_CI 13.500 +- 0.000
Test h4-h1
--> step 1 mean_95_CI 2.000 +- 2.105
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 8.500 +- 0.000
--> step 4 mean_95_CI 13.500 +- 0.000
Test h5-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 0.000 +- 0.000
--> step 4 mean_95_CI 13.500 +- 0.000

The results above mean basically that 1) without the change, tests from h2 to h1 will have a 75% pkt loss on the first step and then 100% of packet loss in all other steps -- this basically means that EVC got stuck after the first failover routine because path was wrongly provisioned; same happens for tests from h3 to h1, but step 1 does not have an impact because those links do not affect communication from h3 to h1, but starting on step 2 we see pretty much the same behavior; 2) with the change we do see some packet loss, but no EVC get stuck -- we can confirm this because the packet loss is lower than the time duration from when the event happen and the end of the data transfer (in other words, packet loss is way lower than 75%). One could expect 0% of packet loss, but that is not possible in this scenario because, unfortunately, the failover path is chosen from a link that is subject to failures (the hypothesis here is that those two links share the underlying physical media (e.g., optical system)) -- see Issue #439 for more info on this context.

Another evaluation I executed was a performance test on Link failover convergence, to understand if the changes here would impact the performance in ideal scenarios (where we can really benefit from the failover_path) and below are the results (the test is basically measuring how long does it takes for Kytos to failover a Link with 100 EVCs):

Without this change:

min 0.376 / 25pct 0.597 / 50pct 1.099 / 90pcl 1.499 / max 1.757 / mean_95_CI 0.996 +- 0.036

With this change:

min 0.376 / 25pct 0.552 / 50pct 0.889 / 90pcl 1.469 / max 1.547 / mean_95_CI 0.933 +- 0.036

We can see that the values are very similar, but with the change, the performance was slightly better.

End-to-End Tests

Results from the end-to-end tests using this branch (Job #59047):

============================= test session starts ==============================
platform linux -- Python 3.9.2, pytest-7.2.0, pluggy-1.4.0
rootdir: /builds/amlight/kytos-end-to-end-tester/kytos-end-to-end-tests
plugins: rerunfailures-10.2, timeout-2.1.0, anyio-3.6.2
collected 257 items

tests/test_e2e_01_kytos_startup.py ..                                    [  0%]
tests/test_e2e_05_topology.py ....................                       [  8%]
tests/test_e2e_10_mef_eline.py ..........ss.....x.....x................  [ 24%]
tests/test_e2e_11_mef_eline.py ......                                    [ 26%]
tests/test_e2e_12_mef_eline.py .....Xx.                                  [ 29%]
tests/test_e2e_13_mef_eline.py ....Xs.s.....Xs.s.XXxX.xxxx..X........... [ 45%]
.                                                                        [ 45%]
tests/test_e2e_14_mef_eline.py x                                         [ 46%]
tests/test_e2e_15_mef_eline.py .....                                     [ 48%]
tests/test_e2e_16_mef_eline.py .                                         [ 48%]
tests/test_e2e_20_flow_manager.py .....................                  [ 56%]
tests/test_e2e_21_flow_manager.py ...                                    [ 57%]
tests/test_e2e_22_flow_manager.py ...............                        [ 63%]
tests/test_e2e_23_flow_manager.py ..............                         [ 69%]
tests/test_e2e_30_of_lldp.py ....                                        [ 70%]
tests/test_e2e_31_of_lldp.py ...                                         [ 71%]
tests/test_e2e_32_of_lldp.py ...                                         [ 73%]
tests/test_e2e_40_sdntrace.py ..............                             [ 78%]
tests/test_e2e_41_kytos_auth.py ........                                 [ 81%]
tests/test_e2e_42_sdntrace.py ..                                         [ 82%]
tests/test_e2e_50_maintenance.py ........................                [ 91%]
tests/test_e2e_60_of_multi_table.py .....                                [ 93%]
tests/test_e2e_70_kytos_stats.py ........                                [ 96%]
tests/test_e2e_80_pathfinder.py ss......                                 [100%]

=============================== warnings summary ===============================
= 233 passed, 8 skipped, 9 xfailed, 7 xpassed, 1143 warnings in 12324.60s (3:25:24) =

…tor on link down handler

main.py

Ktmi

Seems like you are on the right track here

main.py

…on of link down handler

Ktmi · 2024-02-22T18:58:23Z

Tested this against the example from #405, and the issue there appears to be resolved.

main.py

…ulk update EVCs dict; delegate evc.setup_failover_path to consistency routine; create new event to cleanup old failover path

italovalcy · 2024-03-06T08:43:58Z

Hi @viniarck based on our discussion on the performance impact of evc.sync() and evc.setup_failover_path() above, I've submitted new commits to:

Implement the bulk write on EVCs updated by the failover path take over
Delegated the setup of a new failover path to the consistency check routine, with an extra filter to only setup the failover path after some time of the EVC being stable (currently after one consistency check round)

After adding those two changes, I re-executed the local tests, performance tests and also end-to-end tests. Results are presented below.

Local tests

Same strategy as before, simulating the extreme scenario where two links between adjacent switches exists and those links are subject to simultaneous failure, making challenging for mef_eline handle those two events that affect both current_path and failover_path.

A) Packet loss without this change/fix (same result as before):

Test h1-h2
--> step 1 mean_95_CI 75.050 +- 0.113
--> step 2 mean_95_CI 100.000 +- 0.000
--> step 3 mean_95_CI 100.000 +- 0.000
--> step 4 mean_95_CI 100.000 +- 0.000
Test h3-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 75.250 +- 0.189
--> step 3 mean_95_CI 100.000 +- 0.000
--> step 4 mean_95_CI 100.000 +- 0.000
Test h4-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 75.300 +- 0.185
--> step 4 mean_95_CI 100.000 +- 0.000
Test h5-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 0.000 +- 0.000
--> step 4 mean_95_CI 75.400 +- 0.151

B) Packet loss with the new changes:

Test h1-h2
--> step 1 mean_95_CI 2.000 +- 0.000
--> step 2 mean_95_CI 4.600 +- 0.278
--> step 3 mean_95_CI 11.500 +- 0.000
--> step 4 mean_95_CI 13.000 +- 0.000
Test h3-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 7.200 +- 0.340
--> step 3 mean_95_CI 10.500 +- 0.000
--> step 4 mean_95_CI 12.300 +- 0.340
Test h4-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 8.700 +- 0.340
--> step 4 mean_95_CI 10.500 +- 0.000
Test h5-h1
--> step 1 mean_95_CI 0.000 +- 0.000
--> step 2 mean_95_CI 0.000 +- 0.000
--> step 3 mean_95_CI 0.000 +- 0.000
--> step 4 mean_95_CI 7.500 +- 0.439

Here we can still see that new proposed changes really improved the failure handler on this scenario because the packet loss is much lower than previously measured, meaning mef_eline didnt suffer from race conditions when handling the multiple failures and were actually able to redeploy a new path for the EVCs.

Performance test

New results from the performance test simulating a link failure with 100 EVCs using the failed Link, to measure the convergence:

min 23.873 / 25pct 26.955 / 50pct 27.342 / 90pcl 28.712 / max 29.817 / mean_95_CI 27.563 +- 0.062

To help understand the result above we have to compare with previous results:

Without this change:

min 0.376 / 25pct 0.597 / 50pct 1.099 / 90pcl 1.499 / max 1.757 / mean_95_CI 0.996 +- 0.036

With this change (previous version):

min 0.376 / 25pct 0.552 / 50pct 0.889 / 90pcl 1.469 / max 1.547 / mean_95_CI 0.933 +- 0.036

In summary: there is something wrong with the results above. The results got very worst.

End-to-end test

============================= test session starts ==============================
platform linux -- Python 3.9.2, pytest-7.2.0, pluggy-1.4.0
rootdir: /builds/amlight/kytos-end-to-end-tester/kytos-end-to-end-tests
plugins: rerunfailures-10.2, timeout-2.1.0, anyio-3.6.2
collected 261 items

tests/test_e2e_01_kytos_startup.py ..                                    [  0%]
tests/test_e2e_05_topology.py ....................                       [  8%]
tests/test_e2e_10_mef_eline.py ..........ss.....x.....x................  [ 23%]
tests/test_e2e_11_mef_eline.py ......                                    [ 26%]
tests/test_e2e_12_mef_eline.py .....Xx.                                  [ 29%]
tests/test_e2e_13_mef_eline.py ....Xs.s.....Xs.s.XXxX.xxxx..X........... [ 44%]
.                                                                        [ 45%]
tests/test_e2e_14_mef_eline.py x                                         [ 45%]
tests/test_e2e_15_mef_eline.py .....                                     [ 47%]
tests/test_e2e_16_mef_eline.py .                                         [ 47%]
tests/test_e2e_20_flow_manager.py .....................                  [ 55%]
tests/test_e2e_21_flow_manager.py ...                                    [ 57%]
tests/test_e2e_22_flow_manager.py ...............                        [ 62%]
tests/test_e2e_23_flow_manager.py ..............                         [ 68%]
tests/test_e2e_30_of_lldp.py ....                                        [ 69%]
tests/test_e2e_31_of_lldp.py ...                                         [ 70%]
tests/test_e2e_32_of_lldp.py ...                                         [ 72%]
tests/test_e2e_40_sdntrace.py ..............                             [ 77%]
tests/test_e2e_41_kytos_auth.py ........                                 [ 80%]
tests/test_e2e_42_sdntrace.py ..                                         [ 81%]
tests/test_e2e_50_maintenance.py ............................            [ 91%]
tests/test_e2e_60_of_multi_table.py .....                                [ 93%]
tests/test_e2e_70_kytos_stats.py ........                                [ 96%]
tests/test_e2e_80_pathfinder.py ss......                                 [100%]

=============================== warnings summary ===============================
------------------------------- start/stop times -------------------------------
= 237 passed, 8 skipped, 9 xfailed, 7 xpassed, 1143 warnings in 12264.43s (3:24:24) =

viniarck

Nicely done @italovalcy, and much appreciated your help with this one, this is was a tough one and I'm glad how succinct and great the solution has become. Glad to see the results too. I also explored with 200 EVCs, from what I've seen it also worked well on my local env.

Other than that, I opened a few threads, none of them are blockers, some of them are just points to be aware regarding future work on telemetry_int, one regarding consistency check, and the rest if just minor code implementation details, that's up to you whether it'll be worth to slightly change or not. Once e2e and changelog are updated feel free to merge.

main.py

…ducing try_setup_failover when EVC gets redeployed

italovalcy · 2024-03-08T14:20:21Z

Hi,

In summary: there is something wrong with the results above. The results got very worst.

New results:

min 0.331 / 25pct 0.508 / 50pct 0.967 / 90pcl 1.492 / max 1.646 / mean_95_CI 0.920 +- 0.026

The problem was basically the testing methodology, which was starting the link failure simulation a few seconds after finish creating the circuits (which didnt give enough time for setup failover path). I changed the testing methodology to give more time between EVC creating and link failure simulations. With the changes introduced on commit 9fb511b this won't even be necessary.

To help understand the result above we have to compare with previous results:

Without this change:

min 0.376 / 25pct 0.597 / 50pct 1.099 / 90pcl 1.499 / max 1.757 / mean_95_CI 0.996 +- 0.036

With this change (previous version):

min 0.376 / 25pct 0.552 / 50pct 0.889 / 90pcl 1.469 / max 1.547 / mean_95_CI 0.933 +- 0.036

viniarck

LGTM. Excellent PR @italovalcy, appreciated the updates, glad to see that the perf results ended up great and it was just a methodology detail in the last run, also appreciated the issues you've mapped for us to be aware and address on 2024.1.

Before merging, make sure to also update the CHANGELOG.rst, thanks.

adding lock for link down handler and to individual EVCs, minor refac…

56d1720

…tor on link down handler

italovalcy commented Feb 22, 2024

View reviewed changes

main.py Show resolved Hide resolved

fix lint errors

59054fd

Ktmi requested changes Feb 22, 2024

View reviewed changes

main.py Outdated Show resolved Hide resolved

Italo Valcy added 3 commits February 22, 2024 14:41

fix unit tests

60a681b

using pool dynamic_single instaed of Lock to guarantee single executi…

a874260

…on of link down handler

remove evc.sync() from the hot path to reduce impact on fast failover

cca4804

Ktmi mentioned this pull request Feb 22, 2024

Replace consistency check with routine for checking traces #437

Open

Ktmi approved these changes Feb 22, 2024

View reviewed changes

viniarck reviewed Feb 22, 2024

View reviewed changes

main.py Outdated Show resolved Hide resolved

italovalcy marked this pull request as ready for review February 23, 2024 18:04

italovalcy requested a review from a team as a code owner February 23, 2024 18:04

viniarck mentioned this pull request Feb 27, 2024

EVC.sync() improvement for time complexity #356

Open

renamed update_evcs to update_evcs_metadata; created update_evcs to b…

272062d

…ulk update EVCs dict; delegate evc.setup_failover_path to consistency routine; create new event to cleanup old failover path

viniarck self-requested a review March 6, 2024 12:17

viniarck approved these changes Mar 6, 2024

View reviewed changes

italovalcy commented Mar 7, 2024

View reviewed changes

main.py Outdated Show resolved Hide resolved

This was referenced Mar 8, 2024

Consistency check: remove orphan flows from an EVC #444

Open

Review Kytos events when EVC gets redeployed #445

Open

adjusting code to make easy to understand some EVC attrs and re-intro…

9fb511b

…ducing try_setup_failover when EVC gets redeployed

italovalcy requested a review from viniarck March 8, 2024 14:21

viniarck approved these changes Mar 8, 2024

View reviewed changes

italovalcy merged commit 60ebb1d into master Mar 8, 2024
2 checks passed

italovalcy deleted the issue_405 branch March 8, 2024 14:58

italovalcy mentioned this pull request Mar 8, 2024

Failover path can be equal to current path when operator asks for redeploy right after a sequence of link failures #446

Closed

viniarck mentioned this pull request Mar 12, 2024

review/reassessment: cross reference to review mef_eline deployed/redeployed events kytos-ng/telemetry_int#81

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance link down handler to avoid race conditions in the failover path #438

Enhance link down handler to avoid race conditions in the failover path #438

italovalcy commented Feb 22, 2024 •

edited

Loading

Ktmi left a comment

Ktmi commented Feb 22, 2024

italovalcy commented Mar 6, 2024 •

edited

Loading

viniarck left a comment •

edited

Loading

italovalcy commented Mar 8, 2024

viniarck left a comment

Enhance link down handler to avoid race conditions in the failover path #438

Enhance link down handler to avoid race conditions in the failover path #438

Conversation

italovalcy commented Feb 22, 2024 • edited Loading

Summary

Local Tests

End-to-End Tests

Ktmi left a comment

Choose a reason for hiding this comment

Ktmi commented Feb 22, 2024

italovalcy commented Mar 6, 2024 • edited Loading

Local tests

Performance test

End-to-end test

viniarck left a comment • edited Loading

Choose a reason for hiding this comment

italovalcy commented Mar 8, 2024

viniarck left a comment

Choose a reason for hiding this comment

italovalcy commented Feb 22, 2024 •

edited

Loading

italovalcy commented Mar 6, 2024 •

edited

Loading

viniarck left a comment •

edited

Loading