Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redeploy of an EVC no longer fixes inconsistencies #543

Open
italovalcy opened this issue Oct 21, 2024 · 14 comments · May be fixed by #563
Open

Redeploy of an EVC no longer fixes inconsistencies #543

italovalcy opened this issue Oct 21, 2024 · 14 comments · May be fixed by #563
Assignees
Labels

Comments

@italovalcy
Copy link

Hi,

In the past we use redeploy feature to run a fresh setup of the EVC paths, and use it a lot in situations of problems like the EVC is having inconsistencies, the user is complaining about not forwarding, etc.

It happens that on 2023.2 we no longer can count on the redeploy feature to restart the deployment of the EVC:

> curl -X GET -u fulano -H 'Content-type: application/json' "https://xxxx/api/kytos/mef_eline/v2/evc/" > /tmp/evcs
> cat /tmp/evcs | jq -r '.["5679bb42a3a146"].current_path[]|.endpoint_a.id + " " + .endpoint_b.id + " " + (.metadata|tostring)'
00:00:00:00:00:xx:xx:03:1 00:00:00:00:00:xx:xx:04:1 {"s_vlan":{"tag_type":"vlan","value":15}}
00:00:00:00:00:xx:xx:10:25 00:00:00:00:00:xx:xx:04:32 {"s_vlan":{"tag_type":"vlan","value":1120}}
00:00:00:00:00:xx:xx:10:31 00:00:00:00:00:xx:xx:16:31 {"s_vlan":{"tag_type":"vlan","value":104}}
00:00:00:00:00:xx:xx:15:3 00:00:00:00:00:xx:xx:16:3 {"s_vlan":{"tag_type":"vlan","value":5}}
> curl -X PATCH -u fulano -H 'Content-type: application/json' "https://xxxx/api/kytos/mef_eline/v2/evc/{5679bb42a3a146}/redeploy"
Enter host password for user 'fulano':
{"response":"Circuit 5679bb42a3a146 redeploy received."}

> curl -X GET -u fulano -H 'Content-type: application/json' "https://xxxx/api/kytos/mef_eline/v2/evc/" > /tmp/evcs
> cat /tmp/evcs | jq -r '.["5679bb42a3a146"].current_path[]|.endpoint_a.id + " " + .endpoint_b.id + " " + (.metadata|tostring)'
00:00:00:00:00:xx:xx:03:1 00:00:00:00:00:xx:xx:04:1 {"s_vlan":{"tag_type":"vlan","value":15}}
00:00:00:00:00:xx:xx:10:25 00:00:00:00:00:xx:xx:04:32 {"s_vlan":{"tag_type":"vlan","value":1120}}
00:00:00:00:00:xx:xx:10:31 00:00:00:00:00:xx:xx:16:31 {"s_vlan":{"tag_type":"vlan","value":104}}
00:00:00:00:00:xx:xx:15:3 00:00:00:00:00:xx:xx:16:3 {"s_vlan":{"tag_type":"vlan","value":5}}

#
# As you can see, the service VLAN were still the same.
# 

# Then we decided to redeploy the other EVC (11a6a9a8778945) which was conflicting with 5679bb42a3a146

> curl -X PATCH -u fulano -H 'Content-type: application/json' "https://xxxx/api/kytos/mef_eline/v2/evc/{11a6a9a8778945}/redeploy"
Enter host password for user 'fulano':
{"response":"Circuit 11a6a9a8778945 redeploy received."}

> curl -X GET -u fulano -H 'Content-type: application/json' "https://xxxx/api/kytos/mef_eline/v2/evc/" > /tmp/evcs
Enter host password for user 'fulano':
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1892k  100 1892k    0     0   463k      0  0:00:04  0:00:04 --:--:--  463k

> cat /tmp/evcs | jq -r '.["5679bb42a3a146"].current_path[]|.endpoint_a.id + " " + .endpoint_b.id + " " + (.metadata|tostring)'
00:00:00:00:00:xx:xx:03:1 00:00:00:00:00:xx:xx:04:1 {"s_vlan":{"tag_type":"vlan","value":15}}
00:00:00:00:00:xx:xx:10:25 00:00:00:00:00:xx:xx:04:32 {"s_vlan":{"tag_type":"vlan","value":1120}}
00:00:00:00:00:xx:xx:10:31 00:00:00:00:00:xx:xx:16:31 {"s_vlan":{"tag_type":"vlan","value":104}}
00:00:00:00:00:xx:xx:15:3 00:00:00:00:00:xx:xx:16:3 {"s_vlan":{"tag_type":"vlan","value":5}}

> cat /tmp/evcs | jq -r '.["11a6a9a8778945"].current_path[]|.endpoint_a.id + " " + .endpoint_b.id + " " + (.metadata|tostring)'
00:00:00:00:00:xx:xx:10:25 00:00:00:00:00:xx:xx:04:32 {"s_vlan":{"tag_type":"vlan","value":1120}}
00:00:00:00:00:xx:xx:10:31 00:00:00:00:00:xx:xx:16:31 {"s_vlan":{"tag_type":"vlan","value":157}}
00:00:00:00:00:xx:xx:16:29 00:00:00:00:00:xx:xx:03:29 {"s_vlan":{"tag_type":"vlan","value":3}}
00:00:00:00:00:xx:xx:03:30 00:00:00:00:00:xx:xx:03:32 {"s_vlan":{"tag_type":"vlan","value":8}}
00:00:00:00:00:xx:xx:03:32 00:00:00:00:00:xx:xx:03:30 {"s_vlan":{"tag_type":"vlan","value":6}}

As can be seen above, even after some redeploys on two different EVCs, the service VLAN remained the same and as so the inconsistencies on the log, impacting the user. What triggered those inconsistencies? Just redeploying the EVCs in sequencial order (spaced with 5s). I had to disable both EVCs and then enable them again.

@viniarck
Copy link
Member

Indeed no same s vlans should be using in different EVCs, that's the root issue to be fixed:

00:00:00:00:00:xx:xx:10:25 00:00:00:00:00:xx:xx:04:32 {"s_vlan":{"tag_type":"vlan","value":1120}}
00:00:00:00:00:xx:xx:10:25 00:00:00:00:00:xx:xx:04:32 {"s_vlan":{"tag_type":"vlan","value":1120}}

Now regarding redeploy not using a next svlan, indeed after removal + install it'll pick up the same svlan that has just been released since now the vlans use ranges there. In prior versions the just released was appended to the end.

Now using that to fix conflicting problems that shouldn't be in the first place was handy for network opertations, we could consider introducing a feature where that's an option for the redeploy such as instead of getting the next available, pick the last available svlan (that's not difficult to implement and cover with tests), and or also consider exposing which s-vlan should be used in all links (not sure how useful but could be a feature too)

@italovalcy
Copy link
Author

Indeed no same s vlans should be using in different EVCs, that's the root issue to be fixed:

Yes, I believe so.

Now using that to fix conflicting problems that shouldn't be in the first place was handy for network opertations, we could consider introducing a feature where that's an option for the redeploy such as instead of getting the next available, pick the last available svlan (that's not difficult to implement and cover with tests), and or also consider exposing which s-vlan should be used in all links (not sure how useful but could be a feature too)

You are right, I agree with you Vinicius. In this case, I suggest we use the same endpoint for redeploy and the default option to be choosing new s-vlans (but also adding an option to allow keep with the same). I'm saying this because the main usage of the redeployment now is exactly to handle those errors, so I think for usability it is better that the old behavior remains the same. What do you think?

@viniarck
Copy link
Member

viniarck commented Oct 21, 2024

redeploy and the default option to be choosing new s-vlans

That can be done, highly recommend to still add an option for that and persist in the DB too, because even though it was being used to also force new s-vlans, still using the same s-vlans without race conditions can also be valuable. But, yes you decide the default behavior of the option @italovalcy, whatever you think it's best. Since Ops were also used to use new s-vlans, that can be the default.

Implementation-wise, the easiest way that also minimizes other states to potentially implement this would be to remember/store if in the last deployemtn if the s-vlans was allocated as the next or last, and then in the next deployment pick up a different direction (from either next or last), since that's easy to get. But trying to remember which s vlan per link and storing all of that and passing down the the new deployment wouldn't be too trivial and a expressive amount of code. cc'ing @Alopalao to think about and contribute to the discussion since he was the main contributor of this part.

@Alopalao
Copy link

I agree with Italo, that would be the fastest implementation instead of comparing existing path with new one and trying to pick a different VLAN.
Now, about the cause of the issue, @italovalcy do you know if the flows for these conflicted EVCs were installed before and after the issue?

@Alopalao
Copy link

Alopalao commented Nov 7, 2024

The behavior presented in this issue is expected. The VLAN is made available when the path is deleted and immediately used by the same EVC, resulting in the same VLAN.

We can implement two solutions to avoid the same VLAN but maintain some predictability.

  • Add an option to EVC to choose the "last" or "first" available VLAN for the links.
  • Remember the previous VLAN so the new VLAN is different. For example in case that the EVC is choosing the "first" VLAN available, the VLAN would be the smaller VLAN different than the previous VLAN. Naive implementation:
previous_vlan = 2
for vlan in range(1, 4095):
    if vlan != previous_vlan:
        vlan_taken = vlan
        break

In case of "last", the next VLAN would be the biggest VLAN different that the previous one.

Changes to mef_eline:

  • Add a field "last_tag" to EVC, if true the "last" VLAN will be taken. False is the default value.
  • Add a parameter to PATCH /mef_eline/v2/evc/{evc_id}/redeploy. last_tag a bool to determine new taken VLAN and it may change the last_tag value from the EVC if different. If there is not parameter, then the EVC will redeploy obeying the value stored in last_tag.

@italovalcy
Copy link
Author

The behavior presented in this issue is expected. The VLAN is made available when the path is deleted and immediately used by the same EVC, resulting in the same VLAN.

We can implement two solutions to avoid the same VLAN but maintain some predictability.

  • Add an option to EVC to choose the "last" or "first" available VLAN for the links.
  • Remember the previous VLAN so the new VLAN is different. For example in case that the EVC is choosing the "first" VLAN available, the VLAN would be the smaller VLAN different than the previous VLAN. Naive implementation:
previous_vlan = 2
for vlan in range(1, 4095):
    if vlan != previous_vlan:
        vlan_taken = vlan
        break

In case of "last", the next VLAN would be the biggest VLAN different that the previous one.

Changes to mef_eline:

  • Add a field "last_tag" to EVC, if true the "last" VLAN will be taken. False is the default value.
  • Add a parameter to PATCH /mef_eline/v2/evc/{evc_id}/redeploy. last_tag a bool to determine new taken VLAN and it may change the last_tag value from the EVC if different. If there is not parameter, then the EVC will redeploy obeying the value stored in last_tag.

Hi Aldo and Vinicius, thanks for looking into this issue and providing this proposal. I would like to suggest another approach and see what you guys think.

If I understood correctly, the root cause of the problem is that we first release the resource, then we request new resources and finally we end up getting the same resource released before. My suggestion is for us to invert the logic here: request new resources, then release the current resources, then provision them. Using this logic we even avoid another problem: #394

In other words, it seems like a correct assumption that we should release the current resources only if new resources are available.

What do you guys think?

@Alopalao
Copy link

Alopalao commented Nov 8, 2024

Seems reasonable and the final result would be the same as in memorizing the previous VLAN. I think keeping the current_path will require a more careful implementation. Maybe even break down the cases in which the EVC are when changes in the topology occurred. As it is right now, most of the EVC path modification will require to delete the not longer used path. wdyt @viniarck

@Alopalao
Copy link

Alopalao commented Nov 12, 2024

Checking for the case where we switch the order, I noticed that telemetry_int needs to know that an EVC has been "undeployed". For example, when updating an EVC and its needs deployment, the EVC will be remove before deploying. Maybe just send the "undeployed" event instead of also deleting paths. For redeployment, we first delete as well but omitting it seems safe.

@viniarck
Copy link
Member

If I understood correctly, the root cause of the problem is that we first release the resource, then we request new resources and finally we end up getting the same resource released before.

That's true, but the root cause with the current implementation is the fact that now deallocating a s_vlan will put it back in order in the available_tags pool as opposed to in the end, so leaving room for it to get picked again sooner or even a new next, since they're allocated in order by default.

My suggestion is for us to invert the logic here: request new resources, then release the current resources, then provision them. Using this logic we even avoid another problem: #394

It's reasonable @italovalcy, but it's a medium/major change as you know and as @Alopalao pointed out, and especially since this is being asked to be backported, if we want to minimize refactoring risks and stability, I think we should deliver this incrementally: a) only fix s_vlan being different while providing a basic/minimal configuration option, and then later on 2024.2+ b) start considering other optimizations as you pointed out on #394 when it gets prioritized.

Speaking of optimizations of not deallocating s_vlan first, there's a corner case with dynamic paths, where if a shared link has all s_vlans taken then it wouldn't be able to fully provision, whereas if deallocating that shared link first too it would give it back the s_vlan allowing it to provision, this is very rare but I think we need to reassess this type of thing too when considering other optimizations, that's different than the one you pointd out on 394, but it's related one.

That said, the option that @Alopalao suggested by remembering the s_vlan I think is good middle ground with also minimizing refactoring changes, since it's very lightweight implementation-wise we can provide guarantees that the s_vlan will be different (depending on the s_vlan specified strategy), while providing flexibility for in the future to change the s_vlan allocation strategy. And then, when b) gets prioritized this shouldn't impact on since the implementation would operate at a higher level so it could decide not to remove the path yet, and if a link up comes up and that path becomes active again then no convergence operation happens (that's assuming what we currently have that it doesn't need to try to search for a better path).

Can we go for a more incremental approach for now @italovalcy @Alopalao? Especially considering that's important to be backported for prod, unless that changed let us know, then we can consider the other adjacent proposals here too.

@viniarck
Copy link
Member

@Alopalao, regarding picking a different vlan, notice that in this example naive implementation in some cases it'd still need to give an extra s_vlan back too if it conflicted with the prior_s_vlan or old_s_vlan, it can work but now imagine with multiple EVCs doing all of this concurrently, it'll work too and the locks won't be held for too long since it's not too computationally expensive, but it can cause a bit of attrition with the underlying link and endpoints locks.

Since your implementation of picking the last tag has landed on master, can we also store which direction in the metadata was used and then pick the opposite side since that minimizes conflicts? The only exception would be when there's only one tag, but in that case it can either use the same or consider that's it's not available. In the future we can also provide an option to try to allocate a specific s_vlan too if ever needed, so storing both of these link metadata prior_s_vlan|old_s_vlan = number and s_vlan_prior_side|s_vlan_old_side = "first|last" might be handy I believe.

And then in the EVC we could expose this updatable attr: svlan_alloc_strategy: str = "different_or_same|different", where "different_or_same" would allocate a diff s_vlan alternating, and if there's only one it'd try to use the same, or different where it would be a hard requirement where it always need to be different, and then we set either of this values as a default. The prior implementation was equivalent to "different_or_same" since it was fully rotating for each link on each dealloc/alloc operation.

Food for thought. What do you think? cc'ing also @italovalcy if he has other ideas to contribute too.

@viniarck
Copy link
Member

Checking for the case where we switch the order, I noticed that telemetry_int needs to know that an EVC has been "undeployed". For example, when updating an EVC and its needs deployment, the EVC will be remove before deploying. Maybe just send the "undeployed" event instead of also deleting paths. For redeployment, we first delete as well but omitting it seems safe.

Yes, in summary, telemetry_int always follows mef_eline path changes + also the convergence of the external loops. So, whatever mef_eline does and will do in the future convergence path wise, telemetry_int has to keep following and subscribing to it. Also, telemetry_int also removes first in certain specific telemetry_int events like redeploy. Historically the full removal was desired to have and provided extra sanity in some cases, excluding the optimization part of what was being discussed here.

@Alopalao
Copy link

so storing both of these link metadata prior_s_vlan|old_s_vlan = number and s_vlan_prior_side|s_vlan_old_side = "first|last" might be handy I believe.

While this behavior can be achieved with minor changes, storing it in the link might require more changes. The problem is that we always find a new path for API request update and redeploy. mef_eline discards the previous path. Is it desirable to reuse the old path with only different VLANs for these cases? This would change the redeployment to instead re-installed flows in case where the path can be deployed again.

@viniarck
Copy link
Member

so storing both of these link metadata prior_s_vlan|old_s_vlan = number and s_vlan_prior_side|s_vlan_old_side = "first|last" might be handy I believe.

While this behavior can be achieved with minor changes, storing it in the link might require more changes. The problem is that we always find a new path for API request update and redeploy. mef_eline discards the previous path.

Oh yes, good observation. That would require keeping more states than I anticipated.

@viniarck
Copy link
Member

Is it desirable to reuse the old path with only different VLANs for these cases? This would change the redeployment to instead re-installed flows in case where the path can be deployed again.

That's a good point. For static EVCs, I'd say yes based on what Italo asked, the old path (either primary or backup) would always use a different s_vlan. But, now for this case for dynamic EVCs, there's the case that Italo was mentioning that if there's no other path the removal of this path could be delayed since a subsequent link up might or might not make it active again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants