Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.18] OCPBUGS-50582: Graceful cleanup of IPsec states #2644

Open
wants to merge 6 commits into
base: release-4.18
Choose a base branch
from

Conversation

pperiyasamy
Copy link
Member

@pperiyasamy pperiyasamy commented Feb 12, 2025

This PR does the backport of following fixes to prevent unnecessary ipsec service restart, ip xfrm state policy cleanups while bringing up ipsec-host pod. This would potentially avoid reestablishment of IKE SAs during ipsec pod restarts and let OVN networking pods traffic go on without any packet drops.

  1. There is an incorrect check in ipsec pod clean up logic which removes /etc/ipsec.d/openshift.conf file, ip xfrm state and policy entries in all cases, but these must be removed only when ipsec mode is changed from full to external or disabled.
  2. We don't need narrowing=yes option to be set explicitly anymore because system default crypto policies are commented out now, otherwise TS_UNACCEPTABLE error is seen temporarily at the time of ipsec service restart.
  3. The IPsec service restart is needed only at the time of specific IPsec config changes, so doing it only at the time commenting out default crypto-policies conf file.
  4. After IPsec is deployed, when machine config pools goes into progressing/degraded state while processing some other machine machine configs (or) node reboot scenarios, CNO deletes IPsec pods which is known behavior as per changes in the commit 6aa2f0f, but it also accidentally disabling IPsec in OVN which is not an expected behavior. This causes ovs-monitor-ipsec to refresh existing ipsec connections unnecessarily when IPsec pod comes up (once machine config pool settles) as it is not able to find remote_name from the tunnel.
  5. It also removes dead code related to IPsec 4.13 upgrade which is no longer valid for upgrade scenarios beyond >= 4.15. This change is intended to be backported until 4.15 so still keeping legacy upgrade (4.14->4.15) scenarios.

Manually cherry picked master commits 864bdc5, ece9fbb, ea1d489, 4e57dcd and ff0b147, no conflicts seen.

There is an incorrect check while cleaning up ipsec state upon deleting ipsec pod
which removes states in all cases, so this fix removes state only when ipsec mode
is not full mode.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
(cherry picked from commit 864bdc5)
This reverts commit e0bfa7e.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
(cherry picked from commit ece9fbb)
Signed-off-by: Periyasamy Palanisamy <[email protected]>
(cherry picked from commit ea1d489)
   
The following change on the machine pool
`status.MachineCount == status.UpdatedMachineCount && hasSourceInMachineConfigStatus(status, machineConfigs)`
is introduced with PR openshift#2349
which ensures IPsec machine config is always installed on all the nodes in the cluster,
So this is deleting the IPsec daemonset as per the CNO state machine for IPsec when
the condition is not met. But this is also accidentally disabling IPsec in OVN which
is not an expected behavior. This causes ovs-monitor-ipsec to refresh existing ipsec
connections unnecessarily when IPsec pod comes up as it is not able to find remote_name
from the tunnel. This may also trigger deleting IPsec connection entries from openshift.conf
file if ovs-monitor-ipsec is not killed timely when ipsec daemonset is removed.
So this commit enables ovn ipsec option as long as the API is set with Full mode.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
(cherry picked from commit 4e57dcd)
This removes stale 4.13 IPsec upgrade handling code which
is not a valid anymore for >=4.15 upgrade scenarios.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
(cherry picked from commit 3c99d4f)
The commit 4e57dcd is not complete because OVNIPsecStatus
is still not set when none of the IPsec daemonset exists on
the cluster at the time of machine config pools are updating
(or) node is rebooted. Hence fixing it by OVNIPsecStatus is
always set to reflect ipsec deployment state of the cluster
and update the render pipeline to render ovn ipsec for the above
mentioned scenarios.

It renders ovn ipsec even before ipsec daemonsets are deployed
when IPsec is freshly enabled on the cluster. That's ok	because
It will be effective only when the ovs-monitor-ipsec script is
started and that's going to be done only when the ipsec pod is running.
so we are safe to ignore it now.

When IPsec is disabled from API, ovn ipsec is disabled followed
by stop	rendering ipsec	machine	config and ipsec daemonset. While
ipsec machine configs are removed which	would make ovnkube-node
daemonset into progressing state and OVNIPsecActive condition
becomes	true again. Hence this commit considers	machine	config
status as well so that the rendering pipeline will not render IPsec
machine	configs	again.

Signed-off-by: Periyasamy Palanisamy <[email protected]>
(cherry picked from commit ff0b147)
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 12, 2025
@openshift-ci-robot
Copy link
Contributor

@pperiyasamy: This pull request references Jira Issue OCPBUGS-50582, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-50582 to depend on a bug targeting a version in 4.19.0 and in one of the following states: MODIFIED, ON_QA, VERIFIED, but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR does the backport of following fixes to prevent unnecessary ipsec service restart, ip xfrm state policy cleanups while bringing up ipsec-host pod. This would potentially avoid reestablishment of IKE SAs during ipsec pod restarts and let OVN networking pods traffic go on without any packet drops.

  1. There is an incorrect check in ipsec pod clean up logic which removes /etc/ipsec.d/openshift.conf file, ip xfrm state and policy entries in all cases, but these must be removed only when ipsec mode is changed from full to external or disabled.
  2. We don't need narrowing=yes option to be set explicitly anymore because system default crypto policies are commented out now, otherwise TS_UNACCEPTABLE error is seen temporarily at the time of ipsec service restart.
  3. The IPsec service restart is needed only at the time of specific IPsec config changes, so doing it only at the time commenting out default crypto-policies conf file.
  4. After IPsec is deployed, when machine config pools goes into progressing/degraded state while processing some other machine machine configs (or) node reboot scenarios, CNO deletes IPsec pods which is known behavior as per changes in the commit 6aa2f0f, but it also accidentally disabling IPsec in OVN which is not an expected behavior. This causes ovs-monitor-ipsec to refresh existing ipsec connections unnecessarily when IPsec pod comes up (once machine config pool settles) as it is not able to find remote_name from the tunnel.
  5. It also removes dead code related to IPsec 4.13 upgrade which is no longer valid for upgrade scenarios beyond >= 4.15. This change is intended to be backported until 4.15 so still keeping legacy upgrade (4.14->4.15) scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from abhat and kyrtapz February 12, 2025 08:57
Copy link
Contributor

openshift-ci bot commented Feb 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pperiyasamy
Once this PR has been reviewed and has the lgtm label, please assign dougbtv for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pperiyasamy
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 12, 2025
@openshift-ci-robot
Copy link
Contributor

@pperiyasamy: This pull request references Jira Issue OCPBUGS-50582, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note type set to "Release Note Not Required"
  • dependent bug Jira Issue OCPBUGS-50616 is in the state ON_QA, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
  • dependent Jira Issue OCPBUGS-50616 targets the "4.19.0" version, which is one of the valid target versions: 4.19.0
  • bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@pperiyasamy
Copy link
Member Author

/assign @jcaamano @huiran0826

@openshift-ci-robot
Copy link
Contributor

@pperiyasamy: This pull request references Jira Issue OCPBUGS-50582, which is valid.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note type set to "Release Note Not Required"
  • dependent bug Jira Issue OCPBUGS-50616 is in the state ON_QA, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
  • dependent Jira Issue OCPBUGS-50616 targets the "4.19.0" version, which is one of the valid target versions: 4.19.0
  • bug has dependents

Requesting review from QA contact:
/cc @huiran0826

In response to this:

This PR does the backport of following fixes to prevent unnecessary ipsec service restart, ip xfrm state policy cleanups while bringing up ipsec-host pod. This would potentially avoid reestablishment of IKE SAs during ipsec pod restarts and let OVN networking pods traffic go on without any packet drops.

  1. There is an incorrect check in ipsec pod clean up logic which removes /etc/ipsec.d/openshift.conf file, ip xfrm state and policy entries in all cases, but these must be removed only when ipsec mode is changed from full to external or disabled.
  2. We don't need narrowing=yes option to be set explicitly anymore because system default crypto policies are commented out now, otherwise TS_UNACCEPTABLE error is seen temporarily at the time of ipsec service restart.
  3. The IPsec service restart is needed only at the time of specific IPsec config changes, so doing it only at the time commenting out default crypto-policies conf file.
  4. After IPsec is deployed, when machine config pools goes into progressing/degraded state while processing some other machine machine configs (or) node reboot scenarios, CNO deletes IPsec pods which is known behavior as per changes in the commit 6aa2f0f, but it also accidentally disabling IPsec in OVN which is not an expected behavior. This causes ovs-monitor-ipsec to refresh existing ipsec connections unnecessarily when IPsec pod comes up (once machine config pool settles) as it is not able to find remote_name from the tunnel.
  5. It also removes dead code related to IPsec 4.13 upgrade which is no longer valid for upgrade scenarios beyond >= 4.15. This change is intended to be backported until 4.15 so still keeping legacy upgrade (4.14->4.15) scenarios.

Manually cherry picked master commits 864bdc5, ece9fbb, ea1d489, 4e57dcd and ff0b147, no conflicts seen.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from huiran0826 February 12, 2025 09:54
Copy link
Contributor

openshift-ci bot commented Feb 12, 2025

@pperiyasamy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6 813104e link false /test e2e-vsphere-ovn-dualstack-primaryv6
ci/prow/e2e-aws-hypershift-ovn-kubevirt 813104e link false /test e2e-aws-hypershift-ovn-kubevirt
ci/prow/e2e-network-mtu-migration-ovn-ipv4 813104e link false /test e2e-network-mtu-migration-ovn-ipv4
ci/prow/e2e-gcp-ovn-upgrade 813104e link true /test e2e-gcp-ovn-upgrade
ci/prow/okd-scos-e2e-aws-ovn 813104e link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-windows 813104e link true /test e2e-aws-ovn-windows
ci/prow/e2e-azure-ovn-upgrade 813104e link true /test e2e-azure-ovn-upgrade
ci/prow/security 813104e link false /test security
ci/prow/4.18-upgrade-from-stable-4.17-e2e-azure-ovn-upgrade 813104e link false /test 4.18-upgrade-from-stable-4.17-e2e-azure-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants