Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix syncEgressFirewall (truncate ACL names) and SetupMaster (stop recreating cluster-wide PGs) #3466

Merged
merged 3 commits into from
Mar 7, 2023

Conversation

tssurya
Copy link
Member

@tssurya tssurya commented Mar 6, 2023

This commit ensures we truncate names as a precaution
also in CreateOrUpdateACLsOps so that our bases are
covered since not all code snippets call BuildACL
directly

NOTE: The first commit adds a test to showcase that we actually weren't truncating. Test server is not smart enough to scream like real ovsdb server so it won't complain :/ See ovn-org/libovsdb#338.
Second commit does the fix and changes the test case

@tssurya tssurya requested a review from trozet March 6, 2023 22:39
This commit adds a test to showcase that
since syncEgressFirewall isn't calling libovsdbops.BuildACL
directly, we are not truncating ACL names.

Note that we really need ovn-org/libovsdb#338
for our test server to start screaming for long names.

Signed-off-by: Surya Seetharaman <[email protected]>
@tssurya tssurya force-pushed the fix-acl-naming-efw-truncation branch from e42cecb to f752ec4 Compare March 6, 2023 22:44
This commit ensures we truncate names as a precaution
also in CreateOrUpdateACLsOps so that our bases are
covered since not all code snippets call BuildACL
directly

Signed-off-by: Surya Seetharaman <[email protected]>
@tssurya tssurya force-pushed the fix-acl-naming-efw-truncation branch from f752ec4 to 7bfe50e Compare March 6, 2023 22:58
@trozet
Copy link
Contributor

trozet commented Mar 7, 2023

I think there is yet another bug, perhaps introduced during 4.13 when @npinaeva moved the EF rules to port groups. When I restart OVNK master I see nothing is happening in sync egress firewall:

[trozet@fedora contrib]$ oc logs -n ovn-kubernetes ovnkube-master-78c4f69485-8jhl7 -c ovnkube-master |grep -i TROZET
I0307 02:40:42.257490      41 egressfirewall.go:133] TROZET STARTING EF SYNC
I0307 02:40:42.258374      41 egressfirewall.go:230] TROZET END EF SYNC

In the function, it should print every single egress firewall found when it tries to find via predicate:

func (oc *DefaultNetworkController) syncEgressFirewall(egressFirewalls []interface{}) error {
	klog.Infof("TROZET STARTING EF SYNC")
	// Lookup all ACLs used for egress Firewalls
	aclPred := func(item *nbdb.ACL) bool {
		return item.Priority >= types.MinimumReservedEgressFirewallPriority && item.Priority <= types.EgressFirewallStartPriority
	}
	egressFirewallACLs, err := libovsdbops.FindACLsWithPredicate(oc.nbClient, aclPred)
	if err != nil {
		return fmt.Errorf("unable to list egress firewall ACLs, cannot cleanup old stale data, err: %v", err)
	}

	for _, acl := range egressFirewallACLs {
		klog.Infof("TROZET found ACL: %s", *acl.Name)
	}

However, it finds nothing, which means no ACLs actually exist after restart. Even though there was one:

_uuid               : fd0fa6c8-1d6e-4234-8a67-c35b1808b19e
acls                : [f91e50c7-ae05-48f3-87a6-da6a772f0718]
external_ids        : {name=clusterPortGroup}
name                : clusterPortGroup
ports               : [809c2092-c0fa-4043-9178-4859162869b2, d891f7eb-f0db-4b9f-aa59-3439abe1555e]

_uuid               : 550f4499-d42e-4e78-8aa1-c9c1124e8375
acls                : []
external_ids        : {name=clusterRtrPortGroup}
name                : clusterRtrPortGroup
ports               : []


_uuid               : f91e50c7-ae05-48f3-87a6-da6a772f0718
action              : drop
direction           : to-lport
external_ids        : {egressFirewall=aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
label               : 0
log                 : false
match               : "(ip4.dst == 0.0.0.0/0 && ip4.dst != 10.244.0.0/16) && ip4.src == $a6686003692170172026"
meter               : acl-logging
name                : egressFirewall_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
options             : {}
priority            : 9999
severity            : []

So I notice in the logs this ACL was deleted, because the port group was updated to have no ACLs....

I0307 02:40:42.231758      41 model_client.go:358] Update operations generated as: [{Op:update Table:Port_Group Row:map[acls:{GoSet:[]} external_ids:{GoMap:map[name:clusterPortGroup]} ports:{GoSet:
[]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {fd0fa6c8-1d6e-4234-8a67-c35b1808b19e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]
I0307 02:40:42.231781      41 transact.go:41] Configuring OVN: [{Op:update Table:Port_Group Row:map[acls:{GoSet:[]} external_ids:{GoMap:map[name:clusterPortGroup]} ports:{GoSet:[]}] Rows:[] Columns
:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {fd0fa6c8-1d6e-4234-8a67-c35b1808b19e}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUIDName:}]
I0307 02:40:42.232198      41 cache.go:1083] cache "msg"="deleting row" "database"="OVN_Northbound" "model"="&{UUID:f91e50c7-ae05-48f3-87a6-da6a772f0718 Action:drop Direction:to-lport ExternalIDs:map[egressFirewall:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa] Label:0 Log:false Match:(ip4.dst == 0.0.0.0/0 && ip4.dst != 10.244.0.0/16) && ip4.src == $a6686003692170172026 Meter:0xc0004af3b0 Name:0xc0004af3c0 Options:map[] Priority:9999 Severity:<nil>}" "table"="ACL" "uuid"="f91e50c7-ae05-48f3-87a6-da6a772f0718"

Now I see in the port group create code:

// SetupMaster creates the central router and load-balancers for the network
func (oc *DefaultNetworkController) SetupMaster(existingNodeNames []string) error {
	// Create default Control Plane Protection (COPP) entry for routers
	logicalRouter, err := oc.createOvnClusterRouter()
	if err != nil {
		return err
	}
	oc.defaultCOPPUUID = *(logicalRouter.Copp)

	// Create a cluster-wide port group that all logical switch ports are part of
	pg := libovsdbops.BuildPortGroup(types.ClusterPortGroupName, types.ClusterPortGroupName, nil, nil)
	err = libovsdbops.CreateOrUpdatePortGroups(oc.nbClient, pg)
	if err != nil {
		klog.Errorf("Failed to create cluster port group: %v", err)
		return err
	}

port group ops code includes updating ACL list:

// CreateOrUpdatePortGroupsOps creates or updates the provided port groups
// returning the corresponding ops
func CreateOrUpdatePortGroupsOps(nbClient libovsdbclient.Client, ops []libovsdb.Operation, pgs ...*nbdb.PortGroup) ([]libovsdb.Operation, error) {
	opModels := make([]operationModel, 0, len(pgs))
	for i := range pgs {
		pg := pgs[i]
		opModel := operationModel{
			Model:          pg,
			OnModelUpdates: getAllUpdatableFields(pg),
			ErrNotFound:    false,
			BulkOp:         false,
		}
		opModels = append(opModels, opModel)
	}

	m := newModelClient(nbClient)
	return m.CreateOrUpdateOps(ops, opModels...)
}

// getAllUpdatableFields returns a list of all of the columns/fields that can be updated for a model
func getAllUpdatableFields(model model.Model) []interface{} {
	switch t := model.(type) {
	case *nbdb.LogicalSwitchPort:
		return []interface{}{&t.Addresses, &t.Type, &t.TagRequest, &t.Options, &t.PortSecurity}
	case *nbdb.PortGroup:
		return []interface{}{&t.ACLs, &t.Ports, &t.ExternalIDs}
	default:
		panic(fmt.Sprintf("getAllUpdatableFields: unknown model %T", t))
	}
}

We need to fix this too. Do you mind including it in this PR please @tssurya ?

@npinaeva
Copy link
Member

npinaeva commented Mar 7, 2023

there is one more acl that needs unique name, and includes namespace name https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/multicast.go#L95-L97
we need to add extenalIDs for that case too, it can be in another pr as well, but we probably will need to backport it all together

@npinaeva
Copy link
Member

npinaeva commented Mar 7, 2023

@trozet your bug can be fixed separately as all acls will be recreated on restart, the only problem is that there will be no egress firewall acls for some time during sync

@tssurya
Copy link
Member Author

tssurya commented Mar 7, 2023

I think there is yet another bug, perhaps introduced during 4.13 when @npinaeva moved the EF rules to port groups. When I restart OVNK master I see nothing is happening in sync egress firewall:
We need to fix this too. Do you mind including it in this PR please @tssurya ?

yup fixed this, if an existing clusterRtrPG or clusterPG exists then no need to re-update/re-create with empty acls/ports there

[surya@hidden-temple egf]$ oc logs -n ovn-kubernetes ovnkube-master-74c44fbb5b-xwj92 ovnkube-master | grep TROZET
I0307 11:23:04.392360      41 egressfirewall.go:135] TROZET STARTING EF SYNC
I0307 11:23:04.392386      41 egressfirewall.go:144] TROZET found ACL: egressFirewall_allow-traffic-apache-server-on-lbdns-node-run1-1
I0307 11:23:04.392393      41 egressfirewall.go:144] TROZET found ACL: egressFirewall_allow-traffic-apache-server-on-lbdns-node-run1-1
I0307 11:23:04.394401      41 egressfirewall.go:224] TROZET END EF SYNC

fix seems to work, I'll add this as a new commit.

In SetupMaster, we always call CreateOrUpdatePortGroupsOps
with empty ACLs and PGs for the cluster-wide port group
and cluster-wide-router-PG. This is disruptive during
upgrades since momentarily all efw ACLs and multicast ACLs
will be wiped out.

This commit changes this to first check if the PG already exists,
if then no need to do anything.
Each of those features are responsible for ensuring ACLs, Ports
are good on those PGs they own.

NOTE: This bug was an issue for multicast and started being an
issue for egf from ovn-org@bd29f41
Before that we didn't have ACLs on cluster wide PG.

Signed-off-by: Surya Seetharaman <[email protected]>
@tssurya
Copy link
Member Author

tssurya commented Mar 7, 2023

there is one more acl that needs unique name, and includes namespace name https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/multicast.go#L95-L97 we need to add extenalIDs for that case too, it can be in another pr as well, but we probably will need to backport it all together

After talking with Nadia, we decided to add a namespaceKey to extID func to keep things unique and comply with isEquivalentACL match criteria. Will do a new commit for this one too.

@tssurya tssurya changed the title Fix syncEgressFirewall to truncate ACL names Fix syncEgressFirewall (truncate ACL names) and SetupMaster (stop recreating cluster-wide PGs) Mar 7, 2023
@tssurya
Copy link
Member Author

tssurya commented Mar 7, 2023

there is one more acl that needs unique name, and includes namespace name https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/multicast.go#L95-L97 we need to add extenalIDs for that case too, it can be in another pr as well, but we probably will need to backport it all together

After talking with Nadia, we decided to add a namespaceKey to extID func to keep things unique and comply with isEquivalentACL match criteria. Will do a new commit for this one too.

#3470 keeping multicast acl issue separate from EFW.

if pg == nil {
// we didn't find an existing clusterPG, let's create a new empty PG (fresh cluster install)
// Create a cluster-wide port group that all logical switch ports are part of
pg := libovsdbops.BuildPortGroup(types.ClusterPortGroupName, types.ClusterPortGroupName, nil, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, but cleaning up ports from this port group even when it exists may be intended, since there is no cleanup for ports AFAIK. So we may need some new sync function for nodes to delete possibly stale ports?

Copy link
Member Author

@tssurya tssurya Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea that was also a concern I had but couldn't really find if this was intended or something that we just did. I hope the ACLs are synced elsewhere in each of these features.

I'd Hope when node add/dels happen we do remove these ports from these port groups. Also these mp0 ports and router ports are not supposed to change that frequently in real envs.
Regardless a cleanup on syncNodes might be a good part for syncing things when nodes come up and ensuring we are keeping a clean list? -> since risk is less I could do this in a FUP if it makes sense
@trozet -> WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we saw delete node will remove the LSP from the port group so the chances of leaking are small and shouldn't have meaningful impact

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trozet checked a node deletion, we do remove k8s* ports when a node goes away, that automatically removes it from the PG. So we are good here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants