Skip to content

Conversation

@eshitachandwani
Copy link
Member

This PR moves the LDS and RDS watchers to dependency manager without chaning the current functionality or behaviour. This is a part of implementation of gRFC A74.

RELEASE NOTES: None

@eshitachandwani eshitachandwani added this to the 1.77 Release milestone Oct 14, 2025
@eshitachandwani eshitachandwani added Type: Internal Cleanup Refactors, etc Area: xDS Includes everything xDS related, including LB policies used with xDS. labels Oct 14, 2025
@codecov
Copy link

codecov bot commented Oct 15, 2025

Codecov Report

❌ Patch coverage is 69.82249% with 51 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.97%. Comparing base (ae62635) to head (c279455).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
internal/xds/xdsdependencymanager/watch_service.go 65.00% 14 Missing and 7 partials ⚠️
...xds/xdsdependencymanager/xds_dependency_manager.go 78.94% 10 Missing and 6 partials ⚠️
internal/xds/xdsclient/xdsresource/xdsconfig.go 0.00% 8 Missing ⚠️
internal/grpctest/tlogger.go 73.91% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8651      +/-   ##
==========================================
+ Coverage   81.21%   82.97%   +1.76%     
==========================================
  Files         416      421       +5     
  Lines       41002    32464    -8538     
==========================================
- Hits        33298    26936    -6362     
+ Misses       6226     4129    -2097     
+ Partials     1478     1399      -79     
Files with missing lines Coverage Δ
internal/xds/xdsdependencymanager/logging.go 100.00% <100.00%> (ø)
internal/grpctest/tlogger.go 74.82% <73.91%> (+3.90%) ⬆️
internal/xds/xdsclient/xdsresource/xdsconfig.go 0.00% <0.00%> (ø)
...xds/xdsdependencymanager/xds_dependency_manager.go 78.94% <78.94%> (ø)
internal/xds/xdsdependencymanager/watch_service.go 65.00% <65.00%> (ø)

... and 365 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@easwars
Copy link
Contributor

easwars commented Oct 15, 2025

The tests are failing. Is this ready for review?

Comment on lines 22 to 23
// XDSConfig holds the complete and resolved xDS resource configuration
// including LDS, RDS, CDS and endpoints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// XDSConfig holds the complete and resolved xDS resource configuration
// including LDS, RDS, CDS and endpoints.
// XDSConfig holds the complete gRPC client-side xDS configuration
// containing all necessary resources.

// including LDS, RDS, CDS and endpoints.
type XDSConfig struct {
// Listener is the listener resource update
Listener ListenerUpdate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving the ResourceChanged methods on the resource watchers to accept a pointer to the update struct (instead of accepting the update by value). So, I think it would make sense for us to store them as pointers here as well.

See: #8652

// XDSConfig holds the complete and resolved xDS resource configuration
// including LDS, RDS, CDS and endpoints.
type XDSConfig struct {
// Listener is the listener resource update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's try to consistently use the word configuration or config instead of update in these docstrings.

So, maybe something like:
// Listener holds the listener configuration.

Comment on lines 28 to 29
// RouteConfig is the route configuration resource update. It will be
// populated even if RouteConfig is inlined into the Listener resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// RouteConfig is the route configuration resource update. It will be
// populated even if RouteConfig is inlined into the Listener resource.
// RouteConfig holds the route configuration. It will be
// populated even if the route configuration was inlined into the Listener resource.

Comment on lines 32 to 33
// VirtualHost is the virtual host from the route configuration matched with
// dataplane authority .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe?

Suggested change
// VirtualHost is the virtual host from the route configuration matched with
// dataplane authority .
// VirtualHost selected from the route configuration whose domain field
// offers the best match against the provided dataplane authority.

Comment on lines 36 to 42
// Clusters maps the cluster name with the ClusterResult which will have
// either the cluster configuration or error. It will have an error status
// if either
//
// (a) there was an error and we did not already have a valid resource or
//
// (b) the resource does not exist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making it much simpler and leaving more of the documentation to the individual structs.

// Clusters is a map from cluster name to its configuration.

Clusters map[string]*ClusterResult
}

// ClusterResult contains either a cluster's configuration or an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like?

// ClusterResult contains a cluster's configuration when we receive a 
// valid resource from the management server. It contains an error when:
// - we receive an invalid resource from the management server and
//   we did not already have a valid resource or
// - the cluster resource does not exist on the management server

Err error
}

// ClusterConfig contains cluster configuration for a single cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ClusterConfig contains configuration for a single cluster.

// ClusterConfig contains cluster configuration for a single cluster.
type ClusterConfig struct {
Cluster ClusterUpdate // Cluster configuration. Always present.
EndpointConfig EndpointConfig // Endpoint configuration for leaf clusters which will of type EDS or DNS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just "Endpoint configuration for leaf clusters" should suffice.

AggregateConfig AggregateConfig // List of children for aggregate clusters.
}

// AggregateConfig contains a list of leaf cluster names.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need not technically be all leaf clusters. Aggregate clusters can have children that are aggregate clusters as well.

Copy link
Member Author

@eshitachandwani eshitachandwani Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

LeafClusters []string
}

// EndpointConfig contains resolved endpoints for a leaf cluster either from DNS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this contains more than just resolved endpoints, at least for the EDS case. So, maybe the comment can be more generic.

// EndpointConfig contains configuration corresponding to the endpoints in a cluster.

And we should also clarify that only one of three fields can be populated at any given point in time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it the case that ResolutionNote can have a non-nil error even when one of EDSUpdate or DNSEndpoints is set? If so, we need to clarify that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , the Resolution note will also be set when we have an ambient error along with the old endpoints.

// including LDS, RDS, CDS and EDS and sends update once we have all the
// resources and sends an error when we get error in listener or route
// resources.
func New(listenername, dataplaneAuthority string, xdsClient xdsclient.XDSClient, watcher ConfigWatcher) *DependencyManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/listenername/listenerName

// resources and sends an error when we get error in listener or route
// resources.
func New(listenername, dataplaneAuthority string, xdsClient xdsclient.XDSClient, watcher ConfigWatcher) *DependencyManager {
// Builds the dependency manager and starts the listener watch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: nix this comment as this is very obvious that you are creating the struct here. And the listener watch is not started here though.

Comment on lines 110 to 112
// ConfigWatcher is notified of the XDSConfig resource updates and errors that
// are received by the xDS client from the management server. It only receives a
// XDSConfig update after all the xds resources have been received.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

// ConfigWatcher is the interface for consumers of aggregated xDS configuration
// from the DependencyManager. The only consumer of this configuration is
// currently the xDS resolver.

// ConfigWatcher is notified of the XDSConfig resource updates and errors that
// are received by the xDS client from the management server. It only receives a
// XDSConfig update after all the xds resources have been received.
type ConfigWatcher interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider moving this to the top of the file so that the methods of the DependencyManager stay together and are not mixed in with another type's definition.


func (m *DependencyManager) maybeSendUpdate() {
if m.logger.V(2) {
m.logger.Infof("Sending update to watcher: Listener: %v, RouteConfig: %v", pretty.ToJSON(m.currentListenerUpdate), pretty.ToJSON(m.currentRouteConfig))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had many performance problems with using pretty.JSON for printing structs. I would recommend using %+v or some other native formatting directive instead.

Another thing to consider is also whether the xDS resolver also outputs this log. If so, we don't want the same information being repeated twice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think adding this is resolver looks better.

Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the tests yet. But I guess these comments will give you enough to make progress.

type ConfigWatcher interface {
// OnUpdate is invoked by the dependency manager to provide a new,
// validated xDS configuration to the watcher.
OnUpdate(xdsresource.XDSConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we are changing the resource watcher APIs to accept pointers to resource update structs, it would make more sense to store pointers to them in the XDSConfig struct as well.

And continuing in that same vein, we could return a pointer to the XDSConfig struct from here. Also, it would make sense to document that the watcher must not modify the returned XDSConfig and that it should read-only for the watcher.

// OnError is invoked when an error is received in listener or route
// resource. This includes cases where:
// - The listener or route resource watcher reports a resource error.
// - The received listener resource is a socket listener, not an API listener - TODO : This is not yet implemented, tracked here #8114
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could generalize this and specify that any resource validations performed at the DependencyManager that fail, also lead to OnError being invoked on the watcher.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not necessarily true, since cluster error are stored separately in a struct and endpoint error are stored in resolution note. Only errors in Listener and route resource are sent using OnError function.

OnError(error)
}

func (m *DependencyManager) maybeSendUpdate() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called maybeSendUpdate? Under what conditions will it not send an update? Can this be captured in its docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to check if the whole cluster tree is resolved , even if we have one leaf endpoint missing, we might not send the update and the checks for that will go in this function, and also c++ and java both have the same name so that got stuck in my head... Let me know if we should change it?

// Only executed in the context of a serializer callback.
func (m *DependencyManager) onListenerResourceUpdate(update *xdsresource.ListenerUpdate) {
if m.logger.V(2) {
m.logger.Infof("Received update for Listener resource %q: %v", m.ldsResourceName, pretty.ToJSON(update))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all these usages of pretty.JSON, please consider switching them to native formatting directives. Experiment with a few of them like %v, %+v, %#v, %+V, %#V and see which one provides the best output and use that.

}

func (m *DependencyManager) applyRouteConfigUpdate(update xdsresource.RouteConfigUpdate) {
matchVh := xdsresource.FindBestMatchingVirtualHost(m.dataplaneAuthority, update.VirtualHosts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/matchVh/matchVH to comply with Go initialisms.

// Only executed in the context of a serializer callback.
func (m *DependencyManager) onListenerResourceError(err error) {
if m.logger.V(2) {
m.logger.Infof("Received resource error for Listener resource %q: %v", m.ldsResourceName, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we have some code in the xDS client to ensure that the returned errors contain the xDS node ID. Could you please ensure that that property still holds. Thanks.

Comment on lines +160 to +164
m.rdsResourceName = ""
if m.routeConfigWatcher != nil {
m.routeConfigWatcher.stop()
m.routeConfigWatcher = nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have to set the matching virtual host to nil here?

It would be nice if we have a method to do all the cleanup when a listener resource error or a listener resource update invalidates the previously received route config. I see similar code in onListenerResourceError, but that one sets the matching virtual host to nil as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont set to virtual host to nil because we are going to update the virtual host in the function below using the inline route resource that we get. And we set it to nil in `OnListenerResourceError because we want to invalidate the route resource. Here we are just updating it and cancelling just the watchers since we get the resource inline.

m.rdsResourceName = ""
m.currentVirtualHost = nil
m.routeConfigWatcher = nil
m.watcher.OnError(status.Errorf(codes.Unavailable, "Listener resource error : %v", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, if the watcher is going to be given status errors, we need to document that clearly along with what status codes are returned when. And why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. We do not need status code. I looked at C++ code and got confused between absl status codes and gRPC status codes.

Comment on lines 218 to 221
if m.rdsResourceName != resourceName {
// Drop updates from canceled watchers.
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing we are going need code like this for cluster and endpoint watchers as well. Can we make this part of the watcher instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I have changed the listener and route watchers let me know if it looks good.

Comment on lines 231 to 234
//If update is not for the current watcher
if m.rdsResourceName != resourceName {
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

AggregateConfig AggregateConfig // List of children for aggregate clusters.
}

// AggregateConfig contains a list of leaf cluster names.
Copy link
Member Author

@eshitachandwani eshitachandwani Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

LeafClusters []string
}

// EndpointConfig contains resolved endpoints for a leaf cluster either from DNS
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , the Resolution note will also be set when we have an ambient error along with the old endpoints.


// RouteConfig is the route configuration resource update. It will be
// populated even if RouteConfig is inlined into the Listener resource.
RouteConfig RouteConfigUpdate
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change other resources to pointer after #8652 is merged


func (l *listenerWatcher) stop() {
l.cancel()
l.parent.logger.Infof("Canceling watch on Listener resource %q", l.resourceName)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed but I wanted to iknow where the log should be printed unconditionally where we should put a V(2) check, because I thought shutdown and cancel should be default becuase its useful information.

Comment on lines +47 to +48
serializer *grpcsync.CallbackSerializer
serializerCancel context.CancelFunc
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, the ResourceChanged and error methods call Parent's serializer (like here and here) which is going to be dependency manager now.

OnError(error)
}

func (m *DependencyManager) maybeSendUpdate() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to check if the whole cluster tree is resolved , even if we have one leaf endpoint missing, we might not send the update and the checks for that will go in this function, and also c++ and java both have the same name so that got stuck in my head... Let me know if we should change it?


func (m *DependencyManager) maybeSendUpdate() {
if m.logger.V(2) {
m.logger.Infof("Sending update to watcher: Listener: %v, RouteConfig: %v", pretty.ToJSON(m.currentListenerUpdate), pretty.ToJSON(m.currentRouteConfig))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think adding this is resolver looks better.

Comment on lines +160 to +164
m.rdsResourceName = ""
if m.routeConfigWatcher != nil {
m.routeConfigWatcher.stop()
m.routeConfigWatcher = nil
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We dont set to virtual host to nil because we are going to update the virtual host in the function below using the inline route resource that we get. And we set it to nil in `OnListenerResourceError because we want to invalidate the route resource. Here we are just updating it and cancelling just the watchers since we get the resource inline.

m.rdsResourceName = ""
m.currentVirtualHost = nil
m.routeConfigWatcher = nil
m.watcher.OnError(status.Errorf(codes.Unavailable, "Listener resource error : %v", err))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. We do not need status code. I looked at C++ code and got confused between absl status codes and gRPC status codes.

Comment on lines 218 to 221
if m.rdsResourceName != resourceName {
// Drop updates from canceled watchers.
return
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I have changed the listener and route watchers let me know if it looks good.

m.rdsResourceName = ""
m.currentVirtualHost = nil
m.routeConfigWatcher = nil
m.watcher.OnError(fmt.Errorf("listener resource error : %v", err))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we annotate the errors passed to other components too, or just make sure the are annotated with node id when they are actually printed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Internal Cleanup Refactors, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants