Skip to content

xds: add "resource_timer_is_transient_failure" server feature #12063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 15, 2025

Conversation

shivaspeaks
Copy link
Member

@shivaspeaks shivaspeaks commented May 14, 2025

As per gRFC A88 (grpc/proposal#466).

@shivaspeaks shivaspeaks changed the title add resource_timer_is_transient_failure server feature add "resource_timer_is_transient_failure" server feature May 14, 2025
for (ResourceWatcher<T> watcher : watchers.keySet()) {
if (processingTracker != null) {
processingTracker.startTask();
}
watchers.get(watcher).execute(() -> {
try {
/*This will go after xdsClient watcher APIs are in.
watcher.onResourceChanged(StatusOr.fromStatus(Status.UNAVAILABLE.withDescription(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just call onError(). That's equivalent for the current API as there is no data yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a lot of new code has been merged since then. There are a lot of new failures in files like CdsLoadBalancer2Test, GrpcXdsClientImplTestBase, GrpcXdsClientImplV3Test, etc.
Same files and lines were also getting changed in xds client watcher API changes. Perhaps this should wait until xds client watcher api changes goes in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should there be any new failures? Sounds like you aren't setting resourceTimerIsTransientError correctly to preserve the old behavior for the majority of the tests.

@shivaspeaks shivaspeaks marked this pull request as ready for review June 25, 2025 08:40
for (ResourceWatcher<T> watcher : watchers.keySet()) {
if (processingTracker != null) {
processingTracker.startTask();
}
watchers.get(watcher).execute(() -> {
try {
watcher.onResourceDoesNotExist(resource);
if (resourceTimerIsTransientError) {
watcher.onError(Status.UNAVAILABLE.withDescription(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test for watcher.onError and the timeoutSec value when resource timer is transient error.

@shivaspeaks shivaspeaks changed the title add "resource_timer_is_transient_failure" server feature xds: add "resource_timer_is_transient_failure" server feature Jul 15, 2025
if (serverInfo.target().equals(SERVER_URI)) {
return new GrpcXdsTransport(channel);
}
if (serverInfo.target().equals(SERVER_URI_CUSTOME_AUTHORITY)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: CUSTOM_AUTHORITY

fakeClock.getPendingTasks(CDS_RESOURCE_FETCH_TIMEOUT_TASK_FILTER));
assertThat(task.getDelay(TimeUnit.SECONDS))
.isEqualTo(XdsClientImpl.EXTENDED_RESOURCE_FETCH_TIMEOUT_SEC);
fakeClock.runDueTasks();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could delete this statement.

Copy link
Member Author

@shivaspeaks shivaspeaks Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing that exactly earlier but after every test method there is an @After function that runs:

@After
  public void tearDown() {
    XdsClusterResource.enableLeastRequest = originalEnableLeastRequest;
    xdsClient.shutdown();
    channel.shutdown();  // channel not owned by XdsClient
    assertThat(adsEnded.get()).isTrue();
    assertThat(lrsEnded.get()).isTrue();
    assertThat(fakeClock.getPendingTasks()).isEmpty();
  }

This fails while asserting for empty pending tasks. We need to run all the due tasks at the end.

@shivaspeaks shivaspeaks merged commit 5a8326f into grpc:master Jul 15, 2025
20 of 21 checks passed
@shivaspeaks shivaspeaks deleted the server_feature_resource_timer branch July 15, 2025 10:03
@@ -256,7 +260,7 @@ public UpdateFailureState getErrorState() {
* config_dump.proto</a>
*/
public enum ResourceMetadataStatus {
UNKNOWN, REQUESTED, DOES_NOT_EXIST, ACKED, NACKED
UNKNOWN, REQUESTED, DOES_NOT_EXIST, ACKED, NACKED, TIMEOUT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used by CSDS, and looks like it crashes if you don't update it:

static ClientResourceStatus metadataStatusToClientStatus(ResourceMetadataStatus status) {
switch (status) {
case UNKNOWN:
return ClientResourceStatus.UNKNOWN;
case DOES_NOT_EXIST:
return ClientResourceStatus.DOES_NOT_EXIST;
case REQUESTED:
return ClientResourceStatus.REQUESTED;
case ACKED:
return ClientResourceStatus.ACKED;
case NACKED:
return ClientResourceStatus.NACKED;
default:
throw new AssertionError("Unexpected ResourceMetadataStatus: " + status);
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in the case of TIMEOUT, CSDS service should return either ClientResourceStatus.REQUESTED or ClientResourceStatus.DOES_NOT_EXIST. What do you think? DOES_NOT_EXIST fits better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is clearly not something for us to arbitrarily decide during implementation. That's defined by the gRFC. They map 1:1 with the CSDS enum values.

Those states map directly to the states in the CSDS ClientResourceStatus enum.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I see now: envoyproxy/envoy@2aaa544. We need to do envoy proto sync to grpc java to have these.

@@ -738,6 +740,9 @@ void restartTimer() {
// When client becomes ready, it triggers a restartTimer for all relevant subscribers.
return;
}
ServerInfo serverInfo = activeCpc.getServerInfo();
int timeoutSec = xdsDataErrorHandlingEnabled && serverInfo.resourceTimerIsTransientError()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking xdsDataErrorHandlingEnabled doesn't seem to do anything, as the flag protecting is happening in BootstrapperImpl.

}

@Test
@SuppressWarnings("unchecked")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of putting this on the entire method, can you put it on just the ResourceWatcher<CdsUpdate> timeoutWatcher definition? Ditto for resourceTimerIsTransientError_schedulesExtendedTimeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants