-
Notifications
You must be signed in to change notification settings - Fork 3.9k
xds: add "resource_timer_is_transient_failure" server feature #12063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xds: add "resource_timer_is_transient_failure" server feature #12063
Conversation
for (ResourceWatcher<T> watcher : watchers.keySet()) { | ||
if (processingTracker != null) { | ||
processingTracker.startTask(); | ||
} | ||
watchers.get(watcher).execute(() -> { | ||
try { | ||
/*This will go after xdsClient watcher APIs are in. | ||
watcher.onResourceChanged(StatusOr.fromStatus(Status.UNAVAILABLE.withDescription( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just call onError()
. That's equivalent for the current API as there is no data yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a lot of new code has been merged since then. There are a lot of new failures in files like CdsLoadBalancer2Test, GrpcXdsClientImplTestBase, GrpcXdsClientImplV3Test, etc.
Same files and lines were also getting changed in xds client watcher API changes. Perhaps this should wait until xds client watcher api changes goes in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should there be any new failures? Sounds like you aren't setting resourceTimerIsTransientError
correctly to preserve the old behavior for the majority of the tests.
…r_feature_resource_timer
…r_feature_resource_timer
…ivaspeaks/grpc-java into server_feature_resource_timer
for (ResourceWatcher<T> watcher : watchers.keySet()) { | ||
if (processingTracker != null) { | ||
processingTracker.startTask(); | ||
} | ||
watchers.get(watcher).execute(() -> { | ||
try { | ||
watcher.onResourceDoesNotExist(resource); | ||
if (resourceTimerIsTransientError) { | ||
watcher.onError(Status.UNAVAILABLE.withDescription( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test for watcher.onError and the timeoutSec value when resource timer is transient error.
if (serverInfo.target().equals(SERVER_URI)) { | ||
return new GrpcXdsTransport(channel); | ||
} | ||
if (serverInfo.target().equals(SERVER_URI_CUSTOME_AUTHORITY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: CUSTOM_AUTHORITY
fakeClock.getPendingTasks(CDS_RESOURCE_FETCH_TIMEOUT_TASK_FILTER)); | ||
assertThat(task.getDelay(TimeUnit.SECONDS)) | ||
.isEqualTo(XdsClientImpl.EXTENDED_RESOURCE_FETCH_TIMEOUT_SEC); | ||
fakeClock.runDueTasks(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could delete this statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was doing that exactly earlier but after every test method there is an @After
function that runs:
@After
public void tearDown() {
XdsClusterResource.enableLeastRequest = originalEnableLeastRequest;
xdsClient.shutdown();
channel.shutdown(); // channel not owned by XdsClient
assertThat(adsEnded.get()).isTrue();
assertThat(lrsEnded.get()).isTrue();
assertThat(fakeClock.getPendingTasks()).isEmpty();
}
This fails while asserting for empty pending tasks. We need to run all the due tasks at the end.
@@ -256,7 +260,7 @@ public UpdateFailureState getErrorState() { | |||
* config_dump.proto</a> | |||
*/ | |||
public enum ResourceMetadataStatus { | |||
UNKNOWN, REQUESTED, DOES_NOT_EXIST, ACKED, NACKED | |||
UNKNOWN, REQUESTED, DOES_NOT_EXIST, ACKED, NACKED, TIMEOUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used by CSDS, and looks like it crashes if you don't update it:
grpc-java/xds/src/main/java/io/grpc/xds/CsdsService.java
Lines 240 to 255 in 5a8326f
static ClientResourceStatus metadataStatusToClientStatus(ResourceMetadataStatus status) { | |
switch (status) { | |
case UNKNOWN: | |
return ClientResourceStatus.UNKNOWN; | |
case DOES_NOT_EXIST: | |
return ClientResourceStatus.DOES_NOT_EXIST; | |
case REQUESTED: | |
return ClientResourceStatus.REQUESTED; | |
case ACKED: | |
return ClientResourceStatus.ACKED; | |
case NACKED: | |
return ClientResourceStatus.NACKED; | |
default: | |
throw new AssertionError("Unexpected ResourceMetadataStatus: " + status); | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in the case of TIMEOUT, CSDS service should return either ClientResourceStatus.REQUESTED
or ClientResourceStatus.DOES_NOT_EXIST
. What do you think? DOES_NOT_EXIST
fits better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is clearly not something for us to arbitrarily decide during implementation. That's defined by the gRFC. They map 1:1 with the CSDS enum values.
Those states map directly to the states in the CSDS ClientResourceStatus enum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I see now: envoyproxy/envoy@2aaa544. We need to do envoy proto sync to grpc java to have these.
@@ -738,6 +740,9 @@ void restartTimer() { | |||
// When client becomes ready, it triggers a restartTimer for all relevant subscribers. | |||
return; | |||
} | |||
ServerInfo serverInfo = activeCpc.getServerInfo(); | |||
int timeoutSec = xdsDataErrorHandlingEnabled && serverInfo.resourceTimerIsTransientError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking xdsDataErrorHandlingEnabled
doesn't seem to do anything, as the flag protecting is happening in BootstrapperImpl.
} | ||
|
||
@Test | ||
@SuppressWarnings("unchecked") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of putting this on the entire method, can you put it on just the ResourceWatcher<CdsUpdate> timeoutWatcher
definition? Ditto for resourceTimerIsTransientError_schedulesExtendedTimeout.
As per gRFC A88 (grpc/proposal#466).