Reduce node fetch dependancy for RPCs #5954

sorindumitru · 2025-03-18T07:57:31Z

Currently for every RPC the agent calls, the server verifies that it is actually an agent by fetching the attested node information (a "node fetch" operation). The reason for this is so we can verify that the agent is in possession of either the current or the next (immediately after re-attestation or renewal) SVID and verify that the agent wasn't banned or evicted. We also need to have up to date information about the agent, so we fetch this from the rw connection, so from the primary node. This makes every RPC called by the agent dependant on the primary node of the database being available.

For most operations that the agent does, this is maybe not that big of an issue. It can delay synchronization of authorized entries and renewal of X509-SVIDs, but as long as the datastore recovers it will also automatically recover.

For JWT-SVID issuing this can be a problem. SVIDs are not preloaded so the first time one is requested, it will be fetched from the server. If the primary node changes, for example due to database maintenance, this can lead to visible spikes in failures to fetch JWT-SVIDs. It would be good if we could go over these periods of short unavailability without any visible failures.

I propose we change the current algorithm, of fetching the latest node information from the rw connction, to the following:

Fetch node information from some potentially stale data and verify the agent against that. If the caller is attested as an agent based on that information, let the request through.
Otherwise, fetch node information from the rw connection and verify it against that.

This has the downside the agent is potentially trusted for a bit longer than it should be, for example eviction operations don't have immediate effect. As long as the period of time we trust the stale information for is bounded it may be an acceptable compromise.

For the source of the stale node information we could use either:

an in-memory cache of the node attestation data. This would allow us to better control how long we trust the stale information for. For example we could consider the information in the cache useable for 30 seconds, but needs to be refreshed after 10-15 seconds.
the read-only datastore connection. This allows us to read this information from any of the read-replicas that are still available, in the hope that they can still serve the data. We don't have that much control over how stale the data is, but we will usually have up to date information since replication is quick.

Some operations still need up to date attested node information, such as RenewAgent, so they will have to fetch that information themselves. They currently already do that.

There were some other people complaining about this in #4484 (comment) as well as some discussion of it a long time ago in the contributors sync.

The text was updated successfully, but these errors were encountered:

amartinezfayo added the triage/in-progress Issue triage is in progress label Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce node fetch dependancy for RPCs #5954

Reduce node fetch dependancy for RPCs #5954

sorindumitru commented Mar 18, 2025

Reduce node fetch dependancy for RPCs #5954

Reduce node fetch dependancy for RPCs #5954

Comments

sorindumitru commented Mar 18, 2025