Skip to content

Create a custom thrift protocol to ensure client/server compatibility #5398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 47 commits into
base: main
Choose a base branch
from

Conversation

DomGarguilo
Copy link
Member

Addresses #5390

@DomGarguilo DomGarguilo added this to the 4.0.0 milestone Mar 11, 2025
@DomGarguilo DomGarguilo requested a review from ctubbsii March 11, 2025 20:06
@DomGarguilo DomGarguilo self-assigned this Mar 11, 2025
@DomGarguilo
Copy link
Member Author

Created this in draft since, for now, it only addresses the task outlined in the ticket:

create a custom protocol with the versioning information stored in a header and checks on the server side to validate it

I'm not sure if any of the other proposed tasks should be added here or as follow-on PRs.

public static class AccumuloProtocol extends TCompactProtocol {

private static final int MAGIC_NUMBER = 0x41434355; // "ACCU" in ASCII
private static final byte PROTOCOL_VERSION = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably add a comment to AccumuloDataVersion.CURRENT_VERSION that references this and mentions that when changing CURRENT_VERSION may also need to change this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that... whether the protocol version should be independent, so we can version the serialization format of the protocol header.

We will probably want to include the AccumuloDataVersion.CURRENT_VERSION in the protocol header directly, though. We could also include the actual Accumulo release version in here, too (major.minor.patch; e.g. 4.0.0). I think to verify, we'd probably want to ensure the same major and same minor version, if we want to be even more restrictive than the data version. We really shouldn't have Accumulo servers or clients/servers talking across major.minor versions (different patch releases are okay, and is important for rolling upgrades).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it separate for now just in case we wanted to have multiple protocol versions able to talk to each other but am not sure about that.

The data version and protocol version definitely seem related but I'm not sure how closely tied we want to have those versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the protocol version can be independently numbered, not connected to the data version. I think it's sufficient to check the Accumulo version (Constants.VERSION) major.minor. We don't need to use the data version in here at all. The protocol version is the version of the RPC protocol... the data version is the version of the data persisted in ZK and HDFS... and the Accumulo version is the overall version of the software. Only the RPC protocol version and the Accumulo version are relevant here.

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we actually need to distinguish between client and server for the spans.

@DomGarguilo
Copy link
Member Author

Marking this as ready for review. I tried to address the following two suggestions from the ticket:

  1. Primarily, create a custom protocol with the versioning information stored in a header and checks on the server side to validate it, then
  2. Secondarily, simplify the thrift APIs by adding tracing info to the protocol header

There are a lot of changed files in this PR since I had to update the thrift code. There are also a lot of places where the only change is the removal of the TInfo object as the first param in a method call which was a trivial change. Hopefully that helps skip/skim over portions of the diff while reviewing.

@DomGarguilo DomGarguilo marked this pull request as ready for review March 17, 2025 20:57
@DomGarguilo
Copy link
Member Author

@kevinrr888 your suggestions should all be addressed as of 29c1234

in that commit I...

  • refactored the protocol version compatibility check method to throw an exception directly instead of return a boolean. I also updated the wording on that message to make things more clear
  • addressed the IDE warnings regarding nullness in various places
  • added javadoc to the static client and server factory methods in ThriftUtil to make it clearer which should be used

@ctubbsii
Copy link
Member

One additional thought I had was that clients should send the instanceId, and the server should verify it, so that we don't get connections from clients for a different instance.

I was trying to work this in but am not sure the best place to pull in the instanceId from. I tried to pass it in as a parameter but had to reach pretty high up the chain which changed a lot of code. Was hoping you had a better idea for where I can add this in from.

The ServerContext and/or ClientContext should be able to provide the InstanceId, and the protocol factory could be constructed with it. I'm not sure how much code that touches. I can try to take a look if you hit a wall here.

@DomGarguilo
Copy link
Member Author

One additional thought I had was that clients should send the instanceId, and the server should verify it, so that we don't get connections from clients for a different instance.

I was trying to work this in but am not sure the best place to pull in the instanceId from. I tried to pass it in as a parameter but had to reach pretty high up the chain which changed a lot of code. Was hoping you had a better idea for where I can add this in from.

The ServerContext and/or ClientContext should be able to provide the InstanceId, and the protocol factory could be constructed with it. I'm not sure how much code that touches. I can try to take a look if you hit a wall here.

Okay I added the instance ID validation in 769d38f. Had to reach pretty far up the chain to get the instanceId so there were quite a few files changed but all of those are just parameter changes.

@DomGarguilo DomGarguilo requested a review from ctubbsii April 16, 2025 19:49
@ctubbsii
Copy link
Member

Can you provide an example of the opentelemetry tracing output (screenshot or log sequences or something)? I'd be curious what they look like when a client performs a traced operation over the new RPC protocol.

@DomGarguilo
Copy link
Member Author

Can you provide an example of the opentelemetry tracing output (screenshot or log sequences or something)? I'd be curious what they look like when a client performs a traced operation over the new RPC protocol.

I'm not too sure how much this shows but here is a screenshot from the "graph view" in Jaeger capturing traces from a scan from the shell
image

@keith-turner
Copy link
Contributor

@DomGarguilo and @ctubbsii curious what the status of this PR is. Would be nice to get this merged in if possible.

@DomGarguilo
Copy link
Member Author

@DomGarguilo and @ctubbsii curious what the status of this PR is. Would be nice to get this merged in if possible.

Ready to be merged unless anyone else has blockers here.

@ctubbsii
Copy link
Member

ctubbsii commented Jun 7, 2025

I think this is probably ready to be merged, but I'd like to go over the visual with @DomGarguilo to make sure I understand what I'm looking at and everything is working as it should. Sorry for holding this up. I think it can be merged next week. @DomGarguilo could you give me a call next week to go over it first, if you have time?

@DomGarguilo
Copy link
Member Author

I am actually not sure that traces are being propagated correctly. I'm looking into things but I don't think this should be merged yet.

@DomGarguilo
Copy link
Member Author

DomGarguilo commented Jun 13, 2025

I am actually not sure that traces are being propagated correctly. I'm looking into things but I don't think this should be merged yet.

After debugging. The issue is fixed via a683811 and I can see spans from client -> server in Jaeger. The issue was that we were not closing the scope or span on AccumuloProtocol.readMessageEnd() so the traces were never closed on the server side, and thus those server spans never finished (and couldn’t be linked back to the client spans in Jaeger).

In the screenshot below, in a shell I ran trace on then scan (on an existing table with an entry) then trace off.
image

This demonstrates that cross-component tracing is working.

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code again today, after that fix for the readMessageEnd...

I think the stuff with the version checks, and all that is all good, and we should definitely merge that stuff in soon. And, I really wanted the changes to the tracing stuff also, because I was really hoping it would allow us to simplify a lot of the Thrift code.

However, I'm starting to wonder if pushing the tracing stuff into this is actually feasible. When I looked at it again today, I think I'm starting to realize that readMessageEnd might finish before the actual action that the RPC was requesting is executed. If that's the case, then the span is already closed, and any subsequent spans on the server side are going to be disconnected. That seems to match what I'm seeing in the second screenshot, where we only see the handleRpcMessage span, but no child spans for it on the server.

The current code creates a span that wraps the entire execution of the RPC call, but I think in the protocol, it's only wrapping the message parsing, not the execution. I am not too familiar with Thrift internals, but I suspect a subsequent writeMessage is done on the server side to send a message back to the client with the results of the execution. If so, we really want to close the handleRpcMessage span at that time, and not before.

It may still be possible to salvage these tracing changes, and close the span at the end of the RPC execution inside the protocol factory. However, if that's not possible, we may have to abandon the thrift tracing changes, and continue doing that the way it is currently done, so it wraps the whole RPC operation, and only keep the portions that do the version/compatibility checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a custom thrift protocol to ensure client/server compatibility
4 participants