Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifying start and length in ReadRel #367

Open
save-buffer opened this issue Nov 4, 2022 · 7 comments
Open

Clarifying start and length in ReadRel #367

save-buffer opened this issue Nov 4, 2022 · 7 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted No one is currently implementing but it seems like a good idea

Comments

@save-buffer
Copy link

Hi, I just came a cross an issue when running Velox on a parquet file. My plan omitted the start and length fields of the LocalFiles message which caused both to be set to 0, which meant that Velox read 0 bytes at offset 0, producing no results. I think in a huge percentage of cases people will want to read the entire file. Would it be reasonable to specify these fields as optional or to make 0 for the length mean to read the whole thing?
I think technically Velox is conformant right now, but this isn't the behavior I'd expect.

@westonpace
Copy link
Member

My understanding was that the default (length = 0, start = 0) would be to read the entire file. Though I see this isn't exactly specified anywhere.

CC @rui-mo

@rui-mo
Copy link
Contributor

rui-mo commented Nov 7, 2022

Understand this issue. But if (length = 0, start = 0) means to read the entire file, what can be used for an empty file? From my view, optional may be better. If not configured, then read the entire file.

@jvanstraten
Copy link
Contributor

My two cents:

  • There is no way to differentiate (penultimate paragraph) between 0/false/empty string/first enum variant and "field not set" for protobuf primitive aka scalar fields. By design, the serializer always encodes the default value as "field not set," and the deserializer always treats "field not set" as the default value. Only message fields have a discernible "unspecified" state.
  • Slicing parameters (to which these fields belong) are listed as optional in the table here.

Therefore, setting both values to zero == unspecified should indicate the whole file. I figure setting only length to zero means reading from a set start point all the way to the end, and vice versa (more obviously) for the other way around. But I would agree that the docs could be improved.

@westonpace
Copy link
Member

I don't understand why you would want to read an empty file but I agree with @jvanstraten that, if we do want to support that behavior, we will need to change something. The most intuitive change, IMHO, would be a dedicated message:

message Slice {
      // The start position in byte to read from this item
      uint64 start = 1;

      // The length in byte to read from this item
      uint64 length = 2;
}

Slice slice = 9;

reserved 7;
reserved "start";
reserved 8;
reserved "length";

@westonpace westonpace added enhancement New feature or request good first issue Good for newcomers help wanted No one is currently implementing but it seems like a good idea labels Mar 8, 2023
@thefloofe22
Copy link

Is this issue still up for grabs? Would love to do this if it is still an open issue

@EpsilonPrime
Copy link
Member

Is this issue still up for grabs? Would love to do this if it is still an open issue

Looks like it to me given the labels on this issue.

@westonpace
Copy link
Member

Yes, I don't think anyone is working on this at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted No one is currently implementing but it seems like a good idea
Projects
None yet
Development

No branches or pull requests

6 participants