Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: FileFormats in Substrait #183

Closed
wants to merge 4 commits into from

Conversation

sanjibansg
Copy link
Contributor

This PR introduces support for various File formats in Substrait. With reference to Issue #174, this PR currently provides the implementation of CSV format based on Apache Arrow's CSV Reader implementation. Implementations of other file formats and a generic format for all of them will also be developed in this PR.

Comment on lines -106 to -112
uint64 partition_index = 6;
uint64 partition_index = 7;

// the start position in byte to read from this item
uint64 start = 7;
uint64 start = 8;

// the length in byte to read from this item
uint64 length = 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't change existing field numbers when you can help it, as it breaks binary compatibility unnecessarily. Just use field number 9 for the new field in the oneof, they don't need to be consecutive.

Comment on lines +117 to +149
message CSVConvertOptions{
bool ignore_check_utf8 = 1;
repeated string null_values = 2;
repeated string true_values = 3;
repeated string false_values = 4;
bool strings_can_be_null = 5;
bool quoted_strings_cannot_be_null = 6;
bool auto_dict_encode = 7;
int32 auto_dict_max_cardinality = 8;
string decimal_point = 9;
repeated string include_columns = 10;
bool include_missing_columns = 11;
}

message CSVReadOptions{
bool no_use_threads = 1;
int32 block_size = 2;
int32 skip_rows = 3;
int32 skip_rows_after_names = 4;
repeated string column_names = 5;
bool autogenerate_column_names = 6;
}

message CSVParseOptions{
string delimiter = 1;
bool quoting = 2;
string quote_char = 3;
bool double_quote = 4;
bool escaping = 5;
string escape_char = 6;
bool newlines_in_values = 7;
bool ignore_empty_lines = 8;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless this set of options corresponds to some kind of de-facto standard CSV reader that I'm not aware of, I'm not sure all these options really belong in Substrait as such. In fact, column_names is redundant as it is also part of base_schema, and autogenerate_column_names can't work for the same reason. I also suspect quoting/quote_char and escaping/escape_char to be redundant; would it not make more sense in protobuf to define only a string field and specify that an empty string means the feature is disabled?

IMO any options that are not generally applicable to any and every reasonable CSV reader implementation shouldn't be here, and should instead be part of an AdvancedExtension field. The intended behavior of the options that remain should also really be documented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second the comment on the CSV options. (I thought I already added this comment here.) Let's focus on common things like delimiter, quote char, etc. And among those options, maybe try to think about what is the best way to represent instead of possibly what is here.

Comment on lines +103 to +106
oneof file_type{
FileFormat format = 5;
CSVOptions csv_options = 6;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, a backward-compatible solution (field numbers aside). Might be a good idea to deprecate format and add a oneof option for Parquet in a later PR.

I would rename the csv_options field to just csv though, it'll make more sense in the JSON serialization that way.

@jacques-n jacques-n mentioned this pull request May 18, 2022
@jacques-n
Copy link
Contributor

Now that #169 is merged, you should be able to extend it for your purposes.

@jacques-n
Copy link
Contributor

Closing this due to inactivity. Please reopen if you want to pick it up.

@jacques-n jacques-n closed this Jul 25, 2022
rkondakov pushed a commit to rkondakov/substrait that referenced this pull request Nov 21, 2023
feat(calcite): support VarCharLiteral and FixedBinaryLiteral conversions

fix(isthmus): convert StrLiteral to VARCHAR
BREAKING CHANGE: StrLiteral is no longer converted to CHAR(<length>)

fix(isthmus): convert BinaryLiteral to VARBINARY 
BREAKING CHANGE: BinaryLiteral is no longer converted to BINARY<length>)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants