-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: FileFormats in Substrait #183
Conversation
uint64 partition_index = 6; | ||
uint64 partition_index = 7; | ||
|
||
// the start position in byte to read from this item | ||
uint64 start = 7; | ||
uint64 start = 8; | ||
|
||
// the length in byte to read from this item | ||
uint64 length = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't change existing field numbers when you can help it, as it breaks binary compatibility unnecessarily. Just use field number 9 for the new field in the oneof
, they don't need to be consecutive.
message CSVConvertOptions{ | ||
bool ignore_check_utf8 = 1; | ||
repeated string null_values = 2; | ||
repeated string true_values = 3; | ||
repeated string false_values = 4; | ||
bool strings_can_be_null = 5; | ||
bool quoted_strings_cannot_be_null = 6; | ||
bool auto_dict_encode = 7; | ||
int32 auto_dict_max_cardinality = 8; | ||
string decimal_point = 9; | ||
repeated string include_columns = 10; | ||
bool include_missing_columns = 11; | ||
} | ||
|
||
message CSVReadOptions{ | ||
bool no_use_threads = 1; | ||
int32 block_size = 2; | ||
int32 skip_rows = 3; | ||
int32 skip_rows_after_names = 4; | ||
repeated string column_names = 5; | ||
bool autogenerate_column_names = 6; | ||
} | ||
|
||
message CSVParseOptions{ | ||
string delimiter = 1; | ||
bool quoting = 2; | ||
string quote_char = 3; | ||
bool double_quote = 4; | ||
bool escaping = 5; | ||
string escape_char = 6; | ||
bool newlines_in_values = 7; | ||
bool ignore_empty_lines = 8; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless this set of options corresponds to some kind of de-facto standard CSV reader that I'm not aware of, I'm not sure all these options really belong in Substrait as such. In fact, column_names
is redundant as it is also part of base_schema
, and autogenerate_column_names
can't work for the same reason. I also suspect quoting
/quote_char
and escaping
/escape_char
to be redundant; would it not make more sense in protobuf to define only a string field and specify that an empty string means the feature is disabled?
IMO any options that are not generally applicable to any and every reasonable CSV reader implementation shouldn't be here, and should instead be part of an AdvancedExtension
field. The intended behavior of the options that remain should also really be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second the comment on the CSV options. (I thought I already added this comment here.) Let's focus on common things like delimiter, quote char, etc. And among those options, maybe try to think about what is the best way to represent instead of possibly what is here.
oneof file_type{ | ||
FileFormat format = 5; | ||
CSVOptions csv_options = 6; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, a backward-compatible solution (field numbers aside). Might be a good idea to deprecate format
and add a oneof option for Parquet in a later PR.
I would rename the csv_options
field to just csv
though, it'll make more sense in the JSON serialization that way.
Now that #169 is merged, you should be able to extend it for your purposes. |
Closing this due to inactivity. Please reopen if you want to pick it up. |
feat(calcite): support VarCharLiteral and FixedBinaryLiteral conversions fix(isthmus): convert StrLiteral to VARCHAR BREAKING CHANGE: StrLiteral is no longer converted to CHAR(<length>) fix(isthmus): convert BinaryLiteral to VARBINARY BREAKING CHANGE: BinaryLiteral is no longer converted to BINARY<length>)
This PR introduces support for various File formats in Substrait. With reference to Issue #174, this PR currently provides the implementation of CSV format based on Apache Arrow's CSV Reader implementation. Implementations of other file formats and a generic format for all of them will also be developed in this PR.