feat: FileFormats in Substrait #183

sanjibansg · 2022-05-04T10:09:28Z

This PR introduces support for various File formats in Substrait. With reference to Issue #174, this PR currently provides the implementation of CSV format based on Apache Arrow's CSV Reader implementation. Implementations of other file formats and a generic format for all of them will also be developed in this PR.

jvanstraten · 2022-05-09T12:03:02Z

proto/substrait/algebra.proto

-      uint64 partition_index = 6;
+      uint64 partition_index = 7;

      // the start position in byte to read from this item
-      uint64 start = 7;
+      uint64 start = 8;

      // the length in byte to read from this item
-      uint64 length = 8;


Don't change existing field numbers when you can help it, as it breaks binary compatibility unnecessarily. Just use field number 9 for the new field in the oneof, they don't need to be consecutive.

jvanstraten · 2022-05-09T12:20:18Z

proto/substrait/algebra.proto

+      message CSVConvertOptions{
+        bool ignore_check_utf8 = 1;
+        repeated string null_values = 2;
+        repeated string true_values = 3;
+        repeated string false_values = 4;
+        bool strings_can_be_null = 5;
+        bool quoted_strings_cannot_be_null = 6;
+        bool auto_dict_encode = 7;
+        int32 auto_dict_max_cardinality = 8;
+        string decimal_point = 9;
+        repeated string include_columns = 10;
+        bool include_missing_columns = 11;
+      }
+
+      message CSVReadOptions{
+        bool no_use_threads = 1;
+        int32 block_size = 2;
+        int32 skip_rows = 3;
+        int32 skip_rows_after_names = 4;
+        repeated string column_names = 5;
+        bool autogenerate_column_names = 6;
+      }
+
+      message CSVParseOptions{
+        string delimiter = 1;
+        bool quoting = 2;
+        string quote_char = 3; 
+        bool double_quote = 4;
+        bool escaping = 5;
+        string escape_char = 6;
+        bool newlines_in_values = 7;
+        bool ignore_empty_lines = 8;
+      }


Unless this set of options corresponds to some kind of de-facto standard CSV reader that I'm not aware of, I'm not sure all these options really belong in Substrait as such. In fact, column_names is redundant as it is also part of base_schema, and autogenerate_column_names can't work for the same reason. I also suspect quoting/quote_char and escaping/escape_char to be redundant; would it not make more sense in protobuf to define only a string field and specify that an empty string means the feature is disabled?

IMO any options that are not generally applicable to any and every reasonable CSV reader implementation shouldn't be here, and should instead be part of an AdvancedExtension field. The intended behavior of the options that remain should also really be documented.

I second the comment on the CSV options. (I thought I already added this comment here.) Let's focus on common things like delimiter, quote char, etc. And among those options, maybe try to think about what is the best way to represent instead of possibly what is here.

jvanstraten · 2022-05-09T12:24:40Z

proto/substrait/algebra.proto

+      oneof file_type{
+        FileFormat format = 5;
+        CSVOptions csv_options = 6;
+      }


Nice, a backward-compatible solution (field numbers aside). Might be a good idea to deprecate format and add a oneof option for Parquet in a later PR.

I would rename the csv_options field to just csv though, it'll make more sense in the JSON serialization that way.

jacques-n · 2022-06-05T23:16:48Z

Now that #169 is merged, you should be able to extend it for your purposes.

jacques-n · 2022-07-25T16:11:51Z

Closing this due to inactivity. Please reopen if you want to pick it up.

feat(calcite): support VarCharLiteral and FixedBinaryLiteral conversions fix(isthmus): convert StrLiteral to VARCHAR BREAKING CHANGE: StrLiteral is no longer converted to CHAR(<length>) fix(isthmus): convert BinaryLiteral to VARBINARY BREAKING CHANGE: BinaryLiteral is no longer converted to BINARY<length>)

sanjibansg added 4 commits April 17, 2022 14:59

Initial Commit: FileFormat for CSV

9ab8033

feat: oneof file_type

7db17ff

feat: CSVConvertOptions

34a9992

fix: CSVOptions

69cbd5b

jvanstraten reviewed May 9, 2022

View reviewed changes

jacques-n mentioned this pull request May 18, 2022

Add orc file format #202

Closed

jacques-n closed this Jul 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: FileFormats in Substrait #183

feat: FileFormats in Substrait #183

sanjibansg commented May 4, 2022

jvanstraten May 9, 2022

jvanstraten May 9, 2022

jacques-n May 10, 2022

jvanstraten May 9, 2022

jacques-n commented Jun 5, 2022

jacques-n commented Jul 25, 2022

feat: FileFormats in Substrait #183

feat: FileFormats in Substrait #183

Conversation

sanjibansg commented May 4, 2022

jvanstraten May 9, 2022

Choose a reason for hiding this comment

jvanstraten May 9, 2022

Choose a reason for hiding this comment

jacques-n May 10, 2022

Choose a reason for hiding this comment

jvanstraten May 9, 2022

Choose a reason for hiding this comment

jacques-n commented Jun 5, 2022

jacques-n commented Jul 25, 2022