Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CSV FileFormat in Substrait #174

Open
sanjibansg opened this issue Apr 26, 2022 · 2 comments
Open

Add CSV FileFormat in Substrait #174

sanjibansg opened this issue Apr 26, 2022 · 2 comments
Labels
enhancement New feature or request help wanted No one is currently implementing but it seems like a good idea

Comments

@sanjibansg
Copy link
Contributor

sanjibansg commented Apr 26, 2022

With reference to #138, we can have the implementation for CSV file format by defining the required messages. (Prototype code can be found here)

message CSVConvertOptions{
        bool ignore_check_utf8 = 1;
        repeated string null_values = 2;
        repeated string true_values = 3;
        repeated string false_values = 4;
        bool strings_can_be_null = 5;
        bool quoted_strings_cannot_be_null = 6;
        bool auto_dict_encode = 7;
        int32 auto_dict_max_cardinality = 8;
        string decimal_point = 9;
        repeated string include_columns = 10;
        bool include_missing_columns = 11;
      }

message CSVReadOptions{
        bool no_use_threads = 1;
        int32 block_size = 2;
        int32 skip_rows = 3;
        int32 skip_rows_after_names = 4;
        repeated string column_names = 5;
        bool autogenerate_column_names = 6;
      }

message CSVParseOptions{
        string delimiter = 1;
        bool quoting = 2;
        string quote_char = 3; 
        bool double_quote = 4;
        bool escaping = 5;
        string escape_char = 6;
        bool newlines_in_values = 7;
        bool ignore_empty_lines = 8;
      }

message CSVOptions{
        CSVParseOptions parse_options = 1;
        CSVConvertOptions convert_options = 2;
        CSVReadOptions read_options = 3;
      }

and then the file_type can be defined by one_of,

      oneof file_type{
        FileFormat format = 5;
        CSVOptions csv_options = 6;
      }

We can proceed with this and can then develop a generic implementation using google.protobuf.Any, with separate .proto files defining various file formats.

@westonpace
Copy link
Member

Thanks @sanjibansg. I can add a bit of context. These are based on Arrow's CSV reader implementation. There is a similar "giant block of CSV options" in pandas. I think my big question (for the Substrait community) would be whether something like this is in scope of Substrait and, if so, how it should be added?

@jacques-n
Copy link
Contributor

I think it should be partially added to core Substrait. Some of these things seem very arrow specific, some seem very generic (specific: use threads, generic: delimiter).

Let's start by focusing on adding the things that are common to most delimited text readers. Then we can potentially define some structured hints that may be useful but could be ignored. For example, use threads feels like a hint, not a semantic piece of information (implementations could ignore and still provide logically equivalent results). Some of these things also don't really make any sense. For example, I don't know what column names would mean in the context of substrait (and there are several properties focused on this).

@westonpace westonpace added enhancement New feature or request help wanted No one is currently implementing but it seems like a good idea labels Mar 1, 2023
rkondakov pushed a commit to rkondakov/substrait that referenced this issue Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted No one is currently implementing but it seems like a good idea
Projects
None yet
Development

No branches or pull requests

3 participants