Conversation
src/main/flatbuf/parquet3.fbs
Outdated
| @@ -0,0 +1,224 @@ | |||
| namespace parquet.format3; | |||
There was a problem hiding this comment.
lets just name this parquet.format for now?
src/main/flatbuf/parquet3.fbs
Outdated
| // 1. Statistics are stored in integral types if their size is fixed, otherwise prefix + suffix | ||
| // 2. ColumnMetaData.encoding_stats are removed, they are replaced with | ||
| // ColumnMetaData.is_fully_dict_encoded. | ||
| // 3. RowGroups are limited to 2GB in size, so we can use int for sizes. |
There was a problem hiding this comment.
I think this and the item below are out of date (we are using long now) and can keep things absolute?
src/main/flatbuf/parquet3.fbs
Outdated
| @@ -0,0 +1,224 @@ | |||
| namespace parquet.format3; | |||
|
|
|||
| // Optimization notes | |||
There was a problem hiding this comment.
Can we expand this comment to be explicit about the relationship between this FBS and parquet.thrift.
| // Note: Match the thrift enum values so that we can cast between them. | ||
| enum Encoding : byte { | ||
| PLAIN = 0, | ||
| // GROUP_VAR_INT = 1, |
There was a problem hiding this comment.
Call out commented out entries as deprecated to make it clear why they are commented out?
src/main/flatbuf/parquet3.fbs
Outdated
| GZIP = 2, | ||
| LZO = 3, | ||
| BROTLI = 4, | ||
| // LZ4 = 5, |
There was a problem hiding this comment.
same comment on deprecation.
src/main/flatbuf/parquet3.fbs
Outdated
| scale: int; | ||
| } | ||
| enum TimeUnit : byte { | ||
| MS = 0, |
There was a problem hiding this comment.
Can we please make these match parquet.thrift for names (Millisecond, Microsecond, Nanosecond)?
| // Logical types. | ||
| /////////////////////////////////////////////////////////////////////////////////////////////////// | ||
|
|
||
| table Empty {} |
There was a problem hiding this comment.
I think we want detailed docs (same level as parquet.thrift if we intend this to be the new footer)?
src/main/flatbuf/parquet3.fbs
Outdated
| /////////////////////////////////////////////////////////////////////////////////////////////////// | ||
|
|
||
| table Empty {} | ||
| table DecimalOpts { |
There was a problem hiding this comment.
| table DecimalOpts { | |
| table DecimalOptions { |
Should be spell out type names to make it easer on reader?
src/main/flatbuf/parquet3.fbs
Outdated
| // - BYTE_ARRAY: | ||
| // prefix: the longest common prefix of min/max | ||
| // lo8+hi8 zero padded 16 bytes (big-endian) of the suffix | ||
| // len: the length for the suffix of the value after removing the prefix. If > 16 then the |
There was a problem hiding this comment.
| // len: the length for the suffix of the value after removing the prefix. If > 16 then the | |
| // min_len/max_len: the length for the suffix of the value after removing the prefix. If > 16 then the |
src/main/flatbuf/parquet3.fbs
Outdated
| // prefix: the longest common prefix of min/max | ||
| // lo8+hi8 zero padded 16 bytes (big-endian) of the suffix | ||
| // len: the length for the suffix of the value after removing the prefix. If > 16 then the | ||
| // value is inexact |
There was a problem hiding this comment.
| // value is inexact | |
| // value is inexact (it is exact otherwise). |
src/main/flatbuf/parquet3.fbs
Outdated
| // - BOOLEAN: none | ||
| // - INT32/FLOAT: lo4 (little-endian) | ||
| // - INT64/DOUBLE: lo8 (little-endian) | ||
| // - INT96: lo4+lo8 (little-endian) |
There was a problem hiding this comment.
for composite values, I think this is complicated enough that providing concrete examples would be belpful for implementors?
src/main/flatbuf/parquet3.fbs
Outdated
| DATA_PAGE_V2 = 3, | ||
| } | ||
|
|
||
| table KV { |
There was a problem hiding this comment.
nit
| table KV { | |
| table KeyValue { |
Lets keep name consistent if possible?
| codec: CompressionCodec; | ||
| num_values: long = null; // only present if not equal to rg.num_rows | ||
| total_uncompressed_size: long; | ||
| total_compressed_size: long; |
There was a problem hiding this comment.
It would be nice to keep total unencoded size here which I think is generally useful? But I suppose it can be added after?
src/main/flatbuf/parquet3.fbs
Outdated
| dictionary_page_offset: long = null; | ||
| statistics: Statistics; | ||
| is_fully_dict_encoded: bool; | ||
| bloom_filter_offset: long = null; |
There was a problem hiding this comment.
Should we this be made a struct/value to make the bloom filter info more self contained?
src/main/flatbuf/parquet3.fbs
Outdated
| row_groups: [RowGroup]; | ||
| kv: [KV]; | ||
| created_by: string; | ||
| // column_orders: [ColumnOrder]; // moved to SchemaElement |
There was a problem hiding this comment.
remove this row for now?
emkornfield
left a comment
There was a problem hiding this comment.
I think we also need to add an apache header here, and CI to make sure this compiles?
|
Hi @rok and @emkornfield , could you help to have another look of this pr? |
src/main/flatbuf/parquet3.fbs
Outdated
| // It can currently be attached as a footer extension, and may fully replace the | ||
| // Thrift footer in the future. |
There was a problem hiding this comment.
Maybe we should explicitly state thrift footer is still required as of now?
|
|
||
| /** Optional Bloom filter information for this column chunk */ | ||
| bloom_filter: BloomFilterInfo; | ||
| } |
There was a problem hiding this comment.
Currently this proposal omits optional GeospatialStatistics which with bounding boxes (8 doubles) makes for nice pruning metadata and would likely be worth keeping.
parquet-format/src/main/thrift/parquet.thrift
Line 944 in 38818fa
src/main/flatbuf/parquet3.fbs
Outdated
| min_lo4: uint; | ||
| min_lo8: ulong; | ||
| min_hi8: ulong; | ||
| min_len: byte = null; |
There was a problem hiding this comment.
| min_len: byte = null; | |
| min_len: int = null; |
Original suffix lenght could exceed int8 range of byte type.
src/main/flatbuf/parquet3.fbs
Outdated
| max_lo4: uint; | ||
| max_lo8: ulong; | ||
| max_hi8: ulong; | ||
| max_len: byte = null; |
There was a problem hiding this comment.
As above:
| max_len: byte = null; | |
| max_len: int = null; |
src/main/flatbuf/parquet3.fbs
Outdated
| } | ||
|
|
||
| table Statistics { | ||
| null_count: int = null; |
There was a problem hiding this comment.
| null_count: int = null; | |
| null_count: long = null; |
To match row group size range.
|
|
||
| union ColumnCryptoMetadata { | ||
| EncryptionWithFooterKey:Empty, | ||
| EncryptionWithColumnKey:Empty, | ||
| } |
There was a problem hiding this comment.
Used by readers to recover the column key from KMS services.
| union ColumnCryptoMetadata { | |
| EncryptionWithFooterKey:Empty, | |
| EncryptionWithColumnKey:Empty, | |
| } | |
| table EncryptionWithColumnKey { | |
| /** Column path in schema **/ | |
| path_in_schema: [string]; | |
| /** Retrieval metadata of column encryption key **/ | |
| key_metadata: [byte]; | |
| } | |
| union ColumnCryptoMetadata { | |
| EncryptionWithFooterKey:Empty, | |
| EncryptionWithColumnKey:EncryptionWithColumnKey, | |
| } |
src/main/flatbuf/parquet3.fbs
Outdated
|
|
||
| /** repetition of the field. The root of the schema does not have a repetition_type. | ||
| * All other nodes must have one */ | ||
| repetition_type: FieldRepetitionType; |
There was a problem hiding this comment.
To allow for root to not have repetition type. In thrift we have optional:
parquet-format/src/main/thrift/parquet.thrift
Line 518 in 38818fa
| repetition_type: FieldRepetitionType; | |
| repetition_type: FieldRepetitionType = null; |
src/main/flatbuf/parquet3.fbs
Outdated
| // - INT32/FLOAT: min_lo4/max_lo4 (little-endian, 4 bytes) | ||
| // - INT64/DOUBLE: min_lo8/max_lo8 (little-endian, 8 bytes) | ||
| // - INT96: lo4 contains the low 4 bytes, lo8 contains the high 8 bytes (little-endian, 12 bytes total) | ||
| // - FIXED_LEN_BYTE_ARRAY: |
There was a problem hiding this comment.
Perhaps note it's the same as for BYTE_ARRAY?
| // - FIXED_LEN_BYTE_ARRAY: | |
| // - FIXED_LEN_BYTE_ARRAY: Encoded the same way as BYTE_ARRAY below |
src/main/flatbuf/parquet3.fbs
Outdated
| max_lo8: ulong; | ||
| max_hi8: ulong; | ||
| max_len: byte = null; | ||
| prefix: string; |
| table DecimalOptions { | ||
| precision: int; | ||
| scale: int; | ||
| } |
There was a problem hiding this comment.
In thrift these are in reverse order. Any reason to change that here?
parquet-format/src/main/thrift/parquet.thrift
Lines 343 to 346 in 38818fa
src/main/flatbuf/parquet3.fbs
Outdated
| // 1. Statistics use fixed-width integral types when possible; otherwise they are | ||
| // encoded as prefix + suffix. |
There was a problem hiding this comment.
SizeStatistics and Statistics.distinct_count are removed.
|
@Jiayi-Wang-db Feel free to resolve comments you feel were addressed to make this more readable? |
Rationale for this change
Improve wide table support.
What changes are included in this PR?
Add parquet flatbuf schema.
Do these changes have PoC implementations?
apache/arrow#48431