GH-43695: [C++][Parquet] flatbuffers metadata integration#48431
GH-43695: [C++][Parquet] flatbuffers metadata integration#48431Jiayi-Wang-db wants to merge 23 commits intoapache:mainfrom
Conversation
| auto To(const format::ColumnMetaData& cm) { | ||
| if (!cm.encoding_stats.empty()) { | ||
| for (auto&& e : cm.encoding_stats) { | ||
| if (e.page_type != format::PageType::DATA_PAGE && | ||
| e.page_type != format::PageType::DATA_PAGE_V2) | ||
| continue; | ||
| if (e.encoding != format::Encoding::PLAIN_DICTIONARY && | ||
| e.encoding != format::Encoding::RLE_DICTIONARY) { | ||
| return false; | ||
| } | ||
| } | ||
| return true; | ||
| } | ||
| bool has_plain_dictionary_encoding = false; | ||
| bool has_non_dictionary_encoding = false; | ||
| for (auto encoding : cm.encodings) { | ||
| switch (encoding) { | ||
| case format::Encoding::PLAIN_DICTIONARY: | ||
| // PLAIN_DICTIONARY encoding was present, which means at | ||
| // least one page was dictionary encoded and v1.0 encodings are used. | ||
| has_plain_dictionary_encoding = true; | ||
| break; | ||
| case format::Encoding::RLE: | ||
| case format::Encoding::BIT_PACKED: | ||
| // Other than for boolean values, RLE and BIT_PACKED are only used for | ||
| // repetition or definition levels. Additionally booleans are not dictionary | ||
| // encoded hence it is safe to disregard the case where some boolean data pages | ||
| // are dictionary encoded and some boolean pages are RLE/BIT_PACKED encoded. | ||
| break; | ||
| default: | ||
| has_non_dictionary_encoding = true; | ||
| break; | ||
| } | ||
| } | ||
| if (has_plain_dictionary_encoding) { | ||
| // Return true, if there are no encodings other than dictionary or | ||
| // repetition/definition levels. | ||
| return !has_non_dictionary_encoding; | ||
| } | ||
|
|
||
| // If PLAIN_DICTIONARY wasn't present, then either the column is not | ||
| // dictionary-encoded, or the 2.0 encoding, RLE_DICTIONARY, was used. | ||
| // For 2.0, this cannot determine whether a page fell back to non-dictionary encoding | ||
| // without page encoding stats. | ||
| return false; | ||
| } |
There was a problem hiding this comment.
This is not the same logic as parquet::IsColumnChunkFullyDictionaryEncoded, but it is the same as parquet-mr DistionaryFilte::HasNonDictionaryPages.
Need advice on what's the difference and which approach to follow.
There was a problem hiding this comment.
Could you summarize the difference?
|
Great to see things moving here! |
### Rationale for this change Add link to flatbuf footer ticket with proposal. ### What changes are included in this PR? ### Do these changes have PoC implementations? apache/arrow#48431
### Rationale for this change Add link to flatbuf footer ticket with proposal. ### What changes are included in this PR? ### Do these changes have PoC implementations? apache/arrow#48431
| // Returns the size of the flatbuffer if found (and writes to out_flatbuffer), | ||
| // returns 0 if no flatbuffer extension is present, or returns the required | ||
| // buffer size if the input buffer is too small. | ||
| ::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer); |
There was a problem hiding this comment.
Since FileMetaData::Make takes uint32_t as metadata_len it might make sense to return it here?
| ::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer); | |
| ::arrow::Result<uint32_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer); |
| #include "arrow/result.h" | ||
| #include "flatbuffers/flatbuffers.h" | ||
| #include "generated/parquet3_generated.h" | ||
| #include "generated/parquet_types.h" |
There was a problem hiding this comment.
Is metadata3.h meant to be public? If so this will make generated thrift header public as well. Perhaps we could introduce MakeFromFlatbuffer in metadata.h/cc instead so we can use it in file_reader.cc:457.
static std::shared_ptr<FileMetaData> MakeFromFlatbuffer(
const uint8_t* flatbuffer_data,
size_t flatbuffer_size,
uint32_t metadata_len,
const ReaderProperties& properties = default_reader_properties());There was a problem hiding this comment.
Some effort is made to not make thrift structs public, I think we should take the same approach with Flatbuffer.
|
FYI @emkornfield @prtkgaur if you want to take a look |
| format3::GetFileMetaData(flatbuffer_data.data()); | ||
| auto thrift_metadata = | ||
| std::make_unique<format::FileMetaData>(FromFlatbuffer(fb_metadata)); | ||
| file_metadata_ = FileMetaData::Make( |
There was a problem hiding this comment.
FileMetadata is already a wrapper around thrift, is there a reason we don't have a different implementation that is made purely from the FileMetadata?
| @@ -0,0 +1,224 @@ | |||
| namespace parquet.format3; | |||
There was a problem hiding this comment.
I left comments on the PR for the FBS file in parquet-format, we should resync after those are adressed.
| void set_footer_read_size(size_t size) { footer_read_size_ = size; } | ||
| size_t footer_read_size() const { return footer_read_size_; } | ||
|
|
||
| // If enabled, try to read the metadata3 footer from the file. |
There was a problem hiding this comment.
| // If enabled, try to read the metadata3 footer from the file. | |
| // If enabled, try to read the flatbuffer metadata footer from the file as an extension (i.e. a PAR1 file). |
| // If it fails, fall back to Thrift footer decoding. | ||
| bool read_metadata3() const { return read_metadata3_; } | ||
| void set_read_metadata3(bool read_metadata3) { read_metadata3_ = read_metadata3; } | ||
|
|
There was a problem hiding this comment.
I guess we need to finalize PAR2 or PAR3 footer to be able to write this out without extension, I think that can be follow-up work but it would be nice to do this as part of the FBS work to ensure we can eventually move away from thrift.
|
|
||
| // If enabled, try to read the metadata3 footer from the file. | ||
| // If it fails, fall back to Thrift footer decoding. | ||
| bool read_metadata3() const { return read_metadata3_; } |
There was a problem hiding this comment.
| bool read_metadata3() const { return read_metadata3_; } | |
| bool read_flatbuffer_metadata_if_present() const { return read_metadata3_; } |
| bool page_checksum_verification_ = false; | ||
| // Used with a RecordReader. | ||
| bool read_dense_for_nullable_ = false; | ||
| bool read_metadata3_ = false; |
There was a problem hiding this comment.
I think this should default to true? otherwise I worry about readers getting the benefit?
| LZ4_RAW = 7, | ||
| }; | ||
|
|
||
| auto GetNumChildren( |
There was a problem hiding this comment.
style nit: Is auto needed here, generally we wouldn't use it unless it was needed for templating, etc.
|
|
||
| auto GetName(const std::vector<format::SchemaElement>& s, size_t i) { return s[i].name; } | ||
|
|
||
| class ColumnMap { |
| BuildParents(s); | ||
| } | ||
|
|
||
| size_t ToSchema(size_t cc_idx) const { return colchunk2schema_[cc_idx]; } |
| std::vector<uint32_t> parents_; | ||
| }; | ||
|
|
||
| struct MinMax { |
| uint8_t* const p = reinterpret_cast<uint8_t*>(out.data()) + n + 1; | ||
|
|
||
| // Compute and store checksums and lengths | ||
| uint32_t crc32 = ::arrow::internal::crc32(0, reinterpret_cast<const uint8_t*>(out.data()), n + 1); |
There was a problem hiding this comment.
Is this format documented (I might have missed it in the parquet-format pull request).
662128e to
3aea095
Compare
| auto footer_buffer, | ||
| source_->ReadAt(source_size_ - footer_read_size, footer_read_size)); | ||
| uint32_t metadata_len = ParseFooterLength(footer_buffer, footer_read_size); | ||
| if (properties_.read_flatbuffer_metadata_if_present()) { |
There was a problem hiding this comment.
nit: could we move most of this to a helper, this function is already getting kind of long?
| auto result = ExtractFlatbuffer(footer_buffer, &flatbuffer_data); | ||
| if (result.ok()) { | ||
| uint32_t required_or_consumed = *result; | ||
| if (required_or_consumed > static_cast<uint32_t>(footer_buffer->size()) && |
There was a problem hiding this comment.
we are on C++20 now, so using std::cmp_greater I think would make a lot of these comparison more succinct.
| // Get flatbuffer metadata and convert to thrift | ||
| const format3::FileMetaData* fb_metadata = | ||
| format3::GetFileMetaData(flatbuffer_data.data()); | ||
| auto thrift_metadata = |
There was a problem hiding this comment.
I'm not sure this addresses a previous comment about not requiring the conversion to thrift? Parquet C++ already tries to abstract the footer metadata behind an interface is there a reason we can't make that consume the flatbuffer metadata directly?
| std::shared_ptr<InternalFileDecryptor> file_decryptor = NULLPTR); | ||
|
|
||
| /// \brief Create a FileMetaData from an already-deserialized thrift object. | ||
| static std::shared_ptr<FileMetaData> Make( |
There was a problem hiding this comment.
I don't think we want this constructor here, there has been a lot of care taken to make format::FileMetadata an internal class.
|
|
||
| void WriteTo(::arrow::io::OutputStream* dst, | ||
| const std::shared_ptr<Encryptor>& encryptor = NULLPTR) const; | ||
| void WriteToWithMetadata3(::arrow::io::OutputStream* dst) const; |
There was a problem hiding this comment.
| void WriteToWithMetadata3(::arrow::io::OutputStream* dst) const; | |
| void WriteWithFlatbufferMetadataTo(::arrow::io::OutputStream* dst) const; |
| return this; | ||
| } | ||
|
|
||
| Builder* enable_write_metadata3() { |
There was a problem hiding this comment.
| Builder* enable_write_metadata3() { | |
| Builder* enable_write_flatbuffer_metadata_extension() { |
emkornfield
left a comment
There was a problem hiding this comment.
Two high level changes:
- We should be aiming to encapsulate the flatbuffer metadata behind a separate implementation inside of FileMetadata, and not translating to the thrift struct. If this requires a lot of additional refactor, please lets do the refactor first in a separate PR.
- Lets not version metadata method names/file names. There are two use-cases for flatbuffer metadata:
- Extension of the existing thrift metadata
- As a new footer format for PAR3 files
I think coupling the naming to a version makes it more confusing to understand both use-cases especially because we haven't actually made progress on finalizing what the a v3 footer would look like.
Rationale for this change
Integrate flatbuffers metadata into thrift footer.
The detailed design and experiment doc:
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?usp=sharing)
What changes are included in this PR?
Definition of the FlatBuffer footer and the generated FlatBuffer file
To/FromFlatBuffer functions to convert between FlatBuffer and Thrift footer
Append/Extract FlatBuffer to/from the extension field of the Thrift footer
Use append/extract operations based on reader/writer flags
Are these changes tested?
Yes, with newly added tests.
Are there any user-facing changes?
Yes, users can write and read the FlatBuffer footer to speed up footer parsing.