Skip to content

GH-43695: [C++][Parquet] flatbuffers metadata integration#48431

Open
Jiayi-Wang-db wants to merge 23 commits intoapache:mainfrom
Jiayi-Wang-db:flatbuf3
Open

GH-43695: [C++][Parquet] flatbuffers metadata integration#48431
Jiayi-Wang-db wants to merge 23 commits intoapache:mainfrom
Jiayi-Wang-db:flatbuf3

Conversation

@Jiayi-Wang-db
Copy link
Copy Markdown

@Jiayi-Wang-db Jiayi-Wang-db commented Dec 10, 2025

Rationale for this change

Integrate flatbuffers metadata into thrift footer.
The detailed design and experiment doc:
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?usp=sharing)

What changes are included in this PR?

  • Definition of the FlatBuffer footer and the generated FlatBuffer file

  • To/FromFlatBuffer functions to convert between FlatBuffer and Thrift footer

  • Append/Extract FlatBuffer to/from the extension field of the Thrift footer

  • Use append/extract operations based on reader/writer flags

Are these changes tested?

Yes, with newly added tests.

Are there any user-facing changes?

Yes, users can write and read the FlatBuffer footer to speed up footer parsing.

Comment on lines +874 to +919
auto To(const format::ColumnMetaData& cm) {
if (!cm.encoding_stats.empty()) {
for (auto&& e : cm.encoding_stats) {
if (e.page_type != format::PageType::DATA_PAGE &&
e.page_type != format::PageType::DATA_PAGE_V2)
continue;
if (e.encoding != format::Encoding::PLAIN_DICTIONARY &&
e.encoding != format::Encoding::RLE_DICTIONARY) {
return false;
}
}
return true;
}
bool has_plain_dictionary_encoding = false;
bool has_non_dictionary_encoding = false;
for (auto encoding : cm.encodings) {
switch (encoding) {
case format::Encoding::PLAIN_DICTIONARY:
// PLAIN_DICTIONARY encoding was present, which means at
// least one page was dictionary encoded and v1.0 encodings are used.
has_plain_dictionary_encoding = true;
break;
case format::Encoding::RLE:
case format::Encoding::BIT_PACKED:
// Other than for boolean values, RLE and BIT_PACKED are only used for
// repetition or definition levels. Additionally booleans are not dictionary
// encoded hence it is safe to disregard the case where some boolean data pages
// are dictionary encoded and some boolean pages are RLE/BIT_PACKED encoded.
break;
default:
has_non_dictionary_encoding = true;
break;
}
}
if (has_plain_dictionary_encoding) {
// Return true, if there are no encodings other than dictionary or
// repetition/definition levels.
return !has_non_dictionary_encoding;
}

// If PLAIN_DICTIONARY wasn't present, then either the column is not
// dictionary-encoded, or the 2.0 encoding, RLE_DICTIONARY, was used.
// For 2.0, this cannot determine whether a page fell back to non-dictionary encoding
// without page encoding stats.
return false;
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the same logic as parquet::IsColumnChunkFullyDictionaryEncoded, but it is the same as parquet-mr DistionaryFilte::HasNonDictionaryPages.
Need advice on what's the difference and which approach to follow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you summarize the difference?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 11, 2025
@rok
Copy link
Copy Markdown
Member

rok commented Dec 11, 2025

Great to see things moving here!

alkis added a commit to alkis/parquet-format that referenced this pull request Dec 12, 2025
### Rationale for this change

Add link to flatbuf footer ticket with proposal.

### What changes are included in this PR?


### Do these changes have PoC implementations?

apache/arrow#48431
Fokko pushed a commit to apache/parquet-format that referenced this pull request Dec 12, 2025
### Rationale for this change

Add link to flatbuf footer ticket with proposal.

### What changes are included in this PR?


### Do these changes have PoC implementations?

apache/arrow#48431
Comment thread cpp/src/parquet/metadata3.h Outdated
// Returns the size of the flatbuffer if found (and writes to out_flatbuffer),
// returns 0 if no flatbuffer extension is present, or returns the required
// buffer size if the input buffer is too small.
::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FileMetaData::Make takes uint32_t as metadata_len it might make sense to return it here?

Suggested change
::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer);
::arrow::Result<uint32_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer);

#include "arrow/result.h"
#include "flatbuffers/flatbuffers.h"
#include "generated/parquet3_generated.h"
#include "generated/parquet_types.h"
Copy link
Copy Markdown
Member

@rok rok Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is metadata3.h meant to be public? If so this will make generated thrift header public as well. Perhaps we could introduce MakeFromFlatbuffer in metadata.h/cc instead so we can use it in file_reader.cc:457.

  static std::shared_ptr<FileMetaData> MakeFromFlatbuffer(
      const uint8_t* flatbuffer_data,
      size_t flatbuffer_size,
      uint32_t metadata_len,
      const ReaderProperties& properties = default_reader_properties());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some effort is made to not make thrift structs public, I think we should take the same approach with Flatbuffer.

@alkis
Copy link
Copy Markdown
Contributor

alkis commented Jan 21, 2026

FYI @emkornfield @prtkgaur if you want to take a look

Comment thread cpp/src/parquet/file_reader.cc Outdated
format3::GetFileMetaData(flatbuffer_data.data());
auto thrift_metadata =
std::make_unique<format::FileMetaData>(FromFlatbuffer(fb_metadata));
file_metadata_ = FileMetaData::Make(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileMetadata is already a wrapper around thrift, is there a reason we don't have a different implementation that is made purely from the FileMetadata?

@@ -0,0 +1,224 @@
namespace parquet.format3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left comments on the PR for the FBS file in parquet-format, we should resync after those are adressed.

Comment thread cpp/src/parquet/properties.h Outdated
void set_footer_read_size(size_t size) { footer_read_size_ = size; }
size_t footer_read_size() const { return footer_read_size_; }

// If enabled, try to read the metadata3 footer from the file.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If enabled, try to read the metadata3 footer from the file.
// If enabled, try to read the flatbuffer metadata footer from the file as an extension (i.e. a PAR1 file).

// If it fails, fall back to Thrift footer decoding.
bool read_metadata3() const { return read_metadata3_; }
void set_read_metadata3(bool read_metadata3) { read_metadata3_ = read_metadata3; }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to finalize PAR2 or PAR3 footer to be able to write this out without extension, I think that can be follow-up work but it would be nice to do this as part of the FBS work to ensure we can eventually move away from thrift.

Comment thread cpp/src/parquet/properties.h Outdated

// If enabled, try to read the metadata3 footer from the file.
// If it fails, fall back to Thrift footer decoding.
bool read_metadata3() const { return read_metadata3_; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool read_metadata3() const { return read_metadata3_; }
bool read_flatbuffer_metadata_if_present() const { return read_metadata3_; }

Comment thread cpp/src/parquet/properties.h Outdated
bool page_checksum_verification_ = false;
// Used with a RecordReader.
bool read_dense_for_nullable_ = false;
bool read_metadata3_ = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should default to true? otherwise I worry about readers getting the benefit?

Comment thread cpp/src/parquet/metadata3.cc Outdated
LZ4_RAW = 7,
};

auto GetNumChildren(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: Is auto needed here, generally we wouldn't use it unless it was needed for templating, etc.


auto GetName(const std::vector<format::SchemaElement>& s, size_t i) { return s[i].name; }

class ColumnMap {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add docs.

Comment thread cpp/src/parquet/metadata3.cc Outdated
BuildParents(s);
}

size_t ToSchema(size_t cc_idx) const { return colchunk2schema_[cc_idx]; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs.

std::vector<uint32_t> parents_;
};

struct MinMax {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs.

Comment thread cpp/src/parquet/metadata3.cc Outdated
uint8_t* const p = reinterpret_cast<uint8_t*>(out.data()) + n + 1;

// Compute and store checksums and lengths
uint32_t crc32 = ::arrow::internal::crc32(0, reinterpret_cast<const uint8_t*>(out.data()), n + 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this format documented (I might have missed it in the parquet-format pull request).

@rok rok force-pushed the flatbuf3 branch 11 times, most recently from 662128e to 3aea095 Compare March 6, 2026 20:10
auto footer_buffer,
source_->ReadAt(source_size_ - footer_read_size, footer_read_size));
uint32_t metadata_len = ParseFooterLength(footer_buffer, footer_read_size);
if (properties_.read_flatbuffer_metadata_if_present()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could we move most of this to a helper, this function is already getting kind of long?

auto result = ExtractFlatbuffer(footer_buffer, &flatbuffer_data);
if (result.ok()) {
uint32_t required_or_consumed = *result;
if (required_or_consumed > static_cast<uint32_t>(footer_buffer->size()) &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are on C++20 now, so using std::cmp_greater I think would make a lot of these comparison more succinct.

// Get flatbuffer metadata and convert to thrift
const format3::FileMetaData* fb_metadata =
format3::GetFileMetaData(flatbuffer_data.data());
auto thrift_metadata =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this addresses a previous comment about not requiring the conversion to thrift? Parquet C++ already tries to abstract the footer metadata behind an interface is there a reason we can't make that consume the flatbuffer metadata directly?

std::shared_ptr<InternalFileDecryptor> file_decryptor = NULLPTR);

/// \brief Create a FileMetaData from an already-deserialized thrift object.
static std::shared_ptr<FileMetaData> Make(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want this constructor here, there has been a lot of care taken to make format::FileMetadata an internal class.


void WriteTo(::arrow::io::OutputStream* dst,
const std::shared_ptr<Encryptor>& encryptor = NULLPTR) const;
void WriteToWithMetadata3(::arrow::io::OutputStream* dst) const;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void WriteToWithMetadata3(::arrow::io::OutputStream* dst) const;
void WriteWithFlatbufferMetadataTo(::arrow::io::OutputStream* dst) const;

return this;
}

Builder* enable_write_metadata3() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Builder* enable_write_metadata3() {
Builder* enable_write_flatbuffer_metadata_extension() {

Copy link
Copy Markdown
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two high level changes:

  1. We should be aiming to encapsulate the flatbuffer metadata behind a separate implementation inside of FileMetadata, and not translating to the thrift struct. If this requires a lot of additional refactor, please lets do the refactor first in a separate PR.
  2. Lets not version metadata method names/file names. There are two use-cases for flatbuffer metadata:
    • Extension of the existing thrift metadata
    • As a new footer format for PAR3 files

I think coupling the naming to a version makes it more confusing to understand both use-cases especially because we haven't actually made progress on finalizing what the a v3 footer would look like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants