feat(format): optimize avro read performance and support reading various nested types#76
Merged
Merged
Conversation
lxy-9602
reviewed
Jan 26, 2026
Collaborator
|
Please enhance the PR description with more details, such as:
|
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 63 out of 91 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (2)
test/inte/scan_and_read_inte_test.cpp:2716
- The new Avro integration tests (TestAvroWithAppendTable and TestAvroWithPkTable) check for file_format == "avro" at the beginning, but the GetTestValuesForScanAndReadInteTest function doesn't include avro in the returned test values. This means these tests will never execute. Consider adding avro test values similar to how ORC is conditionally added, for example:
#ifdef PAIMON_ENABLE_AVRO
values.emplace_back("avro", false);
values.emplace_back("avro", true);
#endifsrc/paimon/format/avro/avro_format_writer.cpp:123
- Inconsistent error message formatting: The error messages at lines 72, 84, 99, and 123 use "{}" placeholders but don't call fmt::format(). These should either use fmt::format() like line 70, 82, 97, and 121, or should use simple string concatenation with "+". For example, line 72 should be:
return Status::Invalid(fmt::format("avro format writer create failed: {}", e.what()));The same fix applies to lines 84, 99, and 123.
return Status::Invalid("avro format writer create failed: {}", e.what());
} catch (...) {
return Status::Invalid("avro format writer create failed: unknown exception");
}
}
Status AvroFormatWriter::Flush() {
try {
writer_->flush();
} catch (const ::avro::Exception& e) {
return Status::Invalid(fmt::format("avro writer flush failed. {}", e.what()));
} catch (const std::exception& e) {
return Status::Invalid("avro writer flush failed: {}", e.what());
} catch (...) {
return Status::Invalid("avro writer flush failed: unknown exception");
}
return Status::OK();
}
Status AvroFormatWriter::Finish() {
try {
avro_output_stream_->FlushBuffer(); // we need flush buffer before close writer
writer_->close();
} catch (const ::avro::Exception& e) {
return Status::Invalid(fmt::format("avro writer close failed. {}", e.what()));
} catch (const std::exception& e) {
return Status::Invalid("avro writer close failed: {}", e.what());
} catch (...) {
return Status::Invalid("avro writer close failed: unknown exception");
}
return Status::OK();
}
Result<bool> AvroFormatWriter::ReachTargetSize(bool suggested_check, int64_t target_size) const {
return Status::NotImplemented("not support yet");
}
Status AvroFormatWriter::AddBatch(ArrowArray* batch) {
assert(batch);
PAIMON_ASSIGN_OR_RAISE_FROM_ARROW(std::shared_ptr<arrow::Array> arrow_array,
arrow::ImportArray(batch, data_type_));
PAIMON_ASSIGN_OR_RAISE(std::vector<::avro::GenericDatum> datums,
adaptor_->ConvertArrayToGenericDatums(arrow_array, avro_schema_));
try {
for (const auto& datum : datums) {
writer_->write(datum);
}
} catch (const ::avro::Exception& e) {
return Status::Invalid(fmt::format("avro writer add batch failed. {}", e.what()));
} catch (const std::exception& e) {
return Status::Invalid("avro writer add batch failed: {}", e.what());
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lucasfang
reviewed
Jan 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: #31
Performance implications of the current approach:
In avro's default impl
GenericDatum, it usestd::anyfor data store. When read and write, it usestd::any_castto cast to destination data type. It accounts for 40% to 50% of the performance hotspots.Detailed optimize strategy:
Refer to the implementation of Apache Iceberg C++: Directly decode Avro data to Arrow array builders without GenericDatum. Eliminates the GenericDatum intermediate layer by directly calling Avro decoder methods and immediately appending to Arrow builders.
More TODO:
Support avro write in next PR.
Tests
API and Format
Documentation