fix: use read schema field names when rebuilding non-partition struct#292
fix: use read schema field names when rebuilding non-partition struct#292Smith-Cruise wants to merge 2 commits into
Conversation
|
Good catch! The inconsistency you spotted (using That said, I'd suggest a slightly different fix: change line 110 to: casted_field_names.push_back(non_partition_info_.non_partition_data_schema[i].Name());instead of changing line 115. The reasoning is that Could you also add a test case that exercises the interaction between RENAME COLUMN and TYPE CAST to verify correctness? Something like: TEST_F(FieldMappingReaderTest, TestReadWithSchemaEvolutionRenameOnCastField) {
std::vector<DataField> data_fields = {
DataField(0, arrow::field("f0", arrow::utf8())),
DataField(1, arrow::field("f1", arrow::int32())),
DataField(2, arrow::field("f2", arrow::int32())),
DataField(3, arrow::field("f3", arrow::int32())),
};
auto data_schema = DataField::ConvertDataFieldsToArrowSchema(data_fields);
auto data_array = std::dynamic_pointer_cast<arrow::StructArray>(
arrow::ipc::internal::json::ArrayFromJSON(arrow::struct_(data_schema->fields()),
R"([
["Bob", 100, 10, 1],
["Emily", 200, 20, 2],
["Alice", 300, 30, 3]
])")
.ValueOrDie());
std::vector<DataField> read_fields = {
DataField(0, arrow::field("f0", arrow::utf8())),
DataField(1, arrow::field("new_f1", arrow::int32())),
DataField(2, arrow::field("f2", arrow::utf8())),
DataField(3, arrow::field("new_f3", arrow::utf8())),
};
auto read_schema = DataField::ConvertDataFieldsToArrowSchema(read_fields);
auto expected = std::dynamic_pointer_cast<arrow::StructArray>(
arrow::ipc::internal::json::ArrayFromJSON(arrow::struct_(read_schema->fields()),
R"([
["Bob", 100, "10", "1"],
["Emily", 200, "20", "2"],
["Alice", 300, "30", "3"]
])")
.ValueOrDie());
CheckResult(data_schema, data_array, read_schema, /*predicate=*/nullptr,
/*partition_keys=*/{}, BinaryRow::EmptyRow(), expected);
}This covers the case where some fields are both renamed AND cast (f3→new_f3, int32→utf8), some are only renamed (f1→new_f1), and some are only cast (f2, int32→utf8), ensuring the fix doesn't break any combination. |
|
I've already fixed this issue and added the corresponding tests in #293 , which has been merged. Closing this PR to avoid duplication. |
Purpose
It looks like a typo, we need to reconstruct RecordBatch's schema from read_schema, not from data_schema.
Tests
API and Format
Documentation
Generative AI tooling