Skip to content

fix: use read schema field names when rebuilding non-partition struct#292

Closed
Smith-Cruise wants to merge 2 commits into
alibaba:mainfrom
Smith-Cruise:fix-schema-bug
Closed

fix: use read schema field names when rebuilding non-partition struct#292
Smith-Cruise wants to merge 2 commits into
alibaba:mainfrom
Smith-Cruise:fix-schema-bug

Conversation

@Smith-Cruise
Copy link
Copy Markdown
Contributor

Purpose

It looks like a typo, we need to reconstruct RecordBatch's schema from read_schema, not from data_schema.

Tests

API and Format

Documentation

Generative AI tooling

Signed-off-by: Smith Cruise <chendingchao1@126.com>
@lxy-9602
Copy link
Copy Markdown
Collaborator

Good catch! The inconsistency you spotted (using non_partition_data_schema name in one branch vs non_partition_read_schema in the other) indeed exists. However, it won't affect the final result because MappingFields is called right after CastNonPartitionArrayIfNeed and overwrites the field names using non_partition_read_schema[i].Name().

That said, I'd suggest a slightly different fix: change line 110 to:

casted_field_names.push_back(non_partition_info_.non_partition_data_schema[i].Name());

instead of changing line 115.

The reasoning is that CastNonPartitionArrayIfNeed should only be responsible for type casting, not for name mapping. Field name remapping (e.g. due to RENAME COLUMN) should be left entirely to MappingFields, which is the dedicated function for that purpose. This keeps a cleaner separation of concerns.

Could you also add a test case that exercises the interaction between RENAME COLUMN and TYPE CAST to verify correctness? Something like:

TEST_F(FieldMappingReaderTest, TestReadWithSchemaEvolutionRenameOnCastField) {
    std::vector<DataField> data_fields = {
        DataField(0, arrow::field("f0", arrow::utf8())),
        DataField(1, arrow::field("f1", arrow::int32())),
        DataField(2, arrow::field("f2", arrow::int32())),
        DataField(3, arrow::field("f3", arrow::int32())),
    };
    auto data_schema = DataField::ConvertDataFieldsToArrowSchema(data_fields);
    auto data_array = std::dynamic_pointer_cast<arrow::StructArray>(
        arrow::ipc::internal::json::ArrayFromJSON(arrow::struct_(data_schema->fields()),
                                                  R"([
        ["Bob", 100, 10, 1],
        ["Emily", 200, 20, 2],
        ["Alice", 300, 30, 3]
    ])")
            .ValueOrDie());

    std::vector<DataField> read_fields = {
        DataField(0, arrow::field("f0", arrow::utf8())),
        DataField(1, arrow::field("new_f1", arrow::int32())),
        DataField(2, arrow::field("f2", arrow::utf8())),
        DataField(3, arrow::field("new_f3", arrow::utf8())),
    };
    auto read_schema = DataField::ConvertDataFieldsToArrowSchema(read_fields);

    auto expected = std::dynamic_pointer_cast<arrow::StructArray>(
        arrow::ipc::internal::json::ArrayFromJSON(arrow::struct_(read_schema->fields()),
                                                  R"([
        ["Bob", 100, "10", "1"],
        ["Emily", 200, "20", "2"],
        ["Alice", 300, "30", "3"]
    ])")
            .ValueOrDie());

    CheckResult(data_schema, data_array, read_schema, /*predicate=*/nullptr,
                /*partition_keys=*/{}, BinaryRow::EmptyRow(), expected);
}

This covers the case where some fields are both renamed AND cast (f3→new_f3, int32→utf8), some are only renamed (f1→new_f1), and some are only cast (f2, int32→utf8), ensuring the fix doesn't break any combination.

@lxy-9602
Copy link
Copy Markdown
Collaborator

I've already fixed this issue and added the corresponding tests in #293 , which has been merged. Closing this PR to avoid duplication.
Please feel free to comment if you have any questions or concerns.

@lxy-9602 lxy-9602 closed this May 20, 2026
@Smith-Cruise Smith-Cruise deleted the fix-schema-bug branch May 20, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants