20063: fix: regression of `dict_id` in physical plan proto #218

martin-augment · 2026-01-30T11:22:23Z

20063: To review by AI

gemini-code-assist · 2026-01-30T11:22:43Z

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a regression that caused dict_ids to be lost when ScalarValue::List instances, particularly those containing dictionary arrays, were serialized to and deserialized from physical plan protobufs. The changes ensure that dictionary IDs are consistently managed across the serialization boundary, preventing data corruption or incorrect interpretation of dictionary-encoded data, thereby improving the robustness of physical plan representation.

Highlights

Regression Fix: Addressed a regression where dict_ids were not correctly preserved during the serialization and deserialization of ScalarValue::List containing dictionary arrays in the physical plan protobuf.
Serialization Enhancement: Modified the serialization process in to_proto to pre-allocate dictionary IDs by encoding the schema first, ensuring consistent ID assignment required by the IPC writer.
Deserialization Improvement: Updated the deserialization logic in from_proto to reconstruct dictionary IDs deterministically by round-tripping the schema through IPC and utilizing the read_dictionary function for robust dictionary batch processing.
New Test Case: Introduced a new roundtrip test case (roundtrip_call_null_scalar_struct_dict) to validate the fix for ScalarValue::List with nested dictionary types, ensuring the issue is resolved and prevented from recurring.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-01-30T11:22:43Z

Walkthrough

This change improves IPC dictionary handling in serialization and deserialization. The from_proto module now reconstructs dictionary IDs deterministically through IPC schema round-tripping, replacing map-based dictionary handling with explicit per-dictionary decoding and ID accumulation. The to_proto module adds pre-population of the IPC dictionary tracker before encoding scalar nested values. A new test verifies roundtrip behavior for complex nested types containing dictionary fields within filter predicates.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-20063-2026-01-30-11-22-19

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-01-30T11:24:05Z

PR Review: Fix regression of `dict_id` in physical plan proto

Summary

This PR addresses a regression in dictionary ID handling during protobuf serialization/deserialization of physical plans. The fix ensures that dictionary batch IDs are correctly reconstructed when converting from protobuf format, which is critical for scalar values containing dictionary-encoded data.

Positive Aspects

✅ Root Cause Fix: The PR correctly identifies that protobuf Schema doesn't preserve IPC dictionary batch IDs and implements a deterministic reconstruction by round-tripping through IPC encoding.

✅ Symmetry: The changes maintain symmetry between encoding (to_proto) and decoding (from_proto) paths - both now pre-populate the dictionary tracker with schema-derived IDs.

✅ Test Coverage: The new test roundtrip_call_null_scalar_struct_dict specifically targets the regression case with dictionary-encoded fields inside struct types.

✅ Use of Arrow APIs: Correctly uses read_dictionary instead of manual dictionary reconstruction, which is more robust and handles edge cases properly.

Code Quality Observations

datafusion/proto-common/src/from_proto/mod.rs:414-441
The schema round-trip logic is well-commented and necessary. However, there's a minor inefficiency:

The schema is round-tripped through IPC encoding/decoding just to assign dictionary IDs
This happens for every scalar value deserialization
Consider if there's a way to cache or optimize this for performance-sensitive paths

datafusion/proto-common/src/from_proto/mod.rs:457-485
The refactoring from .map().collect() to an explicit loop with read_dictionary is correct and cleaner:

✅ Properly uses read_dictionary which mutates dict_by_id in place
✅ Better error handling with context
✅ More readable code flow

datafusion/proto-common/src/to_proto/mod.rs:1028-1034
Good fix - pre-populating the dictionary tracker before encoding ensures consistent dictionary IDs:

✅ The comment clearly explains why this is needed
✅ Minimal change to existing logic

Potential Issues & Suggestions

1. Performance Concern (Minor)

Location: from_proto/mod.rs:417-441

The schema round-trip through IPC encoding is performed for every scalar value deserialization. For workloads with many scalar values (especially in filter predicates), this could add overhead.

Suggestion: Consider caching the IPC-encoded schema if the same schema is used repeatedly, or document the performance characteristics if this is unavoidable.

2. Error Message Inconsistency

Location: from_proto/mod.rs:430, 436, 465

Error messages still reference "ScalarValue::List" even though the code also handles StructValue, MapValue, etc.

Suggestion: Update error messages to be more generic:

"Error IPC schema message while deserializing nested ScalarValue: {e}"

3. Test Coverage Gap

Location: roundtrip_physical_plan.rs:2568-2585

The test only covers dictionary-in-struct case. The fix also affects List, LargeList, FixedSizeList, and Map types.

Suggestion: Consider adding test cases for:

Dictionary inside List
Dictionary inside Map
Nested dictionaries (struct containing list containing dictionary)

4. Missing Documentation

The PR description mentions issue 20063 but the issue isn't accessible.

Suggestion: Add more context in the commit message or PR description about:

What specific regression occurred
What behavior was broken
How to reproduce the original issue

Security Considerations

✅ No security concerns identified. The changes:

Use safe Arrow IPC APIs
Don't introduce unsafe code
Properly handle buffer boundaries
Include appropriate error handling

Performance Considerations

⚠️ Minor concern: The schema round-trip adds computational overhead. Profile this in scenarios with:

High volume of scalar value serialization
Complex nested dictionary types
Large schemas

Best Practices

✅ Follows Rust idioms and DataFusion patterns
✅ Appropriate use of error contexts
✅ Good code comments explaining non-obvious logic
⚠️ Could benefit from more comprehensive test coverage

Recommendation

APPROVE with suggestions for follow-up improvements.

The fix correctly addresses the dictionary ID regression and maintains code quality. The minor suggestions above would enhance robustness but don't block merging.

Action Items for Follow-up (non-blocking):

Profile performance impact of schema round-tripping
Add test cases for other dictionary-containing nested types
Update error messages for consistency
Consider schema caching optimization if profiling shows impact

Review generated with assistance from Claude Code

gemini-code-assist

Code Review

This pull request fixes a regression related to dict_id serialization in the physical plan protobuf format. The fix is well-implemented and involves two main parts: pre-populating the DictionaryTracker during serialization to ensure dictionary IDs are assigned, and reconstructing the schema with these IDs during deserialization via an IPC round-trip. The change to use read_dictionary is also a good cleanup. The new test case effectively covers the failing scenario. Overall, the changes are correct and improve the robustness of dictionary serialization. I have one suggestion to make the error handling more idiomatic.

gemini-code-assist · 2026-01-30T11:25:05Z

datafusion/proto-common/src/from_proto/mod.rs

+                        root_as_message(encoded_schema.ipc_message.as_slice()).map_err(
+                            |e| {
+                                Error::General(format!(
+                                    "Error IPC schema message while deserializing ScalarValue::List: {e}"
+                                ))
+                            },
+                        )?;


The current error handling loses the original ArrowError type by wrapping it in Error::General. It's better to convert it to a DataFusionError to preserve the error type and add context, which improves debuggability. This can be done using arrow_datafusion_err! and the .context() method.

This pattern can be applied to similar root_as_message calls in this file (e.g., lines 443-447 and 463-467).

root_as_message(encoded_schema.ipc_message.as_slice()).map_err(|e| { arrow_datafusion_err!(e) .context("Error IPC schema message while deserializing ScalarValue::List") })?

value:good-to-have; category:bug; feedback:The Gemini AI reviewer is correct! By using arrow_datafusion_err!() macro the original error type will be preserved and also its stacktrace will be collected. Creating a new custom Error will use the Display view of the error and will not init the backtrace from the correct call. But it uses the custom Error for consistency with the rest of the code in this file.

augmentcode · 2026-01-30T11:38:52Z

🤖 Augment PR Summary

Summary: Fixes a regression in physical plan protobuf round-tripping involving dictionary IDs for nested scalar values.

Changes:

When encoding nested scalars (List/Struct/Map, etc.), pre-encodes the Arrow schema to populate DictionaryTracker with stable dictionary IDs before writing IPC dictionaries.
When decoding nested scalars, reconstructs the Arrow schema’s IPC dictionary IDs deterministically by round-tripping the protobuf schema through Arrow IPC schema encoding/decoding.
Switches dictionary decoding to Arrow’s read_dictionary helper to populate the dictionaries-by-id map.
Adds a regression test covering a null struct scalar containing a dictionary-encoded field in a physical plan.

Technical Notes: This keeps protobuf schemas (which don’t store IPC dictionary IDs) interoperable with Arrow IPC dictionary batches by re-deriving IDs consistently on both encode and decode.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

martin-augment · 2026-01-30T12:31:22Z

2. Error Message Inconsistency

Location: from_proto/mod.rs:430, 436, 465

Error messages still reference "ScalarValue::List" even though the code also handles StructValue, MapValue, etc.

Suggestion: Update error messages to be more generic:
"Error IPC schema message while deserializing nested ScalarValue: {e}"

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The specific type in the error message is not always correct. It would be wrong when the ScalarValue is a Map or Struct or even LargeList. It would be better to remove the "::List" from the error message and use a more generic message.

fix: regression of dict_id in physical plan proto

e0a0061

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

augmentcode bot reviewed Jan 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

20063: fix: regression of `dict_id` in physical plan proto #218

20063: fix: regression of `dict_id` in physical plan proto #218

Uh oh!

martin-augment commented Jan 30, 2026

Uh oh!

gemini-code-assist bot commented Jan 30, 2026

Uh oh!

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Uh oh!

martin-augment Jan 30, 2026

Uh oh!

augmentcode bot commented Jan 30, 2026

Uh oh!

augmentcode bot left a comment

Uh oh!

martin-augment commented Jan 30, 2026

2. Error Message Inconsistency

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

20063: fix: regression of dict_id in physical plan proto #218

Are you sure you want to change the base?

20063: fix: regression of dict_id in physical plan proto #218

Uh oh!

Conversation

martin-augment commented Jan 30, 2026

Uh oh!

gemini-code-assist bot commented Jan 30, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

claude bot commented Jan 30, 2026

PR Review: Fix regression of dict_id in physical plan proto

Summary

Positive Aspects

Code Quality Observations

Potential Issues & Suggestions

1. Performance Concern (Minor)

2. Error Message Inconsistency

3. Test Coverage Gap

4. Missing Documentation

Security Considerations

Performance Considerations

Best Practices

Recommendation

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Jan 30, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Jan 30, 2026

2. Error Message Inconsistency

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

20063: fix: regression of `dict_id` in physical plan proto #218

20063: fix: regression of `dict_id` in physical plan proto #218

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

PR Review: Fix regression of `dict_id` in physical plan proto