Skip to content

Conversation

@martin-augment
Copy link
Owner

20063: To review by AI

@gemini-code-assist
Copy link

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a regression that caused dict_ids to be lost when ScalarValue::List instances, particularly those containing dictionary arrays, were serialized to and deserialized from physical plan protobufs. The changes ensure that dictionary IDs are consistently managed across the serialization boundary, preventing data corruption or incorrect interpretation of dictionary-encoded data, thereby improving the robustness of physical plan representation.

Highlights

  • Regression Fix: Addressed a regression where dict_ids were not correctly preserved during the serialization and deserialization of ScalarValue::List containing dictionary arrays in the physical plan protobuf.
  • Serialization Enhancement: Modified the serialization process in to_proto to pre-allocate dictionary IDs by encoding the schema first, ensuring consistent ID assignment required by the IPC writer.
  • Deserialization Improvement: Updated the deserialization logic in from_proto to reconstruct dictionary IDs deterministically by round-tripping the schema through IPC and utilizing the read_dictionary function for robust dictionary batch processing.
  • New Test Case: Introduced a new roundtrip test case (roundtrip_call_null_scalar_struct_dict) to validate the fix for ScalarValue::List with nested dictionary types, ensuring the issue is resolved and prevented from recurring.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link

coderabbitai bot commented Jan 30, 2026

Walkthrough

This change improves IPC dictionary handling in serialization and deserialization. The from_proto module now reconstructs dictionary IDs deterministically through IPC schema round-tripping, replacing map-based dictionary handling with explicit per-dictionary decoding and ID accumulation. The to_proto module adds pre-population of the IPC dictionary tracker before encoding scalar nested values. A new test verifies roundtrip behavior for complex nested types containing dictionary fields within filter predicates.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-20063-2026-01-30-11-22-19

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Jan 30, 2026

PR Review: Fix regression of dict_id in physical plan proto

Summary

This PR addresses a regression in dictionary ID handling during protobuf serialization/deserialization of physical plans. The fix ensures that dictionary batch IDs are correctly reconstructed when converting from protobuf format, which is critical for scalar values containing dictionary-encoded data.


Positive Aspects

Root Cause Fix: The PR correctly identifies that protobuf Schema doesn't preserve IPC dictionary batch IDs and implements a deterministic reconstruction by round-tripping through IPC encoding.

Symmetry: The changes maintain symmetry between encoding (to_proto) and decoding (from_proto) paths - both now pre-populate the dictionary tracker with schema-derived IDs.

Test Coverage: The new test roundtrip_call_null_scalar_struct_dict specifically targets the regression case with dictionary-encoded fields inside struct types.

Use of Arrow APIs: Correctly uses read_dictionary instead of manual dictionary reconstruction, which is more robust and handles edge cases properly.


Code Quality Observations

datafusion/proto-common/src/from_proto/mod.rs:414-441
The schema round-trip logic is well-commented and necessary. However, there's a minor inefficiency:

  • The schema is round-tripped through IPC encoding/decoding just to assign dictionary IDs
  • This happens for every scalar value deserialization
  • Consider if there's a way to cache or optimize this for performance-sensitive paths

datafusion/proto-common/src/from_proto/mod.rs:457-485
The refactoring from .map().collect() to an explicit loop with read_dictionary is correct and cleaner:

  • ✅ Properly uses read_dictionary which mutates dict_by_id in place
  • ✅ Better error handling with context
  • ✅ More readable code flow

datafusion/proto-common/src/to_proto/mod.rs:1028-1034
Good fix - pre-populating the dictionary tracker before encoding ensures consistent dictionary IDs:

  • ✅ The comment clearly explains why this is needed
  • ✅ Minimal change to existing logic

Potential Issues & Suggestions

1. Performance Concern (Minor)

Location: from_proto/mod.rs:417-441

The schema round-trip through IPC encoding is performed for every scalar value deserialization. For workloads with many scalar values (especially in filter predicates), this could add overhead.

Suggestion: Consider caching the IPC-encoded schema if the same schema is used repeatedly, or document the performance characteristics if this is unavoidable.

2. Error Message Inconsistency

Location: from_proto/mod.rs:430, 436, 465

Error messages still reference "ScalarValue::List" even though the code also handles StructValue, MapValue, etc.

Suggestion: Update error messages to be more generic:

"Error IPC schema message while deserializing nested ScalarValue: {e}"

3. Test Coverage Gap

Location: roundtrip_physical_plan.rs:2568-2585

The test only covers dictionary-in-struct case. The fix also affects List, LargeList, FixedSizeList, and Map types.

Suggestion: Consider adding test cases for:

  • Dictionary inside List
  • Dictionary inside Map
  • Nested dictionaries (struct containing list containing dictionary)

4. Missing Documentation

The PR description mentions issue 20063 but the issue isn't accessible.

Suggestion: Add more context in the commit message or PR description about:

  • What specific regression occurred
  • What behavior was broken
  • How to reproduce the original issue

Security Considerations

✅ No security concerns identified. The changes:

  • Use safe Arrow IPC APIs
  • Don't introduce unsafe code
  • Properly handle buffer boundaries
  • Include appropriate error handling

Performance Considerations

⚠️ Minor concern: The schema round-trip adds computational overhead. Profile this in scenarios with:

  • High volume of scalar value serialization
  • Complex nested dictionary types
  • Large schemas

Best Practices

✅ Follows Rust idioms and DataFusion patterns
✅ Appropriate use of error contexts
✅ Good code comments explaining non-obvious logic
⚠️ Could benefit from more comprehensive test coverage


Recommendation

APPROVE with suggestions for follow-up improvements.

The fix correctly addresses the dictionary ID regression and maintains code quality. The minor suggestions above would enhance robustness but don't block merging.

Action Items for Follow-up (non-blocking):

  1. Profile performance impact of schema round-tripping
  2. Add test cases for other dictionary-containing nested types
  3. Update error messages for consistency
  4. Consider schema caching optimization if profiling shows impact

Review generated with assistance from Claude Code

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a regression related to dict_id serialization in the physical plan protobuf format. The fix is well-implemented and involves two main parts: pre-populating the DictionaryTracker during serialization to ensure dictionary IDs are assigned, and reconstructing the schema with these IDs during deserialization via an IPC round-trip. The change to use read_dictionary is also a good cleanup. The new test case effectively covers the failing scenario. Overall, the changes are correct and improve the robustness of dictionary serialization. I have one suggestion to make the error handling more idiomatic.

Comment on lines +427 to +433
root_as_message(encoded_schema.ipc_message.as_slice()).map_err(
|e| {
Error::General(format!(
"Error IPC schema message while deserializing ScalarValue::List: {e}"
))
},
)?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current error handling loses the original ArrowError type by wrapping it in Error::General. It's better to convert it to a DataFusionError to preserve the error type and add context, which improves debuggability. This can be done using arrow_datafusion_err! and the .context() method.

This pattern can be applied to similar root_as_message calls in this file (e.g., lines 443-447 and 463-467).

                        root_as_message(encoded_schema.ipc_message.as_slice()).map_err(|e| {
                            arrow_datafusion_err!(e)
                                .context("Error IPC schema message while deserializing ScalarValue::List")
                        })?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The Gemini AI reviewer is correct! By using arrow_datafusion_err!() macro the original error type will be preserved and also its stacktrace will be collected. Creating a new custom Error will use the Display view of the error and will not init the backtrace from the correct call. But it uses the custom Error for consistency with the rest of the code in this file.

@augmentcode
Copy link

augmentcode bot commented Jan 30, 2026

🤖 Augment PR Summary

Summary: Fixes a regression in physical plan protobuf round-tripping involving dictionary IDs for nested scalar values.

Changes:

  • When encoding nested scalars (List/Struct/Map, etc.), pre-encodes the Arrow schema to populate DictionaryTracker with stable dictionary IDs before writing IPC dictionaries.
  • When decoding nested scalars, reconstructs the Arrow schema’s IPC dictionary IDs deterministically by round-tripping the protobuf schema through Arrow IPC schema encoding/decoding.
  • Switches dictionary decoding to Arrow’s read_dictionary helper to populate the dictionaries-by-id map.
  • Adds a regression test covering a null struct scalar containing a dictionary-encoded field in a physical plan.

Technical Notes: This keeps protobuf schemas (which don’t store IPC dictionary IDs) interoperable with Arrow IPC dictionary batches by re-deriving IDs consistently on both encode and decode.

🤖 Was this summary useful? React with 👍 or 👎

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

@martin-augment
Copy link
Owner Author

2. Error Message Inconsistency

Location: from_proto/mod.rs:430, 436, 465

Error messages still reference "ScalarValue::List" even though the code also handles StructValue, MapValue, etc.

Suggestion: Update error messages to be more generic:

"Error IPC schema message while deserializing nested ScalarValue: {e}"

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The specific type in the error message is not always correct. It would be wrong when the ScalarValue is a Map or Struct or even LargeList. It would be better to remove the "::List" from the error message and use a more generic message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants