-
Notifications
You must be signed in to change notification settings - Fork 0
20063: fix: regression of dict_id in physical plan proto
#218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a regression that caused Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThis change improves IPC dictionary handling in serialization and deserialization. The from_proto module now reconstructs dictionary IDs deterministically through IPC schema round-tripping, replacing map-based dictionary handling with explicit per-dictionary decoding and ID accumulation. The to_proto module adds pre-population of the IPC dictionary tracker before encoding scalar nested values. A new test verifies roundtrip behavior for complex nested types containing dictionary fields within filter predicates. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Review: Fix regression of
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes a regression related to dict_id serialization in the physical plan protobuf format. The fix is well-implemented and involves two main parts: pre-populating the DictionaryTracker during serialization to ensure dictionary IDs are assigned, and reconstructing the schema with these IDs during deserialization via an IPC round-trip. The change to use read_dictionary is also a good cleanup. The new test case effectively covers the failing scenario. Overall, the changes are correct and improve the robustness of dictionary serialization. I have one suggestion to make the error handling more idiomatic.
| root_as_message(encoded_schema.ipc_message.as_slice()).map_err( | ||
| |e| { | ||
| Error::General(format!( | ||
| "Error IPC schema message while deserializing ScalarValue::List: {e}" | ||
| )) | ||
| }, | ||
| )?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current error handling loses the original ArrowError type by wrapping it in Error::General. It's better to convert it to a DataFusionError to preserve the error type and add context, which improves debuggability. This can be done using arrow_datafusion_err! and the .context() method.
This pattern can be applied to similar root_as_message calls in this file (e.g., lines 443-447 and 463-467).
root_as_message(encoded_schema.ipc_message.as_slice()).map_err(|e| {
arrow_datafusion_err!(e)
.context("Error IPC schema message while deserializing ScalarValue::List")
})?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:good-to-have; category:bug; feedback:The Gemini AI reviewer is correct! By using arrow_datafusion_err!() macro the original error type will be preserved and also its stacktrace will be collected. Creating a new custom Error will use the Display view of the error and will not init the backtrace from the correct call. But it uses the custom Error for consistency with the rest of the code in this file.
🤖 Augment PR SummarySummary: Fixes a regression in physical plan protobuf round-tripping involving dictionary IDs for nested scalar values. Changes:
Technical Notes: This keeps protobuf schemas (which don’t store IPC dictionary IDs) interoperable with Arrow IPC dictionary batches by re-deriving IDs consistently on both encode and decode. 🤖 Was this summary useful? React with 👍 or 👎 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review completed. No suggestions at this time.
Comment augment review to trigger a new review at any time.
value:useful; category:bug; feedback: The Claude AI reviewer is correct! The specific type in the error message is not always correct. It would be wrong when the ScalarValue is a Map or Struct or even LargeList. It would be better to remove the "::List" from the error message and use a more generic message. |
20063: To review by AI