-
Notifications
You must be signed in to change notification settings - Fork 1
Variant shredding #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
# Which issue does this PR close? - Related to apache#7395 - Closes apache#7495 - Closes apache#7377 # Rationale for this change Let's update tonic to the latest Given the open and unresolved questions on @rmn-boiko's PR apache#7377 from @Xuanwo and @sundy-li, I thought a new PR would result in a faster resolution. # What changes are included in this PR? This PR is based on apache#7495 from @MichaelScofield -- I resolved some merge conflicts and updated Cargo.toml in the integration tests # Are these changes tested? Yes, by CI # Are there any user-facing changes? New dependency version --------- Co-authored-by: LFC <[email protected]>
…pache#7922) # Which issue does this PR close? - Part of apache#7896 # Rationale for this change In apache#7896, we saw that inserting a large amount of field names takes a long time -- in this case ~45s to insert 2**24 field names. The bulk of this time is spent just allocating the strings, but we also see quite a bit of time spent reallocating the `IndexSet` that we're inserting into. `with_field_names` is an optimization to declare the field names upfront which avoids having to reallocate and rehash the entire `IndexSet` during field name insertion. Using this method requires at least 2 string allocations for each field name -- 1 to declare field names upfront and 1 to insert the actual field name during object building. This PR adds a new method `with_field_name_capacity` which allows you to reserve space to the metadata builder, without needing to allocate the field names themselves upfront. In this case, we see a modest performance improvement when inserting the field names during object building Before: <img width="1512" height="829" alt="Screenshot 2025-07-13 at 12 08 43 PM" src="https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c" /> After: <img width="1512" height="805" alt="Screenshot 2025-07-13 at 12 08 55 PM" src="https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3" />
…che#7914) # Which issue does this PR close? - Fixes apache#7907 # Rationale for this change When trying to append `VariantObject` or `VariantList`s directly on the `VariantBuilder`, it will panic. # Changes to the public API `VariantBuilder` now has these additional methods: - `append_object`, will panic if shallow validation fails or the object has duplicate field names - `try_append_object`, will perform full validation on the object before appending - `append_list`, will panic if shallow validation fails - `try_append_list`, will perform full validation on the list before appending --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? - Closes apache#7893 # What changes are included in this PR? In parquet-variant: - Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it). - Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply. In parquet-variant-compute: - Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type. - Includes some basic unit tests. Not comprehensive. - Includes a simple micro-benchmark for reference. Current limitations: - It can only return another VariantArray. Casts are not implemented yet. - Only top-level object/list access is supported. It panics on finding a nested object/list. Needs apache#7914 to fix this. - Perf is a TODO. # Are these changes tested? Some basic unit tests are added. # Are there any user-facing changes? Yes --------- Co-authored-by: Andrew Lamb <[email protected]>
woohoo! |
…he#7774) # Which issue does this PR close? - Part of apache#7762 # Rationale for this change As part of apache#7762 I want to optimize applying filters by adding a new code path. To ensure that works well, let's ensure the filtered code path is well covered with tests # What changes are included in this PR? 1. Add tests for filtering batches with 0.01%, 1%, 10% and 90% and varying data types # Are these changes tested? Only tests, no functional changes # Are there any user-facing changes?
18d88b0
to
b1afed1
Compare
…ath (apache#7942) # Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Follow on to apache#7919 # Rationale for this change While reviewing apache#7919 from @Samyak2 I found I wanted to write some additional tests directly for `Variant::get_path` When I started doing that I found it was somewhat awkward to write examples, so I added some new conversion routines to make it easier. # What changes are included in this PR? 1. Add doc examples (and thus tests) of `VaraintGet` and `VariantPath` 2. Add more documentation # Are these changes tested? Yes, by doc examples which run in CI # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out.
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Closes apache#7901 . # Rationale for this change Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. # What changes are included in this PR? There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --------- Signed-off-by: codephage2020 <[email protected]>
…pache#7931) # Which issue does this PR close? - Closes #NNN. This is minor but I can create an issue if needed. # Rationale for this change `make_builder` currently errors with `Data type Utf8View is not currently supported`. # What changes are included in this PR? Support `DataType::Utf8View` and `DataType::BinaryView` in `make_builder`. # Are these changes tested? Only via the exhaustive enum match. It doesn't look like there are any tests for `make_builder` in that file? # Are there any user-facing changes? No
# Rationale for this change - Closes apache#7948 This PR introduces a custom implementation of `PartialEq` for variant objects. According to the spec, field values are not required to be in the same order as the field IDs, to enable flexibility when constructing Variant values. Instead of comparing the raw bytes of 2 variant objects, this implementation recursively checks whether the field values are equal -- regardless of their order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure the right way to review this -- is it better to wait for it to become an arrow-rs PR instead?
@@ -149,6 +150,154 @@ impl VariantArray { | |||
// spec says fields order is not guaranteed, so we search by name | |||
self.inner.column_by_name("value").unwrap() | |||
} | |||
|
|||
/// Get the metadata bytes for a specific index | |||
pub fn metadata(&self, index: usize) -> &[u8] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn metadata(&self, index: usize) -> &[u8] { | |
pub fn metadata_bytes(&self, index: usize) -> &[u8] { |
?
for element in path.elements() { | ||
match element { | ||
crate::field_operations::VariantPathElement::Field(field_name) => { | ||
current_variant = current_variant.get_object_field(field_name)?; | ||
} | ||
crate::field_operations::VariantPathElement::Index(idx) => { | ||
current_variant = current_variant.get_list_element(*idx)?; | ||
} | ||
} | ||
} | ||
|
||
Some(current_variant) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just an Iterator::try_fold
?
match FieldOperations::remove_field_bytes( | ||
self.metadata(i), | ||
self.value_bytes(i), | ||
field_name, | ||
)? { | ||
Some(new_value) => { | ||
builder.append_variant_buffers(self.metadata(i), &new_value); | ||
} | ||
None => { | ||
// Field didn't exist, use original value | ||
builder.append_variant_buffers(self.metadata(i), self.value_bytes(i)); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's some redundancy here?
match FieldOperations::remove_field_bytes( | |
self.metadata(i), | |
self.value_bytes(i), | |
field_name, | |
)? { | |
Some(new_value) => { | |
builder.append_variant_buffers(self.metadata(i), &new_value); | |
} | |
None => { | |
// Field didn't exist, use original value | |
builder.append_variant_buffers(self.metadata(i), self.value_bytes(i)); | |
} | |
} | |
let new_value = FieldOperations::remove_field_bytes( | |
self.metadata(i), | |
self.value_bytes(i), | |
field_name, | |
)?; | |
// Use original value if the field didn't exist | |
let new_value = new_value.as_ref().unwrap_or_else(|| self.value_bytes(i)); | |
builder.append_variant_buffers(self.metadata(i), new_value); |
(again below)
|
||
/// Primitive type variants | ||
#[derive(Debug, Clone, PartialEq)] | ||
pub enum PrimitiveType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like several of the types here are similar to similar ones defined elsewhere?
Is there a way to harmonize them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a lot of redundant logic as well)
match primitive_type { | ||
0 => Ok(PrimitiveType::Null), | ||
1 => Ok(PrimitiveType::True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honest question: Is it more readable to have a bunch of Ok(...)
? Or to pull out the result and wrap it once?
let result = match primitive_type {
0 => PrimitiveType::Null,
1 => PrimitiveType::True,
...
_ => {
return Err(ArrowError::InvalidArgumentError(format!(...)));
}
};
Ok(result)
if length > 13 { | ||
return Err(ArrowError::InvalidArgumentError(format!( | ||
"Short string length {} exceeds maximum of 13", | ||
length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the string length a 6-bit value? The spec says:
The "short string" basic type may be used as an optimization to fold string length into the type byte for strings less than 64 bytes.
| PrimitiveType::TimestampNtz | ||
| PrimitiveType::TimestampLtz => 8, | ||
PrimitiveType::Decimal16 => 16, | ||
PrimitiveType::Binary | PrimitiveType::String => 0, // Variable length, need to read from data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this method should return Option<usize>
to distinguish null/true/false from binary/string?
# Which issue does this PR close? Optimize `partition_validity` function used in sort kernels - Preallocate vectors based on known null counts - Avoid dynamic dispatch by calling `NullBuffer::is_valid` instead of `Array::is_valid` - Avoid capacity checks inside loop by writing to `spare_capacity_mut` instead of using `push` - Closes apache#7936. # Rationale for this change Microbenchmark results for `sort_kernels` compared to `main`, only looking at benchmarks matching "nulls to indices": ``` sort i32 nulls to indices 2^10 time: [4.9325 µs 4.9370 µs 4.9422 µs] change: [−20.303% −20.133% −19.974%] (p = 0.00 < 0.05) Performance has improved. sort i32 nulls to indices 2^12 time: [20.096 µs 20.209 µs 20.327 µs] change: [−26.819% −26.275% −25.697%] (p = 0.00 < 0.05) Performance has improved. sort f32 nulls to indices 2^12 time: [26.329 µs 26.366 µs 26.406 µs] change: [−29.487% −29.331% −29.146%] (p = 0.00 < 0.05) Performance has improved. sort string[0-10] nulls to indices 2^12 time: [70.667 µs 70.762 µs 70.886 µs] change: [−20.057% −19.935% −19.819%] (p = 0.00 < 0.05) Performance has improved. sort string[0-100] nulls to indices 2^12 time: [101.98 µs 102.44 µs 102.99 µs] change: [−0.3501% +0.0835% +0.4918%] (p = 0.71 > 0.05) No change in performance detected. sort string[0-400] nulls to indices 2^12 time: [84.952 µs 85.024 µs 85.102 µs] change: [−5.3969% −4.9827% −4.6421%] (p = 0.00 < 0.05) Performance has improved. sort string[10] nulls to indices 2^12 time: [72.486 µs 72.664 µs 72.893 µs] change: [−14.937% −14.781% −14.599%] (p = 0.00 < 0.05) Performance has improved. sort string[100] nulls to indices 2^12 time: [71.354 µs 71.606 µs 71.902 µs] change: [−17.207% −16.795% −16.373%] (p = 0.00 < 0.05) Performance has improved. sort string[1000] nulls to indices 2^12 time: [73.088 µs 73.195 µs 73.311 µs] change: [−16.705% −16.599% −16.483%] (p = 0.00 < 0.05) Performance has improved. sort string_view[10] nulls to indices 2^12 time: [32.592 µs 32.654 µs 32.731 µs] change: [−15.722% −15.512% −15.310%] (p = 0.00 < 0.05) Performance has improved. sort string_view[0-400] nulls to indices 2^12 time: [32.981 µs 33.074 µs 33.189 µs] change: [−25.570% −25.132% −24.700%] (p = 0.00 < 0.05) Performance has improved. sort string_view_inlined[0-12] nulls to indices 2^12 time: [28.467 µs 28.496 µs 28.529 µs] change: [−22.978% −22.786% −22.574%] (p = 0.00 < 0.05) Performance has improved. sort string[10] dict nulls to indices 2^12 time: [94.463 µs 94.503 µs 94.542 µs] change: [−11.386% −11.165% −10.961%] (p = 0.00 < 0.05) Performance has improved. ``` # Are these changes tested? Covered by existing tests # Are there any user-facing changes? No, the method is internal to the sort kernels.
# Which issue does this PR close? - Closes apache#7949 # Rationale for this change I would like it to be easier / more ergonomic to make objects # What changes are included in this PR? 1. Add `ObjectBuilder::with_field` 2. Add documentation w/ examples 3. Rewrite some tests # Are these changes tested? Yes, by doc tests # Are there any user-facing changes? Yes a new API
…ntArray (apache#7945) # Which issue does this PR close? - Closes apache#7920. # Are these changes tested? Tests were already implemented # Are there any user-facing changes? None
…che#7956) # Which issue does this PR close? - Follow-up to apache#7901 # Rationale for this change - apache#7934 Introduced a minor regression, in (accidentally?) forbidding the empty string as a dictionary key. Fix the bug and simplify the code a bit further while we're at it. # What changes are included in this PR? Revert the unsorted dictionary check back to what it had been (it just uses `Iterator::is_sorted_by` now, instead of `primitive.slice::is_sorted_by`). Remove the redundant offset monotonicity check from the ordered dictionary path, relying on the fact that string slice extraction will anyway fail if the offsets are not monotonic. Improve the error message now that it does double duty. # Are these changes tested? New unit tests for dictionaries containing the empty string. As a side effect, we now have at least a little coverage for sorted dictionaries -- somehow, I couldn't find any existing unit test that creates a sorted dictionary?? # Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <[email protected]>
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Part of apache#7911 - Part of apache#6736 - Follow on to apache#7905 # Rationale for this change I wrote benchmark some changes to the json decoder in apache#7911 but they are non trivial. To keep apache#7911 easier to review I have pulled the benchmarks out to their own PR # What changes are included in this PR? 1. Add new json benchmark 2. Include the `variant_get` benchmark added in apache#7919 by @Samyak2 # Are these changes tested? I tested them manually and clippy CI coverage ensures they compile # Are there any user-facing changes? No these are only benchmarks
# Which issue does this PR close? - Closes apache#7951 . # Rationale for this change # What changes are included in this PR? # Are these changes tested? Yes # Are there any user-facing changes? New API Signed-off-by: codephage2020 <[email protected]>
# Which issue does this PR close? - Follow on to apache#7943 - Part of apache#7948 # Rationale for this change I found a few more tests I would like to have seen while reviewing apache#7943 # What changes are included in this PR? Add some list equality tests # Are these changes tested? It is only tests, no functionality changes # Are there any user-facing changes? No
# Which issue does this PR close? We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. - Closes apache#7947 . # Rationale for this change Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. # What changes are included in this PR? There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. # Are these changes tested? We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? # Are there any user-facing changes? If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. Signed-off-by: codephage2020 <[email protected]>
# Which issue does this PR close? - Related to apache#6736 # Rationale for this change I noticed in apache#7956 that some Clippy errors were introduced but not caught by CI. # What changes are included in this PR? Add `parquet-variant-compute` to the CI for parqet-variant related PRs # Are these changes tested? It is only tests # Are there any user-facing changes? No
Thanks for the review @scovich , its already an arrow-rs draft PR but i can make these changes still, thank you |
9a616b5
to
c712747
Compare
Here is a PR with some potential tests for shredded variant access: |
…r functionalities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there was a bad merge of some kind? I see a lot of seemingly unrelated changes in other crates like arrow-flight
and arrow-select
?
@@ -60,7 +60,7 @@ jobs: | |||
cargo test -p arrow-flight --all-features | |||
- name: Test --examples | |||
run: | | |||
cargo test -p arrow-flight --features=flight-sql,tls --examples | |||
cargo test -p arrow-flight --features=flight-sql,tls-ring --examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated/noise?
Path-based Field Extraction for VariantArray
This PR implements efficient path-based field extraction and manipulation capabilities for
VariantArray
, enabling direct access to nested fields without expensive unshredding operations. The implementation provides both high-level convenience methods and low-level byte operations to support various analytical workloads on variant data.Relationship to Concurrent PRs
This work builds directly on the path navigation concepts introduced in PR #7919, sharing the fundamental
VariantPathElement
design withField
andIndex
variants. While PR apache#7919 provides a compute kernel approach with avariant_get
function, this PR provides instance-based methods directly onVariantArray
with a fluent builder API using owned strings rather than PR apache#7919's vector-based approach.This PR is complementary to PR #7921, which implements schema-driven shredding during array construction. This PR provides runtime path-based access to both shredded and unshredded data, creating a complete solution for both efficient construction and efficient access of variant data.
What This PR Contributes
This PR introduces three entirely original capabilities missing from both concurrent PRs. Field removal operations through methods like
remove_field
andremove_fields
enable efficient removal of specific fields from variant data, crucial for shredding operations where temporary or debug fields need to be stripped. A complete byte-level operations module (field_operations.rs
) provides direct binary manipulation through functions likeget_path_bytes
,extract_field_bytes
, andremove_field_bytes
that operate on raw binary format without constructing intermediate objects. A comprehensive binary parser (variant_parser.rs
) supports all variant types with specialized parsers for 17 different primitive types, providing the foundation for efficient binary navigation.How This Benefits PR apache#7919
The performance-critical byte operations could serve as the underlying implementation for PR apache#7919's compute kernel, potentially providing better performance for batch operations by avoiding object construction overhead. The field removal capabilities could extend PR apache#7919's functionality beyond extraction to comprehensive field manipulation. The instance-based approach provides different ergonomics that complement PR apache#7919's compute kernel approach.
Implementation Details
The implementation follows a three-tier architecture: high-level instance methods returning
Variant
objects for convenient manipulation, mid-level path operations usingVariantPath
andVariantPathElement
types for type-safe nested access, and low-level byte operations for maximum performance where object construction overhead is prohibitive. This directly addresses the performance concerns identified in PR apache#7919 by providing direct binary navigation without full object reconstruction, enabling efficient batch operations, and implementing selective field access that prevents the quadratic work patterns identified in the original performance analysis.What Remains Pending
This PR focuses on runtime access and manipulation rather than construction-time optimization, leaving build-time schema-driven shredding to PR apache#7921. Future work could explore integration with PR apache#7919's compute kernel approach, potentially using this PR's byte-level operations as the underlying implementation.