Skip to content

Variant shredding #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
7b7aad2
Upgrade tonic dependencies to 0.13.0 version (try 2) (#7839)
alamb Jul 16, 2025
0055f57
[Variant] Reserve capacity beforehand during large object building (#…
friendlymatthew Jul 16, 2025
7af62d5
[Variant] Support appending complex variants in `VariantBuilder` (#7914)
friendlymatthew Jul 16, 2025
d4c0a32
[Variant] Add `variant_get` compute kernel (#7919)
Samyak2 Jul 16, 2025
03a837e
Add tests for `BatchCoalescer::push_batch_with_filter`, fix bug (#7774)
alamb Jul 16, 2025
d809f19
[Variant] Add documentation, tests and cleaner api for Variant::get_p…
alamb Jul 17, 2025
7089786
[Variant] Avoid collecting offset iterator (#7934)
codephage2020 Jul 17, 2025
dfe907f
Minor: Support BinaryView and StringView builders in `make_builder` (…
kylebarron Jul 17, 2025
d0fa24e
[Variant] Impl `PartialEq` for VariantObject (#7943)
friendlymatthew Jul 17, 2025
233dad3
Optimize partition_validity function used in sort kernels (#7937)
jhorstmann Jul 18, 2025
722ef59
[Variant] Add ObjectBuilder::with_field for convenience (#7950)
alamb Jul 18, 2025
a984ca7
[Variant] Adding code to store metadata and value references in Varia…
abacef Jul 18, 2025
a5afda2
[Variant] VariantMetadata is allowed to contain the empty string (#7956)
scovich Jul 18, 2025
71dd48e
[Variant] Add `variant_kernels` benchmark (#7944)
alamb Jul 18, 2025
a15f345
[Variant] Add ListBuilder::with_value for convenience (#7959)
codephage2020 Jul 18, 2025
4f5ab12
[Test] Add tests for VariantList equality (#7953)
alamb Jul 18, 2025
55fbf5c
[Variant] remove VariantMetadata::dictionary_size (#7958)
codephage2020 Jul 18, 2025
99eb1bc
Add missing `parquet-variant-compute` crate to CI jobs (#7963)
alamb Jul 18, 2025
82821e5
arrow-ipc: Remove all abilities to preserve dict IDs (#7940)
brancz Jul 18, 2025
291e6e5
Add arrow-avro support for Impala Nullability (#7954)
veronica-m-ef Jul 21, 2025
b726b6f
Add additional integration tests to arrow-avro (#7974)
nathaniel-d-ef Jul 22, 2025
ed02131
arrow-schema: Remove dict_id from being required equal for merging (#…
brancz Jul 22, 2025
d4f1cfa
Implement Improved arrow-avro Reader Zero-Byte Record Handling (#7966)
jecsand838 Jul 22, 2025
6874ffa
[Variant] Avoid extra allocation in object builder (#7935)
klion26 Jul 22, 2025
dff67c9
GH-7686: [Parquet] Fix int96 min/max stats (#7687)
rahulketch Jul 22, 2025
f39461c
[Variant] Revisit VariantMetadata and Object equality (#7961)
friendlymatthew Jul 22, 2025
ec81db3
Add decimal32 and decimal64 support to Parquet, JSON and CSV readers …
CurtHagenlocher Jul 22, 2025
4c1d6f2
[ADD] Path-based field extraction for VariantArray
carpecodeum Jul 16, 2025
5ac22a7
[FIX] sanitise variant_array file
carpecodeum Jul 16, 2025
1ef8926
[ADD] add hybrid approach for field access
carpecodeum Jul 16, 2025
d782197
[FIX] fix variant_array implementation
carpecodeum Jul 16, 2025
948bb39
[ADD] add support for path operations on different data types
carpecodeum Jul 16, 2025
e16af07
[FIX] minor fixes
carpecodeum Jul 16, 2025
3da46b8
[FIX] fix formatting issues
carpecodeum Jul 16, 2025
7c03e21
[FIX] remove redundancy
carpecodeum Jul 20, 2025
eb8bb69
[FIX] improve the tests
carpecodeum Jul 20, 2025
397c717
[FIX] refactor code for modularity
carpecodeum Jul 21, 2025
dda30ea
[FIX] fix issues with the spec
carpecodeum Jul 21, 2025
32c55ea
remove redundancy with field_operations.rs and variant_parser.rs
carpecodeum Jul 21, 2025
3b3c191
[REMOVE] revert field_operations.rs
carpecodeum Jul 21, 2025
01f0be7
[REMOVE] remove extra lines in cargo.toml
carpecodeum Jul 21, 2025
eb23834
[REMOVE] remove variant_parser.rs file as decoder.rs already has majo…
carpecodeum Jul 21, 2025
cc5e149
[FIX] make code modular
carpecodeum Jul 21, 2025
30e9cd2
[FIX] clippy and lint issues
carpecodeum Jul 22, 2025
7dd6c23
[FIX] remove unsafe functions doing byte operations
carpecodeum Jul 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/arrow_flight.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ jobs:
cargo test -p arrow-flight --all-features
- name: Test --examples
run: |
cargo test -p arrow-flight --features=flight-sql,tls --examples
cargo test -p arrow-flight --features=flight-sql,tls-ring --examples
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated/noise?


vendor:
name: Verify Vendored Code
Expand Down
16 changes: 12 additions & 4 deletions .github/workflows/parquet-variant.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ on:
pull_request:
paths:
- parquet-variant/**
- parquet-variant-json/**
- parquet-variant-compute/**
- .github/**

jobs:
Expand All @@ -50,6 +52,8 @@ jobs:
run: cargo test -p parquet-variant
- name: Test parquet-variant-json
run: cargo test -p parquet-variant-json
- name: Test parquet-variant-compute
run: cargo test -p parquet-variant-compute

# test compilation
linux-features:
Expand All @@ -63,10 +67,12 @@ jobs:
submodules: true
- name: Setup Rust toolchain
uses: ./.github/actions/setup-builder
- name: Check compilation
- name: Check compilation (parquet-variant)
run: cargo check -p parquet-variant
- name: Check compilation
- name: Check compilation (parquet-variant-json)
run: cargo check -p parquet-variant-json
- name: Check compilation (parquet-variant-compute)
run: cargo check -p parquet-variant-compute

clippy:
name: Clippy
Expand All @@ -79,7 +85,9 @@ jobs:
uses: ./.github/actions/setup-builder
- name: Setup Clippy
run: rustup component add clippy
- name: Run clippy
- name: Run clippy (parquet-variant)
run: cargo clippy -p parquet-variant --all-targets --all-features -- -D warnings
- name: Run clippy
- name: Run clippy (parquet-variant-json)
run: cargo clippy -p parquet-variant-json --all-targets --all-features -- -D warnings
- name: Run clippy (parquet-variant-compute)
run: cargo clippy -p parquet-variant-compute --all-targets --all-features -- -D warnings
2 changes: 2 additions & 0 deletions arrow-array/src/builder/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -447,6 +447,7 @@ pub fn make_builder(datatype: &DataType, capacity: usize) -> Box<dyn ArrayBuilde
DataType::Float64 => Box::new(Float64Builder::with_capacity(capacity)),
DataType::Binary => Box::new(BinaryBuilder::with_capacity(capacity, 1024)),
DataType::LargeBinary => Box::new(LargeBinaryBuilder::with_capacity(capacity, 1024)),
DataType::BinaryView => Box::new(BinaryViewBuilder::with_capacity(capacity)),
DataType::FixedSizeBinary(len) => {
Box::new(FixedSizeBinaryBuilder::with_capacity(capacity, *len))
}
Expand All @@ -464,6 +465,7 @@ pub fn make_builder(datatype: &DataType, capacity: usize) -> Box<dyn ArrayBuilde
),
DataType::Utf8 => Box::new(StringBuilder::with_capacity(capacity, 1024)),
DataType::LargeUtf8 => Box::new(LargeStringBuilder::with_capacity(capacity, 1024)),
DataType::Utf8View => Box::new(StringViewBuilder::with_capacity(capacity)),
DataType::Date32 => Box::new(Date32Builder::with_capacity(capacity)),
DataType::Date64 => Box::new(Date64Builder::with_capacity(capacity)),
DataType::Time32(TimeUnit::Second) => {
Expand Down
1 change: 1 addition & 0 deletions arrow-avro/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ crc = { version = "3.0", optional = true }
uuid = "1.17"

[dev-dependencies]
arrow-data = { workspace = true }
rand = { version = "0.9.1", default-features = false, features = [
"std",
"std_rng",
Expand Down
126 changes: 109 additions & 17 deletions arrow-avro/src/codec.rs
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ impl<'a> TryFrom<&Schema<'a>> for AvroField {
match schema {
Schema::Complex(ComplexType::Record(r)) => {
let mut resolver = Resolver::default();
let data_type = make_data_type(schema, None, &mut resolver, false)?;
let data_type = make_data_type(schema, None, &mut resolver, false, false)?;
Ok(AvroField {
data_type,
name: r.name.to_string(),
Expand All @@ -161,6 +161,60 @@ impl<'a> TryFrom<&Schema<'a>> for AvroField {
}
}

/// Builder for an [`AvroField`]
#[derive(Debug)]
pub struct AvroFieldBuilder<'a> {
schema: &'a Schema<'a>,
use_utf8view: bool,
strict_mode: bool,
}

impl<'a> AvroFieldBuilder<'a> {
/// Creates a new [`AvroFieldBuilder`]
pub fn new(schema: &'a Schema<'a>) -> Self {
Self {
schema,
use_utf8view: false,
strict_mode: false,
}
}

/// Enable or disable Utf8View support
pub fn with_utf8view(mut self, use_utf8view: bool) -> Self {
self.use_utf8view = use_utf8view;
self
}

/// Enable or disable strict mode.
pub fn with_strict_mode(mut self, strict_mode: bool) -> Self {
self.strict_mode = strict_mode;
self
}

/// Build an [`AvroField`] from the builder
pub fn build(self) -> Result<AvroField, ArrowError> {
match self.schema {
Schema::Complex(ComplexType::Record(r)) => {
let mut resolver = Resolver::default();
let data_type = make_data_type(
self.schema,
None,
&mut resolver,
self.use_utf8view,
self.strict_mode,
)?;
Ok(AvroField {
name: r.name.to_string(),
data_type,
})
}
_ => Err(ArrowError::ParseError(format!(
"Expected a Record schema to build an AvroField, but got {:?}",
self.schema
))),
}
}
}
/// An Avro encoding
///
/// <https://avro.apache.org/docs/1.11.1/specification/#encodings>
Expand Down Expand Up @@ -409,6 +463,7 @@ fn make_data_type<'a>(
namespace: Option<&'a str>,
resolver: &mut Resolver<'a>,
use_utf8view: bool,
strict_mode: bool,
) -> Result<AvroDataType, ArrowError> {
match schema {
Schema::TypeName(TypeName::Primitive(p)) => {
Expand All @@ -428,12 +483,20 @@ fn make_data_type<'a>(
.position(|x| x == &Schema::TypeName(TypeName::Primitive(PrimitiveType::Null)));
match (f.len() == 2, null) {
(true, Some(0)) => {
let mut field = make_data_type(&f[1], namespace, resolver, use_utf8view)?;
let mut field =
make_data_type(&f[1], namespace, resolver, use_utf8view, strict_mode)?;
field.nullability = Some(Nullability::NullFirst);
Ok(field)
}
(true, Some(1)) => {
let mut field = make_data_type(&f[0], namespace, resolver, use_utf8view)?;
if strict_mode {
return Err(ArrowError::SchemaError(
"Found Avro union of the form ['T','null'], which is disallowed in strict_mode"
.to_string(),
));
}
let mut field =
make_data_type(&f[0], namespace, resolver, use_utf8view, strict_mode)?;
field.nullability = Some(Nullability::NullSecond);
Ok(field)
}
Expand All @@ -456,6 +519,7 @@ fn make_data_type<'a>(
namespace,
resolver,
use_utf8view,
strict_mode,
)?,
})
})
Expand All @@ -469,8 +533,13 @@ fn make_data_type<'a>(
Ok(field)
}
ComplexType::Array(a) => {
let mut field =
make_data_type(a.items.as_ref(), namespace, resolver, use_utf8view)?;
let mut field = make_data_type(
a.items.as_ref(),
namespace,
resolver,
use_utf8view,
strict_mode,
)?;
Ok(AvroDataType {
nullability: None,
metadata: a.attributes.field_metadata(),
Expand Down Expand Up @@ -535,7 +604,8 @@ fn make_data_type<'a>(
Ok(field)
}
ComplexType::Map(m) => {
let val = make_data_type(&m.values, namespace, resolver, use_utf8view)?;
let val =
make_data_type(&m.values, namespace, resolver, use_utf8view, strict_mode)?;
Ok(AvroDataType {
nullability: None,
metadata: m.attributes.field_metadata(),
Expand All @@ -549,6 +619,7 @@ fn make_data_type<'a>(
namespace,
resolver,
use_utf8view,
strict_mode,
)?;

// https://avro.apache.org/docs/1.11.1/specification/#logical-types
Expand Down Expand Up @@ -630,7 +701,7 @@ mod tests {
let schema = create_schema_with_logical_type(PrimitiveType::Int, "date");

let mut resolver = Resolver::default();
let result = make_data_type(&schema, None, &mut resolver, false).unwrap();
let result = make_data_type(&schema, None, &mut resolver, false, false).unwrap();

assert!(matches!(result.codec, Codec::Date32));
}
Expand All @@ -640,7 +711,7 @@ mod tests {
let schema = create_schema_with_logical_type(PrimitiveType::Int, "time-millis");

let mut resolver = Resolver::default();
let result = make_data_type(&schema, None, &mut resolver, false).unwrap();
let result = make_data_type(&schema, None, &mut resolver, false, false).unwrap();

assert!(matches!(result.codec, Codec::TimeMillis));
}
Expand All @@ -650,7 +721,7 @@ mod tests {
let schema = create_schema_with_logical_type(PrimitiveType::Long, "time-micros");

let mut resolver = Resolver::default();
let result = make_data_type(&schema, None, &mut resolver, false).unwrap();
let result = make_data_type(&schema, None, &mut resolver, false, false).unwrap();

assert!(matches!(result.codec, Codec::TimeMicros));
}
Expand All @@ -660,7 +731,7 @@ mod tests {
let schema = create_schema_with_logical_type(PrimitiveType::Long, "timestamp-millis");

let mut resolver = Resolver::default();
let result = make_data_type(&schema, None, &mut resolver, false).unwrap();
let result = make_data_type(&schema, None, &mut resolver, false, false).unwrap();

assert!(matches!(result.codec, Codec::TimestampMillis(true)));
}
Expand All @@ -670,7 +741,7 @@ mod tests {
let schema = create_schema_with_logical_type(PrimitiveType::Long, "timestamp-micros");

let mut resolver = Resolver::default();
let result = make_data_type(&schema, None, &mut resolver, false).unwrap();
let result = make_data_type(&schema, None, &mut resolver, false, false).unwrap();

assert!(matches!(result.codec, Codec::TimestampMicros(true)));
}
Expand All @@ -680,7 +751,7 @@ mod tests {
let schema = create_schema_with_logical_type(PrimitiveType::Long, "local-timestamp-millis");

let mut resolver = Resolver::default();
let result = make_data_type(&schema, None, &mut resolver, false).unwrap();
let result = make_data_type(&schema, None, &mut resolver, false, false).unwrap();

assert!(matches!(result.codec, Codec::TimestampMillis(false)));
}
Expand All @@ -690,7 +761,7 @@ mod tests {
let schema = create_schema_with_logical_type(PrimitiveType::Long, "local-timestamp-micros");

let mut resolver = Resolver::default();
let result = make_data_type(&schema, None, &mut resolver, false).unwrap();
let result = make_data_type(&schema, None, &mut resolver, false, false).unwrap();

assert!(matches!(result.codec, Codec::TimestampMicros(false)));
}
Expand Down Expand Up @@ -745,7 +816,7 @@ mod tests {
let schema = create_schema_with_logical_type(PrimitiveType::Int, "custom-type");

let mut resolver = Resolver::default();
let result = make_data_type(&schema, None, &mut resolver, false).unwrap();
let result = make_data_type(&schema, None, &mut resolver, false, false).unwrap();

assert_eq!(
result.metadata.get("logicalType"),
Expand All @@ -758,7 +829,7 @@ mod tests {
let schema = Schema::TypeName(TypeName::Primitive(PrimitiveType::String));

let mut resolver = Resolver::default();
let result = make_data_type(&schema, None, &mut resolver, true).unwrap();
let result = make_data_type(&schema, None, &mut resolver, true, false).unwrap();

assert!(matches!(result.codec, Codec::Utf8View));
}
Expand All @@ -768,7 +839,7 @@ mod tests {
let schema = Schema::TypeName(TypeName::Primitive(PrimitiveType::String));

let mut resolver = Resolver::default();
let result = make_data_type(&schema, None, &mut resolver, false).unwrap();
let result = make_data_type(&schema, None, &mut resolver, false, false).unwrap();

assert!(matches!(result.codec, Codec::Utf8));
}
Expand Down Expand Up @@ -796,7 +867,7 @@ mod tests {
let schema = Schema::Complex(ComplexType::Record(record));

let mut resolver = Resolver::default();
let result = make_data_type(&schema, None, &mut resolver, true).unwrap();
let result = make_data_type(&schema, None, &mut resolver, true, false).unwrap();

if let Codec::Struct(fields) = &result.codec {
let first_field_codec = &fields[0].data_type().codec;
Expand All @@ -805,4 +876,25 @@ mod tests {
panic!("Expected Struct codec");
}
}

#[test]
fn test_union_with_strict_mode() {
let schema = Schema::Union(vec![
Schema::TypeName(TypeName::Primitive(PrimitiveType::String)),
Schema::TypeName(TypeName::Primitive(PrimitiveType::Null)),
]);

let mut resolver = Resolver::default();
let result = make_data_type(&schema, None, &mut resolver, false, true);

assert!(result.is_err());
match result {
Err(ArrowError::SchemaError(msg)) => {
assert!(msg.contains(
"Found Avro union of the form ['T','null'], which is disallowed in strict_mode"
));
}
_ => panic!("Expected SchemaError"),
}
}
}
Loading
Loading