Skip to content

chore(rust): Remove the wrap/unwrap workaround#12

Merged
jiayuasu merged 46 commits intoapache:mainfrom
paleolimbot:remove-wrap-unwrap
Sep 4, 2025
Merged

chore(rust): Remove the wrap/unwrap workaround#12
jiayuasu merged 46 commits intoapache:mainfrom
paleolimbot:remove-wrap-unwrap

Conversation

@paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Sep 2, 2025

Now that DataFusion propagates field metadata through more types of expressions, we can remove the wrap/unwrap workaround! We now have plenty of integration tests (+ SedonaBench) such that we should be able to detect any regressions caused by DataFusion internals that haven't considered metadata yet.

Broadly, the changes are:

  • Remove all of the functions that wrapped schemas containing extension types or unwrapped them. All schemas and record batches now have the same representation outside the engine and inside the engine.
  • Removed SedonaType::from_data_type(). Previously a DataType could unambiguously be a geometry type or an Arrow type, but now a DataType is ambiguous (could be an Arrow type or the storage type of a geometry). Most uses of this were changed to SedonaType::from_storage_field(), where the extension metadata is available to let us know if it is an extension type or not.
  • Removed TryFrom<DataType> for SedonaType: we had a lot of code that looked like DataType::Boolean.try_into().unwrap(). This can now be DataType::Boolean.into() or SedonaType::Arrow(DataType::Boolean). I changed all internal usage to SedaonType::Arrow(DataType::Boolean) because it is more explicit and I was paranoid about dropping extension metadata by accident while doing this change.
  • Some tests weren't using the ScalarUdfTester and had to be rewritten to use it. It is no longer trivial to call a scalar function and the tester is pretty much required. These are the most verbose of the changes.
  • I removed SedonaType::data_type() because it is ambiguous. Where we want the underlying storage type we can use .storage_type() but mostly we want to_storage_field() because it doesn't drop metadata.

Three outstanding issues are:

  • Right/Left Anti/Semi joins drop metadata for some reason and now error where they previously worked. It is possibly a DataFusion bug but I would like to track this down independently of this PR.
  • Aggregate functions via FFI no longer work. We weren't using them yet and there's a PR to fix this upstream already (I had filed an issue).
  • Failures for assertions that used assert_scalar_equal() from sedona_testing are now awful (they show WKB instead of a WKT diff). This is because ScalarValue and ArrayRef no longer can be uniquely identified as "geometry" because they require field metadata to provide this context. This is solvable but I'd like to do it in a separate PR.

@paleolimbot
Copy link
Member Author

@Kontinuation I'm getting close here but I have the following test failures:

---- exec::tests::test_left_joins::join_type_2_JoinType__LeftSemi stdout ----
Error: Context("type_coercion", NotImplemented("st_intersects([Arrow(Binary), Wkb(Planar, None)]): No kernel matching arguments"))

---- exec::tests::test_left_joins::join_type_3_JoinType__LeftAnti stdout ----
Error: Context("type_coercion", NotImplemented("st_intersects([Arrow(Binary), Wkb(Planar, None)]): No kernel matching arguments"))

---- exec::tests::test_right_joins::join_type_2_JoinType__RightSemi stdout ----
Error: Context("type_coercion", NotImplemented("st_intersects([Wkb(Planar, None), Arrow(Binary)]): No kernel matching arguments"))

---- exec::tests::test_right_joins::join_type_3_JoinType__RightAnti stdout ----
Error: Context("type_coercion", NotImplemented("st_intersects([Wkb(Planar, None), Arrow(Binary)]): No kernel matching arguments"))


failures:
    exec::tests::test_left_joins::join_type_2_JoinType__LeftSemi
    exec::tests::test_left_joins::join_type_3_JoinType__LeftAnti
    exec::tests::test_right_joins::join_type_2_JoinType__RightSemi
    exec::tests::test_right_joins::join_type_3_JoinType__RightAnti

I'll look more closely tomorrow, but is there anything you can think of quickly in our code where we need to do some kind of wrapping or unwrapping specifically for semi and/or anti joins? (It's possible this is a bug in DataFusion, too).

#[tokio::test]
async fn test_left_joins(
#[values(JoinType::Left, JoinType::LeftSemi, JoinType::LeftAnti)] join_type: JoinType,
#[values(JoinType::Left, /* JoinType::LeftSemi, JoinType::LeftAnti */)] join_type: JoinType,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reminder to myself to circle back to this line. These two tests are failing and I'm not sure why yet.

#[tokio::test]
async fn test_right_joins(
#[values(JoinType::Right, JoinType::RightSemi, JoinType::RightAnti)] join_type: JoinType,
#[values(JoinType::Right, /* JoinType::RightSemi, JoinType::RightAnti */)]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this line

@Kontinuation
Copy link
Member

I'll look more closely tomorrow, but is there anything you can think of quickly in our code where we need to do some kind of wrapping or unwrapping specifically for semi and/or anti joins? (It's possible this is a bug in DataFusion, too).

There's no special treatment for semi/anti joins. Outer/semi/anti joins are handled uniformly by utils::adjust_indices_by_join_type. The semi/anti join failures are very likely caused by a datafusion bug.

@paleolimbot
Copy link
Member Author

Thank you! I'll take a look.

Comment on lines -274 to -293
#[test]
#[should_panic(expected = "actual ScalarValue != expected ScalarValue:
actual ScalarValue has type Wkb(Spherical, None), expected ScalarValue has type Wkb(Planar, None)")]
fn value_scalar_not_equal() {
assert_value_equal(
&create_scalar_value(None, &WKB_GEOGRAPHY),
&create_scalar_value(None, &WKB_GEOMETRY),
);
}

#[test]
#[should_panic(expected = "actual Array != expected Array:
actual Array has type Wkb(Spherical, None), expected Array has type Wkb(Planar, None)")]
fn value_array_not_equal() {
assert_value_equal(
&create_array_value(&[], &WKB_GEOGRAPHY),
&create_array_value(&[], &WKB_GEOMETRY),
);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other place we need to circle back to. The assert_value_equal()/assert_array_equal()/assert_scalar_equal() functions used to give nice diffs for geometry arrays, but now they can't detect that the geometry arrays are geometry. We probably need create_array() to return a new struct ArrayWithMetadata that works in ScalarUdfTester::invoke_xxx(). create_scalar() should return a Literal, which can hold extra metadata already.

@paleolimbot paleolimbot marked this pull request as ready for review September 4, 2025 16:29
@jiayuasu jiayuasu merged commit 4b74b56 into apache:main Sep 4, 2025
5 checks passed
@paleolimbot paleolimbot deleted the remove-wrap-unwrap branch September 10, 2025 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants