chore(rust): Remove the wrap/unwrap workaround#12
Conversation
|
@Kontinuation I'm getting close here but I have the following test failures: I'll look more closely tomorrow, but is there anything you can think of quickly in our code where we need to do some kind of wrapping or unwrapping specifically for semi and/or anti joins? (It's possible this is a bug in DataFusion, too). |
| #[tokio::test] | ||
| async fn test_left_joins( | ||
| #[values(JoinType::Left, JoinType::LeftSemi, JoinType::LeftAnti)] join_type: JoinType, | ||
| #[values(JoinType::Left, /* JoinType::LeftSemi, JoinType::LeftAnti */)] join_type: JoinType, |
There was a problem hiding this comment.
A reminder to myself to circle back to this line. These two tests are failing and I'm not sure why yet.
| #[tokio::test] | ||
| async fn test_right_joins( | ||
| #[values(JoinType::Right, JoinType::RightSemi, JoinType::RightAnti)] join_type: JoinType, | ||
| #[values(JoinType::Right, /* JoinType::RightSemi, JoinType::RightAnti */)] |
There's no special treatment for semi/anti joins. Outer/semi/anti joins are handled uniformly by |
|
Thank you! I'll take a look. |
| #[test] | ||
| #[should_panic(expected = "actual ScalarValue != expected ScalarValue: | ||
| actual ScalarValue has type Wkb(Spherical, None), expected ScalarValue has type Wkb(Planar, None)")] | ||
| fn value_scalar_not_equal() { | ||
| assert_value_equal( | ||
| &create_scalar_value(None, &WKB_GEOGRAPHY), | ||
| &create_scalar_value(None, &WKB_GEOMETRY), | ||
| ); | ||
| } | ||
|
|
||
| #[test] | ||
| #[should_panic(expected = "actual Array != expected Array: | ||
| actual Array has type Wkb(Spherical, None), expected Array has type Wkb(Planar, None)")] | ||
| fn value_array_not_equal() { | ||
| assert_value_equal( | ||
| &create_array_value(&[], &WKB_GEOGRAPHY), | ||
| &create_array_value(&[], &WKB_GEOMETRY), | ||
| ); | ||
| } | ||
|
|
There was a problem hiding this comment.
This is the other place we need to circle back to. The assert_value_equal()/assert_array_equal()/assert_scalar_equal() functions used to give nice diffs for geometry arrays, but now they can't detect that the geometry arrays are geometry. We probably need create_array() to return a new struct ArrayWithMetadata that works in ScalarUdfTester::invoke_xxx(). create_scalar() should return a Literal, which can hold extra metadata already.
Now that DataFusion propagates field metadata through more types of expressions, we can remove the wrap/unwrap workaround! We now have plenty of integration tests (+ SedonaBench) such that we should be able to detect any regressions caused by DataFusion internals that haven't considered metadata yet.
Broadly, the changes are:
SedonaType::from_data_type(). Previously aDataTypecould unambiguously be a geometry type or an Arrow type, but now a DataType is ambiguous (could be an Arrow type or the storage type of a geometry). Most uses of this were changed toSedonaType::from_storage_field(), where the extension metadata is available to let us know if it is an extension type or not.TryFrom<DataType>forSedonaType: we had a lot of code that looked likeDataType::Boolean.try_into().unwrap(). This can now beDataType::Boolean.into()orSedonaType::Arrow(DataType::Boolean). I changed all internal usage toSedaonType::Arrow(DataType::Boolean)because it is more explicit and I was paranoid about dropping extension metadata by accident while doing this change.ScalarUdfTesterand had to be rewritten to use it. It is no longer trivial to call a scalar function and the tester is pretty much required. These are the most verbose of the changes.SedonaType::data_type()because it is ambiguous. Where we want the underlying storage type we can use.storage_type()but mostly we wantto_storage_field()because it doesn't drop metadata.Three outstanding issues are:
assert_scalar_equal()fromsedona_testingare now awful (they show WKB instead of a WKT diff). This is becauseScalarValueandArrayRefno longer can be uniquely identified as "geometry" because they require field metadata to provide this context. This is solvable but I'd like to do it in a separate PR.