-
Notifications
You must be signed in to change notification settings - Fork 7
update to DataFusion 50 #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/protobuf/distributed_codec.rs
Outdated
// Create a default SessionContext for the protobuf parsing | ||
// TODO: This loses the original function registry, but DataFusion 50.0.0 requires SessionContext | ||
let session_ctx = SessionContext::new(); | ||
let partioning = parse_protobuf_partitioning( | ||
partitioning.as_ref(), | ||
registry, | ||
&session_ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what DataFusion intends to do here, or what the consequences of this are. It needs some investigation.
src/test_utils/insta.rs
Outdated
|
||
pub fn settings() -> insta::Settings { | ||
env::set_var("INSTA_WORKSPACE_ROOT", env!("CARGO_MANIFEST_DIR")); | ||
unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From 2024 edition
channel = "1.85.1" | ||
profile = "default" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This locks in a specific version. The MSRV in Cargo.toml
is a minimum version, much more flexible
Cargo.toml
Outdated
arrow = { version = "56.1.0", optional = true } | ||
tokio-stream = { version = "0.1.17", optional = true } | ||
hyper-util = { version = "0.1.16", optional = true } | ||
chrono = { version = "0.4.42", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as #143, need to rebase
This is likely to end up having conflicts - @robtandy @gabotechs could you review? |
Awesome! 🔥 Thanks for this. It would be great if we could delay this upgrade at least for one or two weeks. Unfortunately we do not support DataFusion v50 yet, and this upgrade will block our internal testing. We also do not want to bother other contributors, so here's my ask: would it be fine to delay this upgrade just one or two weeks? Hopefully by that time we can have v50 running on our system. If by then we have not yet upgraded we can just merge this and we'll figure it out. |
Yeah no problem, it's all experimental on our end, we can work off of a branch |
tests/tpch_validation_test.rs
Outdated
@@ -88,7 +88,7 @@ mod tests { | |||
│ HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(p_partkey@0, ps_partkey@0)], projection=[p_partkey@0, p_mfgr@1, ps_suppkey@3, ps_supplycost@4] | |||
│ CoalescePartitionsExec | |||
│ NetworkCoalesceExec read_from=Stage 2, output_partitions=8, input_tasks=4 | |||
│ DataSourceExec: file_groups={6 groups: [[/testdata/tpch/data/partsupp/1.parquet:0..753508, /testdata/tpch/data/partsupp/10.parquet:0..748125, /testdata/tpch/data/partsupp/11.parquet:0..496934], [/testdata/tpch/data/partsupp/11.parquet:496934..744689, /testdata/tpch/data/partsupp/12.parquet:0..749068, /testdata/tpch/data/partsupp/13.parquet:0..750296, /testdata/tpch/data/partsupp/14.parquet:0..251448], [/testdata/tpch/data/partsupp/14.parquet:251448..748674, /testdata/tpch/data/partsupp/15.parquet:0..744298, /testdata/tpch/data/partsupp/16.parquet:0..748035, /testdata/tpch/data/partsupp/2.parquet:0..9008], [/testdata/tpch/data/partsupp/2.parquet:9008..753010, /testdata/tpch/data/partsupp/3.parquet:0..756405, /testdata/tpch/data/partsupp/4.parquet:0..498160], [/testdata/tpch/data/partsupp/4.parquet:498160..746724, /testdata/tpch/data/partsupp/5.parquet:0..746720, /testdata/tpch/data/partsupp/6.parquet:0..750184, /testdata/tpch/data/partsupp/7.parquet:0..253099], ...]}, projection=[ps_partkey, ps_suppkey, ps_supplycost], file_type=parquet | |||
│ DataSourceExec: file_groups={6 groups: [[/testdata/tpch/data/partsupp/1.parquet:0..753508, /testdata/tpch/data/partsupp/10.parquet:0..748125, /testdata/tpch/data/partsupp/11.parquet:0..496934], [/testdata/tpch/data/partsupp/11.parquet:496934..744689, /testdata/tpch/data/partsupp/12.parquet:0..749068, /testdata/tpch/data/partsupp/13.parquet:0..750296, /testdata/tpch/data/partsupp/14.parquet:0..251448], [/testdata/tpch/data/partsupp/14.parquet:251448..748674, /testdata/tpch/data/partsupp/15.parquet:0..744298, /testdata/tpch/data/partsupp/16.parquet:0..748035, /testdata/tpch/data/partsupp/2.parquet:0..9008], [/testdata/tpch/data/partsupp/2.parquet:9008..753010, /testdata/tpch/data/partsupp/3.parquet:0..756405, /testdata/tpch/data/partsupp/4.parquet:0..498160], [/testdata/tpch/data/partsupp/4.parquet:498160..746724, /testdata/tpch/data/partsupp/5.parquet:0..746720, /testdata/tpch/data/partsupp/6.parquet:0..750184, /testdata/tpch/data/partsupp/7.parquet:0..253099], ...]}, projection=[ps_partkey, ps_suppkey, ps_supplycost], file_type=parquet, predicate=DynamicFilterPhysicalExpr [ true ] AND DynamicFilterPhysicalExpr [ true ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff in all of these is the extra DynamicFilterPhysicalExpr [ true ]
I'm guessing we're ending up with multiple because there are multiple optimizer passes being run? I'm not sure tbh.
They also obviously aren't working (will never be updated from true
) but also shouldn't be much overhead (it's an extra literal true
to evaluate in expression evaluation).
Changing this would require a lot of upstream work in DataFusion which we want to do anyway but I don't think it would be a good idea to block this PR for that.
@gabotechs I've rebased on the other PRs. No problem with waiting. But there are some open questions here that maybe we could start to think about while we wait? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time for putting this together 😄 ! just a couple of comments, but otherwise looking good
let stage = | ||
stage_from_proto(stage_proto, &ctx, &self.runtime, &codec).map_err(|err| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the context here to contain all the UDFs, UDAFs and stuff necessary for properly decoding the stage node. Before, we were passing the &session_state
, which is the one built by the user containing all their custom stuff, but now we are passing a recently built SessionContext::new()
that knows nothing about the custom things provided by the user in &session_state
.
This means that we should probably be passing:
let ctx = SessionContext::from(session_state);
instead
🟢 green light to ship this upgrade! I think the only missing thing would be this comment and making the pipelines pass. |
Great news! I did a rebase on this repo's |
Yeap, I also saw it failing when trying to point this repo to our https://github.com/DataDog/datafusion fork. cc @jayshrivastava it's one of the new tests for #160, it seems to not play well with DataFusion 50. It should be fine to leave it ignored in this PR I think. |
+1 please feel free to skip/ignore the failing unit test and file a corresponding ticket :) The issue is that this query does not get distributed anymore under this config
|
skipped the test |
I'm not sure what to do about the tpch tests. They pass locally, I tried regenerating the data |
We probably need to tell insta to normalize those random integer ranges that seem to slightly vary depending on the machine running the test. Something like adding: settings.add_filter(r"\d+\.\.\d+", "<int>..<int>"); on should work |
Thanks @gabotechs , we're 🟢 now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 let's go
Thanks for getting this across the line! We should probably create a tracking issue to integrate the upstream change in apache/datafusion#17601 when DataFusion 50.1.0 is released |
DataFusion 50 has been released. I had to bump the MSRV and took the opportunity to upgrade the edition (I see no reason not to, unless I'm missing some internal reasoning).