Skip to content

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Sep 22, 2025

DataFusion 50 has been released. I had to bump the MSRV and took the opportunity to upgrade the edition (I see no reason not to, unless I'm missing some internal reasoning).

Comment on lines 59 to 64
// Create a default SessionContext for the protobuf parsing
// TODO: This loses the original function registry, but DataFusion 50.0.0 requires SessionContext
let session_ctx = SessionContext::new();
let partioning = parse_protobuf_partitioning(
partitioning.as_ref(),
registry,
&session_ctx,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what DataFusion intends to do here, or what the consequences of this are. It needs some investigation.


pub fn settings() -> insta::Settings {
env::set_var("INSTA_WORKSPACE_ROOT", env!("CARGO_MANIFEST_DIR"));
unsafe {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From 2024 edition

@adriangb adriangb marked this pull request as ready for review September 23, 2025 01:23
Comment on lines 2 to 3
channel = "1.85.1"
profile = "default"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This locks in a specific version. The MSRV in Cargo.toml is a minimum version, much more flexible

Cargo.toml Outdated
arrow = { version = "56.1.0", optional = true }
tokio-stream = { version = "0.1.17", optional = true }
hyper-util = { version = "0.1.16", optional = true }
chrono = { version = "0.4.42", optional = true }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as #143, need to rebase

@adriangb
Copy link
Contributor Author

This is likely to end up having conflicts - @robtandy @gabotechs could you review?

@adriangb
Copy link
Contributor Author

@robtandy there's now a large diff from running cargo fmt on the 2024 edition. I split that change out into #152 which we should handle before this PR and will make this PR much much smaller and more manageable.

@gabotechs
Copy link
Collaborator

Awesome! 🔥 Thanks for this.

It would be great if we could delay this upgrade at least for one or two weeks. Unfortunately we do not support DataFusion v50 yet, and this upgrade will block our internal testing.

We also do not want to bother other contributors, so here's my ask: would it be fine to delay this upgrade just one or two weeks? Hopefully by that time we can have v50 running on our system.

If by then we have not yet upgraded we can just merge this and we'll figure it out.

@adriangb
Copy link
Contributor Author

Yeah no problem, it's all experimental on our end, we can work off of a branch

@@ -88,7 +88,7 @@ mod tests {
│ HashJoinExec: mode=CollectLeft, join_type=Inner, on=[(p_partkey@0, ps_partkey@0)], projection=[p_partkey@0, p_mfgr@1, ps_suppkey@3, ps_supplycost@4]
│ CoalescePartitionsExec
│ NetworkCoalesceExec read_from=Stage 2, output_partitions=8, input_tasks=4
│ DataSourceExec: file_groups={6 groups: [[/testdata/tpch/data/partsupp/1.parquet:0..753508, /testdata/tpch/data/partsupp/10.parquet:0..748125, /testdata/tpch/data/partsupp/11.parquet:0..496934], [/testdata/tpch/data/partsupp/11.parquet:496934..744689, /testdata/tpch/data/partsupp/12.parquet:0..749068, /testdata/tpch/data/partsupp/13.parquet:0..750296, /testdata/tpch/data/partsupp/14.parquet:0..251448], [/testdata/tpch/data/partsupp/14.parquet:251448..748674, /testdata/tpch/data/partsupp/15.parquet:0..744298, /testdata/tpch/data/partsupp/16.parquet:0..748035, /testdata/tpch/data/partsupp/2.parquet:0..9008], [/testdata/tpch/data/partsupp/2.parquet:9008..753010, /testdata/tpch/data/partsupp/3.parquet:0..756405, /testdata/tpch/data/partsupp/4.parquet:0..498160], [/testdata/tpch/data/partsupp/4.parquet:498160..746724, /testdata/tpch/data/partsupp/5.parquet:0..746720, /testdata/tpch/data/partsupp/6.parquet:0..750184, /testdata/tpch/data/partsupp/7.parquet:0..253099], ...]}, projection=[ps_partkey, ps_suppkey, ps_supplycost], file_type=parquet
│ DataSourceExec: file_groups={6 groups: [[/testdata/tpch/data/partsupp/1.parquet:0..753508, /testdata/tpch/data/partsupp/10.parquet:0..748125, /testdata/tpch/data/partsupp/11.parquet:0..496934], [/testdata/tpch/data/partsupp/11.parquet:496934..744689, /testdata/tpch/data/partsupp/12.parquet:0..749068, /testdata/tpch/data/partsupp/13.parquet:0..750296, /testdata/tpch/data/partsupp/14.parquet:0..251448], [/testdata/tpch/data/partsupp/14.parquet:251448..748674, /testdata/tpch/data/partsupp/15.parquet:0..744298, /testdata/tpch/data/partsupp/16.parquet:0..748035, /testdata/tpch/data/partsupp/2.parquet:0..9008], [/testdata/tpch/data/partsupp/2.parquet:9008..753010, /testdata/tpch/data/partsupp/3.parquet:0..756405, /testdata/tpch/data/partsupp/4.parquet:0..498160], [/testdata/tpch/data/partsupp/4.parquet:498160..746724, /testdata/tpch/data/partsupp/5.parquet:0..746720, /testdata/tpch/data/partsupp/6.parquet:0..750184, /testdata/tpch/data/partsupp/7.parquet:0..253099], ...]}, projection=[ps_partkey, ps_suppkey, ps_supplycost], file_type=parquet, predicate=DynamicFilterPhysicalExpr [ true ] AND DynamicFilterPhysicalExpr [ true ]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff in all of these is the extra DynamicFilterPhysicalExpr [ true ]

I'm guessing we're ending up with multiple because there are multiple optimizer passes being run? I'm not sure tbh.

They also obviously aren't working (will never be updated from true) but also shouldn't be much overhead (it's an extra literal true to evaluate in expression evaluation).

Changing this would require a lot of upstream work in DataFusion which we want to do anyway but I don't think it would be a good idea to block this PR for that.

@adriangb
Copy link
Contributor Author

@gabotechs I've rebased on the other PRs. No problem with waiting. But there are some open questions here that maybe we could start to think about while we wait?

Copy link
Collaborator

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time for putting this together 😄 ! just a couple of comments, but otherwise looking good

Comment on lines +85 to +98
let stage =
stage_from_proto(stage_proto, &ctx, &self.runtime, &codec).map_err(|err| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the context here to contain all the UDFs, UDAFs and stuff necessary for properly decoding the stage node. Before, we were passing the &session_state, which is the one built by the user containing all their custom stuff, but now we are passing a recently built SessionContext::new() that knows nothing about the custom things provided by the user in &session_state.

This means that we should probably be passing:

let ctx = SessionContext::from(session_state);

instead

@gabotechs
Copy link
Collaborator

🟢 green light to ship this upgrade!

I think the only missing thing would be this comment and making the pipelines pass.

@adriangb
Copy link
Contributor Author

🟢 green light to ship this upgrade!

I think the only missing thing would be this comment and making the pipelines pass.

Great news! I did a rebase on this repo's main and addressed #146 (comment). Unfortunately there is now a failing test 😢

@gabotechs
Copy link
Collaborator

gabotechs commented Oct 1, 2025

Great news! I did a rebase on this repo's main and addressed #146 (comment). Unfortunately there is now a failing test 😢

Yeap, I also saw it failing when trying to point this repo to our https://github.com/DataDog/datafusion fork. cc @jayshrivastava it's one of the new tests for #160, it seems to not play well with DataFusion 50.

It should be fine to leave it ignored in this PR I think.

@jayshrivastava
Copy link
Collaborator

+1 please feel free to skip/ignore the failing unit test and file a corresponding ticket :)

The issue is that this query does not get distributed anymore under this config

SELECT sum(balance) / 7.0 as avg_yearly
            FROM table2
            WHERE name LIKE 'customer%'
              AND balance < (
                SELECT 0.2 * avg(balance)
                FROM table2 t2_inner
                WHERE t2_inner.id = table2.id
              )
let config = SessionConfig::new().with_target_partitions(2);

let state = SessionStateBuilder::new()
    .with_default_features()
    .with_config(config)
    .with_distributed_channel_resolver(InMemoryChannelResolver::new())
    .with_physical_optimizer_rule(Arc::new(
        DistributedPhysicalOptimizerRule::default()
            .with_network_coalesce_tasks(2)
            .with_network_shuffle_tasks(2),
    ))
    .build();

@adriangb
Copy link
Contributor Author

adriangb commented Oct 1, 2025

skipped the test

@adriangb
Copy link
Contributor Author

adriangb commented Oct 1, 2025

I'm not sure what to do about the tpch tests. They pass locally, I tried regenerating the data

@gabotechs
Copy link
Collaborator

I'm not sure what to do about the tpch tests. They pass locally, I tried regenerating the data

We probably need to tell insta to normalize those random integer ranges that seem to slightly vary depending on the machine running the test.

Something like adding:

    settings.add_filter(r"\d+\.\.\d+", "<int>..<int>");

on

should work

@adriangb
Copy link
Contributor Author

adriangb commented Oct 2, 2025

Thanks @gabotechs , we're 🟢 now!

Copy link
Collaborator

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 let's go

@gabotechs gabotechs merged commit d2b392d into datafusion-contrib:main Oct 2, 2025
4 checks passed
@adriangb adriangb deleted the update-50 branch October 2, 2025 12:45
@adriangb
Copy link
Contributor Author

adriangb commented Oct 2, 2025

Thanks for getting this across the line! We should probably create a tracking issue to integrate the upstream change in apache/datafusion#17601 when DataFusion 50.1.0 is released

@adriangb
Copy link
Contributor Author

adriangb commented Oct 2, 2025

#173

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants