Skip to content

Conversation

@jayshrivastava
Copy link
Collaborator

@jayshrivastava jayshrivastava commented Nov 19, 2025

This change introduces a new property_based.rs test utility which lets us evaluate
correctness using properties. These are useful for evaluating correctness when
we do not know the expected output of a test (ex. if we were to fuzz the database
with randomized data or randomzed queries, then we can only verify the output using
properties). The two oracles are

  • ResultSetOracle: Compares the result set between single node and distributed datafusion
  • OrderingOracle: Uses plan properties to figure out the expected ordering and asserts it

This change does not introduce a fuzz test, but it introduces a TPC-DS test. This test
randomly generates data using the duckdb CLI and runs 99 queries on a distributed cluster.
The query outputs are validated against single-node datafusion using test utils in
metamorphic.rs. This test also randomizes the test cluster parameters - there's no harm
in doing so.

Next steps:

  • Add fuzzing
    • Now that we have property-based testing utils, we can properly fuzz the project
      using SQLancer
    • SQLancer produces INSERT and SELECT statements which we could point at a datafusion
      distributed cluster and verify against single node datafusion
    • Although it doesn't support nested select statements, 70% of the queries were valid
      datafusion queries, meaning these are good test cases for us
  • Add metrics oracle to validate output_rows metric / metrics propagation

@jayshrivastava jayshrivastava changed the title Js/tpcds fuzz Add fuzzing infrastructure for distributed DataFusion Nov 19, 2025
@jayshrivastava jayshrivastava force-pushed the js/tpcds-fuzz branch 6 times, most recently from 4684bbe to 281c72e Compare November 19, 2025 22:43
@jayshrivastava jayshrivastava marked this pull request as ready for review November 19, 2025 22:43
@jayshrivastava jayshrivastava changed the title Add fuzzing infrastructure for distributed DataFusion add tpc-ds tests and property-based testing utilities Nov 20, 2025
@jayshrivastava jayshrivastava force-pushed the js/tpcds-fuzz branch 10 times, most recently from 9bd3bfc to ce974fc Compare November 20, 2025 19:07
@jayshrivastava jayshrivastava force-pushed the js/tpcds-fuzz branch 6 times, most recently from d5d6f42 to 14e923a Compare November 20, 2025 20:29
@jayshrivastava jayshrivastava force-pushed the js/tpcds-fuzz branch 2 times, most recently from 9f7e288 to aaab00b Compare November 20, 2025 21:07
@jayshrivastava
Copy link
Collaborator Author

@gabotechs This is ready for another review 🙇🏽

Comment on lines 60 to 67
- name: Upload test artifacts on failure
if: failure() || steps.test.outcome == 'failure'
uses: actions/upload-artifact@v4
with:
name: tpcds-test-artifacts-${{ github.run_id }}
path: testdata/tpcds/data/**
retention-days: 7
if-no-files-found: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 why would we want to upload the data on failure as an artifact?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I previously thought that data gen was random, but it's deterministic based on the scale factor. Removed.

Comment on lines 68 to 73
- name: Clean up test data
run: |
rm -rf testdata/tpcds/data/*
rm -f $HOME/.local/bin/duckdb
rm -rf /home/runner/.duckdb
df -h
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub runners are stateless, meaning that each step run is done in a newly spawned machine, so this step is unnecessary, as the machine running this step is destroyed after the run.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +51 to +56
run: |
curl https://install.duckdb.org | sh
mkdir -p $HOME/.local/bin
mv /home/runner/.duckdb/cli/latest/duckdb $HOME/.local/bin/
echo "$HOME/.local/bin" >> $GITHUB_PATH
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it not sufficient to just do:

curl https://install.duckdb.org | sh

I imagine that will already install the cli in the appropriate place accessible from $PATH

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the script, it does not add /home/runner/.duckdb/cli/latest/duckdb to the $PATH.

Cargo.toml Outdated
bytes = "1.10.1"

# integration_tests deps
base64 = { version = "0.22", optional = true }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this and rand_chacha be also added to dev-dependencies?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed both dependencies. Since we aren't randomizing anything, we don't need either. Removed the rand test util as well.

Comment on lines 29 to 53
pub async fn new(test_ctx: SessionContext, compare_ctx: SessionContext) -> Result<Self> {
let oracles: Vec<Box<dyn Oracle + Send + Sync>> =
vec![Box::new(ResultSetOracle {}), Box::new(OrderingOracle {})];

Ok(Validator {
test_ctx,
compare_ctx,
oracles,
})
}

/// Create a new Validator with ordering checks enabled.
pub async fn new_with_ordering(
test_ctx: SessionContext,
compare_ctx: SessionContext,
) -> Result<Self> {
let oracles: Vec<Box<dyn Oracle + Send + Sync>> =
vec![Box::new(ResultSetOracle {}), Box::new(OrderingOracle {})];

Ok(Validator {
test_ctx,
compare_ctx,
oracles,
})
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions are exactly the same, and the second one is unused, so I think it can be removed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed as a part of the refactor

Comment on lines 21 to 24
let num_workers = rng.gen_range(3..=8);
let files_per_task = rng.gen_range(2..=4);
let cardinality_task_count_factor = rng.gen_range(1.1..=3.0);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced about this idea, if this has the bad luck of using num_workers=3 and cardinality_task_count_factor=3.0 then very little things will get distributed, making this test randomly succeed in situations where it should be failing.

I'm struggling to find value in randomizing this parameters, it seems like we are trying to brute force into finding issues that could perfectly be replicated with appropriate specific numbers, and I'm afraid the results of randomness are just going to be flaky tests instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order words, for the same reason that we are randomizing config values in this test, we should also be randomizing config values in every other test in this project. I don't think the TPC-DS tests have anything special on top of all other tests that makes it more suitable for parameters to be randomized.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I hardcoded this

const NUM_WORKERS: usize = 4;
const FILES_PER_TASK: usize = 2;
const CARDINALITY_TASK_COUNT_FACTOR: f64 = 2.0;

let me know if different numbers make more sense.

Comment on lines 90 to 100
eprintln!(
"Test summary - Success: {} Invalid: {} Failed: {} Valid %: {:.2}%",
successful,
invalid,
failed,
if successful + invalid > 0 {
(successful as f64 / (successful + invalid) as f64) * 100.0
} else {
0.0
}
);
Copy link
Collaborator

@gabotechs gabotechs Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having our print statements for reporting test results, I would arrange tests here so that it's plain cargo test the one that performs the reporting.

This is how it's done upstream:

https://github.com/apache/datafusion/blob/7fa2a694bbc18608a46f85974e721fafa2503219/datafusion/core/tests/tpcds_planning.rs#L32-L32

And we also do the same in this project for the TPCH test.

having independent tests per query is also going to allow us to rerun just the query that fails vs running them all even if we only care about one.

Having one test per query is going to result in ~1000 LOC to make, so very good opportunity to learn about VIM macros.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I did not learn VIM macros 😅. I asked claude

Comment on lines 95 to 101
async fn validate(
&self,
_test_ctx: &SessionContext,
compare_ctx: &SessionContext,
query: &str,
test_result: &Result<Vec<RecordBatch>>,
) -> Result<()> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this can be simplified to just use a couple of functions?

Having this trait here is:

  • forcing you to do some juggling by ignoring certain arguments in certain implementations
  • running the query unnecessarily twice in single node, like in the OrderingOracle (run first in Validator::run and second in OrderingOracle::validate)
  • having an extra struct that glues all implementations together

I think that unless we start using a fuzzing framework that require us to implement some traits, we are perfectly fine with just simple pure functions, for example something like this:

be2c567

Trims 100 lines of code of extra traits and structs and it runs the same code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I copied your commit

jayshrivastava and others added 3 commits November 21, 2025 13:47
This change introduces a new `property_based.rs` test utility which lets us evaluate
correctness using properties. These are useful for evaluating correctness when
we do not know the expected output of a test (ex. if we were to fuzz the database
with randomized data or randomzed queries, then we can only verify the output using
properties). The two oracles are
- ResultSetOracle: Compares the result set between single node and distributed datafusion
- OrderingOracle: Uses plan properties to figure out the expected ordering and asserts it

This change does not introduce a fuzz test, but it introduces a TPC-DS test. This test
randomly generates data using the duckdb CLI and runs 99 queries on a distributed cluster.
The query outputs are validated against single-node datafusion using test utils in
`metamorphic.rs`. This test also randomizes the test cluster parameters - there's no harm
in doing so.

Next steps:
- Add fuzzing
  - Now that we have property-based testing utils, we can properly fuzz the project
    using SQLancer
  - SQLancer produces INSERT and SELECT statements which we could point at a datafusion
    distributed cluster and verify against single node datafusion
  - Although it doesn't support nested select statements, 70% of the queries were valid
    datafusion queries, meaning these are good test cases for us
- Add metrics oracle to validate output_rows metric / metrics propagation
This commit adds tpcds-randomized-test to ci. It relies on the duckdb cli for tpc-ds database generation. It also saves the artifacts if the test fails so we can reproduce issues.
@jayshrivastava jayshrivastava force-pushed the js/tpcds-fuzz branch 6 times, most recently from 862613f to 9dc1b85 Compare November 21, 2025 19:20
Copy link
Collaborator

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

echo "Scale factor must be greater than or equal to 0"
exit 1
fi

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One trick for ensuring this script can be run from an arbitrary directory, is to do something like this:

# https://stackoverflow.com/questions/59895/how-do-i-get-the-directory-where-a-bash-script-is-located-from-within-the-script
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

Like what upstream has here:
https://github.com/apache/datafusion/blob/main/benchmarks/bench.sh#L28-L30

It makes the script a bit more future proof

Comment on lines +5 to +7
- Queries 47 and 57 were modified to add explicit ORDER BY d_moy to avg() window function. DataFusion requires explicit ordering in window functions with PARTITION BY for deterministic results.
- Query 72 was modified to support date functions in datafusion

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I also faced this in one TPCH query. What I did was to just do a string replacement in the test itself in order to keep the .sql files untouched, but what you did here is also fine

let sql = get_test_tpch_query(10);
// There is a chance that this query returns non-deterministic results if two entries
// happen to have the exact same revenue. With small scales, this never happens, but with
// bigger scales, this is very likely to happen.
// This extra ordering accounts for it.
let sql = sql.replace("revenue desc", "revenue, c_acctbal desc");
test_tpch_query(sql).await
}

ctx: &SessionContext,
query_sql: &str,
) -> (Arc<dyn ExecutionPlan>, Result<Vec<RecordBatch>>) {
let df = ctx.sql(&query_sql).await.unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you should be fine with:

Suggested change
let df = ctx.sql(&query_sql).await.unwrap();
let df = ctx.sql(query_sql).await.unwrap();

.gitignore Outdated
/benchmarks/data/
testdata/tpch/*
!testdata/tpch/queries
testdata/tpch/data/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
testdata/tpch/data/

This can be removed, it's already ignored on line 4

@@ -0,0 +1,587 @@
#[cfg(all(feature = "integration", feature = "tpcds", test))]
mod tests {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests are failing for me, it created a bunch of empty folders and now it wont try to regenerate them again.

Probably something went wrong along the way, but I did not get any error message or anything at all, just test errors claiming "column not found"

@jayshrivastava jayshrivastava merged commit ce5218b into main Dec 4, 2025
5 checks passed
@jayshrivastava jayshrivastava deleted the js/tpcds-fuzz branch December 4, 2025 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants