add tpc-ds tests and property-based testing utilities #231

jayshrivastava · 2025-11-19T19:35:48Z

This change introduces a new property_based.rs test utility which lets us evaluate
correctness using properties. These are useful for evaluating correctness when
we do not know the expected output of a test (ex. if we were to fuzz the database
with randomized data or randomzed queries, then we can only verify the output using
properties). The two oracles are

ResultSetOracle: Compares the result set between single node and distributed datafusion
OrderingOracle: Uses plan properties to figure out the expected ordering and asserts it

This change does not introduce a fuzz test, but it introduces a TPC-DS test. This test
randomly generates data using the duckdb CLI and runs 99 queries on a distributed cluster.
The query outputs are validated against single-node datafusion using test utils in
metamorphic.rs. This test also randomizes the test cluster parameters - there's no harm
in doing so.

Next steps:

Add fuzzing
- Now that we have property-based testing utils, we can properly fuzz the project
  using SQLancer
- SQLancer produces INSERT and SELECT statements which we could point at a datafusion
  distributed cluster and verify against single node datafusion
- Although it doesn't support nested select statements, 70% of the queries were valid
  datafusion queries, meaning these are good test cases for us
Add metrics oracle to validate output_rows metric / metrics propagation

src/bin/fuzz.rs

.github/workflows/nightly.yml

.github/actions/setup/action.yml

jayshrivastava · 2025-11-20T21:21:16Z

@gabotechs This is ready for another review 🙇🏽

gabotechs · 2025-11-21T08:05:54Z

.github/workflows/ci.yml

+      - name: Upload test artifacts on failure
+        if: failure() || steps.test.outcome == 'failure'
+        uses: actions/upload-artifact@v4
+        with:
+          name: tpcds-test-artifacts-${{ github.run_id }}
+          path: testdata/tpcds/data/**
+          retention-days: 7
+          if-no-files-found: ignore


🤔 why would we want to upload the data on failure as an artifact?

Ah I previously thought that data gen was random, but it's deterministic based on the scale factor. Removed.

gabotechs · 2025-11-21T08:07:14Z

.github/workflows/ci.yml

+      - name: Clean up test data
+        run: |
+          rm -rf testdata/tpcds/data/*
+          rm -f $HOME/.local/bin/duckdb
+          rm -rf /home/runner/.duckdb
+          df -h


GitHub runners are stateless, meaning that each step run is done in a newly spawned machine, so this step is unnecessary, as the machine running this step is destroyed after the run.

gabotechs · 2025-11-21T08:08:33Z

.github/workflows/ci.yml

+        run: |
+          curl https://install.duckdb.org | sh
+          mkdir -p $HOME/.local/bin
+          mv /home/runner/.duckdb/cli/latest/duckdb $HOME/.local/bin/
+          echo "$HOME/.local/bin" >> $GITHUB_PATH


How is it not sufficient to just do:

curl https://install.duckdb.org | sh

I imagine that will already install the cli in the appropriate place accessible from $PATH

Looking at the script, it does not add /home/runner/.duckdb/cli/latest/duckdb to the $PATH.

gabotechs · 2025-11-21T08:10:46Z

Cargo.toml

 bytes = "1.10.1"

 # integration_tests deps
+base64 = { version = "0.22", optional = true }


Shouldn't this and rand_chacha be also added to dev-dependencies?

Removed both dependencies. Since we aren't randomizing anything, we don't need either. Removed the rand test util as well.

gabotechs · 2025-11-21T08:12:00Z

src/test_utils/property_based.rs

+    pub async fn new(test_ctx: SessionContext, compare_ctx: SessionContext) -> Result<Self> {
+        let oracles: Vec<Box<dyn Oracle + Send + Sync>> =
+            vec![Box::new(ResultSetOracle {}), Box::new(OrderingOracle {})];
+
+        Ok(Validator {
+            test_ctx,
+            compare_ctx,
+            oracles,
+        })
+    }
+
+    /// Create a new Validator with ordering checks enabled.
+    pub async fn new_with_ordering(
+        test_ctx: SessionContext,
+        compare_ctx: SessionContext,
+    ) -> Result<Self> {
+        let oracles: Vec<Box<dyn Oracle + Send + Sync>> =
+            vec![Box::new(ResultSetOracle {}), Box::new(OrderingOracle {})];
+
+        Ok(Validator {
+            test_ctx,
+            compare_ctx,
+            oracles,
+        })
+    }


These two functions are exactly the same, and the second one is unused, so I think it can be removed

Done. Removed as a part of the refactor

gabotechs · 2025-11-21T08:21:19Z

tests/tpc_ds_randomized.rs

+        let num_workers = rng.gen_range(3..=8);
+        let files_per_task = rng.gen_range(2..=4);
+        let cardinality_task_count_factor = rng.gen_range(1.1..=3.0);
+


I'm still not convinced about this idea, if this has the bad luck of using num_workers=3 and cardinality_task_count_factor=3.0 then very little things will get distributed, making this test randomly succeed in situations where it should be failing.

I'm struggling to find value in randomizing this parameters, it seems like we are trying to brute force into finding issues that could perfectly be replicated with appropriate specific numbers, and I'm afraid the results of randomness are just going to be flaky tests instead.

In order words, for the same reason that we are randomizing config values in this test, we should also be randomizing config values in every other test in this project. I don't think the TPC-DS tests have anything special on top of all other tests that makes it more suitable for parameters to be randomized.

Done. I hardcoded this

const NUM_WORKERS: usize = 4; const FILES_PER_TASK: usize = 2; const CARDINALITY_TASK_COUNT_FACTOR: f64 = 2.0;

let me know if different numbers make more sense.

gabotechs · 2025-11-21T08:28:41Z

tests/tpc_ds_randomized.rs

+        eprintln!(
+            "Test summary - Success: {} Invalid: {} Failed: {} Valid %: {:.2}%",
+            successful,
+            invalid,
+            failed,
+            if successful + invalid > 0 {
+                (successful as f64 / (successful + invalid) as f64) * 100.0
+            } else {
+                0.0
+            }
+        );


Rather than having our print statements for reporting test results, I would arrange tests here so that it's plain cargo test the one that performs the reporting.

This is how it's done upstream:

https://github.com/apache/datafusion/blob/7fa2a694bbc18608a46f85974e721fafa2503219/datafusion/core/tests/tpcds_planning.rs#L32-L32

And we also do the same in this project for the TPCH test.

having independent tests per query is also going to allow us to rerun just the query that fails vs running them all even if we only care about one.

Having one test per query is going to result in ~1000 LOC to make, so very good opportunity to learn about VIM macros.

Done. I did not learn VIM macros 😅. I asked claude

gabotechs · 2025-11-21T08:54:17Z

src/test_utils/property_based.rs

+    async fn validate(
+        &self,
+        _test_ctx: &SessionContext,
+        compare_ctx: &SessionContext,
+        query: &str,
+        test_result: &Result<Vec<RecordBatch>>,
+    ) -> Result<()> {


Do you think this can be simplified to just use a couple of functions?

Having this trait here is:

forcing you to do some juggling by ignoring certain arguments in certain implementations

running the query unnecessarily twice in single node, like in the OrderingOracle (run first in Validator::run and second in OrderingOracle::validate)

having an extra struct that glues all implementations together

I think that unless we start using a fuzzing framework that require us to implement some traits, we are perfectly fine with just simple pure functions, for example something like this:

be2c567

Trims 100 lines of code of extra traits and structs and it runs the same code.

Done. I copied your commit

This change introduces a new `property_based.rs` test utility which lets us evaluate correctness using properties. These are useful for evaluating correctness when we do not know the expected output of a test (ex. if we were to fuzz the database with randomized data or randomzed queries, then we can only verify the output using properties). The two oracles are - ResultSetOracle: Compares the result set between single node and distributed datafusion - OrderingOracle: Uses plan properties to figure out the expected ordering and asserts it This change does not introduce a fuzz test, but it introduces a TPC-DS test. This test randomly generates data using the duckdb CLI and runs 99 queries on a distributed cluster. The query outputs are validated against single-node datafusion using test utils in `metamorphic.rs`. This test also randomizes the test cluster parameters - there's no harm in doing so. Next steps: - Add fuzzing - Now that we have property-based testing utils, we can properly fuzz the project using SQLancer - SQLancer produces INSERT and SELECT statements which we could point at a datafusion distributed cluster and verify against single node datafusion - Although it doesn't support nested select statements, 70% of the queries were valid datafusion queries, meaning these are good test cases for us - Add metrics oracle to validate output_rows metric / metrics propagation

This commit adds tpcds-randomized-test to ci. It relies on the duckdb cli for tpc-ds database generation. It also saves the artifacts if the test fails so we can reproduce issues.

gabotechs

Great!

gabotechs · 2025-11-24T08:29:05Z

testdata/tpcds/generate.sh

+    echo "Scale factor must be greater than or equal to 0"
+    exit 1
+fi
+


One trick for ensuring this script can be run from an arbitrary directory, is to do something like this:

# https://stackoverflow.com/questions/59895/how-do-i-get-the-directory-where-a-bash-script-is-located-from-within-the-script SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

Like what upstream has here:
https://github.com/apache/datafusion/blob/main/benchmarks/bench.sh#L28-L30

It makes the script a bit more future proof

gabotechs · 2025-11-24T08:31:01Z

testdata/tpcds/README.md

+ - Queries 47 and 57 were modified to add explicit ORDER BY d_moy to avg() window function. DataFusion requires explicit ordering in window functions with PARTITION BY for deterministic results.
+ - Query 72 was modified to support date functions in datafusion
+


👍 I also faced this in one TPCH query. What I did was to just do a string replacement in the test itself in order to keep the .sql files untouched, but what you did here is also fine

datafusion-distributed/tests/tpch_correctness_test.rs

Lines 66 to 73 in 67ce84c

let sql = get_test_tpch_query(10);

// There is a chance that this query returns non-deterministic results if two entries

// happen to have the exact same revenue. With small scales, this never happens, but with

// bigger scales, this is very likely to happen.

// This extra ordering accounts for it.

let sql = sql.replace("revenue desc", "revenue, c_acctbal desc");

test_tpch_query(sql).await

}

gabotechs · 2025-11-24T08:32:27Z

tests/tpcds_test.rs

+        ctx: &SessionContext,
+        query_sql: &str,
+    ) -> (Arc<dyn ExecutionPlan>, Result<Vec<RecordBatch>>) {
+        let df = ctx.sql(&query_sql).await.unwrap();


nit: you should be fine with:

Suggested change

let df = ctx.sql(&query_sql).await.unwrap();

let df = ctx.sql(query_sql).await.unwrap();

gabotechs · 2025-11-24T08:42:49Z

.gitignore

 /benchmarks/data/
 testdata/tpch/*
 !testdata/tpch/queries
+testdata/tpch/data/


Suggested change

testdata/tpch/data/

This can be removed, it's already ignored on line 4

gabotechs · 2025-11-24T08:44:42Z

tests/tpcds_test.rs

@@ -0,0 +1,587 @@
+#[cfg(all(feature = "integration", feature = "tpcds", test))]
+mod tests {


This tests are failing for me, it created a bunch of empty folders and now it wont try to regenerate them again.

Probably something went wrong along the way, but I did not get any error message or anything at all, just test errors claiming "column not found"

jayshrivastava force-pushed the js/tpcds-fuzz branch from dc94612 to 3260143 Compare November 19, 2025 22:03

jayshrivastava changed the title ~~Js/tpcds fuzz~~ Add fuzzing infrastructure for distributed DataFusion Nov 19, 2025

jayshrivastava force-pushed the js/tpcds-fuzz branch 6 times, most recently from 4684bbe to 281c72e Compare November 19, 2025 22:43

jayshrivastava marked this pull request as ready for review November 19, 2025 22:43

gabotechs reviewed Nov 20, 2025

View reviewed changes

src/bin/fuzz.rs Outdated Show resolved Hide resolved

src/bin/fuzz.rs Outdated Show resolved Hide resolved

jayshrivastava force-pushed the js/tpcds-fuzz branch from 281c72e to 1a3260f Compare November 20, 2025 15:56

jayshrivastava changed the title ~~Add fuzzing infrastructure for distributed DataFusion~~ add tpc-ds tests and property-based testing utilities Nov 20, 2025

jayshrivastava force-pushed the js/tpcds-fuzz branch 10 times, most recently from 9bd3bfc to ce974fc Compare November 20, 2025 19:07

jayshrivastava commented Nov 20, 2025

View reviewed changes

.github/workflows/nightly.yml Outdated Show resolved Hide resolved

jayshrivastava requested a review from gabotechs November 20, 2025 19:09

jayshrivastava force-pushed the js/tpcds-fuzz branch 6 times, most recently from d5d6f42 to 14e923a Compare November 20, 2025 20:29

jayshrivastava force-pushed the js/tpcds-fuzz branch 2 times, most recently from 9f7e288 to aaab00b Compare November 20, 2025 21:07

jayshrivastava commented Nov 20, 2025

View reviewed changes

.github/actions/setup/action.yml Outdated Show resolved Hide resolved

jayshrivastava force-pushed the js/tpcds-fuzz branch from aaab00b to 5776a5a Compare November 20, 2025 21:14

gabotechs reviewed Nov 21, 2025

View reviewed changes

jayshrivastava force-pushed the js/tpcds-fuzz branch from 84a780b to b96a5c2 Compare November 21, 2025 18:35

jayshrivastava and others added 3 commits November 21, 2025 13:47

add tpcds-randomized-test to ci

839fdc2

This commit adds tpcds-randomized-test to ci. It relies on the duckdb cli for tpc-ds database generation. It also saves the artifacts if the test fails so we can reproduce issues.

Remove unnecessary traits

95320e6

jayshrivastava force-pushed the js/tpcds-fuzz branch 6 times, most recently from 862613f to 9dc1b85 Compare November 21, 2025 19:20

jayshrivastava added 2 commits November 21, 2025 14:22

refactor based on PR comments

5f31b16

update ci

67ce84c

jayshrivastava force-pushed the js/tpcds-fuzz branch from 9dc1b85 to 67ce84c Compare November 21, 2025 19:22

jayshrivastava mentioned this pull request Nov 21, 2025

fuzz with SQLancer #239

Open

gabotechs approved these changes Nov 24, 2025

View reviewed changes

gabotechs reviewed Nov 24, 2025

View reviewed changes

address PR comments and use dict encoding in tpcds table

b076bac

jayshrivastava merged commit ce5218b into main Dec 4, 2025
5 checks passed

jayshrivastava deleted the js/tpcds-fuzz branch December 4, 2025 22:36

		- Queries 47 and 57 were modified to add explicit ORDER BY d_moy to avg() window function. DataFusion requires explicit ordering in window functions with PARTITION BY for deterministic results.
		- Query 72 was modified to support date functions in datafusion

	let sql = get_test_tpch_query(10);
	// There is a chance that this query returns non-deterministic results if two entries
	// happen to have the exact same revenue. With small scales, this never happens, but with
	// bigger scales, this is very likely to happen.
	// This extra ordering accounts for it.
	let sql = sql.replace("revenue desc", "revenue, c_acctbal desc");
	test_tpch_query(sql).await
	}

	let df = ctx.sql(&query_sql).await.unwrap();
	let df = ctx.sql(query_sql).await.unwrap();

		@@ -0,0 +1,587 @@
		#[cfg(all(feature = "integration", feature = "tpcds", test))]
		mod tests {

add tpc-ds tests and property-based testing utilities #231

add tpc-ds tests and property-based testing utilities #231

Uh oh!

Conversation

jayshrivastava commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jayshrivastava commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jayshrivastava commented Nov 19, 2025 •

edited

Loading

gabotechs Nov 21, 2025 •

edited

Loading