Skip to content

Conversation

@gabotechs
Copy link
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

Running ./bench.sh run tpcds with a freshly created ./bench.sh data tpcds fails with the following error:

Please prepare TPC-DS data first by following instructions:
  ./bench.sh data tpcds

This PR fixes it

What changes are included in this PR?

Fixes the TPCDS_DIR variable in run_tpcds

Are these changes tested?

just benchmark scripts

Are there any user-facing changes?

no need

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gabotechs I think it shouldn't be there. by default the script checks for datafusion-benchmarks repo here https://github.com/apache/datafusion-benchmarks/tree/main/tpcds/data/sf1 and there is no tpcds-sf1.

you can specify your own DATA_DIR like

export DATA_DIR=../../datafusion-benchmarks/tpcds/data/sf1/
and then run tpcds benchmarks

@gabotechs
Copy link
Contributor Author

gabotechs commented Jan 12, 2026

🤔 Are you sure? I get the impression that this is why the benchmark run commands are failing

#19761 (comment)

Also, note how the data_tpcds() function counterpart actually has this same line:

https://github.com/apache/datafusion/blob/main/benchmarks/bench.sh#L633

# Downloads TPC-DS data
data_tpcds() {
    TPCDS_DIR="${DATA_DIR}/tpcds_sf1"

@comphead
Copy link
Contributor

comphead commented Jan 13, 2026

I just checked for #19635 the TPCDS benchmark the commands provided in
https://github.com/apache/datafusion/blob/main/benchmarks/README.md#comparing-performance-of-main-and-a-pr
and it worked fine

@gabotechs
Copy link
Contributor Author

gabotechs commented Jan 13, 2026

Ok, I see what happened here:

IMO it would be nicer if ./benchmarks/bench.sh data tpcds && ./benchmarks/bench.sh run tpcds worked out of the box without requiring users to set the DATA_DIR env in the same way it works for the TPC-H benchmark.

In fact, I'd bet the intention behind this code here https://github.com/apache/datafusion/blob/main/benchmarks/bench.sh#L644-L646 is that it works that way, as it's explicitly extracting the contents to "${DATA_DIR}/tpcds_sf1":

        echo "Extracting TPC-DS parquet data to ${TPCDS_DIR}..."
        unzip -o -j -d "${TPCDS_DIR}" "${DATA_DIR}/datafusion-benchmarks.zip" datafusion-benchmarks-main/tpcds/data/sf1/*
        echo "TPC-DS data extracted."

However happy to follow your lead here, I can survive setting up an extra env variable.

@Dandandan Dandandan added this pull request to the merge queue Jan 13, 2026
Merged via the queue into apache:main with commit 36880d8 Jan 13, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants