Skip to content

Update tpc-h data generation#296

Open
TomAugspurger wants to merge 1 commit intorapidsai:mainfrom
TomAugspurger:tom/datagen
Open

Update tpc-h data generation#296
TomAugspurger wants to merge 1 commit intorapidsai:mainfrom
TomAugspurger:tom/datagen

Conversation

@TomAugspurger
Copy link
Copy Markdown
Contributor

This updates the data generation code in velox-testing to use the strategy / options from https://github.com/TomAugspurger/tpchgen-rs/tree/tom/sync-upstream-clean/scripts.

These rely on a fork of tpcgen-rs adds new features to control data generation. Users will need to ensure that they're in an enviroment with the appropriate tpchgen-cli on their path.

This updates the data generation code in velox-testing to use the
strategy / options from https://github.com/TomAugspurger/tpchgen-rs/tree/tom/sync-upstream-clean/scripts.

These rely on a fork of tpcgen-rs adds new features to control data generation. Users will
need to ensure that they're in an enviroment with the appropriate tpchgen-cli on their path.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 30, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor Author

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd appreciate help on integrating this the rest of the way into velox-testing.

I've updated one call site that used benchmark_data_tools/generate_data_files.py to use the new subcommands. But I haven't attempted to handle the Python environment changes, and I haven't updated any of the other spots.

And I'd appreciate feedback on any other uses of benchmark_data_tools/generate_data_files.py. Are they using that requirements.txt? Are there places where updating that requirements.txt to be "correct" and point to the tpchgen-rs fork is harmful? For example, if you're installing from requirements.txt in a spot that runs data generation only if the data aren't present, then building tpcgen-rs from source (and possibly failing) would be unfortunate.

six==1.17.0
urllib3==2.5.0
tpchgen-cli==2.0.1
tpchgen-cli @ git+https://github.com/TomAugspurger/tpchgen-rs@tom/sync-upstream-clean#subdirectory=tpchgen-cli
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change means you'd need a rust toolchain on your path. I don't know whether most users of this script / requirements.txt will have that.

Without this, we'd attempt to pass invalid parameters to the tpcgen-cli CLI.

In general, I'd recommend using the container image instead of trying to build from source. I think the more common scenario would be mounting velox-testing (so you have access to scripts/generate_data_files.py) into a container that has the right tpchgen-cli / tpchgen-rs in it.

Copy link
Copy Markdown
Contributor

@bdice bdice Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we would just call docker via subprocess or something like that. Is that viable? I think the probability of having docker is higher than having a rust toolchain.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to @bdice's comment, I was thinking that we would be generating the tpchgen-cli binary via a docker container.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is for expressing the python dependencies.

If we want to bring in our python dependencies in a different way (through a docker image with Python and some / all of the deps), we would need to change that a level above this at the consumers of this file. In this repo that would be

pip install -q -r benchmark_data_tools/requirements.txt
and
pip install -r benchmark_data_tools/requirements.txt
.

Who else is relying on this file?

if args.output == "text":
print(f"Inspecting {data_dir}")

tables = [
Copy link
Copy Markdown
Contributor Author

@TomAugspurger TomAugspurger Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be the only tpc-h specific bits of this script. But save that for later.

@TomAugspurger TomAugspurger marked this pull request as ready for review March 30, 2026 21:05
@TomAugspurger TomAugspurger requested a review from a team as a code owner March 30, 2026 21:05
Copy link
Copy Markdown
Contributor

@paul-aiyedun paul-aiyedun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please narrow the scope of this PR to just the use of the tpchgen-cli fork (without changing the generate_data_files.py interface)? We can add support for different column encodings in a subsequent PR.

Comment on lines -98 to -101
--benchmark-type tpch \
--data-dir-path "${TPCH_DATA_DIR}" \
--scale-factor 0.01 \
--convert-decimals-to-floats
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to keep the existing script interface while changing the underlying data generator.

Copy link
Copy Markdown
Contributor Author

@TomAugspurger TomAugspurger Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a goal, and if so do you have time to handle that?

We have some scripts relying on the interface from https://github.com/TomAugspurger/tpchgen-rs/tree/tom/sync-upstream-clean. We can update those, but that'll take time.

six==1.17.0
urllib3==2.5.0
tpchgen-cli==2.0.1
tpchgen-cli @ git+https://github.com/TomAugspurger/tpchgen-rs@tom/sync-upstream-clean#subdirectory=tpchgen-cli
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to @bdice's comment, I was thinking that we would be generating the tpchgen-cli binary via a docker container.

@TomAugspurger
Copy link
Copy Markdown
Contributor Author

Apologies, but I probably won't have time to break this into smaller PRs, at least not for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants