Conversation
This updates the data generation code in velox-testing to use the strategy / options from https://github.com/TomAugspurger/tpchgen-rs/tree/tom/sync-upstream-clean/scripts. These rely on a fork of tpcgen-rs adds new features to control data generation. Users will need to ensure that they're in an enviroment with the appropriate tpchgen-cli on their path.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
TomAugspurger
left a comment
There was a problem hiding this comment.
I'd appreciate help on integrating this the rest of the way into velox-testing.
I've updated one call site that used benchmark_data_tools/generate_data_files.py to use the new subcommands. But I haven't attempted to handle the Python environment changes, and I haven't updated any of the other spots.
And I'd appreciate feedback on any other uses of benchmark_data_tools/generate_data_files.py. Are they using that requirements.txt? Are there places where updating that requirements.txt to be "correct" and point to the tpchgen-rs fork is harmful? For example, if you're installing from requirements.txt in a spot that runs data generation only if the data aren't present, then building tpcgen-rs from source (and possibly failing) would be unfortunate.
| six==1.17.0 | ||
| urllib3==2.5.0 | ||
| tpchgen-cli==2.0.1 | ||
| tpchgen-cli @ git+https://github.com/TomAugspurger/tpchgen-rs@tom/sync-upstream-clean#subdirectory=tpchgen-cli |
There was a problem hiding this comment.
This change means you'd need a rust toolchain on your path. I don't know whether most users of this script / requirements.txt will have that.
Without this, we'd attempt to pass invalid parameters to the tpcgen-cli CLI.
In general, I'd recommend using the container image instead of trying to build from source. I think the more common scenario would be mounting velox-testing (so you have access to scripts/generate_data_files.py) into a container that has the right tpchgen-cli / tpchgen-rs in it.
There was a problem hiding this comment.
I was thinking we would just call docker via subprocess or something like that. Is that viable? I think the probability of having docker is higher than having a rust toolchain.
There was a problem hiding this comment.
Similar to @bdice's comment, I was thinking that we would be generating the tpchgen-cli binary via a docker container.
There was a problem hiding this comment.
This file is for expressing the python dependencies.
If we want to bring in our python dependencies in a different way (through a docker image with Python and some / all of the deps), we would need to change that a level above this at the consumers of this file. In this repo that would be
and .Who else is relying on this file?
| if args.output == "text": | ||
| print(f"Inspecting {data_dir}") | ||
|
|
||
| tables = [ |
There was a problem hiding this comment.
This might be the only tpc-h specific bits of this script. But save that for later.
paul-aiyedun
left a comment
There was a problem hiding this comment.
Can we please narrow the scope of this PR to just the use of the tpchgen-cli fork (without changing the generate_data_files.py interface)? We can add support for different column encodings in a subsequent PR.
| --benchmark-type tpch \ | ||
| --data-dir-path "${TPCH_DATA_DIR}" \ | ||
| --scale-factor 0.01 \ | ||
| --convert-decimals-to-floats |
There was a problem hiding this comment.
We should be able to keep the existing script interface while changing the underlying data generator.
There was a problem hiding this comment.
Is that a goal, and if so do you have time to handle that?
We have some scripts relying on the interface from https://github.com/TomAugspurger/tpchgen-rs/tree/tom/sync-upstream-clean. We can update those, but that'll take time.
| six==1.17.0 | ||
| urllib3==2.5.0 | ||
| tpchgen-cli==2.0.1 | ||
| tpchgen-cli @ git+https://github.com/TomAugspurger/tpchgen-rs@tom/sync-upstream-clean#subdirectory=tpchgen-cli |
There was a problem hiding this comment.
Similar to @bdice's comment, I was thinking that we would be generating the tpchgen-cli binary via a docker container.
|
Apologies, but I probably won't have time to break this into smaller PRs, at least not for a while. |
This updates the data generation code in velox-testing to use the strategy / options from https://github.com/TomAugspurger/tpchgen-rs/tree/tom/sync-upstream-clean/scripts.
These rely on a fork of tpcgen-rs adds new features to control data generation. Users will need to ensure that they're in an enviroment with the appropriate tpchgen-cli on their path.