-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the sqlite database with TPCH data for the demo #11
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments regarding how the testing is being set up
select | ||
sum(l_extendedprice * l_discount) as revenue | ||
from | ||
lineitem | ||
where | ||
l_shipdate >= '1994-01-01' | ||
and l_shipdate < '1995-01-01' | ||
and l_discount between 0.05 and 0.07 | ||
and l_quantity < 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to a .sql
file in a data
folder so we can do the following:
- Have a parameterized fixture for the 22 queries by number.
- Have a
get_tpch_query
fixture which is a function (similar to ourdatapath
fixture) that takes in a query number and returns the query text. - also store the refsols in the same folder (parquet? csv?). At least the following 14 are all definitely short enough to make it practical to do so: q1, q3, q4, q5, q6, q7, q8, q9, q10, q12, q14, q17, q19, q22 (and the other 8 might be small enough for it to be reasonable, and if not we can skip those 8 in the parameterized fixture) and have a
get_tpch_answer
fixture that fetches the answer by number.
Then the TPCH correctness tests look like this:
def test_tpch_correctness(tpch_db: Cursor, tpch_query_id: int, get_tpch_query: Callable[[int], str], get_tpch_answer: Callable[[int], pd.DataFrame]) -> None:
query: str = get_tpch_query(tpch_query_id)
answer: pd.DataFrame = get_tpch_answer(tpch_query_id)
result: Result = tpch_db.execute(query)
check_result(result, answer, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to do all 14 (or 22) in this PR, but we should set this up "properly" now at least for q6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes feel excess since we are just verifying the download installed correctly. Are you concerned that we are downloading something that isn't actually TPCH because this feels excessive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this goal is to build infrastructure for testing the PyDough version of each TPCH query by executing the PyDough version and comparing it to a SQL version then I think this is feasible, but that wouldn't have any overlap with this test which just aims to validate the download.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you'd like me to setup the PyDough testing infrastructure as a followup for comparing PyDough to the TPCH queries using this database I can do this next in a followup PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal with this suggestion is to have us have a single, singular, simple API for accessing all TPC-H stuff that we can use in any tests (this one, or the real deal) & trivially extend to the other TPC-H queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's reasonable, but I still think it would be better to have a followup to add testing with the TPCH queries themselves. I'll add the github issue.
on: | ||
pull_request: | ||
branches: | ||
- main | ||
paths: | ||
# Only run on changes to the TPCH demo. | ||
- "tpch_demo/**" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also enable this if a regular PR's commit message includes [RUN-TPCH]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not my intention to have this pipeline directly test PyDough functionality yet. When we opt to add the end to end tests that seek to use this data to answer the PyDough version of the TPCH queries then I think we can add that functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the harm of having a special [RUN-TPCH]
command though? I do think we should use different types of [RUN-XXX]
in both Bodo & BodoSQL to make it easier for developers to control subsets of tests that are run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is harm in adding code we aren't actively using/testing. It makes the repo more complicated (for anyone to use it we have to write all of these down) and if it breaks but we don't use it then we won't know.
I totally agree with your point in general and I think we should have greater configuration, I just think we should introduce it once we have tests that will depend on the actual PyDough project.
on: | ||
pull_request: | ||
branches: | ||
- main | ||
paths: | ||
# Only run on changes to the TPCH demo. | ||
- "tpch_demo/**" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the harm of having a special [RUN-TPCH]
command though? I do think we should use different types of [RUN-XXX]
in both Bodo & BodoSQL to make it easier for developers to control subsets of tests that are run.
Adds the dependency and github actions steps to install a sqlite database with TPCH SF1 data.