Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the sqlite database with TPCH data for the demo #11

Merged
merged 14 commits into from
Oct 25, 2024
Merged

Conversation

njriasan
Copy link
Contributor

Adds the dependency and github actions steps to install a sqlite database with TPCH SF1 data.

@njriasan njriasan changed the title [WIP] Add the sqlite database with TPCH data for the demo Add the sqlite database with TPCH data for the demo Oct 22, 2024
@njriasan njriasan requested a review from knassre-bodo October 22, 2024 15:57
Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments regarding how the testing is being set up

Comment on lines +29 to +37
select
sum(l_extendedprice * l_discount) as revenue
from
lineitem
where
l_shipdate >= '1994-01-01'
and l_shipdate < '1995-01-01'
and l_discount between 0.05 and 0.07
and l_quantity < 24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to a .sql file in a data folder so we can do the following:

  • Have a parameterized fixture for the 22 queries by number.
  • Have a get_tpch_query fixture which is a function (similar to our datapath fixture) that takes in a query number and returns the query text.
  • also store the refsols in the same folder (parquet? csv?). At least the following 14 are all definitely short enough to make it practical to do so: q1, q3, q4, q5, q6, q7, q8, q9, q10, q12, q14, q17, q19, q22 (and the other 8 might be small enough for it to be reasonable, and if not we can skip those 8 in the parameterized fixture) and have a get_tpch_answer fixture that fetches the answer by number.

Then the TPCH correctness tests look like this:

def test_tpch_correctness(tpch_db: Cursor, tpch_query_id: int, get_tpch_query: Callable[[int], str], get_tpch_answer: Callable[[int], pd.DataFrame]) -> None:
 query: str = get_tpch_query(tpch_query_id)
 answer: pd.DataFrame = get_tpch_answer(tpch_query_id)
 result: Result = tpch_db.execute(query)
 check_result(result, answer, ...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to do all 14 (or 22) in this PR, but we should set this up "properly" now at least for q6.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes feel excess since we are just verifying the download installed correctly. Are you concerned that we are downloading something that isn't actually TPCH because this feels excessive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this goal is to build infrastructure for testing the PyDough version of each TPCH query by executing the PyDough version and comparing it to a SQL version then I think this is feasible, but that wouldn't have any overlap with this test which just aims to validate the download.

Copy link
Contributor Author

@njriasan njriasan Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd like me to setup the PyDough testing infrastructure as a followup for comparing PyDough to the TPCH queries using this database I can do this next in a followup PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal with this suggestion is to have us have a single, singular, simple API for accessing all TPC-H stuff that we can use in any tests (this one, or the real deal) & trivially extend to the other TPC-H queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's reasonable, but I still think it would be better to have a followup to add testing with the TPCH queries themselves. I'll add the github issue.

Comment on lines +3 to +9
on:
pull_request:
branches:
- main
paths:
# Only run on changes to the TPCH demo.
- "tpch_demo/**"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also enable this if a regular PR's commit message includes [RUN-TPCH]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not my intention to have this pipeline directly test PyDough functionality yet. When we opt to add the end to end tests that seek to use this data to answer the PyDough version of the TPCH queries then I think we can add that functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the harm of having a special [RUN-TPCH] command though? I do think we should use different types of [RUN-XXX] in both Bodo & BodoSQL to make it easier for developers to control subsets of tests that are run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is harm in adding code we aren't actively using/testing. It makes the repo more complicated (for anyone to use it we have to write all of these down) and if it breaks but we don't use it then we won't know.

I totally agree with your point in general and I think we should have greater configuration, I just think we should introduce it once we have tests that will depend on the actual PyDough project.

@njriasan njriasan requested a review from knassre-bodo October 25, 2024 03:08
Comment on lines +3 to +9
on:
pull_request:
branches:
- main
paths:
# Only run on changes to the TPCH demo.
- "tpch_demo/**"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the harm of having a special [RUN-TPCH] command though? I do think we should use different types of [RUN-XXX] in both Bodo & BodoSQL to make it easier for developers to control subsets of tests that are run.

@njriasan njriasan merged commit d220fc5 into main Oct 25, 2024
7 checks passed
@njriasan njriasan deleted the nick/demo_setup branch October 25, 2024 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants