Add the sqlite database with TPCH data for the demo #11

njriasan · 2024-10-21T20:50:35Z

Adds the dependency and github actions steps to install a sqlite database with TPCH SF1 data.

knassre-bodo

A few comments regarding how the testing is being set up

tpch_demo/test_tpch_download.py

knassre-bodo · 2024-10-22T18:56:33Z

tpch_demo/test_tpch_download.py

+        select
+            sum(l_extendedprice * l_discount) as revenue
+        from
+            lineitem
+        where
+            l_shipdate >= '1994-01-01'
+            and l_shipdate < '1995-01-01'
+            and l_discount between 0.05 and 0.07
+            and l_quantity < 24


Let's move this to a .sql file in a data folder so we can do the following:

Have a parameterized fixture for the 22 queries by number.

Have a get_tpch_query fixture which is a function (similar to our datapath fixture) that takes in a query number and returns the query text.

also store the refsols in the same folder (parquet? csv?). At least the following 14 are all definitely short enough to make it practical to do so: q1, q3, q4, q5, q6, q7, q8, q9, q10, q12, q14, q17, q19, q22 (and the other 8 might be small enough for it to be reasonable, and if not we can skip those 8 in the parameterized fixture) and have a get_tpch_answer fixture that fetches the answer by number.

Then the TPCH correctness tests look like this:

def test_tpch_correctness(tpch_db: Cursor, tpch_query_id: int, get_tpch_query: Callable[[int], str], get_tpch_answer: Callable[[int], pd.DataFrame]) -> None: query: str = get_tpch_query(tpch_query_id) answer: pd.DataFrame = get_tpch_answer(tpch_query_id) result: Result = tpch_db.execute(query) check_result(result, answer, ...)

Don't need to do all 14 (or 22) in this PR, but we should set this up "properly" now at least for q6.

These changes feel excess since we are just verifying the download installed correctly. Are you concerned that we are downloading something that isn't actually TPCH because this feels excessive.

If this goal is to build infrastructure for testing the PyDough version of each TPCH query by executing the PyDough version and comparing it to a SQL version then I think this is feasible, but that wouldn't have any overlap with this test which just aims to validate the download.

If you'd like me to setup the PyDough testing infrastructure as a followup for comparing PyDough to the TPCH queries using this database I can do this next in a followup PR.

My goal with this suggestion is to have us have a single, singular, simple API for accessing all TPC-H stuff that we can use in any tests (this one, or the real deal) & trivially extend to the other TPC-H queries.

I think that's reasonable, but I still think it would be better to have a followup to add testing with the TPCH queries themselves. I'll add the github issue.

knassre-bodo · 2024-10-22T19:03:29Z

.github/workflows/tpch_db.yml

+on:
+  pull_request:
+    branches:
+      - main
+    paths:
+      # Only run on changes to the TPCH demo.
+      - "tpch_demo/**"


Let's also enable this if a regular PR's commit message includes [RUN-TPCH]

Its not my intention to have this pipeline directly test PyDough functionality yet. When we opt to add the end to end tests that seek to use this data to answer the PyDough version of the TPCH queries then I think we can add that functionality.

What's the harm of having a special [RUN-TPCH] command though? I do think we should use different types of [RUN-XXX] in both Bodo & BodoSQL to make it easier for developers to control subsets of tests that are run.

I think there is harm in adding code we aren't actively using/testing. It makes the repo more complicated (for anyone to use it we have to write all of these down) and if it breaks but we don't use it then we won't know.

I totally agree with your point in general and I think we should have greater configuration, I just think we should introduce it once we have tests that will depend on the actual PyDough project.

knassre-bodo · 2024-10-25T15:29:48Z

.github/workflows/tpch_db.yml

+on:
+  pull_request:
+    branches:
+      - main
+    paths:
+      # Only run on changes to the TPCH demo.
+      - "tpch_demo/**"


What's the harm of having a special [RUN-TPCH] command though? I do think we should use different types of [RUN-XXX] in both Bodo & BodoSQL to make it easier for developers to control subsets of tests that are run.

njriasan added 5 commits October 21, 2024 14:32

added a readme

2aa2c75

Added the files

7f2ead4

Added the TPCH test, need to add the github actions steps

b32c4fd

Added basic functional test, need to add env variables

8d6ca1b

Tried to update the env variables

8227e02

njriasan changed the title ~~[WIP] Add the sqlite database with TPCH data for the demo~~ Add the sqlite database with TPCH data for the demo Oct 22, 2024

njriasan added 7 commits October 22, 2024 11:45

added a new commit message [run CI]

498ba46

Fixed the tests that run [run CI]

cac44f6

Fixed the actions yaml

4c916f6

Added a comment [run CI]

20df68a

add uv back [run CI]

12b6d78

Fuse download and testing steps [run CI]

abf8799

Fixed my typo [run CI]

dd70273

njriasan requested a review from knassre-bodo October 22, 2024 15:57

knassre-bodo reviewed Oct 24, 2024

View reviewed changes

njriasan added 2 commits October 24, 2024 22:57

Merge branch 'main' into nick/demo_setup

093fbc9

Fixed the testing [run CI]

83cfd3d

njriasan requested a review from knassre-bodo October 25, 2024 03:08

knassre-bodo approved these changes Oct 25, 2024

View reviewed changes

njriasan merged commit d220fc5 into main Oct 25, 2024
7 checks passed

njriasan deleted the nick/demo_setup branch October 25, 2024 19:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the sqlite database with TPCH data for the demo #11

Add the sqlite database with TPCH data for the demo #11

njriasan commented Oct 21, 2024

knassre-bodo left a comment

knassre-bodo Oct 22, 2024

knassre-bodo Oct 23, 2024

njriasan Oct 25, 2024

njriasan Oct 25, 2024

njriasan Oct 25, 2024 •

edited

Loading

knassre-bodo Oct 25, 2024

njriasan Oct 25, 2024

knassre-bodo Oct 22, 2024

njriasan Oct 25, 2024

knassre-bodo Oct 25, 2024

njriasan Oct 25, 2024

knassre-bodo Oct 25, 2024

Add the sqlite database with TPCH data for the demo #11

Add the sqlite database with TPCH data for the demo #11

Conversation

njriasan commented Oct 21, 2024

knassre-bodo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njriasan Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njriasan Oct 25, 2024 •

edited

Loading