Add validation to run_benchmark.sh by misiugodfrey · Pull Request #275 · rapidsai/velox-testing

misiugodfrey · 2026-03-14T22:49:04Z

Summary

Add automatic TPC-H query result validation to run_benchmark.sh. After each benchmark run, validate_results.py compares actual query outputs (parquet files) against expected results and writes validation_results.json to the benchmark output directory.
Add validate_results.py — a standalone CLI script that compares actual vs expected parquet files using Polars, supporting multiple expected file naming conventions (q01.parquet, q1.parquet, 01.parquet).
Update post_results.py to automatically read validation_results.json alongside benchmark_result.json when posting results to the database.
Integrate validation into run_benchmark.sh — validation runs automatically after every benchmark. Reference data is located via (in order of precedence):
a. --reference-results-dir (explicit; missing directory is a fatal error)
b. PRESTO_EXPECTED_RESULTS_DIR env var (implicit; missing directory is a warning, validation skipped)
c. Auto-detection from benchmark_result.json (fallback; missing → not-validated)
The following updates to validation were added based on Polar's validation scripts:
Fix date type mismatches between Presto (returns dates as strings) and DuckDB (returns datetime.date / datetime64). The new _is_temporal_like() helper detects temporal values across numpy dtypes, pandas extension types, and pyarrow-backed columns, then converts both sides to datetime64 before comparison.
Fix pandas 3.0 StringDtype compatibility — np.issubdtype raises TypeError on extension types; added isinstance(dtype, np.dtype) guard throughout.
More detailed LIMIT tie-breaking/sort normalization for large scale-factors.
Move shared infrastructure to common/ — queries JSON files, performance benchmark fixtures (conftest.py, common_fixtures.py, benchmark_keys.py, cache_utils.py), and integration test helpers are now in common/testing/ and shared across Presto and Spark Gluten rather than duplicated.
Validation output prefixed with [VALIDATION] — passing queries are silent; only XFAIL/FAIL details and the summary line are printed.

Validation status priority: failed > not-validated > xfail > passed

Missing individual expected file → query status "not-validated" (not "failed")
If any query is "not-validated", overall status is "not-validated" even if others passed
When -q is used with run_benchmark.sh, only the queried queries are validated

TomAugspurger · 2026-03-16T13:28:54Z

+        vdata = per_query_validation.get(vkey)
+        validation_result = (
+            {
+                "status": "expected-failure" if vdata["status"] == "xfail" else vdata["status"],


If you're able to, I'd recommend standardizing on "expected-failure" as the string to represent an expected failure, rather than converting between xfail and expected-failure.

But if other parts of this project already use xfail then nvm.

TomAugspurger · 2026-03-16T13:30:00Z

    # Handle failed queries that may not have times
    for query_name, error_info in failed_queries.items():
        if query_name not in raw_times:
+            vkey = "q" + query_name.lstrip("Q").lower()


If the comment above is accurate ("failed queries that may not have times") won't this be "not-validated" by definition?

If a query fails before producing a result file then it should fall-back to "not-validated". I've added a comment to clarify.

TomAugspurger · 2026-03-16T13:35:38Z

+        try:
+            validation_results = json.loads(validation_results_path.read_text())
+        except (json.JSONDecodeError, FileNotFoundError) as e:
+            print(f"  Warning: could not load validation results: {e}", file=sys.stderr)


Is continuing here the right behavior?

I'd recommend either raising here or recording some kind of error status and ensuring that the process exits with a non-zero status code.

I'm now popagating the exception and exiting if the json is malformed.

TomAugspurger · 2026-03-16T13:37:27Z

+"""
+Validate TPC-H query results against expected parquet files.
+
+Validation logic is ported from cudf_polars's assert_tpch_result_equal


I think vendoring the code here is the right call, at least for now. Maybe longer term we can try to find a home for this that both velox-testing and cudf-polars can depend on.

So a request to our future selves: try to remember to fix issues in both places.

TomAugspurger · 2026-03-16T13:38:24Z

+- Schema (dtypes) is NOT checked — Presto may produce different parquet
+  types than polars for the same logical values.
+- Decimal columns are cast to Float64 before comparison (same as polars).
+- Floating-point values are compared with rel_tol=1e-5, abs_tol=1e-8.


Nice, I think cudf-polars only validates with abs_tol=1e-2.

TomAugspurger · 2026-03-16T13:40:20Z

+# sort_by entries: (column_name, descending)
+# ---------------------------------------------------------------------------
+
+QUERY_CONFIG: dict[str, dict] = {


Is velox-testing running mypy / some other type checker?

Suggested change

QUERY_CONFIG: dict[str, dict] = {

class SortLimit(TypedDict):

sort_by: list[tuple[str, bool]] | None

limit: int | None

xfail_if_empty: bool

QUERY_CONFIG: dict[str, SortLimit] = {

Ah, I guess xfail_if_empty is only sometimes there. There is xfail_if_empty: typing.NotRequired[bool] for that.

Done. mypy isn't used in this repo (to my knowledge). We us some other checkers as part of the pre-commit but I'm not sure if anything uses type-checking rules.

TomAugspurger · 2026-03-16T13:41:21Z

+
+def _polars_assert_frame_equal(left: pl.DataFrame, right: pl.DataFrame, **kwargs: Any) -> None:
+    """Call polars.testing.assert_frame_equal, handling rel_tol/abs_tol API differences."""
+    try:


FWIW, I think requiring a new enough polars is reasonable here (cudf-polars needs to support older versions).

TomAugspurger · 2026-03-16T13:42:58Z

+
+def _reconcile_presto_col_names(result: pl.DataFrame, expected: pl.DataFrame) -> pl.DataFrame:
+    """
+    Rename Presto's anonymous aggregate columns (_col0, _col1, ...) to match


Maybe out of scope for this PR, but does presto offer the option to rename these anonymous columns in the query? When implementing validation in cudf-polars, we had some similar issues that we decided to fix by adjusting the query.

We could rename the columns in the query, but I'd rather leave the SQL alone. I think it's easy enough to keep the original SQL and just reconcile them here.

TomAugspurger · 2026-03-16T13:47:12Z

+    polars_kwargs: dict[str, Any] = {
+        "check_row_order": True,
+        "check_column_order": True,
+        "check_dtypes": False,   # Presto types may differ from polars types


This feels slightly risky. Do you worry at all about unexpectedly getting different dtypes here?

On the cudf-polars side, we handled a similar issue by explicitly listing the casts required to match duckdb: https://github.com/rapidsai/cudf/blob/9ba0eb36f55712dae230ebb1b40b7fa1326fe147/python/cudf_polars/cudf_polars/experimental/benchmarks/pdsh.py#L44-L105. For tpc-h this wasn't too bad. For something like tpc-ds it might not be tenable.

On the other hand, for both cudf-polars and velox-cudf, we do compare against the CPU engine. So it's not like we're completely ignoring dtype validation.

I've turned the dtype detection on and added a duckdb->presto mapping of types. As far as I can tell, it's just occasional int32 vs int64 differences between the two.

TomAugspurger · 2026-03-16T13:49:28Z

+    query_id: str,
+    actual: pl.DataFrame,
+    expected: pl.DataFrame,
+) -> tuple[str, str | None]:


I'd recommend encoding the status as a Literal or enum. And if possible use the same values as the API expects (expected-failure).

… misiug/validationSplit

TomAugspurger

Looks good. Just one note about a recent change to assert_tpch_result_equal in cudf-polars.

TomAugspurger · 2026-03-17T16:32:23Z

+    right: pl.DataFrame,
+    *,
+    sort_by: list[tuple[str, bool]],
+    limit: int | None = None,


@Matt711 just added a new nulls_last keyword to support tpc-ds: rapidsai/cudf@109634d. I'd recommend trying to adopt that now, or making an issue to do it in the future.

paul-aiyedun · 2026-03-18T17:59:04Z

    --skip-drop-cache       Skip dropping system caches before each benchmark query (dropped by default).
    -m, --metrics           Collect detailed metrics from Presto REST API after each query.
                            Metrics are stored in query-specific directories.
+    --expected-dir          Path to a directory containing expected TPC-H parquet files.


Can we consistently use --reference-results-dir as done in the integration test script?

velox-testing/presto/scripts/run_integ_test.sh

Line 33 in 8a6416f

-r, --reference-results-dir If specified, use the results in the specified directory for comparison. The results are

paul-aiyedun · 2026-03-18T18:08:42Z

+VALIDATE_SCRIPT="${SCRIPT_DIR}/../../benchmark_reporting_tools/validate_results.py"
+
+# Determine the expected directory.
+# If --expected-dir was not provided, auto-detect from ${PRESTO_DATA_DIR}/sf${SCALE_FACTOR}_expected


Why is the reference results directory placed under PRESTO_DATA_DIR (which is supposed to be for the datasets)?

Since the reference results are going to be tied to the source data, the convention so far as been that the two should live together. This is overwrite-able, but the default location for a reference result is the same as the source data + the _expected suffix.

To that end, I've re-written this section so that it prioritizes an explicit path parameter, and otherwise defaults to looking in the same data directory that the source data was taken from with that "_expected" suffix. If an explicit path is provided but there are no reference results it's an error, if no path is specified and there are no implicit reference results it's a "not validated".

the convention so far as been that the two should live together.

Where is this convention defined? I don't think any of the existing scripts does this?

paul-aiyedun · 2026-03-18T18:16:13Z

+EXPECTED_DIR_EXPLICIT=false
+if [[ -n ${EXPECTED_DIR} ]]; then
+  EXPECTED_DIR_EXPLICIT=true
+elif [[ -n ${BENCHMARK_EXPECTED_BASE_DIR:-${PRESTO_DATA_DIR}} ]]; then
+  BENCHMARK_RESULT_JSON="${ACTUAL_OUTPUT_DIR}/benchmark_result.json"
+  SCALE_FACTOR_FROM_DATA="$(python3 -c "
+import json
+try:
+    d = json.load(open('${BENCHMARK_RESULT_JSON}'))
+    sf = d.get('context', {}).get('scale_factor')
+    if sf is not None:
+        sf = float(sf)
+        print(int(sf) if sf == int(sf) else sf)
+except Exception:
+    pass
+" 2>/dev/null)"
+  if [[ -n ${SCALE_FACTOR_FROM_DATA} ]]; then
+    EXPECTED_DIR="${BENCHMARK_EXPECTED_BASE_DIR:-${PRESTO_DATA_DIR}}/sf${SCALE_FACTOR_FROM_DATA}_expected"
+  fi
+fi


Can all these be moved into the python validation script?

paul-aiyedun · 2026-03-18T18:17:55Z

+"""
+Validate TPC-H query results against expected parquet files.
+
+Validation logic is ported from cudf_polars's assert_tpch_result_equal


We seem to have created divergence here. Why are we not re-using the validation logic in test_utils.py (

velox-testing/common/testing/integration_tests/test_utils.py

Line 74 in 8a6416f

def compare_results(query_engine_rows, duckdb_rows, types, query, column_names):

)?

We intentionally want to using the same comparison logic as cudf-polars. I don't know if we want the integration tests to use the same validation logic, but I do think we want to make sure that we are using the same logic as polars. Ideally, we want to pull this out as shared code between multiple projects.

I think there is an open question here about how we want to consolidate the verification logic going forward, but I think for now it would be fine to leave our existing integration test logic alone and use the polars logic for benchmark verification as we do now. We can adjust later as necessary.

The validation logic should be semantically the same in both projects, so what is the reason for the change? Using the same module for validation across projects would be ideal, but I don't think code duplication is the best way to get there.

… misiug/validationSplit

mattgara

LGTM. Awesome been waiting for a Polars verification utility, will really help with validation of larger runs!

mattgara · 2026-04-01T19:45:51Z

        times = raw_times[query_name]
        is_failed = query_name in failed_queries

+        # Look up validation result for this query (keys are lowercase e.g. "q1")


nit: This loop logic plus that around lines 444 look to be nearly identical, consider pulling out a shared base?

… misiug/validationSplit

misiugodfrey · 2026-04-10T18:59:35Z

I've updated this PR to deduplicate the code between the Polars validation and what already existed in integration tests.

Current output looks like:
Passing case:

[VALIDATION] Running validation: /velox-testing/presto/scripts/benchmark_output/query_results vs /Data/sf1_expected
[VALIDATION] Benchmark:  tpch
[VALIDATION] Results: 22 passed, 0 failed, 0 expected-failure, 0 skipped
[VALIDATION] Results written to /velox-testing/presto/scripts/benchmark_output/validation_results.json

Case with failures:

[VALIDATION] Running validation: /velox-testing/presto/scripts/benchmark_output/query_results vs /Data/sf1_expected_bad
[VALIDATION] Benchmark:  tpch
[VALIDATION] Q1  : FAIL     AssertionError: Found 4 mismatch(es):
  Row 0, col 'sum_base_price': 56586554400.729935 vs 0.0 (diff=5.66e+10, tol=5.66e+05)
  Row 1, col 'sum_base_price': 1487504710.380001 vs 0.0 (diff=1.49e+09, tol=1.49e+04)
  Row 2, col 'sum_base_price': 111701729697.73993 vs 0.0 (diff=1.12e+11, tol=1.12e+06)
  Row 3, col 'sum_base_price': 56568041380.89992 vs 0.0 (diff=5.66e+10, tol=5.66e+05)
[VALIDATION] Q14 : FAIL     AssertionError: Found 1 mismatch(es):
  Row 0, col 'promo_revenue': 16.38077862639554 vs 32.76155725279108 (diff=1.64e+01, tol=3.28e-04)
[VALIDATION] Q7  : FAIL     AssertionError: Row count mismatch: 4 (actual) vs 5 (expected)
[VALIDATION] Results: 19 passed, 3 failed, 0 expected-failure, 0 skipped
[VALIDATION] Results written to /velox-testing/presto/scripts/benchmark_output/validation_results.json

TomAugspurger · 2026-04-10T20:31:04Z

@misiugodfrey can you walk me through what those two outputs are saying? IIUC, it's that with the previous code there's some tests somewhere (where?) running something and validating the results, and all the tests passed validation. But with the new validation code, some of those tests are failing with validation errors? Based on the traceback, it looks like the values are different, so it's a good thing there's a validation error there (but the tests or implementation will need to be updated?).

Or am I misreading things?

misiugodfrey · 2026-04-10T22:06:39Z

@misiugodfrey can you walk me through what those two outputs are saying?

@TomAugspurger I should clarify, the two cases I posted above were for the sake of showing the new output format. The "failing" case was a set of parquet files I intentionally changed to fail to test that path. So far everything seems to validate correctly (tested up to 3k).

… misiug/validationSplit

paul-aiyedun

Changes overall make sense to me. However, I had a number of questions.

paul-aiyedun · 2026-04-13T18:25:59Z

+# PRESTO_EXPECTED_RESULTS_DIR env var is the implicit fallback (warning if missing).
+# Explicit --reference-results-dir was already validated before the benchmark ran.
+if [[ -n ${PRESTO_EXPECTED_RESULTS_DIR} && ! -d ${PRESTO_EXPECTED_RESULTS_DIR} ]]; then
+  echo "[Validation] Warning: PRESTO_EXPECTED_RESULTS_DIR not found: ${PRESTO_EXPECTED_RESULTS_DIR}; validation skipped."


Unless I am missing something, we still seem to proceed with validation in this case. Should we exit after this line?

We proceeded to the next step where the script would output more details about how the directory does not exist and that validation was skipped. I think you are right though that an early exit is a better idea here, since the extra output is effectively redundant since we are already stating we are skipping due to a missing directory. I've changed this to an early exit.

paul-aiyedun · 2026-04-13T18:40:04Z

+# Compute the actual output directory (mirrors pytest's --output-dir / --tag logic).
+ACTUAL_OUTPUT_DIR="${OUTPUT_DIR:-$(pwd)/benchmark_output}"
+if [[ -n ${TAG} ]]; then
+  ACTUAL_OUTPUT_DIR="${ACTUAL_OUTPUT_DIR}/${TAG}"
+fi


Consider moving this logic into the validate_results.py script and reusing the same function that sets this (

velox-testing/common/testing/performance_benchmarks/conftest.py

Line 184 in 0f2e363

def get_output_dir(config):

).

I've refactored this to use the same logic.

paul-aiyedun · 2026-04-13T18:40:45Z

+VALIDATE_REQUIREMENTS="${SCRIPT_DIR}/../../benchmark_reporting_tools/requirements.txt"
+echo "[Validation] Running validation: ${RESULTS_DIR} vs ${PRESTO_EXPECTED_RESULTS_DIR:-<not set>}"
+pip install -q -r "${VALIDATE_REQUIREMENTS}"
+python "${VALIDATE_SCRIPT}" "${VALIDATE_ARGS[@]}"


Did you consider using run_py_script.sh?

Switching to use the script.

paul-aiyedun · 2026-04-13T22:31:28Z

+    def _get_validation_result(query_name):
+        # Look up validation result for this query (keys are lowercase e.g. "q1")
+        vkey = "q" + query_name.lstrip("Q").lower()
+        vdata = per_query_validation.get(vkey)


What happens if per_query_validation is an empty dictionary (from line 400)?

If per_query_validation is empty then per_query_validation.get(vkey) should return None for each query and the status will be returned as "not-validated".

paul-aiyedun · 2026-04-13T22:35:24Z

        print(f"  Node count: {payload['node_count']}", file=sys.stderr)
        print(f"  Query logs: {len(payload['query_logs'])}", file=sys.stderr)
+        print(f"  Validation status: {payload['validation_status']}", file=sys.stderr)
+        xfail_queries = [


What does the x prefix mean here?

The "x" prefix was short for "expected". As in this is an "expected failure". This convention was set in the Benchmarking DB where one of the possible validation states is "XFAIL" for this particular case.

paul-aiyedun · 2026-04-14T17:21:57Z

+    # 1 & 2. Column reconciliation and validation
+    actual = _reconcile_col_names(actual, expected)
+    if list(actual.columns) != list(expected.columns):
+        extra = set(actual.columns) - set(expected.columns)
+        missing = set(expected.columns) - set(actual.columns)
+        raise AssertionError(
+            f"Column name mismatch — extra: {extra}, missing: {missing}\n"
+            f"  actual:   {list(actual.columns)}\n"
+            f"  expected: {list(expected.columns)}"
+        )


Why are we concerned about column names for this validation?

This is because the Polars validation did this. If we are unconcerned with the column names I could remove this, but I would rather we keep as close to them as we can.

Perhaps, @TomAugspurger can speak to why this is done for Polars, but I believe the TPC-H specification states that column names are optional

2.1.3.4 (a) Columns appear in the order specified by the SELECT list of either the functional query definition or an
approved variant. Column headings are optional.

Also, it is possible to have queries without clear column names. For instance, Q18 projects sum(l_quantity) without an alias and so, the column name would be query engine defined.

Having the column names is helpful for debugging, and simplifies the rest of the validation logic (which can now assume column names match) and error reporting (it's easier to say and read "the data type of column '<name>' doesn't match" rather than something about a positional index or left_name=... right_name=...)

the column name would be query engine defined

We made minor changes to our polars expressions to match (e.g. the .alias("sum(l_quantity)") to match duckdb here). Nothing too onerous.

paul-aiyedun · 2026-04-14T18:45:03Z

+        # Decimal → float64
+        if _is_decimal_like(actual[col]):
+            actual[col] = pd.to_numeric(actual[col], errors="coerce")
+        if _is_decimal_like(expected[col]):


Instead of repeated if statements, can we have this function do normalization for one dataframe at a time (similar to the normalize_rows function that existed before)?

paul-aiyedun · 2026-04-14T20:34:26Z

+                # Safety net: compare as Timestamps when one side is a date
+                # string (e.g. Presto '1995-03-05') and the other is a
+                # Timestamp object that slipped through _normalize_dtypes.
+                try:
+                    if pd.Timestamp(v1) == pd.Timestamp(v2):
+                        continue
+                except Exception:
+                    pass


Why is the column type not checked here?

I'll add a check.

paul-aiyedun · 2026-04-14T20:36:45Z

+    if not sort_by:
+        # No ORDER BY (or unparsable) — sort both sides and compare
+        _assert_frames_equal(_sort_for_comparison(actual), _sort_for_comparison(expected))
+        return


Is this check sufficient for queries with limits?

I'm not sure there' s more that we can do in this case, as a LIMIT without a sort_by means any limit-sized subset of the data is valid.

paul-aiyedun · 2026-04-14T20:44:16Z

+    Extract ORDER BY information from SQL using sqlglot.
+
+    Returns:
+        (sort_by, nulls_last) where sort_by is [(col_name, descending), ...]


Why are we concerned about nulls_last?

It's a DuckDB thing. DuckDB defaults to ASC → NULLS LAST, DESC → NULLS FIRST, and we need to track this to know where we should expect nulls in ORDER BY columns.

Why do we only get it for the first sorted column?

misiugodfrey

Addressed recent feedback.

misiugodfrey · 2026-04-14T23:01:06Z

+# PRESTO_EXPECTED_RESULTS_DIR env var is the implicit fallback (warning if missing).
+# Explicit --reference-results-dir was already validated before the benchmark ran.
+if [[ -n ${PRESTO_EXPECTED_RESULTS_DIR} && ! -d ${PRESTO_EXPECTED_RESULTS_DIR} ]]; then
+  echo "[Validation] Warning: PRESTO_EXPECTED_RESULTS_DIR not found: ${PRESTO_EXPECTED_RESULTS_DIR}; validation skipped."


We proceeded to the next step where the script would output more details about how the directory does not exist and that validation was skipped. I think you are right though that an early exit is a better idea here, since the extra output is effectively redundant since we are already stating we are skipping due to a missing directory. I've changed this to an early exit.

misiugodfrey · 2026-04-14T23:06:08Z

+# Compute the actual output directory (mirrors pytest's --output-dir / --tag logic).
+ACTUAL_OUTPUT_DIR="${OUTPUT_DIR:-$(pwd)/benchmark_output}"
+if [[ -n ${TAG} ]]; then
+  ACTUAL_OUTPUT_DIR="${ACTUAL_OUTPUT_DIR}/${TAG}"
+fi


I've refactored this to use the same logic.

misiugodfrey · 2026-04-14T23:06:18Z

+VALIDATE_REQUIREMENTS="${SCRIPT_DIR}/../../benchmark_reporting_tools/requirements.txt"
+echo "[Validation] Running validation: ${RESULTS_DIR} vs ${PRESTO_EXPECTED_RESULTS_DIR:-<not set>}"
+pip install -q -r "${VALIDATE_REQUIREMENTS}"
+python "${VALIDATE_SCRIPT}" "${VALIDATE_ARGS[@]}"


Switching to use the script.

misiugodfrey · 2026-04-14T23:09:07Z

+    def _get_validation_result(query_name):
+        # Look up validation result for this query (keys are lowercase e.g. "q1")
+        vkey = "q" + query_name.lstrip("Q").lower()
+        vdata = per_query_validation.get(vkey)


If per_query_validation is empty then per_query_validation.get(vkey) should return None for each query and the status will be returned as "not-validated".

misiugodfrey · 2026-04-14T23:11:09Z

        print(f"  Node count: {payload['node_count']}", file=sys.stderr)
        print(f"  Query logs: {len(payload['query_logs'])}", file=sys.stderr)
+        print(f"  Validation status: {payload['validation_status']}", file=sys.stderr)
+        xfail_queries = [


The "x" prefix was short for "expected". As in this is an "expected failure". This convention was set in the Benchmarking DB where one of the possible validation states is "XFAIL" for this particular case.

misiugodfrey · 2026-04-14T23:47:00Z

+    # 1 & 2. Column reconciliation and validation
+    actual = _reconcile_col_names(actual, expected)
+    if list(actual.columns) != list(expected.columns):
+        extra = set(actual.columns) - set(expected.columns)
+        missing = set(expected.columns) - set(actual.columns)
+        raise AssertionError(
+            f"Column name mismatch — extra: {extra}, missing: {missing}\n"
+            f"  actual:   {list(actual.columns)}\n"
+            f"  expected: {list(expected.columns)}"
+        )


This is because the Polars validation did this. If we are unconcerned with the column names I could remove this, but I would rather we keep as close to them as we can.

misiugodfrey · 2026-04-14T23:48:12Z

+DuckDB reference) and benchmark validation (comparing result parquet files
+against expected parquet files).
+
+Comparison behaviour


I'll strip the module docstring down and let the function docstring contain the details.

misiugodfrey · 2026-04-14T23:50:46Z

+                # Safety net: compare as Timestamps when one side is a date
+                # string (e.g. Presto '1995-03-05') and the other is a
+                # Timestamp object that slipped through _normalize_dtypes.
+                try:
+                    if pd.Timestamp(v1) == pd.Timestamp(v2):
+                        continue
+                except Exception:
+                    pass


I'll add a check.

misiugodfrey · 2026-04-14T23:52:47Z

+    if not sort_by:
+        # No ORDER BY (or unparsable) — sort both sides and compare
+        _assert_frames_equal(_sort_for_comparison(actual), _sort_for_comparison(expected))
+        return


I'm not sure there' s more that we can do in this case, as a LIMIT without a sort_by means any limit-sized subset of the data is valid.

misiugodfrey · 2026-04-14T23:57:02Z

+    Extract ORDER BY information from SQL using sqlglot.
+
+    Returns:
+        (sort_by, nulls_last) where sort_by is [(col_name, descending), ...]


It's a DuckDB thing. DuckDB defaults to ASC → NULLS LAST, DESC → NULLS FIRST, and we need to track this to know where we should expect nulls in ORDER BY columns.

paul-aiyedun · 2026-04-15T16:29:44Z

common/testing is not a Python project, so I don't think requirements.txt should be here. The dependencies should probably be managed by the project that uses the shared modules/files.

paul-aiyedun · 2026-04-15T16:56:18Z

+    # 1 & 2. Column reconciliation and validation
+    actual = _reconcile_col_names(actual, expected)
+    if list(actual.columns) != list(expected.columns):
+        extra = set(actual.columns) - set(expected.columns)
+        missing = set(expected.columns) - set(actual.columns)
+        raise AssertionError(
+            f"Column name mismatch — extra: {extra}, missing: {missing}\n"
+            f"  actual:   {list(actual.columns)}\n"
+            f"  expected: {list(expected.columns)}"
+        )


Perhaps, @TomAugspurger can speak to why this is done for Polars, but I believe the TPC-H specification states that column names are optional

2.1.3.4 (a) Columns appear in the order specified by the SELECT list of either the functional query definition or an
approved variant. Column headings are optional.

Also, it is possible to have queries without clear column names. For instance, Q18 projects sum(l_quantity) without an alias and so, the column name would be query engine defined.

paul-aiyedun · 2026-04-15T18:10:07Z

+  ACTUAL_OUTPUT_DIR="${OUTPUT_DIR:-$(pwd)/benchmark_output}"
+  [[ -n ${TAG} ]] && ACTUAL_OUTPUT_DIR="${ACTUAL_OUTPUT_DIR}/${TAG}"


We are not using the same python function per #275 (comment).

paul-aiyedun · 2026-04-15T18:13:34Z

+    Extract ORDER BY information from SQL using sqlglot.
+
+    Returns:
+        (sort_by, nulls_last) where sort_by is [(col_name, descending), ...]


Why do we only get it for the first sorted column?

paul-aiyedun · 2026-04-15T18:28:13Z

+
+    if limit is None:
+        # ORDER BY, no LIMIT — sort by non-float cols for tie-breaking
+        _assert_frames_equal(_sort_for_comparison(actual), _sort_for_comparison(expected))


Why are we sorting the results in this case?

Add validation to run_benchmark.sh

48ac749

misiugodfrey requested review from TomAugspurger, mattgara and paul-aiyedun March 14, 2026 22:49

misiugodfrey requested a review from a team as a code owner March 14, 2026 22:49

TomAugspurger reviewed Mar 16, 2026

View reviewed changes

misiugodfrey added 2 commits March 16, 2026 16:25

PR feedback

73e58f0

Merge branch 'main' of https://github.com/rapidsai/velox-testing into…

6f938a0

… misiug/validationSplit

misiugodfrey requested a review from TomAugspurger March 17, 2026 16:24

pre-commit checks

4d6ce30

TomAugspurger approved these changes Mar 17, 2026

View reviewed changes

nulls_last support

d63df04

paul-aiyedun requested changes Mar 18, 2026

View reviewed changes

Moved logic into validate_results.py

a7e5b39

misiugodfrey requested a review from paul-aiyedun March 23, 2026 22:40

Merge branch 'main' of https://github.com/rapidsai/velox-testing into…

cae9795

… misiug/validationSplit

misiugodfrey requested a review from karthikeyann March 23, 2026 22:41

mattgara approved these changes Apr 1, 2026

View reviewed changes

misiugodfrey added 4 commits April 1, 2026 16:18

deduplicate validation code

532e804

Merge branch 'main' of https://github.com/rapidsai/velox-testing into…

3fa9796

… misiug/validationSplit

Merge branch 'main' of https://github.com/rapidsai/velox-testing into…

2f65d9a

… misiug/validationSplit

unified polars and integration test valiation

7d8a178

Removed unnecessary output

e8ed42f

misiugodfrey added 3 commits April 10, 2026 15:14

Merge branch 'main' of https://github.com/rapidsai/velox-testing into…

06d1be0

… misiug/validationSplit

pre-check

b7705e8

removed auto-detect if env variable not specified

7aa1684

paul-aiyedun reviewed Apr 14, 2026

View reviewed changes

PR feeback

8452b1d

misiugodfrey commented Apr 15, 2026

View reviewed changes

misiugodfrey requested a review from paul-aiyedun April 15, 2026 00:12

misiugodfrey assigned paul-aiyedun Apr 15, 2026

paul-aiyedun reviewed Apr 15, 2026

View reviewed changes

-QUERY_CONFIG: dict[str, dict] = {
+class SortLimit(TypedDict):
+    sort_by: list[tuple[str, bool]] | None
+    limit: int | None
+    xfail_if_empty: bool
+QUERY_CONFIG: dict[str, SortLimit] = {

		ACTUAL_OUTPUT_DIR="${OUTPUT_DIR:-$(pwd)/benchmark_output}"
		[[ -n ${TAG} ]] && ACTUAL_OUTPUT_DIR="${ACTUAL_OUTPUT_DIR}/${TAG}"

Conversation

misiugodfrey commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

misiugodfrey Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

misiugodfrey Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattgara left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

misiugodfrey commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Apr 10, 2026

Uh oh!

misiugodfrey commented Apr 10, 2026

misiugodfrey commented Mar 14, 2026 •

edited

Loading

misiugodfrey Mar 23, 2026 •

edited

Loading

misiugodfrey Mar 23, 2026 •

edited

Loading

misiugodfrey commented Apr 10, 2026 •

edited

Loading