Skip to content

19042: Add sorted data benchmark.#94

Open
martin-augment wants to merge 9 commits intomainfrom
pr-19042-2025-12-02-13-53-05
Open

19042: Add sorted data benchmark.#94
martin-augment wants to merge 9 commits intomainfrom
pr-19042-2025-12-02-13-53-05

Conversation

@martin-augment
Copy link
Owner

19042: To review by AI

@coderabbitai
Copy link

coderabbitai bot commented Dec 2, 2025

Walkthrough

This change introduces a new sorted data benchmark workflow for ClickBench. It adds two bash functions (data_sorted_clickbench and run_data_sorted_clickbench) to orchestrate data preparation and benchmark execution with sort order preservation. A new Python script sorts Parquet files by EventTime using PyArrow with configurable settings. The Rust code is extended with optional sorting parameters (sorted_by and sort_order) that enable creation of external tables with explicit sort ordering and force single-partition execution. A new SQL query file supports the sorted benchmark variant. These components work together to enable benchmarking on pre-sorted datasets.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-19042-2025-12-02-13-53-05

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

# Batch size for writing
write_batch_size=min(row_group_size, 1024 * 64),
# Enable compression for all columns
compression_level=None, # Use default compression level
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --compression-level CLI option is parsed but not applied when writing the Parquet file (the writer sets compression_level=None), so the flag currently has no effect. Consider wiring this option through to the writer to avoid confusing behavior.

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The Augment AI reviewer is correct! The compression level option is ignored and never used. Prevents confusion in the user why the files are not compressed with the requested level

parser.add_argument(
'--row-group-size',
type=int,
default=64 * 1024, # 64K rows
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples and inline comment indicate a 1M row-group default, but the parser default here is 64K, which is inconsistent and could confuse users. Consider aligning the help text and default value to match the intended default.

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

@martin-augment martin-augment Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The usage example tells that the default value for row_group_size is not the default value of the command line interface option. They should be aligned.


if __name__ == '__main__':
if len(sys.argv) == 1:
print("Usage: python3 sort_clickbench_enhanced.py <input_file> <output_file>")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage message references sort_clickbench_enhanced.py, but the file name is sort_clickbench.py. This mismatch is user-facing and may cause confusion when following the printed instructions.

🤖 Was this useful? React with 👍 or 👎

Copy link
Owner Author

@martin-augment martin-augment Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:typo; feedback:The Augment AI reviewer is correct! The name of the script is sort_clickbench.py and the usage needs to be updated. Prevents confusion in the user trying to execute this script.

type=int,
default=3,
help='Compression level (default: 3 for zstd)'
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Command-line compression level argument is never used

The --compression-level command-line argument is parsed via argparse but never passed to sort_clickbench_data(). The call at lines 238-244 omits args.compression_level, and the function hardcodes compression_level=None when calling pq.write_table(). Users setting this flag would see no effect on compression behavior.

Fix in Cursor Fix in Web

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The Bugbot AI reviewer is correct! The compression level option is ignored and never used. Prevents confusion in the user why the files are not compressed with the requested level

# specific language governing permissions and limitations
# under the License.

#!/usr/bin/env python3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Duplicate shebang line in Python script

There are two shebang lines in the script - one at line 1 (#!/usr/bin/env python3) and another at line 20. The second shebang is likely a copy-paste artifact from merging different versions of the script. While it doesn't cause a runtime error (it's treated as a comment after line 1), it indicates accidentally committed code.

Fix in Cursor Fix in Web

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The Bugbot AI reviewer is correct! The shebang line should be the very first one in a script. The second line should be removed since it is interpreted as a normal comment.

@claude
Copy link

claude bot commented Dec 2, 2025

Code Review - PR #94: Add sorted data benchmark

Thank you for this contribution! This PR adds a valuable benchmark for testing sort optimization in DataFusion. Here is my comprehensive review:


Strengths

  1. Clear Purpose: The PR effectively implements a benchmark for testing sort elimination optimization with pre-sorted data
  2. Good Documentation: Inline comments and docstrings explain the rationale well
  3. Proper Apache License Headers: All new files include the required ASF license headers
  4. Thoughtful Design: Using hits_0.parquet (~150MB) instead of the full dataset makes testing faster

Bugs and Issues

Critical Issues

1. Missing File Extension in Python Script (benchmarks/sort_clickbench.py:249)

  • Line 249 references sort_clickbench_enhanced.py but the file is named sort_clickbench.py
  • Fix: Change the usage message to reference the correct filename

2. Unused compression_level Parameter (benchmarks/sort_clickbench.py:210-214, 241)

  • The argument is parsed but never passed to sort_clickbench_data()
  • The function does not accept this parameter
  • Fix: Either remove the argument or implement compression level support

3. Inconsistent Default Row Group Size (benchmarks/sort_clickbench.py:40 vs 201)

  • Function signature default: row_group_size: int = 1024 * 1024 (1M rows)
  • Argument parser default: default=64 * 1024 (64K rows)
  • Fix: Align both defaults (64K seems more reasonable for better query performance)

Medium Priority Issues

4. Missing Validation for sort_order Parameter (benchmarks/src/clickbench.rs:89-90)

  • No validation that the value is ASC or DESC
  • Could cause runtime errors in SQL generation
  • Fix: Add validation or use an enum

5. No Error Handling for Invalid Sort Column (benchmarks/src/clickbench.rs:267-268)

  • If user provides a non-existent column name, it will fail during SQL execution
  • Better to validate against the schema first

Performance Considerations

Good Practices

  • Forcing target_partitions=1 when sorted data is used (line 142) is correct for preserving sort order
  • Using zstd compression by default in Python script is good for file size

Potential Improvements

6. Memory Usage in Python Sort (benchmarks/sort_clickbench.py:62)

  • Loads entire table into memory before sorting
  • For hits_0.parquet (~150MB) this is fine, but consider chunked processing for larger files

7. Redundant Filter Pushdown Configuration (benchmarks/src/clickbench.rs:161-165)

  • The code sets filter pushdown twice: once in the if self.pushdown block and again in the if self.sorted_by block
  • This should be clarified or consolidated

Security Concerns

8. Shell Injection Risk in bench.sh (benchmarks/bench.sh:1250)

  • Variables are properly quoted which is good
  • However, SCRIPT_DIR should also be validated or quoted when used in paths

9. Path Validation in Rust (benchmarks/src/clickbench.rs:255)

  • Uses .unwrap() which will panic on non-UTF-8 paths
  • Better to handle the error gracefully with proper error propagation

Test Coverage

Missing Tests

10. No Unit Tests for New Rust Code

  • The new register_hits logic with WITH ORDER clause has no tests
  • Should add tests for valid sort column registration, invalid sort column handling, ASC/DESC order handling, and verification that sort order metadata is actually applied

11. No Integration Tests for Shell Scripts

  • The data_sorted_clickbench and run_data_sorted_clickbench functions should be tested
  • Consider adding a simple test that runs the benchmark on small sample data

12. Python Script Has No Tests

  • Should test sorting correctness, row group size configuration, compression options, and error handling

Code Quality Improvements

13. Typo in Comments (benchmarks/sort_clickbench.py:19-20)

  • Has duplicate shebang line

14. Inconsistent Emoji Usage

  • Uses warning emoji for warnings but checkmark/X for pass/fail
  • Consider consistent style (all emoji or none)

15. Debug Print Statements Left In (benchmarks/src/clickbench.rs:171-175)

  • These println statements should probably be behind a debug flag or use proper logging

16. Hardcoded Magic Numbers (benchmarks/sort_clickbench.py:79)

  • The 1000 constant should be a named constant

Documentation Gaps

17. Missing README or Documentation

  • Should add documentation explaining what sorted data benchmarks test, how to run them, expected performance characteristics, and interpretation of results

18. Query File Comment Confusion (benchmarks/queries/clickbench/queries/sorted_data/q0.sql:1-2)

  • Comments about setting binary_as_string but does not explain why only DESC query is tested
  • Should explain this is testing reverse scan optimization

Recommendations

High Priority

  1. Fix the filename reference in error message (bug 18481: feat: Add selectivity metric to NestedLoopJoinExec for EXPLAIN ANALYZE #1)
  2. Remove or implement compression_level parameter (bug 18457: Support Arrow IPC Stream Files #2)
  3. Add validation for sort_order parameter (bug Add Claude Code GitHub Workflow #4)
  4. Replace .unwrap() with proper error handling (bug 18519: Refactor log() signature to use coercion API + fixes #9)

Medium Priority

  1. Align row group size defaults (bug 18450: feat: support named variables & defaults for CREATE FUNCTION #3)
  2. Add unit tests for the Rust changes
  3. Remove debug print statements or add proper logging
  4. Add documentation explaining the benchmark purpose

Low Priority

  1. Fix duplicate shebang line
  2. Consider memory-efficient sorting for Python script
  3. Add integration tests

Overall Assessment

This is a solid contribution that adds valuable benchmarking capabilities. The core implementation is sound, but there are several bugs and missing tests that should be addressed before merging. The biggest concerns are:

  1. The filename mismatch in the Python script
  2. Lack of test coverage
  3. Missing input validation

With these issues addressed, this will be a great addition to DataFusion benchmark suite!

Recommendation: Request changes for the critical bugs, then approve after fixes.


Review generated with Claude Code

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
benchmarks/src/clickbench.rs (1)

161-167: Consider consolidating pushdown enablement logic.

When both --pushdown and --sorted-by are provided, the same settings are applied twice. While harmless, this could be simplified:

-            if self.pushdown {
-                parquet_options.pushdown_filters = true;
-                parquet_options.reorder_filters = true;
-            }
-
-            if self.sorted_by.is_some() {
-                // We should compare the dynamic topk optimization when data is sorted, so we make the
-                // assumption that filter pushdown is also enabled in this case.
-                parquet_options.pushdown_filters = true;
-                parquet_options.reorder_filters = true;
-            }
+            if self.pushdown || self.sorted_by.is_some() {
+                // Enable filter pushdown for explicit --pushdown flag or sorted data benchmarks
+                // (sorted data benchmarks need pushdown for dynamic topk optimization comparison)
+                parquet_options.pushdown_filters = true;
+                parquet_options.reorder_filters = true;
+            }
benchmarks/sort_clickbench.py (2)

39-45: Default values inconsistency between function signature and CLI.

The function defaults differ from CLI defaults:

  • row_group_size: function=1M rows vs CLI=64K rows
  • compression: function=snappy vs CLI=zstd

This creates confusion as the function defaults are effectively dead code when invoked via CLI.

Align the defaults:

 def sort_clickbench_data(
         input_path: str,
         output_path: str,
-        row_group_size: int = 1024 * 1024,  # 1M rows default
-        compression: str = 'snappy',
+        row_group_size: int = 64 * 1024,  # 64K rows default
+        compression: str = 'zstd',
         verify: bool = True
 ):

Also applies to: 198-208


115-119: Remove extraneous f prefix from strings without placeholders.

Lines 115 and 119 use f-strings without any placeholders.

-        print(f"\n✓ Successfully created sorted file!")
+        print("\n✓ Successfully created sorted file!")
         print(f"  Input:  {input_size_mb:.1f} MB")
         print(f"  Output: {output_size_mb:.1f} MB")
         print(f"  Compression ratio: {input_size_mb/output_size_mb:.2f}x")
-        print(f"\nRow Group Statistics:")
+        print("\nRow Group Statistics:")

Based on static analysis hints.

benchmarks/bench.sh (2)

1209-1211: Remove leftover development comments.

Lines 1209-1211 and 1266-1267 contain "Add these functions to bench.sh" comments that appear to be leftover from development.

-# Sorted Data Benchmark Functions (Optimized for hits_0.parquet)
-# Add these functions to bench.sh
-
 # Creates sorted ClickBench data from hits_0.parquet (partitioned dataset)
-# Sorted Data Benchmark Functions for bench.sh
-# Add these functions to your bench.sh script
-
 # Runs the sorted data benchmark (sorted only) with sort order information

Also applies to: 1266-1267


1236-1248: Consider adding Python version check for consistency.

Unlike data_h2o and data_h2o_join which check for Python 3.10+, this function uses python3 directly. While pyarrow works with Python 3.8+, adding version checking would improve consistency with other data generation functions.

This is optional since pyarrow has broader Python version support than falsa.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between feff701 and 88f84d9.

📒 Files selected for processing (4)
  • benchmarks/bench.sh (4 hunks)
  • benchmarks/queries/clickbench/queries/sorted_data/q0.sql (1 hunks)
  • benchmarks/sort_clickbench.py (1 hunks)
  • benchmarks/src/clickbench.rs (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
benchmarks/src/clickbench.rs (4)
benchmarks/src/util/options.rs (1)
  • config (68-70)
datafusion/datasource-parquet/src/file_format.rs (1)
  • parquet_options (1192-1194)
datafusion/core/src/execution/context/mod.rs (1)
  • new_with_config_rt (349-356)
datafusion/execution/src/config.rs (1)
  • target_partitions (275-277)
🪛 Ruff (0.14.7)
benchmarks/sort_clickbench.py

1-1: Shebang is present but file is not executable

(EXE001)


20-20: Shebang is present but file is not executable

(EXE001)


20-20: Shebang should be at the beginning of the file

(EXE005)


115-115: f-string without any placeholders

Remove extraneous f prefix

(F541)


119-119: f-string without any placeholders

Remove extraneous f prefix

(F541)


134-134: Do not catch blind exception: Exception

(BLE001)


164-164: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: claude-review
  • GitHub Check: Analyze (rust)
🔇 Additional comments (8)
benchmarks/queries/clickbench/queries/sorted_data/q0.sql (1)

1-3: LGTM - Query correctly tests reverse scan on sorted data.

The ORDER BY "EventTime" DESC on data sorted ASC tests the sort elimination optimization for reverse scans. The binary_as_string setting is correctly applied programmatically in the Rust code.

benchmarks/src/clickbench.rs (3)

138-148: LGTM - Forcing single partition preserves sort order.

This is the correct approach to ensure the benchmark measures the pure performance benefit of pre-sorted data without partition-induced reordering.


264-269: Column name escaping is simplistic but acceptable for benchmark code.

The current logic assumes column names either already have quotes or need them added. This works for controlled benchmark inputs like "EventTime" but wouldn't handle edge cases like embedded quotes. For a benchmark utility with known inputs, this is acceptable.


172-177: LGTM - Useful diagnostic output.

benchmarks/sort_clickbench.py (1)

141-165: Verification is a lightweight sanity check.

The verify_sorted_file function only compares the first and last EventTime values across row groups, which wouldn't detect unsorted data within row groups. For a benchmark utility where the sorting is done programmatically, this quick sanity check is acceptable. The main sort_clickbench_data function performs a more thorough check on the first 1000 rows.

benchmarks/bench.sh (3)

102-104: LGTM - Documentation added for the new benchmark.

The new data_sorted_clickbench benchmark is properly documented in the usage section with a clear description.


328-330: LGTM - Command dispatch entries follow existing patterns.

Also applies to: 514-516


1270-1287: LGTM - Run function follows established patterns.

The function correctly:

  • Ensures data exists before running
  • Passes --sorted-by and --sort-order parameters to enable sort order awareness
  • Uses the dedicated sorted_data queries path
  • Follows the same structure as other run_* functions

Comment on lines +1 to +24
#!/usr/bin/env python3

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

#!/usr/bin/env python3
"""
Sort ClickBench data by EventTime for reverse scan benchmark.
Enhanced version with configurable row group size and optimization options.
"""
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove duplicate shebang line.

There are two shebang lines (lines 1 and 20). The second one at line 20 is invalid as shebangs must be at the beginning of the file.

 #!/usr/bin/env python3
 
 # Licensed to the Apache Software Foundation (ASF) under one
 ...
 # under the License.
 
-#!/usr/bin/env python3
 """
 Sort ClickBench data by EventTime for reverse scan benchmark.

Based on static analysis hints.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/usr/bin/env python3
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#!/usr/bin/env python3
"""
Sort ClickBench data by EventTime for reverse scan benchmark.
Enhanced version with configurable row group size and optimization options.
"""
#!/usr/bin/env python3
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""
Sort ClickBench data by EventTime for reverse scan benchmark.
Enhanced version with configurable row group size and optimization options.
"""
🧰 Tools
🪛 Ruff (0.14.7)

1-1: Shebang is present but file is not executable

(EXE001)


20-20: Shebang is present but file is not executable

(EXE001)


20-20: Shebang should be at the beginning of the file

(EXE005)

🤖 Prompt for AI Agents
In benchmarks/sort_clickbench.py around lines 1 to 24, there is a duplicate
shebang: the file already starts with "#!/usr/bin/env python3" on line 1 but has
another shebang at line 20; remove the second shebang (line 20) so only the
first line contains the interpreter directive and leave the rest of the file
(docstring and comments) intact.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! The shebang line should be the very first one in a script. The second line should be removed since it is interpreted as a normal comment.

Comment on lines +104 to +105
compression_level=None, # Use default compression level
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

--compression-level CLI argument is parsed but not used.

The compression_level argument is defined in argparse (line 211-215) but never passed to pq.write_table(). Line 104 hardcodes compression_level=None.

Either remove the unused argument or wire it through:

 def sort_clickbench_data(
         input_path: str,
         output_path: str,
         row_group_size: int = 64 * 1024,
         compression: str = 'zstd',
+        compression_level: int = None,
         verify: bool = True
 ):

And in pq.write_table():

-            compression_level=None,  # Use default compression level
+            compression_level=compression_level,

And in main():

     sort_clickbench_data(
         args.input,
         args.output,
         row_group_size=args.row_group_size,
         compression=args.compression,
+        compression_level=args.compression_level,
         verify=not args.no_verify
     )

Also applies to: 210-215

🤖 Prompt for AI Agents
In benchmarks/sort_clickbench.py around lines 104 and 210-215, the CLI option
--compression-level is parsed but never used: pq.write_table() currently
hardcodes compression_level=None. Wire the parsed value through by replacing the
hardcoded None with the parsed args.compression_level (or equivalent variable)
when calling pq.write_table(), and ensure the argparse block (lines ~210-215)
either remains or is removed if you choose not to support compression; if
keeping it, validate/convert the parsed value as needed before passing it.
Ensure the function call propagates the selected compression level so the CLI
option takes effect.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! The compression level option is ignored and never used. Prevents confusion in the user why the files are not compressed with the requested level

Comment on lines +81 to +90

/// Column name that the data is sorted by (e.g., "EventTime")
/// If specified, DataFusion will be informed that the data has this sort order
/// using CREATE EXTERNAL TABLE with WITH ORDER clause
#[structopt(long = "sorted-by")]
sorted_by: Option<String>,

/// Sort order: ASC or DESC (default: ASC)
#[structopt(long = "sort-order", default_value = "ASC")]
sort_order: String,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate sort_order to prevent invalid SQL generation.

The sort_order parameter accepts any string but is directly interpolated into SQL. Invalid values like "INVALID" would produce malformed DDL.

Consider validating the value:

+    /// Sort order: ASC or DESC (default: ASC)
+    #[structopt(long = "sort-order", default_value = "ASC", possible_values = &["ASC", "DESC", "asc", "desc"])]
     sort_order: String,

Alternatively, validate at runtime before building the SQL:

if !["ASC", "DESC"].contains(&self.sort_order.to_uppercase().as_str()) {
    return Err(exec_datafusion_err!("sort_order must be ASC or DESC"));
}
🤖 Prompt for AI Agents
In benchmarks/src/clickbench.rs around lines 81 to 90 the sort_order field is
accepted as any string and directly interpolated into generated SQL, which can
produce invalid DDL; validate sort_order before using it by checking
case-insensitively that it equals "ASC" or "DESC" and return a descriptive error
(or convert/normalize it to upper-case) if not, so only valid values are used
when constructing the WITH ORDER clause.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! At the moment there is no validation of the provided user input and the value is passed directly to the SQL engine. It could be even exploited to execute an SQL injection.

@martin-augment
Copy link
Owner Author

2. Unused compression_level Parameter (benchmarks/sort_clickbench.py:210-214, 241)

  • The argument is parsed but never passed to sort_clickbench_data()
  • The function does not accept this parameter
  • Fix: Either remove the argument or implement compression level support

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct! The compression level option is ignored and never used. Prevents confusion in the user why the files are not compressed with the requested level

@martin-augment
Copy link
Owner Author

martin-augment commented Dec 2, 2025

3. Inconsistent Default Row Group Size (benchmarks/sort_clickbench.py:40 vs 201)

  • Function signature default: row_group_size: int = 1024 * 1024 (1M rows)
  • Argument parser default: default=64 * 1024 (64K rows)
  • Fix: Align both defaults (64K seems more reasonable for better query performance)

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The usage example tells that the default value for row_group_size is not the default value of the command line interface option. They should be aligned.

@martin-augment
Copy link
Owner Author

martin-augment commented Dec 2, 2025

39-45: Default values inconsistency between function signature and CLI.

The function defaults differ from CLI defaults:

  • row_group_size: function=1M rows vs CLI=64K rows
  • compression: function=snappy vs CLI=zstd

This creates confusion as the function defaults are effectively dead code when invoked via CLI.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The usage example tells that the default value for row_group_size is not the default value of the command line interface option. They should be aligned.

@martin-augment
Copy link
Owner Author

martin-augment commented Dec 2, 2025

1. Missing File Extension in Python Script (benchmarks/sort_clickbench.py:249)

  • Line 249 references sort_clickbench_enhanced.py but the file is named sort_clickbench.py
  • Fix: Change the usage message to reference the correct filename

value:useful; category:typo; feedback:The Claude AI reviewer is correct! The name of the script is sort_clickbench.py and the usage needs to be updated. Prevents confusion in the user trying to execute this script.

@martin-augment
Copy link
Owner Author

4. Missing Validation for sort_order Parameter (benchmarks/src/clickbench.rs:89-90)

  • No validation that the value is ASC or DESC
  • Could cause runtime errors in SQL generation
  • Fix: Add validation or use an enum

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct! At the moment there is no validation of the provided user input and the value is passed directly to the SQL engine. It could be even exploited to execute an SQL injection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants