19042: Add sorted data benchmark. by martin-augment · Pull Request #94 · martin-augment/datafusion

martin-augment · 2025-12-02T13:53:10Z

19042: To review by AI

coderabbitai · 2025-12-02T13:54:00Z

Walkthrough

This change introduces a new sorted data benchmark workflow for ClickBench. It adds two bash functions (data_sorted_clickbench and run_data_sorted_clickbench) to orchestrate data preparation and benchmark execution with sort order preservation. A new Python script sorts Parquet files by EventTime using PyArrow with configurable settings. The Rust code is extended with optional sorting parameters (sorted_by and sort_order) that enable creation of external tables with explicit sort ordering and force single-partition execution. A new SQL query file supports the sorted benchmark variant. These components work together to enable benchmarking on pre-sorted datasets.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-19042-2025-12-02-13-53-05

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

augmentcode

Review completed. 3 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-12-02T13:56:32Z

benchmarks/sort_clickbench.py

+            # Batch size for writing
+            write_batch_size=min(row_group_size, 1024 * 64),
+            # Enable compression for all columns
+            compression_level=None,  # Use default compression level


The --compression-level CLI option is parsed but not applied when writing the Parquet file (the writer sets compression_level=None), so the flag currently has no effect. Consider wiring this option through to the writer to avoid confusing behavior.

_{🤖 Was this useful? React with 👍 or 👎}

value:good-to-have; category:bug; feedback:The Augment AI reviewer is correct! The compression level option is ignored and never used. Prevents confusion in the user why the files are not compressed with the requested level

augmentcode · 2025-12-02T13:56:32Z

benchmarks/sort_clickbench.py

+    parser.add_argument(
+        '--row-group-size',
+        type=int,
+        default=64 * 1024,  # 64K rows


The examples and inline comment indicate a 1M row-group default, but the parser default here is 64K, which is inconsistent and could confuse users. Consider aligning the help text and default value to match the intended default.

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:bug; feedback:The Augment AI reviewer is correct! The usage example tells that the default value for row_group_size is not the default value of the command line interface option. They should be aligned.

augmentcode · 2025-12-02T13:56:32Z

benchmarks/sort_clickbench.py

+
+if __name__ == '__main__':
+    if len(sys.argv) == 1:
+        print("Usage: python3 sort_clickbench_enhanced.py <input_file> <output_file>")


The usage message references sort_clickbench_enhanced.py, but the file name is sort_clickbench.py. This mismatch is user-facing and may cause confusion when following the printed instructions.

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:typo; feedback:The Augment AI reviewer is correct! The name of the script is sort_clickbench.py and the usage needs to be updated. Prevents confusion in the user trying to execute this script.

cursor · 2025-12-02T13:56:40Z

benchmarks/sort_clickbench.py

+        type=int,
+        default=3,
+        help='Compression level (default: 3 for zstd)'
+    )


Bug: Command-line compression level argument is never used

The --compression-level command-line argument is parsed via argparse but never passed to sort_clickbench_data(). The call at lines 238-244 omits args.compression_level, and the function hardcodes compression_level=None when calling pq.write_table(). Users setting this flag would see no effect on compression behavior.

value:good-to-have; category:bug; feedback:The Bugbot AI reviewer is correct! The compression level option is ignored and never used. Prevents confusion in the user why the files are not compressed with the requested level

cursor · 2025-12-02T13:56:40Z

benchmarks/sort_clickbench.py

+# specific language governing permissions and limitations
+# under the License.
+
+#!/usr/bin/env python3


Bug: Duplicate shebang line in Python script

There are two shebang lines in the script - one at line 1 (#!/usr/bin/env python3) and another at line 20. The second shebang is likely a copy-paste artifact from merging different versions of the script. While it doesn't cause a runtime error (it's treated as a comment after line 1), it indicates accidentally committed code.

value:good-to-have; category:bug; feedback:The Bugbot AI reviewer is correct! The shebang line should be the very first one in a script. The second line should be removed since it is interpreted as a normal comment.

claude · 2025-12-02T13:56:43Z

Code Review - PR #94: Add sorted data benchmark

Thank you for this contribution! This PR adds a valuable benchmark for testing sort optimization in DataFusion. Here is my comprehensive review:

Strengths

Clear Purpose: The PR effectively implements a benchmark for testing sort elimination optimization with pre-sorted data
Good Documentation: Inline comments and docstrings explain the rationale well
Proper Apache License Headers: All new files include the required ASF license headers
Thoughtful Design: Using hits_0.parquet (~150MB) instead of the full dataset makes testing faster

Bugs and Issues

Critical Issues

1. Missing File Extension in Python Script (benchmarks/sort_clickbench.py:249)

Line 249 references sort_clickbench_enhanced.py but the file is named sort_clickbench.py
Fix: Change the usage message to reference the correct filename

2. Unused compression_level Parameter (benchmarks/sort_clickbench.py:210-214, 241)

The argument is parsed but never passed to sort_clickbench_data()
The function does not accept this parameter
Fix: Either remove the argument or implement compression level support

3. Inconsistent Default Row Group Size (benchmarks/sort_clickbench.py:40 vs 201)

Function signature default: row_group_size: int = 1024 * 1024 (1M rows)
Argument parser default: default=64 * 1024 (64K rows)
Fix: Align both defaults (64K seems more reasonable for better query performance)

Medium Priority Issues

4. Missing Validation for sort_order Parameter (benchmarks/src/clickbench.rs:89-90)

No validation that the value is ASC or DESC
Could cause runtime errors in SQL generation
Fix: Add validation or use an enum

5. No Error Handling for Invalid Sort Column (benchmarks/src/clickbench.rs:267-268)

If user provides a non-existent column name, it will fail during SQL execution
Better to validate against the schema first

Performance Considerations

Good Practices

Forcing target_partitions=1 when sorted data is used (line 142) is correct for preserving sort order
Using zstd compression by default in Python script is good for file size

Potential Improvements

6. Memory Usage in Python Sort (benchmarks/sort_clickbench.py:62)

Loads entire table into memory before sorting
For hits_0.parquet (~150MB) this is fine, but consider chunked processing for larger files

7. Redundant Filter Pushdown Configuration (benchmarks/src/clickbench.rs:161-165)

The code sets filter pushdown twice: once in the if self.pushdown block and again in the if self.sorted_by block
This should be clarified or consolidated

Security Concerns

8. Shell Injection Risk in bench.sh (benchmarks/bench.sh:1250)

Variables are properly quoted which is good
However, SCRIPT_DIR should also be validated or quoted when used in paths

9. Path Validation in Rust (benchmarks/src/clickbench.rs:255)

Uses .unwrap() which will panic on non-UTF-8 paths
Better to handle the error gracefully with proper error propagation

Test Coverage

Missing Tests

10. No Unit Tests for New Rust Code

The new register_hits logic with WITH ORDER clause has no tests
Should add tests for valid sort column registration, invalid sort column handling, ASC/DESC order handling, and verification that sort order metadata is actually applied

11. No Integration Tests for Shell Scripts

The data_sorted_clickbench and run_data_sorted_clickbench functions should be tested
Consider adding a simple test that runs the benchmark on small sample data

12. Python Script Has No Tests

Should test sorting correctness, row group size configuration, compression options, and error handling

Code Quality Improvements

13. Typo in Comments (benchmarks/sort_clickbench.py:19-20)

Has duplicate shebang line

14. Inconsistent Emoji Usage

Uses warning emoji for warnings but checkmark/X for pass/fail
Consider consistent style (all emoji or none)

15. Debug Print Statements Left In (benchmarks/src/clickbench.rs:171-175)

These println statements should probably be behind a debug flag or use proper logging

16. Hardcoded Magic Numbers (benchmarks/sort_clickbench.py:79)

The 1000 constant should be a named constant

Documentation Gaps

17. Missing README or Documentation

Should add documentation explaining what sorted data benchmarks test, how to run them, expected performance characteristics, and interpretation of results

18. Query File Comment Confusion (benchmarks/queries/clickbench/queries/sorted_data/q0.sql:1-2)

Comments about setting binary_as_string but does not explain why only DESC query is tested
Should explain this is testing reverse scan optimization

Recommendations

High Priority

Fix the filename reference in error message (bug 18481: feat: Add selectivity metric to NestedLoopJoinExec for EXPLAIN ANALYZE #1)
Remove or implement compression_level parameter (bug 18457: Support Arrow IPC Stream Files #2)
Add validation for sort_order parameter (bug Add Claude Code GitHub Workflow #4)
Replace .unwrap() with proper error handling (bug 18519: Refactor log() signature to use coercion API + fixes #9)

Medium Priority

Align row group size defaults (bug 18450: feat: support named variables & defaults for CREATE FUNCTION #3)
Add unit tests for the Rust changes
Remove debug print statements or add proper logging
Add documentation explaining the benchmark purpose

Low Priority

Fix duplicate shebang line
Consider memory-efficient sorting for Python script
Add integration tests

Overall Assessment

This is a solid contribution that adds valuable benchmarking capabilities. The core implementation is sound, but there are several bugs and missing tests that should be addressed before merging. The biggest concerns are:

The filename mismatch in the Python script
Lack of test coverage
Missing input validation

With these issues addressed, this will be a great addition to DataFusion benchmark suite!

Recommendation: Request changes for the critical bugs, then approve after fixes.

Review generated with Claude Code

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (5)

benchmarks/src/clickbench.rs (1)
161-167: Consider consolidating pushdown enablement logic.

When both --pushdown and --sorted-by are provided, the same settings are applied twice. While harmless, this could be simplified:
-            if self.pushdown {
-                parquet_options.pushdown_filters = true;
-                parquet_options.reorder_filters = true;
-            }
-
-            if self.sorted_by.is_some() {
-                // We should compare the dynamic topk optimization when data is sorted, so we make the
-                // assumption that filter pushdown is also enabled in this case.
-                parquet_options.pushdown_filters = true;
-                parquet_options.reorder_filters = true;
-            }
+            if self.pushdown || self.sorted_by.is_some() {
+                // Enable filter pushdown for explicit --pushdown flag or sorted data benchmarks
+                // (sorted data benchmarks need pushdown for dynamic topk optimization comparison)
+                parquet_options.pushdown_filters = true;
+                parquet_options.reorder_filters = true;
+            }
benchmarks/sort_clickbench.py (2)
39-45: Default values inconsistency between function signature and CLI.

The function defaults differ from CLI defaults:

row_group_size: function=1M rows vs CLI=64K rows

compression: function=snappy vs CLI=zstd

This creates confusion as the function defaults are effectively dead code when invoked via CLI.

Align the defaults:
 def sort_clickbench_data(
         input_path: str,
         output_path: str,
-        row_group_size: int = 1024 * 1024,  # 1M rows default
-        compression: str = 'snappy',
+        row_group_size: int = 64 * 1024,  # 64K rows default
+        compression: str = 'zstd',
         verify: bool = True
 ):
Also applies to: 198-208

115-119: Remove extraneous f prefix from strings without placeholders.

Lines 115 and 119 use f-strings without any placeholders.
-        print(f"\n✓ Successfully created sorted file!")
+        print("\n✓ Successfully created sorted file!")
         print(f"  Input:  {input_size_mb:.1f} MB")
         print(f"  Output: {output_size_mb:.1f} MB")
         print(f"  Compression ratio: {input_size_mb/output_size_mb:.2f}x")
-        print(f"\nRow Group Statistics:")
+        print("\nRow Group Statistics:")
Based on static analysis hints.
benchmarks/bench.sh (2)
1209-1211: Remove leftover development comments.

Lines 1209-1211 and 1266-1267 contain "Add these functions to bench.sh" comments that appear to be leftover from development.
-# Sorted Data Benchmark Functions (Optimized for hits_0.parquet)
-# Add these functions to bench.sh
-
 # Creates sorted ClickBench data from hits_0.parquet (partitioned dataset)
-# Sorted Data Benchmark Functions for bench.sh
-# Add these functions to your bench.sh script
-
 # Runs the sorted data benchmark (sorted only) with sort order information
Also applies to: 1266-1267

1236-1248: Consider adding Python version check for consistency.

Unlike data_h2o and data_h2o_join which check for Python 3.10+, this function uses python3 directly. While pyarrow works with Python 3.8+, adding version checking would improve consistency with other data generation functions.

This is optional since pyarrow has broader Python version support than falsa.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between feff701 and 88f84d9.

📒 Files selected for processing (4)

benchmarks/bench.sh (4 hunks)
benchmarks/queries/clickbench/queries/sorted_data/q0.sql (1 hunks)
benchmarks/sort_clickbench.py (1 hunks)
benchmarks/src/clickbench.rs (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

benchmarks/src/clickbench.rs (4)

benchmarks/src/util/options.rs (1)

config (68-70)

datafusion/datasource-parquet/src/file_format.rs (1)

parquet_options (1192-1194)

datafusion/core/src/execution/context/mod.rs (1)

new_with_config_rt (349-356)

datafusion/execution/src/config.rs (1)

target_partitions (275-277)

🪛 Ruff (0.14.7)

benchmarks/sort_clickbench.py

1-1: Shebang is present but file is not executable

(EXE001)

20-20: Shebang is present but file is not executable

(EXE001)

20-20: Shebang should be at the beginning of the file

(EXE005)

115-115: f-string without any placeholders

Remove extraneous f prefix

(F541)

119-119: f-string without any placeholders

Remove extraneous f prefix

(F541)

134-134: Do not catch blind exception: Exception

(BLE001)

164-164: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Cursor Bugbot
GitHub Check: claude-review
GitHub Check: Analyze (rust)

🔇 Additional comments (8)

benchmarks/queries/clickbench/queries/sorted_data/q0.sql (1)

1-3: LGTM - Query correctly tests reverse scan on sorted data.

The ORDER BY "EventTime" DESC on data sorted ASC tests the sort elimination optimization for reverse scans. The binary_as_string setting is correctly applied programmatically in the Rust code.

benchmarks/src/clickbench.rs (3)

138-148: LGTM - Forcing single partition preserves sort order.

This is the correct approach to ensure the benchmark measures the pure performance benefit of pre-sorted data without partition-induced reordering.

264-269: Column name escaping is simplistic but acceptable for benchmark code.

The current logic assumes column names either already have quotes or need them added. This works for controlled benchmark inputs like "EventTime" but wouldn't handle edge cases like embedded quotes. For a benchmark utility with known inputs, this is acceptable.

172-177: LGTM - Useful diagnostic output.

benchmarks/sort_clickbench.py (1)

141-165: Verification is a lightweight sanity check.

The verify_sorted_file function only compares the first and last EventTime values across row groups, which wouldn't detect unsorted data within row groups. For a benchmark utility where the sorting is done programmatically, this quick sanity check is acceptable. The main sort_clickbench_data function performs a more thorough check on the first 1000 rows.

benchmarks/bench.sh (3)

102-104: LGTM - Documentation added for the new benchmark.

The new data_sorted_clickbench benchmark is properly documented in the usage section with a clear description.

328-330: LGTM - Command dispatch entries follow existing patterns.

Also applies to: 514-516

1270-1287: LGTM - Run function follows established patterns.

The function correctly:

Ensures data exists before running

Passes --sorted-by and --sort-order parameters to enable sort order awareness

Uses the dedicated sorted_data queries path

Follows the same structure as other run_* functions

coderabbitai · 2025-12-02T13:59:14Z

benchmarks/sort_clickbench.py

+#!/usr/bin/env python3
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+#!/usr/bin/env python3
+"""
+Sort ClickBench data by EventTime for reverse scan benchmark.
+Enhanced version with configurable row group size and optimization options.
+"""


⚠️ Potential issue | 🟡 Minor

Remove duplicate shebang line.

There are two shebang lines (lines 1 and 20). The second one at line 20 is invalid as shebangs must be at the beginning of the file.

#!/usr/bin/env python3 # Licensed to the Apache Software Foundation (ASF) under one ... # under the License. -#!/usr/bin/env python3 """ Sort ClickBench data by EventTime for reverse scan benchmark.

Based on static analysis hints.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#!/usr/bin/env python3

# Licensed to the Apache Software Foundation (ASF) under one

# or more contributor license agreements. See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership. The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

# with the License. You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing,

# software distributed under the License is distributed on an

# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

# KIND, either express or implied. See the License for the

# specific language governing permissions and limitations

# under the License.

#!/usr/bin/env python3

"""

Sort ClickBench data by EventTime for reverse scan benchmark.

Enhanced version with configurable row group size and optimization options.

"""

#!/usr/bin/env python3

# Licensed to the Apache Software Foundation (ASF) under one

# or more contributor license agreements. See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership. The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

# with the License. You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing,

# software distributed under the License is distributed on an

# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY

# KIND, either express or implied. See the License for the

# specific language governing permissions and limitations

# under the License.

"""

Sort ClickBench data by EventTime for reverse scan benchmark.

Enhanced version with configurable row group size and optimization options.

"""

🧰 Tools

🪛 Ruff (0.14.7)

1-1: Shebang is present but file is not executable

(EXE001)

20-20: Shebang is present but file is not executable

(EXE001)

20-20: Shebang should be at the beginning of the file

(EXE005)

🤖 Prompt for AI Agents

In benchmarks/sort_clickbench.py around lines 1 to 24, there is a duplicate shebang: the file already starts with "#!/usr/bin/env python3" on line 1 but has another shebang at line 20; remove the second shebang (line 20) so only the first line contains the interpreter directive and leave the rest of the file (docstring and comments) intact.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! The shebang line should be the very first one in a script. The second line should be removed since it is interpreted as a normal comment.

coderabbitai · 2025-12-02T13:59:14Z

benchmarks/sort_clickbench.py

+            compression_level=None,  # Use default compression level
+        )


⚠️ Potential issue | 🟡 Minor

--compression-level CLI argument is parsed but not used.

The compression_level argument is defined in argparse (line 211-215) but never passed to pq.write_table(). Line 104 hardcodes compression_level=None.

Either remove the unused argument or wire it through:

def sort_clickbench_data( input_path: str, output_path: str, row_group_size: int = 64 * 1024, compression: str = 'zstd', + compression_level: int = None, verify: bool = True ):

And in pq.write_table():

- compression_level=None, # Use default compression level + compression_level=compression_level,

And in main():

sort_clickbench_data( args.input, args.output, row_group_size=args.row_group_size, compression=args.compression, + compression_level=args.compression_level, verify=not args.no_verify )

Also applies to: 210-215

🤖 Prompt for AI Agents

In benchmarks/sort_clickbench.py around lines 104 and 210-215, the CLI option --compression-level is parsed but never used: pq.write_table() currently hardcodes compression_level=None. Wire the parsed value through by replacing the hardcoded None with the parsed args.compression_level (or equivalent variable) when calling pq.write_table(), and ensure the argparse block (lines ~210-215) either remains or is removed if you choose not to support compression; if keeping it, validate/convert the parsed value as needed before passing it. Ensure the function call propagates the selected compression level so the CLI option takes effect.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! The compression level option is ignored and never used. Prevents confusion in the user why the files are not compressed with the requested level

coderabbitai · 2025-12-02T13:59:14Z

benchmarks/src/clickbench.rs

+
+    /// Column name that the data is sorted by (e.g., "EventTime")
+    /// If specified, DataFusion will be informed that the data has this sort order
+    /// using CREATE EXTERNAL TABLE with WITH ORDER clause
+    #[structopt(long = "sorted-by")]
+    sorted_by: Option<String>,
+
+    /// Sort order: ASC or DESC (default: ASC)
+    #[structopt(long = "sort-order", default_value = "ASC")]
+    sort_order: String,


⚠️ Potential issue | 🟡 Minor

Validate sort_order to prevent invalid SQL generation.

The sort_order parameter accepts any string but is directly interpolated into SQL. Invalid values like "INVALID" would produce malformed DDL.

Consider validating the value:

+ /// Sort order: ASC or DESC (default: ASC) + #[structopt(long = "sort-order", default_value = "ASC", possible_values = &["ASC", "DESC", "asc", "desc"])] sort_order: String,

Alternatively, validate at runtime before building the SQL:

if !["ASC", "DESC"].contains(&self.sort_order.to_uppercase().as_str()) { return Err(exec_datafusion_err!("sort_order must be ASC or DESC")); }

🤖 Prompt for AI Agents

In benchmarks/src/clickbench.rs around lines 81 to 90 the sort_order field is accepted as any string and directly interpolated into generated SQL, which can produce invalid DDL; validate sort_order before using it by checking case-insensitively that it equals "ASC" or "DESC" and return a descriptive error (or convert/normalize it to upper-case) if not, so only valid values are used when constructing the WITH ORDER clause.

value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! At the moment there is no validation of the provided user input and the value is passed directly to the SQL engine. It could be even exploited to execute an SQL injection.

martin-augment · 2025-12-02T14:51:35Z

2. Unused compression_level Parameter (benchmarks/sort_clickbench.py:210-214, 241)

The argument is parsed but never passed to sort_clickbench_data()

The function does not accept this parameter

Fix: Either remove the argument or implement compression level support

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct! The compression level option is ignored and never used. Prevents confusion in the user why the files are not compressed with the requested level

martin-augment · 2025-12-02T14:54:06Z

3. Inconsistent Default Row Group Size (benchmarks/sort_clickbench.py:40 vs 201)

Function signature default: row_group_size: int = 1024 * 1024 (1M rows)

Argument parser default: default=64 * 1024 (64K rows)

Fix: Align both defaults (64K seems more reasonable for better query performance)

value:useful; category:bug; feedback:The Claude AI reviewer is correct! The usage example tells that the default value for row_group_size is not the default value of the command line interface option. They should be aligned.

martin-augment · 2025-12-02T14:54:35Z

39-45: Default values inconsistency between function signature and CLI.

The function defaults differ from CLI defaults:

row_group_size: function=1M rows vs CLI=64K rows

compression: function=snappy vs CLI=zstd

This creates confusion as the function defaults are effectively dead code when invoked via CLI.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The usage example tells that the default value for row_group_size is not the default value of the command line interface option. They should be aligned.

martin-augment · 2025-12-02T14:56:05Z

1. Missing File Extension in Python Script (benchmarks/sort_clickbench.py:249)

Line 249 references sort_clickbench_enhanced.py but the file is named sort_clickbench.py

Fix: Change the usage message to reference the correct filename

value:useful; category:typo; feedback:The Claude AI reviewer is correct! The name of the script is sort_clickbench.py and the usage needs to be updated. Prevents confusion in the user trying to execute this script.

martin-augment · 2025-12-02T15:02:25Z

4. Missing Validation for sort_order Parameter (benchmarks/src/clickbench.rs:89-90)

No validation that the value is ASC or DESC

Could cause runtime errors in SQL generation

Fix: Add validation or use an enum

value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct! At the moment there is no validation of the provided user input and the value is passed directly to the SQL engine. It could be even exploited to execute an SQL injection.

zhuqi-lucas added 9 commits December 1, 2025 22:12

Test for sorted data

c3647cd

draft implementation for sorted data benchmark

627f081

fix

6c0afd6

fix

0ee89f5

mege

acf6e24

fix

c654246

fix

68e72f1

fix

401cb8e

better

88f84d9

augmentcode bot reviewed Dec 2, 2025

View reviewed changes

cursor bot reviewed Dec 2, 2025

View reviewed changes

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

Conversation

martin-augment commented Dec 2, 2025

Uh oh!

coderabbitai bot commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Dec 2, 2025

Choose a reason for hiding this comment

Bug: Command-line compression level argument is never used

Uh oh!

martin-augment Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot Dec 2, 2025

Choose a reason for hiding this comment

Bug: Duplicate shebang line in Python script

Uh oh!

martin-augment Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Dec 2, 2025

Code Review - PR #94: Add sorted data benchmark

Strengths

Bugs and Issues

Critical Issues

Medium Priority Issues

Performance Considerations

Good Practices

Potential Improvements

Security Concerns

Test Coverage

Missing Tests

Code Quality Improvements

Documentation Gaps

Recommendations

High Priority

Medium Priority

Low Priority

Overall Assessment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

coderabbitai bot commented Dec 2, 2025 •

edited

Loading

martin-augment Dec 2, 2025 •

edited

Loading

martin-augment Dec 2, 2025 •

edited

Loading

martin-augment commented Dec 2, 2025 •

edited

Loading

martin-augment commented Dec 2, 2025 •

edited

Loading

martin-augment commented Dec 2, 2025 •

edited

Loading