Conversation
WalkthroughThis change introduces a new sorted data benchmark workflow for ClickBench. It adds two bash functions ( ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| # Batch size for writing | ||
| write_batch_size=min(row_group_size, 1024 * 64), | ||
| # Enable compression for all columns | ||
| compression_level=None, # Use default compression level |
There was a problem hiding this comment.
The --compression-level CLI option is parsed but not applied when writing the Parquet file (the writer sets compression_level=None), so the flag currently has no effect. Consider wiring this option through to the writer to avoid confusing behavior.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback:The Augment AI reviewer is correct! The compression level option is ignored and never used. Prevents confusion in the user why the files are not compressed with the requested level
| parser.add_argument( | ||
| '--row-group-size', | ||
| type=int, | ||
| default=64 * 1024, # 64K rows |
There was a problem hiding this comment.
The examples and inline comment indicate a 1M row-group default, but the parser default here is 64K, which is inconsistent and could confuse users. Consider aligning the help text and default value to match the intended default.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! The usage example tells that the default value for row_group_size is not the default value of the command line interface option. They should be aligned.
|
|
||
| if __name__ == '__main__': | ||
| if len(sys.argv) == 1: | ||
| print("Usage: python3 sort_clickbench_enhanced.py <input_file> <output_file>") |
There was a problem hiding this comment.
The usage message references sort_clickbench_enhanced.py, but the file name is sort_clickbench.py. This mismatch is user-facing and may cause confusion when following the printed instructions.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:typo; feedback:The Augment AI reviewer is correct! The name of the script is sort_clickbench.py and the usage needs to be updated. Prevents confusion in the user trying to execute this script.
| type=int, | ||
| default=3, | ||
| help='Compression level (default: 3 for zstd)' | ||
| ) |
There was a problem hiding this comment.
Bug: Command-line compression level argument is never used
The --compression-level command-line argument is parsed via argparse but never passed to sort_clickbench_data(). The call at lines 238-244 omits args.compression_level, and the function hardcodes compression_level=None when calling pq.write_table(). Users setting this flag would see no effect on compression behavior.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback:The Bugbot AI reviewer is correct! The compression level option is ignored and never used. Prevents confusion in the user why the files are not compressed with the requested level
| # specific language governing permissions and limitations | ||
| # under the License. | ||
|
|
||
| #!/usr/bin/env python3 |
There was a problem hiding this comment.
Bug: Duplicate shebang line in Python script
There are two shebang lines in the script - one at line 1 (#!/usr/bin/env python3) and another at line 20. The second shebang is likely a copy-paste artifact from merging different versions of the script. While it doesn't cause a runtime error (it's treated as a comment after line 1), it indicates accidentally committed code.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback:The Bugbot AI reviewer is correct! The shebang line should be the very first one in a script. The second line should be removed since it is interpreted as a normal comment.
Code Review - PR #94: Add sorted data benchmarkThank you for this contribution! This PR adds a valuable benchmark for testing sort optimization in DataFusion. Here is my comprehensive review: Strengths
Bugs and IssuesCritical Issues1. Missing File Extension in Python Script (benchmarks/sort_clickbench.py:249)
2. Unused compression_level Parameter (benchmarks/sort_clickbench.py:210-214, 241)
3. Inconsistent Default Row Group Size (benchmarks/sort_clickbench.py:40 vs 201)
Medium Priority Issues4. Missing Validation for sort_order Parameter (benchmarks/src/clickbench.rs:89-90)
5. No Error Handling for Invalid Sort Column (benchmarks/src/clickbench.rs:267-268)
Performance ConsiderationsGood Practices
Potential Improvements6. Memory Usage in Python Sort (benchmarks/sort_clickbench.py:62)
7. Redundant Filter Pushdown Configuration (benchmarks/src/clickbench.rs:161-165)
Security Concerns8. Shell Injection Risk in bench.sh (benchmarks/bench.sh:1250)
9. Path Validation in Rust (benchmarks/src/clickbench.rs:255)
Test CoverageMissing Tests10. No Unit Tests for New Rust Code
11. No Integration Tests for Shell Scripts
12. Python Script Has No Tests
Code Quality Improvements13. Typo in Comments (benchmarks/sort_clickbench.py:19-20)
14. Inconsistent Emoji Usage
15. Debug Print Statements Left In (benchmarks/src/clickbench.rs:171-175)
16. Hardcoded Magic Numbers (benchmarks/sort_clickbench.py:79)
Documentation Gaps17. Missing README or Documentation
18. Query File Comment Confusion (benchmarks/queries/clickbench/queries/sorted_data/q0.sql:1-2)
RecommendationsHigh Priority
Medium Priority
Low Priority
Overall AssessmentThis is a solid contribution that adds valuable benchmarking capabilities. The core implementation is sound, but there are several bugs and missing tests that should be addressed before merging. The biggest concerns are:
With these issues addressed, this will be a great addition to DataFusion benchmark suite! Recommendation: Request changes for the critical bugs, then approve after fixes. Review generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
benchmarks/src/clickbench.rs (1)
161-167: Consider consolidating pushdown enablement logic.When both
--pushdownand--sorted-byare provided, the same settings are applied twice. While harmless, this could be simplified:- if self.pushdown { - parquet_options.pushdown_filters = true; - parquet_options.reorder_filters = true; - } - - if self.sorted_by.is_some() { - // We should compare the dynamic topk optimization when data is sorted, so we make the - // assumption that filter pushdown is also enabled in this case. - parquet_options.pushdown_filters = true; - parquet_options.reorder_filters = true; - } + if self.pushdown || self.sorted_by.is_some() { + // Enable filter pushdown for explicit --pushdown flag or sorted data benchmarks + // (sorted data benchmarks need pushdown for dynamic topk optimization comparison) + parquet_options.pushdown_filters = true; + parquet_options.reorder_filters = true; + }benchmarks/sort_clickbench.py (2)
39-45: Default values inconsistency between function signature and CLI.The function defaults differ from CLI defaults:
row_group_size: function=1M rows vs CLI=64K rowscompression: function=snappyvs CLI=zstdThis creates confusion as the function defaults are effectively dead code when invoked via CLI.
Align the defaults:
def sort_clickbench_data( input_path: str, output_path: str, - row_group_size: int = 1024 * 1024, # 1M rows default - compression: str = 'snappy', + row_group_size: int = 64 * 1024, # 64K rows default + compression: str = 'zstd', verify: bool = True ):Also applies to: 198-208
115-119: Remove extraneousfprefix from strings without placeholders.Lines 115 and 119 use f-strings without any placeholders.
- print(f"\n✓ Successfully created sorted file!") + print("\n✓ Successfully created sorted file!") print(f" Input: {input_size_mb:.1f} MB") print(f" Output: {output_size_mb:.1f} MB") print(f" Compression ratio: {input_size_mb/output_size_mb:.2f}x") - print(f"\nRow Group Statistics:") + print("\nRow Group Statistics:")Based on static analysis hints.
benchmarks/bench.sh (2)
1209-1211: Remove leftover development comments.Lines 1209-1211 and 1266-1267 contain "Add these functions to bench.sh" comments that appear to be leftover from development.
-# Sorted Data Benchmark Functions (Optimized for hits_0.parquet) -# Add these functions to bench.sh - # Creates sorted ClickBench data from hits_0.parquet (partitioned dataset)-# Sorted Data Benchmark Functions for bench.sh -# Add these functions to your bench.sh script - # Runs the sorted data benchmark (sorted only) with sort order informationAlso applies to: 1266-1267
1236-1248: Consider adding Python version check for consistency.Unlike
data_h2oanddata_h2o_joinwhich check for Python 3.10+, this function usespython3directly. While pyarrow works with Python 3.8+, adding version checking would improve consistency with other data generation functions.This is optional since pyarrow has broader Python version support than
falsa.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
benchmarks/bench.sh(4 hunks)benchmarks/queries/clickbench/queries/sorted_data/q0.sql(1 hunks)benchmarks/sort_clickbench.py(1 hunks)benchmarks/src/clickbench.rs(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
benchmarks/src/clickbench.rs (4)
benchmarks/src/util/options.rs (1)
config(68-70)datafusion/datasource-parquet/src/file_format.rs (1)
parquet_options(1192-1194)datafusion/core/src/execution/context/mod.rs (1)
new_with_config_rt(349-356)datafusion/execution/src/config.rs (1)
target_partitions(275-277)
🪛 Ruff (0.14.7)
benchmarks/sort_clickbench.py
1-1: Shebang is present but file is not executable
(EXE001)
20-20: Shebang is present but file is not executable
(EXE001)
20-20: Shebang should be at the beginning of the file
(EXE005)
115-115: f-string without any placeholders
Remove extraneous f prefix
(F541)
119-119: f-string without any placeholders
Remove extraneous f prefix
(F541)
134-134: Do not catch blind exception: Exception
(BLE001)
164-164: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
- GitHub Check: Analyze (rust)
🔇 Additional comments (8)
benchmarks/queries/clickbench/queries/sorted_data/q0.sql (1)
1-3: LGTM - Query correctly tests reverse scan on sorted data.The
ORDER BY "EventTime" DESCon data sorted ASC tests the sort elimination optimization for reverse scans. Thebinary_as_stringsetting is correctly applied programmatically in the Rust code.benchmarks/src/clickbench.rs (3)
138-148: LGTM - Forcing single partition preserves sort order.This is the correct approach to ensure the benchmark measures the pure performance benefit of pre-sorted data without partition-induced reordering.
264-269: Column name escaping is simplistic but acceptable for benchmark code.The current logic assumes column names either already have quotes or need them added. This works for controlled benchmark inputs like
"EventTime"but wouldn't handle edge cases like embedded quotes. For a benchmark utility with known inputs, this is acceptable.
172-177: LGTM - Useful diagnostic output.benchmarks/sort_clickbench.py (1)
141-165: Verification is a lightweight sanity check.The
verify_sorted_filefunction only compares the first and last EventTime values across row groups, which wouldn't detect unsorted data within row groups. For a benchmark utility where the sorting is done programmatically, this quick sanity check is acceptable. The mainsort_clickbench_datafunction performs a more thorough check on the first 1000 rows.benchmarks/bench.sh (3)
102-104: LGTM - Documentation added for the new benchmark.The new
data_sorted_clickbenchbenchmark is properly documented in the usage section with a clear description.
328-330: LGTM - Command dispatch entries follow existing patterns.Also applies to: 514-516
1270-1287: LGTM - Run function follows established patterns.The function correctly:
- Ensures data exists before running
- Passes
--sorted-byand--sort-orderparameters to enable sort order awareness- Uses the dedicated
sorted_dataqueries path- Follows the same structure as other
run_*functions
| #!/usr/bin/env python3 | ||
|
|
||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
|
|
||
| #!/usr/bin/env python3 | ||
| """ | ||
| Sort ClickBench data by EventTime for reverse scan benchmark. | ||
| Enhanced version with configurable row group size and optimization options. | ||
| """ |
There was a problem hiding this comment.
Remove duplicate shebang line.
There are two shebang lines (lines 1 and 20). The second one at line 20 is invalid as shebangs must be at the beginning of the file.
#!/usr/bin/env python3
# Licensed to the Apache Software Foundation (ASF) under one
...
# under the License.
-#!/usr/bin/env python3
"""
Sort ClickBench data by EventTime for reverse scan benchmark.Based on static analysis hints.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/usr/bin/env python3 | |
| # Licensed to the Apache Software Foundation (ASF) under one | |
| # or more contributor license agreements. See the NOTICE file | |
| # distributed with this work for additional information | |
| # regarding copyright ownership. The ASF licenses this file | |
| # to you under the Apache License, Version 2.0 (the | |
| # "License"); you may not use this file except in compliance | |
| # with the License. You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, | |
| # software distributed under the License is distributed on an | |
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
| # KIND, either express or implied. See the License for the | |
| # specific language governing permissions and limitations | |
| # under the License. | |
| #!/usr/bin/env python3 | |
| """ | |
| Sort ClickBench data by EventTime for reverse scan benchmark. | |
| Enhanced version with configurable row group size and optimization options. | |
| """ | |
| #!/usr/bin/env python3 | |
| # Licensed to the Apache Software Foundation (ASF) under one | |
| # or more contributor license agreements. See the NOTICE file | |
| # distributed with this work for additional information | |
| # regarding copyright ownership. The ASF licenses this file | |
| # to you under the Apache License, Version 2.0 (the | |
| # "License"); you may not use this file except in compliance | |
| # with the License. You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, | |
| # software distributed under the License is distributed on an | |
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |
| # KIND, either express or implied. See the License for the | |
| # specific language governing permissions and limitations | |
| # under the License. | |
| """ | |
| Sort ClickBench data by EventTime for reverse scan benchmark. | |
| Enhanced version with configurable row group size and optimization options. | |
| """ |
🧰 Tools
🪛 Ruff (0.14.7)
1-1: Shebang is present but file is not executable
(EXE001)
20-20: Shebang is present but file is not executable
(EXE001)
20-20: Shebang should be at the beginning of the file
(EXE005)
🤖 Prompt for AI Agents
In benchmarks/sort_clickbench.py around lines 1 to 24, there is a duplicate
shebang: the file already starts with "#!/usr/bin/env python3" on line 1 but has
another shebang at line 20; remove the second shebang (line 20) so only the
first line contains the interpreter directive and leave the rest of the file
(docstring and comments) intact.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! The shebang line should be the very first one in a script. The second line should be removed since it is interpreted as a normal comment.
| compression_level=None, # Use default compression level | ||
| ) |
There was a problem hiding this comment.
--compression-level CLI argument is parsed but not used.
The compression_level argument is defined in argparse (line 211-215) but never passed to pq.write_table(). Line 104 hardcodes compression_level=None.
Either remove the unused argument or wire it through:
def sort_clickbench_data(
input_path: str,
output_path: str,
row_group_size: int = 64 * 1024,
compression: str = 'zstd',
+ compression_level: int = None,
verify: bool = True
):And in pq.write_table():
- compression_level=None, # Use default compression level
+ compression_level=compression_level,And in main():
sort_clickbench_data(
args.input,
args.output,
row_group_size=args.row_group_size,
compression=args.compression,
+ compression_level=args.compression_level,
verify=not args.no_verify
)Also applies to: 210-215
🤖 Prompt for AI Agents
In benchmarks/sort_clickbench.py around lines 104 and 210-215, the CLI option
--compression-level is parsed but never used: pq.write_table() currently
hardcodes compression_level=None. Wire the parsed value through by replacing the
hardcoded None with the parsed args.compression_level (or equivalent variable)
when calling pq.write_table(), and ensure the argparse block (lines ~210-215)
either remains or is removed if you choose not to support compression; if
keeping it, validate/convert the parsed value as needed before passing it.
Ensure the function call propagates the selected compression level so the CLI
option takes effect.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! The compression level option is ignored and never used. Prevents confusion in the user why the files are not compressed with the requested level
|
|
||
| /// Column name that the data is sorted by (e.g., "EventTime") | ||
| /// If specified, DataFusion will be informed that the data has this sort order | ||
| /// using CREATE EXTERNAL TABLE with WITH ORDER clause | ||
| #[structopt(long = "sorted-by")] | ||
| sorted_by: Option<String>, | ||
|
|
||
| /// Sort order: ASC or DESC (default: ASC) | ||
| #[structopt(long = "sort-order", default_value = "ASC")] | ||
| sort_order: String, |
There was a problem hiding this comment.
Validate sort_order to prevent invalid SQL generation.
The sort_order parameter accepts any string but is directly interpolated into SQL. Invalid values like "INVALID" would produce malformed DDL.
Consider validating the value:
+ /// Sort order: ASC or DESC (default: ASC)
+ #[structopt(long = "sort-order", default_value = "ASC", possible_values = &["ASC", "DESC", "asc", "desc"])]
sort_order: String,Alternatively, validate at runtime before building the SQL:
if !["ASC", "DESC"].contains(&self.sort_order.to_uppercase().as_str()) {
return Err(exec_datafusion_err!("sort_order must be ASC or DESC"));
}🤖 Prompt for AI Agents
In benchmarks/src/clickbench.rs around lines 81 to 90 the sort_order field is
accepted as any string and directly interpolated into generated SQL, which can
produce invalid DDL; validate sort_order before using it by checking
case-insensitively that it equals "ASC" or "DESC" and return a descriptive error
(or convert/normalize it to upper-case) if not, so only valid values are used
when constructing the WITH ORDER clause.
There was a problem hiding this comment.
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! At the moment there is no validation of the provided user input and the value is passed directly to the SQL engine. It could be even exploited to execute an SQL injection.
value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct! The compression level option is ignored and never used. Prevents confusion in the user why the files are not compressed with the requested level |
value:useful; category:bug; feedback:The Claude AI reviewer is correct! The usage example tells that the default value for row_group_size is not the default value of the command line interface option. They should be aligned. |
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct! The usage example tells that the default value for row_group_size is not the default value of the command line interface option. They should be aligned. |
value:useful; category:typo; feedback:The Claude AI reviewer is correct! The name of the script is sort_clickbench.py and the usage needs to be updated. Prevents confusion in the user trying to execute this script. |
value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct! At the moment there is no validation of the provided user input and the value is passed directly to the SQL engine. It could be even exploited to execute an SQL injection. |
19042: To review by AI