Skip to content

Conversation

KKould
Copy link
Member

@KKould KKould commented Oct 20, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

#1218 (comment)

This issue mentioned that under large_statment, the performance of databend parser is significantly lower than that of datafuison parser.

This PR attempts to discover the differences that cause poor performance and improve them

  • run_pratt_parser to randomly clone the entire statement for error display, which will cause extremely high copy cost and stack overflow when large_statement is used (50% of the time)
  • For simple tokens, using rule! will cause repeated parse. In this PR, rule! is removed for binary_op, unary_op, and json_op, and TokenKind is directly judged (15% of the time on binary_op, unary_op, and json_op)
  • Similar parse rules may cause serious parse backtracking: feat(parse): Simplify the matching pattern when parse function, avoid exponential backtracking #17942

bench code on #1218 (comment)

before this PR:

bench                               fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ dummy                                          │               │               │               │         │
   ├─ large_statement_1_databend    158.7 ms      │ 168.3 ms      │ 161 ms        │ 162.3 ms      │ 4       │ 4
   ╰─ large_statement_1_datafusion  4.843 ms      │ 5.276 ms      │ 5.021 ms      │ 5.023 ms      │ 100     │ 100

after this PR:
Tips: remove src/query/ast/src/parser/parser.rs:58 I don't know why debug_assertions is still true when cargo bench

bench                               fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ dummy                                          │               │               │               │         │
   ├─ large_statement_1_databend    52.28 ms      │ 55.9 ms       │ 52.78 ms      │ 53.16 ms      │ 10      │ 10
   ╰─ large_statement_1_datafusion  5.088 ms      │ 6.517 ms      │ 5.308 ms      │ 5.358 ms      │ 94      │ 94

I try to minimize the branching of expr for large_statement (only #binary_op | #function_call | #column_ref | #literal)

bench                               fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ dummy                                          │               │               │               │         │
   ├─ large_statement_1_databend    15.95 ms      │ 20.12 ms      │ 16.39 ms      │ 16.53 ms      │ 30      │ 30
   ╰─ large_statement_1_datafusion  4.903 ms      │ 5.606 ms      │ 5.004 ms      │ 5.007 ms      │ 100     │ 100

currently, there is still a lot of backtracking under large_statment, which is also the direction that parse needs to be optimized.
flamegraph

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@KKould KKould self-assigned this Oct 20, 2025
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Oct 20, 2025
@KKould KKould requested review from BohuTANG, b41sh and sundy-li October 21, 2025 06:06
@KKould KKould marked this pull request as ready for review October 21, 2025 06:06
@BohuTANG BohuTANG merged commit 16c7380 into databendlabs:main Oct 21, 2025
250 of 256 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature this PR introduces a new feature to the codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants