feat: optimize expression parse #18871

KKould · 2025-10-20T08:22:26Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This issue mentioned that under large_statment, the performance of databend parser is significantly lower than that of datafuison parser.

This PR attempts to discover the differences that cause poor performance and improve them

run_pratt_parser to randomly clone the entire statement for error display, which will cause extremely high copy cost and stack overflow when large_statement is used (50% of the time)
For simple tokens, using rule! will cause repeated parse. In this PR, rule! is removed for binary_op, unary_op, and json_op, and TokenKind is directly judged (15% of the time on binary_op, unary_op, and json_op)
Similar parse rules may cause serious parse backtracking: feat(parse): Simplify the matching pattern when parse function, avoid exponential backtracking #17942

bench code on #1218 (comment)

before this PR:

bench                               fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ dummy                                          │               │               │               │         │
   ├─ large_statement_1_databend    158.7 ms      │ 168.3 ms      │ 161 ms        │ 162.3 ms      │ 4       │ 4
   ╰─ large_statement_1_datafusion  4.843 ms      │ 5.276 ms      │ 5.021 ms      │ 5.023 ms      │ 100     │ 100

after this PR:
Tips: remove src/query/ast/src/parser/parser.rs:58 I don't know why debug_assertions is still true when cargo bench

bench                               fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ dummy                                          │               │               │               │         │
   ├─ large_statement_1_databend    52.28 ms      │ 55.9 ms       │ 52.78 ms      │ 53.16 ms      │ 10      │ 10
   ╰─ large_statement_1_datafusion  5.088 ms      │ 6.517 ms      │ 5.308 ms      │ 5.358 ms      │ 94      │ 94

I try to minimize the branching of expr for large_statement (only #binary_op | #function_call | #column_ref | #literal)

bench                               fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ dummy                                          │               │               │               │         │
   ├─ large_statement_1_databend    15.95 ms      │ 20.12 ms      │ 16.39 ms      │ 16.53 ms      │ 30      │ 30
   ╰─ large_statement_1_datafusion  4.903 ms      │ 5.606 ms      │ 5.004 ms      │ 5.007 ms      │ 100     │ 100

currently, there is still a lot of backtracking under large_statment, which is also the direction that parse needs to be optimized.

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

KKould added 2 commits October 20, 2025 16:08

perf: optimize expression parse

8284881

chore: update ast bench comment

379e13a

KKould self-assigned this Oct 20, 2025

github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Oct 20, 2025

KKould added 2 commits October 20, 2025 17:46

chore: fix 03_0016_insert_into_values.test

24ae206

chore: codefmt

4ea85ed

KKould requested review from BohuTANG, b41sh and sundy-li October 21, 2025 06:06

KKould marked this pull request as ready for review October 21, 2025 06:06

sundy-li approved these changes Oct 21, 2025

View reviewed changes

BohuTANG merged commit 16c7380 into databendlabs:main Oct 21, 2025
250 of 256 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: optimize expression parse #18871

feat: optimize expression parse #18871

Uh oh!

KKould commented Oct 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: optimize expression parse #18871

feat: optimize expression parse #18871

Uh oh!

Conversation

KKould commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Type of change

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KKould commented Oct 20, 2025 •

edited

Loading