Skip to content

Conversation

@geooo109
Copy link
Collaborator

@geooo109 geooo109 commented Oct 20, 2025

Fixes #6137

We should avoid simplifying cases like SELECT 1 AND x or SELECT 0 or x.
So, this PR makes this check tighter by checking if the LHS or RHS is Boolean type.

@georgesittas
Copy link
Collaborator

Discussed offline– we should check whether all engines evaluate connectors into boolean values, instead of treating them as control flow. If SELECT <always true value> AND other is always boolean, we can't just simplify it to SELECT other, because the semantics change. Unfortunately, this is what we do today, which is incorrect.

@geooo109 geooo109 changed the title fix(optimizer)!: simplify AND with static BOOLEAN type fix(optimizer)!: simplify AND/OR with static BOOLEAN type Oct 21, 2025
@geooo109 geooo109 force-pushed the geooo109/simplify_and branch 2 times, most recently from 306b884 to 66ffdd7 Compare October 22, 2025 12:55
@georgesittas georgesittas marked this pull request as draft October 23, 2025 09:49
@georgesittas
Copy link
Collaborator

We discovered that post-simplifying we lose some types. That is a bug and we decided to address it in this PR instead of postponing, because that will reduce the PR's complexity as a side-effect.

@georgesittas
Copy link
Collaborator

@barakalon @tobymao curious what your thoughts are on this.

The problem: today we infer types & then run other rules that transform the AST, resulting in building new, untyped, expressions. This is a bug on its own, and it also affects type-sensitive rules. E.g., simplify now checks the boolean type to apply certain transformations but inbetween types are lost and so it cannot always work.

My worry is that fixing each call site introduces a new pattern where we have to be alert for rules after annotate_types and ensure new AST nodes are always typed. OTOH, not sure how else we can unblock these rules in simplify that run up to a fix point...

@geooo109 geooo109 force-pushed the geooo109/simplify_and branch from e5d9ed2 to 052bdcc Compare October 29, 2025 13:37
@geooo109 geooo109 marked this pull request as ready for review October 29, 2025 15:06
@geooo109
Copy link
Collaborator Author

/benchmark

@github-actions
Copy link
Contributor

Benchmark Results

Legend:

  • 🟢🟢 = 2x+ faster
  • 🟢 = 1.1x - 2x faster
  • ⚪ = No significant change (< 1.1x)
  • 🔴 = 1.1x - 2x slower
  • 🔴🔴 = 2x+ slower

Parsing Benchmark

Benchmark bench_parse_main bench_parse_pr
parse_sqlglot_tpch 5.51 ms 5.61 ms: ⚪ 1.02x slower
parse_sqlglot_short 346 us 348 us: ⚪ 1.01x slower
parse_sqlglot_long 5.28 ms 5.31 ms: ⚪ 1.01x slower
parse_sqlglotrs_tpch 4.03 ms 4.08 ms: ⚪ 1.01x slower
parse_sqlglotrs_short 284 us 288 us: ⚪ 1.01x slower
parse_sqlglotrs_long 3.87 ms 3.99 ms: ⚪ 1.03x slower
parse_sqlglotrs_crazy 11.5 ms 11.9 ms: ⚪ 1.03x slower
Geometric mean (ref) 1.02x slower

Benchmark hidden because not significant (1): parse_sqlglot_crazy


Optimization Benchmark

Benchmark bench_optimize_main bench_optimize_pr
optimize_tpch 778 ms 788 ms: ⚪ 1.01x slower
optimize_condition_100 96.3 ms 97.0 ms: ⚪ 1.01x slower
Geometric mean (ref) 1.00x slower

Benchmark hidden because not significant (2): optimize_condition_10, optimize_condition_1000

@geooo109
Copy link
Collaborator Author

/benchmark

@github-actions
Copy link
Contributor

Benchmark Results

Legend:

  • 🟢🟢 = 2x+ faster
  • 🟢 = 1.1x - 2x faster
  • ⚪ = No significant change (< 1.1x)
  • 🔴 = 1.1x - 2x slower
  • 🔴🔴 = 2x+ slower

Parsing Benchmark

Benchmark bench_parse_main bench_parse_pr
parse_sqlglot_long 5.32 ms 5.30 ms: ⚪ 1.00x faster
parse_sqlglotrs_long 3.88 ms 3.91 ms: ⚪ 1.01x slower
Geometric mean (ref) 1.00x slower

Benchmark hidden because not significant (6): parse_sqlglot_tpch, parse_sqlglot_short, parse_sqlglot_crazy, parse_sqlglotrs_tpch, parse_sqlglotrs_short, parse_sqlglotrs_crazy


Optimization Benchmark

Benchmark bench_optimize_main bench_optimize_pr
optimize_condition_100 96.7 ms 96.0 ms: ⚪ 1.01x faster
Geometric mean (ref) 1.00x faster

Benchmark hidden because not significant (3): optimize_tpch, optimize_condition_10, optimize_condition_1000

@geooo109
Copy link
Collaborator Author

geooo109 commented Oct 30, 2025

Current benchmarks don't show a significant performance slow down.

On the other hand, queries like the following (create a lot of AST transformation due to the NOT (x CONNECTOR y) => re-compute hash), create significant slow downs:

SELECT * FROM t WHERE
    NOT (
      NOT (
        NOT (
          NOT (
            NOT (
              (
                NOT ((a = 1 AND b = 2 AND TRUE) OR (c = 3 AND d = 4 OR FALSE)) AND
                NOT ((e = 5 OR f = 6 AND TRUE) AND (g = 7 OR h = 8 OR FALSE))
              ) OR (
                NOT (
                  NOT ((i = 9 AND j = 10) OR (k = 11 AND l = 12)) AND
                  NOT ((m = 13 OR n = 14) AND (o = 15 OR p = 16))
                )
              )
            ) AND (
              NOT (
                NOT (
                  NOT ((q = 17 AND r = 18) OR (s = 19 AND t = 20)) OR
                  NOT ((u = 21 OR v = 22) AND (w = 23 OR x = 24))
                ) AND (
                  (y = 25 OR NOT (z = 26 AND aa = 27))
                )
              )
            )
          ) OR (
            NOT (
              NOT (
                NOT (
                  (bb = 28 AND cc = 29) OR (dd = 30 AND ee = 31)
                ) AND (
                  (ff = 32 OR gg = 33) AND (hh = 34 OR ii = 35)
                )
              ) OR (
                NOT (
                  NOT ((jj = 36 AND kk = 37) OR (ll = 38 AND mm = 39))
                )
              )
            )
          )
        ) AND (
          NOT (
            NOT (
              NOT ((nn = 40 AND oo = 41) OR (pp = 42 AND qq = 43)) AND
              NOT ((rr = 44 OR ss = 45) AND (tt = 46 OR uu = 47))
            )
          ) OR (
            NOT (
              (vv = 48 AND NOT (ww = 49 OR xx = 50)) OR
              (NOT (yy = 51 AND zz = 52) AND aaa = 53)
            )
          )
        ) 
      )
    );

Old simplify (using the optimize function):
Total time for 50 runs: 8.7888s
Average time per run: 0.1758s

New simplify (using the optimize function):
Total time for 50 runs: 12.5861s
Average time per run: 0.2517s

@georgesittas
Copy link
Collaborator

I see, so stressing the "unhappy" code paths can result in a ~1.4x slowdown.. That's annoying, although somewhat expected. Let's pause for a bit and disuss to see how we can improve this- FYI @tobymao.

@geooo109 geooo109 force-pushed the geooo109/simplify_and branch from bea403f to 5fb0d4b Compare October 31, 2025 11:45
Comment on lines +180 to +183
dialect = kwargs.get("dialect")
annotator = TypeAnnotator(
schema=ensure_schema(None, dialect=dialect), overwrite_types=False
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe: having the right dialect is a must here, otherwise we may use incorrect inference rules.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering whether we should just introduce a class that encapsulates the simplify rules so that we can share state like the type annotator and dialect among them, similar to what we do in the annotate_types module.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgesittas yep, I aggree, this isn't the final solution for this. We can do that, I think is compact.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Aight, let's go with the class approach tomorrow then before benchmarking again.

Comment on lines -230 to +253
)
),
copy=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we're copying in main as well. Did you compare this branch to main w/ copy=False in both?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, I took this fix up into consideration for the benchmark, and this branch is 1.24x slower compared the fixed main.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@georgesittas I can push the change also in the main and run a new benchmark here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's make sure this fix is both in main and here (I'd double-check/prove that there are no mutation edge cases) and benchmark both branches w/out copying in this paren call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The optimizer changes the query's result from a boolean value to the column's raw value

4 participants