Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer equality when comparing values #34164

Open
ranma42 opened this issue Jul 5, 2024 · 7 comments · May be fixed by #34166
Open

Prefer equality when comparing values #34164

ranma42 opened this issue Jul 5, 2024 · 7 comments · May be fixed by #34166

Comments

@ranma42
Copy link
Contributor

ranma42 commented Jul 5, 2024

Most databases can take advantage of indexes (and in some cases even auto-generate them) when queries perform equality comparisons (instead of inequality).

In some places EFCore already tries to lean towards equality

// use equality where possible
// !(a == true) -> a == false
// !(a == false) -> a == true
SqlBinaryExpression { OperatorType: ExpressionType.Equal, Right: SqlConstantExpression { Value: bool } } binary
=> Equal(binary.Left, Not(binary.Right)),
// !(true == a) -> false == a
// !(false == a) -> true == a
SqlBinaryExpression { OperatorType: ExpressionType.Equal, Left: SqlConstantExpression { Value: bool } } binary
=> Equal(Not(binary.Left), binary.Right),

In other places it disregards this and in fact can even replace equality comparisons with inequalities

// a == b <=> !a == !b -> a == b
// !a == b <=> a == !b -> a != b
// a != b <=> !a != !b -> a != b
// !a != b <=> a != !b -> a == b
nullable = false;
return sqlBinaryExpression.OperatorType == ExpressionType.Equal ^ leftNegated == rightNegated
? _sqlExpressionFactory.NotEqual(left, right)
: _sqlExpressionFactory.Equal(left, right);
// a == !b and !a == b in SQL evaluate the same as a != b
body = _sqlExpressionFactory.NotEqual(left, right);

It would be better to avoid converting equalities into inequalities and, when possible, convert inequalities into equalities.

@ranma42
Copy link
Contributor Author

ranma42 commented Jul 12, 2024

On Sqlite

create table test (a, b);

inequality has a much shorter plan:

explain select * from test as t1, test as t2 where t1.a != t2.b;
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     16    0                    0
1     OpenRead       0     2     0     2              0
2     OpenRead       1     2     0     2              0
3     Rewind         0     15    0                    0
4       Rewind         1     15    0                    0
5         Column         0     0     1                    0
6         Column         1     1     2                    0
7         Eq             2     13    1     BINARY-8       81
8         Column         0     0     3                    0
9         Column         0     1     4                    0
10        Column         1     0     5                    0
11        Column         1     1     6                    0
12        ResultRow      3     4     0                    0
13      Next           1     5     0                    1
14    Next           0     4     0                    1
15    Halt           0     0     0                    0
16    Transaction    0     0     1     0              1
17    Goto           0     1     0                    0

but it performs a full table scan of the whole table for each row.

Equal-to-not has a more complex plan:

explain select * from test as t1, test as t2 where t1.a == NOT(t2.b);
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     30    0                    0
1     OpenRead       1     2     0     2              0
2     OpenRead       0     2     0     2              0
3     Rewind         1     29    0                    0
4       Once           0     16    0                    0
5       OpenAutoindex  2     3     0     k(3,B,,)       0
6       Explain        6     0     0     BLOOM FILTER ON t1 (a=?) 0
7       Blob           10000 1     0                    0
8       Rewind         0     16    0                    0
9         Column         0     0     3                    0
10        Column         0     1     4                    0
11        Rowid          0     5     0                    0
12        MakeRecord     3     3     2                    0
13        FilterAdd      1     0     3     1              0
14        IdxInsert      2     2     0                    16
15      Next           0     9     0                    3
16      Column         1     1     2                    0
17      Not            2     6     0                    0
18      IsNull         6     28    0                    0
19      Filter         1     28    6     1              0
20      SeekGE         2     28    6     1              0
21        IdxGT          2     28    6     1              0
22        Column         2     0     7                    0
23        Column         2     1     8                    0
24        Column         1     0     9                    0
25        Column         1     1     10                   0
26        ResultRow      7     4     0                    0
27      Next           2     21    0                    0
28    Next           1     4     0                    1
29    Halt           0     0     0                    0
30    Transaction    0     0     1     0              1
31    Goto           0     1     0                    0

which automatically creates a temporary index for column a and takes advantage of the index.

@ranma42
Copy link
Contributor Author

ranma42 commented Jul 12, 2024

IIUC similar results also hold true in postgres:

create table test (a boolean, b boolean);
explain select * from test as t1, test as t2 where t1.a != t2.b;

QUERY PLAN                                                                     |
-------------------------------------------------------------------------------+
Nested Loop  (cost=0.00..111057.20 rows=3699200 width=4)                       |
  Join Filter: (t1.a <> t2.b)                                                  |
  ->  Seq Scan on test t1  (cost=0.00..37.20 rows=2720 width=2)                |
  ->  Materialize  (cost=0.00..50.80 rows=2720 width=2)                        |
        ->  Seq Scan on test t2  (cost=0.00..37.20 rows=2720 width=2)          |
JIT:                                                                           |
  Functions: 6                                                                 |
  Options: Inlining false, Optimization false, Expressions true, Deforming true|
explain select * from test as t1, test as t2 where t1.a = NOT(t2.b);

QUERY PLAN                                                           |
---------------------------------------------------------------------+
Hash Join  (cost=71.20..41731.20 rows=3699200 width=4)               |
  Hash Cond: (t1.a = (NOT t2.b))                                     |
  ->  Seq Scan on test t1  (cost=0.00..37.20 rows=2720 width=2)      |
  ->  Hash  (cost=37.20..37.20 rows=2720 width=2)                    |
        ->  Seq Scan on test t2  (cost=0.00..37.20 rows=2720 width=2)|

@ranma42
Copy link
Contributor Author

ranma42 commented Jul 12, 2024

SqlServer 2022:

create table test (a bit, b bit);

SET SHOWPLAN_ALL ON;
StmtText                                                                                                           |StmtId|NodeId|Parent|PhysicalOp  |LogicalOp |Argument                                                                            |DefinedValues     |EstimateRows|EstimateIO|EstimateCPU|AvgRowSize|TotalSubtreeCost|OutputList                            |Warnings|Type    |Parallel|EstimateExecutions|
-------------------------------------------------------------------------------------------------------------------+------+------+------+------------+----------+------------------------------------------------------------------------------------+------------------+------------+----------+-----------+----------+----------------+--------------------------------------+--------+--------+--------+------------------+
select * from test as t1, test as t2 where t1.a != t2.b;                                                           |     1|     1|     0|            |          |1                                                                                   |                  |         1.0|          |           |          |      0.00657068|                                      |        |SELECT  |       0|                  |
  |--Nested Loops(Inner Join, WHERE:([master].[dbo].[test].[a] as [t1].[a]<>[master].[dbo].[test].[b] as [t2].[b]))|     1|     2|     1|Nested Loops|Inner Join|WHERE:([master].[dbo].[test].[a] as [t1].[a]<>[master].[dbo].[test].[b] as [t2].[b])|                  |         1.0|       0.0| 0.00000418|         9|      0.00657068|[t1].[a], [t1].[b], [t2].[a], [t2].[b]|        |PLAN_ROW|       0|               1.0|
       |--Table Scan(OBJECT:([master].[dbo].[test] AS [t2]))                                                       |     1|     3|     2|Table Scan  |Table Scan|OBJECT:([master].[dbo].[test] AS [t2])                                              |[t2].[a], [t2].[b]|         1.0|  0.003125|  0.0001581|         9|       0.0032831|[t2].[a], [t2].[b]                    |        |PLAN_ROW|       0|               1.0|
       |--Table Scan(OBJECT:([master].[dbo].[test] AS [t1]))                                                       |     1|     4|     2|Table Scan  |Table Scan|OBJECT:([master].[dbo].[test] AS [t1])                                              |[t1].[a], [t1].[b]|         1.0|  0.003125|  0.0001581|         9|       0.0032831|[t1].[a], [t1].[b]                    |        |PLAN_ROW|       0|               1.0|
StmtText                                                                                                              |StmtId|NodeId|Parent|PhysicalOp    |LogicalOp     |Argument                                                                                 |DefinedValues                                    |EstimateRows|EstimateIO|EstimateCPU|AvgRowSize|TotalSubtreeCost|OutputList                            |Warnings|Type    |Parallel|EstimateExecutions|
----------------------------------------------------------------------------------------------------------------------+------+------+------+--------------+--------------+-----------------------------------------------------------------------------------------+-------------------------------------------------+------------+----------+-----------+----------+----------------+--------------------------------------+--------+--------+--------+------------------+
select * from test as t1, test as t2 where t1.a = ~(t2.b)                                                             |     1|     1|     0|              |              |1                                                                                        |                                                 |         1.0|          |           |          |     0.024334392|                                      |        |SELECT  |       0|                  |
  |--Hash Match(Inner Join, HASH:([Expr1004])=([t1].[a]), RESIDUAL:([master].[dbo].[test].[a] as [t1].[a]=[Expr1004]))|     1|     2|     1|Hash Match    |Inner Join    |HASH:([Expr1004])=([t1].[a]), RESIDUAL:([master].[dbo].[test].[a] as [t1].[a]=[Expr1004])|                                                 |         1.0|       0.0|0.017765092|         9|     0.024334392|[t1].[a], [t1].[b], [t2].[a], [t2].[b]|        |PLAN_ROW|       0|               1.0|
       |--Compute Scalar(DEFINE:([Expr1004]=~[master].[dbo].[test].[b] as [t2].[b]))                                  |     1|     3|     2|Compute Scalar|Compute Scalar|DEFINE:([Expr1004]=~[master].[dbo].[test].[b] as [t2].[b])                               |[Expr1004]=~[master].[dbo].[test].[b] as [t2].[b]|         1.0|       0.0|  0.0000001|         9|       0.0032832|[t2].[a], [t2].[b], [Expr1004]        |        |PLAN_ROW|       0|               1.0|
       |    |--Table Scan(OBJECT:([master].[dbo].[test] AS [t2]))                                                     |     1|     4|     3|Table Scan    |Table Scan    |OBJECT:([master].[dbo].[test] AS [t2])                                                   |[t2].[a], [t2].[b]                               |         1.0|  0.003125|  0.0001581|         9|       0.0032831|[t2].[a], [t2].[b]                    |        |PLAN_ROW|       0|               1.0|
       |--Table Scan(OBJECT:([master].[dbo].[test] AS [t1]))                                                          |     1|     5|     2|Table Scan    |Table Scan    |OBJECT:([master].[dbo].[test] AS [t1])                                                   |[t1].[a], [t1].[b]                               |         1.0|  0.003125|  0.0001581|         9|       0.0032831|[t1].[a], [t1].[b]                    |        |PLAN_ROW|       0|               1.0|

@ranma42
Copy link
Contributor Author

ranma42 commented Jul 12, 2024

An example session showing the performance difference on Sqlite:

SQLite version 3.46.0 2024-05-23 13:25:27
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .timer on
sqlite> create table test (a, b);
Run Time: real 0.001 user 0.000172 sys 0.000000
sqlite> insert into test (a,b) select value & 1, NOT(value & 2) from generate_series(1, 10000);
Run Time: real 0.001 user 0.000942 sys 0.000000
sqlite> select COUNT(*) from test as t1, test as t2 where t1.a != t2.b;
50000000
Run Time: real 3.297 user 3.366575 sys 0.000000
sqlite> select COUNT(*) from test as t1, test as t2 where t1.a = NOT(t2.b);
50000000
Run Time: real 1.150 user 1.174196 sys 0.000000
sqlite> insert into test (a,b) select value & 1, NOT(value & 2) from generate_series(1, 10000);
Run Time: real 0.001 user 0.000879 sys 0.000000
sqlite> select COUNT(*) from test as t1, test as t2 where t1.a != t2.b;
200000000
Run Time: real 13.137 user 13.424913 sys 0.000000
sqlite> select COUNT(*) from test as t1, test as t2 where t1.a = NOT(t2.b);
200000000
Run Time: real 5.320 user 4.663320 sys 0.000000
sqlite> delete from test;
Run Time: real 0.000 user 0.000198 sys 0.000000
sqlite> insert into test (a,b) select
    CASE random() % 20 WHEN 0 THEN 0 WHEN 1 THEN 1 END,
    CASE random() % 20 WHEN 0 THEN 0 WHEN 1 THEN 1 END
from generate_series(1, 20000);
Run Time: real 0.004 user 0.003828 sys 0.000000
sqlite> select COUNT(*) from test as t1, test as t2 where t1.a != t2.b;
968190
Run Time: real 12.874 user 12.323147 sys 0.000000
sqlite> select COUNT(*) from test as t1, test as t2 where t1.a = NOT(t2.b);
968190
Run Time: real 0.030 user 0.031159 sys 0.000000

TL;DR: about 3 times faster for a 50/50 distribution of true/false; much faster for 5/5/90 false/true/null

@ranma42
Copy link
Contributor Author

ranma42 commented Jul 12, 2024

A similar experiment on SqlServer 2022:

set statistics time ON

create table test (a BIT, b BIT);

insert into test (a,b) select value & 1, CAST(value & 2 AS BIT) from generate_series(1, 10000000);

select COUNT_BIG (*) from test as t1, test as t2 where t1.a != t2.b;
--SQL Server parse and compile time: 
--   CPU time = 0 ms, elapsed time = 0 ms.
--
-- SQL Server Execution Times:
--   CPU time = 4027 ms,  elapsed time = 2281 ms.

select COUNT_BIG (*) from test as t1, test as t2 where t1.a = ~(t2.b);
--SQL Server parse and compile time: 
--   CPU time = 0 ms, elapsed time = 0 ms.
--
-- SQL Server Execution Times:
--   CPU time = 3578 ms,  elapsed time = 198 ms.

delete from test;
insert into test (a,b) select
    CASE value % 20 WHEN 0 THEN 0 WHEN 1 THEN 1 END,
    CASE (value / 20) % 20 WHEN 0 THEN 0 WHEN 1 THEN 1 END
from generate_series(1, 10000000);

select COUNT_BIG (*) from test as t1, test as t2 where t1.a != t2.b;
--SQL Server parse and compile time: 
--   CPU time = 0 ms, elapsed time = 0 ms.
--
-- SQL Server Execution Times:
--   CPU time = 5176 ms,  elapsed time = 3350 ms.

select COUNT_BIG (*) from test as t1, test as t2 where t1.a = ~(t2.b);
--SQL Server parse and compile time: 
--   CPU time = 0 ms, elapsed time = 0 ms.
--
-- SQL Server Execution Times:
--   CPU time = 3682 ms,  elapsed time = 199 ms.

TL;DR: SqlServer saves about 15% of the CPU time when filtering with = and it apparently can run it in parallel (10x faster wall-time).

@ErikEJ
Copy link
Contributor

ErikEJ commented Jul 12, 2024

I see the same perf pattern on my informal test based on the script above!

@ranma42
Copy link
Contributor Author

ranma42 commented Jul 12, 2024

Ah, I mentioned it in the comment #34166 (comment) but maybe it should also be repeated here: these benchmarks are completely synthetic and do not represent a real-world workload.
They are meant to demonstrate that a = NOT(b) is usually optimized better than <>.
One might expect these optimizations to be automatically handled by strictly-typed DB engines, but apparently that is not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants