Make ScalarFn array validity lazy when the function defines a validity expression#8336
Conversation
Merging this PR will degrade performance by 17.42%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | baseline_eq[4, 65536] |
186.4 µs | 238.8 µs | -21.93% |
| ❌ | Simulation | baseline_lt[16, 65536] |
219 µs | 276.4 µs | -20.78% |
| ❌ | Simulation | baseline_lt[4, 65536] |
202.2 µs | 254 µs | -20.39% |
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[128] |
215.3 ns | 244.4 ns | -11.93% |
| ❌ | Simulation | baseline_eq[16, 65536] |
231.3 µs | 261.2 µs | -11.45% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/cool-bardeen-l8jlsy-4-lazy-scalarfn-validity (f9b0a16) with claude/cool-bardeen-l8jlsy-3-definitely-all-invalid (a13f0ab)
|
Investigated the CodSpeed
The same flavor of noise shows in the rest of the stack: the I can't acknowledge regressions on the CodSpeed dashboard from this session — that needs someone with dashboard access. https://claude.ai/code/session_01VPQ7dfZtijfrsjAipwXvEj Generated by Claude Code |
| /// Transforms the expression into one representing the validity of this expression. | ||
| pub fn validity(&self, expr: &Expression) -> VortexResult<Expression> { | ||
| Ok(self.0.validity(expr)?.unwrap_or_else(|| { | ||
| Ok(self.validity_opt(expr)?.unwrap_or_else(|| { |
There was a problem hiding this comment.
do you want to remove the TODO?
d72800b to
f4e8d62
Compare
5d2c247 to
e515a28
Compare
9673272 to
e9341da
Compare
e515a28 to
e507aa3
Compare
If two Mask::AllTrue or AllFalse are passed we don't need a bit buffer to check equality Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
e507aa3 to
2d2498f
Compare
e9341da to
587baf9
Compare
2d2498f to
a13f0ab
Compare
587baf9 to
6fe8231
Compare
## Summary Adds docs for `StatsRewriteRule` and its functions. Can be considered a follow-up for #8345, but can be individually merged. Signed-off-by: Adam Gutglick <adam@spiraldb.com>
## Summary **PR 2 of a 4-PR stack** (stacked on #8333) preparing `Validity` for lazy validity arrays. `mask_eq` previously returned `false` for any mixed-variant pairing without executing — e.g. a `Validity::Array` that resolves to all-true compared against `Validity::AllValid`. With lazy validity arrays, unresolved `Array` variants frequently hold constant masks, making this silently wrong rather than merely conservative. --------- Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
## Summary Adds import/export from `vortex-json` to Arrow's JSON canonical extension type. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>
For Array-vs-constant pairings, the min/max statistics of the validity array decide all-valid/all-invalid exactly, so consult them first and only fall back to executing the validity array into a Mask when statistics are unavailable. Constant variants with opposite masks now short-circuit on length instead of building two masks. https://claude.ai/code/session_01VPQ7dfZtijfrsjAipwXvEj Signed-off-by: Claude <noreply@anthropic.com>
https://claude.ai/code/session_01VPQ7dfZtijfrsjAipwXvEj Signed-off-by: Claude <noreply@anthropic.com>
…aths Replace scattered matches!(.., Validity::AllInvalid) checks with a named helper, symmetric with definitely_no_nulls(). The name makes the conservative semantics explicit: a Validity::Array may still resolve to all-invalid once executed, so a false result means "unknown without compute", not "definitely has valid values". Call sites that assert an exact variant in tests keep the raw matches!. https://claude.ai/code/session_01VPQ7dfZtijfrsjAipwXvEj Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
The non-test code now uses definitely_all_invalid() and no longer references the Validity type directly; the test module has its own import. https://claude.ai/code/session_01VPQ7dfZtijfrsjAipwXvEj Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…y expression Previously ValidityVTable<ScalarFn> always eagerly executed the validity expression via the legacy session. Now, when the scalar function provides a validity expression over its inputs, the expression is converted into a lazy ScalarFn array DAG instead: Literal nodes become ConstantArrays, ArrayExpr leaves unwrap to the child arrays they hold, and interior nodes become lazy ScalarFn arrays. Constant results are folded back into AllValid/AllInvalid via child_to_validity. Functions that do not define a validity expression (e.g. Kleene logic and/or, where validity depends on the computed values) keep the eager path. The erased fallback for these is is_not_null over the expression itself, so a lazy representation would be self-referential: resolving the validity of the inner node spawns another is_not_null DAG, which recurses without ever shrinking (this manifested as a stack overflow in element-wise execution paths). ScalarFnRef::validity_opt is added to expose whether a function defines its own validity expression. https://claude.ai/code/session_01VPQ7dfZtijfrsjAipwXvEj Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
6fe8231 to
f9b0a16
Compare
Summary
PR 4 of a 4-PR stack (stacked on #8335) — the payoff: lazy validity for
ScalarFnarrays.Previously
ValidityVTable<ScalarFn>always eagerly executed the validity expression via the legacy session. Now, when the scalar function provides a validity expression over its inputs, the expression is converted into a lazy ScalarFn array DAG instead:Literalnodes →ConstantArrayArrayExprleaves → unwrap to the child array they holdScalarFnarrays viaArray::<ScalarFn>::try_newAllValid/AllInvalidviachild_to_validity