GaijinEntertainment · borisbat · May 16, 2026 · May 16, 2026
diff --git a/benchmarks/sql/LINQ.md b/benchmarks/sql/LINQ.md
@@ -0,0 +1,119 @@
+# linq_fold — splice-mode macro rewrite
+
+Project notes and progress for the `daslib/linq_fold` macro family, modeled after the dispatch-macro architecture in `modules/dasSQLITE/daslib/sqlite_linq`. Other `.md` files on this subject (operator-specific design notes, post-mortems, etc.) get filed in this directory alongside this one.
+
+## What this is
+
+The current `_fold(...)` macro in linq_boost wraps LINQ pipelines into intermediate `array<T>` per stage and pattern-matches a small set of common shapes (`where+count`, `where+select`, `select+where`, `order+distinct`, `where`, `select`) for ad-hoc fusion. Everything else falls through to a default emitter that builds nested `var pass_N <- pass_(N-1) |> next(...)` — one fresh array per stage, every predicate called via lambda dispatch.
+
+The goal is a planner-driven dispatch macro that emits one fused for-loop with predicates inlined (splice mode), materializing only when an operator genuinely needs random access. Three output modes:
+
+1. **Splice** (default): producer body inlined into consumer's loop. Zero allocation, zero per-element dispatch.
+2. **Array intermediate**: when a downstream op needs random access / multi-pass / length (`sort`, `reverse`, `distinct`, `groupby`). Once we go array, we stay array (iterating an array is faster than iterating an iterator).
+3. **Helper-call fallback**: when splice can't apply at all (escape into `let`, opaque source). Emits calls to named helper functions in `linq_fold`.
+
+Lambda inlining is best-effort: literal `@(x) => expr` at the call site → splice the body; otherwise → call.
+
+See `~/.claude/plans/keen-hopping-balloon.md` for the long-form plan.
+
+## Phase status
+
+| Phase | Description | Status |
+|---|---|---|
+| 0 | Rename `_fold` → `_old_fold` in linq_boost; extract `_fold` and `_old_fold` into new `daslib/linq_fold.das` module; `linq_boost` `require linq_fold public` for re-export | ✅ done |
+| 1 | Benchmark suite: 24 files under `benchmarks/sql/`, each 4-way (m1 `_sql` / m3 plain linq / m3f_old `_old_fold` / m3f `_fold`) at 100K rows; baseline numbers captured | ✅ done |
+| 2 | Splice planner + initial operators (`count`, `sum`, `to_array`, `where` with literal-lambda inlining); pattern tests for "spliced" vs "fell back" | ⏳ next |
+| 3+ | Per-operator splice PRs: `select`, terminal aggregates with early-exit (`first`, `any`, `all`, `min`, `max`, `average`), `take`/`skip`/chained `where`, then buffer-required ops (`distinct`, `sort`, `groupby`, `zip`, `join`) | ⏳ |
+| 4 | Final coverage pass + docs; full 4-way comparison table refresh; parity-test sweep | ⏳ |
+
+## Baselines (100K rows, INTERP mode)
+
+Captured 2026-05-16 on commit `e691abe1b` + foundation PR. ns/op is **per element** (chunk_size = n = 100K), so 30 ns/op means ~3ms for the full operation. Smaller is better. m3f and m3f_old are intentionally identical in this PR — they diverge once Phase 2 lands.
+
+Notation: `—` means the variant is not applicable for this benchmark (operator has no clean form in that mode).
+
+| Benchmark | Shape | m1 (sql) | m3 (linq) | m3f_old | m3f |
+|---|---|---:|---:|---:|---:|
+| count_aggregate | `where → count` | 29 | 29 | 5 | 5 |
+| sum_aggregate | `select → sum` | 29 | 30 | 8 | 8 |
+| sum_where | `where → select → sum` | 33 | 43 | 12 | 12 |
+| min_aggregate | `select → min` | 30 | 38 | 25 | 25 |
+| max_aggregate | `select → max` | 31 | 36 | 23 | 23 |
+| average_aggregate | `select → average` | 30 | 34 | 20 | 20 |
+| first_match | `where → first` | 0\* | 28 | 15 | 15 |
+| any_match | `where → first_opt`/`any` | 0\* | 0\* | 0\* | 0\* |
+| all_match | `count(where ¬p)==0` / `all` | 27 | 20 | 24 | 25 |
+| to_array_filter | `where → select → to_array` | 70 | 43 | 11 | 11 |
+| take_count | `take → to_array` | 3 | 0\* | 0\* | 0\* |
+| skip_take | `skip → take → to_array` | 0\* | 16 | 23 | 23 |
+| distinct_count | `select → distinct → to_array` | 41 | 43 | 33 | 33 |
+| sort_first | `order_by → first` | 37 | 2170 | 2206 | 2238 |
+| sort_take | `order_by → take` | 38 | 2188 | 2247 | 2269 |
+| groupby_count | `group_by → select(_, length)` | 140 | 70 | 76 | 76 |
+| groupby_sum | `group_by → select(_, sum)` | 172 | 101 | 107 | 107 |
+| chained_where | `where → where → count` | 36 | 45 | 17 | 17 |
+| zip_dot_product | `zip → select → sum` | — | 53 | 37 | 37 |
+| join_count | `join → count` | —\*\* | 116 | 121 | 122 |
+| count_aggregate (existing) | `where → count` | 29 | 29 | 5 | 5 |
+| select_where (existing) | `where → to_array` | 7 | 50 | 12 | 12 |
+| select_where_order_take (existing) | `where → order_by → take` | 36 | 1024 | 1007 | 1014 |
+| indexed_lookup (existing) | `where id==k → count` | 1460\*\*\* | 2003299 | 336129 | 328207 |
+
+\* Sub-nanosecond per element — early-exit operation hits answer in O(1) regardless of N; per-element timing collapses to 0/near-0 noise.
+
+\*\* `_sql` join requires the inner `select_from` to bind the db handle inside the `_sql` analyzer; that wiring isn't exposed for direct authoring without more sqlite_linq plumbing. Foundation PR omits m1 here; revisit if/when needed.
+
+\*\*\* PK lookup vs linear scan — inverse-asymmetry benchmark. SQLite's b-tree index wins by ~5 orders of magnitude over the array linear scan. The interesting comparison is m3 vs m3f_old: the existing fold's `where+count` fusion already gives ~6× speedup.
+
+### Reading the table
+
+- **m1 vs m3** shows the SQLite-vs-in-memory-LINQ cost gap. SQL wins on `indexed_lookup` (b-tree) and on sorted-take patterns (engine partial-sort + LIMIT). Arrays win on raw aggregates where the SQL overhead exceeds the in-memory work.
+- **m3 vs m3f_old** shows what the *current* `_fold` macro already achieves. Big wins on the patterns it explicitly recognizes (`where+count` 6×, `where+select+to_array` ~4×, `chained_where+count` 2.6×). Negligible difference where it falls through to the default emitter.
+- **m3f vs m3f_old** is the target of Phase 2+. Currently identical by construction. Each PR in the splice series adds a splice path for one operator family and updates this table with the new ratio.
+
+## Operator-coverage checklist (parity tests)
+
+The 24 benchmarks above cover the most common shapes. The end-game target is one benchmark per `_fold`-applicable scenario in the broader `tests/linq/` operator suite. Tracking the long-tail coverage below; PRs that add splice support for new operators should add a benchmark here if not already present.
+
+| Source test file | Operator group | Covered by benchmark | Status |
+|---|---|---|---|
+| `test_linq.das` | comprehension basics | count_aggregate, sum_aggregate | ✅ |
+| `test_linq_aggregation.das` | count/sum/min/max/avg/aggregate | count/sum/min/max/average_aggregate, sum_where | ✅ core; `aggregate(seed, fn)` ⏳ |
+| `test_linq_querying.das` | any/all/contains | any_match, all_match | ✅ core; `contains` ⏳ |
+| `test_linq_transform.das` | select/select_many/zip | to_array_filter, zip_dot_product | ✅ select/zip; `select_many` ⏳ |
+| `test_linq_sorting.das` | order/order_by/reverse | sort_first, sort_take, select_where_order_take | ✅ ascending; `order_descending` + `reverse` ⏳ |
+| `test_linq_group_by.das` | group_by/group_by_lazy/having | groupby_count, groupby_sum | ✅ basic; `having_` ⏳ |
+| `test_linq_join.das` | join/left_join/right_join/full_outer/cross | join_count | ✅ inner; outer joins + cross ⏳ |
+| `test_linq_partition.das` | take/skip/take_while/skip_while/chunk | take_count, skip_take | ✅ take/skip; `_while` + `chunk` ⏳ |
+| `test_linq_set.das` | distinct/union/except/intersect/unique | distinct_count | ✅ distinct; set ops ⏳ |
+| `test_linq_element.das` | first/last/single/element_at + _or_default | first_match | ✅ first; last/single/element_at ⏳ |
+| `test_linq_concat.das` | concat/prepend/append | — | ⏳ |
+| `test_linq_generation.das` | range/repeat/etc. | — | ⏳ |
+| `test_linq_bugs.das` | regression cases | — | ⏳ as bugs surface |
+
+## How to run
+
+```sh
+# All benchmarks at 100K (the suite default)
+./bin/daslang dastest/dastest.das -- --bench --test benchmarks/sql --test-names none
+
+# One benchmark
+./bin/daslang dastest/dastest.das -- --bench --test benchmarks/sql/count_aggregate.das --test-names none
+
+# Filter by [benchmark] function-name prefix
+./bin/daslang dastest/dastest.das -- --bench --bench-names sum_ --test benchmarks/sql --test-names none
+```
+
+dastest reports `ns/op` in INTERP mode by default. To bump dataset size as the splice work progresses (the interpreter is the bottleneck; 100K is a fast-iteration target, larger N gives more accurate asymptotics), change the `n` argument in each file's `[benchmark]` entries. AOT/JIT modes will be added when the splice path lands and benefits from compilation.
+
+## Design decisions
+
+**`_old_fold` lives alongside `_fold` in `linq_fold`, not in `linq_boost`.** Both macros share the entire dispatch infrastructure (`linqCalls`, `g_foldSeq`, `fold_*`, `flatten_linq`, `fold_linq_default`). Keeping them in one module avoids duplication; the only difference today is the macro-name string passed into `fold_linq_default`'s recursive sub-fold call.
+
+**Recursive macro-name is parameterized.** `fold_linq_default(expr, recursiveMacroName)` — `_fold` passes `"_fold"`, `_old_fold` passes `"_old_fold"`. This keeps the frozen baseline truly frozen once `_fold` diverges in Phase 2+: when `_fold` starts emitting splice loops, `_old_fold` keeps emitting the historical comprehension/invoke shape because its recursive sub-folds still target `_old_fold`.
+
+**100K rows.** daslang is interpreter-first; 100K gives sub-second-per-variant benchmark turnaround and clearly shows the asymmetries we care about. Bump later if AOT/JIT numbers warrant.
+
+**`PERF009` suppression in `fold_linq_default`.** The macro's `var pass_N = call` + later `return <- pass_N` pattern triggers PERF009 on single-pass chains (e.g. `take_count`). Rewriting to direct `return <- call` would change `_old_fold`'s baseline; we suppress inline at the qmacro_expr emission site and document why.
+
+**Benchmark variants where SQL has no clean form.** `zip` (not a relational op), `_all(pred)` (no direct `_all` chain terminal in sqlite_linq), `join` with inner-select-from (wiring not exposed), `distinct |> count` (no `COUNT(DISTINCT col)` yet), `take/skip` before aggregate (LIMIT/OFFSET semantics conflict with aggregate-collapse). We either reformulate to a SQL-friendly shape (`count(where ¬p)` for all_match), omit the m1 column (zip, join), or terminate the chain in `to_array` instead of an aggregate (take/skip/distinct).
diff --git a/benchmarks/sql/_common.das b/benchmarks/sql/_common.das
@@ -8,28 +8,70 @@ require sqlite/sqlite_linq public
 require dastest/testing_boost public
 require daslib/fio public
 
+let public BRAND_COUNT = 5
+let public DEALER_COUNT = 100
+
 [sql_table(name = "Cars")]
 struct Car {
     @sql_primary_key id : int
     name : string
     price : int
+    brand : int
+    year : int
+    dealer_id : int
+}
+
+[sql_table(name = "Dealers")]
+struct Dealer {
+    @sql_primary_key id : int
+    name : string
 }
 
 def public fixture_db(db : SqlRunner; n : int) {
     db |> create_table(type<Car>)
-    var rows : array<Car>
-    rows |> resize(n)
+    db |> create_table(type<Dealer>)
+    var cars : array<Car>
+    cars |> resize(n)
     for (i in range(n)) {
-        rows[i] = Car(id = i + 1, name = "Car{i}", price = (i * 37) % 1000)
+        cars[i] = Car(
+            id = i + 1,
+            name = "Car{i}",
+            price = (i * 37) % 1000,
+            brand = i % BRAND_COUNT,
+            year = 2010 + (i * 7) % 16,
+            dealer_id = (i % DEALER_COUNT) + 1
+        )
     }
-    db |> insert(rows)
+    db |> insert(cars)
+    var dealers : array<Dealer>
+    dealers |> resize(DEALER_COUNT)
+    for (i in range(DEALER_COUNT)) {
+        dealers[i] = Dealer(id = i + 1, name = "Dealer{i}")
+    }
+    db |> insert(dealers)
 }
 
 def public fixture_array(n : int) : array<Car> {
     var arr : array<Car>
     arr |> resize(n)
     for (i in range(n)) {
-        arr[i] = Car(id = i + 1, name = "Car{i}", price = (i * 37) % 1000)
+        arr[i] = Car(
+            id = i + 1,
+            name = "Car{i}",
+            price = (i * 37) % 1000,
+            brand = i % BRAND_COUNT,
+            year = 2010 + (i * 7) % 16,
+            dealer_id = (i % DEALER_COUNT) + 1
+        )
+    }
+    return <- arr
+}
+
+def public fixture_dealers_array() : array<Dealer> {
+    var arr : array<Dealer>
+    arr |> resize(DEALER_COUNT)
+    for (i in range(DEALER_COUNT)) {
+        arr[i] = Dealer(id = i + 1, name = "Dealer{i}")
     }
     return <- arr
 }
diff --git a/benchmarks/sql/all_match.das b/benchmarks/sql/all_match.das
@@ -0,0 +1,72 @@
+options gen2
+options persistent_heap
+
+require _common public
+
+// Predicate that is ALWAYS true (price < 9999), so the full scan runs.
+// SQL equivalent of `all(pred)` is `count(where NOT pred) == 0` — no direct
+// `_all` terminal in sqlite_linq, so we phrase it as a counter-example count.
+// Array variants use `_all(pred)` which short-circuits on first failing element;
+// always-true predicate forces full-array traversal (fair upper-bound timing).
+
+def run_m1(b : B?; n : int) {
+    with_sqlite(":memory:") $(db) {
+        fixture_db(db, n)
+        b |> run("m1_sql/{n}", n) {
+            let bad = _sql(db |> select_from(type<Car>) |> _where(_.price >= 9999) |> count())
+            if (bad != 0) {
+                b->failNow()
+            }
+        }
+    }
+}
+
+def run_m3(b : B?; n : int) {
+    let arr <- fixture_array(n)
+    b |> run("m3_array/{n}", n) {
+        let yes = arr |> _all(_.price < 9999)
+        if (!yes) {
+            b->failNow()
+        }
+    }
+}
+
+def run_m3f_old(b : B?; n : int) {
+    let arr <- fixture_array(n)
+    b |> run("m3f_old_array_fold/{n}", n) {
+        let yes = _old_fold(each(arr)._all(_.price < 9999))
+        if (!yes) {
+            b->failNow()
+        }
+    }
+}
+
+def run_m3f(b : B?; n : int) {
+    let arr <- fixture_array(n)
+    b |> run("m3f_array_fold/{n}", n) {
+        let yes = _fold(each(arr)._all(_.price < 9999))
+        if (!yes) {
+            b->failNow()
+        }
+    }
+}
+
+[benchmark]
+def all_match_m1(b : B?) {
+    run_m1(b, 100000)
+}
+
+[benchmark]
+def all_match_m3(b : B?) {
+    run_m3(b, 100000)
+}
+
+[benchmark]
+def all_match_m3f_old(b : B?) {
+    run_m3f_old(b, 100000)
+}
+
+[benchmark]
+def all_match_m3f(b : B?) {
+    run_m3f(b, 100000)
+}
diff --git a/benchmarks/sql/any_match.das b/benchmarks/sql/any_match.das
@@ -0,0 +1,72 @@
+options gen2
+options persistent_heap
+
+require _common public
+
+let THRESHOLD = 500
+
+// SQL: SELECT ... LIMIT 1 — engine stops at first hit. Use _first_opt then is_some
+// since sqlite_linq doesn't expose `_any()` as a chain terminal.
+// Array variants use `_any(pred)` which short-circuits at the first match.
+
+def run_m1(b : B?; n : int) {
+    with_sqlite(":memory:") $(db) {
+        fixture_db(db, n)
+        b |> run("m1_sql/{n}", n) {
+            let opt = _sql(db |> select_from(type<Car>) |> _where(_.price > THRESHOLD) |> _first_opt())
+            if (!is_some(opt)) {
+                b->failNow()
+            }
+        }
+    }
+}
+
+def run_m3(b : B?; n : int) {
+    let arr <- fixture_array(n)
+    b |> run("m3_array/{n}", n) {
+        let yes = arr |> _any(_.price > THRESHOLD)
+        if (!yes) {
+            b->failNow()
+        }
+    }
+}
+
+def run_m3f_old(b : B?; n : int) {
+    let arr <- fixture_array(n)
+    b |> run("m3f_old_array_fold/{n}", n) {
+        let yes = _old_fold(each(arr)._any(_.price > THRESHOLD))
+        if (!yes) {
+            b->failNow()
+        }
+    }
+}
+
+def run_m3f(b : B?; n : int) {
+    let arr <- fixture_array(n)
+    b |> run("m3f_array_fold/{n}", n) {
+        let yes = _fold(each(arr)._any(_.price > THRESHOLD))
+        if (!yes) {
+            b->failNow()
+        }
+    }
+}
+
+[benchmark]
+def any_match_m1(b : B?) {
+    run_m1(b, 100000)
+}
+
+[benchmark]
+def any_match_m3(b : B?) {
+    run_m3(b, 100000)
+}
+
+[benchmark]
+def any_match_m3f_old(b : B?) {
+    run_m3f_old(b, 100000)
+}
+
+[benchmark]
+def any_match_m3f(b : B?) {
+    run_m3f(b, 100000)
+}