Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions benchmarks/sql/LINQ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# linq_fold — splice-mode macro rewrite

Project notes and progress for the `daslib/linq_fold` macro family, modeled after the dispatch-macro architecture in `modules/dasSQLITE/daslib/sqlite_linq`. Other `.md` files on this subject (operator-specific design notes, post-mortems, etc.) get filed in this directory alongside this one.

## What this is

The current `_fold(...)` macro in linq_boost wraps LINQ pipelines into intermediate `array<T>` per stage and pattern-matches a small set of common shapes (`where+count`, `where+select`, `select+where`, `order+distinct`, `where`, `select`) for ad-hoc fusion. Everything else falls through to a default emitter that builds nested `var pass_N <- pass_(N-1) |> next(...)` — one fresh array per stage, every predicate called via lambda dispatch.

The goal is a planner-driven dispatch macro that emits one fused for-loop with predicates inlined (splice mode), materializing only when an operator genuinely needs random access. Three output modes:

1. **Splice** (default): producer body inlined into consumer's loop. Zero allocation, zero per-element dispatch.
2. **Array intermediate**: when a downstream op needs random access / multi-pass / length (`sort`, `reverse`, `distinct`, `groupby`). Once we go array, we stay array (iterating an array is faster than iterating an iterator).
3. **Helper-call fallback**: when splice can't apply at all (escape into `let`, opaque source). Emits calls to named helper functions in `linq_fold`.

Lambda inlining is best-effort: literal `@(x) => expr` at the call site → splice the body; otherwise → call.

See `~/.claude/plans/keen-hopping-balloon.md` for the long-form plan.

## Phase status

| Phase | Description | Status |
|---|---|---|
| 0 | Rename `_fold` → `_old_fold` in linq_boost; extract `_fold` and `_old_fold` into new `daslib/linq_fold.das` module; `linq_boost` `require linq_fold public` for re-export | ✅ done |
| 1 | Benchmark suite: 24 files under `benchmarks/sql/`, each 4-way (m1 `_sql` / m3 plain linq / m3f_old `_old_fold` / m3f `_fold`) at 100K rows; baseline numbers captured | ✅ done |
| 2 | Splice planner + initial operators (`count`, `sum`, `to_array`, `where` with literal-lambda inlining); pattern tests for "spliced" vs "fell back" | ⏳ next |
| 3+ | Per-operator splice PRs: `select`, terminal aggregates with early-exit (`first`, `any`, `all`, `min`, `max`, `average`), `take`/`skip`/chained `where`, then buffer-required ops (`distinct`, `sort`, `groupby`, `zip`, `join`) | ⏳ |
| 4 | Final coverage pass + docs; full 4-way comparison table refresh; parity-test sweep | ⏳ |

## Baselines (100K rows, INTERP mode)

Captured 2026-05-16 on commit `e691abe1b` + foundation PR. ns/op is **per element** (chunk_size = n = 100K), so 30 ns/op means ~3ms for the full operation. Smaller is better. m3f and m3f_old are intentionally identical in this PR — they diverge once Phase 2 lands.

Notation: `—` means the variant is not applicable for this benchmark (operator has no clean form in that mode).

| Benchmark | Shape | m1 (sql) | m3 (linq) | m3f_old | m3f |
|---|---|---:|---:|---:|---:|
| count_aggregate | `where → count` | 29 | 29 | 5 | 5 |
| sum_aggregate | `select → sum` | 29 | 30 | 8 | 8 |
| sum_where | `where → select → sum` | 33 | 43 | 12 | 12 |
| min_aggregate | `select → min` | 30 | 38 | 25 | 25 |
| max_aggregate | `select → max` | 31 | 36 | 23 | 23 |
| average_aggregate | `select → average` | 30 | 34 | 20 | 20 |
| first_match | `where → first` | 0\* | 28 | 15 | 15 |
| any_match | `where → first_opt`/`any` | 0\* | 0\* | 0\* | 0\* |
| all_match | `count(where ¬p)==0` / `all` | 27 | 20 | 24 | 25 |
| to_array_filter | `where → select → to_array` | 70 | 43 | 11 | 11 |
| take_count | `take → to_array` | 3 | 0\* | 0\* | 0\* |
| skip_take | `skip → take → to_array` | 0\* | 16 | 23 | 23 |
| distinct_count | `select → distinct → to_array` | 41 | 43 | 33 | 33 |
| sort_first | `order_by → first` | 37 | 2170 | 2206 | 2238 |
| sort_take | `order_by → take` | 38 | 2188 | 2247 | 2269 |
| groupby_count | `group_by → select(_, length)` | 140 | 70 | 76 | 76 |
| groupby_sum | `group_by → select(_, sum)` | 172 | 101 | 107 | 107 |
| chained_where | `where → where → count` | 36 | 45 | 17 | 17 |
| zip_dot_product | `zip → select → sum` | — | 53 | 37 | 37 |
| join_count | `join → count` | —\*\* | 116 | 121 | 122 |
| count_aggregate (existing) | `where → count` | 29 | 29 | 5 | 5 |
| select_where (existing) | `where → to_array` | 7 | 50 | 12 | 12 |
| select_where_order_take (existing) | `where → order_by → take` | 36 | 1024 | 1007 | 1014 |
| indexed_lookup (existing) | `where id==k → count` | 1460\*\*\* | 2003299 | 336129 | 328207 |

\* Sub-nanosecond per element — early-exit operation hits answer in O(1) regardless of N; per-element timing collapses to 0/near-0 noise.

\*\* `_sql` join requires the inner `select_from` to bind the db handle inside the `_sql` analyzer; that wiring isn't exposed for direct authoring without more sqlite_linq plumbing. Foundation PR omits m1 here; revisit if/when needed.

\*\*\* PK lookup vs linear scan — inverse-asymmetry benchmark. SQLite's b-tree index wins by ~5 orders of magnitude over the array linear scan. The interesting comparison is m3 vs m3f_old: the existing fold's `where+count` fusion already gives ~6× speedup.

### Reading the table

- **m1 vs m3** shows the SQLite-vs-in-memory-LINQ cost gap. SQL wins on `indexed_lookup` (b-tree) and on sorted-take patterns (engine partial-sort + LIMIT). Arrays win on raw aggregates where the SQL overhead exceeds the in-memory work.
- **m3 vs m3f_old** shows what the *current* `_fold` macro already achieves. Big wins on the patterns it explicitly recognizes (`where+count` 6×, `where+select+to_array` ~4×, `chained_where+count` 2.6×). Negligible difference where it falls through to the default emitter.
- **m3f vs m3f_old** is the target of Phase 2+. Currently identical by construction. Each PR in the splice series adds a splice path for one operator family and updates this table with the new ratio.

## Operator-coverage checklist (parity tests)

The 24 benchmarks above cover the most common shapes. The end-game target is one benchmark per `_fold`-applicable scenario in the broader `tests/linq/` operator suite. Tracking the long-tail coverage below; PRs that add splice support for new operators should add a benchmark here if not already present.

| Source test file | Operator group | Covered by benchmark | Status |
|---|---|---|---|
| `test_linq.das` | comprehension basics | count_aggregate, sum_aggregate | ✅ |
| `test_linq_aggregation.das` | count/sum/min/max/avg/aggregate | count/sum/min/max/average_aggregate, sum_where | ✅ core; `aggregate(seed, fn)` ⏳ |
| `test_linq_querying.das` | any/all/contains | any_match, all_match | ✅ core; `contains` ⏳ |
| `test_linq_transform.das` | select/select_many/zip | to_array_filter, zip_dot_product | ✅ select/zip; `select_many` ⏳ |
| `test_linq_sorting.das` | order/order_by/reverse | sort_first, sort_take, select_where_order_take | ✅ ascending; `order_descending` + `reverse` ⏳ |
| `test_linq_group_by.das` | group_by/group_by_lazy/having | groupby_count, groupby_sum | ✅ basic; `having_` ⏳ |
| `test_linq_join.das` | join/left_join/right_join/full_outer/cross | join_count | ✅ inner; outer joins + cross ⏳ |
| `test_linq_partition.das` | take/skip/take_while/skip_while/chunk | take_count, skip_take | ✅ take/skip; `_while` + `chunk` ⏳ |
| `test_linq_set.das` | distinct/union/except/intersect/unique | distinct_count | ✅ distinct; set ops ⏳ |
| `test_linq_element.das` | first/last/single/element_at + _or_default | first_match | ✅ first; last/single/element_at ⏳ |
| `test_linq_concat.das` | concat/prepend/append | — | ⏳ |
| `test_linq_generation.das` | range/repeat/etc. | — | ⏳ |
| `test_linq_bugs.das` | regression cases | — | ⏳ as bugs surface |

## How to run

```sh
# All benchmarks at 100K (the suite default)
./bin/daslang dastest/dastest.das -- --bench --test benchmarks/sql --test-names none

# One benchmark
./bin/daslang dastest/dastest.das -- --bench --test benchmarks/sql/count_aggregate.das --test-names none

# Filter by [benchmark] function-name prefix
./bin/daslang dastest/dastest.das -- --bench --bench-names sum_ --test benchmarks/sql --test-names none
```

dastest reports `ns/op` in INTERP mode by default. To bump dataset size as the splice work progresses (the interpreter is the bottleneck; 100K is a fast-iteration target, larger N gives more accurate asymptotics), change the `n` argument in each file's `[benchmark]` entries. AOT/JIT modes will be added when the splice path lands and benefits from compilation.

## Design decisions

**`_old_fold` lives alongside `_fold` in `linq_fold`, not in `linq_boost`.** Both macros share the entire dispatch infrastructure (`linqCalls`, `g_foldSeq`, `fold_*`, `flatten_linq`, `fold_linq_default`). Keeping them in one module avoids duplication; the only difference today is the macro-name string passed into `fold_linq_default`'s recursive sub-fold call.

**Recursive macro-name is parameterized.** `fold_linq_default(expr, recursiveMacroName)` — `_fold` passes `"_fold"`, `_old_fold` passes `"_old_fold"`. This keeps the frozen baseline truly frozen once `_fold` diverges in Phase 2+: when `_fold` starts emitting splice loops, `_old_fold` keeps emitting the historical comprehension/invoke shape because its recursive sub-folds still target `_old_fold`.

**100K rows.** daslang is interpreter-first; 100K gives sub-second-per-variant benchmark turnaround and clearly shows the asymmetries we care about. Bump later if AOT/JIT numbers warrant.

**`PERF009` suppression in `fold_linq_default`.** The macro's `var pass_N = call` + later `return <- pass_N` pattern triggers PERF009 on single-pass chains (e.g. `take_count`). Rewriting to direct `return <- call` would change `_old_fold`'s baseline; we suppress inline at the qmacro_expr emission site and document why.

**Benchmark variants where SQL has no clean form.** `zip` (not a relational op), `_all(pred)` (no direct `_all` chain terminal in sqlite_linq), `join` with inner-select-from (wiring not exposed), `distinct |> count` (no `COUNT(DISTINCT col)` yet), `take/skip` before aggregate (LIMIT/OFFSET semantics conflict with aggregate-collapse). We either reformulate to a SQL-friendly shape (`count(where ¬p)` for all_match), omit the m1 column (zip, join), or terminate the chain in `to_array` instead of an aggregate (take/skip/distinct).
52 changes: 47 additions & 5 deletions benchmarks/sql/_common.das
Original file line number Diff line number Diff line change
Expand Up @@ -8,28 +8,70 @@ require sqlite/sqlite_linq public
require dastest/testing_boost public
require daslib/fio public

let public BRAND_COUNT = 5
let public DEALER_COUNT = 100

[sql_table(name = "Cars")]
struct Car {
@sql_primary_key id : int
name : string
price : int
brand : int
year : int
dealer_id : int
}

[sql_table(name = "Dealers")]
struct Dealer {
@sql_primary_key id : int
name : string
}

def public fixture_db(db : SqlRunner; n : int) {
db |> create_table(type<Car>)
var rows : array<Car>
rows |> resize(n)
db |> create_table(type<Dealer>)
var cars : array<Car>
cars |> resize(n)
for (i in range(n)) {
rows[i] = Car(id = i + 1, name = "Car{i}", price = (i * 37) % 1000)
cars[i] = Car(
id = i + 1,
name = "Car{i}",
price = (i * 37) % 1000,
brand = i % BRAND_COUNT,
year = 2010 + (i * 7) % 16,
dealer_id = (i % DEALER_COUNT) + 1
)
}
db |> insert(rows)
db |> insert(cars)
var dealers : array<Dealer>
dealers |> resize(DEALER_COUNT)
for (i in range(DEALER_COUNT)) {
dealers[i] = Dealer(id = i + 1, name = "Dealer{i}")
}
db |> insert(dealers)
}

def public fixture_array(n : int) : array<Car> {
var arr : array<Car>
arr |> resize(n)
for (i in range(n)) {
arr[i] = Car(id = i + 1, name = "Car{i}", price = (i * 37) % 1000)
arr[i] = Car(
id = i + 1,
name = "Car{i}",
price = (i * 37) % 1000,
brand = i % BRAND_COUNT,
year = 2010 + (i * 7) % 16,
dealer_id = (i % DEALER_COUNT) + 1
)
}
return <- arr
}

def public fixture_dealers_array() : array<Dealer> {
var arr : array<Dealer>
arr |> resize(DEALER_COUNT)
for (i in range(DEALER_COUNT)) {
arr[i] = Dealer(id = i + 1, name = "Dealer{i}")
}
return <- arr
}
72 changes: 72 additions & 0 deletions benchmarks/sql/all_match.das
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
options gen2
options persistent_heap

require _common public

// Predicate that is ALWAYS true (price < 9999), so the full scan runs.
// SQL equivalent of `all(pred)` is `count(where NOT pred) == 0` — no direct
// `_all` terminal in sqlite_linq, so we phrase it as a counter-example count.
// Array variants use `_all(pred)` which short-circuits on first failing element;
// always-true predicate forces full-array traversal (fair upper-bound timing).

def run_m1(b : B?; n : int) {
with_sqlite(":memory:") $(db) {
fixture_db(db, n)
b |> run("m1_sql/{n}", n) {
let bad = _sql(db |> select_from(type<Car>) |> _where(_.price >= 9999) |> count())
if (bad != 0) {
b->failNow()
}
}
}
}

def run_m3(b : B?; n : int) {
let arr <- fixture_array(n)
b |> run("m3_array/{n}", n) {
let yes = arr |> _all(_.price < 9999)
if (!yes) {
b->failNow()
}
}
}

def run_m3f_old(b : B?; n : int) {
let arr <- fixture_array(n)
b |> run("m3f_old_array_fold/{n}", n) {
let yes = _old_fold(each(arr)._all(_.price < 9999))
if (!yes) {
b->failNow()
}
}
}

def run_m3f(b : B?; n : int) {
let arr <- fixture_array(n)
b |> run("m3f_array_fold/{n}", n) {
let yes = _fold(each(arr)._all(_.price < 9999))
if (!yes) {
b->failNow()
}
}
}

[benchmark]
def all_match_m1(b : B?) {
run_m1(b, 100000)
}

[benchmark]
def all_match_m3(b : B?) {
run_m3(b, 100000)
}

[benchmark]
def all_match_m3f_old(b : B?) {
run_m3f_old(b, 100000)
}

[benchmark]
def all_match_m3f(b : B?) {
run_m3f(b, 100000)
}
72 changes: 72 additions & 0 deletions benchmarks/sql/any_match.das
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
options gen2
options persistent_heap

require _common public

let THRESHOLD = 500

// SQL: SELECT ... LIMIT 1 — engine stops at first hit. Use _first_opt then is_some
// since sqlite_linq doesn't expose `_any()` as a chain terminal.
// Array variants use `_any(pred)` which short-circuits at the first match.

def run_m1(b : B?; n : int) {
with_sqlite(":memory:") $(db) {
fixture_db(db, n)
b |> run("m1_sql/{n}", n) {
let opt = _sql(db |> select_from(type<Car>) |> _where(_.price > THRESHOLD) |> _first_opt())
if (!is_some(opt)) {
b->failNow()
}
}
}
}

def run_m3(b : B?; n : int) {
let arr <- fixture_array(n)
b |> run("m3_array/{n}", n) {
let yes = arr |> _any(_.price > THRESHOLD)
if (!yes) {
b->failNow()
}
}
}

def run_m3f_old(b : B?; n : int) {
let arr <- fixture_array(n)
b |> run("m3f_old_array_fold/{n}", n) {
let yes = _old_fold(each(arr)._any(_.price > THRESHOLD))
if (!yes) {
b->failNow()
}
}
}

def run_m3f(b : B?; n : int) {
let arr <- fixture_array(n)
b |> run("m3f_array_fold/{n}", n) {
let yes = _fold(each(arr)._any(_.price > THRESHOLD))
if (!yes) {
b->failNow()
}
}
}

[benchmark]
def any_match_m1(b : B?) {
run_m1(b, 100000)
}

[benchmark]
def any_match_m3(b : B?) {
run_m3(b, 100000)
}

[benchmark]
def any_match_m3f_old(b : B?) {
run_m3f_old(b, 100000)
}

[benchmark]
def any_match_m3f(b : B?) {
run_m3f(b, 100000)
}
Loading
Loading