DO-NOT-MERGE: #657 re-measure on post-#665 main#724
Conversation
Hot keyed list-diff path (grid steady state) allocated heavily and missed fast-paths even when keys were unchanged. Eliminate per-diff allocations and add the missing fast-path, preserving diff behavior exactly. ChildReconciler.cs: - Keyed prefix/suffix loops now take the Element.CanSkipUpdate early-exit that the positional path has, so stable keyed rows no longer re-diff every tick (#30); cache children.Count once instead of re-reading the COM IVector.get_Size per suffix iteration (#37). - Replace HashSet-returning ComputeLIS with allocation-free ComputeLISInto filling a pooled bool mask; pool tails/tailIndices/predecessors from ArrayPool (#31/#32). Keep a thin ComputeLIS(int[]) wrapper for tests. - Pool ReconcileKeyedMiddle's working arrays (ArrayPool) and the two key->index maps (re-entrancy-safe ThreadStatic dict pool); buffers cleared and returned on every exit (#33). - Filter: count-pass + single Element[] fill, no List+ToArray (#36). - GetKey: cache Type.Name via ConcurrentDictionary (#38). KeyedListDiff.cs: - Rent newKeys from ArrayPool<string>, threading explicit newCount (rented array may be larger); returned clearArray:true in a finally (#2). - Move the no-op (SequenceEqual) + empty/empty fast paths ABOVE the duplicate scan so the steady-state grid case never allocates/scans a dup set (#1). - Fold the churn-bailout decision into ApplyGeneral, computed from the same post-prefix/suffix scratch map the general walk builds (diff-range churn == full-range churn), removing the O(2n) null-marker pre-pass (#34). - Pool the doomed ReactorRow[] (ArrayPool, descending IComparer sort over [0,removes), cleared+returned) (#3). - Lazy-allocate movedRows only on an actual move under an ambient (#35). - In-place SyncLastKeysToSource instead of LastKeys Clear()+Add (#10). - HasDuplicates reuses state.Scratch (TryAdd) for 4+ keys instead of a fresh HashSet (#11); cached null-key diagnostic sample (#41). Tests: add pooling non-corruption coverage (rented-buffer-larger-than-count, interleaved independent states, randomized stress vs oracle with survivor identity, pooled doomed removes, churn with shared prefix+suffix, Scratch reuse for the dup scan) and ComputeLISInto bool-mask coverage. Full Reactor.Tests suite green (9693 passed); core lib Release build is warning-clean (AOT/trim). Closes #653 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review follow-ups on #657 (keyed list-diff allocation work): - ChildReconciler.ReconcileKeyedMiddle: return the pooled matched and inLis bool[] buffers with clearArray:false instead of true (Copilot review). Both are value-typed (no reference pinning) and have their used range fully (re)initialized before any read - matched via the Array.Clear on rent, inLis via ComputeLISInto leading clear - so the full-array wipe on return was avoidable O(rented-capacity) work on the hot path. Matches how the int[] newToOld already returns. Reference-typed pooled buffers (string[]/ReactorRow[] in KeyedListDiff) still clear. - Tests (ChildReconcilerLisIntoTests): replace the circular oracle that compared ComputeLISInto against ComputeLIS (now a thin wrapper over it) with an independent brute-force LIS-length DP that honors the -1 unmapped sentinel, asserting the mask marks a valid strictly-increasing subsequence of maximal length. - Tests (ChildReconcilerKeyedSkipTests, new): cover the #30 keyed CanSkipUpdate fast path - an identical keyed list skips every row with no ops and no child-control access, across repeated stable frames. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/perf |
| foreach (var k in next) | ||
| if (s.ByKey.TryGetValue(k, out var row)) survivorsBefore[k] = row; |
⚡ Reactor perf comparisonWorkload: Regression vs
|
| Metric | main (baseline) |
This PR | Δ (95% CI) | Status |
|---|---|---|---|---|
| Renders/sec ↑ | 2.48 | 2.61 | +2.4% 95% CI [-3.6, +8.5] | ≈ within noise |
| Avg Reconcile (ms) ↓ | 130.0 | 130.9 | -1.5% 95% CI [-3.8, +0.8] | ≈ within noise |
| Avg Diff (ms) ↓ | 119.8 | 118.1 | -1.2% 95% CI [-3.6, +1.2] | ≈ within noise |
| Avg Memory (MB) ↓ | 283.8 | 284.6 | 0.0% 95% CI [-0.8, +0.8] | ≈ within noise |
Low-mutation skip-floor (--percent 0)
At --percent 0 the workload mutates few cells per tick (always at least one), so reconcile/diff isolate the O(n) per-tick child skip-walk floor that higher mutation rates dilute — ChildReconciler re-walks every child each tick even when nothing moved. The closer --percent is to 0, the more this floor is the signal, so a structural-skip optimization shows up cleanly where the headline table above buries it. Δ is the mean paired change with a 95% CI.
| Metric | main (baseline) |
This PR | Δ (95% CI) | Status |
|---|---|---|---|---|
| Renders/sec ↑ | 16.36 | 15.91 | -1.6% 95% CI [-10.7, +7.6] | ≈ within noise |
| Avg Reconcile (ms) ↓ | 37.5 | 37.9 | +5.4% 95% CI [-8.0, +18.8] | ≈ within noise |
| Avg Diff (ms) ↓ | 35.4 | 35.9 | +5.5% 95% CI [-8.2, +19.2] | ≈ within noise |
| Avg Memory (MB) ↓ | 267.0 | 266.3 | -0.1% 95% CI [-0.4, +0.3] | ≈ within noise |
Allocation (Reactor) — lower is better
| Metric | main (baseline) |
This PR | Δ (95% CI) | Status |
|---|---|---|---|---|
| Alloc bytes/render ↓ | 5774436 | 5773474 | +0.1% 95% CI [-1.0, +1.3] | ≈ within noise |
| Gen0 GC / 1k renders ↓ | 230.77 | 230.77 | -0.3% 95% CI [-10.9, +10.2] | ≈ within noise |
Keyed-list workload (StressPerf.KeyedList, --percent 50)
A separate macro workload: a ~500-row stably keyed list whose rows are reordered / inserted / removed each tick. Because every child carries a key, the child reconciler takes its keyed arm (ReconcileKeyed → ReconcileKeyedMiddle, the LIS-based minimal-move pass) instead of the positional re-walk the StocksGrid tables above measure — so this is the sensitive macro signal for keyed-diff work the positional cells can never reach. Same interleaved paired-Δ 95% CI as the headline table.
| Metric | main (baseline) |
This PR | Δ (95% CI) | Status |
|---|---|---|---|---|
| Renders/sec ↑ | 20.94 | 20.90 | -1.3% 95% CI [-3.4, +0.7] | ≈ within noise |
| Avg Reconcile (ms) ↓ | 16.0 | 15.6 | +0.8% 95% CI [-1.3, +2.9] | ≈ within noise |
| Avg Diff (ms) ↓ | 15.7 | 15.4 | +0.4% 95% CI [-1.6, +2.5] | ≈ within noise |
| Avg Memory (MB) ↓ | 168.9 | 172.2 | +1.9% 95% CI [+1.4, +2.4] |
Allocation (keyed-list) — lower is better
| Metric | main (baseline) |
This PR | Δ (95% CI) | Status |
|---|---|---|---|---|
| Alloc bytes/render ↓ | 313777 | 217985 | -30.5% 95% CI [-30.8, -30.2] | ✅ improvement |
| Gen0 GC / 1k renders ↓ | 17.78 | 13.61 | -22.4% 95% CI [-30.0, -14.7] | ✅ improvement |
Reconciler micro-benchmarks (PerfBench.ControlModel)
Production --variant Reactor control-model path, ns-resolution and WinUI-undiluted (spec-047 M1–M13) — ↓ lower is better. Status tracks allocated bytes/op, the authoritative signal here; it is deterministic for structurally-fixed benches, while dispatcher / background-thread benches carry a small process-to-process offset, so a bench is flagged only when its 95% CI clears a ±3% minimum-effect band (real structural alloc changes are several percent to many-x). ns/op is shown for context but is not auto-flagged (its paired CI is rep-interleaved but the flag remains dormant pending a real-CI identical-binary band calibration). Δ is the mean paired change with a 95% CI.
| Bench | main ns/op |
Δ ns (95% CI) | main B/op |
Δ alloc (95% CI) | Status |
|---|---|---|---|---|---|
M1 Mount_Leaf_NoCallback |
148354.3 | +0.8% 95% CI [-0.2, +1.9] | 1140.9 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M2 Mount_Leaf_OneCallback |
108358.3 | -2.8% 95% CI [-8.0, +2.4] | 3383.3 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M3 Mount_Leaf_ThreeCallbacks |
217328.5 | 0.0% 95% CI [-4.0, +4.1] | 8460.3 | +0.1% 95% CI [-2.6, +2.7] | ≈ within noise |
M4 Dispatch_Switch_Cold |
104469.9 | -2.2% 95% CI [-4.5, +0.2] | 1767.8 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M5 Dispatch_Switch_Warm |
106110.0 | -2.8% 95% CI [-7.0, +1.5] | 1766.0 | -0.4% 95% CI [-1.8, +1.1] | ≈ within noise |
M6 Dispatch_ExternalType |
90500.7 | +3.1% 95% CI [-0.7, +7.0] | 987.6 | -0.6% 95% CI [-3.2, +2.0] | ≈ within noise |
M7 Update_NoChange |
55148.5 | +0.8% 95% CI [-0.1, +1.6] | 452.1 | +0.7% 95% CI [-7.1, +8.4] | ≈ within noise |
M8 Update_OneLeafChanged |
41393.3 | -0.2% 95% CI [-1.8, +1.5] | 536.0 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M9 Update_AllChanged |
2805582.0 | -0.3% 95% CI [-1.4, +0.8] | 184278.1 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M10 EventHandlerState_Alloc |
85233.1 | -1.6% 95% CI [-2.8, -0.5] | 3095.2 | 0.0% 95% CI [0.0, +0.1] | ≈ within noise |
M11 ModifierEHS_Frequency |
45870.8 | +0.4% 95% CI [-1.0, +1.9] | 638.9 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M12 Pool_Rent_HotPath |
116699.7 | +0.6% 95% CI [-0.2, +1.5] | 1099.9 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M13 Setters_Suppression_Scope |
96.8 | -0.3% 95% CI [-9.6, +9.0] | 26.7 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M14 Dsl_Rebuild_Cascade |
1515895.0 | -0.3% 95% CI [-1.5, +1.0] | 2231828.9 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
C207 ChangeHandler_DpRead_Coalesce |
1228.7 | -5.0% 95% CI [-8.9, -1.1] | 0.6 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
OAlloc Optional_Element_Alloc |
216.6 | -2.4% 95% CI [-7.6, +2.9] | 528.0 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
OUpdate Optional_Reconciler_Update |
12072.2 | +1.1% 95% CI [0.0, +2.2] | 2772.3 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
Cross-framework reference (same StocksGrid workload)
| Metric | vanilla WinUI3¹ | Rust windows-reactor² |
Reactor (this PR) |
|---|---|---|---|
| Renders/sec ↑ | 3.27 | 4.90 | 2.61 |
| Avg Reconcile (ms) ↓ | n/a | 18.5 | 130.9 |
| Avg Diff (ms) ↓ | n/a | 16.1 | 118.1 |
| Avg Memory (MB) ↓ | 264.8 | 195.7 | 284.6 |
↑ higher is better · ↓ lower is better. Within noise = the 95% confidence interval of the paired Δ includes 0 (no change resolvable at this sample size); ✅ improvement /
Allocation metrics (alloc bytes/render, Gen0 GC) are the sensitive signal for allocation-reduction work, where the mean-ms / memory figures are largely flat. They read n/a for a harness built from a revision that predates them (rebase the PR onto main to populate them).
Reconciler micro-benchmarks run PerfBench.ControlModel --variant Reactor (M1–M13) as a headless loop bracketed by per-thread alloc + GC counters — ns-resolution and free of WinUI render / working-set dilution, so they resolve Core/Reconciler allocation deltas the macro StocksGrid workload cannot. main and PR each link their own src/Reactor build and are rep-interleaved (a fresh alternated process per rep); Δ is the paired 95% CI over per-rep means. The Status column tracks allocated bytes/op (deterministic for identical code); ns/op is informational — its paired CI is now unbiased but the flag stays dormant pending a real-CI identical-binary band calibration.
¹ vanilla WinUI3 = StressPerf.Direct (imperative; no virtual-DOM, so it has no reconcile/diff phase — those cells read n/a). Measured live on this runner.
² Rust = test_reactor_perf from microsoft/windows-rs — a port of this harness (same StocksGrid, same --percent/--duration CLI). Built from source and measured live on this runner.
Absolute numbers are runner-dependent — trust the Δ vs main, not the absolute values. Memory (working set) is the noisiest metric.
Runner: CPU: AMD EPYC 7763 64-Core Processor · 4 logical cores · 16 GB RAM · runner: GitHub Actions 1042996823.
Generated by .github/workflows/perf-compare.yml · PR 03db34e vs main b9ace1e · 2026-06-27T04:31:11Z · run log.
Throwaway measurement PR. Rebases #657 (keyed-list diff alloc, head 13679c9) onto current origin/main (b9ace1e = #692+M14+#665+#649) to re-measure the keyed-list block on the fresh baseline. DO NOT MERGE — origin #657 stays pristine. Closed after /perf completes.