[perf-measure only · DO NOT MERGE] #657 rebased past #694 — keyed-leg /perf#705
[perf-measure only · DO NOT MERGE] #657 rebased past #694 — keyed-leg /perf#705azchohfi wants to merge 2 commits into
Conversation
Hot keyed list-diff path (grid steady state) allocated heavily and missed fast-paths even when keys were unchanged. Eliminate per-diff allocations and add the missing fast-path, preserving diff behavior exactly. ChildReconciler.cs: - Keyed prefix/suffix loops now take the Element.CanSkipUpdate early-exit that the positional path has, so stable keyed rows no longer re-diff every tick (#30); cache children.Count once instead of re-reading the COM IVector.get_Size per suffix iteration (#37). - Replace HashSet-returning ComputeLIS with allocation-free ComputeLISInto filling a pooled bool mask; pool tails/tailIndices/predecessors from ArrayPool (#31/#32). Keep a thin ComputeLIS(int[]) wrapper for tests. - Pool ReconcileKeyedMiddle's working arrays (ArrayPool) and the two key->index maps (re-entrancy-safe ThreadStatic dict pool); buffers cleared and returned on every exit (#33). - Filter: count-pass + single Element[] fill, no List+ToArray (#36). - GetKey: cache Type.Name via ConcurrentDictionary (#38). KeyedListDiff.cs: - Rent newKeys from ArrayPool<string>, threading explicit newCount (rented array may be larger); returned clearArray:true in a finally (#2). - Move the no-op (SequenceEqual) + empty/empty fast paths ABOVE the duplicate scan so the steady-state grid case never allocates/scans a dup set (#1). - Fold the churn-bailout decision into ApplyGeneral, computed from the same post-prefix/suffix scratch map the general walk builds (diff-range churn == full-range churn), removing the O(2n) null-marker pre-pass (#34). - Pool the doomed ReactorRow[] (ArrayPool, descending IComparer sort over [0,removes), cleared+returned) (#3). - Lazy-allocate movedRows only on an actual move under an ambient (#35). - In-place SyncLastKeysToSource instead of LastKeys Clear()+Add (#10). - HasDuplicates reuses state.Scratch (TryAdd) for 4+ keys instead of a fresh HashSet (#11); cached null-key diagnostic sample (#41). Tests: add pooling non-corruption coverage (rented-buffer-larger-than-count, interleaved independent states, randomized stress vs oracle with survivor identity, pooled doomed removes, churn with shared prefix+suffix, Scratch reuse for the dup scan) and ComputeLISInto bool-mask coverage. Full Reactor.Tests suite green (9693 passed); core lib Release build is warning-clean (AOT/trim). Closes #653 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review follow-ups on #657 (keyed list-diff allocation work): - ChildReconciler.ReconcileKeyedMiddle: return the pooled matched and inLis bool[] buffers with clearArray:false instead of true (Copilot review). Both are value-typed (no reference pinning) and have their used range fully (re)initialized before any read - matched via the Array.Clear on rent, inLis via ComputeLISInto leading clear - so the full-array wipe on return was avoidable O(rented-capacity) work on the hot path. Matches how the int[] newToOld already returns. Reference-typed pooled buffers (string[]/ReactorRow[] in KeyedListDiff) still clear. - Tests (ChildReconcilerLisIntoTests): replace the circular oracle that compared ComputeLISInto against ComputeLIS (now a thin wrapper over it) with an independent brute-force LIS-length DP that honors the -1 unmapped sentinel, asserting the mask marks a valid strictly-increasing subsequence of maximal length. - Tests (ChildReconcilerKeyedSkipTests, new): cover the #30 keyed CanSkipUpdate fast path - an identical keyed list skips every row with no ops and no child-control access, across repeated stable frames. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/perf |
| foreach (var k in next) | ||
| if (s.ByKey.TryGetValue(k, out var row)) survivorsBefore[k] = row; |
⚡ Reactor perf comparisonWorkload: Regression vs
|
| Metric | main (baseline) |
This PR | Δ (95% CI) | Status |
|---|---|---|---|---|
| Renders/sec ↑ | 2.85 | 2.64 | -2.1% 95% CI [-5.9, +1.8] | ≈ within noise |
| Avg Reconcile (ms) ↓ | 145.9 | 146.5 | +1.0% 95% CI [-1.1, +3.1] | ≈ within noise |
| Avg Diff (ms) ↓ | 132.6 | 132.1 | +0.2% 95% CI [-1.7, +2.0] | ≈ within noise |
| Avg Memory (MB) ↓ | 304.4 | 300.2 | -0.6% 95% CI [-1.3, 0.0] | ✅ improvement |
Low-mutation skip-floor (--percent 0)
At --percent 0 the workload mutates few cells per tick (always at least one), so reconcile/diff isolate the O(n) per-tick child skip-walk floor that higher mutation rates dilute — ChildReconciler re-walks every child each tick even when nothing moved. The closer --percent is to 0, the more this floor is the signal, so a structural-skip optimization shows up cleanly where the headline table above buries it. Δ is the mean paired change with a 95% CI.
| Metric | main (baseline) |
This PR | Δ (95% CI) | Status |
|---|---|---|---|---|
| Renders/sec ↑ | 15.07 | 14.69 | -0.5% 95% CI [-4.4, +3.5] | ≈ within noise |
| Avg Reconcile (ms) ↓ | 25.9 | 26.1 | -1.3% 95% CI [-4.4, +1.9] | ≈ within noise |
| Avg Diff (ms) ↓ | 24.1 | 24.3 | -1.2% 95% CI [-4.8, +2.3] | ≈ within noise |
| Avg Memory (MB) ↓ | 268.2 | 267.9 | 0.0% 95% CI [-0.5, +0.4] | ≈ within noise |
Allocation (Reactor) — lower is better
| Metric | main (baseline) |
This PR | Δ (95% CI) | Status |
|---|---|---|---|---|
| Alloc bytes/render ↓ | 9745167 | 9814008 | +0.3% 95% CI [-0.2, +0.7] | ≈ within noise |
| Gen0 GC / 1k renders ↓ | 333.33 | 357.14 | +2.8% 95% CI [-1.4, +6.9] | ≈ within noise |
Keyed-list workload (StressPerf.KeyedList, --percent 50)
A separate macro workload: a ~500-row stably keyed list whose rows are reordered / inserted / removed each tick. Because every child carries a key, the child reconciler takes its keyed arm (ReconcileKeyed → ReconcileKeyedMiddle, the LIS-based minimal-move pass) instead of the positional re-walk the StocksGrid tables above measure — so this is the sensitive macro signal for keyed-diff work the positional cells can never reach. Same interleaved paired-Δ 95% CI as the headline table.
| Metric | main (baseline) |
This PR | Δ (95% CI) | Status |
|---|---|---|---|---|
| Renders/sec ↑ | 22.88 | 22.34 | -2.5% 95% CI [-4.7, -0.4] | |
| Avg Reconcile (ms) ↓ | 13.2 | 13.2 | +2.5% 95% CI [-2.6, +7.6] | ≈ within noise |
| Avg Diff (ms) ↓ | 13.1 | 13.1 | +2.3% 95% CI [-3.0, +7.7] | ≈ within noise |
| Avg Memory (MB) ↓ | 177.6 | 181.9 | +2.1% 95% CI [+1.3, +3.0] |
Allocation (keyed-list) — lower is better
| Metric | main (baseline) |
This PR | Δ (95% CI) | Status |
|---|---|---|---|---|
| Alloc bytes/render ↓ | 316844 | 219994 | -30.6% 95% CI [-30.8, -30.5] | ✅ improvement |
| Gen0 GC / 1k renders ↓ | 12.85 | 8.79 | -31.5% 95% CI [-33.0, -30.1] | ✅ improvement |
Reconciler micro-benchmarks (PerfBench.ControlModel)
Production --variant Reactor control-model path, ns-resolution and WinUI-undiluted (spec-047 M1–M13) — ↓ lower is better. Status tracks allocated bytes/op, the authoritative signal here; it is deterministic for structurally-fixed benches, while dispatcher / background-thread benches carry a small process-to-process offset, so a bench is flagged only when its 95% CI clears a ±3% minimum-effect band (real structural alloc changes are several percent to many-x). ns/op is shown for context but is not auto-flagged (its paired CI is rep-interleaved but the flag remains dormant pending a real-CI identical-binary band calibration). Δ is the mean paired change with a 95% CI.
| Bench | main ns/op |
Δ ns (95% CI) | main B/op |
Δ alloc (95% CI) | Status |
|---|---|---|---|---|---|
M1 Mount_Leaf_NoCallback |
112111.5 | +0.5% 95% CI [-0.5, +1.4] | 1140.9 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M2 Mount_Leaf_OneCallback |
82876.1 | +0.6% 95% CI [-1.3, +2.5] | 3383.3 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M3 Mount_Leaf_ThreeCallbacks |
165104.0 | +1.2% 95% CI [-0.9, +3.3] | 8179.0 | +0.2% 95% CI [0.0, +0.3] | ≈ within noise |
M4 Dispatch_Switch_Cold |
81557.1 | -3.5% 95% CI [-9.5, +2.4] | 1768.1 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M5 Dispatch_Switch_Warm |
81854.1 | +2.0% 95% CI [-1.4, +5.4] | 1766.0 | 0.0% 95% CI [-1.7, +1.8] | ≈ within noise |
M6 Dispatch_ExternalType |
70801.2 | +0.2% 95% CI [-0.4, +0.7] | 987.6 | +0.1% 95% CI [-3.0, +3.2] | ≈ within noise |
M7 Update_NoChange |
44317.0 | +0.2% 95% CI [-1.1, +1.4] | 452.1 | +2.2% 95% CI [-4.6, +9.0] | ≈ within noise |
M8 Update_OneLeafChanged |
32192.2 | -2.3% 95% CI [-9.3, +4.8] | 536.0 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M9 Update_AllChanged |
2783983.3 | -1.3% 95% CI [-4.0, +1.5] | 184278.1 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M10 EventHandlerState_Alloc |
64810.4 | +1.4% 95% CI [-1.0, +3.8] | 3013.2 | 0.0% 95% CI [-1.1, +1.1] | ≈ within noise |
M11 ModifierEHS_Frequency |
34713.0 | +0.2% 95% CI [-0.8, +1.1] | 638.8 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M12 Pool_Rent_HotPath |
90760.7 | -0.5% 95% CI [-1.7, +0.6] | 1099.9 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
M13 Setters_Suppression_Scope |
108.6 | +2.0% 95% CI [-5.0, +8.9] | 26.7 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
C207 ChangeHandler_DpRead_Coalesce |
1151.3 | -3.6% 95% CI [-6.8, -0.4] | 0.6 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
OAlloc Optional_Element_Alloc |
163.2 | +1.6% 95% CI [-6.0, +9.3] | 528.0 | 0.0% 95% CI [0.0, 0.0] | ≈ within noise |
OUpdate Optional_Reconciler_Update |
14270.7 | -0.6% 95% CI [-2.4, +1.2] | 5892.8 | 0.0% 95% CI [-0.5, +0.5] | ≈ within noise |
Cross-framework reference (same StocksGrid workload)
| Metric | vanilla WinUI3¹ | Rust windows-reactor² |
Reactor (this PR) |
|---|---|---|---|
| Renders/sec ↑ | 3.15 | 5.20 | 2.64 |
| Avg Reconcile (ms) ↓ | n/a | 21.6 | 146.5 |
| Avg Diff (ms) ↓ | n/a | 19.3 | 132.1 |
| Avg Memory (MB) ↓ | 262.9 | 197.9 | 300.2 |
↑ higher is better · ↓ lower is better. Within noise = the 95% confidence interval of the paired Δ includes 0 (no change resolvable at this sample size); ✅ improvement /
Allocation metrics (alloc bytes/render, Gen0 GC) are the sensitive signal for allocation-reduction work, where the mean-ms / memory figures are largely flat. They read n/a for a harness built from a revision that predates them (rebase the PR onto main to populate them).
Reconciler micro-benchmarks run PerfBench.ControlModel --variant Reactor (M1–M13) as a headless loop bracketed by per-thread alloc + GC counters — ns-resolution and free of WinUI render / working-set dilution, so they resolve Core/Reconciler allocation deltas the macro StocksGrid workload cannot. main and PR each link their own src/Reactor build and are rep-interleaved (a fresh alternated process per rep); Δ is the paired 95% CI over per-rep means. The Status column tracks allocated bytes/op (deterministic for identical code); ns/op is informational — its paired CI is now unbiased but the flag stays dormant pending a real-CI identical-binary band calibration.
¹ vanilla WinUI3 = StressPerf.Direct (imperative; no virtual-DOM, so it has no reconcile/diff phase — those cells read n/a). Measured live on this runner.
² Rust = test_reactor_perf from microsoft/windows-rs — a port of this harness (same StocksGrid, same --percent/--duration CLI). Built from source and measured live on this runner.
Absolute numbers are runner-dependent — trust the Δ vs main, not the absolute values. Memory (working set) is the noisiest metric.
Runner: CPU: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz · 4 logical cores · 16 GB RAM · runner: GitHub Actions 1042900980.
Generated by .github/workflows/perf-compare.yml · PR 9e08764 vs main 52baebb · 2026-06-26T17:48:39Z · run log.
|
Measurement complete; perf-compare verified + banked by coordinator. Closing temp measurement PR (DO-NOT-MERGE artifact). Original #657 branch untouched/pristine. |
MEASUREMENT-ONLY · DO NOT MERGE · throwaway.
This is #657 (
azchohfi-perf-keyed-list-diff-allocations@ 13679c9) rebased onto main 52baebb to pull in the #694StressPerf.KeyedListworkload so the keyed-LIS leg actually renders under/perf(the original branch predates #694, so its whole instrument was dark on positional StocksGrid).Verified on the rebased head:
Reactor.csprojRelease 0/0;Reactor.Tests9730 passed / 0 failed / 64 skipped (--arch arm64).Original #657 left pristine (gate-clean-HELD). This PR will be closed and its branch deleted once the perf comparison lands. Tracks #657.