[perf-measure only · DO NOT MERGE] #657 rebased past #694 — keyed-leg /perf by azchohfi · Pull Request #705 · microsoft/microsoft-ui-reactor

azchohfi · 2026-06-26T17:13:20Z

MEASUREMENT-ONLY · DO NOT MERGE · throwaway.

This is #657 (azchohfi-perf-keyed-list-diff-allocations @ 13679c9) rebased onto main 52baebb to pull in the #694 StressPerf.KeyedList workload so the keyed-LIS leg actually renders under /perf (the original branch predates #694, so its whole instrument was dark on positional StocksGrid).

Verified on the rebased head: Reactor.csproj Release 0/0; Reactor.Tests 9730 passed / 0 failed / 64 skipped (--arch arm64).

Original #657 left pristine (gate-clean-HELD). This PR will be closed and its branch deleted once the perf comparison lands. Tracks #657.

Hot keyed list-diff path (grid steady state) allocated heavily and missed fast-paths even when keys were unchanged. Eliminate per-diff allocations and add the missing fast-path, preserving diff behavior exactly. ChildReconciler.cs: - Keyed prefix/suffix loops now take the Element.CanSkipUpdate early-exit that the positional path has, so stable keyed rows no longer re-diff every tick (#30); cache children.Count once instead of re-reading the COM IVector.get_Size per suffix iteration (#37). - Replace HashSet-returning ComputeLIS with allocation-free ComputeLISInto filling a pooled bool mask; pool tails/tailIndices/predecessors from ArrayPool (#31/#32). Keep a thin ComputeLIS(int[]) wrapper for tests. - Pool ReconcileKeyedMiddle's working arrays (ArrayPool) and the two key->index maps (re-entrancy-safe ThreadStatic dict pool); buffers cleared and returned on every exit (#33). - Filter: count-pass + single Element[] fill, no List+ToArray (#36). - GetKey: cache Type.Name via ConcurrentDictionary (#38). KeyedListDiff.cs: - Rent newKeys from ArrayPool<string>, threading explicit newCount (rented array may be larger); returned clearArray:true in a finally (#2). - Move the no-op (SequenceEqual) + empty/empty fast paths ABOVE the duplicate scan so the steady-state grid case never allocates/scans a dup set (#1). - Fold the churn-bailout decision into ApplyGeneral, computed from the same post-prefix/suffix scratch map the general walk builds (diff-range churn == full-range churn), removing the O(2n) null-marker pre-pass (#34). - Pool the doomed ReactorRow[] (ArrayPool, descending IComparer sort over [0,removes), cleared+returned) (#3). - Lazy-allocate movedRows only on an actual move under an ambient (#35). - In-place SyncLastKeysToSource instead of LastKeys Clear()+Add (#10). - HasDuplicates reuses state.Scratch (TryAdd) for 4+ keys instead of a fresh HashSet (#11); cached null-key diagnostic sample (#41). Tests: add pooling non-corruption coverage (rented-buffer-larger-than-count, interleaved independent states, randomized stress vs oracle with survivor identity, pooled doomed removes, churn with shared prefix+suffix, Scratch reuse for the dup scan) and ComputeLISInto bool-mask coverage. Full Reactor.Tests suite green (9693 passed); core lib Release build is warning-clean (AOT/trim). Closes #653 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Review follow-ups on #657 (keyed list-diff allocation work): - ChildReconciler.ReconcileKeyedMiddle: return the pooled matched and inLis bool[] buffers with clearArray:false instead of true (Copilot review). Both are value-typed (no reference pinning) and have their used range fully (re)initialized before any read - matched via the Array.Clear on rent, inLis via ComputeLISInto leading clear - so the full-array wipe on return was avoidable O(rented-capacity) work on the hot path. Matches how the int[] newToOld already returns. Reference-typed pooled buffers (string[]/ReactorRow[] in KeyedListDiff) still clear. - Tests (ChildReconcilerLisIntoTests): replace the circular oracle that compared ComputeLISInto against ComputeLIS (now a thin wrapper over it) with an independent brute-force LIS-length DP that honors the -1 unmapped sentinel, asserting the mask marks a valid strictly-increasing subsequence of maximal length. - Tests (ChildReconcilerKeyedSkipTests, new): cover the #30 keyed CanSkipUpdate fast path - an identical keyed list skips every row with no ops and no child-control access, across repeated stable frames. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

azchohfi · 2026-06-26T17:13:29Z

/perf

+            foreach (var k in next)
+                if (s.ByKey.TryGetValue(k, out var row)) survivorsBefore[k] = row;


github-actions · 2026-06-26T17:48:43Z

⚡ Reactor perf comparison

Workload: StressPerf.ReactorOptimized StocksGrid · --percent 50 --duration 10 · x64 Release · median of 12 paired runs (2 warmup dropped); Δ is the mean change with a 95% CI · PR head and main built and run interleaved on the same runner.

Regression vs `main` baseline

Metric	`main` (baseline)	This PR	Δ (95% CI)	Status
Renders/sec ↑	2.85	2.64	-2.1% _{95% CI [-5.9, +1.8]}	≈ within noise
Avg Reconcile (ms) ↓	145.9	146.5	+1.0% _{95% CI [-1.1, +3.1]}	≈ within noise
Avg Diff (ms) ↓	132.6	132.1	+0.2% _{95% CI [-1.7, +2.0]}	≈ within noise
Avg Memory (MB) ↓	304.4	300.2	-0.6% _{95% CI [-1.3, 0.0]}	✅ improvement

Low-mutation skip-floor (`--percent 0`)

At --percent 0 the workload mutates few cells per tick (always at least one), so reconcile/diff isolate the O(n) per-tick child skip-walk floor that higher mutation rates dilute — ChildReconciler re-walks every child each tick even when nothing moved. The closer --percent is to 0, the more this floor is the signal, so a structural-skip optimization shows up cleanly where the headline table above buries it. Δ is the mean paired change with a 95% CI.

Metric	`main` (baseline)	This PR	Δ (95% CI)	Status
Renders/sec ↑	15.07	14.69	-0.5% _{95% CI [-4.4, +3.5]}	≈ within noise
Avg Reconcile (ms) ↓	25.9	26.1	-1.3% _{95% CI [-4.4, +1.9]}	≈ within noise
Avg Diff (ms) ↓	24.1	24.3	-1.2% _{95% CI [-4.8, +2.3]}	≈ within noise
Avg Memory (MB) ↓	268.2	267.9	0.0% _{95% CI [-0.5, +0.4]}	≈ within noise

Allocation (Reactor) — lower is better

Metric	`main` (baseline)	This PR	Δ (95% CI)	Status
Alloc bytes/render ↓	9745167	9814008	+0.3% _{95% CI [-0.2, +0.7]}	≈ within noise
Gen0 GC / 1k renders ↓	333.33	357.14	+2.8% _{95% CI [-1.4, +6.9]}	≈ within noise

Keyed-list workload (`StressPerf.KeyedList`, `--percent 50`)

A separate macro workload: a ~500-row stably keyed list whose rows are reordered / inserted / removed each tick. Because every child carries a key, the child reconciler takes its keyed arm (ReconcileKeyed → ReconcileKeyedMiddle, the LIS-based minimal-move pass) instead of the positional re-walk the StocksGrid tables above measure — so this is the sensitive macro signal for keyed-diff work the positional cells can never reach. Same interleaved paired-Δ 95% CI as the headline table.

Metric	`main` (baseline)	This PR	Δ (95% CI)	Status
Renders/sec ↑	22.88	22.34	-2.5% _{95% CI [-4.7, -0.4]}	⚠️ regression
Avg Reconcile (ms) ↓	13.2	13.2	+2.5% _{95% CI [-2.6, +7.6]}	≈ within noise
Avg Diff (ms) ↓	13.1	13.1	+2.3% _{95% CI [-3.0, +7.7]}	≈ within noise
Avg Memory (MB) ↓	177.6	181.9	+2.1% _{95% CI [+1.3, +3.0]}	⚠️ regression

Allocation (keyed-list) — lower is better

Metric	`main` (baseline)	This PR	Δ (95% CI)	Status
Alloc bytes/render ↓	316844	219994	-30.6% _{95% CI [-30.8, -30.5]}	✅ improvement
Gen0 GC / 1k renders ↓	12.85	8.79	-31.5% _{95% CI [-33.0, -30.1]}	✅ improvement

Reconciler micro-benchmarks (`PerfBench.ControlModel`)

Production --variant Reactor control-model path, ns-resolution and WinUI-undiluted (spec-047 M1–M13) — ↓ lower is better. Status tracks allocated bytes/op, the authoritative signal here; it is deterministic for structurally-fixed benches, while dispatcher / background-thread benches carry a small process-to-process offset, so a bench is flagged only when its 95% CI clears a ±3% minimum-effect band (real structural alloc changes are several percent to many-x). ns/op is shown for context but is not auto-flagged (its paired CI is rep-interleaved but the flag remains dormant pending a real-CI identical-binary band calibration). Δ is the mean paired change with a 95% CI.

Bench	`main` ns/op	Δ ns (95% CI)	`main` B/op	Δ alloc (95% CI)	Status
`M1` Mount_Leaf_NoCallback	112111.5	+0.5% _{95% CI [-0.5, +1.4]}	1140.9	0.0% _{95% CI [0.0, 0.0]}	≈ within noise
`M2` Mount_Leaf_OneCallback	82876.1	+0.6% _{95% CI [-1.3, +2.5]}	3383.3	0.0% _{95% CI [0.0, 0.0]}	≈ within noise
`M3` Mount_Leaf_ThreeCallbacks	165104.0	+1.2% _{95% CI [-0.9, +3.3]}	8179.0	+0.2% _{95% CI [0.0, +0.3]}	≈ within noise
`M4` Dispatch_Switch_Cold	81557.1	-3.5% _{95% CI [-9.5, +2.4]}	1768.1	0.0% _{95% CI [0.0, 0.0]}	≈ within noise
`M5` Dispatch_Switch_Warm	81854.1	+2.0% _{95% CI [-1.4, +5.4]}	1766.0	0.0% _{95% CI [-1.7, +1.8]}	≈ within noise
`M6` Dispatch_ExternalType	70801.2	+0.2% _{95% CI [-0.4, +0.7]}	987.6	+0.1% _{95% CI [-3.0, +3.2]}	≈ within noise
`M7` Update_NoChange	44317.0	+0.2% _{95% CI [-1.1, +1.4]}	452.1	+2.2% _{95% CI [-4.6, +9.0]}	≈ within noise
`M8` Update_OneLeafChanged	32192.2	-2.3% _{95% CI [-9.3, +4.8]}	536.0	0.0% _{95% CI [0.0, 0.0]}	≈ within noise
`M9` Update_AllChanged	2783983.3	-1.3% _{95% CI [-4.0, +1.5]}	184278.1	0.0% _{95% CI [0.0, 0.0]}	≈ within noise
`M10` EventHandlerState_Alloc	64810.4	+1.4% _{95% CI [-1.0, +3.8]}	3013.2	0.0% _{95% CI [-1.1, +1.1]}	≈ within noise
`M11` ModifierEHS_Frequency	34713.0	+0.2% _{95% CI [-0.8, +1.1]}	638.8	0.0% _{95% CI [0.0, 0.0]}	≈ within noise
`M12` Pool_Rent_HotPath	90760.7	-0.5% _{95% CI [-1.7, +0.6]}	1099.9	0.0% _{95% CI [0.0, 0.0]}	≈ within noise
`M13` Setters_Suppression_Scope	108.6	+2.0% _{95% CI [-5.0, +8.9]}	26.7	0.0% _{95% CI [0.0, 0.0]}	≈ within noise
`C207` ChangeHandler_DpRead_Coalesce	1151.3	-3.6% _{95% CI [-6.8, -0.4]}	0.6	0.0% _{95% CI [0.0, 0.0]}	≈ within noise
`OAlloc` Optional_Element_Alloc	163.2	+1.6% _{95% CI [-6.0, +9.3]}	528.0	0.0% _{95% CI [0.0, 0.0]}	≈ within noise
`OUpdate` Optional_Reconciler_Update	14270.7	-0.6% _{95% CI [-2.4, +1.2]}	5892.8	0.0% _{95% CI [-0.5, +0.5]}	≈ within noise

Cross-framework reference (same StocksGrid workload)

Metric	vanilla WinUI3¹	Rust `windows-reactor`²	Reactor (this PR)
Renders/sec ↑	3.15	5.20	2.64
Avg Reconcile (ms) ↓	n/a	21.6	146.5
Avg Diff (ms) ↓	n/a	19.3	132.1
Avg Memory (MB) ↓	262.9	197.9	300.2

_{↑ higher is better · ↓ lower is better. Within noise = the 95% confidence interval of the paired Δ includes 0 (no change resolvable at this sample size); ✅ improvement / ⚠️ regression require the CI to exclude 0.}
_{Allocation metrics (alloc bytes/render, Gen0 GC) are the sensitive signal for allocation-reduction work, where the mean-ms / memory figures are largely flat. They read n/a for a harness built from a revision that predates them (rebase the PR onto main to populate them).}
_{Reconciler micro-benchmarks run PerfBench.ControlModel --variant Reactor (M1–M13) as a headless loop bracketed by per-thread alloc + GC counters — ns-resolution and free of WinUI render / working-set dilution, so they resolve Core/Reconciler allocation deltas the macro StocksGrid workload cannot. main and PR each link their own src/Reactor build and are rep-interleaved (a fresh alternated process per rep); Δ is the paired 95% CI over per-rep means. The Status column tracks allocated bytes/op (deterministic for identical code); ns/op is informational — its paired CI is now unbiased but the flag stays dormant pending a real-CI identical-binary band calibration.}
_{¹ vanilla WinUI3 = StressPerf.Direct (imperative; no virtual-DOM, so it has no reconcile/diff phase — those cells read n/a). Measured live on this runner.}
_{² Rust = test_reactor_perf from microsoft/windows-rs — a port of this harness (same StocksGrid, same --percent/--duration CLI). Built from source and measured live on this runner.}
_{Absolute numbers are runner-dependent — trust the Δ vs main, not the absolute values. Memory (working set) is the noisiest metric.}
_{Runner: CPU: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz · 4 logical cores · 16 GB RAM · runner: GitHub Actions 1042900980.}
_{Generated by .github/workflows/perf-compare.yml · PR 9e08764 vs main 52baebb · 2026-06-26T17:48:39Z · run log.}

azchohfi · 2026-06-26T18:51:28Z

Measurement complete; perf-compare verified + banked by coordinator. Closing temp measurement PR (DO-NOT-MERGE artifact). Original #657 branch untouched/pristine.

azchohfi and others added 2 commits June 26, 2026 10:10

github-code-quality Bot found potential problems Jun 26, 2026

View reviewed changes

Comment thread tests/Reactor.Tests/Internal/KeyedListDiffPoolingTests.cs

Comment on lines +284 to +285

foreach (var k in next)

if (s.ByKey.TryGetValue(k, out var row)) survivorsBefore[k] = row;

azchohfi closed this Jun 26, 2026

azchohfi deleted the azchohfi-657-keyed-perf-remeasure branch June 26, 2026 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[perf-measure only · DO NOT MERGE] #657 rebased past #694 — keyed-leg /perf#705

[perf-measure only · DO NOT MERGE] #657 rebased past #694 — keyed-leg /perf#705
azchohfi wants to merge 2 commits into
mainfrom
azchohfi-657-keyed-perf-remeasure

azchohfi commented Jun 26, 2026

Uh oh!

azchohfi commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

azchohfi commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		foreach (var k in next)
		if (s.ByKey.TryGetValue(k, out var row)) survivorsBefore[k] = row;

Uh oh!

Conversation

azchohfi commented Jun 26, 2026

Uh oh!

azchohfi commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

⚡ Reactor perf comparison

Regression vs main baseline

Low-mutation skip-floor (--percent 0)

Allocation (Reactor) — lower is better

Keyed-list workload (StressPerf.KeyedList, --percent 50)

Reconciler micro-benchmarks (PerfBench.ControlModel)

Cross-framework reference (same StocksGrid workload)

Uh oh!

azchohfi commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Regression vs `main` baseline

Low-mutation skip-floor (`--percent 0`)

Keyed-list workload (`StressPerf.KeyedList`, `--percent 50`)

Reconciler micro-benchmarks (`PerfBench.ControlModel`)