perf: eliminate per-diff allocations in ChildReconciler + KeyedListDiff by azchohfi · Pull Request #657 · microsoft/microsoft-ui-reactor

azchohfi · 2026-06-25T19:02:03Z

Closes #653

What

Eliminates per-diff allocations in the keyed list-diff path and restores the missing fast-path on the keyed prefix/suffix loops — a top Diff-gap contributor on the data-grid stress workload (StocksGrid), where stable grid rows go through the keyed path and re-diff every tick. Diff behavior is preserved exactly (which rows move/insert/remove); these are allocation/fast-path changes only.

Scope is limited to src/Reactor/Core/ChildReconciler.cs and src/Reactor/Core/Internal/KeyedListDiff.cs (+ tests).

ChildReconciler.cs

[flagship] Keyed prefix/suffix CanSkipUpdate early-exit (docs(skills): split SKILL.md into root + on-demand sub-skills #30): mirror the positional path's structural-skip so stable keyed rows avoid the children.Get COM call + full property diff every tick. Includes the ForceRenderThroughWrapper guard and the callback-Tag refresh.
Cache children.Count (feat(trace): ETW provider for correlated Reactor + WinUI profiling #37): stop re-reading the COM IVector.get_Size per suffix iteration.
Allocation-free LIS (docs(skills): split SKILL.md into sub-skills; recompile guide #31/feat(024): AI agent devtools — MCP server + UIA-driven automation tools (phases 1–4) #32): replace the HashSet<int>-returning ComputeLIS (which was copied into a 2nd identical set) with ComputeLISInto(arr, length, bool[] mask); tails/tailIndices/predecessors come from ArrayPool. A thin ComputeLIS(int[]) wrapper is kept for existing tests.
Pool ReconcileKeyedMiddle buffers (test(d3): port upstream d3 tests to raise CI coverage #33): newToOld/matched/inLis from ArrayPool; the two key→index dicts from a re-entrancy-safe [ThreadStatic] pool. Cleared + returned on every exit (incl. exceptions).
Filter (fix(reconciler): preserve Grid children on child type flip (#34) #36): count-pass + single Element[] fill instead of List + ToArray.
GetKey (test(devtools): add stress harnesses for MCP startup/shutdown lifecycle #38): cache Type.Name via ConcurrentDictionary.

KeyedListDiff.cs

[flagship] No-op/empty fast paths above the duplicate scan (This repo is missing a LICENSE file #1): the steady-state grid case (keys unchanged) no longer allocates/scans a dup set before the SequenceEqual fast path.
[flagship] Rent newKeys from ArrayPool<string> (Adding Microsoft SECURITY.MD #2): an explicit newCount is threaded everywhere (the rented array may be larger); returned clearArray:true in a finally via an Apply/ApplyCore split.
Fold churn-bailout into ApplyGeneral (DataGrid cells lost from UIA tree after async commit triggers row state transition #34): computed from the same post-prefix/suffix scratch map the general walk already builds (diff-range churn ≡ full-range churn), removing the O(2n) null-marker pre-pass + second populate.
Pool the doomed ReactorRow[] (This repo is missing important files #3), lazy movedRows (refactor: rename Factories.Text → TextBlock, TextElement → TextBlockElement #35), in-place SyncLastKeysToSource instead of Clear()+Add (Add MIT LICENSE file #10), HasDuplicates reuses state.Scratch for 4+ keys (docs: add Blazor as a Microsoft framework in comparative analysis #11), cached null-key diagnostic sample (chore(coverage): clean up coverage config, document unit+selftest merge #41).

Tests

New coverage in tests/Reactor.Tests: pooled-buffer non-corruption (rented-buffer-larger-than-count, interleaved independent states, randomized stress vs. a trivial oracle with survivor-identity checks, pooled doomed removes, churn with shared prefix and suffix, Scratch reuse across a dup bailout) and ComputeLISInto bool-mask coverage (matches the legacy set, clears stale markers, honors length on an oversized pooled mask).

Validation

dotnet test tests/Reactor.Tests — 9693 passed, 0 failed, 64 skipped (the full existing keyed-diff oracle suite stays green, validating exact behavior preservation).
dotnet build src/Reactor -c Release — 0 warnings, 0 errors (core lib promotes AOT/trim warnings to errors).
Perf impact validates via /perf once the benchmark workflow merges.

Copilot

Pull request overview

This PR targets a known perf hotspot in Reactor’s keyed child reconciliation + keyed list diff pipeline by eliminating per-diff allocations (via pooling) and restoring missing fast paths, while adding regression coverage to ensure diff correctness and buffer-reuse safety.

Changes:

Add pooled/length-threaded buffer handling and new early fast paths in KeyedListDiff (including moving the no-op check ahead of duplicate detection).
Optimize keyed reconciliation in ChildReconciler (prefix/suffix CanSkipUpdate early-exit, pooled LIS mask/buffers, pooled key→index maps, and cheaper filtering/key generation).
Add new unit tests covering pooling non-corruption scenarios and ComputeLISInto mask semantics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
src/Reactor/Core/ChildReconciler.cs	Keyed reconcile fast paths + pooled LIS/dictionaries/arrays to reduce per-tick work/allocations.
src/Reactor/Core/Internal/KeyedListDiff.cs	ArrayPool-backed key materialization and reordered fast paths; pooled removal buffers and reduced scratch churn.
tests/Reactor.Tests/Internal/KeyedListDiffPoolingTests.cs	New regression suite for pooled keyed diff correctness and reuse safety.
tests/Reactor.Tests/ChildReconcilerLisIntoTests.cs	New tests validating `ComputeLISInto` mask behavior vs legacy set-based wrapper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Review follow-ups on #657 (keyed list-diff allocation work): - ChildReconciler.ReconcileKeyedMiddle: return the pooled matched and inLis bool[] buffers with clearArray:false instead of true (Copilot review). Both are value-typed (no reference pinning) and have their used range fully (re)initialized before any read - matched via the Array.Clear on rent, inLis via ComputeLISInto leading clear - so the full-array wipe on return was avoidable O(rented-capacity) work on the hot path. Matches how the int[] newToOld already returns. Reference-typed pooled buffers (string[]/ReactorRow[] in KeyedListDiff) still clear. - Tests (ChildReconcilerLisIntoTests): replace the circular oracle that compared ComputeLISInto against ComputeLIS (now a thin wrapper over it) with an independent brute-force LIS-length DP that honors the -1 unmapped sentinel, asserting the mask marks a valid strictly-increasing subsequence of maximal length. - Tests (ChildReconcilerKeyedSkipTests, new): cover the #30 keyed CanSkipUpdate fast path - an identical keyed list skips every row with no ops and no child-control access, across repeated stable frames. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

azchohfi · 2026-06-26T07:20:47Z

/perf

Hot keyed list-diff path (grid steady state) allocated heavily and missed fast-paths even when keys were unchanged. Eliminate per-diff allocations and add the missing fast-path, preserving diff behavior exactly. ChildReconciler.cs: - Keyed prefix/suffix loops now take the Element.CanSkipUpdate early-exit that the positional path has, so stable keyed rows no longer re-diff every tick (#30); cache children.Count once instead of re-reading the COM IVector.get_Size per suffix iteration (#37). - Replace HashSet-returning ComputeLIS with allocation-free ComputeLISInto filling a pooled bool mask; pool tails/tailIndices/predecessors from ArrayPool (#31/#32). Keep a thin ComputeLIS(int[]) wrapper for tests. - Pool ReconcileKeyedMiddle's working arrays (ArrayPool) and the two key->index maps (re-entrancy-safe ThreadStatic dict pool); buffers cleared and returned on every exit (#33). - Filter: count-pass + single Element[] fill, no List+ToArray (#36). - GetKey: cache Type.Name via ConcurrentDictionary (#38). KeyedListDiff.cs: - Rent newKeys from ArrayPool<string>, threading explicit newCount (rented array may be larger); returned clearArray:true in a finally (#2). - Move the no-op (SequenceEqual) + empty/empty fast paths ABOVE the duplicate scan so the steady-state grid case never allocates/scans a dup set (#1). - Fold the churn-bailout decision into ApplyGeneral, computed from the same post-prefix/suffix scratch map the general walk builds (diff-range churn == full-range churn), removing the O(2n) null-marker pre-pass (#34). - Pool the doomed ReactorRow[] (ArrayPool, descending IComparer sort over [0,removes), cleared+returned) (#3). - Lazy-allocate movedRows only on an actual move under an ambient (#35). - In-place SyncLastKeysToSource instead of LastKeys Clear()+Add (#10). - HasDuplicates reuses state.Scratch (TryAdd) for 4+ keys instead of a fresh HashSet (#11); cached null-key diagnostic sample (#41). Tests: add pooling non-corruption coverage (rented-buffer-larger-than-count, interleaved independent states, randomized stress vs oracle with survivor identity, pooled doomed removes, churn with shared prefix+suffix, Scratch reuse for the dup scan) and ComputeLISInto bool-mask coverage. Full Reactor.Tests suite green (9693 passed); core lib Release build is warning-clean (AOT/trim). Closes #653 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Review follow-ups on #657 (keyed list-diff allocation work): - ChildReconciler.ReconcileKeyedMiddle: return the pooled matched and inLis bool[] buffers with clearArray:false instead of true (Copilot review). Both are value-typed (no reference pinning) and have their used range fully (re)initialized before any read - matched via the Array.Clear on rent, inLis via ComputeLISInto leading clear - so the full-array wipe on return was avoidable O(rented-capacity) work on the hot path. Matches how the int[] newToOld already returns. Reference-typed pooled buffers (string[]/ReactorRow[] in KeyedListDiff) still clear. - Tests (ChildReconcilerLisIntoTests): replace the circular oracle that compared ComputeLISInto against ComputeLIS (now a thin wrapper over it) with an independent brute-force LIS-length DP that honors the -1 unmapped sentinel, asserting the mask marks a valid strictly-increasing subsequence of maximal length. - Tests (ChildReconcilerKeyedSkipTests, new): cover the #30 keyed CanSkipUpdate fast path - an identical keyed list skips every row with no ops and no child-control access, across repeated stable frames. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

azchohfi · 2026-06-26T07:30:13Z

/perf

github-actions · 2026-06-26T07:30:47Z

⚡ Reactor perf comparison

Workload: StressPerf.ReactorOptimized StocksGrid · --percent 50 --duration 10 · x64 Release · median of 12 paired runs (2 warmup dropped); Δ is the mean change with a 95% CI · PR head and main built and run interleaved on the same runner.

Regression vs `main` baseline

Metric	`main` (baseline)	This PR	Δ (95% CI)	Status
Renders/sec ↑	2.68	2.62	-0.5% _{95% CI [-5.0, +3.9]}	≈ within noise
Avg Reconcile (ms) ↓	145.9	146.7	+1.2% _{95% CI [-1.2, +3.7]}	≈ within noise
Avg Diff (ms) ↓	134.1	135.0	+1.6% _{95% CI [-1.9, +5.1]}	≈ within noise
Avg Memory (MB) ↓	293.7	293.2	-0.4% _{95% CI [-1.3, +0.6]}	≈ within noise

Allocation (Reactor) — lower is better

Metric	`main` (baseline)	This PR	Δ (95% CI)	Status
Alloc bytes/render ↓	9554925	9558250	0.0% _{95% CI [-0.7, +0.7]}	≈ within noise
Gen0 GC / 1k renders ↓	275.86	275.86	-0.2% _{95% CI [-8.5, +8.0]}	≈ within noise

Cross-framework reference (same StocksGrid workload)

Metric	vanilla WinUI3¹	Rust `windows-reactor`²	Reactor (this PR)
Renders/sec ↑	3.39	5.00	2.62
Avg Reconcile (ms) ↓	n/a	18.4	146.7
Avg Diff (ms) ↓	n/a	15.9	135.0
Avg Memory (MB) ↓	263.7	196.3	293.2

_{↑ higher is better · ↓ lower is better. Within noise = the 95% confidence interval of the paired Δ includes 0 (no change resolvable at this sample size); ✅ improvement / ⚠️ regression require the CI to exclude 0.}
_{Allocation metrics (alloc bytes/render, Gen0 GC) are the sensitive signal for allocation-reduction work, where the mean-ms / memory figures are largely flat. They read n/a for a harness built from a revision that predates them (rebase the PR onto main to populate them).}
_{¹ vanilla WinUI3 = StressPerf.Direct (imperative; no virtual-DOM, so it has no reconcile/diff phase — those cells read n/a). Measured live on this runner.}
_{² Rust = test_reactor_perf from microsoft/windows-rs — a port of this harness (same StocksGrid, same --percent/--duration CLI). Built from source and measured live on this runner.}
_{Absolute numbers are runner-dependent — trust the Δ vs main, not the absolute values. Memory (working set) is the noisiest metric.}
_{Runner: CPU: AMD EPYC 7763 64-Core Processor · 4 logical cores · 16 GB RAM · runner: GitHub Actions 1042824877.}
_{Generated by .github/workflows/perf-compare.yml · PR 13679c9 vs main 66d38dc · 2026-06-26T07:42:40Z · run log.}

…698) * perf(ci): wire StressPerf.KeyedList into /perf as a third macro leg Add the already-merged (#694) StressPerf.KeyedList workload to the /perf comparison as a third interleaved A/B macro leg, so /perf can resolve keyed-diff optimizations the positional StocksGrid cells never exercise. StocksGrid (StressPerf.ReactorOptimized) mutates cells in place by index, so its child diff always takes ChildReconciler.ReconcilePositional. The keyed LIS arm (ReconcileKeyed -> ReconcileKeyedMiddle) is invisible to it by construction -- the same blind spot that made the headline-only comparison unable to resolve keyed work (P2/#657, keyed structural-skip P4-C2). StressPerf.KeyedList renders ~500 stably keyed rows reordered/inserted/ removed each tick, driving that keyed arm on every tick. Harness changes (measurement tooling only; 0 src/Reactor): - Run-PerfBenchmark.ps1: AppRegistry KeyedList entry; -IncludeKeyedList (default on) + 'KeyedList' in the local -Apps set; best-effort build (try/catch -> omit on failure, like micro); third interleaved leg at the headline -Percent with the same pair-aligned drop-both collection and full Warmup+Reps budget; aggregation, ctx samples, result.json, and Format-PerfComment threading. - PerfLib.ps1: Format-PerfKeyedListSection (renders the 4 headline metrics with the same paired-delta 95% CI / direction-aware status as the headline table; empty when either side is null); Format-PerfComment -MainKeyed/ -PrKeyed params, rendered after the Allocation table and before the micro suite. - Tests: PerfLib.Tests.ps1 (216 assertions) + RunPerfBenchmark.Tests.ps1 (34 assertions) cover the new section + the static wiring contract. - Docs: README params table + "The comment" bullet; METHODOLOGY subsection; perf-compare.yml header + timeout 60 -> 75 (third macro leg + build). Verified: both test suites green; a local Run-PerfBenchmark.ps1 smoke (stub harness exes on this ARM64/OneDrive box) renders the keyed table in the right position with a direction-correct paired delta, and omits it under -IncludeKeyedList:$false. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * perf(ci): address pr-review findings on the keyed-list leg Internal pr-review (8 dimensions + multi-model cross-check) on #698 surfaced 1 low + 3 medium, all confirmed by the cross-check, all in this PR's own tooling/tests. No critical/high. Addressed: - api-ergonomics (low): the run-mode log line printed `keyed-list={on/off}` from -IncludeKeyedList even in LOCAL mode, where the workload set actually comes from -Apps (so a default local run logged keyed-list=on while not running KeyedList). Make the mode line mode-aware: COMPARE keeps skip-floor/keyed-list tied to the include switches; LOCAL reports `apps=<-Apps>` instead. - test-coverage (medium x3) in the keyed leg's tests: * RunPerfBenchmark.Tests.ps1: lock the opt-out + best-effort build fallback (build guarded by -IncludeKeyedList; a build failure flips it $false; the run leg is skipped when off) and the one-sided-failure drop-both pair alignment ($mm -and $pm appends both; $mm -or $pm drops both). * PerfLib.Tests.ps1: assert keyed verdicts for ALL four headline metrics -- add Avg Diff (improvement) and Avg Memory (within noise). The keyed fixture now gives memory a small symmetric per-pair jitter so its paired CI straddles 0 (a proper within-noise case, not a degenerate zero-variance delta). PerfLib.Tests.ps1 218/218 (+2), RunPerfBenchmark.Tests.ps1 39/39 (+5); all four scripts parse clean; 0 src/Reactor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

azchohfi requested a review from codemonkeychris as a code owner June 25, 2026 19:02

azchohfi marked this pull request as draft June 25, 2026 19:03

github-code-quality Bot found potential problems Jun 25, 2026

View reviewed changes

Comment thread tests/Reactor.Tests/Internal/KeyedListDiffPoolingTests.cs

azchohfi requested a review from Copilot June 25, 2026 19:59

Copilot started reviewing on behalf of azchohfi June 25, 2026 19:59 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Comment thread src/Reactor/Core/ChildReconciler.cs Outdated

azchohfi requested a review from Copilot June 25, 2026 21:01

Copilot started reviewing on behalf of azchohfi June 25, 2026 21:02 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

azchohfi and others added 2 commits June 26, 2026 00:24

azchohfi force-pushed the azchohfi-perf-keyed-list-diff-allocations branch from 83599a0 to 13679c9 Compare June 26, 2026 07:30

azchohfi mentioned this pull request Jun 26, 2026

perf(ci): wire StressPerf.KeyedList into /perf as a third macro leg #698

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: eliminate per-diff allocations in ChildReconciler + KeyedListDiff#657

perf: eliminate per-diff allocations in ChildReconciler + KeyedListDiff#657
azchohfi wants to merge 2 commits into
mainfrom
azchohfi-perf-keyed-list-diff-allocations

azchohfi commented Jun 25, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

azchohfi commented Jun 26, 2026

Uh oh!

azchohfi commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

azchohfi commented Jun 25, 2026

What

ChildReconciler.cs

KeyedListDiff.cs

Tests

Validation

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

azchohfi commented Jun 26, 2026

Uh oh!

azchohfi commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚡ Reactor perf comparison

Regression vs main baseline

Allocation (Reactor) — lower is better

Cross-framework reference (same StocksGrid workload)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 26, 2026 •

edited

Loading

Regression vs `main` baseline