perf: eliminate per-diff allocations in ChildReconciler + KeyedListDiff#657
perf: eliminate per-diff allocations in ChildReconciler + KeyedListDiff#657azchohfi wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR targets a known perf hotspot in Reactor’s keyed child reconciliation + keyed list diff pipeline by eliminating per-diff allocations (via pooling) and restoring missing fast paths, while adding regression coverage to ensure diff correctness and buffer-reuse safety.
Changes:
- Add pooled/length-threaded buffer handling and new early fast paths in
KeyedListDiff(including moving the no-op check ahead of duplicate detection). - Optimize keyed reconciliation in
ChildReconciler(prefix/suffixCanSkipUpdateearly-exit, pooled LIS mask/buffers, pooled key→index maps, and cheaper filtering/key generation). - Add new unit tests covering pooling non-corruption scenarios and
ComputeLISIntomask semantics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/Reactor/Core/ChildReconciler.cs | Keyed reconcile fast paths + pooled LIS/dictionaries/arrays to reduce per-tick work/allocations. |
| src/Reactor/Core/Internal/KeyedListDiff.cs | ArrayPool-backed key materialization and reordered fast paths; pooled removal buffers and reduced scratch churn. |
| tests/Reactor.Tests/Internal/KeyedListDiffPoolingTests.cs | New regression suite for pooled keyed diff correctness and reuse safety. |
| tests/Reactor.Tests/ChildReconcilerLisIntoTests.cs | New tests validating ComputeLISInto mask behavior vs legacy set-based wrapper. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Review follow-ups on #657 (keyed list-diff allocation work): - ChildReconciler.ReconcileKeyedMiddle: return the pooled matched and inLis bool[] buffers with clearArray:false instead of true (Copilot review). Both are value-typed (no reference pinning) and have their used range fully (re)initialized before any read - matched via the Array.Clear on rent, inLis via ComputeLISInto leading clear - so the full-array wipe on return was avoidable O(rented-capacity) work on the hot path. Matches how the int[] newToOld already returns. Reference-typed pooled buffers (string[]/ReactorRow[] in KeyedListDiff) still clear. - Tests (ChildReconcilerLisIntoTests): replace the circular oracle that compared ComputeLISInto against ComputeLIS (now a thin wrapper over it) with an independent brute-force LIS-length DP that honors the -1 unmapped sentinel, asserting the mask marks a valid strictly-increasing subsequence of maximal length. - Tests (ChildReconcilerKeyedSkipTests, new): cover the #30 keyed CanSkipUpdate fast path - an identical keyed list skips every row with no ops and no child-control access, across repeated stable frames. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/perf |
Hot keyed list-diff path (grid steady state) allocated heavily and missed fast-paths even when keys were unchanged. Eliminate per-diff allocations and add the missing fast-path, preserving diff behavior exactly. ChildReconciler.cs: - Keyed prefix/suffix loops now take the Element.CanSkipUpdate early-exit that the positional path has, so stable keyed rows no longer re-diff every tick (#30); cache children.Count once instead of re-reading the COM IVector.get_Size per suffix iteration (#37). - Replace HashSet-returning ComputeLIS with allocation-free ComputeLISInto filling a pooled bool mask; pool tails/tailIndices/predecessors from ArrayPool (#31/#32). Keep a thin ComputeLIS(int[]) wrapper for tests. - Pool ReconcileKeyedMiddle's working arrays (ArrayPool) and the two key->index maps (re-entrancy-safe ThreadStatic dict pool); buffers cleared and returned on every exit (#33). - Filter: count-pass + single Element[] fill, no List+ToArray (#36). - GetKey: cache Type.Name via ConcurrentDictionary (#38). KeyedListDiff.cs: - Rent newKeys from ArrayPool<string>, threading explicit newCount (rented array may be larger); returned clearArray:true in a finally (#2). - Move the no-op (SequenceEqual) + empty/empty fast paths ABOVE the duplicate scan so the steady-state grid case never allocates/scans a dup set (#1). - Fold the churn-bailout decision into ApplyGeneral, computed from the same post-prefix/suffix scratch map the general walk builds (diff-range churn == full-range churn), removing the O(2n) null-marker pre-pass (#34). - Pool the doomed ReactorRow[] (ArrayPool, descending IComparer sort over [0,removes), cleared+returned) (#3). - Lazy-allocate movedRows only on an actual move under an ambient (#35). - In-place SyncLastKeysToSource instead of LastKeys Clear()+Add (#10). - HasDuplicates reuses state.Scratch (TryAdd) for 4+ keys instead of a fresh HashSet (#11); cached null-key diagnostic sample (#41). Tests: add pooling non-corruption coverage (rented-buffer-larger-than-count, interleaved independent states, randomized stress vs oracle with survivor identity, pooled doomed removes, churn with shared prefix+suffix, Scratch reuse for the dup scan) and ComputeLISInto bool-mask coverage. Full Reactor.Tests suite green (9693 passed); core lib Release build is warning-clean (AOT/trim). Closes #653 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review follow-ups on #657 (keyed list-diff allocation work): - ChildReconciler.ReconcileKeyedMiddle: return the pooled matched and inLis bool[] buffers with clearArray:false instead of true (Copilot review). Both are value-typed (no reference pinning) and have their used range fully (re)initialized before any read - matched via the Array.Clear on rent, inLis via ComputeLISInto leading clear - so the full-array wipe on return was avoidable O(rented-capacity) work on the hot path. Matches how the int[] newToOld already returns. Reference-typed pooled buffers (string[]/ReactorRow[] in KeyedListDiff) still clear. - Tests (ChildReconcilerLisIntoTests): replace the circular oracle that compared ComputeLISInto against ComputeLIS (now a thin wrapper over it) with an independent brute-force LIS-length DP that honors the -1 unmapped sentinel, asserting the mask marks a valid strictly-increasing subsequence of maximal length. - Tests (ChildReconcilerKeyedSkipTests, new): cover the #30 keyed CanSkipUpdate fast path - an identical keyed list skips every row with no ops and no child-control access, across repeated stable frames. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
83599a0 to
13679c9
Compare
|
/perf |
⚡ Reactor perf comparisonWorkload: Regression vs
|
| Metric | main (baseline) |
This PR | Δ (95% CI) | Status |
|---|---|---|---|---|
| Renders/sec ↑ | 2.68 | 2.62 | -0.5% 95% CI [-5.0, +3.9] | ≈ within noise |
| Avg Reconcile (ms) ↓ | 145.9 | 146.7 | +1.2% 95% CI [-1.2, +3.7] | ≈ within noise |
| Avg Diff (ms) ↓ | 134.1 | 135.0 | +1.6% 95% CI [-1.9, +5.1] | ≈ within noise |
| Avg Memory (MB) ↓ | 293.7 | 293.2 | -0.4% 95% CI [-1.3, +0.6] | ≈ within noise |
Allocation (Reactor) — lower is better
| Metric | main (baseline) |
This PR | Δ (95% CI) | Status |
|---|---|---|---|---|
| Alloc bytes/render ↓ | 9554925 | 9558250 | 0.0% 95% CI [-0.7, +0.7] | ≈ within noise |
| Gen0 GC / 1k renders ↓ | 275.86 | 275.86 | -0.2% 95% CI [-8.5, +8.0] | ≈ within noise |
Cross-framework reference (same StocksGrid workload)
| Metric | vanilla WinUI3¹ | Rust windows-reactor² |
Reactor (this PR) |
|---|---|---|---|
| Renders/sec ↑ | 3.39 | 5.00 | 2.62 |
| Avg Reconcile (ms) ↓ | n/a | 18.4 | 146.7 |
| Avg Diff (ms) ↓ | n/a | 15.9 | 135.0 |
| Avg Memory (MB) ↓ | 263.7 | 196.3 | 293.2 |
↑ higher is better · ↓ lower is better. Within noise = the 95% confidence interval of the paired Δ includes 0 (no change resolvable at this sample size); ✅ improvement /
Allocation metrics (alloc bytes/render, Gen0 GC) are the sensitive signal for allocation-reduction work, where the mean-ms / memory figures are largely flat. They read n/a for a harness built from a revision that predates them (rebase the PR onto main to populate them).
¹ vanilla WinUI3 = StressPerf.Direct (imperative; no virtual-DOM, so it has no reconcile/diff phase — those cells read n/a). Measured live on this runner.
² Rust = test_reactor_perf from microsoft/windows-rs — a port of this harness (same StocksGrid, same --percent/--duration CLI). Built from source and measured live on this runner.
Absolute numbers are runner-dependent — trust the Δ vs main, not the absolute values. Memory (working set) is the noisiest metric.
Runner: CPU: AMD EPYC 7763 64-Core Processor · 4 logical cores · 16 GB RAM · runner: GitHub Actions 1042824877.
Generated by .github/workflows/perf-compare.yml · PR 13679c9 vs main 66d38dc · 2026-06-26T07:42:40Z · run log.
…698) * perf(ci): wire StressPerf.KeyedList into /perf as a third macro leg Add the already-merged (#694) StressPerf.KeyedList workload to the /perf comparison as a third interleaved A/B macro leg, so /perf can resolve keyed-diff optimizations the positional StocksGrid cells never exercise. StocksGrid (StressPerf.ReactorOptimized) mutates cells in place by index, so its child diff always takes ChildReconciler.ReconcilePositional. The keyed LIS arm (ReconcileKeyed -> ReconcileKeyedMiddle) is invisible to it by construction -- the same blind spot that made the headline-only comparison unable to resolve keyed work (P2/#657, keyed structural-skip P4-C2). StressPerf.KeyedList renders ~500 stably keyed rows reordered/inserted/ removed each tick, driving that keyed arm on every tick. Harness changes (measurement tooling only; 0 src/Reactor): - Run-PerfBenchmark.ps1: AppRegistry KeyedList entry; -IncludeKeyedList (default on) + 'KeyedList' in the local -Apps set; best-effort build (try/catch -> omit on failure, like micro); third interleaved leg at the headline -Percent with the same pair-aligned drop-both collection and full Warmup+Reps budget; aggregation, ctx samples, result.json, and Format-PerfComment threading. - PerfLib.ps1: Format-PerfKeyedListSection (renders the 4 headline metrics with the same paired-delta 95% CI / direction-aware status as the headline table; empty when either side is null); Format-PerfComment -MainKeyed/ -PrKeyed params, rendered after the Allocation table and before the micro suite. - Tests: PerfLib.Tests.ps1 (216 assertions) + RunPerfBenchmark.Tests.ps1 (34 assertions) cover the new section + the static wiring contract. - Docs: README params table + "The comment" bullet; METHODOLOGY subsection; perf-compare.yml header + timeout 60 -> 75 (third macro leg + build). Verified: both test suites green; a local Run-PerfBenchmark.ps1 smoke (stub harness exes on this ARM64/OneDrive box) renders the keyed table in the right position with a direction-correct paired delta, and omits it under -IncludeKeyedList:$false. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * perf(ci): address pr-review findings on the keyed-list leg Internal pr-review (8 dimensions + multi-model cross-check) on #698 surfaced 1 low + 3 medium, all confirmed by the cross-check, all in this PR's own tooling/tests. No critical/high. Addressed: - api-ergonomics (low): the run-mode log line printed `keyed-list={on/off}` from -IncludeKeyedList even in LOCAL mode, where the workload set actually comes from -Apps (so a default local run logged keyed-list=on while not running KeyedList). Make the mode line mode-aware: COMPARE keeps skip-floor/keyed-list tied to the include switches; LOCAL reports `apps=<-Apps>` instead. - test-coverage (medium x3) in the keyed leg's tests: * RunPerfBenchmark.Tests.ps1: lock the opt-out + best-effort build fallback (build guarded by -IncludeKeyedList; a build failure flips it $false; the run leg is skipped when off) and the one-sided-failure drop-both pair alignment ($mm -and $pm appends both; $mm -or $pm drops both). * PerfLib.Tests.ps1: assert keyed verdicts for ALL four headline metrics -- add Avg Diff (improvement) and Avg Memory (within noise). The keyed fixture now gives memory a small symmetric per-pair jitter so its paired CI straddles 0 (a proper within-noise case, not a degenerate zero-variance delta). PerfLib.Tests.ps1 218/218 (+2), RunPerfBenchmark.Tests.ps1 39/39 (+5); all four scripts parse clean; 0 src/Reactor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closes #653
What
Eliminates per-diff allocations in the keyed list-diff path and restores the missing fast-path on the keyed prefix/suffix loops — a top Diff-gap contributor on the data-grid stress workload (StocksGrid), where stable grid rows go through the keyed path and re-diff every tick. Diff behavior is preserved exactly (which rows move/insert/remove); these are allocation/fast-path changes only.
Scope is limited to
src/Reactor/Core/ChildReconciler.csandsrc/Reactor/Core/Internal/KeyedListDiff.cs(+ tests).ChildReconciler.cs
CanSkipUpdateearly-exit (docs(skills): split SKILL.md into root + on-demand sub-skills #30): mirror the positional path's structural-skip so stable keyed rows avoid thechildren.GetCOM call + full property diff every tick. Includes theForceRenderThroughWrapperguard and the callback-Tag refresh.children.Count(feat(trace): ETW provider for correlated Reactor + WinUI profiling #37): stop re-reading the COMIVector.get_Sizeper suffix iteration.HashSet<int>-returningComputeLIS(which was copied into a 2nd identical set) withComputeLISInto(arr, length, bool[] mask);tails/tailIndices/predecessorscome fromArrayPool. A thinComputeLIS(int[])wrapper is kept for existing tests.ReconcileKeyedMiddlebuffers (test(d3): port upstream d3 tests to raise CI coverage #33):newToOld/matched/inLisfromArrayPool; the two key→index dicts from a re-entrancy-safe[ThreadStatic]pool. Cleared + returned on every exit (incl. exceptions).Filter(fix(reconciler): preserve Grid children on child type flip (#34) #36): count-pass + singleElement[]fill instead ofList+ToArray.GetKey(test(devtools): add stress harnesses for MCP startup/shutdown lifecycle #38): cacheType.NameviaConcurrentDictionary.KeyedListDiff.cs
SequenceEqualfast path.newKeysfromArrayPool<string>(Adding Microsoft SECURITY.MD #2): an explicitnewCountis threaded everywhere (the rented array may be larger); returnedclearArray:truein afinallyvia anApply/ApplyCoresplit.ApplyGeneral(DataGrid cells lost from UIA tree after async commit triggers row state transition #34): computed from the same post-prefix/suffix scratch map the general walk already builds (diff-range churn ≡ full-range churn), removing the O(2n) null-marker pre-pass + second populate.doomedReactorRow[](This repo is missing important files #3), lazymovedRows(refactor: rename Factories.Text → TextBlock, TextElement → TextBlockElement #35), in-placeSyncLastKeysToSourceinstead ofClear()+Add(Add MIT LICENSE file #10),HasDuplicatesreusesstate.Scratchfor 4+ keys (docs: add Blazor as a Microsoft framework in comparative analysis #11), cached null-key diagnostic sample (chore(coverage): clean up coverage config, document unit+selftest merge #41).Tests
New coverage in
tests/Reactor.Tests: pooled-buffer non-corruption (rented-buffer-larger-than-count, interleaved independent states, randomized stress vs. a trivial oracle with survivor-identity checks, pooleddoomedremoves, churn with shared prefix and suffix,Scratchreuse across a dup bailout) andComputeLISIntobool-mask coverage (matches the legacy set, clears stale markers, honorslengthon an oversized pooled mask).Validation
dotnet test tests/Reactor.Tests— 9693 passed, 0 failed, 64 skipped (the full existing keyed-diff oracle suite stays green, validating exact behavior preservation).dotnet build src/Reactor-c Release— 0 warnings, 0 errors (core lib promotes AOT/trim warnings to errors)./perfonce the benchmark workflow merges.