Skip to content

perf: eliminate per-diff allocations in ChildReconciler + KeyedListDiff#657

Draft
azchohfi wants to merge 2 commits into
mainfrom
azchohfi-perf-keyed-list-diff-allocations
Draft

perf: eliminate per-diff allocations in ChildReconciler + KeyedListDiff#657
azchohfi wants to merge 2 commits into
mainfrom
azchohfi-perf-keyed-list-diff-allocations

Conversation

@azchohfi

Copy link
Copy Markdown
Collaborator

Closes #653

What

Eliminates per-diff allocations in the keyed list-diff path and restores the missing fast-path on the keyed prefix/suffix loops — a top Diff-gap contributor on the data-grid stress workload (StocksGrid), where stable grid rows go through the keyed path and re-diff every tick. Diff behavior is preserved exactly (which rows move/insert/remove); these are allocation/fast-path changes only.

Scope is limited to src/Reactor/Core/ChildReconciler.cs and src/Reactor/Core/Internal/KeyedListDiff.cs (+ tests).

ChildReconciler.cs

KeyedListDiff.cs

Tests

New coverage in tests/Reactor.Tests: pooled-buffer non-corruption (rented-buffer-larger-than-count, interleaved independent states, randomized stress vs. a trivial oracle with survivor-identity checks, pooled doomed removes, churn with shared prefix and suffix, Scratch reuse across a dup bailout) and ComputeLISInto bool-mask coverage (matches the legacy set, clears stale markers, honors length on an oversized pooled mask).

Validation

  • dotnet test tests/Reactor.Tests9693 passed, 0 failed, 64 skipped (the full existing keyed-diff oracle suite stays green, validating exact behavior preservation).
  • dotnet build src/Reactor -c Release0 warnings, 0 errors (core lib promotes AOT/trim warnings to errors).
  • Perf impact validates via /perf once the benchmark workflow merges.

@azchohfi azchohfi marked this pull request as draft June 25, 2026 19:03
Comment thread tests/Reactor.Tests/Internal/KeyedListDiffPoolingTests.cs

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets a known perf hotspot in Reactor’s keyed child reconciliation + keyed list diff pipeline by eliminating per-diff allocations (via pooling) and restoring missing fast paths, while adding regression coverage to ensure diff correctness and buffer-reuse safety.

Changes:

  • Add pooled/length-threaded buffer handling and new early fast paths in KeyedListDiff (including moving the no-op check ahead of duplicate detection).
  • Optimize keyed reconciliation in ChildReconciler (prefix/suffix CanSkipUpdate early-exit, pooled LIS mask/buffers, pooled key→index maps, and cheaper filtering/key generation).
  • Add new unit tests covering pooling non-corruption scenarios and ComputeLISInto mask semantics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/Reactor/Core/ChildReconciler.cs Keyed reconcile fast paths + pooled LIS/dictionaries/arrays to reduce per-tick work/allocations.
src/Reactor/Core/Internal/KeyedListDiff.cs ArrayPool-backed key materialization and reordered fast paths; pooled removal buffers and reduced scratch churn.
tests/Reactor.Tests/Internal/KeyedListDiffPoolingTests.cs New regression suite for pooled keyed diff correctness and reuse safety.
tests/Reactor.Tests/ChildReconcilerLisIntoTests.cs New tests validating ComputeLISInto mask behavior vs legacy set-based wrapper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/Reactor/Core/ChildReconciler.cs Outdated
azchohfi added a commit that referenced this pull request Jun 25, 2026
Review follow-ups on #657 (keyed list-diff allocation work):

- ChildReconciler.ReconcileKeyedMiddle: return the pooled matched and inLis
  bool[] buffers with clearArray:false instead of true (Copilot review).
  Both are value-typed (no reference pinning) and have their used range
  fully (re)initialized before any read - matched via the Array.Clear on
  rent, inLis via ComputeLISInto leading clear - so the full-array wipe on
  return was avoidable O(rented-capacity) work on the hot path. Matches how
  the int[] newToOld already returns. Reference-typed pooled buffers
  (string[]/ReactorRow[] in KeyedListDiff) still clear.

- Tests (ChildReconcilerLisIntoTests): replace the circular oracle that
  compared ComputeLISInto against ComputeLIS (now a thin wrapper over it)
  with an independent brute-force LIS-length DP that honors the -1 unmapped
  sentinel, asserting the mask marks a valid strictly-increasing subsequence
  of maximal length.

- Tests (ChildReconcilerKeyedSkipTests, new): cover the #30 keyed
  CanSkipUpdate fast path - an identical keyed list skips every row with no
  ops and no child-control access, across repeated stable frames.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azchohfi azchohfi requested a review from Copilot June 25, 2026 21:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@azchohfi

Copy link
Copy Markdown
Collaborator Author

/perf

azchohfi and others added 2 commits June 26, 2026 00:24
Hot keyed list-diff path (grid steady state) allocated heavily and missed
fast-paths even when keys were unchanged. Eliminate per-diff allocations and
add the missing fast-path, preserving diff behavior exactly.

ChildReconciler.cs:
- Keyed prefix/suffix loops now take the Element.CanSkipUpdate early-exit
  that the positional path has, so stable keyed rows no longer re-diff every
  tick (#30); cache children.Count once instead of re-reading the COM
  IVector.get_Size per suffix iteration (#37).
- Replace HashSet-returning ComputeLIS with allocation-free ComputeLISInto
  filling a pooled bool mask; pool tails/tailIndices/predecessors from
  ArrayPool (#31/#32). Keep a thin ComputeLIS(int[]) wrapper for tests.
- Pool ReconcileKeyedMiddle's working arrays (ArrayPool) and the two
  key->index maps (re-entrancy-safe ThreadStatic dict pool); buffers cleared
  and returned on every exit (#33).
- Filter: count-pass + single Element[] fill, no List+ToArray (#36).
- GetKey: cache Type.Name via ConcurrentDictionary (#38).

KeyedListDiff.cs:
- Rent newKeys from ArrayPool<string>, threading explicit newCount (rented
  array may be larger); returned clearArray:true in a finally (#2).
- Move the no-op (SequenceEqual) + empty/empty fast paths ABOVE the duplicate
  scan so the steady-state grid case never allocates/scans a dup set (#1).
- Fold the churn-bailout decision into ApplyGeneral, computed from the same
  post-prefix/suffix scratch map the general walk builds (diff-range churn ==
  full-range churn), removing the O(2n) null-marker pre-pass (#34).
- Pool the doomed ReactorRow[] (ArrayPool, descending IComparer sort over
  [0,removes), cleared+returned) (#3).
- Lazy-allocate movedRows only on an actual move under an ambient (#35).
- In-place SyncLastKeysToSource instead of LastKeys Clear()+Add (#10).
- HasDuplicates reuses state.Scratch (TryAdd) for 4+ keys instead of a fresh
  HashSet (#11); cached null-key diagnostic sample (#41).

Tests: add pooling non-corruption coverage (rented-buffer-larger-than-count,
interleaved independent states, randomized stress vs oracle with survivor
identity, pooled doomed removes, churn with shared prefix+suffix, Scratch
reuse for the dup scan) and ComputeLISInto bool-mask coverage. Full
Reactor.Tests suite green (9693 passed); core lib Release build is
warning-clean (AOT/trim).

Closes #653

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review follow-ups on #657 (keyed list-diff allocation work):

- ChildReconciler.ReconcileKeyedMiddle: return the pooled matched and inLis
  bool[] buffers with clearArray:false instead of true (Copilot review).
  Both are value-typed (no reference pinning) and have their used range
  fully (re)initialized before any read - matched via the Array.Clear on
  rent, inLis via ComputeLISInto leading clear - so the full-array wipe on
  return was avoidable O(rented-capacity) work on the hot path. Matches how
  the int[] newToOld already returns. Reference-typed pooled buffers
  (string[]/ReactorRow[] in KeyedListDiff) still clear.

- Tests (ChildReconcilerLisIntoTests): replace the circular oracle that
  compared ComputeLISInto against ComputeLIS (now a thin wrapper over it)
  with an independent brute-force LIS-length DP that honors the -1 unmapped
  sentinel, asserting the mask marks a valid strictly-increasing subsequence
  of maximal length.

- Tests (ChildReconcilerKeyedSkipTests, new): cover the #30 keyed
  CanSkipUpdate fast path - an identical keyed list skips every row with no
  ops and no child-control access, across repeated stable frames.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azchohfi azchohfi force-pushed the azchohfi-perf-keyed-list-diff-allocations branch from 83599a0 to 13679c9 Compare June 26, 2026 07:30
@azchohfi

Copy link
Copy Markdown
Collaborator Author

/perf

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown

⚡ Reactor perf comparison

Workload: StressPerf.ReactorOptimized StocksGrid · --percent 50 --duration 10 · x64 Release · median of 12 paired runs (2 warmup dropped); Δ is the mean change with a 95% CI · PR head and main built and run interleaved on the same runner.

Regression vs main baseline

Metric main (baseline) This PR Δ (95% CI) Status
Renders/sec ↑ 2.68 2.62 -0.5% 95% CI [-5.0, +3.9] ≈ within noise
Avg Reconcile (ms) ↓ 145.9 146.7 +1.2% 95% CI [-1.2, +3.7] ≈ within noise
Avg Diff (ms) ↓ 134.1 135.0 +1.6% 95% CI [-1.9, +5.1] ≈ within noise
Avg Memory (MB) ↓ 293.7 293.2 -0.4% 95% CI [-1.3, +0.6] ≈ within noise

Allocation (Reactor) — lower is better

Metric main (baseline) This PR Δ (95% CI) Status
Alloc bytes/render ↓ 9554925 9558250 0.0% 95% CI [-0.7, +0.7] ≈ within noise
Gen0 GC / 1k renders ↓ 275.86 275.86 -0.2% 95% CI [-8.5, +8.0] ≈ within noise

Cross-framework reference (same StocksGrid workload)

Metric vanilla WinUI3¹ Rust windows-reactor² Reactor (this PR)
Renders/sec ↑ 3.39 5.00 2.62
Avg Reconcile (ms) ↓ n/a 18.4 146.7
Avg Diff (ms) ↓ n/a 15.9 135.0
Avg Memory (MB) ↓ 263.7 196.3 293.2

↑ higher is better · ↓ lower is better. Within noise = the 95% confidence interval of the paired Δ includes 0 (no change resolvable at this sample size); ✅ improvement / ⚠️ regression require the CI to exclude 0.
Allocation metrics (alloc bytes/render, Gen0 GC) are the sensitive signal for allocation-reduction work, where the mean-ms / memory figures are largely flat. They read n/a for a harness built from a revision that predates them (rebase the PR onto main to populate them).
¹ vanilla WinUI3 = StressPerf.Direct (imperative; no virtual-DOM, so it has no reconcile/diff phase — those cells read n/a). Measured live on this runner.
² Rust = test_reactor_perf from microsoft/windows-rs — a port of this harness (same StocksGrid, same --percent/--duration CLI). Built from source and measured live on this runner.
Absolute numbers are runner-dependent — trust the Δ vs main, not the absolute values. Memory (working set) is the noisiest metric.
Runner: CPU: AMD EPYC 7763 64-Core Processor · 4 logical cores · 16 GB RAM · runner: GitHub Actions 1042824877.
Generated by .github/workflows/perf-compare.yml · PR 13679c9 vs main 66d38dc · 2026-06-26T07:42:40Z · run log.

azchohfi added a commit that referenced this pull request Jun 26, 2026
…698)

* perf(ci): wire StressPerf.KeyedList into /perf as a third macro leg

Add the already-merged (#694) StressPerf.KeyedList workload to the /perf
comparison as a third interleaved A/B macro leg, so /perf can resolve
keyed-diff optimizations the positional StocksGrid cells never exercise.

StocksGrid (StressPerf.ReactorOptimized) mutates cells in place by index, so
its child diff always takes ChildReconciler.ReconcilePositional. The keyed
LIS arm (ReconcileKeyed -> ReconcileKeyedMiddle) is invisible to it by
construction -- the same blind spot that made the headline-only comparison
unable to resolve keyed work (P2/#657, keyed structural-skip P4-C2).
StressPerf.KeyedList renders ~500 stably keyed rows reordered/inserted/
removed each tick, driving that keyed arm on every tick.

Harness changes (measurement tooling only; 0 src/Reactor):
- Run-PerfBenchmark.ps1: AppRegistry KeyedList entry; -IncludeKeyedList
  (default on) + 'KeyedList' in the local -Apps set; best-effort build
  (try/catch -> omit on failure, like micro); third interleaved leg at the
  headline -Percent with the same pair-aligned drop-both collection and full
  Warmup+Reps budget; aggregation, ctx samples, result.json, and
  Format-PerfComment threading.
- PerfLib.ps1: Format-PerfKeyedListSection (renders the 4 headline metrics
  with the same paired-delta 95% CI / direction-aware status as the headline
  table; empty when either side is null); Format-PerfComment -MainKeyed/
  -PrKeyed params, rendered after the Allocation table and before the micro
  suite.
- Tests: PerfLib.Tests.ps1 (216 assertions) + RunPerfBenchmark.Tests.ps1
  (34 assertions) cover the new section + the static wiring contract.
- Docs: README params table + "The comment" bullet; METHODOLOGY subsection;
  perf-compare.yml header + timeout 60 -> 75 (third macro leg + build).

Verified: both test suites green; a local Run-PerfBenchmark.ps1 smoke (stub
harness exes on this ARM64/OneDrive box) renders the keyed table in the right
position with a direction-correct paired delta, and omits it under
-IncludeKeyedList:$false.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* perf(ci): address pr-review findings on the keyed-list leg

Internal pr-review (8 dimensions + multi-model cross-check) on #698 surfaced
1 low + 3 medium, all confirmed by the cross-check, all in this PR's own
tooling/tests. No critical/high. Addressed:

- api-ergonomics (low): the run-mode log line printed `keyed-list={on/off}`
  from -IncludeKeyedList even in LOCAL mode, where the workload set actually
  comes from -Apps (so a default local run logged keyed-list=on while not
  running KeyedList). Make the mode line mode-aware: COMPARE keeps
  skip-floor/keyed-list tied to the include switches; LOCAL reports
  `apps=<-Apps>` instead.
- test-coverage (medium x3) in the keyed leg's tests:
  * RunPerfBenchmark.Tests.ps1: lock the opt-out + best-effort build fallback
    (build guarded by -IncludeKeyedList; a build failure flips it $false; the
    run leg is skipped when off) and the one-sided-failure drop-both pair
    alignment ($mm -and $pm appends both; $mm -or $pm drops both).
  * PerfLib.Tests.ps1: assert keyed verdicts for ALL four headline metrics --
    add Avg Diff (improvement) and Avg Memory (within noise). The keyed fixture
    now gives memory a small symmetric per-pair jitter so its paired CI
    straddles 0 (a proper within-noise case, not a degenerate zero-variance
    delta).

PerfLib.Tests.ps1 218/218 (+2), RunPerfBenchmark.Tests.ps1 39/39 (+5); all
four scripts parse clean; 0 src/Reactor.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: eliminate per-diff allocations in ChildReconciler + KeyedListDiff

2 participants