Skip to content

[perf-measure only · DO NOT MERGE] #657 rebased past #694 — keyed-leg /perf#705

Closed
azchohfi wants to merge 2 commits into
mainfrom
azchohfi-657-keyed-perf-remeasure
Closed

[perf-measure only · DO NOT MERGE] #657 rebased past #694 — keyed-leg /perf#705
azchohfi wants to merge 2 commits into
mainfrom
azchohfi-657-keyed-perf-remeasure

Conversation

@azchohfi

Copy link
Copy Markdown
Collaborator

MEASUREMENT-ONLY · DO NOT MERGE · throwaway.

This is #657 (azchohfi-perf-keyed-list-diff-allocations @ 13679c9) rebased onto main 52baebb to pull in the #694 StressPerf.KeyedList workload so the keyed-LIS leg actually renders under /perf (the original branch predates #694, so its whole instrument was dark on positional StocksGrid).

Verified on the rebased head: Reactor.csproj Release 0/0; Reactor.Tests 9730 passed / 0 failed / 64 skipped (--arch arm64).

Original #657 left pristine (gate-clean-HELD). This PR will be closed and its branch deleted once the perf comparison lands. Tracks #657.

azchohfi and others added 2 commits June 26, 2026 10:10
Hot keyed list-diff path (grid steady state) allocated heavily and missed
fast-paths even when keys were unchanged. Eliminate per-diff allocations and
add the missing fast-path, preserving diff behavior exactly.

ChildReconciler.cs:
- Keyed prefix/suffix loops now take the Element.CanSkipUpdate early-exit
  that the positional path has, so stable keyed rows no longer re-diff every
  tick (#30); cache children.Count once instead of re-reading the COM
  IVector.get_Size per suffix iteration (#37).
- Replace HashSet-returning ComputeLIS with allocation-free ComputeLISInto
  filling a pooled bool mask; pool tails/tailIndices/predecessors from
  ArrayPool (#31/#32). Keep a thin ComputeLIS(int[]) wrapper for tests.
- Pool ReconcileKeyedMiddle's working arrays (ArrayPool) and the two
  key->index maps (re-entrancy-safe ThreadStatic dict pool); buffers cleared
  and returned on every exit (#33).
- Filter: count-pass + single Element[] fill, no List+ToArray (#36).
- GetKey: cache Type.Name via ConcurrentDictionary (#38).

KeyedListDiff.cs:
- Rent newKeys from ArrayPool<string>, threading explicit newCount (rented
  array may be larger); returned clearArray:true in a finally (#2).
- Move the no-op (SequenceEqual) + empty/empty fast paths ABOVE the duplicate
  scan so the steady-state grid case never allocates/scans a dup set (#1).
- Fold the churn-bailout decision into ApplyGeneral, computed from the same
  post-prefix/suffix scratch map the general walk builds (diff-range churn ==
  full-range churn), removing the O(2n) null-marker pre-pass (#34).
- Pool the doomed ReactorRow[] (ArrayPool, descending IComparer sort over
  [0,removes), cleared+returned) (#3).
- Lazy-allocate movedRows only on an actual move under an ambient (#35).
- In-place SyncLastKeysToSource instead of LastKeys Clear()+Add (#10).
- HasDuplicates reuses state.Scratch (TryAdd) for 4+ keys instead of a fresh
  HashSet (#11); cached null-key diagnostic sample (#41).

Tests: add pooling non-corruption coverage (rented-buffer-larger-than-count,
interleaved independent states, randomized stress vs oracle with survivor
identity, pooled doomed removes, churn with shared prefix+suffix, Scratch
reuse for the dup scan) and ComputeLISInto bool-mask coverage. Full
Reactor.Tests suite green (9693 passed); core lib Release build is
warning-clean (AOT/trim).

Closes #653

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Review follow-ups on #657 (keyed list-diff allocation work):

- ChildReconciler.ReconcileKeyedMiddle: return the pooled matched and inLis
  bool[] buffers with clearArray:false instead of true (Copilot review).
  Both are value-typed (no reference pinning) and have their used range
  fully (re)initialized before any read - matched via the Array.Clear on
  rent, inLis via ComputeLISInto leading clear - so the full-array wipe on
  return was avoidable O(rented-capacity) work on the hot path. Matches how
  the int[] newToOld already returns. Reference-typed pooled buffers
  (string[]/ReactorRow[] in KeyedListDiff) still clear.

- Tests (ChildReconcilerLisIntoTests): replace the circular oracle that
  compared ComputeLISInto against ComputeLIS (now a thin wrapper over it)
  with an independent brute-force LIS-length DP that honors the -1 unmapped
  sentinel, asserting the mask marks a valid strictly-increasing subsequence
  of maximal length.

- Tests (ChildReconcilerKeyedSkipTests, new): cover the #30 keyed
  CanSkipUpdate fast path - an identical keyed list skips every row with no
  ops and no child-control access, across repeated stable frames.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azchohfi

Copy link
Copy Markdown
Collaborator Author

/perf

Comment on lines +284 to +285
foreach (var k in next)
if (s.ByKey.TryGetValue(k, out var row)) survivorsBefore[k] = row;
@github-actions

Copy link
Copy Markdown

⚡ Reactor perf comparison

Workload: StressPerf.ReactorOptimized StocksGrid · --percent 50 --duration 10 · x64 Release · median of 12 paired runs (2 warmup dropped); Δ is the mean change with a 95% CI · PR head and main built and run interleaved on the same runner.

Regression vs main baseline

Metric main (baseline) This PR Δ (95% CI) Status
Renders/sec ↑ 2.85 2.64 -2.1% 95% CI [-5.9, +1.8] ≈ within noise
Avg Reconcile (ms) ↓ 145.9 146.5 +1.0% 95% CI [-1.1, +3.1] ≈ within noise
Avg Diff (ms) ↓ 132.6 132.1 +0.2% 95% CI [-1.7, +2.0] ≈ within noise
Avg Memory (MB) ↓ 304.4 300.2 -0.6% 95% CI [-1.3, 0.0] ✅ improvement

Low-mutation skip-floor (--percent 0)

At --percent 0 the workload mutates few cells per tick (always at least one), so reconcile/diff isolate the O(n) per-tick child skip-walk floor that higher mutation rates dilute — ChildReconciler re-walks every child each tick even when nothing moved. The closer --percent is to 0, the more this floor is the signal, so a structural-skip optimization shows up cleanly where the headline table above buries it. Δ is the mean paired change with a 95% CI.

Metric main (baseline) This PR Δ (95% CI) Status
Renders/sec ↑ 15.07 14.69 -0.5% 95% CI [-4.4, +3.5] ≈ within noise
Avg Reconcile (ms) ↓ 25.9 26.1 -1.3% 95% CI [-4.4, +1.9] ≈ within noise
Avg Diff (ms) ↓ 24.1 24.3 -1.2% 95% CI [-4.8, +2.3] ≈ within noise
Avg Memory (MB) ↓ 268.2 267.9 0.0% 95% CI [-0.5, +0.4] ≈ within noise

Allocation (Reactor) — lower is better

Metric main (baseline) This PR Δ (95% CI) Status
Alloc bytes/render ↓ 9745167 9814008 +0.3% 95% CI [-0.2, +0.7] ≈ within noise
Gen0 GC / 1k renders ↓ 333.33 357.14 +2.8% 95% CI [-1.4, +6.9] ≈ within noise

Keyed-list workload (StressPerf.KeyedList, --percent 50)

A separate macro workload: a ~500-row stably keyed list whose rows are reordered / inserted / removed each tick. Because every child carries a key, the child reconciler takes its keyed arm (ReconcileKeyedReconcileKeyedMiddle, the LIS-based minimal-move pass) instead of the positional re-walk the StocksGrid tables above measure — so this is the sensitive macro signal for keyed-diff work the positional cells can never reach. Same interleaved paired-Δ 95% CI as the headline table.

Metric main (baseline) This PR Δ (95% CI) Status
Renders/sec ↑ 22.88 22.34 -2.5% 95% CI [-4.7, -0.4] ⚠️ regression
Avg Reconcile (ms) ↓ 13.2 13.2 +2.5% 95% CI [-2.6, +7.6] ≈ within noise
Avg Diff (ms) ↓ 13.1 13.1 +2.3% 95% CI [-3.0, +7.7] ≈ within noise
Avg Memory (MB) ↓ 177.6 181.9 +2.1% 95% CI [+1.3, +3.0] ⚠️ regression

Allocation (keyed-list) — lower is better

Metric main (baseline) This PR Δ (95% CI) Status
Alloc bytes/render ↓ 316844 219994 -30.6% 95% CI [-30.8, -30.5] ✅ improvement
Gen0 GC / 1k renders ↓ 12.85 8.79 -31.5% 95% CI [-33.0, -30.1] ✅ improvement

Reconciler micro-benchmarks (PerfBench.ControlModel)

Production --variant Reactor control-model path, ns-resolution and WinUI-undiluted (spec-047 M1–M13) — ↓ lower is better. Status tracks allocated bytes/op, the authoritative signal here; it is deterministic for structurally-fixed benches, while dispatcher / background-thread benches carry a small process-to-process offset, so a bench is flagged only when its 95% CI clears a ±3% minimum-effect band (real structural alloc changes are several percent to many-x). ns/op is shown for context but is not auto-flagged (its paired CI is rep-interleaved but the flag remains dormant pending a real-CI identical-binary band calibration). Δ is the mean paired change with a 95% CI.

Bench main ns/op Δ ns (95% CI) main B/op Δ alloc (95% CI) Status
M1 Mount_Leaf_NoCallback 112111.5 +0.5% 95% CI [-0.5, +1.4] 1140.9 0.0% 95% CI [0.0, 0.0] ≈ within noise
M2 Mount_Leaf_OneCallback 82876.1 +0.6% 95% CI [-1.3, +2.5] 3383.3 0.0% 95% CI [0.0, 0.0] ≈ within noise
M3 Mount_Leaf_ThreeCallbacks 165104.0 +1.2% 95% CI [-0.9, +3.3] 8179.0 +0.2% 95% CI [0.0, +0.3] ≈ within noise
M4 Dispatch_Switch_Cold 81557.1 -3.5% 95% CI [-9.5, +2.4] 1768.1 0.0% 95% CI [0.0, 0.0] ≈ within noise
M5 Dispatch_Switch_Warm 81854.1 +2.0% 95% CI [-1.4, +5.4] 1766.0 0.0% 95% CI [-1.7, +1.8] ≈ within noise
M6 Dispatch_ExternalType 70801.2 +0.2% 95% CI [-0.4, +0.7] 987.6 +0.1% 95% CI [-3.0, +3.2] ≈ within noise
M7 Update_NoChange 44317.0 +0.2% 95% CI [-1.1, +1.4] 452.1 +2.2% 95% CI [-4.6, +9.0] ≈ within noise
M8 Update_OneLeafChanged 32192.2 -2.3% 95% CI [-9.3, +4.8] 536.0 0.0% 95% CI [0.0, 0.0] ≈ within noise
M9 Update_AllChanged 2783983.3 -1.3% 95% CI [-4.0, +1.5] 184278.1 0.0% 95% CI [0.0, 0.0] ≈ within noise
M10 EventHandlerState_Alloc 64810.4 +1.4% 95% CI [-1.0, +3.8] 3013.2 0.0% 95% CI [-1.1, +1.1] ≈ within noise
M11 ModifierEHS_Frequency 34713.0 +0.2% 95% CI [-0.8, +1.1] 638.8 0.0% 95% CI [0.0, 0.0] ≈ within noise
M12 Pool_Rent_HotPath 90760.7 -0.5% 95% CI [-1.7, +0.6] 1099.9 0.0% 95% CI [0.0, 0.0] ≈ within noise
M13 Setters_Suppression_Scope 108.6 +2.0% 95% CI [-5.0, +8.9] 26.7 0.0% 95% CI [0.0, 0.0] ≈ within noise
C207 ChangeHandler_DpRead_Coalesce 1151.3 -3.6% 95% CI [-6.8, -0.4] 0.6 0.0% 95% CI [0.0, 0.0] ≈ within noise
OAlloc Optional_Element_Alloc 163.2 +1.6% 95% CI [-6.0, +9.3] 528.0 0.0% 95% CI [0.0, 0.0] ≈ within noise
OUpdate Optional_Reconciler_Update 14270.7 -0.6% 95% CI [-2.4, +1.2] 5892.8 0.0% 95% CI [-0.5, +0.5] ≈ within noise

Cross-framework reference (same StocksGrid workload)

Metric vanilla WinUI3¹ Rust windows-reactor² Reactor (this PR)
Renders/sec ↑ 3.15 5.20 2.64
Avg Reconcile (ms) ↓ n/a 21.6 146.5
Avg Diff (ms) ↓ n/a 19.3 132.1
Avg Memory (MB) ↓ 262.9 197.9 300.2

↑ higher is better · ↓ lower is better. Within noise = the 95% confidence interval of the paired Δ includes 0 (no change resolvable at this sample size); ✅ improvement / ⚠️ regression require the CI to exclude 0.
Allocation metrics (alloc bytes/render, Gen0 GC) are the sensitive signal for allocation-reduction work, where the mean-ms / memory figures are largely flat. They read n/a for a harness built from a revision that predates them (rebase the PR onto main to populate them).
Reconciler micro-benchmarks run PerfBench.ControlModel --variant Reactor (M1–M13) as a headless loop bracketed by per-thread alloc + GC counters — ns-resolution and free of WinUI render / working-set dilution, so they resolve Core/Reconciler allocation deltas the macro StocksGrid workload cannot. main and PR each link their own src/Reactor build and are rep-interleaved (a fresh alternated process per rep); Δ is the paired 95% CI over per-rep means. The Status column tracks allocated bytes/op (deterministic for identical code); ns/op is informational — its paired CI is now unbiased but the flag stays dormant pending a real-CI identical-binary band calibration.
¹ vanilla WinUI3 = StressPerf.Direct (imperative; no virtual-DOM, so it has no reconcile/diff phase — those cells read n/a). Measured live on this runner.
² Rust = test_reactor_perf from microsoft/windows-rs — a port of this harness (same StocksGrid, same --percent/--duration CLI). Built from source and measured live on this runner.
Absolute numbers are runner-dependent — trust the Δ vs main, not the absolute values. Memory (working set) is the noisiest metric.
Runner: CPU: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz · 4 logical cores · 16 GB RAM · runner: GitHub Actions 1042900980.
Generated by .github/workflows/perf-compare.yml · PR 9e08764 vs main 52baebb · 2026-06-26T17:48:39Z · run log.

@azchohfi

Copy link
Copy Markdown
Collaborator Author

Measurement complete; perf-compare verified + banked by coordinator. Closing temp measurement PR (DO-NOT-MERGE artifact). Original #657 branch untouched/pristine.

@azchohfi azchohfi closed this Jun 26, 2026
@azchohfi azchohfi deleted the azchohfi-657-keyed-perf-remeasure branch June 26, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant